secure

Title: Achieving Maximum Efficiency in Schnorr-based Multi-signature and Applications in Blockchain. (arXiv:2305.13699v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.13699
Code URL: null
Copy Paste: [[2305.13699] Achieving Maximum Efficiency in Schnorr-based Multi-signature and Applications in Blockchain](http://arxiv.org/abs/2305.13699) #secure
Summary:
Multi-signature aggregates signatures from multiple users on the same message into a joint signature, which is widely applied in blockchain to reduce the percentage of signatures in blocks and improve the throughput of transactions. The $k$-sum attacks are one of the major challenges to design secure multi-signature schemes. In this work, we address $k$-sum attacks from a novel angle by defining a Public Third Party (PTP), which is an automatic process that can be verifiable by the public and restricts the signing phase from continuing until receiving commitments from all signers. Further, a two-round multi-signature scheme MEMS with PTP is proposed, which is secure based on discrete logarithm assumption in the random oracle model. As each signer communicates directly with the PTP instead of other co-signers, the total amount of communications is significantly reduced. In addition, as PTP participates in the computation of the aggregation and signing algorithms, the computation cost left for each signer and verifier remains the same as the basis Schnorr signature. To the best of our knowledge, this is the maximum efficiency that a Schnorr-based multi-signature scheme can achieve. Further, MEMS is applied in blockchain platform, e.g., Fabric, to improve the transaction efficiency.

security

Title: Human Body Pose Estimation for Gait Identification: A Comprehensive Survey of Datasets and Models. (arXiv:2305.13765v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13765
Code URL: null
Copy Paste: [[2305.13765] Human Body Pose Estimation for Gait Identification: A Comprehensive Survey of Datasets and Models](http://arxiv.org/abs/2305.13765) #security
Summary:
Person identification is a problem that has received substantial attention, particularly in security domains. Gait recognition is one of the most convenient approaches enabling person identification at a distance without the need of high-quality images. There are several review studies addressing person identification such as the utilization of facial images, silhouette images, and wearable sensor. Despite skeleton-based person identification gaining popularity while overcoming the challenges of traditional approaches, existing survey studies lack the comprehensive review of skeleton-based approaches to gait identification. We present a detailed review of the human pose estimation and gait analysis that make the skeleton-based approaches possible. The study covers various types of related datasets, tools, methodologies, and evaluation metrics with associated challenges, limitations, and application domains. Detailed comparisons are presented for each of these aspects with recommendations for potential research and alternatives. A common trend throughout this paper is the positive impact that deep learning techniques are beginning to have on topics such as human pose estimation and gait identification. The survey outcomes might be useful for the related research community and other stakeholders in terms of performance analysis of existing methodologies, potential research gaps, application domains, and possible contributions in the future.

Title: Extracting Protocol Format as State Machine via Controlled Static Loop Analysis. (arXiv:2305.13483v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.13483
Code URL: null
Copy Paste: [[2305.13483] Extracting Protocol Format as State Machine via Controlled Static Loop Analysis](http://arxiv.org/abs/2305.13483) #security
Summary:
Reverse engineering of protocol message formats is critical for many security applications. Mainstream techniques use dynamic analysis and inherit its low-coverage problem -- the inferred message formats only reflect the features of their inputs. To achieve high coverage, we choose to use static analysis to infer message formats from the implementation of protocol parsers. In this work, we focus on a class of extremely challenging protocols whose formats are described via constraint-enhanced regular expressions and parsed using finite-state machines. Such state machines are often implemented as complicated parsing loops, which are inherently difficult to analyze via conventional static analysis. Our new technique extracts a state machine by regarding each loop iteration as a state and the dependency between loop iterations as state transitions. To achieve high, i.e., path-sensitive, precision but avoid path explosion, the analysis is controlled to merge as many paths as possible based on carefully-designed rules. The evaluation results show that we can infer a state machine and, thus, the message formats, in five minutes with over 90% precision and recall, far better than state of the art. We also applied the state machines to enhance protocol fuzzers, which are improved by 20% to 230% in terms of coverage and detect ten more zero-days compared to baselines.

Title: Algorithmic Security is Insufficient: A Comprehensive Survey on Implementation Attacks Haunting Post-Quantum Security. (arXiv:2305.13544v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.13544
Code URL: null
Copy Paste: [[2305.13544] Algorithmic Security is Insufficient: A Comprehensive Survey on Implementation Attacks Haunting Post-Quantum Security](http://arxiv.org/abs/2305.13544) #security
Summary:
This survey is on forward-looking, emerging security concerns in post-quantum era, i.e., the implementation attacks for 2022 winners of NIST post-quantum cryptography (PQC) competition and thus the visions, insights, and discussions can be used as a step forward towards scrutinizing the new standards for applications ranging from Metaverse, Web 3.0 to deeply-embedded systems. The rapid advances in quantum computing have brought immense opportunities for scientific discovery and technological progress; however, it poses a major risk to today's security since advanced quantum computers are believed to break all traditional public-key cryptographic algorithms. This has led to active research on PQC algorithms that are believed to be secure against classical and powerful quantum computers. However, algorithmic security is unfortunately insufficient, and many cryptographic algorithms are vulnerable to side-channel attacks (SCA), where an attacker passively or actively gets side-channel data to compromise the security properties that are assumed to be safe theoretically. In this survey, we explore such imminent threats and their countermeasures with respect to PQC. We provide the respective, latest advancements in PQC research, as well as assessments and providing visions on the different types of SCAs.

privacy

Title: Attribute-Guided Encryption with Facial Texture Masking. (arXiv:2305.13548v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13548
Code URL: null
Copy Paste: [[2305.13548] Attribute-Guided Encryption with Facial Texture Masking](http://arxiv.org/abs/2305.13548) #privacy
Summary:
The increasingly pervasive facial recognition (FR) systems raise serious concerns about personal privacy, especially for billions of users who have publicly shared their photos on social media. Several attempts have been made to protect individuals from unauthorized FR systems utilizing adversarial attacks to generate encrypted face images to protect users from being identified by FR systems. However, existing methods suffer from poor visual quality or low attack success rates, which limit their usability in practice. In this paper, we propose Attribute Guided Encryption with Facial Texture Masking (AGE-FTM) that performs a dual manifold adversarial attack on FR systems to achieve both good visual quality and high black box attack success rates. In particular, AGE-FTM utilizes a high fidelity generative adversarial network (GAN) to generate natural on-manifold adversarial samples by modifying facial attributes, and performs the facial texture masking attack to generate imperceptible off-manifold adversarial samples. Extensive experiments on the CelebA-HQ dataset demonstrate that our proposed method produces more natural-looking encrypted images than state-of-the-art methods while achieving competitive attack performance. We further evaluate the effectiveness of AGE-FTM in the real world using a commercial FR API and validate its usefulness in practice through an user study.

Title: DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection. (arXiv:2305.13625v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13625
Code URL: null
Copy Paste: [[2305.13625] DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection](http://arxiv.org/abs/2305.13625) #privacy
Summary:
The increasingly pervasive facial recognition (FR) systems raise serious concerns about personal privacy, especially for billions of users who have publicly shared their photos on social media. Several attempts have been made to protect individuals from being identified by unauthorized FR systems utilizing adversarial attacks to generate encrypted face images. However, existing methods suffer from poor visual quality or low attack success rates, which limit their utility. Recently, diffusion models have achieved tremendous success in image generation. In this work, we ask: can diffusion models be used to generate adversarial examples to improve both visual quality and attack performance? We propose DiffProtect, which utilizes a diffusion autoencoder to generate semantically meaningful perturbations on FR systems. Extensive experiments demonstrate that DiffProtect produces more natural-looking encrypted images than state-of-the-art methods while achieving significantly higher attack success rates, e.g., 24.5% and 25.1% absolute improvements on the CelebA-HQ and FFHQ datasets.

Title: Mixup-Privacy: A simple yet effective approach for privacy-preserving segmentation. (arXiv:2305.13756v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13756
Code URL: null
Copy Paste: [[2305.13756] Mixup-Privacy: A simple yet effective approach for privacy-preserving segmentation](http://arxiv.org/abs/2305.13756) #privacy
Summary:
Privacy protection in medical data is a legitimate obstacle for centralized machine learning applications. Here, we propose a client-server image segmentation system which allows for the analysis of multi-centric medical images while preserving patient privacy. In this approach, the client protects the to-be-segmented patient image by mixing it to a reference image. As shown in our work, it is challenging to separate the image mixture to exact original content, thus making the data unworkable and unrecognizable for an unauthorized person. This proxy image is sent to a server for processing. The server then returns the mixture of segmentation maps, which the client can revert to a correct target segmentation. Our system has two components: 1) a segmentation network on the server side which processes the image mixture, and 2) a segmentation unmixing network which recovers the correct segmentation map from the segmentation mixture. Furthermore, the whole system is trained end-to-end. The proposed method is validated on the task of MRI brain segmentation using images from two different datasets. Results show that the segmentation accuracy of our method is comparable to a system trained on raw images, and outperforms other privacy-preserving methods with little computational overhead.

Title: Selective Pre-training for Private Fine-tuning. (arXiv:2305.13865v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13865
Code URL: null
Copy Paste: [[2305.13865] Selective Pre-training for Private Fine-tuning](http://arxiv.org/abs/2305.13865) #privacy
Summary:
Suppose we want to train text prediction models in email clients or word processors. The models must preserve the privacy of user data and adhere to a specific fixed size to meet memory and inference time requirements. We introduce a generic framework to solve this problem. Specifically, we are given a public dataset $D_\text{pub}$ and a private dataset $D_\text{priv}$ corresponding to a downstream task $T$. How should we pre-train a fixed-size model $M$ on $D_\text{pub}$ and fine-tune it on $D_\text{priv}$ such that performance of $M$ with respect to $T$ is maximized and $M$ satisfies differential privacy with respect to $D_\text{priv}$? We show that pre-training on a {\em subset} of dataset $D_\text{pub}$ that brings the public distribution closer to the private distribution is a crucial ingredient to maximize the transfer learning abilities of $M$ after pre-training, especially in the regimes where model sizes are relatively small. Besides performance improvements, our framework also shows that with careful pre-training and private fine-tuning, {\em smaller models} can match the performance of much larger models, highlighting the promise of differentially private training as a tool for model compression and efficiency.

protect

Title: Towards Legally Enforceable Hate Speech Detection for Public Forums. (arXiv:2305.13677v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13677
Code URL: null
Copy Paste: [[2305.13677] Towards Legally Enforceable Hate Speech Detection for Public Forums](http://arxiv.org/abs/2305.13677) #protect
Summary:
Hate speech is a serious issue on public forums, and proper enforcement of hate speech laws is key for protecting groups of people against harmful and discriminatory language. However, determining what constitutes hate speech is a complex task that is highly open to subjective interpretations. Existing works do not align their systems with enforceable definitions of hate speech, which can make their outputs inconsistent with the goals of regulators. Our work introduces a new task for enforceable hate speech detection centred around legal definitions, and a dataset annotated on violations of eleven possible definitions by legal experts. Given the challenge of identifying clear, legally enforceable instances of hate speech, we augment the dataset with expert-generated samples and an automatically mined challenge set. We experiment with grounding the model decision in these definitions using zero-shot and few-shot prompting. We then report results on several large language models (LLMs). With this task definition, automatic hate speech detection can be more closely aligned to enforceable laws, and hence assist in more rigorous enforcement of legal protections against harmful speech in public forums.

defense

Title: Adversarial Defenses via Vector Quantization. (arXiv:2305.13651v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13651
Code URL: null
Copy Paste: [[2305.13651] Adversarial Defenses via Vector Quantization](http://arxiv.org/abs/2305.13651) #defense
Summary:
Building upon Randomized Discretization, we develop two novel adversarial defenses against white-box PGD attacks, utilizing vector quantization in higher dimensional spaces. These methods, termed pRD and swRD, not only offer a theoretical guarantee in terms of certified accuracy, they are also shown, via abundant experiments, to perform comparably or even superior to the current art of adversarial defenses. These methods can be extended to a version that allows further training of the target classifier and demonstrates further improved performance.

Title: REGARD: Rules of EngaGement for Automated cybeR Defense to aid in Intrusion Response. (arXiv:2305.13967v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.13967
Code URL: null
Copy Paste: [[2305.13967] REGARD: Rules of EngaGement for Automated cybeR Defense to aid in Intrusion Response](http://arxiv.org/abs/2305.13967) #defense
Summary:
Automated Intelligent Cyberdefense Agents (AICAs) that are part Intrusion Detection Systems (IDS) and part Intrusion Response Systems (IRS) are being designed to protect against sophisticated and automated cyber-attacks. An AICA based on the ideas of Self-Adaptive Autonomic Computing Systems (SA-ACS) can be considered as a managing system that protects a managed system like a personal computer, web application, critical infrastructure, etc. An AICA, specifically the IRS components, can compute a wide range of potential responses to meet its security goals and objectives, such as taking actions to prevent the attack from completing, restoring the system to comply with the organizational security policy, containing or confining an attack, attack eradication, deploying forensics measures to enable future attack analysis, counterattack, and so on. To restrict its activities in order to minimize collateral/organizational damage, such an automated system must have set Rules of Engagement (RoE). Automated systems must determine which operations can be completely automated (and when), which actions require human operator confirmation, and which actions must never be undertaken. In this paper, to enable this control functionality over an IRS, we create Rules of EngaGement for Automated cybeR Defense (REGARD) system which holds a set of Rules of Engagement (RoE) to protect the managed system according to the instructions provided by the human operator. These rules help limit the action of the IRS on the managed system in compliance with the recommendations of the domain expert. We provide details of execution, management, operation, and conflict resolution for Rules of Engagement (RoE) to constrain the actions of an automated IRS. We also describe REGARD system implementation, security case studies for cyber defense, and RoE demonstrations.

attack

Title: Model Stealing Attack against Multi-Exit Networks. (arXiv:2305.13584v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.13584
Code URL: null
Copy Paste: [[2305.13584] Model Stealing Attack against Multi-Exit Networks](http://arxiv.org/abs/2305.13584) #attack
Summary:
Compared to traditional neural networks with a single exit, a multi-exit network has multiple exits that allow for early output from intermediate layers of the model, thus bringing significant improvement in computational efficiency while maintaining similar recognition accuracy. When attempting to steal such valuable models using traditional model stealing attacks, we found that conventional methods can only steal the model's classification function while failing to capture its output strategy. This results in a significant decrease in computational efficiency for the stolen substitute model, thereby losing the advantages of multi-exit networks.In this paper, we propose the first model stealing attack to extract both the model function and output strategy. We employ bayesian changepoint detection to analyze the target model's output strategy and use performance loss and strategy loss to guide the training of the substitute model. Furthermore, we designed a novel output strategy search algorithm that can find the optimal output strategy to maximize the consistency between the victim model and the substitute model's outputs. Through experiments on multiple mainstream multi-exit networks and benchmark datasets, we thoroughly demonstrates the effectiveness of our method.

robust

Title: ColMix -- A Simple Data Augmentation Framework to Improve Object Detector Performance and Robustness in Aerial Images. (arXiv:2305.13509v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13509
Code URL: null
Copy Paste: [[2305.13509] ColMix -- A Simple Data Augmentation Framework to Improve Object Detector Performance and Robustness in Aerial Images](http://arxiv.org/abs/2305.13509) #robust
Summary:
In the last decade, Convolutional Neural Network (CNN) and transformer based object detectors have achieved high performance on a large variety of datasets. Though the majority of detection literature has developed this capability on datasets such as MS COCO, these detectors have still proven effective for remote sensing applications. Challenges in this particular domain, such as small numbers of annotated objects and low object density, hinder overall performance. In this work, we present a novel augmentation method, called collage pasting, for increasing the object density without a need for segmentation masks, thereby improving the detector performance. We demonstrate that collage pasting improves precision and recall beyond related methods, such as mosaic augmentation, and enables greater control of object density. However, we find that collage pasting is vulnerable to certain out-of-distribution shifts, such as image corruptions. To address this, we introduce two simple approaches for combining collage pasting with PixMix augmentation method, and refer to our combined techniques as ColMix. Through extensive experiments, we show that employing ColMix results in detectors with superior performance on aerial imagery datasets and robust to various corruptions.

Title: A Dive into SAM Prior in Image Restoration. (arXiv:2305.13620v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13620
Code URL: null
Copy Paste: [[2305.13620] A Dive into SAM Prior in Image Restoration](http://arxiv.org/abs/2305.13620) #robust
Summary:
The goal of image restoration (IR), a fundamental issue in computer vision, is to restore a high-quality (HQ) image from its degraded low-quality (LQ) observation. Multiple HQ solutions may correspond to an LQ input in this poorly posed problem, creating an ambiguous solution space. This motivates the investigation and incorporation of prior knowledge in order to effectively constrain the solution space and enhance the quality of the restored images. In spite of the pervasive use of hand-crafted and learned priors in IR, limited attention has been paid to the incorporation of knowledge from large-scale foundation models. In this paper, we for the first time leverage the prior knowledge of the state-of-the-art segment anything model (SAM) to boost the performance of existing IR networks in an parameter-efficient tuning manner. In particular, the choice of SAM is based on its robustness to image degradations, such that HQ semantic masks can be extracted from it. In order to leverage semantic priors and enhance restoration quality, we propose a lightweight SAM prior tuning (SPT) unit. This plug-and-play component allows us to effectively integrate semantic priors into existing IR networks, resulting in significant improvements in restoration quality. As the only trainable module in our method, the SPT unit has the potential to improve both efficiency and scalability. We demonstrate the effectiveness of the proposed method in enhancing a variety of methods across multiple tasks, such as image super-resolution and color image denoising.

Title: RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search. (arXiv:2305.13653v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13653
Code URL: null
Copy Paste: [[2305.13653] RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search](http://arxiv.org/abs/2305.13653) #robust
Summary:
Text-based person search aims to retrieve the specified person images given a textual description. The key to tackling such a challenging task is to learn powerful multi-modal representations. Towards this, we propose a Relation and Sensitivity aware representation learning method (RaSa), including two novel tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA). For one thing, existing methods cluster representations of all positive pairs without distinction and overlook the noise problem caused by the weak positive pairs where the text and the paired image have noise correspondences, thus leading to overfitting learning. RA offsets the overfitting risk by introducing a novel positive relation detection task (i.e., learning to distinguish strong and weak positive pairs). For another thing, learning invariant representation under data augmentation (i.e., being insensitive to some transformations) is a general practice for improving representation's robustness in existing methods. Beyond that, we encourage the representation to perceive the sensitive transformation by SA (i.e., learning to detect the replaced words), thus promoting the representation's robustness. Experiments demonstrate that RaSa outperforms existing state-of-the-art methods by 6.94%, 4.45% and 15.35% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively. Code is available at: https://github.com/Flame-Chasers/RaSa.

Title: Leveraging Uncertainty Quantification for Picking Robust First Break Times. (arXiv:2305.13799v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13799
Code URL: null
Copy Paste: [[2305.13799] Leveraging Uncertainty Quantification for Picking Robust First Break Times](http://arxiv.org/abs/2305.13799) #robust
Summary:
In seismic exploration, the selection of first break times is a crucial aspect in the determination of subsurface velocity models, which in turn significantly influences the placement of wells. Many deep neural network (DNN)-based automatic first break picking methods have been proposed to speed up this picking processing. However, there has been no work on the uncertainty of the first picking results of the output of DNN. In this paper, we propose a new framework for first break picking based on a Bayesian neural network to further explain the uncertainty of the output. In a large number of experiments, we evaluate that the proposed method has better accuracy and robustness than the deterministic DNN-based model. In addition, we also verify that the uncertainty of measurement is meaningful, which can provide a reference for human decision-making.

Title: Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans. (arXiv:2305.13876v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13876
Code URL: null
Copy Paste: [[2305.13876] Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans](http://arxiv.org/abs/2305.13876) #robust
Summary:
We present Cross3DVG, a novel task for cross-dataset visual grounding in 3D scenes, revealing the limitations of existing 3D visual grounding models using restricted 3D resources and thus easily overfit to a specific 3D dataset. To facilitate Cross3DVG, we have created a large-scale 3D visual grounding dataset containing more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan with human annotations, paired with the existing 52k descriptions on ScanRefer. We perform Cross3DVG by training a model on the source 3D visual grounding dataset and then evaluating it on the target dataset constructed in different ways (e.g., different sensors, 3D reconstruction methods, and language annotators) without using target labels. We conduct comprehensive experiments using established visual grounding models, as well as a CLIP-based 2D-3D integration method, designed to bridge the gaps between 3D datasets. By performing Cross3DVG tasks, we found that (i) cross-dataset 3D visual grounding has significantly lower performance than learning and evaluation with a single dataset, suggesting much room for improvement in cross-dataset generalization of 3D visual grounding, (ii) better detectors and transformer-based localization modules for 3D grounding are beneficial for enhancing 3D grounding performance and (iii) fusing 2D-3D data using CLIP demonstrates further performance improvements. Our Cross3DVG task will provide a benchmark for developing robust 3D visual grounding models capable of handling diverse 3D scenes while leveraging deep language understanding.

Title: Deep Transductive Transfer Learning for Automatic Target Recognition. (arXiv:2305.13886v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13886
Code URL: null
Copy Paste: [[2305.13886] Deep Transductive Transfer Learning for Automatic Target Recognition](http://arxiv.org/abs/2305.13886) #robust
Summary:
One of the major obstacles in designing an automatic target recognition (ATR) algorithm, is that there are often labeled images in one domain (i.e., infrared source domain) but no annotated images in the other target domains (i.e., visible, SAR, LIDAR). Therefore, automatically annotating these images is essential to build a robust classifier in the target domain based on the labeled images of the source domain. Transductive transfer learning is an effective way to adapt a network to a new target domain by utilizing a pretrained ATR network in the source domain. We propose an unpaired transductive transfer learning framework where a CycleGAN model and a well-trained ATR classifier in the source domain are used to construct an ATR classifier in the target domain without having any labeled data in the target domain. We employ a CycleGAN model to transfer the mid-wave infrared (MWIR) images to visible (VIS) domain images (or visible to MWIR domain). To train the transductive CycleGAN, we optimize a cost function consisting of the adversarial, identity, cycle-consistency, and categorical cross-entropy loss for both the source and target classifiers. In this paper, we perform a detailed experimental analysis on the challenging DSIAC ATR dataset. The dataset consists of ten classes of vehicles at different poses and distances ranging from 1-5 kilometers on both the MWIR and VIS domains. In our experiment, we assume that the images in the VIS domain are the unlabeled target dataset. We first detect and crop the vehicles from the raw images and then project them into a common distance of 2 kilometers. Our proposed transductive CycleGAN achieves 71.56% accuracy in classifying the visible domain vehicles in the DSIAC ATR dataset.

Title: Development and Whole-Body Validation of Personalizable Female and Male Pedestrian SAFER Human Body Models. (arXiv:2305.13918v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13918
Code URL: null
Copy Paste: [[2305.13918] Development and Whole-Body Validation of Personalizable Female and Male Pedestrian SAFER Human Body Models](http://arxiv.org/abs/2305.13918) #robust
Summary:
Vulnerable road users are overrepresented in the worldwide number of road-traffic injury victims. Developing biofidelic male and female pedestrian HBMs representing a range of anthropometries is imperative to follow through with the efforts to increase road safety and propose intervention strategies. In this study, a 50th percentile male and female pedestrian of the SAFER HBM was developed via a newly developed image registration-based mesh morphing framework for subject personalization. The HBM and its accompanied personalization framework were evaluated by means of a set of cadaver experiments, where subjects were struck laterally by a generic sedan buck. In the simulated whole-body pedestrian collisions, the personalized HBMs demonstrate a good capability of reproducing the trajectories and head kinematics observed in lateral impacts. The presented pedestrian HBMs and personalization framework provide robust means to thoroughly and accurately reconstruct and evaluate pedestrian-to-vehicle collisions.

Title: DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules. (arXiv:2305.13406v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13406
Code URL: null
Copy Paste: [[2305.13406] DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules](http://arxiv.org/abs/2305.13406) #robust
Summary:
Existing large language models (LLMs) that mainly focus on Standard American English (SAE) often lead to significantly worse performance when being applied to other English dialects. While existing mitigations tackle discrepancies for individual target dialects, they assume access to high-accuracy dialect identification systems. The boundaries between dialects are inherently flexible, making it difficult to categorize language into discrete predefined categories. In this paper, we propose DADA (Dialect Adaptation via Dynamic Aggregation), a modular approach to imbue SAE-trained models with multi-dialectal robustness by composing adapters which handle specific linguistic features. The compositional architecture of DADA allows for both targeted adaptation to specific dialect variants and simultaneous adaptation to various dialects. We show that DADA is effective for both single task and instruction finetuned language models, offering an extensible and interpretable framework for adapting existing LLMs to different English dialects.

Title: Small Language Models Improve Giants by Rewriting Their Outputs. (arXiv:2305.13514v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13514
Code URL: null
Copy Paste: [[2305.13514] Small Language Models Improve Giants by Rewriting Their Outputs](http://arxiv.org/abs/2305.13514) #robust
Summary:
Large language models (LLMs) have demonstrated impressive few-shot learning capabilities, but they often underperform compared to fine-tuned models on challenging tasks. Furthermore, their large size and restricted access only through APIs make task-specific fine-tuning impractical. Moreover, LLMs are sensitive to different aspects of prompts (e.g., the selection and order of demonstrations) and can thus require time-consuming prompt engineering. In this light, we propose a method to correct LLM outputs without relying on their weights. First, we generate a pool of candidates by few-shot prompting an LLM. Second, we refine the LLM-generated outputs using a smaller model, the LM-corrector (LMCor), which is trained to rank, combine and rewrite the candidates to produce the final target output. Our experiments demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B) across diverse tasks. Moreover, we illustrate that the LMCor exhibits robustness against different prompts, thereby minimizing the need for extensive prompt engineering. Finally, we showcase that the LMCor can be seamlessly integrated with different LLMs at inference time, serving as a plug-and-play module to improve their performance.

Title: Transfer-Free Data-Efficient Multilingual Slot Labeling. (arXiv:2305.13528v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13528
Code URL: null
Copy Paste: [[2305.13528] Transfer-Free Data-Efficient Multilingual Slot Labeling](http://arxiv.org/abs/2305.13528) #robust
Summary:
Slot labeling (SL) is a core component of task-oriented dialogue (ToD) systems, where slots and corresponding values are usually language-, task- and domain-specific. Therefore, extending the system to any new language-domain-task configuration requires (re)running an expensive and resource-intensive data annotation process. To mitigate the inherent data scarcity issue, current research on multilingual ToD assumes that sufficient English-language annotated data are always available for particular tasks and domains, and thus operates in a standard cross-lingual transfer setup. In this work, we depart from this often unrealistic assumption. We examine challenging scenarios where such transfer-enabling English annotated data cannot be guaranteed, and focus on bootstrapping multilingual data-efficient slot labelers in transfer-free scenarios directly in the target languages without any English-ready data. We propose a two-stage slot labeling approach (termed TWOSL) which transforms standard multilingual sentence encoders into effective slot labelers. In Stage 1, relying on SL-adapted contrastive learning with only a handful of SL-annotated examples, we turn sentence encoders into task-specific span encoders. In Stage 2, we recast SL from a token classification into a simpler, less data-intensive span classification task. Our results on two standard multilingual TOD datasets and across diverse languages confirm the effectiveness and robustness of TWOSL. It is especially effective for the most challenging transfer-free few-shot setups, paving the way for quick and data-efficient bootstrapping of multilingual slot labelers for ToD.

Title: Improving Classifier Robustness through Active Generation of Pairwise Counterfactuals. (arXiv:2305.13535v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13535
Code URL: null
Copy Paste: [[2305.13535] Improving Classifier Robustness through Active Generation of Pairwise Counterfactuals](http://arxiv.org/abs/2305.13535) #robust
Summary:
Counterfactual Data Augmentation (CDA) is a commonly used technique for improving robustness in natural language classifiers. However, one fundamental challenge is how to discover meaningful counterfactuals and efficiently label them, with minimal human labeling cost. Most existing methods either completely rely on human-annotated labels, an expensive process which limits the scale of counterfactual data, or implicitly assume label invariance, which may mislead the model with incorrect labels. In this paper, we present a novel framework that utilizes counterfactual generative models to generate a large number of diverse counterfactuals by actively sampling from regions of uncertainty, and then automatically label them with a learned pairwise classifier. Our key insight is that we can more correctly label the generated counterfactuals by training a pairwise classifier that interpolates the relationship between the original example and the counterfactual. We demonstrate that with a small amount of human-annotated counterfactual data (10%), we can generate a counterfactual augmentation dataset with learned labels, that provides an 18-20% improvement in robustness and a 14-21% reduction in errors on 6 out-of-domain datasets, comparable to that of a fully human-annotated counterfactual dataset for both sentiment classification and question paraphrase tasks.

Title: Better Low-Resource Entity Recognition Through Translation and Annotation Fusion. (arXiv:2305.13582v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13582
Code URL: https://github.com/edchengg/transfusion
Copy Paste: [[2305.13582] Better Low-Resource Entity Recognition Through Translation and Annotation Fusion](http://arxiv.org/abs/2305.13582) #robust
Summary:
Pre-trained multilingual language models have enabled significant advancements in cross-lingual transfer. However, these models often exhibit a performance disparity when transferring from high-resource languages to low-resource languages, especially for languages that are underrepresented or not in the pre-training data. Motivated by the superior performance of these models on high-resource languages compared to low-resource languages, we introduce a Translation-and-fusion framework, which translates low-resource language text into a high-resource language for annotation using fully supervised models before fusing the annotations back into the low-resource language. Based on this framework, we present TransFusion, a model trained to fuse predictions from a high-resource language to make robust predictions on low-resource languages. We evaluate our methods on two low-resource named entity recognition (NER) datasets, MasakhaNER2.0 and LORELEI NER, covering 25 languages, and show consistent improvement up to +16 F$_1$ over English fine-tuning systems, achieving state-of-the-art performance compared to Translate-train systems. Our analysis depicts the unique advantages of the TransFusion method which is robust to translation errors and source language prediction errors, and complimentary to adapted multilingual language models.

Title: Understanding and Mitigating Spurious Correlations in Text Classification. (arXiv:2305.13654v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13654
Code URL: null
Copy Paste: [[2305.13654] Understanding and Mitigating Spurious Correlations in Text Classification](http://arxiv.org/abs/2305.13654) #robust
Summary:
Recent work has shown that deep learning models are prone to exploit spurious correlations that are present in the training set, yet may not hold true in general. A sentiment classifier may erroneously learn that the token spielberg is always tied to positive movie reviews. Relying on spurious correlations may lead to significant degradation in generalizability and should be avoided. In this paper, we propose a neighborhood analysis framework to explain how exactly language models exploit spurious correlations. Driven by the analysis, we propose a family of regularization methods, NFL (do Not Forget your Language) to prevent the situation. Experiments on two text classification tasks show that NFL brings a significant improvement over standard fine-tuning in terms of robustness without sacrificing in-distribution accuracy.

Title: Physics of Language Models: Part 1, Context-Free Grammar. (arXiv:2305.13673v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13673
Code URL: null
Copy Paste: [[2305.13673] Physics of Language Models: Part 1, Context-Free Grammar](http://arxiv.org/abs/2305.13673) #robust
Summary:
We design experiments to study $\textit{how}$ generative language models, like GPT, learn context-free grammars (CFGs) -- diverse language systems with a tree-like structure capturing many aspects of natural languages, programs, and human logics. CFGs are as hard as pushdown automata, and can be ambiguous so that verifying if a string satisfies the rules requires dynamic programming. We construct synthetic data and demonstrate that even for very challenging CFGs, pre-trained transformers can learn to generate sentences with near-perfect accuracy and remarkable $\textit{diversity}$.

More importantly, we delve into the $\textit{physical principles}$ behind how transformers learns CFGs. We discover that the hidden states within the transformer implicitly and $\textit{precisely}$ encode the CFG structure (such as putting tree node information exactly on the subtree boundary), and learn to form "boundary to boundary" attentions that resemble dynamic programming. We also cover some extension of CFGs as well as the robustness aspect of transformers against grammar mistakes. Overall, our research provides a comprehensive and empirical understanding of how transformers learn CFGs, and reveals the physical mechanisms utilized by transformers to capture the structure and rules of languages.

Title: PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training. (arXiv:2305.13723v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13723
Code URL: null
Copy Paste: [[2305.13723] PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training](http://arxiv.org/abs/2305.13723) #robust
Summary:
Recently proposed weakly-supervised text classification settings train a classifier using the label name of each target class as the only supervision. Such weakly-supervised settings have been gaining increasing attention since they can largely reduce human annotation efforts compared to fully-supervised and semi-supervised settings. Most existing methods follow the strategy that first uses the label names as static features to generate pseudo labels, which are then used for classifier training. While reasonable, such a commonly adopted framework suffers from two limitations: (1) words can have different meanings in different contexts, so using label names for context-free matching can induce very noisy pseudo labels; and (2) the errors made in the pseudo label generation stage will directly propagate to the classifier training stage without a chance of being corrected. In this paper, we propose a new method, PromptClass, consisting of two modules: (1) a pseudo label acquisition module that uses zero-shot prompting of pre-trained language models (PLM) to get pseudo labels based on contextualized text understanding, and (2) a noise-robust self-training module that iteratively trains the classifier and updates pseudo labels by utilizing two PLM fine-tuning strategies that regularize each other. Extensive experiments show that PromptClass achieves overall better performance than existing strong baselines on four benchmark datasets and even achieves similar performance to fully-supervised classifiers on sentiment classification tasks.

Title: Revealing User Familiarity Bias in Task-Oriented Dialogue via Interactive Evaluation. (arXiv:2305.13857v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13857
Code URL: null
Copy Paste: [[2305.13857] Revealing User Familiarity Bias in Task-Oriented Dialogue via Interactive Evaluation](http://arxiv.org/abs/2305.13857) #robust
Summary:
Most task-oriented dialogue (TOD) benchmarks assume users that know exactly how to use the system by constraining the user behaviors within the system's capabilities via strict user goals, namely "user familiarity" bias. This data bias deepens when it combines with data-driven TOD systems, as it is impossible to fathom the effect of it with existing static evaluations. Hence, we conduct an interactive user study to unveil how vulnerable TOD systems are against realistic scenarios. In particular, we compare users with 1) detailed goal instructions that conform to the system boundaries (closed-goal) and 2) vague goal instructions that are often unsupported but realistic (open-goal). Our study reveals that conversations in open-goal settings lead to catastrophic failures of the system, in which 92% of the dialogues had significant issues. Moreover, we conduct a thorough analysis to identify distinctive features between the two settings through error annotation. From this, we discover a novel "pretending" behavior, in which the system pretends to handle the user requests even though they are beyond the system's capabilities. We discuss its characteristics and toxicity while emphasizing transparency and a fallback strategy for robust TOD systems.

Title: Robust Instruction Optimization for Large Language Models with Distribution Shifts. (arXiv:2305.13954v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13954
Code URL: null
Copy Paste: [[2305.13954] Robust Instruction Optimization for Large Language Models with Distribution Shifts](http://arxiv.org/abs/2305.13954) #robust
Summary:
Large Language Models have demonstrated significant ability in accomplishing a wide range of Natural Language Processing (NLP) tasks. However, their performance is highly sensitive to the even minor changes in the phrasing of the task instructions, leading to a line of research in automatic instruction optimization towards better performance for NLP tasks. Unfortunately, existing methods for instruction optimization fail to consider the distribution shift between the seen training data and the unseen test data, where testing on unseen group of data with a different distribution could potentially lead to performance drop. In this paper, we take an initial step of investigating the problem of LLM instruction optimization across data groups with distribution shifts. We find that the optimal instructions do encounter performance drops on LLM under certain distribution shifts. To this end, we propose a framework to derive more robust optimal instructions that improve the performance on the unseen data group without large sacrifice on the seen data group. Experimental results demonstrate the effectiveness of our proposed framework.

Title: Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction. (arXiv:2305.13981v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13981
Code URL: null
Copy Paste: [[2305.13981] Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction](http://arxiv.org/abs/2305.13981) #robust
Summary:
The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code will be publicly available.

Title: Developmental Curiosity and Social Interaction in Virtual Agents. (arXiv:2305.13396v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13396
Code URL: null
Copy Paste: [[2305.13396] Developmental Curiosity and Social Interaction in Virtual Agents](http://arxiv.org/abs/2305.13396) #robust
Summary:
Infants explore their complex physical and social environment in an organized way. To gain insight into what intrinsic motivations may help structure this exploration, we create a virtual infant agent and place it in a developmentally-inspired 3D environment with no external rewards. The environment has a virtual caregiver agent with the capability to interact contingently with the infant agent in ways that resemble play. We test intrinsic reward functions that are similar to motivations that have been proposed to drive exploration in humans: surprise, uncertainty, novelty, and learning progress. These generic reward functions lead the infant agent to explore its environment and discover the contingencies that are embedded into the caregiver agent. The reward functions that are proxies for novelty and uncertainty are the most successful in generating diverse experiences and activating the environment contingencies. We also find that learning a world model in the presence of an attentive caregiver helps the infant agent learn how to predict scenarios with challenging social and physical dynamics. Taken together, our findings provide insight into how curiosity-like intrinsic rewards and contingent social interaction lead to dynamic social behavior and the creation of a robust predictive world model.

Title: DeepBern-Nets: Taming the Complexity of Certifying Neural Networks using Bernstein Polynomial Activations and Precise Bound Propagation. (arXiv:2305.13508v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13508
Code URL: https://github.com/rcpsl/deepbern-nets
Copy Paste: [[2305.13508] DeepBern-Nets: Taming the Complexity of Certifying Neural Networks using Bernstein Polynomial Activations and Precise Bound Propagation](http://arxiv.org/abs/2305.13508) #robust
Summary:
Formal certification of Neural Networks (NNs) is crucial for ensuring their safety, fairness, and robustness. Unfortunately, on the one hand, sound and complete certification algorithms of ReLU-based NNs do not scale to large-scale NNs. On the other hand, incomplete certification algorithms are easier to compute, but they result in loose bounds that deteriorate with the depth of NN, which diminishes their effectiveness. In this paper, we ask the following question; can we replace the ReLU activation function with one that opens the door to incomplete certification algorithms that are easy to compute but can produce tight bounds on the NN's outputs? We introduce DeepBern-Nets, a class of NNs with activation functions based on Bernstein polynomials instead of the commonly used ReLU activation. Bernstein polynomials are smooth and differentiable functions with desirable properties such as the so-called range enclosure and subdivision properties. We design a novel algorithm, called Bern-IBP, to efficiently compute tight bounds on DeepBern-Nets outputs. Our approach leverages the properties of Bernstein polynomials to improve the tractability of neural network certification tasks while maintaining the accuracy of the trained networks. We conduct comprehensive experiments in adversarial robustness and reachability analysis settings to assess the effectiveness of the proposed Bernstein polynomial activation in enhancing the certification process. Our proposed framework achieves high certified accuracy for adversarially-trained NNs, which is often a challenging task for certifiers of ReLU-based NNs. Moreover, using Bern-IBP bounds for certified training results in NNs with state-of-the-art certified accuracy compared to ReLU networks. This work establishes Bernstein polynomial activation as a promising alternative for improving NN certification tasks across various applications.

Title: Representing Input Transformations by Low-Dimensional Parameter Subspaces. (arXiv:2305.13536v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13536
Code URL: https://github.com/osaukh/subspace-configurable-networks
Copy Paste: [[2305.13536] Representing Input Transformations by Low-Dimensional Parameter Subspaces](http://arxiv.org/abs/2305.13536) #robust
Summary:
Deep models lack robustness to simple input transformations such as rotation, scaling, and translation, unless they feature a particular invariant architecture or undergo specific training, e.g., learning the desired robustness from data augmentations. Alternatively, input transformations can be treated as a domain shift problem, and solved by post-deployment model adaptation. Although a large number of methods deal with transformed inputs, the fundamental relation between input transformations and optimal model weights is unknown. In this paper, we put forward the configuration subspace hypothesis that model weights optimal for parameterized continuous transformations can reside in low-dimensional linear subspaces. We introduce subspace-configurable networks to learn these subspaces and observe their structure and surprisingly low dimensionality on all tested transformations, datasets and architectures from computer vision and audio signal processing domains. Our findings enable efficient model reconfiguration, especially when limited storage and computing resources are at stake.

Title: Property-Guided Generative Modelling for Robust Model-Based Design with Imbalanced Data. (arXiv:2305.13650v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13650
Code URL: null
Copy Paste: [[2305.13650] Property-Guided Generative Modelling for Robust Model-Based Design with Imbalanced Data](http://arxiv.org/abs/2305.13650) #robust
Summary:
The problem of designing protein sequences with desired properties is challenging, as it requires to explore a high-dimensional protein sequence space with extremely sparse meaningful regions. This has led to the development of model-based optimization (MBO) techniques that aid in the design, by using effective search models guided by the properties over the sequence space. However, the intrinsic imbalanced nature of experimentally derived datasets causes existing MBO approaches to struggle or outright fail. We propose a property-guided variational auto-encoder (PGVAE) whose latent space is explicitly structured by the property values such that samples are prioritized according to these properties. Through extensive benchmarking on real and semi-synthetic protein datasets, we demonstrate that MBO with PGVAE robustly finds sequences with improved properties despite significant dataset imbalances. We further showcase the generality of our approach to continuous design spaces, and its robustness to dataset imbalance in an application to physics-informed neural networks.

Title: Enhancing Accuracy and Robustness through Adversarial Training in Class Incremental Continual Learning. (arXiv:2305.13678v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13678
Code URL: null
Copy Paste: [[2305.13678] Enhancing Accuracy and Robustness through Adversarial Training in Class Incremental Continual Learning](http://arxiv.org/abs/2305.13678) #robust
Summary:
In real life, adversarial attack to deep learning models is a fatal security issue. However, the issue has been rarely discussed in a widely used class-incremental continual learning (CICL). In this paper, we address problems of applying adversarial training to CICL, which is well-known defense method against adversarial attack. A well-known problem of CICL is class-imbalance that biases a model to the current task by a few samples of previous tasks. Meeting with the adversarial training, the imbalance causes another imbalance of attack trials over tasks. Lacking clean data of a minority class by the class-imbalance and increasing of attack trials from a majority class by the secondary imbalance, adversarial training distorts optimal decision boundaries. The distortion eventually decreases both accuracy and robustness than adversarial training. To exclude the effects, we propose a straightforward but significantly effective method, External Adversarial Training (EAT) which can be applied to methods using experience replay. This method conduct adversarial training to an auxiliary external model for the current task data at each time step, and applies generated adversarial examples to train the target model. We verify the effects on a toy problem and show significance on CICL benchmarks of image classification. We expect that the results will be used as the first baseline for robustness research of CICL.

Title: Mitigating Label Noise through Data Ambiguation. (arXiv:2305.13764v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13764
Code URL: null
Copy Paste: [[2305.13764] Mitigating Label Noise through Data Ambiguation](http://arxiv.org/abs/2305.13764) #robust
Summary:
Label noise poses an important challenge in machine learning, especially in deep learning, in which large models with high expressive power dominate the field. Models of that kind are prone to memorizing incorrect labels, thereby harming generalization performance. Many methods have been proposed to address this problem, including robust loss functions and more complex label correction approaches. Robust loss functions are appealing due to their simplicity, but typically lack flexibility, while label correction usually adds substantial complexity to the training setup. In this paper, we suggest to address the shortcomings of both methodologies by "ambiguating" the target information, adding additional, complementary candidate labels in case the learner is not sufficiently convinced of the observed training label. More precisely, we leverage the framework of so-called superset learning to construct set-valued targets based on a confidence threshold, which deliver imprecise yet more reliable beliefs about the ground-truth, effectively helping the learner to suppress the memorization effect. In an extensive empirical evaluation, our method demonstrates favorable learning behavior on synthetic and real-world noise, confirming the effectiveness in detecting and correcting erroneous training labels.

Title: SNEkhorn: Dimension Reduction with Symmetric Entropic Affinities. (arXiv:2305.13797v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13797
Code URL: null
Copy Paste: [[2305.13797] SNEkhorn: Dimension Reduction with Symmetric Entropic Affinities](http://arxiv.org/abs/2305.13797) #robust
Summary:
Many approaches in machine learning rely on a weighted graph to encode the similarities between samples in a dataset. Entropic affinities (EAs), which are notably used in the popular Dimensionality Reduction (DR) algorithm t-SNE, are particular instances of such graphs. To ensure robustness to heterogeneous sampling densities, EAs assign a kernel bandwidth parameter to every sample in such a way that the entropy of each row in the affinity matrix is kept constant at a specific value, whose exponential is known as perplexity. EAs are inherently asymmetric and row-wise stochastic, but they are used in DR approaches after undergoing heuristic symmetrization methods that violate both the row-wise constant entropy and stochasticity properties. In this work, we uncover a novel characterization of EA as an optimal transport problem, allowing a natural symmetrization that can be computed efficiently using dual ascent. The corresponding novel affinity matrix derives advantages from symmetric doubly stochastic normalization in terms of clustering performance, while also effectively controlling the entropy of each row thus making it particularly robust to varying noise levels. Following, we present a new DR algorithm, SNEkhorn, that leverages this new affinity matrix. We show its clear superiority to state-of-the-art approaches with several indicators on both synthetic and real-world datasets.

Title: On the Optimal Batch Size for Byzantine-Robust Distributed Learning. (arXiv:2305.13856v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13856
Code URL: null
Copy Paste: [[2305.13856] On the Optimal Batch Size for Byzantine-Robust Distributed Learning](http://arxiv.org/abs/2305.13856) #robust
Summary:
Byzantine-robust distributed learning (BRDL), in which computing devices are likely to behave abnormally due to accidental failures or malicious attacks, has recently become a hot research topic. However, even in the independent and identically distributed (i.i.d.) case, existing BRDL methods will suffer from a significant drop on model accuracy due to the large variance of stochastic gradients. Increasing batch sizes is a simple yet effective way to reduce the variance. However, when the total number of gradient computation is fixed, a too-large batch size will lead to a too-small iteration number (update number), which may also degrade the model accuracy. In view of this challenge, we mainly study the optimal batch size when the total number of gradient computation is fixed in this work. In particular, we theoretically and empirically show that when the total number of gradient computation is fixed, the optimal batch size in BRDL increases with the fraction of Byzantine workers. Therefore, compared to the case without attacks, the batch size should be set larger when under Byzantine attacks. However, for existing BRDL methods, large batch sizes will lead to a drop on model accuracy, even if there is no Byzantine attack. To deal with this problem, we propose a novel BRDL method, called Byzantine-robust stochastic gradient descent with normalized momentum (ByzSGDnm), which can alleviate the drop on model accuracy in large-batch cases. Moreover, we theoretically prove the convergence of ByzSGDnm for general non-convex cases under Byzantine attacks. Empirical results show that ByzSGDnm has a comparable performance to existing BRDL methods under bit-flipping failure, but can outperform existing BRDL methods under deliberately crafted attacks.

Title: Deep GEM-Based Network for Weakly Supervised UWB Ranging Error Mitigation. (arXiv:2305.13904v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13904
Code URL: null
Copy Paste: [[2305.13904] Deep GEM-Based Network for Weakly Supervised UWB Ranging Error Mitigation](http://arxiv.org/abs/2305.13904) #robust
Summary:
Ultra-wideband (UWB)-based techniques, while becoming mainstream approaches for high-accurate positioning, tend to be challenged by ranging bias in harsh environments. The emerging learning-based methods for error mitigation have shown great performance improvement via exploiting high semantic features from raw data. However, these methods rely heavily on fully labeled data, leading to a high cost for data acquisition. We present a learning framework based on weak supervision for UWB ranging error mitigation. Specifically, we propose a deep learning method based on the generalized expectation-maximization (GEM) algorithm for robust UWB ranging error mitigation under weak supervision. Such method integrate probabilistic modeling into the deep learning scheme, and adopt weakly supervised labels as prior information. Extensive experiments in various supervision scenarios illustrate the superiority of the proposed method.

biometric

steal

extraction

Title: Flare-Aware Cross-modal Enhancement Network for Multi-spectral Vehicle Re-identification. (arXiv:2305.13659v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13659
Code URL: null
Copy Paste: [[2305.13659] Flare-Aware Cross-modal Enhancement Network for Multi-spectral Vehicle Re-identification](http://arxiv.org/abs/2305.13659) #extraction
Summary:
Multi-spectral vehicle re-identification aims to address the challenge of identifying vehicles in complex lighting conditions by incorporating complementary visible and infrared information. However, in harsh environments, the discriminative cues in RGB and NIR modalities are often lost due to strong flares from vehicle lamps or sunlight, and existing multi-modal fusion methods are limited in their ability to recover these important cues. To address this problem, we propose a Flare-Aware Cross-modal Enhancement Network that adaptively restores flare-corrupted RGB and NIR features with guidance from the flare-immunized thermal infrared spectrum. First, to reduce the influence of locally degraded appearance due to intense flare, we propose a Mutual Flare Mask Prediction module to jointly obtain flare-corrupted masks in RGB and NIR modalities in a self-supervised manner. Second, to use the flare-immunized TI information to enhance the masked RGB and NIR, we propose a Flare-Aware Cross-modal Enhancement module that adaptively guides feature extraction of masked RGB and NIR spectra with prior flare-immunized knowledge from the TI spectrum. Third, to extract common informative semantic information from RGB and NIR, we propose an Inter-modality Consistency loss that enforces semantic consistency between the two modalities. Finally, to evaluate the proposed FACENet in handling intense flare, we introduce a new multi-spectral vehicle re-ID dataset, called WMVEID863, with additional challenges such as motion blur, significant background changes, and particularly intense flare degradation. Comprehensive experiments on both the newly collected dataset and public benchmark multi-spectral vehicle re-ID datasets demonstrate the superior performance of the proposed FACENet compared to state-of-the-art methods, especially in handling strong flares. The code and dataset will be released soon.

Title: Full Resolution Repetition Counting. (arXiv:2305.13778v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13778
Code URL: null
Copy Paste: [[2305.13778] Full Resolution Repetition Counting](http://arxiv.org/abs/2305.13778) #extraction
Summary:
Given an untrimmed video, repetitive actions counting aims to estimate the number of repetitions of class-agnostic actions. To handle the various length of videos and repetitive actions, also optimization challenges in end-to-end video model training, down-sampling is commonly utilized in recent state-of-the-art methods, leading to ignorance of several repetitive samples. In this paper, we attempt to understand repetitive actions from a full temporal resolution view, by combining offline feature extraction and temporal convolution networks. The former step enables us to train repetition counting network without down-sampling while preserving all repetition regardless of the video length and action frequency, and the later network models all frames in a flexible and dynamically expanding temporal receptive field to retrieve all repetitions with a global aspect. We experimentally demonstrate that our method achieves better or comparable performance in three public datasets, i.e., TransRAC, UCFRep and QUVA. We expect this work will encourage our community to think about the importance of full temporal resolution.

Title: Generalizable Synthetic Image Detection via Language-guided Contrastive Learning. (arXiv:2305.13800v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13800
Code URL: https://github.com/highwaywu/lasted
Copy Paste: [[2305.13800] Generalizable Synthetic Image Detection via Language-guided Contrastive Learning](http://arxiv.org/abs/2305.13800) #extraction
Summary:
The heightened realism of AI-generated images can be attributed to the rapid development of synthetic models, including generative adversarial networks (GANs) and diffusion models (DMs). The malevolent use of synthetic images, such as the dissemination of fake news or the creation of fake profiles, however, raises significant concerns regarding the authenticity of images. Though many forensic algorithms have been developed for detecting synthetic images, their performance, especially the generalization capability, is still far from being adequate to cope with the increasing number of synthetic models. In this work, we propose a simple yet very effective synthetic image detection method via a language-guided contrastive learning and a new formulation of the detection problem. We first augment the training images with carefully-designed textual labels, enabling us to use a joint image-text contrastive learning for the forensic feature extraction. In addition, we formulate the synthetic image detection as an identification problem, which is vastly different from the traditional classification-based approaches. It is shown that our proposed LanguAge-guided SynThEsis Detection (LASTED) model achieves much improved generalizability to unseen image generation models and delivers promising performance that far exceeds state-of-the-art competitors by +22.66% accuracy and +15.24% AUC. The code is available at https://github.com/HighwayWu/LASTED.

Title: Leveraging BEV Representation for 360-degree Visual Place Recognition. (arXiv:2305.13814v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13814
Code URL: https://github.com/maverickpeter/vdisco
Copy Paste: [[2305.13814] Leveraging BEV Representation for 360-degree Visual Place Recognition](http://arxiv.org/abs/2305.13814) #extraction
Summary:
This paper investigates the advantages of using Bird's Eye View (BEV) representation in 360-degree visual place recognition (VPR). We propose a novel network architecture that utilizes the BEV representation in feature extraction, feature aggregation, and vision-LiDAR fusion, which bridges visual cues and spatial awareness. Our method extracts image features using standard convolutional networks and combines the features according to pre-defined 3D grid spatial points. To alleviate the mechanical and time misalignments between cameras, we further introduce deformable attention to learn the compensation. Upon the BEV feature representation, we then employ the polar transform and the Discrete Fourier transform for aggregation, which is shown to be rotation-invariant. In addition, the image and point cloud cues can be easily stated in the same coordinates, which benefits sensor fusion for place recognition. The proposed BEV-based method is evaluated in ablation and comparative studies on two datasets, including on-the-road and off-the-road scenarios. The experimental results verify the hypothesis that BEV can benefit VPR by its superior performance compared to baseline methods. To the best of our knowledge, this is the first trial of employing BEV representation in this task.

Title: A Novel Dataset Towards Extracting Virus-Host Interactions. (arXiv:2305.13317v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13317
Code URL: null
Copy Paste: [[2305.13317] A Novel Dataset Towards Extracting Virus-Host Interactions](http://arxiv.org/abs/2305.13317) #extraction
Summary:
We describe a novel dataset for the automated recognition of named taxonomic and other entities relevant to the association of viruses with their hosts. We further describe some initial results using pre-trained models on the named-entity recognition (NER) task on this novel dataset. We propose that our dataset of manually annotated abstracts now offers a Gold Standard Corpus for training future NER models in the automated extraction of host-pathogen detection methods from scientific publications, and further explain how our work makes first steps towards predicting the important human health-related concept of viral spillover risk automatically from the scientific literature.

Title: BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance. (arXiv:2305.13395v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13395
Code URL: https://github.com/kareldo/biodex
Copy Paste: [[2305.13395] BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance](http://arxiv.org/abs/2305.13395) #extraction
Summary:
Timely and accurate extraction of Adverse Drug Events (ADE) from biomedical literature is paramount for public safety, but involves slow and costly manual labor. We set out to improve drug safety monitoring (pharmacovigilance, PV) through the use of Natural Language Processing (NLP). We introduce BioDEX, a large-scale resource for Biomedical adverse Drug Event Extraction, rooted in the historical output of drug safety reporting in the U.S. BioDEX consists of 65k abstracts and 19k full-text biomedical papers with 256k associated document-level safety reports created by medical experts. The core features of these reports include the reported weight, age, and biological sex of a patient, a set of drugs taken by the patient, the drug dosages, the reactions experienced, and whether the reaction was life threatening. In this work, we consider the task of predicting the core information of the report given its originating paper. We estimate human performance to be 72.0% F1, whereas our best model achieves 62.3% F1, indicating significant headroom on this task. We also begin to explore ways in which these models could help professional PV reviewers. Our code and data are available: https://github.com/KarelDO/BioDEX.

Title: MAILEX: Email Event and Argument Extraction. (arXiv:2305.13469v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13469
Code URL: null
Copy Paste: [[2305.13469] MAILEX: Email Event and Argument Extraction](http://arxiv.org/abs/2305.13469) #extraction
Summary:
In this work, we present the first dataset, \dataset, for performing event extraction from conversational email threads. To this end, we first proposed a new taxonomy covering 10 event types and 76 arguments in the email domain. Our final dataset includes $\sim$4K emails annotated with $\sim$9K event instances. To understand the task challenges, we conducted a series of experiments comparing two commonly-seen lines of approaches for event extraction, i.e., sequence labeling and generative end-to-end extraction (including few-shot GPT-3.5). Our results showed that the task of email event extraction is far from being addressed, due to challenges lying in, e.g., extracting non-continuous, shared trigger spans, extracting non-named entity arguments, and modeling the email conversational history. Our work thus suggests more investigations in this domain-specific event extraction task in the future.\footnote{The source code and dataset can be obtained from \url{https://github.com/salokr/Email-Event-Extraction}.

Title: Open-world Semi-supervised Generalized Relation Discovery Aligned in a Real-world Setting. (arXiv:2305.13533v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13533
Code URL: null
Copy Paste: [[2305.13533] Open-world Semi-supervised Generalized Relation Discovery Aligned in a Real-world Setting](http://arxiv.org/abs/2305.13533) #extraction
Summary:
Open-world Relation Extraction (OpenRE) has recently garnered significant attention. However, existing approaches tend to oversimplify the problem by assuming that all unlabeled texts belong to novel classes, thereby limiting the practicality of these methods. We argue that the OpenRE setting should be more aligned with the characteristics of real-world data. Specifically, we propose two key improvements: (a) unlabeled data should encompass known and novel classes, including hard-negative instances; and (b) the set of novel classes should represent long-tail relation types. Furthermore, we observe that popular relations such as titles and locations can often be implicitly inferred through specific patterns, while long-tail relations tend to be explicitly expressed in sentences. Motivated by these insights, we present a novel method called KNoRD (Known and Novel Relation Discovery), which effectively classifies explicitly and implicitly expressed relations from known and novel classes within unlabeled data. Experimental evaluations on several Open-world RE benchmarks demonstrate that KNoRD consistently outperforms other existing methods, achieving significant performance gains.

Title: EntRED: Benchmarking Relation Extraction with Fewer Shortcuts. (arXiv:2305.13551v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13551
Code URL: https://github.com/wangywust/entred
Copy Paste: [[2305.13551] EntRED: Benchmarking Relation Extraction with Fewer Shortcuts](http://arxiv.org/abs/2305.13551) #extraction
Summary:
Entity names play an effective role in relation extraction (RE) and often influence model performance. As a result, the entity names in the benchmarks' test sets significantly influence the evaluation of RE models. In this work, we find that the standard RE benchmarks' datasets have a large portion of incorrect entity annotations, low entity name diversity, and are prone to have shortcuts from entity names to ground-truth relations. These issues make the standard benchmarks far from reflecting the real-world scenarios. Hence, in this work, we present EntRED, a challenging RE benchmark with reduced shortcuts and higher diversity of entities. To build EntRED, we propose an end-to-end entity replacement pipeline based on causal inference (CI): ERIC. ERIC performs type-constrained replacements on entities to reduce the shortcuts from entity bias to ground-truth relations. ERIC applies CI in two aspects: 1) targeting the instances that need entity replacements, and 2) determining the candidate entities for replacements. We apply ERIC on TACRED to produce EntRED. Our EntRED evaluates whether the RE model can correctly extract the relations from the text instead of relying on entity bias. Empirical results reveal that even the strong RE model has a significant performance drop on EntRED, which memorizes entity name patterns instead of reasoning from the textual context. We release ERIC's source code and the EntRED benchmark at https://github.com/wangywUST/ENTRED.

Title: SPEECH: Structured Prediction with Energy-Based Event-Centric Hyperspheres. (arXiv:2305.13617v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13617
Code URL: https://github.com/zjunlp/speech
Copy Paste: [[2305.13617] SPEECH: Structured Prediction with Energy-Based Event-Centric Hyperspheres](http://arxiv.org/abs/2305.13617) #extraction
Summary:
Event-centric structured prediction involves predicting structured outputs of events. In most NLP cases, event structures are complex with manifold dependency, and it is challenging to effectively represent these complicated structured events. To address these issues, we propose Structured Prediction with Energy-based Event-Centric Hyperspheres (SPEECH). SPEECH models complex dependency among event structured components with energy-based modeling, and represents event classes with simple but effective hyperspheres. Experiments on two unified-annotated event datasets indicate that SPEECH is predominant in event detection and event-relation extraction tasks.

Title: mPMR: A Multilingual Pre-trained Machine Reader at Scale. (arXiv:2305.13645v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13645
Code URL: https://github.com/damo-nlp-sg/pmr
Copy Paste: [[2305.13645] mPMR: A Multilingual Pre-trained Machine Reader at Scale](http://arxiv.org/abs/2305.13645) #extraction
Summary:
We present multilingual Pre-trained Machine Reader (mPMR), a novel method for multilingual machine reading comprehension (MRC)-style pre-training. mPMR aims to guide multilingual pre-trained language models (mPLMs) to perform natural language understanding (NLU) including both sequence classification and span extraction in multiple languages. To achieve cross-lingual generalization when only source-language fine-tuning data is available, existing mPLMs solely transfer NLU capability from a source language to target languages. In contrast, mPMR allows the direct inheritance of multilingual NLU capability from the MRC-style pre-training to downstream tasks. Therefore, mPMR acquires better NLU capability for target languages. mPMR also provides a unified solver for tackling cross-lingual span extraction and sequence classification, thereby enabling the extraction of rationales to explain the sentence-pair classification process.

Title: Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path. (arXiv:2305.13805v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13805
Code URL: null
Copy Paste: [[2305.13805] Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path](http://arxiv.org/abs/2305.13805) #extraction
Summary:
The rapid growth of web pages and the increasing complexity of their structure poses a challenge for web mining models. Web mining models are required to understand the semi-structured web pages, particularly when little is known about the subject or template of a new page. Current methods migrate language models to the web mining by embedding the XML source code into the transformer or encoding the rendered layout with graph neural networks. However, these approaches do not take into account the relationships between text nodes within and across pages. In this paper, we propose a new approach, ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree which is a more accurate and efficient signal for key-value pair extraction within a web page. It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages. We use the contrastive learning to address the issue of sparsity in relation extraction. Extensive experiments on public benchmarks show that our method, ReXMiner, outperforms the state-of-the-art baselines in the task of zero-shot relation extraction in web mining.

Title: Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing. (arXiv:2305.13817v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13817
Code URL: null
Copy Paste: [[2305.13817] Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing](http://arxiv.org/abs/2305.13817) #extraction
Summary:
Objective:Develop and validate an algorithm for analyzing the layout of PDF clinical documents to improve the performance of downstream natural language processing tasks. Materials and Methods: We designed an algorithm to process clinical PDF documents and extract only clinically relevant text. The algorithm consists of several steps: initial text extraction using a PDF parser, followed by classification into categories such as body text, left notes, and footers using a Transformer deep neural network architecture, and finally an aggregation step to compile the lines of a given label in the text. We evaluated the technical performance of the body text extraction algorithm by applying it to a random sample of documents that were annotated. Medical performance was evaluated by examining the extraction of medical concepts of interest from the text in their respective sections. Finally, we tested an end-to-end system on a medical use case of automatic detection of acute infection described in the hospital report. Results:Our algorithm achieved per-line precision, recall, and F1 score of 98.4, 97.0, and 97.7, respectively, for body line extraction. The precision, recall, and F1 score per document for the acute infection detection algorithm were 82.54 (95CI 72.86-91.60), 85.24 (95CI 76.61-93.70), 83.87 (95CI 76, 92-90.08) with exploitation of the results of the advanced body extraction algorithm, respectively. Conclusion:We have developed and validated a system for extracting body text from clinical documents in PDF format by identifying their layout. We were able to demonstrate that this preprocessing allowed us to obtain better performances for a common downstream task, i.e., the extraction of medical concepts in their respective sections, thus proving the interest of this method on a clinical use case.

Title: Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document. (arXiv:2305.13850v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13850
Code URL: null
Copy Paste: [[2305.13850] Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document](http://arxiv.org/abs/2305.13850) #extraction
Summary:
Visual relation extraction (VRE) aims to extract relations between entities from visuallyrich documents. Existing methods usually predict relations for each entity pair independently based on entity features but ignore the global structure information, i.e., dependencies between entity pairs. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a GlObal Structure knowledgeguided relation Extraction (GOSE) framework, which captures dependencies between entity pairs in an iterative manner. Given a scanned image of the document, GOSE firstly generates preliminary relation predictions on entity pairs. Secondly, it mines global structure knowledge based on prediction results of the previous iteration and further incorporates global structure knowledge into entity representations. This "generate-capture-incorporate" schema is performed multiple times so that entity representations and global structure knowledge can mutually reinforce each other. Extensive experiments show that GOSE not only outperforms previous methods on the standard fine-tuning setting but also shows promising superiority in cross-lingual learning; even yields stronger data-efficient performance in the low-resource setting.

Title: Flexible Grammar-Based Constrained Decoding for Language Models. (arXiv:2305.13971v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13971
Code URL: null
Copy Paste: [[2305.13971] Flexible Grammar-Based Constrained Decoding for Language Models](http://arxiv.org/abs/2305.13971) #extraction
Summary:
LLMs have shown impressive few-shot performance across many tasks. However, they still struggle when it comes to generating complex output structures, such as those required for Information Extraction. This limitation stems from the fact that LLMs, without finetuning, tend to generate free text rather than precise structures that follow a specific grammar. In this work, we propose to enrich the decoding step with formal grammar constraints. During beam search, only valid token continuations compliant with the grammar production rules are considered. This enforces the generation of valid sequences exclusively. Our framework is highly general and flexible, allowing any Context-Free Grammar (CFG) to be integrated into our custom constrained beam search implementation. We demonstrate that the outputs of many NLP tasks can be represented as formal languages, making them suitable for direct use in our framework. For task where the output space is dependent on the input, we propose input-dependent grammars to constrain the generation. We conducted experiments with two challenging tasks involving large alphabets in their grammar (Wikidata entities and relations): information extraction and entity disambiguation. Our results with LLaMA models clearly indicate that grammar-constrained decoding outperforms few-shot prompting without constraints, and even competes with task-specific finetuned models. These findings suggest that integrating grammar-based constraints during decoding holds great promise in making LLMs reliably produce structured outputs, especially in setting where training data is scarce and finetuning is expensive.

Title: An Autoencoder-based Snow Drought Index. (arXiv:2305.13646v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13646
Code URL: null
Copy Paste: [[2305.13646] An Autoencoder-based Snow Drought Index](http://arxiv.org/abs/2305.13646) #extraction
Summary:
In several regions across the globe, snow has a significant impact on hydrology. The amounts of water that infiltrate the ground and flow as runoff are driven by the melting of snow. Therefore, it is crucial to study the magnitude and effect of snowmelt. Snow droughts, resulting from reduced snow storage, can drastically impact the water supplies in basins where snow predominates, such as in the western United States. Hence, it is important to detect the time and severity of snow droughts efficiently. We propose Snow Drought Response Index or SnoDRI, a novel indicator that could be used to identify and quantify snow drought occurrences. Our index is calculated using cutting-edge ML algorithms from various snow-related variables. The self-supervised learning of an autoencoder is combined with mutual information in the model. In this study, we use random forests for feature extraction for SnoDRI and assess the importance of each variable. We use reanalysis data (NLDAS-2) from 1981 to 2021 for the Pacific United States to study the efficacy of the new snow drought index. We evaluate the index by confirming the coincidence of its interpretation and the actual snow drought incidents.

membership infer

federate

Title: Asynchronous Multi-Model Federated Learning over Wireless Networks: Theory, Modeling, and Optimization. (arXiv:2305.13503v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13503
Code URL: null
Copy Paste: [[2305.13503] Asynchronous Multi-Model Federated Learning over Wireless Networks: Theory, Modeling, and Optimization](http://arxiv.org/abs/2305.13503) #federate
Summary:
Federated learning (FL) has emerged as a key technique for distributed machine learning (ML). Most literature on FL has focused on systems with (i) ML model training for a single task/model, (ii) a synchronous setting for uplink/downlink transfer of model parameters, which is often unrealistic. To address this, we develop MA-FL, which considers FL with multiple downstream tasks to be trained over an asynchronous model transmission architecture. We first characterize the convergence of ML model training under MA-FL via introducing a family of scheduling tensors to capture the scheduling of devices. Our convergence analysis sheds light on the impact of resource allocation (e.g., the mini-batch size and number of gradient descent iterations), device scheduling, and individual model states (i.e., warmed vs. cold initialization) on the performance of ML models. We then formulate a non-convex mixed integer optimization problem for jointly configuring the resource allocation and device scheduling to strike an efficient trade-off between energy consumption and ML performance, which is solved via successive convex approximations. Through numerical simulations, we reveal the advantages of MA-FL in terms of model performance and network resource savings.

Title: Federated Variational Inference: Towards Improved Personalization and Generalization. (arXiv:2305.13672v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13672
Code URL: null
Copy Paste: [[2305.13672] Federated Variational Inference: Towards Improved Personalization and Generalization](http://arxiv.org/abs/2305.13672) #federate
Summary:
Conventional federated learning algorithms train a single global model by leveraging all participating clients' data. However, due to heterogeneity in client generative distributions and predictive models, these approaches may not appropriately approximate the predictive process, converge to an optimal state, or generalize to new clients. We study personalization and generalization in stateless cross-device federated learning setups assuming heterogeneity in client data distributions and predictive models. We first propose a hierarchical generative model and formalize it using Bayesian Inference. We then approximate this process using Variational Inference to train our model efficiently. We call this algorithm Federated Variational Inference (FedVI). We use PAC-Bayes analysis to provide generalization bounds for FedVI. We evaluate our model on FEMNIST and CIFAR-100 image classification and show that FedVI beats the state-of-the-art on both tasks.

Title: Fair Differentially Private Federated Learning Framework. (arXiv:2305.13878v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13878
Code URL: null
Copy Paste: [[2305.13878] Fair Differentially Private Federated Learning Framework](http://arxiv.org/abs/2305.13878) #federate
Summary:
Federated learning (FL) is a distributed machine learning strategy that enables participants to collaborate and train a shared model without sharing their individual datasets. Privacy and fairness are crucial considerations in FL. While FL promotes privacy by minimizing the amount of user data stored on central servers, it still poses privacy risks that need to be addressed. Industry standards such as differential privacy, secure multi-party computation, homomorphic encryption, and secure aggregation protocols are followed to ensure privacy in FL. Fairness is also a critical issue in FL, as models can inherit biases present in local datasets, leading to unfair predictions. Balancing privacy and fairness in FL is a challenge, as privacy requires protecting user data while fairness requires representative training data. This paper presents a "Fair Differentially Private Federated Learning Framework" that addresses the challenges of generating a fair global model without validation data and creating a globally private differential model. The framework employs clipping techniques for biased model updates and Gaussian mechanisms for differential privacy. The paper also reviews related works on privacy and fairness in FL, highlighting recent advancements and approaches to mitigate bias and ensure privacy. Achieving privacy and fairness in FL requires careful consideration of specific contexts and requirements, taking into account the latest developments in industry standards and techniques.

fair

Title: Distribution-aware Fairness Test Generation. (arXiv:2305.13935v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13935
Code URL: null
Copy Paste: [[2305.13935] Distribution-aware Fairness Test Generation](http://arxiv.org/abs/2305.13935) #fair
Summary:
This work addresses how to validate group fairness in image recognition software. We propose a distribution-aware fairness testing approach (called DistroFair) that systematically exposes class-level fairness violations in image classifiers via a synergistic combination of out-of-distribution (OOD) testing and semantic-preserving image mutation. DistroFair automatically learns the distribution (e.g., number/orientation) of objects in a set of images. Then it systematically mutates objects in the images to become OOD using three semantic-preserving image mutations -- object deletion, object insertion and object rotation. We evaluate DistroFair using two well-known datasets (CityScapes and MS-COCO) and three major, commercial image recognition software (namely, Amazon Rekognition, Google Cloud Vision and Azure Computer Vision). Results show that about 21% of images generated by DistroFair reveal class-level fairness violations using either ground truth or metamorphic oracles. DistroFair is up to 2.3x more effective than two main baselines, i.e., (a) an approach which focuses on generating images only within the distribution (ID) and (b) fairness analysis using only the original image dataset. We further observed that DistroFair is efficient, it generates 460 images per hour, on average. Finally, we evaluate the semantic validity of our approach via a user study with 81 participants, using 30 real images and 30 corresponding mutated images generated by DistroFair. We found that images generated by DistroFair are 80% as realistic as real-world images.

Title: On the Limitations of Simulating Active Learning. (arXiv:2305.13342v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13342
Code URL: null
Copy Paste: [[2305.13342] On the Limitations of Simulating Active Learning](http://arxiv.org/abs/2305.13342) #fair
Summary:
Active learning (AL) is a human-and-model-in-the-loop paradigm that iteratively selects informative unlabeled data for human annotation, aiming to improve over random sampling. However, performing AL experiments with human annotations on-the-fly is a laborious and expensive process, thus unrealistic for academic research. An easy fix to this impediment is to simulate AL, by treating an already labeled and publicly available dataset as the pool of unlabeled data. In this position paper, we first survey recent literature and highlight the challenges across all different steps within the AL loop. We further unveil neglected caveats in the experimental setup that can significantly affect the quality of AL research. We continue with an exploration of how the simulation setting can govern empirical findings, arguing that it might be one of the answers behind the ever posed question ``why do active learning algorithms sometimes fail to outperform random sampling?''. We argue that evaluating AL algorithms on available labeled datasets might provide a lower bound as to their effectiveness in real data. We believe it is essential to collectively shape the best practices for AL research, particularly as engineering advancements in LLMs push the research focus towards data-driven approaches (e.g., data efficiency, alignment, fairness). In light of this, we have developed guidelines for future work. Our aim is to draw attention to these limitations within the community, in the hope of finding ways to address them.

Title: Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. (arXiv:2305.13707v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13707
Code URL: null
Copy Paste: [[2305.13707] Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models](http://arxiv.org/abs/2305.13707) #fair
Summary:
Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of ``tokens'' processed or generated by the underlying language models. What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages. In this work, we analyze the effect of this non-uniformity on the fairness of an API's pricing policy across languages. We conduct a systematic analysis of the cost and utility of OpenAI's language model API on multilingual benchmarks in 22 typologically diverse languages. We show evidence that speakers of a large number of the supported languages are overcharged while obtaining poorer results. These speakers tend to also come from regions where the APIs are less affordable to begin with. Through these analyses, we aim to increase transparency around language model APIs' pricing policies and encourage the vendors to make them more equitable.

Title: Reducing Sensitivity on Speaker Names for Text Generation from Dialogues. (arXiv:2305.13833v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13833
Code URL: null
Copy Paste: [[2305.13833] Reducing Sensitivity on Speaker Names for Text Generation from Dialogues](http://arxiv.org/abs/2305.13833) #fair
Summary:
Changing speaker names consistently throughout a dialogue should not affect its meaning and corresponding outputs for text generation from dialogues. However, pre-trained language models, serving as the backbone for dialogue-processing tasks, have shown to be sensitive to nuances. This may result in unfairness in real-world applications. No comprehensive analysis of this problem has been done in the past. In this work, we propose to quantitatively measure a model's sensitivity on speaker names, and comprehensively evaluate a number of known methods for reducing speaker name sensitivity, including a novel approach of our own. Extensive experiments on multiple datasets provide a benchmark for this problem and show the favorable performance of our approach in sensitivity reduction and quality of generation.

Title: A Trip Towards Fairness: Bias and De-Biasing in Large Language Models. (arXiv:2305.13862v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13862
Code URL: null
Copy Paste: [[2305.13862] A Trip Towards Fairness: Bias and De-Biasing in Large Language Models](http://arxiv.org/abs/2305.13862) #fair
Summary:
An outbreak in the popularity of transformer-based Language Models (such as GPT (Brown et al., 2020) and PaLM (Chowdhery et al., 2022)) has opened the doors to new Machine Learning applications. In particular, in Natural Language Processing and how pre-training from large text, corpora is essential in achieving remarkable results in downstream tasks. However, these Language Models seem to have inherent biases toward certain demographics reflected in their training data. While research has attempted to mitigate this problem, existing methods either fail to remove bias altogether, degrade performance, or are expensive. This paper examines the bias produced by promising Language Models when varying parameters and pre-training data. Finally, we propose a de-biasing technique that produces robust de-bias models that maintain performance on downstream tasks.

Title: Fair Oversampling Technique using Heterogeneous Clusters. (arXiv:2305.13875v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13875
Code URL: null
Copy Paste: [[2305.13875] Fair Oversampling Technique using Heterogeneous Clusters](http://arxiv.org/abs/2305.13875) #fair
Summary:
Class imbalance and group (e.g., race, gender, and age) imbalance are acknowledged as two reasons in data that hinder the trade-off between fairness and utility of machine learning classifiers. Existing techniques have jointly addressed issues regarding class imbalance and group imbalance by proposing fair over-sampling techniques. Unlike the common oversampling techniques, which only address class imbalance, fair oversampling techniques significantly improve the abovementioned trade-off, as they can also address group imbalance. However, if the size of the original clusters is too small, these techniques may cause classifier overfitting. To address this problem, we herein develop a fair oversampling technique using data from heterogeneous clusters. The proposed technique generates synthetic data that have class-mix features or group-mix features to make classifiers robust to overfitting. Moreover, we develop an interpolation method that can enhance the validity of generated synthetic data by considering the original cluster distribution and data noise. Finally, we conduct experiments on five realistic datasets and three classifiers, and the experimental results demonstrate the effectiveness of the proposed technique in terms of fairness and utility.

Title: On the relevance of APIs facing fairwashed audits. (arXiv:2305.13883v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13883
Code URL: null
Copy Paste: [[2305.13883] On the relevance of APIs facing fairwashed audits](http://arxiv.org/abs/2305.13883) #fair
Summary:
Recent legislation required AI platforms to provide APIs for regulators to assess their compliance with the law. Research has nevertheless shown that platforms can manipulate their API answers through fairwashing. Facing this threat for reliable auditing, this paper studies the benefits of the joint use of platform scraping and of APIs. In this setup, we elaborate on the use of scraping to detect manipulated answers: since fairwashing only manipulates API answers, exploiting scraps may reveal a manipulation. To abstract the wide range of specific API-scrap situations, we introduce a notion of proxy that captures the consistency an auditor might expect between both data sources. If the regulator has a good proxy of the consistency, then she can easily detect manipulation and even bypass the API to conduct her audit. On the other hand, without a good proxy, relying on the API is necessary, and the auditor cannot defend against fairwashing.

We then simulate practical scenarios in which the auditor may mostly rely on the API to conveniently conduct the audit task, while maintaining her chances to detect a potential manipulation. To highlight the tension between the audit task and the API fairwashing detection task, we identify Pareto-optimal strategies in a practical audit scenario.

We believe this research sets the stage for reliable audits in practical and manipulation-prone setups.

interpretability

Title: Regularization Through Simultaneous Learning: A Case Study for Hop Classification. (arXiv:2305.13447v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13447
Code URL: null
Copy Paste: [[2305.13447] Regularization Through Simultaneous Learning: A Case Study for Hop Classification](http://arxiv.org/abs/2305.13447) #interpretability
Summary:
Overfitting remains a prevalent challenge in deep neural networks, leading to suboptimal real-world performance. Employing regularization techniques is a common strategy to counter this challenge, improving model generalization. This paper proposes Simultaneous Learning, a novel regularization approach drawing on Transfer Learning and Multi-task Learning principles, applied specifically to the classification of hop varieties - an integral component of beer production. Our approach harnesses the power of auxiliary datasets in synergy with the target dataset to amplify the acquisition of highly relevant features. Through a strategic modification of the model's final layer, we enable the simultaneous classification of both datasets without the necessity to treat them as disparate tasks. To realize this, we formulate a loss function that includes an inter-group penalty. We conducted experimental evaluations using the InceptionV3 and ResNet50 models, designating the UFOP-HVD hop leaf dataset as the target and ImageNet and PlantNet as auxiliary datasets. Our proposed method exhibited a substantial performance advantage over models without regularization and those adopting dropout regularization, with accuracy improvements ranging from 5 to 22 percentage points. Additionally, we introduce a technique for interpretability devised to assess the quality of features by analyzing correlations among class features in the network's convolutional layers.

Title: Syntactic Knowledge via Graph Attention with BERT in Machine Translation. (arXiv:2305.13413v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13413
Code URL: null
Copy Paste: [[2305.13413] Syntactic Knowledge via Graph Attention with BERT in Machine Translation](http://arxiv.org/abs/2305.13413) #interpretability
Summary:
Although the Transformer model can effectively acquire context features via a self-attention mechanism, deeper syntactic knowledge is still not effectively modeled. To alleviate the above problem, we propose Syntactic knowledge via Graph attention with BERT (SGB) in Machine Translation (MT) scenarios. Graph Attention Network (GAT) and BERT jointly represent syntactic dependency feature as explicit knowledge of the source language to enrich source language representations and guide target language generation. Our experiments use gold syntax-annotation sentences and Quality Estimation (QE) model to obtain interpretability of translation quality improvement regarding syntactic knowledge without being limited to a BLEU score. Experiments show that the proposed SGB engines improve translation quality across the three MT tasks without sacrificing BLEU scores. We investigate what length of source sentences benefits the most and what dependencies are better identified by the SGB engines. We also find that learning of specific dependency relations by GAT can be reflected in the translation quality containing such relations and that syntax on the graph leads to new modeling of syntactic aspects of source sentences in the middle and bottom layers of BERT.

explainability

Title: SAR-to-Optical Image Translation via Thermodynamics-inspired Network. (arXiv:2305.13839v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13839
Code URL: null
Copy Paste: [[2305.13839] SAR-to-Optical Image Translation via Thermodynamics-inspired Network](http://arxiv.org/abs/2305.13839) #explainability
Summary:
Synthetic aperture radar (SAR) is prevalent in the remote sensing field but is difficult to interpret in human visual perception. Recently, SAR-to-optical (S2O) image conversion methods have provided a prospective solution for interpretation. However, since there is a huge domain difference between optical and SAR images, they suffer from low image quality and geometric distortion in the produced optical images. Motivated by the analogy between pixels during the S2O image translation and molecules in a heat field, Thermodynamics-inspired Network for SAR-to-Optical Image Translation (S2O-TDN) is proposed in this paper. Specifically, we design a Third-order Finite Difference (TFD) residual structure in light of the TFD equation of thermodynamics, which allows us to efficiently extract inter-domain invariant features and facilitate the learning of the nonlinear translation mapping. In addition, we exploit the first law of thermodynamics (FLT) to devise an FLT-guided branch that promotes the state transition of the feature values from the unstable diffusion state to the stable one, aiming to regularize the feature diffusion and preserve image structures during S2O image translation. S2O-TDN follows an explicit design principle derived from thermodynamic theory and enjoys the advantage of explainability. Experiments on the public SEN1-2 dataset show the advantages of the proposed S2O-TDN over the current methods with more delicate textures and higher quantitative results.

watermark

diffusion

Title: LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On. (arXiv:2305.13501v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13501
Code URL: https://github.com/miccunifi/ladi-vton
Copy Paste: [[2305.13501] LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On](http://arxiv.org/abs/2305.13501) #diffusion
Summary:
The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task. Source code and trained models will be publicly released at: https://github.com/miccunifi/ladi-vton.

Title: LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models. (arXiv:2305.13655v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13655
Code URL: null
Copy Paste: [[2305.13655] LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models](http://arxiv.org/abs/2305.13655) #diffusion
Summary:
Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

Title: DiffHand: End-to-End Hand Mesh Reconstruction via Diffusion Models. (arXiv:2305.13705v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13705
Code URL: null
Copy Paste: [[2305.13705] DiffHand: End-to-End Hand Mesh Reconstruction via Diffusion Models](http://arxiv.org/abs/2305.13705) #diffusion
Summary:
Hand mesh reconstruction from the monocular image is a challenging task due to its depth ambiguity and severe occlusion, there remains a non-unique mapping between the monocular image and hand mesh. To address this, we develop DiffHand, the first diffusion-based framework that approaches hand mesh reconstruction as a denoising diffusion process. Our one-stage pipeline utilizes noise to model the uncertainty distribution of the intermediate hand mesh in a forward process. We reformulate the denoising diffusion process to gradually refine noisy hand mesh and then select mesh with the highest probability of being correct based on the image itself, rather than relying on 2D joints extracted beforehand. To better model the connectivity of hand vertices, we design a novel network module called the cross-modality decoder. Extensive experiments on the popular benchmarks demonstrate that our method outperforms the state-of-the-art hand mesh reconstruction approaches by achieving 5.8mm PA-MPJPE on the Freihand test set, 4.98mm PA-MPJPE on the DexYCB test set.

Title: Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models. (arXiv:2305.13773v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13773
Code URL: null
Copy Paste: [[2305.13773] Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models](http://arxiv.org/abs/2305.13773) #diffusion
Summary:
The emergence of text-driven motion synthesis technique provides animators with great potential to create efficiently. However, in most cases, textual expressions only contain general and qualitative motion descriptions, while lack fine depiction and sufficient intensity, leading to the synthesized motions that either (a) semantically compliant but uncontrollable over specific pose details, or (b) even deviates from the provided descriptions, bringing animators with undesired cases. In this paper, we propose DiffKFC, a conditional diffusion model for text-driven motion synthesis with keyframes collaborated. Different from plain text-driven designs, full interaction among texts, keyframes and the rest diffused frames are conducted at training, enabling realistic generation under efficient, collaborative dual-level control: coarse guidance at semantic level, with only few keyframes for direct and fine-grained depiction down to body posture level, to satisfy animator requirements without tedious labor. Specifically, we customize efficient Dilated Mask Attention modules, where only partial valid tokens participate in local-to-global attention, indicated by the dilated keyframe mask. For user flexibility, DiffKFC supports adjustment on importance of fine-grained keyframe control. Experimental results show that our model achieves state-of-the-art performance on text-to-motion datasets HumanML3D and KIT.

Title: WaveDM: Wavelet-Based Diffusion Models for Image Restoration. (arXiv:2305.13819v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13819
Code URL: null
Copy Paste: [[2305.13819] WaveDM: Wavelet-Based Diffusion Models for Image Restoration](http://arxiv.org/abs/2305.13819) #diffusion
Summary:
Latest diffusion-based methods for many image restoration tasks outperform traditional models, but they encounter the long-time inference problem. To tackle it, this paper proposes a Wavelet-Based Diffusion Model (WaveDM) with an Efficient Conditional Sampling (ECS) strategy. WaveDM learns the distribution of clean images in the wavelet domain conditioned on the wavelet spectrum of degraded images after wavelet transform, which is more time-saving in each step of sampling than modeling in the spatial domain. In addition, ECS follows the same procedure as the deterministic implicit sampling in the initial sampling period and then stops to predict clean images directly, which reduces the number of total sampling steps to around 5. Evaluations on four benchmark datasets including image raindrop removal, defocus deblurring, demoir\'eing, and denoising demonstrate that WaveDM achieves state-of-the-art performance with the efficiency that is comparable to traditional one-pass methods and over 100 times faster than existing image restoration methods using vanilla diffusion models.

Title: Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models. (arXiv:2305.13840v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13840
Code URL: null
Copy Paste: [[2305.13840] Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models](http://arxiv.org/abs/2305.13840) #diffusion
Summary:
This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/

Title: Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models. (arXiv:2305.13873v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13873
Code URL: null
Copy Paste: [[2305.13873] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models](http://arxiv.org/abs/2305.13873) #diffusion
Summary:
State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$\cdot$2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that these models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion's tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.

Title: Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models. (arXiv:2305.13921v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13921
Code URL: null
Copy Paste: [[2305.13921] Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models](http://arxiv.org/abs/2305.13921) #diffusion
Summary:
Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, these models fail to semantically align the generated images with the text descriptions due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these three issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective, and can be readily integrated into existing cross-attention-diffusion-based T2I generators. We compare our approach to competing methods and demonstrate that it not only faithfully conveys the semantics of the original text to the generated content, but also achieves high availability as a ready-to-use plugin.

noise learning

data-free

transformer

Title: Efficient Large-Scale Vision Representation Learning. (arXiv:2305.13399v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13399
Code URL: null
Copy Paste: [[2305.13399] Efficient Large-Scale Vision Representation Learning](http://arxiv.org/abs/2305.13399) #transformer
Summary:
In this article, we present our approach to single-modality vision representation learning. Understanding vision representations of product content is vital for recommendations, search, and advertising applications in e-commerce. We detail and contrast techniques used to fine tune large-scale vision representation learning models in an efficient manner under low-resource settings, including several pretrained backbone architectures, both in the convolutional neural network as well as the vision transformer family. We highlight the challenges for e-commerce applications at-scale and highlight the efforts to more efficiently train, evaluate, and serve visual representations. We present ablation studies for several downstream tasks, including our visually similar ad recommendations. We evaluate the offline performance of the derived visual representations in downstream tasks. To this end, we present a novel text-to-image generative offline evaluation method for visually similar recommendation systems. Finally, we include online results from deployed machine learning systems in production at Etsy.

Title: Type-to-Track: Retrieve Any Object via Prompt-based Tracking. (arXiv:2305.13495v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13495
Code URL: null
Copy Paste: [[2305.13495] Type-to-Track: Retrieve Any Object via Prompt-based Tracking](http://arxiv.org/abs/2305.13495) #transformer
Summary:
One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions describing their appearance and action in detail. Additionally, we introduce two new evaluation protocols and formulate evaluation metrics specifically for this task. We develop a new efficient method that models a transformer-based eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor decomposition. The experiments in five scenarios show that our MENDER approach outperforms another two-stage design in terms of accuracy and efficiency, up to 14.7% accuracy and 4$\times$ speed faster.

Title: Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of GPT. (arXiv:2305.13417v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13417
Code URL: null
Copy Paste: [[2305.13417] Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of GPT](http://arxiv.org/abs/2305.13417) #transformer
Summary:
Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models (LMs) to their vocabulary, a transformation that makes them human interpretable and enables us to assign semantics to what was seen only as numerical vectors. In this paper, we interpret LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input. By analyzing the tokens they represent through this projection, we identify patterns in the information flow inside the attention mechanism. Based on these discoveries, we create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph, with nodes representing neurons or hidden states and edges representing the interactions between them. Our visualization simplifies huge amounts of data into easy-to-read plots that reflect why models output their results. We demonstrate the utility of our modeling by identifying the effect LM components have on the intermediate processing in the model before outputting a prediction. For instance, we discover that layer norms are used as semantic filters and find neurons that act as regularization vectors.

Title: Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings. (arXiv:2305.13571v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13571
Code URL: null
Copy Paste: [[2305.13571] Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings](http://arxiv.org/abs/2305.13571) #transformer
Summary:
The use of positional embeddings in transformer language models is widely accepted. However, recent research has called into question the necessity of such embeddings. We further extend this inquiry by demonstrating that a randomly initialized and frozen transformer language model, devoid of positional embeddings, inherently encodes strong positional information through the shrinkage of self-attention variance. To quantify this variance, we derive the underlying distribution of each step within a transformer layer. Through empirical validation using a fully pretrained model, we show that the variance shrinkage effect still persists after extensive gradient updates. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.

Title: Cross-Attention is Not Enough: Incongruity-Aware Multimodal Sentiment Analysis and Emotion Recognition. (arXiv:2305.13583v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13583
Code URL: null
Copy Paste: [[2305.13583] Cross-Attention is Not Enough: Incongruity-Aware Multimodal Sentiment Analysis and Emotion Recognition](http://arxiv.org/abs/2305.13583) #transformer
Summary:
Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) outperforms major prior work by achieving competitive results and can successfully recognize hard samples; 2) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 3) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.

Title: AxomiyaBERTa: A Phonologically-aware Transformer Model for Assamese. (arXiv:2305.13641v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13641
Code URL: https://github.com/csu-signal/axomiyaberta
Copy Paste: [[2305.13641] AxomiyaBERTa: A Phonologically-aware Transformer Model for Assamese](http://arxiv.org/abs/2305.13641) #transformer
Summary:
Despite their successes in NLP, Transformer-based language models still require extensive computing resources and suffer in low-resource or low-compute settings. In this paper, we present AxomiyaBERTa, a novel BERT model for Assamese, a morphologically-rich low-resource language (LRL) of Eastern India. AxomiyaBERTa is trained only on the masked language modeling (MLM) task, without the typical additional next sentence prediction (NSP) objective, and our results show that in resource-scarce settings for very low-resource languages like Assamese, MLM alone can be successfully leveraged for a range of tasks. AxomiyaBERTa achieves SOTA on token-level tasks like Named Entity Recognition and also performs well on "longer-context" tasks like Cloze-style QA and Wiki Title Prediction, with the assistance of a novel embedding disperser and phonological signals respectively. Moreover, we show that AxomiyaBERTa can leverage phonological signals for even more challenging tasks, such as a novel cross-document coreference task on a translated version of the ECB+ corpus, where we present a new SOTA result for an LRL. Our source code and evaluation scripts may be found at https://github.com/csu-signal/axomiyaberta.

Title: Optimizing Non-Autoregressive Transformers with Contrastive Learning. (arXiv:2305.13667v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13667
Code URL: null
Copy Paste: [[2305.13667] Optimizing Non-Autoregressive Transformers with Contrastive Learning](http://arxiv.org/abs/2305.13667) #transformer
Summary:
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order. They have achieved remarkable progress in machine translation as well as many other applications. However, a long-standing challenge for NATs is the learning of multi-modality data distribution, which is the main cause of the performance gap between NATs and ATs. In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution. We derive contrastive constraints to stabilize the training process and integrate this resulting objective with the state-of-the-art NAT architecture DA-Transformer. Our model \method is examined on 3 different tasks, including machine translation, text summarization, and paraphrasing with 5 benchmarks. Results show that our approach outperforms previous non-autoregressive baselines by a significant margin and establishes new state-of-the-art results for non-autoregressive transformers on all the benchmarks.

Title: Grounding and Distinguishing Conceptual Vocabulary Through Similarity Learning in Embodied Simulations. (arXiv:2305.13668v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13668
Code URL: null
Copy Paste: [[2305.13668] Grounding and Distinguishing Conceptual Vocabulary Through Similarity Learning in Embodied Simulations](http://arxiv.org/abs/2305.13668) #transformer
Summary:
We present a novel method for using agent experiences gathered through an embodied simulation to ground contextualized word vectors to object representations. We use similarity learning to make comparisons between different object types based on their properties when interacted with, and to extract common features pertaining to the objects' behavior. We then use an affine transformation to calculate a projection matrix that transforms contextualized word vectors from different transformer-based language models into this learned space, and evaluate whether new test instances of transformed token vectors identify the correct concept in the object embedding space. Our results expose properties of the embedding spaces of four different transformer models and show that grounding object token vectors is usually more helpful to grounding verb and attribute token vectors than the reverse, which reflects earlier conclusions in the analogical reasoning and psycholinguistic literature.

Title: Causal Intervention for Abstractive Related Work Generation. (arXiv:2305.13685v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13685
Code URL: null
Copy Paste: [[2305.13685] Causal Intervention for Abstractive Related Work Generation](http://arxiv.org/abs/2305.13685) #transformer
Summary:
Abstractive related work generation has attracted increasing attention in generating coherent related work that better helps readers grasp the background in the current research. However, most existing abstractive models ignore the inherent causality of related work generation, leading to low quality of generated related work and spurious correlations that affect the models' generalizability. In this study, we argue that causal intervention can address these limitations and improve the quality and coherence of the generated related works. To this end, we propose a novel Causal Intervention Module for Related Work Generation (CaM) to effectively capture causalities in the generation process and improve the quality and coherence of the generated related works. Specifically, we first model the relations among sentence order, document relation, and transitional content in related work generation using a causal graph. Then, to implement the causal intervention and mitigate the negative impact of spurious correlations, we use do-calculus to derive ordinary conditional probabilities and identify causal effects through CaM. Finally, we subtly fuse CaM with Transformer to obtain an end-to-end generation model. Extensive experiments on two real-world datasets show that causal interventions in CaM can effectively promote the model to learn causal relations and produce related work of higher quality and coherence.

Title: UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning. (arXiv:2305.13697v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13697
Code URL: null
Copy Paste: [[2305.13697] UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning](http://arxiv.org/abs/2305.13697) #transformer
Summary:
Vision-and-language (VL) pre-training, which aims to learn a general representation of image-text pairs that can be transferred to various vision-and-language tasks. Compared with modeling uni-modal data, the main challenge of the VL model is: how to learn the cross-modal interaction from multimodal data, especially the fine-grained interaction. Existing works have shown that fully transformer-based models that adopt attention mechanisms to learn in-layer cross-model interaction can demonstrate impressive performance on various cross-modal downstream tasks. However, they ignored that the semantic information of the different modals at the same layer was not uniform, which leads to the cross-modal interaction collapsing into a limited multi-modal semantic information interaction. In this work, we propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal in-layer interaction and cross-layer interaction. UNIMO-3 model can establish effective connections between different layers in a cross-modal encoder, and adaptively capture the interaction between two modalities at different levels. The experimental results show that our model achieves state-of-the-art performance in various downstream tasks, and through ablation study can prove that effective cross-layer learning improves the model's ability of multimodal representation.

Title: Concept-aware Training Improves In-context Learning Ability of Language Models. (arXiv:2305.13775v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13775
Code URL: null
Copy Paste: [[2305.13775] Concept-aware Training Improves In-context Learning Ability of Language Models](http://arxiv.org/abs/2305.13775) #transformer
Summary:
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability, manifested in the LMs' ability to modulate their function by a task described in a natural language input. Previous work curating these models assumes that ICL emerges from vast over-parametrization or the scale of multi-task training. However, a complementary branch of recent theoretical work attributes ICL emergence to specific properties of training data and creates functional in-context learners in small-scale, synthetic settings.

Inspired by recent findings on data properties driving the emergence of ICL, we propose a method to create LMs able to better utilize the in-context information, by constructing training scenarios where it is beneficial for the LM to capture the analogical reasoning concepts. We measure that data sampling of Concept-aware Training (CoAT) consistently improves models' reasoning ability. As a result, the in-context learners trained with CoAT on only two datasets of a single (QA) task perform comparably to larger models trained on 1600+ tasks.

Title: Probing Brain Context-Sensitivity with Masked-Attention Generation. (arXiv:2305.13863v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13863
Code URL: null
Copy Paste: [[2305.13863] Probing Brain Context-Sensitivity with Masked-Attention Generation](http://arxiv.org/abs/2305.13863) #transformer
Summary:
Two fundamental questions in neurolinguistics concerns the brain regions that integrate information beyond the lexical level, and the size of their window of integration. To address these questions we introduce a new approach named masked-attention generation. It uses GPT-2 transformers to generate word embeddings that capture a fixed amount of contextual information. We then tested whether these embeddings could predict fMRI brain activity in humans listening to naturalistic text. The results showed that most of the cortex within the language network is sensitive to contextual information, and that the right hemisphere is more sensitive to longer contexts than the left. Masked-attention generation supports previous analyses of context-sensitivity in the brain, and complements them by quantifying the window size of context integration per voxel.

Title: Narrative XL: A Large-scale Dataset For Long-Term Memory Models. (arXiv:2305.13877v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13877
Code URL: https://github.com/r-seny/narrativexl
Copy Paste: [[2305.13877] Narrative XL: A Large-scale Dataset For Long-Term Memory Models](http://arxiv.org/abs/2305.13877) #transformer
Summary:
Despite their tremendous successes, most large language models do not have any long-term memory mechanisms, which restricts their applications. Overcoming this limitation would not only require changes to the typical transformer architectures or training procedures, but also a dataset on which these new models could be trained and evaluated. We argue that existing resources lack a few key properties, and that at present, there are no naturalistic datasets of sufficient scale to train (and not only evaluate) long-term memory language models. We then present our solution that capitalizes on the advances in short-term memory language models to create such a dataset. Using GPT 3.5, we summarized each scene in 1500 hand-curated books from Project Gutenberg, which resulted in approximately 150 scene-level summaries per book. We then created a number of reading comprehension questions based on these summaries, including three types of multiple-choice scene recognition questions, as well as free-form narrative reconstruction questions. Each book is thus associated with more than 500 reading comprehension questions. Crucially, most questions have a known ``retention demand'', indicating how long-term of a memory is needed to answer it, which should aid long-term memory performance evaluation. We validate our data in three small-scale experiments: one with human labelers, and two with existing language models. We show that our questions 1) adequately represent the source material 2) can be used to diagnose the model's memory capacity 3) are not trivial for modern language models even when the memory demand does not exceed those models' context lengths. Lastly, we provide our code which can be used to further expand the dataset in an automated manner.

Title: Neural Functional Transformers. (arXiv:2305.13546v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13546
Code URL: https://github.com/allanyangzhou/nfn
Copy Paste: [[2305.13546] Neural Functional Transformers](http://arxiv.org/abs/2305.13546) #transformer
Summary:
The recent success of neural networks as implicit representation of data has driven growing interest in neural functionals: models that can process other neural networks as input by operating directly over their weight spaces. Nevertheless, constructing expressive and efficient neural functional architectures that can handle high-dimensional weight-space objects remains challenging. This paper uses the attention mechanism to define a novel set of permutation equivariant weight-space layers and composes them into deep equivariant models called neural functional Transformers (NFTs). NFTs respect weight-space permutation symmetries while incorporating the advantages of attention, which have exhibited remarkable success across multiple domains. In experiments processing the weights of feedforward MLPs and CNNs, we find that NFTs match or exceed the performance of prior weight-space methods. We also leverage NFTs to develop Inr2Array, a novel method for computing permutation invariant latent representations from the weights of implicit neural representations (INRs). Our proposed method improves INR classification accuracy by up to $+17\%$ over existing methods. We provide an implementation of our layers at https://github.com/AllanYangZhou/nfn.

generative

Title: Know Your Self-supervised Learning: A Survey on Image-based Generative and Discriminative Training. (arXiv:2305.13689v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13689
Code URL: null
Copy Paste: [[2305.13689] Know Your Self-supervised Learning: A Survey on Image-based Generative and Discriminative Training](http://arxiv.org/abs/2305.13689) #generative
Summary:
Although supervised learning has been highly successful in improving the state-of-the-art in the domain of image-based computer vision in the past, the margin of improvement has diminished significantly in recent years, indicating that a plateau is in sight. Meanwhile, the use of self-supervised learning (SSL) for the purpose of natural language processing (NLP) has seen tremendous successes during the past couple of years, with this new learning paradigm yielding powerful language models. Inspired by the excellent results obtained in the field of NLP, self-supervised methods that rely on clustering, contrastive learning, distillation, and information-maximization, which all fall under the banner of discriminative SSL, have experienced a swift uptake in the area of computer vision. Shortly afterwards, generative SSL frameworks that are mostly based on masked image modeling, complemented and surpassed the results obtained with discriminative SSL. Consequently, within a span of three years, over $100$ unique general-purpose frameworks for generative and discriminative SSL, with a focus on imaging, were proposed. In this survey, we review a plethora of research efforts conducted on image-oriented SSL, providing a historic view and paying attention to best practices as well as useful software packages. While doing so, we discuss pretext tasks for image-based SSL, as well as techniques that are commonly used in image-based SSL. Lastly, to aid researchers who aim at contributing to image-focused SSL, we outline a number of promising research directions.

Title: VisorGPT: Learning Visual Prior via Generative Pre-Training. (arXiv:2305.13777v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13777
Code URL: https://github.com/sierkinhane/visorgpt
Copy Paste: [[2305.13777] VisorGPT: Learning Visual Prior via Generative Pre-Training](http://arxiv.org/abs/2305.13777) #generative
Summary:
Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, \emph{e.g.,} object location and shape, in the model. Such prior potentially impacts many vision tasks. For example, in conditional image synthesis, spatial conditions failing to adhere to the prior can result in visually inaccurate synthetic results. This work aims to explicitly learn the visual prior and enable the customization of sampling. Inspired by advances in language modeling, we propose to learn Visual prior via Generative Pre-Training, dubbed VisorGPT. By discretizing visual locations of objects, \emph{e.g.,} bounding boxes, human pose, and instance masks, into sequences, \our~can model visual prior through likelihood maximization. Besides, prompt engineering is investigated to unify various visual locations and enable customized sampling of sequential outputs from the learned prior. Experimental results demonstrate that \our~can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet. Code will be released at https://github.com/Sierkinhane/VisorGPT.

Title: Variational Bayesian Framework for Advanced Image Generation with Domain-Related Variables. (arXiv:2305.13872v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13872
Code URL: null
Copy Paste: [[2305.13872] Variational Bayesian Framework for Advanced Image Generation with Domain-Related Variables](http://arxiv.org/abs/2305.13872) #generative
Summary:
Deep generative models (DGMs) and their conditional counterparts provide a powerful ability for general-purpose generative modeling of data distributions. However, it remains challenging for existing methods to address advanced conditional generative problems without annotations, which can enable multiple applications like image-to-image translation and image editing. We present a unified Bayesian framework for such problems, which introduces an inference stage on latent variables within the learning process. In particular, we propose a variational Bayesian image translation network (VBITN) that enables multiple image translation and editing tasks. Comprehensive experiments show the effectiveness of our method on unsupervised image-to-image translation, and demonstrate the novel advanced capabilities for semantic editing and mixed domain translation.

Title: Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction. (arXiv:2305.13903v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13903
Code URL: null
Copy Paste: [[2305.13903] Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction](http://arxiv.org/abs/2305.13903) #generative
Summary:
Despite constituting 65% of all internet traffic in 2023, video content is underrepresented in generative AI research. Meanwhile, recent large language models (LLMs) have become increasingly integrated with capabilities in the visual modality. Integrating video with LLMs is a natural next step, so how can this gap be bridged? To advance video reasoning, we propose a new research direction of VideoCOT on video keyframes, which leverages the multimodal generative abilities of vision-language models to enhance video reasoning while reducing the computational complexity of processing hundreds or thousands of frames. We introduce VIP, an inference-time dataset that can be used to evaluate VideoCOT, containing 1) a variety of real-life videos with keyframes and corresponding unstructured and structured scene descriptions, and 2) two new video reasoning tasks: video infilling and scene prediction. We benchmark various vision-language models on VIP, demonstrating the potential to use vision-language models and LLMs to enhance video chain of thought reasoning.

Title: From Model-Based to Data-Driven Simulation: Challenges and Trends in Autonomous Driving. (arXiv:2305.13960v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13960
Code URL: null
Copy Paste: [[2305.13960] From Model-Based to Data-Driven Simulation: Challenges and Trends in Autonomous Driving](http://arxiv.org/abs/2305.13960) #generative
Summary:
Simulation is an integral part in the process of developing autonomous vehicles and advantageous for training, validation, and verification of driving functions. Even though simulations come with a series of benefits compared to real-world experiments, various challenges still prevent virtual testing from entirely replacing physical test-drives. Our work provides an overview of these challenges with regard to different aspects and types of simulation and subsumes current trends to overcome them. We cover aspects around perception-, behavior- and content-realism as well as general hurdles in the domain of simulation. Among others, we observe a trend of data-driven, generative approaches and high-fidelity data synthesis to increasingly replace model-based simulation.

Title: A Study of Generative Large Language Model for Medical Research and Healthcare. (arXiv:2305.13523v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13523
Code URL: null
Copy Paste: [[2305.13523] A Study of Generative Large Language Model for Medical Research and Healthcare](http://arxiv.org/abs/2305.13523) #generative
Summary:
There is enormous enthusiasm and concerns in using large language models (LLMs) in healthcare, yet current assumptions are all based on general-purpose LLMs such as ChatGPT. This study develops a clinical generative LLM, GatorTronGPT, using 277 billion words of mixed clinical and English text with a GPT-3 architecture of 20 billion parameters. GatorTronGPT improves biomedical natural language processing for medical research. Synthetic NLP models trained using GatorTronGPT generated text outperform NLP models trained using real-world clinical text. Physicians Turing test using 1 (worst) to 9 (best) scale shows that there is no significant difference in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights on the opportunities and challenges of LLMs for medical research and healthcare.

Title: Non-parametric, Nearest-neighbor-assisted Fine-tuning for Neural Machine Translation. (arXiv:2305.13648v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13648
Code URL: null
Copy Paste: [[2305.13648] Non-parametric, Nearest-neighbor-assisted Fine-tuning for Neural Machine Translation](http://arxiv.org/abs/2305.13648) #generative
Summary:
Non-parametric, k-nearest-neighbor algorithms have recently made inroads to assist generative models such as language models and machine translation decoders. We explore whether such non-parametric models can improve machine translation models at the fine-tuning stage by incorporating statistics from the kNN predictions to inform the gradient updates for a baseline translation model. There are multiple methods which could be used to incorporate kNN statistics and we investigate gradient scaling by a gating mechanism, the kNN's ground truth probability, and reinforcement learning. For four standard in-domain machine translation datasets, compared with classic fine-tuning, we report consistent improvements of all of the three methods by as much as 1.45 BLEU and 1.28 BLEU for German-English and English-German translations respectively. Through qualitative analysis, we found particular improvements when it comes to translating grammatical relations or function words, which results in increased fluency of our model.

large language model

Title: Can LLMs facilitate interpretation of pre-trained language models?. (arXiv:2305.13386v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13386
Code URL: null
Copy Paste: [[2305.13386] Can LLMs facilitate interpretation of pre-trained language models?](http://arxiv.org/abs/2305.13386) #large language model
Summary:
Work done to uncover the knowledge encoded within pre-trained language models, rely on annotated corpora or human-in-the-loop methods. However, these approaches are limited in terms of scalability and the scope of interpretation. We propose using a large language model, ChatGPT, as an annotator to enable fine-grained interpretation analysis of pre-trained language models. We discover latent concepts within pre-trained language models by applying hierarchical clustering over contextualized representations and then annotate these concepts using GPT annotations. Our findings demonstrate that ChatGPT produces accurate and semantically richer annotations compared to human-annotated concepts. Additionally, we showcase how GPT-based annotations empower interpretation analysis methodologies of which we demonstrate two: probing framework and neuron interpretation. To facilitate further exploration and experimentation in this field, we have made available a substantial ConceptNet dataset comprising 39,000 annotated latent concepts.

Title: Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method. (arXiv:2305.13412v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13412
Code URL: https://github.com/alsace08/sumcot
Copy Paste: [[2305.13412] Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method](http://arxiv.org/abs/2305.13412) #large language model
Summary:
Automatic summarization generates concise summaries that contain key ideas of source documents. As the most mainstream datasets for the news sub-domain, CNN/DailyMail and BBC XSum have been widely used for performance benchmarking. However, the reference summaries of those datasets turn out to be noisy, mainly in terms of factual hallucination and information redundancy. To address this challenge, we first annotate new expert-writing Element-aware test sets following the "Lasswell Communication Model" proposed by Lasswell (1948), allowing reference summaries to focus on more fine-grained news elements objectively and comprehensively. Utilizing the new test sets, we observe the surprising zero-shot summary ability of LLMs, which addresses the issue of the inconsistent results between human preference and automatic evaluation metrics of LLMs' zero-shot summaries in prior work. Further, we propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step, which helps them integrate more fine-grained details of source documents into the final summaries that correlate with the human writing mindset. Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L on the two datasets, respectively. Dataset and code are publicly available at https://github.com/Alsace08/SumCoT.

Title: clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents. (arXiv:2305.13455v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13455
Code URL: null
Copy Paste: [[2305.13455] clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents](http://arxiv.org/abs/2305.13455) #large language model
Summary:
Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive settings. Other recent work has argued that Large Language Models (LLMs), if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development cycle, with newer models performing better. The metrics even for the comparatively simple example games are far from being saturated, suggesting that the proposed instrument will remain to have diagnostic value. Our general framework for implementing and evaluating games with LLMs is available at https://github.com/clp-research/clembench.

Title: Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding. (arXiv:2305.13512v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13512
Code URL: null
Copy Paste: [[2305.13512] Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding](http://arxiv.org/abs/2305.13512) #large language model
Summary:
Recently, large pretrained language models have demonstrated strong language understanding capabilities. This is particularly reflected in their zero-shot and in-context learning abilities on downstream tasks through prompting. To assess their impact on spoken language understanding (SLU), we evaluate several such models like ChatGPT and OPT of different sizes on multiple benchmarks. We verify the emergent ability unique to the largest models as they can reach intent classification accuracy close to that of supervised models with zero or few shots on various languages given oracle transcripts. By contrast, the results for smaller models fitting a single GPU fall far behind. We note that the error cases often arise from the annotation scheme of the dataset; responses from ChatGPT are still reasonable. We show, however, that the model is worse at slot filling, and its performance is sensitive to ASR errors, suggesting serious challenges for the application of those textual models on SLU.

Title: Understanding Programs by Exploiting (Fuzzing) Test Cases. (arXiv:2305.13592v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.13592
Code URL: null
Copy Paste: [[2305.13592] Understanding Programs by Exploiting (Fuzzing) Test Cases](http://arxiv.org/abs/2305.13592) #large language model
Summary:
Semantic understanding of programs has attracted great attention in the community. Inspired by recent successes of large language models (LLMs) in natural language understanding, tremendous progress has been made by treating programming language as another sort of natural language and training LLMs on corpora of program code. However, programs are essentially different from texts after all, in a sense that they are normally heavily structured and syntax-strict. In particular, programs and their basic units (i.e., functions and subroutines) are designed to demonstrate a variety of behaviors and/or provide possible outputs, given different inputs. The relationship between inputs and possible outputs/behaviors represents the functions/subroutines and profiles the program as a whole. Therefore, we propose to incorporate such a relationship into learning, for achieving a deeper semantic understanding of programs. To obtain inputs that are representative enough to trigger the execution of most part of the code, we resort to fuzz testing and propose fuzz tuning to boost the performance of program understanding and code representation learning, given a pre-trained LLM. The effectiveness of the proposed method is verified on two program understanding tasks including code clone detection and code classification, and it outperforms current state-of-the-arts by large margins. Code is available at https://github.com/rabbitjy/FuzzTuning.

Title: Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration. (arXiv:2305.13626v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13626
Code URL: null
Copy Paste: [[2305.13626] Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration](http://arxiv.org/abs/2305.13626) #large language model
Summary:
Conversational systems based on Large Language Models (LLMs), such as ChatGPT, show exceptional proficiency in context understanding and response generation. However, despite their impressive capabilities, they still possess limitations, such as providing randomly-guessed answers to ambiguous queries or failing to refuse users' requests, both of which are considered aspects of a conversational agent's proactivity. This raises the question of whether LLM-based conversational systems are equipped to handle proactive dialogue problems. In this work, we conduct a comprehensive analysis of LLM-based conversational systems, specifically focusing on three aspects of proactive dialogue systems: clarification, target-guided, and non-collaborative dialogues. To trigger the proactivity of LLMs, we propose the Proactive Chain-of-Thought prompting scheme, which augments LLMs with the goal planning capability over descriptive reasoning chains. Empirical findings are discussed to promote future studies on LLM-based proactive dialogue systems.

Title: Instruct-Align: Teaching Novel Languages with to LLMs through Alignment-based Cross-Lingual Instruction. (arXiv:2305.13627v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13627
Code URL: null
Copy Paste: [[2305.13627] Instruct-Align: Teaching Novel Languages with to LLMs through Alignment-based Cross-Lingual Instruction](http://arxiv.org/abs/2305.13627) #large language model
Summary:
Instruction-tuned large language models (LLMs) have shown remarkable generalization capability over multiple tasks in multiple languages. Nevertheless, their generalization towards different languages varies especially to underrepresented languages or even to unseen languages. Prior works on adapting new languages to LLMs find that naively adapting new languages to instruction-tuned LLMs will result in catastrophic forgetting, which in turn causes the loss of multitasking ability in these LLMs. To tackle this, we propose the Instruct-Align a.k.a (IA)$^1$ framework, which enables instruction-tuned LLMs to learn cross-lingual alignment between unseen and previously learned languages via alignment-based cross-lingual instruction-tuning. Our preliminary result on BLOOMZ-560M shows that (IA)$^1$ is able to learn a new language effectively with only a limited amount of parallel data and at the same time prevent catastrophic forgetting by applying continual instruction-tuning through experience replay. Our work contributes to the progression of language adaptation methods for instruction-tuned LLMs and opens up the possibility of adapting underrepresented low-resource languages into existing instruction-tuned LLMs. Our code will be publicly released upon acceptance.

Title: ChatGPT as your Personal Data Scientist. (arXiv:2305.13657v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13657
Code URL: null
Copy Paste: [[2305.13657] ChatGPT as your Personal Data Scientist](http://arxiv.org/abs/2305.13657) #large language model
Summary:
The rise of big data has amplified the need for efficient, user-friendly automated machine learning (AutoML) tools. However, the intricacy of understanding domain-specific data and defining prediction tasks necessitates human intervention making the process time-consuming while preventing full automation. Instead, envision an intelligent agent capable of assisting users in conducting AutoML tasks through intuitive, natural conversations without requiring in-depth knowledge of the underlying machine learning (ML) processes. This agent's key challenge is to accurately comprehend the user's prediction goals and, consequently, formulate precise ML tasks, adjust data sets and model parameters accordingly, and articulate results effectively. In this paper, we take a pioneering step towards this ambitious goal by introducing a ChatGPT-based conversational data-science framework to act as a "personal data scientist". Precisely, we utilize Large Language Models (ChatGPT) to build a natural interface between the users and the ML models (Scikit-Learn), which in turn, allows us to approach this ambitious problem with a realistic solution.

Our model pivots around four dialogue states: Data Visualization, Task Formulation, Prediction Engineering, and Result Summary and Recommendation. Each state marks a unique conversation phase, impacting the overall user-system interaction. Multiple LLM instances, serving as "micro-agents", ensure a cohesive conversation flow, granting us granular control over the conversation's progression. In summary, we developed an end-to-end system that not only proves the viability of the novel concept of conversational data science but also underscores the potency of LLMs in solving complex tasks. Interestingly, its development spotlighted several critical weaknesses in the current LLMs (ChatGPT) and highlighted substantial opportunities for improvement.

Title: Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning. (arXiv:2305.13660v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13660
Code URL: null
Copy Paste: [[2305.13660] Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning](http://arxiv.org/abs/2305.13660) #large language model
Summary:
Planning for goal-oriented dialogue often requires simulating future dialogue interactions and estimating task progress. Many approaches thus consider training neural networks to perform look-ahead search algorithms such as A* search and Monte Carlo Tree Search (MCTS). However, this training often require abundant annotated data, which creates challenges when faced with noisy annotations or low-resource settings. We introduce GDP-Zero, an approach using Open-Loop MCTS to perform goal-oriented dialogue policy planning without any model training. GDP-Zero prompts a large language model to act as a policy prior, value function, user simulator, and system model during the tree search. We evaluate GDP-Zero on the goal-oriented task PersuasionForGood, and find that its responses are preferred over ChatGPT up to 59.32% of the time, and are rated more persuasive than ChatGPT during interactive evaluations.

Title: On the Risk of Misinformation Pollution with Large Language Models. (arXiv:2305.13661v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13661
Code URL: null
Copy Paste: [[2305.13661] On the Risk of Misinformation Pollution with Large Language Models](http://arxiv.org/abs/2305.13661) #large language model
Summary:
In this paper, we comprehensively investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation and its subsequent impact on information-intensive applications, particularly Open-Domain Question Answering (ODQA) systems. We establish a threat model and simulate potential misuse scenarios, both unintentional and intentional, to assess the extent to which LLMs can be utilized to produce misinformation. Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of ODQA systems. To mitigate the harm caused by LLM-generated misinformation, we explore three defense strategies: prompting, misinformation detection, and majority voting. While initial results show promising trends for these defensive strategies, much more work needs to be done to address the challenge of misinformation pollution. Our work highlights the need for further research and interdisciplinary collaboration to address LLM-generated misinformation and to promote responsible use of LLMs.

Title: Efficient Open Domain Multi-Hop Question Answering with Few-Shot Data Synthesis. (arXiv:2305.13691v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13691
Code URL: null
Copy Paste: [[2305.13691] Efficient Open Domain Multi-Hop Question Answering with Few-Shot Data Synthesis](http://arxiv.org/abs/2305.13691) #large language model
Summary:
Few-shot learning for open domain multi-hop question answering typically relies on large language models (LLMs). While powerful, LLMs are inefficient at the inference time. We propose a data synthesis framework for multi-hop question answering that allows for improving smaller language models with less than 10 human-annotated question answer pairs. The framework is built upon the data generation functions parameterized by LLMs and prompts, which requires minimal hand-crafted features. Empirically, we synthesize millions of multi-hop questions and claims. After finetuning language models on the synthetic data, we evaluate the models on popular benchmarks on multi-hop question answering and fact verification. Our experimental results show that finetuning on the synthetic data improves model performance significantly, allowing our finetuned models to be competitive with prior models while being almost one-third the size in terms of parameter counts.

Title: Exploring Large Language Models for Classical Philology. (arXiv:2305.13698v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13698
Code URL: null
Copy Paste: [[2305.13698] Exploring Large Language Models for Classical Philology](http://arxiv.org/abs/2305.13698) #large language model
Summary:
Recent advances in NLP have led to the creation of powerful language models for many languages including Ancient Greek and Latin. While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong model types, and create for each of them (ii) a monolingual Ancient Greek and a multilingual instance that includes Latin and English. We evaluate all models on morphological and syntactic tasks, including lemmatization, which demonstrates the added value of T5's decoding abilities. We further define two probing tasks to investigate the knowledge acquired by models pre-trained on Classical texts. Our experiments provide the first benchmarking analysis of existing models of Ancient Greek. Results show that our models provide significant improvements over the SoTA. The systematic analysis of model types can inform future research in designing language models for Classical languages, including the development of novel generative tasks. We make all our models available as community resources, along with a large curated pre-training corpus for Ancient Greek, to support the creation of a larger, comparable model zoo for Classical Philology. Our models and resources are available at https://github.com/Heidelberg-NLP/ancient-language-models.

Title: LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models. (arXiv:2305.13711v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13711
Code URL: null
Copy Paste: [[2305.13711] LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models](http://arxiv.org/abs/2305.13711) #large language model
Summary:
We propose LLM-Eval, a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs). Existing evaluation methods often rely on human annotations, ground-truth responses, or multiple LLM prompts, which can be expensive and time-consuming. To address these issues, we design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call. We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods. Our analysis also highlights the importance of choosing suitable LLMs and decoding strategies for accurate evaluation results. LLM-Eval offers a versatile and robust solution for evaluating open-domain conversation systems, streamlining the evaluation process and providing consistent performance across diverse scenarios.

Title: Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models. (arXiv:2305.13712v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13712
Code URL: null
Copy Paste: [[2305.13712] Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models](http://arxiv.org/abs/2305.13712) #large language model
Summary:
This paper investigates the capabilities of Large Language Models (LLMs) in the context of understanding their own knowledge and measuring their uncertainty. We argue this is an important feature for mitigating hallucinations. Specifically, we focus on addressing \textit{known-unknown} questions, characterized by high uncertainty due to the absence of definitive answers. To facilitate our study, we collect a dataset with new Known-Unknown Questions (KUQ) and propose a novel categorization scheme to elucidate the sources of uncertainty. Subsequently, we assess the LLMs' ability to differentiate between known and unknown questions and classify them accordingly. Moreover, we evaluate the quality of their answers in an Open-Ended QA setting. To quantify the uncertainty expressed in the answers, we create a semantic evaluation method that measures the model's accuracy in expressing uncertainty between known vs unknown questions.

Title: LogicLLM: Exploring Self-supervised Logic-enhanced Training for Large Language Models. (arXiv:2305.13718v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13718
Code URL: null
Copy Paste: [[2305.13718] LogicLLM: Exploring Self-supervised Logic-enhanced Training for Large Language Models](http://arxiv.org/abs/2305.13718) #large language model
Summary:
Existing efforts to improve logical reasoning ability of language models have predominantly relied on supervised fine-tuning, hindering generalization to new domains and/or tasks. The development of Large Langauge Models (LLMs) has demonstrated the capacity of compressing abundant knowledge into a single proxy, enabling them to tackle multiple tasks effectively. Our preliminary experiments, nevertheless, show that LLMs do not show capability on logical reasoning. The performance of LLMs on logical reasoning benchmarks is far behind the existing state-of-the-art baselines. In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training, and activating it via in-context learning, which we termed as LogicLLM. Specifically, we devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM. Besides, we conduct extensive ablation studies to analyze the key factors in designing logic-oriented proxy tasks.

Title: Self-Critique Prompting with Large Language Models for Inductive Instructions. (arXiv:2305.13733v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13733
Code URL: null
Copy Paste: [[2305.13733] Self-Critique Prompting with Large Language Models for Inductive Instructions](http://arxiv.org/abs/2305.13733) #large language model
Summary:
Numerous works are proposed to improve or evaluate the capabilities of Large language models (LLMs) to fulfill user instructions. However, they neglect the possibility that user inputs may inherently contain incorrect information due to users' false beliefs or malicious intents. In this way, blindly adhering to users' false content will cause deception and harm. To address this problem, we propose a challenging benchmark consisting of Inductive Instructions (INDust) to evaluate whether LLMs could resist these instructions. The INDust includes 15K instructions across three categories: Fact-Checking Instructions, Questions based on False Premises, and Creative Instructions based on False Premises. Our experiments on several strong LLMs reveal that current LLMs can be easily deceived by INDust into generating misleading and malicious statements. Hence we employ Self-Critique prompting to encourage LLMs to not only critique themselves like in previous works but also the users, which show remarkable improvement in handling inductive instructions under both zero-shot and few-shot settings.

Title: Aligning Large Language Models through Synthetic Feedback. (arXiv:2305.13735v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13735
Code URL: null
Copy Paste: [[2305.13735] Aligning Large Language Models through Synthetic Feedback](http://arxiv.org/abs/2305.13735) #large language model
Summary:
Aligning large language models (LLMs) to human values has become increasingly important as it enables sophisticated steering of LLMs, e.g., making them follow given instructions while keeping them less toxic. However, it requires a significant amount of human demonstrations and feedback. Recently, open-sourced models have attempted to replicate the alignment learning process by distilling data from already aligned LLMs like InstructGPT or ChatGPT. While this process reduces human efforts, constructing these datasets has a heavy dependency on the teacher models. In this work, we propose a novel framework for alignment learning with almost no human labor and no dependency on pre-aligned LLMs. First, we perform reward modeling (RM) with synthetic feedback by contrasting responses from vanilla LLMs with various sizes and prompts. Then, we use the RM for simulating high-quality demonstrations to train a supervised policy and for further optimizing the model with reinforcement learning. Our resulting model, Aligned Language Model with Synthetic Training dataset (ALMoST), outperforms open-sourced models, including Alpaca, Dolly, and OpenAssistant, which are trained on the outputs of InstructGPT or human-annotated instructions. Our 7B-sized model outperforms the 12-13B models in the A/B tests using GPT-4 as the judge with about 75% winning rate on average.

Title: Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks. (arXiv:2305.13782v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13782
Code URL: null
Copy Paste: [[2305.13782] Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks](http://arxiv.org/abs/2305.13782) #large language model
Summary:
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input -- but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the language model using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access language models against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that language models are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model's output by providing a means of tracing the output back through the verbalised image content.

Title: Can Large Language Models Infer and Disagree Like Humans?. (arXiv:2305.13788v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13788
Code URL: null
Copy Paste: [[2305.13788] Can Large Language Models Infer and Disagree Like Humans?](http://arxiv.org/abs/2305.13788) #large language model
Summary:
Large Language Models (LLMs) have shown stellar achievements in solving a broad range of tasks. When generating text, it is common to sample tokens from these models: whether LLMs closely align with the human disagreement distribution has not been well-studied, especially within the scope of Natural Language Inference (NLI). In this paper, we evaluate the performance and alignment of LLM distribution with humans using two different techniques: Monte Carlo Reconstruction (MCR) and Log Probability Reconstruction (LPR). As a result, we show LLMs exhibit limited ability in solving NLI tasks and simultaneously fail to capture human disagreement distribution, raising concerns about their natural language understanding (NLU) ability and their representativeness of human users.

Title: "Is the Pope Catholic?" Applying Chain-of-Thought Reasoning to Understanding Conversational Implicatures. (arXiv:2305.13826v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13826
Code URL: null
Copy Paste: [[2305.13826] "Is the Pope Catholic?" Applying Chain-of-Thought Reasoning to Understanding Conversational Implicatures](http://arxiv.org/abs/2305.13826) #large language model
Summary:
Conversational implicatures are pragmatic inferences that require listeners to deduce the intended meaning conveyed by a speaker from their explicit utterances. Although such inferential reasoning is fundamental to human communication, recent research indicates that large language models struggle to comprehend these implicatures as effectively as the average human. This paper demonstrates that by incorporating Grice's Four Maxims into the model through chain-of-thought prompting, we can significantly enhance its performance, surpassing even the average human performance on this task.

Title: Learn from Mistakes through Cooperative Interaction with Study Assistant. (arXiv:2305.13829v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13829
Code URL: null
Copy Paste: [[2305.13829] Learn from Mistakes through Cooperative Interaction with Study Assistant](http://arxiv.org/abs/2305.13829) #large language model
Summary:
Large language models have demonstrated their ability to self-reflect and refine their generation, which can further improve their performance. However, this feedback mechanism faces challenges such as no guarantee of correctness and the lack of global insight into the model's weaknesses. In this paper, we propose a novel framework, Study Assistant for Large Language Model (SALAM), to aid LLMs in the reflection and refinement process. Motivated by the human study assistant, this framework grades previous responses with the ground truth and collects mistakes in the training phase. During inference, it identifies common misunderstandings based on the mistake collections and provides guidelines for the model to help the model avoid similar mistakes during inference. SALAM is a model-agnostic framework, focusing on providing general feedback and can adapt to any base model. Our evaluation of SALAM on two challenging benchmarks demonstrated a significant improvement over various baselines.

Title: PaD: Program-aided Distillation Specializes Large Models in Reasoning. (arXiv:2305.13888v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13888
Code URL: null
Copy Paste: [[2305.13888] PaD: Program-aided Distillation Specializes Large Models in Reasoning](http://arxiv.org/abs/2305.13888) #large language model
Summary:
While Large Language Models (LLMs) excel in several natural language processing tasks, their size and inaccessibility present challenges for extensive practical application. Previous studies acquire specialized skills through distillation on LLMs, which result in trading generic abilities, called model specialization. As for reasoning ability, chain-of-thought was synthesized to subsequent distillation. However, due to hallucination, synthetic chain-of-thought from LLMs contains faulty reasoning. These incorrect reasoning steps damage the reasoning capability. To tackle above issues, we propose Program-aided Distillation (PaD), which distills LLMs to obtain specialized small models in reasoning tasks. In PaD, we strengthen specialized models with program-aided reasoning, and help them overcome faulty reasoning steps with automated error checking. Experimental results demonstrate that, on the GSM8K benchmark, a 0.06B model using PaD can not only outperform certain LLMs (e.g., LLaMA), but also achieves a 10% improvement over baselines with a significantly smaller scale of parameters and data. Data pruning analysis reveals that PaD possesses higher training efficiency.

Title: Generating Data for Symbolic Language with Large Language Models. (arXiv:2305.13917v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13917
Code URL: null
Copy Paste: [[2305.13917] Generating Data for Symbolic Language with Large Language Models](http://arxiv.org/abs/2305.13917) #large language model
Summary:
While large language models (LLMs) bring not only performance but also complexity, recent work has started to turn LLMs into data generators rather than task inferencers, where another affordable task model is trained for efficient deployment and inference. However, such an approach has primarily been applied to natural language tasks and has not yet been explored for symbolic language tasks with complex structured outputs (e.g., semantic parsing and code generation). In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data. SymGen consists of an informative prompt to steer generation and an agreement-based verifier to improve data correctness. We conduct extensive experiments on six symbolic language tasks across various settings. Compared with the LLMs, we demonstrate the 1\%-sized task model can achieve comparable or better performance, largely cutting inference and deployment costs. We also show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model, saving a considerable amount of annotation effort. SymGen sheds new light on data generation for complex tasks, and we release the code at \href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}.

segmentation

Title: VDD: Varied Drone Dataset for Semantic Segmentation. (arXiv:2305.13608v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13608
Code URL: null
Copy Paste: [[2305.13608] VDD: Varied Drone Dataset for Semantic Segmentation](http://arxiv.org/abs/2305.13608) #segmentation
Summary:
Semantic segmentation of drone images is critical to many aerial vision tasks as it provides essential semantic details that can compensate for the lack of depth information from monocular cameras. However, maintaining high accuracy of semantic segmentation models for drones requires diverse, large-scale, and high-resolution datasets, which are rare in the field of aerial image processing. Existing datasets are typically small and focus primarily on urban scenes, neglecting rural and industrial areas. Models trained on such datasets are not sufficiently equipped to handle the variety of inputs seen in drone imagery. In the VDD-Varied Drone Dataset, we offer a large-scale and densely labeled dataset comprising 400 high-resolution images that feature carefully chosen scenes, camera angles, and varied light and weather conditions. Furthermore, we have adapted existing drone datasets to conform to our annotation standards and integrated them with VDD to create a dataset 1.5 times the size of fine annotation of Cityscapes. We have developed a novel DeepLabT model, which combines CNN and Transformer backbones, to provide a reliable baseline for semantic segmentation in drone imagery. Our experiments indicate that DeepLabT performs admirably on VDD and other drone datasets. We expect that our dataset will generate considerable interest in drone image segmentation and serve as a foundation for other drone vision tasks. VDD is freely available on our website at https://vddvdd.com .

Title: Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation. (arXiv:2305.13752v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13752
Code URL: null
Copy Paste: [[2305.13752] Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation](http://arxiv.org/abs/2305.13752) #segmentation
Summary:
Domain adaptive semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. However, existing methods primarily focus on directly learning qualified target features, making it challenging to guarantee their discrimination in the absence of target labels. This work provides a new perspective. We observe that the features learned with source data manage to keep categorically discriminative during training, thereby enabling us to implicitly learn adequate target representations by simply \textbf{pulling target features close to source features for each category}. To this end, we propose T2S-DA, which we interpret as a form of pulling Target to Source for Domain Adaptation, encouraging the model in learning similar cross-domain features. Also, considering the pixel categories are heavily imbalanced for segmentation datasets, we come up with a dynamic re-weighting strategy to help the model concentrate on those underperforming classes. Extensive experiments confirm that T2S-DA learns a more discriminative and generalizable representation, significantly surpassing the state-of-the-art. We further show that our method is quite qualified for the domain generalization task, verifying its domain-invariant property.

Title: MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation. (arXiv:2305.13864v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.13864
Code URL: https://github.com/aldrich2y/mianet
Copy Paste: [[2305.13864] MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation](http://arxiv.org/abs/2305.13864) #segmentation
Summary:
Existing few-shot segmentation methods are based on the meta-learning strategy and extract instance knowledge from a support set and then apply the knowledge to segment target objects in a query set. However, the extracted knowledge is insufficient to cope with the variable intra-class differences since the knowledge is obtained from a few samples in the support set. To address the problem, we propose a multi-information aggregation network (MIANet) that effectively leverages the general knowledge, i.e., semantic word embeddings, and instance information for accurate segmentation. Specifically, in MIANet, a general information module (GIM) is proposed to extract a general class prototype from word embeddings as a supplement to instance information. To this end, we design a triplet loss that treats the general class prototype as an anchor and samples positive-negative pairs from local features in the support set. The calculated triplet loss can transfer semantic similarities among language identities from a word embedding space to a visual representation space. To alleviate the model biasing towards the seen training classes and to obtain multi-scale information, we then introduce a non-parametric hierarchical prior module (HPM) to generate unbiased instance-level information via calculating the pixel-level similarity between the support and query image features. Finally, an information fusion module (IFM) combines the general and instance information to make predictions for the query image. Extensive experiments on PASCAL-5i and COCO-20i show that MIANet yields superior performance and set a new state-of-the-art. Code is available at https://github.com/Aldrich2y/MIANet.

Title: Topic-driven Distant Supervision Framework for Macro-level Discourse Parsing. (arXiv:2305.13755v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.13755
Code URL: null
Copy Paste: [[2305.13755] Topic-driven Distant Supervision Framework for Macro-level Discourse Parsing](http://arxiv.org/abs/2305.13755) #segmentation
Summary:
Discourse parsing, the task of analyzing the internal rhetorical structure of texts, is a challenging problem in natural language processing. Despite the recent advances in neural models, the lack of large-scale, high-quality corpora for training remains a major obstacle. Recent studies have attempted to overcome this limitation by using distant supervision, which utilizes results from other NLP tasks (e.g., sentiment polarity, attention matrix, and segmentation probability) to parse discourse trees. However, these methods do not take into account the differences between in-domain and out-of-domain tasks, resulting in lower performance and inability to leverage the high-quality in-domain data for further improvement. To address these issues, we propose a distant supervision framework that leverages the relations between topic structure and rhetorical structure. Specifically, we propose two distantly supervised methods, based on transfer learning and the teacher-student model, that narrow the gap between in-domain and out-of-domain tasks through label mapping and oracle annotation. Experimental results on the MCDTB and RST-DT datasets show that our methods achieve the best performance in both distant-supervised and supervised scenarios.