secure

Title: From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying. (arXiv:2310.04145v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04145
Code URL: null
Copy Paste: [[2310.04145]] From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying(http://arxiv.org/abs/2310.04145)
Summary:
Safeguarding the Intellectual Property (IP) of data has become critically important as machine learning applications continue to proliferate, and their success heavily relies on the quality of training data. While various mechanisms exist to secure data during storage, transmission, and consumption, fewer studies have been developed to detect whether they are already leaked for model training without authorization. This issue is particularly challenging due to the absence of information and control over the training process conducted by potential attackers.

In this paper, we concentrate on the domain of tabular data and introduce a novel methodology, Local Distribution Shifting Synthesis (\textsc{LDSS}), to detect leaked data that are used to train classification models. The core concept behind \textsc{LDSS} involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset. This enables the effective identification of models trained on leaked data through model querying alone, as the synthetic data injection results in a pronounced disparity in the predictions of models trained on leaked and modified datasets. \textsc{LDSS} is \emph{model-oblivious} and hence compatible with a diverse range of classification models, such as Naive Bayes, Decision Tree, and Random Forest. We have conducted extensive experiments on seven types of classification models across five real-world datasets. The comprehensive results affirm the reliability, robustness, fidelity, security, and efficiency of \textsc{LDSS}.

security

Title: Integrating Audio-Visual Features for Multimodal Deepfake Detection. (arXiv:2310.03827v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03827
Code URL: null
Copy Paste: [[2310.03827]] Integrating Audio-Visual Features for Multimodal Deepfake Detection(http://arxiv.org/abs/2310.03827)
Summary:
Deepfakes are AI-generated media in which an image or video has been digitally modified. The advancements made in deepfake technology have led to privacy and security issues. Most deepfake detection techniques rely on the detection of a single modality. Existing methods for audio-visual detection do not always surpass that of the analysis based on single modalities. Therefore, this paper proposes an audio-visual-based method for deepfake detection, which integrates fine-grained deepfake identification with binary classification. We categorize the samples into four types by combining labels specific to each single modality. This method enhances the detection under intra-domain and cross-domain testing.

Title: Hermes: Unlocking Security Analysis of Cellular Network Protocols by Synthesizing Finite State Machines from Natural Language Specifications. (arXiv:2310.04381v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.04381
Code URL: https://github.com/synsec-den/hermes-spec-to-fsm
Copy Paste: [[2310.04381]] Hermes: Unlocking Security Analysis of Cellular Network Protocols by Synthesizing Finite State Machines from Natural Language Specifications(http://arxiv.org/abs/2310.04381)
Summary:
In this paper, we present Hermes, an end-to-end framework to automatically generate formal representations from natural language cellular specifications. We first develop a neural constituency parser, NEUTREX, to process transition-relevant texts and extract transition components (i.e., states, conditions, and actions). We also design a domain-specific language to translate these transition components to logical formulas by leveraging dependency parse trees. Finally, we compile these logical formulas to generate transitions and create the formal model as finite state machines. To demonstrate the effectiveness of Hermes, we evaluate it on 4G NAS, 5G NAS, and 5G RRC specifications and obtain an overall accuracy of 81-87%, which is a substantial improvement over the state-of-the-art. Our security analysis of the extracted models uncovers 3 new vulnerabilities and identifies 19 previous attacks in 4G and 5G specifications, and 7 deviations in commercial 4G basebands.

Title: LaTeX: Language Pattern-aware Triggering Event Detection for Adverse Experience during Pandemics. (arXiv:2310.03941v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03941
Code URL: null
Copy Paste: [[2310.03941]] LaTeX: Language Pattern-aware Triggering Event Detection for Adverse Experience during Pandemics(http://arxiv.org/abs/2310.03941)
Summary:
The COVID-19 pandemic has accentuated socioeconomic disparities across various racial and ethnic groups in the United States. While previous studies have utilized traditional survey methods like the Household Pulse Survey (HPS) to elucidate these disparities, this paper explores the role of social media platforms in both highlighting and addressing these challenges. Drawing from real-time data sourced from Twitter, we analyzed language patterns related to four major types of adverse experiences: loss of employment income (LI), food scarcity (FS), housing insecurity (HI), and unmet needs for mental health services (UM). We first formulate a sparsity optimization problem that extracts low-level language features from social media data sources. Second, we propose novel constraints on feature similarity exploiting prior knowledge about the similarity of the language patterns among the adverse experiences. The proposed problem is challenging to solve due to the non-convexity objective and non-smoothness penalties. We develop an algorithm based on the alternating direction method of multipliers (ADMM) framework to solve the proposed formulation. Extensive experiments and comparisons to other models on real-world social media and the detection of adverse experiences justify the efficacy of our model.

privacy

Title: Alice Benchmarks: Connecting Real World Object Re-Identification with the Synthetic. (arXiv:2310.04416v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04416
Code URL: null
Copy Paste: [[2310.04416]] Alice Benchmarks: Connecting Real World Object Re-Identification with the Synthetic(http://arxiv.org/abs/2310.04416)
Summary:
For object re-identification (re-ID), learning from synthetic data has become a promising strategy to cheaply acquire large-scale annotated datasets and effective models, with few privacy concerns. Many interesting research problems arise from this strategy, e.g., how to reduce the domain gap between synthetic source and real-world target. To facilitate developing more new approaches in learning from synthetic data, we introduce the Alice benchmarks, large-scale datasets providing benchmarks as well as evaluation protocols to the research community. Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID. We collected and annotated two challenging real-world target datasets: AlicePerson and AliceVehicle, captured under various illuminations, image resolutions, etc. As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario. Correspondingly, we reuse existing PersonX and VehicleX as synthetic source domains. The primary goal is to train models from synthetic data that can work effectively in the real world. In this paper, we detail the settings of Alice benchmarks, provide an analysis of existing commonly-used domain adaptation methods, and discuss some interesting future directions. An online server will be set up for the community to evaluate methods conveniently and fairly.

Title: PrIeD-KIE: Towards Privacy Preserved Document Key Information Extraction. (arXiv:2310.03777v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.03777
Code URL: null
Copy Paste: [[2310.03777]] PrIeD-KIE: Towards Privacy Preserved Document Key Information Extraction(http://arxiv.org/abs/2310.03777)
Summary:
In this paper, we introduce strategies for developing private Key Information Extraction (KIE) systems by leveraging large pretrained document foundation models in conjunction with differential privacy (DP), federated learning (FL), and Differentially Private Federated Learning (DP-FL). Through extensive experimentation on six benchmark datasets (FUNSD, CORD, SROIE, WildReceipts, XFUND, and DOCILE), we demonstrate that large document foundation models can be effectively fine-tuned for the KIE task under private settings to achieve adequate performance while maintaining strong privacy guarantees. Moreover, by thoroughly analyzing the impact of various training and model parameters on model performance, we propose simple yet effective guidelines for achieving an optimal privacy-utility trade-off for the KIE task under global DP. Finally, we introduce FeAm-DP, a novel DP-FL algorithm that enables efficiently upscaling global DP from a standalone context to a multi-client federated environment. We conduct a comprehensive evaluation of the algorithm across various client and privacy settings, and demonstrate its capability to achieve comparable performance and privacy guarantees to standalone DP, even when accommodating an increasing number of participating clients. Overall, our study offers valuable insights into the development of private KIE systems, and highlights the potential of document foundation models for privacy-preserved Document AI applications. To the best of authors' knowledge, this is the first work that explores privacy preserved document KIE using document foundation models.

Title: Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning. (arXiv:2310.03838v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03838
Code URL: null
Copy Paste: [[2310.03838]] Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning(http://arxiv.org/abs/2310.03838)
Summary:
The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs.

Title: Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization. (arXiv:2310.04015v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04015
Code URL: null
Copy Paste: [[2310.04015]] Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization(http://arxiv.org/abs/2310.04015)
Summary:
While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a paramount concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called \emph{look-alike clustering}, which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.

protect

defense

attack

Title: HDNA: A graph-based change detection in HTML pages(Deface Attack Detection). (arXiv:2310.03891v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.03891
Code URL: null
Copy Paste: [[2310.03891]] HDNA: A graph-based change detection in HTML pages(Deface Attack Detection)(http://arxiv.org/abs/2310.03891)
Summary:
In this paper, a new approach called HDNA (HTML DNA) is introduced for analyzing and comparing Document Object Model (DOM) trees in order to detect differences in HTML pages. This method assigns an identifier to each HTML page based on its structure, which proves to be particularly useful for detecting variations caused by server-side updates, user interactions or potential security risks. The process involves preprocessing the HTML content generating a DOM tree and calculating the disparities between two or more trees. By assigning weights to the nodes valuable insights about their hierarchical importance are obtained. The effectiveness of the HDNA approach has been demonstrated in identifying changes in DOM trees even when dynamically generated content is involved. Not does this method benefit web developers, testers, and security analysts by offering a deeper understanding of how web pages evolve. It also helps ensure the functionality and performance of web applications. Additionally, it enables detection and response to vulnerabilities that may arise from modifications in DOM structures. As the web ecosystem continues to evolve HDNA proves to be a tool, for individuals engaged in web development, testing, or security analysis.

Title: Minors solve the elliptic curve discrete logarithm problem. (arXiv:2310.04132v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.04132
Code URL: null
Copy Paste: [[2310.04132]] Minors solve the elliptic curve discrete logarithm problem(http://arxiv.org/abs/2310.04132)
Summary:
The elliptic curve discrete logarithm problem is of fundamental importance in public-key cryptography. It is in use for a long time. Moreover, it is an interesting challenge in computational mathematics. Its solution is supposed to provide interesting research directions.

In this paper, we explore ways to solve the elliptic curve discrete logarithm problem. Our results are mostly computational. However, it seems, the methods that we develop and directions that we pursue can provide a potent attack on this problem. This work follows our earlier work, where we tried to solve this problem by finding a zero minor in a matrix over the same finite field on which the elliptic curve is defined. This paper is self-contained.

Title: Indirect Meltdown: Building Novel Side-Channel Attacks from Transient-Execution Attacks. (arXiv:2310.04183v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.04183
Code URL: null
Copy Paste: [[2310.04183]] Indirect Meltdown: Building Novel Side-Channel Attacks from Transient-Execution Attacks(http://arxiv.org/abs/2310.04183)
Summary:
The transient-execution attack Meltdown leaks sensitive information by transiently accessing inaccessible data during out-of-order execution. Although Meltdown is fixed in hardware for recent CPU generations, most currently-deployed CPUs have to rely on software mitigations, such as KPTI. Still, Meltdown is considered non-exploitable on current systems. In this paper, we show that adding another layer of indirection to Meltdown transforms a transient-execution attack into a side-channel attack, leaking metadata instead of data. We show that despite software mitigations, attackers can still leak metadata from other security domains by observing the success rate of Meltdown on non-secret data. With LeakIDT, we present the first cache-line granular monitoring of kernel addresses. LeakIDT allows an attacker to obtain cycle-accurate timestamps for attacker-chosen interrupts. We use our attack to get accurate inter-keystroke timings and fingerprint visited websites. While we propose a low-overhead software mitigation to prevent the exploitation of LeakIDT, we emphasize that the side-channel aspect of transient-execution attacks should not be underestimated.

Title: Reviving Meltdown 3a. (arXiv:2310.04192v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.04192
Code URL: null
Copy Paste: [[2310.04192]] Reviving Meltdown 3a(http://arxiv.org/abs/2310.04192)
Summary:
Since the initial discovery of Meltdown and Spectre in 2017, different variants of these attacks have been discovered. One often overlooked variant is Meltdown 3a, also known as Meltdown-CPL-REG. Even though Meltdown-CPL-REG was initially discovered in 2018, the available information regarding the vulnerability is still sparse. In this paper, we analyze Meltdown-CPL-REG on 19 different CPUs from different vendors using an automated tool. We observe that the impact is more diverse than documented and differs from CPU to CPU. Surprisingly, while the newest Intel CPUs do not seem affected by Meltdown-CPL-REG, the newest available AMD CPUs (Zen3+) are still affected by the vulnerability. Furthermore, given our attack primitive CounterLeak, we show that besides up-to-date patches, Meltdown-CPL-REG can still be exploited as we reenable performance-counter-based attacks on cryptographic algorithms, break KASLR, and mount Spectre attacks. Although Meltdown-CPL-REG is not as powerful as other transient-execution attacks, its attack surface should not be underestimated.

Title: Threat Trekker: An Approach to Cyber Threat Hunting. (arXiv:2310.04197v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.04197
Code URL: null
Copy Paste: [[2310.04197]] Threat Trekker: An Approach to Cyber Threat Hunting(http://arxiv.org/abs/2310.04197)
Summary:
Threat hunting is a proactive methodology for exploring, detecting and mitigating cyberattacks within complex environments. As opposed to conventional detection systems, threat hunting strategies assume adversaries have infiltrated the system; as a result they proactively search out any unusual patterns or activities which might indicate intrusion attempts.

Historically, this endeavour has been pursued using three investigation methodologies: (1) Hypothesis-Driven Investigations; (2) Indicator of Compromise (IOC); and (3) High-level machine learning analysis-based approaches. Therefore, this paper introduces a novel machine learning paradigm known as Threat Trekker. This proposal utilizes connectors to feed data directly into an event streaming channel for processing by the algorithm and provide feedback back into its host network.

Conclusions drawn from these experiments clearly establish the efficacy of employing machine learning for classifying more subtle attacks.

Title: Mapping the DeFi Crime Landscape: An Evidence-based Picture. (arXiv:2310.04356v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.04356
Code URL: null
Copy Paste: [[2310.04356]] Mapping the DeFi Crime Landscape: An Evidence-based Picture(http://arxiv.org/abs/2310.04356)
Summary:
Over the past years, decentralized finance (DeFi) has been the target of numerous profit-driven crimes. However, until now, the full prevalence and cumulative impact of these crimes have not been assessed. This study provides a first comprehensive assessment of profit-driven crimes targeting the DeFi sector. To achieve this, we collected data on 1155 crime events from 2017 to 2022. Of these, 1050 were related to the DeFi industry and 105 to the centralized finance (CeFi) industry. Focusing on the former, a taxonomy was developed to clarify the similarities and differences among these crimes. All events were mapped onto the DeFi stack to assess the impacted technical layers, and the financial damages were quantified to gauge their scale. The findings show that the entire cryptoasset industry has suffered a minimum loss of US$30B, with two thirds related to centralized finance (CeFi) and one third to DeFi. Focusing solely on the latter, the results highlight that during an attack, a DeFi actor (an entity developing a DeFi technology) can serve as a direct target, as a perpetrator, or as an intermediary. The findings show that DeFi actors are the first victims of crimes targeting the DeFi industry: 52% of crime events targeted them, primarily due to technical vulnerabilities at the protocol layer, and these events accounted for 83% of all recorded financial damages. On the other hand, in 40% of crime events, DeFi actors were themselves malicious perpetrators, predominantly misusing contracts at the cryptoasset layer (e.g., rug pull scams). However, these events accounted for only 17% of all financial damages. The study's findings offer a preliminary assessment of the size and scope of crime events within the DeFi sector and highlight the vulnerable position of DeFi actors in the ecosystem.

Title: Improving classifier decision boundaries using nearest neighbors. (arXiv:2310.03927v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03927
Code URL: null
Copy Paste: [[2310.03927]] Improving classifier decision boundaries using nearest neighbors(http://arxiv.org/abs/2310.03927)
Summary:
Neural networks are not learning optimal decision boundaries. We show that decision boundaries are situated in areas of low training data density. They are impacted by few training samples which can easily lead to overfitting. We provide a simple algorithm performing a weighted average of the prediction of a sample and its nearest neighbors' (computed in latent space) leading to a minor favorable outcomes for a variety of important measures for neural networks. In our evaluation, we employ various self-trained and pre-trained convolutional neural networks to show that our approach improves (i) resistance to label noise, (ii) robustness against adversarial attacks, (iii) classification accuracy, and to some degree even (iv) interpretability. While improvements are not necessarily large in all four areas, our approach is conceptually simple, i.e., improvements come without any modification to network architecture, training procedure or dataset. Furthermore, they are in stark contrast to prior works that often require trade-offs among the four objectives or provide valuable, but non-actionable insights.

robust

Title: WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection. (arXiv:2310.03821v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03821
Code URL: null
Copy Paste: [[2310.03821]] WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection(http://arxiv.org/abs/2310.03821)
Summary:
In the field of domain adaptation (DA) on 3D object detection, most of the work is dedicated to unsupervised domain adaptation (UDA). Yet, without any target annotations, the performance gap between the UDA approaches and the fully-supervised approach is still noticeable, which is impractical for real-world applications. On the other hand, weakly-supervised domain adaptation (WDA) is an underexplored yet practical task that only requires few labeling effort on the target domain. To improve the DA performance in a cost-effective way, we propose a general weak labels guided self-training framework, WLST, designed for WDA on 3D object detection. By incorporating autolabeler, which can generate 3D pseudo labels from 2D bounding boxes, into the existing self-training pipeline, our method is able to generate more robust and consistent pseudo labels that would benefit the training process on the target domain. Extensive experiments demonstrate the effectiveness, robustness, and detector-agnosticism of our WLST framework. Notably, it outperforms previous state-of-the-art methods on all evaluation tasks.

Title: Hard View Selection for Contrastive Learning. (arXiv:2310.03940v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03940
Code URL: null
Copy Paste: [[2310.03940]] Hard View Selection for Contrastive Learning(http://arxiv.org/abs/2310.03940)
Summary:
Many Contrastive Learning (CL) methods train their models to be invariant to different "views" of an image input for which a good data augmentation pipeline is crucial. While considerable efforts were directed towards improving pre-text tasks, architectures, or robustness (e.g., Siamese networks or teacher-softmax centering), the majority of these methods remain strongly reliant on the random sampling of operations within the image augmentation pipeline, such as the random resized crop or color distortion operation. In this paper, we argue that the role of the view generation and its effect on performance has so far received insufficient attention. To address this, we propose an easy, learning-free, yet powerful Hard View Selection (HVS) strategy designed to extend the random view generation to expose the pretrained model to harder samples during CL training. It encompasses the following iterative steps: 1) randomly sample multiple views and create pairs of two views, 2) run forward passes for each view pair on the currently trained model, 3) adversarially select the pair yielding the worst loss, and 4) run the backward pass with the selected pair. In our empirical analysis we show that under the hood, HVS increases task difficulty by controlling the Intersection over Union of views during pretraining. With only 300-epoch pretraining, HVS is able to closely rival the 800-epoch DINO baseline which remains very favorable even when factoring in the slowdown induced by the additional forwards of HVS. Additionally, HVS consistently achieves accuracy improvements on ImageNet between 0.55% and 1.9% on linear evaluation and similar improvements on transfer tasks across multiple CL methods, such as DINO, SimSiam, and SimCLR.

Title: Towards Increasing the Robustness of Predictive Steering-Control Autonomous Navigation Systems Against Dash Cam Image Angle Perturbations Due to Pothole Encounters. (arXiv:2310.03959v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03959
Code URL: null
Copy Paste: [[2310.03959]] Towards Increasing the Robustness of Predictive Steering-Control Autonomous Navigation Systems Against Dash Cam Image Angle Perturbations Due to Pothole Encounters(http://arxiv.org/abs/2310.03959)
Summary:
Vehicle manufacturers are racing to create autonomous navigation and steering control algorithms for their vehicles. These software are made to handle various real-life scenarios such as obstacle avoidance and lane maneuvering. There is some ongoing research to incorporate pothole avoidance into these autonomous systems. However, there is very little research on the effect of hitting a pothole on the autonomous navigation software that uses cameras to make driving decisions. Perturbations in the camera angle when hitting a pothole can cause errors in the predicted steering angle. In this paper, we present a new model to compensate for such angle perturbations and reduce any errors in steering control prediction algorithms. We evaluate our model on perturbations of publicly available datasets and show our model can reduce the errors in the estimated steering angle from perturbed images to 2.3%, making autonomous steering control robust against the dash cam image angle perturbations induced when one wheel of a car goes over a pothole.

Title: Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation. (arXiv:2310.03986v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03986
Code URL: null
Copy Paste: [[2310.03986]] Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation(http://arxiv.org/abs/2310.03986)
Summary:
Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose simple and parameter-efficient adaptation procedures for pretrained multimodal networks. In particular, we exploit low-rank adaptation and modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 0.7% of the total parameters in most experiments). We conduct a series of experiments to highlight the robustness of our proposed method using diverse datasets for RGB-thermal and RGB-Depth semantic segmentation, multimodal material segmentation, and multimodal sentiment analysis tasks. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.

Title: Assessing Robustness via Score-Based Adversarial Image Generation. (arXiv:2310.04285v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04285
Code URL: null
Copy Paste: [[2310.04285]] Assessing Robustness via Score-Based Adversarial Image Generation(http://arxiv.org/abs/2310.04285)
Summary:
Most adversarial attacks and defenses focus on perturbations within small $\ell_p$-norm constraints. However, $\ell_p$ threat models cannot capture all relevant semantic-preserving perturbations, and hence, the scope of robustness evaluations is limited. In this work, we introduce Score-Based Adversarial Generation (ScoreAG), a novel framework that leverages the advancements in score-based generative models to generate adversarial examples beyond $\ell_p$-norm constraints, so-called unrestricted adversarial examples, overcoming their limitations. Unlike traditional methods, ScoreAG maintains the core semantics of images while generating realistic adversarial examples, either by transforming existing images or synthesizing new ones entirely from scratch. We further exploit the generative capability of ScoreAG to purify images, empirically enhancing the robustness of classifiers. Our extensive empirical evaluation demonstrates that ScoreAG matches the performance of state-of-the-art attacks and defenses across multiple benchmarks. This work highlights the importance of investigating adversarial examples bounded by semantics rather than $\ell_p$-norm constraints. ScoreAG represents an important step towards more encompassing robustness assessments.

Title: Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning. (arXiv:2310.04306v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04306
Code URL: null
Copy Paste: [[2310.04306]] Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning(http://arxiv.org/abs/2310.04306)
Summary:
Group-level emotion recognition (GER) is an inseparable part of human behavior analysis, aiming to recognize an overall emotion in a multi-person scene. However, the existing methods are devoted to combing diverse emotion cues while ignoring the inherent uncertainties under unconstrained environments, such as congestion and occlusion occurring within a group. Additionally, since only group-level labels are available, inconsistent emotion predictions among individuals in one group can confuse the network. In this paper, we propose an uncertainty-aware learning (UAL) method to extract more robust representations for GER. By explicitly modeling the uncertainty of each individual, we utilize stochastic embedding drawn from a Gaussian distribution instead of deterministic point embedding. This representation captures the probabilities of different emotions and generates diverse predictions through this stochasticity during the inference stage. Furthermore, uncertainty-sensitive scores are adaptively assigned as the fusion weights of individuals' face within each group. Moreover, we develop an image enhancement module to enhance the model's robustness against severe noise. The overall three-branch model, encompassing face, object, and scene component, is guided by a proportional-weighted fusion strategy and integrates the proposed uncertainty-aware method to produce the final group-level output. Experimental results demonstrate the effectiveness and generalization ability of our method across three widely used databases.

Title: SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation. (arXiv:2310.03991v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.03991
Code URL: null
Copy Paste: [[2310.03991]] SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation(http://arxiv.org/abs/2310.03991)
Summary:
Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design. To address this issue, we propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH), which partitions the semantic space of sentences. The algorithm encodes and LSH-hashes a candidate sentence generated by an LLM, and conducts sentence-level rejection sampling until the sampled sentence falls in watermarked partitions in the semantic embedding space. A margin-based constraint is used to enhance its robustness. To show the advantages of our algorithm, we propose a "bigram" paraphrase attack using the paraphrase that has the fewest bigram overlaps with the original sentence. This attack is shown to be effective against the existing token-level watermarking method. Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation.

Title: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services. (arXiv:2310.04313v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.04313
Code URL: null
Copy Paste: [[2310.04313]] Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services(http://arxiv.org/abs/2310.04313)
Summary:
With the growth of online services, the need for advanced text classification algorithms, such as sentiment analysis and biased text detection, has become increasingly evident. The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language models, our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics. Beyond academic contributions, our work can provide practical solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health. Our work provides a robust foundation for future research aiming to improve the quality of online discourse and foster societal well-being. All source codes and datasets are publicly accessible at https://github.com/Dasol-Choi/KoMultiText.

Title: Fishnets: Information-Optimal, Scalable Aggregation for Sets and Graphs. (arXiv:2310.03812v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03812
Code URL: null
Copy Paste: [[2310.03812]] Fishnets: Information-Optimal, Scalable Aggregation for Sets and Graphs(http://arxiv.org/abs/2310.03812)
Summary:
Set-based learning is an essential component of modern deep learning and network science. Graph Neural Networks (GNNs) and their edge-free counterparts Deepsets have proven remarkably useful on ragged and topologically challenging datasets. The key to learning informative embeddings for set members is a specified aggregation function, usually a sum, max, or mean. We propose Fishnets, an aggregation strategy for learning information-optimal embeddings for sets of data for both Bayesian inference and graph aggregation. We demonstrate that i) Fishnets neural summaries can be scaled optimally to an arbitrary number of data objects, ii) Fishnets aggregations are robust to changes in data distribution, unlike standard deepsets, iii) Fishnets saturate Bayesian information content and extend to regimes where MCMC techniques fail and iv) Fishnets can be used as a drop-in aggregation scheme within GNNs. We show that by adopting a Fishnets aggregation scheme for message passing, GNNs can achieve state-of-the-art performance versus architecture size on ogbn-protein data over existing benchmarks with a fraction of learnable parameters and faster training time.

Title: Leveraging Low-Rank and Sparse Recurrent Connectivity for Robust Closed-Loop Control. (arXiv:2310.03915v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03915
Code URL: null
Copy Paste: [[2310.03915]] Leveraging Low-Rank and Sparse Recurrent Connectivity for Robust Closed-Loop Control(http://arxiv.org/abs/2310.03915)
Summary:
Developing autonomous agents that can interact with changing environments is an open challenge in machine learning. Robustness is particularly important in these settings as agents are often fit offline on expert demonstrations but deployed online where they must generalize to the closed feedback loop within the environment. In this work, we explore the application of recurrent neural networks to tasks of this nature and understand how a parameterization of their recurrent connectivity influences robustness in closed-loop settings. Specifically, we represent the recurrent connectivity as a function of rank and sparsity and show both theoretically and empirically that modulating these two variables has desirable effects on network dynamics. The proposed low-rank, sparse connectivity induces an interpretable prior on the network that proves to be most amenable for a class of models known as closed-form continuous-time neural networks (CfCs). We find that CfCs with fewer parameters can outperform their full-rank, fully-connected counterparts in the online setting under distribution shift. This yields memory-efficient and robust agents while opening a new perspective on how we can modulate network dynamics through connectivity.

Title: Joint Projection Learning and Tensor Decomposition Based Incomplete Multi-view Clustering. (arXiv:2310.04038v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04038
Code URL: https://github.com/weilvnju/jpltd
Copy Paste: [[2310.04038]] Joint Projection Learning and Tensor Decomposition Based Incomplete Multi-view Clustering(http://arxiv.org/abs/2310.04038)
Summary:
Incomplete multi-view clustering (IMVC) has received increasing attention since it is often that some views of samples are incomplete in reality. Most existing methods learn similarity subgraphs from original incomplete multi-view data and seek complete graphs by exploring the incomplete subgraphs of each view for spectral clustering. However, the graphs constructed on the original high-dimensional data may be suboptimal due to feature redundancy and noise. Besides, previous methods generally ignored the graph noise caused by the inter-class and intra-class structure variation during the transformation of incomplete graphs and complete graphs. To address these problems, we propose a novel Joint Projection Learning and Tensor Decomposition Based method (JPLTD) for IMVC. Specifically, to alleviate the influence of redundant features and noise in high-dimensional data, JPLTD introduces an orthogonal projection matrix to project the high-dimensional features into a lower-dimensional space for compact feature learning.Meanwhile, based on the lower-dimensional space, the similarity graphs corresponding to instances of different views are learned, and JPLTD stacks these graphs into a third-order low-rank tensor to explore the high-order correlations across different views. We further consider the graph noise of projected data caused by missing samples and use a tensor-decomposition based graph filter for robust clustering.JPLTD decomposes the original tensor into an intrinsic tensor and a sparse tensor. The intrinsic tensor models the true data similarities. An effective optimization algorithm is adopted to solve the JPLTD model. Comprehensive experiments on several benchmark datasets demonstrate that JPLTD outperforms the state-of-the-art methods. The code of JPLTD is available at https://github.com/weilvNJU/JPLTD.

Title: Introducing the Attribution Stability Indicator: a Measure for Time Series XAI Attributions. (arXiv:2310.04178v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04178
Code URL: https://github.com/visual-xai-for-time-series/attribution-stability-indicator
Copy Paste: [[2310.04178]] Introducing the Attribution Stability Indicator: a Measure for Time Series XAI Attributions(http://arxiv.org/abs/2310.04178)
Summary:
Given the increasing amount and general complexity of time series data in domains such as finance, weather forecasting, and healthcare, there is a growing need for state-of-the-art performance models that can provide interpretable insights into underlying patterns and relationships. Attribution techniques enable the extraction of explanations from time series models to gain insights but are hard to evaluate for their robustness and trustworthiness. We propose the Attribution Stability Indicator (ASI), a measure to incorporate robustness and trustworthiness as properties of attribution techniques for time series into account. We extend a perturbation analysis with correlations of the original time series to the perturbed instance and the attributions to include wanted properties in the measure. We demonstrate the wanted properties based on an analysis of the attributions in a dimension-reduced space and the ASI scores distribution over three whole time series classification datasets.

Title: Identifying Representations for Intervention Extrapolation. (arXiv:2310.04295v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04295
Code URL: null
Copy Paste: [[2310.04295]] Identifying Representations for Intervention Extrapolation(http://arxiv.org/abs/2310.04295)
Summary:
The premise of identifiable and causal representation learning is to improve the current representation learning paradigm in terms of generalizability or robustness. Despite recent progress in questions of identifiability, more theoretical results demonstrating concrete advantages of these methods for downstream tasks are needed. In this paper, we consider the task of intervention extrapolation: predicting how interventions affect an outcome, even when those interventions are not observed at training time, and show that identifiable representations can provide an effective solution to this task even if the interventions affect the outcome non-linearly. Our setup includes an outcome Y, observed features X, which are generated as a non-linear transformation of latent features Z, and exogenous action variables A, which influence Z. The objective of intervention extrapolation is to predict how interventions on A that lie outside the training support of A affect Y. Here, extrapolation becomes possible if the effect of A on Z is linear and the residual when regressing Z on A has full support. As Z is latent, we combine the task of intervention extrapolation with identifiable representation learning, which we call Rep4Ex: we aim to map the observed features X into a subspace that allows for non-linear extrapolation in A. We show using Wiener's Tauberian theorem that the hidden representation is identifiable up to an affine transformation in Z-space, which is sufficient for intervention extrapolation. The identifiability is characterized by a novel constraint describing the linearity assumption of A on Z. Based on this insight, we propose a method that enforces the linear invariance constraint and can be combined with any type of autoencoder. We validate our theoretical findings through synthetic experiments and show that our approach succeeds in predicting the effects of unseen interventions.

Title: Adjustable Robust Reinforcement Learning for Online 3D Bin Packing. (arXiv:2310.04323v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04323
Code URL: null
Copy Paste: [[2310.04323]] Adjustable Robust Reinforcement Learning for Online 3D Bin Packing(http://arxiv.org/abs/2310.04323)
Summary:
Designing effective policies for the online 3D bin packing problem (3D-BPP) has been a long-standing challenge, primarily due to the unpredictable nature of incoming box sequences and stringent physical constraints. While current deep reinforcement learning (DRL) methods for online 3D-BPP have shown promising results in optimizing average performance over an underlying box sequence distribution, they often fail in real-world settings where some worst-case scenarios can materialize. Standard robust DRL algorithms tend to overly prioritize optimizing the worst-case performance at the expense of performance under normal problem instance distribution. To address these issues, we first introduce a permutation-based attacker to investigate the practical robustness of both DRL-based and heuristic methods proposed for solving online 3D-BPP. Then, we propose an adjustable robust reinforcement learning (AR2L) framework that allows efficient adjustment of robustness weights to achieve the desired balance of the policy's performance in average and worst-case environments. Specifically, we formulate the objective function as a weighted sum of expected and worst-case returns, and derive the lower performance bound by relating to the return under a mixture dynamics. To realize this lower bound, we adopt an iterative procedure that searches for the associated mixture dynamics and improves the corresponding policy. We integrate this procedure into two popular robust adversarial algorithms to develop the exact and approximate AR2L algorithms. Experiments demonstrate that AR2L is versatile in the sense that it improves policy robustness while maintaining an acceptable level of performance for the nominal case.

Title: Robust Losses for Decision-Focused Learning. (arXiv:2310.04328v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04328
Code URL: null
Copy Paste: [[2310.04328]] Robust Losses for Decision-Focused Learning(http://arxiv.org/abs/2310.04328)
Summary:
Optimization models used to make discrete decisions often contain uncertain parameters that are context-dependent and are estimated through prediction. To account for the quality of the decision made based on the prediction, decision-focused learning (end-to-end predict-then-optimize) aims at training the predictive model to minimize regret, i.e., the loss incurred by making a suboptimal decision. Despite the challenge of this loss function being possibly non-convex and in general non-differentiable, effective gradient-based learning approaches have been proposed to minimize the expected loss, using the empirical loss as a surrogate. However, empirical regret can be an ineffective surrogate because the uncertainty in the optimization model makes the empirical regret unequal to the expected regret in expectation. To illustrate the impact of this inequality, we evaluate the effect of aleatoric and epistemic uncertainty on the accuracy of empirical regret as a surrogate. Next, we propose three robust loss functions that more closely approximate expected regret. Experimental results show that training two state-of-the-art decision-focused learning approaches using robust regret losses improves test-sample empirical regret in general while keeping computational time equivalent relative to the number of training epochs.

biometric

Title: DEFT: A new distance-based feature set for keystroke dynamics. (arXiv:2310.04059v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04059
Code URL: null
Copy Paste: [[2310.04059]] DEFT: A new distance-based feature set for keystroke dynamics(http://arxiv.org/abs/2310.04059)
Summary:
Keystroke dynamics is a behavioural biometric utilised for user identification and authentication. We propose a new set of features based on the distance between keys on the keyboard, a concept that has not been considered before in keystroke dynamics. We combine flight times, a popular metric, with the distance between keys on the keyboard and call them as Distance Enhanced Flight Time features (DEFT). This novel approach provides comprehensive insights into a person's typing behaviour, surpassing typing velocity alone. We build a DEFT model by combining DEFT features with other previously used keystroke dynamic features. The DEFT model is designed to be device-agnostic, allowing us to evaluate its effectiveness across three commonly used devices: desktop, mobile, and tablet. The DEFT model outperforms the existing state-of-the-art methods when we evaluate its effectiveness across two datasets. We obtain accuracy rates exceeding 99% and equal error rates below 10% on all three devices.

steal

extraction

Title: Investigating Alternative Feature Extraction Pipelines For Clinical Note Phenotyping. (arXiv:2310.03772v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.03772
Code URL: null
Copy Paste: [[2310.03772]] Investigating Alternative Feature Extraction Pipelines For Clinical Note Phenotyping(http://arxiv.org/abs/2310.03772)
Summary:
A common practice in the medical industry is the use of clinical notes, which consist of detailed patient observations. However, electronic health record systems frequently do not contain these observations in a structured format, rendering patient information challenging to assess and evaluate automatically. Using computational systems for the extraction of medical attributes offers many applications, including longitudinal analysis of patients, risk assessment, and hospital evaluation. Recent work has constructed successful methods for phenotyping: extracting medical attributes from clinical notes. BERT-based models can be used to transform clinical notes into a series of representations, which are then condensed into a single document representation based on their CLS embeddings and passed into an LSTM (Mulyar et al., 2020). Though this pipeline yields a considerable performance improvement over previous results, it requires extensive convergence time. This method also does not allow for predicting attributes not yet identified in clinical notes.

Considering the wide variety of medical attributes that may be present in a clinical note, we propose an alternative pipeline utilizing ScispaCy (Neumann et al., 2019) for the extraction of common diseases. We then train various supervised learning models to associate the presence of these conditions with patient attributes. Finally, we replicate a ClinicalBERT (Alsentzer et al., 2019) and LSTM-based approach for purposes of comparison. We find that alternative methods moderately underperform the replicated LSTM approach. Yet, considering a complex tradeoff between accuracy and runtime, in addition to the fact that the alternative approach also allows for the detection of medical conditions that are not already present in a clinical note, its usage may be considered as a supplement to established methods.

Title: Exploring the evolution of research topics during the COVID-19 pandemic. (arXiv:2310.03928v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.03928
Code URL: null
Copy Paste: [[2310.03928]] Exploring the evolution of research topics during the COVID-19 pandemic(http://arxiv.org/abs/2310.03928)
Summary:
The COVID-19 pandemic has changed the research agendas of most scientific communities, resulting in an overwhelming production of research articles in a variety of domains, including medicine, virology, epidemiology, economy, psychology, and so on. Several open-access corpora and literature hubs were established; among them, the COVID-19 Open Research Dataset (CORD-19) has systematically gathered scientific contributions for 2.5 years, by collecting and indexing over one million articles. Here, we present the CORD-19 Topic Visualizer (CORToViz), a method and associated visualization tool for inspecting the CORD-19 textual corpus of scientific abstracts. Our method is based upon a careful selection of up-to-date technologies (including large language models), resulting in an architecture for clustering articles along orthogonal dimensions and extraction techniques for temporal topic mining. Topic inspection is supported by an interactive dashboard, providing fast, one-click visualization of topic contents as word clouds and topic trends as time series, equipped with easy-to-drive statistical testing for analyzing the significance of topic emergence along arbitrarily selected time windows. The processes of data preparation and results visualization are completely general and virtually applicable to any corpus of textual documents - thus suited for effective adaptation to other contexts.

Title: Automatic Aspect Extraction from Scientific Texts. (arXiv:2310.04074v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.04074
Code URL: https://github.com/anna-marshalova/automatic-aspect-extraction-from-scientific-texts
Copy Paste: [[2310.04074]] Automatic Aspect Extraction from Scientific Texts(http://arxiv.org/abs/2310.04074)
Summary:
Being able to extract from scientific papers their main points, key insights, and other important information, referred to here as aspects, might facilitate the process of conducting a scientific literature review. Therefore, the aim of our research is to create a tool for automatic aspect extraction from Russian-language scientific texts of any domain. In this paper, we present a cross-domain dataset of scientific texts in Russian, annotated with such aspects as Task, Contribution, Method, and Conclusion, as well as a baseline algorithm for aspect extraction, based on the multilingual BERT model fine-tuned on our data. We show that there are some differences in aspect representation in different domains, but even though our model was trained on a limited number of scientific domains, it is still able to generalize to new domains, as was proved by cross-domain experiments. The code and the dataset are available at \url{https://github.com/anna-marshalova/automatic-aspect-extraction-from-scientific-texts}.

Title: SIFT -- File Fragment Classification Without Metadata. (arXiv:2310.03831v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.03831
Code URL: null
Copy Paste: [[2310.03831]] SIFT -- File Fragment Classification Without Metadata(http://arxiv.org/abs/2310.03831)
Summary:
A vital issue of file carving in digital forensics is type classification of file fragments when the filesystem metadata is missing. Over the past decades, there have been several efforts for developing methods to classify file fragments. In this research, a novel sifting approach, named SIFT (Sifting File Types), is proposed. SIFT outperforms the other state-of-the-art techniques by at least 8%. (1) One of the significant differences between SIFT and others is that SIFT uses a single byte as a separate feature, i.e., a total of 256 (0x00 - 0xFF) features. We also call this a lossless feature (information) extraction, i.e., there is no loss of information. (2) The other significant difference is the technique used to estimate inter-Classes and intra-Classes information gain of a feature. Unlike others, SIFT adapts TF-IDF for this purpose, and computes and assigns weight to each byte (feature) in a fragment (sample). With these significant differences and approaches, SIFT produces promising (better) results compared to other works.

Title: A Learnable Counter-condition Analysis Framework for Functional Connectivity-based Neurological Disorder Diagnosis. (arXiv:2310.03964v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03964
Code URL: null
Copy Paste: [[2310.03964]] A Learnable Counter-condition Analysis Framework for Functional Connectivity-based Neurological Disorder Diagnosis(http://arxiv.org/abs/2310.03964)
Summary:
To understand the biological characteristics of neurological disorders with functional connectivity (FC), recent studies have widely utilized deep learning-based models to identify the disease and conducted post-hoc analyses via explainable models to discover disease-related biomarkers. Most existing frameworks consist of three stages, namely, feature selection, feature extraction for classification, and analysis, where each stage is implemented separately. However, if the results at each stage lack reliability, it can cause misdiagnosis and incorrect analysis in afterward stages. In this study, we propose a novel unified framework that systemically integrates diagnoses (i.e., feature selection and feature extraction) and explanations. Notably, we devised an adaptive attention network as a feature selection approach to identify individual-specific disease-related connections. We also propose a functional network relational encoder that summarizes the global topological properties of FC by learning the inter-network relations without pre-defined edges between functional networks. Last but not least, our framework provides a novel explanatory power for neuroscientific interpretation, also termed counter-condition analysis. We simulated the FC that reverses the diagnostic information (i.e., counter-condition FC): converting a normal brain to be abnormal and vice versa. We validated the effectiveness of our framework by using two large resting-state functional magnetic resonance imaging (fMRI) datasets, Autism Brain Imaging Data Exchange (ABIDE) and REST-meta-MDD, and demonstrated that our framework outperforms other competing methods for disease identification. Furthermore, we analyzed the disease-related neurological patterns based on counter-condition analysis.

membership infer

federate

Title: FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning. (arXiv:2310.04412v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04412
Code URL: https://github.com/ucsc-vlaa/fedconv
Copy Paste: [[2310.04412]] FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning(http://arxiv.org/abs/2310.04412)
Summary:
Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL.

Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks. The code is publicly available at https://github.com/UCSC-VLAA/FedConv

Title: Kick Bad Guys Out! Zero-Knowledge-Proof-Based Anomaly Detection in Federated Learning. (arXiv:2310.04055v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.04055
Code URL: null
Copy Paste: [[2310.04055]] Kick Bad Guys Out! Zero-Knowledge-Proof-Based Anomaly Detection in Federated Learning(http://arxiv.org/abs/2310.04055)
Summary:
Federated learning (FL) systems are vulnerable to malicious clients that submit poisoned local models to achieve their adversarial goals, such as preventing the convergence of the global model or inducing the global model to misclassify some data. Many existing defense mechanisms are impractical in real-world FL systems, as they require prior knowledge of the number of malicious clients or rely on re-weighting or modifying submissions. This is because adversaries typically do not announce their intentions before attacking, and re-weighting might change aggregation results even in the absence of attacks. To address these challenges in real FL systems, this paper introduces a cutting-edge anomaly detection approach with the following features: i) Detecting the occurrence of attacks and performing defense operations only when attacks happen; ii) Upon the occurrence of an attack, further detecting the malicious client models and eliminating them without harming the benign ones; iii) Ensuring honest execution of defense mechanisms at the server by leveraging a zero-knowledge proof mechanism. We validate the superior performance of the proposed approach with extensive experiments.

fair

Title: ILSH: The Imperial Light-Stage Head Dataset for Human Head View Synthesis. (arXiv:2310.03952v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03952
Code URL: null
Copy Paste: [[2310.03952]] ILSH: The Imperial Light-Stage Head Dataset for Human Head View Synthesis(http://arxiv.org/abs/2310.03952)
Summary:
This paper introduces the Imperial Light-Stage Head (ILSH) dataset, a novel light-stage-captured human head dataset designed to support view synthesis academic challenges for human heads. The ILSH dataset is intended to facilitate diverse approaches, such as scene-specific or generic neural rendering, multiple-view geometry, 3D vision, and computer graphics, to further advance the development of photo-realistic human avatars. This paper details the setup of a light-stage specifically designed to capture high-resolution (4K) human head images and describes the process of addressing challenges (preprocessing, ethical issues) in collecting high-quality data. In addition to the data collection, we address the split of the dataset into train, validation, and test sets. Our goal is to design and support a fair view synthesis challenge task for this novel dataset, such that a similar level of performance can be maintained and expected when using the test set, as when using the validation set. The ILSH dataset consists of 52 subjects captured using 24 cameras with all 82 lighting sources turned on, resulting in a total of 1,248 close-up head images, border masks, and camera pose pairs.

interpretability

Title: Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases. (arXiv:2310.04189v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04189
Code URL: null
Copy Paste: [[2310.04189]] Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases(http://arxiv.org/abs/2310.04189)
Summary:
The goal of motion understanding is to establish a reliable mapping between motion and action semantics, while it is a challenging many-to-many problem. An abstract action semantic (i.e., walk forwards) could be conveyed by perceptually diverse motions (walk with arms up or swinging), while a motion could carry different semantics w.r.t. its context and intention. This makes an elegant mapping between them difficult. Previous attempts adopted direct-mapping paradigms with limited reliability. Also, current automatic metrics fail to provide reliable assessments of the consistency between motions and action semantics. We identify the source of these problems as the significant gap between the two modalities. To alleviate this gap, we propose Kinematic Phrases (KP) that take the objective kinematic facts of human motion with proper abstraction, interpretability, and generality characteristics. Based on KP as a mediator, we can unify a motion knowledge base and build a motion understanding system. Meanwhile, KP can be automatically converted from motions and to text descriptions with no subjective bias, inspiring Kinematic Prompt Generation (KPG) as a novel automatic motion generation benchmark. In extensive experiments, our approach shows superiority over other methods. Our code and data would be made publicly available at https://foruck.github.io/KP.

explainability

watermark

diffusion

Title: Characterizing the Features of Mitotic Figures Using a Conditional Diffusion Probabilistic Model. (arXiv:2310.03893v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03893
Code URL: https://github.com/cagladbahadir/dpm-for-mitotic-figures
Copy Paste: [[2310.03893]] Characterizing the Features of Mitotic Figures Using a Conditional Diffusion Probabilistic Model(http://arxiv.org/abs/2310.03893)
Summary:
Mitotic figure detection in histology images is a hard-to-define, yet clinically significant task, where labels are generated with pathologist interpretations and where there is no ``gold-standard'' independent ground-truth. However, it is well-established that these interpretation based labels are often unreliable, in part, due to differences in expertise levels and human subjectivity. In this paper, our goal is to shed light on the inherent uncertainty of mitosis labels and characterize the mitotic figure classification task in a human interpretable manner. We train a probabilistic diffusion model to synthesize patches of cell nuclei for a given mitosis label condition. Using this model, we can then generate a sequence of synthetic images that correspond to the same nucleus transitioning into the mitotic state. This allows us to identify different image features associated with mitosis, such as cytoplasm granularity, nuclear density, nuclear irregularity and high contrast between the nucleus and the cell body. Our approach offers a new tool for pathologists to interpret and communicate the features driving the decision to recognize a mitotic figure.

Title: VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification. (arXiv:2310.04122v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04122
Code URL: null
Copy Paste: [[2310.04122]] VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification(http://arxiv.org/abs/2310.04122)
Summary:
Visible-Infrared person re-identification (VI-ReID) in real-world scenarios poses a significant challenge due to the high cost of cross-modality data annotation. Different sensing cameras, such as RGB/IR cameras for good/poor lighting conditions, make it costly and error-prone to identify the same person across modalities. To overcome this, we explore the use of single-modality labeled data for the VI-ReID task, which is more cost-effective and practical. By labeling pedestrians in only one modality (e.g., visible images) and retrieving in another modality (e.g., infrared images), we aim to create a training set containing both originally labeled and modality-translated data using unpaired image-to-image translation techniques. In this paper, we propose VI-Diff, a diffusion model that effectively addresses the task of Visible-Infrared person image translation. Through comprehensive experiments, we demonstrate that VI-Diff outperforms existing diffusion and GAN models, making it a promising solution for VI-ReID with single-modality labeled data. Our approach can be a promising solution to the VI-ReID task with single-modality labeled data and serves as a good starting point for future study. Code will be available.

Title: Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference. (arXiv:2310.04378v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04378
Code URL: https://github.com/luosiallen/latent-consistency-model
Copy Paste: [[2310.04378]] Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference(http://arxiv.org/abs/2310.04378)
Summary:
Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/

Title: CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis. (arXiv:2310.04414v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04414
Code URL: null
Copy Paste: [[2310.04414]] CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis(http://arxiv.org/abs/2310.04414)
Summary:
Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.

Title: Observation-Guided Diffusion Probabilistic Models. (arXiv:2310.04041v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04041
Code URL: null
Copy Paste: [[2310.04041]] Observation-Guided Diffusion Probabilistic Models(http://arxiv.org/abs/2310.04041)
Summary:
We propose a novel diffusion model called observation-guided diffusion probabilistic model (OGDM), which effectively addresses the trade-off between quality control and fast sampling. Our approach reestablishes the training objective by integrating the guidance of the observation process with the Markov chain in a principled way. This is achieved by introducing an additional loss term derived from the observation based on the conditional discriminator on noise level, which employs Bernoulli distribution indicating whether its input lies on the (noisy) real manifold or not. This strategy allows us to optimize the more accurate negative log-likelihood induced in the inference stage especially when the number of function evaluations is limited. The proposed training method is also advantageous even when incorporated only into the fine-tuning process, and it is compatible with various fast inference strategies since our method yields better denoising networks using the exactly same inference procedure without incurring extra computational cost. We demonstrate the effectiveness of the proposed training algorithm using diverse inference methods on strong diffusion model baselines.

noise learning

data-free

transformer

Title: Accelerated Neural Network Training with Rooted Logistic Objectives. (arXiv:2310.03890v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03890
Code URL: null
Copy Paste: [[2310.03890]] Accelerated Neural Network Training with Rooted Logistic Objectives(http://arxiv.org/abs/2310.03890)
Summary:
Many neural networks deployed in the real world scenarios are trained using cross entropy based loss functions. From the optimization perspective, it is known that the behavior of first order methods such as gradient descent crucially depend on the separability of datasets. In fact, even in the most simplest case of binary classification, the rate of convergence depends on two factors: (1) condition number of data matrix, and (2) separability of the dataset. With no further pre-processing techniques such as over-parametrization, data augmentation etc., separability is an intrinsic quantity of the data distribution under consideration. We focus on the landscape design of the logistic function and derive a novel sequence of {\em strictly} convex functions that are at least as strict as logistic loss. The minimizers of these functions coincide with those of the minimum norm solution wherever possible. The strict convexity of the derived function can be extended to finetune state-of-the-art models and applications. In empirical experimental analysis, we apply our proposed rooted logistic objective to multiple deep models, e.g., fully-connected neural networks and transformers, on various of classification benchmarks. Our results illustrate that training with rooted loss function is converged faster and gains performance improvements. Furthermore, we illustrate applications of our novel rooted loss function in generative modeling based downstream applications, such as finetuning StyleGAN model with the rooted loss. The code implementing our losses and models can be found here for open source software development purposes: https://anonymous.4open.science/r/rooted_loss.

Title: Sub-token ViT Embedding via Stochastic Resonance Transformers. (arXiv:2310.03967v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03967
Code URL: null
Copy Paste: [[2310.03967]] Sub-token ViT Embedding via Stochastic Resonance Transformers(http://arxiv.org/abs/2310.03967)
Summary:
We discover the presence of quantization artifacts in Vision Transformers (ViTs), which arise due to the image tokenization step inherent in these architectures. These artifacts result in coarsely quantized features, which negatively impact performance, especially on downstream dense prediction tasks. We present a zero-shot method to improve how pre-trained ViTs handle spatial quantization. In particular, we propose to ensemble the features obtained from perturbing input images via sub-token spatial translations, inspired by Stochastic Resonance, a method traditionally applied to climate dynamics and signal processing. We term our method ``Stochastic Resonance Transformer" (SRT), which we show can effectively super-resolve features of pre-trained ViTs, capturing more of the local fine-grained structures that might otherwise be neglected as a result of tokenization. SRT can be applied at any layer, on any task, and does not require any fine-tuning. The advantage of the former is evident when applied to monocular depth prediction, where we show that ensembling model outputs are detrimental while applying SRT on intermediate ViT features outperforms the baseline models by an average of 4.7% and 14.9% on the RMSE and RMSE-log metrics across three different architectures. When applied to semi-supervised video object segmentation, SRT also improves over the baseline models uniformly across all metrics, and by an average of 2.4% in F&J score. We further show that these quantization artifacts can be attenuated to some extent via self-distillation. On the unsupervised salient region segmentation, SRT improves upon the base model by an average of 2.1% on the maxF metric. Finally, despite operating purely on pixel-level features, SRT generalizes to non-dense prediction tasks such as image retrieval and object discovery, yielding consistent improvements of up to 2.6% and 1.0% respectively.

Title: ClusVPR: Efficient Visual Place Recognition with Clustering-based Weighted Transformer. (arXiv:2310.04099v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04099
Code URL: null
Copy Paste: [[2310.04099]] ClusVPR: Efficient Visual Place Recognition with Clustering-based Weighted Transformer(http://arxiv.org/abs/2310.04099)
Summary:
Visual place recognition (VPR) is a highly challenging task that has a wide range of applications, including robot navigation and self-driving vehicles. VPR is particularly difficult due to the presence of duplicate regions and the lack of attention to small objects in complex scenes, resulting in recognition deviations. In this paper, we present ClusVPR, a novel approach that tackles the specific issues of redundant information in duplicate regions and representations of small objects. Different from existing methods that rely on Convolutional Neural Networks (CNNs) for feature map generation, ClusVPR introduces a unique paradigm called Clustering-based Weighted Transformer Network (CWTNet). CWTNet leverages the power of clustering-based weighted feature maps and integrates global dependencies to effectively address visual deviations encountered in large-scale VPR problems. We also introduce the optimized-VLAD (OptLAD) layer that significantly reduces the number of parameters and enhances model efficiency. This layer is specifically designed to aggregate the information obtained from scale-wise image patches. Additionally, our pyramid self-supervised strategy focuses on extracting representative and diverse information from scale-wise image patches instead of entire images, which is crucial for capturing representative and diverse information in VPR. Extensive experiments on four VPR datasets show our model's superior performance compared to existing models while being less complex.

Title: TiC: Exploring Vision Transformer in Convolution. (arXiv:2310.04134v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04134
Code URL: https://github.com/zs670980918/msa-conv
Copy Paste: [[2310.04134]] TiC: Exploring Vision Transformer in Convolution(http://arxiv.org/abs/2310.04134)
Summary:
While models derived from Vision Transformers (ViTs) have been phonemically surging, pre-trained models cannot seamlessly adapt to arbitrary resolution images without altering the architecture and configuration, such as sampling the positional encoding, limiting their flexibility for various vision tasks. For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all input images to be resized to 1024$\times$1024. To overcome this limitation, we propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. Enabling transformers to handle images of varying sizes without retraining or rescaling, the use of MSA-Conv further reduces computational costs compared to global attention in ViT, which grows costly as image size increases. Later, we present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv, where two capacity enhancing strategies, namely Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing long-distance connections between tokens and enlarging the effective receptive field. Extensive experiments have been carried out to validate the overall effectiveness of TiC. Additionally, ablation studies confirm the performance improvement made by MSA-Conv and the two capacity enhancing strategies separately. Note that our proposal aims at studying an alternative to the global attention used in ViT, while MSA-Conv meets our goal by making TiC comparable to state-of-the-art on ImageNet-1K. Code will be released at https://github.com/zs670980918/MSA-Conv.

Title: Entropic Score metric: Decoupling Topology and Size in Training-free NAS. (arXiv:2310.04179v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04179
Code URL: null
Copy Paste: [[2310.04179]] Entropic Score metric: Decoupling Topology and Size in Training-free NAS(http://arxiv.org/abs/2310.04179)
Summary:
Neural Networks design is a complex and often daunting task, particularly for resource-constrained scenarios typical of mobile-sized models. Neural Architecture Search is a promising approach to automate this process, but existing competitive methods require large training time and computational resources to generate accurate models. To overcome these limits, this paper contributes with: i) a novel training-free metric, named Entropic Score, to estimate model expressivity through the aggregated element-wise entropy of its activations; ii) a cyclic search algorithm to separately yet synergistically search model size and topology. Entropic Score shows remarkable ability in searching for the topology of the network, and a proper combination with LogSynflow, to search for model size, yields superior capability to completely design high-performance Hybrid Transformers for edge applications in less than 1 GPU hour, resulting in the fastest and most accurate NAS method for ImageNet classification.

Title: Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution. (arXiv:2310.04180v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04180
Code URL: https://github.com/i2-multimedia-lab/dsat
Copy Paste: [[2310.04180]] Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution(http://arxiv.org/abs/2310.04180)
Summary:
Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their abilities to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at $\times$2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at $\times$4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area. Source code is available at: https://github.com/I2-Multimedia-Lab/DSAT/tree/main.

Title: Contextualized Structural Self-supervised Learning for Ontology Matching. (arXiv:2310.03840v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03840
Code URL: https://github.com/ellenzhuwang/lakermap
Copy Paste: [[2310.03840]] Contextualized Structural Self-supervised Learning for Ontology Matching(http://arxiv.org/abs/2310.03840)
Summary:
Ontology matching (OM) entails the identification of semantic relationships between concepts within two or more knowledge graphs (KGs) and serves as a critical step in integrating KGs from various sources. Recent advancements in deep OM models have harnessed the power of transformer-based language models and the advantages of knowledge graph embedding. Nevertheless, these OM models still face persistent challenges, such as a lack of reference alignments, runtime latency, and unexplored different graph structures within an end-to-end framework. In this study, we introduce a novel self-supervised learning OM framework with input ontologies, called LaKERMap. This framework capitalizes on the contextual and structural information of concepts by integrating implicit knowledge into transformers. Specifically, we aim to capture multiple structural contexts, encompassing both local and global interactions, by employing distinct training objectives. To assess our methods, we utilize the Bio-ML datasets and tasks. The findings from our innovative approach reveal that LaKERMap surpasses state-of-the-art systems in terms of alignment quality and inference time. Our models and codes are available here: https://github.com/ellenzhuwang/lakermap.

Title: Quantized Transformer Language Model Implementations on Edge Devices. (arXiv:2310.03971v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.03971
Code URL: null
Copy Paste: [[2310.03971]] Quantized Transformer Language Model Implementations on Edge Devices(http://arxiv.org/abs/2310.03971)
Summary:
Large-scale transformer-based models like the Bidirectional Encoder Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications, wherein these models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task. One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency. In order to overcome these limitations, such large-scale models can be converted to an optimized FlatBuffer format, tailored for deployment on resource-constrained edge devices. Herein, we evaluate the performance of such FlatBuffer transformed MobileBERT models on three different edge devices, fine-tuned for Reputation analysis of English language tweets in the RepLab 2013 dataset. In addition, this study encompassed an evaluation of the deployed models, wherein their latency, performance, and resource efficiency were meticulously assessed. Our experiment results show that, compared to the original BERT large model, the converted and quantized MobileBERT models have 160$\times$ smaller footprints for a 4.1% drop in accuracy while analyzing at least one tweet per second on edge devices. Furthermore, our study highlights the privacy-preserving aspect of TinyML systems as all data is processed locally within a serverless environment.

Title: ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures. (arXiv:2310.03841v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2310.03841
Code URL: null
Copy Paste: [[2310.03841]] ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures(http://arxiv.org/abs/2310.03841)
Summary:
Vision Transformers are being increasingly deployed in safety-critical applications that demand high reliability. It is crucial to ensure the correctness of their execution in spite of potential errors such as transient hardware errors. We propose a novel algorithm-based resilience framework called ALBERTA that allows us to perform end-to-end resilience analysis and protection of transformer-based architectures. First, our work develops an efficient process of computing and ranking the resilience of transformers layers. We find that due to the large size of transformer models, applying traditional network redundancy to a subset of the most vulnerable layers provides high error coverage albeit with impractically high overhead. We address this shortcoming by providing a software-directed, checksum-based error detection technique aimed at protecting the most vulnerable general matrix multiply (GEMM) layers in the transformer models that use either floating-point or integer arithmetic. Results show that our approach achieves over 99% coverage for errors that result in a mismatch at less than 0.2% computation overhead. Lastly, we present the applicability of our framework in various modern GPU architectures under different numerical precisions. We introduce an efficient self-correction mechanism for resolving erroneous detection with an average overhead of less than 0.002% (with a 2% overhead to resolve each erroneous detection).

Title: CrysFormer: Protein Structure Prediction via 3d Patterson Maps and Partial Structure Attention. (arXiv:2310.03899v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03899
Code URL: null
Copy Paste: [[2310.03899]] CrysFormer: Protein Structure Prediction via 3d Patterson Maps and Partial Structure Attention(http://arxiv.org/abs/2310.03899)
Summary:
Determining the structure of a protein has been a decades-long open question. A protein's three-dimensional structure often poses nontrivial computation costs, when classical simulation algorithms are utilized. Advances in the transformer neural network architecture -- such as AlphaFold2 -- achieve significant improvements for this problem, by learning from a large dataset of sequence information and corresponding protein structures. Yet, such methods only focus on sequence information; other available prior knowledge, such as protein crystallography and partial structure of amino acids, could be potentially utilized. To the best of our knowledge, we propose the first transformer-based model that directly utilizes protein crystallography and partial structure information to predict the electron density maps of proteins. Via two new datasets of peptide fragments (2-residue and 15-residue) , we demonstrate our method, dubbed \texttt{CrysFormer}, can achieve accurate predictions, based on a much smaller dataset size and with reduced computation costs.

Title: RTDK-BO: High Dimensional Bayesian Optimization with Reinforced Transformer Deep kernels. (arXiv:2310.03912v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03912
Code URL: null
Copy Paste: [[2310.03912]] RTDK-BO: High Dimensional Bayesian Optimization with Reinforced Transformer Deep kernels(http://arxiv.org/abs/2310.03912)
Summary:
Bayesian Optimization (BO), guided by Gaussian process (GP) surrogates, has proven to be an invaluable technique for efficient, high-dimensional, black-box optimization, a critical problem inherent to many applications such as industrial design and scientific computing. Recent contributions have introduced reinforcement learning (RL) to improve the optimization performance on both single function optimization and \textit{few-shot} multi-objective optimization. However, even few-shot techniques fail to exploit similarities shared between closely related objectives. In this paper, we combine recent developments in Deep Kernel Learning (DKL) and attention-based Transformer models to improve the modeling powers of GP surrogates with meta-learning. We propose a novel method for improving meta-learning BO surrogates by incorporating attention mechanisms into DKL, empowering the surrogates to adapt to contextual information gathered during the BO process. We combine this Transformer Deep Kernel with a learned acquisition function trained with continuous Soft Actor-Critic Reinforcement Learning to aid in exploration. This Reinforced Transformer Deep Kernel (RTDK-BO) approach yields state-of-the-art results in continuous high-dimensional optimization problems.

Title: Toward a Foundation Model for Time Series Data. (arXiv:2310.03916v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03916
Code URL: null
Copy Paste: [[2310.03916]] Toward a Foundation Model for Time Series Data(http://arxiv.org/abs/2310.03916)
Summary:
A foundation model is a machine learning model trained on a large and diverse set of data, typically using self-supervised learning-based pre-training techniques, that can be adapted to various downstream tasks. However, current research on time series pre-training has mostly focused on models pre-trained solely on data from a single domain, resulting in a lack of knowledge about other types of time series. However, current research on time series pre-training has predominantly focused on models trained exclusively on data from a single domain. As a result, these models possess domain-specific knowledge that may not be easily transferable to time series from other domains. In this paper, we aim to develop an effective time series foundation model by leveraging unlabeled samples from multiple domains. To achieve this, we repurposed the publicly available UCR Archive and evaluated four existing self-supervised learning-based pre-training methods, along with a novel method, on the datasets. We tested these methods using four popular neural network architectures for time series to understand how the pre-training methods interact with different network designs. Our experimental results show that pre-training improves downstream classification tasks by enhancing the convergence of the fine-tuning process. Furthermore, we found that the proposed pre-training method, when combined with the Transformer model, outperforms the alternatives.

Title: Exploiting Transformer Activation Sparsity with Dynamic Inference. (arXiv:2310.04361v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04361
Code URL: null
Copy Paste: [[2310.04361]] Exploiting Transformer Activation Sparsity with Dynamic Inference(http://arxiv.org/abs/2310.04361)
Summary:
Transformer models, despite their impressive performance, often face practical limitations due to their high computational requirements. At the same time, previous studies have revealed significant activation sparsity in these models, indicating the presence of redundant computations. In this paper, we propose Dynamic Sparsified Transformer Inference (DSTI), a method that radically reduces the inference cost of Transformer models by enforcing activation sparsity and subsequently transforming a dense model into its sparse Mixture of Experts (MoE) version. We demonstrate that it is possible to train small gating networks that successfully predict the relative contribution of each expert during inference. Furthermore, we introduce a mechanism that dynamically determines the number of executed experts individually for each token. DSTI can be applied to any Transformer-based architecture and has negligible impact on the accuracy. For the BERT-base classification model, we reduce inference cost by almost 60%.

Title: Functional Interpolation for Relative Positions Improves Long Context Transformers. (arXiv:2310.04418v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04418
Code URL: null
Copy Paste: [[2310.04418]] Functional Interpolation for Relative Positions Improves Long Context Transformers(http://arxiv.org/abs/2310.04418)
Summary:
Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.

generative

Title: Class-Incremental Learning Using Generative Experience Replay Based on Time-aware Regularization. (arXiv:2310.03898v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.03898
Code URL: null
Copy Paste: [[2310.03898]] Class-Incremental Learning Using Generative Experience Replay Based on Time-aware Regularization(http://arxiv.org/abs/2310.03898)
Summary:
Learning new tasks accumulatively without forgetting remains a critical challenge in continual learning. Generative experience replay addresses this challenge by synthesizing pseudo-data points for past learned tasks and later replaying them for concurrent training along with the new tasks' data. Generative replay is the best strategy for continual learning under a strict class-incremental setting when certain constraints need to be met: (i) constant model size, (ii) no pre-training dataset, and (iii) no memory buffer for storing past tasks' data. Inspired by the biological nervous system mechanisms, we introduce a time-aware regularization method to dynamically fine-tune the three training objective terms used for generative replay: supervised learning, latent regularization, and data reconstruction. Experimental results on major benchmarks indicate that our method pushes the limit of brain-inspired continual learners under such strict settings, improves memory retention, and increases the average performance over continually arriving tasks.

large language model

Title: Automatic and Human-AI Interactive Text Generation. (arXiv:2310.03878v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.03878
Code URL: null
Copy Paste: [[2310.03878]] Automatic and Human-AI Interactive Text Generation(http://arxiv.org/abs/2310.03878)
Summary:
In this tutorial, we focus on text-to-text generation, a class of natural language generation (NLG) tasks, that takes a piece of text as input and then generates a revision that is improved according to some specific criteria (e.g., readability or linguistic styles), while largely retaining the original meaning and the length of the text. This includes many useful applications, such as text simplification, paraphrase generation, style transfer, etc. In contrast to text summarization and open-ended text completion (e.g., story), the text-to-text generation tasks we discuss in this tutorial are more constrained in terms of semantic consistency and targeted language styles. This level of control makes these tasks ideal testbeds for studying the ability of models to generate text that is both semantically adequate and stylistically appropriate. Moreover, these tasks are interesting from a technical standpoint, as they require complex combinations of lexical and syntactical transformations, stylistic control, and adherence to factual knowledge, -- all at once. With a special focus on text simplification and revision, this tutorial aims to provide an overview of the state-of-the-art natural language generation research from four major aspects -- Data, Models, Human-AI Collaboration, and Evaluation -- and to discuss and showcase a few significant and recent advances: (1) the use of non-retrogressive approaches; (2) the shift from fine-tuning to prompting with large language models; (3) the development of new learnable metric and fine-grained human evaluation framework; (4) a growing body of studies and datasets on non-English languages; (5) the rise of HCI+NLP+Accessibility interdisciplinary research to create real-world writing assistant systems.

Title: Evaluating Multi-Agent Coordination Abilities in Large Language Models. (arXiv:2310.03903v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.03903
Code URL: null
Copy Paste: [[2310.03903]] Evaluating Multi-Agent Coordination Abilities in Large Language Models(http://arxiv.org/abs/2310.03903)
Summary:
A pivotal aim in contemporary AI research is to develop agents proficient in multi-agent coordination, enabling effective collaboration with both humans and other systems. Large Language Models (LLMs), with their notable ability to understand, generate, and interpret language in a human-like manner, stand out as promising candidates for the development of such agents. In this study, we build and assess the effectiveness of agents crafted using LLMs in various coordination scenarios. We introduce the LLM-Coordination (LLM-Co) Framework, specifically designed to enable LLMs to play coordination games. With the LLM-Co framework, we conduct our evaluation with three game environments and organize the evaluation into five aspects: Theory of Mind, Situated Reasoning, Sustained Coordination, Robustness to Partners, and Explicit Assistance. First, the evaluation of the Theory of Mind and Situated Reasoning reveals the capabilities of LLM to infer the partner's intention and reason actions accordingly. Then, the evaluation around Sustained Coordination and Robustness to Partners further showcases the ability of LLMs to coordinate with an unknown partner in complex long-horizon tasks, outperforming Reinforcement Learning baselines. Lastly, to test Explicit Assistance, which refers to the ability of an agent to offer help proactively, we introduce two novel layouts into the Overcooked-AI benchmark, examining if agents can prioritize helping their partners, sacrificing time that could have been spent on their tasks. This research underscores the promising capabilities of LLMs in sophisticated coordination environments and reveals the potential of LLMs in building strong real-world agents for multi-agent coordination.

Title: Chain of Natural Language Inference for Reducing Large Language Model Ungrounded Hallucinations. (arXiv:2310.03951v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.03951
Code URL: null
Copy Paste: [[2310.03951]] Chain of Natural Language Inference for Reducing Large Language Model Ungrounded Hallucinations(http://arxiv.org/abs/2310.03951)
Summary:
Large language models (LLMs) can generate fluent natural language texts when given relevant documents as background context. This ability has attracted considerable interest in developing industry applications of LLMs. However, LLMs are prone to generate hallucinations that are not supported by the provided sources. In this paper, we propose a hierarchical framework to detect and mitigate such ungrounded hallucination. Our framework uses Chain of Natural Language Inference (CoNLI) for hallucination detection and hallucination reduction via post-editing. Our approach achieves state-of-the-art performance on hallucination detection and enhances text quality through rewrite, using LLMs without any fine-tuning or domain-specific prompt engineering. We show that this simple plug-and-play framework can serve as an effective choice for hallucination detection and reduction, achieving competitive performance across various contexts.

Title: Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models. (arXiv:2310.04027v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.04027
Code URL: https://github.com/AI4Finance-Foundation/FinGPT/tree/master/fingpt/FinGPT-RAG
Copy Paste: [[2310.04027]] Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models(http://arxiv.org/abs/2310.04027)
Summary:
Financial sentiment analysis is critical for valuation and investment decision-making. Traditional NLP models, however, are limited by their parameter size and the scope of their training datasets, which hampers their generalization capabilities and effectiveness in this field. Recently, Large Language Models (LLMs) pre-trained on extensive corpora have demonstrated superior performance across various NLP tasks due to their commendable zero-shot abilities. Yet, directly applying LLMs to financial sentiment analysis presents challenges: The discrepancy between the pre-training objective of LLMs and predicting the sentiment label can compromise their predictive performance. Furthermore, the succinct nature of financial news, often devoid of sufficient context, can significantly diminish the reliability of LLMs' sentiment analysis. To address these challenges, we introduce a retrieval-augmented LLMs framework for financial sentiment analysis. This framework includes an instruction-tuned LLMs module, which ensures LLMs behave as predictors of sentiment labels, and a retrieval-augmentation module which retrieves additional context from reliable external sources. Benchmarked against traditional models and LLMs like ChatGPT and LLaMA, our approach achieves 15\% to 48\% performance gain in accuracy and F1 score.

Title: Analysis of the Reasoning with Redundant Information Provided Ability of Large Language Models. (arXiv:2310.04039v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.04039
Code URL: null
Copy Paste: [[2310.04039]] Analysis of the Reasoning with Redundant Information Provided Ability of Large Language Models(http://arxiv.org/abs/2310.04039)
Summary:
Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities across a range of natural language processing tasks, especially in reasoning, a cornerstone for achieving Artificial General Intelligence (AGI). However, commonly used benchmarks may not fully encapsulate the inferential abilities of these models in real-world scenarios. To address this gap, a new form of Question-Answering (QA) task, termed Reasoning with Redundant Information Provided (RRIP), is introduced. The study designed a modified version of the grade school math 8K (GSM-8K) dataset which has several variants focusing on different attributes of redundant information. This investigation evaluates two popular LLMs, LlaMA2-13B-chat and generative pre-trained transformer 3.5 (GPT-3.5), contrasting their performance on traditional QA tasks against the RRIP tasks. Findings indicate that while these models achieved moderate success on standard QA benchmarks, their performance notably declines when assessed on RRIP tasks. The study not only highlights the limitations of current LLMs in handling redundant information but also suggests that future training of these models should focus on incorporating redundant information into the training data to increase the performance on RRIP tasks.

Title: A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks. (arXiv:2310.04270v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.04270
Code URL: null
Copy Paste: [[2310.04270]] A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks(http://arxiv.org/abs/2310.04270)
Summary:
Recently, Large Language Models (LLM) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, we conduct a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art fine-tuned biomedical models. This suggests that pretraining on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.

Title: Amortizing intractable inference in large language models. (arXiv:2310.04363v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04363
Code URL: null
Copy Paste: [[2310.04363]] Amortizing intractable inference in large language models(http://arxiv.org/abs/2310.04363)
Summary:
Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.

Title: Policy-Gradient Training of Language Models for Ranking. (arXiv:2310.04407v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2310.04407
Code URL: null
Copy Paste: [[2310.04407]] Policy-Gradient Training of Language Models for Ranking(http://arxiv.org/abs/2310.04407)
Summary:
Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires intricate heuristics, including selecting hard negatives and using additional supervision as learning signals. This reliance on heuristics stems from the fact that the contrastive loss itself is heuristic and does not directly optimize the downstream metrics of decision quality at the end of the processing pipeline. To address this issue, we introduce Neural PG-RANK, a novel training algorithm that learns to rank by instantiating a LLM as a Plackett-Luce ranking policy. Neural PG-RANK provides a principled method for end-to-end training of retrieval models as part of larger decision systems via policy gradient, with little reliance on complex heuristics, and it effectively unifies the training objective with downstream decision-making quality. We conduct extensive experiments on various text retrieval benchmarks. The results demonstrate that when the training objective aligns with the evaluation setup, Neural PG-RANK yields remarkable in-domain performance improvement, with substantial out-of-domain generalization to some critical datasets employed in downstream question answering tasks.

Title: AUTOPARLLM: GNN-Guided Automatic Code Parallelization using Large Language Models. (arXiv:2310.04047v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04047
Code URL: null
Copy Paste: [[2310.04047]] AUTOPARLLM: GNN-Guided Automatic Code Parallelization using Large Language Models(http://arxiv.org/abs/2310.04047)
Summary:
Parallelizing sequentially written programs is a challenging task. Even experienced developers need to spend considerable time finding parallelism opportunities and then actually writing parallel versions of sequentially written programs. To address this issue, we present AUTOPARLLM, a framework for automatically discovering parallelism and generating the parallel version of the sequentially written program. Our framework consists of two major components: i) a heterogeneous Graph Neural Network (GNN) based parallelism discovery and parallel pattern detection module, and ii) an LLM-based code generator to generate the parallel counterpart of the sequential programs. We use the GNN to learn the flow-aware characteristics of the programs to identify parallel regions in sequential programs and then construct an enhanced prompt using the GNN's results for the LLM-based generator to finally produce the parallel counterparts of the sequential programs. We evaluate AUTOPARLLM on 11 applications of 2 well-known benchmark suites: NAS Parallel Benchmark and Rodinia Benchmark. Our results show that AUTOPARLLM is indeed effective in improving the state-of-the-art LLM-based models for the task of parallel code generation in terms of multiple code generation metrics. AUTOPARLLM also improves the average runtime of the parallel code generated by the state-of-the-art LLMs by as high as 3.4% and 2.9% for the NAS Parallel Benchmark and Rodinia Benchmark respectively. Additionally, to overcome the issue that well-known metrics for translation evaluation have not been optimized to evaluate the quality of the generated parallel code, we propose OMPScore for evaluating the quality of the generated code. We show that OMPScore exhibits a better correlation with human judgment than existing metrics, measured by up to 75% improvement of Spearman correlation.

Title: A Language-Agent Approach to Formal Theorem-Proving. (arXiv:2310.04353v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04353
Code URL: null
Copy Paste: [[2310.04353]] A Language-Agent Approach to Formal Theorem-Proving(http://arxiv.org/abs/2310.04353)
Summary:
Language agents, which use a large language model (LLM) capable of in-context learning to interact with an external environment, have recently emerged as a promising approach to control tasks. We present the first language-agent approach to formal theorem-proving. Our method, COPRA, uses a high-capacity, black-box LLM (GPT-4) as part of a policy for a stateful backtracking search. During the search, the policy can select proof tactics and retrieve lemmas and definitions from an external database. Each selected tactic is executed in the underlying proof framework, and the execution feedback is used to build the prompt for the next policy invocation. The search also tracks selected information from its history and uses it to reduce hallucinations and unnecessary LLM queries.

We evaluate COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the Compcert project. On these benchmarks, COPRA is significantly better than one-shot invocations of GPT-4, as well as state-of-the-art models fine-tuned on proof data, at finding correct proofs quickly.

Title: Confronting Reward Model Overoptimization with Constrained RLHF. (arXiv:2310.04373v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04373
Code URL: null
Copy Paste: [[2310.04373]] Confronting Reward Model Overoptimization with Constrained RLHF(http://arxiv.org/abs/2310.04373)
Summary:
Large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally given by the Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.

Title: Why Do We Need Weight Decay in Modern Deep Learning?. (arXiv:2310.04415v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04415
Code URL: https://github.com/tml-epfl/why-weight-decay
Copy Paste: [[2310.04415]] Why Do We Need Weight Decay in Modern Deep Learning?(http://arxiv.org/abs/2310.04415)
Summary:
Weight decay is a broadly used technique for training state-of-the-art deep networks, including large language models. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for underparameterized large language models trained with nearly online SGD, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss. Moreover, we show that weight decay also prevents sudden loss divergences for bfloat16 mixed-precision training which is a crucial tool for LLM training. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. Our code is available at https://github.com/tml-epfl/why-weight-decay.

Title: BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity. (arXiv:2310.04420v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2310.04420
Code URL: null
Copy Paste: [[2310.04420]] BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity(http://arxiv.org/abs/2310.04420)
Summary:
Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-language model and utilizes a pre-trained large language model to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.

segmentation

Title: Consistency Regularization Improves Placenta Segmentation in Fetal EPI MRI Time Series. (arXiv:2310.03870v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03870
Code URL: https://github.com/firstmover/cr-seg
Copy Paste: [[2310.03870]] Consistency Regularization Improves Placenta Segmentation in Fetal EPI MRI Time Series(http://arxiv.org/abs/2310.03870)
Summary:
The placenta plays a crucial role in fetal development. Automated 3D placenta segmentation from fetal EPI MRI holds promise for advancing prenatal care. This paper proposes an effective semi-supervised learning method for improving placenta segmentation in fetal EPI MRI time series. We employ consistency regularization loss that promotes consistency under spatial transformation of the same image and temporal consistency across nearby images in a time series. The experimental results show that the method improves the overall segmentation accuracy and provides better performance for outliers and hard samples. The evaluation also indicates that our method improves the temporal coherency of the prediction, which could lead to more accurate computation of temporal placental biomarkers. This work contributes to the study of the placenta and prenatal clinical decision-making. Code is available at https://github.com/firstmover/cr-seg.

Title: Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation. (arXiv:2310.03923v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03923
Code URL: null
Copy Paste: [[2310.03923]] Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation(http://arxiv.org/abs/2310.03923)
Summary:
Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion

Title: CUPre: Cross-domain Unsupervised Pre-training for Few-Shot Cell Segmentation. (arXiv:2310.03981v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.03981
Code URL: null
Copy Paste: [[2310.03981]] CUPre: Cross-domain Unsupervised Pre-training for Few-Shot Cell Segmentation(http://arxiv.org/abs/2310.03981)
Summary:
While pre-training on object detection tasks, such as Common Objects in Contexts (COCO) [1], could significantly boost the performance of cell segmentation, it still consumes on massive fine-annotated cell images [2] with bounding boxes, masks, and cell types for every cell in every image, to fine-tune the pre-trained model. To lower the cost of annotation, this work considers the problem of pre-training DNN models for few-shot cell segmentation, where massive unlabeled cell images are available but only a small proportion is annotated. Hereby, we propose Cross-domain Unsupervised Pre-training, namely CUPre, transferring the capability of object detection and instance segmentation for common visual objects (learned from COCO) to the visual domain of cells using unlabeled images. Given a standard COCO pre-trained network with backbone, neck, and head modules, CUPre adopts an alternate multi-task pre-training (AMT2) procedure with two sub-tasks -- in every iteration of pre-training, AMT2 first trains the backbone with cell images from multiple cell datasets via unsupervised momentum contrastive learning (MoCo) [3], and then trains the whole model with vanilla COCO datasets via instance segmentation. After pre-training, CUPre fine-tunes the whole model on the cell segmentation task using a few annotated images. We carry out extensive experiments to evaluate CUPre using LIVECell [2] and BBBC038 [4] datasets in few-shot instance segmentation settings. The experiment shows that CUPre can outperform existing pre-training methods, achieving the highest average precision (AP) for few-shot cell segmentation and detection.

Title: A Deeply Supervised Semantic Segmentation Method Based on GAN. (arXiv:2310.04081v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04081
Code URL: null
Copy Paste: [[2310.04081]] A Deeply Supervised Semantic Segmentation Method Based on GAN(http://arxiv.org/abs/2310.04081)
Summary:
In recent years, the field of intelligent transportation has witnessed rapid advancements, driven by the increasing demand for automation and efficiency in transportation systems. Traffic safety, one of the tasks integral to intelligent transport systems, requires accurately identifying and locating various road elements, such as road cracks, lanes, and traffic signs. Semantic segmentation plays a pivotal role in achieving this task, as it enables the partition of images into meaningful regions with accurate boundaries. In this study, we propose an improved semantic segmentation model that combines the strengths of adversarial learning with state-of-the-art semantic segmentation techniques. The proposed model integrates a generative adversarial network (GAN) framework into the traditional semantic segmentation model, enhancing the model's performance in capturing complex and subtle features in transportation images. The effectiveness of our approach is demonstrated by a significant boost in performance on the road crack dataset compared to the existing methods, \textit{i.e.,} SEGAN. This improvement can be attributed to the synergistic effect of adversarial learning and semantic segmentation, which leads to a more refined and accurate representation of road structures and conditions. The enhanced model not only contributes to better detection of road cracks but also to a wide range of applications in intelligent transportation, such as traffic sign recognition, vehicle detection, and lane segmentation.

Title: Automated 3D Segmentation of Kidneys and Tumors in MICCAI KiTS 2023 Challenge. (arXiv:2310.04110v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04110
Code URL: https://github.com/Project-MONAI/MONAI
Copy Paste: [[2310.04110]] Automated 3D Segmentation of Kidneys and Tumors in MICCAI KiTS 2023 Challenge(http://arxiv.org/abs/2310.04110)
Summary:
Kidney and Kidney Tumor Segmentation Challenge (KiTS) 2023 offers a platform for researchers to compare their solutions to segmentation from 3D CT. In this work, we describe our submission to the challenge using automated segmentation of Auto3DSeg available in MONAI. Our solution achieves the average dice of 0.835 and surface dice of 0.723, which ranks first and wins the KiTS 2023 challenge.

Title: Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning. (arXiv:2310.04148v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04148
Code URL: https://github.com/ydchen0806/dbmim
Copy Paste: [[2310.04148]] Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning(http://arxiv.org/abs/2310.04148)
Summary:
The performance of existing supervised neuron segmentation methods is highly dependent on the number of accurate annotations, especially when applied to large scale electron microscopy (EM) data. By extracting semantic information from unlabeled data, self-supervised methods can improve the performance of downstream tasks, among which the mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images. However, due to the high degree of structural locality in EM images, as well as the existence of considerable noise, many voxels contain little discriminative information, making MIM pretraining inefficient on the neuron segmentation task. To overcome this challenge, we propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy. Due to the vast exploration space, using single-agent RL for voxel prediction is impractical. Therefore, we treat each input patch as an agent with a shared behavior policy, allowing for multi-agent collaboration. Furthermore, this multi-agent model can capture dependencies between voxels, which is beneficial for the downstream segmentation task. Experiments conducted on representative EM datasets demonstrate that our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation. Code is available at \url{https://github.com/ydchen0806/dbMiM}.

Title: DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions. (arXiv:2310.04181v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04181
Code URL: null
Copy Paste: [[2310.04181]] DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions(http://arxiv.org/abs/2310.04181)
Summary:
Semantic segmentation in adverse weather scenarios is a critical task for autonomous driving systems. While foundation models have shown promise, the need for specialized adaptors becomes evident for handling more challenging scenarios. We introduce DiffPrompter, a novel differentiable visual and latent prompting mechanism aimed at expanding the learning capabilities of existing adaptors in foundation models. Our proposed $\nabla$HFC image processing block excels particularly in adverse weather conditions, where conventional methods often fall short. Furthermore, we investigate the advantages of jointly training visual and latent prompts, demonstrating that this combined approach significantly enhances performance in out-of-distribution scenarios. Our differentiable visual prompts leverage parallel and series architectures to generate prompts, effectively improving object segmentation tasks in adverse conditions. Through a comprehensive series of experiments and evaluations, we provide empirical evidence to support the efficacy of our approach. Project page at https://diffprompter.github.io.

Title: Semantic segmentation of longitudinal thermal images for identification of hot and cool spots in urban areas. (arXiv:2310.04247v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2310.04247
Code URL: null
Copy Paste: [[2310.04247]] Semantic segmentation of longitudinal thermal images for identification of hot and cool spots in urban areas(http://arxiv.org/abs/2310.04247)
Summary:
This work presents the analysis of semantically segmented, longitudinally, and spatially rich thermal images collected at the neighborhood scale to identify hot and cool spots in urban areas. An infrared observatory was operated over a few months to collect thermal images of different types of buildings on the educational campus of the National University of Singapore. A subset of the thermal image dataset was used to train state-of-the-art deep learning models to segment various urban features such as buildings, vegetation, sky, and roads. It was observed that the U-Net segmentation model with `resnet34' CNN backbone has the highest mIoU score of 0.99 on the test dataset, compared to other models such as DeepLabV3, DeeplabV3+, FPN, and PSPnet. The masks generated using the segmentation models were then used to extract the temperature from thermal images and correct for differences in the emissivity of various urban features. Further, various statistical measure of the temperature extracted using the predicted segmentation masks is shown to closely match the temperature extracted using the ground truth masks. Finally, the masks were used to identify hot and cool spots in the urban feature at various instances of time. This forms one of the very few studies demonstrating the automated analysis of thermal images, which can be of potential use to urban planners for devising mitigation strategies for reducing the urban heat island (UHI) effect, improving building energy efficiency, and maximizing outdoor thermal comfort.