secure

Title: CoPriv: Network/Protocol Co-Optimization for Communication-Efficient Private Inference. (arXiv:2311.01737v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2311.01737
Code URL: null
Copy Paste: [[2311.01737]] CoPriv: Network/Protocol Co-Optimization for Communication-Efficient Private Inference(http://arxiv.org/abs/2311.01737)
Summary:
Deep neural network (DNN) inference based on secure 2-party computation (2PC) can offer cryptographically-secure privacy protection but suffers from orders of magnitude latency overhead due to enormous communication. Previous works heavily rely on a proxy metric of ReLU counts to approximate the communication overhead and focus on reducing the ReLUs to improve the communication efficiency. However, we observe these works achieve limited communication reduction for state-of-the-art (SOTA) 2PC protocols due to the ignorance of other linear and non-linear operations, which now contribute to the majority of communication. In this work, we present CoPriv, a framework that jointly optimizes the 2PC inference protocol and the DNN architecture. CoPriv features a new 2PC protocol for convolution based on Winograd transformation and develops DNN-aware optimization to significantly reduce the inference communication. CoPriv further develops a 2PC-aware network optimization algorithm that is compatible with the proposed protocol and simultaneously reduces the communication for all the linear and non-linear operations. We compare CoPriv with the SOTA 2PC protocol, CrypTFlow2, and demonstrate 2.1x communication reduction for both ResNet-18 and ResNet-32 on CIFAR-100. We also compare CoPriv with SOTA network optimization methods, including SNL, MetaPruning, etc. CoPriv achieves 9.98x and 3.88x online and total communication reduction with a higher accuracy compare to SNL, respectively. CoPriv also achieves 3.87x online communication reduction with more than 3% higher accuracy compared to MetaPruning.

security

Title: Adversary ML Resilience in Autonomous Driving Through Human Centered Perception Mechanisms. (arXiv:2311.01478v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01478
Code URL: null
Copy Paste: [[2311.01478]] Adversary ML Resilience in Autonomous Driving Through Human Centered Perception Mechanisms(http://arxiv.org/abs/2311.01478)
Summary:
Physical adversarial attacks on road signs are continuously exploiting vulnerabilities in modern day autonomous vehicles (AVs) and impeding their ability to correctly classify what type of road sign they encounter. Current models cannot generalize input data well, resulting in overfitting or underfitting. In overfitting, the model memorizes the input data but cannot generalize to new scenarios. In underfitting, the model does not learn enough of the input data to accurately classify these road signs. This paper explores the resilience of autonomous driving systems against three main physical adversarial attacks (tape, graffiti, illumination), specifically targeting object classifiers. Several machine learning models were developed and evaluated on two distinct datasets: road signs (stop signs, speed limit signs, traffic lights, and pedestrian crosswalk signs) and geometric shapes (octagons, circles, squares, and triangles). The study compared algorithm performance under different conditions, including clean and adversarial training and testing on these datasets. To build robustness against attacks, defense techniques like adversarial training and transfer learning were implemented. Results demonstrated transfer learning models played a crucial role in performance by allowing knowledge gained from shape training to improve generalizability of road sign classification, despite the datasets being completely different. The paper suggests future research directions, including human-in-the-loop validation, security analysis, real-world testing, and explainable AI for transparency. This study aims to contribute to improving security and robustness of object classifiers in autonomous vehicles and mitigating adversarial example impacts on driving systems.

Title: VFCFinder: Seamlessly Pairing Security Advisories and Patches. (arXiv:2311.01532v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2311.01532
Code URL: null
Copy Paste: [[2311.01532]] VFCFinder: Seamlessly Pairing Security Advisories and Patches(http://arxiv.org/abs/2311.01532)
Summary:
Security advisories are the primary channel of communication for discovered vulnerabilities in open-source software, but they often lack crucial information. Specifically, 63% of vulnerability database reports are missing their patch links, also referred to as vulnerability fixing commits (VFCs). This paper introduces VFCFinder, a tool that generates the top-five ranked set of VFCs for a given security advisory using Natural Language Programming Language (NL-PL) models. VFCFinder yields a 96.6% recall for finding the correct VFC within the Top-5 commits, and an 80.0% recall for the Top-1 ranked commit. VFCFinder generalizes to nine different programming languages and outperforms state-of-the-art approaches by 36 percentage points in terms of Top-1 recall. As a practical contribution, we used VFCFinder to backfill over 300 missing VFCs in the GitHub Security Advisory (GHSA) database. All of the VFCs were accepted and merged into the GHSA database. In addition to demonstrating a practical pairing of security advisories to VFCs, our general open-source implementation will allow vulnerability database maintainers to drastically improve data quality, supporting efforts to secure the software supply chain.

Title: Architecture of Smart Certificates for Web3 Applications Against Cyberthreats in Financial Industry. (arXiv:2311.01956v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2311.01956
Code URL: null
Copy Paste: [[2311.01956]] Architecture of Smart Certificates for Web3 Applications Against Cyberthreats in Financial Industry(http://arxiv.org/abs/2311.01956)
Summary:
This study addresses the security challenges associated with the current internet transformations, specifically focusing on emerging technologies such as blockchain and decentralized storage. It also investigates the role of Web3 applications in shaping the future of the internet. The primary objective is to propose a novel design for 'smart certificates,' which are digital certificates that can be programmatically enforced. Utilizing such certificates, an enterprise can better protect itself from cyberattacks and ensure the security of its data and systems. Web3 recent security solutions by companies and projects like Certik, Forta, Slither, and Securify are the equivalent of code scanning tool that were originally developed for Web1 and Web2 applications, and definitely not like certificates to help enterprises feel safe against cyberthreats. We aim to improve the resilience of enterprises' digital infrastructure by building on top of Web3 application and put methodologies in place for vulnerability analysis and attack correlation, focusing on architecture of different layers, Wallet/Client, Application and Smart Contract, where specific components are provided to identify and predict threats and risks. Furthermore, Certificate Transparency is used for enhancing the security, trustworthiness and decentralized management of the certificates, and detecting misuses, compromises, and malfeasances.

privacy

Title: ProS: Facial Omni-Representation Learning via Prototype-based Self-Distillation. (arXiv:2311.01929v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01929
Code URL: null
Copy Paste: [[2311.01929]] ProS: Facial Omni-Representation Learning via Prototype-based Self-Distillation(http://arxiv.org/abs/2311.01929)
Summary:
This paper presents a novel approach, called Prototype-based Self-Distillation (ProS), for unsupervised face representation learning. The existing supervised methods heavily rely on a large amount of annotated training facial data, which poses challenges in terms of data collection and privacy concerns. To address these issues, we propose ProS, which leverages a vast collection of unlabeled face images to learn a comprehensive facial omni-representation. In particular, ProS consists of two vision-transformers (teacher and student models) that are trained with different augmented images (cropping, blurring, coloring, etc.). Besides, we build a face-aware retrieval system along with augmentations to obtain the curated images comprising predominantly facial areas. To enhance the discrimination of learned features, we introduce a prototype-based matching loss that aligns the similarity distributions between features (teacher or student) and a set of learnable prototypes. After pre-training, the teacher vision transformer serves as a backbone for downstream tasks, including attribute estimation, expression recognition, and landmark alignment, achieved through simple fine-tuning with additional layers. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various tasks, both in full and few-shot settings. Furthermore, we investigate pre-training with synthetic face images, and ProS exhibits promising performance in this scenario as well.

Title: MARRS: Multimodal Reference Resolution System. (arXiv:2311.01650v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01650
Code URL: null
Copy Paste: [[2311.01650]] MARRS: Multimodal Reference Resolution System(http://arxiv.org/abs/2311.01650)
Summary:
Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.

Title: CiFlow: Dataflow Analysis and Optimization of Key Switching for Homomorphic Encryption. (arXiv:2311.01598v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2311.01598
Code URL: null
Copy Paste: [[2311.01598]] CiFlow: Dataflow Analysis and Optimization of Key Switching for Homomorphic Encryption(http://arxiv.org/abs/2311.01598)
Summary:
Homomorphic encryption (HE) is a privacy-preserving computation technique that enables computation on encrypted data. Today, the potential of HE remains largely unrealized as it is impractically slow, preventing it from being used in real applications. A major computational bottleneck in HE is the key-switching operation, accounting for approximately 70% of the overall HE execution time and involving a large amount of data for inputs, intermediates, and keys. Prior research has focused on hardware accelerators to improve HE performance, typically featuring large on-chip SRAMs and high off-chip bandwidth to deal with large scale data. In this paper, we present a novel approach to improve key-switching performance by rigorously analyzing its dataflow. Our primary goal is to optimize data reuse with limited on-chip memory to minimize off-chip data movement. We introduce three distinct dataflows: Max-Parallel (MP), Digit-Centric (DC), and Output-Centric (OC), each with unique scheduling approaches for key-switching computations. Through our analysis, we show how our proposed Output-Centric technique can effectively reuse data by significantly lowering the intermediate key-switching working set and alleviating the need for massive off-chip bandwidth. We thoroughly evaluate the three dataflows using the RPU, a recently published vector processor tailored for ring processing algorithms, which includes HE. This evaluation considers sweeps of bandwidth and computational throughput, and whether keys are buffered on-chip or streamed. With OC, we demonstrate up to 4.16x speedup over the MP dataflow and show how OC can save 16x on-chip SRAM by streaming keys for minimal performance penalty.

protect

defense

Title: Adversarial Examples in the Physical World: A Survey. (arXiv:2311.01473v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01473
Code URL: null
Copy Paste: [[2311.01473]] Adversarial Examples in the Physical World: A Survey(http://arxiv.org/abs/2311.01473)
Summary:
Deep neural networks (DNNs) have demonstrated high vulnerability to adversarial examples. Besides the attacks in the digital world, the practical implications of adversarial examples in the physical world present significant challenges and safety concerns. However, current research on physical adversarial examples (PAEs) lacks a comprehensive understanding of their unique characteristics, leading to limited significance and understanding. In this paper, we address this gap by thoroughly examining the characteristics of PAEs within a practical workflow encompassing training, manufacturing, and re-sampling processes. By analyzing the links between physical adversarial attacks, we identify manufacturing and re-sampling as the primary sources of distinct attributes and particularities in PAEs. Leveraging this knowledge, we develop a comprehensive analysis and classification framework for PAEs based on their specific characteristics, covering over 100 studies on physical-world adversarial examples. Furthermore, we investigate defense strategies against PAEs and identify open challenges and opportunities for future research. We aim to provide a fresh, thorough, and systematic understanding of PAEs, thereby promoting the development of robust adversarial learning and its application in open-world scenarios.

attack

Title: Universal Perturbation-based Secret Key-Controlled Data Hiding. (arXiv:2311.01696v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2311.01696
Code URL: null
Copy Paste: [[2311.01696]] Universal Perturbation-based Secret Key-Controlled Data Hiding(http://arxiv.org/abs/2311.01696)
Summary:
Deep neural networks (DNNs) are demonstrated to be vulnerable to universal perturbation, a single quasi-perceptible perturbation that can deceive the DNN on most images. However, the previous works are focused on using universal perturbation to perform adversarial attacks, while the potential usability of universal perturbation as data carriers in data hiding is less explored, especially for the key-controlled data hiding method. In this paper, we propose a novel universal perturbation-based secret key-controlled data-hiding method, realizing data hiding with a single universal perturbation and data decoding with the secret key-controlled decoder. Specifically, we optimize a single universal perturbation, which serves as a data carrier that can hide multiple secret images and be added to most cover images. Then, we devise a secret key-controlled decoder to extract different secret images from the single container image constructed by the universal perturbation by using different secret keys. Moreover, a suppress loss function is proposed to prevent the secret image from leakage. Furthermore, we adopt a robust module to boost the decoder's capability against corruption. Finally, A co-joint optimization strategy is proposed to find the optimal universal perturbation and decoder. Extensive experiments are conducted on different datasets to demonstrate the effectiveness of the proposed method. Additionally, the physical test performed on platforms (e.g., WeChat and Twitter) verifies the usability of the proposed method in practice.

Title: Efficient Black-Box Adversarial Attacks on Neural Text Detectors. (arXiv:2311.01873v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01873
Code URL: null
Copy Paste: [[2311.01873]] Efficient Black-Box Adversarial Attacks on Neural Text Detectors(http://arxiv.org/abs/2311.01873)
Summary:
Neural text detectors are models trained to detect whether a given text was generated by a language model or written by a human. In this paper, we investigate three simple and resource-efficient strategies (parameter tweaking, prompt engineering, and character-level mutations) to alter texts generated by GPT-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors. The results show that especially parameter tweaking and character-level mutations are effective strategies.

Title: Adversarial Attacks on Cooperative Multi-agent Bandits. (arXiv:2311.01698v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01698
Code URL: null
Copy Paste: [[2311.01698]] Adversarial Attacks on Cooperative Multi-agent Bandits(http://arxiv.org/abs/2311.01698)
Summary:
Cooperative multi-agent multi-armed bandits (CMA2B) consider the collaborative efforts of multiple agents in a shared multi-armed bandit game. We study latent vulnerabilities exposed by this collaboration and consider adversarial attacks on a few agents with the goal of influencing the decisions of the rest. More specifically, we study adversarial attacks on CMA2B in both homogeneous settings, where agents operate with the same arm set, and heterogeneous settings, where agents have distinct arm sets. In the homogeneous setting, we propose attack strategies that, by targeting just one agent, convince all agents to select a particular target arm $T-o(T)$ times while incurring $o(T)$ attack costs in $T$ rounds. In the heterogeneous setting, we prove that a target arm attack requires linear attack costs and propose attack strategies that can force a maximum number of agents to suffer linear regrets while incurring sublinear costs and only manipulating the observations of a few target agents. Numerical experiments validate the effectiveness of our proposed attack strategies.

robust

Title: Assist Is Just as Important as the Goal: Image Resurfacing to Aid Model's Robust Prediction. (arXiv:2311.01563v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01563
Code URL: null
Copy Paste: [[2311.01563]] Assist Is Just as Important as the Goal: Image Resurfacing to Aid Model's Robust Prediction(http://arxiv.org/abs/2311.01563)
Summary:
Adversarial patches threaten visual AI models in the real world. The number of patches in a patch attack is variable and determines the attack's potency in a specific environment. Most existing defenses assume a single patch in the scene, and the multiple patch scenarios are shown to overcome them. This paper presents a model-agnostic defense against patch attacks based on total variation for image resurfacing (TVR). The TVR is an image-cleansing method that processes images to remove probable adversarial regions. TVR can be utilized solely or augmented with a defended model, providing multi-level security for robust prediction. TVR nullifies the influence of patches in a single image scan with no prior assumption on the number of patches in the scene. We validate TVR on the ImageNet-Patch benchmark dataset and with real-world physical objects, demonstrating its ability to mitigate patch attack.

Title: Look-Ahead Selective Plasticity for Continual Learning of Visual Tasks. (arXiv:2311.01617v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01617
Code URL: https://github.com/crouzbehmeshkin/lasp
Copy Paste: [[2311.01617]] Look-Ahead Selective Plasticity for Continual Learning of Visual Tasks(http://arxiv.org/abs/2311.01617)
Summary:
Contrastive representation learning has emerged as a promising technique for continual learning as it can learn representations that are robust to catastrophic forgetting and generalize well to unseen future tasks. Previous work in continual learning has addressed forgetting by using previous task data and trained models. Inspired by event models created and updated in the brain, we propose a new mechanism that takes place during task boundaries, i.e., when one task finishes and another starts. By observing the redundancy-inducing ability of contrastive loss on the output of a neural network, our method leverages the first few samples of the new task to identify and retain parameters contributing most to the transfer ability of the neural network, freeing up the remaining parts of the network to learn new features. We evaluate the proposed methods on benchmark computer vision datasets including CIFAR10 and TinyImagenet and demonstrate state-of-the-art performance in the task-incremental, class-incremental, and domain-incremental continual learning scenarios.

Title: SemiGPC: Distribution-Aware Label Refinement for Imbalanced Semi-Supervised Learning Using Gaussian Processes. (arXiv:2311.01646v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01646
Code URL: null
Copy Paste: [[2311.01646]] SemiGPC: Distribution-Aware Label Refinement for Imbalanced Semi-Supervised Learning Using Gaussian Processes(http://arxiv.org/abs/2311.01646)
Summary:
In this paper we introduce SemiGPC, a distribution-aware label refinement strategy based on Gaussian Processes where the predictions of the model are derived from the labels posterior distribution. Differently from other buffer-based semi-supervised methods such as CoMatch and SimMatch, our SemiGPC includes a normalization term that addresses imbalances in the global data distribution while maintaining local sensitivity. This explicit control allows SemiGPC to be more robust to confirmation bias especially under class imbalance. We show that SemiGPC improves performance when paired with different Semi-Supervised methods such as FixMatch, ReMixMatch, SimMatch and FreeMatch and different pre-training strategies including MSN and Dino. We also show that SemiGPC achieves state of the art results under different degrees of class imbalance on standard CIFAR10-LT/CIFAR100-LT especially in the low data-regime. Using SemiGPC also results in about 2% avg.accuracy increase compared to a new competitive baseline on the more challenging benchmarks SemiAves, SemiCUB, SemiFungi and Semi-iNat.

Title: Detecting Spurious Correlations via Robust Visual Concepts in Real and AI-Generated Image Classification. (arXiv:2311.01655v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01655
Code URL: null
Copy Paste: [[2311.01655]] Detecting Spurious Correlations via Robust Visual Concepts in Real and AI-Generated Image Classification(http://arxiv.org/abs/2311.01655)
Summary:
Often machine learning models tend to automatically learn associations present in the training data without questioning their validity or appropriateness. This undesirable property is the root cause of the manifestation of spurious correlations, which render models unreliable and prone to failure in the presence of distribution shifts. Research shows that most methods attempting to remedy spurious correlations are only effective for a model's known spurious associations. Current spurious correlation detection algorithms either rely on extensive human annotations or are too restrictive in their formulation. Moreover, they rely on strict definitions of visual artifacts that may not apply to data produced by generative models, as they are known to hallucinate contents that do not conform to standard specifications. In this work, we introduce a general-purpose method that efficiently detects potential spurious correlations, and requires significantly less human interference in comparison to the prior art. Additionally, the proposed method provides intuitive explanations while eliminating the need for pixel-level annotations. We demonstrate the proposed method's tolerance to the peculiarity of AI-generated images, which is a considerably challenging task, one where most of the existing methods fall short. Consequently, our method is also suitable for detecting spurious correlations that may propagate to downstream applications originating from generative models.

Title: Disentangled Representation Learning with Transmitted Information Bottleneck. (arXiv:2311.01686v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01686
Code URL: null
Copy Paste: [[2311.01686]] Disentangled Representation Learning with Transmitted Information Bottleneck(http://arxiv.org/abs/2311.01686)
Summary:
Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models. Although significant advances have been made by regularizing the information in representations with information theory, two major challenges remain: 1) the representation compression inevitably leads to performance drop; 2) the disentanglement constraints on representations are in complicated optimization. To these issues, we introduce Bayesian networks with transmitted information to formulate the interaction among input and representations during disentanglement. Building upon this framework, we propose \textbf{DisTIB} (\textbf{T}ransmitted \textbf{I}nformation \textbf{B}ottleneck for \textbf{Dis}entangled representation learning), a novel objective that navigates the balance between information compression and preservation. We employ variational inference to derive a tractable estimation for DisTIB. This estimation can be simply optimized via standard gradient descent with a reparameterization trick. Moreover, we theoretically prove that DisTIB can achieve optimal disentanglement, underscoring its superior efficacy. To solidify our claims, we conduct extensive experiments on various downstream tasks to demonstrate the appealing efficacy of DisTIB and validate our theoretical analyses.

Title: Towards Calibrated Robust Fine-Tuning of Vision-Language Models. (arXiv:2311.01723v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01723
Code URL: null
Copy Paste: [[2311.01723]] Towards Calibrated Robust Fine-Tuning of Vision-Language Models(http://arxiv.org/abs/2311.01723)
Summary:
While fine-tuning unleashes the potential of a pre-trained model to a specific task, it trades off the model's generalization capability on out-of-distribution (OOD) datasets. To mitigate this, robust fine-tuning aims to ensure performance on OOD datasets as well as an in-distribution (ID) dataset for which the model is being tuned. However, another criterion for reliable machine learning (ML), confidence calibration, has been overlooked despite its increasing demand for real-world high-stakes ML applications (e.g., autonomous driving and medical diagnosis). For the first time, we raise concerns about the calibration of fine-tuned vision-language models (VLMs) under distribution shift by showing that naive fine-tuning and even state-of-the-art robust fine-tuning methods hurt the calibration of pre-trained VLMs, especially on OOD datasets. To address this, we provide a simple approach, called a calibrated robust fine-tuning (CaRot) that incentivizes the calibration and robustness on both ID and OOD datasets. Empirical results on ImageNet-1K distribution shift evaluation verify the effectiveness of our method.

Title: From Chaos to Calibration: A Geometric Mutual Information Approach to Target-Free Camera LiDAR Extrinsic Calibration. (arXiv:2311.01905v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01905
Code URL: null
Copy Paste: [[2311.01905]] From Chaos to Calibration: A Geometric Mutual Information Approach to Target-Free Camera LiDAR Extrinsic Calibration(http://arxiv.org/abs/2311.01905)
Summary:
Sensor fusion is vital for the safe and robust operation of autonomous vehicles. Accurate extrinsic sensor to sensor calibration is necessary to accurately fuse multiple sensor's data in a common spatial reference frame. In this paper, we propose a target free extrinsic calibration algorithm that requires no ground truth training data, artificially constrained motion trajectories, hand engineered features or offline optimization and that is accurate, precise and extremely robust to initialization error.

Most current research on online camera-LiDAR extrinsic calibration requires ground truth training data which is impossible to capture at scale. We revisit analytical mutual information based methods first proposed in 2012 and demonstrate that geometric features provide a robust information metric for camera-LiDAR extrinsic calibration. We demonstrate our proposed improvement using the KITTI and KITTI-360 fisheye data set.

Title: Assessing Fidelity in XAI post-hoc techniques: A Comparative Study with Ground Truth Explanations Datasets. (arXiv:2311.01961v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01961
Code URL: null
Copy Paste: [[2311.01961]] Assessing Fidelity in XAI post-hoc techniques: A Comparative Study with Ground Truth Explanations Datasets(http://arxiv.org/abs/2311.01961)
Summary:
The evaluation of the fidelity of eXplainable Artificial Intelligence (XAI) methods to their underlying models is a challenging task, primarily due to the absence of a ground truth for explanations. However, assessing fidelity is a necessary step for ensuring a correct XAI methodology. In this study, we conduct a fair and objective comparison of the current state-of-the-art XAI methods by introducing three novel image datasets with reliable ground truth for explanations. The primary objective of this comparison is to identify methods with low fidelity and eliminate them from further research, thereby promoting the development of more trustworthy and effective XAI techniques. Our results demonstrate that XAI methods based on the backpropagation of output information to input yield higher accuracy and reliability compared to methods relying on sensitivity analysis or Class Activation Maps (CAM). However, the backpropagation method tends to generate more noisy saliency maps. These findings have significant implications for the advancement of XAI methods, enabling the elimination of erroneous explanations and fostering the development of more robust and reliable XAI.

Title: Learning Historical Status Prompt for Accurate and Robust Visual Tracking. (arXiv:2311.02072v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.02072
Code URL: null
Copy Paste: [[2311.02072]] Learning Historical Status Prompt for Accurate and Robust Visual Tracking(http://arxiv.org/abs/2311.02072)
Summary:
Most trackers perform template and search region similarity matching to find the most similar object to the template during tracking. However, they struggle to make prediction when the target appearance changes due to the limited historical information introduced by roughly cropping the current search region based on the predicted result of previous frame. In this paper, we identify that the central impediment to improving the performance of existing trackers is the incapacity to integrate abundant and effective historical information. To address this issue, we propose a Historical Information Prompter (HIP) to enhance the provision of historical information. We also build HIPTrack upon HIP module. HIP is a plug-and-play module that make full use of search region features to introduce historical appearance information. It also incorporates historical position information by constructing refined mask of the target. HIP is a lightweight module to generate historical information prompts. By integrating historical information prompts, HIPTrack significantly enhances the tracking performance without the need to retrain the backbone. Experimental results demonstrate that our method outperforms all state-of-the-art approaches on LaSOT, LaSOT ext, GOT10k and NfS. Futhermore, HIP module exhibits strong generality and can be seamlessly integrated into trackers to improve tracking performance. The source code and models will be released for further research.

Title: Faithful and Robust Local Interpretability for Textual Predictions. (arXiv:2311.01605v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01605
Code URL: null
Copy Paste: [[2311.01605]] Faithful and Robust Local Interpretability for Textual Predictions(http://arxiv.org/abs/2311.01605)
Summary:
Interpretability is essential for machine learning models to be trusted and deployed in critical domains. However, existing methods for interpreting text models are often complex, lack solid mathematical foundations, and their performance is not guaranteed. In this paper, we propose FRED (Faithful and Robust Explainer for textual Documents), a novel method for interpreting predictions over text. FRED identifies key words in a document that significantly impact the prediction when removed. We establish the reliability of FRED through formal definitions and theoretical analyses on interpretable classifiers. Additionally, our empirical evaluation against state-of-the-art methods demonstrates the effectiveness of FRED in providing insights into text models.

Title: $R^3$-NL2GQL: A Hybrid Models Approach for for Accuracy Enhancing and Hallucinations Mitigation. (arXiv:2311.01862v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01862
Code URL: https://github.com/zhiqix/nl2gql
Copy Paste: [[2311.01862]] $R^3$-NL2GQL: A Hybrid Models Approach for for Accuracy Enhancing and Hallucinations Mitigation(http://arxiv.org/abs/2311.01862)
Summary:
While current NL2SQL tasks constructed using Foundation Models have achieved commendable results, their direct application to Natural Language to Graph Query Language (NL2GQL) tasks poses challenges due to the significant differences between GQL and SQL expressions, as well as the numerous types of GQL. Our extensive experiments reveal that in NL2GQL tasks, larger Foundation Models demonstrate superior cross-schema generalization abilities, while smaller Foundation Models struggle to improve their GQL generation capabilities through fine-tuning. However, after fine-tuning, smaller models exhibit better intent comprehension and higher grammatical accuracy. Diverging from rule-based and slot-filling techniques, we introduce R3-NL2GQL, which employs both smaller and larger Foundation Models as reranker, rewriter and refiner. The approach harnesses the comprehension ability of smaller models for information reranker and rewriter, and the exceptional generalization and generation capabilities of larger models to transform input natural language queries and code structure schema into any form of GQLs. Recognizing the lack of established datasets in this nascent domain, we have created a bilingual dataset derived from graph database documentation and some open-source Knowledge Graphs (KGs). We tested our approach on this dataset and the experimental results showed that delivers promising performance and robustness.Our code and dataset is available at https://github.com/zhiqix/NL2GQL

Title: The language of prompting: What linguistic properties make a prompt successful?. (arXiv:2311.01967v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01967
Code URL: null
Copy Paste: [[2311.01967]] The language of prompting: What linguistic properties make a prompt successful?(http://arxiv.org/abs/2311.01967)
Summary:
The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned, perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust and comprehensive evaluation standard for prompting research.

Title: Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula. (arXiv:2311.01642v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01642
Code URL: null
Copy Paste: [[2311.01642]] Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula(http://arxiv.org/abs/2311.01642)
Summary:
Robustness against adversarial attacks and distribution shifts is a long-standing goal of Reinforcement Learning (RL). To this end, Robust Adversarial Reinforcement Learning (RARL) trains a protagonist against destabilizing forces exercised by an adversary in a competitive zero-sum Markov game, whose optimal solution, i.e., rational strategy, corresponds to a Nash equilibrium. However, finding Nash equilibria requires facing complex saddle point optimization problems, which can be prohibitive to solve, especially for high-dimensional control. In this paper, we propose a novel approach for adversarial RL based on entropy regularization to ease the complexity of the saddle point optimization problem. We show that the solution of this entropy-regularized problem corresponds to a Quantal Response Equilibrium (QRE), a generalization of Nash equilibria that accounts for bounded rationality, i.e., agents sometimes play random actions instead of optimal ones. Crucially, the connection between the entropy-regularized objective and QRE enables free modulation of the rationality of the agents by simply tuning the temperature coefficient. We leverage this insight to propose our novel algorithm, Quantal Adversarial RL (QARL), which gradually increases the rationality of the adversary in a curriculum fashion until it is fully rational, easing the complexity of the optimization problem while retaining robustness. We provide extensive evidence of QARL outperforming RARL and recent baselines across several MuJoCo locomotion and navigation problems in overall performance and robustness.

Title: High Precision Causal Model Evaluation with Conditional Randomization. (arXiv:2311.01902v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01902
Code URL: null
Copy Paste: [[2311.01902]] High Precision Causal Model Evaluation with Conditional Randomization(http://arxiv.org/abs/2311.01902)
Summary:
The gold standard for causal model evaluation involves comparing model predictions with true effects estimated from randomized controlled trials (RCT). However, RCTs are not always feasible or ethical to perform. In contrast, conditionally randomized experiments based on inverse probability weighting (IPW) offer a more realistic approach but may suffer from high estimation variance. To tackle this challenge and enhance causal model evaluation in real-world conditional randomization settings, we introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator. By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller asymptotic variance. Empirical studies demonstrate the improved of our estimator, highlighting its potential on achieving near-RCT performance. Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself, paving the way for more robust and reliable model assessments.

Title: Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos. (arXiv:2311.02076v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.02076
Code URL: null
Copy Paste: [[2311.02076]] Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos(http://arxiv.org/abs/2311.02076)
Summary:
In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. This includes early time regimes where the sharpness may decrease during early periods of training (sharpness reduction), and later time behavior such as progressive sharpening and edge of stability. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios. By analyzing the structure of dynamical fixed points in function space and the vector field of function updates, we uncover the underlying mechanisms behind these sharpness trends. Our analysis reveals (i) the mechanism behind early sharpness reduction and progressive sharpening, (ii) the required conditions for edge of stability, and (iii) a period-doubling route to chaos on the edge of stability manifold as learning rate is increased. Finally, we demonstrate that various predictions from this simplified model generalize to real-world scenarios and discuss its limitations.

biometric

Title: Keypoint Description by Symmetry Assessment -- Applications in Biometrics. (arXiv:2311.01651v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01651
Code URL: null
Copy Paste: [[2311.01651]] Keypoint Description by Symmetry Assessment -- Applications in Biometrics(http://arxiv.org/abs/2311.01651)
Summary:
We present a model-based feature extractor to describe neighborhoods around keypoints by finite expansion, estimating the spatially varying orientation by harmonic functions. The iso-curves of such functions are highly symmetric w.r.t. the origin (a keypoint) and the estimated parameters have well defined geometric interpretations. The origin is also a unique singularity of all harmonic functions, helping to determine the location of a keypoint precisely, whereas the functions describe the object shape of the neighborhood. This is novel and complementary to traditional texture features which describe texture-shape properties i.e. they are purposively invariant to translation (within a texture). We report on experiments of verification and identification of keypoints in forensic fingerprints by using publicly available data (NIST SD27) and discuss the results in comparison to other studies. These support our conclusions that the novel features can equip single cores or single minutia with a significant verification power at 19% EER, and an identification power of 24-78% for ranks of 1-20. Additionally, we report verification results of periocular biometrics using near-infrared images, reaching an EER performance of 13%, which is comparable to the state of the art. More importantly, fusion of two systems, our and texture features (Gabor), result in a measurable performance improvement. We report reduction of the EER to 9%, supporting the view that the novel features capture relevant visual information, which traditional texture features do not.

steal

extraction

Title: MineSegSAT: An automated system to evaluate mining disturbed area extents from Sentinel-2 imagery. (arXiv:2311.01676v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01676
Code URL: null
Copy Paste: [[2311.01676]] MineSegSAT: An automated system to evaluate mining disturbed area extents from Sentinel-2 imagery(http://arxiv.org/abs/2311.01676)
Summary:
Assessing the environmental impact of the mineral extraction industry plays a critical role in understanding and mitigating the ecological consequences of extractive activities. This paper presents MineSegSAT, a model that presents a novel approach to predicting environmentally impacted areas of mineral extraction sites using the SegFormer deep learning segmentation architecture trained on Sentinel-2 data. The data was collected from non-overlapping regions over Western Canada in 2021 containing areas of land that have been environmentally impacted by mining activities that were identified from high-resolution satellite imagery in 2021. The SegFormer architecture, a state-of-the-art semantic segmentation framework, is employed to leverage its advanced spatial understanding capabilities for accurate land cover classification. We investigate the efficacy of loss functions including Dice, Tversky, and Lovasz loss respectively. The trained model was utilized for inference over the test region in the ensuing year to identify potential areas of expansion or contraction over these same periods. The Sentinel-2 data is made available on Amazon Web Services through a collaboration with Earth Daily Analytics which provides corrected and tiled analytics-ready data on the AWS platform. The model and ongoing API to access the data on AWS allow the creation of an automated tool to monitor the extent of disturbed areas surrounding known mining sites to ensure compliance with their environmental impact goals.

Title: Taking a PEEK into YOLOv5 for Satellite Component Recognition via Entropy-based Visual Explanations. (arXiv:2311.01703v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01703
Code URL: null
Copy Paste: [[2311.01703]] Taking a PEEK into YOLOv5 for Satellite Component Recognition via Entropy-based Visual Explanations(http://arxiv.org/abs/2311.01703)
Summary:
The escalating risk of collisions and the accumulation of space debris in Low Earth Orbit (LEO) has reached critical concern due to the ever increasing number of spacecraft. Addressing this crisis, especially in dealing with non-cooperative and unidentified space debris, is of paramount importance. This paper contributes to efforts in enabling autonomous swarms of small chaser satellites for target geometry determination and safe flight trajectory planning for proximity operations in LEO. Our research explores on-orbit use of the You Only Look Once v5 (YOLOv5) object detection model trained to detect satellite components. While this model has shown promise, its inherent lack of interpretability hinders human understanding, a critical aspect of validating algorithms for use in safety-critical missions. To analyze the decision processes, we introduce Probabilistic Explanations for Entropic Knowledge extraction (PEEK), a method that utilizes information theoretic analysis of the latent representations within the hidden layers of the model. Through both synthetic in hardware-in-the-loop experiments, PEEK illuminates the decision-making processes of the model, helping identify its strengths, limitations and biases.

Title: Support or Refute: Analyzing the Stance of Evidence to Detect Out-of-Context Mis- and Disinformation. (arXiv:2311.01766v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01766
Code URL: null
Copy Paste: [[2311.01766]] Support or Refute: Analyzing the Stance of Evidence to Detect Out-of-Context Mis- and Disinformation(http://arxiv.org/abs/2311.01766)
Summary:
Mis- and disinformation online have become a major societal problem as major sources of online harms of different kinds. One common form of mis- and disinformation is out-of-context (OOC) information, where different pieces of information are falsely associated, e.g., a real image combined with a false textual caption or a misleading textual description. Although some past studies have attempted to defend against OOC mis- and disinformation through external evidence, they tend to disregard the role of different pieces of evidence with different stances. Motivated by the intuition that the stance of evidence represents a bias towards different detection results, we propose a stance extraction network (SEN) that can extract the stances of different pieces of multi-modal evidence in a unified framework. Moreover, we introduce a support-refutation score calculated based on the co-occurrence relations of named entities into the textual SEN. Extensive experiments on a public large-scale dataset demonstrated that our proposed method outperformed the state-of-the-art baselines, with the best model achieving a performance gain of 3.2% in accuracy.

Title: Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image Fusion. (arXiv:2311.01886v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01886
Code URL: null
Copy Paste: [[2311.01886]] Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image Fusion(http://arxiv.org/abs/2311.01886)
Summary:
Multi-modal image fusion (MMIF) integrates valuable information from different modality images into a fused one. However, the fusion of multiple visible images with different focal regions and infrared images is a unprecedented challenge in real MMIF applications. This is because of the limited depth of the focus of visible optical lenses, which impedes the simultaneous capture of the focal information within the same scene. To address this issue, in this paper, we propose a MMIF framework for joint focused integration and modalities information extraction. Specifically, a semi-sparsity-based smoothing filter is introduced to decompose the images into structure and texture components. Subsequently, a novel multi-scale operator is proposed to fuse the texture components, capable of detecting significant information by considering the pixel focus attributes and relevant data from various modal images. Additionally, to achieve an effective capture of scene luminance and reasonable contrast maintenance, we consider the distribution of energy information in the structural components in terms of multi-directional frequency variance and information entropy. Extensive experiments on existing MMIF datasets, as well as the object detection and depth estimation tasks, consistently demonstrate that the proposed algorithm can surpass the state-of-the-art methods in visual perception and quantitative evaluation. The code is available at https://github.com/ixilai/MFIF-MMIF.

Title: Relation Extraction from News Articles (RENA): A Tool for Epidemic Surveillance. (arXiv:2311.01472v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01472
Code URL: null
Copy Paste: [[2311.01472]] Relation Extraction from News Articles (RENA): A Tool for Epidemic Surveillance(http://arxiv.org/abs/2311.01472)
Summary:
Relation Extraction from News Articles (RENA) is a browser-based tool designed to extract key entities and their semantic relationships in English language news articles related to infectious diseases. Constructed using the React framework, this system presents users with an elegant and user-friendly interface. It enables users to input a news article and select from a choice of two models to generate a comprehensive list of relations within the provided text. As a result, RENA allows real-time parsing of news articles to extract key information for epidemic surveillance, contributing to EPIWATCH, an open-source intelligence-based epidemic warning system.

Title: UP4LS: User Profile Constructed by Multiple Attributes for Enhancing Linguistic Steganalysis. (arXiv:2311.01775v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01775
Code URL: null
Copy Paste: [[2311.01775]] UP4LS: User Profile Constructed by Multiple Attributes for Enhancing Linguistic Steganalysis(http://arxiv.org/abs/2311.01775)
Summary:
Linguistic steganalysis (LS) tasks aim to effectively detect stegos generated by linguistic steganography. Existing LS methods overlook the distinctive user characteristics, leading to weak performance in social networks. The limited occurrence of stegos further complicates detection. In this paper, we propose the UP4LS, a novel framework with the User Profile for enhancing LS performance. Specifically, by delving into post content, we explore user attributes like writing habits, psychological states, and focal areas, thereby building the user profile for LS. For each attribute, we design the identified feature extraction module. The extracted features are mapped to high-dimensional user features via deep-learning networks from existing methods. Then the language model is employed to extract content features. The user and content features are integrated to optimize feature representation. During the training phase, we prioritize the distribution of stegos. Experiments demonstrate that UP4LS can significantly enhance the performance of existing methods, and an overall accuracy improvement of nearly 25%. In particular, the improvement is especially pronounced with fewer stego samples. Additionally, UP4LS also sets the stage for studies on related tasks, encouraging extensive applications on LS tasks.

membership infer

federate

Title: FedSN: A General Federated Learning Framework over LEO Satellite Networks. (arXiv:2311.01483v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01483
Code URL: null
Copy Paste: [[2311.01483]] FedSN: A General Federated Learning Framework over LEO Satellite Networks(http://arxiv.org/abs/2311.01483)
Summary:
Recently, a large number of Low Earth Orbit (LEO) satellites have been launched and deployed successfully in space by commercial companies, such as SpaceX. Due to multimodal sensors equipped by the LEO satellites, they serve not only for communication but also for various machine learning applications, such as space modulation recognition, remote sensing image classification, etc. However, the ground station (GS) may be incapable of downloading such a large volume of raw sensing data for centralized model training due to the limited contact time with LEO satellites (e.g. 5 minutes). Therefore, federated learning (FL) has emerged as the promising solution to address this problem via on-device training. Unfortunately, to enable FL on LEO satellites, we still face three critical challenges that are i) heterogeneous computing and memory capabilities, ii) limited uplink rate, and iii) model staleness. To this end, we propose FedSN as a general FL framework to tackle the above challenges, and fully explore data diversity on LEO satellites. Specifically, we first present a novel sub-structure scheme to enable heterogeneous local model training considering different computing, memory, and communication constraints on LEO satellites. Additionally, we propose a pseudo-synchronous model aggregation strategy to dynamically schedule model aggregation for compensating model staleness. To further demonstrate the effectiveness of the FedSN, we evaluate it using space modulation recognition and remote sensing image classification tasks by leveraging the data from real-world satellite networks. Extensive experimental results demonstrate that FedSN framework achieves higher accuracy, lower computing, and communication overhead than the state-of-the-art benchmarks and the effectiveness of each components in FedSN.

Title: Communication-Efficient Federated Non-Linear Bandit Optimization. (arXiv:2311.01695v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01695
Code URL: null
Copy Paste: [[2311.01695]] Communication-Efficient Federated Non-Linear Bandit Optimization(http://arxiv.org/abs/2311.01695)
Summary:
Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm.

Title: Heterogeneous federated collaborative filtering using FAIR: Federated Averaging in Random Subspaces. (arXiv:2311.01722v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01722
Code URL: https://github.com/apd10/flcf
Copy Paste: [[2311.01722]] Heterogeneous federated collaborative filtering using FAIR: Federated Averaging in Random Subspaces(http://arxiv.org/abs/2311.01722)
Summary:
Recommendation systems (RS) for items (e.g., movies, books) and ads are widely used to tailor content to users on various internet platforms. Traditionally, recommendation models are trained on a central server. However, due to rising concerns for data privacy and regulations like the GDPR, federated learning is an increasingly popular paradigm in which data never leaves the client device. Applying federated learning to recommendation models is non-trivial due to large embedding tables, which often exceed the memory constraints of most user devices. To include data from all devices in federated learning, we must enable collective training of embedding tables on devices with heterogeneous memory capacities. Current solutions to heterogeneous federated learning can only accommodate a small range of capacities and thus limit the number of devices that can participate in training. We present Federated Averaging in Random subspaces (FAIR), which allows arbitrary compression of embedding tables based on device capacity and ensures the participation of all devices in training. FAIR uses what we call consistent and collapsible subspaces defined by hashing-based random projections to jointly train large embedding tables while using varying amounts of compression on user devices. We evaluate FAIR on Neural Collaborative Filtering tasks with multiple datasets and verify that FAIR can gather and share information from a wide range of devices with varying capacities, allowing for seamless collaboration. We prove the convergence of FAIR in the homogeneous setting with non-i.i.d data distribution. Our code is open source at {https://github.com/apd10/FLCF}

Title: Epidemic Decision-making System Based Federated Reinforcement Learning. (arXiv:2311.01749v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01749
Code URL: null
Copy Paste: [[2311.01749]] Epidemic Decision-making System Based Federated Reinforcement Learning(http://arxiv.org/abs/2311.01749)
Summary:
Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. However, epidemic data often has the characteristics of limited samples and high privacy. However, epidemic data often has the characteristics of limited samples and high privacy. This model can combine the epidemic situation data of various provinces for cooperative training to use as an enhanced learning model for epidemic situation decision, while protecting the privacy of data. The experiment shows that the enhanced federated learning can obtain more optimized performance and return than the enhanced learning, and the enhanced federated learning can also accelerate the training convergence speed of the training model. accelerate the training convergence speed of the client. At the same time, through the experimental comparison, A2C is the most suitable reinforcement learning model for the epidemic situation decision-making. learning model for the epidemic situation decision-making scenario, followed by the PPO model, and the performance of DDPG is unsatisfactory.

fair

Title: Improving Fairness using Vision-Language Driven Image Augmentation. (arXiv:2311.01573v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01573
Code URL: https://github.com/moreno98/vision-language-bias-control
Copy Paste: [[2311.01573]] Improving Fairness using Vision-Language Driven Image Augmentation(http://arxiv.org/abs/2311.01573)
Summary:
Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness. To do so, we learn interpretable and meaningful paths lying in the semantic space of a pre-trained diffusion model (DiffAE) -- such paths being supervised by contrastive text dipoles. That is, we learn to edit protected characteristics (age and skin color). These paths are then applied to augment images to improve the fairness of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on several downstream tasks with age and skin color as protected characteristics. As a proxy for fairness, we compute the difference in accuracy with respect to the protected characteristics. Quantitative results show how the augmented images help the model improve the overall accuracy, the aforementioned metric, and the disparity of equal opportunity. Code is available at: https://github.com/Moreno98/Vision-Language-Bias-Control.

Title: Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval. (arXiv:2311.01870v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01870
Code URL: null
Copy Paste: [[2311.01870]] Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval(http://arxiv.org/abs/2311.01870)
Summary:
We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.

Title: Don't Make Your LLM an Evaluation Benchmark Cheater. (arXiv:2311.01964v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01964
Code URL: null
Copy Paste: [[2311.01964]] Don't Make Your LLM an Evaluation Benchmark Cheater(http://arxiv.org/abs/2311.01964)
Summary:
Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs in different aspects. Despite that a number of high-quality benchmarks have been released, the concerns about the appropriate use of these benchmarks and the fair comparison of different models are increasingly growing. Considering these concerns, in this paper, we discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results. Specially, we focus on a special issue that would lead to inappropriate evaluation, \ie \emph{benchmark leakage}, referring that the data related to evaluation sets is occasionally used for model training. This phenomenon now becomes more common since pre-training data is often prepared ahead of model test. We conduct extensive experiments to study the effect of benchmark leverage, and find that it can dramatically boost the evaluation results, which would finally lead to an unreliable assessment of model performance. To improve the use of existing evaluation benchmarks, we finally present several guidelines for both LLM developers and benchmark maintainers. We hope this work can draw attention to appropriate training and evaluation of LLMs.

Title: Post Turing: Mapping the landscape of LLM Evaluation. (arXiv:2311.02049v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.02049
Code URL: null
Copy Paste: [[2311.02049]] Post Turing: Mapping the landscape of LLM Evaluation(http://arxiv.org/abs/2311.02049)
Summary:
In the rapidly evolving landscape of Large Language Models (LLMs), introduction of well-defined and standardized evaluation methodologies remains a crucial challenge. This paper traces the historical trajectory of LLM evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We categorize the evolution of LLMs into distinct periods, each characterized by its unique benchmarks and evaluation criteria. As LLMs increasingly mimic human-like behaviors, traditional evaluation proxies, such as the Turing test, have become less reliable. We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models. Through an analysis of common evaluation methodologies, we advocate for a qualitative shift in assessment approaches, underscoring the importance of standardization and objective criteria. This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.

Title: Better Fair than Sorry: Adversarial Missing Data Imputation for Fair GNNs. (arXiv:2311.01591v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01591
Code URL: null
Copy Paste: [[2311.01591]] Better Fair than Sorry: Adversarial Missing Data Imputation for Fair GNNs(http://arxiv.org/abs/2311.01591)
Summary:
This paper addresses the problem of learning fair Graph Neural Networks (GNNs) under missing protected attributes. GNNs have achieved state-of-the-art results in many relevant tasks where decisions might disproportionately impact specific communities. However, existing work on fair GNNs assumes that either protected attributes are fully-observed or that the missing data imputation is fair. In practice, biases in the imputation will be propagated to the model outcomes, leading them to overestimate the fairness of their predictions. We address this challenge by proposing Better Fair than Sorry (BFtS), a fair missing data imputation model for protected attributes used by fair GNNs. The key design principle behind BFtS is that imputations should approximate the worst-case scenario for the fair GNN -- i.e. when optimizing fairness is the hardest. We implement this idea using a 3-player adversarial scheme where two adversaries collaborate against the fair GNN. Experiments using synthetic and real datasets show that BFtS often achieves a better fairness $\times$ accuracy trade-off than existing alternatives.

interpretability

Title: Occlusion-Aware 2D and 3D Centerline Detection for Urban Driving via Automatic Label Generation. (arXiv:2311.02044v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.02044
Code URL: null
Copy Paste: [[2311.02044]] Occlusion-Aware 2D and 3D Centerline Detection for Urban Driving via Automatic Label Generation(http://arxiv.org/abs/2311.02044)
Summary:
This research work seeks to explore and identify strategies that can determine road topology information in 2D and 3D under highly dynamic urban driving scenarios. To facilitate this exploration, we introduce a substantial dataset comprising nearly one million automatically labeled data frames. A key contribution of our research lies in developing an automatic label-generation process and an occlusion handling strategy. This strategy is designed to model a wide range of occlusion scenarios, from mild disruptions to severe blockages. Furthermore, we present a comprehensive ablation study wherein multiple centerline detection methods are developed and evaluated. This analysis not only benchmarks the performance of various approaches but also provides valuable insights into the interpretability of these methods. Finally, we demonstrate the practicality of our methods and assess their adaptability across different sensor configurations, highlighting their versatility and relevance in real-world scenarios. Our dataset and experimental models are publicly available.

Title: Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models. (arXiv:2311.01732v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01732
Code URL: null
Copy Paste: [[2311.01732]] Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models(http://arxiv.org/abs/2311.01732)
Summary:
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), but their lack of interpretability has been a major concern. Current methods for interpreting LLMs are post hoc, applied after inference time, and have limitations such as their focus on low-level features and lack of explainability at higher level text units. In this work, we introduce proto-lm, a prototypical network-based white-box framework that allows LLMs to learn immediately interpretable embeddings during the fine-tuning stage while maintaining competitive performance. Our method's applicability and interpretability are demonstrated through experiments on a wide range of NLP tasks, and our results indicate a new possibility of creating interpretable models without sacrificing performance. This novel approach to interpretability in LLMs can pave the way for more interpretable models without the need to sacrifice performance.

Title: Constructing Temporal Dynamic Knowledge Graphs from Interactive Text-based Games. (arXiv:2311.01928v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01928
Code URL: https://github.com/yukw777/temporal-discrete-graph-updater
Copy Paste: [[2311.01928]] Constructing Temporal Dynamic Knowledge Graphs from Interactive Text-based Games(http://arxiv.org/abs/2311.01928)
Summary:
In natural language processing, interactive text-based games serve as a test bed for interactive AI systems. Prior work has proposed to play text-based games by acting based on discrete knowledge graphs constructed by the Discrete Graph Updater (DGU) to represent the game state from the natural language description. While DGU has shown promising results with high interpretability, it suffers from lower knowledge graph accuracy due to its lack of temporality and limited generalizability to complex environments with objects with the same label. In order to address DGU's weaknesses while preserving its high interpretability, we propose the Temporal Discrete Graph Updater (TDGU), a novel neural network model that represents dynamic knowledge graphs as a sequence of timestamped graph events and models them using a temporal point based graph neural network. Through experiments on the dataset collected from a text-based game TextWorld, we show that TDGU outperforms the baseline DGU. We further show the importance of temporal information for TDGU's performance through an ablation study and demonstrate that TDGU has the ability to generalize to more complex environments with objects with the same label. All the relevant code can be found at \url{https://github.com/yukw777/temporal-discrete-graph-updater}.

explainability

watermark

diffusion

Title: Exploring the Hyperparameter Space of Image Diffusion Models for Echocardiogram Generation. (arXiv:2311.01567v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01567
Code URL: null
Copy Paste: [[2311.01567]] Exploring the Hyperparameter Space of Image Diffusion Models for Echocardiogram Generation(http://arxiv.org/abs/2311.01567)
Summary:
This work presents an extensive hyperparameter search on Image Diffusion Models for Echocardiogram generation. The objective is to establish foundational benchmarks and provide guidelines within the realm of ultrasound image and video generation. This study builds over the latest advancements, including cutting-edge model architectures and training methodologies. We also examine the distribution shift between real and generated samples and consider potential solutions, crucial to train efficient models on generated data. We determine an Optimal FID score of $0.88$ for our research problem and achieve an FID of $2.60$. This work is aimed at contributing valuable insights and serving as a reference for further developments in the specialized field of ultrasound image and video generation.

Title: PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation. (arXiv:2311.01773v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01773
Code URL: null
Copy Paste: [[2311.01773]] PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation(http://arxiv.org/abs/2311.01773)
Summary:
Recent advances in implicit neural representations have achieved impressive results by sampling and fusing individual points along sampling rays in the sampling space. However, due to the explosively growing sampling space, finely representing and synthesizing detailed textures remains a challenge for unbounded large-scale outdoor scenes. To alleviate the dilemma of using individual points to perceive the entire colossal space, we explore learning the surface distribution of the scene to provide structural priors and reduce the samplable space and propose a Point Diffusion implicit Function, PDF, for large-scale scene neural representation. The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface. Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.

Title: DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder. (arXiv:2311.01811v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01811
Code URL: null
Copy Paste: [[2311.01811]] DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder(http://arxiv.org/abs/2311.01811)
Summary:
Generating high-quality and person-generic visual dubbing remains a challenge. Recent innovation has seen the advent of a two-stage paradigm, decoupling the rendering and lip synchronization process facilitated by intermediate representation as a conduit. Still, previous methodologies rely on rough landmarks or are confined to a single speaker, thus limiting their performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We first craft the Diffusion auto-encoder by an inpainting renderer incorporating a mask to delineate editable zones and unaltered regions. This allows for seamless filling of the lower-face region while preserving the remaining parts. Throughout our experiments, we encountered several challenges. Primarily, the semantic encoder lacks robustness, constricting its ability to capture high-level features. Besides, the modeling ignored facial positioning, causing mouth or nose jitters across frames. To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance. Moreover, we encapsulated a conformer-based reference encoder and motion generator fortified by a cross-attention mechanism. This enables our model to learn person-specific textures with varying references and reduces reliance on paired audio-visual data. Our rigorous experiments comprehensively highlight that our ground-breaking approach outpaces existing methods with considerable margins and delivers seamless, intelligible videos in person-generic and multilingual scenarios.

Title: On the Generalization Properties of Diffusion Models. (arXiv:2311.01797v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01797
Code URL: https://github.com/lphleo/diffusion_generalization
Copy Paste: [[2311.01797]] On the Generalization Properties of Diffusion Models(http://arxiv.org/abs/2311.01797)
Summary:
Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.

noise learning

data-free

Title: Data-Free Distillation of Language Model by Text-to-Text Transfer. (arXiv:2311.01689v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01689
Code URL: null
Copy Paste: [[2311.01689]] Data-Free Distillation of Language Model by Text-to-Text Transfer(http://arxiv.org/abs/2311.01689)
Summary:
Data-Free Knowledge Distillation (DFKD) plays a vital role in compressing the model when original training data is unavailable. Previous works for DFKD in NLP mainly focus on distilling encoder-only structures like BERT on classification tasks, which overlook the notable progress of generative language modeling. In this work, we propose a novel DFKD framework, namely DFKD-T$^{3}$, where the pretrained generative language model can also serve as a controllable data generator for model compression. This novel framework DFKD-T$^{3}$ leads to an end-to-end learnable text-to-text framework to transform the general domain corpus to compression-friendly task data, targeting to improve both the \textit{specificity} and \textit{diversity}. Extensive experiments show that our method can boost the distillation performance in various downstream tasks such as sentiment analysis, linguistic acceptability, and information extraction. Furthermore, we show that the generated texts can be directly used for distilling other language models and outperform the SOTA methods, making our method more appealing in a general DFKD setting. Our code is available at https://gitee.com/mindspore/models/tree/master/research/nlp/DFKD\_T3.

transformer

Title: Content Significance Distribution of Sub-Text Blocks in Articles and Its Application to Article-Organization Assessment. (arXiv:2311.01673v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01673
Code URL: null
Copy Paste: [[2311.01673]] Content Significance Distribution of Sub-Text Blocks in Articles and Its Application to Article-Organization Assessment(http://arxiv.org/abs/2311.01673)
Summary:
We explore how to capture the significance of a sub-text block in an article and how it may be used for text mining tasks. A sub-text block is a sub-sequence of sentences in the article. We formulate the notion of content significance distribution (CSD) of sub-text blocks, referred to as CSD of the first kind and denoted by CSD-1. In particular, we leverage Hugging Face's SentenceTransformer to generate contextual sentence embeddings, and use MoverScore over text embeddings to measure how similar a sub-text block is to the entire text. To overcome the exponential blowup on the number of sub-text blocks, we present an approximation algorithm and show that the approximated CSD-1 is almost identical to the exact CSD-1. Under this approximation, we show that the average and median CSD-1's for news, scholarly research, argument, and narrative articles share the same pattern. We also show that under a certain linear transformation, the complement of the cumulative distribution function of the beta distribution with certain values of $\alpha$ and $\beta$ resembles a CSD-1 curve. We then use CSD-1's to extract linguistic features to train an SVC classifier for assessing how well an article is organized. Through experiments, we show that this method achieves high accuracy for assessing student essays. Moreover, we study CSD of sentence locations, referred to as CSD of the second kind and denoted by CSD-2, and show that average CSD-2's for different types of articles possess distinctive patterns, which either conform common perceptions of article structures or provide rectification with minor deviation.

Title: Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection. (arXiv:2311.01755v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01755
Code URL: null
Copy Paste: [[2311.01755]] Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection(http://arxiv.org/abs/2311.01755)
Summary:
Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively.

Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.

Title: EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision. (arXiv:2311.02077v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.02077
Code URL: null
Copy Paste: [[2311.02077]] EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision(http://arxiv.org/abs/2311.02077)
Summary:
We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.

Title: A New Korean Text Classification Benchmark for Recognizing the Political Intents in Online Newspapers. (arXiv:2311.01712v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01712
Code URL: https://github.com/kdavid2355/kopolitic-benchmark-dataset
Copy Paste: [[2311.01712]] A New Korean Text Classification Benchmark for Recognizing the Political Intents in Online Newspapers(http://arxiv.org/abs/2311.01712)
Summary:
Many users reading online articles in various magazines may suffer considerable difficulty in distinguishing the implicit intents in texts. In this work, we focus on automatically recognizing the political intents of a given online newspaper by understanding the context of the text. To solve this task, we present a novel Korean text classification dataset that contains various articles. We also provide deep-learning-based text classification baseline models trained on the proposed dataset. Our dataset contains 12,000 news articles that may contain political intentions, from the politics section of six of the most representative newspaper organizations in South Korea. All the text samples are labeled simultaneously in two aspects (1) the level of political orientation and (2) the level of pro-government. To the best of our knowledge, our paper is the most large-scale Korean news dataset that contains long text and addresses multi-task classification problems. We also train recent state-of-the-art (SOTA) language models that are based on transformer architectures and demonstrate that the trained models show decent text classification performance. All the codes, datasets, and trained models are available at https://github.com/Kdavid2355/KoPolitic-Benchmark-Dataset.

Title: An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction. (arXiv:2311.01713v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01713
Code URL: null
Copy Paste: [[2311.01713]] An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction(http://arxiv.org/abs/2311.01713)
Summary:
Aspect sentiment quad prediction (ASQP) is a critical subtask of aspect-level sentiment analysis. Current ASQP datasets are characterized by their small size and low quadruple density, which hinders technical development. To expand capacity, we construct two large Chinese ASQP datasets crawled from multiple online platforms. The datasets hold several significant characteristics: larger size (each with 10,000+ samples) and rich aspect categories, more words per sentence, and higher density than existing ASQP datasets. Moreover, we are the first to evaluate the performance of Generative Pre-trained Transformer (GPT) series models on ASQP and exhibit potential issues. The experiments with state-of-the-art ASQP baselines underscore the need to explore additional techniques to address ASQP, as well as the importance of further investigation into methods to improve the performance of GPTs.

Title: GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling. (arXiv:2311.01927v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01927
Code URL: null
Copy Paste: [[2311.01927]] GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling(http://arxiv.org/abs/2311.01927)
Summary:
Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost $O(l)$ recurrent mode and an efficient $O(l \log_{2} l)$ parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an $O(l^2)$ surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing data-controlled relative-positional information to Attention. While many existing models solely rely on data-controlled cumulative sums for context aggregation, our findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models.

Title: ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models. (arXiv:2311.01981v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01981
Code URL: null
Copy Paste: [[2311.01981]] ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models(http://arxiv.org/abs/2311.01981)
Summary:
RNN-like language models are getting renewed attention from NLP researchers in recent years and several models have made significant progress, which demonstrates performance comparable to traditional transformers. However, due to the recurrent nature of RNNs, this kind of language model can only store information in a set of fixed-length state vectors. As a consequence, they still suffer from forgetfulness though after a lot of improvements and optimizations, when given complex instructions or prompts. As the prompted generation is the main and most concerned function of LMs, solving the problem of forgetting in the process of generation is no wonder of vital importance. In this paper, focusing on easing the prompt forgetting during generation, we proposed an architecture to teach the model memorizing prompt during generation by synthetic gradient. To force the model to memorize the prompt, we derive the states that encode the prompt, then transform it into model parameter modification using low-rank gradient approximation, which hard-codes the prompt into model parameters temporarily. We construct a dataset for experiments, and the results have demonstrated the effectiveness of our method in solving the problem of forgetfulness in the process of prompted generation. We will release all the code upon acceptance.

Title: On the Convergence of Encoder-only Shallow Transformers. (arXiv:2311.01575v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01575
Code URL: null
Copy Paste: [[2311.01575]] On the Convergence of Encoder-only Shallow Transformers(http://arxiv.org/abs/2311.01575)
Summary:
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.

Title: TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices. (arXiv:2311.01759v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01759
Code URL: null
Copy Paste: [[2311.01759]] TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices(http://arxiv.org/abs/2311.01759)
Summary:
Developing deep learning models on tiny devices (e.g. Microcontroller units, MCUs) has attracted much attention in various embedded IoT applications. However, it is challenging to efficiently design and deploy recent advanced models (e.g. transformers) on tiny devices due to their severe hardware resource constraints. In this work, we propose TinyFormer, a framework specifically designed to develop and deploy resource-efficient transformers on MCUs. TinyFormer mainly consists of SuperNAS, SparseNAS and SparseEngine. Separately, SuperNAS aims to search for an appropriate supernet from a vast search space. SparseNAS evaluates the best sparse single-path model including transformer architecture from the identified supernet. Finally, SparseEngine efficiently deploys the searched sparse models onto MCUs. To the best of our knowledge, SparseEngine is the first deployment framework capable of performing inference of sparse models with transformer on MCUs. Evaluation results on the CIFAR-10 dataset demonstrate that TinyFormer can develop efficient transformers with an accuracy of $96.1\%$ while adhering to hardware constraints of $1$MB storage and $320$KB memory. Additionally, TinyFormer achieves significant speedups in sparse inference, up to $12.2\times$, when compared to the CMSIS-NN library. TinyFormer is believed to bring powerful transformers into TinyML scenarios and greatly expand the scope of deep learning applications.

Title: Simplifying Transformer Blocks. (arXiv:2311.01906v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01906
Code URL: https://github.com/bobby-he/simplified_transformers
Copy Paste: [[2311.01906]] Simplifying Transformer Blocks(http://arxiv.org/abs/2311.01906)
Summary:
A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable.

In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.

generative

Title: Efficient Cloud Pipelines for Neural Radiance Fields. (arXiv:2311.01659v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01659
Code URL: null
Copy Paste: [[2311.01659]] Efficient Cloud Pipelines for Neural Radiance Fields(http://arxiv.org/abs/2311.01659)
Summary:
Since their introduction in 2020, Neural Radiance Fields (NeRFs) have taken the computer vision community by storm. They provide a multi-view representation of a scene or object that is ideal for eXtended Reality (XR) applications and for creative endeavors such as virtual production, as well as change detection operations in geospatial analytics. The computational cost of these generative AI models is quite high, however, and the construction of cloud pipelines to generate NeRFs is neccesary to realize their potential in client applications. In this paper, we present pipelines on a high performance academic computing cluster and compare it with a pipeline implemented on Microsoft Azure. Along the way, we describe some uses of NeRFs in enabling novel user interaction scenarios.

Title: Indo LEGO-ABSA: A Multitask Generative Aspect Based Sentiment Analysis for Indonesian Language. (arXiv:2311.01757v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01757
Code URL: null
Copy Paste: [[2311.01757]] Indo LEGO-ABSA: A Multitask Generative Aspect Based Sentiment Analysis for Indonesian Language(http://arxiv.org/abs/2311.01757)
Summary:
Aspect-based sentiment analysis is a method in natural language processing aimed at identifying and understanding sentiments related to specific aspects of an entity. Aspects are words or phrases that represent an aspect or attribute of a particular entity. Previous research has utilized generative pre-trained language models to perform aspect-based sentiment analysis. LEGO-ABSA is one framework that has successfully employed generative pre-trained language models in aspect-based sentiment analysis, particularly in English. LEGO-ABSA uses a multitask learning and prompting approach to enhance model performance. However, the application of this approach has not been done in the context of Bahasa Indonesia. Therefore, this research aims to implement the multitask learning and prompting approach in aspect-based sentiment analysis for Bahasa Indonesia using generative pre-trained language models. In this study, the Indo LEGO-ABSA model is developed, which is an aspect-based sentiment analysis model utilizing generative pre-trained language models and trained with multitask learning and prompting. Indo LEGO-ABSA is trained with a hotel domain dataset in the Indonesian language. The obtained results include an f1-score of 79.55% for the Aspect Sentiment Triplet Extraction task, 86.09% for Unified Aspect-based Sentiment Analysis, 79.85% for Aspect Opinion Pair Extraction, 87.45% for Aspect Term Extraction, and 88.09% for Opinion Term Extraction. Indo LEGO-ABSA adopts the LEGO-ABSA framework that employs the T5 model, specifically mT5, by applying multitask learning to train all tasks within aspect-based sentiment analysis.

Title: Indicative Summarization of Long Discussions. (arXiv:2311.01882v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01882
Code URL: https://github.com/webis-de/emnlp-23
Copy Paste: [[2311.01882]] Indicative Summarization of Long Discussions(http://arxiv.org/abs/2311.01882)
Summary:
Online forums encourage the exchange and discussion of different stances on many topics. Not only do they provide an opportunity to present one's own arguments, but may also gather a broad cross-section of others' arguments. However, the resulting long discussions are difficult to overview. This paper presents a novel unsupervised approach using large language models (LLMs) to generating indicative summaries for long discussions that basically serve as tables of contents. Our approach first clusters argument sentences, generates cluster labels as abstractive summaries, and classifies the generated cluster labels into argumentation frames resulting in a two-level summary. Based on an extensively optimized prompt engineering approach, we evaluate 19~LLMs for generative cluster labeling and frame classification. To evaluate the usefulness of our indicative summaries, we conduct a purpose-driven user study via a new visual interface called Discussion Explorer: It shows that our proposed indicative summaries serve as a convenient navigation tool to explore long discussions.

large language model

Title: Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI. (arXiv:2311.01463v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01463
Code URL: null
Copy Paste: [[2311.01463]] Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI(http://arxiv.org/abs/2311.01463)
Summary:
Large language models have proliferated across multiple domains in as short period of time. There is however hesitation in the medical and healthcare domain towards their adoption because of issues like factuality, coherence, and hallucinations. Give the high stakes nature of healthcare, many researchers have even cautioned against its usage until these issues are resolved. The key to the implementation and deployment of LLMs in healthcare is to make these models trustworthy, transparent (as much possible) and explainable. In this paper we describe the key elements in creating reliable, trustworthy, and unbiased models as a necessary condition for their adoption in healthcare. Specifically we focus on the quantification, validation, and mitigation of hallucinations in the context in healthcare. Lastly, we discuss how the future of LLMs in healthcare may look like.

Title: What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning. (arXiv:2311.01487v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01487
Code URL: https://github.com/rucaibox/comvint
Copy Paste: [[2311.01487]] What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning(http://arxiv.org/abs/2311.01487)
Summary:
Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs). A surge of visual instruction datasets with various focuses and characteristics have been proposed recently, enabling MLLMs to achieve surprising results on evaluation benchmarks. To develop more capable MLLMs, in this paper, we aim to investigate a more fundamental question: ``what makes for good visual instructions?''. By conducting a comprehensive empirical study, we find that instructions focused on complex visual reasoning tasks are particularly effective in improving the performance of MLLMs on evaluation benchmarks. Building upon this finding, we design a systematic approach to automatically creating high-quality complex visual reasoning instructions. Our approach employs a synthesis-complication-reformulation paradigm, leveraging multiple stages to gradually increase the complexity of the instructions while guaranteeing quality. Based on this approach, we create the synthetic visual reasoning instruction dataset consisting of 32K examples, namely ComVint, and fine-tune four MLLMs on it. Experimental results demonstrate that our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively. Our code and data are publicly available at the link: https://github.com/RUCAIBox/ComVint.

Title: Remember what you did so you know what to do next. (arXiv:2311.01468v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01468
Code URL: null
Copy Paste: [[2311.01468]] Remember what you did so you know what to do next(http://arxiv.org/abs/2311.01468)
Summary:
We explore using a moderately sized large language model (GPT-J 6B parameters) to create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Previously published empirical work claimed that large language models (LLMs) are a poor fit (Wang et al., 2022) compared to reinforcement learning. Using the Markov assumption (a single previous step), the LLM outperforms the reinforcement learning-based approach by a factor of 1.4. When we fill the LLM's input buffer with as many prior steps as possible, improvement rises to 3.5x. Even when training on only 6.5% of the training data, we observe a 2.2x improvement over the reinforcement-learning-based approach. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues. In work contemporaneous with ours, Lin et al. (2023) demonstrated a two-part approach (SwiftSage) that uses a small LLM (T5-large) complemented by OpenAI's massive LLMs to achieve outstanding results in ScienceWorld. Our 6-B parameter, single-stage GPT-J matches the performance of SwiftSage's two-stage architecture when it incorporates GPT-3.5 turbo which has 29-times more parameters than GPT-J.

Title: Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization. (arXiv:2311.01544v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01544
Code URL: null
Copy Paste: [[2311.01544]] Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization(http://arxiv.org/abs/2311.01544)
Summary:
Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. Their ever-increasing size, however, raised concerns about their effective deployment and the need for LLM compressions. This study introduces the Divergent Token metrics (DTMs), a novel approach for assessing compressed LLMs, addressing the limitations of traditional measures like perplexity that fail to accurately reflect text generation quality. DTMs focus on token divergence, providing deeper insights into the subtleties of model compression. Our results indicate that significant levels of precision and sparsity can be achieved without compromising text generation quality. Moreover, DTMs offers a more precise evaluation of each component's impact individually. Utilizing the First Divergent Token metric (FDTM) in model sparsification reveals that nearly 20% of all components can be pruned over 90%. In terms of quantization, the FDTM suggests that over 80% of parameters can be straightforwardly transformed to int8 without special outlier management.

Title: Preserving the knowledge of long clinical texts using aggregated ensembles of large language models. (arXiv:2311.01571v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01571
Code URL: null
Copy Paste: [[2311.01571]] Preserving the knowledge of long clinical texts using aggregated ensembles of large language models(http://arxiv.org/abs/2311.01571)
Summary:
Clinical texts, such as admission notes, discharge summaries, and progress notes, contain rich and valuable information that can be used for various clinical outcome prediction tasks. However, applying large language models, such as BERT-based models, to clinical texts poses two major challenges: the limitation of input length and the diversity of data sources. This paper proposes a novel method to preserve the knowledge of long clinical texts using aggregated ensembles of large language models. Unlike previous studies which use model ensembling or text aggregation methods separately, we combine ensemble learning with text aggregation and train multiple large language models on two clinical outcome tasks: mortality prediction and length of stay prediction. We show that our method can achieve better results than baselines, ensembling, and aggregation individually, and can improve the performance of large language models while handling long inputs and diverse datasets. We conduct extensive experiments on the admission notes from the MIMIC-III clinical database by combining multiple unstructured and high-dimensional datasets, demonstrating our method's effectiveness and superiority over existing approaches. We also provide a comprehensive analysis and discussion of our results, highlighting our method's applications and limitations for future research in the domain of clinical healthcare. The results and analysis of this study is supportive of our method assisting in clinical healthcare systems by enabling clinical decision-making with robust performance overcoming the challenges of long text inputs and varied datasets.

Title: DialogBench: Evaluating LLMs as Human-like Dialogue Systems. (arXiv:2311.01677v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01677
Code URL: null
Copy Paste: [[2311.01677]] DialogBench: Evaluating LLMs as Human-like Dialogue Systems(http://arxiv.org/abs/2311.01677)
Summary:
Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities, refreshing human's impressions on dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users by satisfying the need for communication, affection and social belonging. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that currently contains $12$ dialogue tasks to assess the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely-used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive test over $28$ LLMs (including pre-trained and supervised instruction-tuning) shows that instruction fine-tuning benefits improve the human likeness of LLMs to a certain extent, but there is still much room to improve those capabilities for most LLMs as human-like dialogue systems. In addition, experimental results also indicate that LLMs perform differently in various abilities that human-like dialogue systems should have. We will publicly release DialogBench, along with the associated evaluation code for the broader research community.

Title: PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion. (arXiv:2311.01767v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01767
Code URL: https://github.com/gydpku/pptc
Copy Paste: [[2311.01767]] PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion(http://arxiv.org/abs/2311.01767)
Summary:
Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.

Title: TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine. (arXiv:2311.01786v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01786
Code URL: null
Copy Paste: [[2311.01786]] TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine(http://arxiv.org/abs/2311.01786)
Summary:
Pre-training and fine-tuning have emerged as a promising paradigm across various natural language processing (NLP) tasks. The effectiveness of pretrained large language models (LLM) has witnessed further enhancement, holding potential for applications in the field of medicine, particularly in the context of Traditional Chinese Medicine (TCM). However, the application of these general models to specific domains often yields suboptimal results, primarily due to challenges like lack of domain knowledge, unique objectives, and computational efficiency. Furthermore, their effectiveness in specialized domains, such as Traditional Chinese Medicine, requires comprehensive evaluation. To address the above issues, we propose a novel domain specific TCMDA (TCM Domain Adaptation) approach, efficient pre-training with domain-specific corpus. Specifically, we first construct a large TCM-specific corpus, TCM-Corpus-1B, by identifying domain keywords and retreving from general corpus. Then, our TCMDA leverages the LoRA which freezes the pretrained model's weights and uses rank decomposition matrices to efficiently train specific dense layers for pre-training and fine-tuning, efficiently aligning the model with TCM-related tasks, namely TCM-GPT-7B. We further conducted extensive experiments on two TCM tasks, including TCM examination and TCM diagnosis. TCM-GPT-7B archived the best performance across both datasets, outperforming other models by relative increments of 17% and 12% in accuracy, respectively. To the best of our knowledge, our study represents the pioneering validation of domain adaptation of a large language model with 7 billion parameters in TCM domain. We will release both TCMCorpus-1B and TCM-GPT-7B model once accepted to facilitate interdisciplinary development in TCM and NLP, serving as the foundation for further study.

Title: AFPQ: Asymmetric Floating Point Quantization for LLMs. (arXiv:2311.01792v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01792
Code URL: https://github.com/zhangsichengsjtu/afpq
Copy Paste: [[2311.01792]] AFPQ: Asymmetric Floating Point Quantization for LLMs(http://arxiv.org/abs/2311.01792)
Summary:
Large language models (LLMs) show great performance in various tasks, but face deployment challenges from limited memory capacity and bandwidth. Low-bit weight quantization can save memory and accelerate inference. Although floating-point (FP) formats show good performance in LLM quantization, they tend to perform poorly with small group sizes or sub-4 bits. We find the reason is that the absence of asymmetry in previous FP quantization makes it unsuitable for handling asymmetric value distribution of LLM weight tensors. In this work, we propose asymmetric FP quantization (AFPQ), which sets separate scales for positive and negative values. Our method leads to large accuracy improvements and can be easily plugged into other quantization methods, including GPTQ and AWQ, for better performance. Besides, no additional storage is needed compared with asymmetric integer (INT) quantization. The code is available at https://github.com/zhangsichengsjtu/AFPQ.

Title: Towards Concept-Aware Large Language Models. (arXiv:2311.01866v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01866
Code URL: null
Copy Paste: [[2311.01866]] Towards Concept-Aware Large Language Models(http://arxiv.org/abs/2311.01866)
Summary:
Concepts play a pivotal role in various human cognitive functions, including learning, reasoning and communication. However, there is very little work on endowing machines with the ability to form and reason with concepts. In particular, state-of-the-art large language models (LLMs) work at the level of tokens, not concepts.

In this work, we analyze how well contemporary LLMs capture human concepts and their structure. We then discuss ways to develop concept-aware LLMs, taking place at different stages of the pipeline. We sketch a method for pretraining LLMs using concepts, and also explore the simpler approach that uses the output of existing LLMs. Despite its simplicity, our proof-of-concept is shown to better match human intuition, as well as improve the robustness of predictions. These preliminary results underscore the promise of concept-aware LLMs.

Title: Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review. (arXiv:2311.01918v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01918
Code URL: https://github.com/mingze-yuan/awesome-llm-healthcare
Copy Paste: [[2311.01918]] Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review(http://arxiv.org/abs/2311.01918)
Summary:
With the rapid development of artificial intelligence, large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning. This has sparked significant interest in applying LLMs to enhance various aspects of healthcare, ranging from medical education to clinical decision support. However, medicine involves multifaceted data modalities and nuanced reasoning skills, presenting challenges for integrating LLMs. This paper provides a comprehensive review on the applications and implications of LLMs in medicine. It begins by examining the fundamental applications of general-purpose and specialized LLMs, demonstrating their utilities in knowledge retrieval, research support, clinical workflow automation, and diagnostic assistance. Recognizing the inherent multimodality of medicine, the review then focuses on multimodal LLMs, investigating their ability to process diverse data types like medical imaging and EHRs to augment diagnostic accuracy. To address LLMs' limitations regarding personalization and complex clinical reasoning, the paper explores the emerging development of LLM-powered autonomous agents for healthcare. Furthermore, it summarizes the evaluation methodologies for assessing LLMs' reliability and safety in medical contexts. Overall, this review offers an extensive analysis on the transformative potential of LLMs in modern medicine. It also highlights the pivotal need for continuous optimizations and ethical oversight before these models can be effectively integrated into clinical practice. Visit https://github.com/mingze-yuan/Awesome-LLM-Healthcare for an accompanying GitHub repository containing latest papers.

Title: Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks. (arXiv:2311.01949v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.01949
Code URL: null
Copy Paste: [[2311.01949]] Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks(http://arxiv.org/abs/2311.01949)
Summary:
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs), enabling them to learn input-label mappings from demonstrations and perform well on downstream tasks. However, under the standard ICL setting, LLMs may sometimes neglect query-related information in demonstrations, leading to incorrect predictions. To address this limitation, we propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering, an important form in knowledge-intensive tasks. HICL leverages LLMs' reasoning ability to extract query-related knowledge from demonstrations, then concatenates the knowledge to prompt LLMs in a more explicit way. Furthermore, we track the source of this knowledge to identify specific examples, and introduce a Hint-related Example Retriever (HER) to select informative examples for enhanced demonstrations. We evaluate HICL with HER on 3 open-domain QA benchmarks, and observe average performance gains of 2.89 EM score and 2.52 F1 score on gpt-3.5-turbo, 7.62 EM score and 7.27 F1 score on LLaMA-2-Chat-7B compared with standard setting.

Title: Conditions on Preference Relations that Guarantee the Existence of Optimal Policies. (arXiv:2311.01990v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.01990
Code URL: null
Copy Paste: [[2311.01990]] Conditions on Preference Relations that Guarantee the Existence of Optimal Policies(http://arxiv.org/abs/2311.01990)
Summary:
Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. Using the von Neumann-Morgenstern Expected Utility Theorem, we show that the Direct Preference Process generalizes the standard reinforcement learning problem. Our findings narrow the gap between the empirical success and theoretical understanding of LfPF algorithms and provide future practitioners with the tools necessary for a more principled design of LfPF agents.

segmentation

Title: Patch-Based Deep Unsupervised Image Segmentation using Graph Cuts. (arXiv:2311.01475v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01475
Code URL: null
Copy Paste: [[2311.01475]] Patch-Based Deep Unsupervised Image Segmentation using Graph Cuts(http://arxiv.org/abs/2311.01475)
Summary:
Unsupervised image segmentation aims at grouping different semantic patterns in an image without the use of human annotation. Similarly, image clustering searches for groupings of images based on their semantic content without supervision. Classically, both problems have captivated researchers as they drew from sound mathematical concepts to produce concrete applications. With the emergence of deep learning, the scientific community turned its attention to complex neural network-based solvers that achieved impressive results in those domains but rarely leveraged the advances made by classical methods. In this work, we propose a patch-based unsupervised image segmentation strategy that bridges advances in unsupervised feature extraction from deep clustering methods with the algorithmic help of classical graph-based methods. We show that a simple convolutional neural network, trained to classify image patches and iteratively regularized using graph cuts, naturally leads to a state-of-the-art fully-convolutional unsupervised pixel-level segmenter. Furthermore, we demonstrate that this is the ideal setting for leveraging the patch-level pairwise features generated by vision transformer models. Our results on real image data demonstrate the effectiveness of our proposed methodology.

Title: 4D-Former: Multimodal 4D Panoptic Segmentation. (arXiv:2311.01520v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01520
Code URL: null
Copy Paste: [[2311.01520]] 4D-Former: Multimodal 4D Panoptic Segmentation(http://arxiv.org/abs/2311.01520)
Summary:
4D panoptic segmentation is a challenging but practically useful task that requires every point in a LiDAR point-cloud sequence to be assigned a semantic class label, and individual objects to be segmented and tracked over time. Existing approaches utilize only LiDAR inputs which convey limited information in regions with point sparsity. This problem can, however, be mitigated by utilizing RGB camera images which offer appearance-based information that can reinforce the geometry-based LiDAR features. Motivated by this, we propose 4D-Former: a novel method for 4D panoptic segmentation which leverages both LiDAR and image modalities, and predicts semantic masks as well as temporally consistent object masks for the input point-cloud sequence. We encode semantic classes and objects using a set of concise queries which absorb feature information from both data modalities. Additionally, we propose a learned mechanism to associate object tracks over time which reasons over both appearance and spatial location. We apply 4D-Former to the nuScenes and SemanticKITTI datasets where it achieves state-of-the-art results.

Title: MemorySeg: Online LiDAR Semantic Segmentation with a Latent Memory. (arXiv:2311.01556v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01556
Code URL: null
Copy Paste: [[2311.01556]] MemorySeg: Online LiDAR Semantic Segmentation with a Latent Memory(http://arxiv.org/abs/2311.01556)
Summary:
Semantic segmentation of LiDAR point clouds has been widely studied in recent years, with most existing methods focusing on tackling this task using a single scan of the environment. However, leveraging the temporal stream of observations can provide very rich contextual information on regions of the scene with poor visibility (e.g., occlusions) or sparse observations (e.g., at long range), and can help reduce redundant computation frame after frame. In this paper, we tackle the challenge of exploiting the information from the past frames to improve the predictions of the current frame in an online fashion. To address this challenge, we propose a novel framework for semantic segmentation of a temporal sequence of LiDAR point clouds that utilizes a memory network to store, update and retrieve past information. Our framework also includes a regularizer that penalizes prediction variations in the neighborhood of the point cloud. Prior works have attempted to incorporate memory in range view representations for semantic segmentation, but these methods fail to handle occlusions and the range view representation of the scene changes drastically as agents nearby move. Our proposed framework overcomes these limitations by building a sparse 3D latent representation of the surroundings. We evaluate our method on SemanticKITTI, nuScenes, and PandaSet. Our experiments demonstrate the effectiveness of the proposed framework compared to the state-of-the-art.

Title: Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation. (arXiv:2311.01989v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.01989
Code URL: null
Copy Paste: [[2311.01989]] Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation(http://arxiv.org/abs/2311.01989)
Summary:
Recently, large-scale pre-trained models such as Segment-Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP) have demonstrated remarkable success and revolutionized the field of computer vision. These foundation vision models effectively capture knowledge from a large-scale broad data with their vast model parameters, enabling them to perform zero-shot segmentation on previously unseen data without additional training. While they showcase competence in 2D tasks, their potential for enhancing 3D scene understanding remains relatively unexplored. To this end, we present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. We then project these mask predictions from various frames of RGB-D video sequences into 3D space. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting. We examine diverse scenarios, like zero-shot learning and limited guidance from sparse 2D point labels, to assess the pros and cons of different vision foundation models. Our approach is experimented on ScanNet dataset for 3D indoor scenes, and the results demonstrate the effectiveness of adopting general 2D foundation models on solving 3D point cloud segmentation tasks.