secure

Title: Detecting inner-LAN anomalies using hierarchical forecasting. (arXiv:2304.13941v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.13941
Code URL: null
Copy Paste: [[2304.13941] Detecting inner-LAN anomalies using hierarchical forecasting](http://arxiv.org/abs/2304.13941) #secure
Summary:
Increasing activity and the number of devices online are leading to increasing and more diverse cyber attacks. This continuously evolving attack activity makes signature-based detection methods ineffective. Once malware has infiltrated into a LAN, bypassing an external gateway or entering via an unsecured mobile device, it can potentially infect all nodes in the LAN as well as carry out nefarious activities such as stealing valuable data, leading to financial damage and loss of reputation. Such infiltration could be viewed as an insider attack, increasing the need for LAN monitoring and security. In this paper we aim to detect such inner-LAN activity by studying the variations in Address Resolution Protocol (ARP) calls within the LAN. We find anomalous nodes by modelling inner-LAN traffic using hierarchical forecasting methods. We substantially reduce the false positives ever present in anomaly detection, by using an extreme value theory based method. We use a dataset from a real inner-LAN monitoring project, containing over 10M ARP calls from 362 nodes. Furthermore, the small number of false positives generated using our methods, is a potential solution to the "alert fatigue" commonly reported by security experts.

Title: Holo-Block Chain: A Hybrid Approach for Secured IoT Healthcare Ecosystem. (arXiv:2304.14175v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.14175
Code URL: null
Copy Paste: [[2304.14175] Holo-Block Chain: A Hybrid Approach for Secured IoT Healthcare Ecosystem](http://arxiv.org/abs/2304.14175) #secure
Summary:
The Internet-of-Things (IoT) is an imminent and corporal technology that enables the connectivity of smart physical devices with virtual objects contriving in distinct platforms with the help of the internet. The IoT is under massive experimentation to operate in a distributed manner, making it favourable to be utilized in the healthcare ecosystem. However, un- der the IoT healthcare ecosystem (IoT-HS), the nodes of the IoT networks are unveiled to an aberrant level of security threats. Regulating an adequate volume of sensitive and personal data, IoT-HS undergoes various security challenges for which a distributed mechanism to address such concerns plays a vital role. Although Blockchain, having a distributed ledger, is integral to solving security concerns in IoT-HSs, it undergoes major problems, including massive storage and computational requirements. Also, Holochain, which has low computational and memory requirements, lacks authentication distribution availability. Therefore, this paper proposes a hybrid Holochain and Blockchain-based privacy perseverance and security framework for IoT-HSs that combines the benefits Holochain and Blockchain provide, overcoming the computational, memory, and authentication challenges. This framework is more suited for IoT scenarios where resource needs to be optimally utilized. Comprehensive security and performance analysis is conducted to demonstrate the suitability and effectiveness of the proposed hybrid security approach for IoT-HSs in contrast to the Blockchain-only or Holochain-only based approaches.

security

Title: Composable Security of Distributed Symmetric Key Exchange Protocol. (arXiv:2304.13789v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.13789
Code URL: null
Copy Paste: [[2304.13789] Composable Security of Distributed Symmetric Key Exchange Protocol](http://arxiv.org/abs/2304.13789) #security
Summary:
The Distributed Symmetric Key Exchange (DSKE) protocol provides secure secret exchange (e.g., for key exchange) between two honest parties that need not have had prior contact, and use intermediaries with whom they each securely share confidential data. We show the composable security of the DSKE protocol in the constructive cryptography framework of Maurer. Specifically, we prove the security (correctness and confidentiality) and robustness of this protocol against any computationally unbounded adversary, who additionally may have fully compromised a bounded number of the intermediaries and can eavesdrop on all communication. As DSKE is highly scalable in a network setting with no distance limit, it is expected to be a cost-effective quantum-safe cryptographic solution to safeguarding the network security against the threat of quantum computers.

Title: CNN based IoT Device Identification. (arXiv:2304.13894v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.13894
Code URL: https://github.com/kahramankostas/cnn-based-iot-device-identification
Copy Paste: [[2304.13894] CNN based IoT Device Identification](http://arxiv.org/abs/2304.13894) #security
Summary:
While the use of the Internet of Things is becoming more and more popular, many security vulnerabilities are emerging with the large number of devices being introduced to the market. In this environment, IoT device identification methods provide a preventive security measure as an important factor in identifying these devices and detecting the vulnerabilities they suffer from. In this study, we present a method that identifies devices in the Aalto dataset using the convolutional neural network (CNN).

Title: LSTM based IoT Device Identification. (arXiv:2304.13905v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.13905
Code URL: https://github.com/kahramankostas/lstm-based-iot-device-identification
Copy Paste: [[2304.13905] LSTM based IoT Device Identification](http://arxiv.org/abs/2304.13905) #security
Summary:
While the use of the Internet of Things is becoming more and more popular, many security vulnerabilities are emerging with the large number of devices being introduced to the market. In this environment, IoT device identification methods provide a preventive security measure as an important factor in identifying these devices and detecting the vulnerabilities they suffer from. In this study, we present a method that identifies devices in the Aalto dataset using Long short-term memory (LSTM)

Title: You Can't Always Check What You Wanted: Selective Checking and Trusted Execution to Prevent False Actuations in Cyber-Physical Systems. (arXiv:2304.13956v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.13956
Code URL: null
Copy Paste: [[2304.13956] You Can't Always Check What You Wanted: Selective Checking and Trusted Execution to Prevent False Actuations in Cyber-Physical Systems](http://arxiv.org/abs/2304.13956) #security
Summary:
Cyber-physical systems (CPS) are vulnerable to attacks targeting outgoing actuation commands that modify their physical behaviors. The limited resources in such systems, coupled with their stringent timing constraints, often prevents the checking of every outgoing command. We present a "selective checking" mechanism that uses game-theoretic modeling to identify the right subset of commands to be checked in order to deter an adversary. This mechanism is coupled with a "delay-aware" trusted execution environment (TEE) to ensure that only verified actuation commands are ever sent to the physical system, thus maintaining their safety and integrity. The selective checking and trusted execution (SCATE) framework is implemented on an off-the-shelf ARM platform running standard embedded Linux. We demonstrate the effectiveness of SCATE using four realistic cyber-physical systems (a ground rover, a flight controller, a robotic arm and an automated syringe pump) and study design trade-offs. Not only does SCATE provide a high level of security and high performance, it also suffers from significantly lower overheads (30.48%-47.32% less) in the process. In fact, SCATE can work with more systems without negatively affecting the safety of the system. Considering that most CPS do not have any such checking mechanisms, and SCATE is guaranteed to meet all the timing requirements (i.e., ensure the safety/integrity of the system), our methods can significantly improve the security (and, hence, safety) of the system.

Title: Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency. (arXiv:2304.13738v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13738
Code URL: null
Copy Paste: [[2304.13738] Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency](http://arxiv.org/abs/2304.13738) #security
Summary:
In recent years, the integration of artificial intelligence (AI) and cloud computing has emerged as a promising avenue for addressing the growing computational demands of AI applications. This paper presents a comprehensive study of scalable, distributed AI frameworks leveraging cloud computing for enhanced deep learning performance and efficiency. We first provide an overview of popular AI frameworks and cloud services, highlighting their respective strengths and weaknesses. Next, we delve into the critical aspects of data storage and management in cloud-based AI systems, discussing data preprocessing, feature engineering, privacy, and security. We then explore parallel and distributed training techniques for AI models, focusing on model partitioning, communication strategies, and cloud-based training architectures.

In subsequent chapters, we discuss optimization strategies for AI workloads in the cloud, covering load balancing, resource allocation, auto-scaling, and performance benchmarking. We also examine AI model deployment and serving in the cloud, outlining containerization, serverless deployment options, and monitoring best practices. To ensure the cost-effectiveness of cloud-based AI solutions, we present a thorough analysis of costs, optimization strategies, and case studies showcasing successful deployments. Finally, we summarize the key findings of this study, discuss the challenges and limitations of cloud-based AI, and identify emerging trends and future research opportunities in the field.

privacy

Title: Do SSL Models Have D\'ej`a Vu? A Case of Unintended Memorization in Self-supervised Learning. (arXiv:2304.13850v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13850
Code URL: https://github.com/facebookresearch/dejavu
Copy Paste: [[2304.13850] Do SSL Models Have D\'ej\a Vu? A Case of Unintended Memorization in Self-supervised Learning](http://arxiv.org/abs/2304.13850) #privacy`
Summary:
Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as d\'ej`a vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that d\'ej`a vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of d\'ej`a vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies. Code is available at https://github.com/facebookresearch/DejaVu.

Title: Improving the Utility of Differentially Private Clustering through Dynamical Processing. (arXiv:2304.13886v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13886
Code URL: null
Copy Paste: [[2304.13886] Improving the Utility of Differentially Private Clustering through Dynamical Processing](http://arxiv.org/abs/2304.13886) #privacy
Summary:
This study aims to alleviate the trade-off between utility and privacy in the task of differentially private clustering. Existing works focus on simple clustering methods, which show poor clustering performance for non-convex clusters. By utilizing Morse theory, we hierarchically connect the Gaussian sub-clusters to fit complex cluster distributions. Because differentially private sub-clusters are obtained through the existing methods, the proposed method causes little or no additional privacy loss. We provide a theoretical background that implies that the proposed method is inductive and can achieve any desired number of clusters. Experiments on various datasets show that our framework achieves better clustering performance at the same privacy level, compared to the existing methods.

protect

defense

attack

Title: Detection of Adversarial Physical Attacks in Time-Series Image Data. (arXiv:2304.13919v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13919
Code URL: null
Copy Paste: [[2304.13919] Detection of Adversarial Physical Attacks in Time-Series Image Data](http://arxiv.org/abs/2304.13919) #attack
Summary:
Deep neural networks (DNN) have become a common sensing modality in autonomous systems as they allow for semantically perceiving the ambient environment given input images. Nevertheless, DNN models have proven to be vulnerable to adversarial digital and physical attacks. To mitigate this issue, several detection frameworks have been proposed to detect whether a single input image has been manipulated by adversarial digital noise or not. In our prior work, we proposed a real-time detector, called VisionGuard (VG), for adversarial physical attacks against single input images to DNN models. Building upon that work, we propose VisionGuard (VG), which couples VG with majority-vote methods, to detect adversarial physical attacks in time-series image data, e.g., videos. This is motivated by autonomous systems applications where images are collected over time using onboard sensors for decision-making purposes. We emphasize that majority-vote mechanisms are quite common in autonomous system applications (among many other applications), as e.g., in autonomous driving stacks for object detection. In this paper, we investigate, both theoretically and experimentally, how this widely used mechanism can be leveraged to enhance the performance of adversarial detectors. We have evaluated VG on videos of both clean and physically attacked traffic signs generated by a state-of-the-art robust physical attack. We provide extensive comparative experiments against detectors that have been designed originally for out-of-distribution data and digitally attacked images.

Title: Bitcoin Double-Spending Attack Detection using Graph Neural Network. (arXiv:2304.13935v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.13935
Code URL: null
Copy Paste: [[2304.13935] Bitcoin Double-Spending Attack Detection using Graph Neural Network](http://arxiv.org/abs/2304.13935) #attack
Summary:
Bitcoin transactions include unspent transaction outputs (UTXOs) as their inputs and generate one or more newly owned UTXOs at specified addresses. Each UTXO can only be used as an input in a transaction once, and using it in two or more different transactions is referred to as a double-spending attack. Ultimately, due to the characteristics of the Bitcoin protocol, double-spending is impossible. However, problems may arise when a transaction is considered final even though its finality has not been fully guaranteed in order to achieve fast payment. In this paper, we propose an approach to detecting Bitcoin double-spending attacks using a graph neural network (GNN). This model predicts whether all nodes in the network contain a given payment transaction in their own memory pool (mempool) using information only obtained from some observer nodes in the network. Our experiment shows that the proposed model can detect double-spending with an accuracy of at least 0.95 when more than about 1% of the entire nodes in the network are observer nodes.

Title: Attacks on Robust Distributed Learning Schemes via Sensitivity Curve Maximization. (arXiv:2304.14024v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.14024
Code URL: null
Copy Paste: [[2304.14024] Attacks on Robust Distributed Learning Schemes via Sensitivity Curve Maximization](http://arxiv.org/abs/2304.14024) #attack
Summary:
Distributed learning paradigms, such as federated or decentralized learning, allow a collection of agents to solve global learning and optimization problems through limited local interactions. Most such strategies rely on a mixture of local adaptation and aggregation steps, either among peers or at a central fusion center. Classically, aggregation in distributed learning is based on averaging, which is statistically efficient, but susceptible to attacks by even a small number of malicious agents. This observation has motivated a number of recent works, which develop robust aggregation schemes by employing robust variations of the mean. We present a new attack based on sensitivity curve maximization (SCM), and demonstrate that it is able to disrupt existing robust aggregation schemes by injecting small, but effective perturbations.

Title: Boosting Big Brother: Attacking Search Engines with Encodings. (arXiv:2304.14031v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.14031
Code URL: https://github.com/nickboucher/search-engine-attacks
Copy Paste: [[2304.14031] Boosting Big Brother: Attacking Search Engines with Encodings](http://arxiv.org/abs/2304.14031) #attack
Summary:
Search engines are vulnerable to attacks against indexing and searching via text encoding manipulation. By imperceptibly perturbing text using uncommon encoded representations, adversaries can control results across search engines for specific search queries. We demonstrate that this attack is successful against two major commercial search engines - Google and Bing - and one open source search engine - Elasticsearch. We further demonstrate that this attack is successful against LLM chat search including Bing's GPT-4 chatbot and Google's Bard chatbot. We also present a variant of the attack targeting text summarization and plagiarism detection models, two ML tasks closely tied to search. We provide a set of defenses against these techniques and warn that adversaries can leverage these attacks to launch disinformation campaigns against unsuspecting users, motivating the need for search engine maintainers to patch deployed systems.

robust

Title: Automatic Localization and Detection Applicable to Robust Image Watermarking Resisting against Camera Shooting. (arXiv:2304.13953v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13953
Code URL: null
Copy Paste: [[2304.13953] Automatic Localization and Detection Applicable to Robust Image Watermarking Resisting against Camera Shooting](http://arxiv.org/abs/2304.13953) #robust
Summary:
Robust image watermarking that can resist camera shooting has become an active research topic in recent years due to the increasing demand for preventing sensitive information displayed on computer screens from being captured. However, many mainstream schemes require human assistance during the watermark detection process and cannot adapt to scenarios that require processing a large number of images. Although deep learning-based schemes enable end-to-end watermark embedding and detection, their limited generalization ability makes them vulnerable to failure in complex scenarios. In this paper, we propose a carefully crafted watermarking system that can resist camera shooting. The proposed scheme deals with two important problems: automatic watermark localization (AWL) and automatic watermark detection (AWD). AWL automatically identifies the region of interest (RoI), which contains watermark information, in the camera-shooting image by analyzing the local statistical characteristics. Meanwhile, AWD extracts the hidden watermark from the identified RoI after applying perspective correction. Compared with previous works, the proposed scheme is fully automatic, making it ideal for application scenarios. Furthermore, the proposed scheme is not limited to any specific watermark embedding strategy, allowing for improvements in the watermark embedding and extraction procedure. Extensive experimental results and analysis show that the embedded watermark can be automatically and reliably extracted from the camera-shooting image in different scenarios, demonstrating the superiority and applicability of the proposed approach.

Title: Moderately Distributional Exploration for Domain Generalization. (arXiv:2304.13976v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13976
Code URL: https://github.com/rxsw/mode
Copy Paste: [[2304.13976] Moderately Distributional Exploration for Domain Generalization](http://arxiv.org/abs/2304.13976) #robust
Summary:
Domain generalization (DG) aims to tackle the distribution shift between training domains and unknown target domains. Generating new domains is one of the most effective approaches, yet its performance gain depends on the distribution discrepancy between the generated and target domains. Distributionally robust optimization is promising to tackle distribution discrepancy by exploring domains in an uncertainty set. However, the uncertainty set may be overwhelmingly large, leading to low-confidence prediction in DG. It is because a large uncertainty set could introduce domains containing semantically different factors from training domains. To address this issue, we propose to perform a $\textbf{mo}$derately $\textbf{d}$istributional $\textbf{e}$xploration (MODE) for domain generalization. Specifically, MODE performs distribution exploration in an uncertainty $\textit{subset}$ that shares the same semantic factors with the training domains. We show that MODE can endow models with provable generalization performance on unknown target domains. The experimental results show that MODE achieves competitive performance compared to state-of-the-art baselines.

Title: MCLFIQ: Mobile Contactless Fingerprint Image Quality. (arXiv:2304.14123v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14123
Code URL: null
Copy Paste: [[2304.14123] MCLFIQ: Mobile Contactless Fingerprint Image Quality](http://arxiv.org/abs/2304.14123) #robust
Summary:
We propose MCLFIQ: Mobile Contactless Fingerprint Image Quality, the first quality assessment algorithm for mobile contactless fingerprint samples. To this end, we retrained the NIST Fingerprint Image Quality (NFIQ) 2 method, which was originally designed for contact-based fingerprints, with a synthetic contactless fingerprint database. We evaluate the predictive performance of the resulting MCLFIQ model in terms of Error-vs.-Discard Characteristic (EDC) curves on three real-world contactless fingerprint databases using two recognition algorithms. In experiments, the MCLFIQ method is compared against the original NFIQ 2 method and a sharpness-based quality assessment algorithm developed for contactless fingerprint images. Obtained results show that the re-training of NFIQ 2 on synthetic data is a viable alternative to training on real databases. Moreover, the evaluation shows that our MCLFIQ method works more accurate and robust compared to NFIQ 2 and the sharpness-based quality assessment. We suggest considering the proposed MCLFIQ method as a candidate for a new standard algorithm for contactless fingerprint quality assessment.

Title: Figments and Misalignments: A Framework for Fine-grained Crossmodal Misinformation Detection. (arXiv:2304.14133v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14133
Code URL: https://github.com/stevejpapad/figments-and-misalignments
Copy Paste: [[2304.14133] Figments and Misalignments: A Framework for Fine-grained Crossmodal Misinformation Detection](http://arxiv.org/abs/2304.14133) #robust
Summary:
Multimedia content has become ubiquitous on social media platforms, leading to the rise of multimodal misinformation and the urgent need for effective strategies to detect and prevent its spread. This study focuses on CrossModal Misinformation (CMM) where image-caption pairs work together to spread falsehoods. We contrast CMM with Asymmetric Multimodal Misinformation (AMM), where one dominant modality propagates falsehoods while other modalities have little or no influence. We show that AMM adds noise to the training and evaluation process while exacerbating the unimodal bias, where text-only or image-only detectors can seemingly outperform their multimodal counterparts on an inherently multimodal task. To address this issue, we collect and curate FIGMENTS, a robust evaluation benchmark for CMM, which consists of real world cases of misinformation, excludes AMM and utilizes modality balancing to successfully alleviate unimodal bias. FIGMENTS also provides a first step towards fine-grained CMM detection by including three classes: truthful, out-of-context, and miscaptioned image-caption pairs. Furthermore, we introduce a method for generating realistic synthetic training data that maintains crossmodal relations between legitimate images and false human-written captions that we term Crossmodal HArd Synthetic MisAlignment (CHASMA). We conduct extensive comparative study using a Transformer-based architecture. Our results show that incorporating CHASMA in conjunction with other generated datasets consistently improved the overall performance on FIGMENTS in both binary (+6.26%) and multiclass settings (+15.8%).We release our code at: https://github.com/stevejpapad/figments-and-misalignments

Title: A Probabilistic Attention Model with Occlusion-aware Texture Regression for 3D Hand Reconstruction from a Single RGB Image. (arXiv:2304.14299v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14299
Code URL: https://github.com/zhehengjianglancaster/amvur
Copy Paste: [[2304.14299] A Probabilistic Attention Model with Occlusion-aware Texture Regression for 3D Hand Reconstruction from a Single RGB Image](http://arxiv.org/abs/2304.14299) #robust
Summary:
Recently, deep learning based approaches have shown promising results in 3D hand reconstruction from a single RGB image. These approaches can be roughly divided into model-based approaches, which are heavily dependent on the model's parameter space, and model-free approaches, which require large numbers of 3D ground truths to reduce depth ambiguity and struggle in weakly-supervised scenarios. To overcome these issues, we propose a novel probabilistic model to achieve the robustness of model-based approaches and reduced dependence on the model's parameter space of model-free approaches. The proposed probabilistic model incorporates a model-based network as a prior-net to estimate the prior probability distribution of joints and vertices. An Attention-based Mesh Vertices Uncertainty Regression (AMVUR) model is proposed to capture dependencies among vertices and the correlation between joints and mesh vertices to improve their feature representation. We further propose a learning based occlusion-aware Hand Texture Regression model to achieve high-fidelity texture reconstruction. We demonstrate the flexibility of the proposed probabilistic model to be trained in both supervised and weakly-supervised scenarios. The experimental results demonstrate our probabilistic model's state-of-the-art accuracy in 3D hand and texture reconstruction from a single image in both training schemes, including in the presence of severe occlusions.

Title: Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM. (arXiv:2304.14377v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14377
Code URL: null
Copy Paste: [[2304.14377] Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM](http://arxiv.org/abs/2304.14377) #robust
Summary:
We present Co-SLAM, a neural RGB-D SLAM system based on a hybrid representation, that performs robust camera tracking and high-fidelity surface reconstruction in real time. Co-SLAM represents the scene as a multi-resolution hash-grid to exploit its high convergence speed and ability to represent high-frequency local features. In addition, Co-SLAM incorporates one-blob encoding, to encourage surface coherence and completion in unobserved areas. This joint parametric-coordinate encoding enables real-time and robust performance by bringing the best of both worlds: fast convergence and surface hole filling. Moreover, our ray sampling strategy allows Co-SLAM to perform global bundle adjustment over all keyframes instead of requiring keyframe selection to maintain a small number of active keyframes as competing neural SLAM approaches do. Experimental results show that Co-SLAM runs at 10-17Hz and achieves state-of-the-art scene reconstruction results, and competitive tracking performance in various datasets and benchmarks (ScanNet, TUM, Replica, Synthetic RGBD). Project page: https://hengyiwang.github.io/projects/CoSLAM

Title: $\pi$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation. (arXiv:2304.14381v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14381
Code URL: null
Copy Paste: [[2304.14381] $\pi$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation](http://arxiv.org/abs/2304.14381) #robust
Summary:
Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks. However, the potential of such multi-task learners has not been exploited during transfer learning. In this work, we present a universal parameter-efficient transfer learning method, termed Predict-Interpolate Tuning ($\pi$-Tuning), for vision, language, and vision-language tasks. It aggregates the parameters of lightweight task-specific experts learned from similar tasks to aid the target downstream task. The task similarities are predicted in a unified modality-independent space, yielding a scalable graph to demonstrate task relationships. $\pi$-Tuning has several appealing benefits. First, it flexibly explores both intra- and inter-modal transferability between similar tasks to improve the accuracy and robustness of transfer learning, especially in data-scarce scenarios. Second, it offers a systematical solution for transfer learning with multi-task prediction-and-then-interpolation, compatible with diverse types of parameter-efficient experts, such as prompt and adapter. Third, an extensive study of task-level mutual benefits on 14 unimodal and 6 multimodal datasets shows that $\pi$-Tuning surpasses fine-tuning and other parameter-efficient transfer learning methods both in full-shot and low-shot regimes. The task graph also enables an in-depth interpretable analysis of task transferability across modalities.

Title: Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models. (arXiv:2304.13803v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.13803
Code URL: null
Copy Paste: [[2304.13803] Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models](http://arxiv.org/abs/2304.13803) #robust
Summary:
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks such as translation and multilingual word sense disambiguation (WSD). However, they often struggle at disambiguating word sense in a zero-shot setting. To better understand this contrast, we present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT), an extension of word-level translation that prompts the model to translate a given word in context. We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance. Building on C-WLT, we introduce a zero-shot approach for WSD, tested on 18 languages from the XL-WSD dataset. Our method outperforms fully supervised baselines on recall for many evaluation languages without additional training or finetuning. This study presents a first step towards understanding how to best leverage the cross-lingual knowledge inside PLMs for robust zero-shot reasoning in any language.

Title: Transferring Procedural Knowledge across Commonsense Tasks. (arXiv:2304.13867v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.13867
Code URL: null
Copy Paste: [[2304.13867] Transferring Procedural Knowledge across Commonsense Tasks](http://arxiv.org/abs/2304.13867) #robust
Summary:
Stories about everyday situations are an essential part of human communication, motivating the need to develop AI agents that can reliably understand these stories. Despite the long list of supervised methods for story completion and procedural understanding, current AI has no mechanisms to automatically track and explain procedures in unseen stories. To bridge this gap, we study the ability of AI models to transfer procedural knowledge to novel narrative tasks in a transparent manner. We design LEAP: a comprehensive framework that integrates state-of-the-art modeling architectures, training regimes, and augmentation strategies based on both natural and synthetic stories. To address the lack of densely annotated training data, we devise a robust automatic labeler based on few-shot prompting to enhance the augmented data. Our experiments with in- and out-of-domain tasks reveal insights into the interplay of different architectures, training regimes, and augmentation strategies. LEAP's labeler has a clear positive impact on out-of-domain datasets, while the resulting dense annotation provides native explainability.

Title: ChatLog: Recording and Analyzing ChatGPT Across Time. (arXiv:2304.14106v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14106
Code URL: https://github.com/thu-keg/chatlog
Copy Paste: [[2304.14106] ChatLog: Recording and Analyzing ChatGPT Across Time](http://arxiv.org/abs/2304.14106) #robust
Summary:
While there are abundant researches about evaluating ChatGPT on natural language understanding and generation tasks, few studies have investigated how ChatGPT's behavior changes over time. In this paper, we collect a coarse-to-fine temporal dataset called ChatLog, consisting of two parts that update monthly and daily: ChatLog-Monthly is a dataset of 38,730 question-answer pairs collected every month including questions from both the reasoning and classification tasks. ChatLog-Daily, on the other hand, consists of ChatGPT's responses to 1000 identical questions for long-form generation every day. We conduct comprehensive automatic and human evaluation to provide the evidence for the existence of ChatGPT evolving patterns. We further analyze the unchanged characteristics of ChatGPT over time by extracting its knowledge and linguistic features. We find some stable features to improve the robustness of a RoBERTa-based detector on new versions of ChatGPT. We will continuously maintain our project at https://github.com/THU-KEG/ChatLog.

Title: Distance Weighted Supervised Learning for Offline Interaction Data. (arXiv:2304.13774v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13774
Code URL: null
Copy Paste: [[2304.13774] Distance Weighted Supervised Learning for Offline Interaction Data](http://arxiv.org/abs/2304.13774) #robust
Summary:
Sequential decision making algorithms often struggle to leverage different sources of unstructured offline interaction data. Imitation learning (IL) methods based on supervised learning are robust, but require optimal demonstrations, which are hard to collect. Offline goal-conditioned reinforcement learning (RL) algorithms promise to learn from sub-optimal data, but face optimization challenges especially with high-dimensional data. To bridge the gap between IL and RL, we introduce Distance Weighted Supervised Learning or DWSL, a supervised method for learning goal-conditioned policies from offline data. DWSL models the entire distribution of time-steps between states in offline data with only supervised learning, and uses this distribution to approximate shortest path distances. To extract a policy, we weight actions by their reduction in distance estimates. Theoretically, DWSL converges to an optimal policy constrained to the data distribution, an attractive property for offline learning, without any bootstrapping. Across all datasets we test, DWSL empirically maintains behavior cloning as a lower bound while still exhibiting policy improvement. In high-dimensional image domains, DWSL surpasses the performance of both prior goal-conditioned IL and RL algorithms. Visualizations and code can be found at https://sites.google.com/view/dwsl/home .

Title: Self-discipline on multiple channels. (arXiv:2304.14224v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.14224
Code URL: https://github.com/jiutiannn/smc-self-discipline-on-multiple-channels
Copy Paste: [[2304.14224] Self-discipline on multiple channels](http://arxiv.org/abs/2304.14224) #robust
Summary:
Self-distillation relies on its own information to improve the generalization ability of the model and has a bright future. Existing self-distillation methods either require additional models, model modification, or batch size expansion for training, which increases the difficulty of use, memory consumption, and computational cost. This paper developed Self-discipline on multiple channels(SMC), which combines consistency regularization with self-distillation using the concept of multiple channels. Conceptually, SMC consists of two steps: 1) each channel data is simultaneously passed through the model to obtain its corresponding soft label, and 2) the soft label saved in the previous step is read together with the soft label obtained from the current channel data through the model to calculate the loss function. SMC uses consistent regularization and self-distillation to improve the generalization ability of the model and the robustness of the model to noisy labels. We named the SMC containing only two channels as SMC-2. Comparative experimental results on both datasets show that SMC-2 outperforms Label Smoothing Regularizaion and Self-distillation From The Last Mini-batch on all models, and outperforms the state-of-the-art Sharpness-Aware Minimization method on 83% of the models.Compatibility of SMC-2 and data augmentation experimental results show that using both SMC-2 and data augmentation improves the generalization ability of the model between 0.28% and 1.80% compared to using only data augmentation. Ultimately, the results of the label noise interference experiments show that SMC-2 curbs the tendency that the model's generalization ability decreases in the late training period due to the interference of label noise. The code is available at https://github.com/JiuTiannn/SMC-Self-discipline-on-multiple-channels.

biometric

steal

extraction

Title: Large Scale Genealogical Information Extraction From Handwritten Quebec Parish Records. (arXiv:2304.14044v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14044
Code URL: null
Copy Paste: [[2304.14044] Large Scale Genealogical Information Extraction From Handwritten Quebec Parish Records](http://arxiv.org/abs/2304.14044) #extraction
Summary:
This paper presents a complete workflow designed for extracting information from Quebec handwritten parish registers. The acts in these documents contain individual and family information highly valuable for genetic, demographic and social studies of the Quebec population. From an image of parish records, our workflow is able to identify the acts and extract personal information. The workflow is divided into successive steps: page classification, text line detection, handwritten text recognition, named entity recognition and act detection and classification. For all these steps, different machine learning models are compared. Once the information is extracted, validation rules designed by experts are then applied to standardize the extracted information and ensure its consistency with the type of act (birth, marriage, and death). This validation step is able to reject records that are considered invalid or merged. The full workflow has been used to process over two million pages of Quebec parish registers from the 19-20th centuries. On a sample comprising 65% of registers, 3.2 million acts were recognized. Verification of the birth and death acts from this sample shows that 74% of them are considered complete and valid. These records will be integrated into the BALSAC database and linked together to recreate family and genealogical relations at large scale.

Title: MasonNLP+ at SemEval-2023 Task 8: Extracting Medical Questions, Experiences and Claims from Social Media using Knowledge-Augmented Pre-trained Language Models. (arXiv:2304.13875v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.13875
Code URL: null
Copy Paste: [[2304.13875] MasonNLP+ at SemEval-2023 Task 8: Extracting Medical Questions, Experiences and Claims from Social Media using Knowledge-Augmented Pre-trained Language Models](http://arxiv.org/abs/2304.13875) #extraction
Summary:
In online forums like Reddit, users share their experiences with medical conditions and treatments, including making claims, asking questions, and discussing the effects of treatments on their health. Building systems to understand this information can effectively monitor the spread of misinformation and verify user claims. The Task-8 of the 2023 International Workshop on Semantic Evaluation focused on medical applications, specifically extracting patient experience- and medical condition-related entities from user posts on social media. The Reddit Health Online Talk (RedHot) corpus contains posts from medical condition-related subreddits with annotations characterizing the patient experience and medical conditions. In Subtask-1, patient experience is characterized by personal experience, questions, and claims. In Subtask-2, medical conditions are characterized by population, intervention, and outcome. For the automatic extraction of patient experiences and medical condition information, as a part of the challenge, we proposed language-model-based extraction systems that ranked $3^{rd}$ on both subtasks' leaderboards. In this work, we describe our approach and, in addition, explore the automatic extraction of this information using domain-specific language models and the inclusion of external knowledge.

membership infer

federate

Title: Maximizing Model Generalization for Manufacturing with Self-Supervised Learning and Federated Learning. (arXiv:2304.14398v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.14398
Code URL: null
Copy Paste: [[2304.14398] Maximizing Model Generalization for Manufacturing with Self-Supervised Learning and Federated Learning](http://arxiv.org/abs/2304.14398) #federate
Summary:
Deep Learning (DL) can diagnose faults and assess machine health from raw condition monitoring data without manually designed statistical features. However, practical manufacturing applications remain extremely difficult for existing DL methods. Machine data is often unlabeled and from very few health conditions (e.g., only normal operating data). Furthermore, models often encounter shifts in domain as process parameters change and new categories of faults emerge. Traditional supervised learning may struggle to learn compact, discriminative representations that generalize to these unseen target domains since it depends on having plentiful classes to partition the feature space with decision boundaries. Transfer Learning (TL) with domain adaptation attempts to adapt these models to unlabeled target domains but assumes similar underlying structure that may not be present if new faults emerge. This study proposes focusing on maximizing the feature generality on the source domain and applying TL via weight transfer to copy the model to the target domain. Specifically, Self-Supervised Learning (SSL) with Barlow Twins may produce more discriminative features for monitoring health condition than supervised learning by focusing on semantic properties of the data. Furthermore, Federated Learning (FL) for distributed training may also improve generalization by efficiently expanding the effective size and diversity of training data by sharing information across multiple client machines. Results show that Barlow Twins outperforms supervised learning in an unlabeled target domain with emerging motor faults when the source training data contains very few distinct categories. Incorporating FL may also provide a slight advantage by diffusing knowledge of health conditions between machines.

fair

Title: FLAC: Fairness-Aware Representation Learning by Suppressing Attribute-Class Associations. (arXiv:2304.14252v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14252
Code URL: https://github.com/gsarridis/flac
Copy Paste: [[2304.14252] FLAC: Fairness-Aware Representation Learning by Suppressing Attribute-Class Associations](http://arxiv.org/abs/2304.14252) #fair
Summary:
Bias in computer vision systems can perpetuate or even amplify discrimination against certain populations. Considering that bias is often introduced by biased visual datasets, many recent research efforts focus on training fair models using such data. However, most of them heavily rely on the availability of protected attribute labels in the dataset, which limits their applicability, while label-unaware approaches, i.e., approaches operating without such labels, exhibit considerably lower performance. To overcome these limitations, this work introduces FLAC, a methodology that minimizes mutual information between the features extracted by the model and a protected attribute, without the use of attribute labels. To do that, FLAC proposes a sampling strategy that highlights underrepresented samples in the dataset, and casts the problem of learning fair representations as a probability matching problem that leverages representations extracted by a bias-capturing classifier. It is theoretically shown that FLAC can indeed lead to fair representations, that are independent of the protected attributes. FLAC surpasses the current state-of-the-art on Biased MNIST, CelebA, and UTKFace, by 29.1%, 18.1%, and 21.9%, respectively. Additionally, FLAC exhibits 2.2% increased accuracy on ImageNet-A consisting of the most challenging samples of ImageNet. Finally, in most experiments, FLAC even outperforms the bias label-aware state-of-the-art methods.

Title: Proportionally Representative Clustering. (arXiv:2304.13917v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13917
Code URL: null
Copy Paste: [[2304.13917] Proportionally Representative Clustering](http://arxiv.org/abs/2304.13917) #fair
Summary:
In recent years, there has been a surge in effort to formalize notions of fairness in machine learning. We focus on clustering -- one of the fundamental tasks in unsupervised machine learning. We propose a new axiom that captures proportional representation fairness (PRF). We make a case that the concept achieves the raison d'{\^{e}}tre of several existing concepts in the literature in an arguably more convincing manner. Our fairness concept is not satisfied by existing fair clustering algorithms. We design efficient algorithms to achieve PRF both for unconstrained and discrete clustering problems.

Title: Oversampling Higher-Performing Minorities During Machine Learning Model Training Reduces Adverse Impact Slightly but Also Reduces Model Accuracy. (arXiv:2304.13933v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13933
Code URL: null
Copy Paste: [[2304.13933] Oversampling Higher-Performing Minorities During Machine Learning Model Training Reduces Adverse Impact Slightly but Also Reduces Model Accuracy](http://arxiv.org/abs/2304.13933) #fair
Summary:
Organizations are increasingly adopting machine learning (ML) for personnel assessment. However, concerns exist about fairness in designing and implementing ML assessments. Supervised ML models are trained to model patterns in data, meaning ML models tend to yield predictions that reflect subgroup differences in applicant attributes in the training data, regardless of the underlying cause of subgroup differences. In this study, we systematically under- and oversampled minority (Black and Hispanic) applicants to manipulate adverse impact ratios in training data and investigated how training data adverse impact ratios affect ML model adverse impact and accuracy. We used self-reports and interview transcripts from job applicants (N = 2,501) to train 9,702 ML models to predict screening decisions. We found that training data adverse impact related linearly to ML model adverse impact. However, removing adverse impact from training data only slightly reduced ML model adverse impact and tended to negatively affect ML model accuracy. We observed consistent effects across self-reports and interview transcripts, whether oversampling real (i.e., bootstrapping) or synthetic observations. As our study relied on limited predictor sets from one organization, the observed effects on adverse impact may be attenuated among more accurate ML models.

Title: Towards Efficient and Comprehensive Urban Spatial-Temporal Prediction: A Unified Library and Performance Benchmark. (arXiv:2304.14343v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.14343
Code URL: https://github.com/libcity/bigscity-libcity
Copy Paste: [[2304.14343] Towards Efficient and Comprehensive Urban Spatial-Temporal Prediction: A Unified Library and Performance Benchmark](http://arxiv.org/abs/2304.14343) #fair
Summary:
As deep learning technology advances and more urban spatial-temporal data accumulates, an increasing number of deep learning models are being proposed to solve urban spatial-temporal prediction problems. However, there are limitations in the existing field, including open-source data being in various formats and difficult to use, few papers making their code and data openly available, and open-source models often using different frameworks and platforms, making comparisons challenging. A standardized framework is urgently needed to implement and evaluate these methods. To address these issues, we provide a comprehensive review of urban spatial-temporal prediction and propose a unified storage format for spatial-temporal data called atomic files. We also propose LibCity, an open-source library that offers researchers a credible experimental tool and a convenient development framework. In this library, we have reproduced 65 spatial-temporal prediction models and collected 55 spatial-temporal datasets, allowing researchers to conduct comprehensive experiments conveniently. Using LibCity, we conducted a series of experiments to validate the effectiveness of different models and components, and we summarized promising future technology developments and research directions for spatial-temporal prediction. By enabling fair model comparisons, designing a unified data storage format, and simplifying the process of developing new models, LibCity is poised to make significant contributions to the spatial-temporal prediction field.

interpretability

explainability

watermark

diffusion

Title: DataComp: In search of the next generation of multimodal datasets. (arXiv:2304.14108v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14108
Code URL: https://github.com/mlfoundations/datacomp
Copy Paste: [[2304.14108] DataComp: In search of the next generation of multimodal datasets](http://arxiv.org/abs/2304.14108) #diffusion
Summary:
Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets. We provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources.

Our baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. We introduce DataComp-1B, a dataset created by applying a simple filtering algorithm to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9x less training compute. We also outperform OpenAI's CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets.

Title: Motion-Conditioned Diffusion Model for Controllable Video Synthesis. (arXiv:2304.14404v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14404
Code URL: null
Copy Paste: [[2304.14404] Motion-Conditioned Diffusion Model for Controllable Video Synthesis](http://arxiv.org/abs/2304.14404) #diffusion
Summary:
Recent advancements in diffusion models have greatly improved the quality and diversity of synthesized content. To harness the expressive power of diffusion models, researchers have explored various controllable mechanisms that allow users to intuitively guide the content synthesis process. Although the latest efforts have primarily focused on video synthesis, there has been a lack of effective methods for controlling and describing desired content and motion. In response to this gap, we introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes, which allow users to specify the intended content and dynamics for synthesis. To tackle the ambiguity of sparse motion inputs and achieve better synthesis quality, MCDiff first utilizes a flow completion model to predict the dense video motion based on the semantic understanding of the video frame and the sparse motion control. Then, the diffusion model synthesizes high-quality future frames to form the output video. We qualitatively and quantitatively show that MCDiff achieves the state-the-of-art visual quality in stroke-guided controllable video synthesis. Additional experiments on MPII Human Pose further exhibit the capability of our model on diverse content and motion synthesis.

Title: Putting People in Their Place: Affordance-Aware Human Insertion into Scenes. (arXiv:2304.14406v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14406
Code URL: https://github.com/adobe-research/affordance-insertion
Copy Paste: [[2304.14406] Putting People in Their Place: Affordance-Aware Human Insertion into Scenes](http://arxiv.org/abs/2304.14406) #diffusion
Summary:
We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. We set up the task in a self-supervised fashion by learning to re-pose humans in video clips. We train a large-scale diffusion model on a dataset of 2.4M video clips that produces diverse plausible poses while respecting the scene context. Given the learned human-scene composition, our model can also hallucinate realistic people and scenes when prompted without conditioning and also enables interactive editing. A quantitative evaluation shows that our method synthesizes more realistic human appearance and more natural human-scene interactions than prior work.

Title: Functional Diffusion Maps. (arXiv:2304.14378v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.14378
Code URL: null
Copy Paste: [[2304.14378] Functional Diffusion Maps](http://arxiv.org/abs/2304.14378) #diffusion
Summary:
Nowadays many real-world datasets can be considered as functional, in the sense that the processes which generate them are continuous. A fundamental property of this type of data is that in theory they belong to an infinite-dimensional space. Although in practice we usually receive finite observations, they are still high-dimensional and hence dimensionality reduction methods are crucial. In this vein, the main state-of-the-art method for functional data analysis is Functional PCA. Nevertheless, this classic technique assumes that the data lie in a linear manifold, and hence it could have problems when this hypothesis is not fulfilled. In this research, attention has been placed on a non-linear manifold learning method: Diffusion Maps. The article explains how to extend this multivariate method to functional data and compares its behavior against Functional PCA over different simulated and real examples.

noise learning

data-free

transformer

Title: Optimization-Inspired Cross-Attention Transformer for Compressive Sensing. (arXiv:2304.13986v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13986
Code URL: https://github.com/songjiechong/octuf
Copy Paste: [[2304.13986] Optimization-Inspired Cross-Attention Transformer for Compressive Sensing](http://arxiv.org/abs/2304.13986) #transformer
Summary:
By integrating certain optimization solvers with deep neural networks, deep unfolding network (DUN) with good interpretability and high performance has attracted growing attention in compressive sensing (CS). However, existing DUNs often improve the visual quality at the price of a large number of parameters and have the problem of feature information loss during iteration. In this paper, we propose an Optimization-inspired Cross-attention Transformer (OCT) module as an iterative process, leading to a lightweight OCT-based Unfolding Framework (OCTUF) for image CS. Specifically, we design a novel Dual Cross Attention (Dual-CA) sub-module, which consists of an Inertia-Supplied Cross Attention (ISCA) block and a Projection-Guided Cross Attention (PGCA) block. ISCA block introduces multi-channel inertia forces and increases the memory effect by a cross attention mechanism between adjacent iterations. And, PGCA block achieves an enhanced information interaction, which introduces the inertia force into the gradient descent step through a cross attention block. Extensive CS experiments manifest that our OCTUF achieves superior performance compared to state-of-the-art methods while training lower complexity. Codes are available at https://github.com/songjiechong/OCTUF.

Title: Vision Conformer: Incorporating Convolutions into Vision Transformer Layers. (arXiv:2304.13991v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13991
Code URL: https://github.com/uchidalab/vision-conformer
Copy Paste: [[2304.13991] Vision Conformer: Incorporating Convolutions into Vision Transformer Layers](http://arxiv.org/abs/2304.13991) #transformer
Summary:
Transformers are popular neural network models that use layers of self-attention and fully-connected nodes with embedded tokens. Vision Transformers (ViT) adapt transformers for image recognition tasks. In order to do this, the images are split into patches and used as tokens. One issue with ViT is the lack of inductive bias toward image structures. Because ViT was adapted for image data from language modeling, the network does not explicitly handle issues such as local translations, pixel information, and information loss in the structures and features shared by multiple patches. Conversely, Convolutional Neural Networks (CNN) incorporate this information. Thus, in this paper, we propose the use of convolutional layers within ViT. Specifically, we propose a model called a Vision Conformer (ViC) which replaces the Multi-Layer Perceptron (MLP) in a ViT layer with a CNN. In addition, to use the CNN, we proposed to reconstruct the image data after the self-attention in a reverse embedding layer. Through the evaluation, we demonstrate that the proposed convolutions help improve the classification ability of ViT.

Title: Lightweight, Pre-trained Transformers for Remote Sensing Timeseries. (arXiv:2304.14065v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14065
Code URL: https://github.com/nasaharvest/presto
Copy Paste: [[2304.14065] Lightweight, Pre-trained Transformers for Remote Sensing Timeseries](http://arxiv.org/abs/2304.14065) #transformer
Summary:
Machine learning algorithms for parsing remote sensing data have a wide range of societally relevant applications, but labels used to train these algorithms can be difficult or impossible to acquire. This challenge has spurred research into self-supervised learning for remote sensing data aiming to unlock the use of machine learning in geographies or application domains where labelled datasets are small. Current self-supervised learning approaches for remote sensing data draw significant inspiration from techniques applied to natural images. However, remote sensing data has important differences from natural images -- for example, the temporal dimension is critical for many tasks and data is collected from many complementary sensors. We show that designing models and self-supervised training techniques specifically for remote sensing data results in both smaller and more performant models. We introduce the Pretrained Remote Sensing Transformer (Presto), a transformer-based model pre-trained on remote sensing pixel-timeseries data. Presto excels at a wide variety of globally distributed remote sensing tasks and outperforms much larger models. Presto can be used for transfer learning or as a feature extractor for simple models, enabling efficient deployment at scale.

Title: Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification. (arXiv:2304.14122v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14122
Code URL: null
Copy Paste: [[2304.14122] Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification](http://arxiv.org/abs/2304.14122) #transformer
Summary:
Advanced deep Convolutional Neural Networks (CNNs) have shown great success in video-based person Re-Identification (Re-ID). However, they usually focus on the most obvious regions of persons with a limited global representation ability. Recently, it witnesses that Transformers explore the inter-patch relations with global observations for performance improvements. In this work, we take both sides and propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Firstly, we couple CNNs and Transformers to extract two kinds of visual features and experimentally verify their complementarity. Further, in spatial, we propose a Complementary Content Attention (CCA) to take advantages of the coupled structure and guide independent features for spatial complementary learning. In temporal, a Hierarchical Temporal Aggregation (HTA) is proposed to progressively capture the inter-frame dependencies and encode temporal information. Besides, a gated attention is utilized to deliver aggregated temporal information into the CNN and Transformer branches for temporal complementary learning. Finally, we introduce a self-distillation training strategy to transfer the superior spatial-temporal knowledge to backbone networks for higher accuracy and more efficiency. In this way, two kinds of typical features from same videos are integrated mechanically for more informative representations. Extensive experiments on four public Re-ID benchmarks demonstrate that our framework could attain better performances than most state-of-the-art methods.

Title: Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation. (arXiv:2304.14124v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14124
Code URL: https://github.com/jiamang/ibt
Copy Paste: [[2304.14124] Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation](http://arxiv.org/abs/2304.14124) #transformer
Summary:
Discovering inter-point connection for efficient high-dimensional feature extraction from point coordinate is a key challenge in processing point cloud. Most existing methods focus on designing efficient local feature extractors while ignoring global connection, or vice versa. In this paper, we design a new Inductive Bias-aided Transformer (IBT) method to learn 3D inter-point relations, which considers both local and global attentions. Specifically, considering local spatial coherence, local feature learning is performed through Relative Position Encoding and Attentive Feature Pooling. We incorporate the learned locality into the Transformer module. The local feature affects value component in Transformer to modulate the relationship between channels of each point, which can enhance self-attention mechanism with locality based channel interaction. We demonstrate its superiority experimentally on classification and segmentation tasks. The code is available at: https://github.com/jiamang/IBT

Title: Analogy-Forming Transformers for Few-Shot 3D Parsing. (arXiv:2304.14382v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14382
Code URL: null
Copy Paste: [[2304.14382] Analogy-Forming Transformers for Few-Shot 3D Parsing](http://arxiv.org/abs/2304.14382) #transformer
Summary:
We present Analogical Networks, a model that encodes domain knowledge explicitly, in a collection of structured labelled 3D scenes, in addition to implicitly, as model parameters, and segments 3D object scenes with analogical reasoning: instead of mapping a scene to part segments directly, our model first retrieves related scenes from memory and their corresponding part structures, and then predicts analogous part structures for the input scene, via an end-to-end learnable modulation mechanism. By conditioning on more than one retrieved memories, compositions of structures are predicted, that mix and match parts across the retrieved memories. One-shot, few-shot or many-shot learning are treated uniformly in Analogical Networks, by conditioning on the appropriate set of memories, whether taken from a single, few or many memory exemplars, and inferring analogous parses. We show Analogical Networks are competitive with state-of-the-art 3D segmentation transformers in many-shot settings, and outperform them, as well as existing paradigms of meta-learning and few-shot learning, in few-shot settings. Analogical Networks successfully segment instances of novel object categories simply by expanding their memory, without any weight updates. Our code and models are publicly available in the project webpage: this http URL

Title: SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. (arXiv:2304.14394v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14394
Code URL: https://github.com/microsoft/videox
Copy Paste: [[2304.14394] SeqTrack: Sequence to Sequence Learning for Visual Object Tracking](http://arxiv.org/abs/2304.14394) #transformer
Summary:
In this paper, we present a new sequence-to-sequence learning framework for visual tracking, dubbed SeqTrack. It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion. This is different from prior Siamese trackers and transformer trackers, which rely on designing complicated head networks, such as classification and regression heads. SeqTrack only adopts a simple encoder-decoder transformer architecture. The encoder extracts visual features with a bidirectional transformer, while the decoder generates a sequence of bounding box values autoregressively with a causal transformer. The loss function is a plain cross-entropy. Such a sequence learning paradigm not only simplifies tracking framework, but also achieves competitive performance on benchmarks. For instance, SeqTrack gets 72.5% AUC on LaSOT, establishing a new state-of-the-art performance. Code and models are available at here.

Title: IconShop: Text-Based Vector Icon Synthesis with Autoregressive Transformers. (arXiv:2304.14400v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14400
Code URL: null
Copy Paste: [[2304.14400] IconShop: Text-Based Vector Icon Synthesis with Autoregressive Transformers](http://arxiv.org/abs/2304.14400) #transformer
Summary:
Scalable Vector Graphics (SVG) is a prevalent vector image format with good support for interactivity and animation. Despite such appealing characteristics, it is generally challenging for users to create their own SVG content because of the long learning curve to comprehend SVG grammars or acquaint themselves with professional editing software. Recent progress in text-to-image generation has inspired researchers to explore image-based icon synthesis (i.e., text -> raster image -> vector image) via differential rendering and language-based icon synthesis (i.e., text -> vector image script) via the "zero-shot" capabilities of large language models. However, these methods may suffer from several limitations regarding generation quality, diversity, flexibility, and speed. In this paper, we introduce IconShop, a text-guided vector icon synthesis method using an autoregressive transformer. The key to success of our approach is to sequentialize and tokenize the SVG paths (and textual descriptions) into a uniquely decodable command sequence. With such a single sequence as input, we are able to fully exploit the sequence learning power of autoregressive transformers, while enabling various icon synthesis and manipulation tasks. Through standard training to predict the next token on a large-scale icon dataset accompanied by textural descriptions, the proposed IconShop consistently exhibits better icon synthesis performance than existing image-based and language-based methods both quantitatively (using the FID and CLIP score) and qualitatively (through visual inspection). Meanwhile, we observe a dramatic improvement in generation diversity, which is supported by objective measures (Uniqueness and Novelty). More importantly, we demonstrate the flexibility of IconShop with two novel icon manipulation tasks
text-guided icon infilling, and text-combined icon synthesis.

Title: Neural Keyphrase Generation: Analysis and Evaluation. (arXiv:2304.13883v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.13883
Code URL: null
Copy Paste: [[2304.13883] Neural Keyphrase Generation: Analysis and Evaluation](http://arxiv.org/abs/2304.13883) #transformer
Summary:
Keyphrase generation aims at generating topical phrases from a given text either by copying from the original text (present keyphrases) or by producing new keyphrases (absent keyphrases) that capture the semantic meaning of the text. Encoder-decoder models are most widely used for this task because of their capabilities for absent keyphrase generation. However, there has been little to no analysis on the performance and behavior of such models for keyphrase generation. In this paper, we study various tendencies exhibited by three strong models: T5 (based on a pre-trained transformer), CatSeq-Transformer (a non-pretrained Transformer), and ExHiRD (based on a recurrent neural network). We analyze prediction confidence scores, model calibration, and the effect of token position on keyphrases generation. Moreover, we motivate and propose a novel metric framework, SoftKeyScore, to evaluate the similarity between two sets of keyphrases by using softscores to account for partial matching and semantic similarity. We find that SoftKeyScore is more suitable than the standard F1 metric for evaluating two sets of given keyphrases.

Title: SweCTRL-Mini: a data-transparent Transformer-based large language model for controllable text generation in Swedish. (arXiv:2304.13994v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.13994
Code URL: https://github.com/dkalpakchi/swectrl-mini
Copy Paste: [[2304.13994] SweCTRL-Mini: a data-transparent Transformer-based large language model for controllable text generation in Swedish](http://arxiv.org/abs/2304.13994) #transformer
Summary:
We present SweCTRL-Mini, a large Swedish language model that can be used for inference and fine-tuning on a single consumer-grade GPU. The model is based on the CTRL architecture by Keskar, McCann, Varshney, Xiong, and Socher (2019), which means that users of the SweCTRL-Mini model can control the genre of the generated text by inserting special tokens in the generation prompts. SweCTRL-Mini is trained on a subset of the Swedish part of the mC4 corpus and a set of Swedish novels. In this article, we provide (1) a detailed account of the utilized training data and text pre-processing steps, to the extent that it is possible to check whether a specific phrase/source was a part of the training data, and (2) an evaluation of the model on both discriminative tasks, using automatic evaluation methods, and generative tasks, using human referees. We also compare the generative capabilities of the model with those of GPT-3. SweCTRL-Mini is fully open and available for download.

Title: ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task. (arXiv:2304.14177v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14177
Code URL: null
Copy Paste: [[2304.14177] ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](http://arxiv.org/abs/2304.14177) #transformer
Summary:
Transformer-based language models, including ChatGPT, have demonstrated exceptional performance in various natural language generation tasks. However, there has been limited research evaluating ChatGPT's keyphrase generation ability, which involves identifying informative phrases that accurately reflect a document's content. This study seeks to address this gap by comparing ChatGPT's keyphrase generation performance with state-of-the-art models, while also testing its potential as a solution for two significant challenges in the field: domain adaptation and keyphrase generation from long documents. We conducted experiments on six publicly available datasets from scientific articles and news domains, analyzing performance on both short and long documents. Our results show that ChatGPT outperforms current state-of-the-art models in all tested datasets and environments, generating high-quality keyphrases that adapt well to diverse domains and document lengths.

Title: NAP at SemEval-2023 Task 3: Is Less Really More? (Back-)Translation as Data Augmentation Strategies for Detecting Persuasion Techniques. (arXiv:2304.14179v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14179
Code URL: null
Copy Paste: [[2304.14179] NAP at SemEval-2023 Task 3: Is Less Really More? (Back-)Translation as Data Augmentation Strategies for Detecting Persuasion Techniques](http://arxiv.org/abs/2304.14179) #transformer
Summary:
Persuasion techniques detection in news in a multi-lingual setup is non-trivial and comes with challenges, including little training data. Our system successfully leverages (back-)translation as data augmentation strategies with multi-lingual transformer models for the task of detecting persuasion techniques. The automatic and human evaluation of our augmented data allows us to explore whether (back-)translation aid or hinder performance. Our in-depth analyses indicate that both data augmentation strategies boost performance; however, balancing human-produced and machine-generated data seems to be crucial.

Title: Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be. (arXiv:2304.13960v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13960
Code URL: https://github.com/fkunstner/noise-sgd-adam-sign
Copy Paste: [[2304.13960] Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be](http://arxiv.org/abs/2304.13960) #transformer
Summary:
The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, our theoretical understanding of this discrepancy is lagging, preventing the development of significant improvements on either algorithm. Recent work advances the hypothesis that Adam and other heuristics like gradient clipping outperform SGD on language tasks because the distribution of the error induced by sampling has heavy tails. This suggests that Adam outperform SGD because it uses a more robust gradient estimate. We evaluate this hypothesis by varying the batch size, up to the entire dataset, to control for stochasticity. We present evidence that stochasticity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam. Rather, Adam performs better as the batch size increases, while SGD is less effective at taking advantage of the reduction in noise. This raises the question as to why Adam outperforms SGD in the full-batch setting. Through further investigation of simpler variants of SGD, we find that the behavior of Adam with large batches is similar to sign descent with momentum.

generative

Title: Multimodal Composite Association Score: Measuring Gender Bias in Generative Multimodal Models. (arXiv:2304.13855v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13855
Code URL: null
Copy Paste: [[2304.13855] Multimodal Composite Association Score: Measuring Gender Bias in Generative Multimodal Models](http://arxiv.org/abs/2304.13855) #generative
Summary:
Generative multimodal models based on diffusion models have seen tremendous growth and advances in recent years. Models such as DALL-E and Stable Diffusion have become increasingly popular and successful at creating images from texts, often combining abstract ideas. However, like other deep learning models, they also reflect social biases they inherit from their training data, which is often crawled from the internet. Manually auditing models for biases can be very time and resource consuming and is further complicated by the unbounded and unconstrained nature of inputs these models can take. Research into bias measurement and quantification has generally focused on small single-stage models working on a single modality. Thus the emergence of multistage multimodal models requires a different approach. In this paper, we propose Multimodal Composite Association Score (MCAS) as a new method of measuring gender bias in multimodal generative models. Evaluating both DALL-E 2 and Stable Diffusion using this approach uncovered the presence of gendered associations of concepts embedded within the models. We propose MCAS as an accessible and scalable method of quantifying potential bias for models with different modalities and a range of potential biases.

Title: Deep Learning Techniques for Hyperspectral Image Analysis in Agriculture: A Review. (arXiv:2304.13880v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13880
Code URL: null
Copy Paste: [[2304.13880] Deep Learning Techniques for Hyperspectral Image Analysis in Agriculture: A Review](http://arxiv.org/abs/2304.13880) #generative
Summary:
In the recent years, hyperspectral imaging (HSI) has gained considerably popularity among computer vision researchers for its potential in solving remote sensing problems, especially in agriculture field. However, HSI classification is a complex task due to the high redundancy of spectral bands, limited training samples, and non-linear relationship between spatial position and spectral bands. Fortunately, deep learning techniques have shown promising results in HSI analysis. This literature review explores recent applications of deep learning approaches such as Autoencoders, Convolutional Neural Networks (1D, 2D, and 3D), Recurrent Neural Networks, Deep Belief Networks, and Generative Adversarial Networks in agriculture. The performance of these approaches has been evaluated and discussed on well-known land cover datasets including Indian Pines, Salinas Valley, and Pavia University.

Title: ContraNeRF: 3D-Aware Generative Model via Contrastive Learning with Unsupervised Implicit Pose Embedding. (arXiv:2304.14005v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14005
Code URL: null
Copy Paste: [[2304.14005] ContraNeRF: 3D-Aware Generative Model via Contrastive Learning with Unsupervised Implicit Pose Embedding](http://arxiv.org/abs/2304.14005) #generative
Summary:
Although 3D-aware GANs based on neural radiance fields have achieved competitive performance, their applicability is still limited to objects or scenes with the ground-truths or prediction models for clearly defined canonical camera poses. To extend the scope of applicable datasets, we propose a novel 3D-aware GAN optimization technique through contrastive learning with implicit pose embeddings. To this end, we first revise the discriminator design and remove dependency on ground-truth camera poses. Then, to capture complex and challenging 3D scene structures more effectively, we make the discriminator estimate a high-dimensional implicit pose embedding from a given image and perform contrastive learning on the pose embedding. The proposed approach can be employed for the dataset, where the canonical camera pose is ill-defined because it does not look up or estimate camera poses. Experimental results show that our algorithm outperforms existing methods by large margins on the datasets with multiple object categories and inconsistent canonical camera poses.

Title: Edit Everything: A Text-Guided Generative System for Images Editing. (arXiv:2304.14006v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14006
Code URL: https://github.com/defengxie/edit_everything
Copy Paste: [[2304.14006] Edit Everything: A Text-Guided Generative System for Images Editing](http://arxiv.org/abs/2304.14006) #generative
Summary:
We introduce a new generative system called Edit Everything, which can take image and text inputs and produce image outputs. Edit Everything allows users to edit images using simple text instructions. Our system designs prompts to guide the visual module in generating requested images. Experiments demonstrate that Edit Everything facilitates the implementation of the visual aspects of Stable Diffusion with the use of Segment Anything model and CLIP. Our system is publicly available at https://github.com/DefengXie/Edit_Everything.

Title: AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays. (arXiv:2304.14276v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14276
Code URL: null
Copy Paste: [[2304.14276] AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays](http://arxiv.org/abs/2304.14276) #generative
Summary:
Background: Recently, ChatGPT and similar generative AI models have attracted hundreds of millions of users and become part of the public discourse. Many believe that such models will disrupt society and will result in a significant change in the education system and information generation in the future. So far, this belief is based on either colloquial evidence or benchmarks from the owners of the models -- both lack scientific rigour.

Objective: Through a large-scale study comparing human-written versus ChatGPT-generated argumentative student essays, we systematically assess the quality of the AI-generated content.

Methods: A large corpus of essays was rated using standard criteria by a large number of human experts (teachers). We augment the analysis with a consideration of the linguistic characteristics of the generated essays.

Results: Our results demonstrate that ChatGPT generates essays that are rated higher for quality than human-written essays. The writing style of the AI models exhibits linguistic characteristics that are different from those of the human-written essays, e.g., it is characterized by fewer discourse and epistemic markers, but more nominalizations and greater lexical diversity.

Conclusions: Our results clearly demonstrate that models like ChatGPT outperform humans in generating argumentative essays. Since the technology is readily available for anyone to use, educators must act immediately. We must re-invent homework and develop teaching concepts that utilize these AI models in the same way as math utilized the calculator: teach the general concepts first and then use AI tools to free up time for other learning objectives.

Title: LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions. (arXiv:2304.14402v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14402
Code URL: https://github.com/mbzuai-nlp/lamini-lm
Copy Paste: [[2304.14402] LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions](http://arxiv.org/abs/2304.14402) #generative
Summary:
Large language models (LLMs) with instruction finetuning demonstrate superior generative capabilities. However, these models are resource intensive. To alleviate this issue, we explore distilling knowledge from instruction-tuned LLMs to much smaller ones. To this end, we carefully develop a large set of 2.58M instructions based on both existing and newly-generated instructions. In addition to being sizeable, we design our instructions to cover a broad set of topics to ensure. A thorough investigation of our instruction data demonstrate their diversity, and we generate responses for these instructions using gpt-3.5-turbo. We then exploit the instructions to tune a host of models, dubbed LaMini-LM, of varying sizes, both from the encoder-decoder as well as the decoder-only families. We evaluate our models both automatically (on 15 different NLP benchmarks) and manually. Results show that our proposed LaMini-LM are on par with competitive baselines while being nearly 10 times smaller in size.

Title: ChatGPT is all you need to decolonize sub-Saharan Vocational Education. (arXiv:2304.13728v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13728
Code URL: null
Copy Paste: [[2304.13728] ChatGPT is all you need to decolonize sub-Saharan Vocational Education](http://arxiv.org/abs/2304.13728) #generative
Summary:
The advances of Generative AI models with interactive capabilities over the past few years offer unique opportunities for socioeconomic mobility. Their potential for scalability, accessibility, affordability, personalizing and convenience sets a first-class opportunity for poverty-stricken countries to adapt and modernize their educational order. As a result, this position paper makes the case for an educational policy framework that would succeed in this transformation by prioritizing vocational and technical training over academic education in sub-Saharan African countries. We highlight substantial applications of Large Language Models, tailor-made to their respective cultural background(s) and needs, that would reinforce their systemic decolonization. Lastly, we provide specific historical examples of diverse states successfully implementing such policies in the elementary steps of their socioeconomic transformation, in order to corroborate our proposal to sub-Saharan African countries to follow their lead.

Title: TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation. (arXiv:2304.13742v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13742
Code URL: https://github.com/layer6ai-labs/tr0n
Copy Paste: [[2304.13742] TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation](http://arxiv.org/abs/2304.13742) #generative
Summary:
We propose TR0N, a highly general framework to turn pre-trained unconditional generative models, such as GANs and VAEs, into conditional models. The conditioning can be highly arbitrary, and requires only a pre-trained auxiliary model. For example, we show how to turn unconditional models into class-conditional ones with the help of a classifier, and also into text-to-image models by leveraging CLIP. TR0N learns a lightweight stochastic mapping which "translates" between the space of conditions and the latent space of the generative model, in such a way that the generated latent corresponds to a data sample satisfying the desired condition. The translated latent samples are then further improved upon through Langevin dynamics, enabling us to obtain higher-quality data samples. TR0N requires no training data nor fine-tuning, yet can achieve a zero-shot FID of 10.9 on MS-COCO, outperforming competing alternatives not only on this metric, but also in sampling speed -- all while retaining a much higher level of generality. Our code is available at https://github.com/layer6ai-labs/tr0n.

large language model

Title: mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. (arXiv:2304.14178v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14178
Code URL: https://github.com/x-plug/mplug-owl
Copy Paste: [[2304.14178] mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](http://arxiv.org/abs/2304.14178) #large language model
Summary:
Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.

Title: The Internal State of an LLM Knows When its Lying. (arXiv:2304.13734v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.13734
Code URL: null
Copy Paste: [[2304.13734] The Internal State of an LLM Knows When its Lying](http://arxiv.org/abs/2304.13734) #large language model
Summary:
While Large Language Models (LLMs) have shown exceptional performance in various tasks, their (arguably) most prominent drawback is generating inaccurate or false information with a confident tone. In this paper, we hypothesize that the LLM's internal state can be used to reveal the truthfulness of a statement. Therefore, we introduce a simple yet effective method to detect the truthfulness of LLM-generated statements, which utilizes the LLM's hidden layer activations to determine the veracity of statements. To train and evaluate our method, we compose a dataset of true and false statements in six different topics. A classifier is trained to detect which statement is true or false based on an LLM's activation values. Specifically, the classifier receives as input the activation values from the LLM for each of the statements in the dataset. Our experiments demonstrate that our method for detecting statement veracity significantly outperforms even few-shot prompting methods, highlighting its potential to enhance the reliability of LLM-generated content and its practical applicability in real-world scenarios.

Title: Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models. (arXiv:2304.13835v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.13835
Code URL: https://github.com/facebookresearch/light
Copy Paste: [[2304.13835] Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models](http://arxiv.org/abs/2304.13835) #large language model
Summary:
Current dialogue research primarily studies pairwise (two-party) conversations, and does not address the everyday setting where more than two speakers converse together. In this work, we both collect and evaluate multi-party conversations to study this more general case. We use the LIGHT environment to construct grounded conversations, where each participant has an assigned character to role-play. We thus evaluate the ability of language models to act as one or more characters in such conversations. Models require two skills that pairwise-trained models appear to lack: (1) being able to decide when to talk; (2) producing coherent utterances grounded on multiple characters. We compare models trained on our new dataset to existing pairwise-trained dialogue models, as well as large language models with few-shot prompting. We find that our new dataset, MultiLIGHT, which we will publicly release, can help bring significant improvements in the group setting.

Title: Origin Tracing and Detecting of LLMs. (arXiv:2304.14072v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14072
Code URL: null
Copy Paste: [[2304.14072] Origin Tracing and Detecting of LLMs](http://arxiv.org/abs/2304.14072) #large language model
Summary:
The extraordinary performance of large language models (LLMs) heightens the importance of detecting whether the context is generated by an AI system. More importantly, while more and more companies and institutions release their LLMs, the origin can be hard to trace. Since LLMs are heading towards the time of AGI, similar to the origin tracing in anthropology, it is of great importance to trace the origin of LLMs. In this paper, we first raise the concern of the origin tracing of LLMs and propose an effective method to trace and detect AI-generated contexts. We introduce a novel algorithm that leverages the contrastive features between LLMs and extracts model-wise features to trace the text origins. Our proposed method works under both white-box and black-box settings therefore can be widely generalized to detect various LLMs.(e.g. can be generalized to detect GPT-3 models without the GPT-3 models). Also, our proposed method requires only limited data compared with the supervised learning methods and can be extended to trace new-coming model origins. We construct extensive experiments to examine whether we can trace the origins of given texts. We provide valuable observations based on the experimental results, such as the difficulty level of AI origin tracing, and the AI origin similarities, and call for ethical concerns of LLM providers. We are releasing all codes and data as a toolkit and benchmark for future AI origin tracing and detecting studies. \footnote{We are releasing all available resource at \url{https://github.com/OpenLMLab/}.}

Title: Large Language Models are Strong Zero-Shot Retriever. (arXiv:2304.14233v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14233
Code URL: null
Copy Paste: [[2304.14233] Large Language Models are Strong Zero-Shot Retriever](http://arxiv.org/abs/2304.14233) #large language model
Summary:
In this work, we propose a simple method that applies a large language model (LLM) to large-scale retrieval in zero-shot scenarios. Our method, Language language model as Retriever (LameR) is built upon no other neural models but an LLM, while breaking up brute-force combinations of retrievers with LLMs and lifting the performance of zero-shot retrieval to be very competitive on benchmark datasets. Essentially, we propose to augment a query with its potential answers by prompting LLMs with a composition of the query and the query's in-domain candidates. The candidates, regardless of correct or wrong, are obtained by a vanilla retrieval procedure on the target collection. Such candidates, as a part of prompts, are likely to help LLM generate more precise answers by pattern imitation or candidate summarization. Even if all the candidates are wrong, the prompts at least make LLM aware of in-collection patterns and genres. Moreover, due to the low performance of a self-supervised retriever, the LLM-based query augmentation becomes less effective as the retriever bottlenecks the whole pipeline. So, we propose to leverage a non-parametric lexicon-based method (e.g., BM25) as the retrieval module to capture query-document overlap in a literal fashion. As such, LameR makes the retrieval procedure transparent to the LLM, so it circumvents the performance bottleneck.

Title: What's in a Name? Evaluating Assembly-Part Semantic Knowledge in Language Models through User-Provided Names in CAD Files. (arXiv:2304.14275v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14275
Code URL: null
Copy Paste: [[2304.14275] What's in a Name? Evaluating Assembly-Part Semantic Knowledge in Language Models through User-Provided Names in CAD Files](http://arxiv.org/abs/2304.14275) #large language model
Summary:
Semantic knowledge of part-part and part-whole relationships in assemblies is useful for a variety of tasks from searching design repositories to the construction of engineering knowledge bases. In this work we propose that the natural language names designers use in Computer Aided Design (CAD) software are a valuable source of such knowledge, and that Large Language Models (LLMs) contain useful domain-specific information for working with this data as well as other CAD and engineering-related tasks.

In particular we extract and clean a large corpus of natural language part, feature and document names and use this to quantitatively demonstrate that a pre-trained language model can outperform numerous benchmarks on three self-supervised tasks, without ever having seen this data before. Moreover, we show that fine-tuning on the text data corpus further boosts the performance on all tasks, thus demonstrating the value of the text data which until now has been largely ignored. We also identify key limitations to using LLMs with text data alone, and our findings provide a strong motivation for further work into multi-modal text-geometry models.

To aid and encourage further work in this area we make all our data and code publicly available.

Title: Controlled Text Generation with Natural Language Instructions. (arXiv:2304.14293v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14293
Code URL: null
Copy Paste: [[2304.14293] Controlled Text Generation with Natural Language Instructions](http://arxiv.org/abs/2304.14293) #large language model
Summary:
Large language models generate fluent texts and can follow natural language instructions to solve a wide range of tasks without task-specific training. Nevertheless, it is notoriously difficult to control their generation to satisfy the various constraints required by different applications. In this work, we present InstructCTG, a controlled text generation framework that incorporates different constraints by conditioning on natural language descriptions and demonstrations of the constraints. In particular, we first extract the underlying constraints of natural texts through a combination of off-the-shelf NLP tools and simple heuristics. We then verbalize the constraints into natural language instructions to form weakly supervised training data. By prepending natural language descriptions of the constraints and a few demonstrations, we fine-tune a pre-trained language model to incorporate various types of constraints. Compared to existing search-based or score-based methods, InstructCTG is more flexible to different constraint types and has a much smaller impact on the generation quality and speed because it does not modify the decoding procedure. Additionally, InstructCTG allows the model to adapt to new constraints without re-training through the use of few-shot task generalization and in-context learning abilities of instruction-tuned language models.

Title: q2d: Turning Questions into Dialogs to Teach Models How to Search. (arXiv:2304.14318v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14318
Code URL: null
Copy Paste: [[2304.14318] q2d: Turning Questions into Dialogs to Teach Models How to Search](http://arxiv.org/abs/2304.14318) #large language model
Summary:
One of the exciting capabilities of recent language models for dialog is their ability to independently search for relevant information to ground a given dialog response. However, obtaining training data to teach models how to issue search queries is time and resource consuming. In this work, we propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from questions. We prompt a large language model (PaLM) to create conversational versions of question answering datasets, and use it to improve query generation models that communicate with external search APIs to ground dialog responses. Unlike previous approaches which relied on human written dialogs with search queries, our method allows to automatically generate query-based grounded dialogs with better control and scale. Our experiments demonstrate that: (1) For query generation on the QReCC dataset, models trained on our synthetically-generated data achieve 90%--97% of the performance of models trained on the human-generated data; (2) We can successfully generate data for training dialog models in new domains without any existing dialog data as demonstrated on the multi-hop MuSiQue and Bamboogle QA datasets. (3) We perform a thorough analysis of the generated dialogs showing that humans find them of high quality and struggle to distinguish them from human-written dialogs.

Title: Industrial Engineering with Large Language Models: A case study of ChatGPT's performance on Oil & Gas problems. (arXiv:2304.14354v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14354
Code URL: null
Copy Paste: [[2304.14354] Industrial Engineering with Large Language Models: A case study of ChatGPT's performance on Oil & Gas problems](http://arxiv.org/abs/2304.14354) #large language model
Summary:
Large Language Models (LLMs) have shown great potential in solving complex problems in various fields, including oil and gas engineering and other industrial engineering disciplines like factory automation, PLC programming etc. However, automatic identification of strong and weak solutions to fundamental physics equations governing several industrial processes remain a challenging task. This paper identifies the limitation of current LLM approaches, particularly ChatGPT in selected practical problems native to oil and gas engineering but not exclusively. The performance of ChatGPT in solving complex problems in oil and gas engineering is discussed and the areas where LLMs are most effective are presented.

Title: CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants. (arXiv:2304.14364v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.14364
Code URL: null
Copy Paste: [[2304.14364] CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants](http://arxiv.org/abs/2304.14364) #large language model
Summary:
A wave of new task-based virtual assistants has been fueled by increasingly powerful large language models, such as GPT-4. These conversational agents can be customized to serve customer-specific use cases, but ensuring that agent-generated text conforms to designer-specified rules included in prompt instructions alone is challenging. Therefore, chatbot designers often use another model, called a guardrail model, to verify that the agent output aligns with their rules and constraints. We explore using a distillation approach to guardrail models to monitor the output of the first model using training data from GPT-4. We find two crucial steps to our CONSCENDI process: scenario-augmented generation and contrastive training examples. When generating conversational data, we generate a set of rule-breaking scenarios, which enumerate a diverse set of high-level ways a rule can be violated. This scenario-guided approach produces a diverse training set of rule-violating conversations, and it provides chatbot designers greater control over the classification process. We also prompt GPT-4 to also generate contrastive examples by altering conversations with violations into acceptable conversations. This set of borderline, contrastive examples enables the distilled model to learn finer-grained distinctions between what is acceptable and what is not. We find that CONSCENDI results in guardrail models that improve over baselines.

segmentation

Title: Customized Segment Anything Model for Medical Image Segmentation. (arXiv:2304.13785v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13785
Code URL: https://github.com/hitachinsk/samed
Copy Paste: [[2304.13785] Customized Segment Anything Model for Medical Image Segmentation](http://arxiv.org/abs/2304.13785) #segmentation
Summary:
We propose SAMed, a general solution for medical image segmentation. Different from the previous methods, SAMed is built upon the large-scale image segmentation model, Segment Anything Model (SAM), to explore the new research paradigm of customizing large-scale models for medical image segmentation. SAMed applies the low-rank-based (LoRA) finetuning strategy to the SAM image encoder and finetunes it together with the prompt encoder and the mask decoder on labeled medical image segmentation datasets. We also observe the warmup finetuning strategy and the AdamW optimizer lead SAMed to successful convergence and lower loss. Different from SAM, SAMed could perform semantic segmentation on medical images. Our trained SAMed model achieves 81.88 DSC and 20.64 HD on the Synapse multi-organ segmentation dataset, which is on par with the state-of-the-art methods. We conduct extensive experiments to validate the effectiveness of our design. Since SAMed only updates a small fraction of the SAM parameters, its deployment cost and storage cost are quite marginal in practical usage. The code of SAMed is available at https://github.com/hitachinsk/SAMed.

Title: GazeSAM: What You See is What You Segment. (arXiv:2304.13844v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13844
Code URL: https://github.com/ukaukaaaa/gazesam
Copy Paste: [[2304.13844] GazeSAM: What You See is What You Segment](http://arxiv.org/abs/2304.13844) #segmentation
Summary:
This study investigates the potential of eye-tracking technology and the Segment Anything Model (SAM) to design a collaborative human-computer interaction system that automates medical image segmentation. We present the \textbf{GazeSAM} system to enable radiologists to collect segmentation masks by simply looking at the region of interest during image diagnosis. The proposed system tracks radiologists' eye movement and utilizes the eye-gaze data as the input prompt for SAM, which automatically generates the segmentation mask in real time. This study is the first work to leverage the power of eye-tracking technology and SAM to enhance the efficiency of daily clinical practice. Moreover, eye-gaze data coupled with image and corresponding segmentation labels can be easily recorded for further advanced eye-tracking research. The code is available in \url{https://github.com/ukaukaaaa/GazeSAM}.

Title: SkinSAM: Empowering Skin Cancer Segmentation with Segment Anything Model. (arXiv:2304.13973v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13973
Code URL: null
Copy Paste: [[2304.13973] SkinSAM: Empowering Skin Cancer Segmentation with Segment Anything Model](http://arxiv.org/abs/2304.13973) #segmentation
Summary:
Skin cancer is a prevalent and potentially fatal disease that requires accurate and efficient diagnosis and treatment. Although manual tracing is the current standard in clinics, automated tools are desired to reduce human labor and improve accuracy. However, developing such tools is challenging due to the highly variable appearance of skin cancers and complex objects in the background. In this paper, we present SkinSAM, a fine-tuned model based on the Segment Anything Model that showed outstanding segmentation performance. The models are validated on HAM10000 dataset which includes 10015 dermatoscopic images. While larger models (ViT_L, ViT_H) performed better than the smaller one (ViT_b), the finetuned model (ViT_b_finetuned) exhibited the greatest improvement, with a Mean pixel accuracy of 0.945, Mean dice score of 0.8879, and Mean IoU score of 0.7843. Among the lesion types, vascular lesions showed the best segmentation results. Our research demonstrates the great potential of adapting SAM to medical image segmentation tasks.

Title: Adaptive-Mask Fusion Network for Segmentation of Drivable Road and Negative Obstacle With Untrustworthy Features. (arXiv:2304.13979v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13979
Code URL: null
Copy Paste: [[2304.13979] Adaptive-Mask Fusion Network for Segmentation of Drivable Road and Negative Obstacle With Untrustworthy Features](http://arxiv.org/abs/2304.13979) #segmentation
Summary:
Segmentation of drivable roads and negative obstacles is critical to the safe driving of autonomous vehicles. Currently, many multi-modal fusion methods have been proposed to improve segmentation accuracy, such as fusing RGB and depth images. However, we find that when fusing two modals of data with untrustworthy features, the performance of multi-modal networks could be degraded, even lower than those using a single modality. In this paper, the untrustworthy features refer to those extracted from regions (e.g., far objects that are beyond the depth measurement range) with invalid depth data (i.e., 0 pixel value) in depth images. The untrustworthy features can confuse the segmentation results, and hence lead to inferior results. To provide a solution to this issue, we propose the Adaptive-Mask Fusion Network (AMFNet) by introducing adaptive-weight masks in the fusion module to fuse features from RGB and depth images with inconsistency. In addition, we release a large-scale RGB-depth dataset with manually-labeled ground truth based on the NPO dataset for drivable roads and negative obstacles segmentation. Extensive experimental results demonstrate that our network achieves state-of-the-art performance compared with other networks. Our code and dataset are available at: https://github.com/lab-sun/AMFNet.

Title: A Review of Panoptic Segmentation for Mobile Mapping Point Clouds. (arXiv:2304.13980v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13980
Code URL: null
Copy Paste: [[2304.13980] A Review of Panoptic Segmentation for Mobile Mapping Point Clouds](http://arxiv.org/abs/2304.13980) #segmentation
Summary:
3D point cloud panoptic segmentation is the combined task to (i) assign each point to a semantic class and (ii) separate the points in each class into object instances. Recently there has been an increased interest in such comprehensive 3D scene understanding, building on the rapid advances of semantic segmentation due to the advent of deep 3D neural networks. Yet, to date there is very little work about panoptic segmentation of outdoor mobile-mapping data, and no systematic comparisons. The present paper tries to close that gap. It reviews the building blocks needed to assemble a panoptic segmentation pipeline and the related literature. Moreover, a modular pipeline is set up to perform comprehensive, systematic experiments to assess the state of panoptic segmentation in the context of street mapping. As a byproduct, we also provide the first public dataset for that task, by extending the NPM3D dataset to include instance labels.

Title: COSST: Multi-organ Segmentation with Partially Labeled Datasets Using Comprehensive Supervisions and Self-training. (arXiv:2304.14030v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14030
Code URL: null
Copy Paste: [[2304.14030] COSST: Multi-organ Segmentation with Partially Labeled Datasets Using Comprehensive Supervisions and Self-training](http://arxiv.org/abs/2304.14030) #segmentation
Summary:
Deep learning models have demonstrated remarkable success in multi-organ segmentation but typically require large-scale datasets with all organs of interest annotated. However, medical image datasets are often low in sample size and only partially labeled, i.e., only a subset of organs are annotated. Therefore, it is crucial to investigate how to learn a unified model on the available partially labeled datasets to leverage their synergistic potential. In this paper, we empirically and systematically study the partial-label segmentation with in-depth analyses on the existing approaches and identify three distinct types of supervision signals, including two signals derived from ground truth and one from pseudo label. We propose a novel training framework termed COSST, which effectively and efficiently integrates comprehensive supervision signals with self-training. Concretely, we first train an initial unified model using two ground truth-based signals and then iteratively incorporate the pseudo label signal to the initial model using self-training. To mitigate performance degradation caused by unreliable pseudo labels, we assess the reliability of pseudo labels via outlier detection in latent space and exclude the most unreliable pseudo labels from each self-training iteration. Extensive experiments are conducted on six CT datasets for three partial-label segmentation tasks. Experimental results show that our proposed COSST achieves significant improvement over the baseline method, i.e., individual networks trained on each partially labeled dataset. Compared to the state-of-the-art partial-label segmentation methods, COSST demonstrates consistent superior performance on various segmentation tasks and with different training data size.

Title: Human Semantic Segmentation using Millimeter-Wave Radar Sparse Point Clouds. (arXiv:2304.14132v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14132
Code URL: null
Copy Paste: [[2304.14132] Human Semantic Segmentation using Millimeter-Wave Radar Sparse Point Clouds](http://arxiv.org/abs/2304.14132) #segmentation
Summary:
This paper presents a framework for semantic segmentation on sparse sequential point clouds of millimeter-wave radar. Compared with cameras and lidars, millimeter-wave radars have the advantage of not revealing privacy, having a strong anti-interference ability, and having long detection distance. The sparsity and capturing temporal-topological features of mmWave data is still a problem. However, the issue of capturing the temporal-topological coupling features under the human semantic segmentation task prevents previous advanced segmentation methods (e.g PointNet, PointCNN, Point Transformer) from being well utilized in practical scenarios. To address the challenge caused by the sparsity and temporal-topological feature of the data, we (i) introduce graph structure and topological features to the point cloud, (ii) propose a semantic segmentation framework including a global feature-extracting module and a sequential feature-extracting module. In addition, we design an efficient and more fitting loss function for a better training process and segmentation results based on graph clustering. Experimentally, we deploy representative semantic segmentation algorithms (Transformer, GCNN, etc.) on a custom dataset. Experimental results indicate that our model achieves mean accuracy on the custom dataset by $\mathbf{82.31}\%$ and outperforms the state-of-the-art algorithms. Moreover, to validate the model's robustness, we deploy our model on the well-known S3DIS dataset. On the S3DIS dataset, our model achieves mean accuracy by $\mathbf{92.6}\%$, outperforming baseline algorithms.

Title: EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation. (arXiv:2304.14291v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14291
Code URL: https://github.com/susaha/edaps
Copy Paste: [[2304.14291] EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation](http://arxiv.org/abs/2304.14291) #segmentation
Summary:
With autonomous industries on the rise, domain adaptation of the visual perception stack is an important research direction due to the cost savings promise. Much prior art was dedicated to domain-adaptive semantic segmentation in the synthetic-to-real context. Despite being a crucial output of the perception stack, panoptic segmentation has been largely overlooked by the domain adaptation community. Therefore, we revisit well-performing domain adaptation strategies from other fields, adapt them to panoptic segmentation, and show that they can effectively enhance panoptic domain adaptation. Further, we study the panoptic network design and propose a novel architecture (EDAPS) designed explicitly for domain-adaptive panoptic segmentation. It uses a shared, domain-robust transformer encoder to facilitate the joint adaptation of semantic and instance features, but task-specific decoders tailored for the specific requirements of both domain-adaptive semantic and instance segmentation. As a result, the performance gap seen in challenging panoptic benchmarks is substantially narrowed. EDAPS significantly improves the state-of-the-art performance for panoptic segmentation UDA by a large margin of 25% on SYNTHIA-to-Cityscapes and even 72% on the more challenging SYNTHIA-to-Mapillary Vistas. The implementation is available at https://github.com/susaha/edaps.

Title: Instance Segmentation in the Dark. (arXiv:2304.14298v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14298
Code URL: https://github.com/linwei-chen/lis
Copy Paste: [[2304.14298] Instance Segmentation in the Dark](http://arxiv.org/abs/2304.14298) #segmentation
Summary:
Existing instance segmentation techniques are primarily tailored for high-visibility inputs, but their performance significantly deteriorates in extremely low-light environments. In this work, we take a deep look at instance segmentation in the dark and introduce several techniques that substantially boost the low-light inference accuracy. The proposed method is motivated by the observation that noise in low-light images introduces high-frequency disturbances to the feature maps of neural networks, thereby significantly degrading performance. To suppress this ``feature noise", we propose a novel learning method that relies on an adaptive weighted downsampling layer, a smooth-oriented convolutional block, and disturbance suppression learning. These components effectively reduce feature noise during downsampling and convolution operations, enabling the model to learn disturbance-invariant features. Furthermore, we discover that high-bit-depth RAW images can better preserve richer scene information in low-light conditions compared to typical camera sRGB outputs, thus supporting the use of RAW-input algorithms. Our analysis indicates that high bit-depth can be critical for low-light instance segmentation. To mitigate the scarcity of annotated RAW datasets, we leverage a low-light RAW synthetic pipeline to generate realistic low-light data. In addition, to facilitate further research in this direction, we capture a real-world low-light instance segmentation dataset comprising over two thousand paired low/normal-light images with instance-level pixel-wise annotations. Remarkably, without any image preprocessing, we achieve satisfactory performance on instance segmentation in very low light (4~\% AP higher than state-of-the-art competitors), meanwhile opening new opportunities for future research.

Title: Neural Field Conditioning Strategies for 2D Semantic Segmentation. (arXiv:2304.14371v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14371
Code URL: null
Copy Paste: [[2304.14371] Neural Field Conditioning Strategies for 2D Semantic Segmentation](http://arxiv.org/abs/2304.14371) #segmentation
Summary:
Neural fields are neural networks which map coordinates to a desired signal. When a neural field should jointly model multiple signals, and not memorize only one, it needs to be conditioned on a latent code which describes the signal at hand. Despite being an important aspect, there has been little research on conditioning strategies for neural fields. In this work, we explore the use of neural fields as decoders for 2D semantic segmentation. For this task, we compare three conditioning methods, simple concatenation of the latent code, Feature Wise Linear Modulation (FiLM), and Cross-Attention, in conjunction with latent codes which either describe the full image or only a local region of the image. Our results show a considerable difference in performance between the examined conditioning strategies. Furthermore, we show that conditioning via Cross-Attention achieves the best results and is competitive with a CNN-based decoder for semantic segmentation.

Title: Zero-shot Unsupervised Transfer Instance Segmentation. (arXiv:2304.14376v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.14376
Code URL: https://github.com/noelshin/zutis
Copy Paste: [[2304.14376] Zero-shot Unsupervised Transfer Instance Segmentation](http://arxiv.org/abs/2304.14376) #segmentation
Summary:
Segmentation is a core computer vision competency, with applications spanning a broad range of scientifically and economically valuable domains. To date, however, the prohibitive cost of annotation has limited the deployment of flexible segmentation models. In this work, we propose Zero-shot Unsupervised Transfer Instance Segmentation (ZUTIS), a framework that aims to meet this challenge. The key strengths of ZUTIS are: (i) no requirement for instance-level or pixel-level annotations; (ii) an ability of zero-shot transfer, i.e., no assumption on access to a target data distribution; (iii) a unified framework for semantic and instance segmentations with solid performance on both tasks compared to state-of-the-art unsupervised methods. While comparing to previous work, we show ZUTIS achieves a gain of 2.2 mask AP on COCO-20K and 14.5 mIoU on ImageNet-S with 919 categories for instance and semantic segmentations, respectively. The code is made publicly available.