diffusion

Title: Text-to-Image Models for Counterfactual Explanations: a Black-Box Approach. (arXiv:2309.07944v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.07944
Code URL: null
Copy Paste: [[2309.07944]] Text-to-Image Models for Counterfactual Explanations: a Black-Box Approach(http://arxiv.org/abs/2309.07944)
Summary:
This paper addresses the challenge of generating Counterfactual Explanations (CEs), involving the identification and modification of the fewest necessary features to alter a classifier's prediction for a given image. Our proposed method, Text-to-Image Models for Counterfactual Explanations (TIME), is a black-box counterfactual technique based on distillation. Unlike previous methods, this approach requires solely the image and its prediction, omitting the need for the classifier's structure, parameters, or gradients. Before generating the counterfactuals, TIME introduces two distinct biases into Stable Diffusion in the form of textual embeddings: the context bias, associated with the image's structure, and the class bias, linked to class-specific features learned by the target classifier. After learning these biases, we find the optimal latent code applying the classifier's predicted class token and regenerate the image using the target embedding as conditioning, producing the counterfactual explanation. Extensive empirical studies validate that TIME can generate explanations of comparable effectiveness even when operating within a black-box setting.

Title: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models. (arXiv:2309.07986v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.07986
Code URL: null
Copy Paste: [[2309.07986]] Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models(http://arxiv.org/abs/2309.07986)
Summary:
Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint.

ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.

Title: Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions. (arXiv:2309.08097v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08097
Code URL: null
Copy Paste: [[2309.08097]] Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions(http://arxiv.org/abs/2309.08097)
Summary:
The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation. However, the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model~(DRDM), which leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference~(SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic mapping between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.

Title: Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models. (arXiv:2309.08251v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08251
Code URL: null
Copy Paste: [[2309.08251]] Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models(http://arxiv.org/abs/2309.08251)
Summary:
Image cartoonization has attracted significant interest in the field of image generation. However, most of the existing image cartoonization techniques require re-training models using images of cartoon style. In this paper, we present CartoonDiff, a novel training-free sampling approach which generates image cartoonization using diffusion transformer models. Specifically, we decompose the reverse process of diffusion models into the semantic generation phase and the detail generation phase. Furthermore, we implement the image cartoonization process by normalizing high-frequency signal of the noisy image in specific denoising steps. CartoonDiff doesn't require any additional reference images, complex model designs, or the tedious adjustment of multiple parameters. Extensive experimental results show the powerful ability of our CartoonDiff. The project page is available at: https://cartoondiff.github.io/

Title: Unsupervised Disentangling of Facial Representations with 3D-aware Latent Diffusion Models. (arXiv:2309.08273v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08273
Code URL: null
Copy Paste: [[2309.08273]] Unsupervised Disentangling of Facial Representations with 3D-aware Latent Diffusion Models(http://arxiv.org/abs/2309.08273)
Summary:
Unsupervised learning of facial representations has gained increasing attention for face understanding ability without heavily relying on large-scale annotated datasets. However, it remains unsolved due to the coupling of facial identities, expressions, and external factors like pose and light. Prior methods primarily focus on 2D factors and pixel-level consistency, leading to incomplete disentangling and suboptimal performance in downstream tasks. In this paper, we propose LatentFace, a novel unsupervised disentangling framework for facial expression and identity representation. We suggest the disentangling problem should be performed in latent space and propose the solution using a 3D-ware latent diffusion model. First, we introduce a 3D-aware autoencoder to encode face images into 3D latent embeddings. Second, we propose a novel representation diffusion model (RDM) to disentangle 3D latent into facial identity and expression. Consequently, our method achieves state-of-the-art performance in facial expression recognition and face verification among unsupervised facial representation learning models.

Title: Large Intestine 3D Shape Refinement Using Point Diffusion Models for Digital Phantom Generation. (arXiv:2309.08289v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08289
Code URL: null
Copy Paste: [[2309.08289]] Large Intestine 3D Shape Refinement Using Point Diffusion Models for Digital Phantom Generation(http://arxiv.org/abs/2309.08289)
Summary:
Accurate 3D modeling of human organs plays a crucial role in building computational phantoms for virtual imaging trials. However, generating anatomically plausible reconstructions of organ surfaces from computed tomography scans remains challenging for many structures in the human body. This challenge is particularly evident when dealing with the large intestine. In this study, we leverage recent advancements in geometric deep learning and denoising diffusion probabilistic models to refine the segmentation results of the large intestine. We begin by representing the organ as point clouds sampled from the surface of the 3D segmentation mask. Subsequently, we employ a hierarchical variational autoencoder to obtain global and local latent representations of the organ's shape. We train two conditional denoising diffusion models in the hierarchical latent space to perform shape refinement. To further enhance our method, we incorporate a state-of-the-art surface reconstruction model, allowing us to generate smooth meshes from the obtained complete point clouds. Experimental results demonstrate the effectiveness of our approach in capturing both the global distribution of the organ's shape and its fine details. Our complete refinement pipeline demonstrates remarkable enhancements in surface representation compared to the initial segmentation, reducing the Chamfer distance by 70%, the Hausdorff distance by 32%, and the Earth Mover's distance by 6%. By combining geometric deep learning, denoising diffusion models, and advanced surface reconstruction techniques, our proposed method offers a promising solution for accurately modeling the large intestine's surface and can easily be extended to other anatomical structures.

self-supervised

Title: DA-RAW: Domain Adaptive Object Detection for Real-World Adverse Weather Conditions. (arXiv:2309.08152v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08152
Code URL: null
Copy Paste: [[2309.08152]] DA-RAW: Domain Adaptive Object Detection for Real-World Adverse Weather Conditions(http://arxiv.org/abs/2309.08152)
Summary:
Despite the success of deep learning-based object detection methods in recent years, it is still challenging to make the object detector reliable in adverse weather conditions such as rain and snow. For the robust performance of object detectors, unsupervised domain adaptation has been utilized to adapt the detection network trained on clear weather images to adverse weather images. While previous methods do not explicitly address weather corruption during adaptation, the domain gap between clear and adverse weather can be decomposed into two factors with distinct characteristics: a style gap and a weather gap. In this paper, we present an unsupervised domain adaptation framework for object detection that can more effectively adapt to real-world environments with adverse weather conditions by addressing these two gaps separately. Our method resolves the style gap by concentrating on style-related information of high-level features using an attention module. Using self-supervised contrastive learning, our framework then reduces the weather gap and acquires instance features that are robust to weather corruption. Extensive experiments demonstrate that our method outperforms other methods for object detection in adverse weather conditions.

Title: Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens. (arXiv:2309.08531v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08531
Code URL: null
Copy Paste: [[2309.08531]] Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens(http://arxiv.org/abs/2309.08531)
Summary:
In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar to the speech unit case, we convert the original image into image units, which are derived through vector quantization of the raw image. With these image units, we can drastically reduce the required data storage for saving image data to just 0.8% when compared to the original image data in terms of bits. Demo page: https://ms-dot-k.github.io/Image-to-Speech-Captioning.

Title: Structural Self-Supervised Objectives for Transformers. (arXiv:2309.08272v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.08272
Code URL: https://github.com/lucadiliello/transformers-framework
Copy Paste: [[2309.08272]] Structural Self-Supervised Objectives for Transformers(http://arxiv.org/abs/2309.08272)
Summary:
This thesis focuses on improving the pre-training of natural language models using unsupervised raw data to make them more efficient and aligned with downstream applications.

In the first part, we introduce three alternative pre-training objectives to BERT's Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution (C-RTS), and Swapped Language Modeling (SLM). These objectives involve token swapping instead of masking, with RTS and C-RTS aiming to predict token originality and SLM predicting the original token values. Results show that RTS and C-RTS require less pre-training time while maintaining performance comparable to MLM. Surprisingly, SLM outperforms MLM on certain tasks despite using the same computational budget.

In the second part, we proposes self-supervised pre-training tasks that align structurally with downstream applications, reducing the need for labeled data. We use large corpora like Wikipedia and CC-News to train models to recognize if text spans originate from the same paragraph or document in several ways. By doing continuous pre-training, starting from existing models like RoBERTa, ELECTRA, DeBERTa, BART, and T5, we demonstrate significant performance improvements in tasks like Fact Verification, Answer Sentence Selection, and Summarization. These improvements are especially pronounced when limited annotation data is available. The proposed objectives also achieve state-of-the-art results on various benchmark datasets, including FEVER (dev set), ASNQ, WikiQA, and TREC-QA, as well as enhancing the quality of summaries. Importantly, these techniques can be easily integrated with other methods without altering the internal structure of Transformer models, making them versatile for various NLP applications.

Title: Headless Language Models: Learning without Predicting with Contrastive Weight Tying. (arXiv:2309.08351v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.08351
Code URL: null
Copy Paste: [[2309.08351]] Headless Language Models: Learning without Predicting with Contrastive Weight Tying(http://arxiv.org/abs/2309.08351)
Summary:
Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.

Title: Supervised Stochastic Neighbor Embedding Using Contrastive Learning. (arXiv:2309.08077v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.08077
Code URL: https://github.com/imyizhang/manifold-learn
Copy Paste: [[2309.08077]] Supervised Stochastic Neighbor Embedding Using Contrastive Learning(http://arxiv.org/abs/2309.08077)
Summary:
Stochastic neighbor embedding (SNE) methods $t$-SNE, UMAP are two most popular dimensionality reduction methods for data visualization. Contrastive learning, especially self-supervised contrastive learning (SSCL), has showed great success in embedding features from unlabeled data. The conceptual connection between SNE and SSCL has been exploited. In this work, within the scope of preserving neighboring information of a dataset, we extend the self-supervised contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of samples belonging to the same class are pulled together in low-dimensional embedding space, while simultaneously pushing apart clusters of samples from different classes.

Title: Understanding the limitations of self-supervised learning for tabular anomaly detection. (arXiv:2309.08374v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.08374
Code URL: null
Copy Paste: [[2309.08374]] Understanding the limitations of self-supervised learning for tabular anomaly detection(http://arxiv.org/abs/2309.08374)
Summary:
While self-supervised learning has improved anomaly detection in computer vision and natural language processing, it is unclear whether tabular data can benefit from it. This paper explores the limitations of self-supervision for tabular anomaly detection. We conduct several experiments spanning various pretext tasks on 26 benchmark datasets to understand why this is the case. Our results confirm representations derived from self-supervision do not improve tabular anomaly detection performance compared to using the raw representations of the data. We show this is due to neural networks introducing irrelevant features, which reduces the effectiveness of anomaly detectors. However, we demonstrate that using a subspace of the neural network's representation can recover performance.

foundation model

Title: Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer. (arXiv:2309.07929v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.07929
Code URL: null
Copy Paste: [[2309.07929]] Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer(http://arxiv.org/abs/2309.07929)
Summary:
Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios.

Title: BROW: Better featuRes fOr Whole slide image based on self-distillation. (arXiv:2309.08259v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08259
Code URL: null
Copy Paste: [[2309.08259]] BROW: Better featuRes fOr Whole slide image based on self-distillation(http://arxiv.org/abs/2309.08259)
Summary:
Whole slide image (WSI) processing is becoming part of the key components of standard clinical diagnosis for various diseases. However, the direct application of conventional image processing algorithms to WSI faces certain obstacles because of WSIs' distinct property: the super-high resolution. The performance of most WSI-related tasks relies on the efficacy of the backbone which extracts WSI patch feature representations. Hence, we proposed BROW, a foundation model for extracting better feature representations for WSIs, which can be conveniently adapted to downstream tasks without or with slight fine-tuning. The model takes transformer architecture, pretrained using self-distillation framework. To improve model's robustness, techniques such as patch shuffling have been employed. Additionally, the model leverages the unique properties of WSIs, utilizing WSI's multi-scale pyramid to incorporate an additional global view, thereby further enhancing its performance. We used both private and public data to make up a large pretraining dataset, containing more than 11000 slides, over 180M extracted patches, encompassing WSIs related to various organs and tissues. To assess the effectiveness of \ourmodel, we run a wide range of downstream tasks, including slide-level subtyping, patch-level classification and nuclei instance segmentation. The results confirmed the efficacy, robustness and good generalization ability of the proposed model. This substantiates its potential as foundation model for WSI feature extraction and highlights promising prospects for its application in WSI processing.

Title: Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding. (arXiv:2309.08585v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08585
Code URL: null
Copy Paste: [[2309.08585]] Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding(http://arxiv.org/abs/2309.08585)
Summary:
Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.

Title: Scaling Laws for Sparsely-Connected Foundation Models. (arXiv:2309.08520v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.08520
Code URL: null
Copy Paste: [[2309.08520]] Scaling Laws for Sparsely-Connected Foundation Models(http://arxiv.org/abs/2309.08520)
Summary:
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.

Title: Compositional Foundation Models for Hierarchical Planning. (arXiv:2309.08587v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.08587
Code URL: null
Copy Paste: [[2309.08587]] Compositional Foundation Models for Hierarchical Planning(http://arxiv.org/abs/2309.08587)
Summary:
To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model trained on language, vision and action data individually jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks.

generative

Title: Breathing New Life into 3D Assets with Generative Repainting. (arXiv:2309.08523v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.08523
Code URL: https://github.com/toshas/remesh_isotropic_planar
Copy Paste: [[2309.08523]] Breathing New Life into 3D Assets with Generative Repainting(http://arxiv.org/abs/2309.08523)
Summary:
Diffusion-based text-to-image models ignited immense attention from the vision community, artists, and content creators. Broad adoption of these models is due to significant improvement in the quality of generations and efficient conditioning on various modalities, not just text. However, lifting the rich generative priors of these 2D models into 3D is challenging. Recent works have proposed various pipelines powered by the entanglement of diffusion models and neural fields. We explore the power of pretrained 2D diffusion models and standard 3D neural radiance fields as independent, standalone tools and demonstrate their ability to work together in a non-learned fashion. Such modularity has the intrinsic advantage of eased partial upgrades, which became an important property in such a fast-paced domain. Our pipeline accepts any legacy renderable geometry, such as textured or untextured meshes, orchestrates the interaction between 2D generative refinement and 3D consistency enforcement tools, and outputs a painted input geometry in several formats. We conduct a large-scale study on a wide range of objects and categories from the ShapeNetSem dataset and demonstrate the advantages of our approach, both qualitatively and quantitatively. Project page: https://www.obukhov.ai/repainting_3d_assets

Title: An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing. (arXiv:2309.08008v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.08008
Code URL: null
Copy Paste: [[2309.08008]] An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing(http://arxiv.org/abs/2309.08008)
Summary:
Large language models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), especially in domains where labeled data is scarce or expensive, such as clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches. In this paper, we present a comprehensive and systematic experimental study on prompt engineering for five clinical NLP tasks: Clinical Sense Disambiguation, Biomedical Evidence Extraction, Coreference Resolution, Medication Status Extraction, and Medication Attribute Extraction. We assessed the prompts proposed in recent literature, including simple prefix, simple cloze, chain of thought, and anticipatory prompts, and introduced two new types of prompts, namely heuristic prompting and ensemble prompting. We evaluated the performance of these prompts on three state-of-the-art LLMs: GPT-3.5, BARD, and LLAMA2. We also contrasted zero-shot prompting with few-shot prompting, and provide novel insights and guidelines for prompt engineering for LLMs in clinical NLP. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative AI, and we hope that it will inspire and inform future research in this area.

Title: Reward Engineering for Generating Semi-structured Explanation. (arXiv:2309.08347v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.08347
Code URL: https://github.com/jiuzhouh/reward-engineering-for-generating-seg
Copy Paste: [[2309.08347]] Reward Engineering for Generating Semi-structured Explanation(http://arxiv.org/abs/2309.08347)
Summary:
Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify model's true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs, as the reasoner is expected to couple a sequential answer with a structured explanation which embodies both the correct presentation and the correct reasoning process. In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed reward on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.

Title: Masked Generative Modeling with Enhanced Sampling Scheme. (arXiv:2309.07945v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.07945
Code URL: null
Copy Paste: [[2309.07945]] Masked Generative Modeling with Enhanced Sampling Scheme(http://arxiv.org/abs/2309.07945)
Summary:
This paper presents a novel sampling scheme for masked non-autoregressive generative modeling. We identify the limitations of TimeVQVAE, MaskGIT, and Token-Critic in their sampling processes, and propose Enhanced Sampling Scheme (ESS) to overcome these limitations. ESS explicitly ensures both sample diversity and fidelity, and consists of three stages: Naive Iterative Decoding, Critical Reverse Sampling, and Critical Resampling. ESS starts by sampling a token set using the naive iterative decoding as proposed in MaskGIT, ensuring sample diversity. Then, the token set undergoes the critical reverse sampling, masking tokens leading to unrealistic samples. After that, critical resampling reconstructs masked tokens until the final sampling step is reached to ensure high fidelity. Critical resampling uses confidence scores obtained from a self-Token-Critic to better measure the realism of sampled tokens, while critical reverse sampling uses the structure of the quantized latent vector space to discover unrealistic sample paths. We demonstrate significant performance gains of ESS in both unconditional sampling and class-conditional sampling using all the 128 datasets in the UCR Time Series archive.

Title: An Automated Machine Learning Approach for Detecting Anomalous Peak Patterns in Time Series Data from a Research Watershed in the Northeastern United States Critical Zone. (arXiv:2309.07992v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.07992
Code URL: null
Copy Paste: [[2309.07992]] An Automated Machine Learning Approach for Detecting Anomalous Peak Patterns in Time Series Data from a Research Watershed in the Northeastern United States Critical Zone(http://arxiv.org/abs/2309.07992)
Summary:
This paper presents an automated machine learning framework designed to assist hydrologists in detecting anomalies in time series data generated by sensors in a research watershed in the northeastern United States critical zone. The framework specifically focuses on identifying peak-pattern anomalies, which may arise from sensor malfunctions or natural phenomena. However, the use of classification methods for anomaly detection poses challenges, such as the requirement for labeled data as ground truth and the selection of the most suitable deep learning model for the given task and dataset. To address these challenges, our framework generates labeled datasets by injecting synthetic peak patterns into synthetically generated time series data and incorporates an automated hyperparameter optimization mechanism. This mechanism generates an optimized model instance with the best architectural and training parameters from a pool of five selected models, namely Temporal Convolutional Network (TCN), InceptionTime, MiniRocket, Residual Networks (ResNet), and Long Short-Term Memory (LSTM). The selection is based on the user's preferences regarding anomaly detection accuracy and computational cost. The framework employs Time-series Generative Adversarial Networks (TimeGAN) as the synthetic dataset generator. The generated model instances are evaluated using a combination of accuracy and computational cost metrics, including training time and memory, during the anomaly detection process. Performance evaluation of the framework was conducted using a dataset from a watershed, demonstrating consistent selection of the most fitting model instance that satisfies the user's preferences.

anomaly

in-context

Title: LASER: LLM Agent with State-Space Exploration for Web Navigation. (arXiv:2309.08172v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.08172
Code URL: null
Copy Paste: [[2309.08172]] LASER: LLM Agent with State-Space Exploration for Web Navigation(http://arxiv.org/abs/2309.08172)
Summary:
Large language models (LLMs) have been successfully adapted for interactive decision-making tasks like web navigation. While achieving decent performance, previous methods implicitly assume a forward-only execution mode for the model, where they only provide oracle trajectories as in-context examples to teach the model how to reason in the interactive environment. Consequently, the model could not handle more challenging scenarios not covered in the in-context examples, e.g., mistakes, leading to sub-optimal performance. To address this issue, we propose to model the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task. This formulation enables flexible back-tracking, allowing the model to easily recover from errors. We evaluate our proposed LLM Agent with State-Space ExploRation (LASER) on the WebShop task. Experimental results show that our LASER agent significantly outperforms previous methods and closes the gap with human performance on the web navigation task.

Title: Bridging Topic, Domain, and Language Shifts: An Evaluation of Comprehensive Out-of-Distribution Scenarios. (arXiv:2309.08316v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.08316
Code URL: null
Copy Paste: [[2309.08316]] Bridging Topic, Domain, and Language Shifts: An Evaluation of Comprehensive Out-of-Distribution Scenarios(http://arxiv.org/abs/2309.08316)
Summary:
Language models (LMs) excel in in-distribution (ID) scenarios where train and test data are independent and identically distributed. However, their performance often degrades in real-world applications like argument mining. Such degradation happens when new topics emerge, or other text domains and languages become relevant. To assess LMs' generalization abilities in such out-of-distribution (OOD) scenarios, we simulate such distribution shifts by deliberately withholding specific instances for testing, as from the social media domain or the topic Solar Energy.

Unlike prior studies focusing on specific shifts and metrics in isolation, we comprehensively analyze OOD generalization. We define three metrics to pinpoint generalization flaws and propose eleven classification tasks covering topic, domain, and language shifts. Overall, we find superior performance of prompt-based fine-tuning, notably when train and test splits primarily differ semantically. Simultaneously, in-context learning is more effective than prompt-based or vanilla fine-tuning for tasks when training data embodies heavy discrepancies in label distribution compared to testing data. This reveals a crucial drawback of gradient-based learning: it biases LMs regarding such structural obstacles.

Title: ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer. (arXiv:2309.08583v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.08583
Code URL: https://github.com/asaakyan/explain-st
Copy Paste: [[2309.08583]] ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer(http://arxiv.org/abs/2309.08583)
Summary:
While state-of-the-art language models excel at the style transfer task, current work does not address explainability of style transfer systems. Explanations could be generated using large language models such as GPT-3.5 and GPT-4, but the use of such complex systems is inefficient when smaller, widely distributed, and transparent alternatives are available. We propose a framework to augment and improve a formality style transfer dataset with explanations via model distillation from ChatGPT. To further refine the generated explanations, we propose a novel way to incorporate scarce expert human feedback using in-context learning (ICLEF: In-Context Learning from Expert Feedback) by prompting ChatGPT to act as a critic to its own outputs. We use the resulting dataset of 9,960 explainable formality style transfer instances (e-GYAFC) to show that current openly distributed instruction-tuned models (and, in some settings, ChatGPT) perform poorly on the task, and that fine-tuning on our high-quality dataset leads to significant improvements as shown by automatic evaluation. In human evaluation, we show that models much smaller than ChatGPT fine-tuned on our data align better with expert preferences. Finally, we discuss two potential applications of models fine-tuned on the explainable style transfer task: interpretable authorship verification and interpretable adversarial attacks on AI-generated text detectors.

Title: Neural Machine Translation Models Can Learn to be Few-shot Learners. (arXiv:2309.08590v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.08590
Code URL: null
Copy Paste: [[2309.08590]] Neural Machine Translation Models Can Learn to be Few-shot Learners(http://arxiv.org/abs/2309.08590)
Summary:
The emergent ability of Large Language Models to use a small number of examples to learn to perform in novel domains and tasks, also called in-context learning (ICL). In this work, we show that a much smaller model can be trained to perform ICL by fine-tuning towards a specialized training objective, exemplified on the task of domain adaptation for neural machine translation. With this capacity for ICL, the model can take advantage of relevant few-shot examples to adapt its output towards the domain. We compare the quality of this domain adaptation to traditional supervised techniques and ICL with a 40B-parameter Large Language Model. Our approach allows efficient batch inference on a mix of domains and outperforms state-of-the-art baselines in terms of both translation quality and immediate adaptation rate, i.e. the ability to reproduce a specific term after being shown a single example.