diffusion

Title: Fast Adaptation with Bradley-Terry Preference Models in Text-To-Image Classification and Generation. (arXiv:2308.07929v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.07929
Code URL: null
Copy Paste: [[2308.07929]] Fast Adaptation with Bradley-Terry Preference Models in Text-To-Image Classification and Generation(http://arxiv.org/abs/2308.07929)
Summary:
Recently, large multimodal models, such as CLIP and Stable Diffusion have experimented tremendous successes in both foundations and applications. However, as these models increase in parameter size and computational requirements, it becomes more challenging for users to personalize them for specific tasks or preferences. In this work, we address the problem of adapting the previous models towards sets of particular human preferences, aligning the retrieved or generated images with the preferences of the user. We leverage the Bradley-Terry preference model to develop a fast adaptation method that efficiently fine-tunes the original model, with few examples and with minimal computing resources. Extensive evidence of the capabilities of this framework is provided through experiments in different domains related to multimodal text and image understanding, including preference prediction as a reward model, and generation tasks.

Title: YODA: You Only Diffuse Areas. An Area-Masked Diffusion Approach For Image Super-Resolution. (arXiv:2308.07977v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.07977
Code URL: null
Copy Paste: [[2308.07977]] YODA: You Only Diffuse Areas(http://arxiv.org/abs/2308.07977)
Summary:
This work introduces "You Only Diffuse Areas" (YODA), a novel method for partial diffusion in Single-Image Super-Resolution (SISR). The core idea is to utilize diffusion selectively on spatial regions based on attention maps derived from the low-resolution image and the current time step in the diffusion process. This time-dependent targeting enables a more effective conversion to high-resolution outputs by focusing on areas that benefit the most from the iterative refinement process, i.e., detail-rich objects. We empirically validate YODA by extending leading diffusion-based SISR methods SR3 and SRDiff. Our experiments demonstrate new state-of-the-art performance gains in face and general SR across PSNR, SSIM, and LPIPS metrics. A notable finding is YODA's stabilization effect on training by reducing color shifts, especially when induced by small batch sizes, potentially contributing to resource-constrained scenarios. The proposed spatial and temporal adaptive diffusion mechanism opens promising research directions, including developing enhanced attention map extraction techniques and optimizing inference latency based on sparser diffusion.

Title: DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory. (arXiv:2308.08089v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.08089
Code URL: null
Copy Paste: [[2308.08089]] DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory(http://arxiv.org/abs/2308.08089)
Summary:
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}

Title: Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis. (arXiv:2308.08157v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.08157
Code URL: null
Copy Paste: [[2308.08157]] Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis(http://arxiv.org/abs/2308.08157)
Summary:
Existing text-to-image generation approaches have set high standards for photorealism and text-image correspondence, largely benefiting from web-scale text-image datasets, which can include up to 5~billion pairs. However, text-to-image generation models trained on domain-specific datasets, such as urban scenes, medical images, and faces, still suffer from low text-image correspondence due to the lack of text-image pairs. Additionally, collecting billions of text-image pairs for a specific domain can be time-consuming and costly. Thus, ensuring high text-image correspondence without relying on web-scale text-image datasets remains a challenging task. In this paper, we present a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs. Our experiments reveal that we can guide text-to-image generation models to be aware of the semantics of different image regions, by training the model to generate semantic labels for each pixel. We demonstrate that our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset, where text-image pairs are scarce. Codes are available in this https://pmh9960.github.io/research/GCDP

Title: Dual-Stream Diffusion Net for Text-to-Video Generation. (arXiv:2308.08316v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.08316
Code URL: null
Copy Paste: [[2308.08316]] Dual-Stream Diffusion Net for Text-to-Video Generation(http://arxiv.org/abs/2308.08316)
Summary:
With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.

Title: Diff-CAPTCHA: An Image-based CAPTCHA with Security Enhanced by Denoising Diffusion Model. (arXiv:2308.08367v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2308.08367
Code URL: null
Copy Paste: [[2308.08367]] Diff-CAPTCHA: An Image-based CAPTCHA with Security Enhanced by Denoising Diffusion Model(http://arxiv.org/abs/2308.08367)
Summary:
To enhance the security of text CAPTCHAs, various methods have been employed, such as adding the interference lines on the text, randomly distorting the characters, and overlapping multiple characters. These methods partly increase the difficulty of automated segmentation and recognition attacks. However, facing the rapid development of the end-to-end breaking algorithms, their security has been greatly weakened. The diffusion model is a novel image generation model that can generate the text images with deep fusion of characters and background images. In this paper, an image-click CAPTCHA scheme called Diff-CAPTCHA is proposed based on denoising diffusion models. The background image and characters of the CAPTCHA are treated as a whole to guide the generation process of a diffusion model, thus weakening the character features available for machine learning, enhancing the diversity of character features in the CAPTCHA, and increasing the difficulty of breaking algorithms. To evaluate the security of Diff-CAPTCHA, this paper develops several attack methods, including end-to-end attacks based on Faster R-CNN and two-stage attacks, and Diff-CAPTCHA is compared with three baseline schemes, including commercial CAPTCHA scheme and security-enhanced CAPTCHA scheme based on style transfer. The experimental results show that diffusion models can effectively enhance CAPTCHA security while maintaining good usability in human testing.

Title: TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. (arXiv:2308.08545v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.08545
Code URL: null
Copy Paste: [[2308.08545]] TeCH: Text-guided Reconstruction of Lifelike Clothed Humans(http://arxiv.org/abs/2308.08545)
Summary:
Despite recent research advancements in reconstructing clothed humans from a single image, accurately restoring the "unseen regions" with high-level details remains an unsolved challenge that lacks attention. Existing methods often generate overly smooth back-side surfaces with a blurry texture. But how to effectively capture all visual attributes of an individual from a single image, which are sufficient to reconstruct unseen areas (e.g., the back view)? Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance. To represent high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field. Guided by the descriptive prompts + personalized T2I diffusion model, the geometry and texture of the 3D humans are optimized through multi-view Score Distillation Sampling (SDS) and reconstruction losses based on the original observation. TeCH produces high-fidelity 3D clothed humans with consistent & delicate texture, and detailed full-body geometry. Quantitative and qualitative experiments demonstrate that TeCH outperforms the state-of-the-art methods in terms of reconstruction accuracy and rendering quality. The code will be publicly available for research purposes at https://huangyangyi.github.io/tech

self-supervised

Title: Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation. (arXiv:2308.07931v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.07931
Code URL: null
Copy Paste: [[2308.07931]] Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation(http://arxiv.org/abs/2308.07931)
Summary:
Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

Title: Contrastive Learning for Lane Detection via cross-similarity. (arXiv:2308.08242v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.08242
Code URL: null
Copy Paste: [[2308.08242]] Contrastive Learning for Lane Detection via cross-similarity(http://arxiv.org/abs/2308.08242)
Summary:
Detecting road lanes is challenging due to intricate markings vulnerable to unfavorable conditions. Lane markings have strong shape priors, but their visibility is easily compromised. Factors like lighting, weather, vehicles, pedestrians, and aging colors challenge the detection. A large amount of data is required to train a lane detection approach that can withstand natural variations caused by low visibility. This is because there are numerous lane shapes and natural variations that exist. Our solution, Contrastive Learning for Lane Detection via cross-similarity (CLLD), is a self-supervised learning method that tackles this challenge by enhancing lane detection models resilience to real-world conditions that cause lane low visibility. CLLD is a novel multitask contrastive learning that trains lane detection approaches to detect lane markings even in low visible situations by integrating local feature contrastive learning (CL) with our new proposed operation cross-similarity. Local feature CL focuses on extracting features for small image parts, which is necessary to localize lane segments, while cross-similarity captures global features to detect obscured lane segments using their surrounding. We enhance cross-similarity by randomly masking parts of input images for augmentation. Evaluated on benchmark datasets, CLLD outperforms state-of-the-art contrastive learning, especially in visibility-impairing conditions like shadows. Compared to supervised learning, CLLD excels in scenarios like shadows and crowded scenes.

Title: Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations. (arXiv:2308.08321v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.08321
Code URL: null
Copy Paste: [[2308.08321]] Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations(http://arxiv.org/abs/2308.08321)
Summary:
In recent years, discriminative self-supervised methods have made significant strides in advancing various visual tasks. The central idea of learning a data encoder that is robust to data distortions/augmentations is straightforward yet highly effective. Although many studies have demonstrated the empirical success of various learning methods, the resulting learned representations can exhibit instability and hinder downstream performance. In this study, we analyze discriminative self-supervised methods from a causal perspective to explain these unstable behaviors and propose solutions to overcome them. Our approach draws inspiration from prior works that empirically demonstrate the ability of discriminative self-supervised methods to demix ground truth causal sources to some extent. Unlike previous work on causality-empowered representation learning, we do not apply our solutions during the training process but rather during the inference process to improve time efficiency. Through experiments on both controlled image datasets and realistic image datasets, we show that our proposed solutions, which involve tempering a linear transformation with controlled synthetic data, are effective in addressing these issues.

Title: Self-Supervised Online Camera Calibration for Automated Driving and Parking Applications. (arXiv:2308.08495v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.08495
Code URL: null
Copy Paste: [[2308.08495]] Self-Supervised Online Camera Calibration for Automated Driving and Parking Applications(http://arxiv.org/abs/2308.08495)
Summary:
Camera-based perception systems play a central role in modern autonomous vehicles. These camera based perception algorithms require an accurate calibration to map the real world distances to image pixels. In practice, calibration is a laborious procedure requiring specialised data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence in autonomous vehicles. Hence there is a need to calibrate at regular intervals to ensure the camera is accurate. Proposed is a deep learning framework to learn intrinsic and extrinsic calibration of the camera in real time. The framework is self-supervised and doesn't require any labelling or supervision to learn the calibration parameters. The framework learns calibration without the need for any physical targets or to drive the car on special planar surfaces.

Title: Is Self-Supervised Pretraining Good for Extrapolation in Molecular Property Prediction?. (arXiv:2308.08129v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2308.08129
Code URL: null
Copy Paste: [[2308.08129]] Is Self-Supervised Pretraining Good for Extrapolation in Molecular Property Prediction?(http://arxiv.org/abs/2308.08129)
Summary:
The prediction of material properties plays a crucial role in the development and discovery of materials in diverse applications, such as batteries, semiconductors, catalysts, and pharmaceuticals. Recently, there has been a growing interest in employing data-driven approaches by using machine learning technologies, in combination with conventional theoretical calculations. In material science, the prediction of unobserved values, commonly referred to as extrapolation, is particularly critical for property prediction as it enables researchers to gain insight into materials beyond the limits of available data. However, even with the recent advancements in powerful machine learning models, accurate extrapolation is still widely recognized as a significantly challenging problem. On the other hand, self-supervised pretraining is a machine learning technique where a model is first trained on unlabeled data using relatively simple pretext tasks before being trained on labeled data for target tasks. As self-supervised pretraining can effectively utilize material data without observed property values, it has the potential to improve the model's extrapolation ability. In this paper, we clarify how such self-supervised pretraining can enhance extrapolation performance.We propose an experimental framework for the demonstration and empirically reveal that while models were unable to accurately extrapolate absolute property values, self-supervised pretraining enables them to learn relative tendencies of unobserved property values and improve extrapolation performance.

foundation model

Title: $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models. (arXiv:2308.07997v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.07997
Code URL: null
Copy Paste: [[2308.07997]] $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models(http://arxiv.org/abs/2308.07997)
Summary:
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show $A^2$Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets.

Title: LLM4TS: Two-Stage Fine-Tuning for Time-Series Forecasting with Pre-Trained LLMs. (arXiv:2308.08469v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2308.08469
Code URL: null
Copy Paste: [[2308.08469]] LLM4TS: Two-Stage Fine-Tuning for Time-Series Forecasting with Pre-Trained LLMs(http://arxiv.org/abs/2308.08469)
Summary:
In this work, we leverage pre-trained Large Language Models (LLMs) to enhance time-series forecasting. Mirroring the growing interest in unifying models for Natural Language Processing and Computer Vision, we envision creating an analogous model for long-term time-series forecasting. Due to limited large-scale time-series data for building robust foundation models, our approach LLM4TS focuses on leveraging the strengths of pre-trained LLMs. By combining time-series patching with temporal encoding, we have enhanced the capability of LLMs to handle time-series data effectively. Inspired by the supervised fine-tuning in chatbot domains, we prioritize a two-stage fine-tuning process: first conducting supervised fine-tuning to orient the LLM towards time-series data, followed by task-specific downstream fine-tuning. Furthermore, to unlock the flexibility of pre-trained LLMs without extensive parameter adjustments, we adopt several Parameter-Efficient Fine-Tuning (PEFT) techniques. Drawing on these innovations, LLM4TS has yielded state-of-the-art results in long-term forecasting. Our model has also shown exceptional capabilities as both a robust representation learner and an effective few-shot learner, thanks to the knowledge transferred from the pre-trained LLM.

generative

Title: Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment. (arXiv:2308.08525v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2308.08525
Code URL: https://github.com/chenqi008/leica
Copy Paste: [[2308.08525]] Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment(http://arxiv.org/abs/2308.08525)
Summary:
Text-to-image synthesis has made encouraging progress and attracted lots of public attention recently. However, popular evaluation metrics in this area, like the Inception Score and Fr'echet Inception Distance, incur several issues. First of all, they cannot explicitly assess the perceptual quality of generated images and poorly reflect the semantic alignment of each text-image pair. Also, they are inefficient and need to sample thousands of images to stabilise their evaluation results. In this paper, we propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images using a pre-trained likelihood-based text-to-image generative model, i.e., a higher likelihood indicates better perceptual quality and better text-image alignment. To prevent the likelihood of being dominated by the non-crucial part of the generated image, we propose several new designs to develop a credit assignment strategy based on the semantic and perceptual significance of the image patches. In the experiments, we evaluate the proposed metric on multiple popular text-to-image generation models and datasets in accessing both the perceptual quality and the text-image alignment. Moreover, it can successfully assess the generation ability of these models with as few as a hundred samples, making it very efficient in practice.

Title: Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System. (arXiv:2308.08169v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2308.08169
Code URL: null
Copy Paste: [[2308.08169]] Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System(http://arxiv.org/abs/2308.08169)
Summary:
End-to-end task-oriented dialogue (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-trained models. This work enables the TOD systems with more flexibility through a simple cache. The cache provides the flexibility to dynamically update the TOD systems and handle both existing and unseen dialogue scenarios. Towards this end, we first fine-tune a retrieval module to effectively retrieve the most relevant information entries from the cache. We then train end-to-end TOD models that can refer to and ground on both dialogue history and retrieved information during TOD generation. The cache is straightforward to construct, and the backbone models of TOD systems are compatible with existing pre-trained generative models. Extensive experiments demonstrate the superior performance of our framework, with a notable improvement in non-empty joint goal accuracy by 6.7% compared to strong baselines.

Title: Deep Generative Imputation Model for Missing Not At Random Data. (arXiv:2308.08158v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2308.08158
Code URL: null
Copy Paste: [[2308.08158]] Deep Generative Imputation Model for Missing Not At Random Data(http://arxiv.org/abs/2308.08158)
Summary:
Data analysis usually suffers from the Missing Not At Random (MNAR) problem, where the cause of the value missing is not fully observed. Compared to the naive Missing Completely At Random (MCAR) problem, it is more in line with the realistic scenario whereas more complex and challenging. Existing statistical methods model the MNAR mechanism by different decomposition of the joint distribution of the complete data and the missing mask. But we empirically find that directly incorporating these statistical methods into deep generative models is sub-optimal. Specifically, it would neglect the confidence of the reconstructed mask during the MNAR imputation process, which leads to insufficient information extraction and less-guaranteed imputation quality. In this paper, we revisit the MNAR problem from a novel perspective that the complete data and missing mask are two modalities of incomplete data on an equal footing. Along with this line, we put forward a generative-model-specific joint probability decomposition method, conjunction model, to represent the distributions of two modalities in parallel and extract sufficient information from both complete data and missing mask. Taking a step further, we exploit a deep generative imputation model, namely GNR, to process the real-world missing mechanism in the latent space and concurrently impute the incomplete data and reconstruct the missing mask. The experimental results show that our GNR surpasses state-of-the-art MNAR baselines with significant margins (averagely improved from 9.9% to 18.8% in RMSE) and always gives a better mask reconstruction accuracy which makes the imputation more principle.

Title: It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models. (arXiv:2308.08268v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2308.08268
Code URL: null
Copy Paste: [[2308.08268]] It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models(http://arxiv.org/abs/2308.08268)
Summary:
Generative Transformer-based models have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not fully understood and not always satisfying. Researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. Curiously, it is observed that when training on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably and mysteriously on longer, unseen cases (out-of-distribution (OOD) generalization). Studies try to bridge this gap with workarounds such as modifying position embedding, fine-tuning, and priming with more extensive or instructive data. However, without addressing the essential mechanism, there is hardly any guarantee regarding the robustness of these solutions. We bring this unexplained performance drop into attention and ask whether it is purely from random errors. Here we turn to the mechanistic line of research which has notable successes in model interpretability. We discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with equivalence relations in the ID domain. These highlight the potential of the models to carry useful information for improved generalization.

Title: Explainable AI for clinical risk prediction: a survey of concepts, methods, and modalities. (arXiv:2308.08407v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2308.08407
Code URL: null
Copy Paste: [[2308.08407]] Explainable AI for clinical risk prediction: a survey of concepts, methods, and modalities(http://arxiv.org/abs/2308.08407)
Summary:
Recent advancements in AI applications to healthcare have shown incredible promise in surpassing human performance in diagnosis and disease prognosis. With the increasing complexity of AI models, however, concerns regarding their opacity, potential biases, and the need for interpretability. To ensure trust and reliability in AI systems, especially in clinical risk prediction models, explainability becomes crucial. Explainability is usually referred to as an AI system's ability to provide a robust interpretation of its decision-making logic or the decisions themselves to human stakeholders. In clinical risk prediction, other aspects of explainability like fairness, bias, trust, and transparency also represent important concepts beyond just interpretability. In this review, we address the relationship between these concepts as they are often used together or interchangeably. This review also discusses recent progress in developing explainable models for clinical risk prediction, highlighting the importance of quantitative and clinical evaluation and validation across multiple common modalities in clinical practice. It emphasizes the need for external validation and the combination of diverse interpretability methods to enhance trust and fairness. Adopting rigorous testing, such as using synthetic datasets with known generative factors, can further improve the reliability of explainability methods. Open access and code-sharing resources are essential for transparency and reproducibility, enabling the growth and trustworthiness of explainable research. While challenges exist, an end-to-end approach to explainability in clinical risk prediction, incorporating stakeholders from clinicians to developers, is essential for success.

anomaly

in-context

Title: Time Travel in LLMs: Tracing Data Contamination in Large Language Models. (arXiv:2308.08493v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2308.08493
Code URL: null
Copy Paste: [[2308.08493]] Time Travel in LLMs: Tracing Data Contamination in Large Language Models(http://arxiv.org/abs/2308.08493)
Summary:
Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in understanding LLMs' effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination in individual instances that are drawn from a small random sample; using this information, our approach then assesses if an entire dataset partition is contaminated. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or closely matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE or BLEURT) is statistically significantly better with the guided instruction vs. a general instruction that does not include the dataset and partition name. The second idea marks a dataset as contaminated if a classifier based on GPT-4 with in-context learning prompting marks multiple instances as contaminated. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.