2023-12-29

language model

Title: An Explainable AI Approach to Large Language Model Assisted Causal Model Auditing and Development. (arXiv:2312.16211v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.16211
Code URL: null
Copy Paste: [[2312.16211]] An Explainable AI Approach to Large Language Model Assisted Causal Model Auditing and Development(http://arxiv.org/abs/2312.16211)
Summary:
Causal networks are widely used in many fields, including epidemiology, social science, medicine, and engineering, to model the complex relationships between variables. While it can be convenient to algorithmically infer these models directly from observational data, the resulting networks are often plagued with erroneous edges. Auditing and correcting these networks may require domain expertise frequently unavailable to the analyst. We propose the use of large language models such as ChatGPT as an auditor for causal networks. Our method presents ChatGPT with a causal network, one edge at a time, to produce insights about edge directionality, possible confounders, and mediating variables. We ask ChatGPT to reflect on various aspects of each causal link and we then produce visualizations that summarize these viewpoints for the human analyst to direct the edge, gather more data, or test further hypotheses. We envision a system where large language models, automated causal inference, and the human analyst and domain expert work hand in hand as a team to derive holistic and comprehensive causal models for any given case scenario. This paper presents first results obtained with an emerging prototype.

Title: More than Correlation: Do Large Language Models Learn Causal Representations of Space?. (arXiv:2312.16257v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16257
Code URL: null
Copy Paste: [[2312.16257]] More than Correlation: Do Large Language Models Learn Causal Representations of Space?(http://arxiv.org/abs/2312.16257)
Summary:
Recent work found high mutual information between the learned representations of large language models (LLMs) and the geospatial property of its input, hinting an emergent internal model of space. However, whether this internal space model has any causal effects on the LLMs' behaviors was not answered by that work, led to criticism of these findings as mere statistical correlation. Our study focused on uncovering the causality of the spatial representations in LLMs. In particular, we discovered the potential spatial representations in DeBERTa, GPT-Neo using representational similarity analysis and linear and non-linear probing. Our casual intervention experiments showed that the spatial representations influenced the model's performance on next word prediction and a downstream task that relies on geospatial information. Our experiments suggested that the LLMs learn and use an internal model of space in solving geospatial related tasks.

Title: Preference as Reward, Maximum Preference Optimization with Importance Sampling. (arXiv:2312.16430v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16430
Code URL: null
Copy Paste: [[2312.16430]] Preference as Reward, Maximum Preference Optimization with Importance Sampling(http://arxiv.org/abs/2312.16430)
Summary:
Preference learning is a key technology for aligning language models with human values. Reinforcement Learning from Human Feedback (RLHF) is a model based algorithm to optimize preference learning, which first fitting a reward model for preference score, and then optimizing generating policy with on-policy PPO algorithm to maximize the reward. The processing of RLHF is complex, time-consuming and unstable. Direct Preference Optimization (DPO) algorithm using off-policy algorithm to direct optimize generating policy and eliminating the need for reward model, which is data efficient and stable. DPO use Bradley-Terry model and log-loss which leads to over-fitting to the preference data at the expense of ignoring KL-regularization term when preference near deterministic. IPO uses a root-finding pairwise MSE loss to solve the ignoring KL-regularization problem, and learning an optimal policy. But IPO's pairwise loss still can't s make the KL-regularization to work. In this paper, we design a simple and intuitive off-policy preferences optimization algorithm from an importance sampling view, and add an off-policy KL-regularization term which makes KL-regularization truly effective. To simplify the learning process and save memory usage, we can generate regularization data in advance, which eliminate the needs for both reward model and reference policy in the stage of optimization.

Title: A Large Language Model-based Computational Approach to Improve Identity-Related Write-Ups. (arXiv:2312.16659v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16659
Code URL: null
Copy Paste: [[2312.16659]] A Large Language Model-based Computational Approach to Improve Identity-Related Write-Ups(http://arxiv.org/abs/2312.16659)
Summary:
Creating written products is essential to modern life, including writings about one's identity and personal experiences. However, writing is often a difficult activity that requires extensive effort to frame the central ideas, the pursued approach to communicate the central ideas, e.g., using analogies, metaphors, or other possible means, the needed presentation structure, and the actual verbal expression. Large Language Models, a recently emerged approach in Machine Learning, can offer a significant help in reducing the effort and improving the quality of written products. This paper proposes a new computational approach to explore prompts that given as inputs to a Large Language Models can generate cues to improve the considered written products. Two case studies on improving write-ups, one based on an analogy and one on a metaphor, are also presented in the paper.

Title: Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss. (arXiv:2312.16682v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16682
Code URL: null
Copy Paste: [[2312.16682]] Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss(http://arxiv.org/abs/2312.16682)
Summary:
Practitioners commonly align large language models using pairwise preferences, i.e., given labels of the type response A is preferred to response B for a given input. Perhaps less commonly, methods have also been developed for binary feedback, i.e. training models given labels of type response A is good or bad. We show how an existing performant binary feedback method, the Cringe Loss (Adolphs et al., 2022), can be generalized to the pairwise preference setting using a simple soft margin extension. Pairwise Cringe Loss is straightforward to implement and efficient to train, and we find it outperforms state-of-the-art preference optimization algorithms such as PPO and DPO on the AlpacaFarm benchmark.

Title: Rethinking Tabular Data Understanding with Large Language Models. (arXiv:2312.16702v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16702
Code URL: https://github.com/Leolty/tablellm
Copy Paste: [[2312.16702]] Rethinking Tabular Data Understanding with Large Language Models(http://arxiv.org/abs/2312.16702)
Summary:
Large Language Models (LLMs) have shown to be capable of various tasks, yet their capability in interpreting and reasoning over tabular data remains an underexplored area. In this context, this study investigates from three core perspectives: the robustness of LLMs to structural perturbations in tables, the comparative analysis of textual and symbolic reasoning on tables, and the potential of boosting model performance through the aggregation of multiple reasoning pathways. We discover that structural variance of tables presenting the same content reveals a notable performance decline, particularly in symbolic reasoning tasks. This prompts the proposal of a method for table structure normalization. Moreover, textual reasoning slightly edges out symbolic reasoning, and a detailed error analysis reveals that each exhibits different strengths depending on the specific tasks. Notably, the aggregation of textual and symbolic reasoning pathways, bolstered by a mix self-consistency mechanism, resulted in achieving SOTA performance, with an accuracy of 73.6% on WIKITABLEQUESTIONS, representing a substantial advancement over previous existing table processing paradigms of LLMs.

Title: Observable Propagation: A Data-Efficient Approach to Uncover Feature Vectors in Transformers. (arXiv:2312.16291v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16291
Code URL: https://github.com/jacobdunefsky/observablepropagation
Copy Paste: [[2312.16291]] Observable Propagation: A Data-Efficient Approach to Uncover Feature Vectors in Transformers(http://arxiv.org/abs/2312.16291)
Summary:
A key goal of current mechanistic interpretability research in NLP is to find linear features (also called "feature vectors") for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally expensive to utilize. In this work, we introduce a novel method, called "observable propagation" (in short: ObsProp), for finding linear features used by transformer language models in computing a given task -- using almost no data. Our paradigm centers on the concept of observables, linear functionals corresponding to given tasks. We then introduce a mathematical theory for the analysis of feature vectors: we provide theoretical motivation for why LayerNorm nonlinearities do not affect the direction of feature vectors; we also introduce a similarity metric between feature vectors called the coupling coefficient which estimates the degree to which one feature's output correlates with another's. We use ObsProp to perform extensive qualitative investigations into several tasks, including gendered occupational bias, political party prediction, and programming language detection. Our results suggest that ObsProp surpasses traditional approaches for finding feature vectors in the low-data regime, and that ObsProp can be used to better understand the mechanisms responsible for bias in large language models. Code for experiments can be found at github.com/jacobdunefsky/ObservablePropagation.

Title: Task Contamination: Language Models May Not Be Few-Shot Anymore. (arXiv:2312.16337v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16337
Code URL: null
Copy Paste: [[2312.16337]] Task Contamination: Language Models May Not Be Few-Shot Anymore(http://arxiv.org/abs/2312.16337)
Summary:
Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

Title: Exploring intra-task relations to improve meta-learning algorithms. (arXiv:2312.16612v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16612
Code URL: null
Copy Paste: [[2312.16612]] Exploring intra-task relations to improve meta-learning algorithms(http://arxiv.org/abs/2312.16612)
Summary:
Meta-learning has emerged as an effective methodology to model several real-world tasks and problems due to its extraordinary effectiveness in the low-data regime. There are many scenarios ranging from the classification of rare diseases to language modelling of uncommon languages where the availability of large datasets is rare. Similarly, for more broader scenarios like self-driving, an autonomous vehicle needs to be trained to handle every situation well. This requires training the ML model on a variety of tasks with good quality data. But often times, we find that the data distribution across various tasks is skewed, i.e.the data follows a long-tail distribution. This leads to the model performing well on some tasks and not performing so well on others leading to model robustness issues. Meta-learning has recently emerged as a potential learning paradigm which can effectively learn from one task and generalize that learning to unseen tasks. In this study, we aim to exploit external knowledge of task relations to improve training stability via effective mini-batching of tasks. We hypothesize that selecting a diverse set of tasks in a mini-batch will lead to a better estimate of the full gradient and hence will lead to a reduction of noise in training.

gpt

llm

Title: LLM Polygraph: Uncovering LLMs' Factual Discernment through Intermediate Data Analysis. (arXiv:2312.16374v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16374
Code URL: null
Copy Paste: [[2312.16374]] LLM Polygraph: Uncovering LLMs' Factual Discernment through Intermediate Data Analysis(http://arxiv.org/abs/2312.16374)
Summary:
Large Language Models (LLMs) have revolutionized various domains with extensive knowledge and creative capabilities. However, a critical issue with LLMs is their tendency to produce outputs that diverge from factual reality. This phenomenon is particularly concerning in sensitive applications such as medical consultation and legal advice, where accuracy is paramount. In this paper, we introduce the LLM factoscope, a novel Siamese network-based model that leverages the inner states of LLMs for factual detection. Our investigation reveals distinguishable patterns in LLMs' inner states when generating factual versus non-factual content. We demonstrate the LLM factoscope's effectiveness across various architectures, achieving over 96% accuracy in factual detection. Our work opens a new avenue for utilizing LLMs' inner states for factual detection and encourages further exploration into LLMs' inner workings for enhanced reliability and transparency.

Title: Automating Knowledge Acquisition for Content-Centric Cognitive Agents Using LLMs. (arXiv:2312.16378v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16378
Code URL: null
Copy Paste: [[2312.16378]] Automating Knowledge Acquisition for Content-Centric Cognitive Agents Using LLMs(http://arxiv.org/abs/2312.16378)
Summary:
The paper describes a system that uses large language model (LLM) technology to support the automatic learning of new entries in an intelligent agent's semantic lexicon. The process is bootstrapped by an existing non-toy lexicon and a natural language generator that converts formal, ontologically-grounded representations of meaning into natural language sentences. The learning method involves a sequence of LLM requests and includes an automatic quality control step. To date, this learning method has been applied to learning multiword expressions whose meanings are equivalent to those of transitive verbs in the agent's lexicon. The experiment demonstrates the benefits of a hybrid learning architecture that integrates knowledge-based methods and resources with both traditional data analytics and LLMs.

Title: How Robust are LLMs to In-Context Majority Label Bias?. (arXiv:2312.16549v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16549
Code URL: null
Copy Paste: [[2312.16549]] How Robust are LLMs to In-Context Majority Label Bias?(http://arxiv.org/abs/2312.16549)
Summary:
In the In-Context Learning (ICL) setup, various forms of label biases can manifest. One such manifestation is majority label bias, which arises when the distribution of labeled examples in the in-context samples is skewed towards one or more specific classes making Large Language Models (LLMs) more prone to predict those labels. Such discrepancies can arise from various factors, including logistical constraints, inherent biases in data collection methods, limited access to diverse data sources, etc. which are unavoidable in a real-world industry setup. In this work, we study the robustness of in-context learning in LLMs to shifts that occur due to majority label bias within the purview of text classification tasks. Prior works have shown that in-context learning with LLMs is susceptible to such biases. In our study, we go one level deeper and show that the robustness boundary varies widely for different models and tasks, with certain LLMs being highly robust (~90%) to majority label bias. Additionally, our findings also highlight the impact of model size and the richness of instructional prompts contributing towards model robustness. We restrict our study to only publicly available open-source models to ensure transparency and reproducibility.

long context

lora

Title: Understanding News Creation Intents: Frame, Dataset, and Method. (arXiv:2312.16490v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16490
Code URL: https://github.com/ictmcg/newsint
Copy Paste: [[2312.16490]] Understanding News Creation Intents: Frame, Dataset, and Method(http://arxiv.org/abs/2312.16490)
Summary:
As the disruptive changes in the media economy and the proliferation of alternative news media outlets, news intent has progressively deviated from ethical standards that serve the public interest. News intent refers to the purpose or intention behind the creation of a news article. While the significance of research on news intent has been widely acknowledged, the absence of a systematic news intent understanding framework hinders further exploration of news intent and its downstream applications. To bridge this gap, we propose News INTent (NINT) frame, the first component-aware formalism for understanding the news creation intent based on research in philosophy, psychology, and cognitive science. Within this frame, we define the news intent identification task and provide a benchmark dataset with fine-grained labels along with an efficient benchmark method. Experiments demonstrate that NINT is beneficial in both the intent identification task and downstream tasks that demand a profound understanding of news. This work marks a foundational step towards a more systematic exploration of news creation intents.

Title: FairCompass: Operationalising Fairness in Machine Learning. (arXiv:2312.16726v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16726
Code URL: null
Copy Paste: [[2312.16726]] FairCompass: Operationalising Fairness in Machine Learning(http://arxiv.org/abs/2312.16726)
Summary:
As artificial intelligence (AI) increasingly becomes an integral part of our societal and individual activities, there is a growing imperative to develop responsible AI solutions. Despite a diverse assortment of machine learning fairness solutions is proposed in the literature, there is reportedly a lack of practical implementation of these tools in real-world applications. Industry experts have participated in thorough discussions on the challenges associated with operationalising fairness in the development of machine learning-empowered solutions, in which a shift toward human-centred approaches is promptly advocated to mitigate the limitations of existing techniques. In this work, we propose a human-in-the-loop approach for fairness auditing, presenting a mixed visual analytical system (hereafter referred to as 'FairCompass'), which integrates both subgroup discovery technique and the decision tree-based schema for end users. Moreover, we innovatively integrate an Exploration, Guidance and Informed Analysis loop, to facilitate the use of the Knowledge Generation Model for Visual Analytics in FairCompass. We evaluate the effectiveness of FairCompass for fairness auditing in a real-world scenario, and the findings demonstrate the system's potential for real-world deployability. We anticipate this work will address the current gaps in research for fairness and facilitate the operationalisation of fairness in machine learning systems.

Title: Adaptive trajectory-constrained exploration strategy for deep reinforcement learning. (arXiv:2312.16456v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16456
Code URL: https://github.com/buaawgj/tace
Copy Paste: [[2312.16456]] Adaptive trajectory-constrained exploration strategy for deep reinforcement learning(http://arxiv.org/abs/2312.16456)
Summary:
Deep reinforcement learning (DRL) faces significant challenges in addressing the hard-exploration problems in tasks with sparse or deceptive rewards and large state spaces. These challenges severely limit the practical application of DRL. Most previous exploration methods relied on complex architectures to estimate state novelty or introduced sensitive hyperparameters, resulting in instability. To mitigate these issues, we propose an efficient adaptive trajectory-constrained exploration strategy for DRL. The proposed method guides the policy of the agent away from suboptimal solutions by leveraging incomplete offline demonstrations as references. This approach gradually expands the exploration scope of the agent and strives for optimality in a constrained optimization manner. Additionally, we introduce a novel policy-gradient-based optimization algorithm that utilizes adaptively clipped trajectory-distance rewards for both single- and multi-agent reinforcement learning. We provide a theoretical analysis of our method, including a deduction of the worst-case approximation error bounds, highlighting the validity of our approach for enhancing exploration. To evaluate the effectiveness of the proposed method, we conducted experiments on two large 2D grid world mazes and several MuJoCo tasks. The extensive experimental results demonstrate the significant advantages of our method in achieving temporally extended exploration and avoiding myopic and suboptimal behaviors in both single- and multi-agent settings. Notably, the specific metrics and quantifiable results further support these findings. The code used in the study is available at \url{https://github.com/buaawgj/TACE}.

Title: Expressivity and Approximation Properties of Deep Neural Networks with ReLU$^k$ Activation. (arXiv:2312.16483v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16483
Code URL: null
Copy Paste: [[2312.16483]] Expressivity and Approximation Properties of Deep Neural Networks with ReLU$^k$ Activation(http://arxiv.org/abs/2312.16483)
Summary:
In this paper, we investigate the expressivity and approximation properties of deep neural networks employing the ReLU$^k$ activation function for $k \geq 2$. Although deep ReLU networks can approximate polynomials effectively, deep ReLU$^k$ networks have the capability to represent higher-degree polynomials precisely. Our initial contribution is a comprehensive, constructive proof for polynomial representation using deep ReLU$^k$ networks. This allows us to establish an upper bound on both the size and count of network parameters. Consequently, we are able to demonstrate a suboptimal approximation rate for functions from Sobolev spaces as well as for analytic functions. Additionally, through an exploration of the representation power of deep ReLU$^k$ networks for shallow networks, we reveal that deep ReLU$^k$ networks can approximate functions from a range of variation spaces, extending beyond those generated solely by the ReLU$^k$ activation function. This finding demonstrates the adaptability of deep ReLU$^k$ networks in approximating functions within various variation spaces.

Title: Foundations of Reinforcement Learning and Interactive Decision Making. (arXiv:2312.16730v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16730
Code URL: null
Copy Paste: [[2312.16730]] Foundations of Reinforcement Learning and Interactive Decision Making(http://arxiv.org/abs/2312.16730)
Summary:
These lecture notes give a statistical perspective on the foundations of reinforcement learning and interactive decision making. We present a unifying framework for addressing the exploration-exploitation dilemma using frequentist and Bayesian approaches, with connections and parallels between supervised learning/estimation and decision making as an overarching theme. Special attention is paid to function approximation and flexible model classes such as neural networks. Topics covered include multi-armed and contextual bandits, structured bandits, and reinforcement learning with high-dimensional feedback.

hallucination

prompt

Title: Chatbot is Not All You Need: Information-rich Prompting for More Realistic Responses. (arXiv:2312.16233v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16233
Code URL: null
Copy Paste: [[2312.16233]] Chatbot is Not All You Need: Information-rich Prompting for More Realistic Responses(http://arxiv.org/abs/2312.16233)
Summary:
Recent Large Language Models (LLMs) have shown remarkable capabilities in mimicking fictional characters or real humans in conversational settings. However, the realism and consistency of these responses can be further enhanced by providing richer information of the agent being mimicked. In this paper, we propose a novel approach to generate more realistic and consistent responses from LLMs, leveraging five senses, attributes, emotional states, relationship with the interlocutor, and memories. By incorporating these factors, we aim to increase the LLM's capacity for generating natural and realistic reactions in conversational exchanges. Through our research, we expect to contribute to the development of LLMs that demonstrate improved capabilities in mimicking fictional characters. We release a new benchmark dataset and all our codes, prompts, and sample results on our Github: https://github.com/srafsasm/InfoRichBot

code

Title: Learning Time-aware Graph Structures for Spatially Correlated Time Series Forecasting. (arXiv:2312.16403v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16403
Code URL: null
Copy Paste: [[2312.16403]] Learning Time-aware Graph Structures for Spatially Correlated Time Series Forecasting(http://arxiv.org/abs/2312.16403)
Summary:
Spatio-temporal forecasting of future values of spatially correlated time series is important across many cyber-physical systems (CPS). Recent studies offer evidence that the use of graph neural networks to capture latent correlations between time series holds a potential for enhanced forecasting. However, most existing methods rely on pre-defined or self-learning graphs, which are either static or unintentionally dynamic, and thus cannot model the time-varying correlations that exhibit trends and periodicities caused by the regularity of the underlying processes in CPS. To tackle such limitation, we propose Time-aware Graph Structure Learning (TagSL), which extracts time-aware correlations among time series by measuring the interaction of node and time representations in high-dimensional spaces. Notably, we introduce time discrepancy learning that utilizes contrastive learning with distance-based regularization terms to constrain learned spatial correlations to a trend sequence. Additionally, we propose a periodic discriminant function to enable the capture of periodic changes from the state of nodes. Next, we present a Graph Convolution-based Gated Recurrent Unit (GCGRU) that jointly captures spatial and temporal dependencies while learning time-aware and node-specific patterns. Finally, we introduce a unified framework named Time-aware Graph Convolutional Recurrent Network (TGCRN), combining TagSL, and GCGRU in an encoder-decoder architecture for multi-step spatio-temporal forecasting. We report on experiments with TGCRN and popular existing approaches on five real-world datasets, thus providing evidence that TGCRN is capable of advancing the state-of-the-art. We also cover a detailed ablation study and visualization analysis, offering detailed insight into the effectiveness of time-aware structure learning.

Title: Soft Contrastive Learning for Time Series. (arXiv:2312.16424v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16424
Code URL: https://github.com/seunghan96/softclt
Copy Paste: [[2312.16424]] Soft Contrastive Learning for Time Series(http://arxiv.org/abs/2312.16424)
Summary:
Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way. However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance. Code is available at this repository: https://github.com/seunghan96/softclt.

Title: Learning to Embed Time Series Patches Independently. (arXiv:2312.16427v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16427
Code URL: https://github.com/seunghan96/pits
Copy Paste: [[2312.16427]] Learning to Embed Time Series Patches Independently(http://arxiv.org/abs/2312.16427)
Summary:
Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series. Inspired by masked image modeling in computer vision, recent works first patchify and partially mask out time series, and then train Transformers to capture the dependencies between patches by predicting masked patches from unmasked patches. However, we argue that capturing such patch dependencies might not be an optimal strategy for time series representation learning; rather, learning to embed patches independently results in better time series representations. Specifically, we propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise MLP that embeds each patch independently. In addition, we introduce complementary contrastive learning to hierarchically capture adjacent time series information efficiently. Our proposed method improves time series forecasting and classification performance compared to state-of-the-art Transformer-based models, while it is more efficient in terms of the number of parameters and training/inference time. Code is available at this repository: https://github.com/seunghan96/pits.

Title: FALCON: Feature-Label Constrained Graph Net Collapse for Memory Efficient GNNs. (arXiv:2312.16542v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16542
Code URL: null
Copy Paste: [[2312.16542]] FALCON: Feature-Label Constrained Graph Net Collapse for Memory Efficient GNNs(http://arxiv.org/abs/2312.16542)
Summary:
Graph Neural Network (GNN) ushered in a new era of machine learning with interconnected datasets. While traditional neural networks can only be trained on independent samples, GNN allows for the inclusion of inter-sample interactions in the training process. This gain, however, incurs additional memory cost, rendering most GNNs unscalable for real-world applications involving vast and complicated networks with tens of millions of nodes (e.g., social circles, web graphs, and brain graphs). This means that storing the graph in the main memory can be difficult, let alone training the GNN model with significantly less GPU memory. While much of the recent literature has focused on either mini-batching GNN methods or quantization, graph reduction methods remain largely scarce. Furthermore, present graph reduction approaches have several drawbacks. First, most graph reduction focuses only on the inference stage (e.g., condensation and distillation) and requires full graph GNN training, which does not reduce training memory footprint. Second, many methods focus solely on the graph's structural aspect, ignoring the initial population feature-label distribution, resulting in a skewed post-reduction label distribution. Here, we propose a Feature-Label COnstrained graph Net collapse, FALCON, to address these limitations. Our three core contributions lie in (i) designing FALCON, a topology-aware graph reduction technique that preserves feature-label distribution; (ii) implementation of FALCON with other memory reduction methods (i.e., mini-batched GNN and quantization) for further memory reduction; (iii) extensive benchmarking and ablation studies against SOTA methods to evaluate FALCON memory reduction. Our extensive results show that FALCON can significantly collapse various public datasets while achieving equal prediction quality across GNN models. Code: https://github.com/basiralab/FALCON

Title: Mitigating Degree Biases in Message Passing Mechanism by Utilizing Community Structures. (arXiv:2312.16788v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16788
Code URL: https://github.com/nslab-cuk/community-aware-graph-transformer
Copy Paste: [[2312.16788]] Mitigating Degree Biases in Message Passing Mechanism by Utilizing Community Structures(http://arxiv.org/abs/2312.16788)
Summary:
This study utilizes community structures to address node degree biases in message-passing (MP) via learnable graph augmentations and novel graph transformers. Recent augmentation-based methods showed that MP neural networks often perform poorly on low-degree nodes, leading to degree biases due to a lack of messages reaching low-degree nodes. Despite their success, most methods use heuristic or uniform random augmentations, which are non-differentiable and may not always generate valuable edges for learning representations. In this paper, we propose Community-aware Graph Transformers, namely CGT, to learn degree-unbiased representations based on learnable augmentations and graph transformers by extracting within community structures. We first design a learnable graph augmentation to generate more within-community edges connecting low-degree nodes through edge perturbation. Second, we propose an improved self-attention to learn underlying proximity and the roles of nodes within the community. Third, we propose a self-supervised learning task that could learn the representations to preserve the global graph structure and regularize the graph augmentations. Extensive experiments on various benchmark datasets showed CGT outperforms state-of-the-art baselines and significantly improves the node degree biases. The source code is available at https://github.com/NSLab-CUK/Community-aware-Graph-Transformer.

Title: Transfer and Alignment Network for Generalized Category Discovery. (arXiv:2312.16467v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16467
Code URL: https://github.com/lackel/tan
Copy Paste: [[2312.16467]] Transfer and Alignment Network for Generalized Category Discovery(http://arxiv.org/abs/2312.16467)
Summary:
Generalized Category Discovery is a crucial real-world task. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. To mitigate these two issues, we propose a Transfer and Alignment Network (TAN), which incorporates two knowledge transfer mechanisms to calibrate the biased knowledge and two feature alignment mechanisms to learn discriminative features. Specifically, we model different categories with prototypes and transfer the prototypes in labeled data to correct model bias towards known categories. On the one hand, we pull instances with known categories in unlabeled data closer to these prototypes to form more compact clusters and avoid boundary overlap between known and novel categories. On the other hand, we use these prototypes to calibrate noisy prototypes estimated from unlabeled data based on category similarities, which allows for more accurate estimation of prototypes for novel categories that can be used as reliable learning targets later. After knowledge transfer, we further propose two feature alignment mechanisms to acquire both instance- and category-level knowledge from unlabeled data by aligning instance features with both augmented features and the calibrated prototypes, which can boost model performance on both known and novel categories with less noise. Experiments on three benchmark datasets show that our model outperforms SOTA methods, especially on novel categories. Theoretical analysis is provided for an in-depth understanding of our model in general. Our code and data are available at https://github.com/Lackel/TAN.

Title: Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection. (arXiv:2312.16488v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16488
Code URL: null
Copy Paste: [[2312.16488]] Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection(http://arxiv.org/abs/2312.16488)
Summary:
Source code clone detection is the task of finding code fragments that have the same or similar functionality, but may differ in syntax or structure. This task is important for software maintenance, reuse, and quality assurance (Roy et al. 2009). However, code clone detection is challenging, as source code can be written in different languages, domains, and styles. In this paper, we argue that source code is inherently a graph, not a sequence, and that graph-based methods are more suitable for code clone detection than sequence-based methods. We compare the performance of two state-of-the-art models: CodeBERT (Feng et al. 2020), a sequence-based model, and CodeGraph (Yu et al. 2023), a graph-based model, on two benchmark data-sets: BCB (Svajlenko et al. 2014) and PoolC (PoolC no date). We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones. To the best of our knowledge, this is the first work to demonstrate the superiority of graph-based methods over sequence-based methods on cross-lingual code clone detection.

Title: Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise Attention and Gaussian Mixture Model. (arXiv:2312.16623v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.16623
Code URL: null
Copy Paste: [[2312.16623]] Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise Attention and Gaussian Mixture Model(http://arxiv.org/abs/2312.16623)
Summary:
BERT-based models have shown a remarkable ability in the Chinese Spelling Check (CSC) task recently. However, traditional BERT-based methods still suffer from two limitations. First, although previous works have identified that explicit prior knowledge like Part-Of-Speech (POS) tagging can benefit in the CSC task, they neglected the fact that spelling errors inherent in CSC data can lead to incorrect tags and therefore mislead models. Additionally, they ignored the correlation between the implicit hierarchical information encoded by BERT's intermediate layers and different linguistic phenomena. This results in sub-optimal accuracy. To alleviate the above two issues, we design a heterogeneous knowledge-infused framework to strengthen BERT-based CSC models. To incorporate explicit POS knowledge, we utilize an auxiliary task strategy driven by Gaussian mixture model. Meanwhile, to incorporate implicit hierarchical linguistic knowledge within the encoder, we propose a novel form of n-gram-based layerwise self-attention to generate a multilayer representation. Experimental results show that our proposed framework yields a stable performance boost over four strong baseline models and outperforms the previous state-of-the-art methods on two datasets.

Title: Continuous-time Autoencoders for Regular and Irregular Time Series Imputation. (arXiv:2312.16581v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16581
Code URL: null
Copy Paste: [[2312.16581]] Continuous-time Autoencoders for Regular and Irregular Time Series Imputation(http://arxiv.org/abs/2312.16581)
Summary:
Time series imputation is one of the most fundamental tasks for time series. Real-world time series datasets are frequently incomplete (or irregular with missing observations), in which case imputation is strongly required. Many different time series imputation methods have been proposed. Recent self-attention-based methods show the state-of-the-art imputation performance. However, it has been overlooked for a long time to design an imputation method based on continuous-time recurrent neural networks (RNNs), i.e., neural controlled differential equations (NCDEs). To this end, we redesign time series (variational) autoencoders based on NCDEs. Our method, called continuous-time autoencoder (CTA), encodes an input time series sample into a continuous hidden path (rather than a hidden vector) and decodes it to reconstruct and impute the input. In our experiments with 4 datasets and 19 baselines, our method shows the best imputation performance in almost all cases.

Title: Enhancing Traffic Flow Prediction using Outlier-Weighted AutoEncoders: Handling Real-Time Changes. (arXiv:2312.16596v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16596
Code URL: https://github.com/himanshudce/owam
Copy Paste: [[2312.16596]] Enhancing Traffic Flow Prediction using Outlier-Weighted AutoEncoders: Handling Real-Time Changes(http://arxiv.org/abs/2312.16596)
Summary:
In today's urban landscape, traffic congestion poses a critical challenge, especially during outlier scenarios. These outliers can indicate abrupt traffic peaks, drops, or irregular trends, often arising from factors such as accidents, events, or roadwork. Moreover, Given the dynamic nature of traffic, the need for real-time traffic modeling also becomes crucial to ensure accurate and up-to-date traffic predictions. To address these challenges, we introduce the Outlier Weighted Autoencoder Modeling (OWAM) framework. OWAM employs autoencoders for local outlier detection and generates correlation scores to assess neighboring traffic's influence. These scores serve as a weighted factor for neighboring sensors, before fusing them into the model. This information enhances the traffic model's performance and supports effective real-time updates, a crucial aspect for capturing dynamic traffic patterns. OWAM demonstrates a favorable trade-off between accuracy and efficiency, rendering it highly suitable for real-world applications. The research findings contribute significantly to the development of more efficient and adaptive traffic prediction models, advancing the field of transportation management for the future. The code and datasets of our framework is publicly available under https://github.com/himanshudce/OWAM.

Title: Learning the Dynamic Correlations and Mitigating Noise by Hierarchical Convolution for Long-term Sequence Forecasting. (arXiv:2312.16790v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16790
Code URL: https://github.com/yzhhoward/hmnet
Copy Paste: [[2312.16790]] Learning the Dynamic Correlations and Mitigating Noise by Hierarchical Convolution for Long-term Sequence Forecasting(http://arxiv.org/abs/2312.16790)
Summary:
Deep learning algorithms, especially Transformer-based models, have achieved significant performance by capturing long-range dependencies and historical information. However, the power of convolution has not been fully investigated. Moreover, most existing works ignore the dynamic interaction among variables and evolutionary noise in series. Addressing these issues, we propose a Hierarchical Memorizing Network (HMNet). In particular, a hierarchical convolution structure is introduced to extract the information from the series at various scales. Besides, we propose a dynamic variable interaction module to learn the varying correlation and an adaptive denoising module to search and exploit similar patterns to alleviate noises. These modules can cooperate with the hierarchical structure from the perspective of fine to coarse grain. Experiments on five benchmarks demonstrate that HMNet significantly outperforms the state-of-the-art models by 10.6% on MSE and 5.7% on MAE. Our code is released at https://github.com/yzhHoward/HMNet.

chat

retrieval augmented generation

retrieval-augmented generation

rag

Title: OpenRL: A Unified Reinforcement Learning Framework. (arXiv:2312.16189v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16189
Code URL: https://github.com/openrl-lab/openrl
Copy Paste: [[2312.16189]] OpenRL: A Unified Reinforcement Learning Framework(http://arxiv.org/abs/2312.16189)
Summary:
We present OpenRL, an advanced reinforcement learning (RL) framework designed to accommodate a diverse array of tasks, from single-agent challenges to complex multi-agent systems. OpenRL's robust support for self-play training empowers agents to develop advanced strategies in competitive settings. Notably, OpenRL integrates Natural Language Processing (NLP) with RL, enabling researchers to address a combination of RL training and language-centric tasks effectively. Leveraging PyTorch's robust capabilities, OpenRL exemplifies modularity and a user-centric approach. It offers a universal interface that simplifies the user experience for beginners while maintaining the flexibility experts require for innovation and algorithm development. This equilibrium enhances the framework's practicality, adaptability, and scalability, establishing a new standard in RL research. To delve into OpenRL's features, we invite researchers and enthusiasts to explore our GitHub repository at https://github.com/OpenRL-Lab/openrl and access our comprehensive documentation at https://openrl-docs.readthedocs.io.

Title: Learning temporal formulas from examples is hard. (arXiv:2312.16336v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16336
Code URL: null
Copy Paste: [[2312.16336]] Learning temporal formulas from examples is hard(http://arxiv.org/abs/2312.16336)
Summary:
We study the problem of learning linear temporal logic (LTL) formulas from examples, as a first step towards expressing a property separating positive and negative instances in a way that is comprehensible for humans. In this paper we initiate the study of the computational complexity of the problem. Our main results are hardness results: we show that the LTL learning problem is NP-complete, both for the full logic and for almost all of its fragments. This motivates the search for efficient heuristics, and highlights the complexity of expressing separating properties in concise natural language.

Title: FCDNet: Frequency-Guided Complementary Dependency Modeling for Multivariate Time-Series Forecasting. (arXiv:2312.16450v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16450
Code URL: https://github.com/oncecwj/fcdnet
Copy Paste: [[2312.16450]] FCDNet: Frequency-Guided Complementary Dependency Modeling for Multivariate Time-Series Forecasting(http://arxiv.org/abs/2312.16450)
Summary:
Multivariate time-series (MTS) forecasting is a challenging task in many real-world non-stationary dynamic scenarios. In addition to intra-series temporal signals, the inter-series dependency also plays a crucial role in shaping future trends. How to enable the model's awareness of dependency information has raised substantial research attention. Previous approaches have either presupposed dependency constraints based on domain knowledge or imposed them using real-time feature similarity. However, MTS data often exhibit both enduring long-term static relationships and transient short-term interactions, which mutually influence their evolving states. It is necessary to recognize and incorporate the complementary dependencies for more accurate MTS prediction. The frequency information in time series reflects the evolutionary rules behind complex temporal dynamics, and different frequency components can be used to well construct long-term and short-term interactive dependency structures between variables. To this end, we propose FCDNet, a concise yet effective framework for multivariate time-series forecasting. Specifically, FCDNet overcomes the above limitations by applying two light-weight dependency constructors to help extract long- and short-term dependency information adaptively from multi-level frequency patterns. With the growth of input variables, the number of trainable parameters in FCDNet only increases linearly, which is conducive to the model's scalability and avoids over-fitting. Additionally, adopting a frequency-based perspective can effectively mitigate the influence of noise within MTS data, which helps capture more genuine dependencies. The experimental results on six real-world datasets from multiple fields show that FCDNet significantly exceeds strong baselines, with an average improvement of 6.82% on MAE, 4.98% on RMSE, and 4.91% on MAPE.

Title: Federated Continual Learning via Knowledge Fusion: A Survey. (arXiv:2312.16475v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16475
Code URL: null
Copy Paste: [[2312.16475]] Federated Continual Learning via Knowledge Fusion: A Survey(http://arxiv.org/abs/2312.16475)
Summary:
Data privacy and silos are nontrivial and greatly challenging in many real-world applications. Federated learning is a decentralized approach to training models across multiple local clients without the exchange of raw data from client devices to global servers. However, existing works focus on a static data environment and ignore continual learning from streaming data with incremental tasks. Federated Continual Learning (FCL) is an emerging paradigm to address model learning in both federated and continual learning environments. The key objective of FCL is to fuse heterogeneous knowledge from different clients and retain knowledge of previous tasks while learning on new ones. In this work, we delineate federated learning and continual learning first and then discuss their integration, i.e., FCL, and particular FCL via knowledge fusion. In summary, our motivations are four-fold: we (1) raise a fundamental problem called ''spatial-temporal catastrophic forgetting'' and evaluate its impact on the performance using a well-known method called federated averaging (FedAvg), (2) integrate most of the existing FCL methods into two generic frameworks, namely synchronous FCL and asynchronous FCL, (3) categorize a large number of methods according to the mechanism involved in knowledge fusion, and finally (4) showcase an outlook on the future work of FCL.

Title: On the Granular Representation of Fuzzy Quantifier-Based Fuzzy Rough Sets. (arXiv:2312.16704v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.16704
Code URL: null
Copy Paste: [[2312.16704]] On the Granular Representation of Fuzzy Quantifier-Based Fuzzy Rough Sets(http://arxiv.org/abs/2312.16704)
Summary:
Rough set theory is a well-known mathematical framework that can deal with inconsistent data by providing lower and upper approximations of concepts. A prominent property of these approximations is their granular representation: that is, they can be written as unions of simple sets, called granules. The latter can be identified with "if. . . , then. . . " rules, which form the backbone of rough set rule induction. It has been shown previously that this property can be maintained for various fuzzy rough set models, including those based on ordered weighted average (OWA) operators. In this paper, we will focus on some instances of the general class of fuzzy quantifier-based fuzzy rough sets (FQFRS). In these models, the lower and upper approximations are evaluated using binary and unary fuzzy quantifiers, respectively. One of the main targets of this study is to examine the granular representation of different models of FQFRS. The main findings reveal that Choquet-based fuzzy rough sets can be represented granularly under the same conditions as OWA-based fuzzy rough sets, whereas Sugeno-based FRS can always be represented granularly. This observation highlights the potential of these models for resolving data inconsistencies and managing noise.

Title: The Fourth International Verification of Neural Networks Competition (VNN-COMP 2023): Summary and Results. (arXiv:2312.16760v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16760
Code URL: https://github.com/stanleybak/vnncomp2023
Copy Paste: [[2312.16760]] The Fourth International Verification of Neural Networks Competition (VNN-COMP 2023): Summary and Results(http://arxiv.org/abs/2312.16760)
Summary:
This report summarizes the 4th International Verification of Neural Networks Competition (VNN-COMP 2023), held as a part of the 6th Workshop on Formal Methods for ML-Enabled Autonomous Systems (FoMLAS), that was collocated with the 35th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2023 iteration, 7 teams participated on a diverse set of 10 scored and 4 unscored benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.

Title: AdapterDistillation: Non-Destructive Task Composition with Knowledge Distillation. (arXiv:2312.16261v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16261
Code URL: null
Copy Paste: [[2312.16261]] AdapterDistillation: Non-Destructive Task Composition with Knowledge Distillation(http://arxiv.org/abs/2312.16261)
Summary:
Leveraging knowledge from multiple tasks through introducing a small number of task specific parameters into each transformer layer, also known as adapters, receives much attention recently. However, adding an extra fusion layer to implement knowledge composition not only increases the inference time but also is non-scalable for some applications. To avoid these issues, we propose a two-stage knowledge distillation algorithm called AdapterDistillation. In the first stage, we extract task specific knowledge by using local data to train a student adapter. In the second stage, we distill the knowledge from the existing teacher adapters into the student adapter to help its inference. Extensive experiments on frequently asked question retrieval in task-oriented dialog systems validate the efficiency of AdapterDistillation. We show that AdapterDistillation outperforms existing algorithms in terms of accuracy, resource consumption and inference time.

Title: Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning. (arXiv:2312.16409v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16409
Code URL: https://github.com/fanyan0411/dsgd
Copy Paste: [[2312.16409]] Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning(http://arxiv.org/abs/2312.16409)
Summary:
Continual learning (CL) has shown promising results and comparable performance to learning at once in a fully supervised manner. However, CL strategies typically require a large number of labeled samples, making their real-life deployment challenging. In this work, we focus on semi-supervised continual learning (SSCL), where the model progressively learns from partially labeled data with unknown categories. We provide a comprehensive analysis of SSCL and demonstrate that unreliable distributions of unlabeled data lead to unstable training and refinement of the progressing stages. This problem severely impacts the performance of SSCL. To address the limitations, we propose a novel approach called Dynamic Sub-Graph Distillation (DSGD) for semi-supervised continual learning, which leverages both semantic and structural information to achieve more stable knowledge distillation on unlabeled data and exhibit robustness against distribution bias. Firstly, we formalize a general model of structural distillation and design a dynamic graph construction for the continual learning progress. Next, we define a structure distillation vector and design a dynamic sub-graph distillation algorithm, which enables end-to-end training and adaptability to scale up tasks. The entire proposed method is adaptable to various CL methods and supervision settings. Finally, experiments conducted on three datasets CIFAR10, CIFAR100, and ImageNet-100, with varying supervision ratios, demonstrate the effectiveness of our proposed approach in mitigating the catastrophic forgetting problem in semi-supervised continual learning scenarios.

Title: MolSets: Molecular Graph Deep Sets Learning for Mixture Property Modeling. (arXiv:2312.16473v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16473
Code URL: null
Copy Paste: [[2312.16473]] MolSets: Molecular Graph Deep Sets Learning for Mixture Property Modeling(http://arxiv.org/abs/2312.16473)
Summary:
Recent advances in machine learning (ML) have expedited materials discovery and design. One significant challenge faced in ML for materials is the expansive combinatorial space of potential materials formed by diverse constituents and their flexible configurations. This complexity is particularly evident in molecular mixtures, a frequently explored space for materials such as battery electrolytes. Owing to the complex structures of molecules and the sequence-independent nature of mixtures, conventional ML methods have difficulties in modeling such systems. Here we present MolSets, a specialized ML model for molecular mixtures. Representing individual molecules as graphs and their mixture as a set, MolSets leverages a graph neural network and the deep sets architecture to extract information at the molecule level and aggregate it at the mixture level, thus addressing local complexity while retaining global flexibility. We demonstrate the efficacy of MolSets in predicting the conductivity of lithium battery electrolytes and highlight its benefits in virtual screening of the combinatorial chemical space.

Title: Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation. (arXiv:2312.16478v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16478
Code URL: null
Copy Paste: [[2312.16478]] Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation(http://arxiv.org/abs/2312.16478)
Summary:
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, \ie, noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on \emph{similarity-guided training with hard negatives} and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely \textbf{S}elf-\textbf{R}einforcing \textbf{E}rrors \textbf{M}itigation (SREM). Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM.

Title: Agnostically Learning Multi-index Models with Queries. (arXiv:2312.16616v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16616
Code URL: null
Copy Paste: [[2312.16616]] Agnostically Learning Multi-index Models with Queries(http://arxiv.org/abs/2312.16616)
Summary:
We study the power of query access for the task of agnostic learning under the Gaussian distribution. In the agnostic model, no assumptions are made on the labels and the goal is to compute a hypothesis that is competitive with the {\em best-fit} function in a known class, i.e., it achieves error $\mathrm{opt}+\epsilon$, where $\mathrm{opt}$ is the error of the best function in the class. We focus on a general family of Multi-Index Models (MIMs), which are $d$-variate functions that depend only on few relevant directions, i.e., have the form $g(\mathbf{W} \mathbf{x})$ for an unknown link function $g$ and a $k \times d$ matrix $\mathbf{W}$. Multi-index models cover a wide range of commonly studied function classes, including constant-depth neural networks with ReLU activations, and intersections of halfspaces.

Our main result shows that query access gives significant runtime improvements over random examples for agnostically learning MIMs. Under standard regularity assumptions for the link function (namely, bounded variation or surface area), we give an agnostic query learner for MIMs with complexity $O(k)^{\mathrm{poly}(1/\epsilon)} \; \mathrm{poly}(d) $. In contrast, algorithms that rely only on random examples inherently require $d^{\mathrm{poly}(1/\epsilon)}$ samples and runtime, even for the basic problem of agnostically learning a single ReLU or a halfspace.

Our algorithmic result establishes a strong computational separation between the agnostic PAC and the agnostic PAC+Query models under the Gaussian distribution. Prior to our work, no such separation was known -- even for the special case of agnostically learning a single halfspace, for which it was an open problem first posed by Feldman. Our results are enabled by a general dimension-reduction technique that leverages query access to estimate gradients of (a smoothed version of) the underlying label function.

Title: Disentangled Continual Learning: Separating Memory Edits from Model Updates. (arXiv:2312.16731v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16731
Code URL: null
Copy Paste: [[2312.16731]] Disentangled Continual Learning: Separating Memory Edits from Model Updates(http://arxiv.org/abs/2312.16731)
Summary:
The ability of machine learning systems to learn continually is hindered by catastrophic forgetting, the tendency of neural networks to overwrite existing knowledge when learning a new task. Existing continual learning methods alleviate this problem through regularisation, parameter isolation, or rehearsal, and are typically evaluated on benchmarks consisting of a handful of tasks. We propose a novel conceptual approach to continual classification that aims to disentangle class-specific information that needs to be memorised from the class-agnostic knowledge that encapsulates generalization. We store the former in a buffer that can be easily pruned or updated when new categories arrive, while the latter is represented with a neural network that generalizes across tasks. We show that the class-agnostic network does not suffer from catastrophic forgetting and by leveraging it to perform classification, we improve accuracy on past tasks over time. In addition, our approach supports open-set classification and one-shot generalization. To test our conceptual framework, we introduce Infinite dSprites, a tool for creating continual classification and disentanglement benchmarks of arbitrary length with full control over generative factors. We show that over a sufficiently long time horizon all major types of continual learning methods break down, while our approach enables continual learning over hundreds of tasks with explicit control over memorization and forgetting.

multi-run

chain-of-thought

tree-of-thought

agent

Title: Dynamic Knowledge Injection for AIXI Agents. (arXiv:2312.16184v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.16184
Code URL: null
Copy Paste: [[2312.16184]] Dynamic Knowledge Injection for AIXI Agents(http://arxiv.org/abs/2312.16184)
Summary:
Prior approximations of AIXI, a Bayesian optimality notion for general reinforcement learning, can only approximate AIXI's Bayesian environment model using an a-priori defined set of models. This is a fundamental source of epistemic uncertainty for the agent in settings where the existence of systematic bias in the predefined model class cannot be resolved by simply collecting more data from the environment. We address this issue in the context of Human-AI teaming by considering a setup where additional knowledge for the agent in the form of new candidate models arrives from a human operator in an online fashion. We introduce a new agent called DynamicHedgeAIXI that maintains an exact Bayesian mixture over dynamically changing sets of models via a time-adaptive prior constructed from a variant of the Hedge algorithm. The DynamicHedgeAIXI agent is the richest direct approximation of AIXI known to date and comes with good performance guarantees. Experimental results on epidemic control on contact networks validates the agent's practical utility.

Title: XuanCe: A Comprehensive and Unified Deep Reinforcement Learning Library. (arXiv:2312.16248v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16248
Code URL: https://github.com/agi-brain/xuance
Copy Paste: [[2312.16248]] XuanCe: A Comprehensive and Unified Deep Reinforcement Learning Library(http://arxiv.org/abs/2312.16248)
Summary:
In this paper, we present XuanCe, a comprehensive and unified deep reinforcement learning (DRL) library designed to be compatible with PyTorch, TensorFlow, and MindSpore. XuanCe offers a wide range of functionalities, including over 40 classical DRL and multi-agent DRL algorithms, with the flexibility to easily incorporate new algorithms and environments. It is a versatile DRL library that supports CPU, GPU, and Ascend, and can be executed on various operating systems such as Ubuntu, Windows, MacOS, and EulerOS. Extensive benchmarks conducted on popular environments including MuJoCo, Atari, and StarCraftII multi-agent challenge demonstrate the library's impressive performance. XuanCe is open-source and can be accessed at https://github.com/agi-brain/xuance.git.

Title: Active Third-Person Imitation Learning. (arXiv:2312.16365v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16365
Code URL: null
Copy Paste: [[2312.16365]] Active Third-Person Imitation Learning(http://arxiv.org/abs/2312.16365)
Summary:
We consider the problem of third-person imitation learning with the additional challenge that the learner must select the perspective from which they observe the expert. In our setting, each perspective provides only limited information about the expert's behavior, and the learning agent must carefully select and combine information from different perspectives to achieve competitive performance. This setting is inspired by real-world imitation learning applications, e.g., in robotics, a robot might observe a human demonstrator via camera and receive information from different perspectives depending on the camera's position. We formalize the aforementioned active third-person imitation learning problem, theoretically analyze its characteristics, and propose a generative adversarial network-based active learning approach. Empirically, we demstrate that our proposed approach can effectively learn from expert demonstrations and explore the importance of different architectural choices for the learner's performance.

Title: Adaptive Anytime Multi-Agent Path Finding Using Bandit-Based Large Neighborhood Search. (arXiv:2312.16767v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.16767
Code URL: https://github.com/thomyphan/anytime-mapf
Copy Paste: [[2312.16767]] Adaptive Anytime Multi-Agent Path Finding Using Bandit-Based Large Neighborhood Search(http://arxiv.org/abs/2312.16767)
Summary:
Anytime multi-agent path finding (MAPF) is a promising approach to scalable path optimization in large-scale multi-agent systems. State-of-the-art anytime MAPF is based on Large Neighborhood Search (LNS), where a fast initial solution is iteratively optimized by destroying and repairing a fixed number of parts, i.e., the neighborhood, of the solution, using randomized destroy heuristics and prioritized planning. Despite their recent success in various MAPF instances, current LNS-based approaches lack exploration and flexibility due to greedy optimization with a fixed neighborhood size which can lead to low quality solutions in general. So far, these limitations have been addressed with extensive prior effort in tuning or offline machine learning beyond actual planning. In this paper, we focus on online learning in LNS and propose Bandit-based Adaptive LArge Neighborhood search Combined with Exploration (BALANCE). BALANCE uses a bi-level multi-armed bandit scheme to adapt the selection of destroy heuristics and neighborhood sizes on the fly during search. We evaluate BALANCE on multiple maps from the MAPF benchmark set and empirically demonstrate cost improvements of at least 50% compared to state-of-the-art anytime MAPF in large-scale scenarios. We find that Thompson Sampling performs particularly well compared to alternative multi-armed bandit algorithms.