language model

Title: Axiomatic Preference Modeling for Longform Question Answering. (arXiv:2312.02206v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.02206
Code URL: null
Copy Paste: [[2312.02206]] Axiomatic Preference Modeling for Longform Question Answering(http://arxiv.org/abs/2312.02206)
Summary:
The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

Title: An Evaluation Framework for Mapping News Headlines to Event Classes in a Knowledge Graph. (arXiv:2312.02334v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02334
Code URL: https://github.com/mbouadeus/news-headline-event-linking
Copy Paste: [[2312.02334]] An Evaluation Framework for Mapping News Headlines to Event Classes in a Knowledge Graph(http://arxiv.org/abs/2312.02334)
Summary:
Mapping ongoing news headlines to event-related classes in a rich knowledge base can be an important component in a knowledge-based event analysis and forecasting solution. In this paper, we present a methodology for creating a benchmark dataset of news headlines mapped to event classes in Wikidata, and resources for the evaluation of methods that perform the mapping. We use the dataset to study two classes of unsupervised methods for this task: 1) adaptations of classic entity linking methods, and 2) methods that treat the problem as a zero-shot text classification problem. For the first approach, we evaluate off-the-shelf entity linking systems. For the second approach, we explore a) pre-trained natural language inference (NLI) models, and b) pre-trained large generative language models. We present the results of our evaluation, lessons learned, and directions for future work. The dataset and scripts for evaluation are made publicly available.

Title: Visually Grounded Language Learning: a review of language games, datasets, tasks, and models. (arXiv:2312.02431v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02431
Code URL: null
Copy Paste: [[2312.02431]] Visually Grounded Language Learning: a review of language games, datasets, tasks, and models(http://arxiv.org/abs/2312.02431)
Summary:
In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.

Title: MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following. (arXiv:2312.02436v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02436
Code URL: null
Copy Paste: [[2312.02436]] MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following(http://arxiv.org/abs/2312.02436)
Summary:
In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.

Title: Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation. (arXiv:2312.02439v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.02439
Code URL: https://github.com/sail-sg/clot
Copy Paste: [[2312.02439]] Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation(http://arxiv.org/abs/2312.02439)
Summary:
Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step, and can motivate their logical reasoning ability. While effective for logical tasks, CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a non-sequential, creative paradigm involving strong associations and knowledge leaps. To this end, we study LLMs on the popular Oogiri game which needs participants to have good creativity and strong associative thinking for responding unexpectedly and humorously to the given image, text, or both, and thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset which contains over 130,000 samples from the Oogiri game, and observe the insufficient LoT ability or failures of most existing LLMs on the Oogiri game. Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into LoT-oriented instruction tuning data to train pretrained LLM for achieving certain LoT humor generation and discrimination abilities. Then CLoT designs an explorative self-refinement that encourages the LLM to generate more creative LoT data via exploring parallels between seemingly unrelated concepts and selects high-quality data to train itself for self-refinement. CLoT not only excels in humor generation in the Oogiri game but also boosts creative abilities in various tasks like cloud guessing game and divergent association task. These findings advance our understanding and offer a pathway to improve LLMs' creative capacities for innovative applications across domains. The dataset, code, and models will be released online. https://github.com/sail-sg/CLoT.

Title: ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU. (arXiv:2312.02515v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02515
Code URL: null
Copy Paste: [[2312.02515]] ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU(http://arxiv.org/abs/2312.02515)
Summary:
Transformer-based large language models (LLMs) have demonstrated outstanding performance across diverse domains, particularly when fine-turned for specific domains. Recent studies suggest that the resources required for fine-tuning LLMs can be economized through parameter-efficient methods such as Low-Rank Adaptation (LoRA). While LoRA effectively reduces computational burdens and resource demands, it currently supports only a single-job fine-tuning setup.

In this paper, we present ASPEN, a high-throughput framework for fine-tuning LLMs. ASPEN efficiently trains multiple jobs on a single GPU using the LoRA method, leveraging shared pre-trained model and adaptive scheduling. ASPEN is compatible with transformer-based language models like LLaMA and ChatGLM, etc. Experiments show that ASPEN saves 53% of GPU memory when training multiple LLaMA-7B models on NVIDIA A100 80GB GPU and boosts training throughput by about 17% compared to existing methods when training with various pre-trained models on different GPUs. The adaptive scheduling algorithm reduces turnaround time by 24%, end-to-end training latency by 12%, prioritizing jobs and preventing out-of-memory issues.

Title: Creative Agents: Empowering Agents with Imagination for Creative Tasks. (arXiv:2312.02519v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.02519
Code URL: https://github.com/pku-rl/creative-agents
Copy Paste: [[2312.02519]] Creative Agents: Empowering Agents with Imagination for Creative Tasks(http://arxiv.org/abs/2312.02519)
Summary:
We study building embodied agents for open-ended creative tasks. While existing methods build instruction-following agents that can perform diverse open-ended tasks, none of them demonstrates creativity -- the ability to give novel and diverse task solutions implicit in the language instructions. This limitation comes from their inability to convert abstract language instructions into concrete task goals in the environment and perform long-horizon planning for such complicated goals. Given the observation that humans perform creative tasks with the help of imagination, we propose a class of solutions for creative agents, where the controller is enhanced with an imaginator that generates detailed imaginations of task outcomes conditioned on language instructions. We introduce several approaches to implementing the components of creative agents. We implement the imaginator with either a large language model for textual imagination or a diffusion model for visual imagination. The controller can either be a behavior-cloning policy learned from data or a pre-trained foundation model generating executable codes in the environment. We benchmark creative tasks with the challenging open-world game Minecraft, where the agents are asked to create diverse buildings given free-form language instructions. In addition, we propose novel evaluation metrics for open-ended creative tasks utilizing GPT-4V, which holds many advantages over existing metrics. We perform a detailed experimental analysis of creative agents, showing that creative agents are the first AI agents accomplishing diverse building creation in the survival mode of Minecraft. Our benchmark and models are open-source for future research on creative agents (https://github.com/PKU-RL/Creative-Agents).

Title: Impact of Tokenization on LLaMa Russian Adaptation. (arXiv:2312.02598v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02598
Code URL: null
Copy Paste: [[2312.02598]] Impact of Tokenization on LLaMa Russian Adaptation(http://arxiv.org/abs/2312.02598)
Summary:
Latest instruction-tuned large language models (LLM) show great results on various tasks, however, they often face performance degradation for non-English input. There is evidence that the reason lies in inefficient tokenization caused by low language representation in pre-training data which hinders the comprehension of non-English instructions, limiting the potential of target language instruction-tuning. In this work we investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation. We explore three variants of vocabulary adaptation and test their performance on Saiga instruction-tuning and fine-tuning on Russian Super Glue benchmark. The results of automatic evaluation show that vocabulary substitution not only improves the model's quality in Russian but also accelerates fine-tuning (35%) and inference (up to 60%) while reducing memory consumption. Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference than the original Saiga-LLaMa model.

Title: Large Knowledge Model: Perspectives and Challenges. (arXiv:2312.02706v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.02706
Code URL: null
Copy Paste: [[2312.02706]] Large Knowledge Model: Perspectives and Challenges(http://arxiv.org/abs/2312.02706)
Summary:
Humankind's understanding of the world is fundamentally linked to our perception and cognition, with \emph{human languages} serving as one of the major carriers of \emph{world knowledge}. In this vein, \emph{Large Language Models} (LLMs) like ChatGPT epitomize the pre-training of extensive, sequence-based world knowledge into neural networks, facilitating the processing and manipulation of this knowledge in a parametric space. This article explores large models through the lens of ``knowledge''. We initially investigate the role of symbolic knowledge such as Knowledge Graphs (KGs) in enhancing LLMs, covering aspects like knowledge-augmented language model, structure-inducing pre-training, knowledgeable prompts, structured CoT, knowledge editing, semantic tools for LLM and knowledgeable AI agents. Subsequently, we examine how LLMs can amplify traditional symbolic knowledge bases, encompassing aspects like using LLM as KG builder and controller, structured knowledge pretraining, LLM-enhanced symbolic reasoning, and the amalgamation of perception with cognition. Considering the intricate nature of human knowledge, we advocate for the creation of \emph{Large Knowledge Models} (LKM), specifically engineered to manage diversified spectrum of knowledge structures. This ambitious undertaking could entail several key challenges, such as disentangling knowledge representation from language models, restructuring pre-training with structured knowledge, and building large commonsense models, among others. We finally propose a five-``A'' principle to distinguish the concept of LKM.

Title: Toward autocorrection of chemical process flowsheets using large language models. (arXiv:2312.02873v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02873
Code URL: null
Copy Paste: [[2312.02873]] Toward autocorrection of chemical process flowsheets using large language models(http://arxiv.org/abs/2312.02873)
Summary:
The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.

Title: Revisiting Topic-Guided Language Models. (arXiv:2312.02331v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02331
Code URL: https://github.com/carolinazheng/revisiting-tglms
Copy Paste: [[2312.02331]] Revisiting Topic-Guided Language Models(http://arxiv.org/abs/2312.02331)
Summary:
A recent line of work in natural language processing has aimed to combine language models and topic models. These topic-guided language models augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora. Surprisingly, we find that none of these methods outperform a standard LSTM language model baseline, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline's hidden states already encode topic information. We make public all code used for this study.

Title: Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings. (arXiv:2312.02337v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02337
Code URL: null
Copy Paste: [[2312.02337]] Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings(http://arxiv.org/abs/2312.02337)
Summary:
An essential part of monitoring machine learning models in production is measuring input and output data drift. In this paper, we present a system for measuring distributional shifts in natural language data and highlight and investigate the potential advantage of using large language models (LLMs) for this problem. Recent advancements in LLMs and their successful adoption in different domains indicate their effectiveness in capturing semantic relationships for solving various natural language processing problems. The power of LLMs comes largely from the encodings (embeddings) generated in the hidden layers of the corresponding neural network. First we propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings. Then we study the effectiveness of our approach when applied to text embeddings generated by both LLMs and classical embedding algorithms. Our experiments show that general-purpose LLM-based embeddings provide a high sensitivity to data drift compared to other embedding methods. We propose drift sensitivity as an important evaluation metric to consider when comparing language models. Finally, we present insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.

Title: Efficient Online Data Mixing For Language Model Pre-Training. (arXiv:2312.02406v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02406
Code URL: null
Copy Paste: [[2312.02406]] Efficient Online Data Mixing For Language Model Pre-Training(http://arxiv.org/abs/2312.02406)
Summary:
The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19\% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining.

Title: ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference. (arXiv:2312.02554v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02554
Code URL: null
Copy Paste: [[2312.02554]] ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference(http://arxiv.org/abs/2312.02554)
Summary:
Language model alignment is a cutting-edge technique in large language model training to align the model output to user's intent, e.g., being helpful and harmless. Recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. Previous preference learning methods, such as RLHF and DPO, mainly focus on pair-wise preference data. However, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. To fill this gap, in this paper, we first develop a preference learning method called point-wise DPO to tackle point-wise preference data. Further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. Extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. A new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.

Title: Towards Measuring Representational Similarity of Large Language Models. (arXiv:2312.02730v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02730
Code URL: https://github.com/mklabunde/llm_repsim
Copy Paste: [[2312.02730]] Towards Measuring Representational Similarity of Large Language Models(http://arxiv.org/abs/2312.02730)
Summary:
Understanding the similarity of the numerous released large language models (LLMs) has many uses, e.g., simplifying model selection, detecting illegal model reuse, and advancing our understanding of what makes LLMs perform well. In this work, we measure the similarity of representations of a set of LLMs with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions.

Title: Scaling Laws for Adversarial Attacks on Language Model Activations. (arXiv:2312.02780v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02780
Code URL: null
Copy Paste: [[2312.02780]] Scaling Laws for Adversarial Attacks on Language Model Activations(http://arxiv.org/abs/2312.02780)
Summary:
We explore a class of adversarial attacks targeting the activations of language models. By manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. We empirically verify a scaling law where the maximum number of target tokens $t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = \kappa a$. We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $\chi$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models. Compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. This opens up a new, broad attack surface. By using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.

Title: Large Language Models on Graphs: A Comprehensive Survey. (arXiv:2312.02783v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02783
Code URL: https://github.com/petergriffinjin/awesome-language-model-on-graphs
Copy Paste: [[2312.02783]] Large Language Models on Graphs: A Comprehensive Survey(http://arxiv.org/abs/2312.02783)
Summary:
Large language models (LLMs), such as ChatGPT and LLaMA, are creating significant advancements in natural language processing, due to their strong text encoding/decoding ability and newly found emergent capability (e.g., reasoning). While LLMs are mainly designed to process pure texts, there are many real-world scenarios where text data are associated with rich structure information in the form of graphs (e.g., academic networks, and e-commerce networks) or scenarios where graph data are paired with rich textual information (e.g., molecules with descriptions). Besides, although LLMs have shown their pure text-based reasoning ability, it is underexplored whether such ability can be generalized to graph scenarios (i.e., graph-based reasoning). In this paper, we provide a systematic review of scenarios and techniques related to large language models on graphs. We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-rich graphs, and text-paired graphs. We then discuss detailed techniques for utilizing LLMs on graphs, including LLM as Predictor, LLM as Encoder, and LLM as Aligner, and compare the advantages and disadvantages of different schools of models. Furthermore, we mention the real-world applications of such methods and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future research directions in this fast-growing field. The related source can be found at https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs.

Title: Can We Learn Communication-Efficient Optimizers?. (arXiv:2312.02204v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02204
Code URL: null
Copy Paste: [[2312.02204]] Can We Learn Communication-Efficient Optimizers?(http://arxiv.org/abs/2312.02204)
Summary:
Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally, that is on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art adaptive optimizers for deep learning. In this work, we investigate if the recent progress in the emerging area of learned optimizers can potentially close this gap while remaining communication-efficient. Specifically, we meta-learn how to perform global updates given an update from local SGD iterations. Our results demonstrate that learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. Learned optimizers can even generalize to unseen and much larger datasets and architectures, including ImageNet and ViTs, and to unseen modalities such as language modeling. We therefore demonstrate the potential of learned optimizers for improving communication-efficient distributed learning.

gpt

Title: MKA: A Scalable Medical Knowledge Assisted Mechanism for Generative Models on Medical Conversation Tasks. (arXiv:2312.02496v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02496
Code URL: https://github.com/liangke23/knowledge_assisted_medical_dialogue_generation_mechanism
Copy Paste: [[2312.02496]] MKA: A Scalable Medical Knowledge Assisted Mechanism for Generative Models on Medical Conversation Tasks(http://arxiv.org/abs/2312.02496)
Summary:
Using natural language processing (NLP) technologies to develop medical chatbots makes the diagnosis of the patient more convenient and efficient, which is a typical application in healthcare AI. Because of its importance, lots of research have been come out. Recently, the neural generative models have shown their impressive ability as the core of chatbot, while it cannot scale well when directly applied to medical conversation due to the lack of medical-specific knowledge. To address the limitation, a scalable Medical Knowledge Assisted mechanism, MKA, is proposed in this paper. The mechanism aims to assist general neural generative models to achieve better performance on the medical conversation task. The medical-specific knowledge graph is designed within the mechanism, which contains 6 types of medical-related information, including department, drug, check, symptom, disease, food. Besides, the specific token concatenation policy is defined to effectively inject medical information into the input data. Evaluation of our method is carried out on two typical medical datasets, MedDG and MedDialog-CN. The evaluation results demonstrate that models combined with our mechanism outperform original methods in multiple automatic evaluation metrics. Besides, MKA-Bert-GPT achieves state-of-the-art performance. The open-sourced codes are public: https://github.com/LIANGKE23/Knowledge_Assisted_Medical_Dialogue_Generation_Mechanism

llm

Title: JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization. (arXiv:2312.02213v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02213
Code URL: null
Copy Paste: [[2312.02213]] JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization(http://arxiv.org/abs/2312.02213)
Summary:
In this study, we introduce JarviX, a sophisticated data analytics framework. JarviX is designed to employ Large Language Models (LLMs) to facilitate an automated guide and execute high-precision data analyzes on tabular datasets. This framework emphasizes the significance of varying column types, capitalizing on state-of-the-art LLMs to generate concise data insight summaries, propose relevant analysis inquiries, visualize data effectively, and provide comprehensive explanations for results drawn from an extensive data analysis pipeline. Moreover, JarviX incorporates an automated machine learning (AutoML) pipeline for predictive modeling. This integration forms a comprehensive and automated optimization cycle, which proves particularly advantageous for optimizing machine configuration. The efficacy and adaptability of JarviX are substantiated through a series of practical use case studies.

Title: LLMs Accelerate Annotation for Medical Information Extraction. (arXiv:2312.02296v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02296
Code URL: null
Copy Paste: [[2312.02296]] LLMs Accelerate Annotation for Medical Information Extraction(http://arxiv.org/abs/2312.02296)
Summary:
The unstructured nature of clinical notes within electronic health records often conceals vital patient-related information, making it challenging to access or interpret. To uncover this hidden information, specialized Natural Language Processing (NLP) models are required. However, training these models necessitates large amounts of labeled data, a process that is both time-consuming and costly when relying solely on human experts for annotation. In this paper, we propose an approach that combines Large Language Models (LLMs) with human expertise to create an efficient method for generating ground truth labels for medical text annotation. By utilizing LLMs in conjunction with human annotators, we significantly reduce the human annotation burden, enabling the rapid creation of labeled datasets. We rigorously evaluate our method on a medical information extraction task, demonstrating that our approach not only substantially cuts down on human intervention but also maintains high accuracy. The results highlight the potential of using LLMs to improve the utilization of unstructured clinical data, allowing for the swift deployment of tailored NLP solutions in healthcare.

Title: When is Offline Policy Selection Sample Efficient for Reinforcement Learning?. (arXiv:2312.02355v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02355
Code URL: null
Copy Paste: [[2312.02355]] When is Offline Policy Selection Sample Efficient for Reinforcement Learning?(http://arxiv.org/abs/2312.02355)
Summary:
Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.

Title: New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking. (arXiv:2312.02382v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02382
Code URL: null
Copy Paste: [[2312.02382]] New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking(http://arxiv.org/abs/2312.02382)
Summary:
With the increasing use of large-language models (LLMs) like ChatGPT, watermarking has emerged as a promising approach for tracing machine-generated content. However, research on LLM watermarking often relies on simple perplexity or diversity-based measures to assess the quality of watermarked text, which can mask important limitations in watermarking. Here we introduce two new easy-to-use methods for evaluating watermarking algorithms for LLMs: 1) evaluation by LLM-judger with specific guidelines; and 2) binary classification on text embeddings to distinguish between watermarked and unwatermarked text. We apply these methods to characterize the effectiveness of current watermarking techniques. Our experiments, conducted across various datasets, reveal that current watermarking methods are detectable by even simple classifiers, challenging the notion of watermarking subtlety. We also found, through the LLM judger, that watermarking impacts text quality, especially in degrading the coherence and depth of the response. Our findings underscore the trade-off between watermark robustness and text quality and highlight the importance of having more informative metrics to assess watermarking quality.

Title: MedDM:LLM-executable clinical guidance tree for clinical decision-making. (arXiv:2312.02441v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02441
Code URL: null
Copy Paste: [[2312.02441]] MedDM:LLM-executable clinical guidance tree for clinical decision-making(http://arxiv.org/abs/2312.02441)
Summary:
It is becoming increasingly emphasis on the importance of LLM participating in clinical diagnosis decision-making. However, the low specialization refers to that current medical LLMs can not provide specific medical advice, which are more like a medical Q\&A. And there is no suitable clinical guidance tree data set that can be used directly with LLM. To address this issue, we first propose LLM-executavle clinical guidance tree(CGT), which can be directly used by large language models, and construct medical diagnostic decision-making dataset (MedDM), from flowcharts in clinical practice guidelines. We propose an approach to screen flowcharts from medical literature, followed by their identification and conversion into standardized diagnostic decision trees. Constructed a knowledge base with 1202 decision trees, which came from 5000 medical literature and covered 12 hospital departments, including internal medicine, surgery, psychiatry, and over 500 diseases.Moreover, we propose a method for reasoning on LLM-executable CGT and a Patient-LLM multi-turn dialogue framework.

Title: Weakly Supervised Detection of Hallucinations in LLM Activations. (arXiv:2312.02798v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02798
Code URL: null
Copy Paste: [[2312.02798]] Weakly Supervised Detection of Hallucinations in LLM Activations(http://arxiv.org/abs/2312.02798)
Summary:
We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.

long context

lora

Title: AdsorbRL: Deep Multi-Objective Reinforcement Learning for Inverse Catalysts Design. (arXiv:2312.02308v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02308
Code URL: https://github.com/rlacombe/adsorbrl
Copy Paste: [[2312.02308]] AdsorbRL: Deep Multi-Objective Reinforcement Learning for Inverse Catalysts Design(http://arxiv.org/abs/2312.02308)
Summary:
A central challenge of the clean energy transition is the development of catalysts for low-emissions technologies. Recent advances in Machine Learning for quantum chemistry drastically accelerate the computation of catalytic activity descriptors such as adsorption energies. Here we introduce AdsorbRL, a Deep Reinforcement Learning agent aiming to identify potential catalysts given a multi-objective binding energy target, trained using offline learning on the Open Catalyst 2020 and Materials Project data sets. We experiment with Deep Q-Network agents to traverse the space of all ~160,000 possible unary, binary and ternary compounds of 55 chemical elements, with very sparse rewards based on adsorption energy known for only between 2,000 and 3,000 catalysts per adsorbate. To constrain the actions space, we introduce Random Edge Traversal and train a single-objective DQN agent on the known states subgraph, which we find strengthens target binding energy by an average of 4.1 eV. We extend this approach to multi-objective, goal-conditioned learning, and train a DQN agent to identify materials with the highest (respectively lowest) adsorption energies for multiple simultaneous target adsorbates. We experiment with Objective Sub-Sampling, a novel training scheme aimed at encouraging exploration in the multi-objective setup, and demonstrate simultaneous adsorption energy improvement across all target adsorbates, by an average of 0.8 eV. Overall, our results suggest strong potential for Deep Reinforcement Learning applied to the inverse catalysts design problem.

Title: Learning Energy-based Model via Dual-MCMC Teaching. (arXiv:2312.02469v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02469
Code URL: null
Copy Paste: [[2312.02469]] Learning Energy-based Model via Dual-MCMC Teaching(http://arxiv.org/abs/2312.02469)
Summary:
This paper studies the fundamental learning problem of the energy-based model (EBM). Learning the EBM can be achieved using the maximum likelihood estimation (MLE), which typically involves the Markov Chain Monte Carlo (MCMC) sampling, such as the Langevin dynamics. However, the noise-initialized Langevin dynamics can be challenging in practice and hard to mix. This motivates the exploration of joint training with the generator model where the generator model serves as a complementary model to bypass MCMC sampling. However, such a method can be less accurate than the MCMC and result in biased EBM learning. While the generator can also serve as an initializer model for better MCMC sampling, its learning can be biased since it only matches the EBM and has no access to empirical training examples. Such biased generator learning may limit the potential of learning the EBM. To address this issue, we present a joint learning framework that interweaves the maximum likelihood learning algorithm for both the EBM and the complementary generator model. In particular, the generator model is learned by MLE to match both the EBM and the empirical data distribution, making it a more informative initializer for MCMC sampling of EBM. Learning generator with observed examples typically requires inference of the generator posterior. To ensure accurate and efficient inference, we adopt the MCMC posterior sampling and introduce a complementary inference model to initialize such latent MCMC sampling. We show that three separate models can be seamlessly integrated into our joint framework through two (dual-) MCMC teaching, enabling effective and efficient EBM learning.

hallucination

Title: Compositional Generalization for Data-to-Text Generation. (arXiv:2312.02748v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02748
Code URL: null
Copy Paste: [[2312.02748]] Compositional Generalization for Data-to-Text Generation(http://arxiv.org/abs/2312.02748)
Summary:
Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of predicates, producing unfaithful descriptions (e.g. hallucinations or omissions). We refer to this issue as compositional generalisation, and it encouraged us to create a benchmark for assessing the performance of different approaches on this specific problem. Furthermore, we propose a novel model that addresses compositional generalization by clustering predicates into groups. Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time. This approach significantly outperforms T5~baselines across all evaluation metrics.Notably, it achieved a 31% improvement over T5 in terms of a metric focused on maintaining faithfulness to the input.

prompt

Title: Prompt Optimization via Adversarial In-Context Learning. (arXiv:2312.02614v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02614
Code URL: null
Copy Paste: [[2312.02614]] Prompt Optimization via Adversarial In-Context Learning(http://arxiv.org/abs/2312.02614)
Summary:
We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompt for in-context learning (ICL) by employing one LLM as a generator, another as a discriminator, and a third as a prompt modifier. As in traditional adversarial learning, adv-ICL is implemented as a two-player game between the generator and discriminator, where the generator tries to generate realistic enough output to fool the discriminator. In each round, given an input prefixed by task instructions and several exemplars, the generator produces an output. The discriminator is then tasked with classifying the generator input-output pair as model-generated or real data. Based on the discriminator loss, the prompt modifier proposes possible edits to the generator and discriminator prompts, and the edits that most improve the adversarial loss are selected. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques for both open and closed-source models on 11 generation and classification tasks including summarization, arithmetic reasoning, machine translation, data-to-text generation, and the MMLU and big-bench hard benchmarks. In addition, because our method uses pre-trained models and updates only prompts rather than model parameters, it is computationally efficient, easy to extend to any LLM and task, and effective in low-resource settings.

code

Title: A Simple and Scalable Representation for Graph Generation. (arXiv:2312.02230v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02230
Code URL: null
Copy Paste: [[2312.02230]] A Simple and Scalable Representation for Graph Generation(http://arxiv.org/abs/2312.02230)
Summary:
Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL.

Title: Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games. (arXiv:2312.02312v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02312
Code URL: null
Copy Paste: [[2312.02312]] Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games(http://arxiv.org/abs/2312.02312)
Summary:
Video games have served as useful benchmarks for the decision making community, but going beyond Atari games towards training agents in modern games has been prohibitively expensive for the vast majority of the research community. Recent progress in the research, development and open release of large vision models has the potential to amortize some of these costs across the community. However, it is currently unclear which of these models have learnt representations that retain information critical for sequential decision making. Towards enabling wider participation in the research of gameplaying agents in modern games, we present a systematic study of imitation learning with publicly available visual encoders compared to the typical, task-specific, end-to-end training approach in Minecraft, Minecraft Dungeons and Counter-Strike: Global Offensive.

Title: GNN2R: Weakly-Supervised Rationale-Providing Question Answering over Knowledge Graphs. (arXiv:2312.02317v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02317
Code URL: https://github.com/ruijie-wang-uzh/gnn2r
Copy Paste: [[2312.02317]] GNN2R: Weakly-Supervised Rationale-Providing Question Answering over Knowledge Graphs(http://arxiv.org/abs/2312.02317)
Summary:
Most current methods for multi-hop question answering (QA) over knowledge graphs (KGs) only provide final conclusive answers without explanations, such as a set of KG entities that is difficult for normal users to review and comprehend. This issue severely limits the application of KG-based QA in real-world scenarios. However, it is non-trivial to solve due to two challenges: First, annotations of reasoning chains of multi-hop questions, which could serve as supervision for explanation generation, are usually lacking. Second, it is difficult to maintain high efficiency when explicit KG triples need to be retrieved to generate explanations. In this paper, we propose a novel Graph Neural Network-based Two-Step Reasoning model (GNN2R) to solve this issue. GNN2R can provide both final answers and reasoning subgraphs as a rationale behind final answers efficiently with only weak supervision that is available through question-final answer pairs. We extensively evaluated GNN2R with detailed analyses in experiments. The results demonstrate that, in terms of effectiveness, efficiency, and quality of generated explanations, GNN2R outperforms existing state-of-the-art methods that are applicable to this task. Our code and pre-trained models are available at https://github.com/ruijie-wang-uzh/GNN2R.

Title: Expressive Sign Equivariant Networks for Spectral Geometric Learning. (arXiv:2312.02339v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02339
Code URL: https://github.com/cptq/sign-equivariant-nets
Copy Paste: [[2312.02339]] Expressive Sign Equivariant Networks for Spectral Geometric Learning(http://arxiv.org/abs/2312.02339)
Summary:
Recent work has shown the utility of developing machine learning models that respect the structure and symmetries of eigenvectors. These works promote sign invariance, since for any eigenvector v the negation -v is also an eigenvector. However, we show that sign invariance is theoretically limited for tasks such as building orthogonally equivariant models and learning node positional encodings for link prediction in graphs. In this work, we demonstrate the benefits of sign equivariance for these tasks. To obtain these benefits, we develop novel sign equivariant neural network architectures. Our models are based on a new analytic characterization of sign equivariant polynomials and thus inherit provable expressiveness properties. Controlled synthetic experiments show that our networks can achieve the theoretically predicted benefits of sign equivariant models. Code is available at https://github.com/cptq/Sign-Equivariant-Nets.

Title: BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks. (arXiv:2312.02405v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2312.02405
Code URL: https://github.com/minerllabs/basalt-benchmark
Copy Paste: [[2312.02405]] BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks(http://arxiv.org/abs/2312.02405)
Summary:
The MineRL BASALT competition has served to catalyze advances in learning from human feedback through four hard-to-specify tasks in Minecraft, such as create and photograph a waterfall. Given the completion of two years of BASALT competitions, we offer to the community a formalized benchmark through the BASALT Evaluation and Demonstrations Dataset (BEDD), which serves as a resource for algorithm development and performance assessment. BEDD consists of a collection of 26 million image-action pairs from nearly 14,000 videos of human players completing the BASALT tasks in Minecraft. It also includes over 3,000 dense pairwise human evaluations of human and algorithmic agents. These comparisons serve as a fixed, preliminary leaderboard for evaluating newly-developed algorithms. To enable this comparison, we present a streamlined codebase for benchmarking new algorithms against the leaderboard. In addition to presenting these datasets, we conduct a detailed analysis of the data from both datasets to guide algorithm development and evaluation. The released code and data are available at https://github.com/minerllabs/basalt-benchmark .

Title: Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data. (arXiv:2312.02418v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02418
Code URL: null
Copy Paste: [[2312.02418]] Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data(http://arxiv.org/abs/2312.02418)
Summary:
Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.

Title: Structured World Representations in Maze-Solving Transformers. (arXiv:2312.02566v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02566
Code URL: null
Copy Paste: [[2312.02566]] Structured World Representations in Maze-Solving Transformers(http://arxiv.org/abs/2312.02566)
Summary:
Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of their inner workings remains a significant challenge. To this end, we set out to understand small transformer models in a more tractable setting: that of solving mazes. In this work, we focus on the abstractions formed by these models and find evidence for the consistent emergence of structured internal representations of maze topology and valid paths. We demonstrate this by showing that the residual stream of only a single token can be linearly decoded to faithfully reconstruct the entire maze. We also find that the learned embeddings of individual tokens have spatial structure. Furthermore, we take steps towards deciphering the circuity of path-following by identifying attention heads (dubbed $\textit{adjacency heads}$), which are implicated in finding valid subsequent tokens.

Title: On the Initialization of Graph Neural Networks. (arXiv:2312.02622v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02622
Code URL: https://github.com/lspongebobjh/virgo_icml2023
Copy Paste: [[2312.02622]] On the Initialization of Graph Neural Networks(http://arxiv.org/abs/2312.02622)
Summary:
Graph Neural Networks (GNNs) have displayed considerable promise in graph representation learning across various applications. The core learning process requires the initialization of model weight matrices within each GNN layer, which is typically accomplished via classic initialization methods such as Xavier initialization. However, these methods were originally motivated to stabilize the variance of hidden embeddings and gradients across layers of Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) to avoid vanishing gradients and maintain steady information flow. In contrast, within the GNN context classical initializations disregard the impact of the input graph structure and message passing on variance. In this paper, we analyze the variance of forward and backward propagation across GNN layers and show that the variance instability of GNN initializations comes from the combined effect of the activation function, hidden dimension, graph structure and message passing. To better account for these influence factors, we propose a new initialization method for Variance Instability Reduction within GNN Optimization (Virgo), which naturally tends to equate forward and backward variances across successive layers. We conduct comprehensive experiments on 15 datasets to show that Virgo can lead to superior model performance and more stable variance at initialization on node classification, link prediction and graph classification tasks. Codes are in https://github.com/LspongebobJH/virgo_icml2023.

Title: H-GAP: Humanoid Control with a Generalist Planner. (arXiv:2312.02682v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02682
Code URL: null
Copy Paste: [[2312.02682]] H-GAP: Humanoid Control with a Generalist Planner(http://arxiv.org/abs/2312.02682)
Summary:
Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations. The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC). For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviors to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines that have access to the ground truth dynamics model, and is superior or comparable to offline RL methods trained for individual tasks. Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. Code and videos are available at https://ycxuyingchen.github.io/hgap/.

Title: Clustering Pseudo Language Family in Multilingual Translation Models with Fisher Information Matrix. (arXiv:2312.02820v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02820
Code URL: https://github.com/ecoli-hit/pseudofamily
Copy Paste: [[2312.02820]] Clustering Pseudo Language Family in Multilingual Translation Models with Fisher Information Matrix(http://arxiv.org/abs/2312.02820)
Summary:
In multilingual translation research, the comprehension and utilization of language families are of paramount importance. Nevertheless, clustering languages based solely on their ancestral families can yield suboptimal results due to variations in the datasets employed during the model's training phase. To mitigate this challenge, we introduce an innovative method that leverages the fisher information matrix (FIM) to cluster language families, anchored on the multilingual translation model's characteristics. We hypothesize that language pairs with similar effects on model parameters exhibit a considerable degree of linguistic congruence and should thus be grouped cohesively. This concept has led us to define pseudo language families. We provide an in-depth discussion regarding the inception and application of these pseudo language families. Empirical evaluations reveal that employing these pseudo language families enhances performance over conventional language families in adapting a multilingual translation model to unfamiliar language pairs. The proposed methodology may also be extended to scenarios requiring language similarity measurements. The source code and associated scripts can be accessed at https://github.com/ecoli-hit/PseudoFamily.

Title: MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition. (arXiv:2312.02829v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02829
Code URL: https://github.com/ibm/multiple-input-multiple-output-nets
Copy Paste: [[2312.02829]] MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition(http://arxiv.org/abs/2312.02829)
Summary:
With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.

Title: FaultFormer: Transformer-based Prediction of Bearing Faults. (arXiv:2312.02380v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02380
Code URL: null
Copy Paste: [[2312.02380]] FaultFormer: Transformer-based Prediction of Bearing Faults(http://arxiv.org/abs/2312.02380)
Summary:
The growth of deep learning in the past decade has motivated important applications to smart manufacturing and machine health monitoring. In particular, vibration data offers a rich and reliable source to provide meaningful insights into machine health and predictive maintenance. In this work, we present a Transformer based framework for analyzing vibration signals to predict different types of bearing faults (FaultFormer). In particular, we process signal data using data augmentations and extract their Fourier modes to train a transformer encoder to achieve state of the art accuracies. The attention mechanism as well as model outputs were analyzed to confirm the transformer's ability to automatically extract features within signals and learn both global and local relationships to make classifications. Lastly, two pretraining strategies were proposed to pave the way for large, generalizable transformers that could adapt to new data, situations, or machinery on the production floor.

Title: Robust Clustering using Hyperdimensional Computing. (arXiv:2312.02407v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02407
Code URL: null
Copy Paste: [[2312.02407]] Robust Clustering using Hyperdimensional Computing(http://arxiv.org/abs/2312.02407)
Summary:
This paper addresses the clustering of data in the hyperdimensional computing (HDC) domain. In prior work, an HDC-based clustering framework, referred to as HDCluster, has been proposed. However, the performance of the existing HDCluster is not robust. The performance of HDCluster is degraded as the hypervectors for the clusters are chosen at random during the initialization step. To overcome this bottleneck, we assign the initial cluster hypervectors by exploring the similarity of the encoded data, referred to as \textit{query} hypervectors. Intra-cluster hypervectors have a higher similarity than inter-cluster hypervectors. Harnessing the similarity results among query hypervectors, this paper proposes four HDC-based clustering algorithms: similarity-based k-means, equal bin-width histogram, equal bin-height histogram, and similarity-based affinity propagation. Experimental results illustrate that: (i) Compared to the existing HDCluster, our proposed HDC-based clustering algorithms can achieve better accuracy, more robust performance, fewer iterations, and less execution time. Similarity-based affinity propagation outperforms the other three HDC-based clustering algorithms on eight datasets by 2~38% in clustering accuracy. (ii) Even for one-pass clustering, i.e., without any iterative update of the cluster hypervectors, our proposed algorithms can provide more robust clustering accuracy than HDCluster. (iii) Over eight datasets, five out of eight can achieve higher or comparable accuracy when projected onto the hyperdimensional space. Traditional clustering is more desirable than HDC when the number of clusters, $k$, is large.

Title: Dimensionality Reduction and Dynamical Mode Recognition of Circular Arrays of Flame Oscillators Using Deep Neural Network. (arXiv:2312.02462v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02462
Code URL: null
Copy Paste: [[2312.02462]] Dimensionality Reduction and Dynamical Mode Recognition of Circular Arrays of Flame Oscillators Using Deep Neural Network(http://arxiv.org/abs/2312.02462)
Summary:
Oscillatory combustion in aero engines and modern gas turbines often has significant adverse effects on their operation, and accurately recognizing various oscillation modes is the prerequisite for understanding and controlling combustion instability. However, the high-dimensional spatial-temporal data of a complex combustion system typically poses considerable challenges to the dynamical mode recognition. Based on a two-layer bidirectional long short-term memory variational autoencoder (Bi-LSTM-VAE) dimensionality reduction model and a two-dimensional Wasserstein distance-based classifier (WDC), this study proposes a promising method (Bi-LSTM-VAE-WDC) for recognizing dynamical modes in oscillatory combustion systems. Specifically, the Bi-LSTM-VAE dimension reduction model was introduced to reduce the high-dimensional spatial-temporal data of the combustion system to a low-dimensional phase space; Gaussian kernel density estimates (GKDE) were computed based on the distribution of phase points in a grid; two-dimensional WD values were calculated from the GKDE maps to recognize the oscillation modes. The time-series data used in this study were obtained from numerical simulations of circular arrays of laminar flame oscillators. The results show that the novel Bi-LSTM-VAE method can produce a non-overlapping distribution of phase points, indicating an effective unsupervised mode recognition and classification. Furthermore, the present method exhibits a more prominent performance than VAE and PCA (principal component analysis) for distinguishing dynamical modes in complex flame systems, implying its potential in studying turbulent combustion.

Title: Constrained Twin Variational Auto-Encoder for Intrusion Detection in IoT Systems. (arXiv:2312.02490v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02490
Code URL: null
Copy Paste: [[2312.02490]] Constrained Twin Variational Auto-Encoder for Intrusion Detection in IoT Systems(http://arxiv.org/abs/2312.02490)
Summary:
Intrusion detection systems (IDSs) play a critical role in protecting billions of IoT devices from malicious attacks. However, the IDSs for IoT devices face inherent challenges of IoT systems, including the heterogeneity of IoT data/devices, the high dimensionality of training data, and the imbalanced data. Moreover, the deployment of IDSs on IoT systems is challenging, and sometimes impossible, due to the limited resources such as memory/storage and computing capability of typical IoT devices. To tackle these challenges, this article proposes a novel deep neural network/architecture called Constrained Twin Variational Auto-Encoder (CTVAE) that can feed classifiers of IDSs with more separable/distinguishable and lower-dimensional representation data. Additionally, in comparison to the state-of-the-art neural networks used in IDSs, CTVAE requires less memory/storage and computing power, hence making it more suitable for IoT IDS systems. Extensive experiments with the 11 most popular IoT botnet datasets show that CTVAE can boost around 1% in terms of accuracy and Fscore in detection attack compared to the state-of-the-art machine learning and representation learning methods, whilst the running time for attack detection is lower than 2E-6 seconds and the model size is lower than 1 MB. We also further investigate various characteristics of CTVAE in the latent space and in the reconstruction representation to demonstrate its efficacy compared with current well-known methods.

Title: Rethinking and Simplifying Bootstrapped Graph Latents. (arXiv:2312.02619v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02619
Code URL: null
Copy Paste: [[2312.02619]] Rethinking and Simplifying Bootstrapped Graph Latents(http://arxiv.org/abs/2312.02619)
Summary:
Graph contrastive learning (GCL) has emerged as a representative paradigm in graph self-supervised learning, where negative samples are commonly regarded as the key to preventing model collapse and producing distinguishable representations. Recent studies have shown that GCL without negative samples can achieve state-of-the-art performance as well as scalability improvement, with bootstrapped graph latent (BGRL) as a prominent step forward. However, BGRL relies on a complex architecture to maintain the ability to scatter representations, and the underlying mechanisms enabling the success remain largely unexplored. In this paper, we introduce an instance-level decorrelation perspective to tackle the aforementioned issue and leverage it as a springboard to reveal the potential unnecessary model complexity within BGRL. Based on our findings, we present SGCL, a simple yet effective GCL framework that utilizes the outputs from two consecutive iterations as positive pairs, eliminating the negative samples. SGCL only requires a single graph augmentation and a single graph encoder without additional parameters. Extensive experiments conducted on various graph benchmarks demonstrate that SGCL can achieve competitive performance with fewer parameters, lower time and space costs, and significant convergence speedup.

chat

Title: How Generative-AI can be Effectively used in Government Chatbots. (arXiv:2312.02181v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02181
Code URL: null
Copy Paste: [[2312.02181]] How Generative-AI can be Effectively used in Government Chatbots(http://arxiv.org/abs/2312.02181)
Summary:
With the rapid development of artificial intelligence and breakthroughs in machine learning and natural language processing, intelligent question-answering robots have become widely used in government affairs. This paper conducts a horizontal comparison between Guangdong Province's government chatbots, ChatGPT, and Wenxin Ernie, two large language models, to analyze the strengths and weaknesses of existing government chatbots and AIGC technology. The study finds significant differences between government chatbots and large language models. China's government chatbots are still in an exploratory stage and have a gap to close to achieve "intelligence." To explore the future direction of government chatbots more deeply, this research proposes targeted optimization paths to help generative AI be effectively applied in government chatbot conversations.

retrieval augmented generation

rag

Title: Low-Precision Mixed-Computation Models for Inference on Edge. (arXiv:2312.02210v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02210
Code URL: null
Copy Paste: [[2312.02210]] Low-Precision Mixed-Computation Models for Inference on Edge(http://arxiv.org/abs/2312.02210)
Summary:
This paper presents a mixed-computation neural network processing approach for edge applications that incorporates low-precision (low-width) Posit and low-precision fixed point (FixP) number systems. This mixed-computation approach employs 4-bit Posit (Posit4), which has higher precision around zero, for representing weights with high sensitivity, while it uses 4-bit FixP (FixP4) for representing other weights. A heuristic for analyzing the importance and the quantization error of the weights is presented to assign the proper number system to different weights. Additionally, a gradient approximation for Posit representation is introduced to improve the quality of weight updates in the backpropagation process. Due to the high energy consumption of the fully Posit-based computations, neural network operations are carried out in FixP or Posit/FixP. An efficient hardware implementation of a MAC operation with a first Posit operand and FixP for a second operand and accumulator is presented. The efficacy of the proposed low-precision mixed-computation approach is extensively assessed on vision and language models. The results show that, on average, the accuracy of the mixed-computation is about 1.5% higher than that of FixP with a cost of 0.19% energy overhead.

Title: Rethinking Adversarial Training with Neural Tangent Kernel. (arXiv:2312.02236v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02236
Code URL: null
Copy Paste: [[2312.02236]] Rethinking Adversarial Training with Neural Tangent Kernel(http://arxiv.org/abs/2312.02236)
Summary:
Adversarial training (AT) is an important and attractive topic in deep learning security, exhibiting mysteries and odd properties. Recent studies of neural network training dynamics based on Neural Tangent Kernel (NTK) make it possible to reacquaint AT and deeply analyze its properties. In this paper, we perform an in-depth investigation of AT process and properties with NTK, such as NTK evolution. We uncover three new findings that are missed in previous works. First, we disclose the impact of data normalization on AT and the importance of unbiased estimators in batch normalization layers. Second, we experimentally explore the kernel dynamics and propose more time-saving AT methods. Third, we study the spectrum feature inside the kernel to address the catastrophic overfitting problem. To the best of our knowledge, it is the first work leveraging the observations of kernel dynamics to improve existing AT methods.

Title: Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor. (arXiv:2312.02416v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02416
Code URL: https://github.com/J1nqianChen/FedKA
Copy Paste: [[2312.02416]] Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor(http://arxiv.org/abs/2312.02416)
Summary:
Federated learning encounters a critical challenge of data heterogeneity, adversely affecting the performance and convergence of the federated model. Various approaches have been proposed to address this issue, yet their effectiveness is still limited. Recent studies have revealed that the federated model suffers severe forgetting in local training, leading to global forgetting and performance degradation. Although the analysis provides valuable insights, a comprehensive understanding of the vulnerable classes and their impact factors is yet to be established. In this paper, we aim to bridge this gap by systematically analyzing the forgetting degree of each class during local training across different communication rounds. Our observations are: (1) Both missing and non-dominant classes suffer similar severe forgetting during local training, while dominant classes show improvement in performance. (2) When dynamically reducing the sample size of a dominant class, catastrophic forgetting occurs abruptly when the proportion of its samples is below a certain threshold, indicating that the local model struggles to leverage a few samples of a specific class effectively to prevent forgetting. Motivated by these findings, we propose a novel and straightforward algorithm called Federated Knowledge Anchor (FedKA). Assuming that all clients have a single shared sample for each class, the knowledge anchor is constructed before each local training stage by extracting shared samples for missing classes and randomly selecting one sample per class for non-dominant classes. The knowledge anchor is then utilized to correct the gradient of each mini-batch towards the direction of preserving the knowledge of the missing and non-dominant classes. Extensive experimental results demonstrate that our proposed FedKA achieves fast and stable convergence, significantly improving accuracy on popular benchmarks.

Title: MASP: Scalable GNN-based Planning for Multi-Agent Navigation. (arXiv:2312.02522v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02522
Code URL: null
Copy Paste: [[2312.02522]] MASP: Scalable GNN-based Planning for Multi-Agent Navigation(http://arxiv.org/abs/2312.02522)
Summary:
We investigate the problem of decentralized multi-agent navigation tasks, where multiple agents need to reach initially unassigned targets in a limited time. Classical planning-based methods suffer from expensive computation overhead at each step and offer limited expressiveness for complex cooperation strategies. In contrast, reinforcement learning (RL) has recently become a popular paradigm for addressing this issue. However, RL struggles with low data efficiency and cooperation when directly exploring (nearly) optimal policies in the large search space, especially with an increased agent number (e.g., 10+ agents) or in complex environments (e.g., 3D simulators). In this paper, we propose Multi-Agent Scalable GNN-based P lanner (MASP), a goal-conditioned hierarchical planner for navigation tasks with a substantial number of agents. MASP adopts a hierarchical framework to divide a large search space into multiple smaller spaces, thereby reducing the space complexity and accelerating training convergence. We also leverage graph neural networks (GNN) to model the interaction between agents and goals, improving goal achievement. Besides, to enhance generalization capabilities in scenarios with unseen team sizes, we divide agents into multiple groups, each with a previously trained number of agents. The results demonstrate that MASP outperforms classical planning-based competitors and RL baselines, achieving a nearly 100% success rate with minimal training data in both multi-agent particle environments (MPE) with 50 agents and a quadrotor 3-dimensional environment (OmniDrones) with 20 agents. Furthermore, the learned policy showcases zero-shot generalization across unseen team sizes.

Title: MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection. (arXiv:2312.02530v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02530
Code URL: null
Copy Paste: [[2312.02530]] MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection(http://arxiv.org/abs/2312.02530)
Summary:
Detecting anomalies in real-world multivariate time series data is challenging due to complex temporal dependencies and inter-variable correlations. Recently, reconstruction-based deep models have been widely used to solve the problem. However, these methods still suffer from an over-generalization issue and fail to deliver consistently high performance. To address this issue, we propose the MEMTO, a memory-guided Transformer using a reconstruction-based approach. It is designed to incorporate a novel memory module that can learn the degree to which each memory item should be updated in response to the input data. To stabilize the training procedure, we use a two-phase training paradigm which involves using K-means clustering for initializing memory items. Additionally, we introduce a bi-dimensional deviation-based detection criterion that calculates anomaly scores considering both input space and latent space. We evaluate our proposed method on five real-world datasets from diverse domains, and it achieves an average anomaly detection F1-score of 95.74%, significantly outperforming the previous state-of-the-art methods. We also conduct extensive experiments to empirically validate the effectiveness of our proposed model's key components.

Title: Towards the Inferrence of Structural Similarity of Combinatorial Landscapes. (arXiv:2312.02720v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02720
Code URL: null
Copy Paste: [[2312.02720]] Towards the Inferrence of Structural Similarity of Combinatorial Landscapes(http://arxiv.org/abs/2312.02720)
Summary:
One of the most common problem-solving heuristics is by analogy. For a given problem, a solver can be viewed as a strategic walk on its fitness landscape. Thus if a solver works for one problem instance, we expect it will also be effective for other instances whose fitness landscapes essentially share structural similarities with each other. However, due to the black-box nature of combinatorial optimization, it is far from trivial to infer such similarity in real-world scenarios. To bridge this gap, by using local optima network as a proxy of fitness landscapes, this paper proposed to leverage graph data mining techniques to conduct qualitative and quantitative analyses to explore the latent topological structural information embedded in those landscapes. By conducting large-scale empirical experiments on three classic combinatorial optimization problems, we gain concrete evidence to support the existence of structural similarity between landscapes of the same classes within neighboring dimensions. We also interrogated the relationship between landscapes of different problem classes.

Title: Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic IR in English and Arabic. (arXiv:2312.02803v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02803
Code URL: null
Copy Paste: [[2312.02803]] Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic IR in English and Arabic(http://arxiv.org/abs/2312.02803)
Summary:
In this work, we approach the problem of Qur'anic information retrieval (IR) in Arabic and English. Using the latest state-of-the-art methods in neural IR, we research what helps to tackle this task more efficiently. Training retrieval models requires a lot of data, which is difficult to obtain for training in-domain. Therefore, we commence with training on a large amount of general domain data and then continue training on in-domain data. To handle the lack of in-domain data, we employed a data augmentation technique, which considerably improved results in MRR@10 and NDCG@5 metrics, setting the state-of-the-art in Qur'anic IR for both English and Arabic. The absence of an Islamic corpus and domain-specific model for IR task in English motivated us to address this lack of resources and take preliminary steps of the Islamic corpus compilation and domain-specific language model (LM) pre-training, which helped to improve the performance of the retrieval models that use the domain-specific LM as the shared backbone. We examined several language models (LMs) in Arabic to select one that efficiently deals with the Qur'anic IR task. Besides transferring successful experiments from English to Arabic, we conducted additional experiments with retrieval task in Arabic to amortize the scarcity of general domain datasets used to train the retrieval models. Handling Qur'anic IR task combining English and Arabic allowed us to enhance the comparison and share valuable insights across models and languages.

Title: Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis. (arXiv:2312.02826v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02826
Code URL: null
Copy Paste: [[2312.02826]] Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis(http://arxiv.org/abs/2312.02826)
Summary:
Intelligent Fault Diagnosis (IFD) based on deep learning has proven to be an effective and flexible solution, attracting extensive research. Deep neural networks can learn rich representations from vast amounts of representative labeled data for various applications. In IFD, they achieve high classification performance from signals in an end-to-end manner, without requiring extensive domain knowledge. However, deep learning models usually only perform well on the data distribution they have been trained on. When applied to a different distribution, they may experience performance drops. This is also observed in IFD, where assets are often operated in working conditions different from those in which labeled data have been collected. Unsupervised domain adaptation (UDA) deals with the scenario where labeled data are available in a source domain, and only unlabeled data are available in a target domain, where domains may correspond to operating conditions. Recent methods rely on training with confident pseudo-labels for target samples. However, the confidence-based selection of pseudo-labels is hindered by poorly calibrated confidence estimates in the target domain, primarily due to over-confident predictions, which limits the quality of pseudo-labels and leads to error accumulation. In this paper, we propose a novel UDA method called Calibrated Adaptive Teacher (CAT), where we propose to calibrate the predictions of the teacher network throughout the self-training process, leveraging post-hoc calibration techniques. We evaluate CAT on domain-adaptive IFD and perform extensive experiments on the Paderborn benchmark for bearing fault diagnosis under varying operating conditions. Our proposed method achieves state-of-the-art performance on most transfer tasks.

Title: FlowHON: Representing Flow Fields Using Higher-Order Networks. (arXiv:2312.02243v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02243
Code URL: null
Copy Paste: [[2312.02243]] FlowHON: Representing Flow Fields Using Higher-Order Networks(http://arxiv.org/abs/2312.02243)
Summary:
Flow fields are often partitioned into data blocks for massively parallel computation and analysis based on blockwise relationships. However, most of the previous techniques only consider the first-order dependencies among blocks, which is insufficient in describing complex flow patterns. In this work, we present FlowHON, an approach to construct higher-order networks (HONs) from flow fields. FlowHON captures the inherent higher-order dependencies in flow fields as nodes and estimates the transitions among them as edges. We formulate the HON construction as an optimization problem with three linear transformations. The first two layers correspond to the node generation and the third one corresponds to edge estimation. Our formulation allows the node generation and edge estimation to be solved in a unified framework. With FlowHON, the rich set of traditional graph algorithms can be applied without any modification to analyze flow fields, while leveraging the higher-order information to understand the inherent structure and manage flow data for efficiency. We demonstrate the effectiveness of FlowHON using a series of downstream tasks, including estimating the density of particles during tracing, partitioning flow fields for data management, and understanding flow fields using the node-link diagram representation of networks.

Title: FLea: Improving federated learning on scarce and label-skewed data via privacy-preserving feature augmentation. (arXiv:2312.02327v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02327
Code URL: null
Copy Paste: [[2312.02327]] FLea: Improving federated learning on scarce and label-skewed data via privacy-preserving feature augmentation(http://arxiv.org/abs/2312.02327)
Summary:
Learning a global model by abstracting the knowledge, distributed across multiple clients, without aggregating the raw data is the primary goal of Federated Learning (FL). Typically, this works in rounds alternating between parallel local training at several clients, followed by model aggregation at a server. We found that existing FL methods under-perform when local datasets are small and present severe label skew as these lead to over-fitting and local model bias. This is a realistic setting in many real-world applications. To address the problem, we propose \textit{FLea}, a unified framework that tackles over-fitting and local bias by encouraging clients to exchange privacy-protected features to aid local training. The features refer to activations from an intermediate layer of the model, which are obfuscated before being shared with other clients to protect sensitive information in the data. \textit{FLea} leverages a novel way of combining local and shared features as augmentations to enhance local model learning. Our extensive experiments demonstrate that \textit{FLea} outperforms the start-of-the-art FL methods, sharing only model parameters, by up to $17.6\%$, and FL methods that share data augmentations by up to $6.3\%$, while reducing the privacy vulnerability associated with shared data augmentations.

Title: Adaptive Instrument Design for Indirect Experiments. (arXiv:2312.02438v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02438
Code URL: null
Copy Paste: [[2312.02438]] Adaptive Instrument Design for Indirect Experiments(http://arxiv.org/abs/2312.02438)
Summary:
Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for direct experiments, in this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.

Title: Generator Born from Classifier. (arXiv:2312.02470v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02470
Code URL: null
Copy Paste: [[2312.02470]] Generator Born from Classifier(http://arxiv.org/abs/2312.02470)
Summary:
In this paper, we make a bold attempt toward an ambitious task: given a pre-trained classifier, we aim to reconstruct an image generator, without relying on any data samples. From a black-box perspective, this challenge seems intractable, since it inevitably involves identifying the inverse function for a classifier, which is, by nature, an information extraction process. As such, we resort to leveraging the knowledge encapsulated within the parameters of the neural network. Grounded on the theory of Maximum-Margin Bias of gradient descent, we propose a novel learning paradigm, in which the generator is trained to ensure that the convergence conditions of the network parameters are satisfied over the generated distribution of the samples. Empirical validation from various image generation tasks substantiates the efficacy of our strategy.

Title: NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams. (arXiv:2312.02473v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02473
Code URL: null
Copy Paste: [[2312.02473]] NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams(http://arxiv.org/abs/2312.02473)
Summary:
Existing Graph Neural Network (GNN) training frameworks have been designed to help developers easily create performant GNN implementations. However, most existing GNN frameworks assume that the input graphs are static, but ignore that most real-world graphs are constantly evolving. Though many dynamic GNN models have emerged to learn from evolving graphs, the training process of these dynamic GNNs is dramatically different from traditional GNNs in that it captures both the spatial and temporal dependencies of graph updates. This poses new challenges for designing dynamic GNN training frameworks. First, the traditional batched training method fails to capture real-time structural evolution information. Second, the time-dependent nature makes parallel training hard to design. Third, it lacks system supports for users to efficiently implement dynamic GNNs. In this paper, we present NeutronStream, a framework for training dynamic GNN models. NeutronStream abstracts the input dynamic graph into a chronologically updated stream of events and processes the stream with an optimized sliding window to incrementally capture the spatial-temporal dependencies of events. Furthermore, NeutronStream provides a parallel execution engine to tackle the sequential event processing challenge to achieve high performance. NeutronStream also integrates a built-in graph storage structure that supports dynamic updates and provides a set of easy-to-use APIs that allow users to express their dynamic GNNs. Our experimental results demonstrate that, compared to state-of-the-art dynamic GNN implementations, NeutronStream achieves speedups ranging from 1.48X to 5.87X and an average accuracy improvement of 3.97%.

Title: Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing. (arXiv:2312.02491v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02491
Code URL: null
Copy Paste: [[2312.02491]] Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing(http://arxiv.org/abs/2312.02491)
Summary:
The incorporation of advanced sensors and machine learning techniques has enabled modern manufacturing enterprises to perform data-driven in-situ quality monitoring based on the sensor data collected in manufacturing processes. However, one critical challenge is that newly presented defect category may manifest as the manufacturing process continues, resulting in monitoring performance deterioration of previously trained machine learning models. Hence, there is an increasing need for empowering machine learning model to learn continually. Among all continual learning methods, memory-based continual learning has the best performance but faces the constraints of data storage capacity. To address this issue, this paper develops a novel pseudo replay-based continual learning by integrating class incremental learning and oversampling-based data generation. Without storing all the data, the developed framework could generate high-quality data representing previous classes to train machine learning model incrementally when new category anomaly occurs. In addition, it could even enhance the monitoring performance since it also effectively improves the data quality. The effectiveness of the proposed framework is validated in an additive manufacturing process, which leverages supervised classification problem for anomaly detection. The experimental results show that the developed method is very promising in detecting novel anomaly while maintaining a good performance on the previous task and brings up more flexibility in model architecture.

Title: A Self-Commissioning Edge Computing Method for Data-Driven Anomaly Detection in Power Electronic Systems. (arXiv:2312.02661v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02661
Code URL: null
Copy Paste: [[2312.02661]] A Self-Commissioning Edge Computing Method for Data-Driven Anomaly Detection in Power Electronic Systems(http://arxiv.org/abs/2312.02661)
Summary:
Ensuring the reliability of power electronic converters is a matter of great importance, and data-driven condition monitoring techniques are cementing themselves as an important tool for this purpose. However, translating methods that work well in controlled lab environments to field applications presents significant challenges, notably because of the limited diversity and accuracy of the lab training data. By enabling the use of field data, online machine learning can be a powerful tool to overcome this problem, but it introduces additional challenges in ensuring the stability and predictability of the training processes. This work presents an edge computing method that mitigates these shortcomings with minimal additional memory usage, by employing an autonomous algorithm that prioritizes the storage of training samples with larger prediction errors. The method is demonstrated on the use case of a self-commissioning condition monitoring system, in the form of a thermal anomaly detection scheme for a variable frequency motor drive, where the algorithm self-learned to distinguish normal and anomalous operation with minimal prior knowledge. The obtained results, based on experimental data, show a significant improvement in prediction accuracy and training speed, when compared to equivalent models trained online without the proposed data selection process.

Title: Semi-Supervised Health Index Monitoring with Feature Generation and Fusion. (arXiv:2312.02867v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02867
Code URL: null
Copy Paste: [[2312.02867]] Semi-Supervised Health Index Monitoring with Feature Generation and Fusion(http://arxiv.org/abs/2312.02867)
Summary:
The Health Index (HI) is crucial for evaluating system health, aiding tasks like anomaly detection and predicting remaining useful life for systems demanding high safety and reliability. Tight monitoring is crucial for achieving high precision at a lower cost, with applications such as spray coating. Obtaining HI labels in real-world applications is often cost-prohibitive, requiring continuous, precise health measurements. Therefore, it is more convenient to leverage run-to failure datasets that may provide potential indications of machine wear condition, making it necessary to apply semi-supervised tools for HI construction. In this study, we adapt the Deep Semi-supervised Anomaly Detection (DeepSAD) method for HI construction. We use the DeepSAD embedding as a condition indicators to address interpretability challenges and sensitivity to system-specific factors. Then, we introduce a diversity loss to enrich condition indicators. We employ an alternating projection algorithm with isotonic constraints to transform the DeepSAD embedding into a normalized HI with an increasing trend. Validation on the PHME 2010 milling dataset, a recognized benchmark with ground truth HIs demonstrates meaningful HIs estimations. Our methodology is then applied to monitor wear states of thermal spray coatings using high-frequency voltage. Our contributions create opportunities for more accessible and reliable HI estimation, particularly in cases where obtaining ground truth HI labels is unfeasible.

multi-run

chain-of-thought

Title: Training Chain-of-Thought via Latent-Variable Inference. (arXiv:2312.02179v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02179
Code URL: null
Copy Paste: [[2312.02179]] Training Chain-of-Thought via Latent-Variable Inference(http://arxiv.org/abs/2312.02179)
Summary:
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.

language model

Title: Axiomatic Preference Modeling for Longform Question Answering. (arXiv:2312.02206v1 [cs.AI])

Title: An Evaluation Framework for Mapping News Headlines to Event Classes in a Knowledge Graph. (arXiv:2312.02334v1 [cs.CL])

Title: Visually Grounded Language Learning: a review of language games, datasets, tasks, and models. (arXiv:2312.02431v1 [cs.CL])

Title: MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following. (arXiv:2312.02436v1 [cs.CL])

Title: Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation. (arXiv:2312.02439v1 [cs.AI])

Title: ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU. (arXiv:2312.02515v1 [cs.LG])

Title: Creative Agents: Empowering Agents with Imagination for Creative Tasks. (arXiv:2312.02519v1 [cs.AI])

Title: Impact of Tokenization on LLaMa Russian Adaptation. (arXiv:2312.02598v1 [cs.CL])

Title: Large Knowledge Model: Perspectives and Challenges. (arXiv:2312.02706v1 [cs.AI])

Title: Toward autocorrection of chemical process flowsheets using large language models. (arXiv:2312.02873v1 [cs.LG])

Title: Revisiting Topic-Guided Language Models. (arXiv:2312.02331v1 [cs.CL])

Title: Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings. (arXiv:2312.02337v1 [cs.CL])

Title: Efficient Online Data Mixing For Language Model Pre-Training. (arXiv:2312.02406v1 [cs.CL])

Title: ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference. (arXiv:2312.02554v1 [cs.LG])

Title: Towards Measuring Representational Similarity of Large Language Models. (arXiv:2312.02730v1 [cs.LG])

Title: Scaling Laws for Adversarial Attacks on Language Model Activations. (arXiv:2312.02780v1 [cs.LG])

Title: Large Language Models on Graphs: A Comprehensive Survey. (arXiv:2312.02783v1 [cs.CL])

Title: Can We Learn Communication-Efficient Optimizers?. (arXiv:2312.02204v1 [cs.LG])

gpt

Title: MKA: A Scalable Medical Knowledge Assisted Mechanism for Generative Models on Medical Conversation Tasks. (arXiv:2312.02496v1 [cs.CL])

llm

Title: JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization. (arXiv:2312.02213v1 [cs.LG])

Title: LLMs Accelerate Annotation for Medical Information Extraction. (arXiv:2312.02296v1 [cs.CL])

Title: When is Offline Policy Selection Sample Efficient for Reinforcement Learning?. (arXiv:2312.02355v1 [cs.LG])

Title: New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking. (arXiv:2312.02382v1 [cs.CL])

Title: MedDM:LLM-executable clinical guidance tree for clinical decision-making. (arXiv:2312.02441v1 [cs.CL])

Title: Weakly Supervised Detection of Hallucinations in LLM Activations. (arXiv:2312.02798v1 [cs.LG])

long context

lora

Title: AdsorbRL: Deep Multi-Objective Reinforcement Learning for Inverse Catalysts Design. (arXiv:2312.02308v1 [cs.LG])

Title: Learning Energy-based Model via Dual-MCMC Teaching. (arXiv:2312.02469v1 [cs.LG])

hallucination

Title: Compositional Generalization for Data-to-Text Generation. (arXiv:2312.02748v1 [cs.CL])

prompt

Title: Prompt Optimization via Adversarial In-Context Learning. (arXiv:2312.02614v1 [cs.LG])

code

Title: A Simple and Scalable Representation for Graph Generation. (arXiv:2312.02230v1 [cs.LG])

Title: Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games. (arXiv:2312.02312v1 [cs.LG])

Title: GNN2R: Weakly-Supervised Rationale-Providing Question Answering over Knowledge Graphs. (arXiv:2312.02317v1 [cs.CL])

Title: Expressive Sign Equivariant Networks for Spectral Geometric Learning. (arXiv:2312.02339v1 [cs.LG])

Title: BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks. (arXiv:2312.02405v1 [cs.AI])

Title: Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data. (arXiv:2312.02418v1 [cs.CL])

Title: Structured World Representations in Maze-Solving Transformers. (arXiv:2312.02566v1 [cs.LG])

Title: On the Initialization of Graph Neural Networks. (arXiv:2312.02622v1 [cs.LG])

Title: H-GAP: Humanoid Control with a Generalist Planner. (arXiv:2312.02682v1 [cs.LG])

Title: Clustering Pseudo Language Family in Multilingual Translation Models with Fisher Information Matrix. (arXiv:2312.02820v1 [cs.CL])

Title: MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition. (arXiv:2312.02829v1 [cs.LG])

Title: FaultFormer: Transformer-based Prediction of Bearing Faults. (arXiv:2312.02380v1 [cs.LG])

Title: Robust Clustering using Hyperdimensional Computing. (arXiv:2312.02407v1 [cs.LG])

Title: Dimensionality Reduction and Dynamical Mode Recognition of Circular Arrays of Flame Oscillators Using Deep Neural Network. (arXiv:2312.02462v1 [cs.LG])

Title: Constrained Twin Variational Auto-Encoder for Intrusion Detection in IoT Systems. (arXiv:2312.02490v1 [cs.LG])

Title: Rethinking and Simplifying Bootstrapped Graph Latents. (arXiv:2312.02619v1 [cs.LG])

chat

Title: How Generative-AI can be Effectively used in Government Chatbots. (arXiv:2312.02181v1 [cs.CL])

retrieval augmented generation

rag

Title: Low-Precision Mixed-Computation Models for Inference on Edge. (arXiv:2312.02210v1 [cs.LG])

Title: Rethinking Adversarial Training with Neural Tangent Kernel. (arXiv:2312.02236v1 [cs.LG])

Title: Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor. (arXiv:2312.02416v1 [cs.LG])

Title: MASP: Scalable GNN-based Planning for Multi-Agent Navigation. (arXiv:2312.02522v1 [cs.LG])

Title: MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection. (arXiv:2312.02530v1 [cs.LG])

Title: Towards the Inferrence of Structural Similarity of Combinatorial Landscapes. (arXiv:2312.02720v1 [cs.LG])

Title: Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic IR in English and Arabic. (arXiv:2312.02803v1 [cs.CL])

Title: Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis. (arXiv:2312.02826v1 [cs.LG])

Title: FlowHON: Representing Flow Fields Using Higher-Order Networks. (arXiv:2312.02243v1 [cs.LG])

Title: FLea: Improving federated learning on scarce and label-skewed data via privacy-preserving feature augmentation. (arXiv:2312.02327v1 [cs.LG])

Title: Adaptive Instrument Design for Indirect Experiments. (arXiv:2312.02438v1 [cs.LG])

Title: Generator Born from Classifier. (arXiv:2312.02470v1 [cs.LG])

Title: NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams. (arXiv:2312.02473v1 [cs.LG])

Title: Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing. (arXiv:2312.02491v1 [cs.LG])

Title: A Self-Commissioning Edge Computing Method for Data-Driven Anomaly Detection in Power Electronic Systems. (arXiv:2312.02661v1 [cs.LG])

Title: Semi-Supervised Health Index Monitoring with Feature Generation and Fusion. (arXiv:2312.02867v1 [cs.LG])

multi-run

chain-of-thought

Title: Training Chain-of-Thought via Latent-Variable Inference. (arXiv:2312.02179v1 [cs.LG])

tree-of-thought