2024-01-05

language model

Title: Generalist embedding models are better at short-context clinical semantic search than specialized embedding models. (arXiv:2401.01943v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.01943
Code URL: https://github.com/kaduceo/icd10cm_embedding_benchmark
Copy Paste: [[2401.01943]] Generalist embedding models are better at short-context clinical semantic search than specialized embedding models(http://arxiv.org/abs/2401.01943)
Summary:
The increasing use of tools and solutions based on Large Language Models (LLMs) for various tasks in the medical domain has become a prominent trend. Their use in this highly critical and sensitive domain has thus raised important questions about their robustness, especially in response to variations in input, and the reliability of the generated outputs. This study addresses these questions by constructing a textual dataset based on the ICD-10-CM code descriptions, widely used in US hospitals and containing many clinical terms, and their easily reproducible rephrasing. We then benchmarked existing embedding models, either generalist or specialized in the clinical domain, in a semantic search task where the goal was to correctly match the rephrased text to the original description. Our results showed that generalist models performed better than clinical models, suggesting that existing clinical specialized models are more sensitive to small changes in input that confuse them. The highlighted problem of specialized models may be due to the fact that they have not been trained on sufficient data, and in particular on datasets that are not diverse enough to have a reliable global language understanding, which is still necessary for accurate handling of medical documents.

Title: A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. (arXiv:2401.01967v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.01967
Code URL: https://github.com/ajyl/dpo_toxic
Copy Paste: [[2401.01967]] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity(http://arxiv.org/abs/2401.01967)
Summary:
While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.

Title: Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspective of Position Bias. (arXiv:2401.01989v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.01989
Code URL: null
Copy Paste: [[2401.01989]] Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspective of Position Bias(http://arxiv.org/abs/2401.01989)
Summary:
We characterize and study zero-shot abstractive summarization in Large Language Models (LLMs) by measuring position bias, which we propose as a general formulation of the more restrictive lead bias phenomenon studied previously in the literature. Position bias captures the tendency of a model unfairly prioritizing information from certain parts of the input text over others, leading to undesirable behavior. Through numerous experiments on four diverse real-world datasets, we study position bias in multiple LLM models such as GPT 3.5-Turbo, Llama-2, and Dolly-v2, as well as state-of-the-art pretrained encoder-decoder abstractive summarization models such as Pegasus and BART. Our findings lead to novel insights and discussion on performance and position bias of models for zero-shot summarization tasks.

Title: Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives. (arXiv:2401.02009v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02009
Code URL: null
Copy Paste: [[2401.02009]] Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives(http://arxiv.org/abs/2401.02009)
Summary:
The reflection capacity of Large Language Model (LLM) has garnered extensive attention. A post-hoc prompting strategy, e.g., reflexion and self-refine, refines LLM's response based on self-evaluated or external feedback. However, recent research indicates without external feedback, LLM's intrinsic reflection is unstable. Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback. We find LLMs often exhibit overconfidence or high randomness when self-evaluate, offering stubborn or inconsistent feedback, which causes poor reflection. To remedy this, we advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies. Our method endows LLM with diverse perspectives to alleviate stubborn biases. Moreover, their discrepancies indicate potential errors or inherent uncertainties that LLM often overlooks. Reflecting upon these can catalyze more accurate and stable reflection. Experiments conducted on a series of reasoning and translation tasks with different LLMs serve to underscore the effectiveness and generality of our strategy.

Title: DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models. (arXiv:2401.02132v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02132
Code URL: null
Copy Paste: [[2401.02132]] DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models(http://arxiv.org/abs/2401.02132)
Summary:
Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. This work proposes DCR, an automated framework for evaluating and improving the consistency of LLM-generated texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (DCE) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. To facilitate this approach, we introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (RAI) that leverages the analytical reasons with explanations identified by DCE to generate new responses aimed at reducing these inconsistencies. Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the consistency of LLM generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.

Title: TinyLlama: An Open-Source Small Language Model. (arXiv:2401.02385v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02385
Code URL: https://github.com/jzhang38/tinyllama
Copy Paste: [[2401.02385]] TinyLlama: An Open-Source Small Language Model(http://arxiv.org/abs/2401.02385)
Summary:
We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.

Title: ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers. (arXiv:2401.02072v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02072
Code URL: null
Copy Paste: [[2401.02072]] ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers(http://arxiv.org/abs/2401.02072)
Summary:
The emergence of Large Language Models (LLMs) such as ChatGPT and LLaMA encounter limitations in domain-specific tasks, with these models often lacking depth and accuracy in specialized areas, and exhibiting a decrease in general capabilities when fine-tuned, particularly analysis ability in small sized models. To address these gaps, we introduce ICE-GRT, utilizing Reinforcement Learning from Human Feedback (RLHF) grounded in Proximal Policy Optimization (PPO), demonstrating remarkable ability in in-domain scenarios without compromising general task performance. Our exploration of ICE-GRT highlights its understanding and reasoning ability to not only generate robust answers but also to provide detailed analyses of the reasons behind the answer. This capability marks a significant progression beyond the scope of Supervised Fine-Tuning models. The success of ICE-GRT is dependent on several crucial factors, including Appropriate Data, Reward Size Scaling, KL-Control, Advantage Normalization, etc. The ICE-GRT model exhibits state-of-the-art performance in domain-specific tasks and across 12 general Language tasks against equivalent size and even larger size LLMs, highlighting the effectiveness of our approach. We provide a comprehensive analysis of the ICE-GRT, underscoring the significant advancements it brings to the field of LLM.

Title: DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models. (arXiv:2401.02208v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02208
Code URL: null
Copy Paste: [[2401.02208]] DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models(http://arxiv.org/abs/2401.02208)
Summary:
We present DIALIGHT, a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems which facilitates systematic evaluations and comparisons between ToD systems using fine-tuning of Pretrained Language Models (PLMs) and those utilising the zero-shot and in-context learning capabilities of Large Language Models (LLMs). In addition to automatic evaluation, this toolkit features (i) a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level, and (ii) a microservice-based backend, improving efficiency and scalability. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses. However, we also identify significant challenges of LLMs in adherence to task-specific instructions and generating outputs in multiple languages, highlighting areas for future research. We hope this open-sourced toolkit will serve as a valuable resource for researchers aiming to develop and properly evaluate multilingual ToD systems and will lower, currently still high, entry barriers in the field.

Title: Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models. (arXiv:2401.02333v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02333
Code URL: null
Copy Paste: [[2401.02333]] Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models(http://arxiv.org/abs/2401.02333)
Summary:
The conventional use of the Retrieval-Augmented Generation (RAG) architecture has proven effective for retrieving information from diverse documents. However, challenges arise in handling complex table queries, especially within PDF documents containing intricate tabular structures.This research introduces an innovative approach to enhance the accuracy of complex table queries in RAG-based systems. Our methodology involves storing PDFs in the retrieval database and extracting tabular content separately. The extracted tables undergo a process of context enrichment, concatenating headers with corresponding values. To ensure a comprehensive understanding of the enriched data, we employ a fine-tuned version of the Llama-2-chat language model for summarisation within the RAG architecture. Furthermore, we augment the tabular data with contextual sense using the ChatGPT 3.5 API through a one-shot prompt. This enriched data is then fed into the retrieval database alongside other PDFs. Our approach aims to significantly improve the precision of complex table queries, offering a promising solution to a longstanding challenge in information retrieval.

Title: LLaMA Pro: Progressive LLaMA with Block Expansion. (arXiv:2401.02415v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02415
Code URL: https://github.com/tencentarc/llama-pro
Copy Paste: [[2401.02415]] LLaMA Pro: Progressive LLaMA with Block Expansion(http://arxiv.org/abs/2401.02415)
Summary:
Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.

gpt

Title: Text2MDT: Extracting Medical Decision Trees from Medical Texts. (arXiv:2401.02034v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02034
Code URL: null
Copy Paste: [[2401.02034]] Text2MDT: Extracting Medical Decision Trees from Medical Texts(http://arxiv.org/abs/2401.02034)
Summary:
Knowledge of the medical decision process, which can be modeled as medical decision trees (MDTs), is critical to build clinical decision support systems. However, the current MDT construction methods rely heavily on time-consuming and laborious manual annotation. In this work, we propose a novel task, Text2MDT, to explore the automatic extraction of MDTs from medical texts such as medical guidelines and textbooks. We normalize the form of the MDT and create an annotated Text-to-MDT dataset in Chinese with the participation of medical experts. We investigate two different methods for the Text2MDT tasks: (a) an end-to-end framework which only relies on a GPT style large language models (LLM) instruction tuning to generate all the node information and tree structures. (b) The pipeline framework which decomposes the Text2MDT task to three subtasks. Experiments on our Text2MDT dataset demonstrate that: (a) the end-to-end method basd on LLMs (7B parameters or larger) show promising results, and successfully outperform the pipeline methods. (b) The chain-of-thought (COT) prompting method \cite{Wei2022ChainOT} can improve the performance of the fine-tuned LLMs on the Text2MDT test set. (c) the lightweight pipelined method based on encoder-based pretrained models can perform comparably with LLMs with model complexity two magnititudes smaller. Our Text2MDT dataset is open-sourced at \url{https://tianchi.aliyun.com/dataset/95414}, and the source codes are open-sourced at \url{https://github.com/michael-wzhu/text2dt}.

Title: Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe. (arXiv:2401.02088v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02088
Code URL: null
Copy Paste: [[2401.02088]] Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe(http://arxiv.org/abs/2401.02088)
Summary:
Pipeline parallelism is an essential technique in the training of large-scale Transformer models. However, it suffers from imbalanced memory consumption, leading to insufficient memory utilization. The BPipe technique was proposed to address this issue and has proven effective in the GPT-3 model. Nevertheless, our experiments have not yielded similar benefits for LLaMA training. Additionally, BPipe only yields negligible benefits for GPT-3 training when applying flash attention. We analyze the underlying causes of the divergent performance of BPipe on GPT-3 and LLaMA. Furthermore, we introduce a novel method to estimate the performance of BPipe.

Title: Exploring Boundary of GPT-4V on Marine Analysis: A Preliminary Case Study. (arXiv:2401.02147v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02147
Code URL: null
Copy Paste: [[2401.02147]] Exploring Boundary of GPT-4V on Marine Analysis: A Preliminary Case Study(http://arxiv.org/abs/2401.02147)
Summary:
Large language models (LLMs) have demonstrated a powerful ability to answer various queries as a general-purpose assistant. The continuous multi-modal large language models (MLLM) empower LLMs with the ability to perceive visual signals. The launch of GPT-4 (Generative Pre-trained Transformers) has generated significant interest in the research communities. GPT-4V(ison) has demonstrated significant power in both academia and industry fields, as a focal point in a new artificial intelligence generation. Though significant success was achieved by GPT-4V, exploring MLLMs in domain-specific analysis (e.g., marine analysis) that required domain-specific knowledge and expertise has gained less attention. In this study, we carry out the preliminary and comprehensive case study of utilizing GPT-4V for marine analysis. This report conducts a systematic evaluation of existing GPT-4V, assessing the performance of GPT-4V on marine research and also setting a new standard for future developments in MLLMs. The experimental results of GPT-4V show that the responses generated by GPT-4V are still far away from satisfying the domain-specific requirements of the marine professions. All images and prompts used in this study will be available at https://github.com/hkust-vgd/Marine_GPT-4V_Eval

llm

Title: LLM Augmented LLMs: Expanding Capabilities through Composition. (arXiv:2401.02412v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02412
Code URL: null
Copy Paste: [[2401.02412]] LLM Augmented LLMs: Expanding Capabilities through Composition(http://arxiv.org/abs/2401.02412)
Summary:
Foundational models with billions of parameters which have been trained on large corpora of data have demonstrated non-trivial skills in a variety of domains. However, due to their monolithic structure, it is challenging and expensive to augment them or impart new skills. On the other hand, due to their adaptation abilities, several new instances of these models are being trained towards new domains and tasks. In this work, we study the problem of efficient and practical composition of existing foundation models with more specific models to enable newer capabilities. To this end, we propose CALM -- Composition to Augment Language Models -- which introduces cross-attention between models to compose their representations and enable new capabilities. Salient features of CALM are: (i) Scales up LLMs on new tasks by 're-using' existing LLMs along with a few additional parameters and data, (ii) Existing model weights are kept intact, and hence preserves existing capabilities, and (iii) Applies to diverse domains and settings. We illustrate that augmenting PaLM2-S with a smaller model trained on low-resource languages results in an absolute improvement of up to 13\% on tasks like translation into English and arithmetic reasoning for low-resource languages. Similarly, when PaLM2-S is augmented with a code-specific model, we see a relative improvement of 40\% over the base model for code generation and explanation tasks -- on-par with fully fine-tuned counterparts.

Title: Understanding LLMs: A Comprehensive Overview from Training to Inference. (arXiv:2401.02038v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02038
Code URL: null
Copy Paste: [[2401.02038]] Understanding LLMs: A Comprehensive Overview from Training to Inference(http://arxiv.org/abs/2401.02038)
Summary:
The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs' utilization and provides insights into their future development.

Title: Using LLM to select the right SQL Query from candidates. (arXiv:2401.02115v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02115
Code URL: null
Copy Paste: [[2401.02115]] Using LLM to select the right SQL Query from candidates(http://arxiv.org/abs/2401.02115)
Summary:
Text-to-SQL models can generate a list of candidate SQL queries, and the best query is often in the candidate list, but not at the top of the list. An effective re-rank method can select the right SQL query from the candidate list and improve the model's performance. Previous studies on code generation automatically generate test cases and use them to re-rank candidate codes. However, automatic test case generation for text-to-SQL is an understudied field. We propose an automatic test case generation method that first generates a database and then uses LLMs to predict the ground truth, which is the expected execution results of the ground truth SQL query on this database. To reduce the difficulty for LLMs to predict, we conduct experiments to search for ways to generate easy databases for LLMs and design easy-to-understand prompts. Based on our test case generation method, we propose a re-rank method to select the right SQL query from the candidate list. Given a candidate list, our method can generate test cases and re-rank the candidate list according to their pass numbers on these test cases and their generation probabilities. The experiment results on the validation dataset of Spider show that the performance of some state-of-the-art models can get a 3.6\% improvement after applying our re-rank method.

Title: Are LLMs Robust for Spoken Dialogues?. (arXiv:2401.02297v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02297
Code URL: null
Copy Paste: [[2401.02297]] Are LLMs Robust for Spoken Dialogues?(http://arxiv.org/abs/2401.02297)
Summary:
Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tracking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In this work, we have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets. Due to the lack of proper spoken dialogue datasets, we have automatically transcribed a development set of spoken dialogues with a state-of-the-art ASR engine. We have characterized the ASR-error types and their distributions and simulated these errors in a large dataset of dialogues. We report the intrinsic (perplexity) and extrinsic (human evaluation) performance of fine-tuned GPT-2 and T5 models in two subtasks of response generation and dialogue state tracking, respectively. The results show that LLMs are not robust to spoken noise by default, however, fine-tuning/training such models on a proper dataset of spoken TODs can result in a more robust performance.

Title: SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded Entity Retrieval. (arXiv:2401.02369v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02369
Code URL: null
Copy Paste: [[2401.02369]] SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded Entity Retrieval(http://arxiv.org/abs/2401.02369)
Summary:
Clinician must write a lengthy summary each time a patient is discharged from the hospital. This task is time-consuming due to the sheer number of unique clinical concepts covered in the admission. Identifying and covering salient entities is vital for the summary to be clinically useful. We fine-tune open-source LLMs (Mistral-7B-Instruct and Zephyr-7B-\b{eta}) on the task and find that they generate incomplete and unfaithful summaries. To increase entity coverage, we train a smaller, encoder-only model to predict salient entities, which are treated as content-plans to guide the LLM. To encourage the LLM to focus on specific mentions in the source notes, we propose SPEER: Sentence-level Planning via Embedded Entity Retrieval. Specifically, we mark each salient entity span with special "{{ }}" boundary tags and instruct the LLM to retrieve marked spans before generating each sentence. Sentence-level planning acts as a form of state tracking in that the model is explicitly recording the entities it uses. We fine-tune Mistral and Zephyr variants on a large-scale, diverse dataset of ~167k in-patient hospital admissions and evaluate on 3 datasets. SPEER shows gains in both coverage and faithfulness metrics over non-guided and guided baselines.

Title: A Robust Quantile Huber Loss With Interpretable Parameter Adjustment In Distributional Reinforcement Learning. (arXiv:2401.02325v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02325
Code URL: https://github.com/parvin95/generalized-quantile-huber-loss
Copy Paste: [[2401.02325]] A Robust Quantile Huber Loss With Interpretable Parameter Adjustment In Distributional Reinforcement Learning(http://arxiv.org/abs/2401.02325)
Summary:
Distributional Reinforcement Learning (RL) estimates return distribution mainly by learning quantile values via minimizing the quantile Huber loss function, entailing a threshold parameter often selected heuristically or via hyperparameter search, which may not generalize well and can be suboptimal. This paper introduces a generalized quantile Huber loss function derived from Wasserstein distance (WD) calculation between Gaussian distributions, capturing noise in predicted (current) and target (Bellman-updated) quantile values. Compared to the classical quantile Huber loss, this innovative loss function enhances robustness against outliers. Notably, the classical Huber loss function can be seen as an approximation of our proposed loss, enabling parameter adjustment by approximating the amount of noise in the data during the learning process. Empirical tests on Atari games, a common application in distributional RL, and a recent hedging strategy using distributional RL, validate the effectiveness of our proposed loss function and its potential for parameter adjustments in distributional RL.

long context

lora

Title: Trajectory-Oriented Policy Optimization with Sparse Rewards. (arXiv:2401.02225v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02225
Code URL: null
Copy Paste: [[2401.02225]] Trajectory-Oriented Policy Optimization with Sparse Rewards(http://arxiv.org/abs/2401.02225)
Summary:
Deep reinforcement learning (DRL) remains challenging in tasks with sparse rewards. These sparse rewards often only indicate whether the task is partially or fully completed, meaning that many exploration actions must be performed before the agent obtains useful feedback. Hence, most existing DRL algorithms fail to learn feasible policies within a reasonable time frame. To overcome this problem, we develop an approach that exploits offline demonstration trajectories for faster and more efficient online RL in sparse reward settings. Our key insight is that by regarding offline demonstration trajectories as guidance, instead of imitating them, our method learns a policy whose state-action visitation marginal distribution matches that of offline demonstrations. Specifically, we introduce a novel trajectory distance based on maximum mean discrepancy (MMD) and formulate policy optimization as a distance-constrained optimization problem. Then, we show that this distance-constrained optimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from offline demonstrations. The proposed algorithm is evaluated on extensive discrete and continuous control tasks with sparse and deceptive rewards. The experimental results indicate that our proposed algorithm is significantly better than the baseline methods regarding diverse exploration and learning the optimal policy.

hallucination

prompt

code

Title: Location Aware Modular Biencoder for Tourism Question Answering. (arXiv:2401.02187v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02187
Code URL: https://github.com/haonan-li/lamb
Copy Paste: [[2401.02187]] Location Aware Modular Biencoder for Tourism Question Answering(http://arxiv.org/abs/2401.02187)
Summary:
Answering real-world tourism questions that seek Point-of-Interest (POI) recommendations is challenging, as it requires both spatial and non-spatial reasoning, over a large candidate pool. The traditional method of encoding each pair of question and POI becomes inefficient when the number of candidates increases, making it infeasible for real-world applications. To overcome this, we propose treating the QA task as a dense vector retrieval problem, where we encode questions and POIs separately and retrieve the most relevant POIs for a question by utilizing embedding space similarity. We use pretrained language models (PLMs) to encode textual information, and train a location encoder to capture spatial information of POIs. Experiments on a real-world tourism QA dataset demonstrate that our approach is effective, efficient, and outperforms previous methods across all metrics. Enabled by the dense retrieval architecture, we further build a global evaluation baseline, expanding the search space by 20 times compared to previous work. We also explore several factors that impact on the model's performance through follow-up experiments. Our code and model are publicly available at https://github.com/haonan-li/LAMB.

Title: Representation Learning of Multivariate Time Series using Attention and Adversarial Training. (arXiv:2401.01987v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.01987
Code URL: null
Copy Paste: [[2401.01987]] Representation Learning of Multivariate Time Series using Attention and Adversarial Training(http://arxiv.org/abs/2401.01987)
Summary:
A critical factor in trustworthy machine learning is to develop robust representations of the training data. Only under this guarantee methods are legitimate to artificially generate data, for example, to counteract imbalanced datasets or provide counterfactual explanations for blackbox decision-making systems. In recent years, Generative Adversarial Networks (GANs) have shown considerable results in forming stable representations and generating realistic data. While many applications focus on generating image data, less effort has been made in generating time series data, especially multivariate signals. In this work, a Transformer-based autoencoder is proposed that is regularized using an adversarial training scheme to generate artificial multivariate time series signals. The representation is evaluated using t-SNE visualizations, Dynamic Time Warping (DTW) and Entropy scores. Our results indicate that the generated signals exhibit higher similarity to an exemplary dataset than using a convolutional network approach.

Title: SwitchTab: Switched Autoencoders Are Effective Tabular Learners. (arXiv:2401.02013v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02013
Code URL: null
Copy Paste: [[2401.02013]] SwitchTab: Switched Autoencoders Are Effective Tabular Learners(http://arxiv.org/abs/2401.02013)
Summary:
Self-supervised representation learning methods have achieved significant success in computer vision and natural language processing, where data samples exhibit explicit spatial or semantic dependencies. However, applying these methods to tabular data is challenging due to the less pronounced dependencies among data samples. In this paper, we address this limitation by introducing SwitchTab, a novel self-supervised method specifically designed to capture latent dependencies in tabular data. SwitchTab leverages an asymmetric encoder-decoder framework to decouple mutual and salient features among data pairs, resulting in more representative embeddings. These embeddings, in turn, contribute to better decision boundaries and lead to improved results in downstream tasks. To validate the effectiveness of SwitchTab, we conduct extensive experiments across various domains involving tabular data. The results showcase superior performance in end-to-end prediction tasks with fine-tuning. Moreover, we demonstrate that pre-trained salient embeddings can be utilized as plug-and-play features to enhance the performance of various traditional classification methods (e.g., Logistic Regression, XGBoost, etc.). Lastly, we highlight the capability of SwitchTab to create explainable representations through visualization of decoupled mutual and salient features in the latent space.

Title: Energy based diffusion generator for efficient sampling of Boltzmann distributions. (arXiv:2401.02080v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02080
Code URL: null
Copy Paste: [[2401.02080]] Energy based diffusion generator for efficient sampling of Boltzmann distributions(http://arxiv.org/abs/2401.02080)
Summary:
We introduce a novel sampler called the energy based diffusion generator for generating samples from arbitrary target distributions. The sampling model employs a structure similar to a variational autoencoder, utilizing a decoder to transform latent variables from a simple distribution into random variables approximating the target distribution, and we design an encoder based on the diffusion model. Leveraging the powerful modeling capacity of the diffusion model for complex distributions, we can obtain an accurate variational estimate of the Kullback-Leibler divergence between the distributions of the generated samples and the target. Moreover, we propose a decoder based on generalized Hamiltonian dynamics to further enhance sampling performance. Through empirical evaluation, we demonstrate the effectiveness of our method across various complex distribution functions, showcasing its superiority compared to existing methods.

chat

retrieval augmented generation

retrieval-augmented generation

rag

Title: Shayona@SMM4H23: COVID-19 Self diagnosis classification using BERT and LightGBM models. (arXiv:2401.02158v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02158
Code URL: null
Copy Paste: [[2401.02158]] Shayona@SMM4H23: COVID-19 Self diagnosis classification using BERT and LightGBM models(http://arxiv.org/abs/2401.02158)
Summary:
This paper describes approaches and results for shared Task 1 and 4 of SMMH4-23 by Team Shayona. Shared Task-1 was binary classification of english tweets self-reporting a COVID-19 diagnosis, and Shared Task-4 was Binary classification of English Reddit posts self-reporting a social anxiety disorder diagnosis. Our team has achieved the highest f1-score 0.94 in Task-1 among all participants. We have leveraged the Transformer model (BERT) in combination with the LightGBM model for both tasks.

Title: Uncertainty-Aware Deep Attention Recurrent Neural Network for Heterogeneous Time Series Imputation. (arXiv:2401.02258v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02258
Code URL: null
Copy Paste: [[2401.02258]] Uncertainty-Aware Deep Attention Recurrent Neural Network for Heterogeneous Time Series Imputation(http://arxiv.org/abs/2401.02258)
Summary:
Missingness is ubiquitous in multivariate time series and poses an obstacle to reliable downstream analysis. Although recurrent network imputation achieved the SOTA, existing models do not scale to deep architectures that can potentially alleviate issues arising in complex data. Moreover, imputation carries the risk of biased estimations of the ground truth. Yet, confidence in the imputed values is always unmeasured or computed post hoc from model output. We propose DEep Attention Recurrent Imputation (DEARI), which jointly estimates missing values and their associated uncertainty in heterogeneous multivariate time series. By jointly representing feature-wise correlations and temporal dynamics, we adopt a self attention mechanism, along with an effective residual component, to achieve a deep recurrent neural network with good imputation performance and stable convergence. We also leverage self-supervised metric learning to boost performance by optimizing sample similarity. Finally, we transform DEARI into a Bayesian neural network through a novel Bayesian marginalization strategy to produce stochastic DEARI, which outperforms its deterministic equivalent. Experiments show that DEARI surpasses the SOTA in diverse imputation tasks using real-world datasets, namely air quality control, healthcare and traffic.

Title: PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques. (arXiv:2401.02122v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02122
Code URL: null
Copy Paste: [[2401.02122]] PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques(http://arxiv.org/abs/2401.02122)
Summary:
Parameter-Efficient Fine-Tuning (PEFT) is increasingly recognized as an effective method in speech processing. However, the optimal approach and the placement of PEFT methods remain inconclusive. Our study conducts extensive experiments to compare different PEFT methods and their layer-wise placement adapting Differentiable Architecture Search (DARTS). We also explore the use of ensemble learning to leverage diverse PEFT strategies. The results reveal that DARTS does not outperform the baseline approach, which involves inserting the same PEFT method into all layers of a Self-Supervised Learning (SSL) model. In contrast, an ensemble learning approach, particularly one employing majority voting, demonstrates superior performance. Our statistical evidence indicates that different PEFT methods learn in varied ways. This variation might explain why the synergistic integration of various PEFT methods through ensemble learning can harness their unique learning capabilities more effectively compared to individual layer-wise optimization.

Title: L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages. (arXiv:2401.02254v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02254
Code URL: null
Copy Paste: [[2401.02254]] L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages(http://arxiv.org/abs/2401.02254)
Summary:
In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp

Title: Beyond Regrets: Geometric Metrics for Bayesian Optimization. (arXiv:2401.01981v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.01981
Code URL: null
Copy Paste: [[2401.01981]] Beyond Regrets: Geometric Metrics for Bayesian Optimization(http://arxiv.org/abs/2401.01981)
Summary:
Bayesian optimization is a principled optimization strategy for a black-box objective function. It shows its effectiveness in a wide variety of real-world applications such as scientific discovery and experimental design. In general, the performance of Bayesian optimization is assessed by regret-based metrics such as instantaneous, simple, and cumulative regrets. These metrics only rely on function evaluations, so that they do not consider geometric relationships between query points and global solutions, or query points themselves. Notably, they cannot discriminate if multiple global solutions are successfully found. Moreover, they do not evaluate Bayesian optimization's abilities to exploit and explore a search space given. To tackle these issues, we propose four new geometric metrics, i.e., precision, recall, average degree, and average distance. These metrics allow us to compare Bayesian optimization algorithms considering the geometry of both query points and global optima, or query points. However, they are accompanied by an extra parameter, which needs to be carefully determined. We therefore devise the parameter-free forms of the respective metrics by integrating out the additional parameter. Finally, we empirically validate that our proposed metrics can provide more convincing interpretation and understanding of Bayesian optimization algorithms from distinct perspectives, compared to the conventional metrics.

Title: Two-Stage Surrogate Modeling for Data-Driven Design Optimization with Application to Composite Microstructure Generation. (arXiv:2401.02008v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02008
Code URL: null
Copy Paste: [[2401.02008]] Two-Stage Surrogate Modeling for Data-Driven Design Optimization with Application to Composite Microstructure Generation(http://arxiv.org/abs/2401.02008)
Summary:
This paper introduces a novel two-stage machine learning-based surrogate modeling framework to address inverse problems in scientific and engineering fields. In the first stage of the proposed framework, a machine learning model termed the "learner" identifies a limited set of candidates within the input design space whose predicted outputs closely align with desired outcomes. Subsequently, in the second stage, a separate surrogate model, functioning as an "evaluator," is employed to assess the reduced candidate space generated in the first stage. This evaluation process eliminates inaccurate and uncertain solutions, guided by a user-defined coverage level. The framework's distinctive contribution is the integration of conformal inference, providing a versatile and efficient approach that can be widely applicable. To demonstrate the effectiveness of the proposed framework compared to conventional single-stage inverse problems, we conduct several benchmark tests and investigate an engineering application focused on the micromechanical modeling of fiber-reinforced composites. The results affirm the superiority of our proposed framework, as it consistently produces more reliable solutions. Therefore, the introduced framework offers a unique perspective on fostering interactions between machine learning-based surrogate models in real-world applications.

Title: Fast & Fair: Efficient Second-Order Robust Optimization for Fairness in Machine Learning. (arXiv:2401.02012v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02012
Code URL: null
Copy Paste: [[2401.02012]] Fast & Fair: Efficient Second-Order Robust Optimization for Fairness in Machine Learning(http://arxiv.org/abs/2401.02012)
Summary:
This project explores adversarial training techniques to develop fairer Deep Neural Networks (DNNs) to mitigate the inherent bias they are known to exhibit. DNNs are susceptible to inheriting bias with respect to sensitive attributes such as race and gender, which can lead to life-altering outcomes (e.g., demographic bias in facial recognition software used to arrest a suspect). We propose a robust optimization problem, which we demonstrate can improve fairness in several datasets, both synthetic and real-world, using an affine linear model. Leveraging second order information, we are able to find a solution to our optimization problem more efficiently than a purely first order method.

Title: Balancing Continual Learning and Fine-tuning for Human Activity Recognition. (arXiv:2401.02255v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02255
Code URL: null
Copy Paste: [[2401.02255]] Balancing Continual Learning and Fine-tuning for Human Activity Recognition(http://arxiv.org/abs/2401.02255)
Summary:
Wearable-based Human Activity Recognition (HAR) is a key task in human-centric machine learning due to its fundamental understanding of human behaviours. Due to the dynamic nature of human behaviours, continual learning promises HAR systems that are tailored to users' needs. However, because of the difficulty in collecting labelled data with wearable sensors, existing approaches that focus on supervised continual learning have limited applicability, while unsupervised continual learning methods only handle representation learning while delaying classifier training to a later stage. This work explores the adoption and adaptation of CaSSLe, a continual self-supervised learning model, and Kaizen, a semi-supervised continual learning model that balances representation learning and down-stream classification, for the task of wearable-based HAR. These schemes re-purpose contrastive learning for knowledge retention and, Kaizen combines that with self-training in a unified scheme that can leverage unlabelled and labelled data for continual learning. In addition to comparing state-of-the-art self-supervised continual learning schemes, we further investigated the importance of different loss terms and explored the trade-off between knowledge retention and learning from new tasks. In particular, our extensive evaluation demonstrated that the use of a weighting factor that reflects the ratio between learned and new classes achieves the best overall trade-off in continual learning.

Title: Training Single-Layer Morphological Perceptron Using Convex-Concave Programming. (arXiv:2401.02296v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02296
Code URL: null
Copy Paste: [[2401.02296]] Training Single-Layer Morphological Perceptron Using Convex-Concave Programming(http://arxiv.org/abs/2401.02296)
Summary:
This paper concerns the training of a single-layer morphological perceptron using disciplined convex-concave programming (DCCP). We introduce an algorithm referred to as K-DDCCP, which combines the existing single-layer morphological perceptron (SLMP) model proposed by Ritter and Urcid with the weighted disciplined convex-concave programming (WDCCP) algorithm by Charisopoulos and Maragos. The proposed training algorithm leverages the disciplined convex-concave procedure (DCCP) and formulates a non-convex optimization problem for binary classification. To tackle this problem, the constraints are expressed as differences of convex functions, enabling the application of the DCCP package. The experimental results confirm the effectiveness of the K-DDCCP algorithm in solving binary classification problems. Overall, this work contributes to the field of morphological neural networks by proposing an algorithm that extends the capabilities of the SLMP model.

Title: Not all Minorities are Equal: Empty-Class-Aware Distillation for Heterogeneous Federated Learning. (arXiv:2401.02329v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02329
Code URL: null
Copy Paste: [[2401.02329]] Not all Minorities are Equal: Empty-Class-Aware Distillation for Heterogeneous Federated Learning(http://arxiv.org/abs/2401.02329)
Summary:
Data heterogeneity, characterized by disparities in local data distribution across clients, poses a significant challenge in federated learning. Substantial efforts have been devoted to addressing the heterogeneity in local label distribution. As minority classes suffer from worse accuracy due to overfitting on local imbalanced data, prior methods often incorporate class-balanced learning techniques during local training. Despite the improved mean accuracy across all classes, we observe that empty classes-referring to categories absent from a client's data distribution-are still not well recognized. This paper introduces FedED, a novel approach in heterogeneous federated learning that integrates both empty-class distillation and logit suppression simultaneously. Specifically, empty-class distillation leverages knowledge distillation during local training on each client to retain essential information related to empty classes from the global model. Moreover, logit suppression directly penalizes network logits for non-label classes, effectively addressing misclassifications in minority classes that may be biased toward majority classes. Extensive experiments validate the efficacy of FedED, surpassing previous state-of-the-art methods across diverse datasets with varying degrees of label distribution shift.

Title: Multi-Source Domain Adaptation with Transformer-based Feature Generation for Subject-Independent EEG-based Emotion Recognition. (arXiv:2401.02344v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02344
Code URL: null
Copy Paste: [[2401.02344]] Multi-Source Domain Adaptation with Transformer-based Feature Generation for Subject-Independent EEG-based Emotion Recognition(http://arxiv.org/abs/2401.02344)
Summary:
Although deep learning-based algorithms have demonstrated excellent performance in automated emotion recognition via electroencephalogram (EEG) signals, variations across brain signal patterns of individuals can diminish the model's effectiveness when applied across different subjects. While transfer learning techniques have exhibited promising outcomes, they still encounter challenges related to inadequate feature representations and may overlook the fact that source subjects themselves can possess distinct characteristics. In this work, we propose a multi-source domain adaptation approach with a transformer-based feature generator (MSDA-TF) designed to leverage information from multiple sources. The proposed feature generator retains convolutional layers to capture shallow spatial, temporal, and spectral EEG data representations, while self-attention mechanisms extract global dependencies within these features. During the adaptation process, we group the source subjects based on correlation values and aim to align the moments of the target subject with each source as well as within the sources. MSDA-TF is validated on the SEED dataset and is shown to yield promising results.

multi-run

chain-of-thought

tree-of-thought

agent

Title: Decentralized Multi-Task Online Convex Optimization Under Random Link Failures. (arXiv:2401.02011v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02011
Code URL: null
Copy Paste: [[2401.02011]] Decentralized Multi-Task Online Convex Optimization Under Random Link Failures(http://arxiv.org/abs/2401.02011)
Summary:
Decentralized optimization methods often entail information exchange between neighbors. Transmission failures can happen due to network congestion, hardware/software issues, communication outage, and other factors. In this paper, we investigate the random link failure problem in decentralized multi-task online convex optimization, where agents have individual decisions that are coupled with each other via pairwise constraints. Although widely used in constrained optimization, conventional saddle-point algorithms are not directly applicable here because of random packet dropping. To address this issue, we develop a robust decentralized saddle-point algorithm against random link failures with heterogeneous probabilities by replacing the missing decisions of neighbors with their latest received values. Then, by judiciously bounding the accumulated deviation stemming from this replacement, we first establish that our algorithm achieves $\mathcal{O}(\sqrt{T})$ regret and $\mathcal{O}(T^\frac{3}{4})$ constraint violations for the full information scenario, where the complete information on the local cost function is revealed to each agent at the end of each time slot. These two bounds match, in order sense, the performance bounds of algorithms with perfect communications. Further, we extend our algorithm and analysis to the two-point bandit feedback scenario, where only the values of the local cost function at two random points are disclosed to each agent sequentially. Performance bounds of the same orders as the full information case are derived. Finally, we corroborate the efficacy of the proposed algorithms and the analytical results through numerical simulations.