2024-01-12

language model

Title: Enhancing Essay Scoring with Adversarial Weights Perturbation and Metric-specific AttentionPooling. (arXiv:2401.05433v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05433
Code URL: null
Copy Paste: [[2401.05433]] Enhancing Essay Scoring with Adversarial Weights Perturbation and Metric-specific AttentionPooling(http://arxiv.org/abs/2401.05433)
Summary:
The objective of this study is to improve automated feedback tools designed for English Language Learners (ELLs) through the utilization of data science techniques encompassing machine learning, natural language processing, and educational data analytics. Automated essay scoring (AES) research has made strides in evaluating written essays, but it often overlooks the specific needs of English Language Learners (ELLs) in language development. This study explores the application of BERT-related techniques to enhance the assessment of ELLs' writing proficiency within AES.

To address the specific needs of ELLs, we propose the use of DeBERTa, a state-of-the-art neural language model, for improving automated feedback tools. DeBERTa, pretrained on large text corpora using self-supervised learning, learns universal language representations adaptable to various natural language understanding tasks. The model incorporates several innovative techniques, including adversarial training through Adversarial Weights Perturbation (AWP) and Metric-specific AttentionPooling (6 kinds of AP) for each label in the competition.

The primary focus of this research is to investigate the impact of hyperparameters, particularly the adversarial learning rate, on the performance of the model. By fine-tuning the hyperparameter tuning process, including the influence of 6AP and AWP, the resulting models can provide more accurate evaluations of language proficiency and support tailored learning tasks for ELLs. This work has the potential to significantly benefit ELLs by improving their English language proficiency and facilitating their educational journey.
摘要：
这项研究的目的是通过利用机器学习、自然语言处理和教育数据分析等数据科学技术来改进为英语学习者 (ELL) 设计的自动反馈工具。自动论文评分 (AES) 研究在评估书面论文方面取得了长足的进步，但它经常忽视英语学习者 (ELL) 在语言发展方面的具体需求。本研究探讨了 BERT 相关技术的应用，以增强 AES 中 ELL 写作能力的评估。

为了满足 ELL 的特定需求，我们建议使用最先进的神经语言模型 DeBERTa 来改进自动反馈工具。 DeBERTa 使用自我监督学习在大型文本语料库上进行预训练，学习适用于各种自然语言理解任务的通用语言表示。该模型融合了多项创新技术，包括通过对抗性权重扰动（AWP）进行对抗性训练，以及针对竞赛中每个标签的特定指标注意力池（6种AP）。

这项研究的主要重点是研究超参数（特别是对抗性学习率）对模型性能的影响。通过微调超参数调整过程，包括 6AP 和 AWP 的影响，生成的模型可以提供更准确的语言能力评估，并支持 ELL 的定制学习任务。这项工作有可能通过提高 ELL 的英语语言能力并促进他们的教育旅程而使他们受益匪浅。

Title: REBUS: A Robust Evaluation Benchmark of Understanding Symbols. (arXiv:2401.05604v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05604
Code URL: https://github.com/cvndsh/rebus
Copy Paste: [[2401.05604]] REBUS: A Robust Evaluation Benchmark of Understanding Symbols(http://arxiv.org/abs/2401.05604)
Summary:
We propose a new benchmark evaluating the performance of multimodal large language models on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that proprietary models such as GPT-4V and Gemini Pro significantly outperform all other tested models. However, even the best model has a final accuracy of just 24%, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.
摘要：
我们提出了一个新的基准，评估多模式大语言模型在画画谜题上的性能。该数据集涵盖 333 个基于图像的文字游戏的原始示例，涵盖电影、作曲家、主要城市和食物等 13 个类别。为了在识别线索单词或短语的基准上取得良好的性能，模型必须将图像识别和字符串操作与假设检验、多步骤推理和对人类认知的理解结合起来，从而对能力进行复杂的多模式评估。我们发现 GPT-4V 和 Gemini Pro 等专有模型的性能明显优于所有其他测试模型。然而，即使是最好的模型，最终准确率也仅为 24%，这凸显了推理方面需要大幅改进的必要性。此外，模型很少理解谜题的所有部分，并且几乎总是无法追溯解释正确的答案。因此，我们的基准可用于识别多模态大语言模型的知识和推理中的主要缺陷。

Title: The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models. (arXiv:2401.05618v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05618
Code URL: null
Copy Paste: [[2401.05618]] The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models(http://arxiv.org/abs/2401.05618)
Summary:
In this paper, we introduce Concise Chain-of-Thought (CCoT) prompting. We compared standard CoT and CCoT prompts to see how conciseness impacts response length and correct-answer accuracy. We evaluated this using GPT-3.5 and GPT-4 with a multiple-choice question-and-answer (MCQA) benchmark. CCoT reduced average response length by 48.70% for both GPT-3.5 and GPT-4 while having a negligible impact on problem-solving performance. However, on math problems, GPT-3.5 with CCoT incurs a performance penalty of 27.69%. Overall, CCoT leads to an average per-token cost reduction of 22.67%. These results have practical implications for AI systems engineers using LLMs to solve real-world problems with CoT prompt-engineering techniques. In addition, these results provide more general insight for AI researchers studying the emergent behavior of step-by-step reasoning in LLMs.
摘要：
在本文中，我们介绍了简洁思维链（CCoT）提示。我们比较了标准 CoT 和 CCoT 提示，以了解简洁性如何影响响应长度和正确答案的准确性。我们使用 GPT-3.5 和 GPT-4 以及多项选择问答 (MCQA) 基准对此进行了评估。 CCoT 将 GPT-3.5 和 GPT-4 的平均响应长度减少了 48.70%，同时对问题解决性能的影响可以忽略不计。然而，在数学问题上，带有 CCoT 的 GPT-3.5 会导致性能损失 27.69%。总体而言，CCoT 使每个代币的平均成本降低了 22.67%。这些结果对于使用法学硕士通过 CoT 即时工程技术解决现实世界问题的人工智能系统工程师具有实际意义。此外，这些结果为人工智能研究人员研究法学硕士中逐步推理的涌现行为提供了更普遍的见解。

Title: Towards Conversational Diagnostic AI. (arXiv:2401.05654v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2401.05654
Code URL: null
Copy Paste: [[2401.05654]] Towards Conversational Diagnostic AI(http://arxiv.org/abs/2401.05654)
Summary:
At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.

AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.
摘要：
医学的核心在于医患对话，熟练的病史采集为准确的诊断、有效的管理和持久的信任铺平了道路。能够进行诊断对话的人工智能（AI）系统可以提高护理的可及性、一致性和质量。然而，接近临床医生的专业知识是一个巨大的挑战。在这里，我们介绍 AMIE（Articulate Medical Intelligence Explorer），这是一种基于大型语言模型（LLM）的人工智能系统，针对诊断对话进行了优化。

AMIE 使用一种新颖的基于自我游戏的模拟环境，具有自动反馈机制，可在不同的疾病状况、专业和背景下扩展学习。我们设计了一个框架，用于评估具有临床意义的绩效轴，包括病史采集、诊断准确性、管理推理、沟通技巧和同理心。我们在一项随机、双盲交叉研究中，以客观结构化临床检查 (OSCE) 的方式与经过验证的患者参与者进行基于文本的咨询，将 AMIE 的表现与初级保健医生 (PCP) 的表现进行了比较。该研究包括来自加拿大、英国和印度临床提供者的 149 个病例场景、20 个与 AMIE 进行比较的 PCP，以及专科医生和患者参与者的评估。根据专科医生的说法，AMIE 在 32 个轴中的 28 个轴上表现出了更高的诊断准确性和卓越的性能，根据患者参与者的说法，AMIE 在 26 个轴中的 24 个轴上表现出了更高的诊断准确性和卓越的性能。我们的研究有一些局限性，应谨慎解释。临床医生仅限于不熟悉的同步文本聊天，这允许大规模的法学硕士与患者互动，但不能代表通常的临床实践。虽然在将 AMIE 转化为现实世界环境之前还需要进一步研究，但结果代表了对话式诊断 AI 的一个里程碑。

Title: A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism. (arXiv:2401.05749v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05749
Code URL: null
Copy Paste: [[2401.05749]] A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism(http://arxiv.org/abs/2401.05749)
Summary:
We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.
摘要：
我们发现网络上的内容通常会被翻译成多种语言，而这些多向翻译的质量较低，表明它们很可能是使用机器翻译 (MT) 创建的。多路并行、机器生成的内容不仅主导着低资源语言的翻译，而且还主导着低资源语言的翻译。它还占这些语言的网络内容总量的很大一部分。我们还发现翻译成多种语言的内容类型存在选择偏差的证据，这与通过机器翻译将低质量英语内容批量翻译成许多较低资源语言的情况一致。我们的工作引起了人们对训练模型的严重关注，例如基于从网络上抓取的单语和双语数据的多语言大语言模型。

Title: Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems. (arXiv:2401.05778v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05778
Code URL: null
Copy Paste: [[2401.05778]] Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems(http://arxiv.org/abs/2401.05778)
Summary:
Large language models (LLMs) have strong capabilities in solving diverse natural language processing tasks. However, the safety and security issues of LLM systems have become the major obstacle to their widespread application. Many studies have extensively investigated risks in LLM systems and developed the corresponding mitigation strategies. Leading-edge enterprises such as OpenAI, Google, Meta, and Anthropic have also made lots of efforts on responsible LLMs. Therefore, there is a growing need to organize the existing studies and establish comprehensive taxonomies for the community. In this paper, we delve into four essential modules of an LLM system, including an input module for receiving prompts, a language model trained on extensive corpora, a toolchain module for development and deployment, and an output module for exporting LLM-generated content. Based on this, we propose a comprehensive taxonomy, which systematically analyzes potential risks associated with each module of an LLM system and discusses the corresponding mitigation strategies. Furthermore, we review prevalent benchmarks, aiming to facilitate the risk assessment of LLM systems. We hope that this paper can help LLM participants embrace a systematic perspective to build their responsible LLM systems.
摘要：
大型语言模型（LLM）在解决各种自然语言处理任务方面具有强大的能力。然而，LLM系统的安全保障问题已成为其广泛应用的主要障碍。许多研究广泛调查了法学硕士系统的风险并制定了相应的缓解策略。 OpenAI、Google、Meta、Anthropic等前沿企业也在负责任的LLM方面做出了很多努力。因此，越来越需要组织现有的研究并为社区建立全面的分类法。在本文中，我们深入研究了LLM系统的四个基本模块，包括用于接收提示的输入模块、在广泛语料库上训练的语言模型、用于开发和部署的工具链模块以及用于导出LLM生成内容的输出模块。在此基础上，我们提出了一个全面的分类法，系统地分析了LLM系统每个模块相关的潜在风险，并讨论了相应的缓解策略。此外，我们审查流行的基准，旨在促进法学硕士系统的风险评估。我们希望本文能够帮助LLM参与者以系统的视角来构建他们负责任的LLM系统。

Title: How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes. (arXiv:2401.05914v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05914
Code URL: null
Copy Paste: [[2401.05914]] How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes(http://arxiv.org/abs/2401.05914)
Summary:
Question generation (QG) is a natural language processing task with an abundance of potential benefits and use cases in the educational domain. In order for this potential to be realized, QG systems must be designed and validated with pedagogical needs in mind. However, little research has assessed or designed QG approaches with the input from real teachers or students. This paper applies a large language model-based QG approach where questions are generated with learning goals derived from Bloom's taxonomy. The automatically generated questions are used in multiple experiments designed to assess how teachers use them in practice. The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions. Further, several metrics indicate that automatically generated questions can even improve the quality of the quizzes created, showing the promise for large scale use of QG in the classroom setting.
摘要：
问题生成 (QG) 是一项自然语言处理任务，在教育领域具有大量潜在优势和用例。为了实现这一潜力，QG 系统的设计和验证必须考虑到教学需求。然而，很少有研究根据真实教师或学生的意见来评估或设计 QG 方法。本文应用了基于大型语言模型的 QG 方法，其中问题是根据从 Bloom 分类法得出的学习目标生成的。自动生成的问题用于多个实验，旨在评估教师在实践中如何使用它们。结果表明，教师更喜欢使用自动生成的问题编写测验，并且与手写版本相比，此类测验在质量上没有损失。此外，一些指标表明，自动生成的问题甚至可以提高所创建测验的质量，这表明 QG 在课堂环境中大规模使用的前景。

Title: Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks. (arXiv:2401.05949v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05949
Code URL: null
Copy Paste: [[2401.05949]] Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks(http://arxiv.org/abs/2401.05949)
Summary:
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks, especially in few-shot settings. Unlike traditional fine-tuning methods, in-context learning adapts pre-trained models to unseen tasks without updating any parameters. Despite being widely applied, in-context learning is vulnerable to malicious attacks. In this work, we raise security concerns regarding this paradigm. Our studies demonstrate that an attacker can manipulate the behavior of large language models by poisoning the demonstration context, without the need for fine-tuning the model. Specifically, we have designed a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning. Our method encompasses two types of attacks: poisoning demonstration examples and poisoning prompts, which can make models behave in accordance with predefined intentions. ICLAttack does not require additional fine-tuning to implant a backdoor, thus preserving the model's generality. Furthermore, the poisoned examples are correctly labeled, enhancing the natural stealth of our attack method. Extensive experimental results across several language models, ranging in size from 1.3B to 40B parameters, demonstrate the effectiveness of our attack method, exemplified by a high average attack success rate of 95.0% across the three datasets on OPT models. Our findings highlight the vulnerabilities of language models, and we hope this work will raise awareness of the possible security threats associated with in-context learning.
摘要：
上下文学习是一种弥合预训练和微调之间差距的范式，已在多项 NLP 任务中表现出高效能，尤其是在少量样本设置中。与传统的微调方法不同，上下文学习使预先训练的模型适应未见过的任务，而无需更新任何参数。尽管应用广泛，但情境学习很容易受到恶意攻击。在这项工作中，我们提出了有关此范例的安全问题。我们的研究表明，攻击者可以通过毒害演示上下文来操纵大型语言模型的行为，而无需对模型进行微调。具体来说，我们设计了一种新的后门攻击方法，名为 ICLAtack，针对基于上下文学习的大型语言模型。我们的方法包含两种类型的攻击：中毒演示示例和中毒提示，这可以使模型按照预定义的意图行事。 ICLAttack 不需要额外的微调来植入后门，从而保留了模型的通用性。此外，中毒的例子被正确标记，增强了我们的攻击方法的自然隐蔽性。跨多种语言模型的广泛实验结果（参数大小从 1.3B 到 40B 不等）证明了我们的攻击方法的有效性，OPT 模型上的三个数据集的平均攻击成功率高达 95.0%。我们的研究结果强调了语言模型的漏洞，我们希望这项工作能够提高人们对与上下文学习相关的可能安全威胁的认识。

Title: Investigating Data Contamination for Pre-training Language Models. (arXiv:2401.06059v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.06059
Code URL: null
Copy Paste: [[2401.06059]] Investigating Data Contamination for Pre-training Language Models(http://arxiv.org/abs/2401.06059)
Summary:
Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.
摘要：
在网络规模的语料库上预先训练的语言模型在各种下游任务上展示了令人印象深刻的能力。然而，人们越来越担心这种能力是否可能是由于将评估数据集包含在预训练语料库中而产生的（这种现象称为 \textit{数据污染}），从而人为地提高了性能。人们对这种潜在污染如何影响语言模型在下游任务中的表现知之甚少。在本文中，我们通过预训练一系列 GPT-2 模型\textit{从头开始}来探索预训练阶段数据污染的影响。我们强调评估数据中文本污染（\textit{i.e.}\ 评估样本的输入文本）和真实污染（\textit{i.e.}\ 对输入和所需输出提出的提示）的影响。我们还研究了重复污染对各种下游任务的影响。此外，我们还研究了当前法学硕士报告中流行的基于 n 元语法的污染定义，指出其局限性和不足。我们的研究结果为数据污染对语言模型能力的影响提供了新的见解，并强调了法学硕士研究中独立、全面的污染评估的必要性。

Title: Secrets of RLHF in Large Language Models Part II: Reward Modeling. (arXiv:2401.06080v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2401.06080
Code URL: null
Copy Paste: [[2401.06080]] Secrets of RLHF in Large Language Models Part II: Reward Modeling(http://arxiv.org/abs/2401.06080)
Summary:
Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training.

In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.
摘要：
人类反馈强化学习 (RLHF) 已成为使语言模型与人类价值观和意图保持一致的关键技术，使模型能够产生更有帮助且无害的响应。奖励模型被训练为人类偏好的代理，以驱动强化学习优化。虽然奖励模型通常被认为是实现高性能的核心，但它们在实际应用中面临以下挑战：（1）数据集中不正确和模糊的偏好对可能会阻碍奖励模型准确捕捉人类意图。 (2) 根据特定分布的数据训练的奖励模型通常很难推广到该分布之外的示例，并且不适合迭代 RLHF 训练。

在本报告中，我们尝试解决这两个问题。 (1)从数据的角度来看，我们提出了一种基于多种奖励模型的投票机制来衡量数据内偏好强度的方法。实验结果证实，不同偏好强度的数据对奖励模型的性能有不同的影响。我们引入了一系列新颖的方法来减轻数据集中不正确和模糊偏好的影响，并充分利用高质量的偏好数据。（2）从算法的角度来看，我们引入对比学习来增强奖励模型区分选择和拒绝响应的能力，从而提高模型泛化能力。此外，我们采用元学习使奖励模型能够保持区分分布外样本中细微差异的能力，并且该方法可用于迭代 RLHF 优化。

Title: Autocompletion of Chief Complaints in the Electronic Health Records using Large Language Models. (arXiv:2401.06088v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.06088
Code URL: null
Copy Paste: [[2401.06088]] Autocompletion of Chief Complaints in the Electronic Health Records using Large Language Models(http://arxiv.org/abs/2401.06088)
Summary:
The Chief Complaint (CC) is a crucial component of a patient's medical record as it describes the main reason or concern for seeking medical care. It provides critical information for healthcare providers to make informed decisions about patient care. However, documenting CCs can be time-consuming for healthcare providers, especially in busy emergency departments. To address this issue, an autocompletion tool that suggests accurate and well-formatted phrases or sentences for clinical notes can be a valuable resource for triage nurses. In this study, we utilized text generation techniques to develop machine learning models using CC data. In our proposed work, we train a Long Short-Term Memory (LSTM) model and fine-tune three different variants of Biomedical Generative Pretrained Transformers (BioGPT), namely microsoft/biogpt, microsoft/BioGPT-Large, and microsoft/BioGPT-Large-PubMedQA. Additionally, we tune a prompt by incorporating exemplar CC sentences, utilizing the OpenAI API of GPT-4. We evaluate the models' performance based on the perplexity score, modified BERTScore, and cosine similarity score. The results show that BioGPT-Large exhibits superior performance compared to the other models. It consistently achieves a remarkably low perplexity score of 1.65 when generating CC, whereas the baseline LSTM model achieves the best perplexity score of 170. Further, we evaluate and assess the proposed models' performance and the outcome of GPT-4.0. Our study demonstrates that utilizing LLMs such as BioGPT, leads to the development of an effective autocompletion tool for generating CC documentation in healthcare settings.
摘要：
主诉 (CC) 是患者医疗记录的重要组成部分，因为它描述了寻求医疗护理的主要原因或担忧。它为医疗保健提供者提供关键信息，以做出有关患者护理的明智决策。然而，对于医疗保健提供者来说，记录 CC 可能非常耗时，尤其是在繁忙的急诊科。为了解决这个问题，自动完成工具可以为临床记录提供准确且格式良好的短语或句子，这对于分诊护士来说可能是宝贵的资源。在本研究中，我们利用文本生成技术来开发使用 CC 数据的机器学习模型。在我们提出的工作中，我们训练了一个长短期记忆（LSTM）模型，并对生物医学生成预训练变压器（BioGPT）的三种不同变体进行了微调，即 microsoft/biogpt、microsoft/BioGPT-Large 和 microsoft/BioGPT-Large -PubMedQA。此外，我们还利用 GPT-4 的 OpenAI API，通过合并示例 CC 句子来调整提示。我们根据困惑度得分、修改后的 BERTScore 和余弦相似度得分来评估模型的性能。结果表明，与其他模型相比，BioGPT-Large 表现出优越的性能。在生成 CC 时，它始终实现了 1.65 的非常低的困惑度分数，而基线 LSTM 模型实现了 170 的最佳困惑度分数。此外，我们评估和评估了所提出的模型的性能和 GPT-4.0 的结果。我们的研究表明，利用 BioGPT 等法学硕士，可以开发出一种有效的自动完成工具，用于在医疗保健环境中生成 CC 文档。

Title: Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models. (arXiv:2401.06102v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.06102
Code URL: null
Copy Paste: [[2401.06102]] Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models(http://arxiv.org/abs/2401.06102)
Summary:
Inspecting the information encoded in hidden representations of large language models (LLMs) can explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of research questions about an LLM's computation. We show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation, can be viewed as special instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by a Patchscope. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.
摘要：
检查大型语言模型 (LLM) 隐藏表示中编码的信息可以解释模型的行为并验证其与人类价值观的一致性。鉴于法学硕士在生成人类可理解的文本方面的能力，我们建议利用模型本身来解释其自然语言的内部表示。我们介绍了一个名为 Patchscopes 的框架，并展示了如何使用它来回答有关法学硕士计算的各种研究问题。我们表明，基于将表示投影到词汇空间并干预 LLM 计算的先前可解释性方法可以被视为该框架的特殊实例。此外，它们的一些缺点，例如无法检查早期层或缺乏表现力，可以通过 Patchscope 来缓解。除了统一先前的检查技术之外，Patchscopes 还开辟了新的可能性，例如使用功能更强大的模型来解释较小模型的表示，并解锁了新的应用程序，例如多跳推理中的自我校正。

Title: TrustLLM: Trustworthiness in Large Language Models. (arXiv:2401.05561v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05561
Code URL: https://github.com/HowieHwong/TrustLLM
Copy Paste: [[2401.05561]] TrustLLM: Trustworthiness in Large Language Models(http://arxiv.org/abs/2401.05561)
Summary:
Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.
摘要：
以 ChatGPT 为代表的大型语言模型 (LLM) 因其出色的自然语言处理能力而受到广泛关注。尽管如此，这些法学硕士提出了许多挑战，特别是在可信度领域。因此，确保LLM的可信度成为一个重要的话题。本文介绍了TrustLLM，这是一项关于法学硕士可信度的综合研究，包括可信度不同维度的原则、主流法学硕士可信度的建立基准、评估和分析，以及对开放挑战和未来方向的讨论。具体来说，我们首先提出了一套涵盖八个不同维度的值得信赖的法学硕士原则。基于这些原则，我们进一步建立了真实性、安全性、公平性、稳健性、隐私性和机器道德等六个维度的基准。然后，我们提出了一项评估 TrustLLM 中 16 个主流法学硕士的研究，其中包含 30 多个数据集。我们的研究结果首先表明，一般来说，可信度和效用（即功能有效性）呈正相关。其次，我们的观察表明，专有法学硕士在可信度方面通常优于大多数开源法学硕士，这引起了人们对广泛使用的开源法学硕士潜在风险的担忧。然而，一些开源法学硕士非常接近专有法学硕士。第三，值得注意的是，一些法学硕士可能会过度校准以表现出可信度，以至于他们错误地将良性提示视为有害提示并因此不响应，从而损害了其效用。最后，我们强调不仅要确保模型本身的透明度，还要确保支撑可信性的技术的透明度。了解已采用的具体可信技术对于分析其有效性至关重要。

Title: Scaling Laws for Forgetting When Fine-Tuning Large Language Models. (arXiv:2401.05605v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05605
Code URL: null
Copy Paste: [[2401.05605]] Scaling Laws for Forgetting When Fine-Tuning Large Language Models(http://arxiv.org/abs/2401.05605)
Summary:
We study and quantify the problem of forgetting when fine-tuning pre-trained large language models (LLMs) on a downstream task. We find that parameter-efficient fine-tuning (PEFT) strategies, such as Low-Rank Adapters (LoRA), still suffer from catastrophic forgetting. In particular, we identify a strong inverse linear relationship between the fine-tuning performance and the amount of forgetting when fine-tuning LLMs with LoRA. We further obtain precise scaling laws that show forgetting increases as a shifted power law in the number of parameters fine-tuned and the number of update steps. We also examine the impact of forgetting on knowledge, reasoning, and the safety guardrails trained into Llama 2 7B chat. Our study suggests that forgetting cannot be avoided through early stopping or by varying the number of parameters fine-tuned. We believe this opens up an important safety-critical direction for future research to evaluate and develop fine-tuning schemes which mitigate forgetting
摘要：
我们研究并量化了在下游任务上微调预训练大型语言模型 (LLM) 时的遗忘问题。我们发现参数高效微调（PEFT）策略，例如低秩适配器（LoRA），仍然遭受灾难性遗忘的困扰。特别是，当使用 LoRA 微调 LLM 时，我们发现微调性能和遗忘量之间存在很强的逆线性关系。我们进一步获得了精确的缩放定律，该定律表明遗忘随着微调参数数量和更新步骤数量的幂律变化而增加。我们还研究了遗忘对 Llama 2 7B 聊天中训练的知识、推理和安全护栏的影响。我们的研究表明，不能通过提前停止或改变微调参数的数量来避免遗忘。我们相信，这为未来的研究开辟了一个重要的安全关键方向，以评估和开发减轻遗忘的微调方案

Title: On Detecting Cherry-picking in News Coverage Using Large Language Models. (arXiv:2401.05650v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05650
Code URL: null
Copy Paste: [[2401.05650]] On Detecting Cherry-picking in News Coverage Using Large Language Models(http://arxiv.org/abs/2401.05650)
Summary:
Cherry-picking refers to the deliberate selection of evidence or facts that favor a particular viewpoint while ignoring or distorting evidence that supports an opposing perspective. Manually identifying instances of cherry-picked statements in news stories can be challenging, particularly when the opposing viewpoint's story is absent. This study introduces Cherry, an innovative approach for automatically detecting cherry-picked statements in news articles by finding missing important statements in the target news story. Cherry utilizes the analysis of news coverage from multiple sources to identify instances of cherry-picking. Our approach relies on language models that consider contextual information from other news sources to classify statements based on their importance to the event covered in the target news story. Furthermore, this research introduces a novel dataset specifically designed for cherry-picking detection, which was used to train and evaluate the performance of the models. Our best performing model achieves an F-1 score of about %89 in detecting important statements when tested on unseen set of news stories. Moreover, results show the importance incorporating external knowledge from alternative unbiased narratives when assessing a statement's importance.
摘要：
择优挑选是指故意选择支持特定观点的证据或事实，而忽略或歪曲支持相反观点的证据。手动识别新闻报道中精心挑选的陈述实例可能具有挑战性，特别是当反对观点的报道不存在时。这项研究引入了 Cherry，这是一种创新方法，通过查找目标新闻报道中缺失的重要陈述来自动检测新闻文章中精选的陈述。 Cherry 利用对多个来源的新闻报道进行分析来识别择优挑选的情况。我们的方法依赖于语言模型，该模型考虑其他新闻来源的上下文信息，根据语句对目标新闻报道中所涵盖事件的重要性对语句进行分类。此外，这项研究引入了一个专门为挑选检测而设计的新颖数据集，用于训练和评估模型的性能。当对未见过的新闻报道进行测试时，我们表现最好的模型在检测重要陈述方面取得了约 %89 的 F-1 分数。此外，结果表明，在评估陈述的重要性时，结合来自其他公正叙述的外部知识非常重要。

Title: ConcEPT: Concept-Enhanced Pre-Training for Language Models. (arXiv:2401.05669v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05669
Code URL: null
Copy Paste: [[2401.05669]] ConcEPT: Concept-Enhanced Pre-Training for Language Models(http://arxiv.org/abs/2401.05669)
Summary:
Pre-trained language models (PLMs) have been prevailing in state-of-the-art methods for natural language processing, and knowledge-enhanced PLMs are further proposed to promote model performance in knowledge-intensive tasks. However, conceptual knowledge, one essential kind of knowledge for human cognition, still remains understudied in this line of research. This limits PLMs' performance in scenarios requiring human-like cognition, such as understanding long-tail entities with concepts. In this paper, we propose ConcEPT, which stands for Concept-Enhanced Pre-Training for language models, to infuse conceptual knowledge into PLMs. ConcEPT exploits external taxonomies with entity concept prediction, a novel pre-training objective to predict the concepts of entities mentioned in the pre-training contexts. Unlike previous concept-enhanced methods, ConcEPT can be readily adapted to various downstream applications without entity linking or concept mapping. Results of extensive experiments show the effectiveness of ConcEPT in four tasks such as entity typing, which validates that our model gains improved conceptual knowledge with concept-enhanced pre-training.
摘要：
预训练语言模型 (PLM) 已在最先进的自然语言处理方法中盛行，并且进一步提出了知识增强型 PLM，以提高知识密集型任务中的模型性能。然而，概念知识作为人类认知的一种重要知识，在这一领域的研究仍然不够深入。这限制了 PLM 在需要类人认知的场景中的性能，例如通过概念理解长尾实体。在本文中，我们提出了 ConcEPT，它代表语言模型的概念增强预训练，将概念知识注入到 PLM 中。 ConcEPT 利用外部分类法和实体概念预测，这是一种新颖的预训练目标，用于预测预训练上下文中提到的实体概念。与之前的概念增强方法不同，ConcEPT 可以轻松适应各种下游应用，无需实体链接或概念映射。大量实验的结果显示了 ConcEPT 在实体类型等四项任务中的有效性，这验证了我们的模型通过概念增强的预训练获得了改进的概念知识。

Title: Integrating Physician Diagnostic Logic into Large Language Models: Preference Learning from Process Feedback. (arXiv:2401.05695v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05695
Code URL: null
Copy Paste: [[2401.05695]] Integrating Physician Diagnostic Logic into Large Language Models: Preference Learning from Process Feedback(http://arxiv.org/abs/2401.05695)
Summary:
The use of large language models in medical dialogue generation has garnered significant attention, with a focus on improving response quality and fluency. While previous studies have made progress in optimizing model performance for single-round medical Q&A tasks, there is a need to enhance the model's capability for multi-round conversations to avoid logical inconsistencies. To address this, we propose an approach called preference learning from process feedback~(PLPF), which integrates the doctor's diagnostic logic into LLMs. PLPF involves rule modeling, preference data generation, and preference alignment to train the model to adhere to the diagnostic process. Experimental results using Standardized Patient Testing show that PLPF enhances the diagnostic accuracy of the baseline model in medical conversations by 17.6%, outperforming traditional reinforcement learning from human feedback. Additionally, PLPF demonstrates effectiveness in both multi-round and single-round dialogue tasks, showcasing its potential for improving medical dialogue generation.
摘要：
大型语言模型在医学对话生成中的使用引起了广泛关注，重点是提高响应质量和流畅性。虽然之前的研究在优化单轮医疗问答任务的模型性能方面取得了进展，但仍需要增强模型的多轮对话能力，以避免逻辑不一致。为了解决这个问题，我们提出了一种称为从过程反馈中进行偏好学习（PLPF）的方法，它将医生的诊断逻辑集成到法学硕士中。 PLPF 涉及规则建模、偏好数据生成和偏好对齐，以训练模型遵循诊断过程。使用标准化患者测试的实验结果表明，PLPF 将医疗对话中基线模型的诊断准确性提高了 17.6%，优于基于人类反馈的传统强化学习。此外，PLPF 在多轮和单轮对话任务中都表现出了有效性，展示了其改善医学对话生成的潜力。

Title: CAT-LLM: Prompting Large Language Models with Text Style Definition for Chinese Article-style Transfer. (arXiv:2401.05707v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05707
Code URL: null
Copy Paste: [[2401.05707]] CAT-LLM: Prompting Large Language Models with Text Style Definition for Chinese Article-style Transfer(http://arxiv.org/abs/2401.05707)
Summary:
Text style transfer is increasingly prominent in online entertainment and social media. However, existing research mainly concentrates on style transfer within individual English sentences, while ignoring the complexity of long Chinese texts, which limits the wider applicability of style transfer in digital media realm. To bridge this gap, we propose a Chinese Article-style Transfer framework (CAT-LLM), leveraging the capabilities of Large Language Models (LLMs). CAT-LLM incorporates a bespoke, pluggable Text Style Definition (TSD) module aimed at comprehensively analyzing text features in articles, prompting LLMs to efficiently transfer Chinese article-style. The TSD module integrates a series of machine learning algorithms to analyze article-style from both words and sentences levels, thereby aiding LLMs thoroughly grasp the target style without compromising the integrity of the original text. In addition, this module supports dynamic expansion of internal style trees, showcasing robust compatibility and allowing flexible optimization in subsequent research. Moreover, we select five Chinese articles with distinct styles and create five parallel datasets using ChatGPT, enhancing the models' performance evaluation accuracy and establishing a novel paradigm for evaluating subsequent research on article-style transfer. Extensive experimental results affirm that CAT-LLM outperforms current research in terms of transfer accuracy and content preservation, and has remarkable applicability to various types of LLMs.
摘要：
文本风格迁移在在线娱乐和社交媒体中越来越突出。然而，现有的研究主要集中在单个英语句子内的风格迁移，而忽略了中文长文本的复杂性，这限制了风格迁移在数字媒体领域的更广泛应用。为了弥补这一差距，我们提出了一个中文文章式传输框架（CAT-LLM），利用大型语言模型（LLM）的功能。 CAT-LLM采用了定制的、可插拔的文本风格定义（TSD）模块，旨在全面分析文章中的文本特征，促使法学硕士高效地迁移中文文章风格。 TSD模块集成了一系列机器学习算法，从单词和句子层面分析文章风格，从而帮助LLM在不损害原文完整性的情况下彻底掌握目标风格。此外，该模块支持内部样式树的动态扩展，具有强大的兼容性，并允许后续研究中的灵活优化。此外，我们选择了五篇风格独特的中文文章，并使用 ChatGPT 创建了五个并行数据集，提高了模型性能评估的准确性，并为评估后续文章风格迁移研究建立了一个新的范式。大量的实验结果证实，CAT-LLM在传输准确性和内容保存方面优于当前研究，并且对各种类型的LLM具有显着的适用性。

Title: Zero Resource Cross-Lingual Part Of Speech Tagging. (arXiv:2401.05727v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05727
Code URL: null
Copy Paste: [[2401.05727]] Zero Resource Cross-Lingual Part Of Speech Tagging(http://arxiv.org/abs/2401.05727)
Summary:
Part of speech tagging in zero-resource settings can be an effective approach for low-resource languages when no labeled training data is available. Existing systems use two main techniques for POS tagging i.e. pretrained multilingual large language models(LLM) or project the source language labels into the zero resource target language and train a sequence labeling model on it. We explore the latter approach using the off-the-shelf alignment module and train a hidden Markov model(HMM) to predict the POS tags. We evaluate transfer learning setup with English as a source language and French, German, and Spanish as target languages for part-of-speech tagging. Our conclusion is that projected alignment data in zero-resource language can be beneficial to predict POS tags.
摘要：
当没有可用的标记训练数据时，零资源设置中的词性标记可能是低资源语言的有效方法。现有系统使用两种主要的词性标注技术，即预训练的多语言大语言模型（LLM）或将源语言标签投影到零资源目标语言并在其上训练序列标注模型。我们使用现成的对齐模块探索后一种方法，并训练隐马尔可夫模型（HMM）来预测 POS 标签。我们以英语作为源语言，以法语、德语和西班牙语作为词性标记的目标语言来评估迁移学习设置。我们的结论是，零资源语言的投影对齐数据有助于预测 POS 标签。

Title: Probing Structured Semantics Understanding and Generation of Language Models via Question Answering. (arXiv:2401.05777v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05777
Code URL: null
Copy Paste: [[2401.05777]] Probing Structured Semantics Understanding and Generation of Language Models via Question Answering(http://arxiv.org/abs/2401.05777)
Summary:
Recent advancement in the capabilities of large language models (LLMs) has triggered a new surge in LLMs' evaluation. Most recent evaluation works tends to evaluate the comprehensive ability of LLMs over series of tasks. However, the deep structure understanding of natural language is rarely explored. In this work, we examine the ability of LLMs to deal with structured semantics on the tasks of question answering with the help of the human-constructed formal language. Specifically, we implement the inter-conversion of natural and formal language through in-context learning of LLMs to verify their ability to understand and generate the structured logical forms. Extensive experiments with models of different sizes and in different formal languages show that today's state-of-the-art LLMs' understanding of the logical forms can approach human level overall, but there still are plenty of room in generating correct logical forms, which suggest that it is more effective to use LLMs to generate more natural language training data to reinforce a small model than directly answering questions with LLMs. Moreover, our results also indicate that models exhibit considerable sensitivity to different formal languages. In general, the formal language with the lower the formalization level, i.e. the more similar it is to natural language, is more LLMs-friendly.
摘要：
最近大型语言模型 (LLM) 功能的进步引发了 LLM 评估的新一轮激增。最近的评估工作倾向于评估法学硕士在一系列任务上的综合能力。然而，自然语言的深层结构理解却很少被探索。在这项工作中，我们研究了法学硕士在人类构建的形式语言的帮助下处理问答任务中的结构化语义的能力。具体来说，我们通过法学硕士的情境学习实现自然语言和形式语言的相互转换，以验证他们理解和生成结构化逻辑形式的能力。对不同规模和不同形式语言的模型进行的大量实验表明，当今最先进的法学硕士对逻辑形式的理解总体上可以接近人类水平，但在生成正确的逻辑形式方面仍然有很大的空间，这表明使用法学硕士生成更多自然语言训练数据来强化小型模型比直接使用法学硕士回答问题更有效。此外，我们的结果还表明模型对不同的形式语言表现出相当的敏感性。一般来说，形式化程度越低的形式语言，即与自然语言越相似，就越适合法学硕士。

Title: Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations. (arXiv:2401.05792v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05792
Code URL: null
Copy Paste: [[2401.05792]] Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations(http://arxiv.org/abs/2401.05792)
Summary:
Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within the multilingual embedding spaces, there exists strong language identity information which hinders the expression of linguistic factors shared across languages. For semantic tasks like cross-lingual sentence retrieval, it is desired to remove such language identity signals to fully leverage semantic information. In this work, we provide a novel view of projecting away language-specific factors from a multilingual embedding space. Specifically, we discover that there exists a low-rank subspace that primarily encodes information irrelevant to semantics (e.g., syntactic information). To identify this subspace, we present a simple but effective unsupervised method based on singular value decomposition with multiple monolingual corpora as input. Once the subspace is found, we can directly project the original embeddings into the null space to boost language agnosticism without finetuning. We systematically evaluate our method on various tasks including the challenging language-agnostic QA retrieval task. Empirical results show that applying our method consistently leads to improvements over commonly used ML-LMs.
摘要：
大型预训练多语言语言模型 (ML-LM) 已显示出卓越的零样本跨语言迁移能力，无需直接跨语言监督。虽然这些结果很有希望，但后续工作发现，在多语言嵌入空间内，存在强大的语言身份信息，阻碍了跨语言共享的语言因素的表达。对于跨语言句子检索等语义任务，需要去除此类语言标识信号以充分利用语义信息。在这项工作中，我们提供了一种从多语言嵌入空间中投影特定于语言的因素的新颖观点。具体来说，我们发现存在一个低秩子空间，主要编码与语义无关的信息（例如句法信息）。为了识别这个子空间，我们提出了一种简单但有效的无监督方法，该方法基于以多个单语语料库作为输入的奇异值分解。一旦找到子空间，我们就可以直接将原始嵌入投影到零空间中，以增强语言不可知论，而无需进行微调。我们系统地评估我们在各种任务上的方法，包括具有挑战性的与语言无关的 QA 检索任务。实证结果表明，持续应用我们的方法可以比常用的 ML-LM 有所改进。

Title: Towards Boosting Many-to-Many Multilingual Machine Translation with Large Language Models. (arXiv:2401.05861v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05861
Code URL: null
Copy Paste: [[2401.05861]] Towards Boosting Many-to-Many Multilingual Machine Translation with Large Language Models(http://arxiv.org/abs/2401.05861)
Summary:
The training paradigm for machine translation has gradually shifted, from learning neural machine translation (NMT) models with extensive parallel corpora to instruction finetuning on pretrained multilingual large language models (LLMs) with high-quality translation pairs. In this paper, we focus on boosting the many-to-many multilingual translation performance of LLMs with an emphasis on zero-shot translation directions. We demonstrate that prompt strategies adopted during instruction finetuning are crucial to zero-shot translation performance and introduce a cross-lingual consistency regularization, XConST, to bridge the representation gap among different languages and improve zero-shot translation performance. XConST is not a new method, but a version of CrossConST (Gao et al., 2023a) adapted for multilingual finetuning on LLMs with translation instructions. Experimental results on ALMA (Xu et al., 2023) and LLaMA-2 (Touvron et al., 2023) show that our approach consistently improves translation performance. Our implementations are available at https://github.com/gpengzhi/CrossConST-LLM.
摘要：
机器翻译的训练范式已逐渐转变，从学习具有广泛并行语料库的神经机器翻译 (NMT) 模型，到对具有高质量翻译对的预训练多语言大语言模型 (LLM) 进行指令微调。在本文中，我们专注于提高法学硕士的多对多多语言翻译性能，重点是零样本翻译方向。我们证明，在指令微调期间采用的即时策略对于零样本翻译性能至关重要，并引入跨语言一致性正则化 XConST，以弥合不同语言之间的表示差距并提高零样本翻译性能。 XConST 不是一种新方法，而是 CrossConST（Gao 等人，2023a）的一个版本，适用于带有翻译指令的 LLM 多语言微调。 ALMA (Xu et al., 2023) 和 LLaMA-2 (Touvron et al., 2023) 的实验结果表明，我们的方法持续提高了翻译性能。我们的实现可在 https://github.com/gpengzhi/CrossConST-LLM 获取。

Title: EpilepsyLLM: Domain-Specific Large Language Model Fine-tuned with Epilepsy Medical Knowledge. (arXiv:2401.05908v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05908
Code URL: null
Copy Paste: [[2401.05908]] EpilepsyLLM: Domain-Specific Large Language Model Fine-tuned with Epilepsy Medical Knowledge(http://arxiv.org/abs/2401.05908)
Summary:
With large training datasets and massive amounts of computing sources, large language models (LLMs) achieve remarkable performance in comprehensive and generative ability. Based on those powerful LLMs, the model fine-tuned with domain-specific datasets posseses more specialized knowledge and thus is more practical like medical LLMs. However, the existing fine-tuned medical LLMs are limited to general medical knowledge with English language. For disease-specific problems, the model's response is inaccurate and sometimes even completely irrelevant, especially when using a language other than English. In this work, we focus on the particular disease of Epilepsy with Japanese language and introduce a customized LLM termed as EpilepsyLLM. Our model is trained from the pre-trained LLM by fine-tuning technique using datasets from the epilepsy domain. The datasets contain knowledge of basic information about disease, common treatment methods and drugs, and important notes in life and work. The experimental results demonstrate that EpilepsyLLM can provide more reliable and specialized medical knowledge responses.
摘要：
凭借庞大的训练数据集和海量的计算源，大型语言模型（LLM）在综合能力和生成能力方面表现出色。基于这些强大的LLM，用特定领域的数据集进行微调的模型拥有更专业的知识，因此比医学LLM更实用。然而，现有的微调医学法学硕士仅限于英语语言的一般医学知识。对于特定疾病的问题，模型的响应不准确，有时甚至完全无关，尤其是在使用英语以外的语言时。在这项工作中，我们专注于日语中的癫痫这一特殊疾病，并引入了名为 EpilepsyLLM 的定制法学硕士。我们的模型是通过使用癫痫领域的数据集进行微调技术，从预训练的法学硕士进行训练的。数据集包含疾病的基本信息、常用治疗方法和药物的知识以及生活和工作中的重要注意事项。实验结果表明EpilepsyLLM可以提供更可靠、更专业的医学知识应答。

Title: LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization. (arXiv:2401.06034v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.06034
Code URL: null
Copy Paste: [[2401.06034]] LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization(http://arxiv.org/abs/2401.06034)
Summary:
Pretrained language models (PLMs) have shown remarkable generalization toward multiple tasks and languages. Nonetheless, the generalization of PLMs towards unseen languages is poor, resulting in significantly worse language performance, or even generating nonsensical responses that are comparable to a random baseline. This limitation has been a longstanding problem of PLMs raising the problem of diversity and equal access to language modeling technology. In this work, we solve this limitation by introducing LinguAlchemy, a regularization technique that incorporates various aspects of languages covering typological, geographical, and phylogenetic constraining the resulting representation of PLMs to better characterize the corresponding linguistics constraints. LinguAlchemy significantly improves the accuracy performance of mBERT and XLM-R on unseen languages by ~18% and ~2%, respectively compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search. LinguAlchemy enables better cross-lingual generalization to unseen languages which is vital for better inclusivity and accessibility of PLMs.
摘要：
预训练语言模型 (PLM) 对多种任务和语言表现出了卓越的泛化能力。尽管如此，PLM 对未见过的语言的泛化能力很差，导致语言性能明显较差，甚至产生与随机基线相当的无意义响应。这种限制一直是 PLM 长期存在的问题，引发了语言建模技术的多样性和平等访问问题。在这项工作中，我们通过引入 LinguAlchemy 来解决这一限制，这是一种正则化技术，它结合了语言的各个方面，涵盖类型学、地理和系统发育，约束 PLM 的结果表示，以更好地表征相应的语言学约束。与完全微调的模型相比，LinguAlchemy 显着提高了 mBERT 和 XLM-R 在未见过的语言上的准确度性能，分别提高了约 18% 和约 2%，并显示出高度的未见过的语言泛化能力。我们进一步介绍了 AlchemyScale 和 AlchemyTune，它们是 LinguAlchemy 的扩展，可自动调整语言正则化权重，从而减轻超参数搜索的需要。 LinguAlchemy 能够更好地跨语言泛化到未见过的语言，这对于提高 PLM 的包容性和可访问性至关重要。

Title: DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. (arXiv:2401.06066v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.06066
Code URL: null
Copy Paste: [[2401.06066]] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models(http://arxiv.org/abs/2401.06066)
Summary:
In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts; (2) isolating $K_s$ experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.
摘要：
在大型语言模型时代，专家混合 (MoE) 是一种很有前途的架构，用于在扩展模型参数时管理计算成本。然而，传统的 MoE 架构（如 GShard）激活了 $N$ 专家中的顶级 $K$，在确保专家专业化方面面临着挑战，即每个专家都获得不重叠且有针对性的知识。作为回应，我们提出了 DeepSeekMoE 架构，以实现最终的专家专业化。它涉及两个主要策略：（1）将专家精细分割为$mN$个专家，并从中激活$mK$，从而允许更灵活的激活专家组合； (2) 将$K_s$专家隔离为共享专家，旨在捕获共同知识并减少路由专家中的冗余。从具有 2B 参数的适度规模开始，我们证明 DeepSeekMoE 2B 实现了与 GShard 2.9B 相当的性能，后者的参数和计算量是专家参数和计算的 1.5 倍。此外，DeepSeekMoE 2B 在总参数数量相同的情况下几乎接近其密集对应模型的性能，这设定了 MoE 模型的上限。随后，我们将 DeepSeekMoE 扩展到 16B 参数，并表明它实现了与 LLaMA2 7B 相当的性能，而计算量仅为约 40%。此外，我们将 DeepSeekMoE 扩展到 145B 参数的初步努力一致验证了其相对于 GShard 架构的巨大优势，并显示其性能与 DeepSeek 67B 相当，仅使用 28.5%（甚至可能是 18.2%）的计算量。

Title: Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint. (arXiv:2401.06081v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.06081
Code URL: null
Copy Paste: [[2401.06081]] Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint(http://arxiv.org/abs/2401.06081)
Summary:
Reinforcement learning (RL) has been widely used in training large language models~(LLMs) for preventing unexpected outputs, \eg reducing harmfulness and errors. However, existing RL methods mostly adopt the instance-level reward, which is unable to provide fine-grained supervision for complex reasoning tasks, and can not focus on the few key tokens that lead to the incorrectness. To address it, we propose a new RL method named \textbf{RLMEC} that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, and can produce token-level rewards for RL training. Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process. And the both objectives focus on the learning of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. The experiment results on mathematical tasks and question-answering tasks have demonstrated the effectiveness of our approach. Our code and data are available at \url{https://github.com/RUCAIBox/RLMEC}.
摘要：
强化学习（RL）已广泛用于训练大型语言模型〜（LLM）以防止意外输出，例如减少危害和错误。然而，现有的强化学习方法大多采用实例级奖励，无法为复杂的推理任务提供细粒度的监督，也无法关注导致错误的少数关键标记。为了解决这个问题，我们提出了一种名为 \textbf{RLMEC} 的新 RL 方法，该方法将生成模型作为奖励模型，在最小编辑约束下通过错误解重写任务进行训练，并且可以为 RL 产生 token 级奖励训练。基于生成奖励模型，我们设计了用于训练的代币级强化学习目标，以及用于稳定强化学习过程的基于模仿的正则化。这两个目标都集中在学习错误解决方案的关键标记，减少其他不重要标记的影响。数学任务和问答任务的实验结果证明了我们方法的有效性。我们的代码和数据可在 \url{https://github.com/RUCAIBox/RLMEC} 获取。

Title: Extreme Compression of Large Language Models via Additive Quantization. (arXiv:2401.06118v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.06118
Code URL: null
Copy Paste: [[2401.06118]] Extreme Compression of Large Language Models via Additive Quantization(http://arxiv.org/abs/2401.06118)
Summary:
The emergence of accurate open large language models (LLMs) has led to a race towards quantization techniques for such models enabling execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression--defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter, from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our work builds on top of Additive Quantization, a classic algorithm from the MCQ family, and adapts it to the quantization of language models. The resulting algorithm advances the state-of-the-art in LLM compression, outperforming all recently-proposed techniques in terms of accuracy at a given compression budget. For instance, when compressing Llama 2 models to 2 bits per parameter, our algorithm quantizes the 7B model to 6.93 perplexity (a 1.29 improvement relative to the best prior work, and 1.81 points from FP16), the 13B model to 5.70 perplexity (a .36 improvement) and the 70B model to 3.94 perplexity (a .22 improvement) on WikiText2. We release our implementation of Additive Quantization for Language Models AQLM as a baseline to facilitate future research in LLM quantization.
摘要：
准确的开放式大语言模型 (LLM) 的出现引发了一场针对此类模型的量化技术的竞赛，这些技术可在最终用户设备上执行。在本文中，我们从多码本量化 (MCQ) 中的经典方法的角度重新审视“极端”LLM 压缩问题——定义为针对极低的位数，例如每个参数 2 到 3 位。我们的工作建立在加性量化（MCQ 系列的经典算法）之上，并使其适应语言模型的量化。由此产生的算法推进了 LLM 压缩的最先进技术，在给定压缩预算的精度方面优于所有最近提出的技术。例如，当将 Llama 2 模型压缩到每个参数 2 位时，我们的算法将 7B 模型量化为 6.93 困惑度（相对于之前最好的工作提高了 1.29，与 FP16 相比提高了 1.81 点），将 13B 模型量化为 5.70 困惑度（a . 36 改进）和 WikiText2 上的 70B 模型到 3.94 困惑度（0.22 改进）。我们发布了语言模型 AQLM 的加性量化的实现作为基准，以促进 LLM 量化的未来研究。

Title: A Closer Look at AUROC and AUPRC under Class Imbalance. (arXiv:2401.06091v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.06091
Code URL: null
Copy Paste: [[2401.06091]] A Closer Look at AUROC and AUPRC under Class Imbalance(http://arxiv.org/abs/2401.06091)
Summary:
In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.
摘要：
在机器学习 (ML) 中，有一句广为流传的格言是，对于具有类别的二元分类任务，精确回忆曲线下面积 (AUPRC) 是模型比较的一个优越指标，优于接收者操作特征下面积 (AUROC)不平衡。本文通过新颖的数学分析挑战了这一概念，说明 AUROC 和 AUPRC 可以用概率术语简洁地关联起来。我们证明，与普遍看法相反，AUPRC 在类别不平衡的情况下并不优越，甚至可能是一个有害的指标，因为它倾向于过度支持在具有更频繁的积极标签的亚群中进行模型改进。这种偏差可能会无意中加剧算法差异。在这些见解的推动下，我们对现有 ML 文献进行了彻底审查，利用大型语言模型分析了 arXiv 超过 150 万篇论文。我们的调查重点是所谓的 AUPRC 优越性的普遍性和证实。结果暴露了实证支持的严重不足以及错误归因的趋势，这些都促使人们广泛接受 AUPRC 所谓的优势。我们的发现代表了双重贡献：理解度量行为方面的重大技术进步，以及对机器学习社区中未经检查的假设的严厉警告。所有实验均可在 https://github.com/mmcdermott/AUC_is_all_you_need 访问。

gpt

llm

Title: POMP: Probability-driven Meta-graph Prompter for LLMs in Low-resource Unsupervised Neural Machine Translation. (arXiv:2401.05596v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05596
Code URL: null
Copy Paste: [[2401.05596]] POMP: Probability-driven Meta-graph Prompter for LLMs in Low-resource Unsupervised Neural Machine Translation(http://arxiv.org/abs/2401.05596)
Summary:
Low-resource languages (LRLs) face challenges in supervised neural machine translation due to limited parallel data, prompting research into unsupervised methods. Unsupervised neural machine translation (UNMT) methods, including back-translation, transfer learning, and pivot-based translation, offer practical solutions for LRL translation, but they are hindered by issues like synthetic data noise, language bias, and error propagation, which can potentially be mitigated by Large Language Models (LLMs). LLMs have advanced NMT with in-context learning (ICL) and supervised fine-tuning methods, but insufficient training data results in poor performance in LRLs. We argue that LLMs can mitigate the linguistic noise with auxiliary languages to improve translations in LRLs. In this paper, we propose Probability-driven Meta-graph Prompter (POMP), a novel approach employing a dynamic, sampling-based graph of multiple auxiliary languages to enhance LLMs' translation capabilities for LRLs. POMP involves constructing a directed acyclic meta-graph for each source language, from which we dynamically sample multiple paths to prompt LLMs to mitigate the linguistic noise and improve translations during training. We use the BLEURT metric to evaluate the translations and back-propagate rewards, estimated by scores, to update the probabilities of auxiliary languages in the paths. Our experiments show significant improvements in the translation quality of three LRLs, demonstrating the effectiveness of our approach.
摘要：
由于并行数据有限，低资源语言 (LRL) 在监督神经机器翻译方面面临挑战，这促使人们对无监督方法进行研究。无监督神经机器翻译 (UNMT) 方法，包括反向翻译、迁移学习和基于枢轴的翻译，为 LRL 翻译提供了实用的解决方案，但它们受到合成数据噪声、语言偏差和错误传播等问题的阻碍，这些问题可能会导致大型语言模型 (LLM) 可能会缓解这一问题。 LLM 拥有带有上下文学习 (ICL) 和监督微调方法的先进 NMT，但训练数据不足会导致 LRL 表现不佳。我们认为法学硕士可以通过辅助语言减轻语言噪音，从而改善法学硕士的翻译。在本文中，我们提出了概率驱动的元图提示器（POMP），这是一种采用动态、基于采样的多种辅助语言图来增强法学硕士对 LRL 的翻译能力的新颖方法。 POMP 涉及为每种源语言构建一个有向非循环元图，我们从中动态采样多个路径，以提示法学硕士在训练期间减轻语言噪音并改进翻译。我们使用 BLEURT 指标来评估翻译和反向传播奖励（通过分数估计），以更新路径中辅助语言的概率。我们的实验表明三个 LRL 的翻译质量显着提高，证明了我们方法的有效性。

Title: Designing Heterogeneous LLM Agents for Financial Sentiment Analysis. (arXiv:2401.05799v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05799
Code URL: null
Copy Paste: [[2401.05799]] Designing Heterogeneous LLM Agents for Financial Sentiment Analysis(http://arxiv.org/abs/2401.05799)
Summary:
Large language models (LLMs) have drastically changed the possible ways to design intelligent systems, shifting the focuses from massive data acquisition and new modeling training to human alignment and strategical elicitation of the full potential of existing pre-trained models. This paradigm shift, however, is not fully realized in financial sentiment analysis (FSA), due to the discriminative nature of this task and a lack of prescriptive knowledge of how to leverage generative models in such a context. This study investigates the effectiveness of the new paradigm, i.e., using LLMs without fine-tuning for FSA. Rooted in Minsky's theory of mind and emotions, a design framework with heterogeneous LLM agents is proposed. The framework instantiates specialized agents using prior domain knowledge of the types of FSA errors and reasons on the aggregated agent discussions. Comprehensive evaluation on FSA datasets show that the framework yields better accuracies, especially when the discussions are substantial. This study contributes to the design foundations and paves new avenues for LLMs-based FSA. Implications on business and management are also discussed.
摘要：
大型语言模型 (LLM) 极大地改变了设计智能系统的可能方式，将重点从海量数据采集和新的建模训练转移到人类调整和战略性激发现有预训练模型的全部潜力。然而，由于这项任务的歧视性以及缺乏如何在这种背景下利用生成模型的规范性知识，这种范式转变在金融情绪分析（FSA）中并未完全实现。本研究调查了新范式的有效性，即使用法学硕士而不对 FSA 进行微调。植根于明斯基的心灵和情感理论，提出了一种具有异构 LLM 代理的设计框架。该框架使用 FSA 错误类型的先验领域知识以及聚合代理讨论的原因来实例化专用代理。对 FSA 数据集的综合评估表明，该框架具有更好的准确性，特别是在讨论大量时。这项研究为基于法学硕士的 FSA 奠定了设计基础并铺平了新的途径。还讨论了对业务和管理的影响。

Title: Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages. (arXiv:2401.05811v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05811
Code URL: null
Copy Paste: [[2401.05811]] Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages(http://arxiv.org/abs/2401.05811)
Summary:
This article introduces contrastive alignment instructions (AlignInstruct) to address two challenges in machine translation (MT) on large language models (LLMs). One is the expansion of supported languages to previously unseen ones. The second relates to the lack of data in low-resource languages. Model fine-tuning through MT instructions (MTInstruct) is a straightforward approach to the first challenge. However, MTInstruct is limited by weak cross-lingual signals inherent in the second challenge. AlignInstruct emphasizes cross-lingual supervision via a cross-lingual discriminator built using statistical word alignments. Our results based on fine-tuning the BLOOMZ models (1b1, 3b, and 7b1) in up to 24 unseen languages showed that: (1) LLMs can effectively translate unseen languages using MTInstruct; (2) AlignInstruct led to consistent improvements in translation quality across 48 translation directions involving English; (3) Discriminator-based instructions outperformed their generative counterparts as cross-lingual instructions; (4) AlignInstruct improved performance in 30 zero-shot directions.
摘要：
本文介绍了对比对齐指令 (AlignInstruct)，以解决大型语言模型 (LLM) 上机器翻译 (MT) 的两个挑战。一是将支持的语言扩展到以前未见过的语言。第二个与缺乏资源语言的数据有关。通过 MT 指令 (MTInstruct) 进行模型微调是应对第一个挑战的简单方法。然而，MTInstruct 受到第二个挑战中固有的微弱跨语言信号的限制。 AlignInstruct 强调通过使用统计单词对齐构建的跨语言鉴别器进行跨语言监督。我们基于最多 24 种未见过的语言对 BLOOMZ 模型（1b1、3b 和 7b1）进行微调的结果表明：（1）法学硕士可以使用 MTInstruct 有效翻译未见的语言； (2) AlignInstruct 导致 48 个涉及英语的翻译方向的翻译质量持续提高； (3) 作为跨语言指令，基于判别器的指令优于生成指令； (4) AlignInstruct 改进了 30 个零样本方向的性能。

Title: Chain of History: Learning and Forecasting with LLMs for Temporal Knowledge Graph Completion. (arXiv:2401.06072v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2401.06072
Code URL: null
Copy Paste: [[2401.06072]] Chain of History: Learning and Forecasting with LLMs for Temporal Knowledge Graph Completion(http://arxiv.org/abs/2401.06072)
Summary:
Temporal Knowledge Graph Completion (TKGC) is a challenging task of predicting missing event links at future timestamps by leveraging established temporal structural knowledge. Given the formidable generative capabilities inherent in LLMs (LLMs), this paper proposes a novel approach to conceptualize temporal link prediction as an event generation task within the context of a historical event chain. We employ efficient fine-tuning methods to make LLMs adapt to specific graph textual information and patterns discovered in temporal timelines. Furthermore, we introduce structure-based historical data augmentation and the integration of reverse knowledge to emphasize LLMs' awareness of structural information, thereby enhancing their reasoning capabilities. We conduct thorough experiments on multiple widely used datasets and find that our fine-tuned model outperforms existing embedding-based models on multiple metrics, achieving SOTA results. We also carry out sufficient ablation experiments to explore the key influencing factors when LLMs perform structured temporal knowledge inference tasks.
摘要：
时态知识图补全（TKGC）是一项具有挑战性的任务，它通过利用已建立的时态结构知识来预测未来时间戳的缺失事件链接。鉴于法学硕士（LLM）固有的强大生成能力，本文提出了一种新方法，将时间链接预测概念化为历史事件链背景下的事件生成任务。我们采用高效的微调方法，使法学硕士适应时间轴上发现的特定图形文本信息和模式。此外，我们引入基于结构的历史数据增强和逆向知识的集成，以强调法学硕士对结构信息的意识，从而增强他们的推理能力。我们对多个广泛使用的数据集进行了彻底的实验，发现我们的微调模型在多个指标上优于现有的基于嵌入的模型，实现了 SOTA 结果。我们还进行了足够的消融实验来探索法学硕士执行结构化时间知识推理任务时的关键影响因素。

Title: Natural Language Processing for Dialects of a Language: A Survey. (arXiv:2401.05632v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05632
Code URL: null
Copy Paste: [[2401.05632]] Natural Language Processing for Dialects of a Language: A Survey(http://arxiv.org/abs/2401.05632)
Summary:
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches. We describe a wide range of NLP tasks in terms of two categories: natural language understanding (NLU) (for tasks such as dialect classification, sentiment analysis, parsing, and NLU benchmarks) and natural language generation (NLG) (for summarisation, machine translation, and dialogue systems). The survey is also broad in its coverage of languages which include English, Arabic, German among others. We observe that past work in NLP concerning dialects goes deeper than mere dialect classification, and . This includes early approaches that used sentence transduction that lead to the recent approaches that integrate hypernetworks into LoRA. We expect that this survey will be useful to NLP researchers interested in building equitable language technologies by rethinking LLM benchmarks and model architectures.
摘要：
最先进的自然语言处理 (NLP) 模型在大规模训练语料库上进行训练，并在评估数据集上报告了卓越的性能。这项调查深入研究了这些数据集的一个重要属性：语言的方言。由于方言数据集 NLP 模型的性能下降及其对语言技术公平性的影响，我们从数据集和方法方面调查了过去的方言 NLP 研究。我们根据两类描述广泛的 NLP 任务：自然语言理解 (NLU)（用于方言分类、情感分析、解析和 NLU 基准等任务）和自然语言生成 (NLG)（用于摘要、机器翻译）和对话系统）。该调查的语言覆盖面也很广泛，包括英语、阿拉伯语、德语等。我们观察到，过去有关方言的 NLP 工作比单纯的方言分类更深入，并且。这包括使用句子转导的早期方法，这些方法导致了最近将超网络集成到 LoRA 中的方法。我们希望这项调查对那些有兴趣通过重新思考 LLM 基准和模型架构来构建公平语言技术的 NLP 研究人员有所帮助。

Title: LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase. (arXiv:2401.05952v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05952
Code URL: null
Copy Paste: [[2401.05952]] LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase(http://arxiv.org/abs/2401.05952)
Summary:
With the remarkable development and widespread applications of large language models (LLMs), the use of machine-generated text (MGT) is becoming increasingly common. This trend brings potential risks, particularly to the quality and completeness of information in fields such as news and education. Current research predominantly addresses the detection of pure MGT without adequately addressing mixed scenarios including AI-revised Human-Written Text (HWT) or human-revised MGT. To confront this challenge, we introduce mixcase, a novel concept representing a hybrid text form involving both machine-generated and human-generated content. We collected mixcase instances generated from multiple daily text-editing scenarios and composed MixSet, the first dataset dedicated to studying these mixed modification scenarios. We conduct experiments to evaluate the efficacy of popular MGT detectors, assessing their effectiveness, robustness, and generalization performance. Our findings reveal that existing detectors struggle to identify mixcase as a separate class or MGT, particularly in dealing with subtle modifications and style adaptability. This research underscores the urgent need for more fine-grain detectors tailored for mixcase, offering valuable insights for future research. Code and Models are available at https://github.com/Dongping-Chen/MixSet.
摘要：
随着大型语言模型 (LLM) 的显着发展和广泛应用，机器生成文本 (MGT) 的使用变得越来越普遍。这种趋势带来了潜在的风险，尤其是新闻、教育等领域信息的质量和完整性。目前的研究主要针对纯 MGT 的检测，而没有充分解决混合场景，包括人工智能修订的人类书写文本 (HWT) 或人类修订的 MGT。为了应对这一挑战，我们引入了 mixcase，这是一个代表混合文本形式的新颖概念，涉及机器生成和人类生成的内容。我们收集了从多个日常文本编辑场景生成的 mixcase 实例，并组成了 MixSet，这是第一个致力于研究这些混合修改场景的数据集。我们进行实验来评估流行的 MGT 检测器的功效，评估其有效性、鲁棒性和泛化性能。我们的研究结果表明，现有的检测器很难将 mixcase 识别为单独的类或 MGT，特别是在处理细微的修改和风格适应性方面。这项研究强调了对针对混合情况定制的更多细粒度检测器的迫切需求，为未来的研究提供了宝贵的见解。代码和模型可在 https://github.com/Dongping-Chen/MixSet 获取。

Title: Transformers are Multi-State RNNs. (arXiv:2401.06104v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.06104
Code URL: null
Copy Paste: [[2401.06104]] Transformers are Multi-State RNNs(http://arxiv.org/abs/2401.06104)
Summary:
Transformers are considered conceptually different compared to the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as infinite multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that pretrained transformers can be converted into $\textit{finite}$ multi-state RNNs by fixing the size of their hidden state. We observe that several existing transformers cache compression techniques can be framed as such conversion policies, and introduce a novel policy, TOVA, which is simpler compared to these policies. Our experiments with several long range tasks indicate that TOVA outperforms all other baseline policies, while being nearly on par with the full (infinite) model, and using in some cases only $\frac{1}{8}$ of the original cache size. Our results indicate that transformer decoder LLMs often behave in practice as RNNs. They also lay out the option of mitigating one of their most painful computational bottlenecks - the size of their cache memory. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA.
摘要：
与上一代最先进的 NLP 模型——循环神经网络 (RNN) 相比，Transformer 在概念上被认为是不同的。在这项工作中，我们证明了仅解码器 Transformer 实际上可以被概念化为无限多状态 RNN——一种具有无限隐藏状态大小的 RNN 变体。我们进一步表明，通过固定隐藏状态的大小，预训练的 Transformer 可以转换为 $\textit{finite}$ 多状态 RNN。我们观察到一些现有的转换器缓存压缩技术可以被构建为这样的转换策略，并引入一种新的策略 TOVA，它比这些策略更简单。我们对多个远程任务进行的实验表明，TOVA 优于所有其他基线策略，同时几乎与完整（无限）模型相当，并且在某些情况下仅使用原始缓存大小的 $\frac{1}{8}$ 。我们的结果表明，变压器解码器 LLM 在实践中通常表现为 RNN。他们还提出了缓解最痛苦的计算瓶颈之一——缓存大小的选项。我们在 https://github.com/schwartz-lab-NLP/TOVA 公开发布我们的代码。

Title: TOFU: A Task of Fictitious Unlearning for LLMs. (arXiv:2401.06121v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.06121
Code URL: null
Copy Paste: [[2401.06121]] TOFU: A Task of Fictitious Unlearning for LLMs(http://arxiv.org/abs/2401.06121)
Summary:
Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.
摘要：
基于网络上的大量数据训练的大型语言模型可以记忆和复制敏感或私人数据，从而引起法律和道德问题。忘却或调整模型以忘记训练数据中存在的信息，为我们提供了一种在训练后保护私人数据的方法。尽管存在几种用于这种遗忘的方法，但尚不清楚它们在多大程度上会产生与最初从未学习过要遗忘的数据的模型相当的模型。为了应对这一挑战，我们提出了 TOFU，一项虚构的忘却任务，作为基准，旨在帮助加深我们对忘却的理解。我们提供了一个由 200 个不同的合成作者资料组成的数据集，每个资料由 20 个问答对组成，这些资料的一个子集称为遗忘集，用作忘却的目标。我们编制了一套指标，这些指标共同作用，提供了遗忘效率的整体情况。最后，我们提供了一组来自现有遗忘算法的基线结果。重要的是，我们考虑的基线都没有显示出有效的遗忘数据，可以激励人们继续努力开发遗忘方法，从而有效地调整模型，使它们真正表现得好像从未接受过遗忘数据的训练一样。

long context

lora

hallucination

Title: Hallucination Benchmark in Medical Visual Question Answering. (arXiv:2401.05827v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05827
Code URL: null
Copy Paste: [[2401.05827]] Hallucination Benchmark in Medical Visual Question Answering(http://arxiv.org/abs/2401.05827)
Summary:
The recent success of large language and vision models on vision question answering (VQA), particularly their applications in medicine (Med-VQA), has shown a great potential of realizing effective visual assistants for healthcare. However, these models are not extensively tested on the hallucination phenomenon in clinical settings. Here, we created a hallucination benchmark of medical images paired with question-answer sets and conducted a comprehensive evaluation of the state-of-the-art models. The study provides an in-depth analysis of current models limitations and reveals the effectiveness of various prompting strategies.
摘要：
大型语言和视觉模型最近在视觉问答（VQA）方面取得的成功，特别是它们在医学（Med-VQA）中的应用，已经显示出实现有效的医疗保健视觉助手的巨大潜力。然而，这些模型并未在临床环境中对幻觉现象进行广泛的测试。在这里，我们创建了与问答集配对的医学图像的幻觉基准，并对最先进的模型进行了全面评估。该研究深入分析了当前模型的局限性，并揭示了各种提示策略的有效性。

prompt

Title: CodePrompt: Improving Source Code-Related Classification with Knowledge Features through Prompt Learning. (arXiv:2401.05544v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05544
Code URL: null
Copy Paste: [[2401.05544]] CodePrompt: Improving Source Code-Related Classification with Knowledge Features through Prompt Learning(http://arxiv.org/abs/2401.05544)
Summary:
Researchers have explored the potential of utilizing pre-trained language models, such as CodeBERT, to improve source code-related tasks. Previous studies have mainly relied on CodeBERT's text embedding capability and the `[CLS]' sentence embedding information as semantic representations for fine-tuning downstream source code-related tasks. However, these methods require additional neural network layers to extract effective features, resulting in higher computational costs. Furthermore, existing approaches have not leveraged the rich knowledge contained in both source code and related text, which can lead to lower accuracy. This paper presents a novel approach, CodePrompt, which utilizes rich knowledge recalled from a pre-trained model by prompt learning and an attention mechanism to improve source code-related classification tasks. Our approach initially motivates the language model with prompt information to retrieve abundant knowledge associated with the input as representative features, thus avoiding the need for additional neural network layers and reducing computational costs. Subsequently, we employ an attention mechanism to aggregate multiple layers of related knowledge for each task as final features to boost their accuracy. We conducted extensive experiments on four downstream source code-related tasks to evaluate our approach and our results demonstrate that CodePrompt achieves new state-of-the-art performance on the accuracy metric while also exhibiting computation cost-saving capabilities.
摘要：
研究人员探索了利用预训练语言模型（例如 CodeBERT）来改进源代码相关任务的潜力。之前的研究主要依靠 CodeBERT 的文本嵌入能力和“[CLS]”句子嵌入信息作为语义表示来微调下游源代码相关任务。然而，这些方法需要额外的神经网络层来提取有效特征，导致计算成本较高。此外，现有方法没有利用源代码和相关文本中包含的丰富知识，这可能导致准确性较低。本文提出了一种新颖的方法，CodePrompt，它利用通过提示学习和注意力机制从预训练模型中召回的丰富知识来改进与源代码相关的分类任务。我们的方法最初通过提示信息激发语言模型来检索与输入相关的丰富知识作为代表性特征，从而避免了对额外神经网络层的需要并降低了计算成本。随后，我们采用注意力机制来聚合每个任务的多层相关知识作为最终特征，以提高其准确性。我们对四个与下游源代码相关的任务进行了广泛的实验，以评估我们的方法，结果表明 CodePrompt 在准确性指标上实现了新的最先进的性能，同时还展示了节省计算成本的能力。

Title: Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability. (arXiv:2401.05655v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05655
Code URL: null
Copy Paste: [[2401.05655]] Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability(http://arxiv.org/abs/2401.05655)
Summary:
Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability, contributing practical insights for developing effective AES models in real-world education. To this end, we meticulously selected nine prominent AES methods and evaluated their performance using seven metrics on an open-sourced dataset, which contains over 25,000 essays and various demographic information about students such as gender, English language learner status, and economic status. Through extensive evaluations, we demonstrated that: (1) prompt-specific models tend to outperform their cross-prompt counterparts in terms of predictive accuracy; (2) prompt-specific models frequently exhibit a greater bias towards students of different economic statuses compared to cross-prompt models; (3) in the pursuit of generalizability, traditional machine learning models coupled with carefully engineered features hold greater potential for achieving both high accuracy and fairness than complex neural network models.
摘要：
论文自动评分 (AES) 是一项成熟的教育活动，它利用机器学习来评估学生撰写的论文。虽然在这一领域做出了很多努力，但当前的研究主要集中在 (i) 提高 AES 模型对特定提示的预测准确性（即开发特定于提示的模型），这通常严重依赖于使用来自同一目标提示的标记数据； (ii) 评估在非目标提示上开发的 AES 模型对预期目标提示的适用性（即，在交叉提示设置中开发 AES 模型）。考虑到机器学习的固有偏差及其对边缘群体的潜在影响，有必要研究当前 AES 方法中是否存在这种偏差，如果存在，它如何影响 AES 模型的准确性和普遍性。因此，我们的研究旨在揭示 AES 模型的准确性、公平性和普遍性之间的复杂关系，为在现实教育中开发有效的 AES 模型提供实用见解。为此，我们精心挑选了九种著名的 AES 方法，并使用开源数据集的七个指标评估了它们的性能，该数据集包含超过 25,000 篇论文以及有关学生的各种人口统计信息，例如性别、英语学习者状况和经济状况。通过广泛的评估，我们证明：（1）特定提示模型在预测准确性方面往往优于交叉提示模型；（2）与交叉提示模型相比，特定提示模型经常对不同经济状况的学生表现出更大的偏见；（3）为了追求普遍性，传统的机器学习模型加上精心设计的特征，比复杂的神经网络模型具有更大的潜力来实现高精度和公平性。

Title: Evidence to Generate (E2G): A Single-agent Two-step Prompting for Context Grounded and Retrieval Augmented Reasoning. (arXiv:2401.05787v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05787
Code URL: null
Copy Paste: [[2401.05787]] Evidence to Generate (E2G): A Single-agent Two-step Prompting for Context Grounded and Retrieval Augmented Reasoning(http://arxiv.org/abs/2401.05787)
Summary:
While chain-of-thought (CoT) prompting has revolutionized how LLMs perform reasoning tasks, its current methods and variations (e.g, Self-consistency, ReACT, Reflexion, Tree-of-Thoughts (ToT), Cumulative Reasoning (CR)) suffer from limitations like slowness, limited context grounding, hallucination and inconsistent outputs. To overcome these challenges, we introduce Evidence to Generate (E2G), a novel single-agent, two-step prompting framework. Instead of unverified reasoning claims, this innovative approach leverages the power of "evidence for decision making" by first focusing exclusively on the thought sequences (the series of intermediate steps) explicitly mentioned in the context which then serve as extracted evidence, guiding the LLM's output generation process with greater precision and efficiency. This simple yet powerful approach unlocks the true potential of chain-of-thought like prompting, paving the way for faster, more reliable, and more contextually aware reasoning in LLMs. \tool achieves remarkable results robustly across a wide range of knowledge-intensive reasoning and generation tasks, surpassing baseline approaches with state-of-the-art LLMs. For example, (i) on LogiQA benchmark using GPT-4 as backbone model, \tool achieves a new state-of-the Accuracy of 53.8% exceeding CoT by 18%, ToT by 11%, CR by 9% (ii) a variant of E2G with PaLM2 outperforms the variable-shot performance of Gemini Ultra by 0.9 F1 points, reaching an F1 score of 83.3 on a subset of DROP.
摘要：
虽然思想链 (CoT) 提示彻底改变了法学硕士执行推理任务的方式，但其当前的方法和变体（例如，自我一致性、反应、反思、思想树 (ToT)、累积推理 (CR) )) 受到诸如缓慢、有限的背景基础、幻觉和不一致的输出等限制。为了克服这些挑战，我们引入了证据生成（E2G），这是一种新颖的单代理、两步提示框架。这种创新方法不是未经验证的推理主张，而是利用“决策证据”的力量，首先专门关注上下文中明确提到的思维序列（一系列中间步骤），然后将其作为提取的证据，指导法学硕士的输出生成过程具有更高的精度和效率。这种简单而强大的方法释放了思维链（如提示）的真正潜力，为法学硕士中更快、更可靠、更上下文相关的推理铺平了道路。 \tool 在广泛的知识密集型推理和生成任务中取得了显着的成果，超越了最先进的法学硕士的基线方法。例如，(i) 在使用 GPT-4 作为骨干模型的 LogiQA 基准测试中，工具达到了 53.8% 的新准确率，CoT 超出 18%，ToT 超出 11%，CR 超出 9% (ii)带有 PaLM2 的 E2G 变体比 Gemini Ultra 的可变镜头性能高出 0.9 F1 分，在 DROP 子集上达到 83.3 的 F1 分数。

Title: Prompt-based mental health screening from social media text. (arXiv:2401.05912v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05912
Code URL: null
Copy Paste: [[2401.05912]] Prompt-based mental health screening from social media text(http://arxiv.org/abs/2401.05912)
Summary:
This article presents a method for prompt-based mental health screening from a large and noisy dataset of social media text. Our method uses GPT 3.5. prompting to distinguish publications that may be more relevant to the task, and then uses a straightforward bag-of-words text classifier to predict actual user labels. Results are found to be on pair with a BERT mixture of experts classifier, and incurring only a fraction of its computational costs.
摘要：
本文提出了一种从大型且嘈杂的社交媒体文本数据集中进行基于提示的心理健康筛查的方法。我们的方法使用 GPT 3.5。提示区分可能与任务更相关的出版物，然后使用简单的词袋文本分类器来预测实际的用户标签。结果发现与 BERT 专家混合分类器配对，并且只产生其计算成本的一小部分。

code

Title: CoLafier: Collaborative Noisy Label Purifier With Local Intrinsic Dimensionality Guidance. (arXiv:2401.05458v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05458
Code URL: null
Copy Paste: [[2401.05458]] CoLafier: Collaborative Noisy Label Purifier With Local Intrinsic Dimensionality Guidance(http://arxiv.org/abs/2401.05458)
Summary:
Deep neural networks (DNNs) have advanced many machine learning tasks, but their performance is often harmed by noisy labels in real-world data. Addressing this, we introduce CoLafier, a novel approach that uses Local Intrinsic Dimensionality (LID) for learning with noisy labels. CoLafier consists of two subnets: LID-dis and LID-gen. LID-dis is a specialized classifier. Trained with our uniquely crafted scheme, LID-dis consumes both a sample's features and its label to predict the label - which allows it to produce an enhanced internal representation. We observe that LID scores computed from this representation effectively distinguish between correct and incorrect labels across various noise scenarios. In contrast to LID-dis, LID-gen, functioning as a regular classifier, operates solely on the sample's features. During training, CoLafier utilizes two augmented views per instance to feed both subnets. CoLafier considers the LID scores from the two views as produced by LID-dis to assign weights in an adapted loss function for both subnets. Concurrently, LID-gen, serving as classifier, suggests pseudo-labels. LID-dis then processes these pseudo-labels along with two views to derive LID scores. Finally, these LID scores along with the differences in predictions from the two subnets guide the label update decisions. This dual-view and dual-subnet approach enhances the overall reliability of the framework. Upon completion of the training, we deploy the LID-gen subnet of CoLafier as the final classification model. CoLafier demonstrates improved prediction accuracy, surpassing existing methods, particularly under severe label noise. For more details, see the code at https://github.com/zdy93/CoLafier.
摘要：
深度神经网络 (DNN) 已经推进了许多机器学习任务，但它们的性能常常受到现实数据中的噪声标签的损害。为了解决这个问题，我们引入了 CoLafier，这是一种使用局部固有维度 (LID) 来学习噪声标签的新颖方法。 CoLafier 由两个子网组成：LID-dis 和 LID-gen。 LID-dis 是一个专门的分类器。通过我们独特设计的方案进行训练，LID-dis 使用样本的特征及其标签来预测标签 - 这使其能够产生增强的内部表示。我们观察到，根据这种表示计算出的 LID 分数可以有效地区分各种噪声场景中的正确标签和错误标签。与 LID-dis 相比，LID-gen 作为常规分类器，仅对样本的特征进行操作。在训练期间，CoLafier 在每个实例中使用两个增强视图来为两个子网提供数据。 CoLafier 考虑 LID-dis 生成的两个视图的 LID 分数，以在两个子网的自适应损失函数中分配权重。同时，LID-gen 作为分类器，建议伪标签。然后，LID-dis 处理这些伪标签以及两个视图，以得出 LID 分数。最后，这些 LID 分数以及两个子网的预测差异指导标签更新决策。这种双视图和双子网方法增强了框架的整体可靠性。训练完成后，我们部署 CoLafier 的 LID-gen 子网作为最终的分类模型。 CoLafier 展示了更高的预测准确性，超越了现有方法，特别是在严重的标签噪声下。有关更多详细信息，请参阅 https://github.com/zdy93/CoLafier 上的代码。

Title: Knowledge Translation: A New Pathway for Model Compression. (arXiv:2401.05772v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05772
Code URL: null
Copy Paste: [[2401.05772]] Knowledge Translation: A New Pathway for Model Compression(http://arxiv.org/abs/2401.05772)
Summary:
Deep learning has witnessed significant advancements in recent years at the cost of increasing training, inference, and model storage overhead. While existing model compression methods strive to reduce the number of model parameters while maintaining high accuracy, they inevitably necessitate the re-training of the compressed model or impose architectural constraints. To overcome these limitations, this paper presents a novel framework, termed \textbf{K}nowledge \textbf{T}ranslation (KT), wherein a ``translation'' model is trained to receive the parameters of a larger model and generate compressed parameters. The concept of KT draws inspiration from language translation, which effectively employs neural networks to convert different languages, maintaining identical meaning. Accordingly, we explore the potential of neural networks to convert models of disparate sizes, while preserving their functionality. We propose a comprehensive framework for KT, introduce data augmentation strategies to enhance model performance despite restricted training data, and successfully demonstrate the feasibility of KT on the MNIST dataset. Code is available at \url{https://github.com/zju-SWJ/KT}.
摘要：
近年来，深度学习取得了显着进步，但代价是训练、推理和模型存储开销不断增加。虽然现有的模型压缩方法努力在保持高精度的同时减少模型参数的数量，但它们不可避免地需要重新训练压缩模型或施加架构限制。为了克服这些限制，本文提出了一种新颖的框架，称为 \textbf{K}nowledge \textbf{T}translation (KT)，其中训练“翻译”模型以接收更大模型的参数并生成压缩的参数。 KT的概念受到语言翻译的启发，它有效地利用神经网络来转换不同的语言，并保持相同的含义。因此，我们探索神经网络转换不同大小模型的潜力，同时保留其功能。我们提出了一个全面的 KT 框架，引入数据增强策略来增强模型性能，尽管训练数据有限，并成功证明了 KT 在 MNIST 数据集上的可行性。代码可在 \url{https://github.com/zju-SWJ/KT} 获取。

Title: Graph Spatiotemporal Process for Multivariate Time Series Anomaly Detection with Missing Values. (arXiv:2401.05800v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05800
Code URL: null
Copy Paste: [[2401.05800]] Graph Spatiotemporal Process for Multivariate Time Series Anomaly Detection with Missing Values(http://arxiv.org/abs/2401.05800)
Summary:
The detection of anomalies in multivariate time series data is crucial for various practical applications, including smart power grids, traffic flow forecasting, and industrial process control. However, real-world time series data is usually not well-structured, posting significant challenges to existing approaches: (1) The existence of missing values in multivariate time series data along variable and time dimensions hinders the effective modeling of interwoven spatial and temporal dependencies, resulting in important patterns being overlooked during model training; (2) Anomaly scoring with irregularly-sampled observations is less explored, making it difficult to use existing detectors for multivariate series without fully-observed values. In this work, we introduce a novel framework called GST-Pro, which utilizes a graph spatiotemporal process and anomaly scorer to tackle the aforementioned challenges in detecting anomalies on irregularly-sampled multivariate time series. Our approach comprises two main components. First, we propose a graph spatiotemporal process based on neural controlled differential equations. This process enables effective modeling of multivariate time series from both spatial and temporal perspectives, even when the data contains missing values. Second, we present a novel distribution-based anomaly scoring mechanism that alleviates the reliance on complete uniform observations. By analyzing the predictions of the graph spatiotemporal process, our approach allows anomalies to be easily detected. Our experimental results show that the GST-Pro method can effectively detect anomalies in time series data and outperforms state-of-the-art methods, regardless of whether there are missing values present in the data. Our code is available: https://github.com/huankoh/GST-Pro.
摘要：
多元时间序列数据中的异常检测对于各种实际应用至关重要，包括智能电网、交通流量预测和工业过程控制。然而，现实世界的时间序列数据通常结构不佳，这对现有方法提出了重大挑战：（1）多元时间序列数据在变量和时间维度上存在缺失值，阻碍了交织的空间和时间依赖性的有效建模，导致模型训练过程中重要的模式被忽略； (2) 对不规则采样观测值的异常评分研究较少，这使得在没有完全观测值的情况下很难将现有的检测器用于多元序列。在这项工作中，我们介绍了一种名为 GST-Pro 的新颖框架，它利用图时空过程和异常评分器来解决上述在不规则采样的多元时间序列上检测异常的挑战。我们的方法包括两个主要部分。首先，我们提出了一种基于神经控制微分方程的图时空过程。即使数据包含缺失值，此过程也可以从空间和时间角度对多元时间序列进行有效建模。其次，我们提出了一种新颖的基于分布的异常评分机制，减轻了对完全统一观察的依赖。通过分析图时空过程的预测，我们的方法可以轻松检测到异常情况。我们的实验结果表明，无论数据中是否存在缺失值，GST-Pro 方法都可以有效地检测时间序列数据中的异常，并且优于最先进的方法。我们的代码可用：https://github.com/huankoh/GST-Pro。

Title: SH2: Self-Highlighted Hesitation Helps You Decode More Truthfully. (arXiv:2401.05930v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05930
Code URL: null
Copy Paste: [[2401.05930]] SH2: Self-Highlighted Hesitation Helps You Decode More Truthfully(http://arxiv.org/abs/2401.05930)
Summary:
Large language models (LLMs) demonstrate great performance in text generation. However, LLMs are still suffering from hallucinations. In this work, we propose an inference-time method, Self-Highlighted Hesitation (SH2), to help LLMs decode more truthfully. SH2 is based on a simple fact rooted in information theory that for an LLM, the tokens predicted with lower probabilities are prone to be more informative than others. Our analysis shows that the tokens assigned with lower probabilities by an LLM are more likely to be closely related to factual information, such as nouns, proper nouns, and adjectives. Therefore, we propose to ''highlight'' the factual information by selecting the tokens with the lowest probabilities and concatenating them to the original context, thus forcing the model to repeatedly read and hesitate on these tokens before generation. During decoding, we also adopt contrastive decoding to emphasize the difference in the output probabilities brought by the hesitation. Experimental results demonstrate that our SH2, requiring no additional data or models, can effectively help LLMs elicit factual knowledge and distinguish hallucinated contexts. Significant and consistent improvements are achieved by SH2 for LLaMA-7b and LLaMA2-7b on multiple hallucination tasks.
摘要：
大型语言模型 (LLM) 在文本生成方面表现出了出色的性能。然而，法学硕士仍然饱受幻觉之苦。在这项工作中，我们提出了一种推理时间方法，自我突出的犹豫（SH2），以帮助法学硕士更真实地解码。 SH2 基于一个植根于信息论的简单事实，即对于法学硕士来说，以较低概率预测的标记往往比其他标记提供更多信息。我们的分析表明，法学硕士分配的概率较低的标记更有可能与事实信息密切相关，例如名词、专有名词和形容词。因此，我们建议通过选择概率最低的标记并将它们连接到原始上下文来“突出”事实信息，从而迫使模型在生成之前反复读取和犹豫这些标记。在解码过程中，我们还采用对比解码来强调犹豫带来的输出概率的差异。实验结果表明，我们的 SH2 不需要额外的数据或模型，可以有效地帮助法学硕士获得事实知识并区分幻觉背景。 SH2 对 LLaMA-7b 和 LLaMA2-7b 在多个幻觉任务上取得了显着且一致的改进。

Title: Learning Cognitive Maps from Transformer Representations for Efficient Planning in Partially Observed Environments. (arXiv:2401.05946v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05946
Code URL: null
Copy Paste: [[2401.05946]] Learning Cognitive Maps from Transformer Representations for Efficient Planning in Partially Observed Environments(http://arxiv.org/abs/2401.05946)
Summary:
Despite their stellar performance on a wide range of tasks, including in-context tasks only revealed during inference, vanilla transformers and variants trained for next-token predictions (a) do not learn an explicit world model of their environment which can be flexibly queried and (b) cannot be used for planning or navigation. In this paper, we consider partially observed environments (POEs), where an agent receives perceptually aliased observations as it navigates, which makes path planning hard. We introduce a transformer with (multiple) discrete bottleneck(s), TDB, whose latent codes learn a compressed representation of the history of observations and actions. After training a TDB to predict the future observation(s) given the history, we extract interpretable cognitive maps of the environment from its active bottleneck(s) indices. These maps are then paired with an external solver to solve (constrained) path planning problems. First, we show that a TDB trained on POEs (a) retains the near perfect predictive performance of a vanilla transformer or an LSTM while (b) solving shortest path problems exponentially faster. Second, a TDB extracts interpretable representations from text datasets, while reaching higher in-context accuracy than vanilla sequence models. Finally, in new POEs, a TDB (a) reaches near-perfect in-context accuracy, (b) learns accurate in-context cognitive maps (c) solves in-context path planning problems.
摘要：
尽管它们在广泛的任务上表现出色，包括仅在推理过程中揭示的上下文任务，但为下一个标记预测训练的普通变压器和变体（a）没有学习其环境的明确的世界模型，这可以是灵活查询，(b) 不能用于规划或导航。在本文中，我们考虑部分观察环境（POE），其中代理在导航时接收到感知混叠的观察结果，这使得路径规划变得困难。我们引入了一个具有（多个）离散瓶颈的变压器 TDB，其潜在代码学习观察和动作历史的压缩表示。在训练 TDB 来预测给定历史的未来观察之后，我们从其活动瓶颈索引中提取环境的可解释认知图。然后将这些地图与外部求解器配对以解决（受限）路径规划问题。首先，我们表明，在 POE 上训练的 TDB (a) 保留了普通 Transformer 或 LSTM 近乎完美的预测性能，同时 (b) 解决最短路径问题的速度呈指数级增长。其次，TDB 从文本数据集中提取可解释的表示，同时达到比普通序列模型更高的上下文准确性。最后，在新的 POE 中，TDB (a) 达到近乎完美的上下文准确性，(b) 学习准确的上下文认知图，(c) 解决上下文路径规划问题。

Title: Spatial-Aware Deep Reinforcement Learning for the Traveling Officer Problem. (arXiv:2401.05969v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05969
Code URL: null
Copy Paste: [[2401.05969]] Spatial-Aware Deep Reinforcement Learning for the Traveling Officer Problem(http://arxiv.org/abs/2401.05969)
Summary:
The traveling officer problem (TOP) is a challenging stochastic optimization task. In this problem, a parking officer is guided through a city equipped with parking sensors to fine as many parking offenders as possible. A major challenge in TOP is the dynamic nature of parking offenses, which randomly appear and disappear after some time, regardless of whether they have been fined. Thus, solutions need to dynamically adjust to currently fineable parking offenses while also planning ahead to increase the likelihood that the officer arrives during the offense taking place. Though various solutions exist, these methods often struggle to take the implications of actions on the ability to fine future parking violations into account. This paper proposes SATOP, a novel spatial-aware deep reinforcement learning approach for TOP. Our novel state encoder creates a representation of each action, leveraging the spatial relationships between parking spots, the agent, and the action. Furthermore, we propose a novel message-passing module for learning future inter-action correlations in the given environment. Thus, the agent can estimate the potential to fine further parking violations after executing an action. We evaluate our method using an environment based on real-world data from Melbourne. Our results show that SATOP consistently outperforms state-of-the-art TOP agents and is able to fine up to 22% more parking offenses.
摘要：
旅行官员问题（TOP）是一项具有挑战性的随机优化任务。在这个问题中，停车管理人员被引导穿过一个配备停车传感器的城市，对尽可能多的停车违规者进行罚款。 TOP 的一个主要挑战是停车违法行为的动态性质，它们会在一段时间后随机出现和消失，无论是否被罚款。因此，解决方案需要动态调整当前可罚款的停车违法行为，同时还需要提前规划，以增加警察在违法行为发生期间到达的可能性。尽管存在各种解决方案，但这些方法通常很难考虑到行为对未来停车违规罚款能力的影响。本文提出了 SATOP，一种新颖的 TOP 空间感知深度强化学习方法。我们新颖的状态编码器利用停车位、代理和动作之间的空间关系创建每个动作的表示。此外，我们提出了一种新颖的消息传递模块，用于学习给定环境中未来的交互相关性。因此，代理可以估计在执行操作后进一步罚款违规停车的可能性。我们使用基于墨尔本真实数据的环境来评估我们的方法。我们的结果表明，SATOP 的表现始终优于最先进的 TOP 代理，并且能够将停车违规罚款最多增加 22%。

Title: Cross-modal Retrieval for Knowledge-based Visual Question Answering. (arXiv:2401.05736v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05736
Code URL: null
Copy Paste: [[2401.05736]] Cross-modal Retrieval for Knowledge-based Visual Question Answering(http://arxiv.org/abs/2401.05736)
Summary:
Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.
摘要：
有关命名实体的基于知识的视觉问答是一项具有挑战性的任务，需要从多模式知识库中检索信息。命名实体具有不同的视觉表示，因此难以识别。我们认为，跨模态检索可能有助于弥合实体与其描述之间的语义差距，并且最重要的是与单模态检索的互补。我们通过在最近的 ViQuAE、InfoSeek 和 Encyclopedic-VQA 数据集上使用多模态双编码器（即 CLIP）进行实验来提供经验证据。此外，我们研究了三种不同的策略来微调这样的模型：单模态、跨模态或联合训练。我们的方法结合了单模态和跨模态检索，与三个数据集上的十亿参数模型具有竞争力，同时概念上更简单且计算成本更低。

Title: WildGEN: Long-horizon Trajectory Generation for Wildlife. (arXiv:2401.05421v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05421
Code URL: null
Copy Paste: [[2401.05421]] WildGEN: Long-horizon Trajectory Generation for Wildlife(http://arxiv.org/abs/2401.05421)
Summary:
Trajectory generation is an important concern in pedestrian, vehicle, and wildlife movement studies. Generated trajectories help enrich the training corpus in relation to deep learning applications, and may be used to facilitate simulation tasks. This is especially significant in the wildlife domain, where the cost of obtaining additional real data can be prohibitively expensive, time-consuming, and bear ethical considerations. In this paper, we introduce WildGEN: a conceptual framework that addresses this challenge by employing a Variational Auto-encoders (VAEs) based method for the acquisition of movement characteristics exhibited by wild geese over a long horizon using a sparse set of truth samples. A subsequent post-processing step of the generated trajectories is performed based on smoothing filters to reduce excessive wandering. Our evaluation is conducted through visual inspection and the computation of the Hausdorff distance between the generated and real trajectories. In addition, we utilize the Pearson Correlation Coefficient as a way to measure how realistic the trajectories are based on the similarity of clusters evaluated on the generated and real trajectories.
摘要：
轨迹生成是行人、车辆和野生动物运动研究中的一个重要问题。生成的轨迹有助于丰富与深度学习应用相关的训练语料库，并可用于促进模拟任务。这在野生动物领域尤其重要，因为获取额外真实数据的成本可能非常昂贵、耗时，并且需要考虑伦理问题。在本文中，我们介绍了 WildGEN：一个概念框架，它通过采用基于变分自动编码器 (VAE) 的方法来解决这一挑战，该方法使用一组稀疏的真实样本来获取大雁在长视野中表现出的运动特征。基于平滑滤波器对生成的轨迹执行后续后处理步骤，以减少过度漂移。我们的评估是通过目视检查和计算生成轨迹与真实轨迹之间的豪斯多夫距离来进行的。此外，我们利用皮尔逊相关系数作为基于生成轨迹和真实轨迹评估的聚类相似性来衡量轨迹真实程度的方法。

Title: Wavelet-Inspired Multiscale Graph Convolutional Recurrent Network for Traffic Forecasting. (arXiv:2401.06040v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.06040
Code URL: null
Copy Paste: [[2401.06040]] Wavelet-Inspired Multiscale Graph Convolutional Recurrent Network for Traffic Forecasting(http://arxiv.org/abs/2401.06040)
Summary:
Traffic forecasting is the foundation for intelligent transportation systems. Spatiotemporal graph neural networks have demonstrated state-of-the-art performance in traffic forecasting. However, these methods do not explicitly model some of the natural characteristics in traffic data, such as the multiscale structure that encompasses spatial and temporal variations at different levels of granularity or scale. To that end, we propose a Wavelet-Inspired Graph Convolutional Recurrent Network (WavGCRN) which combines multiscale analysis (MSA)-based method with Deep Learning (DL)-based method. In WavGCRN, the traffic data is decomposed into time-frequency components with Discrete Wavelet Transformation (DWT), constructing a multi-stream input structure; then Graph Convolutional Recurrent networks (GCRNs) are employed as encoders for each stream, extracting spatiotemporal features in different scales; and finally the learnable Inversed DWT and GCRN are combined as the decoder, fusing the information from all streams for traffic metrics reconstruction and prediction. Furthermore, road-network-informed graphs and data-driven graph learning are combined to accurately capture spatial correlation. The proposed method can offer well-defined interpretability, powerful learning capability, and competitive forecasting performance on real-world traffic data sets.
摘要：
交通预测是智能交通系统的基础。时空图神经网络在交通预测方面表现出了最先进的性能。然而，这些方法没有明确地模拟交通数据中的一些自然特征，例如包含不同粒度或尺度级别的空间和时间变化的多尺度结构。为此，我们提出了一种小波启发图卷积循环网络（WavGCRN），它将基于多尺度分析（MSA）的方法与基于深度学习（DL）的方法相结合。在WavGCRN中，通过离散小波变换（DWT）将流量数据分解为时频分量，构建多流输入结构；然后采用图卷积循环网络（GCRN）作为每个流的编码器，提取不同尺度的时空特征；最后将可学习的 Inversed DWT 和 GCRN 组合起来作为解码器，融合来自所有流的信息以进行流量度量重建和预测。此外，将道路网络信息图和数据驱动图学习相结合，以准确捕获空间相关性。所提出的方法可以在现实世界的交通数据集上提供明确的可解释性、强大的学习能力和有竞争力的预测性能。

chat

Title: Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks. (arXiv:2401.05871v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05871
Code URL: null
Copy Paste: [[2401.05871]] Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks(http://arxiv.org/abs/2401.05871)
Summary:
Personality recognition is useful for enhancing robots' ability to tailor user-adaptive responses, thus fostering rich human-robot interactions. One of the challenges in this task is a limited number of speakers in existing dialogue corpora, which hampers the development of robust, speaker-independent personality recognition models. Additionally, accurately modeling both the interdependencies among interlocutors and the intra-dependencies within the speaker in dialogues remains a significant issue. To address the first challenge, we introduce personality trait interpolation for speaker data augmentation. For the second, we propose heterogeneous conversational graph networks to independently capture both contextual influences and inherent personality traits. Evaluations on the RealPersonaChat corpus demonstrate our method's significant improvements over existing baselines.
摘要：
个性识别有助于增强机器人定制用户自适应响应的能力，从而促进丰富的人机交互。这项任务的挑战之一是现有对话语料库中说话者的数量有限，这阻碍了稳健的、独立于说话者的个性识别模型的开发。此外，准确地建模对话者之间的相互依赖关系和对话中发言者内部的依赖关系仍然是一个重要的问题。为了解决第一个挑战，我们引入了用于说话者数据增强的人格特质插值。对于第二个，我们提出异构会话图网络来独立捕获上下文影响和固有的个性特征。对 RealPersonaChat 语料库的评估表明我们的方法比现有基线有了显着改进。

retrieval augmented generation

retrieval-augmented generation

rag

Title: Functional Graphical Models: Structure Enables Offline Data-Driven Optimization. (arXiv:2401.05442v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05442
Code URL: null
Copy Paste: [[2401.05442]] Functional Graphical Models: Structure Enables Offline Data-Driven Optimization(http://arxiv.org/abs/2401.05442)
Summary:
While machine learning models are typically trained to solve prediction problems, we might often want to use them for optimization problems. For example, given a dataset of proteins and their corresponding fluorescence levels, we might want to optimize for a new protein with the highest possible fluorescence. This kind of data-driven optimization (DDO) presents a range of challenges beyond those in standard prediction problems, since we need models that successfully predict the performance of new designs that are better than the best designs seen in the training set. It is not clear theoretically when existing approaches can even perform better than the naive approach that simply selects the best design in the dataset. In this paper, we study how structure can enable sample-efficient data-driven optimization. To formalize the notion of structure, we introduce functional graphical models (FGMs) and show theoretically how they can provide for principled data-driven optimization by decomposing the original high-dimensional optimization problem into smaller sub-problems. This allows us to derive much more practical regret bounds for DDO, and the result implies that DDO with FGMs can achieve nearly optimal designs in situations where naive approaches fail due to insufficient coverage of the offline data. We further present a data-driven optimization algorithm that inferes the FGM structure itself, either over the original input variables or a latent variable representation of the inputs.
摘要：
虽然机器学习模型通常经过训练来解决预测问题，但我们可能经常希望将它们用于优化问题。例如，给定蛋白质数据集及其相应的荧光水平，我们可能希望优化具有最高可能荧光的新蛋白质。这种数据驱动优化 (DDO) 提出了标准预测问题之外的一系列挑战，因为我们需要能够成功预测新设计性能的模型，这些新设计的性能要优于训练集中看到的最佳设计。从理论上讲，尚不清楚现有方法何时可以比简单地选择数据集中最佳设计的简单方法表现得更好。在本文中，我们研究结构如何实现样本高效的数据驱动优化。为了形式化结构的概念，我们引入了函数图模型（FGM），并从理论上展示了它们如何通过将原始高维优化问题分解为更小的子问题来提供原则性的数据驱动优化。这使我们能够得出 DDO 更实际的遗憾界限，结果表明，在由于离线数据覆盖不足而导致简单方法失败的情况下，具有 FGM 的 DDO 可以实现近乎最优的设计。我们进一步提出了一种数据驱动的优化算法，该算法可以通过原始输入变量或输入的潜在变量表示来推断 FGM 结构本身。

Title: EsaCL: Efficient Continual Learning of Sparse Models. (arXiv:2401.05667v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05667
Code URL: null
Copy Paste: [[2401.05667]] EsaCL: Efficient Continual Learning of Sparse Models(http://arxiv.org/abs/2401.05667)
Summary:
A key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks. Many existing approaches to this problem work by either retraining the model on previous tasks or by expanding the model to accommodate new tasks. However, these approaches typically suffer from increased storage and computational requirements, a problem that is worsened in the case of sparse models due to need for expensive re-training after sparsification. To address this challenge, we propose a new method for efficient continual learning of sparse models (EsaCL) that can automatically prune redundant parameters without adversely impacting the model's predictive power, and circumvent the need of retraining. We conduct a theoretical analysis of loss landscapes with parameter pruning, and design a directional pruning (SDP) strategy that is informed by the sharpness of the loss function with respect to the model parameters. SDP ensures model with minimal loss of predictive accuracy, accelerating the learning of sparse models at each stage. To accelerate model update, we introduce an intelligent data selection (IDS) strategy that can identify critical instances for estimating loss landscape, yielding substantially improved data efficiency. The results of our experiments show that EsaCL achieves performance that is competitive with the state-of-the-art methods on three continual learning benchmarks, while using substantially reduced memory and computational resources.
摘要：
持续学习环境中的一个关键挑战是有效地学习一系列任务，同时又不会忘记如何执行之前学过的任务。解决此问题的许多现有方法都是通过在以前的任务上重新训练模型或扩展模型以适应新任务来实现的。然而，这些方法通常会遇到存储和计算要求增加的问题，在稀疏模型的情况下，由于稀疏化后需要昂贵的重新训练，这个问题会变得更加严重。为了应对这一挑战，我们提出了一种有效持续学习稀疏模型（EsaCL）的新方法，该方法可以自动修剪冗余参数，而不会对模型的预测能力产生不利影响，并避免重新训练的需要。我们对参数剪枝的损失景观进行了理论分析，并设计了一种定向剪枝（SDP）策略，该策略根据损失函数相对于模型参数的锐度来确定。 SDP确保模型预测准确性损失最小，加速每个阶段稀疏模型的学习。为了加速模型更新，我们引入了智能数据选择（IDS）策略，该策略可以识别用于估计损失情况的关键实例，从而显着提高数据效率。我们的实验结果表明，EsaCL 在三个持续学习基准上实现了与最先进的方法相媲美的性能，同时使用显着减少的内存和计算资源。

Title: Revisiting Silhouette: From Micro to Macro Aggregation. (arXiv:2401.05831v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05831
Code URL: null
Copy Paste: [[2401.05831]] Revisiting Silhouette: From Micro to Macro Aggregation(http://arxiv.org/abs/2401.05831)
Summary:
Silhouette coefficient is an established internal clustering evaluation measure that produces a score per data point, assessing the quality of its clustering assignment. To assess the quality of the clustering of the whole dataset, the scores of all the points in the dataset are typically averaged into a single value, a strategy which we call as micro-averaging. As we illustrate in this work, by using a synthetic example, this micro-averaging strategy is sensitive both to cluster imbalance and outliers (background noise). To address these issues, we propose an alternative aggregation strategy, which first averages the silhouette scores at a cluster level and then (macro) averages the scores across the clusters. Based on the same synthetic example, we show that the proposed macro-averaged silhouette score is robust to cluster imbalance and background noise. We have conducted an experimental study showing that our macro-averaged variant provides better estimates of the ground truth number of clusters on several cases compared to the typical micro-averaged score.
摘要：
轮廓系数是一种既定的内部聚类评估措施，它为每个数据点生成一个分数，评估其聚类分配的质量。为了评估整个数据集的聚类质量，通常将数据集中所有点的分数平均为一个值，我们将这种策略称为微平均。正如我们在这项工作中所说明的，通过使用综合示例，这种微平均策略对集群不平衡和异常值（背景噪声）都很敏感。为了解决这些问题，我们提出了一种替代聚合策略，该策略首先在集群级别上平均轮廓分数，然后（宏观）平均跨集群的分数。基于相同的合成示例，我们表明所提出的宏观平均轮廓分数对于集群不平衡和背景噪声具有鲁棒性。我们进行了一项实验研究，表明与典型的微观平均得分相比，我们的宏观平均变体可以更好地估计几种情况下的集群的真实数量。

Title: Binary Linear Tree Commitment-based Ownership Protection for Distributed Machine Learning. (arXiv:2401.05895v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05895
Code URL: null
Copy Paste: [[2401.05895]] Binary Linear Tree Commitment-based Ownership Protection for Distributed Machine Learning(http://arxiv.org/abs/2401.05895)
Summary:
Distributed machine learning enables parallel training of extensive datasets by delegating computing tasks across multiple workers. Despite the cost reduction benefits of distributed machine learning, the dissemination of final model weights often leads to potential conflicts over model ownership as workers struggle to substantiate their involvement in the training computation. To address the above ownership issues and prevent accidental failures and malicious attacks, verifying the computational integrity and effectiveness of workers becomes particularly crucial in distributed machine learning. In this paper, we proposed a novel binary linear tree commitment-based ownership protection model to ensure computational integrity with limited overhead and concise proof. Due to the frequent updates of parameters during training, our commitment scheme introduces a maintainable tree structure to reduce the costs of updating proofs. Distinguished from SNARK-based verifiable computation, our model achieves efficient proof aggregation by leveraging inner product arguments. Furthermore, proofs of model weights are watermarked by worker identity keys to prevent commitments from being forged or duplicated. The performance analysis and comparison with SNARK-based hash commitments validate the efficacy of our model in preserving computational integrity within distributed machine learning.
摘要：
分布式机器学习通过将计算任务委托给多个工作人员来实现对大量数据集的并行训练。尽管分布式机器学习具有降低成本的好处，但最终模型权重的传播往往会导致模型所有权的潜在冲突，因为工作人员很难证实他们对训练计算的参与。为了解决上述所有权问题并防止意外故障和恶意攻击，验证工作人员的计算完整性和有效性在分布式机器学习中变得尤为重要。在本文中，我们提出了一种新颖的基于二叉线性树承诺的所有权保护模型，以有限的开销和简洁的证明来确保计算完整性。由于训练过程中参数的频繁更新，我们的承诺方案引入了可维护的树结构来降低更新证明的成本。与基于 SNARK 的可验证计算不同，我们的模型通过利用内积参数实现高效的证明聚合。此外，模型权重的证明由工作人员身份密钥加水印，以防止承诺被伪造或复制。性能分析以及与基于 SNARK 的哈希承诺的比较验证了我们的模型在分布式机器学习中保持计算完整性的有效性。

Title: Machine Learning Insides OptVerse AI Solver: Design Principles and Applications. (arXiv:2401.05960v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2401.05960
Code URL: null
Copy Paste: [[2401.05960]] Machine Learning Insides OptVerse AI Solver: Design Principles and Applications(http://arxiv.org/abs/2401.05960)
Summary:
In an era of digital ubiquity, efficient resource management and decision-making are paramount across numerous industries. To this end, we present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI Solver, which aims to mitigate the scarcity of real-world mathematical programming instances, and to surpass the capabilities of traditional optimization techniques. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. Furthermore, we introduce a training framework leveraging augmentation policies to maintain solvers' utility in dynamic environments. Besides the data generation and augmentation, our proposed approaches also include novel ML-driven policies for personalized solver strategies, with an emphasis on applications like graph convolutional networks for initial basis selection and reinforcement learning for advanced presolving and cut selection. Additionally, we detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance. Compared with traditional solvers such as Gurobi and SCIP, our ML-augmented OptVerse AI Solver demonstrates superior speed and precision across both established benchmarks and real-world scenarios, reinforcing the practical imperative and effectiveness of machine learning techniques in mathematical programming solvers.
摘要：
在数字化无处不在的时代，高效的资源管理和决策对于众多行业来说至关重要。为此，我们提出了将机器学习（ML）技术集成到华为云OptVerse AI Solver中的全面研究，旨在缓解现实世界数学编程实例的稀缺性，并超越传统优化技术的能力。我们展示了利用反映现实问题多方面结构的生成模型生成复杂 SAT 和 MILP 实例的方法。此外，我们引入了一个训练框架，利用增强策略来维持求解器在动态环境中的效用。除了数据生成和增强之外，我们提出的方法还包括用于个性化求解器策略的新颖的机器学习驱动策略，重点是用于初始基础选择的图卷积网络和用于高级预求解和剪切选择的强化学习等应用。此外，我们详细介绍了最先进的参数调整算法的结合，这些算法显着提高了求解器的性能。与 Gurobi 和 SCIP 等传统求解器相比，我们的 ML 增强型 OptVerse AI Solver 在既定基准和现实场景中表现出卓越的速度和精度，增强了机器学习技术在数学编程求解器中的实际必要性和有效性。

Title: VI-PANN: Harnessing Transfer Learning and Uncertainty-Aware Variational Inference for Improved Generalization in Audio Pattern Recognition. (arXiv:2401.05531v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05531
Code URL: null
Copy Paste: [[2401.05531]] VI-PANN: Harnessing Transfer Learning and Uncertainty-Aware Variational Inference for Improved Generalization in Audio Pattern Recognition(http://arxiv.org/abs/2401.05531)
Summary:
Transfer learning (TL) is an increasingly popular approach to training deep learning (DL) models that leverages the knowledge gained by training a foundation model on diverse, large-scale datasets for use on downstream tasks where less domain- or task-specific data is available. The literature is rich with TL techniques and applications; however, the bulk of the research makes use of deterministic DL models which are often uncalibrated and lack the ability to communicate a measure of epistemic (model) uncertainty in prediction. Unlike their deterministic counterparts, Bayesian DL (BDL) models are often well-calibrated, provide access to epistemic uncertainty for a prediction, and are capable of achieving competitive predictive performance. In this study, we propose variational inference pre-trained audio neural networks (VI-PANNs). VI-PANNs are a variational inference variant of the popular ResNet-54 architecture which are pre-trained on AudioSet, a large-scale audio event detection dataset. We evaluate the quality of the resulting uncertainty when transferring knowledge from VI-PANNs to other downstream acoustic classification tasks using the ESC-50, UrbanSound8K, and DCASE2013 datasets. We demonstrate, for the first time, that it is possible to transfer calibrated uncertainty information along with knowledge from upstream tasks to enhance a model's capability to perform downstream tasks.
摘要：
迁移学习（TL）是一种越来越流行的训练深度学习（DL）模型的方法，它利用在多样化的大规模数据集上训练基础模型所获得的知识，用于域或任务较少的下游任务。具体数据可查。文献中有丰富的 TL 技术和应用；然而，大部分研究都使用确定性深度学习模型，这些模型通常未经校准，并且缺乏在预测中传达认知（模型）不确定性度量的能力。与确定性模型不同，贝叶斯深度学习 (BDL) 模型通常经过良好校准，可以获取预测的认知不确定性，并且能够实现有竞争力的预测性能。在本研究中，我们提出了变分推理预训练音频神经网络（VI-PANN）。 VI-PANN 是流行的 ResNet-54 架构的变分推理变体，它在大规模音频事件检测数据集 AudioSet 上进行了预训练。当使用 ESC-50、UrbanSound8K 和 DCASE2013 数据集将知识从 VI-PANN 转移到其他下游声学分类任务时，我们评估了由此产生的不确定性的质量。我们首次证明，可以将校准的不确定性信息与来自上游任务的知识一起传输，以增强模型执行下游任务的能力。

Title: Dynamic Indoor Fingerprinting Localization based on Few-Shot Meta-Learning with CSI Images. (arXiv:2401.05711v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05711
Code URL: null
Copy Paste: [[2401.05711]] Dynamic Indoor Fingerprinting Localization based on Few-Shot Meta-Learning with CSI Images(http://arxiv.org/abs/2401.05711)
Summary:
While fingerprinting localization is favored for its effectiveness, it is hindered by high data acquisition costs and the inaccuracy of static database-based estimates. Addressing these issues, this letter presents an innovative indoor localization method using a data-efficient meta-learning algorithm. This approach, grounded in the ``Learning to Learn'' paradigm of meta-learning, utilizes historical localization tasks to improve adaptability and learning efficiency in dynamic indoor environments. We introduce a task-weighted loss to enhance knowledge transfer within this framework. Our comprehensive experiments confirm the method's robustness and superiority over current benchmarks, achieving a notable 23.13\% average gain in Mean Euclidean Distance, particularly effective in scenarios with limited CSI data.
摘要：
虽然指纹定位因其有效性而受到青睐，但它受到高昂的数据获取成本和基于静态数据库的估计不准确的阻碍。为了解决这些问题，这封信提出了一种使用数据高效元学习算法的创新室内定位方法。这种方法基于元学习的“学会学习”范式，利用历史定位任务来提高动态室内环境中的适应性和学习效率。我们引入任务加权损失来增强该框架内的知识转移。我们的综合实验证实了该方法相对于当前基准的稳健性和优越性，在平均欧几里德距离方面实现了 23.13% 的平均增益，在 CSI 数据有限的情况下尤其有效。

Title: Optimistic Model Rollouts for Pessimistic Offline Policy Optimization. (arXiv:2401.05899v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05899
Code URL: null
Copy Paste: [[2401.05899]] Optimistic Model Rollouts for Pessimistic Offline Policy Optimization(http://arxiv.org/abs/2401.05899)
Summary:
Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
摘要：
基于模型的离线强化学习 (RL) 取得了显着进展，为通过合成模型的推出提高泛化能力提供了一条有前途的途径。现有的工作主要侧重于将悲观主义纳入政策优化，通常通过构建悲观马尔可夫决策过程（P-MDP）。然而，P-MDP 不鼓励策略在超出离线数据集支持的分布外 (OOD) 区域进行学习，这可能无法充分利用动态模型的泛化能力。相反，我们建议构建乐观 MDP（O-MDP）。我们最初观察到鼓励更多 OOD 推出所带来的乐观情绪的潜在好处。受这一观察的启发，我们提出了 ORPO，一个简单而有效的基于模型的离线强化学习框架。 ORPO 生成用于悲观离线策略优化的乐观模型 Rollouts。具体来说，我们在 O-MDP 中训练乐观的推出策略，以采样更多的 OOD 模型推出。然后，我们用惩罚奖励重新标记采样的状态-动作对，并优化 P-MDP 中的输出策略。从理论上讲，我们证明了用 ORPO 训练的策略的性能在线性 MDP 中可以是下限的。实验结果表明，我们的框架显着优于 P-MDP 基准 30%，在广泛使用的基准上实现了最先进的性能。此外，ORPO 在需要泛化的问题上表现出显着的优势。

multi-run

chain-of-thought

tree-of-thought

agent

Title: Machine Teaching for Building Modular AI Agents based on Zero-shot Learners. (arXiv:2401.05467v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05467
Code URL: null
Copy Paste: [[2401.05467]] Machine Teaching for Building Modular AI Agents based on Zero-shot Learners(http://arxiv.org/abs/2401.05467)
Summary:
The recent advances in large language models (LLMs) have led to the creation of many modular AI agents. These agents employ LLMs as zero-shot learners to perform sub-tasks in order to solve complex tasks set forth by human users. We propose an approach to enhance the robustness and performance of modular AI agents that utilize LLMs as zero-shot learners. Our iterative machine teaching method offers an efficient way to teach AI agents over time with limited human feedback, addressing the limit posed by the quality of zero-shot learning. We advocate leveraging the data traces from initial deployments and outputs or annotations from the zero-shot learners to train smaller and task-specific substitute models which can reduce both the monetary costs and environmental impact. Our machine teaching process avails human expertise to correct examples with a high likelihood of misannotations. Results on three tasks, common to conversational AI agents, show that close-to-oracle performance can be achieved with supervision on 20-70% of the dataset depending upon the complexity of the task and performance of zero-shot learners.
摘要：
大型语言模型 (LLM) 的最新进展催生了许多模块化人工智能代理的创建。这些代理使用 LLM 作为零样本学习器来执行子任务，以解决人类用户提出的复杂任务。我们提出了一种方法来增强模块化人工智能代理的鲁棒性和性能，该代理利用法学硕士作为零样本学习者。我们的迭代机器教学方法提供了一种有效的方法，可以在有限的人类反馈的情况下随着时间的推移教授人工智能代理，解决零样本学习质量带来的限制。我们主张利用初始部署的数据跟踪和零样本学习者的输出或注释来训练更小的、特定于任务的替代模型，这可以减少金钱成本和环境影响。我们的机器教学过程利用人类专业知识来纠正极有可能出现错误注释的示例。对话式 AI 代理常见的三个任务的结果表明，根据任务的复杂性和零样本学习者的性能，通过对 20-70% 的数据集进行监督，可以实现接近预言机的性能。

Title: InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. (arXiv:2401.05507v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05507
Code URL: null
Copy Paste: [[2401.05507]] InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks(http://arxiv.org/abs/2401.05507)
Summary:
In this paper, we introduce "InfiAgent-DABench", the first benchmark specifically designed to evaluate LLM-based agents in data analysis tasks. This benchmark contains DAEval, a dataset consisting of 311 data analysis questions derived from 55 CSV files, and an agent framework to evaluate LLMs as data analysis agents. We adopt a format-prompting technique, ensuring questions to be closed-form that can be automatically evaluated. Our extensive benchmarking of 23 state-of-the-art LLMs uncovers the current challenges encountered in data analysis tasks. In addition, we have developed DAAgent, a specialized agent trained on instruction-tuning datasets. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent.
摘要：
在本文中，我们介绍了“InfiAgent-DABench”，这是第一个专门设计用于评估数据分析任务中基于 LLM 的代理的基准测试。该基准测试包含 DAEval（一个由源自 55 个 CSV 文件的 311 个数据分析问题组成的数据集）以及一个用于评估 LLM 作为数据分析代理的代理框架。我们采用格式提示技术，确保问题是封闭式的，可以自动评估。我们对 23 名最先进的法学硕士进行了广泛的基准测试，揭示了当前数据分析任务中遇到的挑战。此外，我们还开发了 DAAgent，这是一种经过指令调整数据集训练的专门代理。 InfiAgent-DABench 的评估数据集和工具包发布于 https://github.com/InfiAgent/InfiAgent。

Title: Innate-Values-driven Reinforcement Learning for Cooperative Multi-Agent Systems. (arXiv:2401.05572v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05572
Code URL: null
Copy Paste: [[2401.05572]] Innate-Values-driven Reinforcement Learning for Cooperative Multi-Agent Systems(http://arxiv.org/abs/2401.05572)
Summary:
Innate values describe agents' intrinsic motivations, which reflect their inherent interests and preferences to pursue goals and drive them to develop diverse skills satisfying their various needs. The essence of reinforcement learning (RL) is learning from interaction based on reward-driven (such as utilities) behaviors, much like natural agents. It is an excellent model to describe the innate-values-driven (IV) behaviors of AI agents. Especially in multi-agent systems (MAS), building the awareness of AI agents to balance the group utilities and system costs and satisfy group members' needs in their cooperation is a crucial problem for individuals learning to support their community and integrate human society in the long term. This paper proposes a hierarchical compound intrinsic value reinforcement learning model -- innate-values-driven reinforcement learning termed IVRL to describe the complex behaviors of multi-agent interaction in their cooperation. We implement the IVRL architecture in the StarCraft Multi-Agent Challenge (SMAC) environment and compare the cooperative performance within three characteristics of innate value agents (Coward, Neutral, and Reckless) through three benchmark multi-agent RL algorithms: QMIX, IQL, and QTRAN. The results demonstrate that by organizing individual various needs rationally, the group can achieve better performance with lower costs effectively.
摘要：
先天价值观描述了代理人的内在动机，反映了他们追求目标的固有兴趣和偏好，并驱使他们发展各种技能来满足他们的各种需求。强化学习（RL）的本质是从基于奖励驱动（例如效用）行为的交互中学习，就像自然代理一样。它是描述人工智能代理的先天价值驱动（IV）行为的优秀模型。特别是在多智能体系统（MAS）中，建立人工智能智能体的意识来平衡群体效用和系统成本并满足群体成员的合作需求是个人学习支持社区并将人类社会融入社会的关键问题。长期。本文提出了一种层次复合内在价值强化学习模型——内在价值驱动强化学习（IVRL）来描述多智能体协作交互的复杂行为。我们在星际争霸多智能体挑战赛（SMAC）环境中实现了 IVRL 架构，并通过三种基准多智能体 RL 算法（QMIX、IQL 和QTRAN。结果表明，通过合理组织个体的各种需求，团队可以有效地以较低的成本获得更好的绩效。

Title: Towards Goal-Oriented Agents for Evolving Problems Observed via Conversation. (arXiv:2401.05822v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2401.05822
Code URL: null
Copy Paste: [[2401.05822]] Towards Goal-Oriented Agents for Evolving Problems Observed via Conversation(http://arxiv.org/abs/2401.05822)
Summary:
The objective of this work is to train a chatbot capable of solving evolving problems through conversing with a user about a problem the chatbot cannot directly observe. The system consists of a virtual problem (in this case a simple game), a simulated user capable of answering natural language questions that can observe and perform actions on the problem, and a Deep Q-Network (DQN)-based chatbot architecture. The chatbot is trained with the goal of solving the problem through dialogue with the simulated user using reinforcement learning. The contributions of this paper are as follows: a proposed architecture to apply a conversational DQN-based agent to evolving problems, an exploration of training methods such as curriculum learning on model performance and the effect of modified reward functions in the case of increasing environment complexity.
摘要：
这项工作的目标是训练一个聊天机器人，使其能够通过与用户讨论聊天机器人无法直接观察到的问题来解决不断发展的问题。该系统由一个虚拟问题（在本例中是一个简单的游戏）、一个能够回答自然语言问题并可以观察问题并执行操作的模拟用户以及一个基于深度 Q 网络 (DQN) 的聊天机器人架构组成。训练聊天机器人的目的是通过强化学习与模拟用户对话来解决问题。本文的贡献如下：提出了一种将基于会话 DQN 的代理应用于不断发展的问题的架构，探索了课程学习等训练方法对模型性能的影响，以及在环境复杂性增加的情况下修改奖励函数的影响。

Title: Combating Adversarial Attacks with Multi-Agent Debate. (arXiv:2401.05998v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05998
Code URL: null
Copy Paste: [[2401.05998]] Combating Adversarial Attacks with Multi-Agent Debate(http://arxiv.org/abs/2401.05998)
Summary:
While state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arXiv:2305.14325. We implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. We also find marginal improvements through the general usage of multi-agent interactions. We further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.
摘要：
虽然最先进的语言模型取得了令人印象深刻的结果，但它们仍然容易受到推理时对抗性攻击，例如红队生成的对抗性提示arXiv:2209.07858。提出的一种提高语言模型生成总体质量的方法是多智能体辩论，其中语言模型通过讨论和反馈进行自我评估arXiv： 2305.14325。我们在当前最先进的语言模型之间实现多代理辩论，并评估模型在单代理和多代理设置中对红队攻击的敏感性。我们发现，当越狱或能力较差的模型被迫与未越狱或能力更强的模型进行辩论时，多智能体辩论可以减少模型毒性。我们还发现通过多智能体交互的普遍使用带来了边际改进。我们进一步通过嵌入聚类进行对抗性提示内容分类，并分析不同模型对不同类型攻击主题的敏感性。

Title: XGBoost Learning of Dynamic Wager Placement for In-Play Betting on an Agent-Based Model of a Sports Betting Exchange. (arXiv:2401.06086v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.06086
Code URL: null
Copy Paste: [[2401.06086]] XGBoost Learning of Dynamic Wager Placement for In-Play Betting on an Agent-Based Model of a Sports Betting Exchange(http://arxiv.org/abs/2401.06086)
Summary:
We present first results from the use of XGBoost, a highly effective machine learning (ML) method, within the Bristol Betting Exchange (BBE), an open-source agent-based model (ABM) designed to simulate a contemporary sports-betting exchange with in-play betting during track-racing events such as horse races. We use the BBE ABM and its array of minimally-simple bettor-agents as a synthetic data generator which feeds into our XGBoost ML system, with the intention that XGBoost discovers profitable dynamic betting strategies by learning from the more profitable bets made by the BBE bettor-agents. After this XGBoost training, which results in one or more decision trees, a bettor-agent with a betting strategy determined by the XGBoost-learned decision tree(s) is added to the BBE ABM and made to bet on a sequence of races under various conditions and betting-market scenarios, with profitability serving as the primary metric of comparison and evaluation. Our initial findings presented here show that XGBoost trained in this way can indeed learn profitable betting strategies, and can generalise to learn strategies that outperform each of the set of strategies used for creation of the training data. To foster further research and enhancements, the complete version of our extended BBE, including the XGBoost integration, has been made freely available as an open-source release on GitHub.
摘要：
我们展示了在布里斯托尔投注交易所 (BBE) 中使用 XGBoost（一种高效的机器学习 (ML) 方法）的初步结果，这是一种基于开源代理的模型 (ABM)，旨在模拟当代体育运动-在赛马等田径赛事期间与赛中投注进行投注交换。我们使用 BBE ABM 及其一系列最简单的投注者代理作为合成数据生成器，将其输入到我们的 XGBoost ML 系统中，目的是 XGBoost 通过学习 BBE 投注者进行的更有利可图的投注来发现有利可图的动态投注策略- 代理。在产生一个或多个决策树的 XGBoost 训练之后，具有由 XGBoost 学习的决策树确定的投注策略的投注者代理被添加到 BBE ABM 中，并在各种条件下对一系列比赛进行投注。条件和博彩市场情景，以盈利能力作为比较和评估的主要指标。我们在此提出的初步研究结果表明，以这种方式训练的 XGBoost 确实可以学习有利可图的投注策略，并且可以推广到学习优于用于创建训练数据的每组策略的策略。为了促进进一步的研究和增强，我们的扩展 BBE 的完整版本（包括 XGBoost 集成）已作为开源版本在 GitHub 上免费提供。

Title: Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents. (arXiv:2401.05821v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05821
Code URL: null
Copy Paste: [[2401.05821]] Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents(http://arxiv.org/abs/2401.05821)
Summary:
Reward sparsity, difficult credit assignment, and misalignment are only a few of the many issues that make it difficult, if not impossible, for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep networks impedes the inclusion of domain experts who could interpret the model and correct wrong behavior. To this end, we introduce Successive Concept Bottlenecks Agents (SCoBots), which make the whole decision pipeline transparent via the integration of consecutive concept bottleneck layers. SCoBots make use of not only relevant object properties but also of relational concepts. Our experimental results provide strong evidence that SCoBots allow domain experts to efficiently understand and regularize their behavior, resulting in potentially better human-aligned RL. In this way, SCoBots enabled us to identify a misalignment problem in the most simple and iconic video game, Pong, and resolve it.
摘要：
奖励稀疏、信用分配困难和错位只是导致深度强化学习 (RL) 智能体学习最优策略变得困难甚至不可能的众多问题中的几个。不幸的是，深度网络的黑盒性质阻碍了领域专家的参与，他们可以解释模型并纠正错误行为。为此，我们引入了连续概念瓶颈代理（SCoBots），它通过集成连续概念瓶颈层使整个决策管道透明。 SCoBot 不仅利用相关的对象属性，还利用关系概念。我们的实验结果提供了强有力的证据，证明 SCoBot 可以让领域专家有效地理解并规范他们的行为，从而可能产生更好的人性化强化学习。通过这种方式，SCoBots 使我们能够识别最简单和标志性的视频游戏 Pong 中的错位问题，并解决它。