2024-01-11

language model

Title: How predictable is language model benchmark performance?. (arXiv:2401.04757v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.04757
Code URL: null
Copy Paste: [[2401.04757]] How predictable is language model benchmark performance?(http://arxiv.org/abs/2401.04757)
Summary:
We investigate large language model performance across five orders of magnitude of compute scaling in eleven recent model architectures. We show that average benchmark performance, aggregating over many individual tasks and evaluations as in the commonly-used BIG-Bench dataset, is decently predictable as a function of training compute scale. Specifically, when extrapolating BIG-Bench Hard performance across one order of magnitude in compute, we observe average absolute errors of 6 percentage points (pp). By contrast, extrapolation for individual BIG-Bench tasks across an order of magnitude in compute yields higher average errors of 18pp. Nonetheless, individual task performance remains significantly more predictable than chance. Overall, our work suggests compute scaling provides a promising basis to forecast AI capabilities in diverse benchmarks, though predicting performance in specific tasks poses challenges.
摘要：
我们在 11 个最新模型架构中研究了跨五个数量级的计算扩展的大型语言模型性能。我们表明，平均基准性能（如常用的 BIG-Bench 数据集中的许多单独任务和评估的汇总）作为训练计算规模的函数是可以很好预测的。具体来说，当在计算中将 BIG-Bench Hard 性能推断出一个数量级时，我们观察到平均绝对误差为 6 个百分点 (pp)。相比之下，在计算量级上对单个 BIG-Bench 任务进行外推会产生 18pp 的较高平均误差。尽管如此，个人任务的表现仍然比偶然更容易预测。总的来说，我们的工作表明，计算扩展为预测不同基准中的人工智能能力提供了一个有希望的基础，尽管预测特定任务的性能提出了挑战。

Title: MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer. (arXiv:2401.04821v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04821
Code URL: null
Copy Paste: [[2401.04821]] MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer(http://arxiv.org/abs/2401.04821)
Summary:
Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available to high-resource languages. On the contrary, static word embeddings are easier to train in terms of computing resources and the amount of data required. In this paper, we introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer), a novel and challenging task that is especially relevant to low-resource languages for which static word embeddings are available. To tackle the task, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. In this way, we can train the PLM on source-language training data and perform zero-shot transfer to the target language by simply swapping the embedding layer. However, through extensive experiments on two classification datasets, we show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines. In this paper, we attempt to explain this negative result and provide several thoughts on possible improvement.
摘要：
基于 Transformer 的预训练语言模型 (PLM) 在各种自然语言处理 (NLP) 任务中取得了显着的性能。然而，预训练此类模型可能会占用大量资源，而这些资源几乎仅适用于高资源语言。相反，静态词嵌入在计算资源和所需数据量方面更容易训练。在本文中，我们介绍了用于跨语言零样本迁移的静态词嵌入的 MoSECroT 模型缝合，这是一项新颖且具有挑战性的任务，尤其与可使用静态词嵌入的低资源语言相关。为了解决这一任务，我们提出了第一个框架，该框架利用相对表示来为源语言 PLM 的嵌入和目标语言的静态词嵌入构建公共空间。这样，我们可以在源语言训练数据上训练 PLM，并通过简单地交换嵌入层来执行到目标语言的零样本迁移。然而，通过对两个分类数据集的大量实验，我们表明，尽管我们提出的框架在解决 MoSECroT 问题时与弱基线具有竞争力，但与一些强基线相比，它无法实现有竞争力的结果。在本文中，我们试图解释这一负面结果，并就可能的改进提供一些想法。

Title: ANGO: A Next-Level Evaluation Benchmark For Generation-Oriented Language Models In Chinese Domain. (arXiv:2401.04898v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04898
Code URL: null
Copy Paste: [[2401.04898]] ANGO: A Next-Level Evaluation Benchmark For Generation-Oriented Language Models In Chinese Domain(http://arxiv.org/abs/2401.04898)
Summary:
Recently, various Large Language Models (LLMs) evaluation datasets have emerged, but most of them have issues with distorted rankings and difficulty in model capabilities analysis. Addressing these concerns, this paper introduces ANGO, a Chinese multi-choice question evaluation benchmark. ANGO proposes \textit{Keypoint} categorization standard for the first time, each question in ANGO can correspond to multiple keypoints, effectively enhancing interpretability of evaluation results. Base on performance of real humans, we build a quantifiable question difficulty standard and divide ANGO questions into 9 difficulty levels, which provide more precise guidance for model training. To minimize data leakage impact and fully leverage ANGO's innovative features, we have engineered exclusive sampling strategies and a new evaluation framework that support swift testset iteration. Our experiments demonstrate that ANGO poses a stronger challenge to models and reveals more details in evaluation result compared to existing benchmarks.
摘要：
近年来，各种大型语言模型（LLM）评估数据集不断涌现，但大多存在排名扭曲、模型能力分析困难等问题。针对这些问题，本文介绍了中国多项选择题评估基准 ANGO。 ANGO首次提出\textit{Keypoint}分类标准，ANGO中每个问题可以对应多个关键点，有效增强评估结果的可解释性。基于真人的表现，我们建立了可量化的问题难度标准，将ANGO问题分为9个难度级别，为模型训练提供更精准的指导。为了最大限度地减少数据泄露的影响并充分利用 ANGO 的创新功能，我们设计了独家采样策略和支持快速测试集迭代的新评估框架。我们的实验表明，与现有基准相比，ANGO 对模型提出了更强的挑战，并且在评估结果中揭示了更多细节。

Title: The Impact of Reasoning Step Length on Large Language Models. (arXiv:2401.04925v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04925
Code URL: null
Copy Paste: [[2401.04925]] The Impact of Reasoning Step Length on Large Language Models(http://arxiv.org/abs/2401.04925)
Summary:
Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models (LLMs). However, the correlation between the effectiveness of CoT and the length of reasoning steps in prompts remains largely unknown. To shed light on this, we have conducted several empirical experiments to explore the relations. Specifically, we design experiments that expand and compress the rationale reasoning steps within CoT demonstrations, while keeping all other factors constant. We have the following key findings. First, the results indicate that lengthening the reasoning steps in prompts, even without adding new information into the prompt, considerably enhances LLMs' reasoning abilities across multiple datasets. Alternatively, shortening the reasoning steps, even while preserving the key information, significantly diminishes the reasoning abilities of models. This finding highlights the importance of the number of steps in CoT prompts and provides practical guidance to make better use of LLMs' potential in complex problem-solving scenarios. Second, we also investigated the relationship between the performance of CoT and the rationales used in demonstrations. Surprisingly, the result shows that even incorrect rationales can yield favorable outcomes if they maintain the requisite length of inference. Third, we observed that the advantages of increasing reasoning steps are task-dependent: simpler tasks require fewer steps, whereas complex tasks gain significantly from longer inference sequences.
摘要：
思想链（CoT）对于提高大型语言模型（LLM）的推理能力具有重要意义。然而，CoT 的有效性与提示中推理步骤的长度之间的相关性仍然很大程度上未知。为了阐明这一点，我们进行了几次实证实验来探索其中的关系。具体来说，我们设计了扩展和压缩 CoT 演示中的基本原理推理步骤的实验，同时保持所有其他因素不变。我们有以下主要发现。首先，结果表明，即使没有在提示中添加新信息，延长提示中的推理步骤也可以显着增强法学硕士跨多个数据集的推理能力。或者，即使在保留关键信息的情况下缩短推理步骤，也会显着降低模型的推理能力。这一发现强调了 CoT 提示中步骤数量的重要性，并为在复杂的问题解决场景中更好地发挥法学硕士的潜力提供了实用指导。其次，我们还研究了 CoT 的性能与演示中使用的基本原理之间的关系。令人惊讶的是，结果表明，即使是不正确的理由，如果保持必要的推理长度，也能产生有利的结果。第三，我们观察到增加推理步骤的优势与任务相关：更简单的任务需要更少的步骤，而复杂的任务可以从更长的推理序列中获得显着的收益。

Title: Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding. (arXiv:2401.05054v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05054
Code URL: null
Copy Paste: [[2401.05054]] Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding(http://arxiv.org/abs/2401.05054)
Summary:
One of the most important challenges in text generation systems is to produce outputs that are not only correct but also diverse. Recently, Minimum Bayes-Risk (MBR) decoding has gained prominence for generating sentences of the highest quality among the decoding algorithms. However, existing algorithms proposed for generating diverse outputs are predominantly based on beam search or random sampling, thus their output quality is capped by these underlying methods. In this paper, we investigate an alternative approach -- we develop diversity-promoting decoding algorithms by enforcing diversity objectives to MBR decoding. We propose two variants of MBR, Diverse MBR (DMBR) and $k$-medoids MBR (KMBR), methods to generate a set of sentences with high quality and diversity. We evaluate DMBR and KMBR on a variety of directed text generation tasks using encoder-decoder models and a large language model with prompting. The experimental results show that the proposed method achieves a better trade-off than the diverse beam search and sampling algorithms.
摘要：
文本生成系统中最重要的挑战之一是产生不仅正确而且多样化的输出。最近，最小贝叶斯风险（MBR）解码因在解码算法中生成最高质量的句子而受到重视。然而，现有的用于生成不同输出的算法主要基于波束搜索或随机采样，因此它们的输出质量受到这些底层方法的限制。在本文中，我们研究了一种替代方法——通过对 MBR 解码强制执行多样性目标来开发多样性促进解码算法。我们提出了 MBR 的两种变体，即多样化 MBR (DMBR) 和 $k$-medoids MBR (KMBR)，这是生成一组高质量和多样性句子的方法。我们使用编码器-解码器模型和带提示的大型语言模型在各种定向文本生成任务上评估 DMBR 和 KMBR。实验结果表明，所提出的方法比多样化的波束搜索和采样算法实现了更好的权衡。

Title: Pre-trained Large Language Models for Financial Sentiment Analysis. (arXiv:2401.05215v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05215
Code URL: null
Copy Paste: [[2401.05215]] Pre-trained Large Language Models for Financial Sentiment Analysis(http://arxiv.org/abs/2401.05215)
Summary:
Financial sentiment analysis refers to classifying financial text contents into sentiment categories (e.g. positive, negative, and neutral). In this paper, we focus on the classification of financial news title, which is a challenging task due to a lack of large amount of training samples. To overcome this difficulty, we propose to adapt the pretrained large language models (LLMs) [1, 2, 3] to solve this problem. The LLMs, which are trained from huge amount of text corpora,have an advantage in text understanding and can be effectively adapted to domain-specific task while requiring very few amount of training samples. In particular, we adapt the open-source Llama2-7B model (2023) with the supervised fine-tuning (SFT) technique [4]. Experimental evaluation shows that even with the 7B model (which is relatively small for LLMs), our approach significantly outperforms the previous state-of-the-art algorithms.
摘要：
财经情感分析是指将财经文本内容分为情感类别（例如正面、负面和中性）。在本文中，我们关注财经新闻标题的分类，由于缺乏大量的训练样本，这是一项具有挑战性的任务。为了克服这个困难，我们建议采用预训练的大型语言模型（LLM）[1,2,3]来解决这个问题。从大量文本语料库中训练出来的法学硕士在文本理解方面具有优势，可以有效地适应特定领域的任务，同时需要很少量的训练样本。特别是，我们使用监督微调（SFT）技术来调整开源 Llama2-7B 模型（2023）[4]。实验评估表明，即使使用 7B 模型（对于法学硕士来说相对较小），我们的方法也明显优于以前最先进的算法。

Title: INACIA: Integrating Large Language Models in Brazilian Audit Courts: Opportunities and Challenges. (arXiv:2401.05273v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05273
Code URL: null
Copy Paste: [[2401.05273]] INACIA: Integrating Large Language Models in Brazilian Audit Courts: Opportunities and Challenges(http://arxiv.org/abs/2401.05273)
Summary:
This paper introduces INACIA (Instru\c{c}\~ao Assistida com Intelig\^encia Artificial), a groundbreaking system designed to integrate Large Language Models (LLMs) into the operational framework of Brazilian Federal Court of Accounts (TCU). The system automates various stages of case analysis, including basic information extraction, admissibility examination, Periculum in mora and Fumus boni iuris analyses, and recommendations generation. Through a series of experiments, we demonstrate INACIA's potential in extracting relevant information from case documents, evaluating its legal plausibility, and generating judicial recommendations. Utilizing a validation dataset alongside LLMs, our evaluation methodology presents an innovative approach to assessing system performance, correlating highly with human judgment. The results highlight INACIA's proficiency in handling complex legal tasks, indicating its suitability for augmenting efficiency and judicial fairness within legal systems. The paper also discusses potential enhancements and future applications, positioning INACIA as a model for worldwide AI integration in legal domains.
摘要：
本文介绍了 INACIA (Instru\c{c}\~ao Assistida com Intelig\^encia Artificial)，这是一个突破性的系统，旨在将大型语言模型 (LLM) 集成到巴西联邦会计法院 (TCU) 的运营框架中）。该系统自动执行案例分析的各个阶段，包括基本信息提取、可受理性审查、Mora 和 Fumus boni iuris 分析以及建议生成。通过一系列实验，我们展示了 INACIA 在从案件文件中提取相关信息、评估其法律合理性以及生成司法建议方面的潜力。通过利用验证数据集和法学硕士，我们的评估方法提出了一种评估系统性能的创新方法，与人类判断高度相关。结果突显了 INACIA 在处理复杂法律任务方面的熟练程度，表明其适合提高法律体系内的效率和司法公平。该论文还讨论了潜在的增强功能和未来的应用，将 INACIA 定位为法律领域全球人工智能集成的模型。

Title: I am a Strange Dataset: Metalinguistic Tests for Language Models. (arXiv:2401.05300v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05300
Code URL: https://github.com/tristanthrush/i-am-a-strange-dataset
Copy Paste: [[2401.05300]] I am a Strange Dataset: Metalinguistic Tests for Language Models(http://arxiv.org/abs/2401.05300)
Summary:
Statements involving metalinguistic self-reference ("This paper has six sections.") are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present "I am a Strange Dataset", a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like "The penultimate word in this sentence is" (where a correct continuation is "is"). In verification, models judge the truth of statements like "The penultimate word in this sentence is sentence." (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset.
摘要：
涉及元语言自指的陈述（“本文有六个部分。”）在许多领域都很普遍。大型语言模型（LLM）可以处理这种语言吗？在本文中，我们提出了“我是一个奇怪的数据集”，这是一个解决这个问题的新数据集。有两个子任务：生成和验证。在生成过程中，模型会继续诸如“这句话中的倒数第二个词是”之类的语句（其中正确的延续是“is”）。在验证中，模型判断诸如“这句话中的倒数第二个词是句子”之类的陈述的真实性。（错误的）。我们还提供了最小差异的元语言非自引用示例，通过探索模型是否可以处理元语言语言来补充主数据集。该数据集由专家手工制作，并由非专家注释者验证。我们通过 API 测试各种开源 LLM（7B 至 70B 参数）以及闭源 LLM。尽管我们发现模型规模有了一些稳定的改进，但所有模型在两个子任务上，甚至在非自指元语言控制数据上的表现都接近偶然。 GPT 4 是唯一一个始终显着优于随机性的模型，但它仍然只在 60% 的范围内，而我们未经训练的人类注释者的得分在 89-93% 的范围内。数据集和评估工具包可从 https://github.com/TristanThrush/i-am-a-strange-dataset 获取。

Title: Entity Recognition from Colloquial Text. (arXiv:2401.04853v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04853
Code URL: null
Copy Paste: [[2401.04853]] Entity Recognition from Colloquial Text(http://arxiv.org/abs/2401.04853)
Summary:
Extraction of concepts and entities of interest from non-formal texts such as social media posts and informal communication is an important capability for decision support systems in many domains, including healthcare, customer relationship management, and others. Despite the recent advances in training large language models for a variety of natural language processing tasks, the developed models and techniques have mainly focused on formal texts and do not perform as well on colloquial data, which is characterized by a number of distinct challenges. In our research, we focus on the healthcare domain and investigate the problem of symptom recognition from colloquial texts by designing and evaluating several training strategies for BERT-based model fine-tuning. These strategies are distinguished by the choice of the base model, the training corpora, and application of term perturbations in the training data. The best-performing models trained using these strategies outperform the state-of-the-art specialized symptom recognizer by a large margin. Through a series of experiments, we have found specific patterns of model behavior associated with the training strategies we designed. We present design principles for training strategies for effective entity recognition in colloquial texts based on our findings.
摘要：
从社交媒体帖子和非正式沟通等非正式文本中提取感兴趣的概念和实体是许多领域（包括医疗保健、客户关系管理等）决策支持系统的一项重要功能。尽管最近在训练用于各种自然语言处理任务的大型语言模型方面取得了进展，但开发的模型和技术主要集中在正式文本上，而在口语数据上表现不佳，口语数据的特点是存在许多独特的挑战。在我们的研究中，我们专注于医疗保健领域，通过设计和评估几种基于 BERT 的模型微调的训练策略来研究口语文本的症状识别问题。这些策略的特点是基础模型、训练语料库的选择以及训练数据中术语扰动的应用。使用这些策略训练的性能最佳的模型大大优于最先进的专门症状识别器。通过一系列实验，我们发现了与我们设计的训练策略相关的特定模型行为模式。根据我们的发现，我们提出了口语文本中有效实体识别的训练策略的设计原则。

Title: Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs. (arXiv:2401.04854v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04854
Code URL: null
Copy Paste: [[2401.04854]] Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs(http://arxiv.org/abs/2401.04854)
Summary:
Are LLMs cultural technologies like photocopiers or printing presses, which transmit information but cannot create new content? A challenge for this idea, which we call bibliotechnism, is that LLMs often do generate entirely novel text. We begin by defending bibliotechnism against this challenge, showing how novel text may be meaningful only in a derivative sense, so that the content of this generated text depends in an important sense on the content of original human text. We go on to present a different, novel challenge for bibliotechnism, stemming from examples in which LLMs generate "novel reference", using novel names to refer to novel entities. Such examples could be smoothly explained if LLMs were not cultural technologies but possessed a limited form of agency (beliefs, desires, and intentions). According to interpretationism in the philosophy of mind, a system has beliefs, desires and intentions if and only if its behavior is well-explained by the hypothesis that it has such states. In line with this view, we argue that cases of novel reference provide evidence that LLMs do in fact have beliefs, desires, and intentions, and thus have a limited form of agency.
摘要：
法学硕士是否像复印机或印刷机那样传输信息但无法创造新内容的文化技术？这种我们称之为“文献技术主义”的想法面临的一个挑战是，法学硕士经常会生成完全新颖的文本。我们首先捍卫图书馆技术主义以应对这一挑战，展示小说文本如何仅在派生意义上才有意义，因此生成文本的内容在重要意义上取决于原始人类文本的内容。我们继续对图书馆技术提出了一个不同的、新颖的挑战，源于法学硕士生成“新颖参考”的例子，使用新颖的名称来引用新颖的实体。如果法学硕士不是文化技术而是拥有有限形式的能动性（信念、欲望和意图），那么这些例子就可以顺利地解释。根据心灵哲学中的解释主义，一个系统具有信念、欲望和意图，当且仅当它的行为可以通过它具有这些状态的假设得到很好的解释时。根据这一观点，我们认为新颖的参考案例提供了证据，证明法学硕士实际上有信念、愿望和意图，因此具有有限的代理形式。

Title: Can AI Write Classical Chinese Poetry like Humans? An Empirical Study Inspired by Turing Test. (arXiv:2401.04952v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04952
Code URL: null
Copy Paste: [[2401.04952]] Can AI Write Classical Chinese Poetry like Humans? An Empirical Study Inspired by Turing Test(http://arxiv.org/abs/2401.04952)
Summary:
Some argue that the essence of humanity, such as creativity and sentiment, can never be mimicked by machines. This paper casts doubt on this belief by studying a vital question: Can AI compose poetry as well as humans? To answer the question, we propose ProFTAP, a novel evaluation framework inspired by Turing test to assess AI's poetry writing capability. We apply it on current large language models (LLMs) and find that recent LLMs do indeed possess the ability to write classical Chinese poems nearly indistinguishable from those of humans. We also reveal that various open-source LLMs can outperform GPT-4 on this task.
摘要：
有人认为，人类的本质，例如创造力和情感，永远无法被机器模仿。本文通过研究一个重要问题对这一信念提出了质疑：人工智能能否像人类一样创作诗歌？为了回答这个问题，我们提出了ProFTAP，这是一种受图灵测试启发的新颖评估框架，用于评估人工智能的诗歌写作能力。我们将其应用到当前的大型语言模型（LLM）上，发现最近的L LM确实具有写出与人类几乎没有区别的中国古典诗歌的能力。我们还表明，各种开源 LLM 在此任务上的表现都优于 GPT-4。

Title: Aligning Translation-Specific Understanding to General Understanding in Large Language Models. (arXiv:2401.05072v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05072
Code URL: null
Copy Paste: [[2401.05072]] Aligning Translation-Specific Understanding to General Understanding in Large Language Models(http://arxiv.org/abs/2401.05072)
Summary:
Although large language models (LLMs) have shown surprising language understanding and generation capabilities, they have yet to gain a revolutionary advancement in the field of machine translation. One potential cause of the limited performance is the misalignment between the translation-specific understanding and general understanding inside LLMs. To align the translation-specific understanding to the general one, we propose a novel translation process xIoD (Cross-Lingual Interpretation of Difficult words), explicitly incorporating the general understanding on the content incurring inconsistent understanding to guide the translation. Specifically, xIoD performs the cross-lingual interpretation for the difficult-to-translate words and enhances the translation with the generated interpretations. Furthermore, we reframe the external tools of QE to tackle the challenges of xIoD in the detection of difficult words and the generation of helpful interpretations. We conduct experiments on the self-constructed benchmark ChallengeMT, which includes cases in which multiple SOTA translation systems consistently underperform. Experimental results show the effectiveness of our xIoD, which improves up to +3.85 COMET.
摘要：
尽管大型语言模型 (LLM) 表现出了令人惊讶的语言理解和生成能力，但它们尚未在机器翻译领域获得革命性的进步。绩效有限的一个潜在原因是法学硕士内部对翻译特定的理解与一般理解之间的不一致。为了使特定翻译的理解与一般理解保持一致，我们提出了一种新颖的翻译过程xIoD（难词的跨语言解释），明确地将对产生不一致理解的内容的一般理解纳入指导翻译。具体来说，xIoD 对难以翻译的单词进行跨语言解释，并通过生成的解释来增强翻译。此外，我们重新构建了 QE 的外部工具，以应对 xIoD 在检测困难单词和生成有用解释方面的挑战。我们在自建的基准 ChallengeMT 上进行了实验，其中包括多个 SOTA 翻译系统始终表现不佳的情况。实验结果表明我们的 xIoD 的有效性，最高可提高 +3.85 COMET。

Title: Hierarchical Classification of Transversal Skills in Job Ads Based on Sentence Embeddings. (arXiv:2401.05073v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05073
Code URL: null
Copy Paste: [[2401.05073]] Hierarchical Classification of Transversal Skills in Job Ads Based on Sentence Embeddings(http://arxiv.org/abs/2401.05073)
Summary:
This paper proposes a classification framework aimed at identifying correlations between job ad requirements and transversal skill sets, with a focus on predicting the necessary skills for individual job descriptions using a deep learning model. The approach involves data collection, preprocessing, and labeling using ESCO (European Skills, Competences, and Occupations) taxonomy. Hierarchical classification and multi-label strategies are used for skill identification, while augmentation techniques address data imbalance, enhancing model robustness. A comparison between results obtained with English-specific and multi-language sentence embedding models reveals close accuracy. The experimental case studies detail neural network configurations, hyperparameters, and cross-validation results, highlighting the efficacy of the hierarchical approach and the suitability of the multi-language model for the diverse European job market. Thus, a new approach is proposed for the hierarchical classification of transversal skills from job ads.
摘要：
本文提出了一个分类框架，旨在识别职位广告要求和横向技能组合之间的相关性，重点是使用深度学习模型预测个人职位描述的必要技能。该方法涉及使用 ESCO（欧洲技能、能力和职业）分类法进行数据收集、预处理和标记。分层分类和多标签策略用于技能识别，而增强技术则解决数据不平衡问题，增强模型的稳健性。通过英语特定句子嵌入模型和多语言句子嵌入模型获得的结果之间的比较显示出接近的准确性。实验案例研究详细介绍了神经网络配置、超参数和交叉验证结果，强调了分层方法的有效性以及多语言模型对多样化欧洲就业市场的适用性。因此，提出了一种新方法对招聘广告中的横向技能进行层次分类。

Title: Divide and Conquer for Large Language Models Reasoning. (arXiv:2401.05190v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05190
Code URL: https://github.com/aimijie/divide-and-conquer
Copy Paste: [[2401.05190]] Divide and Conquer for Large Language Models Reasoning(http://arxiv.org/abs/2401.05190)
Summary:
Large language models (LLMs) have shown impressive performance in various reasoning benchmarks with the emergence of Chain-of-Thought (CoT) and its derivative methods, particularly in tasks involving multi-choice questions (MCQs). However, current works all process data uniformly without considering the problem-solving difficulty, which means an excessive focus on simple questions while insufficient to intricate ones. To address this challenge, we inspired by humans using heuristic strategies to categorize tasks and handle them individually, propose to apply the Divide and Conquer to LLMs reasoning. First, we divide questions into different subsets based on the statistical confidence score ($\mathcal{CS}$), then fix nearly resolved sets and conquer demanding nuanced process ones with elaborately designed methods, including Prior Knowledge based Reasoning (PKR) and Filter Choices based Reasoning (FCR), as well as their integration variants. Our experiments demonstrate that this proposed strategy significantly boosts the models' reasoning abilities across nine datasets involving arithmetic, commonsense, and logic tasks. For instance, compared to baseline, we make a striking improvement on low confidence subsets of 8.72\% for AQuA, 15.07\% for ARC Challenge and 7.71\% for RiddleSense. In addition, through extensive analysis on length of rationale and number of options, we verify that longer reasoning paths in PKR could prevent models from referring infer-harmful shortcuts, and also find that removing irrelevant choices in FCR would substantially avoid models' confusion. The code is at \url{https://github.com/AiMijie/Divide-and-Conquer}
摘要：
随着思想链 (CoT) 及其衍生方法的出现，大型语言模型 (LLM) 在各种推理基准测试中表现出了令人印象深刻的性能，特别是在涉及多项选择问题 (MCQ) 的任务中。然而，目前的工作都统一处理数据，没有考虑解决问题的难度，这意味着过分关注简单问题，而对复杂问题的关注不足。为了应对这一挑战，我们受到人类使用启发式策略对任务进行分类并单独处理的启发，建议将分而治之应用于法学硕士的推理。首先，我们根据统计置信度得分（$\mathcal{CS}$）将问题划分为不同的子集，然后修复几乎已解决的集合，并通过精心设计的方法（包括基于先验知识的推理（PKR）和过滤器）克服要求严格的细致入微的过程问题基于选择的推理 (FCR) 及其集成变体。我们的实验表明，这种提出的策略显着提高了模型在涉及算术、常识和逻辑任务的九个数据集上的推理能力。例如，与基线相比，我们在低置信度子集上取得了显着的改进，AQuA 为 8.72%，ARC Challenge 为 15.07%，RiddleSense 为 7.71%。此外，通过对基本原理长度和选项数量的广泛分析，我们验证了 PKR 中较长的推理路径可以防止模型引用有害的捷径，并且还发现删除 FCR 中不相关的选择将大大避免模型的混乱。代码位于\url{https://github.com/AiMijie/Divide-and-Conquer}

Title: CASA: Causality-driven Argument Sufficiency Assessment. (arXiv:2401.05249v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05249
Code URL: https://github.com/xxxiaol/casa
Copy Paste: [[2401.05249]] CASA: Causality-driven Argument Sufficiency Assessment(http://arxiv.org/abs/2401.05249)
Summary:
The argument sufficiency assessment task aims to determine if the premises of a given argument support its conclusion. To tackle this task, existing works often train a classifier on data annotated by humans. However, annotating data is laborious, and annotations are often inconsistent due to subjective criteria. Motivated by the probability of sufficiency (PS) definition in the causal literature, we propose CASA, a zero-shot causality-driven argument sufficiency assessment framework. PS measures how likely introducing the premise event would lead to the conclusion, when both the premise and conclusion events are absent. To estimate this probability, we propose to use large language models (LLMs) to generate contexts that are inconsistent with the premise and conclusion, and revise them by injecting the premise event. Experiments on two logical fallacy detection datasets demonstrate that CASA accurately identifies insufficient arguments. We further deploy CASA in a writing assistance application, and find that suggestions generated by CASA enhance the sufficiency of student-written arguments. Code and data are available at https://github.com/xxxiaol/CASA.
摘要：
论证充分性评估任务旨在确定给定论证的前提是否支持其结论。为了解决这一任务，现有的工作通常会根据人类注释的数据来训练分类器。然而，对数据进行注释是费力的，并且由于主观标准，注释常常不一致。受因果文献中充分性概率（PS）定义的启发，我们提出了 CASA，一种零样本因果驱动的论证充分性评估框架。 PS 衡量当前提事件和结论事件都不存在时，引入前提事件导致结论的可能性有多大。为了估计这个概率，我们建议使用大型语言模型（LLM）来生成与前提和结论不一致的上下文，并通过注入前提事件来修改它们。对两个逻辑谬误检测数据集的实验表明，CASA 可以准确识别不充分的论点。我们进一步在写作辅助应用程序中部署 CASA，发现 CASA 生成的建议增强了学生书面论证的充分性。代码和数据可在 https://github.com/xxxiaol/CASA 获取。

Title: Leveraging Print Debugging to Improve Code Generation in Large Language Models. (arXiv:2401.05319v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05319
Code URL: null
Copy Paste: [[2401.05319]] Leveraging Print Debugging to Improve Code Generation in Large Language Models(http://arxiv.org/abs/2401.05319)
Summary:
Large language models (LLMs) have made significant progress in code generation tasks, but their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. To address this issue, we propose an in-context learning approach that guides LLMs to debug by using a "print debugging" method, which involves inserting print statements to trace and analysing logs for fixing the bug. We collect a Leetcode problem dataset and evaluate our method using the Leetcode online judging system. Experiments with GPT-4 demonstrate the effectiveness of our approach, outperforming rubber duck debugging in easy and medium-level Leetcode problems by 1.5% and 17.9%.
摘要：
大型语言模型 (LLM) 在代码生成任务方面取得了重大进展，但它们在处理复杂数据结构和算法的编程问题方面的性能仍然不够理想。为了解决这个问题，我们提出了一种上下文学习方法，引导法学硕士使用“打印调试”方法进行调试，其中包括插入打印语句来跟踪和分析日志以修复错误。我们收集 Leetcode 问题数据集并使用 Leetcode 在线评审系统评估我们的方法。 GPT-4 的实验证明了我们方法的有效性，在简单和中等水平的 Leetcode 问题上，其性能比橡皮鸭调试高出 1.5% 和 17.9%。

gpt

Title: Can ChatGPT Rival Neural Machine Translation? A Comparative Study. (arXiv:2401.05176v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05176
Code URL: null
Copy Paste: [[2401.05176]] Can ChatGPT Rival Neural Machine Translation? A Comparative Study(http://arxiv.org/abs/2401.05176)
Summary:
Inspired by the increasing interest in leveraging large language models for translation, this paper evaluates the capabilities of large language models (LLMs) represented by ChatGPT in comparison to the mainstream neural machine translation (NMT) engines in translating Chinese diplomatic texts into English. Specifically, we examine the translation quality of ChatGPT and NMT engines as measured by four automated metrics and human evaluation based on an error-typology and six analytic rubrics. Our findings show that automated metrics yield similar results for ChatGPT under different prompts and NMT systems, while human annotators tend to assign noticeably higher scores to ChatGPT when it is provided an example or contextual information about the translation task. Pairwise correlation between automated metrics and dimensions of human evaluation produces weak and non-significant results, suggesting the divergence between the two methods of translation quality assessment. These findings provide valuable insights into the potential of ChatGPT as a capable machine translator, and the influence of prompt engineering on its performance.
摘要：
受人们对利用大型语言模型进行翻译的兴趣日益浓厚的启发，本文评估了以 ChatGPT 为代表的大型语言模型 (LLM) 与主流神经机器翻译 (NMT) 引擎在将中国外交文本翻译成英语方面的能力。具体来说，我们检查了 ChatGPT 和 NMT 引擎的翻译质量，通过四个自动指标和基于错误类型和六个分析规则的人工评估来衡量。我们的研究结果表明，在不同的提示和 NMT 系统下，自动化指标对 ChatGPT 产生相似的结果，而当提供有关翻译任务的示例或上下文信息时，人类注释者倾向于为 ChatGPT 分配明显更高的分数。自动指标和人类评估维度之间的成对相关性产生的结果较弱且不显着，这表明两种翻译质量评估方法之间存在差异。这些发现为了解 ChatGPT 作为强大的机器翻译器的潜力以及即时工程对其性能的影响提供了宝贵的见解。

Title: Monte Carlo Tree Search for Recipe Generation using GPT-2. (arXiv:2401.05199v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05199
Code URL: null
Copy Paste: [[2401.05199]] Monte Carlo Tree Search for Recipe Generation using GPT-2(http://arxiv.org/abs/2401.05199)
Summary:
Automatic food recipe generation methods provide a creative tool for chefs to explore and to create new, and interesting culinary delights. Given the recent success of large language models (LLMs), they have the potential to create new recipes that can meet individual preferences, dietary constraints, and adapt to what is in your refrigerator. Existing research on using LLMs to generate recipes has shown that LLMs can be finetuned to generate realistic-sounding recipes. However, on close examination, these generated recipes often fail to meet basic requirements like including chicken as an ingredient in chicken dishes. In this paper, we propose RecipeMC, a text generation method using GPT-2 that relies on Monte Carlo Tree Search (MCTS). RecipeMC allows us to define reward functions to put soft constraints on text generation and thus improve the credibility of the generated recipes. Our results show that human evaluators prefer recipes generated with RecipeMC more often than recipes generated with other baseline methods when compared with real recipes.
摘要：
自动食品食谱生成方法为厨师提供了一个创造性的工具来探索和创造新的、有趣的烹饪美食。鉴于大型语言模型（LLM）最近取得的成功，他们有潜力创造出新的食谱，可以满足个人喜好、饮食限制，并适应冰箱里的食物。关于使用法学硕士生成菜谱的现有研究表明，法学硕士可以进行微调以生成听起来逼真的菜谱。然而，经过仔细检查，这些生成的食谱通常无法满足基本要求，例如将鸡肉作为鸡肉菜肴的成分。在本文中，我们提出了 RecipeMC，一种使用依赖于蒙特卡罗树搜索 (MCTS) 的 GPT-2 的文本生成方法。 RecipeMC 允许我们定义奖励函数，对文本生成施加软约束，从而提高生成菜谱的可信度。我们的结果表明，与真实菜谱相比，人类评估者更喜欢使用 RecipeMC 生成的菜谱，而不是使用其他基线方法生成的菜谱。

Title: Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need. (arXiv:2401.04848v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04848
Code URL: null
Copy Paste: [[2401.04848]] Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need(http://arxiv.org/abs/2401.04848)
Summary:
Automatic diacritization of Arabic text involves adding diacritical marks (diacritics) to the text. This task poses a significant challenge with noteworthy implications for computational processing and comprehension. In this paper, we introduce PTCAD (Pre-FineTuned Token Classification for Arabic Diacritization, a novel two-phase approach for the Arabic Text Diacritization task. PTCAD comprises a pre-finetuning phase and a finetuning phase, treating Arabic Text Diacritization as a token classification task for pre-trained models. The effectiveness of PTCAD is demonstrated through evaluations on two benchmark datasets derived from the Tashkeela dataset, where it achieves state-of-the-art results, including a 20\% reduction in Word Error Rate (WER) compared to existing benchmarks and superior performance over GPT-4 in ATD tasks.
摘要：
阿拉伯语文本的自动变音符号涉及向文本添加变音符号（变音符号）。这项任务提出了重大挑战，对计算处理和理解具有显着影响。在本文中，我们介绍了 PTCAD（阿拉伯语变音符号的预微调标记分类），这是一种用于阿拉伯语文本变音符号任务的新颖的两阶段方法。PTCAD 包括预微调阶段和微调阶段，将阿拉伯语文本变音符号视为标记分类PTCAD 的有效性通过对源自 Tashkeela 数据集的两个基准数据集的评估得到证明，它取得了最先进的结果，包括字错误率 (WER) 降低了 20\%与现有基准相比，在 ATD 任务中性能优于 GPT-4。

llm

Title: Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk. (arXiv:2401.05033v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05033
Code URL: null
Copy Paste: [[2401.05033]] Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk(http://arxiv.org/abs/2401.05033)
Summary:
Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging. Instructing tuning, i.e. tuning models on instruction and sample responses generated by humans (Ouyang et al., 2022), has proven as an effective method to do so, yet requires a number of data samples that a) might not be available or b) costly to generate. Furthermore, this cost increases when the goal is to make the LLM follow a specific workflow within a dialogue instead of single instructions. Inspired by the self-play technique in reinforcement learning and the use of LLMs to simulate human agents, we propose a more effective method for data collection through LLMs engaging in a conversation in various roles. This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning. We introduce an automated way to measure the (partial) success of a dialogue. This metric is used to filter the generated conversational data that is fed back in LLM for training. Based on our automated and human evaluations of conversation quality, we demonstrate that such self-talk data improves results. In addition, we examine the various characteristics that showcase the quality of generated dialogues and how they can be connected to their potential utility as training data.
摘要：
大型语言模型 (LLM) 是强大的对话代理，但将其专门用于实现特定功能可能具有挑战性。指导调优，即根据人类生成的指令和样本响应调整模型（Ouyang et al., 2022），已被证明是一种有效的方法，但需要大量数据样本，而这些数据样本 a) 可能不可用或 b)生成成本高昂。此外，当目标是让法学硕士遵循对话中的特定工作流程而不是单一指令时，这种成本就会增加。受到强化学习中的自我对战技术和使用法学硕士来模拟人类代理的启发，我们提出了一种通过法学硕士参与各种角色对话来收集数据的更有效方法。这种方法通过法学硕士的“自言自语”生成训练数据，可以对其进行细化并用于监督微调。我们引入了一种自动化方法来衡量对话的（部分）成功。该指标用于过滤生成的对话数据，这些数据在 LLM 中反馈用于训练。根据我们对对话质量的自动和人工评估，我们证明此类自言自语数据可以改善结果。此外，我们还研究了展示生成对话质量的各种特征，以及如何将它们与作为训练数据的潜在效用联系起来。

Title: Multi-User Chat Assistant (MUCA): a Framework Using LLMs to Facilitate Group Conversations. (arXiv:2401.04883v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04883
Code URL: null
Copy Paste: [[2401.04883]] Multi-User Chat Assistant (MUCA): a Framework Using LLMs to Facilitate Group Conversations(http://arxiv.org/abs/2401.04883)
Summary:
Recent advancements in large language models (LLMs) have provided a new avenue for chatbot development, while most existing research has primarily centered on single-user chatbots that focus on deciding "What" to answer after user inputs. In this paper, we identified that multi-user chatbots have more complex 3W design dimensions -- "What" to say, "When" to respond, and "Who" to answer. Additionally, we proposed Multi-User Chat Assistant (MUCA), which is an LLM-based framework for chatbots specifically designed for group discussions. MUCA consists of three main modules: Sub-topic Generator, Dialog Analyzer, and Utterance Strategies Arbitrator. These modules jointly determine suitable response contents, timings, and the appropriate recipients. To make the optimizing process for MUCA easier, we further propose an LLM-based Multi-User Simulator (MUS) that can mimic real user behavior. This enables faster simulation of a conversation between the chatbot and simulated users, making the early development of the chatbot framework much more efficient. MUCA demonstrates effectiveness, including appropriate chime-in timing, relevant content, and positive user engagement, in goal-oriented conversations with a small to medium number of participants, as evidenced by case studies and experimental results from user studies.
摘要：
大型语言模型 (LLM) 的最新进展为聊天机器人的开发提供了新的途径，而大多数现有研究主要集中在单用户聊天机器人上，这些机器人专注于在用户输入后决定回答 “什么”。在本文中，我们发现多用户聊天机器人具有更复杂的 3W 设计维度——“说什么”、“何时”响应以及“谁”回答。此外，我们还提出了多用户聊天助手（MUCA），这是一个基于法学硕士的聊天机器人框架，专门为小组讨论而设计。 MUCA由三个主要模块组成：子主题生成器、对话分析器和话语策略仲裁器。这些模块共同确定合适的响应内容、时间和合适的接收者。为了使 MUCA 的优化过程更容易，我们进一步提出了一种基于 LLM 的多用户模拟器（MUS），可以模仿真实的用户行为。这使得能够更快地模拟聊天机器人和模拟用户之间的对话，从而使聊天机器人框架的早期开发更加高效。案例研究和用户研究的实验结果证明，MUCA 在与中小数量参与者进行目标导向的对话中展示了有效性，包括适当的插话时机、相关内容和积极的用户参与。

Title: Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces. (arXiv:2401.05233v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05233
Code URL: null
Copy Paste: [[2401.05233]] Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces(http://arxiv.org/abs/2401.05233)
Summary:
We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the Bellman operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.
摘要：
我们引入了一种新颖的框架，用于分析连续状态动作空间中的强化学习 (RL)，并用它来证明离线和在线设置中的快速收敛速度。我们的分析强调了两个关键的稳定性属性，涉及价值函数和/或政策的变化如何影响贝尔曼运营商和占领措施。我们认为这些属性在许多连续状态-动作马尔可夫决策过程中得到满足，并证明它们在使用线性函数逼近方法时如何自然出现。我们的分析为悲观主义和乐观主义在离线和在线强化学习中的作用提供了新的视角，并强调了离线强化学习和迁移学习之间的联系。

long context

Title: Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing. (arXiv:2401.04881v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04881
Code URL: null
Copy Paste: [[2401.04881]] Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing(http://arxiv.org/abs/2401.04881)
Summary:
As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal nature between the key-values in prior context and the queries at present, this approach cannot be extended to bidirectional attention such as in an encoder-decoder or PrefixLM decoder-only architecture. In this paper, we propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures, and we also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory (K/V memory) with evicted queries in the query memory (Q memory). As a first step, we evaluate this method in the context length extension setup using the TriviaQA reading comprehension task, and show the effectiveness of the approach.
摘要：随着法学硕士已经能够处理更复杂类型的输入，研究人员最近研究了如何高效且经济地处理可能任意长的序列。一种有效的方法是使用 FIFO 内存来存储过去块中关注子层的键和值，以允许后续查询参与。然而，这种方法需要大内存和/或考虑特定的 LM 架构。此外，由于先前上下文中的键值与当前查询之间的因果关系，这种方法无法扩展到双向注意力，例如在编码器-解码器或仅 PrefixLM 解码器架构中。在本文中，我们提出使用驱逐策略（例如LRA和LFA）来减少内存大小并适应各种架构，并且我们还提出了Attendre层，这是一种通过检索键值内存来等待出席的机制（K/V 内存），并在查询内存（Q 内存）中驱逐查询。第一步，我们使用 TriviaQA 阅读理解任务在上下文长度扩展设置中评估该方法，并展示该方法的有效性。

lora

Title: Sample-and-Bound for Non-Convex Optimization. (arXiv:2401.04812v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2401.04812
Code URL: https://github.com/aaucsd/mcir
Copy Paste: [[2401.04812]] Sample-and-Bound for Non-Convex Optimization(http://arxiv.org/abs/2401.04812)
Summary:
Standard approaches for global optimization of non-convex functions, such as branch-and-bound, maintain partition trees to systematically prune the domain. The tree size grows exponentially in the number of dimensions. We propose new sampling-based methods for non-convex optimization that adapts Monte Carlo Tree Search (MCTS) to improve efficiency. Instead of the standard use of visitation count in Upper Confidence Bounds, we utilize numerical overapproximations of the objective as an uncertainty metric, and also take into account of sampled estimates of first-order and second-order information. The Monte Carlo tree in our approach avoids the usual fixed combinatorial patterns in growing the tree, and aggressively zooms into the promising regions, while still balancing exploration and exploitation. We evaluate the proposed algorithms on high-dimensional non-convex optimization benchmarks against competitive baselines and analyze the effects of the hyper parameters.
摘要：
非凸函数全局优化的标准方法（例如分支定界）维护分区树以系统地修剪域。树的大小随维数呈指数增长。我们提出了新的基于采样的非凸优化方法，该方法采用蒙特卡罗树搜索（MCTS）来提高效率。我们没有在置信上限中标准使用访问计数，而是利用目标的数值过度近似作为不确定性度量，并且还考虑一阶和二阶信息的采样估计。我们的方法中的蒙特卡罗树避免了生长树时通常的固定组合模式，并积极地放大到有希望的区域，同时仍然平衡探索和利用。我们根据竞争基线在高维非凸优化基准上评估所提出的算法，并分析超参数的影响。

Title: Temporal Analysis of World Disaster Risk:A Machine Learning Approach to Cluster Dynamics. (arXiv:2401.05007v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05007
Code URL: null
Copy Paste: [[2401.05007]] Temporal Analysis of World Disaster Risk:A Machine Learning Approach to Cluster Dynamics(http://arxiv.org/abs/2401.05007)
Summary:
he evaluation of the impact of actions undertaken is essential in management. This paper assesses the impact of efforts considered to mitigate risk and create safe environments on a global scale. We measure this impact by looking at the probability of improvement over a specific short period of time. Using the World Risk Index, we conduct a temporal analysis of global disaster risk dynamics from 2011 to 2021. This temporal exploration through the lens of the World Risk Index provides insights into the complex dynamics of disaster risk. We found that, despite sustained efforts, the global landscape remains divided into two main clusters: high susceptibility and moderate susceptibility, regardless of geographical location. This clustering was achieved using a semi-supervised approach through the Label Spreading algorithm, with 98% accuracy. We also found that the prediction of clusters achieved through supervised learning on the period considered in this study (one, three, and five years) showed that the Logistic regression (almost 99% at each stage) performed better than other classifiers. This suggests that the current policies and mechanisms are not effective in helping countries move from a hazardous position to a safer one during the period considered. In fact, statistical projections using a scenario analysis indicate that there is only a 1% chance of such a shift occurring within a five-year timeframe. This sobering reality highlights the need for a paradigm shift. Traditional long-term disaster management strategies are not effective for countries that are highly vulnerable. Our findings indicate the need for an innovative approach that is tailored to the specific vulnerabilities of these nations. As the threat of vulnerability persists, our research calls for the development of new strategies that can effectively address the ongoing challenges of disaster risk management
摘要：
评估所采取行动的影响对于管理至关重要。本文评估了在全球范围内降低风险和创建安全环境的努力的影响。我们通过查看特定短时间内改善的可能性来衡量这种影响。利用世界风险指数，我们对 2011 年至 2021 年全球灾害风险动态进行了时间分析。通过世界风险指数的视角进行的时间探索提供了对灾害风险复杂动态的见解。我们发现，尽管做出了持续的努力，全球格局仍然分为两个主要集群：高易感性和中度易感性，无论地理位置如何。这种聚类是通过标签传播算法使用半监督方法实现的，准确率高达 98%。我们还发现，通过监督学习在本研究所考虑的时期（一年、三年和五年）实现的聚类预测表明，Logistic 回归（每个阶段几乎 99%）的表现优于其他分类器。这表明，当前的政策和机制并不能有效地帮助各国在所考虑的时期从危险境地转向安全境地。事实上，使用情景分析的统计预测表明，在五年内发生这种转变的可能性只有 1% 。这一发人深省的现实凸显了范式转变的必要性。传统的长期灾害管理战略对于高度脆弱的国家来说并不有效。我们的研究结果表明需要针对这些国家的具体脆弱性制定创新方法。由于脆弱性的威胁持续存在，我们的研究呼吁制定新的战略，以有效应对灾害风险管理的持续挑战

hallucination

prompt

Title: User Embedding Model for Personalized Language Prompting. (arXiv:2401.04858v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04858
Code URL: null
Copy Paste: [[2401.04858]] User Embedding Model for Personalized Language Prompting(http://arxiv.org/abs/2401.04858)
Summary:
Modeling long histories plays a pivotal role in enhancing recommendation systems, allowing to capture user's evolving preferences, resulting in more precise and personalized recommendations. In this study we tackle the challenges of modeling long user histories for preference understanding in natural language. Specifically, we introduce a new User Embedding Module (UEM) that efficiently processes user history in free-form text by compressing and representing them as embeddings, to use them as soft prompts to a LM. Our experiments demonstrate the superior capability of this approach in handling significantly longer histories compared to conventional text based prompting methods, yielding substantial improvements in predictive performance. The main contribution of this research is to demonstrate the ability to bias language models with user signals represented as embeddings.
摘要：
对长期历史进行建模在增强推荐系统方面发挥着关键作用，可以捕获用户不断变化的偏好，从而提供更精确和个性化的推荐。在这项研究中，我们解决了对长期用户历史进行建模以实现自然语言偏好理解的挑战。具体来说，我们引入了一个新的用户嵌入模块（UEM），它通过压缩并将其表示为嵌入来有效地处理自由格式文本中的用户历史记录，以将它们用作 LM 的软提示。我们的实验证明，与传统的基于文本的提示方法相比，这种方法在处理更长的历史方面具有卓越的能力，从而在预测性能方面产生了显着的改进。这项研究的主要贡献是证明用嵌入表示的用户信号来偏置语言模型的能力。

Title: A Novel Prompt-tuning Method: Incorporating Scenario-specific Concepts into a Verbalizer. (arXiv:2401.05204v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05204
Code URL: null
Copy Paste: [[2401.05204]] A Novel Prompt-tuning Method: Incorporating Scenario-specific Concepts into a Verbalizer(http://arxiv.org/abs/2401.05204)
Summary:
The verbalizer, which serves to map label words to class labels, is an essential component of prompt-tuning. In this paper, we present a novel approach to constructing verbalizers. While existing methods for verbalizer construction mainly rely on augmenting and refining sets of synonyms or related words based on class names, this paradigm suffers from a narrow perspective and lack of abstraction, resulting in limited coverage and high bias in the label-word space. To address this issue, we propose a label-word construction process that incorporates scenario-specific concepts. Specifically, we extract rich concepts from task-specific scenarios as label-word candidates and then develop a novel cascade calibration module to refine the candidates into a set of label words for each class. We evaluate the effectiveness of our proposed approach through extensive experiments on {five} widely used datasets for zero-shot text classification. The results demonstrate that our method outperforms existing methods and achieves state-of-the-art results.
摘要：
语言器，用于将标签词映射到类标签，是提示调整的重要组成部分。在本文中，我们提出了一种构建言语器的新方法。虽然现有的言语构建方法主要依赖于基于类名来增强和细化同义词或相关词的集合，但这种范式存在视角狭窄和缺乏抽象的问题，导致标签词空间的覆盖范围有限和偏差较大。为了解决这个问题，我们提出了一种结合特定场景概念的标签词构建过程。具体来说，我们从特定任务场景中提取丰富的概念作为标签词候选，然后开发一个新颖的级联校准模块，将候选细化为每个类别的一组标签词。我们通过对{五个} 广泛使用的零样本文本分类数据集进行大量实验来评估我们提出的方法的有效性。结果表明，我们的方法优于现有方法并取得了最先进的结果。

code

Title: Yes, this is what I was looking for! Towards Multi-modal Medical Consultation Concern Summary Generation. (arXiv:2401.05134v1 [cs.AI])

Paper URL: http://arxiv.org/abs/2401.05134
Code URL: https://github.com/nlp-rl/mmcsg
Copy Paste: [[2401.05134]] Yes, this is what I was looking for! Towards Multi-modal Medical Consultation Concern Summary Generation(http://arxiv.org/abs/2401.05134)
Summary:
Over the past few years, the use of the Internet for healthcare-related tasks has grown by leaps and bounds, posing a challenge in effectively managing and processing information to ensure its efficient utilization. During moments of emotional turmoil and psychological challenges, we frequently turn to the internet as our initial source of support, choosing this over discussing our feelings with others due to the associated social stigma. In this paper, we propose a new task of multi-modal medical concern summary (MMCS) generation, which provides a short and precise summary of patients' major concerns brought up during the consultation. Nonverbal cues, such as patients' gestures and facial expressions, aid in accurately identifying patients' concerns. Doctors also consider patients' personal information, such as age and gender, in order to describe the medical condition appropriately. Motivated by the potential efficacy of patients' personal context and visual gestures, we propose a transformer-based multi-task, multi-modal intent-recognition, and medical concern summary generation (IR-MMCSG) system. Furthermore, we propose a multitasking framework for intent recognition and medical concern summary generation for doctor-patient consultations. We construct the first multi-modal medical concern summary generation (MM-MediConSummation) corpus, which includes patient-doctor consultations annotated with medical concern summaries, intents, patient personal information, doctor's recommendations, and keywords. Our experiments and analysis demonstrate (a) the significant role of patients' expressions/gestures and their personal information in intent identification and medical concern summary generation, and (b) the strong correlation between intent recognition and patients' medical concern summary generation

The dataset and source code are available at https://github.com/NLP-RL/MMCSG.
摘要：
在过去几年中，互联网在医疗保健相关任务中的使用突飞猛进，这对有效管理和处理信息以确保其有效利用提出了挑战。在情绪动荡和心理挑战的时刻，我们经常转向互联网作为我们最初的支持来源，由于相关的社会耻辱，我们选择互联网而不是与他人讨论我们的感受。在本文中，我们提出了多模式医疗问题摘要（MMCS）生成的新任务，它提供了患者在咨询期间提出的主要问题的简短而准确的摘要。非语言线索，例如患者的手势和面部表情，有助于准确识别患者的担忧。医生还会考虑患者的个人信息，例如年龄和性别，以便适当地描述医疗状况。受患者个人背景和视觉手势潜在功效的启发，我们提出了一种基于变压器的多任务、多模式意图识别和医疗问题摘要生成（IR- MMCSG）系统。此外，我们提出了一个多任务框架，用于医患咨询的意图识别和医疗问题摘要生成。我们构建了第一个多模式医疗问题摘要生成（MM-MediConSum mation）语料库，其中包括用医疗问题摘要、意图、患者个人信息、医生建议和关键词注释的医患咨询。我们的实验和分析证明了（a）患者的表情/手势及其个人信息在意图识别和医疗问题摘要生成中的重要作用，以及（b）意图识别与患者的医疗问题摘要生成之间的强相关性

数据集和源代码可在 https://github.com/NLP-RL/MMCSG 获取。

Title: Whose wife is it anyway? Assessing bias against same-gender relationships in machine translation. (arXiv:2401.04972v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04972
Code URL: null
Copy Paste: [[2401.04972]] Whose wife is it anyway? Assessing bias against same-gender relationships in machine translation(http://arxiv.org/abs/2401.04972)
Summary:
Machine translation often suffers from biased data and algorithms that can lead to unacceptable errors in system output. While bias in gender norms has been investigated, less is known about whether MT systems encode bias about social relationships, e.g. sentences such as "the lawyer kissed her wife." We investigate the degree of bias against same-gender relationships in MT systems, using generated template sentences drawn from several noun-gender languages (e.g. Spanish). We find that three popular MT services consistently fail to accurately translate sentences concerning relationships between nouns of the same gender. The error rate varies considerably based on the context, e.g. same-gender sentences referencing high female-representation occupations are translated with lower accuracy. We provide this work as a case study in the evaluation of intrinsic bias in NLP systems, with respect to social relationships.
摘要：
机器翻译经常受到数据和算法偏差的影响，可能导致系统输出出现不可接受的错误。虽然已经对性别规范的偏见进行了调查，但人们对机器翻译系统是否编码了有关社会关系的偏见知之甚少，例如社交关系。诸如“律师亲吻了她的妻子”之类的句子。我们使用从几种名词性别语言（例如西班牙语）生成的模板句子来调查机器翻译系统中对同性关系的偏见程度。我们发现，三种流行的机器翻译服务始终无法准确翻译有关同性名词之间关系的句子。错误率根据上下文的不同而有很大差异，例如提及女性代表性较高的职业的同性别句子的翻译准确性较低。我们将这项工作作为评估 NLP 系统中与社会关系相关的内在偏见的案例研究。

Title: BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation. (arXiv:2401.05125v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05125
Code URL: null
Copy Paste: [[2401.05125]] BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation(http://arxiv.org/abs/2401.05125)
Summary:
Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). A popular approach to the task are name-based methods, i.e. those identifying the most appropriate name in the KB for a given mention, either via dense retrieval or autoregressive modeling. However, as these methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance, especially for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). We therefore present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. Specifically, BELHD builds upon the BioSyn (Sung et al.,2020) model introducing two crucial extensions. First, it performs a preprocessing of the KB in which it expands homonyms with an automatically chosen disambiguating string, thus enforcing unique linking decisions. Second, we introduce candidate sharing, a novel strategy to select candidates for contrastive learning that enhances the overall training signal. Experiments with 10 corpora and five entity types show that BELHD improves upon state-of-the-art approaches, achieving the best results in 6 out 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the core prediction model and thus can also improve other methods, which we exemplify for GenBioEL (Yuan et al, 2022), a generative name-based BEL approach. Code is available at: link added upon publication.
摘要：
生物医学实体链接（BEL）是将实体提及基础到知识库（KB）的任务。完成该任务的一种流行方法是基于名称的方法，即通过密集检索或自回归建模来识别知识库中给定提及的最合适名称的方法。然而，由于这些方法直接返回知识库名称，因此它们无法处理同音异义词，即不同的知识库实体共享完全相同的名称。这会显着影响它们的性能，特别是对于同音异义词占大量实体提及的知识库（例如 UMLS 和 NCBI Gene）。因此，我们提出了 BELHD（具有同音词消歧功能的生物医学实体链接），这是一种应对这一挑战的新的基于名称的方法。具体来说，BELHD 建立在 BioSyn（Sung 等人，20 20）模型的基础上，引入了两个关键的扩展。首先，它对知识库进行预处理，其中使用自动选择的消歧字符串扩展同音异义词，从而强制执行唯一的链接决策。其次，我们引入候选共享，这是一种选择候选进行对比学习的新颖策略，可以增强整体训练信号。对 10 个语料库和 5 种实体类型的实验表明，BELHD 在最先进的方法的基础上进行了改进，在 10 个语料库中的 6 个中取得了最佳结果，平均提高了 4.55pp 召回率@1。此外，KB 预处理与核心预测模型正交，因此也可以改进其他方法，我们以 GenBioEL（Yuan 等人，2022）为例，这是一种基于名称的生成 BEL 方法。代码可在以下位置获得：发布时添加的链接。

Title: Masked AutoEncoder for Graph Clustering without Pre-defined Cluster Number k. (arXiv:2401.04741v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.04741
Code URL: null
Copy Paste: [[2401.04741]] Masked AutoEncoder for Graph Clustering without Pre-defined Cluster Number k(http://arxiv.org/abs/2401.04741)
Summary:
Graph clustering algorithms with autoencoder structures have recently gained popularity due to their efficient performance and low training cost. However, for existing graph autoencoder clustering algorithms based on GCN or GAT, not only do they lack good generalization ability, but also the number of clusters clustered by such autoencoder models is difficult to determine automatically. To solve this problem, we propose a new framework called Graph Clustering with Masked Autoencoders (GCMA). It employs our designed fusion autoencoder based on the graph masking method for the fusion coding of graph. It introduces our improved density-based clustering algorithm as a second decoder while decoding with multi-target reconstruction. By decoding the mask embedding, our model can capture more generalized and comprehensive knowledge. The number of clusters and clustering results can be output end-to-end while improving the generalization ability. As a nonparametric class method, extensive experiments demonstrate the superiority of \textit{GCMA} over state-of-the-art baselines.
摘要：
具有自动编码器结构的图聚类算法由于其高效的性能和较低的训练成本而最近受到欢迎。然而，现有的基于GCN或GAT的图自编码器聚类算法，不仅缺乏良好的泛化能力，而且此类自编码器模型聚类的簇数也难以自动确定。为了解决这个问题，我们提出了一个新的框架，称为带有掩码自动编码器的图聚类（GCMA）。它采用我们设计的基于图掩蔽方法的融合自动编码器来进行图的融合编码。它引入了我们改进的基于密度的聚类算法作为第二个解码器，同时使用多目标重建进行解码。通过解码掩码嵌入，我们的模型可以捕获更广义和更全面的知识。可以端到端输出聚类数量和聚类结果，同时提高泛化能力。作为一种非参数类方法，大量的实验证明了 \textit{GCMA} 相对于最先进的基线的优越性。

Title: T-PRIME: Transformer-based Protocol Identification for Machine-learning at the Edge. (arXiv:2401.04837v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.04837
Code URL: https://github.com/genesys-neu/t-prime
Copy Paste: [[2401.04837]] T-PRIME: Transformer-based Protocol Identification for Machine-learning at the Edge(http://arxiv.org/abs/2401.04837)
Summary:
Spectrum sharing allows different protocols of the same standard (e.g., 802.11 family) or different standards (e.g., LTE and DVB) to coexist in overlapping frequency bands. As this paradigm continues to spread, wireless systems must also evolve to identify active transmitters and unauthorized waveforms in real time under intentional distortion of preambles, extremely low signal-to-noise ratios and challenging channel conditions. We overcome limitations of correlation-based preamble matching methods in such conditions through the design of T-PRIME: a Transformer-based machine learning approach. T-PRIME learns the structural design of transmitted frames through its attention mechanism, looking at sequence patterns that go beyond the preamble alone. The paper makes three contributions: First, it compares Transformer models and demonstrates their superiority over traditional methods and state-of-the-art neural networks. Second, it rigorously analyzes T-PRIME's real-time feasibility on DeepWave's AIR-T platform. Third, it utilizes an extensive 66 GB dataset of over-the-air (OTA) WiFi transmissions for training, which is released along with the code for community use. Results reveal nearly perfect (i.e. $>98\%$) classification accuracy under simulated scenarios, showing $100\%$ detection improvement over legacy methods in low SNR ranges, $97\%$ classification accuracy for OTA single-protocol transmissions and up to $75\%$ double-protocol classification accuracy in interference scenarios.
摘要：
频谱共享允许同一标准（例如 802.11 系列）或不同标准（例如 LTE 和 DVB）的不同协议在重叠频段中共存。随着这种范例的不断传播，无线系统还必须不断发展，以在故意扭曲前导码、极低的信噪比和具有挑战性的信道条件下实时识别活动发射机和未经授权的波形。我们通过设计 T-PRIME：一种基于 Transformer 的机器学习方法，克服了在这种情况下基于相关性的前导码匹配方法的局限性。 T-PRIME 通过其注意力机制学习传输帧的结构设计，查看超出前导码范围的序列模式。这篇论文做出了三个贡献：首先，它比较了 Transformer 模型，并证明了它们相对于传统方法和最先进的神经网络的优越性。其次，严格分析了T-PRIME在DeepWave的AIR-T平台上的实时可行性。第三，它利用广泛的 66 GB 无线 (OTA) WiFi 传输数据集进行训练，该数据集与代码一起发布供社区使用。结果显示，在模拟场景下，分类精度近乎完美（即 $>98\%$），在低 SNR 范围内，检测结果比传统方法提高了 $100\%$，OTA 单协议传输的分类精度为 $97\%$，高达干扰场景下双协议分类精度为 75\%$。

Title: Rethinking Test-time Likelihood: The Likelihood Path Principle and Its Application to OOD Detection. (arXiv:2401.04933v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.04933
Code URL: https://github.com/XavierXiao/Likelihood-Regret
Copy Paste: [[2401.04933]] Rethinking Test-time Likelihood: The Likelihood Path Principle and Its Application to OOD Detection(http://arxiv.org/abs/2401.04933)
Summary:
While likelihood is attractive in theory, its estimates by deep generative models (DGMs) are often broken in practice, and perform poorly for out of distribution (OOD) Detection. Various recent works started to consider alternative scores and achieved better performances. However, such recipes do not come with provable guarantees, nor is it clear that their choices extract sufficient information.

We attempt to change this by conducting a case study on variational autoencoders (VAEs). First, we introduce the likelihood path (LPath) principle, generalizing the likelihood principle. This narrows the search for informative summary statistics down to the minimal sufficient statistics of VAEs' conditional likelihoods. Second, introducing new theoretic tools such as nearly essential support, essential distance and co-Lipschitzness, we obtain non-asymptotic provable OOD detection guarantees for certain distillation of the minimal sufficient statistics. The corresponding LPath algorithm demonstrates SOTA performances, even using simple and small VAEs with poor likelihood estimates. To our best knowledge, this is the first provable unsupervised OOD method that delivers excellent empirical results, better than any other VAEs based techniques. We use the same model as \cite{xiao2020likelihood}, open sourced from: https://github.com/XavierXiao/Likelihood-Regret
摘要：
虽然可能性在理论上很有吸引力，但深度生成模型 (DGM) 的估计在实践中经常被破坏，并且在分布外 (OOD) 检测中表现不佳。最近的各种作品开始考虑替代配乐并取得了更好的表现。然而，这些食谱并没有提供可证明的保证，也不清楚他们的选择是否提取了足够的信息。
< p>我们试图通过对变分自动编码器（VAE）进行案例研究来改变这一点。首先，我们引入似然路径（LPath）原理，推广似然原理。这将信息性汇总统计的搜索范围缩小到 VAE 条件可能性的最小充分统计。其次，引入新的理论工具，例如近本质支持、本质距离和共同Lipschitzness，我们获得了对最小充分统计量的某些精炼的非渐进可证明的OOD检测保证。相应的 LPath 算法展示了 SOTA 性能，即使使用似然估计较差的简单且小型 VAE。据我们所知，这是第一个可证明的无监督 OOD 方法，它提供了出色的实证结果，优于任何其他基于 VAE 的技术。我们使用与 \cite{xiao2020likelihood} 相同的模型，开源自：https://github.com/XavierXiao/Likelihood-Regret

Title: HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling for Long-Term Forecasting. (arXiv:2401.05012v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05012
Code URL: null
Copy Paste: [[2401.05012]] HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling for Long-Term Forecasting(http://arxiv.org/abs/2401.05012)
Summary:
Time series forecasting is crucial and challenging in the real world. The recent surge in interest regarding time series foundation models, which cater to a diverse array of downstream tasks, is noteworthy. However, existing methods often overlook the multi-scale nature of time series, an aspect crucial for precise forecasting. To bridge this gap, we propose HiMTM, a hierarchical multi-scale masked time series modeling method designed for long-term forecasting. Specifically, it comprises four integral components: (1) hierarchical multi-scale transformer (HMT) to capture temporal information at different scales; (2) decoupled encoder-decoder (DED) forces the encoder to focus on feature extraction, while the decoder to focus on pretext tasks; (3) multi-scale masked reconstruction (MMR) provides multi-stage supervision signals for pre-training; (4) cross-scale attention fine-tuning (CSA-FT) to capture dependencies between different scales for forecasting. Collectively, these components enhance multi-scale feature extraction capabilities in masked time series modeling and contribute to improved prediction accuracy. We conduct extensive experiments on 7 mainstream datasets to prove that HiMTM has obvious advantages over contemporary self-supervised and end-to-end learning methods. The effectiveness of HiMTM is further showcased by its application in the industry of natural gas demand forecasting.
摘要：
时间序列预测在现实世界中至关重要且具有挑战性。值得注意的是，最近人们对时间序列基础模型的兴趣激增，这些模型可以满足各种下游任务的需求。然而，现有的方法常常忽视时间序列的多尺度性质，而这对于精确预测至关重要。为了弥补这一差距，我们提出了 HiMTM，一种专为长期预测而设计的分层多尺度屏蔽时间序列建模方法。具体来说，它包含四个组成部分：（1）分层多尺度变换器（HMT），用于捕获不同尺度的时间信息；（2）解耦编码器- 解码器（DED）迫使编码器专注于特征提取，而解码器专注于借口任务；（3）多尺度掩蔽重建（MMR）为预训练提供多级监督信号；（4）跨尺度注意力微调（CSA- FT）以捕获不同尺度之间的依赖关系进行预测。总的来说，这些组件增强了屏蔽时间序列建模中的多尺度特征提取能力，并有助于提高预测精度。我们在 7 个主流数据集上进行了广泛的实验，证明 HiMTM 相对于当代的自监督和端到端学习方法具有明显的优势。 HiMTM在天然气需求预测行业的应用进一步证明了其有效性。

Title: An Information Theoretic Approach to Interaction-Grounded Learning. (arXiv:2401.05015v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05015
Code URL: null
Copy Paste: [[2401.05015]] An Information Theoretic Approach to Interaction-Grounded Learning(http://arxiv.org/abs/2401.05015)
Summary:
Reinforcement learning (RL) problems where the learner attempts to infer an unobserved reward from some feedback variables have been studied in several recent papers. The setting of Interaction-Grounded Learning (IGL) is an example of such feedback-based reinforcement learning tasks where the learner optimizes the return by inferring latent binary rewards from the interaction with the environment. In the IGL setting, a relevant assumption used in the RL literature is that the feedback variable $Y$ is conditionally independent of the context-action $(X,A)$ given the latent reward $R$. In this work, we propose Variational Information-based IGL (VI-IGL) as an information-theoretic method to enforce the conditional independence assumption in the IGL-based RL problem. The VI-IGL framework learns a reward decoder using an information-based objective based on the conditional mutual information (MI) between the context-action $(X,A)$ and the feedback variable $Y$ observed from the environment. To estimate and optimize the information-based terms for the continuous random variables in the RL problem, VI-IGL leverages the variational representation of mutual information and results in a min-max optimization problem. Furthermore, we extend the VI-IGL framework to general $f$-Information measures in the information theory literature, leading to the generalized $f$-VI-IGL framework to address the RL problem under the IGL condition. Finally, we provide the empirical results of applying the VI-IGL method to several reinforcement learning settings, which indicate an improved performance in comparison to the previous IGL-based RL algorithm.
摘要：
最近的几篇论文对强化学习 (RL) 问题进行了研究，其中学习者试图从一些反馈变量中推断出未观察到的奖励。交互基础学习（IGL）的设置是这种基于反馈的强化学习任务的一个例子，其中学习者通过从与环境的交互中推断潜在的二元奖励来优化回报。在 IGL 设置中，RL 文献中使用的相关假设是，在给定潜在奖励 $R$ 的情况下，反馈变量 $Y$ 有条件地独立于上下文动作 $(X,A)$。在这项工作中，我们提出基于变分信息的 IGL（VI-IGL）作为一种信息论方法，以在基于 IGL 的 RL 问题中强制执行条件独立假设。 VI-IGL 框架使用基于信息的目标来学习奖励解码器，该目标基于上下文动作 $(X,A)$ 和从环境中观察到的反馈变量 $Y$ 之间的条件互信息 (MI)。为了估计和优化 RL 问题中连续随机变量的基于信息的项，VI-IGL 利用互信息的变分表示并产生最小-最大优化问题。此外，我们将 VI-IGL 框架扩展到信息论文献中的通用 $f$-信息度量，从而产生了广义的 $f$-VI- IGL 框架来解决 IGL 条件下的 RL 问题。最后，我们提供了将 VI-IGL 方法应用于多种强化学习设置的实证结果，这表明与之前基于 IGL 的 RL 算法相比，性能有所提高。

chat

retrieval augmented generation

retrieval-augmented generation

rag

Title: An Analysis of User Behaviours for Objectively Evaluating Spoken Dialogue Systems. (arXiv:2401.04867v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.04867
Code URL: null
Copy Paste: [[2401.04867]] An Analysis of User Behaviours for Objectively Evaluating Spoken Dialogue Systems(http://arxiv.org/abs/2401.04867)
Summary:
Establishing evaluation schemes for spoken dialogue systems is important, but it can also be challenging. While subjective evaluations are commonly used in user experiments, objective evaluations are necessary for research comparison and reproducibility. To address this issue, we propose a framework for indirectly but objectively evaluating systems based on users' behaviours. In this paper, to this end, we investigate the relationship between user behaviours and subjective evaluation scores in social dialogue tasks: attentive listening, job interview, and first-meeting conversation. The results reveal that in dialogue tasks where user utterances are primary, such as attentive listening and job interview, indicators like the number of utterances and words play a significant role in evaluation. Observing disfluency also can indicate the effectiveness of formal tasks, such as job interview. On the other hand, in dialogue tasks with high interactivity, such as first-meeting conversation, behaviours related to turn-taking, like average switch pause length, become more important. These findings suggest that selecting appropriate user behaviours can provide valuable insights for objective evaluation in each social dialogue task.
摘要：
建立语音对话系统的评估方案很重要，但也可能具有挑战性。虽然主观评价通常用于用户实验，但客观评价对于研究比较和可重复性是必要的。为了解决这个问题，我们提出了一个根据用户行为间接但客观地评估系统的框架。为此，我们在本文中研究了社交对话任务中用户行为与主观评价分数之间的关系：专注倾听、工作面试和初次见面对话。结果表明，在以用户话语为主的对话任务中，例如专心倾听和工作面试，话语数量和单词数量等指标在评估中发挥着重要作用。观察不流利程度还可以表明正式任务的有效性，例如工作面试。另一方面，在交互性较高的对话任务中，例如初次见面的对话，与轮流相关的行为（例如平均切换停顿长度）变得更加重要。这些发现表明，选择适当的用户行为可以为每个社交对话任务的客观评估提供有价值的见解。

Title: Structure-Preserving Physics-Informed Neural Networks With Energy or Lyapunov Structure. (arXiv:2401.04986v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.04986
Code URL: null
Copy Paste: [[2401.04986]] Structure-Preserving Physics-Informed Neural Networks With Energy or Lyapunov Structure(http://arxiv.org/abs/2401.04986)
Summary:
Recently, there has been growing interest in using physics-informed neural networks (PINNs) to solve differential equations. However, the preservation of structure, such as energy and stability, in a suitable manner has yet to be established. This limitation could be a potential reason why the learning process for PINNs is not always efficient and the numerical results may suggest nonphysical behavior. Besides, there is little research on their applications on downstream tasks. To address these issues, we propose structure-preserving PINNs to improve their performance and broaden their applications for downstream tasks. Firstly, by leveraging prior knowledge about the physical system, a structure-preserving loss function is designed to assist the PINN in learning the underlying structure. Secondly, a framework that utilizes structure-preserving PINN for robust image recognition is proposed. Here, preserving the Lyapunov structure of the underlying system ensures the stability of the system. Experimental results demonstrate that the proposed method improves the numerical accuracy of PINNs for partial differential equations. Furthermore, the robustness of the model against adversarial perturbations in image data is enhanced.
摘要：
最近，人们对使用物理信息神经网络 (PINN) 求解微分方程越来越感兴趣。然而，以适当的方式保存结构，例如能量和稳定性，尚未建立。这种限制可能是 PINN 的学习过程并不总是有效且数值结果可能表明非物理行为的潜在原因。此外，关于它们在下游任务中的应用的研究还很少。为了解决这些问题，我们提出了保留结构的 PINN，以提高其性能并扩大其在下游任务中的应用。首先，通过利用物理系统的先验知识，设计了结构保持损失函数来帮助 PINN 学习底层结构。其次，提出了一种利用结构保留 PINN 进行鲁棒图像识别的框架。这里，保留底层系统的李雅普诺夫结构保证了系统的稳定性。实验结果表明，该方法提高了偏微分方程 PINN 的数值精度。此外，模型针对图像数据中的对抗性扰动的鲁棒性也得到了增强。

multi-run

chain-of-thought

tree-of-thought

agent

Title: ReACT: Reinforcement Learning for Controller Parametrization using B-Spline Geometries. (arXiv:2401.05251v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05251
Code URL: null
Copy Paste: [[2401.05251]] ReACT: Reinforcement Learning for Controller Parametrization using B-Spline Geometries(http://arxiv.org/abs/2401.05251)
Summary:
Robust and performant controllers are essential for industrial applications. However, deriving controller parameters for complex and nonlinear systems is challenging and time-consuming. To facilitate automatic controller parametrization, this work presents a novel approach using deep reinforcement learning (DRL) with N-dimensional B-spline geometries (BSGs). We focus on the control of parameter-variant systems, a class of systems with complex behavior which depends on the operating conditions. For this system class, gain-scheduling control structures are widely used in applications across industries due to well-known design principles. Facilitating the expensive controller parametrization task regarding these control structures, we deploy an DRL agent. Based on control system observations, the agent autonomously decides how to adapt the controller parameters. We make the adaptation process more efficient by introducing BSGs to map the controller parameters which may depend on numerous operating conditions. To preprocess time-series data and extract a fixed-length feature vector, we use a long short-term memory (LSTM) neural networks. Furthermore, this work contributes actor regularizations that are relevant to real-world environments which differ from training. Accordingly, we apply dropout layer normalization to the actor and critic networks of the truncated quantile critic (TQC) algorithm. To show our approach's working principle and effectiveness, we train and evaluate the DRL agent on the parametrization task of an industrial control structure with parameter lookup tables.
摘要：
稳健且高性能的控制器对于工业应用至关重要。然而，为复杂的非线性系统推导控制器参数具有挑战性且耗时。为了促进自动控制器参数化，这项工作提出了一种使用深度强化学习 (DRL) 和 N 维 B 样条几何 (BSG) 的新颖方法。我们专注于参数变化系统的控制，这是一类具有取决于操作条件的复杂行为的系统。对于此类系统，增益调度控制结构由于众所周知的设计原理而广泛应用于各行业的应用中。为了促进有关这些控制结构的昂贵的控制器参数化任务，我们部署了 DRL 代理。基于控制系统的观察，代理自主决定如何调整控制器参数。我们通过引入 BSG 来映射可能取决于多种操作条件的控制器参数，从而使适应过程更加高效。为了预处理时间序列数据并提取固定长度的特征向量，我们使用长短期记忆（LSTM）神经网络。此外，这项工作还贡献了与不同于训练的现实环境相关的参与者正则化。因此，我们将 dropout 层归一化应用于截断分位数批评家 (TQC) 算法的参与者和批评家网络。为了展示我们方法的工作原理和有效性，我们使用参数查找表在工业控制结构的参数化任务上训练和评估 DRL 代理。

Title: AUTOACT: Automatic Agent Learning from Scratch via Self-Planning. (arXiv:2401.05268v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05268
Code URL: https://github.com/zjunlp/autoact
Copy Paste: [[2401.05268]] AUTOACT: Automatic Agent Learning from Scratch via Self-Planning(http://arxiv.org/abs/2401.05268)
Summary:
Language agents have achieved considerable performance on various complex tasks. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework that does not rely on large-scale annotated data and synthetic trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. We even notice that AutoAct, when using the Llama-2-13b model, can achieve performance comparable to that of the GPT-3.5-Turbo agent. Code will be available at https://github.com/zjunlp/AutoAct.
摘要：
语言代理在各种复杂任务上取得了相当可观的表现。尽管在这一领域不断进行探索，现有的语言代理系统仍然与成本高昂、不可重复的数据依赖作斗争，并面临着强制单一模型实现多种功能的挑战。为此，我们引入了 AutoAct，一个自动代理学习框架，它不依赖于大规模注释数据和来自闭源模型（例如 GPT-4）的合成轨迹。鉴于工具库的数据有限，AutoAct 首先自动合成规划轨迹，无需人类或强大的闭源模型的任何帮助。然后，AutoAct利用分工策略，根据目标任务信息和合成轨迹自动区分，产生子代理组来完成任务。我们对不同的法学硕士进行了全面的实验，这表明与各种强大的基线相比，AutoAct 可以产生更好或并行的性能。我们甚至注意到，当使用 Llama-2-13b 模型时，AutoAct 可以获得与 GPT-3.5-Turbo 代理相当的性能。代码可在 https://github.com/zjunlp/AutoAct 获取。