2024-01-22

Title: Top in Chinese Data Processing: English Code Models

Authors: Linghan Zheng, Hui Liu, Xiaojun Lin, Jiayuan Dong, Yue Sheng, Gang Shi, Zhiwei Liu, Hongwei Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2401.10286
Pdf URL: https://arxiv.org/pdf/2401.10286
Copy Paste: [[2401.10286]] Top in Chinese Data Processing: English Code Models(https://arxiv.org/abs/2401.10286)
Keywords: language model, llm, hallucination, code, retrieval-augmented generation, rag
Abstract: While the alignment between tasks and training corpora is a fundamental consensus in the application of language models, our series of experiments and the metrics we designed reveal that code-based Large Language Models (LLMs) significantly outperform models trained on data that is closely matched to the tasks in non-coding Chinese tasks. Moreover, in tasks high sensitivity to Chinese hallucinations, models exhibiting fewer linguistic features of the Chinese language achieve better performance. Our experimental results can be easily replicated in Chinese data processing tasks, such as preparing data for Retrieval-Augmented Generation (RAG), by simply replacing the base model with a code-based model. Additionally, our research offers a distinct perspective for discussion on the philosophical "Chinese Room" thought experiment.
摘要：虽然任务和训练语料库之间的一致性是语言模型应用中的基本共识，但我们的一系列实验和我们设计的指标表明，基于代码的大型语言模型（LLM）显着优于基于与数据密切匹配的数据训练的模型。非编码中文任务中的任务。此外，在对中文幻觉高度敏感的任务中，表现出较少中文语言特征的模型取得了更好的性能。我们的实验结果可以很容易地在中文数据处理任务中复制，例如通过简单地用基于代码的模型替换基本模型来为检索增强生成（RAG）准备数据。此外，我们的研究为哲学“中国室”思想实验的讨论提供了独特的视角。

Title: On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning

Authors: Joan Giner-Miguelez, Abel Gómez, Jordi Cabot
Subjects: cs.LG, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2401.10304
Pdf URL: https://arxiv.org/pdf/2401.10304
Copy Paste: [[2401.10304]] On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning(https://arxiv.org/abs/2401.10304)
Keywords: rag
Abstract: To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, the adoption of these practices by academic institutions has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness and coverage of the requested dimensions, and trends in recent years, putting special emphasis on the most and least documented dimensions. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data's preparedness for its transparent and fairer use in ML technologies.
摘要：为了确保机器学习（ML）系统的公平性和可信性，ML社区最近的立法举措和相关研究指出需要记录用于训练ML模型的数据。此外，近年来，许多科学领域的数据共享实践出于可重复性的目的而不断发展。从这个意义上说，学术机构采用这些做法鼓励研究人员在同行评审的出版物（例如数据论文）中发布他们的数据和技术文档。在本研究中，我们分析了这些科学数据文档如何满足机器学习社区和监管机构在机器学习技术中使用的需求。我们检查了不同领域的 4041 篇数据论文样本，评估其完整性和所要求维度的覆盖范围以及近年来的趋势，特别强调记录最多和最少的维度。因此，我们为数据创建者和科学数据发布者提出了一套推荐指南，以增强他们的数据准备，使其在机器学习技术中透明和公平地使用。

Title: MELODY: Robust Semi-Supervised Hybrid Model for Entity-Level Online Anomaly Detection with Multivariate Time Series

Authors: Jingchao Ni, Gauthier Guinet, Peihong Jiang, Laurent Callot, Andrey Kan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2401.10338
Pdf URL: https://arxiv.org/pdf/2401.10338
Copy Paste: [[2401.10338]] MELODY: Robust Semi-Supervised Hybrid Model for Entity-Level Online Anomaly Detection with Multivariate Time Series(https://arxiv.org/abs/2401.10338)
Keywords: code
Abstract: In large IT systems, software deployment is a crucial process in online services as their code is regularly updated. However, a faulty code change may degrade the target service's performance and cause cascading outages in downstream services. Thus, software deployments should be comprehensively monitored, and their anomalies should be detected timely. In this paper, we study the problem of anomaly detection for deployments. We begin by identifying the challenges unique to this anomaly detection problem, which is at entity-level (e.g., deployments), relative to the more typical problem of anomaly detection in multivariate time series (MTS). The unique challenges include the heterogeneity of deployments, the low latency tolerance, the ambiguous anomaly definition, and the limited supervision. To address them, we propose a novel framework, semi-supervised hybrid Model for Entity-Level Online Detection of anomalY (MELODY). MELODY first transforms the MTS of different entities to the same feature space by an online feature extractor, then uses a newly proposed semi-supervised deep one-class model for detecting anomalous entities. We evaluated MELODY on real data of cloud services with 1.2M+ time series. The relative F1 score improvement of MELODY over the state-of-the-art methods ranges from 7.6% to 56.5%. The user evaluation suggests MELODY is suitable for monitoring deployments in large online systems.
摘要：在大型IT系统中，软件部署是在线服务中的一个关键过程，因为它们的代码会定期更新。然而，错误的代码更改可能会降低目标服务的性能并导致下游服务的级联中断。因此，应全面监控软件部署，及时发现异常情况。在本文中，我们研究了部署的异常检测问题。我们首先确定此异常检测问题所特有的挑战，相对于多元时间序列 (MTS) 中更典型的异常检测问题，该问题处于实体级别（例如部署）。独特的挑战包括部署的异构性、低延迟容忍度、不明确的异常定义以及有限的监管。为了解决这些问题，我们提出了一种新颖的框架，即用于实体级在线异常检测的半监督混合模型（MELODY）。 MELODY首先通过在线特征提取器将不同实体的MTS转换到相同的特征空间，然后使用新提出的半监督深度一类模型来检测异常实体。我们在 120 万+时间序列的云服务真实数据上对 MELODY 进行了评估。与最先进的方法相比，MELODY 的相对 F1 分数提高了 7.6% 至 56.5%。用户评价表明 MELODY 适合监控大型在线系统中的部署。

Title: Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys

Authors: Yong Cao, Min Chen, Daniel Hershcovich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10352
Pdf URL: https://arxiv.org/pdf/2401.10352
Copy Paste: [[2401.10352]] Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys(https://arxiv.org/abs/2401.10352)
Keywords: lora, agent
Abstract: The cultural landscape of interactions with dialogue agents is a compelling yet relatively unexplored territory. It's clear that various sociocultural aspects -- from communication styles and beliefs to shared metaphors and knowledge -- profoundly impact these interactions. To delve deeper into this dynamic, we introduce cuDialog, a first-of-its-kind benchmark for dialogue generation with a cultural lens. We also develop baseline models capable of extracting cultural attributes from dialogue exchanges, with the goal of enhancing the predictive accuracy and quality of dialogue agents. To effectively co-learn cultural understanding and multi-turn dialogue predictions, we propose to incorporate cultural dimensions with dialogue encoding features. Our experimental findings highlight that incorporating cultural value surveys boosts alignment with references and cultural markers, demonstrating its considerable influence on personalization and dialogue quality. To facilitate further exploration in this exciting domain, we publish our benchmark publicly accessible at https://github.com/yongcaoplus/cuDialog.
摘要：与对话代理互动的文化景观是一个引人注目但相对未经探索的领域。很明显，各种社会文化方面——从沟通方式和信仰到共享的隐喻和知识——深刻地影响着这些互动。为了更深入地研究这种动态，我们引入了 cuDialog，这是第一个从文化角度生成对话的基准。我们还开发了能够从对话交流中提取文化属性的基线模型，目的是提高对话代理的预测准确性和质量。为了有效地共同学习文化理解和多轮对话预测，我们建议将文化维度与对话编码特征结合起来。我们的实验结果强调，纳入文化价值调查可以促进与参考文献和文化标记的一致性，证明其对个性化和对话质量的巨大影响。为了促进在这个令人兴奋的领域的进一步探索，我们发布了我们的基准测试，可在 https://github.com/yongcaoplus/cuDialog 上公开访问。

Title: Inconsistent dialogue responses and how to recover from them

Authors: Mian Zhang, Lifeng Jin, Linfeng Song, Haitao Mi, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10353
Pdf URL: https://arxiv.org/pdf/2401.10353
Copy Paste: [[2401.10353]] Inconsistent dialogue responses and how to recover from them(https://arxiv.org/abs/2401.10353)
Keywords: language model, gpt, chat
Abstract: One critical issue for chat systems is to stay consistent about preferences, opinions, beliefs and facts of itself, which has been shown a difficult problem. In this work, we study methods to assess and bolster utterance consistency of chat systems. A dataset is first developed for studying the inconsistencies, where inconsistent dialogue responses, explanations of the inconsistencies, and recovery utterances are authored by annotators. This covers the life span of inconsistencies, namely introduction, understanding, and resolution. Building on this, we introduce a set of tasks centered on dialogue consistency, specifically focused on its detection and resolution. Our experimental findings indicate that our dataset significantly helps the progress in identifying and resolving conversational inconsistencies, and current popular large language models like ChatGPT which are good at resolving inconsistencies however still struggle with detection.
摘要：聊天系统的一个关键问题是保持偏好、观点、信念和事实本身的一致性，这已被证明是一个难题。在这项工作中，我们研究了评估和增强聊天系统话语一致性的方法。首先开发一个数据集来研究不一致，其中不一致的对话响应、不一致的解释和恢复话语由注释者编写。这涵盖了不一致的生命周期，即引入、理解和解决。在此基础上，我们引入了一组以对话一致性为中心的任务，特别关注其检测和解决。我们的实验结果表明，我们的数据集极大地帮助了识别和解决会话不一致方面的进展，而当前流行的大型语言模型（如 ChatGPT）擅长解决不一致问题，但仍然难以检测。

Title: Hierarchical Federated Learning in Multi-hop Cluster-Based VANETs

Authors: M. Saeid HaghighiFard, Sinem Coleri
Subjects: cs.LG, cs.AI, cs.DC, cs.NI, eess.SY
Abstract URL: https://arxiv.org/abs/2401.10361
Pdf URL: https://arxiv.org/pdf/2401.10361
Copy Paste: [[2401.10361]] Hierarchical Federated Learning in Multi-hop Cluster-Based VANETs(https://arxiv.org/abs/2401.10361)
Keywords: rag
Abstract: The usage of federated learning (FL) in Vehicular Ad hoc Networks (VANET) has garnered significant interest in research due to the advantages of reducing transmission overhead and protecting user privacy by communicating local dataset gradients instead of raw data. However, implementing FL in VANETs faces challenges, including limited communication resources, high vehicle mobility, and the statistical diversity of data distributions. In order to tackle these issues, this paper introduces a novel framework for hierarchical federated learning (HFL) over multi-hop clustering-based VANET. The proposed method utilizes a weighted combination of the average relative speed and cosine similarity of FL model parameters as a clustering metric to consider both data diversity and high vehicle mobility. This metric ensures convergence with minimum changes in cluster heads while tackling the complexities associated with non-independent and identically distributed (non-IID) data scenarios. Additionally, the framework includes a novel mechanism to manage seamless transitions of cluster heads (CHs), followed by transferring the most recent FL model parameter to the designated CH. Furthermore, the proposed approach considers the option of merging CHs, aiming to reduce their count and, consequently, mitigate associated overhead. Through extensive simulations, the proposed hierarchical federated learning over clustered VANET has been demonstrated to improve accuracy and convergence time significantly while maintaining an acceptable level of packet overhead compared to previously proposed clustering algorithms and non-clustered VANET.
摘要：联邦学习 (FL) 在车辆自组织网络 (VANET) 中的使用引起了人们的极大兴趣，因为它具有通过传输本地数据集梯度而不是原始数据来减少传输开销和保护用户隐私的优点。然而，在 VANET 中实施 FL 面临着挑战，包括有限的通信资源、高车辆移动性以及数据分布的统计多样性。为了解决这些问题，本文介绍了一种基于多跳集群的 VANET 的分层联合学习（HFL）的新框架。该方法利用FL模型参数的平均相对速度和余弦相似度的加权组合作为聚类指标来考虑数据多样性和高车辆移动性。该指标确保以最小的簇头变化实现收敛，同时解决与非独立和同分布（非 IID）数据场景相关的复杂性。此外，该框架还包括一种新颖的机制来管理簇头 (CH) 的无缝转换，然后将最新的 FL 模型参数传输到指定的 CH。此外，所提出的方法考虑了合并 CH 的选项，旨在减少 CH 的数量，从而减轻相关的开销。通过广泛的模拟，与先前提出的集群算法和非集群 VANET 相比，所提出的集群 VANET 分层联合学习已被证明可以显着提高准确性和收敛时间，同时保持可接受的数据包开销水平。

Title: Using LLM such as ChatGPT for Designing and Implementing a RISC Processor: Execution,Challenges and Limitations

Authors: Shadeeb Hossain, Aayush Gohil, Yizhou Wang
Subjects: cs.LG, cs.AR, cs.SE
Abstract URL: https://arxiv.org/abs/2401.10364
Pdf URL: https://arxiv.org/pdf/2401.10364
Copy Paste: [[2401.10364]] Using LLM such as ChatGPT for Designing and Implementing a RISC Processor: Execution,Challenges and Limitations(https://arxiv.org/abs/2401.10364)
Keywords: language model, gpt, llm, code, chat
Abstract: This paper discusses the feasibility of using Large Language Models LLM for code generation with a particular application in designing an RISC. The paper also reviews the associated steps such as parsing, tokenization, encoding, attention mechanism, sampling the tokens and iterations during code generation. The generated code for the RISC components is verified through testbenches and hardware implementation on a FPGA board. Four metric parameters Correct output on the first iteration, Number of errors embedded in the code, Number of trials required to achieve the code and Failure to generate the code after three iterations, are used to compare the efficiency of using LLM in programming. In all the cases, the generated code had significant errors and human intervention was always required to fix the bugs. LLM can therefore be used to complement a programmer code design.
摘要：本文讨论了使用大型语言模型 LLM 进行代码生成以及设计 RISC 中特定应用的可行性。本文还回顾了相关步骤，如解析、标记化、编码、注意力机制、标记采样和代码生成过程中的迭代。 RISC 组件生成的代码通过测试平台和 FPGA 板上的硬件实现进行验证。第一次迭代的正确输出、代码中嵌入的错误数量、实现代码所需的试验次数和三次迭代后未能生成代码这四个度量参数用于比较在编程中使用 LLM 的效率。在所有情况下，生成的代码都存在重大错误，并且始终需要人工干预来修复错误。因此，LLM 可以用来补充程序员的代码设计。

Title: Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

Authors: Phevos Paschalidis, Runyu Zhang, Na Li
Subjects: cs.LG, cs.MA, stat.ML
Abstract URL: https://arxiv.org/abs/2401.10383
Pdf URL: https://arxiv.org/pdf/2401.10383
Copy Paste: [[2401.10383]] Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis(https://arxiv.org/abs/2401.10383)
Keywords: agent
Abstract: In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, $N$ cooperative agents travel on a connected graph $G$ with $K$ nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture the decreasing marginal reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over $T$ steps is bounded by $O(N\log(T)[\sqrt{KT} + DK])$, where $D$ is the diameter of graph $G$. Lastly, we numerically test our algorithm by comparing it to alternative methods.
摘要：在本文中，我们将多智能体图强盗问题表述为由Zhang、Johansson 和Li [CISS 57, 1-6 (2023)] 提出的图强盗问题的多智能体扩展。在我们的公式中，$N$ 合作代理在具有 $K$ 节点的连通图 $G$ 上行驶。到达每个节点后，智能体会观察到从依赖于节点的概率分布中抽取的随机奖励。系统的奖励被建模为代理观察到的奖励的加权和，其中权重捕获与多个代理同时采样同一节点相关的边际奖励递减。我们提出了一种基于上置信界 (UCB) 的学习算法 Multi-G-UCB，并证明其对 $T$ 步骤的预期遗憾受到 $O(N\log(T)[\sqrt{KT} + DK])$，其中 $D$ 是图 $G$ 的直径。最后，我们通过与其他方法进行比较来对我们的算法进行数值测试。

Title: Distribution Consistency based Self-Training for Graph Neural Networks with Sparse Labels

Authors: Fali Wang, Tianxiang Zhao, Suhang Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2401.10394
Pdf URL: https://arxiv.org/pdf/2401.10394
Copy Paste: [[2401.10394]] Distribution Consistency based Self-Training for Graph Neural Networks with Sparse Labels(https://arxiv.org/abs/2401.10394)
Keywords: rag
Abstract: Few-shot node classification poses a significant challenge for Graph Neural Networks (GNNs) due to insufficient supervision and potential distribution shifts between labeled and unlabeled nodes. Self-training has emerged as a widely popular framework to leverage the abundance of unlabeled data, which expands the training set by assigning pseudo-labels to selected unlabeled nodes. Efforts have been made to develop various selection strategies based on confidence, information gain, etc. However, none of these methods takes into account the distribution shift between the training and testing node sets. The pseudo-labeling step may amplify this shift and even introduce new ones, hindering the effectiveness of self-training. Therefore, in this work, we explore the potential of explicitly bridging the distribution shift between the expanded training set and test set during self-training. To this end, we propose a novel Distribution-Consistent Graph Self-Training (DC-GST) framework to identify pseudo-labeled nodes that are both informative and capable of redeeming the distribution discrepancy and formulate it as a differentiable optimization task. A distribution-shift-aware edge predictor is further adopted to augment the graph and increase the model's generalizability in assigning pseudo labels. We evaluate our proposed method on four publicly available benchmark datasets and extensive experiments demonstrate that our framework consistently outperforms state-of-the-art baselines.
摘要：由于监督不足以及标记和未标记节点之间潜在的分布变化，少样本节点分类对图神经网络（GNN）提出了重大挑战。自训练已成为一种广泛流行的框架，可以利用大量未标记数据，通过向选定的未标记节点分配伪标签来扩展训练集。人们已经努力开发基于置信度、信息增益等的各种选择策略。但是，这些方法都没有考虑训练和测试节点集之间的分布变化。伪标记步骤可能会放大这种转变，甚至引入新的转变，从而阻碍自我训练的有效性。因此，在这项工作中，我们探索了在自训练过程中明确桥接扩展训练集和测试集之间的分布转变的潜力。为此，我们提出了一种新颖的分布一致图自训练（DC-GST）框架来识别伪标记节点，这些节点既提供信息又能够弥补分布差异，并将其制定为可微优化任务。进一步采用分布偏移感知边缘预测器来增强图并提高模型在分配伪标签方面的通用性。我们在四个公开可用的基准数据集上评估了我们提出的方法，并且广泛的实验表明我们的框架始终优于最先进的基线。

Title: Learning High-Quality and General-Purpose Phrase Representations

Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10407
Pdf URL: https://arxiv.org/pdf/2401.10407
Copy Paste: [[2401.10407]] Learning High-Quality and General-Purpose Phrase Representations(https://arxiv.org/abs/2401.10407)
Keywords: language model, code, rag
Abstract: Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pre-trained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences. Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible. We propose an improved framework to learn phrase representations in a context-free fashion. The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation. Furthermore, we design three granularities of data augmentation to increase the diversity of training samples. Our experiments across a wide range of tasks show that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size. The code is available at \faGithub~ \url{https://github.com/tigerchen52/PEARL} \end{abstract}
摘要：短语表示在数据科学和自然语言处理中发挥着重要作用，有利于实体对齐、记录链接、模糊连接和释义分类等各种任务。当前最先进的方法涉及使用对比学习对短语嵌入的预训练语言模型进行微调。然而，我们已经确定了需要改进的领域。首先，这些预训练的模型往往过于复杂，并且需要在具有上下文句子的语料库上进行预训练。其次，利用短语类型和形态可以提供更精确、更灵活的短语表示。我们提出了一个改进的框架，以上下文无关的方式学习短语表示。该框架采用短语类型分类作为辅助任务，并将字符级信息更有效地合并到短语表示中。此外，我们设计了三种数据增强粒度以增加训练样本的多样性。我们在广泛任务中进行的实验表明，与以前的方法相比，我们的方法可以生成更出色的短语嵌入，同时需要更小的模型大小。代码可在 \faGithub~ \url{https://github.com/tigerchen52/PEARL} \end{abstract} 获取

Title: Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?

Authors: Marcio Fonseca, Shay B. Cohen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2401.10415
Pdf URL: https://arxiv.org/pdf/2401.10415
Copy Paste: [[2401.10415]] Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?(https://arxiv.org/abs/2401.10415)
Keywords: language model, llm, rag
Abstract: In this work, we investigate the controllability of large language models (LLMs) on scientific summarization tasks. We identify key stylistic and content coverage factors that characterize different types of summaries such as paper reviews, abstracts, and lay summaries. By controlling stylistic features, we find that non-fine-tuned LLMs outperform humans in the MuP review generation task, both in terms of similarity to reference summaries and human preferences. Also, we show that we can improve the controllability of LLMs with keyword-based classifier-free guidance (CFG) while achieving lexical overlap comparable to strong fine-tuned baselines on arXiv and PubMed. However, our results also indicate that LLMs cannot consistently generate long summaries with more than 8 sentences. Furthermore, these models exhibit limited capacity to produce highly abstractive lay summaries. Although LLMs demonstrate strong generic summarization competency, sophisticated content control without costly fine-tuning remains an open problem for domain-specific applications.
摘要：在这项工作中，我们研究了大型语言模型（LLM）在科学摘要任务上的可控性。我们确定了表征不同类型摘要（例如论文评论、摘要和简明摘要）的关键文体和内容覆盖因素。通过控制文体特征，我们发现未经微调的法学硕士在 MuP 评论生成任务中的表现优于人类，无论是在与参考摘要的相似性还是人类偏好方面。此外，我们还表明，我们可以通过基于关键字的无分类器指导（CFG）来提高法学硕士的可控性，同时实现与 arXiv 和 PubMed 上强大的微调基线相当的词汇重叠。然而，我们的结果还表明，法学硕士无法始终如一地生成超过 8 个句子的长摘要。此外，这些模型产生高度抽象的外行摘要的能力有限。尽管法学硕士表现出强大的通用概括能力，但对于特定领域的应用程序来说，无需昂贵的微调即可实现复杂的内容控制仍然是一个悬而未决的问题。

Title: Understanding Learning through the Lens of Dynamical Invariants

Authors: Alex Ushveridze
Subjects: cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2401.10428
Pdf URL: https://arxiv.org/pdf/2401.10428
Copy Paste: [[2401.10428]] Understanding Learning through the Lens of Dynamical Invariants(https://arxiv.org/abs/2401.10428)
Keywords: agent
Abstract: This paper proposes a novel perspective on learning, positing it as the pursuit of dynamical invariants -- data combinations that remain constant or exhibit minimal change over time as a system evolves. This concept is underpinned by both informational and physical principles, rooted in the inherent properties of these invariants. Firstly, their stability makes them ideal for memorization and integration into associative networks, forming the basis of our knowledge structures. Secondly, the predictability of these stable invariants makes them valuable sources of usable energy, quantifiable as kTln2 per bit of accurately predicted information. This energy can be harnessed to explore new transformations, rendering learning systems energetically autonomous and increasingly effective. Such systems are driven to continuously seek new data invariants as energy sources. The paper further explores several meta-architectures of autonomous, self-propelled learning agents that utilize predictable information patterns as a source of usable energy.
摘要：本文提出了一种关于学习的新颖视角，将其视为对动态不变量的追求——随着系统的发展，数据组合保持不变或随着时间的推移表现出最小的变化。这一概念以信息原理和物理原理为基础，植根于这些不变量的固有属性。首先，它们的稳定性使它们非常适合记忆和融入联想网络，形成我们知识结构的基础。其次，这些稳定不变量的可预测性使它们成为宝贵的可用能量来源，可以量化为每比特准确预测信息的 kTln2。这种能量可以用来探索新的转变，使学习系统充满活力并日益有效。此类系统不断寻求新的数据不变量作为能源。该论文进一步探讨了几种自主、自我推进的学习代理的元架构，这些代理利用可预测的信息模式作为可用能量的来源。

Title: Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

Authors: Terra Blevins, Tomasz Limisiewicz, Suchin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, Luke Zettlemoyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10440
Pdf URL: https://arxiv.org/pdf/2401.10440
Copy Paste: [[2401.10440]] Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models(https://arxiv.org/abs/2401.10440)
Keywords: language model
Abstract: Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.
摘要：尽管多语言模型在非英语 NLP 中很受欢迎，但由于模型参数的语言间竞争，多语言模型的性能通常不如单语言模型。我们提出了跨语言专家语言模型（X-ELM），它通过在多语言语料库的子集上独立训练语言模型来缓解这种竞争。此过程将 X-ELM 专门用于不同的语言，同时保持作为多语言集成的有效性。我们的实验表明，当给定相同的计算预算时，X-ELM 在所有考虑的语言中优于联合训练的多语言模型，并且这些收益会转移到下游任务。 X-ELM 提供了超越性能改进的额外好处：可以迭代地添加新的专家，使 X-ELM 适应新的语言，而不会发生灾难性的遗忘。此外，训练是异步的，减少了多语言训练的硬件要求并使多语言建模民主化。

Title: Can A Cognitive Architecture Fundamentally Enhance LLMs? Or Vice Versa?

Authors: Ron Sun
Subjects: cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2401.10444
Pdf URL: https://arxiv.org/pdf/2401.10444
Copy Paste: [[2401.10444]] Can A Cognitive Architecture Fundamentally Enhance LLMs? Or Vice Versa?(https://arxiv.org/abs/2401.10444)
Keywords: llm
Abstract: The paper discusses what is needed to address the limitations of current LLM-centered AI systems. The paper argues that incorporating insights from human cognition and psychology, as embodied by a computational cognitive architecture, can help develop systems that are more capable, more reliable, and more human-like. It emphasizes the importance of the dual-process architecture and the hybrid neuro-symbolic approach in addressing the limitations of current LLMs. In the opposite direction, the paper also highlights the need for an overhaul of computational cognitive architectures to better reflect advances in AI and computing technology. Overall, the paper advocates for a multidisciplinary, mutually beneficial approach towards developing better models both for AI and for understanding the human mind.
摘要：本文讨论了如何解决当前以法学硕士为中心的人工智能系统的局限性。该论文认为，结合人类认知和心理学的见解，如计算认知架构所体现的那样，可以帮助开发更强大、更可靠、更人性化的系统。它强调了双进程架构和混合神经符号方法在解决当前法学硕士的局限性方面的重要性。相反，该论文还强调需要彻底改革计算认知架构，以更好地反映人工智能和计算技术的进步。总体而言，本文主张采用多学科、互利的方法来开发更好的人工智能模型和理解人类思维。

Title: Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2401.10446
Pdf URL: https://arxiv.org/pdf/2401.10446
Copy Paste: [[2401.10446]] Large Language Models are Efficient Learners of Noise-Robust Speech Recognition(https://arxiv.org/abs/2401.10446)
Keywords: language model, llm, code, rag
Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising.
摘要：大语言模型（LLM）的最新进展促进了自动语音识别（ASR）的生成错误纠正（GER），它利用LLM丰富的语言知识和强大的推理能力来提高识别结果。最新的工作提出了使用 HyPoadise 数据集的 GER 基准，通过高效的 LLM 微调来学习从 ASR N 最佳假设到真实转录的映射，这显示出很大的有效性，但缺乏抗噪声 ASR 的特异性。在这项工作中，我们将基准扩展到噪声条件，并研究是否可以教法学硕士对 GER 进行去噪，就像稳健的 ASR 所做的那样}，其中一种解决方案是将噪声信息作为调节器引入法学硕士。然而，由于跨模态间隙，直接合并来自音频编码器的噪声嵌入可能会损害 LLM 调优。为此，我们建议从N-best列表中提取语言空间噪声嵌入来表示源语音的噪声条件，这可以促进GER中的去噪过程。此外，为了增强其对音频噪声的表示能力，我们通过互信息估计设计了一种知识蒸馏（KD）方法，将音频嵌入中的真实噪声信息蒸馏到我们的语言嵌入中。对各种最新法学硕士的实验表明，我们的方法取得了新的突破，在训练数据有限的情况下，在单词错误率方面实现了高达 53.9% 的纠正改进。分析表明，我们的语言空间噪声嵌入可以很好地表示源语音的噪声条件，在这种情况下，现成的LLM表现出强大的语言空间去噪能力。

Title: Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

Authors: Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke
Subjects: cs.CL, cs.AI, cs.LG, cs.NE, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2401.10447
Pdf URL: https://arxiv.org/pdf/2401.10447
Copy Paste: [[2401.10447]] Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition(https://arxiv.org/abs/2401.10447)
Keywords: language model, lora
Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling.
摘要：使用低秩自适应 (LoRA) 和冻结预训练语言模型 (PLM) 作为内存受限硬件的主流、资源高效的建模方法已变得越来越流行。在本研究中，我们首先探索如何通过引入各种 LoRA 训练策略来增强模型性能，在公共 Librispeech 数据集上实现相对单词错误率降低 3.50%，在消息传递领域的内部数据集上实现相对单词错误率降低 3.67%。为了进一步表征基于 LoRA 的第二遍语音识别模型的稳定性，我们检查了针对输入扰动的鲁棒性。这些扰动植根于同音词替换和一种称为 N 最佳基于扰动的重新评分鲁棒性 (NPRR) 的新指标，两者都旨在衡量重新评分模型性能的相对退化。我们的实验结果表明，虽然 LoRA 的高级变体（例如动态排名分配 LoRA）会导致 $1$ 最佳扰动中的性能下降，但它们会减轻 $N$ 最佳扰动中的性能下降。这一发现与完全调整的模型和普通 LoRA 调整基线进行了比较，表明在使用基于 LoRA 的自适应来节省计算成本和稳健的语言建模时需要进行全面的选择。

Title: Contrastive Unlearning: A Contrastive Approach to Machine Unlearning

Authors: Hong kyu Lee, Qiuchen Zhang, Carl Yang, Jian Lou, Li Xiong
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2401.10458
Pdf URL: https://arxiv.org/pdf/2401.10458
Copy Paste: [[2401.10458]] Contrastive Unlearning: A Contrastive Approach to Machine Unlearning(https://arxiv.org/abs/2401.10458)
Keywords: rag
Abstract: Machine unlearning aims to eliminate the influence of a subset of training samples (i.e., unlearning samples) from a trained model. Effectively and efficiently removing the unlearning samples without negatively impacting the overall model performance is still challenging. In this paper, we propose a contrastive unlearning framework, leveraging the concept of representation learning for more effective unlearning. It removes the influence of unlearning samples by contrasting their embeddings against the remaining samples so that they are pushed away from their original classes and pulled toward other classes. By directly optimizing the representation space, it effectively removes the influence of unlearning samples while maintaining the representations learned from the remaining samples. Experiments on a variety of datasets and models on both class unlearning and sample unlearning showed that contrastive unlearning achieves the best unlearning effects and efficiency with the lowest performance loss compared with the state-of-the-art algorithms.
摘要：机器取消学习旨在消除训练模型中训练样本子集（即取消学习样本）的影响。在不对整体模型性能产生负面影响的情况下有效且高效地删除未学习样本仍然具有挑战性。在本文中，我们提出了一个对比性遗忘框架，利用表示学习的概念来实现更有效的遗忘。它通过将样本的嵌入与剩余样本进行对比来消除未学习样本的影响，从而将它们从原始类别中推开并拉向其他类别。通过直接优化表示空间，它有效地消除了未学习样本的影响，同时保持了从剩余样本中学到的表示。在各种数据集和模型上进行的类别遗忘学习和样本遗忘学习的实验表明，与最先进的算法相比，对比遗忘学习以最低的性能损失实现了最佳的遗忘效果和效率。

Title: Critical Data Size of Language Models from a Grokking Perspective

Authors: Xuekai Zhu, Yao Fu, Bowen Zhou, Zhouhan Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2401.10463
Pdf URL: https://arxiv.org/pdf/2401.10463
Copy Paste: [[2401.10463]] Critical Data Size of Language Models from a Grokking Perspective(https://arxiv.org/abs/2401.10463)
Keywords: language model
Abstract: We explore the critical data size in language models, a threshold that marks a fundamental shift from quick memorization to slow generalization. We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis and identify data insufficiency, sufficiency, and surplus regimes in language models training dynamics. We develop a grokking configuration to reproduce grokking on simplistic language models stably by rescaling initialization and weight decay. We show that generalization occurs only when language models reach a critical size. We analyze grokking across sample-wise and model-wise, verifying the proposed data efficiency hypothesis. Our experiments reveal smoother phase transitions occurring at the critical dataset size for language datasets. As the model size increases, this critical point also becomes larger, indicating that larger models require more data. Our results deepen the understanding of language model training, offering a novel perspective on the role of data in the learning mechanism of language models.
摘要：我们探索语言模型中的关键数据大小，这是一个标志着从快速记忆到缓慢泛化的根本转变的阈值。我们将 grokking 配置下的相变形式化为数据效率假设，并识别语言模型训练动态中的数据不足、充足和过剩状态。我们开发了一种 grokking 配置，通过重新调整初始化和权重衰减，在简单的语言模型上稳定地重现 grokking。我们证明，只有当语言模型达到临界大小时才会发生泛化。我们对样本和模型进行了分析，验证了所提出的数据效率假设。我们的实验揭示了语言数据集在关键数据集大小下发生的更平滑的相变。随着模型尺寸的增加，这个临界点也变得更大，表明更大的模型需要更多的数据。我们的结果加深了对语言模型训练的理解，为数据在语言模型学习机制中的作用提供了新的视角。

Title: DeepEdit: Knowledge Editing as Decoding with Constraints

Authors: Yiwei Wang, Muhao Chen, Nanyun Peng, Kai-Wei Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2401.10471
Pdf URL: https://arxiv.org/pdf/2401.10471
Copy Paste: [[2401.10471]] DeepEdit: Knowledge Editing as Decoding with Constraints(https://arxiv.org/abs/2401.10471)
Keywords: language model, llm, code
Abstract: We develop a new perspective of knowledge editing for large language models (LLMs) as decoding with constraints. We propose DeepEdit (Depth-first Search based Progressive Decoding for Knowledge Editing), a neuro-symbolic method that improves knowledge editing with better coherence of reasoning, relevance to the question, and awareness of updated knowledge. DeepEdit can be flexibly applied to all black-box LLMs: it does not require any access to the model parameters, representations, or output vocabulary distributions. DeepEdit progressively produces the high-quality reasoning steps towards effective knowledge editing. It utilizes a depth-first search to revise the LLMs' output, which improves the output's informativeness to the input question and awareness of the updated knowledge. Qualitatively, DeepEdit effectively controls LLMs to produce more succinct reasoning in accord with knowledge editing. Quantitatively, DeepEdit yields significant gains on MQuaKE, a challenging multi-hop question-answering dataset with knowledge editing. We release the source code at https://github.com/wangywUST/DeepEdit.
摘要：我们为大型语言模型（LLM）开发了一种新的知识编辑视角，即带约束的解码。我们提出了 DeepEdit（基于深度优先搜索的知识编辑渐进解码），这是一种神经符号方法，可以通过更好的推理连贯性、与问题的相关性以及对更新知识的认识来改进知识编辑。 DeepEdit 可以灵活地应用于所有黑盒法学硕士：它不需要访问模型参数、表示或输出词汇分布。 DeepEdit 逐步产生高质量的推理步骤，以实现有效的知识编辑。它利用深度优先搜索来修改法学硕士的输出，从而提高输出对输入问题的信息量以及对更新知识的认识。定性地讲，DeepEdit有效控制LLM根据知识编辑产生更简洁的推理。从数量上来说，DeepEdit 比 MQuaKE 取得了显着的进步，MQuaKE 是一个具有挑战性的多跳问答数据集，具有知识编辑功能。我们在 https://github.com/wangywUST/DeepEdit 发布了源代码。

Title: Name Tagging Under Domain Shift via Metric Learning for Life Sciences

Authors: Hongyi Liu, Qingyun Wang, Payam Karisani, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10472
Pdf URL: https://arxiv.org/pdf/2401.10472
Copy Paste: [[2401.10472]] Name Tagging Under Domain Shift via Metric Learning for Life Sciences(https://arxiv.org/abs/2401.10472)
Keywords: language model, gpt, llm, chat
Abstract: Name tagging is a key component of Information Extraction (IE), particularly in scientific domains such as biomedicine and chemistry, where large language models (LLMs), e.g., ChatGPT, fall short. We investigate the applicability of transfer learning for enhancing a name tagging model trained in the biomedical domain (the source domain) to be used in the chemical domain (the target domain). A common practice for training such a model in a few-shot learning setting is to pretrain the model on the labeled source data, and then, to finetune it on a hand-full of labeled target examples. In our experiments we observed that such a model is prone to mis-labeling the source entities, which can often appear in the text, as the target entities. To alleviate this problem, we propose a model to transfer the knowledge from the source domain to the target domain, however, at the same time, to project the source entities and target entities into separate regions of the feature space. This diminishes the risk of mis-labeling the source entities as the target entities. Our model consists of two stages: 1) entity grouping in the source domain, which incorporates knowledge from annotated events to establish relations between entities, and 2) entity discrimination in the target domain, which relies on pseudo labeling and contrastive learning to enhance discrimination between the entities in the two domains. We carry out our extensive experiments across three source and three target datasets, and demonstrate that our method outperforms the baselines, in some scenarios by 5\% absolute value.
摘要：名称标签是信息提取 (IE) 的关键组成部分，特别是在生物医学和化学等科学领域，而大型语言模型 (LLM)（例如 ChatGPT）在这些领域存在不足。我们研究了迁移学习的适用性，以增强在生物医学领域（源领域）中训练的名称标签模型，以便在化学领域（目标领域）中使用。在几次学习设置中训练此类模型的常见做法是在标记的源数据上预训练模型，然后在大量标记的目标示例上对其进行微调。在我们的实验中，我们观察到这种模型很容易将经常出现在文本中的源实体错误地标记为目标实体。为了缓解这个问题，我们提出了一种模型，将知识从源域转移到目标域，但同时将源实体和目标实体投影到特征空间的不同区域。这降低了将源实体错误标记为目标实体的风险。我们的模型由两个阶段组成：1）源域中的实体分组，它结合了带注释事件的知识来建立实体之间的关系，2）目标域中的实体区分，它依赖于伪标签和对比学习来增强实体之间的区分两个域中的实体。我们在三个源数据集和三个目标数据集上进行了广泛的实验，并证明我们的方法优于基线，在某些情况下绝对值高出 5%。

Title: Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning

Authors: Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, Kan Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2401.10480
Pdf URL: https://arxiv.org/pdf/2401.10480
Copy Paste: [[2401.10480]] Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning(https://arxiv.org/abs/2401.10480)
Keywords: language model, rag, chain-of-thought
Abstract: Self-consistency (SC) has been a widely used decoding strategy for chain-of-thought reasoning. Despite bringing significant performance improvements across a variety of multi-step reasoning tasks, it is a high-cost method that requires multiple sampling with the preset size. In this paper, we propose a simple and scalable sampling process, \textbf{E}arly-Stopping \textbf{S}elf-\textbf{C}onsistency (ESC), to greatly reduce the cost of SC without sacrificing performance. On this basis, one control scheme for ESC is further derivated to dynamically choose the performance-cost balance for different tasks and models. To demonstrate ESC's effectiveness, we conducted extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning over language models with varying scales. The empirical results show that ESC reduces the average number of sampling of chain-of-thought reasoning by a significant margin on six benchmarks, including MATH (-33.8%), GSM8K (-80.1%), StrategyQA (-76.8%), CommonsenseQA (-78.5%), Coin Flip (-84.2%) and Last Letters (-67.4%), while attaining comparable performances.
摘要：自一致性（SC）是一种广泛使用的思想链推理解码策略。尽管在各种多步推理任务中带来了显着的性能改进，但它是一种高成本方法，需要以预设大小进行多次采样。在本文中，我们提出了一种简单且可扩展的采样过程，\textbf{E}arly-Stopping \textbf{S}elf-\textbf{C}onsistency (ESC)，以在不牺牲性能的情况下大大降低 SC 的成本。在此基础上，进一步推导了一种ESC控制方案，以针对不同任务和模型动态选择性能-成本平衡。为了证明 ESC 的有效性，我们对三种流行的推理任务类别进行了广泛的实验：不同规模的语言模型上的算术推理、常识推理和符号推理。实证结果表明，ESC在MATH（-33.8%）、GSM8K（-80.1%）、StrategyQA（-76.8%）、CommonsenseQA等六个基准上显着降低了思想链推理的平均采样次数。 (-78.5%)、抛硬币 (-84.2%) 和最后一封信 (-67.4%)，同时获得了可比的性能。

Title: Generalization Error Guaranteed Auto-Encoder-Based Nonlinear Model Reduction for Operator Learning

Authors: Hao Liu, Biraj Dahal, Rongjie Lai, Wenjing Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2401.10490
Pdf URL: https://arxiv.org/pdf/2401.10490
Copy Paste: [[2401.10490]] Generalization Error Guaranteed Auto-Encoder-Based Nonlinear Model Reduction for Operator Learning(https://arxiv.org/abs/2401.10490)
Keywords: code
Abstract: Many physical processes in science and engineering are naturally represented by operators between infinite-dimensional function spaces. The problem of operator learning, in this context, seeks to extract these physical processes from empirical data, which is challenging due to the infinite or high dimensionality of data. An integral component in addressing this challenge is model reduction, which reduces both the data dimensionality and problem size. In this paper, we utilize low-dimensional nonlinear structures in model reduction by investigating Auto-Encoder-based Neural Network (AENet). AENet first learns the latent variables of the input data and then learns the transformation from these latent variables to corresponding output data. Our numerical experiments validate the ability of AENet to accurately learn the solution operator of nonlinear partial differential equations. Furthermore, we establish a mathematical and statistical estimation theory that analyzes the generalization error of AENet. Our theoretical framework shows that the sample complexity of training AENet is intricately tied to the intrinsic dimension of the modeled process, while also demonstrating the remarkable resilience of AENet to noise.
摘要：科学和工程中的许多物理过程自然地由无限维函数空间之间的算子来表示。在这种情况下，算子学习问题寻求从经验数据中提取这些物理过程，由于数据的无限或高维，这具有挑战性。解决这一挑战的一个不可或缺的组成部分是模型缩减，它可以减少数据维度和问题规模。在本文中，我们通过研究基于自动编码器的神经网络（AENet），在模型简化中利用低维非线性结构。 AENet 首先学习输入数据的潜在变量，然后学习从这些潜在变量到相应输出数据的转换。我们的数值实验验证了 AENet 准确学习非线性偏微分方程解算子的能力。此外，我们建立了数学和统计估计理论来分析 AENet 的泛化误差。我们的理论框架表明，训练 AENet 的样本复杂性与建模过程的内在维度密切相关，同时也证明了 AENet 对噪声的显着适应能力。

Title: Knowledge Fusion of Large Language Models

Authors: Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10491
Pdf URL: https://arxiv.org/pdf/2401.10491
Copy Paste: [[2401.10491]] Knowledge Fusion of Large Language Models(https://arxiv.org/abs/2401.10491)
Keywords: language model, llm, code, rag
Abstract: While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseLLM}.
摘要：虽然从头开始训练大型语言模型 (LLM) 可以生成具有独特功能和优势的模型，但它的成本很高，并且可能会导致功能冗余。或者，一种经济有效且引人注目的方法是将现有的预训练法学硕士合并到更有效的模型中。然而，由于这些法学硕士的架构不同，直接混合它们的权重是不切实际的。在本文中，我们介绍了法学硕士知识融合的概念，旨在结合现有法学硕士的能力并将其转移到单个法学硕士中。通过利用来源法学硕士的生成分布，我们将他们的集体知识和独特优势具体化，从而有可能将目标模型的能力提升到超越任何单个来源法学硕士的能力。我们使用具有不同架构的三种流行的 LLM（Llama-2、MPT 和 OpenLLaMA）在各种基准和任务中验证我们的方法。我们的研究结果证实，法学硕士的融合可以提高目标模型在推理、常识和代码生成等一系列功能方面的性能。我们的代码、模型权重和数据在 \url{https://github.com/fanqiwan/FuseLLM} 上公开。

Title: FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis

Authors: Chao Zhang, Yuren Mao, Yijiang Fan, Yu Mi, Yunjun Gao, Lu Chen, Dongfang Lou, Jinshu Lin
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2401.10506
Pdf URL: https://arxiv.org/pdf/2401.10506
Copy Paste: [[2401.10506]] FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis(https://arxiv.org/abs/2401.10506)
Keywords: language model, llm, prompt, code
Abstract: Text-to-SQL, which provides zero-code interface for operating relational databases, has gained much attention in financial analysis; because, financial professionals may not well-skilled in SQL programming. However, until now, there is no practical Text-to-SQL benchmark dataset for financial analysis, and existing Text-to-SQL methods have not considered the unique characteristics of databases in financial applications, such as commonly existing wide tables. To address these issues, we collect a practical Text-to-SQL benchmark dataset and propose a model-agnostic Large Language Model (LLMs)-based Text-to-SQL framework for financial analysis. The benchmark dataset, BULL, is collected from the practical financial analysis business of Hundsun Technologies Inc., including databases for fund, stock, and macro economy. Besides, the proposed LLMs-based Text-to-SQL framework, FinSQL, provides a systematic treatment for financial Text-to-SQL from the perspectives of prompt construction, parameter-efficient fine-tuning and output calibration. Extensive experimental results on BULL demonstrate that FinSQL achieves the state-of-the-art Text-to-SQL performance at a small cost; furthermore, FinSQL can bring up to 36.64% performance improvement in scenarios requiring few-shot cross-database model transfer.
摘要：Text-to-SQL提供了操作关系数据库的零代码接口，在金融分析领域备受关注；因为，金融专业人士可能不擅长 SQL 编程。然而，到目前为止，还没有用于金融分析的实用的Text-to-SQL基准数据集，并且现有的Text-to-SQL方法没有考虑金融应用中数据库的独特特征，例如常见的宽表。为了解决这些问题，我们收集了一个实用的文本到 SQL 基准数据集，并提出了一个与模型无关的基于大型语言模型 (LLM) 的文本到 SQL 财务分析框架。基准数据集BULL来自恒生电子实战金融分析业务，包括基金、股票、宏观经济数据库。此外，所提出的基于LLM的Text-to-SQL框架FinSQL从提示构建、参数高效微调和输出校准的角度为金融Text-to-SQL提供了系统的处理。 BULL 上的大量实验结果表明，FinSQL 以较小的成本实现了最先进的 Text-to-SQL 性能；此外，FinSQL在需要few-shot跨数据库模型传输的场景下可以带来高达36.64%的性能提升。

Title: Spatial-temporal Forecasting for Regions without Observations

Authors: Xinyu Su, Jianzhong Qi, Egemen Tanin, Yanchuan Chang, Majid Sarvi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2401.10518
Pdf URL: https://arxiv.org/pdf/2401.10518
Copy Paste: [[2401.10518]] Spatial-temporal Forecasting for Regions without Observations(https://arxiv.org/abs/2401.10518)
Keywords: code
Abstract: Spatial-temporal forecasting plays an important role in many real-world applications, such as traffic forecasting, air pollutant forecasting, crowd-flow forecasting, and so on. State-of-the-art spatial-temporal forecasting models take data-driven approaches and rely heavily on data availability. Such models suffer from accuracy issues when data is incomplete, which is common in reality due to the heavy costs of deploying and maintaining sensors for data collection. A few recent studies attempted to address the issue of incomplete data. They typically assume some data availability in a region of interest either for a short period or at a few locations. In this paper, we further study spatial-temporal forecasting for a region of interest without any historical observations, to address scenarios such as unbalanced region development, progressive deployment of sensors or lack of open data. We propose a model named STSM for the task. The model takes a contrastive learning-based approach to learn spatial-temporal patterns from adjacent regions that have recorded data. Our key insight is to learn from the locations that resemble those in the region of interest, and we propose a selective masking strategy to enable the learning. As a result, our model outperforms adapted state-of-the-art models, reducing errors consistently over both traffic and air pollutant forecasting tasks. The source code is available at https://github.com/suzy0223/STSM.
摘要：时空预测在许多现实应用中发挥着重要作用，例如交通预测、空气污染物预测、人群流量预测等。最先进的时空预测模型采用数据驱动的方法并严重依赖数据可用性。当数据不完整时，此类模型会遇到准确性问题，这在现实中很常见，因为部署和维护数据收集传感器的成本很高。最近的一些研究试图解决数据不完整的问题。他们通常假设感兴趣区域内有一些数据在短时间内或在几个地点可用。在本文中，我们进一步研究在没有任何历史观测的情况下对感兴趣区域进行时空预测，以解决诸如区域发展不平衡、传感器逐步部署或缺乏开放数据等场景。我们为该任务提出了一个名为 STSM 的模型。该模型采用基于对比学习的方法来学习已记录数据的相邻区域的时空模式。我们的主要见解是从与感兴趣区域相似的位置进行学习，并且我们提出了一种选择性掩蔽策略来实现学习。因此，我们的模型优于经过调整的最先进模型，持续减少了交通和空气污染物预测任务的错误。源代码可在 https://github.com/suzy0223/STSM 获取。

Title: Cross-lingual Editing in Multilingual Language Models

Authors: Himanshu Beniwal, Kowsik Nandagopan D, Mayank Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2401.10521
Pdf URL: https://arxiv.org/pdf/2401.10521
Copy Paste: [[2401.10521]] Cross-lingual Editing in Multilingual Language Models(https://arxiv.org/abs/2401.10521)
Keywords: language model, llm, code
Abstract: The training of large language models (LLMs) necessitates substantial data and computational resources, and updating outdated LLMs entails significant efforts and resources. While numerous model editing techniques (METs) have emerged to efficiently update model outputs without retraining, their effectiveness in multilingual LLMs, where knowledge is stored in diverse languages, remains an underexplored research area. This research paper introduces the cross-lingual model editing (\textbf{XME}) paradigm, wherein a fact is edited in one language, and the subsequent update propagation is observed across other languages. To investigate the XME paradigm, we conducted experiments using BLOOM, mBERT, and XLM-RoBERTa using the two writing scripts: \textit{Latin} (English, French, and Spanish) and \textit{Indic} (Hindi, Gujarati, and Bengali). The results reveal notable performance limitations of state-of-the-art METs under the XME setting, mainly when the languages involved belong to two distinct script families. These findings highlight the need for further research and development of XME techniques to address these challenges. For more comprehensive information, the dataset used in this research and the associated code are publicly available at the following URL\url{https://github.com/lingo-iitgn/XME}.
摘要：大型语言模型（LLM）的训练需要大量的数据和计算资源，而更新过时的LLM则需要大量的努力和资源。尽管已经出现了许多模型编辑技术（MET）来有效更新模型输出而无需重新训练，但它们在多语言法学硕士（知识以多种语言存储）中的有效性仍然是一个尚未充分探索的研究领域。本研究论文介绍了跨语言模型编辑（\textbf{XME}）范例，其中用一种语言编辑事实，并在其他语言中观察随后的更新传播。为了研究 XME 范式，我们使用 BLOOM、mBERT 和 XLM-RoBERTa 进行了实验，使用两种书写脚本：\textit{Latin}（英语、法语和西班牙语）和 \textit{Indic}（印地语、古吉拉特语和孟加拉语））。结果揭示了 XME 设置下最先进的 MET 的显着性能限制，主要是当涉及的语言属于两个不同的脚本系列时。这些发现强调需要进一步研究和开发 XME 技术来应对这些挑战。如需更全面的信息，本研究中使用的数据集和相关代码可在以下 URL\url{https://github.com/lingo-iitgn/XME} 上公开获取。

Title: Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

Authors: Yong Wang, Cheng Lu, Hailun Lian, Yan Zhao, Björn Schuller, Yuan Zong, Wenming Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10536
Pdf URL: https://arxiv.org/pdf/2401.10536
Copy Paste: [[2401.10536]] Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition(https://arxiv.org/abs/2401.10536)
Keywords: code, rag
Abstract: Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.
摘要：Swin-Transformer 通过利用其基于 Transformer 的分层特征表示，在计算机视觉领域取得了显着的成功。在语音信号中，情感信息分布在不同尺度的语音特征中，例如单词、短语和话语。受上述启发，本文提出了一种分层语音变压器，其具有移动窗口来聚合语音情感识别（SER）的多尺度情感特征，称为语音 Swin-Transformer。具体来说，我们首先将语音频谱图在时域中划分为段级补丁，由多个帧补丁组成。然后使用一堆 Swin 块对这些片段级补丁进行编码，其中利用本地窗口 Transformer 来探索每个片段补丁的帧补丁之间的局部帧间情感信息。之后，我们还设计了一个移位窗口 Transformer 来补偿段补丁边界附近的补丁相关性。最后，我们通过将 Transformer 的感受野从帧级扩展到段级，采用补丁合并操作来聚合段级情感特征，以实现分层语音表示。实验结果表明，我们提出的 Speech Swin-Transformer 优于最先进的方法。

Title: PhoGAD: Graph-based Anomaly Behavior Detection with Persistent Homology Optimization

Authors: Ziqi Yuan, Haoyi Zhou, Tianyu Chen, Jianxin Li
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2401.10547
Pdf URL: https://arxiv.org/pdf/2401.10547
Copy Paste: [[2401.10547]] PhoGAD: Graph-based Anomaly Behavior Detection with Persistent Homology Optimization(https://arxiv.org/abs/2401.10547)
Keywords: rag
Abstract: A multitude of toxic online behaviors, ranging from network attacks to anonymous traffic and spam, have severely disrupted the smooth operation of networks. Due to the inherent sender-receiver nature of network behaviors, graph-based frameworks are commonly used for detecting anomalous behaviors. However, in real-world scenarios, the boundary between normal and anomalous behaviors tends to be ambiguous. The local heterophily of graphs interferes with the detection, and existing methods based on nodes or edges introduce unwanted noise into representation results, thereby impacting the effectiveness of detection. To address these issues, we propose PhoGAD, a graph-based anomaly detection framework. PhoGAD leverages persistent homology optimization to clarify behavioral boundaries. Building upon this, the weights of adjacent edges are designed to mitigate the effects of local heterophily. Subsequently, to tackle the noise problem, we conduct a formal analysis and propose a disentangled representation-based explicit embedding method, ultimately achieving anomaly behavior detection. Experiments on intrusion, traffic, and spam datasets verify that PhoGAD has surpassed the performance of state-of-the-art (SOTA) frameworks in detection efficacy. Notably, PhoGAD demonstrates robust detection even with diminished anomaly proportions, highlighting its applicability to real-world scenarios. The analysis of persistent homology demonstrates its effectiveness in capturing the topological structure formed by normal edge features. Additionally, ablation experiments validate the effectiveness of the innovative mechanisms integrated within PhoGAD.
摘要：从网络攻击到匿名流量和垃圾邮件，多种有毒在线行为严重扰乱了网络的平稳运行。由于网络行为固有的发送者-接收者性质，基于图的框架通常用于检测异常行为。然而，在现实场景中，正常行为和异常行为之间的界限往往是模糊的。图的局部异质性会干扰检测，并且现有基于节点或边的方法将不需要的噪声引入到表示结果中，从而影响检测的有效性。为了解决这些问题，我们提出了 PhoGAD，一种基于图的异常检测框架。 PhoGAD 利用持续同源优化来澄清行为边界。在此基础上，相邻边的权重被设计为减轻局部异质性的影响。随后，为了解决噪声问题，我们进行了形式化分析，并提出了一种基于解缠表示的显式嵌入方法，最终实现了异常行为检测。在入侵、流量和垃圾邮件数据集上的实验验证了 PhoGAD 在检测效率方面已经超越了最先进 (SOTA) 框架的性能。值得注意的是，即使异常比例减少，PhoGAD 也能表现出强大的检测能力，凸显了其对现实场景的适用性。持久同源性分析证明了其在捕获正常边缘特征形成的拓扑结构方面的有效性。此外，消融实验验证了 PhoGAD 中集成的创新机制的有效性。

Title: Unified View Imputation and Feature Selection Learning for Incomplete Multi-view Data

Authors: Yanyong Huang, Zongxin Shen, Tianrui Li, Fengmao Lv
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2401.10549
Pdf URL: https://arxiv.org/pdf/2401.10549
Copy Paste: [[2401.10549]] Unified View Imputation and Feature Selection Learning for Incomplete Multi-view Data(https://arxiv.org/abs/2401.10549)
Keywords: rag
Abstract: Although multi-view unsupervised feature selection (MUFS) is an effective technology for reducing dimensionality in machine learning, existing methods cannot directly deal with incomplete multi-view data where some samples are missing in certain views. These methods should first apply predetermined values to impute missing data, then perform feature selection on the complete dataset. Separating imputation and feature selection processes fails to capitalize on the potential synergy where local structural information gleaned from feature selection could guide the imputation, thereby improving the feature selection performance in turn. Additionally, previous methods only focus on leveraging samples' local structure information, while ignoring the intrinsic locality of the feature space. To tackle these problems, a novel MUFS method, called UNified view Imputation and Feature selectIon lEaRning (UNIFIER), is proposed. UNIFIER explores the local structure of multi-view data by adaptively learning similarity-induced graphs from both the sample and feature spaces. Then, UNIFIER dynamically recovers the missing views, guided by the sample and feature similarity graphs during the feature selection procedure. Furthermore, the half-quadratic minimization technique is used to automatically weight different instances, alleviating the impact of outliers and unreliable restored data. Comprehensive experimental results demonstrate that UNIFIER outperforms other state-of-the-art methods.
摘要：尽管多视图无监督特征选择（MUFS）是机器学习中降维的有效技术，但现有方法无法直接处理某些视图中丢失某些样本的不完整多视图数据。这些方法应首先应用预定值来估算缺失数据，然后对完整数据集执行特征选择。将插补和特征选择过程分开无法利用潜在的协同作用，从特征选择中收集的局部结构信息可以指导插补，从而反过来提高特征选择性能。此外，以前的方法只注重利用样本的局部结构信息，而忽略了特征空间的内在局部性。为了解决这些问题，提出了一种新颖的 MUFS 方法，称为统一视图插补和特征选择学习（UNIFIER）。 UNIFIER 通过从样本空间和特征空间自适应学习相似性诱导图来探索多视图数据的局部结构。然后，UNFIER 在特征选择过程中以样本和特征相似性图为指导，动态恢复丢失的视图。此外，使用半二次最小化技术自动对不同实例进行加权，减轻异常值和不可靠恢复数据的影响。综合实验结果表明 UNIFIER 优于其他最先进的方法。

Title: CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents

Authors: Siyuan Qi, Shuo Chen, Yexin Li, Xiangyu Kong, Junqi Wang, Bangcheng Yang, Pring Wong, Yifan Zhong, Xiaoyuan Zhang, Zhaowei Zhang, Nian Liu, Wei Wang, Yaodong Yang, Song-Chun Zhu
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2401.10568
Pdf URL: https://arxiv.org/pdf/2401.10568
Copy Paste: [[2401.10568]] CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents(https://arxiv.org/abs/2401.10568)
Keywords: llm, code, agent
Abstract: The generalization of decision-making agents encompasses two fundamental elements: learning from past experiences and reasoning in novel contexts. However, the predominant emphasis in most interactive environments is on learning, often at the expense of complexity in reasoning. In this paper, we introduce CivRealm, an environment inspired by the Civilization game. Civilization's profound alignment with human history and society necessitates sophisticated learning, while its ever-changing situations demand strong reasoning to generalize. Particularly, CivRealm sets up an imperfect-information general-sum game with a changing number of players; it presents a plethora of complex features, challenging the agent to deal with open-ended stochastic environments that require diplomacy and negotiation skills. Within CivRealm, we provide interfaces for two typical agent types: tensor-based agents that focus on learning, and language-based agents that emphasize reasoning. To catalyze further research, we present initial results for both paradigms. The canonical RL-based agents exhibit reasonable performance in mini-games, whereas both RL- and LLM-based agents struggle to make substantial progress in the full game. Overall, CivRealm stands as a unique learning and reasoning challenge for decision-making agents. The code is available at https://github.com/bigai-ai/civrealm.
摘要：决策主体的泛化包含两个基本要素：从过去的经验中学习和在新的环境中进行推理。然而，大多数交互式环境的主要重点是学习，这通常会牺牲推理的复杂性。在本文中，我们介绍了 CivRealm，这是一个受《文明》游戏启发的环境。文明与人类历史和社会的深刻结合需要复杂的学习，而其不断变化的情况则需要强有力的推理来概括。特别是，CivRealm设置了一个不完全信息一般和博弈，参与者数量不断变化；它呈现出过多的复杂特征，挑战智能体应对需要外交和谈判技能的开放式随机环境。在 CivRealm 中，我们为两种典型的代理类型提供了接口：专注于学习的基于张量的代理，以及强调推理的基于语言的代理。为了促进进一步的研究，我们提出了这两种范式的初步结果。典型的基于 RL 的智能体在迷你游戏中表现出合理的性能，而基于 RL 和 LLM 的智能体都难以在完整的游戏中取得实质性进展。总体而言，CivRealm 对决策代理来说是一个独特的学习和推理挑战。代码可在 https://github.com/bigai-ai/civrealm 获取。

Title: PHOENIX: Open-Source Language Adaption for Direct Preference Optimization

Authors: Matthias Uhlig, Sigurd Schacht, Sudarshan Kamath Barkur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10580
Pdf URL: https://arxiv.org/pdf/2401.10580
Copy Paste: [[2401.10580]] PHOENIX: Open-Source Language Adaption for Direct Preference Optimization(https://arxiv.org/abs/2401.10580)
Keywords: language model
Abstract: Large language models have gained immense importance in recent years and have demonstrated outstanding results in solving various tasks. However, despite these achievements, many questions remain unanswered in the context of large language models. Besides the optimal use of the models for inference and the alignment of the results to the desired specifications, the transfer of models to other languages is still an underdeveloped area of research. The recent publication of models such as Llama-2 and Zephyr has provided new insights into architectural improvements and the use of human feedback. However, insights into adapting these techniques to other languages remain scarce. In this paper, we build on latest improvements and apply the Direct Preference Optimization(DPO) approach to the German language. The model is available at https://huggingface.co/DRXD1000/Phoenix.
摘要：近年来，大型语言模型变得非常重要，并且在解决各种任务方面表现出了出色的成果。然而，尽管取得了这些成就，但在大型语言模型的背景下，许多问题仍然没有得到解答。除了优化使用模型进行推理以及将结果与所需规范保持一致之外，将模型转移到其他语言仍然是一个不发达的研究领域。最近发布的 Llama-2 和 Zephyr 等模型为架构改进和人类反馈的使用提供了新的见解。然而，将这些技术应用于其他语言的见解仍然很少。在本文中，我们基于最新的改进并将直接偏好优化（DPO）方法应用于德语。该模型可在 https://huggingface.co/DRXD1000/Phoenix 上获取。

Title: Polytopic Autoencoders with Smooth Clustering for Reduced-order Modelling of Flows

Authors: Jan Heiland, Yongho Kim
Subjects: cs.LG, cs.CV, math.DS
Abstract URL: https://arxiv.org/abs/2401.10620
Pdf URL: https://arxiv.org/pdf/2401.10620
Copy Paste: [[2401.10620]] Polytopic Autoencoders with Smooth Clustering for Reduced-order Modelling of Flows(https://arxiv.org/abs/2401.10620)
Keywords: code
Abstract: With the advancement of neural networks, there has been a notable increase, both in terms of quantity and variety, in research publications concerning the application of autoencoders to reduced-order models. We propose a polytopic autoencoder architecture that includes a lightweight nonlinear encoder, a convex combination decoder, and a smooth clustering network. Supported by several proofs, the model architecture ensures that all reconstructed states lie within a polytope, accompanied by a metric indicating the quality of the constructed polytopes, referred to as polytope error. Additionally, it offers a minimal number of convex coordinates for polytopic linear-parameter varying systems while achieving acceptable reconstruction errors compared to proper orthogonal decomposition (POD). To validate our proposed model, we conduct simulations involving two flow scenarios with the incompressible Navier-Stokes equation. Numerical results demonstrate the guaranteed properties of the model, low reconstruction errors compared to POD, and the improvement in error using a clustering network.
摘要：随着神经网络的进步，有关自动编码器在降阶模型中的应用的研究出版物在数量和种类上都有显着增加。我们提出了一种多面自动编码器架构，其中包括轻量级非线性编码器、凸组合解码器和平滑聚类网络。在多个证明的支持下，模型架构确保所有重建状态都位于多胞体内，并附有指示所构造多胞体质量的度量，称为多胞体误差。此外，与适当的正交分解 (POD) 相比，它为多面线性参数变化系统提供了最少数量的凸坐标，同时实现了可接受的重建误差。为了验证我们提出的模型，我们使用不可压缩的纳维-斯托克斯方程进行了涉及两种流动场景的模拟。数值结果证明了模型的保证特性、与 POD 相比较低的重建误差以及使用聚类网络的误差改进。

Title: Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models

Authors: Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10647
Pdf URL: https://arxiv.org/pdf/2401.10647
Copy Paste: [[2401.10647]] Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models(https://arxiv.org/abs/2401.10647)
Keywords: language model, llm
Abstract: In the rapidly advancing field of artificial intelligence, the concept of Red-Teaming or Jailbreaking large language models (LLMs) has emerged as a crucial area of study. This approach is especially significant in terms of assessing and enhancing the safety and robustness of these models. This paper investigates the intricate consequences of such modifications through model editing, uncovering a complex relationship between enhancing model accuracy and preserving its ethical integrity. Our in-depth analysis reveals a striking paradox: while injecting accurate information is crucial for model reliability, it can paradoxically destabilize the model's foundational framework, resulting in unpredictable and potentially unsafe behaviors. Additionally, we propose a benchmark dataset NicheHazardQA to investigate this unsafe behavior both within the same and cross topical domain. This aspect of our research sheds light on how the edits, impact the model's safety metrics and guardrails. Our findings show that model editing serves as a cost-effective tool for topical red-teaming by methodically applying targeted edits and evaluating the resultant model behavior
摘要：在快速发展的人工智能领域，红队或越狱大型语言模型（LLM）的概念已成为一个重要的研究领域。这种方法对于评估和增强这些模型的安全性和稳健性尤其重要。本文通过模型编辑研究了此类修改的复杂后果，揭示了提高模型准确性和维护其道德完整性之间的复杂关系。我们的深入分析揭示了一个惊人的悖论：虽然注入准确的信息对于模型可靠性至关重要，但它可能会矛盾地破坏模型的基础框架的稳定性，导致不可预测和潜在的不安全行为。此外，我们提出了一个基准数据集 NicheHazardQA 来调查同一和跨主题领域内的这种不安全行为。我们研究的这方面揭示了编辑如何影响模型的安全指标和护栏。我们的研究结果表明，通过有条不紊地应用有针对性的编辑并评估由此产生的模型行为，模型编辑可以成为主题红队的一种经济高效的工具

Title: FIMBA: Evaluating the Robustness of AI in Genomics via Feature Importance Adversarial Attacks

Authors: Heorhii Skovorodnikov, Hoda Alkhzaimi
Subjects: cs.LG, cs.CR, q-bio.GN
Abstract URL: https://arxiv.org/abs/2401.10657
Pdf URL: https://arxiv.org/pdf/2401.10657
Copy Paste: [[2401.10657]] FIMBA: Evaluating the Robustness of AI in Genomics via Feature Importance Adversarial Attacks(https://arxiv.org/abs/2401.10657)
Keywords: code
Abstract: With the steady rise of the use of AI in bio-technical applications and the widespread adoption of genomics sequencing, an increasing amount of AI-based algorithms and tools is entering the research and production stage affecting critical decision-making streams like drug discovery and clinical outcomes. This paper demonstrates the vulnerability of AI models often utilized downstream tasks on recognized public genomics datasets. We undermine model robustness by deploying an attack that focuses on input transformation while mimicking the real data and confusing the model decision-making, ultimately yielding a pronounced deterioration in model performance. Further, we enhance our approach by generating poisoned data using a variational autoencoder-based model. Our empirical findings unequivocally demonstrate a decline in model performance, underscored by diminished accuracy and an upswing in false positives and false negatives. Furthermore, we analyze the resulting adversarial samples via spectral analysis yielding conclusions for countermeasures against such attacks.
摘要：随着人工智能在生物技术应用中的稳步增长以及基因组测序的广泛采用，越来越多的基于人工智能的算法和工具正在进入研究和生产阶段，影响药物发现和临床等关键决策流程结果。本文展示了经常在公认的公共基因组数据集上使用下游任务的人工智能模型的脆弱性。我们通过部署专注于输入转换的攻击来破坏模型的鲁棒性，同时模仿真实数据并混淆模型决策，最终导致模型性能明显恶化。此外，我们通过使用基于变分自动编码器的模型生成中毒数据来增强我们的方法。我们的实证研究结果明确表明模型性能下降，准确性下降以及假阳性和假阴性的上升。此外，我们通过频谱分析来分析生成的对抗样本，得出针对此类攻击的对策的结论。

Title: A Simple Framework to Accelerate Multilingual Language Model for Monolingual Text Generation

Authors: Jimin Hong, Gibbeum Lee, Jaewoong Cho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2401.10660
Pdf URL: https://arxiv.org/pdf/2401.10660
Copy Paste: [[2401.10660]] A Simple Framework to Accelerate Multilingual Language Model for Monolingual Text Generation(https://arxiv.org/abs/2401.10660)
Keywords: language model, code, rag
Abstract: Recent advancements in large language models have facilitated the execution of complex language tasks, not only in English but also in non-English languages. However, the tokenizers of most language models, such as Llama, trained on English-centric corpora, tend to excessively fragment tokens in non-English languages. This issue is especially pronounced in non-roman alphabetic languages, which are often divided at a character or even Unicode level, leading to slower text generation. To address this, our study introduces a novel framework designed to expedite text generation in these languages. This framework predicts larger linguistic units than those of conventional multilingual tokenizers and is specifically tailored to the target language, thereby reducing the number of decoding steps required. Our empirical results demonstrate that the proposed framework increases the generation speed by a factor of 1.9 compared to standard decoding while maintaining the performance of a pre-trained multilingual model on monolingual tasks.
摘要：大型语言模型的最新进展促进了复杂语言任务的执行，不仅是英语，而且是非英语语言。然而，大多数语言模型的分词器（例如在以英语为中心的语料库上训练的 Llama）往往会过度分割非英语语言中的分词。这个问题在非罗马字母语言中尤其明显，这些语言通常在字符甚至 Unicode 级别上进行划分，从而导致文本生成速度变慢。为了解决这个问题，我们的研究引入了一种新颖的框架，旨在加速这些语言的文本生成。该框架可以预测比传统多语言分词器更大的语言单元，并且专门针对目标语言进行定制，从而减少了所需的解码步骤数量。我们的实证结果表明，与标准解码相比，所提出的框架将生成速度提高了 1.9 倍，同时保持了预训练多语言模型在单语言任务上的性能。

Title: Beyond RMSE and MAE: Introducing EAUC to unmask hidden bias and unfairness in dyadic regression models

Authors: Jorge Paz-Ruza, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas, Brais Cancela, Carlos Eiras-Franco
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2401.10690
Pdf URL: https://arxiv.org/pdf/2401.10690
Copy Paste: [[2401.10690]] Beyond RMSE and MAE: Introducing EAUC to unmask hidden bias and unfairness in dyadic regression models(https://arxiv.org/abs/2401.10690)
Keywords: lora, rag
Abstract: Dyadic regression models, which predict real-valued outcomes for pairs of entities, are fundamental in many domains (e.g. predicting the rating of a user to a product in Recommender Systems) and promising and under exploration in many others (e.g. approximating the adequate dosage of a drug for a patient in personalized pharmacology). In this work, we demonstrate that non-uniformity in the observed value distributions of individual entities leads to severely biased predictions in state-of-the-art models, skewing predictions towards the average of observed past values for the entity and providing worse-than-random predictive power in eccentric yet equally important cases. We show that the usage of global error metrics like Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) is insufficient to capture this phenomenon, which we name eccentricity bias, and we introduce Eccentricity-Area Under the Curve (EAUC) as a new complementary metric that can quantify it in all studied models and datasets. We also prove the adequateness of EAUC by using naive de-biasing corrections to demonstrate that a lower model bias correlates with a lower EAUC and vice-versa. This work contributes a bias-aware evaluation of dyadic regression models to avoid potential unfairness and risks in critical real-world applications of such systems.
摘要：二元回归模型可以预测实体对的实际值结果，在许多领域中都是基础（例如，在推荐系统中预测用户对产品的评分），并且在许多其他领域中很有前景并正在探索中（例如，近似计算适当的剂量）个性化药理学中用于患者的药物）。在这项工作中，我们证明了单个实体观测值分布的不均匀性会导致最先进模型中的预测出现严重偏差，使预测偏向实体观测到的过去值的平均值，并提供比- 在古怪但同样重要的情况下的随机预测能力。我们表明，使用均方根误差 (RMSE) 和平均绝对误差 (MAE) 等全局误差指标不足以捕捉这种现象，我们将其称为偏心率偏差，并且我们将偏心率曲线下面积 (EAUC) 引入为一个新的补充指标，可以在所有研究的模型和数据集中对其进行量化。我们还通过使用朴素去偏差校正来证明 EAUC 的充分性，以证明较低的模型偏差与较低的 EAUC 相关，反之亦然。这项工作有助于对二元回归模型进行偏差感知评估，以避免此类系统的关键现实应用中潜在的不公平和风险。

Title: LangBridge: Multilingual Reasoning Without Multilingual Supervision

Authors: Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone Kim, Sheikh Shafayat, Minjoon Seo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10695
Pdf URL: https://arxiv.org/pdf/2401.10695
Copy Paste: [[2401.10695]] LangBridge: Multilingual Reasoning Without Multilingual Supervision(https://arxiv.org/abs/2401.10695)
Keywords: language model, code
Abstract: We introduce LangBridge, a zero-shot approach to adapt language models for multilingual reasoning tasks without multilingual supervision. LangBridge operates by bridging two models, each specialized in different aspects: (1) one specialized in understanding multiple languages (e.g., mT5 encoder) and (2) one specialized in reasoning (e.g., Orca 2). LangBridge connects the two models by introducing minimal trainable parameters between them. Despite utilizing only English data for training, LangBridge considerably enhances the performance of language models on low-resource languages across mathematical reasoning, coding, and logical reasoning. Our analysis suggests that the efficacy of LangBridge stems from the language-agnostic characteristics of multilingual representations. We publicly release our code and models.
摘要：我们引入了 LangBridge，一种零样本方法，可以在没有多语言监督的情况下适应多语言推理任务的语言模型。 LangBridge 通过桥接两个模型来运行，每个模型专门从事不同的方面：(1) 一个专门负责理解多种语言（例如 mT5 编码器），(2) 一个专门负责推理（例如 Orca 2）。 LangBridge 通过在两个模型之间引入最小的可训练参数来连接这两个模型。尽管仅使用英语数据进行训练，但 LangBridge 显着增强了低资源语言的语言模型在数学推理、编码和逻辑推理方面的性能。我们的分析表明，LangBridge 的功效源于多语言表示的与语言无关的特征。我们公开发布我们的代码和模型。

Title: Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

Authors: Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, Jie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10716
Pdf URL: https://arxiv.org/pdf/2401.10716
Copy Paste: [[2401.10716]] Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models(https://arxiv.org/abs/2401.10716)
Keywords: language model, code
Abstract: Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees -- also known as concrete syntax trees (CSTs) -- and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly significant when there are limited training examples, demonstrating the effectiveness of integrating program structures with plain-text representation even when working with backbone models that have not been pre-trained with structures.
摘要：当前为代码任务定制的语言模型通常采用自然语言处理的预训练然后微调范例，将源代码建模为纯文本。然而，这种方法忽略了编程语言固有的明确结构。在这项工作中，我们通过进一步的预训练和程序结构的微调来探索预训练代码模型的数据高效适应。具体来说，我们将程序表示为解析树（也称为具体语法树 (CST)），并在序列化 CST 上调整预训练模型。尽管我们采用的模型仅在程序的表面形式上进行了预训练，但我们发现，在不改变模型架构的情况下对 CST 进行少量的持续预训练和微调，可以在各种代码中比基准方法产生改进任务。当训练示例有限时，这些改进尤其显着，证明了将程序结构与纯文本表示相集成的有效性，即使在使用未经结构预训练的骨干模型时也是如此。

Title: FinLLMs: A Framework for Financial Reasoning Dataset Generation with Large Language Models

Authors: Ziqiang Yuan, Kaiyuan Wang, Shoutai Zhu, Ye Yuan, Jingya Zhou, Yanlin Zhu, Wenqi Wei
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2401.10744
Pdf URL: https://arxiv.org/pdf/2401.10744
Copy Paste: [[2401.10744]] FinLLMs: A Framework for Financial Reasoning Dataset Generation with Large Language Models(https://arxiv.org/abs/2401.10744)
Keywords: language model, gpt, llm
Abstract: Large Language models (LLMs) usually rely on extensive training datasets. In the financial domain, creating numerical reasoning datasets that include a mix of tables and long text often involves substantial manual annotation expenses. To address the limited data resources and reduce the annotation cost, we introduce FinLLMs, a method for generating financial question-answering data based on common financial formulas using Large Language Models. First, we compile a list of common financial formulas and construct a graph based on the variables these formulas employ. We then augment the formula set by combining those that share identical variables as new elements. Specifically, we explore formulas obtained by manual annotation and merge those formulas with shared variables by traversing the constructed graph. Finally, utilizing GPT-3.5, we generate financial question-answering data that encompasses both tabular information and long textual content, building on the collected formula set. Our experiments demonstrate that synthetic data generated by FinLLMs effectively enhances the performance of several large-scale numerical reasoning models in the financial domain, outperforming two established benchmark financial question-answering datasets.
摘要：大型语言模型 (LLM) 通常依赖于广泛的训练数据集。在金融领域，创建包含表格和长文本混合的数值推理数据集通常需要大量的手动注释费用。为了解决数据资源有限的问题并降低标注成本，我们引入了 FinLLM，这是一种使用大型语言模型基于常见金融公式生成金融问答数据的方法。首先，我们列出了常见的金融公式，并根据这些公式使用的变量构建了一个图表。然后，我们通过将共享相同变量的变量组合为新元素来扩充公式集。具体来说，我们探索通过手动注释获得的公式，并通过遍历构造的图将这些公式与共享变量合并。最后，利用 GPT-3.5，我们在收集的公式集的基础上生成包含表格信息和长文本内容的金融问答数据。我们的实验表明，FinLLM 生成的合成数据有效增强了金融领域多个大规模数值推理模型的性能，优于两个已建立的基准金融问答数据集。

Title: Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment

Authors: Fanqi Wan, Xinting Huang, Leyang Cui, Xiaojun Quan, Wei Bi, Shuming Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2401.10768
Pdf URL: https://arxiv.org/pdf/2401.10768
Copy Paste: [[2401.10768]] Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment(https://arxiv.org/abs/2401.10768)
Keywords: language model, llm, hallucination, code
Abstract: While Large Language Models (LLMs) have proven to be exceptional on a variety of tasks after alignment, they may still produce responses that contradict the context or world knowledge confidently, a phenomenon known as ``hallucination''. In this paper, we demonstrate that reducing the inconsistency between the external knowledge encapsulated in the training data and the intrinsic knowledge inherited in the pretraining corpus could mitigate hallucination in alignment. Specifically, we introduce a novel knowledge consistent alignment (KCA) approach, which involves automatically formulating examinations based on external knowledge for accessing the comprehension of LLMs. For data encompassing knowledge inconsistency, KCA implements several simple yet efficient strategies for processing. We illustrate the superior performance of the proposed KCA approach in mitigating hallucinations across six benchmarks using LLMs of different backbones and scales. Furthermore, we confirm the correlation between knowledge inconsistency and hallucination, signifying the effectiveness of reducing knowledge inconsistency in alleviating hallucinations. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/KCA}.
摘要：虽然大型语言模型 (LLM) 已被证明在对齐后在各种任务上表现出色，但它们仍然可能会自信地产生与上下文或世界知识相矛盾的响应，这种现象称为“幻觉”。在本文中，我们证明，减少训练数据中封装的外部知识与预训练语料库中继承的内在知识之间的不一致可以减轻对齐中的幻觉。具体来说，我们引入了一种新颖的知识一致性对齐（KCA）方法，该方法涉及根据外部知识自动制定考试，以获取法学硕士的理解。对于包含知识不一致的数据，KCA 实施了几种简单而有效的处理策略。我们使用不同骨干和规模的法学硕士在六个基准上展示了所提出的 KCA 方法在减轻幻觉方面的卓越性能。此外，我们确认了知识不一致与幻觉之间的相关性，表明减少知识不一致对于减轻幻觉是有效的。我们的代码、模型权重和数据在 \url{https://github.com/fanqiwan/KCA} 上公开。

Title: Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Authors: Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2401.10774
Pdf URL: https://arxiv.org/pdf/2401.10774
Copy Paste: [[2401.10774]] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads(https://arxiv.org/abs/2401.10774)
Keywords: language model, llm, rag
Abstract: The inference process in Large Language Models (LLMs) is often limited due to the absence of parallelism in the auto-regressive decoding process, resulting in most operations being restricted by the memory bandwidth of accelerators. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa introduces only minimal overhead in terms of single-step latency while substantially reducing the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.
摘要：由于自回归解码过程中缺乏并行性，大型语言模型（LLM）中的推理过程通常受到限制，导致大多数操作受到加速器内存带宽的限制。虽然已经建议使用推测解码等方法来解决这个问题，但它们的实施受到与获取和维护单独的草稿模型相关的挑战的阻碍。在本文中，我们提出了 Medusa，一种有效的方法，通过添加额外的解码头来并行预测多个后续标记，从而增强 LLM 推理。 Medusa 使用基于树的注意力机制构建多个候选延续，并在每个解码步骤中同时验证它们。通过利用并行处理，Medusa 仅在单步延迟方面引入了最小的开销，同时大大减少了所需的解码步骤数。我们为 Medusa 提供了两个级别的微调程序，以满足不同用例的需求： Medusa-1：Medusa 直接在冻结骨干 LLM 之上进行微调，从而实现无损推理加速。 Medusa-2：Medusa 与主干 LLM 一起进行微调，可以实现更好的 Medusa 头部预测精度和更高的加速，但需要特殊的训练方法来保留主干模型的功能。此外，我们提出了几种改进或扩展 Medusa 实用性的扩展，包括用于处理没有可用训练数据的情况的自蒸馏，以及用于在保持生成质量的同时提高接受率的典型接受方案。我们在不同尺寸和训练程序的模型上评估美杜莎。我们的实验表明，Medusa-1 可以在不影响生成质量的情况下实现超过 2.2 倍的加速，而 Medusa-2 进一步将加速提高到 2.3-3.6 倍。

Title: Novel Representation Learning Technique using Graphs for Performance Analytics

Authors: Tarek Ramadan, Ankur Lahiry, Tanzima Z. Islam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2401.10799
Pdf URL: https://arxiv.org/pdf/2401.10799
Copy Paste: [[2401.10799]] Novel Representation Learning Technique using Graphs for Performance Analytics(https://arxiv.org/abs/2401.10799)
Keywords: rag
Abstract: The performance analytics domain in High Performance Computing (HPC) uses tabular data to solve regression problems, such as predicting the execution time. Existing Machine Learning (ML) techniques leverage the correlations among features given tabular datasets, not leveraging the relationships between samples directly. Moreover, since high-quality embeddings from raw features improve the fidelity of the downstream predictive models, existing methods rely on extensive feature engineering and pre-processing steps, costing time and manual effort. To fill these two gaps, we propose a novel idea of transforming tabular performance data into graphs to leverage the advancement of Graph Neural Network-based (GNN) techniques in capturing complex relationships between features and samples. In contrast to other ML application domains, such as social networks, the graph is not given; instead, we need to build it. To address this gap, we propose graph-building methods where nodes represent samples, and the edges are automatically inferred iteratively based on the similarity between the features in the samples. We evaluate the effectiveness of the generated embeddings from GNNs based on how well they make even a simple feed-forward neural network perform for regression tasks compared to other state-of-the-art representation learning techniques. Our evaluation demonstrates that even with up to 25% random missing values for each dataset, our method outperforms commonly used graph and Deep Neural Network (DNN)-based approaches and achieves up to 61.67% & 78.56% improvement in MSE loss over the DNN baseline respectively for HPC dataset and Machine Learning Datasets.
摘要：高性能计算 (HPC) 中的性能分析领域使用表格数据来解决回归问题，例如预测执行时间。现有的机器学习 (ML) 技术利用给定表格数据集的特征之间的相关性，而不是直接利用样本之间的关系。此外，由于原始特征的高质量嵌入提高了下游预测模型的保真度，因此现有方法依赖于广泛的特征工程和预处理步骤，从而耗费时间和人力。为了填补这两个空白，我们提出了一种将表格性能数据转换为图形的新想法，以利用基于图神经网络（GNN）技术的进步来捕获特征和样本之间的复杂关系。与其他 ML 应用领域（例如社交网络）相比，没有给出图表；相反，我们需要构建它。为了解决这个问题，我们提出了图构建方法，其中节点代表样本，并且根据样本中特征之间的相似性自动迭代地推断边。我们评估 GNN 生成的嵌入的有效性，基于与其他最先进的表示学习技术相比，即使是简单的前馈神经网络在回归任务中的执行情况。我们的评估表明，即使每个数据集的随机缺失值高达 25%，我们的方法也优于常用的基于图和深度神经网络 (DNN) 的方法，并且与 DNN 基线相比，MSE 损失分别提高了 61.67% 和 78.56%分别针对 HPC 数据集和机器学习数据集。

Title: Neglected Hessian component explains mysteries in Sharpness regularization

Authors: Yann N. Dauphin, Atish Agarwala, Hossein Mobahi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2401.10809
Pdf URL: https://arxiv.org/pdf/2401.10809
Copy Paste: [[2401.10809]] Neglected Hessian component explains mysteries in Sharpness regularization(https://arxiv.org/abs/2401.10809)
Keywords: lora
Abstract: Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.
摘要：最近的研究表明，像 SAM 这样显式或隐式惩罚二阶信息的方法可以提高深度学习的泛化能力。看似相似的方法，如权重噪声和梯度惩罚，通常无法提供这样的好处。我们证明这些差异可以通过损失的 Hessian 矩阵的结构来解释。首先，我们证明 Hessian 矩阵的常见分解可以定量地解释为将特征利用与特征探索分开。特征探索可以通过非线性建模误差矩阵（NME）来描述，但在文献中通常被忽略，因为它在插值时消失。我们的工作表明，NME 实际上很重要，因为它可以解释为什么梯度惩罚对激活函数的选择敏感。利用这种洞察力，我们设计干预措施来提高绩效。我们还提供了证据来挑战长期以来权重噪声和梯度惩罚的等价性。这种等价依赖于 NME 可以被忽略的假设，我们发现这一假设不适用于现代网络，因为它们涉及重要的特征学习。我们发现正则化特征利用而不是特征探索会产生类似于梯度惩罚的性能。

Title: Optimisation in Neurosymbolic Learning Systems

Authors: Emile van Krieken
Subjects: cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2401.10819
Pdf URL: https://arxiv.org/pdf/2401.10819
Copy Paste: [[2401.10819]] Optimisation in Neurosymbolic Learning Systems(https://arxiv.org/abs/2401.10819)
Keywords: rag
Abstract: Neurosymbolic AI aims to integrate deep learning with symbolic AI. This integration has many promises, such as decreasing the amount of data required to train a neural network, improving the explainability and interpretability of answers given by models and verifying the correctness of trained systems. We study neurosymbolic learning, where we have both data and background knowledge expressed using symbolic languages. How do we connect the symbolic and neural components to communicate this knowledge? One option is fuzzy reasoning, which studies degrees of truth. For example, being tall is not a binary concept. Instead, probabilistic reasoning studies the probability that something is true or will happen. Our first research question studies how different forms of fuzzy reasoning combine with learning. We find surprising results like a connection to the Raven paradox stating we confirm "ravens are black" when we observe a green apple. In this study, we did not use the background knowledge when we deployed our models after training. In our second research question, we studied how to use background knowledge in deployed models. We developed a new neural network layer based on fuzzy reasoning. Probabilistic reasoning is a natural fit for neural networks, which we usually train to be probabilistic. However, they are expensive to compute and do not scale well to large tasks. In our third research question, we study how to connect probabilistic reasoning with neural networks by sampling to estimate averages, while in the final research question, we study scaling probabilistic neurosymbolic learning to much larger problems than before. Our insight is to train a neural network with synthetic data to predict the result of probabilistic reasoning.
摘要：神经符号人工智能旨在将深度学习与符号人工智能相结合。这种集成有很多前景，例如减少训练神经网络所需的数据量、提高模型给出的答案的可解释性和可解释性以及验证训练系统的正确性。我们研究神经符号学习，其中我们拥有使用符号语言表达的数据和背景知识。我们如何连接符号和神经组件来传达这些知识？一种选择是模糊推理，它研究真实程度。例如，身高并不是一个二元概念。相反，概率推理研究某件事为真或将要发生的概率。我们的第一个研究问题研究不同形式的模糊推理如何与学习相结合。我们发现了令人惊讶的结果，例如与乌鸦悖论的联系，即当我们观察青苹果时，我们确认“乌鸦是黑色的”。在本研究中，我们在训练后部署模型时没有使用背景知识。在我们的第二个研究问题中，我们研究了如何在部署的模型中使用背景知识。我们开发了一个基于模糊推理的新神经网络层。概率推理非常适合神经网络，我们通常将其训练为概率性的。然而，它们的计算成本很高，并且不能很好地扩展到大型任务。在我们的第三个研究问题中，我们研究如何通过采样来估计平均值，将概率推理与神经网络联系起来，而在最后一个研究问题中，我们研究将概率神经符号学习扩展到比以前更大的问题。我们的见解是使用合成数据训练神经网络来预测概率推理的结果。

Title: A survey on recent advances in named entity recognition

Authors: Imed Keraghel, Stanislas Morbieu, Mohamed Nadif
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2401.10825
Pdf URL: https://arxiv.org/pdf/2401.10825
Copy Paste: [[2401.10825]] A survey on recent advances in named entity recognition(https://arxiv.org/abs/2401.10825)
Keywords: language model, llm, rag
Abstract: Named Entity Recognition seeks to extract substrings within a text that name real-world objects and to determine their type (for example, whether they refer to persons or organizations). In this survey, we first present an overview of recent popular approaches, but we also look at graph- and transformer- based methods including Large Language Models (LLMs) that have not had much coverage in other surveys. Second, we focus on methods designed for datasets with scarce annotations. Third, we evaluate the performance of the main NER implementations on a variety of datasets with differing characteristics (as regards their domain, their size, and their number of classes). We thus provide a deep comparison of algorithms that are never considered together. Our experiments shed some light on how the characteristics of datasets affect the behavior of the methods that we compare.
摘要：命名实体识别旨在提取文本中命名现实世界对象的子字符串并确定它们的类型（例如，它们是否指人或组织）。在本次调查中，我们首先概述了最近流行的方法，但我们也研究了基于图和变压器的方法，包括大型语言模型（LLM），这些方法在其他调查中没有太多报道。其次，我们专注于为注释稀缺的数据集设计的方法。第三，我们评估了主要 NER 实现在具有不同特征（关于其域、大小和类数量）的各种数据集上的性能。因此，我们对从未一起考虑的算法进行了深入比较。我们的实验揭示了数据集的特征如何影响我们比较的方法的行为。

Title: Using LLMs to discover emerging coded antisemitic hate-speech emergence in extremist social media

Authors: Dhanush Kikkisetti, Raza Ul Mustafa, Wendy Melillo, Roberto Corizzo, Zois Boukouvalas, Jeff Gill, Nathalie Japkowicz
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2401.10841
Pdf URL: https://arxiv.org/pdf/2401.10841
Copy Paste: [[2401.10841]] Using LLMs to discover emerging coded antisemitic hate-speech emergence in extremist social media(https://arxiv.org/abs/2401.10841)
Keywords: language model, llm, code
Abstract: Online hate speech proliferation has created a difficult problem for social media platforms. A particular challenge relates to the use of coded language by groups interested in both creating a sense of belonging for its users and evading detection. Coded language evolves quickly and its use varies over time. This paper proposes a methodology for detecting emerging coded hate-laden terminology. The methodology is tested in the context of online antisemitic discourse. The approach considers posts scraped from social media platforms, often used by extremist users. The posts are scraped using seed expressions related to previously known discourse of hatred towards Jews. The method begins by identifying the expressions most representative of each post and calculating their frequency in the whole corpus. It filters out grammatically incoherent expressions as well as previously encountered ones so as to focus on emergent well-formed terminology. This is followed by an assessment of semantic similarity to known antisemitic terminology using a fine-tuned large language model, and subsequent filtering out of the expressions that are too distant from known expressions of hatred. Emergent antisemitic expressions containing terms clearly relating to Jewish topics are then removed to return only coded expressions of hatred.
摘要：网上仇恨言论的泛滥给社交媒体平台带来了一个难题。一个特殊的挑战涉及到对为用户创造归属感和逃避检测感兴趣的团体对编码语言的使用。编码语言发展迅速，其用途随着时间的推移而变化。本文提出了一种检测新兴编码仇恨术语的方法。该方法在网上反犹太主义言论的背景下进行了测试。该方法考虑从极端主义用户经常使用的社交媒体平台上抓取的帖子。这些帖子是使用与先前已知的仇恨犹太人言论相关的种子表达方式进行抓取的。该方法首先识别每个帖子中最有代表性的表达方式，并计算它们在整个语料库中的频率。它过滤掉语法不连贯的表达以及以前遇到过的表达，以便专注于新出现的格式良好的术语。接下来是使用微调的大型语言模型评估与已知反犹太术语的语义相似性，并随后过滤掉与已知仇恨表达太远的表达。然后，删除包含明显与犹太主题相关的术语的新兴反犹太主义表达，仅返回编码的仇恨表达。

Title: Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning

Authors: Adib Hasan, Ileana Rugina, Alex Wang
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2401.10862
Pdf URL: https://arxiv.org/pdf/2401.10862
Copy Paste: [[2401.10862]] Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning(https://arxiv.org/abs/2401.10862)
Keywords: language model, llm, prompt, chat
Abstract: Large Language Models (LLMs) are vulnerable to `Jailbreaking' prompts, a type of attack that can coax these models into generating harmful and illegal content. In this paper, we show that pruning up to 20% of LLM parameters markedly increases their resistance to such attacks without additional training and without sacrificing their performance in standard benchmarks. Intriguingly, we discovered that the enhanced safety observed post-pruning correlates to the initial safety training level of the model, hinting that the effect of pruning could be more general and may hold for other LLM behaviors beyond safety. Additionally, we introduce a curated dataset of 225 harmful tasks across five categories, inserted into ten different Jailbreaking prompts, showing that pruning aids LLMs in concentrating attention on task-relevant tokens in jailbreaking prompts. Lastly, our experiments reveal that the prominent chat models, such as LLaMA-2 Chat, Vicuna, and Mistral Instruct exhibit high susceptibility to jailbreaking attacks, with some categories achieving nearly 70-100% success rate. These insights underline the potential of pruning as a generalizable approach for improving LLM safety, reliability, and potentially other desired behaviors.
摘要：大型语言模型 (LLM) 很容易受到“越狱”提示的影响，这种攻击可以诱导这些模型生成有害和非法内容。在本文中，我们表明，修剪高达 20% 的 LLM 参数可显着提高其对此类攻击的抵抗力，无需额外训练，也不会牺牲其在标准基准测试中的性能。有趣的是，我们发现剪枝后观察到的安全性增强与模型的初始安全训练水平相关，这表明剪枝的效果可能更普遍，并且可能适用于安全之外的其他法学硕士行为。此外，我们引入了一个包含 5 个类别的 225 个有害任务的精选数据集，插入到 10 个不同的越狱提示中，表明修剪有助于法学硕士将注意力集中在越狱提示中与任务相关的标记上。最后，我们的实验表明，著名的聊天模型，如 LLaMA-2 Chat、Vicuna 和 Mistral Instruct 表现出对越狱攻击的高度敏感性，某些类别的成功率接近 70-100%。这些见解强调了修剪作为提高法学硕士安全性、可靠性和其他潜在期望行为的通用方法的潜力。

Title: Reinforcement learning for question answering in programming domain using public community scoring as a human feedback

Authors: Alexey Gorbatovski, Sergey Kovalchuk
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2401.10882
Pdf URL: https://arxiv.org/pdf/2401.10882
Copy Paste: [[2401.10882]] Reinforcement learning for question answering in programming domain using public community scoring as a human feedback(https://arxiv.org/abs/2401.10882)
Keywords: language model, gpt
Abstract: In this study, we investigate the enhancement of the GPT Neo 125M performance in Community Question Answering (CQA) with a focus on programming, through the integration of Reinforcement Learning from Human Feedback (RLHF) and the utilization of scores from Stack Overflow. Two distinct reward model training strategies are employed for fine-tuning with Proximal Policy Optimization (PPO). Notably, the improvements in performance achieved through this method are comparable to those of GPT Neo 2.7B parameter variant. Additionally, an auxiliary scoring mechanism is introduced, which demonstrates the limitations of conventional linguistic metrics in evaluating responses in the programming domain. Through accurate analysis, this paper looks at the divergence between traditional linguistic metrics and our human-preferences-based reward model, underscoring the imperative for domain-specific evaluation methods. By elucidating the complexities involved in applying RLHF to programming CQA and accentuating the significance of context-aware evaluation, this study contributes to the ongoing efforts in refining Large Language Models through focused human feedback.
摘要：在本研究中，我们通过集成人类反馈强化学习 (RLHF) 和利用 Stack Overflow 的分数，研究了 GPT Neo 125M 在社区问答 (CQA) 中的性能增强，重点是编程。采用两种不同的奖励模型训练策略通过近端策略优化（PPO）进行微调。值得注意的是，通过此方法实现的性能改进与 GPT Neo 2.7B 参数变体的性能改进相当。此外，还引入了辅助评分机制，证明了传统语言度量在评估编程领域的响应方面的局限性。通过准确的分析，本文探讨了传统语言指标与基于人类偏好的奖励模型之间的差异，强调了特定领域评估方法的必要性。通过阐明将 RLHF 应用到 CQA 编程中所涉及的复杂性并强调上下文感知评估的重要性，本研究有助于通过集中的人类反馈来完善大型语言模型的持续努力。