2024-05-14

Title: Levels of AI Agents: from Rules to Large Language Models

Authors: Yu Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.06643
Pdf URL: https://arxiv.org/pdf/2405.06643
Copy Paste: [[2405.06643]] Levels of AI Agents: from Rules to Large Language Models(https://arxiv.org/abs/2405.06643)
Keywords: language model, llm, agent
Abstract: AI agents are defined as artificial entities to perceive the environment, make decisions and take actions. Inspired by the 6 levels of autonomous driving by Society of Automotive Engineers, the AI agents are also categorized based on utilities and strongness, as the following levels: L0, no AI, with tools taking into account perception plus actions; L1, using rule-based AI; L2, making rule-based AI replaced by IL/RL-based AI, with additional reasoning & decision making; L3, applying LLM-based AI instead of IL/RL-based AI, additionally setting up memory & reflection; L4, based on L3, facilitating autonomous learning & generalization; L5, based on L4, appending personality of emotion and character and collaborative behavior with multi-agents.
摘要：人工智能代理被定义为感知环境、做出决策和采取行动的人工实体。受汽车工程师协会自动驾驶6个级别的启发，AI智能体也根据效用和强度分为以下级别：L0，无AI，工具考虑感知加行动； L1，使用基于规则的人工智能； L2，使基于规则的人工智能被基于IL/RL的人工智能取代，并具有额外的推理和决策能力； L3，应用基于LLM的AI而不是基于IL/RL的AI，另外设置记忆和反射； L4，基于L3，促进自主学习和泛化； L5，在L4的基础上，附加了情感、性格的个性以及多智能体的协作行为。

Title: Large Language Models as Planning Domain Generators

Authors: James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, Shirin Sohrabi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06650
Pdf URL: https://arxiv.org/pdf/2405.06650
Copy Paste: [[2405.06650]] Large Language Models as Planning Domain Generators(https://arxiv.org/abs/2405.06650)
Keywords: language model, llm, chat
Abstract: Developing domain models is one of the few remaining places that require manual human labor in AI planning. Thus, in order to make planning more accessible, it is desirable to automate the process of domain model generation. To this end, we investigate if large language models (LLMs) can be used to generate planning domain models from simple textual descriptions. Specifically, we introduce a framework for automated evaluation of LLM-generated domains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains, and under three classes of natural language domain descriptions. Our results indicate that LLMs, particularly those with high parameter counts, exhibit a moderate level of proficiency in generating correct planning domains from natural language descriptions. Our code is available at this https URL.
摘要：开发领域模型是人工智能规划中为数不多的需要人工劳动的领域之一。因此，为了使规划更容易实现，需要自动化领域模型生成的过程。为此，我们研究是否可以使用大型语言模型（LLM）从简单的文本描述生成规划领域模型。具体来说，我们引入了一个框架，通过比较域实例的计划集来自动评估 LLM 生成的域。最后，我们对 7 个大型语言模型进行了实证分析，包括跨 9 个不同规划领域以及三类自然语言领域描述的编码和聊天模型。我们的结果表明，法学硕士，特别是那些具有高参数计数的法学硕士，在从自然语言描述生成正确的规划域方面表现出中等水平的熟练程度。我们的代码可以在这个 https URL 上找到。

Title: Large Language Model (LLM) AI text generation detection based on transformer deep learning algorithm

Authors: Yuhong Mo, Hao Qin, Yushan Dong, Ziyi Zhu, Zhenglin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.06652
Pdf URL: https://arxiv.org/pdf/2405.06652
Copy Paste: [[2405.06652]] Large Language Model (LLM) AI text generation detection based on transformer deep learning algorithm(https://arxiv.org/abs/2405.06652)
Keywords: language model, llm
Abstract: In this paper, a tool for detecting LLM AI text generation is developed based on the Transformer model, aiming to improve the accuracy of AI text generation detection and provide reference for subsequent research. Firstly the text is Unicode normalised, converted to lowercase form, characters other than non-alphabetic characters and punctuation marks are removed by regular expressions, spaces are added around punctuation marks, first and last spaces are removed, consecutive ellipses are replaced with single spaces and the text is connected using the specified delimiter. Next remove non-alphabetic characters and extra whitespace characters, replace multiple consecutive whitespace characters with a single space and again convert to lowercase form. The deep learning model combines layers such as LSTM, Transformer and CNN for text classification or sequence labelling tasks. The training and validation sets show that the model loss decreases from 0.127 to 0.005 and accuracy increases from 94.96 to 99.8, indicating that the model has good detection and classification ability for AI generated text. The test set confusion matrix and accuracy show that the model has 99% prediction accuracy for AI-generated text, with a precision of 0.99, a recall of 1, and an f1 score of 0.99, achieving a very high classification accuracy. Looking forward, it has the prospect of wide application in the field of AI text detection.
摘要：本文基于Transformer模型开发了一种LLM AI文本生成检测工具，旨在提高AI文本生成检测的准确性，为后续研究提供参考。首先对文本进行 Unicode 规范化，转换为小写形式，通过正则表达式删除非字母字符和标点符号以外的字符，在标点符号周围添加空格，删除第一个和最后一个空格，将连续的省略号替换为单个空格，文本使用指定的分隔符连接。接下来删除非字母字符和多余的空白字符，用一个空格替换多个连续的空白字符，然后再次转换为小写形式。深度学习模型结合了 LSTM、Transformer 和 CNN 等层，用于文本分类或序列标记任务。训练集和验证集显示，模型损失从0.127下降到0.005，准确率从94.96提高到99.8，表明模型对AI生成的文本具有良好的检测和分类能力。测试集混淆矩阵和准确率表明，该模型对AI生成的文本有99%的预测准确率，准确率为0.99，召回率为1，f1分数为0.99，达到了非常高的分类准确率。展望未来，其在AI文本检测领域具有广泛的应用前景。

Title: Enhancing Language Models for Financial Relation Extraction with Named Entities and Part-of-Speech

Authors: Menglin Li, Kwan Hui Lim
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2405.06665
Pdf URL: https://arxiv.org/pdf/2405.06665
Copy Paste: [[2405.06665]] Enhancing Language Models for Financial Relation Extraction with Named Entities and Part-of-Speech(https://arxiv.org/abs/2405.06665)
Keywords: language model
Abstract: The Financial Relation Extraction (FinRE) task involves identifying the entities and their relation, given a piece of financial statement/text. To solve this FinRE problem, we propose a simple but effective strategy that improves the performance of pre-trained language models by augmenting them with Named Entity Recognition (NER) and Part-Of-Speech (POS), as well as different approaches to combine these information. Experiments on a financial relations dataset show promising results and highlights the benefits of incorporating NER and POS in existing models. Our dataset and codes are available at this https URL.
摘要：财务关系提取 (FinRE) 任务涉及在给定一段财务报表/文本的情况下识别实体及其关系。为了解决这个 FinRE 问题，我们提出了一种简单但有效的策略，通过使用命名实体识别 (NER) 和词性 (POS) 以及不同的组合方法来增强预训练语言模型的性能这些信息。金融关系数据集上的实验显示了有希望的结果，并强调了将 NER 和 POS 纳入现有模型的好处。我们的数据集和代码可在此 https URL 获取。

Title: Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Authors: Subhendu Khatuya, Rajdeep Mukherjee, Akash Ghosh, Manjunath Hegde, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, Pawan Goyal
Subjects: cs.CL, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2405.06671
Pdf URL: https://arxiv.org/pdf/2405.06671
Copy Paste: [[2405.06671]] Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling(https://arxiv.org/abs/2405.06671)
Keywords: language model, llm
Abstract: We study the problem of automatically annotating relevant numerals (GAAP metrics) occurring in the financial documents with their corresponding XBRL tags. Different from prior works, we investigate the feasibility of solving this extreme classification problem using a generative paradigm through instruction tuning of Large Language Models (LLMs). To this end, we leverage metric metadata information to frame our target outputs while proposing a parameter efficient solution for the task using LoRA. We perform experiments on two recently released financial numeric labeling datasets. Our proposed model, FLAN-FinXC, achieves new state-of-the-art performances on both the datasets, outperforming several strong baselines. We explain the better scores of our proposed model by demonstrating its capability for zero-shot as well as the least frequently occurring tags. Also, even when we fail to predict the XBRL tags correctly, our generated output has substantial overlap with the ground-truth in majority of the cases.
摘要：我们研究了使用相应的 XBRL 标签自动注释财务文档中出现的相关数字（GAAP 指标）的问题。与之前的工作不同，我们研究了通过大型语言模型（LLM）的指令调整使用生成范式解决这个极端分类问题的可行性。为此，我们利用指标元数据信息来构建我们的目标输出，同时为使用 LoRA 的任务提出参数有效的解决方案。我们对两个最近发布的金融数字标签数据集进行了实验。我们提出的模型 FLAN-FinXC 在两个数据集上都实现了新的最先进的性能，优于几个强大的基线。我们通过展示其零样本和最不频繁出现的标签的能力来解释我们提出的模型的更好分数。此外，即使我们无法正确预测 XBRL 标签，在大多数情况下，我们生成的输出也与真实情况有很大重叠。

Title: Open-SQL Framework: Enhancing Text-to-SQL on Open-source Large Language Models

Authors: Xiaojun Chen, Tianle Wang, Tianhao Qiu, Jianbin Qin, Min Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Open-SQL Framework: Enhancing Text-to-SQL on Open-source Large Language Models(https://arxiv.org/abs/)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Despite the success of large language models (LLMs) in Text-to-SQL tasks, open-source LLMs encounter challenges in contextual understanding and response coherence. To tackle these issues, we present \ours, a systematic methodology tailored for Text-to-SQL with open-source LLMs. Our contributions include a comprehensive evaluation of open-source LLMs in Text-to-SQL tasks, the \openprompt strategy for effective question representation, and novel strategies for supervised fine-tuning. We explore the benefits of Chain-of-Thought in step-by-step inference and propose the \openexample method for enhanced few-shot learning. Additionally, we introduce token-efficient techniques, such as \textbf{Variable-length Open DB Schema}, \textbf{Target Column Truncation}, and \textbf{Example Column Truncation}, addressing challenges in large-scale databases. Our findings emphasize the need for further investigation into the impact of supervised fine-tuning on contextual learning capabilities. Remarkably, our method significantly improved Llama2-7B from 2.54\% to 41.04\% and Code Llama-7B from 14.54\% to 48.24\% on the BIRD-Dev dataset. Notably, the performance of Code Llama-7B surpassed GPT-4 (46.35\%) on the BIRD-Dev dataset.
摘要：尽管大型语言模型 (LLM) 在文本到 SQL 任务中取得了成功，但开源 LLM 在上下文理解和响应一致性方面遇到了挑战。为了解决这些问题，我们提出了一种针对文本到 SQL 的开源法学硕士量身定制的系统方法。我们的贡献包括对文本到 SQL 任务中开源法学硕士的全面评估、有效问题表示的 \openprompt 策略以及监督微调的新颖策略。我们探索了思想链在逐步推理中的好处，并提出了用于增强小样本学习的 \openexample 方法。此外，我们还引入了令牌高效技术，例如 \textbf{Variable-length Open DB Schema}、\textbf{Target Column Truncation} 和 \textbf{Example Column Truncation}，以解决大型数据库中的挑战。我们的研究结果强调需要进一步调查监督微调对情境学习能力的影响。值得注意的是，我们的方法在 BIRD-Dev 数据集上将 Llama2-7B 从 2.54\% 显着提高到 41.04\%，将 Code Llama-7B 从 14.54\% 显着提高到 48.24\%。值得注意的是，Code Llama-7B 在 BIRD-Dev 数据集上的性能超过了 GPT-4 (46.35%)。

Title: EDA Corpus: A Large Language Model Dataset for Enhanced Interaction with OpenROAD

Authors: Bing-Yue Wu, Utsav Sharma, Sai Rahul Dhanvi Kankipati, Ajay Yadav, Bintu Kappil George, Sai Ritish Guntupalli, Austin Rovinski, Vidya A. Chhabria
Subjects: cs.CL, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2405.06676
Pdf URL: https://arxiv.org/pdf/2405.06676
Copy Paste: [[2405.06676]] EDA Corpus: A Large Language Model Dataset for Enhanced Interaction with OpenROAD(https://arxiv.org/abs/2405.06676)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) serve as powerful tools for design, providing capabilities for both task automation and design assistance. Recent advancements have shown tremendous potential for facilitating LLM integration into the chip design process; however, many of these works rely on data that are not publicly available and/or not permissively licensed for use in LLM training and distribution. In this paper, we present a solution aimed at bridging this gap by introducing an open-source dataset tailored for OpenROAD, a widely adopted open-source EDA toolchain. The dataset features over 1000 data points and is structured in two formats: (i) a pairwise set comprised of question prompts with prose answers, and (ii) a pairwise set comprised of code prompts and their corresponding OpenROAD scripts. By providing this dataset, we aim to facilitate LLM-focused research within the EDA domain. The dataset is available at this https URL.
摘要：大型语言模型 (LLM) 是强大的设计工具，提供任务自动化和设计辅助功能。最近的进展显示了促进法学硕士融入芯片设计流程的巨大潜力；然而，其中许多工作依赖于未公开提供和/或未获得许可用于法学硕士培训和分发的数据。在本文中，我们提出了一种解决方案，旨在通过引入专为 OpenROAD（一种广泛采用的开源 EDA 工具链）量身定制的开源数据集来弥补这一差距。该数据集包含 1000 多个数据点，并采用两种格式构建：(i) 由问题提示和散文答案组成的成对集，以及 (ii) 由代码提示及其相应的 OpenROAD 脚本组成的成对集。通过提供此数据集，我们的目标是促进 EDA 领域内以法学硕士为重点的研究。该数据集可从此 https URL 获取。

Title: ATG: Benchmarking Automated Theorem Generation for Generative Language Models

Authors: Xiaohan Lin, Qingxing Cao, Yinya Huang, Zhicheng Yang, Zhengying Liu, Zhenguo Li, Xiaodan Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06677
Pdf URL: https://arxiv.org/pdf/2405.06677
Copy Paste: [[2405.06677]] ATG: Benchmarking Automated Theorem Generation for Generative Language Models(https://arxiv.org/abs/2405.06677)
Keywords: language model, agent
Abstract: Humans can develop new theorems to explore broader and more complex mathematical results. While current generative language models (LMs) have achieved significant improvement in automatically proving theorems, their ability to generate new or reusable theorems is still under-explored. Without the new theorems, current LMs struggle to prove harder theorems that are distant from the given hypotheses with the exponentially growing search space. Therefore, this paper proposes an Automated Theorem Generation (ATG) benchmark that evaluates whether an agent can automatically generate valuable (and possibly brand new) theorems that are applicable for downstream theorem proving as reusable knowledge. Specifically, we construct the ATG benchmark by splitting the Metamath library into three sets: axioms, library, and problem based on their proving depth. We conduct extensive experiments to investigate whether current LMs can generate theorems in the library and benefit the problem theorems proving. The results demonstrate that high-quality ATG data facilitates models' performances on downstream ATP. However, there is still room for current LMs to develop better ATG and generate more advanced and human-like theorems. We hope the new ATG challenge can shed some light on advanced complex theorem proving.
摘要：人类可以开发新的定理来探索更广泛、更复杂的数学结果。虽然当前的生成语言模型（LM）在自动证明定理方面取得了显着的进步，但它们生成新的或可重用定理的能力仍有待探索。如果没有新的定理，当前的语言模型很难证明更难的定理，这些定理与指数增长的搜索空间远离给定的假设。因此，本文提出了自动定理生成（ATG）基准，用于评估代理是否可以自动生成有价值的（可能是全新的）定理，这些定理适用于下游定理证明作为可重用知识。具体来说，我们通过将 Metamath 库根据证明深度分为三组来构建 ATG 基准：公理、库和问题。我们进行了大量的实验来研究当前的 LM 是否可以在库中生成定理并有利于问题定理的证明。结果表明，高质量的 ATG 数据有助于模型在下游 ATP 上的表现。然而，当前的 LM 仍有空间开发更好的 ATG 并生成更先进的、类似人类的定理。我们希望新的 ATG 挑战能够为高级复杂定理证明提供一些启发。

Title: Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning

Authors: Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06680
Pdf URL: https://arxiv.org/pdf/2405.06680
Copy Paste: [[2405.06680]] Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning(https://arxiv.org/abs/2405.06680)
Keywords: language model, llm, prompt
Abstract: Human cognition exhibits systematic compositionality, the algebraic ability to generate infinite novel combinations from finite learned components, which is the key to understanding and reasoning about complex logic. In this work, we investigate the compositionality of large language models (LLMs) in mathematical reasoning. Specifically, we construct a new dataset \textsc{MathTrap}\footnotemark[3] by introducing carefully designed logical traps into the problem descriptions of MATH and GSM8k. Since problems with logical flaws are quite rare in the real world, these represent ``unseen'' cases to LLMs. Solving these requires the models to systematically compose (1) the mathematical knowledge involved in the original problems with (2) knowledge related to the introduced traps. Our experiments show that while LLMs possess both components of requisite knowledge, they do not \textbf{spontaneously} combine them to handle these novel cases. We explore several methods to mitigate this deficiency, such as natural language prompts, few-shot demonstrations, and fine-tuning. We find that LLMs' performance can be \textbf{passively} improved through the above external intervention. Overall, systematic compositionality remains an open challenge for large language models.
摘要：人类认知表现出系统的组合性，即从有限的学习组件生成无限新颖组合的代数能力，这是理解和推理复杂逻辑的关键。在这项工作中，我们研究了数学推理中大型语言模型（LLM）的组合性。具体来说，我们通过在 MATH 和 GSM8k 的问题描述中引入精心设计的逻辑陷阱来构造一个新的数据集 \textsc{MathTrap}\footnotemark[3]。由于逻辑缺陷问题在现实世界中相当罕见，因此这些对于法学硕士来说代表着“看不见”的案例。解决这些问题需要模型系统地组合（1）原始问题涉及的数学知识和（2）与引入的陷阱相关的知识。我们的实验表明，虽然法学硕士拥有必要知识的两个组成部分，但他们并没有\textbf{自发地}将它们结合起来来处理这些新案例。我们探索了几种方法来弥补这一缺陷，例如自然语言提示、少量演示和微调。我们发现法学硕士的表现可以通过上述外部干预得到\textbf{被动}的提高。总体而言，系统组合性对于大型语言模型来说仍然是一个开放的挑战。

Title: Leveraging Lecture Content for Improved Feedback: Explorations with GPT-4 and Retrieval Augmented Generation

Authors: Sven Jacobs, Steffen Jaschke
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2405.06681
Pdf URL: https://arxiv.org/pdf/2405.06681
Copy Paste: [[2405.06681]] Leveraging Lecture Content for Improved Feedback: Explorations with GPT-4 and Retrieval Augmented Generation(https://arxiv.org/abs/2405.06681)
Keywords: language model, gpt, hallucination, retrieval augmented generation
Abstract: This paper presents the use of Retrieval Augmented Generation (RAG) to improve the feedback generated by Large Language Models for programming tasks. For this purpose, corresponding lecture recordings were transcribed and made available to the Large Language Model GPT-4 as external knowledge source together with timestamps as metainformation by using RAG. The purpose of this is to prevent hallucinations and to enforce the use of the technical terms and phrases from the lecture. In an exercise platform developed to solve programming problems for an introductory programming lecture, students can request feedback on their solutions generated by GPT-4. For this task GPT-4 receives the students' code solution, the compiler output, the result of unit tests and the relevant passages from the lecture notes available through the use of RAG as additional context. The feedback generated by GPT-4 should guide students to solve problems independently and link to the lecture content, using the time stamps of the transcript as meta-information. In this way, the corresponding lecture videos can be viewed immediately at the corresponding positions. For the evaluation, students worked with the tool in a workshop and decided for each feedback whether it should be extended by RAG or not. First results based on a questionnaire and the collected usage data show that the use of RAG can improve feedback generation and is preferred by students in some situations. Due to the slower speed of feedback generation, the benefits are situation dependent.
摘要：本文介绍了使用检索增强生成（RAG）来改进大型语言模型为编程任务生成的反馈。为此，使用 RAG 转录了相应的讲座录音，并将其作为外部知识源和时间戳作为元信息提供给大型语言模型 GPT-4。这样做的目的是防止产生幻觉并强制使用讲座中的技术术语和短语。在为解决入门编程讲座的编程问题而开发的练习平台中，学生可以请求对 GPT-4 生成的解决方案的反馈。对于此任务，GPT-4 接收学生的代码解决方案、编译器输出、单元测试结果以及通过使用 RAG 作为附加上下文提供的讲义中的相关段落。 GPT-4生成的反馈应该引导学生独立解决问题并链接到讲座内容，使用成绩单的时间戳作为元信息。这样就可以在相应的位置立即观看相应的讲座视频。为了进行评估，学生们在研讨会上使用该工具，并根据每个反馈决定是否应该由 RAG 扩展。基于调查问卷和收集的使用数据的初步结果表明，RAG 的使用可以改善反馈的生成，并且在某些情况下受到学生的青睐。由于反馈生成速度较慢，其好处取决于具体情况。

Title: Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

Authors: Matthew Renze, Erhan Guven
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06682
Pdf URL: https://arxiv.org/pdf/2405.06682
Copy Paste: [[2405.06682]] Self-Reflection in LLM Agents: Effects on Problem-Solving Performance(https://arxiv.org/abs/2405.06682)
Keywords: language model, llm, agent
Abstract: In this study, we investigated the effects of self-reflection in large language models (LLMs) on problem-solving performance. We instructed nine popular LLMs to answer a series of multiple-choice questions to provide a performance baseline. For each incorrectly answered question, we instructed eight types of self-reflecting LLM agents to reflect on their mistakes and provide themselves with guidance to improve problem-solving. Then, using this guidance, each self-reflecting agent attempted to re-answer the same questions. Our results indicate that LLM agents are able to significantly improve their problem-solving performance through self-reflection ($p < 0.001$). In addition, we compared the various types of self-reflection to determine their individual contribution to performance. All code and data are available on GitHub at this https URL
摘要：在这项研究中，我们研究了大型语言模型（LLM）中的自我反思对解决问题的表现的影响。我们指导九名受欢迎的法学硕士回答一系列多项选择题，以提供绩效基准。对于每个错误回答的问题，我们指示八种自我反思的法学硕士代理人反思自己的错误，并为自己提供改进问题解决的指导。然后，利用这个指导，每个自我反思的代理人尝试重新回答相同的问题。我们的结果表明，法学硕士代理人能够通过自我反思显着提高其解决问题的能力（$p < 0.001$）。此外，我们还比较了各种类型的自我反思，以确定他们对绩效的个人贡献。所有代码和数据均可在 GitHub 上通过此 https URL 获取

Title: ERAGent: Enhancing Retrieval-Augmented Language Models with Improved Accuracy, Efficiency, and Personalization

Authors: Yunxiao Shi, Xing Zi, Zijing Shi, Haimin Zhang, Qiang Wu, Min Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.06683
Pdf URL: https://arxiv.org/pdf/2405.06683
Copy Paste: [[2405.06683]] ERAGent: Enhancing Retrieval-Augmented Language Models with Improved Accuracy, Efficiency, and Personalization(https://arxiv.org/abs/2405.06683)
Keywords: language model, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) for language models significantly improves language understanding systems. The basic retrieval-then-read pipeline of response generation has evolved into a more extended process due to the integration of various components, sometimes even forming loop structures. Despite its advancements in improving response accuracy, challenges like poor retrieval quality for complex questions that require the search of multifaceted semantic information, inefficiencies in knowledge re-retrieval during long-term serving, and lack of personalized responses persist. Motivated by transcending these limitations, we introduce ERAGent, a cutting-edge framework that embodies an advancement in the RAG area. Our contribution is the introduction of the synergistically operated module: Enhanced Question Rewriter and Knowledge Filter, for better retrieval quality. Retrieval Trigger is incorporated to curtail extraneous external knowledge retrieval without sacrificing response quality. ERAGent also personalizes responses by incorporating a learned user profile. The efficiency and personalization characteristics of ERAGent are supported by the Experiential Learner module which makes the AI assistant being capable of expanding its knowledge and modeling user profile incrementally. Rigorous evaluations across six datasets and three question-answering tasks prove ERAGent's superior accuracy, efficiency, and personalization, emphasizing its potential to advance the RAG field and its applicability in practical systems.
摘要：语言模型的检索增强生成（RAG）显着改善了语言理解系统。由于各种组件的集成，响应生成的基本检索然后读取管道已演变成更扩展的过程，有时甚至形成循环结构。尽管其在提高响应准确性方面取得了进步，但诸如需要搜索多方面语义信息的复杂问题的检索质量差、长期服务期间知识重新检索效率低、以及缺乏个性化响应等挑战仍然存在。为了超越这些限制，我们推出了 ERAGent，这是一个体现 RAG 领域进步的尖端框架。我们的贡献是引入了协同操作的模块：增强型问题重写器和知识过滤器，以提高检索质量。检索触发器被纳入以减少无关的外部知识检索而不牺牲响应质量。 ERAGent 还通过合并学习的用户配置文件来个性化响应。 ERAGent 的效率和个性化特征得到了体验式学习者模块的支持，这使得人工智能助手能够扩展其知识并逐步对用户档案进行建模。对六个数据集和三个问答任务的严格评估证明了 ERAGent 卓越的准确性、效率和个性化，强调了其推进 RAG 领域的潜力及其在实际系统中的适用性。

Title: QuakeBERT: Accurate Classification of Social Media Texts for Rapid Earthquake Impact Assessment

Authors: Jin Han, Zhe Zheng, Xin-Zheng Lu, Ke-Yin Chen, Jia-Rui Lin
Subjects: cs.CL, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2405.06684
Pdf URL: https://arxiv.org/pdf/2405.06684
Copy Paste: [[2405.06684]] QuakeBERT: Accurate Classification of Social Media Texts for Rapid Earthquake Impact Assessment(https://arxiv.org/abs/2405.06684)
Keywords: language model, llm
Abstract: Social media aids disaster response but suffers from noise, hindering accurate impact assessment and decision making for resilient cities, which few studies considered. To address the problem, this study proposes the first domain-specific LLM model and an integrated method for rapid earthquake impact assessment. First, a few categories are introduced to classify and filter microblogs considering their relationship to the physical and social impacts of earthquakes, and a dataset comprising 7282 earthquake-related microblogs from twenty earthquakes in different locations is developed as well. Then, with a systematic analysis of various influential factors, QuakeBERT, a domain-specific large language model (LLM), is developed and fine-tuned for accurate classification and filtering of microblogs. Meanwhile, an integrated method integrating public opinion trend analysis, sentiment analysis, and keyword-based physical impact quantification is introduced to assess both the physical and social impacts of earthquakes based on social media texts. Experiments show that data diversity and data volume dominate the performance of QuakeBERT and increase the macro average F1 score by 27%, while the best classification model QuakeBERT outperforms the CNN- or RNN-based models by improving the macro average F1 score from 60.87% to 84.33%. Finally, the proposed approach is applied to assess two earthquakes with the same magnitude and focal depth. Results show that the proposed approach can effectively enhance the impact assessment process by accurate detection of noisy microblogs, which enables effective post-disaster emergency responses to create more resilient cities.
摘要：社交媒体有助于灾难响应，但受到噪音影响，阻碍了弹性城市的准确影响评估和决策，而很少有研究考虑到这一点。为了解决这个问题，本研究提出了第一个特定领域的法学硕士模型和快速地震影响评估的集成方法。首先，考虑微博与地震物理和社会影响的关系，引入几个类别对微博进行分类和过滤，并开发了一个包含来自不同地点的 20 次地震的 7282 条地震相关微博的数据集。然后，通过对各种影响因素的系统分析，开发并微调了特定领域的大语言模型（LLM）QuakeBERT，用于微博的精确分类和过滤。同时，引入了一种集成舆情趋势分析、情感分析和基于关键词的物理影响量化的综合方法，基于社交媒体文本评估地震的物理和社会影响。实验表明，数据多样性和数据量主导了 QuakeBERT 的性能，并将宏观平均 F1 分数提高了 27%，而最佳分类模型 QuakeBERT 的性能优于基于 CNN 或 RNN 的模型，将宏观平均 F1 分数从 60.87% 提高到84.33%。最后，所提出的方法用于评估两次具有相同震级和震源深度的地震。结果表明，所提出的方法可以通过准确检测嘈杂的微博来有效增强影响评估过程，从而实现有效的灾后应急响应，从而创建更具弹性的城市。

Title: Multigenre AI-powered Story Composition

Authors: Edirlei Soares de Lima, Margot M. E. Neggers, Antonio L. Furtado
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.06685
Pdf URL: https://arxiv.org/pdf/2405.06685
Copy Paste: [[2405.06685]] Multigenre AI-powered Story Composition(https://arxiv.org/abs/2405.06685)
Keywords: agent
Abstract: This paper shows how to construct genre patterns, whose purpose is to guide interactive story composition in a way that enforces thematic consistency. To start the discussion we argue, based on previous seminal works, for the existence of five fundamental genres, namely comedy, romance - in the sense of epic plots, flourishing since the twelfth century -, tragedy, satire, and mystery. To construct the patterns, a simple two-phase process is employed: first retrieving examples that match our genre characterizations, and then applying a form of most specific generalization to the groups of examples in order to find their commonalities. In both phases, AI agents are instrumental, with our PatternTeller prototype being called to operate the story composition process, offering the opportunity to generate stories from a given premise of the user, to be developed under the guidance of the chosen pattern and trying to accommodate the user's suggestions along the composition stages.
摘要：本文展示了如何构建类型模式，其目的是以增强主题一致性的方式指导交互式故事的创作。为了开始讨论，我们根据之前的开创性作品，论证了五种基本类型的存在，即喜剧、浪漫（史诗情节的意义上，自十二世纪以来蓬勃发展）、悲剧、讽刺和悬疑。为了构建模式，采用了一个简单的两阶段过程：首先检索与我们的流派特征相匹配的示例，然后将最具体的概括形式应用于示例组，以找到它们的共性。在这两个阶段，人工智能代理都发挥了重要作用，我们的 PatternTeller 原型被调用来操作故事创作过程，提供了从用户给定前提生成故事的机会，在所选模式的指导下进行开发并尝试适应用户在创作阶段的建议。

Title: Word2World: Generating Stories and Worlds through Large Language Models

Authors: Muhammad U. Nasir, Steven James, Julian Togelius
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06686
Pdf URL: https://arxiv.org/pdf/2405.06686
Copy Paste: [[2405.06686]] Word2World: Generating Stories and Worlds through Large Language Models(https://arxiv.org/abs/2405.06686)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have proven their worth across a diverse spectrum of disciplines. LLMs have shown great potential in Procedural Content Generation (PCG) as well, but directly generating a level through a pre-trained LLM is still challenging. This work introduces Word2World, a system that enables LLMs to procedurally design playable games through stories, without any task-specific fine-tuning. Word2World leverages the abilities of LLMs to create diverse content and extract information. Combining these abilities, LLMs can create a story for the game, design narrative, and place tiles in appropriate places to create coherent worlds and playable games. We test Word2World with different LLMs and perform a thorough ablation study to validate each step. We open-source the code at this https URL.
摘要：大型语言模型 (LLM) 已在不同学科领域证明了其价值。 LLM 在程序内容生成 (PCG) 方面也显示出巨大潜力，但通过预训练的 LLM 直接生成级别仍然具有挑战性。这项工作引入了 Word2World，这是一个系统，使法学硕士能够通过故事程序设计可玩的游戏，而无需任何特定于任务的微调。 Word2World 利用法学硕士的能力来创建多样化的内容和提取信息。结合这些能力，法学硕士可以为游戏创建故事、设计叙事，并将图块放置在适当的位置，以创建连贯的世界和可玩的游戏。我们使用不同的法学硕士来测试 Word2World，并进行彻底的消融研究来验证每个步骤。我们在此 https URL 开源代码。

Title: Hire Me or Not? Examining Language Model's Behavior with Occupation Attributes

Authors: Damin Zhang, Yi Zhang, Geetanjali Bihani, Julia Rayz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.06687
Pdf URL: https://arxiv.org/pdf/2405.06687
Copy Paste: [[2405.06687]] Hire Me or Not? Examining Language Model's Behavior with Occupation Attributes(https://arxiv.org/abs/2405.06687)
Keywords: language model, gpt, llm, chat
Abstract: With the impressive performance in various downstream tasks, large language models (LLMs) have been widely integrated into production pipelines, like recruitment and recommendation systems. A known issue of models trained on natural language data is the presence of human biases, which can impact the fairness of the system. This paper investigates LLMs' behavior with respect to gender stereotypes, in the context of occupation decision making. Our framework is designed to investigate and quantify the presence of gender stereotypes in LLMs' behavior via multi-round question answering. Inspired by prior works, we construct a dataset by leveraging a standard occupation classification knowledge base released by authoritative agencies. We tested three LLMs (RoBERTa-large, GPT-3.5-turbo, and Llama2-70b-chat) and found that all models exhibit gender stereotypes analogous to human biases, but with different preferences. The distinct preferences of GPT-3.5-turbo and Llama2-70b-chat may imply the current alignment methods are insufficient for debiasing and could introduce new biases contradicting the traditional gender stereotypes.
摘要：凭借在各种下游任务中令人印象深刻的表现，大型语言模型（LLM）已被广泛集成到生产管道中，例如招聘和推荐系统。在自然语言数据上训练的模型的一个已知问题是存在人类偏见，这可能会影响系统的公平性。本文调查了法学硕士在职业决策背景下的性别刻板印象行为。我们的框架旨在通过多轮问答来调查和量化法学硕士行为中性别刻板印象的存在。受先前工作的启发，我们利用权威机构发布的标准职业分类知识库构建了数据集。我们测试了三个 LLM（RoBERTa-large、GPT-3.5-turbo 和 Llama2-70b-chat），发现所有模型都表现出类似于人类偏见的性别刻板印象，但具有不同的偏好。 GPT-3.5-turbo 和 Llama2-70b-chat 的不同偏好可能意味着当前的对齐方法不足以消除偏见，并且可能会引入与传统性别刻板印象相矛盾的新偏见。

Title: Fleet of Agents: Coordinated Problem Solving with Large Language Models using Genetic Particle Filtering

Authors: Akhil Arora, Lars Klein, Nearchos Potamitis, Roland Aydin, Caglar Gulcehre, Robert West
Subjects: cs.CL, cs.AI, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2405.06691
Pdf URL: https://arxiv.org/pdf/2405.06691
Copy Paste: [[2405.06691]] Fleet of Agents: Coordinated Problem Solving with Large Language Models using Genetic Particle Filtering(https://arxiv.org/abs/2405.06691)
Keywords: language model, llm, tree-of-thought, agent
Abstract: Large language models (LLMs) have significantly evolved, moving from simple output generation to complex reasoning and from stand-alone usage to being embedded into broader frameworks. In this paper, we introduce \emph{Fleet of Agents (FoA)}, a novel framework utilizing LLMs as agents to navigate through dynamic tree searches, employing a genetic-type particle filtering approach. FoA spawns a multitude of agents, each exploring autonomously, followed by a selection phase where resampling based on a heuristic value function optimizes the balance between exploration and exploitation. This mechanism enables dynamic branching, adapting the exploration strategy based on discovered solutions. We experimentally validate FoA using two benchmark tasks, "Game of 24" and "Mini-Crosswords". FoA outperforms the previously proposed Tree-of-Thoughts method in terms of efficacy and efficiency: it significantly decreases computational costs (by calling the value function less frequently) while preserving comparable or even superior accuracy.
摘要：大型语言模型（LLM）已经发生了显着的发展，从简单的输出生成到复杂的推理，从独立使用到嵌入到更广泛的框架中。在本文中，我们介绍了 \emph{Fleet of Agents (FoA)}，这是一种利用 LLM 作为代理来导航动态树搜索的新颖框架，采用遗传型粒子过滤方法。 FoA 产生大量代理，每个代理都自主探索，然后是选择阶段，其中基于启发式价值函数的重新采样优化了探索和利用之间的平衡。这种机制支持动态分支，根据发现的解决方案调整探索策略。我们使用两个基准任务“Game of 24”和“Mini-Crosswords”对 FoA 进行实验验证。 FoA 在功效和效率方面优于之前提出的思想树方法：它显着降低了计算成本（通过降低调用价值函数的频率），同时保持可比较甚至更高的精度。

Title: SUTRA: Scalable Multilingual Language Model Architecture

Authors: Abhijit Bendale, Michael Sapienza, Steven Ripplinger, Simon Gibbs, Jaewon Lee, Pranav Mistry
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06694
Pdf URL: https://arxiv.org/pdf/2405.06694
Copy Paste: [[2405.06694]] SUTRA: Scalable Multilingual Language Model Architecture(https://arxiv.org/abs/2405.06694)
Keywords: language model, gpt, llm, hallucination
Abstract: In this paper, we introduce SUTRA, multilingual Large Language Model architecture capable of understanding, reasoning, and generating text in over 50 languages. SUTRA's design uniquely decouples core conceptual understanding from language-specific processing, which facilitates scalable and efficient multilingual alignment and learning. Employing a Mixture of Experts framework both in language and concept processing, SUTRA demonstrates both computational efficiency and responsiveness. Through extensive evaluations, SUTRA is demonstrated to surpass existing models like GPT-3.5, Llama2 by 20-30% on leading Massive Multitask Language Understanding (MMLU) benchmarks for multilingual tasks. SUTRA models are also online LLMs that can use knowledge from the internet to provide hallucination-free, factual and up-to-date responses while retaining their multilingual capabilities. Furthermore, we explore the broader implications of its architecture for the future of multilingual AI, highlighting its potential to democratize access to AI technology globally and to improve the equity and utility of AI in regions with predominantly non-English languages. Our findings suggest that SUTRA not only fills pivotal gaps in multilingual model capabilities but also establishes a new benchmark for operational efficiency and scalability in AI applications.
摘要：在本文中，我们介绍了 SUTRA，一种多语言大语言模型架构，能够理解、推理和生成 50 多种语言的文本。 SUTRA 的设计独特地将核心概念理解与特定于语言的处理分离，从而促进可扩展且高效的多语言对齐和学习。 SUTRA 在语言和概念处理中采用了混合专家框架，展示了计算效率和响应能力。通过广泛的评估，SUTRA 被证明在领先的多语言任务大规模多任务语言理解 (MMLU) 基准上超越了 GPT-3.5、Llama2 等现有模型 20-30%。 SUTRA 模型也是在线法学硕士，可以利用互联网上的知识提供无幻觉、事实和最新的回答，同时保留其多语言能力。此外，我们还探讨了其架构对多语言人工智能未来的更广泛影响，强调了其在全球范围内实现人工智能技术民主化以及提高人工智能在非英语语言为主的地区的公平性和实用性的潜力。我们的研究结果表明，SUTRA 不仅填补了多语言模型功能的关键空白，而且还为人工智能应用程序的运营效率和可扩展性建立了新的基准。

Title: Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks

Authors: Chancellor R. Woolsey, Prakash Bisht, Joshua Rothman, Gondy Leroy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06695
Pdf URL: https://arxiv.org/pdf/2405.06695
Copy Paste: [[2405.06695]] Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks(https://arxiv.org/abs/2405.06695)
Keywords: language model, gpt, llm, prompt, chat
Abstract: An important issue impacting healthcare is a lack of available experts. Machine learning (ML) models could resolve this by aiding in diagnosing patients. However, creating datasets large enough to train these models is expensive. We evaluated large language models (LLMs) for data creation. Using Autism Spectrum Disorders (ASD), we prompted ChatGPT and GPT-Premium to generate 4,200 synthetic observations to augment existing medical data. Our goal is to label behaviors corresponding to autism criteria and improve model accuracy with synthetic training data. We used a BERT classifier pre-trained on biomedical literature to assess differences in performance between models. A random sample (N=140) from the LLM-generated data was evaluated by a clinician and found to contain 83% correct example-label pairs. Augmenting data increased recall by 13% but decreased precision by 16%, correlating with higher quality and lower accuracy across pairs. Future work will analyze how different synthetic data traits affect ML outcomes.
摘要：影响医疗保健的一个重要问题是缺乏可用的专家。机器学习（ML）模型可以通过帮助诊断患者来解决这个问题。然而，创建足够大的数据集来训练这些模型的成本很高。我们评估了用于数据创建的大型语言模型 (LLM)。使用自闭症谱系障碍 (ASD)，我们促使 ChatGPT 和 GPT-Premium 生成 4,200 个综合观察结果，以增强现有的医疗数据。我们的目标是标记与自闭症标准相对应的行为，并利用合成训练数据提高模型的准确性。我们使用根据生物医学文献预先训练的 BERT 分类器来评估模型之间性能的差异。临床医生对 LLM 生成的数据中的随机样本 (N=140) 进行了评估，发现包含 83% 正确的示例标签对。增强数据使召回率提高了 13%，但精确度降低了 16%，这与配对之间更高的质量和更低的准确度相关。未来的工作将分析不同的合成数据特征如何影响机器学习结果。

Title: Automated Conversion of Static to Dynamic Scheduler via Natural Language

Authors: Paul Mingzheng Tang, Kenji Kah Hoe Leong, Nowshad Shaik, Hoong Chuin Lau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06697
Pdf URL: https://arxiv.org/pdf/2405.06697
Copy Paste: [[2405.06697]] Automated Conversion of Static to Dynamic Scheduler via Natural Language(https://arxiv.org/abs/2405.06697)
Keywords: language model, llm, retrieval-augmented generation
Abstract: In this paper, we explore the potential application of Large Language Models (LLMs) that will automatically model constraints and generate code for dynamic scheduling problems given an existing static model. Static scheduling problems are modelled and coded by optimization experts. These models may be easily obsoleted as the underlying constraints may need to be fine-tuned in order to reflect changes in the scheduling rules. Furthermore, it may be necessary to turn a static model into a dynamic one in order to cope with disturbances in the environment. In this paper, we propose a Retrieval-Augmented Generation (RAG) based LLM model to automate the process of implementing constraints for Dynamic Scheduling (RAGDyS), without seeking help from an optimization modeling expert. Our framework aims to minimize technical complexities related to mathematical modelling and computational workload for end-users, thereby allowing end-users to quickly obtain a new schedule close to the original schedule with changes reflected by natural language constraint descriptions.
摘要：在本文中，我们探索了大型语言模型（LLM）的潜在应用，该模型将自动对约束进行建模并为给定现有静态模型的动态调度问题生成代码。静态调度问题由优化专家建模和编码。这些模型可能很容易被淘汰，因为可能需要微调底层约束以反映调度规则的变化。此外，可能有必要将静态模型转变为动态模型，以应对环境中的干扰。在本文中，我们提出了一种基于检索增强生成（RAG）的LLM模型，以自动化实现动态调度（RAGDyS）约束的过程，而无需寻求优化建模专家的帮助。我们的框架旨在最大限度地减少与最终用户的数学建模和计算工作量相关的技术复杂性，从而允许最终用户快速获得接近原始时间表的新时间表，并通过自然语言约束描述反映变化。

Title: ChatSOS: Vector Database Augmented Generative Question Answering Assistant in Safety Engineering

Authors: Haiyang Tang, Dongping Chen, Qingzhao Chu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06699
Pdf URL: https://arxiv.org/pdf/2405.06699
Copy Paste: [[2405.06699]] ChatSOS: Vector Database Augmented Generative Question Answering Assistant in Safety Engineering(https://arxiv.org/abs/2405.06699)
Keywords: language model, llm, chat
Abstract: With the rapid advancement of natural language processing technologies, generative artificial intelligence techniques, represented by large language models (LLMs), are gaining increasing prominence and demonstrating significant potential for applications in safety engineering. However, fundamental LLMs face constraints such as limited training data coverage and unreliable responses. This study develops a vector database from 117 explosion accident reports in China spanning 2013 to 2023, employing techniques such as corpus segmenting and vector embedding. By utilizing the vector database, which outperforms the relational database in information retrieval quality, we provide LLMs with richer, more relevant knowledge. Comparative analysis of LLMs demonstrates that ChatSOS significantly enhances reliability, accuracy, and comprehensiveness, improves adaptability and clarification of responses. These results illustrate the effectiveness of supplementing LLMs with an external database, highlighting their potential to handle professional queries in safety engineering and laying a foundation for broader applications.
摘要：随着自然语言处理技术的快速发展，以大语言模型（LLM）为代表的生成人工智能技术日益受到重视，并在安全工程中展现出巨大的应用潜力。然而，基础法学硕士面临着培训数据覆盖范围有限和响应不可靠等限制。本研究采用语料库分割和向量嵌入等技术，从 2013 年至 2023 年中国 117 起爆炸事故报告中开发了一个向量数据库。通过利用在信息检索质量上优于关系数据库的向量数据库，我们为法学硕士提供了更丰富、更相关的知识。对法学硕士的比较分析表明，ChatSOS 显着增强了可靠性、准确性和全面性，提高了响应的适应性和清晰度。这些结果说明了用外部数据库补充法学硕士的有效性，突出了它们处理安全工程中专业查询的潜力，并为更广泛的应用奠定了基础。

Title: Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance

Authors: Goran Muric, Ben Delay, Steven Minton
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance(https://arxiv.org/abs/)
Keywords: language model, llm, prompt
Abstract: In this paper, we introduce the Interpretable Cross-Examination Technique (ICE-T), a novel approach that leverages structured multi-prompt techniques with Large Language Models (LLMs) to improve classification performance over zero-shot and few-shot methods. In domains where interpretability is crucial, such as medicine and law, standard models often fall short due to their "black-box" nature. ICE-T addresses these limitations by using a series of generated prompts that allow an LLM to approach the problem from multiple directions. The responses from the LLM are then converted into numerical feature vectors and processed by a traditional classifier. This method not only maintains high interpretability but also allows for smaller, less capable models to achieve or exceed the performance of larger, more advanced models under zero-shot conditions. We demonstrate the effectiveness of ICE-T across a diverse set of data sources, including medical records and legal documents, consistently surpassing the zero-shot baseline in terms of classification metrics such as F1 scores. Our results indicate that ICE-T can be used for improving both the performance and transparency of AI applications in complex decision-making environments.
摘要：在本文中，我们介绍了可解释交叉检查技术（ICE-T），这是一种利用结构化多提示技术和大型语言模型（LLM）来提高零样本和少样本方法的分类性能的新颖方法。在可解释性至关重要的领域，例如医学和法律，标准模型往往由于其“黑匣子”性质而达不到要求。 ICE-T 通过使用一系列生成的提示来解决这些限制，这些提示允许法学硕士从多个方向解决问题。然后，法学硕士的响应被转换为数值特征向量，并由传统分类器进行处理。这种方法不仅保持了高可解释性，而且允许更小、能力较差的模型在零样本条件下达到或超过更大、更先进模型的性能。我们在包括医疗记录和法律文件在内的多种数据源中证明了 ICE-T 的有效性，在 F1 分数等分类指标方面始终超越零样本基线。我们的结果表明，ICE-T 可用于提高复杂决策环境中人工智能应用的性能和透明度。

Title: LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

Authors: Zhuoxuan Jiang, Haoyuan Peng, Shanshan Feng, Fan Li, Dongsheng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06705
Pdf URL: https://arxiv.org/pdf/2405.06705
Copy Paste: [[2405.06705]] LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought(https://arxiv.org/abs/2405.06705)
Keywords: language model, llm, hallucination, prompt, chain-of-thought
Abstract: Self-correction is emerging as a promising approach to mitigate the issue of hallucination in Large Language Models (LLMs). To facilitate effective self-correction, recent research has proposed mistake detection as its initial step. However, current literature suggests that LLMs often struggle with reliably identifying reasoning mistakes when using simplistic prompting strategies. To address this challenge, we introduce a unique prompting strategy, termed the Pedagogical Chain-of-Thought (PedCoT), which is specifically designed to guide the identification of reasoning mistakes, particularly mathematical reasoning mistakes. PedCoT consists of pedagogical principles for prompts (PPP) design, two-stage interaction process (TIP) and grounded PedCoT prompts, all inspired by the educational theory of the Bloom Cognitive Model (BCM). We evaluate our approach on two public datasets featuring math problems of varying difficulty levels. The experiments demonstrate that our zero-shot prompting strategy significantly outperforms strong baselines. The proposed method can achieve the goal of reliable mathematical mistake identification and provide a foundation for automatic math answer grading. The results underscore the significance of educational theory, serving as domain knowledge, in guiding prompting strategy design for addressing challenging tasks with LLMs effectively.
摘要：自我纠正正在成为缓解大语言模型（LLM）中幻觉问题的一种有前途的方法。为了促进有效的自我纠正，最近的研究提出将错误检测作为第一步。然而，当前的文献表明，法学硕士在使用简单的提示策略时常常难以可靠地识别推理错误。为了应对这一挑战，我们引入了一种独特的提示策略，称为教学思想链（PedCoT），它专门用于指导推理错误的识别，特别是数学推理错误。 PedCoT 由提示设计的教学原则 (PPP)、两阶段互动过程 (TIP) 和扎根的 PedCoT 提示组成，所有这些都受到布鲁姆认知模型 (BCM) 教育理论的启发。我们在两个具有不同难度级别的数学问题的公共数据集上评估我们的方法。实验表明，我们的零样本提示策略显着优于强基线。该方法能够达到可靠的数学错误识别的目的，为数学答案自动评分提供基础。结果强调了教育理论作为领域知识在指导激励策略设计以有效解决法学硕士的挑战性任务方面的重要性。

Title: Exploring the Capabilities of Large Multimodal Models on Dense Text

Authors: Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06706
Pdf URL: https://arxiv.org/pdf/2405.06706
Copy Paste: [[2405.06706]] Exploring the Capabilities of Large Multimodal Models on Dense Text(https://arxiv.org/abs/2405.06706)
Keywords: gpt, prompt
Abstract: While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at this https URL.
摘要：虽然大型多模态模型（LMM）在多模态任务中取得了显着进展，但它们在涉及密集文本内容的任务中的能力仍有待充分探索。包含重要信息的密集文本经常出现在文档、表格和产品描述中。理解密集的文本使我们能够获得更准确的信息，有助于做出更好的决策。为了进一步探索 LMM 在复杂文本任务中的能力，我们提出了 DT-VQA 数据集，其中包含 17 万个问答对。在本文中，我们在我们的数据集上对 GPT4V、Gemini 和各种开源 LMM 进行了全面评估，揭示了它们的优点和缺点。此外，我们评估了 LMM 两种策略的有效性：即时工程和下游微调。我们发现，即使使用自动标记的训练数据集，也可以实现模型性能的显着改进。我们希望这项研究能够促进稠密文本任务中 LMM 的研究。代码将在此 https URL 发布。

Title: Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models

Authors: Yitian Li, Jidong Tian, Hao He, Yaohui Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06707
Pdf URL: https://arxiv.org/pdf/2405.06707
Copy Paste: [[2405.06707]] Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models(https://arxiv.org/abs/2405.06707)
Keywords: language model, prompt, chain-of-thought
Abstract: Combining different forms of prompts with pre-trained large language models has yielded remarkable results on reasoning tasks (e.g. Chain-of-Thought prompting). However, along with testing on more complex reasoning, these methods also expose problems such as invalid reasoning and fictional reasoning paths. In this paper, we develop \textit{Hypothesis Testing Prompting}, which adds conclusion assumptions, backward reasoning, and fact verification during intermediate reasoning steps. \textit{Hypothesis Testing prompting} involves multiple assumptions and reverses validation of conclusions leading to its unique correct answer. Experiments on two challenging deductive reasoning datasets ProofWriter and RuleTaker show that hypothesis testing prompting not only significantly improves the effect, but also generates a more reasonable and standardized reasoning process.
摘要：将不同形式的提示与预先训练的大型语言模型相结合，在推理任务（例如思维链提示）上取得了显着的效果。然而，随着对更复杂推理的测试，这些方法也暴露了诸如无效推理和虚构推理路径等问题。在本文中，我们开发了 \textit{假设检验提示}，它在中间推理步骤中添加了结论假设、向后推理和事实验证。 \textit{假设检验提示}涉及多个假设并逆转结论的验证，从而得出其独特的正确答案。在两个具有挑战性的演绎推理数据集ProofWriter和RuleTaker上的实验表明，假设检验提示不仅显着提高了效果，而且产生了更加合理和标准化的推理过程。

Title: Evaluating the Efficacy of AI Techniques in Textual Anonymization: A Comparative Study

Authors: Dimitris Asimopoulos, Ilias Siniosoglou, Vasileios Argyriou, Sotirios K. Goudos, Konstantinos E. Psannis, Nikoleta Karditsioti, Theocharis Saoulidis, Panagiotis Sarigiannidis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06709
Pdf URL: https://arxiv.org/pdf/2405.06709
Copy Paste: [[2405.06709]] Evaluating the Efficacy of AI Techniques in Textual Anonymization: A Comparative Study(https://arxiv.org/abs/2405.06709)
Keywords: language model
Abstract: In the digital era, with escalating privacy concerns, it's imperative to devise robust strategies that protect private data while maintaining the intrinsic value of textual information. This research embarks on a comprehensive examination of text anonymisation methods, focusing on Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), Embeddings from Language Models (ELMo), and the transformative capabilities of the Transformers architecture. Each model presents unique strengths since LSTM is modeling long-term dependencies, CRF captures dependencies among word sequences, ELMo delivers contextual word representations using deep bidirectional language models and Transformers introduce self-attention mechanisms that provide enhanced scalability. Our study is positioned as a comparative analysis of these models, emphasising their synergistic potential in addressing text anonymisation challenges. Preliminary results indicate that CRF, LSTM, and ELMo individually outperform traditional methods. The inclusion of Transformers, when compared alongside with the other models, offers a broader perspective on achieving optimal text anonymisation in contemporary settings.
摘要：在数字时代，随着隐私问题的不断升级，必须制定强有力的策略来保护私人数据，同时保持文本信息的内在价值。这项研究着手对文本匿名化方法进行全面检查，重点关注条件随机场 (CRF)、长短期记忆 (LSTM)、语言模型嵌入 (ELMo) 以及 Transformers 架构的变革能力。每个模型都具有独特的优势，因为 LSTM 正在建模长期依赖关系，CRF 捕获单词序列之间的依赖关系，ELMo 使用深度双向语言模型提供上下文单词表示，而 Transformers 引入了可增强可扩展性的自注意力机制。我们的研究定位为对这些模型的比较分析，强调它们在解决文本匿名化挑战方面的协同潜力。初步结果表明，CRF、LSTM 和 ELMo 各自优于传统方法。与其他模型相比，变形金刚的加入为在当代环境中实现最佳文本匿名化提供了更广阔的视角。

Title: Mobile Sequencers

Authors: Cem Bozsahin
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2405.06710
Pdf URL: https://arxiv.org/pdf/2405.06710
Copy Paste: [[2405.06710]] Mobile Sequencers(https://arxiv.org/abs/2405.06710)
Keywords: agent
Abstract: The article is an attempt to contribute to explorations of a common origin for language and planned-collaborative action. It gives `semantics of change' the central stage in the synthesis, from its history and recordkeeping to its development, its syntax, delivery and reception, including substratal aspects. It is suggested that to arrive at a common core, linguistic semantics must be understood as studying through syntax mobile agent's representing, tracking and coping with change and no change. Semantics of actions can be conceived the same way, but through plans instead of syntax. The key point is the following: Sequencing itself, of words and action sequences, brings in more structural interpretation to the sequence than which is immediately evident from the sequents themselves. Mobile sequencers can be understood as subjects structuring reporting, understanding and keeping track of change and no change. The idea invites rethinking of the notion of category, both in language and in planning. Understanding understanding change by mobile agents is suggested to be about human extended practice, not extended-human practice. That's why linguistics is as important as computer science in the synthesis. It must rely on representational history of acts, thoughts and expressions, personal and public, crosscutting overtness and covertness of these phenomena. It has implication for anthropology in the extended practice, which is covered briefly.
摘要：这篇文章试图为探索语言和有计划的协作行动的共同起源做出贡献。它将“变化的语义”置于综合的中心阶段，从它的历史和记录保存到它的发展、它的语法、传递和接收，包括底层方面。建议为了达到共同的核心，语言语义学必须被理解为通过句法来研究移动智能体的表示、跟踪和应对变化与不变。动作的语义可以用同样的方式来构思，但通过计划而不是语法。关键点如下：单词和动作序列的排序本身给序列带来了比序列本身直接显而易见的更多结构解释。移动测序仪可以被理解为构建报告、理解和跟踪变化和无变化的主体。这个想法引发了人们对类别概念的重新思考，无论是在语言上还是在规划上。建议理解移动代理的变化是关于人类扩展实践，而不是扩展人类实践。这就是为什么语言学在综合中与计算机科学一样重要。它必须依赖于个人和公共的行为、思想和表达的代表性历史，以及这些现象的公开和隐蔽的交叉。它对人类学的扩展实践具有影响，对此进行了简要介绍。

Title: Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses

Authors: Gaurav Kumar Gupta, Aditi Singh, Sijo Valayakkad Manikandan, Abul Ehtesham
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06712
Pdf URL: https://arxiv.org/pdf/2405.06712
Copy Paste: [[2405.06712]] Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses(https://arxiv.org/abs/2405.06712)
Keywords: language model, gpt, llm, prompt
Abstract: The recent swift development of LLMs like GPT-4, Gemini, and GPT-3.5 offers a transformative opportunity in medicine and healthcare, especially in digital diagnostics. This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses, and it demonstrates how each of these models could significantly increase diagnostic accuracy and efficiency. Through a series of diagnostic prompts based on symptoms from medical databases, GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data. Meanwhile, Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model when physicians are trying to make high-risk diagnoses. GPT-3.5, though slightly less advanced, is a good tool for medical diagnostics. This study highlights the need to study LLMs for healthcare and clinical practices with more care and attention, ensuring that any system utilizing LLMs promotes patient privacy and complies with health information privacy laws such as HIPAA compliance, as well as the social consequences that affect the varied individuals in complex healthcare contexts. This study marks the start of a larger future effort to study the various ways in which assigning ethical concerns to LLMs task of learning from human biases could unearth new ways to apply AI in complex medical settings.
摘要：GPT-4、Gemini 和 GPT-3.5 等法学硕士最近的快速发展为医学和医疗保健领域，特别是数字诊断领域提供了变革机会。这项研究通过解释用户症状并确定适合常见疾病的诊断来评估每个模型的诊断能力，并演示了每个模型如何显着提高诊断准确性和效率。通过根据医学数据库中的症状进行一系列诊断提示，GPT-4从其深厚而完整的医学数据训练历史中表现出了更高的诊断准确性。与此同时，Gemini 作为疾病分类的关键工具具有高精度，证明了当医生试图做出高风险诊断时，它有可能成为可靠的模型。 GPT-3.5 虽然稍微不那么先进，但对于医疗诊断来说是一个很好的工具。这项研究强调需要更加谨慎和关注地研究医疗保健和临床实践的法学硕士，确保任何利用法学硕士的系统都能促进患者隐私并遵守健康信息隐私法，例如 HIPAA 合规性，以及影响各种社会后果的社会后果。复杂医疗环境中的个人。这项研究标志着未来更大规模努力的开始，以研究各种方式，将道德问题分配给法学硕士从人类偏见中学习的任务，可以发现在复杂的医疗环境中应用人工智能的新方法。

Title: Unveiling the Competitive Dynamics: A Comparative Evaluation of American and Chinese LLMs

Authors: Zhenhui Jiang, Jiaxin Li, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06713
Pdf URL: https://arxiv.org/pdf/2405.06713
Copy Paste: [[2405.06713]] Unveiling the Competitive Dynamics: A Comparative Evaluation of American and Chinese LLMs(https://arxiv.org/abs/2405.06713)
Keywords: language model, gpt, llm, chat
Abstract: The strategic significance of Large Language Models (LLMs) in economic expansion, innovation, societal development, and national security has been increasingly recognized since the advent of ChatGPT. This study provides a comprehensive comparative evaluation of American and Chinese LLMs in both English and Chinese contexts. We proposed a comprehensive evaluation framework that encompasses natural language proficiency, disciplinary expertise, and safety and responsibility, and systematically assessed 16 prominent models from the US and China under various operational tasks and scenarios. Our key findings show that GPT 4-Turbo is at the forefront in English contexts, whereas Ernie-Bot 4 stands out in Chinese contexts. The study also highlights disparities in LLM performance across languages and tasks, stressing the necessity for linguistically and culturally nuanced model development. The complementary strengths of American and Chinese LLMs point to the value of Sino-US collaboration in advancing LLM technology. The research presents the current LLM competition landscape and offers valuable insights for policymakers and businesses regarding strategic LLM investments and development. Future work will expand on this framework to include emerging LLM multimodal capabilities and business application assessments.
摘要：自 ChatGPT 出现以来，大型语言模型 (LLM) 在经济扩张、创新、社会发展和国家安全方面的战略意义得到了越来越多的认可。本研究对美国和中国的法学硕士在英语和中文背景下进行了全面的比较评估。我们提出了一个涵盖自然语言能力、学科专业知识、安全和责任的综合评估框架，并在不同的操作任务和场景下系统地评估了来自美国和中国的 16 个著名模型。我们的主要发现表明，GPT 4-Turbo 在英语环境中处于领先地位，而 Ernie-Bot 4 在中文环境中脱颖而出。该研究还强调了法学硕士在不同语言和任务上的表现差异，强调了语言和文化上细致入微的模型开发的必要性。中美法学硕士的优势互补体现了中美合作推进法学硕士技术的价值。该研究展示了当前法学硕士的竞争格局，并为政策制定者和企业提供了有关法学硕士战略投资和发展的宝贵见解。未来的工作将扩展该框架，包括新兴的法学硕士多模式能力和业务应用评估。

Title: Towards a path dependent account of category fluency

Authors: David Heineman, Reba Koenen, Sashank Varma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Towards a path dependent account of category fluency(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Category fluency is a widely studied cognitive phenomenon, yet two conflicting accounts have been proposed as the underlying retrieval mechanism -- an optimal foraging process deliberately searching through memory (Hills et al., 2012) and a random walk sampling from a semantic network (Abbott et al., 2015). Evidence for both accounts has centered around predicting human patch switches, where both existing models of category fluency produce paradoxically identical results. We begin by peeling back the assumptions made by existing models, namely that each named example only depends on the previous example, by (i) adding an additional bias to model the category transition probability directly and (ii) relying on a large language model to predict based on the entire existing sequence. Then, we present evidence towards resolving the disagreement between each account of foraging by reformulating models as sequence generators. To evaluate, we compare generated category fluency runs to a bank of human-written sequences by proposing a metric based on n-gram overlap. We find category switch predictors do not necessarily produce human-like sequences, in fact the additional biases used by the Hills et al. (2012) model are required to improve generation quality, which are later improved by our category modification. Even generating exclusively with an LLM requires an additional global cue to trigger the patch switching behavior during production. Further tests on only the search process on top of the semantic network highlight the importance of deterministic search to replicate human behavior.
摘要：类别流畅性是一种被广泛研究的认知现象，然而，人们提出了两种相互矛盾的解释作为潜在的检索机制——刻意搜索记忆的最佳觅食过程（Hills et al., 2012）和从语义网络中随机游走采样（Abbott等人，2015）。这两个帐户的证据都集中在预测人类补丁切换上，其中现有的类别流畅度模型产生了矛盾的相同结果。我们首先剥离现有模型所做的假设，即每个命名示例仅依赖于前面的示例，方法是（i）添加额外的偏差来直接对类别转换概率进行建模，以及（ii）依靠大型语言模型来基于整个现有序列进行预测。然后，我们提出证据，通过将模型重新表述为序列生成器来解决每个觅食描述之间的分歧。为了进行评估，我们通过提出基于 n 元语法重叠的度量，将生成的类别流利度运行与一组人类编写的序列进行比较。我们发现类别转换预测器不一定会产生类似人类的序列，事实上 Hills 等人使用的额外偏差。（2012）模型需要提高生成质量，后来通过我们的类别修改得到了改进。即使仅使用 LLM 进行生成，也需要额外的全局提示来触发制作过程中的补丁切换行为。仅对语义网络之上的搜索过程进行的进一步测试凸显了确定性搜索对于复制人类行为的重要性。

Title: Enhancing Creativity in Large Language Models through Associative Thinking Strategies

Authors: Pronita Mehrotra, Aishni Parab, Sumit Gulwani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06715
Pdf URL: https://arxiv.org/pdf/2405.06715
Copy Paste: [[2405.06715]] Enhancing Creativity in Large Language Models through Associative Thinking Strategies(https://arxiv.org/abs/2405.06715)
Keywords: language model, gpt, llm, prompt
Abstract: This paper explores the enhancement of creativity in Large Language Models (LLMs) like vGPT-4 through associative thinking, a cognitive process where creative ideas emerge from linking seemingly unrelated concepts. Associative thinking strategies have been found to effectively help humans boost creativity. However, whether the same strategies can help LLMs become more creative remains under-explored. In this work, we investigate whether prompting LLMs to connect disparate concepts can augment their creative outputs. Focusing on three domains -- Product Design, Storytelling, and Marketing -- we introduce creativity tasks designed to assess vGPT-4's ability to generate original and useful content. By challenging the models to form novel associations, we evaluate the potential of associative thinking to enhance the creative capabilities of LLMs. Our findings show that leveraging associative thinking techniques can significantly improve the originality of vGPT-4's responses.
摘要：本文探讨了通过联想思维来增强 vGPT-4 等大型语言模型 (LLM) 的创造力，联想思维是一种认知过程，创造性的想法是通过将看似不相关的概念联系起来而产生的。人们发现联想思维策略可以有效帮助人类提高创造力。然而，同样的策略是否可以帮助法学硕士变得更具创造力仍有待探索。在这项工作中，我们研究了促使法学硕士连接不同概念是否可以增强他们的创造性产出。我们专注于三个领域——产品设计、讲故事和营销——引入旨在评估 vGPT-4 生成原创且有用内容的能力的创造力任务。通过挑战模型形成新颖的联想，我们评估了联想思维在增强法学硕士创造力方面的潜力。我们的研究结果表明，利用联想思维技术可以显着提高 vGPT-4 响应的原创性。

Title: Enhancing Traffic Prediction with Textual Data Using Large Language Models

Authors: Xiannan Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06719
Pdf URL: https://arxiv.org/pdf/2405.06719
Copy Paste: [[2405.06719]] Enhancing Traffic Prediction with Textual Data Using Large Language Models(https://arxiv.org/abs/2405.06719)
Keywords: language model
Abstract: Traffic prediction is pivotal for rational transportation supply scheduling and allocation. Existing researches into short-term traffic prediction, however, face challenges in adequately addressing exceptional circumstances and integrating non-numerical contextual information like weather into models. While, Large language models offer a promising solution due to their inherent world knowledge. However, directly using them for traffic prediction presents drawbacks such as high cost, lack of determinism, and limited mathematical capability. To mitigate these issues, this study proposes a novel approach. Instead of directly employing large models for prediction, it utilizes them to process textual information and obtain embeddings. These embeddings are then combined with historical traffic data and inputted into traditional spatiotemporal forecasting models. The study investigates two types of special scenarios: regional-level and node-level. For regional-level scenarios, textual information is represented as a node connected to the entire network. For node-level scenarios, embeddings from the large model represent additional nodes connected only to corresponding nodes. This approach shows a significant improvement in prediction accuracy according to our experiment of New York Bike dataset.
摘要：交通预测对于合理的交通供给调度和分配至关重要。然而，现有的短期交通预测研究在充分解决特殊情况以及将天气等非数字上下文信息整合到模型中面临着挑战。同时，大型语言模型由于其固有的世界知识而提供了一个有前途的解决方案。然而，直接使用它们进行流量预测存在成本高、缺乏确定性和数学能力有限等缺点。为了缓解这些问题，本研究提出了一种新方法。它不是直接使用大型模型进行预测，而是利用它们来处理文本信息并获得嵌入。然后将这些嵌入与历史流量数据相结合，并输入到传统的时空预测模型中。该研究调查了两种类型的特殊场景：区域级和节点级。对于区域级场景，文本信息表示为连接到整个网络的节点。对于节点级场景，大型模型的嵌入表示仅连接到相应节点的附加节点。根据我们对 New York Bike 数据集的实验，这种方法显示了预测准确性的显着提高。

Title: Opportunities for Persian Digital Humanities Research with Artificial Intelligence Language Models; Case Study: Forough Farrokhzad

Authors: Arash Rasti Meymandi, Zahra Hosseini, Sina Davari, Abolfazl Moshiri, Shabnam Rahimi-Golkhandan, Khashayar Namdar, Nikta Feizi, Mohamad Tavakoli-Targhi, Farzad Khalvati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06760
Pdf URL: https://arxiv.org/pdf/2405.06760
Copy Paste: [[2405.06760]] Opportunities for Persian Digital Humanities Research with Artificial Intelligence Language Models; Case Study: Forough Farrokhzad(https://arxiv.org/abs/2405.06760)
Keywords: language model
Abstract: This study explores the integration of advanced Natural Language Processing (NLP) and Artificial Intelligence (AI) techniques to analyze and interpret Persian literature, focusing on the poetry of Forough Farrokhzad. Utilizing computational methods, we aim to unveil thematic, stylistic, and linguistic patterns in Persian poetry. Specifically, the study employs AI models including transformer-based language models for clustering of the poems in an unsupervised framework. This research underscores the potential of AI in enhancing our understanding of Persian literary heritage, with Forough Farrokhzad's work providing a comprehensive case study. This approach not only contributes to the field of Persian Digital Humanities but also sets a precedent for future research in Persian literary studies using computational techniques.
摘要：本研究探讨了先进的自然语言处理 (NLP) 和人工智能 (AI) 技术的整合来分析和解释波斯文学，重点关注福罗·法罗赫扎德 (Forough Farrokhzad) 的诗歌。利用计算方法，我们的目标是揭示波斯诗歌的主题、风格和语言模式。具体来说，该研究采用人工智能模型，包括基于变压器的语言模型，在无监督框架中对诗歌进行聚类。这项研究强调了人工智能在增强我们对波斯文学遗产的理解方面的潜力，福罗·法罗赫扎德 (Forough Farrokhzad) 的作品提供了全面的案例研究。这种方法不仅对波斯数字人文领域做出了贡献，而且为未来使用计算技术进行波斯文学研究奠定了先例。

Title: LLM-Generated Black-box Explanations Can Be Adversarially Helpful

Authors: Rohan Ajwani, Shashidhar Reddy Javaji, Frank Rudzicz, Zining Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.06800
Pdf URL: https://arxiv.org/pdf/2405.06800
Copy Paste: [[2405.06800]] LLM-Generated Black-box Explanations Can Be Adversarially Helpful(https://arxiv.org/abs/2405.06800)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are becoming vital tools that help us solve and understand complex problems by acting as digital assistants. LLMs can generate convincing explanations, even when only given the inputs and outputs of these problems, i.e., in a ``black-box'' approach. However, our research uncovers a hidden risk tied to this approach, which we call *adversarial helpfulness*. This happens when an LLM's explanations make a wrong answer look right, potentially leading people to trust incorrect solutions. In this paper, we show that this issue affects not just humans, but also LLM evaluators. Digging deeper, we identify and examine key persuasive strategies employed by LLMs. Our findings reveal that these models employ strategies such as reframing the questions, expressing an elevated level of confidence, and cherry-picking evidence to paint misleading answers in a credible light. To examine if LLMs are able to navigate complex-structured knowledge when generating adversarially helpful explanations, we create a special task based on navigating through graphs. Some LLMs are not able to find alternative paths along simple graphs, indicating that their misleading explanations aren't produced by only logical deductions using complex knowledge. These findings shed light on the limitations of black-box explanation setting. We provide some advice on how to use LLMs as explainers safely.
摘要：大型语言模型 (LLM) 正在成为重要的工具，通过充当数字助理来帮助我们解决和理解复杂的问题。即使仅给出这些问题的输入和输出，即采用“黑匣子”方法，法学硕士也可以生成令人信服的解释。然而，我们的研究发现了与这种方法相关的隐藏风险，我们称之为“对抗性帮助”。当法学硕士的解释使错误的答案看起来正确时，就会发生这种情况，可能会导致人们相信错误的解决方案。在本文中，我们表明这个问题不仅影响人类，还影响法学硕士评估者。通过更深入的挖掘，我们确定并研究了法学硕士所采用的关键说服策略。我们的研究结果表明，这些模型采用了重新构建问题、表达更高水平的信心以及挑选证据等策略，以可信的方式描绘出误导性的答案。为了检查法学硕士在生成对抗性有用的解释时是否能够导航复杂结构的知识，我们创建了一个基于图形导航的特殊任务。一些法学硕士无法沿着简单的图表找到替代路径，这表明他们的误导性解释并不是仅通过使用复杂知识的逻辑推论产生的。这些发现揭示了黑盒解释设置的局限性。我们提供一些关于如何安全地使用法学硕士作为解释者的建议。

Title: Tackling Execution-Based Evaluation for NL2Bash

Authors: Ngoc Phuoc An Vo, Brent Paulovicks, Vadim Sheinin
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2405.06807
Pdf URL: https://arxiv.org/pdf/2405.06807
Copy Paste: [[2405.06807]] Tackling Execution-Based Evaluation for NL2Bash(https://arxiv.org/abs/2405.06807)
Keywords: language model, llm, prompt
Abstract: Given recent advancement of Large Language Models (LLMs), the task of translating from natural language prompts to different programming languages (code generation) attracts immense attention for wide application in different domains. Specially code generation for Bash (NL2Bash) is widely used to generate Bash scripts for automating different tasks, such as performance monitoring, compilation, system administration, system diagnostics, etc. Besides code generation, validating synthetic code is critical before using them for any application. Different methods for code validation are proposed, both direct (execution evaluation) and indirect validations (i.e. exact/partial match, BLEU score). Among these, Execution-based Evaluation (EE) can validate the predicted code by comparing the execution output of model prediction and expected output in system. However, designing and implementing such an execution-based evaluation system for NL2Bash is not a trivial task. In this paper, we present a machinery for execution-based evaluation for NL2Bash. We create a set of 50 prompts to evaluate some popular LLMs for NL2Bash. We also analyze several advantages and challenges of EE such as syntactically different yet semantically equivalent Bash scripts generated by different LLMs, or syntactically correct but semantically incorrect Bash scripts, and how we capture and process them correctly.
摘要：鉴于大型语言模型（LLM）的最新进展，从自然语言提示翻译为不同编程语言（代码生成）的任务因其在不同领域的广泛应用而引起了极大的关注。特别是 Bash 代码生成 (NL2Bash) 广泛用于生成 Bash 脚本，用于自动执行不同的任务，例如性能监控、编译、系统管理、系统诊断等。除了代码生成之外，在将合成代码用于任何应用程序之前验证合成代码也至关重要。提出了不同的代码验证方法，包括直接（执行评估）和间接验证（即精确/部分匹配、BLEU 分数）。其中，基于执行的评估（EE）可以通过比较模型预测的执行输出和系统中的预期输出来验证预测的代码。然而，为 NL2Bash 设计和实现这样一个基于执行的评估系统并不是一项简单的任务。在本文中，我们提出了一种用于 NL2Bash 基于执行的评估的机制。我们创建了一组 50 个提示来评估一些流行的 NL2Bash 法学硕士。我们还分析了 EE 的几个优点和挑战，例如不同 LLM 生成的语法不同但语义等效的 Bash 脚本，或者语法正确但语义不正确的 Bash 脚本，以及我们如何正确捕获和处理它们。

Title: TacoERE: Cluster-aware Compression for Event Relation Extraction

Authors: Yong Guan, Xiaozhi Wang, Lei Hou, Juanzi Li, Jeff Pan, Jiaoyan Chen, Freddy Lecue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.06890
Pdf URL: https://arxiv.org/pdf/2405.06890
Copy Paste: [[2405.06890]] TacoERE: Cluster-aware Compression for Event Relation Extraction(https://arxiv.org/abs/2405.06890)
Keywords: language model, gpt, chat
Abstract: Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a cluster-aware compression method for improving event relation extraction (TacoERE), which explores a compression-then-extraction paradigm. Specifically, we first introduce document clustering for modeling event dependencies. It splits the document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Secondly, we utilize cluster summarization to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. We have conducted extensive experiments on both pre-trained language models, such as RoBERTa, and large language models, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Experimental results demonstrate that TacoERE is an effective method for ERE.
摘要：事件关系提取（ERE）是自然语言处理的一个关键且基本的挑战。现有工作主要集中于直接对整个文档进行建模，无法有效处理长程依赖和信息冗余。为了解决这些问题，我们提出了一种用于改进事件关系提取的集群感知压缩方法（TacoERE），该方法探索了压缩然后提取的范例。具体来说，我们首先引入文档聚类来建模事件依赖关系。它将文档分为簇内和簇间，其中簇内旨在增强同一簇内的关系，而簇间则尝试对任意距离的相关事件进行建模。其次，我们利用聚类摘要来简化和突出聚类的重要文本内容，以减轻信息冗余和事件距离。我们在 MAVEN-ERE、EventStoryLine 和 HiEve 三个 ERE 数据集上对预训练语言模型（例如 RoBERTa）和大型语言模型（例如 ChatGPT 和 GPT-4）进行了广泛的实验。实验结果表明TacoERE是一种有效的ERE方法。

Title: Finding structure in logographic writing with library learning

Authors: Guangyuan Jiang, Matthias Hofer, Jiayuan Mao, Lionel Wong, Joshua B. Tenenbaum, Roger P. Levy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.06906
Pdf URL: https://arxiv.org/pdf/2405.06906
Copy Paste: [[2405.06906]] Finding structure in logographic writing with library learning(https://arxiv.org/abs/2405.06906)
Keywords: llm
Abstract: One hallmark of human language is its combinatoriality -- reusing a relatively small inventory of building blocks to create a far larger inventory of increasingly complex structures. In this paper, we explore the idea that combinatoriality in language reflects a human inductive bias toward representational efficiency in symbol systems. We develop a computational framework for discovering structure in a writing system. Built on top of state-of-the-art library learning and program synthesis techniques, our computational framework discovers known linguistic structures in the Chinese writing system and reveals how the system evolves towards simplification under pressures for representational efficiency. We demonstrate how a library learning approach, utilizing learned abstractions and compression, may help reveal the fundamental computational principles that underlie the creation of combinatorial structures in human cognition, and offer broader insights into the evolution of efficient communication systems.
摘要：人类语言的一个标志是它的组合性——重复使用相对较小的构建块库存来创建更大的日益复杂的结构库存。在本文中，我们探讨了这样的观点：语言中的组合性反映了人类对符号系统中表示效率的归纳偏见。我们开发了一个用于发现书写系统结构的计算框架。我们的计算框架建立在最先进的图书馆学习和程序综合技术之上，可以发现中文书写系统中已知的语言结构，并揭示该系统如何在表示效率的压力下朝着简化的方向发展。我们展示了图书馆学习方法如何利用学习到的抽象和压缩来帮助揭示人类认知中组合结构创建的基本计算原理，并为高效通信系统的演变提供更广泛的见解。

Title: CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents

Authors: Shuyuan Xu, Zelong Li, Kai Mei, Yongfeng Zhang
Subjects: cs.CL, cs.AI, cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents(https://arxiv.org/abs/)
Keywords: language model, llm, agent
Abstract: Since their inception, programming languages have trended towards greater readability and lower barriers for programmers. Following this trend, natural language can be a promising type of programming language that provides great flexibility and usability and helps towards the democracy of programming. However, the inherent vagueness, ambiguity, and verbosity of natural language pose significant challenges in developing an interpreter that can accurately understand the programming logic and execute instructions written in natural language. Fortunately, recent advancements in Large Language Models (LLMs) have demonstrated remarkable proficiency in interpreting complex natural language. Inspired by this, we develop a novel system for Code Representation and Execution (CoRE), which employs LLM as interpreter to interpret and execute natural language instructions. The proposed system unifies natural language programming, pseudo-code programming, and flow programming under the same representation for constructing language agents, while LLM serves as the interpreter to interpret and execute the agent programs. In this paper, we begin with defining the programming syntax that structures natural language instructions logically. During the execution, we incorporate external memory to minimize redundancy. Furthermore, we equip the designed interpreter with the capability to invoke external tools, compensating for the limitations of LLM in specialized domains or when accessing real-time information. This work is open-source at this https URL.
摘要：自诞生以来，编程语言一直朝着提高程序员可读性和降低障碍的方向发展。遵循这一趋势，自然语言可能成为一种有前途的编程语言，它提供了极大的灵活性和可用性，并有助于实现编程的民主化。然而，自然语言固有的模糊性、歧义性和冗长性给开发能够准确理解编程逻辑并执行用自然语言编写的指令的解释器带来了重大挑战。幸运的是，大型语言模型 (LLM) 的最新进展已经证明了其在解释复杂自然语言方面的卓越能力。受此启发，我们开发了一种新颖的代码表示和执行（CoRE）系统，该系统采用LLM作为解释器来解释和执行自然语言指令。该系统将自然语言编程、伪代码编程和流程编程统一在同一表示下来构建语言代理，而LLM则充当解释器来解释和执行代理程序。在本文中，我们首先定义逻辑地构造自然语言指令的编程语法。在执行过程中，我们合并了外部存储器以最大限度地减少冗余。此外，我们为设计的解释器配备了调用外部工具的能力，弥补了LLM在专业领域或访问实时信息时的局限性。这项工作是在 https URL 上开源的。

Title: Quite Good, but Not Enough: Nationality Bias in Large Language Models -- A Case Study of ChatGPT

Authors: Shucheng Zhu, Weikang Wang, Ying Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Quite Good, but Not Enough: Nationality Bias in Large Language Models -- A Case Study of ChatGPT(https://arxiv.org/abs/)
Keywords: language model, gpt, llm, prompt, chat
Abstract: While nationality is a pivotal demographic element that enhances the performance of language models, it has received far less scrutiny regarding inherent biases. This study investigates nationality bias in ChatGPT (GPT-3.5), a large language model (LLM) designed for text generation. The research covers 195 countries, 4 temperature settings, and 3 distinct prompt types, generating 4,680 discourses about nationality descriptions in Chinese and English. Automated metrics were used to analyze the nationality bias, and expert annotators alongside ChatGPT itself evaluated the perceived bias. The results show that ChatGPT's generated discourses are predominantly positive, especially compared to its predecessor, GPT-2. However, when prompted with negative inclinations, it occasionally produces negative content. Despite ChatGPT considering its generated text as neutral, it shows consistent self-awareness about nationality bias when subjected to the same pair-wise comparison annotation framework used by human annotators. In conclusion, while ChatGPT's generated texts seem friendly and positive, they reflect the inherent nationality biases in the real world. This bias may vary across different language versions of ChatGPT, indicating diverse cultural perspectives. The study highlights the subtle and pervasive nature of biases within LLMs, emphasizing the need for further scrutiny.
摘要：虽然国籍是提高语言模型性能的关键人口统计因素，但它受到的关于固有偏见的审查要少得多。本研究调查了 ChatGPT (GPT-3.5) 中的国籍偏见，这是一种专为文本生成而设计的大型语言模型 (LLM)。该研究涵盖了195个国家、4种温度设置和3种不同的提示类型，生成了4,680条有关中英文国籍描述的话语。使用自动化指标来分析国籍偏见，专家注释者与 ChatGPT 本身一起评估感知到的偏见。结果表明，ChatGPT 生成的话语主要是积极的，特别是与其前身 GPT-2 相比。然而，当出现负面倾向时，它偶尔会产生负面内容。尽管 ChatGPT 认为其生成的文本是中性的，但当受到与人类注释者使用的相同的成对比较注释框架时，它表现出了对国籍偏见的一致的自我意识。总之，虽然 ChatGPT 生成的文本看起来友好且积极，但它们反映了现实世界中固有的国籍偏见。这种偏见可能因 ChatGPT 的不同语言版本而异，表明文化观点不同。该研究强调了法学硕士内部偏见的微妙和普遍性，强调需要进一步审查。

Title: Evaluating Task-based Effectiveness of MLLMs on Charts

Authors: Yifan Wu, Lutao Yan, Yuyu Luo, Yunhai Wang, Nan Tang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2405.07001
Pdf URL: https://arxiv.org/pdf/2405.07001
Copy Paste: [[2405.07001]] Evaluating Task-based Effectiveness of MLLMs on Charts(https://arxiv.org/abs/2405.07001)
Keywords: gpt, llm, prompt
Abstract: In this paper, we explore a forward-thinking question: Is GPT-4V effective at low-level data analysis tasks on charts? To this end, we first curate a large-scale dataset, named ChartInsights, consisting of 89,388 quartets (chart, task, question, answer) and covering 10 widely-used low-level data analysis tasks on 7 chart types. Firstly, we conduct systematic evaluations to understand the capabilities and limitations of 18 advanced MLLMs, which include 12 open-source models and 6 closed-source models. Starting with a standard textual prompt approach, the average accuracy rate across the 18 MLLMs is 36.17%. Among all the models, GPT-4V achieves the highest accuracy, reaching 56.13%. To understand the limitations of multimodal large models in low-level data analysis tasks, we have designed various experiments to conduct an in-depth test of capabilities of GPT-4V. We further investigate how visual modifications to charts, such as altering visual elements (e.g. changing color schemes) and introducing perturbations (e.g. adding image noise), affect performance of GPT-4V. Secondly, we present 12 experimental findings. These findings suggest potential of GPT-4V to revolutionize interaction with charts and uncover the gap between human analytic needs and capabilities of GPT-4V. Thirdly, we propose a novel textual prompt strategy, named Chain-of-Charts, tailored for low-level analysis tasks, which boosts model performance by 24.36%, resulting in an accuracy of 80.49%. Furthermore, by incorporating a visual prompt strategy that directs attention of GPT-4V to question-relevant visual elements, we further improve accuracy to 83.83%. Our study not only sheds light on the capabilities and limitations of GPT-4V in low-level data analysis tasks but also offers valuable insights for future research.
摘要：在本文中，我们探讨了一个前瞻性问题：GPT-4V 在图表上的低级数据分析任务中是否有效？为此，我们首先整理了一个名为 ChartInsights 的大型数据集，由 89,388 个四组（图表、任务、问题、答案）组成，涵盖 7 种图表类型的 10 个广泛使用的低级数据分析任务。首先，我们进行系统评估，以了解 18 个高级 MLLM 的能力和局限性，其中包括 12 个开源模型和 6 个闭源模型。从标准文本提示方法开始，18 个 MLLM 的平均准确率为 36.17%。在所有模型中，GPT-4V 的准确率最高，达到 56.13%。为了了解多模态大模型在低级数据分析任务中的局限性，我们设计了各种实验来对GPT-4V的能力进行深入测试。我们进一步研究对图表的视觉修改，例如改变视觉元素（例如改变配色方案）和引入扰动（例如添加图像噪声）如何影响 GPT-4V 的性能。其次，我们提出了 12 项实验结果。这些发现表明 GPT-4V 有可能彻底改变与图表的交互，并揭示人类分析需求与 GPT-4V 功能之间的差距。第三，我们提出了一种新颖的文本提示策略，名为Chain-of-Charts，专为低级分析任务量身定制，将模型性能提高了24.36%，准确率达到80.49%。此外，通过结合视觉提示策略，将 GPT-4V 的注意力引导到与问题相关的视觉元素，我们进一步将准确率提高到 83.83%。我们的研究不仅揭示了 GPT-4V 在低级数据分析任务中的能力和局限性，而且为未来的研究提供了宝贵的见解。

Title: A Turkish Educational Crossword Puzzle

Authors: Kamyar Zeinalipour, Yusuf Gökberk Keptiğ, Marco Maggini, Leonardo Rigutini, Marco Gori
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Turkish Educational Crossword Puzzle(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: This paper introduces the first Turkish crossword puzzle generator designed to leverage the capabilities of large language models (LLMs) for educational purposes. In this work, we introduced two specially created datasets: one with over 180,000 unique answer-clue pairs for generating relevant clues from the given answer, and another with over 35,000 samples containing text, answer, category, and clue data, aimed at producing clues for specific texts and keywords within certain categories. Beyond entertainment, this generator emerges as an interactive educational tool that enhances memory, vocabulary, and problem-solving skills. It's a notable step in AI-enhanced education, merging game-like engagement with learning for Turkish and setting new standards for interactive, intelligent learning tools in Turkish.
摘要：本文介绍了第一个土耳其语填字游戏生成器，旨在利用大语言模型 (LLM) 的功能来实现教育目的。在这项工作中，我们引入了两个专门创建的数据集：一个包含超过 180,000 个独特的答案线索对，用于从给定答案生成相关线索，另一个包含超过 35,000 个包含文本、答案、类别和线索数据的样本，旨在生成线索对于某些类别内的特定文本和关键字。除了娱乐之外，该生成器还作为一种交互式教育工具出现，可以增强记忆力、词汇量和解决问题的能力。这是人工智能增强教育领域迈出的重要一步，将游戏式参与与土耳其语学习相结合，并为土耳其语交互式智能学习工具设定了新标准。

Title: Length-Aware Multi-Kernel Transformer for Long Document Classification

Authors: Guangzeng Han, Jack Tsao, Xiaolei Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07052
Pdf URL: https://arxiv.org/pdf/2405.07052
Copy Paste: [[2405.07052]] Length-Aware Multi-Kernel Transformer for Long Document Classification(https://arxiv.org/abs/2405.07052)
Keywords: language model
Abstract: Lengthy documents pose a unique challenge to neural language models due to substantial memory consumption. While existing state-of-the-art (SOTA) models segment long texts into equal-length snippets (e.g., 128 tokens per snippet) or deploy sparse attention networks, these methods have new challenges of context fragmentation and generalizability due to sentence boundaries and varying text lengths. For example, our empirical analysis has shown that SOTA models consistently overfit one set of lengthy documents (e.g., 2000 tokens) while performing worse on texts with other lengths (e.g., 1000 or 4000). In this study, we propose a Length-Aware Multi-Kernel Transformer (LAMKIT) to address the new challenges for the long document classification. LAMKIT encodes lengthy documents by diverse transformer-based kernels for bridging context boundaries and vectorizes text length by the kernels to promote model robustness over varying document lengths. Experiments on five standard benchmarks from health and law domains show LAMKIT outperforms SOTA models up to an absolute 10.9% improvement. We conduct extensive ablation analyses to examine model robustness and effectiveness over varying document lengths.
摘要：由于大量的内存消耗，冗长的文档对神经语言模型构成了独特的挑战。虽然现有最先进的 (SOTA) 模型将长文本分割成等长的片段（例如，每个片段 128 个标记）或部署稀疏注意力网络，但由于句子边界和不同的文本长度。例如，我们的实证分析表明，SOTA 模型始终过度拟合一组冗长的文档（例如 2000 个标记），而在其他长度的文本（例如 1000 或 4000）上表现较差。在本研究中，我们提出了一种长度感知多内核变压器（LAMKIT）来解决长文档分类的新挑战。 LAMKIT 通过各种基于转换器的内核对长文档进行编码，以桥接上下文边界，并通过内核对文本长度进行矢量化，以提高模型在不同文档长度上的鲁棒性。对来自健康和法律领域的五个标准基准的实验表明，LAMKIT 的性能优于 SOTA 模型，绝对提升了 10.9%。我们进行广泛的消融分析，以检查模型在不同文档长度下的稳健性和有效性。

Title: Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

Authors: Edward Y. Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07076
Pdf URL: https://arxiv.org/pdf/2405.07076
Copy Paste: [[2405.07076]] Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models(https://arxiv.org/abs/2405.07076)
Keywords: language model, llm
Abstract: This research develops advanced methodologies for Large Language Models (LLMs) to better manage linguistic behaviors related to emotions and ethics. We introduce DIKE, an adversarial framework that enhances the LLMs' ability to internalize and reflect global human values, adapting to varied cultural contexts to promote transparency and trust among users. The methodology involves detailed modeling of emotions, classification of linguistic behaviors, and implementation of ethical guardrails. Our innovative approaches include mapping emotions and behaviors using self-supervised learning techniques, refining these guardrails through adversarial reviews, and systematically adjusting outputs to ensure ethical alignment. This framework establishes a robust foundation for AI systems to operate with ethical integrity and cultural sensitivity, paving the way for more responsible and context-aware AI interactions.
摘要：这项研究开发了大型语言模型（LLM）的先进方法，以更好地管理与情感和道德相关的语言行为。我们引入了 DIKE，这是一个对抗性框架，可以增强法学硕士内化和反映全球人类价值观的能力，适应不同的文化背景，以提高用户之间的透明度和信任。该方法涉及详细的情感建模、语言行为分类以及道德护栏的实施。我们的创新方法包括使用自我监督学习技术绘制情绪和行为，通过对抗性审查完善这些护栏，以及系统地调整输出以确保道德一致性。该框架为人工智能系统以道德诚信和文化敏感性运作奠定了坚实的基础，为更负责任和情境感知的人工智能交互铺平了道路。

Title: Do Pretrained Contextual Language Models Distinguish between Hebrew Homograph Analyses?

Authors: Avi Shmidman, Cheyn Shmuel Shmidman, Dan Bareket, Moshe Koppel, Reut Tsarfaty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07099
Pdf URL: https://arxiv.org/pdf/2405.07099
Copy Paste: [[2405.07099]] Do Pretrained Contextual Language Models Distinguish between Hebrew Homograph Analyses?(https://arxiv.org/abs/2405.07099)
Keywords: language model
Abstract: Semitic morphologically-rich languages (MRLs) are characterized by extreme word ambiguity. Because most vowels are omitted in standard texts, many of the words are homographs with multiple possible analyses, each with a different pronunciation and different morphosyntactic properties. This ambiguity goes beyond word-sense disambiguation (WSD), and may include token segmentation into multiple word units. Previous research on MRLs claimed that standardly trained pre-trained language models (PLMs) based on word-pieces may not sufficiently capture the internal structure of such tokens in order to distinguish between these analyses. Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated and analyzed using PLMs. We evaluate all existing models for contextualized Hebrew embeddings on a novel Hebrew homograph challenge sets that we deliver. Our empirical results demonstrate that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings; and that they are most effective for disambiguating segmentation and morphosyntactic features, less so regarding pure word-sense disambiguation. We show that these embeddings are more effective when the number of word-piece splits is limited, and they are more effective for 2-way and 3-way ambiguities than for 4-way ambiguity. We show that the embeddings are equally effective for homographs of both balanced and skewed distributions, whether calculated as masked or unmasked tokens. Finally, we show that these embeddings are as effective for homograph disambiguation with extensive supervised training as with a few-shot setup.
摘要：闪米特形态丰富的语言（MRL）的特点是单词极度歧义。由于标准文本中省略了大多数元音，因此许多单词是具有多种可能分析的同形异义词，每种分析都有不同的发音和不同的形态句法属性。这种歧义超出了词义消歧 (WSD) 范围，并且可能包括将标记分割为多个词单元。之前关于 MRL 的研究声称，基于单词片段的标准训练的预训练语言模型 (PLM) 可能无法充分捕获此类标记的内部结构，从而无法区分这些分析。以希伯来语作为案例研究，我们研究了使用 PLM 消除和分析希伯来语同形异义词的程度。我们在我们提供的新颖希伯来语同形词挑战集上评估所有现有的上下文希伯来语嵌入模型。我们的实证结果表明，当代希伯来语情境化嵌入优于非情境化嵌入；并且它们对于消除分段和形态句法特征的歧义最为有效，而对于纯粹的词义消歧则效果较差。我们证明，当词片分割的数量有限时，这些嵌入更有效，并且它们对于 2 向和 3 向歧义比对于 4 向歧义更有效。我们证明，嵌入对于平衡分布和倾斜分布的同形异义词同样有效，无论是计算为屏蔽标记还是未屏蔽标记。最后，我们证明这些嵌入对于通过广泛的监督训练进行同形异义词消歧与通过几次镜头设置一样有效。

Title: Advanced Natural-based interaction for the ITAlian language: LLaMAntino-3-ANITA

Authors: Marco Polignano, Pierpaolo Basile, Giovanni Semeraro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07101
Pdf URL: https://arxiv.org/pdf/2405.07101
Copy Paste: [[2405.07101]] Advanced Natural-based interaction for the ITAlian language: LLaMAntino-3-ANITA(https://arxiv.org/abs/2405.07101)
Keywords: language model, llm
Abstract: In the pursuit of advancing natural language processing for the Italian language, we introduce a state-of-the-art Large Language Model (LLM) based on the novel Meta LLaMA-3 model: LLaMAntino-3-ANITA-8B-Inst-DPO-ITA. We fine-tuned the original 8B parameters instruction tuned model using the Supervised Fine-tuning (SFT) technique on the English and Italian language datasets in order to improve the original performance. Consequently, a Dynamic Preference Optimization (DPO) process has been used to align preferences, avoid dangerous and inappropriate answers, and limit biases and prejudices. Our model leverages the efficiency of QLoRA to fine-tune the model on a smaller portion of the original model weights and then adapt the model specifically for the Italian linguistic structure, achieving significant improvements in both performance and computational efficiency. Concurrently, DPO is employed to refine the model's output, ensuring that generated content aligns with quality answers. The synergy between SFT, QLoRA's parameter efficiency and DPO's user-centric optimization results in a robust LLM that excels in a variety of tasks, including but not limited to text completion, zero-shot classification, and contextual understanding. The model has been extensively evaluated over standard benchmarks for the Italian and English languages, showing outstanding results. The model is freely available over the HuggingFace hub and, examples of use can be found in our GitHub repository. this https URL
摘要：为了推进意大利语自然语言处理的发展，我们引入了基于新颖的 Meta LLaMA-3 模型的最先进的大语言模型 (LLM)：LLaMAntino-3-ANITA-8B-Inst-DPO -ITA。我们在英语和意大利语数据集上使用监督微调（SFT）技术对原始 8B 参数指令调整模型进行微调，以提高原始性能。因此，动态偏好优化 (DPO) 流程已用于调整偏好、避免危险和不适当的答案，并限制偏见和成见。我们的模型利用 QLoRA 的效率，在原始模型权重的较小部分上对模型进行微调，然后专门针对意大利语语言结构调整模型，从而在性能和计算效率上实现显着提高。同时，DPO 用于完善模型的输出，确保生成的内容与质量答案保持一致。 SFT、QLoRA 的参数效率和 DPO 以用户为中心的优化之间的协同作用产生了强大的法学硕士，在各种任务中表现出色，包括但不限于文本完成、零样本分类和上下文理解。该模型已根据意大利语和英语的标准基准进行了广泛评估，显示出出色的结果。该模型可通过 HuggingFace 中心免费获取，并且可以在我们的 GitHub 存储库中找到使用示例。这个 https 网址

Title: Designing and Evaluating Dialogue LLMs for Co-Creative Improvised Theatre

Authors: Boyd Branch, Piotr Mirowski, Kory Mathewson, Sophia Ppali, Alexandra Covaci
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07111
Pdf URL: https://arxiv.org/pdf/2405.07111
Copy Paste: [[2405.07111]] Designing and Evaluating Dialogue LLMs for Co-Creative Improvised Theatre(https://arxiv.org/abs/2405.07111)
Keywords: language model, llm, agent
Abstract: Social robotics researchers are increasingly interested in multi-party trained conversational agents. With a growing demand for real-world evaluations, our study presents Large Language Models (LLMs) deployed in a month-long live show at the Edinburgh Festival Fringe. This case study investigates human improvisers co-creating with conversational agents in a professional theatre setting. We explore the technical capabilities and constraints of on-the-spot multi-party dialogue, providing comprehensive insights from both audience and performer experiences with AI on stage. Our human-in-the-loop methodology underlines the challenges of these LLMs in generating context-relevant responses, stressing the user interface's crucial role. Audience feedback indicates an evolving interest for AI-driven live entertainment, direct human-AI interaction, and a diverse range of expectations about AI's conversational competence and utility as a creativity support tool. Human performers express immense enthusiasm, varied satisfaction, and the evolving public opinion highlights mixed emotions about AI's role in arts.
摘要：社交机器人研究人员对多方训练的对话代理越来越感兴趣。随着对现实世界评估的需求不断增长，我们的研究提出了在爱丁堡边缘艺术节为期一个月的现场表演中部署的大型语言模型 (LLM)。本案例研究调查了人类即兴创作者在专业剧院环境中与对话代理共同创作的情况。我们探索现场多方对话的技术能力和限制，从观众和表演者在舞台上的人工智能体验中提供全面的见解。我们的人机交互方法强调了这些法学硕士在生成上下文相关响应方面面临的挑战，强调了用户界面的关键作用。观众反馈表明，人们对人工智能驱动的现场娱乐、直接人机交互的兴趣日益浓厚，并对人工智能的对话能力和作为创造力支持工具的实用性抱有各种各样的期望。人类表演者表达了巨大的热情和不同的满足感，不断变化的公众舆论突显了人们对人工智能在艺术中的作用的复杂情绪。

Title: InsightNet: Structured Insight Mining from Customer Feedback

Authors: Sandeep Sricharan Mukku, Manan Soni, Jitenkumar Rana, Chetan Aggarwal, Promod Yenigalla, Rashmi Patange, Shyam Mohan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07195
Pdf URL: https://arxiv.org/pdf/2405.07195
Copy Paste: [[2405.07195]] InsightNet: Structured Insight Mining from Customer Feedback(https://arxiv.org/abs/2405.07195)
Keywords: llm
Abstract: We propose InsightNet, a novel approach for the automated extraction of structured insights from customer reviews. Our end-to-end machine learning framework is designed to overcome the limitations of current solutions, including the absence of structure for identified topics, non-standard aspect names, and lack of abundant training data. The proposed solution builds a semi-supervised multi-level taxonomy from raw reviews, a semantic similarity heuristic approach to generate labelled data and employs a multi-task insight extraction architecture by fine-tuning an LLM. InsightNet identifies granular actionable topics with customer sentiments and verbatim for each topic. Evaluations on real-world customer review data show that InsightNet performs better than existing solutions in terms of structure, hierarchy and completeness. We empirically demonstrate that InsightNet outperforms the current state-of-the-art methods in multi-label topic classification, achieving an F1 score of 0.85, which is an improvement of 11% F1-score over the previous best results. Additionally, InsightNet generalises well for unseen aspects and suggests new topics to be added to the taxonomy.
摘要：我们提出了 InsightNet，这是一种从客户评论中自动提取结构化见解的新颖方法。我们的端到端机器学习框架旨在克服当前解决方案的局限性，包括缺乏已识别主题的结构、不标准的方面名称以及缺乏丰富的训练数据。所提出的解决方案从原始评论中构建半监督的多级分类法，使用语义相似性启发式方法来生成标记数据，并通过微调法学硕士来采用多任务洞察提取架构。 InsightNet 通过客户情绪和每个主题的逐字来识别细粒度的可操作主题。对真实客户评论数据的评估表明，InsightNet 在结构、层次结构和完整性方面优于现有解决方案。我们凭经验证明，InsightNet 在多标签主题分类方面优于当前最先进的方法，实现了 0.85 的 F1 分数，比之前的最佳结果提高了 11%。此外，InsightNet 很好地概括了未见的方面，并建议将新主题添加到分类法中。

Title: Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Authors: Nikolay B Petrov, Gregory Serapio-García, Jason Rentfrow
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2405.07248
Pdf URL: https://arxiv.org/pdf/2405.07248
Copy Paste: [[2405.07248]] Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis(https://arxiv.org/abs/2405.07248)
Keywords: language model, gpt, llm, prompt
Abstract: The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.
摘要：大型语言模型 (LLM) 的类人反应促使社会科学家研究 LLM 是否可以用于模拟实验、民意调查和调查中的人类参与者。这一研究领域的核心兴趣是通过促使法学硕士回答标准化问卷来绘制他们的心理概况。鉴于从法学硕士对问卷的文本回答中找出潜在的或潜在的特征并不是一件容易的事，因此这项研究的相互矛盾的结果并不令人意外。为了解决这个问题，我们使用心理测量学，即心理测量科学。在这项研究中，我们促使 OpenAI 的旗舰模型 GPT-3.5 和 GPT-4 假设不同的角色并响应一系列标准化的人格结构测量。我们使用两种角色描述：通用（四个或五个随机人物描述）或特定（主要是来自大规模人类数据集的实际人类的人口统计数据）。我们发现，使用通用角色描述的 GPT-4（而非 GPT-3.5）的响应显示出有希望的（尽管不完美）心理测量特性，与人类规范类似，但两个法学硕士在使用特定人口统计概况时的数据显示出较差的心理测量特性特性。我们的结论是，目前，当法学硕士被要求模拟硅人物角色时，他们的反应是潜在潜在特征的不良信号。因此，我们的工作对法学硕士在多项选择题回答任务中模拟个人层面的人类行为的能力产生了怀疑。

Title: Human-interpretable clustering of short-text using large language models

Authors: Justin K. Miller, Tristram J. Alexander
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2405.07278
Pdf URL: https://arxiv.org/pdf/2405.07278
Copy Paste: [[2405.07278]] Human-interpretable clustering of short-text using large language models(https://arxiv.org/abs/2405.07278)
Keywords: language model, gpt, chat
Abstract: Large language models have seen extraordinary growth in popularity due to their human-like content generation capabilities. We show that these models can also be used to successfully cluster human-generated content, with success defined through the measures of distinctiveness and interpretability. This success is validated by both human reviewers and ChatGPT, providing an automated means to close the 'validation gap' that has challenged short-text clustering. Comparing the machine and human approaches we identify the biases inherent in each, and question the reliance on human-coding as the 'gold standard'. We apply our methodology to Twitter bios and find characteristic ways humans describe themselves, agreeing well with prior specialist work, but with interesting differences characteristic of the medium used to express identity.
摘要：大型语言模型由于其类似人类的内容生成能力而变得越来越受欢迎。我们表明，这些模型还可以用于成功地聚类人类生成的内容，并通过独特性和可解释性的衡量标准来定义成功。这一成功得到了人类审阅者和 ChatGPT 的验证，提供了一种自动化方法来缩小挑战短文本聚类的“验证差距”。通过比较机器和人类方法，我们确定了每种方法固有的偏见，并对人类编码作为“黄金标准”的依赖提出了质疑。我们将我们的方法应用于 Twitter 简介，并找到人类描述自己的特征方式，与之前的专业工作非常吻合，但与用于表达身份的媒介特征存在有趣的差异。

Title: Humor Mechanics: Advancing Humor Generation with Multistep Reasoning

Authors: Alexey Tikhonov, Pavel Shtykovskiy
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2405.07280
Pdf URL: https://arxiv.org/pdf/2405.07280
Copy Paste: [[2405.07280]] Humor Mechanics: Advancing Humor Generation with Multistep Reasoning(https://arxiv.org/abs/2405.07280)
Keywords: gpt
Abstract: In this paper, we explore the generation of one-liner jokes through multi-step reasoning. Our work involved reconstructing the process behind creating humorous one-liners and developing a working prototype for humor generation. We conducted comprehensive experiments with human participants to evaluate our approach, comparing it with human-created jokes, zero-shot GPT-4 generated humor, and other baselines. The evaluation focused on the quality of humor produced, using human labeling as a benchmark. Our findings demonstrate that the multi-step reasoning approach consistently improves the quality of generated humor. We present the results and share the datasets used in our experiments, offering insights into enhancing humor generation with artificial intelligence.
摘要：在本文中，我们探索通过多步骤推理生成俏皮笑话。我们的工作涉及重建幽默俏皮话的创作过程以及开发幽默生成的工作原型。我们对人类参与者进行了全面的实验来评估我们的方法，并将其与人类创造的笑话、零样本 GPT-4 生成的幽默和其他基线进行比较。评估的重点是幽默的质量，以人类标签为基准。我们的研究结果表明，多步骤推理方法持续提高了幽默感的质量。我们展示了结果并分享了实验中使用的数据集，提供了利用人工智能增强幽默生成的见解。

Title: Branching Narratives: Character Decision Points Detection

Authors: Alexey Tikhonov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07282
Pdf URL: https://arxiv.org/pdf/2405.07282
Copy Paste: [[2405.07282]] Branching Narratives: Character Decision Points Detection(https://arxiv.org/abs/2405.07282)
Keywords: llm
Abstract: This paper presents the Character Decision Points Detection (CHADPOD) task, a task of identification of points within narratives where characters make decisions that may significantly influence the story's direction. We propose a novel dataset based on CYOA-like games graphs to be used as a benchmark for such a task. We provide a comparative analysis of different models' performance on this task, including a couple of LLMs and several MLMs as baselines, achieving up to 89% accuracy. This underscores the complexity of narrative analysis, showing the challenges associated with understanding character-driven story dynamics. Additionally, we show how such a model can be applied to the existing text to produce linear segments divided by potential branching points, demonstrating the practical application of our findings in narrative analysis.
摘要：本文提出了角色决策点检测（CHADPOD）任务，该任务识别叙事中的点，在这些点上角色做出的决策可能会显着影响故事的方向。我们提出了一个基于类 CYOA 游戏图的新颖数据集，用作此类任务的基准。我们对不同模型在此任务上的表现进行了比较分析，包括几个 LLM 和几个 MLM 作为基线，达到了高达 89% 的准确率。这强调了叙事分析的复杂性，显示了与理解角色驱动的故事动态相关的挑战。此外，我们还展示了如何将这种模型应用于现有文本，以生成除以潜在分支点的线性片段，从而展示我们的发现在叙事分析中的实际应用。

Title: L(u)PIN: LLM-based Political Ideology Nowcasting

Authors: Ken Kato, Annabelle Purnomo, Christopher Cochrane, Raeid Saqur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07320
Pdf URL: https://arxiv.org/pdf/2405.07320
Copy Paste: [[2405.07320]] L(u)PIN: LLM-based Political Ideology Nowcasting(https://arxiv.org/abs/2405.07320)
Keywords: gpt, llm, prompt
Abstract: The quantitative analysis of political ideological positions is a difficult task. In the past, various literature focused on parliamentary voting data of politicians, party manifestos and parliamentary speech to estimate political disagreement and polarization in various political systems. However previous methods of quantitative political analysis suffered from a common challenge which was the amount of data available for analysis. Also previous methods frequently focused on a more general analysis of politics such as overall polarization of the parliament or party-wide political ideological positions. In this paper, we present a method to analyze ideological positions of individual parliamentary representatives by leveraging the latent knowledge of LLMs. The method allows us to evaluate the stance of politicians on an axis of our choice allowing us to flexibly measure the stance of politicians in regards to a topic/controversy of our choice. We achieve this by using a fine-tuned BERT classifier to extract the opinion-based sentences from the speeches of representatives and projecting the average BERT embeddings for each representative on a pair of reference seeds. These reference seeds are either manually chosen representatives known to have opposing views on a particular topic or they are generated sentences which where created using the GPT-4 model of OpenAI. We created the sentences by prompting the GPT-4 model to generate a speech that would come from a politician defending a particular position.
摘要：政治意识形态立场的定量分析是一项艰巨的任务。过去，各种文献主要关注政治家的议会投票数据、政党宣言和议会演讲，以估计各种政治体系中的政治分歧和两极分化。然而，以前的定量政治分析方法面临着一个共同的挑战，即可用于分析的数据量。此外，以前的方法经常侧重于对政治的更一般性分析，例如议会的总体两极分化或全党的政治意识形态立场。在本文中，我们提出了一种利用法学硕士的潜在知识来分析个别议会代表的意识形态立场的方法。该方法使我们能够评估政治家在我们选择的轴上的立场，使我们能够灵活地衡量政治家对我们选择的主题/争议的立场。我们通过使用微调的 BERT 分类器从代表的演讲中提取基于意见的句子并将每个代表的平均 BERT 嵌入投影到一对参考种子上来实现这一目标。这些参考种子要么是手动选择的已知对特定主题有相反观点的代表，要么是使用 OpenAI 的 GPT-4 模型创建的生成句子。我们通过提示 GPT-4 模型生成来自政治家捍卫特定立场的演讲来创建句子。

Title: MedConceptsQA -- Open Source Medical Concepts QA Benchmark

Authors: Ofir Ben Shoham, Nadav Rappoport
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MedConceptsQA -- Open Source Medical Concepts QA Benchmark(https://arxiv.org/abs/)
Keywords: language model, gpt
Abstract: We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at this https URL
摘要：我们推出 MedConceptsQA，这是一个专门用于医学概念问答的开源基准。该基准包括跨不同词汇的各种医学概念的问题：诊断、手术和药物。问题分为三个难度级别：简单、中等和困难。我们使用各种大型语言模型对基准进行了评估。我们的研究结果表明，尽管预先训练了医疗数据，但预先训练的临床大语言模型在此基准上达到了接近随机猜测的准确度水平。然而，与临床大语言模型相比，GPT-4 实现了近 27%-37% 的绝对平均改进（零样本学习为 27%，少样本学习为 37%）。我们的基准测试是评估大型语言模型对医学概念的理解和推理的宝贵资源。我们的基准测试可以在这个 https URL 上找到

Title: Evaluation of Retrieval-Augmented Generation: A Survey

Authors: Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07437
Pdf URL: https://arxiv.org/pdf/2405.07437
Copy Paste: [[2405.07437]] Evaluation of Retrieval-Augmented Generation: A Survey(https://arxiv.org/abs/2405.07437)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a pivotal innovation in natural language processing, enhancing generative models by incorporating external information retrieval. Evaluating RAG systems, however, poses distinct challenges due to their hybrid structure and reliance on dynamic knowledge sources. We consequently enhanced an extensive survey and proposed an analysis framework for benchmarks of RAG systems, RAGR (Retrieval, Generation, Additional Requirement), designed to systematically analyze RAG benchmarks by focusing on measurable outputs and established truths. Specifically, we scrutinize and contrast multiple quantifiable metrics of the Retrieval and Generation component, such as relevance, accuracy, and faithfulness, of the internal links within the current RAG evaluation methods, covering the possible output and ground truth pairs. We also analyze the integration of additional requirements of different works, discuss the limitations of current benchmarks, and propose potential directions for further research to address these shortcomings and advance the field of RAG evaluation. In conclusion, this paper collates the challenges associated with RAG evaluation. It presents a thorough analysis and examination of existing methodologies for RAG benchmark design based on the proposed RGAR framework.
摘要：检索增强生成（RAG）已成为自然语言处理领域的一项关键创新，通过结合外部信息检索来增强生成模型。然而，由于 RAG 系统的混合结构和对动态知识源的依赖，评估 RAG 系统带来了明显的挑战。因此，我们加强了广泛的调查，并提出了 RAG 系统基准的分析框架，RAGR（检索、生成、附加要求），旨在通过关注可测量的输出和既定事实来系统地分析 RAG 基准。具体来说，我们仔细检查和对比当前 RAG 评估方法中内部链接的检索和生成组件的多个可量化指标，例如相关性、准确性和忠实度，涵盖可能的输出和地面实况对。我们还分析了不同工作的附加要求的整合，讨论了当前基准的局限性，并提出了进一步研究的潜在方向，以解决这些缺点并推进 RAG 评估领域。总之，本文整理了与 RAG 评估相关的挑战。它基于所提出的 RGAR 框架，对 RAG 基准设计的现有方法进行了彻底的分析和检查。

Title: MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation

Authors: Dongjun Lee, Choongwon Park, Jaehyuk Kim, Heesoo Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07467
Pdf URL: https://arxiv.org/pdf/2405.07467
Copy Paste: [[2405.07467]] MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation(https://arxiv.org/abs/2405.07467)
Keywords: language model, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have enabled in-context learning (ICL)-based methods that significantly outperform fine-tuning approaches for text-to-SQL tasks. However, their performance is still considerably lower than that of human experts on benchmarks that include complex schemas and queries, such as BIRD. This study considers the sensitivity of LLMs to the prompts and introduces a novel approach that leverages multiple prompts to explore a broader search space for possible answers and effectively aggregate them. Specifically, we robustly refine the database schema through schema linking using multiple prompts. Thereafter, we generate various candidate SQL queries based on the refined schema and diverse prompts. Finally, the candidate queries are filtered based on their confidence scores, and the optimal query is obtained through a multiple-choice selection that is presented to the LLM. When evaluated on the BIRD and Spider benchmarks, the proposed method achieved execution accuracies of 65.5\% and 89.6\%, respectively, significantly outperforming previous ICL-based methods. Moreover, we established a new SOTA performance on the BIRD in terms of both the accuracy and efficiency of the generated queries.
摘要：大型语言模型 (LLM) 的最新进展使得基于上下文学习 (ICL) 的方法成为可能，这些方法在文本到 SQL 任务中的性能显着优于微调方法。然而，在包括复杂模式和查询（例如 BIRD）的基准测试中，它们的性能仍然大大低于人类专家。这项研究考虑了法学硕士对提示的敏感性，并引入了一种新颖的方法，该方法利用多个提示来探索更广泛的搜索空间，寻找可能的答案，并有效地聚合它们。具体来说，我们通过使用多个提示的模式链接来稳健地细化数据库模式。此后，我们根据细化的模式和不同的提示生成各种候选 SQL 查询。最后，根据候选查询的置信度分数进行过滤，并通过多项选择获得最佳查询，并将其提交给法学硕士。在 BIRD 和 Spider 基准测试上进行评估时，所提出的方法的执行精度分别为 65.5% 和 89.6%，明显优于之前基于 ICL 的方法。此外，我们在 BIRD 上建立了新的 SOTA 性能，无论是在生成查询的准确性还是效率方面。

Title: Evaluating large language models in medical applications: a survey

Authors: Xiaolan Chen, Jiayang Xiang, Shanfu Lu, Yexin Liu, Mingguang He, Danli Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07468
Pdf URL: https://arxiv.org/pdf/2405.07468
Copy Paste: [[2405.07468]] Evaluating large language models in medical applications: a survey(https://arxiv.org/abs/2405.07468)
Keywords: language model, llm
Abstract: Large language models (LLMs) have emerged as powerful tools with transformative potential across numerous domains, including healthcare and medicine. In the medical domain, LLMs hold promise for tasks ranging from clinical decision support to patient education. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the complex and critical nature of medical information. This paper provides a comprehensive overview of the landscape of medical LLM evaluation, synthesizing insights from existing studies and highlighting evaluation data sources, task scenarios, and evaluation methods. Additionally, it identifies key challenges and opportunities in medical LLM evaluation, emphasizing the need for continued research and innovation to ensure the responsible integration of LLMs into clinical practice.
摘要：大语言模型 (LLM) 已成为强大的工具，在医疗保健和医学等众多领域具有变革潜力。在医学领域，法学硕士有望完成从临床决策支持到患者教育等任务。然而，由于医疗信息的复杂性和关键性，评估法学硕士在医疗环境中的表现提出了独特的挑战。本文全面概述了医学法学硕士评估的概况，综合了现有研究的见解，并重点介绍了评估数据源、任务场景和评估方法。此外，它还确定了医学法学硕士评估中的主要挑战和机遇，强调需要持续研究和创新，以确保法学硕士负责任地融入临床实践。

Title: Strategic Data Ordering: Enhancing Large Language Model Performance through Curriculum Learning

Authors: Jisu Kim, Juhwan Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07490
Pdf URL: https://arxiv.org/pdf/2405.07490
Copy Paste: [[2405.07490]] Strategic Data Ordering: Enhancing Large Language Model Performance through Curriculum Learning(https://arxiv.org/abs/2405.07490)
Keywords: language model, llm, prompt
Abstract: The rapid advancement of Large Language Models (LLMs) has improved text understanding and generation but poses challenges in computational resources. This study proposes a curriculum learning-inspired, data-centric training strategy that begins with simpler tasks and progresses to more complex ones, using criteria such as prompt length, attention scores, and loss values to structure the training data. Experiments with Mistral-7B (Jiang et al., 2023) and Gemma-7B (Team et al., 2024) models demonstrate that curriculum learning slightly improves performance compared to traditional random data shuffling. Notably, we observed that sorting data based on our proposed attention criteria generally led to better performance. This approach offers a sustainable method to enhance LLM performance without increasing model size or dataset volume, addressing scalability challenges in LLM training.
摘要：大型语言模型（LLM）的快速发展改善了文本理解和生成，但对计算资源提出了挑战。这项研究提出了一种以课程学习为灵感、以数据为中心的培训策略，从更简单的任务开始，逐渐发展到更复杂的任务，使用提示长度、注意力分数和损失值等标准来构建训练数据。 Mistral-7B（Jiang 等人，2023）和 Gemma-7B（Team 等人，2024）模型的实验表明，与传统的随机数据洗牌相比，课程学习略微提高了性能。值得注意的是，我们观察到，根据我们提出的注意力标准对数据进行排序通常会带来更好的性能。这种方法提供了一种可持续的方法来增强 LLM 性能，而无需增加模型大小或数据集数量，从而解决了 LLM 培训中的可扩展性挑战。

Title: MacBehaviour: An R package for behavioural experimentation on large language models

Authors: Xufeng Duan, Shixuan Li, Zhenguang G. Cai1
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07495
Pdf URL: https://arxiv.org/pdf/2405.07495
Copy Paste: [[2405.07495]] MacBehaviour: An R package for behavioural experimentation on large language models(https://arxiv.org/abs/2405.07495)
Keywords: language model, gpt, llm, chat
Abstract: There has been increasing interest in investigating the behaviours of large language models (LLMs) and LLM-powered chatbots by treating an LLM as a participant in a psychological experiment. We therefore developed an R package called "MacBehaviour" that aims to interact with more than 60 language models in one package (e.g., OpenAI's GPT family, the Claude family, Gemini, Llama family, and open-source models) and streamline the experimental process of LLMs behaviour experiments. The package offers a comprehensive set of functions designed for LLM experiments, covering experiment design, stimuli presentation, model behaviour manipulation, logging response and token probability. To demonstrate the utility and effectiveness of "MacBehaviour," we conducted three validation experiments on three LLMs (GPT-3.5, Llama-2 7B, and Vicuna-1.5 13B) to replicate sound-gender association in LLMs. The results consistently showed that they exhibit human-like tendencies to infer gender from novel personal names based on their phonology, as previously demonstrated (Cai et al., 2023). In summary, "MacBehaviour" is an R package for machine behaviour studies which offers a user-friendly interface and comprehensive features to simplify and standardize the experimental process.
摘要：人们越来越有兴趣通过将 LLM 视为心理实验的参与者来研究大型语言模型 (LLM) 和 LLM 支持的聊天机器人的行为。因此，我们开发了一个名为“MacBehaviour”的 R 包，旨在在一个包中与 60 多种语言模型（例如 OpenAI 的 GPT 系列、Claude 系列、Gemini、Llama 系列和开源模型）进行交互，并简化实验过程法学硕士行为实验。该软件包提供了一套专为 LLM 实验设计的全面功能，涵盖实验设计、刺激呈现、模型行为操作、记录响应和令牌概率。为了证明“MacBehaviour”的实用性和有效性，我们对三个法学硕士（GPT-3.5、Llama-2 7B 和 Vicuna-1.5 13B）进行了三项验证实验，以复制法学硕士中的健全性别关联。结果一致表明，正如之前所证明的那样，他们表现出类似人类的倾向，可以根据音韵从新颖的人名中推断出性别（Cai et al., 2023）。总之，“MacBehaviour”是一个用于机器行为研究的 R 包，它提供了用户友好的界面和全面的功能来简化和标准化实验过程。

Title: EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Authors: Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07542
Pdf URL: https://arxiv.org/pdf/2405.07542
Copy Paste: [[2405.07542]] EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models(https://arxiv.org/abs/2405.07542)
Keywords: language model, llm
Abstract: Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked due to varying numbers of accepted tokens within a batch in the verification phase. Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. However, this increases the computational and memory access overhead, thereby reducing the speedup ratio. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Furthermore, our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens. Sufficient experiments demonstrate the efficacy of our method. Our code is available at this https URL.
摘要：推测性解码成为提高大型语言模型 (LLM) 推理速度的关键技术。尽管最近的研究旨在提高预测效率，但由于验证阶段批次中接受的令牌数量不同，多样本推测解码仍被忽视。 Vanilla 方法添加填充标记，以确保新标记的数量在样本之间保持一致。然而，这增加了计算和内存访问开销，从而降低了加速比。我们提出了一种新颖的方法，可以解决不同样本接受的令牌不一致的问题，而无需增加内存或计算开销。此外，我们提出的方法可以处理不同样本的预测标记不一致的情况，而无需添加填充标记。足够的实验证明了我们方法的有效性。我们的代码可以在这个 https URL 上找到。

Title: MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

Authors: Shuo Yin, Weihao You, Zhilong Ji, Guoqiang Zhong, Jinfeng Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07551
Pdf URL: https://arxiv.org/pdf/2405.07551
Copy Paste: [[2405.07551]] MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning(https://arxiv.org/abs/2405.07551)
Keywords: language model, llm
Abstract: The tool-use Large Language Models (LLMs) that integrate with external Python interpreters have significantly enhanced mathematical reasoning capabilities for open-source LLMs, while tool-free methods chose another track: augmenting math reasoning data. However, a great method to integrate the above two research paths and combine their advantages remains to be explored. In this work, we firstly include new math questions via multi-perspective data augmenting methods and then synthesize code-nested solutions to them. The open LLMs (i.e., Llama-2) are finetuned on the augmented dataset to get the resulting models, MuMath-Code ($\mu$-Math-Code). During the inference phase, our MuMath-Code generates code and interacts with the external python interpreter to get the execution results. Therefore, MuMath-Code leverages the advantages of both the external tool and data augmentation. To fully leverage the advantages of our augmented data, we propose a two-stage training strategy: In Stage-1, we finetune Llama-2 on pure CoT data to get an intermediate model, which then is trained on the code-nested data in Stage-2 to get the resulting MuMath-Code. Our MuMath-Code-7B achieves 83.8 on GSM8K and 52.4 on MATH, while MuMath-Code-70B model achieves new state-of-the-art performance among open methods -- achieving 90.7% on GSM8K and 55.1% on MATH. Extensive experiments validate the combination of tool use and data augmentation, as well as our two-stage training strategy. We release the proposed dataset along with the associated code for public use.
摘要：与外部Python解释器集成的工具使用大型语言模型（LLM）显着增强了开源LLM的数学推理能力，而无工具方法则选择了另一条道路：增强数学推理数据。然而，如何整合上述两种研究路径并结合其优势仍有待探索。在这项工作中，我们首先通过多视角数据增强方法包含新的数学问题，然后综合它们的代码嵌套解决方案。开放式 LLM（即 Llama-2）在增强数据集上进行微调，以获得结果模型 MuMath-Code（$\mu$-Math-Code）。在推理阶段，我们的 MuMath-Code 生成代码并与外部 python 解释器交互以获取执行结果。因此，MuMath-Code 利用了外部工具和数据增强的优势。为了充分利用增强数据的优势，我们提出了一个两阶段训练策略：在第 1 阶段，我们在纯 CoT 数据上对 Llama-2 进行微调以获得中间模型，然后在代码嵌套数据上进行训练第 2 阶段获取生成的 MuMath 代码。我们的 MuMath-Code-7B 在 GSM8K 上达到 83.8，在 MATH 上达到 52.4，而 MuMath-Code-70B 模型在开放方法中实现了新的最先进的性能——在 GSM8K 上达到 90.7%，在 MATH 上达到 55.1%。大量的实验验证了工具使用和数据增强的结合，以及我们的两阶段训练策略。我们发布建议的数据集以及相关代码以供公众使用。

Title: NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition

Authors: Elena Merdjanovska, Ansar Aynetdinov, Alan Akbik
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2405.07609
Pdf URL: https://arxiv.org/pdf/2405.07609
Copy Paste: [[2405.07609]] NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition(https://arxiv.org/abs/2405.07609)
Keywords: llm
Abstract: Available training data for named entity recognition (NER) often contains a significant percentage of incorrect labels for entity types and entity boundaries. Such label noise poses challenges for supervised learning and may significantly deteriorate model quality. To address this, prior work proposed various noise-robust learning approaches capable of learning from data with partially incorrect labels. These approaches are typically evaluated using simulated noise where the labels in a clean dataset are automatically corrupted. However, as we show in this paper, this leads to unrealistic noise that is far easier to handle than real noise caused by human error or semi-automatic annotation. To enable the study of the impact of various types of real noise, we introduce NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors. We present an analysis that shows that real noise is significantly more challenging than simulated noise, and show that current state-of-the-art models for noise-robust learning fall far short of their theoretically achievable upper bound. We release NoiseBench to the research community.
摘要：命名实体识别 (NER) 的可用训练数据通常包含很大比例的实体类型和实体边界错误标签。这种标签噪声给监督学习带来了挑战，并且可能会显着降低模型质量。为了解决这个问题，之前的工作提出了各种抗噪声学习方法，能够从部分不正确标签的数据中学习。这些方法通常使用模拟噪声进行评估，其中干净数据集中的标签会自动损坏。然而，正如我们在本文中所示，这会导致不切实际的噪声，比人为错误或半自动注释引起的真实噪声更容易处理。为了能够研究各种类型的真实噪声的影响，我们引入了 NoiseBench，这是一个 NER 基准测试，由被 6 种类型的真实噪声损坏的干净训练数据组成，包括专家错误、众包错误、自动注释错误和 LLM 错误。我们提出的分析表明，真实噪声比模拟噪声更具挑战性，并表明当前最先进的噪声鲁棒学习模型远远低于其理论上可实现的上限。我们向研究社区发布了 NoiseBench。

Title: ViWikiFC: Fact-Checking for Vietnamese Wikipedia-Based Textual Knowledge Source

Authors: Hung Tuan Le, Long Truong To, Manh Trong Nguyen, Kiet Van Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07615
Pdf URL: https://arxiv.org/pdf/2405.07615
Copy Paste: [[2405.07615]] ViWikiFC: Fact-Checking for Vietnamese Wikipedia-Based Textual Knowledge Source(https://arxiv.org/abs/2405.07615)
Keywords: language model
Abstract: Fact-checking is essential due to the explosion of misinformation in the media ecosystem. Although false information exists in every language and country, most research to solve the problem mainly concentrated on huge communities like English and Chinese. Low-resource languages like Vietnamese are necessary to explore corpora and models for fact verification. To bridge this gap, we construct ViWikiFC, the first manual annotated open-domain corpus for Vietnamese Wikipedia Fact Checking more than 20K claims generated by converting evidence sentences extracted from Wikipedia articles. We analyze our corpus through many linguistic aspects, from the new dependency rate, the new n-gram rate, and the new word rate. We conducted various experiments for Vietnamese fact-checking, including evidence retrieval and verdict prediction. BM25 and InfoXLM (Large) achieved the best results in two tasks, with BM25 achieving an accuracy of 88.30% for SUPPORTS, 86.93% for REFUTES, and only 56.67% for the NEI label in the evidence retrieval task, InfoXLM (Large) achieved an F1 score of 86.51%. Furthermore, we also conducted a pipeline approach, which only achieved a strict accuracy of 67.00% when using InfoXLM (Large) and BM25. These results demonstrate that our dataset is challenging for the Vietnamese language model in fact-checking tasks.
摘要：由于媒体生态系统中错误信息的激增，事实核查至关重要。尽管虚假信息存在于每种语言和国家，但大多数解决该问题的研究主要集中在英语和中文等庞大的社区。像越南语这样的资源匮乏的语言对于探索事实验证的语料库和模型是必要的。为了弥补这一差距，我们构建了 ViWikiFC，这是第一个用于越南语维基百科事实检查的手动注释开放域语料库，通过转换从维基百科文章中提取的证据句子生成超过 2 万个声明。我们通过许多语言方面来分析我们的语料库，从新的依赖率、新的 n-gram 率和新词率。我们对越南事实核查进行了各种实验，包括证据检索和判决预测。 BM25 和 InfoXLM (Large) 在两项任务中取得了最好的成绩，其中 BM25 在证据检索任务中，SUPPORTS 的准确率达到 88.30%，REFUTES 的准确率达到 86.93%，而 NEI 标签的准确率仅为 56.67%，InfoXLM (Large) 取得了F1成绩为86.51%。此外，我们还进行了管道方法，在使用 InfoXLM (Large) 和 BM25 时仅达到了 67.00% 的严格准确率。这些结果表明，我们的数据集对于越南语言模型在事实检查任务中具有挑战性。

Title: COBias and Debias: Minimizing Language Model Pairwise Accuracy Bias via Nonlinear Integer Programming

Authors: Ruixi Lin, Yang You
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07623
Pdf URL: https://arxiv.org/pdf/2405.07623
Copy Paste: [[2405.07623]] COBias and Debias: Minimizing Language Model Pairwise Accuracy Bias via Nonlinear Integer Programming(https://arxiv.org/abs/2405.07623)
Keywords: language model, llm
Abstract: For language model classification, would you prefer having only one workable class or having every class working? The latter makes more practical uses. Especially for large language models (LLMs), the fact that they achieve a fair overall accuracy by in-context learning (ICL) obscures a large difference in individual class accuracies. In this work, we uncover and tackle language models' imbalance in per-class prediction accuracy by reconceptualizing it as the Contextual Oddity Bias (COBias), and we are the first to engage nonlinear integer programming (NIP) to debias it. Briefly, COBias refers to the difference in accuracy by a class A compared to its ''odd'' class, which holds the majority wrong predictions of class A. With the COBias metric, we reveal that LLMs of varied scales and families exhibit large per-class accuracy differences. Then we propose Debiasing as Nonlinear Integer Programming (DNIP) to correct ICL per-class probabilities for lower bias and higher overall accuracy. Our optimization objective is directly based on the evaluation scores by COBias and accuracy metrics, solved by simulated annealing. Evaluations on three LLMs across seven NLP classification tasks show that DNIP simultaneously achieves significant COBias reduction ($-27\%$) and accuracy improvement ($+12\%$) over the conventional ICL approach, suggesting that modeling pairwise class accuracy differences is a direction in pushing forward more accurate, more reliable LLM predictions.
摘要：对于语言模型分类，您更喜欢只有一个可用的类还是让每个类都工作？后者有更多实际用途。特别是对于大型语言模型（LLM）来说，它们通过上下文学习（ICL）实现了相当的整体准确度，这一事实掩盖了各个类别准确度的巨大差异。在这项工作中，我们通过将其重新概念化为上下文奇数偏差（COBias），发现并解决了语言模型在每类预测准确性方面的不平衡，并且我们是第一个采用非线性整数规划（NIP）来消除这种偏差的人。简而言之，COBias 指的是 A 类与其“奇数”类相比的准确性差异，“奇数”类持有 A 类的大多数错误预测。通过 COBias 指标，我们发现不同规模和家庭的法学硕士表现出很大的- 等级精度差异。然后，我们提出将去偏作为非线性整数规划 (DNIP) 来纠正 ICL 每类概率，以获得更低的偏差和更高的总体精度。我们的优化目标直接基于 COBias 的评估分数和准确性指标，通过模拟退火解决。对七个 NLP 分类任务中的三个 LLM 的评估表明，与传统的 ICL 方法相比，DNIP 同时实现了显着的 COBias 降低 ($-27\%$) 和准确性提高 ($+12\%$)，这表明建模成对类别准确性差异是推动更准确、更可靠的法学硕士预测的方向。

Title: Age-Dependent Analysis and Stochastic Generation of Child-Directed Speech

Authors: Okko Räsänen, Daniil Kocharov
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2405.07700
Pdf URL: https://arxiv.org/pdf/2405.07700
Copy Paste: [[2405.07700]] Age-Dependent Analysis and Stochastic Generation of Child-Directed Speech(https://arxiv.org/abs/2405.07700)
Keywords: language model
Abstract: Child-directed speech (CDS) is a particular type of speech that adults use when addressing young children. Its properties also change as a function of extralinguistic factors, such as age of the child being addressed. Access to large amounts of representative and varied CDS would be useful for child language research, as this would enable controlled computational modeling experiments of infant language acquisition with realistic input in terms of quality and quantity. In this study, we describe an approach to model age-dependent linguistic properties of CDS using a language model (LM) trained on CDS transcripts and ages of the recipient children, as obtained from North American English corpora of the CHILDES database. The created LM can then be used to stochastically generate synthetic CDS transcripts in an age-appropriate manner, thereby scaling beyond the original datasets in size. We compare characteristics of the generated CDS against the real speech addressed at children of different ages, showing that the LM manages to capture age-dependent changes in CDS, except for a slight difference in the effective vocabulary size. As a side product, we also provide a systematic characterization of age-dependent linguistic properties of CDS in CHILDES, illustrating how all measured aspects of the CDS change with children's age.
摘要：面向儿童的言语 (CDS) 是成人对幼儿讲话时使用的一种特殊类型的言语。它的属性也会随着语言外因素的变化而变化，例如所处理的孩子的年龄。获得大量具有代表性和多样化的 CDS 对于儿童语言研究将非常有用，因为这将能够实现婴儿语言习得的受控计算建模实验，并在质量和数量方面提供真实的输入。在这项研究中，我们描述了一种使用语言模型（LM）对 CDS 的年龄相关语言特性进行建模的方法，该语言模型是根据 CDS 成绩单和接收儿童的年龄（从 CHILDES 数据库的北美英语语料库获得的）进行训练的。然后，创建的 LM 可用于以适合年龄的方式随机生成合成 CDS 转录本，从而在大小上超出原始数据集。我们将生成的 CDS 的特征与针对不同年龄儿童的真实语音进行比较，结果表明，除了有效词汇量大小略有差异外，LM 成功捕获了 CDS 中与年龄相关的变化。作为副产品，我们还提供了儿童 CDS 与年龄相关的语言特性的系统表征，说明了 CDS 的所有测量方面如何随儿童年龄而变化。

Title: OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs trained starting from Llama 2

Authors: Mihai Masala, Denis C. Ilie-Ablachim, Dragos Corlatescu, Miruna Zavelca, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs trained starting from Llama 2(https://arxiv.org/abs/)
Keywords: language model, llm, chat
Abstract: In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English. Hence, their performance in English greatly exceeds their performance in other languages. This document presents our approach to training and evaluating the first foundational and chat LLM specialized for Romanian.
摘要：近年来，大型语言模型（LLM）在各种任务上取得了几乎与人类相似的性能。虽然一些法学硕士接受了多语言数据的培训，但大多数培训数据都是英语的。因此，他们的英语表现大大超过了其他语言的表现。本文件介绍了我们培训和评估第一个专门针对罗马尼亚语的基础和聊天法学硕士的方法。

Title: Quantifying and Optimizing Global Faithfulness in Persona-driven Role-playing

Authors: Letian Peng, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07726
Pdf URL: https://arxiv.org/pdf/2405.07726
Copy Paste: [[2405.07726]] Quantifying and Optimizing Global Faithfulness in Persona-driven Role-playing(https://arxiv.org/abs/2405.07726)
Keywords: gpt, llm
Abstract: Persona-driven role-playing (PRP) aims to build AI characters that can respond to user queries by faithfully sticking with all persona statements. Unfortunately, existing faithfulness criteria for PRP are limited to coarse-grained LLM-based scoring without a clear definition or formulation. This paper presents a pioneering exploration to quantify PRP faithfulness as a fine-grained and explainable criterion, which also serves as a reliable reference for optimization. Our criterion first discriminates persona statements into active and passive constraints by identifying the query-statement relevance. Then, we incorporate all constraints following the principle that the AI character's response should be (a) entailed by active (relevant) constraints and (b) not contradicted by passive (irrelevant) constraints. We translate this principle mathematically into a novel Active-Passive-Constraint (APC) score, a constraint-wise sum of natural language inference (NLI) scores weighted by relevance scores. In practice, we build the APC scoring system by symbolically distilling small discriminators from GPT-4 for efficiency. We validate the quality of the APC score against human evaluation based on example personas with tens of statements, and the results show a high correlation. We further leverage it as a reward system in direct preference optimization (DPO) for better AI characters. Our experiments offer a fine-grained and explainable comparison between existing PRP techniques, revealing their advantages and limitations. We further find APC-based DPO to be one of the most competitive techniques for sticking with all constraints and can be well incorporated with other techniques. We then extend the scale of the experiments to real persons with hundreds of statements and reach a consistent conclusion.
摘要：角色驱动的角色扮演 (PRP) 旨在构建能够通过忠实地遵循所有角色陈述来响应用户查询的 AI 角色。不幸的是，现有的 PRP 忠实度标准仅限于基于 LLM 的粗粒度评分，没有明确的定义或表述。本文提出了将 PRP 忠实度量化为细粒度且可解释的标准的开创性探索，这也为优化提供了可靠的参考。我们的标准首先通过识别查询语句相关性将角色陈述区分为主动和被动约束。然后，我们按照以下原则合并所有约束：AI 角色的响应应该 (a) 受到主动（相关）约束的约束，并且 (b) 不与被动（不相关）约束相矛盾。我们以数学方式将这一原理转化为一种新颖的主动-被动-约束（APC）分数，即按相关性分数加权的自然语言推理（NLI）分数的约束总和。在实践中，我们通过从 GPT-4 中象征性地提取小判别器来构建 APC 评分系统以提高效率。我们根据具有数十个陈述的示例人物角色对照人类评估来验证 APC 分数的质量，结果显示出高度相关性。我们进一步利用它作为直接偏好优化（DPO）中的奖励系统，以获得更好的人工智能角色。我们的实验对现有 PRP 技术进行了细粒度且可解释的比较，揭示了它们的优点和局限性。我们进一步发现基于 APC 的 DPO 是遵守所有限制的最具竞争力的技术之一，并且可以与其他技术很好地结合。然后，我们将实验规模扩展到真人，并发表数百条陈述，并得出一致的结论。

Title: LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language

Authors: Cagri Toraman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language(https://arxiv.org/abs/)
Keywords: language model
Abstract: Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.
摘要：尽管以英语为主的生成大语言模型取得了进步，但资源匮乏的语言还需要进一步开发，以增强全球可访问性。表示这些语言的主要方法是单语言和多语言预训练。由于硬件要求，单语言预训练的成本很高，而多语言模型在不同语言之间的性能往往参差不齐。这项研究通过将主要以英语训练的大型语言模型应用于低资源语言来探索替代解决方案。我们评估各种策略，包括持续培训、教学微调、针对特定任务的微调和词汇扩展。结果表明，持续训练可以提高语言理解能力，这反映在困惑度分数上，而特定于任务的调整通常可以提高下游任务的性能。然而，扩大词汇量并没有带来实质性的好处。此外，虽然较大的模型可以通过几次调整来提高任务性能，但多语言模型在调整后的表现却比单语言模型差。

Title: TANQ: An open domain dataset of table answered questions

Authors: Mubashara Akhtar, Chenxi Pang, Andreea Marzoca, Yasemin Altun, Julian Martin Eisenschlos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07765
Pdf URL: https://arxiv.org/pdf/2405.07765
Copy Paste: [[2405.07765]] TANQ: An open domain dataset of table answered questions(https://arxiv.org/abs/2405.07765)
Keywords: language model, gpt
Abstract: Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points. We analyse baselines' performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.
摘要：语言模型可能会通过检索等工具的使用而得到增强，正在成为回答问题的首选手段。理解和回答现实世界中的问题通常需要从不同来源检索信息，处理和聚合数据以提取见解，并以结构化工件（例如新颖的表格、图表或信息图表）的形式呈现复杂的发现。在本文中，我们介绍了 TANQ，这是第一个开放域问答数据集，其中答案需要根据跨多个来源的信息构建表格。我们发布了结果表中每个单元格的完整源属性，并在开放式、预言机和封闭式书籍设置中对最先进的语言模型进行了基准测试。我们表现最好的基线 GPT4 的 F1 总体得分为 29.1，落后人类表现 19.7 分。我们分析不同数据集属性的基线性能，例如此任务所需的不同技能，包括多跳推理、数学运算和单位转换。我们进一步讨论了模型生成答案中的常见失败，表明 TANQ 是一项复杂的任务，面临着许多挑战。

Title: DEPTH: Discourse Education through Pre-Training Hierarchically

Authors: Zachary Bamberger, Ofek Glick, Chaim Baskin, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07788
Pdf URL: https://arxiv.org/pdf/2405.07788
Copy Paste: [[2405.07788]] DEPTH: Discourse Education through Pre-Training Hierarchically(https://arxiv.org/abs/2405.07788)
Keywords: language model
Abstract: Language Models (LMs) often struggle with linguistic understanding at the discourse level, even though discourse patterns such as coherence, cohesion, and narrative flow are prevalent in their pre-training data. Current methods address these challenges only after the pre-training phase, relying on expensive human annotated data to align the model. To improve the discourse capabilities of LMs already at the pre-training stage, we introduce DEPTH, an encoder-decoder model that learns to represent sentences using a discourse-oriented pre-training objective. DEPTH combines hierarchical sentence representations with two objectives: (1) Sentence Un-Shuffling, and (2) Span-Corruption. This approach trains the model to represent both sub-word-level and sentence-level dependencies over a massive amount of unstructured text. When trained either from scratch or continuing from a pre-trained T5 checkpoint, DEPTH learns semantic and discourse-level representations faster than T5, outperforming it in span-corruption loss despite the additional sentence-un-shuffling objective. Evaluations on the GLUE, DiscoEval, and NI benchmarks demonstrate DEPTH's ability to quickly learn diverse downstream tasks, which require syntactic, semantic, and discourse capabilities. Overall, our approach extends the discourse capabilities of T5, while minimally impacting other natural language understanding (NLU) capabilities in the resulting LM.
摘要：语言模型 (LM) 经常在话语层面的语言理解上遇到困难，尽管连贯性、衔接性和叙述流等话语模式在其预训练数据中很普遍。当前的方法仅在预训练阶段之后才解决这些挑战，依靠昂贵的人工注释数据来对齐模型。为了提高语言模型在预训练阶段的话语能力，我们引入了 DEPTH，一种编码器-解码器模型，它学习使用面向话语的预训练目标来表示句子。 DEPTH 将分层句子表示与两个目标结合起来：(1) 句子不洗牌，(2) 跨度损坏。这种方法训练模型来表示大量非结构化文本的子词级和句子级依赖关系。当从头开始训练或从预先训练的 T5 检查点继续训练时，DEPTH 学习语义和话语级表示的速度比 T5 更快，尽管有额外的句子非洗牌目标，但在跨度损坏损失方面优于它。对 GLUE、DiscoEval 和 NI 基准的评估证明了 DEPTH 快速学习各种下游任务的能力，这需要句法、语义和话语能力。总的来说，我们的方法扩展了 T5 的话语能力，同时对最终 LM 中的其他自然语言理解 (NLU) 能力的影响最小。

Title: Zero-Shot Tokenizer Transfer

Authors: Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07883
Pdf URL: https://arxiv.org/pdf/2405.07883
Copy Paste: [[2405.07883]] Zero-Shot Tokenizer Transfer(https://arxiv.org/abs/2405.07883)
Keywords: language model, llm
Abstract: Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.
摘要：语言模型 (LM) 与其标记器绑定，该标记器将原始文本映射到一系列词汇项（标记）。这限制了它们的灵活性：例如，主要接受英语训练的语言模型可能在其他自然语言和编程语言中仍然表现良好，但由于其以英语为中心的分词器，效率大大降低。为了缓解这一问题，我们应该能够动态地将原始 LM 分词器替换为任意分词器，而不会降低性能。因此，在这项工作中，我们定义了一个新问题：零样本分词器传输（ZeTT）。 ZeTT 的核心挑战是在新标记器的词汇表中找到标记的嵌入。由于先前用于初始化嵌入的启发式方法通常在 ZeTT 设置中以机会级别执行，因此我们提出了一种新的解决方案：我们训练一个超网络，以标记器作为输入并预测相应的嵌入。我们凭经验证明，超网络可以推广到具有编码器（例如 XLM-R）和解码器 LLM（例如 Mistral-7B）的新分词器。我们的方法在跨语言和编码任务中接近原始模型的性能，同时显着减少了标记序列的长度。我们还发现，剩余的差距可以通过对少于 1B 代币的持续训练来快速缩小。最后，我们表明，针对基础 (L)LM 训练的 ZeTT 超网络也可以应用于微调变体，而无需额外训练。总体而言，我们的结果在将 LM 与其标记器分离方面取得了重大进展。

Title: Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

Authors: Alena Tsanda, Elena Bruches
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07886
Pdf URL: https://arxiv.org/pdf/2405.07886
Copy Paste: [[2405.07886]] Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers(https://arxiv.org/abs/2405.07886)
Keywords: language model, gpt, chat
Abstract: The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on this https URL.
摘要：本文讨论了俄语科学论文多模态数据集的创建以及用于自动文本摘要任务的现有语言模型的测试。该数据集的一个特点是它的多模态数据，包括文本、表格和图形。该论文介绍了两种语言模型的实验结果：来自 SBER 的 Gigachat 和来自 Yandex 的 YandexGPT。该数据集由 420 篇论文组成，可通过此 https URL 公开获取。

Title: PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Authors: Ziyang Zhang, Qizhen Zhang, Jakob Foerster
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2405.07932
Pdf URL: https://arxiv.org/pdf/2405.07932
Copy Paste: [[2405.07932]] PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition(https://arxiv.org/abs/2405.07932)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM's inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can't do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN, which avoids this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires finetuning nor white box access to the model. We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. Code and data are available at this https URL. We find that PARDEN is particularly powerful in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviours dataset.
摘要：大型语言模型 (LLM) 在许多自然语言处理任务中都取得了成功。尽管有严格的安全调整流程，Llama 2 和 Claude 2 等所谓的安全调整法学硕士仍然容易越狱，从而导致安全风险和模型滥用。减轻此类风险的一种选择是为法学硕士提供专门的“保障措施”，以检查法学硕士的输入或输出是否存在不良行为。一个有前途的方法是使用法学硕士本身作为保障。尽管如此，基线方法（例如提示法学硕士对有毒成分进行自我分类）的效果有限。我们假设这是由于领域转移造成的：对齐训练赋予模型一种自我审查行为（“抱歉，我不能这样做”），而自分类方法将其转换为分类格式（“这是提示恶意”）。在这项工作中，我们提出了 PARDEN，它通过简单地要求模型重复其自己的输出来避免这种域转移。 PARDEN 既不需要微调也不需要对模型进行白盒访问。我们凭经验验证了我们方法的有效性，并表明 PARDEN 显着优于 Llama-2 和 Claude-2 的现有越狱检测基线。代码和数据可从此 https URL 获取。我们发现 PARDEN 在高真阳性率（TPR）和低假阳性率（FPR）的相关机制中特别强大。例如，对于 Llama2-7B，在 TPR 等于 90% 时，PARDEN 在有害行为数据集上实现了 FPR 大约 11 倍的降低，从 24.8% 降至 2.0%。

Title: EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential Reasoning

Authors: Yinzhu Quan, Zefang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2405.07938
Pdf URL: https://arxiv.org/pdf/2405.07938
Copy Paste: [[2405.07938]] EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential Reasoning(https://arxiv.org/abs/2405.07938)
Keywords: language model, llm
Abstract: In this paper, we introduce EconLogicQA, a rigorous benchmark designed to assess the sequential reasoning capabilities of large language models (LLMs) within the intricate realms of economics, business, and supply chain management. Diverging from traditional benchmarks that predict subsequent events individually, EconLogicQA poses a more challenging task: it requires models to discern and sequence multiple interconnected events, capturing the complexity of economic logics. EconLogicQA comprises an array of multi-event scenarios derived from economic articles, which necessitate an insightful understanding of both temporal and logical event relationships. Through comprehensive evaluations, we exhibit that EconLogicQA effectively gauges a LLM's proficiency in navigating the sequential complexities inherent in economic contexts. We provide a detailed description of EconLogicQA dataset and shows the outcomes from evaluating the benchmark across various leading-edge LLMs, thereby offering a thorough perspective on their sequential reasoning potential in economic contexts. Our benchmark dataset is available at this https URL.
摘要：在本文中，我们介绍了 EconLogicQA，这是一个严格的基准测试，旨在评估经济、商业和供应链管理等复杂领域内大型语言模型 (LLM) 的顺序推理能力。与单独预测后续事件的传统基准不同，EconLogicQA 提出了更具挑战性的任务：它需要模型来识别和排序多个相互关联的事件，捕获经济逻辑的复杂性。 EconLogicQA 包含一系列源自经济文章的多事件场景，这需要对时间和逻辑事件关系有深刻的理解。通过综合评估，我们表明 EconLogicQA 有效地衡量了法学硕士在应对经济环境中固有的连续复杂性方面的熟练程度。我们提供了 EconLogicQA 数据集的详细描述，并展示了评估各种前沿法学硕士的基准的结果，从而提供了关于其在经济背景下的顺序推理潜力的全面视角。我们的基准数据集可通过此 https URL 获取。

Title: Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Authors: Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2405.07990
Pdf URL: https://arxiv.org/pdf/2405.07990
Copy Paste: [[2405.07990]] Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots(https://arxiv.org/abs/2405.07990)
Keywords: language model, gpt, llm
Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at this https URL.
摘要：多模态大语言模型（MLLM）的显着进步由于其在视觉上下文中的卓越性能而引起了人们的广泛关注。然而，它们将视觉图形转化为可执行代码的能力尚未得到彻底评估。为了解决这个问题，我们引入了 Plot2Code，这是一个综合性的视觉编码基准测试，旨在对 MLLM 进行公平和深入的评估。我们从公开可用的 matplotlib 库中仔细收集了 132 个手动选择的跨六种绘图类型的高质量 matplotlib 绘图。对于每个图，我们都会仔细提供其源代码以及 GPT-4 总结的描述性指令。这种方法使 Plot2Code 能够跨各种输入模式广泛评估 MLLM 的代码功能。此外，我们提出了三个自动评估指标，包括代码通过率、文本匹配率和 GPT-4V 总体评分，用于对输出代码和渲染图像进行细粒度评估。我们不是简单地判断通过或失败，而是使用 GPT-4V 对生成图像和参考图像进行整体判断，这已被证明与人类评估一致。评估结果包括对 14 个 MLLM（例如专有的 GPT-4V、Gemini-Pro 和开源 Mini-Gemini）的分析，突显了 Plot2Code 所面临的重大挑战。通过 Plot2Code，我们发现大多数现有的 MLLM 在文本密集图的视觉编码方面遇到困难，严重依赖文本指令。我们希望 Plot2Code 对视觉编码的评估结果能够指导 MLLM 的未来发展。与 Plot2Code 相关的所有数据均可在此 https URL 中获取。