2024-07-29

Title: The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem

Authors: Joe B Hakim, Jeffery L Painter, Darmendra Ramcharran, Vijay Kara, Greg Powell, Paulina Sobczak, Chiho Sato, Andrew Bate, Andrew Beam
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2407.18322
Pdf URL: https://arxiv.org/pdf/2407.18322
Copy Paste: [[2407.18322]] The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem(https://arxiv.org/abs/2407.18322)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are useful tools with the capacity for performing specific types of knowledge work at an effective scale. However, LLM deployments in high-risk and safety-critical domains pose unique challenges, notably the issue of ``hallucination,'' where LLMs can generate fabricated information. This is particularly concerning in settings such as drug safety, where inaccuracies could lead to patient harm. To mitigate these risks, we have developed and demonstrated a proof of concept suite of guardrails specifically designed to mitigate certain types of hallucinations and errors for drug safety, and potentially applicable to other medical safety-critical contexts. These guardrails include mechanisms to detect anomalous documents to prevent the ingestion of inappropriate data, identify incorrect drug names or adverse event terms, and convey uncertainty in generated content. We integrated these guardrails with an LLM fine-tuned for a text-to-text task, which involves converting both structured and unstructured data within adverse event reports into natural language. This method was applied to translate individual case safety reports, demonstrating effective application in a pharmacovigilance processing task. Our guardrail framework offers a set of tools with broad applicability across various domains, ensuring LLMs can be safely used in high-risk situations by eliminating the occurrence of key errors, including the generation of incorrect pharmacovigilance-related terms, thus adhering to stringent regulatory and quality standards in medical safety-critical environments.
摘要：大型语言模型 (LLM) 是一种有用的工具，能够以有效的规模执行特定类型的知识工作。然而，在高风险和安全关键领域部署 LLM 带来了独特的挑战，尤其是“幻觉”问题，LLM 可能会生成虚假信息。这在药品安全等环境中尤其令人担忧，因为不准确的信息可能会导致患者受到伤害。为了减轻这些风险，我们开发并展示了一套概念验证防护措施，专门用于减轻某些类型的幻觉和药品安全错误，并可能适用于其他医疗安全关键环境。这些防护措施包括检测异常文档的机制，以防止摄入不适当的数据，识别错误的药品名称或不良事件术语，并在生成的内容中传达不确定性。我们将这些防护措施与针对文本到文本任务进行微调的 LLM 集成在一起，该任务涉及将不良事件报告中的结构化和非结构化数据转换为自然语言。该方法已用于翻译个案安全报告，在药物警戒处理任务中表现出了良好的应用效果。我们的护栏框架提供了一套适用于各个领域的广泛工具，通过消除关键错误（包括生成错误的药物警戒相关术语）来确保 LLM 可以在高风险情况下安全使用，从而在医疗安全至关重要的环境中遵守严格的监管和质量标准。

Title: Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Authors: Xuansheng Wu, Padmaja Pravin Saraf, Gyeong-Geon Lee, Ehsan Latif, Ninghao Liu, Xiaoming Zhai
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2407.18328
Pdf URL: https://arxiv.org/pdf/2407.18328
Copy Paste: [[2407.18328]] Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring(https://arxiv.org/abs/2407.18328)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI's scoring process mirrors that of humans, or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy. Specifically, we prompt LLMs to generate analytic rubrics that they use to assign scores and study the alignment gap with human grading rubrics. Based on a series of experiments with various configurations of LLM settings, we reveal a notable alignment gap between human and LLM graders. While LLMs can adapt quickly to scoring tasks, they often resort to shortcuts, bypassing deeper logical reasoning expected in human grading. We found that incorporating high-quality analytical rubrics designed to reflect human grading logic can mitigate this gap and enhance LLMs' scoring accuracy. These results caution against the simplistic application of LLMs in science education and highlight the importance of aligning LLM outputs with human expectations to ensure efficient and accurate automatic scoring.
摘要：大型语言模型 (LLM) 在自动评分结构化反应评估方面表现出了巨大的潜力。虽然人类评分的结构化反应通常基于给定的评分标准，但 LLM 评分的方法在很大程度上仍不清楚。人工智能的评分过程与人类的评分过程有多接近，或者是否遵循相同的评分标准，也尚不确定。为了解决这一差距，本文揭示了 LLM 用于评分学生对科学任务的书面回答的评分标准及其与人类分数的一致性。我们还研究了增强一致性是否可以提高评分准确性。具体来说，我们提示 LLM 生成用于分配分数的分析标准，并研究与人类评分标准的一致性差距。基于对各种 LLM 设置配置的一系列实验，我们发现人类和 LLM 评分者之间存在明显的一致性差距。虽然 LLM 可以快速适应评分任务，但它们经常走捷径，绕过人类评分中期望的更深层次的逻辑推理。我们发现，采用旨在反映人类评分逻辑的高质量分析评分标准可以弥补这一差距，并提高 LLM 的评分准确性。这些结果警告不要将 LLM 简单应用于科学教育，并强调将 LLM 输出与人类期望相一致以确保高效准确的自动评分的重要性。

Title: Robust Claim Verification Through Fact Detection

Authors: Nazanin Jafari, James Allan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.18367
Pdf URL: https://arxiv.org/pdf/2407.18367
Copy Paste: [[2407.18367]] Robust Claim Verification Through Fact Detection(https://arxiv.org/abs/2407.18367)
Keywords: language model, llm, prompt
Abstract: Claim verification can be a challenging task. In this paper, we present a method to enhance the robustness and reasoning capabilities of automated claim verification through the extraction of short facts from evidence. Our novel approach, FactDetect, leverages Large Language Models (LLMs) to generate concise factual statements from evidence and label these facts based on their semantic relevance to the claim and evidence. The generated facts are then combined with the claim and evidence. To train a lightweight supervised model, we incorporate a fact-detection task into the claim verification process as a multitasking approach to improve both performance and explainability. We also show that augmenting FactDetect in the claim verification prompt enhances performance in zero-shot claim verification using LLMs. Our method demonstrates competitive results in the supervised claim verification model by 15% on the F1 score when evaluated for challenging scientific claim verification datasets. We also demonstrate that FactDetect can be augmented with claim and evidence for zero-shot prompting (AugFactDetect) in LLMs for verdict prediction. We show that AugFactDetect outperforms the baseline with statistical significance on three challenging scientific claim verification datasets with an average of 17.3% performance gain compared to the best performing baselines.
摘要：声明验证可能是一项具有挑战性的任务。在本文中，我们提出了一种通过从证据中提取简短事实来增强自动声明验证的稳健性和推理能力的方法。我们的新方法 FactDetect 利用大型语言模型 (LLM) 从证据中生成简洁的事实陈述，并根据这些事实与声明和证据的语义相关性对这些事实进行标记。然后将生成的事实与声明和证据相结合。为了训练轻量级监督模型，我们将事实检测任务纳入声明验证过程，作为一种多任务方法，以提高性能和可解释性。我们还表明，在声明验证提示中增强 FactDetect 可提高使用 LLM 的零样本声明验证的性能。在针对具有挑战性的科学声明验证数据集进行评估时，我们的方法在监督声明验证模型中表现出具有竞争力的结果，F1 分数提高了 15%。我们还证明，可以使用声明和证据增强 FactDetect，以便在 LLM 中进行零样本提示 (AugFactDetect) 以进行判决预测。我们表明，AugFactDetect 在三个具有挑战性的科学声明验证数据集上的表现具有统计意义，优于基线，与表现最好的基线相比，平均性能提高了 17.3%。

Title: PersonaGym: Evaluating Persona Agents and LLMs

Authors: Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, Vishvak Murahari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.18416
Pdf URL: https://arxiv.org/pdf/2407.18416
Copy Paste: [[2407.18416]] PersonaGym: Evaluating Persona Agents and LLMs(https://arxiv.org/abs/2407.18416)
Keywords: gpt, llm, agent
Abstract: Persona agents, which are LLM agents that act according to an assigned persona, have demonstrated impressive contextual response capabilities across various applications. These persona agents offer significant enhancements across diverse sectors, such as education, healthcare, and entertainment, where model developers can align agent responses to different user requirements thereby broadening the scope of agent applications. However, evaluating persona agent performance is incredibly challenging due to the complexity of assessing persona adherence in free-form interactions across various environments that are relevant to each persona agent. We introduce PersonaGym, the first dynamic evaluation framework for assessing persona agents, and PersonaScore, the first automated human-aligned metric grounded in decision theory for comprehensive large-scale evaluation of persona agents. Our evaluation of 6 open and closed-source LLMs, using a benchmark encompassing 200 personas and 10,000 questions, reveals significant opportunities for advancement in persona agent capabilities across state-of-the-art models. For example, Claude 3.5 Sonnet only has a 2.97% relative improvement in PersonaScore than GPT 3.5 despite being a much more advanced model. Importantly, we find that increased model size and complexity do not necessarily imply enhanced persona agent capabilities thereby highlighting the pressing need for algorithmic and architectural invention towards faithful and performant persona agents.
摘要：角色代理是根据指定角色行事的 LLM 代理，已在各种应用中展示了令人印象深刻的情境响应能力。这些角色代理在教育、医疗保健和娱乐等不同领域提供了显著的增强，模型开发人员可以根据不同的用户需求调整代理响应，从而扩大代理应用的范围。然而，评估角色代理的性能极具挑战性，因为评估与每个角色代理相关的各种环境中自由形式交互中角色依从性的复杂性。我们推出了 PersonaGym，这是第一个用于评估角色代理的动态评估框架，以及 PersonaScore，这是第一个基于决策理论的自动化人机对齐指标，用于对角色代理进行全面的大规模评估。我们对 6 个开源和闭源 LLM 进行了评估，使用包含 200 个角色和 10,000 个问题的基准，揭示了在最先进模型中角色代理能力的巨大提升机会。例如，尽管 Claude 3.5 Sonnet 是一个更先进的模型，但它的 PersonaScore 仅比 GPT 3.5 提高了 2.97%。重要的是，我们发现模型大小和复杂性的增加并不一定意味着角色代理能力的增强，从而凸显了对算法和架构创新的迫切需求，以实现忠实而高性能的角色代理。

Title: The Art of Refusal: A Survey of Abstention in Large Language Models

Authors: Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, Lucy Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18418
Pdf URL: https://arxiv.org/pdf/2407.18418
Copy Paste: [[2407.18418]] The Art of Refusal: A Survey of Abstention in Large Language Models(https://arxiv.org/abs/2407.18418)
Keywords: language model, llm, hallucination
Abstract: Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in building LLM systems. In this survey, we introduce a framework to examine abstention behavior from three perspectives: the query, the model, and human values. We review the literature on abstention methods (categorized based on the development stages of LLMs), benchmarks, and evaluation metrics, and discuss the merits and limitations of prior work. We further identify and motivate areas for future research, such as encouraging the study of abstention as a meta-capability across tasks and customizing abstention abilities based on context. In doing so, we aim to broaden the scope and impact of abstention methodologies in AI systems.
摘要：弃权，即大型语言模型 (LLM) 拒绝提供答案，因其在构建 LLM 系统时减轻幻觉和增强安全性的潜力而日益受到认可。在本次调查中，我们引入了一个框架，从三个角度检查弃权行为：查询、模型和人类价值观。我们回顾了弃权方法（根据 LLM 的开发阶段分类）、基准和评估指标的文献，并讨论了先前研究的优点和局限性。我们进一步确定并激发未来研究的领域，例如鼓励将弃权作为跨任务的元能力进行研究，并根据上下文定制弃权能力。通过这样做，我们旨在扩大人工智能系统中弃权方法的范围和影响。

Title: Self-Directed Synthetic Dialogues and Revisions Technical Report

Authors: Nathan Lambert, Hailey Schoelkopf, Aaron Gokaslan, Luca Soldaini, Valentina Pyatkin, Louis Castricato
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.18421
Pdf URL: https://arxiv.org/pdf/2407.18421
Copy Paste: [[2407.18421]] Self-Directed Synthetic Dialogues and Revisions Technical Report(https://arxiv.org/abs/2407.18421)
Keywords: language model
Abstract: Synthetic data has become an important tool in the fine-tuning of language models to follow instructions and solve complex problems. Nevertheless, the majority of open data to date is often lacking multi-turn data and collected on closed models, limiting progress on advancing open fine-tuning methods. We introduce Self Directed Synthetic Dialogues (SDSD), an experimental dataset consisting of guided conversations of language models talking to themselves. The dataset consists of multi-turn conversations generated with DBRX, Llama 2 70B, and Mistral Large, all instructed to follow a conversation plan generated prior to the conversation. We also explore including principles from Constitutional AI and other related works to create synthetic preference data via revisions to the final conversation turn. We hope this work encourages further exploration in multi-turn data and the use of open models for expanding the impact of synthetic data.
摘要：合成数据已成为语言模型微调以遵循指令和解决复杂问题的重要工具。然而，迄今为止，大多数开放数据通常缺乏多轮数据并且是在封闭模型上收集的，这限制了开放微调方法的进展。我们引入了自导合成对话 (SDSD)，这是一个实验数据集，由语言模型自言自语的引导对话组成。该数据集包括使用 DBRX、Llama 2 70B 和 Mistral Large 生成的多轮对话，所有对话均被指示遵循对话前生成的对话计划。我们还探索包括 Constitutional AI 和其他相关作品中的原则，通过对最终对话轮次的修订来创建合成偏好数据。我们希望这项工作能够鼓励进一步探索多轮数据并使用开放模型来扩大合成数据的影响力。

Title: Guidance-Based Prompt Data Augmentation in Specialized Domains for Named Entity Recognition

Authors: Hyeonseok Kang, Hyein Seo, Jeesu Jung, Sangkeun Jung, Du-Seong Chang, Riwoo Chung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18442
Pdf URL: https://arxiv.org/pdf/2407.18442
Copy Paste: [[2407.18442]] Guidance-Based Prompt Data Augmentation in Specialized Domains for Named Entity Recognition(https://arxiv.org/abs/2407.18442)
Keywords: prompt
Abstract: While the abundance of rich and vast datasets across numerous fields has facilitated the advancement of natural language processing, sectors in need of specialized data types continue to struggle with the challenge of finding quality data. Our study introduces a novel guidance data augmentation technique utilizing abstracted context and sentence structures to produce varied sentences while maintaining context-entity relationships, addressing data scarcity challenges. By fostering a closer relationship between context, sentence structure, and role of entities, our method enhances data augmentation's effectiveness. Consequently, by showcasing diversification in both entity-related vocabulary and overall sentence structure, and simultaneously improving the training performance of named entity recognition task.
摘要：虽然众多领域中丰富而庞大的数据集促进了自然语言处理的发展，但需要专门数据类型的部门仍然面临着寻找优质数据的挑战。我们的研究引入了一种新颖的指导数据增强技术，利用抽象的上下文和句子结构来生成不同的句子，同时保持上下文-实体关系，从而解决数据稀缺的挑战。通过建立上下文、句子结构和实体角色之间的更紧密关系，我们的方法提高了数据增强的有效性。因此，通过展示与实体相关的词汇和整体句子结构的多样化，同时提高命名实体识别任务的训练性能。

Title: Fairness Definitions in Language Models Explained

Authors: Thang Viet Doan, Zhibo Chu, Zichong Wang, Wenbin Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.18454
Pdf URL: https://arxiv.org/pdf/2407.18454
Copy Paste: [[2407.18454]] Fairness Definitions in Language Models Explained(https://arxiv.org/abs/2407.18454)
Keywords: language model
Abstract: Language Models (LMs) have demonstrated exceptional performance across various Natural Language Processing (NLP) tasks. Despite these advancements, LMs can inherit and amplify societal biases related to sensitive attributes such as gender and race, limiting their adoption in real-world applications. Therefore, fairness has been extensively explored in LMs, leading to the proposal of various fairness notions. However, the lack of clear agreement on which fairness definition to apply in specific contexts (\textit{e.g.,} medium-sized LMs versus large-sized LMs) and the complexity of understanding the distinctions between these definitions can create confusion and impede further progress. To this end, this paper proposes a systematic survey that clarifies the definitions of fairness as they apply to LMs. Specifically, we begin with a brief introduction to LMs and fairness in LMs, followed by a comprehensive, up-to-date overview of existing fairness notions in LMs and the introduction of a novel taxonomy that categorizes these concepts based on their foundational principles and operational distinctions. We further illustrate each definition through experiments, showcasing their practical implications and outcomes. Finally, we discuss current research challenges and open questions, aiming to foster innovative ideas and advance the field. The implementation and additional resources are publicly available at this https URL.
摘要：语言模型 (LM) 在各种自然语言处理 (NLP) 任务中表现出色。尽管取得了这些进步，但 LM 可能会继承和放大与性别和种族等敏感属性相关的社会偏见，从而限制其在现实世界中的应用。因此，公平性在 LM 中得到了广泛的探索，从而提出了各种公平概念。然而，由于缺乏明确的共识，无法确定在特定情况下应应用哪种公平性定义（\textit{例如，} 中型 LM 与大型 LM），而且理解这些定义之间的区别很复杂，可能会造成混乱并阻碍进一步的进展。为此，本文提出了一项系统调查，以澄清公平性在 LM 中的定义。具体来说，我们首先简要介绍 LM 和 LM 中的公平性，然后全面、最新地概述 LM 中现有的公平性概念，并介绍一种新颖的分类法，该分类法根据其基本原则和操作区别对这些概念进行分类。我们通过实验进一步阐述每个定义，展示它们的实际意义和结果。最后，我们讨论当前的研究挑战和未解决的问题，旨在培养创新思想并推动该领域发展。实施和其他资源可在此 https URL 上公开获取。

Title: Multi-turn Response Selection with Commonsense-enhanced Language Models

Authors: Yuandong Wang, Xuhui Ren, Tong Chen, Yuxiao Dong, Nguyen Quoc Viet Hung, Jie Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18479
Pdf URL: https://arxiv.org/pdf/2407.18479
Copy Paste: [[2407.18479]] Multi-turn Response Selection with Commonsense-enhanced Language Models(https://arxiv.org/abs/2407.18479)
Keywords: language model, chat
Abstract: As a branch of advanced artificial intelligence, dialogue systems are prospering. Multi-turn response selection is a general research problem in dialogue systems. With the assistance of background information and pre-trained language models, the performance of state-of-the-art methods on this problem gains impressive improvement. However, existing studies neglect the importance of external commonsense knowledge. Hence, we design a Siamese network where a pre-trained Language model merges with a Graph neural network (SinLG). SinLG takes advantage of Pre-trained Language Models (PLMs) to catch the word correlations in the context and response candidates and utilizes a Graph Neural Network (GNN) to reason helpful common sense from an external knowledge graph. The GNN aims to assist the PLM in fine-tuning, and arousing its related memories to attain better performance. Specifically, we first extract related concepts as nodes from an external knowledge graph to construct a subgraph with the context response pair as a super node for each sample. Next, we learn two representations for the context response pair via both the PLM and GNN. A similarity loss between the two representations is utilized to transfer the commonsense knowledge from the GNN to the PLM. Then only the PLM is used to infer online so that efficiency can be guaranteed. Finally, we conduct extensive experiments on two variants of the PERSONA-CHAT dataset, which proves that our solution can not only improve the performance of the PLM but also achieve an efficient inference.
摘要：对话系统作为高级人工智能的一个分支，正在蓬勃发展。多轮响应选择是对话系统中一个普遍的研究问题。在背景信息和预训练语言模型的帮助下，最先进的方法在这个问题上的表现得到了显著的提升。然而，现有的研究忽视了外部常识知识的重要性。因此，我们设计了一个孪生网络，其中预训练的语言模型与图神经网络 (SinLG) 融合。SinLG 利用预训练语言模型 (PLM) 来捕捉上下文和响应候选中的单词相关性，并利用图神经网络 (GNN) 从外部知识图谱中推理出有用的常识。GNN 旨在协助 PLM 进行微调，并唤醒其相关记忆以获得更好的性能。具体来说，我们首先从外部知识图谱中提取相关概念作为节点，为每个样本构建一个以上下文响应对为超节点的子图。接下来，我们通过 PLM 和 GNN 学习上下文响应对的两种表示。利用两种表示之间的相似性损失将常识知识从 GNN 迁移到 PLM。然后只使用 PLM 进行在线推理，从而保证效率。最后，我们在 PERSONA-CHAT 数据集的两个变体上进行了广泛的实验，证明了我们的解决方案不仅可以提高 PLM 的性能，还可以实现高效的推理。

Title: A Role-specific Guided Large Language Model for Ophthalmic Consultation Based on Stylistic Differentiation

Authors: Laiyi Fu, Binbin Fan, Hongkai Du, Yanxiang Feng, Chunhua Li, Huping Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.18483
Pdf URL: https://arxiv.org/pdf/2407.18483
Copy Paste: [[2407.18483]] A Role-specific Guided Large Language Model for Ophthalmic Consultation Based on Stylistic Differentiation(https://arxiv.org/abs/2407.18483)
Keywords: language model, gpt, chat
Abstract: Ophthalmology consultations are crucial for diagnosing, treating, and preventing eye diseases. However, the growing demand for consultations exceeds the availability of ophthalmologists. By leveraging large pre-trained language models, we can design effective dialogues for specific scenarios, aiding in consultations. Traditional fine-tuning strategies for question-answering tasks are impractical due to increasing model size and often ignoring patient-doctor role function during consultations. In this paper, we propose EyeDoctor, an ophthalmic medical questioning large language model that enhances accuracy through doctor-patient role perception guided and an augmented knowledge base with external disease information. Experimental results show EyeDoctor achieves higher question-answering precision in ophthalmology consultations. Notably, EyeDoctor demonstrated a 7.25% improvement in Rouge-1 scores and a 10.16% improvement in F1 scores on multi-round datasets compared to second best model ChatGPT, highlighting the importance of doctor-patient role differentiation and dynamic knowledge base expansion for intelligent medical consultations. EyeDoc also serves as a free available web based service and souce code is available at this https URL.
摘要：眼科问诊是诊断、治疗和预防眼部疾病的关键。然而，问诊需求的不断增长超过了眼科医生的可用人数。通过利用大型预训练语言模型，我们可以为特定场景设计有效的对话，帮助问诊。传统的问答任务微调策略由于模型规模不断增加且在问诊过程中经常忽略医患角色功能而变得不切实际。在本文中，我们提出了 EyeDoctor，这是一个眼科医学问答大型语言模型，通过引导医患角色感知和利用外部疾病信息增强知识库来提高准确率。实验结果表明，EyeDoctor 在眼科问诊中实现了更高的问答准确率。值得注意的是，与第二好的模型 ChatGPT 相比，EyeDoctor 在多轮数据集上的 Rouge-1 分数提高了 7.25%，F1 分数提高了 10.16%，凸显了医患角色区分和动态知识库扩展对于智能医疗问诊的重要性。 EyeDoc 也可作为一项免费的基于网络的服务，其源代码可通过此 https URL 获得。

Title: A Reliable Common-Sense Reasoning Socialbot Built Using LLMs and Goal-Directed ASP

Authors: Yankai Zeng, Abhiramon Rajashekharan, Kinjal Basu, Huaduo Wang, Joaquín Arias, Gopal Gupta
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2407.18498
Pdf URL: https://arxiv.org/pdf/2407.18498
Copy Paste: [[2407.18498]] A Reliable Common-Sense Reasoning Socialbot Built Using LLMs and Goal-Directed ASP(https://arxiv.org/abs/2407.18498)
Keywords: language model, gpt, llm, chat
Abstract: The development of large language models (LLMs), such as GPT, has enabled the construction of several socialbots, like ChatGPT, that are receiving a lot of attention for their ability to simulate a human conversation. However, the conversation is not guided by a goal and is hard to control. In addition, because LLMs rely more on pattern recognition than deductive reasoning, they can give confusing answers and have difficulty integrating multiple topics into a cohesive response. These limitations often lead the LLM to deviate from the main topic to keep the conversation interesting. We propose AutoCompanion, a socialbot that uses an LLM model to translate natural language into predicates (and vice versa) and employs commonsense reasoning based on Answer Set Programming (ASP) to hold a social conversation with a human. In particular, we rely on s(CASP), a goal-directed implementation of ASP as the backend. This paper presents the framework design and how an LLM is used to parse user messages and generate a response from the s(CASP) engine output. To validate our proposal, we describe (real) conversations in which the chatbot's goal is to keep the user entertained by talking about movies and books, and s(CASP) ensures (i) correctness of answers, (ii) coherence (and precision) during the conversation, which it dynamically regulates to achieve its specific purpose, and (iii) no deviation from the main topic.
摘要：大型语言模型 (LLM)（例如 GPT）的发展使得构建多个社交机器人（例如 ChatGPT）成为可能，这些机器人因其模拟人类对话的能力而备受关注。然而，对话不受目标引导，难以控制。此外，由于 LLM 更多地依赖于模式识别而不是演绎推理，因此它们可能会给出令人困惑的答案，并且难以将多个主题整合成一个连贯的响应。这些限制通常会导致 LLM 偏离主要主题以保持对话的趣味性。我们提出了 AutoCompanion，这是一个社交机器人，它使用 LLM 模型将自然语言转换为谓词（反之亦然），并使用基于答案集编程 (ASP) 的常识推理与人类进行社交对话。具体来说，我们依赖 s(CASP)，这是 ASP 的目标导向实现作为后端。本文介绍了框架设计以及如何使用 LLM 解析用户消息并从 s(CASP) 引擎输出生成响应。为了验证我们的提议，我们描述了（真实的）对话，其中聊天机器人的目标是通过谈论电影和书籍来让用户感到愉悦，并且 s（CASP）确保（i）答案的正确性，（ii）对话过程中的连贯性（和准确性），它会动态地调节以实现其特定目的，以及（iii）不偏离主要话题。

Title: Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Authors: Yinghao Zhu, Junyi Gao, Zixiang Wang, Weibin Liao, Xiaochen Zheng, Lifang Liang, Yasha Wang, Chengwei Pan, Ewen M. Harrison, Liantao Ma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.18525
Pdf URL: https://arxiv.org/pdf/2407.18525
Copy Paste: [[2407.18525]] Is larger always better? Evaluating and prompting large language models for non-generative medical tasks(https://arxiv.org/abs/2407.18525)
Keywords: language model, gpt, llm, prompt
Abstract: The use of Large Language Models (LLMs) in medicine is growing, but their ability to handle both structured Electronic Health Record (EHR) data and unstructured clinical notes is not well-studied. This study benchmarks various models, including GPT-based LLMs, BERT-based models, and traditional clinical predictive models, for non-generative medical tasks utilizing renowned datasets. We assessed 14 language models (9 GPT-based and 5 BERT-based) and 7 traditional predictive models using the MIMIC dataset (ICU patient records) and the TJH dataset (early COVID-19 EHR data), focusing on tasks such as mortality and readmission prediction, disease hierarchy reconstruction, and biomedical sentence matching, comparing both zero-shot and finetuned performance. Results indicated that LLMs exhibited robust zero-shot predictive capabilities on structured EHR data when using well-designed prompting strategies, frequently surpassing traditional models. However, for unstructured medical texts, LLMs did not outperform finetuned BERT models, which excelled in both supervised and unsupervised tasks. Consequently, while LLMs are effective for zero-shot learning on structured data, finetuned BERT models are more suitable for unstructured texts, underscoring the importance of selecting models based on specific task requirements and data characteristics to optimize the application of NLP technology in healthcare.
摘要：大型语言模型 (LLM) 在医学中的使用正在增长，但它们处理结构化电子健康记录 (EHR) 数据和非结构化临床笔记的能力尚不清楚。本研究利用知名数据集对各种模型进行了基准测试，包括基于 GPT 的 LLM、基于 BERT 的模型和传统的临床预测模型，用于非生成性医疗任务。我们使用 MIMIC 数据集（ICU 患者记录）和 TJH 数据集（早期 COVID-19 EHR 数据）评估了 14 种语言模型（9 种基于 GPT 和 5 种基于 BERT）和 7 种传统预测模型，重点关注死亡率和再入院预测、疾病层次重建和生物医学句子匹配等任务，比较了零样本和微调性能。结果表明，在使用精心设计的提示策略时，LLM 在结构化 EHR 数据上表现出强大的零样本预测能力，经常超越传统模型。然而，对于非结构化的医学文本，LLM 的表现并不优于微调的 BERT 模型，后者在监督和无监督任务中均表现出色。因此，虽然 LLM 适用于结构化数据的零样本学习，但微调的 BERT 模型更适合非结构化文本，这凸显了根据特定任务需求和数据特征选择模型的重要性，以优化 NLP 技术在医疗保健领域的应用。

Title: Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems

Authors: Aravind Sesagiri Raamkumar, Siyuan Brandon Loh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18538
Pdf URL: https://arxiv.org/pdf/2407.18538
Copy Paste: [[2407.18538]] Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems(https://arxiv.org/abs/2407.18538)
Keywords: language model, llm
Abstract: Empathetic Conversational Systems (ECS) are built to respond empathetically to the user's emotions and sentiments, regardless of the application domain. Current ECS studies evaluation approaches are restricted to offline evaluation experiments primarily for gold standard comparison & benchmarking, and user evaluation studies for collecting human ratings on specific constructs. These methods are inadequate in measuring the actual quality of empathy in conversations. In this paper, we propose a multidimensional empathy evaluation framework with three new methods for measuring empathy at (i) structural level using three empathy-related dimensions, (ii) behavioral level using empathy behavioral types, and (iii) overall level using an empathy lexicon, thereby fortifying the evaluation process. Experiments were conducted with the state-of-the-art ECS models and large language models (LLMs) to show the framework's usefulness.
摘要：无论应用领域如何，共情对话系统 (ECS) 都旨在对用户的情绪和情感做出共情反应。当前的 ECS 研究评估方法仅限于离线评估实验，主要用于黄金标准比较和基准测试，以及用户评估研究，用于收集人类对特定构造的评分。这些方法不足以衡量对话中共情的实际质量。在本文中，我们提出了一个多维共情评估框架，该框架包含三种新方法，用于在 (i) 使用三个共情相关维度的结构层面、(ii) 使用共情行为类型的行为层面和 (iii) 使用共情词典的整体层面测量共情，从而强化评估过程。使用最先进的 ECS 模型和大型语言模型 (LLM) 进行了实验，以证明该框架的实用性。

Title: A Universal Prompting Strategy for Extracting Process Model Information from Natural Language Text using Large Language Models

Authors: Julian Neuberger, Lars Ackermann, Han van der Aa, Stefan Jablonski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.18540
Pdf URL: https://arxiv.org/pdf/2407.18540
Copy Paste: [[2407.18540]] A Universal Prompting Strategy for Extracting Process Model Information from Natural Language Text using Large Language Models(https://arxiv.org/abs/2407.18540)
Keywords: language model, llm, prompt
Abstract: Over the past decade, extensive research efforts have been dedicated to the extraction of information from textual process descriptions. Despite the remarkable progress witnessed in natural language processing (NLP), information extraction within the Business Process Management domain remains predominantly reliant on rule-based systems and machine learning methodologies. Data scarcity has so far prevented the successful application of deep learning techniques. However, the rapid progress in generative large language models (LLMs) makes it possible to solve many NLP tasks with very high quality without the need for extensive data. Therefore, we systematically investigate the potential of LLMs for extracting information from textual process descriptions, targeting the detection of process elements such as activities and actors, and relations between them. Using a heuristic algorithm, we demonstrate the suitability of the extracted information for process model generation. Based on a novel prompting strategy, we show that LLMs are able to outperform state-of-the-art machine learning approaches with absolute performance improvements of up to 8\% $F_1$ score across three different datasets. We evaluate our prompting strategy on eight different LLMs, showing it is universally applicable, while also analyzing the impact of certain prompt parts on extraction quality. The number of example texts, the specificity of definitions, and the rigour of format instructions are identified as key for improving the accuracy of extracted information. Our code, prompts, and data are publicly available.
摘要：在过去十年中，人们投入了大量研究来从文本流程描述中提取信息。尽管自然语言处理 (NLP) 取得了显著进展，但业务流程管理领域内的信息提取仍然主要依赖于基于规则的系统和机器学习方法。数据稀缺迄今为止阻碍了深度学习技术的成功应用。然而，生成式大型语言模型 (LLM) 的快速发展使得人们能够以非常高的质量解决许多 NLP 任务，而无需大量数据。因此，我们系统地研究了 LLM 从文本流程描述中提取信息的潜力，目标是检测活动和参与者等流程元素以及它们之间的关系。使用启发式算法，我们证明了提取的信息适用于流程模型生成。基于一种新颖的提示策略，我们表明 LLM 能够超越最先进的机器学习方法，在三个不同的数据集上将绝对性能提高高达 8% 的 $F_1$ 分数。我们在 8 个不同的 LLM 上评估了我们的提示策略，表明它具有普遍适用性，同时还分析了某些提示部分对提取质量的影响。示例文本的数量、定义的特异性和格式说明的严谨性被认为是提高提取信息准确性的关键。我们的代码、提示和数据都是公开的。

Title: Dynamic Language Group-Based MoE: Enhancing Efficiency and Flexibility for Code-Switching Speech Recognition

Authors: Hukai Huang, Shenghui Lu, Yahui Shan, He Qu, Wenhao Guan, Qingyang Hong, Lin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18581
Pdf URL: https://arxiv.org/pdf/2407.18581
Copy Paste: [[2407.18581]] Dynamic Language Group-Based MoE: Enhancing Efficiency and Flexibility for Code-Switching Speech Recognition(https://arxiv.org/abs/2407.18581)
Keywords: language model
Abstract: The Mixture of Experts (MoE) approach is ideally suited for tackling multilingual and code-switching (CS) challenges due to its multi-expert architecture. This work introduces the DLG-MoE, which is optimized for bilingual and CS scenarios. Our novel Dynamic Language Group-based MoE layer features a language router with shared weights for explicit language modeling, while independent unsupervised routers within the language group handle attributes beyond language. This structure not only enhances expert extension capabilities but also supports dynamic top-k training, allowing for flexible inference across various top-k values and improving overall performance. The model requires no pre-training and supports streaming recognition, achieving state-of-the-art (SOTA) results with unmatched flexibility compared to other methods. The Code will be released.
摘要：混合专家 (MoE) 方法由于其多专家架构而非常适合解决多语言和代码转换 (CS) 挑战。这项工作引入了 DLG-MoE，它针对双语和 CS 场景进行了优化。我们新颖的基于动态语言组的 MoE 层具有具有共享权重的语言路由器，用于显式语言建模，而语言组内的独立无监督路由器则处理语言以外的属性。这种结构不仅增强了专家扩展能力，还支持动态 top-k 训练，允许跨各种 top-k 值进行灵活推理并提高整体性能。该模型不需要预训练并支持流式识别，与其他方法相比，以无与伦比的灵活性实现了最先进的 (SOTA) 结果。代码即将发布。

Title: Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models

Authors: Xiang Shi, Jiawei Liu, Yinpeng Liu, Qikai Cheng, Wei Lu
Subjects: cs.CL, cs.AI, cs.CV, cs.DL, cs.MM
Abstract URL: https://arxiv.org/abs/2407.18626
Pdf URL: https://arxiv.org/pdf/2407.18626
Copy Paste: [[2407.18626]] Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models(https://arxiv.org/abs/2407.18626)
Keywords: language model, llm
Abstract: This paper tackles a key issue in the interpretation of scientific figures: the fine-grained alignment of text and figures. It advances beyond prior research that primarily dealt with straightforward, data-driven visualizations such as bar and pie charts and only offered a basic understanding of diagrams through captioning and classification. We introduce a novel task, Figure Integrity Verification, designed to evaluate the precision of technologies in aligning textual knowledge with visual elements in scientific figures. To support this, we develop a semi-automated method for constructing a large-scale dataset, Figure-seg, specifically designed for this task. Additionally, we propose an innovative framework, Every Part Matters (EPM), which leverages Multimodal Large Language Models (MLLMs) to not only incrementally improve the alignment and verification of text-figure integrity but also enhance integrity through analogical reasoning. Our comprehensive experiments show that these innovations substantially improve upon existing methods, allowing for more precise and thorough analysis of complex scientific figures. This progress not only enhances our understanding of multimodal technologies but also stimulates further research and practical applications across fields requiring the accurate interpretation of complex visual data.
摘要：本文探讨了科学图表解读中的一个关键问题：文本和图表的细粒度对齐。它超越了之前的研究，之前的研究主要处理简单的数据驱动可视化，例如条形图和饼图，并且仅通过标题和分类提供对图表的基本理解。我们引入了一项新任务，即图表完整性验证，旨在评估技术在将文本知识与科学图表中的视觉元素对齐方面的精确度。为了支持这一点，我们开发了一种半自动化方法来构建专门为此任务设计的大规模数据集 Figure-seg。此外，我们提出了一个创新框架 Every Part Matters (EPM)，它利用多模态大型语言模型 (MLLM) 不仅可以逐步改进文本-图表完整性的对齐和验证，还可以通过类比推理增强完整性。我们的综合实验表明，这些创新大大改进了现有方法，可以更精确、更彻底地分析复杂的科学图表。这一进展不仅增强了我们对多模式技术的理解，而且还促进了需要准确解释复杂视觉数据的领域的进一步研究和实际应用。

Title: The BIAS Detection Framework: Bias Detection in Word Embeddings and Language Models for European Languages

Authors: Alexandre Puttick, Leander Rankwiler, Catherine Ikae, Mascha Kurpicz-Briki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18689
Pdf URL: https://arxiv.org/pdf/2407.18689
Copy Paste: [[2407.18689]] The BIAS Detection Framework: Bias Detection in Word Embeddings and Language Models for European Languages(https://arxiv.org/abs/2407.18689)
Keywords: language model
Abstract: The project BIAS: Mitigating Diversity Biases of AI in the Labor Market is a four-year project funded by the European commission and supported by the Swiss State Secretariat for Education, Research and Innovation (SERI). As part of the project, novel bias detection methods to identify societal bias in language models and word embeddings in European languages are developed, with particular attention to linguistic and geographic particularities. This technical report describes the overall architecture and components of the BIAS Detection Framework. The code described in this technical report is available and will be updated and expanded continuously with upcoming results from the BIAS project. The details about the datasets for the different languages are described in corresponding papers at scientific venues.
摘要：BIAS：减轻劳动力市场中人工智能的多样性偏见项目是一个为期四年的项目，由欧盟委员会资助，并得到瑞士国家教育、研究和创新秘书处 (SERI) 的支持。作为该项目的一部分，开发了新颖的偏见检测方法，以识别欧洲语言中的语言模型和词嵌入中的社会偏见，特别关注语言和地理特殊性。本技术报告描述了 BIAS 检测框架的总体架构和组件。本技术报告中描述的代码可用，并将随着 BIAS 项目即将发布的结果不断更新和扩展。有关不同语言数据集的详细信息在科学场所的相应论文中进行了描述。

Title: Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation

Authors: Esteban Garces Arias, Julian Rodemann, Meimingwei Li, Christian Heumann, Matthias Aßenmacher
Subjects: cs.CL, cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2407.18698
Pdf URL: https://arxiv.org/pdf/2407.18698
Copy Paste: [[2407.18698]] Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation(https://arxiv.org/abs/2407.18698)
Keywords: language model
Abstract: Decoding from the output distributions of large language models to produce high-quality text is a complex challenge in language modeling. Various approaches, such as beam search, sampling with temperature, $k-$sampling, nucleus $p-$sampling, typical decoding, contrastive decoding, and contrastive search, have been proposed to address this problem, aiming to improve coherence, diversity, as well as resemblance to human-generated text. In this study, we introduce adaptive contrastive search, a novel decoding strategy extending contrastive search by incorporating an adaptive degeneration penalty, guided by the estimated uncertainty of the model at each generation step. This strategy is designed to enhance both the creativity and diversity of the language modeling process while at the same time producing coherent and high-quality generated text output. Our findings indicate performance enhancement in both aspects, across different model architectures and datasets, underscoring the effectiveness of our method in text generation tasks. Our code base, datasets, and models are publicly available.
摘要：从大型语言模型的输出分布中解码以生成高质量文本是语言建模中的一项复杂挑战。已经提出了各种方法来解决此问题，例如波束搜索、带温度采样、$k-$采样、核$p-$采样、典型解码、对比解码和对比搜索，旨在提高连贯性、多样性以及与人类生成文本的相似性。在本研究中，我们引入了自适应对比搜索，这是一种新颖的解码策略，通过结合自适应退化惩罚来扩展对比搜索，并在每个生成步骤中由模型估计的不确定性引导。该策略旨在增强语言建模过程的创造性和多样性，同时生成连贯且高质量的生成文本输出。我们的研究结果表明，在不同的模型架构和数据集中，这两个方面的性能都有所提高，凸显了我们的方法在文本生成任务中的有效性。我们的代码库、数据集和模型都是公开的。

Title: ChatSchema: A pipeline of extracting structured information with Large Multimodal Models based on schema

Authors: Fei Wang, Yuewen Zheng, Qin Li, Jingyi Wu, Pengfei Li, Luxia Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18716
Pdf URL: https://arxiv.org/pdf/2407.18716
Copy Paste: [[2407.18716]] ChatSchema: A pipeline of extracting structured information with Large Multimodal Models based on schema(https://arxiv.org/abs/2407.18716)
Keywords: gpt, chat
Abstract: Objective: This study introduces ChatSchema, an effective method for extracting and structuring information from unstructured data in medical paper reports using a combination of Large Multimodal Models (LMMs) and Optical Character Recognition (OCR) based on the schema. By integrating predefined schema, we intend to enable LMMs to directly extract and standardize information according to the schema specifications, facilitating further data entry. Method: Our approach involves a two-stage process, including classification and extraction for categorizing report scenarios and structuring information. We established and annotated a dataset to verify the effectiveness of ChatSchema, and evaluated key extraction using precision, recall, F1-score, and accuracy metrics. Based on key extraction, we further assessed value extraction. We conducted ablation studies on two LMMs to illustrate the improvement of structured information extraction with different input modals and methods. Result: We analyzed 100 medical reports from Peking University First Hospital and established a ground truth dataset with 2,945 key-value pairs. We evaluated ChatSchema using GPT-4o and Gemini 1.5 Pro and found a higher overall performance of GPT-4o. The results are as follows: For the result of key extraction, key-precision was 98.6%, key-recall was 98.5%, key-F1-score was 98.6%. For the result of value extraction based on correct key extraction, the overall accuracy was 97.2%, precision was 95.8%, recall was 95.8%, and F1-score was 95.8%. An ablation study demonstrated that ChatSchema achieved significantly higher overall accuracy and overall F1-score of key-value extraction, compared to the Baseline, with increases of 26.9% overall accuracy and 27.4% overall F1-score, respectively.
摘要：目的：本研究介绍了 ChatSchema，这是一种有效的方法，它结合了大型多模态模型 (LMM) 和基于模式的光学字符识别 (OCR)，可从医学论文报告中的非结构化数据中提取和结构化信息。通过集成预定义的模式，我们打算使 LMM 能够根据模式规范直接提取和标准化信息，从而方便进一步的数据输入。方法：我们的方法涉及一个两阶段的过程，包括分类和提取，用于对报告场景进行分类和结构化信息。我们建立并注释了一个数据集来验证 ChatSchema 的有效性，并使用精度、召回率、F1 分数和准确度指标评估了关键提取。在关键提取的基础上，我们进一步评估了值提取。我们对两个 LMM 进行了消融研究，以说明使用不同的输入模式和方法对结构化信息提取的改进。结果：我们分析了北京大学第一医院的 100 份医疗报告，并建立了一个包含 2,945 个键值对的真实数据集。我们利用 GPT-4o 和 Gemini 1.5 Pro 对 ChatSchema 进行了测试，发现 GPT-4o 的整体性能更高。结果如下：对于键提取的结果，键准确率为 98.6%，键召回率为 98.5%，键 F1 得分为 98.6%。对于基于正确键提取的值提取的结果，总体准确率为 97.2%，准确率为 95.8%，召回率为 95.8%，F1 得分为 95.8%。消融研究表明，与 Baseline 相比，ChatSchema 的键值提取总体准确率和总体 F1 得分显著提高，总体准确率提高了 26.9%，总体 F1 得分提高了 27.4%。

Title: Towards Generalized Offensive Language Identification

Authors: Alphaeus Dmonte, Tejas Arya, Tharindu Ranasinghe, Marcos Zampieri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.18738
Pdf URL: https://arxiv.org/pdf/2407.18738
Copy Paste: [[2407.18738]] Towards Generalized Offensive Language Identification(https://arxiv.org/abs/2407.18738)
Keywords: language model, llm, prompt
Abstract: The prevalence of offensive content on the internet, encompassing hate speech and cyberbullying, is a pervasive issue worldwide. Consequently, it has garnered significant attention from the machine learning (ML) and natural language processing (NLP) communities. As a result, numerous systems have been developed to automatically identify potentially harmful content and mitigate its impact. These systems can follow two approaches; (1) Use publicly available models and application endpoints, including prompting large language models (LLMs) (2) Annotate datasets and train ML models on them. However, both approaches lack an understanding of how generalizable they are. Furthermore, the applicability of these systems is often questioned in off-domain and practical environments. This paper empirically evaluates the generalizability of offensive language detection models and datasets across a novel generalized benchmark. We answer three research questions on generalizability. Our findings will be useful in creating robust real-world offensive language detection systems.
摘要：互联网上攻击性内容的盛行，包括仇恨言论和网络欺凌，是一个全球普遍存在的问题。因此，它引起了机器学习 (ML) 和自然语言处理 (NLP) 社区的极大关注。因此，已经开发了许多系统来自动识别潜在的有害内容并减轻其影响。这些系统可以遵循两种方法； (1) 使用公开可用的模型和应用程序端点，包括提示大型语言模型 (LLM) (2) 注释数据集并在其上训练 ML 模型。然而，这两种方法都缺乏对它们的通用性有多强的理解。此外，这些系统在域外和实际环境中的适用性经常受到质疑。本文通过一种新的通用基准，实证评估了攻击性语言检测模型和数据集的通用性。我们回答了关于通用性的三个研究问题。我们的发现将有助于创建强大的现实世界攻击性语言检测系统。

Title: Towards Effective and Efficient Continual Pre-training of Large Language Models

Authors: Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18743
Pdf URL: https://arxiv.org/pdf/2407.18743
Copy Paste: [[2407.18743]] Towards Effective and Efficient Continual Pre-training of Large Language Models(https://arxiv.org/abs/2407.18743)
Keywords: language model
Abstract: Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model. To enhance the new abilities while retaining the original abilities, we design specific data mixture and curriculum strategies by utilizing existing datasets and synthesizing high-quality datasets. Specifically, we synthesize multidisciplinary scientific question and answer (QA) pairs based on related web pages, and subsequently incorporate these synthetic data to improve the scientific reasoning ability of Llama-3. We refer to the model after CPT as Llama-3-SynE (Synthetic data Enhanced Llama-3). We also present the tuning experiments with a relatively small model -- TinyLlama, and employ the derived findings to train the backbone model. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of the backbone models, including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval), without hurting the original capacities. Our model, data, and codes are available at this https URL.
摘要：持续预训练（CPT）是将语言模型应用于特定领域或任务的重要方法。为了使 CPT 方法更具可追溯性，本文介绍了持续预训练 Llama-3（8B）的技术报告，该技术显著增强了主干模型的中文语言能力和科学推理能力。为了在保留原有能力的同时增强新的能力，我们利用现有数据集并合成高质量数据集设计了特定的数据混合和课程策略。具体来说，我们根据相关网页合成多学科科学问答（QA）对，随后将这些合成数据纳入其中，以提高 Llama-3 的科学推理能力。我们将 CPT 后的模型称为 Llama-3-SynE（合成数据增强的 Llama-3）。我们还展示了使用相对较小的模型 TinyLlama 进行的调优实验，并使用得出的发现来训练主干模型。在一系列评估基准上进行的大量实验表明，我们的方法可以大大提高骨干模型的性能，包括一般能力（C-Eval 上 +8.81，CMMLU 上 +6.31）和科学推理能力（MATH 上 +12.00，SciEval 上 +4.13），而不会损害原有的能力。我们的模型、数据和代码可在此 https URL 上找到。

Title: Knowledge Graph Structure as Prompt: Improving Small Language Models Capabilities for Knowledge-based Causal Discovery

Authors: Yuni Susanti, Michael Färber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.18752
Pdf URL: https://arxiv.org/pdf/2407.18752
Copy Paste: [[2407.18752]] Knowledge Graph Structure as Prompt: Improving Small Language Models Capabilities for Knowledge-based Causal Discovery(https://arxiv.org/abs/2407.18752)
Keywords: language model, llm, prompt
Abstract: Causal discovery aims to estimate causal structures among variables based on observational data. Large Language Models (LLMs) offer a fresh perspective to tackle the causal discovery problem by reasoning on the metadata associated with variables rather than their actual data values, an approach referred to as knowledge-based causal discovery. In this paper, we investigate the capabilities of Small Language Models (SLMs, defined as LLMs with fewer than 1 billion parameters) with prompt-based learning for knowledge-based causal discovery. Specifically, we present KG Structure as Prompt, a novel approach for integrating structural information from a knowledge graph, such as common neighbor nodes and metapaths, into prompt-based learning to enhance the capabilities of SLMs. Experimental results on three types of biomedical and open-domain datasets under few-shot settings demonstrate the effectiveness of our approach, surpassing most baselines and even conventional fine-tuning approaches trained on full datasets. Our findings further highlight the strong capabilities of SLMs: in combination with knowledge graphs and prompt-based learning, SLMs demonstrate the potential to surpass LLMs with larger number of parameters. Our code and datasets are available on GitHub.
摘要：因果发现旨在根据观察数据估计变量之间的因果结构。大型语言模型 (LLM) 通过推理与变量相关的元数据而不是其实际数据值，为解决因果发现问题提供了一种新视角，这种方法称为基于知识的因果发现。在本文中，我们研究了小型语言模型 (SLM，定义为具有少于 10 亿个参数的 LLM) 通过基于提示的学习进行基于知识的因果发现的能力。具体来说，我们提出了 KG Structure as Prompt，这是一种将知识图谱中的结构信息（例如公共邻居节点和元路径）集成到基于提示的学习中以增强 SLM 能力的新方法。在少量设置下对三种类型的生物医学和开放域数据集的实验结果证明了我们方法的有效性，超越了大多数基线甚至在完整数据集上训练的传统微调方法。我们的研究结果进一步凸显了 SLM 的强大功能：结合知识图谱和基于提示的学习，SLM 显示出超越具有更多参数的 LLM 的潜力。我们的代码和数据集可在 GitHub 上找到。

Title: The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

Authors: Aleix Sant, Carlos Escolano, Audrey Mash, Francesca De Luca Fornaciari, Maite Melero
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.18786
Pdf URL: https://arxiv.org/pdf/2407.18786
Copy Paste: [[2407.18786]] The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs(https://arxiv.org/abs/2407.18786)
Keywords: language model, llm, prompt
Abstract: This paper studies gender bias in machine translation through the lens of Large Language Models (LLMs). Four widely-used test sets are employed to benchmark various base LLMs, comparing their translation quality and gender bias against state-of-the-art Neural Machine Translation (NMT) models for English to Catalan (En $\rightarrow$ Ca) and English to Spanish (En $\rightarrow$ Es) translation directions. Our findings reveal pervasive gender bias across all models, with base LLMs exhibiting a higher degree of bias compared to NMT models. To combat this bias, we explore prompting engineering techniques applied to an instruction-tuned LLM. We identify a prompt structure that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts. These results significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems.
摘要：本文通过大型语言模型 (LLM) 的视角研究机器翻译中的性别偏见。采用四个广泛使用的测试集对各种基础 LLM 进行基准测试，将其翻译质量和性别偏见与英语到加泰罗尼亚语 (En $\rightarrow$ Ca) 和英语到西班牙语 (En $\rightarrow$ Es) 翻译方向的最先进的神经机器翻译 (NMT) 模型进行比较。我们的研究结果表明，所有模型都存在普遍的性别偏见，基础 LLM 表现出比 NMT 模型更高程度的偏见。为了消除这种偏见，我们探索了应用于指令调整的 LLM 的提示工程技术。我们确定了一种提示结构，与更直接的提示相比，它在 WinoMT 评估数据集上将性别偏见显着降低了 12%。这些结果显着缩小了 LLM 和传统 NMT 系统之间的性别偏见准确性差距。