2024-03-20

Title: Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions

Authors: Xuming Hu, Xiaochuan Li, Junzhe Chen, Yinghui Li, Yangning Li, Xiaoguang Li, Yasheng Wang, Qun Liu, Lijie Wen, Philip S. Yu, Zhijiang Guo
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2403.12077
Pdf URL: https://arxiv.org/pdf/2403.12077
Copy Paste: [[2403.12077]] Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions(https://arxiv.org/abs/2403.12077)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: Generative search engines have the potential to transform how people seek information online, but generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Nonetheless, retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable part of a claim. To this end, we propose evaluating the robustness of generative search engines in the realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning incorrect responses. Through a comprehensive human evaluation of various generative search engines, such as Bing Chat, PerplexityAI, and YouChat across diverse queries, we demonstrate the effectiveness of adversarial factual questions in inducing incorrect responses. Moreover, retrieval-augmented generation exhibits a higher susceptibility to factual errors compared to LLMs without retrieval. These findings highlight the potential security risks of these systems and emphasize the need for rigorous evaluation before deployment.
摘要：生成搜索引擎有潜力改变人们在线搜索信息的方式，但从现有的大型语言模型 (LLM) 支持的生成搜索引擎生成的响应可能并不总是准确的。尽管如此，检索增强生成加剧了安全问题，因为对手可能通过巧妙地操纵索赔中最脆弱的部分来成功逃避整个系统。为此，我们建议评估生成搜索引擎在现实和高风险环境中的鲁棒性，其中对手只有黑盒系统访问权限并试图欺骗模型返回错误的响应。通过对各种生成搜索引擎（例如 Bing Chat、PerplexityAI 和 YouChat）跨不同查询的全面人工评估，我们证明了对抗性事实问题在诱导错误响应方面的有效性。此外，与没有检索的法学硕士相比，检索增强生成对事实错误的敏感性更高。这些发现凸显了这些系统的潜在安全风险，并强调在部署前进行严格评估的必要性。

Title: The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported

Authors: Adam Shostack
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12082
Pdf URL: https://arxiv.org/pdf/2403.12082
Copy Paste: [[2403.12082]] The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported(https://arxiv.org/abs/2403.12082)
Keywords: llm
Abstract: Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''
摘要：最近的工作 arXiv.2310.02238 声称“我们有效地消除了模型生成或回忆哈利·波特相关内容的能力。”这种说法被证明过于宽泛。一个不到十次试验的小实验导致了重复和具体的提及哈利·波特，包括“啊，我明白了！ “麻瓜”是特里·普拉切特在《哈利·波特》系列丛书中使用的一个术语……”

Title: TMU at TREC Clinical Trials Track 2023

Authors: Aritra Kumar Lahiri, Emrul Hasan, Qinmin Vivian Hu, Cherie Ding
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2403.12088
Pdf URL: https://arxiv.org/pdf/2403.12088
Copy Paste: [[2403.12088]] TMU at TREC Clinical Trials Track 2023(https://arxiv.org/abs/2403.12088)
Keywords: language model
Abstract: This paper describes Toronto Metropolitan University's participation in the TREC Clinical Trials Track for 2023. As part of the tasks, we utilize advanced natural language processing techniques and neural language models in our experiments to retrieve the most relevant clinical trials. We illustrate the overall methodology, experimental settings, and results of our implementation for the run submission as part of Team - V-TorontoMU.
摘要：本文描述了多伦多城市大学参与 2023 年 TREC 临床试验轨道的情况。作为任务的一部分，我们在实验中利用先进的自然语言处理技术和神经语言模型来检索最相关的临床试验。我们阐述了作为 V-TorontoMU 团队一部分的运行提交的总体方法、实验设置和实施结果。

Title: Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets

Authors: Ashwin Daswani, Rohan Sawant, Najoung Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12145
Pdf URL: https://arxiv.org/pdf/2403.12145
Copy Paste: [[2403.12145]] Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets(https://arxiv.org/abs/2403.12145)
Keywords: language model
Abstract: Sensitivity to false assumptions (or false premises) in information-seeking questions is critical for robust question-answering (QA) systems. Recent work has shown that false assumptions in naturally occurring questions pose challenges to current models, with low performance on both generative QA and simple detection tasks (Kim et al. 2023). However, the focus of existing work on naturally occurring questions leads to a gap in the analysis of model behavior on the long tail of the distribution of possible questions. To this end, we introduce Syn-(QA)$^2$, a set of two synthetically generated QA datasets: one generated using perturbed relations from Wikidata, and the other by perturbing HotpotQA (Yang et al. 2018). Our findings from evaluating a range of large language models are threefold: (1) false assumptions in QA are challenging, echoing the findings of prior work, (2) the binary detection task is challenging even compared to the difficulty of generative QA itself, possibly due to the linguistic structure of the problem, and (3) the detection task is more challenging with long-tail questions compared to naturally occurring questions, highlighting the utility of our synthetic datasets and generation method.
摘要：对于信息搜索问题中的错误假设（或错误前提）的敏感性对于稳健的问答 (QA) 系统至关重要。最近的工作表明，自然发生的问题中的错误假设对当前模型提出了挑战，在生成 QA 和简单检测任务上的性能都很低（Kim 等人，2023）。然而，现有工作的重点是自然发生的问题，导致在可能问题分布的长尾上分析模型行为方面存在差距。为此，我们引入了 Syn-(QA)$^2$，这是一组两个综合生成的 QA 数据集：一个是使用来自 Wikidata 的扰动关系生成的，另一个是通过扰动 HotpotQA 生成的（Yang 等人，2018）。我们评估一系列大型语言模型的结果有三个方面：(1) QA 中的错误假设具有挑战性，这与之前工作的结果相呼应，(2) 即使与生成 QA 本身的难度相比，二进制检测任务也具有挑战性，可能由于问题的语言结构，（3）与自然发生的问题相比，长尾问题的检测任务更具挑战性，这凸显了我们的合成数据集和生成方法的实用性。

Title: EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Authors: Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12171
Pdf URL: https://arxiv.org/pdf/2403.12171
Copy Paste: [[2403.12171]] EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models(https://arxiv.org/abs/2403.12171)
Keywords: language model, gpt, llm
Abstract: Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.
摘要：越狱攻击对于识别和缓解大型语言模型 (LLM) 的安全漏洞至关重要。它们旨在绕过安全措施并引发禁止的输出。然而，由于各种越狱方法之间存在显着差异，社区没有可用的标准实施框架，这限制了全面的安全评估。本文介绍了 EasyJailbreak，这是一个统一的框架，可简化针对 LLM 的越狱攻击的构建和评估。它使用四个组件构建越狱攻击：选择器、变异器、约束器和评估器。这种模块化框架使研究人员能够轻松地通过新颖和现有组件的组合来构建攻击。到目前为止，EasyJailbreak 支持 11 种不同的越狱方法，并有助于广泛的法学硕士的安全验证。我们对 10 个不同的法学硕士进行的验证揭示了一个重大漏洞，在各种越狱攻击下，平均违规概率为 60%。值得注意的是，即使是 GPT-3.5-Turbo 和 GPT-4 等高级模型，平均攻击成功率 (ASR) 也分别为 57% 和 33%。我们为研究人员发布了丰富的资源，包括网络平台、PyPI 发布包、截屏视频和实验输出。

Title: TnT-LLM: Text Mining at Scale with Large Language Models

Authors: Mengting Wan, Tara Safavi, Sujay Kumar Jauhar, Yujin Kim, Scott Counts, Jennifer Neville, Siddharth Suri, Chirag Shah, Ryen W White, Longqi Yang, Reid Andersen, Georg Buscher, Dhruv Joshi, Nagu Rangan
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2403.12173
Pdf URL: https://arxiv.org/pdf/2403.12173
Copy Paste: [[2403.12173]] TnT-LLM: Text Mining at Scale with Large Language Models(https://arxiv.org/abs/2403.12173)
Keywords: language model, llm, prompt, chat
Abstract: Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.
摘要：将非结构化文本转换为由有用的类别标签组织的结构化且有意义的形式，是下游分析和应用的文本挖掘的基本步骤。然而，大多数用于生成标签分类和构建基于文本的标签分类器的现有方法仍然严重依赖于领域专业知识和手动管理，使得该过程既昂贵又耗时。当标签空间未指定并且大规模数据注释不可用时，这尤其具有挑战性。在本文中，我们通过大型语言模型（LLM）解决这些挑战，其基于提示的界面有助于大规模伪标签的归纳和使用。我们提出了 TnT-LLM，这是一个两阶段框架，它利用 LLM 来自动化端到端标签生成和分配的过程，对于任何给定的用例，只需最少的人力。在第一阶段，我们引入了一种零样本、多阶段推理方法，使法学硕士能够迭代地生成和完善标签分类法。在第二阶段，LLM 被用作生成训练样本的数据标记器，以便可以可靠地构建、部署和大规模服务轻量级监督分类器。我们将 TnT-LLM 应用于 Bing Copilot（以前称为 Bing Chat）（一种基于聊天的开放域搜索引擎）的用户意图和对话域分析。使用人工和自动评估指标进行的大量实验表明，与最先进的基线相比，TnT-LLM 可以生成更准确、更相关的标签分类法，并在大规模分类的准确性和效率之间实现良好的平衡。我们还分享了关于在实际应用中使用法学硕士进行大规模文本挖掘的挑战和机遇的实践经验和见解。

Title: Reference-based Metrics Disprove Themselves in Question Generation

Authors: Bang Nguyen, Mengxia Yu, Yun Huang, Meng Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12242
Pdf URL: https://arxiv.org/pdf/2403.12242
Copy Paste: [[2403.12242]] Reference-based Metrics Disprove Themselves in Question Generation(https://arxiv.org/abs/2403.12242)
Keywords: language model
Abstract: Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
摘要：BLEU 和 BERTScore 等基于参考的指标被广泛用于评估问题生成 (QG)。在本研究中，在 SQuAD 和 HotpotQA 等 QG 基准上，我们发现使用人工编写的参考文献并不能保证基于参考指标的有效性。大多数 QG 基准只有一个参考；我们复制了注释过程并收集了另一个参考。一个好的指标应该能够对经过人工验证的问题进行评分，不会比生成的问题差。然而，我们新收集的参考文献的基于参考指标的结果反驳了这些指标本身。我们利用大型语言模型提出了一种由多维标准组成的无参考指标，例如自然性、可回答性和复杂性。这些标准不限于单个参考问题的句法或语义，并且度量不需要不同的参考集。实验表明，我们的指标可以准确地区分高质量问题和有缺陷的问题，并实现与人类判断最先进的一致性。

Title: Zero-Shot Multi-task Hallucination Detection

Authors: Patanjali Bhamidipati, Advaith Malladi, Manish Shrivastava, Radhika Mamidi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12244
Pdf URL: https://arxiv.org/pdf/2403.12244
Copy Paste: [[2403.12244]] Zero-Shot Multi-task Hallucination Detection(https://arxiv.org/abs/2403.12244)
Keywords: language model, hallucination
Abstract: In recent studies, the extensive utilization of large language models has underscored the importance of robust evaluation methodologies for assessing text generation quality and relevance to specific tasks. This has revealed a prevalent issue known as hallucination, an emergent condition in the model where generated text lacks faithfulness to the source and deviates from the evaluation criteria. In this study, we formally define hallucination and propose a framework for its quantitative detection in a zero-shot setting, leveraging our definition and the assumption that model outputs entail task and sample specific inputs. In detecting hallucinations, our solution achieves an accuracy of 0.78 in a model-aware setting and 0.61 in a model-agnostic setting. Notably, our solution maintains computational efficiency, requiring far less computational resources than other SOTA approaches, aligning with the trend towards lightweight and compressed models.
摘要：在最近的研究中，大型语言模型的广泛使用强调了稳健的评估方法对于评估文本生成质量和与特定任务的相关性的重要性。这揭示了一个被称为幻觉的普遍问题，这是模型中的一种紧急情况，其中生成的文本缺乏对源的忠实度并偏离评估标准。在这项研究中，我们正式定义了幻觉，并提出了一个在零样本设置中定量检测幻觉的框架，利用我们的定义和模型输出需要任务和样本特定输入的假设。在检测幻觉时，我们的解决方案在模型感知设置中实现了 0.78 的准确度，在与模型无关的设置中实现了 0.61 的准确度。值得注意的是，我们的解决方案保持了计算效率，比其他 SOTA 方法需要更少的计算资源，符合轻量级和压缩模型的趋势。

Title: FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications

Authors: Thanos Konstantinidis, Giorgos Iacovides, Mingxue Xu, Tony G. Constantinides, Danilo Mandic
Subjects: cs.CL, cs.LG, q-fin.ST, q-fin.TR
Abstract URL: https://arxiv.org/abs/2403.12285
Pdf URL: https://arxiv.org/pdf/2403.12285
Copy Paste: [[2403.12285]] FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications(https://arxiv.org/abs/2403.12285)
Keywords: language model, llm
Abstract: There are multiple sources of financial news online which influence market movements and trader's decisions. This highlights the need for accurate sentiment analysis, in addition to having appropriate algorithmic trading techniques, to arrive at better informed trading decisions. Standard lexicon based sentiment approaches have demonstrated their power in aiding financial decisions. However, they are known to suffer from issues related to context sensitivity and word ordering. Large Language Models (LLMs) can also be used in this context, but they are not finance-specific and tend to require significant computational resources. To facilitate a finance specific LLM framework, we introduce a novel approach based on the Llama 2 7B foundational model, in order to benefit from its generative nature and comprehensive language manipulation. This is achieved by fine-tuning the Llama2 7B model on a small portion of supervised financial sentiment analysis data, so as to jointly handle the complexities of financial lexicon and context, and further equipping it with a neural network based decision mechanism. Such a generator-classifier scheme, referred to as FinLlama, is trained not only to classify the sentiment valence but also quantify its strength, thus offering traders a nuanced insight into financial news articles. Complementing this, the implementation of parameter-efficient fine-tuning through LoRA optimises trainable parameters, thus minimising computational and memory requirements, without sacrificing accuracy. Simulation results demonstrate the ability of the proposed FinLlama to provide a framework for enhanced portfolio management decisions and increased market returns. These results underpin the ability of FinLlama to construct high-return portfolios which exhibit enhanced resilience, even during volatile periods and unpredictable market events.
摘要：在线财经新闻有多种来源，它们会影响市场走势和交易者的决策。这凸显了除了拥有适当的算法交易技术之外还需要准确的情绪分析，以做出更明智的交易决策。基于标准词典的情绪方法已经证明了它们在帮助财务决策方面的力量。然而，众所周知，它们会遇到与上下文敏感性和词序相关的问题。大型语言模型 (LLM) 也可以在这种情况下使用，但它们不是特定于金融的，并且往往需要大量的计算资源。为了促进金融特定的 LLM 框架，我们引入了一种基于 Llama 2 7B 基础模型的新颖方法，以便从其生成性和全面的语言操作中受益。这是通过在一小部分有监督的金融情绪分析数据上对 Llama2 7B 模型进行微调来实现的，以共同处理金融词汇和上下文的复杂性，并进一步为其配备基于神经网络的决策机制。这种生成器分类器方案被称为 FinLlama，经过训练不仅可以对情绪效价进行分类，还可以量化其强度，从而为交易者提供对金融新闻文章的细致入微的洞察。作为补充，通过 LoRA 实现参数高效微调可优化可训练参数，从而在不牺牲准确性的情况下最大限度地减少计算和内存需求。模拟结果表明，拟议的 FinLlama 能够为增强投资组合管理决策和增加市场回报提供框架。这些结果支撑了 FinLlama 构建高回报投资组合的能力，即使在动荡时期和不可预测的市场事件中，该投资组合也表现出增强的弹性。

Title: A Comparative Investigation of Compositional Syntax and Semantics in DALL-E 2

Authors: Elliot Murphy, Jill de Villiers, Sofia Lucero Morales
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12294
Pdf URL: https://arxiv.org/pdf/2403.12294
Copy Paste: [[2403.12294]] A Comparative Investigation of Compositional Syntax and Semantics in DALL-E 2(https://arxiv.org/abs/2403.12294)
Keywords: prompt, agent
Abstract: In this study we compared how well DALL-E 2 visually represented the meaning of linguistic prompts also given to young children in comprehension tests. Sentences representing fundamental components of grammatical knowledge were selected from assessment tests used with several hundred English-speaking children aged 2-7 years for whom we had collected original item-level data. DALL-E 2 was given these prompts five times to generate 20 cartoons per item, for 9 adult judges to score. Results revealed no conditions in which DALL-E 2-generated images that matched the semantic accuracy of children, even at the youngest age (2 years). DALL-E 2 failed to assign the appropriate roles in reversible forms; it failed on negation despite an easier contrastive prompt than the children received; it often assigned the adjective to the wrong noun; it ignored implicit agents in passives. This work points to a clear absence of compositional sentence representations for DALL-E 2.
摘要：在这项研究中，我们比较了 DALL-E 2 在视觉上表现语言提示含义的程度，这些提示也在理解测试中提供给幼儿。代表语法知识基本组成部分的句子是从对数百名 2-7 岁英语儿童进行的评估测试中选出的，我们为这些儿童收集了原始项目级数据。 DALL-E 2 收到这些提示五次，为每个项目生成 20 个卡通片，供 9 名成年评委评分。结果显示，即使在最小的年龄（2 岁），DALL-E 2 生成的图像也无法与儿童的语义准确性相匹配。 DALL-E 2未能以可逆的形式分配适当的角色；尽管对比提示比孩子们收到的要容易，但它还是失败了；它经常将形容词分配给错误的名词；它忽略了被动语态中的隐含主体。这项工作指出 DALL-E 2 明显缺乏组合句子表示。

Title: Leveraging Large Language Models to Extract Information on Substance Use Disorder Severity from Clinical Notes: A Zero-shot Learning Approach

Authors: Maria Mahbub, Gregory M. Dams, Sudarshan Srinivasan, Caitlin Rizy, Ioana Danciu, Jodie Trafton, Kathryn Knight
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12297
Pdf URL: https://arxiv.org/pdf/2403.12297
Copy Paste: [[2403.12297]] Leveraging Large Language Models to Extract Information on Substance Use Disorder Severity from Clinical Notes: A Zero-shot Learning Approach(https://arxiv.org/abs/2403.12297)
Keywords: language model, llm, prompt
Abstract: Substance use disorder (SUD) poses a major concern due to its detrimental effects on health and society. SUD identification and treatment depend on a variety of factors such as severity, co-determinants (e.g., withdrawal symptoms), and social determinants of health. Existing diagnostic coding systems used by American insurance providers, like the International Classification of Diseases (ICD-10), lack granularity for certain diagnoses, but clinicians will add this granularity (as that found within the Diagnostic and Statistical Manual of Mental Disorders classification or DSM-5) as supplemental unstructured text in clinical notes. Traditional natural language processing (NLP) methods face limitations in accurately parsing such diverse clinical language. Large Language Models (LLMs) offer promise in overcoming these challenges by adapting to diverse language patterns. This study investigates the application of LLMs for extracting severity-related information for various SUD diagnoses from clinical notes. We propose a workflow employing zero-shot learning of LLMs with carefully crafted prompts and post-processing techniques. Through experimentation with Flan-T5, an open-source LLM, we demonstrate its superior recall compared to the rule-based approach. Focusing on 11 categories of SUD diagnoses, we show the effectiveness of LLMs in extracting severity information, contributing to improved risk assessment and treatment planning for SUD patients.
摘要：物质使用障碍（SUD）因其对健康和社会的有害影响而引起人们的主要关注。 SUD 的识别和治疗取决于多种因素，例如严重程度、共同决定因素（例如戒断症状）和健康的社会决定因素。美国保险公司使用的现有诊断编码系统，例如国际疾病分类 (ICD-10)，缺乏某些诊断的粒度，但临床医生会添加这种粒度（如《精神疾病诊断和统计手册》分类或 DSM 中所见） -5) 作为临床笔记中的补充非结构化文本。传统的自然语言处理（NLP）方法在准确解析如此多样化的临床语言方面面临局限性。大型语言模型 (LLM) 有望通过适应不同的语言模式来克服这些挑战。本研究调查了法学硕士在从临床记录中提取各种 SUD 诊断的严重程度相关信息的应用。我们提出了一种采用法学硕士零样本学习的工作流程，并带有精心设计的提示和后处理技术。通过对开源法学硕士 Flan-T5 的实验，我们证明了其与基于规则的方法相比具有更高的召回率。我们重点关注 11 类 SUD 诊断，展示了法学硕士在提取严重程度信息方面的有效性，有助于改进 SUD 患者的风险评估和治疗计划。

Title: OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Authors: Chuang Liu, Linhao Yu, Jiaxuan Li, Renren Jin, Yufei Huang, Ling Shi, Junhui Zhang, Xinmeng Ji, Tingting Cui, Tao Liu, Jinwang Song, Hongying Zan, Sun Li, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12316
Pdf URL: https://arxiv.org/pdf/2403.12316
Copy Paste: [[2403.12316]] OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety(https://arxiv.org/abs/2403.12316)
Keywords: language model, llm
Abstract: The rapid development of Chinese large language models (LLMs) poses big challenges for efficient LLM evaluation. While current initiatives have introduced new benchmarks or evaluation platforms for assessing Chinese LLMs, many of these focus primarily on capabilities, usually overlooking potential alignment and safety issues. To address this gap, we introduce OpenEval, an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety. For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning. For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs. To evaluate safety, especially anticipated risks (e.g., power-seeking, self-awareness) of advanced LLMs, we include 6 datasets. In addition to these benchmarks, we have implemented a phased public evaluation and benchmark update strategy to ensure that OpenEval is in line with the development of Chinese LLMs or even able to provide cutting-edge benchmark datasets to guide the development of Chinese LLMs. In our first public evaluation, we have tested a range of Chinese LLMs, spanning from 7B to 72B parameters, including both open-source and proprietary models. Evaluation results indicate that while Chinese LLMs have shown impressive performance in certain tasks, more attention should be directed towards broader aspects such as commonsense reasoning, alignment, and safety.
摘要：中文大语言模型（LLM）的快速发展对LLM的高效评估提出了巨大的挑战。虽然目前的举措引入了新的基准或评估平台来评估中国法学硕士，但其中许多主要关注能力，通常忽视潜在的一致性和安全问题。为了弥补这一差距，我们引入了 OpenEval，这是一个评估测试平台，可以在能力、一致性和安全性方面对中国法学硕士进行基准测试。在能力评估方面，我们包含了12个基准数据集，从NLP任务、学科知识、常识推理和数学推理4个子维度评估中国法学硕士。对于一致性评估，OpenEval 包含 7 个数据集，用于检查中国法学硕士输出中的偏见、冒犯性和非法性。为了评估高级法学硕士的安全性，特别是预期风险（例如，权力追求、自我意识），我们包括 6 个数据集。除了这些基准之外，我们还实施了分阶段的公开评估和基准更新策略，以确保OpenEval符合中国LLM的发展甚至能够提供前沿的基准数据集来指导中国LLM的发展。在我们的第一次公开评估中，我们测试了一系列中国法学硕士，参数范围从 7B 到 72B，包括开源模型和专有模型。评估结果表明，虽然中国的法学硕士在某些任务中表现出了令人印象深刻的表现，但更多的注意力应该集中在更广泛的方面，例如常识推理、一致性和安全性。

Title: Characteristic AI Agents via Large Language Models

Authors: Xi Wang, Hongliang Dai, Shen Gao, Piji Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12368
Pdf URL: https://arxiv.org/pdf/2403.12368
Copy Paste: [[2403.12368]] Characteristic AI Agents via Large Language Models(https://arxiv.org/abs/2403.12368)
Keywords: language model, llm, chat, agent
Abstract: The advancement of Large Language Models (LLMs) has led to significant enhancements in the performance of chatbot systems. Many researchers have dedicated their efforts to the development of bringing characteristics to chatbots. While there have been commercial products for developing role-driven chatbots using LLMs, it is worth noting that academic research in this area remains relatively scarce. Our research focuses on investigating the performance of LLMs in constructing Characteristic AI Agents by simulating real-life individuals across different settings. Current investigations have primarily focused on act on roles with simple profiles. In response to this research gap, we create a benchmark for the characteristic AI agents task, including dataset, techniques, and evaluation metrics. A dataset called ``Character100'' is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play. With the constructed dataset, we conduct comprehensive assessment of LLMs across various settings. In addition, we devise a set of automatic metrics for quantitative performance evaluation. The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents. The benchmark is available at https://github.com/nuaa-nlp/Character100.
摘要：大型语言模型（LLM）的进步使得聊天机器人系统的性能显着增强。许多研究人员致力于开发为聊天机器人带来特性。虽然已经有使用法学硕士开发角色驱动的聊天机器人的商业产品，但值得注意的是，该领域的学术研究仍然相对匮乏。我们的研究重点是通过模拟不同环境中的现实生活中的个体来调查法学硕士在构建特征人工智能代理方面的表现。目前的调查主要集中在具有简单配置文件的角色上。针对这一研究空白，我们为特征人工智能代理任务创建了基准，包括数据集、技术和评估指标。为此基准构建了一个名为“Character100”的数据集，其中包含维基百科上访问量最大的人员，用于进行角色扮演的语言模型。利用构建的数据集，我们对各种环境下的法学硕士进行了全面评估。此外，我们还设计了一套用于定量绩效评估的自动指标。实验结果强调了进一步提高法学硕士构建特征人工智能体能力的潜在方向。该基准可在 https://github.com/nuaa-nlp/Character100 上获取。

Title: RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

Authors: Chi Hu, Yuan Ge, Xiangnan Ma, Hang Cao, Qiang Li, Yonghua Yang, Tong Xiao, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12373
Pdf URL: https://arxiv.org/pdf/2403.12373
Copy Paste: [[2403.12373]] RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners(https://arxiv.org/abs/2403.12373)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, which include deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent responses. To address these challenges, we introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources. RankPrompt breaks down the ranking problem into a series of comparisons among diverse responses, leveraging the inherent capabilities of LLMs to generate chains of comparison as contextual exemplars. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13\%. RankPrompt also excels in LLM-based automatic evaluations for open-ended generation, aligning with human preferences 74\% of the time in the AlpacaEval set. Moreover, RankPrompt demonstrates robustness against variations in the orderings and consistencies of responses.
摘要：大型语言模型 (LLM) 在各种推理任务中取得了令人印象深刻的性能。然而，即使是最先进的法学硕士（例如 ChatGPT）在推理过程中也容易出现逻辑错误。现有的解决方案，包括部署特定于任务的验证器或对多个推理路径进行投票，要么需要大量的人工注释，要么在响应不一致的情况下失败。为了应对这些挑战，我们引入了 RankPrompt，这是一种新的提示方法，使法学硕士能够对他们的回答进行自我排名，而无需额外的资源。 RankPrompt 将排名问题分解为不同响应之间的一系列比较，利用法学硕士的固有功能生成比较链作为上下文范例。我们在 11 个算术和常识推理任务中进行的实验表明，RankPrompt 显着增强了 ChatGPT 和 GPT-4 的推理性能，提升幅度高达 13%。 RankPrompt 还擅长基于 LLM 的开放式生成自动评估，在 AlpacaEval 集中 74% 的时间与人类偏好保持一致。此外，RankPrompt 展示了针对响应顺序和一致性变化的鲁棒性。

Title: Improving Generalizability of Extracting Social Determinants of Health Using Large Language Models through Prompt-tuning

Authors: Cheng Peng, Zehao Yu, Kaleb E Smith, Wei-Hsuan Lo-Ciganic, Jiang Bian, Yonghui Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12374
Pdf URL: https://arxiv.org/pdf/2403.12374
Copy Paste: [[2403.12374]] Improving Generalizability of Extracting Social Determinants of Health Using Large Language Models through Prompt-tuning(https://arxiv.org/abs/2403.12374)
Keywords: language model, gpt, llm, prompt
Abstract: The progress in natural language processing (NLP) using large language models (LLMs) has greatly improved patient information extraction from clinical narratives. However, most methods based on the fine-tuning strategy have limited transfer learning ability for cross-domain applications. This study proposed a novel approach that employs a soft prompt-based learning architecture, which introduces trainable prompts to guide LLMs toward desired outputs. We examined two types of LLM architectures, including encoder-only GatorTron and decoder-only GatorTronGPT, and evaluated their performance for the extraction of social determinants of health (SDoH) using a cross-institution dataset from the 2022 n2c2 challenge and a cross-disease dataset from the University of Florida (UF) Health. The results show that decoder-only LLMs with prompt tuning achieved better performance in cross-domain applications. GatorTronGPT achieved the best F1 scores for both datasets, outperforming traditional fine-tuned GatorTron by 8.9% and 21.8% in a cross-institution setting, and 5.5% and 14.5% in a cross-disease setting.
摘要：使用大语言模型 (LLM) 的自然语言处理 (NLP) 的进展极大地改善了从临床叙述中提取患者信息。然而，大多数基于微调策略的方法对于跨领域应用的迁移学习能力有限。这项研究提出了一种采用基于软提示的学习架构的新颖方法，该架构引入了可训练的提示来指导法学硕士获得所需的输出。我们检查了两种类型的 LLM 架构，包括仅编码器的 GatorTron 和仅解码器的 GatorTronGPT，并使用 2022 年 n2c2 挑战赛和跨疾病的跨机构数据集评估了它们提取健康社会决定因素 (SDoH) 的性能来自佛罗里达大学 (UF) Health 的数据集。结果表明，具有及时调整的纯解码器 LLM 在跨域应用中取得了更好的性能。 GatorTronGPT 在两个数据集上都取得了最好的 F1 分数，在跨机构设置中比传统的微调 GatorTron 高出 8.9% 和 21.8%，在跨疾病设置中高出 5.5% 和 14.5%。

Title: AraPoemBERT: A Pretrained Language Model for Arabic Poetry Analysis

Authors: Faisal Qarah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12392
Pdf URL: https://arxiv.org/pdf/2403.12392
Copy Paste: [[2403.12392]] AraPoemBERT: A Pretrained Language Model for Arabic Poetry Analysis(https://arxiv.org/abs/2403.12392)
Keywords: language model
Abstract: Arabic poetry, with its rich linguistic features and profound cultural significance, presents a unique challenge to the Natural Language Processing (NLP) field. The complexity of its structure and context necessitates advanced computational models for accurate analysis. In this paper, we introduce AraPoemBERT, an Arabic language model pretrained exclusively on Arabic poetry text. To demonstrate the effectiveness of the proposed model, we compared AraPoemBERT with 5 different Arabic language models on various NLP tasks related to Arabic poetry. The new model outperformed all other models and achieved state-of-the-art results in most of the downstream tasks. AraPoemBERT achieved unprecedented accuracy in two out of three novel tasks: poet's gender classification (99.34\% accuracy), and poetry sub-meter classification (97.79\% accuracy). In addition, the model achieved an accuracy score in poems' rhyme classification (97.73\% accuracy) which is almost equivalent to the best score reported in this study. Moreover, the proposed model significantly outperformed previous work and other comparative models in the tasks of poems' sentiment analysis, achieving an accuracy of 78.95\%, and poetry meter classification (99.03\% accuracy), while significantly expanding the scope of these two problems. The dataset used in this study, contains more than 2.09 million verses collected from online sources, each associated with various attributes such as meter, sub-meter, poet, rhyme, and topic. The results demonstrate the effectiveness of the proposed model in understanding and analyzing Arabic poetry, achieving state-of-the-art results in several tasks and outperforming previous works and other language models included in the study. AraPoemBERT model is publicly available on \url{https://huggingface.co/faisalq}.
摘要：阿拉伯诗歌以其丰富的语言特征和深厚的文化意义，对自然语言处理（NLP）领域提出了独特的挑战。其结构和背景的复杂性需要先进的计算模型来进行准确分析。在本文中，我们介绍了 AraPoemBERT，这是一种专门针对阿拉伯诗歌文本进行预训练的阿拉伯语言模型。为了证明所提出模型的有效性，我们在与阿拉伯诗歌相关的各种 NLP 任务上将 AraPoemBERT 与 5 种不同的阿拉伯语言模型进行了比较。新模型的表现优于所有其他模型，并在大多数下游任务中取得了最先进的结果。 AraPoemBERT 在三项新颖任务中的两项取得了前所未有的准确率：诗人性别分类（准确率 99.34%）和诗歌分格分类（准确率 97.79%）。此外，该模型在诗歌韵律分类方面取得了准确率（97.73％准确率），几乎相当于本研究中报告的最佳分数。此外，所提出的模型在诗歌情感分析任务中显着优于先前的工作和其他比较模型，达到了 78.95% 的准确率，以及诗歌计量分类（99.03% 的准确率），同时显着扩展了这两个问题的范围。本研究中使用的数据集包含从在线资源收集的超过 209 万首诗句，每首诗句都与各种属性相关，例如韵律、子韵律、诗人、韵律和主题。结果证明了所提出的模型在理解和分析阿拉伯诗歌方面的有效性，在多项任务中取得了最先进的结果，并且优于之前的作品和研究中包含的其他语言模型。 AraPoemBERT 模型可在 \url{https://huggingface.co/faisalq} 上公开获取。

Title: Dr3: Ask Large Language Models Not to Give Off-Topic Answers in Open Domain Multi-Hop Question Answering

Authors: Yuan Gao, Yiheng Zhu, Yuanbin Cao, Yinzhi Zhou, Zhen Wu, Yujie Chen, Shenglan Wu, Haoyuan Hu, Xinyu Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12393
Pdf URL: https://arxiv.org/pdf/2403.12393
Copy Paste: [[2403.12393]] Dr3: Ask Large Language Models Not to Give Off-Topic Answers in Open Domain Multi-Hop Question Answering(https://arxiv.org/abs/2403.12393)
Keywords: language model, llm
Abstract: Open Domain Multi-Hop Question Answering (ODMHQA) plays a crucial role in Natural Language Processing (NLP) by aiming to answer complex questions through multi-step reasoning over retrieved information from external knowledge sources. Recently, Large Language Models (LLMs) have demonstrated remarkable performance in solving ODMHQA owing to their capabilities including planning, reasoning, and utilizing tools. However, LLMs may generate off-topic answers when attempting to solve ODMHQA, namely the generated answers are irrelevant to the original questions. This issue of off-topic answers accounts for approximately one-third of incorrect answers, yet remains underexplored despite its significance. To alleviate this issue, we propose the Discriminate->Re-Compose->Re- Solve->Re-Decompose (Dr3) mechanism. Specifically, the Discriminator leverages the intrinsic capabilities of LLMs to judge whether the generated answers are off-topic. In cases where an off-topic answer is detected, the Corrector performs step-wise revisions along the reversed reasoning chain (Re-Compose->Re-Solve->Re-Decompose) until the final answer becomes on-topic. Experimental results on the HotpotQA and 2WikiMultiHopQA datasets demonstrate that our Dr3 mechanism considerably reduces the occurrence of off-topic answers in ODMHQA by nearly 13%, improving the performance in Exact Match (EM) by nearly 3% compared to the baseline method without the Dr3 mechanism.
摘要：开放域多跳问答 (ODMHQA) 在自然语言处理 (NLP) 中发挥着至关重要的作用，旨在通过对从外部知识源检索到的信息进行多步推理来回答复杂的问题。最近，大型语言模型（LLM）由于其规划、推理和利用工具等能力，在解决 ODMHQA 方面表现出了卓越的性能。然而，法学硕士在尝试解决 ODMHQA 时可能会生成偏离主题的答案，即生成的答案与原始问题无关。这一期偏离主题的答案约占错误答案的三分之一，尽管其意义重大，但仍未得到充分探索。为了缓解这个问题，我们提出了判别->重新组合->重新解决->重新分解（Dr3）机制。具体来说，判别器利用法学硕士的内在能力来判断生成的答案是否偏离主题。如果检测到偏离主题的答案，校正器会沿着反向推理链（重新组合 -> 重新解决 -> 重新分解）执行逐步修订，直到最终答案成为主题。 HotpotQA 和 2WikiMultiHopQA 数据集上的实验结果表明，与没有 Dr3 的基线方法相比，我们的 Dr3 机制大大减少了 ODMHQA 中离题答案的发生率近 13%，精确匹配 (EM) 的性能提高了近 3%机制。

Title: An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

Authors: Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2403.12402
Pdf URL: https://arxiv.org/pdf/2403.12402
Copy Paste: [[2403.12402]] An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis(https://arxiv.org/abs/2403.12402)
Keywords: language model, prompt
Abstract: Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning. A typical speech LM takes discrete semantic units as content and a short utterance as prompt, and synthesizes speech which preserves the content's semantics but mimics the prompt's style. However, there is no systematic understanding on how the synthesized audio is controlled by the prompt and content. In this work, we conduct an empirical study of the widely used autoregressive (AR) and non-autoregressive (NAR) speech LMs and provide insights into the prompt design and content semantic units. Our analysis reveals that heterogeneous and nonstationary prompts hurt the audio quality in contrast to the previous finding that longer prompts always lead to better synthesis. Moreover, we find that the speaker style of the synthesized audio is also affected by the content in addition to the prompt. We further show that semantic units carry rich acoustic information such as pitch, tempo, volume and speech emphasis, which might be leaked from the content to the synthesized audio.
摘要：语音语言模型（LM）有望通过上下文学习实现高质量的语音合成。典型的语音LM以离散的语义单元作为内容，以简短的话语作为提示，并合成保留内容语义但模仿提示风格的语音。然而，对于提示和内容如何控制合成音频还没有系统的了解。在这项工作中，我们对广泛使用的自回归（AR）和非自回归（NAR）语音语言模型进行了实证研究，并提供了对提示设计和内容语义单元的见解。我们的分析表明，异构和非平稳的提示会损害音频质量，而之前的发现较长的提示总是会带来更好的合成。此外，我们发现，除了提示之外，合成音频的说话人风格还受到内容的影响。我们进一步表明，语义单元携带丰富的声学信息，例如音高、节奏、音量和语音强调，这些信息可能会从内容泄漏到合成音频中。

Title: Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales

Authors: Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, Huan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12403
Pdf URL: https://arxiv.org/pdf/2403.12403
Copy Paste: [[2403.12403]] Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales(https://arxiv.org/abs/2403.12403)
Keywords: language model, llm
Abstract: Although social media platforms are a prominent arena for users to engage in interpersonal discussions and express opinions, the facade and anonymity offered by social media may allow users to spew hate speech and offensive content. Given the massive scale of such platforms, there arises a need to automatically identify and flag instances of hate speech. Although several hate speech detection methods exist, most of these black-box methods are not interpretable or explainable by design. To address the lack of interpretability, in this paper, we propose to use state-of-the-art Large Language Models (LLMs) to extract features in the form of rationales from the input text, to train a base hate speech classifier, thereby enabling faithful interpretability by design. Our framework effectively combines the textual understanding capabilities of LLMs and the discriminative power of state-of-the-art hate speech classifiers to make these classifiers faithfully interpretable. Our comprehensive evaluation on a variety of social media hate speech datasets demonstrate: (1) the goodness of the LLM-extracted rationales, and (2) the surprising retention of detector performance even after training to ensure interpretability.
摘要：尽管社交媒体平台是用户参与人际讨论和表达意见的重要场所，但社交媒体提供的外观和匿名性可能允许用户发表仇恨言论和攻击性内容。鉴于此类平台规模庞大，需要自动识别和标记仇恨言论实例。尽管存在多种仇恨言论检测方法，但大多数黑盒方法在设计上都无法解释或解释。为了解决可解释性的缺乏，在本文中，我们建议使用最先进的大型语言模型（LLM）从输入文本中提取基本原理形式的特征，以训练基本的仇恨言论分类器，从而通过设计实现忠实的可解释性。我们的框架有效地结合了法学硕士的文本理解能力和最先进的仇恨言论分类器的判别能力，使这些分类器能够忠实地解释。我们对各种社交媒体仇恨言论数据集的综合评估表明：（1）LLM 提取的基本原理的优点，以及（2）即使在经过训练以确保可解释性之后，检测器性能仍令人惊讶地保持不变。

Title: Cross-Lingual Transfer for Natural Language Inference via Multilingual Prompt Translator

Authors: Xiaoyu Qiu, Yuechen Wang, Jiaxin Shi, Wengang Zhou, Houqiang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12407
Pdf URL: https://arxiv.org/pdf/2403.12407
Copy Paste: [[2403.12407]] Cross-Lingual Transfer for Natural Language Inference via Multilingual Prompt Translator(https://arxiv.org/abs/2403.12407)
Keywords: prompt
Abstract: Based on multilingual pre-trained models, cross-lingual transfer with prompt learning has shown promising effectiveness, where soft prompt learned in a source language is transferred to target languages for downstream tasks, particularly in the low-resource scenario. To efficiently transfer soft prompt, we propose a novel framework, Multilingual Prompt Translator (MPT), where a multilingual prompt translator is introduced to properly process crucial knowledge embedded in prompt by changing language knowledge while retaining task knowledge. Concretely, we first train prompt in source language and employ translator to translate it into target prompt. Besides, we extend an external corpus as auxiliary data, on which an alignment task for predicted answer probability is designed to convert language knowledge, thereby equipping target prompt with multilingual knowledge. In few-shot settings on XNLI, MPT demonstrates superiority over baselines by remarkable improvements. MPT is more prominent compared with vanilla prompting when transferring to languages quite distinct from source language.
摘要：基于多语言预训练模型，带有提示学习的跨语言迁移已显示出良好的有效性，其中将源语言学习的软提示迁移到目标语言以执行下游任务，特别是在资源匮乏的情况下。为了有效地传输软提示，我们提出了一种新颖的框架，多语言提示翻译器（MPT），其中引入多语言提示翻译器，通过改变语言知识同时保留任务知识来正确处理提示中嵌入的关键知识。具体来说，我们首先训练源语言的提示，然后使用翻译器将其翻译成目标提示。此外，我们扩展了外部语料库作为辅助数据，在其上设计了预测答案概率的对齐任务来转换语言知识，从而为目标提示配备多语言知识。在 XNLI 的少数镜头设置中，MPT 通过显着的改进表现出相对于基线的优越性。当转移到与源语言截然不同的语言时，MPT 比普通提示更加突出。

Title: MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Authors: Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan Wang, Hongyu Gong
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2403.12408
Pdf URL: https://arxiv.org/pdf/2403.12408
Copy Paste: [[2403.12408]] MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation(https://arxiv.org/abs/2403.12408)
Keywords: language model
Abstract: There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.
摘要：语音到语音翻译 (S2ST)（将话语从一种语言翻译成另一种语言）方面出现了新的研究兴趣并取得了进展。这项工作提出了多任务语音语言模型（MSLM），它是在多任务设置中训练的仅解码器的语音语言模型。在不依赖文本训练数据的情况下，我们的模型能够支持多语言 S2ST，并保留说话者的风格。

Title: Third-Party Language Model Performance Prediction from Instruction

Authors: Rahul Nadkarni, Yizhong Wang, Noah A. Smith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12413
Pdf URL: https://arxiv.org/pdf/2403.12413
Copy Paste: [[2403.12413]] Third-Party Language Model Performance Prediction from Instruction(https://arxiv.org/abs/2403.12413)
Keywords: language model, prompt
Abstract: Language model-based instruction-following systems have lately shown increasing performance on many benchmark tasks, demonstrating the capability of adapting to a broad variety of instructions. However, such systems are often not designed to be transparent about their limitations; a user may easily prompt a model with an instruction without any idea of whether the responses should be expected to be accurate, or if the system is even capable of performing the task. We propose a third party performance prediction framework, where a separate model is trained to predict the metric resulting from evaluating an instruction-following system on a task while assuming access only to its inputs and outputs at inference time. We perform this analysis with a variety of both open and closed instruction-following models as well as multiple performance predictors, and examine the effect of various factors such as model size, number of training tasks, and prompt format. Our findings indicate that third-party performance prediction is very challenging, and much work remains in developing predictors that can automatically reveal the limitations of modern instruction-following natural language processing systems.
摘要：基于语言模型的指令跟踪系统最近在许多基准任务上表现出不断提高的性能，展示了适应各种指令的能力。然而，此类系统的设计往往并不明确其局限性；用户可以轻松地用指令提示模型，而不知道响应是否应该准确，或者系统是否能够执行任务。我们提出了一个第三方性能预测框架，其中训练一个单独的模型来预测评估任务上的指令跟踪系统所产生的指标，同时假设在推理时仅访问其输入和输出。我们使用各种开放式和封闭式指令跟踪模型以及多个性能预测器来执行此分析，并检查模型大小、训练任务数量和提示格式等各种因素的影响。我们的研究结果表明，第三方性能预测非常具有挑战性，在开发可以自动揭示现代指令跟踪自然语言处理系统的局限性的预测器方面仍有大量工作要做。

Title: CrossTune: Black-Box Few-Shot Classification with Label Enhancement

Authors: Danqing Luo, Chen Zhang, Yan Zhang, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12468
Pdf URL: https://arxiv.org/pdf/2403.12468
Copy Paste: [[2403.12468]] CrossTune: Black-Box Few-Shot Classification with Label Enhancement(https://arxiv.org/abs/2403.12468)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Training or finetuning large-scale language models (LLMs) requires substantial computation resources, motivating recent efforts to explore parameter-efficient adaptation to downstream tasks. One approach is to treat these models as black boxes and use forward passes (Inference APIs) to interact with them. Current research focuses on adapting these black-box models to downstream tasks using gradient-free prompt optimization, but this often involves an expensive process of searching task-specific prompts. Therefore, we are motivated to study black-box language model adaptation without prompt search. Specifically, we introduce a label-enhanced cross-attention network called CrossTune, which models the semantic relatedness between the input text sequence and task-specific label descriptions. Its effectiveness is examined in the context of few-shot text classification. To improve the generalization of CrossTune, we utilize ChatGPT to generate additional training data through in-context learning. A switch mechanism is implemented to exclude low-quality ChatGPT-generated data. Through extensive experiments on seven benchmark text classification datasets, we demonstrate that our proposed approach outperforms the previous state-of-the-art gradient-free black-box tuning method by 5.7% on average. Even without using ChatGPT-augmented data, CrossTune performs better or comparably than previous black-box tuning methods, suggesting the effectiveness of our approach.
摘要：训练或微调大规模语言模型（LLM）需要大量的计算资源，这促使人们最近努力探索对下游任务的参数高效适应。一种方法是将这些模型视为黑匣子，并使用前向传递（推理 API）与它们交互。当前的研究重点是使用无梯度提示优化使这些黑盒模型适应下游任务，但这通常涉及搜索特定于任务的提示的昂贵过程。因此，我们有动力在没有即时搜索的情况下研究黑盒语言模型自适应。具体来说，我们引入了一种称为 CrossTune 的标签增强交叉注意力网络，它对输入文本序列和特定于任务的标签描述之间的语义相关性进行建模。它的有效性在少镜头文本分类的背景下进行了检验。为了提高 CrossTune 的泛化能力，我们利用 ChatGPT 通过上下文学习生成额外的训练数据。实施切换机制以排除 ChatGPT 生成的低质量数据。通过对七个基准文本分类数据集的广泛实验，我们证明我们提出的方法平均优于之前最先进的无梯度黑盒调整方法 5.7%。即使不使用 ChatGPT 增强数据，CrossTune 的性能也比以前的黑盒调整方法更好或相当，这表明我们方法的有效性。

Title: Prompt-based Graph Model for Joint Liberal Event Extraction and Event Schema Induction

Authors: Haochen Li, Di Geng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12526
Pdf URL: https://arxiv.org/pdf/2403.12526
Copy Paste: [[2403.12526]] Prompt-based Graph Model for Joint Liberal Event Extraction and Event Schema Induction(https://arxiv.org/abs/2403.12526)
Keywords: prompt
Abstract: Events are essential components of speech and texts, describing the changes in the state of entities. The event extraction task aims to identify and classify events and find their participants according to event schemas. Manually predefined event schemas have limited coverage and are hard to migrate across domains. Therefore, the researchers propose Liberal Event Extraction (LEE), which aims to extract events and discover event schemas simultaneously. However, existing LEE models rely heavily on external language knowledge bases and require the manual development of numerous rules for noise removal and knowledge alignment, which is complex and laborious. To this end, we propose a Prompt-based Graph Model for Liberal Event Extraction (PGLEE). Specifically, we use a prompt-based model to obtain candidate triggers and arguments, and then build heterogeneous event graphs to encode the structures within and between events. Experimental results prove that our approach achieves excellent performance with or without predefined event schemas, while the automatically detected event schemas are proven high quality.
摘要：事件是语音和文本的重要组成部分，描述实体状态的变化。事件提取任务旨在对事件进行识别和分类，并根据事件模式找到其参与者。手动预定义的事件模式覆盖范围有限，并且很难跨域迁移。因此，研究人员提出了自由事件提取（LEE），旨在同时提取事件并发现事件模式。然而，现有的LEE模型严重依赖外部语言知识库，需要手动开发大量的噪声去除和知识对齐规则，复杂且费力。为此，我们提出了一种基于提示的自由事件提取图模型（PGLEE）。具体来说，我们使用基于提示的模型来获取候选触发器和参数，然后构建异构事件图来对事件内部和事件之间的结构进行编码。实验结果证明，无论有没有预定义的事件模式，我们的方法都能实现出色的性能，而自动检测到的事件模式也被证明是高质量的。

Title: Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

Authors: Zhigang Chen, Benjia Zhou, Jun Li, Jun Wan, Zhen Lei, Ning Jiang, Quan Lu, Guoqing Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12556
Pdf URL: https://arxiv.org/pdf/2403.12556
Copy Paste: [[2403.12556]] Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation(https://arxiv.org/abs/2403.12556)
Keywords: language model, llm
Abstract: Previous Sign Language Translation (SLT) methods achieve superior performance by relying on gloss annotations. However, labeling high-quality glosses is a labor-intensive task, which limits the further development of SLT. Although some approaches work towards gloss-free SLT through jointly training the visual encoder and translation network, these efforts still suffer from poor performance and inefficient use of the powerful Large Language Model (LLM). Most seriously, we find that directly introducing LLM into SLT will lead to insufficient learning of visual representations as LLM dominates the learning curve. To address these problems, we propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free SLT. Concretely, we factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder. In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM's translation potential. This factorized training strategy proves to be highly effective as evidenced by significant improvements achieved across three SLT datasets which are all conducted under the gloss-free setting.
摘要：以前的手语翻译 (SLT) 方法通过依赖注释来实现卓越的性能。然而，高质量光泽标签是一项劳动密集型任务，这限制了SLT的进一步发展。尽管一些方法通过联合训练视觉编码器和翻译网络来实现无光泽 SLT，但这些努力仍然受到性能不佳和强大的大语言模型 (LLM) 使用效率低下的困扰。最严重的是，我们发现直接将LLM引入SLT会导致视觉表征学习不足，因为LLM在学习曲线中占主导地位。为了解决这些问题，我们提出了用于无光泽 SLT 的大型语言模型辅助分解学习 (FLa-LLM)。具体来说，我们将训练过程分为两个阶段。在视觉初始化阶段，我们在视觉编码器之后采用轻量级翻译模型来预训练视觉编码器。在LLM微调阶段，我们将获得的知识冻结在视觉编码器中，并将其与预先训练的LLM集成，以激发LLM的翻译潜力。这种分解训练策略被证明是非常有效的，三个 SLT 数据集都取得了显着的改进，这三个数据集都是在无光泽设置下进行的。

Title: AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework

Authors: Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, Jun Huang, Wei Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12582
Pdf URL: https://arxiv.org/pdf/2403.12582
Copy Paste: [[2403.12582]] AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework(https://arxiv.org/abs/2403.12582)
Keywords: language model, llm, hallucination, retrieval-augmented generation, chain-of-thought
Abstract: The task of financial analysis primarily encompasses two key areas: stock trend prediction and the corresponding financial question answering. Currently, machine learning and deep learning algorithms (ML&DL) have been widely applied for stock trend predictions, leading to significant progress. However, these methods fail to provide reasons for predictions, lacking interpretability and reasoning processes. Also, they can not integrate textual information such as financial news or reports. Meanwhile, large language models (LLMs) have remarkable textual understanding and generation ability. But due to the scarcity of financial training datasets and limited integration with real-time knowledge, LLMs still suffer from hallucinations and are unable to keep up with the latest information. To tackle these challenges, we first release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. It has a positive impact on training LLMs for completing financial analysis. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task, which integrates retrieval-augmented generation (RAG) techniques. Extensive experiments are conducted to demonstrate the effectiveness of our framework on financial analysis.
摘要：财务分析的任务主要包括两个关键领域：股票走势预测和相应的财务问题解答。目前，机器学习和深度学习算法（ML&DL）已广泛应用于股票走势预测，并取得了显着进展。然而，这些方法无法提供预测的理由，缺乏可解释性和推理过程。此外，它们无法集成财经新闻或报告等文本信息。同时，大型语言模型（LLM）具有出色的文本理解和生成能力。但由于金融培训数据集的稀缺以及与实时知识的整合有限，法学硕士仍然饱受幻觉之苦，无法跟上最新信息。为了应对这些挑战，我们首先发布了 AlphaFin 数据集，结合了传统研究数据集、实时金融数据和手写思想链（CoT）数据。它对培训法学硕士完成财务分析具有积极影响。然后，我们使用 AlphaFin 数据集对称为 Stock-Chain 的最先进方法进行基准测试，以有效地处理财务分析任务，该方法集成了检索增强生成 (RAG) 技术。我们进行了大量的实验来证明我们的财务分析框架的有效性。

Title: Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

Authors: Victor Carbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, Abhanshu Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12596
Pdf URL: https://arxiv.org/pdf/2403.12596
Copy Paste: [[2403.12596]] Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs(https://arxiv.org/abs/2403.12596)
Keywords: language model, gpt, llm, prompt
Abstract: Vision-language models (VLMs) are achieving increasingly strong performance on multimodal tasks. However, reasoning capabilities remain limited particularly for smaller VLMs, while those of large-language models (LLMs) have seen numerous improvements. We propose a technique to transfer capabilities from LLMs to VLMs. On the recently introduced ChartQA, our method obtains state-of-the-art performance when applied on the PaLI3-5B VLM by \citet{chen2023pali3}, while also enabling much better performance on PlotQA and FigureQA. We first improve the chart representation by continuing the pre-training stage using an improved version of the chart-to-table translation task by \citet{liu2023deplot}. We then propose constructing a 20x larger dataset than the original training set. To improve general reasoning capabilities and improve numerical operations, we synthesize reasoning traces using the table representation of charts. Lastly, our model is fine-tuned using the multitask loss introduced by \citet{hsieh2023distilling}. Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B without using an upstream OCR system, while keeping inference time constant compared to the PaLI3-5B baseline. When rationales are further refined with a simple program-of-thought prompt \cite{chen2023program}, our model outperforms the recently introduced Gemini Ultra and GPT-4V.
摘要：视觉语言模型 (VLM) 在多模式任务上取得了越来越强的性能。然而，推理能力仍然有限，特别是对于较小的 VLM，而大型语言模型 (LLM) 的推理能力已经有了许多改进。我们提出了一种将能力从 LLM 转移到 VLM 的技术。在最近推出的 ChartQA 上，我们的方法在 \citet{chen2023pali3} 应用于 PaLI3-5B VLM 时获得了最先进的性能，同时在 PlotQA 和 FigureQA 上也实现了更好的性能。我们首先使用 \citet{liu2023deplot} 的图表到表格翻译任务的改进版本继续预训练阶段来改进图表表示。然后，我们建议构建一个比原始训练集大 20 倍的数据集。为了提高一般推理能力并改进数值运算，我们使用图表的表格表示来综合推理轨迹。最后，我们的模型使用 \citet{hsieh2023distilling} 引入的多任务损失进行微调。我们的变体 ChartPaLI-5B 在不使用上游 OCR 系统的情况下甚至优于 10 倍大的模型（例如 PaLIX-55B），同时与 PaLI3-5B 基线相比保持推理时间恒定。当使用简单的思维程序提示 \cite{chen2023program} 进一步完善基本原理时，我们的模型优于最近推出的 Gemini Ultra 和 GPT-4V。

Title: LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

Authors: Chuang Liu, Renren Jin, Yuqi Ren, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12601
Pdf URL: https://arxiv.org/pdf/2403.12601
Copy Paste: [[2403.12601]] LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models(https://arxiv.org/abs/2403.12601)
Keywords: language model, gpt, llm
Abstract: Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications. However, the existing benchmarks for comprehensively evaluating these LLMs are still insufficient, particularly in terms of measuring knowledge that LLMs capture. Current datasets collect questions from Chinese examinations across different subjects and educational levels to address this issue. Yet, these benchmarks primarily focus on objective questions such as multiple-choice questions, leading to a lack of diversity in question types. To tackle this problem, we propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark in this paper. LHMKE is designed to provide a comprehensive evaluation of the knowledge acquisition capabilities of Chinese LLMs. It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams. Notably, LHMKE includes both objective and subjective questions, offering a more holistic evaluation of the knowledge level of LLMs. We have assessed 11 Chinese LLMs under the zero-shot setting, which aligns with real examinations, and compared their performance across different subjects. We also conduct an in-depth analysis to check whether GPT-4 can automatically score subjective predictions. Our findings suggest that LHMKE is a challenging and advanced testbed for Chinese LLMs.
摘要：中文大语言模型 (LLM) 最近在各种 NLP 基准测试和实际应用中展示了令人印象深刻的功能。然而，全面评估这些法学硕士的现有基准仍然不足，特别是在衡量法学硕士所掌握的知识方面。当前的数据集收集了不同学科和教育水平的中国考试问题来解决这个问题。然而，这些基准主要关注多项选择题等客观题，导致问题类型缺乏多样性。为了解决这个问题，我们在本文中提出了 LHMKE，一种大规模、整体、多学科知识评估基准。 LHMKE旨在对中国法学硕士的知识获取能力进行综合评估。它包含 75 项任务中的 10,465 个问题，涵盖 30 个科目，从小学到专业认证考试。值得注意的是，LHMKE 包括客观和主观问题，为法学硕士的知识水平提供更全面的评估。我们在零样本环境下评估了 11 名中国法学硕士，与真实考试相符，并比较了他们在不同科目上的表现。我们还进行了深入分析，以检查 GPT-4 是否可以自动对主观预测进行评分。我们的研究结果表明，LHMKE 对于中国法学硕士来说是一个具有挑战性且先进的测试平台。

Title: Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean

Authors: Dojun Park, Sebastian Padó
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12666
Pdf URL: https://arxiv.org/pdf/2403.12666
Copy Paste: [[2403.12666]] Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean(https://arxiv.org/abs/2403.12666)
Keywords: language model
Abstract: Almost all frameworks for the manual or automatic evaluation of machine translation characterize the quality of an MT output with a single number. An exception is the Multidimensional Quality Metrics (MQM) framework which offers a fine-grained ontology of quality dimensions for scoring (such as style, fluency, accuracy, and terminology). Previous studies have demonstrated the feasibility of MQM annotation but there are, to our knowledge, no computational models that predict MQM scores for novel texts, due to a lack of resources. In this paper, we address these shortcomings by (a) providing a 1200-sentence MQM evaluation benchmark for the language pair English-Korean and (b) reframing MT evaluation as the multi-task problem of simultaneously predicting several MQM scores using SOTA language models, both in a reference-based MT evaluation setup and a reference-free quality estimation (QE) setup. We find that reference-free setup outperforms its counterpart in the style dimension while reference-based models retain an edge regarding accuracy. Overall, RemBERT emerges as the most promising model. Through our evaluation, we offer an insight into the translation quality in a more fine-grained, interpretable manner.
摘要：几乎所有用于手动或自动评估机器翻译的框架都用单个数字来表征机器翻译输出的质量。多维质量指标 (MQM) 框架是一个例外，它提供了用于评分的细粒度质量维度本体（例如风格、流畅性、准确性和术语）。先前的研究已经证明了 MQM 注释的可行性，但据我们所知，由于缺乏资源，没有预测小说文本 MQM 分数的计算模型。在本文中，我们通过（a）为英语-韩语语言对提供 1200 句 MQM 评估基准，以及（b）将 MT 评估重新定义为使用 SOTA 语言模型同时预测多个 MQM 分数的多任务问题来解决这些缺点，无论是在基于参考的 MT 评估设置还是无参考质量估计 (QE) 设置中。我们发现无参考设置在风格维度上优于其对应设置，而基于参考的模型在准确性方面保持优势。总体而言，RemBERT 成为最有前途的模型。通过我们的评估，我们以更细粒度、可解释的方式深入了解翻译质量。

Title: Pragmatic Competence Evaluation of Large Language Models for Korean

Authors: Dojun Park, Jiwoo Lee, Hyeyun Jeong, Seohyun Park, Sungeun Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12675
Pdf URL: https://arxiv.org/pdf/2403.12675
Copy Paste: [[2403.12675]] Pragmatic Competence Evaluation of Large Language Models for Korean(https://arxiv.org/abs/2403.12675)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: The current evaluation of Large Language Models (LLMs) predominantly relies on benchmarks focusing on their embedded knowledge by testing through multiple-choice questions (MCQs), a format inherently suited for automated evaluation. Our study extends this evaluation to explore LLMs' pragmatic competence--a facet previously underexamined before the advent of sophisticated LLMs, specifically in the context of Korean. We employ two distinct evaluation setups: the conventional MCQ format, adapted for automatic evaluation, and Open-Ended Questions (OEQs), assessed by human experts, to examine LLMs' narrative response capabilities without predefined options. Our findings reveal that GPT-4 excels, scoring 81.11 and 85.69 in the MCQ and OEQ setups, respectively, with HyperCLOVA X, an LLM optimized for Korean, closely following, especially in the OEQ setup, demonstrating a score of 81.56 with a marginal difference of 4.13 points compared to GPT-4. Furthermore, while few-shot learning strategies generally enhance LLM performance, Chain-of-Thought (CoT) prompting introduces a bias toward literal interpretations, hindering accurate pragmatic inference. Considering the growing expectation for LLMs to understand and produce language that aligns with human communicative norms, our findings emphasize the importance for advancing LLMs' abilities to grasp and convey sophisticated meanings beyond mere literal interpretations.
摘要：当前对大型语言模型（LLM）的评估主要依赖于通过多项选择题（MCQ）进行测试来关注其嵌入知识的基准，MCQ 是一种本质上适合自动评估的格式。我们的研究将这一评估扩展到探索法学硕士的实用能力——在复杂的法学硕士出现之前，这一方面以前未被充分审查，特别是在韩语背景下。我们采用两种不同的评估设置：适用于自动评估的传统 MCQ 格式和由人类专家评估的开放式问题 (OEQ)，以在没有预定义选项的情况下检查法学硕士的叙述反应能力。我们的研究结果表明，GPT-4 表现出色，在 MCQ 和 OEQ 设置中分别得分 81.11 和 85.69，而针对韩语优化的法学硕士 HyperCLOVA X 紧随其后，尤其是在 OEQ 设置中，得分为 81.56，略有差异与 GPT-4 相比为 4.13 分。此外，虽然少样本学习策略通常可以提高法学硕士的表现，但思想链（CoT）提示会引入对字面解释的偏见，阻碍准确的实用推理。考虑到人们对法学硕士理解和产生符合人类交际规范的语言的期望日益增长，我们的研究结果强调了提高法学硕士掌握和传达超越单纯字面解释的复杂含义的能力的重要性。

Title: Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights

Authors: Maksym Taranukhin, Sahithya Ravi, Gabor Lukacs, Evangelos Milios, Vered Shwartz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12678
Pdf URL: https://arxiv.org/pdf/2403.12678
Copy Paste: [[2403.12678]] Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights(https://arxiv.org/abs/2403.12678)
Keywords: hallucination, chat
Abstract: The Canadian air travel sector has seen a significant increase in flight delays, cancellations, and other issues concerning passenger rights. Recognizing this demand, we present a chatbot to assist passengers and educate them about their rights. Our system breaks a complex user input into simple queries which are used to retrieve information from a collection of documents detailing air travel regulations. The most relevant passages from these documents are presented along with links to the original documents and the generated queries, enabling users to dissect and leverage the information for their unique circumstances. The system successfully overcomes two predominant challenges: understanding complex user inputs, and delivering accurate answers, free of hallucinations, that passengers can rely on for making informed decisions. A user study comparing the chatbot to a Google search demonstrated the chatbot's usefulness and ease of use. Beyond the primary goal of providing accurate and timely information to air passengers regarding their rights, we hope that this system will also enable further research exploring the tradeoff between the user-friendly conversational interface of chatbots and the accuracy of retrieval systems.
摘要：加拿大航空旅行业的航班延误、取消和其他涉及乘客权利的问题大幅增加。认识到这一需求，我们推出了一个聊天机器人来帮助乘客并教育他们了解自己的权利。我们的系统将复杂的用户输入分解为简单的查询，用于从详细说明航空旅行法规的文档集合中检索信息。这些文档中最相关的段落以及原始文档和生成的查询的链接一起呈现，使用户能够根据自己的独特情况剖析和利用这些信息。该系统成功克服了两个主要挑战：理解复杂的用户输入，并提供准确的答案，没有幻觉，乘客可以依靠这些答案做出明智的决定。一项将聊天机器人与 Google 搜索进行比较的用户研究证明了聊天机器人的实用性和易用性。除了向航空乘客提供有关其权利的准确及时信息的主要目标之外，我们希望该系统还能够进行进一步的研究，探索聊天机器人的用户友好对话界面与检索系统的准确性之间的权衡。

Title: Instructing Large Language Models to Identify and Ignore Irrelevant Conditions

Authors: Zhenyu Wu, Chao Shen, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12744
Pdf URL: https://arxiv.org/pdf/2403.12744
Copy Paste: [[2403.12744]] Instructing Large Language Models to Identify and Ignore Irrelevant Conditions(https://arxiv.org/abs/2403.12744)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Math word problem (MWP) solving requires generating a reasoning path based on a given problem description that often contains irrelevant conditions. Existing chain-of-thought (CoT) prompting methods elicited multi-step reasoning abilities of large language models (LLMs) to solve MWPs. However, they were seriously confused by the irrelevant conditions, resulting in low accuracy. In this paper, we propose a novel approach named I$^3$C that instructs LLMs to identify and ignore irrelevant conditions. It identifies a set of irrelevant condition candidates that have a weak semantic relevance with the question. Then it prompts LLMs to verify the irrelevant conditions. Lastly it instructs the LLMs with the verification on relevant and irrelevant conditions to avoid confusion and improve reasoning paths. Moreover, we propose to select (problem, reasoning paths) pairs as demonstrations to enhance I$^3$C with few-shot reasoning. We develop I$^3$C-Select that selects the most confusing problems based on the semantic relevance measurement. We conduct extensive experiments on eight MWP datasets. I$^3$C can be combined with any CoT prompting methods to improve the performance of solving MWPs. Notably, with GPT-3.5-Turbo and I$^3$C-Select, we achieve an accuracy of 96.0 and 94.1 on GSM-IC2-1K and GSM-ICM-1K, respectively, significantly outperforming the state-of-the-art few-shot prompting method Complex-CoT by +11.7 and +11.1. Our implementation is made publicly available at https://wzy6642.github.io/I3C.github.io/.
摘要：数学应用题 (MWP) 解决需要根据给定的问题描述生成推理路径，该问题描述通常包含不相关的条件。现有的思想链（CoT）提示方法引发了大语言模型（LLM）的多步推理能力来解决 MWP。然而，他们被不相关的条件严重迷惑，导致准确率低下。在本文中，我们提出了一种名为 I$^3$C 的新颖方法，指导法学硕士识别并忽略不相关的条件。它识别出一组与问题语义相关性较弱的不相关候选条件。然后它会提示法学硕士验证不相关的条件。最后指导法学硕士对相关和不相关条件进行验证，以避免混淆并改进推理路径。此外，我们建议选择（问题，推理路径）对作为演示，以通过少样本推理增强 I$^3$C。我们开发了 I$^3$C-Select，它根据语义相关性测量来选择最令人困惑的问题。我们对八个 MWP 数据集进行了广泛的实验。 I$^3$C 可以与任何 CoT 提示方法相结合，以提高求解 MWP 的性能。值得注意的是，使用 GPT-3.5-Turbo 和 I$^3$C-Select，我们在 GSM-IC2-1K 和 GSM-ICM-1K 上分别实现了 96.0 和 94.1 的准确度，显着优于现有技术艺术少样本提示方法 Complex-CoT 为 +11.7 和 +11.1。我们的实现已在 https://wzy6642.github.io/I3C.github.io/ 上公开发布。

Title: NovelQA: A Benchmark for Long-Range Novel Question Answering

Authors: Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12766
Pdf URL: https://arxiv.org/pdf/2403.12766
Copy Paste: [[2403.12766]] NovelQA: A Benchmark for Long-Range Novel Question Answering(https://arxiv.org/abs/2403.12766)
Keywords: language model, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has introduced a new frontier in natural language processing, particularly in understanding and processing long-context information. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark specifically designed to test the capabilities of LLMs with extended texts. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance, particularly emphasizing the challenges they face with multi-hop reasoning, detail-oriented questions, and extremely long input with more than 100,000 tokens. The results underscore the necessity for further advancements in LLMs to improve their long-context comprehension and computational literary studies.
摘要：大型语言模型（LLM）的快速发展为自然语言处理开辟了新的前沿，特别是在理解和处理长上下文信息方面。然而，由于当前基准的限制，评估这些模型的长上下文能力仍然是一个挑战。为了弥补这一差距，我们引入了NovelQA，这是一个专门为测试法学硕士使用扩展文本的能力而设计的基准。 NovelQA 以英语小说为基础，提供了复杂性、长度和叙事连贯性的独特融合，使其成为评估法学硕士深度文本理解的理想工具。本文介绍了 NovelQA 的设计和构建，重点介绍了其手动注释和多样化的问题类型。我们对 NovelQA 上的长上下文法学硕士的评估揭示了对模型性能的重要见解，特别强调了它们在多跳推理、面向细节的问题以及超过 100,000 个标记的超长输入方面面临的挑战。结果强调了法学硕士进一步进步的必要性，以提高他们的长语境理解和计算文学研究。

Title: Automated Data Curation for Robust Language Model Fine-Tuning

Authors: Jiuhai Chen, Jonas Mueller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12776
Pdf URL: https://arxiv.org/pdf/2403.12776
Copy Paste: [[2403.12776]] Automated Data Curation for Robust Language Model Fine-Tuning(https://arxiv.org/abs/2403.12776)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses. Supervised fine-tuning specializes a LLM by training it on dataset of example prompts with target responses, but real-world data tends to be noisy. While many fine-tuning algorithms exist, here we consider a \emph{data-centric AI} perspective on LLM fine-tuning, studying how to \emph{systematically} curate the training dataset to improve the LLM produced via \emph{any} fine-tuning algorithm. We introduce an automated data curation pipeline CLEAR (Confidence-based LLM Evaluation And Rectification) for instruction tuning datasets, that can be used with any LLM and fine-tuning procedure. CLEAR estimates which training data is low-quality and either filters or corrects it. Automatically identifying which data to filter or correct is done via LLM-derived confidence estimates, to ensure only confident modifications to the dataset. Unlike existing data curation techniques, CLEAR is a comprehensive framework that can improve a dataset (and trained model outputs) without additional fine-tuning computations. We don't assume access to a stronger LLM than the model being fine-tuned (e.g.\ relying on GPT-4 when fine-tuning GPT-3.5), to see whether CLEAR can meaningfully improve the capabilities of any LLM. Experiments reveal that CLEAR consistently improves the performance of fine-tuned models across many datasets and models (like GPT-3.5 and Llama2).
摘要：大型语言模型已成为序列到序列文本生成任务的事实上的方法，但对于专门的任务/领域，预训练的 LLM 缺乏生成准确或格式良好的响应的特定功能。监督微调通过在具有目标响应的示例提示数据集上训练法学硕士来专业化法学硕士，但现实世界的数据往往充满噪音。虽然存在许多微调算法，但在这里我们从 \emph{以数据为中心的 AI} 角度考虑 LLM 微调，研究如何 \emph{系统地} 管理训练数据集，以改进通过 \emph{any} 生成的 LLM微调算法。我们引入了用于指令调整数据集的自动数据管理管道 CLEAR（基于置信度的 LLM 评估和纠正），可与任何 LLM 和微调程序一起使用。 CLEAR 会估计哪些训练数据质量较低，然后对其进行过滤或纠正。通过 LLM 导出的置信度估计自动识别要过滤或纠正的数据，以确保仅对数据集进行可信修改。与现有的数据管理技术不同，CLEAR 是一个综合框架，可以改进数据集（和经过训练的模型输出），而无需额外的微调计算。我们不假设访问比正在微调的模型更强大的 LLM（例如，在微调 GPT-3.5 时依赖 GPT-4），以了解 CLEAR 是否可以有意义地提高任何 LLM 的能力。实验表明，CLEAR 持续改进了许多数据集和模型（如 GPT-3.5 和 Llama2）中微调模型的性能。

Title: Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models

Authors: Zhixue Zhao, Nikolaos Aletras
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12809
Pdf URL: https://arxiv.org/pdf/2403.12809
Copy Paste: [[2403.12809]] Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models(https://arxiv.org/abs/2403.12809)
Keywords: language model
Abstract: In many real natural language processing application scenarios, practitioners not only aim to maximize predictive performance but also seek faithful explanations for the model predictions. Rationales and importance distribution given by feature attribution methods (FAs) provide insights into how different parts of the input contribute to a prediction. Previous studies have explored how different factors affect faithfulness, mainly in the context of monolingual English models. On the other hand, the differences in FA faithfulness between multilingual and monolingual models have yet to be explored. Our extensive experiments, covering five languages and five popular FAs, show that FA faithfulness varies between multilingual and monolingual models. We find that the larger the multilingual model, the less faithful the FAs are compared to its counterpart monolingual models.Our further analysis shows that the faithfulness disparity is potentially driven by the differences between model tokenizers. Our code is available: https://github.com/casszhao/multilingual-faith.
摘要：在许多真实的自然语言处理应用场景中，从业者不仅致力于最大化预测性能，而且还寻求对模型预测的忠实解释。特征归因方法 (FA) 给出的基本原理和重要性分布提供了有关输入的不同部分如何有助于预测的见解。先前的研究主要在单语英语模型的背景下探讨了不同因素如何影响忠诚度。另一方面，多语言和单语言模型之间 FA 忠实度的差异仍有待探索。我们的广泛实验涵盖五种语言和五种流行的 FA，表明多语言和单语言模型之间的 FA 忠诚度有所不同。我们发现，多语言模型越大，与对应的单语言模型相比，FA 的可信度就越低。我们的进一步分析表明，可信度差异可能是由模型分词器之间的差异造成的。我们的代码可用：https://github.com/casszhao/multilingual-faith。

Title: Epistemology of Language Models: Do Language Models Have Holistic Knowledge?

Authors: Minsu Kim, James Thorne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12862
Pdf URL: https://arxiv.org/pdf/2403.12862
Copy Paste: [[2403.12862]] Epistemology of Language Models: Do Language Models Have Holistic Knowledge?(https://arxiv.org/abs/2403.12862)
Keywords: language model, llm
Abstract: This paper investigates the inherent knowledge in language models from the perspective of epistemological holism. The purpose of this paper is to explore whether LLMs exhibit characteristics consistent with epistemological holism. These characteristics suggest that core knowledge, such as general scientific knowledge, each plays a specific role, serving as the foundation of our knowledge system and being difficult to revise. To assess these traits related to holism, we created a scientific reasoning dataset and examined the epistemology of language models through three tasks: Abduction, Revision, and Argument Generation. In the abduction task, the language models explained situations while avoiding revising the core knowledge. However, in other tasks, the language models were revealed not to distinguish between core and peripheral knowledge, showing an incomplete alignment with holistic knowledge principles.
摘要：本文从认识论整体论的角度探讨了语言模型中的内在知识。本文的目的是探讨法学硕士是否表现出与认识论整体论一致的特征。这些特征表明，核心知识，例如一般科学知识，各自发挥着特定的作用，是我们知识体系的基础，并且难以修改。为了评估与整体论相关的这些特征，我们创建了一个科学推理数据集，并通过三个任务检查了语言模型的认识论：溯因、修订和论证生成。在溯因任务中，语言模型解释了情况，同时避免了修改核心知识。然而，在其他任务中，语言模型并未区分核心知识和外围知识，显示出与整体知识原则的不完全一致。

Title: Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

Authors: Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, Feng Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12881
Pdf URL: https://arxiv.org/pdf/2403.12881
Copy Paste: [[2403.12881]] Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models(https://arxiv.org/abs/2403.12881)
Keywords: language model, llm, hallucination, agent
Abstract: Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem. This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LANguage models for Agents. Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5\% across various agent evaluation datasets. With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark. Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs. The code will be available at https://github.com/InternLM/Agent-FLAN.
摘要：开源的大型语言模型（LLM）在各种 NLP 任务中取得了巨大的成功，但在充当代理时，它们仍然远远不如基于 API 的模型。如何将代理能力融入到普通法学硕士课程中成为一个至关重要而紧迫的问题。本文首先提出了三个关键观察结果：（1）当前的智能体训练语料库与格式遵循和智能体推理纠缠在一起，这与预训练数据的分布发生了显着的变化； (2) LLM 对代理任务所需的能力表现出不同的学习速度； (3)当前的方法在通过引入幻觉来提高代理能力时存在副作用。基于上述发现，我们提出 Agent-FLAN 来有效地微调 Agent 的语言模型。通过对训练语料库的仔细分解和重新设计，Agent-FLAN 使 Llama2-7B 在各种代理评估数据集上的表现比之前的最佳作品高出 3.5%。通过全面构建负样本，Agent-FLAN 根据我们建立的评估基准极大地缓解了幻觉问题。此外，它在扩展模型大小时持续提高了 LLM 的代理能力，同时略微增强了 LLM 的一般能力。该代码可在 https://github.com/InternLM/Agent-FLAN 上获取。

Title: Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts

Authors: Sai Ashish Somayajula, Youwei Liang, Abhishek Singh, Li Zhang, Pengtao Xie
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12918
Pdf URL: https://arxiv.org/pdf/2403.12918
Copy Paste: [[2403.12918]] Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts(https://arxiv.org/abs/2403.12918)
Keywords: language model
Abstract: Pretrained Language Models (PLMs) have advanced Natural Language Processing (NLP) tasks significantly, but finetuning PLMs on low-resource datasets poses significant challenges such as instability and overfitting. Previous methods tackle these issues by finetuning a strategically chosen subnetwork on a downstream task, while keeping the remaining weights fixed to the pretrained weights. However, they rely on a suboptimal criteria for sub-network selection, leading to suboptimal solutions. To address these limitations, we propose a regularization method based on attention-guided weight mixup for finetuning PLMs. Our approach represents each network weight as a mixup of task-specific weight and pretrained weight, controlled by a learnable attention parameter, providing finer control over sub-network selection. Furthermore, we employ a bi-level optimization (BLO) based framework on two separate splits of the training dataset, improving generalization and combating overfitting. We validate the efficacy of our proposed method through extensive experiments, demonstrating its superiority over previous methods, particularly in the context of finetuning PLMs on low-resource datasets.
摘要：预训练语言模型 (PLM) 显着推进了自然语言处理 (NLP) 任务，但在低资源数据集上微调 PLM 会带来不稳定和过度拟合等重大挑战。以前的方法通过对下游任务上战略选择的子网络进行微调来解决这些问题，同时将剩余权重固定为预训练权重。然而，它们依赖于次优标准来选择子网，从而导致次优解决方案。为了解决这些限制，我们提出了一种基于注意力引导权重混合的正则化方法，用于微调 PLM。我们的方法将每个网络权重表示为特定于任务的权重和预训练权重的混合，由可学习的注意力参数控制，从而提供对子网络选择的更精细的控制。此外，我们在训练数据集的两个单独的分割上采用基于双层优化（BLO）的框架，提高泛化能力并防止过度拟合。我们通过大量实验验证了我们提出的方法的有效性，证明了其相对于以前方法的优越性，特别是在低资源数据集上微调 PLM 的情况下。

Title: Supporting Energy Policy Research with Large Language Models

Authors: Grant Buster, Pavlo Pinchuk, Jacob Barrons, Ryan McKeever, Aaron Levine, Anthony Lopez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12924
Pdf URL: https://arxiv.org/pdf/2403.12924
Copy Paste: [[2403.12924]] Supporting Energy Policy Research with Large Language Models(https://arxiv.org/abs/2403.12924)
Keywords: language model, llm
Abstract: The recent growth in renewable energy development in the United States has been accompanied by a simultaneous surge in renewable energy siting ordinances. These zoning laws play a critical role in dictating the placement of wind and solar resources that are critical for achieving low-carbon energy futures. In this context, efficient access to and management of siting ordinance data becomes imperative. The National Renewable Energy Laboratory (NREL) recently introduced a public wind and solar siting database to fill this need. This paper presents a method for harnessing Large Language Models (LLMs) to automate the extraction of these siting ordinances from legal documents, enabling this database to maintain accurate up-to-date information in the rapidly changing energy policy landscape. A novel contribution of this research is the integration of a decision tree framework with LLMs. Our results show that this approach is 85 to 90% accurate with outputs that can be used directly in downstream quantitative modeling. We discuss opportunities to use this work to support similar large-scale policy research in the energy sector. By unlocking new efficiencies in the extraction and analysis of legal documents using LLMs, this study enables a path forward for automated large-scale energy policy research.
摘要：近年来，美国可再生能源开发的增长伴随着可再生能源选址条例的同步激增。这些分区法在规定风能和太阳能资源的布局方面发挥着关键作用，这对于实现低碳能源未来至关重要。在这种背景下，有效访问和管理选址条例数据变得势在必行。国家可再生能源实验室 (NREL) 最近推出了公共风能和太阳能选址数据库来满足这一需求。本文提出了一种利用大型语言模型 (LLM) 自动从法律文档中提取这些选址条例的方法，使该数据库能够在快速变化的能源政策环境中保持准确的最新信息。这项研究的一个新颖贡献是将决策树框架与法学硕士集成。我们的结果表明，这种方法的准确度为 85% 至 90%，输出可直接用于下游定量建模。我们讨论利用这项工作来支持能源领域类似的大规模政策研究的机会。通过利用法学硕士提高法律文件提取和分析的新效率，这项研究为自动化大规模能源政策研究开辟了道路。

Title: Automatic Information Extraction From Employment Tribunal Judgements Using Large Language Models

Authors: Joana Ribeiro de Faria, Huiyuan Xie, Felix Steffek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12936
Pdf URL: https://arxiv.org/pdf/2403.12936
Copy Paste: [[2403.12936]] Automatic Information Extraction From Employment Tribunal Judgements Using Large Language Models(https://arxiv.org/abs/2403.12936)
Keywords: language model, gpt, llm
Abstract: Court transcripts and judgments are rich repositories of legal knowledge, detailing the intricacies of cases and the rationale behind judicial decisions. The extraction of key information from these documents provides a concise overview of a case, crucial for both legal experts and the public. With the advent of large language models (LLMs), automatic information extraction has become increasingly feasible and efficient. This paper presents a comprehensive study on the application of GPT-4, a large language model, for automatic information extraction from UK Employment Tribunal (UKET) cases. We meticulously evaluated GPT-4's performance in extracting critical information with a manual verification process to ensure the accuracy and relevance of the extracted data. Our research is structured around two primary extraction tasks: the first involves a general extraction of eight key aspects that hold significance for both legal specialists and the general public, including the facts of the case, the claims made, references to legal statutes, references to precedents, general case outcomes and corresponding labels, detailed order and remedies and reasons for the decision. The second task is more focused, aimed at analysing three of those extracted features, namely facts, claims and outcomes, in order to facilitate the development of a tool capable of predicting the outcome of employment law disputes. Through our analysis, we demonstrate that LLMs like GPT-4 can obtain high accuracy in legal information extraction, highlighting the potential of LLMs in revolutionising the way legal information is processed and utilised, offering significant implications for legal research and practice.
摘要：法庭笔录和判决书是丰富的法律知识宝库，详细介绍了案件的复杂性和司法判决背后的理由。从这些文件中提取关键信息可以提供案件的简明概述，这对法律专家和公众都至关重要。随着大型语言模型（LLM）的出现，自动信息提取变得越来越可行和高效。本文对大型语言模型 GPT-4 在英国就业法庭 (UKET) 案件中自动信息提取的应用进行了全面研究。我们通过手动验证过程仔细评估了 GPT-4 在提取关键信息方面的性能，以确保提取数据的准确性和相关性。我们的研究围绕两个主要提取任务进行：第一个涉及对法律专家和公众都具有重要意义的八个关键方面的一般提取，包括案件事实、提出的主张、对法律法规的引用、对先例、一般案件结果和相应标签、详细顺序和补救措施以及决定的理由。第二项任务更有针对性，旨在分析其中三个提取的特征，即事实、主张和结果，以促进开发能够预测就业法纠纷结果的工具。通过我们的分析，我们证明像 GPT-4 这样的法学硕士可以在法律信息提取方面获得高精度，凸显了法学硕士在彻底改变法律信息处理和使用方式方面的潜力，为法律研究和实践提供了重大影响。

Title: Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Authors: Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.12958
Pdf URL: https://arxiv.org/pdf/2403.12958
Copy Paste: [[2403.12958]] Dated Data: Tracing Knowledge Cutoffs in Large Language Models(https://arxiv.org/abs/2403.12958)
Keywords: language model, llm
Abstract: Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated knowledge for these subsets closely align to their cutoff dates? In this work, we define the notion of an effective cutoff. This is distinct from the LLM designer reported cutoff and applies separately to sub-resources and topics. We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM by probing across versions of the data. Using this analysis, we find that effective cutoffs often differ from reported cutoffs. To understand the root cause of this observation, we conduct a direct large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models.
摘要：发布的大型语言模型 (LLM) 通常与声称的知识截止日期或收集训练数据的日期配对。此类信息对于法学硕士必须提供最新信息的申请至关重要。然而，这种说法只触及表面：训练数据中的所有资源是否共享相同的知识截止日期？模型对这些子集的展示知识是否与它们的截止日期紧密一致？在这项工作中，我们定义了有效截止的概念。这与法学硕士设计师报告的截止值不同，并分别适用于子资源和主题。我们提出了一种简单的方法，通过跨版本的数据探测来估计法学硕士资源级时间对齐的有效截止值。通过此分析，我们发现有效截止值通常与报告的截止值不同。为了了解这一观察结果的根本原因，我们对开放的预训练数据集进行了直接的大规模分析。我们的分析揭示了造成这些不一致的两个原因：（1）由于新转储中大量旧数据而导致 CommonCrawl 数据的时间偏差；（2）LLM 重复数据删除方案中涉及语义重复和词汇近似重复的复杂性。总的来说，我们的结果表明，知识界限并不像看起来那么简单，法学硕士数据集管理者以及寻求使用这些模型信息的从业者都必须小心谨慎。

Title: LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Authors: Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12968
Pdf URL: https://arxiv.org/pdf/2403.12968
Copy Paste: [[2403.12968]] LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression(https://arxiv.org/abs/2403.12968)
Keywords: language model, llm, prompt
Abstract: This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.
摘要：本文重点关注与任务无关的提示压缩，以提高通用性和效率。考虑到自然语言中的冗余，现有方法通过根据从因果语言模型（例如LLaMa-7B）获得的信息熵删除标记或词汇单元来压缩提示。挑战在于信息熵可能是次优压缩指标：（i）它仅利用单向上下文，可能无法捕获即时压缩所需的所有基本信息； (ii) 它与即时压缩目标不一致。为了解决这些问题，我们提出了一种数据蒸馏程序，从法学硕士中获取知识来压缩提示，而不会丢失关键信息，同时引入提取文本压缩数据集。我们将提示压缩制定为令牌分类问题，以保证压缩提示与原始提示的忠实度，并使用 Transformer 编码器作为基础架构，从完整的双向上下文中捕获提示压缩的所有基本信息。我们的方法通过使用较小的模型（例如 XLM-RoBERTa-large 和 mBERT）显式学习压缩目标来降低延迟。我们在域内和域外数据集（包括 MeetingBank、LongBench、ZeroScrolls、GSM8K 和 BBH）上评估我们的方法。尽管规模很小，但我们的模型在强大的基线上显示出显着的性能提升，并在不同的法学硕士中展示了强大的泛化能力。此外，我们的模型比现有的即时压缩方法快 3 倍到 6 倍，同时将端到端延迟加快 1.6 倍到 2.9 倍，压缩率为 2 倍到 5 倍。