2024-08-20

Title: See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Authors: Yulong Chen, Yang Liu, Jianhao Yan, Xuefeng Bai, Ming Zhong, Yinghao Yang, Ziyi Yang, Chenguang Zhu, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.08978
Pdf URL: https://arxiv.org/pdf/2408.08978
Copy Paste: [[2408.08978]] See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses(https://arxiv.org/abs/2408.08978)
Keywords: language model, gpt, llm, prompt
Abstract: The impressive performance of Large Language Models (LLMs) has consistently surpassed numerous human-designed benchmarks, presenting new challenges in assessing the shortcomings of LLMs. Designing tasks and finding LLMs' limitations are becoming increasingly important. In this paper, we investigate the question of whether an LLM can discover its own limitations from the errors it makes. To this end, we propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances and incorporate human feedback on them to refine these patterns for generating more challenging data, iteratively. We end up with 8 diverse patterns, such as text manipulation and questions with assumptions. We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses. The SC-G4 serves as a challenging benchmark that allows for a detailed assessment of LLMs' abilities. Our results show that only 44.96\% of instances in SC-G4 can be answered correctly by GPT-4. Interestingly, our pilot study indicates that these error patterns also challenge other LLMs, such as Claude-3 and Llama-3, and cannot be fully resolved through fine-tuning. Our work takes the first step to demonstrate that LLMs can autonomously identify their inherent flaws and provide insights for future dynamic and automatic evaluation.
摘要：大型语言模型 (LLM) 的出色表现一直超越众多人工设计的基准，为评估 LLM 的缺点带来了新的挑战。设计任务和发现 LLM 的局限性变得越来越重要。在本文中，我们研究了 LLM 是否能从其所犯的错误中发现自身的局限性。为此，我们提出了一个以人为本的自我挑战评估框架。从 GPT-4 无法回答的种子实例开始，我们提示 GPT-4 总结可用于生成新实例的错误模式，并结合人工反馈来改进这些模式，以迭代方式生成更具挑战性的数据。我们最终得到了 8 种不同的模式，例如文本操作和带有假设的问题。然后，我们构建了一个基准 SC-G4，它由 GPT-4 使用这些模式生成的 1,835 个实例组成，并带有人工注释的黄金响应。SC-G4 是一个具有挑战性的基准，可以对 LLM 的能力进行详细评估。我们的结果表明，GPT-4 只能正确回答 SC-G4 中 44.96% 的实例。有趣的是，我们的初步研究表明，这些错误模式也对其他 LLM（例如 Claude-3 和 Llama-3）构成挑战，并且无法通过微调完全解决。我们的工作迈出了第一步，证明了 LLM 可以自主识别其固有缺陷，并为未来的动态和自动评估提供见解。

Title: Language Models Show Stable Value Orientations Across Diverse Role-Plays

Authors: Bruce W. Lee, Yeongheon Lee, Hyunsoo Cho
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2408.09049
Pdf URL: https://arxiv.org/pdf/2408.09049
Copy Paste: [[2408.09049]] Language Models Show Stable Value Orientations Across Diverse Role-Plays(https://arxiv.org/abs/2408.09049)
Keywords: language model, llm, prompt
Abstract: We demonstrate that large language models (LLMs) exhibit consistent value orientations despite adopting diverse personas, revealing a persistent inertia in their responses that remains stable across the variety of roles they are prompted to assume. To systematically explore this phenomenon, we introduce the role-play-at-scale methodology, which involves prompting LLMs with randomized, diverse personas and analyzing the macroscopic trend of their responses. Unlike previous works that simply feed these questions to LLMs as if testing human subjects, our role-play-at-scale methodology diagnoses inherent tendencies in a systematic and scalable manner by: (1) prompting the model to act in different random personas and (2) asking the same question multiple times for each random persona. This approach reveals consistent patterns in LLM responses across diverse role-play scenarios, indicating deeply encoded inherent tendencies. Our findings contribute to the discourse on value alignment in foundation models and demonstrate the efficacy of role-play-at-scale as a diagnostic tool for uncovering encoded biases in LLMs.
摘要：我们证明，大型语言模型 (LLM) 尽管采用了不同的角色，但仍表现出一致的价值取向，这表明它们的反应具有持久的惯性，这种惯性在它们被要求承担的各种角色中都保持稳定。为了系统地探索这一现象，我们引入了大规模角色扮演方法，该方法涉及用随机、多样化的角色提示 LLM，并分析其反应的宏观趋势。与之前的研究不同，这些研究只是将这些问题提供给 LLM，就像测试人类受试者一样，我们的大规模角色扮演方法通过以下方式以系统和可扩展的方式诊断固有倾向：(1) 提示模型以不同的随机角色行事，(2) 对每个随机角色多次提出相同的问题。这种方法揭示了 LLM 在不同角色扮演场景中的反应的一致模式，表明了深深编码的固有倾向。我们的研究结果有助于讨论基础模型中的价值观一致性，并证明了大规模角色扮演作为发现 LLM 中编码偏见的诊断工具的有效性。

Title: CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts

Authors: Qingkai Zeng, Yuyang Bai, Zhaoxuan Tan, Zhenyu Wu, Shangbin Feng, Meng Jiang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2408.09070
Pdf URL: https://arxiv.org/pdf/2408.09070
Copy Paste: [[2408.09070]] CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts(https://arxiv.org/abs/2408.09070)
Keywords: language model, prompt
Abstract: Taxonomies play a crucial role in various applications by providing a structural representation of knowledge. The task of taxonomy expansion involves integrating emerging concepts into existing taxonomies by identifying appropriate parent concepts for these new query concepts. Previous approaches typically relied on self-supervised methods that generate annotation data from existing taxonomies. However, these methods are less effective when the existing taxonomy is small (fewer than 100 entities). In this work, we introduce \textsc{CodeTaxo}, a novel approach that leverages large language models through code language prompts to capture the taxonomic structure. Extensive experiments on five real-world benchmarks from different domains demonstrate that \textsc{CodeTaxo} consistently achieves superior performance across all evaluation metrics, significantly outperforming previous state-of-the-art methods. The code and data are available at \url{this https URL}.
摘要：分类法通过提供知识的结构化表示，在各种应用中发挥着至关重要的作用。分类法扩展的任务包括通过为这些新查询概念确定适当的父概念，将新兴概念集成到现有分类法中。以前的方法通常依赖于从现有分类法生成注释数据的自监督方法。但是，当现有分类法较小（少于 100 个实体）时，这些方法效果较差。在这项工作中，我们引入了 \textsc{CodeTaxo}，这是一种新颖的方法，它通过代码语言提示利用大型语言模型来捕获分类结构。对来自不同领域的五个真实基准进行的大量实验表明，\textsc{CodeTaxo} 在所有评估指标上始终保持卓越性能，明显优于以前最先进的方法。代码和数据可在 \url{此 https URL} 上找到。

Title: CogLM: Tracking Cognitive Development of Large Language Models

Authors: Xinglin Wang, Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Boyuan Pan, Heda Wang, Yao Hu, Kan Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09150
Pdf URL: https://arxiv.org/pdf/2408.09150
Copy Paste: [[2408.09150]] CogLM: Tracking Cognitive Development of Large Language Models(https://arxiv.org/abs/2408.09150)
Keywords: language model, gpt, llm
Abstract: Piaget's Theory of Cognitive Development (PTC) posits that the development of cognitive levels forms the foundation for human learning across various abilities. As Large Language Models (LLMs) have recently shown remarkable abilities across a wide variety of tasks, we are curious about the cognitive levels of current LLMs: to what extent they have developed and how this development has been achieved. To this end, we construct a benchmark CogLM (Cognitive Ability Evaluation for Language Model) based on PTC to assess the cognitive levels of LLMs. CogLM comprises 1,220 questions spanning 10 cognitive abilities crafted by more than 20 human experts, providing a comprehensive testbed for the cognitive levels of LLMs. Through extensive experiments across multiple mainstream LLMs with CogLM, we find that: (1) Human-like cognitive abilities have emerged in advanced LLMs (GPT-4), comparable to those of a 20-year-old human. (2) The parameter size and optimization objective are two key factors affecting the cognitive levels of LLMs. (3) The performance on downstream tasks is positively correlated with the level of cognitive abilities. These findings fill the gap in research on the cognitive abilities of LLMs, tracing the development of LLMs from a cognitive perspective and guiding the future direction of their evolution.
摘要：皮亚杰的认知发展理论 (PTC) 认为，认知水平的发展构成了人类学习各种能力的基础。由于大型语言模型 (LLM) 最近在各种任务中都表现出非凡的能力，我们对当前 LLM 的认知水平感到好奇：它们发展到了什么程度，以及这种发展是如何实现的。为此，我们基于 PTC 构建了一个基准 CogLM（语言模型认知能力评估）来评估 LLM 的认知水平。CogLM 包含 1,220 个问题，涵盖 10 种认知能力，由 20 多位人类专家精心设计，为 LLM 的认知水平提供了全面的测试平台。通过在多个主流 LLM 上使用 CogLM 进行大量实验，我们发现：（1）高级 LLM（GPT-4）已经出现了类似人类的认知能力，可与 20 岁人类的认知能力相媲美。（2）参数大小和优化目标是影响 LLM 认知水平的两个关键因素。 (3)下游任务的表现与认知能力水平呈正相关。本研究填补了法学硕士认知能力研究的空白，从认知视角追溯法学硕士的发展历程，指导法学硕士未来的演进方向。

Title: TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Authors: Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Guanglin Niu, Tongliang Li, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09174
Pdf URL: https://arxiv.org/pdf/2408.09174
Copy Paste: [[2408.09174]] TableBench: A Comprehensive and Complex Benchmark for Table Question Answering(https://arxiv.org/abs/2408.09174)
Keywords: language model, gpt, llm
Abstract: Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.
摘要：大型语言模型 (LLM) 的最新进展显著增强了表格数据的解释和处理能力，带来了以前难以想象的能力。尽管取得了这些成就，但 LLM 在应用于工业场景时仍然面临重大挑战，特别是由于现实世界表格数据所需的推理复杂性增加，凸显了学术基准与实际应用之间的明显差异。为了解决这一差异，我们对表格数据在工业场景中的应用进行了详细调查，并提出了一个全面而复杂的基准 TableBench，包括四大类表格问答 (TableQA) 功能中的 18 个字段。此外，我们引入了 TableLLM，在我们精心构建的训练集 TableInstruct 上进行训练，实现了与 GPT-3.5 相当的性能。在 TableBench 上进行的大量实验表明，开源和专有 LLM 仍有很大改进空间以满足现实世界的需求，其中最先进的模型 GPT-4 与人类相比仅取得了适中的分数。

Title: Chinese Metaphor Recognition Using a Multi-stage Prompting Large Language Model

Authors: Jie Wang, Jin Wang, Xuejie Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09177
Pdf URL: https://arxiv.org/pdf/2408.09177
Copy Paste: [[2408.09177]] Chinese Metaphor Recognition Using a Multi-stage Prompting Large Language Model(https://arxiv.org/abs/2408.09177)
Keywords: language model, llm, prompt
Abstract: Metaphors are common in everyday language, and the identification and understanding of metaphors are facilitated by models to achieve a better understanding of the text. Metaphors are mainly identified and generated by pre-trained models in existing research, but situations, where tenors or vehicles are not included in the metaphor, cannot be handled. The problem can be effectively solved by using Large Language Models (LLMs), but significant room for exploration remains in this early-stage research area. A multi-stage generative heuristic-enhanced prompt framework is proposed in this study to enhance the ability of LLMs to recognize tenors, vehicles, and grounds in Chinese metaphors. In the first stage, a small model is trained to obtain the required confidence score for answer candidate generation. In the second stage, questions are clustered and sampled according to specific rules. Finally, the heuristic-enhanced prompt needed is formed by combining the generated answer candidates and demonstrations. The proposed model achieved 3rd place in Track 1 of Subtask 1, 1st place in Track 2 of Subtask 1, and 1st place in both tracks of Subtask 2 at the NLPCC-2024 Shared Task 9.
摘要：隐喻是日常语言中常见的一种语言，通过模型来识别和理解隐喻，从而更好地理解文本。现有研究主要通过预训练模型来识别和生成隐喻，但无法处理隐喻中不包含基调或喻体的情形。该问题可以通过使用大型语言模型（LLM）得到有效解决，但在这一早期研究领域仍有很大探索空间。本研究提出了一种多阶段的生成式启发式增强提示框架，以增强LLM对汉语隐喻中基调、喻体和喻因的识别能力。在第一阶段，训练一个小模型以获得生成答案候选所需的置信度分数。在第二阶段，按照特定规则对问题进行聚类和抽样。最后，通过结合生成的答案候选和演示形成所需的启发式增强提示。所提出的模型在 NLPCC-2024 共享任务 9 中在子任务 1 的 Track 1 中取得了第 3 名，在子任务 1 的 Track 2 中取得了第 1 名，在子任务 2 的两个 Track 中均取得了第 1 名。

Title: Architectural Foundations and Strategic Considerations for the Large Language Model Infrastructures

Authors: Hongyin Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09205
Pdf URL: https://arxiv.org/pdf/2408.09205
Copy Paste: [[2408.09205]] Architectural Foundations and Strategic Considerations for the Large Language Model Infrastructures(https://arxiv.org/abs/2408.09205)
Keywords: language model, llm
Abstract: The development of a large language model (LLM) infrastructure is a pivotal undertaking in artificial intelligence. This paper explores the intricate landscape of LLM infrastructure, software, and data management. By analyzing these core components, we emphasize the pivotal considerations and safeguards crucial for successful LLM development. This work presents a concise synthesis of the challenges and strategies inherent in constructing a robust and effective LLM infrastructure, offering valuable insights for researchers and practitioners alike.
摘要：大型语言模型 (LLM) 基础设施的开发是人工智能领域的一项关键任务。本文探讨了 LLM 基础设施、软件和数据管理的复杂情况。通过分析这些核心组件，我们强调了成功开发 LLM 的关键考虑因素和保障措施。这项工作简要总结了构建强大而有效的 LLM 基础设施所固有的挑战和策略，为研究人员和从业者提供了宝贵的见解。

Title: Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

Authors: Sher Badshah, Hassan Sajjad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09235
Pdf URL: https://arxiv.org/pdf/2408.09235
Copy Paste: [[2408.09235]] Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text(https://arxiv.org/abs/2408.09235)
Keywords: language model, llm
Abstract: The rapid advancements in Large Language Models (LLMs) have highlighted the critical need for robust evaluation methods that can accurately assess the quality of generated text, particularly in free-form tasks. Traditional metrics like BLEU and ROUGE, while useful, often fail to capture the semantic richness and contextual relevance of free-form text compared to reference answers. In this study, we introduce a reference-guided verdict method that leverages multiple LLMs-as-judges to provide a more reliable and accurate evaluation of open-ended LLM generations. By integrating diverse LLMs, our approach mitigates individual model biases and significantly improves alignment with human judgments, especially in challenging tasks where traditional metrics and single-model evaluations fall short. Through experiments across multiple question-answering tasks, we show that our method closely aligns with human evaluations, establishing it as a scalable, reproducible, and effective alternative to human evaluation. Our approach not only enhances evaluation reliability but also opens new avenues for refining automated assessment in generative AI.
摘要：大型语言模型 (LLM) 的快速发展凸显了对能够准确评估生成文本质量的稳健评估方法的迫切需求，尤其是在自由格式任务中。BLEU 和 ROUGE 等传统指标虽然有用，但与参考答案相比，它们往往无法捕捉自由格式文本的语义丰富性和上下文相关性。在本研究中，我们引入了一种参考引导的裁决方法，该方法利用多个 LLM 作为评判者，为开放式 LLM 生成提供更可靠、更准确的评估。通过整合不同的 LLM，我们的方法可以减轻单个模型的偏差，并显著提高与人类判断的一致性，尤其是在传统指标和单一模型评估不足的具有挑战性的任务中。通过对多个问答任务的实验，我们表明我们的方法与人工评估紧密结合，使其成为一种可扩展、可重复且有效的人工评估替代方案。我们的方法不仅提高了评估可靠性，还为改进生成式 AI 中的自动评估开辟了新途径。

Title: ConVerSum: A Contrastive Learning based Approach for Data-Scarce Solution of Cross-Lingual Summarization Beyond Direct Equivalents

Authors: Sanzana Karim Lora, Rifat Shahriyar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09273
Pdf URL: https://arxiv.org/pdf/2408.09273
Copy Paste: [[2408.09273]] ConVerSum: A Contrastive Learning based Approach for Data-Scarce Solution of Cross-Lingual Summarization Beyond Direct Equivalents(https://arxiv.org/abs/2408.09273)
Keywords: language model, gpt, llm
Abstract: Cross-Lingual summarization (CLS) is a sophisticated branch in Natural Language Processing that demands models to accurately translate and summarize articles from different source languages. Despite the improvement of the subsequent studies, This area still needs data-efficient solutions along with effective training methodologies. To the best of our knowledge, there is no feasible solution for CLS when there is no available high-quality CLS data. In this paper, we propose a novel data-efficient approach, ConVerSum, for CLS leveraging the power of contrastive learning, generating versatile candidate summaries in different languages based on the given source document and contrasting these summaries with reference summaries concerning the given documents. After that, we train the model with a contrastive ranking loss. Then, we rigorously evaluate the proposed approach against current methodologies and compare it to powerful Large Language Models (LLMs)- Gemini, GPT 3.5, and GPT 4 proving our model performs better for low-resource languages' CLS. These findings represent a substantial improvement in the area, opening the door to more efficient and accurate cross-lingual summarizing techniques.
摘要：跨语言摘要 (CLS) 是自然语言处理中的一个复杂分支，它要求模型能够准确地翻译和总结来自不同源语言的文章。尽管后续研究有所改进，但该领域仍然需要数据高效的解决方案以及有效的训练方法。据我们所知，如果没有可用的高质量 CLS 数据，则 CLS 没有可行的解决方案。在本文中，我们为 CLS 提出了一种新颖的数据高效方法 ConVerSum，利用对比学习的力量，根据给定的源文档生成不同语言的多功能候选摘要，并将这些摘要与给定文档的参考摘要进行对比。之后，我们使用对比排名损失来训练模型。然后，我们根据当前方法严格评估所提出的方法，并将其与强大的大型语言模型 (LLM) - Gemini、GPT 3.5 和 GPT 4 进行比较，证明我们的模型在低资源语言的 CLS 方面表现更好。这些发现代表了该领域的重大进步，为更高效、更准确的跨语言总结技术打开了大门。

Title: CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions

Authors: Matan Levi, Yair Alluouche, Daniel Ohayon, Anton Puzanov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09304
Pdf URL: https://arxiv.org/pdf/2408.09304
Copy Paste: [[2408.09304]] CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions(https://arxiv.org/abs/2408.09304)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have significantly advanced natural language processing (NLP), providing versatile capabilities across various applications. However, their application to complex, domain-specific tasks, such as cyber-security, often faces substantial challenges. In this study, we introduce SecKnowledge and this http URL to address these challenges and train security-expert LLMs. SecKnowledge is a domain-knowledge-driven cyber-security instruction dataset, meticulously designed using years of accumulated expert knowledge in the domain through a multi-phase generation process. this http URL refers to a family of LLMs fine-tuned using SecKnowledge, aimed at building security-specialized LLMs capable of answering and following complex security-related instructions. Additionally, we introduce SecKnowledge-Eval, a comprehensive and diverse cyber-security evaluation benchmark, composed of an extensive set of cyber-security tasks we specifically developed to assess LLMs in the field of cyber-security, along with other publicly available security benchmarks. Our results show a significant average improvement of up to 24% over the baseline models, underscoring the benefits of our expert-driven instruction dataset generation process. These findings contribute to the advancement of AI-based cyber-security applications, paving the way for security-expert LLMs that can enhance threat-hunting and investigation processes.
摘要：大型语言模型 (LLM) 显著提高了自然语言处理 (NLP) 的性能，为各种应用提供了多功能功能。然而，将它们应用于复杂的特定领域任务（例如网络安全）往往面临巨大挑战。在本研究中，我们引入了 SecKnowledge 和此 http URL 来应对这些挑战并训练安全专家 LLM。SecKnowledge 是一个领域知识驱动的网络安全指令数据集，通过多阶段生成过程，使用多年积累的领域专家知识精心设计而成。此 http URL 指的是使用 SecKnowledge 微调的 LLM 系列，旨在构建能够回答和遵循复杂安全相关指令的安全专用 LLM。此外，我们还推出了 SecKnowledge-Eval，这是一个全面而多样的网络安全评估基准，由我们专门为评估网络安全领域的 LLM 而开发的一套广泛的网络安全任务以及其他公开可用的安全基准组成。我们的结果表明，与基线模型相比，平均提升幅度高达 24%，凸显了我们专家驱动的指令数据集生成流程的优势。这些发现有助于推动基于 AI 的网络安全应用的发展，为能够增强威胁搜寻和调查流程的安全专家法学硕士课程铺平了道路。

Title: Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

Authors: Kexin Chen, Yi Liu, Dongxia Wang, Jiaying Chen, Wenhai Wang
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2408.09326
Pdf URL: https://arxiv.org/pdf/2408.09326
Copy Paste: [[2408.09326]] Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks(https://arxiv.org/abs/2408.09326)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have increasingly become pivotal in content generation with notable societal impact. These models hold the potential to generate content that could be deemed harmful.Efforts to mitigate this risk include implementing safeguards to ensure LLMs adhere to social ethics.However, despite such measures, the phenomenon of "jailbreaking" -- where carefully crafted prompts elicit harmful responses from models -- persists as a significant challenge. Recognizing the continuous threat posed by jailbreaking tactics and their repercussions for the trustworthy use of LLMs, a rigorous assessment of the models' robustness against such attacks is essential. This study introduces an comprehensive evaluation framework and conducts an large-scale empirical experiment to address this need. We concentrate on 10 cutting-edge jailbreak strategies across three categories, 1525 questions from 61 specific harmful categories, and 13 popular LLMs. We adopt multi-dimensional metrics such as Attack Success Rate (ASR), Toxicity Score, Fluency, Token Length, and Grammatical Errors to thoroughly assess the LLMs' outputs under jailbreak. By normalizing and aggregating these metrics, we present a detailed reliability score for different LLMs, coupled with strategic recommendations to reduce their susceptibility to such vulnerabilities. Additionally, we explore the relationships among the models, attack strategies, and types of harmful content, as well as the correlations between the evaluation metrics, which proves the validity of our multifaceted evaluation framework. Our extensive experimental results demonstrate a lack of resilience among all tested LLMs against certain strategies, and highlight the need to concentrate on the reliability facets of LLMs. We believe our study can provide valuable insights into enhancing the security evaluation of LLMs against jailbreak within the domain.
摘要：大型语言模型 (LLM) 在内容生成中越来越重要，具有显著的社会影响。这些模型有可能生成可能被视为有害的内容。为降低这种风险，人们采取了各种措施，包括实施保障措施，确保 LLM 遵守社会道德规范。然而，尽管采取了这些措施，“越狱”现象（即精心设计的提示引发模型的有害反应）仍然是一项重大挑战。认识到越狱策略带来的持续威胁及其对 LLM 可信使用的影响，对模型抵御此类攻击的稳健性进行严格评估至关重要。本研究引入了一个全面的评估框架，并进行了一项大规模的实证实验来满足这一需求。我们专注于三个类别的 10 种前沿越狱策略、61 个特定有害类别的 1525 个问题和 13 个流行的 LLM。我们采用攻击成功率 (ASR)、毒性评分、流畅度、令牌长度和语法错误等多维指标来全面评估越狱下的 LLM 输出。通过规范化和汇总这些指标，我们为不同的 LLM 提供了详细的可靠性分数，并提出了降低其对此类漏洞敏感性的战略建议。此外，我们还探讨了模型、攻击策略和有害内容类型之间的关系，以及评估指标之间的相关性，这证明了我们多方面评估框架的有效性。我们广泛的实验结果表明，所有测试的 LLM 对某些策略都缺乏弹性，并强调需要集中精力于 LLM 的可靠性方面。我们相信我们的研究可以为增强领域内 LLM 对越狱的安全性评估提供宝贵的见解。

Title: Fostering Natural Conversation in Large Language Models with NICO: a Natural Interactive COnversation dataset

Authors: Renliang Sun, Mengyuan Liu, Shiping Yang, Rui Wang, Junqing He, Jiaxing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09330
Pdf URL: https://arxiv.org/pdf/2408.09330
Copy Paste: [[2408.09330]] Fostering Natural Conversation in Large Language Models with NICO: a Natural Interactive COnversation dataset(https://arxiv.org/abs/2408.09330)
Keywords: language model, gpt, llm, chat
Abstract: Benefiting from diverse instruction datasets, contemporary Large Language Models (LLMs) perform effectively as AI assistants in collaborating with humans. However, LLMs still struggle to generate natural and colloquial responses in real-world applications such as chatbots and psychological counseling that require more human-like interactions. To address these limitations, we introduce NICO, a Natural Interactive COnversation dataset in Chinese. We first use GPT-4-turbo to generate dialogue drafts and make them cover 20 daily-life topics and 5 types of social interactions. Then, we hire workers to revise these dialogues to ensure that they are free of grammatical errors and unnatural utterances. We define two dialogue-level natural conversation tasks and two sentence-level tasks for identifying and rewriting unnatural sentences. Multiple open-source and closed-source LLMs are tested and analyzed in detail. The experimental results highlight the challenge of the tasks and demonstrate how NICO can help foster the natural dialogue capabilities of LLMs. The dataset will be released.
摘要：得益于多样化的教学数据集，当代大型语言模型 (LLM) 在与人类协作方面表现得非常出色，可作为 AI 助手。然而，LLM 在聊天机器人和心理咨询等需要更多类似人类交互的现实应用中仍然难以生成自然和口语化的响应。为了解决这些限制，我们引入了 NICO，一个中文自然互动对话数据集。我们首先使用 GPT-4-turbo 生成对话草稿，并使其涵盖 20 个日常生活话题和 5 种社交互动类型。然后，我们雇佣工人修改这些对话，以确保它们没有语法错误和不自然的话语。我们定义了两个对话级自然对话任务和两个句子级任务来识别和重写不自然的句子。对多个开源和闭源 LLM 进行了详细测试和分析。实验结果突出了任务的挑战性，并展示了 NICO 如何帮助培养 LLM 的自然对话能力。该数据集即将发布。

Title: Improving and Assessing the Fidelity of Large Language Models Alignment to Online Communities

Authors: Minh Duc Chu, Zihao He, Rebecca Dorn, Kristina Lerman
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2408.09366
Pdf URL: https://arxiv.org/pdf/2408.09366
Copy Paste: [[2408.09366]] Improving and Assessing the Fidelity of Large Language Models Alignment to Online Communities(https://arxiv.org/abs/2408.09366)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown promise in representing individuals and communities, offering new ways to study complex social dynamics. However, effectively aligning LLMs with specific human groups and systematically assessing the fidelity of the alignment remains a challenge. This paper presents a robust framework for aligning LLMs with online communities via instruction-tuning and comprehensively evaluating alignment across various aspects of language, including authenticity, emotional tone, toxicity, and harm. We demonstrate the utility of our approach by applying it to online communities centered on dieting and body image. We administer an eating disorder psychometric test to the aligned LLMs to reveal unhealthy beliefs and successfully differentiate communities with varying levels of eating disorder risk. Our results highlight the potential of LLMs in automated moderation and broader applications in public health and social science research.
摘要：大型语言模型 (LLM) 在表示个人和社区方面显示出良好的前景，为研究复杂的社会动态提供了新方法。然而，有效地将 LLM 与特定的人类群体对齐并系统地评估对齐的保真度仍然是一个挑战。本文提出了一个强大的框架，通过指令调整将 LLM 与在线社区对齐，并全面评估语言各个方面的对齐，包括真实性、情绪基调、毒性和伤害。我们通过将我们的方法应用于以节食和身体形象为中心的在线社区来证明其实用性。我们对对齐的 LLM 进行饮食失调心理测试，以揭示不健康的信念，并成功区分具有不同程度饮食失调风险的社区。我们的研究结果凸显了 LLM 在自动审核和公共卫生和社会科学研究领域的更广泛应用方面的潜力。

Title: Offline RLHF Methods Need More Accurate Supervision Signals

Authors: Shiqi Wang, Zhengze Zhang, Rui Zhao, Fei Tan, Cam Tu Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09385
Pdf URL: https://arxiv.org/pdf/2408.09385
Copy Paste: [[2408.09385]] Offline RLHF Methods Need More Accurate Supervision Signals(https://arxiv.org/abs/2408.09385)
Keywords: language model, llm
Abstract: With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the ``ordinal relationship'' between responses, overlooking the crucial aspect of ``how much'' one is preferred over the others. To address this issue, we propose a simple yet effective solution called \textbf{R}eward \textbf{D}ifference \textbf{O}ptimization, shorted as \textbf{RDO}. Specifically, we introduce {\it reward difference coefficients} to reweigh sample pairs in offline RLHF. We then develop a {\it difference model} involving rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values.
摘要：随着大型语言模型 (LLM) 的快速发展，将 LLM 与人类偏好相结合变得越来越重要。尽管带人类反馈的强化学习 (RLHF) 被证明是有效的，但它非常复杂且资源密集。因此，离线 RLHF 已被引入作为替代解决方案，它直接优化固定偏好数据集上具有排名损失的 LLM。当前的离线 RLHF 仅捕获响应之间的“序数关系”，而忽略了“一个响应比其他响应更受欢迎”的关键方面。为了解决这个问题，我们提出了一个简单而有效的解决方案，称为 \textbf{R}eward \textbf{D}ifference \textbf{O}ptimization，简称 \textbf{RDO}。具体来说，我们引入了 {\it 奖励差异系数} 来重新加权离线 RLHF 中的样本对。然后，我们开发了一个 {\it 差异模型}，该模型涉及一对响应之间的丰富交互，用于预测这些差异系数。在 HH 和 TL;DR 数据集上使用 7B LLM 进行的实验证实了我们的方法在自动指标和人工评估方面的有效性，从而凸显了其将 LLM 与人类意图和价值观相结合的潜力。

Title: Challenges and Responses in the Practice of Large Language Models

Authors: Hongyin Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09416
Pdf URL: https://arxiv.org/pdf/2408.09416
Copy Paste: [[2408.09416]] Challenges and Responses in the Practice of Large Language Models(https://arxiv.org/abs/2408.09416)
Keywords: language model
Abstract: This paper carefully summarizes extensive and profound questions from all walks of life, focusing on the current high-profile AI field, covering multiple dimensions such as industry trends, academic research, technological innovation and business applications. This paper meticulously curates questions that are both thought-provoking and practically relevant, providing nuanced and insightful answers to each. To facilitate readers' understanding and reference, this paper specifically classifies and organizes these questions systematically and meticulously from the five core dimensions of computing power infrastructure, software architecture, data resources, application scenarios, and brain science. This work aims to provide readers with a comprehensive, in-depth and cutting-edge AI knowledge framework to help people from all walks of life grasp the pulse of AI development, stimulate innovative thinking, and promote industrial progress.
摘要：本文精心总结了各界广泛而深刻的问题，聚焦当前备受关注的人工智能领域，涵盖了行业趋势、学术研究、技术创新和商业应用等多个维度。本文精心挑选了既发人深省又具有实践意义的问题，并针对每个问题提供了细致入微的答案。为方便读者理解和参考，本文特意从算力基础设施、软件架构、数据资源、应用场景、脑科学五个核心维度对这些问题进行了系统细致的分类和整理。旨在为读者提供全面、深入、前沿的人工智能知识框架，帮助各界人士把握人工智能发展脉搏，激发创新思维，推动产业进步。

Title: FASST: Fast LLM-based Simultaneous Speech Translation

Authors: Siqi Ouyang, Xi Xu, Chinmay Dandekar, Lei Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09430
Pdf URL: https://arxiv.org/pdf/2408.09430
Copy Paste: [[2408.09430]] FASST: Fast LLM-based Simultaneous Speech Translation(https://arxiv.org/abs/2408.09430)
Keywords: language model, llm
Abstract: Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
摘要：同步语音翻译 (SST) 接收流式语音输入并即时生成文本翻译。现有方法要么由于重新计算输入表示而导致高延迟，要么在翻译质量上落后于离线 ST。在本文中，我们提出了一种基于快速大型语言模型的流式语音翻译方法 FASST。我们提出了分块因果语音编码和一致性掩码，以便流式语音输入可以增量编码而无需重新计算。此外，我们开发了一种两阶段训练策略来优化 FASST 以实现同步推理。我们在 MuST-C 数据集上评估了 FASST 和多个强先验模型。实验结果表明，FASST 实现了最佳的质量-延迟权衡。在英语到西班牙语的翻译中，在相同的延迟下，它比之前的最佳模型平均高出 1.5 BLEU。

Title: HySem: A context length optimized LLM pipeline for unstructured tabular extraction

Authors: Narayanan PP, Anantharaman Palacode Narayana Iyer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09434
Pdf URL: https://arxiv.org/pdf/2408.09434
Copy Paste: [[2408.09434]] HySem: A context length optimized LLM pipeline for unstructured tabular extraction(https://arxiv.org/abs/2408.09434)
Keywords: language model, gpt, llm, agent
Abstract: Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging open-source models, our auto-correcting agents rectify both syntax and semantic errors in LLM-generated content. HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o and effectively addresses context length limitations, which is a crucial factor for supporting larger tables.
摘要：制药行业的法规合规性报告依赖于详细的表格，但由于其非结构化格式和任意内容，这些表格除了合规性之外通常未得到充分利用。由于表格呈现方式多样，提取和语义表示表格数据具有挑战性。大型语言模型 (LLM) 展示了语义表示的巨大潜力，但它们面临着与准确性和上下文大小限制相关的挑战，这是行业应用的关键考虑因素。我们引入了 HySem，这是一种采用新颖的上下文长度优化技术从 HTML 表生成准确语义 JSON 表示的管道。这种方法利用了专为成本和隐私敏感的中小型制药企业设计的自定义微调模型。我们的自动纠正代理在商用硬件上运行并利用开源模型，可以纠正 LLM 生成内容中的语法和语义错误。HySem 在准确性方面超越了同类开源模型，在与 OpenAI GPT-4o 进行基准测试时提供了具有竞争力的性能，并有效解决了上下文长度限制问题，这是支持更大表格的关键因素。

Title: Identifying Speakers and Addressees of Quotations in Novels with Prompt Learning

Authors: Yuchen Yan, Hanjie Zhao, Senbin Zhu, Hongde Liu, Zhihong Zhang, Yuxiang Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09452
Pdf URL: https://arxiv.org/pdf/2408.09452
Copy Paste: [[2408.09452]] Identifying Speakers and Addressees of Quotations in Novels with Prompt Learning(https://arxiv.org/abs/2408.09452)
Keywords: language model, prompt
Abstract: Quotations in literary works, especially novels, are important to create characters, reflect character relationships, and drive plot development. Current research on quotation extraction in novels primarily focuses on quotation attribution, i.e., identifying the speaker of the quotation. However, the addressee of the quotation is also important to construct the relationship between the speaker and the addressee. To tackle the problem of dataset scarcity, we annotate the first Chinese quotation corpus with elements including speaker, addressee, speaking mode and linguistic cue. We propose prompt learning-based methods for speaker and addressee identification based on fine-tuned pre-trained models. Experiments on both Chinese and English datasets show the effectiveness of the proposed methods, which outperform methods based on zero-shot and few-shot large language models.
摘要：文学作品尤其是小说中的引语对塑造人物形象、反映人物关系、推动情节发展具有重要意义。当前对小说引语提取的研究主要集中于引语归属，即识别引语的说话人，但引语的收语人对于构建说话人和收语人的关系也很重要。针对数据集稀缺的问题，我们标注了首个中文引语语料库，其中包含说话人、收语人、说话方式和语言线索等元素。我们提出了基于快速学习的说话人和收语人识别方法，该方法基于经过微调的预训练模型。在中文和英文数据集上的实验证明了所提方法的有效性，其效果优于基于零样本和少样本大型语言模型的方法。

Title: WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models

Authors: Guitao Chen, Yunshen Wang, Hongye Sun, Guang Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2408.09459
Pdf URL: https://arxiv.org/pdf/2408.09459
Copy Paste: [[2408.09459]] WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models(https://arxiv.org/abs/2408.09459)
Keywords: language model, gpt, prompt
Abstract: Generative language models (LMs) offer numerous advantages but may produce inappropriate or harmful outputs due to the harmful knowledge acquired during pre-training. This knowledge often manifests as undesirable correspondences, such as "harmful prompts" leading to "harmful outputs," which our research aims to mitigate through unlearning techniques.However, existing unlearning methods based on gradient ascent can significantly impair the performance of LMs. To address this issue, we propose a novel approach called Weighted Positional N-pair (WPN) Learning, which leverages position-weighted mean pooling within an n-pair contrastive learning framework. WPN is designed to modify the output distribution of LMs by eliminating specific harmful outputs (e.g., replacing toxic responses with neutral ones), thereby transforming the model's behavior from "harmful prompt-harmful output" to "harmful prompt-harmless response".Experiments on OPT and GPT-NEO LMs show that WPN effectively reduces the proportion of harmful responses, achieving a harmless rate of up to 95.8\% while maintaining stable performance on nine common benchmarks (with less than 2\% degradation on average). Moreover, we provide empirical evidence to demonstrate WPN's ability to weaken the harmful correspondences in terms of generalizability and robustness, as evaluated on out-of-distribution test sets and under adversarial attacks.
摘要：生成语言模型 (LM) 具有众多优势，但由于在预训练期间获取了有害知识，因此可能会产生不适当或有害的输出。这些知识通常表现为不良对应关系，例如导致“有害输出”的“有害提示”，我们的研究旨在通过反学习技术来缓解这种情况。然而，现有的基于梯度上升的反学习方法会严重损害 LM 的性能。为了解决这个问题，我们提出了一种称为加权位置 N 对 (WPN) 学习的新方法，该方法利用 n 对对比学习框架中的位置加权均值池化。 WPN 旨在通过消除特定的有害输出（例如，用中性输出替换有毒响应）来修改 LM 的输出分布，从而将模型的行为从“有害提示 - 有害输出”转变为“有害提示 - 无害响应”。在 OPT 和 GPT-NEO LM 上的实验表明，WPN 有效地降低了有害响应的比例，在九个常见基准上保持稳定的性能（平均下降幅度小于 2%）的同时，实现了高达 95.8% 的无害率。此外，我们提供了经验证据来证明 WPN 能够在泛化和鲁棒性方面削弱有害对应关系，这些评估是在分布外的测试集和对抗性攻击下进行的。

Title: PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis

Authors: Meng Luo, Hao Fei, Bobo Li, Shengqiong Wu, Qian Liu, Soujanya Poria, Erik Cambria, Mong-Li Lee, Wynne Hsu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09481
Pdf URL: https://arxiv.org/pdf/2408.09481
Copy Paste: [[2408.09481]] PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis(https://arxiv.org/abs/2408.09481)
Keywords: language model
Abstract: While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community, and thus all our codes and data are open at this https URL
摘要：尽管现有的基于方面的情绪分析 (ABSA) 已得到广泛的努力和进步，但在定义一个更全面的研究目标方面仍然存在差距，该研究目标无缝地集成多模态、对话上下文、细粒度，并涵盖不断变化的情绪动态以及认知因果原理。本文通过引入多模态对话 ABSA 来弥补这些差距，其中提出了两个新颖的子任务：1）全景情绪六元组提取，从多轮多方多模态对话中全景识别持有人、目标、方面、观点、情绪和原理。2）情绪翻转分析，检测对话过程中的动态情绪转换以及因果原因。为了对这些任务进行基准测试，我们构建了 PanoSent，这是一个手动和自动注释的数据集，具有高质量、大规模、多模态、多语言、多场景，涵盖隐性和显性情绪元素的特点。为了有效地解决这些任务，我们设计了一个新颖的情感链推理框架，以及一个新颖的多模态大型语言模型（即 Sentica）和一个基于释义的验证机制。广泛的评估证明了我们的方法优于强基线，验证了我们提出的所有方法的有效性。这项工作有望为 ABSA 社区开辟一个新时代，因此我们所有的代码和数据都在此 https URL 上开放

Title: REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Authors: Rameez Qureshi, Naïm Es-Sebbani, Luis Galárraga, Yvette Graham, Miguel Couceiro, Zied Bouraoui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09489
Pdf URL: https://arxiv.org/pdf/2408.09489
Copy Paste: [[2408.09489]] REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning(https://arxiv.org/abs/2408.09489)
Keywords: language model
Abstract: With the introduction of (large) language models, there has been significant concern about the unintended bias such models may inherit from their training data. A number of studies have shown that such models propagate gender stereotypes, as well as geographical and racial bias, among other biases. While existing works tackle this issue by preprocessing data and debiasing embeddings, the proposed methods require a lot of computational resources and annotation effort while being limited to certain types of biases. To address these issues, we introduce REFINE-LM, a debiasing method that uses reinforcement learning to handle different types of biases without any fine-tuning. By training a simple model on top of the word probability distribution of a LM, our bias agnostic reinforcement learning method enables model debiasing without human annotations or significant computational resources. Experiments conducted on a wide range of models, including several LMs, show that our method (i) significantly reduces stereotypical biases while preserving LMs performance; (ii) is applicable to different types of biases, generalizing across contexts such as gender, ethnicity, religion, and nationality-based biases; and (iii) it is not expensive to train.
摘要：随着 (大型) 语言模型的引入，人们开始非常担心此类模型可能会从其训练数据中继承意外的偏见。许多研究表明，这类模型传播了性别刻板印象以及地理和种族偏见以及其他偏见。虽然现有研究通过预处理数据和去偏嵌入来解决这个问题，但所提出的方法需要大量的计算资源和注释工作，而且仅限于某些类型的偏见。为了解决这些问题，我们引入了 REFINE-LM，这是一种使用强化学习来处理不同类型的偏见而无需任何微调的去偏方法。通过在 LM 的词概率分布之上训练一个简单的模型，我们的偏见不可知强化学习方法无需人工注释或大量计算资源即可实现模型去偏。在包括多个 LM 在内的广泛模型上进行的实验表明，我们的方法 (i) 显着减少了刻板印象偏见，同时保持了 LM 的性能；（ii）适用于不同类型的偏见，可推广至不同背景，例如性别、种族、宗教和国籍偏见；（iii）训练成本不高。

Title: Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Authors: Jiajun Song, Zhuoyan Xu, Yiqiao Zhong
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2408.09503
Pdf URL: https://arxiv.org/pdf/2408.09503
Copy Paste: [[2408.09503]] Out-of-distribution generalization via composition: a lens through induction heads in Transformers(https://arxiv.org/abs/2408.09503)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data -- which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found that OOD generalization and composition are tied together -- models can learn rules by composing two self-attention layers, thereby achieving OOD generalization. Furthermore, a shared latent subspace in the embedding (or feature) space acts as a bridge for composition by aligning early layers and later layers, which we refer to as the common bridge representation hypothesis.
摘要：大型语言模型 (LLM)（例如 GPT-4）有时似乎很有创意，通常只需在提示中给出一些演示即可解决新颖的任务。这些任务要求模型对不同于训练数据的分布进行泛化——这被称为分布外 (OOD) 泛化。尽管 LLM 取得了巨大的成功，但它们如何实现 OOD 泛化仍然是一个悬而未决且尚未得到充分探索的问题。我们在根据隐藏规则生成实例的环境中检查 OOD 泛化，包括使用符号推理的上下文学习。模型需要在不进行任何微调的情况下推断输入提示背后的隐藏规则。我们在一个合成示例上实证检验了 Transformers 的训练动态，并对各种预训练的 LLM 进行了广泛的实验，重点关注一种称为感应头的组件。我们发现 OOD 泛化和组合是紧密相连的——模型可以通过组合两个自注意力层来学习规则，从而实现 OOD 泛化。此外，嵌入（或特征）空间中的共享潜在子空间通过对齐早期层和后期层充当了组合的桥梁，我们将其称为公共桥梁表示假设。

Title: Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path

Authors: Xinnan Dai, Qihao Wen, Yifei Shen, Hongzhi Wen, Dongsheng Li, Jiliang Tang, Caihua Shan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09529
Pdf URL: https://arxiv.org/pdf/2408.09529
Copy Paste: [[2408.09529]] Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path(https://arxiv.org/abs/2408.09529)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved great success in various reasoning tasks. In this work, we focus on the graph reasoning ability of LLMs. Although theoretical studies proved that LLMs are capable of handling graph reasoning tasks, empirical evaluations reveal numerous failures. To deepen our understanding on this discrepancy, we revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem. Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these three fundamental tasks. Meanwhile, we perform a real-world investigation on knowledge graphs and make consistent observations with our findings. The codes and datasets are available.
摘要：大型语言模型 (LLM) 在各种推理任务中取得了巨大成功。在本文中，我们关注 LLM 的图推理能力。尽管理论研究证明 LLM 能够处理图推理任务，但实证评估揭示了许多失败之处。为了加深对这种差异的理解，我们重新审视了 LLM 在三个基本图任务上的能力：图描述翻译、图连通性和最短路径问题。我们的研究结果表明，LLM 无法通过文本描述理解图结构，并且在这三个基本任务中表现出不同的性能。同时，我们对知识图谱进行了真实世界的调查，并得出了与我们的发现一致的观察结果。代码和数据集已发布。

Title: Using ChatGPT to Score Essays and Short-Form Constructed Responses

Authors: Mark D. Shermis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09540
Pdf URL: https://arxiv.org/pdf/2408.09540
Copy Paste: [[2408.09540]] Using ChatGPT to Score Essays and Short-Form Constructed Responses(https://arxiv.org/abs/2408.09540)
Keywords: language model, gpt, chat
Abstract: This study aimed to determine if ChatGPT's large language models could match the scoring accuracy of human and machine scores from the ASAP competition. The investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost. ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics. Results indicated that while ChatGPT's gradient boost model achieved QWKs close to human raters for some data sets, its overall performance was inconsistent and often lower than human scores. The study highlighted the need for further refinement, particularly in handling biases and ensuring scoring fairness. Despite these challenges, ChatGPT demonstrated potential for scoring efficiency, especially with domain-specific fine-tuning. The study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments. Future research should improve model accuracy, address ethical considerations, and explore hybrid models combining ChatGPT with empirical methods.
摘要：本研究旨在确定 ChatGPT 的大型语言模型是否能与 ASAP 竞赛中人类和机器评分的评分准确度相匹配。调查重点关注各种预测模型，包括线性回归、随机森林、梯度提升和提升。使用二次加权 kappa (QWK) 指标评估了 ChatGPT 与人类评分者的表现。结果表明，虽然 ChatGPT 的梯度提升模型在某些数据集上实现了接近人类评分者的 QWK，但其整体表现不一致且通常低于人类评分。该研究强调需要进一步改进，特别是在处理偏见和确保评分公平性方面。尽管存在这些挑战，ChatGPT 仍展示了评分效率的潜力，尤其是在特定领域的微调方面。研究得出结论，ChatGPT 可以补充人类评分，但需要进一步开发才能可靠地进行高风险评估。未来的研究应提高模型准确性，解决道德问题，并探索将 ChatGPT 与实证方法相结合的混合模型。

Title: No Such Thing as a General Learner: Language models and their dual optimization

Authors: Emmanuel Chemla, Ryan M. Nefdt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09544
Pdf URL: https://arxiv.org/pdf/2408.09544
Copy Paste: [[2408.09544]] No Such Thing as a General Learner: Language models and their dual optimization(https://arxiv.org/abs/2408.09544)
Keywords: language model, llm
Abstract: What role can the otherwise successful Large Language Models (LLMs) play in the understanding of human cognition, and in particular in terms of informing language acquisition debates? To contribute to this question, we first argue that neither humans nor LLMs are general learners, in a variety of senses. We make a novel case for how in particular LLMs follow a dual-optimization process: they are optimized during their training (which is typically compared to language acquisition), and modern LLMs have also been selected, through a process akin to natural selection in a species. From this perspective, we argue that the performance of LLMs, whether similar or dissimilar to that of humans, does not weigh easily on important debates about the importance of human cognitive biases for language.
摘要：在其他方面取得成功的大型语言模型 (LLM) 在理解人类认知方面能发挥什么作用，特别是在语言习得辩论方面？为了回答这个问题，我们首先指出，无论是人类还是 LLM，在很多方面都不是一般的学习者。我们提出了一个新颖的案例，说明 LLM 遵循双重优化过程：它们在训练过程中得到优化（通常与语言习得相比），现代 LLM 也经过了选择，这个过程类似于物种的自然选择。从这个角度来看，我们认为 LLM 的表现，无论与人类的表现相似还是不同，都不会轻易影响关于人类认知偏见对语言的重要性的重要辩论。

Title: HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

Authors: Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, Ping Luo
Subjects: cs.CL, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2408.09559
Pdf URL: https://arxiv.org/pdf/2408.09559
Copy Paste: [[2408.09559]] HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model(https://arxiv.org/abs/2408.09559)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Model (LLM)-based agents exhibit significant potential across various domains, operating as interactive systems that process environmental observations to generate executable actions for target tasks. The effectiveness of these agents is significantly influenced by their memory mechanism, which records historical experiences as sequences of action-observation pairs. We categorize memory into two types: cross-trial memory, accumulated across multiple attempts, and in-trial memory (working memory), accumulated within a single attempt. While considerable research has optimized performance through cross-trial memory, the enhancement of agent performance through improved working memory utilization remains underexplored. Instead, existing approaches often involve directly inputting entire historical action-observation pairs into LLMs, leading to redundancy in long-horizon tasks. Inspired by human problem-solving strategies, this paper introduces HiAgent, a framework that leverages subgoals as memory chunks to manage the working memory of LLM-based agents hierarchically. Specifically, HiAgent prompts LLMs to formulate subgoals before generating executable actions and enables LLMs to decide proactively to replace previous subgoals with summarized observations, retaining only the action-observation pairs relevant to the current subgoal. Experimental results across five long-horizon tasks demonstrate that HiAgent achieves a twofold increase in success rate and reduces the average number of steps required by 3.8. Additionally, our analysis shows that HiAgent consistently improves performance across various steps, highlighting its robustness and generalizability. Project Page: this https URL .
摘要：基于大型语言模型 (LLM) 的代理在各个领域都表现出巨大的潜力，它们可以作为交互式系统运行，处理环境观察结果以生成目标任务的可执行操作。这些代理的有效性受到其记忆机制的显著影响，该机制将历史经验记录为动作-观察对的序列。我们将记忆分为两种类型：跨试验记忆（在多次尝试中积累）和试验内记忆（工作记忆），在一次尝试中积累。虽然大量研究已经通过跨试验记忆优化了性能，但通过提高工作记忆利用率来提高代理性能的方法仍未得到充分探索。相反，现有方法通常涉及将整个历史动作-观察对直接输入 LLM，从而导致长期任务中的冗余。受人类问题解决策略的启发，本文介绍了 HiAgent，这是一个利用子目标作为记忆块来分层管理基于 LLM 的代理的工作记忆的框架。具体而言，HiAgent 提示 LLM 在生成可执行操作之前制定子目标，并使 LLM 能够主动决定用汇总的观察结果替换先前的子目标，仅保留与当前子目标相关的动作-观察对。五个长期任务的实验结果表明，HiAgent 的成功率提高了两倍，并将所需的平均步骤数减少了 3.8。此外，我们的分析表明，HiAgent 持续提高了各个步骤的性能，突出了其稳健性和通用性。项目页面：此 https URL 。

Title: Grammatical Error Feedback: An Implicit Evaluation Approach

Authors: Stefano Bannò, Kate Knill, Mark J. F. Gales
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09565
Pdf URL: https://arxiv.org/pdf/2408.09565
Copy Paste: [[2408.09565]] Grammatical Error Feedback: An Implicit Evaluation Approach(https://arxiv.org/abs/2408.09565)
Keywords: language model, llm, prompt
Abstract: Grammatical feedback is crucial for consolidating second language (L2) learning. Most research in computer-assisted language learning has focused on feedback through grammatical error correction (GEC) systems, rather than examining more holistic feedback that may be more useful for learners. This holistic feedback will be referred to as grammatical error feedback (GEF). In this paper, we present a novel implicit evaluation approach to GEF that eliminates the need for manual feedback annotations. Our method adopts a grammatical lineup approach where the task is to pair feedback and essay representations from a set of possible alternatives. This matching process can be performed by appropriately prompting a large language model (LLM). An important aspect of this process, explored here, is the form of the lineup, i.e., the selection of foils. This paper exploits this framework to examine the quality and need for GEC to generate feedback, as well as the system used to generate feedback, using essays from the Cambridge Learner Corpus.
摘要：语法反馈对于巩固第二语言 (L2) 学习至关重要。计算机辅助语言学习中的大多数研究都集中在通过语法错误纠正 (GEC) 系统进行反馈，而不是研究对学习者更有用的更全面的反馈。这种整体反馈将被称为语法错误反馈 (GEF)。在本文中，我们提出了一种新的 GEF 隐式评估方法，该方法消除了对手动反馈注释的需求。我们的方法采用语法阵容方法，其中的任务是将反馈与一组可能的替代方案中的论文表示配对。此匹配过程可以通过适当提示大型语言模型 (LLM) 来执行。本文探讨的这个过程的一个重要方面是阵容的形式，即陪衬的选择。本文利用这个框架，使用来自剑桥学习者语料库的论文，检查 GEC 生成反馈的质量和必要性，以及用于生成反馈的系统。

Title: Refining Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models

Authors: Yanbing Chen, Ruilin Wang, Zihao Yang, Lavender Yao Jiang, Eric Karl Oermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09621
Pdf URL: https://arxiv.org/pdf/2408.09621
Copy Paste: [[2408.09621]] Refining Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models(https://arxiv.org/abs/2408.09621)
Keywords: language model
Abstract: Packing and shuffling tokens is a common practice in training auto-regressive language models (LMs) to prevent overfitting and improve efficiency. Typically documents are concatenated to chunks of maximum sequence length (MSL) and then shuffled. However setting the atom size, the length for each data chunk accompanied by random shuffling, to MSL may lead to contextual incoherence due to tokens from different documents being packed into the same chunk. An alternative approach is to utilize padding, another common data packing strategy, to avoid contextual incoherence by only including one document in each shuffled chunk. To optimize both packing strategies (concatenation vs padding), we investigated the optimal atom size for shuffling and compared their performance and efficiency. We found that matching atom size to MSL optimizes performance for both packing methods (concatenation and padding), and padding yields lower final perplexity (higher performance) than concatenation at the cost of more training steps and lower compute efficiency. This trade-off informs the choice of packing methods in training language models.
摘要：打包和改组标记是训练自回归语言模型 (LM) 的常见做法，可防止过度拟合并提高效率。通常，将文档连接到最大序列长度 (MSL) 的块，然后进行改组。但是，将原子大小（每个数据块的长度，伴随着随机改组）设置为 MSL 可能会导致上下文不一致，因为来自不同文档的标记被打包到同一个块中。另一种方法是利用填充（另一种常见的数据打包策略），通过在每个改组块中仅包含一个文档来避免上下文不一致。为了优化这两种打包策略（连接与填充），我们研究了改组的最佳原子大小，并比较了它们的性能和效率。我们发现，将原子大小与 MSL 匹配可以优化两种打包方法（连接和填充）的性能，并且填充比连接产生更低的最终困惑度（更高的性能），但代价是更多的训练步骤和更低的计算效率。这种权衡影响了在训练语言模型时选择打包方法。

Title: A Strategy to Combine 1stGen Transformers and Open LLMs for Automatic Text Classification

Authors: Claudio M. V. de Andrade, Washington Cunha, Davi Reis, Adriana Silvina Pagano, Leonardo Rocha, Marcos André Gonçalves
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09629
Pdf URL: https://arxiv.org/pdf/2408.09629
Copy Paste: [[2408.09629]] A Strategy to Combine 1stGen Transformers and Open LLMs for Automatic Text Classification(https://arxiv.org/abs/2408.09629)
Keywords: language model, llm
Abstract: Transformer models have achieved state-of-the-art results, with Large Language Models (LLMs), an evolution of first-generation transformers (1stTR), being considered the cutting edge in several NLP tasks. However, the literature has yet to conclusively demonstrate that LLMs consistently outperform 1stTRs across all NLP tasks. This study compares three 1stTRs (BERT, RoBERTa, and BART) with two open LLMs (Llama 2 and Bloom) across 11 sentiment analysis datasets. The results indicate that open LLMs may moderately outperform or match 1stTRs in 8 out of 11 datasets but only when fine-tuned. Given this substantial cost for only moderate gains, the practical applicability of these models in cost-sensitive scenarios is questionable. In this context, a confidence-based strategy that seamlessly integrates 1stTRs with open LLMs based on prediction certainty is proposed. High-confidence documents are classified by the more cost-effective 1stTRs, while uncertain cases are handled by LLMs in zero-shot or few-shot modes, at a much lower cost than fine-tuned versions. Experiments in sentiment analysis demonstrate that our solution not only outperforms 1stTRs, zero-shot, and few-shot LLMs but also competes closely with fine-tuned LLMs at a fraction of the cost.
摘要：Transformer 模型已经取得了最先进的成果，其中大型语言模型 (LLM) 是第一代 Transformer (1stTR) 的演变，被认为是多项 NLP 任务中的前沿技术。然而，文献尚未确凿地证明 LLM 在所有 NLP 任务中的表现始终优于 1stTR。这项研究在 11 个情绪分析数据集中比较了三个 1stTR (BERT、RoBERTa 和 BART) 与两个开放 LLM (Llama 2 和 Bloom)。结果表明，开放 LLM 可能在 11 个数据集中的 8 个中度优于或匹配 1stTR，但只有经过微调后才能实现。鉴于如此高昂的成本却只带来适度的收益，这些模型在成本敏感场景中的实际适用性值得怀疑。在此背景下，提出了一种基于置信度的策略，该策略基于预测确定性将 1stTR 与开放 LLM 无缝集成。高置信度文档由更具成本效益的 1stTR 进行分类，而不确定的情况则由零样本或少样本模式的 LLM 处理，成本远低于微调版本。情绪分析中的实验表明，我们的解决方案不仅优于 1stTR、零样本和少样本 LLM，而且成本仅为微调 LLM 的一小部分，与微调 LLM 相媲美。

Title: How to Make the Most of LLMs' Grammatical Knowledge for Acceptability Judgments

Authors: Yusuke Ide, Yuto Nishida, Miyu Oba, Yusuke Sakai, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09639
Pdf URL: https://arxiv.org/pdf/2408.09639
Copy Paste: [[2408.09639]] How to Make the Most of LLMs' Grammatical Knowledge for Acceptability Judgments(https://arxiv.org/abs/2408.09639)
Keywords: language model, llm, prompt
Abstract: The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is acceptable. The existing dominant approach, however, naively calculates and compares the probabilities of paired sentences using LMs. Additionally, large language models (LLMs) have yet to be thoroughly examined in this field. We thus investigate how to make the most of LLMs' grammatical knowledge to comprehensively evaluate it. Through extensive experiments of nine judgment methods in English and Chinese, we demonstrate that a probability readout method, in-template LP, and a prompting-based method, Yes/No probability computing, achieve particularly high performance, surpassing the conventional approach. Our analysis reveals their different strengths, e.g., Yes/No probability computing is robust against token-length bias, suggesting that they harness different aspects of LLMs' grammatical knowledge. Consequently, we recommend using diverse judgment methods to evaluate LLMs comprehensively.
摘要：语言模型 (LM) 的语法知识通常使用语言最小对基准来衡量，其中向 LM 呈现一对可接受和不可接受的句子，并要求 LM 判断哪个是可接受的。然而，现有的主导方法只是天真地使用 LM 计算和比较成对句子的概率。此外，大型语言模型 (LLM) 尚未在该领域得到彻底研究。因此，我们研究如何充分利用 LLM 的语法知识来全面评估它。通过对英文和中文的九种判断方法进行大量实验，我们证明了概率读出方法模板内 LP 和基于提示的方法 Yes/No 概率计算实现了特别高的性能，超越了传统方法。我们的分析揭示了它们的不同优势，例如 Yes/No 概率计算对 token 长度偏差具有鲁棒性，这表明它们利用了 LLM 语法知识的不同方面。因此，我们建议使用不同的判断方法来全面评估 LLM。

Title: Acquiring Bidirectionality via Large and Small Language Models

Authors: Takumi Goto, Hiroyoshi Nagao, Yuta Koreeda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09640
Pdf URL: https://arxiv.org/pdf/2408.09640
Copy Paste: [[2408.09640]] Acquiring Bidirectionality via Large and Small Language Models(https://arxiv.org/abs/2408.09640)
Keywords: language model
Abstract: Using token representation from bidirectional language models (LMs) such as BERT is still a widely used approach for token-classification tasks. Even though there exist much larger unidirectional LMs such as Llama-2, they are rarely used to replace the token representation of bidirectional LMs. In this work, we hypothesize that their lack of bidirectionality is keeping them behind. To that end, we propose to newly train a small backward LM and concatenate its representations to those of existing LM for downstream tasks. Through experiments in named entity recognition, we demonstrate that introducing backward model improves the benchmark performance more than 10 points. Furthermore, we show that the proposed method is especially effective for rare domains and in few-shot learning settings.
摘要：使用双向语言模型 (LM)（例如 BERT）中的标记表示仍然是标记分类任务中广泛使用的方法。尽管存在更大的单向语言模型（例如 Llama-2），但它们很少被用来替代双向语言模型的标记表示。在这项工作中，我们假设它们缺乏双向性导致它们落后。为此，我们建议重新训练一个小型后向语言模型，并将其表示与现有语言模型的表示连接起来以用于下游任务。通过命名实体识别中的实验，我们证明引入后向模型可将基准性能提高 10 个百分点以上。此外，我们表明所提出的方法对于稀有领域和小样本学习设置特别有效。

Title: BLADE: Benchmarking Language Model Agents for Data-Driven Science

Authors: Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, Tim Althoff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09667
Pdf URL: https://arxiv.org/pdf/2408.09667
Copy Paste: [[2408.09667]] BLADE: Benchmarking Language Model Agents for Data-Driven Science(https://arxiv.org/abs/2408.09667)
Keywords: language model, agent
Abstract: Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents' analysis approaches.
摘要：数据驱动的科学发现需要反复整合科学领域知识、统计专业知识和对数据语义的理解，以做出细致入微的分析决策，例如，考虑哪些变量、转换和统计模型。配备规划、记忆和代码执行功能的基于 LM 的代理有潜力支持数据驱动的科学。然而，由于存在多种有效方法、部分正确的步骤以及表达相同决策的不同方式，评估代理在这种开放式任务上的表现具有挑战性。为了应对这些挑战，我们提出了 BLADE，这是一个基准，用于自动评估代理对开放式研究问题的多方面方法。BLADE 由 12 个数据集和研究问题组成，这些数据集和研究问题来自现有的科学文献，其基本事实由专家数据科学家和研究人员从独立分析中收集。为了自动评估代理响应，我们开发了相应的计算方法来将不同的分析表示与这个基本事实相匹配。虽然语言模型拥有相当多的世界知识，但我们的评估表明它们通常仅限于基本分析。然而，能够与底层数据交互的代理在分析决策方面表现出改进的多样性，但仍然不是最优的。我们的工作使得数据驱动科学的代理评估成为可能，并为研究人员提供了对代理分析方法的更深入了解。

Title: Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Authors: Jiaqing Liu, Chong Deng, Qinglin Zhang, Qian Chen, Hai Yu, Wen Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09688
Pdf URL: https://arxiv.org/pdf/2408.09688
Copy Paste: [[2408.09688]] Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts(https://arxiv.org/abs/2408.09688)
Keywords: language model, llm
Abstract: Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to enhance the outputs. Experimental results reveal that LLMs have the potential to excel in the CoS2W task, particularly in grammaticality and formality, our methods achieve effective understanding of contexts and auxiliary information by LLMs. We further investigate the effectiveness of using LLMs as evaluators and find that LLM evaluators show strong correlations with human evaluations on rankings of faithfulness and formality, which validates the reliability of LLM evaluators for the CoS2W task.
摘要：自动语音识别 (ASR) 文本存在识别错误和各种口语现象，例如不流畅、不合语法的句子和不完整的句子，因此可读性较差。为了提高可读性，我们提出了一个语境化口语到书面语转换 (CoS2W) 任务来解决 ASR 和语法错误，并利用上下文和辅助信息将非正式文本转换为正式风格，同时保留内容。此任务自然与大型语言模型 (LLM) 的语境学习能力相匹配。为了便于对各种 LLM 进行全面比较，我们构建了一个文档级的 ASR 文本基准 (SWAB) 数据集的口语到书面语转换。使用 SWAB，我们研究了不同粒度级别对 CoS2W 性能的影响，并提出了利用上下文和辅助信息来增强输出的方法。实验结果表明，LLM 有潜力在 CoS2W 任务中表现出色，特别是在语法性和形式性方面，我们的方法实现了 LLM 对上下文和辅助信息的有效理解。我们进一步研究了使用 LLM 作为评估者的有效性，发现 LLM 评估者在忠实性和形式性的排名上与人类评估表现出很强的相关性，这验证了 LLM 评估者对 CoS2W 任务的可靠性。

Title: Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer

Authors: Mingda Li, Abhijit Mishra, Utkarsh Mujumdar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09701
Pdf URL: https://arxiv.org/pdf/2408.09701
Copy Paste: [[2408.09701]] Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer(https://arxiv.org/abs/2408.09701)
Keywords: language model, llm, prompt
Abstract: The use of Large Language Models (LLMs) for program code generation has gained substantial attention, but their biases and limitations with non-English prompts challenge global inclusivity. This paper investigates the complexities of multilingual prompt-based code generation. Our evaluations of LLMs, including CodeLLaMa and CodeGemma, reveal significant disparities in code quality for non-English prompts; we also demonstrate the inadequacy of simple approaches like prompt translation, bootstrapped data augmentation, and fine-tuning. To address this, we propose a zero-shot cross-lingual approach using a neural projection technique, integrating a cross-lingual encoder like LASER artetxe2019massively to map multilingual embeddings from it into the LLM's token space. This method requires training only on English data and scales effectively to other languages. Results on a translated and quality-checked MBPP dataset show substantial improvements in code quality. This research promotes a more inclusive code generation landscape by empowering LLMs with multilingual capabilities to support the diverse linguistic spectrum in programming.
摘要：使用大型语言模型 (LLM) 生成程序代码已引起广泛关注，但它们在非英语提示方面的偏见和局限性对全球包容性提出了挑战。本文探讨了基于多语言提示的代码生成的复杂性。我们对 LLM（包括 CodeLLaMa 和 CodeGemma）的评估表明，非英语提示的代码质量存在显著差异；我们还证明了提示翻译、引导数据增强和微调等简单方法的不足。为了解决这个问题，我们提出了一种使用神经投影技术的零样本跨语言方法，集成了 LASER artetxe2019massively 等跨语言编码器，以将其中的多语言嵌入映射到 LLM 的标记空间中。此方法只需要在英语数据上进行训练，并且可以有效地扩展到其他语言。翻译和质量检查后的 MBPP 数据集的结果显示代码质量显着提高。这项研究通过赋予 LLM 多语言能力来支持编程中的多样化语言范围，从而促进了更具包容性的代码生成格局。

Title: SEMDR: A Semantic-Aware Dual Encoder Model for Legal Judgment Prediction with Legal Clue Tracing

Authors: Pengjie Liu, Wang Zhang, Yulong Ding, Xuefeng Zhang, Shuang-Hua Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09717
Pdf URL: https://arxiv.org/pdf/2408.09717
Copy Paste: [[2408.09717]] SEMDR: A Semantic-Aware Dual Encoder Model for Legal Judgment Prediction with Legal Clue Tracing(https://arxiv.org/abs/2408.09717)
Keywords: language model
Abstract: Legal Judgment Prediction (LJP) aims to form legal judgments based on the criminal fact description. However, researchers struggle to classify confusing criminal cases, such as robbery and theft, which requires LJP models to distinguish the nuances between similar crimes. Existing methods usually design handcrafted features to pick up necessary semantic legal clues to make more accurate legal judgment predictions. In this paper, we propose a Semantic-Aware Dual Encoder Model (SEMDR), which designs a novel legal clue tracing mechanism to conduct fine-grained semantic reasoning between criminal facts and instruments. Our legal clue tracing mechanism is built from three reasoning levels: 1) Lexicon-Tracing, which aims to extract criminal facts from criminal descriptions; 2) Sentence Representation Learning, which contrastively trains language models to better represent confusing criminal facts; 3) Multi-Fact Reasoning, which builds a reasons graph to propagate semantic clues among fact nodes to capture the subtle difference among criminal facts. Our legal clue tracing mechanism helps SEMDR achieve state-of-the-art on the CAIL2018 dataset and shows its advance in few-shot scenarios. Our experiments show that SEMDR has a strong ability to learn more uniform and distinguished representations for criminal facts, which helps to make more accurate predictions on confusing criminal cases and reduces the model uncertainty during making judgments. All codes will be released via GitHub.
摘要：法律判决预测（LJP）旨在根据犯罪事实描述形成法律判决。然而，研究人员很难对抢劫和盗窃等容易混淆的刑事案件进行分类，这需要 LJP 模型能够区分类似犯罪之间的细微差别。现有的方法通常设计手工特征来拾取必要的语义法律线索，以做出更准确的法律判决预测。在本文中，我们提出了一种语义感知双编码器模型（SEMDR），该模型设计了一种新颖的法律线索追踪机制，在犯罪事实和工具之间进行细粒度的语义推理。我们的法律线索追踪机制从三个推理层面构建：1）词典追踪，旨在从犯罪描述中提取犯罪事实；2）句子表示学习，对比训练语言模型以更好地表示容易混淆的犯罪事实；3）多事实推理，建立原因图以在事实节点之间传播语义线索，以捕捉犯罪事实之间的细微差别。我们的法律线索追踪机制帮助 SEMDR 在 CAIL2018 数据集上取得了最佳成绩，并在少样本场景中展现了其优势。我们的实验表明，SEMDR 具有强大的能力来学习更统一、更可区分的犯罪事实表征，这有助于对混乱的刑事案件做出更准确的预测，并减少模型在做出判断时的不确定性。所有代码将通过 GitHub 发布。

Title: Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs

Authors: Simon D Angus, Lachlan O'Neill
Subjects: cs.CL, cs.AI, econ.GN
Abstract URL: https://arxiv.org/abs/2408.09742
Pdf URL: https://arxiv.org/pdf/2408.09742
Copy Paste: [[2408.09742]] Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs(https://arxiv.org/abs/2408.09742)
Keywords: language model, llm, prompt
Abstract: Detecting and quantifying issue framing in textual discourse - the perspective one takes to a given topic (e.g. climate science vs. denialism, misogyny vs. gender equality) - is highly valuable to a range of end-users from social and political scientists to program evaluators and policy analysts. However, conceptual framing is notoriously challenging for automated natural language processing (NLP) methods since the words and phrases used by either `side' of an issue are often held in common, with only subtle stylistic flourishes separating their use. Here we develop and rigorously evaluate new detection methods for issue framing and narrative analysis within large text datasets. By introducing a novel application of next-token log probabilities derived from generative large language models (LLMs) we show that issue framing can be reliably and efficiently detected in large corpora with only a few examples of either perspective on a given issue, a method we call `paired completion'. Through 192 independent experiments over three novel, synthetic datasets, we evaluate paired completion against prompt-based LLM methods and labelled methods using traditional NLP and recent LLM contextual embeddings. We additionally conduct a cost-based analysis to mark out the feasible set of performant methods at production-level scales, and a model bias analysis. Together, our work demonstrates a feasible path to scalable, accurate and low-bias issue-framing in large corpora.
摘要：检测和量化文本话语中的问题框架——人们对特定主题的看法（例如气候科学与否认主义、厌女症与性别平等）——对从社会和政治科学家到项目评估人员和政策分析师等一系列最终用户都非常有价值。然而，概念框架对于自动自然语言处理 (NLP) 方法来说是一项非常具有挑战性的任务，因为问题双方使用的单词和短语通常是相同的，只有微妙的风格差异才能区分它们的用法。在这里，我们开发并严格评估了用于大型文本数据集中问题框架和叙述分析的新检测方法。通过引入一种从生成式大型语言模型 (LLM) 中得出的下一个标记对数概率的新应用，我们表明，只需对给定问题的任何一方的观点的几个例子，就可以在大型语料库中可靠而有效地检测到问题框架，我们称这种方法为“配对完成”。通过在三个新颖的合成数据集上进行 192 次独立实验，我们使用传统 NLP 和最近的 LLM 上下文嵌入来评估配对完成与基于提示的 LLM 方法和标记方法。我们还进行了基于成本的分析，以标记出生产级规模上可行的性能方法集，并进行了模型偏差分析。总之，我们的工作展示了在大型语料库中实现可扩展、准确和低偏差问题框架的可行途径。

Title: Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?

Authors: Shiyu Ni, Keping Bi, Lulu Yu, Jiafeng Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09773
Pdf URL: https://arxiv.org/pdf/2408.09773
Copy Paste: [[2408.09773]] Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?(https://arxiv.org/abs/2408.09773)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have been found to produce hallucinations when the question exceeds their internal knowledge boundaries. A reliable model should have a clear perception of its knowledge boundaries, providing correct answers within its scope and refusing to answer when it lacks knowledge. Existing research on LLMs' perception of their knowledge boundaries typically uses either the probability of the generated tokens or the verbalized confidence as the model's confidence in its response. However, these studies overlook the differences and connections between the two. In this paper, we conduct a comprehensive analysis and comparison of LLMs' probabilistic perception and verbalized perception of their factual knowledge boundaries. First, we investigate the pros and cons of these two perceptions. Then, we study how they change under questions of varying frequencies. Finally, we measure the correlation between LLMs' probabilistic confidence and verbalized confidence. Experimental results show that 1) LLMs' probabilistic perception is generally more accurate than verbalized perception but requires an in-domain validation set to adjust the confidence threshold. 2) Both perceptions perform better on less frequent questions. 3) It is challenging for LLMs to accurately express their internal confidence in natural language.
摘要：研究发现，当问题超出其内部知识边界时，大型语言模型（LLM）会产生幻觉。一个可靠的模型应该对自己的知识边界有清晰的感知，在自身范围内提供正确答案，在缺乏知识时拒绝回答。现有关于LLM对其知识边界感知的研究通常使用生成的token的概率或言语化置信度作为模型对其响应的置信度。然而，这些研究忽略了两者之间的区别和联系。在本文中，我们对LLM对其事实知识边界的概率感知和言语化感知进行了全面的分析和比较。首先，我们研究这两种感知的优缺点。然后，我们研究它们在不同频率的问题下如何变化。最后，我们测量了LLM的概率置信度和言语化置信度之间的相关性。实验结果表明：1）LLM的概率感知通常比言语化感知更准确，但需要领域内验证集来调整置信度阈值。 2）两种感知在频率较低的问题上表现更好。3）对于法学硕士来说，用自然语言准确表达自己的内在信心是一项挑战。

Title: Summarizing long regulatory documents with a multi-step pipeline

Authors: Mika Sie, Ruby Beek, Michiel Bots, Sjaak Brinkkemper, Albert Gatt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09777
Pdf URL: https://arxiv.org/pdf/2408.09777
Copy Paste: [[2408.09777]] Summarizing long regulatory documents with a multi-step pipeline(https://arxiv.org/abs/2408.09777)
Keywords: language model
Abstract: Due to their length and complexity, long regulatory texts are challenging to summarize. To address this, a multi-step extractive-abstractive architecture is proposed to handle lengthy regulatory documents more effectively. In this paper, we show that the effectiveness of a two-step architecture for summarizing long regulatory texts varies significantly depending on the model used. Specifically, the two-step architecture improves the performance of decoder-only models. For abstractive encoder-decoder models with short context lengths, the effectiveness of an extractive step varies, whereas for long-context encoder-decoder models, the extractive step worsens their performance. This research also highlights the challenges of evaluating generated texts, as evidenced by the differing results from human and automated evaluations. Most notably, human evaluations favoured language models pretrained on legal text, while automated metrics rank general-purpose language models higher. The results underscore the importance of selecting the appropriate summarization strategy based on model architecture and context length.
摘要：由于篇幅长且复杂，长篇监管文本很难总结。为了解决这个问题，我们提出了一种多步骤的提取-抽象架构，以更有效地处理冗长的监管文件。在本文中，我们表明，两步架构总结长篇监管文本的有效性因所用模型的不同而有很大差异。具体而言，两步架构提高了仅解码器模型的性能。对于具有较短上下文长度的抽象编码器-解码器模型，提取步骤的有效性各不相同，而对于长上下文编码器-解码器模型，提取步骤会降低其性能。这项研究还强调了评估生成文本的挑战，人工和自动评估的不同结果就是明证。最值得注意的是，人工评估青睐在法律文本上预先训练的语言模型，而自动化指标则将通用语言模型的排名更高。结果强调了根据模型架构和上下文长度选择适当的总结策略的重要性。

Title: Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Authors: Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang
Subjects: cs.CL, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2408.09787
Pdf URL: https://arxiv.org/pdf/2408.09787
Copy Paste: [[2408.09787]] Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation(https://arxiv.org/abs/2408.09787)
Keywords: prompt, agent
Abstract: Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director's script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.
摘要：传统的动画生成方法依赖于使用人工标记的数据训练生成模型，这需要一个复杂的多阶段流程，需要大量的人力和高昂的训练成本。由于提示计划有限，这些方法通常会产生简短、信息贫乏和上下文不连贯的动画。为了克服这些限制并实现动画过程的自动化，我们率先引入大型多模态模型 (LMM) 作为核心处理器，以构建一个自主动画制作代理，名为 Anim-Director。该代理主要利用 LMM 和生成式 AI 工具的高级理解和推理能力，根据简洁的叙述或简单的指令创建动画视频。具体来说，它分为三个主要阶段：首先，Anim-Director 根据用户输入生成连贯的故事情节，然后是详细的导演脚本，其中包括角色简介和内部/外部描述的设置，以及包括出现角色、内部或外部以及场景事件的上下文连贯的场景描述。其次，我们使用 LMM 和图像生成工具来生成场景和场景的视觉图像。这些图像旨在使用视觉语言提示方法保持不同场景之间的视觉一致性，该方法结合了场景描述和出现角色和场景的图像。第三，场景图像是制作动画视频的基础，LMM 生成提示来指导此过程。整个过程非常自主，无需人工干预，因为 LMM 与生成工具无缝交互以生成提示、评估视觉质量并选择最佳提示来优化最终输出。

Title: CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

Authors: Linhao Yu, Yongqi Leng, Yufei Huang, Shang Wu, Haixin Liu, Xinmeng Ji, Jiahui Zhao, Jinwang Song, Tingting Cui, Xiaoqing Cheng, Tao Liu, Deyi Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09819
Pdf URL: https://arxiv.org/pdf/2408.09819
Copy Paste: [[2408.09819]] CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models(https://arxiv.org/abs/2408.09819)
Keywords: language model, llm
Abstract: What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. The dataset is publicly available at \url{this https URL}.
摘要：大型语言模型 (LLM) 在道德相关背景下会做出什么反应？在本文中，我们策划了一个大型基准 CMoralEval，用于对中国 LLM 进行道德评估。CMoralEval 的数据来源有两个：1）一个讨论中国道德规范的中国电视节目，其中包含来自社会的故事；2）来自各种报纸和道德学术论文的中国道德失范集。利用这些来源，我们旨在创建一个以多样性和真实性为特征的道德评估数据集。我们开发了一种道德分类法和一套基本道德原则，这些原则不仅植根于中国传统文化，而且与当代社会规范相一致。为了促进 CMoralEval 中实例的有效构建和注释，我们建立了一个具有 AI 辅助实例生成的平台来简化注释过程。这些帮助我们策划了 CMoralEval，它涵盖了明确的道德场景（14,964 个实例）和道德困境场景（15,424 个实例），每个场景都有来自不同数据源的实例。我们使用 CMoralEval 进行了大量实验，以检验各种中文法学硕士。实验结果表明，CMoralEval 是中文法学硕士的一个具有挑战性的基准。该数据集可在 \url{此 https URL} 上公开获取。

Title: Continual Dialogue State Tracking via Reason-of-Select Distillation

Authors: Yujie Feng, Bo Liu, Xiaoyu Dong, Zexin Lu, Li-Ming Zhan, Xiao-Ming Wu, Albert Y.S. Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.09846
Pdf URL: https://arxiv.org/pdf/2408.09846
Copy Paste: [[2408.09846]] Continual Dialogue State Tracking via Reason-of-Select Distillation(https://arxiv.org/abs/2408.09846)
Keywords: hallucination
Abstract: An ideal dialogue system requires continuous skill acquisition and adaptation to new tasks while retaining prior knowledge. Dialogue State Tracking (DST), vital in these systems, often involves learning new services and confronting catastrophic forgetting, along with a critical capability loss termed the "Value Selection Quandary." To address these challenges, we introduce the Reason-of-Select (RoS) distillation method by enhancing smaller models with a novel 'meta-reasoning' capability. Meta-reasoning employs an enhanced multi-domain perspective, combining fragments of meta-knowledge from domain-specific dialogues during continual learning. This transcends traditional single-perspective reasoning. The domain bootstrapping process enhances the model's ability to dissect intricate dialogues from multiple possible values. Its domain-agnostic property aligns data distribution across different domains, effectively mitigating forgetting. Additionally, two novel improvements, "multi-value resolution" strategy and Semantic Contrastive Reasoning Selection method, significantly enhance RoS by generating DST-specific selection chains and mitigating hallucinations in teachers' reasoning, ensuring effective and reliable knowledge transfer. Extensive experiments validate the exceptional performance and robust generalization capabilities of our method. The source code is provided for reproducibility.
摘要：理想的对话系统需要在保留先前知识的同时不断掌握技能并适应新任务。对话状态跟踪 (DST) 在这些系统中至关重要，它通常涉及学习新服务并应对灾难性遗忘，以及称为“价值选择困境”的关键能力损失。为了应对这些挑战，我们引入了选择理由 (RoS) 蒸馏方法，通过使用新颖的“元推理”功能增强较小的模型。元推理采用增强的多领域视角，在持续学习过程中结合来自特定领域对话的元知识片段。这超越了传统的单一视角推理。领域引导过程增强了模型从多个可能值中剖析复杂对话的能力。其领域无关属性使数据分布在不同领域之间保持一致，从而有效地减轻遗忘。此外，两项新颖的改进，“多值解析”策略和语义对比推理选择方法通过生成 DST 特定的选择链并减轻教师推理中的幻觉，显著增强了 RoS，确保了有效可靠的知识传递。大量实验验证了我们方法的卓越性能和强大的泛化能力。提供源代码以供重复使用。

Title: Importance Weighting Can Help Large Language Models Self-Improve

Authors: Chunyang Jiang, Chi-min Chan, Wei Xue, Qifeng Liu, Yike Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09849
Pdf URL: https://arxiv.org/pdf/2408.09849
Copy Paste: [[2408.09849]] Importance Weighting Can Help Large Language Models Self-Improve(https://arxiv.org/abs/2408.09849)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable capability in numerous tasks and applications. However, fine-tuning LLMs using high-quality datasets under external supervision remains prohibitively expensive. In response, LLM self-improvement approaches have been vibrantly developed recently. The typical paradigm of LLM self-improvement involves training LLM on self-generated data, part of which may be detrimental and should be filtered out due to the unstable data quality. While current works primarily employs filtering strategies based on answer correctness, in this paper, we demonstrate that filtering out correct but with high distribution shift extent (DSE) samples could also benefit the results of self-improvement. Given that the actual sample distribution is usually inaccessible, we propose a new metric called DS weight to approximate DSE, inspired by the Importance Weighting methods. Consequently, we integrate DS weight with self-consistency to comprehensively filter the self-generated samples and fine-tune the language model. Experiments show that with only a tiny valid set (up to 5\% size of the training set) to compute DS weight, our approach can notably promote the reasoning ability of current LLM self-improvement methods. The resulting performance is on par with methods that rely on external supervision from pre-trained reward models.
摘要：大型语言模型 (LLM) 已在众多任务和应用中展现出卓越的能力。然而，在外部监督下使用高质量数据集对 LLM 进行微调仍然成本高昂。为此，LLM 自我改进方法最近得到了蓬勃发展。LLM 自我改进的典型范例涉及在自生成数据上训练 LLM，其中一部分数据可能有害，由于数据质量不稳定，应将其过滤掉。虽然当前的研究主要采用基于答案正确性的过滤策略，但在本文中，我们证明过滤掉正确但分布偏移程度 (DSE) 较高的样本也可以有利于自我改进的结果。鉴于实际样本分布通常难以获得，我们提出了一种称为 DS 权重的新指标来近似 DSE，这受到重要性加权方法的启发。因此，我们将 DS 权重与自洽性相结合，以全面过滤自生成样本并微调语言模型。实验表明，仅使用一个很小的有效集（最多占训练集的 5\%）来计算 DS 权重，我们的方法就可以显著提升当前 LLM 自我改进方法的推理能力。最终的性能与依赖预先训练的奖励模型的外部监督的方法相当。

Title: Self-Directed Turing Test for Large Language Models

Authors: Weiqi Wu, Hongqiu Wu, Hai Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09853
Pdf URL: https://arxiv.org/pdf/2408.09853
Copy Paste: [[2408.09853]] Self-Directed Turing Test for Large Language Models(https://arxiv.org/abs/2408.09853)
Keywords: language model, gpt, llm
Abstract: The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time and require continuous human involvement to direct the entire interaction with the test subject. This fails to reflect a natural conversational style and hinders the evaluation of Large Language Models (LLMs) in complex and prolonged dialogues. This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format, allowing more dynamic exchanges by multiple consecutive messages. It further efficiently reduces human workload by having the LLM self-direct the majority of the test process, iteratively generating dialogues that simulate its interaction with humans. With the pseudo-dialogue history, the model then engages in a shorter dialogue with a human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the X-Turn Pass-Rate metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9% and 38.9% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.
摘要：图灵测试考察人工智能在自然语言对话中是否能表现出类似人类的行为。传统的图灵测试采用严格的对话格式，每个参与者每次只发送一条消息，并需要人类持续参与来指导与测试对象的整个互动。这无法反映自然的对话风格，并阻碍了对复杂而长时间对话中的大型语言模型 (LLM) 的评估。本文提出了自导式图灵测试，它以突发对话格式扩展了原始测试，允许通过多条连续消息进行更动态的交换。它通过让 LLM 自我指导大部分测试过程，迭代生成模拟其与人类互动的对话，进一步有效地减少了人类的工作量。有了伪对话历史，模型随后与人类进行较短的对话，并与同一主题的人与人对话配对，并使用问卷进行判断。我们引入了 X-Turn Pass-Rate 指标来评估不同持续时间内 LLM 的人类相似性。虽然 GPT-4 等 LLM 最初表现良好，在 3 轮和 10 轮对话中分别达到 51.9% 和 38.9% 的通过率，但随着对话的进行，它们的表现会下降，这凸显了保持长期一致性的难度。

Title: Performance Law of Large Language Models

Authors: Chuhan Wu, Ruiming Tang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.09895
Pdf URL: https://arxiv.org/pdf/2408.09895
Copy Paste: [[2408.09895]] Performance Law of Large Language Models(https://arxiv.org/abs/2408.09895)
Keywords: language model, llm
Abstract: Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as model architectures, data distributions, tokenizers, and computation precision. Thus, estimating the real performance of LLMs with different training settings rather than loss may be quite useful in practical development. In this article, we present an empirical equation named "Performance Law" to directly predict the MMLU score of an LLM, which is a widely used metric to indicate the general capability of LLMs in real-world conversations and applications. Based on only a few key hyperparameters of the LLM architecture and the size of training data, we obtain a quite accurate MMLU prediction of various LLMs with diverse sizes and architectures developed by different organizations in different years. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
摘要：在缩放定律的指导下，大型语言模型 (LLM) 近年来取得了令人瞩目的表现。然而，缩放定律仅对损失进行了定性估计，损失受到模型架构、数据分布、标记器和计算精度等各种因素的影响。因此，在实际开发中，估计具有不同训练设置的 LLM 的实际性能而不是损失可能非常有用。在本文中，我们提出了一个名为“性能定律”的经验方程来直接预测 LLM 的 MMLU 分数，这是一个广泛使用的指标，用于指示 LLM 在现实世界对话和应用中的一般能力。仅基于 LLM 架构的几个关键超参数和训练数据的大小，我们就获得了不同组织在不同年份开发的各种具有不同大小和架构的 LLM 的相当准确的 MMLU 预测。性能定律可用于指导 LLM 架构的选择和计算资源的有效分配，而无需进行大量实验。

Title: Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy, Fluency, and Elegance

Authors: Andong Chen, Lianzhang Lou, Kehai Chen, Xuefeng Bai, Yang Xiang, Muyun Yang, Tiejun Zhao, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.09945
Pdf URL: https://arxiv.org/pdf/2408.09945
Copy Paste: [[2408.09945]] Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy, Fluency, and Elegance(https://arxiv.org/abs/2408.09945)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have shown remarkable performance in general translation tasks. However, the increasing demand for high-quality translations that are not only adequate but also fluent and elegant. To assess the extent to which current LLMs can meet these demands, we introduce a suitable benchmark for translating classical Chinese poetry into English. This task requires not only adequacy in translating culturally and historically significant content but also a strict adherence to linguistic fluency and poetic elegance. Our study reveals that existing LLMs fall short of this task. To address these issues, we propose RAT, a \textbf{R}etrieval-\textbf{A}ugmented machine \textbf{T}ranslation method that enhances the translation process by incorporating knowledge related to classical poetry. Additionally, we propose an automatic evaluation metric based on GPT-4, which better assesses translation quality in terms of adequacy, fluency, and elegance, overcoming the limitations of traditional metrics. Our dataset and code will be made available.
摘要：大型语言模型 (LLM) 在一般翻译任务中表现出色。然而，对高质量翻译的需求日益增长，这些翻译不仅要足够，而且要流畅优雅。为了评估当前的 LLM 在多大程度上能够满足这些需求，我们引入了一个将中国古典诗歌翻译成英文的合适基准。这项任务不仅需要充分翻译具有文化和历史意义的内容，而且还需要严格遵守语言流畅性和诗意优雅性。我们的研究表明，现有的 LLM 无法完成这项任务。为了解决这些问题，我们提出了 RAT，这是一种 \textbf{R} 检索 - \textbf{A} 增强机器 \textbf{T} 翻译方法，它通过结合与古典诗歌相关的知识来增强翻译过程。此外，我们提出了一种基于 GPT-4 的自动评估指标，它可以更好地评估翻译质量的充分性、流畅性和优雅性，克服了传统指标的局限性。我们的数据集和代码将公开。

Title: Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory

Authors: Haoran Li, Wei Fan, Yulin Chen, Jiayang Cheng, Tianshu Chu, Xuebing Zhou, Peizhao Hu, Yangqiu Song
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2408.10053
Pdf URL: https://arxiv.org/pdf/2408.10053
Copy Paste: [[2408.10053]] Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory(https://arxiv.org/abs/2408.10053)
Keywords: language model, llm
Abstract: Privacy research has attracted wide attention as individuals worry that their private data can be easily leaked during interactions with smart devices, social platforms, and AI applications. Computer science researchers, on the other hand, commonly study privacy issues through privacy attacks and defenses on segmented fields. Privacy research is conducted on various sub-fields, including Computer Vision (CV), Natural Language Processing (NLP), and Computer Networks. Within each field, privacy has its own formulation. Though pioneering works on attacks and defenses reveal sensitive privacy issues, they are narrowly trapped and cannot fully cover people's actual privacy concerns. Consequently, the research on general and human-centric privacy research remains rather unexplored. In this paper, we formulate the privacy issue as a reasoning problem rather than simple pattern matching. We ground on the Contextual Integrity (CI) theory which posits that people's perceptions of privacy are highly correlated with the corresponding social context. Based on such an assumption, we develop the first comprehensive checklist that covers social identities, private attributes, and existing privacy regulations. Unlike prior works on CI that either cover limited expert annotated norms or model incomplete social context, our proposed privacy checklist uses the whole Health Insurance Portability and Accountability Act of 1996 (HIPAA) as an example, to show that we can resort to large language models (LLMs) to completely cover the HIPAA's regulations. Additionally, our checklist also gathers expert annotations across multiple ontologies to determine private information including but not limited to personally identifiable information (PII). We use our preliminary results on the HIPAA to shed light on future context-centric privacy research to cover more privacy regulations, social norms and standards.
摘要：隐私研究引起了广泛关注，因为人们担心在与智能设备、社交平台和人工智能应用程序交互时，他们的私人数据很容易被泄露。另一方面，计算机科学研究人员通常通过分段领域的隐私攻击和防御来研究隐私问题。隐私研究涉及各种子领域，包括计算机视觉 (CV)、自然语言处理 (NLP) 和计算机网络。在每个领域，隐私都有自己的表述。虽然关于攻击和防御的开创性研究揭示了敏感的隐私问题，但它们被狭隘地限制住了，无法完全涵盖人们实际的隐私问题。因此，对一般和以人为中心的隐私研究仍然没有得到充分探索。在本文中，我们将隐私问题表述为一个推理问题，而不是简单的模式匹配。我们以情境完整性 (CI) 理论为基础，该理论认为人们对隐私的感知与相应的社会情境高度相关。基于这样的假设，我们开发了第一个全面的清单，涵盖了社会身份、私人属性和现有的隐私法规。与之前关于 CI 的研究不同，这些研究要么涵盖有限的专家注释规范，要么模拟不完整的社会背景，我们提出的隐私检查表以整个 1996 年健康保险流通与责任法案 (HIPAA) 为例，表明我们可以借助大型语言模型 (LLM) 来完全涵盖 HIPAA 的规定。此外，我们的检查表还收集了多个本体中的专家注释，以确定包括但不限于个人身份信息 (PII) 的私人信息。我们使用 HIPAA 的初步结果来阐明未来以上下文为中心的隐私研究，以涵盖更多的隐私法规、社会规范和标准。

Title: GLIMMER: Incorporating Graph and Lexical Features in Unsupervised Multi-Document Summarization

Authors: Ran Liu, Ming Liu, Min Yu, Jianguo Jiang, Gang Li, Dan Zhang, Jingyuan Li, Xiang Meng, Weiqing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10115
Pdf URL: https://arxiv.org/pdf/2408.10115
Copy Paste: [[2408.10115]] GLIMMER: Incorporating Graph and Lexical Features in Unsupervised Multi-Document Summarization(https://arxiv.org/abs/2408.10115)
Keywords: language model
Abstract: Pre-trained language models are increasingly being used in multi-document summarization tasks. However, these models need large-scale corpora for pre-training and are domain-dependent. Other non-neural unsupervised summarization approaches mostly rely on key sentence extraction, which can lead to information loss. To address these challenges, we propose a lightweight yet effective unsupervised approach called GLIMMER: a Graph and LexIcal features based unsupervised Multi-docuMEnt summaRization approach. It first constructs a sentence graph from the source documents, then automatically identifies semantic clusters by mining low-level features from raw texts, thereby improving intra-cluster correlation and the fluency of generated sentences. Finally, it summarizes clusters into natural sentences. Experiments conducted on Multi-News, Multi-XScience and DUC-2004 demonstrate that our approach outperforms existing unsupervised approaches. Furthermore, it surpasses state-of-the-art pre-trained multi-document summarization models (e.g. PEGASUS and PRIMERA) under zero-shot settings in terms of ROUGE scores. Additionally, human evaluations indicate that summaries generated by GLIMMER achieve high readability and informativeness scores. Our code is available at this https URL.
摘要：预训练语言模型越来越多地用于多文档摘要任务。然而，这些模型需要大规模语料库进行预训练并且依赖于领域。其他非神经无监督摘要方法大多依赖于关键句子提取，这可能导致信息丢失。为了应对这些挑战，我们提出了一种轻量级但有效的无监督方法，称为 GLIMMER：一种基于图形和词汇特征的无监督多文档摘要方法。它首先从源文档构建句子图，然后通过从原始文本中挖掘低级特征来自动识别语义聚类，从而提高聚类内相关性和生成句子的流畅性。最后，它将聚类总结为自然句子。在 Multi-News、Multi-XScience 和 DUC-2004 上进行的实验表明，我们的方法优于现有的无监督方法。此外，在 ROUGE 得分方面，它在零样本设置下超越了最先进的预训练多文档摘要模型（例如 PEGASUS 和 PRIMERA）。此外，人工评估表明，GLIMMER 生成的摘要具有较高的可读性和信息量得分。我们的代码可在此 https URL 上找到。

Title: Rhyme-aware Chinese lyric generator based on GPT

Authors: Yixiao Yuan, Yangchen Huang, Yu Ma, Xinjin Li, Zhenglin Li, Yiming Shi, Huapeng Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10130
Pdf URL: https://arxiv.org/pdf/2408.10130
Copy Paste: [[2408.10130]] Rhyme-aware Chinese lyric generator based on GPT(https://arxiv.org/abs/2408.10130)
Keywords: language model, gpt
Abstract: Neural language representation models such as GPT, pre-trained on large-scale corpora, can effectively capture rich semantic patterns from plain text and be fine-tuned to consistently improve natural language generation performance. However, existing pre-trained language models used to generate lyrics rarely consider rhyme information, which is crucial in lyrics. Using a pre-trained model directly results in poor performance. To enhance the rhyming quality of generated lyrics, we incorporate integrated rhyme information into our model, thereby improving lyric generation performance.
摘要：神经语言表征模型（例如 GPT）在大型语料库上进行了预训练，可以有效地从纯文本中捕获丰富的语义模式，并进行微调以持续提高自然语言生成性能。然而，现有的用于生成歌词的预训练语言模型很少考虑歌词中至关重要的押韵信息。使用预训练模型直接导致性能不佳。为了提高生成的歌词的押韵质量，我们将综合押韵信息纳入模型，从而提高歌词生成性能。

Title: Instruction Finetuning for Leaderboard Generation from Empirical AI Research

Authors: Salomon Kabongo, Jennifer D'Souza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10141
Pdf URL: https://arxiv.org/pdf/2408.10141
Copy Paste: [[2408.10141]] Instruction Finetuning for Leaderboard Generation from Empirical AI Research(https://arxiv.org/abs/2408.10141)
Keywords: language model, llm
Abstract: This study demonstrates the application of instruction finetuning of pretrained Large Language Models (LLMs) to automate the generation of AI research leaderboards, extracting (Task, Dataset, Metric, Score) quadruples from articles. It aims to streamline the dissemination of advancements in AI research by transitioning from traditional, manual community curation, or otherwise taxonomy-constrained natural language inference (NLI) models, to an automated, generative LLM-based approach. Utilizing the FLAN-T5 model, this research enhances LLMs' adaptability and reliability in information extraction, offering a novel method for structured knowledge representation.
摘要：本研究展示了预训练大型语言模型 (LLM) 的指令微调在自动生成 AI 研究排行榜中的应用，从文章中提取 (任务、数据集、指标、分数) 四倍。它旨在通过从传统的、手动的社区策展或其他分类法约束的自然语言推理 (NLI) 模型过渡到基于 LLM 的自动生成方法，简化 AI 研究进展的传播。利用 FLAN-T5 模型，本研究增强了 LLM 在信息提取方面的适应性和可靠性，为结构化知识表示提供了一种新方法。

Title: Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

Authors: Amey Hengle, Prasoon Bajpai, Soham Dan, Tanmoy Chakraborty
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10151
Pdf URL: https://arxiv.org/pdf/2408.10151
Copy Paste: [[2408.10151]] Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models(https://arxiv.org/abs/2408.10151)
Keywords: language model, llm
Abstract: While recent large language models (LLMs) demonstrate remarkable abilities in responding to queries in diverse languages, their ability to handle long multilingual contexts is unexplored. As such, a systematic evaluation of the long-context capabilities of LLMs in multilingual settings is crucial, specifically in the context of information retrieval. To address this gap, we introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed to assess a model's ability to retrieve relevant information (the needle) from a collection of multilingual distractor texts (the haystack). This test serves as an extension of the multilingual question-answering task, encompassing both monolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMs on MLNeedle. Our findings reveal that model performance can vary significantly with language and needle position. Specifically, we observe that model performance is the lowest when the needle is (i) in a language outside the English language family and (ii) located in the middle of the input context. Furthermore, although some models claim a context size of $8k$ tokens or greater, none demonstrate satisfactory cross-lingual retrieval performance as the context length increases. Our analysis provides key insights into the long-context behavior of LLMs in multilingual settings to guide future evaluation protocols. To our knowledge, this is the first study to investigate the multilingual long-context behavior of LLMs.
摘要：虽然最近的大型语言模型 (LLM) 在响应多种语言的查询方面表现出非凡的能力，但它们处理长多语言上下文的能力尚未得到探索。因此，对 LLM 在多语言环境中的长上下文能力进行系统评估至关重要，特别是在信息检索的背景下。为了弥补这一差距，我们引入了多语言大海捞针 (MLNeedle) 测试，旨在评估模型从多语言干扰文本集合（大海捞针）中检索相关信息（针）的能力。该测试是多语言问答任务的扩展，涵盖单语和跨语言检索。我们在 MLNeedle 上评估了四种最先进的 LLM。我们的研究结果表明，模型性能会因语言和针的位置而有很大差异。具体而言，我们观察到，当针 (i) 位于英语语系之外的语言中且 (ii) 位于输入上下文中间时，模型性能最低。此外，尽管有些模型声称上下文大小为 $8k$ 个标记或更大，但随着上下文长度的增加，没有一个模型表现出令人满意的跨语言检索性能。我们的分析为多语言环境中 LLM 的长上下文行为提供了重要见解，以指导未来的评估协议。据我们所知，这是第一项研究 LLM 的多语言长上下文行为的研究。