2024-04-29

Title: Rumour Evaluation with Very Large Language Models

Authors: Dahlia Shehata, Robin Cohen, Charles Clarke
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2404.16859
Pdf URL: https://arxiv.org/pdf/2404.16859
Copy Paste: [[2404.16859]] Rumour Evaluation with Very Large Language Models(https://arxiv.org/abs/2404.16859)
Keywords: language model, gpt, llm, prompt
Abstract: Conversational prompt-engineering-based large language models (LLMs) have enabled targeted control over the output creation, enhancing versatility, adaptability and adhoc retrieval. From another perspective, digital misinformation has reached alarming levels. The anonymity, availability and reach of social media offer fertile ground for rumours to propagate. This work proposes to leverage the advancement of prompting-dependent LLMs to combat misinformation by extending the research efforts of the RumourEval task on its Twitter dataset. To the end, we employ two prompting-based LLM variants (GPT-3.5-turbo and GPT-4) to extend the two RumourEval subtasks: (1) veracity prediction, and (2) stance classification. For veracity prediction, three classifications schemes are experimented per GPT variant. Each scheme is tested in zero-, one- and few-shot settings. Our best results outperform the precedent ones by a substantial margin. For stance classification, prompting-based-approaches show comparable performance to prior results, with no improvement over finetuning methods. Rumour stance subtask is also extended beyond the original setting to allow multiclass classification. All of the generated predictions for both subtasks are equipped with confidence scores determining their trustworthiness degree according to the LLM, and post-hoc justifications for explainability and interpretability purposes. Our primary aim is AI for social good.
摘要：基于会话提示工程的大语言模型 (LLM) 实现了对输出创建的有针对性的控制，增强了多功能性、适应性和即席检索。从另一个角度来看，数字错误信息已经达到了令人震惊的程度。社交媒体的匿名性、可用性和影响力为谣言传播提供了肥沃的土壤。这项工作建议通过扩展 RumourEval 任务在 Twitter 数据集上的研究工作，利用依赖提示的法学硕士的进步来打击错误信息。最后，我们采用两种基于提示的 LLM 变体（GPT-3.5-turbo 和 GPT-4）来扩展两个 RumourEval 子任务：（1）准确性预测和（2）立场分类。为了预测准确性，每个 GPT 变体都试验了三种分类方案。每个方案都在零次、一次和几次设置中进行测试。我们的最佳结果大大优于之前的结果。对于立场分类，基于提示的方法显示出与之前结果相当的性能，与微调方法相比没有任何改进。谣言立场子任务也扩展到原始设置之外，以允许多类分类。两个子任务生成的所有预测都配备了根据 LLM 确定其可信度的置信度分数，以及用于可解释性和可解释性目的的事后理由。我们的主要目标是人工智能造福社会。

Title: Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations

Authors: Shen Zhang, Haojie Zhang, Jing Zhang, Xudong Zhang, Yimeng Zhuang, Jinting Wu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2404.16905
Pdf URL: https://arxiv.org/pdf/2404.16905
Copy Paste: [[2404.16905]] Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations(https://arxiv.org/abs/2404.16905)
Keywords: agent
Abstract: In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. Unraveling the causes of emotions is more challenging. A new task named Multimodal Emotion-Cause Pair Extraction in Conversations is responsible for recognizing emotion and identifying causal expressions. In this study, we propose a multi-stage framework to generate emotion and extract the emotion causal pairs given the target emotion. In the first stage, Llama-2-based InstructERC is utilized to extract the emotion category of each utterance in a conversation. After emotion recognition, a two-stream attention model is employed to extract the emotion causal pairs given the target emotion for subtask 2 while MuTEC is employed to extract causal span for subtask 1. Our approach achieved first place for both of the two subtasks in the competition.
摘要：在人机交互中，智能体通过理解人类的情绪来做出反应至关重要。揭开情绪的成因更具挑战性。一项名为“对话中的多模态情绪-原因对提取”的新任务负责识别情绪并识别因果表达。在本研究中，我们提出了一个多阶段框架来生成情感并提取给定目标情感的情感因果对。在第一阶段，利用基于 Llama-2 的 InstructERC 来提取对话中每个话语的情感类别。在情感识别之后，使用双流注意力模型来提取给定子任务 2 的目标情感的情感因果对，同时使用 MuTEC 来提取子任务 1 的因果跨度。我们的方法在两个子任务中都获得了第一名。竞赛。

Title: Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Authors: Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, James Bono
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.16966
Pdf URL: https://arxiv.org/pdf/2404.16966
Copy Paste: [[2404.16966]] Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks(https://arxiv.org/abs/2404.16966)
Keywords: language model, llm, prompt
Abstract: Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
摘要：基准已成为评估大型语言模型 (LLM) 的核心方法。研究社区通常依赖模型在基准测试提示下的平均性能来评估模型的性能。这与基准测试中的测试提示代表来自真实世界感兴趣分布的随机样本的假设是一致的。我们注意到，通常情况并非如此；相反，我们认为兴趣的分配根据具体用例而变化。我们发现（1）测试提示之间的模型性能相关性是非随机的，（2）考虑测试提示之间的相关性可以改变主要基准上的模型排名，（3）这些相关性的解释因素包括语义相似性和常见的 LLM故障点。

Title: Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models

Authors: Bradley P. Allen, Paul T. Groth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17000
Pdf URL: https://arxiv.org/pdf/2404.17000
Copy Paste: [[2404.17000]] Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models(https://arxiv.org/abs/2404.17000)
Keywords: language model, gpt, chain-of-thought
Abstract: A backbone of knowledge graphs are their class membership relations, which assign entities to a given class. As part of the knowledge engineering process, we propose a new method for evaluating the quality of these relations by processing descriptions of a given entity and class using a zero-shot chain-of-thought classifier that uses a natural language intensional definition of a class. We evaluate the method using two publicly available knowledge graphs, Wikidata and CaLiGraph, and 7 large language models. Using the gpt-4-0125-preview large language model, the method's classification performance achieves a macro-averaged F1-score of 0.830 on data from Wikidata and 0.893 on data from CaLiGraph. Moreover, a manual analysis of the classification errors shows that 40.9% of errors were due to the knowledge graphs, with 16.0% due to missing relations and 24.9% due to incorrectly asserted relations. These results show how large language models can assist knowledge engineers in the process of knowledge graph refinement. The code and data are available on Github.
摘要：知识图的支柱是它们的类成员关系，它将实体分配给给定的类。作为知识工程过程的一部分，我们提出了一种新方法来评估这些关系的质量，通过使用零镜头思想链分类器处理给定实体和类的描述，该分类器使用类的自然语言内涵定义。我们使用两个公开可用的知识图（Wikidata 和 CaLiGraph）以及 7 个大型语言模型来评估该方法。使用 gpt-4-0125-preview 大语言模型，该方法的分类性能在 Wikidata 数据上实现了 0.830 的宏观平均 F1 分数，在 CaLiGraph 数据上实现了 0.893 的宏观平均 F1 分数。此外，对分类错误的手动分析显示，40.9%的错误是由于知识图造成的，其中16.0%是由于缺失关系造成的，24.9%是由于错误断言关系造成的。这些结果表明大型语言模型可以如何帮助知识工程师进行知识图细化过程。代码和数据可以在 Github 上找到。

Title: Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models

Authors: Eren Dogan, M. Egemen Uzun, Atahan Uz, H. Emre Seyrek, Ahmed Zeer, Ezgi Sevi, H. Toprak Kesgin, M. Kaan Yuce, M. Fatih Amasyali
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17010
Pdf URL: https://arxiv.org/pdf/2404.17010
Copy Paste: [[2404.17010]] Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models(https://arxiv.org/abs/2404.17010)
Keywords: language model
Abstract: The developments that language models have provided in fulfilling almost all kinds of tasks have attracted the attention of not only researchers but also the society and have enabled them to become products. There are commercially successful language models available. However, users may prefer open-source language models due to cost, data privacy, or regulations. Yet, despite the increasing number of these models, there is no comprehensive comparison of their performance for Turkish. This study aims to fill this gap in the literature. A comparison is made among seven selected language models based on their contextual learning and question-answering abilities. Turkish datasets for contextual learning and question-answering were prepared, and both automatic and human evaluations were conducted. The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish and that in-context learning performances do not much related to question-answering performances.
摘要：语言模型在完成几乎所有类型的任务方面所提供的发展不仅引起了研究人员的关注，也引起了社会的关注，并使其成为产品。有商业上成功的语言模型可用。然而，由于成本、数据隐私或法规的原因，用户可能更喜欢开源语言模型。然而，尽管这些模型的数量不断增加，但还没有对它们在土耳其语中的性能进行全面比较。本研究旨在填补这一文献空白。根据语境学习和问答能力对七种选定的语言模型进行了比较。准备了用于情境学习和问答的土耳其数据集，并进行了自动和人工评估。结果表明，对于问答来说，在使用教学数据集进行微调之前继续进行预训练可以更成功地使多语言模型适应土耳其语，并且上下文学习性能与问答性能没有太大关系。

Title: Player-Driven Emergence in LLM-Driven Game Narrative

Authors: Xiangyu Peng, Jessica Quaye, Weijia Xu, Chris Brockett, Bill Dolan, Nebojsa Jojic, Gabriel DesGarennes, Ken Lobb, Michael Xu, Jorge Leandro, Claire Jin, Sudha Rao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17027
Pdf URL: https://arxiv.org/pdf/2404.17027
Copy Paste: [[2404.17027]] Player-Driven Emergence in LLM-Driven Game Narrative(https://arxiv.org/abs/2404.17027)
Keywords: language model, gpt, llm
Abstract: We explore how interaction with large language models (LLMs) can give rise to emergent behaviors, empowering players to participate in the evolution of game narratives. Our testbed is a text-adventure game in which players attempt to solve a mystery under a fixed narrative premise, but can freely interact with non-player characters generated by GPT-4, a large language model. We recruit 28 gamers to play the game and use GPT-4 to automatically convert the game logs into a node-graph representing the narrative in the player's gameplay. We find that through their interactions with the non-deterministic behavior of the LLM, players are able to discover interesting new emergent nodes that were not a part of the original narrative but have potential for being fun and engaging. Players that created the most emergent nodes tended to be those that often enjoy games that facilitate discovery, exploration and experimentation.
摘要：我们探索与大型语言模型 (LLM) 的交互如何引发突发行为，使玩家能够参与游戏叙事的演变。我们的测试平台是一款文本冒险游戏，玩家试图在固定的叙事前提下解开谜团，但可以与大型语言模型 GPT-4 生成的非玩家角色自由互动。我们招募 28 名玩家来玩游戏，并使用 GPT-4 自动将游戏日志转换为代表玩家游戏中的叙述的节点图。我们发现，通过与法学硕士的非确定性行为的互动，玩家能够发现有趣的新出现的节点，这些节点不是原始叙述的一部分，但有可能变得有趣和吸引人。创建最新兴节点的玩家往往是那些经常喜欢促进发现、探索和实验的游戏的玩家。

Title: Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Authors: Valeriia Cherepanova, James Zou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.17120
Pdf URL: https://arxiv.org/pdf/2404.17120
Copy Paste: [[2404.17120]] Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs(https://arxiv.org/abs/2404.17120)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit excellent ability to understand human languages, but do they also understand their own language that appears gibberish to us? In this work we delve into this question, aiming to uncover the mechanisms underlying such behavior in LLMs. We employ the Greedy Coordinate Gradient optimizer to craft prompts that compel LLMs to generate coherent responses from seemingly nonsensical inputs. We call these inputs LM Babel and this work systematically studies the behavior of LLMs manipulated by these prompts. We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima compared to natural prompts. We further examine the structure of the Babel prompts and evaluate their robustness. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.
摘要：大型语言模型（LLM）表现出出色的理解人类语言的能力，但它们是否也理解自己的语言，而这些语言对我们来说似乎是无稽之谈？在这项工作中，我们深入研究了这个问题，旨在揭示法学硕士此类行为背后的机制。我们使用贪婪坐标梯度优化器来制作提示，迫使法学硕士从看似无意义的输入中生成连贯的响应。我们将这些输入称为 LM Babel，这项工作系统地研究了受这些提示操纵的 LLM 的行为。我们发现操作效率取决于目标文本的长度和复杂度，与自然提示相比，Babel 提示通常位于较低的损失最小值。我们进一步检查 Babel 提示的结构并评估其稳健性。值得注意的是，我们发现引导模型生成有害文本并不比生成良性文本更困难，这表明分布外提示缺乏对齐。

Title: Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Authors: Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17140
Pdf URL: https://arxiv.org/pdf/2404.17140
Copy Paste: [[2404.17140]] Small Language Models Need Strong Verifiers to Self-Correct Reasoning(https://arxiv.org/abs/2404.17140)
Keywords: language model, gpt, llm, prompt
Abstract: Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether smaller-size (<= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct.
摘要：自我修正已成为一种有前景的解决方案，可提高大型语言模型 (LLM) 的推理性能，其中 LLM 使用自我生成的批评来精确指出错误，从而完善其解决方案。这项工作探讨了较小规模 (<= 13B) 的语言模型 (LM) 是否具有在推理任务上自我修正的能力，并且需要来自更强 LM 的最少输入。我们提出了一种新颖的管道，可以促使较小的语言模型收集支持自我完善能力训练的自我修正数据。首先，我们利用正确的解决方案来指导模型批评其错误反应。其次，生成的批评经过过滤后，用于通过解决方案细化对自校正推理器进行监督微调。我们的实验结果表明，两个模型在涵盖数学和常识推理的五个数据集上的自我校正能力得到了提高，与基于 GPT-4 的强大验证器配合使用时，性能显着提升，尽管在使用弱自我验证器进行确定时存在局限性。何时纠正。

Title: Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

Authors: Shotaro Ishihara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17143
Pdf URL: https://arxiv.org/pdf/2404.17143
Copy Paste: [[2404.17143]] Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls(https://arxiv.org/abs/2404.17143)
Keywords: language model, gpt, prompt
Abstract: Dominant pre-trained language models (PLMs) have been successful in high-quality natural language generation. However, the analysis of their generation is not mature: do they acquire generalizable linguistic abstractions, or do they simply memorize and recover substrings of the training data? Especially, few studies focus on domain-specific PLM. In this study, we pre-trained domain-specific GPT-2 models using a limited corpus of Japanese newspaper articles and quantified memorization of training data by comparing them with general Japanese GPT-2 models. Our experiments revealed that domain-specific PLMs sometimes "copy and paste" on a large scale. Furthermore, we replicated the empirical finding that memorization is related to duplication, model size, and prompt length, in Japanese the same as in previous English studies. Our evaluations are relieved from data contamination concerns by focusing on newspaper paywalls, which prevent their use as training data. We hope that our paper encourages a sound discussion such as the security and copyright of PLMs.
摘要：主流预训练语言模型 (PLM) 在高质量自然语言生成方面取得了成功。然而，对它们生成的分析还不成熟：它们是否获得了可推广的语言抽象，还是只是记忆和恢复训练数据的子字符串？特别是，很少有研究关注特定领域的 PLM。在本研究中，我们使用有限的日语报纸文章语料库对特定领域的 GPT-2 模型进行了预训练，并通过将它们与一般的日语 GPT-2 模型进行比较来量化训练数据的记忆。我们的实验表明，特定领域的 PLM 有时会大规模“复制和粘贴”。此外，我们复制了经验发现，即记忆与重复、模型大小和提示长度有关，日语与之前的英语研究一样。我们的评估通过关注报纸付费墙（这阻止了它们用作训练数据）而摆脱了数据污染问题。我们希望我们的论文能够鼓励对 PLM 的安全性和版权等进行合理的讨论。

Title: A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition

Authors: Haojie Zhang, Yimeng Zhuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17178
Pdf URL: https://arxiv.org/pdf/2404.17178
Copy Paste: [[2404.17178]] A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition(https://arxiv.org/abs/2404.17178)
Keywords: prompt
Abstract: Few-shot Named Entity Recognition (NER) aims to extract named entities using only a limited number of labeled examples. Existing contrastive learning methods often suffer from insufficient distinguishability in context vector representation because they either solely rely on label semantics or completely disregard them. To tackle this issue, we propose a unified label-aware token-level contrastive learning framework. Our approach enriches the context by utilizing label semantics as suffix prompts. Additionally, it simultaneously optimizes context-context and context-label contrastive learning objectives to enhance generalized discriminative contextual representations.Extensive experiments on various traditional test domains (OntoNotes, CoNLL'03, WNUT'17, GUM, I2B2) and the large-scale few-shot NER dataset (FEWNERD) demonstrate the effectiveness of our approach. It outperforms prior state-of-the-art models by a significant margin, achieving an average absolute gain of 7% in micro F1 scores across most scenarios. Further analysis reveals that our model benefits from its powerful transfer capability and improved contextual representations.
摘要：Few-shot 命名实体识别 (NER) 旨在仅使用有限数量的标记示例来提取命名实体。现有的对比学习方法常常面临上下文向量表示的可区分性不足的问题，因为它们要么仅仅依赖标签语义，要么完全忽视它们。为了解决这个问题，我们提出了一个统一的标签感知令牌级对比学习框架。我们的方法通过利用标签语义作为后缀提示来丰富上下文。此外，它同时优化上下文-上下文和上下文-标签对比学习目标，以增强广义判别性上下文表示。在各种传统测试领域（OntoNotes、CoNLL'03、WNUT'17、GUM、I2B2）和大规模少数领域进行了大量实验-shot NER 数据集（FEWNERD）证明了我们方法的有效性。它的性能显着优于之前最先进的模型，在大多数情况下，微 F1 分数的平均绝对增益达到 7%。进一步的分析表明，我们的模型受益于其强大的传输能力和改进的上下文表示。

Title: Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot

Authors: Michelle Terblanche, Kayode Olaleye, Vukosi Marivate
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17216
Pdf URL: https://arxiv.org/pdf/2404.17216
Copy Paste: [[2404.17216]] Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot(https://arxiv.org/abs/2404.17216)
Keywords: language model, gpt, prompt
Abstract: Many multilingual communities, including numerous in Africa, frequently engage in code-switching during conversations. This behaviour stresses the need for natural language processing technologies adept at processing code-switched text. However, data scarcity, particularly in African languages, poses a significant challenge, as many are low-resourced and under-represented. In this study, we prompted GPT 3.5 to generate Afrikaans--English and Yoruba--English code-switched sentences, enhancing diversity using topic-keyword pairs, linguistic guidelines, and few-shot examples. Our findings indicate that the quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate. There is therefore a notable opportunity to refine prompting guidelines to yield sentences suitable for the fine-tuning of language models. We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT and propose leveraging this technology to mitigate data scarcity in low-resourced languages, underscoring the essential role of native speakers in this process.
摘要：许多多语言社区，包括非洲的许多多语言社区，在对话过程中经常进行语码转换。这种行为强调了对擅长处理代码转换文本的自然语言处理技术的需求。然而，数据稀缺，特别是非洲语言的数据稀缺，构成了重大挑战，因为许多数据资源匮乏且代表性不足。在这项研究中，我们促使 GPT 3.5 生成南非荷兰语（英语）和约鲁巴语（英语）语码转换句子，使用主题关键字对、语言指南和少量示例来增强多样性。我们的研究结果表明，与南非荷兰语-英语的高成功率相比，使用非拉丁文字（如约鲁巴语）的语言生成的句子质量要低得多。因此，这是一个改进提示准则以产生适合语言模型微调的句子的显着机会。我们提出了一个使用 GPT 增强综合生成的代码转换数据多样性的框架，并建议利用该技术来缓解资源匮乏语言的数据稀缺性，强调母语人士在这一过程中的重要作用。

Title: Prompting Techniques for Reducing Social Bias in LLMs through System 1 and System 2 Cognitive Processes

Authors: Mahammed Kamruzzaman, Gene Louis Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17218
Pdf URL: https://arxiv.org/pdf/2404.17218
Copy Paste: [[2404.17218]] Prompting Techniques for Reducing Social Bias in LLMs through System 1 and System 2 Cognitive Processes(https://arxiv.org/abs/2404.17218)
Keywords: llm, prompt, chain-of-thought
Abstract: Dual process theory posits that human cognition arises via two systems. System 1, which is a quick, emotional, and intuitive process, which is subject to cognitive biases, and System 2, a slow, onerous, and deliberate process. NLP researchers often compare zero-shot prompting in LLMs to System 1 reasoning and chain-of-thought (CoT) prompting to System 2. In line with this interpretation, prior research has found that using CoT prompting in LLMs leads to reduced gender bias. We investigate the relationship between bias, CoT prompting, and dual process theory in LLMs directly. We compare zero-shot, CoT, and a variety of dual process theory-based prompting strategies on two bias datasets spanning nine different social bias categories. We also use human and machine personas to determine whether the effects of dual process theory in LLMs are based on modeling human cognition or inherent to the system. We find that a human persona, System 2, and CoT prompting all tend to reduce social biases in LLMs, though the best combination of features depends on the exact model and bias category -- resulting in up to a 13 percent drop in stereotypical judgments by an LLM.
摘要：双过程理论认为人类认知是通过两个系统产生的。系统1是一个快速、情绪化和直觉的过程，容易受到认知偏差的影响；系统2是一个缓慢、繁重和深思熟虑的过程。 NLP 研究人员经常将法学硕士中的零样本提示与系统 1 推理进行比较，将思维链 (CoT) 提示与系统 2 进行比较。根据这种解释，先前的研究发现，在法学硕士中使用 CoT 提示可以减少性别偏见。我们直接研究法学硕士中的偏见、CoT 提示和双重过程理论之间的关系。我们在跨越九个不同社会偏见类别的两个偏见数据集上比较了零样本、CoT 和各种基于双过程理论的提示策略。我们还使用人类和机器角色来确定法学硕士中双过程理论的效果是基于人类认知建模还是系统固有的。我们发现，人类角色、系统 2 和 CoT 提示都倾向于减少法学硕士的社会偏见，尽管功能的最佳组合取决于确切的模型和偏见类别 - 导致刻板判断下降高达 13%法学硕士。

Title: Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM

Authors: Xuan Zhang, Wei Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17283
Pdf URL: https://arxiv.org/pdf/2404.17283
Copy Paste: [[2404.17283]] Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM(https://arxiv.org/abs/2404.17283)
Keywords: language model, llm
Abstract: Retrieval-augmented language models have exhibited promising performance across various areas of natural language processing (NLP), including fact-critical tasks. However, due to the black-box nature of advanced large language models (LLMs) and the non-retrieval-oriented supervision signal of specific tasks, the training of retrieval model faces significant challenges under the setting of black-box LLM. We propose an approach leveraging Fine-grained Feedback with Reinforcement Retrieval (FFRR) to enhance fact-checking on news claims by using black-box LLM. FFRR adopts a two-level strategy to gather fine-grained feedback from the LLM, which serves as a reward for optimizing the retrieval policy, by rating the retrieved documents based on the non-retrieval ground truth of the task. We evaluate our model on two public datasets for real-world news claim verification, and the results demonstrate that FFRR achieves significant improvements over strong LLM-enabled and non-LLM baselines.
摘要：检索增强语言模型在自然语言处理（NLP）的各个领域（包括事实关键任务）都表现出了良好的性能。然而，由于高级大语言模型（LLM）的黑盒性质以及特定任务的非检索导向的监督信号，检索模型的训练在黑盒LLM的背景下面临着重大挑战。我们提出了一种利用细粒度反馈与强化检索（FFRR）的方法，通过使用黑盒法学硕士来增强对新闻声明的事实核查。 FFRR 采用两级策略从 LLM 收集细粒度的反馈，通过根据任务的非检索基本事实对检索到的文档进行评级，作为优化检索策略的奖励。我们在两个公共数据集上评估我们的模型，以验证真实世界的新闻主张，结果表明 FFRR 比支持 LLM 和非 LLM 的强大基线取得了显着改进。

Title: When to Trust LLMs: Aligning Confidence with Response Quality

Authors: Shuchang Tao, Liuyi Yao, Hanxing Ding, Yuexiang Xie, Qi Cao, Fei Sun, Jinyang Gao, Huawei Shen, Bolin Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17287
Pdf URL: https://arxiv.org/pdf/2404.17287
Copy Paste: [[2404.17287]] When to Trust LLMs: Aligning Confidence with Response Quality(https://arxiv.org/abs/2404.17287)
Keywords: language model, llm
Abstract: Despite the success of large language models (LLMs) in natural language generation, much evidence shows that LLMs may produce incorrect or nonsensical text. This limitation highlights the importance of discerning when to trust LLMs, especially in safety-critical domains. Existing methods, which rely on verbalizing confidence to tell the reliability by inducing top-k responses and sampling-aggregating multiple responses, often fail, due to the lack of objective guidance of confidence. To address this, we propose CONfidence-Quality-ORDerpreserving alignment approach (CONQORD), leveraging reinforcement learning with a tailored dual-component reward function. This function encompasses quality reward and orderpreserving alignment reward functions. Specifically, the order-preserving reward incentivizes the model to verbalize greater confidence for responses of higher quality to align the order of confidence and quality. Experiments demonstrate that our CONQORD significantly improves the alignment performance between confidence levels and response accuracy, without causing the model to become over-cautious. Furthermore, the aligned confidence provided by CONQORD informs when to trust LLMs, and acts as a determinant for initiating the retrieval process of external knowledge. Aligning confidence with response quality ensures more transparent and reliable responses, providing better trustworthiness.
摘要：尽管大型语言模型 (LLM) 在自然语言生成方面取得了成功，但许多证据表明 LLM 可能会生成不正确或无意义的文本。这一限制凸显了辨别何时信任法学硕士的重要性，尤其是在安全关键领域。现有的方法依靠言语表达信心，通过诱导 top-k 响应和采样聚合多个响应来判断可靠性，但由于缺乏客观的信心指导，常常会失败。为了解决这个问题，我们提出了 CONfidence-Quality-ORDerpreserving 对齐方法（CONQORD），利用强化学习和定制的双成分奖励函数。该功能包括质量奖励和保序对齐奖励功能。具体来说，保持顺序的奖励会激励模型对更高质量的响应表达更大的信心，以调整信心和质量的顺序。实验表明，我们的 CONQORD 显着提高了置信水平和响应准确性之间的对齐性能，而不会导致模型变得过于谨慎。此外，CONQORD 提供的一致置信度告知何时信任 LLM，并充当启动外部知识检索过程的决定因素。将信心与响应质量相结合可确保响应更加透明和可靠，从而提供更好的可信度。

Title: Introducing cosmosGPT: Monolingual Training for Turkish Language Models

Authors: H. Toprak Kesgin, M. Kaan Yuce, Eren Dogan, M. Egemen Uzun, Atahan Uz, H. Emre Seyrek, Ahmed Zeer, M. Fatih Amasyali
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17336
Pdf URL: https://arxiv.org/pdf/2404.17336
Copy Paste: [[2404.17336]] Introducing cosmosGPT: Monolingual Training for Turkish Language Models(https://arxiv.org/abs/2404.17336)
Keywords: language model, gpt
Abstract: The number of open source language models that can produce Turkish is increasing day by day, as in other languages. In order to create the basic versions of such models, the training of multilingual models is usually continued with Turkish corpora. The alternative is to train the model with only Turkish corpora. In this study, we first introduce the cosmosGPT models that we created with this alternative method. Then, we introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models. Finally, a comprehensive comparison of the adapted Turkish language models on different capabilities is presented. The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.
摘要：与其他语言一样，可以生成土耳其语的开源语言模型的数量正在日益增加。为了创建此类模型的基本版本，通常使用土耳其语料库继续进行多语言模型的训练。另一种方法是仅使用土耳其语料库来训练模型。在本研究中，我们首先介绍用这种替代方法创建的 cosmosGPT 模型。然后，我们引入了用于基本语言模型的新微调数据集以满足用户请求，以及用于测量土耳其语言模型能力的新评估数据集。最后，对不同功能的适应土耳其语言模型进行了全面比较。结果表明，我们用单语语料库构建的语言模型尽管比其他模型小 10 倍左右，但仍具有良好的性能。

Title: Evaluation of Geographical Distortions in Language Models: A Crucial Step Towards Equitable Representations

Authors: Rémy Decoupes, Roberto Interdonato, Mathieu Roche, Maguelonne Teisseire, Sarah Valentin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17401
Pdf URL: https://arxiv.org/pdf/2404.17401
Copy Paste: [[2404.17401]] Evaluation of Geographical Distortions in Language Models: A Crucial Step Towards Equitable Representations(https://arxiv.org/abs/2404.17401)
Keywords: language model
Abstract: Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.
摘要：语言模型现在成为提高写作、编码或学习等许多专业任务效率的重要工具。因此，必须识别固有偏见。在自然语言处理领域，有五个明显的偏见来源：数据、注释、表示、模型和研究设计。本研究的重点是与地理知识相关的偏见。我们通过强调地理和语言模型歪曲空间信息的倾向来探索地理和语言模型之间的联系，从而导致地理距离表示的扭曲。本研究通过比较地理和语义距离，引入了四个指标来评估这些扭曲。根据这四个指标，用十种广泛使用的语言模型进行了实验。结果强调了检查和纠正语言模型中的空间偏差以确保准确和公平的表示的至关重要性。

Title: Ruffle&Riley: Insights from Designing and Evaluating a Large Language Model-Based Conversational Tutoring System

Authors: Robin Schmucker, Meng Xia, Amos Azaria, Tom Mitchell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.17460
Pdf URL: https://arxiv.org/pdf/2404.17460
Copy Paste: [[2404.17460]] Ruffle&Riley: Insights from Designing and Evaluating a Large Language Model-Based Conversational Tutoring System(https://arxiv.org/abs/2404.17460)
Keywords: language model, llm, chat, agent
Abstract: Conversational tutoring systems (CTSs) offer learning experiences through interactions based on natural language. They are recognized for promoting cognitive engagement and improving learning outcomes, especially in reasoning tasks. Nonetheless, the cost associated with authoring CTS content is a major obstacle to widespread adoption and to research on effective instructional design. In this paper, we discuss and evaluate a novel type of CTS that leverages recent advances in large language models (LLMs) in two ways: First, the system enables AI-assisted content authoring by inducing an easily editable tutoring script automatically from a lesson text. Second, the system automates the script orchestration in a learning-by-teaching format via two LLM-based agents (Ruffle&Riley) acting as a student and a professor. The system allows for free-form conversations that follow the ITS-typical inner and outer loop structure. We evaluate Ruffle&Riley's ability to support biology lessons in two between-subject online user studies (N = 200) comparing the system to simpler QA chatbots and reading activity. Analyzing system usage patterns, pre/post-test scores and user experience surveys, we find that Ruffle&Riley users report high levels of engagement, understanding and perceive the offered support as helpful. Even though Ruffle&Riley users require more time to complete the activity, we did not find significant differences in short-term learning gains over the reading activity. Our system architecture and user study provide various insights for designers of future CTSs. We further open-source our system to support ongoing research on effective instructional design of LLM-based learning technologies.
摘要：对话式辅导系统 (CTS) 通过基于自然语言的交互提供学习体验。它们因促进认知参与和改善学习成果而受到认可，特别是在推理任务中。尽管如此，与创作 CTS 内容相关的成本是广泛采用和研究有效教学设计的主要障碍。在本文中，我们讨论和评估了一种新型 CTS，它通过两种方式利用大语言模型 (LLM) 的最新进展：首先，该系统通过从课程文本自动引入易于编辑的辅导脚本来实现人工智能辅助内容创作。。其次，系统通过两个基于 LLM 的代理 (Ruffle&Riley) 分别扮演学生和教授，以边学边教的方式自动化脚本编排。该系统允许遵循 ITS 典型的内环和外环结构的自由形式对话。我们在两项主题间在线用户研究 (N = 200) 中评估了 Ruffle&Riley 支持生物学课程的能力，将该系统与更简单的 QA 聊天机器人和阅读活动进行比较。通过分析系统使用模式、测试前/测试后分数和用户体验调查，我们发现 Ruffle&Riley 用户表示高度参与、理解并认为所提供的支持是有帮助的。尽管 Ruffle&Riley 用户需要更多时间来完成活动，但我们没有发现短期学习收益与阅读活动相比存在显着差异。我们的系统架构和用户研究为未来 CTS 的设计者提供了各种见解。我们进一步开源我们的系统，以支持正在进行的基于法学硕士的学习技术的有效教学设计的研究。

Title: CEval: A Benchmark for Evaluating Counterfactual Text Generation

Authors: Van Bach Nguyen, Jörg Schlötterer, Christin Seifert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17475
Pdf URL: https://arxiv.org/pdf/2404.17475
Copy Paste: [[2404.17475]] CEval: A Benchmark for Evaluating Counterfactual Text Generation(https://arxiv.org/abs/2404.17475)
Keywords: language model, llm, prompt
Abstract: Counterfactual text generation aims to minimally change a text, such that it is classified differently. Judging advancements in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark for comparing counterfactual text generation methods. CEval unifies counterfactual and text quality metrics, includes common counterfactual datasets with human annotations, standard baselines (MICE, GDBA, CREST) and the open-source language model LLAMA-2. Our experiments found no perfect method for generating counterfactual text. Methods that excel at counterfactual metrics often produce lower-quality text while LLMs with simple prompts generate high-quality text but struggle with counterfactual criteria. By making CEval available as an open-source Python library, we encourage the community to contribute more methods and maintain consistent evaluation in future work.
摘要：反事实文本生成旨在最小程度地改变文本，从而对其进行不同的分类。相关工作中数据集和指标的不统一使用阻碍了判断反事实文本生成方法开发的进步。我们提出了 CEval，一个用于比较反事实文本生成方法的基准。 CEval 统一了反事实和文本质量指标，包括带有人工注释的常见反事实数据集、标准基线（MICE、GDBA、CREST）和开源语言模型 LLAMA-2。我们的实验发现没有完美的方法来生成反事实文本。擅长反事实指标的方法通常会生成较低质量的文本，而具有简单提示的法学硕士会生成高质量的文本，但难以满足反事实标准。通过将 CEval 作为开源 Python 库提供，我们鼓励社区贡献更多方法并在未来的工作中保持一致的评估。

Title: A Comprehensive Evaluation on Event Reasoning of Large Language Models

Authors: Zhengwei Tao, Zhi Jin, Yifan Zhang, Xiancai Chen, Xiaoying Bai, Yue Fang, Haiyan Zhao, Jia Li, Chongyang Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.17513
Pdf URL: https://arxiv.org/pdf/2404.17513
Copy Paste: [[2404.17513]] A Comprehensive Evaluation on Event Reasoning of Large Language Models(https://arxiv.org/abs/2404.17513)
Keywords: language model, llm
Abstract: Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the inter-event relations and the reasoning paradigms. How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown. To mitigate this disparity, we comprehensively evaluate the abilities of event reasoning of LLMs. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. EV2 consists of two levels of evaluation of schema and instance and is comprehensive in relations and reasoning paradigms. We conduct extensive experiments on EV2. We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory. We also notice the imbalance of event reasoning abilities in LLMs. Besides, LLMs have event schema knowledge, however, they're not aligned with humans on how to utilize the knowledge. Based on these findings, we introduce two methods to guide the LLMs to utilize the event schema knowledge. Both methods achieve improvements.
摘要：事件推理是许多应用程序的基础能力。它需要事件图式知识来进行全局推理，需要处理事件间关系和推理范式的多样性。法学硕士如何在各种关系和推理范式上完成事件推理仍然未知。为了缩小这种差距，我们全面评估了法学硕士的事件推理能力。我们引入了一种新颖的基准 EV2 用于评估事件推理。 EV2由图式和实例两个层次的评估组成，关系和推理范式很全面。我们对 EV2 进行了广泛的实验。我们发现法学硕士有能力完成事件推理，但他们的表现却远不能令人满意。我们还注意到法学硕士事件推理能力的不平衡。此外，法学硕士拥有事件模式知识，但是，他们在如何利用这些知识方面与人类不一致。基于这些发现，我们介绍了两种方法来指导法学硕士利用事件模式知识。两种方法都取得了改进。