2024-06-11

Title: LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Authors: Arash Gholami Davoodi, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05194
Pdf URL: https://arxiv.org/pdf/2406.05194
Copy Paste: [[2406.05194]] LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs(https://arxiv.org/abs/2406.05194)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning. However, despite these achievements, current evaluations are mostly limited to specific mathematical topics, and it remains unclear whether LLMs are genuinely engaging in reasoning. To address these gaps, we present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions across a wide array of mathematical subjects, each paired with a detailed hierarchical chain of topics. Upon assessing different LLMs using the MaTT benchmark, we find that the most advanced model, GPT-4, achieved a mere 54\% accuracy in a multiple-choice scenario. Interestingly, even when employing Chain-of-Thought prompting, we observe mostly no notable improvement. Moreover, LLMs accuracy dramatically reduced by up to 24.2 percentage point when the questions were presented without providing choices. Further detailed analysis of the LLMs' performance across a range of topics showed significant discrepancy even for closely related subtopics within the same general mathematical area. In an effort to pinpoint the reasons behind LLMs performances, we conducted a manual evaluation of the completeness and correctness of the explanations generated by GPT-4 when choices were available. Surprisingly, we find that in only 53.3\% of the instances where the model provided a correct answer, the accompanying explanations were deemed complete and accurate, i.e., the model engaged in genuine reasoning.
摘要：大型语言模型 (LLM) 表现出令人印象深刻的数学推理能力。然而，尽管取得了这些成就，目前的评估大多局限于特定的数学主题，而且 LLM 是否真正参与推理仍不清楚。为了解决这些差距，我们提出了数学主题树 (MaTT) 基准，这是一个具有挑战性的结构化基准，它提供了 1,958 个涉及广泛数学学科的问题，每个问题都与一个详细的层次化主题链配对。在使用 MaTT 基准评估不同的 LLM 后，我们发现最先进的模型 GPT-4 在多项选择题场景中的准确率仅为 54%。有趣的是，即使采用思维链提示，我们也几乎没有观察到明显的改善。此外，当问题没有提供选择时，LLM 的准确率急剧下降了 24.2 个百分点。对 LLM 在一系列主题上的表现进行进一步的详细分析表明，即使是在同一一般数学领域内密切相关的子主题，也存在显著的差异。为了找出 LLM 表现不佳的原因，我们在有选择的情况下对 GPT-4 生成的解释的完整性和正确性进行了手动评估。令人惊讶的是，我们发现，在模型提供正确答案的情况下，只有 53.3% 的附带解释被认为是完整和准确的，即模型进行了真正的推理。

Title: On Subjective Uncertainty Quantification and Calibration in Natural Language Generation

Authors: Ziyu Wang, Chris Holmes
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2406.05213
Pdf URL: https://arxiv.org/pdf/2406.05213
Copy Paste: [[2406.05213]] On Subjective Uncertainty Quantification and Calibration in Natural Language Generation(https://arxiv.org/abs/2406.05213)
Keywords: language model, gpt
Abstract: Applications of large language models often involve the generation of free-form responses, in which case uncertainty quantification becomes challenging. This is due to the need to identify task-specific uncertainties (e.g., about the semantics) which appears difficult to define in general cases. This work addresses these challenges from a perspective of Bayesian decision theory, starting from the assumption that our utility is characterized by a similarity measure that compares a generated response with a hypothetical true response. We discuss how this assumption enables principled quantification of the model's subjective uncertainty and its calibration. We further derive a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk. The proposed measures can be applied to black-box language models. We demonstrate the proposed methods on question answering and machine translation tasks, where they extract broadly meaningful uncertainty estimates from GPT and Gemini models and quantify their calibration.
摘要：大型语言模型的应用通常涉及生成自由形式的响应，在这种情况下，不确定性量化变得具有挑战性。这是因为需要识别特定于任务的不确定性（例如，关于语义的不确定性），而这在一般情况下似乎很难定义。这项工作从贝叶斯决策理论的角度解决了这些挑战，从我们的效用由相似度度量表征的假设开始，该度量将生成的响应与假设的真实响应进行比较。我们讨论了这一假设如何实现对模型的主观不确定性及其校准的原则性量化。我们进一步推导出一种认知不确定性的度量，该度量基于缺失数据视角及其作为超额风险的特征。所提出的度量可以应用于黑盒语言模型。我们在问答和机器翻译任务中展示了所提出的方法，它们从 GPT 和 Gemini 模型中提取了具有广泛意义的不确定性估计并量化了它们的校准。

Title: Improving Logits-based Detector without Logits from Black-box LLMs

Authors: Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, zhiqiang xu, Yao Li, Haifeng Chen, Wei Cheng, Dongkuan Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05232
Pdf URL: https://arxiv.org/pdf/2406.05232
Copy Paste: [[2406.05232]] Improving Logits-based Detector without Logits from Black-box LLMs(https://arxiv.org/abs/2406.05232)
Keywords: language model, gpt, llm, chat
Abstract: The advent of Large Language Models (LLMs) has revolutionized text generation, producing outputs that closely mimic human writing. This blurring of lines between machine- and human-written text presents new challenges in distinguishing one from the other a task further complicated by the frequent updates and closed nature of leading proprietary LLMs. Traditional logits-based detection methods leverage surrogate models for identifying LLM-generated content when the exact logits are unavailable from black-box LLMs. However, these methods grapple with the misalignment between the distributions of the surrogate and the often undisclosed target models, leading to performance degradation, particularly with the introduction of new, closed-source models. Furthermore, while current methodologies are generally effective when the source model is identified, they falter in scenarios where the model version remains unknown, or the test set comprises outputs from various source models. To address these limitations, we present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs. DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations with minimal training investment. By leveraging corpus samples from publicly accessible outputs of advanced models such as ChatGPT, GPT-4 and Claude-3, DALD fine-tunes surrogate models to synchronize with unknown source model distributions effectively.
摘要：大型语言模型 (LLM) 的出现彻底改变了文本生成，产生了与人类书写非常相似的输出。机器文本和人类书写文本之间的界限变得模糊，这给区分两者带来了新的挑战，而领先的专有 LLM 的频繁更新和封闭性使这一任务变得更加复杂。当黑盒 LLM 无法提供精确的 logit 时，传统的基于 logit 的检测方法利用代理模型来识别 LLM 生成的内容。然而，这些方法难以解决代理模型和通常未公开的目标模型之间的分布不一致的问题，从而导致性能下降，尤其是在引入新的闭源模型时。此外，虽然当前的方法在识别源模型时通常是有效的，但在模型版本未知或测试集包含来自各种源模型的输出的情况下，它们就会失效。为了解决这些限制，我们提出了分布对齐的 LLM 检测 (DALD)，这是一个创新框架，即使没有来自源 LLM 的 logit，它也能重新定义黑盒文本检测的最新性能。 DALD 旨在使代理模型的分布与未知目标 LLM 的分布保持一致，从而确保以最少的训练投入增强检测能力和对快速模型迭代的适应能力。通过利用 ChatGPT、GPT-4 和 Claude-3 等高级模型的公开输出语料样本，DALD 可以对代理模型进行微调，以有效地与未知源模型分布同步。

Title: Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers

Authors: Lütfi Kerem Senel, Besnik Fetahu, Davis Yoshida, Zhiyu Chen, Giuseppe Castellucci, Nikhita Vedula, Jason Choi, Shervin Malmasi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05255
Pdf URL: https://arxiv.org/pdf/2406.05255
Copy Paste: [[2406.05255]] Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers(https://arxiv.org/abs/2406.05255)
Keywords: language model, llm
Abstract: Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enable good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning LLMs is prohibitively expensive. We present a training-free approach for optimizing generative recommenders by connecting user feedback loops to LLM-based optimizers. We propose a generative explore-exploit method that can not only exploit generated items with known high engagement, but also actively explore and discover hidden population preferences to improve recommendation quality. We evaluate our approach on question generation in two domains (e-commerce and general knowledge), and model user feedback with Click Through Rate (CTR). Experiments show our LLM-based explore-exploit approach can iteratively improve recommendations, and consistently increase CTR. Ablation analysis shows that generative exploration is key to learning user preferences, avoiding the pitfalls of greedy exploit-only approaches. A human evaluation strongly supports our quantitative findings.
摘要：推荐系统被广泛用于推荐引人入胜的内容，大型语言模型 (LLM) 催生了生成式推荐器。此类系统可以直接生成项目，包括问题建议等开放集任务。虽然 LLM 的世界知识可以实现良好的推荐，但通过用户反馈改进生成的内容具有挑战性，因为持续微调 LLM 的成本过高。我们提出了一种无需训练的方法来优化生成式推荐器，方法是将用户反馈循环连接到基于 LLM 的优化器。我们提出了一种生成式探索-利用方法，它不仅可以利用已知参与度高的生成项目，还可以主动探索和发现隐藏的人口偏好以提高推荐质量。我们在两个领域（电子商务和一般知识）评估了我们的问题生成方法，并使用点击率 (CTR) 对用户反馈进行建模。实验表明，我们基于 LLM 的探索-利用方法可以迭代改进推荐，并持续提高点击率。消融分析表明，生成式探索是学习用户偏好的关键，可以避免贪婪的仅利用方法的陷阱。人工评估有力地支持了我们的定量发现。

Title: SuperPos-Prompt: Enhancing Soft Prompt Tuning of Language Models with Superposition of Multi Token Embeddings

Authors: MohammadAli SadraeiJavaeri, Ehsaneddin Asgari, Alice Carolyn McHardy, Hamid Reza Rabiee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05279
Pdf URL: https://arxiv.org/pdf/2406.05279
Copy Paste: [[2406.05279]] SuperPos-Prompt: Enhancing Soft Prompt Tuning of Language Models with Superposition of Multi Token Embeddings(https://arxiv.org/abs/2406.05279)
Keywords: language model, prompt
Abstract: Soft prompt tuning techniques have recently gained traction as an effective strategy for the parameter-efficient tuning of pretrained language models, particularly minimizing the required adjustment of model parameters. Despite their growing use, achieving optimal tuning with soft prompts, especially for smaller datasets, remains a substantial challenge. This study makes two contributions in this domain: (i) we introduce SuperPos-Prompt, a new reparameterization technique employing the superposition of multiple pretrained vocabulary embeddings to improve the learning of soft prompts. Our experiments across several GLUE and SuperGLUE benchmarks consistently highlight SuperPos-Prompt's superiority over Residual Prompt tuning, exhibiting an average score increase of $+6.4$ in T5-Small and $+5.0$ in T5-Base along with a faster convergence. Remarkably, SuperPos-Prompt occasionally outperforms even full fine-tuning methods. (ii) Additionally, we demonstrate enhanced performance and rapid convergence by omitting dropouts from the frozen network, yielding consistent improvements across various scenarios and tuning methods.
摘要：软提示调优技术最近受到广泛关注，成为一种有效的策略，用于对预训练语言模型进行参数高效调优，尤其是最大限度地减少所需的模型参数调整。尽管软提示的使用越来越广泛，但实现软提示的最佳调优，尤其是对于较小的数据集，仍然是一项艰巨的挑战。这项研究在这个领域做出了两点贡献：（i）我们引入了 SuperPos-Prompt，这是一种新的重新参数化技术，采用多个预训练词汇嵌入的叠加来改进软提示的学习。我们在多个 GLUE 和 SuperGLUE 基准测试中的实验始终突出了 SuperPos-Prompt 优于 Residual Prompt 调优的优势，在 T5-Small 中平均得分增加了 $+6.4$，在 T5-Base 中平均得分增加了 $+5.0$，并且收敛速度更快。值得注意的是，SuperPos-Prompt 有时甚至比完整的微调方法表现更好。 (ii) 此外，我们通过省略冻结网络中的丢失来展示增强的性能和快速收敛，从而在各种场景和调整方法中实现一致的改进。

Title: Concept Formation and Alignment in Language Models: Bridging Statistical Patterns in Latent Space to Concept Taxonomy

Authors: Mehrdad Khatir, Chandan K. Reddy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05315
Pdf URL: https://arxiv.org/pdf/2406.05315
Copy Paste: [[2406.05315]] Concept Formation and Alignment in Language Models: Bridging Statistical Patterns in Latent Space to Concept Taxonomy(https://arxiv.org/abs/2406.05315)
Keywords: language model
Abstract: This paper explores the concept formation and alignment within the realm of language models (LMs). We propose a mechanism for identifying concepts and their hierarchical organization within the semantic representations learned by various LMs, encompassing a spectrum from early models like Glove to the transformer-based language models like ALBERT and T5. Our approach leverages the inherent structure present in the semantic embeddings generated by these models to extract a taxonomy of concepts and their hierarchical relationships. This investigation sheds light on how LMs develop conceptual understanding and opens doors to further research to improve their ability to reason and leverage real-world knowledge. We further conducted experiments and observed the possibility of isolating these extracted conceptual representations from the reasoning modules of the transformer-based LMs. The observed concept formation along with the isolation of conceptual representations from the reasoning modules can enable targeted token engineering to open the door for potential applications in knowledge transfer, explainable AI, and the development of more modular and conceptually grounded language models.
摘要：本文探讨了语言模型 (LM) 领域中的概念形成和对齐。我们提出了一种机制，用于识别各种 LM 学习到的语义表示中的概念及其层次结构，涵盖从早期模型（如 Glove）到基于转换器的语言模型（如 ALBERT 和 T5）的范围。我们的方法利用这些模型生成的语义嵌入中存在的固有结构来提取概念及其层次关系的分类。这项研究揭示了 LM 如何发展概念理解，并为进一步研究打开了大门，以提高其推理和利用现实世界知识的能力。我们进一步进行了实验，并观察到将这些提取的概念表示与基于转换器的 LM 的推理模块分离的可能性。观察到的概念形成以及将概念表示与推理模块分离可以使有针对性的 token 工程为知识转移、可解释的人工智能以及开发更模块化和概念基础的语言模型的潜在应用打开大门。

Title: Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios

Authors: Yuhang Zhou, Wei Ai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05322
Pdf URL: https://arxiv.org/pdf/2406.05322
Copy Paste: [[2406.05322]] Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios(https://arxiv.org/abs/2406.05322)
Keywords: language model, gpt, llm
Abstract: There is increasing interest in distilling task-specific knowledge from large language models (LLM) to smaller student models. Nonetheless, LLM distillation presents a dual challenge: 1) there is a high cost associated with querying the teacher LLM, such as GPT-4, for gathering an ample number of demonstrations; 2) the teacher LLM might provide imperfect outputs with a negative impact on the student's learning process. To enhance sample efficiency within resource-constrained, imperfect teacher scenarios, we propose a three-component framework leveraging three signal types. The first signal is the student's self-consistency (consistency of student multiple outputs), which is a proxy of the student's confidence. Specifically, we introduce a ``teaching assistant'' (TA) model to assess the uncertainty of both the student's and the teacher's outputs via confidence scoring, which serves as another two signals for student training. Furthermore, we propose a two-stage training schema to first warm up the student with a small proportion of data to better utilize student's signal. Experiments have shown the superiority of our proposed framework for four complex reasoning tasks. On average, our proposed two-stage framework brings a relative improvement of up to 20.79% compared to fine-tuning without any signals across datasets.
摘要：人们越来越有兴趣将特定于任务的知识从大型语言模型 (LLM) 提炼到较小的学生模型中。尽管如此，LLM 提炼仍面临双重挑战：1) 查询教师 LLM（例如 GPT-4）以收集大量演示的成本很高；2) 教师 LLM 可能会提供不完美的输出，从而对学生的学习过程产生负面影响。为了在资源受限、不完美的教师场景中提高样本效率，我们提出了一个利用三种信号类型的三组件框架。第一个信号是学生的自我一致性（学生多个输出的一致性），它是学生信心的代表。具体来说，我们引入了一个“助教”（TA）模型，通过置信度评分来评估学生和老师输出的不确定性，这作为学生训练的另外两个信号。此外，我们提出了一个两阶段训练方案，首先用一小部分数据对学生进行热身，以更好地利用学生的信号。实验表明，我们提出的框架在四项复杂推理任务中表现出色。平均而言，与没有任何跨数据集信号的微调相比，我们提出的两阶段框架带来了高达 20.79% 的相对改进。

Title: Hidden Question Representations Tell Non-Factuality Within and Across Large Language Models

Authors: Yanling Wang, Haoyang Li, Hao Zou, Jing Zhang, Xinlei He, Qi Li, Ke Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05328
Pdf URL: https://arxiv.org/pdf/2406.05328
Copy Paste: [[2406.05328]] Hidden Question Representations Tell Non-Factuality Within and Across Large Language Models(https://arxiv.org/abs/2406.05328)
Keywords: language model, llm
Abstract: Despite the remarkable advance of large language models (LLMs), the prevalence of non-factual responses remains a common issue. This work studies non-factuality prediction (NFP), which predicts whether an LLM will generate non-factual responses to a question before the generation process. Previous efforts on NFP usually rely on extensive computation. In this work, we conduct extensive analysis to explore the capabilities of using a lightweight probe to elicit ``whether an LLM knows'' from the hidden representations of questions. Additionally, we discover that the non-factuality probe employs similar patterns for NFP across multiple LLMs. Motivated by the intriguing finding, we conduct effective transfer learning for cross-LLM NFP and propose a question-aligned strategy to ensure the efficacy of mini-batch based training.
摘要：尽管大型语言模型 (LLM) 取得了显著进步，但非事实性回答的普遍性仍然是一个常见问题。这项工作研究了非事实性预测 (NFP)，它预测 LLM 在生成过程之前是否会对问题生成非事实性回答。以前对 NFP 的研究通常依赖于大量计算。在这项工作中，我们进行了广泛的分析，以探索使用轻量级探测器从问题的隐藏表示中引出“LLM 是否知道”的能力。此外，我们发现非事实性探测器在多个 LLM 中对 NFP 采用类似的模式。受这一有趣发现的启发，我们对跨 LLM NFP 进行了有效的迁移学习，并提出了一种问题对齐策略，以确保基于小批量的训练的有效性。

Title: MemeGuard: An LLM and VLM-based Framework for Advancing Content Moderation via Meme Intervention

Authors: Prince Jha, Raghav Jain, Konika Mandal, Aman Chadha, Sriparna Saha, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05344
Pdf URL: https://arxiv.org/pdf/2406.05344
Copy Paste: [[2406.05344]] MemeGuard: An LLM and VLM-based Framework for Advancing Content Moderation via Meme Intervention(https://arxiv.org/abs/2406.05344)
Keywords: language model, llm
Abstract: In the digital world, memes present a unique challenge for content moderation due to their potential to spread harmful content. Although detection methods have improved, proactive solutions such as intervention are still limited, with current research focusing mostly on text-based content, neglecting the widespread influence of multimodal content like memes. Addressing this gap, we present \textit{MemeGuard}, a comprehensive framework leveraging Large Language Models (LLMs) and Visual Language Models (VLMs) for meme intervention. \textit{MemeGuard} harnesses a specially fine-tuned VLM, \textit{VLMeme}, for meme interpretation, and a multimodal knowledge selection and ranking mechanism (\textit{MKS}) for distilling relevant knowledge. This knowledge is then employed by a general-purpose LLM to generate contextually appropriate interventions. Another key contribution of this work is the \textit{\textbf{I}ntervening} \textit{\textbf{C}yberbullying in \textbf{M}ultimodal \textbf{M}emes (ICMM)} dataset, a high-quality, labeled dataset featuring toxic memes and their corresponding human-annotated interventions. We leverage \textit{ICMM} to test \textit{MemeGuard}, demonstrating its proficiency in generating relevant and effective responses to toxic memes.
摘要：在数字世界中，模因具有传播有害内容的潜力，因此对内容审核提出了独特的挑战。尽管检测方法已经得到改进，但干预等主动解决方案仍然有限，当前的研究主要集中在基于文本的内容上，而忽略了模因等多模态内容的广泛影响。为了解决这一差距，我们提出了 \textit{MemeGuard}，这是一个利用大型语言模型 (LLM) 和视觉语言模型 (VLM) 进行模因干预的综合框架。 \textit{MemeGuard} 利用经过特别微调的 VLM，\textit{VLMeme} 进行模因解释，并利用多模态知识选择和排名机制 (\textit{MKS}) 来提炼相关知识。然后，通用 LLM 使用这些知识来生成适合上下文的干预措施。本研究的另一个重要贡献是 \textit{\textbf{I}ntervening} \textit{\textbf{C}yberbullying in \textbf{M}ultimodal \textbf{M}emes (ICMM)} 数据集，这是一个高质量的标记数据集，包含有毒模因及其相应的人工注释干预措施。我们利用 \textit{ICMM} 来测试 \textit{MemeGuard}，展示其在生成对有毒模因的相关有效响应方面的熟练程度。

Title: Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets

Authors: Satanu Ghosh, Neal R. Brodnik, Carolina Frey, Collin Holgate, Tresa M. Pollock, Samantha Daly, Samuel Carton
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2406.05348
Pdf URL: https://arxiv.org/pdf/2406.05348
Copy Paste: [[2406.05348]] Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets(https://arxiv.org/abs/2406.05348)
Keywords: gpt, prompt
Abstract: We explore the ability of GPT-4 to perform ad-hoc schema based information extraction from scientific literature. We assess specifically whether it can, with a basic prompting approach, replicate two existing material science datasets, given the manuscripts from which they were originally manually extracted. We employ materials scientists to perform a detailed manual error analysis to assess where the model struggles to faithfully extract the desired information, and draw on their insights to suggest research directions to address this broadly important task.
摘要：我们探索了 GPT-4 从科学文献中执行基于模式的临时信息提取的能力。我们特别评估了它是否能够使用基本的提示方法复制两个现有的材料科学数据集，前提是这些数据集最初是从手稿中手动提取的。我们聘请了材料科学家进行详细的手动错误分析，以评估模型在忠实提取所需信息方面存在哪些困难，并借鉴他们的见解来提出研究方向以解决这一广泛重要的任务。

Title: Flexible and Adaptable Summarization via Expertise Separation

Authors: Xiuying Chen, Mingzhe Li, Shen Gao, Xin Cheng, Qingqing Zhu, Rui Yan, Xin Gao, Xiangliang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05360
Pdf URL: https://arxiv.org/pdf/2406.05360
Copy Paste: [[2406.05360]] Flexible and Adaptable Summarization via Expertise Separation(https://arxiv.org/abs/2406.05360)
Keywords: language model, llm
Abstract: A proficient summarization model should exhibit both flexibility -- the capacity to handle a range of in-domain summarization tasks, and adaptability -- the competence to acquire new knowledge and adjust to unseen out-of-domain tasks. Unlike large language models (LLMs) that achieve this through parameter scaling, we propose a more parameter-efficient approach in this study. Our motivation rests on the principle that the general summarization ability to capture salient information can be shared across different tasks, while the domain-specific summarization abilities need to be distinct and tailored. Concretely, we propose MoeSumm, a Mixture-of-Expert Summarization architecture, which utilizes a main expert for gaining the general summarization capability and deputy experts that selectively collaborate to meet specific summarization task requirements. We further propose a max-margin loss to stimulate the separation of these abilities. Our model's distinct separation of general and domain-specific summarization abilities grants it with notable flexibility and adaptability, all while maintaining parameter efficiency. MoeSumm achieves flexibility by managing summarization across multiple domains with a single model, utilizing a shared main expert and selected deputy experts. It exhibits adaptability by tailoring deputy experts to cater to out-of-domain few-shot and zero-shot scenarios. Experimental results on 11 datasets show the superiority of our model compared with recent baselines and LLMs. We also provide statistical and visual evidence of the distinct separation of the two abilities in MoeSumm (this https URL).
摘要：一个熟练的摘要模型应该既具有灵活性（处理一系列领域内摘要任务的能力），又具有适应性（获取新知识和适应未知的领域外任务的能力）。与通过参数缩放实现这一点的大型语言模型 (LLM) 不同，我们在本研究中提出了一种参数效率更高的方法。我们的动机基于这样一个原则：捕捉显著信息的一般摘要能力可以在不同的任务之间共享，而特定领域的摘要能力则需要独特和量身定制。具体来说，我们提出了 MoeSumm，一种混合专家摘要架构，它利用主专家获得一般摘要能力，副专家有选择地协作以满足特定的摘要任务要求。我们进一步提出了最大边际损失来刺激这些能力的分离。我们的模型将一般和特定领域的摘要能力明显分开，使其具有显着的灵活性和适应性，同时保持了参数效率。 MoeSumm 通过使用共享的主专家和选定的副专家，使用单个模型管理跨多个领域的摘要，从而实现灵活性。它通过定制副专家来迎合域外的少样本和零样本场景，展现出适应性。在 11 个数据集上进行的实验结果表明，与最近的基线和 LLM 相比，我们的模型更胜一筹。我们还提供了统计和视觉证据，证明了 MoeSumm 中两种能力的明显分离（此 https URL）。

Title: Write Summary Step-by-Step: A Pilot Study of Stepwise Summarization

Authors: Xiuying Chen, Shen Gao, Mingzhe Li, Qingqing Zhu, Xin Gao, Xiangliang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05361
Pdf URL: https://arxiv.org/pdf/2406.05361
Copy Paste: [[2406.05361]] Write Summary Step-by-Step: A Pilot Study of Stepwise Summarization(https://arxiv.org/abs/2406.05361)
Keywords: language model
Abstract: Nowadays, neural text generation has made tremendous progress in abstractive summarization tasks. However, most of the existing summarization models take in the whole document all at once, which sometimes cannot meet the needs in practice. Practically, social text streams such as news events and tweets keep growing from time to time, and can only be fed to the summarization system step by step. Hence, in this paper, we propose the task of Stepwise Summarization, which aims to generate a new appended summary each time a new document is proposed. The appended summary should not only summarize the newly added content but also be coherent with the previous summary, to form an up-to-date complete summary. To tackle this challenge, we design an adversarial learning model, named Stepwise Summary Generator (SSG). First, SSG selectively processes the new document under the guidance of the previous summary, obtaining polished document representation. Next, SSG generates the summary considering both the previous summary and the document. Finally, a convolutional-based discriminator is employed to determine whether the newly generated summary is coherent with the previous summary. For the experiment, we extend the traditional two-step update summarization setting to a multi-step stepwise setting, and re-propose a large-scale stepwise summarization dataset based on a public story generation dataset. Extensive experiments on this dataset show that SSG achieves state-of-the-art performance in terms of both automatic metrics and human evaluations. Ablation studies demonstrate the effectiveness of each module in our framework. We also discuss the benefits and limitations of recent large language models on this task.
摘要：如今，神经文本生成在抽象摘要任务中取得了巨大进步。然而，现有的大多数摘要模型都是一次性接收整个文档，这有时无法满足实际需求。实际上，新闻事件和推文等社交文本流不断增长，只能逐步输入摘要系统。因此，在本文中，我们提出了逐步摘要的任务，旨在每次提出新文档时生成一个新的附加摘要。附加摘要不仅应总结新添加的内容，还应与之前的摘要保持一致，以形成最新的完整摘要。为了应对这一挑战，我们设计了一个对抗性学习模型，称为逐步摘要生成器（SSG）。首先，SSG 在先前摘要的指导下有选择地处理新文档，获得完善的文档表示。接下来，SSG 会同时考虑先前的摘要和文档来生成摘要。最后，使用基于卷积的鉴别器来确定新生成的摘要是否与先前的摘要一致。在实验中，我们将传统的两步更新摘要设置扩展为多步逐步设置，并基于公开的故事生成数据集重新提出大规模逐步摘要数据集。对该数据集进行的大量实验表明，SSG 在自动指标和人工评估方面均实现了最先进的性能。消融研究证明了我们框架中每个模块的有效性。我们还讨论了近期大型语言模型在此任务上的优势和局限性。

Title: CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation

Authors: I-Hung Hsu, Zifeng Wang, Long T. Le, Lesly Miculicich, Nanyun Peng, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05365
Pdf URL: https://arxiv.org/pdf/2406.05365
Copy Paste: [[2406.05365]] CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation(https://arxiv.org/abs/2406.05365)
Keywords: language model
Abstract: Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses by accurately citing verifiable sources. However, existing methods, by either feeding LMs with raw or preprocessed materials, remain prone to errors. To address this, we introduce CaLM, a novel verification framework. CaLM leverages the insight that a robust grounded response should be consistent with information derived solely from its cited sources. Our framework empowers smaller LMs, which rely less on parametric memory and excel at processing relevant information given a query, to validate the output of larger LMs. Larger LM responses that closely align with the smaller LMs' output, which relies exclusively on cited documents, are verified. Responses showing discrepancies are iteratively refined through a feedback loop. Experiments on three open-domain question-answering datasets demonstrate significant performance gains of 1.5% to 7% absolute average without any required model fine-tuning.
摘要：扎实生成旨在通过准确引用可验证来源，使语言模型 (LM) 能够生成更可信、更可靠的响应。然而，现有方法（通过向 LM 提供原始或预处理材料）仍然容易出错。为了解决这个问题，我们引入了一种新颖的验证框架 CaLM。CaLM 利用了这样的见解：稳健的扎实响应应该与仅从其引用来源获得的信息一致。我们的框架使较小的 LM 能够验证较大的 LM 的输出，这些 LM 较少依赖参数记忆，并且擅长处理给定查询的相关信息。与较小的 LM 的输出（完全依赖于引用文档）紧密一致的较大 LM 响应得到了验证。显示差异的响应通过反馈循环迭代细化。在三个开放域问答数据集上进行的实验表明，性能显著提高，绝对平均值为 1.5% 到 7%，而无需任何模型微调。

Title: Venn Diagram Prompting : Accelerating Comprehension with Scaffolding Effect

Authors: Sakshi Mahendru, Tejul Pandit
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05369
Pdf URL: https://arxiv.org/pdf/2406.05369
Copy Paste: [[2406.05369]] Venn Diagram Prompting : Accelerating Comprehension with Scaffolding Effect(https://arxiv.org/abs/2406.05369)
Keywords: language model, llm, prompt
Abstract: We introduce Venn Diagram (VD) Prompting, an innovative prompting technique which allows Large Language Models (LLMs) to combine and synthesize information across complex, diverse and long-context documents in knowledge-intensive question-answering tasks. Generating answers from multiple documents involves numerous steps to extract relevant and unique information and amalgamate it into a cohesive response. To improve the quality of the final answer, multiple LLM calls or pretrained models are used to perform different tasks such as summarization, reorganization and customization. The approach covered in the paper focuses on replacing the multi-step strategy via a single LLM call using VD prompting. Our proposed technique also aims to eliminate the inherent position bias in the LLMs, enhancing consistency in answers by removing sensitivity to the sequence of input information. It overcomes the challenge of inconsistency traditionally associated with varying input sequences. We also explore the practical applications of the VD prompt based on our examination of the prompt's outcomes. In the experiments performed on four public benchmark question-answering datasets, VD prompting continually matches or surpasses the performance of a meticulously crafted instruction prompt which adheres to optimal guidelines and practices.
摘要：我们引入了维恩图 (VD) 提示，这是一种创新的提示技术，它允许大型语言模型 (LLM) 在知识密集型问答任务中组合和综合复杂、多样和长上下文文档中的信息。从多个文档生成答案涉及许多步骤，以提取相关和独特的信息并将其合并为一个有凝聚力的响应。为了提高最终答案的质量，使用多个 LLM 调用或预训练模型来执行不同的任务，例如总结、重组和定制。本文介绍的方法侧重于使用 VD 提示通过单个 LLM 调用取代多步骤策略。我们提出的技术还旨在消除 LLM 中固有的位置偏差，通过消除对输入信息序列的敏感性来提高答案的一致性。它克服了传统上与不同输入序列相关的不一致问题。我们还根据对提示结果的检查探索了 VD 提示的实际应用。在对四个公共基准问答数据集进行的实验中，VD 提示持续匹配或超越遵循最佳指南和实践的精心制作的指令提示的性能。

Title: VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Authors: Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.05370
Pdf URL: https://arxiv.org/pdf/2406.05370
Copy Paste: [[2406.05370]] VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers(https://arxiv.org/abs/2406.05370)
Keywords: language model
Abstract: This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to this https URL.
摘要：本文介绍了神经编解码器语言模型的最新进展 VALL-E 2，标志着零样本文本转语音合成 (TTS) 的一个里程碑，首次实现了与人类相当的水平。在其前身 VALL-E 的基础上，新版本引入了两项重大增强功能：重复感知采样通过考虑解码历史中的标记重复来改进原始核采样过程。它不仅可以稳定解码，还可以避免无限循环问题。分组代码建模将编解码器代码组织成组，以有效缩短序列长度，这不仅可以提高推理速度，还可以解决长序列建模的挑战。我们在 LibriSpeech 和 VCTK 数据集上的实验表明，VALL-E 2 在语音稳健性、自然性和说话者相似性方面超越了之前的系统。它是同类产品中第一个在这些基准上达到人类水平的产品。此外，VALL-E 2 始终如一地合成高质量语音，即使对于传统上由于其复杂性或重复短语而具有挑战性的句子也是如此。这项工作的优势可以促进有价值的努力，例如为失语症患者或肌萎缩侧索硬化症患者生成语音。VALL-E 2 的演示将发布到此 https URL。

Title: Planning Like Human: A Dual-process Framework for Dialogue Planning

Authors: Tao He, Lizi Liao, Yixin Cao, Yuanxing Liu, Ming Liu, Zerui Chen, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05374
Pdf URL: https://arxiv.org/pdf/2406.05374
Copy Paste: [[2406.05374]] Planning Like Human: A Dual-process Framework for Dialogue Planning(https://arxiv.org/abs/2406.05374)
Keywords: language model, llm, prompt
Abstract: In proactive dialogue, the challenge lies not just in generating responses but in steering conversations toward predetermined goals, a task where Large Language Models (LLMs) typically struggle due to their reactive nature. Traditional approaches to enhance dialogue planning in LLMs, ranging from elaborate prompt engineering to the integration of policy networks, either face efficiency issues or deliver suboptimal performance. Inspired by the dualprocess theory in psychology, which identifies two distinct modes of thinking - intuitive (fast) and analytical (slow), we propose the Dual-Process Dialogue Planning (DPDP) framework. DPDP embodies this theory through two complementary planning systems: an instinctive policy model for familiar contexts and a deliberative Monte Carlo Tree Search (MCTS) mechanism for complex, novel scenarios. This dual strategy is further coupled with a novel two-stage training regimen: offline Reinforcement Learning for robust initial policy model formation followed by MCTS-enhanced on-the-fly learning, which ensures a dynamic balance between efficiency and strategic depth. Our empirical evaluations across diverse dialogue tasks affirm DPDP's superiority in achieving both high-quality dialogues and operational efficiency, outpacing existing methods.
摘要：在主动对话中，挑战不仅在于生成响应，还在于引导对话朝着预定目标前进，大型语言模型 (LLM) 通常由于其反应性而难以完成这项任务。从复杂的提示工程到策略网络的集成，增强 LLM 中对话规划的传统方法要么面临效率问题，要么提供次优性能。受心理学中双重过程理论的启发，该理论确定了两种不同的思维模式 - 直觉（快速）和分析（慢速），我们提出了双重过程对话规划 (DPDP) 框架。DPDP 通过两个互补的规划系统体现了这一理论：用于熟悉环境的本能策略模型和用于复杂、新颖场景的审议蒙特卡洛树搜索 (MCTS) 机制。这种双重策略还与一种新颖的两阶段训练方案相结合：离线强化学习用于形成稳健的初始策略模型，然后是 MCTS 增强的即时学习，以确保效率和战略深度之间的动态平衡。我们对不同对话任务的实证评估证实了DPDP在实现高质量对话和运行效率方面的优势，超越了现有的方法。

Title: Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas

Authors: Chengyuan Deng, Yiqun Duan, Xin Jin, Heng Chang, Yijun Tian, Han Liu, Henry Peng Zou, Yiqiao Jin, Yijia Xiao, Yichen Wang, Shenghao Wu, Zongxing Xie, Kuofeng Gao, Sihong He, Jun Zhuang, Lu Cheng, Haohan Wang
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05392
Pdf URL: https://arxiv.org/pdf/2406.05392
Copy Paste: [[2406.05392]] Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas(https://arxiv.org/abs/2406.05392)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, and data privacy, to emerging problems like truthfulness and social norms. We critically analyze existing research aimed at understanding, examining, and mitigating these ethical risks. Our survey underscores integrating ethical standards and societal values into the development of LLMs, thereby guiding the development of responsible and ethically aligned language models.
摘要：近年来，大型语言模型 (LLM) 在各种语言建模任务中取得了无与伦比的成功。然而，这一进步也加剧了道德问题，影响了 LLM 在日常环境中的部署。本文全面调查了与 LLM 相关的道德挑战，从版权侵权、系统性偏见和数据隐私等长期存在的问题，到真实性和社会规范等新出现的问题。我们批判性地分析了旨在理解、检查和减轻这些道德风险的现有研究。我们的调查强调将道德标准和社会价值观融入 LLM 的开发中，从而指导负责任且符合道德规范的语言模型的开发。

Title: MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature

Authors: Gyeong Hoon Yi, Jiwoo Choi, Hyeongyun Song, Olivia Miano, Jaewoong Choi, Kihoon Bang, Byungju Lee, Seok Su Sohn, David Buttler, Anna Hiszpanski, Sang Soo Han, Donghun Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05431
Pdf URL: https://arxiv.org/pdf/2406.05431
Copy Paste: [[2406.05431]] MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature(https://arxiv.org/abs/2406.05431)
Keywords: gpt
Abstract: Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, we present MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieved an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot and fine-tuning, we present a Pareto-front mapping where the few-shot learning method was found to be the most balanced solution owing to both its high extraction accuracy (total F1 score>95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.
摘要：高效地从科学文献的表格中提取数据对于构建大型数据库至关重要。然而，材料科学论文中报道的表格形式多种多样；因此，基于规则的提取是一种无效的方法。为了克服这一挑战，我们提出了 MaTableGPT，这是一个基于 GPT 的材料科学文献表格数据提取器。MaTableGPT 采用了表格数据表示和表格拆分的关键策略，以便更好地理解 GPT 并通过后续问题过滤幻觉信息。当应用于大量水分解催化文献时，MaTableGPT 的提取准确率（总 F1 得分）高达 96.8%。通过对零样本、少样本和微调学习方法的 GPT 使用成本、标记成本和提取精度进行综合评估，我们提出了一个帕累托前沿映射，其中发现少样本学习方法是最平衡的解决方案，因为它具有高提取精度（总 F1 得分 > 95%）和低成本（GPT 使用成本为 5.97 美元，标记成本为 10 个 I/O 配对示例）。对 MaTableGPT 生成的数据库进行的统计分析揭示了水分解文献中已报道的催化剂的过电位和元素利用率分布的宝贵见解。

Title: Fighting Against the Repetitive Training and Sample Dependency Problem in Few-shot Named Entity Recognition

Authors: Chang Tian, Wenpeng Yin, Dan Li, Marie-Francine Moens
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05460
Pdf URL: https://arxiv.org/pdf/2406.05460
Copy Paste: [[2406.05460]] Fighting Against the Repetitive Training and Sample Dependency Problem in Few-shot Named Entity Recognition(https://arxiv.org/abs/2406.05460)
Keywords: language model, gpt, llm, chat
Abstract: Few-shot named entity recognition (NER) systems recognize entities using a few labeled training examples. The general pipeline consists of a span detector to identify entity spans in text and an entity-type classifier to assign types to entities. Current span detectors rely on extensive manual labeling to guide training. Almost every span detector requires initial training on basic span features followed by adaptation to task-specific features. This process leads to repetitive training of the basic span features among span detectors. Additionally, metric-based entity-type classifiers, such as prototypical networks, typically employ a specific metric that gauges the distance between the query sample and entity-type referents, ultimately assigning the most probable entity type to the query sample. However, these classifiers encounter the sample dependency problem, primarily stemming from the limited samples available for each entity-type referent. To address these challenges, we proposed an improved few-shot NER pipeline. First, we introduce a steppingstone span detector that is pre-trained on open-domain Wikipedia data. It can be used to initialize the pipeline span detector to reduce the repetitive training of basic features. Second, we leverage a large language model (LLM) to set reliable entity-type referents, eliminating reliance on few-shot samples of each type. Our model exhibits superior performance with fewer training steps and human-labeled data compared with baselines, as demonstrated through extensive experiments on various datasets. Particularly in fine-grained few-shot NER settings, our model outperforms strong baselines, including ChatGPT. We will publicly release the code, datasets, LLM outputs, and model checkpoints.
摘要：小样本命名实体识别 (NER) 系统使用少量标记的训练示例来识别实体。通用管道由跨度检测器（用于识别文本中的实体跨度）和实体类型分类器（用于为实体分配类型）组成。当前的跨度检测器依靠大量手动标记来指导训练。几乎每个跨度检测器都需要对基本跨度特征进行初始训练，然后适应特定于任务的特征。此过程导致跨度检测器之间重复训练基本跨度特征。此外，基于度量的实体类型分类器（例如原型网络）通常采用特定度量来衡量查询样本和实体类型指称之间的距离，最终将最可能的实体类型分配给查询样本。然而，这些分类器遇到样本依赖性问题，主要源于每个实体类型指称可用的样本有限。为了应对这些挑战，我们提出了一种改进的小样本 NER 管道。首先，我们引入了一个在开放域 Wikipedia 数据上预先训练的 steppingstone 跨度检测器。它可用于初始化管道跨度检测器以减少基本特征的重复训练。其次，我们利用大型语言模型 (LLM) 来设置可靠的实体类型指称，从而消除对每种类型的少量样本的依赖。与基线相比，我们的模型在训练步骤和人工标记数据更少的情况下表现出卓越的性能，这已通过对各种数据集的大量实验得到证明。特别是在细粒度的少量 NER 设置中，我们的模型优于强大的基线，包括 ChatGPT。我们将公开发布代码、数据集、LLM 输出和模型检查点。

Title: Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

Authors: Neeraj Varshney, Satyam Raj, Venkatesh Mishra, Agneet Chatterjee, Ritika Sarkar, Amir Saeidi, Chitta Baral
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05494
Pdf URL: https://arxiv.org/pdf/2406.05494
Copy Paste: [[2406.05494]] Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation(https://arxiv.org/abs/2406.05494)
Keywords: language model, llm, hallucination, chat
Abstract: Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, they have been shown to suffer from a critical limitation pertinent to 'hallucination' in their output. Recent research has focused on investigating and addressing this problem for a variety of tasks such as biography generation, question answering, abstractive summarization, and dialogue generation. However, the crucial aspect pertaining to 'negation' has remained considerably underexplored. Negation is important because it adds depth and nuance to the understanding of language and is also crucial for logical reasoning and inference. In this work, we address the above limitation and particularly focus on studying the impact of negation in LLM hallucinations. Specifically, we study four tasks with negation: 'false premise completion', 'constrained fact generation', 'multiple choice question answering', and 'fact generation'. We show that open-source state-of-the-art LLMs such as LLaMA-2-chat, Vicuna, and Orca-2 hallucinate considerably on all these tasks involving negation which underlines a critical shortcoming of these models. Addressing this problem, we further study numerous strategies to mitigate these hallucinations and demonstrate their impact.
摘要：大型语言模型 (LLM) 在各种自然语言任务中都取得了显著的成绩。然而，它们已被证明在输出中存在与“幻觉”相关的严重限制。最近的研究集中于调查和解决各种任务中的这一问题，例如传记生成、问答、抽象总结和对话生成。然而，与“否定”有关的关键方面仍未得到充分探索。否定很重要，因为它增加了对语言理解的深度和细微差别，对逻辑推理和推理也至关重要。在这项工作中，我们解决了上述限制，并特别关注研究否定对 LLM 幻觉的影响。具体来说，我们研究了四个带有否定的任务：“错误前提完成”、“受限事实生成”、“多项选择题回答”和“事实生成”。我们表明，开源的最先进的 LLM（例如 LLaMA-2-chat、Vicuna 和 Orca-2）在所有涉及否定的任务中都会产生相当大的幻觉，这凸显了这些模型的一个关键缺陷。为了解决这个问题，我们进一步研究了许多缓解这些幻觉的策略并展示了它们的影响。

Title: ThatiAR: Subjectivity Detection in Arabic News Sentences

Authors: Reem Suwaileh, Maram Hasanain, Fatema Hubail, Wajdi Zaghouani, Firoj Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05559
Pdf URL: https://arxiv.org/pdf/2406.05559
Copy Paste: [[2406.05559]] ThatiAR: Subjectivity Detection in Arabic News Sentences(https://arxiv.org/abs/2406.05559)
Keywords: gpt, llm
Abstract: Detecting subjectivity in news sentences is crucial for identifying media bias, enhancing credibility, and combating misinformation by flagging opinion-based content. It provides insights into public sentiment, empowers readers to make informed decisions, and encourages critical thinking. While research has developed methods and systems for this purpose, most efforts have focused on English and other high-resourced languages. In this study, we present the first large dataset for subjectivity detection in Arabic, consisting of ~3.6K manually annotated sentences, and GPT-4o based explanation. In addition, we included instructions (both in English and Arabic) to facilitate LLM based fine-tuning. We provide an in-depth analysis of the dataset, annotation process, and extensive benchmark results, including PLMs and LLMs. Our analysis of the annotation process highlights that annotators were strongly influenced by their political, cultural, and religious backgrounds, especially at the beginning of the annotation process. The experimental results suggest that LLMs with in-context learning provide better performance. We aim to release the dataset and resources for the community.
摘要：通过标记基于观点的内容，检测新闻句子中的主观性对于识别媒体偏见、提高可信度和打击错误信息至关重要。它提供了对公众情绪的洞察，使读者能够做出明智的决定，并鼓励批判性思维。虽然研究已经为此目的开发了方法和系统，但大多数努力都集中在英语和其他资源丰富的语言上。在本研究中，我们展示了第一个用于阿拉伯语主观性检测的大型数据集，其中包括约 3.6K 个手动注释的句子和基于 GPT-4o 的解释。此外，我们还提供了说明（英语和阿拉伯语），以促进基于 LLM 的微调。我们对数据集、注释过程和广泛的基准测试结果（包括 PLM 和 LLM）进行了深入分析。我们对注释过程的分析强调，注释者受到政治、文化和宗教背景的强烈影响，尤其是在注释过程开始时。实验结果表明，具有上下文学习的 LLM 提供了更好的性能。我们的目标是为社区发布数据集和资源。

Title: Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts

Authors: Metehan Oğuz, Yusuf Umut Ciftci, Yavuz Faruk Bakman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05569
Pdf URL: https://arxiv.org/pdf/2406.05569
Copy Paste: [[2406.05569]] Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts(https://arxiv.org/abs/2406.05569)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have shown impressive capabilities in tasks such as machine translation, text summarization, question answering, and solving complex mathematical problems. However, their primary training on data-rich languages like English limits their performance in low-resource languages. This study addresses this gap by focusing on the Indexical Shift problem in Turkish. The Indexical Shift problem involves resolving pronouns in indexical shift contexts, a grammatical challenge not present in high-resource languages like English. We present the first study examining indexical shift in any language, releasing a Turkish dataset specifically designed for this purpose. Our Indexical Shift Dataset consists of 156 multiple-choice questions, each annotated with necessary linguistic details, to evaluate LLMs in a few-shot setting. We evaluate recent multilingual LLMs, including GPT-4, GPT-3.5, Cohere-AYA, Trendyol-LLM, and Turkcell-LLM, using this dataset. Our analysis reveals that even advanced models like GPT-4 struggle with the grammatical nuances of indexical shift in Turkish, achieving only moderate performance. These findings underscore the need for focused research on the grammatical challenges posed by low-resource languages. We released the dataset and code \href{https://anonymous.4open.science/r/indexical_shift_llm-E1B4} {here}.
摘要：大型语言模型 (LLM) 在机器翻译、文本摘要、问答和解决复杂数学问题等任务中表现出了令人印象深刻的能力。然而，它们主要在数据丰富的语言（如英语）上进行训练，这限制了它们在资源匮乏的语言中的表现。本研究通过关注土耳其语中的索引转换问题来解决这一差距。索引转换问题涉及在索引转换上下文中解析代词，这是英语等资源丰富的语言中不存在的语法挑战。我们提出了第一项研究任何语言中索引转换的研究，并发布了专门为此设计的土耳其语数据集。我们的索引转换数据集由 156 个多项选择题组成，每个问题都带有必要的语言细节注释，用于在少量设置中评估 LLM。我们使用此数据集评估最近的多语言 LLM，包括 GPT-4、GPT-3.5、Cohere-AYA、Trendyol-LLM 和 Turkcell-LLM。我们的分析表明，即使是像 GPT-4 这样的高级模型也难以处理土耳其语中索引移位的语法细微差别，只能取得中等水平的表现。这些发现强调了需要重点研究资源匮乏的语言所带来的语法挑战。我们在 \href{https://anonymous.4open.science/r/indexical_shift_llm-E1B4} {此处} 发布了数据集和代码。

Title: Creativity Has Left the Chat: The Price of Debiasing Language Models

Authors: Behnam Mohammadi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05587
Pdf URL: https://arxiv.org/pdf/2406.05587
Copy Paste: [[2406.05587]] Creativity Has Left the Chat: The Price of Debiasing Language Models(https://arxiv.org/abs/2406.05587)
Keywords: language model, llm, prompt, chat
Abstract: Large Language Models (LLMs) have revolutionized natural language processing but can exhibit biases and may generate toxic content. While alignment techniques like Reinforcement Learning from Human Feedback (RLHF) reduce these issues, their impact on creativity, defined as syntactic and semantic diversity, remains unexplored. We investigate the unintended consequences of RLHF on the creativity of LLMs through three experiments focusing on the Llama-2 series. Our findings reveal that aligned models exhibit lower entropy in token predictions, form distinct clusters in the embedding space, and gravitate towards "attractor states", indicating limited output diversity. Our findings have significant implications for marketers who rely on LLMs for creative tasks such as copywriting, ad creation, and customer persona generation. The trade-off between consistency and creativity in aligned models should be carefully considered when selecting the appropriate model for a given application. We also discuss the importance of prompt engineering in harnessing the creative potential of base models.
摘要：大型语言模型 (LLM) 彻底改变了自然语言处理，但可能会出现偏见并产生有害内容。虽然诸如强化学习人类反馈 (RLHF) 之类的对齐技术可以减少这些问题，但它们对创造力（定义为句法和语义多样性）的影响仍未得到探索。我们通过三个以 Llama-2 系列为重点的实验研究了 RLHF 对 LLM 创造力的意外后果。我们的研究结果表明，对齐模型在标记预测中表现出较低的熵，在嵌入空间中形成不同的聚类，并倾向于“吸引子状态”，表明输出多样性有限。我们的研究结果对于依赖 LLM 进行创意任务（例如文案撰写、广告制作和客户角色生成）的营销人员具有重要意义。在为给定应用程序选择合适的模型时，应仔细考虑对齐模型中一致性和创造力之间的权衡。我们还讨论了及时工程在利用基础模型的创造潜力方面的重要性。

Title: CERET: Cost-Effective Extrinsic Refinement for Text Generation

Authors: Jason Cai, Hang Su, Monica Sunkara, Igor Shalyminov, Saab Mansour
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05588
Pdf URL: https://arxiv.org/pdf/2406.05588
Copy Paste: [[2406.05588]] CERET: Cost-Effective Extrinsic Refinement for Text Generation(https://arxiv.org/abs/2406.05588)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are powerful models for generation tasks, but they may not generate good quality outputs in their first attempt. Apart from model fine-tuning, existing approaches to improve prediction accuracy and quality typically involve LLM self-improvement / self-reflection that incorporate feedback from models themselves. Despite their effectiveness, these methods are hindered by their high computational cost and lack of scalability. In this work, we propose CERET, a method for refining text generations by considering semantic stability, entailment and inter-sample uncertainty measures. Experimental results show that CERET outperforms Self-consistency and Self-rerank baselines consistently under various task setups, by ~1.6% in Rouge-1 for abstractive summarization and ~3.5% in hit rate for question answering. Compared to LLM Self-rerank method, our approach only requires 9.4% of its latency and is more cost-effective.
摘要：大型语言模型 (LLM) 是用于生成任务的强大模型，但它们可能无法在第一次尝试时生成高质量的输出。除了模型微调之外，现有的提高预测准确性和质量的方法通常涉及 LLM 自我改进/自我反思，其中包含来自模型本身的反馈。尽管这些方法有效，但它们的计算成本高且缺乏可扩展性。在这项工作中，我们提出了 CERET，一种通过考虑语义稳定性、蕴涵和样本间不确定性度量来细化文本生成的方法。实验结果表明，CERET 在各种任务设置下均优于自一致性和自重排序基线，在 Rouge-1 的抽象摘要中高出约 1.6%，问答命中率高出约 3.5%。与 LLM 自重排序方法相比，我们的方法只需要 9.4% 的延迟并且更具成本效益。

Title: GrowOVER: How Can LLMs Adapt to Growing Real-World Knowledge?

Authors: Dayoon Ko, Jinyoung Kim, Hahyeon Choi, Gunhee Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05606
Pdf URL: https://arxiv.org/pdf/2406.05606
Copy Paste: [[2406.05606]] GrowOVER: How Can LLMs Adapt to Growing Real-World Knowledge?(https://arxiv.org/abs/2406.05606)
Keywords: language model, llm
Abstract: In the real world, knowledge is constantly evolving, which can render existing knowledge-based datasets outdated. This unreliability highlights the critical need for continuous updates to ensure both accuracy and relevance in knowledge-intensive tasks. To address this, we propose GrowOVER-QA and GrowOVER-Dialogue, dynamic open-domain QA and dialogue benchmarks that undergo a continuous cycle of updates, keeping pace with the rapid evolution of knowledge. Our research indicates that retrieval-augmented language models (RaLMs) struggle with knowledge that has not been trained on or recently updated. Consequently, we introduce a novel retrieval-interactive language model framework, where the language model evaluates and reflects on its answers for further re-retrieval. Our exhaustive experiments demonstrate that our training-free framework significantly improves upon existing methods, performing comparably to or even surpassing continuously trained language models.
摘要：在现实世界中，知识在不断发展，这可能会导致现有的基于知识的数据集过时。这种不可靠性凸显了持续更新的迫切需求，以确保知识密集型任务的准确性和相关性。为了解决这个问题，我们提出了 GrowOVER-QA 和 GrowOVER-Dialogue，它们是动态开放域 QA 和对话基准，它们会不断更新，以跟上知识的快速发展。我们的研究表明，检索增强语言模型 (RaLM) 难以处理未经训练或最近未更新的知识。因此，我们引入了一个新颖的检索交互式语言模型框架，其中语言模型会评估并反思其答案以供进一步重新检索。我们详尽的实验表明，我们的免训练框架显著改进了现有方法，其性能可与持续训练的语言模型相媲美甚至超越。

Title: How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Yongbin Li
Subjects: cs.CL, cs.AI, cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2406.05644
Pdf URL: https://arxiv.org/pdf/2406.05644
Copy Paste: [[2406.05644]] How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States(https://arxiv.org/abs/2406.05644)
Keywords: language model, llm
Abstract: Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Unfortunately, jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content and raising concerns about LLM safety. Due to language models with intensive parameters often regarded as black boxes, the mechanisms of alignment and jailbreak are challenging to elucidate. In this paper, we employ weak classifiers to explain LLM safety through the intermediate hidden states. We first confirm that LLMs learn ethical concepts during pre-training rather than alignment and can identify malicious and normal inputs in the early layers. Alignment actually associates the early concepts with emotion guesses in the middle layers and then refines them to the specific reject tokens for safe generations. Jailbreak disturbs the transformation of early unethical classification into negative emotions. We conduct experiments on models from 7B to 70B across various model families to prove our conclusion. Overall, our paper indicates the intrinsical mechanism of LLM safety and how jailbreaks circumvent safety guardrails, offering a new perspective on LLM safety and reducing concerns.
摘要：大型语言模型 (LLM) 依靠安全对齐来避免响应恶意用户输入。不幸的是，越狱可以绕过安全护栏，导致 LLM 生成有害内容并引发对 LLM 安全性的担忧。由于具有密集参数的语言模型通常被视为黑匣子，因此对齐和越狱的机制很难阐明。在本文中，我们使用弱分类器通过中间隐藏状态来解释 LLM 安全性。我们首先确认 LLM 在预训练而不是对齐期间学习道德概念，并且可以在早期层中识别恶意和正常输入。对齐实际上将早期概念与中间层的情绪猜测相关联，然后将它们细化为安全生成的特定拒绝标记。越狱会干扰早期不道德分类向负面情绪的转变。我们对各种模型系列的 7B 到 70B 模型进行了实验，以证明我们的结论。总的来说，我们的论文指出了 LLM 安全性的内在机制以及越狱如何绕过安全护栏，为 LLM 安全性提供了新的视角并减少了担忧。

Title: DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

Authors: Shuting Wang, Jiongnan Liu Shiren Song, Jiehan Cheng, Yuqi Fu, Peidong Guo, Kun Fang, Yutao Zhu, Zhicheng Dou
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2406.05654
Pdf URL: https://arxiv.org/pdf/2406.05654
Copy Paste: [[2406.05654]] DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation(https://arxiv.org/abs/2406.05654)
Keywords: language model, gpt, llm, hallucination, chat, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) offers a promising solution to address various limitations of Large Language Models (LLMs), such as hallucination and difficulties in keeping up with real-time updates. This approach is particularly critical in expert and domain-specific applications where LLMs struggle to cover expert knowledge. Therefore, evaluating RAG models in such scenarios is crucial, yet current studies often rely on general knowledge sources like Wikipedia to assess the models' abilities in solving common-sense problems. In this paper, we evaluated LLMs by RAG settings in a domain-specific context, college enrollment. We identified six required abilities for RAG models, including the ability in conversational RAG, analyzing structural information, faithfulness to external knowledge, denoising, solving time-sensitive problems, and understanding multi-document interactions. Each ability has an associated dataset with shared corpora to evaluate the RAG models' performance. We evaluated popular LLMs such as Llama, Baichuan, ChatGLM, and GPT models. Experimental results indicate that existing closed-book LLMs struggle with domain-specific questions, highlighting the need for RAG models to solve expert problems. Moreover, there is room for RAG models to improve their abilities in comprehending conversational history, analyzing structural information, denoising, processing multi-document interactions, and faithfulness in expert knowledge. We expect future studies could solve these problems better.
摘要：检索增强生成 (RAG) 提供了一种有前途的解决方案来解决大型语言模型 (LLM) 的各种限制，例如幻觉和难以跟上实时更新。这种方法在专家和领域特定应用中尤其重要，因为 LLM 很难涵盖专家知识。因此，在这种情况下评估 RAG 模型至关重要，但当前的研究通常依赖维基百科等一般知识来源来评估模型解决常识问题的能力。在本文中，我们在特定领域的背景下（大学入学）通过 RAG 设置评估了 LLM。我们确定了 RAG 模型所需的六种能力，包括对话 RAG 中的能力、分析结构信息、忠实于外部知识、去噪、解决时间敏感问题以及理解多文档交互。每种能力都有一个与共享语料库相关联的数据集，以评估 RAG 模型的性能。我们评估了流行的 LLM，例如 Llama、Baichuan、ChatGLM 和 GPT 模型。实验结果表明，现有的闭卷法学 LLM 难以解决特定领域的问题，这凸显了 RAG 模型解决专家问题的需求。此外，RAG 模型在理解对话历史、分析结构信息、去噪、处理多文档交互和忠实于专家知识方面的能力还有提升空间。我们期待未来的研究能够更好地解决这些问题。

Title: Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses

Authors: Maryam Amirizaniani, Elias Martin, Maryna Sivachenko, Afra Mashhadi, Chirag Shah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05659
Pdf URL: https://arxiv.org/pdf/2406.05659
Copy Paste: [[2406.05659]] Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses(https://arxiv.org/abs/2406.05659)
Keywords: language model, llm, prompt
Abstract: Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts, which is vital for guiding one's own thought processes. Although large language models (LLMs) excel in tasks such as summarization, question answering, and translation, they still face challenges with ToM reasoning, especially in open-ended questions. Despite advancements, the extent to which LLMs truly understand ToM reasoning and how closely it aligns with human ToM reasoning remains inadequately explored in open-ended scenarios. Motivated by this gap, we assess the abilities of LLMs to perceive and integrate human intentions and emotions into their ToM reasoning processes within open-ended questions. Our study utilizes posts from Reddit's ChangeMyView platform, which demands nuanced social reasoning to craft persuasive responses. Our analysis, comparing semantic similarity and lexical overlap metrics between responses generated by humans and LLMs, reveals clear disparities in ToM reasoning capabilities in open-ended questions, with even the most advanced models showing notable limitations. To enhance LLM capabilities, we implement a prompt tuning method that incorporates human intentions and emotions, resulting in improvements in ToM reasoning performance. However, despite these improvements, the enhancement still falls short of fully achieving human-like reasoning. This research highlights the deficiencies in LLMs' social reasoning and demonstrates how integrating human intentions and emotions can boost their effectiveness.
摘要：心智理论 (ToM) 推理需要认识到其他人拥有自己的意图、情感和想法，这对于引导一个人的思维过程至关重要。尽管大型语言模型 (LLM) 在总结、问答和翻译等任务上表现出色，但它们在 ToM 推理方面仍然面临挑战，尤其是在开放式问题中。尽管取得了进步，但在开放式场景中，LLM 真正理解 ToM 推理的程度以及它与人类 ToM 推理的接近程度仍未得到充分探索。受这一差距的启发，我们评估了 LLM 在开放式问题中感知人类意图和情感并将其融入其 ToM 推理过程的能力。我们的研究利用了 Reddit 的 ChangeMyView 平台的帖子，该平台需要细致入微的社交推理来制作有说服力的回应。我们的分析比较了人类和 LLM 生成的响应之间的语义相似性和词汇重叠指标，揭示了开放式问题中 ToM 推理能力的明显差异，即使是最先进的模型也显示出明显的局限性。为了增强 LLM 的能力，我们实施了一种结合人类意图和情感的快速调整方法，从而提高了 ToM 推理性能。然而，尽管取得了这些进步，但这种增强仍然未能完全实现类似人类的推理。这项研究突出了 LLM 社交推理的不足之处，并展示了如何通过整合人类意图和情感来提高其有效性。

Title: MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

Authors: Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05661
Pdf URL: https://arxiv.org/pdf/2406.05661
Copy Paste: [[2406.05661]] MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations(https://arxiv.org/abs/2406.05661)
Keywords: language model
Abstract: In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.
摘要：近年来，自监督预训练方法在从原始语音中学习高级信息方面获得了显著的关注。在这些方法中，HuBERT 在自动语音识别 (ASR) 中展示了 SOTA 性能。然而，由于预训练策略的差异，HuBERT 的性能落后于 data2vec。在本文中，我们提出了 (i) 一种 Swap 方法来解决 HuBERT 中观察到的预训练和推理不匹配问题，以及 (ii) 结合多聚类掩蔽预测损失以更有效地利用模型容量。由此产生的方法是 MS-HuBERT，一种用于学习稳健语音表示的端到端自监督预训练方法。在不同的微调分割上进行评估时，它在 ASR Librispeech 基准上平均比 vanilla HuBERT 高出 5%。此外，我们证明在预训练期间获得的学习到的嵌入编码了基本信息，可以提高基于内容的任务（例如 ASR）的性能。

Title: SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models

Authors: Hengyu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05678
Pdf URL: https://arxiv.org/pdf/2406.05678
Copy Paste: [[2406.05678]] SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models(https://arxiv.org/abs/2406.05678)
Keywords: language model, chat
Abstract: Extending the functionality of the Transformer model to accommodate longer sequence lengths has become a critical challenge. This extension is crucial not only for improving tasks such as language translation and long-context processing but also for enabling novel applications like chatbots, code generation, and multimedia content creation. The primary obstacle is the self-attention mechanism, which scales quadratically with sequence length in terms of computation time and memory requirements. LongLoRA proposed shifted sparse attention (S$^2$-Attn), effectively enabling context extension and leading to non-trivial computation savings with similar performance to fine-tuning with vanilla attention. However, LongLoRA is still not as efficient as vanilla attention, reaching only 39\% of the perplexity improvement compared to full attention. This inefficiency is due to the cyclic shift applied within different attention head patterns, causing either chaos in the attention head structure or unnecessary information exchange between token groups. To address these issues, We propose \textbf{SinkLoRA}, which features better work partitioning. Specifically, (1) we developed SF-Attn with a segmentation and reassembly algorithm to proportionally return cyclically shifted groups of attention heads to their un-shifted state together with global attention of "sink attention tokens", achieving 92\% of the perplexity improvement compared to full attention after fine tuning, and (2) applied a SOTA KV cache compression algorithm H$_2$O to accelerate inference. Furthermore, We conducted supervised fine-tuning with SinkLoRA using a self collected LongAlpaca-plus dataset. All our code, models, datasets, and demos are available at \url{this https URL}.
摘要：扩展 Transformer 模型的功能以适应更长的序列长度已成为一项关键挑战。这种扩展不仅对于改进语言翻译和长上下文处理等任务至关重要，而且对于实现聊天机器人、代码生成和多媒体内容创建等新应用也至关重要。主要障碍是自注意力机制，其计算时间和内存需求与序列长度成二次方关系。LongLoRA 提出了移位稀疏注意力 (S$^2$-Attn)，有效地实现了上下文扩展，并节省了大量计算量，性能与使用原始注意力进行微调相似。然而，LongLoRA 仍然不如原始注意力那么高效，与全注意力相比，困惑度改进仅达到 39\%。这种低效率是由于在不同的注意力头模式中应用了循环移位，导致注意力头结构混乱或 token 组之间不必要的信息交换。为了解决这些问题，我们提出了 \textbf{SinkLoRA}，它具有更好的工作划分。具体来说，(1) 我们开发了 SF-Attn，使用分段和重组算法按比例将循环移位的注意力头组恢复到未移位状态，同时将“接收器注意力标记”的全局注意力与全局注意力一起，在微调后与全注意力相比实现了 92\% 的困惑度改善，以及 (2) 应用 SOTA KV 缓存压缩算法 H$_2$O 来加速推理。此外，我们使用自收集的 LongAlpaca-plus 数据集对 SinkLoRA 进行了监督微调。我们所有的代码、模型、数据集和演示都可以在 \url{此 https URL} 上找到。

Title: Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions

Authors: Cheng Tan, Dongxin Lyu, Siyuan Li, Zhangyang Gao, Jingxuan Wei, Siqi Ma, Zicheng Liu, Stan Z. Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05688
Pdf URL: https://arxiv.org/pdf/2406.05688
Copy Paste: [[2406.05688]] Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions(https://arxiv.org/abs/2406.05688)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated wide-ranging applications across various fields and have shown significant potential in the academic peer-review process. However, existing applications are primarily limited to static review generation based on submitted papers, which fail to capture the dynamic and iterative nature of real-world peer reviews. In this paper, we reformulate the peer-review process as a multi-turn, long-context dialogue, incorporating distinct roles for authors, reviewers, and decision makers. We construct a comprehensive dataset containing over 26,841 papers with 92,017 reviews collected from multiple sources, including the top-tier conference and prestigious journal. This dataset is meticulously designed to facilitate the applications of LLMs for multi-turn dialogues, effectively simulating the complete peer-review process. Furthermore, we propose a series of metrics to evaluate the performance of LLMs for each role under this reformulated peer-review setting, ensuring fair and comprehensive evaluations. We believe this work provides a promising perspective on enhancing the LLM-driven peer-review process by incorporating dynamic, role-based interactions. It aligns closely with the iterative and interactive nature of real-world academic peer review, offering a robust foundation for future research and development in this area. We open-source the dataset at this https URL.
摘要：大型语言模型 (LLM) 已在各个领域展现出广泛的应用，并在学术同行评审过程中展现出巨大的潜力。然而，现有的应用主要限于基于提交的论文生成静态评论，无法捕捉现实世界同行评审的动态和迭代性质。在本文中，我们将同行评审过程重新表述为多轮、长上下文对话，结合作者、审稿人和决策者的不同角色。我们构建了一个全面的数据集，其中包含 26,841 多篇论文和 92,017 篇评论，这些评论来自多个来源，包括顶级会议和著名期刊。该数据集经过精心设计，以促进 LLM 在多轮对话中的应用，有效模拟完整的同行评审过程。此外，我们提出了一系列指标来评估这种重新表述的同行评审环境下 LLM 对每个角色的表现，确保评估公平全面。我们相信这项工作通过结合动态、基于角色的交互，为增强 LLM 驱动的同行评审过程提供了一个有希望的视角。它与现实世界学术同行评审的迭代和交互性质紧密结合，为该领域的未来研究和开发奠定了坚实的基础。我们在此 https URL 上开源了数据集。

Title: MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation

Authors: Yan Ma, Yu Qiao, Pengfei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05690
Pdf URL: https://arxiv.org/pdf/2406.05690
Copy Paste: [[2406.05690]] MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation(https://arxiv.org/abs/2406.05690)
Keywords: language model, llm
Abstract: A story premise succinctly defines a story's main idea, foundation, and trajectory. It serves as the initial trigger in automatic story generation. Existing sources of story premises are limited by a lack of diversity, uneven quality, and high costs that make them difficult to scale. In response, we introduce Modular Story Premise Synthesis (MoPS) which breaks down story premises into modules like background and persona for automated design and generation. MoPS consists of three phases: (1) Precollect a consistent set of candidates for each module to form a nested dictionary. (2) Extract a key path from the nested dictionary as the premise design. (3) Instruct an LLM to integrate the design into a coherent premise sentence. Thorough evaluations demonstrate that our synthesized premises excel in diversity, fascination, completeness, and originality compared to those induced from large language models and captured from public story datasets. Similarly, the extended novels and scripts generated from our premises also exhibit higher quality. In supplementary materials, we provide the MoPS code suite, along with 7.6k generated premises and 1k extended stories. Code: this https URL.
摘要：故事前提简明扼要地定义了故事的主旨、基础和发展轨迹。它是自动故事生成的初始触发器。现有的故事前提来源受到缺乏多样性、质量参差不齐和成本高昂的限制，难以扩展。为此，我们引入了模块化故事前提合成 (MoPS)，将故事前提分解为背景和角色等模块，以便进行自动设计和生成。MoPS 包含三个阶段：(1) 预先收集每个模块的一致候选集以形成嵌套词典。(2) 从嵌套词典中提取关键路径作为前提设计。(3) 指导 LLM 将设计整合成连贯的前提句。全面的评估表明，与从大型语言模型中得出的和从公共故事数据集中捕获的前提相比，我们合成的前提在多样性、吸引力、完整性和原创性方面更胜一筹。同样，从我们的前提生成的扩展小说和剧本也表现出更高的质量。在补充材料中，我们提供了 MoPS 代码套件，以及 7.6k 个生成的前提和 1k 个扩展故事。代码：此 https URL。

Title: MrRank: Improving Question Answering Retrieval System through Multi-Result Ranking Model

Authors: Danupat Khamnuansin, Tawunrat Chalothorn, Ekapol Chuangsuwanich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05733
Pdf URL: https://arxiv.org/pdf/2406.05733
Copy Paste: [[2406.05733]] MrRank: Improving Question Answering Retrieval System through Multi-Result Ranking Model(https://arxiv.org/abs/2406.05733)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) often struggle with hallucinations and outdated information. To address this, Information Retrieval (IR) systems can be employed to augment LLMs with up-to-date knowledge. However, existing IR techniques contain deficiencies, posing a performance bottleneck. Given the extensive array of IR systems, combining diverse approaches presents a viable strategy. Nevertheless, prior attempts have yielded restricted efficacy. In this work, we propose an approach that leverages learning-to-rank techniques to combine heterogeneous IR systems. We demonstrate the method on two Retrieval Question Answering (ReQA) tasks. Our empirical findings exhibit a significant performance enhancement, outperforming previous approaches and achieving state-of-the-art results on ReQA SQuAD.
摘要：大型语言模型 (LLM) 经常会遇到幻觉和过时信息的问题。为了解决这个问题，可以使用信息检索 (IR) 系统来为 LLM 提供最新知识。然而，现有的 IR 技术存在缺陷，造成了性能瓶颈。鉴于 IR 系统种类繁多，结合多种方法是一种可行的策略。然而，之前的尝试效果有限。在这项工作中，我们提出了一种利用学习排序技术来组合异构 IR 系统的方法。我们在两个检索问答 (ReQA) 任务上演示了该方法。我们的实证研究结果表明，该方法的性能得到了显著提升，优于以前的方法，并在 ReQA SQuAD 上取得了最先进的结果。

Title: Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Authors: Salman Elgamal, Ossama Obeid, Tameem Kabbani, Go Inoue, Nizar Habash
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05760
Pdf URL: https://arxiv.org/pdf/2406.05760
Copy Paste: [[2406.05760]] Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization(https://arxiv.org/abs/2406.05760)
Keywords: gpt, chat
Abstract: The widespread absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP). This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild," to unveil patterns and latent information across six diverse genres: news articles, novels, children's books, poetry, political documents, and ChatGPT outputs. We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context. Additionally, we propose extensions to the analyze-and-disambiguate approach in Arabic NLP to leverage these diacritics, resulting in notable improvements. Our contributions encompass a thorough analysis, valuable datasets, and an extended diacritization algorithm. We release our code and datasets as open source.
摘要：阿拉伯语文本中普遍缺乏变音符号，这对阿拉伯语自然语言处理 (NLP) 提出了重大挑战。本文探讨了自然发生的变音符号实例，称为“自然变音符号”，以揭示六种不同类型中的模式和潜在信息：新闻文章、小说、儿童读物、诗歌、政治文件和 ChatGPT 输出。我们提出了一个新的带注释的数据集，将现实世界中部分变音的单词映射到上下文中的最大完整变音。此外，我们建议扩展阿拉伯语 NLP 中的分析和消歧方法，以利用这些变音符号，从而取得显着的改进。我们的贡献包括彻底的分析、有价值的数据集和扩展的变音算法。我们将代码和数据集作为开源发布。

Title: The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Authors: Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05761
Pdf URL: https://arxiv.org/pdf/2406.05761
Copy Paste: [[2406.05761]] The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models(https://arxiv.org/abs/2406.05761)
Keywords: language model
Abstract: As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at this https URL.
摘要：随着语言模型 (LM) 能够处理各种任务，对它们的评估也变得与开发一样具有挑战性。目前，大多数生成基准都使用抽象的评估标准（如有用性和无害性）来评估 LM，这些标准通常缺乏人工评估的灵活性和粒度。此外，这些基准往往过分关注特定功能（如指令遵循），从而导致覆盖偏差。为了克服这些限制，我们推出了 BiGGen Bench，这是一个原则性的生成基准，旨在全面评估 LM 在 77 个不同任务中的九种不同功能。BiGGen Bench 的一个关键特性是它使用特定于实例的评估标准，与人工评估的细微辨别力非常相似。我们应用此基准，使用五个评估器 LM 评估 103 个前沿 LM。我们的代码、数据和评估结果均可在此 https URL 上公开获取。

Title: RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Generation

Authors: Kiseung Kim, Jay-Yoon Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05794
Pdf URL: https://arxiv.org/pdf/2406.05794
Copy Paste: [[2406.05794]] RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Generation(https://arxiv.org/abs/2406.05794)
Keywords: gpt, llm, chat, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) frame work is showing state-of-the-art performance on open-domain question answering tasks by referencing external knowledge. However, the RAG system faces challenges with performance degradation when it is fed contexts of low relevance or when the relative relevance among the input contexts is inaccurately assessed. In this work, we propose a RE-RAG framework that injects an explicit context relevance estimator (RE) into the RAG system. RE-RAG re-evaluates the retrieved contexts with the proposed context RE and passes the more relevant contexts along with their measure importance to the generator. To train context RE, we propose an unsupervised learning method, which does not utilize any labeled document ranking data to train the context RE. To examine the efficacy of RE-RAG, we examine its performance on Natural Questions and TriviaQA datasets. RE-RAG achieves on-par performance compared to the FiD variants while utilizing fewer contexts (0.25x). We show that the proposed context RE, which was trained with the T5 model, is also applicable to RAG with LLMs(ChatGPT) by improving the performance on NQ (+6.4EM) and TQA (+2.8EM), respecitvely. Lastly, we display that RE can add interpretability to RAG framework as RE score highly correlates with the RE-RAG accuracy. Consequently, RE can be utilized to filter out unanswerable scenarios where context does not contain answers with 38.9%-51.3% accuracy just by examining a set of retrieved contexts.
摘要：检索增强生成 (RAG) 框架通过引用外部知识在开放域问答任务上表现出最佳性能。但是，当输入相关性较低的上下文或输入上下文之间的相对相关性评估不准确时，RAG 系统会面临性能下降的挑战。在这项工作中，我们提出了一个 RE-RAG 框架，将显式上下文相关性估计器 (RE) 注入 RAG 系统。RE-RAG 使用建议的上下文 RE 重新评估检索到的上下文，并将更相关的上下文连同它们的度量重要性一起传递给生成器。为了训练上下文 RE，我们提出了一种无监督学习方法，该方法不使用任何标记文档排名数据来训练上下文 RE。为了检验 RE-RAG 的有效性，我们检验了它在 Natural Questions 和 TriviaQA 数据集上的性能。与 FiD 变体相比，RE-RAG 实现了同等性能，同时使用了更少的上下文（0.25 倍）。我们证明了，使用 T5 模型训练的所提出的上下文 RE 也适用于具有 LLM 的 RAG（ChatGPT），分别提高了 NQ（+6.4EM）和 TQA（+2.8EM）的性能。最后，我们表明 RE 可以为 RAG 框架增加可解释性，因为 RE 分数与 RE-RAG 准确率高度相关。因此，只需检查一组检索到的上下文，RE 就可用于过滤掉上下文不包含答案的无法回答的场景，准确率为 38.9%-51.3%。

Title: Hidden Holes: topological aspects of language models

Authors: Stephen Fitz, Peter Romero, Jiyan Jonas Schneider
Subjects: cs.CL, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2406.05798
Pdf URL: https://arxiv.org/pdf/2406.05798
Copy Paste: [[2406.05798]] Hidden Holes: topological aspects of language models(https://arxiv.org/abs/2406.05798)
Keywords: language model, gpt
Abstract: We explore the topology of representation manifolds arising in autoregressive neural language models trained on raw text data. In order to study their properties, we introduce tools from computational algebraic topology, which we use as a basis for a measure of topological complexity, that we call perforation. Using this measure, we study the evolution of topological structure in GPT based large language models across depth and time during training. We then compare these to gated recurrent models, and show that the latter exhibit more topological complexity, with a distinct pattern of changes common to all natural languages but absent from synthetically generated data. The paper presents a detailed analysis of the representation manifolds derived by these models based on studying the shapes of vector clouds induced by them as they are conditioned on sentences from corpora of natural language text. The methods developed in this paper are novel in the field and based on mathematical apparatus that might be unfamiliar to the target audience. To help with that we introduce the minimum necessary theory, and provide additional visualizations in the appendices. The main contribution of the paper is a striking observation about the topological structure of the transformer as compared to LSTM based neural architectures. It suggests that further research into mathematical properties of these neural networks is necessary to understand the operation of large transformer language models. We hope this work inspires further explorations in this direction within the NLP community.
摘要：我们探索在原始文本数据上训练的自回归神经语言模型中出现的表示流形的拓扑结构。为了研究它们的属性，我们引入了计算代数拓扑的工具，我们将其用作拓扑复杂性度量的基础，我们称之为穿孔。使用这个度量，我们研究了基于 GPT 的大型语言模型在训练期间随深度和时间变化的拓扑结构演变。然后，我们将它们与门控循环模型进行比较，并表明后者表现出更多的拓扑复杂性，具有所有自然语言共有的但在合成生成的数据中不存在的独特变化模式。本文基于研究这些模型在以自然语言文本语料库中的句子为条件时诱导的矢量云的形状，对这些模型得出的表示流形进行了详细分析。本文开发的方法在该领域是新颖的，并且基于目标受众可能不熟悉的数学工具。为了帮助实现这一点，我们引入了最低限度的必要理论，并在附录中提供了额外的可视化。这篇论文的主要贡献是对 Transformer 的拓扑结构与基于 LSTM 的神经架构的对比进行了惊人的观察。这表明，需要进一步研究这些神经网络的数学特性，以了解大型 Transformer 语言模型的运行。我们希望这项工作能够激发 NLP 社区在这方面的进一步探索。

Title: Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

Authors: Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.05806
Pdf URL: https://arxiv.org/pdf/2406.05806
Copy Paste: [[2406.05806]] Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper(https://arxiv.org/abs/2406.05806)
Keywords: prompt
Abstract: This research explores the interaction between Whisper, a high-performing speech recognition model, and information in prompts. Our results unexpectedly show that Whisper may not fully grasp textual prompts as anticipated. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by effectively ignoring incorrect language tokens and focusing on the correct ones. In summary, this work raises questions about Whisper's prompt understanding capability and encourages further studies.
摘要：本研究探索了高性能语音识别模型 Whisper 与提示信息之间的相互作用。我们的结果意外地表明，Whisper 可能无法像预期的那样完全掌握文本提示。此外，我们发现即使更严格地遵循文本提示中的主题信息，也不能保证性能的提高。我们还注意到，在两种语言的数据集上，英语提示通常比普通话提示表现更好，这可能是由于这两种语言的训练数据分布不同。相反，我们发现 Whisper 通过有效地忽略不正确的语言标记并专注于正确的语言标记，表现出对语言标记中误导信息的认识。总之，这项研究提出了有关 Whisper 提示理解能力的问题，并鼓励进一步研究。

Title: Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models

Authors: Shraboni Sarker, Ahmad Tamim Hamad, Hulayyil Alshammari, Viviana Grieco, Praveen Rao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05812
Pdf URL: https://arxiv.org/pdf/2406.05812
Copy Paste: [[2406.05812]] Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models(https://arxiv.org/abs/2406.05812)
Keywords: language model, gpt, llm, chat
Abstract: Large language models have gained tremendous popularity in domains such as e-commerce, finance, healthcare, and education. Fine-tuning is a common approach to customize an LLM on a domain-specific dataset for a desired downstream task. In this paper, we present a valuable resource for fine-tuning LLMs developed for the Spanish language to perform a variety of tasks such as classification, masked language modeling, clustering, and others. Our resource is a collection of handwritten notary records from the seventeenth century obtained from the National Archives of Argentina. This collection contains a combination of original images and transcribed text (and metadata) of 160+ pages that were handwritten by two notaries, namely, Estenban Agreda de Vergara and Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical evaluation, we demonstrate that our collection can be used to fine-tune Spanish LLMs for tasks such as classification and masked language modeling, and can outperform pre-trained Spanish models and ChatGPT-3.5/ChatGPT-4o. Our resource will be an invaluable resource for historical text analysis and is publicly available on GitHub.
摘要：大型语言模型在电子商务、金融、医疗保健和教育等领域获得了极大的欢迎。微调是一种常用方法，用于在特定领域的数据集上定制 LLM 以完成所需的下游任务。在本文中，我们介绍了一种宝贵的资源，用于微调为西班牙语开发的 LLM，以执行各种任务，例如分类、掩码语言建模、聚类等。我们的资源是从阿根廷国家档案馆获得的 17 世纪手写公证记录集。该集合包含 160 多页的原始图像和转录文本（和元数据）的组合，这些记录由两位公证人 Estenban Agreda de Vergara 和 Nicolas de Valdivia y Brisuela 在近 400 年前手写。通过实证评估，我们证明我们的集合可用于微调西班牙语 LLM 以完成分类和掩码语言建模等任务，并且可以胜过预训练的西班牙语模型和 ChatGPT-3.5/ChatGPT-4o。我们的资源将成为历史文本分析的宝贵资源，并在 GitHub 上公开提供。

Title: MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

Authors: Juraj Vladika, Phillip Schneider, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05845
Pdf URL: https://arxiv.org/pdf/2406.05845
Copy Paste: [[2406.05845]] MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering(https://arxiv.org/abs/2406.05845)
Keywords: language model, gpt, llm
Abstract: In recent years, Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. They can leverage this knowledge for downstream tasks like question answering (QA), even in complex areas involving health topics. Considering their high potential for facilitating clinical work in the future, understanding the quality of encoded medical knowledge and its recall in LLMs is an important step forward. In this study, we examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews -- studies synthesizing evidence-based answers for specific medical questions. Through experiments on the new MedREQAL dataset, comprising question-answer pairs extracted from rigorous systematic reviews, we assess six LLMs, such as GPT and Mixtral, analyzing their classification and generation performance. Our experimental insights into LLM performance on the novel biomedical QA dataset reveal the still challenging nature of this task.
摘要：近年来，大型语言模型 (LLM) 在对大型文本语料库进行预训练时，展示了令人印象深刻的知识编码能力。他们可以将这些知识用于下游任务，例如问答 (QA)，即使在涉及健康主题的复杂领域也是如此。考虑到它们在未来促进临床工作的巨大潜力，了解编码医学知识及其在 LLM 中的回忆质量是向前迈出的重要一步。在本研究中，我们通过构建一个源自系统评价的新数据集来检查 LLM 展示医学知识回忆的能力——系统评价是综合特定医学问题的循证答案的研究。通过对新的 MedREQAL 数据集进行实验（包含从严格的系统评价中提取的问答对），我们评估了六个 LLM，例如 GPT 和 Mixtral，分析了它们的分类和生成性能。我们对新型生物医学 QA 数据集上 LLM 性能的实验洞察揭示了这项任务仍然具有挑战性。

Title: II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Authors: Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, Hongquan Lin, Jiaming Li, Yuansheng Ni, Haihong Wu, Yaswanth Narsupalli, Zhigang Zheng, Chengming Li, Xiping Hu, Ruifeng Xu, Xiaojun Chen, Min Yang, Jiaheng Liu, Ruibo Liu, Wenhao Huang, Ge Zhang, Shiwen Ni
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2406.05862
Pdf URL: https://arxiv.org/pdf/2406.05862
Copy Paste: [[2406.05862]] II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models(https://arxiv.org/abs/2406.05862)
Keywords: language model, llm, prompt
Abstract: The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at this https URL.
摘要：多模态大型语言模型 (MLLM) 的快速发展不断推动着各种基准测试取得新突破。为此，人们提出了许多具有挑战性和综合性的基准测试来更准确地评估 MLLM 的能力。然而，对 MLLM 的高阶感知能力的探索却很少。为了填补这一空白，我们提出了图像蕴涵理解基准 II-Bench，旨在评估模型对图像的高阶感知。通过在多个 MLLM 上对 II-Bench 进行大量实验，我们取得了重大发现。最初，在 II-Bench 上观察到 MLLM 和人类的表现之间存在巨大差距。MLLM 的最高准确率达到 74.8%，而人类的平均准确率达到 90%，最高达到惊人的 98%。随后，MLLM 在抽象和复杂图像上的表现较差，表明它们在理解高级语义和捕捉图像细节方面的能力有限。最后，我们观察到，当图像情绪极性提示被纳入提示中时，大多数模型都表现出更高的准确率。这一观察结果强调了它们对图像情绪的固有理解存在明显缺陷。我们相信 II-Bench 将激励社区开发下一代 MLLM，推动专家级通用人工智能 (AGI) 的发展。II-Bench 可在此 https URL 上公开获取。

Title: Zero-Shot End-To-End Spoken Question Answering In Medical Domain

Authors: Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.05876
Pdf URL: https://arxiv.org/pdf/2406.05876
Copy Paste: [[2406.05876]] Zero-Shot End-To-End Spoken Question Answering In Medical Domain(https://arxiv.org/abs/2406.05876)
Keywords: language model, llm
Abstract: In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question audio transcription and answer selection, resulting in significant resource utilization and error accumulation. To tackle these challenges, we explore the effectiveness of end-to-end (E2E) methodologies for SQA in the medical domain. Our study introduces a novel zero-shot SQA approach, compared to traditional cascade systems. Through a comprehensive evaluation conducted on a new open benchmark of 8 medical tasks and 48 hours of synthetic audio, we demonstrate that our approach requires up to 14.7 times fewer resources than a combined 1.3B parameters LLM with a 1.55B parameters ASR model while improving average accuracy by 0.5\%. These findings underscore the potential of E2E methodologies for SQA in resource-constrained contexts.
摘要：在快速发展的语音问答 (SQA) 领域，大型语言模型 (LLM) 的集成已成为一种变革性发展。传统方法通常需要使用单独的模型进行问题音频转录和答案选择，从而导致大量资源占用和错误积累。为了应对这些挑战，我们探索了端到端 (E2E) 方法在医学领域对 SQA 的有效性。与传统的级联系统相比，我们的研究引入了一种新颖的零样本 SQA 方法。通过对 8 个医疗任务和 48 小时合成音频的新开放基准进行全面评估，我们证明我们的方法所需的资源比组合的 1.3B 参数 LLM 和 1.55B 参数 ASR 模型少 14.7 倍，同时将平均准确率提高了 0.5%。这些发现强调了 E2E 方法在资源受限环境中对 SQA 的潜力。

Title: Are Large Language Models Actually Good at Text Style Transfer?

Authors: Sourabrata Mukherjee, Atul Kr. Ojha, Ondřej Dušek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05885
Pdf URL: https://arxiv.org/pdf/2406.05885
Copy Paste: [[2406.05885]] Are Large Language Models Actually Good at Text Style Transfer?(https://arxiv.org/abs/2406.05885)
Keywords: language model, gpt, llm, prompt
Abstract: We analyze the performance of large language models (LLMs) on Text Style Transfer (TST), specifically focusing on sentiment transfer and text detoxification across three languages: English, Hindi, and Bengali. Text Style Transfer involves modifying the linguistic style of a text while preserving its core content. We evaluate the capabilities of pre-trained LLMs using zero-shot and few-shot prompting as well as parameter-efficient finetuning on publicly available datasets. Our evaluation using automatic metrics, GPT-4 and human evaluations reveals that while some prompted LLMs perform well in English, their performance in on other languages (Hindi, Bengali) remains average. However, finetuning significantly improves results compared to zero-shot and few-shot prompting, making them comparable to previous state-of-the-art. This underscores the necessity of dedicated datasets and specialized models for effective TST.
摘要：我们分析了大型语言模型 (LLM) 在文本风格迁移 (TST) 上的表现，特别关注三种语言的情感迁移和文本去毒：英语、印地语和孟加拉语。文本风格迁移涉及修改文本的语言风格，同时保留其核心内容。我们使用零样本和少样本提示以及参数高效的微调在公开可用的数据集上评估预训练 LLM 的能力。我们使用自动指标、GPT-4 和人工评估进行的评估表明，虽然一些提示的 LLM 在英语上表现良好，但它们在其他语言（印地语、孟加拉语）上的表现仍然很一般。然而，与零样本和少样本提示相比，微调显着改善了结果，使其与之前的最先进技术相媲美。这强调了有效 TST 需要专用数据集和专门模型。

Title: Feriji: A French-Zarma Parallel Corpus, Glossary & Translator

Authors: Mamadou K. Keita, Elysabhete Amadou Ibrahim, Habibatou Abdoulaye Alfari, Christopher Homan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.05888
Pdf URL: https://arxiv.org/pdf/2406.05888
Copy Paste: [[2406.05888]] Feriji: A French-Zarma Parallel Corpus, Glossary & Translator(https://arxiv.org/abs/2406.05888)
Keywords: language model
Abstract: Machine translation (MT) is a rapidly expanding field that has experienced significant advancements in recent years with the development of models capable of translating multiple languages with remarkable accuracy. However, the representation of African languages in this field still needs to improve due to linguistic complexities and limited resources. This applies to the Zarma language, a dialect of Songhay (of the Nilo-Saharan language family) spoken by over 5 million people across Niger and neighboring countries \cite{lewis2016ethnologue}. This paper introduces Feriji, the first robust French-Zarma parallel corpus and glossary designed for MT. The corpus, containing 61,085 sentences in Zarma and 42,789 in French, and a glossary of 4,062 words represent a significant step in addressing the need for more resources for Zarma. We fine-tune three large language models on our dataset, obtaining a BLEU score of 30.06 on the best-performing model. We further evaluate the models on human judgments of fluency, comprehension, and readability and the importance and impact of the corpus and models. Our contributions help to bridge a significant language gap and promote an essential and overlooked indigenous African language.
摘要：机器翻译 (MT) 是一个快速发展的领域，近年来随着能够以惊人的准确度翻译多种语言的模型的发展，该领域取得了重大进展。然而，由于语言复杂性和资源有限，非洲语言在该领域的代表性仍需改进。Zarma 语言就是这种情况，它是桑海语（尼罗-撒哈拉语系）的一种方言，尼日尔和邻国有超过 500 万人使用 \cite{lewis2016ethnologue}。本文介绍了 Feriji，这是第一个为 MT 设计的强大的法语-Zarma 平行语料库和词汇表。该语料库包含 61,085 个 Zarma 句子和 42,789 个法语句子，以及一个包含 4,062 个单词的词汇表，这代表着在解决 Zarma 对更多资源的需求方面迈出了重要一步。我们对数据集上的三个大型语言模型进行了微调，在表现最佳的模型上获得了 30.06 的 BLEU 分数。我们进一步评估了模型对人类对流利度、理解力和可读性的判断，以及语料库和模型的重要性和影响。我们的贡献有助于弥合巨大的语言差距，并推广一种重要且被忽视的非洲土著语言。

Title: Why Don't Prompt-Based Fairness Metrics Correlate?

Authors: Abdelrahman Zayed, Goncalo Mordido, Ioana Baldini, Sarath Chandar
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2406.05918
Pdf URL: https://arxiv.org/pdf/2406.05918
Copy Paste: [[2406.05918]] Why Don't Prompt-Based Fairness Metrics Correlate?(https://arxiv.org/abs/2406.05918)
Keywords: language model, prompt
Abstract: The widespread use of large language models has brought up essential questions about the potential biases these models might learn. This led to the development of several metrics aimed at evaluating and mitigating these biases. In this paper, we first demonstrate that prompt-based fairness metrics exhibit poor agreement, as measured by correlation, raising important questions about the reliability of fairness assessment using prompts. Then, we outline six relevant reasons why such a low correlation is observed across existing metrics. Based on these insights, we propose a method called Correlated Fairness Output (CAIRO) to enhance the correlation between fairness metrics. CAIRO augments the original prompts of a given fairness metric by using several pre-trained language models and then selects the combination of the augmented prompts that achieves the highest correlation across metrics. We show a significant improvement in Pearson correlation from 0.3 and 0.18 to 0.90 and 0.98 across metrics for gender and religion biases, respectively. Our code is available at this https URL.
摘要：大型语言模型的广泛使用提出了一些基本问题，即这些模型可能学习到的潜在偏见。这导致了旨在评估和减轻这些偏见的几种指标的开发。在本文中，我们首先证明基于提示的公平性指标表现出较差的一致性（以相关性衡量），这提出了有关使用提示进行公平性评估的可靠性的重要问题。然后，我们概述了现有指标之间相关性如此低的六个相关原因。基于这些见解，我们提出了一种称为相关公平性输出 (CAIRO) 的方法来增强公平性指标之间的相关性。CAIRO 通过使用几个预先训练的语言模型来增强给定公平性指标的原始提示，然后选择在指标之间实现最高相关性的增强提示组合。我们显示，在性别和宗教偏见指标中，皮尔逊相关性分别从 0.3 和 0.18 显著提高到 0.90 和 0.98。我们的代码可在此 https URL 上找到。

Title: Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

Authors: Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, Tat-Seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.05925
Pdf URL: https://arxiv.org/pdf/2406.05925
Copy Paste: [[2406.05925]] Hello Again! LLM-powered Personalized Agent for Long-term Dialogue(https://arxiv.org/abs/2406.05925)
Keywords: language model, llm, chat, agent
Abstract: Open-domain dialogue systems have seen remarkable advancements with the development of large language models (LLMs). Nonetheless, most existing dialogue systems predominantly focus on brief single-session interactions, neglecting the real-world demands for long-term companionship and personalized interactions with chatbots. Crucial to addressing this real-world need are event summary and persona management, which enable reasoning for appropriate long-term dialogue responses. Recent progress in the human-like cognitive and reasoning capabilities of LLMs suggests that LLM-based agents could significantly enhance automated perception, decision-making, and problem-solving. In response to this potential, we introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent), which incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation. For the event memory module, long and short-term memory banks are employed to separately focus on historical and ongoing sessions, while a topic-based retrieval mechanism is introduced to enhance the accuracy of memory retrieval. Furthermore, the persona module conducts dynamic persona modeling for both users and agents. The integration of retrieved memories and extracted personas is subsequently fed into the generator to induce appropriate responses. The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated across various illustrative benchmarks, models, and tasks. The code is released at this https URL.
摘要：随着大型语言模型 (LLM) 的发展，开放域对话系统取得了显著进步。尽管如此，大多数现有的对话系统主要关注简短的单会话交互，而忽略了现实世界对长期陪伴和与聊天机器人个性化交互的需求。事件摘要和角色管理对于满足这一现实需求至关重要，它们能够推理出适当的长期对话响应。LLM 的类人认知和推理能力的最新进展表明，基于 LLM 的代理可以显著增强自动感知、决策和解决问题的能力。为了应对这一潜力，我们引入了一个与模型无关的框架，即长期对话代理 (LD-Agent)，它包含三个独立可调的模块，专用于事件感知、角色提取和响应生成。对于事件记忆模块，使用长期和短期记忆库分别关注历史和正在进行的会话，同时引入基于主题的检索机制来提高记忆检索的准确性。此外，角色模块为用户和代理进行动态角色建模。随后，检索到的记忆和提取的角色的整合被输入到生成器中以诱导适当的响应。LD-Agent 的有效性、通用性和跨领域能力在各种说明性基准、模型和任务中得到了实证证明。代码在此 https URL 上发布。

Title: HOLMES: Hyper-Relational Knowledge Graphs for Multi-hop Question Answering using LLMs

Authors: Pranoy Panda, Ankush Agarwal, Chaitanya Devaguptapu, Manohar Kaul, Prathosh A P
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06027
Pdf URL: https://arxiv.org/pdf/2406.06027
Copy Paste: [[2406.06027]] HOLMES: Hyper-Relational Knowledge Graphs for Multi-hop Question Answering using LLMs(https://arxiv.org/abs/2406.06027)
Keywords: language model, llm
Abstract: Given unstructured text, Large Language Models (LLMs) are adept at answering simple (single-hop) questions. However, as the complexity of the questions increase, the performance of LLMs degrade. We believe this is due to the overhead associated with understanding the complex question followed by filtering and aggregating unstructured information in the raw text. Recent methods try to reduce this burden by integrating structured knowledge triples into the raw text, aiming to provide a structured overview that simplifies information processing. However, this simplistic approach is query-agnostic and the extracted facts are ambiguous as they lack context. To address these drawbacks and to enable LLMs to answer complex (multi-hop) questions with ease, we propose to use a knowledge graph (KG) that is context-aware and is distilled to contain query-relevant information. The use of our compressed distilled KG as input to the LLM results in our method utilizing up to $67\%$ fewer tokens to represent the query relevant information present in the supporting documents, compared to the state-of-the-art (SoTA) method. Our experiments show consistent improvements over the SoTA across several metrics (EM, F1, BERTScore, and Human Eval) on two popular benchmark datasets (HotpotQA and MuSiQue).
摘要：对于非结构化文本，大型语言模型 (LLM) 擅长回答简单（单跳）问题。然而，随着问题的复杂性增加，LLM 的性能会下降。我们认为这是由于理解复杂问题然后过滤和聚合原始文本中的非结构化信息所产生的开销。最近的方法试图通过将结构化知识三元组集成到原始文本中来减轻这种负担，旨在提供可简化信息处理的结构化概述。然而，这种简单的方法与查询无关，提取的事实由于缺乏上下文而含糊不清。为了解决这些缺点并使 LLM 能够轻松回答复杂（多跳）问题，我们建议使用具有上下文感知且经过提炼以包含与查询相关的信息的知识图谱 (KG)。与最先进的 (SoTA) 方法相比，使用压缩的精炼 KG 作为 LLM 的输入，我们的方法最多可减少 $67\%$ 个标记来表示支持文档中存在的查询相关信息。我们的实验表明，在两个流行的基准数据集 (HotpotQA 和 MuSiQue) 上的多个指标 (EM、F1、BERTScore 和 Human Eval) 上，SoTA 的表现持续优于 SoTA。

Title: The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models

Authors: Ryosuke Takahashi, Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06032
Pdf URL: https://arxiv.org/pdf/2406.06032
Copy Paste: [[2406.06032]] The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models(https://arxiv.org/abs/2406.06032)
Keywords: language model
Abstract: Language models (LMs) encode world knowledge in their internal parameters through training. However, LMs may learn personal and confidential information from the training data, leading to privacy concerns such as data leakage. Therefore, research on knowledge deletion from LMs is essential. This study focuses on the knowledge stored in LMs and analyzes the relationship between the side effects of knowledge deletion and the entities related to the knowledge. Our findings reveal that deleting knowledge related to popular entities can have catastrophic side effects. Furthermore, this research is the first to analyze knowledge deletion in models trained on synthetic knowledge graphs, indicating a new direction for controlled experiments.
摘要：语言模型 (LM) 通过训练将世界知识编码在其内部参数中。然而，LM 可能会从训练数据中学习个人和机密信息，从而导致数据泄露等隐私问题。因此，对 LM 中知识删除的研究至关重要。本研究重点关注 LM 中存储的知识，并分析知识删除的副作用与与知识相关的实体之间的关系。我们的研究结果表明，删除与流行实体相关的知识可能会带来灾难性的副作用。此外，这项研究首次分析了在合成知识图谱上训练的模型中的知识删除，为受控实验指明了新的方向。

Title: MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Authors: Zichun Yu, Spandan Das, Chenyan Xiong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.06046
Pdf URL: https://arxiv.org/pdf/2406.06046
Copy Paste: [[2406.06046]] MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models(https://arxiv.org/abs/2406.06046)
Keywords: language model
Abstract: Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we fine-tune a small data influence model to approximate oracle data preference signals collected by locally probing the pretraining model and to select data accordingly for the next pretraining stage. Experiments on Pythia and the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks in both zero- and few-shot settings. It doubles the gains achieved by recent data selection approaches that leverage larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analysis validates the ever-changing data preferences of pretraining models and the effectiveness of our data influence models to capture them. Our code is open-sourced at this https URL.
摘要：预训练数据选择有可能通过利用来自海量网络数据语料库的更高质量数据来提高语言模型预训练效率。当前的数据选择方法依赖于手工制定的规则或更大的参考模型，这些方法是静态进行的，不会捕获预训练期间不断变化的数据偏好。在本文中，我们引入了具有数据影响模型的模型感知数据选择 (MATES)，其中数据影响模型不断适应预训练模型不断变化的数据偏好，然后选择对当前预训练进度最有效的数据。具体来说，我们对小型数据影响模型进行微调，以近似通过本地探测预训练模型收集的 oracle 数据偏好信号，并相应地为下一个预训练阶段选择数据。在 Pythia 和 C4 数据集上的实验表明，MATES 在零样本和少样本设置中在广泛的下游任务上的表现明显优于随机数据选择。它将利用更大参考模型的近期数据选择方法所取得的收益翻了一番，并将达到某些性能所需的总 FLOP 减少了一半。进一步的分析验证了预训练模型不断变化的数据偏好以及我们的数据影响模型捕捉这些偏好的有效性。我们的代码在此 https URL 上开源。

Title: Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text

Authors: Avijit Mitra, Emily Druhl, Raelene Goodwin, Hong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06056
Pdf URL: https://arxiv.org/pdf/2406.06056
Copy Paste: [[2406.06056]] Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text(https://arxiv.org/abs/2406.06056)
Keywords: llm
Abstract: Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 62.5% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints. Human evaluation demonstrates a Human-LLM alignment of 71.06% and uncovers areas for future refinements.
摘要：健康的社会和行为决定因素 (SBDH) 在健康结果中起着至关重要的作用，并且经常在临床文本中记录。从临床文本中自动提取 SBDH 信息依赖于公开可用的高质量数据集。然而，现有的 SBDH 数据集在可用性和覆盖范围方面表现出很大的局限性。在本研究中，我们介绍了 Synth-SBDH，这是一种具有详细 SBDH 注释的新型合成数据集，涵盖 15 个 SBDH 类别的状态、时间信息和原理。我们使用来自两个不同医院环境的真实临床数据集展示了 Synth-SBDH 在三个任务中的实用性，突出了它的多功能性、通用性和提炼能力。在 Synth-SBDH 上训练的模型始终优于没有 Synth-SBDH 训练的模型，实现了高达 62.5% 的宏观 F 改进。此外，Synth-SBDH 被证明对罕见的 SBDH 类别和资源不足的情况有效。人工评估表明 Human-LLM 对齐率为 71.06%，并揭示了未来改进的领域。

Title: Recurrent Context Compression: Efficiently Expanding the Context Window of LLM

Authors: Chensen Huang, Guibo Zhu, Xuepeng Wang, Yifei Luo, Guojing Ge, Haoran Chen, Dong Yi, Jinqiao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06110
Pdf URL: https://arxiv.org/pdf/2406.06110
Copy Paste: [[2406.06110]] Recurrent Context Compression: Efficiently Expanding the Context Window of LLM(https://arxiv.org/abs/2406.06110)
Keywords: language model, llm
Abstract: To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLMs within constrained storage space. We also investigate the issue of poor model responses when both instructions and context are compressed in downstream tasks, and propose an instruction reconstruction method to mitigate this problem. We validated the effectiveness of our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100\% accuracy on a passkey retrieval task with a sequence length of 1M. Finally, our method demonstrated competitive performance in long-text question-answering tasks compared to non-compressed methods, while significantly saving storage resources in long-text inference tasks. Our code, models, and demo are available at this https URL
摘要：为了扩展基于 Transformer 的大型语言模型 (LLM) 的上下文长度并提高理解能力，我们经常会面临计算资源和内存存储容量有限的限制。这项工作引入了一种称为循环上下文压缩 (RCC) 的方法，旨在在受限的存储空间内有效地扩展 LLM 的上下文窗口长度。我们还研究了在下游任务中指令和上下文都被压缩时模型响应不佳的问题，并提出了一种指令重构方法来缓解这个问题。我们在多个任务上验证了我们的方法的有效性，在文本重构任务上实现了高达 32 倍的压缩率，BLEU4 得分接近 0.95，在序列长度为 1M 的密码检索任务上实现了接近 100% 的准确率。最后，与非压缩方法相比，我们的方法在长文本问答任务中表现出了具有竞争力的性能，同时在长文本推理任务中显著节省了存储资源。我们的代码、模型和演示可在此 https URL 上找到

Title: Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation

Authors: Aadharsh Aadhithya A, Sachin Kumar S, Soman K.P
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06124
Pdf URL: https://arxiv.org/pdf/2406.06124
Copy Paste: [[2406.06124]] Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation(https://arxiv.org/abs/2406.06124)
Keywords: language model, llm, retrieval augmented generation
Abstract: Large language models have limited context capacity, hindering reasoning over long conversations. We propose the Hierarchical Aggregate Tree memory structure to recursively aggregate relevant dialogue context through conditional tree traversals. HAT encapsulates information from children nodes, enabling broad coverage with depth control. We formulate finding best context as optimal tree traversal. Experiments show HAT improves dialog coherence and summary quality over baseline contexts, demonstrating the techniques effectiveness for multi turn reasoning without exponential parameter growth. This memory augmentation enables more consistent, grounded longform conversations from LLMs
摘要：大型语言模型的上下文容量有限，阻碍了长时间对话的推理。我们提出了分层聚合树内存结构，通过条件树遍历以递归方式聚合相关对话上下文。HAT 封装了来自子节点的信息，从而实现了深度控制的广泛覆盖。我们将寻找最佳上下文定义为最佳树遍历。实验表明，HAT 提高了对话连贯性和基线上下文的摘要质量，证明了该技术在多轮推理方面的有效性，而无需指数级参数增长。这种内存增强使 LLM 的长篇对话更加一致、更有根据

Title: Verifiable Generation with Subsentence-Level Fine-Grained Citations

Authors: Shuyang Cao, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06125
Pdf URL: https://arxiv.org/pdf/2406.06125
Copy Paste: [[2406.06125]] Verifiable Generation with Subsentence-Level Fine-Grained Citations(https://arxiv.org/abs/2406.06125)
Keywords: language model, llm
Abstract: Verifiable generation requires large language models (LLMs) to cite source documents supporting their outputs, thereby improve output transparency and trustworthiness. Yet, previous work mainly targets the generation of sentence-level citations, lacking specificity about which parts of a sentence are backed by the cited sources. This work studies verifiable generation with subsentence-level fine-grained citations for more precise location of generated content supported by the cited sources. We first present a dataset, SCiFi, comprising 10K Wikipedia paragraphs with subsentence-level citations. Each paragraph is paired with a set of candidate source documents for citation and a query that triggers the generation of the paragraph content. On SCiFi, we evaluate the performance of state-of-the-art LLMs and strategies for processing long documents designed for these models. Our experiment results reveals key factors that could enhance the quality of citations, including the expansion of the source documents' context accessible to the models and the implementation of specialized model tuning.
摘要：可验证生成需要大型语言模型 (LLM) 引用支持其输出的源文档，从而提高输出的透明度和可信度。然而，以前的工作主要针对句子级引用的生成，缺乏关于句子的哪些部分由引用来源支持的具体性。这项工作研究了具有子句级细粒度引用的可验证生成，以更精确地定位由引用来源支持的生成内容。我们首先提供了一个数据集 SCiFi，其中包含 10K 个带有子句级引用的维基百科段落。每个段落都与一组用于引用的候选源文档和一个触发段落内容生成的查询配对。在 SCiFi 上，我们评估了最先进的 LLM 的性能以及为这些模型设计的处理长文档的策略。我们的实验结果揭示了可以提高引用质量的关键因素，包括扩展模型可访问的源文档上下文和实施专门的模型调整。

Title: Can I understand what I create? Self-Knowledge Evaluation of Large Language Models

Authors: Zhiquan Tan, Lai Wei, Jindong Wang, Xing Xie, Weiran Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.06140
Pdf URL: https://arxiv.org/pdf/2406.06140
Copy Paste: [[2406.06140]] Can I understand what I create? Self-Knowledge Evaluation of Large Language Models(https://arxiv.org/abs/2406.06140)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable progress in linguistic tasks, necessitating robust evaluation frameworks to understand their capabilities and limitations. Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework that is easy to implement, evaluating models on their ability to comprehend and respond to self-generated questions. Our findings, based on testing multiple models across diverse tasks, reveal significant gaps in the model's self-knowledge ability. Further analysis indicates these gaps may be due to misalignment with human attention mechanisms. Additionally, fine-tuning on self-generated math task may enhance the model's math performance, highlighting the potential of the framework for efficient and insightful model evaluation and may also contribute to the improvement of LLMs.
摘要：大型语言模型 (LLM) 在语言任务上取得了显著进展，需要强大的评估框架来了解其能力和局限性。受费曼通过创造来理解的原理的启发，我们引入了一个易于实施的自我知识评估框架，评估模型理解和回答自发问题的能力。我们的研究结果基于对不同任务中多个模型的测试，揭示了模型的自我知识能力存在重大差距。进一步的分析表明，这些差距可能是由于与人类注意力机制不一致造成的。此外，对自生成数学任务进行微调可能会提高模型的数学性能，凸显了该框架在高效和有洞察力的模型评估方面的潜力，也可能有助于改进 LLM。

Title: Language Models Resist Alignment

Authors: Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Yaodong Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06144
Pdf URL: https://arxiv.org/pdf/2406.06144
Copy Paste: [[2406.06144]] Language Models Resist Alignment(https://arxiv.org/abs/2406.06144)
Keywords: language model, llm
Abstract: Large language models (LLMs) may exhibit undesirable behaviors. Recent efforts have focused on aligning these models to prevent harmful generation. Despite these efforts, studies have shown that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Do alignment fine-tuning have robust effects on models, or are merely superficial? In this work, we answer this question through both theoretical and empirical means. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Using compression theory, we formally derive that such fine-tuning process \textit{disproportionately} undermines alignment compared to pre-training, potentially by orders of magnitude. We conduct experimental validations to confirm the presence of elasticity across models of varying types and sizes. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. We further reveal that elasticity positively correlates with increased model size and the expansion of pre-training data. Our discovery signifies the importance of taming the inherent elasticity of LLMs, thereby overcoming the resistance of LLMs to alignment finetuning.
摘要：大型语言模型 (LLM) 可能会表现出不良行为。最近的努力集中在对齐这些模型以防止有害的产生。尽管做出了这些努力，但研究表明，即使是精心进行的对齐过程也很容易被规避，无论是有意还是无意。对齐微调对模型有强大的影响，还是只是表面作用？在这项工作中，我们通过理论和经验两种方式回答了这个问题。从经验上讲，我们证明了对齐后模型的弹性，即在进一步微调后倾向于恢复到预训练阶段形成的行为分布。使用压缩理论，我们正式得出，与预训练相比，这种微调过程 \textit{不成比例} 地破坏了对齐，可能破坏了几个数量级。我们进行实验验证，以确认不同类型和大小的模型都存在弹性。具体而言，我们发现模型性能在恢复到预训练分布之前会迅速下降，之后下降率会显着下降。我们进一步发现，弹性与模型大小的增加和预训练数据的扩展呈正相关。我们的发现表明，控制 LLM 固有弹性的重要性，从而克服 LLM 对对齐微调的阻力。

Title: LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

Authors: Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06196
Pdf URL: https://arxiv.org/pdf/2406.06196
Copy Paste: [[2406.06196]] LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages(https://arxiv.org/abs/2406.06196)
Keywords: language model, llm
Abstract: In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 35.3% accuracy, 21.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.
摘要：在本文中，我们提出了 LingOly 基准，这是大型语言模型高级推理能力的新基准。使用具有挑战性的语言奥林匹克难题，我们评估 (i) 在资源非常匮乏或已灭绝的语言中上下文识别和概括语言模式的能力，以及 (ii) 遵循复杂任务指令的能力。LingOly 基准涵盖了 90 多种资源匮乏的语言，最大限度地减少了数据污染问题，并包含 6 种格式和 5 个人类难度级别的 1,133 个问题。我们通过直接准确度和与无上下文基线的比较来评估性能，以惩罚记忆。11 个最先进的 LLM 的分数表明基准具有挑战性，模型在更高难度的问题上表现不佳。在更难的问题上，即使是顶级模型也只实现了 35.3% 的准确率，比无上下文基线提高了 21.7%。大型封闭模型通常优于开放模型，一般来说，语言的资源越多，分数就越高。这些结果表明，在没有记忆的情况下，真正的多步骤域外推理对于当前的语言模型来说仍然是一个挑战。

Title: Multi-Prompting Decoder Helps Better Language Understanding

Authors: Zifeng Cheng, Zhaoling Chen, Zhiwei Jiang, Yafeng Yin, Shiping Ge, Yuliang Liu, Qing Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06279
Pdf URL: https://arxiv.org/pdf/2406.06279
Copy Paste: [[2406.06279]] Multi-Prompting Decoder Helps Better Language Understanding(https://arxiv.org/abs/2406.06279)
Keywords: language model, prompt
Abstract: Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the output hidden states and class scores of the PLM. Despite the effectiveness of these methods, they only use a single prompt to query PLMs for decoding, leading to a heavy reliance on the quality of the adopted prompt. In this paper, we propose a simple yet effective Multi-Prompting Decoder (MPD) framework for MaaS adaptation. The core idea is to query PLMs with multiple different prompts for each sample, thereby obtaining multiple output hidden states and class scores for subsequent decoding. Such multi-prompting decoding paradigm can simultaneously mitigate reliance on the quality of a single prompt, alleviate the issue of data scarcity under the few-shot setting, and provide richer knowledge extracted from PLMs. Specifically, we propose two decoding strategies: multi-prompting decoding with optimal transport for hidden states and calibrated decoding for class scores. Extensive experiments demonstrate that our method achieves new state-of-the-art results on multiple natural language understanding datasets under the few-shot setting.
摘要：最近的预训练语言模型 (PLM) 通常只为用户提供推理 API，即新兴的模型即服务 (MaaS) 设置。为了使 MaaS PLM 适应下游任务而无需访问其参数和梯度，一些现有方法专注于 PLM 的输出端适配，将 PLM 视为编码器，然后优化任务特定的解码器以解码 PLM 的输出隐藏状态和类别分数。尽管这些方法很有效，但它们仅使用单个提示来查询 PLM 进行解码，导致严重依赖所采用提示的质量。在本文中，我们提出了一个简单而有效的多提示解码器 (MPD) 框架用于 MaaS 适配。核心思想是针对每个样本使用多个不同的提示查询 PLM，从而获得多个输出隐藏状态和类别分数以供后续解码。这种多提示解码范式可以同时减轻对单个提示质量的依赖，缓解少样本设置下的数据稀缺问题，并提供从 PLM 中提取的更丰富的知识。具体来说，我们提出了两种解码策略：隐藏状态的最佳传输多提示解码和类别分数的校准解码。大量实验表明，我们的方法在少样本设置下在多个自然语言理解数据集上取得了新的最佳结果。

Title: Tx-LLM: A Large Language Model for Therapeutics

Authors: Juan Manuel Zambrano Chaves, Eric Wang, Tao Tu, Eeshit Dhaval Vaishnav, Byron Lee, S. Sara Mahdavi, Christopher Semturs, David Fleet, Vivek Natarajan, Shekoofeh Azizi
Subjects: cs.CL, cs.AI, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2406.06316
Pdf URL: https://arxiv.org/pdf/2406.06316
Copy Paste: [[2406.06316]] Tx-LLM: A Large Language Model for Therapeutics(https://arxiv.org/abs/2406.06316)
Keywords: language model, llm, prompt
Abstract: Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.
摘要：开发疗法是一个漫长而昂贵的过程，需要满足许多不同的标准，而能够加快这一过程的人工智能模型将是无价之宝。然而，目前大多数人工智能方法只解决一组定义狭窄的任务，通常局限于特定领域。为了弥补这一差距，我们引入了 Tx-LLM，这是一种通用大型语言模型 (LLM)，由 PaLM-2 微调而成，可编码有关各种治疗方式的知识。Tx-LLM 使用 709 个数据集进行训练，这些数据集针对药物发现流程各个阶段的 66 个任务。 Tx-LLM 使用一组权重同时处理与自由文本交织的各种化学或生物实体（小分子、蛋白质、核酸、细胞系、疾病），使其能够预测广泛的相关属性，在 66 项任务中的 43 项上实现了与最先进 (SOTA) 相媲美的性能，并在 22 项上超过了 SOTA。其中，Tx-LLM 特别强大，在将分子 SMILES 表示与文本（例如细胞系名称或疾病名称）相结合的任务中，其平均性能超过了同类最佳，这可能是由于在预训练期间学习了上下文。我们观察到不同药物类型的任务之间存在正向转移的证据（例如，涉及小分子的任务和涉及蛋白质的任务），并且我们研究了模型大小、域微调和提示策略对性能的影响。我们相信 Tx-LLM 代表了朝着编码生化知识的 LLM 迈出的重要一步，并且可能在未来成为药物发现开发流程中端到端的工具。

Title: Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Authors: Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Yipeng Zhang, Haitao Mi, Helen Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06326
Pdf URL: https://arxiv.org/pdf/2406.06326
Copy Paste: [[2406.06326]] Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching(https://arxiv.org/abs/2406.06326)
Keywords: language model, llm
Abstract: Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in efficient human learning, we introduce Self-Tuning, a learning framework aimed at improving an LLM's ability to effectively acquire new knowledge from raw documents through self-teaching. Specifically, we develop a Self-Teaching strategy that augments the documents with a set of knowledge-intensive tasks created in a self-supervised manner, focusing on three crucial aspects: memorization, comprehension, and self-reflection. Additionally, we introduce three Wiki-Newpages-2023-QA datasets to facilitate an in-depth analysis of an LLM's knowledge acquisition ability concerning memorization, extraction, and reasoning. Extensive experimental results on Llama2 family models reveal that Self-Tuning consistently exhibits superior performance across all knowledge acquisition tasks and excels in preserving previous knowledge.
摘要：大型语言模型 (LLM) 通常难以提供最新信息，因为它们需要一次性训练，而且世界在不断发展。为了使 LLM 保持最新状态，现有方法通常涉及对新文档进行持续的预训练。然而，它们在提取存储的知识方面经常面临困难。受费曼技术在高效人类学习方面取得的显著成功的启发，我们引入了自调，这是一个学习框架，旨在提高 LLM 通过自学从原始文档中有效获取新知识的能力。具体来说，我们开发了一种自学策略，通过一组以自监督方式创建的知识密集型任务来增强文档，重点关注三个关键方面：记忆、理解和自我反思。此外，我们引入了三个 Wiki-Newpages-2023-QA 数据集，以便深入分析 LLM 在记忆、提取和推理方面的知识获取能力。在 Llama2 系列模型上进行的大量实验结果表明，自调节在所有知识获取任务中始终表现出优异的性能，并且在保存先前知识方面表现出色。

Title: MedExQA: Medical Question Answering Benchmark with Multiple Explanations

Authors: Yunsoo Kim, Jinge Wu, Yusuf Abdulle, Honghan Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06331
Pdf URL: https://arxiv.org/pdf/2406.06331
Copy Paste: [[2406.06331]] MedExQA: Medical Question Answering Benchmark with Multiple Explanations(https://arxiv.org/abs/2406.06331)
Keywords: language model, gpt, llm
Abstract: This paper introduces MedExQA, a novel benchmark in medical question-answering, to evaluate large language models' (LLMs) understanding of medical knowledge through explanations. By constructing datasets across five distinct medical specialties that are underrepresented in current datasets and further incorporating multiple explanations for each question-answer pair, we address a major gap in current medical QA benchmarks which is the absence of comprehensive assessments of LLMs' ability to generate nuanced medical explanations. Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology, where current LLMs including GPT4 lack good understanding. Our results show generation evaluation with multiple explanations aligns better with human assessment, highlighting an opportunity for a more robust automated comprehension assessment for LLMs. To diversify open-source medical LLMs (currently mostly based on Llama2), this work also proposes a new medical model, MedPhi-2, based on Phi-2 (2.7B). The model outperformed medical LLMs based on Llama2-70B in generating explanations, showing its effectiveness in the resource-constrained medical domain. We will share our benchmark datasets and the trained model.
摘要：本文介绍了 MedExQA，这是医学问答领域的一种新基准，用于评估大型语言模型 (LLM) 通过解释对医学知识的理解。通过构建当前数据集中代表性不足的五个不同医学专业的数据集，并进一步为每个问答对加入多种解释，我们解决了当前医学 QA 基准的一个主要缺陷，即缺乏对 LLM 生成细致入微的医学解释能力的全面评估。我们的工作强调了可解释性在医学 LLM 中的重要性，提出了一种评估分类准确性之外的模型的有效方法，并阐明了一个特定领域，即言语语言病理学，包括 GPT4 在内的当前 LLM 缺乏良好的理解。我们的结果表明，具有多种解释的生成评估与人工评估更加一致，这突出表明了对 LLM 进行更强大的自动理解评估的机会。为了使开源医学 LLM（目前主要基于 Llama2）多样化，这项工作还提出了一种基于 Phi-2（2.7B）的新医学模型 MedPhi-2。该模型在生成解释方面的表现优于基于 Llama2-70B 的医学 LLM，证明了其在资源受限的医学领域的有效性。我们将分享我们的基准数据集和经过训练的模型。

Title: MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows

Authors: Xingjian Zhang, Yutong Xie, Jin Huang, Jinge Ma, Zhaoying Pan, Qijia Liu, Ziyang Xiong, Tolga Ergen, Dongsub Shim, Honglak Lee, Qiaozhu Mei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06357
Pdf URL: https://arxiv.org/pdf/2406.06357
Copy Paste: [[2406.06357]] MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows(https://arxiv.org/abs/2406.06357)
Keywords: language model, llm
Abstract: Scientific innovation relies on detailed workflows, which include critical steps such as analyzing literature, generating ideas, validating these ideas, interpreting results, and inspiring follow-up research. However, scientific publications that document these workflows are extensive and unstructured. This makes it difficult for both human researchers and AI systems to effectively navigate and explore the space of scientific innovation. To address this issue, we introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of Scientific Workflows. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. Using Large Language Models (LLMs), we automatically extract five core aspects from these publications -- context, key idea, method, outcome, and projected impact -- which correspond to five key steps in the research workflow. These structured summaries facilitate a variety of downstream tasks and analyses. The quality of the LLM-extracted summaries is validated by comparing them with human annotations. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset, which make various types of predictions and recommendations along the scientific workflow. MASSW holds significant potential for researchers to create and benchmark new AI methods for optimizing scientific workflows and fostering scientific innovation in the field. Our dataset is openly available at \url{this https URL}.
摘要：科学创新依赖于详细的工作流程，其中包括分析文献、产生想法、验证这些想法、解释结果和启发后续研究等关键步骤。然而，记录这些工作流程的科学出版物内容广泛且非结构化。这使得人类研究人员和人工智能系统都难以有效地导航和探索科学创新领域。为了解决这个问题，我们推出了 MASSW，这是一个关于科学工作流程多方面总结的综合文本数据集。MASSW 包含过去 50 年来 17 个领先的计算机科学会议的 152,000 多份同行评审出版物。使用大型语言模型 (LLM)，我们可以自动从这些出版物中提取五个核心方面——背景、关键思想、方法、结果和预计影响——分别对应于研究工作流程中的五个关键步骤。这些结构化摘要有助于完成各种下游任务和分析。通过将 LLM 提取的摘要与人工注释进行比较来验证其质量。我们通过多个可以使用这个新数据集进行基准测试的新型机器学习任务展示了 MASSW 的实用性，这些任务在科学工作流程中做出各种类型的预测和建议。MASSW 为研究人员创造和基准测试新的 AI 方法以优化科学工作流程和促进该领域的科学创新提供了巨大的潜力。我们的数据集在 \url{此 https URL} 上公开提供。

Title: Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Authors: Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06366
Pdf URL: https://arxiv.org/pdf/2406.06366
Copy Paste: [[2406.06366]] Symmetric Dot-Product Attention for Efficient Training of BERT Language Models(https://arxiv.org/abs/2406.06366)
Keywords: language model
Abstract: Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.
摘要：Transformer 架构最初作为机器翻译模型引入，现已成为现代深度学习架构的基础，应用于从计算机视觉到自然语言处理等各个领域。如今，为了应对越来越复杂的任务，基于 Transformer 的模型被扩展到巨大的规模，需要越来越大的训练数据集和难以承受的计算资源量。因此，Transformer 及其核心组件注意力机制的普遍性成为效率研究的主要目标。在本文中，我们为 Transformer 架构引入的自注意力机制提出了一种替代兼容性函数。该兼容性函数利用了传统缩放点积注意力的学习表示中的重叠，从而产生了具有成对系数点积注意力的对称性。当应用于类 BERT 模型的预训练时，这种新的对称注意力机制在 GLUE 基准上达到了 79.36 分（传统实现的得分为 78.74），可训练参数数量减少了 6%，收敛前所需的训练步骤数减少了一半。

Title: Annotation alignment: Comparing LLM and human annotations of conversational safety

Authors: Rajiv Movva, Pang Wei Koh, Emma Pierson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06369
Pdf URL: https://arxiv.org/pdf/2406.06369
Copy Paste: [[2406.06369]] Annotation alignment: Comparing LLM and human annotations of conversational safety(https://arxiv.org/abs/2406.06369)
Keywords: gpt, llm, chat
Abstract: To what extent to do LLMs align with human perceptions of safety? We study this question via *annotation alignment*, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, higher than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether GPT-4 exhibits disparities in how well it correlates with demographic groups. Also, there is substantial idiosyncratic variation in correlation *within* groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
摘要：LLM 在多大程度上与人类对安全的认知相一致？我们通过注释对齐来研究这个问题，即 LLM 和人类在注释用户聊天机器人对话的安全性时达成一致的程度。我们利用最近的 DICES 数据集（Aroyo 等人，2023 年），其中 112 名注释者（涵盖 10 个种族-性别群体）对 350 次对话进行了安全评级。GPT-4 与平均注释者评分的皮尔逊相关性达到 $r = 0.59$，高于中位数注释者与平均值的相关性 ($r=0.51$)。我们表明，需要更大的数据集来解决 GPT-4 在与人口群体的相关性方面是否存在差异。此外，群体内部的相关性存在显著的特殊差异，这表明种族和性别不能完全捕捉到对齐的差异。最后，我们发现 GPT-4 无法预测一个人口群体何时会发现对话比另一个更不安全。

Title: Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

Authors: Simone Alghisi, Massimo Rizzoli, Gabriel Roccabruna, Seyed Mahed Mousavi, Giuseppe Riccardi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06399
Pdf URL: https://arxiv.org/pdf/2406.06399
Copy Paste: [[2406.06399]] Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue(https://arxiv.org/abs/2406.06399)
Keywords: language model, llm, retrieval-augmented generation
Abstract: We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.
摘要：我们研究了大型语言模型 (LLM) 在人机对话中响应生成任务的局限性。文献中提出了几种针对不同对话类型（例如开放域）的技术。然而，这些技术的评估在基础 LLM、对话类型和评估指标方面受到限制。在这项工作中，我们广泛分析了应用于不同对话类型的不同 LLM 自适应技术。我们选择了两个基础 LLM，Llama-2 和 Mistral，以及四种对话类型：开放域、知识基础、任务导向和问答。我们评估了针对每种对话类型选择的数据集中的上下文学习和微调技术的性能。我们评估了在检索增强生成 (RAG) 和黄金知识两种场景中结合外部知识为生成打下基础的影响。我们对自动指标和人工评估协议采用一致的评估和可解释性标准。我们的分析表明，没有通用的最佳技术来适应大型语言模型，因为每种技术的有效性都取决于基础 LLM 和特定的对话类型。最后但同样重要的是，对最佳适应技术的评估应包括人工评估，以避免自动指标得出错误的预期和结果。

Title: Controlling Emotion in Text-to-Speech with Natural Language Prompts

Authors: Thomas Bott, Florian Lux, Ngoc Thang Vu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.06406
Pdf URL: https://arxiv.org/pdf/2406.06406
Copy Paste: [[2406.06406]] Controlling Emotion in Text-to-Speech with Natural Language Prompts(https://arxiv.org/abs/2406.06406)
Keywords: prompt
Abstract: In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.
摘要：近年来，由于提示可以直观地使用自然语言，它已迅速成为引导生成式机器学习模型输出的标准方法之一。在这项工作中，我们提出了一个以情感丰富的文本作为提示的嵌入为条件的系统。因此，说话者和提示嵌入的联合表示被集成在基于 Transformer 的架构中的几个点上。我们的方法是在合并的情感语音和文本数据集上进行训练的，并在每次训练迭代中改变提示，以提高模型的泛化能力。客观和主观评估结果表明，条件合成系统能够准确地将提示中的情感转化为语音。同时，说话者身份的精确可处理性以及整体的高语音质量和清晰度得以保持。

Title: Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain

Authors: Brian Hu, Bill Ray, Alice Leung, Amy Summerville, David Joy, Christopher Funk, Arslan Basharat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06435
Pdf URL: https://arxiv.org/pdf/2406.06435
Copy Paste: [[2406.06435]] Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain(https://arxiv.org/abs/2406.06435)
Keywords: language model, llm, prompt
Abstract: In difficult decision-making scenarios, it is common to have conflicting opinions among expert human decision-makers as there may not be a single right answer. Such decisions may be guided by different attributes that can be used to characterize an individual's decision. We introduce a novel dataset for medical triage decision-making, labeled with a set of decision-maker attributes (DMAs). This dataset consists of 62 scenarios, covering six different DMAs, including ethical principles such as fairness and moral desert. We present a novel software framework for human-aligned decision-making by utilizing these DMAs, paving the way for trustworthy AI with better guardrails. Specifically, we demonstrate how large language models (LLMs) can serve as ethical decision-makers, and how their decisions can be aligned to different DMAs using zero-shot prompting. Our experiments focus on different open-source models with varying sizes and training techniques, such as Falcon, Mistral, and Llama 2. Finally, we also introduce a new form of weighted self-consistency that improves the overall quantified performance. Our results provide new research directions in the use of LLMs as alignable decision-makers. The dataset and open-source software are publicly available at: this https URL.
摘要：在困难的决策场景中，专家人类决策者之间经常会出现意见冲突，因为可能没有一个正确的答案。此类决策可能受不同属性的指导，这些属性可用于描述个人的决策。我们引入了一个用于医疗分诊决策的新数据集，该数据集标有一组决策者属性 (DMA)。该数据集包含 62 个场景，涵盖六种不同的 DMA，包括公平和道德应得等道德原则。我们利用这些 DMA 提出了一种用于人性化决策的新型软件框架，为具有更好护栏的可信赖 AI 铺平了道路。具体而言，我们展示了大型语言模型 (LLM) 如何充当道德决策者，以及如何使用零样本提示将他们的决策与不同的 DMA 保持一致。我们的实验侧重于具有不同大小和训练技术的不同开源模型，例如 Falcon、Mistral 和 Llama 2。最后，我们还引入了一种新形式的加权自洽，可以提高整体量化性能。我们的研究结果为 LLM 作为可协调决策者的使用提供了新的研究方向。数据集和开源软件可在此 https URL 上公开获取。

Title: Multimodal Contextualized Semantic Parsing from Speech

Authors: Jordan Voas, Raymond Mooney, David Harwath
Subjects: cs.CL, cs.CV, cs.HC, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.06438
Pdf URL: https://arxiv.org/pdf/2406.06438
Copy Paste: [[2406.06438]] Multimodal Contextualized Semantic Parsing from Speech(https://arxiv.org/abs/2406.06438)
Keywords: agent
Abstract: We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
摘要：我们引入了上下文环境中的语义解析 (SPICE)，这是一项旨在通过将多模态输入与先前上下文相结合来增强人工智能代理的上下文意识的任务。SPICE 超越了传统的语义解析，提供了一个结构化、可解释的框架，用于使用新信息动态更新代理的知识，反映了人类交流的复杂性。我们开发了 VG-SPICE 数据集，旨在通过从口头对话交流中构建视觉场景图来挑战代理，突出语音和视觉数据集成。我们还介绍了为 VG-SPICE 开发的音频-视觉对话场景解析器 (AViD-SP)。这些创新旨在改进多模态信息处理和集成。VG-SPICE 数据集和 AViD-SP 模型都是公开的。

Title: Interpretability of Language Models via Task Spaces

Authors: Lucas Weber, Jaap Jumelet, Elia Bruni, Dieuwke Hupkes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06441
Pdf URL: https://arxiv.org/pdf/2406.06441
Copy Paste: [[2406.06441]] Interpretability of Language Models via Task Spaces(https://arxiv.org/abs/2406.06441)
Keywords: language model
Abstract: The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes. In this paper, we present an alternative approach, concentrating on the quality of LM processing, with a focus on their language abilities. To this end, we construct 'linguistic task spaces' -- representations of an LM's language conceptualisation -- that shed light on the connections LMs draw between language phenomena. Task spaces are based on the interactions of the learning signals from different linguistic phenomena, which we assess via a method we call 'similarity probing'. To disentangle the learning signals of linguistic phenomena, we further introduce a method called 'fine-tuning via gradient differentials' (FTGD). We apply our methods to language models of three different scales and find that larger models generalise better to overarching general concepts for linguistic tasks, making better use of their shared structure. Further, the distributedness of linguistic processing increases with pre-training through increased parameter sharing between related linguistic tasks. The overall generalisation patterns are mostly stable throughout training and not marked by incisive stages, potentially explaining the lack of successful curriculum strategies for LMs.
摘要：解释语言模型 (LM) 的常用方法是测试它们在不同基准上的性能，然后推断它们的内部过程。在本文中，我们提出了一种替代方法，专注于 LM 处理的质量，重点关注它们的语言能力。为此，我们构建了“语言任务空间”——LM 语言概念化的表示——以阐明 LM 在语言现象之间建立的联系。任务空间基于来自不同语言现象的学习信号的相互作用，我们通过一种称为“相似性探测”的方法来评估这些相互作用。为了解开语言现象的学习信号，我们进一步引入了一种称为“通过梯度微分进行微调” (FTGD) 的方法。我们将我们的方法应用于三种不同规模的语言模型，发现较大的模型可以更好地推广到语言任务的总体一般概念，从而更好地利用它们的共享结构。此外，通过增加相关语言任务之间的参数共享，语言处理的分布性会随着预训练而增加。总体概括模式在整个训练过程中大多保持稳定，并没有明显的阶段特征，这可能解释了 LM 缺乏成功的课程策略。

Title: Evaluating the Retrieval Component in LLM-Based Question Answering Systems

Authors: Ashkan Alinejad, Krtin Kumar, Ali Vahdat
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2406.06458
Pdf URL: https://arxiv.org/pdf/2406.06458
Copy Paste: [[2406.06458]] Evaluating the Retrieval Component in LLM-Based Question Answering Systems(https://arxiv.org/abs/2406.06458)
Keywords: language model, llm, hallucination, chat, retrieval-augmented generation
Abstract: Question answering systems (QA) utilizing Large Language Models (LLMs) heavily depend on the retrieval component to provide them with domain-specific information and reduce the risk of generating inaccurate responses or hallucinations. Although the evaluation of retrievers dates back to the early research in Information Retrieval, assessing their performance within LLM-based chatbots remains a challenge. This study proposes a straightforward baseline for evaluating retrievers in Retrieval-Augmented Generation (RAG)-based chatbots. Our findings demonstrate that this evaluation framework provides a better image of how the retriever performs and is more aligned with the overall performance of the QA system. Although conventional metrics such as precision, recall, and F1 score may not fully capture LLMs' capabilities - as they can yield accurate responses despite imperfect retrievers - our method considers LLMs' strengths to ignore irrelevant contexts, as well as potential errors and hallucinations in their responses.
摘要：使用大型语言模型 (LLM) 的问答系统 (QA) 严重依赖检索组件为其提供领域特定信息并降低产生不准确响应或幻觉的风险。虽然对检索器的评估可以追溯到信息检索的早期研究，但评估它们在基于 LLM 的聊天机器人中的表现仍然是一个挑战。这项研究提出了一个简单的基准，用于评估基于检索增强生成 (RAG) 的聊天机器人中的检索器。我们的研究结果表明，该评估框架可以更好地反映检索器的表现，并且与 QA 系统的整体性能更加一致。虽然诸如精确度、召回率和 F1 分数等传统指标可能无法完全反映 LLM 的能力 - 因为尽管检索器不完美，它们也可以产生准确的响应 - 但我们的方法考虑了 LLM 的优势，以忽略不相关的上下文以及其响应中的潜在错误和幻觉。

Title: Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies

Authors: Junlin Wang, Siddhartha Jain, Dejiao Zhang, Baishakhi Ray, Varun Kumar, Ben Athiwaratkun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.06461
Pdf URL: https://arxiv.org/pdf/2406.06461
Copy Paste: [[2406.06461]] Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies(https://arxiv.org/abs/2406.06461)
Keywords: language model, llm, chain-of-thought, agent
Abstract: A diverse array of reasoning strategies has been proposed to elicit the capabilities of large language models. However, in this paper, we point out that traditional evaluations which focus solely on performance metrics miss a key factor: the increased effectiveness due to additional compute. By overlooking this aspect, a skewed view of strategy efficiency is often presented. This paper introduces a framework that incorporates the compute budget into the evaluation, providing a more informative comparison that takes into account both performance metrics and computational cost. In this budget-aware perspective, we find that complex reasoning strategies often don't surpass simpler baselines purely due to algorithmic ingenuity, but rather due to the larger computational resources allocated. When we provide a simple baseline like chain-of-thought self-consistency with comparable compute resources, it frequently outperforms reasoning strategies proposed in the literature. In this scale-aware perspective, we find that unlike self-consistency, certain strategies such as multi-agent debate or Reflexion can become worse if more compute budget is utilized.
摘要：人们提出了各种各样的推理策略来挖掘大型语言模型的能力。然而，在本文中，我们指出，传统的评估只关注性能指标，却忽略了一个关键因素：由于额外的计算而提高的效率。由于忽略了这一点，人们往往会对策略效率产生偏见。本文介绍了一个将计算预算纳入评估的框架，提供了一个更具信息量的比较，既考虑了性能指标，也考虑了计算成本。从这个预算意识的角度来看，我们发现复杂的推理策略往往不会仅仅因为算法的独创性而超越更简单的基线，而是因为分配了更大的计算资源。当我们提供一个简单的基线，比如具有可比计算资源的思路链自洽时，它的表现往往优于文献中提出的推理策略。从这个规模意识的角度来看，我们发现，与自洽不同，如果使用更多的计算预算，某些策略（如多智能体辩论或反思）可能会变得更糟。

Title: Can Language Models Serve as Text-Based World Simulators?

Authors: Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, Peter Jansen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.06485
Pdf URL: https://arxiv.org/pdf/2406.06485
Copy Paste: [[2406.06485]] Can Language Models Serve as Text-Based World Simulators?(https://arxiv.org/abs/2406.06485)
Keywords: language model, gpt, llm
Abstract: Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM's capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.
摘要：虚拟环境在对复杂规划和决策任务的进展进行基准测试方面发挥着关键作用，但手工构建成本高昂且复杂。当前的语言模型本身是否可以充当世界模拟器，正确预测动作如何改变不同的世界状态，从而避免大量手动编码的需要？我们的目标是在基于文本的模拟器的背景下回答这个问题。我们的方法是构建和使用一个新的基准，称为 ByteSized32-State-Prediction，其中包含文本游戏状态转换和随附游戏任务的数据集。我们首次使用它来直接量化 LLM 作为基于文本的世界模拟器的性能。我们在这个数据集上测试了 GPT-4，发现尽管它的性能令人印象深刻，但如果没有进一步的创新，它仍然是一个不可靠的世界模拟器。因此，这项工作既为当前 LLM 的能力和弱点提供了新的见解，也为跟踪新模型出现时的未来进展提供了一个新的基准。