2024-06-24

Title: Can LLMs Learn by Teaching? A Preliminary Study

Authors: Xuefei Ning, Zifu Wang, Shiyao Li, Zinan Lin, Peiran Yao, Tianyu Fu, Matthew B. Blaschko, Guohao Dai, Huazhong Yang, Yu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14629
Pdf URL: https://arxiv.org/pdf/2406.14629
Copy Paste: [[2406.14629]] Can LLMs Learn by Teaching? A Preliminary Study(https://arxiv.org/abs/2406.14629)
Keywords: llm, prompt
Abstract: Teaching to improve student models (e.g., knowledge distillation) is an extensively studied methodology in LLMs. However, for humans, teaching not only improves students but also improves teachers. We ask: Can LLMs also learn by teaching (LbT)? If yes, we can potentially unlock the possibility of continuously advancing the models without solely relying on human-produced data or stronger models. In this paper, we provide a preliminary exploration of this ambitious agenda. We show that LbT ideas can be incorporated into existing LLM training/prompting pipelines and provide noticeable improvements. Specifically, we design three methods, each mimicking one of the three levels of LbT in humans: observing students' feedback, learning from the feedback, and learning iteratively, with the goals of improving answer accuracy without training and improving models' inherent capability with fine-tuning. The findings are encouraging. For example, similar to LbT in human, we see that: (1) LbT can induce weak-to-strong generalization: strong models can improve themselves by teaching other weak models; (2) Diversity in students might help: teaching multiple students could be better than teaching one student or the teacher itself. We hope that this early promise can inspire future research on LbT and more broadly adopting the advanced techniques in education to improve LLMs. The code is available at this https URL.
摘要：通过教学改进学生模型（例如知识提炼）是 LLM 中一种广泛研究的方法。然而，对于人类而言，教学不仅可以提高学生的水平，还可以提高教师的水平。我们不禁要问：LLM 也可以通过教学（LbT）来学习吗？如果是，我们就有可能开启不断推进模型的可能性，而不仅仅依靠人造数据或更强大的模型。在本文中，我们对这一雄心勃勃的议程进行了初步探索。我们表明，LbT 理念可以融入现有的 LLM 训练/提示流程中，并提供显着的改进。具体来说，我们设计了三种方法，每种方法都模仿人类 LbT 的三个级别之一：观察学生的反馈、从反馈中学习和迭代学习，目标是在没有训练的情况下提高答案准确性并通过微调提高模型的固有能力。这些发现令人鼓舞。例如，与人类的 LbT 类似，我们看到：（1）LbT 可以诱导从弱到强的泛化：强模型可以通过教其他弱模型来提高自己； (2) 学生的多样性可能会有所帮助：教多个学生可能比教一个学生或教一个老师更好。我们希望这一早期承诺能够启发未来对 LbT 的研究，并更广泛地采用先进的教育技术来改进 LLM。代码可在此 https URL 上找到。

Title: Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Authors: Chunyuan Deng, Yilun Zhao, Yuzhao Heng, Yitong Li, Jiannan Cao, Xiangru Tang, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14644
Pdf URL: https://arxiv.org/pdf/2406.14644
Copy Paste: [[2406.14644]] Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation(https://arxiv.org/abs/2406.14644)
Keywords: language model, llm
Abstract: Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present a comprehensive survey in the field of data contamination, laying out the key issues, methodologies, and findings to date, and highlighting areas in need of further research and development. In particular, we begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.
摘要：在大型语言模型 (LLM) 时代，由于依赖大量来自互联网的训练语料库，数据污染引起了越来越多的关注。训练语料库与评估基准重叠的问题（称为污染）一直是近期重要研究的焦点。这项工作旨在识别污染、了解其影响并从不同角度探索缓解策略。然而，在这个新兴领域，缺乏提供从基础概念到高级见解的清晰途径的综合研究。因此，我们对数据污染领域进行了全面调查，列出了迄今为止的关键问题、方法和发现，并强调了需要进一步研究和开发的领域。特别是，我们首先研究数据污染在不同阶段和形式中的影响。然后，我们对当前的污染检测方法进行了详细分析，对它们进行了分类，以突出它们的重点、假设、优势和局限性。我们还讨论了缓解策略，为未来的研究提供了明确的指导。这项调查简要概述了数据污染研究的最新进展，为未来的研究工作提供了直接的指导。

Title: Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Authors: Kawshik Manikantan (1), Shubham Toshniwal (2), Makarand Tapaswi (1), Vineet Gandhi (1) ((1) CVIT, IIIT Hyderabad, (2) NVIDIA)
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14654
Pdf URL: https://arxiv.org/pdf/2406.14654
Copy Paste: [[2406.14654]] Major Entity Identification: A Generalizable Alternative to Coreference Resolution(https://arxiv.org/abs/2406.14654)
Keywords: llm, prompt
Abstract: The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task's broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.
摘要：共指解析 (CR) 模型的泛化能力有限一直是该任务广泛应用的主要瓶颈。先前的研究已将注释差异（尤其是提及检测）确定为泛化差距的主要原因之一，并建议使用额外的带注释的目标域数据。我们不依赖于这种额外的注释，而是提出了 CR 任务的另一种表述，即主要实体识别 (MEI)，其中我们：(a) 假设目标实体在输入中指定，(b) 将任务限制为仅频繁实体。通过大量实验，我们证明 MEI 模型在具有监督模型和基于 LLM 的少量提示的多个数据集上跨域很好地泛化。此外，MEI 任务符合分类框架，这使得使用比当前 CR 指标更稳健的基于分类的指标成为可能。最后，MEI 也具有实际用途，因为它允许用户搜索特定实体或一组感兴趣的实体的所有提及。

Title: OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset

Authors: Allen Roush, Yusuf Shabazz, Arvind Balaji, Peter Zhang, Stefano Mezza, Markus Zhang, Sanjay Basu, Sriram Vishwanath, Mehdi Fatemi, Ravid Schwartz-Ziv
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14657
Pdf URL: https://arxiv.org/pdf/2406.14657
Copy Paste: [[2406.14657]] OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset(https://arxiv.org/abs/2406.14657)
Keywords: language model
Abstract: We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: this https URL
摘要：我们推出了 OpenDebateEvidence，这是一个来自美国辩论社区的用于论证挖掘和总结的综合数据集。该数据集包含超过 350 万份具有丰富元数据的文档，是辩论证据最广泛的集合之一。OpenDebateEvidence 捕捉了高中和大学辩论中论证的复杂性，为培训和评估提供了宝贵的资源。我们进行了广泛的实验，证明了在各种方法、模型和数据集中微调最先进的大型语言模型进行论证抽象总结的有效性。通过提供这种综合资源，我们旨在推进计算论证并为辩论者、教育工作者和研究人员提供实际应用。OpenDebateEvidence 是公开的，以支持计算论证的进一步研究和创新。在此处访问：此 https URL

Title: Exploring Design Choices for Building Language-Specific LLMs

Authors: Atula Tejaswi, Nilesh Gupta, Eunsol Choi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14670
Pdf URL: https://arxiv.org/pdf/2406.14670
Copy Paste: [[2406.14670]] Exploring Design Choices for Building Language-Specific LLMs(https://arxiv.org/abs/2406.14670)
Keywords: language model, llm
Abstract: Despite rapid progress in large language models (LLMs), their performance on a vast majority of languages remain unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued fine-tuning) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance before the adaptation is not always indicative of the final performance. (2) Efficiency can easily improved with simple vocabulary extension and continued fine-tuning in most LLMs we study, and (3) The optimal adaptation method is highly language-dependent, and the simplest approach works well across various experimental settings. Adapting English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.
摘要：尽管大型语言模型 (LLM) 取得了快速进展，但它们在绝大多数语言上的表现仍然不令人满意。在本文中，我们研究通过调整单语言和多语言 LLM 来构建特定语言的 LLM。我们进行了系统的实验，以了解设计选择（基础模型选择、词汇扩展和持续微调）如何影响调整后的 LLM，包括效率（编码相同数量的信息需要多少个标记）和最终任务性能。我们发现 (1) 调整前的初始性能并不总是代表最终性能。 (2) 在我们研究的大多数 LLM 中，通过简单的词汇扩展和持续微调可以轻松提高效率，以及 (3) 最佳适应方法高度依赖于语言，最简单的方法在各种实验环境中都效果良好。调整以英语为中心的模型可以比调整多语言模型产生更好的结果，尽管它们在资源匮乏的语言上的初始性能较差。总之，我们的工作为通过调整现有的 LLM 有效地构建特定语言的 LLM 奠定了基础。

Title: Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

Authors: Taiming Lu, Muhan Gao, Kuai Yu, Adam Byerly, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14673
Pdf URL: https://arxiv.org/pdf/2406.14673
Copy Paste: [[2406.14673]] Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell(https://arxiv.org/abs/2406.14673)
Keywords: language model, llm, long context
Abstract: Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information retrieval and utilization, a "know but don't tell" phenomenon. We further analyze the relationship between extraction time and final accuracy, offering insights into the underlying mechanics of transformer models.
摘要：大型语言模型 (LLM) 表现出位置偏差，难以利用长上下文中间或末尾的信息。我们的研究通过探索 LLM 的隐藏表示来探索其长上下文推理。我们发现，虽然 LLM 编码了目标信息的位置，但它们往往无法利用这一点来生成准确的响应。这揭示了信息检索和利用之间的脱节，这是一种“知而不言”的现象。我们进一步分析了提取时间与最终准确度之间的关系，从而深入了解了 Transformer 模型的底层机制。

Title: Bidirectional Transformer Representations of (Spanish) Ambiguous Words in Context: A New Lexical Resource and Empirical Analysis

Authors: Pamela D. Rivière (1), Anne L. Beatty-Martínez (1), Sean Trott (1 and 2) ((1) Department of Cognitive Science UC San Diego, (2) Computational Social Science UC San Diego)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14678
Pdf URL: https://arxiv.org/pdf/2406.14678
Copy Paste: [[2406.14678]] Bidirectional Transformer Representations of (Spanish) Ambiguous Words in Context: A New Lexical Resource and Empirical Analysis(https://arxiv.org/abs/2406.14678)
Keywords: language model, llm
Abstract: Lexical ambiguity -- where a single wordform takes on distinct, context-dependent meanings -- serves as a useful tool to compare across different large language models' (LLMs') ability to form distinct, contextualized representations of the same stimulus. Few studies have systematically compared LLMs' contextualized word embeddings for languages beyond English. Here, we evaluate multiple bidirectional transformers' (BERTs') semantic representations of Spanish ambiguous nouns in context. We develop a novel dataset of minimal-pair sentences evoking the same or different sense for a target ambiguous noun. In a pre-registered study, we collect contextualized human relatedness judgments for each sentence pair. We find that various BERT-based LLMs' contextualized semantic representations capture some variance in human judgments but fall short of the human benchmark, and for Spanish -- unlike English -- model scale is uncorrelated with performance. We also identify stereotyped trajectories of target noun disambiguation as a proportion of traversal through a given LLM family's architecture, which we partially replicate in English. We contribute (1) a dataset of controlled, Spanish sentence stimuli with human relatedness norms, and (2) to our evolving understanding of the impact that LLM specification (architectures, training protocols) exerts on contextualized embeddings.
摘要：词汇歧义（即单个词形具有不同的、依赖于上下文的含义）是一种有用的工具，可用于比较不同大型语言模型 (LLM) 形成同一刺激的不同语境化表示的能力。很少有研究系统地比较过英语以外语言的 LLM 语境化词嵌入。在这里，我们评估了多个双向转换器 (BERT) 对西班牙语歧义名词在上下文中的语义表示。我们开发了一个新颖的最小对句子数据集，这些句子对目标歧义名词具有相同或不同的含义。在一项预先注册的研究中，我们收集了每个句子对的语境化人类相关性判断。我们发现，各种基于 BERT 的 LLM 的语境化语义表示捕捉到了人类判断的一些差异，但达不到人类基准，而且对于西班牙语（与英语不同），模型规模与性能无关。我们还确定了目标名词消歧的刻板轨迹，将其作为遍历给定 LLM 系列架构的一部分，我们在英语中部分复制了这些轨迹。我们贡献了 (1) 一个受控的西班牙语句子刺激数据集，其中包含人类相关性规范，以及 (2) 我们对 LLM 规范（架构、训练协议）对语境化嵌入的影响的不断发展的理解。

Title: Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

Authors: Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, Jinyoung Yeo, Youngjae Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14703
Pdf URL: https://arxiv.org/pdf/2406.14703
Copy Paste: [[2406.14703]] Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics(https://arxiv.org/abs/2406.14703)
Keywords: language model, llm, prompt
Abstract: The idea of personality in descriptive psychology, traditionally defined through observable behavior, has now been extended to Large Language Models (LLMs) to better understand their behavior. This raises a question: do LLMs exhibit distinct and consistent personality traits, similar to humans? Existing self-assessment personality tests, while applicable, lack the necessary validity and reliability for precise personality measurements. To address this, we introduce TRAIT, a new tool consisting of 8K multi-choice questions designed to assess the personality of LLMs with validity and reliability. TRAIT is built on the psychometrically validated human questionnaire, Big Five Inventory (BFI) and Short Dark Triad (SD-3), enhanced with the ATOMIC10X knowledge graph for testing personality in a variety of real scenarios. TRAIT overcomes the reliability and validity issues when measuring personality of LLM with self-assessment, showing the highest scores across three metrics: refusal rate, prompt sensitivity, and option order sensitivity. It reveals notable insights into personality of LLM: 1) LLMs exhibit distinct and consistent personality, which is highly influenced by their training data (i.e., data used for alignment tuning), and 2) current prompting techniques have limited effectiveness in eliciting certain traits, such as high psychopathy or low conscientiousness, suggesting the need for further research in this direction.
摘要：描述心理学中的人格概念传统上是通过可观察的行为来定义的，现在已扩展到大型语言模型 (LLM)，以更好地理解其行为。这引出了一个问题：LLM 是否表现出与人类相似的独特而一致的人格特质？现有的自我评估人格测试虽然适用，但缺乏精确测量人格所需的有效性和可靠性。为了解决这个问题，我们推出了 TRAIT，这是一种由 8K 多选题组成的新工具，旨在以有效性和可靠性评估 LLM 的人格。TRAIT 建立在经过心理测量验证的人类问卷、大五人格量表 (BFI) 和短暗三角 (SD-3) 的基础上，并通过 ATOMIC10X 知识图谱进行了增强，可在各种真实场景中测试人格。TRAIT 克服了使用自我评估测量 LLM 人格时的可靠性和有效性问题，在三个指标中显示出最高分数：拒绝率、提示敏感度和选项顺序敏感度。它揭示了有关 LLM 个性的显著见解：1）LLM 表现出独特而一致的个性，这受到其训练数据（即用于调整对齐的数据）的高度影响，2）当前的提示技术在引出某些特征方面效果有限，例如高度精神病或低尽责性，这表明需要在这方面进行进一步研究。

Title: Factual Dialogue Summarization via Learning from Large Language Models

Authors: Rongxin Zhu, Jey Han Lau, Jianzhong Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14709
Pdf URL: https://arxiv.org/pdf/2406.14709
Copy Paste: [[2406.14709]] Factual Dialogue Summarization via Learning from Large Language Models(https://arxiv.org/abs/2406.14709)
Keywords: language model, llm
Abstract: Factual consistency is an important quality in dialogue summarization. Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries compared to those by smaller pretrained language models, but they face deployment challenges in real-world applications due to privacy or resource constraints. In this paper, we investigate the use of symbolic knowledge distillation to improve the factual consistency of smaller pretrained models for dialogue summarization. We employ zero-shot learning to extract symbolic knowledge from LLMs, generating both factually consistent (positive) and inconsistent (negative) summaries. We then apply two contrastive learning objectives on these summaries to enhance smaller summarization models. Experiments with BART, PEGASUS, and Flan-T5 indicate that our approach surpasses strong baselines that rely on complex data augmentation strategies. Our approach achieves better factual consistency while maintaining coherence, fluency, and relevance, as confirmed by various automatic evaluation metrics. We also provide access to the data and code to facilitate future research.
摘要：事实一致性是对话摘要的一个重要品质。与小型预训练语言模型相比，基于大型语言模型 (LLM) 的自动文本摘要模型可以生成更多事实一致的摘要，但由于隐私或资源限制，它们在实际应用中面临部署挑战。在本文中，我们研究了使用符号知识蒸馏来提高小型预训练对话摘要模型的事实一致性。我们使用零样本学习从 LLM 中提取符号知识，生成事实一致（正面）和不一致（负面）的摘要。然后，我们对这些摘要应用两个对比学习目标来增强较小的摘要模型。对 BART、PEGASUS 和 Flan-T5 的实验表明，我们的方法超越了依赖复杂数据增强策略的强大基线。我们的方法在保持连贯性、流畅性和相关性的同时实现了更好的事实一致性，这已得到各种自动评估指标的证实。我们还提供对数据和代码的访问，以促进未来的研究。

Title: MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

Authors: Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, William Wang
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2406.14711
Pdf URL: https://arxiv.org/pdf/2406.14711
Copy Paste: [[2406.14711]] MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate(https://arxiv.org/abs/2406.14711)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary's effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model's persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.
摘要：大型语言模型 (LLM) 在单独工作时在当前基准上表现出色。其功能的进步以及参数大小和推理时间的减少促进了这些模型作为代理的使用，使多个模型之间能够交互以执行复杂任务。这种协作提供了几个优势，包括使用专门的模型（例如编码）、通过多次计算提高信心以及增强发散思维，从而产生更多样化的输出。因此，语言模型的协作使用预计在未来几年将大幅增长。在这项工作中，我们评估了在对手的影响下通过辩论进行协作的模型网络的行为。我们引入了相关指标来评估对手的有效性，重点关注系统准确性和模型一致性。我们的研究结果强调了模型的说服能力在影响他人方面的重要性。此外，我们探索推理时间方法来生成更有说服力的论点，并评估基于提示的缓解作为防御策略的潜力。

Title: 1+1>2: Can Large Language Models Serve as Cross-Lingual Knowledge Aggregators?

Authors: Yue Huang, Chenrui Fan, Yuan Li, Siyuan Wu, Tianyi Zhou, Xiangliang Zhang, Lichao Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14721
Pdf URL: https://arxiv.org/pdf/2406.14721
Copy Paste: [[2406.14721]] 1+1>2: Can Large Language Models Serve as Cross-Lingual Knowledge Aggregators?(https://arxiv.org/abs/2406.14721)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have garnered significant attention due to their remarkable ability to process information across various languages. Despite their capabilities, they exhibit inconsistencies in handling identical queries in different languages, presenting challenges for further advancement. This paper introduces a method to enhance the multilingual performance of LLMs by aggregating knowledge from diverse languages. This approach incorporates a low-resource knowledge detector specific to a language, a language selection process, and mechanisms for answer replacement and integration. Our experiments demonstrate notable performance improvements, particularly in reducing language performance disparity. An ablation study confirms that each component of our method significantly contributes to these enhancements. This research highlights the inherent potential of LLMs to harmonize multilingual capabilities and offers valuable insights for further exploration.
摘要：大型语言模型 (LLM) 因其处理跨多种语言信息的出色能力而备受关注。尽管它们功能强大，但在处理不同语言的相同查询时却存在不一致，这对进一步发展提出了挑战。本文介绍了一种通过聚合来自不同语言的知识来增强 LLM 多语言性能的方法。该方法结合了特定于语言的低资源知识检测器、语言选择过程以及答案替换和集成机制。我们的实验表明性能显著提高，特别是在减少语言性能差异方面。一项消融研究证实，我们方法的每个组成部分都对这些增强做出了重大贡献。这项研究强调了 LLM 协调多语言能力的内在潜力，并为进一步探索提供了宝贵的见解。

Title: TTQA-RS- A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization

Authors: Jayetri Bardhan, Bushi Xiao, Daisy Zhe Wang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2406.14732
Pdf URL: https://arxiv.org/pdf/2406.14732
Copy Paste: [[2406.14732]] TTQA-RS- A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization(https://arxiv.org/abs/2406.14732)
Keywords: language model, gpt, llm, prompt
Abstract: Question answering (QA) over tables and text has gained much popularity over the years. Multi-hop table-text QA requires multiple hops between the table and text, making it a challenging QA task. Although several works have attempted to solve the table-text QA task, most involve training the models and requiring labeled data. In this paper, we have proposed a model - TTQA-RS: A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization. Our model uses augmented knowledge including table-text summary with decomposed sub-question with answer for a reasoning-based table-text QA. Using open-source language models our model outperformed all existing prompting methods for table-text QA tasks on existing table-text QA datasets like HybridQA and OTT-QA's development set. Our results are comparable with the training-based state-of-the-art models, demonstrating the potential of prompt-based approaches using open-source LLMs. Additionally, by using GPT-4 with LLaMA3-70B, our model achieved state-of-the-art performance for prompting-based methods on multi-hop table-text QA.
摘要：多年来，表格和文本问答 (QA) 越来越受欢迎。多跳表格文本问答需要在表格和文本之间进行多次跳跃，这使其成为一项具有挑战性的 QA 任务。尽管有几项工作试图解决表格文本问答任务，但大多数工作都涉及训练模型并需要标记数据。在本文中，我们提出了一个模型 - TTQA-RS：一种用于具有推理和总结的多跳表格文本问答的分解提示方法。我们的模型使用增强知识（包括带有分解子问题和答案的表格文本摘要）进行基于推理的表格文本问答。使用开源语言模型，我们的模型在现有表格文本问答数据集（如 HybridQA 和 OTT-QA 的开发集）上的表现优于所有现有的表格文本问答任务提示方法。我们的结果与基于训练的最先进的模型相当，展示了使用开源 LLM 的基于提示的方法的潜力。此外，通过将 GPT-4 与 LLaMA3-70B 结合使用，我们的模型在基于提示的多跳表文本 QA 方法中实现了最先进的性能。

Title: Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?

Authors: Zhiqiang Pi, Annapurna Vadaparty, Benjamin K. Bergen, Cameron R. Jones
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14737
Pdf URL: https://arxiv.org/pdf/2406.14737
Copy Paste: [[2406.14737]] Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?(https://arxiv.org/abs/2406.14737)
Keywords: language model, llm
Abstract: Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task (Kosinski, 2023), others have argued that LLMs solve these tasks by exploiting spurious correlations -- not representing beliefs -- since they fail on trivial alterations to these tasks (Ullman, 2023). In this paper, we introduce SCALPEL: a technique to generate targeted modifications for False Belief tasks to test different specific hypotheses about why LLMs fail. We find that modifications which make explicit common inferences -- such as that looking at a transparent object implies recognizing its contents -- preserve LLMs' performance. This suggests that LLMs' failures on modified ToM tasks could result from a lack of more general commonsense reasoning, rather than a failure to represent mental states. We argue that SCALPEL could be helpful for explaining LLM successes and failures in other cases.
摘要：最近的经验结果引发了一场关于大型语言模型 (LLM) 是否能够进行心智理论 (ToM) 的争论。虽然有些人发现 LLM 在诸如错误信念任务 (Kosinski, 2023) 之类的 ToM 评估中取得了成功，但其他人则认为 LLM 通过利用虚假相关性（而不是代表信念）来解决这些任务，因为它们无法完成对这些任务的细微改动 (Ullman, 2023)。在本文中，我们介绍了 SCALPEL：一种针对错误信念任务生成有针对性的修改的技术，以测试关于 LLM 失败原因的不同具体假设。我们发现，做出明确常见推论的修改（例如，看透明物体意味着识别其内容）可以保留 LLM 的性能。这表明 LLM 在修改后的 ToM 任务上的失败可能是由于缺乏更普遍的常识推理，而不是无法表示心理状态。我们认为 SCALPEL 有助于解释其他情况下的 LLM 成功和失败。

Title: Learning to Retrieve Iteratively for In-Context Learning

Authors: Yunmo Chen, Tongfei Chen, Harsh Jhamtani, Patrick Xia, Richard Shin, Jason Eisner, Benjamin Van Durme
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14739
Pdf URL: https://arxiv.org/pdf/2406.14739
Copy Paste: [[2406.14739]] Learning to Retrieve Iteratively for In-Context Learning(https://arxiv.org/abs/2406.14739)
Keywords: language model, llm
Abstract: We introduce iterative retrieval, a novel framework that empowers retrievers to make iterative decisions through policy optimization. Finding an optimal portfolio of retrieved items is a combinatorial optimization problem, generally considered NP-hard. This approach provides a learned approximation to such a solution, meeting specific task requirements under a given family of large language models (LLMs). We propose a training procedure based on reinforcement learning, incorporating feedback from LLMs. We instantiate an iterative retriever for composing in-context learning (ICL) exemplars and apply it to various semantic parsing tasks that demand synthesized programs as outputs. By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever, outperforming previous methods in selecting ICL exemplars on semantic parsing datasets such as CalFlow, TreeDST, and MTOP. Additionally, the trained iterative retriever generalizes across different inference LLMs beyond the one used during training.
摘要：我们引入了迭代检索，这是一种新颖的框架，它使检索器能够通过策略优化做出迭代决策。找到检索项目的最佳组合是一个组合优化问题，通常被认为是 NP 难题。这种方法为这种解决方案提供了学习到的近似值，满足给定大型语言模型 (LLM) 系列下的特定任务要求。我们提出了一种基于强化学习的训练程序，结合了 LLM 的反馈。我们实例化了一个迭代检索器来编写上下文学习 (ICL) 范例，并将其应用于需要合成程序作为输出的各种语义解析任务。通过仅添加 4M 个额外的状态编码参数，我们将现成的密集检索器转换为有状态的迭代检索器，在选择语义解析数据集（例如 CalFlow、TreeDST 和 MTOP）上的 ICL 范例方面优于以前的方法。此外，经过训练的迭代检索器可以推广到训练期间使用的不同推理 LLM。

Title: Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks

Authors: Sefika Efeoglu, Adrian Paschke
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14745
Pdf URL: https://arxiv.org/pdf/2406.14745
Copy Paste: [[2406.14745]] Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks(https://arxiv.org/abs/2406.14745)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Information Extraction (IE) is crucial for converting unstructured data into structured formats like Knowledge Graphs (KGs). A key task within IE is Relation Extraction (RE), which identifies relationships between entities in text. Various RE methods exist, including supervised, unsupervised, weakly supervised, and rule-based approaches. Recent studies leveraging pre-trained language models (PLMs) have shown significant success in this area. In the current era dominated by Large Language Models (LLMs), fine-tuning these models can overcome limitations associated with zero-shot LLM prompting-based RE methods, especially regarding domain adaptation challenges and identifying implicit relations between entities in sentences. These implicit relations, which cannot be easily extracted from a sentence's dependency tree, require logical inference for accurate identification. This work explores the performance of fine-tuned LLMs and their integration into the Retrieval Augmented-based (RAG) RE approach to address the challenges of identifying implicit relations at the sentence level, particularly when LLMs act as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant performance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B, and T5 (Large). Notably, our approach achieves substantial gains on SemEVAL, where implicit relations are common, surpassing previous results on this dataset. Additionally, our method outperforms previous works on TACRED, TACREV, and Re-TACRED, demonstrating exceptional performance across diverse evaluation scenarios.
摘要：信息提取 (IE) 对于将非结构化数据转换为知识图谱 (KG) 等结构化格式至关重要。IE 中的一项关键任务是关系提取 (RE)，它识别文本中实体之间的关系。存在各种 RE 方法，包括监督、无监督、弱监督和基于规则的方法。最近利用预训练语言模型 (PLM) 的研究已在此领域取得了重大成功。在当前由大型语言模型 (LLM) 主导的时代，对这些模型进行微调可以克服与零样本 LLM 提示式 RE 方法相关的限制，尤其是在领域适应挑战和识别句子中实体之间的隐式关系方面。这些隐式关系无法从句子的依赖树中轻松提取，需要逻辑推理才能准确识别。这项工作探索了微调 LLM 的性能及其与基于检索增强 (RAG) 的 RE 方法的集成，以解决在句子级别识别隐式关系的挑战，特别是当 LLM 在 RAG 框架内充当生成器时。对 TACRED、TACRED-Revisited (TACREV)、Re-TACRED 和 SemEVAL 数据集的实证评估表明，使用经过微调的 LLM（包括 Llama2-7B、Mistral-7B 和 T5 (Large)），性能显著提升。值得注意的是，我们的方法在隐式关系常见的 SemEVAL 上取得了显著的进步，超越了之前在该数据集上的结果。此外，我们的方法优于之前在 TACRED、TACREV 和 Re-TACRED 上的工作，在各种评估场景中表现出色。

Title: An LLM Feature-based Framework for Dialogue Constructiveness Assessment

Authors: Lexin Zhou, Youmna Farag, Andreas Vlachos
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14760
Pdf URL: https://arxiv.org/pdf/2406.14760
Copy Paste: [[2406.14760]] An LLM Feature-based Framework for Dialogue Constructiveness Assessment(https://arxiv.org/abs/2406.14760)
Keywords: language model, llm, prompt
Abstract: Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructive outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models (which often involve costly human annotations) or neural models such as pre-trained language models (which have empirically shown higher task accuracy but lack interpretability). We propose a novel LLM feature-based framework that combines the strengths of feature-based and neural approaches while mitigating their downsides, in assessing dialogue constructiveness. The framework first defines a set of dataset-independent and interpretable linguistic features, which can be extracted by both prompting an LLM and simple heuristics. Such features are then used to train LLM feature-based models. We apply this framework to three datasets of dialogue constructiveness and find that our LLM feature-based models significantly outperform standard feature-based models and neural models, and tend to learn more robust prediction rules instead of relying on superficial shortcuts (as seen with neural models). Further, we demonstrate that interpreting these LLM feature-based models can yield valuable insights into what makes a dialogue constructive.
摘要：对话建设性评估研究主要集中在 (i) 分析影响个人采取特定行动、赢得辩论、改变观点或拓宽开放心态的对话因素，以及 (ii) 预测此类用例对话后的建设性结果。这些目标可以通过训练可解释的基于特征的模型（通常涉及昂贵的人工注释）或神经模型（例如预训练的语言模型（经验表明具有更高的任务准确性但缺乏可解释性））来实现。我们提出了一种新颖的基于 LLM 特征的框架，该框架结合了基于特征和神经方法的优势，同时减轻了它们的缺点，以评估对话的建设性。该框架首先定义一组独立于数据集且可解释的语言特征，可以通过提示 LLM 和简单的启发式方法提取这些特征。然后使用这些特征来训练基于 LLM 特征的模型。我们将此框架应用于三个对话建设性数据集，发现我们的 LLM 基于特征的模型明显优于基于特征的标准模型和神经模型，并且倾向于学习更稳健的预测规则，而不是依赖于肤浅的捷径（如神经模型所见）。此外，我们证明，解释这些基于特征的 LLM 模型可以产生有价值的见解，了解是什么让对话变得有建设性。

Title: A Learn-Then-Reason Model Towards Generalization in Knowledge Base Question Answering

Authors: Lingxi Zhang, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14763
Pdf URL: https://arxiv.org/pdf/2406.14763
Copy Paste: [[2406.14763]] A Learn-Then-Reason Model Towards Generalization in Knowledge Base Question Answering(https://arxiv.org/abs/2406.14763)
Keywords: language model
Abstract: Large-scale knowledge bases (KBs) like Freebase and Wikidata house millions of structured knowledge. Knowledge Base Question Answering (KBQA) provides a user-friendly way to access these valuable KBs via asking natural language questions. In order to improve the generalization capabilities of KBQA models, extensive research has embraced a retrieve-then-reason framework to retrieve relevant evidence for logical expression generation. These multi-stage efforts prioritize acquiring external sources but overlook the incorporation of new knowledge into their model parameters. In effect, even advanced language models and retrievers have knowledge boundaries, thereby limiting the generalization capabilities of previous KBQA models. Therefore, this paper develops KBLLaMA, which follows a learn-then-reason framework to inject new KB knowledge into a large language model for flexible end-to-end KBQA. At the core of KBLLaMA, we study (1) how to organize new knowledge about KBQA and (2) how to facilitate the learning of the organized knowledge. Extensive experiments on various KBQA generalization tasks showcase the state-of-the-art performance of KBLLaMA. Especially on the general benchmark GrailQA and domain-specific benchmark Bio-chemical, KBLLaMA respectively derives a performance gain of up to 3.8% and 9.8% compared to the baselines.
摘要：大型知识库 (KB)（如 Freebase 和 Wikidata）包含数百万个结构化知识。知识库问答 (KBQA) 提供了一种用户友好的方式来通过提出自然语言问题来访问这些有价值的知识库。为了提高 KBQA 模型的泛化能力，大量研究采用了检索然后推理的框架来检索逻辑表达式生成的相关证据。这些多阶段的努力优先考虑获取外部资源，但忽略了将新知识纳入其模型参数。实际上，即使是高级语言模型和检索器也有知识边界，从而限制了以前的 KBQA 模型的泛化能力。因此，本文开发了 KBLLaMA，它遵循学习然后推理的框架，将新的知识库知识注入大型语言模型，以实现灵活的端到端 KBQA。在 KBLLaMA 的核心，我们研究 (1) 如何组织有关 KBQA 的新知识和 (2) 如何促进有组织的知识的学习。在各种 KBQA 泛化任务上进行的大量实验展示了 KBLLaMA 的领先性能。特别是在通用基准 GrailQA 和领域特定基准 Bio-chemical 上，与基线相比，KBLLaMA 分别获得了高达 3.8% 和 9.8% 的性能提升。

Title: Understanding Finetuning for Factual Knowledge Extraction

Authors: Gaurav Ghosal, Tatsunori Hashimoto, Aditi Raghunathan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14785
Pdf URL: https://arxiv.org/pdf/2406.14785
Copy Paste: [[2406.14785]] Understanding Finetuning for Factual Knowledge Extraction(https://arxiv.org/abs/2406.14785)
Keywords: language model
Abstract: In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model to ignore subject entity names and instead output a generic plausible response even when the relevant factual knowledge is encoded in the model. On three question answering benchmarks (PopQA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. Ultimately, our results shed light on the interaction between pretrained knowledge and finetuning data and demonstrate the importance of taking into account how facts are stored in the pretrained model when fine-tuning for knowledge-intensive tasks.
摘要：在这项工作中，我们研究了 QA 微调数据对下游事实性的影响。我们表明，对预训练期间存储不充分的鲜为人知的事实进行微调，其事实性明显低于对众所周知的事实进行微调，即使在预训练期间看到了所有事实。我们从理论上证明了这一现象，表明对鲜为人知的事实进行训练可能会导致模型忽略主题实体名称，而输出一般合理的响应，即使相关事实知识已编码在模型中。在三个问答基准（PopQA、Entity Questions 和 MMLU）和两个语言模型（Llama-2-7B 和 Mistral-7B）上，我们发现 (i) 对完全事实但鲜为人知的数据子集进行微调会降低下游事实性（5-10%）；(ii) 对一组知名示例进行微调，其效果与对整个数据集进行微调相当或优于对整个数据集进行微调。最终，我们的结果阐明了预训练知识和微调数据之间的相互作用，并证明了在针对知识密集型任务进行微调时考虑事实如何存储在预训练模型中的重要性。

Title: How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions

Authors: Julia Kharchenko, Tanya Roosta, Aman Chadha, Chirag Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14805
Pdf URL: https://arxiv.org/pdf/2406.14805
Copy Paste: [[2406.14805]] How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions(https://arxiv.org/abs/2406.14805)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) attempt to imitate human behavior by responding to humans in a way that pleases them, including by adhering to their values. However, humans come from diverse cultures with different values. It is critical to understand whether LLMs showcase different values to the user based on the stereotypical values of a user's known country. We prompt different LLMs with a series of advice requests based on 5 Hofstede Cultural Dimensions -- a quantifiable way of representing the values of a country. Throughout each prompt, we incorporate personas representing 36 different countries and, separately, languages predominantly tied to each country to analyze the consistency in the LLMs' cultural understanding. Through our analysis of the responses, we found that LLMs can differentiate between one side of a value and another, as well as understand that countries have differing values, but will not always uphold the values when giving advice, and fail to understand the need to answer differently based on different cultural values. Rooted in these findings, we present recommendations for training value-aligned and culturally sensitive LLMs. More importantly, the methodology and the framework developed here can help further understand and mitigate culture and language alignment issues with LLMs.
摘要：大型语言模型 (LLM) 试图模仿人类行为，以一种让人类愉悦的方式回应人类，包括坚持他们的价值观。然而，人类来自不同的文化，有着不同的价值观。了解 LLM 是否根据用户已知国家的刻板价值观向用户展示不同的价值观至关重要。我们向不同的 LLM 提出一系列建议请求，这些请求基于 5 个霍夫斯泰德文化维度——一种可量化的代表国家价值观的方式。在每个提示中，我们结合了代表 36 个不同国家的角色，并分别结合了与每个国家主要相关的语言，以分析 LLM 对文化理解的一致性。通过对回复的分析，我们发现 LLM 可以区分价值观的一面和另一面，并理解各国有不同的价值观，但在提供建议时并不总是坚持这些价值观，并且无法理解根据不同的文化价值观做出不同回答的必要性。基于这些发现，我们提出了培养价值观一致和文化敏感的 LLM 的建议。更重要的是，这里开发的方法和框架可以帮助进一步理解和缓解 LLM 的文化和语言协调问题。

Title: TemPrompt: Multi-Task Prompt Learning for Temporal Relation Extraction in RAG-based Crowdsourcing Systems

Authors: Jing Yang, Yu Zhao, Yang Linyao, Xiao Wang, Fei-Yue Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14825
Pdf URL: https://arxiv.org/pdf/2406.14825
Copy Paste: [[2406.14825]] TemPrompt: Multi-Task Prompt Learning for Temporal Relation Extraction in RAG-based Crowdsourcing Systems(https://arxiv.org/abs/2406.14825)
Keywords: language model, prompt
Abstract: Temporal relation extraction (TRE) aims to grasp the evolution of events or actions, and thus shape the workflow of associated tasks, so it holds promise in helping understand task requests initiated by requesters in crowdsourcing systems. However, existing methods still struggle with limited and unevenly distributed annotated data. Therefore, inspired by the abundant global knowledge stored within pre-trained language models (PLMs), we propose a multi-task prompt learning framework for TRE (TemPrompt), incorporating prompt tuning and contrastive learning to tackle these issues. To elicit more effective prompts for PLMs, we introduce a task-oriented prompt construction approach that thoroughly takes the myriad factors of TRE into consideration for automatic prompt generation. In addition, we present temporal event reasoning as a supplement to bolster the model's focus on events and temporal cues. The experimental results demonstrate that TemPrompt outperforms all compared baselines across the majority of metrics under both standard and few-shot settings. A case study is provided to validate its effectiveness in crowdsourcing scenarios.
摘要：时间关系提取 (TRE) 旨在掌握事件或动作的演变，从而塑造相关任务的工作流程，因此它有望帮助理解众包系统中请求者发起的任务请求。然而，现有方法仍然难以应对有限且分布不均的注释数据。因此，受预训练语言模型 (PLM) 中存储的丰富全局知识的启发，我们提出了一个用于 TRE (TemPrompt) 的多任务提示学习框架，结合提示调整和对比学习来解决这些问题。为了为 PLM 引出更有效的提示，我们引入了一种面向任务的提示构建方法，该方法彻底考虑了 TRE 的无数因素以自动生成提示。此外，我们提出了时间事件推理作为补充，以加强模型对事件和时间线索的关注。实验结果表明，在标准和小样本设置下，TemPrompt 在大多数指标上都优于所有比较的基线。提供了一个案例研究来验证其在众包场景中的有效性。

Title: Word Matters: What Influences Domain Adaptation in Summarization?

Authors: Yinghao Li, Siyu Miao, Heyan Huang, Yang Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14828
Pdf URL: https://arxiv.org/pdf/2406.14828
Copy Paste: [[2406.14828]] Word Matters: What Influences Domain Adaptation in Summarization?(https://arxiv.org/abs/2406.14828)
Keywords: language model, llm
Abstract: Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation performance, analyzing the specific impact of `words' in training data on summarization tasks. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization, which is determined by two indicators: word-based compression rate and abstraction level. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship, which is not directly related to the number of words. Based on this finding, predicting a model's performance on unknown domain datasets is possible without undergoing training.
摘要：领域自适应旨在使大型语言模型 (LLM) 能够有效地泛化训练阶段未见过的领域数据集。但模型参数大小、训练数据规模等因素只是一般性影响因素，并未体现领域自适应性能的细微差别。本文探讨影响领域自适应性能的细粒度因素，分析训练数据中的“单词”对摘要任务的具体影响。我们提出将数据集学习难度量化为生成性摘要的学习难度，由基于单词的压缩率和抽象级别两个指标决定。实验表明，当考虑数据集学习难度时，跨领域重叠度与摘要任务的性能增益呈现近似线性关系，且与单词数量无直接关系。基于此发现，无需经过训练即可预测模型在未知领域数据集上的性能。

Title: Efficient Continual Pre-training by Mitigating the Stability Gap

Authors: Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14833
Pdf URL: https://arxiv.org/pdf/2406.14833
Copy Paste: [[2406.14833]] Efficient Continual Pre-training by Mitigating the Stability Gap(https://arxiv.org/abs/2406.14833)
Keywords: language model, gpt, llm
Abstract: Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at \url{this https URL}.
摘要：持续预训练已日益成为将大型语言模型 (LLM) 适应新领域的主要方法。此过程涉及使用来自新领域的语料库更新预训练的 LLM，从而导致训练分布发生变化。为了研究 LLM 在此转变过程中的行为，我们在整个持续预训练过程中测量了模型的性能。我们观察到开始时性能暂时下降，然后进入恢复阶段，这种现象称为“稳定性差距”，此前在对新类别进行分类的视觉模型中曾提到过。为了解决这个问题并在固定的计算预算内提高 LLM 性能，我们提出了三种有效的策略：(1) 在具有适当大小的子集上对 LLM 进行多个 epoch 的持续预训练，与在单个 epoch 上对大型语料库进行预训练相比，其性能恢复速度更快；(2) 仅在高质量子语料库上对 LLM 进行预训练，可快速提升领域性能； (3) 使用类似于预训练数据的数据混合来减少分布差距。我们对 Llama 系列模型进行了各种实验，以验证我们的策略在医学持续预训练和指令调整方面的有效性。例如，我们的策略仅使用原始训练预算的 40% 将 OpenLlama-3B 模型的平均医疗任务性能从 36.2% 提高到 40.7%，并且在不引起遗忘的情况下提高了平均一般任务性能。此外，我们将我们的策略应用于 Llama-3-8B 模型。由此产生的模型 Llama-3-Physician 在当前的开源模型中实现了最好的医疗性能，并且在几个医学基准测试中的表现与 GPT-4 相当甚至更好。我们在 \url{this https URL} 发布我们的模型。

Title: ToVo: Toxicity Taxonomy via Voting

Authors: Tinh Son Luong, Thanh-Thien Le, Thang Viet Doan, Linh Ngo Van, Thien Huu Nguyen, Diep Thi-Ngoc Nguyen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14835
Pdf URL: https://arxiv.org/pdf/2406.14835
Copy Paste: [[2406.14835]] ToVo: Toxicity Taxonomy via Voting(https://arxiv.org/abs/2406.14835)
Keywords: chain-of-thought
Abstract: Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications. We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions.
摘要：现有的毒性检测模型面临着重大限制，例如缺乏透明度、定制化和可重复性。这些挑战源于其训练数据的闭源性质以及其评估机制缺乏解释。为了解决这些问题，我们提出了一种数据集创建机制，该机制集成了投票和思维链过程，从而为毒性内容检测生成了高质量的开源数据集。我们的方法确保每个样本的分类指标多样化，包括分类分数和分类的解释推理。我们利用通过我们提出的机制创建的数据集来训练我们的模型，然后将其与现有的广泛使用的检测器进行比较。我们的方法不仅提高了透明度和可定制性，而且还有助于针对特定用例进行更好的微调。这项工作为开发毒性内容检测模型提供了一个强大的框架，强调开放性和适应性，从而为更有效和针对用户的内容审核解决方案铺平了道路。

Title: Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models

Authors: Qi Liu, Bo Wang, Nan Wang, Jiaxin Mao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2406.14848
Pdf URL: https://arxiv.org/pdf/2406.14848
Copy Paste: [[2406.14848]] Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models(https://arxiv.org/abs/2406.14848)
Keywords: language model, gpt, llm
Abstract: Recent studies have demonstrated the effectiveness of using large language language models (LLMs) in passage ranking. The listwise approaches, such as RankGPT, have become new state-of-the-art in this task. However, the efficiency of RankGPT models is limited by the maximum context length and relatively high latency of LLM inference. To address these issues, in this paper, we propose PE-Rank, leveraging the single passage embedding as a good context compression for efficient listwise passage reranking. By treating each passage as a special token, we can directly input passage embeddings into LLMs, thereby reducing input length. Additionally, we introduce an inference method that dynamically constrains the decoding space to these special tokens, accelerating the decoding process. For adapting the model to reranking, we employ listwise learning to rank loss for training. Evaluation results on multiple benchmarks demonstrate that PE-Rank significantly improves efficiency in both prefilling and decoding, while maintaining competitive ranking effectiveness. {The Code is available at \url{this https URL}.}
摘要：最近的研究证明了在段落排名中使用大型语言模型 (LLM) 的有效性。列表式方法（例如 RankGPT）已成为此任务的新前沿。然而，RankGPT 模型的效率受到最大上下文长度和 LLM 推理相对较高的延迟的限制。为了解决这些问题，在本文中，我们提出了 PE-Rank，利用单个段落嵌入作为良好的上下文压缩，实现高效的列表式段落重新排名。通过将每个段落视为特殊标记，我们可以将段落嵌入直接输入到 LLM 中，从而减少输入长度。此外，我们引入了一种推理方法，可以动态地将解码空间限制为这些特殊标记，从而加速解码过程。为了使模型适应重新排名，我们采用列表式学习对排名损失进行训练。多个基准的评估结果表明，PE-Rank 显着提高了预填充和解码的效率，同时保持了有竞争力的排名效果。{代码可在 \url{此 https URL} 处获得。}

Title: From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

Authors: Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14859
Pdf URL: https://arxiv.org/pdf/2406.14859
Copy Paste: [[2406.14859]] From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking(https://arxiv.org/abs/2406.14859)
Keywords: language model, llm
Abstract: The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs.
摘要：大型语言模型 (LLM) 和多模态大型语言模型 (MLLM) 的快速发展暴露了其易受各种对抗性攻击的弱点。本文全面概述了针对 LLM 和 MLLM 的越狱研究，重点介绍了评估基准、攻击技术和防御策略方面的最新进展。与更先进的单模态越狱相比，多模态领域仍未得到充分探索。我们总结了多模态越狱的局限性和潜在的研究方向，旨在启发未来的研究并进一步增强 MLLM 的稳健性和安全性。

Title: Direct Multi-Turn Preference Optimization for Language Agents

Authors: Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, Fuli Feng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14868
Pdf URL: https://arxiv.org/pdf/2406.14868
Copy Paste: [[2406.14868]] Direct Multi-Turn Preference Optimization for Language Agents(https://arxiv.org/abs/2406.14868)
Keywords: language model, llm, agent
Abstract: Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.
摘要：在开发语言代理时，将大型语言模型 (LLM) 调整为代理任务至关重要。直接偏好优化 (DPO) 是一种很有前途的适应技术，可以减轻复合错误，提供一种直接优化强化学习 (RL) 目标的方法。然而，由于无法取消分区函数，将 DPO 应用于多轮任务会带来挑战。克服这一障碍需要使分区函数独立于当前状态，并解决首选和不首选轨迹之间的长度差异。鉴于此，我们在 RL 目标中用状态动作占用度量约束替换策略约束，并在 Bradley-Terry 模型中添加长度规范化，从而产生一种名为 DMPO 的新型损失函数，用于多轮代理任务，并提供理论解释。在三个多轮代理任务数据集上进行的大量实验证实了 DMPO 损失的有效性和优越性。

Title: Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video

Authors: Zhengbang Yang, Haotian Xia, Jingxi Li, Zezhi Chen, Zhuangdi Zhu, Weining Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14877
Pdf URL: https://arxiv.org/pdf/2406.14877
Copy Paste: [[2406.14877]] Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video(https://arxiv.org/abs/2406.14877)
Keywords: language model, chain-of-thought
Abstract: Understanding sports is crucial for the advancement of Natural Language Processing (NLP) due to its intricate and dynamic nature. Reasoning over complex sports scenarios has posed significant challenges to current NLP technologies which require advanced cognitive capabilities. Toward addressing the limitations of existing benchmarks on sports understanding in the NLP field, we extensively evaluated mainstream large language models for various sports tasks. Our evaluation spans from simple queries on basic rules and historical facts to complex, context-specific reasoning, leveraging strategies from zero-shot to few-shot learning, and chain-of-thought techniques. In addition to unimodal analysis, we further assessed the sports reasoning capabilities of mainstream video language models to bridge the gap in multimodal sports understanding benchmarking. Our findings highlighted the critical challenges of sports understanding for NLP. We proposed a new benchmark based on a comprehensive overview of existing sports datasets and provided extensive error analysis which we hope can help identify future research priorities in this field.
摘要：由于体育运动的复杂性和动态性，理解体育运动对于自然语言处理 (NLP) 的发展至关重要。对复杂体育场景的推理对需要高级认知能力的当前 NLP 技术提出了重大挑战。为了解决 NLP 领域现有体育理解基准的局限性，我们广泛评估了各种体育任务的主流大型语言模型。我们的评估范围从对基本规则和历史事实的简单查询到复杂的、特定于上下文的推理，利用从零样本学习到少样本学习的策略以及思路链技术。除了单模态分析外，我们还进一步评估了主流视频语言模型的体育推理能力，以弥补多模态体育理解基准测试的差距。我们的研究结果强调了 NLP 在体育理解方面面临的关键挑战。我们根据对现有体育数据集的全面概述提出了一个新的基准，并进行了广泛的错误分析，我们希望这可以帮助确定该领域未来的研究重点。

Title: 70B-parameter large language models in Japanese medical question-answering

Authors: Issey Sukeda, Risa Kishikawa, Satoshi Kodera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14882
Pdf URL: https://arxiv.org/pdf/2406.14882
Copy Paste: [[2406.14882]] 70B-parameter large language models in Japanese medical question-answering(https://arxiv.org/abs/2406.14882)
Keywords: language model, llm, prompt
Abstract: Since the rise of large language models (LLMs), the domain adaptation has been one of the hot topics in various domains. Many medical LLMs trained with English medical dataset have made public recently. However, Japanese LLMs in medical domain still lack its research. Here we utilize multiple 70B-parameter LLMs for the first time and show that instruction tuning using Japanese medical question-answering dataset significantly improves the ability of Japanese LLMs to solve Japanese medical license exams, surpassing 50\% in accuracy. In particular, the Japanese-centric models exhibit a more significant leap in improvement through instruction tuning compared to their English-centric counterparts. This underscores the importance of continual pretraining and the adjustment of the tokenizer in our local language. We also examine two slightly different prompt formats, resulting in non-negligible performance improvement.
摘要：自大型语言模型 (LLM) 兴起以来，领域自适应一直是各个领域的热门话题之一。最近，许多使用英语医学数据集训练的医学 LLM 已经面世。然而，日语 LLM 在医学领域的研究仍然不足。在这里，我们首次使用多个 70B 参数的 LLM，并表明使用日语医学问答数据集进行指令调整可显著提高日语 LLM 解决日语医师执照考试的能力，准确率超过 50%。特别是，与以英语为中心的模型相比，以日语为中心的模型通过指令调整表现出更显著的飞跃改进。这强调了持续预训练和调整本地语言标记器的重要性。我们还研究了两种略有不同的提示格式，从而带来了不可忽略的性能提升。

Title: OATH-Frames: Characterizing Online Attitudes Towards Homelessness with LLM Assistants

Authors: Jaspreet Ranjit, Brihi Joshi, Rebecca Dorn, Laura Petry, Olga Koumoundouros, Jayne Bottarini, Peichen Liu, Eric Rice, Swabha Swayamdipta
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2406.14883
Pdf URL: https://arxiv.org/pdf/2406.14883
Copy Paste: [[2406.14883]] OATH-Frames: Characterizing Online Attitudes Towards Homelessness with LLM Assistants(https://arxiv.org/abs/2406.14883)
Keywords: language model, llm
Abstract: Warning: Contents of this paper may be upsetting. Public attitudes towards key societal issues, expressed on online media, are of immense value in policy and reform efforts, yet challenging to understand at scale. We study one such social issue: homelessness in the U.S., by leveraging the remarkable capabilities of large language models to assist social work experts in analyzing millions of posts from Twitter. We introduce a framing typology: Online Attitudes Towards Homelessness (OATH) Frames: nine hierarchical frames capturing critiques, responses and perceptions. We release annotations with varying degrees of assistance from language models, with immense benefits in scaling: 6.5x speedup in annotation time while only incurring a 3 point F1 reduction in performance with respect to the domain experts. Our experiments demonstrate the value of modeling OATH-Frames over existing sentiment and toxicity classifiers. Our large-scale analysis with predicted OATH-Frames on 2.4M posts on homelessness reveal key trends in attitudes across states, time periods and vulnerable populations, enabling new insights on the issue. Our work provides a general framework to understand nuanced public attitudes at scale, on issues beyond homelessness.
摘要：警告：本文的内容可能会令人不安。公众对关键社会问题的态度在网络媒体上表达出来，对政策和改革工作具有巨大的价值，但要大规模理解却具有挑战性。我们研究了这样一个社会问题：美国的无家可归问题，利用大型语言模型的卓越能力帮助社会工作专家分析数百万条来自 Twitter 的帖子。我们引入了一种框架类型：在线无家可归态度 (OATH) 框架：九个层次框架，捕捉批评、回应和看法。我们发布了不同程度的语言模型辅助的注释，在扩展方面具有巨大的优势：注释时间加快了 6.5 倍，而与领域专家相比，性能仅降低了 3 点 F1。我们的实验证明了建模 OATH-Frames 优于现有情绪和毒性分类器的价值。我们对 240 万条关于无家可归的帖子进行了大规模分析，预测了 OATH-Frames，揭示了各州、各个时期和弱势群体态度的主要趋势，从而对这一问题有了新的见解。我们的工作提供了一个总体框架，以了解公众对无家可归问题以外问题的细微态度。

Title: FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents

Authors: Ruixuan Xiao, Wentao Ma, Ke Wang, Yuchuan Wu, Junbo Zhao, Haobo Wang, Fei Huang, Yongbin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14884
Pdf URL: https://arxiv.org/pdf/2406.14884
Copy Paste: [[2406.14884]] FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents(https://arxiv.org/abs/2406.14884)
Keywords: llm, hallucination, agent
Abstract: LLM-based agents have emerged as promising tools, which are crafted to fulfill complex tasks by iterative planning and action. However, these agents are susceptible to undesired planning hallucinations when lacking specific knowledge for expertise-intensive tasks. To address this, preliminary attempts are made to enhance planning reliability by incorporating external workflow-related knowledge. Despite the promise, such infused knowledge is mostly disorganized and diverse in formats, lacking rigorous formalization and comprehensive comparisons. Motivated by this, we formalize different formats of workflow knowledge and present FlowBench, the first benchmark for workflow-guided planning. FlowBench covers 51 different scenarios from 6 domains, with knowledge presented in diverse formats. To assess different LLMs on FlowBench, we design a multi-tiered evaluation framework. We evaluate the efficacy of workflow knowledge across multiple formats, and the results indicate that current LLM agents need considerable improvements for satisfactory planning. We hope that our challenging benchmark can pave the way for future agent planning research.
摘要：基于 LLM 的代理已成为一种有前途的工具，它们被设计成通过迭代规划和行动来完成复杂任务。然而，当缺乏专业知识密集型任务的特定知识时，这些代理容易产生不良的规划幻觉。为了解决这个问题，人们进行了初步尝试，通过整合外部工作流相关知识来提高规划可靠性。尽管前景光明，但这些注入的知识大多杂乱无章、格式各异，缺乏严格的形式化和全面的比较。受此启发，我们将不同格式的工作流知识形式化，并提出了第一个工作流引导规划基准 FlowBench。FlowBench 涵盖了 6 个领域的 51 种不同场景，知识以多种格式呈现。为了在 FlowBench 上评估不同的 LLM，我们设计了一个多层次的评估框架。我们评估了多种格式的工作流知识的有效性，结果表明，当前的 LLM 代理需要进行相当大的改进才能实现令人满意的规划。我们希望我们具有挑战性的基准可以为未来的代理规划研究铺平道路。

Title: InternLM-Law: An Open Source Chinese Legal Large Language Model

Authors: Zhiwei Fei, Songyang Zhang, Xiaoyu Shen, Dawei Zhu, Xiao Wang, Maosong Cao, Fengzhe Zhou, Yining Li, Wenwei Zhang, Dahua Lin, Kai Chen, Jidong Ge
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14887
Pdf URL: https://arxiv.org/pdf/2406.14887
Copy Paste: [[2406.14887]] InternLM-Law: An Open Source Chinese Legal Large Language Model(https://arxiv.org/abs/2406.14887)
Keywords: language model, gpt, llm
Abstract: While large language models (LLMs) have showcased impressive capabilities, they struggle with addressing legal queries due to the intricate complexities and specialized expertise required in the legal field. In this paper, we introduce InternLM-Law, a specialized LLM tailored for addressing diverse legal queries related to Chinese laws, spanning from responding to standard legal questions (e.g., legal exercises in textbooks) to analyzing complex real-world legal situations. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries, and implement a data filtering and processing pipeline to ensure its diversity and quality. Our training approach involves a novel two-stage process: initially fine-tuning LLMs on both legal-specific and general-purpose content to equip the models with broad knowledge, followed by exclusive fine-tuning on high-quality legal data to enhance structured output generation. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks. We make InternLM-Law and our dataset publicly available to facilitate future research in applying LLMs within the legal domain.
摘要：尽管大型语言模型 (LLM) 已经展示了令人印象深刻的功能，但由于法律领域错综复杂的问题和专业知识要求，它们在解决法律问题方面仍举步维艰。在本文中，我们介绍了 InternLM-Law，这是一种专门的 LLM，专门用于解决与中国法律相关的各种法律问题，从回答标准法律问题（例如教科书中的法律练习）到分析复杂的现实法律情况。我们精心构建了一个涵盖 100 多万个查询的中国法律领域数据集，并实施了数据过滤和处理流程以确保其多样性和质量。我们的训练方法涉及一个新颖的两阶段流程：首先对法律特定内容和通用内容对 LLM 进行微调，使模型具备广泛的知识，然后对高质量法律数据进行专门微调，以增强结构化输出生成。InternLM-Law 在 LawBench 上取得了最高的平均性能，在 20 个子任务中的 13 个子任务中，其表现优于包括 GPT-4 在内的最先进模型。我们向公众开放 InternLM-Law 和我们的数据集，以促进未来在法律领域应用 LLM 的研究。

Title: Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering

Authors: Zhengliang Shi, Shuo Zhang, Weiwei Sun, Shen Gao, Pengjie Ren, Zhumin Chen, Zhaochun Ren
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering(https://arxiv.org/abs/)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Multi-Hop Question Answering (MHQA) tasks present a significant challenge for large language models (LLMs) due to the intensive knowledge required. Current solutions, like Retrieval-Augmented Generation, typically retrieve potential documents from an external corpus to read an answer. However, the performance of this retrieve-then-read paradigm is constrained by the retriever and the inevitable noise in the retrieved documents. To mitigate these challenges, we introduce a novel generate-then-ground (GenGround) framework, synergizing the parametric knowledge of LLMs and external documents to solve a multi-hop question. GenGround empowers LLMs to alternate two phases until the final answer is derived: (1) formulate a simpler, single-hop question and directly generate the answer; (2) ground the question-answer pair in retrieved documents, amending any wrong predictions in the answer. We also propose an instructional grounding distillation method to generalize our method into smaller models. Extensive experiments conducted on four datasets illustrate the superiority of our method.
摘要：由于需要大量知识，多跳问答 (MHQA) 任务对大型语言模型 (LLM) 提出了重大挑战。当前的解决方案（如检索增强生成）通常从外部语料库中检索潜在文档以读取答案。但是，这种检索后读取范式的性能受到检索器和检索到的文档中不可避免的噪音的限制。为了缓解这些挑战，我们引入了一个新颖的生成后接地 (GenGround) 框架，协同 LLM 和外部文档的参数知识来解决多跳问题。GenGround 使 LLM 能够交替进行两个阶段，直到得出最终答案：(1) 制定一个更简单的单跳问题并直接生成答案；(2) 在检索到的文档中接地问答对，修改答案中的任何错误预测。我们还提出了一种教学接地蒸馏方法，将我们的方法推广到更小的模型中。在四个数据集上进行的大量实验说明了我们方法的优越性。

Title: Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition

Authors: Candida M. Greco, Lucio La Cava, Andrea Tagarelli
Subjects: cs.CL, cs.AI, cs.CY, cs.IR, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2406.14894
Pdf URL: https://arxiv.org/pdf/2406.14894
Copy Paste: [[2406.14894]] Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition(https://arxiv.org/abs/2406.14894)
Keywords: language model, llm, prompt
Abstract: Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models' performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.
摘要：动词是语言的支柱，为句子提供结构和意义。然而，它们复杂的语义细微差别带来了长期的挑战。通过词汇蕴涵的概念来理解动词关系对于理解句子意义和掌握动词动态至关重要。这项研究调查了八个大型语言模型通过不同设计的提示策略和零次/少次设置识别动词之间的词汇蕴涵关系的能力，这些动词来自两个词汇数据库，即 WordNet 和 HyperLex。我们的研究结果表明，这些模型可以以中等良好的性能处理词汇蕴涵识别任务，尽管有效性程度不同，条件也不同。此外，利用少次提示可以提高模型的性能。然而，完美地解决这个任务是所有被研究的 LLM 都面临的一个未解决的挑战，这为进一步研究这一主题提供了契机。

Title: Towards Retrieval Augmented Generation over Large Video Libraries

Authors: Yannis Tevissen, Khalil Guetari, Frédéric Petitpont
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14938
Pdf URL: https://arxiv.org/pdf/2406.14938
Copy Paste: [[2406.14938]] Towards Retrieval Augmented Generation over Large Video Libraries(https://arxiv.org/abs/2406.14938)
Keywords: language model, llm, retrieval augmented generation
Abstract: Video content creators need efficient tools to repurpose content, a task that often requires complex manual or automated searches. Crafting a new video from large video libraries remains a challenge. In this paper we introduce the task of Video Library Question Answering (VLQA) through an interoperable architecture that applies Retrieval Augmented Generation (RAG) to video libraries. We propose a system that uses large language models (LLMs) to generate search queries, retrieving relevant video moments indexed by speech and visual metadata. An answer generation module then integrates user queries with this metadata to produce responses with specific video timestamps. This approach shows promise in multimedia content retrieval, and AI-assisted video content creation.
摘要：视频内容创建者需要高效的工具来重新利用内容，这项任务通常需要复杂的手动或自动搜索。从大型视频库中制作新视频仍然是一项挑战。在本文中，我们通过一种可互操作的架构介绍了视频库问答 (VLQA) 任务，该架构将检索增强生成 (RAG) 应用于视频库。我们提出了一个使用大型语言模型 (LLM) 生成搜索查询的系统，检索由语音和视觉元数据索引的相关视频时刻。然后，答案生成模块将用户查询与此元数据集成，以生成具有特定视频时间戳的响应。这种方法在多媒体内容检索和 AI 辅助视频内容创建方面显示出良好的前景。

Title: ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Authors: Haiquan Zhao, Lingyu Li, Shisong Chen, Shuqi Kong, Jiaan Wang, Kexing Huang, Tianle Gu, Yixu Wang, Dandan Liang, Zhixu Li, Tan Teng, Yanghua Xiao, Yingchun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14952
Pdf URL: https://arxiv.org/pdf/2406.14952
Copy Paste: [[2406.14952]] ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models(https://arxiv.org/abs/2406.14952)
Keywords: language model, gpt, llm, chat, agent
Abstract: Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at this https URL.
摘要：情绪支持对话 (ESC) 是一项至关重要的应用，旨在减轻人类压力、提供情绪指导并最终增强人类的身心健康。随着大型语言模型 (LLM) 的进步，许多研究人员已将 LLM 用作 ESC 模型。然而，对这些基于 LLM 的 ESC 的评估仍然不确定。受到角色扮演代理的惊人发展的启发，我们提出了一个 ESC 评估框架 (ESC-Eval)，该框架使用角色扮演代理与 ESC 模型交互，然后对交互式对话进行手动评估。具体来说，我们首先从七个现有数据集中重新组织 2,801 张角色扮演卡片，以定义角色扮演代理的角色。其次，我们训练一个名为 ESC-Role 的特定角色扮演模型，它的行为比 GPT-4 更像一个困惑的人。第三，通过 ESC-Role 和组织好的角色卡，我们系统地使用 14 个 LLM 作为 ESC 模型进行实验，包括通用 AI 辅助 LLM（ChatGPT）和面向 ESC 的 LLM（ExTES-Llama）。我们对不同 ESC 模型的交互式多轮对话进行了全面的人工注释。结果表明，与通用 AI 辅助 LLM 相比，面向 ESC 的 LLM 表现出了更出色的 ESC 能力，但与人类的表现仍有差距。此外，为了使未来 ESC 模型的评分过程自动化，我们开发了 ESC-RANK，它在带注释的数据上进行训练，取得了超过 GPT-4 35 分的评分表现。我们的数据和代码可在此 https URL 上找到。

Title: ICLEval: Evaluating In-Context Learning Ability of Large Language Models

Authors: Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14955
Pdf URL: https://arxiv.org/pdf/2406.14955
Copy Paste: [[2406.14955]] ICLEval: Evaluating In-Context Learning Ability of Large Language Models(https://arxiv.org/abs/2406.14955)
Keywords: language model, llm
Abstract: In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward. Our source codes and benchmark are released at this https URL.
摘要：上下文学习 (ICL) 是大型语言模型 (LLM) 的一项关键能力，因为它使它们能够理解和推理相互关联的输入。评估 LLM 的 ICL 能力可以提高它们的利用率，并加深我们对如何在训练阶段获得这种能力的理解。然而，现有的评估框架主要关注语言能力和知识，往往忽略了对 ICL 能力的评估。在这项工作中，我们引入了 ICLEval 基准来评估 LLM 的 ICL 能力，它包含两个关键的子能力：精确复制和规则学习。通过 ICLEval 基准，我们证明 ICL 能力普遍存在于不同的 LLM 中，模型大小并不是决定 ICL 功效的唯一因素。令人惊讶的是，我们观察到 ICL 能力，尤其是复制能力，在预训练过程的早期发展，之后趋于稳定。我们的源代码和基准发布在此 https URL 上。

Title: Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation

Authors: Shamane Siriwardhana, Mark McQuade, Thomas Gauthier, Lucas Atkins, Fernando Fernandes Neto, Luke Meyers, Anneketh Vij, Tyler Odenthal, Charles Goddard, Mary MacCarthy, Jacob Solawetz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14971
Pdf URL: https://arxiv.org/pdf/2406.14971
Copy Paste: [[2406.14971]] Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation(https://arxiv.org/abs/2406.14971)
Keywords: language model
Abstract: We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: this https URL, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
摘要：我们对 Meta-Llama-3-70B-Instruct 模型在 SEC 数据上的领域适应性进行了广泛的实验，探索了它在通用和领域特定基准上的表现。我们的重点包括持续预训练 (CPT) 和模型合并，旨在增强模型的领域特定能力，同时减轻灾难性遗忘。通过这项研究，我们评估了将金融监管数据整合到强大的语言模型中的影响，并检查了我们的模型合并技术在保持和提高模型指导能力方面的有效性。该模型可通过 hugging face 访问：此 https URL，arcee-ai/Llama-3-SEC-Base。这是我们最终模型的中间检查点，到目前为止已经看到了 200 亿个 token。完整模型仍在训练过程中。这是一份预印本技术报告，其中包含全面的评估，以了解整个过程。

Title: A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

Authors: Florin Cuconasu, Giovanni Trappolini, Nicola Tonellotto, Fabrizio Silvestri
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2406.14972
Pdf URL: https://arxiv.org/pdf/2406.14972
Copy Paste: [[2406.14972]] A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems(https://arxiv.org/abs/2406.14972)
Keywords: language model, llm, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".
摘要：检索增强生成 (RAG) 代表了人工智能的一项重大进步，它将检索阶段与生成阶段相结合，后者通常由大型语言模型 (LLM) 提供支持。RAG 中当前的常见做法是使用“指导式”LLM，这些 LLM 通过监督训练进行微调，以增强其遵循指令的能力，并使用最先进的技术与人类偏好保持一致。与普遍看法相反，我们的研究表明，在我们的实验环境下，基础模型在 RAG 任务中的表现平均比指导式模型高出 20%。这一发现挑战了关于指导式 LLM 在 RAG 应用中优越性的普遍假设。进一步的调查揭示了一个更微妙的情况，质疑了 RAG 的基本方面，并表明需要就该主题进行更广泛的讨论；或者，正如弗洛姆所说，“很少能一眼看懂统计数据就足以理解数字的含义”。

Title: Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation

Authors: Yuanjie Lyu, Zihan Niu, Zheyong Xie, Chao Zhang, Tong Xu, Yang Wang, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14979
Pdf URL: https://arxiv.org/pdf/2406.14979
Copy Paste: [[2406.14979]] Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation(https://arxiv.org/abs/2406.14979)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Despite the significant progress of large language models (LLMs) in various tasks, they often produce factual errors due to their limited internal knowledge. Retrieval-Augmented Generation (RAG), which enhances LLMs with external knowledge sources, offers a promising solution. However, these methods can be misled by irrelevant paragraphs in retrieved documents. Due to the inherent uncertainty in LLM generation, inputting the entire document may introduce off-topic information, causing the model to deviate from the central topic and affecting the relevance of the generated content. To address these issues, we propose the Retrieve-Plan-Generation (RPG) framework. RPG generates plan tokens to guide subsequent generation in the plan stage. In the answer stage, the model selects relevant fine-grained paragraphs based on the plan and uses them for further answer generation. This plan-answer process is repeated iteratively until completion, enhancing generation relevance by focusing on specific topics. To implement this framework efficiently, we utilize a simple but effective multi-task prompt-tuning method, enabling the existing LLMs to handle both planning and answering. We comprehensively compare RPG with baselines across 5 knowledge-intensive generation tasks, demonstrating the effectiveness of our approach.
摘要：尽管大型语言模型 (LLM) 在各种任务中取得了重大进展，但由于其内部知识有限，它们经常会产生事实错误。检索增强生成 (RAG) 通过利用外部知识源增强 LLM，提供了一种有前途的解决方案。然而，这些方法可能会被检索到的文档中不相关的段落误导。由于 LLM 生成中固有的不确定性，输入整个文档可能会引入离题信息，导致模型偏离中心主题并影响生成内容的相关性。为了解决这些问题，我们提出了检索-计划-生成 (RPG) 框架。RPG 生成计划标记以指导计划阶段的后续生成。在答案阶段，模型根据计划选择相关的细粒度段落并将其用于进一步的答案生成。这个计划-答案过程迭代重复直至完成，通过关注特定主题来增强生成的相关性。为了高效地实施该框架，我们采用了一种简单但有效的多任务提示调整方法，使现有的 LLM 能够同时处理规划和回答。我们在 5 个知识密集型生成任务中全面比较了 RPG 与基线，证明了我们方法的有效性。

Title: SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Authors: Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, Jie Tang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2406.14991
Pdf URL: https://arxiv.org/pdf/2406.14991
Copy Paste: [[2406.14991]] SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation(https://arxiv.org/abs/2406.14991)
Keywords: language model, llm
Abstract: We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel forums, which reflect the intricate needs of users. The associated spreadsheets from the forums contain a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements. Furthermore, we propose a more reliable evaluation metric akin to online judge platforms, where multiple spreadsheet files are created as test cases for each instruction, ensuring the evaluation of robust solutions capable of handling spreadsheets with varying values. Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark's difficulty.
摘要：我们推出了 SpreadsheetBench，这是一个具有挑战性的电子表格操作基准，完全源自真实场景，旨在将当前的大型语言模型 (LLM) 融入电子表格用户的实际工作流程中。与依赖合成查询和简化电子表格文件的现有基准不同，SpreadsheetBench 是根据从在线 Excel 论坛收集的 912 个真实问题构建的，这些问题反映了用户的复杂需求。论坛中的相关电子表格包含各种表格数据，例如多个表格、非标准关系表和丰富的非文本元素。此外，我们提出了一种更可靠的评估指标，类似于在线评判平台，其中为每个指令创建多个电子表格文件作为测试用例，确保评估能够处理具有不同值的电子表格的稳健解决方案。我们在单轮和多轮推理设置下对各种 LLM 进行了全面评估，发现最先进 (SOTA) 模型与人类表现之间存在巨大差距，凸显了基准的难度。

Title: Unveiling the Impact of Multi-Modal Interactions on User Engagement: A Comprehensive Evaluation in AI-driven Conversations

Authors: Lichao Zhang, Jia Yu, Shuai Zhang, Long Li, Yangyang Zhong, Guanbao Liang, Yuming Yan, Qing Ma, Fangsheng Weng, Fayu Pan, Jing Li, Renjun Xu, Zhenzhong Lan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.15000
Pdf URL: https://arxiv.org/pdf/2406.15000
Copy Paste: [[2406.15000]] Unveiling the Impact of Multi-Modal Interactions on User Engagement: A Comprehensive Evaluation in AI-driven Conversations(https://arxiv.org/abs/2406.15000)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) have significantly advanced user-bot interactions, enabling more complex and coherent dialogues. However, the prevalent text-only modality might not fully exploit the potential for effective user engagement. This paper explores the impact of multi-modal interactions, which incorporate images and audio alongside text, on user engagement in chatbot conversations. We conduct a comprehensive analysis using a diverse set of chatbots and real-user interaction data, employing metrics such as retention rate and conversation length to evaluate user engagement. Our findings reveal a significant enhancement in user engagement with multi-modal interactions compared to text-only dialogues. Notably, the incorporation of a third modality significantly amplifies engagement beyond the benefits observed with just two modalities. These results suggest that multi-modal interactions optimize cognitive processing and facilitate richer information comprehension. This study underscores the importance of multi-modality in chatbot design, offering valuable insights for creating more engaging and immersive AI communication experiences and informing the broader AI community about the benefits of multi-modal interactions in enhancing user engagement.
摘要：大型语言模型 (LLM) 显著提高了用户与机器人之间的交互，使对话更加复杂和连贯。然而，普遍的纯文本模式可能无法充分发挥有效用户参与的潜力。本文探讨了多模态交互（将图像和音频与文本结合在一起）对聊天机器人对话中用户参与度的影响。我们使用多种聊天机器人和真实用户交互数据进行全面分析，采用留存率和对话长度等指标来评估用户参与度。我们的研究结果表明，与纯文本对话相比，多模态交互显著提高了用户参与度。值得注意的是，加入第三种模式显著提高了参与度，超过了仅使用两种模式所观察到的好处。这些结果表明，多模态交互优化了认知处理并促进了更丰富的信息理解。这项研究强调了多模态在聊天机器人设计中的重要性，为创造更具吸引力和沉浸感的人工智能通信体验提供了宝贵的见解，并让更广泛的人工智能社区了解多模态交互在增强用户参与度方面的好处。

Title: MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Authors: Yongqi Fan, Hongli Sun, Kui Xue, Xiaofan Zhang, Shaoting Zhang, Tong Ruan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15019
Pdf URL: https://arxiv.org/pdf/2406.15019
Copy Paste: [[2406.15019]] MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens(https://arxiv.org/abs/2406.15019)
Keywords: language model, llm, long context
Abstract: Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "needles in a haystack" task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The second component confronts the challenge of requiring professional medical expertise. Especially, we design the ``Maximum Identical Context'' principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long contexts and presents detailed performance analyses. This highlights that LLMs still face challenges and need for further research in this area. Our code and data are released in the repository: \url{this https URL.}
摘要：现在，许多先进的大型语言模型 (LLM) 支持高达 128K 的上下文长度，有些甚至扩展到 200K。通用领域的一些基准测试也跟进了对长上下文能力的评估。在医学领域，由于独特的上下文和对领域专业知识的需求，任务具有独特性，需要进一步评估。然而，尽管在医学场景中经常出现长文本，但该领域 LLM 的长上下文能力评估基准测试仍然很少。在本文中，我们提出了 MedOdyssey，这是第一个医学长上下文基准测试，具有七个长度级别，范围从 4K 到 200K 个标记。MedOdyssey 由两个主要部分组成：医学上下文“大海捞针”任务和一系列特定于医学应用的任务，总共包含 10 个数据集。第一个组件包括反直觉推理和新（未知）事实注入等挑战，以减轻 LLM 的知识泄漏和数据污染。第二部分面临着需要专业医疗专业知识的挑战。特别是，我们设计了“最大相同上下文”原则，通过保证不同的 LLM 观察到尽可能多的相同上下文来提高公平性。我们的实验评估了为处理长上下文而定制的高级专有和开源 LLM，并提供了详细的性能分析。这突出表明 LLM 仍然面临挑战，需要在这一领域进一步研究。我们的代码和数据发布在存储库中：\url{此 https URL。}

Title: GiusBERTo: A Legal Language Model for Personal Data De-identification in Italian Court of Auditors Decisions

Authors: Giulio Salierno, Rosamaria Bertè, Luca Attias, Carla Morrone, Dario Pettazzoni, Daniela Battisti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.15032
Pdf URL: https://arxiv.org/pdf/2406.15032
Copy Paste: [[2406.15032]] GiusBERTo: A Legal Language Model for Personal Data De-identification in Italian Court of Auditors Decisions(https://arxiv.org/abs/2406.15032)
Keywords: language model
Abstract: Recent advances in Natural Language Processing have demonstrated the effectiveness of pretrained language models like BERT for a variety of downstream tasks. We present GiusBERTo, the first BERT-based model specialized for anonymizing personal data in Italian legal documents. GiusBERTo is trained on a large dataset of Court of Auditors decisions to recognize entities to anonymize, including names, dates, locations, while retaining contextual relevance. We evaluate GiusBERTo on a held-out test set and achieve 97% token-level accuracy. GiusBERTo provides the Italian legal community with an accurate and tailored BERT model for de-identification, balancing privacy and data protection.
摘要：自然语言处理领域的最新进展证明了 BERT 等预训练语言模型对各种下游任务的有效性。我们推出了 GiusBERTo，这是第一个基于 BERT 的模型，专门用于匿名化意大利法律文件中的个人数据。GiusBERTo 经过大量审计法院判决数据集的训练，可以识别要匿名化的实体，包括姓名、日期、地点，同时保留上下文相关性。我们在保留的测试集上评估了 GiusBERTo，并实现了 97% 的标记级准确率。GiusBERTo 为意大利法律界提供了一个准确且量身定制的 BERT 模型，用于去识别化，平衡隐私和数据保护。

Title: Harnessing Knowledge Retrieval with Large Language Models for Clinical Report Error Correction

Authors: Jinge Wu, Zhaolong Wu, Abul Hasan, Yunsoo Kim, Jason P.Y. Cheung, Teng Zhang, Honghan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15045
Pdf URL: https://arxiv.org/pdf/2406.15045
Copy Paste: [[2406.15045]] Harnessing Knowledge Retrieval with Large Language Models for Clinical Report Error Correction(https://arxiv.org/abs/2406.15045)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This study proposes an approach for error correction in clinical radiology reports, leveraging large language models (LLMs) and retrieval-augmented generation (RAG) techniques. The proposed framework employs internal and external retrieval mechanisms to extract relevant medical entities and relations from the report and external knowledge sources. A three-stage inference process is introduced, decomposing the task into error detection, localization, and correction subtasks, which enhances the explainability and performance of the system. The effectiveness of the approach is evaluated using a benchmark dataset created by corrupting real-world radiology reports with realistic errors, guided by domain experts. Experimental results demonstrate the benefits of the proposed methods, with the combination of internal and external retrieval significantly improving the accuracy of error detection, localization, and correction across various state-of-the-art LLMs. The findings contribute to the development of more robust and reliable error correction systems for clinical documentation.
摘要：本研究提出了一种利用大型语言模型 (LLM) 和检索增强生成 (RAG) 技术来纠正临床放射学报告中的错误的方法。所提出的框架采用内部和外部检索机制，从报告和外部知识源中提取相关的医疗实体和关系。引入了一个三阶段推理过程，将任务分解为错误检测、定位和纠正子任务，从而增强了系统的可解释性和性能。在领域专家的指导下，使用通过破坏真实世界放射学报告而创建的基准数据集来评估该方法的有效性。实验结果证明了所提出方法的优势，内部和外部检索的结合显著提高了各种最先进的 LLM 中错误检测、定位和纠正的准确性。这些发现有助于开发更强大、更可靠的临床文档纠错系统。

Title: PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data

Authors: Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15053
Pdf URL: https://arxiv.org/pdf/2406.15053
Copy Paste: [[2406.15053]] PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data(https://arxiv.org/abs/2406.15053)
Keywords: language model, gpt, llm
Abstract: Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
摘要：多语言大型语言模型 (LLM) 的评估具有挑战性，原因有很多——缺乏具有足够语言多样性的基准、流行基准污染了 LLM 预训练数据以及翻译后的基准缺乏本地文化细微差别。在这项工作中，我们研究了多语言、多文化环境中的人工评估和基于 LLM 的评估。我们通过进行 90K 次人工评估和 30K 次基于 LLM 的评估，评估了 10 种印度语中的 30 个模型，发现 GPT-4o 和 Llama-3 70B 等模型在大多数印度语中始终表现最佳。我们为两种评估设置（成对比较和直接评估）建立了排行榜，并分析了人工和 LLM 之间的一致性。我们发现，在成对设置中，人工和 LLM 的一致性相当好，但在直接评估中一致性下降，尤其是对于孟加拉语和奥迪亚语等语言。我们还检查了人工评估和 LLM 评估中的各种偏见，并发现了 GPT 评估器中存在自我偏见的证据。我们的工作为扩大 LLM 的多语言评估迈出了重要一步。

Title: Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network

Authors: Badr AlKhamissi, Greta Tuckute, Antoine Bosselut, Martin Schrimpf
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.15109
Pdf URL: https://arxiv.org/pdf/2406.15109
Copy Paste: [[2406.15109]] Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network(https://arxiv.org/abs/2406.15109)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been shown to be effective models of the human language system, with some models predicting most explainable variance of brain activity in current datasets. Even in untrained models, the representations induced by architectural priors can exhibit reasonable alignment to brain data. In this work, we investigate the key architectural components driving the surprising alignment of untrained models. To estimate LLM-to-brain similarity, we first select language-selective units within an LLM, similar to how neuroscientists identify the language network in the human brain. We then benchmark the brain alignment of these LLM units across five different brain recording datasets. By isolating critical components of the Transformer architecture, we identify tokenization strategy and multihead attention as the two major components driving brain alignment. A simple form of recurrence further improves alignment. We further demonstrate this quantitative brain alignment of our model by reproducing landmark studies in the language neuroscience field, showing that localized model units -- just like language voxels measured empirically in the human brain -- discriminate more reliably between lexical than syntactic differences, and exhibit similar response profiles under the same experimental conditions. Finally, we demonstrate the utility of our model's representations for language modeling, achieving improved sample and parameter efficiency over comparable architectures. Our model's estimates of surprisal sets a new state-of-the-art in the behavioral alignment to human reading times. Taken together, we propose a highly brain- and behaviorally-aligned model that conceptualizes the human language system as an untrained shallow feature encoder, with structural priors, combined with a trained decoder to achieve efficient and performant language processing.
摘要：大型语言模型 (LLM) 已被证明是人类语言系统的有效模型，其中一些模型可以预测当前数据集中大脑活动的最可解释的方差。即使在未经训练的模型中，由架构先验引起的表示也可以表现出与大脑数据的合理对齐。在这项工作中，我们研究了驱动未经训练模型令人惊讶的对齐的关键架构组件。为了估计 LLM 与大脑的相似性，我们首先在 LLM 中选择语言选择性单元，类似于神经科学家识别人脑中语言网络的方式。然后，我们在五个不同的大脑记录数据集中对这些 LLM 单元的大脑对齐进行基准测试。通过隔离 Transformer 架构的关键组件，我们将标记化策略和多头注意力确定为驱动大脑对齐的两个主要组件。简单的递归形式进一步改善了对齐。我们通过重现语言神经科学领域的里程碑式研究进一步证明了我们模型的这种定量大脑对齐，表明局部模型单元——就像人类大脑中经验测量的语言体素一样——能够更可靠地区分词汇差异而非句法差异，并且在相同的实验条件下表现出相似的响应曲线。最后，我们展示了我们的模型表示对语言建模的实用性，与同类架构相比，实现了更高的样本和参数效率。我们模型的意外估计在行为与人类阅读时间的对齐方面树立了新的先进水平。总之，我们提出了一个高度大脑和行为对齐的模型，将人类语言系统概念化为未经训练的浅层特征编码器，具有结构先验，并结合训练有素的解码器以实现高效和高性能的语言处理。

Title: On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Authors: Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15126
Pdf URL: https://arxiv.org/pdf/2406.15126
Copy Paste: [[2406.15126]] On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey(https://arxiv.org/abs/2406.15126)
Keywords: language model, llm
Abstract: Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation.
摘要：在不断发展的深度学习领域中，数据数量和质量的困境一直是一个长期存在的问题。最近出现的大型语言模型 (LLM) 提供了一种以数据为中心的解决方案，以缓解合成数据生成对现实世界数据的局限性。然而，目前对该领域的研究缺乏统一的框架，大多停留在表面。因此，本文基于合成数据生成的通用工作流程对相关研究进行了组织。通过这样做，我们突出了现有研究中的差距并概述了未来研究的预期途径。这项工作旨在引导学术界和工业界对 LLM 驱动的合成数据生成的功能和应用进行更深入、更有条理的研究。

Title: Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks

Authors: Victor Hugo Nascimento Rocha, Igor Cataneo Silveira, Paulo Pirozelli, Denis Deratani Mauá, Fabio Gagliardi Cozman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.15130
Pdf URL: https://arxiv.org/pdf/2406.15130
Copy Paste: [[2406.15130]] Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks(https://arxiv.org/abs/2406.15130)
Keywords: language model, gpt, llm, chat
Abstract: The recent success of Large Language Models (LLMs) has sparked concerns about their potential to spread misinformation. As a result, there is a pressing need for tools to identify ``fake arguments'' generated by such models. To create these tools, examples of texts generated by LLMs are needed. This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT, OpenAI's LLM. We then describe a novel dataset containing a set of diverse arguments, ArGPT. We assess the effectiveness of our dataset and establish baselines for several argumentation-related tasks. Finally, we show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.
摘要：大型语言模型 (LLM) 最近的成功引发了人们对其传播错误信息的可能性的担忧。因此，迫切需要一些工具来识别此类模型生成的“假论点”。要创建这些工具，需要由 LLM 生成的文本示例。本文介绍了一种从 OpenAI 的 LLM ChatGPT 生成的论证文章中获取好、坏和丑陋论点的方法。然后，我们描述了一个包含一组不同论点的新数据集 ArGPT。我们评估了数据集的有效性并为几个与论证相关的任务建立了基线。最后，我们表明人工生成的数据与人类论证有很好的相关性，因此可用作训练和测试定义任务系统的工具。

Title: Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss

Authors: Wei He, Marco Idiart, Carolina Scarton, Aline Villavicencio
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.15175
Pdf URL: https://arxiv.org/pdf/2406.15175
Copy Paste: [[2406.15175]] Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss(https://arxiv.org/abs/2406.15175)
Keywords: language model
Abstract: Accurately modeling idiomatic or non-compositional language has been a longstanding challenge in Natural Language Processing (NLP). This is partly because these expressions do not derive their meanings solely from their constituent words, but also due to the scarcity of relevant data resources, and their impact on the performance of downstream tasks such as machine translation and simplification. In this paper we propose an approach to model idiomaticity effectively using a triplet loss that incorporates the asymmetric contribution of components words to an idiomatic meaning for training language models by using adaptive contrastive learning and resampling miners to build an idiomatic-aware learning objective. Our proposed method is evaluated on a SemEval challenge and outperforms previous alternatives significantly in many metrics.
摘要：准确地对惯用语言或非组合语言进行建模一直是自然语言处理 (NLP) 领域的长期挑战。部分原因是这些表达的含义并非仅来自其组成词，还因为相关数据资源的稀缺性，以及它们对机器翻译和简化等下游任务性能的影响。在本文中，我们提出了一种使用三重损失来有效地对惯用性进行建模的方法，该方法将组成词对惯用意义的不对称贡献结合起来，通过使用自适应对比学习和重采样挖掘器来构建惯用感知学习目标，以训练语言模型。我们提出的方法在 SemEval 挑战中进行了评估，在许多指标上都远远优于以前的替代方案。

Title: Hybrid Alignment Training for Large Language Models

Authors: Chenglong Wang, Hang Zhou, Kaiyan Chang, Bei Li, Yongyu Mu, Tong Xiao, Tongran Liu, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15178
Pdf URL: https://arxiv.org/pdf/2406.15178
Copy Paste: [[2406.15178]] Hybrid Alignment Training for Large Language Models(https://arxiv.org/abs/2406.15178)
Keywords: language model, llm
Abstract: Alignment training is crucial for enabling large language models (LLMs) to cater to human intentions and preferences. It is typically performed based on two stages with different objectives: instruction-following alignment and human-preference alignment. However, aligning LLMs with these objectives in sequence suffers from an inherent problem: the objectives may conflict, and the LLMs cannot guarantee to simultaneously align with the instructions and human preferences well. To response to these, in this work, we propose a Hybrid Alignment Training (Hbat) approach, based on alternating alignment and modified elastic weight consolidation methods. The basic idea is to alternate between different objectives during alignment training, so that better collaboration can be achieved between the two alignment tasks.We experiment with Hbat on summarization and dialogue tasks. Experimental results show that the proposed \textsc{Hbat} can significantly outperform all baselines. Notably, Hbat yields consistent performance gains over the traditional two-stage alignment training when using both proximal policy optimization and direct preference optimization.
摘要：对齐训练对于使大型语言模型 (LLM) 能够满足人类的意图和偏好至关重要。它通常基于具有不同目标的两个阶段执行：指令遵循对齐和人类偏好对齐。然而，按顺序将 LLM 与这些目标对齐存在一个固有问题：目标可能会发生冲突，并且 LLM 无法保证同时与指令和人类偏好很好地对齐。为了解决这些问题，在这项工作中，我们提出了一种混合对齐训练 (Hbat) 方法，该方法基于交替对齐和改进的弹性权重合并方法。基本思想是在对齐训练期间在不同目标之间交替，以便在两个对齐任务之间实现更好的协作。我们在总结和对话任务上尝试了 Hbat。实验结果表明，提出的 \textsc{Hbat} 可以显著优于所有基线。值得注意的是，当同时使用近端策略优化和直接偏好优化时，Hbat 比传统的两阶段对齐训练具有一致的性能提升。

Title: Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

Authors: Chia-Yu Hung, Navonil Majumder, Ambuj Mehrish, Soujanya Poria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15193
Pdf URL: https://arxiv.org/pdf/2406.15193
Copy Paste: [[2406.15193]] Reward Steering with Evolutionary Heuristics for Decoding-time Alignment(https://arxiv.org/abs/2406.15193)
Keywords: llm
Abstract: The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: this https URL.
摘要：LLM 的广泛适用性和日益普遍的存在促使人们需要将 LLM 响应与用户和利益相关者的偏好保持一致。已经提出了许多偏好优化方法，这些方法可以微调 LLM 参数以实现良好的一致性。然而，众所周知，这种参数调整会干扰许多任务上的模型性能。此外，在这种情况下，跟上不断变化的用户偏好是很棘手的。使用奖励模型指导的解码时间对齐解决了这些问题，但代价是增加推理时间。然而，大多数此类方法都未能在奖励的探索和利用之间取得适当的平衡——通常是由于这两个方面的表述混为一谈——以给出良好一致的响应。为了解决这个问题，我们将这两个方面分离并以进化的方式实现它们：通过从变异指令解码来强制探索，而利用则表示为定期用奖励丰厚的代际替换奖励较少的代际。经验证据表明，该策略在两个广为接受的对齐基准 AlpacaEval 2 和 MT-Bench 上优于许多偏好优化和解码时间对齐方法。我们的实现将在以下网址提供：此 https URL。

Title: How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

Authors: Subhankar Maity, Aniket Deroy, Sudeshna Sarkar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.15211
Pdf URL: https://arxiv.org/pdf/2406.15211
Copy Paste: [[2406.15211]] How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?(https://arxiv.org/abs/2406.15211)
Keywords: gpt
Abstract: We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.
摘要：我们评估了 GPT-4 Turbo 在零样本模式下从 NCERT 教科书中生成教育问题的有效性。我们的研究强调了 GPT-4 Turbo 生成需要高阶思维技能的问题的能力，尤其是在根据布鲁姆修订分类法的“理解”层面。虽然我们发现 GPT-4 Turbo 生成的问题与人类评估的问题在复杂性方面存在显著的一致性，但偶尔也存在差异。我们的评估还揭示了人类和机器评估问题质量的方式存在差异，其趋势与布鲁姆修订分类法的水平成反比。这些发现表明，虽然 GPT-4 Turbo 是一种很有前途的教育问题生成工具，但它的功效在不同的认知水平上有所不同，这表明需要进一步改进才能完全满足教育标准。

Title: Unsupervised Extraction of Dialogue Policies from Conversations

Authors: Makesh Narsimhan Sreedhar, Traian Rebedea, Christopher Parisien
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15214
Pdf URL: https://arxiv.org/pdf/2406.15214
Copy Paste: [[2406.15214]] Unsupervised Extraction of Dialogue Policies from Conversations(https://arxiv.org/abs/2406.15214)
Keywords: language model, llm, prompt
Abstract: Dialogue policies play a crucial role in developing task-oriented dialogue systems, yet their development and maintenance are challenging and typically require substantial effort from experts in dialogue modeling. While in many situations, large amounts of conversational data are available for the task at hand, people lack an effective solution able to extract dialogue policies from this data. In this paper, we address this gap by first illustrating how Large Language Models (LLMs) can be instrumental in extracting dialogue policies from datasets, through the conversion of conversations into a unified intermediate representation consisting of canonical forms. We then propose a novel method for generating dialogue policies utilizing a controllable and interpretable graph-based methodology. By combining canonical forms across conversations into a flow network, we find that running graph traversal algorithms helps in extracting dialogue flows. These flows are a better representation of the underlying interactions than flows extracted by prompting LLMs. Our technique focuses on giving conversation designers greater control, offering a productivity tool to improve the process of developing dialogue policies.
摘要：对话策略在开发面向任务的对话系统中起着至关重要的作用，但它们的开发和维护具有挑战性，通常需要对话建模专家付出大量努力。虽然在许多情况下，手头的任务有大量的对话数据可用，但人们缺乏能够从这些数据中提取对话策略的有效解决方案。在本文中，我们首先说明大型语言模型 (LLM) 如何通过将对话转换为由规范形式组成的统一中间表示，帮助从数据集中提取对话策略，从而解决这一差距。然后，我们提出了一种利用可控制和可解释的基于图的方法生成对话策略的新方法。通过将对话中的规范形式组合成流网络，我们发现运行图遍历算法有助于提取对话流。与通过提示 LLM 提取的流相比，这些流更好地表示了底层交互。我们的技术侧重于赋予对话设计者更大的控制权，提供一种生产力工具来改进对话策略的开发过程。

Title: A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation

Authors: Irune Zubiaga, Aitor Soroa, Rodrigo Agerri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15227
Pdf URL: https://arxiv.org/pdf/2406.15227
Copy Paste: [[2406.15227]] A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation(https://arxiv.org/abs/2406.15227)
Keywords: language model, llm, chat
Abstract: The proliferation of misinformation and harmful narratives in online discourse has underscored the critical need for effective Counter Narrative (CN) generation techniques. However, existing automatic evaluation methods often lack interpretability and fail to capture the nuanced relationship between generated CNs and human perception. Aiming to achieve a higher correlation with human judgments, this paper proposes a novel approach to asses generated CNs that consists on the use of a Large Language Model (LLM) as a evaluator. By comparing generated CNs pairwise in a tournament-style format, we establish a model ranking pipeline that achieves a correlation of $0.88$ with human preference. As an additional contribution, we leverage LLMs as zero-shot (ZS) CN generators and conduct a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in ZS are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.
摘要：在线话语中错误信息和有害叙述的泛滥凸显了对有效的反叙述 (CN) 生成技术的迫切需求。然而，现有的自动评估方法往往缺乏可解释性，无法捕捉生成的 CN 与人类感知之间的细微关系。为了实现与人类判断的更高相关性，本文提出了一种评估生成的 CN 的新方法，该方法包括使用大型语言模型 (LLM) 作为评估器。通过以锦标赛风格的形式成对比较生成的 CN，我们建立了一个模型排名管道，该管道与人类偏好的相关性达到 0.88。作为额外的贡献，我们利用 LLM 作为零样本 (ZS) CN 生成器并对聊天、指导和基础模型进行比较分析，探索它们各自的优势和局限性。通过细致的评估，包括微调实验，我们阐明了性能和对特定领域数据的响应能力的差异。我们得出结论，只要 ZS 中的聊天对齐模型不会因为安全问题而拒绝生成答案，那么它们就是执行该任务的最佳选择。

Title: Detecting Synthetic Lyrics with Few-Shot Inference

Authors: Yanis Labrak, Gabriel Meseguer-Brocal, Elena V. Epure
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.15231
Pdf URL: https://arxiv.org/pdf/2406.15231
Copy Paste: [[2406.15231]] Detecting Synthetic Lyrics with Few-Shot Inference(https://arxiv.org/abs/2406.15231)
Keywords: language model, llm
Abstract: In recent years, generated content in music has gained significant popularity, with large language models being effectively utilized to produce human-like lyrics in various styles, themes, and linguistic structures. This technological advancement supports artists in their creative processes but also raises issues of authorship infringement, consumer satisfaction and content spamming. To address these challenges, methods for detecting generated lyrics are necessary. However, existing works have not yet focused on this specific modality or on creative text in general regarding machine-generated content detection methods and datasets. In response, we have curated the first dataset of high-quality synthetic lyrics and conducted a comprehensive quantitative evaluation of various few-shot content detection approaches, testing their generalization capabilities and complementing this with a human evaluation. Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods, which are shown competitive in other domains at distinguishing human-written from machine-generated content. It also shows good generalization capabilities to new artists and models, and effectively detects post-generation paraphrasing. This study emphasizes the need for further research on creative content detection, particularly in terms of generalization and scalability with larger song catalogs. All datasets, pre-processing scripts, and code are available publicly on GitHub and Hugging Face under the Apache 2.0 license.
摘要：近年来，音乐中的生成内容越来越受欢迎，大型语言模型被有效地用于生成各种风格、主题和语言结构的类似人类的歌词。这项技术进步支持艺术家的创作过程，但也引发了侵犯作者权、消费者满意度和内容垃圾邮件的问题。为了应对这些挑战，检测生成歌词的方法是必不可少的。然而，现有的研究还没有关注这种特定的模式，也没有关注机器生成内容检测方法和数据集的一般创意文本。为此，我们整理了第一个高质量的合成歌词数据集，并对各种少样本内容检测方法进行了全面的定量评估，测试了它们的泛化能力，并辅以人工评估。我们最好的少样本检测器基于 LLM2Vec，超越了风格和统计方法，这些方法在其他领域表现出色，可以区分人类编写的内容和机器生成的内容。它还对新艺术家和模型表现出良好的泛化能力，并有效地检测生成后的释义。这项研究强调了对创意内容检测进行进一步研究的必要性，特别是在歌曲目录规模较大的情况下的通用性和可扩展性方面。所有数据集、预处理脚本和代码均可在 GitHub 和 Hugging Face 上根据 Apache 2.0 许可公开获取。

Title: Unsupervised Morphological Tree Tokenizer

Authors: Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.15245
Pdf URL: https://arxiv.org/pdf/2406.15245
Copy Paste: [[2406.15245]] Unsupervised Morphological Tree Tokenizer(https://arxiv.org/abs/2406.15245)
Keywords: language model
Abstract: As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $\textit{MorphOverriding}$ to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. The code will be released later.
摘要：作为语言建模的基石，分词涉及将文本输入分割成预定义的原子单元。传统的统计分词器经常会破坏单词内的成分边界，从而破坏语义信息。为了解决这一缺点，我们引入了形态结构指导来分词，并提出了一个深度模型来归纳单词的字符级结构。具体来说，深度模型使用名为 $\textit{MorphOverriding}$ 的机制联合编码单词的内部结构和表示，以确保词素的不可分解性。通过使用自监督目标训练模型，我们的方法能够在没有注释训练数据的情况下归纳出符合形态规则的字符级结构。基于归纳的结构，我们的算法以自上而下的方式通过词汇匹配对单词进行分词。实证结果表明，所提出的方法有效地保留了完整的词素，并且在形态分割任务和语言建模任务上均优于 BPE 和 WordPiece 等广泛采用的方法。代码将在稍后发布。

Title: Evaluating Diversity in Automatic Poetry Generation

Authors: Yanran Chen, Hannes Gröner, Sina Zarrieß, Steffen Eger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15267
Pdf URL: https://arxiv.org/pdf/2406.15267
Copy Paste: [[2406.15267]] Evaluating Diversity in Automatic Poetry Generation(https://arxiv.org/abs/2406.15267)
Keywords: llm
Abstract: Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields. Creative NLG, such as automatic poetry generation, is a fascinating niche in this area. While most previous research has focused on forms of the Turing test when evaluating automatic poetry generation - can humans distinguish between automatic and human generated poetry - we evaluate the diversity of automatically generated poetry, by comparing distributions of generated poetry to distributions of human poetry along structural, lexical, semantic and stylistic dimensions, assessing different model types (word vs. character-level, general purpose LLMs vs. poetry-specific models), including the very recent LLaMA3, and types of fine-tuning (conditioned vs. unconditioned). We find that current automatic poetry systems are considerably underdiverse along multiple dimensions - they often do not rhyme sufficiently, are semantically too uniform and even do not match the length distribution of human poetry. Our experiments reveal, however, that style-conditioning and character-level modeling clearly increases diversity across virtually all dimensions we explore. Our identified limitations may serve as the basis for more genuinely diverse future poetry generation models.
摘要：自然语言生成 (NLG) 和更一般的生成式 AI 是目前最具影响力的研究领域之一。创意 NLG（例如自动诗歌生成）是该领域中一个令人着迷的领域。虽然大多数先前的研究都集中在评估自动诗歌生成时的图灵测试形式上——人类能否区分自动生成的诗歌和人类生成的诗歌——但我们通过比较生成的诗歌与人类诗歌在结构、词汇、语义和风格维度上的分布来评估自动生成的诗歌的多样性，评估不同的模型类型（单词与字符级别、通用 LLM 与诗歌特定模型），包括最近的 LLaMA3，以及微调类型（条件与非条件）。我们发现当前的自动诗歌系统在多个维度上都相当缺乏多样性——它们通常押韵不够，语义上过于统一，甚至与人类诗歌的长度分布不匹配。然而，我们的实验表明，风格条件和字符级建模显然增加了我们探索的几乎所有维度的多样性。我们发现的局限性可以作为未来更加真正多样化的诗歌生成模型的基础。

Title: Cognitive Map for Language Models: Optimal Planning via Verbally Representing the World Model

Authors: Doyoung Kim, Jongwon Lee, Jinho Park, Minjoon Seo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15275
Pdf URL: https://arxiv.org/pdf/2406.15275
Copy Paste: [[2406.15275]] Cognitive Map for Language Models: Optimal Planning via Verbally Representing the World Model(https://arxiv.org/abs/2406.15275)
Keywords: language model
Abstract: Language models have demonstrated impressive capabilities across various natural language processing tasks, yet they struggle with planning tasks requiring multi-step simulations. Inspired by human cognitive processes, this paper investigates the optimal planning power of language models that can construct a cognitive map of a given environment. Our experiments demonstrate that cognitive map significantly enhances the performance of both optimal and reachable planning generation ability in the Gridworld path planning task. We observe that our method showcases two key characteristics similar to human cognition: \textbf{generalization of its planning ability to extrapolated environments and rapid adaptation with limited training data.} We hope our findings in the Gridworld task provide insights into modeling human cognitive processes in language models, potentially leading to the development of more advanced and robust systems that better resemble human cognition.
摘要：语言模型在各种自然语言处理任务中表现出令人印象深刻的能力，但它们在需要多步骤模拟的规划任务中却举步维艰。受人类认知过程的启发，本文研究了可以构建给定环境的认知图的语言模型的最佳规划能力。我们的实验表明，认知图显著提高了 Gridworld 路径规划任务中最佳和可达规划生成能力的性能。我们观察到我们的方法展示了与人类认知相似的两个关键特征：\textbf{将其规划能力推广到推断环境并在有限的训练数据下快速适应。}我们希望我们在 Gridworld 任务中的发现能够为在语言模型中建模人类认知过程提供见解，从而有可能导致开发出更先进、更强大的系统，更好地模拟人类认知。

Title: NLP-KG: A System for Exploratory Search of Scientific Literature in Natural Language Processing

Authors: Tim Schopf, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.15294
Pdf URL: https://arxiv.org/pdf/2406.15294
Copy Paste: [[2406.15294]] NLP-KG: A System for Exploratory Search of Scientific Literature in Natural Language Processing(https://arxiv.org/abs/2406.15294)
Keywords: chat
Abstract: Scientific literature searches are often exploratory, whereby users are not yet familiar with a particular field or concept but are interested in learning more about it. However, existing systems for scientific literature search are typically tailored to keyword-based lookup searches, limiting the possibilities for exploration. We propose NLP-KG, a feature-rich system designed to support the exploration of research literature in unfamiliar natural language processing (NLP) fields. In addition to a semantic search, NLP-KG allows users to easily find survey papers that provide a quick introduction to a field of interest. Further, a Fields of Study hierarchy graph enables users to familiarize themselves with a field and its related areas. Finally, a chat interface allows users to ask questions about unfamiliar concepts or specific articles in NLP and obtain answers grounded in knowledge retrieved from scientific publications. Our system provides users with comprehensive exploration possibilities, supporting them in investigating the relationships between different fields, understanding unfamiliar concepts in NLP, and finding relevant research literature. Demo, video, and code are available at: this https URL.
摘要：科学文献搜索通常是探索性的，用户尚不熟悉某个特定领域或概念，但有兴趣了解更多信息。然而，现有的科学文献搜索系统通常针对基于关键字的查找搜索，限制了探索的可能性。我们提出了 NLP-KG，这是一个功能丰富的系统，旨在支持探索不熟悉的自然语言处理 (NLP) 领域的研究文献。除了语义搜索之外，NLP-KG 还允许用户轻松找到提供感兴趣领域快速介绍的调查论文。此外，研究领域层次结构图使用户能够熟悉某个领域及其相关领域。最后，聊天界面允许用户询问有关 NLP 中不熟悉的概念或特定文章的问题，并获得基于从科学出版物中检索到的知识的答案。我们的系统为用户提供了全面的探索可能性，支持他们调查不同领域之间的关系，理解 NLP 中不熟悉的概念，并查找相关的研究文献。演示、视频和代码可在以下网址获得：此 https URL。

Title: LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Authors: Ziyan Jiang, Xueguang Ma, Wenhu Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs(https://arxiv.org/abs/)
Keywords: llm, retrieval-augmented generation
Abstract: In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units ($\approx$ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
摘要：在传统的 RAG 框架中，基本检索单元通常较短。常见的检索器（如 DPR）通常处理 100 字的维基百科段落。这样的设计迫使检索器搜索大型语料库以找到“针”单元。相比之下，读者只需从较短的检索单元中提取答案。这种不平衡的“重”检索器和“轻”阅读器设计会导致性能不佳。为了缓解这种不平衡，我们提出了一个由“长”检索器和“长”阅读器组成的新框架 LongRAG。LongRAG 将整个维基百科处理成 4K 标记单元，比以前长 30 倍。通过增加单元大小，我们将总单元从 22M 显著减少到 700K。这大大降低了检索者的负担，从而获得了显著的检索分数：NQ 上的答案召回率@1=71%（之前为 52%），HotpotQA（完整维基）上的答案召回率@2=72%（之前为 47%）。然后，我们将前 k 个检索到的单元（约 30K 个标记）输入到现有的长上下文 LLM 中以执行零样本答案提取。无需任何训练，LongRAG 在 NQ 上实现了 62.7% 的 EM，这是已知的最佳结果。LongRAG 在 HotpotQA（完整维基）上也实现了 64.3%，与 SoTA 模型相当。我们的研究为将 RAG 与长上下文 LLM 相结合的未来路线图提供了见解。

Title: A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

Authors: Nishant Balepur, Matthew Shu, Alexander Hoyle, Alison Robey, Shi Feng, Seraphina Goldfarb-Tarrant, Jordan Boyd-Graber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick(https://arxiv.org/abs/)
Keywords: gpt, llm
Abstract: Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior works generate mnemonics for students, but they do not guide models toward mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms. To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics. We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: expressed (inferred from ratings) and observed (inferred from student learning), yielding three key findings. First, expressed and observed preferences disagree; what students think is helpful does not fully capture what is truly helpful. Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal. SMART is tuned via Direct Preference Optimization on this signal, which we show resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4, at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.
摘要：关键词助记符是将新术语与更简单的关键词联系起来的令人难忘的解释。先前的研究为学生生成了助记符，但它们并没有引导模型走向学生喜欢并有助于学习的助记符。我们构建了 SMART，这是一个助记符生成器，根据真实学生学习新术语的反馈进行训练。为了训练 SMART，我们首先在一组精选的用户编写助记符上微调 LLaMA-2。然后，我们使用 LLM 对齐来增强 SMART：我们在抽认卡应用程序中部署由 SMART 生成的助记符，以找到学生喜欢的助记符偏好。我们从 45 名学生那里收集了 2684 个偏好，分为两种类型：表达的（从评分中推断）和观察到的（从学生学习中推断），得出三个关键发现。首先，表达的和观察到的偏好不一致；学生认为有用的并不能完全捕捉到真正有用的。其次，贝叶斯模型可以将来自多种偏好类型的互补数据合成为一个有效性信号。 SMART 通过直接偏好优化对此信号进行调整，我们表明它可以解决典型成对比较方法中的平局和缺失标签问题，从而增强 LLM 输出质量提升的数据。第三，助记专家评估 SMART 与 GPT-4 相当，但部署成本要低得多，这表明捕捉多样化学生反馈以协调 LLM 教育的实用性。