2024-09-06

Title: CLUE: Concept-Level Uncertainty Estimation for Large Language Models

Authors: Yu-Hsiang Wang, Andrew Bai, Che-Ping Tsai, Cho-Jui Hsieh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.03021
Pdf URL: https://arxiv.org/pdf/2409.03021
Copy Paste: [[2409.03021]] CLUE: Concept-Level Uncertainty Estimation for Large Language Models(https://arxiv.org/abs/2409.03021)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in various natural language generation (NLG) tasks. Previous studies suggest that LLMs' generation process involves uncertainty. However, existing approaches to uncertainty estimation mainly focus on sequence-level uncertainty, overlooking individual pieces of information within sequences. These methods fall short in separately assessing the uncertainty of each component in a sequence. In response, we propose a novel framework for Concept-Level Uncertainty Estimation (CLUE) for LLMs. We leverage LLMs to convert output sequences into concept-level representations, breaking down sequences into individual concepts and measuring the uncertainty of each concept separately. We conduct experiments to demonstrate that CLUE can provide more interpretable uncertainty estimation results compared with sentence-level uncertainty, and could be a useful tool for various tasks such as hallucination detection and story generation.
摘要：大型语言模型 (LLM) 在各种自然语言生成 (NLG) 任务中表现出色。先前的研究表明，LLM 的生成过程涉及不确定性。然而，现有的不确定性估计方法主要关注序列级不确定性，忽略了序列内的各个信息。这些方法无法分别评估序列中每个组件的不确定性。为此，我们提出了一个用于 LLM 的概念级不确定性估计 (CLUE) 的新框架。我们利用 LLM 将输出序列转换为概念级表示，将序列分解为单个概念并分别测量每个概念的不确定性。我们进行实验以证明与句子级不确定性相比，CLUE 可以提供更可解释的不确定性估计结果，并且可以成为幻觉检测和故事生成等各种任务的有用工具。

Title: Oddballness: universal anomaly detection with language models

Authors: Filip Graliński, Ryszard Staruch, Krzysztof Jurkiewicz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03046
Pdf URL: https://arxiv.org/pdf/2409.03046
Copy Paste: [[2409.03046]] Oddballness: universal anomaly detection with language models(https://arxiv.org/abs/2409.03046)
Keywords: language model
Abstract: We present a new method to detect anomalies in texts (in general: in sequences of any data), using language models, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a language model, but instead of focusing on low-likelihood tokens, it considers a new metric introduced in this paper: oddballness. Oddballness measures how ``strange'' a given token is according to the language model. We demonstrate in grammatical error detection tasks (a specific case of text anomaly detection) that oddballness is better than just considering low-likelihood events, if a totally unsupervised setup is assumed.
摘要：我们提出了一种新方法，使用语言模型以完全无监督的方式检测文本中的异常（一般来说：任何数据序列中的异常）。该方法考虑语言模型生成的概率（可能性），但不是专注于低可能性标记，而是考虑本文引入的新指标：奇异性。奇异性根据语言模型衡量给定标记的“奇怪”程度。我们在语法错误检测任务（文本异常检测的一个特定案例）中证明，如果假设完全无监督的设置，奇异性比仅考虑低可能性事件更好。

Title: Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models

Authors: Jie Ma, Zhitao Gao, Qi Chai, Wangchun Sun, Pinghui Wang, Hongbin Pei, Jing Tao, Lingyun Song, Jun Liu, Chen Zhang, Lizhen Cui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03155
Pdf URL: https://arxiv.org/pdf/2409.03155
Copy Paste: [[2409.03155]] Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models(https://arxiv.org/abs/2409.03155)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: \textit{excessively long reasoning paths distracting from the answer generation}, and \textit{false-positive relations hindering the path refinement}. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7\% and 9.1\% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG. Code is available at \url{this https URL}.
摘要：大型语言模型 (LLM) 可能会因为缺乏相关知识而在实际应用中出现幻觉。相比之下，知识图谱包含广泛的多关系结构，可存储大量符号事实。因此，人们广泛探索了 LLM 与知识图谱的集成，其中知识图谱问答 (KGQA) 是集成的关键试金石。此任务要求 LLM 通过从知识图谱中检索相关三元组来回答自然语言问题。然而，现有方法面临两个重大挑战：\textit{过长的推理路径分散了答案生成的注意力} 和 \textit{假阳性关系阻碍了路径细化}。在本文中，我们提出了一个迭代交互式 KGQA 框架，该框架利用 LLM 的交互式学习功能来执行图谱推理和辩论 (DoG)。具体而言，DoG 采用子图聚焦机制，允许 LLM 在每个推理步骤后进行答案尝试，从而减轻冗长的推理路径的影响。另一方面，DoG 利用多角色辩论团队逐步简化复杂问题，减少假阳性关系的影响。这种辩论机制确保了推理过程的可靠性。在五个公共数据集上的实验结果证明了我们架构的有效性和优越性。值得注意的是，DoG 在 WebQuestions 和 GrailQA 上的准确率分别比最先进的方法 ToG 高出 23.7% 和 9.1%。此外，在上述数据集上与各种 LLM 的集成实验凸显了 DoG 的灵活性。代码可在 \url{this https URL} 处获得。

Title: MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models

Authors: Michiko Yoshitake (1), Yuta Suzuki (2), Ryo Igarashi (1), Yoshitaka Ushiku (1), Keisuke Nagato (3) ((1) OMRON SINIC X, (2) Osaka Univ., (3) Univ. Tokyo)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03161
Pdf URL: https://arxiv.org/pdf/2409.03161
Copy Paste: [[2409.03161]] MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models(https://arxiv.org/abs/2409.03161)
Keywords: language model, gpt, llm, chat
Abstract: A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the multiple-choice type. Multiple-choice problems are constructed by adding three incorrect answers as choices to a correct answer, so that LLMs can choose one of the four as a response. Most of the problems for free-response answer and multiple-choice types overlap except for the format of the answers. We also conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5, ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with the OpenAI API. The differences and similarities in the performance of LLMs measured by the MaterialBENCH are analyzed and discussed. Performance differences between the free-response type and multiple-choice type in the same models and the influence of using system massages on multiple-choice problems are also studied. We anticipate that MaterialBENCH will encourage further developments of LLMs in reasoning abilities to solve more complicated problems and eventually contribute to materials research and discovery.
摘要：构建了材料科学领域大型语言模型 (LLM) 的大学级基准数据集 MaterialBENCH。该数据集由基于大学教科书的问题-答案对组成。问题有两种类型：一种是自由回答型，另一种是多项选择型。多项选择题的构造方法是将三个错误答案作为选项添加到正确答案中，以便 LLM 可以从四个答案中选择一个作为答案。除了答案的格式外，大多数自由回答型和多项选择型的问题都是重叠的。我们还使用 MaterialBENCH 在 LLM 上进行了实验，包括 ChatGPT-3.5、ChatGPT-4、Bard（实验时）以及使用 OpenAI API 的 GPT-3.5 和 GPT-4。分析和讨论了 MaterialBENCH 测量的 LLM 性能的差异和相似之处。我们还研究了同一模型中自由回答类型和多项选择类型之间的性能差异以及使用系统反馈对多项选择问题的影响。我们预计 MaterialBENCH 将鼓励 LLM 进一步发展推理能力，以解决更复杂的问题，并最终为材料研究和发现做出贡献。

Title: MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering

Authors: Mitchell DeHaven
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03171
Pdf URL: https://arxiv.org/pdf/2409.03171
Copy Paste: [[2409.03171]] MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering(https://arxiv.org/abs/2409.03171)
Keywords: llm, retrieval augmented generation
Abstract: In this paper we present a multi-adapter retrieval augmented generation system (MARAGS) for Meta's Comprehensive RAG (CRAG) competition for KDD CUP 2024. CRAG is a question answering dataset contains 3 different subtasks aimed at realistic question and answering RAG related tasks, with a diverse set of question topics, question types, time dynamic answers, and questions featuring entities of varying popularity. Our system follows a standard setup for web based RAG, which uses processed web pages to provide context for an LLM to produce generations, while also querying API endpoints for additional information. MARAGS also utilizes multiple different adapters to solve the various requirements for these tasks with a standard cross-encoder model for ranking candidate passages relevant for answering the question. Our system achieved 2nd place for Task 1 as well as 3rd place on Task 2.
摘要：在本文中，我们为 KDD CUP 2024 的 Meta 综合 RAG (CRAG) 竞赛介绍了一种多适配器检索增强生成系统 (MARAGS)。CRAG 是一个问答数据集，包含 3 个不同的子任务，旨在完成与 RAG 相关的真实问答任务，具有多样化的问题主题、问题类型、时间动态答案和具有不同受欢迎程度实体的问题。我们的系统遵循基于 Web 的 RAG 的标准设置，它使用处理过的网页为 LLM 提供生成生成的上下文，同时还查询 API 端点以获取更多信息。MARAGS 还利用多个不同的适配器来解决这些任务的各种要求，并使用标准跨编码器模型对与回答问题相关的候选段落进行排名。我们的系统在任务 1 中获得了第二名，在任务 2 中获得了第三名。

Title: Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers

Authors: Zuquan Peng, Yuanyuan He, Jianbing Ni, Ben Niu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03183
Pdf URL: https://arxiv.org/pdf/2409.03183
Copy Paste: [[2409.03183]] Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers(https://arxiv.org/abs/2409.03183)
Keywords: language model, gpt
Abstract: Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
摘要：用于自然语言处理 (NLP) 的神经网络 (NN) 分类模型容易受到通用对抗触发器 (UAT) 攻击，该攻击会触发模型对任何输入产生特定预测。DARCY 借用了“蜜罐”概念来诱饵多个陷门，有效地检测出 UAT 生成的对抗性示例。不幸的是，我们发现了一种新的 UAT 生成方法，称为 IndisUAT，它生成触发器（即标记）并使用它们来制作对抗性示例，其特征分布与 DARCY 检测层中随机选择类别中的良性示例的特征分布没有区别。生成的对抗性示例在受 DARCY 保护的模型中导致预测结果的最大损失。同时，生成的触发器在文本生成、文本推理和阅读理解的黑盒模型中非常有效。最后，在 NLP 任务的 NN 模型下的评估结果表明，IndisUAT 方法可以有效绕过 DARCY 并穿透其他防御。例如，IndisUAT 可以使 DARCY 检测的真正阳性率至少降低 40.8% 和 90.6%，并使 RNN 和 CNN 模型中的准确率至少降低 33.3% 和 51.6%。IndisUAT 使 BERT 对抗防御模型的准确率降低至少 34.0%，并使 GPT-2 语言模型即使在非种族背景下也会产生种族主义输出。

Title: An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification

Authors: Zhuowei Chen, Lianxi Wang, Yuben Wu, Xinfeng Liao, Yujia Tian, Junyang Zhong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03203
Pdf URL: https://arxiv.org/pdf/2409.03203
Copy Paste: [[2409.03203]] An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification(https://arxiv.org/abs/2409.03203)
Keywords: language model
Abstract: Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
摘要：情感分类 (SC) 经常面临资源匮乏的挑战，例如领域特定上下文、标签分布不平衡和小样本场景。扩散语言模型 (LM) 在文本数据增强 (DA) 方面的潜力仍未得到开发，此外，文本 DA 方法难以平衡新样本的多样性和一致性。大多数 DA 方法要么执行逻辑修改，要么使用语言模型重新表述原始序列中不太重要的标记。在 SC 的背景下，强烈的情感标记可能会对整个序列的情感产生关键影响。因此，与重新表述不太重要的上下文相反，我们提出了 DiffusionCLS，利用扩散 LM 来捕获领域内知识并通过重建强标签相关标记来生成伪样本。这种方法确保了一致性和多样性之间的平衡，避免了引入噪声并增强了数据集的关键特征。DiffusionCLS 还包括一个抗噪训练目标，以帮助模型泛化。实验证明了我们的方法在各种低资源场景中的有效性，包括特定领域和领域通用问题。消融研究证实了我们框架模块的有效性，可视化研究强调了最佳部署条件，从而强化了我们的结论。

Title: xLAM: A Family of Large Action Models to Empower AI Agent Systems

Authors: Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.03215
Pdf URL: https://arxiv.org/pdf/2409.03215
Copy Paste: [[2409.03215]] xLAM: A Family of Large Action Models to Empower AI Agent Systems(https://arxiv.org/abs/2409.03215)
Keywords: language model, gpt, llm, agent
Abstract: Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at this https URL
摘要：由大型语言模型 (LLM) 驱动的自主代理引起了广泛的研究兴趣。然而，开源社区在开发用于代理任务的专用模型方面面临许多挑战，这是由于高质量代理数据集的稀缺以及该领域缺乏标准协议所致。我们推出并公开发布了 xLAM，这是一系列专为 AI 代理任务设计的大型动作模型。xLAM 系列包括五个具有密集和混合专家架构的模型，参数范围从 1B 到 8x22B，使用可扩展、灵活的管道进行训练，该管道统一、增强和合成不同的数据集，以增强 AI 代理在不同环境中的通用性和性能。我们的实验结果表明，xLAM 在多个代理能力基准测试中始终表现出色，尤其是在伯克利函数调用排行榜上名列第一，在工具使用方面优于 GPT-4、Claude-3 和许多其他模型。通过发布 xLAM 系列，我们旨在提高自主 AI 代理的开源 LLM 的性能，从而加速进展并实现代理任务高性能模型的民主化访问。模型可在此 https URL 上获取

Title: Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Authors: Jeremy Qin, Bang Liu, Quoc Dinh Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03225
Pdf URL: https://arxiv.org/pdf/2409.03225
Copy Paste: [[2409.03225]] Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration(https://arxiv.org/abs/2409.03225)
Keywords: language model, llm
Abstract: Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to effectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence, leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, \textit{Atypical Presentations Recalibration}, which leverages atypical presentations to adjust the model's confidence estimates. Our approach significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within the recalibration framework.
摘要：黑盒大型语言模型 (LLM) 越来越多地部署在各种环境中，因此这些模型必须有效地传达其信心和不确定性，尤其是在高风险环境中。然而，这些模型往往表现出过度自信，导致潜在的风险和误判。现有的引出和校准 LLM 信心的技术主要集中在一般推理数据集上，只产生了适度的改进。准确的校准对于明智的决策和防止不良后果至关重要，但由于这些模型执行的任务的复杂性和多变性，仍然具有挑战性。在这项工作中，我们调查了医疗保健环境中黑盒 LLM 的错误校准行为。我们提出了一种新方法 \textit{非典型表现重新校准}，它利用非典型表现来调整模型的置信度估计。我们的方法显着改善了校准，在三个医学问答数据集上将校准误差减少了约 60\%，并且优于现有方法，例如 vanilla 口头化信心、CoT 口头化信心等。此外，我们还对重新校准框架内非典型性的作用进行了深入分析。

Title: E2CL: Exploration-based Error Correction Learning for Embodied Agents

Authors: Hanlin Wang, Chak Tou Leong, Jian Wang, Wenjie Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03256
Pdf URL: https://arxiv.org/pdf/2409.03256
Copy Paste: [[2409.03256]] E2CL: Exploration-based Error Correction Learning for Embodied Agents(https://arxiv.org/abs/2409.03256)
Keywords: language model, agent
Abstract: Language models are exhibiting increasing capability in knowledge utilization and reasoning. However, when applied as agents in embodied environments, they often suffer from misalignment between their intrinsic knowledge and environmental knowledge, leading to infeasible actions. Traditional environment alignment methods, such as supervised learning on expert trajectories and reinforcement learning, face limitations in covering environmental knowledge and achieving efficient convergence, respectively. Inspired by human learning, we propose Exploration-based Error Correction Learning (E2CL), a novel framework that leverages exploration-induced errors and environmental feedback to enhance environment alignment for LM-based agents. E2CL incorporates teacher-guided and teacher-free exploration to gather environmental feedback and correct erroneous actions. The agent learns to provide feedback and self-correct, thereby enhancing its adaptability to target environments. Evaluations in the Virtualhome environment demonstrate that E2CL-trained agents outperform those trained by baseline methods and exhibit superior self-correction capabilities.
摘要：语言模型在知识利用和推理方面表现出越来越强的能力。然而，当在具身环境中用作代理时，它们通常会遭受内在知识与环境知识不一致的问题，从而导致不可行行动。传统的环境协调方法，例如专家轨迹上的监督学习和强化学习，在覆盖环境知识和实现有效收敛方面都存在局限性。受人类学习的启发，我们提出了基于探索的纠错学习 (E2CL)，这是一个新颖的框架，它利用探索引起的错误和环境反馈来增强基于 LM 的代理的环境协调。E2CL 结合了教师指导和无教师探索来收集环境反馈并纠正错误操作。代理学习提供反馈和自我纠正，从而增强其对目标环境的适应性。在 Virtualhome 环境中的评估表明，E2CL 训练的代理优于通过基线方法训练的代理，并表现出卓越的自我纠正能力。

Title: Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

Authors: Chanjun Park, Hyeonwoo Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03257
Pdf URL: https://arxiv.org/pdf/2409.03257
Copy Paste: [[2409.03257]] Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard(https://arxiv.org/abs/2409.03257)
Keywords: language model, llm
Abstract: This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.
摘要：本文进行了一项为期 11 个月的纵向研究，以解决先前对 Open Ko-LLM 排行榜的研究的局限性，这些研究依赖于仅具有 5 个月观察期的实证研究。通过延长分析时间，我们旨在更全面地了解韩语大型语言模型 (LLM) 的开发进展。我们的研究围绕三个主要研究问题：(1) 随着时间的推移，提高 Open Ko-LLM 排行榜上不同任务的 LLM 性能面临哪些具体挑战？(2) 模型大小如何影响不同基准之间的任务性能相关性？(3) Open Ko-LLM 排行榜上的排行榜排名模式如何随时间变化？通过分析这段时间内的 1,769 个模型，我们的研究全面考察了 LLM 的持续进步以及评估框架的演变性质。

Title: GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Authors: Yukun Cao, Shuo Han, Zengyi Gao, Zezhong Ding, Xike Xie, S. Kevin Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03258
Pdf URL: https://arxiv.org/pdf/2409.03258
Copy Paste: [[2409.03258]] GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding(https://arxiv.org/abs/2409.03258)
Keywords: language model, llm, prompt, retrieval-augmented generation, agent
Abstract: Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs' comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.
摘要：尽管大型语言模型 (LLM) 已展现出处理图形的潜力，但它们很难通过图形描述序列的提示来理解图形结构信息，尤其是在图形大小增加的情况下。我们将这一挑战归因于 LLM 在图形描述序列不同位置上的不均匀记忆性能，称为“位置偏差”。为了解决这个问题，我们提出了 GraphInsight，这是一个旨在提高 LLM 对宏观和微观图形信息的理解的新框架。GraphInsight 基于两个关键策略：1) 将关键图形信息放在 LLM 表现出更强记忆性能的位置，2) 受检索增强生成 (RAG) 的启发，为记忆性能较弱的区域研究轻量级外部知识库。此外，GraphInsight 探索将这两种策略集成到 LLM 代理流程中，以完成需要多步推理的复合图形任务。对一系列评估任务的基准进行的大量实证研究表明，GraphInsight 在理解不同大小的图结构方面明显优于所有其他图描述方法（例如，提示技术和重新排序策略）。

Title: LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts

Authors: Henrique Da Silva Gameiro, Andrei Kucharavy, Ljiljana Dolamic
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2409.03291
Pdf URL: https://arxiv.org/pdf/2409.03291
Copy Paste: [[2409.03291]] LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts(https://arxiv.org/abs/2409.03291)
Keywords: language model, llm
Abstract: With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (this https URL).
摘要：随着广泛使用的强大 LLM 的出现，大型语言模型 (LLM) 生成的虚假信息已成为主要问题。从历史上看，LLM 检测器一直被吹捧为一种解决方案，但它们在现实世界中的有效性仍有待证明。在本文中，我们重点关注信息操作中的一个重要设置——由中等复杂的攻击者生成的简短新闻类帖子。我们证明现有的 LLM 检测器，无论是零样本还是经过专门训练的，都尚未准备好在该设置下用于现实世界。所有经过测试的零样本检测器的表现与之前的基准不一致，并且极易受到采样温度升高的影响，这是近期基准测试中不存在的微不足道的攻击。可以开发一种在 LLM 和看不见的攻击中推广的经过专门训练的检测器，但它无法推广到新的人类书写文本。我们认为前者表明需要特定领域的基准测试，而后者表明在对抗逃避弹性和过度拟合参考人类文本之间进行权衡，两者都需要在基准测试中进行评估，但目前尚不存在。我们认为这表明需要重新考虑当前的 LLM 检测器基准测试方法，并提供可动态扩展的基准测试来允许它（此 https URL）。

Title: N-gram Prediction and Word Difference Representations for Language Modeling

Authors: DongNyeong Heo, Daniela Noemi Rim, Heeyoul Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03295
Pdf URL: https://arxiv.org/pdf/2409.03295
Copy Paste: [[2409.03295]] N-gram Prediction and Word Difference Representations for Language Modeling(https://arxiv.org/abs/2409.03295)
Keywords: language model, llm
Abstract: Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.
摘要：因果语言模型 (CLM) 是最近大型语言模型 (LLM) 取得巨大成功的基础框架。尽管它取得了成功，但下一个单词预测的训练方法存在潜在风险，可能会导致模型过度关注句子内的局部依赖关系。虽然先前的研究已经引入了同时预测未来 N 个单词的方法，但它们主要应用于掩蔽语言模型 (MLM) 和神经机器翻译 (NMT) 等任务。在本研究中，我们为 CLM 任务引入了一个简单的 N-gram 预测框架。此外，我们在 N-gram 预测框架的基础上引入了词差异表示 (WDR) 作为模型训练过程中的替代和语境化目标表示。为了进一步提高下一个单词预测的质量，我们提出了一种集成未来 N 个单词预测结果的集成方法。在涵盖 CLM 和 NMT 任务的多个基准数据集上进行的实证评估证明了我们提出的方法相对于传统 CLM 的显著优势。

Title: Sketch: A Toolkit for Streamlining LLM Operations

Authors: Xin Jiang, Xiang Li, Wenjia Ma, Xuezhi Fang, Yiqun Yao, Naitong Yu, Xuying Meng, Peng Han, Jing Li, Aixin Sun, Yequan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03346
Pdf URL: https://arxiv.org/pdf/2409.03346
Copy Paste: [[2409.03346]] Sketch: A Toolkit for Streamlining LLM Operations(https://arxiv.org/abs/2409.03346)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model's outputs, thereby constraining the application of LLMs in various domains. In this work, we present Sketch, an innovative toolkit designed to streamline LLM operations across diverse fields. Sketch comprises the following components: (1) a suite of task description schemas and prompt templates encompassing various NLP tasks; (2) a user-friendly, interactive process for building structured output LLM services tailored to various NLP tasks; (3) an open-source dataset for output format control, along with tools for dataset construction; and (4) an open-source model based on LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting instructions. We anticipate this initiative to bring considerable convenience to LLM users, achieving the goal of ''plug-and-play'' for various applications. The components of Sketch will be progressively open-sourced at this https URL.
摘要：以GPT系列为代表的大型语言模型（LLM）取得了显著的成功。LLM的特点在于能够通过生成式方法适应广泛的任务。然而，其输出格式的灵活性对控制和利用模型的输出提出了挑战，从而限制了LLM在各个领域的应用。在本文中，我们提出了一个创新的工具包Sketch，旨在简化各个领域的LLM操作。Sketch包含以下组件：（1）一套涵盖各种NLP任务的任务描述模式和提示模板；（2）一个用户友好的交互式流程，用于构建针对各种NLP任务的结构化输出LLM服务；（3）一个用于控制输出格式的开源数据集，以及用于构建数据集的工具；（4）一个基于LLaMA3-8B-Instruct的开源模型，能够熟练地理解和遵守输出格式指令。我们期待这一举措能为LLM用户带来极大的便利，实现各种应用的“即插即用”目标。 Sketch 的各个组件将会在这个 https URL 上逐步开源。

Title: Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding

Authors: Cheng Wang, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03363
Pdf URL: https://arxiv.org/pdf/2409.03363
Copy Paste: [[2409.03363]] Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding(https://arxiv.org/abs/2409.03363)
Keywords: language model, llm
Abstract: The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member and non-member contexts. While previous work suggested that member contexts provide little information due to the minor distributional shift they induce, our analysis reveals that these subtle shifts can be effectively leveraged when contrasted with non-member contexts. In this paper, we propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts through contrastive decoding, amplifying subtle differences to enhance membership inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves state-of-the-art performance on the WikiMIA benchmark and is robust against various text manipulation techniques.
摘要：大型语言模型中的训练数据是其成功的关键，但它也存在隐私和安全风险，因为其中可能包含敏感信息。检测预训练数据对于缓解这些问题至关重要。现有方法通常孤立地或仅使用非成员上下文来分析目标文本，而忽略了同时考虑成员和非成员上下文的潜在见解。虽然之前的研究表明成员上下文由于其引起的分布偏移很小而提供的信息很少，但我们的分析表明，与非成员上下文相比，这些细微的偏移可以得到有效利用。在本文中，我们提出了 Con-ReCall，这是一种新颖的方法，它通过对比解码利用成员和非成员上下文引起的不对称分布偏移，放大细微的差异以增强成员推断。大量的实证评估表明，Con-ReCall 在 WikiMIA 基准上实现了最先进的性能，并且对各种文本操作技术具有鲁棒性。

Title: Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time

Authors: Francisco de Arriba-Pérez, Silvia García-Méndez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.03375
Pdf URL: https://arxiv.org/pdf/2409.03375
Copy Paste: [[2409.03375]] Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time(https://arxiv.org/abs/2409.03375)
Keywords: language model, llm, prompt, chat
Abstract: Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, Artificial Intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using Large Language Models (LLMs) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical-patient communication using intelligent systems. Consequently, we leverage an LLM using the latest Natural Language Processing (NLP) techniques in a chatbot solution to provide interpretable Machine Learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing NLP-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80 % in all evaluation metrics, with a recall value for the mental deterioration class about 85 %. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.
摘要：根据官方估计，全世界有 5000 万人患有痴呆症，而且每年新增患者人数增加 1000 万。如果没有治愈方法，临床预测和早期干预是延缓其进展的最有效方法。为此，可以利用人工智能和计算语言学进行自然语言分析、个性化评估、监测和治疗。然而，传统方法需要更多的语义知识管理和可解释性能力。此外，使用大型语言模型 (LLM) 进行认知衰退诊断的情况仍然很少，尽管这些模型代表了使用智能系统进行临床与患者沟通的最先进方式。因此，我们在聊天机器人解决方案中利用使用最新自然语言处理 (NLP) 技术的 LLM 来实时提供可解释的机器学习认知衰退预测。利用语言概念特征进行适当的自然语言分析。通过可解释性，我们旨在对抗模型的潜在偏见并提高其帮助临床工作者做出诊断决策的潜力。更详细地说，提出的管道由 (i) 采用基于 NLP 的提示工程进行数据提取组成； (ii) 基于流的数据处理，包括特征工程、分析和选择；(iii) 实时分类；(iv) 可解释性仪表板，提供预测结果的视觉和自然语言描述。分类结果在所有评估指标中均超过 80%，精神衰退类的召回率约为 85%。总而言之，我们为这项工作贡献了一种经济实惠、灵活、非侵入性、个性化的诊断系统。

Title: CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks

Authors: Yongxin Deng (1), Xihe Qiu (1), Xiaoyu Tan (2), Chao Qu (2), Jing Pan (3), Yuan Cheng (3), Yinghui Xu (4), Wei Chu (2) ((1) School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China, (2) INF Technology (Shanghai) Co., Ltd., Shanghai, China, (3) School of Art, Design and Architecture, Monash University, Melbourne, Australia, (4) Artificial Intelligence Innovation and Incubation Institute, Fudan University, Shanghai, China)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03381
Pdf URL: https://arxiv.org/pdf/2409.03381
Copy Paste: [[2409.03381]] CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks(https://arxiv.org/abs/2409.03381)
Keywords: language model, llm
Abstract: Cognitive psychology investigates perception, attention, memory, language, problem-solving, decision-making, and reasoning. Kahneman's dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2. Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level proficiency in various cognitive tasks. Nonetheless, the presence of a dual-system framework analogous to human cognition in LLMs remains unexplored. This study introduces the \textbf{CogniDual Framework for LLMs} (CFLLMs), designed to assess whether LLMs can, through self-training, evolve from deliberate deduction to intuitive responses, thereby emulating the human process of acquiring and mastering new information. Our findings reveal the cognitive mechanisms behind LLMs' response generation, enhancing our understanding of their capabilities in cognitive psychology. Practically, self-trained models can provide faster responses to certain queries, reducing computational demands during inference.
摘要：认知心理学研究感知、注意力、记忆、语言、解决问题、决策和推理。卡尼曼的双系统理论阐明了人类的决策过程，区分了快速、直观的系统 1 和深思熟虑、理性的系统 2。最近的进展已将大型语言模型 (LLM) 定位为在各种认知任务中接近人类水平的强大工具。尽管如此，在 LLM 中是否存在类似于人类认知的双系统框架仍未得到探索。本研究引入了 \textbf{LLM 的认知双框架} (CFLLM)，旨在评估 LLM 是否可以通过自我训练从深思熟虑的推理发展到直觉反应，从而模拟人类获取和掌握新信息的过程。我们的研究结果揭示了 LLM 反应生成背后的认知机制，增强了我们对其在认知心理学中能力的理解。实际上，自我训练的模型可以对某些查询提供更快的响应，从而减少推理过程中的计算需求。

Title: Rx Strategist: Prescription Verification using LLM Agents System

Authors: Phuc Phan Van, Dat Nguyen Minh, An Dinh Ngoc, Huy Phan Thanh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03440
Pdf URL: https://arxiv.org/pdf/2409.03440
Copy Paste: [[2409.03440]] Rx Strategist: Prescription Verification using LLM Agents System(https://arxiv.org/abs/2409.03440)
Keywords: language model, llm, agent
Abstract: To protect patient safety, modern pharmaceutical complexity demands strict prescription verification. We offer a new approach - Rx Strategist - that makes use of knowledge graphs and different search strategies to enhance the power of Large Language Models (LLMs) inside an agentic framework. This multifaceted technique allows for a multi-stage LLM pipeline and reliable information retrieval from a custom-built active ingredient database. Different facets of prescription verification, such as indication, dose, and possible drug interactions, are covered in each stage of the pipeline. We alleviate the drawbacks of monolithic LLM techniques by spreading reasoning over these stages, improving correctness and reliability while reducing memory demands. Our findings demonstrate that Rx Strategist surpasses many current LLMs, achieving performance comparable to that of a highly experienced clinical pharmacist. In the complicated world of modern medications, this combination of LLMs with organized knowledge and sophisticated search methods presents a viable avenue for reducing prescription errors and enhancing patient outcomes.
摘要：为了保护患者安全，现代药物的复杂性要求严格的处方验证。我们提供了一种新方法 - Rx Strategist - 它利用知识图谱和不同的搜索策略来增强代理框架内大型语言模型 (LLM) 的功能。这种多方面的技术允许多阶段 LLM 管道和从定制的活性成分数据库中进行可靠的信息检索。管道的每个阶段都涵盖处方验证的不同方面，例如适应症、剂量和可能的药物相互作用。我们通过在这些阶段分散推理来减轻单片 LLM 技术的缺点，提高正确性和可靠性，同时减少内存需求。我们的研究结果表明，Rx Strategist 超越了许多当前的 LLM，其性能可与经验丰富的临床药剂师相媲美。在复杂的现代药物世界中，这种 LLM 与有组织的知识和复杂的搜索方法的结合为减少处方错误和改善患者治疗效果提供了可行的途径。

Title: Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Authors: Wei Lu, Rachel K. Luu, Markus J. Buehler
Subjects: cs.CL, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03444
Pdf URL: https://arxiv.org/pdf/2409.03444
Copy Paste: [[2409.03444]] Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities(https://arxiv.org/abs/2409.03444)
Keywords: language model, llm, prompt, chat
Abstract: The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.
摘要：大型语言模型 (LLM) 在材料科学和工程等领域的应用进展取决于微调策略的开发，这些策略可使模型适应专门的技术能力。在这项工作中，我们探索了持续预训练 (CPT)、监督微调 (SFT) 和各种基于偏好的优化方法（包括直接偏好优化 (DPO) 和比值偏好优化 (ORPO)）对微调 LLM 性能的影响。我们的分析显示了这些策略如何影响模型结果，并揭示了多个微调模型的合并可以产生超越父模型单独贡献的能力。我们发现模型合并会带来父模型无法单独实现的新功能，从而提高特定领域评估的性能。我们介绍了使用不同模型架构的实验，包括 Llama 3.1 8B 和 Mistral 7B 模型，其中观察到了类似的行为。为了探索结果是否也适用于更小的模型，我们使用了一个具有 17 亿个参数的微型 LLM，并表明非常小的 LLM 在模型合并下不一定具有新兴能力，这表明模型扩展可能是关键因素。在人类与人工智能模型之间开放式但一致的聊天对话中，我们的评估揭示了不同模型变体如何表现的详细见解，并表明最小的模型在推理深度、创造力、清晰度和定量精度等关键标准方面获得了较高的智能分数。其他实验包括基于不同的生物材料设计概念开发图像生成提示，以基于受生物材料启发的建筑原理创建新的微结构、建筑概念和城市设计。

Title: How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes

Authors: Inacio Vieira, Will Allred, Seamus Lankford, Sheila Castilho Monteiro De Sousa, Andy Way
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.03454
Pdf URL: https://arxiv.org/pdf/2409.03454
Copy Paste: [[2409.03454]] How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes(https://arxiv.org/abs/2409.03454)
Keywords: language model, llm
Abstract: Decoder-only LLMs have shown impressive performance in MT due to their ability to learn from extensive datasets and generate high-quality translations. However, LLMs often struggle with the nuances and style required for organisation-specific translation. In this study, we explore the effectiveness of fine-tuning Large Language Models (LLMs), particularly Llama 3 8B Instruct, leveraging translation memories (TMs), as a valuable resource to enhance accuracy and efficiency. We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector. Our experiments cover five translation directions across languages of varying resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to evaluate their influence on translation quality. We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET. Our findings reveal improvement in translation performance with larger datasets across all metrics. On average, BLEU and COMET scores increase by 13 and 25 points, respectively, on the largest training set against the baseline model. Notably, there is a performance deterioration in comparison with the baseline model when fine-tuning on only 1k and 2k examples; however, we observe a substantial improvement as the training dataset size increases. The study highlights the potential of integrating TMs with LLMs to create bespoke translation models tailored to the specific needs of businesses, thus enhancing translation quality and reducing turn-around times. This approach offers a valuable insight for organisations seeking to leverage TMs and LLMs for optimal translation outcomes, especially in narrower domains.
摘要：仅使用解码器的 LLM 在机器翻译中表现出色，因为它们能够从大量数据集中学习并生成高质量的翻译。然而，LLM 经常难以处理特定组织翻译所需的细微差别和风格。在本研究中，我们探索了微调大型语言模型 (LLM)（尤其是 Llama 3 8B Instruct）的有效性，利用翻译记忆库 (TM) 作为宝贵的资源来提高准确性和效率。我们研究了使用来自软件行业特定组织的 TM 对 Llama 3 模型进行微调的影响。我们的实验涵盖了不同资源水平的语言（英语到巴西葡萄牙语、捷克语、德语、芬兰语和韩语）的五个翻译方向。我们分析了不同大小的训练数据集（1k 到 207k 个片段）以评估它们对翻译质量的影响。我们为每个训练集微调单独的模型，并根据自动指标、BLEU、chrF++、TER 和 COMET 评估它们的性能。我们的研究结果表明，在所有指标中，使用更大的数据集可以提高翻译性能。平均而言，与基线模型相比，在最大的训练集上，BLEU 和 COMET 得分分别提高了 13 分和 25 分。值得注意的是，与基线模型相比，在仅对 1k 和 2k 个示例进行微调时，性能有所下降；然而，随着训练数据集大小的增加，我们观察到了显着的改善。该研究强调了将 TM 与 LLM 相结合以创建定制翻译模型的潜力，以满足企业的特定需求，从而提高翻译质量并缩短周转时间。这种方法为寻求利用 TM 和 LLM 获得最佳翻译结果的组织提供了宝贵的见解，尤其是在较窄的领域。

Title: 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances

Authors: Lorenzo Pacchiardi, Lucy G. Cheke, José Hernández-Orallo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.03563
Pdf URL: https://arxiv.org/pdf/2409.03563
Copy Paste: [[2409.03563]] 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances(https://arxiv.org/abs/2409.03563)
Keywords: gpt, llm
Abstract: Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.
摘要：预测 LLM 在单个任务实例上的性能对于确保其在高风险应用中的可靠性至关重要。为此，一种可能性是在一组任务实例上评估所考虑的 LLM，并训练评估者根据实例的特征预测其性能。但是，这种方法需要在足够大的任务实例集上评估每个新的 LLM，以训练针对它的评估者。在这项工作中，我们利用之前测试过的 LLM 的评估结果来减少预测新 LLM 性能所需的评估次数。在实践中，我们建议在一小组参考实例上测试新的 LLM，并训练一个通用评估者，该评估者根据前者在参考集上的性能和感兴趣的实例的特征来预测 LLM 在实例上的性能。我们对 HELM-Lite 和 KindsOfReasoning 进行了实证研究，这是我们引入的现有推理数据集的集合，我们在其中评估所有指令微调的 OpenAI 模型，直到 2024 年 1 月版本的 GPT4。当预测与用于训练通用评估器的分布相同的实例的性能时，我们发现这可以实现与在完整实例集上训练的 LLM 特定评估器相当的性能。此外，我们发现随机选择参考实例的表现与我们测试的一些高级选择方法一样好。然而，对于分布外的实例，没有出现明显的赢家，整体表现更差，这表明 LLM 的固有可预测性较低。

Title: Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers

Authors: Amit Ben Artzy, Roy Schwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03621
Pdf URL: https://arxiv.org/pdf/2409.03621
Copy Paste: [[2409.03621]] Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers(https://arxiv.org/abs/2409.03621)
Keywords: llm, prompt
Abstract: In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidden states at some layer k with random vectors. Our experimenting with four LLMs and four tasks show that this operation often leads to small to negligible drop in performance. Importantly, this happens if the manipulation occurs in the top part of the model-k is in the final 30-50% of the layers. In contrast, doing the same manipulation in earlier layers might lead to chance level performance. We continue by switching the hidden state of certain tokens with hidden states of other tokens from another prompt; e.g., replacing the word "Italy" with "France" in "What is the capital of Italy?". We find that when applying this switch in the top 1/3 of the model, the model ignores it (answering "Rome"). However if we apply it before, the model conforms to the switch ("Paris"). Our results hint at a two stage process in transformer-based LLMs: the first part gathers input from previous tokens, while the second mainly processes that information internally.
摘要：在基于解码器的 LLM 中，给定层的表示有两个用途：在计算当前 token 时作为下一层输入；以及作为未来 token 注意力机制的输入。在这项工作中，我们表明后一个角色的重要性可能被高估了。为了证明这一点，我们首先操纵先前 token 的表示；例如，用随机向量替换某个层 k 的隐藏状态。我们对四个 LLM 和四个任务进行的实验表明，此操作通常会导致性能轻微下降甚至可忽略不计。重要的是，如果操纵发生在模型的顶部（k 位于最后 30-50% 的层中），就会发生这种情况。相反，在较早的层中进行相同的操作可能会导致偶然级别的性能。我们继续将某些 token 的隐藏状态与来自另一个提示的其他 token 的隐藏状态进行切换；例如，在“意大利的首都是什么？”中将“意大利”一词替换为“法国”。我们发现，当在模型的前 1/3 部分应用此切换时，模型会忽略它（回答“罗马”）。但是，如果我们在之前应用它，模型就会遵循切换（“巴黎”）。我们的结果暗示了基于转换器的 LLM 中的两个阶段过程：第一部分收集来自先前标记的输入，而第二部分主要在内部处理这些信息。

Title: LLM-based multi-agent poetry generation in non-cooperative environments

Authors: Ran Zhang, Steffen Eger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03659
Pdf URL: https://arxiv.org/pdf/2409.03659
Copy Paste: [[2409.03659]] LLM-based multi-agent poetry generation in non-cooperative environments(https://arxiv.org/abs/2409.03659)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Despite substantial progress of large language models (LLMs) for automatic poetry generation, the generated poetry lacks diversity while the training process differs greatly from human learning. Under the rationale that the learning process of the poetry generation systems should be more human-like and their output more diverse and novel, we introduce a framework based on social learning where we emphasize non-cooperative interactions besides cooperative interactions to encourage diversity. Our experiments are the first attempt at LLM-based multi-agent systems in non-cooperative environments for poetry generation employing both TRAINING-BASED agents (GPT-2) and PROMPTING-BASED agents (GPT-3 and GPT-4). Our evaluation based on 96k generated poems shows that our framework benefits the poetry generation process for TRAINING-BASED agents resulting in 1) a 3.0-3.7 percentage point (pp) increase in diversity and a 5.6-11.3 pp increase in novelty according to distinct and novel n-grams. The generated poetry from TRAINING-BASED agents also exhibits group divergence in terms of lexicons, styles and semantics. PROMPTING-BASED agents in our framework also benefit from non-cooperative environments and a more diverse ensemble of models with non-homogeneous agents has the potential to further enhance diversity, with an increase of 7.0-17.5 pp according to our experiments. However, PROMPTING-BASED agents show a decrease in lexical diversity over time and do not exhibit the group-based divergence intended in the social network. Our paper argues for a paradigm shift in creative tasks such as automatic poetry generation to include social learning processes (via LLM-based agent modeling) similar to human interaction.
摘要：尽管用于自动诗歌生成的大型语言模型 (LLM) 取得了实质性进展，但生成的诗歌缺乏多样性，而训练过程与人类学习有很大不同。在诗歌生成系统的学习过程应该更像人类、输出应该更加多样化和新颖的理念下，我们引入了一个基于社会学习的框架，除了合作互动外，我们还强调非合作互动以鼓励多样性。我们的实验是基于 LLM 的多智能体系统在非合作环境中的首次尝试，该系统同时使用基于训练的智能体 (GPT-2) 和基于提示的智能体 (GPT-3 和 GPT-4) 进行诗歌生成。我们基于 96k 首生成的诗歌的评估表明，我们的框架有利于基于训练的智能体的诗歌生成过程，结果为 1) 多样性增加了 3.0-3.7 个百分点 (pp)，新颖性增加了 5.6-11.3 pp (根据独特和新颖的 n-gram)。 TRAINING-BASED 代理生成的诗歌在词汇、风格和语义方面也表现出群体差异。我们框架中的 PROMPTING-BASED 代理也受益于非合作环境，并且具有非同质代理的更多样化模型集合有可能进一步增强多样性，根据我们的实验，多样性将增加 7.0-17.5 pp。然而，PROMPTING-BASED 代理的词汇多样性会随着时间的推移而减少，并且不会表现出社交网络中预期的基于群体的差异。我们的论文主张在自动诗歌生成等创造性任务中进行范式转变，以包括类似于人类互动的社会学习过程（通过基于 LLM 的代理建模）。

Title: The representation landscape of few-shot learning and fine-tuning in large language models

Authors: Diego Doimo, Alessandro Serra, Alessio Ansuini, Alberto Cazzaniga
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.03662
Pdf URL: https://arxiv.org/pdf/2409.03662
Copy Paste: [[2409.03662]] The representation landscape of few-shot learning and fine-tuning in large language models(https://arxiv.org/abs/2409.03662)
Keywords: language model, llm
Abstract: In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.
摘要：上下文学习 (ICL) 和监督微调 (SFT) 是两种常见的策略，用于提高现代大型语言模型 (LLM) 在特定任务上的性能。尽管这些策略性质不同，但它们通常可带来相当的性能提升。然而，人们对它们是否会在 LLM 内部产生类似的表示知之甚少。我们通过分析这两种情况下隐藏表示的概率景观来解决这个问题。更具体地说，我们比较了 LLM 如何解决相同的问答任务，发现 ICL 和 SFT 创建了非常不同的内部结构，在这两种情况下，网络中间都会经历一个急剧的转变。在网络的前半部分，ICL 根据语义内容分层组织可解释的表示。相比之下，使用 SFT 获得的概率景观更加模糊且语义混乱。在模型的后半部分，微调后的表示会形成更好地编码答案身份的概率模式，而 ICL 表示的景观则以不太明确的峰值为特征。我们的方法揭示了 LLM 内部开发的用于在不同条件下解决同一任务的多种计算策略，使我们能够朝着设计从语言模型中提取信息的最佳方法迈出一步。

Title: LAST: Language Model Aware Speech Tokenization

Authors: Arnon Turetzky, Yossi Adi
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.03701
Pdf URL: https://arxiv.org/pdf/2409.03701
Copy Paste: [[2409.03701]] LAST: Language Model Aware Speech Tokenization(https://arxiv.org/abs/2409.03701)
Keywords: language model
Abstract: Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization process and its usage afterward. In this study, we propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs. We advocate for the integration of this objective into the process of learning discrete speech representations. Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs. We empirically investigate the impact of various model design choices, including speech vocabulary size and text LM size. Our results demonstrate the proposed tokenization method outperforms the evaluated baselines considering both spoken language modeling and speech-to-text. More importantly, unlike prior work, the proposed method allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.
摘要：语音标记化是语音语言模型 (LM) 的基础，使其能够执行各种任务，例如口语语言建模、文本转语音、语音转文本等。大多数语音标记器都是独立于 LM 训练过程进行训练的，依赖于单独的声学模型和量化方法。采用这种方法可能会导致标记化过程与其后续使用不匹配。在本研究中，我们提出了一种新方法来训练语音标记器，即利用预先训练的文本 LM 中的目标。我们主张将这个目标整合到学习离散语音表示的过程中。我们的目标是将预训练语音模型中的特征转换为新的特征空间，从而使语音 LM 能够更好地进行聚类。我们通过实证研究了各种模型设计选择的影响，包括语音词汇量和文本 LM 大小。我们的结果表明，提出的标记化方法在考虑口语语言建模和语音转文本时均优于评估的基线。更重要的是，与之前的研究不同，所提出的方法允许利用单个预训练的 LM 来处理语音和文本输入，这使其有别于传统的标记化方法。

Title: RAG based Question-Answering for Contextual Response Prediction System

Authors: Sriram Veturi, Saurabh Vaichal, Nafis Irtiza Tripto, Reshma Lal Jagadheesh, Nian Yan
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2409.03708
Pdf URL: https://arxiv.org/pdf/2409.03708
Copy Paste: [[2409.03708]] RAG based Question-Answering for Contextual Response Prediction System(https://arxiv.org/abs/2409.03708)
Keywords: language model, llm, hallucination, chat, retrieval augmented generation, agent
Abstract: Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-answering framework for real-world applications using RAG entails several challenges: 1) data availability issues, 2) evaluating the quality of generated content, and 3) the costly nature of human evaluation. In this paper, we introduce an end-to-end framework that employs LLMs with RAG capabilities for industry use cases. Given a customer query, the proposed system retrieves relevant knowledge documents and leverages them, along with previous chat history, to generate response suggestions for customer service agents in the contact centers of a major retail company. Through comprehensive automated and human evaluations, we show that this solution outperforms the current BERT-based algorithms in accuracy and relevance. Our findings suggest that RAG-based LLMs can be an excellent support to human customer service representatives by lightening their workload.
摘要：大型语言模型 (LLM) 在各种自然语言处理 (NLP) 任务中表现出多功能性，包括它们作为有效问答系统的潜力。然而，为了在行业环境中针对特定客户查询提供准确和相关的信息，LLM 需要访问全面的知识库以避免产生幻觉。检索增强生成 (RAG) 作为解决这一挑战的一种有前途的技术而出现。然而，使用 RAG 为实际应用开发准确的问答框架面临着几个挑战：1) 数据可用性问题，2) 评估生成内容的质量，以及 3) 人工评估的成本高昂。在本文中，我们介绍了一个端到端框架，该框架将具有 RAG 功能的 LLM 用于行业用例。给定客户查询，所提出的系统将检索相关知识文档，并利用它们以及之前的聊天记录为大型零售公司的联络中心的客服代理生成响应建议。通过全面的自动化和人工评估，我们表明该解决方案在准确性和相关性方面优于当前基于 BERT 的算法。我们的研究结果表明，基于 RAG 的 LLM 可以减轻人工客户服务代表的工作量，从而为他们提供出色的支持。

Title: Attention Heads of Large Language Models: A Survey

Authors: Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.03752
Pdf URL: https://arxiv.org/pdf/2409.03752
Copy Paste: [[2409.03752]] Attention Heads of Large Language Models: A Survey(https://arxiv.org/abs/2409.03752)
Keywords: language model, gpt, llm, chat
Abstract: Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in various tasks but remain largely as black-box systems. Consequently, their development relies heavily on data-driven approaches, limiting performance enhancement through changes in internal architecture and reasoning pathways. As a result, many researchers have begun exploring the potential internal mechanisms of LLMs, aiming to identify the essence of their reasoning bottlenecks, with most studies focusing on attention heads. Our survey aims to shed light on the internal reasoning processes of LLMs by concentrating on the interpretability and underlying mechanisms of attention heads. We first distill the human thought process into a four-stage framework: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation. Using this framework, we systematically review existing research to identify and categorize the functions of specific attention heads. Furthermore, we summarize the experimental methodologies used to discover these special heads, dividing them into two categories: Modeling-Free methods and Modeling-Required methods. Also, we outline relevant evaluation methods and benchmarks. Finally, we discuss the limitations of current research and propose several potential future directions. Our reference list is open-sourced at \url{this https URL}.
摘要：自 ChatGPT 问世以来，大型语言模型 (LLM) 在各种任务中表现出色，但在很大程度上仍然是黑箱系统。因此，它们的开发严重依赖于数据驱动的方法，限制了通过改变内部架构和推理路径来提高性能。因此，许多研究人员开始探索 LLM 的潜在内部机制，旨在找出其推理瓶颈的本质，大多数研究都集中在注意力头上。我们的调查旨在通过专注于注意力头的可解释性和潜在机制来阐明 LLM 的内部推理过程。我们首先将人类思维过程提炼为一个四阶段框架：知识回忆、上下文识别、潜在推理和表达准备。利用这个框架，我们系统地回顾现有的研究，以识别和分类特定注意力头的功能。此外，我们总结了用于发现这些特殊头部的实验方法，将它们分为两类：无建模方法和需要建模的方法。此外，我们概述了相关的评估方法和基准。最后，我们讨论了当前研究的局限性并提出了几个潜在的未来发展方向。我们的参考文献列表在 \url{此 https URL} 上开源。

Title: WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild

Authors: Yuntian Deng, Wenting Zhao, Jack Hessel, Xiang Ren, Claire Cardie, Yejin Choi
Subjects: cs.CL, cs.AI, cs.HC, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2409.03753
Pdf URL: https://arxiv.org/pdf/2409.03753
Copy Paste: [[2409.03753]] WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild(https://arxiv.org/abs/2409.03753)
Keywords: chat
Abstract: The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis's utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.
摘要：现实世界对话数据的日益增多为研究人员研究用户与聊天机器人之间的互动提供了令人兴奋的机会。然而，这些数据的庞大数量使得手动检查单个对话变得不切实际。为了克服这一挑战，我们推出了 WildVis，这是一种交互式工具，可以实现快速、多功能和大规模的对话分析。WildVis 根据一系列标准在文本和嵌入空间中提供搜索和可视化功能。为了管理百万级数据集，我们实施了优化，包括搜索索引构建、嵌入预计算和压缩以及缓存，以确保在几秒钟内响应用户交互。我们通过三个案例研究展示了 WildVis 的实用性：促进聊天机器人滥用研究、可视化和比较数据集中的主题分布以及描述用户特定的对话模式。WildVis 是开源的，旨在可扩展，支持其他数据集和自定义搜索和可视化功能。