2024-02-20

Title: Taxonomy-based CheckList for Large Language Model Evaluation

Authors: Damin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.10899
Pdf URL: https://arxiv.org/pdf/2402.10899
Copy Paste: [[2402.10899]] Taxonomy-based CheckList for Large Language Model Evaluation(https://arxiv.org/abs/2402.10899)
Keywords: language model, llm
Abstract: As large language models (LLMs) have been used in many downstream tasks, the internal stereotypical representation may affect the fairness of the outputs. In this work, we introduce human knowledge into natural language interventions and study pre-trained language models' (LMs) behaviors within the context of gender bias. Inspired by CheckList behavioral testing, we present a checklist-style task that aims to probe and quantify LMs' unethical behaviors through question-answering (QA). We design three comparison studies to evaluate LMs from four aspects: consistency, biased tendency, model preference, and gender preference switch. We probe one transformer-based QA model trained on SQuAD-v2 dataset and one autoregressive large language model. Our results indicate that transformer-based QA model's biased tendency positively correlates with its consistency, whereas LLM shows the opposite relation. Our proposed task provides the first dataset that involves human knowledge for LLM bias evaluation.
摘要：由于大型语言模型（LLM）已在许多下游任务中使用，内部刻板表示可能会影响输出的公平性。在这项工作中，我们将人类知识引入自然语言干预中，并研究性别偏见背景下预先训练的语言模型（LM）行为。受 CheckList 行为测试的启发，我们提出了一个清单式的任务，旨在通过问答（QA）来探索和量化 LM 的不道德行为。我们设计了三项比较研究，从一致性、偏差倾向、模型偏好和性别偏好转换四个方面评估 LM。我们探讨了一种在 SQuAD-v2 数据集上训练的基于 Transformer 的 QA 模型和一种自回归大型语言模型。我们的结果表明，基于 Transformer 的 QA 模型的偏差趋势与其一致性呈正相关，而 LLM 则显示出相反的关系。我们提出的任务提供了第一个涉及 LLM 偏差评估人类知识的数据集。

Title: LLM-Assisted Crisis Management: Building Advanced LLM Platforms for Effective Emergency Response and Public Collaboration

Authors: Hakan T. Otal, M. Abdullah Canbaz
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10908
Pdf URL: https://arxiv.org/pdf/2402.10908
Copy Paste: [[2402.10908]] LLM-Assisted Crisis Management: Building Advanced LLM Platforms for Effective Emergency Response and Public Collaboration(https://arxiv.org/abs/2402.10908)
Keywords: language model, llm
Abstract: Emergencies and critical incidents often unfold rapidly, necessitating a swift and effective response. In this research, we introduce a novel approach to identify and classify emergency situations from social media posts and direct emergency messages using an open source Large Language Model, LLAMA2. The goal is to harness the power of natural language processing and machine learning to assist public safety telecommunicators and huge crowds during countrywide emergencies. Our research focuses on developing a language model that can understand users describe their situation in the 911 call, enabling LLAMA2 to analyze the content and offer relevant instructions to the telecommunicator, while also creating workflows to notify government agencies with the caller's information when necessary. Another benefit this language model provides is its ability to assist people during a significant emergency incident when the 911 system is overwhelmed, by assisting the users with simple instructions and informing authorities with their location and emergency information.
摘要：紧急情况和重大事件往往迅速发生，需要迅速有效的应对。在这项研究中，我们引入了一种新颖的方法，使用开源大语言模型 LLAMA2 从社交媒体帖子和直接紧急消息中识别和分类紧急情况。目标是利用自然语言处理和机器学习的力量，在全国紧急情况下为公共安全电信人员和大量人群提供帮助。我们的研究重点是开发一种语言模型，可以理解用户在 911 呼叫中描述的情况，使 LLAMA2 能够分析内容并向电信员提供相关说明，同时还创建工作流程以在必要时向政府机构通知呼叫者的信息。该语言模型提供的另一个好处是，当 911 系统不堪重负时，它能够在重大紧急事件中为人们提供帮助，通过简单的指令协助用户并向当局通报他们的位置和紧急信息。

Title: Advances and Limitations in Open Source Arabic-Script OCR: A Case Study

Authors: Benjamin Kiessling (PSL), Gennady Kurin, Matthew Thomas Miller, Kader Smail
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2402.10943
Pdf URL: https://arxiv.org/pdf/2402.10943
Copy Paste: [[2402.10943]] Advances and Limitations in Open Source Arabic-Script OCR: A Case Study(https://arxiv.org/abs/2402.10943)
Keywords: language model
Abstract: This work presents an accuracy study of the open source OCR engine, Kraken, on the leading Arabic scholarly journal, al-Abhath. In contrast with other commercially available OCR engines, Kraken is shown to be capable of producing highly accurate Arabic-script OCR. The study also assesses the relative accuracy of typeface-specific and generalized models on the al-Abhath data and provides a microanalysis of the ``error instances'' and the contextual features that may have contributed to OCR misrecognition. Building on this analysis, the paper argues that Arabic-script OCR can be significantly improved through (1) a more systematic approach to training data production, and (2) the development of key technological components, especially multi-language models and improved line segmentation and layout analysis. Cet article pr{\'e}sente une {\'e}tude d'exactitude du moteur ROC open source, Krakan, sur la revue acad{\'e}mique arabe de premier rang, al-Abhath. Contrairement {\`a} d'autres moteurs ROC disponibles sur le march{\'e}, Kraken se r{\'e}v{\`e}le {\^e}tre capable de produire de la ROC extr{\^e}mement exacte de l'{\'e}criture arabe. L'{\'e}tude {\'e}value aussi l'exactitude relative des mod{\`e}les sp{\'e}cifiquement configur{\'e}s {\`a} des polices et celle des mod{\`e}les g{\'e}n{\'e}ralis{\'e}s sur les donn{\'e}es d'al-Abhath et fournit une microanalyse des "occurrences d'erreurs", ainsi qu'une microanalyse des {\'e}l{\'e}ments contextuels qui pourraient avoir contribu{\'e} {\`a} la m{\'e}reconnaissance ROC. S'appuyant sur cette analyse, cet article fait valoir que la ROC de l'{\'e}criture arabe peut {\^e}tre consid{\'e}rablement am{\'e}lior{\'e}e gr{\^a}ce {\`a} (1) une approche plus syst{\'e}matique d'entra{\^i}nement de la production de donn{\'e}es et (2) gr{\^a}ce au d{\'e}veloppement de composants technologiques fondamentaux, notammentl'am{\'e}lioration des mod{\`e}les multilingues, de la segmentation de ligne et de l'analyse de la mise en page.
摘要：这项工作在领先的阿拉伯学术期刊 al-Abhath 上介绍了开源 OCR 引擎 Kraken 的准确性研究。与其他商用 OCR 引擎相比，Kraken 能够生成高度准确的阿拉伯文字 OCR。该研究还评估了 al-Abhath 数据上特定字体和通用模型的相对准确性，并对“错误实例”和可能导致 OCR 误识别的上下文特征进行了微观分析。基于这一分析，本文认为，阿拉伯文字 OCR 可以通过以下方式得到显着改进：(1) 更系统的训练数据生成方法，以及 (2) 关键技术组件的开发，特别是多语言模型和改进的行分割和布局分析。 Cet 文章介绍了 ROC 开源的精确性，Krakan，sur la revue acad{\'e}mique arabe de prime rang，al-Abhath。 CONTRAIREMENT {\`a} d'autres moteurs ROC disponibles sur le March{\'e}, Kraken se r{\'e}v{\`e}le {\^e}tre Capable de produire de la ROC extr{ \^e}ment exacte de l'{\'e}阿拉伯文化。 L'{\'e}tude {\'e}value aussi l'exacttituderelative des mod{\`e}les sp{\'e}cifiquement configur{\'e}s {\'a} despolices et celle des mod{\`e}les g{\'e}n{\'e}ralis{\'e}s sur les donn{\'e}es d'al-Abhath 等对“事件”进行微观分析erreurs”，ainsi qu'une microanalysis des {\'e}l{\'e}ments contextuels qui pourraient avoir contribu{\'e} {\`a} la m{\'e}reconnaissance ROC. S'appuyant sur cette analysis, 该文章 fait valoir que la ROC de l'{\'e}criture arabe peut {\^e}tre consid{\'e}rablement am{\'e}lior{\'e} e gr{\^a}ce {\`a} (1) une approche plus syst{\'e}matique d'entra{\^i}nement de la Production de donn{\'e}es et (2) gr{\^a}ce au d{\'e}veloppement de composants technologiques foldamentaux, notammentl'am{\'e}lioration des mod{\`e}les multilinguals, de la splitting de ligne et de l'analysis de la mise en page。

Title: CultureLLM: Incorporating Cultural Differences into Large Language Models

Authors: Cheng Li, Mengzhou Chen, Jindong Wang, Sunayana Sitaram, Xing Xie
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10946
Pdf URL: https://arxiv.org/pdf/2402.10946
Copy Paste: [[2402.10946]] CultureLLM: Incorporating Cultural Differences into Large Language Models(https://arxiv.org/abs/2402.10946)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are reported to be partial to certain cultures owing to the training data dominance from the English corpora. Since multilingual cultural data are often expensive to collect, existing efforts handle this by prompt engineering or culture-specific pre-training. However, they might overlook the knowledge deficiency of low-resource culture and require extensive computing resources. In this paper, we propose CultureLLM, a cost-effective solution to incorporate cultural differences into LLMs. CultureLLM adopts World Value Survey (WVS) as seed data and generates semantically equivalent training data via the proposed semantic data augmentation. Using only 50 seed samples from WVS with augmented data, we fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9 cultures covering rich and low-resource languages. Extensive experiments on 60 culture-related datasets demonstrate that CultureLLM significantly outperforms various counterparts such as GPT-3.5 (by 8.1%) and Gemini Pro (by 9.5%) with comparable performance to GPT-4 or even better. Our human study shows that the generated samples are semantically equivalent to the original samples, providing an effective solution for LLMs augmentation.
摘要：据报道，由于英语语料库的训练数据占主导地位，大型语言模型（LLM）偏向于某些文化。由于多语言文化数据的收集成本通常很高，因此现有的工作通过及时的工程或针对特定文化的预培训来解决这个问题。然而，他们可能忽视了低资源文化的知识缺陷，需要大量的计算资源。在本文中，我们提出了 CultureLLM，这是一种将文化差异纳入法学硕士的经济有效的解决方案。 CultureLLM 采用世界价值调查（WVS）作为种子数据，并通过所提出的语义数据增强生成语义等效的训练数据。我们仅使用来自 WVS 的 50 个种子样本和增强数据，对特定文化的 LLM 和一个统一模型 (CultureLLM-One) 进行了微调，适用于涵盖丰富和资源匮乏语言的 9 种文化。对 60 个文化相关数据集的大量实验表明，CultureLLM 的性能显着优于 GPT-3.5（8.1%）和 Gemini Pro（9.5%）等各种对应模型，其性能与 GPT-4 相当甚至更好。我们的人类研究表明，生成的样本在语义上与原始样本等效，为 LLM 增强提供了有效的解决方案。

Title: Zero-shot Explainable Mental Health Analysis on Social Media by incorporating Mental Scales

Authors: Wenyu Li, Yinuo Zhu, Xin Lin, Ming Li, Ziyue Jiang, Ziqian Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10948
Pdf URL: https://arxiv.org/pdf/2402.10948
Copy Paste: [[2402.10948]] Zero-shot Explainable Mental Health Analysis on Social Media by incorporating Mental Scales(https://arxiv.org/abs/2402.10948)
Keywords: language model, llm
Abstract: Traditional discriminative approaches in mental health analysis are known for their strong capacity but lack interpretability and demand large-scale annotated data. On the other hand, generative approaches, such as those based on large language models (LLMs),have the potential to get rid of heavy annotations and provide explanations. However, their capabilities still fall short compared to discriminative approaches, and their explanations may be unreliable due to the fact that the generation of explanation is a black-box process. Inspired by the psychological assessment practice of using scales to evaluate mental states, our method incorporates two procedures via LLMs. First, the patient completes mental health questionnaires, and second, the psychologist interprets the collected information from the mental health questions and makes informed decisions. Experimental results show that our method outperforms other zero-shot methods. Our method can generate more rigorous explanation based on the outputs of mental questionnaires.
摘要：心理健康分析中的传统判别方法以其强大的能力而闻名，但缺乏可解释性并且需要大规模的注释数据。另一方面，生成方法，例如基于大型语言模型（LLM）的方法，有可能摆脱繁重的注释并提供解释。然而，与判别性方法相比，它们的能力仍然存在不足，并且由于解释的生成是一个黑盒过程，因此它们的解释可能不可靠。受到使用量表评估心理状态的心理评估实践的启发，我们的方法通过法学硕士结合了两个程序。首先，患者填写心理健康问卷，其次，心理学家解释从心理健康问题中收集到的信息并做出明智的决定。实验结果表明，我们的方法优于其他零样本方法。我们的方法可以根据心理问卷的输出生成更严格的解释。

Title: The Unreasonable Effectiveness of Eccentric Automatic Prompts

Authors: Rick Battle, Teja Gollapudi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10949
Pdf URL: https://arxiv.org/pdf/2402.10949
Copy Paste: [[2402.10949]] The Unreasonable Effectiveness of Eccentric Automatic Prompts(https://arxiv.org/abs/2402.10949)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable problem-solving and basic mathematics abilities. However, their efficacy is highly contingent on the formulation of the prompt. This study endeavors to quantify the influence of incorporating "positive thinking" into the system message of the prompt, then compare that to systematic prompt optimization. We assess the performance of 60 combinations of system message snippets, tested with and without Chain of Thought prompting, across three models with parameters ranging from 7 to 70 billion on the GSM8K dataset. Our findings reveal that results do not universally generalize across models. In most instances, the inclusion of "positive thinking" prompts positively affected model performance. Notably, however, Llama2-70B exhibited an exception when not utilizing Chain of Thought, as the optimal system message was found to be none at all. Given the combinatorial complexity, and thus computation time, of experimenting with hand-tuning prompts for large black-box models, we then compared the performance of the best "positive thinking" prompt against the output of systematic prompt optimization. We show that employing an automated prompt optimizer emerges as the most effective method for enhancing performance, even when working with smaller open-source models. Additionally, our findings reveal that the highest-scoring, automatically-optimized prompt exhibits a degree of peculiarity far beyond expectations.
摘要：大型语言模型 (LLM) 已表现出卓越的解决问题和基本数学能力。然而，它们的功效很大程度上取决于提示的制定。本研究试图量化将“积极思考”纳入提示系统信息的影响，然后将其与系统提示优化进行比较。我们评估了 60 种系统消息片段组合的性能，在有或没有思想链提示的情况下进行了测试，在 GSM8K 数据集上参数范围从 7 到 700 亿的三个模型中进行了测试。我们的研究结果表明，结果并不能普遍推广到各个模型。在大多数情况下，包含“积极思考”会对模型性能产生积极影响。然而值得注意的是，Llama2-70B 在不使用思想链时表现出异常，因为发现最佳系统消息根本没有。考虑到对大型黑盒模型进行手动调整提示的组合复杂性和计算时间，然后我们将最佳“积极思考”提示的性能与系统提示优化的输出进行比较。我们表明，即使在使用较小的开源模型时，使用自动提示优化器也是提高性能的最有效方法。此外，我们的研究结果表明，得分最高、自动优化的提示表现出一定程度的特殊性，远远超出预期。

Title: DAEDRA: A language model for predicting outcomes in passive pharmacovigilance reporting

Authors: Chris von Csefalvay
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10951
Pdf URL: https://arxiv.org/pdf/2402.10951
Copy Paste: [[2402.10951]] DAEDRA: A language model for predicting outcomes in passive pharmacovigilance reporting(https://arxiv.org/abs/2402.10951)
Keywords: language model, llm
Abstract: Over the recent years, the emergence of large language models (LLMs) has given rise to a proliferation of domain-specific models that are intended to reflect the particularities of linguistic context and content as a correlate of the originating domain. This paper details the conception, design, training and evaluation of DAEDRA, a LLM designed to detect regulatory-relevant outcomes (mortality, ER attendance and hospitalisation) in adverse event reports elicited through passive reporting (PR). While PR is a highly cost-efficient way of eliciting information from a wide and diverse audience -- typically including not only physicians and healthcare providers but also patients, family members and other lay stakeholders --, this diversity makes PR corpora difficult to analyse. Generic language models may not capture the complex clinical dimensions while specific clinical or biomedical models may not perform well on lay reports. To evaluate the utility of a subdomain-specific language model, an adaptive training approach was adapted, wherein base language model candidates were evaluated on a subset of the corpus, and the best performer was trained on the entire corpus. This yielded a small but significant improvement in $F_1$ (+1%), precision (+2.5%) and recall (+3.8%), at a relatively low training cost and a single-day training time. Subdomain-specific LLMs continue to be viable options for better results when analysing highly specialised corpora.
摘要：近年来，大型语言模型（LLM）的出现引发了特定领域模型的激增，这些模型旨在反映语言背景和内容的特殊性作为原始领域的相关性。本文详细介绍了 DAEDRA 的概念、设计、培训和评估，DAEDRA 是一个法学硕士，旨在检测通过被动报告 (PR) 引发的不良事件报告中的监管相关结果（死亡率、急诊就诊和住院治疗）。虽然公关是一种从广泛且多样化的受众（通常不仅包括医生和医疗保健提供者，还包括患者、家庭成员和其他非专业利益相关者）获取信息的极具成本效益的方式，但这种多样性使得公关语料库难以分析。通用语言模型可能无法捕捉复杂的临床维度，而特定的临床或生物医学模型可能无法在外行报告中表现良好。为了评估子域特定语言模型的实用性，采用了自适应训练方法，其中在语料库的子集上评估基本语言模型候选者，并在整个语料库上训练表现最好的模型。这以相对较低的训练成本和单日训练时间，在 $F_1$ (+1%)、精确度 (+2.5%) 和召回率 (+3.8%) 方面产生了小但显着的改进。在分析高度专业化的语料库时，特定子领域的法学硕士仍然是获得更好结果的可行选择。

Title: Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Authors: Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, Mingyuan Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10958
Pdf URL: https://arxiv.org/pdf/2402.10958
Copy Paste: [[2402.10958]] Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts(https://arxiv.org/abs/2402.10958)
Keywords: language model, llm, prompt
Abstract: In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Through empirical tests, including dialogue and summarization tasks, and evaluations using the AlpacaEval2.0 leaderboard, RPO has demonstrated a superior ability to align LLMs with user preferences and to improve their adaptability during the training process. The PyTorch code necessary to reproduce the results presented in the paper will be made available on GitHub for public access.
摘要：在大语言模型（LLM）领域，使模型与用户的不同偏好保持一致是一项严峻的挑战。直接偏好优化（DPO）在这一领域发挥了关键作用。它的工作原理是使用源自相同提示的偏好对，并且无需额外的奖励模型即可发挥作用。然而，DPO 并没有完全反映人类学习的复杂本质，人类学习通常涉及理解对相同问题和相似问题的不同反应。为了克服这一不足，我们提出了相对偏好优化（RPO）。 RPO 旨在区分源自相同和相关提示的更受欢迎和更不受欢迎的响应。它引入了对比加权机制，可以使用更广泛的偏好数据（包括配对和不配对的数据集）来调整法学硕士。这种方法扩展了模型的学习能力，使其能够利用来自更多样化提示的见解。通过包括对话和总结任务在内的实证测试以及使用AlpacaEval2.0排行榜的评估，RPO展示了使LLM与用户偏好保持一致并提高其在培训过程中的适应性的卓越能力。重现论文中提出的结果所需的 PyTorch 代码将在 GitHub 上提供以供公众访问。

Title: Measuring and Controlling Persona Drift in Language Model Dialogs

Authors: Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10962
Pdf URL: https://arxiv.org/pdf/2402.10962
Copy Paste: [[2402.10962]] Measuring and Controlling Persona Drift in Language Model Dialogs(https://arxiv.org/abs/2402.10962)
Keywords: language model, prompt, chat
Abstract: Prompting is a standard tool for customizing language-model chatbots, enabling them to take on a specific "persona". An implicit assumption in the use of prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated persona for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating persona stability via self-chats between two personalized chatbots. Testing popular models like LLaMA2-chat-70B, we reveal a significant persona drift within eight rounds of conversations. An empirical and theoretical analysis of this phenomenon suggests the transformer attention mechanism plays a role, due to attention decay over long exchanges. To combat attention decay and persona drift, we propose a lightweight method called split-softmax, which compares favorably against two strong baselines.
摘要：提示是用于定制语言模型聊天机器人的标准工具，使它们能够呈现特定的“角色”。使用提示的隐含假设是它们是稳定的，因此聊天机器人将在对话期间继续根据规定的角色生成文本。我们提出了一个定量基准来测试这一假设，通过两个个性化聊天机器人之间的自聊天来评估角色稳定性。通过测试 LLaMA2-chat-70B 等流行模型，我们发现在八轮对话中存在显着的角色漂移。对这种现象的实证和理论分析表明，由于长时间交流中的注意力衰减，变压器注意力机制发挥了作用。为了对抗注意力衰退和角色漂移，我们提出了一种称为 split-softmax 的轻量级方法，它与两个强大的基线相比具有优势。

Title: GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Authors: Alex Havrilla, Sharath Raparthy, Christoforus Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Roberta Railneau
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10963
Pdf URL: https://arxiv.org/pdf/2402.10963
Copy Paste: [[2402.10963]] GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements(https://arxiv.org/abs/2402.10963)
Keywords: language model, llm
Abstract: State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (\textbf{PRMs}), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (\textbf{SORMs}) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train \textit{global} refinement models, which take only the question and a draft solution as input and predict a corrected solution, and \textit{local} refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled.
摘要：最先进的语言模型可以在数学、科学或编码任务中展现出令人印象深刻的推理细化能力。然而，最近的研究表明，即使是最好的模型，在无法获得外部反馈的情况下也很难确定\textit{何时何地进行优化}。基于结果的奖励模型（\textbf{ORMs}）经过训练来预测最终答案的正确性（指示何时进行优化），为决定何时进行优化提供了一种方便的解决方案。基于流程的奖励模型（\textbf{PRMs}）经过训练可以预测中间步骤的正确性，然后可以用来指示需要改进的地方。但它们的训练成本很高，需要大量的人工注释。在本文中，我们提出了逐步 ORM（\textbf{SORMs}），仅在合成数据上进行训练，以近似最优策略或 $V^{\star}$ 的预期未来奖励。更具体地说，SORM 经过训练，可以在对当前策略进行多次采样（而不是像 ORM 那样只采样一次）时预测最终答案的正确性。我们的实验表明，与 ORM 相比，SORM 可以更准确地检测到不正确的推理步骤，从而提高进行细化时的下游准确性。然后，我们训练 \textit{global} 细化模型，该模型仅将问题和草稿解决方案作为输入并预测校正后的解决方案，以及 \textit{local} 细化模型，该模型也将指示第一个推理位置的批评作为输入错误。我们通过重用用于训练 SORM 的数据来综合生成两个模型的训练数据。我们发现，使用 ORM 作为重新排序器，结合全局和局部改进，显着优于其中任何一个单独的改进，以及三个样本基线中的最佳结果。通过此策略，我们可以将 GSM8K 上的 LLaMA-2 13B 模型（已使用 RL 进行微调）的贪婪采样精度从 53% 提高到 65%。

Title: Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Authors: Salman Rahman, Lavender Yao Jiang, Saadia Gabriel, Yindalon Aphinyanaphongs, Eric Karl Oermann, Rumi Chunara
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.10965
Pdf URL: https://arxiv.org/pdf/2402.10965
Copy Paste: [[2402.10965]] Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model(https://arxiv.org/abs/2402.10965)
Keywords: language model, llm
Abstract: Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.
摘要：大语言模型 (LLM) 的进步为医疗保健领域提供了新的机会，可以改善患者护理、临床决策以及增强医生和管理人员的工作流程。然而，这些模型的潜力重要地取决于它们在临床环境和人群中有效推广的能力，这是早期开发中经常被低估的挑战。为了更好地了解这些挑战的原因并为缓解方法提供信息，我们评估了 ClinicLLM（一家接受过[医院]临床记录培训的法学硕士），分析了其在 30 天全因再入院预测方面的表现，重点关注医院之间的差异和患者特征。我们发现普遍性较差，尤其是在样本较少的医院、有政府保险和未指定保险的患者、老年人和合并症较高的患者中。为了了解缺乏概括性的原因，我们调查了微调的样本量、注释内容（每个注释的字数）、患者特征（合并症水平、年龄、保险类型、行政区）和卫生系统方面（医院、所有医疗机构）。导致 30 天再入院和死亡率）。我们使用描述性统计和监督分类来识别特征。我们发现，除了样本量之外，患者年龄、合并症数量以及注释中的字数都是与泛化相关的重要因素。最后，我们比较了局部微调（特定于医院）、基于实例的增强微调和基于集群的微调，以提高泛化能力。其中，局部微调被证明是最有效的，将 AUC 提高了 0.25% 至 11.74%（在数据有限的情况下最有帮助）。总体而言，这项研究为加强大型语言模型在社会重要的医疗保健领域的部署以及提高其在更广泛人群中的表现提供了新的见解。

Title: Generative AI and Process Systems Engineering: The Next Frontier

Authors: Benjamin Decardi-Nelson, Abdulelah S. Alshehri, Akshay Ajagekar, Fengqi You
Subjects: cs.LG, eess.SY, math.OC
Abstract URL: https://arxiv.org/abs/2402.10977
Pdf URL: https://arxiv.org/pdf/2402.10977
Copy Paste: [[2402.10977]] Generative AI and Process Systems Engineering: The Next Frontier(https://arxiv.org/abs/2402.10977)
Keywords: language model, llm
Abstract: This article explores how emerging generative artificial intelligence (GenAI) models, such as large language models (LLMs), can enhance solution methodologies within process systems engineering (PSE). These cutting-edge GenAI models, particularly foundation models (FMs), which are pre-trained on extensive, general-purpose datasets, offer versatile adaptability for a broad range of tasks, including responding to queries, image generation, and complex decision-making. Given the close relationship between advancements in PSE and developments in computing and systems technologies, exploring the synergy between GenAI and PSE is essential. We begin our discussion with a compact overview of both classic and emerging GenAI models, including FMs, and then dive into their applications within key PSE domains: synthesis and design, optimization and integration, and process monitoring and control. In each domain, we explore how GenAI models could potentially advance PSE methodologies, providing insights and prospects for each area. Furthermore, the article identifies and discusses potential challenges in fully leveraging GenAI within PSE, including multiscale modeling, data requirements, evaluation metrics and benchmarks, and trust and safety, thereby deepening the discourse on effective GenAI integration into systems analysis, design, optimization, operations, monitoring, and control. This paper provides a guide for future research focused on the applications of emerging GenAI in PSE.
摘要：本文探讨了新兴的生成人工智能 (GenAI) 模型（例如大语言模型 (LLM)）如何增强流程系统工程 (PSE) 中的解决方案方法。这些尖端的 GenAI 模型，特别是基础模型 (FM)，在广泛的通用数据集上进行了预训练，为广泛的任务提供了多功能的适应性，包括响应查询、图像生成和复杂的决策。鉴于 PSE 的进步与计算和系统技术的发展之间的密切关系，探索 GenAI 和 PSE 之间的协同作用至关重要。我们首先对经典和新兴 GenAI 模型（包括 FM）进行简要概述，然后深入探讨它们在关键 PSE 领域的应用：综合和设计、优化和集成以及过程监控和控制。在每个领域，我们探索 GenAI 模型如何潜在地推进 PSE 方法，为每个领域提供见解和前景。此外，本文还确定并讨论了在 PSE 中充分利用 GenAI 的潜在挑战，包括多尺度建模、数据要求、评估指标和基准以及信任和安全，从而加深了关于将 GenAI 有效集成到系统分析、设计、优化、运营中的讨论。、监视和控制。本文为未来关注新兴 GenAI 在 PSE 中的应用的研究提供了指导。

Title: Language Models with Conformal Factuality Guarantees

Authors: Christopher Mohri, Tatsunori Hashimoto
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.10978
Pdf URL: https://arxiv.org/pdf/2402.10978
Copy Paste: [[2402.10978]] Language Models with Conformal Factuality Guarantees(https://arxiv.org/abs/2402.10978)
Keywords: language model
Abstract: Guaranteeing the correctness and factuality of language model (LM) outputs is a major open problem. In this work, we propose conformal factuality, a framework that can ensure high probability correctness guarantees for LMs by connecting language modeling and conformal prediction. We observe that the correctness of an LM output is equivalent to an uncertainty quantification problem, where the uncertainty sets are defined as the entailment set of an LM's output. Using this connection, we show that conformal prediction in language models corresponds to a back-off algorithm that provides high probability correctness guarantees by progressively making LM outputs less specific (and expanding the associated uncertainty sets). This approach applies to any black-box LM and requires very few human-annotated samples. Evaluations of our approach on closed book QA (FActScore, NaturalQuestions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM's original output.
摘要：保证语言模型（LM）输出的正确性和真实性是一个主要的开放问题。在这项工作中，我们提出了共形事实性，这是一个通过连接语言建模和共形预测来确保 LM 的高概率正确性保证的框架。我们观察到，LM 输出的正确性相当于不确定性量化问题，其中不确定性集被定义为 LM 输出的蕴涵集。利用这种联系，我们表明语言模型中的保形预测对应于一种退避算法，该算法通过逐步降低 LM 输出的具体性（并扩展相关的不确定性集）来提供高概率正确性保证。这种方法适用于任何黑盒 LM，并且需要很少的人工注释样本。对我们的闭卷 QA（FActScore、NaturalQuestions）和推理任务（MATH）方法的评估表明，我们的方法可以提供 80-90% 的正确性保证，同时保留 LM 的大部分原始输出。

Title: SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs

Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Dong Yu, Fei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10979
Pdf URL: https://arxiv.org/pdf/2402.10979
Copy Paste: [[2402.10979]] SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs(https://arxiv.org/abs/2402.10979)
Keywords: language model, llm
Abstract: Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working memory for managing complex data queries. In this paper, we introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs. These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios such as new game rules, longer durations, scrambled narratives, and analyzing key statistics in game summaries. We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks. Our benchmark, SportsMetrics, introduces a new mechanism for assessing LLMs' numerical reasoning and fusion skills.
摘要：大型语言模型在集成各种数据类型（例如文本文档和数据库记录）以进行高级分析方面具有巨大的潜力。然而，混合文本和数字数据带来了巨大的挑战。法学硕士需要处理和交叉引用实体和数字，处理数据不一致和冗余，并开发规划能力，例如构建用于管理复杂数据查询的工作记忆。在本文中，我们介绍了以体育数据分析为中心的四项新颖任务，以评估法学硕士的数值推理和信息融合能力。这些任务包括向法学硕士提供详细的、逐个比赛的体育比赛描述，然后用新的比赛规则、更长的持续时间、混乱的叙述等对抗性场景来挑战他们，并分析比赛摘要中的关键统计数据。我们对 NBA 和 NFL 比赛进行了大量实验，以评估法学硕士在这些任务上的表现。我们的基准 SportsMetrics 引入了一种评估法学硕士数字推理和融合技能的新机制。

Title: FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models

Authors: Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, Muhammad Abdul-Mageed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10986
Pdf URL: https://arxiv.org/pdf/2402.10986
Copy Paste: [[2402.10986]] FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models(https://arxiv.org/abs/2402.10986)
Keywords: language model, gpt, llm, hallucination, chat
Abstract: We introduce FinTral, a suite of state-of-the-art multimodal large language models (LLMs) built upon the Mistral-7b model and tailored for financial analysis. FinTral integrates textual, numerical, tabular, and image data. We enhance FinTral with domain-specific pretraining, instruction fine-tuning, and RLAIF training by exploiting a large collection of textual and visual datasets we curate for this work. We also introduce an extensive benchmark featuring nine tasks and 25 datasets for evaluation, including hallucinations in the financial domain. Our FinTral model trained with direct preference optimization employing advanced Tools and Retrieval methods, dubbed FinTral-DPO-T&R, demonstrates an exceptional zero-shot performance. It outperforms ChatGPT-3.5 in all tasks and surpasses GPT-4 in five out of nine tasks, marking a significant advancement in AI-driven financial technology. We also demonstrate that FinTral has the potential to excel in real-time analysis and decision-making in diverse financial contexts.
摘要：我们推出 FinTral，这是一套最先进的多模式大语言模型 (LLM)，基于 Mistral-7b 模型构建，专为财务分析而定制。 FinTral 集成了文本、数字、表格和图像数据。我们利用我们为此工作策划的大量文本和视觉数据集，通过特定领域的预训练、指令微调和 RLAIF 训练来增强 FinTral。我们还引入了一个广泛的基准，包含 9 个任务和 25 个数据集进行评估，包括金融领域的幻觉。我们的 FinTral 模型采用先进的工具和检索方法，通过直接偏好优化进行训练，称为 FinTral-DPO-T&R，展示了卓越的零样本性能。它在所有任务中均优于 ChatGPT-3.5，并在九项任务中的五项中超过 GPT-4，标志着人工智能驱动的金融技术的重大进步。我们还证明 FinTral 有潜力在不同的金融环境中进行实时分析和决策。

Title: WilKE: Wise-Layer Knowledge Editor for Lifelong Knowledge Editing

Authors: Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10987
Pdf URL: https://arxiv.org/pdf/2402.10987
Copy Paste: [[2402.10987]] WilKE: Wise-Layer Knowledge Editor for Lifelong Knowledge Editing(https://arxiv.org/abs/2402.10987)
Keywords: language model, gpt, llm
Abstract: Knowledge editing aims to rectify inaccuracies in large language models (LLMs) without costly retraining for outdated or erroneous knowledge. However, current knowledge editing methods primarily focus on single editing, failing to meet the requirements for lifelong editing. In this paper, lifelong editing is synonymous with lifelong knowledge editing. This study reveals a performance degradation encountered by knowledge editing in lifelong editing, characterized by toxicity buildup and toxicity flash, with the primary cause identified as pattern unmatch. We introduce a knowledge editing approach named WilKE, which selects editing layer based on the pattern matching degree of editing knowledge across different layers. Experimental results demonstrate that, in lifelong editing, WilKE exhibits an average improvement of 46.2\% and 67.8\% on editing GPT2-XL and GPT-J relative to state-of-the-art knowledge editing methods.
摘要：知识编辑旨在纠正大型语言模型（LLM）中的不准确性，而无需对过时或错误的知识进行昂贵的再培训。然而，目前的知识编辑方式主要集中于单一编辑，无法满足终身编辑的要求。在本文中，终身编辑与终身知识编辑同义。这项研究揭示了知识编辑在终身编辑中遇到的性能下降，其特征是毒性累积和毒性闪光，其主要原因被确定为模式不匹配。我们引入了一种名为 WilKE 的知识编辑方法，该方法根据不同层之间编辑知识的模式匹配程度来选择编辑层。实验结果表明，在终身编辑中，相对于最先进的知识编辑方法，WilKE 在编辑 GPT2-XL 和 GPT-J 方面平均提高了 46.2% 和 67.8%。

Title: "Understanding AI": Semantic Grounding in Large Language Models

Authors: Holger Lyre
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.10992
Pdf URL: https://arxiv.org/pdf/2402.10992
Copy Paste: [[2402.10992]] "Understanding AI": Semantic Grounding in Large Language Models(https://arxiv.org/abs/2402.10992)
Keywords: language model, llm
Abstract: Do LLMs understand the meaning of the texts they generate? Do they possess a semantic grounding? And how could we understand whether and what they understand? I start the paper with the observation that we have recently witnessed a generative turn in AI, since generative models, including LLMs, are key for self-supervised learning. To assess the question of semantic grounding, I distinguish and discuss five methodological ways. The most promising way is to apply core assumptions of theories of meaning in philosophy of mind and language to LLMs. Grounding proves to be a gradual affair with a three-dimensional distinction between functional, social and causal grounding. LLMs show basic evidence in all three dimensions. A strong argument is that LLMs develop world models. Hence, LLMs are neither stochastic parrots nor semantic zombies, but already understand the language they generate, at least in an elementary sense.
摘要：法学硕士理解他们生成的文本的含义吗？它们有语义基础吗？我们如何才能了解他们是否理解以及理解什么？我在这篇论文的开头指出，我们最近目睹了人工智能的生成性转变，因为包括法学硕士在内的生成模型是自我监督学习的关键。为了评估语义基础的问题，我区分并讨论了五种方法论。最有希望的方法是将心灵哲学和语言哲学中意义理论的核心假设应用于法学硕士。事实证明，扎根是一个渐进的过程，在功能扎根、社会扎根和因果扎根之间存在三个维度的区别。法学硕士在所有三个方面都显示了基本证据。一个强有力的论点是法学硕士开发了世界模型。因此，法学硕士既不是随机鹦鹉，也不是语义僵尸，而是已经理解了它们生成的语言，至少在基本意义上是这样。

Title: The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

Authors: Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos Tsilivis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11004
Pdf URL: https://arxiv.org/pdf/2402.11004
Copy Paste: [[2402.11004]] The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains(https://arxiv.org/abs/2402.11004)
Keywords: language model
Abstract: Large language models have the ability to generate text that mimics patterns in their inputs. We introduce a simple Markov Chain sequence modeling task in order to study how this in-context learning (ICL) capability emerges. In our setting, each example is sampled from a Markov chain drawn from a prior distribution over Markov chains. Transformers trained on this task form \emph{statistical induction heads} which compute accurate next-token probabilities given the bigram statistics of the context. During the course of training, models pass through multiple phases: after an initial stage in which predictions are uniform, they learn to sub-optimally predict using in-context single-token statistics (unigrams); then, there is a rapid phase transition to the correct in-context bigram solution. We conduct an empirical and theoretical investigation of this multi-phase process, showing how successful learning results from the interaction between the transformer's layers, and uncovering evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution. We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$.
摘要：大型语言模型能够生成模仿输入模式的文本。我们引入了一个简单的马尔可夫链序列建模任务，以研究这种上下文学习（ICL）能力是如何出现的。在我们的设置中，每个示例都是从马尔可夫链的先验分布中抽取的马尔可夫链中采样的。在此任务上训练的 Transformer 形成 \emph{统计归纳头}，根据上下文的二元统计数据计算准确的下一个标记概率。在训练过程中，模型会经历多个阶段：在预测一致的初始阶段之后，它们学习使用上下文中的单标记统计数据（一元组）进行次优预测；然后，快速相变到正确的上下文二元组解决方案。我们对这个多阶段过程进行了实证和理论研究，展示了变压器层之间的相互作用如何成功学习，并发现了更简单的一元组解决方案的存在可能会延迟最终二元组解决方案的形成的证据。我们研究了改变马尔可夫链上的先验分布如何影响学习，并考虑将我们的马尔可夫链上下文学习 (ICL-MC) 任务推广到 $n> 2$ 的 $n$-grams。

Title: Exploring Value Biases: How LLMs Deviate Towards the Ideal

Authors: Sarath Sivaprasad, Pramod Kaushik, Sahar Abdelnabi, Mario Fritz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11005
Pdf URL: https://arxiv.org/pdf/2402.11005
Copy Paste: [[2402.11005]] Exploring Value Biases: How LLMs Deviate Towards the Ideal(https://arxiv.org/abs/2402.11005)
Keywords: llm, prompt
Abstract: Large-Language-Models (LLMs) are deployed in a wide range of applications, and their response has an increasing social impact. Understanding the non-deliberate(ive) mechanism of LLMs in giving responses is essential in explaining their performance and discerning their biases in real-world applications. This is analogous to human studies, where such inadvertent responses are referred to as sampling. We study this sampling of LLMs in light of value bias and show that the sampling of LLMs tends to favour high-value options. Value bias corresponds to this shift of response from the most likely towards an ideal value represented in the LLM. In fact, this effect can be reproduced even with new entities learnt via in-context prompting. We show that this bias manifests in unexpected places and has implications on relevant application scenarios, like choosing exemplars. The results show that value bias is strong in LLMs across different categories, similar to the results found in human studies.
摘要：大语言模型 (LLM) 被部署在广泛的应用中，它们的响应具有越来越大的社会影响。了解法学硕士在给出答复时的非故意（主动）机制对于解释他们的表现和辨别他们在现实世界应用中的偏见至关重要。这类似于人类研究，这种无意的反应被称为采样。我们根据价值偏差对法学硕士的抽样进行了研究，结果表明，法学硕士的抽样往往倾向于高价值的选择。价值偏差对应于这种从最有可能的响应转向法学硕士所代表的理想值的转变。事实上，即使通过上下文提示学习新实体，这种效果也可以重现。我们表明，这种偏差会出现在意想不到的地方，并对相关应用场景产生影响，例如选择样本。结果表明，不同类别的法学硕士存在很强的价值偏差，与人类研究中发现的结果类似。

Title: PAT-Questions: A Self-Updating Benchmark for Present-Anchored Temporal Question-Answering

Authors: Jannat Ara Meem, Muhammad Shihab Rashid, Yue Dong, Vagelis Hristidis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11034
Pdf URL: https://arxiv.org/pdf/2402.11034
Copy Paste: [[2402.11034]] PAT-Questions: A Self-Updating Benchmark for Present-Anchored Temporal Question-Answering(https://arxiv.org/abs/2402.11034)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Existing work on Temporal Question Answering (TQA) has predominantly focused on questions anchored to specific timestamps or events (e.g. "Who was the US president in 1970?"). Little work has studied questions whose temporal context is relative to the present time (e.g. "Who was the previous US president?"). We refer to this problem as Present-Anchored Temporal QA (PATQA). PATQA poses unique challenges: (1) large language models (LLMs) may have outdated knowledge, (2) complex temporal relationships (e.g. 'before', 'previous') are hard to reason, (3) multi-hop reasoning may be required, and (4) the gold answers of benchmarks must be continuously updated. To address these challenges, we introduce the PAT-Questions benchmark, which includes single and multi-hop temporal questions. The answers in PAT-Questions can be automatically refreshed by re-running SPARQL queries on a knowledge graph, if available. We evaluate several state-of-the-art LLMs and a SOTA temporal reasoning model (TEMPREASON-T5) on PAT-Questions through direct prompting and retrieval-augmented generation (RAG). The results highlight the limitations of existing solutions in PATQA and motivate the need for new methods to improve PATQA reasoning capabilities.
摘要：时间问答 (TQA) 的现有工作主要集中于与特定时间戳或事件相关的问题（例如“谁是 1970 年的美国总统？”）。很少有工作研究时间背景与当前时间相关的问题（例如“谁是前任美国总统？”）。我们将此问题称为当前锚定时间 QA (PATQA)。 PATQA 提出了独特的挑战：(1) 大型语言模型 (LLM) 可能具有过时的知识，(2) 复杂的时间关系（例如“之前”、“先前”）难以推理，(3) 可能需要多跳推理，（4）基准的黄金答案必须不断更新。为了应对这些挑战，我们引入了 PAT-Questions 基准，其中包括单跳和多跳时间问题。 PAT 问题中的答案可以通过在知识图上重新运行 SPARQL 查询（如果可用）来自动刷新。我们通过直接提示和检索增强生成 (RAG) 来评估 PAT 问题上的几个最先进的法学硕士和 SOTA 时间推理模型 (TEMPREASON-T5)。结果凸显了 PATQA 现有解决方案的局限性，并激发了对新方法来提高 PATQA 推理能力的需求。

Title: Retrieval-Augmented Generation: Is Dense Passage Retrieval Retrieving?

Authors: Benjamin Reichman, Larry Heck
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2402.11035
Pdf URL: https://arxiv.org/pdf/2402.11035
Copy Paste: [[2402.11035]] Retrieval-Augmented Generation: Is Dense Passage Retrieval Retrieving?(https://arxiv.org/abs/2402.11035)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Dense passage retrieval (DPR) is the first step in the retrieval augmented generation (RAG) paradigm for improving the performance of large language models (LLM). DPR fine-tunes pre-trained networks to enhance the alignment of the embeddings between queries and relevant textual data. A deeper understanding of DPR fine-tuning will be required to fundamentally unlock the full potential of this approach. In this work, we explore DPR-trained models mechanistically by using a combination of probing, layer activation analysis, and model editing. Our experiments show that DPR training decentralizes how knowledge is stored in the network, creating multiple access pathways to the same information. We also uncover a limitation in this training style: the internal knowledge of the pre-trained model bounds what the retrieval model can retrieve. These findings suggest a few possible directions for dense retrieval: (1) expose the DPR training process to more knowledge so more can be decentralized, (2) inject facts as decentralized representations, (3) model and incorporate knowledge uncertainty in the retrieval process, and (4) directly map internal model knowledge to a knowledge base.
摘要：密集段落检索 (DPR) 是检索增强生成 (RAG) 范式的第一步，旨在提高大型语言模型 (LLM) 的性能。 DPR 微调预先训练的网络，以增强查询和相关文本数据之间嵌入的一致性。需要对 DPR 微调有更深入的了解，才能从根本上释放这种方法的全部潜力。在这项工作中，我们通过结合探测、层激活分析和模型编辑来机械地探索 DPR 训练的模型。我们的实验表明，DPR 训练分散了知识在网络中的存储方式，从而创建了对同一信息的多种访问途径。我们还发现了这种训练方式的局限性：预训练模型的内部知识限制了检索模型可以检索的内容。这些发现提出了密集检索的一些可能方向：（1）将 DPR 训练过程暴露给更多知识，以便更多知识可以去中心化，（2）将事实作为去中心化表示注入，（3）在检索过程中建模并纳入知识不确定性， (4)直接将内部模型知识映射到知识库。

Title: Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives

Authors: Runcong Zhao, Qinglin Zhu, Hainiu Xu, Jiazheng Li, Yuxiang Zhou, Yulan He, Lin Gui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11051
Pdf URL: https://arxiv.org/pdf/2402.11051
Copy Paste: [[2402.11051]] Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives(https://arxiv.org/abs/2402.11051)
Keywords: language model, gpt, llm
Abstract: Existing datasets for narrative understanding often fail to represent the complexity and uncertainty of relationships in real-life social scenarios. To address this gap, we introduce a new benchmark, Conan, designed for extracting and analysing intricate character relation graphs from detective narratives. Specifically, we designed hierarchical relationship categories and manually extracted and annotated role-oriented relationships from the perspectives of various characters, incorporating both public relationships known to most characters and secret ones known to only a few. Our experiments with advanced Large Language Models (LLMs) like GPT-3.5, GPT-4, and Llama2 reveal their limitations in inferencing complex relationships and handling longer narratives. The combination of the Conan dataset and our pipeline strategy is geared towards understanding the ability of LLMs to comprehend nuanced relational dynamics in narrative contexts.
摘要：现有的叙事理解数据集通常无法代表现实社会场景中关系的复杂性和不确定性。为了解决这一差距，我们引入了一个新的基准，柯南，旨在从侦探叙述中提取和分析复杂的人物关系图。具体来说，我们设计了层次关系类别，并从各种角色的角度手动提取和注释面向角色的关系，既包含大多数角色已知的公共关系，也包含只有少数角色知道的秘密关系。我们对 GPT-3.5、GPT-4 和 Llama2 等高级大型语言模型 (LLM) 的实验揭示了它们在推断复杂关系和处理较长叙述方面的局限性。柯南数据集和我们的管道策略的结合旨在了解法学硕士理解叙事背景中微妙的关系动态的能力。

Title: Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement

Authors: Chenkai Sun, Ke Yang, Revanth Gangi Reddy, Yi R. Fung, Hou Pong Chan, ChengXiang Zhai, Heng Ji
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2402.11060
Pdf URL: https://arxiv.org/pdf/2402.11060
Copy Paste: [[2402.11060]] Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement(https://arxiv.org/abs/2402.11060)
Keywords: language model, llm
Abstract: The increasing demand for personalized interactions with large language models (LLMs) calls for the development of methodologies capable of accurately and efficiently identifying user opinions and preferences. Retrieval augmentation emerges as an effective strategy, as it can accommodate a vast number of users without the costs from fine-tuning. Existing research, however, has largely focused on enhancing the retrieval stage and devoted limited exploration toward optimizing the representation of the database, a crucial aspect for tasks such as personalization. In this work, we examine the problem from a novel angle, focusing on how data can be better represented for more efficient retrieval in the context of LLM customization. To tackle this challenge, we introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts and collaborative refinement to effectively bridge knowledge gaps among users. In the task of response forecasting, Persona-DB demonstrates superior efficiency in maintaining accuracy with a significantly reduced retrieval size, a critical advantage in scenarios with extensive histories or limited context windows. Our experiments also indicate a marked improvement of over 15% under cold-start scenarios, when users have extremely sparse data. Furthermore, our analysis reveals the increasing importance of collaborative knowledge as the retrieval capacity expands.
摘要：对大型语言模型 (LLM) 的个性化交互的需求不断增长，需要开发能够准确有效地识别用户意见和偏好的方法。检索增强成为一种有效的策略，因为它可以容纳大量用户，而无需微调成本。然而，现有的研究主要集中在增强检索阶段，并在优化数据库表示方面进行了有限的探索，这是个性化等任务的一个关键方面。在这项工作中，我们从一个新颖的角度审视这个问题，重点关注如何在法学硕士定制的背景下更好地表示数据，以便更有效地检索。为了应对这一挑战，我们引入了 Persona-DB，这是一个简单而有效的框架，由分层构建过程组成，以提高跨任务上下文的泛化性和协作细化，以有效地弥合用户之间的知识差距。在响应预测任务中，Persona-DB 在保持准确性方面表现出卓越的效率，并且显着减小了检索大小，这在具有广泛历史记录或有限上下文窗口的场景中是一个关键优势。我们的实验还表明，在用户数据极其稀疏的冷启动场景下，性能显着提高了 15% 以上。此外，我们的分析表明，随着检索能力的扩大，协作知识的重要性日益增加。

Title: Bridging Causal Discovery and Large Language Models: A Comprehensive Survey of Integrative Approaches and Future Directions

Authors: Guangya Wan, Yuqi Wu, Mengxuan Hu, Zhixuan Chu, Sheng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11068
Pdf URL: https://arxiv.org/pdf/2402.11068
Copy Paste: [[2402.11068]] Bridging Causal Discovery and Large Language Models: A Comprehensive Survey of Integrative Approaches and Future Directions(https://arxiv.org/abs/2402.11068)
Keywords: language model, gpt, llm
Abstract: Causal discovery (CD) and Large Language Models (LLMs) represent two emerging fields of study with significant implications for artificial intelligence. Despite their distinct origins, CD focuses on uncovering cause-effect relationships from data, and LLMs on processing and generating humanlike text, the convergence of these domains offers novel insights and methodologies for understanding complex systems. This paper presents a comprehensive survey of the integration of LLMs, such as GPT4, into CD tasks. We systematically review and compare existing approaches that leverage LLMs for various CD tasks and highlight their innovative use of metadata and natural language to infer causal structures. Our analysis reveals the strengths and potential of LLMs in both enhancing traditional CD methods and as an imperfect expert, alongside the challenges and limitations inherent in current practices. Furthermore, we identify gaps in the literature and propose future research directions aimed at harnessing the full potential of LLMs in causality research. To our knowledge, this is the first survey to offer a unified and detailed examination of the synergy between LLMs and CD, setting the stage for future advancements in the field.
摘要：因果发现（CD）和大型语言模型（LLM）代表了两个对人工智能具有重大影响的新兴研究领域。尽管起源不同，CD 专注于从数据中揭示因果关系，LLM 专注于处理和生成类人文本，但这些领域的融合为理解复杂系统提供了新颖的见解和方法。本文对 GPT4 等 LLM 与 CD 任务的整合进行了全面的调查。我们系统地回顾和比较了利用法学硕士完成各种 CD 任务的现有方法，并强调了它们对元数据和自然语言的创新使用来推断因果结构。我们的分析揭示了法学硕士在增强传统 CD 方法和作为不完美专家方面的优势和潜力，以及当前实践中固有的挑战和局限性。此外，我们还找出了文献中的空白，并提出了未来的研究方向，旨在充分发挥法学硕士在因果关系研究中的潜力。据我们所知，这是第一项对法学硕士和持续发展之间的协同作用进行统一和详细审查的调查，为该领域的未来发展奠定了基础。

Title: AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators

Authors: Jingwei Ni, Minjing Shi, Dominik Stammbach, Mrinmaya Sachan, Elliott Ash, Markus Leippold
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11073
Pdf URL: https://arxiv.org/pdf/2402.11073
Copy Paste: [[2402.11073]] AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators(https://arxiv.org/abs/2402.11073)
Keywords: language model, llm
Abstract: With the rise of generative AI, automated fact-checking methods to combat misinformation are becoming more and more important. However, factual claim detection, the first step in a fact-checking pipeline, suffers from two key issues that limit its scalability and generalizability: (1) inconsistency in definitions of the task and what a claim is, and (2) the high cost of manual annotation. To address (1), we review the definitions in related work and propose a unifying definition of factual claims that focuses on verifiability. To address (2), we introduce AFaCTA (Automatic Factual Claim deTection Annotator), a novel framework that assists in the annotation of factual claims with the help of large language models (LLMs). AFaCTA calibrates its annotation confidence with consistency along three predefined reasoning paths. Extensive evaluation and experiments in the domain of political speech reveal that AFaCTA can efficiently assist experts in annotating factual claims and training high-quality classifiers, and can work with or without expert supervision. Our analyses also result in PoliClaim, a comprehensive claim detection dataset spanning diverse political topics.
摘要：随着生成式人工智能的兴起，打击错误信息的自动事实检查方法变得越来越重要。然而，事实声明检测是事实检查流程的第一步，存在两个限制其可扩展性和普遍性的关键问题：(1) 任务定义和声明内容不一致，(2) 成本高的手动注释。为了解决（1），我们回顾了相关工作中的定义，并提出了一个注重可验证性的事实主张的统一定义。为了解决（2），我们引入了 AFaCTA（自动事实声明检测注释器），这是一种新颖的框架，可在大型语言模型（LLM）的帮助下帮助注释事实声明。 AFaCTA 通过三个预定义推理路径的一致性来校准其注释置信度。在政治言论领域的广泛评估和实验表明，AFaCTA 可以有效地协助专家注释事实主张和训练高质量的分类器，并且可以在有或没有专家监督的情况下工作。我们的分析还产生了 PoliClaim，这是一个涵盖不同政治主题的综合索赔检测数据集。

Title: Word Embeddings Revisited: Do LLMs Offer Something New?

Authors: Matthew Freestone, Shubhra Kanti Karmaker Santu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11094
Pdf URL: https://arxiv.org/pdf/2402.11094
Copy Paste: [[2402.11094]] Word Embeddings Revisited: Do LLMs Offer Something New?(https://arxiv.org/abs/2402.11094)
Keywords: language model, llm
Abstract: Learning meaningful word embeddings is key to training a robust language model. The recent rise of Large Language Models (LLMs) has provided us with many new word/sentence/document embedding models. Although LLMs have shown remarkable advancement in various NLP tasks, it is still unclear whether the performance improvement is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This paper systematically investigates this issue by comparing classical word embedding techniques against LLM-based word embeddings in terms of their latent vector semantics. Our results show that LLMs tend to cluster semantically related words more tightly than classical models. LLMs also yield higher average accuracy on the Bigger Analogy Test Set (BATS) over classical methods. Finally, some LLMs tend to produce word embeddings similar to SBERT, a relatively lighter classical model.
摘要：学习有意义的词嵌入是训练鲁棒语言模型的关键。最近兴起的大型语言模型（LLM）为我们提供了许多新的单词/句子/文档嵌入模型。尽管法学硕士在各种 NLP 任务中表现出了显着的进步，但仍不清楚性能的提高是否仅仅是因为规模，或者它们产生的底层嵌入是否与 Sentence-BERT (SBERT) 或 Universal Sentence Encoder (USE) 等经典编码模型显着不同。本文通过将经典词嵌入技术与基于 LLM 的词嵌入在潜在向量语义方面进行比较，系统地研究了这个问题。我们的结果表明，法学硕士倾向于比经典模型更紧密地聚类语义相关的单词。与经典方法相比，法学硕士在更大的类比测试集 (BATS) 上也能产生更高的平均准确度。最后，一些法学硕士倾向于产生类似于 SBERT 的词嵌入，这是一种相对较轻的经典模型。

Title: When LLMs Meet Cunning Questions: A Fallacy Understanding Benchmark for Large Language Models

Authors: Yinghui Li, Qingyu Zhou, Yuanzhen Luo, Shirong Ma, Yangning Li, Hai-Tao Zheng, Xuming Hu, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11100
Pdf URL: https://arxiv.org/pdf/2402.11100
Copy Paste: [[2402.11100]] When LLMs Meet Cunning Questions: A Fallacy Understanding Benchmark for Large Language Models(https://arxiv.org/abs/2402.11100)
Keywords: language model, llm
Abstract: Recently, Large Language Models (LLMs) have made remarkable evolutions in language understanding and generation. Following this, various benchmarks for measuring all kinds of capabilities of LLMs have sprung up. In this paper, we challenge the reasoning and understanding abilities of LLMs by proposing a FaLlacy Understanding Benchmark (FLUB) containing cunning questions that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning questions that FLUB focuses on mainly consist of the tricky, humorous, and misleading questions collected from the real internet environment. And we design three tasks with increasing difficulty in the FLUB benchmark to evaluate the fallacy understanding ability of LLMs. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs, reflecting our FLUB is challenging and worthy of more future study. Interesting discoveries and valuable insights are achieved in our extensive experiments and detailed analyses. We hope that our benchmark can encourage the community to improve LLMs' ability to understand fallacies.
摘要：最近，大型语言模型（LLM）在语言理解和生成方面取得了显着的进步。随之而来的是各种衡量法学硕士各种能力的基准纷纷涌现。在本文中，我们通过提出一个 FaLlacy Understanding Benchmark (FLUB) 来挑战法学硕士的推理和理解能力，其中包含人类容易理解但模型难以掌握的狡猾问题。具体来说，FLUB关注的刁钻问题主要是从真实的网络环境中收集到的刁钻、幽默、误导性的问题。我们在FLUB基准测试中设计了三个难度递增的任务来评估法学硕士的谬误理解能力。基于FLUB，我们调查了多个具有代表性和先进的LLM的表现，反映出我们的FLUB具有挑战性，值得未来更多的研究。我们通过广泛的实验和详细的分析获得了有趣的发现和有价值的见解。我们希望我们的基准能够鼓励社区提高法学硕士理解谬误的能力。

Title: Whose Emotions and Moral Sentiments Do Language Models Reflect?

Authors: Zihao He, Siyi Guo, Ashwin Rao, Kristina Lerman
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2402.11114
Pdf URL: https://arxiv.org/pdf/2402.11114
Copy Paste: [[2402.11114]] Whose Emotions and Moral Sentiments Do Language Models Reflect?(https://arxiv.org/abs/2402.11114)
Keywords: language model
Abstract: Language models (LMs) are known to represent the perspectives of some social groups better than others, which may impact their performance, especially on subjective tasks such as content moderation and hate speech detection. To explore how LMs represent different perspectives, existing research focused on positional alignment, i.e., how closely the models mimic the opinions and stances of different groups, e.g., liberals or conservatives. However, human communication also encompasses emotional and moral dimensions. We define the problem of affective alignment, which measures how LMs' emotional and moral tone represents those of different groups. By comparing the affect of responses generated by 36 LMs to the affect of Twitter messages, we observe significant misalignment of LMs with both ideological groups. This misalignment is larger than the partisan divide in the U.S. Even after steering the LMs towards specific ideological perspectives, the misalignment and liberal tendencies of the model persist, suggesting a systemic bias within LMs.
摘要：众所周知，语言模型（LM）比其他群体更好地代表某些社会群体的观点，这可能会影响他们的表现，特别是在内容审核和仇恨言论检测等主观任务上。为了探索 LM 如何代表不同的观点，现有的研究侧重于立场一致性，即模型如何模仿不同群体（例如自由派或保守派）的意见和立场。然而，人类交流还包含情感和道德层面。我们定义了情感一致性问题，它衡量了 LM 的情感和道德基调如何代表不同群体的情感和道德基调。通过比较 36 个 LM 产生的响应的影响与 Twitter 消息的影响，我们观察到 LM 与两个意识形态群体的严重不一致。这种失调比美国的党派分歧还要大。即使在引导LM转向特定的意识形态观点之后，模型的失调和自由主义倾向仍然存在，这表明LM内部存在系统性偏见。

Title: Navigating the Dual Facets: A Comprehensive Evaluation of Sequential Memory Editing in Large Language Models

Authors: Zihao Lin, Mohammad Beigi, Hongxuan Li, Yufan Zhou, Yuxiang Zhang, Qifan Wang, Wenpeng Yin, Lifu Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11122
Pdf URL: https://arxiv.org/pdf/2402.11122
Copy Paste: [[2402.11122]] Navigating the Dual Facets: A Comprehensive Evaluation of Sequential Memory Editing in Large Language Models(https://arxiv.org/abs/2402.11122)
Keywords: language model, llm
Abstract: Memory Editing (ME) has emerged as an efficient method to modify erroneous facts or inject new facts into Large Language Models (LLMs). Two mainstream ME methods exist: parameter-modifying ME and parameter-preserving ME (integrating extra modules while preserving original parameters). Regrettably, previous studies on ME evaluation have two critical limitations: (i) evaluating LLMs with single edit only, neglecting the need for continuous editing, and (ii) evaluations focusing solely on basic factual triples, overlooking broader LLM capabilities like logical reasoning and reading understanding. This study addresses these limitations with contributions threefold: (i) We explore how ME affects a wide range of fundamental capabilities of LLMs under sequential editing. Experimental results reveal an intriguing phenomenon: Most parameter-modifying ME consistently degrade performance across all tasks after a few sequential edits. In contrast, parameter-preserving ME effectively maintains LLMs' fundamental capabilities but struggles to accurately recall edited knowledge presented in a different format. (ii) We extend our evaluation to different editing settings, such as layers to edit, model size, instruction tuning, etc. Experimental findings indicate several strategies that can potentially mitigate the adverse effects of ME. (iii) We further explain why parameter-modifying ME damages LLMs from three dimensions: parameter changes after editing, language modeling capability, and the in-context learning capability. Our in-depth study advocates more careful use of ME in real-world scenarios.
摘要：内存编辑 (ME) 已成为修改错误事实或将新事实注入大型语言模型 (LLM) 的有效方法。存在两种主流ME方法：参数修改ME和参数保留ME（在保留原始参数的同时集成额外模块）。遗憾的是，之前关于ME评估的研究有两个关键局限性：（i）仅评估单次编辑的LLM，忽略了连续编辑的需要，以及（ii）评估仅关注基本事实三元组，忽视了更广泛的LLM能力，如逻辑推理和阅读理解。本研究通过三方面的贡献解决了这些局限性：（i）我们探讨 ME 如何影响顺序编辑下法学硕士的广泛基本能力。实验结果揭示了一个有趣的现象：大多数参数修改 ME 在经过几次连续编辑后，所有任务的性能都会持续下降。相比之下，保留参数的 ME 有效地保持了法学硕士的基本能力，但难以准确回忆以不同格式呈现的编辑知识。 (ii) 我们将评估扩展到不同的编辑设置，例如编辑层、模型大小、指令调整等。实验结果表明有几种策略可以潜在地减轻 ME 的不利影响。 (iii)我们从三个维度进一步解释了为什么参数修改ME会损害LLM：编辑后的参数变化、语言建模能力和上下文学习能力。我们的深入研究主张在现实场景中更谨慎地使用 ME。

Title: BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering

Authors: Haoyu Wang, Tuo Zhao, Jing Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11129
Pdf URL: https://arxiv.org/pdf/2402.11129
Copy Paste: [[2402.11129]] BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering(https://arxiv.org/abs/2402.11129)
Keywords: language model, llm
Abstract: Retrieval-augmented Large Language Models (LLMs) offer substantial benefits in enhancing performance across knowledge-intensive scenarios. However, these methods often face challenges with complex inputs and encounter difficulties due to noisy knowledge retrieval, notably hindering model effectiveness. To address this issue, we introduce BlendFilter, a novel approach that elevates retrieval-augmented LLMs by integrating query generation blending with knowledge filtering. BlendFilter proposes the blending process through its query generation method, which integrates both external and internal knowledge augmentation with the original query, ensuring comprehensive information gathering. Additionally, our distinctive knowledge filtering module capitalizes on the intrinsic capabilities of the LLM, effectively eliminating extraneous data. We conduct extensive experiments on three open-domain question answering benchmarks, and the findings clearly indicate that our innovative BlendFilter surpasses state-of-the-art baselines significantly.
摘要：检索增强型大型语言模型 (LLM) 在提高知识密集型场景的性能方面提供了巨大的好处。然而，这些方法经常面临复杂输入的挑战，并且由于噪声知识检索而遇到困难，尤其阻碍了模型的有效性。为了解决这个问题，我们引入了 BlendFilter，这是一种通过将查询生成混合与知识过滤相结合来提升检索增强法学硕士的新颖方法。 BlendFilter通过其查询生成方法提出混合过程，将外部和内部知识增强与原始查询相结合，确保全面的信息收集。此外，我们独特的知识过滤模块利用了法学硕士的内在能力，有效地消除了无关的数据。我们对三个开放域问答基准进行了广泛的实验，结果清楚地表明我们的创新 BlendFilter 显着超越了最先进的基准。

Title: Speculative Streaming: Fast LLM Inference without Auxiliary Models

Authors: Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11131
Pdf URL: https://arxiv.org/pdf/2402.11131
Copy Paste: [[2402.11131]] Speculative Streaming: Fast LLM Inference without Auxiliary Models(https://arxiv.org/abs/2402.11131)
Keywords: language model, llm
Abstract: Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.
摘要：推测解码是一种基于辅助草稿模型的预测来加速大型目标语言模型推理的重要技术。虽然在特定于应用程序的设置中有效，但它通常需要微调草稿模型和目标模型以实现高接受率。随着下游任务数量的增加，这些草案模型显着增加了推理系统的复杂性。我们提出了 Speculative Streaming，这是一种单模型推测解码方法，通过将微调目标从下一个 token 预测更改为未来的 n-gram 预测，将绘图融合到目标模型中。推测流在摘要、结构化查询和含义表示等各种任务中将解码速度提高 1.8 - 3.1 倍，而不会牺牲生成质量。此外，推测流是参数高效的。它实现了与 Medusa 式架构同等/更高的速度，同时使用的额外参数减少了大约 10000 倍，使其非常适合资源受限的设备。

Title: TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks

Authors: Benjamin Feuer, Robin Tibor Schirrmeister, Valeriia Cherepanova, Chinmay Hegde, Frank Hutter, Micah Goldblum, Niv Cohen, Colin White
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11137
Pdf URL: https://arxiv.org/pdf/2402.11137
Copy Paste: [[2402.11137]] TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks(https://arxiv.org/abs/2402.11137)
Keywords: language model, prompt
Abstract: While tabular classification has traditionally relied on from-scratch training, a recent breakthrough called prior-data fitted networks (PFNs) challenges this approach. Similar to large language models, PFNs make use of pretraining and in-context learning to achieve strong performance on new tasks in a single forward pass. However, current PFNs have limitations that prohibit their widespread adoption. Notably, TabPFN achieves very strong performance on small tabular datasets but is not designed to make predictions for datasets of size larger than 1000. In this work, we overcome these limitations and substantially improve the performance of PFNs by developing context optimization techniques for PFNs. Specifically, we propose TuneTables, a novel prompt-tuning strategy that compresses large datasets into a smaller learned context. TuneTables scales TabPFN to be competitive with state-of-the-art tabular classification methods on larger datasets, while having a substantially lower inference time than TabPFN. Furthermore, we show that TuneTables can be used as an interpretability tool and can even be used to mitigate biases by optimizing a fairness objective.
摘要：虽然表格分类传统上依赖于从头开始训练，但最近一项称为先验数据拟合网络 (PFN) 的突破挑战了这种方法。与大型语言模型类似，PFN 利用预训练和上下文学习在单次前向传递中在新任务上实现强大的性能。然而，当前的 PFN 存在局限性，阻碍了其广泛采用。值得注意的是，TabPFN 在小型表格数据集上实现了非常强大的性能，但并非旨在对大小超过 1000 的数据集进行预测。在这项工作中，我们克服了这些限制，并通过开发 PFN 的上下文优化技术来大幅提高 PFN 的性能。具体来说，我们提出了 TuneTables，这是一种新颖的即时调整策略，可将大型数据集压缩到较小的学习上下文中。 TuneTables 对 TabPFN 进行了扩展，使其能够与较大数据集上最先进的表格分类方法竞争，同时推理时间比 TabPFN 低得多。此外，我们还表明 TuneTables 可以用作可解释性工具，甚至可以通过优化公平性目标来减轻偏差。

Title: Contrastive Instruction Tuning

Authors: Tianyi Yan, Fei Wang, James Y. Huang, Wenxuan Zhou, Fan Yin, Aram Galstyan, Wenpeng Yin, Muhao Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11138
Pdf URL: https://arxiv.org/pdf/2402.11138
Copy Paste: [[2402.11138]] Contrastive Instruction Tuning(https://arxiv.org/abs/2402.11138)
Keywords: language model, llm, prompt
Abstract: Instruction tuning has been used as a promising approach to improve the performance of large language models (LLMs) on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. This behavior indicates LLMs' lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. Accordingly, we propose Contrastive Instruction Tuning, which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. To facilitate this approach, we augment the existing FLAN collection by paraphrasing task instructions. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
摘要：指令调优已被用作一种有前途的方法，可以提高大型语言模型（LLM）在未见过的任务上的性能。然而，当前的法学硕士对看不见的指令表现出有限的鲁棒性，当相同的指令以略有不同的形式或语言风格表达时，会产生不一致的输出。这种行为表明法学硕士缺乏对文本变化的鲁棒性和对未见过的指令的通用性，可能导致可信度问题。因此，我们提出对比指令调整，它最大化语义等效指令实例对的隐藏表示之间的相似性，同时最小化语义不同指令实例对之间的相似性。为了促进这种方法，我们通过解释任务指令来扩充现有的 FLAN 集合。 PromptBench 基准测试表明，CoIN 持续提高了法学硕士对字符、单词、句子和语义级别变化的未见指令的鲁棒性，准确度平均提高了 2.5%。

Title: Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

Authors: Sijia Chen, Baochun Li, Di Niu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11140
Pdf URL: https://arxiv.org/pdf/2402.11140
Copy Paste: [[2402.11140]] Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models(https://arxiv.org/abs/2402.11140)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: The reasoning performance of Large Language Models (LLMs) on a wide range of problems critically relies on chain-of-thought prompting, which involves providing a few chain of thought demonstrations as exemplars in prompts. Recent work, e.g., Tree of Thoughts, has pointed out the importance of exploration and self-evaluation in reasoning step selection for complex problem solving. In this paper, we present Boosting of Thoughts (BoT), an automated prompting framework for problem solving with LLMs by iteratively exploring and self-evaluating many trees of thoughts in order to acquire an ensemble of trial-and-error reasoning experiences, which will serve as a new form of prompting to solve the complex problem. Starting from a simple prompt without requiring examples, BoT iteratively explores and evaluates a large collection of reasoning steps, and more importantly, uses error analysis obtained from the LLM on them to explicitly revise prompting, which in turn enhances reasoning step generation, until a final answer is attained. Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches.
摘要：大型语言模型（LLM）对各种问题的推理性能严重依赖于思维链提示，其中涉及提供一些思维链演示作为提示中的范例。最近的工作，例如《思想之树》，指出了探索和自我评估在解决复杂问题的推理步骤选择中的重要性。在本文中，我们提出了思想提升（BoT），这是一种通过迭代探索和自我评估许多思想树来解决法学硕士问题的自动提示框架，以获得试错推理经验的集合，这将作为解决复杂问题的新形式。 BoT 从不需要示例的简单提示开始，迭代地探索和评估大量推理步骤，更重要的是，使用从 LLM 获得的错误分析来显式修改提示，从而增强推理步骤的生成，直到最终的推理步骤。得到了答案。我们使用 GPT-4 和 Llama2 在广泛的复杂数学问题上进行的实验表明，与其他高级提示方法相比，BoT 始终能够实现更高或相当的问题解决率。

Title: Grasping the Essentials: Tailoring Large Language Models for Zero-Shot Relation Extraction

Authors: Sizhe Zhou, Yu Meng, Bowen Jin, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11142
Pdf URL: https://arxiv.org/pdf/2402.11142
Copy Paste: [[2402.11142]] Grasping the Essentials: Tailoring Large Language Models for Zero-Shot Relation Extraction(https://arxiv.org/abs/2402.11142)
Keywords: language model, llm
Abstract: Relation extraction (RE), a crucial task in NLP, aims to identify semantic relationships between entities mentioned in texts. Despite significant advancements in this field, existing models typically rely on extensive annotated data for training, which can be both costly and time-consuming to acquire. Moreover, these models often struggle to adapt to new or unseen relationships. In contrast, few-shot learning settings, which aim to reduce annotation requirements, may offer incomplete and biased supervision for understanding target relation semantics, leading to degraded and unstable performance. To provide the model with accurate and explicit descriptions of the relations types and meanwhile minimize the annotation requirements, we study the definition only zero-shot RE setting where only relation definitions expressed in natural language are used to train a RE model. Motivated by the strong synthetic data generation power of LLMs, we propose a framework REPaL which consists of three stages: (1) We utilize LLMs to generate initial seed instances based on relation definitions and an unlabeled corpora. (2) We fine-tune a bidirectional Small Language Model (SLM) using these initial seeds to learn the relations for the target domain. (3) We enhance pattern coverage and mitigate bias resulting from the limited number of initial seeds by incorporating feedback acquired from SLM's predictions on unlabeled corpora. To accomplish this, we leverage the multi-turn conversation ability of LLMs to generate new instances in follow-up dialogues. Experiments on two datasets show REPaL achieves better zero-shot performance with large margins over baseline methods.
摘要：关系提取（RE）是 NLP 中的一项关键任务，旨在识别文本中提到的实体之间的语义关系。尽管该领域取得了重大进展，但现有模型通常依赖于大量带注释的数据进行训练，而获取这些数据可能既昂贵又耗时。此外，这些模型常常难以适应新的或看不见的关系。相反，旨在减少注释要求的小样本学习设置可能会为理解目标关系语义提供不完整且有偏差的监督，从而导致性能下降和不稳定。为了向模型提供对关系类型的准确和明确的描述，同时最大限度地减少注释需求，我们研究了仅定义零样本RE设置，其中仅使用自然语言表达的关系定义来训练RE模型。受 LLM 强大的合成数据生成能力的启发，我们提出了一个框架 REPaL，该框架由三个阶段组成：（1）我们利用 LLM 基于关系定义和未标记的语料库生成初始种子实例。 (2) 我们使用这些初始种子来微调双向小语言模型 (SLM)，以学习目标域的关系。 (3) 我们通过纳入从 SLM 对未标记语料库的预测中获得的反馈，增强了模式覆盖范围并减轻了由于初始种子数量有限而导致的偏差。为了实现这一目标，我们利用法学硕士的多轮对话能力在后续对话中生成新实例。对两个数据集的实验表明 REPaL 实现了更好的零样本性能，并且比基线方法有更大的优势。

Title: Understanding News Thumbnail Representativeness by Counterfactual Text-Guided Contrastive Language-Image Pretraining

Authors: Yejun Yoon, Seunghyun Yoon, Kunwoo Park
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2402.11159
Pdf URL: https://arxiv.org/pdf/2402.11159
Copy Paste: [[2402.11159]] Understanding News Thumbnail Representativeness by Counterfactual Text-Guided Contrastive Language-Image Pretraining(https://arxiv.org/abs/2402.11159)
Keywords: language model
Abstract: This paper delves into the critical challenge of understanding the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the main subject discussed in the news text. To serve the challenge, we introduce \textsc{NewsTT}, a manually annotated dataset of news thumbnail image and text pairs. We found that pretrained vision and language models, such as CLIP and BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, a pretrained model could not have the ability to match its visual and textual appearances. To fill the gap, we propose CFT-CLIP, a counterfactual text-guided contrastive language-image pretraining framework. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability in the target task. Evaluation experiments using NewsTT show that CFT-CLIP outperforms the pretrained models, such as CLIP and BLIP-2. Our code and data will be made accessible to the public after the paper is accepted.
摘要：本文深入探讨了理解新闻缩略图的代表性的关键挑战，当文章在社交媒体上传播时，新闻缩略图通常是读者的第一个视觉参与。我们关注新闻图像是否代表新闻文本中讨论的主要主题。为了应对挑战，我们引入了 \textsc{NewsTT}，这是一个手动注释的新闻缩略图和文本对的数据集。我们发现预训练的视觉和语言模型（例如 CLIP 和 BLIP-2）难以完成这项任务。由于新闻主题经常涉及命名实体或专有名词，因此预训练模型无法匹配其视觉和文本外观。为了填补这一空白，我们提出了 CFT-CLIP，一种反事实文本引导的对比语言图像预训练框架。我们假设学习将新闻文本与其反事实（其中命名实体被替换）进行对比，可以增强目标任务中的跨模态匹配能力。使用 NewsTT 进行的评估实验表明，CFT-CLIP 优于预训练模型，例如 CLIP 和 BLIP-2。论文被接受后，我们的代码和数据将向公众开放。

Title: PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation

Authors: Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11161
Pdf URL: https://arxiv.org/pdf/2402.11161
Copy Paste: [[2402.11161]] PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation(https://arxiv.org/abs/2402.11161)
Keywords: language model, llm
Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current answer correctness (AC) metrics do not align with human judgments, particularly verbose, free form answers from large language models (LLM). There are two challenges: a lack of data and that models are too big. LLM based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing clear guidelines for evaluating machine QA adopted from human QA contests. We also introduce Precise ANswer correctness Determination and Adjudication (PANDA), a small, efficient, deterministic AC classifier (812 KB) that more accurately evaluates answer correctness.
摘要：只有知道答案是否正确，问答 (QA) 才能取得进展，但对于许多最具挑战性和有趣的 QA 示例，当前答案正确性 (AC) 指标与人类判断不一致，尤其是冗长、自由形式的答案来自大型语言模型（LLM）。存在两个挑战：缺乏数据和模型太大。基于 LLM 的评分器与人类的相关性更好，但这项昂贵的任务仅在有限的 QA 数据集上进行了测试。我们通过提供明确的指南来评估从人类 QA 竞赛中采用的机器 QA 来纠正这些问题。我们还引入了 Precise ANswer 正确性确定和裁决 (PANDA)，这是一种小型、高效、确定性的 AC 分类器 (812 KB)，可以更准确地评估答案的正确性。

Title: KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph

Authors: Jinhao Jiang, Kun Zhou, Wayne Xin Zhao, Yang Song, Chen Zhu, Hengshu Zhu, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11163
Pdf URL: https://arxiv.org/pdf/2402.11163
Copy Paste: [[2402.11163]] KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph(https://arxiv.org/abs/2402.11163)
Keywords: language model, llm, agent
Abstract: In this paper, we aim to improve the reasoning ability of large language models (LLMs) over knowledge graphs (KGs) to answer complex questions. Inspired by existing methods that design the interaction strategy between LLMs and KG, we propose an autonomous LLM-based agent framework, called KG-Agent, which enables a small LLM to actively make decisions until finishing the reasoning process over KGs. In KG-Agent, we integrate the LLM, multifunctional toolbox, KG-based executor, and knowledge memory, and develop an iteration mechanism that autonomously selects the tool then updates the memory for reasoning over KG. To guarantee the effectiveness, we leverage program language to formulate the multi-hop reasoning process over the KG, and synthesize a code-based instruction dataset to fine-tune the base LLM. Extensive experiments demonstrate that only using 10K samples for tuning LLaMA-7B can outperform state-of-the-art methods using larger LLMs or more data, on both in-domain and out-domain datasets. Our code and data will be publicly released.
摘要：在本文中，我们的目标是提高大型语言模型（LLM）相对于知识图（KG）的推理能力，以回答复杂的问题。受设计LLM和KG之间交互策略的现有方法的启发，我们提出了一种基于LLM的自主代理框架，称为KG-Agent，它使得小型LLM能够主动做出决策，直到完成对KG的推理过程。在KG-Agent中，我们集成了LLM、多功能工具箱、基于KG的执行器和知识记忆，并开发了一种迭代机制，自主选择工具然后更新记忆以对KG进行推理。为了保证有效性，我们利用程序语言在KG上制定多跳推理过程，并合成基于代码的指令数据集来微调基础LLM。大量实验表明，在域内和域外数据集上，仅使用 10K 样本来调整 LLaMA-7B 就可以优于使用更大的 LLM 或更多数据的最先进方法。我们的代码和数据将公开发布。

Title: GenDec: A robust generative Question-decomposition method for Multi-hop reasoning

Authors: Jian Wu, Linyi Yang, Yuliang Ji, Wenhao Huang, Börje F. Karlsson, Manabu Okumura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11166
Pdf URL: https://arxiv.org/pdf/2402.11166
Copy Paste: [[2402.11166]] GenDec: A robust generative Question-decomposition method for Multi-hop reasoning(https://arxiv.org/abs/2402.11166)
Keywords: language model, gpt, llm
Abstract: Multi-hop QA (MHQA) involves step-by-step reasoning to answer complex questions and find multiple relevant supporting facts. However, Existing large language models'(LLMs) reasoning ability in multi-hop question answering remains exploration, which is inadequate in answering multi-hop questions. Moreover, it is unclear whether LLMs follow a desired reasoning chain to reach the right final answer. In this paper, we propose a \textbf{gen}erative question \textbf{dec}omposition method (GenDec) from the perspective of explainable QA by generating independent and complete sub-questions based on incorporating additional extracted evidence for enhancing LLMs' reasoning ability in RAG. To demonstrate the impact, generalization, and robustness of Gendec, we conduct two experiments, the first is combining GenDec with small QA systems on paragraph retrieval and QA tasks. We secondly examine the reasoning capabilities of various state-of-the-art LLMs including GPT-4 and GPT-3.5 combined with GenDec. We experiment on the HotpotQA, 2WikihopMultiHopQA, MuSiQue, and PokeMQA datasets.
摘要：多跳 QA (MHQA) 涉及逐步推理来回答复杂问题并找到多个相关支持事实。然而，现有的大型语言模型（LLM）在多跳问答中的推理能力仍处于探索阶段，在回答多跳问题时存在不足。此外，尚不清楚法学硕士是否遵循所需的推理链来得出正确的最终答案。在本文中，我们从可解释的QA的角度提出了一种\textbf{gen}生成问题\textbf{dec}组合方法（GenDec），通过基于合并额外提取的证据生成独立且完整的子问题来增强LLM的推理能力在RAG。为了证明 Gendec 的影响、泛化和鲁棒性，我们进行了两个实验，第一个实验是将 GenDec 与小型 QA 系统结合起来执行段落检索和 QA 任务。其次，我们检查了各种最先进的 LLM 的推理能力，包括 GPT-4 和 GPT-3.5 与 GenDec 的结合。我们在 HotpotQA、2WikihopMultiHopQA、MuSiQue 和 PokeMQA 数据集上进行了实验。

Title: Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection

Authors: Fan Huang, Haewoon Kwak, Jisun An
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11167
Pdf URL: https://arxiv.org/pdf/2402.11167
Copy Paste: [[2402.11167]] Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection(https://arxiv.org/abs/2402.11167)
Keywords: llm, prompt
Abstract: The robustness of AI-content detection models against cultivated attacks (e.g., paraphrasing or word switching) remains a significant concern. This study proposes a novel token-ensemble generation strategy to challenge the robustness of current AI-content detection approaches. We explore the ensemble attack strategy by completing the prompt with the next token generated from random candidate LLMs. We find the token-ensemble approach significantly drops the performance of AI-content detection models (The code and test sets will be released). Our findings reveal that token-ensemble generation poses a vital challenge to current detection models and underlines the need for advancing detection technologies to counter sophisticated adversarial strategies.
摘要：人工智能内容检测模型针对精心设计的攻击（例如释义或换词）的稳健性仍然是一个重大问题。本研究提出了一种新颖的令牌集成生成策略，以挑战当前人工智能内容检测方法的稳健性。我们通过使用随机候选 LLM 生成的下一个标记完成提示来探索整体攻击策略。我们发现令牌集成方法显着降低了人工智能内容检测模型的性能（代码和测试集将被发布）。我们的研究结果表明，令牌集合的生成对当前的检测模型提出了重大挑战，并强调需要先进的检测技术来应对复杂的对抗策略。

Title: M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection

Authors: Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohanned Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11175
Pdf URL: https://arxiv.org/pdf/2402.11175
Copy Paste: [[2402.11175]] M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection(https://arxiv.org/abs/2402.11175)
Keywords: language model, llm
Abstract: The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark involving multilingual, multi-domain and multi-generator for MGT detection -- M4GT-Bench. It is collected for three task formulations: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection identifies which particular model generates the text; and (3) human-machine mixed text detection, where a word boundary delimiting MGT from human-written content should be determined. Human evaluation for Task 2 shows less than random guess performance, demonstrating the challenges to distinguish unique LLMs. Promising results always occur when training and test data distribute within the same domain or generators.
摘要：大型语言模型 (LLM) 的出现带来了跨不同渠道的机器生成文本 (MGT) 前所未有的激增。这引起了对其潜在滥用和社会影响的合理担忧。识别和区分这些内容与真实的人类生成文本的需要对于打击虚假信息、维护教育和科学领域的完整性以及维护通信信任至关重要。在这项工作中，我们通过引入一个涉及多语言、多域和多生成器的 MGT 检测新基准——M4GT-Bench 来解决这个问题。它是针对三种任务形式收集的：（1）单语言和多语言二进制 MGT 检测； (2) 多路检测识别哪个特定模型生成文本； (3)人机混合文本检测，其中应确定将MGT与人写内容分隔开的单词边界。对任务 2 的人工评估显示出低于随机猜测的性能，这表明区分独特的法学硕士面临的挑战。当训练和测试数据分布在同一域或生成器内时，总会出现有希望的结果。

Title: KnowTuning: Knowledge-aware Fine-tuning for Large Language Models

Authors: Yougang Lyu, Lingyong Yan, Shuaiqiang Wang, Haibo Shi, Dawei Yin, Pengjie Ren, Zhumin Chen, Maarten de Rijke, Zhaochun Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11176
Pdf URL: https://arxiv.org/pdf/2402.11176
Copy Paste: [[2402.11176]] KnowTuning: Knowledge-aware Fine-tuning for Large Language Models(https://arxiv.org/abs/2402.11176)
Keywords: language model, llm
Abstract: Despite their success at many natural language processing (NLP) tasks, large language models (LLMs) still struggle to effectively leverage knowledge for knowledge-intensive tasks, manifesting limitations such as generating incomplete, non-factual, or illogical answers. These limitations stem from inadequate knowledge awareness of LLMs during vanilla fine-tuning. To address these problems, we propose a knowledge-aware fine-tuning (KnowTuning) method to explicitly and implicitly improve the knowledge awareness of LLMs. We devise an explicit knowledge-aware generation stage to train LLMs to explicitly identify knowledge triples in answers. We also propose an implicit knowledge-aware comparison stage to train LLMs to implicitly distinguish between reliable and unreliable knowledge, in three aspects: completeness, factuality, and logicality. Extensive experiments on both generic and medical question answering (QA) datasets confirm the effectiveness of KnowTuning, through automatic and human evaluations, across various sizes of LLMs. Finally, we demonstrate that the improvements of KnowTuning generalize to unseen QA datasets.
摘要：尽管大型语言模型 (LLM) 在许多自然语言处理 (NLP) 任务中取得了成功，但它们仍然难以有效利用知识来完成知识密集型任务，表现出诸如生成不完整、非事实或不合逻辑的答案等局限性。这些局限性源于法学硕士在普通微调过程中知识意识不足。为了解决这些问题，我们提出了一种知识感知微调（KnowTuning）方法来显式和隐式地提高法学硕士的知识意识。我们设计了一个明确的知识感知生成阶段来训练法学硕士明确识别答案中的知识三元组。我们还提出了一个隐式知识感知比较阶段，以训练法学硕士在完整性、事实性和逻辑性三个方面隐式区分可靠和不可靠的知识。对通用数据集和医学问答 (QA) 数据集进行的广泛实验通过自动和人工评估，在各种规模的法学硕士中证实了 KnowTuning 的有效性。最后，我们证明 KnowTuning 的改进可以推广到未见过的 QA 数据集。

Title: RENOVI: A Benchmark Towards Remediating Norm Violations in Socio-Cultural Conversations

Authors: Haolan Zhan, Zhuang Li, Xiaoxi Kang, Tao Feng, Yuncheng Hua, Lizhen Qu, Yi Ying, Mei Rianto Chandra, Kelly Rosalin, Jureynolds Jureynolds, Suraj Sharma, Shilin Qu, Linhao Luo, Lay-Ki Soon, Zhaleh Semnani Azad, Ingrid Zukerman, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11178
Pdf URL: https://arxiv.org/pdf/2402.11178
Copy Paste: [[2402.11178]] RENOVI: A Benchmark Towards Remediating Norm Violations in Socio-Cultural Conversations(https://arxiv.org/abs/2402.11178)
Keywords: gpt, llm, prompt, chat
Abstract: Norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. Remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. To equip interactive AI systems with a remediation ability, we offer ReNoVi - a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as define a sequence of tasks to help understand and remediate norm violations step by step. ReNoVi consists of two parts: 512 human-authored dialogues (real data), and 8,746 synthetic conversations generated by ChatGPT through prompt learning. While collecting sufficient human-authored data is costly, synthetic conversations provide suitable amounts of data to help mitigate the scarcity of training data, as well as the chance to assess the alignment between LLMs and humans in the awareness of social norms. We thus harness the power of ChatGPT to generate synthetic training data for our task. To ensure the quality of both human-authored and synthetic data, we follow a quality control protocol during data collection. Our experimental results demonstrate the importance of remediating norm violations in socio-cultural conversations, as well as the improvement in performance obtained from synthetic data.
摘要：当个人未能遵守文化上接受的行为时，就会发生违反规范的情况，这可能会导致潜在的冲突。纠正违反规范的行为需要社会意识和对细微差别的文化敏感性。为了让交互式人工智能系统具备补救能力，我们提供了 ReNoVi——一个由 9,258 条多轮对话组成的大规模语料库，带有社会规范注释，并定义了一系列任务来帮助逐步理解和补救违反规范的行为。 ReNoVi 由两部分组成：512 个人工创作的对话（真实数据），以及 ChatGPT 通过即时学习生成的 8,746 个合成对话。虽然收集足够的人类编写的数据成本高昂，但合成对话可以提供适量的数据，以帮助缓解培训数据的稀缺性，并有机会评估法学硕士和人类在社会规范意识方面的一致性。因此，我们利用 ChatGPT 的强大功能为我们的任务生成合成训练数据。为了确保人工数据和合成数据的质量，我们在数据收集过程中遵循质量控制协议。我们的实验结果证明了纠正社会文化对话中违反规范的行为的重要性，以及从合成数据中获得的性能改进的重要性。

Title: LaCo: Large Language Model Pruning via Layer Collapse

Authors: Yifei Yang, Zouying Cao, Hai Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11187
Pdf URL: https://arxiv.org/pdf/2402.11187
Copy Paste: [[2402.11187]] LaCo: Large Language Model Pruning via Layer Collapse(https://arxiv.org/abs/2402.11187)
Keywords: language model, llm
Abstract: Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion, which brings considerable costs to both model training and inference. However, existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues, including hardware support limitations, the need for extensive training, and alterations to the internal structure of the model. In this paper, we propose a concise layer-wise pruning method called \textit{Layer Collapse (LaCo)}, in which rear model layers collapse into a prior layer, enabling a rapid reduction in model size while preserving the model structure. Comprehensive experiments show that our method maintains an average task performance of over 80\% at pruning ratios of 25-30\%, significantly outperforming existing state-of-the-art structured pruning methods. We also conduct post-training experiments to confirm that the proposed pruning method effectively inherits the parameters of the original model. Finally, we discuss our motivation from the perspective of layer-wise similarity and evaluate the performance of the pruned LLMs across various pruning ratios.
摘要：基于Transformer的大型语言模型（LLM）呈现出明显的规模扩张趋势，这给模型训练和推理带来了相当大的成本。然而，现有的模型量化、知识蒸馏和模型剪枝等方法受到各种问题的限制，包括硬件支持限制、需要大量训练以及对模型内部结构的改变等。在本文中，我们提出了一种简洁的分层剪枝方法，称为 \textit{Layer Collapse (LaCo)}，其中后模型层折叠到前一层，从而能够在保留模型结构的同时快速减小模型大小。综合实验表明，我们的方法在剪枝率为 25-30% 的情况下保持了超过 80% 的平均任务性能，显着优于现有最先进的结构化剪枝方法。我们还进行了训练后实验，以确认所提出的剪枝方法有效地继承了原始模型的参数。最后，我们从分层相似性的角度讨论了我们的动机，并评估了不同剪枝率下剪枝后的 LLM 的性能。

Title: Disclosure and Mitigation of Gender Bias in LLMs

Authors: Xiangjue Dong, Yibo Wang, Philip S. Yu, James Caverlee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11190
Pdf URL: https://arxiv.org/pdf/2402.11190
Copy Paste: [[2402.11190]] Disclosure and Mitigation of Gender Bias in LLMs(https://arxiv.org/abs/2402.11190)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can generate biased responses. Yet previous direct probing techniques contain either gender mentions or predefined gender stereotypes, which are challenging to comprehensively collect. Hence, we propose an indirect probing framework based on conditional generation. This approach aims to induce LLMs to disclose their gender bias even without explicit gender or stereotype mentions. We explore three distinct strategies to disclose explicit and implicit gender bias in LLMs. Our experiments demonstrate that all tested LLMs exhibit explicit and/or implicit gender bias, even when gender stereotypes are not present in the inputs. In addition, an increased model size or model alignment amplifies bias in most cases. Furthermore, we investigate three methods to mitigate bias in LLMs via Hyperparameter Tuning, Instruction Guiding, and Debias Tuning. Remarkably, these methods prove effective even in the absence of explicit genders or stereotypes.
摘要：大型语言模型 (LLM) 可能会产生有偏见的响应。然而，以前的直接探究技术要么包含性别提及，要么包含预定义的性别刻板印象，很难全面收集。因此，我们提出了一种基于条件生成的间接探测框架。这种方法旨在促使法学硕士披露他们的性别偏见，即使没有明确的性别或刻板印象提及。我们探索了三种不同的策略来披露法学硕士中显性和隐性的性别偏见。我们的实验表明，所有测试的法学硕士都表现出显性和/或隐性的性别偏见，即使输入中不存在性别刻板印象。此外，在大多数情况下，增加的模型大小或模型对齐会放大偏差。此外，我们还研究了三种通过超参数调优、指令指导和去偏调来减轻法学硕士偏差的方法。值得注意的是，即使在没有明确的性别或刻板印象的情况下，这些方法也被证明是有效的。

Title: I Learn Better If You Speak My Language: Enhancing Large Language Model Fine-Tuning with Style-Aligned Response Adjustments

Authors: Xuan Ren, Biao Wu, Lingqiao Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11192
Pdf URL: https://arxiv.org/pdf/2402.11192
Copy Paste: [[2402.11192]] I Learn Better If You Speak My Language: Enhancing Large Language Model Fine-Tuning with Style-Aligned Response Adjustments(https://arxiv.org/abs/2402.11192)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) with a small data set for particular tasks is a widely encountered yet complex challenge. The potential for overfitting on a limited number of examples can negatively impact the model's ability to generalize and retain its original skills. Our research explores the impact of the style of ground-truth responses during the fine-tuning process. We found that matching the ground-truth response style with the LLM's inherent style results in better learning outcomes. Building on this insight, we developed a method that minimally alters the LLM's pre-existing responses to correct errors, using these adjusted responses as training targets. This technique enables precise corrections in line with the model's native response style, safeguarding the model's core capabilities and thus avoid overfitting. Our findings show that this approach not only improves the LLM's task-specific accuracy but also crucially maintains its original competencies and effectiveness.
摘要：针对特定任务使用小数据集微调大型语言模型 (LLM) 是一个广泛遇到但复杂的挑战。有限数量的示例的过度拟合可能会对模型泛化和保留其原始技能的能力产生负面影响。我们的研究探讨了微调过程中真实响应风格的影响。我们发现，将真实的回答风格与法学硕士的固有风格相匹配会带来更好的学习成果。基于这一见解，我们开发了一种方法，可以最大限度地改变法学硕士预先存在的反应以纠正错误，并使用这些调整后的反应作为培训目标。该技术可以根据模型的原生响应风格进行精确修正，保护模型的核心功能，从而避免过度拟合。我们的研究结果表明，这种方法不仅提高了法学硕士特定任务的准确性，而且还至关重要地保持了其原有的能力和有效性。

Title: Assessing LLMs' Mathematical Reasoning in Financial Document Question Answering

Authors: Pragya Srivastava, Manuj Malik, Tanuja Ganu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11194
Pdf URL: https://arxiv.org/pdf/2402.11194
Copy Paste: [[2402.11194]] Assessing LLMs' Mathematical Reasoning in Financial Document Question Answering(https://arxiv.org/abs/2402.11194)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs), excel in natural language understanding, but their capability for complex mathematical reasoning with an amalgamation of structured tables and unstructured text is uncertain. This study explores LLMs' mathematical reasoning on four financial tabular question-answering datasets: TATQA, FinQA, ConvFinQA, and Multihiertt. Through extensive experiments with various models and prompting techniques, we assess how LLMs adapt to complex tables and mathematical tasks. We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps. The results provide insights into LLMs' capabilities and limitations in handling complex mathematical scenarios for semi-structured tables. Ultimately, we introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance while providing a nuanced understanding of LLMs abilities for such a task.
摘要：大型语言模型（LLM）在自然语言理解方面表现出色，但它们通过结构化表格和非结构化文本的合并进行复杂数学推理的能力尚不确定。本研究探讨了法学硕士在四个金融表格问答数据集上的数学推理：TATQA、FinQA、ConvFinQA 和 Multihiertt。通过对各种模型和提示技术的广泛实验，我们评估了法学硕士如何适应复杂的表格和数学任务。我们关注对表复杂性和性能变化的敏感性，并增加算术推理步骤的数量。结果让我们深入了解法学硕士在处理半结构化表的复杂数学场景方面的能力和局限性。最终，我们引入了一种针对半结构化文档量身定制的新颖提示技术，在性能上匹配或超越其他基线，同时提供对法学硕士完成此类任务的能力的细致入微的理解。

Title: Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs

Authors: Minh-Vuong Nguyen, Linhao Luo, Fatemeh Shiri, Dinh Phung, Yuan-Fang Li, Thuy-Trang Vu, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11199
Pdf URL: https://arxiv.org/pdf/2402.11199
Copy Paste: [[2402.11199]] Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs(https://arxiv.org/abs/2402.11199)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought (CoT) explanations alongside answers. However, previous research on evaluating LLMs has solely focused on answer accuracy, neglecting the correctness of the generated CoT. In this paper, we delve deeper into the CoT reasoning capabilities of LLMs in multi-hop question answering by utilizing knowledge graphs (KGs). We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT. Through experiments conducted on 5 different families of LLMs across 2 multi-hop question-answering datasets, we find that LLMs possess sufficient knowledge to perform reasoning. However, there exists a significant disparity between answer accuracy and faithfulness of the CoT reasoning generated by LLMs, indicating that they often arrive at correct answers through incorrect reasoning.
摘要：当被提示生成思维链 (CoT) 解释和答案时，大型语言模型 (LLM) 表现出强大的推理能力。然而，之前评估LLM的研究仅仅关注答案的准确性，而忽略了生成的CoT的正确性。在本文中，我们利用知识图（KG）更深入地研究了法学硕士在多跳问答中的 CoT 推理能力。我们提出了一种新颖的判别性和生成性 CoT 评估范式，以评估法学硕士的推理知识和生成的 CoT 的准确性。通过对 2 个多跳问答数据集的 5 个不同的 LLM 家族进行实验，我们发现 LLM 拥有足够的知识来执行推理。然而，LLM 生成的 CoT 推理的答案准确性和忠实度之间存在显着差异，这表明他们经常通过错误的推理得出正确的答案。

Title: Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Authors: Wenxuan Wang, Yihang Su, Jingyuan Huan, Jie Liu, Wenting Chen, Yudi Zhang, Cheng-Yi Li, Kao-Jung Chang, Xiaohan Xin, Linlin Shen, Michael R. Lyu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2402.11217
Pdf URL: https://arxiv.org/pdf/2402.11217
Copy Paste: [[2402.11217]] Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models(https://arxiv.org/abs/2402.11217)
Keywords: language model, llm
Abstract: The significant breakthroughs of Medical Multi-Modal Large Language Models (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the intricate nature of the real-world diagnostic frameworks, which encompass diverse medical specialties and involve complex clinical decisions. Moreover, these benchmarks are susceptible to data leakage, since Med-MLLMs are trained on large assemblies of publicly available data. Thus, an isolated and clinically representative benchmark is highly desirable for credible Med-MLLMs evaluation. To this end, we introduce Asclepius, a novel Med-MLLM benchmark that rigorously and comprehensively assesses model capability in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting from train-validate contamination. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs' capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments. We launch and maintain a leaderboard for community assessment of Med-MLLM capabilities (https://asclepius-med.github.io/).
摘要：医学多模态大语言模型 (Med-MLLM) 的重大突破通过强大的信息综合和医疗决策支持革新了现代医疗保健。然而，由于现实世界诊断框架的复杂性（涵盖不同的医学专业并涉及复杂的临床决策），这些模型通常根据不适合 Med-MLLM 的基准进行评估。此外，这些基准很容易受到数据泄露的影响，因为 Med-MLLM 是在大型公开数据集上进行训练的。因此，对于可靠的 Med-MLLM 评估来说，非常需要一个独立且具有临床代表性的基准。为此，我们引入了 Asclepius，一种新颖的 Med-MLLM 基准，它严格、全面地评估模型在以下方面的能力：不同的医学专业（心血管、胃肠病学等）和不同的诊断能力（感知、疾病分析等）。基于提出的 3 项核心原则，Asclepius 通过涵盖 15 个医学专业、将临床任务分为 3 个主要类别和 8 个子类别，并避免训练验证污染来确保全面评估。我们进一步对 6 位 Med-MLLM 进行深入分析，并将他们与 5 位人类专家进行比较，深入了解他们在各种医疗环境中的能力和局限性。我们的工作不仅增进了对 Med-MLLM 功能的理解，还为未来的评估以及这些模型在临床环境中的安全部署奠定了先例。我们启动并维护了 Med-MLLM 能力社区评估排行榜 (https://asclepius-med.github.io/)。

Title: Controlled Text Generation for Large Language Model with Dynamic Attribute Graphs

Authors: Xun Liang, Hanyu Wang, Shichao Song, Mengting Hu, Xunzhi Wang, Zhiyu Li, Feiyu Xiong, Bo Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11218
Pdf URL: https://arxiv.org/pdf/2402.11218
Copy Paste: [[2402.11218]] Controlled Text Generation for Large Language Model with Dynamic Attribute Graphs(https://arxiv.org/abs/2402.11218)
Keywords: language model, llm
Abstract: Controlled Text Generation (CTG) aims to produce texts that exhibit specific desired attributes. In this study, we introduce a pluggable CTG framework for Large Language Models (LLMs) named Dynamic Attribute Graphs-based controlled text generation (DATG). This framework utilizes an attribute scorer to evaluate the attributes of sentences generated by LLMs and constructs dynamic attribute graphs. DATG modulates the occurrence of key attribute words and key anti-attribute words, achieving effective attribute control without compromising the original capabilities of the model. We conduct experiments across four datasets in two tasks: toxicity mitigation and sentiment transformation, employing five LLMs as foundational models. Our findings highlight a remarkable enhancement in control accuracy, achieving a peak improvement of 19.29% over baseline methods in the most favorable task across four datasets. Additionally, we observe a significant decrease in perplexity, markedly improving text fluency.
摘要：受控文本生成 (CTG) 旨在生成具有特定所需属性的文本。在本研究中，我们引入了一种用于大型语言模型 (LLM) 的可插入 CTG 框架，名为基于动态属性图的受控文本生成 (DATG)。该框架利用属性评分器来评估法学硕士生成的句子的属性并构建动态属性图。 DATG对关键属性词和关键反属性词的出现进行调制，在不损害模型原有能力的情况下实现有效的属性控制。我们在两个任务中的四个数据集上进行了实验：毒性减轻和情绪转换，采用五个法学硕士作为基础模型。我们的研究结果强调了控制精度的显着提高，在四个数据集的最有利任务中，与基线方法相比，峰值提高了 19.29%。此外，我们观察到困惑度显着降低，文本流畅度显着提高。

Title: ZeroG: Investigating Cross-dataset Zero-shot Transferability in Graphs

Authors: Yuhan Li, Peisong Wang, Zhixun Li, Jeffrey Xu Yu, Jia Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11235
Pdf URL: https://arxiv.org/pdf/2402.11235
Copy Paste: [[2402.11235]] ZeroG: Investigating Cross-dataset Zero-shot Transferability in Graphs(https://arxiv.org/abs/2402.11235)
Keywords: language model, gpt, prompt
Abstract: With the development of foundation models such as large language models, zero-shot transfer learning has become increasingly significant. This is highlighted by the generative capabilities of NLP models like GPT-4, and the retrieval-based approaches of CV models like CLIP, both of which effectively bridge the gap between seen and unseen data. In the realm of graph learning, the continuous emergence of new graphs and the challenges of human labeling also amplify the necessity for zero-shot transfer learning, driving the exploration of approaches that can generalize across diverse graph data without necessitating dataset-specific and label-specific fine-tuning. In this study, we extend such paradigms to zero-shot transferability in graphs by introducing ZeroG, a new framework tailored to enable cross-dataset generalization. Addressing the inherent challenges such as feature misalignment, mismatched label spaces, and negative transfer, we leverage a language model to encode both node attributes and class semantics, ensuring consistent feature dimensions across datasets. We also propose a prompt-based subgraph sampling module that enriches the semantic information and structure information of extracted subgraphs using prompting nodes and neighborhood aggregation, respectively. We further adopt a lightweight fine-tuning strategy that reduces the risk of overfitting and maintains the zero-shot learning efficacy of the language model. The results underscore the effectiveness of our model in achieving significant cross-dataset zero-shot transferability, opening pathways for the development of graph foundation models. Especially, ZeroG, as a zero-shot method, can even achieve results comparable to those of semi-supervised learning on Pubmed.
摘要：随着大型语言模型等基础模型的发展，零样本迁移学习变得越来越重要。 GPT-4 等 NLP 模型的生成能力和 CLIP 等基于检索的 CV 模型突出了这一点，这两者都有效地弥合了可见数据和不可见数据之间的差距。在图学习领域，新图的不断出现和人类标记的挑战也放大了零样本迁移学习的必要性，推动了对可以跨不同图数据进行泛化而无需特定于数据集和标签的方法的探索。具体微调。在本研究中，我们通过引入 ZeroG（一种专为实现跨数据集泛化而定制的新框架），将此类范例扩展到图中的零样本可转移性。为了解决特征错位、标签空间不匹配和负迁移等固有挑战，我们利用语言模型对节点属性和类语义进行编码，确保跨数据集的特征维度一致。我们还提出了一种基于提示的子图采样模块，该模块分别使用提示节点和邻域聚合来丰富提取的子图的语义信息和结构信息。我们进一步采用轻量级微调策略，降低过度拟合的风险并保持语言模型的零样本学习功效。结果强调了我们的模型在实现显着的跨数据集零样本可迁移性方面的有效性，为图基础模型的开发开辟了途径。尤其是ZeroG作为一种零样本方法，甚至可以在Pubmed上取得与半监督学习相媲美的结果。

Title: Can Large Language Models perform Relation-based Argument Mining?

Authors: Deniz Gorur, Antonio Rago, Francesca Toni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11243
Pdf URL: https://arxiv.org/pdf/2402.11243
Copy Paste: [[2402.11243]] Can Large Language Models perform Relation-based Argument Mining?(https://arxiv.org/abs/2402.11243)
Keywords: language model, llm, prompt
Abstract: Argument mining (AM) is the process of automatically extracting arguments, their components and/or relations amongst arguments and components from text. As the number of platforms supporting online debate increases, the need for AM becomes ever more urgent, especially in support of downstream tasks. Relation-based AM (RbAM) is a form of AM focusing on identifying agreement (support) and disagreement (attack) relations amongst arguments. RbAM is a challenging classification task, with existing methods failing to perform satisfactorily. In this paper, we show that general-purpose Large Language Models (LLMs), appropriately primed and prompted, can significantly outperform the best performing (RoBERTa-based) baseline. Specifically, we experiment with two open-source LLMs (Llama-2 and Mistral) with ten datasets.
摘要：论据挖掘（AM）是从文本中自动提取论据、其组成部分和/或论据和组成部分之间的关系的过程。随着支持在线辩论的平台数量的增加，对AM的需求变得越来越迫切，特别是在支持下游任务方面。基于关系的 AM (RbAM) 是 AM 的一种形式，专注于识别论点之间的同意（支持）和不同意（攻击）关系。 RbAM 是一项具有挑战性的分类任务，现有方法未能令人满意地执行。在本文中，我们表明，经过适当启动和提示的通用大型语言模型 (LLM) 可以显着优于性能最佳（基于 RoBERTa）的基线。具体来说，我们用两个开源法学硕士（Llama-2 和 Mistral）和 10 个数据集进行实验。

Title: LLM can Achieve Self-Regulation via Hyperparameter Aware Generation

Authors: Siyin Wang, Shimin Li, Tianxiang Sun, Jinlan Fu, Qinyuan Cheng, Jiasheng Ye, Junjie Ye, Xipeng Qiu, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11251
Pdf URL: https://arxiv.org/pdf/2402.11251
Copy Paste: [[2402.11251]] LLM can Achieve Self-Regulation via Hyperparameter Aware Generation(https://arxiv.org/abs/2402.11251)
Keywords: language model, llm
Abstract: In the realm of Large Language Models (LLMs), users commonly employ diverse decoding strategies and adjust hyperparameters to control the generated text. However, a critical question emerges: Are LLMs conscious of the existence of these decoding strategies and capable of regulating themselves? The current decoding generation process often relies on empirical and heuristic manual adjustments to hyperparameters based on types of tasks and demands. However, this process is typically cumbersome, and the decoding hyperparameters may not always be optimal for each sample. To address the aforementioned challenges, we propose a novel text generation paradigm termed Hyperparameter Aware Generation (HAG). By leveraging hyperparameter-aware instruction tuning, the LLM autonomously determines the optimal decoding strategy and configs based on the input samples, enabling self-regulation. Our approach eliminates the need for extensive manual tuning, offering a more autonomous, self-regulate model behavior. Experimental results spanning six datasets across reasoning, creativity, translation, and mathematics tasks demonstrate that hyperparameter-aware instruction tuning empowers the LLMs to self-regulate the decoding strategy and hyperparameter. HAG extends the current paradigm in the text generation process, highlighting the feasibility of endowing the LLMs with self-regulate decoding strategies.
摘要：在大型语言模型（LLM）领域，用户通常采用不同的解码策略并调整超参数来控制生成的文本。然而，一个关键问题出现了：法学硕士是否意识到这些解码策略的存在并有能力自我调节？当前的解码生成过程通常依赖于根据任务类型和需求对超参数进行经验和启发式手动调整。然而，这个过程通常很麻烦，并且解码超参数对于每个样本可能并不总是最佳的。为了解决上述挑战，我们提出了一种新颖的文本生成范例，称为超参数感知生成（HAG）。通过利用超参数感知指令调整，LLM 根据输入样本自主确定最佳解码策略和配置，从而实现自我调节。我们的方法消除了大量手动调整的需要，提供了更加自主、自我调节的模型行为。涵盖推理、创造力、翻译和数学任务的六个数据集的实验结果表明，超参数感知指令调整使法学硕士能够自我调节解码策略和超参数。 HAG 扩展了当前文本生成过程的范式，强调了赋予法学硕士自我调节解码策略的可行性。

Title: Aligning Large Language Models by On-Policy Self-Judgment

Authors: Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.11253
Pdf URL: https://arxiv.org/pdf/2402.11253
Copy Paste: [[2402.11253]] Aligning Large Language Models by On-Policy Self-Judgment(https://arxiv.org/abs/2402.11253)
Keywords: language model
Abstract: To align large language models with human preferences, existing research either utilizes a separate reward model (RM) to perform on-policy learning or simplifies the training procedure by discarding the on-policy learning and the need for a separate RM. In this paper, we present a novel alignment framework, SELF-JUDGE that is (1) on-policy learning and 2) parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT) to train a single model acting as both a policy and a judge. Specifically, we view the pairwise judgment task as a special case of the instruction-following task, choosing the better response from a response pair. Thus, the resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. Experimental results show the efficacy of SELF-JUDGE, outperforming baselines in preference benchmarks. We also show that self-rejection with oversampling can improve further without an additional evaluator. Our code is available at https://github.com/oddqueue/self-judge.
摘要：为了使大型语言模型与人类偏好保持一致，现有研究要么利用单独的奖励模型（RM）来执行策略学习，要么通过放弃策略学习和单独 RM 的需要来简化训练过程。在本文中，我们提出了一种新颖的对齐框架，SELF-JUDGE，它具有（1）策略学习和 2）参数效率，因为它不需要额外的 RM 来评估策略学习的样本。为此，我们提出法官增强监督微调（JSFT）来训练一个既充当政策又充当法官的模型。具体来说，我们将成对判断任务视为指令跟踪任务的特例，从响应对中选择更好的响应。因此，生成的模型可以根据自身初始化的当前策略来判断动态响应的偏好。实验结果表明 SELF-JUDGE 的有效性，在偏好基准中优于基线。我们还表明，在没有额外评估器的情况下，通过过采样进行自我拒绝可以进一步改进。我们的代码可在 https://github.com/oddqueue/self-judge 获取。

Title: C-ICL: Contrastive In-context Learning for Information Extraction

Authors: Ying Mo, Jian Yang, Jiahao Liu, Shun Zhang, Jingang Wang, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11254
Pdf URL: https://arxiv.org/pdf/2402.11254
Copy Paste: [[2402.11254]] C-ICL: Contrastive In-context Learning for Information Extraction(https://arxiv.org/abs/2402.11254)
Keywords: language model, llm, prompt
Abstract: Recently, there has been increasing interest in exploring the capabilities of advanced large language models (LLMs) in the field of information extraction (IE), specifically focusing on tasks related to named entity recognition (NER) and relation extraction (RE). Although researchers are exploring the use of few-shot information extraction through in-context learning with LLMs, they tend to focus only on using correct or positive examples for demonstration, neglecting the potential value of incorporating incorrect or negative examples into the learning process. In this paper, we present c-ICL, a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations. This approach enhances the ability of LLMs to extract entities and relations by utilizing prompts that incorporate not only the positive samples but also the reasoning behind them. This method allows for the identification and correction of potential interface errors. Specifically, our proposed method taps into the inherent contextual information and valuable information in hard negative samples and the nearest positive neighbors to the test and then applies the in-context learning demonstrations based on LLMs. Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods, delivering substantial enhancements in performance across a broad spectrum of related tasks. These improvements are noteworthy, showcasing the versatility of our approach in miscellaneous scenarios.
摘要：最近，人们对探索信息提取 (IE) 领域的高级大语言模型 (LLM) 的功能越来越感兴趣，特别关注与命名实体识别 (NER) 和关系提取 (RE) 相关的任务。尽管研究人员正在探索通过法学硕士的情境学习来使用少样本信息提取，但他们往往只专注于使用正确或正面的例子进行演示，而忽略了将不正确或负面的例子纳入学习过程的潜在价值。在本文中，我们提出了 c-ICL，这是一种新颖的少样本技术，它利用正确和不正确的样本结构来创建上下文学习演示。这种方法通过利用不仅包含正样本而且还包含它们背后的推理的提示，增强了法学硕士提取实体和关系的能力。该方法可以识别和纠正潜在的接口错误。具体来说，我们提出的方法利用了硬负样本和测试的最近正邻居中的固有上下文信息和有价值的信息，然后应用基于法学硕士的上下文学习演示。我们对各种数据集的实验表明，c-ICL 优于之前的几次上下文学习方法，在广泛的相关任务中显着提高了性能。这些改进值得注意，展示了我们的方法在各种场景中的多功能性。

Title: MoRAL: MoE Augmented LoRA for LLMs' Lifelong Learning

Authors: Shu Yang, Muhammad Asif Ali, Cheng-Long Wang, Lijie Hu, Di Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11260
Pdf URL: https://arxiv.org/pdf/2402.11260
Copy Paste: [[2402.11260]] MoRAL: MoE Augmented LoRA for LLMs' Lifelong Learning(https://arxiv.org/abs/2402.11260)
Keywords: language model, llm
Abstract: Adapting large language models (LLMs) to new domains/tasks and enabling them to be efficient lifelong learners is a pivotal challenge. In this paper, we propose MoRAL, i.e., Mixture-of-Experts augmented Low-Rank Adaptation for Lifelong Learning. MoRAL combines the multi-tasking abilities of MoE with the fine-tuning abilities of LoRA for effective life-long learning of LLMs. In contrast to the conventional approaches that use factual triplets as inputs MoRAL relies on simple question-answer pairs, which is a more practical and effective strategy for robust and efficient learning. Owing to new data settings, we introduce a new evaluation benchmark namely: Life Long Learning of LLM (5L-bench) encompassing a newly curated dataset of question-answer pairs, and a set of evaluation metrics for rigorous evaluation of MoRAL in open-book and closed-book settings. Experimental evaluation shows (i) LLMs learn fast in open-book settings with up to 30.15% improvement in "RA" for Phi-2-2.7B compared to closed-book (for models fine-tuned with MoRAL); (ii) MoRAL shows higher performance improvement for models with a greater number of parameters; (iii) MoRAL is robust to catastrophic forgetting offering better knowledge retention compared to baselines.
摘要：使大型语言模型（LLM）适应新的领域/任务并使其成为高效的终身学习者是一项关键挑战。在本文中，我们提出了 MoRAL，即专家混合增强的终身学习低阶适应。 MoRAL 将 MoE 的多任务能力与 LoRA 的微调能力相结合，以实现法学硕士的有效终身学习。与使用事实三元组作为输入的传统方法相比，MoRAL 依赖于简单的问答对，这是一种更实用、更有效的稳健、高效学习策略。由于新的数据设置，我们引入了一个新的评估基准，即：LLM 的终身学习（5L-bench），包含新策划的问答对数据集，以及一组用于在开卷中严格评估 MoRAL 的评估指标和闭卷设置。实验评估表明 (i) 法学硕士在开卷环境中学习速度很快，与闭卷相比（对于使用 MoRAL 微调的模型），Phi-2-2.7B 的“RA”提高了 30.15%； (ii) MoRAL 对于具有更多参数的模型显示出更高的性能改进； (iii) MoRAL 对灾难性遗忘具有鲁棒性，与基线相比可以提供更好的知识保留。

Title: Multi-Perspective Consistency Enhances Confidence Estimation in Large Language Models

Authors: Pei Wang, Yejie Wang, Muxi Diao, Keqing He, Guanting Dong, Weiran Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11279
Pdf URL: https://arxiv.org/pdf/2402.11279
Copy Paste: [[2402.11279]] Multi-Perspective Consistency Enhances Confidence Estimation in Large Language Models(https://arxiv.org/abs/2402.11279)
Keywords: language model, llm
Abstract: In the deployment of large language models (LLMs), accurate confidence estimation is critical for assessing the credibility of model predictions. However, existing methods often fail to overcome the issue of overconfidence on incorrect answers. In this work, we focus on improving the confidence estimation of large language models. Considering the fragility of self-awareness in language models, we introduce a Multi-Perspective Consistency (MPC) method. We leverage complementary insights from different perspectives within models (MPC-Internal) and across different models (MPC-Across) to mitigate the issue of overconfidence arising from a singular viewpoint. The experimental results on eight publicly available datasets show that our MPC achieves state-of-the-art performance. Further analyses indicate that MPC can mitigate the problem of overconfidence and is effectively scalable to other models.
摘要：在大型语言模型（LLM）的部署中，准确的置信度估计对于评估模型预测的可信度至关重要。然而，现有的方法往往无法克服对错误答案过度自信的问题。在这项工作中，我们专注于改进大型语言模型的置信度估计。考虑到语言模型中自我意识的脆弱性，我们引入了多视角一致性（MPC）方法。我们利用模型内（MPC-Internal）和不同模型之间（MPC-Across）不同视角的互补见解来缓解单一观点引起的过度自信问题。在八个公开数据集上的实验结果表明，我们的 MPC 实现了最先进的性能。进一步的分析表明，MPC 可以缓解过度自信的问题，并且可以有效地扩展到其他模型。

Title: Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Authors: Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11281
Pdf URL: https://arxiv.org/pdf/2402.11281
Copy Paste: [[2402.11281]] Can Large Multimodal Models Uncover Deep Semantics Behind Images?(https://arxiv.org/abs/2402.11281)
Keywords: gpt
Abstract: Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision).Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis indicates that the integration of description texts during the inference process notably enhances LMMs' ability to perceive deep semantics. Furthermore, our dataset is divided into multiple categories, and we conducted a more detailed analysis within these categories.
摘要：在社交媒体主导的时代，理解图像的深层语义至关重要。然而，目前的研究主要集中在图像的表面描述上，对图像内在深层语义的系统研究还存在显着不足。在这项工作中，我们引入了 DEEPEVAL，这是一个评估大型多模态模型 (LMM) 视觉深度语义能力的综合基准。 DEEPEVAL 包括人工注释的数据集和三个渐进式子任务：细粒度描述选择、深度标题匹配和深度语义理解。利用 DEEPEVAL，我们评估了 9 个开源 LMM 和 GPT-4V(ision)。我们的评估表明现有 LMM 和人类的深层语义理解能力之间存在巨大差距。例如，GPT-4V 在理解深层语义方面落后人类 30%，尽管它在图像描述方面实现了与人类相当的性能。进一步分析表明，推理过程中描述文本的整合显着增强了 LMM 感知深层语义的能力。此外，我们的数据集分为多个类别，并且我们在这些类别内进行了更详细的分析。

Title: Puzzle Solving using Reasoning of Large Language Models: A Survey

Authors: Panagiotis Giadikiaroglou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11291
Pdf URL: https://arxiv.org/pdf/2402.11291
Copy Paste: [[2402.11291]] Puzzle Solving using Reasoning of Large Language Models: A Survey(https://arxiv.org/abs/2402.11291)
Keywords: language model, llm, prompt
Abstract: Exploring the capabilities of Large Language Models (LLMs) in puzzle solving unveils critical insights into their potential and challenges in artificial intelligence, marking a significant step towards understanding their applicability in complex reasoning tasks. This survey leverages a unique taxonomy -- dividing puzzles into rule-based and rule-less categories -- to critically assess LLMs through various methodologies, including prompting techniques, neuro-symbolic approaches, and fine-tuning. Through a critical review of relevant datasets and benchmarks, we assess LLMs' performance, identifying significant challenges in complex puzzle scenarios. Our findings highlight the disparity between LLM capabilities and human-like reasoning, particularly in those requiring advanced logical inference. The survey underscores the necessity for novel strategies and richer datasets to advance LLMs' puzzle-solving proficiency and contribute to AI's logical reasoning and creative problem-solving advancements.
摘要：探索大型语言模型（LLM）在解决难题方面的能力揭示了对其在人工智能中的潜力和挑战的重要见解，标志着在理解其在复杂推理任务中的适用性方面迈出了重要的一步。这项调查利用独特的分类法——将谜题分为有规则的和无规则的类别——通过各种方法，包括提示技术、神经符号方法和微调，批判性地评估法学硕士。通过对相关数据集和基准的严格审查，我们评估了法学硕士的表现，识别复杂难题场景中的重大挑战。我们的研究结果凸显了法学硕士能力与类人推理之间的差异，特别是在那些需要高级逻辑推理的领域。该调查强调了新颖的策略和更丰富的数据集的必要性，以提高法学硕士解决难题的能力，并为人工智能的逻辑推理和创造性解决问题的进步做出贡献。

Title: OneBit: Towards Extremely Low-bit Large Language Models

Authors: Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11295
Pdf URL: https://arxiv.org/pdf/2402.11295
Copy Paste: [[2402.11295]] OneBit: Towards Extremely Low-bit Large Language Models(https://arxiv.org/abs/2402.11295)
Keywords: language model, llm
Abstract: Model quantification uses low bit-width values to represent the weight matrices of models, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, existing quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit quantization-aware training (QAT) framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the QAT framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 83% of the non-quantized performance) with robust training processes when only using 1-bit weight matrices.
摘要：模型量化使用低位宽值来表示模型的权重矩阵，这是一种有前途的方法，可以减少部署备受期待的 LLM 的存储和计算开销。然而，当位宽极度减小时，现有的量化方法会遭受严重的性能下降，因此集中于利用4位或8位值来量化模型。本文大胆地将LLM的权重矩阵量化为1位，为LLM的极低位宽部署铺平了道路。针对这个目标，我们引入了一个名为 OneBit 的 1 位量化感知训练（QAT）框架，包括一种新颖的 1 位参数表示方法以更好地量化 LLM，以及一种基于矩阵分解的有效参数初始化方法以提高收敛性QAT 框架的速度。足够的实验结果表明，OneBit 在仅使用 1 位权重矩阵时通过鲁棒的训练过程实现了良好的性能（至少是非量化性能的 83%）。

Title: Dissecting Human and LLM Preferences

Authors: Junlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai Zhao, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11296
Pdf URL: https://arxiv.org/pdf/2402.11296
Copy Paste: [[2402.11296]] Dissecting Human and LLM Preferences(https://arxiv.org/abs/2402.11296)
Keywords: language model, gpt, llm
Abstract: As a relative quality comparison of model responses, human and Large Language Model (LLM) preferences serve as common alignment goals in model fine-tuning and criteria in evaluation. Yet, these preferences merely reflect broad tendencies, resulting in less explainable and controllable models with potential safety risks. In this work, we dissect the preferences of human and 32 different LLMs to understand their quantitative composition, using annotations from real-world user-model conversations for a fine-grained, scenario-wise analysis. We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. On the contrary, advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. Additionally, LLMs of similar sizes tend to exhibit similar preferences, regardless of their training methods, and fine-tuning for alignment does not significantly alter the preferences of pretrained-only LLMs. Finally, we show that preference-based evaluation can be intentionally manipulated. In both training-free and training-based settings, aligning a model with the preferences of judges boosts scores, while injecting the least preferred properties lowers them. This results in notable score shifts: up to 0.59 on MT-Bench (1-10 scale) and 31.94 on AlpacaEval 2.0 (0-100 scale), highlighting the significant impact of this strategic adaptation. Interactive Demo: https://huggingface.co/spaces/GAIR/Preference-Dissection-Visualization Dataset: https://huggingface.co/datasets/GAIR/preference-dissection Code: https://github.com/GAIR-NLP/Preference-Dissection
摘要：作为模型响应的相对质量比较，人类和大语言模型 (LLM) 偏好可作为模型微调和评估标准中的共同对齐目标。然而，这些偏好仅仅反映了广泛的趋势，导致模型难以解释和可控，并存在潜在的安全风险。在这项工作中，我们剖析了人类和 32 名不同 LLM 的偏好，以了解他们的定量构成，使用来自现实世界用户模型对话的注释进行细粒度、场景分析。我们发现，人类对错误不太敏感，更喜欢支持自己立场的反应，并且当模型承认自己的局限性时表现出明显的厌恶。相反，像GPT-4-Turbo这样的高级LLM更强调正确性、清晰度和无害性。此外，无论其训练方法如何，规模相似的法学硕士往往表现出相似的偏好，并且对齐的微调不会显着改变仅预训练的法学硕士的偏好。最后，我们表明基于偏好的评估可以被有意操纵。在无训练和基于训练的环境中，将模型与评委的偏好保持一致会提高分数，而注入最不喜欢的属性会降低分数。这导致了显着的分数变化：MT-Bench（1-10 范围）上高达 0.59，AlpacaEval 2.0（0-100 范围）上高达 31.94，凸显了这种策略调整的重大影响。交互式演示：https://huggingface.co/spaces/GAIR/Preference-Dissection-Visualization 数据集：https://huggingface.co/datasets/GAIR/preference-dissection 代码：https://github.com/GAIR-NLP /偏好-解剖

Title: MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal

Authors: Husein Zolkepli, Aisyah Razak, Kamarul Adha, Ariff Nazhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11297
Pdf URL: https://arxiv.org/pdf/2402.11297
Copy Paste: [[2402.11297]] MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal(https://arxiv.org/abs/2402.11297)
Keywords: language model, llm
Abstract: Our contribution introduces a groundbreaking multimodal large language model designed to comprehend multi-images, multi-audio, and multi-images-multi-audio within a single multiturn session. Leveraging state-of-the-art models, we utilize the SigLIP encoder for visual inputs and the Whisper Encoder for audio inputs. Notably, this multimodal large language model is bilingual, proficient in understanding both English and Malay simultaneously. We proudly unveil two versions of this model: TinyLlama with 1.1B parameters, and Mistral with 7B parameters. With its ability to navigate diverse modalities and languages, our model represents a significant advancement for the Malaysian context and beyond. All models released at https://huggingface.co/collections/mesolitica/multimodal-malaysian-llm-65c6f893e03f78fa9e5c8859
摘要：我们的贡献引入了一种突破性的多模态大语言模型，旨在在单个多轮会话中理解多图像、多音频和多图像多音频。利用最先进的模型，我们使用 SigLIP 编码器进行视觉输入，使用 Whisper 编码器进行音频输入。值得注意的是，这种多模态大语言模型是双语的，能够同时熟练地理解英语和马来语。我们自豪地推出该模型的两个版本：具有 1.1B 参数的 TinyLlama 和具有 7B 参数的 Mistral。凭借其驾驭多种模式和语言的能力，我们的模型代表了马来西亚及其他地区的重大进步。所有模型发布于 https://huggingface.co/collections/mesolitica/multimodal-malaysian-llm-65c6f893e03f78fa9e5c8859

Title: EVEDIT: Event-based Knowledge Editing with Deductive Editing Boundaries

Authors: Jiateng Liu, Pengfei Yu, Yuji Zhang, Sha Li, Zixuan Zhang, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11324
Pdf URL: https://arxiv.org/pdf/2402.11324
Copy Paste: [[2402.11324]] EVEDIT: Event-based Knowledge Editing with Deductive Editing Boundaries(https://arxiv.org/abs/2402.11324)
Keywords: language model, llm
Abstract: The dynamic nature of real-world information necessitates efficient knowledge editing (KE) in large language models (LLMs) for knowledge updating. However, current KE approaches, which typically operate on (subject, relation, object) triples, ignore the contextual information and the relation among different knowledge. Such editing methods could thus encounter an uncertain editing boundary, leaving a lot of relevant knowledge in ambiguity: Queries that could be answered pre-edit cannot be reliably answered afterward. In this work, we analyze this issue by introducing a theoretical framework for KE that highlights an overlooked set of knowledge that remains unchanged and aids in knowledge deduction during editing, which we name as the deduction anchor. We further address this issue by proposing a novel task of event-based knowledge editing that pairs facts with event descriptions. This task manifests not only a closer simulation of real-world editing scenarios but also a more logically sound setting, implicitly defining the deduction anchor to address the issue of indeterminate editing boundaries. We empirically demonstrate the superiority of event-based editing over the existing setting on resolving uncertainty in edited models, and curate a new benchmark dataset EvEdit derived from the CounterFact dataset. Moreover, while we observe that the event-based setting is significantly challenging for existing approaches, we propose a novel approach Self-Edit that showcases stronger performance, achieving 55.6% consistency improvement while maintaining the naturalness of generation.
摘要：现实世界信息的动态本质需要在大型语言模型（LLM）中进行高效的知识编辑（KE）以进行知识更新。然而，当前的 KE 方法通常对（主体、关系、客体）三元组进行操作，忽略了上下文信息和不同知识之间的关系。因此，这种编辑方法可能会遇到不确定的编辑边界，从而使许多相关知识变得模糊：可以在编辑前回答的查询在编辑后无法得到可靠回答。在这项工作中，我们通过引入KE的理论框架来分析这个问题，该框架突出了一组保持不变的被忽视的知识，并有助于编辑过程中的知识推演，我们将其称为推演锚点。我们通过提出一项基于事件的知识编辑的新任务来进一步解决这个问题，该任务将事实与事件描述配对。这项任务不仅体现了对现实世界编辑场景的更接近的模拟，而且体现了更符合逻辑的设置，隐式定义了推导锚点以解决编辑边界不确定的问题。我们凭经验证明了基于事件的编辑相对于现有设置在解决编辑模型中的不确定性方面的优越性，并策划了一个从 CounterFact 数据集派生的新基准数据集 EvEdit。此外，虽然我们观察到基于事件的设置对现有方法具有显着的挑战性，但我们提出了一种新颖的自我编辑方法，它展示了更强的性能，在保持生成自然性的同时实现了 55.6% 的一致性改进。

Title: PhaseEvo: Towards Unified In-Context Prompt Optimization for Large Language Models

Authors: Wendi Cui, Jiaxin Zhang, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11347
Pdf URL: https://arxiv.org/pdf/2402.11347
Copy Paste: [[2402.11347]] PhaseEvo: Towards Unified In-Context Prompt Optimization for Large Language Models(https://arxiv.org/abs/2402.11347)
Keywords: language model, llm, prompt
Abstract: Crafting an ideal prompt for Large Language Models (LLMs) is a challenging task that demands significant resources and expert human input. Existing work treats the optimization of prompt instruction and in-context learning examples as distinct problems, leading to sub-optimal prompt performance. This research addresses this limitation by establishing a unified in-context prompt optimization framework, which aims to achieve joint optimization of the prompt instruction and examples. However, formulating such optimization in the discrete and high-dimensional natural language space introduces challenges in terms of convergence and computational efficiency. To overcome these issues, we present PhaseEvo, an efficient automatic prompt optimization framework that combines the generative capability of LLMs with the global search proficiency of evolution algorithms. Our framework features a multi-phase design incorporating innovative LLM-based mutation operators to enhance search efficiency and accelerate convergence. We conduct an extensive evaluation of our approach across 35 benchmark tasks. The results demonstrate that PhaseEvo significantly outperforms the state-of-the-art baseline methods by a large margin whilst maintaining good efficiency.
摘要：为大型语言模型 (LLM) 制作理想的提示是一项具有挑战性的任务，需要大量资源和专家的人力投入。现有的工作将提示教学和情境学习示例的优化视为不同的问题，导致提示性能不佳。本研究通过建立统一的上下文提示优化框架来解决这一限制，旨在实现提示指令和示例的联合优化。然而，在离散和高维自然语言空间中制定这种优化在收敛和计算效率方面带来了挑战。为了克服这些问题，我们提出了 PhaseEvo，这是一种高效的自动提示优化框架，它将法学硕士的生成能力与进化算法的全局搜索能力相结合。我们的框架采用多阶段设计，结合了基于 LLM 的创新变异算子，以提高搜索效率并加速收敛。我们对 35 项基准任务的方法进行了广泛的评估。结果表明，PhaseEvo 的性能大幅优于最先进的基线方法，同时保持良好的效率。

Title: Tasks That Language Models Don't Learn

Authors: Bruce W. Lee, JaeHyuk Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11349
Pdf URL: https://arxiv.org/pdf/2402.11349
Copy Paste: [[2402.11349]] Tasks That Language Models Don't Learn(https://arxiv.org/abs/2402.11349)
Keywords: language model, llm, chain-of-thought
Abstract: We argue that there are certain properties of language that our current large language models (LLMs) don't learn. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-TEST. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory experiences, and the sensory-deprived processing capabilities of LLMs. In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B -> LLaMA 2 70B) do not trivially bring improvements in H-TEST performance. Therefore, we make a particular connection to the philosophical case of Mary, who learns about the world in a sensory-deprived environment (Jackson, 1986). Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of knowledge acquired in the absence of sensory experience.
摘要：我们认为，我们当前的大型语言模型（LLM）无法学习语言的某些属性。我们通过一系列称为 H-TEST 的任务对语言的视觉听觉特性进行实证研究。该基准凸显了人类语言理解（自然地整合了感官体验）与法学硕士的感官剥夺处理能力之间的根本差距。为了支持我们的假设，1. 深思熟虑的推理（思想链），2. 少数样本，或 3. 来自同一模型系列的更强的 LLM（LLaMA 2 13B -> LLaMA 2 70B）不会轻易带来改进在 H-TEST 性能中。因此，我们与玛丽的哲学案例有着特殊的联系，她在感官被剥夺的环境中了解世界（杰克逊，1986）。我们的实验表明，一些最强的专有法学硕士的随机机会基线准确度接近 50%，这凸显了在缺乏感官经验的情况下获取知识的局限性。

Title: What Changed? Converting Representational Interventions to Natural Language

Authors: Matan Avitan, Ryan Cotterell, Yoav Goldberg, Shauli Ravfogel
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11355
Pdf URL: https://arxiv.org/pdf/2402.11355
Copy Paste: [[2402.11355]] What Changed? Converting Representational Interventions to Natural Language(https://arxiv.org/abs/2402.11355)
Keywords: language model
Abstract: Interventions targeting the representation space of language models (LMs) have emerged as effective means to influence model behavior. These methods are employed, for example, to eliminate or alter the encoding of demographic information such as gender within the model's representations, creating a counterfactual representation. However, since the intervention operates within the representation space, understanding precisely which features it modifies poses a challenge. We show that representation-space counterfactuals can be converted into natural language counterfactuals. We demonstrate that this approach enables us to analyze the linguistic alterations corresponding to a given representation-space intervention and to interpret the features utilized for encoding a specific concept. Moreover, the resulting counterfactuals can be used to mitigate bias in classification.
摘要：针对语言模型（LM）表示空间的干预措施已成为影响模型行为的有效手段。例如，这些方法用于消除或改变模型表示中的人口统计信息（例如性别）的编码，从而创建反事实表示。然而，由于干预是在表示空间内进行的，因此准确理解它修改了哪些特征构成了挑战。我们证明表征空间反事实可以转换为自然语言反事实。我们证明，这种方法使我们能够分析与给定表示空间干预相对应的语言变化，并解释用于编码特定概念的特征。此外，由此产生的反事实可用于减轻分类中的偏差。

Title: Training Language Model Agents without Modifying Language Models

Authors: Shaokun Zhang, Jieyu Zhang, Jiale Liu, Linxin Song, Chi Wang, Ranjay Krishna, Qingyun Wu
Subjects: cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.11359
Pdf URL: https://arxiv.org/pdf/2402.11359
Copy Paste: [[2402.11359]] Training Language Model Agents without Modifying Language Models(https://arxiv.org/abs/2402.11359)
Keywords: language model, llm, agent
Abstract: Researchers and practitioners have recently reframed powerful Large Language Models (LLMs) as agents, enabling them to automate complex tasks largely via the use of specialized functions. To facilitate the development of LLM agents, we present a novel paradigm of training LLM agents without modifying the LLM weights, which is particularly useful when the LLMs are difficult or inaccessible for modifications. Inspired by how humans continuously forge tools to adapt to real-world tasks, rather than change our biological structure to fit a static set of tools, we propose to progressively forge agent's functions to better solve the downstream tasks instead of modifying the LLM weights. By treating the functions as learnable `agent parameters' and leveraging the fundamental idea of model training in artificial intelligence, we develop AgentOptimizer that employs the LLM to update agents' functions and devise an agent training algorithm with two strategies, roll-back, and early-stop, to streamline the training process. With extensive experiments, we showcase that the agent training paradigm could significantly improve the performance of representative LLM agents in various downstream tasks. We also study the behavior of the agent training regarding aspects like the learning curve and domain transferability.
摘要：研究人员和从业者最近将强大的大型语言模型（LLM）重新构建为代理，使它们能够主要通过使用专门的功能来自动执行复杂的任务。为了促进LLM代理的开发，我们提出了一种在不修改LLM权重的情况下训练LLM代理的新范式，这在LLM难以或无法修改时特别有用。受人类如何不断打造工具来适应现实世界任务，而不是改变我们的生物结构以适应一组静态工具的启发，我们建议逐步打造代理的功能以更好地解决下游任务，而不是修改 LLM 权重。通过将函数视为可学习的“代理参数”，并利用人工智能中模型训练的基本思想，我们开发了 AgentOptimizer，利用 LLM 来更新代理函数，并设计了具有回滚和早期两种策略的代理训练算法。 -stop，简化训练过程。通过大量的实验，我们展示了代理训练范例可以显着提高代表性 LLM 代理在各种下游任务中的性能。我们还研究了代理训练在学习曲线和域可转移性等方面的行为。

Title: Multi Task Inverse Reinforcement Learning for Common Sense Reward

Authors: Neta Glazer, Aviv Navon, Aviv Shamsian, Ethan Fetaya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11367
Pdf URL: https://arxiv.org/pdf/2402.11367
Copy Paste: [[2402.11367]] Multi Task Inverse Reinforcement Learning for Common Sense Reward(https://arxiv.org/abs/2402.11367)
Keywords: agent
Abstract: One of the challenges in applying reinforcement learning in a complex real-world environment lies in providing the agent with a sufficiently detailed reward function. Any misalignment between the reward and the desired behavior can result in unwanted outcomes. This may lead to issues like "reward hacking" where the agent maximizes rewards by unintended behavior. In this work, we propose to disentangle the reward into two distinct parts. A simple task-specific reward, outlining the particulars of the task at hand, and an unknown common-sense reward, indicating the expected behavior of the agent within the environment. We then explore how this common-sense reward can be learned from expert demonstrations. We first show that inverse reinforcement learning, even when it succeeds in training an agent, does not learn a useful reward function. That is, training a new agent with the learned reward does not impair the desired behaviors. We then demonstrate that this problem can be solved by training simultaneously on multiple tasks. That is, multi-task inverse reinforcement learning can be applied to learn a useful reward function.
摘要：在复杂的现实环境中应用强化学习的挑战之一在于为代理提供足够详细的奖励函数。奖励与期望行为之间的任何不一致都可能导致不良结果。这可能会导致诸如“奖励黑客”之类的问题，即代理通过无意的行为来最大化奖励。在这项工作中，我们建议将奖励分为两个不同的部分。一个简单的特定于任务的奖励，概述了手头任务的细节，以及一个未知的常识性奖励，表明代理在环境中的预期行为。然后我们探讨如何从专家演示中学习这种常识性奖励。我们首先表明，逆强化学习即使成功训练了代理，也无法学习有用的奖励函数。也就是说，用学到的奖励训练新代理不会损害所需的行为。然后我们证明这个问题可以通过同时训练多个任务来解决。也就是说，可以应用多任务逆强化学习来学习有用的奖励函数。

Title: Reinforcement learning to maximise wind turbine energy generation

Authors: Daniel Soler, Oscar Mariño, David Huergo, Martín de Frutos, Esteban Ferrer
Subjects: cs.LG, math-ph, math.OC
Abstract URL: https://arxiv.org/abs/2402.11384
Pdf URL: https://arxiv.org/pdf/2402.11384
Copy Paste: [[2402.11384]] Reinforcement learning to maximise wind turbine energy generation(https://arxiv.org/abs/2402.11384)
Keywords: agent
Abstract: We propose a reinforcement learning strategy to control wind turbine energy generation by actively changing the rotor speed, the rotor yaw angle and the blade pitch angle. A double deep Q-learning with a prioritized experience replay agent is coupled with a blade element momentum model and is trained to allow control for changing winds. The agent is trained to decide the best control (speed, yaw, pitch) for simple steady winds and is subsequently challenged with real dynamic turbulent winds, showing good performance. The double deep Q- learning is compared with a classic value iteration reinforcement learning control and both strategies outperform a classic PID control in all environments. Furthermore, the reinforcement learning approach is well suited to changing environments including turbulent/gusty winds, showing great adaptability. Finally, we compare all control strategies with real winds and compute the annual energy production. In this case, the double deep Q-learning algorithm also outperforms classic methodologies.
摘要：我们提出了一种强化学习策略，通过主动改变转子速度、转子偏航角和叶片桨距角来控制风力涡轮机发电。具有优先经验回放代理的双重深度 Q 学习与叶片元件动量模型相结合，并经过训练以允许控制不断变化的风。该智能体经过训练，可以决定简单稳定风的最佳控制（速度、偏航、俯仰），随后接受真实动态湍流风的挑战，表现出良好的性能。将双深度 Q 学习与经典的值迭代强化学习控制进行比较，两种策略在所有环境下都优于经典的 PID 控制。此外，强化学习方法非常适合包括湍流/阵风在内的不断变化的环境，表现出很强的适应性。最后，我们将所有控制策略与真实风进行比较并计算年发电量。在这种情况下，双深度 Q 学习算法也优于经典方法。

Title: Reasoning before Comparison: LLM-Enhanced Semantic Similarity Metrics for Domain Specialized Text Analysis

Authors: Shaochen Xu, Zihao Wu, Huaqin Zhao, Peng Shu, Zhengliang Liu, Wenxiong Liao, Sheng Li, Andrea Sikora, Tianming Liu, Xiang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11398
Pdf URL: https://arxiv.org/pdf/2402.11398
Copy Paste: [[2402.11398]] Reasoning before Comparison: LLM-Enhanced Semantic Similarity Metrics for Domain Specialized Text Analysis(https://arxiv.org/abs/2402.11398)
Keywords: gpt, llm
Abstract: In this study, we leverage LLM to enhance the semantic analysis and develop similarity metrics for texts, addressing the limitations of traditional unsupervised NLP metrics like ROUGE and BLEU. We develop a framework where LLMs such as GPT-4 are employed for zero-shot text identification and label generation for radiology reports, where the labels are then used as measurements for text similarity. By testing the proposed framework on the MIMIC data, we find that GPT-4 generated labels can significantly improve the semantic similarity assessment, with scores more closely aligned with clinical ground truth than traditional NLP metrics. Our work demonstrates the possibility of conducting semantic analysis of the text data using semi-quantitative reasoning results by the LLMs for highly specialized domains. While the framework is implemented for radiology report similarity analysis, its concept can be extended to other specialized domains as well.
摘要：在本研究中，我们利用 LLM 来增强语义分析并开发文本相似性度量，解决 ROUGE 和 BLEU 等传统无监督 NLP 度量的局限性。我们开发了一个框架，其中 GPT-4 等法学硕士用于零样本文本识别和放射学报告的标签生成，然后将标签用作文本相似性的测量。通过在 MIMIC 数据上测试所提出的框架，我们发现 GPT-4 生成的标签可以显着改善语义相似性评估，其分数比传统 NLP 指标更符合临床基本事实。我们的工作展示了法学硕士使用半定量推理结果对高度专业化领域的文本数据进行语义分析的可能性。虽然该框架是为了放射学报告相似性分析而实现的，但其概念也可以扩展到其他专业领域。

Title: Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection

Authors: Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11406
Pdf URL: https://arxiv.org/pdf/2402.11406
Copy Paste: [[2402.11406]] Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection(https://arxiv.org/abs/2402.11406)
Keywords: language model, llm, prompt
Abstract: The fairness and trustworthiness of Large Language Models (LLMs) are receiving increasing attention. Implicit hate speech, which employs indirect language to convey hateful intentions, occupies a significant portion of practice. However, the extent to which LLMs effectively address this issue remains insufficiently examined. This paper delves into the capability of LLMs to detect implicit hate speech (Classification Task) and express confidence in their responses (Calibration Task). Our evaluation meticulously considers various prompt patterns and mainstream uncertainty estimation methods. Our findings highlight that LLMs exhibit two extremes: (1) LLMs display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech. (2) LLMs' confidence scores for each method excessively concentrate on a fixed range, remaining unchanged regardless of the dataset's complexity. Consequently, the calibration performance is heavily reliant on primary classification accuracy. These discoveries unveil new limitations of LLMs, underscoring the need for caution when optimizing models to ensure they do not veer towards extremes. This serves as a reminder to carefully consider sensitivity and confidence in the pursuit of model fairness.
摘要：大型语言模型（LLM）的公平性和可信性越来越受到关注。隐性仇恨言论，即使用间接语言来传达仇恨意图，占据了实践的很大一部分。然而，法学硕士在多大程度上有效解决这个问题仍然没有得到充分的研究。本文深入研究了法学硕士检测隐含仇恨言论（分类任务）并表达对其回应的信心（校准任务）的能力。我们的评估仔细考虑了各种提示模式和主流的不确定性估计方法。我们的研究结果强调，法学硕士表现出两个极端：（1）法学硕士对可能引起公平问题的群体或话题表现出过度敏感，导致将善意的言论错误地归类为仇恨言论。 (2) 法学硕士对每种方法的置信度得分过度集中在固定范围内，无论数据集的复杂程度如何，都保持不变。因此，校准性能很大程度上依赖于初级分类精度。这些发现揭示了法学硕士的新局限性，强调在优化模型时需要谨慎，以确保它们不会走向极端。这提醒我们在追求模型公平性时仔细考虑敏感性和信心。

Title: Multi-dimensional Evaluation of Empathetic Dialog Responses

Authors: Zhichao Xu, Jiepu Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11409
Pdf URL: https://arxiv.org/pdf/2402.11409
Copy Paste: [[2402.11409]] Multi-dimensional Evaluation of Empathetic Dialog Responses(https://arxiv.org/abs/2402.11409)
Keywords: language model, gpt, llm, prompt
Abstract: Empathy is a critical element of effective and satisfactory conversational communication, yet previous studies in measuring conversational empathy mostly focus on expressed communicative intents -- in which way empathy is expressed, ignoring the fact that conversation is also a collaborative practice involving both speakers and listeners. In contrast, we propose a multi-dimensional empathy evaluation framework that extends upon existing work to measure both expressed intents from the speaker's perspective and perceived empathy from the listener's perspective. Applying the proposed framework to analyzing our internal customer-service dialogue shows that the two dimensions (expressed intent types and perceived empathy) are inter-connected, while perceived empathy has high correlation with the satisfactory level of dialogue sessions. This proposed framework still requires subjective assessments from trained annotators, which can be non-trivial to collect. To scale up evaluation without excessive reliance on carefully annotated data, we explore different modeling options to automatically measure conversational empathy with (1) prompting frozen large language models (LLMs) and (2) training language model-based classifiers. Extensive experiments on both internal and external dialogue datasets show that measuring conversational empathy remains a challenging task for prompting frozen LLMs, reflected by less satisfying performance of GPT-4 and Flan family models. On the other hand, our proposed instruction-finetuned classifiers based on sequence-to-sequence (Seq2Seq) language models is able to achieve the best performance compared to prior works and competitive baselines. Finally, we perform comprehensive ablation studies on the performance of proposed instruction-finetuned classifiers and give recommendations on potentially adopting them as automatic conversational empathy evaluation metrics.
摘要：同理心是有效和令人满意的会话沟通的关键要素，但以往测量会话同理心的研究大多集中于所表达的交际意图——同理心的表达方式，忽略了会话也是一种涉及说话者和听众的协作实践这一事实。相比之下，我们提出了一个多维同理心评估框架，该框架在现有工作的基础上进行了扩展，以衡量从说话者的角度表达的意图和从听众的角度感知的同理心。应用所提出的框架来分析我们的内部客户服务对话表明，两个维度（表达的意图类型和感知的同理心）是相互关联的，而感知的同理心与对话会话的满意度具有高度相关性。这个提议的框架仍然需要训练有素的注释者进行主观评估，而收集这些评估可能并不容易。为了扩大评估规模而不过度依赖仔细注释的数据，我们探索了不同的建模选项来自动测量会话同理心，包括（1）提示冻结大语言模型（LLM）和（2）训练基于语言模型的分类器。对内部和外部对话数据集的大量实验表明，测量对话同理心对于促使冻结的 LLM 来说仍然是一项具有挑战性的任务，GPT-4 和 Flan 系列模型的表现不太令人满意就反映了这一点。另一方面，与先前的工作和竞争基线相比，我们提出的基于序列到序列（Seq2Seq）语言模型的指令微调分类器能够实现最佳性能。最后，我们对所提出的指令微调分类器的性能进行了全面的消融研究，并就可能采用它们作为自动会话同理心评估指标提出了建议。

Title: Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Authors: Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2402.11411
Pdf URL: https://arxiv.org/pdf/2402.11411
Copy Paste: [[2402.11411]] Aligning Modalities in Vision Large Language Models via Preference Fine-tuning(https://arxiv.org/abs/2402.11411)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to inject plausible hallucinations into the correct answer. Second, we distort the image to trigger the inherent hallucination behavior of the VLLM. This is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. Finally, both of these generation strategies are integrated into an RLHF pipeline via Direct Preference Optimization. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. Our data and code are available at https://github.com/YiyangZhou/POVID.
摘要：指令跟随视觉大型语言模型（VLLM）最近在各种任务上取得了重大进展。这些方法融合了强大的预训练视觉模型和大型语言模型（LLM）。由于这些组件是单独训练的，因此学习到的表示需要与其他图像语言对的联合训练保持一致。这个过程并不完美，可能会导致模型产生幻觉——即使核心法学硕士非常真实并且视觉主干具有足够完整的表示，但提供的答案也不能准确反映图像。在这项工作中，我们将幻觉问题视为对齐问题，并通过偏好调整来解决它。具体来说，我们建议 POVID 使用 AI 模型生成反馈数据。我们使用真实指令作为首选响应，并使用两阶段方法来生成非首选数据。首先，我们提示 GPT-4V 将合理的幻觉注入正确答案中。其次，我们扭曲图像以触发 VLLM 固有的幻觉行为。这是一种自动化方法，不依赖于人类数据生成或需要完美的专家，这使得它易于扩展。最后，这两种生成策略都通过直接偏好优化集成到 RLHF 管道中。在跨广泛基准的实验中，我们表明我们不仅可以减少幻觉，还可以提高跨标准基准的模型性能，优于之前的方法。我们的数据和代码可在 https://github.com/YiyangZhou/POVID 获取。

Title: LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models

Authors: Yifan Yang, Jiajun Zhou, Ngai Wong, Zheng Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11417
Pdf URL: https://arxiv.org/pdf/2402.11417
Copy Paste: [[2402.11417]] LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models(https://arxiv.org/abs/2402.11417)
Keywords: language model, llm
Abstract: Various parameter-efficient fine-tuning (PEFT) techniques have been proposed to enable computationally efficient fine-tuning while maintaining model performance. However, existing PEFT methods are still limited by the growing number of trainable parameters with the rapid deployment of Large Language Models (LLMs). To address this challenge, we present LoRETTA, an ultra-parameter-efficient framework that significantly reduces trainable parameters through tensor-train decomposition. Specifically, we propose two methods, named {LoRETTA}$_{adp}$ and {LoRETTA}$_{rep}$. The former employs tensorized adapters, offering a high-performance yet lightweight approach for the fine-tuning of LLMs. The latter emphasizes fine-tuning via weight parameterization with a set of small tensor factors. LoRETTA achieves comparable or better performance than most widely used PEFT methods with up to $100\times$ fewer parameters on the LLaMA-2-7B models. Furthermore, empirical results demonstrate that the proposed method effectively improves training efficiency, enjoys better multi-task learning performance, and enhances the anti-overfitting capability. Plug-and-play codes built upon the Huggingface framework and PEFT library will be released.
摘要：人们提出了各种参数高效微调（PEFT）技术，以在保持模型性能的同时实现计算高效的微调。然而，随着大型语言模型（LLM）的快速部署，现有的 PEFT 方法仍然受到可训练参数数量不断增加的限制。为了应对这一挑战，我们提出了 LoRETTA，这是一种超参数高效框架，可通过张量训练分解显着减少可训练参数。具体来说，我们提出了两种方法，名为 {LoRETTA}$_{adp}$ 和 {LoRETTA}$_{rep}$。前者采用张量化适配器，为 LLM 的微调提供高性能且轻量级的方法。后者强调通过一组小张量因子的权重参数化进行微调。 LoRETTA 实现了与最广泛使用的 PEFT 方法相当或更好的性能，并且在 LLaMA-2-7B 模型上参数减少了 100 倍。此外，实证结果表明，该方法有效提高了训练效率，具有更好的多任务学习性能，增强了抗过拟合能力。基于 Huggingface 框架和 PEFT 库构建的即插即用代码将被发布。

Title: Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction

Authors: Yinghui Li, Shang Qin, Jingheng Ye, Shirong Ma, Yangning Li, Libo Qin, Xuming Hu, Wenhao Jiang, Hai-Tao Zheng, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11420
Pdf URL: https://arxiv.org/pdf/2402.11420
Copy Paste: [[2402.11420]] Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction(https://arxiv.org/abs/2402.11420)
Keywords: language model, llm
Abstract: Recently, Large Language Models (LLMs) have been widely studied by researchers for their roles in various downstream NLP tasks. As a fundamental task in the NLP field, Chinese Grammatical Error Correction (CGEC) aims to correct all potential grammatical errors in the input sentences. Previous studies have shown that LLMs' performance as correctors on CGEC remains unsatisfactory due to its challenging task focus. To promote the CGEC field to better adapt to the era of LLMs, we rethink the roles of LLMs in the CGEC task so that they can be better utilized and explored in CGEC. Considering the rich grammatical knowledge stored in LLMs and their powerful semantic understanding capabilities, we utilize LLMs as explainers to provide explanation information for the CGEC small models during error correction to enhance performance. We also use LLMs as evaluators to bring more reasonable CGEC evaluations, thus alleviating the troubles caused by the subjectivity of the CGEC task. In particular, our work is also an active exploration of how LLMs and small models better collaborate in downstream tasks. Extensive experiments and detailed analyses on widely used datasets verify the effectiveness of our thinking intuition and the proposed methods.
摘要：最近，大型语言模型（LLM）因其在各种下游 NLP 任务中的作用而被研究人员广泛研究。作为自然语言处理领域的一项基本任务，中文语法错误纠正（CGEC）旨在纠正输入句子中所有潜在的语法错误。之前的研究表明，由于任务重点具有挑战性，法学硕士作为 CGEC 校正者的表现仍然不能令人满意。为了推动CGEC领域更好地适应LLM时代，我们重新思考LLM在CGEC任务中的角色，以便它们在CGEC中得到更好的利用和探索。考虑到LLM存储的丰富语法知识及其强大的语义理解能力，我们利用LLM作为解释器，在纠错过程中为CGEC小模型提供解释信息，以提高性能。我们还利用LLM作为评估者来带来更合理的CGEC评估，从而减轻CGEC任务的主观性带来的麻烦。特别是，我们的工作也是积极探索法学硕士和小型模型如何在下游任务中更好地协作。对广泛使用的数据集进行的大量实验和详细分析验证了我们的思维直觉和所提出的方法的有效性。

Title: EventRL: Enhancing Event Extraction with Outcome Supervision for Large Language Models

Authors: Jun Gao, Huan Zhao, Wei Wang, Changlong Yu, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11430
Pdf URL: https://arxiv.org/pdf/2402.11430
Copy Paste: [[2402.11430]] EventRL: Enhancing Event Extraction with Outcome Supervision for Large Language Models(https://arxiv.org/abs/2402.11430)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: In this study, we present EventRL, a reinforcement learning approach developed to enhance event extraction for large language models (LLMs). EventRL utilizes outcome supervision with specific reward functions to tackle prevalent challenges in LLMs, such as instruction following and hallucination, manifested as the mismatch of event structure and the generation of undefined event types. We evaluate EventRL against existing methods like Few-Shot Prompting (FSP) (based on GPT4) and Supervised Fine-Tuning (SFT) across various LLMs, including GPT-4, LLaMa, and CodeLLaMa models. Our findings show that EventRL significantly outperforms these conventional approaches by improving the performance in identifying and structuring events, particularly in handling novel event types. The study emphasizes the critical role of reward function selection and demonstrates the benefits of incorporating code data for better event extraction. While increasing model size leads to higher accuracy, maintaining the ability to generalize is essential to avoid overfitting.
摘要：在这项研究中，我们提出了 EventRL，这是一种强化学习方法，旨在增强大型语言模型 (LLM) 的事件提取。 EventRL 利用具有特定奖励函数的结果监督来解决法学硕士中普遍存在的挑战，例如指令遵循和幻觉，表现为事件结构的不匹配和未定义事件类型的生成。我们针对各种 LLM（包括 GPT-4、LLaMa 和 CodeLLaMa 模型）中的现有方法（例如少样本提示 (FSP)（基于 GPT4）和监督微调 (SFT)）评估 EventRL。我们的研究结果表明，EventRL 通过提高识别和构建事件的性能，特别是在处理新事件类型方面，显着优于这些传统方法。该研究强调了奖励函数选择的关键作用，并证明了合并代码数据以更好地提取事件的好处。虽然增加模型大小可以提高准确性，但保持泛化能力对于避免过度拟合至关重要。

Title: Can Deception Detection Go Deeper? Dataset, Evaluation, and Benchmark for Deception Reasoning

Authors: Kang Chen, Zheng Lian, Haiyang Sun, Bin Liu, Jianhua Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11432
Pdf URL: https://arxiv.org/pdf/2402.11432
Copy Paste: [[2402.11432]] Can Deception Detection Go Deeper? Dataset, Evaluation, and Benchmark for Deception Reasoning(https://arxiv.org/abs/2402.11432)
Keywords: language model, gpt
Abstract: Deception detection has attracted increasing attention due to its importance in many practical scenarios. Currently, data scarcity harms the development of this field. On the one hand, it is costly to hire participants to simulate deception scenarios. On the other hand, it is difficult to collect videos containing deceptive behaviors on the Internet. To address data scarcity, this paper proposes a new data collection pipeline. Specifically, we use GPT-4 to simulate a role-play between a suspect and a police officer. During interrogation, the suspect lies to the police officer to evade responsibility for the crime, while the police officer uncovers the truth and gathers evidence. Compared with previous datasets, this strategy reduces data collection costs, providing a promising way to increase the dataset size. Meanwhile, we extend the traditional deception detection task to deception reasoning, further providing evidence for deceptive parts. This dataset can also be used to evaluate the complex reasoning capability of current large language models and serve as a reasoning benchmark for further research.
摘要：欺骗检测由于其在许多实际场景中的重要性而引起了越来越多的关注。目前，数据稀缺损害了该领域的发展。一方面，雇用参与者来模拟欺骗场景的成本很高。另一方面，网络上含有欺骗行为的视频很难采集。为了解决数据稀缺问题，本文提出了一种新的数据收集管道。具体来说，我们使用 GPT-4 来模拟嫌疑人和警察之间的角色扮演。审讯过程中，犯罪嫌疑人向警官撒谎以逃避犯罪责任，而警官则揭露真相并收集证据。与以前的数据集相比，该策略降低了数据收集成本，为增加数据集大小提供了一种有前途的方法。同时，我们将传统的欺骗检测任务扩展到欺骗推理，进一步为欺骗部分提供证据。该数据集还可用于评估当前大型语言模型的复杂推理能力，并作为进一步研究的推理基准。

Title: Perils of Self-Feedback: Self-Bias Amplifies in Large Language Models

Authors: Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, William Yang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11436
Pdf URL: https://arxiv.org/pdf/2402.11436
Copy Paste: [[2402.11436]] Perils of Self-Feedback: Self-Bias Amplifies in Large Language Models(https://arxiv.org/abs/2402.11436)
Keywords: language model, llm
Abstract: Recent studies show that self-feedback improves large language models (LLMs) on certain tasks while worsens other tasks. We discovered that such a contrary is due to LLM's bias towards their own output. In this paper, we formally define LLM's self-bias -- the tendency to favor its own generation -- using two statistics. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks. We find that self-bias is prevalent in all examined LLMs across multiple languages and tasks. Our analysis reveals that while the self-refine pipeline improves the fluency and understandability of model outputs, it further amplifies self-bias. To mitigate such biases, we discover that larger model size and external feedback with accurate assessment can significantly reduce bias in the self-refine pipeline, leading to actual performance improvement in downstream tasks.
摘要：最近的研究表明，自我反馈可以改善某些任务的大型语言模型（LLM），但会恶化其他任务。我们发现这种相反的情况是由于LLM对自己产出的偏见造成的。在本文中，我们使用两项统计数据正式定义了法学硕士的自我偏见——偏向自己这一代人的倾向。我们分析了六位关于翻译、受限文本生成和数学推理任务的法学硕士。我们发现，在所有跨多种语言和任务的法学硕士中，自我偏见普遍存在。我们的分析表明，虽然自我完善管道提高了模型输出的流畅性和可理解性，但它进一步放大了自我偏差。为了减轻这种偏差，我们发现更大的模型大小和具有准确评估的外部反馈可以显着减少自优化管道中的偏差，从而导致下游任务的实际性能提高。

Title: InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration

Authors: Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, Haifeng Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11441
Pdf URL: https://arxiv.org/pdf/2402.11441
Copy Paste: [[2402.11441]] InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration(https://arxiv.org/abs/2402.11441)
Keywords: language model, llm
Abstract: Though Large Language Models (LLMs) have shown remarkable open-generation capabilities across diverse domains, they struggle with knowledge-intensive tasks. To alleviate this issue, knowledge integration methods have been proposed to enhance LLMs with domain-specific knowledge graphs using external modules. However, they suffer from data inefficiency as they require both known and unknown knowledge for fine-tuning. Thus, we study a novel problem of integrating unknown knowledge into LLMs efficiently without unnecessary overlap of known knowledge. Injecting new knowledge poses the risk of forgetting previously acquired knowledge. To tackle this, we propose a novel Infuser-Guided Knowledge Integration (InfuserKI) framework that utilizes transformer internal states to determine whether to enhance the original LLM output with additional information, thereby effectively mitigating knowledge forgetting. Evaluations on the UMLS-2.5k and MetaQA domain knowledge graphs demonstrate that InfuserKI can effectively acquire new knowledge and outperform state-of-the-art baselines by 9% and 6%, respectively, in reducing knowledge forgetting.
摘要：尽管大型语言模型（LLM）在不同领域表现出了卓越的开放生成能力，但它们在处理知识密集型任务时遇到了困难。为了缓解这个问题，人们提出了知识集成方法，通过使用外部模块的特定领域知识图来增强法学硕士。然而，它们面临着数据效率低下的问题，因为它们需要已知和未知的知识来进行微调。因此，我们研究了一个新问题，即如何有效地将未知知识整合到法学硕士中，而无需与已知知识进行不必要的重叠。注入新知识会带来忘记以前获得的知识的风险。为了解决这个问题，我们提出了一种新颖的 Infuser-Guided Knowledge Integration (InfuserKI) 框架，该框架利用 Transformer 内部状态来确定是否使用附加信息来增强原始 LLM 输出，从而有效减少知识遗忘。对 UMLS-2.5k 和 MetaQA 领域知识图的评估表明，InfuserKI 可以有效获取新知识，并且在减少知识遗忘方面分别比最先进的基线高出 9% 和 6%。

Title: Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs

Authors: Siyuan Wang, Zhongyu Wei, Yejin Choi, Xiang Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11442
Pdf URL: https://arxiv.org/pdf/2402.11442
Copy Paste: [[2402.11442]] Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs(https://arxiv.org/abs/2402.11442)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. To investigate this, we propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic, comprising both primitive and compositional rules across five domains. Our analysis of GPT-series models over a rule subset reveals significant gaps in LLMs' logic understanding compared to human performance, especially in compositional and structural complex rules with certain bias patterns. We further distill these rules into a smaller-scale inference engine for flexible rule generation and enhancing downstream reasoning. Through a multi-judger evaluation, our inference engine proves effective in generating accurate, complex and abstract conclusions and premises, and improve various commonsense reasoning tasks. Overall, our work sheds light on LLMs' limitations in grasping inferential rule and suggests ways to enhance their logical reasoning abilities~\footnote{Code and data are available at \url{https://github.com/SiyuanWangw/ULogic}.}.
摘要：大型语言模型 (LLM) 在各种推理任务中取得了令人印象深刻的类人性能。然而，他们对潜在推理规则的掌握仍然低于人类的能力。为了研究这一点，我们提出了一个逻辑脚手架推理规则生成框架，以构建推理规则库 ULogic，其中包含跨五个领域的原始规则和组合规则。我们对规则子集上的 GPT 系列模型的分析揭示了法学硕士的逻辑理解与人类表现相比的显着差距，特别是在具有某些偏差模式的组成和结构复杂规则中。我们进一步将这些规则提炼成较小规模的推理引擎，以实现灵活的规则生成并增强下游推理。通过多评判者评估，我们的推理引擎被证明可以有效地生成准确、复杂和抽象的结论和前提，并改善各种常识性推理任务。总的来说，我们的工作揭示了法学硕士在掌握推理规则方面的局限性，并提出了增强其逻辑推理能力的方法~\footnote{代码和数据可在\url{https://github.com/SiyuanWangw/ULogic}获得。} 。

Title: Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Authors: Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11443
Pdf URL: https://arxiv.org/pdf/2402.11443
Copy Paste: [[2402.11443]] Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation(https://arxiv.org/abs/2402.11443)
Keywords: language model, llm, agent
Abstract: This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at https://github.com/NanshineLoong/Self-Evolving-Benchmark).
摘要：本文提出了一个基准自我演化框架，用于动态评估快速发展的大型语言模型（LLM），旨在更准确地评估其能力和局限性。我们利用多代理系统来操纵原始实例的上下文或问题，以高置信度重新构建新的不断发展的实例，从而动态扩展现有基准。为了实现更具可扩展性、稳健性和细粒度的评估，我们实施了六种重构操作来构建不断发展的实例，针对不同的查询、数据噪声测试 LLM 并探索其解决问题的子能力。通过这个框架，我们扩展了四个任务的基准数据集。实验结果表明，与最初的结果相比，大多数法学硕士的总体表现有所下降。在我们的可扩展和稳健的评估以及细粒度的评估下，这种下降更准确地反映了模型的能力。此外，我们的框架扩大了不同模型之间以及同一模型内跨各种任务的性能差异，有助于为特定任务选择更明智的模型（代码和数据可在 https://github.com/NanshineLoong/Self-Evolving-Benchmark 获取））。

Title: In-Context Example Ordering Guided by Label Distributions

Authors: Zhichao Xu, Daniel Cohen, Bei Wang, Vivek Srikumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11447
Pdf URL: https://arxiv.org/pdf/2402.11447
Copy Paste: [[2402.11447]] In-Context Example Ordering Guided by Label Distributions(https://arxiv.org/abs/2402.11447)
Keywords: llm
Abstract: By allowing models to predict without task-specific training, in-context learning (ICL) with pretrained LLMs has enormous potential in NLP. However, a number of problems persist in ICL. In particular, its performance is sensitive to the choice and order of in-context examples. Given the same set of in-context examples with different orderings, model performance may vary between near random to near state-of-the-art. In this work, we formulate in-context example ordering as an optimization problem. We examine three problem settings that differ in the assumptions they make about what is known about the task. Inspired by the idea of learning from label proportions, we propose two principles for in-context example ordering guided by model's probability predictions. We apply our proposed principles to thirteen text classification datasets and nine different autoregressive LLMs with 700M to 13B parameters. We demonstrate that our approach outperforms the baselines by improving the classification accuracy, reducing model miscalibration, and also by selecting better in-context examples.
摘要：通过允许模型在没有特定任务训练的情况下进行预测，预训练的法学硕士的情境学习 (ICL) 在 NLP 领域具有巨大的潜力。然而，ICL 仍然存在一些问题。特别是，它的性能对上下文示例的选择和顺序很敏感。给定具有不同排序的同一组上下文示例，模型性能可能会在接近随机到接近最先进之间变化。在这项工作中，我们将上下文示例排序表述为优化问题。我们检查了三个问题设置，它们对已知任务的假设有所不同。受到从标签比例学习的想法的启发，我们提出了由模型概率预测指导的上下文示例排序的两个原则。我们将我们提出的原则应用于 13 个文本分类数据集和 9 个具有 700M 到 13B 参数的不同自回归 LLM。我们证明，我们的方法通过提高分类准确性、减少模型校准错误以及选择更好的上下文示例来优于基线。

Title: SciAgent: Tool-augmented Language Models for Scientific Reasoning

Authors: Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, Aixin Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11451
Pdf URL: https://arxiv.org/pdf/2402.11451
Copy Paste: [[2402.11451]] SciAgent: Tool-augmented Language Models for Scientific Reasoning(https://arxiv.org/abs/2402.11451)
Keywords: language model, gpt, llm, chat, agent
Abstract: Scientific reasoning poses an excessive challenge for even the most advanced Large Language Models (LLMs). To make this task more practical and solvable for LLMs, we introduce a new task setting named tool-augmented scientific reasoning. This setting supplements LLMs with scalable toolsets, and shifts the focus from pursuing an omniscient problem solver to a proficient tool-user. To facilitate the research of such setting, we construct a tool-augmented training corpus named MathFunc which encompasses over 30,000 samples and roughly 6,000 tools. Building on MathFunc, we develop SciAgent to retrieve, understand and, if necessary, use tools for scientific problem solving. Additionally, we craft a benchmark, SciToolBench, spanning five scientific domains to evaluate LLMs' abilities with tool assistance. Extensive experiments on SciToolBench confirm the effectiveness of SciAgent. Notably, SciAgent-Mistral-7B surpasses other LLMs with the same size by more than 13% in absolute accuracy. Furthermore, SciAgent-DeepMath-7B shows much superior performance than ChatGPT.
摘要：即使对于最先进的大型语言模型（LLM）来说，科学推理也构成了巨大的挑战。为了使这项任务对于法学硕士来说更加实用和可解决，我们引入了一种名为工具增强科学推理的新任务设置。这种设置为法学硕士提供了可扩展的工具集，并将重点从追求全知的问题解决者转移到熟练的工具用户。为了促进此类设置的研究，我们构建了一个名为 MathFunc 的工具增强训练语料库，其中包含超过 30,000 个样本和大约 6,000 个工具。在 MathFunc 的基础上，我们开发了 SciAgent 来检索、理解并在必要时使用工具来解决科学问题。此外，我们还制定了一个跨越五个科学领域的基准 SciToolBench，用于评估法学硕士在工具辅助下的能力。 SciToolBench 上的大量实验证实了 SciAgent 的有效性。值得注意的是，SciAgent-Mistral-7B 的绝对准确率比其他同等规模的法学硕士高出了 13% 以上。此外，SciAgent-DeepMath-7B 显示出比 ChatGPT 优越得多的性能。

Title: AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

Authors: Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj, Huaxiu Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11452
Pdf URL: https://arxiv.org/pdf/2402.11452
Copy Paste: [[2402.11452]] AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition(https://arxiv.org/abs/2402.11452)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework AutoPRM that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, AutoPRM first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided-decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that AutoPRM significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, AutoPRM can be easily integrated with other orthogonal reasoning pipelines.
摘要：大型语言模型 (LLM) 的最新进展在多步骤推理任务中显示出了希望，但它们依赖大量手动标记来提供程序反馈仍然是一个重大障碍。为了应对这一挑战，在本文中，我们提出了一种新颖的自监督框架 AutoPRM，它可以有效地增强法学硕士针对复杂推理挑战的微调。具体来说，AutoPRM首先通过可控粒度开关将复杂问题分解为更易于管理的子问题，然后依次应用强化学习来迭代改进子问题求解器。此外，我们提出上下文引导解码以避免奖励篡改并引导子问题求解器解决整体问题。大量实验表明，与 SOTA 相比，AutoPRM 显着提高了数学和常识推理任务的性能。更令人鼓舞的是，AutoPRM 可以轻松地与其他正交推理管道集成。

Title: MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization

Authors: Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, Zhiyuan Liu, Xiaodong Shi, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11453
Pdf URL: https://arxiv.org/pdf/2402.11453
Copy Paste: [[2402.11453]] MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization(https://arxiv.org/abs/2402.11453)
Keywords: language model, gpt, llm, agent
Abstract: Scientific data visualization plays a crucial role in research by enabling the direct display of complex information and assisting researchers in identifying implicit patterns. Despite its importance, the use of Large Language Models (LLMs) for scientific data visualization remains rather unexplored. In this study, we introduce MatPlotAgent, an efficient model-agnostic LLM agent framework designed to automate scientific data visualization tasks. Leveraging the capabilities of both code LLMs and multi-modal LLMs, MatPlotAgent consists of three core modules: query understanding, code generation with iterative debugging, and a visual feedback mechanism for error correction. To address the lack of benchmarks in this field, we present MatPlotBench, a high-quality benchmark consisting of 100 human-verified test cases. Additionally, we introduce a scoring approach that utilizes GPT-4V for automatic evaluation. Experimental results demonstrate that MatPlotAgent can improve the performance of various LLMs, including both commercial and open-source models. Furthermore, the proposed evaluation method shows a strong correlation with human-annotated scores.
摘要：科学数据可视化通过直接显示复杂信息并帮助研究人员识别隐含模式，在研究中发挥着至关重要的作用。尽管大型语言模型 (LLM) 很重要，但它在科学数据可视化中的使用仍未得到探索。在本研究中，我们介绍了 MatPlotAgent，这是一种高效的模型无关的 LLM 代理框架，旨在自动化科学数据可视化任务。 MatPlotAgent 利用代码 LLM 和多模式 LLM 的功能，由三个核心模块组成：查询理解、迭代调试的代码生成以及用于纠错的可视化反馈机制。为了解决该领域缺乏基准的问题，我们推出了 MatPlotBench，这是一个由 100 个经过人工验证的测试用例组成的高质量基准。此外，我们引入了一种利用 GPT-4V 进行自动评估的评分方法。实验结果表明，MatPlotAgent 可以提高各种 LLM 的性能，包括商业模型和开源模型。此外，所提出的评估方法显示出与人工注释的分数有很强的相关性。

Title: LoRA-Flow: Dynamic LoRA Fusion for Large Language Models in Generative Tasks

Authors: Hanqing Wang, Bowen Ping, Shuo Wang, Xu Han, Yun Chen, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11455
Pdf URL: https://arxiv.org/pdf/2402.11455
Copy Paste: [[2402.11455]] LoRA-Flow: Dynamic LoRA Fusion for Large Language Models in Generative Tasks(https://arxiv.org/abs/2402.11455)
Keywords: language model, llm
Abstract: LoRA employs lightweight modules to customize large language models (LLMs) for each downstream task or domain, where different learned additional modules represent diverse skills. Combining existing LoRAs to address new tasks can enhance the reusability of learned LoRAs, particularly beneficial for tasks with limited annotated data. Most prior works on LoRA combination primarily rely on task-level weights for each involved LoRA, making different examples and tokens share the same LoRA weights. However, in generative tasks, different tokens may necessitate diverse skills to manage. Taking the Chinese math task as an example, understanding the problem description may depend more on the Chinese LoRA, while the calculation part may rely more on the math LoRA. To this end, we propose LoRA-Flow, which utilizes dynamic weights to adjust the impact of different LoRAs. The weights at each step are determined by a fusion gate with extremely few parameters, which can be learned with only 200 training examples. Experiments across six generative tasks demonstrate that our method consistently outperforms baselines with task-level fusion weights. This underscores the necessity of introducing dynamic fusion weights for LoRA combination.
摘要：LoRA 采用轻量级模块为每个下游任务或领域定制大型语言模型 (LLM)，其中不同的学习附加模块代表不同的技能。结合现有的 LoRA 来解决新任务可以增强学习到的 LoRA 的可重用性，特别有利于注释数据有限的任务。大多数先前关于 LoRA 组合的工作主要依赖于每个涉及的 LoRA 的任务级权重，使得不同的示例和令牌共享相同的 LoRA 权重。然而，在生成任务中，不同的代币可能需要不同的技能来管理。以中文数学任务为例，理解问题描述可能更多依赖中文LoRA，而计算部分可能更依赖数学LoRA。为此，我们提出了 LoRA-Flow，它利用动态权重来调整不同 LoRA 的影响。每一步的权重由参数极少的融合门决定，只需 200 个训练样例即可学习。六个生成任务的实验表明，我们的方法始终优于任务级融合权重的基线。这强调了为 LoRA 组合引入动态融合权重的必要性。

Title: FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

Authors: Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa Göke, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11456
Pdf URL: https://arxiv.org/pdf/2402.11456
Copy Paste: [[2402.11456]] FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence(https://arxiv.org/abs/2402.11456)
Keywords: gpt, llm
Abstract: Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.
摘要：法学硕士的简单语言摘要对于提高技术内容的文本可访问性很有用。但这些总结在医学等高风险领域的真实性如何？本文介绍了 FactPICO，这是一种事实性基准，用于描述随机对照试验 (RCT) 的医学文本的简单语言摘要，这是循证医学的基础，可以直接为患者治疗提供信息。 FactPICO 由三个 LLM（即 GPT-4、Llama-2 和 Alpaca）生成的 345 个 RCT 摘要的简单语言摘要组成，并包含专家的细粒度评估和自然语言原理。我们评估这些摘要中随机对照试验关键要素的真实性：人群、干预措施、比较器、结果 (PICO) 以及与这些相关的报告结果。我们还评估法学硕士添加的额外信息（例如解释）的正确性。使用 FactPICO，我们对一系列现有的事实性指标进行了基准测试，包括基于法学硕士新设计的指标。我们发现，用通俗易懂的语言总结医学证据仍然具有挑战性，特别是在简单性和事实性之间进行平衡时，并且现有指标与实例级别的专家判断相关性较差。

Title: When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation

Authors: Shiyu Ni, Keping Bi, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11457
Pdf URL: https://arxiv.org/pdf/2402.11457
Copy Paste: [[2402.11457]] When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation(https://arxiv.org/abs/2402.11457)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge and tend to provide specious answers in such cases. Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs' hallucinations. However, due to the extra overhead and unassured quality of retrieval, it may not be optimal to conduct RA all the time. A straightforward idea is to only conduct retrieval when LLMs are uncertain about a question. This motivates us to enhance the LLMs' ability to perceive their knowledge boundaries to help RA. In this paper, we first quantitatively measure LLMs' such ability and confirm their overconfidence. Then, we study how LLMs' certainty about a question correlates with their dependence on external retrieved information. We propose several methods to enhance LLMs' perception of knowledge boundaries and show that they are effective in reducing overconfidence. Additionally, equipped with these methods, LLMs can achieve comparable or even better performance of RA with much fewer retrieval calls.
摘要：人们发现，大型语言模型（LLM）很难知道自己不具备某些知识，并且在这种情况下往往会提供似是而非的答案。检索增强（RA）已被广泛研究以减轻法学硕士的幻觉。然而，由于额外的开销和不确定的检索质量，始终进行 RA 可能不是最佳选择。一个简单的想法是仅当法学硕士不确定某个问题时才进行检索。这激励我们增强法学硕士感知其知识边界的能力，以帮助 RA。在本文中，我们首先定量衡量法学硕士的这种能力并证实他们的过度自信。然后，我们研究法学硕士对问题的确定性与他们对外部检索信息的依赖如何相关。我们提出了几种方法来增强法学硕士对知识边界的感知，并表明它们可以有效减少过度自信。此外，借助这些方法，法学硕士可以通过更少的检索调用实现与 RA 相当甚至更好的性能。

Title: A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models

Authors: Cuong Dang, Dung D. Le, Thai Le
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.11469
Pdf URL: https://arxiv.org/pdf/2402.11469
Copy Paste: [[2402.11469]] A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models(https://arxiv.org/abs/2402.11469)
Keywords: gpt
Abstract: Existing works have shown that fine-tuned textual transformer models achieve state-of-the-art prediction performances but are also vulnerable to adversarial text perturbations. Traditional adversarial evaluation is often done \textit{only after} fine-tuning the models and ignoring the training data. In this paper, we want to prove that there is also a strong correlation between training data and model robustness. To this end, we extract 13 different features representing a wide range of input fine-tuning corpora properties and use them to predict the adversarial robustness of the fine-tuned models. Focusing mostly on encoder-only transformer models BERT and RoBERTa with additional results for BART, ELECTRA and GPT2, we provide diverse evidence to support our argument. First, empirical analyses show that (a) extracted features can be used with a lightweight classifier such as Random Forest to effectively predict the attack success rate and (b) features with the most influence on the model robustness have a clear correlation with the robustness. Second, our framework can be used as a fast and effective additional tool for robustness evaluation since it (a) saves 30x-193x runtime compared to the traditional technique, (b) is transferable across models, (c) can be used under adversarial training, and (d) robust to statistical randomness. Our code will be publicly available.
摘要：现有的工作表明，经过微调的文本转换器模型可以实现最先进的预测性能，但也容易受到对抗性文本扰动的影响。传统的对抗性评估通常是在微调模型并忽略训练数据之后才进行的。在本文中，我们想证明训练数据和模型鲁棒性之间也存在很强的相关性。为此，我们提取了代表各种输入微调语料库属性的 13 个不同特征，并使用它们来预测微调模型的对抗鲁棒性。我们主要关注仅编码器的 Transformer 模型 BERT 和 RoBERTa，以及 BART、ELECTRA 和 GPT2 的其他结果，提供了多种证据来支持我们的论点。首先，实证分析表明（a）提取的特征可以与随机森林等轻量级分类器一起使用，以有效预测攻击成功率；（b）对模型鲁棒性影响最大的特征与鲁棒性有明显的相关性。其次，我们的框架可以用作鲁棒性评估的快速有效的附加工具，因为它（a）与传统技术相比节省了 30 倍至 193 倍的运行时间，（b）可以跨模型转移，（c）可以在对抗性训练下使用，以及 (d) 对统计随机性具有鲁棒性。我们的代码将公开。

Title: DictLLM: Harnessing Key-Value Data Structures with Large Language Models for Enhanced Medical Diagnostics

Authors: YiQiu Guo, Yuchen Yang, Ya Zhang, Yu Wang, Yanfeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11481
Pdf URL: https://arxiv.org/pdf/2402.11481
Copy Paste: [[2402.11481]] DictLLM: Harnessing Key-Value Data Structures with Large Language Models for Enhanced Medical Diagnostics(https://arxiv.org/abs/2402.11481)
Keywords: language model, gpt, llm
Abstract: Structured data offers a sophisticated mechanism for the organization of information. Existing methodologies for the text-serialization of structured data in the context of large language models fail to adequately address the heterogeneity inherent in key-value structured data. These methods are not ideal and frequently result in larger input sizes and poor adaptability to input changes. In this paper, we introduce DictLLM, an innovative framework designed to improve the modeling of key-value structured data, like medical laboratory reports, for generating medical diagnoses. DictLLM integrates three key components: (1) group positional encoding to maintain permutation invariance, (2) hierarchical attention bias to capture the inherent bias in structured data, and (3) an optimal transport alignment layer that aligns the embedding generated by the dictionary encoder with the LLM, thereby producing a sequence of fixed-length virtual tokens. We carry out experiments using various LLM models on a comprehensive real-world medical laboratory report dataset for automatic diagnosis generation, our findings illustrate that DictLLM significantly outperforms established baseline methods and few-shot GPT-4 implementations in terms of both Rouge-L and Knowledge F1 scores. Furthermore, our evaluation of the framework's scalability and robustness, through a series of experiments, underscores its exceptional capability in accurately modeling the complex key-value data structure of medical dictionary data.
摘要：结构化数据提供了一种复杂的信息组织机制。在大型语言模型的背景下，现有的结构化数据文本序列化方法无法充分解决键值结构化数据固有的异构性。这些方法并不理想，并且经常导致输入规模较大且对输入变化的适应性较差。在本文中，我们介绍了 DictLLM，这是一个创新框架，旨在改进键值结构化数据（例如医学实验室报告）的建模，以生成医学诊断。 DictLLM 集成了三个关键组件：(1) 组位置编码以保持排列不变性，(2) 分层注意力偏差以捕获结构化数据中的固有偏差，以及 (3) 最佳传输对齐层，用于对齐字典编码器生成的嵌入与LLM，从而产生一系列固定长度的虚拟令牌。我们在综合的真实世界医学实验室报告数据集上使用各种 LLM 模型进行实验以自动生成诊断，我们的研究结果表明，DictLLM 在 Rouge-L 和 Knowledge 方面显着优于既定的基线方法和少样本 GPT-4 实现F1成绩。此外，我们通过一系列实验对该框架的可扩展性和鲁棒性进行了评估，强调了其在精确建模医学词典数据的复杂键值数据结构方面的卓越能力。

Title: LEIA: Facilitating Cross-Lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation

Authors: Ikuya Yamada, Ryokan Ri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11485
Pdf URL: https://arxiv.org/pdf/2402.11485
Copy Paste: [[2402.11485]] LEIA: Facilitating Cross-Lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation(https://arxiv.org/abs/2402.11485)
Keywords: language model, llm
Abstract: Adapting English-based large language models (LLMs) to other languages has become increasingly popular due to the efficiency and potential of cross-lingual transfer. However, existing language adaptation methods often overlook the benefits of cross-lingual supervision. In this study, we introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages. This method involves augmenting the target language corpus with English entity names and training the model using left-to-right language modeling. We assess LEIA on diverse question answering datasets using 7B-parameter LLMs, demonstrating significant performance gains across various non-English languages. The source code is available at https://github.com/studio-ousia/leia.
摘要：由于跨语言迁移的效率和潜力，将基于英语的大语言模型 (LLM) 应用于其他语言已变得越来越流行。然而，现有的语言适应方法常常忽视跨语言监督的好处。在本研究中，我们引入了 LEIA，这是一种利用跨语言对齐的维基百科实体名称的语言适应调整方法。该方法涉及使用英语实体名称扩充目标语言语料库，并使用从左到右的语言建模来训练模型。我们使用 7B 参数 LLM 在不同的问答数据集上评估 LEIA，证明了各种非英语语言的显着性能提升。源代码可在 https://github.com/studio-ousia/leia 获取。

Title: What's the Plan? Evaluating and Developing Planning-Aware Techniques for LLMs

Authors: Eran Hirsch, Guy Uziel, Ateret Anaby-Tavor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11489
Pdf URL: https://arxiv.org/pdf/2402.11489
Copy Paste: [[2402.11489]] What's the Plan? Evaluating and Developing Planning-Aware Techniques for LLMs(https://arxiv.org/abs/2402.11489)
Keywords: language model, llm, agent
Abstract: Planning is a fundamental task in artificial intelligence that involves finding a sequence of actions that achieve a specified goal in a given environment. Large language models (LLMs) are increasingly used for applications that require planning capabilities, such as web or embodied agents. In line with recent studies, we demonstrate through experimentation that LLMs lack necessary skills required for planning. Based on these observations, we advocate for the potential of a hybrid approach that combines LLMs with classical planning methodology. Then, we introduce SimPlan, a novel hybrid-method, and evaluate its performance in a new challenging setup. Our extensive experiments across various planning domains demonstrate that SimPlan significantly outperforms existing LLM-based planners.
摘要：规划是人工智能中的一项基本任务，涉及寻找在给定环境中实现指定目标的一系列行动。大型语言模型 (LLM) 越来越多地用于需要规划功能的应用程序，例如 Web 或实体代理。根据最近的研究，我们通过实验证明法学硕士缺乏规划所需的必要技能。基于这些观察，我们主张将法学硕士与经典规划方法相结合的混合方法的潜力。然后，我们介绍 SimPlan，一种新颖的混合方法，并在新的具有挑战性的设置中评估其性能。我们在各个规划领域进行的广泛实验表明，SimPlan 的性能显着优于现有的基于 LLM 的规划器。

Title: Benchmarking Knowledge Boundary for Large Language Model: A Different Perspective on Model Evaluation

Authors: Xunjian Yin, Xu Zhang, Jie Ruan, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11493
Pdf URL: https://arxiv.org/pdf/2402.11493
Copy Paste: [[2402.11493]] Benchmarking Knowledge Boundary for Large Language Model: A Different Perspective on Model Evaluation(https://arxiv.org/abs/2402.11493)
Keywords: language model, prompt
Abstract: In recent years, substantial advancements have been made in the development of large language models, achieving remarkable performance across diverse tasks. To evaluate the knowledge ability of language models, previous studies have proposed lots of benchmarks based on question-answering pairs. We argue that it is not reliable and comprehensive to evaluate language models with a fixed question or limited paraphrases as the query, since language models are sensitive to prompt. Therefore, we introduce a novel concept named knowledge boundary to encompass both prompt-agnostic and prompt-sensitive knowledge within language models. Knowledge boundary avoids prompt sensitivity in language model evaluations, rendering them more dependable and robust. To explore the knowledge boundary for a given model, we propose projected gradient descent method with semantic constraints, a new algorithm designed to identify the optimal prompt for each piece of knowledge. Experiments demonstrate a superior performance of our algorithm in computing the knowledge boundary compared to existing methods. Furthermore, we evaluate the ability of multiple language models in several domains with knowledge boundary.
摘要：近年来，大型语言模型的开发取得了实质性进展，在各种任务中取得了显着的性能。为了评估语言模型的知识能力，之前的研究提出了很多基于问答对的基准。我们认为，用固定问题或有限释义作为查询来评估语言模型是不可靠和全面的，因为语言模型对提示很敏感。因此，我们引入了一个名为知识边界的新概念，以涵盖语言模型中的提示不可知论和提示敏感知识。知识边界避免了语言模型评估中的即时敏感性，使它们更加可靠和稳健。为了探索给定模型的知识边界，我们提出了具有语义约束的投影梯度下降法，这是一种旨在识别每条知识的最佳提示的新算法。实验证明，与现有方法相比，我们的算法在计算知识边界方面具有优越的性能。此外，我们评估了具有知识边界的多个领域中多种语言模型的能力。

Title: Federated Fine-tuning of Large Language Models under Heterogeneous Language Tasks and Client Resources

Authors: Jiamu Bai, Daoyuan Chen, Bingchen Qian, Liuyi Yao, Yaliang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11505
Pdf URL: https://arxiv.org/pdf/2402.11505
Copy Paste: [[2402.11505]] Federated Fine-tuning of Large Language Models under Heterogeneous Language Tasks and Client Resources(https://arxiv.org/abs/2402.11505)
Keywords: language model, llm
Abstract: Federated Learning (FL) has recently been applied to the parameter-efficient fine-tuning of Large Language Models (LLMs). While promising, it raises significant challenges due to the heterogeneous resources and data distributions of clients.This study introduces FlexLoRA, a simple yet effective aggregation scheme for LLM fine-tuning, which mitigates the "buckets effect" in traditional FL that restricts the potential of clients with ample resources by tying them to the capabilities of the least-resourced participants. FlexLoRA allows for dynamic adjustment of local LoRA ranks, fostering the development of a global model imbued with broader, less task-specific knowledge. By synthesizing a full-size LoRA weight from individual client contributions and employing Singular Value Decomposition (SVD) for weight redistribution, FlexLoRA fully leverages heterogeneous client resources. Involving over 1,600 clients performing diverse NLP tasks, our experiments validate the efficacy of FlexLoRA, with the federated global model achieving up to a 3.1% average improvement in downstream NLP task performance. FlexLoRA's practicality is further underscored by its seamless integration with existing LoRA-based FL methods and theoretical analysis, offering a path toward scalable, privacy-preserving federated tuning for LLMs.
摘要：联邦学习（FL）最近被应用于大型语言模型（LLM）的参数高效微调。虽然前景光明，但由于客户的异构资源和数据分布，它提出了重大挑战。本研究引入了 FlexLoRA，这是一种简单而有效的 LLM 微调聚合方案，它减轻了传统 FL 中的“水桶效应”，限制了传统 FL 的潜力。通过将客户与资源最少的参与者的能力联系起来，来吸引拥有充足资源的客户。 FlexLoRA 允许动态调整本地 LoRA 排名，促进充满更广泛、更少特定任务知识的全球模型的开发。通过从各个客户贡献中合成全尺寸 LoRA 权重并采用奇异值分解 (SVD) 进行权重重新分配，FlexLoRA 充分利用异构客户资源。我们的实验涉及超过 1,600 个执行不同 NLP 任务的客户，验证了 FlexLoRA 的功效，联合全局模型使下游 NLP 任务性能平均提高了 3.1%。 FlexLoRA 与现有基于 LoRA 的 FL 方法和理论分析的无缝集成进一步强调了其实用性，为法学硕士提供了一条可扩展、保护隐私的联合调优之路。

Title: From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings

Authors: Aishik Rakshit, Smriti Singh, Shuvam Keshari, Arijit Ghosh Chowdhury, Vinija Jain, Aman Chadha
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2402.11512
Pdf URL: https://arxiv.org/pdf/2402.11512
Copy Paste: [[2402.11512]] From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings(https://arxiv.org/abs/2402.11512)
Keywords: language model
Abstract: Embeddings play a pivotal role in the efficacy of Large Language Models. They are the bedrock on which these models grasp contextual relationships and foster a more nuanced understanding of language and consequently perform remarkably on a plethora of complex tasks that require a fundamental understanding of human language. Given that these embeddings themselves often reflect or exhibit bias, it stands to reason that these models may also inadvertently learn this bias. In this work, we build on the seminal previous work and propose DeepSoftDebias, an algorithm that uses a neural network to perform `soft debiasing'. We exhaustively evaluate this algorithm across a variety of SOTA datasets, accuracy metrics, and challenging NLP tasks. We find that DeepSoftDebias outperforms the current state-of-the-art methods at reducing bias across gender, race, and religion.
摘要：嵌入在大型语言模型的功效中发挥着关键作用。它们是这些模型掌握上下文关系并促进对语言更细致的理解的基石，因此在需要对人类语言有基本理解的大量复杂任务上表现出色。鉴于这些嵌入本身经常反映或表现出偏差，因此这些模型也可能无意中学习到这种偏差，这是理所当然的。在这项工作中，我们在之前的开创性工作的基础上提出了 DeepSoftDebias，这是一种使用神经网络执行“软去偏”的算法。我们在各种 SOTA 数据集、准确性指标和具有挑战性的 NLP 任务中详尽地评估了该算法。我们发现 DeepSoftDebias 在减少性别、种族和宗教偏见方面优于当前最先进的方法。

Title: Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM

Authors: Zijin Hong, Zheng Yuan, Hao Chen, Qinggang Zhang, Feiran Huang, Xiao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11517
Pdf URL: https://arxiv.org/pdf/2402.11517
Copy Paste: [[2402.11517]] Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM(https://arxiv.org/abs/2402.11517)
Keywords: language model, llm
Abstract: Generating accurate SQL for user queries (text-to-SQL) is a long-standing problem since the generation of the SQL requires comprehending the query and database and retrivale the accurate data from the database accordingly. Existing models rely on the comprehensive ability of Large Language Models (LLMs) to generate the SQL according to the database schema. However, there is some necessary knowledge that is not explicitly included in the database schema or has been learned by LLMs. Thus, the generated SQL of the knowledge-insufficient queries may be inaccurate, which negatively impacts the robustness of the text-to-SQL models. To deal with this situation, we propose the Knowledge-to-SQL framework, which employs tailored Data Expert LLM (DELLM) to provide helpful knowledge for all types of text-to-SQL models. Specifically, we provide the detailed design of DELLM, in terms of table reading, and the basic fine-tuning process. We further provide a Reinforcement Learning via Database Feedback (RLDBF) training strategy to guide the DELLM to generate more helpful knowledge for LLMs. Extensive experiments verify DELLM can enhance the state-of-the-art LLMs on text-to-SQL tasks. The model structure and the parameter weight of DELLM are released for further research.
摘要：为用户查询生成准确的 SQL（文本到 SQL）是一个长期存在的问题，因为 SQL 的生成需要理解查询和数据库，并相应地从数据库中检索准确的数据。现有的模型依赖于大型语言模型（LLM）的综合能力来根据数据库模式生成SQL。然而，有一些必要的知识没有明确包含在数据库模式中，或者法学硕士已经学到了。因此，知识不足的查询生成的 SQL 可能不准确，这会对文本到 SQL 模型的鲁棒性产生负面影响。为了应对这种情况，我们提出了 Knowledge-to SQL 框架，该框架采用定制的 Data Expert LLM (DELLM) 为所有类型的文本到 SQL 模型提供有用的知识。具体来说，我们提供了DELLM的详细设计，在读表方面，以及基本的微调过程。我们进一步提供了通过数据库反馈强化学习（RLDBF）培训策略来指导 DELLM 为 LLM 生成更多有用的知识。大量实验验证了 DELLM 可以增强文本到 SQL 任务上最先进的 LLM。发布DELLM的模型结构和参数权重以供进一步研究。

Title: Large Language Model-driven Meta-structure Discovery in Heterogeneous Information Network

Authors: Lin Chen, Fengli Xu, Nian Li, Zhenyu Han, Meng Wang, Yong Li, Pan Hui
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11518
Pdf URL: https://arxiv.org/pdf/2402.11518
Copy Paste: [[2402.11518]] Large Language Model-driven Meta-structure Discovery in Heterogeneous Information Network(https://arxiv.org/abs/2402.11518)
Keywords: language model, llm
Abstract: Heterogeneous information networks (HIN) have gained increasing popularity for being able to capture complex relations between nodes of diverse types. Meta-structure was proposed to identify important patterns of relations on HIN, which has been proven effective for extracting rich semantic information and facilitating graph neural networks to learn expressive representations. However, hand-crafted meta-structures pose challenges for scaling up, which draws wide research attention for developing automatic meta-structure search algorithms. Previous efforts concentrate on searching for meta-structures with good empirical prediction performance, overlooking explainability. Thus, they often produce meta-structures prone to overfitting and incomprehensible to humans. To address this, we draw inspiration from the emergent reasoning abilities of large language models (LLMs). We propose a novel REasoning meta-STRUCTure search (ReStruct) framework that integrates LLM reasoning into the evolutionary procedure. ReStruct uses a grammar translator to encode meta-structures into natural language sentences, and leverages the reasoning power of LLMs to evaluate semantically feasible meta-structures. ReStruct also employs performance-oriented evolutionary operations. These two competing forces jointly optimize for semantic explainability and empirical performance of meta-structures. We also design a differential LLM explainer that can produce natural language explanations for the discovered meta-structures, and refine the explanation by reasoning through the search history. Experiments on five datasets demonstrate ReStruct achieve SOTA performance in node classification and link recommendation tasks. Additionally, a survey study involving 73 graduate students shows that the meta-structures and natural language explanations generated by ReStruct are substantially more comprehensible.
摘要：异构信息网络（HIN）因其能够捕获不同类型节点之间的复杂关系而越来越受欢迎。元结构被提出来识别 HIN 上的重要关系模式，已被证明可以有效提取丰富的语义信息并促进图神经网络学习表达表示。然而，手工制作的元结构对扩展提出了挑战，这引起了开发自动元结构搜索算法的广泛研究关注。以前的努力集中在寻找具有良好经验预测性能的元结构，而忽略了可解释性。因此，它们经常产生容易过度拟合且人类难以理解的元结构。为了解决这个问题，我们从大型语言模型（LLM）的新兴推理能力中汲取灵感。我们提出了一种新颖的推理元结构搜索（ReStruct）框架，它将 LLM 推理集成到进化过程中。 ReStruct 使用语法翻译器将元结构编码为自然语言句子，并利用法学硕士的推理能力来评估语义上可行的元结构。 ReStruct 还采用了面向性能的进化操作。这两种竞争力量共同优化元结构的语义可解释性和经验性能。我们还设计了一个差分法学硕士解释器，它可以为发现的元结构生成自然语言解释，并通过搜索历史推理来完善解释。对五个数据集的实验表明 ReStruct 在节点分类和链接推荐任务中实现了 SOTA 性能。此外，一项涉及 73 名研究生的调查研究表明，ReStruct 生成的元结构和自然语言解释更加容易理解。

Title: Unveiling the Secrets of Engaging Conversations: Factors that Keep Users Hooked on Role-Playing Dialog Agents

Authors: Shuai Zhang, Yu Lu, Junwen Liu, Jia Yu, Huachuan Qiu, Yuming Yan, Zhenzhong Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11522
Pdf URL: https://arxiv.org/pdf/2402.11522
Copy Paste: [[2402.11522]] Unveiling the Secrets of Engaging Conversations: Factors that Keep Users Hooked on Role-Playing Dialog Agents(https://arxiv.org/abs/2402.11522)
Keywords: language model, agent
Abstract: With the growing humanlike nature of dialog agents, people are now engaging in extended conversations that can stretch from brief moments to substantial periods of time. Understanding the factors that contribute to sustaining these interactions is crucial, yet existing studies primarily focusing on short-term simulations that rarely explore such prolonged and real conversations. In this paper, we investigate the factors influencing retention rates in real interactions with roleplaying models. By analyzing a large dataset of interactions between real users and thousands of characters, we systematically examine multiple factors and assess their impact on user retention rate. Surprisingly, we find that the degree to which the bot embodies the roles it plays has limited influence on retention rates, while the length of each turn it speaks significantly affects retention rates. This study sheds light on the critical aspects of user engagement with role-playing models and provides valuable insights for future improvements in the development of large language models for role-playing purposes.
摘要：随着对话代理的人性化特征不断增强，人们现在正在进行长时间的对话，对话时间可以从短暂的时刻延伸到相当长的一段时间。了解有助于维持这些互动的因素至关重要，但现有的研究主要集中于短期模拟，很少探索如此长期和真实的对话。在本文中，我们研究了在与角色扮演模型的实际交互中影响保留率的因素。通过分析真实用户与数千个角色之间的交互的大型数据集，我们系统地检查多个因素并评估它们对用户保留率的影响。令人惊讶的是，我们发现机器人体现其所扮演的角色的程度对保留率的影响有限，而它每轮说话的长度对保留率有显着影响。这项研究揭示了用户参与角色扮演模型的关键方面，并为未来改进用于角色扮演的大型语言模型的开发提供了宝贵的见解。

Title: Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models

Authors: Shirley Anugrah Hayati, Taehee Jung, Tristan Bodding-Long, Sudipta Kar, Abhinav Sethy, Joo-Kyung Kim, Dongyeop Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11532
Pdf URL: https://arxiv.org/pdf/2402.11532
Copy Paste: [[2402.11532]] Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models(https://arxiv.org/abs/2402.11532)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) with a collection of large and diverse instructions has improved the model's generalization to different tasks, even for unseen tasks. However, most existing instruction datasets include only single instructions, and they struggle to follow complex instructions composed of multiple subtasks (Wang et al., 2023a). In this work, we propose a novel concept of compositional instructions called chain-of-instructions (CoI), where the output of one instruction becomes an input for the next like a chain. Unlike the conventional practice of solving single instruction tasks, our proposed method encourages a model to solve each subtask step by step until the final answer is reached. CoI-tuning (i.e., fine-tuning with CoI instructions) improves the model's ability to handle instructions composed of multiple subtasks. CoI-tuned models also outperformed baseline models on multilingual summarization, demonstrating the generalizability of CoI models on unseen composite downstream tasks.
摘要：使用大量多样化指令对大型语言模型 (LLM) 进行微调，提高了模型对不同任务的泛化能力，甚至对于看不见的任务也是如此。然而，大多数现有的指令数据集仅包含单个指令，并且它们很难遵循由多个子任务组成的复杂指令（Wang 等人，2023a）。在这项工作中，我们提出了一种称为指令链（CoI）的组合指令的新颖概念，其中一条指令的输出成为下一条指令的输入，就像一条链一样。与解决单指令任务的传统做法不同，我们提出的方法鼓励模型逐步解决每个子任务，直到达到最终答案。 CoI 调优（即使用 CoI 指令进行微调）提高了模型处理由多个子任务组成的指令的能力。 CoI 调整的模型在多语言摘要方面也优于基线模型，证明了 CoI 模型在未见过的复合下游任务上的通用性。

Title: PreAct: Predicting Future in ReAct Enhances Agent's Planning Ability

Authors: Dayuan Fu, Jianzhao Huang, Siyuan Lu, Guanting Dong, Yejie Wang, Keqing He, Weiran Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11534
Pdf URL: https://arxiv.org/pdf/2402.11534
Copy Paste: [[2402.11534]] PreAct: Predicting Future in ReAct Enhances Agent's Planning Ability(https://arxiv.org/abs/2402.11534)
Keywords: language model, llm, prompt, agent
Abstract: Addressing the discrepancies between predictions and actual outcomes often aids individuals in expanding their thought processes and engaging in reflection, thereby facilitating reasoning in the correct direction. In this paper, we introduce $\textbf{PreAct}$, an agent framework that integrates $\textbf{pre}$diction with $\textbf{rea}$soning and $\textbf{act}$ion. Leveraging the information provided by predictions, a large language model (LLM) based agent can offer more diversified and strategically oriented reasoning, which in turn leads to more effective actions that help the agent complete complex tasks. Our experiments demonstrate that PreAct outperforms the ReAct approach in accomplishing complex tasks and that PreAct can be co-enhanced when combined with Reflexion methods. We prompt the model with different numbers of historical predictions and find that historical predictions have a sustained positive effect on LLM planning. The differences in single-step reasoning between PreAct and ReAct show that PreAct indeed offers advantages in terms of diversity and strategic directivity over ReAct.
摘要：解决预测与实际结果之间的差异通常有助于个人扩展思维过程并进行反思，从而促进正确方向的推理。在本文中，我们介绍了 $\textbf{PreAct}$，一个将 $\textbf{pre}$diction 与 $\textbf{rea}$soning 和 $\textbf{act}$ion 集成的代理框架。利用预测提供的信息，基于大型语言模型 (LLM) 的代理可以提供更加多样化和面向战略的推理，从而导致更有效的行动，帮助代理完成复杂的任务。我们的实验表明，PreAct 在完成复杂任务方面优于 ReAct 方法，并且当与 Reflexion 方法结合时，PreAct 可以得到共同增强。我们用不同数量的历史预测来提示模型，发现历史预测对法学硕士规划有持续的积极影响。 PreAct 和 ReAct 在单步推理上的差异表明，PreAct 确实比 ReAct 在多样性和策略方向性方面具有优势。

Title: Deciphering the lmpact of Pretraining Data on Large Language Models through Machine Unlearning

Authors: Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Zhouhao Sun, Jun Shi, Ting Liu, Bing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11537
Pdf URL: https://arxiv.org/pdf/2402.11537
Copy Paste: [[2402.11537]] Deciphering the lmpact of Pretraining Data on Large Language Models through Machine Unlearning(https://arxiv.org/abs/2402.11537)
Keywords: language model, llm
Abstract: Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the pretraining corpus is still empirical and may deviate from the optimal. To address this issue, we systematically analyze the impact of 48 datasets from 5 major categories of pretraining data of LLMs and measure their impacts on LLMs using benchmarks about nine major categories of model capabilities. Our analyses provide empirical results about the contribution of multiple corpora on the performances of LLMs, along with their joint impact patterns, including complementary, orthogonal, and correlational relationships. We also identify a set of ``high-impact data'' such as Books that is significantly related to a set of model capabilities. These findings provide insights into the organization of data to support more efficient pretraining of LLMs.
摘要：通过对各种来源的语料库进行预训练，大型语言模型（LLM）获得了令人印象深刻的性能。然而，预训练语料库每个组成部分的影响仍然不透明。因此，预训练语料库的组织仍然是经验性的，可能会偏离最优。为了解决这个问题，我们系统地分析了法学硕士5大类预训练数据的48个数据集的影响，并使用九大类模型能力的基准来衡量它们对法学硕士的影响。我们的分析提供了关于多个语料库对法学硕士表现的贡献的实证结果，以及它们的联合影响模式，包括互补、正交和相关关系。我们还确定了一组“高影响力数据”，例如与一组模型功能显着相关的书籍。这些发现提供了对数据组织的见解，以支持法学硕士更有效的预培训。

Title: Counter-intuitive: Large Language Models Can Better Understand Knowledge Graphs Than We Thought

Authors: Xinbang Dai, Yuncheng Hua, Tongtong Wu, Yang Sheng, Guilin Qi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11541
Pdf URL: https://arxiv.org/pdf/2402.11541
Copy Paste: [[2402.11541]] Counter-intuitive: Large Language Models Can Better Understand Knowledge Graphs Than We Thought(https://arxiv.org/abs/2402.11541)
Keywords: language model, llm, hallucination, prompt
Abstract: Although the method of enhancing large language models' (LLMs') reasoning ability and reducing their hallucinations through the use of knowledge graphs (KGs) has received widespread attention, the exploration of how to enable LLMs to integrate the structured knowledge in KGs on-the-fly remains inadequate. Researchers often co-train KG embeddings and LLM parameters to equip LLMs with the ability of comprehending KG knowledge. However, this resource-hungry training paradigm significantly increases the model learning cost and is also unsuitable for non-open-source, black-box LLMs. In this paper, we employ complex question answering (CQA) as a task to assess the LLM's ability of comprehending KG knowledge. We conducted a comprehensive comparison of KG knowledge injection methods (from triples to natural language text), aiming to explore the optimal prompting method for supplying KG knowledge to LLMs, thereby enhancing their comprehension of KG. Contrary to our initial expectations, our analysis revealed that LLMs effectively handle messy, noisy, and linearized KG knowledge, outperforming methods that employ well-designed natural language (NL) textual prompts. This counter-intuitive finding provides substantial insights for future research on LLMs' comprehension of structured knowledge.
摘要：尽管通过使用知识图谱（KG）来增强大语言模型（LLM）推理能力、减少其幻觉的方法受到广泛关注，但如何使LLM能够整合知识图谱中的结构化知识的探索还不够深入。 -飞行仍然不足。研究人员经常联合训练 KG 嵌入和 LLM 参数，以使 LLM 具备理解 KG 知识的能力。然而，这种资源匮乏的训练范式显着增加了模型学习成本，并且也不适合非开源、黑盒法学硕士。在本文中，我们采用复杂问题回答（CQA）作为任务来评估法学硕士理解知识图谱知识的能力。我们对KG知识注入方法（从三元组到自然语言文本）进行了全面比较，旨在探索向LLM提供KG知识的最佳提示方法，从而增强他们对KG的理解。与我们最初的预期相反，我们的分析表明法学硕士可以有效地处理混乱、嘈杂和线性化的 KG 知识，优于采用精心设计的自然语言 (NL) 文本提示的方法。这一反直觉的发现为法学硕士对结构化知识的理解的未来研究提供了重要的见解。

Title: KMMLU: Measuring Massive Multitask Language Understanding in Korean

Authors: Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11548
Pdf URL: https://arxiv.org/pdf/2402.11548
Copy Paste: [[2402.11548]] KMMLU: Measuring Massive Multitask Language Understanding in Korean(https://arxiv.org/abs/2402.11548)
Keywords: language model, gpt, llm
Abstract: We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. Unlike previous Korean benchmarks that are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language. We test 26 publically available and proprietary LLMs, identifying significant room for improvement. The best publicly available model achieves 50.54% on KMMLU, far below the average human performance of 62.6%. This model was primarily trained for English and Chinese, not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X, achieve 59.95% and 53.40%, respectively. This suggests that further work is needed to improve Korean LLMs, and KMMLU offers the right tool to track this progress. We make our dataset publicly available on the Hugging Face Hub and integrate the benchmark into EleutherAI's Language Model Evaluation Harness.
摘要：我们提出了 KMMLU，这是一个新的韩国基准，包含 35,030 个专家级多项选择题，涵盖从人文学科到 STEM 等 45 个学科。与之前从现有英语基准翻译而来的韩语基准不同，KMMLU 是从原始韩语考试中收集的，捕捉了韩语的语言和文化方面。我们测试了 26 个公开的和专有的法学硕士，发现了巨大的改进空间。最好的公开可用模型在 KMMLU 上达到了 50.54%，远低于人类平均表现 62.6%。该模型主要针对英语和中文进行训练，而不是韩语。目前针对韩语的法学硕士（例如 Polyglot-Ko）的表现要差得多。令人惊讶的是，即使是最有能力的专有法学硕士，例如 GPT-4 和 HyperCLOVA X，也分别达到 59.95% 和 53.40%。这表明需要进一步开展工作来改进韩国法学硕士，而 KMMLU 提供了跟踪这一进展的正确工具。我们在 Hugging Face Hub 上公开提供我们的数据集，并将基准集成到 EleutherAI 的语言模型评估工具中。

Title: LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration

Authors: Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11550
Pdf URL: https://arxiv.org/pdf/2402.11550
Copy Paste: [[2402.11550]] LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration(https://arxiv.org/abs/2402.11550)
Keywords: language model, gpt, llm, long context, hallucination, agent
Abstract: Large language models (LLMs) have demonstrated impressive performance in understanding language and executing complex reasoning tasks. However, LLMs with long context windows have been notorious for their expensive training costs and high inference latency. Even the most advanced models such as GPT-4 and Claude2 often make mistakes when processing inputs of over $100k$ tokens, a phenomenon also known as \textit{lost in the middle}. In this paper, we propose \textsc{LongAgent}, a method based on multi-agent collaboration, which scales LLMs (e.g., LLaMA) to a context of 128K and demonstrates potential superiority in long-text processing compared to GPT-4. In \textsc{LongAgent}, a leader is responsible for understanding user intent and directing team members to acquire information from documents. Due to members' hallucinations, it is non-trivial for a leader to obtain accurate information from the responses of dozens to hundreds of members. To address this, we develop an \textit{inter-member communication} mechanism to resolve response conflicts caused by hallucinations through information sharing. Our experimental results indicate that \textsc{LongAgent} offers a promising alternative for long-text processing. The agent team instantiated with LLaMA-7B achieves significant improvements in tasks such as 128k-long text retrieval, multi-hop question answering, compared to GPT-4.
摘要：大型语言模型 (LLM) 在理解语言和执行复杂推理任务方面表现出了令人印象深刻的性能。然而，具有长上下文窗口的法学硕士因其昂贵的培训成本和高推理延迟而臭名昭著。即使是最先进的模型，如 GPT-4 和 Claude2，在处理超过 10 万美元代币的输入时也经常会出错，这种现象也称为 \textit{lost in the middle}。在本文中，我们提出了 \textsc{LongAgent}，一种基于多智能体协作的方法，它将 LLM（例如 LLaMA）扩展到 128K 的上下文，并展示了与 GPT-4 相比在长文本处理方面的潜在优势。在 \textsc{LongAgent} 中，领导者负责理解用户意图并指导团队成员从文档中获取信息。由于成员的幻觉，领导者要从数十到数百名成员的反应中获得准确的信息并非易事。为了解决这个问题，我们开发了一种\textit{成员间通信}机制，通过信息共享来解决由幻觉引起的响应冲突。我们的实验结果表明 \textsc{LongAgent} 为长文本处理提供了一种有前途的替代方案。与 GPT-4 相比，使用 LLaMA-7B 实例化的代理团队在 128k 长文本检索、多跳问答等任务上取得了显着改进。

Title: Cobra Effect in Reference-Free Image Captioning Metrics

Authors: Zheng Ma, Changxin Wang, Yawen Ouyang, Fei Zhao, Jianbing Zhang, Shujian Huang, Jiajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11572
Pdf URL: https://arxiv.org/pdf/2402.11572
Copy Paste: [[2402.11572]] Cobra Effect in Reference-Free Image Captioning Metrics(https://arxiv.org/abs/2402.11572)
Keywords: gpt
Abstract: Evaluating the compatibility between textual descriptions and corresponding images represents a core endeavor within multi-modal research. In recent years, a proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. Empirical evidence has substantiated that these innovative approaches exhibit a higher correlation with human judgment, marking a significant advancement in the field. However, does a higher correlation with human evaluations alone sufficiently denote the complete of a metric? In response to this question, in this paper, we study if there are any deficiencies in reference-free metrics. Specifically, inspired by the Cobra Effect, we utilize metric scores as rewards to direct the captioning model toward generating descriptions that closely align with the metric's criteria. If a certain metric has flaws, it will be exploited by the model and reflected in the generated sentences. Our findings reveal that descriptions guided by these metrics contain significant flaws, e.g. incoherent statements and excessive repetition. Subsequently, we propose a novel method termed Self-Improving to rectify the identified shortcomings within these metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance. In addition, we also introduce a challenging evaluation benchmark called Flaws Caption to evaluate reference-free image captioning metrics comprehensively. Our code is available at https://github.com/aaronma2020/robust_captioning_metric
摘要：评估文本描述和相应图像之间的兼容性是多模态研究的核心工作。近年来，利用视觉语言预训练模型（VLM）的无参考方法不断涌现。经验证据证实，这些创新方法与人类判断具有更高的相关性，标志着该领域的重大进步。然而，与人类评估的较高相关性是否足以表明指标的完整性？针对这个问题，在本文中，我们研究了无参考指标是否存在缺陷。具体来说，受眼镜蛇效应的启发，我们利用指标分数作为奖励，引导字幕模型生成与指标标准紧密一致的描述。如果某个指标有缺陷，模型就会利用它并反映在生成的句子中。我们的研究结果表明，这些指标指导下的描述包含重大缺陷，例如不连贯的陈述和过多的重复。随后，我们提出了一种称为“自我改进”的新方法来纠正这些指标中已发现的缺陷。我们采用 GPT-4V 作为评估工具来评估生成的句子，结果表明我们的方法实现了最先进的 (SOTA) 性能。此外，我们还引入了一个名为“Flaws Caption”的具有挑战性的评估基准，以全面评估无参考图像字幕指标。我们的代码位于 https://github.com/aaronma2020/robust_captioning_metric

Title: BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

Authors: Kun Luo, Zheng Liu, Shitao Xiao, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11573
Pdf URL: https://arxiv.org/pdf/2402.11573
Copy Paste: [[2402.11573]] BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models(https://arxiv.org/abs/2402.11573)
Keywords: language model, llm
Abstract: Large language models (LLMs) call for extension of context to handle many critical applications. However, the existing approaches are prone to expensive costs and inferior quality of context extension. In this work, we proposeExtensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness. Extensible embedding stand as an enhancement of typical token embedding, which represents the information for an extensible scope of context instead of a single token. By leveraging such compact input units of higher information density, the LLM can access to a vast scope of context even with a small context window. Extensible embedding is systematically optimized in architecture and training method, which leads to multiple advantages. 1) High flexibility of context extension, which flexibly supports ad-hoc extension of diverse context lengths. 2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way. 3) Superior compatibility with the existing LLMs, where the extensible embedding can be seamlessly introduced as a plug-in component. Comprehensive evaluations on long-context language modeling and understanding tasks verify extensible embedding as an effective, efficient, flexible, and compatible method to extend the LLM's context.
摘要：大型语言模型 (LLM) 需要扩展上下文来处理许多关键应用程序。然而，现有的方法往往成本高昂且上下文扩展质量较差。在这项工作中，我们提出了Extensible Embedding，它实现了LLM上下文的高质量扩展，具有很强的灵活性和成本效益。可扩展嵌入是典型令牌嵌入的增强，它表示上下文可扩展范围的信息而不是单个令牌。通过利用这种具有更高信息密度的紧凑输入单元，法学硕士即使使用较小的上下文窗口也可以访问广泛的上下文。可扩展嵌入在架构和训练方法上进行了系统优化，具有多种优势。 1）上下文扩展灵活性高，灵活支持多种上下文长度的ad-hoc扩展。 2）训练样本效率高，使得嵌入模型能够以经济高效的方式学习。 3）与现有的LLM具有出色的兼容性，可扩展的嵌入可以作为插件无缝引入。对长上下文语言建模和理解任务的综合评估验证了可扩展嵌入是一种有效、高效、灵活且兼容的方法来扩展法学硕士的上下文。

Title: Extensible Embedding: A Flexible Multipler For LLM's Context Length

Authors: Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11577
Pdf URL: https://arxiv.org/pdf/2402.11577
Copy Paste: [[2402.11577]] Extensible Embedding: A Flexible Multipler For LLM's Context Length(https://arxiv.org/abs/2402.11577)
Keywords: language model, llm
Abstract: Large language models (LLMs) call for extension of context to handle many critical applications. However, the existing approaches are prone to expensive costs and inferior quality of context extension. In this work, we propose Extensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness. Extensible embedding stand as an enhancement of typical token embedding, which represents the information for an extensible scope of context instead of a single token. By leveraging such compact input units of higher information density, the LLM can access to a vast scope of context even with a small context window. Extensible embedding is systematically optimized in architecture and training method, which leads to multiple advantages. 1) High flexibility of context extension, which flexibly supports ad-hoc extension of diverse context lengths. 2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way. 3) Superior compatibility with the existing LLMs, where the extensible embedding can be seamlessly introduced as a plug-in component. Comprehensive evaluations on long-context language modeling and understanding tasks verify extensible embedding as an effective, efficient, flexible, and compatible method to extend the LLM's context.
摘要：大型语言模型 (LLM) 需要扩展上下文来处理许多关键应用程序。然而，现有的方法往往成本高昂且上下文扩展质量较差。在这项工作中，我们提出了可扩展嵌入（Extensible Embedding），它以强大的灵活性和成本效益实现了LLM上下文的高质量扩展。可扩展嵌入是典型令牌嵌入的增强，它表示上下文可扩展范围的信息而不是单个令牌。通过利用这种具有更高信息密度的紧凑输入单元，法学硕士即使使用较小的上下文窗口也可以访问广泛的上下文。可扩展嵌入在架构和训练方法上进行了系统优化，具有多种优势。 1）上下文扩展灵活性高，灵活支持多种上下文长度的ad-hoc扩展。 2）训练样本效率高，使得嵌入模型能够以经济高效的方式学习。 3）与现有的LLM具有出色的兼容性，可扩展的嵌入可以作为插件无缝引入。对长上下文语言建模和理解任务的综合评估验证了可扩展嵌入是一种有效、高效、灵活且兼容的方法来扩展法学硕士的上下文。

Title: Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Authors: Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11592
Pdf URL: https://arxiv.org/pdf/2402.11592
Copy Paste: [[2402.11592]] Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark(https://arxiv.org/abs/2402.11592)
Keywords: language model, llm
Abstract: In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications like on-device training where memory efficiency is paramount. This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during LLM fine-tuning, building on the initial concept introduced by MeZO. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques, through a comprehensive, first-of-its-kind benchmarking study across five LLM families (Roberta, OPT, LLaMA, Vicuna, Mistral), three task complexities, and five fine-tuning schemes. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance. We further introduce novel enhancements to ZO optimization, including block-wise descent, hybrid training, and gradient sparsity. Our study offers a promising direction for achieving further memory-efficient LLM fine-tuning. Codes to reproduce all our experiments are at https://github.com/ZO-Bench/ZO-LLM .
摘要：在不断发展的自然语言处理 (NLP) 领域，使用 SGD 和 Adam 等一阶 (FO) 优化器微调预训练的大型语言模型 (LLM) 已成为标准。然而，随着 LLM 规模的增长，FO 梯度计算的反向传播 (BP) 产生的大量内存开销提出了重大挑战。解决这个问题至关重要，特别是对于设备上训练等内存效率至关重要的应用程序。本文基于 MeZO 引入的初始概念，提出了向无 BP、零阶 (ZO) 优化的转变，作为 LLM 微调期间降低内存成本的解决方案。与传统的 ZO-SGD 方法不同，我们的工作通过对五个 LLM 系列（Roberta、OPT、LLaMA、Vicuna、Mistral）进行全面、首创的基准测试研究，将探索扩展到更广泛的 ZO 优化技术，三个任务复杂性和五个微调方案。我们的研究揭示了以前被忽视的优化原理，强调了任务对齐的重要性、前向梯度方法的作用以及算法复杂性和微调性能之间的平衡。我们进一步介绍了 ZO 优化的新颖增强功能，包括分块下降、混合训练和梯度稀疏性。我们的研究为进一步实现内存高效的法学硕士微调提供了一个有希望的方向。重现我们所有实验的代码位于 https://github.com/ZO-Bench/ZO-LLM 。

Title: Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?

Authors: Guijin Son, Sangwon Baek, Sangdae Nam, Ilgyun Jeong, Seungone Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11597
Pdf URL: https://arxiv.org/pdf/2402.11597
Copy Paste: [[2402.11597]] Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?(https://arxiv.org/abs/2402.11597)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) are typically prompted to follow a single instruction per inference call. In this work, we analyze whether LLMs also hold the capability to handle multiple instructions simultaneously, denoted as Multi-Task Inference. For this purpose, we introduce the MTI Bench(Multi-Task Inference Benchmark), a comprehensive evaluation benchmark encompassing 5,000 instances across 25 tasks. Each task in the MTI Bench involves 2 to 3 sub-tasks. As expected, we first demonstrate that Multi-Task Inference reduces the total inference time by 1.46 times in average since it does not require multiple inference calls. Interestingly, contrary to the expectation that LLMs would perform better when tasks are divided, we find that state-of-the-art LLMs, such as Llama-2-Chat-70B and GPT-4, show up to 7.3% and 12.4% improved performance with Multi-Task Inference compared to Single-Task Inference on the MTI Bench. We release the MTI Bench dataset and our code at this link https://github.com/guijinSON/MTI-Bench.
摘要：通常会提示大型语言模型 (LLM) 每个推理调用遵循一条指令。在这项工作中，我们分析了法学硕士是否也具备同时处理多个指令的能力，即多任务推理。为此，我们引入了 MTI Bench（多任务推理基准），这是一个涵盖 25 个任务的 5,000 个实例的综合评估基准。 MTI Bench 中的每个任务都涉及 2 到 3 个子任务。正如预期的那样，我们首先证明多任务推理由于不需要多次推理调用，因此平均将总推理时间减少了 1.46 倍。有趣的是，与 LLM 在任务划分时表现更好的预期相反，我们发现最先进的 Llama-2-Chat-70B 和 GPT-4 分别达到 7.3% 和 12.4%与 MTI Bench 上的单任务推理相比，多任务推理提高了性能。我们在此链接 https://github.com/guijinSON/MTI-Bench 发布了 MTI Bench 数据集和代码。

Title: Self-evolving Autoencoder Embedded Q-Network

Authors: J. Senthilnath, Bangjian Zhou, Zhen Wei Ng, Deeksha Aggarwal, Rajdeep Dutta, Ji Wei Yoon, Aye Phyu Phyu Aung, Keyu Wu, Min Wu, Xiaoli Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11604
Pdf URL: https://arxiv.org/pdf/2402.11604
Copy Paste: [[2402.11604]] Self-evolving Autoencoder Embedded Q-Network(https://arxiv.org/abs/2402.11604)
Keywords: agent
Abstract: In the realm of sequential decision-making tasks, the exploration capability of a reinforcement learning (RL) agent is paramount for achieving high rewards through interactions with the environment. To enhance this crucial ability, we propose SAQN, a novel approach wherein a self-evolving autoencoder (SA) is embedded with a Q-Network (QN). In SAQN, the self-evolving autoencoder architecture adapts and evolves as the agent explores the environment. This evolution enables the autoencoder to capture a diverse range of raw observations and represent them effectively in its latent space. By leveraging the disentangled states extracted from the encoder generated latent space, the QN is trained to determine optimal actions that improve rewards. During the evolution of the autoencoder architecture, a bias-variance regulatory strategy is employed to elicit the optimal response from the RL agent. This strategy involves two key components: (i) fostering the growth of nodes to retain previously acquired knowledge, ensuring a rich representation of the environment, and (ii) pruning the least contributing nodes to maintain a more manageable and tractable latent space. Extensive experimental evaluations conducted on three distinct benchmark environments and a real-world molecular environment demonstrate that the proposed SAQN significantly outperforms state-of-the-art counterparts. The results highlight the effectiveness of the self-evolving autoencoder and its collaboration with the Q-Network in tackling sequential decision-making tasks.
摘要：在顺序决策任务领域，强化学习（RL）代理的探索能力对于通过与环境的交互获得高奖励至关重要。为了增强这一关键能力，我们提出了 SAQN，这是一种新颖的方法，其中将自进化自动编码器（SA）嵌入到 Q 网络（QN）中。在 SAQN 中，自进化自动编码器架构随着代理探索环境而适应和进化。这种演变使自动编码器能够捕获各种原始观察结果，并在其潜在空间中有效地表示它们。通过利用从编码器生成的潜在空间中提取的解纠缠状态，QN 被训练以确定改善奖励的最佳动作。在自动编码器架构的演化过程中，采用偏差-方差调节策略来引发 RL 代理的最佳响应。该策略涉及两个关键组成部分：（i）促进节点的增长以保留先前获得的知识，确保环境的丰富表示，以及（ii）修剪贡献最少的节点以维持更易于管理和处理的潜在空间。在三个不同的基准环境和真实分子环境上进行的广泛实验评估表明，所提出的 SAQN 显着优于最先进的同行。结果凸显了自进化自动编码器及其与 Q 网络在处理顺序决策任务方面的协作的有效性。

Title: Metric-Learning Encoding Models Identify Processing Profiles of Linguistic Features in BERT's Representations

Authors: Louis Jalouzot, Robin Sobczyk, Bastien Lhopitallier, Jeanne Salle, Nur Lan, Emmanuel Chemla, Yair Lakretz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11608
Pdf URL: https://arxiv.org/pdf/2402.11608
Copy Paste: [[2402.11608]] Metric-Learning Encoding Models Identify Processing Profiles of Linguistic Features in BERT's Representations(https://arxiv.org/abs/2402.11608)
Keywords: language model
Abstract: We introduce Metric-Learning Encoding Models (MLEMs) as a new approach to understand how neural systems represent the theoretical features of the objects they process. As a proof-of-concept, we apply MLEMs to neural representations extracted from BERT, and track a wide variety of linguistic features (e.g., tense, subject person, clause type, clause embedding). We find that: (1) linguistic features are ordered: they separate representations of sentences to different degrees in different layers; (2) neural representations are organized hierarchically: in some layers, we find clusters of representations nested within larger clusters, following successively important linguistic features; (3) linguistic features are disentangled in middle layers: distinct, selective units are activated by distinct linguistic features. Methodologically, MLEMs are superior (4) to multivariate decoding methods, being more robust to type-I errors, and (5) to univariate encoding methods, in being able to predict both local and distributed representations. Together, this demonstrates the utility of Metric-Learning Encoding Methods for studying how linguistic features are neurally encoded in language models and the advantage of MLEMs over traditional methods. MLEMs can be extended to other domains (e.g. vision) and to other neural systems, such as the human brain.
摘要：我们引入度量学习编码模型（MLEM）作为一种新方法来理解神经系统如何表示它们处理的对象的理论特征。作为概念验证，我们将 MLEM 应用于从 BERT 中提取的神经表示，并跟踪各种语言特征（例如时态、主语、子句类型、子句嵌入）。我们发现：（1）语言特征是有序的：它们在不同层中不同程度地分离句子的表示；（2）神经表征是分层组织的：在某些层中，我们发现表征簇嵌套在更大的簇中，遵循连续重要的语言特征；（3）语言特征在中间层被解开：不同的、选择性的单元由不同的语言特征激活。从方法上讲，MLEM 优于 (4) 多变量解码方法，对 I 类错误更稳健；(5) 优于单变量编码方法，能够预测局部和分布式表示。总之，这证明了度量学习编码方法在研究语言特征如何在语言模型中进行神经编码方面的实用性，以及 MLEM 相对于传统方法的优势。 MLEM 可以扩展到其他领域（例如视觉）和其他神经系统，例如人脑。

Title: Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Bias Detection

Authors: Valeria Pastorino, Jasivan A. Sivakumar, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11621
Pdf URL: https://arxiv.org/pdf/2402.11621
Copy Paste: [[2402.11621]] Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Bias Detection(https://arxiv.org/abs/2402.11621)
Keywords: language model, gpt, llm, prompt
Abstract: This work contributes to the expanding research on the applicability of LLMs in social sciences by examining the performance of GPT-3.5 Turbo, GPT-4, and Flan-T5 models in detecting framing bias in news headlines through zero-shot, few-shot, and explainable prompting methods. A key insight from our evaluation is the notable efficacy of explainable prompting in enhancing the reliability of these models, highlighting the importance of explainable settings for social science research on framing bias. GPT-4, in particular, demonstrated enhanced performance in few-shot scenarios when presented with a range of relevant, in-domain examples. FLAN-T5's poor performance indicates that smaller models may require additional task-specific fine-tuning for identifying framing bias detection. Our study also found that models, particularly GPT-4, often misinterpret emotional language as an indicator of framing bias, underscoring the challenge of distinguishing between reporting genuine emotional expression and intentionally use framing bias in news headlines. We further evaluated the models on two subsets of headlines where the presence or absence of framing bias was either clear-cut or more contested, with the results suggesting that these models' can be useful in flagging potential annotation inaccuracies within existing or new datasets. Finally, the study evaluates the models in real-world conditions ("in the wild"), moving beyond the initial dataset focused on U.S. Gun Violence, assessing the models' performance on framed headlines covering a broad range of topics.
摘要：这项工作通过检查 GPT-3.5 Turbo、GPT-4 和 Flan-T5 模型在通过零样本、少样本、以及可解释的提示方法。我们评估的一个关键见解是可解释的提示在增强这些模型的可靠性方面具有显着的功效，强调了可解释的设置对于社会科学研究框架偏见的重要性。尤其是 GPT-4，在提供一系列相关的领域内示例时，在少量场景中展示了增强的性能。 FLAN-T5 的较差性能表明较小的模型可能需要额外的特定于任务的微调来识别帧偏差检测。我们的研究还发现，模型，尤其是 GPT-4，经常将情感语言误解为框架偏见的指标，这凸显了区分报道真实情感表达和故意在新闻标题中使用框架偏见的挑战。我们进一步评估了两个标题子集上的模型，其中框架偏差的存在或不存在要么是明确的，要么是更有争议的，结果表明这些模型可用于标记现有或新数据集中潜在的注释不准确性。最后，该研究评估了现实条件下（“野外”）的模型，超越了专注于美国枪支暴力的初始数据集，评估了模型在涵盖广泛主题的框架标题上的表现。

Title: SpeCrawler: Generating OpenAPI Specifications from API Documentation Using Large Language Models

Authors: Koren Lazar, Matan Vetzler, Guy Uziel, David Boaz, Esther Goldbraich, David Amid, Ateret Anaby-Tavor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11625
Pdf URL: https://arxiv.org/pdf/2402.11625
Copy Paste: [[2402.11625]] SpeCrawler: Generating OpenAPI Specifications from API Documentation Using Large Language Models(https://arxiv.org/abs/2402.11625)
Keywords: language model, llm
Abstract: In the digital era, the widespread use of APIs is evident. However, scalable utilization of APIs poses a challenge due to structure divergence observed in online API documentation. This underscores the need for automatic tools to facilitate API consumption. A viable approach involves the conversion of documentation into an API Specification format. While previous attempts have been made using rule-based methods, these approaches encountered difficulties in generalizing across diverse documentation. In this paper we introduce SpeCrawler, a comprehensive system that utilizes large language models (LLMs) to generate OpenAPI Specifications from diverse API documentation through a carefully crafted pipeline. By creating a standardized format for numerous APIs, SpeCrawler aids in streamlining integration processes within API orchestrating systems and facilitating the incorporation of tools into LLMs. The paper explores SpeCrawler's methodology, supported by empirical evidence and case studies, demonstrating its efficacy through LLM capabilities.
摘要：在数字时代，API的广泛使用是显而易见的。然而，由于在线 API 文档中观察到的结构差异，API 的可扩展利用提出了挑战。这强调了需要自动化工具来促进 API 的使用。一种可行的方法是将文档转换为 API 规范格式。虽然之前已经尝试使用基于规则的方法，但这些方法在跨不同文档进行推广时遇到了困难。在本文中，我们介绍了 SpeCrawler，这是一个综合系统，它利用大型语言模型 (LLM) 通过精心设计的管道从不同的 API 文档生成 OpenAPI 规范。通过为众多 API 创建标准化格式，SpeCrawler 有助于简化 API 编排系统内的集成流程，并促进将工具合并到法学硕士中。本文探讨了 SpeCrawler 的方法，并以经验证据和案例研究为支持，通过法学硕士能力证明了其有效性。

Title: Metacognitive Retrieval-Augmented Large Language Models

Authors: Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, Zhicheng Dou
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2402.11626
Pdf URL: https://arxiv.org/pdf/2402.11626
Copy Paste: [[2402.11626]] Metacognitive Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2402.11626)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-augmented generation have become central in natural language processing due to their efficacy in generating factual content. While traditional methods employ single-time retrieval, more recent approaches have shifted towards multi-time retrieval for multi-hop reasoning tasks. However, these strategies are bound by predefined reasoning steps, potentially leading to inaccuracies in response generation. This paper introduces MetaRAG, an approach that combines the retrieval-augmented generation process with metacognition. Drawing from cognitive psychology, metacognition allows an entity to self-reflect and critically evaluate its cognitive processes. By integrating this, MetaRAG enables the model to monitor, evaluate, and plan its response strategies, enhancing its introspective reasoning abilities. Through a three-step metacognitive regulation pipeline, the model can identify inadequacies in initial cognitive responses and fixes them. Empirical evaluations show that MetaRAG significantly outperforms existing methods.
摘要：检索增强生成因其在生成事实内容方面的功效而成为自然语言处理的核心。虽然传统方法采用单次检索，但最近的方法已转向针对多跳推理任务的多次检索。然而，这些策略受到预定义推理步骤的约束，可能导致响应生成不准确。本文介绍了 MetaRAG，一种将检索增强生成过程与元认知相结合的方法。元认知借鉴认知心理学，允许实体自我反思并批判性地评估其认知过程。通过整合这一点，MetaRAG 使模型能够监控、评估和规划其响应策略，从而增强其内省推理能力。通过三步元认知调节流程，该模型可以识别初始认知反应中的不足之处并加以修复。实证评估表明 MetaRAG 显着优于现有方法。

Title: Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking dialogs

Authors: Arian Askari, Roxana Petcu, Chuan Meng, Mohammad Aliannejadi, Amin Abolghasemi, Evangelos Kanoulas, Suzan Verberne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11633
Pdf URL: https://arxiv.org/pdf/2402.11633
Copy Paste: [[2402.11633]] Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking dialogs(https://arxiv.org/abs/2402.11633)
Keywords: language model, llm, prompt
Abstract: Identifying user intents in information-seeking dialogs is crucial for a system to meet user's information needs. Intent prediction (IP) is challenging and demands sufficient dialogs with human-labeled intents for training. However, manually annotating intents is resource-intensive. While large language models (LLMs) have been shown to be effective in generating synthetic data, there is no study on using LLMs to generate intent-aware information-seeking dialogs. In this paper, we focus on leveraging LLMs for zero-shot generation of large-scale, open-domain, and intent-aware information-seeking dialogs. We propose SOLID, which has novel self-seeding and multi-intent self-instructing schemes. The former improves the generation quality by using the LLM's own knowledge scope to initiate dialog generation; the latter prompts the LLM to generate utterances sequentially, and mitigates the need for manual prompt design by asking the LLM to autonomously adapt its prompt instruction when generating complex multi-intent utterances. Furthermore, we propose SOLID-RL, which is further trained to generate a dialog in one step on the data generated by SOLID. We propose a length-based quality estimation mechanism to assign varying weights to SOLID-generated dialogs based on their quality during the training process of SOLID-RL. We use SOLID and SOLID-RL to generate more than 300k intent-aware dialogs, surpassing the size of existing datasets. Experiments show that IP methods trained on dialogs generated by SOLID and SOLID-RL achieve better IP quality than ones trained on human-generated dialogs.
摘要：识别信息查找对话框中的用户意图对于系统满足用户的信息需求至关重要。意图预测（IP）具有挑战性，需要与人类标记的意图进行充分的对话以进行训练。然而，手动注释意图是资源密集型的。虽然大型语言模型 (LLM) 已被证明可以有效生成合成数据，但目前还没有关于使用 LLM 生成意图感知信息搜索对话框的研究。在本文中，我们重点关注利用法学硕士来零样本生成大规模、开放域和意图感知的信息寻求对话。我们提出了 SOLID，它具有新颖的自播种和多意图自指导方案。前者利用LLM自身的知识范围发起对话生成，提高生成质量；后者提示法学硕士按顺序生成话语，并通过要求法学硕士在生成复杂的多意图话语时自主调整其提示指令来减少手动提示设计的需要。此外，我们提出了 SOLID-RL，它经过进一步训练，可以根据 SOLID 生成的数据一步生成对话。我们提出了一种基于长度的质量估计机制，在 SOLID-RL 的训练过程中根据 SOLID 生成的对话的质量为其分配不同的权重。我们使用 SOLID 和 SOLID-RL 生成超过 30 万个意图感知对话框，超过了现有数据集的大小。实验表明，在 SOLID 和 SOLID-RL 生成的对话上训练的 IP 方法比在人工生成的对话上训练的 IP 方法具有更好的 IP 质量。

Title: Stumbling Blocks: Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks

Authors: Yichen Wang, Shangbin Feng, Abe Bohan Hou, Xiao Pu, Chao Shen, Xiaoming Liu, Yulia Tsvetkov, Tianxing He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11638
Pdf URL: https://arxiv.org/pdf/2402.11638
Copy Paste: [[2402.11638]] Stumbling Blocks: Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks(https://arxiv.org/abs/2402.11638)
Keywords: language model, llm, prompt
Abstract: The widespread use of large language models (LLMs) is increasing the demand for methods that detect machine-generated text to prevent misuse. The goal of our study is to stress test the detectors' robustness to malicious attacks under realistic scenarios. We comprehensively study the robustness of popular machine-generated text detectors under attacks from diverse categories: editing, paraphrasing, prompting, and co-generating. Our attacks assume limited access to the generator LLMs, and we compare the performance of detectors on different attacks under different budget levels. Our experiments reveal that almost none of the existing detectors remain robust under all the attacks, and all detectors exhibit different loopholes. Averaging all detectors, the performance drops by 35% across all attacks. Further, we investigate the reasons behind these defects and propose initial out-of-the-box patches to improve robustness.
摘要：大语言模型 (LLM) 的广泛使用增加了对检测机器生成文本以防止误用的方法的需求。我们研究的目的是在现实场景下对检测器对恶意攻击的鲁棒性进行压力测试。我们全面研究流行的机器生成文本检测器在不同类别攻击下的鲁棒性：编辑、释义、提示和共同生成。我们的攻击假设对生成器 LLM 的访问受到限制，并且我们比较了不同预算水平下不同攻击的检测器的性能。我们的实验表明，现有的检测器几乎没有一个能够在所有攻击下保持鲁棒性，并且所有检测器都表现出不同的漏洞。对所有检测器进行平均后，所有攻击的性能都会下降 35%。此外，我们调查了这些缺陷背后的原因，并提出了初始开箱即用的补丁来提高鲁棒性。

Title: Towards Versatile Graph Learning Approach: from the Perspective of Large Language Models

Authors: Lanning Wei, Jun Gao, Huan Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11641
Pdf URL: https://arxiv.org/pdf/2402.11641
Copy Paste: [[2402.11641]] Towards Versatile Graph Learning Approach: from the Perspective of Large Language Models(https://arxiv.org/abs/2402.11641)
Keywords: language model, llm
Abstract: Graph-structured data are the commonly used and have wide application scenarios in the real world. For these diverse applications, the vast variety of learning tasks, graph domains, and complex graph learning procedures present challenges for human experts when designing versatile graph learning approaches. Facing these challenges, large language models (LLMs) offer a potential solution due to the extensive knowledge and the human-like intelligence. This paper proposes a novel conceptual prototype for designing versatile graph learning methods with LLMs, with a particular focus on the ``where'' and ``how'' perspectives. From the ``where'' perspective, we summarize four key graph learning procedures, including task definition, graph data feature engineering, model selection and optimization, deployment and serving. We then explore the application scenarios of LLMs in these procedures across a wider spectrum. In the ``how'' perspective, we align the abilities of LLMs with the requirements of each procedure. Finally, we point out the promising directions that could better leverage the strength of LLMs towards versatile graph learning methods.
摘要：图结构数据是最常用的数据，在现实世界中有着广泛的应用场景。对于这些不同的应用，各种各样的学习任务、图域和复杂的图学习过程给人类专家在设计多功能图学习方法时带来了挑战。面对这些挑战，大型语言模型（LLM）凭借丰富的知识和类人智能提供了潜在的解决方案。本文提出了一种新颖的概念原型，用于利用法学硕士设计多功能图学习方法，特别关注“哪里”和“如何”的视角。从“where”的角度，我们总结了四个关键的图学习过程，包括任务定义、图数据特征工程、模型选择和优化、部署和服务。然后，我们在更广泛的范围内探索法学硕士在这些程序中的应用场景。从“如何”的角度来看，我们将法学硕士的能力与每个程序的要求结合起来。最后，我们指出了可以更好地利用法学硕士的优势来实现多功能图学习方法的有希望的方向。

Title: Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents

Authors: Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11651
Pdf URL: https://arxiv.org/pdf/2402.11651
Copy Paste: [[2402.11651]] Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents(https://arxiv.org/abs/2402.11651)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools like search engines. However, LLMs are not optimized specifically for tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has collected interaction trajectories between GPT-4 and environments, and fine-tuned smaller models with them. As part of this, the standard approach has been to simply discard trajectories that do not finish the task successfully, which, on the one hand, leads to a significant waste of data and resources, and on the other hand, has the potential to limit the possible optimization paths during fine-tuning. In this paper, we contend that large language models can learn from failures through appropriate data cleaning and fine-tuning strategies. We conduct experiments on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. Experimental results demonstrate that compared to solely using positive examples, incorporating negative examples enhances model performance by a large margin.
摘要：大型语言模型（LLM）在充当代理方面取得了成功，它通过搜索引擎等工具与环境进行交互。然而，LLM 并未专门针对培训或调整期间的工具使用进行优化，这限制了它们作为代理的有效性。为了解决这个问题，之前的工作收集了 GPT-4 和环境之间的交互轨迹，并用它们对较小的模型进行了微调。作为其中的一部分，标准方法是简单地丢弃未成功完成任务的轨迹，这一方面会导致数据和资源的严重浪费，另一方面有可能限制微调期间可能的优化路径。在本文中，我们认为大型语言模型可以通过适当的数据清理和微调策略从失败中学习。我们对数学推理、多跳问答和策略问答任务进行实验。实验结果表明，与仅使用正例相比，合并负例可以大幅提高模型性能。

Title: Combinatorial Client-Master Multiagent Deep Reinforcement Learning for Task Offloading in Mobile Edge Computing

Authors: Tesfay Zemuy Gebrekidan, Sebastian Stein, Timothy J.Norman
Subjects: cs.AI, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2402.11653
Pdf URL: https://arxiv.org/pdf/2402.11653
Copy Paste: [[2402.11653]] Combinatorial Client-Master Multiagent Deep Reinforcement Learning for Task Offloading in Mobile Edge Computing(https://arxiv.org/abs/2402.11653)
Keywords: agent
Abstract: Recently, there has been an explosion of mobile applications that perform computationally intensive tasks such as video streaming, data mining, virtual reality, augmented reality, image processing, video processing, face recognition, and online gaming. However, user devices (UDs), such as tablets and smartphones, have a limited ability to perform the computation needs of the tasks. Mobile edge computing (MEC) has emerged as a promising technology to meet the increasing computing demands of UDs. Task offloading in MEC is a strategy that meets the demands of UDs by distributing tasks between UDs and MEC servers. Deep reinforcement learning (DRL) is gaining attention in task-offloading problems because it can adapt to dynamic changes and minimize online computational complexity. However, the various types of continuous and discrete resource constraints on UDs and MEC servers pose challenges to the design of an efficient DRL-based task-offloading strategy. Existing DRL-based task-offloading algorithms focus on the constraints of the UDs, assuming the availability of enough storage resources on the server. Moreover, existing multiagent DRL (MADRL)--based task-offloading algorithms are homogeneous agents and consider homogeneous constraints as a penalty in their reward function. We proposed a novel combinatorial client-master MADRL (CCM\_MADRL) algorithm for task offloading in MEC (CCM\_MADRL\_MEC) that enables UDs to decide their resource requirements and the server to make a combinatorial decision based on the requirements of the UDs. CCM\_MADRL\_MEC is the first MADRL in task offloading to consider server storage capacity in addition to the constraints in the UDs. By taking advantage of the combinatorial action selection, CCM\_MADRL\_MEC has shown superior convergence over existing MADDPG and heuristic algorithms.
摘要：最近，执行计算密集型任务的移动应用程序激增，例如视频流、数据挖掘、虚拟现实、增强现实、图像处理、视频处理、人脸识别和在线游戏。然而，平板电脑和智能手机等用户设备 (UD) 执行任务计算需求的能力有限。移动边缘计算 (MEC) 已成为一项有前途的技术，可以满足 UD 日益增长的计算需求。 MEC中的任务卸载是一种通过在UD和MEC服务器之间分配任务来满足UD需求的策略。深度强化学习（DRL）在任务卸载问题中越来越受到关注，因为它可以适应动态变化并最大限度地减少在线计算复杂性。然而，UD和MEC服务器上各种类型的连续和离散资源约束对设计高效的基于DRL的任务卸载策略提出了挑战。现有的基于 DRL 的任务卸载算法侧重于 UD 的约束，假设服务器上有足够的存储资源。此外，现有的基于多智能体 DRL (MADRL) 的任务卸载算法是同质智能体，并将同质约束视为其奖励函数中的惩罚。我们提出了一种新颖的组合客户端-主站 MADRL (CCM\_MADRL) 算法，用于 MEC 中的任务卸载 (CCM\_MADRL\_MEC)，该算法使 UD 能够决定其资源需求，并且服务器能够根据 UD 的需求做出组合决策。 CCM\_MADRL\_MEC 是任务卸载中第一个除了 UD 限制之外还考虑服务器存储容量的 MADRL。通过利用组合动作选择，CCM\_MADRL\_MEC 表现出优于现有 MADDPG 和启发式算法的收敛性。

Title: Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

Authors: Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, Bernhard Schölkopf
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11655
Pdf URL: https://arxiv.org/pdf/2402.11655
Copy Paste: [[2402.11655]] Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals(https://arxiv.org/abs/2402.11655)
Keywords: language model, llm
Abstract: Interpretability research aims to bridge the gap between the empirical success and our scientific understanding of the inner workings of large language models (LLMs). However, most existing research in this area focused on analyzing a single mechanism, such as how models copy or recall factual knowledge. In this work, we propose the formulation of competition of mechanisms, which instead of individual mechanisms focuses on the interplay of multiple mechanisms, and traces how one of them becomes dominant in the final prediction. We uncover how and where the competition of mechanisms happens within LLMs using two interpretability methods, logit inspection and attention modification. Our findings show traces of the mechanisms and their competition across various model components, and reveal attention positions that effectively control the strength of certain mechanisms. Our code and data are at https://github.com/francescortu/Competition_of_Mechanisms.
摘要：可解释性研究旨在弥合实证成功与我们对大型语言模型（LLM）内部运作的科学理解之间的差距。然而，该领域的大多数现有研究都集中于分析单一机制，例如模型如何复制或回忆事实知识。在这项工作中，我们提出了机制竞争的表述，它不是单个机制，而是关注多种机制的相互作用，并追踪其中一个机制如何在最终预测中占据主导地位。我们使用两种可解释性方法（logit 检查和注意力修改）揭示了 LLM 中机制竞争如何以及在何处发生。我们的研究结果显示了各种模型组件之间的机制及其竞争的痕迹，并揭示了有效控制某些机制强度的注意力位置。我们的代码和数据位于 https://github.com/francescortu/Competition_of_Mechanisms。

Title: Dynamic planning in hierarchical active inference

Authors: Matteo Priorelli, Ivilin Peev Stoianov
Subjects: cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2402.11658
Pdf URL: https://arxiv.org/pdf/2402.11658
Copy Paste: [[2402.11658]] Dynamic planning in hierarchical active inference(https://arxiv.org/abs/2402.11658)
Keywords: agent
Abstract: By dynamic planning, we refer to the ability of the human brain to infer and impose motor trajectories related to cognitive decisions. A recent paradigm, active inference, brings fundamental insights into the adaptation of biological organisms, constantly striving to minimize prediction errors to restrict themselves to life-compatible states. Over the past years, many studies have shown how human and animal behavior could be explained in terms of an active inferential process -- either as discrete decision-making or continuous motor control -- inspiring innovative solutions in robotics and artificial intelligence. Still, the literature lacks a comprehensive outlook on how to effectively plan actions in changing environments. Setting ourselves the goal of modeling tool use, we delve into the topic of dynamic planning in active inference, keeping in mind two crucial aspects of biological goal-directed behavior: the capacity to understand and exploit affordances for object manipulation, and to learn the hierarchical interactions between the self and the environment, including other agents. We start from a simple unit and gradually describe more advanced structures, comparing recently proposed design choices and providing basic examples for each section. This study distances itself from traditional views centered on neural networks and reinforcement learning, and points toward a yet unexplored direction in active inference: hybrid representations in hierarchical models.
摘要：通过动态规划，我们指的是人脑推断和施加与认知决策相关的运动轨迹的能力。最近的一个范式，主动推理，为生物有机体的适应带来了基本的见解，不断努力最小化预测错误，以将自身限制在与生命相容的状态。在过去的几年里，许多研究表明，如何用主动推理过程来解释人类和动物的行为——无论是离散决策还是连续运动控制——激发了机器人和人工智能领域的创新解决方案。尽管如此，文献仍然缺乏关于如何在不断变化的环境中有效规划行动的全面观点。为自己设定建模工具使用的目标，我们深入研究主动推理中动态规划的主题，同时牢记生物目标导向行为的两个关键方面：理解和利用对象操作的可供性的能力，以及学习层次结构的能力。自我与环境（包括其他主体）之间的相互作用。我们从一个简单的单元开始，逐渐描述更高级的结构，比较最近提出的设计选择并为每个部分提供基本示例。这项研究与以神经网络和强化学习为中心的传统观点保持了距离，并指出了主动推理中尚未探索的方向：分层模型中的混合表示。

Title: Autocorrect for Estonian texts: final report from project EKTB25

Authors: Agnes Luhtaru, Martin Vainikko, Krista Liin, Kais Allkivi-Metsoja, Jaagup Kippar, Pille Eslon, Mark Fishel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11671
Pdf URL: https://arxiv.org/pdf/2402.11671
Copy Paste: [[2402.11671]] Autocorrect for Estonian texts: final report from project EKTB25(https://arxiv.org/abs/2402.11671)
Keywords: language model, gpt
Abstract: The project was funded in 2021-2023 by the National Programme of Estonian Language Technology. Its main aim was to develop spelling and grammar correction tools for the Estonian language. The main challenge was the very small amount of available error correction data needed for such development. To mitigate this, (1) we annotated more correction data for model training and testing, (2) we tested transfer-learning, i.e. retraining machine learning models created for other tasks, so as not to depend solely on correction data, (3) we compared the developed method and model with alternatives, including large language models. We also developed automatic evaluation, which can calculate the accuracy and yield of corrections by error category, so that the effectiveness of different methods can be compared in detail. There has been a breakthrough in large language models during the project: GPT4, a commercial language model with Estonian-language support, has been created. We took into account the existence of the model when adjusting plans and in the report we present a comparison with the ability of GPT4 to improve the Estonian language text. The final results show that the approach we have developed provides better scores than GPT4 and the result is usable but not entirely reliable yet. The report also contains ideas on how GPT4 and other major language models can be implemented in the future, focusing on open-source solutions. All results of this project are open-data/open-source, with licenses that allow them to be used for purposes including commercial ones.
摘要：该项目于 2021-2023 年由爱沙尼亚语言技术国家计划资助。其主要目标是开发爱沙尼亚语的拼写和语法纠正工具。主要挑战是此类开发所需的可用纠错数据非常少。为了缓解这个问题，（1）我们注释了更多用于模型训练和测试的校正数据，（2）我们测试了迁移学习，即重新训练为其他任务创建的机器学习模型，以免仅依赖于校正数据，（3）我们将开发的方法和模型与替代方案（包括大型语言模型）进行了比较。我们还开发了自动评估，可以按错误类别计算纠正的准确性和良率，以便详细比较不同方法的有效性。该项目期间在大型语言模型方面取得了突破：创建了支持爱沙尼亚语的商业语言模型GPT4。我们在调整计划时考虑了模型的存在，并在报告中与 GPT4 改进爱沙尼亚语文本的能力进行了比较。最终结果表明，我们开发的方法提供了比 GPT4 更好的分数，结果可用但不完全可靠。该报告还包含关于未来如何实施 GPT4 和其他主要语言模型的想法，重点关注开源解决方案。该项目的所有成果都是开放数据/开源的，并具有允许它们用于包括商业用途在内的用途的许可证。

Title: A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models

Authors: Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, Huan Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11676
Pdf URL: https://arxiv.org/pdf/2402.11676
Copy Paste: [[2402.11676]] A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models(https://arxiv.org/abs/2402.11676)
Keywords: language model, llm, prompt
Abstract: Counter narratives - informed responses to hate speech contexts designed to refute hateful claims and de-escalate encounters - have emerged as an effective hate speech intervention strategy. While previous work has proposed automatic counter narrative generation methods to aid manual interventions, the evaluation of these approaches remains underdeveloped. Previous automatic metrics for counter narrative evaluation lack alignment with human judgment as they rely on superficial reference comparisons instead of incorporating key aspects of counter narrative quality as evaluation criteria. To address prior evaluation limitations, we propose a novel evaluation framework prompting LLMs to provide scores and feedback for generated counter narrative candidates using 5 defined aspects derived from guidelines from counter narrative specialized NGOs. We found that LLM evaluators achieve strong alignment to human-annotated scores and feedback and outperform alternative metrics, indicating their potential as multi-aspect, reference-free and interpretable evaluators for counter narrative evaluation.
摘要：反叙事——对仇恨言论环境做出明智的反应，旨在反驳仇恨言论并缓和冲突升级——已成为一种有效的仇恨言论干预策略。虽然之前的工作提出了自动反叙事生成方法来辅助手动干预，但对这些方法的评估仍然不发达。以前用于反叙事评估的自动指标缺乏与人类判断的一致性，因为它们依赖于肤浅的参考比较，而不是将反叙事质量的关键方面作为评估标准。为了解决先前的评估局限性，我们提出了一种新颖的评估框架，促使法学硕士使用源自反叙事专业非政府组织指南的 5 个定义方面为生成的反叙事候选人提供分数和反馈。我们发现 LLM 评估者与人工注释的分数和反馈实现了高度一致，并且优于替代指标，这表明他们作为反叙事评估的多方面、无参考和可解释的评估者的潜力。

Title: Opening the black box of language acquisition

Authors: Jérôme Michaud, Anna Jon-and
Subjects: cs.CL, math.NA
Abstract URL: https://arxiv.org/abs/2402.11681
Pdf URL: https://arxiv.org/pdf/2402.11681
Copy Paste: [[2402.11681]] Opening the black box of language acquisition(https://arxiv.org/abs/2402.11681)
Keywords: language model
Abstract: Recent advances in large language models using deep learning techniques have renewed interest on how languages can be learned from data. However, it is unclear whether or how these models represent grammatical information from the learned languages. In addition, the models must be pre-trained on large corpora before they can be used. In this work, we propose an alternative, more transparent and cognitively plausible architecture for learning language. Instead of using deep learning, our approach uses a minimal cognitive architecture based on sequence memory and chunking. The learning mechanism is based on the principles of reinforcement learning. We test our architecture on a number of natural-like toy languages. Results show that the model can learn these artificial languages from scratch and extract grammatical information that supports learning. Our study demonstrates the power of this simple architecture and stresses the importance of sequence memory as a key component of the language learning process. Since other animals do not seem to have a faithful sequence memory, this may explain why only humans have developed complex languages.
摘要：使用深度学习技术的大型语言模型的最新进展重新引起了人们对如何从数据中学习语言的兴趣。然而，尚不清楚这些模型是否或如何表示所学语言的语法信息。此外，模型必须在大型语料库上进行预训练才能使用。在这项工作中，我们提出了一种替代的、更透明且认知上合理的语言学习架构。我们的方法不使用深度学习，而是使用基于序列记忆和分块的最小认知架构。学习机制基于强化学习的原理。我们在许多类似自然的玩具语言上测试了我们的架构。结果表明，该模型可以从头开始学习这些人工语言，并提取支持学习的语法信息。我们的研究证明了这种简单架构的力量，并强调了序列记忆作为语言学习过程关键组成部分的重要性。由于其他动物似乎没有忠实的序列记忆，这可以解释为什么只有人类发展出复杂的语言。

Title: One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation

Authors: Tejpalsingh Siledar, Swaroop Nath, Sankara Sri Raghava Ravindra Muddu, Rupasai Rangaraju, Swaprava Nath, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Nikesh Garera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11683
Pdf URL: https://arxiv.org/pdf/2402.11683
Copy Paste: [[2402.11683]] One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation(https://arxiv.org/abs/2402.11683)
Keywords: language model, llm, prompt
Abstract: Evaluation of opinion summaries using conventional reference-based metrics rarely provides a holistic evaluation and has been shown to have a relatively low correlation with human judgments. Recent studies suggest using Large Language Models (LLMs) as reference-free metrics for NLG evaluation, however, they remain unexplored for opinion summary evaluation. Moreover, limited opinion summary evaluation datasets inhibit progress. To address this, we release the SUMMEVAL-OP dataset covering 7 dimensions related to the evaluation of opinion summaries: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, and specificity. We investigate Op-I-Prompt a dimension-independent prompt, and Op-Prompts, a dimension-dependent set of prompts for opinion summary evaluation. Experiments indicate that Op-I-Prompt emerges as a good alternative for evaluating opinion summaries achieving an average Spearman correlation of 0.70 with humans, outperforming all previous approaches. To the best of our knowledge, we are the first to investigate LLMs as evaluators on both closed-source and open-source models in the opinion summarization domain.
摘要：使用传统的基于参考的指标对意见摘要的评估很少提供整体评估，并且已被证明与人类判断的相关性相对较低。最近的研究建议使用大型语言模型 (LLM) 作为 NLG 评估的无参考指标，但是，它们在意见摘要评估方面仍未得到探索。此外，有限的意见总结评估数据集阻碍了进展。为了解决这个问题，我们发布了 SUMMEVAL-OP 数据集，涵盖与意见摘要评估相关的 7 个维度：流畅性、连贯性、相关性、忠实性、方面覆盖率、情感一致性和特异性。我们研究 Op-I-Prompt（与维度无关的提示）和 Op-Prompt（与维度相关的一组用于意见摘要评估的提示）。实验表明，Op-I-Prompt 成为评估意见摘要的良好替代方案，与人类的平均 Spearman 相关性达到 0.70，优于之前的所有方法。据我们所知，我们是第一个研究法学硕士作为观点总结领域闭源和开源模型评估者的人。

Title: ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

Authors: Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, Benyou Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11684
Pdf URL: https://arxiv.org/pdf/2402.11684
Copy Paste: [[2402.11684]] ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model(https://arxiv.org/abs/2402.11684)
Keywords: language model, gpt
Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have enabled processing of multimodal inputs in language models but require significant computational resources for deployment, especially in edge devices. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To do this, a synthetic dataset is created by leveraging GPT-4V's ability to generate detailed captions, complex reasoning instructions and detailed answers from images. The resulted model trained with our data, ALLaVA, achieves competitive performance on 12 benchmarks up to 3B LVLMs. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. Our online demo is available at \url{https://allava.freedomai.cn}.
摘要：大型视觉语言模型 (LVLM) 的最新进展使得能够在语言模型中处理多模式输入，但需要大量的计算资源进行部署，尤其是在边缘设备中。本研究旨在通过采用高质量的训练数据来缩小传统规模的 LVLM 和资源友好型精简版之间的性能差距。为此，我们利用 GPT-4V 生成详细说明、复杂推理指令和图像详细答案的能力来创建合成数据集。使用我们的数据 ALLaVA 训练的结果模型在 12 个基准测试中达到了高达 3B LVLM 的竞争性能。这项工作强调了采用高质量数据来构建更高效的 LVLM 的可行性。我们的在线演示位于 \url{https://allava.freedomai.cn}。

Title: Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Authors: Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2402.11690
Pdf URL: https://arxiv.org/pdf/2402.11690
Copy Paste: [[2402.11690]] Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning(https://arxiv.org/abs/2402.11690)
Keywords: language model, gpt, llm, hallucination
Abstract: Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.
摘要：尽管视觉语言模型 (VLM) 作为多功能视觉助手具有卓越的功能，但现有 VLM 框架中仍然存在两个重大挑战：(1) 预训练和视觉指令调整方面缺乏任务多样性，(2) GPT 中的注释错误和偏差4.综合指令调整数据。这两个挑战都会导致普遍性差、幻觉和灾难性遗忘等问题。为了应对这些挑战，我们构建了 Vision-Flan，这是迄今为止最多样化的公开视觉指令调整数据集，包含来自学术数据集的 187 个不同任务和 1,664,261 个实例，每个任务都附有专家编写的指令。此外，我们提出了一个两阶段的指令调优框架，其中VLM首先在Vision-Flan上进行微调，并进一步在GPT-4合成数据上进行调优。我们发现这种两阶段调优框架显着优于传统的单阶段视觉指令调优框架，并在各种多模态评估基准中实现了最先进的性能。最后，我们进行了深入分析以了解视觉指令调整，我们的研究结果表明：（1）GPT-4 合成数据并没有显着增强 VLM 的功能，而是调节模型对人类首选格式的响应； (2) 最小数量（例如 1,000 个）的 GPT-4 合成数据可以有效地将 VLM 响应与人类偏好保持一致；（3）视觉指令调优主要帮助大语言模型（LLM）理解视觉特征。

Title: Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

Authors: Shuzhou Yuan, Ercong Nie, Bolei Ma, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11700
Pdf URL: https://arxiv.org/pdf/2402.11700
Copy Paste: [[2402.11700]] Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers(https://arxiv.org/abs/2402.11700)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking. While traditional approaches such as model pruning or distillation offer ways for reducing model size, they often come at the expense of performance retention. In our investigation, we systematically explore the approach of reducing the number of layers in LLMs. Surprisingly, we observe that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks. Remarkably, in certain cases, models with a single layer outperform their fully layered counterparts. These findings offer valuable insights for future work aimed at mitigating the size constraints of LLMs while preserving their performance, thereby opening avenues for significantly more efficient use of LLMs.
摘要：大型语言模型 (LLM) 在解决各种自然语言处理 (NLP) 任务方面具有出色的能力。然而，由于通过层堆叠包含数十亿个参数，这些模型的庞大规模在存储、训练和推理方面带来了挑战。虽然模型剪枝或蒸馏等传统方法提供了减小模型大小的方法，但它们通常是以牺牲性能为代价的。在我们的调查中，我们系统地探索了减少法学硕士层次数量的方法。令人惊讶的是，我们观察到，即使层数较少，法学硕士也能保持相似或更好的性能水平，特别是在文本分类任务的基于提示的微调方面。值得注意的是，在某些情况下，单层模型的性能优于全分层模型。这些发现为未来的工作提供了宝贵的见解，旨在减轻法学硕士的规模限制，同时保持其绩效，从而为更有效地利用法学硕士开辟途径。

Title: GNNavi: Navigating the Information Flow in Large Language Models by Graph Neural Network

Authors: Shuzhou Yuan, Ercong Nie, Michael Färber, Helmut Schmid, Hinrich Schütze
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11709
Pdf URL: https://arxiv.org/pdf/2402.11709
Copy Paste: [[2402.11709]] GNNavi: Navigating the Information Flow in Large Language Models by Graph Neural Network(https://arxiv.org/abs/2402.11709)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) exhibit strong In-Context Learning (ICL) capabilities when prompts with demonstrations are applied to them. However, fine-tuning still remains crucial to further enhance their adaptability. Prompt-based fine-tuning proves to be an effective fine-tuning method in low-data scenarios, but high demands on computing resources limit its practicality. We address this issue by introducing a prompt-based parameter-efficient fine-tuning (PEFT) approach. GNNavi leverages insights into ICL's information flow dynamics, which indicates that label words act in prompts as anchors for information propagation. GNNavi employs a Graph Neural Network (GNN) layer to precisely guide the aggregation and distribution of information flow during the processing of prompts by hardwiring the desired information flow into the GNN. Our experiments on text classification tasks with GPT-2 and Llama2 shows GNNavi surpasses standard prompt-based fine-tuning methods in few-shot settings by updating just 0.2% to 0.5% of parameters. We compare GNNavi with prevalent PEFT approaches, such as prefix tuning, LoRA and Adapter in terms of performance and efficiency. Our analysis reveals that GNNavi enhances information flow and ensures a clear aggregation process.
摘要：当将演示提示应用于大型语言模型 (LLM) 时，它们会表现出强大的情境学习 (ICL) 功能。然而，微调对于进一步增强其适应性仍然至关重要。基于提示的微调被证明是低数据场景下有效的微调方法，但对计算资源的高要求限制了其实用性。我们通过引入基于提示的参数高效微调（PEFT）方法来解决这个问题。 GNNavi 利用对 ICL 信息流动态的洞察，这表明标签词在提示中充当信息传播的锚点。 GNNavi 采用图神经网络 (GNN) 层，通过将所需的信息流硬连接到 GNN 中，在提示处理过程中精确指导信息流的聚合和分发。我们使用 GPT-2 和 Llama2 进行文本分类任务的实验表明，GNNavi 仅更新 0.2% 到 0.5% 的参数，在少量设置中超越了标准的基于提示的微调方法。我们将 GNNavi 与流行的 PEFT 方法（例如前缀调整、LoRA 和 Adapter）在性能和效率方面进行比较。我们的分析表明，GNNavi 增强了信息流并确保了清晰的聚合过程。

Title: A Note on Bias to Complete

Authors: Jia Xu, Mona Diab
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11710
Pdf URL: https://arxiv.org/pdf/2402.11710
Copy Paste: [[2402.11710]] A Note on Bias to Complete(https://arxiv.org/abs/2402.11710)
Keywords: llm
Abstract: Minimizing social bias strengthens societal bonds, promoting shared understanding and better decision-making. We revisit the definition of bias by discovering new bias types (e.g., societal status) in dynamic environments and describe them relative to context, such as culture, region, time, and personal background. Our framework includes eight hypotheses about bias and a minimizing bias strategy for each assumption as well as five methods as proposed solutions in LLM. The realization of the framework is yet to be completed.
摘要：最大限度地减少社会偏见可以加强社会纽带，促进共同理解和更好的决策。我们通过在动态环境中发现新的偏见类型（例如社会地位）来重新审视偏见的定义，并根据文化、地区、时间和个人背景等背景来描述它们。我们的框架包括八个关于偏差的假设和每个假设的最小化偏差策略，以及作为法学硕士提出的解决方案的五种方法。该框架的实现尚未完成。

Title: MORL-Prompt: An Empirical Analysis of Multi-Objective Reinforcement Learning for Discrete Prompt Optimization

Authors: Yasaman Jafari, Dheeraj Mekala, Rose Yu, Taylor Berg-Kirkpatrick
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11711
Pdf URL: https://arxiv.org/pdf/2402.11711
Copy Paste: [[2402.11711]] MORL-Prompt: An Empirical Analysis of Multi-Objective Reinforcement Learning for Discrete Prompt Optimization(https://arxiv.org/abs/2402.11711)
Keywords: language model, prompt
Abstract: RL-based techniques can be used to search for prompts that when fed into a target language model maximize a set of user-specified reward functions. However, in many target applications, the natural reward functions are in tension with one another -- for example, content preservation vs. style matching in style transfer tasks. Current techniques focus on maximizing the average of reward functions, which does not necessarily lead to prompts that achieve balance across rewards -- an issue that has been well-studied in the multi-objective and robust optimization literature. In this paper, we adapt several techniques for multi-objective optimization to RL-based discrete prompt optimization -- two that consider volume of the Pareto reward surface, and another that chooses an update direction that benefits all rewards simultaneously. We conduct an empirical analysis of these methods on two NLP tasks: style transfer and machine translation, each using three competing reward functions. Our experiments demonstrate that multi-objective methods that directly optimize volume perform better and achieve a better balance of all rewards than those that attempt to find monotonic update directions.
摘要：基于强化学习的技术可用于搜索提示，当将其输入目标语言模型时，可以最大化一组用户指定的奖励函数。然而，在许多目标应用中，自然奖励函数彼此之间存在紧张关系——例如，风格迁移任务中的内容保留与风格匹配。当前的技术侧重于最大化奖励函数的平均值，这并不一定会导致实现奖励平衡的提示——这个问题在多目标和鲁棒优化文献中已经得到了充分研究。在本文中，我们将多种多目标优化技术应用于基于强化学习的离散提示优化——其中两种考虑帕累托奖励曲面的体积，另一种选择同时有利于所有奖励的更新方向。我们对两种 NLP 任务（风格迁移和机器翻译）的这些方法进行了实证分析，每个任务都使用三个相互竞争的奖励函数。我们的实验表明，直接优化交易量的多目标方法比那些试图寻找单调更新方向的方法表现更好，并实现了所有奖励的更好平衡。

Title: Modelling Political Coalition Negotiations Using LLM-based Agents

Authors: Farhad Moghimifar, Yuan-Fang Li, Robert Thomson, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11712
Pdf URL: https://arxiv.org/pdf/2402.11712
Copy Paste: [[2402.11712]] Modelling Political Coalition Negotiations Using LLM-based Agents(https://arxiv.org/abs/2402.11712)
Keywords: language model, llm, agent
Abstract: Coalition negotiations are a cornerstone of parliamentary democracies, characterised by complex interactions and strategic communications among political parties. Despite its significance, the modelling of these negotiations has remained unexplored with the domain of Natural Language Processing (NLP), mostly due to lack of proper data. In this paper, we introduce coalition negotiations as a novel NLP task, and model it as a negotiation between large language model-based agents. We introduce a multilingual dataset, POLCA, comprising manifestos of European political parties and coalition agreements over a number of elections in these countries. This dataset addresses the challenge of the current scope limitations in political negotiation modelling by providing a diverse, real-world basis for simulation. Additionally, we propose a hierarchical Markov decision process designed to simulate the process of coalition negotiation between political parties and predict the outcomes. We evaluate the performance of state-of-the-art large language models (LLMs) as agents in handling coalition negotiations, offering insights into their capabilities and paving the way for future advancements in political modelling.
摘要：联盟谈判是议会民主的基石，其特点是政党之间复杂的互动和战略沟通。尽管其意义重大，但这些谈判的建模尚未在自然语言处理（NLP）领域得到探索，这主要是由于缺乏适当的数据。在本文中，我们将联盟协商作为一种新颖的 NLP 任务引入，并将其建模为基于大型语言模型的代理之间的协商。我们引入了一个多语言数据集 POLCA，其中包含欧洲政党的宣言和这些国家多次选举的联盟协议。该数据集通过提供多样化的真实世界模拟基础，解决了当前政治谈判建模范围限制的挑战。此外，我们提出了一种分层马尔可夫决策过程，旨在模拟政党之间的联盟谈判过程并预测结果。我们评估最先进的大型语言模型（LLM）作为代理在处理联盟谈判中的表现，提供对其能力的见解，并为政治建模的未来进步铺平道路。

Title: How Susceptible are Large Language Models to Ideological Manipulation?

Authors: Kai Chen, Zihao He, Jun Yan, Taiwei Shi, Kristina Lerman
Subjects: cs.CL, cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2402.11725
Pdf URL: https://arxiv.org/pdf/2402.11725
Copy Paste: [[2402.11725]] How Susceptible are Large Language Models to Ideological Manipulation?(https://arxiv.org/abs/2402.11725)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) possess the potential to exert substantial influence on public perceptions and interactions with information. This raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. In this work, we investigate how effectively LLMs can learn and generalize ideological biases from their instruction-tuning data. Our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of LLMs. Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. The ease with which LLMs' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. It also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on LLMs.
摘要：大型语言模型（LLM）具有对公众认知和信息交互产生重大影响的潜力。这引发了人们的担忧，即如果这些模型中的意识形态很容易被操纵，可能会产生社会影响。在这项工作中，我们研究了法学硕士如何有效地从其指令调整数据中学习和概括意识形态偏见。我们的研究结果揭示了一个令人担忧的弱点：仅接触少量意识形态驱动的样本就会显着改变法学硕士的意识形态。值得注意的是，法学硕士表现出一种惊人的能力，可以从一个主题中吸收意识形态，并将其推广到甚至不相关的主题。法学硕士的意识形态很容易被扭曲，这凸显了与恶意行为者故意毒害训练数据或数据注释者无意中引入偏见相关的风险。它还强调必须采取强有力的保障措施，以减轻意识形态操纵对法学硕士的影响。

Title: Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis

Authors: Agam Shah, Arnav Hiray, Pratvi Shah, Arkaprabha Banerjee, Anushka Singh, Dheeraj Eidnani, Bhaskar Chaudhury, Sudheer Chava
Subjects: cs.CL, cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2402.11728
Pdf URL: https://arxiv.org/pdf/2402.11728
Copy Paste: [[2402.11728]] Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis(https://arxiv.org/abs/2402.11728)
Keywords: language model
Abstract: In this paper, we investigate the influence of claims in analyst reports and earnings calls on financial market returns, considering them as significant quarterly events for publicly traded companies. To facilitate a comprehensive analysis, we construct a new financial dataset for the claim detection task in the financial domain. We benchmark various language models on this dataset and propose a novel weak-supervision model that incorporates the knowledge of subject matter experts (SMEs) in the aggregation function, outperforming existing approaches. Furthermore, we demonstrate the practical utility of our proposed model by constructing a novel measure ``optimism". Furthermore, we observed the dependence of earnings surprise and return on our optimism measure. Our dataset, models, and code will be made publicly (under CC BY 4.0 license) available on GitHub and Hugging Face.
摘要：在本文中，我们研究了分析师报告和财报电话会议中的索赔对金融市场回报的影响，将其视为上市公司的重要季度事件。为了便于全面分析，我们为金融领域的索赔检测任务构建了一个新的金融数据集。我们在此数据集上对各种语言模型进行了基准测试，并提出了一种新颖的弱监督模型，该模型在聚合函数中融入了主题专家（SME）的知识，其性能优于现有方法。此外，我们通过构建一种新颖的“乐观”衡量标准来证明我们提出的模型的实用性。此外，我们观察了盈利惊喜和回报对我们的乐观衡量标准的依赖。我们的数据集、模型和代码将公开（根据CC BY 4.0 许可证）可在 GitHub 和 Hugging Face 上获取。

Title: Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

Authors: Rishabh Bhardwaj, Do Duc Anh, Soujanya Poria
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11746
Pdf URL: https://arxiv.org/pdf/2402.11746
Copy Paste: [[2402.11746]] Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic(https://arxiv.org/abs/2402.11746)
Keywords: language model, llm
Abstract: Aligned language models face a significant limitation as their fine-tuning often results in compromised safety. To tackle this, we propose a simple method RESTA that performs LLM safety realignment. RESTA stands for REstoring Safety through Task Arithmetic. At its core, it involves a simple arithmetic addition of a safety vector to the weights of the compromised model. We demonstrate the effectiveness of RESTA in both parameter-efficient and full fine-tuning, covering a wide range of downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math. We also showcase the generalizability of RESTA on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5 sub-categories of harm. Overall, RESTA decreases the harmfulness of the compromised model from 18.6% to 5.1% and from 9.2% to 1.5% in parameter-efficient and full fine-tuning, respectively, while maintaining most of the model's performance on the task. We release the source codes at: https://github.com/declare-lab/resta.
摘要：对齐的语言模型面临着重大限制，因为它们的微调通常会导致安全性受到损害。为了解决这个问题，我们提出了一种简单的 RESTA 方法来执行 LLM 安全重新调整。 RESTA 代表通过任务算法恢复安全。从本质上讲，它涉及将安全向量与受损模型的权重进行简单算术相加。我们展示了 RESTA 在参数高效和全面微调方面的有效性，涵盖了广泛的下游任务，包括中文、英语和印地语的指令遵循，以及代码和数学的问题解决能力。我们还展示了 RESTA 在三个现有安全评估基准和作为本工作一部分提出的多语言基准数据集上的普遍性，该数据集由涵盖 11 个类别的 550 个有害问题组成，每个类别又包含 5 个危害子类别。总体而言，RESTA 在参数效率和完全微调方面分别将受损模型的危害性从 18.6% 降低到 5.1%，从 9.2% 降低到 1.5%，同时保持模型在任务上的大部分性能。我们在以下位置发布源代码：https://github.com/declare-lab/resta。

Title: In-Context Learning Demonstration Selection via Influence Analysis

Authors: Vinay M.S., Minh-Hao Van, Xintao Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11750
Pdf URL: https://arxiv.org/pdf/2402.11750
Copy Paste: [[2402.11750]] In-Context Learning Demonstration Selection via Influence Analysis(https://arxiv.org/abs/2402.11750)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated their In-Context Learning (ICL) capabilities which provides an opportunity to perform few shot learning without any gradient update. Despite its multiple benefits, ICL generalization performance is sensitive to the selected demonstrations. Selecting effective demonstrations for ICL is still an open research challenge. To address this challenge, we propose a demonstration selection method called InfICL which analyzes influences of training samples through influence functions. Identifying highly influential training samples can potentially aid in uplifting the ICL generalization performance. To limit the running cost of InfICL, we only employ the LLM to generate sample embeddings, and don't perform any costly fine tuning. We perform empirical study on multiple real-world datasets and show merits of our InfICL against state-of-the-art baselines.
摘要：大型语言模型 (LLM) 已经展示了其上下文学习 (ICL) 功能，该功能提供了在无需任何梯度更新的情况下执行少量镜头学习的机会。尽管具有多种优点，但 ICL 泛化性能对所选演示很敏感。选择有效的 ICL 演示仍然是一项开放的研究挑战。为了应对这一挑战，我们提出了一种名为 InfICL 的演示选择方法，该方法通过影响函数分析训练样本的影响。识别具有高度影响力的训练样本可能有助于提升 ICL 泛化性能。为了限制 InfICL 的运行成本，我们只使用 LLM 来生成样本嵌入，而不执行任何昂贵的微调。我们对多个现实世界数据集进行实证研究，并根据最先进的基线展示我们的 InfICL 的优点。

Title: ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Authors: Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11753
Pdf URL: https://arxiv.org/pdf/2402.11753
Copy Paste: [[2402.11753]] ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs(https://arxiv.org/abs/2402.11753)
Keywords: language model, gpt, llm, prompt
Abstract: Safety is critical to the usage of large language models (LLMs). Multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in LLMs. For example, users of forums often use ASCII art, a form of text-based art, to convey image information. In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this observation, we develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs. ArtPrompt only requires black-box access to the victim LLMs, making it a practical attack. We evaluate ArtPrompt on five SOTA LLMs, and show that ArtPrompt can effectively and efficiently induce undesired behaviors from all five LLMs.
摘要：安全对于大型语言模型 (LLM) 的使用至关重要。数据过滤和监督微调等多种技术已被开发出来，以加强法学硕士的安全性。然而，目前已知的技术假设用于法学硕士安全对齐的语料库仅通过语义来解释。然而，这种假设在实际应用中并不成立，这导致法学硕士存在严重漏洞。例如，论坛的用户经常使用 ASCII 艺术（一种基于文本的艺术形式）来传达图像信息。在本文中，我们提出了一种新颖的基于 ASCII 艺术的越狱攻击，并引入了综合基准视觉文本挑战（ViTC）来评估法学硕士识别不能仅通过语义解释的提示的能力。我们展示了五个 SOTA LLM（GPT-3.5、GPT-4、Gemini、Claude 和 Llama2）难以识别以 ASCII 艺术形式提供的提示。基于这一观察，我们开发了越狱攻击 ArtPrompt，它利用法学硕士在识别 ASCII 艺术方面的糟糕表现来绕过安全措施并引发法学硕士的不良行为。 ArtPrompt 仅需要对受害者 LLM 进行黑盒访问，使其成为一种实用的攻击。我们在五个 SOTA 法学硕士上评估了 ArtPrompt，并表明 ArtPrompt 可以有效且高效地诱导所有五个法学硕士的不良行为。

Title: SPML: A DSL for Defending Language Models Against Prompt Attacks

Authors: Reshabh K Sharma, Vinayak Gupta, Dan Grossman
Subjects: cs.LG, cs.CL, cs.CR, cs.PL
Abstract URL: https://arxiv.org/abs/2402.11755
Pdf URL: https://arxiv.org/pdf/2402.11755
Copy Paste: [[2402.11755]] SPML: A DSL for Defending Language Models Against Prompt Attacks(https://arxiv.org/abs/2402.11755)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) have profoundly transformed natural language applications, with a growing reliance on instruction-based definitions for designing chatbots. However, post-deployment the chatbot definitions are fixed and are vulnerable to attacks by malicious users, emphasizing the need to prevent unethical applications and financial losses. Existing studies explore user prompts' impact on LLM-based chatbots, yet practical methods to contain attacks on application-specific chatbots remain unexplored. This paper presents System Prompt Meta Language (SPML), a domain-specific language for refining prompts and monitoring the inputs to the LLM-based chatbots. SPML actively checks attack prompts, ensuring user inputs align with chatbot definitions to prevent malicious execution on the LLM backbone, optimizing costs. It also streamlines chatbot definition crafting with programming language capabilities, overcoming natural language design challenges. Additionally, we introduce a groundbreaking benchmark with 1.8k system prompts and 20k user inputs, offering the inaugural language and benchmark for chatbot definition evaluation. Experiments across datasets demonstrate SPML's proficiency in understanding attacker prompts, surpassing models like GPT-4, GPT-3.5, and LLAMA. Our data and codes are publicly available at: https://prompt-compiler.github.io/SPML/.
摘要：大型语言模型 (LLM) 深刻地改变了自然语言应用程序，越来越依赖基于指令的定义来设计聊天机器人。然而，部署后的聊天机器人定义是固定的，很容易受到恶意用户的攻击，这强调了防止不道德应用和财务损失的必要性。现有研究探讨了用户提示对基于 LLM 的聊天机器人的影响，但遏制对特定应用程序聊天机器人攻击的实用方法仍未探索。本文介绍了系统提示元语言 (SPML)，这是一种特定于领域的语言，用于细化提示并监控基于 LLM 的聊天机器人的输入。 SPML 主动检查攻击提示，确保用户输入与聊天机器人定义一致，以防止 LLM 主干上的恶意执行，从而优化成本。它还通过编程语言功能简化了聊天机器人定义的制作，克服了自然语言设计的挑战。此外，我们还引入了具有 1.8k 系统提示和 20k 用户输入的突破性基准，为聊天机器人定义评估提供了首个语言和基准。跨数据集的实验证明 SPML 在理解攻击者提示方面的能力，超越了 GPT-4、GPT-3.5 和 LLAMA 等模型。我们的数据和代码可公开获取：https://prompt-compiler.github.io/SPML/。

Title: MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs

Authors: Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, Salman Avestimehr
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11756
Pdf URL: https://arxiv.org/pdf/2402.11756
Copy Paste: [[2402.11756]] MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs(https://arxiv.org/abs/2402.11756)
Keywords: language model, llm
Abstract: Generative Large Language Models (LLMs) are widely utilized for their excellence in various tasks. However, their tendency to produce inaccurate or misleading outputs poses a potential risk, particularly in high-stakes environments. Therefore, estimating the correctness of generative LLM outputs is an important task for enhanced reliability. Uncertainty Estimation (UE) in generative LLMs is an evolving domain, where SOTA probability-based methods commonly employ length-normalized scoring. In this work, we propose Meaning-Aware Response Scoring (MARS) as an alternative to length-normalized scoring for UE methods. MARS is a novel scoring function that considers the semantic contribution of each token in the generated sequence in the context of the question. We demonstrate that integrating MARS into UE methods results in a universal and significant improvement in UE performance. We conduct experiments using three distinct closed-book question-answering datasets across five popular pre-trained LLMs. Lastly, we validate the efficacy of MARS on a Medical QA dataset. Code can be found https://anonymous.4open.science/r/LLM_Uncertainity-309B.
摘要：生成式大语言模型 (LLM) 因其在各种任务中的卓越表现而被广泛使用。然而，它们产生不准确或误导性输出的倾向会带来潜在风险，特别是在高风险环境中。因此，估计生成式 LLM 输出的正确性是增强可靠性的一项重要任务。生成法学硕士中的不确定性估计 (UE) 是一个不断发展的领域，其中基于 SOTA 概率的方法通常采用长度归一化评分。在这项工作中，我们提出意义感知响应评分（MARS）作为 UE 方法长度归一化评分的替代方案。 MARS 是一种新颖的评分函数，它考虑问题上下文中生成序列中每个标记的语义贡献。我们证明，将 MARS 集成到 UE 方法中可以普遍且显着地提高 UE 性能。我们使用五个流行的预训练法学硕士的三个不同的闭卷问答数据集进行实验。最后，我们在医学 QA 数据集上验证了 MARS 的有效性。代码可以在 https://anonymous.4open.science/r/LLM_Uncertainity-309B 找到。

Title: ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs

Authors: Pengrui Han, Rafal Kocielnik, Adhithya Saravanan, Roy Jiang, Or Sharir, Anima Anandkumar
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2402.11764
Pdf URL: https://arxiv.org/pdf/2402.11764
Copy Paste: [[2402.11764]] ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs(https://arxiv.org/abs/2402.11764)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language models (LLMs), while powerful, exhibit harmful social biases. Debiasing is often challenging due to computational costs, data constraints, and potential degradation of multi-task language capabilities. This work introduces a novel approach utilizing ChatGPT to generate synthetic training data, aiming to enhance the debiasing of LLMs. We propose two strategies: Targeted Prompting, which provides effective debiasing for known biases but necessitates prior specification of bias in question; and General Prompting, which, while slightly less effective, offers debiasing across various categories. We leverage resource-efficient LLM debiasing using adapter tuning and compare the effectiveness of our synthetic data to existing debiasing datasets. Our results reveal that: (1) ChatGPT can efficiently produce high-quality training data for debiasing other LLMs; (2) data produced via our approach surpasses existing datasets in debiasing performance while also preserving internal knowledge of a pre-trained LLM; and (3) synthetic data exhibits generalizability across categories, effectively mitigating various biases, including intersectional ones. These findings underscore the potential of synthetic data in advancing the fairness of LLMs with minimal retraining cost.
摘要：大型语言模型（LLM）虽然强大，但表现出有害的社会偏见。由于计算成本、数据限制以及多任务语言能力的潜在退化，去偏通常具有挑战性。这项工作介绍了一种利用 ChatGPT 生成合成训练数据的新方法，旨在增强 LLM 的去偏性。我们提出了两种策略：有针对性的提示，它为已知的偏见提供有效的去偏见，但需要事先指定有问题的偏见；和一般提示，虽然效果稍差，但可以在各个类别中消除偏差。我们通过适配器调整来利用资源高效的 LLM 去偏，并将我们的合成数据与现有去偏数据集的有效性进行比较。我们的结果表明：（1）ChatGPT 可以有效地生成高质量的训练数据，以消除其他 LLM 的偏差； (2) 通过我们的方法生成的数据在去偏性能方面超越了现有数据集，同时还保留了预训练的法学硕士的内部知识； (3) 合成数据表现出跨类别的普遍性，有效地减少了各种偏差，包括交叉偏差。这些发现强调了合成数据在以最小的再培训成本促进法学硕士公平性方面的潜力。

Title: Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations

Authors: Md Arafat Sultan, Jatin Ganhotra, Ramón Fernandez Astudillo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11770
Pdf URL: https://arxiv.org/pdf/2402.11770
Copy Paste: [[2402.11770]] Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations(https://arxiv.org/abs/2402.11770)
Keywords: language model, llm, hallucination, prompt, chain-of-thought, agent
Abstract: We introduce a structured chain-of-thought (SCoT) prompting approach to generating content-grounded multi-turn question-answer conversations using a pre-trained large language model (LLM). At the core of our proposal is a structured breakdown of the complex task into a number of states in a state machine, so that actions corresponding to various subtasks, e.g., content reading and utterance generation, can be executed in their own dedicated states. Each state leverages a unique set of resources including prompts and (optionally) additional tools to augment the generation process. Our experimental results show that SCoT prompting with designated states for hallucination mitigation increases agent faithfulness to grounding documents by up to 16.8%. When used as training data, our open-domain conversations synthesized from only 6 Wikipedia-based seed demonstrations train strong conversational QA agents; in out-of-domain evaluation, for example, we observe improvements of up to 13.9% over target domain gold data when the latter is augmented with our generated examples.
摘要：我们引入了一种结构化思想链（SCoT）提示方法，使用预先训练的大语言模型（LLM）生成基于内容的多轮问答对话。我们建议的核心是将复杂任务结构化分解为状态机中的多个状态，以便与各种子任务相对应的动作（例如内容阅读和话语生成）可以在它们自己的专用状态中执行。每个状态都利用一组独特的资源，包括提示和（可选）附加工具来增强生成过程。我们的实验结果表明，SCoT 提示指定州缓解幻觉，可将特工对接地文件的忠诚度提高高达 16.8%。当用作训练数据时，我们的开放域对话仅由 6 个基于维基百科的种子演示合成，可训练强大的对话 QA 代理；例如，在域外评估中，当目标域黄金数据用我们生成的示例进行增强时，我们观察到比目标域黄金数据提高了高达 13.9%。

Title: Evaluating the Effectiveness of Index-Based Treatment Allocation

Authors: Niclas Boehmer, Yash Nair, Sanket Shah, Lucas Janson, Aparna Taneja, Milind Tambe
Subjects: cs.LG, cs.AI, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2402.11771
Pdf URL: https://arxiv.org/pdf/2402.11771
Copy Paste: [[2402.11771]] Evaluating the Effectiveness of Index-Based Treatment Allocation(https://arxiv.org/abs/2402.11771)
Keywords: agent
Abstract: When resources are scarce, an allocation policy is needed to decide who receives a resource. This problem occurs, for instance, when allocating scarce medical resources and is often solved using modern ML methods. This paper introduces methods to evaluate index-based allocation policies -- that allocate a fixed number of resources to those who need them the most -- by using data from a randomized control trial. Such policies create dependencies between agents, which render the assumptions behind standard statistical tests invalid and limit the effectiveness of estimators. Addressing these challenges, we translate and extend recent ideas from the statistics literature to present an efficient estimator and methods for computing asymptotically correct confidence intervals. This enables us to effectively draw valid statistical conclusions, a critical gap in previous work. Our extensive experiments validate our methodology in practical settings, while also showcasing its statistical power. We conclude by proposing and empirically verifying extensions of our methodology that enable us to reevaluate a past randomized control trial to evaluate different ML allocation policies in the context of a mHealth program, drawing previously invisible conclusions.
摘要：当资源稀缺时，需要制定分配策略来决定谁获得资源。例如，在分配稀缺医疗资源时会出现此问题，并且通常可以使用现代机器学习方法来解决。本文介绍了通过使用随机对照试验的数据来评估基于指数的分配政策的方法，该政策将固定数量的资源分配给最需要的人。此类策略在代理之间创建了依赖性，这使得标准统计测试背后的假设无效并限制了估计器的有效性。为了解决这些挑战，我们翻译并扩展了统计文献中的最新想法，以提出一种有效的估计器和方法来计算渐近正确的置信区间。这使我们能够有效地得出有效的统计结论，这是之前工作中的一个关键差距。我们广泛的实验在实际环境中验证了我们的方法，同时也展示了其统计能力。最后，我们提出并实证验证了我们方法的扩展，使我们能够重新评估过去的随机对照试验，以评估 mHealth 计划背景下的不同 ML 分配政策，得出以前看不见的结论。

Title: Uncovering Latent Human Wellbeing in Language Model Embeddings

Authors: Pedro Freire, ChengCheng Tan, Adam Gleave, Dan Hendrycks, Scott Emmons
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11777
Pdf URL: https://arxiv.org/pdf/2402.11777
Copy Paste: [[2402.11777]] Uncovering Latent Human Wellbeing in Language Model Embeddings(https://arxiv.org/abs/2402.11777)
Keywords: language model, prompt
Abstract: Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.
摘要：语言模型是否隐含地学习了人类福祉的概念？我们通过伦理功利主义任务探索这一点，评估缩放是否增强了预训练模型的表示。我们的初步发现表明，在没有任何及时的工程或微调的情况下，OpenAI 的 text-embedding-ada-002 的主要主要组件达到了 73.9% 的准确率。这与整个 ETHICS 数据集上 74.6% 的 BERT-large 微调结果非常吻合，表明预训练传达了对人类福祉的一些理解。接下来，我们考虑四种语言模型系列，观察功利主义准确性如何随着参数的增加而变化。我们发现，当使用足够数量的主成分时，性能不会随着模型大小的增加而下降。

Title: What Evidence Do Language Models Find Convincing?

Authors: Alexander Wan, Eric Wallace, Dan Klein
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11782
Pdf URL: https://arxiv.org/pdf/2402.11782
Copy Paste: [[2402.11782]] What Evidence Do Language Models Find Convincing?(https://arxiv.org/abs/2402.11782)
Keywords: language model, llm
Abstract: Retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". To resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do I find convincing?". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.
摘要：检索增强语言模型越来越多地承担主观、有争议和冲突的查询的任务，例如“阿斯巴甜与癌症有关吗”。为了解决这些模棱两可的问题，人们必须搜索大量网站并考虑“我认为哪些证据（如果有的话）令人信服？”。在这项工作中，我们研究了法学硕士如何回答这个问题。特别是，我们构建了 ConflictingQA，一个数据集，将有争议的查询与一系列包含不同事实（例如定量结果）、论证风格（例如诉诸权威）和答案（是或否）的现实世界证据文档配对。我们使用该数据集进行敏感性和反事实分析，以探索哪些文本特征对 LLM 预测影响最大。总的来说，我们发现当前的模型严重依赖网站与查询的相关性，而很大程度上忽略了人类认为重要的文体特征，例如文本是否包含科学参考文献或以中性语气编写。总而言之，这些结果凸显了 RAG 语料库质量的重要性（例如，过滤错误信息的需要），甚至可能改变法学硕士的培训方式，以更好地与人类判断保持一致。

Title: Unveiling the Magic: Investigating Attention Distillation in Retrieval-augmented Generation

Authors: Zizhong Li, Haopeng Zhang, Jiawei Zhang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2402.11794
Pdf URL: https://arxiv.org/pdf/2402.11794
Copy Paste: [[2402.11794]] Unveiling the Magic: Investigating Attention Distillation in Retrieval-augmented Generation(https://arxiv.org/abs/2402.11794)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-augmented generation framework can address the limitations of large language models by enabling real-time knowledge updates for more accurate answers. An efficient way in the training phase of retrieval-augmented models is attention distillation, which uses attention scores as a supervision signal instead of manually annotated query-document pairs. Despite its growing popularity, the detailed mechanisms behind the success of attention distillation remain unexplored, particularly the specific patterns it leverages to benefit training. In this paper, we address this gap by conducting a comprehensive review of attention distillation workflow and identifying key factors influencing the learning quality of retrieval-augmented language models. We further propose indicators for optimizing models' training methods and avoiding ineffective training.
摘要：检索增强生成框架可以通过实现实时知识更新以获得更准确的答案来解决大型语言模型的局限性。检索增强模型训练阶段的一种有效方法是注意力蒸馏，它使用注意力分数作为监督信号，而不是手动注释的查询文档对。尽管它越来越受欢迎，但注意力蒸馏成功背后的详细机制仍未被探索，特别是它用来促进训练的具体模式。在本文中，我们通过对注意力蒸馏工作流程进行全面回顾并确定影响检索增强语言模型学习质量的关键因素来解决这一差距。我们进一步提出了优化模型训练方法并避免无效训练的指标。

Title: Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling

Authors: Arman Adibi, Nicolo Dal Fabbro, Luca Schenato, Sanjeev Kulkarni, H. Vincent Poor, George J. Pappas, Hamed Hassani, Aritra Mitra
Subjects: cs.LG, cs.AI, cs.MA, eess.SY, math.OC
Abstract URL: https://arxiv.org/abs/2402.11800
Pdf URL: https://arxiv.org/pdf/2402.11800
Copy Paste: [[2402.11800]] Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling(https://arxiv.org/abs/2402.11800)
Keywords: agent
Abstract: Motivated by applications in large-scale and multi-agent reinforcement learning, we study the non-asymptotic performance of stochastic approximation (SA) schemes with delayed updates under Markovian sampling. While the effect of delays has been extensively studied for optimization, the manner in which they interact with the underlying Markov process to shape the finite-time performance of SA remains poorly understood. In this context, our first main contribution is to show that under time-varying bounded delays, the delayed SA update rule guarantees exponentially fast convergence of the \emph{last iterate} to a ball around the SA operator's fixed point. Notably, our bound is \emph{tight} in its dependence on both the maximum delay $\tau_{max}$, and the mixing time $\tau_{mix}$. To achieve this tight bound, we develop a novel inductive proof technique that, unlike various existing delayed-optimization analyses, relies on establishing uniform boundedness of the iterates. As such, our proof may be of independent interest. Next, to mitigate the impact of the maximum delay on the convergence rate, we provide the first finite-time analysis of a delay-adaptive SA scheme under Markovian sampling. In particular, we show that the exponent of convergence of this scheme gets scaled down by $\tau_{avg}$, as opposed to $\tau_{max}$ for the vanilla delayed SA rule; here, $\tau_{avg}$ denotes the average delay across all iterations. Moreover, the adaptive scheme requires no prior knowledge of the delay sequence for step-size tuning. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms, including TD learning, Q-learning, and stochastic gradient descent under Markovian sampling.
摘要：受大规模多智能体强化学习应用的推动，我们研究了马尔可夫采样下延迟更新的随机逼近（SA）方案的非渐近性能。虽然延迟的影响已被广泛研究以优化，但它们与底层马尔可夫过程相互作用以塑造 SA 的有限时间性能的方式仍然知之甚少。在这种情况下，我们的第一个主要贡献是表明，在时变有界延迟下，延迟的 SA 更新规则保证了 \emph{last iterate} 以指数方式快速收敛到围绕 SA 算子固定点的球。值得注意的是，我们的界限是 \emph{tight}，因为它依赖于最大延迟 $\tau_{max}$ 和混合时间 $\tau_{mix}$。为了实现这种紧界，我们开发了一种新颖的归纳证明技术，与各种现有的延迟优化分析不同，它依赖于建立迭代的统一有界性。因此，我们的证明可能具有独立利益。接下来，为了减轻最大延迟对收敛速度的影响，我们提供了马尔可夫采样下延迟自适应 SA 方案的第一个有限时间分析。特别是，我们表明该方案的收敛指数按比例缩小了 $\tau_{avg}$，而不是普通延迟 SA 规则的 $\tau_{max}$；这里，$\tau_{avg}$表示所有迭代的平均延迟。此外，自适应方案不需要步长调整的延迟序列的先验知识。我们的理论发现揭示了延迟对一类广泛算法的有限时间效应，包括 TD 学习、Q 学习和马尔可夫采样下的随机梯度下降。

Title: LLM as Prompter: Low-resource Inductive Reasoning on Arbitrary Knowledge Graphs

Authors: Kai Wang, Yuwei Xu, Zhiyong Wu, Siqiang Luo
Subjects: cs.AI, cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2402.11804
Pdf URL: https://arxiv.org/pdf/2402.11804
Copy Paste: [[2402.11804]] LLM as Prompter: Low-resource Inductive Reasoning on Arbitrary Knowledge Graphs(https://arxiv.org/abs/2402.11804)
Keywords: language model, llm, prompt
Abstract: Knowledge Graph (KG) inductive reasoning, which aims to infer missing facts from new KGs that are not seen during training, has been widely adopted in various applications. One critical challenge of KG inductive reasoning is handling low-resource scenarios with scarcity in both textual and structural aspects. In this paper, we attempt to address this challenge with Large Language Models (LLMs). Particularly, we utilize the state-of-the-art LLMs to generate a graph-structural prompt to enhance the pre-trained Graph Neural Networks (GNNs), which brings us new methodological insights into the KG inductive reasoning methods, as well as high generalizability in practice. On the methodological side, we introduce a novel pretraining and prompting framework ProLINK, designed for low-resource inductive reasoning across arbitrary KGs without requiring additional training. On the practical side, we experimentally evaluate our approach on 36 low-resource KG datasets and find that ProLINK outperforms previous methods in three-shot, one-shot, and zero-shot reasoning tasks, exhibiting average performance improvements by 20%, 45%, and 147%, respectively. Furthermore, ProLINK demonstrates strong robustness for various LLM promptings as well as full-shot scenarios.
摘要：知识图谱（KG）归纳推理旨在从训练过程中未见的新知识图谱中推断出缺失的事实，已在各种应用中得到广泛采用。 KG 归纳推理的一项关键挑战是处理文本和结构方面都稀缺的低资源场景。在本文中，我们尝试通过大型语言模型（LLM）来应对这一挑战。特别是，我们利用最先进的 LLM 生成图结构提示来增强预训练的图神经网络（GNN），这为我们带来了对 KG 归纳推理方法的新方法论见解，以及高实践中的普遍性。在方法方面，我们引入了一种新颖的预训练和提示框架 ProLINK，专为跨任意 KG 的低资源归纳推理而设计，无需额外的训练。在实践方面，我们在 36 个低资源 KG 数据集上对我们的方法进行了实验评估，发现 ProLINK 在三样本、单样本和零样本推理任务中优于以前的方法，平均性能提高了 20%、45% 、和 147%。此外，ProLINK 对各种 LLM 提示以及全面场景表现出强大的鲁棒性。

Title: Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

Authors: Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11809
Pdf URL: https://arxiv.org/pdf/2402.11809
Copy Paste: [[2402.11809]] Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding(https://arxiv.org/abs/2402.11809)
Keywords: language model, llm
Abstract: This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE), an innovative approach designed for achieving lossless acceleration of LLMs. By integrating semi-autoregressive inference and speculative decoding capabilities, SPACE uniquely enables autoregressive LLMs to parallelize token generation and verification. This is realized through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to simultaneously predict multiple tokens. Additionally, an auto-correct decoding algorithm facilitates the simultaneous generation and verification of token sequences within a single model invocation. Through extensive experiments on a range of LLMs, SPACE has demonstrated inference speedup ranging from 2.7x-4.0x on HumanEval-X while maintaining output quality.
摘要：这项研究旨在加快具有数十亿参数的大型语言模型（LLM）的推理速度。我们提出 \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE)，这是一种旨在实现 LLM 无损加速的创新方法。通过集成半自回归推理和推测解码功能，SPACE 独特地使自回归 LLM 能够并行化令牌生成和验证。这是通过专门的半自回归监督微调过程实现的，该过程使现有的法学硕士能够同时预测多个标记。此外，自动更正解码算法有助于在单个模型调用中同时生成和验证令牌序列。通过对一系列法学硕士的广泛实验，SPACE 证明 HumanEval-X 的推理速度提高了 2.7 倍至 4.0 倍，同时保持了输出质量。

Title: FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema

Authors: Junru Lu, Siyu An, Min Zhang, Yulan He, Di Yin, Xing Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11811
Pdf URL: https://arxiv.org/pdf/2402.11811
Copy Paste: [[2402.11811]] FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema(https://arxiv.org/abs/2402.11811)
Keywords: language model, llm, prompt
Abstract: In the quest to facilitate the deep intelligence of Large Language Models (LLMs) accessible in final-end user-bot interactions, the art of prompt crafting emerges as a critical yet complex task for the average user. Contrast to previous model-oriented yet instruction-agnostic Automatic Prompt Optimization methodologies, yielding polished results for predefined target models while suffering rapid degradation with out-of-box models, we present Free-form Instruction-oriented Prompt Optimization (FIPO). This approach is supported by our large-scale prompt preference dataset and employs a modular fine-tuning schema. The FIPO schema reimagines the optimization process into manageable modules, anchored by a meta prompt that dynamically adapts content. This allows for the flexible integration of the raw task instruction, the optional instruction response, and the optional ground truth to produce finely optimized task prompts. The FIPO preference dataset is meticulously constructed using the optimal and suboptimal LLMs, undergoing rigorous cross-verification by human experts and analytical models. Applying the insights from the data with Tulu2 models and fine-tuning strategies, we validate the efficacy of FIPO schema across five public benchmarks. Codes, data and scripts are here: https://github.com/LuJunru/FIPO_Project.
摘要：为了促进最终用户与机器人交互中可访问的大型语言模型 (LLM) 的深度智能，提示制作艺术对于普通用户来说成为一项关键而复杂的任务。与以前的面向模型但与指令无关的自动提示优化方法相比，我们提出了自由格式的面向指令的提示优化（FIPO），为预定义的目标模型提供了完美的结果，同时遭受开箱即用模型的快速退化。这种方法得到了我们的大规模提示偏好数据集的支持，并采用了模块化的微调模式。 FIPO 模式将优化过程重新构想为可管理的模块，并以动态调整内容的元提示为基础。这允许原始任务指令、可选指令响应和可选基本事实的灵活集成，以产生精细优化的任务提示。 FIPO 偏好数据集是使用最佳和次优 LLM 精心构建的，并经过人类专家和分析模型的严格交叉验证。通过 Tulu2 模型和微调策略应用数据洞察，我们在五个公共基准上验证了 FIPO 模式的有效性。代码、数据和脚本位于：https://github.com/LuJunru/FIPO_Project。

Title: HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?

Authors: Shubhashis Roy Dipta, Sadat Shahriar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11815
Pdf URL: https://arxiv.org/pdf/2402.11815
Copy Paste: [[2402.11815]] HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?(https://arxiv.org/abs/2402.11815)
Keywords: language model, llm
Abstract: This paper describes our system developed for SemEval-2024 Task 8, "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection." Machine-generated texts have been one of the main concerns due to the use of large language models (LLM) in fake text generation, phishing, cheating in exams, or even plagiarizing copyright materials. A lot of systems have been developed to detect machine-generated text. Nonetheless, the majority of these systems rely on the text-generating model, a limitation that is impractical in real-world scenarios, as it's often impossible to know which specific model the user has used for text generation. In this work, we propose a single model based on contrastive learning, which uses ~40% of the baseline's parameters (149M vs. 355M) but shows a comparable performance on the test dataset (21st out of 137 participants). Our key finding is that even without an ensemble of multiple models, a single base model can have comparable performance with the help of data augmentation and contrastive learning.
摘要：本文介绍了我们为 SemEval-2024 任务 8“多生成器、多域和多语言黑盒机器生成的文本检测”开发的系统。由于在虚假文本生成、网络钓鱼、考试作弊甚至抄袭版权材料中使用大型语言模型 (LLM)，机器生成的文本一直是主要问题之一。已经开发了许多系统来检测机器生成的文本。尽管如此，这些系统中的大多数都依赖于文本生成模型，这种限制在现实场景中是不切实际的，因为通常不可能知道用户使用哪个特定模型来生成文本。在这项工作中，我们提出了一个基于对比学习的单一模型，该模型使用约 40% 的基线参数（149M 与 355M），但在测试数据集上显示出可比较的性能（137 名参与者中的第 21 名）。我们的主要发现是，即使没有多个模型的集成，在数据增强和对比学习的帮助下，单个基本模型也可以具有相当的性能。

Title: Where It Really Matters: Few-Shot Environmental Conservation Media Monitoring for Low-Resource Languages

Authors: Sameer Jain, Sedrick Scott Keh, Shova Chettri, Karun Dewan, Pablo Izquierdo, Johanna Prussman, Pooja Shreshtha, Cesar Suarez, Zheyuan Ryan Shi, Lei Li, Fei Fang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2402.11818
Pdf URL: https://arxiv.org/pdf/2402.11818
Copy Paste: [[2402.11818]] Where It Really Matters: Few-Shot Environmental Conservation Media Monitoring for Low-Resource Languages(https://arxiv.org/abs/2402.11818)
Keywords: language model, llm
Abstract: Environmental conservation organizations routinely monitor news content on conservation in protected areas to maintain situational awareness of developments that can have an environmental impact. Existing automated media monitoring systems require large amounts of data labeled by domain experts, which is only feasible at scale for high-resource languages like English. However, such tools are most needed in the global south where news of interest is mainly in local low-resource languages, and far fewer experts are available to annotate datasets sustainably. In this paper, we propose NewsSerow, a method to automatically recognize environmental conservation content in low-resource languages. NewsSerow is a pipeline of summarization, in-context few-shot classification, and self-reflection using large language models (LLMs). Using at most 10 demonstration example news articles in Nepali, NewsSerow significantly outperforms other few-shot methods and achieves comparable performance with models fully fine-tuned using thousands of examples. The World Wide Fund for Nature (WWF) has deployed NewsSerow for media monitoring in Nepal, significantly reducing their operational burden, and ensuring that AI tools for conservation actually reach the communities that need them the most. NewsSerow has also been deployed for countries with other languages like Colombia.
摘要：环境保护组织定期监测有关保护区保护的新闻内容，以保持对可能对环境产生影响的发展的态势感知。现有的自动化媒体监控系统需要由领域专家标记的大量数据，这仅适用于英语等高资源语言。然而，南半球国家最需要此类工具，那里感兴趣的新闻主要是当地资源匮乏的语言，而能够对数据集进行可持续注释的专家却少得多。在本文中，我们提出了 NewsSerow，一种自动识别低资源语言环境保护内容的方法。 NewsSerow 是一个使用大型语言模型 (LLM) 进行摘要、上下文中的小样本分类和自我反思的管道。 NewsSerow 使用最多 10 篇尼泊尔语演示新闻文章示例，显着优于其他小样本方法，并达到了与使用数千个示例完全微调的模型相当的性能。世界自然基金会 (WWF) 在尼泊尔部署了 NewsSerow 进行媒体监控，显着减轻了运营负担，并确保人工智能保护工具真正到达最需要它们的社区。 NewsSerow 还被部署到使用其他语言的国家，例如哥伦比亚。

Title: Head-wise Shareable Attention for Large Language Models

Authors: Zouying Cao, Yifei Yang, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11819
Pdf URL: https://arxiv.org/pdf/2402.11819
Copy Paste: [[2402.11819]] Head-wise Shareable Attention for Large Language Models(https://arxiv.org/abs/2402.11819)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) suffer from huge number of parameters, which restricts their deployment on edge devices. Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop. However, current weight sharing techniques primarily focus on small-scale models like BERT and employ coarse-grained sharing rules, e.g., layer-wise. This becomes limiting given the prevalence of LLMs and sharing an entire layer or block obviously diminishes the flexibility of weight sharing. In this paper, we present a perspective on $\textit{$\textbf{head-wise shareable attention for large language models}$}$. We further propose two memory-efficient methods that share parameters across attention heads, with a specific focus on LLMs. Both of them use the same dynamic strategy to select the shared weight matrices. The first method directly reuses the pre-trained weights without retraining, denoted as $\textbf{DirectShare}$. The second method first post-trains with constraint on weight matrix similarity and then shares, denoted as $\textbf{PostShare}$. Experimental results reveal our head-wise shared models still maintain satisfactory capabilities, demonstrating the feasibility of fine-grained weight sharing applied to LLMs.
摘要：大型语言模型 (LLM) 存在大量参数，这限制了它们在边缘设备上的部署。权重共享是一种很有前途的解决方案，它鼓励权重重用，有效减少内存使用，同时减少性能下降。然而，当前的权重共享技术主要集中在像 BERT 这样的小规模模型上，并采用粗粒度的共享规则，例如逐层。鉴于 LLM 的盛行，并且共享整个层或块明显降低了权重共享的灵活性，这变得很有限。在本文中，我们提出了$\textit{$\textbf{大型语言模型的头向共享注意力}$}$的观点。我们进一步提出了两种内存高效的方法，它们在注意力头之间共享参数，特别关注法学硕士。他们都使用相同的动态策略来选择共享权重矩阵。第一种方法直接重用预训练的权重而不需要重新训练，记为$\textbf{DirectShare}$。第二种方法首先在权重矩阵相似度上进行后训练，然后进行共享，表示为$\textbf{PostShare}$。实验结果表明，我们的 head-wise 共享模型仍然保持了令人满意的能力，证明了细粒度权重共享应用于法学硕士的可行性。

Title: Microstructures and Accuracy of Graph Recall by Large Language Models

Authors: Yanbang Wang, Hejie Cui, Jon Kleinberg
Subjects: cs.LG, cs.CL, cs.IR, cs.SI
Abstract URL: https://arxiv.org/abs/2402.11821
Pdf URL: https://arxiv.org/pdf/2402.11821
Copy Paste: [[2402.11821]] Microstructures and Accuracy of Graph Recall by Large Language Models(https://arxiv.org/abs/2402.11821)
Keywords: language model, llm
Abstract: Graphs data is crucial for many applications, and much of it exists in the relations described in textual format. As a result, being able to accurately recall and encode a graph described in earlier text is a basic yet pivotal ability that LLMs need to demonstrate if they are to perform reasoning tasks that involve graph-structured information. Human performance at graph recall by has been studied by cognitive scientists for decades, and has been found to often exhibit certain structural patterns of bias that align with human handling of social relationships. To date, however, we know little about how LLMs behave in analogous graph recall tasks: do their recalled graphs also exhibit certain biased patterns, and if so, how do they compare with humans and affect other graph reasoning tasks? In this work, we perform the first systematical study of graph recall by LLMs, investigating the accuracy and biased microstructures (local structural patterns) in their recall. We find that LLMs not only underperform often in graph recall, but also tend to favor more triangles and alternating 2-paths. Moreover, we find that more advanced LLMs have a striking dependence on the domain that a real-world graph comes from -- by yielding the best recall accuracy when the graph is narrated in a language style consistent with its original domain.
摘要：图形数据对于许多应用程序至关重要，其中大部分存在于以文本格式描述的关系中。因此，如果法学硕士要执行涉及图结构信息的推理任务，那么能够准确回忆和编码前面文本中描述的图是一项基本但关键的能力。认知科学家对人类在图形回忆方面的表现进行了数十年的研究，发现经常表现出某些与人类处理社会关系相一致的结构性偏见模式。然而，迄今为止，我们对法学硕士在类似图回忆任务中的表现知之甚少：他们回忆的图是否也表现出某些偏见模式，如果是的话，他们如何与人类进行比较并影响其他图推理任务？在这项工作中，我们对法学硕士进行了图形召回的首次系统研究，调查了召回中的准确性和有偏差的微观结构（局部结构模式）。我们发现法学硕士不仅在图形回忆方面表现不佳，而且倾向于更多的三角形和交替 2 路径。此外，我们发现更高级的法学硕士对现实世界图的来源领域有显着的依赖性——当图以与其原始领域一致的语言风格叙述时，会产生最佳的回忆准确性。

Title: Easy as ABCs: Unifying Boltzmann Q-Learning and Counterfactual Regret Minimization

Authors: Luca D'Amico-Wong, Hugh Zhang, Marc Lanctot, David C. Parkes
Subjects: cs.LG, cs.GT, cs.MA
Abstract URL: https://arxiv.org/abs/2402.11835
Pdf URL: https://arxiv.org/pdf/2402.11835
Copy Paste: [[2402.11835]] Easy as ABCs: Unifying Boltzmann Q-Learning and Counterfactual Regret Minimization(https://arxiv.org/abs/2402.11835)
Keywords: agent
Abstract: We propose ABCs (Adaptive Branching through Child stationarity), a best-of-both-worlds algorithm combining Boltzmann Q-learning (BQL), a classic reinforcement learning algorithm for single-agent domains, and counterfactual regret minimization (CFR), a central algorithm for learning in multi-agent domains. ABCs adaptively chooses what fraction of the environment to explore each iteration by measuring the stationarity of the environment's reward and transition dynamics. In Markov decision processes, ABCs converges to the optimal policy with at most an O(A) factor slowdown compared to BQL, where A is the number of actions in the environment. In two-player zero-sum games, ABCs is guaranteed to converge to a Nash equilibrium (assuming access to a perfect oracle for detecting stationarity), while BQL has no such guarantees. Empirically, ABCs demonstrates strong performance when benchmarked across environments drawn from the OpenSpiel game library and OpenAI Gym and exceeds all prior methods in environments which are neither fully stationary nor fully nonstationary.
摘要：我们提出了 ABC（通过子平稳性进行自适应分支），这是一种两全其美的算法，结合了玻尔兹曼 Q 学习（BQL）（一种用于单智能体领域的经典强化学习算法）和反事实遗憾最小化（CFR）（一种中心算法）。多智能体领域学习算法。 ABC 通过测量环境奖励和过渡动态的平稳性，自适应地选择环境的哪一部分来探索每次迭代。在马尔可夫决策过程中，与 BQL 相比，ABC 收敛到最优策略最多具有 O(A) 因子减速，其中 A 是环境中的操作数量。在两人零和博弈中，ABC 保证收敛到纳什均衡（假设可以访问完美的预言机来检测平稳性），而 BQL 则没有这样的保证。根据经验，ABCs 在对来自 OpenSpiel 游戏库和 OpenAI Gym 的环境进行基准测试时表现出强大的性能，并且超过了既不是完全静止也不是完全非静止环境中的所有现有方法。

Title: UniST: A Prompt-Empowered Universal Model for Urban Spatio-Temporal Prediction

Authors: Yuan Yuan, Jingtao Ding, Jie Feng, Depeng Jin, Yong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11838
Pdf URL: https://arxiv.org/pdf/2402.11838
Copy Paste: [[2402.11838]] UniST: A Prompt-Empowered Universal Model for Urban Spatio-Temporal Prediction(https://arxiv.org/abs/2402.11838)
Keywords: language model, prompt
Abstract: Urban spatio-temporal prediction is crucial for informed decision-making, such as transportation management, resource optimization, and urban planning. Although pretrained foundation models for natural languages have experienced remarkable breakthroughs, wherein one general-purpose model can tackle multiple tasks across various domains, urban spatio-temporal modeling lags behind. Existing approaches for urban prediction are usually tailored for specific spatio-temporal scenarios, requiring task-specific model designs and extensive in-domain training data. In this work, we propose a universal model, UniST, for urban spatio-temporal prediction. Drawing inspiration from large language models, UniST achieves success through: (i) flexibility towards diverse spatio-temporal data characteristics, (ii) effective generative pre-training with elaborated masking strategies to capture complex spatio-temporal relationships, (iii) spatio-temporal knowledge-guided prompts that align and leverage intrinsic and shared knowledge across scenarios. These designs together unlock the potential of a one-for-all model for spatio-temporal prediction with powerful generalization capability. Extensive experiments on 15 cities and 6 domains demonstrate the universality of UniST in advancing state-of-the-art prediction performance, especially in few-shot and zero-shot scenarios.
摘要：城市时空预测对于交通管理、资源优化和城市规划等明智决策至关重要。尽管自然语言的预训练基础模型已经取得了显着的突破，其中一种通用模型可以处理跨不同领域的多项任务，但城市时空建模仍然落后。现有的城市预测方法通常是针对特定的时空场景量身定制的，需要特定于任务的模型设计和广泛的域内训练数据。在这项工作中，我们提出了一种用于城市时空预测的通用模型 UniST。从大型语言模型中汲取灵感，UniST 通过以下方式取得成功：(i) 针对不同时空数据特征的灵活性，(ii) 通过精心设计的掩蔽策略进行有效的生成预训练，以捕获复杂的时空关系，(iii) 时空关系知识引导的提示，可以跨场景调整和利用内在的和共享的知识。这些设计共同释放了具有强大泛化能力的时空预测的全能模型的潜力。在 15 个城市和 6 个领域进行的广泛实验证明了 UniST 在提升最先进的预测性能方面的普遍性，特别是在少样本和零样本场景中。

Title: Modularized Networks for Few-shot Hateful Meme Detection

Authors: Rui Cao, Roy Ka-Wei Lee, Jing Jiang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2402.11845
Pdf URL: https://arxiv.org/pdf/2402.11845
Copy Paste: [[2402.11845]] Modularized Networks for Few-shot Hateful Meme Detection(https://arxiv.org/abs/2402.11845)
Keywords: language model, llm
Abstract: In this paper, we address the challenge of detecting hateful memes in the low-resource setting where only a few labeled examples are available. Our approach leverages the compositionality of Low-rank adaptation (LoRA), a widely used parameter-efficient tuning technique. We commence by fine-tuning large language models (LLMs) with LoRA on selected tasks pertinent to hateful meme detection, thereby generating a suite of LoRA modules. These modules are capable of essential reasoning skills for hateful meme detection. We then use the few available annotated samples to train a module composer, which assigns weights to the LoRA modules based on their relevance. The model's learnable parameters are directly proportional to the number of LoRA modules. This modularized network, underpinned by LLMs and augmented with LoRA modules, exhibits enhanced generalization in the context of hateful meme detection. Our evaluation spans three datasets designed for hateful meme detection in a few-shot learning context. The proposed method demonstrates superior performance to traditional in-context learning, which is also more computationally intensive during inference.We then use the few available annotated samples to train a module composer, which assigns weights to the LoRA modules based on their relevance. The model's learnable parameters are directly proportional to the number of LoRA modules. This modularized network, underpinned by LLMs and augmented with LoRA modules, exhibits enhanced generalization in the context of hateful meme detection. Our evaluation spans three datasets designed for hateful meme detection in a few-shot learning context. The proposed method demonstrates superior performance to traditional in-context learning, which is also more computationally intensive during inference.
摘要：在本文中，我们解决了在资源匮乏的环境中检测仇恨模因的挑战，其中只有少数标记的示例可用。我们的方法利用了低秩自适应（LoRA）的组合性，这是一种广泛使用的参数高效调整技术。我们首先使用 LoRA 对与仇恨模因检测相关的选定任务进行微调，从而生成一套 LoRA 模块。这些模块具有用于检测仇恨模因的基本推理技能。然后，我们使用少数可用的带注释样本来训练模块编辑器，该模块根据 LoRA 模块的相关性为其分配权重。该模型的可学习参数与 LoRA 模块的数量成正比。这种模块化网络以法学硕士为基础，并通过 LoRA 模块进行增强，在仇恨模因检测的背景下表现出增强的泛化能力。我们的评估涵盖了三个数据集，这些数据集专为在几次学习环境中检测仇恨模因而设计。所提出的方法表现出优于传统上下文学习的性能，传统的上下文学习在推理过程中的计算量也更大。然后，我们使用少数可用的带注释样本来训练模块组合器，该模块组合器根据 LoRA 模块的相关性为它们分配权重。该模型的可学习参数与 LoRA 模块的数量成正比。这种模块化网络以法学硕士为基础，并通过 LoRA 模块进行增强，在仇恨模因检测的背景下表现出增强的泛化能力。我们的评估涵盖了三个数据集，这些数据集专为在几次学习环境中检测仇恨模因而设计。所提出的方法表现出优于传统上下文学习的性能，而传统的上下文学习在推理过程中的计算量也更大。

Title: How Interpretable are Reasoning Explanations from Prompting Large Language Models?

Authors: Yeo Wei Jie, Ranjan Satapathy, Goh Siow Mong, Rick, Erik Cambria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11863
Pdf URL: https://arxiv.org/pdf/2402.11863
Copy Paste: [[2402.11863]] How Interpretable are Reasoning Explanations from Prompting Large Language Models?(https://arxiv.org/abs/2402.11863)
Keywords: language model, prompt, chain-of-thought
Abstract: Prompt Engineering has garnered significant attention for enhancing the performance of large language models across a multitude of tasks. Techniques such as the Chain-of-Thought not only bolster task performance but also delineate a clear trajectory of reasoning steps, offering a tangible form of explanation for the audience. Prior works on interpretability assess the reasoning chains yielded by Chain-of-Thought solely along a singular axis, namely faithfulness. We present a comprehensive and multifaceted evaluation of interpretability, examining not only faithfulness but also robustness and utility across multiple commonsense reasoning benchmarks. Likewise, our investigation is not confined to a single prompting technique; it expansively covers a multitude of prevalent prompting techniques employed in large language models, thereby ensuring a wide-ranging and exhaustive evaluation. In addition, we introduce a simple interpretability alignment technique, termed Self-Entailment-Alignment Chain-of-thought, that yields more than 70\% improvements across multiple dimensions of interpretability. Code is available at https://github.com/wj210/CoT_interpretability
摘要：Prompt Engineering 在增强大型语言模型在多种任务中的性能方面引起了极大的关注。思想链等技术不仅可以提高任务绩效，还可以勾画出清晰的推理步骤轨迹，为观众提供切实可行的解释形式。先前关于可解释性的研究评估了思想链仅沿着单一轴（即忠实度）产生的推理链。我们对可解释性进行了全面、多方面的评估，不仅检查了忠实性，还检查了多个常识推理基准的稳健性和实用性。同样，我们的调查并不局限于单一的提示技巧；它广泛涵盖了大型语言模型中使用的多种流行的提示技术，从而确保了广泛而详尽的评估。此外，我们引入了一种简单的可解释性对齐技术，称为“自我蕴含对齐思想链”，该技术在可解释性的多个维度上产生了超过 70% 的改进。代码可在 https://github.com/wj210/CoT_interpretability 获取

Title: LoRA Training in the NTK Regime has No Spurious Local Minima

Authors: Uijeong Jang, Jason D. Lee, Ernest K. Ryu
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2402.11867
Pdf URL: https://arxiv.org/pdf/2402.11867
Copy Paste: [[2402.11867]] LoRA Training in the NTK Regime has No Spurious Local Minima(https://arxiv.org/abs/2402.11867)
Keywords: language model, llm
Abstract: Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited. In this work, we theoretically analyze LoRA fine-tuning in the neural tangent kernel (NTK) regime with $N$ data points, showing: (i) full fine-tuning (without LoRA) admits a low-rank solution of rank $r\lesssim \sqrt{N}$; (ii) using LoRA with rank $r\gtrsim \sqrt{N}$ eliminates spurious local minima, allowing gradient descent to find the low-rank solutions; (iii) the low-rank solution found using LoRA generalizes well.
摘要：低秩适应 (LoRA) 已成为大型语言模型 (LLM) 参数高效微调的标准方法，但我们对 LoRA 的理论理解仍然有限。在这项工作中，我们从理论上分析了具有 $N$ 个数据点的神经正切内核 (NTK) 体系中的 LoRA 微调，结果表明：(i) 完全微调（没有 LoRA）承认秩 $r 的低秩解\lesssim \sqrt{N}$; (ii) 使用具有 $r\gtrsim \sqrt{N}$ 秩的 LoRA 消除虚假局部最小值，允许梯度下降找到低秩解； (iii) 使用 LoRA 找到的低秩解决方案具有良好的泛化性。

Title: M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation

Authors: Hongcheng Liu, Pingjie Wang, Yu Wang, Yanfeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11875
Pdf URL: https://arxiv.org/pdf/2402.11875
Copy Paste: [[2402.11875]] M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation(https://arxiv.org/abs/2402.11875)
Keywords: hallucination
Abstract: Video-grounded dialogue generation (VDG) requires the system to generate a fluent and accurate answer based on multimodal knowledge. However, the difficulty in multimodal knowledge utilization brings serious hallucinations to VDG models in practice. Although previous works mitigate the hallucination in a variety of ways, they hardly take notice of the importance of the multimodal knowledge anchor answer tokens. In this paper, we reveal via perplexity that different VDG models experience varying hallucinations and exhibit diverse anchor tokens. Based on this observation, we propose M2K-VDG, a model-adaptive multimodal knowledge anchor enhancement framework for hallucination reduction. Furthermore, we introduce the counterfactual effect for more accurate anchor token detection. The experimental results on three popular benchmarks exhibit the superiority of our approach over state-of-the-art methods, demonstrating its effectiveness in reducing hallucinations.
摘要：基于视频的对话生成（VDG）要求系统基于多模态知识生成流畅且准确的答案。然而，多模态知识利用的困难给VDG模型在实践中带来了严重的幻觉。尽管以前的工作以多种方式减轻了幻觉，但他们几乎没有注意到多模态知识锚答案标记的重要性。在本文中，我们通过困惑揭示了不同的 VDG 模型会经历不同的幻觉并表现出不同的锚标记。基于这一观察，我们提出了 M2K-VDG，一种用于减少幻觉的模型自适应多模态知识锚增强框架。此外，我们引入了反事实效应，以实现更准确的锚标记检测。三个流行基准的实验结果展示了我们的方法相对于最先进方法的优越性，证明了其在减少幻觉方面的有效性。

Title: The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth

Authors: Shir Lissak, Nitay Calderon, Geva Shenkman, Yaakov Ophir, Eyal Fruchter, Anat Brunstein Klomek, Roi Reichart
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11886
Pdf URL: https://arxiv.org/pdf/2402.11886
Copy Paste: [[2402.11886]] The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth(https://arxiv.org/abs/2402.11886)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Queer youth face increased mental health risks, such as depression, anxiety, and suicidal ideation. Hindered by negative stigma, they often avoid seeking help and rely on online resources, which may provide incompatible information. Although access to a supportive environment and reliable information is invaluable, many queer youth worldwide have no access to such support. However, this could soon change due to the rapid adoption of Large Language Models (LLMs) such as ChatGPT. This paper aims to comprehensively explore the potential of LLMs to revolutionize emotional support for queers. To this end, we conduct a qualitative and quantitative analysis of LLM's interactions with queer-related content. To evaluate response quality, we develop a novel ten-question scale that is inspired by psychological standards and expert input. We apply this scale to score several LLMs and human comments to posts where queer youth seek advice and share experiences. We find that LLM responses are supportive and inclusive, outscoring humans. However, they tend to be generic, not empathetic enough, and lack personalization, resulting in nonreliable and potentially harmful advice. We discuss these challenges, demonstrate that a dedicated prompt can improve the performance, and propose a blueprint of an LLM-supporter that actively (but sensitively) seeks user context to provide personalized, empathetic, and reliable responses. Our annotated dataset is available for further research.
摘要：酷儿青少年面临着越来越多的心理健康风险，例如抑郁、焦虑和自杀意念。由于受到负面耻辱的阻碍，他们常常避免寻求帮助并依赖可能提供不兼容信息的在线资源。尽管获得支持性环境和可靠信息非常宝贵，但世界各地的许多酷儿青年却无法获得此类支持。然而，由于 ChatGPT 等大型语言模型 (LLM) 的快速采用，这种情况可能很快就会改变。本文旨在全面探索法学硕士彻底改变酷儿情感支持的潜力。为此，我们对法学硕士与酷儿相关内容的互动进行了定性和定量分析。为了评估回答质量，我们受心理学标准和专家意见的启发，开发了一种新颖的十个问题量表。我们应用这个量表对几个法学硕士进行评分，并对酷儿青年寻求建议和分享经验的帖子进行人工评论。我们发现法学硕士的反应是支持性和包容性的，得分超过了人类。然而，它们往往比较通用，缺乏同理心，并且缺乏个性化，导致建议不可靠且可能有害。我们讨论了这些挑战，证明专门的提示可以提高性能，并提出了一个 LLM 支持者的蓝图，该支持者积极（但敏感地）寻求用户上下文以提供个性化、同理心和可靠的响应。我们带注释的数据集可用于进一步研究。

Title: ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding

Authors: Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11889
Pdf URL: https://arxiv.org/pdf/2402.11889
Copy Paste: [[2402.11889]] ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding(https://arxiv.org/abs/2402.11889)
Keywords: language model, llm, prompt
Abstract: With the development of instruction-tuned large language models (LLMs), improving the safety of LLMs has become more critical. However, the current approaches for aligning the LLMs output with expected safety usually require substantial training efforts, e.g., high-quality safety data and expensive computational resources, which are costly and inefficient. To this end, we present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to directly boost the safety of existing instruction-tuned LLMs without any additional training. The principle of ROSE is to improve the probability of desired safe output via suppressing the undesired output induced by the carefully-designed reverse prompts. Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs. In-depth analyses explore the underlying mechanism of ROSE, and reveal when and where to use it.
摘要：随着指令调整大语言模型（LLM）的发展，提高LLM的安全性变得更加重要。然而，目前使法学硕士输出与预期安全性保持一致的方法通常需要大量的培训工作，例如高质量的安全数据和昂贵的计算资源，这些方法成本高昂且效率低下。为此，我们提出了反向即时对比解码（ROSE），这是一种简单而有效的方法，可以直接提高现有指令调整的 LLM 的安全性，而无需任何额外的培训。 ROSE的原理是通过抑制精心设计的反向提示引起的非期望输出来提高期望安全输出的概率。 6 个安全性和 2 个通用任务的实验表明，我们的 ROSE 不仅对 5 种指令调整的 LLM 带来了一致且显着的安全性改进（高达 +13.8% 的安全分数），而且还受益于法学硕士。深入分析探索ROSE的底层机制，并揭示何时何地使用它。

Title: Revisiting Knowledge Distillation for Autoregressive Language Models

Authors: Qihuang Zhong, Liang Ding, Li Shen, Juhua Liu, Bo Du, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11890
Pdf URL: https://arxiv.org/pdf/2402.11890
Copy Paste: [[2402.11890]] Revisiting Knowledge Distillation for Autoregressive Language Models(https://arxiv.org/abs/2402.11890)
Keywords: language model
Abstract: Knowledge distillation (KD) is a common approach to compress a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, in the context of autoregressive language models (LMs), we empirically find that larger teacher LMs might dramatically result in a poorer student. In response to this problem, we conduct a series of analyses and reveal that different tokens have different teaching modes, neglecting which will lead to performance degradation. Motivated by this, we propose a simple yet effective adaptive teaching approach (ATKD) to improve the KD. The core of ATKD is to reduce rote learning and make teaching more diverse and flexible. Extensive experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains (up to +3.04% average score) across all model types and sizes. More encouragingly, ATKD can improve the student model generalization effectively.
摘要：知识蒸馏 (KD) 是一种常见方法，通过训练较小的学生模型来压缩教师模型，以减少其推理成本和内存占用。然而，在自回归语言模型 (LM) 的背景下，我们凭经验发现较大的教师 LM 可能会显着导致学生成绩较差。针对这个问题，我们进行了一系列的分析，发现不同的代币有不同的教学模式，忽视这会导致性能下降。受此启发，我们提出了一种简单而有效的适应性教学方法（ATKD）来提高 KD。 ATKD的核心是减少死记硬背的学习，让教学更加多样化和灵活。对 8 项 LM 任务的大量实验表明，在 ATKD 的帮助下，各种基线 KD 方法可以在所有模型类型和规模上实现一致且显着的性能提升（平均得分高达 +3.04%）。更令人鼓舞的是，ATKD 可以有效提高学生模型的泛化能力。

Title: Discerning and Resolving Knowledge Conflicts through Adaptive Decoding with Contextual Information-Entropy Constraint

Authors: Xiaowei Yuan, Zhao Yang, Yequan Wang, Shengping Liu, Jun Zhao, Kang Liu
Subjects: cs.AI
Abstract URL: https://arxiv.org/abs/2402.11893
Pdf URL: https://arxiv.org/pdf/2402.11893
Copy Paste: [[2402.11893]] Discerning and Resolving Knowledge Conflicts through Adaptive Decoding with Contextual Information-Entropy Constraint(https://arxiv.org/abs/2402.11893)
Keywords: language model
Abstract: Large language models internalize enormous parametric knowledge during pre-training. Concurrently, realistic applications necessitate external contextual knowledge to aid models on the underlying tasks. This raises a crucial dilemma known as knowledge conflicts, where the contextual knowledge clashes with the However, existing decoding works are specialized in resolving knowledge conflicts and could inadvertently deteriorate performance in absence of conflicts. In this paper, we propose an adaptive decoding method, termed as contextual information-entropy constraint decoding (COIECD), to discern whether the knowledge conflicts occur and resolve them. It can improve the model's faithfulness to conflicting context, and simultaneously maintain high performance among non- Our experiments show that COIECD exhibits strong performance and robustness over knowledge conflicts in realistic datasets. Code is available.
摘要：大型语言模型在预训练期间内化了大量的参数知识。同时，现实的应用程序需要外部上下文知识来帮助底层任务的模型。这提出了一个被称为知识冲突的关键困境，即上下文知识与知识冲突。但是，现有的解码工作专门用于解决知识冲突，并且在没有冲突的情况下可能会无意中降低性能。在本文中，我们提出了一种自适应解码方法，称为上下文信息熵约束解码（COIECD），以识别知识冲突是否发生并解决它们。它可以提高模型对冲突上下文的忠实度，同时在非冲突环境中保持高性能。我们的实验表明，COIECD 在现实数据集中的知识冲突方面表现出强大的性能和鲁棒性。代码可用。

Title: Have Seen Me Before? Automating Dataset Updates Towards Reliable and Timely Evaluation

Authors: Jiahao Ying, Yixin Cao, Bo Wang, Wei Tang, Yizhe Yang, Shuicheng Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11894
Pdf URL: https://arxiv.org/pdf/2402.11894
Copy Paste: [[2402.11894]] Have Seen Me Before? Automating Dataset Updates Towards Reliable and Timely Evaluation(https://arxiv.org/abs/2402.11894)
Keywords: language model, llm
Abstract: Due to the expanding capabilities and pre-training data, Large Language Models (LLMs) are facing increasingly serious evaluation challenges. On one hand, the data leakage issue cause over-estimation on existing benchmarks. On the other hand, periodically curating datasets manually is costly. In this paper, we propose to automate dataset updates for reliable and timely evaluation. The basic idea is to generate unseen and high-quality testing samples based on existing ones to mitigate leakage issues. In specific, we propose two strategies with systematically verification. First, the mimicking strategy employs LLMs to create new samples resembling existing ones, to the maximum extent preserving the stylistic of the original dataset. Our experiments demonstrate its evaluation stability across multiple instantiations and its effectiveness in dealing with data leakage issues in most cases. Second, for the cases that mimicking dataset works poorly, we design an extending strategy that adjusts the difficulty of the generated samples according to varying cognitive levels. This not only makes our evaluation more systematic, but also, with a balanced difficulty, even discern model capabilities better at fine-grained levels.
摘要：由于能力和预训练数据的不断扩展，大型语言模型（LLM）面临着日益严峻的评估挑战。一方面，数据泄露问题导致对现有基准的高估。另一方面，定期手动整理数据集的成本高昂。在本文中，我们建议自动化数据集更新，以进行可靠且及时的评估。基本思想是根据现有样本生成看不见的高质量测试样本，以减轻泄漏问题。具体来说，我们提出了两种策略并进行了系统验证。首先，模仿策略利用LLM创建与现有样本相似的新样本，最大限度地保留原始数据集的风格。我们的实验证明了它在多个实例中的评估稳定性以及在大多数情况下处理数据泄漏问题的有效性。其次，针对模仿数据集效果不佳的情况，我们设计了一种扩展策略，根据不同的认知水平调整生成样本的难度。这不仅使我们的评估更加系统化，而且在难度均衡的情况下，甚至可以在细粒度层面更好地辨别模型能力。

Title: SIBO: A Simple Booster for Parameter-Efficient Fine-Tuning

Authors: Zhihao Wen, Jie Zhang, Yuan Fang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11896
Pdf URL: https://arxiv.org/pdf/2402.11896
Copy Paste: [[2402.11896]] SIBO: A Simple Booster for Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2402.11896)
Keywords: language model, llm
Abstract: Fine-tuning all parameters of large language models (LLMs) necessitates substantial computational power and extended time. Latest advancements in parameter-efficient fine-tuning (PEFT) techniques, such as Adapter tuning and LoRA, allow for adjustments to only a minor fraction of the parameters of these LLMs. Concurrently, it has been noted that the issue of over-smoothing diminishes the effectiveness of these Transformer-based LLMs, resulting in suboptimal performances in downstream tasks. In this paper, we present SIBO, which is a SImple BOoster to enhance PEFT, by injecting an initial residual. SIBO is straight-forward and readily extensible to a range of state-of-the-art PEFT techniques to alleviate over-smoothing and enhance performance. Extensive experiments on 22 benchmark datasets demonstrate that SIBO significantly enhances the performance of various strong baselines, achieving up to 15.7% and 23.5% improvement over existing PEFT methods on the arithmetic and commonsense reasoning tasks, respectively.
摘要：微调大型语言模型 (LLM) 的所有参数需要大量的计算能力和较长的时间。参数高效微调 (PEFT) 技术的最新进展，例如适配器调整和 LoRA，仅允许调整这些 LLM 的一小部分参数。同时，人们注意到过度平滑的问题降低了这些基于 Transformer 的 LLM 的有效性，导致下游任务的性能不佳。在本文中，我们提出了 SIBO，它是一个简单的 BOoster，通过注入初始残差来增强 PEFT。 SIBO 非常简单，并且可以轻松扩展到一系列最先进的 PEFT 技术，以减轻过度平滑并提高性能。对 22 个基准数据集的大量实验表明，SIBO 显着增强了各种强基线的性能，在算术和常识推理任务上分别比现有 PEFT 方法提高了 15.7% 和 23.5%。

Title: Investigating Multi-Hop Factual Shortcuts in Knowledge Editing of Large Language Models

Authors: Tianjie Ju, Yijin Chen, Xinwei Yuan, Zhuosheng Zhang, Wei Du, Yubin Zheng, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11900
Pdf URL: https://arxiv.org/pdf/2402.11900
Copy Paste: [[2402.11900]] Investigating Multi-Hop Factual Shortcuts in Knowledge Editing of Large Language Models(https://arxiv.org/abs/2402.11900)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent work has showcased the powerful capability of large language models (LLMs) in recalling knowledge and reasoning. However, the reliability of LLMs in combining these two capabilities into reasoning through multi-hop facts has not been widely explored. This paper systematically investigates the possibilities for LLMs to utilize shortcuts based on direct connections between the initial and terminal entities of multi-hop knowledge. We first explore the existence of factual shortcuts through Knowledge Neurons, revealing that: (i) the strength of factual shortcuts is highly correlated with the frequency of co-occurrence of initial and terminal entities in the pre-training corpora; (ii) few-shot prompting leverage more shortcuts in answering multi-hop questions compared to chain-of-thought prompting. Then, we analyze the risks posed by factual shortcuts from the perspective of multi-hop knowledge editing. Analysis shows that approximately 20% of the failures are attributed to shortcuts, and the initial and terminal entities in these failure instances usually have higher co-occurrences in the pre-training corpus. Finally, we propose erasing shortcut neurons to mitigate the associated risks and find that this approach significantly reduces failures in multiple-hop knowledge editing caused by shortcuts.
摘要：最近的工作展示了大型语言模型（LLM）在回忆知识和推理方面的强大能力。然而，法学硕士通过多跳事实将这两种能力结合到推理中的可靠性尚未得到广泛探索。本文系统地研究了法学硕士利用基于多跳知识的初始实体和终端实体之间的直接连接的捷径的可能性。我们首先通过知识神经元探索事实捷径的存在，揭示：（i）事实捷径的强度与预训练语料库中初始实体和终端实体共现的频率高度相关； (ii) 与思维链提示相比，少样本提示在回答多跳问题时利用更多捷径。然后，我们从多跳知识编辑的角度分析了事实捷径带来的风险。分析表明，大约20%的失败归因于捷径，并且这些失败实例中的初始实体和终端实体通常在预训练语料库中具有较高的共现性。最后，我们建议擦除快捷神经元以减轻相关风险，并发现这种方法显着减少了由快捷方式引起的多跳知识编辑的失败。

Title: SoLA: Solver-Layer Adaption of LLM for Better Logic Reasoning

Authors: Yu Zhang, Hui-Ling Zhen, Zehua Pei, Yingzhao Lian, Lihao Yin, Mingxuan Yuan, Bei Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.11903
Pdf URL: https://arxiv.org/pdf/2402.11903
Copy Paste: [[2402.11903]] SoLA: Solver-Layer Adaption of LLM for Better Logic Reasoning(https://arxiv.org/abs/2402.11903)
Keywords: language model, llm
Abstract: Considering the challenges faced by large language models (LLMs) on logical reasoning, prior efforts have sought to transform problem-solving through tool learning. While progress has been made on small-scale problems, solving industrial cases remains difficult due to their large scale and intricate expressions. In this paper, we propose a novel solver-layer adaptation (SoLA) method, where we introduce a solver as a new layer of the LLM to differentially guide solutions towards satisfiability. In SoLA, LLM aims to comprehend the search space described in natural language and identify local solutions of the highest quality, while the solver layer focuses solely on constraints not satisfied by the initial solution. Leveraging MaxSAT as a bridge, we define forward and backward transfer gradients, enabling the final model to converge to a satisfied solution or prove unsatisfiability. The backdoor theory ensures that SoLA can obtain accurate solutions within polynomial loops. We evaluate the performance of SoLA on various datasets and empirically demonstrate its consistent outperformance against existing symbolic solvers (including Z3 and Kissat) and tool-learning methods in terms of efficiency in large-scale problem-solving.
摘要：考虑到大型语言模型（LLM）在逻辑推理方面面临的挑战，先前的努力试图通过工具学习来转变问题解决方式。虽然小规模问题取得了一定进展，但行业案例由于规模庞大、表现形式复杂，解决起来仍然困难重重。在本文中，我们提出了一种新颖的求解器层适应（SoLA）方法，其中我们引入求解器作为 LLM 的新层，以差异化方式引导解决方案实现可满足性。在 SoLA 中，LLM 旨在理解以自然语言描述的搜索空间并识别最高质量的本地解决方案，而求解器层仅关注初始解决方案不满足的约束。利用 MaxSAT 作为桥梁，我们定义前向和后向传递梯度，使最终模型能够收敛到满意的解决方案或证明不可满足。后门理论保证了SoLA能够在多项式循环内获得准确的解。我们评估了 SoLA 在各种数据集上的性能，并凭经验证明其在大规模问题解决效率方面始终优于现有符号求解器（包括 Z3 和 Kissat）和工具学习方法。

Title: Learning to Edit: Aligning LLMs with Knowledge Editing

Authors: Yuxin Jiang, Yufei Wang, Chuhan Wu, Wanjun Zhong, Xingshan Zeng, Jiahui Gao, Liangyou Li, Xin Jiang, Lifeng Shang, Ruiming Tang, Qun Liu, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11905
Pdf URL: https://arxiv.org/pdf/2402.11905
Copy Paste: [[2402.11905]] Learning to Edit: Aligning LLMs with Knowledge Editing(https://arxiv.org/abs/2402.11905)
Keywords: language model, llm
Abstract: Knowledge editing techniques, aiming to efficiently modify a minor proportion of knowledge in large language models (LLMs) without negatively impacting performance across other inputs, have garnered widespread attention. However, existing methods predominantly rely on memorizing the updated knowledge, impeding LLMs from effectively combining the new knowledge with their inherent knowledge when answering questions. To this end, we propose a Learning to Edit (LTE) framework, focusing on teaching LLMs to apply updated knowledge into input questions, inspired by the philosophy of "Teach a man to fish." LTE features a two-phase process: (i) the Alignment Phase, which fine-tunes LLMs on a meticulously curated parallel dataset to make reliable, in-scope edits while preserving out-of-scope information and linguistic proficiency; and (ii) the Inference Phase, which employs a retrieval-based mechanism for real-time and mass knowledge editing. By comparing our approach with seven advanced baselines across four popular knowledge editing benchmarks and two LLM architectures, we demonstrate LTE's superiority in knowledge editing performance, robustness in both batch and sequential editing, minimal interference on general tasks, and rapid editing speeds. The data and code are available at https://github.com/YJiangcm/LTE.
摘要：知识编辑技术旨在有效地修改大型语言模型（LLM）中的一小部分知识，而不会对其他输入的性能产生负面影响，已经引起了广泛的关注。然而，现有的方法主要依赖于记忆更新的知识，阻碍了法学硕士在回答问题时有效地将新知识与其固有知识结合起来。为此，我们提出了学习编辑（LTE）框架，重点是教导法学硕士将更新的知识应用到输入问题中，其灵感来自于“授人以鱼”的哲学。 LTE 具有两个阶段的过程：(i) 调整阶段，在精心策划的并行数据集上对法学硕士进行微调，以进行可靠的范围内编辑，同时保留范围外的信息和语言熟练程度； (ii) 推理阶段，采用基于检索的机制进行实时和大规模知识编辑。通过将我们的方法与四个流行知识编辑基准和两个 LLM 架构的七个高级基线进行比较，我们展示了 LTE 在知识编辑性能、批量和顺序编辑的鲁棒性、对一般任务的最小干扰以及快速编辑速度方面的优势。数据和代码可在 https://github.com/YJiangcm/LTE 获取。

Title: Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Authors: Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11907
Pdf URL: https://arxiv.org/pdf/2402.11907
Copy Paste: [[2402.11907]] Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation(https://arxiv.org/abs/2402.11907)
Keywords: language model, llm, prompt
Abstract: Aligning large language models (LLMs) with human expectations without human-annotated preference data is an important problem. In this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA). First, we use contrastive prompt pairs to automatically generate preference data. Then, we continue to evaluate the generated preference data using contrastive prompt pairs and calculate a self-rewarding score. Finally, we use the DPO algorithm to effectively align LLMs by combining this self-rewarding score. In the experimental stage, our DLMA method could surpass the \texttt{RLHF} method without relying on human-annotated preference data.
摘要：在没有人类注释偏好数据的情况下使大型语言模型 (LLM) 与人类期望保持一致是一个重要问题。在本文中，我们提出了一种利用对比提示对下响应对的输出概率来评估响应偏好的方法，与 RLAIF 相比，该方法可以在 LLaMA2-7B 和 LLaMA2-13B 上获得更好的性能。基于此，我们提出了一种自动对齐方法，直接大模型对齐（DLMA）。首先，我们使用对比提示对自动生成偏好数据。然后，我们继续使用对比提示对评估生成的偏好数据并计算自我奖励分数。最后，我们使用 DPO 算法通过结合这种自我奖励分数来有效地调整 LLM。在实验阶段，我们的 DLMA 方法可以超越 \texttt{RLHF} 方法，而无需依赖人类注释的偏好数据。

Title: A Generative Pre-Training Framework for Spatio-Temporal Graph Transfer Learning

Authors: Yuan Yuan, Chenyang Shao, Jingtao Ding, Depeng Jin, Yong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.11922
Pdf URL: https://arxiv.org/pdf/2402.11922
Copy Paste: [[2402.11922]] A Generative Pre-Training Framework for Spatio-Temporal Graph Transfer Learning(https://arxiv.org/abs/2402.11922)
Keywords: prompt
Abstract: Spatio-temporal graph (STG) learning is foundational for smart city applications, yet it is often hindered by data scarcity in many cities and regions. To bridge this gap, we propose a novel generative pre-training framework, GPDiff, for STG transfer learning. Unlike conventional approaches that heavily rely on common feature extraction or intricate transfer learning designs, our solution takes a novel approach by performing generative pre-training on a collection of model parameters optimized with data from source cities. We recast STG transfer learning as pre-training a generative hypernetwork, which generates tailored model parameters guided by prompts, allowing for adaptability to diverse data distributions and city-specific characteristics. GPDiff employs a diffusion model with a transformer-based denoising network, which is model-agnostic to integrate with powerful STG models. By addressing challenges arising from data gaps and the complexity of generalizing knowledge across cities, our framework consistently outperforms state-of-the-art baselines on multiple real-world datasets for tasks such as traffic speed prediction and crowd flow prediction. The implementation of our approach is available: https://github.com/PLUTO-SCY/GPDiff.
摘要：时空图（STG）学习是智慧城市应用的基础，但在许多城市和地区，它常常受到数据稀缺的阻碍。为了弥补这一差距，我们提出了一种用于 STG 迁移学习的新型生成预训练框架 GPDiff。与严重依赖常见特征提取或复杂的迁移学习设计的传统方法不同，我们的解决方案采用一种新颖的方法，对使用源城市数据优化的模型参数集合进行生成预训练。我们将 STG 迁移学习重新定义为预训练生成超网络，该网络在提示的指导下生成定制的模型参数，从而能够适应不同的数据分布和特定于城市的特征。 GPDiff 采用带有基于变压器的去噪网络的扩散模型，该网络与模型无关，可以与强大的 STG 模型集成。通过解决数据差距和跨城市推广知识的复杂性带来的挑战，我们的框架在交通速度预测和人群流量预测等任务中始终优于多个现实世界数据集上的最新基线。我们的方法的实施可参见：https://github.com/PLUTO-SCY/GPDiff。

Title: MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

Authors: Jian Wu, Linyi Yang, Manabu Okumura, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11924
Pdf URL: https://arxiv.org/pdf/2402.11924
Copy Paste: [[2402.11924]] MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition(https://arxiv.org/abs/2402.11924)
Keywords: language model, gpt, llm
Abstract: Although Large Language Models (LLMs) have shown strong performance in Multi-hop Question Answering (MHQA) tasks, their real reasoning ability remains exploration. Current LLM QA evaluation benchmarks have shown limitations, including 1) data contamination, the evaluation data are potentially exposed to LLMs during the pretraining stage; and 2) ignoration of the reasoning chain evaluation. Thus we introduce an LLM MHQA evaluation benchmark, the first QA benchmark based on the new, unprecedented knowledge by editing the off-the-shelf HotpotQA dataset; Besides, we also annotate and evaluate the reasoning chain in the form of sub-questions and intermediate answers corresponding to the multi-hop questions. Specifically, based on the observation, 1) LLMs show a performance gap between the original HotpotQA and our edited data, deeming that current MHQA benchmarks have the potential risk of data contamination that hard to evaluate LLMs' performance objectively and scientifically; 2) LLMs only get a small percentage of the right reasoning chain, e.g. GPT-4 only gets 36.3\% right reasoning chain. We believe this new Multi-hop QA evaluation benchmark and novel evaluation methods will facilitate the development of trustworthy LLM evaluation on the MHQA task.
摘要：尽管大型语言模型（LLM）在多跳问答（MHQA）任务中表现出了强大的性能，但它们真正的推理能力仍然处于探索之中。目前的LLM QA评估基准已显示出局限性，包括1）数据污染，评估数据在预训练阶段可能会暴露给LLM； 2）忽视推理链评估。因此，我们引入了 LLM MHQA 评估基准，这是第一个基于新的、前所未有的知识的 QA 基准，通过编辑现成的 HotpotQA 数据集；此外，我们还以多跳问题对应的子问题和中间答案的形式对推理链进行注释和评估。具体而言，根据观察，1）LLM在原始HotpotQA和我们编辑的数据之间存在性能差距，认为当前的MHQA基准存在数据污染的潜在风险，难以客观、科学地评估LLM的性能； 2）法学硕士只能获得正确推理链的一小部分，例如GPT-4 仅获得 36.3% 的正确推理链。我们相信，这种新的多跳 QA 评估基准和新颖的评估方法将有助于在 MHQA 任务上开发值得信赖的 LLM 评估。

Title: Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

Authors: Xinbei Ma, Zhuosheng Zhang, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11941
Pdf URL: https://arxiv.org/pdf/2402.11941
Copy Paste: [[2402.11941]] Comprehensive Cognitive LLM Agent for Smartphone GUI Automation(https://arxiv.org/abs/2402.11941)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation. However, those GUI agents require comprehensive cognition ability including exhaustive perception and reliable action response. We propose \underline{Co}mprehensive \underline{Co}gnitive LLM \underline{Agent}, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel. Second, CAP decomposes the action prediction into sub-problems: action type prediction and action target conditioned on the action type. With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
摘要：大型语言模型 (LLM) 作为类人自主语言代理，在与现实世界环境交互方面表现出了巨大的潜力，特别是在图形用户界面 (GUI) 自动化方面。然而，这些GUI代理需要全面的认知能力，包括详尽的感知和可靠的动作响应。我们提出 \underline{Co}mprovive \underline{Co}gnitive LLM \underline{Agent}，CoCo-Agent，采用两种新颖的方法，综合环境感知（CEP）和条件动作预测（CAP），系统地提高 GUI 自动化表现。首先，CEP通过不同方面和粒度促进GUI感知，包括视觉通道的屏幕截图和补充详细布局以及文本通道的历史动作。其次，CAP将动作预测分解为子问题：动作类型预测和以动作类型为条件的动作目标。通过我们的技术设计，我们的智能体在 AITW 和 META-GUI 基准测试中实现了最先进的性能，在现实场景中展现出了良好的能力。

Title: LEMMA: Towards LVLM-Enhanced Multimodal Misinformation Detection with External Knowledge Augmentation

Authors: Keyang Xuan, Li Yi, Fan Yang, Ruochen Wu, Yi R. Fung, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11943
Pdf URL: https://arxiv.org/pdf/2402.11943
Copy Paste: [[2402.11943]] LEMMA: Towards LVLM-Enhanced Multimodal Misinformation Detection with External Knowledge Augmentation(https://arxiv.org/abs/2402.11943)
Keywords: language model, llm
Abstract: The rise of multimodal misinformation on social platforms poses significant challenges for individuals and societies. Its increased credibility and broader impact compared to textual misinformation make detection complex, requiring robust reasoning across diverse media types and profound knowledge for accurate verification. The emergence of Large Vision Language Model (LVLM) offers a potential solution to this problem. Leveraging their proficiency in processing visual and textual information, LVLM demonstrates promising capabilities in recognizing complex information and exhibiting strong reasoning skills. In this paper, we first investigate the potential of LVLM on multimodal misinformation detection. We find that even though LVLM has a superior performance compared to LLMs, its profound reasoning may present limited power with a lack of evidence. Based on these observations, we propose LEMMA: LVLM-Enhanced Multimodal Misinformation Detection with External Knowledge Augmentation. LEMMA leverages LVLM intuition and reasoning capabilities while augmenting them with external knowledge to enhance the accuracy of misinformation detection. Our method improves the accuracy over the top baseline LVLM by 7% and 13% on Twitter and Fakeddit datasets respectively.
摘要：社交平台上多模式错误信息的兴起给个人和社会带来了重大挑战。与文本错误信息相比，它的可信度更高，影响更广泛，使得检测变得复杂，需要跨不同媒体类型的强大推理和深厚的知识来进行准确验证。大视觉语言模型（LVLM）的出现为这一问题提供了潜在的解决方案。 LVLM 凭借其处理视觉和文本信息的能力，在识别复杂信息和表现出强大的推理能力方面表现出了良好的能力。在本文中，我们首先研究了 LVLM 在多模态错误信息检测方面的潜力。我们发现，尽管 LVLM 与 LLM 相比具有优越的性能，但其深刻的推理可能在缺乏证据的情况下表现有限。基于这些观察，我们提出 LEMMA：具有外部知识增强的 LVLM 增强型多模态错误信息检测。 LEMMA 利用 LVLM 直觉和推理功能，同时利用外部知识对其进行增强，以提高错误信息检测的准确性。我们的方法在 Twitter 和 Fakeddit 数据集上的准确度比顶级基线 LVLM 分别提高了 7% 和 13%。

Title: Automatic Evaluation for Mental Health Counseling using LLMs

Authors: Anqi Li, Yu Lu, Nirui Song, Shuai Zhang, Lizhi Ma, Zhenzhong Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11958
Pdf URL: https://arxiv.org/pdf/2402.11958
Copy Paste: [[2402.11958]] Automatic Evaluation for Mental Health Counseling using LLMs(https://arxiv.org/abs/2402.11958)
Keywords: language model, llm
Abstract: High-quality psychological counseling is crucial for mental health worldwide, and timely evaluation is vital for ensuring its effectiveness. However, obtaining professional evaluation for each counseling session is expensive and challenging. Existing methods that rely on self or third-party manual reports to assess the quality of counseling suffer from subjective biases and limitations of time-consuming. To address above challenges, this paper proposes an innovative and efficient automatic approach using large language models (LLMs) to evaluate the working alliance in counseling conversations. We collected a comprehensive counseling dataset and conducted multiple third-party evaluations based on therapeutic relationship theory. Our LLM-based evaluation, combined with our guidelines, shows high agreement with human evaluations and provides valuable insights into counseling scripts. This highlights the potential of LLMs as supervisory tools for psychotherapists. By integrating LLMs into the evaluation process, our approach offers a cost-effective and dependable means of assessing counseling quality, enhancing overall effectiveness.
摘要：高质量的心理咨询对于全世界的心理健康至关重要，而及时的评估对于确保其有效性至关重要。然而，为每次咨询课程获得专业评估是昂贵且具有挑战性的。现有的依靠自我或第三方手动报告来评估咨询质量的方法存在主观偏见和耗时的局限性。为了解决上述挑战，本文提出了一种创新且高效的自动方法，使用大语言模型（LLM）来评估咨询对话中的工作联盟。我们收集了全面的咨询数据集，并根据治疗关系理论进行了多次第三方评估。我们基于法学硕士的评估与我们的指南相结合，显示出与人类评估的高度一致性，并为咨询脚本提供了宝贵的见解。这凸显了法学硕士作为心理治疗师监督工具的潜力。通过将法学硕士纳入评估过程，我们的方法提供了一种经济有效且可靠的方法来评估咨询质量，从而提高整体效率。

Title: DB-LLM: Accurate Dual-Binarization for Efficient LLMs

Authors: Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, Dacheng Tao
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.11960
Pdf URL: https://arxiv.org/pdf/2402.11960
Copy Paste: [[2402.11960]] DB-LLM: Accurate Dual-Binarization for Efficient LLMs(https://arxiv.org/abs/2402.11960)
Keywords: language model, llm
Abstract: Large language models (LLMs) have significantly advanced the field of natural language processing, while the expensive memory and computation consumption impede their practical deployment. Quantization emerges as one of the most effective methods for improving the computational efficiency of LLMs. However, existing ultra-low-bit quantization always causes severe accuracy drops. In this paper, we empirically relieve the micro and macro characteristics of ultra-low bit quantization and present a novel Dual-Binarization method for LLMs, namely DB-LLM. For the micro-level, we take both the accuracy advantage of 2-bit-width and the efficiency advantage of binarization into account, introducing Flexible Dual Binarization (FDB). By splitting 2-bit quantized weights into two independent sets of binaries, FDB ensures the accuracy of representations and introduces flexibility, utilizing the efficient bitwise operations of binarization while retaining the inherent high sparsity of ultra-low bit quantization. For the macro-level, we find the distortion that exists in the prediction of LLM after quantization, which is specified as the deviations related to the ambiguity of samples. We propose the Deviation-Aware Distillation (DAD) method, enabling the model to focus differently on various samples. Comprehensive experiments show that our DB-LLM not only significantly surpasses the current State-of-The-Art (SoTA) in ultra-low bit quantization (eg, perplexity decreased from 9.64 to 7.23), but also achieves an additional 20\% reduction in computational consumption compared to the SOTA method under the same bit-width. Our code will be released soon.
摘要：大型语言模型（LLM）极大地推进了自然语言处理领域的发展，而昂贵的内存和计算消耗阻碍了它们的实际部署。量化成为提高法学硕士计算效率的最有效方法之一。然而，现有的超低比特量化总是会导致精度严重下降。在本文中，我们根据经验缓解了超低比特量化的微观和宏观特征，并提出了一种新颖的LLM双二值化方法，即DB-LLM。对于微观层面，我们同时考虑2位宽度的精度优势和二值化的效率优势，引入灵活双二值化（FDB）。通过将 2 位量化权重拆分为两个独立的二进制集，FDB 确保了表示的准确性并引入了灵活性，利用二值化的高效按位运算，同时保留了超低位量化固有的高稀疏性。对于宏观层面，我们发现量化后LLM的预测中存在的失真，其被指定为与样本模糊度相关的偏差。我们提出了偏差感知蒸馏（DAD）方法，使模型能够以不同的方式关注不同的样本。综合实验表明，我们的DB-LLM不仅在超低比特量化方面显着超越了当前的State-of-The-Art（SoTA）（例如，困惑度从9.64下降到7.23），而且还实现了额外的20％的降低与相同位宽下的SOTA方法相比，计算消耗较小。我们的代码很快就会发布。

Title: Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations

Authors: Nuo Chen, Hongguang Li, Juhua Huang, Baoyuan Wang, Jia Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.11975
Pdf URL: https://arxiv.org/pdf/2402.11975
Copy Paste: [[2402.11975]] Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations(https://arxiv.org/abs/2402.11975)
Keywords: language model, chat
Abstract: Existing retrieval-based methods have made significant strides in maintaining long-term conversations. However, these approaches face challenges in memory database management and accurate memory retrieval, hindering their efficacy in dynamic, real-world interactions. This study introduces a novel framework, COmpressive Memory-Enhanced Dialogue sYstems (COMEDY), which eschews traditional retrieval modules and memory databases. Instead, COMEDY adopts a ''One-for-All'' approach, utilizing a single language model to manage memory generation, compression, and response generation. Central to this framework is the concept of compressive memory, which intergrates session-specific summaries, user-bot dynamics, and past events into a concise memory format. To support COMEDY, we curated a large-scale Chinese instruction-tuning dataset, Dolphin, derived from real user-chatbot interactions. Comparative evaluations demonstrate COMEDY's superiority over traditional retrieval-based methods in producing more nuanced and human-like conversational experiences. Our codes are available at https://github.com/nuochenpku/COMEDY.
摘要：现有的基于检索的方法在维持长期对话方面取得了重大进展。然而，这些方法在内存数据库管理和准确的内存检索方面面临挑战，阻碍了它们在动态、现实世界交互中的有效性。这项研究引入了一种新颖的框架，即压缩记忆增强对话系统（COMEDY），它避开了传统的检索模块和记忆数据库。相反，COMEDY 采用“One-for-All”方法，利用单一语言模型来管理内存生成、压缩和响应生成。该框架的核心是压缩内存的概念，它将特定于会话的摘要、用户机器人动态和过去的事件集成到简洁的内存格式中。为了支持 COMEDY，我们策划了一个大规模的中文指令调整数据集 Dolphin，它源自真实的用户与聊天机器人交互。比较评估表明，COMEDY 在产生更细致、更人性化的对话体验方面优于传统的基于检索的方法。我们的代码可在 https://github.com/nuochenpku/COMEDY 获取。

Title: Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models

Authors: Himanshu Beniwal, Kowsik Nandagopan D, Mayank Singh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.11997
Pdf URL: https://arxiv.org/pdf/2402.11997
Copy Paste: [[2402.11997]] Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models(https://arxiv.org/abs/2402.11997)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly becoming ubiquitous, yet their ability to reason about and retain temporal information remains limited. This hinders their application in real-world scenarios where understanding the sequential nature of events is crucial. This paper experiments with state-of-the-art models on a novel, large-scale temporal dataset, \textbf{TempUN}, to reveal significant limitations in temporal retention and reasoning abilities. Interestingly, closed-source models indicate knowledge gaps more frequently, potentially suggesting a trade-off between uncertainty awareness and incorrect responses. Further, exploring various fine-tuning approaches yielded no major performance improvements. The associated dataset and code are available at the following URL (https://github.com/lingoiitgn/TempUN).
摘要：大型语言模型 (LLM) 越来越普遍，但它们推理和保留时间信息的能力仍然有限。这阻碍了它们在现实场景中的应用，在现实场景中，理解事件的顺序性质至关重要。本文在新颖的大规模时间数据集 \textbf{TempUN} 上使用最先进的模型进行实验，以揭示时间保留和推理能力的显着局限性。有趣的是，闭源模型更频繁地表明知识差距，这可能表明不确定性意识和错误反应之间的权衡。此外，探索各种微调方法并没有产生重大的性能改进。相关数据集和代码可从以下 URL (https://github.com/lingoiitgn/TempUN) 获取。

Title: A Systematic Comparison of Contextualized Word Embeddings for Lexical Semantic Change

Authors: Francesco Periti, Nina Tahmasebi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12011
Pdf URL: https://arxiv.org/pdf/2402.12011
Copy Paste: [[2402.12011]] A Systematic Comparison of Contextualized Word Embeddings for Lexical Semantic Change(https://arxiv.org/abs/2402.12011)
Keywords: gpt
Abstract: Contextualized embeddings are the preferred tool for modeling Lexical Semantic Change (LSC). Current evaluations typically focus on a specific task known as Graded Change Detection (GCD). However, performance comparison across work are often misleading due to their reliance on diverse settings. In this paper, we evaluate state-of-the-art models and approaches for GCD under equal conditions. We further break the LSC problem into Word-in-Context (WiC) and Word Sense Induction (WSI) tasks, and compare models across these different levels. Our evaluation is performed across different languages on eight available benchmarks for LSC, and shows that (i) APD outperforms other approaches for GCD; (ii) XL-LEXEME outperforms other contextualized models for WiC, WSI, and GCD, while being comparable to GPT-4; (iii) there is a clear need for improving the modeling of word meanings, as well as focus on how, when, and why these meanings change, rather than solely focusing on the extent of semantic change.
摘要：上下文嵌入是词汇语义变化 (LSC) 建模的首选工具。当前的评估通常侧重于称为分级变化检测（GCD）的特定任务。然而，由于不同工作环境的依赖，跨工作的绩效比较往往会产生误导。在本文中，我们评估了同等条件下最先进的 GCD 模型和方法。我们进一步将 LSC 问题分解为上下文中的词 (WiC) 和词义归纳 (WSI) 任务，并比较这些不同级别的模型。我们的评估是在八个可用的 LSC 基准上跨不同语言进行的，结果表明 (i) APD 优于其他 GCD 方法； (ii) XL-LEXEME 优于 WiC、WSI 和 GCD 的其他情境化模型，同时与 GPT-4 相当； (iii) 显然需要改进词义的建模，并关注这些含义如何、何时以及为何变化，而不是仅仅关注语义变化的程度。

Title: Distilling Large Language Models for Text-Attributed Graph Learning

Authors: Bo Pan, Zheng Zhang, Yifei Zhang, Yuntong Hu, Liang Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12022
Pdf URL: https://arxiv.org/pdf/2402.12022
Copy Paste: [[2402.12022]] Distilling Large Language Models for Text-Attributed Graph Learning(https://arxiv.org/abs/2402.12022)
Keywords: language model, llm
Abstract: Text-Attributed Graphs (TAGs) are graphs of connected textual documents. Graph models can efficiently learn TAGs, but their training heavily relies on human-annotated labels, which are scarce or even unavailable in many applications. Large language models (LLMs) have recently demonstrated remarkable capabilities in few-shot and zero-shot TAG learning, but they suffer from scalability, cost, and privacy issues. Therefore, in this work, we focus on synergizing LLMs and graph models with their complementary strengths by distilling the power of LLMs to a local graph model on TAG learning. To address the inherent gaps between LLMs (generative models for texts) and graph models (discriminative models for graphs), we propose first to let LLMs teach an interpreter with rich textual rationale and then let a student model mimic the interpreter's reasoning without LLMs' textual rationale. Extensive experiments validate the efficacy of our proposed framework.
摘要：文本属性图（TAG）是相互连接的文本文档的图。图模型可以有效地学习标签，但它们的训练严重依赖于人工注释的标签，而这些标签在许多应用中很少甚至不可用。大型语言模型 (LLM) 最近在少样本和零样本 TAG 学习方面表现出了卓越的能力，但它们面临可扩展性、成本和隐私问题。因此，在这项工作中，我们专注于通过将 LLM 的力量提炼为 TAG 学习的本地图模型，来协同 LLM 和图模型的互补优势。为了解决LLM（文本生成模型）和图模型（图判别模型）之间的固有差距，我们建议首先让LLM教授具有丰富文本原理的解释器，然后让学生模型在没有LLM文本的情况下模仿解释器的推理。理由。大量的实验验证了我们提出的框架的有效性。

Title: Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

Authors: Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12025
Pdf URL: https://arxiv.org/pdf/2402.12025
Copy Paste: [[2402.12025]] Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?(https://arxiv.org/abs/2402.12025)
Keywords: language model, llm
Abstract: The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.
摘要：随着基础模型的出现，自然语言处理 (NLP) 领域最近发生了一场变革，特别是大型语言模型 (LLM)，它彻底改变了基于文本的 NLP。这种范式已经扩展到包括语音在内的其他模式，研究人员正在积极探索将语音基础模型（SFM）和法学硕士结合成能够解决多模式任务的单一统一模型。在这些任务中，本文重点关注语音到文本翻译（ST）。通过检查有关该主题的已发表论文，我们对迄今为止提出的架构解决方案和培训策略提出了统一的看法，强调了它们之间的相似性和差异。基于这次检查，我们不仅整理了所学到的经验教训，还展示了不同的设置和评估方法如何阻碍为每个架构构建块和培训选择确定最佳性能的解决方案。最后，我们概述了对该主题的未来工作的建议，旨在更好地了解 ST 的 SFM+LLM 解决方案的优点和缺点。

Title: Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

Authors: Zongru Wu, Zhuosheng Zhang, Pengzhou Cheng, Gongshen Liu
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2402.12026
Pdf URL: https://arxiv.org/pdf/2402.12026
Copy Paste: [[2402.12026]] Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space(https://arxiv.org/abs/2402.12026)
Keywords: language model
Abstract: Despite the notable success of language models (LMs) in various natural language processing (NLP) tasks, the reliability of LMs is susceptible to backdoor attacks. Prior research attempts to mitigate backdoor learning while training the LMs on the poisoned dataset, yet struggles against complex backdoor attacks in real-world scenarios. In this paper, we investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. Our findings indicate that the backdoor mapping presented on the poisoned datasets exhibits a more discernible inclination towards lower frequency compared to clean mapping, resulting in the faster convergence of backdoor mapping. To alleviate this dilemma, we propose Multi-Scale Low-Rank Adaptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model and further aligns the gradients when updating parameters. Through downscaling in the frequency space, MuScleLoRA encourages the model to prioritize the learning of relatively high-frequency clean mapping, consequently mitigating backdoor learning. Experimental results demonstrate that MuScleLoRA outperforms baselines significantly. Notably, MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15\% across multiple datasets and generalizes to various backbone LMs, including BERT, RoBERTa, and Llama2. The codes are available at https://github.com/ZrW00/MuScleLoRA.
摘要：尽管语言模型 (LM) 在各种自然语言处理 (NLP) 任务中取得了显着的成功，但 LM 的可靠性很容易受到后门攻击。先前的研究尝试在有毒数据集上训练 LM 时减轻后门学习，但在现实场景中却难以应对复杂的后门攻击。在本文中，我们通过傅里叶分析研究了频率空间中后门 LM 的学习机制。我们的研究结果表明，与干净的映射相比，中毒数据集上呈现的后门映射表现出更明显的低频倾向，从而导致后门映射的收敛速度更快。为了缓解这种困境，我们提出了多尺度低秩自适应（MuScleLoRA），它在频率空间中部署多个径向缩放，对目标模型进行低秩自适应，并在更新参数时进一步对齐梯度。通过频率空间的缩小，MuScleLoRA 鼓励模型优先学习相对高频的干净映射，从而减少后门学习。实验结果表明 MuScleLoRA 显着优于基线。值得注意的是，MuScleLoRA 将多个数据集中各种后门攻击的平均成功率降低到 15% 以下，并推广到各种骨干 LM，包括 BERT、RoBERTa 和 Llama2。代码可在 https://github.com/ZrW00/MuScleLoRA 获取。

Title: Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Authors: Nicolas Boizard, Kevin El-Haddad, Céline Hudelot, Pierre Colombo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12030
Pdf URL: https://arxiv.org/pdf/2402.12030
Copy Paste: [[2402.12030]] Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs(https://arxiv.org/abs/2402.12030)
Keywords: language model, llm
Abstract: Deploying large language models (LLMs) of several billion parameters can be impractical in most industrial use cases due to constraints such as cost, latency limitations, and hardware accessibility. Knowledge distillation (KD) offers a solution by compressing knowledge from resource-intensive large models to smaller ones. Various strategies exist, some relying on the text generated by the teacher model and optionally utilizing his logits to enhance learning. However, these methods based on logits often require both teacher and student models to share the same tokenizer, limiting their applicability across different LLM families. In this paper, we introduce Universal Logit Distillation (ULD) loss, grounded in optimal transport, to address this limitation. Our experimental results demonstrate the effectiveness of ULD loss in enabling distillation across models with different architectures and tokenizers, paving the way to a more widespread use of distillation techniques.
摘要：由于成本、延迟限制和硬件可访问性等限制，在大多数工业用例中部署具有数十亿参数的大型语言模型 (LLM) 可能不切实际。知识蒸馏（KD）提供了一种解决方案，它将知识从资源密集型大型模型压缩到较小的模型。存在各种策略，一些策略依赖于教师模型生成的文本，并可选择利用他的逻辑来增强学习。然而，这些基于 logits 的方法通常要求教师和学生模型共享相同的分词器，从而限制了它们在不同 LLM 系列中的适用性。在本文中，我们引入了基于最佳传输的通用逻辑蒸馏（ULD）损失来解决这一限制。我们的实验结果证明了 ULD 损失在跨具有不同架构和分词器的模型进行蒸馏方面的有效性，为更广泛地使用蒸馏技术铺平了道路。

Title: Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Authors: Anas Belfathi, Ygor Gallina, Nicolas Hernandez, Richard Dufour, Laura Monceaux
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12036
Pdf URL: https://arxiv.org/pdf/2402.12036
Copy Paste: [[2402.12036]] Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics(https://arxiv.org/abs/2402.12036)
Keywords: language model
Abstract: Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in architectures like BERT. However, the prevalent method of word masking relies on random selection, potentially disregarding domain-specific linguistic attributes. In this article, we introduce an innovative masking approach leveraging genre and topicality information to tailor language models to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language. Pre-trained language models and code are freely available for use.
摘要：预训练语言建模的最新进展促进了各种自然语言处理 (NLP) 任务的重大进展。模型训练期间的单词屏蔽是 BERT 等架构中语言建模的关键组成部分。然而，流行的单词屏蔽方法依赖于随机选择，可能忽略特定领域的语言属性。在本文中，我们介绍了一种创新的屏蔽方法，利用流派和主题信息来定制针对专业领域的语言模型。我们的方法结合了一个排名过程，根据单词的重要性对单词进行优先级排序，随后指导屏蔽过程。在法律领域内使用持续预训练进行的实验强调了我们的方法在英语 LegalGLUE 基准上的有效性。预先训练的语言模型和代码可以免费使用。

Title: Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations

Authors: Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Marie-Jeanne Lesot
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2402.12038
Pdf URL: https://arxiv.org/pdf/2402.12038
Copy Paste: [[2402.12038]] Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations(https://arxiv.org/abs/2402.12038)
Keywords: language model, llm, prompt
Abstract: Incorporating natural language rationales in the prompt and In-Context Learning (ICL) has led to a significant improvement of Large Language Models (LLMs) performance. However, rationales currently require human-annotation or the use of auxiliary proxy models to target promising samples or generate high-quality rationales. In this work, we propose Self-AMPLIFY to generate automatically rationales from post hoc explanation methods applied to Small Language Models (SLMs) to improve their own performance. Self-AMPLIFY is a 3-step method that targets samples, generates rationales and builds a final prompt to leverage ICL. Self-AMPLIFY performance is evaluated on two SLMs and two datasets requiring reasoning abilities: these experiments show that Self-AMPLIFY achieves good results against competitors. Self-AMPLIFY is the first method to apply post hoc explanation methods to SLM to generate rationales to improve their own performance in a fully automated manner.
摘要：将自然语言原理融入提示和情境学习 (ICL) 中，使大型语言模型 (LLM) 的性能得到显着提高。然而，目前的基本原理需要人工注释或使用辅助代理模型来定位有希望的样本或生成高质量的基本原理。在这项工作中，我们提出 Self-AMPLIFY，从应用于小语言模型（SLM）的事后解释方法自动生成基本原理，以提高其自身的性能。 Self-AMPLIFY 是一种分 3 步的方法，它针对样本、生成理由并构建利用 ICL 的最终提示。 Self-AMPLIFY 的性能在两个 SLM 和两个需要推理能力的数据集上进行评估：这些实验表明，Self-AMPLIFY 相对于竞争对手取得了良好的结果。 Self-AMPLIFY 是第一种将事后解释方法应用于 SLM 的方法，以产生以完全自动化的方式提高自身绩效的理由。

Title: Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Authors: Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Kun Kuang, Chao Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12048
Pdf URL: https://arxiv.org/pdf/2402.12048
Copy Paste: [[2402.12048]] Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models(https://arxiv.org/abs/2402.12048)
Keywords: language model, llm
Abstract: Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs), where improving performance on unseen tasks often leads to a significant performance drop on the original tasks. This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor. Our method primarily preserves the pre-trained parameters while replacing a small number ($\leq$ 10\%) of fine-tuned parameters, maintaining $\sim$ 99\% effectiveness on original tasks versus pre-training, and achieving $\sim$ 97\% on new tasks compared to standard fine-tuning. Specifically, we derive a sparse mask to identify the "model patch", based on a fusion strategy that integrates salience and sensitivity analysis. Subsequently, a compensation mechanism is introduced to "decorate the patch", enhancing the model's performance on both target and original tasks. Additionally, our method is adaptable to multi-task scenarios. Through extensive experiments on InstructBLIP and LLaVA-1.5 in both image captioning and visual question answering tasks, our approach demonstrates significant task adaptability while preserving inherent pre-trained capabilities.
摘要：在微调多模态大语言模型（MLLM）时，灾难性遗忘成为一个关键挑战，其中提高未见过的任务的性能通常会导致原始任务的性能显着下降。本文对 MLLM 中的灾难性遗忘进行了全面分析，并介绍了一种称为 Model Tailor 的训练后调整方法。我们的方法主要保留了预训练的参数，同时替换了少量（$\leq$ 10\%）的微调参数，与预训练相比，在原始任务上保持了 $\sim$ 99\% 的有效性，并实现了 $\与标准微调相比，新任务的 sim$ 97\%。具体来说，我们基于集成显着性和敏感性分析的融合策略，得出稀疏掩模来识别“模型补丁”。随后，引入补偿机制来“装饰补丁”，增强模型在目标任务和原始任务上的性能。此外，我们的方法适用于多任务场景。通过在图像字幕和视觉问答任务中对 InstructBLIP 和 LLaVA-1.5 进行大量实验，我们的方法展示了显着的任务适应性，同时保留了固有的预训练功能。

Title: Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs

Authors: Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12052
Pdf URL: https://arxiv.org/pdf/2402.12052
Copy Paste: [[2402.12052]] Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs(https://arxiv.org/abs/2402.12052)
Keywords: language model, llm
Abstract: The integration of large language models (LLMs) and search engines represents a significant evolution in knowledge acquisition methodologies. However, determining the knowledge that an LLM already possesses and the knowledge that requires the help of a search engine remains an unresolved issue. Most existing methods solve this problem through the results of preliminary answers or reasoning done by the LLM itself, but this incurs excessively high computational costs. This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in LLMs with a slim proxy model, to enhance the LLM's knowledge acquisition process. We employ a proxy model which has far fewer parameters, and take its answers as heuristic answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM. We only conduct retrieval for the missing knowledge in questions that the LLM does not know. Extensive experimental results on five datasets with two LLMs demonstrate a notable improvement in the end-to-end performance of LLMs in question-answering tasks, achieving or surpassing current state-of-the-art models with lower LLM inference costs.
摘要：大语言模型 (LLM) 和搜索引擎的集成代表了知识获取方法的重大演变。然而，确定LLM已经拥有的知识和需要搜索引擎帮助的知识仍然是一个悬而未决的问题。现有的方法大多通过LLM本身的初步答案或推理结果来解决这个问题，但这会带来过高的计算成本。本文介绍了一种新颖的协作方法，即 SlimPLM，它通过瘦代理模型检测法学硕士中缺失的知识，以增强法学硕士的知识获取过程。我们采用参数少得多的代理模型，并将其答案作为启发式答案。然后利用启发式答案来预测回答用户问题所需的知识，以及法学硕士内的已知和未知知识。我们只对LLM不知道的问题中缺失的知识进行检索。对两个法学硕士的五个数据集进行的广泛实验结果表明，法学硕士在问答任务中的端到端性能显着提高，以较低的法学硕士推理成本实现或超越了当前最先进的模型。

Title: Are LLM-based Evaluators Confusing NLG Quality Criteria?

Authors: Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12055
Pdf URL: https://arxiv.org/pdf/2402.12055
Copy Paste: [[2402.12055]] Are LLM-based Evaluators Confusing NLG Quality Criteria?(https://arxiv.org/abs/2402.12055)
Keywords: llm
Abstract: Some prior work has shown that LLMs perform well in NLG evaluation for different tasks. However, we discover that LLMs seem to confuse different evaluation criteria, which reduces their reliability. For further verification, we first consider avoiding issues of inconsistent conceptualization and vague expression in existing NLG quality criteria themselves. So we summarize a clear hierarchical classification system for 11 common aspects with corresponding different criteria from previous studies involved. Inspired by behavioral testing, we elaborately design 18 types of aspect-targeted perturbation attacks for fine-grained analysis of the evaluation behaviors of different LLMs. We also conduct human annotations beyond the guidance of the classification system to validate the impact of the perturbations. Our experimental results reveal confusion issues inherent in LLMs, as well as other noteworthy phenomena, and necessitate further research and improvements for LLM-based evaluation.
摘要：之前的一些工作表明，法学硕士在不同任务的 NLG 评估中表现良好。然而，我们发现法学硕士似乎混淆了不同的评估标准，从而降低了其可靠性。为了进一步验证，我们首先考虑避免现有NLG质量标准本身概念化不一致和表达模糊的问题。因此，我们针对 11 个共同方面总结了一个清晰的层次分类系统，并与之前涉及的研究相应的不同标准。受行为测试的启发，我们精心设计了18种面向方面的扰动攻击，用于细粒度分析不同LLM的评估行为。我们还在分类系统的指导之外进行人工注释，以验证扰动的影响。我们的实验结果揭示了法学硕士固有的混乱问题以及其他值得注意的现象，需要对基于法学硕士的评估进行进一步的研究和改进。

Title: All Language Models Large and Small

Authors: Zhixun Chen, Yali Du, David Mguni
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.12061
Pdf URL: https://arxiv.org/pdf/2402.12061
Copy Paste: [[2402.12061]] All Language Models Large and Small(https://arxiv.org/abs/2402.12061)
Keywords: language model, llm
Abstract: Many leading language models (LMs) use high-intensity computational resources both during training and execution. This poses the challenge of lowering resource costs for deployment and faster execution of decision-making tasks among others. We introduce a novel plug-and-play LM framework named Language Optimising Network Distribution (LONDI) framework. LONDI learns to selectively employ large LMs only where complex decision-making and reasoning are required while using low-resource LMs everywhere else. LONDI consists of a system of two (off-)policy networks, an LM, a large LM (LLM), and a reinforcement learning module that uses switching controls to quickly learn which system states to call the LLM. We then introduce a variant of LONDI that maintains budget constraints on LLM calls and hence its resource usage. Theoretically, we prove LONDI learns the subset of system states to activate the LLM required to solve the task. We then prove that LONDI converges to optimal solutions while also preserving budgetary constraints on LLM calls almost surely enabling it to solve various tasks while significantly lowering computational costs. We test LONDI's performance in a range of tasks in ScienceWorld and BabyAI-Text and demonstrate that LONDI can solve tasks only solvable by resource-intensive LLMs while reducing GPU usage by up to 30%.
摘要：许多领先的语言模型 (LM) 在训练和执行过程中都使用高强度的计算资源。这带来了降低部署资源成本和更快执行决策任务等方面的挑战。我们引入了一种新颖的即插即用 LM 框架，名为语言优化网络分布（LONDI）框架。 LONDI 学会仅在需要复杂决策和推理的情况下选择性地使用大型 LM，而在其他地方使用低资源 LM。 LONDI 由两个（离）策略网络、一个 LM、一个大型 LM (LLM) 和一个强化学习模块组成，该模块使用切换控制来快速学习哪些系统状态可以调用 LLM。然后，我们引入 LONDI 的一个变体，它可以维持 LLM 调用的预算限制，从而维持其资源使用。理论上，我们证明 LONDI 可以学习系统状态的子集来激活解决任务所需的 LLM。然后，我们证明 LONDI 收敛于最佳解决方案，同时保留对 LLM 调用的预算限制，几乎肯定使其能够解决各种任务，同时显着降低计算成本。我们在 ScienceWorld 和 BabyAI-Text 中测试了 LONDI 在一系列任务中的性能，并证明 LONDI 可以解决只有资源密集型法学硕士才能解决的任务，同时将 GPU 使用率降低高达 30%。

Title: WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Authors: Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.12065
Pdf URL: https://arxiv.org/pdf/2402.12065
Copy Paste: [[2402.12065]] WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More(https://arxiv.org/abs/2402.12065)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyze the existing quantization approaches, identifying their limitations in balancing the accuracy and efficiency of the quantized LLMs. To advance beyond these limitations, we propose WKVQuant, a PTQ framework especially designed for quantizing weights and the key/value (KV) cache of LLMs. Specifically, we incorporates past-only quantization to improve the computation of attention. Additionally, we introduce two-dimensional quantization strategy to handle the distribution of KV cache, along with a cross-block reconstruction regularization for parameter optimization. Experiments show that WKVQuant achieves almost comparable memory savings to weight-activation quantization, while also approaching the performance of weight-only quantization.
摘要：由于大量的内存需求和自回归文本生成过程的计算需求，大型语言模型 (LLM) 面临着重大的部署挑战。本文通过关注 LLM 的量化来解决这些挑战，这是一种通过将模型参数和激活转换为低位整数来减少内存消耗的技术。我们批判性地分析了现有的量化方法，确定了它们在平衡量化法学硕士的准确性和效率方面的局限性。为了超越这些限制，我们提出了 WKVQuant，这是一个专门为量化 LLM 的权重和键/值 (KV) 缓存而设计的 PTQ 框架。具体来说，我们结合了仅过去的量化来改进注意力的计算。此外，我们引入了二维量化策略来处理 KV 缓存的分布，以及用于参数优化的跨块重建正则化。实验表明，WKVQuant 实现了与权重激活量化几乎相当的内存节省，同时也接近仅权重量化的性能。

Title: Interpretable Brain-Inspired Representations Improve RL Performance on Visual Navigation Tasks

Authors: Moritz Lange, Raphael C. Engelhardt, Wolfgang Konen, Laurenz Wiskott
Subjects: cs.LG, cs.NE, cs.RO
Abstract URL: https://arxiv.org/abs/2402.12067
Pdf URL: https://arxiv.org/pdf/2402.12067
Copy Paste: [[2402.12067]] Interpretable Brain-Inspired Representations Improve RL Performance on Visual Navigation Tasks(https://arxiv.org/abs/2402.12067)
Keywords: agent
Abstract: Visual navigation requires a whole range of capabilities. A crucial one of these is the ability of an agent to determine its own location and heading in an environment. Prior works commonly assume this information as given, or use methods which lack a suitable inductive bias and accumulate error over time. In this work, we show how the method of slow feature analysis (SFA), inspired by neuroscience research, overcomes both limitations by generating interpretable representations of visual data that encode location and heading of an agent. We employ SFA in a modern reinforcement learning context, analyse and compare representations and illustrate where hierarchical SFA can outperform other feature extractors on navigation tasks.
摘要：视觉导航需要全方位的功能。其中至关重要的一项是智能体确定其自身位置和在环境中前进方向的能力。先前的工作通常假设该信息是给定的，或者使用缺乏合适的归纳偏差并随着时间的推移累积误差的方法。在这项工作中，我们展示了受神经科学研究启发的慢特征分析（SFA）方法如何通过生成编码代理位置和航向的视觉数据的可解释表示来克服这两个限制。我们在现代强化学习环境中使用 SFA，分析和比较表示，并说明分层 SFA 在导航任务上可以优于其他特征提取器的地方。

Title: EmoBench: Evaluating the Emotional Intelligence of Large Language Models

Authors: Sahand Sabour, Siyang Liu, Zheyuan Zhang, June M. Liu, Jinfeng Zhou, Alvionna S. Sunaryo, Juanzi Li, Tatia M.C. Lee, Rada Mihalcea, Minlie Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12071
Pdf URL: https://arxiv.org/pdf/2402.12071
Copy Paste: [[2402.12071]] EmoBench: Evaluating the Emotional Intelligence of Large Language Models(https://arxiv.org/abs/2402.12071)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have highlighted the need for robust, comprehensive, and challenging benchmarks. Yet, research on evaluating their Emotional Intelligence (EI) is considerably limited. Existing benchmarks have two major shortcomings: first, they mainly focus on emotion recognition, neglecting essential EI capabilities such as emotion regulation and thought facilitation through emotion understanding; second, they are primarily constructed from existing datasets, which include frequent patterns, explicit information, and annotation errors, leading to unreliable evaluation. We propose EmoBench, a benchmark that draws upon established psychological theories and proposes a comprehensive definition for machine EI, including Emotional Understanding and Emotional Application. EmoBench includes a set of 400 hand-crafted questions in English and Chinese, which are meticulously designed to require thorough reasoning and understanding. Our findings reveal a considerable gap between the EI of existing LLMs and the average human, highlighting a promising direction for future research. Our code and data will be publicly available from https://github.com/Sahandfer/EmoBench.
摘要：大型语言模型 (LLM) 的最新进展凸显了对稳健、全面且具有挑战性的基准的需求。然而，评估他们的情商（EI）的研究相当有限。现有的基准有两个主要缺点：首先，它们主要关注情绪识别，忽视了情绪调节和通过情绪理解促进思维等重要的 EI 能力；其次，它们主要是根据现有数据集构建的，其中包括频繁模式、显式信息和注释错误，导致评估不可靠。我们提出了 EmoBench，这是一个基准，它借鉴了现有的心理学理论，并提出了机器 EI 的全面定义，包括情感理解和情感应用。 EmoBench 包含 400 道手工制作的英文和中文问题，这些问题经过精心设计，需要彻底的推理和理解。我们的研究结果揭示了现有法学硕士的 EI 与普通人之间存在相当大的差距，凸显了未来研究的一个有希望的方向。我们的代码和数据将在 https://github.com/Sahandfer/EmoBench 上公开提供。

Title: Can LLMs Compute with Reasons?

Authors: Harshit Sandilya, Peehu Raj, Jainit Sushil Bafna, Srija Mukhopadhyay, Shivansh Sharma, Ellwil Sharma, Arastu Sharma, Neeta Trivedi, Manish Shrivastava, Rajesh Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12080
Pdf URL: https://arxiv.org/pdf/2402.12080
Copy Paste: [[2402.12080]] Can LLMs Compute with Reasons?(https://arxiv.org/abs/2402.12080)
Keywords: language model, llm
Abstract: Large language models (LLMs) often struggle with complex mathematical tasks, prone to "hallucinating" incorrect answers due to their reliance on statistical patterns. This limitation is further amplified in average Small LangSLMs with limited context and training data. To address this challenge, we propose an "Inductive Learning" approach utilizing a distributed network of SLMs. This network leverages error-based learning and hint incorporation to refine the reasoning capabilities of SLMs. Our goal is to provide a framework that empowers SLMs to approach the level of logic-based applications achieved by high-parameter models, potentially benefiting any language model. Ultimately, this novel concept paves the way for bridging the logical gap between humans and LLMs across various fields.
摘要：大型语言模型 (LLM) 常常难以应对复杂的数学任务，由于依赖统计模式，容易产生“幻觉”错误答案。在上下文和训练数据有限的普通小型 LangSLM 中，这种限制被进一步放大。为了应对这一挑战，我们提出了一种利用分布式 SLM 网络的“归纳学习”方法。该网络利用基于错误的学习和提示合并来完善 SLM 的推理能力。我们的目标是提供一个框架，使 SLM 能够达到由高参数模型实现的基于逻辑的应用程序的水平，从而可能使任何语言模型受益。最终，这个新颖的概念为弥合人类和各个领域的法学硕士之间的逻辑差距铺平了道路。

Title: Do Large Language Models Understand Logic or Just Mimick Context?

Authors: Junbing Yan, Chengyu Wang, Jun Huang, Wei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12091
Pdf URL: https://arxiv.org/pdf/2402.12091
Copy Paste: [[2402.12091]] Do Large Language Models Understand Logic or Just Mimick Context?(https://arxiv.org/abs/2402.12091)
Keywords: language model, llm, prompt
Abstract: Over the past few years, the abilities of large language models (LLMs) have received extensive attention, which have performed exceptionally well in complicated scenarios such as logical reasoning and symbolic inference. A significant factor contributing to this progress is the benefit of in-context learning and few-shot prompting. However, the reasons behind the success of such models using contextual reasoning have not been fully explored. Do LLMs have understand logical rules to draw inferences, or do they ``guess'' the answers by learning a type of probabilistic mapping through context? This paper investigates the reasoning capabilities of LLMs on two logical reasoning datasets by using counterfactual methods to replace context text and modify logical concepts. Based on our analysis, it is found that LLMs do not truly understand logical rules; rather, in-context learning has simply enhanced the likelihood of these models arriving at the correct answers. If one alters certain words in the context text or changes the concepts of logical terms, the outputs of LLMs can be significantly disrupted, leading to counter-intuitive responses. This work provides critical insights into the limitations of LLMs, underscoring the need for more robust mechanisms to ensure reliable logical reasoning in LLMs.
摘要：过去几年，大语言模型（LLM）的能力受到广泛关注，在逻辑推理、符号推理等复杂场景中表现异常出色。促成这一进步的一个重要因素是情境学习和少量提示的好处。然而，此类使用上下文推理的模型成功背后的原因尚未得到充分探讨。法学硕士是否理解逻辑规则来进行推论，或者他们是否通过上下文学习一种概率映射来“猜测”答案？本文通过使用反事实方法替换上下文文本并修改逻辑概念，研究了法学硕士在两个逻辑推理数据集上的推理能力。根据我们的分析，LLM并没有真正理解逻辑规则；相反，情境学习只是提高了这些模型得出正确答案的可能性。如果更改上下文文本中的某些单词或更改逻辑术语的概念，法学硕士的输出可能会受到严重干扰，从而导致反直觉的反应。这项工作提供了对法学硕士局限性的重要见解，强调需要更强大的机制来确保法学硕士的可靠逻辑推理。

Title: Groot: Adversarial Testing for Generative Text-to-Image Models with Tree-based Semantic Transformation

Authors: Yi Liu, Guowei Yang, Gelei Deng, Feiyue Chen, Yuqi Chen, Ling Shi, Tianwei Zhang, Yang Liu
Subjects: cs.CL, cs.AI, cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2402.12100
Pdf URL: https://arxiv.org/pdf/2402.12100
Copy Paste: [[2402.12100]] Groot: Adversarial Testing for Generative Text-to-Image Models with Tree-based Semantic Transformation(https://arxiv.org/abs/2402.12100)
Keywords: llm, prompt
Abstract: With the prevalence of text-to-image generative models, their safety becomes a critical concern. adversarial testing techniques have been developed to probe whether such models can be prompted to produce Not-Safe-For-Work (NSFW) content. However, existing solutions face several challenges, including low success rate and inefficiency. We introduce Groot, the first automated framework leveraging tree-based semantic transformation for adversarial testing of text-to-image models. Groot employs semantic decomposition and sensitive element drowning strategies in conjunction with LLMs to systematically refine adversarial prompts. Our comprehensive evaluation confirms the efficacy of Groot, which not only exceeds the performance of current state-of-the-art approaches but also achieves a remarkable success rate (93.66%) on leading text-to-image models such as DALL-E 3 and Midjourney.
摘要：随着文本到图像生成模型的流行，其安全性成为一个关键问题。对抗性测试技术已经被开发出来，以探究是否可以提示此类模型生成不安全工作（NSFW）内容。然而，现有的解决方案面临着一些挑战，包括成功率低和效率低下。我们介绍 Groot，这是第一个利用基于树的语义转换进行文本到图像模型的对抗性测试的自动化框架。 Groot 将语义分解和敏感元素淹没策略与法学硕士结合使用，系统地完善对抗性提示。我们的综合评估证实了 Groot 的功效，它不仅超过了当前最先进方法的性能，而且在领先的文本到图像模型（例如 DALL-E 3）上取得了显着的成功率（93.66％）和中途。

Title: Is It a Free Lunch for Removing Outliers during Pretraining?

Authors: Baohao Liao, Christof Monz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12102
Pdf URL: https://arxiv.org/pdf/2402.12102
Copy Paste: [[2402.12102]] Is It a Free Lunch for Removing Outliers during Pretraining?(https://arxiv.org/abs/2402.12102)
Keywords: language model
Abstract: With the growing size of large language models, the role of quantization becomes increasingly significant. However, outliers present in weights or activations notably influence the performance of quantized models. Recently, \citet{qtransformer} introduced a novel softmax function aimed at pretraining models in an outlier-free manner, thereby enhancing their suitability for quantization. Interestingly, we observed that such an approach leads to performance degradation in full precision. Building on this insight, we enhance the method by ensuring its normalization is invariant to sequence length, a crucial factor for bridging the gap between pretraining and fine-tuning. Moreover, this improved method also facilitates successful pretraining of causal language models.
摘要：随着大型语言模型规模的不断扩大，量化的作用变得越来越重要。然而，权重或激活中存在的异常值显着影响量化模型的性能。最近，\citet{qtransformer} 引入了一种新颖的 softmax 函数，旨在以无异常值的方式预训练模型，从而增强其量化的适用性。有趣的是，我们观察到这种方法会导致完全精度的性能下降。基于这一见解，我们通过确保其归一化对序列长度不变来增强该方法，这是弥合预训练和微调之间差距的关键因素。此外，这种改进的方法还有助于因果语言模型的成功预训练。

Title: Evaluating Image Review Ability of Vision Language Models

Authors: Shigeki Saito, Kazuki Hayashi, Yusuke Ide, Yusuke Sakai, Kazuma Onishi, Toma Suzuki, Seiji Gobara, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
Subjects: cs.CL, cs.AI, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2402.12121
Pdf URL: https://arxiv.org/pdf/2402.12121
Copy Paste: [[2402.12121]] Evaluating Image Review Ability of Vision Language Models(https://arxiv.org/abs/2402.12121)
Keywords: language model
Abstract: Large-scale vision language models (LVLMs) are language models that are capable of processing images and text inputs by a single model. This paper explores the use of LVLMs to generate review texts for images. The ability of LVLMs to review images is not fully understood, highlighting the need for a methodical evaluation of their review abilities. Unlike image captions, review texts can be written from various perspectives such as image composition and exposure. This diversity of review perspectives makes it difficult to uniquely determine a single correct review for an image. To address this challenge, we introduce an evaluation method based on rank correlation analysis, in which review texts are ranked by humans and LVLMs, then, measures the correlation between these rankings. We further validate this approach by creating a benchmark dataset aimed at assessing the image review ability of recent LVLMs. Our experiments with the dataset reveal that LVLMs, particularly those with proven superiority in other evaluative contexts, excel at distinguishing between high-quality and substandard image reviews.
摘要：大规模视觉语言模型（LVLM）是能够通过单个模型处理图像和文本输入的语言模型。本文探讨了使用 LVLM 生成图像评论文本。 LVLM 审查图像的能力尚未完全了解，这凸显了对其审查能力进行系统评估的必要性。与图像说明不同，评论文本可以从图像构图和曝光等多种角度进行撰写。评论视角的多样性使得很难唯一地确定对图像的单一正确评论。为了应对这一挑战，我们引入了一种基于排名相关分析的评估方法，其中评论文本由人类和 LVLM 进行排名，然后测量这些排名之间的相关性。我们通过创建一个旨在评估最近 LVLM 的图像审查能力的基准数据集来进一步验证这种方法。我们对数据集的实验表明，LVLM，尤其是那些在其他评估环境中已被证明具有优越性的 LVLM，擅长区分高质量和不合格的图像评论。

Title: Meta Ranking: Less Capable Language Models are Capable for Single Response Judgement

Authors: Zijun Liu, Boqun Kou, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12146
Pdf URL: https://arxiv.org/pdf/2402.12146
Copy Paste: [[2402.12146]] Meta Ranking: Less Capable Language Models are Capable for Single Response Judgement(https://arxiv.org/abs/2402.12146)
Keywords: language model, gpt, llm, hallucination
Abstract: Although Large Language Models (LLMs) have demonstrated strong performance on a wide range of tasks, they still face reliability challenges such as hallucination. Previous studies reveal that highly capable LLMs like GPT-4 are effective in judging the reliability of individual responses, while less capable ones are often tuned to evaluate the relative reliability of responses to the same query. To enable less capable LLMs to effectively judge the reliability of individual responses, we propose a novel method named $\textit{Meta}$ $\textit{Ranking}$ (MR). Unlike previous methods, which assess the response directly, we achieve the judgement by comparing the target query-response pair with reference query-response pairs. We found its remarkable effectiveness in error detection for LLM responses on reasoning tasks, where less capable LLMs could outperform strong baselines, even without fine-tuning. We further demonstrate that MR can be used to enhance the performance of LLMs in two practical applications: query routing and iterative training data filtering. The former achieves GPT-4-turbo comparable performance with less than half the token consumption, while the latter makes the instruction-tuned LLaMA-7B and Phi-2, a 2.7B model, significantly surpass Alpaca-13B over fewer training samples, underscoring the high potential of our proposed method.
摘要：尽管大型语言模型（LLM）在广泛的任务中表现出了强大的性能，但它们仍然面临着幻觉等可靠性挑战。先前的研究表明，像 GPT-4 这样能力强的法学硕士可以有效地判断个人回答的可靠性，而能力较差的法学硕士通常会被调整来评估对同一查询的回答的相对可靠性。为了使能力较差的法学硕士能够有效地判断个人回答的可靠性，我们提出了一种名为 $\textit{Meta}$ $\textit{Ranking}$ (MR) 的新方法。与之前直接评估响应的方法不同，我们通过将目标查询-响应对与参考查询-响应对进行比较来实现判断。我们发现它在推理任务中 LLM 响应的错误检测方面具有显着的有效性，其中能力较差的 LLM 甚至可以在没有微调的情况下胜过强大的基线。我们进一步证明，MR 可用于增强 LLM 在两个实际应用中的性能：查询路由和迭代训练数据过滤。前者以不到一半的代币消耗实现了与 GPT-4-turbo 相当的性能，而后者使指令调整的 LLaMA-7B 和 Phi-2（2.7B 模型）在更少的训练样本上显着超越 Alpaca-13B，强调了这一点我们提出的方法的巨大潜力。

Title: End-to-end multilingual fact-checking at scale

Authors: Vinay Setty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12147
Pdf URL: https://arxiv.org/pdf/2402.12147
Copy Paste: [[2402.12147]] End-to-end multilingual fact-checking at scale(https://arxiv.org/abs/2402.12147)
Keywords: language model, gpt
Abstract: In this article, we describe how you can perform end-to-end fact-checking in over 100 languages using Factiverse AI models. We also show through an experimental benchmark that fine-tuned models tailored for fact-checking tasks outperform Large Language Models such as GPT-4, GPT-3.5-Turbo, and Mistral-7b.
摘要：在本文中，我们将介绍如何使用 Factiverse AI 模型以 100 多种语言执行端到端事实检查。我们还通过实验基准表明，为事实检查任务量身定制的微调模型优于 GPT-4、GPT-3.5-Turbo 和 Mistral-7b 等大型语言模型。

Title: Your Large Language Model is Secretly a Fairness Proponent and You Should Prompt it Like One

Authors: Tianlin Li, Xiaoyu Zhang, Chao Du, Tianyu Pang, Qian Liu, Qing Guo, Chao Shen, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12150
Pdf URL: https://arxiv.org/pdf/2402.12150
Copy Paste: [[2402.12150]] Your Large Language Model is Secretly a Fairness Proponent and You Should Prompt it Like One(https://arxiv.org/abs/2402.12150)
Keywords: language model, gpt, llm, prompt
Abstract: The widespread adoption of large language models (LLMs) underscores the urgent need to ensure their fairness. However, LLMs frequently present dominant viewpoints while ignoring alternative perspectives from minority parties, resulting in potential biases. We hypothesize that these fairness-violating behaviors occur because LLMs express their viewpoints using a human personality that represents the majority of training data. In response to this, we validate that prompting LLMs with specific roles can allow LLMs to express diverse viewpoints. Building on this insight and observation, we develop FairThinking, a pipeline designed to automatically generate roles that enable LLMs to articulate diverse perspectives for fair expressions. To evaluate FairThinking, we create a dataset with a thousand items covering three fairness-related topics and conduct experiments on GPT-3.5, GPT-4, Llama2, and Mistral to demonstrate its superior performance.
摘要：大型语言模型（LLM）的广泛采用凸显了确保其公平性的迫切需要。然而，法学硕士经常提出主流观点，而忽略少数党派的替代观点，从而导致潜在的偏见。我们假设这些违反公平行为的发生是因为法学硕士使用代表大多数训练数据的人格来表达他们的观点。针对这一点，我们验证了提示法学硕士具有特定的角色可以让法学硕士表达不同的观点。基于这种洞察和观察，我们开发了 FairThinking，这是一个旨在自动生成角色的管道，使法学硕士能够阐明公平表达的不同观点。为了评估 FairThinking，我们创建了一个包含 1000 个项目的数据集，涵盖三个公平相关主题，并在 GPT-3.5、GPT-4、Llama2 和 Mistral 上进行实验，以证明其优越的性能。

Title: Transformer-based Causal Language Models Perform Clustering

Authors: Xinbo Wu, Lav R. Varshney
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12151
Pdf URL: https://arxiv.org/pdf/2402.12151
Copy Paste: [[2402.12151]] Transformer-based Causal Language Models Perform Clustering(https://arxiv.org/abs/2402.12151)
Keywords: language model, llm
Abstract: Even though large language models (LLMs) have demonstrated remarkable capability in solving various natural language tasks, the capability of an LLM to follow human instructions is still a concern. Recent works have shown great improvements in the instruction-following capability via additional training for instruction-following tasks. However, the mechanisms responsible for effective instruction-following capabilities remain inadequately understood. Here, we introduce a simplified instruction-following task and use synthetic datasets to analyze a Transformer-based causal language model. Our findings suggest that the model learns task-specific information by clustering data within its hidden space, with this clustering process evolving dynamically during learning. We also demonstrate how this phenomenon assists the model in handling unseen instances and validate our results in a more realistic setting.
摘要：尽管大型语言模型（LLM）在解决各种自然语言任务方面表现出了卓越的能力，但 LLM 遵循人类指令的能力仍然是一个问题。最近的研究表明，通过对指令执行任务进行额外的训练，指令执行能力得到了很大的提高。然而，负责有效遵循指令能力的机制仍然没有得到充分理解。在这里，我们引入了一个简化的指令跟踪任务，并使用合成数据集来分析基于 Transformer 的因果语言模型。我们的研究结果表明，该模型通过在其隐藏空间内聚类数据来学习特定于任务的信息，并且该聚类过程在学习过程中动态发展。我们还演示了这种现象如何帮助模型处理未见过的实例，并在更现实的环境中验证我们的结果。

Title: Endowing Pre-trained Graph Models with Provable Fairness

Authors: Zhongjian Zhang, Mengmei Zhang, Yue Yu, Cheng Yang, Jiawei Liu, Chuan Shi
Subjects: cs.LG, cs.AI, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2402.12161
Pdf URL: https://arxiv.org/pdf/2402.12161
Copy Paste: [[2402.12161]] Endowing Pre-trained Graph Models with Provable Fairness(https://arxiv.org/abs/2402.12161)
Keywords: language model
Abstract: Pre-trained graph models (PGMs) aim to capture transferable inherent structural properties and apply them to different downstream tasks. Similar to pre-trained language models, PGMs also inherit biases from human society, resulting in discriminatory behavior in downstream applications. The debiasing process of existing fair methods is generally coupled with parameter optimization of GNNs. However, different downstream tasks may be associated with different sensitive attributes in reality, directly employing existing methods to improve the fairness of PGMs is inflexible and inefficient. Moreover, most of them lack a theoretical guarantee, i.e., provable lower bounds on the fairness of model predictions, which directly provides assurance in a practical scenario. To overcome these limitations, we propose a novel adapter-tuning framework that endows pre-trained \textbf{Graph} models with \textbf{P}rovable f\textbf{A}i\textbf{R}ness (called GraphPAR). GraphPAR freezes the parameters of PGMs and trains a parameter-efficient adapter to flexibly improve the fairness of PGMs in downstream tasks. Specifically, we design a sensitive semantic augmenter on node representations, to extend the node representations with different sensitive attribute semantics for each node. The extended representations will be used to further train an adapter, to prevent the propagation of sensitive attribute semantics from PGMs to task predictions. Furthermore, with GraphPAR, we quantify whether the fairness of each node is provable, i.e., predictions are always fair within a certain range of sensitive attribute semantics. Experimental evaluations on real-world datasets demonstrate that GraphPAR achieves state-of-the-art prediction performance and fairness on node classification task. Furthermore, based on our GraphPAR, around 90\% nodes have provable fairness.
摘要：预训练图模型（PGM）旨在捕获可转移的固有结构属性并将其应用于不同的下游任务。与预先训练的语言模型类似，PGM 也继承了人类社会的偏见，导致下游应用中出现歧视行为。现有公平方法的去偏过程通常与 GNN 的参数优化相结合。然而，现实中不同的下游任务可能与不同的敏感属性相关联，直接利用现有的方法来提高PGM的公平性是不灵活和低效的。此外，它们中的大多数缺乏理论保证，即模型预测公平性的可证明下限，这直接在实际场景中提供保证。为了克服这些限制，我们提出了一种新颖的适配器调整框架，该框架赋予预训练的 \textbf{Graph} 模型以 \textbf{P}rovable f\textbf{A}i\textbf{R}ness（称为 GraphPAR）。 GraphPAR 冻结 PGM 的参数并训练参数高效的适配器，以灵活地提高 PGM 在下游任务中的公平性。具体来说，我们在节点表示上设计了一个敏感语义增强器，为每个节点扩展具有不同敏感属性语义的节点表示。扩展表示将用于进一步训练适配器，以防止敏感属性语义从 PGM 传播到任务预测。此外，通过 GraphPAR，我们量化每个节点的公平性是否可证明，即预测在敏感属性语义的一定范围内始终是公平的。对真实数据集的实验评估表明，GraphPAR 在节点分类任务上实现了最先进的预测性能和公平性。此外，根据我们的 GraphPAR，大约 90% 的节点具有可证明的公平性。

Title: Unsupervised LLM Adaptation for Question Answering

Authors: Kuniaki Saito, Kihyuk Sohn, Chen-Yu Lee, Yoshitaka Ushiku
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12170
Pdf URL: https://arxiv.org/pdf/2402.12170
Copy Paste: [[2402.12170]] Unsupervised LLM Adaptation for Question Answering(https://arxiv.org/abs/2402.12170)
Keywords: language model, llm
Abstract: Large language models (LLM) learn diverse knowledge present in the large-scale training dataset via self-supervised training. Followed by instruction-tuning, LLM acquires the ability to return correct information for diverse questions. However, adapting these pre-trained LLMs to new target domains, such as different organizations or periods, for the question-answering (QA) task incurs a substantial annotation cost. To tackle this challenge, we propose a novel task, unsupervised LLM adaptation for question answering. In this task, we leverage a pre-trained LLM, a publicly available QA dataset (source data), and unlabeled documents from the target domain. Our goal is to learn LLM that can answer questions about the target domain. We introduce one synthetic and two real datasets to evaluate models fine-tuned on the source and target data, and reveal intriguing insights; (i) fine-tuned models exhibit the ability to provide correct answers for questions about the target domain even though they do not see any questions about the information described in the unlabeled documents, but (ii) they have difficulties in accessing information located in the middle or at the end of documents, and (iii) this challenge can be partially mitigated by replacing input tokens with random ones during adaptation.
摘要：大型语言模型（LLM）通过自我监督训练学习大规模训练数据集中存在的各种知识。通过指令调整，法学硕士获得了针对不同问题返回正确信息的能力。然而，将这些预先训练的法学硕士适应新的目标领域，例如不同的组织或时期，以进行问答（QA）任务会产生大量的注释成本。为了应对这一挑战，我们提出了一项新颖的任务，即无监督的法学硕士适应问题回答。在此任务中，我们利用预先训练的 LLM、公开可用的 QA 数据集（源数据）以及来自目标域的未标记文档。我们的目标是学习能够回答有关目标领域问题的法学硕士。我们引入一个合成数据集和两个真实数据集来评估根据源数据和目标数据微调的模型，并揭示有趣的见解； (i) 微调模型表现出为有关目标域的问题提供正确答案的能力，即使它们没有看到有关未标记文档中描述的信息的任何问题，但是 (ii) 它们在访问位于目标域中的信息时遇到困难文档的中间或末尾，以及（iii）可以通过在适应过程中用随机令牌替换输入令牌来部分缓解这一挑战。

Title: BIDER: Bridging Knowledge Inconsistency for Efficient Retrieval-Augmented LLMs via Key Supporting Evidence

Authors: Jiajie Jin, Yutao Zhu, Yujia Zhou, Zhicheng Dou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12174
Pdf URL: https://arxiv.org/pdf/2402.12174
Copy Paste: [[2402.12174]] BIDER: Bridging Knowledge Inconsistency for Efficient Retrieval-Augmented LLMs via Key Supporting Evidence(https://arxiv.org/abs/2402.12174)
Keywords: language model, llm
Abstract: Retrieval-augmented large language models (LLMs) have demonstrated efficacy in knowledge-intensive tasks such as open-domain QA, addressing inherent challenges in knowledge update and factual inadequacy. However, inconsistencies between retrieval knowledge and the necessary knowledge for LLMs, leading to a decline in LLM's answer quality. This paper introduces BIDER, an approach that refines retrieval documents into Key Supporting Evidence (KSE) through knowledge synthesis, supervised fine-tuning (SFT), and preference alignment. We train BIDER by learning from crafting KSE, while maximizing its output to align with LLM's information acquisition preferences through reinforcement learning. Evaluations across five datasets show BIDER boosts LLMs' answer quality by 7% while reducing input content length in retrieval documents by 80%, outperforming existing methods. The proposed KSE simulation effectively equips LLMs with essential information for accurate question answering.
摘要：检索增强的大型语言模型 (LLM) 在知识密集型任务（例如开放域 QA）中表现出了有效性，解决了知识更新和事实不足方面的固有挑战。然而，检索知识与LLM必备知识不一致，导致LLM的答案质量下降。本文介绍了 BIDER，这是一种通过知识合成、监督微调 (SFT) 和偏好对齐将检索文档细化为关键支持证据 (KSE) 的方法。我们通过学习制作 KSE 来训练 BIDER，同时通过强化学习最大化其输出以符合 LLM 的信息获取偏好。对五个数据集的评估显示，BIDER 将法学硕士的答案质量提高了 7%，同时将检索文档中的输入内容长度减少了 80%，优于现有方法。所提出的 KSE 模拟有效地为法学硕士提供了准确回答问题的基本信息。

Title: Mafin: Enhancing Black-Box Embeddings with Model Augmented Fine-tuning

Authors: Mingtian Zhang, Shawn Lan, Peter Hayes, David Barber
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.12177
Pdf URL: https://arxiv.org/pdf/2402.12177
Copy Paste: [[2402.12177]] Mafin: Enhancing Black-Box Embeddings with Model Augmented Fine-tuning(https://arxiv.org/abs/2402.12177)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) has emerged as an effective solution for mitigating hallucinations in Large Language Models (LLMs). The retrieval stage in RAG typically involves a pre-trained embedding model, which converts queries and passages into vectors to capture their semantics. However, a standard pre-trained embedding model may exhibit sub-optimal performance when applied to specific domain knowledge, necessitating fine-tuning. This paper addresses scenarios where the embeddings are only available from a black-box model. We introduce Model augmented fine-tuning (Mafin) -- a novel approach for fine-tuning a black-box embedding model by augmenting it with a trainable embedding model. Our results demonstrate that Mafin significantly enhances the performance of the black-box embeddings by only requiring the training of a small augmented model. We validate the effectiveness of our method on both labeled and unlabeled datasets, illustrating its broad applicability and efficiency.
摘要：检索增强生成 (RAG) 已成为减轻大型语言模型 (LLM) 中幻觉的有效解决方案。 RAG 中的检索阶段通常涉及预先训练的嵌入模型，该模型将查询和段落转换为向量以捕获其语义。然而，标准的预训练嵌入模型在应用于特定领域知识时可能会表现出次优的性能，因此需要进行微调。本文解决了只能从黑盒模型获得嵌入的场景。我们引入模型增强微调（Mafin）——一种通过使用可训练的嵌入模型来增强黑盒嵌入模型来微调黑盒嵌入模型的新颖方法。我们的结果表明，Mafin 只需训练一个小型增强模型即可显着增强黑盒嵌入的性能。我们在标记和未标记数据集上验证了我们的方法的有效性，说明了其广泛的适用性和效率。

Title: Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships

Authors: Myung Gyo Oh, Hong Eun Ahn, Leo Hyun Park, Taekyoung Kwon
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12189
Pdf URL: https://arxiv.org/pdf/2402.12189
Copy Paste: [[2402.12189]] Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships(https://arxiv.org/abs/2402.12189)
Keywords: language model
Abstract: Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
摘要：由于数据记忆，神经语言模型 (LM) 很容易受到训练数据提取攻击。本文介绍了一种新颖的攻击场景，其中攻击者对预先训练的 LM 进行对抗性微调，以放大原始训练数据的暴露程度。该策略与之前的研究不同，旨在加强 LM 对预训练数据集的保留。为了实现这一点，攻击者需要收集与预训练数据密切相关的生成文本。然而，在不了解实际数据集的情况下，量化生成文本中的预训练数据量具有挑战性。为了解决这个问题，我们建议对这些生成的文本使用伪标签，利用目标 LM 中机器生成的概率指示的隶属度近似值。随后，我们根据成员概率对 LM 进行微调，以支持更有可能源自预训练数据的世代。我们的实证研究结果表明了一个显着的结果：参数超过 1B 的 LM 的训练数据暴露量增加了四到八倍。我们讨论潜在的缓解措施并提出未来的研究方向。

Title: A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Authors: Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12193
Pdf URL: https://arxiv.org/pdf/2402.12193
Copy Paste: [[2402.12193]] A Chinese Dataset for Evaluating the Safeguards in Large Language Models(https://arxiv.org/abs/2402.12193)
Keywords: language model, llm, prompt
Abstract: Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Warning: this paper contains example data that may be offensive, harmful, or biased.
摘要：许多研究表明，大型语言模型 (LLM) 可能会产生有害的响应，使用户在部署 LLM 时面临意想不到的风险。先前的研究提出了法学硕士所带来的风险的全面分类，以及可用于检查法学硕士安全机制的相应提示。然而，焦点几乎完全集中在英语上，而很少对其他语言进行探索。我们的目标是弥合这一差距。我们首先引入了一个用于中国法学硕士安全评估的数据集，然后将其扩展到其他两个场景，可用于更好地识别风险提示拒绝方面的假阴性和假阳性示例。我们进一步为每种风险类型提出了一套细粒度的安全评估标准，促进了LLM响应危害性方面的手动注释和自动评估。我们对五个法学硕士的实验表明，地区特定风险是普遍存在的风险类型，这对我们实验过的所有中国法学硕士来说都是一个主要问题。警告：本文包含可能令人反感、有害或有偏见的示例数据。

Title: Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

Authors: Ziyue Wang, Chi Chen, Yiqi Zhu, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12195
Pdf URL: https://arxiv.org/pdf/2402.12195
Copy Paste: [[2402.12195]] Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion(https://arxiv.org/abs/2402.12195)
Keywords: language model, llm
Abstract: With the bloom of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. A primary reason for this shortcoming is that the visual features for each images are encoded individually by frozen encoders before feeding into the LLM backbone, lacking awareness of other images and the multimodal instructions. We term this issue as prior-LLM modality isolation and propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion prior to feeding the features into LLMs. This paradigm initially "browses" through the inputs for essential insights, and then revisits the inputs to "concentrate" on crucial details, guided by these insights, to achieve a more comprehensive understanding of the multimodal inputs. Additionally, we develop training strategies specifically to enhance the understanding of multi-image inputs. Our method markedly boosts the performance on 7 multi-image scenarios, contributing to increments on average accuracy by 2.13% and 7.60% against strong MLLMs baselines with 3B and 11B LLMs, respectively.
摘要：随着大型语言模型 (LLM) 的蓬勃发展，将 LLM 与预训练视觉模型相结合的多模态大型语言模型 (MLLM) 最近在各种视觉语言任务中表现出了令人印象深刻的性能。然而，它们无法理解涉及多个图像的上下文。造成这一缺点的主要原因是，每个图像的视觉特征在输入 LLM 主干之前由冻结编码器单独编码，缺乏对其他图像和多模态指令的了解。我们将这个问题称为先前的 LLM 模态隔离，并提出了一个两阶段范例，即浏览和集中，以在将特征输入 LLM 之前实现深入的多模态上下文融合。该范式首先“浏览”输入以获取基本见解，然后在这些见解的指导下重新审视输入以“集中”关键细节，以实现对多模式输入的更全面的理解。此外，我们还专门开发了训练策略来增强对多图像输入的理解。我们的方法显着提高了 7 个多图像场景的性能，与 3B 和 11B LLM 的强 MLLM 基线相比，平均准确度分别提高了 2.13% 和 7.60%。

Title: Zero shot VLMs for hate meme detection: Are we there yet?

Authors: Naquee Rizwan, Paramananda Bhaskar, Mithun Das, Swadhin Satyaprakash Majhi, Punyajoy Saha, Animesh Mukherjee
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12198
Pdf URL: https://arxiv.org/pdf/2402.12198
Copy Paste: [[2402.12198]] Zero shot VLMs for hate meme detection: Are we there yet?(https://arxiv.org/abs/2402.12198)
Keywords: language model, prompt
Abstract: Multimedia content on social media is rapidly evolving, with memes gaining prominence as a distinctive form. Unfortunately, some malicious users exploit memes to target individuals or vulnerable communities, making it imperative to identify and address such instances of hateful memes. Extensive research has been conducted to address this issue by developing hate meme detection models. However, a notable limitation of traditional machine/deep learning models is the requirement for labeled datasets for accurate classification. Recently, the research community has witnessed the emergence of several visual language models that have exhibited outstanding performance across various tasks. In this study, we aim to investigate the efficacy of these visual language models in handling intricate tasks such as hate meme detection. We use various prompt settings to focus on zero-shot classification of hateful/harmful memes. Through our analysis, we observe that large VLMs are still vulnerable for zero-shot hate meme detection.
摘要：社交媒体上的多媒体内容正在迅速发展，模因作为一种独特的形式日益受到重视。不幸的是，一些恶意用户利用模因来针对个人或弱势社区，因此必须识别和解决此类仇恨模因的实例。为了通过开发仇恨模因检测模型来解决这个问题，人们进行了广泛的研究。然而，传统机器/深度学习模型的一个显着限制是需要标记数据集才能进行准确分类。最近，研究界见证了几种视觉语言模型的出现，这些模型在各种任务中表现出了出色的性能。在这项研究中，我们的目标是研究这些视觉语言模型在处理复杂任务（例如仇恨模因检测）方面的功效。我们使用各种提示设置来专注于仇恨/有害模因的零样本分类。通过我们的分析，我们观察到大型 VLM 仍然容易受到零样本仇恨模因检测的影响。

Title: Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

Authors: Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, Xipeng Qiu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2402.12201
Pdf URL: https://arxiv.org/pdf/2402.12201
Copy Paste: [[2402.12201]] Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT(https://arxiv.org/abs/2402.12201)
Keywords: gpt
Abstract: Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted more monosemantic features: How do we recognize circuits connecting the enormous amount of dictionary features? We propose a circuit discovery framework alternative to activation patching. Our framework suffers less from out-of-distribution and proves to be more efficient in terms of asymptotic complexity. The basic unit in our framework is dictionary features decomposed from all modules writing to the residual stream, including embedding, attention output and MLP output. Starting from any logit, dictionary feature or attention score, we manage to trace down to lower-level dictionary features of all tokens and compute their contribution to these more interpretable and local model behaviors. We dig in a small transformer trained on a synthetic task named Othello and find a number of human-understandable fine-grained circuits inside of it.
摘要：稀疏字典学习是机械可解释性方面快速发展的技术，旨在攻击叠加并从模型激活中提取更多人类可理解的特征。基于提取的更多单义特征，我们提出了进一步的问题：我们如何识别连接大量字典特征的电路？我们提出了一种替代激活修补的电路发现框架。我们的框架较少受到分布外的影响，并且在渐近复杂性方面被证明更有效。我们框架中的基本单元是从写入残差流的所有模块分解而来的字典特征，包括嵌入、注意力输出和 MLP 输出。从任何 logit、字典特征或注意力分数开始，我们设法追溯到所有标记的较低级别字典特征，并计算它们对这些更可解释和本地模型行为的贡献。我们挖掘了一个接受过名为 Othello 的综合任务训练的小型变压器，并在其中发现了许多人类可理解的细粒度电路。

Title: Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

Authors: Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12204
Pdf URL: https://arxiv.org/pdf/2402.12204
Copy Paste: [[2402.12204]] Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages(https://arxiv.org/abs/2402.12204)
Keywords: language model, llm
Abstract: While large language models (LLMs) have been pre-trained on multilingual corpora, their performance still lags behind in most languages compared to a few resource-rich languages. One common approach to mitigate this issue is to translate training data from resource-rich languages into other languages and then continue training. However, using the data obtained solely relying on translation while ignoring the original capabilities of LLMs across languages is not always effective, which we show will limit the performance of cross-lingual knowledge transfer. In this work, we propose SDRRL, a method based on Self-Distillation from Resource-Rich Languages that effectively improve multilingual performance by leveraging the internal capabilities of LLMs on resource-rich languages. We evaluate on different LLMs (LLaMA-2 and SeaLLM) and source languages across various comprehension and generation tasks, experimental results demonstrate that SDRRL can significantly enhance multilingual capabilities while minimizing the impact on original performance in resource-rich languages.
摘要：虽然大型语言模型（LLM）已经在多语言语料库上进行了预训练，但与少数资源丰富的语言相比，它们在大多数语言中的性能仍然落后。缓解此问题的一种常见方法是将训练数据从资源丰富的语言翻译成其他语言，然后继续训练。然而，仅仅依靠翻译获得的数据而忽视法学硕士跨语言的原始能力并不总是有效的，我们表明这将限制跨语言知识转移的性能。在这项工作中，我们提出了 SDRRL，一种基于资源丰富语言自蒸馏的方法，通过利用法学硕士在资源丰富语言上的内部能力，有效提高多语言性能。我们在各种理解和生成任务中对不同的LLM（LLaMA-2和SeaLLM）和源语言进行评估，实验结果表明SDRRL可以显着增强多语言能力，同时最大限度地减少对资源丰富语言的原始性能的影响。

Title: Polarization of Autonomous Generative AI Agents Under Echo Chambers

Authors: Masaya Ohagi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12212
Pdf URL: https://arxiv.org/pdf/2402.12212
Copy Paste: [[2402.12212]] Polarization of Autonomous Generative AI Agents Under Echo Chambers(https://arxiv.org/abs/2402.12212)
Keywords: language model, gpt, prompt, chat, agent
Abstract: Online social networks often create echo chambers where people only hear opinions reinforcing their beliefs. An echo chamber often generates polarization, leading to conflicts caused by people with radical opinions, such as the January 6, 2021, attack on the US Capitol. The echo chamber has been viewed as a human-specific problem, but this implicit assumption is becoming less reasonable as large language models, such as ChatGPT, acquire social abilities. In response to this situation, we investigated the potential for polarization to occur among a group of autonomous AI agents based on generative language models in an echo chamber environment. We had AI agents discuss specific topics and analyzed how the group's opinions changed as the discussion progressed. As a result, we found that the group of agents based on ChatGPT tended to become polarized in echo chamber environments. The analysis of opinion transitions shows that this result is caused by ChatGPT's high prompt understanding ability to update its opinion by considering its own and surrounding agents' opinions. We conducted additional experiments to investigate under what specific conditions AI agents tended to polarize. As a result, we identified factors that strongly influence polarization, such as the agent's persona. These factors should be monitored to prevent the polarization of AI agents.
摘要：在线社交网络经常会产生回音室，人们只能听到强化他们信仰的意见。回音室往往会产生两极分化，导致持激进观点的人引发冲突，例如2021年1月6日美国国会大厦遇袭事件。回声室一直被视为人类特有的问题，但随着大型语言模型（例如 ChatGPT）获得社交能力，这种隐含的假设变得越来越不合理。针对这种情况，我们研究了回声室环境中基于生成语言模型的一组自主人工智能代理之间发生极化的可能性。我们让人工智能代理讨论特定主题，并分析小组的观点如何随着讨论的进展而变化。结果，我们发现基于 ChatGPT 的代理群体在回声室环境中往往会变得两极分化。对意见转变的分析表明，这一结果是由于ChatGPT具有较高的及时理解能力，可以通过考虑自己和周围代理的意见来更新其意见。我们进行了额外的实验来研究人工智能代理在哪些特定条件下倾向于极化。因此，我们确定了强烈影响极化的因素，例如代理人的角色。应监控这些因素，以防止人工智能代理的两极分化。

Title: Reformatted Alignment

Authors: Run-Ze Fan, Xuefeng Li, Haoyang Zou, Junlong Li, Shwai He, Ethan Chern, Jiewen Hu, Pengfei Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12219
Pdf URL: https://arxiv.org/pdf/2402.12219
Copy Paste: [[2402.12219]] Reformatted Alignment(https://arxiv.org/abs/2402.12219)
Keywords: language model, llm, hallucination
Abstract: The quality of finetuning data is crucial for aligning large language models (LLMs) with human values. Current methods to improve data quality are either labor-intensive or prone to factual errors caused by LLM hallucinations. This paper explores elevating the quality of existing instruction data to better align with human values, introducing a simple and effective approach named ReAlign, which reformats the responses of instruction data into a format that better aligns with pre-established criteria and the collated evidence. This approach minimizes human annotation, hallucination, and the difficulty in scaling, remaining orthogonal to existing alignment techniques. Experimentally, ReAlign significantly boosts the general alignment ability, math reasoning, factuality, and readability of the LLMs. Encouragingly, without introducing any additional data or advanced training techniques, and merely by reformatting the response, LLaMA-2-13B's mathematical reasoning ability on GSM8K can be improved from 46.77% to 56.63% in accuracy. Additionally, a mere 5% of ReAlign data yields a 67% boost in general alignment ability measured by the Alpaca dataset. This work highlights the need for further research into the science and mechanistic interpretability of LLMs. We have made the associated code and data publicly accessible to support future studies at https://github.com/GAIR-NLP/ReAlign.
摘要：微调数据的质量对于使大型语言模型 (LLM) 与人类价值观保持一致至关重要。目前提高数据质量的方法要么是劳动密集型的，要么容易出现由法学硕士幻觉引起的事实错误。本文探讨了如何提高现有教学数据的质量，以更好地符合人类价值观，引入了一种名为 ReAlign 的简单有效的方法，该方法将教学数据的响应重新格式化为更符合预先制定的标准和整理证据的格式。这种方法最大限度地减少了人类注释、幻觉和缩放难度，并与现有的对齐技术保持正交。从实验上看，ReAlign 显着提高了法学硕士的总体对齐能力、数学推理、事实性和可读性。令人鼓舞的是，在不引入任何额外数据或高级训练技术的情况下，仅通过重新格式化响应，LLaMA-2-13B 在 GSM8K 上的数学推理能力就可以从 46.77% 提高到 56.63%。此外，仅 5% 的 ReAlign 数据就能将 Alpaca 数据集测得的总体对齐能力提高 67%。这项工作强调了进一步研究法学硕士的科学性和机械解释性的必要性。我们已在 https://github.com/GAIR-NLP/ReAlign 上公开提供相关代码和数据，以支持未来的研究。

Title: AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Authors: Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12226
Pdf URL: https://arxiv.org/pdf/2402.12226
Copy Paste: [[2402.12226]] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling(https://arxiv.org/abs/2402.12226)
Keywords: language model, gpt, llm
Abstract: We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/
摘要：我们介绍 AnyGPT，这是一种任意对任意的多模态语言模型，它利用离散表示来统一处理各种模态，包括语音、文本、图像和音乐。 AnyGPT 可以稳定地训练，无需对当前的大语言模型（LLM）架构或训练范式进行任何改变。相反，它完全依赖于数据级预处理，促进新模式无缝集成到法学硕士中，类似于新语言的合并。我们构建了一个以文本为中心的多模态数据集，用于多模态对齐预训练。利用生成模型，我们合成了第一个大规模任意对任意多模式指令数据集。它由 108k 个多轮对话样本组成，这些对话错综复杂地交织着各种模态，从而使模型能够处理多模态输入和输出的任意组合。实验结果表明，AnyGPT 能够促进任意对任意的多模态对话，同时在所有模态中实现与专用模型相当的性能，证明离散表示可以有效且方便地统一语言模型中的多种模态。演示见 https://junzhan2000.github.io/AnyGPT.github.io/

Title: Empirical Study on Updating Key-Value Memories in Transformer Feed-forward Layers

Authors: Zihan Qiu, Zeyu Huang, Youcheng Huang, Jie Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12233
Pdf URL: https://arxiv.org/pdf/2402.12233
Copy Paste: [[2402.12233]] Empirical Study on Updating Key-Value Memories in Transformer Feed-forward Layers(https://arxiv.org/abs/2402.12233)
Keywords: language model
Abstract: The feed-forward networks (FFNs) in transformers are recognized as a group of key-value neural memories to restore abstract high-level knowledge. In this work, we conduct an empirical ablation study on updating keys (the 1st layer in the FFNs layer) or values (the 2nd layer in the FFNs layer). We compare those two methods in various knowledge editing and fine-tuning tasks of large language models to draw insights to understand FFNs further. Code is available at $\href{https://github.com/qiuzh20/Tuning-keys-v.s.-values}{this\,repo}$.
摘要：Transformer 中的前馈网络（FFN）被认为是一组键值神经记忆，用于恢复抽象的高级知识。在这项工作中，我们对更新键（FFNs 层中的第一层）或值（FFNs 层中的第二层）进行实证消融研究。我们在大型语言模型的各种知识编辑和微调任务中比较了这两种方法，以得出进一步理解 FFN 的见解。代码可在 $\href{https://github.com/qiuzh20/Tuning-keys-v.s.-values}{this\,repo}$ 获取。

Title: Task-Oriented Dialogue with In-Context Learning

Authors: Tom Bocklisch, Thomas Werkmeister, Daksh Varshneya, Alan Nichol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12234
Pdf URL: https://arxiv.org/pdf/2402.12234
Copy Paste: [[2402.12234]] Task-Oriented Dialogue with In-Context Learning(https://arxiv.org/abs/2402.12234)
Keywords: language model, llm, chat
Abstract: We describe a system for building task-oriented dialogue systems combining the in-context learning abilities of large language models (LLMs) with the deterministic execution of business logic. LLMs are used to translate between the surface form of the conversation and a domain-specific language (DSL) which is used to progress the business logic. We compare our approach to the intent-based NLU approach predominantly used in industry today. Our experiments show that developing chatbots with our system requires significantly less effort than established approaches, that these chatbots can successfully navigate complex dialogues which are extremely challenging for NLU-based systems, and that our system has desirable properties for scaling task-oriented dialogue systems to a large number of tasks. We make our implementation available for use and further study.
摘要：我们描述了一种用于构建面向任务的对话系统的系统，该系统将大语言模型（LLM）的上下文学习能力与业务逻辑的确定性执行相结合。 LLM 用于在对话的表面形式和用于推进业务逻辑的领域特定语言 (DSL) 之间进行翻译。我们将我们的方法与当今行业中主要使用的基于意图的 NLU 方法进行比较。我们的实验表明，使用我们的系统开发聊天机器人所需的工作量比现有方法要少得多，这些聊天机器人可以成功地导航复杂的对话，这对于基于 NLU 的系统来说极具挑战性，并且我们的系统具有将面向任务的对话系统扩展到大量的任务。我们使我们的实现可供使用和进一步研究。

Title: Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark

Authors: Niklas Wretblad, Fredrik Gordh Riseby, Rahul Biswas, Amin Ahmadi, Oskar Holmström
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12243
Pdf URL: https://arxiv.org/pdf/2402.12243
Copy Paste: [[2402.12243]] Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark(https://arxiv.org/abs/2402.12243)
Keywords: prompt
Abstract: Text-to-SQL, which involves translating natural language into Structured Query Language (SQL), is crucial for enabling broad access to structured databases without expert knowledge. However, designing models for such tasks is challenging due to numerous factors, including the presence of 'noise,' such as ambiguous questions and syntactical errors. This study provides an in-depth analysis of the distribution and types of noise in the widely used BIRD-Bench benchmark and the impact of noise on models. While BIRD-Bench was created to model dirty and noisy database values, it was not created to contain noise and errors in the questions and gold queries. We found that noise in questions and gold queries are prevalent in the dataset, with varying amounts across domains, and with an uneven distribution between noise types. The presence of incorrect gold SQL queries, which then generate incorrect gold answers, has a significant impact on the benchmark's reliability. Surprisingly, when evaluating models on corrected SQL queries, zero-shot baselines surpassed the performance of state-of-the-art prompting methods. We conclude that informative noise labels and reliable benchmarks are crucial to developing new Text-to-SQL methods that can handle varying types of noise.
摘要：文本到 SQL 涉及将自然语言转换为结构化查询语言 (SQL)，对于无需专业知识即可广泛访问结构化数据库至关重要。然而，由于许多因素，包括模糊问题和语法错误等“噪音”的存在，为此类任务设计模型具有挑战性。本研究深入分析了广泛使用的 BIRD-Bench 基准中噪声的分布和类型以及噪声对模型的影响。虽然 BIRD-Bench 是为了对肮脏和嘈杂的数据库值进行建模而创建的，但它并不是为了包含问题和黄金查询中的噪音和错误而创建的。我们发现问题和黄金查询中的噪声在数据集中很普遍，不同领域的噪声量不同，并且噪声类型之间的分布不均匀。不正确的黄金 SQL 查询的存在会生成不正确的黄金答案，这会对基准测试的可靠性产生重大影响。令人惊讶的是，在评估经过纠正的 SQL 查询的模型时，零样本基线超越了最先进的提示方法的性能。我们得出的结论是，信息丰富的噪声标签和可靠的基准对于开发可以处理不同类型噪声的新文本到 SQL 方法至关重要。

Title: Shallow Synthesis of Knowledge in GPT-Generated Texts: A Case Study in Automatic Related Work Composition

Authors: Anna Martin-Boyle, Aahan Tyagi, Marti A. Hearst, Dongyeop Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12255
Pdf URL: https://arxiv.org/pdf/2402.12255
Copy Paste: [[2402.12255]] Shallow Synthesis of Knowledge in GPT-Generated Texts: A Case Study in Automatic Related Work Composition(https://arxiv.org/abs/2402.12255)
Keywords: gpt
Abstract: Numerous AI-assisted scholarly applications have been developed to aid different stages of the research process. We present an analysis of AI-assisted scholarly writing generated with ScholaCite, a tool we built that is designed for organizing literature and composing Related Work sections for academic papers. Our evaluation method focuses on the analysis of citation graphs to assess the structural complexity and inter-connectedness of citations in texts and involves a three-way comparison between (1) original human-written texts, (2) purely GPT-generated texts, and (3) human-AI collaborative texts. We find that GPT-4 can generate reasonable coarse-grained citation groupings to support human users in brainstorming, but fails to perform detailed synthesis of related works without human intervention. We suggest that future writing assistant tools should not be used to draft text independently of the human author.
摘要：许多人工智能辅助的学术应用程序已经被开发出来，以帮助研究过程的不同阶段。我们对 ScholaCite 生成的人工智能辅助学术写作进行了分析，ScholaCite 是我们构建的工具，旨在组织文献并为学术论文撰写相关工作部分。我们的评估方法侧重于对引文图的分析，以评估文本中引文的结构复杂性和相互关联性，并涉及 (1) 原始人类书写文本、(2) 纯 GPT 生成的文本和 (2) 之间的三向比较。 (3)人机协作文本。我们发现 GPT-4 可以生成合理的粗粒度引用分组来支持人类用户进行头脑风暴，但无法在没有人工干预的情况下对相关作品进行详细的综合。我们建议未来的写作辅助工具不应用于独立于人类作者来起草文本。

Title: NEO-BENCH: Evaluating Robustness of Large Language Models with Neologisms

Authors: Jonathan Zheng, Alan Ritter, Wei Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12261
Pdf URL: https://arxiv.org/pdf/2402.12261
Copy Paste: [[2402.12261]] NEO-BENCH: Evaluating Robustness of Large Language Models with Neologisms(https://arxiv.org/abs/2402.12261)
Keywords: language model, llm
Abstract: The performance of Large Language Models (LLMs) degrades from the temporal drift between data used for model training and newer text seen during inference. One understudied avenue of language change causing data drift is the emergence of neologisms -- new word forms -- over time. We create a diverse resource of recent English neologisms by using several popular collection methods. We analyze temporal drift using neologisms by comparing sentences containing new words with near-identical sentences that replace neologisms with existing substitute words. Model performance is nearly halved in machine translation when a single neologism is introduced in a sentence. Motivated by these results, we construct a benchmark to evaluate LLMs' ability to generalize to neologisms with various natural language understanding tasks and model perplexity. Models with later knowledge cutoff dates yield lower perplexities and perform better in downstream tasks. LLMs are also affected differently based on the linguistic origins of words, indicating that neologisms are complex for static LLMs to address. We will release our benchmark and code for reproducing our experiments.
摘要：大型语言模型 (LLM) 的性能会因模型训练所用数据与推理过程中看到的新文本之间的时间漂移而降低。导致数据漂移的语言变化的一种未被充分研究的途径是随着时间的推移新词（新词形式）的出现。我们通过使用几种流行的收集方法创建了最新英语新词的多样化资源。我们通过将包含新词的句子与用现有替代词替换新词的几乎相同的句子进行比较，使用新词来分析时间漂移。当句子中引入单个新词时，机器翻译中的模型性能几乎减半。受这些结果的启发，我们构建了一个基准来评估法学硕士通过各种自然语言理解任务和模型困惑度推广到新词的能力。知识截止日期较晚的模型产生的困惑度较低，并且在下游任务中表现更好。根据单词的语言起源，法学硕士也会受到不同的影响，这表明新词对于静态法学硕士来说很复杂。我们将发布我们的基准和代码来重现我们的实验。

Title: Uncertainty quantification in fine-tuned LLMs using LoRA ensembles

Authors: Oleksandr Balabanov, Hampus Linander
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2402.12264
Pdf URL: https://arxiv.org/pdf/2402.12264
Copy Paste: [[2402.12264]] Uncertainty quantification in fine-tuned LLMs using LoRA ensembles(https://arxiv.org/abs/2402.12264)
Keywords: language model, llm
Abstract: Fine-tuning large language models can improve task specific performance, although a general understanding of what the fine-tuned model has learned, forgotten and how to trust its predictions is still missing. We derive principled uncertainty quantification for fine-tuned LLMs with posterior approximations using computationally efficient low-rank adaptation ensembles. We analyze three common multiple-choice datasets using low-rank adaptation ensembles based on Mistral-7b, and draw quantitative and qualitative conclusions on their perceived complexity and model efficacy on the different target domains during and after fine-tuning. In particular, backed by the numerical experiments, we hypothesise about signals from entropic uncertainty measures for data domains that are inherently difficult for a given architecture to learn.
摘要：微调大型语言模型可以提高特定于任务的性能，尽管仍然缺乏对微调模型学到了什么、忘记了什么以及如何信任其预测的一般理解。我们使用计算效率高的低秩适应集合，通过后验近似推导出微调 LLM 的原则上的不确定性量化。我们使用基于 Mistral-7b 的低秩适应集成分析了三个常见的多项选择数据集，并就其在微调期间和之后对不同目标域的感知复杂性和模型功效得出定量和定性结论。特别是，在数值实验的支持下，我们假设了数据域的熵不确定性度量的信号，这些信号对于给定的架构来说本质上很难学习。

Title: High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models

Authors: Michela Lorandi, Anya Belz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12267
Pdf URL: https://arxiv.org/pdf/2402.12267
Copy Paste: [[2402.12267]] High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models(https://arxiv.org/abs/2402.12267)
Keywords: language model, llm
Abstract: The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric's suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.
摘要：目前，资源严重不足的语言的 NLP 方法的性能无法与资源丰富的语言的 NLP 方法的最新技术水平相匹配。我们通过爱尔兰语、威尔士语、布列塔尼语和马耳他语的数据到文本生成示例，探讨了预训练的大型语言模型 (LLM) 在多大程度上可以弥补这一差距。我们在一系列场景中测试这些资源匮乏的语言和英语的法学硕士。我们发现，通过自动评估和人工评估来衡量，法学硕士很容易就以很大的优势为资源贫乏的语言设定了最先进的水平。对于我们所有的语言，人类评估显示出与人类最佳系统的性能相当，但 BLEU 分数与英语相比大幅下降，让人怀疑该指标是否适合评估非特定任务系统。总的来说，我们的结果证明了法学硕士在缩小资源贫乏语言的性能差距方面的巨大潜力。

Title: WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment

Authors: Hao Tang, Darren Key, Kevin Ellis
Subjects: cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.12275
Pdf URL: https://arxiv.org/pdf/2402.12275
Copy Paste: [[2402.12275]] WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment(https://arxiv.org/abs/2402.12275)
Keywords: llm, agent
Abstract: We give a model-based agent that builds a Python program representing its knowledge of the world based on its interactions with the environment. The world model tries to explain its interactions, while also being optimistic about what reward it can achieve. We do this by extending work on program synthesis via LLMs. We study our agent on gridworlds, finding our approach is more sample-efficient compared to deep RL, and more compute-efficient compared to ReAct-style agents.
摘要：我们给出了一个基于模型的代理，它构建一个 Python 程序，根据其与环境的交互来表示其对世界的了解。世界模型试图解释其相互作用，同时也对其所能获得的回报持乐观态度。我们通过法学硕士扩展程序综合工作来做到这一点。我们在网格世界上研究我们的代理，发现我们的方法比深度强化学习更具样本效率，并且与 ReAct 风格的代理相比更具计算效率。

Title: Key ingredients for effective zero-shot cross-lingual knowledge transfer in generative tasks

Authors: Nadezhda Chirkova, Vassilina Nikoulina
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12279
Pdf URL: https://arxiv.org/pdf/2402.12279
Copy Paste: [[2402.12279]] Key ingredients for effective zero-shot cross-lingual knowledge transfer in generative tasks(https://arxiv.org/abs/2402.12279)
Keywords: language model
Abstract: Zero-shot cross-lingual generation implies finetuning of the multilingual pretrained language model on a generation task in one language and then using it to make predictions for this task in other languages. Previous works notice a frequent problem of generation in a wrong language and propose approaches to address it, usually using mT5 as a backbone model. In this work we compare various approaches proposed from the literature in unified settings, also including alternative backbone models, namely mBART and NLLB-200. We first underline the importance of tuning learning rate used for finetuning, which helps to substantially alleviate the problem of generation in the wrong language. Then, we show that with careful learning rate tuning, the simple full finetuning of the model acts as a very strong baseline and alternative approaches bring only marginal improvements. Finally, we find that mBART performs similarly to mT5 of the same size, and NLLB-200 can be competitive in some cases. Our final models reach the performance of the approach based on data translation which is usually considered as an upper baseline for zero-shot cross-lingual generation.
摘要：零样本跨语言生成意味着在一种语言的生成任务上对多语言预训练语言模型进行微调，然后用它来对该任务在其他语言中进行预测。之前的工作注意到了错误语言生成的常见问题，并提出了解决该问题的方法，通常使用 mT5 作为骨干模型。在这项工作中，我们比较了统一设置下文献中提出的各种方法，还包括替代骨干模型，即 mBART 和 NLLB-200。我们首先强调调整用于微调的学习率的重要性，这有助于大大缓解生成错误语言的问题。然后，我们表明，通过仔细的学习率调整，模型的简单全面微调可以作为非常强大的基线，而替代方法只能带来边际改进。最后，我们发现 mBART 的性能与相同尺寸的 mT5 类似，并且 NLLB-200 在某些情况下可以具有竞争力。我们的最终模型达到了基于数据翻译的方法的性能，该方法通常被认为是零样本跨语言生成的上限。

Title: Adaptive Skeleton Graph Decoding

Authors: Shuowei Jin, Yongji Wu, Haizhong Zheng, Qingzhao Zhang, Matthew Lentz, Z. Morley Mao, Atul Prakash, Feng Qian, Danyang Zhuo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12280
Pdf URL: https://arxiv.org/pdf/2402.12280
Copy Paste: [[2402.12280]] Adaptive Skeleton Graph Decoding(https://arxiv.org/abs/2402.12280)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have seen significant adoption for natural language tasks, owing their success to massive numbers of model parameters (e.g., 70B+); however, LLM inference incurs significant computation and memory costs. Recent approaches propose parallel decoding strategies, such as Skeleton-of-Thought (SoT), to improve performance by breaking prompts down into sub-problems that can be decoded in parallel; however, they often suffer from reduced response quality. Our key insight is that we can request additional information, specifically dependencies and difficulty, when generating the sub-problems to improve both response quality and performance. In this paper, we propose Skeleton Graph Decoding (SGD), which uses dependencies exposed between sub-problems to support information forwarding between dependent sub-problems for improved quality while exposing parallelization opportunities for decoding independent sub-problems. Additionally, we leverage difficulty estimates for each sub-problem to select an appropriately-sized model, improving performance without significantly reducing quality. Compared to standard autoregressive generation and SoT, SGD achieves a 1.69x speedup while improving quality by up to 51%.
摘要：大型语言模型 (LLM) 在自然语言任务中得到了广泛采用，因为它们的成功归功于大量的模型参数（例如 70B+）；然而，LLM 推理会产生大量的计算和内存成本。最近的方法提出了并行解码策略，例如思想框架（SoT），通过将提示分解为可以并行解码的子问题来提高性能；然而，他们经常遇到响应质量下降的问题。我们的主要见解是，在生成子问题时，我们可以请求附加信息，特别是依赖性和难度，以提高响应质量和性能。在本文中，我们提出了骨架图解码（SGD），它使用子问题之间暴露的依赖关系来支持依赖子问题之间的信息转发，以提高质量，同时为解码独立子问题提供并行化机会。此外，我们利用每个子问题的难度估计来选择适当大小的模型，从而在不显着降低质量的情况下提高性能。与标准自回归生成和 SoT 相比，SGD 实现了 1.69 倍的加速，同时质量提高了高达 51%。

Title: Refining Minimax Regret for Unsupervised Environment Design

Authors: Michael Beukman, Samuel Coward, Michael Matthews, Mattie Fellows, Minqi Jiang, Michael Dennis, Jakob Foerster
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12284
Pdf URL: https://arxiv.org/pdf/2402.12284
Copy Paste: [[2402.12284]] Refining Minimax Regret for Unsupervised Environment Design(https://arxiv.org/abs/2402.12284)
Keywords: agent
Abstract: In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.
摘要：在无监督环境设计中，强化学习代理接受对手生成的环境配置（级别）的训练，以最大化某些目标。遗憾是一个常用的目标，理论上会产生具有理想鲁棒性保证的极小最大遗憾（MMR）策略；特别是，代理人的最大遗憾是有限度的。然而，一旦智能体在所有级别上达到后悔界限，对手只会对后悔无法进一步减少的级别进行采样。尽管在这些遗憾最大化的水平之外还有可能实现绩效改进，但学习却停滞不前。在这项工作中，我们引入了贝叶斯水平完美 MMR (BLP)，这是对极小最大后悔目标的改进，克服了这一限制。我们正式证明，解决这个目标会产生 MMR 策略的子集，并且 BLP 策略在所有级别上都与完美贝叶斯策略一致。我们进一步介绍了一种算法 ReMiDi，它在收敛时产生 BLP 策略。我们凭经验证明，在极小最大遗憾对手的水平上进行训练会导致学习过早停滞，但 ReMiDi 会继续学习。

Title: KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students

Authors: Matthew Shu, Nishant Balepur, Shi Feng, Jordan Boyd-Graber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12291
Pdf URL: https://arxiv.org/pdf/2402.12291
Copy Paste: [[2402.12291]] KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students(https://arxiv.org/abs/2402.12291)
Keywords: language model
Abstract: Flashcard schedulers are tools that rely on 1) student models to predict the flashcards a student knows; and 2) teaching policies to schedule cards based on these predictions. Existing student models, however, only use flashcard-level features, like the student's past responses, ignoring the semantic ties of flashcards. Deep Knowledge Tracing (DKT) models can capture semantic relations with language models, but are inefficient, lack content-rich datasets for evaluation, and require robust teaching policies. To address these issues, we design KARL, a DKT-inspired student model that uses retrieval and BERT embeddings for efficient and accurate student recall predictions. To test KARL, we collect a new dataset of diverse study history on trivia questions. KARL bests existing student models in AUC and calibration error. Finally, we propose a novel teaching policy that exploits the predictive power of DKT models to deploy KARL online. Based on 27 learners and 32 6-day study trajectories, KARL shows the ability to enhance medium-term educational learning, proving its efficacy for scheduling.
摘要：抽认卡调度程序是依赖于 1）学生模型来预测学生所知道的抽认卡的工具； 2）根据这些预测制定安排卡片的教学策略。然而，现有的学生模型仅使用抽认卡级别的特征，例如学生过去的反应，忽略了抽认卡的语义联系。深度知识追踪（DKT）模型可以捕获与语言模型的语义关系，但效率低下，缺乏内容丰富的数据集用于评估，并且需要强大的教学策略。为了解决这些问题，我们设计了 KARL，这是一种受 DKT 启发的学生模型，它使用检索和 BERT 嵌入来实现高效、准确的学生回忆预测。为了测试 KARL，我们收集了关于琐事问题的不同研究历史的新数据集。 KARL 在 AUC 和校准误差方面优于现有的学生模型。最后，我们提出了一种新颖的教学策略，利用 DKT 模型的预测能力在线部署 KARL。基于 27 名学习者和 32 个 6 天学习轨迹，KARL 展示了增强中期教育学习的能力，证明了其调度的有效性。

Title: Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports

Authors: Felix J. Dorfner, Liv Jürgensen, Leonhard Donle, Fares Al Mohamad, Tobias R. Bodenmann, Mason C. Cleveland, Felix Busch, Lisa C. Adams, James Sato, Thomas Schultz, Albert E. Kim, Jameson Merkow, Keno K. Bressem, Christopher P. Bridge
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12298
Pdf URL: https://arxiv.org/pdf/2402.12298
Copy Paste: [[2402.12298]] Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports(https://arxiv.org/abs/2402.12298)
Keywords: language model, gpt, llm, prompt
Abstract: Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models. Materials and Methods: Two different and independent datasets were used. The first dataset consists of 540 chest x-ray reports that were created at the Massachusetts General Hospital between July 2019 and July 2021. The second dataset consists of 500 chest x-ray reports from the ImaGenome dataset. We then compared the commercial models GPT-3.5 Turbo and GPT-4 from OpenAI to the open-source models Mistral-7B, Mixtral-8x7B, Llama2-13B, Llama2-70B, QWEN1.5-72B and CheXbert and CheXpert-labeler in their ability to accurately label the presence of multiple findings in x-ray text reports using different prompting techniques. Results: On the ImaGenome dataset, the best performing open-source model was Llama2-70B with micro F1-scores of 0.972 and 0.970 for zero- and few-shot prompts, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.984, respectively. On the institutional dataset, the best performing open-source model was QWEN1.5-72B with micro F1-scores of 0.952 and 0.965 for zero- and few-shot prompting, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.973, respectively. Conclusion: In this paper, we show that while GPT-4 is superior to open-source models in zero-shot report labeling, the implementation of few-shot prompting can bring open-source models on par with GPT-4. This shows that open-source models could be a performant and privacy preserving alternative to GPT-4 for the task of radiology report classification.
摘要：简介：随着大型语言模型（LLM）的快速发展，出现了许多新的开源和商业模型。虽然最近的出版物探讨了 GPT-4 在从放射学报告中提取感兴趣信息的应用，但尚未将 GPT-4 与不同领先的开源模型进行现实世界的比较。材料和方法：使用了两个不同且独立的数据集。第一个数据集包含 2019 年 7 月至 2021 年 7 月期间在马萨诸塞州总医院创建的 540 份胸部 X 射线报告。第二个数据集包含来自 ImaGenome 数据集的 500 份胸部 X 射线报告。然后，我们将 OpenAI 的商业模型 GPT-3.5 Turbo 和 GPT-4 与开源模型 Mistral-7B、Mixtral-8x7B、Llama2-13B、Llama2-70B、QWEN1.5-72B 以及 CheXbert 和 CheXpert-labeler 进行了比较他们能够使用不同的提示技术准确地标记 X 射线文本报告中多个发现的存在。结果：在 ImaGenome 数据集上，性能最好的开源模型是 Llama2-70B，零次和几次提示的微 F1 分数分别为 0.972 和 0.970。 GPT-4 的微 F1 分数分别为 0.975 和 0.984。在机构数据集上，表现最好的开源模型是 QWEN1.5-72B，零次和几次提示的微 F1 分数分别为 0.952 和 0.965。 GPT-4 的微 F1 分数分别为 0.975 和 0.973。结论：在本文中，我们表明，虽然 GPT-4 在零样本报告标记方面优于开源模型，但少样本提示的实现可以使开源模型与 GPT-4 相媲美。这表明开源模型可以成为 GPT-4 的高性能和隐私保护替代品，用于放射学报告分类任务。

Title: ARKS: Active Retrieval in Knowledge Soup for Code Generation

Authors: Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, Tao Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12317
Pdf URL: https://arxiv.org/pdf/2402.12317
Copy Paste: [[2402.12317]] ARKS: Active Retrieval in Knowledge Soup for Code Generation(https://arxiv.org/abs/2402.12317)
Keywords: language model, gpt, llm, chat, retrieval-augmented generation
Abstract: Recently the retrieval-augmented generation (RAG) paradigm has raised much attention for its potential in incorporating external knowledge into large language models (LLMs) without further training. While widely explored in natural language applications, its utilization in code generation remains under-explored. In this paper, we introduce Active Retrieval in Knowledge Soup (ARKS), an advanced strategy for generalizing large language models for code. In contrast to relying on a single source, we construct a knowledge soup integrating web search, documentation, execution feedback, and evolved code snippets. We employ an active retrieval strategy that iteratively refines the query and updates the knowledge soup. To assess the performance of ARKS, we compile a new benchmark comprising realistic coding problems associated with frequently updated libraries and long-tail programming languages. Experimental results on ChatGPT and CodeLlama demonstrate a substantial improvement in the average execution accuracy of ARKS on LLMs. The analysis confirms the effectiveness of our proposed knowledge soup and active retrieval strategies, offering rich insights into the construction of effective retrieval-augmented code generation (RACG) pipelines. Our model, code, and data are available at https://arks-codegen.github.io.
摘要：最近，检索增强生成（RAG）范式因其无需进一步训练即可将外部知识整合到大型语言模型（LLM）中的潜力而引起了广泛关注。虽然在自然语言应用程序中得到了广泛的探索，但它在代码生成中的利用仍未得到充分探索。在本文中，我们介绍了知识汤中的主动检索（ARKS），这是一种泛化代码大型语言模型的高级策略。与依赖单一来源不同，我们构建了一个集成网络搜索、文档、执行反馈和进化代码片段的知识汤。我们采用主动检索策略来迭代地细化查询并更新知识汤。为了评估 ARKS 的性能，我们编译了一个新的基准，其中包含与频繁更新的库和长尾编程语言相关的实际编码问题。 ChatGPT 和 CodeLlama 上的实验结果表明，ARKS 在 LLM 上的平均执行精度有了显着提高。该分析证实了我们提出的知识汤和主动检索策略的有效性，为构建有效的检索增强代码生成（RACG）管道提供了丰富的见解。我们的模型、代码和数据可在 https://arks-codegen.github.io 上获取。

Title: LLM Agents for Psychology: A Study on Gamified Assessments

Authors: Qisen Yang, Zekun Wang, Honghui Chen, Shenzhi Wang, Yifan Pu, Xin Gao, Wenhao Huang, Shiji Song, Gao Huang
Subjects: cs.CL, cs.CY, cs.HC, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2402.12326
Pdf URL: https://arxiv.org/pdf/2402.12326
Copy Paste: [[2402.12326]] LLM Agents for Psychology: A Study on Gamified Assessments(https://arxiv.org/abs/2402.12326)
Keywords: llm, agent
Abstract: Psychological measurement is essential for mental health, self-understanding, and personal development. Traditional methods, such as self-report scales and psychologist interviews, often face challenges with engagement and accessibility. While game-based and LLM-based tools have been explored to improve user interest and automate assessment, they struggle to balance engagement with generalizability. In this work, we propose PsychoGAT (Psychological Game AgenTs) to achieve a generic gamification of psychological assessment. The main insight is that powerful LLMs can function both as adept psychologists and innovative game designers. By incorporating LLM agents into designated roles and carefully managing their interactions, PsychoGAT can transform any standardized scales into personalized and engaging interactive fiction games. To validate the proposed method, we conduct psychometric evaluations to assess its effectiveness and employ human evaluators to examine the generated content across various psychological constructs, including depression, cognitive distortions, and personality traits. Results demonstrate that PsychoGAT serves as an effective assessment tool, achieving statistically significant excellence in psychometric metrics such as reliability, convergent validity, and discriminant validity. Moreover, human evaluations confirm PsychoGAT's enhancements in content coherence, interactivity, interest, immersion, and satisfaction.
摘要：心理测量对于心理健康、自我理解和个人发展至关重要。传统方法，例如自我报告量表和心理学家访谈，经常面临参与度和可及性方面的挑战。虽然基于游戏和基于法学硕士的工具已经被探索来提高用户兴趣和自动化评估，但它们很难在参与度和普遍性之间取得平衡。在这项工作中，我们提出了 PsychoGAT（心理游戏代理）来实现心理评估的通用游戏化。主要见解是，强大的法学硕士既可以充当熟练的心理学家，也可以充当创新的游戏设计师。通过将 LLM 代理纳入指定角色并仔细管理他们的互动，PsychoGAT 可以将任何标准化规模转化为个性化且引人入胜的互动小说游戏。为了验证所提出的方法，我们进行心理测量评估以评估其有效性，并聘请人类评估员检查各种心理结构（包括抑郁、认知扭曲和人格特质）生成的内容。结果表明，PsychoGAT 是一种有效的评估工具，在可靠性、收敛效度和判别效度等心理测量指标上实现了统计上显着的卓越性。此外，人类评估证实了 PsychoGAT 在内容连贯性、交互性、兴趣、沉浸感和满意度方面的增强。

Title: Shall We Talk: Exploring Spontaneous Collaborations of Competing LLM Agents

Authors: Zengqing Wu, Shuyuan Zheng, Qianying Liu, Xu Han, Brian Inhyuk Kwon, Makoto Onizuka, Shaojie Tang, Run Peng, Chuan Xiao
Subjects: cs.AI, cs.CL, cs.CY, cs.MA, econ.GN
Abstract URL: https://arxiv.org/abs/2402.12327
Pdf URL: https://arxiv.org/pdf/2402.12327
Copy Paste: [[2402.12327]] Shall We Talk: Exploring Spontaneous Collaborations of Competing LLM Agents(https://arxiv.org/abs/2402.12327)
Keywords: language model, llm, agent
Abstract: Recent advancements have shown that agents powered by large language models (LLMs) possess capabilities to simulate human behaviors and societal dynamics. However, the potential for LLM agents to spontaneously establish collaborative relationships in the absence of explicit instructions has not been studied. To address this gap, we conduct three case studies, revealing that LLM agents are capable of spontaneously forming collaborations even within competitive settings. This finding not only demonstrates the capacity of LLM agents to mimic competition and cooperation in human societies but also validates a promising vision of computational social science. Specifically, it suggests that LLM agents could be utilized to model human social interactions, including those with spontaneous collaborations, thus offering insights into social phenomena. The source codes for this study are available at https://github.com/wuzengqing001225/SABM_ShallWeTalk .
摘要：最近的进展表明，由大型语言模型 (LLM) 驱动的智能体具有模拟人类行为和社会动态的能力。然而，LLM 代理人在没有明确指示的情况下自发建立合作关系的潜力尚未得到研究。为了解决这一差距，我们进行了三个案例研究，表明法学硕士代理人即使在竞争环境中也能够自发地形成合作。这一发现不仅证明了法学硕士代理模仿人类社会竞争与合作的能力，而且验证了计算社会科学的前景。具体来说，它表明法学硕士代理可以用来模拟人类社会互动，包括自发合作的互动，从而提供对社会现象的见解。本研究的源代码可在 https://github.com/wuzengqing001225/SABM_ShallWeTalk 获取。

Title: Query-Based Adversarial Prompt Generation

Authors: Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12329
Pdf URL: https://arxiv.org/pdf/2402.12329
Copy Paste: [[2402.12329]] Query-Based Adversarial Prompt Generation(https://arxiv.org/abs/2402.12329)
Keywords: language model, gpt, prompt
Abstract: Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
摘要：最近的工作表明，构建对抗性示例是可能的，这些示例会导致对齐的语言模型发出有害字符串或执行有害行为。现有的攻击要么在白盒设置（可以完全访问模型权重）中起作用，要么通过可转移性起作用：在一个模型上制作的对抗性示例通常在其他模型上仍然有效的现象。我们通过基于查询的攻击改进了之前的工作，该攻击利用对远程语言模型的 API 访问来构建对抗性示例，导致模型以比仅传输攻击高得多的概率发出有害字符串。我们验证了对 GPT-3.5 和 OpenAI 安全分类器的攻击；我们可以使 GPT-3.5 发出当前传输攻击失败的有害字符串，并且我们可以以近 100% 的概率逃避安全分类器。

Title: Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Authors: Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2402.12336
Pdf URL: https://arxiv.org/pdf/2402.12336
Copy Paste: [[2402.12336]] Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models(https://arxiv.org/abs/2402.12336)
Keywords: language model, gpt
Abstract: Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (VLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of VLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the VLM is required. The code and robust models are available at https://github.com/chs20/RobustVLM
摘要：OpenFlamingo、LLaVA 和 GPT-4 等多模态基础模型越来越多地用于各种实际任务。先前的工作表明，这些模型非常容易受到视觉模式的对抗性攻击。这些攻击可用于传播虚假信息或欺骗用户，从而带来重大风险，这使得大型多模态基础模型的稳健性成为一个紧迫的问题。 CLIP 模型或其变体之一在许多视觉语言模型 (VLM) 中用作冻结视觉编码器，例如LLaVA 和 OpenFlamingo。我们提出了一种无监督的对抗性微调方案，以获得鲁棒的 CLIP 视觉编码器，该编码器对依赖于 CLIP 的所有视觉下游任务（VLM、零样本分类）产生鲁棒性。特别是，我们表明，一旦用我们强大的模型替换了原始的 CLIP 模型，恶意第三方提供经过操纵的图像对 VLM 用户进行秘密攻击就不再可能了。无需对 VLM 进行重新训练或微调。代码和稳健模型可在 https://github.com/chs20/RobustVLM 获取

Title: Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Authors: Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12343
Pdf URL: https://arxiv.org/pdf/2402.12343
Copy Paste: [[2402.12343]] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!(https://arxiv.org/abs/2402.12343)
Keywords: language model, llm
Abstract: Large language models (LLMs) need to undergo safety alignment to ensure safe conversations with humans. However, in this work, we introduce an inference-time attack framework, demonstrating that safety alignment can also unintentionally facilitate harmful outcomes under adversarial manipulation. This framework, named Emulated Disalignment (ED), adversely combines a pair of open-source pre-trained and safety-aligned language models in the output space to produce a harmful language model without any training. Our experiments with ED across three datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rate in 43 out of 48 evaluation subsets by a large margin. Crucially, our findings highlight the importance of reevaluating the practice of open-sourcing language models even after safety alignment.
摘要：大型语言模型 (LLM) 需要进行安全调整，以确保与人类的安全对话。然而，在这项工作中，我们引入了一个推理时间攻击框架，证明安全调整也可能在对抗性操纵下无意中促成有害结果。该框架名为 Emulated Disalignment (ED)，在输出空间中不利地结合了一对开源的预训练且安全对齐的语言模型，无需任何训练即可生成有害的语言模型。我们在三个数据集和四个模型系列（Llama-1、Llama-2、Mistral 和 Alpaca）上进行的 ED 实验表明，ED 使预训练模型的危害性增加了一倍，并且优于强基线，在 43 个模型中实现了最高的危害率。 48 个评估子集大幅领先。至关重要的是，我们的研究结果强调了即使在安全调整之后重新评估开源语言模型实践的重要性。

Title: GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Authors: Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2402.12348
Pdf URL: https://arxiv.org/pdf/2402.12348
Copy Paste: [[2402.12348]] GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations(https://arxiv.org/abs/2402.12348)
Keywords: language model, gpt, llm, chain-of-thought, tree-of-thought
Abstract: As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely-recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we investigate two key problems: (1) Characterizing game-theoretic reasoning of LLMs; (2) LLM-vs-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Open-source LLMs, e.g., CodeLlama-34b-Instruct, are less competitive than commercial LLMs, e.g., GPT-4, in complex games. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. Detailed error profiles are also provided for a better understanding of LLMs' behavior.
摘要：随着大型语言模型（LLM）被集成到关键的现实应用中，它们的战略和逻辑推理能力变得越来越重要。本文通过博弈论任务评估法学硕士在竞争环境中的推理能力，例如需要纯粹逻辑和战略推理来与对手竞争的棋盘游戏和纸牌游戏。我们首先提出 GTBench，这是一种语言驱动的环境，由 10 个广泛认可的任务组成，涵盖全面的游戏分类：完整信息与不完整信息、动态与静态、概率场景与确定性场景。然后，我们研究了两个关键问题：（1）表征法学硕士的博弈论推理； (2) LLM-vs-LLM竞赛作为推理评估。我们观察到（1）法学硕士对于各种游戏场景有不同的行为；例如，法学硕士在完整和确定性游戏中失败，但在概率游戏场景中具有竞争力； (2) 在复杂游戏中，开源 LLM（例如 CodeLlama-34b-Instruct）的竞争力低于商业 LLM（例如 GPT-4）。此外，代码预训练极大地有利于策略推理，而思想链（CoT）和思想树（ToT）等高级推理方法并不总是有帮助。还提供了详细的错误配置文件，以便更好地理解法学硕士的行为。

Title: Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge

Authors: Julien Delile, Srayanta Mukherjee, Anton Van Pamel, Leonid Zhukov
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2402.12352
Pdf URL: https://arxiv.org/pdf/2402.12352
Copy Paste: [[2402.12352]] Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge(https://arxiv.org/abs/2402.12352)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Large language models (LLMs) are transforming the way information is retrieved with vast amounts of knowledge being summarized and presented via natural language conversations. Yet, LLMs are prone to highlight the most frequently seen pieces of information from the training set and to neglect the rare ones. In the field of biomedical research, latest discoveries are key to academic and industrial actors and are obscured by the abundance of an ever-increasing literature corpus (the information overload problem). Surfacing new associations between biomedical entities, e.g., drugs, genes, diseases, with LLMs becomes a challenge of capturing the long-tail knowledge of the biomedical scientific production. To overcome this challenge, Retrieval Augmented Generation (RAG) has been proposed to alleviate some of the shortcomings of LLMs by augmenting the prompts with context retrieved from external datasets. RAG methods typically select the context via maximum similarity search over text embeddings. In this study, we show that RAG methods leave out a significant proportion of relevant information due to clusters of over-represented concepts in the biomedical literature. We introduce a novel information-retrieval method that leverages a knowledge graph to downsample these clusters and mitigate the information overload problem. Its retrieval performance is about twice better than embedding similarity alternatives on both precision and recall. Finally, we demonstrate that both embedding similarity and knowledge graph retrieval methods can be advantageously combined into a hybrid model that outperforms both, enabling potential improvements to biomedical question-answering models.
摘要：大型语言模型 (LLM) 正在改变信息检索的方式，通过自然语言对话总结和呈现大量知识。然而，法学硕士很容易强调训练集中最常见的信息，而忽略罕见的信息。在生物医学研究领域，最新的发现对于学术和工业参与者来说至关重要，但由于不断增加的文献语料库（信息过载问题）而变得模糊。揭示生物医学实体（例如药物、基因、疾病）与法学硕士之间的新关联成为捕获生物医学科学成果的长尾知识的挑战。为了克服这一挑战，人们提出了检索增强生成（RAG），通过使用从外部数据集检索的上下文来增强提示，以减轻法学硕士的一些缺点。 RAG 方法通常通过文本嵌入的最大相似度搜索来选择上下文。在这项研究中，我们表明，由于生物医学文献中存在过多的概念，RAG 方法遗漏了很大一部分相关信息。我们引入了一种新颖的信息检索方法，该方法利用知识图来对这些集群进行下采样并缓解信息过载问题。它的检索性能在精确度和召回率方面比嵌入相似性替代方案要好大约两倍。最后，我们证明嵌入相似性和知识图检索方法都可以有利地组合成优于两者的混合模型，从而能够对生物医学问答模型进行潜在的改进。

Title: Emergent Word Order Universals from Cognitively-Motivated Language Models

Authors: Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei Oseki, Ted Briscoe, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12363
Pdf URL: https://arxiv.org/pdf/2402.12363
Copy Paste: [[2402.12363]] Emergent Word Order Universals from Cognitively-Motivated Language Models(https://arxiv.org/abs/2402.12363)
Keywords: language model
Abstract: The world's languages exhibit certain so-called typological or implicational universals; for example, Subject-Object-Verb (SOV) word order typically employs postpositions. Explaining the source of such biases is a key goal in linguistics. We study the word-order universals through a computational simulation with language models (LMs). Our experiments show that typologically typical word orders tend to have lower perplexity estimated by LMs with cognitively plausible biases: syntactic biases, specific parsing strategies, and memory limitations. This suggests that the interplay of these cognitive biases and predictability (perplexity) can explain many aspects of word-order universals. This also showcases the advantage of cognitively-motivated LMs, which are typically employed in cognitive modeling, in the computational simulation of language universals.
摘要：世界上的语言表现出某些所谓的类型学或蕴涵共性；例如，主语-宾语-动词（SOV）词序通常采用后置词。解释这种偏见的根源是语言学的一个关键目标。我们通过语言模型（LM）的计算模拟来研究词序共性。我们的实验表明，类型学上典型的词序往往具有较低的困惑度，由具有认知合理偏差的 LM 估计：句法偏差、特定的解析策略和记忆限制。这表明这些认知偏差和可预测性（困惑）的相互作用可以解释词序共性的许多方面。这也展示了认知驱动的语言模型的优势，这种语言模型通常用于认知建模和语言共性的计算模拟。

Title: A Critical Evaluation of AI Feedback for Aligning Large Language Models

Authors: Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2402.12366
Pdf URL: https://arxiv.org/pdf/2402.12366
Copy Paste: [[2402.12366]] A Critical Evaluation of AI Feedback for Aligning Large Language Models(https://arxiv.org/abs/2402.12366)
Keywords: language model, gpt
Abstract: Reinforcement learning with AI feedback (RLAIF) is a popular paradigm for improving the instruction-following abilities of powerful pre-trained language models. RLAIF first performs supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with reinforcement learning (RL), using feedback from a critic model. While recent popular open-source models have demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely due to the widespread practice of using a weaker teacher model (e.g. GPT-3.5) for SFT data collection than the critic (e.g., GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models. Finally, we provide a mechanistic explanation for when SFT may outperform the full two-step RLAIF pipeline as well as suggestions for making RLAIF maximally useful in practice.
摘要：带人工智能反馈的强化学习（RLAIF）是一种流行的范例，用于提高强大的预训练语言模型的指令跟踪能力。 RLAIF 首先使用教师模型的演示执行监督微调 (SFT)，然后使用评论家模型的反馈通过强化学习 (RL) 进一步微调模型。虽然最近流行的开源模型已经证明了 RL 步骤的性能有了显着提高，但在本文中，我们质疑这个 RL 步骤的复杂性是否真正能够保证 AI 反馈。我们表明，RL 步骤的改进实际上完全归因于使用比用于 AI 反馈生成的批评家模型（例如 GPT-4）更弱的教师模型（例如 GPT-3.5）进行 SFT 数据收集的广泛实践。具体来说，我们证明了以 GPT-4 作为教师的简单监督微调优于现有的 RLAIF 管道。更一般地说，我们发现 RLAIF 的收益在基础模型系列、测试时评估协议和批评模型之间存在很大差异。最后，我们提供了 SFT 何时优于完整的两步 RLAIF 流程的机制解释，以及使 RLAIF 在实践中发挥最大作用的建议。

Title: A synthetic data approach for domain generalization of NLI models

Authors: Mohammad Javad Hosseini, Andrey Petrov, Alex Fabrikant, Annie Louis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12368
Pdf URL: https://arxiv.org/pdf/2402.12368
Copy Paste: [[2402.12368]] A synthetic data approach for domain generalization of NLI models(https://arxiv.org/abs/2402.12368)
Keywords: llm
Abstract: Natural Language Inference (NLI) remains an important benchmark task for LLMs. NLI datasets are a springboard for transfer learning to other semantic tasks, and NLI models are standard tools for identifying the faithfulness of model-generated text. There are several large scale NLI datasets today, and models have improved greatly by hill-climbing on these collections. Yet their realistic performance on out-of-distribution/domain data is less well-understood. We present an in-depth exploration of the problem of domain generalization of NLI models. We demonstrate a new approach for generating synthetic NLI data in diverse domains and lengths, so far not covered by existing training sets. The resulting examples have meaningful premises, the hypotheses are formed in creative ways rather than simple edits to a few premise tokens, and the labels have high accuracy. We show that models trained on this data ($685$K synthetic examples) have the best generalization to completely new downstream test settings. On the TRUE benchmark, a T5-small model trained with our data improves around $7\%$ on average compared to training on the best alternative dataset. The improvements are more pronounced for smaller models, while still meaningful on a T5 XXL model. We also demonstrate gains on test sets when in-domain training data is augmented with our domain-general synthetic data.
摘要：自然语言推理（NLI）仍然是法学硕士的一项重要基准任务。 NLI 数据集是迁移学习到其他语义任务的跳板，而 NLI 模型是用于识别模型生成文本的可信度的标准工具。如今有几个大规模的 NLI 数据集，并且通过在这些数据集上进行爬山，模型得到了极大的改进。然而，它们在分布外/域数据上的实际性能却不太为人所知。我们对 NLI 模型的领域泛化问题进行了深入探索。我们展示了一种生成不同领域和长度的合成 NLI 数据的新方法，迄今为止，现有训练集尚未涵盖这些数据。生成的示例具有有意义的前提，假设是以创造性的方式形成的，而不是对几个前提标记进行简单的编辑，并且标签具有很高的准确性。我们表明，基于此数据（$685$K 合成示例）训练的模型对全新的下游测试设置具有最佳泛化能力。在 TRUE 基准上，与最佳替代数据集上的训练相比，使用我们的数据训练的 T5 小模型平均提高了 7%$ 左右。这些改进对于较小的型号来说更为明显，而对于 T5 XXL 型号来说仍然有意义。当用我们的领域通用合成数据增强领域内训练数据时，我们还展示了测试集的增益。

Title: AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies

Authors: Xiao Ye, Andrew Wang, Jacob Choi, Yining Lu, Shreya Sharma, Lingfeng Shen, Vijay Tiyyala, Nicholas Andrews, Daniel Khashabi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2402.12370
Pdf URL: https://arxiv.org/pdf/2402.12370
Copy Paste: [[2402.12370]] AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies(https://arxiv.org/abs/2402.12370)
Keywords: language model, gpt
Abstract: Humans regularly engage in analogical thinking, relating personal experiences to current situations ($X$ is analogous to $Y$ because of $Z$). Analogical thinking allows humans to solve problems in creative ways, grasp difficult concepts, and articulate ideas more effectively. Can language models (LMs) do the same? To answer this question, we propose ANALOBENCH, a benchmark to determine analogical reasoning ability in LMs. Our benchmarking approach focuses on aspects of this ability that are common among humans: (i) recalling related experiences from a large amount of information, and (ii) applying analogical reasoning to complex and lengthy scenarios. We test a broad collection of proprietary models (e.g., GPT family, Claude V2) and open source models such as LLaMA2. As in prior results, scaling up LMs results in some performance boosts. Surprisingly, scale offers minimal gains when, (i) analogies involve lengthy scenarios, or (ii) recalling relevant scenarios from a large pool of information, a process analogous to finding a needle in a haystack. We hope these observations encourage further research in this field.
摘要：人类经常进行类比思维，将个人经历与当前情况联系起来（$X$ 类似于 $Y$，因为 $Z$）。类比思维使人类能够以创造性的方式解决问题，掌握困难的概念，并更有效地表达想法。语言模型（LM）可以做同样的事情吗？为了回答这个问题，我们提出了 ANALOBENCH，这是一个确定 LM 类比推理能力的基准。我们的基准测试方法侧重于人类常见的这种能力的各个方面：（i）从大量信息中回忆相关经验，以及（ii）将类比推理应用于复杂而冗长的场景。我们测试了广泛的专有模型（例如 GPT 系列、Claude V2）和开源模型（例如 LLaMA2）。与之前的结果一样，扩展 LM 会带来一些性能提升。令人惊讶的是，当（i）类比涉及冗长的场景，或（ii）从大量信息中回忆相关场景（类似于大海捞针的过程）时，规模提供的收益最小。我们希望这些观察结果能够鼓励该领域的进一步研究。

Title: Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Authors: Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2402.12374
Pdf URL: https://arxiv.org/pdf/2402.12374
Copy Paste: [[2402.12374]] Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding(https://arxiv.org/abs/2402.12374)
Keywords: language model, llm
Abstract: As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04\times$, $3.84\times$, and $2.37\times$, and Llama2-70B offloading by up to $10.33\times$ on L40.
摘要：随着大型语言模型 (LLM) 的使用不断增长，使用这些模型进行高效推理变得越来越重要。虽然推测解码最近已成为加速推理的一个有前途的方向，但现有方法在扩展到更大的推测预算以及适应不同的超参数和硬件方面的能力受到限制。本文介绍了 Sequoia，一种可扩展、稳健且硬件感知的推测解码算法。为了获得更好的可扩展性，红杉引入了动态编程算法来寻找推测代币的最佳树结构。为了实现稳健的推测性能，Sequoia 使用了一种新颖的采样和验证方法，该方法在不同的解码温度下优于之前的工作。最后，红杉引入了一个硬件感知树优化器，它通过自动选择给定硬件平台的令牌树大小和深度来最大化推测性能。评估表明，Sequoia 将 Llama2-7B、Llama2-13B 和 Vicuna-33B 在 A100 上的解码速度提高了高达 $4.04\times$、$3.84\times$ 和 $2.37\times$，并将 Llama2-70B 卸载速度提高了高达 $4.04\times$、$3.84\times$ 和 $2.37\times$ L40 上为 $10.33\times$。