2024-06-21

Title: SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

Authors: Xiaoze Liu, Ting Sun, Tianyang Xu, Feijie Wu, Cunxiang Wang, Xiaoqian Wang, Jing Gao
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2406.12975
Pdf URL: https://arxiv.org/pdf/2406.12975
Copy Paste: [[2406.12975]] SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation(https://arxiv.org/abs/2406.12975)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns due to their potential to produce text that infringes on copyrights, resulting in several high-profile lawsuits. The legal landscape is struggling to keep pace with these rapid advancements, with ongoing debates about whether generated text might plagiarize copyrighted materials. Current LLMs may infringe on copyrights or overly restrict non-copyrighted texts, leading to these challenges: (i) the need for a comprehensive evaluation benchmark to assess copyright compliance from multiple aspects; (ii) evaluating robustness against safeguard bypassing attacks; and (iii) developing effective defenses targeted against the generation of copyrighted text. To tackle these challenges, we introduce a curated dataset to evaluate methods, test attack strategies, and propose lightweight, real-time defenses to prevent the generation of copyrighted text, ensuring the safe and lawful use of LLMs. Our experiments demonstrate that current LLMs frequently output copyrighted text, and that jailbreaking attacks can significantly increase the volume of copyrighted output. Our proposed defense mechanisms significantly reduce the volume of copyrighted text generated by LLMs by effectively refusing malicious requests. Code is publicly available at this https URL
摘要：大型语言模型 (LLM) 已经改变了机器学习，但由于其可能生成侵犯版权的文本，因此引发了重大的法律担忧，并导致了几起备受瞩目的诉讼。法律领域正在努力跟上这些快速发展的步伐，关于生成的文本是否会抄袭受版权保护的材料的争论仍在继续。当前的 LLM 可能会侵犯版权或过度限制非版权文本，从而带来以下挑战：(i) 需要一个全面的评估基准来从多个方面评估版权合规性；(ii) 评估对绕过保护措施攻击的稳健性；(iii) 开发针对受版权保护文本生成的有效防御措施。为了应对这些挑战，我们引入了一个精选数据集来评估方法、测试攻击策略，并提出轻量级的实时防御措施，以防止生成受版权保护的文本，确保 LLM 的安全合法使用。我们的实验表明，当前的 LLM 经常输出受版权保护的文本，而越狱攻击可以显著增加受版权保护的输出量。我们提出的防御机制通过有效拒绝恶意请求，显著减少了 LLM 生成的受版权保护文本的数量。代码可在此 https URL 上公开获取

Title: Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors

Authors: Alex Chandler, Devesh Surve, Hui Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors(https://arxiv.org/abs/)
Keywords: language model, llm, hallucination, prompt
Abstract: Accurate text summarization is one of the most common and important tasks performed by Large Language Models, where the costs of human review for an entire document may be high, but the costs of errors in summarization may be even greater. We propose Detecting Errors through Ensembling Prompts (DEEP) - an end-to-end large language model framework for detecting factual errors in text summarization. Our framework uses a diverse set of LLM prompts to identify factual inconsistencies, treating their outputs as binary features, which are then fed into ensembling models. We then calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination. We demonstrate that prior models for detecting factual errors in summaries perform significantly worse without optimizing the thresholds on subsets of the evaluated dataset. Our framework achieves state-of-the-art (SOTA) balanced accuracy on the AggreFact-XSUM FTSOTA, TofuEval Summary-Level, and HaluEval Summarization benchmarks in detecting factual errors within transformer-generated text summaries. It does so without any fine-tuning of the language model or reliance on thresholding techniques not available in practical settings.
摘要：准确的文本摘要是大型语言模型执行的最常见和最重要的任务之一，其中人工审查整个文档的成本可能很高，但摘要中的错误成本可能更高。我们提出了通过集成提示检测错误 (DEEP) - 一种用于检测文本摘要中的事实错误的端到端大型语言模型框架。我们的框架使用一组不同的 LLM 提示来识别事实不一致，将它们的输出视为二进制特征，然后将其输入到集成模型中。然后，我们校准集成模型以产生经验准确的概率，即文本在事实上一致或没有幻觉。我们证明，如果不优化评估数据集子集上的阈值，用于检测摘要中的事实错误的先前模型的表现会明显更差。我们的框架在 AggreFact-XSUM FTSOTA、TofuEval Summary-Level 和 HaluEval Summarization 基准上实现了最先进的 (SOTA) 平衡准确率，可检测 Transformer 生成的文本摘要中的事实错误。它无需对语言模型进行任何微调，也无需依赖实际环境中不可用的阈值技术即可实现此目的。

Title: D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models

Authors: Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models(https://arxiv.org/abs/)
Keywords: language model, llm, hallucination
Abstract: Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache eviction strategies, which prioritize less critical KV-pairs based on attention scores, often degrade generation quality, leading to issues such as context loss or hallucinations. To address this, we introduce Dynamic Discriminative Operations (D2O), a novel method that utilizes two-level discriminative strategies to optimize KV cache size without fine-tuning, while preserving essential context. Initially, by observing varying densities of attention weights between shallow and deep layers, we use this insight to determine which layers should avoid excessive eviction to minimize information loss. Subsequently, for the eviction strategy in each layer, D2O innovatively incorporates a compensation mechanism that maintains a similarity threshold to re-discriminate the importance of previously discarded tokens, determining whether they should be recalled and merged with similar tokens. Our approach not only achieves significant memory savings and enhances inference throughput by more than 3x but also maintains high-quality long-text generation. Extensive experiments across various benchmarks and LLM architectures have demonstrated that D2O significantly enhances performance with a constrained KV cache budget.
摘要：大型语言模型 (LLM) 中的高效推理受到键值 (KV) 缓存不断增长的内存需求的阻碍，尤其是对于较长的序列。传统的 KV 缓存驱逐策略根据注意力得分对不太重要的 KV 对进行优先排序，这通常会降低生成质量，导致上下文丢失或幻觉等问题。为了解决这个问题，我们引入了动态判别操作 (D2O)，这是一种新方法，它利用两级判别策略来优化 KV 缓存大小而无需微调，同时保留基本上下文。首先，通过观察浅层和深层之间注意力权重的不同密度，我们利用这种洞察力来确定哪些层应该避免过度驱逐以最大限度地减少信息丢失。随后，对于每一层的驱逐策略，D2O 创新地结合了一种补偿机制，该机制保持相似性阈值以重新区分先前丢弃的标记的重要性，确定是否应该召回它们并与类似标记合并。我们的方法不仅显著节省了内存，将推理吞吐量提高了 3 倍以上，而且还保持了高质量的长文本生成。在各种基准和 LLM 架构上进行的大量实验表明，D2O 在受限的 KV 缓存预算下显著提高了性能。

Title: Think-then-Act: A Dual-Angle Evaluated Retrieval-Augmented Generation

Authors: Yige Shen, Hao Jiang, Hua Qu, Jihong Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Think-then-Act: A Dual-Angle Evaluated Retrieval-Augmented Generation(https://arxiv.org/abs/)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Despite their impressive capabilities, large language models (LLMs) often face challenges such as temporal misalignment and generating hallucinatory content. Enhancing LLMs with retrieval mechanisms to fetch relevant information from external sources offers a promising solution. Inspired by the proverb "Think twice before you act," we propose a dual-angle evaluated retrieval-augmented generation framework \textit{Think-then-Act}. Unlike previous approaches that indiscriminately rewrite queries or perform retrieval regardless of necessity, or generate temporary responses before deciding on additional retrieval, which increases model generation costs, our framework employs a two-phase process: (i) assessing the input query for clarity and completeness to determine if rewriting is necessary; and (ii) evaluating the model's capability to answer the query and deciding if additional retrieval is needed. Experimental results on five datasets show that the \textit{Think-then-Act} framework significantly improves performance. Our framework demonstrates notable improvements in accuracy and efficiency compared to existing baselines and performs well in both English and non-English contexts. Ablation studies validate the optimal model confidence threshold, highlighting the resource optimization benefits of our approach.
摘要：尽管大型语言模型 (LLM) 具有令人印象深刻的功能，但它们经常面临诸如时间错位和产生幻觉内容等挑战。使用检索机制增强 LLM 以从外部来源获取相关信息提供了一种有希望的解决方案。受谚语“三思而后行”的启发，我们提出了一个双角度评估的检索增强生成框架 \textit{Think-then-Act}。与以前不加区分地重写查询或不顾必要地执行检索，或在决定额外检索之前生成临时响应（这会增加模型生成成本）的方法不同，我们的框架采用了一个两阶段过程：(i) 评估输入查询的清晰度和完整性以确定是否需要重写；(ii) 评估模型回答查询的能力并决定是否需要额外检索。五个数据集上的实验结果表明 \textit{Think-then-Act} 框架显著提高了性能。与现有基线相比，我们的框架在准确率和效率方面有显著提升，在英语和非英语环境中均表现良好。消融研究验证了最佳模型置信度阈值，凸显了我们的方法在资源优化方面的优势。

Title: Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG

Authors: William Merrill, Noah A. Smith, Yanai Elazar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13069
Pdf URL: https://arxiv.org/pdf/2406.13069
Copy Paste: [[2406.13069]] Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG(https://arxiv.org/abs/2406.13069)
Keywords: language model
Abstract: How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are less frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.
摘要：语言模型 (LM) 生成的文本相对于其训练语料库有多新颖？在这项工作中，我们研究了现代 LM 从其训练数据中生成 $n$-gram 的程度，评估了 (i) LM 分配给完成训练 $n$-gram 的概率和 (ii) $n$-novelty，即 LM 生成的未出现在训练数据中的 $n$-gram 的比例（对于任意大的 $n$）。为了在恒定时间内在语料库中实现任意长度的 $n$-gram 搜索，我们开发了 Rusty-DAWG，这是一种受基因组数据索引启发的新型搜索工具。我们将 LM 生成的文本的新颖性与人类书写的文本进行了比较，并探讨了影响生成新颖性的因素，重点关注 Pythia 模型。我们发现，对于 $n > 4$，LM 生成的文本不如人类书写的文本新颖，但对于较小的 $n$，它的新颖性更高。较大的 LM 和更受约束的解码策略都会降低新颖性。最后，我们表明，如果 LM 在训练数据中出现频率较低，则 LM 可以以较低的损失完成 $n$-gram。总体而言，我们的结果揭示了影响 LM 生成文本新颖性的因素，并且我们发布了 Rusty-DAWG 以促进进一步的预训练数据研究。

Title: Exploring and Benchmarking the Planning Capabilities of Large Language Models

Authors: Bernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, Hanie Sedghi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13094
Pdf URL: https://arxiv.org/pdf/2406.13094
Copy Paste: [[2406.13094]] Exploring and Benchmarking the Planning Capabilities of Large Language Models(https://arxiv.org/abs/2406.13094)
Keywords: language model, llm
Abstract: We seek to elevate the planning capabilities of Large Language Models (LLMs)investigating four main directions. First, we construct a comprehensive benchmark suite encompassing both classical planning domains and natural language scenarios. This suite includes algorithms to generate instances with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Second, we investigate the use of in-context learning (ICL) to enhance LLM planning, exploring the direct relationship between increased context length and improved planning performance. Third, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths, as well as the effectiveness of incorporating model-driven search procedures. Finally, we investigate the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges.
摘要：我们力求提升大型语言模型 (LLM) 的规划能力，主要研究四个方向。首先，我们构建了一个全面的基准套件，涵盖了经典规划领域和自然语言场景。该套件包括生成不同难度级别的实例的算法，可以对 LLM 性能进行严格而系统的评估。其次，我们研究了使用上下文学习 (ICL) 来增强 LLM 规划，探索了增加上下文长度和提高规划性能之间的直接关系。第三，我们展示了微调 LLM 对最佳规划路径的积极影响，以及结合模型驱动搜索程序的有效性。最后，我们研究了所提出方法在分布外场景中的表现，评估了其推广到新的和看不见的规划挑战的能力。

Title: Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

Authors: Yuhang Zhou, Jing Zhu, Paiheng Xu, Xiaoyu Liu, Xiyao Wang, Danai Koutra, Wei Ai, Furong Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13114
Pdf URL: https://arxiv.org/pdf/2406.13114
Copy Paste: [[2406.13114]] Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation(https://arxiv.org/abs/2406.13114)
Keywords: language model, llm
Abstract: Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.
摘要：大型语言模型 (LLM) 显著推进了各种自然语言处理任务，但部署它们仍然需要耗费大量的计算资源。知识蒸馏 (KD) 是一种很有前途的解决方案，它能够将大型教师 LLM 的功能转移到更紧凑的学生模型中。特别是序列级知识蒸馏，它蒸馏的是基于原理的推理过程，而不仅仅是最终结果，在提高学生的推理能力方面显示出巨大的潜力。然而，当前的方法在长尾数据分布下难以进行序列级知识蒸馏，对稀疏表示域的泛化产生不利影响。我们引入了多阶段平衡蒸馏 (BalDistill) 框架，它在固定的计算预算内迭代平衡训练数据。通过动态选择代表性头域示例并合成尾域示例，BalDistill 在各种长尾数据集中实现了最先进的性能，提高了蒸馏模型的效率和功效。

Title: Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Authors: Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2406.13121
Pdf URL: https://arxiv.org/pdf/2406.13121
Copy Paste: [[2406.13121]] Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?(https://arxiv.org/abs/2406.13121)
Keywords: language model, prompt
Abstract: Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.
摘要：长上下文语言模型 (LCLM) 有可能彻底改变我们传统上依赖外部工具（如检索系统或数据库）完成任务的方法。利用 LCLM 本地提取和处理整个信息语料库的能力可以带来许多优势。它通过消除对工具专业知识的需求来增强用户友好性，提供强大的端到端建模以最大限度地减少复杂管道中的级联错误，并允许在整个系统中应用复杂的提示技术。为了评估这种范式转变，我们引入了 LOFT，这是现实世界任务的基准，需要多达数百万个标记的上下文，旨在评估 LCLM 在上下文检索和推理方面的表现。我们的研究结果表明，尽管从未针对这些任务进行过明确的训练，但 LCLM 具有令人惊讶的能力，可以与最先进的检索和 RAG 系统相媲美。然而，LCLM 仍然面临 SQL 类任务所需的组合推理等领域的挑战。值得注意的是，提示策略对性能有显著影响，这强调了随着上下文长度的增加，需要继续研究。总体而言，LOFT 为 LCLM 提供了一个严格的测试场地，展示了它们在模型能力扩展时取代现有范式和解决新任务的潜力。

Title: Learning to Generate Answers with Citations via Factual Consistency Models

Authors: Rami Aly, Zhiqiang Tang, Samson Tan, George Karypis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13124
Pdf URL: https://arxiv.org/pdf/2406.13124
Copy Paste: [[2406.13124]] Learning to Generate Answers with Citations via Factual Consistency Models(https://arxiv.org/abs/2406.13124)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs). Our approach alternates between generating texts with citations and supervised fine-tuning with FCM-filtered citation data. Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens, as measured by an FCM. Results on the ALCE few-shot citation benchmark with various instruction-tuned LLMs demonstrate superior performance compared to in-context learning, vanilla supervised fine-tuning, and state-of-the-art methods, with an average improvement of $34.1$, $15.5$, and $10.5$ citation F$_1$ points, respectively. Moreover, in a domain transfer setting we show that the obtained citation generation ability robustly transfers to unseen datasets. Notably, our citation improvements contribute to the lowest factual error rate across baselines.
摘要：大型语言模型 (LLM) 经常产生幻觉，影响其在关键任务情况下的可靠性。解决此问题的一种方法是在生成内容的同时提供相关来源的引用，从而增强生成的可验证性。然而，在答案中准确引用段落仍然是一个巨大的挑战。本文提出了一种利用事实一致性模型 (FCM) 的弱监督微调方法。我们的方法在生成带引文的文本和使用 FCM 过滤的引文数据进行监督微调之间交替进行。重点学习被整合到目标中，指导微调过程强调事实单位标记，以 FCM 为衡量标准。使用各种指令调整的 LLM 在 ALCE 少量引用基准上的结果显示，与上下文学习、普通监督微调和最先进的方法相比，其性能更优异，平均分别提高了 $34.1$、$15.5$ 和 $10.5$ 个引用 F$_1$ 点。此外，在域转移设置中，我们表明获得的引用生成能力可以稳健地转移到看不见的数据集。值得注意的是，我们的引用改进有助于在基线中实现最低的事实错误率。

Title: When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

Authors: Ting-Yun Chang, Jesse Thomason, Robin Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13131
Pdf URL: https://arxiv.org/pdf/2406.13131
Copy Paste: [[2406.13131]] When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models(https://arxiv.org/abs/2406.13131)
Keywords: language model, llm, prompt
Abstract: This paper studies in-context learning (ICL) by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that always predict the same label. We find that component accuracies are well-correlated across different demonstration sets and perturbations of prompt templates, even when the full-model accuracy varies greatly. Based on our findings, we propose component reweighting, which learns to linearly re-scale the component activations from a few labeled examples. Given 24 labeled examples, our method improves by an average of 6.0% accuracy points over 24-shot ICL across 8 tasks on Llama-2-7B. Overall, this paper both enriches our understanding of ICL and provides a practical method for improvement by examining model internals.
摘要：本文通过将大型语言模型的输出分解为注意力头和 MLP（组件）的个体贡献来研究上下文学习 (ICL)。我们观察到一些奇怪的组件：表现良好的组件，即使模型表现不佳，它们在分类任务上也能单独表现良好；表现不佳的组件，它们的表现比偶然性差得多；标签偏差组件总是预测相同的标签。我们发现，即使整个模型的准确度差异很大，组件准确度在不同的演示集和提示模板的扰动之间也具有很好的相关性。根据我们的发现，我们提出了组件重新加权，它学习从一些带标签的示例中线性重新缩放组件激活。给定 24 个带标签的示例，我们的方法在 Llama-2-7B 上的 8 个任务中比 24 次 ICL 平均提高了 6.0% 的准确度点。总的来说，本文既丰富了我们对 ICL 的理解，又通过检查模型内部提供了一种实用的改进方法。

Title: PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model

Authors: Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang
Subjects: cs.CL, cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2406.13133
Pdf URL: https://arxiv.org/pdf/2406.13133
Copy Paste: [[2406.13133]] PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model(https://arxiv.org/abs/2406.13133)
Keywords: language model
Abstract: Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.
摘要：病原体鉴定是诊断、治疗和预防疾病的关键，对于控制感染和保障公众健康至关重要。传统的基于比对的方法虽然被广泛使用，但计算量大且依赖于广泛的参考数据库，由于其灵敏度和特异性低，通常无法检测到新病原体。同样，传统的机器学习技术虽然很有前景，但需要大量带注释的数据集和广泛的特征工程，并且容易过度拟合。为了应对这些挑战，我们推出了 PathoLM，这是一种尖端的病原体语言模型，经过优化可识别细菌和病毒序列中的致病性。利用核苷酸转换器等预训练 DNA 模型的优势，PathoLM 需要最少的数据进行微调，从而增强病原体检测能力。它有效地捕捉了更广泛的基因组背景，显著提高了对新型和不同病原体的识别。我们开发了一个全面的数据集，包含大约 30 种病毒和细菌，包括 ESKAPEE 病原体、七种对抗生素具有耐药性的显著毒性细菌菌株。此外，我们还专门针对 ESKAPEE 组制定了一个物种分类数据集。在比较评估中，PathoLM 的表现明显优于 DciPatho 等现有模型，展示了强大的零样本和少量样本能力。此外，我们扩展了 PathoLM-Sp 用于 ESKAPEE 物种分类，尽管任务很复杂，但它与其他先进的深度学习方法相比表现出色。

Title: Large Language Models are Biased Because They Are Large Language Models

Authors: Philip Resnik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13138
Pdf URL: https://arxiv.org/pdf/2406.13138
Copy Paste: [[2406.13138]] Large Language Models are Biased Because They Are Large Language Models(https://arxiv.org/abs/2406.13138)
Keywords: language model, llm
Abstract: This paper's primary goal is to provoke thoughtful discussion about the relationship between bias and fundamental properties of large language models. We do this by seeking to convince the reader that harmful biases are an inevitable consequence arising from the design of any large language model as LLMs are currently formulated. To the extent that this is true, it suggests that the problem of harmful bias cannot be properly addressed without a serious reconsideration of AI driven by LLMs, going back to the foundational assumptions underlying their design.
摘要：本文的主要目标是引发关于偏见与大型语言模型基本属性之间关系的深入讨论。我们通过努力让读者相信，有害偏见是任何大型语言模型设计中不可避免的结果，因为 LLM 目前就是如此。如果这是真的，那么它表明，如果不认真重新考虑 LLM 驱动的人工智能，回到其设计背后的基本假设，就无法妥善解决有害偏见的问题。

Title: DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents

Authors: Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, Edward Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13144
Pdf URL: https://arxiv.org/pdf/2406.13144
Copy Paste: [[2406.13144]] DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents(https://arxiv.org/abs/2406.13144)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced the capabilities of conversational agents, making them applicable to various fields (e.g., education). Despite their progress, the evaluation of the agents often overlooks the complexities of real-world conversations, such as real-time interactions, multi-party dialogues, and extended contextual dependencies. To bridge this gap, we introduce DialSim, a real-time dialogue simulator. In this simulator, an agent is assigned the role of a character from popular TV shows, requiring it to respond to spontaneous questions using past dialogue information and to distinguish between known and unknown information. Key features of DialSim include evaluating the agent's ability to respond within a reasonable time limit, handling long-term multi-party dialogues, and managing adversarial settings (e.g., swap character names) to challenge the agent's reliance on pre-trained knowledge. We utilized this simulator to evaluate the latest conversational agents and analyze their limitations. Our experiments highlight both the strengths and weaknesses of these agents, providing valuable insights for future improvements in the field of conversational AI. DialSim is available at this https URL.
摘要：大型语言模型 (LLM) 的最新进展显著增强了对话代理的能力，使其适用于各个领域（例如教育）。尽管取得了进展，但对代理的评估往往忽略了现实世界对话的复杂性，例如实时交互、多方对话和扩展的上下文依赖性。为了弥补这一差距，我们引入了实时对话模拟器 DialSim。在这个模拟器中，代理被分配了热门电视节目中的角色，要求它使用过去的对话信息回答自发问题并区分已知和未知信息。DialSim 的主要功能包括评估代理在合理时间限制内响应的能力、处理长期多方对话以及管理对抗设置（例如，交换角色名称）以挑战代理对预训练知识的依赖。我们利用这个模拟器评估最新的对话代理并分析它们的局限性。我们的实验突出了这些代理的优点和缺点，为未来对话式 AI 领域的改进提供了宝贵的见解。DialSim 可在此 https URL 上获取。

Title: Analyzing Diversity in Healthcare LLM Research: A Scientometric Perspective

Authors: David Restrepo, Chenwei Wu, Constanza Vásquez-Venegas, João Matos, Jack Gallifant, Luis Filipe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13152
Pdf URL: https://arxiv.org/pdf/2406.13152
Copy Paste: [[2406.13152]] Analyzing Diversity in Healthcare LLM Research: A Scientometric Perspective(https://arxiv.org/abs/2406.13152)
Keywords: language model, llm
Abstract: The deployment of large language models (LLMs) in healthcare has demonstrated substantial potential for enhancing clinical decision-making, administrative efficiency, and patient outcomes. However, the underrepresentation of diverse groups in the development and application of these models can perpetuate biases, leading to inequitable healthcare delivery. This paper presents a comprehensive scientometric analysis of LLM research for healthcare, including data from January 1, 2021, to June 16, 2024. By analyzing metadata from PubMed and Dimensions, including author affiliations, countries, and funding sources, we assess the diversity of contributors to LLM research. Our findings highlight significant gender and geographic disparities, with a predominance of male authors and contributions primarily from high-income countries (HICs). We introduce a novel journal diversity index based on Gini impurity to measure the inclusiveness of scientific publications. Our results underscore the necessity for greater representation in order to ensure the equitable application of LLMs in healthcare. We propose actionable strategies to enhance diversity and inclusivity in artificial intelligence research, with the ultimate goal of fostering a more inclusive and equitable future in healthcare innovation.
摘要：在医疗保健领域部署大型语言模型 (LLM) 已显示出在增强临床决策、管理效率和患者治疗效果方面的巨大潜力。然而，在这些模型的开发和应用中，不同群体的代表性不足可能会延续偏见，导致医疗保健服务不公平。本文对医疗保健领域的 LLM 研究进行了全面的科学计量分析，包括 2021 年 1 月 1 日至 2024 年 6 月 16 日的数据。通过分析来自 PubMed 和 Dimensions 的元数据（包括作者所属、国家和资金来源），我们评估了 LLM 研究贡献者的多样性。我们的研究结果强调了显著的性别和地理差异，其中男性作者占主导地位，贡献主要来自高收入国家 (HIC)。我们引入了一种基于基尼不纯度的新型期刊多样性指数来衡量科学出版物的包容性。我们的研究结果强调了提高代表性的必要性，以确保 LLM 在医疗保健领域的公平应用。我们提出了可行的战略来增强人工智能研究的多样性和包容性，最终目标是促进医疗保健创新更加包容和公平的未来。

Title: QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism

Authors: Bo Wang, Heyan Huang, Yixin Cao, Jiahao Ying, Wei Tang, Chong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13167
Pdf URL: https://arxiv.org/pdf/2406.13167
Copy Paste: [[2406.13167]] QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism(https://arxiv.org/abs/2406.13167)
Keywords: language model, llm, long context
Abstract: While large language models (LLMs) have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanism offers a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques face challenges with static knowledge integration, leading to insufficient adaptation to task-specific needs and missing multi-segmentation relationships, which hinders the dynamic reorganization and logical combination of relevant segments during the response process. To address these issues, we introduce a novel strategy, Question then Reflection Memory Mechanism (QRMeM), incorporating a dual-structured memory pool. This pool synergizes static textual content with structured graph guidance, fostering a reflective trial-and-error approach for navigating and identifying relevant segments. Our evaluation across multiple-choice questions (MCQ) and multi-document question answering (Multi-doc QA) benchmarks showcases QRMeM enhanced performance compared to existing approaches.
摘要：虽然大型语言模型 (LLM) 在自然语言处理方面取得了显著进步，但它们在处理大量文本方面仍然举步维艰。记忆机制为管理长上下文提供了一种灵活的解决方案，利用压缩、摘要和结构化等技术来促进对大量文本的细致和高效处理。然而，现有技术面临着静态知识整合的挑战，导致无法充分适应特定任务的需求，缺少多段关系，这阻碍了响应过程中相关段的动态重组和逻辑组合。为了解决这些问题，我们引入了一种新颖的策略，即问题然后反射记忆机制 (QRMeM)，其中包含一个双结构记忆池。该池将静态文本内容与结构化图形指导相结合，促进了一种反射式试错方法来导航和识别相关段。我们对多项选择题 (MCQ) 和多文档问答 (Multi-doc QA) 基准的评估展示了 QRMeM 与现有方法相比增强的性能。

Title: Locating and Extracting Relational Concepts in Large Language Models

Authors: Zijian Wang, Britney White, Chang Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13184
Pdf URL: https://arxiv.org/pdf/2406.13184
Copy Paste: [[2406.13184]] Locating and Extracting Relational Concepts in Large Language Models(https://arxiv.org/abs/2406.13184)
Keywords: language model, llm, prompt
Abstract: Relational concepts are indeed foundational to the structure of knowledge representation, as they facilitate the association between various entity concepts, allowing us to express and comprehend complex world knowledge. By expressing relational concepts in natural language prompts, people can effortlessly interact with large language models (LLMs) and recall desired factual knowledge. However, the process of knowledge recall lacks interpretability, and representations of relational concepts within LLMs remain unknown to us. In this paper, we identify hidden states that can express entity and relational concepts through causal mediation analysis in fact recall processes. Our finding reveals that at the last token position of the input prompt, there are hidden states that solely express the causal effects of relational concepts. Based on this finding, we assume that these hidden states can be treated as relational representations and we can successfully extract them from LLMs. The experimental results demonstrate high credibility of the relational representations: they can be flexibly transplanted into other fact recall processes, and can also be used as robust entity connectors. Moreover, we also show that the relational representations exhibit significant potential for controllable fact recall through relation rewriting.
摘要：关系概念确实是知识表示结构的基础，因为它们促进了各种实体概念之间的关联，使我们能够表达和理解复杂的世界知识。通过在自然语言提示中表达关系概念，人们可以毫不费力地与大型语言模型 (LLM) 交互并回忆所需的事实知识。然而，知识回忆的过程缺乏可解释性，而且 LLM 中关系概念的表示对我们来说仍然未知。在本文中，我们通过事实回忆过程中的因果中介分析来识别可以表达实体和关系概念的隐藏状态。我们的发现表明，在输入提示的最后一个标记位置，存在仅表达关系概念的因果效应的隐藏状态。基于这一发现，我们假设这些隐藏状态可以被视为关系表示，我们可以成功地从 LLM 中提取它们。实验结果表明，关系表示具有很高的可信度：它们可以灵活地移植到其他事实回忆过程中，也可以用作强大的实体连接器。此外，我们还表明，关系表示通过关系重写表现出可控事实回忆的巨大潜力。

Title: Learnable In-Context Vector for Visual Question Answering

Authors: Yingzhe Peng, Chenduo Hao, Xu Yang, Jiawei Peng, Xinting Hu, Xin Geng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13185
Pdf URL: https://arxiv.org/pdf/2406.13185
Copy Paste: [[2406.13185]] Learnable In-Context Vector for Visual Question Answering(https://arxiv.org/abs/2406.13185)
Keywords: language model, llm
Abstract: As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. These challenges are further exacerbated in LMMs due to the integration of multiple data types and the combinational complexity of multimodal ICDs. Recently, to address these challenges, some NLP studies introduce non-learnable In-Context Vectors (ICVs) which extract useful task information from ICDs into a single vector and then insert it into the LLM to help solve the corresponding task. However, although useful in simple NLP tasks, these non-learnable methods fail to handle complex multimodal tasks like Visual Question Answering (VQA). In this study, we propose \textbf{Learnable ICV} (L-ICV) to distill essential task information from demonstrations, improving ICL performance in LMMs. Experiments show that L-ICV can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods.
摘要：随着语言模型的不断扩展，大型语言模型 (LLM) 在上下文学习 (ICL) 方面展现出了新兴的能力，使它们能够通过在前缀中添加一些上下文演示 (ICD) 作为上下文来解决语言任务。受这些进步的启发，研究人员扩展了这些技术，开发了具有 ICL 功能的大型多模态模型 (LMM)。然而，应用 ICL 通常面临两大挑战：1) 使用更多的 ICD 将大大增加推理时间；2) 性能对 ICD 的选择很敏感。由于多种数据类型的集成和多模态 ICD 的组合复杂性，这些挑战在 LMM 中进一步加剧。最近，为了应对这些挑战，一些 NLP 研究引入了不可学习的上下文向量 (ICV)，它们从 ICD 中提取有用的任务信息到单个向量中，然后将其插入到 LLM 中以帮助解决相应的任务。然而，尽管这些不可学习的方法在简单的 NLP 任务中很有用，但它们无法处理复杂的多模态任务，如视觉问答 (VQA)。在本研究中，我们提出了 \textbf{可学习 ICV} (L-ICV)，从演示中提取基本任务信息，从而提高 LMM 中的 ICL 性能。实验表明，与传统 ICL 和其他不可学习的 ICV 方法相比，L-ICV 可以显著降低计算成本，同时提高 VQA 任务的准确性。

Title: Synthetic Context Generation for Question Generation

Authors: Naiming Liu, Zichao Wang, Richard Baraniuk
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13188
Pdf URL: https://arxiv.org/pdf/2406.13188
Copy Paste: [[2406.13188]] Synthetic Context Generation for Question Generation(https://arxiv.org/abs/2406.13188)
Keywords: language model, llm, prompt
Abstract: Despite rapid advancements in large language models (LLMs), QG remains a challenging problem due to its complicated process, open-ended nature, and the diverse settings in which question generation occurs. A common approach to address these challenges involves fine-tuning smaller, custom models using datasets containing background context, question, and answer. However, obtaining suitable domain-specific datasets with appropriate context is often more difficult than acquiring question-answer pairs. In this paper, we investigate training QG models using synthetic contexts generated by LLMs from readily available question-answer pairs. We conduct a comprehensive study to answer critical research questions related to the performance of models trained on synthetic contexts and their potential impact on QG research and applications. Our empirical results reveal: 1) contexts are essential for QG tasks, even if they are synthetic; 2) fine-tuning smaller language models has the capability of achieving better performances as compared to prompting larger language models; and 3) synthetic context and real context could achieve comparable performances. These findings highlight the effectiveness of synthetic contexts in QG and paves the way for future advancements in the field.
摘要：尽管大型语言模型 (LLM) 取得了快速发展，但由于其复杂的过程、开放的性质以及问题生成的多样化环境，QG 仍然是一个具有挑战性的问题。解决这些挑战的一种常见方法是使用包含背景上下文、问题和答案的数据集对较小的自定义模型进行微调。然而，获得具有适当上下文的合适领域特定数据集通常比获取问答对更困难。在本文中，我们研究使用 LLM 从现成的问答对生成的合成上下文来训练 QG 模型。我们进行了一项全面的研究，以回答与在合成上下文上训练的模型的性能及其对 QG 研究和应用的潜在影响相关的关键研究问题。我们的实证结果表明：1）上下文对于 QG 任务至关重要，即使是合成的；2）与提示更大的语言模型相比，微调较小的语言模型能够实现更好的性能；3）合成上下文和真实上下文可以实现相当的性能。这些发现凸显了合成上下文在 QG 中的有效性，并为该领域的未来发展铺平了道路。

Title: Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata

Authors: Mykhailo Poliakov, Nadiya Shvai
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2406.13213
Pdf URL: https://arxiv.org/pdf/2406.13213
Copy Paste: [[2406.13213]] Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata(https://arxiv.org/abs/2406.13213)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The retrieval-augmented generation (RAG) enables retrieval of relevant information from an external knowledge source and allows large language models (LLMs) to answer queries over previously unseen document collections. However, it was demonstrated that traditional RAG applications perform poorly in answering multi-hop questions, which require retrieving and reasoning over multiple elements of supporting evidence. We introduce a new method called Multi-Meta-RAG, which uses database filtering with LLM-extracted metadata to improve the RAG selection of the relevant documents from various sources, relevant to the question. While database filtering is specific to a set of questions from a particular domain and format, we found out that Multi-Meta-RAG greatly improves the results on the MultiHop-RAG benchmark. The code is available at this https URL.
摘要：检索增强生成 (RAG) 可以从外部知识源检索相关信息，并允许大型语言模型 (LLM) 回答针对以前未见过的文档集合的查询。然而，事实证明，传统的 RAG 应用程序在回答多跳问题方面表现不佳，因为多跳问题需要检索和推理多个支持证据元素。我们引入了一种称为 Multi-Meta-RAG 的新方法，该方法使用数据库过滤和 LLM 提取的元数据来改进 RAG 从各种来源中选择与问题相关的相关文档。虽然数据库过滤特定于来自特定领域和格式的一组问题，但我们发现 Multi-Meta-RAG 极大地改善了 MultiHop-RAG 基准测试的结果。代码可在此 https URL 上找到。

Title: Bridging Law and Data: Augmenting Reasoning via a Semi-Structured Dataset with IRAC methodology

Authors: Xiaoxi Kang, Lizhen Qu, Lay-Ki Soon, Zhuang Li, Adnan Trakic
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13217
Pdf URL: https://arxiv.org/pdf/2406.13217
Copy Paste: [[2406.13217]] Bridging Law and Data: Augmenting Reasoning via a Semi-Structured Dataset with IRAC methodology(https://arxiv.org/abs/2406.13217)
Keywords: language model, llm
Abstract: The effectiveness of Large Language Models (LLMs) in legal reasoning is often limited due to the unique legal terminologies and the necessity for highly specialized knowledge. These limitations highlight the need for high-quality data tailored for complex legal reasoning tasks. This paper introduces LEGALSEMI, a benchmark specifically curated for legal scenario analysis. LEGALSEMI comprises 54 legal scenarios, each rigorously annotated by legal experts, based on the comprehensive IRAC (Issue, Rule, Application, Conclusion) framework. In addition, LEGALSEMI is accompanied by a structured knowledge graph (SKG). A series of experiments were conducted to assess the usefulness of LEGALSEMI for IRAC analysis. The experimental results demonstrate the effectiveness of incorporating the SKG for issue identification, rule retrieval, application and conclusion generation using four different LLMs. LEGALSEMI will be publicly available upon acceptance of this paper.
摘要：由于法律术语独特且需要高度专业化的知识，大型语言模型 (LLM) 在法律推理中的有效性通常受到限制。这些限制凸显了对针对复杂法律推理任务量身定制的高质量数据的需求。本文介绍了 LEGALSEMI，这是一个专门为法律场景分析而策划的基准。LEGALSEMI 包含 54 个法律场景，每个场景均由法律专家严格注释，基于全面的 IRAC（问题、规则、应用、结论）框架。此外，LEGALSEMI 还附带一个结构化知识图谱 (SKG)。进行了一系列实验以评估 LEGALSEMI 对 IRAC 分析的有用性。实验结果证明了使用四种不同的 LLM 结合 SKG 进行问题识别、规则检索、应用和结论生成的有效性。本文被接受后，LEGALSEMI 将公开。

Title: Probing the Emergence of Cross-lingual Alignment during LLM Training

Authors: Hetong Wang, Pasquale Minervini, Edoardo M. Ponti
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13229
Pdf URL: https://arxiv.org/pdf/2406.13229
Copy Paste: [[2406.13229]] Probing the Emergence of Cross-lingual Alignment during LLM Training(https://arxiv.org/abs/2406.13229)
Keywords: language model, llm
Abstract: Multilingual Large Language Models (LLMs) achieve remarkable levels of zero-shot cross-lingual transfer performance. We speculate that this is predicated on their ability to align languages without explicit supervision from parallel sentences. While representations of translationally equivalent sentences in different languages are known to be similar after convergence, however, it remains unclear how such cross-lingual alignment emerges during pre-training of LLMs. Our study leverages intrinsic probing techniques, which identify which subsets of neurons encode linguistic features, to correlate the degree of cross-lingual neuron overlap with the zero-shot cross-lingual transfer performance for a given model. In particular, we rely on checkpoints of BLOOM, a multilingual autoregressive LLM, across different training steps and model scales. We observe a high correlation between neuron overlap and downstream performance, which supports our hypothesis on the conditions leading to effective cross-lingual transfer. Interestingly, we also detect a degradation of both implicit alignment and multilingual abilities in certain phases of the pre-training process, providing new insights into the multilingual pretraining dynamics.
摘要：多语言大型语言模型 (LLM) 实现了卓越的零样本跨语言迁移性能。我们推测这是因为它们能够在没有来自平行句子的明确监督的情况下对齐语言。虽然已知不同语言中翻译等效句子的表示在收敛后是相似的，但这种跨语言对齐是如何在 LLM 的预训练过程中出现的仍不清楚。我们的研究利用内在探测技术，确定哪些神经元子集编码了语言特征，以将跨语言神经元重叠程度与给定模型的零样本跨语言迁移性能相关联。具体来说，我们依赖于 BLOOM（一种多语言自回归 LLM）的检查点，该检查点跨越不同的训练步骤和模型规模。我们观察到神经元重叠与下游性能之间存在高度相关性，这支持了我们关于导致有效跨语言迁移的条件的假设。有趣的是，我们还发现在预训练过程的某些阶段隐式对齐和多语言能力都有所下降，这为多语言预训练动态提供了新的见解。

Title: Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding

Authors: Xin Liu, Farima Fatahi Bayat, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13230
Pdf URL: https://arxiv.org/pdf/2406.13230
Copy Paste: [[2406.13230]] Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding(https://arxiv.org/abs/2406.13230)
Keywords: language model
Abstract: Calibrating language models (LMs) aligns their generation confidence with the actual likelihood of answer correctness, which can inform users about LMs' reliability and mitigate hallucinated content. However, prior calibration methods, such as self-consistency-based and logit-based approaches, are either limited in inference-time efficiency or fall short of providing informative signals. Moreover, simply filtering out low-confidence responses reduces the LM's helpfulness when the answers are correct. Therefore, effectively using calibration techniques to enhance an LM's factuality remains an unsolved challenge. In this paper, we first propose an activation-based calibration method, ActCab, which trains a linear layer on top of the LM's last-layer activations that can better capture the representations of knowledge. Built on top of ActCab, we further propose CoDec, a confidence-guided decoding strategy to elicit truthful answers with high confidence from LMs. By evaluating on five popular QA benchmarks, ActCab achieves superior calibration performance than all competitive baselines, e.g., by reducing the average expected calibration error (ECE) score by up to 39%. Further experiments on CoDec show consistent improvements in several LMs' factuality on challenging QA datasets, such as TruthfulQA, highlighting the value of confidence signals in enhancing factuality.
摘要：校准语言模型 (LM) 可将其生成置信度与答案正确的实际可能性保持一致，从而让用户了解 LM 的可靠性并减轻幻觉内容。然而，先前的校准方法（例如基于自洽性和基于逻辑的方法）要么在推理时间效率方面受到限制，要么无法提供信息信号。此外，当答案正确时，简单地过滤掉低置信度的响应会降低 LM 的帮助作用。因此，有效地使用校准技术来增强 LM 的真实性仍然是一个尚未解决的挑战。在本文中，我们首先提出了一种基于激活的校准方法 ActCab，它在 LM 的最后一层激活之上训练一个线性层，可以更好地捕捉知识的表示。在 ActCab 的基础上，我们进一步提出了 CoDec，这是一种置信度引导的解码策略，可以从 LM 中以高置信度得出真实的答案。通过对五个流行的 QA 基准进行评估，ActCab 实现了比所有竞争基线更出色的校准性能，例如，将平均预期校准误差 (ECE) 分数降低了多达 39%。在 CoDec 上进行的进一步实验表明，多个 LM 在具有挑战性的 QA 数据集（例如 TruthfulQA）上的真实性得到了持续改进，凸显了置信度信号在增强真实性方面的价值。

Title: Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models

Authors: Akchay Srivastava, Atif Memon
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13232
Pdf URL: https://arxiv.org/pdf/2406.13232
Copy Paste: [[2406.13232]] Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models(https://arxiv.org/abs/2406.13232)
Keywords: language model
Abstract: Open Domain Question Answering (ODQA) within natural language processing involves building systems that answer factual questions using large-scale knowledge corpora. Recent advances stem from the confluence of several factors, such as large-scale training datasets, deep learning techniques, and the rise of large language models. High-quality datasets are used to train models on realistic scenarios and enable the evaluation of the system on potentially unseen data. Standardized metrics facilitate comparisons between different ODQA systems, allowing researchers to objectively track advancements in the field. Our study presents a thorough examination of the current landscape of ODQA benchmarking by reviewing 52 datasets and 20 evaluation techniques across textual and multimodal modalities. We introduce a novel taxonomy for ODQA datasets that incorporates both the modality and difficulty of the question types. Additionally, we present a structured organization of ODQA evaluation metrics along with a critical analysis of their inherent trade-offs. Our study aims to empower researchers by providing a framework for the robust evaluation of modern question-answering systems. We conclude by identifying the current challenges and outlining promising avenues for future research and development.
摘要：自然语言处理中的开放域问答 (ODQA) 涉及构建使用大规模知识语料库回答事实问题的系统。最近的进展源于多种因素的融合，例如大规模训练数据集、深度学习技术和大型语言模型的兴起。高质量数据集用于在现实场景中训练模型，并支持对可能看不见的数据进行系统评估。标准化指标有助于比较不同的 ODQA 系统，使研究人员能够客观地跟踪该领域的进展。我们的研究通过回顾 52 个数据集和 20 种跨文本和多模态模态的评估技术，对当前 ODQA 基准测试的现状进行了全面检查。我们引入了一种新的 ODQA 数据集分类法，该分类法结合了问题类型的模态和难度。此外，我们还介绍了 ODQA 评估指标的结构化组织以及对其固有权衡的批判性分析。我们的研究旨在通过提供一个对现代问答系统进行稳健评估的框架来增强研究人员的能力。最后，我们确定了当前的挑战并概述了未来研究和开发的有希望的途径。

Title: Data Contamination Can Cross Language Barriers

Authors: Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13236
Pdf URL: https://arxiv.org/pdf/2406.13236
Copy Paste: [[2406.13236]] Data Contamination Can Cross Language Barriers(https://arxiv.org/abs/2406.13236)
Keywords: language model, llm
Abstract: The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{this https URL}.
摘要：大型语言模型 (LLM) 开发过程中的不透明性引发了人们对公共基准在预训练数据中可能受到污染的担忧。现有的污染检测方法通常基于训练和评估数据之间的文本重叠，这可能过于肤浅，无法反映更深层次的污染形式。在本文中，我们首先介绍了一种跨语言污染形式，这种污染会夸大 LLM 的性能，同时避开当前的检测方法，这种污染是通过将 LLM 过度拟合到基准测试集的翻译版本上而故意注入的。然后，我们提出了基于泛化的方法来揭露这种深藏不露的污染。具体来说，我们通过将错误的答案选项替换为其他问题的正确答案选项来修改原始基准后，检查 LLM 的性能变化。受污染的模型很难推广到这种更简单的情况，在这些情况下，错误的选择可能\emph{甚至不是错误的}，因为所有选择在记忆中都是正确的。实验结果表明，跨语言污染很容易欺骗现有的检测方法，但不能欺骗我们的方法。此外，我们讨论了跨语言污染在解释 LLM 的工作机制和训练后 LLM 中的潜在用途，以增强多语言能力。我们使用的代码和数据集可以从 \url{此 https URL} 获取。

Title: GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs

Authors: Navid Rajabi, Jana Kosecka
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13246
Pdf URL: https://arxiv.org/pdf/2406.13246
Copy Paste: [[2406.13246]] GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs(https://arxiv.org/abs/2406.13246)
Keywords: language model, llm
Abstract: The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning. This skill rests on the ability to recognize and localize objects of interest and determine their spatial relation. Early vision and language models (VLMs) have been shown to struggle to recognize spatial relations. We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding that highlights the strengths and weaknesses of 27 different models. In addition to the VLMs evaluated in What'sUp, our extensive evaluation encompasses 3 classes of Multimodal LLMs (MLLMs) that vary in their parameter sizes (ranging from 7B to 110B), training/instruction-tuning methods, and visual resolution to benchmark their performances and scrutinize the scaling laws in this task.
摘要：理解和推理图像中物体之间的空间关系的能力是视觉推理的重要组成部分。这种技能依赖于识别和定位感兴趣的物体并确定其空间关系的能力。早期的视觉和语言模型 (VLM) 已被证明难以识别空间关系。我们扩展了之前发布的 What'sUp 数据集，并提出了一种新颖的综合评估方法来理解空间关系，突出了 27 种不同模型的优缺点。除了 What'sUp 中评估的 VLM 之外，我们的广泛评估还涵盖了 3 类多模态 LLM (MLLM)，它们的参数大小（范围从 7B 到 110B）、训练/指令调整方法和视觉分辨率各不相同，以对其性能进行基准测试并仔细研究此任务中的缩放定律。

Title: R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation

Authors: Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2406.13249
Pdf URL: https://arxiv.org/pdf/2406.13249
Copy Paste: [[2406.13249]] R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation(https://arxiv.org/abs/2406.13249)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers. However, a semantic gap exists between LLMs and retrievers due to differences in their training objectives and architectures. This misalignment forces LLMs to passively accept the documents provided by the retrievers, leading to incomprehension in the generation process, where the LLMs are burdened with the task of distinguishing these documents using their inherent knowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill this gap by incorporating Retrieval information into Retrieval Augmented Generation. Specifically, R$^2$AG utilizes the nuanced features from the retrievers and employs a R$^2$-Former to capture retrieval information. Then, a retrieval-aware prompting strategy is designed to integrate retrieval information into LLMs' generation. Notably, R$^2$AG suits low-source scenarios where LLMs and retrievers are frozen. Extensive experiments across five datasets validate the effectiveness, robustness, and efficiency of R$^2$AG. Our analysis reveals that retrieval information serves as an anchor to aid LLMs in the generation process, thereby filling the semantic gap.
摘要：检索增强生成 (RAG) 已应用于许多场景，以使用检索器提供的外部文档来增强大型语言模型 (LLM)。然而，由于训练目标和架构的差异，LLM 和检索器之间存在语义鸿沟。这种不一致迫使 LLM 被动接受检索器提供的文档，导致生成过程中无法理解，而 LLM 则需要使用其固有知识来区分这些文档。本文提出了 R$^2$AG，这是一种新颖的增强型 RAG 框架，通过将检索信息纳入检索增强生成来填补这一空白。具体而言，R$^2$AG 利用检索器的细微特征并使用 R$^2$-Former 来捕获检索信息。然后，设计了一种检索感知提示策略，将检索信息集成到 LLM 的生成中。值得注意的是，R$^2$AG 适用于 LLM 和检索器被冻结的低源场景。在五个数据集上进行的大量实验验证了 R$^2$AG 的有效性、稳健性和效率。我们的分析表明，检索信息可以作为辅助 LLM 生成过程的锚点，从而填补语义空白。

Title: BeHonest: Benchmarking Honesty of Large Language Models

Authors: Steffi Chern, Zhulin Hu, Yuqing Yang, Ethan Chern, Yuan Guo, Jiahe Jin, Binjie Wang, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] BeHonest: Benchmarking Honesty of Large Language Models(https://arxiv.org/abs/)
Keywords: language model, llm
Abstract: Previous works on Large Language Models (LLMs) have mainly focused on evaluating their helpfulness or harmlessness. However, honesty, another crucial alignment criterion, has received relatively less attention. Dishonest behaviors in LLMs, such as spreading misinformation and defrauding users, eroding user trust, and causing real-world harm, present severe risks that intensify as these models approach superintelligence levels. Enhancing honesty in LLMs addresses critical deficiencies and helps uncover latent capabilities that are not readily expressed. This underscores the urgent need for reliable methods and benchmarks to effectively ensure and evaluate the honesty of LLMs. In this paper, we introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in LLMs comprehensively. BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses. Building on this foundation, we designed 10 scenarios to evaluate and analyze 9 popular LLMs on the market, including both closed-source and open-source models from different model families with varied model sizes. Our findings indicate that there is still significant room for improvement in the honesty of LLMs. We also encourage the AI community to prioritize honesty alignment in LLMs. Our benchmark and code can be found at: \url{this https URL}.
摘要：先前对大型语言模型 (LLM) 的研究主要集中在评估其有用性或无害性。然而，另一个关键的对齐标准诚实性却受到的关注相对较少。LLM 中的不诚实行为，例如传播错误信息和欺骗用户、削弱用户信任以及造成现实世界的伤害，都带来了严重的风险，而且随着这些模型接近超级智能水平，这种风险会加剧。提高 LLM 中的诚实性可以解决关键缺陷，并有助于发现不易表达的潜在能力。这凸显了对可靠方法和基准的迫切需求，以有效地确保和评估 LLM 的诚实性。在本文中，我们介绍了 BeHonest，这是一个专门为全面评估 LLM 中的诚实性而设计的先驱基准。BeHonest 评估诚实的三个基本方面：对知识边界的认识、避免欺骗和回答的一致性。在此基础上，我们设计了 10 个场景来评估和分析市场上 9 个流行的 LLM，包括来自不同模型系列、不同模型大小的闭源和开源模型。我们的研究结果表明，LLM 的诚实性仍有很大的改进空间。我们还鼓励 AI 社区优先考虑 LLM 中的诚实性对齐。我们的基准和代码可以在 \url{此 https URL} 找到。

Title: Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

Authors: Meizhi Zhong, Chen Zhang, Yikun Lei, Xikai Liu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13282
Pdf URL: https://arxiv.org/pdf/2406.13282
Copy Paste: [[2406.13282]] Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective(https://arxiv.org/abs/2406.13282)
Keywords: llm
Abstract: Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
摘要：使 LLM 能够处理较长的上下文是当前的研究热点。大多数 LLM 都建立在旋转位置嵌入 (RoPE) 的基础上，这是一种流行的位置编码方法。因此，一种突出的方法是将在相对较短的文本上训练的 RoPE 推广到更长的文本。人们付出了大量的努力来通过扩展 RoPE 的公式来增强推广能力，然而，很少有人试图全面展示它们的内部工作原理。在本文中，我们致力于从注意力的角度和两个基准测试任务对 RoPE 扩展提供直接而深入的理解。广泛的实验揭示了几个有价值的发现：1) 将注意力模式保持在预训练长度可以改善推广能力；2) 较大的注意力不确定性会导致检索错误；3) 对 RoPE 扩展使用更长的连续预训练长度可以减少注意力不确定性并显着增强推广能力。

Title: Improving Zero-shot LLM Re-Ranker with Risk Minimization

Authors: Xiaowei Yuan, Zhao Yang, Yequan Wang, Jun Zhao, Kang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13331
Pdf URL: https://arxiv.org/pdf/2406.13331
Copy Paste: [[2406.13331]] Improving Zero-shot LLM Re-Ranker with Risk Minimization(https://arxiv.org/abs/2406.13331)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: In the Retrieval-Augmented Generation (RAG) system, advanced Large Language Models (LLMs) have emerged as effective Query Likelihood Models (QLMs) in an unsupervised way, which re-rank documents based on the probability of generating the query given the content of a document. However, directly prompting LLMs to approximate QLMs inherently is biased, where the estimated distribution might diverge from the actual document-specific distribution. In this study, we introduce a novel framework, $\mathrm{UR^3}$, which leverages Bayesian decision theory to both quantify and mitigate this estimation bias. Specifically, $\mathrm{UR^3}$ reformulates the problem as maximizing the probability of document generation, thereby harmonizing the optimization of query and document generation probabilities under a unified risk minimization objective. Our empirical results indicate that $\mathrm{UR^3}$ significantly enhances re-ranking, particularly in improving the Top-1 accuracy. It benefits the QA tasks by achieving higher accuracy with fewer input documents.
摘要：在检索增强生成 (RAG) 系统中，先进的大型语言模型 (LLM) 已成为一种有效的无监督查询似然模型 (QLM)，它根据给定文档内容生成查询的概率对文档进行重新排序。但是，直接提示 LLM 近似 QLM 本质上是有偏差的，其中估计的分布可能与实际的文档特定分布不同。在本研究中，我们引入了一个新框架 $\mathrm{UR^3}$，它利用贝叶斯决策理论来量化和减轻这种估计偏差。具体而言，$\mathrm{UR^3}$ 将问题重新表述为最大化文档生成的概率，从而在统一的风险最小化目标下协调查询和文档生成概率的优化。我们的实证结果表明，$\mathrm{UR^3}$ 显着增强了重新排名，尤其是在提高 Top-1 准确率方面。它可以通过更少的输入文档实现更高的准确性，从而使 QA 任务受益。

Title: SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Authors: Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.13340
Pdf URL: https://arxiv.org/pdf/2406.13340
Copy Paste: [[2406.13340]] SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words(https://arxiv.org/abs/2406.13340)
Keywords: language model, llm, chat
Abstract: Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at this https URL.
摘要：语音包含大量信息，包括但不限于内容、副语言和环境信息。语音的这种综合性质对交流有重大影响，对于人机交互至关重要。以通用辅助功能而闻名的面向聊天的大型语言模型 (LLM) 已经发展到可以处理包括语音在内的多模态输入。尽管这些模型可以熟练地识别和分析语音，但它们往往无法生成适当的响应。我们认为这是由于缺乏任务定义和模型开发的原则，这需要适合模型评估的开源数据集和指标。为了弥补这一差距，我们提出了 SD-Eval，这是一个基准数据集，旨在多维评估口语对话的理解和生成。SD-Eval 专注于副语言和环境信息，包括 7,303 条话语，相当于 8.76 小时的语音数据。数据来自八个公共数据集，代表四个视角：情感、口音、年龄和背景声音。为了评估 SD-Eval 基准数据集，我们实施了三种不同的模型，并按照与 SD-Eval 类似的过程构建了一个训练集。训练集包含 1,052.72 小时的语音数据和 724.4k 条话语。我们还使用客观评估方法（例如 BLEU 和 ROUGE）、主观评估和基于 LLM 的指标对生成的响应进行了全面评估。以副语言和环境信息为条件的模型在客观和主观测量中都优于其他模型。此外，实验表明，与传统指标相比，基于 LLM 的指标与人工评估的相关性更高。我们在此 https URL 上开源了 SD-Eval。

Title: ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models

Authors: Hwiyeol Jo, Hyunwoo Lee, Taiwoo Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13342
Pdf URL: https://arxiv.org/pdf/2406.13342
Copy Paste: [[2406.13342]] ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models(https://arxiv.org/abs/2406.13342)
Keywords: language model, llm
Abstract: The recent advancements in large language models (LLMs) have brought significant progress in solving NLP tasks. Notably, in-context learning (ICL) is the key enabling mechanism for LLMs to understand specific tasks and grasping nuances. In this paper, we propose a simple yet effective method to contextualize a task toward a specific LLM, by (1) observing how a given LLM describes (all or a part of) target datasets, i.e., open-ended zero-shot inference, and (2) aggregating the open-ended inference results by the LLM, and (3) finally incorporate the aggregated meta-information for the actual task. We show the effectiveness of this approach in text clustering tasks, and also highlight the importance of the contextualization through examples of the above procedure.
摘要：大型语言模型 (LLM) 的最新进展为解决 NLP 任务带来了重大进展。值得注意的是，上下文学习 (ICL) 是 LLM 理解特定任务和掌握细微差别的关键机制。在本文中，我们提出了一种简单而有效的方法，将任务情境化到特定的 LLM，通过 (1) 观察给定的 LLM 如何描述（全部或部分）目标数据集，即开放式零样本推理，以及 (2) 由 LLM 聚合开放式推理结果，以及 (3) 最后将聚合的元信息合并到实际任务中。我们展示了这种方法在文本聚类任务中的有效性，并通过上述过程的示例强调了情境化的重要性。

Title: Transferable speech-to-text large language model alignment module

Authors: Boyong Wu, Chao Yan, Haoran Pu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2406.13357
Pdf URL: https://arxiv.org/pdf/2406.13357
Copy Paste: [[2406.13357]] Transferable speech-to-text large language model alignment module(https://arxiv.org/abs/2406.13357)
Keywords: language model, llm, chat
Abstract: By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.
摘要：通过利用大型语言模型 (LLM) 和语音基础模型的强大功能，最先进的语音文本双模态工作可以使用更简单的架构完成诸如口语翻译 (ST) 和问答 (SQA) 等具有挑战性的任务。在本文中，我们利用 Whisper 编码器和预训练的 Yi-6B 的功能。实证结果表明，使用一层模块和数百小时的语音文本多任务语料库即可实现模态对齐。我们进一步在推理过程中将 Yi-6B 与人类偏好对齐的 Yi-6B-Chat 版本交换，发现对齐功能也适用。此外，奇异值分解 (SVD) 揭示的对齐子空间也意味着线性对齐子空间是稀疏的，这使得连接声纹或视频等其他特征以扩展模态成为可能。

Title: ALiiCE: Evaluating Positional Fine-grained Citation Generation

Authors: Yilong Xu, Jinhua Gao, Xiaoming Yu, Baolong Bi, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13375
Pdf URL: https://arxiv.org/pdf/2406.13375
Copy Paste: [[2406.13375]] ALiiCE: Evaluating Positional Fine-grained Citation Generation(https://arxiv.org/abs/2406.13375)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can enhance the credibility and verifiability by generating text with citations. However, existing tasks and evaluation methods are predominantly limited to sentence-level statement, neglecting the significance of positional fine-grained citations that can appear anywhere within sentences. To facilitate further exploration of the fine-grained citation generation, we propose ALiiCE, the first automatic evaluation framework for this task. Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level. ALiiCE introduces three novel metrics for positional fined-grained citation quality assessment, including positional fine-grained citation recall and precision, and coefficient of variation of citation positions. We evaluate the positional fine-grained citation generation performance of several LLMs on two long-form QA datasets. Our experiments and analyses demonstrate the effectiveness and reasonableness of ALiiCE. The results also indicate that existing LLMs still struggle to provide positional fine-grained citations.
摘要：大型语言模型 (LLM) 可以通过生成带引文的文本来增强可信度和可验证性。然而现有的任务和评估方法主要局限于句子级别的陈述，忽略了可以出现在句子任何位置的细粒度引文的重要性。为了进一步探索细粒度引文生成，我们提出了 ALiiCE，这是该任务的第一个自动评估框架。我们的框架首先通过依赖关系分析将句子声明解析为原子声明，然后在原子声明级别计算引文质量。ALiiCE 引入了三个新的位置细粒度引文质量评估指标，包括位置细粒度引文召回率和准确率、引文位置的变异系数。我们在两个长篇问答数据集上评估了几个 LLM 的位置细粒度引文生成性能。我们的实验和分析证明了 ALiiCE 的有效性和合理性。结果还表明，现有的法学硕士 (LLM) 仍然难以提供位置细粒度的引用。

Title: CoAct: A Global-Local Hierarchy for Autonomous Agent Collaboration

Authors: Xinming Hou, Mingming Yang, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Wayne Xin Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13381
Pdf URL: https://arxiv.org/pdf/2406.13381
Copy Paste: [[2406.13381]] CoAct: A Global-Local Hierarchy for Autonomous Agent Collaboration(https://arxiv.org/abs/2406.13381)
Keywords: llm, agent
Abstract: Existing LLMs exhibit remarkable performance on various NLP tasks, but still struggle with complex real-world tasks, even equipped with advanced strategies like CoT and ReAct. In this work, we propose the CoAct framework, which transfers the hierarchical planning and collaboration patterns in human society to LLM systems. Specifically, our CoAct framework involves two agents: (1) A global planning agent, to comprehend the problem scope, formulate macro-level plans and provide detailed sub-task descriptions to local execution agents, which serves as the initial rendition of a global plan. (2) A local execution agent, to operate within the multi-tier task execution structure, focusing on detailed execution and implementation of specific tasks within the global plan. Experimental results on the WebArena benchmark show that CoAct can re-arrange the process trajectory when facing failures, and achieves superior performance over baseline methods on long-horizon web tasks. Code is available at this https URL.
摘要：现有的 LLM 在各种 NLP 任务上表现出色，但即使配备了 CoT 和 ReAct 等高级策略，在处理复杂的现实任务时仍然举步维艰。在这项工作中，我们提出了 CoAct 框架，将人类社会中的分层规划和协作模式转移到 LLM 系统中。具体来说，我们的 CoAct 框架涉及两个代理：（1）全局规划代理，用于理解问题范围、制定宏观计划并向本地执行代理提供详细的子任务描述，作为全局计划的初始版本。（2）本地执行代理，在多层任务执行结构中运行，专注于全局计划中特定任务的详细执行和实施。WebArena 基准测试的实验结果表明，CoAct 可以在遇到故障时重新安排流程轨迹，并且在长期 Web 任务上取得了优于基线方法的性能。代码可在此 https URL 上找到。

Title: MoreHopQA: More Than Multi-hop Reasoning

Authors: Julian Schnitzler, Xanh Ho, Jiahao Huang, Florian Boudin, Saku Sugawara, Akiko Aizawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13397
Pdf URL: https://arxiv.org/pdf/2406.13397
Copy Paste: [[2406.13397]] MoreHopQA: More Than Multi-hop Reasoning(https://arxiv.org/abs/2406.13397)
Keywords: language model, gpt
Abstract: Most existing multi-hop datasets are extractive answer datasets, where the answers to the questions can be extracted directly from the provided context. This often leads models to use heuristics or shortcuts instead of performing true multi-hop reasoning. In this paper, we propose a new multi-hop dataset, MoreHopQA, which shifts from extractive to generative answers. Our dataset is created by utilizing three existing multi-hop datasets: HotpotQA, 2WikiMultihopQA, and MuSiQue. Instead of relying solely on factual reasoning, we enhance the existing multi-hop questions by adding another layer of questioning that involves one, two, or all three of the following types of reasoning: commonsense, arithmetic, and symbolic. Our dataset is created through a semi-automated process, resulting in a dataset with 1,118 samples that have undergone human verification. We then use our dataset to evaluate five different large language models: Mistral 7B, Gemma 7B, Llama 3 (8B and 70B), and GPT-4. We also design various cases to analyze the reasoning steps in the question-answering process. Our results show that models perform well on initial multi-hop questions but struggle with our extended questions, indicating that our dataset is more challenging than previous ones. Our analysis of question decomposition reveals that although models can correctly answer questions, only a portion - 38.7% for GPT-4 and 33.4% for Llama3-70B - achieve perfect reasoning, where all corresponding sub-questions are answered correctly. Evaluation code and data are available at this https URL
摘要：大多数现有的多跳数据集都是提取式答案数据集，其中问题的答案可以直接从提供的上下文中提取。这通常会导致模型使用启发式或捷径，而不是执行真正的多跳推理。在本文中，我们提出了一个新的多跳数据集 MoreHopQA，它从提取式答案转向生成式答案。我们的数据集是利用三个现有的多跳数据集创建的：HotpotQA、2WikiMultihopQA 和 MuSiQue。我们不是仅仅依靠事实推理，而是通过添加另一层涉及以下一种、两种或所有三种推理类型的提问来增强现有的多跳问题：常识、算术和符号。我们的数据集是通过半自动化过程创建的，最终生成了一个包含 1,118 个经过人工验证的样本的数据集。然后，我们使用我们的数据集评估五种不同的大型语言模型：Mistral 7B、Gemma 7B、Llama 3（8B 和 70B）和 GPT-4。我们还设计了各种案例来分析问答过程中的推理步骤。我们的结果表明，模型在初始多跳问题上表现良好，但在处理我们的扩展问题时却举步维艰，这表明我们的数据集比以前的数据集更具挑战性。我们对问题分解的分析表明，尽管模型可以正确回答问题，但只有一部分（GPT-4 为 38.7%，Llama3-70B 为 33.4%）实现了完美的推理，所有相应的子问题都得到了正确的回答。评估代码和数据可在此 https URL 上找到

Title: SQLFixAgent: Towards Semantic-Accurate SQL Generation via Multi-Agent Collaboration

Authors: Jipeng Cen, Jiaxin Liu, Zhixu Li, Jingjing Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13408
Pdf URL: https://arxiv.org/pdf/2406.13408
Copy Paste: [[2406.13408]] SQLFixAgent: Towards Semantic-Accurate SQL Generation via Multi-Agent Collaboration(https://arxiv.org/abs/2406.13408)
Keywords: language model, llm, agent
Abstract: While fine-tuned large language models (LLMs) excel in generating grammatically valid SQL in Text-to-SQL parsing, they often struggle to ensure semantic accuracy in queries, leading to user confusion and diminished system usability. To tackle this challenge, we introduce SQLFixAgent, an innovative multi-agent collaborative framework designed for detecting and repairing erroneous SQL. Our framework comprises a core agent, SQLRefiner, alongside two auxiliary agents: SQLReviewer and QueryCrafter. The SQLReviewer agent employs the rubber duck debugging method to identify potential semantic mismatches between SQL statement and user query. If the error is detected, the QueryCrafter agent generates multiple SQL statements as candidate repairs using a fine-tuned SQLTool. Subsequently, leveraging similar repair retrieval and failure memory reflexion, the SQLRefiner agent selects the most fitting SQL statement from the candidates as the final repair. We evaluated our proposed framework on five Text-to-SQL benchmarks. The experimental results show that our method consistently enhances the performance of the baseline model, specifically achieving an execution accuracy improvement of over 3\% on the Bird benchmark. Our framework also has a higher token efficiency compared to other advanced methods, making it more competitive.
摘要：虽然经过微调的大型语言模型 (LLM) 在文本到 SQL 解析中生成语法有效的 SQL 方面表现出色，但它们通常难以确保查询的语义准确性，从而导致用户困惑和系统可用性下降。为了应对这一挑战，我们引入了 SQLFixAgent，这是一个创新的多代理协作框架，旨在检测和修复错误的 SQL。我们的框架包括一个核心代理 SQLRefiner，以及两个辅助代理：SQLReviewer 和 QueryCrafter。SQLReviewer 代理采用橡皮鸭调试方法来识别 SQL 语句和用户查询之间潜在的语义不匹配。如果检测到错误，QueryCrafter 代理将使用经过微调的 SQLTool 生成多个 SQL 语句作为候选修复。随后，利用类似的修复检索和故障内存反射，SQLRefiner 代理从候选中选择最合适的 SQL 语句作为最终修复。我们在五个文本到 SQL 基准上评估了我们提出的框架。实验结果表明，我们的方法持续提升了基线模型的性能，特别是在 Bird 基准上实现了超过 3% 的执行准确率提升。与其他先进方法相比，我们的框架还具有更高的 token 效率，使其更具竞争力。

Title: Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Authors: Matéo Mahaut, Laura Aina, Paula Czarnowska, Momchil Hardalov, Thomas Müller, Lluís Màrquez
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13415
Pdf URL: https://arxiv.org/pdf/2406.13415
Copy Paste: [[2406.13415]] Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators(https://arxiv.org/abs/2406.13415)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) tend to be unreliable in the factuality of their answers. To address this problem, NLP researchers have proposed a range of techniques to estimate LLM's confidence over facts. However, due to the lack of a systematic comparison, it is not clear how the different methods compare to one another. To fill this gap, we present a survey and empirical comparison of estimators of factual confidence. We define an experimental framework allowing for fair comparison, covering both fact-verification and question answering. Our experiments across a series of LLMs indicate that trained hidden-state probes provide the most reliable confidence estimates, albeit at the expense of requiring access to weights and training data. We also conduct a deeper assessment of factual confidence by measuring the consistency of model behavior under meaning-preserving variations in the input. We find that the confidence of LLMs is often unstable across semantically equivalent inputs, suggesting that there is much room for improvement of the stability of models' parametric knowledge. Our code is available at (this https URL).
摘要：大型语言模型 (LLM) 往往无法可靠地确定其答案的真实性。为了解决这个问题，NLP 研究人员提出了一系列技术来估计 LLM 对事实的信心。然而，由于缺乏系统的比较，目前尚不清楚不同方法之间的比较情况。为了填补这一空白，我们对事实信心的估计量进行了调查和实证比较。我们定义了一个允许公平比较的实验框架，涵盖事实验证和问答。我们在一系列 LLM 上进行的实验表明，经过训练的隐藏状态探测提供了最可靠的信心估计，尽管需要访问权重和训练数据。我们还通过测量在输入中保留意义的变化下模型行为的一致性来对事实信心进行更深入的评估。我们发现 LLM 的信心在语义等效的输入中往往不稳定，这表明模型参数知识的稳定性有很大的改进空间。我们的代码可在（此 https URL）获得。

Title: Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Authors: Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, Mitesh M. Khapra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13439
Pdf URL: https://arxiv.org/pdf/2406.13439
Copy Paste: [[2406.13439]] Finding Blind Spots in Evaluator LLMs with Interpretable Checklists(https://arxiv.org/abs/2406.13439)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50\% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. Code and data are available at this https URL.
摘要：大型语言模型 (LLM) 越来越多地被用来评估其他 LLM 的文本输出，从而影响排行榜和开发决策。然而，人们仍然担心这些评估的准确性以及得出误导性结论的可能性。在这项工作中，我们研究了 LLM 作为文本生成任务评估器的有效性。我们提出了 FBI，这是一个新颖的框架，旨在检验评估器 LLM 在评估其他 LLM 的四项关键能力方面的能力：事实准确性、指令遵循、长篇写作连贯性和推理能力。通过在 LLM 生成的答案中引入有针对性的扰动，这些扰动显然会影响其中一项关键能力，我们测试了评估器 LLM 是否可以检测到这些质量下降。通过创建总共 2400 个扰动答案，涵盖 22 个扰动类别，我们使用不同的评估策略对文献中常用作评估器的五种著名 LLM 进行了全面研究。我们的研究结果揭示了当前 Evaluator LLM 的重大缺陷，平均有超过 50% 的案例无法识别质量下降。单答案和成对评估表现出明显的局限性，而基于参考的评估则表现出相对更好的性能。这些结果强调了当前 Evaluator LLM 的不可靠性，并提倡在实际应用中谨慎实施。代码和数据可在此 https URL 上获取。

Title: Dual-Phase Accelerated Prompt Optimization

Authors: Muchen Yang, Moxin Li, Yongle Li, Zijun Chen, Chongming Gao, Junqi Zhang, Yangyang Li, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13443
Pdf URL: https://arxiv.org/pdf/2406.13443
Copy Paste: [[2406.13443]] Dual-Phase Accelerated Prompt Optimization(https://arxiv.org/abs/2406.13443)
Keywords: language model, llm, prompt
Abstract: Gradient-free prompt optimization methods have made significant strides in enhancing the performance of closed-source Large Language Models (LLMs) across a wide range of tasks. However, existing approaches make light of the importance of high-quality prompt initialization and the identification of effective optimization directions, thus resulting in substantial optimization steps to obtain satisfactory performance. In this light, we aim to accelerate prompt optimization process to tackle the challenge of low convergence rate. We propose a dual-phase approach which starts with generating high-quality initial prompts by adopting a well-designed meta-instruction to delve into task-specific information, and iteratively optimize the prompts at the sentence level, leveraging previous tuning experience to expand prompt candidates and accept effective ones. Extensive experiments on eight datasets demonstrate the effectiveness of our proposed method, achieving a consistent accuracy gain over baselines with less than five optimization steps.
摘要：无梯度提示优化方法在提升闭源大型语言模型 (LLM) 在广泛任务中的性能方面取得了重大进展。然而，现有的方法忽视了高质量提示初始化和确定有效优化方向的重要性，从而需要大量的优化步骤才能获得令人满意的性能。有鉴于此，我们旨在加速提示优化过程以应对低收敛速度的挑战。我们提出了一种双阶段方法，首先采用精心设计的元指令深入研究特定于任务的信息来生成高质量的初始提示，然后在句子级别迭代优化提示，利用以前的调优经验来扩展提示候选并接受有效的提示。在八个数据集上进行的大量实验证明了我们提出的方法的有效性，在少于五个优化步骤的情况下实现了与基线相比一致的准确率提升。

Title: VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Authors: Xueqing Wu, Zongyu Lin, Songyan Zhao, Te-Lin Wu, Pan Lu, Nanyun Peng, Kai-Wei Chang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2406.13444
Pdf URL: https://arxiv.org/pdf/2406.13444
Copy Paste: [[2406.13444]] VDebugger: Harnessing Execution Feedback for Debugging Visual Programs(https://arxiv.org/abs/2406.13444)
Keywords: language model
Abstract: Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at this https URL
摘要：视觉程序是由大型语言模型生成的可执行代码，用于解决视觉推理问题。它们将复杂的问题分解为多个推理步骤，并为每个步骤调用专门的模型来解决问题。然而，这些程序容易出现逻辑错误，我们的初步评估显示，58% 的总错误是由程序逻辑错误引起的。调试复杂的视觉程序仍然是视觉推理的主要瓶颈。为了解决这个问题，我们引入了 VDebugger，这是一个新颖的批评者-细化器框架，经过训练可以通过逐步跟踪执行来定位和调试视觉程序。VDebugger 利用详细的执行反馈来识别和纠正程序错误，从而提高可解释性和准确性。训练数据是通过一个自动化管道生成的，该管道使用一种新颖的掩码最佳解码技术将错误注入正确的视觉程序中。对六个数据集的评估证明了 VDebugger 的有效性，显示下游任务准确率的性能提高了 3.2%。进一步的研究表明 VDebugger 能够推广到未见任务，在未见 COVR 任务上带来了 2.3% 的显着提高。代码、数据和模型在此 https URL 上公开发布

Title: Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks

Authors: Dan Saattrup Nielsen, Kenneth Enevoldsen, Peter Schneider-Kamp
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13469
Pdf URL: https://arxiv.org/pdf/2406.13469
Copy Paste: [[2406.13469]] Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks(https://arxiv.org/abs/2406.13469)
Keywords: language model
Abstract: This paper explores the performance of encoder and decoder language models on multilingual Natural Language Understanding (NLU) tasks, with a broad focus on Germanic languages. Building upon the ScandEval benchmark, which initially was restricted to evaluating encoder models, we extend the evaluation framework to include decoder models. We introduce a method for evaluating decoder models on NLU tasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic, Faroese, German, Dutch, and English. Through a series of experiments and analyses, we address key research questions regarding the comparative performance of encoder and decoder models, the impact of NLU task types, and the variation across language resources. Our findings reveal that decoder models can achieve significantly better NLU performance than encoder models, with nuances observed across different tasks and languages. Additionally, we investigate the correlation between decoders and task performance via a UMAP analysis, shedding light on the unique capabilities of decoder and encoder models. This study contributes to a deeper understanding of language model paradigms in NLU tasks and provides valuable insights for model selection and evaluation in multilingual settings.
摘要：本文探讨了编码器和解码器语言模型在多语言自然语言理解 (NLU) 任务中的表现，重点关注日耳曼语。在 ScandEval 基准测试的基础上，我们扩展了评估框架以包括解码器模型，该基准测试最初仅限于评估编码器模型。我们介绍了一种在 NLU 任务中评估解码器模型的方法，并将其应用于丹麦语、瑞典语、挪威语、冰岛语、法罗语、德语、荷兰语和英语。通过一系列实验和分析，我们解决了有关编码器和解码器模型的比较性能、NLU 任务类型的影响以及语言资源之间的差异的关键研究问题。我们的研究结果表明，解码器模型可以实现比编码器模型更好的 NLU 性能，并且在不同的任务和语言中观察到细微差别。此外，我们通过 UMAP 分析研究了解码器与任务性能之间的相关性，揭示了解码器和编码器模型的独特功能。这项研究有助于更深入地了解 NLU 任务中的语言模型范式，并为多语言环境中的模型选择和评估提供了宝贵的见解。

Title: LLMs Are Zero-Shot Context-Aware Simultaneous Translators

Authors: Roman Koshkin, Katsuhito Sudoh, Satoshi Nakamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13476
Pdf URL: https://arxiv.org/pdf/2406.13476
Copy Paste: [[2406.13476]] LLMs Are Zero-Shot Context-Aware Simultaneous Translators(https://arxiv.org/abs/2406.13476)
Keywords: language model, llm
Abstract: The advent of transformers has fueled progress in machine translation. More recently large language models (LLMs) have come to the spotlight thanks to their generality and strong performance in a wide range of language tasks, including translation. Here we show that open-source LLMs perform on par with or better than some state-of-the-art baselines in simultaneous machine translation (SiMT) tasks, zero-shot. We also demonstrate that injection of minimal background information, which is easy with an LLM, brings further performance gains, especially on challenging technical subject-matter. This highlights LLMs' potential for building next generation of massively multilingual, context-aware and terminologically accurate SiMT systems that require no resource-intensive training or fine-tuning.
摘要：Transformer 的出现推动了机器翻译的发展。最近，大型语言模型 (LLM) 因其通用性和在包括翻译在内的各种语言任务中的出色表现而备受关注。在这里，我们展示了开源 LLM 在零样本同步机器翻译 (SiMT) 任务中的表现与一些最先进的基线相当甚至更好。我们还证明了注入最少的背景信息（使用 LLM 很容易）可以带来进一步的性能提升，尤其是在具有挑战性的技术主题上。这凸显了 LLM 在构建下一代大规模多语言、上下文感知和术语准确的 SiMT 系统方面的潜力，这些系统不需要资源密集型的训练或微调。

Title: Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Authors: Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13542
Pdf URL: https://arxiv.org/pdf/2406.13542
Copy Paste: [[2406.13542]] Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models(https://arxiv.org/abs/2406.13542)
Keywords: language model, llm
Abstract: One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at this https URL.
摘要：大型语言模型 (LLM) 的核心能力之一是遵循自然语言指令。然而，在没有人工注释的情况下，如何自动构建高质量的训练数据来增强 LLM 的复杂指令跟随能力，这一问题仍未得到解决。在本文中，我们介绍了 AutoIF，这是第一种可扩展且可靠的自动生成指令跟随训练数据的方法。AutoIF 将指令跟随数据质量的验证转化为代码验证，要求 LLM 生成指令、相应的代码来检查指令响应的正确性，以及单元测试样本来验证代码的正确性。然后，基于执行反馈的拒绝采样可以生成数据，用于监督微调 (SFT) 和从人工反馈进行强化学习 (RLHF) 训练。在自对齐和强到弱蒸馏设置中，AutoIF 在三种训练算法（SFT、离线 DPO 和在线 DPO）上实现了显着的改进。我们的代码可通过此 https URL 公开获取。

Title: Mitigating Social Biases in Language Models through Unlearning

Authors: Omkar Dige, Diljot Singh, Tsz Fung Yau, Qixuan Zhang, Borna Bolandraftar, Xiaodan Zhu, Faiza Khan Khattak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13551
Pdf URL: https://arxiv.org/pdf/2406.13551
Copy Paste: [[2406.13551]] Mitigating Social Biases in Language Models through Unlearning(https://arxiv.org/abs/2406.13551)
Keywords: language model
Abstract: Mitigating bias in language models (LMs) has become a critical problem due to the widespread deployment of LMs. Numerous approaches revolve around data pre-processing and fine-tuning of language models, tasks that can be both time-consuming and computationally demanding. Consequently, there is a growing interest in machine unlearning techniques given their capacity to induce the forgetting of undesired behaviors of the existing pre-trained or fine-tuned models with lower computational cost. In this work, we explore two unlearning methods, (1) Partitioned Contrastive Gradient Unlearning (PCGU) applied on decoder models and (2) Negation via Task Vector, to reduce social biases in state-of-the-art and open-source LMs such as LLaMA-2 and OPT. We also implement distributed PCGU for large models. It is empirically shown, through quantitative and qualitative analyses, that negation via Task Vector method outperforms PCGU in debiasing with minimum deterioration in performance and perplexity of the models. On LLaMA-27B, negation via Task Vector reduces the bias score by 11.8%
摘要：由于语言模型 (LM) 的广泛部署，减轻语言模型中的偏见已成为一个关键问题。许多方法都围绕着数据预处理和语言模型的微调，这些任务既耗时又耗计算。因此，人们对机器反学习技术的兴趣日益浓厚，因为它们能够以较低的计算成本诱导遗忘现有预训练或微调模型的不良行为。在这项工作中，我们探索了两种反学习方法，(1) 应用于解码器模型的分区对比梯度反学习 (PCGU) 和 (2) 通过任务向量进行否定，以减少 LLaMA-2 和 OPT 等最先进和开源 LM 中的社会偏见。我们还为大型模型实现了分布式 PCGU。通过定量和定性分析，经验表明，通过任务向量方法进行否定在消除偏差方面优于 PCGU，同时对模型性能和困惑度的影响最小。在 LLaMA-27B 上，通过任务向量进行否定可将偏差分数降低 11.8%

Title: BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation

Authors: Minchong Li, Feng Zhou, Xiaohui Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13555
Pdf URL: https://arxiv.org/pdf/2406.13555
Copy Paste: [[2406.13555]] BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation(https://arxiv.org/abs/2406.13555)
Keywords: language model, llm
Abstract: In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden "noise" in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-$k$ teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.
摘要：近年来，大型语言模型 (LLM) 在各种自然语言处理 (NLP) 任务中表现出卓越的能力。然而，如此令人印象深刻的性能往往伴随着参数大小的增加，这对广泛部署构成了重大挑战。知识蒸馏 (KD) 通过将知识从大型教师模型转移到较小的学生模型来提供解决方案。在本文中，我们探讨了在 logit 级别对 LLM 进行任务特定的蒸馏。我们的调查显示，微调后的 LLM 的 logit 比视觉模型的 logit 表现出更极端的长尾分布，长尾中隐藏的“噪音”会影响蒸馏性能。此外，现有的 logits 蒸馏方法通常难以有效利用来自 logits 的内部排名信息。为了解决这些问题，我们提出了双向 Logits 差异 (BiLD) 损失。BiLD 损失通过仅利用前 $k$ 个教师和学生 logit 来滤除长尾噪音，并通过构建 logits 差异来利用内部 logits 排名信息。为了评估 BiLD 损失，我们使用两种类型的 LLM 在 13 个数据集上进行了全面的实验。结果表明，仅使用前 8 个 logit 的 BiLD 损失优于监督微调 (SFT)、普通 KL 损失以及来自 NLP 和 CV 领域的其他五种蒸馏方法。

Title: Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language Models

Authors: Yi Zhou, Danushka Bollegala, Jose Camacho-Collados
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13556
Pdf URL: https://arxiv.org/pdf/2406.13556
Copy Paste: [[2406.13556]] Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language Models(https://arxiv.org/abs/2406.13556)
Keywords: language model
Abstract: Social biases such as gender or racial biases have been reported in language models (LMs), including Masked Language Models (MLMs). Given that MLMs are continuously trained with increasing amounts of additional data collected over time, an important yet unanswered question is how the social biases encoded with MLMs vary over time. In particular, the number of social media users continues to grow at an exponential rate, and it is a valid concern for the MLMs trained specifically on social media data whether their social biases (if any) would also amplify over time. To empirically analyse this problem, we use a series of MLMs pretrained on chronologically ordered temporal snapshots of corpora. Our analysis reveals that, although social biases are present in all MLMs, most types of social bias remain relatively stable over time (with a few exceptions). To further understand the mechanisms that influence social biases in MLMs, we analyse the temporal corpora used to train the MLMs. Our findings show that some demographic groups, such as male, obtain higher preference over the other, such as female on the training corpora constantly.
摘要：语言模型 (LM)（包括蒙版语言模型 (MLM)）中已报告了诸如性别或种族偏见之类的社会偏见。鉴于 MLM 会随着时间推移收集越来越多的额外数据而不断进行训练，一个重要但尚未解答的问题是，用 MLM 编码的社会偏见如何随时间变化。特别是，社交媒体用户的数量继续以指数级增长，对于专门在社交媒体数据上训练的 MLM 来说，它们的社会偏见（如果有的话）是否也会随着时间的推移而放大是一个合理的担忧。为了实证分析这个问题，我们使用了一系列在按时间顺序排列的语料库时间快照上预先训练的 MLM。我们的分析表明，尽管所有 MLM 中都存在社会偏见，但大多数类型的社会偏见随着时间的推移保持相对稳定（除了少数例外）。为了进一步了解影响 MLM 中社会偏见的机制，我们分析了用于训练 MLM 的时间语料库。我们的研究结果表明，某些人口群体（例如男性）在训练语料库中比其他群体（例如女性）获得更高的偏好。

Title: Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration

Authors: Han-Cheng Yu, Yu-An Shih, Kin-Man Law, Kai-Yu Hsieh, Yu-Chen Cheng, Hsin-Chih Ho, Zih-An Lin, Wen-Chuan Hsu, Yao-Chung Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13578
Pdf URL: https://arxiv.org/pdf/2406.13578
Copy Paste: [[2406.13578]] Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration(https://arxiv.org/abs/2406.13578)
Keywords: language model
Abstract: In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose \textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Through experiments with benchmarking datasets, we show that our models significantly outperform the state-of-the-art results. Our best-performing model advances the F1@3 score from 14.80 to 16.47 in MCQ dataset and from 15.92 to 16.50 in Sciq dataset.
摘要：在本文中，我们解决了多项选择题的干扰项生成 (DG) 任务。我们的研究引入了两种关键设计。首先，我们提出了 \textit{检索增强预训练}，这涉及改进语言模型预训练，使其与 DG 的下游任务更紧密地保持一致。其次，我们探索知识图谱的集成以增强 DG 的性能。通过对基准数据集的实验，我们表明我们的模型明显优于最先进的结果。我们表现最佳的模型将 MCQ 数据集中的 F1@3 得分从 14.80 提高到 16.47，将 Sciq 数据集中的 F1@3 得分从 15.92 提高到 16.50。

Title: Optimizing Psychological Counseling with Instruction-Tuned Large Language Models

Authors: Wenjie Li, Tianyu Sun, Kun Qian, Wenhong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13617
Pdf URL: https://arxiv.org/pdf/2406.13617
Copy Paste: [[2406.13617]] Optimizing Psychological Counseling with Instruction-Tuned Large Language Models(https://arxiv.org/abs/2406.13617)
Keywords: language model, llm, prompt
Abstract: The advent of large language models (LLMs) has significantly advanced various fields, including natural language processing and automated dialogue systems. This paper explores the application of LLMs in psychological counseling, addressing the increasing demand for mental health services. We present a method for instruction tuning LLMs with specialized prompts to enhance their performance in providing empathetic, relevant, and supportive responses. Our approach involves developing a comprehensive dataset of counseling-specific prompts, refining them through feedback from professional counselors, and conducting rigorous evaluations using both automatic metrics and human assessments. The results demonstrate that our instruction-tuned model outperforms several baseline LLMs, highlighting its potential as a scalable and accessible tool for mental health support.
摘要：大型语言模型 (LLM) 的出现极大地推动了各个领域的发展，包括自然语言处理和自动对话系统。本文探讨了 LLM 在心理咨询中的应用，以满足日益增长的心理健康服务需求。我们提出了一种使用专门的提示对 LLM 进行指令调整的方法，以提高其在提供富有同理心、相关性和支持性的响应方面的表现。我们的方法包括开发一个全面的咨询专用提示数据集，通过专业咨询师的反馈对其进行改进，并使用自动指标和人工评估进行严格的评估。结果表明，我们的指令调整模型优于几个基线 LLM，凸显了其作为心理健康支持可扩展且可访问的工具的潜力。

Title: In-Context Former: Lightning-fast Compressing Context for Large Language Model

Authors: Xiangfeng Wang, Zaiyi Chen, Zheyong Xie, Tong Xu, Yongyi He, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13618
Pdf URL: https://arxiv.org/pdf/2406.13618
Copy Paste: [[2406.13618]] In-Context Former: Lightning-fast Compressing Context for Large Language Model(https://arxiv.org/abs/2406.13618)
Keywords: language model, llm
Abstract: With the rising popularity of Transformer-based large language models (LLMs), reducing their high inference costs has become a significant research focus. One effective approach is to compress the long input contexts. Existing methods typically leverage the self-attention mechanism of the LLM itself for context compression. While these methods have achieved notable results, the compression process still involves quadratic time complexity, which limits their applicability. To mitigate this limitation, we propose the In-Context Former (IC-Former). Unlike previous methods, IC-Former does not depend on the target LLMs. Instead, it leverages the cross-attention mechanism and a small number of learnable digest tokens to directly condense information from the contextual word embeddings. This approach significantly reduces inference time, which achieves linear growth in time complexity within the compression range. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times while achieving over 90% of the baseline performance on evaluation metrics. Overall, our model effectively reduces compression costs and makes real-time compression scenarios feasible.
摘要：随着基于 Transformer 的大型语言模型 (LLM) 越来越受欢迎，降低其高昂的推理成本已成为研究的重点。一种有效的方法是压缩长输入上下文。现有方法通常利用 LLM 本身的自注意力机制进行上下文压缩。虽然这些方法已经取得了显著的成果，但压缩过程仍然涉及二次时间复杂度，这限制了它们的适用性。为了缓解这一限制，我们提出了上下文形成器 (IC-Former)。与以前的方法不同，IC-Former 不依赖于目标 LLM。相反，它利用交叉注意机制和少量可学习的摘要标记直接从上下文词嵌入中压缩信息。这种方法显著减少了推理时间，在压缩范围内实现了时间复杂度的线性增长。实验结果表明，我们的方法在压缩过程中只需要基线浮点运算的 1/32，处理速度提高了 68 到 112 倍，同时在评估指标上实现了基线性能的 90% 以上。总体而言，我们的模型有效地降低了压缩成本并使实时压缩场景变得可行。

Title: Improving Visual Commonsense in Language Models via Multiple Image Generation

Authors: Guy Yariv, Idan Schwartz, Yossi Adi, Sagie Benaim
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13621
Pdf URL: https://arxiv.org/pdf/2406.13621
Copy Paste: [[2406.13621]] Improving Visual Commonsense in Language Models via Multiple Image Generation(https://arxiv.org/abs/2406.13621)
Keywords: language model, llm, prompt
Abstract: Commonsense reasoning is fundamentally based on multimodal knowledge. However, existing large language models (LLMs) are primarily trained using textual data only, limiting their ability to incorporate essential visual information. In contrast, Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning. To this end, we introduce a method aimed at enhancing LLMs' visual commonsense. Specifically, our method generates multiple images based on the input text prompt and integrates these into the model's decision-making process by mixing their prediction probabilities. To facilitate multimodal grounded language modeling, we employ a late-fusion layer that combines the projected visual features with the output of a pre-trained LLM conditioned on text only. This late-fusion layer enables predictions based on comprehensive image-text knowledge as well as text only when this is required. We evaluate our approach using several visual commonsense reasoning tasks together with traditional NLP tasks, including common sense reasoning and reading comprehension. Our experimental results demonstrate significant superiority over existing baselines. When applied to recent state-of-the-art LLMs (e.g., Llama3), we observe improvements not only in visual common sense but also in traditional NLP benchmarks. Code and models are available under this https URL.
摘要：常识推理从根本上来说基于多模态知识。然而，现有的大型语言模型 (LLM) 主要仅使用文本数据进行训练，这限制了它们整合基本视觉信息的能力。相比之下，擅长视觉导向任务的视觉语言模型往往无法完成非视觉任务，例如基本的常识推理。这种分歧凸显了一个关键挑战——将强大的视觉理解与基于文本的基础语言推理相结合。为此，我们引入了一种旨在增强 LLM 视觉常识的方法。具体来说，我们的方法根据输入文本提示生成多幅图像，并通过混合它们的预测概率将它们集成到模型的决策过程中。为了促进多模态基础语言建模，我们采用了后期融合层，将投影的视觉特征与仅以文本为条件的预训练 LLM 的输出相结合。这个后期融合层可以基于全面的图像文本知识以及仅在需要时基于文本进行预测。我们使用几个视觉常识推理任务以及传统的 NLP 任务（包括常识推理和阅读理解）来评估我们的方法。我们的实验结果表明，该方法比现有基准方法具有显著优势。当应用于最近最先进的 LLM（例如 Llama3）时，我们不仅观察到视觉常识方面的改进，还观察到传统 NLP 基准方面的改进。代码和模型可在此 https URL 下找到。

Title: Fine-Tuning Gemma-7B for Enhanced Sentiment Analysis of Financial News Headlines

Authors: Kangtong Mo, Wenyan Liu, Xuanzhen Xu, Chang Yu, Yuelin Zou, Fangqing Xia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13626
Pdf URL: https://arxiv.org/pdf/2406.13626
Copy Paste: [[2406.13626]] Fine-Tuning Gemma-7B for Enhanced Sentiment Analysis of Financial News Headlines(https://arxiv.org/abs/2406.13626)
Keywords: language model, llm
Abstract: In this study, we explore the application of sentiment analysis on financial news headlines to understand investor sentiment. By leveraging Natural Language Processing (NLP) and Large Language Models (LLM), we analyze sentiment from the perspective of retail investors. The FinancialPhraseBank dataset, which contains categorized sentiments of financial news headlines, serves as the basis for our analysis. We fine-tuned several models, including distilbert-base-uncased, Llama, and gemma-7b, to evaluate their effectiveness in sentiment classification. Our experiments demonstrate that the fine-tuned gemma-7b model outperforms others, achieving the highest precision, recall, and F1 score. Specifically, the gemma-7b model showed significant improvements in accuracy after fine-tuning, indicating its robustness in capturing the nuances of financial sentiment. This model can be instrumental in providing market insights, risk management, and aiding investment decisions by accurately predicting the sentiment of financial news. The results highlight the potential of advanced LLMs in transforming how we analyze and interpret financial information, offering a powerful tool for stakeholders in the financial industry.
摘要：在本研究中，我们探索了情绪分析在金融新闻标题中的应用，以了解投资者情绪。通过利用自然语言处理 (NLP) 和大型语言模型 (LLM)，我们从散户投资者的角度分析情绪。FinancialPhraseBank 数据集包含金融新闻标题的分类情绪，是我们分析的基础。我们对几个模型进行了微调，包括 distilbert-base-uncased、Llama 和 gemma-7b，以评估它们在情绪分类中的有效性。我们的实验表明，经过微调的 gemma-7b 模型优于其他模型，实现了最高的精度、召回率和 F1 分数。具体而言，gemma-7b 模型在微调后准确率显著提高，表明其在捕捉金融情绪细微差别方面具有很强的鲁棒性。该模型可以通过准确预测金融新闻的情绪，在提供市场洞察、风险管理和协助投资决策方面发挥重要作用。结果凸显了高级 LLM 在改变我们分析和解释金融信息的方式方面的潜力，为金融行业的利益相关者提供了强大的工具。

Title: InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising

Authors: Zhepei Wei, Wei-Lin Chen, Yu Meng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13629
Pdf URL: https://arxiv.org/pdf/2406.13629
Copy Paste: [[2406.13629]] InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising(https://arxiv.org/abs/2406.13629)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has shown promising potential to enhance the accuracy and factuality of language models (LMs). However, imperfect retrievers or noisy corpora can introduce misleading or even erroneous information to the retrieved contents, posing a significant challenge to the generation quality. Existing RAG methods typically address this challenge by directly predicting final answers despite potentially noisy inputs, resulting in an implicit denoising process that is difficult to interpret and verify. On the other hand, the acquisition of explicit denoising supervision is often costly, involving significant human efforts. In this work, we propose InstructRAG, where LMs explicitly learn the denoising process through self-synthesized rationales -- First, we instruct the LM to explain how the ground-truth answer is derived from retrieved documents. Then, these rationales can be used either as demonstrations for in-context learning of explicit denoising or as supervised fine-tuning data to train the model. Compared to standard RAG approaches, InstructRAG requires no additional supervision, allows for easier verification of the predicted answers, and effectively improves generation accuracy. Experiments show InstructRAG consistently outperforms existing RAG methods in both training-free and trainable scenarios, achieving a relative improvement of 8.3% over the best baseline method on average across five knowledge-intensive benchmarks. Extensive analysis indicates that InstructRAG scales well with increased numbers of retrieved documents and consistently exhibits robust denoising ability even in out-of-domain datasets, demonstrating strong generalizability.
摘要：检索增强生成 (RAG) 已显示出提高语言模型 (LM) 准确性和真实性的良好潜力。然而，不完善的检索器或嘈杂的语料库可能会将误导性甚至错误的信息引入检索到的内容中，对生成质量构成重大挑战。现有的 RAG 方法通常通过直接预测最终答案来解决这一挑战，尽管输入可能有噪声，但这会导致隐式去噪过程难以解释和验证。另一方面，获得显式去噪监督通常成本高昂，需要大量人力。在这项工作中，我们提出了 InstructRAG，其中 LM 通过自我合成的原理显式学习去噪过程——首先，我们指示 LM 解释如何从检索到的文档中得出真实答案。然后，这些原理可以用作显式去噪的上下文学习的演示，也可以用作训练模型的监督微调数据。与标准 RAG 方法相比，InstructRAG 无需额外监督，可以更轻松地验证预测答案，并有效提高生成准确率。实验表明，InstructRAG 在无需训练和可训练的场景中始终优于现有的 RAG 方法，在五个知识密集型基准测试中，平均比最佳基线方法相对提高了 8.3%。广泛的分析表明，随着检索到的文档数量的增加，InstructRAG 可以很好地扩展，并且即使在域外数据集中也始终表现出强大的去噪能力，表现出很强的通用性。

Title: Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations

Authors: Arie Cattan, Alon Jacovi, Alex Fabrikant, Jonathan Herzig, Roee Aharoni, Hannah Rashkin, Dror Marcus, Avinatan Hassidim, Yossi Matias, Idan Szpektor, Avi Caciularu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13632
Pdf URL: https://arxiv.org/pdf/2406.13632
Copy Paste: [[2406.13632]] Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations(https://arxiv.org/abs/2406.13632)
Keywords: language model, llm, long context, prompt
Abstract: Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naively adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot example and context mismatch between the demonstrations and the target query. In this work, we propose to automatically generate few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements on various QA datasets with long context, especially when the answer lies within the middle of the context. Surprisingly, despite introducing only single-hop ICL examples, LLMs also successfully generalize to multi-hop long-context QA using our approach.
摘要：尽管大型语言模型 (LLM) 近期取得了进展，但它们在涉及长上下文的任务上的表现仍然不理想。使用少样本示例的上下文内学习 (ICL) 可能是在此场景中增强 LLM 性能的一种有吸引力的解决方案；然而，天真地添加具有长上下文的 ICL 示例会带来挑战，包括为每个少样本示例增加大量标记开销以及演示和目标查询之间的上下文不匹配。在这项工作中，我们建议通过回收上下文来自动生成用于长上下文 QA 任务的少样本示例。具体来说，给定一个长输入上下文（1-3k 个标记）和一个查询，我们从给定的上下文中生成额外的查询输出对作为少样本示例，同时只引入一次上下文。这确保演示利用与目标查询相同的上下文，同时只向提示添加少量标记。我们通过指示模型在答案之前明确识别相关段落来进一步增强每个演示，这提高了性能，同时为答案来源提供了细粒度的归因。我们将我们的方法应用于多个 LLM，并在具有长上下文的各种 QA 数据集上取得了显著的改进，尤其是当答案位于上下文中间时。令人惊讶的是，尽管只引入了单跳 ICL 示例，但 LLM 也成功地利用我们的方法推广到多跳长上下文 QA。

Title: Towards Minimal Targeted Updates of Language Models with Targeted Negative Training

Authors: Lily H. Zhang, Rajesh Ranganath, Arya Tafvizi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13660
Pdf URL: https://arxiv.org/pdf/2406.13660
Copy Paste: [[2406.13660]] Towards Minimal Targeted Updates of Language Models with Targeted Negative Training(https://arxiv.org/abs/2406.13660)
Keywords: language model
Abstract: Generative models of language exhibit impressive capabilities but still place non-negligible probability mass over undesirable outputs. In this work, we address the task of updating a model to avoid unwanted outputs while minimally changing model behavior otherwise, a challenge we refer to as a minimal targeted update. We first formalize the notion of a minimal targeted update and propose a method to achieve such updates using negative examples from a model's generations. Our proposed Targeted Negative Training (TNT) results in updates that keep the new distribution close to the original, unlike existing losses for negative signal which push down probability but do not control what the updated distribution will be. In experiments, we demonstrate that TNT yields a better trade-off between reducing unwanted behavior and maintaining model generation behavior than baselines, paving the way towards a modeling paradigm based on iterative training updates that constrain models from generating undesirable outputs while preserving their impressive capabilities.
摘要：语言生成模型展现出令人印象深刻的能力，但仍然将不可忽略的概率质量置于不良输出之上。在这项工作中，我们解决了更新模型以避免不良输出的任务，同时尽量减少模型行为的改变，我们将这一挑战称为最小目标更新。我们首先将最小目标更新的概念形式化，并提出一种使用模型生成中的负面示例实现此类更新的方法。我们提出的目标负面训练 (TNT) 可使更新保持新分布接近原始分布，这与现有的负面信号损失不同，后者会降低概率但不控制更新后的分布。在实验中，我们证明 TNT 在减少不良行为和保持模型生成行为之间比基线取得了更好的平衡，为基于迭代训练更新的建模范式铺平了道路，这种范式可以限制模型生成不良输出，同时保留其令人印象深刻的能力。

Title: ObscurePrompt: Jailbreaking Large Language Models via Obscure Input

Authors: Yue Huang, Jingyu Tang, Dongping Chen, Bingda Tang, Yao Wan, Lichao Sun, Xiangliang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13662
Pdf URL: https://arxiv.org/pdf/2406.13662
Copy Paste: [[2406.13662]] ObscurePrompt: Jailbreaking Large Language Models via Obscure Input(https://arxiv.org/abs/2406.13662)
Keywords: language model, llm, prompt
Abstract: Recently, Large Language Models (LLMs) have garnered significant attention for their exceptional natural language processing capabilities. However, concerns about their trustworthiness remain unresolved, particularly in addressing "jailbreaking" attacks on aligned LLMs. Previous research predominantly relies on scenarios with white-box LLMs or specific and fixed prompt templates, which are often impractical and lack broad applicability. In this paper, we introduce a straightforward and novel method, named ObscurePrompt, for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. Specifically, we first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. ObscurePrompt starts with constructing a base prompt that integrates well-known jailbreaking techniques. Powerful LLMs are then utilized to obscure the original prompt through iterative transformations, aiming to bolster the attack's robustness. Comprehensive experiments show that our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms. We believe that our work can offer fresh insights for future research on enhancing LLM alignment.
摘要：最近，大型语言模型 (LLM) 因其出色的自然语言处理能力而备受关注。然而，人们对其可信度的担忧仍未得到解决，特别是在解决针对对齐 LLM 的“越狱”攻击方面。以前的研究主要依赖于白盒 LLM 或特定和固定提示模板的场景，这些场景通常不切实际且缺乏广泛的适用性。在本文中，我们介绍了一种用于越狱 LLM 的简单而新颖的方法，称为 ObscurePrompt，该方法的灵感来自于在分布外 (OOD) 数据中观察到的脆弱对齐。具体来说，我们首先在越狱过程中制定决策边界，然后探索模糊文本如何影响 LLM 的道德决策边界。ObscurePrompt 首先构建一个集成了众所周知的越狱技术的基本提示。然后利用强大的 LLM 通过迭代转换来模糊原始提示，旨在增强攻击的鲁棒性。综合实验表明，我们的方法在攻击有效性方面大大优于以前的方法，并且对两种常见的防御机制保持了有效性。我们相信我们的工作可以为未来增强 LLM 对齐的研究提供新的见解。

Title: Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation

Authors: Jirui Qi, Gabriele Sarti, Raquel Fernández, Arianna Bisazza
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13663
Pdf URL: https://arxiv.org/pdf/2406.13663
Copy Paste: [[2406.13663]] Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation(https://arxiv.org/abs/2406.13663)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Ensuring the verifiability of model answers is a fundamental challenge for retrieval-augmented generation (RAG) in the question answering (QA) domain. Recently, self-citation prompting was proposed to make large language models (LLMs) generate citations to supporting documents along with their answers. However, self-citing LLMs often struggle to match the required format, refer to non-existent sources, and fail to faithfully reflect LLMs' context usage throughout the generation. In this work, we present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in RAG applications. MIRAGE detects context-sensitive answer tokens and pairs them with retrieved documents contributing to their prediction via saliency methods. We evaluate our proposed approach on a multilingual extractive QA dataset, finding high agreement with human answer attribution. On open-ended QA, MIRAGE achieves citation quality and efficiency comparable to self-citation while also allowing for a finer-grained control of attribution parameters. Our qualitative evaluation highlights the faithfulness of MIRAGE's attributions and underscores the promising application of model internals for RAG answer attribution.
摘要：确保模型答案的可验证性是问答 (QA) 领域中检索增强生成 (RAG) 面临的一项基本挑战。最近，有人提出了自引提示，以使大型语言模型 (LLM) 在生成答案的同时生成对支持文档的引用。然而，自引的 LLM 通常难以匹配所需的格式、引用不存在的来源，并且无法忠实反映 LLM 在整个生成过程中的上下文使用情况。在这项工作中，我们提出了 MIRAGE——基于模型内部的 RAG 解释——一种使用模型内部在 RAG 应用程序中忠实地进行答案归因的即插即用方法。MIRAGE 检测上下文相关的答案标记，并通过显着性方法将它们与有助于其预测的检索到的文档配对。我们在多语言提取 QA 数据集上评估了我们提出的方法，发现与人工答案归因高度一致。在开放式问答中，MIRAGE 实现了与自我引用相当的引用质量和效率，同时还允许更精细地控制归因参数。我们的定性评估突出了 MIRAGE 归因的忠实性，并强调了模型内部在 RAG 答案归因方面的良好应用。

Title: Leveraging Large Language Models to Measure Gender Bias in Gendered Languages

Authors: Erik Derner, Sara Sansalvador de la Fuente, Yoan Gutiérrez, Paloma Moreda, Nuria Oliver
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2406.13677
Pdf URL: https://arxiv.org/pdf/2406.13677
Copy Paste: [[2406.13677]] Leveraging Large Language Models to Measure Gender Bias in Gendered Languages(https://arxiv.org/abs/2406.13677)
Keywords: language model, llm
Abstract: Gender bias in text corpora used in various natural language processing (NLP) contexts, such as for training large language models (LLMs), can lead to the perpetuation and amplification of societal inequalities. This is particularly pronounced in gendered languages like Spanish or French, where grammatical structures inherently encode gender, making the bias analysis more challenging. Existing methods designed for English are inadequate for this task due to the intrinsic linguistic differences between English and gendered languages. This paper introduces a novel methodology that leverages the contextual understanding capabilities of LLMs to quantitatively analyze gender representation in Spanish corpora. By utilizing LLMs to identify and classify gendered nouns and pronouns in relation to their reference to human entities, our approach provides a nuanced analysis of gender biases. We empirically validate our method on four widely-used benchmark datasets, uncovering significant gender disparities with a male-to-female ratio ranging from 4:1 to 6:1. These findings demonstrate the value of our methodology for bias quantification in gendered languages and suggest its application in NLP, contributing to the development of more equitable language technologies.
摘要：在各种自然语言处理 (NLP) 环境（例如用于训练大型语言模型 (LLM)）中使用的文本语料库中的性别偏见可能会导致社会不平等的延续和扩大。这在西班牙语或法语等性别语言中尤为明显，这些语言的语法结构本身就编码了性别，这使得偏见分析更具挑战性。由于英语和性别语言之间存在内在的语言差异，现有的为英语设计的方法不足以完成这项任务。本文介绍了一种新方法，该方法利用 LLM 的上下文理解能力来定量分析西班牙语语料库中的性别表现。通过利用 LLM 识别和分类与人类实体相关的性别名词和代词，我们的方法提供了对性别偏见的细致分析。我们在四个广泛使用的基准数据集上对我们的方法进行了实证验证，发现男女比例从 4:1 到 6:1 不等的显著性别差异。这些发现证明了我们在性别语言偏见量化方法方面的价值，并表明其可应用于 NLP，从而有助于开发更公平的语言技术。

Title: Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation

Authors: Di Wu, Jia-Chen Gu, Fan Yin, Nanyun Peng, Kai-Wei Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13692
Pdf URL: https://arxiv.org/pdf/2406.13692
Copy Paste: [[2406.13692]] Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation(https://arxiv.org/abs/2406.13692)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-augmented language models (RALMs) have shown strong performance and wide applicability in knowledge-intensive tasks. However, there are significant trustworthiness concerns as RALMs are prone to generating unfaithful outputs, including baseless information or contradictions with the retrieved context. This paper proposes SynCheck, a lightweight monitor that leverages fine-grained decoding dynamics including sequence likelihood, uncertainty quantification, context influence, and semantic alignment to synchronously detect unfaithful sentences. By integrating efficiently measurable and complementary signals, SynCheck enables accurate and immediate feedback and intervention, achieving 0.85 AUROC in detecting faithfulness errors across six long-form retrieval-augmented generation tasks, improving prior best method by 4%. Leveraging SynCheck, we further introduce FOD, a faithfulness-oriented decoding algorithm guided by beam search for long-form retrieval-augmented generation. Empirical results demonstrate that FOD outperforms traditional strategies such as abstention, reranking, or contrastive decoding significantly in terms of faithfulness, achieving over 10% improvement across six datasets.
摘要：检索增强语言模型 (RALM) 在知识密集型任务中表现出色且适用性广泛。然而，由于 RALM 容易生成不真实的输出，包括毫无根据的信息或与检索到的上下文相矛盾，因此存在严重的可信度问题。本文提出了 SynCheck，这是一个轻量级监视器，它利用细粒度的解码动态（包括序列似然、不确定性量化、上下文影响和语义对齐）来同步检测不真实的句子。通过整合有效可测量和互补的信号，SynCheck 可实现准确、即时的反馈和干预，在六个长格式检索增强生成任务中检测忠诚度错误的 AUROC 达到 0.85，比之前的最佳方法提高了 4%。利用 SynCheck，我们进一步引入了 FOD，一种以忠诚度为导向的解码算法，由集束搜索引导，用于长格式检索增强生成。实证结果表明，FOD 在忠诚度方面明显优于弃权、重新排序或对比解码等传统策略，在六个数据集上实现了超过 10% 的提升。

Title: MMTE: Corpus and Metrics for Evaluating Machine Translation Quality of Metaphorical Language

Authors: Shun Wang, Ge Zhang, Han Wu, Tyler Loakman, Wenhao Huang, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13698
Pdf URL: https://arxiv.org/pdf/2406.13698
Copy Paste: [[2406.13698]] MMTE: Corpus and Metrics for Evaluating Machine Translation Quality of Metaphorical Language(https://arxiv.org/abs/2406.13698)
Keywords: language model
Abstract: Machine Translation (MT) has developed rapidly since the release of Large Language Models and current MT evaluation is performed through comparison with reference human translations or by predicting quality scores from human-labeled data. However, these mainstream evaluation methods mainly focus on fluency and factual reliability, whilst paying little attention to figurative quality. In this paper, we investigate the figurative quality of MT and propose a set of human evaluation metrics focused on the translation of figurative language. We additionally present a multilingual parallel metaphor corpus generated by post-editing. Our evaluation protocol is designed to estimate four aspects of MT: Metaphorical Equivalence, Emotion, Authenticity, and Quality. In doing so, we observe that translations of figurative expressions display different traits from literal ones.
摘要：自大型语言模型发布以来，机器翻译 (MT) 发展迅速，目前的机器翻译评估是通过与参考人工翻译进行比较或通过从人工标记数据中预测质量分数来进行的。然而，这些主流评估方法主要关注流畅性和事实可靠性，而很少关注比喻质量。在本文中，我们研究了机器翻译的比喻质量，并提出了一套专注于比喻语言翻译的人工评估指标。我们还展示了一个由译后编辑生成的多语言平行隐喻语料库。我们的评估协议旨在评估机器翻译的四个方面：隐喻等价性、情感、真实性和质量。在此过程中，我们观察到比喻表达的翻译表现出与字面表达不同的特征。

Title: Breaking News: Case Studies of Generative AI's Use in Journalism

Authors: Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno, Franziska Roesner, Niloofar Mireshghallah
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2406.13706
Pdf URL: https://arxiv.org/pdf/2406.13706
Copy Paste: [[2406.13706]] Breaking News: Case Studies of Generative AI's Use in Journalism(https://arxiv.org/abs/2406.13706)
Keywords: language model, llm, prompt, chat
Abstract: Journalists are among the many users of large language models (LLMs). To better understand the journalist-AI interactions, we conduct a study of LLM usage by two news agencies through browsing the WildChat dataset, identifying candidate interactions, and verifying them by matching to online published articles. Our analysis uncovers instances where journalists provide sensitive material such as confidential correspondence with sources or articles from other agencies to the LLM as stimuli and prompt it to generate articles, and publish these machine-generated articles with limited intervention (median output-publication ROUGE-L of 0.62). Based on our findings, we call for further research into what constitutes responsible use of AI, and the establishment of clear guidelines and best practices on using LLMs in a journalistic context.
摘要：记者是大型语言模型 (LLM) 的众多用户之一。为了更好地了解记者与人工智能的互动，我们对两家新闻机构的 LLM 使用情况进行了研究，研究方式是浏览 WildChat 数据集、确定候选互动，并通过与在线发布的文章进行匹配来验证它们。我们的分析发现了一些案例，记者向 LLM 提供敏感材料（例如与消息来源的机密通信或其他机构的文章）作为刺激，并促使其生成文章，并在有限的干预下发布这些机器生成的文章（中位输出出版物 ROUGE-L 为 0.62）。根据我们的研究结果，我们呼吁进一步研究什么是负责任地使用人工智能，并制定在新闻环境中使用 LLM 的明确指导方针和最佳实践。

Title: Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial Applications

Authors: Mahaman Sanoussi Yahaya Alassan, Jessica López Espejel, Merieme Bouhandi, Walid Dahhane, El Hassane Ettifouri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13713
Pdf URL: https://arxiv.org/pdf/2406.13713
Copy Paste: [[2406.13713]] Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial Applications(https://arxiv.org/abs/2406.13713)
Keywords: language model, llm
Abstract: In the rapidly evolving landscape of Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks such as question answering (QA). However, the accessibility and practicality of utilizing these models for industrial applications pose significant challenges, particularly concerning cost-effectiveness, inference speed, and resource efficiency. This paper presents a comprehensive benchmarking study comparing open-source LLMs with their non-open-source counterparts on the task of question answering. Our objective is to identify open-source alternatives capable of delivering comparable performance to proprietary models while being lightweight in terms of resource requirements and suitable for Central Processing Unit (CPU)-based inference. Through rigorous evaluation across various metrics including accuracy, inference speed, and resource consumption, we aim to provide insights into selecting efficient LLMs for real-world applications. Our findings shed light on viable open-source alternatives that offer acceptable performance and efficiency, addressing the pressing need for accessible and efficient NLP solutions in industry settings.
摘要：在快速发展的自然语言处理 (NLP) 领域，大型语言模型 (LLM) 在问答 (QA) 等任务中表现出了卓越的能力。然而，将这些模型用于工业应用的可访问性和实用性带来了重大挑战，特别是在成本效益、推理速度和资源效率方面。本文对开源 LLM 和非开源 LLM 在问答任务上进行了全面的基准测试比较。我们的目标是找到能够提供与专有模型相当的性能，同时在资源需求方面轻量级且适合基于中央处理单元 (CPU) 的推理的开源替代方案。通过对准确度、推理速度和资源消耗等各种指标进行严格评估，我们旨在为选择适用于实际应用的高效 LLM 提供见解。我们的研究结果揭示了可行的开源替代方案，这些替代方案提供了可接受的性能和效率，满足了行业环境中对可访问且高效的 NLP 解决方案的迫切需求。

Title: Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization

Authors: Niyati Bafna, Kenton Murray, David Yarowsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13718
Pdf URL: https://arxiv.org/pdf/2406.13718
Copy Paste: [[2406.13718]] Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization(https://arxiv.org/abs/2406.13718)
Keywords: language model
Abstract: While large language models exhibit certain cross-lingual generalization capabilities, they suffer from performance degradation (PD) on unseen closely-related languages (CRLs) and dialects relative to their high-resource language neighbour (HRLN). However, we currently lack a fundamental understanding of what kinds of linguistic distances contribute to PD, and to what extent. Furthermore, studies of cross-lingual generalization are confounded by unknown quantities of CRL language traces in the training data, and by the frequent lack of availability of evaluation data in lower-resource related languages and dialects. To address these issues, we model phonological, morphological, and lexical distance as Bayesian noise processes to synthesize artificial languages that are controllably distant from the HRLN. We analyse PD as a function of underlying noise parameters, offering insights on model robustness to isolated and composed linguistic phenomena, and the impact of task and HRL characteristics on PD. We calculate parameter posteriors on real CRL-HRLN pair data and show that they follow computed trends of artificial languages, demonstrating the viability of our noisers. Our framework offers a cheap solution to estimating task performance on an unseen CRL given HRLN performance using its posteriors, as well as for diagnosing observed PD on a CRL in terms of its linguistic distances from its HRLN, and opens doors to principled methods of mitigating performance degradation.
摘要：虽然大型语言模型表现出一定的跨语言泛化能力，但它们在看不见的密切相关语言 (CRL) 和方言上相对于其高资源语言邻居 (HRLN) 会遭受性能下降 (PD) 的困扰。然而，我们目前缺乏对哪些语言距离会导致 PD 以及导致 PD 的程度的根本了解。此外，跨语言泛化研究受到训练数据中 CRL 语言痕迹数量未知以及资源较少的语言和方言评估数据经常缺乏的困扰。为了解决这些问题，我们将音系、形态和词汇距离建模为贝叶斯噪声过程，以合成与 HRLN 距离可控的人工语言。我们分析 PD 作为底层噪声参数的函数，提供有关模型对孤立和组合语言现象的鲁棒性的见解，以及任务和 HRL 特征对 PD 的影响。我们计算了真实 CRL-HRLN 对数据的参数后验，并表明它们遵循人工语言的计算趋势，证明了我们的噪声器的可行性。我们的框架提供了一种廉价的解决方案，可以使用后验估计未见 CRL 上的任务性能（给定 HRLN 性能），以及根据 CRL 与 HRLN 的语言距离诊断观察到的 PD，并为缓解性能下降的原则性方法打开了大门。

Title: On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems

Authors: Md Ibrahim Ibne Alam, Parikshit Ram, Soham Dan, Horst Samulowitz, Koushik Kar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13720
Pdf URL: https://arxiv.org/pdf/2406.13720
Copy Paste: [[2406.13720]] On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems(https://arxiv.org/abs/2406.13720)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been observed to perform well on a wide range of downstream tasks when fine-tuned on domain-specific data. However, such data may not be readily available in many applications, motivating zero-shot or few-shot approaches using domain-adjacent models. While several fine-tuned models for various tasks are available, finding an appropriate domain-adjacent model for a given task is often not straight forward. In this paper, we study DAFT-E, a framework that utilizes an Ensemble of Domain-Adjacent Fine-Tuned Foundation Models for few-shot problems. We show that for zero-shot problems, this ensembling method provides an accuracy performance close to that of the single best model. With few-shot problems, this performance improves further, at which point DEFT-E can outperform any single domain-adjacent model while requiring much less data for domain-specific fine-tuning.
摘要：据观察，大型语言模型 (LLM) 在针对特定领域数据进行微调时，在广泛的下游任务中表现良好。然而，在许多应用中，这样的数据可能并不容易获得，这促使人们使用领域相邻模型进行零样本或少样本方法。虽然有几种针对各种任务的微调模型可用，但为给定任务找到合适的领域相邻模型往往并不是一件容易的事。在本文中，我们研究了 DAFT-E，这是一个利用领域相邻微调基础模型集合来解决少样本问题的框架。我们表明，对于零样本问题，这种集成方法提供的准确度性能接近单个最佳模型。对于少样本问题，这种性能进一步提高，此时 DEFT-E 可以胜过任何单个领域相邻模型，同时需要更少的数据进行领域特定微调。

Title: Every Language Counts: Learn and Unlearn in Multilingual LLMs

Authors: Taiming Lu, Philipp Koehn
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13748
Pdf URL: https://arxiv.org/pdf/2406.13748
Copy Paste: [[2406.13748]] Every Language Counts: Learn and Unlearn in Multilingual LLMs(https://arxiv.org/abs/2406.13748)
Keywords: language model, llm
Abstract: This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated content. Our findings reveal that standard unlearning techniques, which typically focus on English data, are insufficient in mitigating the spread of harmful content in multilingual contexts and could inadvertently reinforce harmful content across languages. We show that only by addressing harmful responses in both English and the original language of the harmful data can we effectively eliminate generations for all languages. This underscores the critical need for comprehensive unlearning strategies that consider the multilingual nature of modern LLMs to enhance their safety and reliability across diverse linguistic landscapes.
摘要：本文研究了多语言大型语言模型 (LLM) 中有害信息的传播，并评估了各种反学习方法的有效性。我们证明，无论使用哪种语言，虚假信息一旦通过训练数据引入这些模型，就会在不同语言中传播，损害生成内容的完整性和可靠性。我们的研究结果表明，标准的反学习技术通常侧重于英语数据，不足以减轻多语言环境中有害内容的传播，并且可能会无意中强化跨语言的有害内容。我们表明，只有通过解决英语和有害数据原始语言中的有害响应，我们才能有效地消除所有语言的生成。这强调了制定全面的反学习策略的迫切需要，这些策略考虑到现代 LLM 的多语言性质，以提高其在不同语言环境中的安全性和可靠性。

Title: Can LLMs Reason in the Wild with Programs?

Authors: Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, Faramarz Fekri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13764
Pdf URL: https://arxiv.org/pdf/2406.13764
Copy Paste: [[2406.13764]] Can LLMs Reason in the Wild with Programs?(https://arxiv.org/abs/2406.13764)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown superior capability to solve reasoning problems with programs. While being a promising direction, most of such frameworks are trained and evaluated in settings with a prior knowledge of task requirements. However, as LLMs become more capable, it is necessary to assess their reasoning abilities in more realistic scenarios where many real-world problems are open-ended with ambiguous scope, and often require multiple formalisms to solve. To investigate this, we introduce the task of reasoning in the wild, where an LLM is tasked to solve a reasoning problem of unknown type by identifying the subproblems and their corresponding formalisms, and writing a program to solve each subproblem, guided by a tactic. We create a large tactic-guided trajectory dataset containing detailed solutions to a diverse set of reasoning problems, ranging from well-defined single-form reasoning (e.g., math, logic), to ambiguous and hybrid ones (e.g., commonsense, combined math and logic). This allows us to test various aspects of LLMs reasoning at the fine-grained level such as the selection and execution of tactics, and the tendency to take undesired shortcuts. In experiments, we highlight that existing LLMs fail significantly on problems with ambiguous and mixed scope, revealing critical limitations and overfitting issues (e.g. accuracy on GSM8K drops by at least 50\%). We further show the potential of finetuning a local LLM on the tactic-guided trajectories in achieving better performance. Project repo is available at this http URL
摘要：大型语言模型 (LLM) 已显示出使用程序解决推理问题的卓越能力。虽然这是一个很有前途的方向，但大多数此类框架都是在事先了解任务要求的环境中进行训练和评估的。然而，随着 LLM 的能力越来越强，有必要在更现实的场景中评估它们的推理能力，其中许多现实世界的问题都是开放式的，范围不明确，并且通常需要多种形式才能解决。为了研究这一点，我们引入了在野外推理的任务，其中 LLM 的任务是通过识别子问题及其相应的形式，并编写程序来解决每个子问题，并在策略的指导下。我们创建了一个大型的策略引导轨迹数据集，其中包含对各种推理问题的详细解决方案，从定义明确的单一形式推理（例如数学、逻辑）到模糊和混合推理（例如常识、数学和逻辑的结合）。这使我们能够在细粒度级别测试 LLM 推理的各个方面，例如策略的选择和执行，以及走捷径的倾向。在实验中，我们强调现有的 LLM 在模糊和混合范围的问题上表现不佳，揭示了关键的局限性和过度拟合问题（例如，GSM8K 的准确率下降至少 50\%）。我们进一步展示了在战术引导轨迹上微调本地 LLM 以实现更好性能的潜力。项目存储库可在此 http URL 上找到

Title: FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Authors: Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, Jinjie Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering(https://arxiv.org/abs/)
Keywords: gpt, chat, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the generated long-form answers. In this paper, we remedy these issues via a systematic study on answer generation in web-enhanced LFQA. Specifically, we first propose a novel outline-enhanced generator to achieve clear logic in the generation of multifaceted answers and construct two datasets accordingly. Then we propose a factuality optimization method based on a carefully designed doubly fine-grained RLHF framework, which contains automatic evaluation and reward modeling in different levels of granularity. Our generic framework comprises conventional fine-grained RLHF methods as special cases. Extensive experiments verify the superiority of our proposed \textit{Factuality-optimized RAG (FoRAG)} method on both English and Chinese benchmarks. In particular, when applying our method to Llama2-7B-chat, the derived model FoRAG-L-7B outperforms WebGPT-175B in terms of three commonly used metrics (i.e., coherence, helpfulness, and factuality), while the number of parameters is much smaller (only 1/24 of that of WebGPT-175B). Our datasets and models are made publicly available for better reproducibility: this https URL.
摘要：检索增强生成 (RAG) 因其能够利用搜索引擎来提高长篇问答 (LFQA) 的质量而在问答 (QA) 任务中变得普遍。尽管出现了各种开源方法和 Bing Chat 等 Web 增强商业系统，但仍有两个关键问题尚未解决，即生成的长篇答案缺乏事实性和清晰的逻辑性。在本文中，我们通过系统研究 Web 增强 LFQA 中的答案生成来解决这些问题。具体而言，我们首先提出了一种新颖的大纲增强生成器，以实现多方面答案生成的清晰逻辑，并据此构建两个数据集。然后，我们提出了一种基于精心设计的双细粒度 RLHF 框架的事实性优化方法，其中包含不同粒度级别的自动评估和奖励建模。我们的通用框架包括传统的细粒度 RLHF 方法作为特例。大量实验验证了我们提出的 \textit{事实性优化 RAG (FoRAG)} 方法在英语和中文基准上的优势。特别是，当将我们的方法应用于 Llama2-7B-chat 时，派生模型 FoRAG-L-7B 在三个常用指标（即连贯性、有用性和事实性）方面优于 WebGPT-175B，而参数数量要少得多（仅为 WebGPT-175B 的 1/24）。我们的数据集和模型已公开，以便更好地再现：此 https URL。

Title: Semantic Structure-Mapping in LLM and Human Analogical Reasoning

Authors: Sam Musker, Alex Duchnowski, Raphaël Millière, Ellie Pavlick
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13803
Pdf URL: https://arxiv.org/pdf/2406.13803
Copy Paste: [[2406.13803]] Semantic Structure-Mapping in LLM and Human Analogical Reasoning(https://arxiv.org/abs/2406.13803)
Keywords: language model, llm
Abstract: Analogical reasoning is considered core to human learning and cognition. Recent studies have compared the analogical reasoning abilities of human subjects and Large Language Models (LLMs) on abstract symbol manipulation tasks, such as letter string analogies. However, these studies largely neglect analogical reasoning over semantically meaningful symbols, such as natural language words. This ability to draw analogies that link language to non-linguistic domains, which we term semantic structure-mapping, is thought to play a crucial role in language acquisition and broader cognitive development. We test human subjects and LLMs on analogical reasoning tasks that require the transfer of semantic structure and content from one domain to another. Advanced LLMs match human performance across many task variations. However, humans and LLMs respond differently to certain task variations and semantic distractors. Overall, our data suggest that LLMs are approaching human-level performance on these important cognitive tasks, but are not yet entirely human like.
摘要：类比推理被认为是人类学习和认知的核心。最近的研究比较了人类受试者和大型语言模型 (LLM) 在抽象符号操作任务（例如字母串类比）上的类比推理能力。然而，这些研究在很大程度上忽略了类比推理对语义上有意义的符号（例如自然语言单词）的影响。这种将语言与非语言领域联系起来的类比能力，我们称之为语义结构映射，被认为在语言习得和更广泛的认知发展中起着至关重要的作用。我们在类比推理任务上测试人类受试者和 LLM，这些任务需要将语义结构和内容从一个领域转移到另一个领域。高级 LLM 在许多任务变化中的表现与人类相当。然而，人类和 LLM 对某些任务变化和语义干扰的反应不同。总体而言，我们的数据表明 LLM 在这些重要的认知任务上的表现正在接近人类水平，但还不是完全像人类。

Title: WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Authors: Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13805
Pdf URL: https://arxiv.org/pdf/2406.13805
Copy Paste: [[2406.13805]] WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia(https://arxiv.org/abs/2406.13805)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: this https URL.
摘要：检索增强生成 (RAG) 已成为一种有前途的解决方案，可减轻大型语言模型 (LLM) 的局限性，例如幻觉和过时信息。然而，LLM 如何处理由不同的增强检索段落引起的知识冲突仍不清楚，尤其是当这些段落来自同一来源且具有相同的可信度时。在这项工作中，我们根据维基百科中的矛盾段落对 LLM 生成的答案进行了全面评估，维基百科是被广泛认为是大多数 LLM 的高质量预训练资源的数据集。具体来说，我们引入了 WikiContradict，这是一个由 253 个高质量、人工注释的实例组成的基准，旨在评估 LLM 在增强包含现实世界知识冲突的检索段落时的性能。我们在不同的 QA 场景下对各种闭源和开源 LLM 进行了基准测试，包括具有单个段落的 RAG 和具有 2 个矛盾段落的 RAG。通过对涉及 5 个 LLM 和 3,500 多个判断的 WikiContradict 实例子集进行严格的人工评估，我们揭示了这些模型的行为和局限性。例如，当提供两个包含矛盾事实的段落时，所有模型都难以生成准确反映上下文冲突性质的答案，尤其是对于需要推理的隐性冲突。由于人工评估成本高昂，我们还引入了一个自动化模型，该模型使用强大的开源语言模型来估计 LLM 性能，实现 0.8 的 F 分数。使用这个自动化指标，我们评估了所有 WikiContradict 实例中七个 LLM 的 1,500 多个答案。为了促进未来的工作，我们在此 https URL 上发布了 WikiContradict。

Title: Neuro-symbolic Training for Reasoning over Spatial Language

Authors: Tanawan Premsri, Parisa Kordjamshidi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13828
Pdf URL: https://arxiv.org/pdf/2406.13828
Copy Paste: [[2406.13828]] Neuro-symbolic Training for Reasoning over Spatial Language(https://arxiv.org/abs/2406.13828)
Keywords: language model
Abstract: Recent research shows that more data and larger models can provide more accurate solutions to natural language problems requiring reasoning. However, models can easily fail to provide solutions in unobserved complex input compositions due to not achieving the level of abstraction required for generalizability. To alleviate this issue, we propose training the language models with neuro-symbolic techniques that can exploit the logical rules of reasoning as constraints and provide additional supervision sources to the model. Training models to adhere to the regulations of reasoning pushes them to make more effective abstractions needed for generalizability and transfer learning. We focus on a challenging problem of spatial reasoning over text. Our results on various benchmarks using multiple language models confirm our hypothesis of effective domain transfer based on neuro-symbolic training.
摘要：最近的研究表明，更多的数据和更大的模型可以为需要推理的自然语言问题提供更准确的解决方案。然而，由于没有达到普遍性所需的抽象程度，模型很容易无法在未观察到的复杂输入组合中提供解决方案。为了缓解这个问题，我们建议使用神经符号技术训练语言模型，这些技术可以利用推理的逻辑规则作为约束并为模型提供额外的监督源。训练模型遵守推理规则会促使它们进行普遍性和迁移学习所需的更有效的抽象。我们专注于文本空间推理的挑战性问题。我们在各种基准上使用多种语言模型的结果证实了我们基于神经符号训练的有效领域转移假设。

Title: Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning

Authors: Kyoka Ono, Simon A. Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13846
Pdf URL: https://arxiv.org/pdf/2406.13846
Copy Paste: [[2406.13846]] Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning(https://arxiv.org/abs/2406.13846)
Keywords: language model
Abstract: Recent research has explored how Language Models (LMs) can be used for feature representation and prediction in tabular machine learning tasks. This involves employing text serialization and supervised fine-tuning (SFT) techniques. Despite the simplicity of these techniques, significant gaps remain in our understanding of the applicability and reliability of LMs in this context. Our study assesses how emerging LM technologies compare with traditional paradigms in tabular machine learning and evaluates the feasibility of adopting similar approaches with these advanced technologies. At the data level, we investigate various methods of data representation and curation of serialized tabular data, exploring their impact on prediction performance. At the classification level, we examine whether text serialization combined with LMs enhances performance on tabular datasets (e.g. class imbalance, distribution shift, biases, and high dimensionality), and assess whether this method represents a state-of-the-art (SOTA) approach for addressing tabular machine learning challenges. Our findings reveal current pre-trained models should not replace conventional approaches.
摘要：最近的研究探索了如何使用语言模型 (LM) 进行表格机器学习任务中的特征表示和预测。这涉及采用文本序列化和监督微调 (SFT) 技术。尽管这些技术很简单，但我们对 LM 在此背景下的适用性和可靠性的理解仍然存在重大差距。我们的研究评估了新兴 LM 技术与表格机器学习中的传统范式的比较，并评估了采用这些先进技术进行类似方法的可行性。在数据层面，我们研究了序列化表格数据的各种数据表示和管理方法，探索了它们对预测性能的影响。在分类层面，我们检查文本序列化与 LM 相结合是否能提高表格数据集的性能（例如类别不平衡、分布偏移、偏差和高维性），并评估这种方法是否代表了解决表格机器学习挑战的最先进 (SOTA) 方法。我们的研究结果表明，当前的预训练模型不应取代传统方法。

Title: Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning

Authors: Yuval Shalev, Amir Feder, Ariel Goldstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13858
Pdf URL: https://arxiv.org/pdf/2406.13858
Copy Paste: [[2406.13858]] Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning(https://arxiv.org/abs/2406.13858)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown an impressive ability to perform tasks believed to require thought processes. When the model does not document an explicit thought process, it becomes difficult to understand the processes occurring within its hidden layers and to determine if these processes can be referred to as reasoning. We introduce a novel and interpretable analysis of internal multi-hop reasoning processes in LLMs. We demonstrate that the prediction process for compositional reasoning questions can be modeled using a simple linear transformation between two semantic category spaces. We show that during inference, the middle layers of the network generate highly interpretable embeddings that represent a set of potential intermediate answers for the multi-hop question. We use statistical analyses to show that a corresponding subset of tokens is activated in the model's output, implying the existence of parallel reasoning paths. These observations hold true even when the model lacks the necessary knowledge to solve the task. Our findings can help uncover the strategies that LLMs use to solve reasoning tasks, offering insights into the types of thought processes that can emerge from artificial intelligence. Finally, we also discuss the implication of cognitive modeling of these results.
摘要：大型语言模型 (LLM) 表现出了令人印象深刻的执行被认为需要思考过程的任务的能力。当模型没有记录明确的思维过程时，很难理解其隐藏层中发生的过程，也很难确定这些过程是否可以称为推理。我们介绍了一种新颖且可解释的 LLM 内部多跳推理过程分析。我们证明，可以使用两个语义类别空间之间的简单线性变换来建模组合推理问题的预测过程。我们表明，在推理过程中，网络的中间层会生成高度可解释的嵌入，这些嵌入代表了多跳问题的一组潜在中间答案。我们使用统计分析来表明，在模型的输出中激活了相应的标记子集，这意味着存在并行推理路径。即使模型缺乏解决任务所需的知识，这些观察结果仍然成立。我们的发现可以帮助揭示 LLM 用于解决推理任务的策略，从而深入了解人工智能可能出现的思维过程类型。最后，我们还讨论了这些结果的认知建模含义。

Title: Knowledge Graph-Enhanced Large Language Models via Path Selection

Authors: Haochen Liu, Song Wang, Yaochen Zhu, Yushun Dong, Jundong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13862
Pdf URL: https://arxiv.org/pdf/2406.13862
Copy Paste: [[2406.13862]] Knowledge Graph-Enhanced Large Language Models via Path Selection(https://arxiv.org/abs/2406.13862)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have shown unprecedented performance in various real-world applications. However, they are known to generate factually inaccurate outputs, a.k.a. the hallucination problem. In recent years, incorporating external knowledge extracted from Knowledge Graphs (KGs) has become a promising strategy to improve the factual accuracy of LLM-generated outputs. Nevertheless, most existing explorations rely on LLMs themselves to perform KG knowledge extraction, which is highly inflexible as LLMs can only provide binary judgment on whether a certain knowledge (e.g., a knowledge path in KG) should be used. In addition, LLMs tend to pick only knowledge with direct semantic relationship with the input text, while potentially useful knowledge with indirect semantics can be ignored. In this work, we propose a principled framework KELP with three stages to handle the above problems. Specifically, KELP is able to achieve finer granularity of flexible knowledge extraction by generating scores for knowledge paths with input texts via latent semantic matching. Meanwhile, knowledge paths with indirect semantic relationships with the input text can also be considered via trained encoding between the selected paths in KG and the input text. Experiments on real-world datasets validate the effectiveness of KELP.
摘要：大型语言模型 (LLM) 在各种实际应用中表现出了前所未有的性能。然而，它们会产生事实上不准确的输出，也就是幻觉问题。近年来，结合从知识图谱 (KG) 中提取的外部知识已成为提高 LLM 生成输出事实准确性的一种有前途的策略。然而，大多数现有的探索都依赖于 LLM 本身来执行 KG 知识提取，这非常不灵活，因为 LLM 只能对是否应该使用某个知识（例如，KG 中的知识路径）提供二元判断。此外，LLM 倾向于只选择与输入文本具有直接语义关系的知识，而可能有用的具有间接语义的知识可能会被忽略。在这项工作中，我们提出了一个具有三个阶段的原则性框架 KELP 来处理上述问题。具体而言，KELP 能够通过潜在语义匹配为具有输入文本的知识路径生成分数，从而实现更细粒度的灵活知识提取。同时，还可以通过训练KG中选定的路径与输入文本之间的编码来考虑与输入文本具有间接语义关系的知识路径。在真实数据集上的实验验证了KELP的有效性。

Title: Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever

Authors: Hang Li, Tianlong Xu, Jiliang Tang, Qingsong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13885
Pdf URL: https://arxiv.org/pdf/2406.13885
Copy Paste: [[2406.13885]] Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever(https://arxiv.org/abs/2406.13885)
Keywords: language model, llm
Abstract: Knowledge tagging for questions plays a crucial role in contemporary intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations are always conducted by pedagogical experts, as the task requires not only a strong semantic understanding of both question stems and knowledge definitions but also deep insights into connecting question-solving logic with corresponding knowledge concepts. With the recent emergence of advanced text encoding algorithms, such as pre-trained language models, many researchers have developed automatic knowledge tagging systems based on calculating the semantic similarity between the knowledge and question embeddings. In this paper, we explore automating the task using Large Language Models (LLMs), in response to the inability of prior encoding-based methods to deal with the hard cases which involve strong domain knowledge and complicated concept definitions. By showing the strong performance of zero- and few-shot results over math questions knowledge tagging tasks, we demonstrate LLMs' great potential in conquering the challenges faced by prior methods. Furthermore, by proposing a reinforcement learning-based demonstration retriever, we successfully exploit the great potential of different-sized LLMs in achieving better performance results while keeping the in-context demonstration usage efficiency high.
摘要：问题的知识标记在当代智能教育应用中起着至关重要的作用，包括学习进度诊断、练习题推荐和课程内容组织。传统上，这些标注总是由教育专家进行，因为这项任务不仅需要对问题词干和知识定义有很强的语义理解，还需要对将问题解决逻辑与相应的知识概念联系起来有深刻的见解。随着最近出现的高级文本编码算法，如预训练语言模型，许多研究人员开发了基于计算知识和问题嵌入之间的语义相似度的自动知识标记系统。在本文中，我们探索使用大型语言模型 (LLM) 来自动化任务，以应对先前基于编码的方法无法处理涉及强大领域知识和复杂概念定义的难题。通过展示零样本和少样本结果在数学问题知识标记任务中的出色表现，我们展示了 LLM 在克服先前方法面临的挑战方面的巨大潜力。此外，通过提出基于强化学习的演示检索器，我们成功地挖掘了不同大小的 LLM 的巨大潜力，在保持较高上下文演示使用效率的同时获得了更好的性能结果。

Title: ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Authors: Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, Jiayi Wang, Weishan Zhao, Yixin Zhang, Renjun Zhang, Li Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13890
Pdf URL: https://arxiv.org/pdf/2406.13890
Copy Paste: [[2406.13890]] ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World(https://arxiv.org/abs/2406.13890)
Keywords: llm, agent
Abstract: LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.
摘要：LLM 在各种 NLP 应用中取得了显著的性能进步。然而，LLM 仍然难以满足医疗领域对准确性和可靠性的严格要求，并且在临床应用中面临许多挑战。现有的用于评估 LLM 驱动的医疗代理的临床诊断评估基准存在严重的局限性。首先，大多数现有的医疗评估基准面临数据泄露或污染的风险。其次，现有基准往往忽视了现代医疗实践中多部门和专业化的特点。第三，现有的评估方法仅限于多项选择题，与现实世界的诊断场景不符。最后，现有的评估方法缺乏对端到端真实临床场景的全面评估。基准的这些局限性反过来阻碍了 LLM 和医疗代理的进步。为了解决这些限制，我们推出了一个全面的临床诊断代理对齐套件 ClinicalLab。ClinicalLab 包括 ClinicalBench，这是一个用于评估医疗代理和 LLM 的端到端多部门临床诊断评估基准。ClinicalBench 基于涵盖 24 个部门和 150 种疾病的真实案例。 ClinicalLab 还包含四个新指标（ClinicalMetrics），用于评估 LLM 在临床诊断任务中的有效性。我们评估了 17 个 LLM，发现它们在不同部门的表现存在很大差异。基于这些发现，我们在 ClinicalLab 中提出了 ClinicalAgent，这是一种与现实世界的临床诊断实践保持一致的端到端临床代理。我们系统地研究了 ClinicalBench 上 ClinicalAgent 变体的性能和适用场景。我们的研究结果表明，在设计医疗代理时与现代医疗实践保持一致非常重要。

Title: Adaptable Logical Control for Large Language Models

Authors: Honghua Zhang, Po-Nien Kung, Masahiro Yoshida, Guy Van den Broeck, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13892
Pdf URL: https://arxiv.org/pdf/2406.13892
Copy Paste: [[2406.13892]] Adaptable Logical Control for Large Language Models(https://arxiv.org/abs/2406.13892)
Keywords: language model, gpt, llm
Abstract: Despite the success of Large Language Models (LLMs) on various tasks following human instructions, controlling model generation at inference time poses a persistent challenge. In this paper, we introduce Ctrl-G, an adaptable framework that facilitates tractable and flexible control of LLM generation to reliably follow logical constraints. Ctrl-G combines any production-ready LLM with a Hidden Markov Model, enabling LLM outputs to adhere to logical constraints represented as deterministic finite automata. We show that Ctrl-G, when applied to a TULU2-7B model, outperforms GPT3.5 and GPT4 on the task of interactive text editing: specifically, for the task of generating text insertions/continuations following logical constraints, Ctrl-G achieves over 30% higher satisfaction rate in human evaluation compared to GPT4. When applied to medium-size language models (e.g., GPT2-large), Ctrl-G also beats its counterparts for constrained generation by large margins on standard benchmarks. Additionally, as a proof-of-concept study, we experiment Ctrl-G on the Grade School Math benchmark to assist LLM reasoning, foreshadowing the application of Ctrl-G, as well as other constrained generation approaches, beyond traditional language generation tasks.
摘要：尽管大型语言模型 (LLM) 在遵循人类指令的各种任务上取得了成功，但在推理时控制模型生成仍然是一个持续的挑战。在本文中，我们介绍了 Ctrl-G，这是一个适应性强的框架，有助于对 LLM 生成进行灵活控制，以可靠地遵循逻辑约束。Ctrl-G 将任何可用于生产的 LLM 与隐马尔可夫模型相结合，使 LLM 输出能够遵守表示为确定性有限自动机的逻辑约束。我们表明，当将 Ctrl-G 应用于 TULU2-7B 模型时，它在交互式文本编辑任务上的表现优于 GPT3.5 和 GPT4：具体来说，对于按照逻辑约束生成文本插入/延续的任务，与 GPT4 相比，Ctrl-G 在人工评估中的满意率高出 30% 以上。当应用于中型语言模型（例如 GPT2-large）时，Ctrl-G 在标准基准测试中也以很大的优势击败了其约束生成同行。此外，作为一项概念验证研究，我们在小学数学基准上实验了 Ctrl-G 来辅助 LLM 推理，预示着 Ctrl-G 以及其他受限生成方法在传统语言生成任务之外的应用。

Title: Open Generative Large Language Models for Galician

Authors: Pablo Gamallo, Pablo Rodríguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, José Ramom Pichel, Marcos Garcia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13893
Pdf URL: https://arxiv.org/pdf/2406.13893
Copy Paste: [[2406.13893]] Open Generative Large Language Models for Galician(https://arxiv.org/abs/2406.13893)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.
摘要：大型语言模型 (LLM) 已经改变了自然语言处理。然而，它们以英语为中心的训练方式导致了不同语言之间的偏见和性能差异。这种不平衡使少数语言被边缘化，使得资源较少的语言（如加利西亚语）更难平等地使用 NLP 技术。我们介绍了前两个专注于加利西亚语的生成式 LLM，以弥补这一差距。这些模型作为开源资源免费提供，使用具有 13 亿个参数的 GPT 架构在 21 亿个单词的语料库上进行训练。利用持续的预训练，我们调整了两个现有的在较大语料库上训练的 LLM 以适应加利西亚语，从而减轻了从头开始训练时出现的数据限制。使用人工判断和标准化基准中的基于任务的数据集对模型进行了评估。这些评估显示了良好的性能，强调了语言多样性在生成模型中的重要性。

Title: Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions

Authors: Hamdireza Rouzegar, Masoud Makrehchi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13903
Pdf URL: https://arxiv.org/pdf/2406.13903
Copy Paste: [[2406.13903]] Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions(https://arxiv.org/abs/2406.13903)
Keywords: gpt, llm
Abstract: This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math, aligning with active learning principles. By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated 'student' model. A novel aspect of the research involved using GPT-4 as a 'teacher' to create complex questions, with GPT-3.5 as the 'student' responding to these challenges. This setup mirrors active learning, promoting deeper engagement. The findings demonstrate GPT-4's superior ability to generate precise, challenging questions and notable improvements in GPT-3.5's ability to handle more complex problems after receiving instruction from GPT-4. These results underscore the potential of LLMs to mimic and enhance active learning scenarios, offering a promising path for AI in customized education. This research contributes to understanding how AI can support personalized learning experiences, highlighting the need for further exploration in various educational contexts
摘要：本研究调查了法学硕士（特别是 GPT-3.5 和 GPT-4）如何根据主动学习原则为 9 年级数学开发定制问题。通过利用迭代方法，这些模型根据难度和内容调整问题，响应模拟“学生”模型的反馈。这项研究的一个新颖的方面是使用 GPT-4 作为“老师”来创建复杂问题，而 GPT-3.5 作为“学生”来应对这些挑战。这种设置反映了主动学习，促进了更深入的参与。研究结果表明，GPT-4 具有生成精确、具有挑战性的问题的卓越能力，并且在接受 GPT-4 的指导后，GPT-3.5 处理更复杂问题的能力得到了显着提高。这些结果强调了法学硕士模拟和增强主动学习场景的潜力，为 AI 在定制教育中提供了一条有希望的道路。这项研究有助于了解 AI 如何支持个性化学习体验，强调需要在各种教育环境中进一步探索

Title: Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking

Authors: Mohamed Elaraby, Diane Litman, Xiang Lorraine Li, Ahmed Magooda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13905
Pdf URL: https://arxiv.org/pdf/2406.13905
Copy Paste: [[2406.13905]] Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking(https://arxiv.org/abs/2406.13905)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Generating free-text rationales is among the emergent capabilities of Large Language Models (LLMs). These rationales have been found to enhance LLM performance across various NLP tasks. Recently, there has been growing interest in using these rationales to provide insights for various important downstream tasks. In this paper, we analyze generated free-text rationales in tasks with subjective answers, emphasizing the importance of rationalization in such scenarios. We focus on pairwise argument ranking, a highly subjective task with significant potential for real-world applications, such as debate assistance. We evaluate the persuasiveness of rationales generated by nine LLMs to support their subjective choices. Our findings suggest that open-source LLMs, particularly Llama2-70B-chat, are capable of providing highly persuasive rationalizations, surpassing even GPT models. Additionally, our experiments show that rationale persuasiveness can be improved by controlling its parameters through prompting or through self-refinement.
摘要：生成自由文本原理是大型语言模型 (LLM) 的新兴功能之一。这些原理已被发现可以提高 LLM 在各种 NLP 任务中的表现。最近，人们对使用这些原理为各种重要的下游任务提供见解的兴趣日益浓厚。在本文中，我们分析了在具有主观答案的任务中生成的自由文本原理，强调了在这种情况下合理化的重要性。我们专注于成对论证排名，这是一项高度主观的任务，具有巨大的实际应用潜力，例如辩论辅助。我们评估了九个 LLM 生成的理由以支持其主观选择的说服力。我们的研究结果表明，开源 LLM，尤其是 Llama2-70B-chat，能够提供极具说服力的合理化，甚至超越了 GPT 模型。此外，我们的实验表明，可以通过提示或自我完善来控制其参数，从而提高原理说服力。

Title: GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Authors: Tao Zhang, Ziqian Zeng, Yuxiang Xiao, Huiping Zhuang, Cen Chen, James Foulds, Shimei Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13925
Pdf URL: https://arxiv.org/pdf/2406.13925
Copy Paste: [[2406.13925]] GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models(https://arxiv.org/abs/2406.13925)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.
摘要：大型语言模型 (LLM) 容易生成表现出性别偏见的内容，从而引发重大的道德问题。对齐，即对 LLM 进行微调以更好地与期望行为保持一致的过程，被认为是减轻性别偏见的有效方法。尽管专有 LLM 在减轻性别偏见方面取得了重大进展，但它们的对齐数据集并未公开。常用且公开可用的对齐数据集 HH-RLHF 仍然在一定程度上表现出性别偏见。缺乏专门设计用于解决性别偏见的公开对齐数据集。因此，我们开发了一个名为 GenderAlign 的新数据集，旨在减轻 LLM 中的一系列性别偏见。该数据集包含 8k 个单轮对话，每个对话都与“选择”和“拒绝”响应配对。与“拒绝”响应相比，“选择”响应表现出较低程度的性别偏见和更高的质量。此外，我们将 GenderAlign 的“拒绝”响应中的性别偏见分为 4 个主要类别。实验结果证明了 GenderAlign 在减少法学硕士（LLM）中的性别偏见方面的有效性。

Title: Large Language Models are Skeptics: False Negative Problem of Input-conflicting Hallucination

Authors: Jongyoon Song, Sangwon Yu, Sungroh Yoon
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13929
Pdf URL: https://arxiv.org/pdf/2406.13929
Copy Paste: [[2406.13929]] Large Language Models are Skeptics: False Negative Problem of Input-conflicting Hallucination(https://arxiv.org/abs/2406.13929)
Keywords: language model, llm, hallucination
Abstract: In this paper, we identify a new category of bias that induces input-conflicting hallucinations, where large language models (LLMs) generate responses inconsistent with the content of the input context. This issue we have termed the false negative problem refers to the phenomenon where LLMs are predisposed to return negative judgments when assessing the correctness of a statement given the context. In experiments involving pairs of statements that contain the same information but have contradictory factual directions, we observe that LLMs exhibit a bias toward false negatives. Specifically, the model presents greater overconfidence when responding with False. Furthermore, we analyze the relationship between the false negative problem and context and query rewriting and observe that both effectively tackle false negatives in LLMs.
摘要：在本文中，我们确定了一种新的偏见，这种偏见会导致输入冲突的幻觉，即大型语言模型 (LLM) 生成的响应与输入上下文的内容不一致。我们将这个问题称为假阴性问题，指的是 LLM 在评估给定上下文的陈述的正确性时倾向于返回负面判断的现象。在涉及包含相同信息但具有矛盾事实方向的陈述对的实验中，我们观察到 LLM 表现出对假阴性的偏见。具体而言，当用 False 回答时，该模型表现出更大的过度自信。此外，我们分析了假阴性问题与上下文和查询重写之间的关系，并观察到两者都有效地解决了 LLM 中的假阴性问题。

Title: Reasoning Like a Doctor: Improving Medical Dialogue Systems via Diagnostic Reasoning Process Alignment

Authors: Kaishuai Xu, Yi Cheng, Wenjun Hou, Qiaoyu Tan, Wenjie Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13934
Pdf URL: https://arxiv.org/pdf/2406.13934
Copy Paste: [[2406.13934]] Reasoning Like a Doctor: Improving Medical Dialogue Systems via Diagnostic Reasoning Process Alignment(https://arxiv.org/abs/2406.13934)
Keywords: language model
Abstract: Medical dialogue systems have attracted significant attention for their potential to act as medical assistants. Enabling these medical systems to emulate clinicians' diagnostic reasoning process has been the long-standing research focus. Previous studies rudimentarily realized the simulation of clinicians' diagnostic process by fine-tuning language models on high-quality dialogue datasets. Nonetheless, they overly focus on the outcomes of the clinician's reasoning process while ignoring their internal thought processes and alignment with clinician preferences. Our work aims to build a medical dialogue system that aligns with clinicians' diagnostic reasoning processes. We propose a novel framework, Emulation, designed to generate an appropriate response that relies on abductive and deductive diagnostic reasoning analyses and aligns with clinician preferences through thought process modeling. Experimental results on two datasets confirm the efficacy of Emulation. Crucially, our framework furnishes clear explanations for the generated responses, enhancing its transparency in medical consultations.
摘要：医疗对话系统因其作为医疗助手的潜力而备受关注。使这些医疗系统能够模拟临床医生的诊断推理过程一直是研究的重点。先前的研究通过在高质量对话数据集上微调语言模型，初步实现了对临床医生诊断过程的模拟。尽管如此，他们过于关注临床医生推理过程的结果，而忽略了他们的内部思维过程和与临床医生偏好的一致性。我们的工作旨在建立一个与临床医生诊断推理过程相一致的医疗对话系统。我们提出了一个新颖的框架仿真，旨在生成依赖于溯因和演绎诊断推理分析的适当反应，并通过思维过程建模与临床医生的偏好保持一致。两个数据集的实验结果证实了仿真的有效性。至关重要的是，我们的框架为生成的响应提供了清晰的解释，提高了其在医疗咨询中的透明度。

Title: AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought

Authors: Yongheng Zhang, Qiguang Chen, Min Li, Wanxiang Che, Libo Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13940
Pdf URL: https://arxiv.org/pdf/2406.13940
Copy Paste: [[2406.13940]] AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought(https://arxiv.org/abs/2406.13940)
Keywords: llm, prompt, chain-of-thought
Abstract: Cross-lingual chain-of-thought can effectively complete reasoning tasks across languages, which gains increasing attention. Recently, dominant approaches in the literature improve cross-lingual alignment capabilities by integrating reasoning knowledge from different languages. Despite achieving excellent performance, current methods still have two main challenges: (1) Manual language specification: They still highly rely on manually selecting the languages to integrate, severely affecting their generalizability; (2) Static weight allocation: Current methods simply integrate all languages equally. In fact, different language reasoning paths should have different weights to achieve better complementation and integration. Motivated by this, we introduce an Automatic Cross-lingual Alignment Planning (AutoCAP) for zero-shot chain-of-thought to address the above challenges. The core of AutoCAP consists of two components: (1) Automatic Language Selection Prompting to guide LLMs to select appropriate languages and (2) Automatic Weight Allocation Prompting to automatically allocate alignment weight scores to each reasoning path. Extensive experiments on several benchmarks reveal that AutoCAP achieves state-of-the-art performance, surpassing previous methods that required manual effort.
摘要：跨语言思路链能有效完成跨语言的推理任务，受到越来越多的关注。最近，文献中的主流方法通过整合来自不同语言的推理知识来提高跨语言对齐能力。尽管取得了优异的性能，但当前方法仍然面临两个主要挑战：（1）手动语言指定：它们仍然高度依赖于手动选择要集成的语言，严重影响了它们的通用性；（2）静态权重分配：当前方法只是平等地整合所有语言。事实上，不同的语言推理路径应该具有不同的权重，以实现更好的互补和集成。受此启发，我们引入了一种用于零样本思路链的自动跨语言对齐规划（AutoCAP）来解决上述挑战。AutoCAP 的核心由两个部分组成：（1）自动语言选择提示，用于指导 LLM 选择合适的语言；（2）自动权重分配提示，用于自动为每条推理路径分配对齐权重分数。在多个基准测试上的大量实验表明，AutoCAP 实现了最先进的性能，超越了以前需要手动操作的方法。

Title: Evolving to be Your Soulmate: Personalized Dialogue Agents with Dynamically Adapted Personas

Authors: Yi Cheng, Wenge Liu, Kaishuai Xu, Wenjun Hou, Yi Ouyang, Chak Tou Leong, Xian Wu, Yefeng Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13960
Pdf URL: https://arxiv.org/pdf/2406.13960
Copy Paste: [[2406.13960]] Evolving to be Your Soulmate: Personalized Dialogue Agents with Dynamically Adapted Personas(https://arxiv.org/abs/2406.13960)
Keywords: agent
Abstract: Previous research on persona-based dialogue agents typically preset the agent's persona before deployment, which remains static thereafter. In this paper, we take a step further and explore a new paradigm called Self-evolving Personalized Dialogue Agents (SPDA), where the agent continuously evolves during the conversation to better align with the user's anticipation by dynamically adapting its persona. This paradigm could enable better personalization for each user, but also introduce unique challenges, which mainly lie in the process of persona adaptation. Two key issues include how to achieve persona alignment with the user and how to ensure smooth transition in the adaptation process. To address them, we propose a novel framework that refines the persona at hierarchical levels to progressively align better with the user in a controllable way. Experiments show that integrating the personas adapted by our framework consistently enhances personalization and overall dialogue performance across various base systems.
摘要：先前对基于角色的对话代理的研究通常在部署之前预设代理的角色，此后该角色保持不变。在本文中，我们更进一步，探索一种称为自进化个性化对话代理 (SPDA) 的新范式，其中代理在对话过程中不断进化，通过动态调整其角色来更好地符合用户的预期。这种范式可以为每个用户提供更好的个性化，但也带来了独特的挑战，这主要在于角色适应的过程。两个关键问题包括如何实现与用户的角色一致以及如何确保适应过程中的平稳过渡。为了解决这些问题，我们提出了一个新颖的框架，该框架在层次结构上细化角色，以可控的方式逐步更好地与用户保持一致。实验表明，集成我们框架所适应的角色可以持续增强各种基础系统的个性化和整体对话性能。

Title: MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models

Authors: Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.13975
Pdf URL: https://arxiv.org/pdf/2406.13975
Copy Paste: [[2406.13975]] MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models(https://arxiv.org/abs/2406.13975)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark MR-BEN that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, open-source models are seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far behind on our benchmark, revealing the underlying reasoning capability gap between them. Our dataset and codes are available on this https URL.
摘要：大型语言模型 (LLM) 表现出越来越强的解决问题和决策能力，这主要基于循序渐进的思路链推理过程。然而，评估 LLM 的推理能力变得越来越具有挑战性。具体而言，现有的基于结果的基准开始饱和，变得不足以监控进度。为此，我们提出了一个基于过程的基准 MR-BEN，它需要元推理技能，其中要求 LM 定位和分析自动生成的推理步骤中的潜在错误。MR-BEN 是一个综合基准，包含从人类专家收集的 5,975 个问题，涵盖物理、化学、逻辑、编码等各种学科。通过我们设计的用于评估此基准上的元推理的指标，我们发现了当前 LLM（开源和闭源模型）有趣的局限性和弱点。例如，开源模型在基于结果的基准上似乎与 GPT-4 相当，但它们在我们的基准上远远落后，揭示了它们之间潜在的推理能力差距。我们的数据集和代码可在此 https URL 上获取。

Title: Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Authors: Qin Zhu, Qingyuan Cheng, Runyu Peng, Xiaonan Li, Tengxiao Liu, Ru Peng, Xipeng Qiu, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.13990
Pdf URL: https://arxiv.org/pdf/2406.13990
Copy Paste: [[2406.13990]] Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation(https://arxiv.org/abs/2406.13990)
Keywords: language model, llm
Abstract: The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.
摘要：大型语言模型 (LLM) 的训练过程经常涉及不同程度的测试数据污染。尽管目前的 LLM 在各种基准测试上的表现越来越好，但它们在实际应用中的表现并不总是与基准测试结果相匹配。基准测试的泄漏会妨碍对 LLM 真实性能的准确评估。然而，构建新的基准测试成本高、劳动密集，并且仍然存在泄漏的风险。因此，在本文中，我们提出了一个问题，我们是否可以重用这些泄漏的基准测试进行 LLM 评估？我们提出了推理时间去污 (ITD) 来解决这个问题，它在不改变难度的情况下检测和重写泄漏的样本。ITD 可以减轻因记忆泄漏的基准测试而导致的性能膨胀。我们的概念验证实验表明，ITD 在 GSM8K 上将膨胀的准确率降低了 22.9%，在 MMLU 上将膨胀的准确率降低了 19.0%。在 MMLU 上，使用推理时间去污会导致 Phi3 和 Mistral 的结果分别下降 6.7% 和 3.6%。我们希望ITD能够为大型语言模型提供更加真实的评估结果。

Title: Exploring Changes in Nation Perception with Nationality-Assigned Personas in LLMs

Authors: Mahammed Kamruzzaman, Gene Louis Kim
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2406.13993
Pdf URL: https://arxiv.org/pdf/2406.13993
Copy Paste: [[2406.13993]] Exploring Changes in Nation Perception with Nationality-Assigned Personas in LLMs(https://arxiv.org/abs/2406.13993)
Keywords: llm
Abstract: Persona assignment has become a common strategy for customizing LLM use to particular tasks and contexts. In this study, we explore how perceptions of different nations change when LLMs are assigned specific nationality personas. We assign 193 different nationality personas (e.g., an American person) to four LLMs and examine how the LLM perceptions of countries change. We find that all LLM-persona combinations tend to favor Western European nations, though nation-personas push LLM behaviors to focus more on and view more favorably the nation-persona's own region. Eastern European, Latin American, and African nations are viewed more negatively by different nationality personas. Our study provides insight into how biases and stereotypes are realized within LLMs when adopting different national personas. In line with the "Blueprint for an AI Bill of Rights", our findings underscore the critical need for developing mechanisms to ensure LLMs uphold fairness and not over-generalize at a global scale.
摘要：角色分配已成为一种常见的策略，用于根据特定任务和环境定制 LLM 的使用。在本研究中，我们探讨了当 LLM 被分配特定国籍角色时，对不同国家的看法如何变化。我们为四个 LLM 分配了 193 个不同国籍的角色（例如，一个美国人），并研究了 LLM 对国家的看法如何变化。我们发现，所有 LLM 角色组合都倾向于偏爱西欧国家，尽管国家角色促使 LLM 行为更加关注和更有利地看待国家角色自己的地区。不同国籍的人对东欧、拉丁美洲和非洲国家的看法更为负面。我们的研究深入了解了在采用不同国家角色时 LLM 内部如何实现偏见和刻板印象。根据“人工智能权利法案蓝图”，我们的研究结果强调了制定机制以确保 LLM 在全球范围内维护公平性而不是过度概括的迫切需要。

Title: "Global is Good, Local is Bad?": Understanding Brand Bias in LLMs

Authors: Mahammed Kamruzzaman, Hieu Minh Nguyen, Gene Louis Kim
Subjects: cs.CL, cs.CE
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] "Global is Good, Local is Bad?": Understanding Brand Bias in LLMs(https://arxiv.org/abs/)
Keywords: llm
Abstract: Many recent studies have investigated social biases in LLMs but brand bias has received little attention. This research examines the biases exhibited by LLMs towards different brands, a significant concern given the widespread use of LLMs in affected use cases such as product recommendation and market analysis. Biased models may perpetuate societal inequalities, unfairly favoring established global brands while marginalizing local ones. Using a curated dataset across four brand categories, we probe the behavior of LLMs in this space. We find a consistent pattern of bias in this space -- both in terms of disproportionately associating global brands with positive attributes and disproportionately recommending luxury gifts for individuals in high-income countries. We also find LLMs are subject to country-of-origin effects which may boost local brand preference in LLM outputs in specific contexts.
摘要：最近，许多研究都调查了法学硕士中的社会偏见，但品牌偏见却很少受到关注。这项研究考察了法学硕士对不同品牌表现出的偏见，这是一个值得关注的重要问题，因为法学硕士在产品推荐和市场分析等受影响的用例中得到了广泛的应用。有偏见的模型可能会加剧社会不平等，不公平地偏袒成熟的全球品牌，而边缘化本地品牌。使用四个品牌类别的精选数据集，我们探究了法学硕士在这个领域的行为。我们发现这个领域存在着一致的偏见模式——既不成比例地将全球品牌与积极的属性联系起来，也不成比例地向高收入国家的个人推荐奢侈礼品。我们还发现法学硕士受到原产国效应的影响，这可能会在特定情况下增强法学硕士输出中的本地品牌偏好。

Title: Information Guided Regularization for Fine-tuning Language Models

Authors: Mandar Sharma, Nikhil Muralidhar, Shengzhe Xu, Raquib Bin Yosuf, Naren Ramakrishnan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14005
Pdf URL: https://arxiv.org/pdf/2406.14005
Copy Paste: [[2406.14005]] Information Guided Regularization for Fine-tuning Language Models(https://arxiv.org/abs/2406.14005)
Keywords: language model
Abstract: The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
摘要：预训练-微调范式已成为现代语言建模中迁移学习的事实上的策略。考虑到语言模型中的任务适应性通常是跨任务共享参数的函数，我们认为需要一种更精准的正则化方法来实现更顺畅的迁移学习。为此，我们从信息论的角度研究了这些任务敏感参数如何影响预训练损失状况。然后，我们利用调查结果设计了一种新颖的 dropout 方法，以改进模型正则化和更好的下游泛化。这种方法称为引导式 dropout，与任务和架构无关，并且不会给微调过程增加任何计算开销。通过实证评估，我们展示了与标准化基线相比，我们的正则化方法即使在数据稀缺的情况下也能始终获得更好的性能。

Title: Seeing Through AI's Lens: Enhancing Human Skepticism Towards LLM-Generated Fake News

Authors: Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14012
Pdf URL: https://arxiv.org/pdf/2406.14012
Copy Paste: [[2406.14012]] Seeing Through AI's Lens: Enhancing Human Skepticism Towards LLM-Generated Fake News(https://arxiv.org/abs/2406.14012)
Keywords: llm
Abstract: LLMs offer valuable capabilities, yet they can be utilized by malicious users to disseminate deceptive information and generate fake news. The growing prevalence of LLMs poses difficulties in crafting detection approaches that remain effective across various text domains. Additionally, the absence of precautionary measures for AI-generated news on online social platforms is concerning. Therefore, there is an urgent need to improve people's ability to differentiate between news articles written by humans and those produced by LLMs. By providing cues in human-written and LLM-generated news, we can help individuals increase their skepticism towards fake LLM-generated news. This paper aims to elucidate simple markers that help individuals distinguish between articles penned by humans and those created by LLMs. To achieve this, we initially collected a dataset comprising 39k news articles authored by humans or generated by four distinct LLMs with varying degrees of fake. We then devise a metric named Entropy-Shift Authorship Signature (ESAS) based on the information theory and entropy principles. The proposed ESAS ranks terms or entities, like POS tagging, within news articles based on their relevance in discerning article authorship. We demonstrate the effectiveness of our metric by showing the high accuracy attained by a basic method, i.e., TF-IDF combined with logistic regression classifier, using a small set of terms with the highest ESAS score. Consequently, we introduce and scrutinize these top ESAS-ranked terms to aid individuals in strengthening their skepticism towards LLM-generated fake news.
摘要：LLM 提供了宝贵的功能，但恶意用户可以利用它们传播欺骗性信息并生成虚假新闻。LLM 的日益普及给设计在各种文本领域都有效的检测方法带来了困难。此外，在线社交平台上缺乏针对人工智能生成的新闻的预防措施令人担忧。因此，迫切需要提高人们区分人类撰写的新闻文章和 LLM 生成的新闻文章的能力。通过在人类撰写的新闻和 LLM 生成的新闻中提供线索，我们可以帮助个人增加对虚假 LLM 生成的新闻的怀疑。本文旨在阐明简单的标记，帮助个人区分人类撰写的文章和 LLM 创建的文章。为此，我们最初收集了一个数据集，其中包含 39,000 篇由人类撰写或由四个具有不同程度虚假性的 LLM 生成的新闻文章。然后，我们根据信息论和熵原理设计了一个名为熵移位作者签名 (ESAS) 的指标。提议的 ESAS 根据新闻文章中的术语或实体（如 POS 标记）在辨别文章作者方面的相关性对其进行排名。我们通过展示基本方法（即 TF-IDF 与逻辑回归分类器相结合）获得的高精度来证明我们指标的有效性，该方法使用一小部分具有最高 ESAS 分数的术语。因此，我们引入并仔细检查这些 ESAS 排名靠前的术语，以帮助个人加强对 LLM 生成的虚假新闻的怀疑。

Title: HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

Authors: Yongqiang Chen, Quanming Yao, Juzheng Zhang, James Cheng, Yatao Bian
Subjects: cs.CL, cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2406.14021
Pdf URL: https://arxiv.org/pdf/2406.14021
Copy Paste: [[2406.14021]] HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment(https://arxiv.org/abs/2406.14021)
Keywords: language model, llm, hallucination
Abstract: Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some successes, existing approaches have overlooked the hierarchical structures that are inherent in graph data. Especially, in molecular graphs, the high-order structural information contains rich semantics of molecular functional groups, which encode crucial biochemical functionalities of the molecules. We establish a simple benchmark showing that neglecting the hierarchical information in graph tokenization will lead to subpar graph-language alignment and severe hallucination in generated outputs. To address this problem, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that extracts and encodes the hierarchy of node, motif, and graph levels of informative tokens to improve the graph perception of LLMs. HIGHT also adopts an augmented graph-language supervised fine-tuning dataset, enriched with the hierarchical graph information, to further enhance the graph-language alignment. Extensive experiments on 7 molecule-centric benchmarks confirm the effectiveness of HIGHT in reducing hallucination by 40%, as well as significant improvements in various molecule-language downstream tasks.
摘要：最近，人们对将大型语言模型 (LLM) 的成功扩展到图形模态（例如社交网络和分子）的兴趣激增。由于 LLM 主要使用一维文本数据进行训练，因此大多数现有方法采用图神经网络将图表示为一系列节点标记，并将这些标记提供给 LLM 进行图形语言对齐。尽管取得了一些成功，但现有方法忽略了图形数据固有的层次结构。特别是在分子图中，高阶结构信息包含分子功能组的丰富语义，这些功能组编码了分子的关键生化功能。我们建立了一个简单的基准，表明忽略图形标记化中的层次信息将导致图形语言对齐效果不佳和生成的输出中出现严重的幻觉。为了解决这个问题，我们提出了一种称为分层图形标记化 (HIGHT) 的新策略。 HIGHT 采用分层图形标记器，提取并编码信息标记的节点、主题和图形级别的层次结构，以改善 LLM 的图形感知。HIGHT 还采用增强的图形语言监督微调数据集，丰富了分层图形信息，以进一步增强图形语言对齐。在 7 个以分子为中心的基准上进行的大量实验证实了 HIGHT 在减少 40% 幻觉方面的有效性，以及各种分子语言下游任务的显著改进。

Title: Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

Authors: Yuchen Wen, Keping Bi, Wei Chen, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14023
Pdf URL: https://arxiv.org/pdf/2406.14023
Copy Paste: [[2406.14023]] Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective(https://arxiv.org/abs/2406.14023)
Keywords: language model, gpt, llm, prompt
Abstract: As Large Language Models (LLMs) become an important way of information seeking, there have been increasing concerns about the unethical content LLMs may generate. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain groups by attacking them with carefully crafted instructions to elicit biased responses. Our attack methodology is inspired by psychometric principles in cognitive and social psychology. We propose three attack approaches, i.e., Disguise, Deception, and Teaching, based on which we built evaluation datasets for four common bias types. Each prompt attack has bilingual versions. Extensive evaluation of representative LLMs shows that 1) all three attack methods work effectively, especially the Deception attacks; 2) GLM-3 performs the best in defending our attacks, compared to GPT-3.5 and GPT-4; 3) LLMs could output content of other bias types when being taught with one type of bias. Our methodology provides a rigorous and effective way of evaluating LLMs' implicit bias and will benefit the assessments of LLMs' potential ethical risks.
摘要：随着大型语言模型 (LLM) 成为一种重要的信息搜索方式，人们越来越担心 LLM 可能产生的不道德内容。在本文中，我们通过用精心设计的指令攻击某些群体以引发偏见反应，对 LLM 对某些群体的隐性偏见进行了严格的评估。我们的攻击方法受到认知和社会心理学中的心理测量原理的启发。我们提出了三种攻击方法，即伪装、欺骗和教学，并在此基础上为四种常见的偏见类型构建了评估数据集。每种提示攻击都有双语版本。对代表性 LLM 的广泛评估表明：1）所有三种攻击方法都有效，尤其是欺骗攻击；2）与 GPT-3.5 和 GPT-4 相比，GLM-3 在防御我们的攻击方面表现最佳；3）当用一种偏见类型进行教学时，LLM 可能会输出其他偏见类型的内容。我们的方法提供了一种严格有效的评估 LLM 隐性偏见的方法，并将有利于评估 LLM 的潜在道德风险。

Title: Prompt Injection Attacks in Defended Systems

Authors: Daniil Khomsky, Narek Maloyan, Bulat Nutfullin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14048
Pdf URL: https://arxiv.org/pdf/2406.14048
Copy Paste: [[2406.14048]] Prompt Injection Attacks in Defended Systems(https://arxiv.org/abs/2406.14048)
Keywords: language model, prompt
Abstract: Large language models play a crucial role in modern natural language processing technologies. However, their extensive use also introduces potential security risks, such as the possibility of black-box attacks. These attacks can embed hidden malicious features into the model, leading to adverse consequences during its deployment. This paper investigates methods for black-box attacks on large language models with a three-tiered defense mechanism. It analyzes the challenges and significance of these attacks, highlighting their potential implications for language processing system security. Existing attack and defense methods are examined, evaluating their effectiveness and applicability across various scenarios. Special attention is given to the detection algorithm for black-box attacks, identifying hazardous vulnerabilities in language models and retrieving sensitive information. This research presents a methodology for vulnerability detection and the development of defensive strategies against black-box attacks on large language models.
摘要：大型语言模型在现代自然语言处理技术中发挥着至关重要的作用。然而，它们的广泛使用也带来了潜在的安全风险，例如黑盒攻击的可能性。这些攻击可以将隐藏的恶意特性嵌入到模型中，从而导致模型部署期间产生不良后果。本文研究了针对大型语言模型的黑盒攻击方法，并提出了三层防御机制。它分析了这些攻击的挑战和意义，强调了它们对语言处理系统安全的潜在影响。研究了现有的攻击和防御方法，评估了它们在各种场景中的有效性和适用性。特别关注黑盒攻击的检测算法，识别语言模型中的危险漏洞并检索敏感信息。本研究提出了一种漏洞检测方法和针对大型语言模型黑盒攻击的防御策略的开发方法。

Title: How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

Authors: Nidhir Bhavsar, Jonathan Jordan, Sherzod Hakimov, David Schlangen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14051
Pdf URL: https://arxiv.org/pdf/2406.14051
Copy Paste: [[2406.14051]] How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics(https://arxiv.org/abs/2406.14051)
Keywords: language model, llm, agent
Abstract: What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the model perform well? What gives a model its abilities? We take a recently introduced type of benchmark that is meant to challenge capabilities in a goal-directed, agentive context through self-play of conversational games, and analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket, which is to be accounted for by training parameters such as fine-tuning data quality and method. From a more practical angle, we also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters, and a, very welcome, performance stability against at least moderate weight quantisation during inference.
摘要：什么才是好的大型语言模型 (LLM)？它在相关基准上表现良好——希望这些基准能够在一定程度上衡量实际应用中同样受到挑战的能力。但是，是什么让模型表现良好？是什么赋予了模型这种能力？我们采用最近引入的一种基准，该基准旨在通过对话游戏的自我对战来挑战目标导向、代理环境中的能力，并分析性能如何随着模型特征（如参数数量或训练类型）的发展而发展。我们发现，虽然参数数量和性能之间存在明显的关系，但在给定的大小范围内，性能点仍然分布广泛，这可以通过训练参数（如微调数据质量和方法）来解释。从更实际的角度来看，我们还发现跨访问方法的性能存在一定程度的不可预测性，这可能是由于未公开的采样参数造成的，并且非常受欢迎的是，在推理过程中至少对中等权重量化具有性能稳定性。

Title: Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models

Authors: Dohyun Lee, Daniel Rim, Minseok Choi, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14091
Pdf URL: https://arxiv.org/pdf/2406.14091
Copy Paste: [[2406.14091]] Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models(https://arxiv.org/abs/2406.14091)
Keywords: language model
Abstract: Although language models (LMs) demonstrate exceptional capabilities on various tasks, they are potentially vulnerable to extraction attacks, which represent a significant privacy risk. To mitigate the privacy concerns of LMs, machine unlearning has emerged as an important research area, which is utilized to induce the LM to selectively forget about some of its training data. While completely retraining the model will guarantee successful unlearning and privacy assurance, it is impractical for LMs, as it would be time-consuming and resource-intensive. Prior works efficiently unlearn the target token sequences, but upon subsequent iterations, the LM displays significant degradation in performance. In this work, we propose Privacy Protection via Optimal Parameters (POP), a novel unlearning method that effectively forgets the target token sequences from the pretrained LM by applying optimal gradient updates to the parameters. Inspired by the gradient derivation of complete retraining, we approximate the optimal training objective that successfully unlearns the target sequence while retaining the knowledge from the rest of the training data. Experimental results demonstrate that POP exhibits remarkable retention performance post-unlearning across 9 classification and 4 dialogue benchmarks, outperforming the state-of-the-art by a large margin. Furthermore, we introduce Remnant Memorization Accuracy that quantifies privacy risks based on token likelihood and validate its effectiveness through both qualitative and quantitative analyses.
摘要：尽管语言模型 (LM) 在各种任务上都表现出卓越的能力，但它们可能容易受到提取攻击，这代表着重大的隐私风险。为了缓解 LM 的隐私问题，机器学习已成为一个重要的研究领域，它被用来诱导 LM 有选择地忘记一些训练数据。虽然完全重新训练模型将保证成功进行机器学习和隐私保证，但对于 LM 来说，这是不切实际的，因为它会耗费时间和资源。先前的工作有效地忘记了目标标记序列，但在后续迭代中，LM 的性能显着下降。在这项工作中，我们提出了通过最佳参数 (POP) 进行隐私保护，这是一种新颖的机器学习方法，通过对参数应用最佳梯度更新，有效地忘记来自预训练 LM 的目标标记序列。受完全重新训练的梯度推导的启发，我们近似了成功忘记目标序列的最佳训练目标，同时保留了其余训练数据中的知识。实验结果表明，POP 在 9 个分类和 4 个对话基准上表现出卓越的后学习保留性能，远远超过最先进的技术。此外，我们引入了残余记忆准确度，根据标记可能性量化隐私风险，并通过定性和定量分析验证其有效性。

Title: Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Authors: Ziche Liu, Rui Ke, Feng Jiang, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14115
Pdf URL: https://arxiv.org/pdf/2406.14115
Copy Paste: [[2406.14115]] Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models(https://arxiv.org/abs/2406.14115)
Keywords: language model, llm
Abstract: Data selection for fine-tuning Large Language Models (LLMs) aims to select a high-quality subset from a given candidate dataset to train a Pending Fine-tune Model (PFM) into a Selective-Enhanced Model (SEM). It can improve the model performance and accelerate the training process. Although a few surveys have investigated related works of data selection, there is a lack of comprehensive comparison between existing methods due to their various experimental settings. To address this issue, we first propose a three-stage scheme for data selection and comprehensively review existing works according to this scheme. Then, we design a unified comparing method with ratio-based efficiency indicators and ranking-based feasibility indicators to overcome the difficulty of comparing various models with diverse experimental settings. After an in-depth comparative analysis, we find that the more targeted method with data-specific and model-specific quality labels has higher efficiency, but the introduction of additional noise information should be avoided when designing selection algorithms. Finally, we summarize the trends in data selection and highlight the short-term and long-term challenges to guide future research.
摘要：大型语言模型（LLM）微调的数据选择旨在从给定的候选数据集中选择一个高质量的子集，将待定微调模型（PFM）训练为选择性增强模型（SEM）。它可以提高模型性能并加速训练过程。尽管一些调查研究了数据选择的相关工作，但由于现有方法的实验设置各异，因此缺乏对现有方法的全面比较。为了解决这个问题，我们首先提出了一个三阶段的数据选择方案，并根据该方案全面回顾了现有的工作。然后，我们设计了一种统一的比较方法，该方法具有基于比率的效率指标和基于排名的可行性指标，以克服比较具有不同实验设置的各种模型的困难。经过深入的比较分析，我们发现具有特定于数据和特定于模型的质量标签的更有针对性的方法效率更高，但在设计选择算法时应避免引入额外的噪声信息。最后，我们总结了数据选择的趋势，并强调了短期和长期的挑战，以指导未来的研究。

Title: MACAROON: Training Vision-Language Models To Be Your Engaged Partners

Authors: Shujin Wu, Yi R. Fung, Sha Li, Yixin Wan, Kai-Wei Chang, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14137
Pdf URL: https://arxiv.org/pdf/2406.14137
Copy Paste: [[2406.14137]] MACAROON: Training Vision-Language Models To Be Your Engaged Partners(https://arxiv.org/abs/2406.14137)
Keywords: language model, gpt, hallucination
Abstract: Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues. Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses. In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners. We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs. Utilizing this hierarchy, we create PIE, (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and accompanied with well-defined metrics. Our evaluations on \benchmark indicate poor performance of existing LVLMs, with the best-performing open-weights model only achieving an Aggregate Align Rate (AAR) of 0.28. In response, we introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions given the task description and human-crafted criteria. Then, the self-imagined data is formatted for conditional reinforcement learning. Experimental results show MACAROON effectively improves LVLMs' capabilities to be proactively engaged (0.84 AAR) while maintaining comparable performance on general tasks.
摘要：大型视觉语言模型 (LVLM) 虽然能够熟练地遵循指令并回答各种问题，但即使问题模棱两可或无法回答，也总是会生成详细的答案，从而导致幻觉和偏见问题。因此，LVLM 必须主动与人类互动，以寻求澄清或更多信息，以获得更好的答案。在本研究中，我们的目标是将 LVLM 从被动的答案提供者转变为主动参与的合作伙伴。我们首先为无效、模棱两可和可个性化的问题建立一个三层层次结构，以衡量 LVLM 的主动参与能力。利用这个层次结构，我们通过 GPT-4o 和人类注释者创建了 PIE（主动参与评估），包括六种不同的细粒度问题类型的 853 个问题，这些问题由人类注释者验证并附有明确定义的指标。我们对 \benchmark 的评估表明，现有 LVLM 的性能不佳，性能最佳的开放权重模型仅实现了 0.28 的总体对齐率 (AAR)。为此，我们引入了 MACAROON，即用于对比偏好优化的自我想象，它指示 LVLM 根据任务描述和人为制定的标准自主生成未标记问题的对比响应对。然后，将自我想象的数据格式化为条件强化学习。实验结果表明，MACAROON 有效提高了 LVLM 主动参与的能力（0.84 AAR），同时在一般任务上保持了相当的性能。

Title: Finding Safety Neurons in Large Language Models

Authors: Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14144
Pdf URL: https://arxiv.org/pdf/2406.14144
Copy Paste: [[2406.14144]] Finding Safety Neurons in Large Language Models(https://arxiv.org/abs/2406.14144)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel in various capabilities but also pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment from the perspective of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse and effective. We can restore $90$% safety performance with intervention only on about $5$% of all the neurons. (2) Safety neurons encode transferrable mechanisms. They exhibit consistent effectiveness on different red-teaming datasets. The finding of safety neurons also interprets "alignment tax". We observe that the identified key neurons for safety and helpfulness significantly overlap, but they require different activation patterns of the shared neurons. Furthermore, we demonstrate an application of safety neurons in detecting unsafe outputs before generation. Our findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.
摘要：大型语言模型 (LLM) 在各种功能上都很出色，但也存在安全风险，例如生成有害内容和错误信息，即使在安全对齐之后也是如此。在本文中，我们从机械可解释性的角度探索安全对齐的内部机制，重点是识别和分析 LLM 中负责安全行为的安全神经元。我们提出生成时间激活对比来定位这些神经元，并动态激活修补来评估它们的因果效应。对多个最近的 LLM 进行的实验表明：（1）安全神经元稀疏且有效。我们只需对大约 5% 的神经元进行干预，就可以恢复 90% 的安全性能。（2）安全神经元编码可转移机制。它们在不同的红队数据集上表现出一致的有效性。安全神经元的发现也解释了“对齐税”。我们观察到，确定的安全性和有用性的关键神经元显着重叠，但它们需要共享神经元的不同激活模式。此外，我们展示了安全神经元在生成之前检测不安全输出的应用。我们的发现可能会促进对 LLM 对齐的进一步研究。源代码将公开发布，以方便未来的研究。

Title: Aligning Large Language Models with Diverse Political Viewpoints

Authors: Dominik Stammbach, Philine Widmer, Eunjung Cho, Caglar Gulcehre, Elliott Ash
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14155
Pdf URL: https://arxiv.org/pdf/2406.14155
Copy Paste: [[2406.14155]] Aligning Large Language Models with Diverse Political Viewpoints(https://arxiv.org/abs/2406.14155)
Keywords: language model, gpt, llm, chat
Abstract: Large language models such as ChatGPT often exhibit striking political biases. If users query them about political information, they might take a normative stance and reinforce such biases. To overcome this, we align LLMs with diverse political viewpoints from 100,000 comments written by candidates running for national parliament in Switzerland. Such aligned models are able to generate more accurate political viewpoints from Swiss parties compared to commercial models such as ChatGPT. We also propose a procedure to generate balanced overviews from multiple viewpoints using such models.
摘要：大型语言模型（例如 ChatGPT）通常表现出明显的政治偏见。如果用户向他们询问政治信息，他们可能会采取规范立场并强化这种偏见。为了克服这个问题，我们将 LLM 与瑞士国家议会候选人撰写的 100,000 条评论中的不同政治观点对齐。与 ChatGPT 等商业模型相比，这种对齐模型能够从瑞士政党中生成更准确的政治观点。我们还提出了一种使用此类模型从多个观点生成平衡概述的程序。

Title: Definition generation for lexical semantic change detection

Authors: Mariia Fedorova, Andrey Kutuzov, Yves Scherrer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14167
Pdf URL: https://arxiv.org/pdf/2406.14167
Copy Paste: [[2406.14167]] Definition generation for lexical semantic change detection(https://arxiv.org/abs/2406.14167)
Keywords: language model
Abstract: We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as `senses', and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.
摘要：在历时词汇语义变化检测 (LSCD) 任务中，我们使用大型语言模型生成的语境化单词定义作为语义表示。简而言之，生成的定义用作“意义”，通过比较两个时间段内的分布来检索目标单词的变化分数。在五个数据集和三种语言的材料上，我们表明生成的定义确实足够具体和通用，可以传达足够的信号，根据单词随时间变化的程度对单词集进行排序。我们的方法与之前的非监督式基于意义的 LSCD 方法相当或优于它们。同时，它保留了可解释性，并允许根据离散的定义作为意义来检查特定转变背后的原因。这是朝着可解释的语义变化建模方向迈出的又一步。

Title: In Tree Structure Should Sentence Be Generated

Authors: Yaguang Li, Xin Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14189
Pdf URL: https://arxiv.org/pdf/2406.14189
Copy Paste: [[2406.14189]] In Tree Structure Should Sentence Be Generated(https://arxiv.org/abs/2406.14189)
Keywords: hallucination
Abstract: Generative models reliant on sequential autoregression have been at the forefront of language generation for an extensive period, particularly following the introduction of widely acclaimed transformers. Despite its excellent performance, there are always some issues that we face today. For example, problems such as hallucinations and getting trapped in a logic loop may occur. To enhance the performance of existing systems, this paper introduces a new method for generating sequences in natural language, which involves generating the targeted sentence in a tree-traversing order. The paper includes an illustration of the theoretical basis and validity of the approach, as well as a comparison of its fundamentals with the diffusion model in graphic generation. Finally, a module called SenTree is introduced for generating an approximating binary tree. It is already available at this https URL. Additionally, a joint training framework based on this approach is proposed, incorporating the intrinsics of generative adversarial networks.
摘要：依赖于顺序自回归的生成模型长期以来一直处于语言生成的前沿，尤其是在广受好评的 Transformer 推出之后。尽管它具有出色的性能，但今天我们总是面临一些问题。例如，可能会出现幻觉和陷入逻辑循环等问题。为了提高现有系统的性能，本文介绍了一种生成自然语言序列的新方法，该方法涉及以树遍历顺序生成目标句子。本文说明了该方法的理论基础和有效性，并将其基本原理与图形生成中的扩散模型进行了比较。最后，介绍了一个名为 SenTree 的模块，用于生成近似二叉树。它已经在这个 https URL 上可用。此外，提出了一个基于这种方法的联合训练框架，结合了生成对抗网络的内在特性。

Title: Timo: Towards Better Temporal Reasoning for Language Models

Authors: Zhaochen Su, Jun Zhang, Tong Zhu, Xiaoye Qu, Juntao Li, Min Zhang, Yu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14192
Pdf URL: https://arxiv.org/pdf/2406.14192
Copy Paste: [[2406.14192]] Timo: Towards Better Temporal Reasoning for Language Models(https://arxiv.org/abs/2406.14192)
Keywords: language model, llm
Abstract: Reasoning about time is essential for Large Language Models (LLMs) to understand the world. Previous works focus on solving specific tasks, primarily on time-sensitive question answering. While these methods have proven effective, they cannot generalize to a wider spectrum of temporal reasoning tasks. Therefore, we propose a crucial question: Can we build a universal framework to handle a variety of temporal reasoning tasks? To that end, we systematically study 38 temporal reasoning tasks. Based on the observation that 19 tasks are directly related to mathematics, we first leverage the available mathematical dataset to set a solid foundation for temporal reasoning. However, the in-depth study indicates that focusing solely on mathematical enhancement falls short of addressing pure temporal reasoning tasks. To mitigate this limitation, we propose a simple but effective self-critic temporal optimization method to enhance the model's temporal reasoning capabilities without sacrificing general task abilities. Finally, we develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales. Notably, Timo outperforms the counterpart LLMs by 10.0 and 7.6 in average accuracy scores and achieves the new state-of-the-art (SOTA) performance of comparable size. Extensive experiments further validate our framework's effectiveness and its generalization across diverse temporal tasks. The code is available at this https URL.
摘要：时间推理对于大型语言模型 (LLM) 理解世界至关重要。以前的研究主要侧重于解决特定任务，主要是时间敏感型问答。虽然这些方法已被证明是有效的，但它们无法推广到更广泛的时间推理任务。因此，我们提出了一个关键问题：我们能否建立一个通用框架来处理各种时间推理任务？为此，我们系统地研究了 38 个时间推理任务。基于 19 个任务与数学直接相关的观察，我们首先利用可用的数学数据集为时间推理奠定坚实的基础。然而，深入研究表明，仅仅关注数学增强不足以解决纯时间推理任务。为了缓解这一限制，我们提出了一种简单但有效的自我批评时间优化方法，以增强模型的时间推理能力，而不会牺牲一般任务能力。最后，我们开发了 Timo，这是一个旨在在 7B 和 13B 规模上表现出色时间推理能力的模型。值得注意的是，Timo 的平均准确率比同类 LLM 高出 10.0 和 7.6，并达到了同等规模的最新 (SOTA) 性能。大量实验进一步验证了我们框架的有效性及其在不同时间任务中的泛化能力。代码可在此 https URL 上获取。

Title: On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning

Authors: Franz Nowak, Anej Svete, Alexandra Butoi, Ryan Cotterell
Subjects: cs.CL, cs.FL
Abstract URL: https://arxiv.org/abs/2406.14197
Pdf URL: https://arxiv.org/pdf/2406.14197
Copy Paste: [[2406.14197]] On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning(https://arxiv.org/abs/2406.14197)
Keywords: language model, chain-of-thought
Abstract: The performance of modern language models (LMs) has been improved by chain-of-thought (CoT) reasoning, i.e., the process of generating intermediate results that guide the model towards a final answer. A possible explanation for this improvement is that CoT reasoning extends an LM's computational power, as RNNs and transformers with additional scratch space are known to be Turing complete. Comparing LMs to Turing machines, however, introduces a category error - Turing machines decide language membership, whereas LMs define distributions over strings. To bridge this gap, we formalize CoT reasoning in a probabilistic setting. We present several results on the representational capacity of recurrent and transformer LMs with CoT reasoning, showing that they can represent the same family of distributions over strings as probabilistic Turing machines.
摘要：现代语言模型 (LM) 的性能已通过思路链 (CoT) 推理得到改善，即生成中间结果以引导模型得出最终答案的过程。这种改进的一个可能解释是，CoT 推理扩展了 LM 的计算能力，因为已知具有额外临时空间的 RNN 和转换器是图灵完备的。然而，将 LM 与图灵机进行比较会引入分类错误 - 图灵机决定语言成员资格，而 LM 定义字符串上的分布。为了弥合这一差距，我们在概率设置中形式化了 CoT 推理。我们展示了具有 CoT 推理的循环和转换器 LM 的表示能力的几个结果，表明它们可以表示与概率图灵机相同的字符串分布系列。

Title: Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing

Authors: Han Jiang, Xiaoyuan Yi, Zhihua Wei, Shu Wang, Xing Xie
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2406.14230
Pdf URL: https://arxiv.org/pdf/2406.14230
Copy Paste: [[2406.14230]] Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing(https://arxiv.org/abs/2406.14230)
Keywords: language model, llm
Abstract: Warning: this paper contains model outputs exhibiting unethical information. Large Language Models (LLMs) have achieved significant breakthroughs, but their generated unethical content poses potential risks. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Numerous datasets have been constructed to assess social bias, toxicity, and ethics in LLMs, but they suffer from evaluation chronoeffect, that is, as models rapidly evolve, existing data becomes leaked or undemanding, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach that dynamically probes the underlying moral baselines of LLMs. Distinct from previous adaptive testing methods that rely on static datasets with limited difficulty, GETA incorporates an iteratively-updated item generator which infers each LLM's moral boundaries and generates difficulty-tailored testing items, accurately reflecting the true alignment extent. This process theoretically learns a joint distribution of item and model response, with item difficulty and value conformity as latent variables, where the generator co-evolves with the LLM, addressing chronoeffect. We evaluate various popular LLMs with diverse capabilities and demonstrate that GETA can create difficulty-matching testing items and more accurately assess LLMs' values, better consistent with their performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.
摘要：警告：本文包含显示不道德信息的模型输出。大型语言模型 (LLM) 取得了重大突破，但其生成的不道德内容构成了潜在风险。衡量 LLM 的价值观一致性对于其监管和负责任的部署至关重要。已经构建了大量数据集来评估 LLM 中的社会偏见、毒性和道德问题，但它们受到评估时间效应的影响，即随着模型的快速发展，现有数据变得泄露或要求不高，从而高估了不断发展的 LLM。为了解决这个问题，我们提出了 GETA，这是一种新颖的生成式进化测试方法，可以动态探测 LLM 的潜在道德基线。与以前依赖难度有限的静态数据集的自适应测试方法不同，GETA 结合了一个迭代更新的项目生成器，可以推断每个 LLM 的道德界限并生成难度定制的测试项目，准确反映真实的一致性程度。这一过程在理论上学习了项目和模型响应的联合分布，以项目难度和价值一致性作为潜在变量，其中生成器与 LLM 共同进化，解决时间效应。我们评估了具有不同功能的各种流行 LLM，并证明 GETA 可以创建难度匹配的测试项目，并更准确地评估 LLM 的价值，更好地与它们在未见过的 OOD 和 i.i.d. 项目上的表现保持一致，为未来的评估范式奠定了基础。

Title: On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?

Authors: Rochelle Choenni, Sara Rajaee, Christof Monz, Ekaterina Shutova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14267
Pdf URL: https://arxiv.org/pdf/2406.14267
Copy Paste: [[2406.14267]] On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?(https://arxiv.org/abs/2406.14267)
Keywords: language model
Abstract: While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally sufficiently representative of a wider range of high-resource languages, we tend to overestimate MLMs' ability on low-resource languages. Finally, we show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
摘要：虽然多语言模型 (MLM) 已在 100 多种语言上进行了训练，但由于大多数语言缺乏可用的测试数据，因此通常只对其中少数几种语言进行评估。这在评估 MLM 对资源匮乏和未见过的语言的潜力时尤其成问题。在本文中，我们分析了多语言 NLP 中现有的评估框架，讨论了它们的局限性，并提出了几个更稳健、更可靠的评估实践方向。此外，我们通过实证研究了机器翻译在多大程度上为跨多种语言的 MLM 大规模评估提供了{可靠的人工翻译替代方案}。我们使用 SOTA 翻译模型将 4 个任务的测试数据翻译成 198 种语言，并使用它们来评估三个 MLM。我们表明，虽然所选的高资源测试语言子集通常足以代表更广泛的高资源语言，但我们倾向于高估 MLM 在低资源语言上的能力。最后，我们表明，更简单的基线无需得益于大规模多语言预训练即可实现相对较强的性能。

Title: Step-Back Profiling: Distilling User History for Personalized Scientific Writing

Authors: Xiangru Tang, Xingyao Zhang, Yanjun Shao, Jie Wu, Yilun Zhao, Arman Cohan, Ming Gong, Dongmei Zhang, Mark Gerstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14275
Pdf URL: https://arxiv.org/pdf/2406.14275
Copy Paste: [[2406.14275]] Step-Back Profiling: Distilling User History for Personalized Scientific Writing(https://arxiv.org/abs/2406.14275)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce Step-Back Profiling to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. Regarding our experiments, we construct a Personalized Scientific Writing (PSW) dataset to study multiuser personalization. PSW requires the models to write scientific papers given specialized author groups with diverse academic backgrounds. As for the results, we demonstrate the effectiveness of capturing user characteristics via Step-Back Profiling for collaborative writing. Moreover, our approach outperforms the baselines by up to 3.6 points on the general personalization benchmark (LaMP), including 7 personalization LLM tasks. Our extensive ablation studies validate the contributions of different components in our method and provide insights into our task definition. Our dataset and code are available at \url{this https URL}.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中表现出色，但它们很难为个人生成个性化内容，尤其是在科学写作等现实场景中。为了应对这一挑战，我们引入了 Step-Back Profiling 来个性化 LLM，方法是将用户历史记录提炼为简明的个人资料，包括用户的基本特征和偏好。关于我们的实验，我们构建了一个个性化科学写作 (PSW) 数据集来研究多用户个性化。PSW 要求模型根据具有不同学术背景的专业作者群体撰写科学论文。至于结果，我们证明了通过 Step-Back Profiling 捕捉用户特征进行协作写作的有效性。此外，我们的方法在通用个性化基准 (LaMP) 上的表现比基线高出多达 3.6 分，包括 7 个个性化 LLM 任务。我们广泛的消融研究验证了我们方法中不同组件的贡献，并为我们的任务定义提供了见解。我们的数据集和代码可在 \url{this https URL} 获得。

Title: Augmenting Query and Passage for Retrieval-Augmented Generation using LLMs for Open-Domain Question Answering

Authors: Minsang Kim, Cheoneum Park, Seungjun Baek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14277
Pdf URL: https://arxiv.org/pdf/2406.14277
Copy Paste: [[2406.14277]] Augmenting Query and Passage for Retrieval-Augmented Generation using LLMs for Open-Domain Question Answering(https://arxiv.org/abs/2406.14277)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has received much attention for Open-domain question-answering (ODQA) tasks as a means to compensate for the parametric knowledge of large language models (LLMs). While previous approaches focused on processing retrieved passages to remove irrelevant context, they still rely heavily on the quality of retrieved passages which can degrade if the question is ambiguous or complex. In this paper, we propose a simple yet efficient method called question and passage augmentation via LLMs for open-domain QA. Our method first decomposes the original questions into multiple-step sub-questions. By augmenting the original question with detailed sub-questions and planning, we are able to make the query more specific on what needs to be retrieved, improving the retrieval performance. In addition, to compensate for the case where the retrieved passages contain distracting information or divided opinions, we augment the retrieved passages with self-generated passages by LLMs to guide the answer extraction. Experimental results show that the proposed scheme outperforms the previous state-of-the-art and achieves significant performance gain over existing RAG methods.
摘要：检索增强生成 (RAG) 已在开放域问答 (ODQA) 任务中引起广泛关注，它可作为弥补大型语言模型 (LLM) 参数知识不足的一种手段。虽然以前的方法侧重于处理检索到的段落以删除不相关的上下文，但它们仍然严重依赖检索到的段落的质量，如果问题模糊或复杂，检索到的段落质量可能会下降。在本文中，我们提出了一种简单而有效的方法，即通过 LLM 进行问题和段落增强，用于开放域问答。我们的方法首先将原始问题分解为多步骤子问题。通过用详细的子问题和计划来增强原始问题，我们能够使查询更具体到需要检索的内容，从而提高检索性能。此外，为了弥补检索到的段落包含分散注意力的信息或分歧意见的情况，我们使用 LLM 自生成的段落来增强检索到的段落，以指导答案提取。实验结果表明，所提出的方案优于之前的最先进的技术，并且比现有的 RAG 方法取得了显著的性能提升。

Title: Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs

Authors: Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, Huajun Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14282
Pdf URL: https://arxiv.org/pdf/2406.14282
Copy Paste: [[2406.14282]] Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs(https://arxiv.org/abs/2406.14282)
Keywords: language model, gpt, llm
Abstract: Improving the performance of large language models (LLMs) in complex question-answering (QA) scenarios has always been a research focal point. Recent studies have attempted to enhance LLMs' performance by combining step-wise planning with external retrieval. While effective for advanced models like GPT-3.5, smaller LLMs face challenges in decomposing complex questions, necessitating supervised fine-tuning. Previous work has relied on manual annotation and knowledge distillation from teacher LLMs, which are time-consuming and not accurate enough. In this paper, we introduce a novel framework for enhancing LLMs' planning capabilities by using planning data derived from knowledge graphs (KGs). LLMs fine-tuned with this data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval. Evaluations on multiple datasets, including our newly proposed benchmark, highlight the effectiveness of our framework and the benefits of KG-derived planning data.
摘要：提高大型语言模型 (LLM) 在复杂问答 (QA) 场景中的性能一直是研究重点。最近的研究尝试通过将分步规划与外部检索相结合来提高 LLM 的性能。虽然较小的 LLM 对 GPT-3.5 等高级模型有效，但在分解复杂问题方面面临挑战，需要进行监督微调。以前的工作依赖于教师 LLM 的手动注释和知识提炼，这既耗时又不够准确。在本文中，我们介绍了一个新框架，用于通过使用来自知识图谱 (KG) 的规划数据来增强 LLM 的规划能力。使用这些数据进行微调的 LLM 具有改进的规划能力，使其能够更好地处理涉及检索的复杂 QA 任务。对包括我们新提出的基准在内的多个数据集的评估凸显了我们框架的有效性以及 KG 衍生的规划数据的优势。

Title: VAIYAKARANA : A Benchmark for Automatic Grammar Correction in Bangla

Authors: Pramit Bhattacharyya, Arnab Bhattacharya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14284
Pdf URL: https://arxiv.org/pdf/2406.14284
Copy Paste: [[2406.14284]] VAIYAKARANA : A Benchmark for Automatic Grammar Correction in Bangla(https://arxiv.org/abs/2406.14284)
Keywords: llm
Abstract: Bangla (Bengali) is the fifth most spoken language globally and, yet, the problem of automatic grammar correction in Bangla is still in its nascent stage. This is mostly due to the need for a large corpus of grammatically incorrect sentences, with their corresponding correct counterparts. The present state-of-the-art techniques to curate a corpus for grammatically wrong sentences involve random swapping, insertion and deletion of words. However,these steps may not always generate grammatically wrong sentences in Bangla. In this work, we propose a pragmatic approach to generate grammatically wrong sentences in Bangla. We first categorize the different kinds of errors in Bangla into 5 broad classes and 12 finer classes. We then use these to generate grammatically wrong sentences systematically from a correct sentence. This approach can generate a large number of wrong sentences and can, thus, mitigate the challenge of lacking a large corpus for neural networks. We provide a dataset, Vaiyakarana, consisting of 92,830 grammatically incorrect sentences as well as 18,426 correct sentences. We also collected 619 human-generated sentences from essays written by Bangla native speakers. This helped us to understand errors that are more frequent. We evaluated our corpus against neural models and LLMs and also benchmark it against human evaluators who are native speakers of Bangla. Our analysis shows that native speakers are far more accurate than state-of-the-art models to detect whether the sentence is grammatically correct. Our methodology of generating erroneous sentences can be applied for most other Indian languages as well.
摘要：孟加拉语是全球第五大语言，但孟加拉语的自动语法纠正问题仍处于起步阶段。这主要是因为需要大量语法错误的句子及其对应的正确句子的语料库。目前，整理语法错误句子语料库的最先进技术包括随机交换、插入和删除单词。然而，这些步骤并不总是能生成孟加拉语语法错误的句子。在这项工作中，我们提出了一种生成孟加拉语语法错误句子的实用方法。我们首先将孟加拉语中的不同类型错误分为 5 个大类和 12 个细类。然后，我们使用这些从正确句子系统地生成语法错误的句子。这种方法可以生成大量错误句子，从而可以缓解神经网络缺乏大型语料库的挑战。我们提供了一个数据集 Vaiyakarana，其中包含 92,830 个语法错误的句子以及 18,426 个正确的句子。我们还从孟加拉语母语人士撰写的文章中收集了 619 个人工生成的句子。这有助于我们了解更常见的错误。我们根据神经模型和 LLM 评估了我们的语料库，并将其与以孟加拉语为母语的人类评估者进行了对比。我们的分析表明，母语人士在检测句子是否语法正确方面比最先进的模型准确得多。我们生成错误句子的方法也可以应用于大多数其他印度语言。

Title: Infusing clinical knowledge into tokenisers for language models

Authors: Abul Hasan, Jinge Wu, Quang Ngoc Nguyen, Salomé Andres, Imane Guellil, Huayu Zhang, Arlene Casey, Beatrice Alex, Bruce Guthrie, Honghan Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14312
Pdf URL: https://arxiv.org/pdf/2406.14312
Copy Paste: [[2406.14312]] Infusing clinical knowledge into tokenisers for language models(https://arxiv.org/abs/2406.14312)
Keywords: language model
Abstract: This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.
摘要：本研究介绍了一种用于临床文本处理的新型知识增强标记化机制 K-Tokeniser。从技术上讲，在初始化阶段，K-Tokeniser 根据领域概念（如药物或疾病）的语义类型从领域本体（如统一医学语言系统）或任务相关语料库的训练数据中填充标记的全局表示。在训练或推理阶段，将利用句子级局部上下文来选择最佳全局标记表示以实现基于语义的标记化。为了避免使用新的标记器进行预训练，提出了一种嵌入初始化方法来生成新标记的表示。使用三个基于转换器的语言模型，在四个真实世界数据集上进行了一组全面的实验，以评估 K-Tokeniser 在广泛的临床文本分析任务中的表现，包括临床概念和关系提取、自动临床编码、临床表型识别和临床研究文章分类。总体而言，我们的模型在所有任务中都表现出比同类模型一致的改进。具体而言，在自动临床编码任务中观察到了显著的改进，Micro $F_1$ 分数增加了 13\%。此外，K-Tokeniser 还表现出促进语言模型更快收敛的显著能力。具体来说，使用 K-Tokeniser，语言模型只需要 50\% 的训练数据就可以在概念提取任务中使用所有训练数据实现基线标记器的最佳性能，在自动编码任务中只需要不到 20\% 的数据。值得一提的是，所有这些改进都不需要预训练过程，这使得该方法具有普遍性。

Title: Robust Few-shot Transfer Learning for Knowledge Base Question Answering with Unanswerable Questions

Authors: Riya Sawhney, Indrajit Bhattacharya, Mausam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14313
Pdf URL: https://arxiv.org/pdf/2406.14313
Copy Paste: [[2406.14313]] Robust Few-shot Transfer Learning for Knowledge Base Question Answering with Unanswerable Questions(https://arxiv.org/abs/2406.14313)
Keywords: llm, prompt
Abstract: Real-world KBQA applications require models that are (1) robust -- e.g., can differentiate between answerable and unanswerable questions, and (2) low-resource -- do not require large training data. Towards this goal, we propose the novel task of few-shot transfer for KBQA with unanswerable questions. We present FUn-FuSIC that extends the state-of-the-art (SoTA) few-shot transfer model for answerable-only KBQA to handle unanswerability. It iteratively prompts an LLM to generate logical forms for the question by providing feedback using a diverse suite of syntactic, semantic and execution guided checks, and adapts self-consistency to assess confidence of the LLM to decide answerability. Experiments over newly constructed datasets show that FUn-FuSIC outperforms suitable adaptations of the SoTA model for KBQA with unanswerability, and the SoTA model for answerable-only few-shot-transfer KBQA.
摘要：现实世界的 KBQA 应用程序需要具备以下特点的模型：(1) 稳健（例如，可以区分可回答和不可回答的问题）；(2) 低资源（不需要大量训练数据）。为了实现这一目标，我们提出了一种新颖的任务，即针对具有不可回答问题的 KBQA 进行少样本迁移。我们提出了 FUn-FuSIC，它扩展了最先进的（SoTA）少样本迁移模型，用于仅可回答的 KBQA，以处理不可回答问题。它通过使用多种句法、语义和执行引导检查提供反馈，迭代地提示 LLM 为问题生成逻辑形式，并调整自洽性以评估 LLM 决定可回答性的信心。在新构建的数据集上进行的实验表明，FUn-FuSIC 的表现优于针对具有不可回答性的 KBQA 的 SoTA 模型的适当改编，以及针对仅可回答的少样本迁移 KBQA 的 SoTA 模型。

Title: Identifying User Goals from UI Trajectories

Authors: Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14314
Pdf URL: https://arxiv.org/pdf/2406.14314
Copy Paste: [[2406.14314]] Identifying User Goals from UI Trajectories(https://arxiv.org/abs/2406.14314)
Keywords: gpt, agent
Abstract: Autonomous agents that interact with graphical user interfaces (GUIs) hold significant potential for enhancing user experiences. To further improve these experiences, agents need to be personalized and proactive. By effectively comprehending user intentions through their actions and interactions with GUIs, agents will be better positioned to achieve these goals. This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. We propose a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. By Leveraging the inverse relation with the UI automation task, we utilized the Android-In-The-Wild and Mind2Web datasets for our experiments. Using our metric and these datasets, we conducted several experiments comparing the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro. Our results show that Gemini performs better than GPT but still underperforms compared to humans, indicating significant room for improvement.
摘要：与图形用户界面 (GUI) 交互的自主代理在增强用户体验方面具有巨大潜力。为了进一步改善这些体验，代理需要个性化和主动性。通过有效地理解用户的行为和与 GUI 的交互，代理将能够更好地实现这些目标。本文介绍了从观察到的 UI 轨迹中识别目标的任务，旨在根据用户的 GUI 交互推断用户的预期任务。我们提出了一种新颖的评估指标来评估两个任务描述在特定 UI 环境中是否是释义。通过利用与 UI 自动化任务的逆关系，我们在实验中使用了 Android-In-The-Wild 和 Mind2Web 数据集。使用我们的指标和这些数据集，我们进行了几次实验，比较了人类和最先进模型（特别是 GPT-4 和 Gemini-1.5 Pro）的性能。我们的结果表明，Gemini 的表现优于 GPT，但与人类相比仍然表现不佳，表明有很大的改进空间。

Title: Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning

Authors: Lynn Chua, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Daogao Liu, Pasin Manurangsi, Amer Sinha, Chiyuan Zhang
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14322
Pdf URL: https://arxiv.org/pdf/2406.14322
Copy Paste: [[2406.14322]] Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning(https://arxiv.org/abs/2406.14322)
Keywords: language model, llm
Abstract: Large language models (LLMs) have emerged as powerful tools for tackling complex tasks across diverse domains, but they also raise privacy concerns when fine-tuned on sensitive data due to potential memorization. While differential privacy (DP) offers a promising solution by ensuring models are `almost indistinguishable' with or without any particular privacy unit, current evaluations on LLMs mostly treat each example (text record) as the privacy unit. This leads to uneven user privacy guarantees when contributions per user vary. We therefore study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users. We present a systematic evaluation of user-level DP for LLM fine-tuning on natural language generation tasks. Focusing on two mechanisms for achieving user-level DP guarantees, Group Privacy and User-wise DP-SGD, we investigate design choices like data selection strategies and parameter tuning for the best privacy-utility tradeoff.
摘要：大型语言模型 (LLM) 已成为解决不同领域复杂任务的有力工具，但由于潜在的记忆性，它们在对敏感数据进行微调时也会引发隐私问题。虽然差异隐私 (DP) 通过确保模型在有或没有任何特定隐私单元的情况下“几乎无法区分”而提供了一种有前途的解决方案，但目前对 LLM 的评估大多将每个示例（文本记录）视为隐私单元。当每个用户的贡献有所不同时，这会导致用户隐私保障不均衡。因此，我们研究用户级 DP，其动机是需要确保跨用户的统一隐私保护的应用程序。我们对用于自然语言生成任务的 LLM 微调的用户级 DP 进行了系统评估。我们专注于实现用户级 DP 保证的两种机制，即群组隐私和用户级 DP-SGD，研究数据选择策略和参数调整等设计选择，以实现最佳隐私效用权衡。

Title: medIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRs

Authors: Mingyi Jia, Junwen Duan, Yan Song, Jianxin Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14326
Pdf URL: https://arxiv.org/pdf/2406.14326
Copy Paste: [[2406.14326]] medIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRs(https://arxiv.org/abs/2406.14326)
Keywords: language model, llm, prompt
Abstract: Electronic Medical Records (EMRs), while integral to modern healthcare, present challenges for clinical reasoning and diagnosis due to their complexity and information redundancy. To address this, we proposed medIKAL (Integrating Knowledge Graphs as Assistants of LLMs), a framework that combines Large Language Models (LLMs) with knowledge graphs (KGs) to enhance diagnostic capabilities. medIKAL assigns weighted importance to entities in medical records based on their type, enabling precise localization of candidate diseases within KGs. It innovatively employs a residual network-like approach, allowing initial diagnosis by the LLM to be merged into KG search results. Through a path-based reranking algorithm and a fill-in-the-blank style prompt template, it further refined the diagnostic process. We validated medIKAL's effectiveness through extensive experiments on a newly introduced open-sourced Chinese EMR dataset, demonstrating its potential to improve clinical diagnosis in real-world settings.
摘要：电子病历 (EMR) 是现代医疗保健不可或缺的一部分，但由于其复杂性和信息冗余，给临床推理和诊断带来了挑战。为了解决这个问题，我们提出了 medIKAL（集成知识图谱作为 LLM 的助手），这是一个将大型语言模型 (LLM) 与知识图谱 (KG) 相结合以增强诊断能力的框架。medIKAL 根据病历中的实体类型为其分配加权重要性，从而能够在 KG 中精确定位候选疾病。它创新地采用了类似残差网络的方法，允许将 LLM 的初步诊断合并到 KG 搜索结果中。通过基于路径的重新排序算法和填空式提示模板，它进一步完善了诊断过程。我们通过对新推出的开源中文 EMR 数据集进行大量实验验证了 medIKAL 的有效性，证明了其在现实环境中改善临床诊断的潜力。

Title: Self-supervised Interpretable Concept-based Models for Text Classification

Authors: Francesco De Santis, Philippe Bich, Gabriele Ciravegna, Pietro Barbiero, Danilo Giordano, Tania Cerquitelli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14335
Pdf URL: https://arxiv.org/pdf/2406.14335
Copy Paste: [[2406.14335]] Self-supervised Interpretable Concept-based Models for Text Classification(https://arxiv.org/abs/2406.14335)
Keywords: language model, llm
Abstract: Despite their success, Large-Language Models (LLMs) still face criticism as their lack of interpretability limits their controllability and reliability. Traditional post-hoc interpretation methods, based on attention and gradient-based analysis, offer limited insight into the model's decision-making processes. In the image field, Concept-based models have emerged as explainable-by-design architectures, employing human-interpretable features as intermediate representations. However, these methods have not been yet adapted to textual data, mainly because they require expensive concept annotations, which are impractical for real-world text data. This paper addresses this challenge by proposing a self-supervised Interpretable Concept Embedding Models (ICEMs). We leverage the generalization abilities of LLMs to predict the concepts labels in a self-supervised way, while we deliver the final predictions with an interpretable function. The results of our experiments show that ICEMs can be trained in a self-supervised way achieving similar performance to fully supervised concept-based models and end-to-end black-box ones. Additionally, we show that our models are (i) interpretable, offering meaningful logical explanations for their predictions; (ii) interactable, allowing humans to modify intermediate predictions through concept interventions; and (iii) controllable, guiding the LLMs' decoding process to follow a required decision-making path.
摘要：尽管大语言模型 (LLM) 取得了成功，但它仍然面临批评，因为其缺乏可解释性限制了可控性和可靠性。传统的事后解释方法基于注意力和基于梯度的分析，对模型的决策过程提供有限的洞察。在图像领域，基于概念的模型已经成为可解释的设计架构，采用人类可解释的特征作为中间表示。然而，这些方法尚未适应文本数据，主要是因为它们需要昂贵的概念注释，这对于现实世界的文本数据来说是不切实际的。本文通过提出一种自监督的可解释概念嵌入模型 (ICEM) 来解决这一挑战。我们利用 LLM 的泛化能力以自监督的方式预测概念标签，同时我们使用可解释函数提供最终预测。我们的实验结果表明，ICEM 可以以自监督的方式进行训练，实现与完全监督的基于概念的模型和端到端黑盒模型类似的性能。此外，我们表明我们的模型（i）可解释，为其预测提供有意义的逻辑解释；（ii）可交互，允许人类通过概念干预修改中间预测；（iii）可控，指导 LLM 的解码过程遵循所需的决策路径。

Title: Exploring Spatial Representations in the Historical Lake District Texts with LLM-based Relation Extraction

Authors: Erum Haris, Anthony G. Cohn, John G. Stell
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14336
Pdf URL: https://arxiv.org/pdf/2406.14336
Copy Paste: [[2406.14336]] Exploring Spatial Representations in the Historical Lake District Texts with LLM-based Relation Extraction(https://arxiv.org/abs/2406.14336)
Keywords: language model, llm
Abstract: Navigating historical narratives poses a challenge in unveiling the spatial intricacies of past landscapes. The proposed work addresses this challenge within the context of the English Lake District, employing the Corpus of the Lake District Writing. The method utilizes a generative pre-trained transformer model to extract spatial relations from the textual descriptions in the corpus. The study applies this large language model to understand the spatial dimensions inherent in historical narratives comprehensively. The outcomes are presented as semantic triples, capturing the nuanced connections between entities and locations, and visualized as a network, offering a graphical representation of the spatial narrative. The study contributes to a deeper comprehension of the English Lake District's spatial tapestry and provides an approach to uncovering spatial relations within diverse historical contexts.
摘要：探索历史叙事对于揭示过去景观的空间复杂性提出了挑战。本研究利用湖区写作语料库，在英格兰湖区背景下应对这一挑战。该方法利用生成式预训练转换器模型从语料库中的文本描述中提取空间关系。本研究应用这种大型语言模型来全面理解历史叙事中固有的空间维度。结果以语义三元组的形式呈现，捕捉实体和位置之间的细微联系，并以网络形式可视化，提供空间叙事的图形表示。本研究有助于更深入地理解英格兰湖区的空间格局，并提供一种在不同历史背景下揭示空间关系的方法。

Title: SEC-QA: A Systematic Evaluation Corpus for Financial QA

Authors: Viet Dac Lai, Michael Krumdick, Charles Lovering, Varshini Reddy, Craig Schmidt, Chris Tanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14394
Pdf URL: https://arxiv.org/pdf/2406.14394
Copy Paste: [[2406.14394]] SEC-QA: A Systematic Evaluation Corpus for Financial QA(https://arxiv.org/abs/2406.14394)
Keywords: llm, long context, retrieval augmented generation
Abstract: The financial domain frequently deals with large numbers of long documents that are essential for daily operations. Significant effort is put towards automating financial data analysis. However, a persistent challenge, not limited to the finance domain, is the scarcity of datasets that accurately reflect real-world tasks for model evaluation. Existing datasets are often constrained by size, context, or relevance to practical applications. Moreover, LLMs are currently trained on trillions of tokens of text, limiting access to novel data or documents that models have not encountered during training for unbiased evaluation. We propose SEC-QA, a continuous dataset generation framework with two key features: 1) the semi-automatic generation of Question-Answer (QA) pairs spanning multiple long context financial documents, which better represent real-world financial scenarios; 2) the ability to continually refresh the dataset using the most recent public document collections, not yet ingested by LLMs. Our experiments show that current retrieval augmented generation methods systematically fail to answer these challenging multi-document questions. In response, we introduce a QA system based on program-of-thought that improves the ability to perform complex information retrieval and quantitative reasoning pipelines, thereby increasing QA accuracy.
摘要：金融领域经常处理大量长文档，这些文档对于日常运营至关重要。人们投入了大量精力来实现金融数据分析的自动化。然而，一个持续存在的挑战（不限于金融领域）是，缺乏能够准确反映模型评估的实际任务的数据集。现有数据集通常受到大小、上下文或与实际应用的相关性的限制。此外，LLM 目前是在数万亿个文本标记上进行训练的，这限制了模型在无偏评估训练期间对新数据或文档的访问。我们提出了 SEC-QA，这是一个连续数据集生成框架，具有两个主要功能：1) 半自动生成跨多个长上下文金融文档的问答 (QA) 对，更好地代表现实世界的金融场景；2) 能够使用 LLM 尚未摄取的最新公共文档集合不断刷新数据集。我们的实验表明，当前的检索增强生成方法系统性地无法回答这些具有挑战性的多文档问题。作为回应，我们引入了一种基于思维程序的 QA 系统，该系统提高了执行复杂信息检索和定量推理流程的能力，从而提高了 QA 准确性。

Title: SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Authors: Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle Augenstein
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14425
Pdf URL: https://arxiv.org/pdf/2406.14425
Copy Paste: [[2406.14425]] SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages(https://arxiv.org/abs/2406.14425)
Keywords: language model, llm
Abstract: Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
摘要：问答 (QA) 数据集在开发和评估大型语言模型 (LLM) 功能方面发挥了重要作用。然而，由于成本高昂以及收集和人工注释的困难，除英语之外的其他语言的此类数据集非常稀缺。这意味着在资源匮乏的语言中生成新模型并衡量多语言 LLM 的性能具有挑战性。为了缓解这一问题，我们提出了 $\textbf{S}$yn$\textbf{DAR}$in，这是一种为资源匮乏的语言生成和验证 QA 数据集的方法。我们利用并行内容挖掘来获取英语和目标语言之间的 $\textit{人工策划}$ 段落。我们使用英语数据作为上下文来 $\textit{生成}$ 合成多项选择 (MC) 问答对，这些问答对会自动翻译并进一步验证质量。将这些问答对与指定的非英语 $\textit{人工策划}$ 段落相结合，形成最终的 QA 数据集。该方法可以保持内容质量，降低事实错误的可能性，并避免昂贵的注释需求。为了测试该方法，我们为亚美尼亚语创建了一个包含 1.2K 个样本的 QA 数据集。人工评估表明，生成的英语数据中有 98% 保持了问题类型和主题的质量和多样性，而翻译验证流程可以过滤掉质量较差的数据。我们使用该数据集对最先进的 LLM 进行基准测试，结果表明它们无法达到人类的准确度，某些模型性能更接近随机机会。这表明生成的数据集并不简单，可用于评估低资源语言的推理能力。

Title: Towards Truthful Multilingual Large Language Models: Benchmarking and Alignment Strategies

Authors: Weihao Liu, Ning Wu, Wenbiao Ding, Shining Liang, Ming Gong, Dongmei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14434
Pdf URL: https://arxiv.org/pdf/2406.14434
Copy Paste: [[2406.14434]] Towards Truthful Multilingual Large Language Models: Benchmarking and Alignment Strategies(https://arxiv.org/abs/2406.14434)
Keywords: language model, llm
Abstract: In the era of large language models (LLMs), building multilingual large language models (MLLMs) that can serve users worldwide holds great significance. However, existing research seldom focuses on the truthfulness of MLLMs. Meanwhile, contemporary multilingual aligning technologies struggle to balance massive languages and often exhibit serious truthfulness gaps across different languages, especially those that differ greatly from English. In our work, we construct a benchmark for truthfulness evaluation in multilingual scenarios and explore the ways to align facts across languages to enhance the truthfulness of MLLMs. Furthermore, we propose Fact-aware Multilingual Selective Synergy (FaMSS) to optimize the data allocation across a large number of languages and different data types. Experimental results demonstrate that our approach can effectively reduce the multilingual representation disparity and enhance the multilingual capabilities of LLMs.
摘要：在大型语言模型（LLM）的时代，构建能够服务于全球用户的多语言大型语言模型（MLLM）具有重要意义。然而，现有的研究很少关注MLLM的真实性。同时，当代的多语言对齐技术难以平衡大量语言，并且经常在不同语言之间表现出严重的真实性差距，特别是那些与英语差异很大的语言。在我们的工作中，我们构建了多语言场景下真实性评估的基准，并探索了跨语言对齐事实的方法以增强MLLM的真实性。此外，我们提出了事实感知的多语言选择性协同（FaMSS）来优化在大量语言和不同数据类型之间的数据分配。实验结果表明，我们的方法可以有效减少多语言表示差异并增强LLM的多语言能力。

Title: Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models

Authors: Shijie Han, Zhenyu Zhang, Andrei Arsene Simion
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14459
Pdf URL: https://arxiv.org/pdf/2406.14459
Copy Paste: [[2406.14459]] Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models(https://arxiv.org/abs/2406.14459)
Keywords: language model
Abstract: Language models like BERT excel at sentence classification tasks due to extensive pre-training on general data, but their robustness to parameter corruption is unexplored. To understand this better, we look at what happens if a language model is "broken", in the sense that some of its parameters are corrupted and then recovered by fine-tuning. Strategically corrupting BERT variants at different levels, we find corrupted models struggle to fully recover their original performance, with higher corruption causing more severe degradation. Notably, bottom-layer corruption affecting fundamental linguistic features is more detrimental than top-layer corruption. Our insights contribute to understanding language model robustness and adaptability under adverse conditions, informing strategies for developing resilient NLP systems against parameter perturbations.
摘要：由于对一般数据进行了大量的预训练，像 BERT 这样的语言模型在句子分类任务中表现出色，但它们对参数损坏的鲁棒性尚未得到探索。为了更好地理解这一点，我们研究了如果语言模型“损坏”会发生什么，即它的一些参数被损坏，然后通过微调恢复。通过在不同级别策略性地破坏 BERT 变体，我们发现损坏的模型很难完全恢复其原始性能，损坏程度越高，性能下降越严重。值得注意的是，影响基本语言特征的底层损坏比顶层损坏更有害。我们的见解有助于理解语言模型在不利条件下的鲁棒性和适应性，为开发针对参数扰动的弹性 NLP 系统提供策略。

Title: Explicit and Implicit Large Language Model Personas Generate Opinions but Fail to Replicate Deeper Perceptions and Biases

Authors: Salvatore Giorgi, Tingting Liu, Ankit Aich, Kelsey Isman, Garrick Sherman, Zachary Fried, João Sedoc, Lyle H. Ungar, Brenda Curtis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14462
Pdf URL: https://arxiv.org/pdf/2406.14462
Copy Paste: [[2406.14462]] Explicit and Implicit Large Language Model Personas Generate Opinions but Fail to Replicate Deeper Perceptions and Biases(https://arxiv.org/abs/2406.14462)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly being used in human-centered social scientific tasks, such as data annotation, synthetic data creation, and engaging in dialog. However, these tasks are highly subjective and dependent on human factors, such as one's environment, attitudes, beliefs, and lived experiences. Thus, employing LLMs (which do not have such human factors) in these tasks may result in a lack of variation in data, failing to reflect the diversity of human experiences. In this paper, we examine the role of prompting LLMs with human-like personas and asking the models to answer as if they were a specific human. This is done explicitly, with exact demographics, political beliefs, and lived experiences, or implicitly via names prevalent in specific populations. The LLM personas are then evaluated via (1) subjective annotation task (e.g., detecting toxicity) and (2) a belief generation task, where both tasks are known to vary across human factors. We examine the impact of explicit vs. implicit personas and investigate which human factors LLMs recognize and respond to. Results show that LLM personas show mixed results when reproducing known human biases, but generate generally fail to demonstrate implicit biases. We conclude that LLMs lack the intrinsic cognitive mechanisms of human thought, while capturing the statistical patterns of how people speak, which may restrict their effectiveness in complex social science applications.
摘要：大型语言模型 (LLM) 越来越多地用于以人为中心的社会科学任务，例如数据注释、合成数据创建和参与对话。然而，这些任务具有高度主观性，并且依赖于人为因素，例如一个人的环境、态度、信仰和生活经历。因此，在这些任务中使用 LLM（没有这样的人为因素）可能会导致数据缺乏变化，无法反映人类经验的多样性。在本文中，我们研究了使用类似人类的角色提示 LLM 并要求模型像特定人类一样回答的作用。这是通过确切的人口统计、政治信仰和生活经历明确完成的，或者通过特定人群中流行的名称隐式完成的。然后通过 (1) 主观注释任务（例如，检测毒性）和 (2) 信念生成任务评估 LLM 角色，其中已知这两项任务因人为因素而异。我们研究了显性与隐性角色的影响，并调查了 LLM 识别和响应哪些人为因素。结果表明，LLM 角色在重现已知的人类偏见时表现出不同的结果，但通常无法表现出隐性偏见。我们得出结论，LLM 缺乏人类思维的内在认知机制，同时捕捉人们说话的统计模式，这可能会限制其在复杂社会科学应用中的有效性。

Title: Instruction Pre-Training: Language Models are Supervised Multitask Learners

Authors: Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14491
Pdf URL: https://arxiv.org/pdf/2406.14491
Copy Paste: [[2406.14491]] Instruction Pre-Training: Language Models are Supervised Multitask Learners(https://arxiv.org/abs/2406.14491)
Keywords: language model
Abstract: Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at this https URL.
摘要：无监督多任务预训练是语言模型 (LM) 近期成功的关键方法。然而，监督多任务学习仍然具有巨大的前景，因为在训练后阶段对其进行扩展会趋向于更好的泛化。在本文中，我们通过提出指令预训练来探索监督多任务预训练，该框架可扩展地用指令-响应对扩充大量原始语料库以预训练 LM。指令-响应对由基于开源模型构建的高效指令合成器生成。在我们的实验中，我们合成了 2 亿个指令-响应对，涵盖 40 多个任务类别，以验证指令预训练的有效性。在从头开始的预训练中，指令预训练不仅可以持续增强预训练的基础模型，还可以从进一步的指令调整中受益更多。在持续的预训练中，指令预训练使 Llama3-8B 能够与 Llama3-70B 相媲美甚至超越 Llama3-70B。我们的模型、代码和数据可在此 https URL 上获取。

Title: LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors

Authors: Sheikh Asif Imran, Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14498
Pdf URL: https://arxiv.org/pdf/2406.14498
Copy Paste: [[2406.14498]] LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors(https://arxiv.org/abs/2406.14498)
Keywords: language model, llm, agent
Abstract: Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpreting and responding to activity and motion analysis queries. Our evaluation demonstrates LLaSA's effectiveness in activity classification and question answering, highlighting its potential in healthcare, sports science, and human-computer interaction. These contributions advance sensor-aware language models and open new research avenues. Our code repository and datasets can be found on this https URL.
摘要：将惯性测量单元 (IMU) 与大型语言模型 (LLM) 相结合，可增强对人类活动的理解，从而推动多模态 AI 的发展。我们引入了 SensorCaps（一个包含 26,288 个 IMU 衍生活动叙述的数据集）和 OpenSQA（一个包含 257,562 个问答对的指令跟踪数据集）。结合 LIMU-BERT 和 Llama，我们开发了 LLaSA，这是一个能够解释和响应活动和运动分析查询的大型多模态代理。我们的评估证明了 LLaSA 在活动分类和问答方面的有效性，凸显了其在医疗保健、运动科学和人机交互方面的潜力。这些贡献推动了传感器感知语言模型的发展，并开辟了新的研究途径。我们的代码存储库和数据集可在此 https URL 上找到。

Title: Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary

Authors: Xingmeng Zhao, Tongnian Wang, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14500
Pdf URL: https://arxiv.org/pdf/2406.14500
Copy Paste: [[2406.14500]] Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary(https://arxiv.org/abs/2406.14500)
Keywords: language model, llm, prompt
Abstract: Radiology report summarization (RRS) is crucial for patient care, requiring concise "Impressions" from detailed "Findings." This paper introduces a novel prompting strategy to enhance RRS by first generating a layperson summary. This approach normalizes key observations and simplifies complex information using non-expert communication techniques inspired by doctor-patient interactions. Combined with few-shot in-context learning, this method improves the model's ability to link general terms to specific findings. We evaluate this approach on the MIMIC-CXR, CheXpert, and MIMIC-III datasets, benchmarking it against 7B/8B parameter state-of-the-art open-source large language models (LLMs) like Meta-Llama-3-8B-Instruct. Our results demonstrate improvements in summarization accuracy and accessibility, particularly in out-of-domain tests, with improvements as high as 5% for some metrics.
摘要：放射学报告摘要 (RRS) 对患者护理至关重要，需要从详细的“发现”中得出简明的“印象”。本文介绍了一种新颖的提示策略，通过首先生成外行摘要来增强 RRS。这种方法使用受医患互动启发的非专家沟通技巧来规范关键观察并简化复杂信息。结合少量上下文学习，该方法提高了模型将一般术语与特定发现联系起来的能力。我们在 MIMIC-CXR、CheXpert 和 MIMIC-III 数据集上评估了这种方法，并将其与 7B/8B 参数最先进的开源大型语言模型 (LLM)（如 Meta-Llama-3-8B-Instruct）进行基准测试。我们的结果表明，摘要准确性和可访问性有所提高，特别是在域外测试中，某些指标的改进高达 5%。

Title: Overview of the CAIL 2023 Argument Mining Track

Authors: Jingcong Liang, Junlong Wang, Xinyu Zhai, Yungui Zhuang, Yiyang Zheng, Xin Xu, Xiandong Ran, Xiaozheng Dong, Honghui Rong, Yanlun Liu, Hao Chen, Yuhan Wei, Donghai Li, Jiajie Peng, Xuanjing Huang, Chongde Shi, Yansong Feng, Yun Song, Zhongyu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14503
Pdf URL: https://arxiv.org/pdf/2406.14503
Copy Paste: [[2406.14503]] Overview of the CAIL 2023 Argument Mining Track(https://arxiv.org/abs/2406.14503)
Keywords: language model
Abstract: We give a detailed overview of the CAIL 2023 Argument Mining Track, one of the Chinese AI and Law Challenge (CAIL) 2023 tracks. The main goal of the track is to identify and extract interacting argument pairs in trial dialogs. It mainly uses summarized judgment documents but can also refer to trial recordings. The track consists of two stages, and we introduce the tasks designed for each stage; we also extend the data from previous events into a new dataset -- CAIL2023-ArgMine -- with annotated new cases from various causes of action. We outline several submissions that achieve the best results, including their methods for different stages. While all submissions rely on language models, they have incorporated strategies that may benefit future work in this field.
摘要：我们详细介绍了 CAIL 2023 论证挖掘赛道，这是中国人工智能与法律挑战赛 (CAIL) 2023 赛道之一。该赛道的主要目标是识别和提取审判对话中相互作用的论证对。它主要使用总结性的判决文件，但也可以参考审判记录。该赛道由两个阶段组成，我们介绍了为每个阶段设计的任务；我们还将以前事件的数据扩展到一个新的数据集——CAIL2023-ArgMine——其中包含来自各种诉讼原因的带注释的新案例。我们概述了几个取得最佳结果的提交，包括它们针对不同阶段的方法。虽然所有提交都依赖于语言模型，但它们已经采用了可能有益于该领域未来工作的策略。

Title: Translating Across Cultures: LLMs for Intralingual Cultural Adaptation

Authors: Pushpdeep Singh, Mayur Patidar, Lovekesh Vig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14504
Pdf URL: https://arxiv.org/pdf/2406.14504
Copy Paste: [[2406.14504]] Translating Across Cultures: LLMs for Intralingual Cultural Adaptation(https://arxiv.org/abs/2406.14504)
Keywords: llm
Abstract: LLMs are increasingly being deployed for multilingual applications and have demonstrated impressive translation capabilities between several low and high resource languages. An aspect of translation that often gets overlooked is that of cultural adaptation, or modifying source culture references to suit the target culture. Cultural adaptation has applications across several creative industries and requires intimate knowledge of source and target cultures during translation. While specialized translation models still outperform LLMs on the machine translation task when viewed from the lens of correctness, they are not sensitive to cultural differences often requiring manual correction. LLMs on the other hand have a rich reservoir of cultural knowledge embedded within its parameters that can be potentially exploited for such applications. In this paper we define the task of cultural adaptation and create an evaluation framework to benchmark different models for this task. We evaluate the performance of modern LLMs for cultural adaptation and analyze their cross cultural knowledge while connecting related concepts across different cultures. We also analyze possible issues with automatic adaptation including cultural biases and stereotypes. We hope that this task will offer more insight into the cultural understanding of LLMs and their creativity in cross-cultural scenarios.
摘要：LLM 越来越多地被部署用于多语言应用，并已在几种低资源和高资源语言之间表现出令人印象深刻的翻译能力。翻译中经常被忽视的一个方面是文化适应，即修改源文化参考以适应目标文化。文化适应在多个创意产业中都有应用，在翻译过程中需要对源文化和目标文化有深入的了解。虽然从正确性的角度来看，专门的翻译模型在机器翻译任务上的表现仍然优于 LLM，但它们对通常需要手动纠正的文化差异不敏感。另一方面，LLM 在其参数中嵌入了丰富的文化知识库，可以潜在地用于此类应用。在本文中，我们定义了文化适应任务，并创建了一个评估框架来对不同模型进行基准测试。我们评估现代 LLM 在文化适应方面的表现，并分析它们的跨文化知识，同时将不同文化之间的相关概念联系起来。我们还分析了自动适应可能出现的问题，包括文化偏见和刻板印象。我们希望这项任务能够为 LLM 的文化理解及其在跨文化场景中的创造力提供更多见解。

Title: Evidence of a log scaling law for political persuasion with large language models

Authors: Kobi Hackenburg, Ben M. Tappin, Paul Röttger, Scott Hale, Jonathan Bright, Helen Margetts
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2406.14508
Pdf URL: https://arxiv.org/pdf/2406.14508
Copy Paste: [[2406.14508]] Evidence of a log scaling law for political persuasion with large language models(https://arxiv.org/abs/2406.14508)
Keywords: language model, llm
Abstract: Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 U.S. political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence of a log scaling law: model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are barely more persuasive than models smaller in size by an order of magnitude or more. Second, mere task completion (coherence, staying on topic) appears to account for larger models' persuasive advantage. These findings suggest that further scaling model size will not much increase the persuasiveness of static LLM-generated messages.
摘要：大型语言模型现在可以生成与人类编写的政治信息一样具有说服力的信息，这引发了人们对这种说服力会随着模型规模的扩大而继续增加多少的担忧。在这里，我们从 24 个语言模型中生成了 720 条关于 10 个美国政治问题的说服性信息，这些模型的规模相差几个数量级。然后，我们将这些信息部署在一项大规模随机调查实验 (N = 25,982) 中，以估计每个模型的说服能力。我们的发现有两个方面。首先，我们发现了对数缩放定律的证据：模型说服力的特点是收益急剧递减，因此当前的前沿模型的说服力几乎不比规模小一个数量级或更多的模型高。其次，单纯的任务完成（连贯性、保持主题）似乎可以解释较大模型的说服力优势。这些发现表明，进一步扩大模型规模不会大大提高静态 LLM 生成的消息的说服力。

Title: Investigating Mysteries of CoT-Augmented Distillation

Authors: Somin Wadhwa, Silvio Amir, Byron C. Wallace
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14511
Pdf URL: https://arxiv.org/pdf/2406.14511
Copy Paste: [[2406.14511]] Investigating Mysteries of CoT-Augmented Distillation(https://arxiv.org/abs/2406.14511)
Keywords: llm
Abstract: Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distillation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a small student model yields (often substantial) improvements. In this work we ask: Why and how does this additional training signal help in model distillation? We perform ablations to interrogate this, and report some potentially surprising results. Specifically: (1) Placing CoT sequences after labels (rather than before) realizes consistently better downstream performance -- this means that no student "reasoning" is necessary at test time to realize gains. (2) When rationales are appended in this way, they need not be coherent reasoning sequences to yield improvements; performance increases are robust to permutations of CoT tokens, for example. In fact, (3) a small number of key tokens are sufficient to achieve improvements equivalent to those observed when full rationales are used in model distillation.
摘要：引出“思路链”（CoT）原理（传达“推理”过程的标记序列）已被证明可以持续提高 LLM 在问答等任务上的表现。最近的研究表明，这种原理也可用于模型提炼：在对小型学生模型进行微调时，除了目标标签外，还包括 CoT 序列（从大型“教师”模型中引出）可产生（通常可观的）改进。在这项工作中，我们问：为什么以及如何将这种额外的训练信号帮助模型提炼？我们进行消融来询问这一点，并报告一些可能令人惊讶的结果。具体来说：（1）将 CoT 序列放在标签之后（而不是之前）可实现持续更好的下游性能——这意味着在测试时不需要学生“推理”即可实现收益。（2）当以这种方式附加原理时，它们不需要是连贯的推理序列即可产生改进；例如，性能提升对 CoT 标记的排列具有鲁棒性。事实上，（3）少量的关键标记足以实现与在模型提炼中使用完整原理时观察到的改进相同的改进。

Title: Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems

Authors: Đorđe Klisura, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14545
Pdf URL: https://arxiv.org/pdf/2406.14545
Copy Paste: [[2406.14545]] Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems(https://arxiv.org/abs/2406.14545)
Keywords: language model
Abstract: Relational databases are integral to modern information systems, serving as the foundation for storing, querying, and managing data efficiently and effectively. Advancements in large language modeling have led to the emergence of text-to-SQL technologies, significantly enhancing the querying and extracting of information from these databases and raising concerns about privacy and security. Our research extracts the database schema elements underlying a text-to-SQL model. Knowledge of the schema can make attacks such as SQL injection easier. By asking specially crafted questions, we have developed a zero-knowledge framework designed to probe various database schema elements without knowledge of the database itself. The text-to-SQL models then process these questions to produce an output that we use to uncover the structure of the database schema. We apply it to specialized text-to-SQL models fine-tuned on text-SQL pairs and generative language models used for SQL generation. Overall, we can reconstruct the table names with an F1 of nearly .75 for fine-tuned models and .96 for generative.
摘要：关系数据库是现代信息系统不可或缺的一部分，是高效存储、查询和管理数据的基础。大型语言建模的进步导致了文本到 SQL 技术的出现，大大增强了从这些数据库中查询和提取信息的能力，并引发了对隐私和安全的担忧。我们的研究提取了文本到 SQL 模型所依赖的数据库模式元素。了解模式可以使 SQL 注入等攻击变得更容易。通过提出特制的问题，我们开发了一个零知识框架，旨在在不了解数据库本身的情况下探测各种数据库模式元素。然后，文本到 SQL 模型处理这些问题以生成输出，我们使用该输出来揭示数据库模式的结构。我们将其应用于针对文本 SQL 对进行微调的专用文本到 SQL 模型和用于 SQL 生成的生成语言模型。总体而言，我们可以重建表名，微调模型的 F1 接近 0.75，生成模型的 F1 接近 0.96。

Title: Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Authors: Johannes Treutlein, Dami Choi, Jan Betley, Cem Anil, Samuel Marks, Roger Baker Grosse, Owain Evans
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14546
Pdf URL: https://arxiv.org/pdf/2406.14546
Copy Paste: [[2406.14546]] Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data(https://arxiv.org/abs/2406.14546)
Keywords: language model, llm
Abstract: One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
摘要：解决大型语言模型 (LLM) 安全风险的一种方法是从其训练数据中删除危险知识。虽然这会删除显式信息，但隐式信息可能仍然分散在各种训练文档中。LLM 能否通过拼凑这些隐式提示来推断出被审查的知识？为了回答这个问题，我们研究了归纳式非语境推理 (OOCR)，这是一种泛化，其中 LLM 从分布在训练文档中的证据中推断出潜在信息，并将其应用于下游任务而无需语境学习。使用一组五个任务，我们证明前沿 LLM 可以执行归纳式 OOCR。在一项实验中，我们对仅包含未知城市与其他已知城市之间距离的语料库上的 LLM 进行了微调。值得注意的是，在没有语境示例或思路链的情况下，LLM 可以用言语表达未知城市是巴黎，并使用这一事实来回答下游问题。进一步的实验表明，仅对单个硬币翻转结果进行训练的 LLM 可以表达硬币是否有偏差，而仅对对 $(x,f(x))$ 进行训练的 LLM 可以清晰地表达 $f$ 的定义并计算逆。虽然 OOCR 在一系列情况下都取得了成功，但我们也表明它并不可靠，尤其是对于学习复杂结构的小型 LLM 而言。总体而言，LLM 在没有明确的上下文学习的情况下“连接点”的能力对监控和控制 LLM 获得的知识构成了潜在的障碍。

Title: GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models

Authors: Shilong Li, Yancheng He, Hangyu Guo, Xingyuan Bu, Ge Bai, Jie Liu, Jiaheng Liu, Xingwei Qu, Yangguang Li, Wanli Ouyang, Wenbo Su, Bo Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2406.14550
Pdf URL: https://arxiv.org/pdf/2406.14550
Copy Paste: [[2406.14550]] GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models(https://arxiv.org/abs/2406.14550)
Keywords: language model, gpt, llm, long context, agent
Abstract: Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks. Despite numerous efforts made to optimize LLMs for long contexts, challenges persist in robustly processing long inputs. In this paper, we introduce GraphReader, a graph-based agent system designed to handle long texts by structuring them into a graph and employing an agent to explore this graph autonomously. Upon receiving a question, the agent first undertakes a step-by-step analysis and devises a rational plan. It then invokes a set of predefined functions to read node content and neighbors, facilitating a coarse-to-fine exploration of the graph. Throughout the exploration, the agent continuously records new insights and reflects on current circumstances to optimize the process until it has gathered sufficient information to generate an answer. Experimental results on the LV-Eval dataset reveal that GraphReader, using a 4k context window, consistently outperforms GPT-4-128k across context lengths from 16k to 256k by a large margin. Additionally, our approach demonstrates superior performance on four challenging single-hop and multi-hop benchmarks.
摘要：长上下文能力对于大型语言模型 (LLM) 处理复杂和长输入任务至关重要。尽管为优化 LLM 以适应长上下文做出了许多努力，但在稳健地处理长输入方面仍然存在挑战。在本文中，我们介绍了 GraphReader，这是一个基于图的代理系统，旨在通过将长文本构造成图并使用代理自主探索该图来处理长文本。收到问题后，代理首先进行逐步分析并制定合理的计划。然后，它调用一组预定义函数来读取节点内容和邻居，从而促进对图的粗到细的探索。在整个探索过程中，代理不断记录新的见解并反思当前情况以优化流程，直到它收集到足够的信息来生成答案。LV-Eval 数据集上的实验结果表明，使用 4k 上下文窗口的 GraphReader 在 16k 到 256k 的上下文长度上始终大大优于 GPT-4-128k。此外，我们的方法在四个具有挑战性的单跳和多跳基准上表现出优异的性能。

Title: How to Compute the Probability of a Word

Authors: Tiago Pimentel, Clara Meister
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2406.14561
Pdf URL: https://arxiv.org/pdf/2406.14561
Copy Paste: [[2406.14561]] How to Compute the Probability of a Word(https://arxiv.org/abs/2406.14561)
Keywords: language model, gpt
Abstract: Language models (LMs) estimate the probability distribution over sequences of natural language; these distributions are crucial for computing perplexity and surprisal in linguistics research. While we are usually concerned with measuring these values for words, most LMs operate over subwords. Despite seemingly straightforward, accurately computing probabilities over one unit given probabilities over the other requires care. Indeed, we show here that many recent linguistic studies have been incorrectly computing these values. This paper derives the correct methods for computing word probabilities, highlighting issues when relying on language models that use beginning-of-word (bow)-marking tokenisers, e.g., the GPT family. Empirically, we show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.
摘要：语言模型 (LM) 估计自然语言序列的概率分布；这些分布对于计算语言学研究中的困惑度和意外度至关重要。虽然我们通常关注的是测量单词的这些值，但大多数 LM 都是针对子词进行操作的。尽管看似简单，但要准确计算一个单元的概率，而给定另一个单元的概率，则需要小心谨慎。事实上，我们在这里表明，许多最近的语言学研究一直在错误地计算这些值。本文推导出计算单词概率的正确方法，强调了依赖使用词首 (bow) 标记标记器的语言模型（例如 GPT 系列）时的问题。从经验上讲，我们表明，纠正概率计算中普遍存在的错误会影响句子理解和词汇优化分析中的测量结果。

Title: Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Authors: Sachit Menon, Richard Zemel, Carl Vondrick
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities(https://arxiv.org/abs/)
Keywords: language model, gpt, prompt, chain-of-thought
Abstract: When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.
摘要：当提出涉及视觉思维的问题时，人类会自然地切换推理模式，通常会形成心理图像或绘制视觉辅助工具。大型语言模型通过将文本中的中间推理表达为思维链，在算术和符号推理中显示出良好的结果，但很难将这种能力扩展到回答可以通过视觉推理轻松解决的文本查询，即使经过了大量的多模态预训练。我们引入了一种简单的方法，即思维白板提示，以解锁跨模态的多模态大型语言模型的视觉推理能力。思维白板提示为多模态大型语言模型提供了一个隐喻性的“白板”，以将推理步骤绘制为图像，然后将这些图像返回给模型进行进一步处理。我们发现这可以在没有演示或专门模块的情况下完成，而是利用模型现有的使用 Matplotlib 和 Turtle 等库编写代码的能力。这种简单的方法在涉及视觉和空间推理的四个困难的自然语言任务上展示了最先进的结果。我们发现，在多种情况下，使用思维链的 GPT-4o 会严重失败，其中不止一种情况下准确率仅为 $0\%$，而思维白板在相同情况下准确率高达 $92\%$。我们详细探讨了该技术的成功之处及其错误来源。

Title: Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Authors: Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete Ozay
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2406.14563
Pdf URL: https://arxiv.org/pdf/2406.14563
Copy Paste: [[2406.14563]] Model Merging and Safety Alignment: One Bad Model Spoils the Bunch(https://arxiv.org/abs/2406.14563)
Keywords: language model, llm
Abstract: Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.
摘要：合并大型语言模型 (LLM) 是一种经济高效的技术，可将多个专家 LLM 组合成一个通用模型，同时保留原始 LLM 的专业知识。然而，当前的方法往往忽视了合并过程中安全性对齐的重要性，导致模型高度不对齐。这项工作研究了模型合并对对齐的影响。我们评估了几种流行的模型合并技术，表明现有方法不仅可以传递领域专业知识，还可以传播不对齐。我们提出了一种简单的两步方法来解决这个问题：(i) 生成合成安全性和领域特定数据，以及 (ii) 将这些生成的数据纳入现有数据感知模型合并技术的优化过程中。这使我们能够将对齐视为可以在合并后的 LLM 中最大化的技能。我们的实验说明了在合并过程中集成对齐相关数据的有效性，从而产生了在领域专业知识和对齐方面都表现出色的模型。