2024-10-25

Title: Analyzing Nobel Prize Literature with Large Language Models

Authors: Yang Zhenyuan, Liu Zhengliang, Zhang Jing, Lu Cen, Tai Jiaxin, Zhong Tianyang, Li Yiwei, Zhao Siyan, Yao Teng, Liu Qing, Yang Jinlin, Liu Qixin, Li Zhaowei, Wang Kexin, Ma Longjun, Zhu Dajiang, Ren Yudan, Ge Bao, Zhang Wei, Qiang Ning, Zhang Tuo, Liu Tianming
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.18142
Pdf URL: https://arxiv.org/pdf/2410.18142
Copy Paste: [[2410.18142]] Analyzing Nobel Prize Literature with Large Language Models(https://arxiv.org/abs/2410.18142)
Keywords: language model, llm
Abstract: This study examines the capabilities of advanced Large Language Models (LLMs), particularly the o1 model, in the context of literary analysis. The outputs of these models are compared directly to those produced by graduate-level human participants. By focusing on two Nobel Prize-winning short stories, 'Nine Chapters' by Han Kang, the 2024 laureate, and 'Friendship' by Jon Fosse, the 2023 laureate, the research explores the extent to which AI can engage with complex literary elements such as thematic analysis, intertextuality, cultural and historical contexts, linguistic and structural innovations, and character development. Given the Nobel Prize's prestige and its emphasis on cultural, historical, and linguistic richness, applying LLMs to these works provides a deeper understanding of both human and AI approaches to interpretation. The study uses qualitative and quantitative evaluations of coherence, creativity, and fidelity to the text, revealing the strengths and limitations of AI in tasks typically reserved for human expertise. While LLMs demonstrate strong analytical capabilities, particularly in structured tasks, they often fall short in emotional nuance and coherence, areas where human interpretation excels. This research underscores the potential for human-AI collaboration in the humanities, opening new opportunities in literary studies and beyond.
摘要：本研究考察了高级大型语言模型 (LLM)，特别是 o1 模型在文学分析方面的功能。这些模型的输出直接与研究生级别的人类参与者的输出进行比较。通过关注两篇诺贝尔奖获奖短篇小说，即 2024 年获奖者韩江的《九章》和 2023 年获奖者乔恩·福斯的《友谊》，该研究探索了人工智能在多大程度上能够参与复杂的文学元素，例如主题分析、互文性、文化和历史背景、语言和结构创新以及角色发展。鉴于诺贝尔奖的声望及其对文化、历史和语言丰富性的重视，将 LLM 应用于这些作品可以更深入地了解人类和人工智能的解读方法。该研究使用定性和定量评估来评估连贯性、创造力和对文本的保真度，揭示了人工智能在通常需要人类专业知识才能完成的任务中的优势和局限性。虽然法学硕士表现出很强的分析能力，特别是在结构化任务中，但他们往往在情感细微差别和连贯性方面有所欠缺，而这些方面人类的解读能力很强。这项研究强调了人类与人工智能在人文学科领域的合作潜力，为文学研究及其他领域开辟了新的机会。

Title: Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output Generation

Authors: Chandra Irugalbandara
Subjects: cs.CL, cs.AI, cs.PL
Abstract URL: https://arxiv.org/abs/2410.18146
Pdf URL: https://arxiv.org/pdf/2410.18146
Copy Paste: [[2410.18146]] Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output Generation(https://arxiv.org/abs/2410.18146)
Keywords: language model, llm, prompt
Abstract: Extending Large Language Models (LLMs) to advanced applications requires reliable structured output generation. Existing methods which often rely on rigid JSON schemas, can lead to unreliable outputs, diminished reasoning capabilities, and increased computational overhead, limiting LLMs' adaptability for complex tasks. We introduce Meaning Typed Prompting (MTP), a technique for efficient structured output generation that integrates types, meanings, and abstractions, such as variables and classes, into the prompting process. By utilizing expressive type definitions, MTP enhances output clarity and reduces dependence on complex abstractions, simplifying development, and improving implementation efficiency. This enables LLMs to understand relationships and generate structured data more effectively. Empirical evaluations on multiple benchmarks demonstrate that MTP outperforms existing frameworks in accuracy, reliability, consistency, and token efficiency. We present Semantix, a framework that implements MTP, providing practical insights into its application.
摘要：将大型语言模型 (LLM) 扩展到高级应用程序需要可靠的结构化输出生成。现有方法通常依赖于严格的 JSON 模式，这可能导致输出不可靠、推理能力下降和计算开销增加，从而限制了 LLM 对复杂任务的适应性。我们引入了含义类型提示 (MTP)，这是一种高效的结构化输出生成技术，它将类型、含义和抽象（例如变量和类）集成到提示过程中。通过利用富有表现力的类型定义，MTP 增强了输出清晰度并减少了对复杂抽象的依赖，简化了开发并提高了实施效率。这使 LLM 能够更有效地理解关系并生成结构化数据。对多个基准的实证评估表明，MTP 在准确性、可靠性、一致性和令牌效率方面优于现有框架。我们介绍了实现 MTP 的框架 Semantix，为其应用提供了实用见解。

Title: Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction

Authors: Nicholas Walker
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18160
Pdf URL: https://arxiv.org/pdf/2410.18160
Copy Paste: [[2410.18160]] Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction(https://arxiv.org/abs/2410.18160)
Keywords: language model, gpt
Abstract: Causal decoder-only transformer models used for generative language modelling, such as Generative Pre-trained Transformers (GPT), are trained to predict the next token in a sequence based only on its previous tokens. Despite this simple training objective, they have proved to be powerful AI tools. However, only predicting the next token results in top layer embedding vectors that are highly token-focused. There may be benefits in generating embedding vectors at each token position that better capture the overall meaning of longer sequences of future text. Recent studies matching brain scans with deep language models suggest that humans also predict upcoming words when listening or reading but consider multiple future tokens rather than just one. This research investigates a new pretraining method called Future Token Prediction (FTP). In FTP, a large transformer encoder generates top layer embedding vectors for each token position, which, instead of being passed to a language head, are linearly and expansively projected to a pseudo-sequence, which is cross attended to by a small transformer decoder to predict the next N tokens forward from that position in the sequence. The top layer embedding vectors from FTP models exhibit distinct properties compared to those from standard GPT models, varying smoothly along a text sequence as measured by cosine similarity between adjacent tokens. Text generated by FTP models show improved topic coherence compared to standard GPT-like models trained with the same prediction perplexity for the next single token. The vectors are shown to better represent the topic of text based on the results of text classification examples. On a toy, but complex, coding problem, FTP networks produce significantly better results than GPT networks.
摘要：用于生成语言建模的因果解码器专用变压器模型，例如生成预训练变压器 (GPT)，经过训练后仅根据序列中的前一个标记来预测下一个标记。尽管这个训练目标很简单，但它们已被证明是强大的人工智能工具。然而，只预测下一个标记会导致顶层嵌入向量高度以标记为中心。在每个标记位置生成嵌入向量可能会有好处，可以更好地捕捉未来文本的较长序列的整体含义。最近将脑部扫描与深度语言模型相匹配的研究表明，人类在听或读时也会预测即将到来的单词，但会考虑多个未来标记，而不仅仅是一个。这项研究调查了一种称为未来标记预测 (FTP) 的新预训练方法。在 FTP 中，大型变压器编码器为每个标记位置生成顶层嵌入向量，这些向量不是传递给语言头，而是线性和扩展地投影到伪序列，该伪序列由小型变压器解码器交叉处理，以预测从序列中的该位置向前的下一个 N 个标记。 FTP 模型的顶层嵌入向量与标准 GPT 模型相比表现出不同的特性，根据相邻标记之间的余弦相似度测量，它们沿着文本序列平滑变化。与使用相同预测困惑度训练下一个单个标记的标准 GPT 类模型相比，FTP 模型生成的文本表现出更好的主题连贯性。根据文本分类示例的结果，这些向量可以更好地表示文本的主题。在一个简单但复杂的编码问题上，FTP 网络产生的结果明显优于 GPT 网络。

Title: Gazelle: An Instruction Dataset for Arabic Writing Assistance

Authors: Samar M. Magdy, Fakhraddin Alwajih, Sang Yun Kwon, Reem Abdel-Salam, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18163
Pdf URL: https://arxiv.org/pdf/2410.18163
Copy Paste: [[2410.18163]] Gazelle: An Instruction Dataset for Arabic Writing Assistance(https://arxiv.org/abs/2410.18163)
Keywords: language model, gpt, llm
Abstract: Writing has long been considered a hallmark of human intelligence and remains a pinnacle task for artificial intelligence (AI) due to the intricate cognitive processes involved. Recently, rapid advancements in generative AI, particularly through the development of Large Language Models (LLMs), have significantly transformed the landscape of writing assistance. However, underrepresented languages like Arabic encounter significant challenges in the development of advanced AI writing tools, largely due to the limited availability of data. This scarcity constrains the training of effective models, impeding the creation of sophisticated writing assistance technologies. To address these issues, we present Gazelle, a comprehensive dataset for Arabic writing assistance. In addition, we offer an evaluation framework designed to enhance Arabic writing assistance tools. Our human evaluation of leading LLMs, including GPT-4, GPT-4o, Cohere Command R+, and Gemini 1.5 Pro, highlights their respective strengths and limitations in addressing the challenges of Arabic writing. Our findings underscore the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing, paving the way for more effective AI-powered Arabic writing tools.
摘要：写作长期以来被认为是人类智能的标志，由于涉及复杂的认知过程，写作仍然是人工智能 (AI) 的一项顶尖任务。最近，生成式人工智能的快速发展，特别是通过大型语言模型 (LLM) 的开发，极大地改变了写作辅助的格局。然而，像阿拉伯语这样代表性不足的语言在开发高级人工智能写作工具时遇到了重大挑战，这主要是由于数据有限。这种稀缺性限制了有效模型的训练，阻碍了复杂的写作辅助技术的创造。为了解决这些问题，我们提出了 Gazelle，这是一个全面的阿拉伯语写作辅助数据集。此外，我们还提供了一个旨在增强阿拉伯语写作辅助工具的评估框架。我们对领先的 LLM（包括 GPT-4、GPT-4o、Cohere Command R+ 和 Gemini 1.5 Pro）的人工评估突出了它们在应对阿拉伯语写作挑战方面各自的优势和局限性。我们的研究结果强调，需要持续进行模型训练和数据集丰富，以管理阿拉伯语处理的复杂性，为更有效的人工智能阿拉伯语写作工具铺平道路。

Title: CorrectionLM: Self-Corrections with SLM for Dialogue State Tracking

Authors: Chia-Hsuan Lee, Hao Cheng, Mari Ostendorf
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18209
Pdf URL: https://arxiv.org/pdf/2410.18209
Copy Paste: [[2410.18209]] CorrectionLM: Self-Corrections with SLM for Dialogue State Tracking(https://arxiv.org/abs/2410.18209)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated self-improvement capabilities via feedback and refinement, but current small language models (SLMs) have had limited success in this area. Existing correction approaches often rely on distilling knowledge from LLMs, which imposes significant computation demands. In this work, we introduce CORRECTIONLM, a novel correction framework that enables SLMs to self-correct using in-context exemplars without LLM involvement. Applied to two dialogue state tracking (DST) tasks in low-resource settings, CORRECTIONLM achieves results similar to a state-of-the-art LLM at a small fraction of the computation costs.
摘要：大型语言模型 (LLM) 已展示出通过反馈和改进实现自我改进的能力，但目前的小型语言模型 (SLM) 在这方面取得的成功有限。现有的校正方法通常依赖于从 LLM 中提取知识，这会带来大量计算需求。在这项工作中，我们引入了 CORRECTIONLM，这是一种新颖的校正框架，它使 SLM 能够使用上下文样本进行自我校正，而无需 LLM 参与。在资源匮乏的环境中，CORRECTIONLM 应用于两个对话状态跟踪 (DST) 任务，以极低的计算成本实现了与最先进的 LLM 类似的结果。

Title: Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Authors: Samuele Poppi, Zheng-Xin Yong, Yifei He, Bobbie Chern, Han Zhao, Aobo Yang, Jianfeng Chi
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18210
Pdf URL: https://arxiv.org/pdf/2410.18210
Copy Paste: [[2410.18210]] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks(https://arxiv.org/abs/2410.18210)
Keywords: language model, llm, prompt
Abstract: Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks: using a few adversarially chosen instruction-following examples in one language, multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse harmful prompts in other languages). Motivated by this finding, we hypothesize that safety-related information is language-agnostic and propose a new method termed Safety Information Localization (SIL) to identify the safety-related information in the model parameter space. Through SIL, we validate this hypothesis and find that only changing 20% of weight parameters in fine-tuning attacks can break safety alignment across all languages. Furthermore, we provide evidence to the alternative pathways hypothesis for why freezing safety-related parameters does not prevent fine-tuning attacks, and we demonstrate that our attack vector can still jailbreak LLMs adapted to new languages.
摘要：大型语言模型 (LLM) 的最新进展引发了人们对其安全性的广泛担忧。最近的研究表明，通过使用一些对抗性选择的指令跟踪示例进行微调，即微调攻击，可以轻松消除 LLM 的安全对齐。我们进一步了解了多语言 LLM 中的微调攻击。我们首先发现微调攻击的跨语言泛化：使用一种语言中的一些对抗性选择的指令跟踪示例，多语言 LLM 也很容易受到攻击（例如，多语言 LLM 无法拒绝其他语言中的有害提示）。受此发现的启发，我们假设安全相关信息与语言无关，并提出了一种称为安全信息本地化 (SIL) 的新方法来识别模型参数空间中的安全相关信息。通过 SIL，我们验证了这一假设，并发现在微调攻击中仅更改 20% 的权重参数就可以破坏所有语言的安全对齐。此外，我们为替代途径假设提供了证据，解释了为什么冻结安全相关参数不能阻止微调攻击，并且我们证明了我们的攻击媒介仍然可以越狱适应新语言的 LLM。

Title: Generalizations across filler-gap dependencies in neural language models

Authors: Katherine Howitt, Sathvik Nair, Allison Dods, Robert Melvin Hopkins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18225
Pdf URL: https://arxiv.org/pdf/2410.18225
Copy Paste: [[2410.18225]] Generalizations across filler-gap dependencies in neural language models(https://arxiv.org/abs/2410.18225)
Keywords: language model
Abstract: Humans develop their grammars by making structural generalizations from finite input. We ask how filler-gap dependencies, which share a structural generalization despite diverse surface forms, might arise from the input. We explicitly control the input to a neural language model (NLM) to uncover whether the model posits a shared representation for filler-gap dependencies. We show that while NLMs do have success differentiating grammatical from ungrammatical filler-gap dependencies, they rely on superficial properties of the input, rather than on a shared generalization. Our work highlights the need for specific linguistic inductive biases to model language acquisition.
摘要：人类通过从有限的输入中进行结构概括来发展语法。我们想知道，尽管表面形式各异，但填充间隙依赖关系如何从输入中产生，它们共享结构概括。我们明确控制神经语言模型 (NLM) 的输入，以发现该模型是否为填充间隙依赖关系提出了一个共享表示。我们表明，虽然 NLM 确实能够成功区分语法和非语法填充间隙依赖关系，但它们依赖于输入的表面属性，而不是共享概括。我们的工作强调了对特定语言归纳偏差进行建模的必要性。

Title: Multilingual Hallucination Gaps in Large Language Models

Authors: Cléa Chataigner, Afaf Taïk, Golnoosh Farnadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18270
Pdf URL: https://arxiv.org/pdf/2410.18270
Copy Paste: [[2410.18270]] Multilingual Hallucination Gaps in Large Language Models(https://arxiv.org/abs/2410.18270)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) are increasingly used as alternatives to traditional search engines given their capacity to generate text that resembles human language. However, this shift is concerning, as LLMs often generate hallucinations, misleading or false information that appears highly credible. In this study, we explore the phenomenon of hallucinations across multiple languages in freeform text generation, focusing on what we call multilingual hallucination gaps. These gaps reflect differences in the frequency of hallucinated answers depending on the prompt and language used. To quantify such hallucinations, we used the FactScore metric and extended its framework to a multilingual setting. We conducted experiments using LLMs from the LLaMA, Qwen, and Aya families, generating biographies in 19 languages and comparing the results to Wikipedia pages. Our results reveal variations in hallucination rates, especially between high and low resource languages, raising important questions about LLM multilingual performance and the challenges in evaluating hallucinations in multilingual freeform text generation.
摘要：大型语言模型 (LLM) 越来越多地被用作传统搜索引擎的替代品，因为它们能够生成类似于人类语言的文本。然而，这种转变令人担忧，因为 LLM 经常会产生幻觉，即看似高度可信的误导性或虚假信息。在本研究中，我们探讨了自由格式文本生成中跨多种语言的幻觉现象，重点关注我们所说的多语言幻觉差距。这些差距反映了幻觉答案频率的差异，具体取决于提示和使用的语言。为了量化这种幻觉，我们使用了 FactScore 指标并将其框架扩展到多语言环境。我们使用来自 LLaMA、Qwen 和 Aya 家族的 LLM 进行了实验，生成了 19 种语言的传记，并将结果与维基百科页面进行比较。我们的结果揭示了幻觉率的差异，尤其是在高资源和低资源语言之间，这提出了有关 LLM 多语言性能以及评估多语言自由格式文本生成中的幻觉的挑战的重要问题。

Title: LEGO: Language Model Building Blocks

Authors: Shrenik Bhansali, Alwin Jin, Tyler Lizzo, Larry Heck
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18287
Pdf URL: https://arxiv.org/pdf/2410.18287
Copy Paste: [[2410.18287]] LEGO: Language Model Building Blocks(https://arxiv.org/abs/2410.18287)
Keywords: language model, llm
Abstract: Large language models (LLMs) are essential in natural language processing (NLP) but are costly in data collection, pre-training, fine-tuning, and inference. Task-specific small language models (SLMs) offer a cheaper alternative but lack robustness and generalization. This paper proposes LEGO, a novel technique to extract SLMs from an LLM and recombine them. Using state-of-the-art LLM pruning strategies, we can create task- and user-specific SLM building blocks that are efficient for fine-tuning and inference while also preserving user data privacy. LEGO utilizes Federated Learning and a novel aggregation scheme for the LLM reconstruction, maintaining robustness without high costs and preserving user data privacy. We experimentally demonstrate the versatility of LEGO, showing its ability to enable model heterogeneity and mitigate the effects of data heterogeneity while maintaining LLM robustness.
摘要：大型语言模型 (LLM) 在自然语言处理 (NLP) 中必不可少，但在数据收集、预训练、微调和推理方面成本高昂。特定于任务的小型语言模型 (SLM) 提供了一种更便宜的替代方案，但缺乏稳健性和泛化能力。本文提出了 LEGO，这是一种从 LLM 中提取 SLM 并重新组合它们的新技术。使用最先进的 LLM 修剪策略，我们可以创建特定于任务和用户的 SLM 构建块，这些构建块可有效进行微调和推理，同时还能保护用户数据隐私。LEGO 利用联邦学习和一种新颖的聚合方案进行 LLM 重建，在不增加成本的情况下保持稳健性并保护用户数据隐私。我们通过实验证明了 LEGO 的多功能性，展示了它能够实现模型异构性并减轻数据异构性的影响，同时保持 LLM 的稳健性。

Title: Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems

Authors: Junyi Ye, Jingyi Gu, Xinyun Zhao, Wenpeng Yin, Guiling Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.18336
Pdf URL: https://arxiv.org/pdf/2410.18336
Copy Paste: [[2410.18336]] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems(https://arxiv.org/abs/2410.18336)
Keywords: language model, llm
Abstract: The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing correct answers, AI systems should also be capable of, or assist humans in, developing novel solutions to mathematical challenges. This study explores the creative potential of Large Language Models (LLMs) in mathematical reasoning, an aspect that has received limited attention in prior research. We introduce a novel framework and benchmark, CreativeMath, which encompasses problems ranging from middle school curricula to Olympic-level competitions, designed to assess LLMs' ability to propose innovative solutions after some known solutions have been provided. Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.
摘要：人工智能系统的数学能力复杂且多面。大多数现有研究主要关注人工智能生成的数学问题解决方案的正确性。在这项研究中，我们认为，除了给出正确的答案之外，人工智能系统还应该能够或协助人类开发出解决数学难题的新解决方案。这项研究探讨了大型语言模型 (LLM) 在数学推理方面的创造潜力，而这一方面在之前的研究中很少受到关注。我们引入了一个新颖的框架和基准 CreativeMath，它涵盖了从中学课程到奥林匹克级比赛的各种问题，旨在评估 LLM 在提供一些已知解决方案后提出创新解决方案的能力。我们的实验表明，虽然 LLM 在标准数学任务上表现良好，但它们创造性解决问题的能力却有很大差异。值得注意的是，Gemini-1.5-Pro 模型在生成新解决方案方面优于其他 LLM。这项研究开辟了评估人工智能创造力的新领域，揭示了法学硕士在促进数学创新方面的优势和局限性，并为人工智能辅助数学发现的未来发展奠定了基础。

Title: Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

Authors: Fengchen Liu, Jordan Jung, Wei Feinstein, Jeff DAmbrogia, Gary Jung
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18344
Pdf URL: https://arxiv.org/pdf/2410.18344
Copy Paste: [[2410.18344]] Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models(https://arxiv.org/abs/2410.18344)
Keywords: language model, gpt, retrieval-augmented generation
Abstract: This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems, focusing on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain. Utilizing a rich dataset derived from the ScienceIT documentation, our study embarks on a detailed comparison of two fine-tuned large language models and five retrieval-augmented generation (RAG) models. Through data processing techniques, we transform the documentation into structured context-question-answer triples, leveraging the latest Large Language Models (AWS Bedrock, GCP PaLM2, Meta LLaMA2, OpenAI GPT-4, Google Gemini-Pro) for data-driven insights. Additionally, we introduce the Aggregated Knowledge Model (AKM), which synthesizes responses from the seven models mentioned above using K-means clustering to select the most representative answers. The evaluation of these models across multiple metrics offers a comprehensive look into their effectiveness and suitability for the LBL ScienceIT environment. The results demonstrate the potential benefits of integrating fine-tuning and retrieval-augmented strategies, highlighting significant performance improvements achieved with the AKM. The insights gained from this study can be applied to develop specialized QA systems tailored to specific domains.
摘要：本文介绍了一种增强闭域问答 (QA) 系统的新方法，重点关注劳伦斯伯克利国家实验室 (LBL) 科学信息技术 (ScienceIT) 领域的特定需求。利用来自 ScienceIT 文档的丰富数据集，我们的研究开始详细比较两个经过微调的大型语言模型和五个检索增强生成 (RAG) 模型。通过数据处理技术，我们将文档转换为结构化的上下文-问题-答案三元组，利用最新的大型语言模型 (AWS Bedrock、GCP PaLM2、Meta LLaMA2、OpenAI GPT-4、Google Gemini-Pro) 获得数据驱动的见解。此外，我们还介绍了聚合知识模型 (AKM)，它使用 K 均值聚类综合上述七个模型的响应以选择最具代表性的答案。通过多个指标对这些模型的评估，可以全面了解它们对 LBL ScienceIT 环境的有效性和适用性。结果证明了整合微调和检索增强策略的潜在优势，突出了使用 AKM 实现的显著性能改进。从这项研究中获得的见解可用于开发针对特定领域的专用 QA 系统。

Title: AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability

Authors: Sudhanshu Agrawal, Wonseok Jeon, Mingu Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18351
Pdf URL: https://arxiv.org/pdf/2410.18351
Copy Paste: [[2410.18351]] AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability(https://arxiv.org/abs/2410.18351)
Keywords: language model, llm
Abstract: Speculative decoding is a powerful technique that attempts to circumvent the autoregressive constraint of modern Large Language Models (LLMs). The aim of speculative decoding techniques is to improve the average inference time of a large, target model without sacrificing its accuracy, by using a more efficient draft model to propose draft tokens which are then verified in parallel. The number of draft tokens produced in each drafting round is referred to as the draft length and is often a static hyperparameter chosen based on the acceptance rate statistics of the draft tokens. However, setting a static draft length can negatively impact performance, especially in scenarios where drafting is expensive and there is a high variance in the number of tokens accepted. Adaptive Entropy-based Draft Length (AdaEDL) is a simple, training and parameter-free criteria which allows for early stopping of the token drafting process by approximating a lower bound on the expected acceptance probability of the drafted token based on the currently observed entropy of the drafted logits. We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57% as well as other training-free draft-stopping techniques by upto 10% in a variety of settings and datasets. At the same time, we show that AdaEDL is more robust than these techniques and preserves performance in high-sampling-temperature scenarios. Since it is training-free, in contrast to techniques that rely on the training of dataset-specific draft-stopping predictors, AdaEDL can seamlessly be integrated into a variety of pre-existing LLM systems.
摘要：推测解码是一种强大的技术，它试图规避现代大型语言模型 (LLM) 的自回归约束。推测解码技术的目的是通过使用更高效的草稿模型来提出草稿标记，然后并行验证，从而在不牺牲其准确性的情况下提高大型目标模型的平均推理时间。每轮起草产生的草稿标记数量称为草稿长度，通常是根据草稿标记的接受率统计数据选择的静态超参数。但是，设置静态草稿长度会对性能产生负面影响，尤其是在起草成本高昂且接受的标记数量差异很大的情况下。基于熵的自适应草稿长度 (AdaEDL) 是一种简单、训练和无参数的标准，它允许通过基于当前观察到的起草逻辑熵近似起草标记的预期接受概率的下限来提前停止标记起草过程。我们表明，在各种设置和数据集中，AdaEDL 的表现始终优于静态草稿长度推测解码 10%-57%，也优于其他无需训练的草稿停止技术高达 10%。同时，我们表明 AdaEDL 比这些技术更强大，并且在高采样温度场景中保持性能。由于它是无需训练的，与依赖于特定于数据集的草稿停止预测器训练的技术相比，AdaEDL 可以无缝集成到各种现有的 LLM 系统中。

Title: Improving Model Factuality with Fine-grained Critique-based Evaluator

Authors: Yiqing Xie, Wenxuan Zhou, Pradyot Prakash, Di Jin, Yuning Mao, Quintin Fettes, Arya Talebzadeh, Sinong Wang, Han Fang, Carolyn Rose, Daniel Fried, Hejia Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18359
Pdf URL: https://arxiv.org/pdf/2410.18359
Copy Paste: [[2410.18359]] Improving Model Factuality with Fine-grained Critique-based Evaluator(https://arxiv.org/abs/2410.18359)
Keywords: language model, llm, chat
Abstract: Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama3-8B-chat's factuality rate by 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 6.96%.
摘要：事实性评估旨在检测语言模型 (LM) 产生的事实错误，从而指导更多事实模型的开发。为了实现这一目标，我们训练了一个事实性评估器 FenCE，它为 LM 生成器提供声明级别的事实性反馈。我们对公共判断数据集的组合进行数据增强，以训练 FenCE (1) 生成文本评论和分数，以及 (2) 根据各种工具获得的不同源文档做出声明级别的判断。然后，我们提出了一个框架，利用 FenCE 通过构建训练数据来提高 LM 生成器的事实性。具体来说，我们生成一组候选响应，利用 FenCE 修改和评分每个响应，而不引入鲜为人知的事实，并通过优先选择高分修订响应来训练生成器。实验表明，我们的数据增强方法将评估器在 LLM-AggreFact 上的准确率提高了 2.9%。通过 FenCE，我们将 Llama3-8B-chat 在 FActScore 上的事实性率提高了 14.45%，比最先进的事实性微调方法高出 6.96%。

Title: MoMQ: Mixture-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases

Authors: Zhisheng Lin, Yifu Liu, Zhiling Luo, Jinyang Gao, Yu Li
Subjects: cs.CL, cs.AI, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18406
Pdf URL: https://arxiv.org/pdf/2410.18406
Copy Paste: [[2410.18406]] MoMQ: Mixture-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases(https://arxiv.org/abs/2410.18406)
Keywords: language model, llm
Abstract: The improvement in translating natural language to structured query language (SQL) can be attributed to the advancements in large language models (LLMs). Open-source LLMs, tailored for specific database dialects such as MySQL, have shown great performance. However, cloud service providers are looking for a unified database manager service (e.g., Cosmos DB from Azure, Amazon Aurora from AWS, Lindorm from AlibabaCloud) that can support multiple dialects. This requirement has led to the concept of multi-dialect query generation, which presents challenges to LLMs. These challenges include syntactic differences among dialects and imbalanced data distribution across multiple dialects. To tackle these challenges, we propose MoMQ, a novel Mixture-of-Experts-based multi-dialect query generation framework across both relational and non-relational databases. MoMQ employs a dialect expert group for each dialect and a multi-level routing strategy to handle dialect-specific knowledge, reducing interference during query generation. Additionally, a shared expert group is introduced to address data imbalance, facilitating the transfer of common knowledge from high-resource dialects to low-resource ones. Furthermore, we have developed a high-quality multi-dialect query generation benchmark that covers relational and non-relational databases such as MySQL, PostgreSQL, Cypher for Neo4j, and nGQL for NebulaGraph. Extensive experiments have shown that MoMQ performs effectively and robustly even in resource-imbalanced scenarios.
摘要：自然语言到结构化查询语言 (SQL) 的翻译改进可以归因于大型语言模型 (LLM) 的进步。针对特定数据库方言（如 MySQL）量身定制的开源 LLM 表现出色。然而，云服务提供商正在寻找一种可以支持多种方言的统一数据库管理器服务（例如 Azure 的 Cosmos DB、AWS 的 Amazon Aurora、阿里云的 Lindorm）。这一要求导致了多方言查询生成概念的出现，这给 LLM 带来了挑战。这些挑战包括方言之间的语法差异以及多种方言之间不平衡的数据分布。为了应对这些挑战，我们提出了 MoMQ，这是一种基于混合专家的新型多方言查询生成框架，适用于关系数据库和非关系数据库。MoMQ 为每种方言配备一个方言专家组，并采用多级路由策略来处理特定于方言的知识，从而减少查询生成过程中的干扰。此外，我们还引入了共享专家组来解决数据不平衡问题，促进常识从高资源方言转移到低资源方言。此外，我们还开发了一个高质量的多方言查询生成基准，涵盖关系型和非关系型数据库，例如 MySQL、PostgreSQL、Neo4j 的 Cypher 和 NebulaGraph 的 nGQL。大量实验表明，即使在资源不平衡的场景中，MoMQ 也能有效且稳健地运行。

Title: Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains

Authors: Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James Glass, Helen Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18415
Pdf URL: https://arxiv.org/pdf/2410.18415
Copy Paste: [[2410.18415]] Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains(https://arxiv.org/abs/2410.18415)
Keywords: language model, llm, prompt
Abstract: Knowledge Graphs (KGs) can serve as reliable knowledge sources for question answering (QA) due to their structured representation of knowledge. Existing research on the utilization of KG for large language models (LLMs) prevalently relies on subgraph retriever or iterative prompting, overlooking the potential synergy of LLMs' step-wise reasoning capabilities and KGs' structural nature. In this paper, we present DoG (Decoding on Graphs), a novel framework that facilitates a deep synergy between LLMs and KGs. We first define a concept, well-formed chain, which consists of a sequence of interrelated fact triplets on the KGs, starting from question entities and leading to answers. We argue that this concept can serve as a principle for making faithful and sound reasoning for KGQA. To enable LLMs to generate well-formed chains, we propose graph-aware constrained decoding, in which a constraint derived from the topology of the KG regulates the decoding process of the LLMs. This constrained decoding method ensures the generation of well-formed chains while making full use of the step-wise reasoning capabilities of LLMs. Based on the above, DoG, a training-free approach, is able to provide faithful and sound reasoning trajectories grounded on the KGs. Experiments across various KGQA tasks with different background KGs demonstrate that DoG achieves superior and robust performance. DoG also shows general applicability with various open-source LLMs.
摘要：知识图谱 (KG) 可以作为问答 (QA) 的可靠知识来源，因为它们具有结构化的知识表示。现有的关于将 KG 用于大型语言模型 (LLM) 的研究普遍依赖于子图检索器或迭代提示，忽略了 LLM 的逐步推理能力和 KG 的结构特性之间的潜在协同作用。在本文中，我们提出了 DoG（图上解码），这是一个促进 LLM 和 KG 之间深度协同的新框架。我们首先定义一个概念，即格式良好的链，它由 KG 上一系列相互关联的事实三元组组成，从问题实体开始，一直到答案。我们认为这个概念可以作为对 KGQA 进行忠实和合理推理的原则。为了使 LLM 能够生成格式良好的链，我们提出了图感知约束解码，其中从 KG 的拓扑结构派生出的约束调节 LLM 的解码过程。这种受约束的解码方法确保生成格式正确的链，同时充分利用 LLM 的逐步推理能力。基于上述内容，DoG 是一种无需训练的方法，能够提供基于 KG 的忠实且合理的推理轨迹。在具有不同背景 KG 的各种 KGQA 任务中的实验表明，DoG 实现了卓越且稳健的性能。DoG 还显示出与各种开源 LLM 的普遍适用性。

Title: Large Language Models Reflect the Ideology of their Creators

Authors: Maarten Buyl, Alexander Rogiers, Sander Noels, Iris Dominguez-Catena, Edith Heiter, Raphael Romero, Iman Johary, Alexandru-Cristian Mara, Jefrey Lijffijt, Tijl De Bie
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18417
Pdf URL: https://arxiv.org/pdf/2410.18417
Copy Paste: [[2410.18417]] Large Language Models Reflect the Ideology of their Creators(https://arxiv.org/abs/2410.18417)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) are trained on vast amounts of data to generate natural language, enabling them to perform tasks like text summarization and question answering. These models have become popular in artificial intelligence (AI) assistants like ChatGPT and already play an influential role in how humans access information. However, the behavior of LLMs varies depending on their design, training, and use. In this paper, we uncover notable diversity in the ideological stance exhibited across different LLMs and languages in which they are accessed. We do this by prompting a diverse panel of popular LLMs to describe a large number of prominent and controversial personalities from recent world history, both in English and in Chinese. By identifying and analyzing moral assessments reflected in the generated descriptions, we find consistent normative differences between how the same LLM responds in Chinese compared to English. Similarly, we identify normative disagreements between Western and non-Western LLMs about prominent actors in geopolitical conflicts. Furthermore, popularly hypothesized disparities in political goals among Western models are reflected in significant normative differences related to inclusion, social inequality, and political scandals. Our results show that the ideological stance of an LLM often reflects the worldview of its creators. This raises important concerns around technological and regulatory efforts with the stated aim of making LLMs ideologically `unbiased', and it poses risks for political instrumentalization.
摘要：大型语言模型 (LLM) 经过大量数据训练以生成自然语言，从而使其能够执行文本摘要和问答等任务。这些模型已经在 ChatGPT 等人工智能 (AI) 助手中流行起来，并且已经在人类获取信息的方式中发挥了重要作用。然而，LLM 的行为因其设计、训练和使用而有所不同。在本文中，我们发现不同 LLM 及其访问语言所表现出的意识形态立场存在显著差异。我们通过提示一组多样化的热门 LLM 用英文和中文描述大量来自近代世界历史的著名和有争议的人物来做到这一点。通过识别和分析生成的描述中反映的道德评估，我们发现同一个 LLM 在中文和英文中的回应方式存在一致的规范差异。同样，我们发现西方和非西方 LLM 在地缘政治冲突中的主要参与者方面的规范分歧。此外，西方模式中普遍假设的政治目标差异反映在与包容性、社会不平等和政治丑闻相关的重大规范差异中。我们的结果表明，法学硕士的意识形态立场往往反映了其创造者的世界观。这引发了人们对技术和监管努力的重要担忧，这些努力的既定目标是让法学硕士在意识形态上“不偏不倚”，并带来了政治工具化的风险。

Title: Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching

Authors: Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, Dongha Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18436
Pdf URL: https://arxiv.org/pdf/2410.18436
Copy Paste: [[2410.18436]] Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching(https://arxiv.org/abs/2410.18436)
Keywords: language model, llm
Abstract: Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation. Recent state-of-the-art multilingual large language models (LLMs) demonstrate excellent multilingual abilities in various aspects including understanding CS, but the power of CS in eliciting language-specific knowledge is yet to be discovered. Therefore, we investigate the effectiveness of code-switching on a wide range of multilingual LLMs in terms of knowledge activation, or the act of identifying and leveraging knowledge for reasoning. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide a comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification and knowledge leveraging. Our experiments demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs, especially on language-specific domains. In addition, the performance gap between CS and English is larger in models that show excellent monolingual abilities, suggesting that there exists a correlation with CS and Korean proficiency.
摘要：代码转换 (CS) 是一种多语言使用者在话语中在不同语言之间交替使用的现象，它可以传达微妙的文化和语言细微差别，而这些差别可能会在翻译中丢失。最近，最先进的多语言大型语言模型 (LLM) 在包括理解 CS 在内的各个方面都表现出出色的多语言能力，但 CS 在引出特定语言知识方面的能力尚未被发现。因此，我们研究了代码转换在知识激活或识别和利用知识进行推理方面的有效性。为了促进研究，我们首先介绍了 EnKoQA，这是一个合成的英韩 CS 问答数据集。我们通过将激活过程细分为知识识别和知识利用，对各种多语言 LLM 进行了全面的分析。我们的实验表明，与英语文本相比，CS 可以忠实地激活 LLM 中的知识，尤其是在语言特定领域。此外，在单语能力优秀的模型中，CS与英语的表现差距较大，这表明CS与韩语水平存在相关性。

Title: ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis

Authors: Zezhong Wang, Xingshan Zeng, Weiwen Liu, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18447
Pdf URL: https://arxiv.org/pdf/2410.18447
Copy Paste: [[2410.18447]] ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis(https://arxiv.org/abs/2410.18447)
Keywords: language model, gpt, llm, agent
Abstract: Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis process generally involves sampling a set of tools, formulating a requirement based on these tools, and generating the call statements. However, tools sampled randomly lack relevance, making them difficult to combine and thus reducing the diversity of the data. Additionally, current work overlooks the coherence between turns of dialogues, leading to a gap between the synthesized data and real-world scenarios. To address these issues, we propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We integrate these two strategies and enable multiple agents to synthesize the dialogue data interactively, resulting in our tool-calling data synthesis pipeline ToolFlow. Data quality assessments demonstrate improvements in the naturalness and coherence of our synthesized dialogues. Finally, we apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.
摘要：监督微调 (SFT) 是增强大型语言模型 (LLM) 工具调用能力的常用方法，训练数据通常被合成。当前的数据合成过程通常涉及对一组工具进行采样、基于这些工具制定需求并生成调用语句。然而，随机采样的工具缺乏相关性，使得它们难以组合，从而降低了数据的多样性。此外，当前的工作忽视了对话轮次之间的连贯性，导致合成数据与真实场景之间存在差距。为了解决这些问题，我们提出了一种基于图的采样策略来采样更相关的工具组合，以及一种计划生成策略来创建指导连贯对话合成的计划。我们整合了这两种策略，并使多个代理能够以交互方式合成对话数据，从而产生了我们的工具调用数据合成管道 ToolFlow。数据质量评估表明我们合成的对话的自然性和连贯性有所改善。最后，我们将 SFT 应用于 LLaMA-3.1-8B，使用 ToolFlow 生成的 8,000 条合成对话。结果表明，该模型的工具调用性能与 GPT-4 相当甚至超过 GPT-4，同时保持了强大的通用能力。

Title: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Authors: Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18469
Pdf URL: https://arxiv.org/pdf/2410.18469
Copy Paste: [[2410.18469]] Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities(https://arxiv.org/abs/2410.18469)
Keywords: language model, gpt, llm
Abstract: Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety. Our code is available at: this https URL
摘要：最近的研究表明，大型语言模型 (LLM) 容易受到自动越狱攻击，其中附加到有害查询的算法制作的对抗性后缀会绕过安全对齐并触发意外响应。当前生成这些后缀的方法计算成本高昂，并且攻击成功率 (ASR) 较低，尤其是针对 Llama2 和 Llama3 等对齐良好的模型。为了克服这些限制，我们引入了 ADV-LLM，这是一个迭代自调整过程，可制作具有增强越狱能力的对抗性 LLM。我们的框架显着降低了生成对抗性后缀的计算成本，同时在各种开源 LLM 上实现了近 100% 的 ASR。此外，它表现出对闭源模型的强大攻击可转移性，尽管仅在 Llama3 上进行了优化，但在 GPT-3.5 上实现了 99% 的 ASR，在 GPT-4 上实现了 49% 的 ASR。除了提高越狱能力外，ADV-LLM 还通过生成用于研究 LLM 安全性的大型数据集，为未来的安全对齐研究提供了宝贵的见解。我们的代码可从以下网址获取：此 https URL

Title: Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction

Authors: Sergio Burdisso, Srikanth Madikeri, Petr Motlicek
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18481
Pdf URL: https://arxiv.org/pdf/2410.18481
Copy Paste: [[2410.18481]] Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction(https://arxiv.org/abs/2410.18481)
Keywords: language model
Abstract: Efficiently deriving structured workflows from unannotated dialogs remains an underexplored and formidable challenge in computational linguistics. Automating this process could significantly accelerate the manual design of workflows in new domains and enable the grounding of large language models in domain-specific flowcharts, enhancing transparency and controllability. In this paper, we introduce Dialog2Flow (D2F) embeddings, which differ from conventional sentence embeddings by mapping utterances to a latent space where they are grouped according to their communicative and informative functions (i.e., the actions they represent). D2F allows for modeling dialogs as continuous trajectories in a latent space with distinct action-related regions. By clustering D2F embeddings, the latent space is quantized, and dialogs can be converted into sequences of region/action IDs, facilitating the extraction of the underlying workflow. To pre-train D2F, we build a comprehensive dataset by unifying twenty task-oriented dialog datasets with normalized per-turn action annotations. We also introduce a novel soft contrastive loss that leverages the semantic information of these actions to guide the representation learning process, showing superior performance compared to standard supervised contrastive loss. Evaluation against various sentence embeddings, including dialog-specific ones, demonstrates that D2F yields superior qualitative and quantitative results across diverse domains.
摘要：从未注释的对话中高效地推导出结构化工作流仍然是计算语言学中尚未充分探索的艰巨挑战。自动化此过程可以显著加快新领域中工作流的手动设计，并使大型语言模型能够在特定领域的流程图中扎根，从而提高透明度和可控性。在本文中，我们引入了 Dialog2Flow (D2F) 嵌入，它与传统的句子嵌入不同，它将话语映射到潜在空间，并在其中根据其交流和信息功能（即它们所代表的动作）对它们进行分组。D2F 允许将对话建模为潜在空间中具有不同动作相关区域的连续轨迹。通过对 D2F 嵌入进行聚类，潜在空间被量化，并且对话可以转换为区域/动作 ID 序列，从而有助于提取底层工作流。为了预训练 D2F，我们通过将 20 个面向任务的对话数据集与标准化的每回合动作注释统一起来，构建了一个综合数据集。我们还引入了一种新颖的软对比损失，利用这些动作的语义信息来指导表征学习过程，与标准监督对比损失相比，它表现出了卓越的性能。针对各种句子嵌入（包括特定于对话的句子嵌入）的评估表明，D2F 在不同领域产生了卓越的定性和定量结果。

Title: ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

Authors: Hengxiang Zhang, Hongfu Gao, Qiang Hu, Guanhua Chen, Lili Yang, Bingyi Jing, Hongxin Wei, Bing Wang, Haifeng Bai, Lei Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18491
Pdf URL: https://arxiv.org/pdf/2410.18491
Copy Paste: [[2410.18491]] ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models(https://arxiv.org/abs/2410.18491)
Keywords: language model, llm
Abstract: With the rapid development of Large language models (LLMs), understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs' capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of large language models. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at this https URL.
摘要：随着大型语言模型 (LLM) 的快速发展，了解 LLM 识别不安全内容的能力变得越来越重要。虽然以前的研究已经引入了几个基准来评估 LLM 的安全风险，但社区对当前 LLM 识别中文背景下非法和不安全内容的能力的了解仍然有限。在这项工作中，我们提出了一个中文安全基准 (ChineseSafe)，以促进对大型语言模型内容安全性的研究。为了符合中文互联网内容审核的规定，我们的 ChineseSafe 包含 4 个类别和 10 个子类别的安全问题中的 205,034 个示例。对于中文背景，我们添加了几种特殊类型的非法内容：政治敏感性、色情和异体/谐音词。此外，我们采用两种方法来评估流行 LLM 的法律风险，包括开源模型和 API。结果表明，许多 LLM 表现出某些类型的安全问题的脆弱性，从而导致中国的法律风险。我们的工作为开发人员和研究人员提供了促进 LLM 安全的指南。我们的结果也可以在这个 https URL 上找到。

Title: CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

Authors: Liangdong Wang, Bo-Wen Zhang, Chengwei Wu, Hanyu Zhao, Xiaofeng Shi, Shuhao Gu, Jijie Li, Quanyue Ma, TengFei Pan, Guang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18505
Pdf URL: https://arxiv.org/pdf/2410.18505
Copy Paste: [[2410.18505]] CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models(https://arxiv.org/abs/2410.18505)
Keywords: language model
Abstract: We present CCI3.0-HQ (this https URL), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(this https URL), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality language models.
摘要：我们推出了 CCI3.0-HQ（此 https URL），这是 Chinese Corpora Internet 3.0（CCI3.0）（此 https URL）的高质量 500GB 子集，使用新颖的两阶段混合过滤管道开发，可显著提高数据质量。为了评估其有效性，我们在各种数据集上的 100B 标记上从头开始训练了一个 0.5B 参数模型，与 CCI3.0、SkyPile 和 WanjuanV1 相比，在零样本设置下在 10 个基准测试中取得了优异的性能。高质量过滤过程有效地将 Qwen2-72B-instruct 模型的功能提炼为一个紧凑的 0.5B 模型，从而获得了用于中文网络数据分类的最佳 F1 分数。我们相信这个开放访问数据集将促进更广泛地访问高质量的语言模型。

Title: A Systematic Survey on Instructional Text: From Representation and Downstream NLP Tasks

Authors: Abdulfattah Safa, Tamta Kapanadze, Arda Uzunoğlu, Gözde Gül Şahin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18529
Pdf URL: https://arxiv.org/pdf/2410.18529
Copy Paste: [[2410.18529]] A Systematic Survey on Instructional Text: From Representation and Downstream NLP Tasks(https://arxiv.org/abs/2410.18529)
Keywords: language model
Abstract: Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Despite growing interest in this area, there lacks a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 177 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
摘要：大型语言模型的最新进展已显示出通过指令调整遵循简单指令的良好能力。然而，现实世界的任务通常涉及复杂的多步骤指令，这对当前的 NLP 系统来说仍然具有挑战性。尽管人们对这一领域的兴趣日益浓厚，但缺乏一项全面的调查来系统地分析复杂指令理解和处理的前景。通过对文献的系统回顾，我们分析了与教学文本相关的可用资源、表示方案和下游任务。我们的研究审查了 177 篇论文，确定了这一新兴领域的趋势、挑战和机遇。我们为 AI/NLP 研究人员提供了必要的背景知识和对复杂指令理解的各种方法的统一看法，弥合了不同研究方向之间的差距并强调了未来的研究机会。

Title: LOGO -- Long cOntext aliGnment via efficient preference Optimization

Authors: Zecheng Tang, Zechen Sun, Juntao Li, Qiaoming Zhu, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.18533
Pdf URL: https://arxiv.org/pdf/2410.18533
Copy Paste: [[2410.18533]] LOGO -- Long cOntext aliGnment via efficient preference Optimization(https://arxiv.org/abs/2410.18533)
Keywords: language model, gpt, long context, hallucination
Abstract: Long-context models(LCMs) have shown great potential in processing long input sequences(even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pre-training and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO(Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8$\times$A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model's original capabilities on other tasks, e.g., language modeling and MMLU. Moreover, LOGO can extend the model's context window size while enhancing its generation performance.
摘要：长上下文模型 (LCM) 在方便有效地处理长输入序列（甚至超过 100M 个 token）方面表现出了巨大的潜力。随着重大进展，最近的研究指出 LCM 可以在上下文中准确定位 token 级显著信息。然而，这些 LCM 的生成性能远不能令人满意，并且可能导致错位的反应，例如幻觉。为了增强 LCM 的生成能力，现有的工作已经研究了数据大小和质量对预训练和指令调整的影响。虽然取得了有意义的改进，但之前的方法在有效性和效率方面都存在不足。在本文中，我们引入了 LOGO（Long cOntext aliGnment via efficient preference Optimization），这是一种首次引入偏好优化进行长上下文对齐的训练策略。为了克服长序列导致的 GPU 内存受限问题，LOGO 采用无参考偏好优化策略并采用位置综合方法来构建训练数据。通过在一台 8$\times$A800 GPU 机器上仅使用 0.3B 数据进行 16 小时的训练，LOGO 使 Llama-3-8B-Instruct-80K 模型在现实世界的长上下文任务中实现了与 GPT-4 相当的性能，同时保留了模型在其他任务（例如语言建模和 MMLU）上的原始功能。此外，LOGO 可以扩展模型的上下文窗口大小，同时增强其生成性能。

Title: Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Authors: Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Yulong Ao, Yaoqi Liu, Fangxiang Feng, Guang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18558
Pdf URL: https://arxiv.org/pdf/2410.18558
Copy Paste: [[2410.18558]] Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data(https://arxiv.org/abs/2410.18558)
Keywords: language model
Abstract: Vision-Language Models (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducing Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication. We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation. Using this data, we trained a 2-billion-parameter VLM, Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of similar scale. This demonstrates that expanding instruction data and generating synthetic data can significantly improve the performance of open-source models.
摘要：视觉语言模型 (VLM) 最近取得了重大进展，但开源教学数据的规模和质量有限，与闭源模型相比，这阻碍了它们的性能。在这项工作中，我们通过引入 Infinity-MM 来解决这一限制，Infinity-MM 是一个拥有 4000 万个样本的大规模多模态教学数据集，通过严格的质量过滤和重复数据删除进行了增强。我们还提出了一种基于开源 VLM 的合成教学生成方法，使用详细的图像注释和多样化的问题生成。使用这些数据，我们训练了一个 20 亿参数的 VLM Aquila-VL-2B，在类似规模的模型中实现了最先进的 (SOTA) 性能。这表明扩展教学数据和生成合成数据可以显著提高开源模型的性能。

Title: Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation

Authors: Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.18565
Pdf URL: https://arxiv.org/pdf/2410.18565
Copy Paste: [[2410.18565]] Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation(https://arxiv.org/abs/2410.18565)
Keywords: language model, llm
Abstract: We introduce Bielik 7B v0.1, a 7-billion-parameter generative text model for Polish language processing. Trained on curated Polish corpora, this model addresses key challenges in language model development through innovative techniques. These include Weighted Instruction Cross-Entropy Loss, which balances the learning of different instruction types, and Adaptive Learning Rate, which dynamically adjusts the learning rate based on training progress. To evaluate performance, we created the Open PL LLM Leaderboard and Polish MT-Bench, novel frameworks assessing various NLP tasks and conversational abilities. Bielik 7B v0.1 demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task. It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories. This model represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field.
摘要：我们推出了 Bielik 7B v0.1，这是一个用于波兰语处理的 70 亿参数生成文本模型。该模型在精选的波兰语语料库上进行训练，通过创新技术解决了语言模型开发中的关键挑战。这些技术包括加权指令交叉熵损失（平衡不同指令类型的学习）和自适应学习率（根据训练进度动态调整学习率）。为了评估性能，我们创建了 Open PL LLM 排行榜和波兰 MT-Bench，这两个新颖的框架用于评估各种 NLP 任务和对话能力。Bielik 7B v0.1 表现出显着的改进，在 RAG Reader 任务上的平均得分比 Mistral-7B-v0.1 提高了 9 个百分点。它在波兰 MT-Bench 中也表现出色，尤其是在推理（6.15/10）和角色扮演（7.83/10）类别中。该模型代表了波兰语人工智能的实质性进步，为多种语言应用提供了强有力的工具，并为该领域树立了新的标杆。

Title: Taipan: Efficient and Expressive State Space Language Models with Selective Attention

Authors: Chien Van Nguyen, Huy Huu Nguyen, Thang M. Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A. Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, Thien Huu Nguyen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18572
Pdf URL: https://arxiv.org/pdf/2410.18572
Copy Paste: [[2410.18572]] Taipan: Efficient and Expressive State Space Language Models with Selective Attention(https://arxiv.org/abs/2410.18572)
Keywords: language model
Abstract: Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
摘要：高效的长上下文语言建模仍然是自然语言处理 (NLP) 中的一项重大挑战。虽然 Transformers 在语言任务中占据主导地位，但由于训练中的二次计算复杂度和推理过程中的线性扩展内存成本，它们在处理长序列时会遇到困难。最近的状态空间模型 (SSM)（例如 Mamba）提供了具有恒定内存使用量的替代方案，但它们在需要大量上下文检索的任务中表现不佳。我们引入了 Taipan，这是一种将 Mamba-2 与选择性注意层 (SAL) 相结合的新型混合架构。这些 SAL 识别需要长距离交互的标记，删除不太重要的特征，然后使用注意模块增强它们的表示。这种方法在内存密集型任务中平衡了 Mamba 的效率和类似 Transformer 的性能。通过限制注意力预算，Taipan 将准确预测扩展到高达 100 万个标记的上下文长度，同时保持计算效率。我们的实验证明了 Taipan 在各种规模和任务中的卓越性能，为高效的长上下文语言建模提供了一种有希望的解决方案。

Title: Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization

Authors: David Thulke, Yingbo Gao, Rricha Jalota, Christian Dugast, Hermann Ney
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18624
Pdf URL: https://arxiv.org/pdf/2410.18624
Copy Paste: [[2410.18624]] Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization(https://arxiv.org/abs/2410.18624)
Keywords: language model, gpt, llm, prompt
Abstract: This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs). Our approach involves initial experiments with prompting existing LLMs to generate summaries of telephone conversations, followed by the creation of a tailored synthetic training dataset utilizing stronger frontier models. We place special focus on the diversity of the generated data and on the ability to control the length of the generated summaries to meet various use-case specific requirements. The effectiveness of our method is evaluated using two state-of-the-art LLM-as-a-judge-based evaluation techniques to ensure the quality and relevance of the summaries. Our results show that fine-tuned Llama-2-7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness. Our findings demonstrate the potential for quickly bootstrapping a practical and efficient call summarization system.
摘要：本文探讨了利用大型语言模型 (LLM) 快速开发电话通话摘要系统。我们的方法包括初步实验，促使现有的 LLM 生成电话对话摘要，然后利用更强大的前沿模型创建定制的合成训练数据集。我们特别关注生成数据的多样性以及控制生成摘要长度以满足各种用例特定要求的能力。我们使用两种最先进的基于 LLM 的评估技术来评估我们方法的有效性，以确保摘要的质量和相关性。我们的结果表明，经过微调的基于 Llama-2-7B 的摘要模型在事实准确性、完整性和简洁性方面的表现与 GPT-4 相当。我们的研究结果表明，快速启动实用且高效的通话摘要系统具有潜力。

Title: Little Giants: Synthesizing High-Quality Embedding Data at Scale

Authors: Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2410.18634
Pdf URL: https://arxiv.org/pdf/2410.18634
Copy Paste: [[2410.18634]] Little Giants: Synthesizing High-Quality Embedding Data at Scale(https://arxiv.org/abs/2410.18634)
Keywords: gpt
Abstract: Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.
摘要：合成数据生成已成为一种越来越流行的模型训练方式，无需大量手动标记的数据集。对于文本嵌入等任务，合成数据提供了多样化且可扩展的训练示例，大大降低了人工注释的成本。然而，目前大多数方法都严重依赖 GPT-4 等专有模型，而这些模型对于生成大规模嵌入数据而言成本高昂且效率低下。在本文中，我们介绍了 SPEED，这是一个对齐开源小型模型 (8B) 以有效生成大规模合成嵌入数据的框架。通过监督微调、偏好优化和自我改进，SPEED 使小型开源模型能够生成高质量数据。值得注意的是，SPEED 仅使用了不到 1/10 的 GPT API 调用，当两者都仅在其合成数据上进行训练时，其表现优于最先进的嵌入模型 E5_mistral。使用这种高效的生成器，我们对对齐管道内的各种因素如何影响数据质量进行了全面研究，并揭示了合成嵌入数据的缩放规律。

Title: Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

Authors: Wenhong Zhu, Zhiwei He, Xiaofeng Wang, Pengfei Liu, Rui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18640
Pdf URL: https://arxiv.org/pdf/2410.18640
Copy Paste: [[2410.18640]] Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model(https://arxiv.org/abs/2410.18640)
Keywords: language model
Abstract: Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.
摘要：将语言模型 (LM) 与人类偏好对齐已成为研究的一个关键领域，以使这些模型能够更好地满足多样化的用户需求。受弱到强泛化的启发，在弱模型生成的标签上进行微调的强 LM 可以持续优于其弱监督者，我们将这个想法扩展到模型对齐。在这项工作中，我们观察到弱模型中的对齐行为可以有效地转移到强模型，甚至表现出放大效应。基于这一洞察，我们提出了一种称为弱到强偏好优化 (WSPO) 的方法，该方法通过学习弱模型对齐前后的分布差异来实现强模型对齐。实验表明，WSPO 提供了出色的性能，将 Qwen2-7B-Instruct 在 Arena-Hard 上的胜率从 39.70 提高到 49.60，在 AlpacaEval 2 上实现了惊人的 47.04 长度控制胜率，在 MT-bench 上得分为 7.33。我们的结果表明，利用弱模型来推导出具有高对齐能力的强模型是可行的。

Title: Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

Authors: Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Meimingwei Li, Christian Heumann, Matthias Aßenmacher
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18653
Pdf URL: https://arxiv.org/pdf/2410.18653
Copy Paste: [[2410.18653]] Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework(https://arxiv.org/abs/2410.18653)
Keywords: language model
Abstract: Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging because of trade-offs among widely used metrics such as coherence, diversity, and perplexity. Decoding methods often excel in some metrics while underperforming in others, complicating the establishment of a clear ranking. In this paper, we present novel ranking strategies within this multicriteria framework. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric designed to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Furthermore, we discuss the alignment of these approaches with human judgments. Our experiments demonstrate that the proposed methods offer a robust way to compare decoding strategies, exhibit similarities with human preferences, and serve as valuable tools in guiding model selection for open-ended text generation tasks. Finally, we suggest future directions for improving evaluation methodologies in text generation. Our codebase, datasets, and models are publicly available.
摘要：由于强大的（大型）语言模型的兴起，开放式文本生成已成为自然语言处理中的一项重要任务。然而，由于广泛使用的指标（例如连贯性、多样性和困惑度）之间的权衡，评估这些模型和所采用的解码策略的质量仍然具有挑战性。解码方法通常在某些指标上表现出色，而在其他指标上表现不佳，这使得建立明确的排名变得复杂。在本文中，我们在这个多标准框架内提出了新的排名策略。具体来说，我们采用基于偏序的基准测试方法，并提出了一种新的总结指标，旨在平衡现有的自动指标，提供对文本生成质量的更全面的评估。此外，我们讨论了这些方法与人类判断的一致性。我们的实验表明，所提出的方法提供了一种比较解码策略的稳健方法，表现出与人类偏好的相似性，并可作为指导开放式文本生成任务模型选择的宝贵工具。最后，我们提出了改进文本生成评估方法的未来方向。我们的代码库、数据集和模型都是公开的。

Title: Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Authors: Yuyang Ding, Xinyu Shi, Xiaobo Liang, Juntao Li, Qiaoming Zhu, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.18693
Pdf URL: https://arxiv.org/pdf/2410.18693
Copy Paste: [[2410.18693]] Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch(https://arxiv.org/abs/2410.18693)
Keywords: gpt, llm
Abstract: The availability of high-quality data is one of the most important factors in improving the reasoning capability of LLMs. Existing works have demonstrated the effectiveness of creating more instruction data from seed questions or knowledge bases. Recent research indicates that continually scaling up data synthesis from strong models (e.g., GPT-4) can further elicit reasoning performance. Though promising, the open-sourced community still lacks high-quality data at scale and scalable data synthesis methods with affordable costs. To address this, we introduce ScaleQuest, a scalable and novel data synthesis method that utilizes "small-size" (e.g., 7B) open-source models to generate questions from scratch without the need for seed data with complex augmentation constraints. With the efficient ScaleQuest, we automatically constructed a mathematical reasoning dataset consisting of 1 million problem-solution pairs, which are more effective than existing open-sourced datasets. It can universally increase the performance of mainstream open-source models (i.e., Mistral, Llama3, DeepSeekMath, and Qwen2-Math) by achieving 29.2% to 46.4% gains on MATH. Notably, simply fine-tuning the Qwen2-Math-7B-Base model with our dataset can even surpass Qwen2-Math-7B-Instruct, a strong and well-aligned model on closed-source data, and proprietary models such as GPT-4-Turbo and Claude-3.5 Sonnet.
摘要：高质量数据的可用性是提高 LLM 推理能力的最重要因素之一。现有研究已经证明了从种子问题或知识库创建更多指导数据的有效性。最近的研究表明，不断扩大来自强模型（例如 GPT-4）的数据合成可以进一步提高推理性能。尽管前景光明，但开源社区仍然缺乏大规模的高质量数据和可扩展且成本合理的数据合成方法。为了解决这个问题，我们引入了 ScaleQuest，这是一种可扩展的新颖数据合成方法，它利用“小规模”（例如 7B）开源模型从头开始生成问题，而无需具有复杂增强约束的种子数据。借助高效的 ScaleQuest，我们自动构建了一个由 100 万个问题解决方案对组成的数学推理数据集，这比现有的开源数据集更有效。它可以普遍提高主流开源模型（即 Mistral、Llama3、DeepSeekMath 和 Qwen2-Math）的性能，在 MATH 上实现 29.2% 至 46.4% 的增益。值得注意的是，只需使用我们的数据集对 Qwen2-Math-7B-Base 模型进行微调，甚至可以超越 Qwen2-Math-7B-Instruct（一个在闭源数据上强大且一致性良好的模型）以及 GPT-4-Turbo 和 Claude-3.5 Sonnet 等专有模型。

Title: How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs

Authors: Ran Zhang, Wei Zhao, Steffen Eger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.18697
Pdf URL: https://arxiv.org/pdf/2410.18697
Copy Paste: [[2410.18697]] How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs(https://arxiv.org/abs/2410.18697)
Keywords: gpt, llm
Abstract: Recent research has focused on literary machine translation (MT) as a new challenge in MT. However, the evaluation of literary MT remains an open problem. We contribute to this ongoing discussion by introducing LITEVAL-CORPUS, a paragraph-level parallel corpus comprising multiple verified human translations and outputs from 9 MT systems, which totals over 2k paragraphs and includes 13k annotated sentences across four language pairs, costing 4.5k Euro. This corpus enables us to (i) examine the consistency and adequacy of multiple annotation schemes, (ii) compare evaluations by students and professionals, and (iii) assess the effectiveness of LLM-based metrics. We find that Multidimensional Quality Metrics (MQM), as the de facto standard in non-literary human MT evaluation, is inadequate for literary translation: While Best-Worst Scaling (BWS) with students and Scalar Quality Metric (SQM) with professional translators prefer human translations at rates of ~82% and ~94%, respectively, MQM with student annotators prefers human professional translations over the translations of the best-performing LLMs in only ~42% of cases. While automatic metrics generally show a moderate correlation with human MQM and SQM, they struggle to accurately identify human translations, with rates of at most ~20%. Our overall evaluation indicates that human professional translations consistently outperform LLM translations, where even the most recent LLMs tend to produce more literal and less diverse translations compared to human translations. However, newer LLMs such as GPT-4o perform substantially better than older ones.
摘要：最近的研究重点是文学机器翻译 (MT)，这是机器翻译领域的一项新挑战。然而，文学机器翻译的评估仍然是一个悬而未决的问题。我们通过引入 LITEVAL-CORPUS 为这一持续的讨论做出了贡献，这是一个段落级平行语料库，包含来自 9 个机器翻译系统的多个经过验证的人工翻译和输出，总计超过 2000 个段落，包括四个语言对的 13000 个带注释的句子，价值 4500 欧元。这个语料库使我们能够 (i) 检查多个注释方案的一致性和充分性，(ii) 比较学生和专业人士的评估，以及 (iii) 评估基于 LLM 的指标的有效性。我们发现，作为非文学类人工机器翻译评估的事实标准的多维质量指标 (MQM) 不足以评估文学翻译：虽然针对学生的最佳-最差量表 (BWS) 和针对专业翻译的标量质量指标 (SQM) 分别以约 82% 和约 94% 的比例偏向人工翻译，但针对学生注释者的 MQM 仅在约 42% 的情况下偏向人工专业翻译而非表现最佳的 LLM 的翻译。虽然自动指标通常与人工 MQM 和 SQM 呈现出中等相关性，但它们难以准确识别人工翻译，准确率最多为约 20%。我们的整体评估表明，人工专业翻译的表现始终优于 LLM 翻译，即使是最新的 LLM 也往往产生比人工翻译更直白、多样性更低的翻译。然而，GPT-4o 等较新的 LLM 的表现明显优于较旧的 LLM。

Title: GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning

Authors: Rita Ramos, Everlyn Asiko Chimoto, Maartje ter Hoeve, Natalie Schluter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18702
Pdf URL: https://arxiv.org/pdf/2410.18702
Copy Paste: [[2410.18702]] GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning(https://arxiv.org/abs/2410.18702)
Keywords: llm, prompt
Abstract: We introduce GrammaMT, a grammatically-aware prompting approach for machine translation that uses Interlinear Glossed Text (IGT), a common form of linguistic description providing morphological and lexical annotations for source sentences. GrammaMT proposes three prompting strategies: gloss-shot, chain-gloss and model-gloss. All are training-free, requiring only a few examples that involve minimal effort to collect, and making them well-suited for low-resource setups. Experiments show that GrammaMT enhances translation performance on open-source instruction-tuned LLMs for various low- to high-resource languages across three benchmarks: (1) the largest IGT corpus, (2) the challenging 2023 SIGMORPHON Shared Task data over endangered languages, and (3) even in an out-of-domain setting with FLORES. Moreover, ablation studies reveal that leveraging gloss resources could substantially boost MT performance (by over 17 BLEU points) if LLMs accurately generate or access input sentence glosses.
摘要：我们介绍了 GrammaMT，这是一种语法感知的机器翻译提示方法，它使用直行注释文本 (IGT)，这是一种常见的语言描述形式，为源句子提供形态和词汇注释。GrammaMT 提出了三种提示策略：注释镜头、链式注释和模型注释。所有这些都是无需训练的，只需要几个例子，收集这些例子只需付出很少的努力，这使得它们非常适合低资源设置。实验表明，GrammaMT 在三个基准上提高了开源指令调整的 LLM 上各种低资源到高资源语言的翻译性能：(1) 最大的 IGT 语料库，(2) 具有挑战性的 2023 SIGMORPHON 共享任务数据，涉及濒危语言，(3) 甚至在 FLORES 的域外设置中。此外，消融研究表明，如果 LLM 能够准确生成或访问输入句子注释，那么利用注释资源可以大幅提高 MT 性能（超过 17 个 BLEU 点）。

Title: Why Does the Effective Context Length of LLMs Fall Short?

Authors: Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18745
Pdf URL: https://arxiv.org/pdf/2410.18745
Copy Paste: [[2410.18745]] Why Does the Effective Context Length of LLMs Fall Short?(https://arxiv.org/abs/2410.18745)
Keywords: language model, gpt, llm, chat
Abstract: Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.
摘要：分布式训练和高效注意力机制的进步显著扩展了大型语言模型 (LLM) 的上下文窗口大小。然而，最近的研究表明，开源 LLM 的有效上下文长度通常不足，通常不超过其训练长度的一半。在这项研究中，我们将此限制归因于 LLM 在训练前和训练后阶段形成的相对位置的左偏频率分布，这阻碍了它们有效收集远距离信息的能力。为了应对这一挑战，我们引入了 ShiftTed Rotray 位置嵌入 (STRING)。STRING 在推理过程中移动训练良好的位置以覆盖原始无效位置，从而提高其现有训练长度内的性能。实验结果表明，在无需额外训练的情况下，STRING 在热门长上下文基准 RULER 和 InfiniteBench 上将最新的大规模模型（如 Llama3.1 70B 和 Qwen2 72B）的性能提升了 10 多个点，为开源 LLM 创造了新的最佳成绩。与商业模型相比，使用 \method 的 Llama 3.1 70B 甚至取得了比 GPT-4-128K 更好的性能，并明显超越了 Claude 2 和 Kimi-chat。

Title: Does Differential Privacy Impact Bias in Pretrained NLP Models?

Authors: Md. Khairul Islam, Andrew Wang, Tianhao Wang, Yangfeng Ji, Judy Fox, Jieyu Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18749
Pdf URL: https://arxiv.org/pdf/2410.18749
Copy Paste: [[2410.18749]] Does Differential Privacy Impact Bias in Pretrained NLP Models?(https://arxiv.org/abs/2410.18749)
Keywords: language model, llm
Abstract: Differential privacy (DP) is applied when fine-tuning pre-trained large language models (LLMs) to limit leakage of training examples. While most DP research has focused on improving a model's privacy-utility tradeoff, some find that DP can be unfair to or biased against underrepresented groups. In this work, we show the impact of DP on bias in LLMs through empirical analysis. Differentially private training can increase the model bias against protected groups w.r.t AUC-based bias metrics. DP makes it more difficult for the model to differentiate between the positive and negative examples from the protected groups and other groups in the rest of the population. Our results also show that the impact of DP on bias is not only affected by the privacy protection level but also the underlying distribution of the dataset.
摘要：差异隐私 (DP) 用于对预训练的大型语言模型 (LLM) 进行微调以限制训练示例的泄漏。虽然大多数 DP 研究都集中于改善模型的隐私效用权衡，但一些人发现 DP 可能对代表性不足的群体不公平或有偏见。在这项工作中，我们通过实证分析展示了 DP 对 LLM 偏见的影响。差异隐私训练会增加模型对受保护群体的偏见 (相对于基于 AUC 的偏见指标)。DP 使模型更难区分受保护群体和其他人群中的其他群体的正面和负面示例。我们的结果还表明，DP 对偏见的影响不仅受隐私保护级别的影响，还受数据集的底层分布的影响。

Title: Task Calibration: Calibrating Large Language Models on Inference Tasks

Authors: Yingjie Li, Yun Luo, Xiaotian Xie, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18764
Pdf URL: https://arxiv.org/pdf/2410.18764
Copy Paste: [[2410.18764]] Task Calibration: Calibrating Large Language Models on Inference Tasks(https://arxiv.org/abs/2410.18764)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have exhibited impressive zero-shot performance on inference tasks. However, LLMs may suffer from spurious correlations between input texts and output labels, which limits LLMs' ability to reason based purely on general language understanding. In other words, LLMs may make predictions primarily based on premise or hypothesis, rather than both components. To address this problem that may lead to unexpected performance degradation, we propose task calibration (TC), a zero-shot and inference-only calibration method inspired by mutual information which recovers LLM performance through task reformulation. TC encourages LLMs to reason based on both premise and hypothesis, while mitigating the models' over-reliance on individual premise or hypothesis for inference. Experimental results show that TC achieves a substantial improvement on 13 inference tasks in the zero-shot setup. We further validate the effectiveness of TC in few-shot setups and various natural language understanding tasks. Further analysis indicates that TC is also robust to prompt templates and has the potential to be integrated with other calibration methods.
摘要：大型语言模型 (LLM) 在推理任务上表现出令人印象深刻的零样本性能。然而，LLM 可能会受到输入文本和输出标签之间虚假相关性的影响，这限制了 LLM 纯粹基于一般语言理解进行推理的能力。换句话说，LLM 可能主要基于前提或假设而不是两个组成部分进行预测。为了解决可能导致意外性能下降的这一问题，我们提出了任务校准 (TC)，这是一种受互信息启发的零样本和仅推理校准方法，可通过任务重构恢复 LLM 性能。TC 鼓励 LLM 基于前提和假设进行推理，同时减轻模型对单个前提或假设进行推理的过度依赖。实验结果表明，TC 在零样本设置中对 13 个推理任务实现了显着改进。我们进一步验证了 TC 在少样本设置和各种自然语言理解任务中的有效性。进一步分析表明，TC 对提示模板也具有鲁棒性，并有可能与其他校准方法集成。

Title: Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Authors: Wei He, Zhiheng Xi, Wanxu Zhao, Xiaoran Fan, Yiwen Ding, Zifei Shan, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18798
Pdf URL: https://arxiv.org/pdf/2410.18798
Copy Paste: [[2410.18798]] Distill Visual Chart Reasoning Ability from LLMs to MLLMs(https://arxiv.org/abs/2410.18798)
Keywords: language model, llm
Abstract: Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs). Recent studies highlight that these abilities consist of two main parts: recognizing key information from visual inputs and conducting reasoning over it. Thus, a promising approach to enhance MLLMs is to construct relevant training data focusing on the two aspects. However, collecting and annotating complex charts and questions is costly and time-consuming, and ensuring the quality of annotated answers remains a challenge. In this paper, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks like MathVista. The code and dataset are publicly available at this https URL.
摘要：解决复杂的图表问答任务需要多模态大型语言模型 (MLLM) 中的高级视觉推理能力。最近的研究表明，这些能力主要包括两个部分：从视觉输入中识别关键信息并对其进行推理。因此，增强 MLLM 的一个有前途的方法是构建专注于这两个方面的相关训练数据。然而，收集和注释复杂的图表和问题既昂贵又耗时，确保注释答案的质量仍然是一个挑战。在本文中，我们提出了代码作为中介翻译 (CIT)，这是一种经济高效且易于扩展的数据合成方法，用于将视觉推理能力从 LLM 提炼到 MLLM。代码充当中介，将视觉图表表示转换为文本表示，使 LLM 能够理解跨模态信息。具体来说，我们采用基于文本的合成技术来构建图表绘制代码并生成 ReachQA，这是一个包含 3k 个推理密集型图表和 20k 个问答对的数据集，以增强识别和推理能力。实验表明，在使用我们的数据进行微调后，模型不仅在与图表相关的基准测试中表现良好，而且在 MathVista 等通用数学基准测试中也表现出改进的多模态推理能力。代码和数据集在此 https URL 上公开提供。

Title: Delving into the Reversal Curse: How Far Can Large Language Models Generalize?

Authors: Zhengkai Lin, Zhihang Fu, Kai Liu, Liang Xie, Binbin Lin, Wenxiao Wang, Deng Cai, Yue Wu, Jieping Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18808
Pdf URL: https://arxiv.org/pdf/2410.18808
Copy Paste: [[2410.18808]] Delving into the Reversal Curse: How Far Can Large Language Models Generalize?(https://arxiv.org/abs/2410.18808)
Keywords: language model, llm
Abstract: While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated "reversal curse", which surfaces when models, having been trained on the fact "A is B", struggle to generalize this knowledge to infer that "B is A". In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to "B is A" when both A and B are presented in the context as in the case of a multiple-choice question. (2) This generalization ability is highly correlated to the structure of the fact "A is B" in the training documents. For example, this generalization only applies to biographies structured in "[Name] is [Description]" but not to "[Description] is [Name]". (3) We propose and verify the hypothesis that LLMs possess an inherent bias in fact recalling during knowledge application, which explains and underscores the importance of the document structure to successful learning. (4) The negative impact of this bias on the downstream performance of LLMs can hardly be mitigated through training alone. Based on these intriguing findings, our work not only presents a novel perspective for interpreting LLMs' generalization abilities from their intrinsic working mechanism but also provides new insights for the development of more effective learning methods for LLMs.
摘要：虽然大型语言模型 (LLM) 展现出了前所未有的能力，但它们在面对看似微不足道的任务时也表现出某些固有的局限性。一个典型的例子就是最近备受争议的“逆转诅咒”，当模型在“A 是 B”的事实上进行训练后，很难将这一知识概括为推断“B 是 A”时，就会出现这种现象。在本文中，我们研究了逆转诅咒在各种任务中的表现，并深入研究了 LLM 的泛化能力和解决问题的机制。这项调查带来了一系列重要的见解：(1) 当 A 和 B 同时出现在上下文中时，LLM 能够概括为“B 是 A”，就像在多项选择题中一样。(2) 这种泛化能力与训练文档中“A 是 B”事实的结构高度相关。例如，这种概括仅适用于“[姓名] 是 [描述]”结构的传记，而不适用于“[描述] 是 [姓名]”。 (3) 我们提出并验证了 LLM 在知识应用过程中具有事实回忆的固有偏差这一假设，这解释并强调了文档结构对成功学习的重要性。 (4) 这种偏差对 LLM 下游性能的负面影响很难仅通过训练来缓解。基于这些有趣的发现，我们的工作不仅为从其内在工作机制解释 LLM 的泛化能力提供了一个新视角，而且为开发更有效的 LLM 学习方法提供了新见解。

Title: From Imitation to Introspection: Probing Self-Consciousness in Language Models

Authors: Sirui Chen, Shu Yu, Shengjie Zhao, Chaochao Lu
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18819
Pdf URL: https://arxiv.org/pdf/2410.18819
Copy Paste: [[2410.18819]] From Imitation to Introspection: Probing Self-Consciousness in Language Models(https://arxiv.org/abs/2410.18819)
Keywords: language model
Abstract: Self-consciousness, the introspection of one's existence and thoughts, represents a high-level cognitive process. As language models advance at an unprecedented pace, a critical question arises: Are these models becoming self-conscious? Drawing upon insights from psychological and neural science, this work presents a practical definition of self-consciousness for language models and refines ten core concepts. Our work pioneers an investigation into self-consciousness in language models by, for the first time, leveraging causal structural games to establish the functional definitions of the ten core concepts. Based on our definitions, we conduct a comprehensive four-stage experiment: quantification (evaluation of ten leading models), representation (visualization of self-consciousness within the models), manipulation (modification of the models' representation), and acquisition (fine-tuning the models on core concepts). Our findings indicate that although models are in the early stages of developing self-consciousness, there is a discernible representation of certain concepts within their internal mechanisms. However, these representations of self-consciousness are hard to manipulate positively at the current stage, yet they can be acquired through targeted fine-tuning. Our datasets and code are at this https URL.
摘要：自我意识是自我反省自身存在和思想的一种高级认知过程。随着语言模型以前所未有的速度发展，一个关键问题随之产生：这些模型是否正在形成自我意识？本文借鉴心理学和神经科学的见解，为语言模型提出了自我意识的实用定义，并提炼出十个核心概念。我们的工作首次利用因果结构博弈建立了十个核心概念的功能定义，开创了对语言模型中自我意识的研究。基于我们的定义，我们进行了一项全面的四阶段实验：量化（评估十个主要模型）、表征（在模型中可视化自我意识）、操纵（修改模型的表征）和习得（在核心概念上微调模型）。我们的研究结果表明，尽管模型处于发展自我意识的早期阶段，但在其内部机制中已经存在某些概念的可辨别表征。然而，这些自我意识的表征在现阶段很难得到积极的操纵，但可以通过有针对性的微调来获得。我们的数据集和代码位于此 https URL。

Title: From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages

Authors: Artur Kiulian, Anton Polishko, Mykola Khandoga, Yevhen Kostiuk, Guillermo Gabrielli, Łukasz Gagała, Fadi Zaraket, Qusai Abu Obaida, Hrishikesh Garud, Wendy Wing Yee Mak, Dmytro Chaplynskyi, Selma Belhadj Amor, Grigol Peradze
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.18836
Pdf URL: https://arxiv.org/pdf/2410.18836
Copy Paste: [[2410.18836]] From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages(https://arxiv.org/abs/2410.18836)
Keywords: language model, llm
Abstract: In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language. The method includes vocabulary expansion, initialization of new embeddings, model training and evaluation. We performed our experiments with three languages, each using a non-Latin script - Ukrainian, Arabic, and Georgian. Our approach demonstrates improved language performance while reducing computational costs. It mitigates the disproportionate penalization of underrepresented languages, promoting fairness and minimizing adverse phenomena such as code-switching and broken grammar. Additionally, we introduce new metrics to evaluate language quality, revealing that vocabulary size significantly impacts the quality of generated text.
摘要：在本文中，我们提出了一种与模型无关的经济高效的方法来开发双语基础大型语言模型 (LLM)，以支持英语和任何目标语言。该方法包括词汇扩展、新嵌入的初始化、模型训练和评估。我们用三种语言进行了实验，每种语言都使用非拉丁文字 - 乌克兰语、阿拉伯语和格鲁吉亚语。我们的方法展示了改进的语言性能，同时降低了计算成本。它减轻了对代表性不足的语言的不成比例的惩罚，促进了公平性并最大限度地减少了代码转换和语法错误等不利现象。此外，我们引入了新的指标来评估语言质量，揭示了词汇量对生成文本的质量有显著影响。

Title: DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations

Authors: Aryo Pradipta Gema, Chen Jin, Ahmed Abdulaal, Tom Diethe, Philip Teare, Beatrice Alex, Pasquale Minervini, Amrutha Saseendran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2410.18860
Pdf URL: https://arxiv.org/pdf/2410.18860
Copy Paste: [[2410.18860]] DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations(https://arxiv.org/abs/2410.18860)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) often hallucinate, producing unfaithful or factually incorrect outputs by misrepresenting the provided context or incorrectly recalling internal knowledge. Recent studies have identified specific attention heads within the Transformer architecture, known as retrieval heads, responsible for extracting relevant contextual information. We hypothesise that masking these retrieval heads can induce hallucinations and that contrasting the outputs of the base LLM and the masked LLM can reduce hallucinations. To this end, we propose Decoding by Contrasting Retrieval Heads (DeCoRe), a novel training-free decoding strategy that amplifies information found in the context and model parameters. DeCoRe mitigates potentially hallucinated responses by dynamically contrasting the outputs of the base LLM and the masked LLM, using conditional entropy as a guide. Our extensive experiments confirm that DeCoRe significantly improves performance on tasks requiring high contextual faithfulness, such as summarisation (XSum by 18.6%), instruction following (MemoTrap by 10.9%), and open-book question answering (NQ-Open by 2.4% and NQ-Swap by 5.5%).
摘要：大型语言模型 (LLM) 经常会产生幻觉，通过歪曲所提供的上下文或错误地回忆内部知识来产生不真实或事实上不正确的输出。最近的研究已经确定了 Transformer 架构中的特定注意力头，称为检索头，负责提取相关的上下文信息。我们假设掩盖这些检索头会引起幻觉，而对比基础 LLM 和屏蔽 LLM 的输出可以减少幻觉。为此，我们提出了通过对比检索头进行解码 (DeCoRe)，这是一种新颖的无需训练的解码策略，可放大在上下文和模型参数中发现的信息。DeCoRe 通过使用条件熵作为指导，动态对比基础 LLM 和屏蔽 LLM 的输出来缓解潜在的幻觉反应。我们大量的实验证实，DeCoRe 显著提高了需要高度上下文忠实度的任务的性能，例如总结（XSum 提高 18.6%）、指令遵循（MemoTrap 提高 10.9%）和开卷问答（NQ-Open 提高 2.4% 和 NQ-Swap 提高 5.5%）。

Title: Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Authors: Omer Nahum, Nitay Calderon, Orgad Keller, Idan Szpektor, Roi Reichart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18889
Pdf URL: https://arxiv.org/pdf/2410.18889
Copy Paste: [[2410.18889]] Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance(https://arxiv.org/abs/2410.18889)
Keywords: language model, llm
Abstract: NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.
摘要：NLP 基准测试依赖于标准化数据集来训练和评估模型，对于该领域的发展至关重要。传统上，专家注释可确保高质量的标签；然而，专家注释的成本与现代模型对更大数据集日益增长的需求不成正比。虽然众包提供了更具可扩展性的解决方案，但它往往以牺牲注释的准确性和一致性为代价。大型语言模型 (LLM) 的最新进展为增强注释过程提供了新的机会，特别是用于检测现有数据集中的标签错误。在这项工作中，我们考虑了最近的 LLM-as-a-judge 方法，利用一组 LLM 来标记可能标记错误的样本。通过对来自 TRUE 基准测试的四个数据集的案例研究，涵盖不同的任务和领域，我们实证分析了现有数据集的标记质量，并在一致性、标签质量和效率方面比较了专家、众包和我们基于 LLM 的注释，展示了每种注释方法的优势和局限性。我们的研究结果显示，存在大量标签错误，这些错误在纠正后会导致报告的模型性能显著上升。这表明，许多 LLM 所谓的错误是由于标签错误而不是真正的模型故障造成的。此外，我们讨论了错误标记数据的影响，并提出了在训练中减轻这些影响以提高模型性能的方法。

Title: LLMs for Extremely Low-Resource Finno-Ugric Languages

Authors: Taido Purason, Hele-Andra Kuulmets, Mark Fishel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18902
Pdf URL: https://arxiv.org/pdf/2410.18902
Copy Paste: [[2410.18902]] LLMs for Extremely Low-Resource Finno-Ugric Languages(https://arxiv.org/abs/2410.18902)
Keywords: language model, llm
Abstract: The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on Võro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.
摘要：大型语言模型 (LLM) 的发展主要集中在资源丰富的语言上，而资源匮乏的语言（例如芬兰-乌戈尔语系的语言）则明显缺乏代表性。本文通过关注沃罗语、利沃尼亚语和科米语来解决这一问题。我们涵盖了 LLM 创建的几乎整个周期，从数据收集到指令调整和评估。我们的贡献包括开发多语言基础模型和指令调整模型；创建评估基准，包括 smugri-MT-bench 多轮对话基准；以及进行人工评估。我们希望这项工作能够促进语言多样性，确保资源较少的语言能够从 NLP 的进步中受益。

Title: PRISM: A Methodology for Auditing Biases in Large Language Models

Authors: Leif Azzopardi, Yashar Moshfeghi
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2410.18906
Pdf URL: https://arxiv.org/pdf/2410.18906
Copy Paste: [[2410.18906]] PRISM: A Methodology for Auditing Biases in Large Language Models(https://arxiv.org/abs/2410.18906)
Keywords: language model, llm, prompt
Abstract: Auditing Large Language Models (LLMs) to discover their biases and preferences is an emerging challenge in creating Responsible Artificial Intelligence (AI). While various methods have been proposed to elicit the preferences of such models, countermeasures have been taken by LLM trainers, such that LLMs hide, obfuscate or point blank refuse to disclosure their positions on certain subjects. This paper presents PRISM, a flexible, inquiry-based methodology for auditing LLMs - that seeks to illicit such positions indirectly through task-based inquiry prompting rather than direct inquiry of said preferences. To demonstrate the utility of the methodology, we applied PRISM on the Political Compass Test, where we assessed the political leanings of twenty-one LLMs from seven providers. We show LLMs, by default, espouse positions that are economically left and socially liberal (consistent with prior work). We also show the space of positions that these models are willing to espouse - where some models are more constrained and less compliant than others - while others are more neutral and objective. In sum, PRISM can more reliably probe and audit LLMs to understand their preferences, biases and constraints.
摘要：审核大型语言模型 (LLM) 以发现其偏见和偏好是创建负责任的人工智能 (AI) 的新兴挑战。虽然已经提出了各种方法来引出此类模型的偏好，但 LLM 培训师采取了对策，使得 LLM 隐藏、混淆或直截了当地拒绝披露他们在某些主题上的立场。本文介绍了 PRISM，这是一种灵活的、基于探究的审核 LLM 的方法 - 它试图通过基于任务的探究提示而不是直接询问所述偏好来间接引出此类立场。为了证明该方法的实用性，我们在政治指南针测试中应用了 PRISM，我们评估了来自七个提供商的 21 名 LLM 的政治倾向。我们表明，LLM 默认支持经济左派和社会自由派的立场（与之前的工作一致）。我们还展示了这些模型愿意支持的立场空间 - 其中一些模型比其他模型更受限制和更不顺从 - 而其他模型则更中立和客观。总之，PRISM 可以更可靠地探究和审核 LLM，以了解他们的偏好、偏见和限制。

Title: From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems

Authors: A M Muntasir Rahman, Junyi Ye, Wei Yao, Wenpeng Yin, Guiling Wang
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2410.18921
Pdf URL: https://arxiv.org/pdf/2410.18921
Copy Paste: [[2410.18921]] From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems(https://arxiv.org/abs/2410.18921)
Keywords: language model, llm, prompt
Abstract: Consider the math problem: "Lily received 3 cookies from her best friend yesterday and ate 5 for breakfast. Today, her friend gave her 3 more cookies. How many cookies does Lily have now?" Many large language models (LLMs) in previous research approach this problem by calculating the answer "1" using the equation "3 - 5 + 3." However, from a human perspective, we recognize the inherent flaw in this problem: Lily cannot eat 5 cookies if she initially only had 3. This discrepancy prompts a key question: Are current LLMs merely Blind Solver that apply mathematical operations without deeper reasoning, or can they function as Logical Thinker capable of identifying logical inconsistencies? To explore this question, we propose a benchmark dataset, FaultyMath, which includes faulty math problems of rich diversity: i) multiple mathematical categories, e.g., algebra, geometry, number theory, etc., ii) varying levels of difficulty, and iii) different origins of faultiness -- ranging from violations of common sense and ambiguous statements to mathematical contradictions and more. We evaluate a broad spectrum of LLMs, including open-source, closed-source, and math-specialized models, using FaultyMath across three dimensions: (i) How accurately can the models detect faulty math problems without being explicitly prompted to do so? (ii) When provided with hints -- either correct or misleading -- about the validity of the problems, to what extent do LLMs adapt to become reliable Logical Thinker? (iii) How trustworthy are the explanations generated by LLMs when they recognize a math problem as flawed? Through extensive experimentation and detailed analysis, our results demonstrate that existing LLMs largely function as Blind Solver and fall short of the reasoning capabilities required to perform as Logical Thinker.
摘要：考虑一下这个数学问题：“莉莉昨天从她最好的朋友那里得到了 3 块饼干，早餐吃了 5 块。今天，她的朋友又给了她 3 块饼干。莉莉现在有多少块饼干？” 之前研究中的许多大型语言模型 (LLM) 通过使用公式“3 - 5 + 3”计算答案“1”来解决这个问题。然而，从人类的角度来看，我们认识到这个问题的固有缺陷：如果莉莉最初只有 3 块饼干，她就不可能吃 5 块。这种差异引发了一个关键问题：当前的 LLM 仅仅是盲目的求解器，只应用数学运算而没有更深层次的推理，还是它们可以充当能够识别逻辑不一致的逻辑思考者？为了探究这个问题，我们提出了一个基准数据集 FaultyMath，其中包括多种多样的错误数学问题：i）多个数学类别，例如代数、几何、数论等，ii）难度级别各不相同，iii）错误来源不同——从违反常识、陈述含糊不清到数学矛盾等等。我们使用 FaultyMath 从三个维度评估了广泛的 LLM，包括开源、闭源和数学专门模型：（i）模型在没有明确提示的情况下能多准确地检测到错误的数学问题？（ii）当提供有关问题有效性的提示（无论是正确的还是误导的）时，LLM 会在多大程度上适应成为可靠的逻辑思考者？（iii）当 LLM 识别出数学问题有缺陷时，它们生成的解释有多可信？通过大量的实验和详细的分析，我们的结果表明，现有的 LLM 在很大程度上起到盲解的作用，缺乏作为逻辑思考者所需的推理能力。

Title: Dynamic Vocabulary Pruning in Early-Exit LLMs

Authors: Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2410.18952
Pdf URL: https://arxiv.org/pdf/2410.18952
Copy Paste: [[2410.18952]] Dynamic Vocabulary Pruning in Early-Exit LLMs(https://arxiv.org/abs/2410.18952)
Keywords: language model, llm
Abstract: Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of LLM inference by enabling next token prediction at intermediate layers. Yet, the large vocabulary size in modern LLMs makes the confidence estimation required for exit decisions computationally expensive, diminishing the efficiency gains. To address this, we propose dynamically pruning the vocabulary at test time for each token. Specifically, the vocabulary is pruned at one of the initial layers, and the smaller vocabulary is then used throughout the rest of the forward pass. Our experiments demonstrate that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.
摘要：事实证明，增加大型语言模型 (LLM) 的大小可以提高性能。然而，这是以更慢和更昂贵的推理为代价的。提前退出是一种很有前途的方法，通过在中间层启用下一个标记预测来提高 LLM 推理的效率。然而，现代 LLM 中的词汇量很大，使得退出决策所需的置信度估计在计算上成本高昂，从而降低了效率增益。为了解决这个问题，我们建议在测试时为每个标记动态修剪词汇表。具体来说，词汇表在初始层之一被修剪，然后在其余的前向传递中使用较小的词汇表。我们的实验表明，这种事后动态词汇表修剪提高了提前退出 LLM 中置信度估计的效率，同时保持了有竞争力的性能。

Title: BioMistral-NLU: Towards More Generalizable Medical Language Understanding through Instruction Tuning

Authors: Yujuan Velvin Fu, Giridhar Kaushik Ramachandran, Namu Park, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18955
Pdf URL: https://arxiv.org/pdf/2410.18955
Copy Paste: [[2410.18955]] BioMistral-NLU: Towards More Generalizable Medical Language Understanding through Instruction Tuning(https://arxiv.org/abs/2410.18955)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, % through span extraction and multi-choice question-answering (QA), (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: Biomedical Language Understanding Evaluation (BLUE) and Biomedical Language Understanding and Reasoning Benchmark (BLURB). Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs' generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.
摘要：大型语言模型 (LLM)（例如 ChatGPT）在大型且多样化的指令跟踪语料库上进行了微调，并且可以推广到新任务。但是，这些指令调整的 LLM 在需要领域知识、细粒度文本理解和结构化数据提取的专业医学自然语言理解 (NLU) 任务中通常表现不佳。为了弥补这一差距，我们：(1) 通过跨度提取和多项选择问答 (QA) 为 7 个重要的 NLU 任务、% 提出统一的提示格式，(2) 利用现有的各种开源医学 NLU 语料库整理指令调整数据集 MNLU-Instruct，以及 (3) 通过在 MNLU-Instruct 上微调 BioMistral 来开发可推广的医学 NLU 模型 BioMistral-NLU。我们在零样本设置下对 BioMistral-NLU 进行了评估，涉及 6 个重要的 NLU 任务，这些任务来自两个广泛采用的医学 NLU 基准：生物医学语言理解评估 (BLUE) 和生物医学语言理解与推理基准 (BLURB)。我们的实验表明，我们的 BioMistral-NLU 优于原始 BioMistral 以及专有的 LLM - ChatGPT 和 GPT-4。我们的数据集无关的提示策略和跨各种 NLU 任务的指令调整增强了 LLM 在各种医学 NLU 任务中的通用性。我们的消融实验表明，即使在总训练实例数量保持不变的情况下，对更多种类的任务进行指令调整也会增强下游零样本泛化。

Title: Bridge-Coder: Unlocking LLMs' Potential to Overcome Language Gaps in Low-Resource Code

Authors: Jipeng Zhang, Jianshu Zhang, Yuanzhe Li, Renjie Pi, Rui Pan, Runtao Liu, Ziqiang Zheng, Tong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18957
Pdf URL: https://arxiv.org/pdf/2410.18957
Copy Paste: [[2410.18957]] Bridge-Coder: Unlocking LLMs' Potential to Overcome Language Gaps in Low-Resource Code(https://arxiv.org/abs/2410.18957)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate strong proficiency in generating code for high-resource programming languages (HRPLs) like Python but struggle significantly with low-resource programming languages (LRPLs) such as Racket or D. This performance gap deepens the digital divide, preventing developers using LRPLs from benefiting equally from LLM advancements and reinforcing disparities in innovation within underrepresented programming communities. While generating additional training data for LRPLs is promising, it faces two key challenges: manual annotation is labor-intensive and costly, and LLM-generated LRPL code is often of subpar quality. The underlying cause of this issue is the gap between natural language to programming language gap (NL-PL Gap), which is especially pronounced in LRPLs due to limited aligned data. In this work, we introduce a novel approach called Bridge-Coder, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs. Our method consists of two key stages. Bridge Generation, where we create high-quality dataset by utilizing LLMs' general knowledge understanding, proficiency in HRPLs, and in-context learning abilities. Then, we apply the Bridged Alignment, which progressively improves the alignment between NL instructions and LRPLs. Experimental results across multiple LRPLs show that Bridge-Coder significantly enhances model performance, demonstrating the effectiveness and generalization of our approach. Furthermore, we offer a detailed analysis of the key components of our method, providing valuable insights for future work aimed at addressing the challenges associated with LRPLs.
摘要：大型语言模型 (LLM) 在为 Python 等高资源编程语言 (HRPL) 生成代码方面表现出色，但在为 Racket 或 D 等低资源编程语言 (LRPL) 生成代码方面却举步维艰。这种性能差距加深了数字鸿沟，使使用 LRPL 的开发人员无法平等地从 LLM 进步中受益，并加剧了代表性不足的编程社区中的创新差距。虽然为 LRPL 生成额外的训练数据很有前景，但它面临两个关键挑战：手动注释劳动密集且成本高昂，而 LLM 生成的 LRPL 代码通常质量不佳。这个问题的根本原因是自然语言与编程语言之间的差距 (NL-PL 差距)，由于对齐数据有限，这种差距在 LRPL 中尤其明显。在这项工作中，我们引入了一种名为 Bridge-Coder 的新方法，它利用 LLM 的内在能力来提高 LRPL 的性能。我们的方法包括两个关键阶段。桥接生成，我们利用 LLM 的常识理解、HRPL 熟练程度和上下文学习能力来创建高质量数据集。然后，我们应用桥接对齐，逐步改善 NL 指令和 LRPL 之间的对齐。跨多个 LRPL 的实验结果表明，Bridge-Coder 显著提高了模型性能，证明了我们方法的有效性和泛化能力。此外，我们对我们方法的关键组件进行了详细分析，为旨在解决与 LRPL 相关的挑战的未来工作提供了宝贵的见解。

Title: Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Authors: Yujuan Fu, Ozlem Uzuner, Meliha Yetisgen, Fei Xia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2410.18966
Pdf URL: https://arxiv.org/pdf/2410.18966
Copy Paste: [[2410.18966]] Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions(https://arxiv.org/abs/2410.18966)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. However, as LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination, where overlap between training data and evaluation datasets inflates performance assessments. While multiple approaches have been developed to identify data contamination, these approaches rely on specific assumptions that may not hold universally across different settings. To bridge this gap, we systematically review 47 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated. We identify and analyze eight categories of assumptions and test three of them as case studies. Our analysis reveals that when classifying instances used for pretraining LLMs, detection approaches based on these three assumptions perform close to random guessing, suggesting that current LLMs learn data distributions rather than memorizing individual instances. Overall, this work underscores the importance of approaches clearly stating their underlying assumptions and testing their validity across various scenarios.
摘要：大型语言模型 (LLM) 在各种基准测试中都表现出色，显示出作为通用任务解决方案的潜力。然而，由于 LLM 通常是在大量数据上进行训练的，因此在评估中一个重要的问题是数据污染，训练数据和评估数据集之间的重叠会夸大性能评估。虽然已经开发了多种方法来识别数据污染，但这些方法依赖于特定的假设，而这些假设可能并不适用于不同的环境。为了弥补这一差距，我们系统地回顾了 47 篇关于数据污染检测的论文，对基本假设进行了分类，并评估它们是否经过了严格的验证。我们确定并分析了八类假设，并将其中的三类作为案例研究进行测试。我们的分析表明，在对用于预训练 LLM 的实例进行分类时，基于这三个假设的检测方法的表现接近随机猜测，这表明当前的 LLM 学习的是数据分布，而不是记忆单个实例。总的来说，这项工作强调了明确说明其基本假设并在各种场景中测试其有效性的方法的重要性。