2024-09-25

Title: Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail?

Authors: Marina Mayor-Rocher, Nina Melero, Elena Merino-Gómez, María Grandury, Javier Conde, Pedro Reviriego
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15334
Pdf URL: https://arxiv.org/pdf/2409.15334
Copy Paste: [[2409.15334]] Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail?(https://arxiv.org/abs/2409.15334)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been profusely evaluated on their ability to answer questions on many topics and their performance on different natural language understanding tasks. Those tests are usually conducted in English, but most LLM users are not native English speakers. Therefore, it is of interest to analyze how LLMs understand other languages at different levels: from paragraphs to morphems. In this paper, we evaluate the performance of state-of-the-art LLMs in TELEIA, a recently released benchmark with similar questions to those of Spanish exams for foreign students, covering topics such as reading comprehension, word formation, meaning and compositional semantics, and grammar. The results show that LLMs perform well at understanding Spanish but are still far from achieving the level of a native speaker in terms of grammatical competence.
摘要：大型语言模型 (LLM) 已在其回答许多主题问题的能力及其在不同自然语言理解任务中的表现方面得到广泛评估。这些测试通常以英语进行，但大多数 LLM 用户都不是英语母语人士。因此，分析 LLM 如何在不同层面理解其他语言（从段落到词素）是很有意义的。在本文中，我们评估了最先进的 LLM 在 TELEIA 中的表现，TELEIA 是最近发布的基准，其问题与针对外国学生的西班牙语考试类似，涵盖阅读理解、词汇形成、含义和组合语义以及语法等主题。结果表明，LLM 在理解西班牙语方面表现良好，但在语法能力方面仍远未达到母语人士的水平。

Title: Watch Your Steps: Observable and Modular Chains of Thought

Authors: Cassandra A. Cohen, William W. Cohen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.15359
Pdf URL: https://arxiv.org/pdf/2409.15359
Copy Paste: [[2409.15359]] Watch Your Steps: Observable and Modular Chains of Thought(https://arxiv.org/abs/2409.15359)
Keywords: prompt
Abstract: We propose a variant of chain of thought (CoT) prompting called Program Trace Prompting that makes explanations more observable while preserving the power, generality and flexibility of CoT. In our approach, few-shot CoT demonstrations are wrapped in a formal syntax based on Python, and each prompt: identifies and names steps; defines the input/output behavior of steps; and replaces CoT explanations of in-context examples with chains of these formalized steps on the same examples. Program Trace Prompting is applicable to many tasks, achieving strong results on the 23 diverse tasks in the BIG-Bench Hard benchmark. More importantly, by instrumenting explanations in this way, we enable new types of analysis. In particular, we identify "non-local errors" (which correspond to incorrectly learning the reasoning method illustrated in the demonstrations) as an unaddressed issue in CoT learning, and we present methods for verifying the modularity of steps in a CoT explanation.
摘要：我们提出了一种思路链 (CoT) 提示的变体，称为程序跟踪提示，它使解释更加可观察，同时保留了 CoT 的强大功能、通用性和灵活性。在我们的方法中，少量 CoT 演示被包装在基于 Python 的形式语法中，并且每个提示：识别和命名步骤；定义步骤的输入/输出行为；并用相同示例上的这些形式化步骤链替换上下文示例的 CoT 解释。程序跟踪提示适用于许多任务，在 BIG-Bench Hard 基准测试中的 23 个不同任务上取得了强劲的成绩。更重要的是，通过以这种方式检测解释，我们可以实现新类型的分析。特别是，我们将“非局部错误”（对应于错误地学习演示中所示的推理方法）确定为 CoT 学习中未解决的问题，并提出了用于验证 CoT 解释中步骤模块化的方法。

Title: Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Authors: Essa Jan, Nouar AlDahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, Yasir Zaki
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.15361
Pdf URL: https://arxiv.org/pdf/2409.15361
Copy Paste: [[2409.15361]] Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning(https://arxiv.org/abs/2409.15361)
Keywords: language model, llm, prompt
Abstract: Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model's overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.
摘要：大型语言模型 (LLM) 的最新突破使其被广泛应用于从代码生成到机器翻译和情感分析等一系列任务。红队/安全协调工作表明，在良性（无害）数据上对模型进行微调可能会危及安全性。然而，目前仍不清楚这种现象在多大程度上受到不同变量的影响，包括微调任务、模型校准等。本文探讨了由于在不同校准过程中对摘要、代码生成、翻译和分类等下游任务进行微调而导致的任务安全性下降。我们的结果表明：1) 针对代码生成和翻译对 LLM 进行微调会导致安全护栏下降最为严重。2) LLM 对翻译和分类的护栏通常较弱，在基线和其他校准过程中，73-92% 的有害提示得到了回答，属于两个关注类别之一。3) 当前的解决方案（包括防护装置和安全调整数据集）缺乏跨任务稳健性。为了解决这些问题，我们开发了一个新的多任务安全数据集，有效地降低了一系列任务中的攻击成功率，同时又不损害模型的整体实用性。我们的工作强调了广义对齐措施的必要性，以确保更安全、更强大的模型。

Title: VERA: Validation and Enhancement for Retrieval Augmented systems

Authors: Nitin Aravind Birur, Tanay Baswa, Divyanshu Kumar, Jatan Loya, Sahil Agarwal, Prashanth Harshangi
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2409.15364
Pdf URL: https://arxiv.org/pdf/2409.15364
Copy Paste: [[2409.15364]] VERA: Validation and Enhancement for Retrieval Augmented systems(https://arxiv.org/abs/2409.15364)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) exhibit remarkable capabilities but often produce inaccurate responses, as they rely solely on their embedded knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating an external information retrieval system, supplying additional context along with the query to mitigate inaccuracies for a particular context. However, accuracy issues still remain, as the model may rely on irrelevant documents or extrapolate incorrectly from its training knowledge. To assess and improve the performance of both the retrieval system and the LLM in a RAG framework, we propose \textbf{VERA} (\textbf{V}alidation and \textbf{E}nhancement for \textbf{R}etrieval \textbf{A}ugmented systems), a system designed to: 1) Evaluate and enhance the retrieved context before response generation, and 2) Evaluate and refine the LLM-generated response to ensure precision and minimize errors. VERA employs an evaluator-cum-enhancer LLM that first checks if external retrieval is necessary, evaluates the relevance and redundancy of the retrieved context, and refines it to eliminate non-essential information. Post-response generation, VERA splits the response into atomic statements, assesses their relevance to the query, and ensures adherence to the context. Our experiments demonstrate VERA's remarkable efficacy not only in improving the performance of smaller open-source models, but also larger state-of-the art models. These enhancements underscore VERA's potential to produce accurate and relevant responses, advancing the state-of-the-art in retrieval-augmented language modeling. VERA's robust methodology, combining multiple evaluation and refinement steps, effectively mitigates hallucinations and improves retrieval and response processes, making it a valuable tool for applications demanding high accuracy and reliability in information generation. .
摘要：大型语言模型 (LLM) 表现出非凡的能力，但通常会产生不准确的响应，因为它们完全依赖于其嵌入的知识。检索增强生成 (RAG) 通过合并外部信息检索系统来增强 LLM，提供额外的上下文以及查询以减轻特定上下文的不准确性。但是，准确性问题仍然存在，因为该模型可能依赖于不相关的文档或从其训练知识中进行错误推断。为了评估和提高 RAG 框架中检索系统和 LLM 的性能，我们提出了 \textbf{VERA}（\textbf{V} 验证和 \textbf{E} 增强检索 \textbf{A} 系统），该系统旨在：1) 在响应生成之前评估和增强检索到的上下文，以及 2) 评估和改进 LLM 生成的响应以确保准确性并最大限度地减少错误。 VERA 采用评估器兼增强器 LLM，首先检查是否需要外部检索，评估检索到的上下文的相关性和冗余性，然后对其进行细化以消除非必要信息。在响应生成后，VERA 将响应拆分为原子语句，评估它们与查询的相关性，并确保符合上下文。我们的实验证明了 VERA 不仅可以提高小型开源模型的性能，还可以提高大型先进模型的性能。这些增强功能凸显了 VERA 产生准确和相关响应的潜力，推动了检索增强语言建模的最新发展。VERA 的强大方法结合了多个评估和细化步骤，有效地减轻了幻觉并改善了检索和响应过程，使其成为对信息生成准确性和可靠性要求高的应用程序的宝贵工具。

Title: Bone: Block Affine Transformation as Parameter Efficient Fine-tuning Methods for Large Language Models

Authors: Jiale Kang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15371
Pdf URL: https://arxiv.org/pdf/2409.15371
Copy Paste: [[2409.15371]] Bone: Block Affine Transformation as Parameter Efficient Fine-tuning Methods for Large Language Models(https://arxiv.org/abs/2409.15371)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) continue to grow in size, their computational and memory requirements increase correspondingly. Consequently, the exploration of cost-effective and efficient fine-tuning methods has become increasingly important. Low-Rank Adaptation (LoRA) has achieved remarkable training results by freezing the original weights and training only low-rank matrices, establishing itself as the predominant fine-tuning method for LLMs. In pursuit of performance closer to full-parameter training, a series of LoRA variants have emerged, such as LoRA+, PISSA, Olora, and LoRA-GA. However, these methods also make the fine-tuning initialization process more complex, and it remains challenging to surpass the performance ceiling of full fine-tuning. To address these issues, this paper introduces an innovative method called Bone (Block Affine), which not only reduces memory overhead but also emphasizes the internal connections between weights, leading to faster convergence and better data fitting. Experimental comparisons across two different LLM architectures (LLaMA2, RWKV6) and various parameter scales demonstrate that the Bone structure can achieve rapid convergence and superior data fitting without the need for complex initialization. For example, when fine-tuning LLaMA2-7B on the MetaMathQA dataset and validating on GSM8k and math benchmarks, Bone achieved fine-tuning scores of 49.36 and 8.8, respectively, outperforming PISSA by 5.84\% and 1.96\%.
摘要：随着大型语言模型（LLM）的规模不断扩大，其计算和内存需求也相应增加。因此，探索经济高效的微调方法变得越来越重要。低秩自适应（LoRA）通过冻结原始权重并仅训练低秩矩阵取得了显著的训练效果，成为LLM的主要微调方法。为了追求更接近全参数训练的性能，出现了一系列LoRA变体，例如LoRA+、PISSA、Olora和LoRA-GA。然而，这些方法也使微调初始化过程更加复杂，并且仍然很难超越完全微调的性能上限。针对这些问题，本文提出了一种称为Bone（Block Affine）的创新方法，它不仅可以减少内存开销，而且强调权重之间的内部联系，从而实现更快的收敛和更好的数据拟合。跨两种不同的 LLM 架构（LLaMA2、RWKV6）和各种参数尺度的实验比较表明，Bone 结构可以实现快速收敛和出色的数据拟合，而无需复杂的初始化。例如，在 MetaMathQA 数据集上微调 LLaMA2-7B 并在 GSM8k 和数学基准上进行验证时，Bone 分别获得了 49.36 和 8.8 的微调分数，比 PISSA 分别高出 5.84% 和 1.96%。

Title: Prompting Large Language Models for Supporting the Differential Diagnosis of Anemia

Authors: Elisa Castagnari (HeKA), Lillian Muyama (HeKA), Adrien Coulet (HeKA)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15377
Pdf URL: https://arxiv.org/pdf/2409.15377
Copy Paste: [[2409.15377]] Prompting Large Language Models for Supporting the Differential Diagnosis of Anemia(https://arxiv.org/abs/2409.15377)
Keywords: language model, gpt, llm, prompt
Abstract: In practice, clinicians achieve a diagnosis by following a sequence of steps, such as laboratory exams, observations, or imaging. The pathways to reach diagnosis decisions are documented by guidelines authored by expert organizations, which guide clinicians to reach a correct diagnosis through these sequences of steps. While these guidelines are beneficial for following medical reasoning and consolidating medical knowledge, they have some drawbacks. They often fail to address patients with uncommon conditions due to their focus on the majority population, and are slow and costly to update, making them unsuitable for rapidly emerging diseases or new practices. Inspired by clinical guidelines, our study aimed to develop pathways similar to those that can be obtained in clinical guidelines. We tested three Large Language Models (LLMs) -Generative Pretrained Transformer 4 (GPT-4), Large Language Model Meta AI (LLaMA), and Mistral -on a synthetic yet realistic dataset to differentially diagnose anemia and its subtypes. By using advanced prompting techniques to enhance the decision-making process, we generated diagnostic pathways using these models. Experimental results indicate that LLMs hold huge potential in clinical pathway discovery from patient data, with GPT-4 exhibiting the best performance in all conducted experiments.
摘要：在实践中，临床医生通过一系列步骤（例如实验室检查、观察或成像）来做出诊断。专家组织编写的指南记录了做出诊断决策的途径，这些指南指导临床医生通过这些步骤序列做出正确的诊断。虽然这些指南有利于遵循医学推理和巩固医学知识，但它们也有一些缺点。由于它们关注的是大多数人群，它们往往无法解决患有罕见疾病的患者，而且更新速度慢且成本高，因此不适合快速出现的疾病或新实践。受临床指南的启发，我们的研究旨在开发与临床指南中可以获得的途径类似的途径。我们在合成但现实的数据集上测试了三种大型语言模型 (LLM) - 生成式预训练 Transformer 4 (GPT-4)、大型语言模型 Meta AI (LLaMA) 和 Mistral，以对贫血及其亚型进行鉴别诊断。通过使用高级提示技术来增强决策过程，我们使用这些模型生成了诊断途径。实验结果表明，LLM 在从患者数据中发现临床路径方面具有巨大潜力，其中 GPT-4 在所有进行的实验中表现出最佳性能。

Title: Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino

Authors: Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini Rengarajan, William Chandra Tjhi, Alham Fikri Aji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15380
Pdf URL: https://arxiv.org/pdf/2409.15380
Copy Paste: [[2409.15380]] Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino(https://arxiv.org/abs/2409.15380)
Keywords: language model, llm, prompt
Abstract: Multilingual large language models (LLMs) today may not necessarily provide culturally appropriate and relevant responses to its Filipino users. We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers. It is composed of 150 high-quality, handcrafted and nuanced prompts that test LLMs for generations that are relevant to shared Filipino cultural knowledge and values. Strong LLM performance in Kalahi indicates a model's ability to generate responses similar to what an average Filipino would say or do in a given situation. We conducted experiments on LLMs with multilingual and Filipino language support. Results show that Kalahi, while trivial for Filipinos, is challenging for LLMs, with the best model answering only 46.0% of the questions correctly compared to native Filipino performance of 89.10%. Thus, Kalahi can be used to accurately and reliably evaluate Filipino cultural representation in LLMs.
摘要：如今，多语言大型语言模型 (LLM) 可能不一定能为其菲律宾用户提供文化上适当且相关的响应。我们推出了 Kalahi，这是一个由菲律宾母语人士共同创建的文化 LLM 评估套件。它由 150 个高质量、手工制作且细致入微的提示组成，用于测试几代与菲律宾共同的文化知识和价值观相关的 LLM。在 Kalahi 中表现出色的 LLM 表明模型能够生成与普通菲律宾人在特定情况下会说或会做的类似的响应。我们对具有多语言和菲律宾语支持的 LLM 进行了实验。结果表明，Kalahi 对菲律宾人来说微不足道，但对 LLM 来说却具有挑战性，最佳模型仅正确回答了 46.0% 的问题，而菲律宾母语人士的表现为 89.10%。因此，Kalahi 可用于准确可靠地评估 LLM 中的菲律宾文化代表性。

Title: Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation

Authors: G M Shahariar, Jia Chen, Jiachen Li, Yue Dong
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2409.15381
Pdf URL: https://arxiv.org/pdf/2409.15381
Copy Paste: [[2409.15381]] Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation(https://arxiv.org/abs/2409.15381)
Keywords: prompt
Abstract: Recent studies show that text-to-image (T2I) models are vulnerable to adversarial attacks, especially with noun perturbations in text prompts. In this study, we investigate the impact of adversarial attacks on different POS tags within text prompts on the images generated by T2I models. We create a high-quality dataset for realistic POS tag token swapping and perform gradient-based attacks to find adversarial suffixes that mislead T2I models into generating images with altered tokens. Our empirical results show that the attack success rate (ASR) varies significantly among different POS tag categories, with nouns, proper nouns, and adjectives being the easiest to attack. We explore the mechanism behind the steering effect of adversarial suffixes, finding that the number of critical tokens and content fusion vary among POS tags, while features like suffix transferability are consistent across categories. We have made our implementation publicly available at - this https URL.
摘要：最近的研究表明，文本到图像 (T2I) 模型容易受到对抗性攻击，尤其是文本提示中的名词扰动。在本研究中，我们研究了对抗性攻击对文本提示中不同 POS 标签的影响，这些攻击对 T2I 模型生成的图像的影响。我们创建了一个高质量的数据集，用于真实的 POS 标签标记交换，并执行基于梯度的攻击，以查找误导 T2I 模型生成带有更改标记的图像的对抗性后缀。我们的实证结果表明，攻击成功率 (ASR) 在不同的 POS 标签类别之间差异很大，其中名词、专有名词和形容词最容易受到攻击。我们探索了对抗性后缀引导效应背后的机制，发现关键标记的数量和内容融合在不同的 POS 标签之间有所不同，而后缀可转移性等特征在各个类别之间是一致的。我们已将我们的实现公开发布在此 https URL 上。

Title: Parse Trees Guided LLM Prompt Compression

Authors: Wenhao Mao, Chengbin Hou, Tianyu Zhang, Xinyu Lin, Ke Tang, Hairong Lv
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15395
Pdf URL: https://arxiv.org/pdf/2409.15395
Copy Paste: [[2409.15395]] Parse Trees Guided LLM Prompt Compression(https://arxiv.org/abs/2409.15395)
Keywords: language model, llm, hallucination, prompt
Abstract: Offering rich contexts to Large Language Models (LLMs) has shown to boost the performance in various tasks, but the resulting longer prompt would increase the computational cost and might exceed the input limit of LLMs. Recently, some prompt compression methods have been suggested to shorten the length of prompts by using language models to generate shorter prompts or by developing computational models to select important parts of original prompt. The generative compression methods would suffer from issues like hallucination, while the selective compression methods have not involved linguistic rules and overlook the global structure of prompt. To this end, we propose a novel selective compression method called PartPrompt. It first obtains a parse tree for each sentence based on linguistic rules, and calculates local information entropy for each node in a parse tree. These local parse trees are then organized into a global tree according to the hierarchical structure such as the dependency of sentences, paragraphs, and sections. After that, the root-ward propagation and leaf-ward propagation are proposed to adjust node values over the global tree. Finally, a recursive algorithm is developed to prune the global tree based on the adjusted node values. The experiments show that PartPrompt receives the state-of-the-art performance across various datasets, metrics, compression ratios, and target LLMs for inference. The in-depth ablation studies confirm the effectiveness of designs in PartPrompt, and other additional experiments also demonstrate its superiority in terms of the coherence of compressed prompts and in the extreme long prompt scenario.
摘要：为大型语言模型 (LLM) 提供丰富的上下文已被证明可以提高各种任务的性能，但由此产生的更长的提示会增加计算成本，并可能超出 LLM 的输入限制。最近，一些提示压缩方法被提出，通过使用语言模型生成较短的提示或通过开发计算模型来选择原始提示的重要部分来缩短提示的长度。生成式压缩方法会出现幻觉等问题，而选择性压缩方法没有涉及语言规则，忽视了提示的全局结构。为此，我们提出了一种名为 PartPrompt 的新型选择性压缩方法。它首先根据语言规则为每个句子获得一棵解析树，并计算解析树中每个节点的局部信息熵。然后根据句子、段落和章节的依赖关系等层次结构将这些局部解析树组织成全局树。之后，提出了根向传播和叶向传播来调整全局树上的节点值。最后，开发了一种递归算法，根据调整后的节点值修剪全局树。实验表明，PartPrompt 在各种数据集、指标、压缩率和目标 LLM 推理中都获得了最佳性能。深入的消融研究证实了 PartPrompt 中设计的有效性，其他附加实验也证明了其在压缩提示的连贯性和极长提示场景中的优势。

Title: CUTE: Measuring LLMs' Understanding of Their Tokens

Authors: Lukas Edman, Helmut Schmid, Alexander Fraser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15452
Pdf URL: https://arxiv.org/pdf/2409.15452
Copy Paste: [[2409.15452]] CUTE: Measuring LLMs' Understanding of Their Tokens(https://arxiv.org/abs/2409.15452)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.
摘要：大型语言模型 (LLM) 在各种任务上都表现出色。大多数 LLM 将文本拆分为多字符标记，并将它们作为原子单位进行处理，而无需直接访问单个字符。这就提出了一个问题：LLM 能在多大程度上学习正字法信息？为了回答这个问题，我们提出了一个新的基准 CUTE，它包含一系列旨在测试 LLM 正字法知识的任务。我们在 CUTE 上评估了流行的 LLM，发现它们中的大多数似乎都知道其标记的拼写，但却无法有效地使用这些信息来处理文本，这让人怀疑这些知识有多少是可以推广的。

Title: In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models

Authors: Pengrui Han, Peiyang Song, Haofei Yu, Jiaxuan You
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15454
Pdf URL: https://arxiv.org/pdf/2409.15454
Copy Paste: [[2409.15454]] In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models(https://arxiv.org/abs/2409.15454)
Keywords: language model, llm
Abstract: Recent advancements in artificial intelligence have led to the creation of highly capable large language models (LLMs) that can perform tasks in a human-like manner. However, LLMs exhibit only infant-level cognitive abilities in certain areas. One such area is the A-Not-B error, a phenomenon seen in infants where they repeat a previously rewarded behavior despite well-observed changed conditions. This highlights their lack of inhibitory control -- the ability to stop a habitual or impulsive response. In our work, we design a text-based multi-choice QA scenario similar to the A-Not-B experimental settings to systematically test the inhibitory control abilities of LLMs. We found that state-of-the-art LLMs (like Llama3-8b) perform consistently well with in-context learning (ICL) but make errors and show a significant drop of as many as 83.3% in reasoning tasks when the context changes trivially. This suggests that LLMs only have inhibitory control abilities on par with human infants in this regard, often failing to suppress the previously established response pattern during ICL.
摘要：人工智能的最新进展促成了高性能大型语言模型 (LLM) 的诞生，这些模型可以像人类一样执行任务。然而，LLM 在某些领域仅表现出婴儿水平的认知能力。其中一个领域是 A-Not-B 错误，这是婴儿中常见的现象，尽管条件发生了明显变化，他们仍会重复之前获得奖励的行为。这凸显了他们缺乏抑制控制——即阻止习惯性或冲动反应的能力。在我们的工作中，我们设计了一个类似于 A-Not-B 实验设置的基于文本的多选问答场景，以系统地测试 LLM 的抑制控制能力。我们发现，最先进的 LLM（如 Llama3-8b）在情境学习 (ICL) 方面表现始终良好，但当情境发生细微变化时，它们会犯错误，并且在推理任务中的表现会显著下降高达 83.3%。这表明 LLM 在这方面仅具有与人类婴儿相当的抑制控制能力，通常无法抑制 ICL 期间先前建立的反应模式。

Title: Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA

Authors: Nirmal Roy, Leonardo F. R. Ribeiro, Rexhina Blloshmi, Kevin Small
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15515
Pdf URL: https://arxiv.org/pdf/2409.15515
Copy Paste: [[2409.15515]] Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA(https://arxiv.org/abs/2409.15515)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Augmenting Large Language Models (LLMs) with information retrieval capabilities (i.e., Retrieval-Augmented Generation (RAG)) has proven beneficial for knowledge-intensive tasks. However, understanding users' contextual search intent when generating responses is an understudied topic for conversational question answering (QA). This conversational extension leads to additional concerns when compared to single-turn QA as it is more challenging for systems to comprehend conversational context and manage retrieved passages over multiple turns. In this work, we propose a method for enabling LLMs to decide when to retrieve in RAG settings given a conversational context. When retrieval is deemed necessary, the LLM then rewrites the conversation for passage retrieval and judges the relevance of returned passages before response generation. Operationally, we build on the single-turn SELF-RAG framework (Asai et al., 2023) and propose SELF-multi-RAG for conversational settings. SELF-multi-RAG demonstrates improved capabilities over single-turn variants with respect to retrieving relevant passages (by using summarized conversational context) and assessing the quality of generated responses. Experiments on three conversational QA datasets validate the enhanced response generation capabilities of SELF-multi-RAG, with improvements of ~13% measured by human annotation.
摘要：增强大型语言模型 (LLM) 的信息检索功能（即检索增强生成 (RAG)）已被证明对知识密集型任务大有裨益。然而，在生成响应时理解用户的上下文搜索意图是对话式问答 (QA) 中一个研究不足的课题。与单轮问答相比，这种对话扩展带来了额外的担忧，因为系统更难理解对话上下文并管理多轮检索到的段落。在这项工作中，我们提出了一种方法，使 LLM 能够在给定对话上下文的情况下决定何时在 RAG 设置中进行检索。当认为有必要进行检索时，LLM 会重写对话以进行段落检索，并在生成响应之前判断返回段落的相关性。在操作上，我们以单轮 SELF-RAG 框架（Asai 等人，2023 年）为基础，并提出了用于对话设置的 SELF-multi-RAG。在检索相关段落（通过使用总结的对话上下文）和评估生成的响应质量方面，SELF-multi-RAG 表现出比单轮变体更好的能力。在三个对话 QA 数据集上进行的实验验证了 SELF-multi-RAG 增强的响应生成能力，经人工注释测量，其改进幅度约为 13%。

Title: GEM-RAG: Graphical Eigen Memories For Retrieval Augmented Generation

Authors: Brendan Hogan Rappazzo, Yingheng Wang, Aaron Ferber, Carla Gomes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15566
Pdf URL: https://arxiv.org/pdf/2409.15566
Copy Paste: [[2409.15566]] GEM-RAG: Graphical Eigen Memories For Retrieval Augmented Generation(https://arxiv.org/abs/2409.15566)
Keywords: language model, gpt, llm, prompt, retrieval augmented generation, agent
Abstract: The ability to form, retrieve, and reason about memories in response to stimuli serves as the cornerstone for general intelligence - shaping entities capable of learning, adaptation, and intuitive insight. Large Language Models (LLMs) have proven their ability, given the proper memories or context, to reason and respond meaningfully to stimuli. However, they are still unable to optimally encode, store, and retrieve memories - the ability to do this would unlock their full ability to operate as AI agents, and to specialize to niche domains. To remedy this, one promising area of research is Retrieval Augmented Generation (RAG), which aims to augment LLMs by providing them with rich in-context examples and information. In question-answering (QA) applications, RAG methods embed the text of interest in chunks, and retrieve the most relevant chunks for a prompt using text embeddings. Motivated by human memory encoding and retrieval, we aim to improve over standard RAG methods by generating and encoding higher-level information and tagging the chunks by their utility to answer questions. We introduce Graphical Eigen Memories For Retrieval Augmented Generation (GEM-RAG). GEM-RAG works by tagging each chunk of text in a given text corpus with LLM generated ``utility'' questions, connecting chunks in a graph based on the similarity of both their text and utility questions, and then using the eigendecomposition of the memory graph to build higher level summary nodes that capture the main themes of the text. We evaluate GEM-RAG, using both UnifiedQA and GPT-3.5 Turbo as the LLMs, with SBERT, and OpenAI's text encoders on two standard QA tasks, showing that GEM-RAG outperforms other state-of-the-art RAG methods on these tasks. We also discuss the implications of having a robust RAG system and future directions.
摘要：在刺激下形成、检索和推理记忆的能力是通用智能的基石 - 塑造能够学习、适应和直观洞察的实体。大型语言模型 (LLM) 已证明其在适当的记忆或上下文中能够推理并对刺激做出有意义的反应。然而，它们仍然无法以最佳方式编码、存储和检索记忆 - 做到这一点将充分发挥它们作为 AI 代理运行的能力，并专门研究小众领域。为了解决这个问题，一个有前途的研究领域是检索增强生成 (RAG)，旨在通过为 LLM 提供丰富的上下文示例和信息来增强 LLM。在问答 (QA) 应用中，RAG 方法将感兴趣的文本嵌入到块中，并使用文本嵌入检索与提示最相关的块。受人类记忆编码和检索的启发，我们旨在通过生成和编码更高级的信息并根据块的实用性来标记块以回答问题，从而改进标准 RAG 方法。我们引入了用于检索增强生成的图形特征记忆 (GEM-RAG)。GEM-RAG 的工作原理是使用 LLM 生成的“实用”问题标记给定文本语料库中的每个文本块，根据文本和实用问题的相似性将块连接到图中，然后使用记忆图的特征分解来构建更高级别的摘要节点，以捕捉文本的主要主题。我们使用 UnifiedQA 和 GPT-3.5 Turbo 作为 LLM，使用 SBERT 和 OpenAI 的文本编码器在两个标准 QA 任务上评估 GEM-RAG，结果表明 GEM-RAG 在这些任务上的表现优于其他最先进的 RAG 方法。我们还讨论了拥有强大的 RAG 系统的意义和未来的发展方向。

Title: Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents

Authors: Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, Shyamnath Gollakota
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.15594
Pdf URL: https://arxiv.org/pdf/2409.15594
Copy Paste: [[2409.15594]] Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents(https://arxiv.org/abs/2409.15594)
Keywords: llm, prompt, agent
Abstract: Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is "full-duplex" allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of "time". To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model's ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms. Webpage: this https URL.
摘要：尽管人们对建模口头对话代理有着广泛的兴趣，但大多数方法本质上都是“半双工”的——仅限于回合制交互，响应需要用户明确提示或隐式跟踪中断或静音事件。相比之下，人类对话是“全双工”的，允许以快速和动态的轮流、重叠语音和反向通道的形式实现丰富的同步性。从技术上讲，使用 LLM 实现全双工对话的挑战在于建模同步，因为预训练的 LLM 没有“时间感”。为了弥补这一差距，我们提出了用于全双工口头对话建模的同步 LLM。我们设计了一种新颖的机制，将时间信息集成到 Llama3-8b 中，使它们与现实世界时钟同步运行。我们还介绍了一种训练方法，使用从文本对话数据生成的 212k 小时合成口头对话数据来创建一个模型，该模型仅使用 2k 小时的现实世界口头对话数据即可生成有意义且自然的口头对话。同步 LLM 在对话意义方面的表现优于最先进的技术，同时保持了自然性。最后，我们通过模拟在不同数据集上训练的两个代理之间的交互来展示该模型参与全双工对话的能力，同时考虑高达 240 毫秒的互联网规模延迟。网页：此 https URL。

Title: A Survey of Stance Detection on Social Media: New Directions and Perspectives

Authors: Bowen Zhang, Genan Dai, Fuqiang Niu, Nan Yin, Xiaomao Fan, Hu Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15690
Pdf URL: https://arxiv.org/pdf/2409.15690
Copy Paste: [[2409.15690]] A Survey of Stance Detection on Social Media: New Directions and Perspectives(https://arxiv.org/abs/2409.15690)
Keywords: language model
Abstract: In modern digital environments, users frequently express opinions on contentious topics, providing a wealth of information on prevailing attitudes. The systematic analysis of these opinions offers valuable insights for decision-making in various sectors, including marketing and politics. As a result, stance detection has emerged as a crucial subfield within affective computing, enabling the automatic detection of user stances in social media conversations and providing a nuanced understanding of public sentiment on complex issues. Recent years have seen a surge of research interest in developing effective stance detection methods, with contributions from multiple communities, including natural language processing, web science, and social computing. This paper provides a comprehensive survey of stance detection techniques on social media, covering task definitions, datasets, approaches, and future works. We review traditional stance detection models, as well as state-of-the-art methods based on large language models, and discuss their strengths and limitations. Our survey highlights the importance of stance detection in understanding public opinion and sentiment, and identifies gaps in current research. We conclude by outlining potential future directions for stance detection on social media, including the need for more robust and generalizable models, and the importance of addressing emerging challenges such as multi-modal stance detection and stance detection in low-resource languages.
摘要：在现代数字环境中，用户经常就有争议的话题发表意见，提供了大量有关主流态度的信息。对这些意见进行系统分析，为包括营销和政治在内的各个领域的决策提供了宝贵的见解。因此，立场检测已成为情感计算中的一个重要子领域，它能够自动检测社交媒体对话中的用户立场，并提供对复杂问题上公众情绪的细致理解。近年来，人们对开发有效立场检测方法的研究兴趣激增，来自多个社区的贡献包括自然语言处理、网络科学和社交计算。本文对社交媒体上的立场检测技术进行了全面的调查，涵盖了任务定义、数据集、方法和未来工作。我们回顾了传统的立场检测模型以及基于大型语言模型的最新方法，并讨论了它们的优势和局限性。我们的调查强调了立场检测在理解公众舆论和情绪方面的重要性，并确定了当前研究中的差距。最后，我们概述了社交媒体立场检测的未来潜在方向，包括对更为稳健和更通用的模型的需求，以及解决多模态立场检测和低资源语言立场检测等新兴挑战的重要性。

Title: Lighter And Better: Towards Flexible Context Adaptation For Retrieval Augmented Generation

Authors: Zheng Liu, Chenyuan Wu, Ninglu Shao, Shitao Xiao, Chaozhuo Li, Defu Lian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15699
Pdf URL: https://arxiv.org/pdf/2409.15699
Copy Paste: [[2409.15699]] Lighter And Better: Towards Flexible Context Adaptation For Retrieval Augmented Generation(https://arxiv.org/abs/2409.15699)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: The existing Retrieval-Augmented Generation (RAG) systems face significant challenges in terms of cost and effectiveness. On one hand, they need to encode the lengthy retrieved contexts before responding to the input tasks, which imposes substantial computational overhead. On the other hand, directly using generic Large Language Models (LLMs) often leads to sub-optimal answers, while task-specific fine-tuning may compromise the LLMs' general capabilities. To address these challenges, we introduce a novel approach called FlexRAG (Flexible Context Adaptation for RAG). In this approach, the retrieved contexts are compressed into compact embeddings before being encoded by the LLMs. Simultaneously, these compressed embeddings are optimized to enhance downstream RAG performance. A key feature of FlexRAG is its flexibility, which enables effective support for diverse compression ratios and selective preservation of important contexts. Thanks to these technical designs, FlexRAG achieves superior generation quality while significantly reducing running costs. Comprehensive experiments on various question-answering datasets validate our approach as a cost-effective and flexible solution for RAG systems.
摘要：现有的检索增强生成 (RAG) 系统在成本和效率方面面临重大挑战。一方面，它们需要在响应输入任务之前对冗长的检索上下文进行编码，这会产生大量的计算开销。另一方面，直接使用通用大型语言模型 (LLM) 通常会导致次优答案，而特定于任务的微调可能会损害 LLM 的一般能力。为了应对这些挑战，我们引入了一种称为 FlexRAG（RAG 的灵活上下文自适应）的新方法。在这种方法中，检索到的上下文在由 LLM 编码之前被压缩为紧凑的嵌入。同时，这些压缩的嵌入经过优化以增强下游 RAG 性能。FlexRAG 的一个关键特性是它的灵活性，它能够有效支持不同的压缩比并选择性地保留重要的上下文。得益于这些技术设计，FlexRAG 实现了卓越的生成质量，同时显着降低了运行成本。在各种问答数据集上进行的综合实验验证了我们的方法是 RAG 系统的一种经济高效且灵活的解决方案。

Title: XTRUST: On the Multilingual Trustworthiness of Large Language Models

Authors: Yahan Li, Yi Wang, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15762
Pdf URL: https://arxiv.org/pdf/2409.15762
Copy Paste: [[2409.15762]] XTRUST: On the Multilingual Trustworthiness of Large Language Models(https://arxiv.org/abs/2409.15762)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a range of natural language processing (NLP) tasks, capturing the attention of both practitioners and the broader public. A key question that now preoccupies the AI community concerns the capabilities and limitations of these models, with trustworthiness emerging as a central issue, particularly as LLMs are increasingly applied in sensitive fields like healthcare and finance, where errors can have serious consequences. However, most previous studies on the trustworthiness of LLMs have been limited to a single language, typically the predominant one in the dataset, such as English. In response to the growing global deployment of LLMs, we introduce XTRUST, the first comprehensive multilingual trustworthiness benchmark. XTRUST encompasses a diverse range of topics, including illegal activities, hallucination, out-of-distribution (OOD) robustness, physical and mental health, toxicity, fairness, misinformation, privacy, and machine ethics, across 10 different languages. Using XTRUST, we conduct an empirical evaluation of the multilingual trustworthiness of five widely used LLMs, offering an in-depth analysis of their performance across languages and tasks. Our results indicate that many LLMs struggle with certain low-resource languages, such as Arabic and Russian, highlighting the considerable room for improvement in the multilingual trustworthiness of current language models. The code is available at this https URL.
摘要：大型语言模型 (LLM) 在一系列自然语言处理 (NLP) 任务中表现出了卓越的能力，吸引了从业者和广大公众的关注。目前，人工智能社区关注的一个关键问题是这些模型的能力和局限性，可信度正成为一个核心问题，尤其是随着 LLM 越来越多地应用于医疗保健和金融等敏感领域，这些领域的错误可能会带来严重后果。然而，之前关于 LLM 可信度的大多数研究都局限于一种语言，通常是数据集中占主导地位的语言，例如英语。为了应对 LLM 在全球范围内日益增长的部署，我们推出了第一个全面的多语言可信度基准 XTRUST。XTRUST 涵盖了 10 种不同语言的各种主题，包括非法活动、幻觉、分布外 (OOD) 稳健性、身心健康、毒性、公平性、错误信息、隐私和机器伦理。使用 XTRUST，我们对五种广泛使用的 LLM 的多语言可信度进行了实证评估，并深入分析了它们在各种语言和任务中的表现。我们的结果表明，许多 LLM 在某些资源匮乏的语言（如阿拉伯语和俄语）上表现不佳，这凸显了当前语言模型在多语言可信度方面还有很大的改进空间。代码可从此 https URL 获取。

Title: CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

Authors: Chenlu Guo, Nuo Xu, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15766
Pdf URL: https://arxiv.org/pdf/2409.15766
Copy Paste: [[2409.15766]] CHBench: A Chinese Dataset for Evaluating Health in Large Language Models(https://arxiv.org/abs/2409.15766)
Keywords: language model, llm
Abstract: With the rapid development of large language models (LLMs), assessing their performance on health-related inquiries has become increasingly essential. It is critical that these models provide accurate and trustworthy health information, as their application in real-world contexts--where misinformation can have serious consequences for individuals seeking medical advice and support--depends on their reliability. In this work, we present CHBench, the first comprehensive Chinese Health-related Benchmark designed to evaluate LLMs' capabilities in understanding physical and mental health across diverse scenarios. CHBench includes 6,493 entries related to mental health and 2,999 entries focused on physical health, covering a broad spectrum of topics. This dataset serves as a foundation for evaluating Chinese LLMs' capacity to comprehend and generate accurate health-related information. Our extensive evaluations of four popular Chinese LLMs demonstrate that there remains considerable room for improvement in their understanding of health-related information. The code is available at this https URL.
摘要：随着大型语言模型 (LLM) 的快速发展，评估其在健康相关查询方面的表现变得越来越重要。这些模型提供准确可信的健康信息至关重要，因为它们在现实世界中的应用（错误信息可能会对寻求医疗建议和支持的个人造成严重后果）取决于它们的可靠性。在这项工作中，我们提出了 CHBench，这是第一个全面的中国健康相关基准，旨在评估 LLM 在不同场景中理解身心健康的能力。CHBench 包括 6,493 个与心理健康相关的条目和 2,999 个专注于身体健康的条目，涵盖了广泛的主题。该数据集是评估中国 LLM 理解和生成准确健康相关信息的能力的基础。我们对四门热门中国 LLM 的广泛评估表明，它们在理解健康相关信息方面仍有很大改进空间。代码可在此 https URL 上找到。

Title: Small Language Models: Survey, Measurements, and Insights

Authors: Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D. Lane, Mengwei Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.15790
Pdf URL: https://arxiv.org/pdf/2409.15790
Copy Paste: [[2409.15790]] Small Language Models: Survey, Measurements, and Insights(https://arxiv.org/abs/2409.15790)
Keywords: language model, llm
Abstract: Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 59 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.
摘要：尽管小型语言模型 (SLM) 在现代智能设备中得到广泛采用，但与主要部署在数据中心和云环境中的大型语言模型 (LLM) 相比，小型语言模型受到的学术关注要少得多。虽然研究人员在追求通用人工智能的过程中不断提高 LLM 的能力，但 SLM 研究旨在使机器智能更易于访问、更实惠、更高效地完成日常任务。我们专注于具有 100M-5B 参数的基于转换器的仅解码器语言模型，调查了 59 个最先进的开源 SLM，从三个方面分析了它们的技术创新：架构、训练数据集和训练算法。此外，我们还评估了它们在各个领域的能力，包括常识推理、上下文学习、数学和编码。为了进一步了解它们的设备运行时成本，我们对它们的推理延迟和内存占用进行了基准测试。通过深入分析我们的基准测试数据，我们为推动该领域的研究提供了宝贵的见解。

Title: NER-Luxury: Named entity recognition for the fashion and luxury domain

Authors: Akim Mousterou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15804
Pdf URL: https://arxiv.org/pdf/2409.15804
Copy Paste: [[2409.15804]] NER-Luxury: Named entity recognition for the fashion and luxury domain(https://arxiv.org/abs/2409.15804)
Keywords: language model
Abstract: In this study, we address multiple challenges of developing a named-entity recognition model in English for the fashion and luxury industry, namely the entity disambiguation, French technical jargon in multiple sub-sectors, scarcity of the ESG methodology, and a disparate company structures of the sector with small and medium-sized luxury houses to large conglomerate leveraging economy of scale. In this work, we introduce a taxonomy of 36+ entity types with a luxury-oriented annotation scheme, and create a dataset of more than 40K sentences respecting a clear hierarchical classification. We also present five supervised fine-tuned models NER-Luxury for fashion, beauty, watches, jewelry, fragrances, cosmetics, and overall luxury, focusing equally on the aesthetic side and the quantitative side. In an additional experiment, we compare in a quantitative empirical assessment of the NER performance of our models against the state-of-the-art open-source large language models that show promising results and highlights the benefits of incorporating a bespoke NER model in existing machine learning pipelines.
摘要：在本研究中，我们解决了为时尚和奢侈品行业开发英语命名实体识别模型的多项挑战，即实体消歧、多个子行业的法语技术术语、ESG 方法的稀缺性以及该行业的公司结构差异，从中小型奢侈品公司到利用规模经济的大型企业集团。在这项工作中，我们引入了 36 多种实体类型的分类法和面向奢侈品的注释方案，并创建了一个包含 40,000 多个句子的数据集，遵循清晰的层次分类。我们还针对时尚、美容、手表、珠宝、香水、化妆品和整体奢侈品提出了五个监督微调模型 NER-Luxury，同样关注美学方面和定量方面。在另一项实验中，我们对我们的模型的 NER 性能进行了定量实证评估，将其与最先进的开源大型语言模型进行了比较，这些模型显示出令人鼓舞的结果，并强调了在现有机器学习管道中纳入定制 NER 模型的好处。

Title: Empirical Insights on Fine-Tuning Large Language Models for Question-Answering

Authors: Junjie Ye, Yuming Yang, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15825
Pdf URL: https://arxiv.org/pdf/2409.15825
Copy Paste: [[2409.15825]] Empirical Insights on Fine-Tuning Large Language Models for Question-Answering(https://arxiv.org/abs/2409.15825)
Keywords: language model, llm
Abstract: Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.
摘要：大型语言模型 (LLM) 通过在海量数据集上进行预训练来编码广泛的世界知识，然后可以针对问答 (QA) 任务对其进行微调。然而，针对问答任务微调 LLM 的有效策略在很大程度上仍未被探索。为了解决这一差距，我们根据预训练 LLM 记忆的知识程度对监督微调 (SFT) 数据进行分类，并进行一系列实证分析。我们的实验涉及来自三个不同模型系列的四个 LLM，重点关注三个关键因素：SFT 所需的数据量、不同 SFT 数据集对模型性能的影响以及不同 LLM 之间的数据需求差异。结果表明，在 SFT 阶段，仅需 60 个数据点即可激活预训练期间编码的知识，使 LLM 能够执行问答任务。此外，具有不同记忆水平数据的 SFT 对 LLM 性能有显著影响，最佳数据集因要微调的特定模型而异。未来的研究将更深入地探究这些现象背后的机制。

Title: Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability

Authors: Xufeng Duan, Xinyu Zhou, Bei Xiao, Zhenguang G. Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15827
Pdf URL: https://arxiv.org/pdf/2409.15827
Copy Paste: [[2409.15827]] Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability(https://arxiv.org/abs/2409.15827)
Keywords: language model, gpt, llm
Abstract: As large language models (LLMs) become advance in their linguistic capacity, understanding how they capture aspects of language competence remains a significant challenge. This study therefore employs psycholinguistic paradigms, which are well-suited for probing deeper cognitive aspects of language processing, to explore neuron-level representations in language model across three tasks: sound-shape association, sound-gender association, and implicit causality. Our findings indicate that while GPT-2-XL struggles with the sound-shape task, it demonstrates human-like abilities in both sound-gender association and implicit causality. Targeted neuron ablation and activation manipulation reveal a crucial relationship: when GPT-2-XL displays a linguistic ability, specific neurons correspond to that competence; conversely, the absence of such an ability indicates a lack of specialized neurons. This study is the first to utilize psycholinguistic experiments to investigate deep language competence at the neuron level, providing a new level of granularity in model interpretability and insights into the internal mechanisms driving language ability in transformer based LLMs.
摘要：随着大型语言模型 (LLM) 的语言能力不断进步，了解它们如何捕捉语言能力的各个方面仍然是一项重大挑战。因此，本研究采用了心理语言学范式，该范式非常适合探索语言处理的更深层次的认知方面，以探索语言模型在三个任务中的神经元级表征：声音形状关联、声音性别关联和隐性因果关系。我们的研究结果表明，虽然 GPT-2-XL 在声音形状任务上表现不佳，但它在声音性别关联和隐性因果关系方面都表现出类似人类的能力。有针对性的神经元消融和激活操作揭示了一个关键的关系：当 GPT-2-XL 表现出语言能力时，特定的神经元与该能力相对应；相反，缺乏这种能力则表明缺乏专门的神经元。这项研究首次利用心理语言学实验在神经元层面研究深度语言能力，为模型可解释性提供了新的粒度，并深入了解了基于转换器的 LLM 中驱动语言能力的内部机制。

Title: A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding

Authors: Abdulfattah Safa, Gözde Gül Şahin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15861
Pdf URL: https://arxiv.org/pdf/2409.15861
Copy Paste: [[2409.15861]] A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding(https://arxiv.org/abs/2409.15861)
Keywords: language model, llm, prompt
Abstract: Dialogue State Tracking (DST) is crucial for understanding user needs and executing appro- priate system actions in task-oriented dialogues. Majority of existing DST methods are designed to work within predefined ontologies and as- sume the availability of gold domain labels, struggling with adapting to new slots values. While Large Language Models (LLMs)-based systems show promising zero-shot DST perfor- mance, they either require extensive computa- tional resources or they underperform existing fully-trained systems, limiting their practical- ity. To address these limitations, we propose a zero-shot, open-vocabulary system that in- tegrates domain classification and DST in a single pipeline. Our approach includes refor- mulating DST as a question-answering task for less capable models and employing self- refining prompts for more adaptable ones. Our system does not rely on fixed slot values de- fined in the ontology allowing the system to adapt dynamically. We compare our approach with existing SOTA, and show that it provides up to 20% better Joint Goal Accuracy (JGA) over previous methods on datasets like Multi- WOZ 2.1, with up to 90% fewer requests to the LLM API.
摘要：对话状态跟踪 (DST) 对于理解用户需求和在面向任务的对话中执行适当的系统操作至关重要。大多数现有的 DST 方法都是在预定义的本体中工作而设计的，并且假设有黄金域标签，但难以适应新的槽值。虽然基于大型语言模型 (LLM) 的系统表现出良好的零样本 DST 性能，但它们要么需要大量的计算资源，要么表现不及现有的完全训练的系统，从而限制了它们的实用性。为了解决这些限制，我们提出了一个零样本开放词汇系统，将领域分类和 DST 集成在一个管道中。我们的方法包括将 DST 重新制定为能力较弱的模型的问答任务，并对适应性更强的模型采用自我细化提示。我们的系统不依赖于本体中定义的固定槽值，从而使系统能够动态适应。我们将我们的方法与现有的 SOTA 进行了比较，并表明与以前的方法相比，我们在 Multi-WOZ 2.1 等数据集上的联合目标准确度 (JGA) 提高了 20%，而对 LLM API 的请求减少了 90%。

Title: Privacy Evaluation Benchmarks for NLP Models

Authors: Wei Huang, Yinggui Wang, Cen Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.15868
Pdf URL: https://arxiv.org/pdf/2409.15868
Copy Paste: [[2409.15868]] Privacy Evaluation Benchmarks for NLP Models(https://arxiv.org/abs/2409.15868)
Keywords: language model, llm
Abstract: By inducing privacy attacks on NLP models, attackers can obtain sensitive information such as training data and model parameters, etc. Although researchers have studied, in-depth, several kinds of attacks in NLP models, they are non-systematic analyses. It lacks a comprehensive understanding of the impact caused by the attacks. For example, we must consider which scenarios can apply to which attacks, what the common factors are that affect the performance of different attacks, the nature of the relationships between different attacks, and the influence of various datasets and models on the effectiveness of the attacks, etc. Therefore, we need a benchmark to holistically assess the privacy risks faced by NLP models. In this paper, we present a privacy attack and defense evaluation benchmark in the field of NLP, which includes the conventional/small models and large language models (LLMs). This benchmark supports a variety of models, datasets, and protocols, along with standardized modules for comprehensive evaluation of attacks and defense strategies. Based on the above framework, we present a study on the association between auxiliary data from different domains and the strength of privacy attacks. And we provide an improved attack method in this scenario with the help of Knowledge Distillation (KD). Furthermore, we propose a chained framework for privacy attacks. Allowing a practitioner to chain multiple attacks to achieve a higher-level attack objective. Based on this, we provide some defense and enhanced attack strategies. The code for reproducing the results can be found at this https URL.
摘要：攻击者通过对NLP模型发起隐私攻击，可以获取训练数据、模型参数等敏感信息。尽管研究人员对NLP模型中的几种攻击进行了深入研究，但这些研究都不是系统的分析，对攻击造成的影响缺乏全面的认识。例如，我们需要考虑哪些场景可以适用于哪些攻击，影响不同攻击表现的共同因素是什么，不同攻击之间的关系性质，以及各种数据集和模型对攻击效果的影响等。因此，我们需要一个基准来整体评估NLP模型面临的隐私风险。在本文中，我们提出了一个NLP领域的隐私攻击和防御评估基准，包括常规/小型模型和大型语言模型（LLM）。该基准支持多种模型、数据集和协议，以及用于全面评估攻击和防御策略的标准化模块。基于上述框架，我们提出了一项关于不同领域辅助数据与隐私攻击强度关联性的研究。并且我们在知识蒸馏（KD）的帮助下针对此场景提供了一种改进的攻击方法。此外，我们提出了一种用于隐私攻击的链式框架。允许从业者链接多个攻击以实现更高级别的攻击目标。在此基础上，我们提供了一些防御和增强的攻击策略。可以在此 https URL 中找到用于重现结果的代码。

Title: HLB: Benchmarking LLMs' Humanlikeness in Language Use

Authors: Xufeng Duan, Bei Xiao, Xuemei Tang, Zhenguang G. Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15890
Pdf URL: https://arxiv.org/pdf/2409.15890
Copy Paste: [[2409.15890]] HLB: Benchmarking LLMs' Humanlikeness in Language Use(https://arxiv.org/abs/2409.15890)
Keywords: language model, llm
Abstract: As synthetic data becomes increasingly prevalent in training language models, particularly through generated dialogue, concerns have emerged that these models may deviate from authentic human language patterns, potentially losing the richness and creativity inherent in human communication. This highlights the critical need to assess the humanlikeness of language models in real-world language use. In this paper, we present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) using 10 psycholinguistic experiments designed to probe core linguistic aspects, including sound, word, syntax, semantics, and discourse (see this https URL). To anchor these comparisons, we collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. For rigorous evaluation, we developed a coding algorithm that accurately identified language use patterns, enabling the extraction of response distributions for each task. By comparing the response distributions between human participants and LLMs, we quantified humanlikeness through distributional similarity. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels. Importantly, we found that improvements in other performance metrics did not necessarily lead to greater humanlikeness, and in some cases, even resulted in a decline. By introducing psycholinguistic methods to model evaluation, this benchmark offers the first framework for systematically assessing the humanlikeness of LLMs in language use.
摘要：随着合成数据在训练语言模型中越来越普遍，尤其是通过生成对话，人们开始担心这些模型可能会偏离真实的人类语言模式，从而可能失去人类交流固有的丰富性和创造性。这凸显了评估语言模型在现实世界语言使用中的人性化程度的迫切需要。在本文中，我们提出了一个全面的人性化基准 (HLB)，使用 10 个心理语言学实验评估 20 个大型语言模型 (LLM)，这些实验旨在探索核心语言方面，包括声音、单词、语法、语义和话语（请参阅此 https URL）。为了巩固这些比较，我们收集了 2,000 多名人类参与者的回答，并将它们与这些实验中 LLM 的输出进行比较。为了进行严格的评估，我们开发了一种编码算法，可以准确识别语言使用模式，从而能够提取每个任务的响应分布。通过比较人类参与者和 LLM 之间的响应分布，我们通过分布相似性量化了人性化程度。我们的结果揭示了 LLM 在各个语言水平上复制人类响应的能力的细微差异。重要的是，我们发现其他性能指标的改进并不一定会导致更接近人类的程度，在某些情况下甚至会导致下降。通过将心理语言学方法引入模型评估，该基准提供了第一个系统地评估 LLM 在语言使用中接近人类的框架。

Title: Konstruktor: A Strong Baseline for Simple Knowledge Graph Question Answering

Authors: Maria Lysyuk, Mikhail Salnikov, Pavel Braslavski, Alexander Panchenko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15902
Pdf URL: https://arxiv.org/pdf/2409.15902
Copy Paste: [[2409.15902]] Konstruktor: A Strong Baseline for Simple Knowledge Graph Question Answering(https://arxiv.org/abs/2409.15902)
Keywords: language model
Abstract: While being one of the most popular question types, simple questions such as "Who is the author of Cinderella?", are still not completely solved. Surprisingly, even the most powerful modern Large Language Models are prone to errors when dealing with such questions, especially when dealing with rare entities. At the same time, as an answer may be one hop away from the question entity, one can try to develop a method that uses structured knowledge graphs (KGs) to answer such questions. In this paper, we introduce Konstruktor - an efficient and robust approach that breaks down the problem into three steps: (i) entity extraction and entity linking, (ii) relation prediction, and (iii) querying the knowledge graph. Our approach integrates language models and knowledge graphs, exploiting the power of the former and the interpretability of the latter. We experiment with two named entity recognition and entity linking methods and several relation detection techniques. We show that for relation detection, the most challenging step of the workflow, a combination of relation classification/generation and ranking outperforms other methods. We report Konstruktor's strong results on four datasets.
摘要：虽然是最常见的问题类型之一，但诸如“谁是《灰姑娘》的作者？”之类的简单问题仍未完全解决。令人惊讶的是，即使是最强大的现代大型语言模型在处理此类问题时也容易出错，尤其是在处理罕见实体时。同时，由于答案可能距离问题实体只有一步之遥，因此可以尝试开发一种使用结构化知识图谱 (KG) 来回答此类问题的方法。在本文中，我们介绍了 Konstruktor - 一种高效而强大的方法，它将问题分为三个步骤：（i）实体提取和实体链接，（ii）关系预测，以及（iii）查询知识图谱。我们的方法集成了语言模型和知识图谱，利用前者的强大功能和后者的可解释性。我们尝试了两种命名实体识别和实体链接方法以及几种关系检测技术。我们表明，对于关系检测（工作流程中最具挑战性的步骤），关系分类/生成和排名的组合优于其他方法。我们报告了 Konstruktor 在四个数据集上的强劲结果。

Title: Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection

Authors: Xingyu Ma, Xin Tian, Lingxiang Wu, Xuepeng Wang, Xueming Tang, Jinqiao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.15907
Pdf URL: https://arxiv.org/pdf/2409.15907
Copy Paste: [[2409.15907]] Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection(https://arxiv.org/abs/2409.15907)
Keywords: language model, llm, hallucination
Abstract: Text-to-SQL is a subtask in semantic parsing that has seen rapid progress with the evolution of Large Language Models (LLMs). However, LLMs face challenges due to hallucination issues and a lack of domain-specific database knowledge(such as table schema and cell values). As a result, they can make errors in generating table names, columns, and matching values to the correct columns in SQL statements. This paper introduces a method of knowledge injection to enhance LLMs' ability to understand schema contents by incorporating prior knowledge. This approach improves their performance in Text-to-SQL tasks. Experimental results show that pre-training LLMs on domain-specific database knowledge and fine-tuning them on downstream Text-to-SQL tasks significantly improves the Execution Match (EX) and Exact Match (EM) metrics across various models. This effectively reduces errors in generating column names and matching values to the columns. Furthermore, the knowledge-injected models can be applied to many downstream Text-to-SQL tasks, demonstrating the generalizability of the approach presented in this paper.
摘要：文本转 SQL 是语义解析中的一项子任务，随着大型语言模型 (LLM) 的发展，该任务取得了快速进展。然而，由于幻觉问题和缺乏特定领域的数据库知识（例如表模式和单元格值），LLM 面临挑战。因此，它们可能会在生成表名、列和将值与 SQL 语句中的正确列匹配时出错。本文介绍了一种知识注入方法，通过结合先验知识来增强 LLM 理解模式内容的能力。这种方法提高了它们在文本转 SQL 任务中的表现。实验结果表明，对 LLM 进行领域特定数据库知识的预训练并在下游文本转 SQL 任务中对其进行微调可显著提高各种模型的执行匹配 (EX) 和精确匹配 (EM) 指标。这有效地减少了生成列名和将值与列匹配的错误。此外，知识注入模型可以应用于许多下游文本转 SQL 任务，证明了本文提出的方法的通用性。

Title: SLIMER-IT: Zero-Shot NER on Italian Language

Authors: Andrew Zamai, Leonardo Rigutini, Marco Maggini, Andrea Zugarini
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2409.15933
Pdf URL: https://arxiv.org/pdf/2409.15933
Copy Paste: [[2409.15933]] SLIMER-IT: Zero-Shot NER on Italian Language(https://arxiv.org/abs/2409.15933)
Keywords: language model, llm, prompt
Abstract: Traditional approaches to Named Entity Recognition (NER) frame the task into a BIO sequence labeling problem. Although these systems often excel in the downstream task at hand, they require extensive annotated data and struggle to generalize to out-of-distribution input domains and unseen entity types. On the contrary, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities. While several works address Zero-Shot NER in English, little has been done in other languages. In this paper, we define an evaluation framework for Zero-Shot NER, applying it to the Italian language. Furthermore, we introduce SLIMER-IT, the Italian version of SLIMER, an instruction-tuning approach for zero-shot NER leveraging prompts enriched with definition and guidelines. Comparisons with other state-of-the-art models, demonstrate the superiority of SLIMER-IT on never-seen-before entity tags.
摘要：传统的命名实体识别 (NER) 方法将任务定义为 BIO 序列标记问题。尽管这些系统通常在手头的下游任务中表现出色，但它们需要大量带注释的数据，并且难以推广到分布外的输入域和看不见的实体类型。相反，大型语言模型 (LLM) 已展示出强大的零样本能力。虽然有几部作品涉及英语中的零样本 NER，但在其他语言中却很少见。在本文中，我们定义了一个零样本 NER 评估框架，并将其应用于意大利语。此外，我们介绍了 SLIMER-IT，即 SLIMER 的意大利语版本，这是一种利用丰富定义和指南的提示进行零样本 NER 的指令调整方法。与其他最先进模型的比较证明了 SLIMER-IT 在从未见过的实体标签上的优势。

Title: Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Authors: Samuel Arcadinho, David Aparicio, Mariana Almeida
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.15934
Pdf URL: https://arxiv.org/pdf/2409.15934
Copy Paste: [[2409.15934]] Automated test generation to evaluate tool-augmented LLMs as conversational AI agents(https://arxiv.org/abs/2409.15934)
Keywords: llm, agent
Abstract: Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator's tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.
摘要：工具增强型 LLM 是一种很有前途的方法，可用于创建能够进行真实对话、遵循程序并调用适当功能的 AI 代理。然而，由于可能的对话种类繁多，对它们进行评估具有挑战性，而现有的数据集仅关注单一交互和函数调用。我们提出了一个测试生成管道来评估 LLM 作为对话式 AI 代理的能力。我们的框架使用 LLM 生成基于用户定义程序的各种测试。为此，我们使用中间图来限制 LLM 测试生成器对不基于输入程序的内容产生幻觉的倾向，并强制对可能的对话进行高覆盖。此外，我们提出了 ALMITA，这是一个手动整理的数据集，用于评估客户支持中的 AI 代理，并使用它来评估现有的 LLM。我们的结果表明，虽然工具增强型 LLM 在单一交互中表现良好，但它们往往难以处理完整的对话。虽然我们的重点是客户支持，但我们的方法是通用的，能够用于不同领域的 AI 代理。

Title: Finetuning LLMs for Comparative Assessment Tasks

Authors: Vatsal Raina, Adian Liusie, Mark Gales
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.15979
Pdf URL: https://arxiv.org/pdf/2409.15979
Copy Paste: [[2409.15979]] Finetuning LLMs for Comparative Assessment Tasks(https://arxiv.org/abs/2409.15979)
Keywords: language model, llm
Abstract: Automated assessment in natural language generation is a challenging task. Instruction-tuned large language models (LLMs) have shown promise in reference-free evaluation, particularly through comparative assessment. However, the quadratic computational complexity of pairwise comparisons limits its scalability. To address this, efficient comparative assessment has been explored by applying comparative strategies on zero-shot LLM probabilities. We propose a framework for finetuning LLMs for comparative assessment to align the model's output with the target distribution of comparative probabilities. By training on soft probabilities, our approach improves state-of-the-art performance while maintaining high performance with an efficient subset of comparisons.
摘要：自然语言生成中的自动评估是一项具有挑战性的任务。指令调整的大型语言模型 (LLM) 在无参考评估中表现出色，尤其是通过比较评估。然而，成对比较的二次计算复杂度限制了它的可扩展性。为了解决这个问题，已经通过在零样本 LLM 概率上应用比较策略来探索有效的比较评估。我们提出了一个框架来微调用于比较评估的 LLM，以使模型的输出与比较概率的目标分布保持一致。通过对软概率进行训练，我们的方法提高了最先进的性能，同时通过高效的比较子集保持了高性能。

Title: Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs

Authors: Yang Yuhang, Peng Yizhou, Eng Siong Chng, Xionghu Zhong
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.16005
Pdf URL: https://arxiv.org/pdf/2409.16005
Copy Paste: [[2409.16005]] Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs(https://arxiv.org/abs/2409.16005)
Keywords: language model, llm
Abstract: The integration of large language models (LLMs) with pre-trained speech models has opened up new avenues in automatic speech recognition (ASR). While LLMs excel in multimodal understanding tasks, effectively leveraging their capabilities for ASR remains a significant challenge. This paper presents a novel training approach to enhance LLM performance in ASR tasks. We propose pre-training LLMs on Pinyin embedding sequences, which represent pronunciation features, to generate corresponding Chinese characters. This step enables the LLM to adapt to generating text from pronunciation features before encountering real speech data. Furthermore, we fine-tune the LoRA parameters to enhance the LLM's understanding of speech modality information. In AISHELL-1 corpus, our approach yields a 9.5% relative improvement in ASR tasks compared to the baseline without Pinyi-to-Character pre-training. Additionally, incorporating auxiliary text data for Pinyi-to-Character pre-training further boosts performance, achieving a 19.0% relative improvement.
摘要：大型语言模型 (LLM) 与预训练语音模型的集成为自动语音识别 (ASR) 开辟了新途径。虽然 LLM 在多模态理解任务中表现出色，但有效利用其功能进行 ASR 仍然是一项重大挑战。本文介绍了一种新颖的训练方法，以提高 LLM 在 ASR 任务中的表现。我们建议在代表发音特征的拼音嵌入序列上对 LLM 进行预训练，以生成相应的汉字。此步骤使 LLM 能够在遇到真实语音数据之前适应从发音特征生成文本。此外，我们对 LoRA 参数进行了微调，以增强 LLM 对语音模态信息的理解。在 AISHELL-1 语料库中，与没有进行拼音转汉字预训练的基线相比，我们的方法在 ASR 任务中取得了 9.5% 的相对改进。此外，结合辅助文本数据进行拼音转汉字预训练进一步提高了性能，实现了 19.0% 的相对改进。

Title: AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment

Authors: Nuo Chen, Jiqun Liu, Xiaoyu Dong, Qijiong Liu, Tetsuya Sakai, Xiao-Ming Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.16022
Pdf URL: https://arxiv.org/pdf/2409.16022
Copy Paste: [[2409.16022]] AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment(https://arxiv.org/abs/2409.16022)
Keywords: language model, gpt, llm
Abstract: Cognitive biases are systematic deviations in thinking that lead to irrational judgments and problematic decision-making, extensively studied across various fields. Recently, large language models (LLMs) have shown advanced understanding capabilities but may inherit human biases from their training data. While social biases in LLMs have been well-studied, cognitive biases have received less attention, with existing research focusing on specific scenarios. The broader impact of cognitive biases on LLMs in various decision- making contexts remains underexplored. We investigated whether LLMs are influenced by the threshold priming effect in relevance judgments, a core task and widely-discussed research topic in the Information Retrieval (IR) coummunity. The priming effect occurs when exposure to certain stimuli unconsciously affects subsequent behavior and decisions. Our experiment employed 10 topics from the TREC 2019 Deep Learning passage track collection, and tested AI judgments under different document relevance scores, batch lengths, and LLM models, including GPT-3.5, GPT-4, LLaMa2-13B and LLaMa2-70B. Results showed that LLMs tend to give lower scores to later documents if earlier ones have high relevance, and vice versa, regardless of the combination and model used. Our finding demonstrates that LLM%u2019s judgments, similar to human judgments, are also influenced by threshold priming biases, and suggests that researchers and system engineers should take into account potential human-like cognitive biases in designing, evaluating, and auditing LLMs in IR tasks and beyond.
摘要：认知偏差是一种系统性的思维偏差，会导致非理性判断和有问题的决策，各个领域都对此进行了广泛的研究。最近，大型语言模型 (LLM) 表现出了高级的理解能力，但可能会从其训练数据中继承人类的偏见。虽然 LLM 中的社会偏见已经得到充分研究，但认知偏差却没有受到太多关注，现有研究主要集中于特定场景。认知偏差对各种决策环境中的 LLM 的更广泛影响仍未得到充分探索。我们研究了 LLM 是否受到相关性判断中的阈值启动效应的影响，相关性判断是信息检索 (IR) 社区的一项核心任务和广泛讨论的研究课题。当接触某些刺激会无意识地影响后续行为和决策时，就会发生启动效应。我们的实验采用了 TREC 2019 深度学习段落轨迹集合中的 10 个主题，并在不同的文档相关性分数、批处理长度和 LLM 模型（包括 GPT-3.5、GPT-4、LLaMa2-13B 和 LLaMa2-70B）下测试了 AI 判断。结果表明，无论使用哪种组合和模型，如果较早的文档具有较高的相关性，LLM 往往会给后面的文档打较低的分数，反之亦然。我们的发现表明，LLM 的判断与人类判断类似，也受到阈值启动偏差的影响，并表明研究人员和系统工程师在设计、评估和审核 IR 任务及其他任务中的 LLM 时应考虑潜在的类似人类的认知偏差。

Title: Unlocking Markets: A Multilingual Benchmark to Cross-Market Question Answering

Authors: Yifei Yuan, Yang Deng, Anders Søgaard, Mohammad Aliannejadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.16025
Pdf URL: https://arxiv.org/pdf/2409.16025
Copy Paste: [[2409.16025]] Unlocking Markets: A Multilingual Benchmark to Cross-Market Question Answering(https://arxiv.org/abs/2409.16025)
Keywords: llm
Abstract: Users post numerous product-related questions on e-commerce platforms, affecting their purchase decisions. Product-related question answering (PQA) entails utilizing product-related resources to provide precise responses to users. We propose a novel task of Multilingual Cross-market Product-based Question Answering (MCPQA) and define the task as providing answers to product-related questions in a main marketplace by utilizing information from another resource-rich auxiliary marketplace in a multilingual context. We introduce a large-scale dataset comprising over 7 million questions from 17 marketplaces across 11 languages. We then perform automatic translation on the Electronics category of our dataset, naming it as McMarket. We focus on two subtasks: review-based answer generation and product-related question ranking. For each subtask, we label a subset of McMarket using an LLM and further evaluate the quality of the annotations via human assessment. We then conduct experiments to benchmark our dataset, using models ranging from traditional lexical models to LLMs in both single-market and cross-market scenarios across McMarket and the corresponding LLM subset. Results show that incorporating cross-market information significantly enhances performance in both tasks.
摘要：用户在电子商务平台上发布大量与产品相关的问题，影响他们的购买决策。产品相关问答 (PQA) 需要利用与产品相关的资源为用户提供精确的响应。我们提出了一项新颖的任务，即多语言跨市场基于产品的问答 (MCPQA)，并将该任务定义为在多语言环境中利用来自另一个资源丰富的辅助市场的信息，为主要市场中的产品相关问题提供答案。我们引入了一个大型数据集，其中包含来自 17 个市场、11 种语言的 700 多万个问题。然后，我们对数据集的电子产品类别执行自动翻译，将其命名为 McMarket。我们专注于两个子任务：基于评论的答案生成和产品相关问题排名。对于每个子任务，我们使用 LLM 标记 McMarket 的一个子集，并通过人工评估进一步评估注释的质量。然后，我们进行实验以对我们的数据集进行基准测试，使用从传统词汇模型到 LLM 的模型，在 McMarket 和相应的 LLM 子集的单市场和跨市场场景中进行测试。结果表明，整合跨市场信息可显著提高两项任务的绩效。

Title: Exploring Hint Generation Approaches in Open-Domain Question Answering

Authors: Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2409.16096
Pdf URL: https://arxiv.org/pdf/2409.16096
Copy Paste: [[2409.16096]] Exploring Hint Generation Approaches in Open-Domain Question Answering(https://arxiv.org/abs/2409.16096)
Keywords: language model, llm, prompt
Abstract: Automatic Question Answering (QA) systems rely on contextual information to provide accurate answers. Commonly, contexts are prepared through either retrieval-based or generation-based methods. The former involves retrieving relevant documents from a corpus like Wikipedia, whereas the latter uses generative models such as Large Language Models (LLMs) to generate the context. In this paper, we introduce a novel context preparation approach called HINTQA, which employs Automatic Hint Generation (HG) techniques. Unlike traditional methods, HINTQA prompts LLMs to produce hints about potential answers for the question rather than generating relevant context. We evaluate our approach across three QA datasets including TriviaQA, NaturalQuestions, and Web Questions, examining how the number and order of hints impact performance. Our findings show that the HINTQA surpasses both retrieval-based and generation-based approaches. We demonstrate that hints enhance the accuracy of answers more than retrieved and generated contexts.
摘要：自动问答 (QA) 系统依靠上下文信息来提供准确的答案。通常，上下文是通过基于检索或基于生成的方法准备的。前者涉及从维基百科等语料库中检索相关文档，而后者使用生成模型（如大型语言模型 (LLM)）来生成上下文。在本文中，我们介绍了一种名为 HINTQA 的新型上下文准备方法，该方法采用自动提示生成 (HG) 技术。与传统方法不同，HINTQA 提示 LLM 生成有关问题潜在答案的提示，而不是生成相关上下文。我们在三个 QA 数据集（包括 TriviaQA、NaturalQuestions 和 Web Questions）上评估了我们的方法，研究了提示的数量和顺序如何影响性能。我们的研究结果表明，HINTQA 超越了基于检索和基于生成的方法。我们证明，提示比检索和生成的上下文更能提高答案的准确性。

Title: Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework

Authors: Lu Chen, Ruqing Zhang, Jiafeng Guo, Yixing Fan, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.16146
Pdf URL: https://arxiv.org/pdf/2409.16146
Copy Paste: [[2409.16146]] Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework(https://arxiv.org/abs/2409.16146)
Keywords: language model, hallucination, prompt, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has emerged as a popular solution to mitigate the hallucination issues of large language models. However, existing studies on RAG seldom address the issue of predictive uncertainty, i.e., how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications. In this work, we emphasize the importance of risk control, ensuring that RAG models proactively refuse to answer questions with low confidence. Our research identifies two critical latent factors affecting RAG's confidence in its predictions: the quality of the retrieved results and the manner in which these results are utilized. To guide RAG models in assessing their own confidence based on these two latent factors, we develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers. We also introduce a benchmarking procedure to collect answers with the option to abstain, facilitating a series of experiments. For evaluation, we introduce several risk-related metrics and the experimental results demonstrate the effectiveness of our approach.
摘要：检索增强生成 (RAG) 已成为缓解大型语言模型幻觉问题的一种流行解决方案。然而，现有的 RAG 研究很少解决预测不确定性问题，即 RAG 模型的预测有多大可能出错，从而导致实际应用中的风险无法控制。在这项工作中，我们强调风险控制的重要性，确保 RAG 模型主动拒绝回答置信度低的问题。我们的研究确定了影响 RAG 对其预测的信心的两个关键潜在因素：检索结果的质量以及使用这些结果的方式。为了指导 RAG 模型根据这两个潜在因素评估自己的信心，我们开发了一个反事实提示框架，诱导模型改变这些因素并分析对其答案的影响。我们还引入了一个基准测试程序来收集带有弃权选项的答案，从而促进一系列实验。为了进行评估，我们引入了几个与风险相关的指标，实验结果证明了我们方法的有效性。

Title: HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

Authors: Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, Junran Peng, Zhaoxiang Zhang, Songyang Zhang, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.16191
Pdf URL: https://arxiv.org/pdf/2409.16191
Copy Paste: [[2409.16191]] HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models(https://arxiv.org/abs/2409.16191)
Keywords: language model, llm, chat
Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in this https URL.
摘要：近年来，大型语言模型 (LLM) 在各种任务（例如，长上下文理解）中展现出卓越的能力，并且已经提出了许多基准。然而，我们发现长文本生成能力尚未得到很好的研究。因此，我们引入了分层长文本生成基准 (HelloBench)，这是一个全面、真实、开放的基准，用于评估 LLM 在生成长文本方面的表现。基于布鲁姆分类法，HelloBench 将长文本生成任务分为五个子任务：开放式问答、摘要、聊天、文本完成和启发式文本生成。此外，我们提出了分层长文本评估 (HelloEval)，这是一种与人类一致的评估方法，可显着减少人工评估所需的时间和精力，同时保持与人工评估的高度相关性。我们在约 30 个主流 LLM 上进行了广泛的实验，并观察到当前的 LLM 缺乏长文本生成能力。具体来说，首先，无论指令是否包含显式或隐式长度约束，我们观察到大多数 LLM 无法生成长度超过 4000 个单词的文本。其次，我们观察到虽然一些 LLM 可以生成更长的文本，但存在许多问题（例如，严重重复和质量下降）。第三，为了证明 HelloEval 的有效性，我们将 HelloEval 与传统指标（例如 ROUGE、BLEU 等）和 LLM-as-a-Judge 方法进行了比较，结果表明 HelloEval 与人工评估的相关性最高。我们在此 https URL 中发布了我们的代码。

Title: EuroLLM: Multilingual Language Models for Europe

Authors: Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.16235
Pdf URL: https://arxiv.org/pdf/2409.16235
Copy Paste: [[2409.16235]] EuroLLM: Multilingual Language Models for Europe(https://arxiv.org/abs/2409.16235)
Keywords: language model, llm
Abstract: The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
摘要：开放权重法学硕士 (LLM) 的质量已显著提高，但它们仍然主要侧重于英语。在本文中，我们介绍了 EuroLLM 项目，该项目旨在开发一套开放权重多语言法学硕士 (LLM)，能够理解和生成所有欧盟官方语言以及其他几种相关语言的文本。我们概述了迄今为止取得的进展，详细介绍了我们的数据收集和过滤过程、缩放定律的开发、多语言标记器的创建以及数据组合和建模配置。此外，我们发布了我们的初始模型：EuroLLM-1.7B 和 EuroLLM-1.7B-Instruct，并报告了它们在多语言通用基准和机器翻译方面的表现。