2024-07-22

Title: RDBE: Reasoning Distillation-Based Evaluation Enhances Automatic Essay Scoring

Authors: Ali Ghiasvand Mohammadkhani
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2407.13781
Pdf URL: https://arxiv.org/pdf/2407.13781
Copy Paste: [[2407.13781]] RDBE: Reasoning Distillation-Based Evaluation Enhances Automatic Essay Scoring(https://arxiv.org/abs/2407.13781)
Keywords: language model, llm
Abstract: Recently, various encoder-only and encoder-decoder pre-trained models like BERT and T5 have been applied to automatic essay scoring (AES) as small language models. However, existing studies have primarily treated this task akin to a classification problem, focusing solely on outputting scores in the target text without offering interpretations for the generated scores. Departing from the approaches, we introduce Reasoning Distillation-Based Evaluation (RDBE), which integrates interpretability to elucidate the rationale behind model scores while enhancing performance through initial reasoning. This interpretive capability is acquired during training by leveraging generated reasoning from a large language model (LLM) to distill a small language model (SLM). Our experimental results demonstrate the efficacy of RDBE across all scoring rubrics considered in the dataset. RDBE outperforms both zero-shot LLM generation and generation from a baseline fine-tuned model, establishing itself as state-of-the-art in the corresponding dataset. This highlights its practical interpretative output and enhanced performance.
摘要：最近，各种仅编码器和编码器解码器预训练模型（如 BERT 和 T5）已作为小型语言模型应用于自动作文评分 (AES)。然而，现有的研究主要将此任务视为类似于分类问题，仅关注在目标文本中输出分数，而不提供对生成分数的解释。与这些方法不同，我们引入了基于推理提炼的评估 (RDBE)，它集成了可解释性以阐明模型分数背后的原理，同时通过初始推理提高性能。这种解释能力是在训练过程中获得的，通过利用大型语言模型 (LLM) 生成的推理来提炼小型语言模型 (SLM)。我们的实验结果证明了 RDBE 在数据集中考虑的所有评分标准中的有效性。RDBE 的表现优于零样本 LLM 生成和基线微调模型的生成，在相应的数据集中确立了其最新水平。这凸显了其实用的解释输出和增强的性能。

Title: Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

Authors: Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen Zhang, Hiteshi Sharma, Blake Bullwinkel, Martin Pouliot, Amanda Minnich, Shiven Chawla, Solianna Herrera, Shahed Warreth, Maggie Engler, Gary Lopez, Nina Chikanov, Raja Sekhar Rao Dheekonda, Bolor-Erdene Jagdagdorj, Roman Lutz, Richard Lundeen, Tori Westerhoff, Pete Bryan, Christian Seifert, Ram Shankar Siva Kumar, Andrew Berkley, Alex Kessler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13833
Pdf URL: https://arxiv.org/pdf/2407.13833
Copy Paste: [[2407.13833]] Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle(https://arxiv.org/abs/2407.13833)
Keywords: language model
Abstract: Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks.
摘要：语言模型训练方面的最新创新表明，可以创建足够小、可以在智能手机上运行的高性能模型。随着这些模型部署在越来越多的领域，确保它们符合人类偏好和安全考虑至关重要。在本报告中，我们介绍了安全调整 Phi-3 系列语言模型的方法。我们利用了“故障修复”循环，执行多轮数据集整理、安全后训练、基准测试、红队测试和漏洞识别，以涵盖单轮和多轮场景中的各种危害区域。我们的结果表明，这种方法在广泛的负责任 AI 基准测试中迭代地提高了 Phi-3 模型的性能。

Title: Learning Goal-Conditioned Representations for Language Reward Models

Authors: Vaskar Nath, Dylan Slack, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer Whitehead, Sean Hendryx
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13887
Pdf URL: https://arxiv.org/pdf/2407.13887
Copy Paste: [[2407.13887]] Learning Goal-Conditioned Representations for Language Reward Models(https://arxiv.org/abs/2407.13887)
Keywords: language model
Abstract: Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning (RL). Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback (RLHF) on language models (LMs). In this work, we propose training reward models (RMs) in a contrastive, $\textit{goal-conditioned}$ fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves RM performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe $2.3\%$ increase in accuracy. Beyond improving reward model performance, we show this way of training RM representations enables improved $\textit{steerability}$ because it allows us to evaluate the likelihood of an action achieving a particular goal-state (e.g., whether a solution is correct or helpful). Leveraging this insight, we find that we can filter up to $55\%$ of generated tokens during majority voting by discarding trajectories likely to end up in an "incorrect" state, which leads to significant cost savings. We additionally find that these representations can perform fine-grained control by conditioning on desired future goal-states. For example, we show that steering a Llama 3 model towards helpful generations with our approach improves helpfulness by $9.6\%$ over a supervised-fine-tuning trained baseline. Similarly, steering the model towards complex generations improves complexity by $21.6\%$ over the baseline. Overall, we find that training RMs in this contrastive, goal-conditioned fashion significantly improves performance and enables model steerability.
摘要：通过离线数据或自我监督目标学习改进的表示的技术在传统强化学习 (RL) 中已显示出令人印象深刻的结果。然而，尚不清楚改进的表示学习如何使语言模型 (LM) 上的人类反馈 (RLHF) 强化学习受益。在这项工作中，我们提出以对比的、$\textit{目标条件}$ 方式训练奖励模型 (RM)，通过增加沿采样的首选轨迹的未来状态的表示相似性并降低沿随机采样的非首选轨迹的相似性。此目标在 MATH 和 GSM8k 等具有挑战性的基准上显着提高了 RM 性能高达 0.09 AUROC。这些发现也扩展到一般对齐 - 在 Helpful-Harmless 数据集上，我们观察到准确率提高了 $2.3\%$。除了提高奖励模型性能之外，我们还展示了这种训练 RM 表示的方式可以提高 $\textit{可操纵性}$，因为它使我们能够评估某个动作实现特定目标状态的可能性（例如，解决方案是否正确或有帮助）。利用这一洞察，我们发现，通过丢弃可能最终处于“不正确”状态的轨迹，我们可以在多数投票期间过滤高达 $55\%$ 的生成标记，从而显著节省成本。我们还发现，这些表示可以通过调节期望的未来目标状态来执行细粒度控制。例如，我们表明，使用我们的方法将 Llama 3 模型转向有用的代，与监督微调训练基线相比，有用性提高了 $9.6\%$。同样，将模型转向复杂的代，与基线相比，复杂性提高了 $21.6\%$。总体而言，我们发现以这种对比的、目标调节的方式训练 RM 可以显著提高性能并实现模型可操纵性。

Title: Crafting Efficient Fine-Tuning Strategies for Large Language Models

Authors: Michael Oliver, Guan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.13906
Pdf URL: https://arxiv.org/pdf/2407.13906
Copy Paste: [[2407.13906]] Crafting Efficient Fine-Tuning Strategies for Large Language Models(https://arxiv.org/abs/2407.13906)
Keywords: language model, llm
Abstract: This paper addresses the challenges of efficiently fine-tuning large language models (LLMs) by exploring data efficiency and hyperparameter optimization. We investigate the minimum data required for effective fine-tuning and propose a novel hyperparameter optimization method that leverages early-stage model performance. Our experiments demonstrate that fine-tuning with as few as 200 samples can improve model accuracy from 70\% to 88\% in a product attribute extraction task. We identify a saturation point of approximately 6,500 samples, beyond which additional data yields diminishing returns. Our proposed bayesian hyperparameter optimization method, which evaluates models at 20\% of total training time, correlates strongly with final model performance, with 4 out of 5 top early-stage models remaining in the top 5 at completion. This approach led to a 2\% improvement in accuracy over baseline models when evaluated on an independent test set. These findings offer actionable insights for practitioners, potentially reducing computational load and dependency on extensive datasets while enhancing overall performance of fine-tuned LLMs.
摘要：本文通过探索数据效率和超参数优化来解决有效微调大型语言模型 (LLM) 的挑战。我们研究了有效微调所需的最少数据，并提出了一种利用早期模型性能的新型超参数优化方法。我们的实验表明，在产品属性提取任务中，仅使用 200 个样本进行微调就可以将模型准确率从 70% 提高到 88%。我们确定了大约 6,500 个样本的饱和点，超过这个饱和点，额外的数据会产生递减的收益。我们提出的贝叶斯超参数优化方法以总训练时间的 20% 来评估模型，与最终模型性能密切相关，5 个顶级早期模型中有 4 个在完成后仍位居前 5 名。在独立测试集上进行评估时，这种方法的准确率比基线模型提高了 2%。这些发现为从业者提供了可行的见解，有可能减少计算负荷和对大量数据集的依赖，同时提高微调 LLM 的整体性能。

Title: BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization

Authors: Ahmed Allam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13928
Pdf URL: https://arxiv.org/pdf/2407.13928
Copy Paste: [[2407.13928]] BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization(https://arxiv.org/abs/2407.13928)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have become pivotal in advancing natural language processing, yet their potential to perpetuate biases poses significant concerns. This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in LLM-generated English text. By developing a loss function that favors less biased over biased completions, our approach cultivates a preference for respectful and non-discriminatory language in LLMs. We also contribute a manually designed dataset for training LLMs to recognize and correct biases. This dataset encompasses a diverse range of prompts paired with both biased and unbiased completions. Implementing this approach on the Microsoft Phi-2 model, we demonstrate substantial reductions in biased outputs as our model outperforms the baseline model on almost all bias benchmarks. Our model also achieves better performance compared to other open-source models on most benchmarks. By reducing biases in the language generated by the model, our study marks a significant step towards developing more ethical and socially responsible LLMs. We publicly release BiasDPO dataset on HuggingFace.
摘要：大型语言模型 (LLM) 已成为推动自然语言处理发展的关键，但它们延续偏见的可能性引起了人们的重大担忧。本文介绍了一种采用直接偏好优化 (DPO) 的新框架，以减轻 LLM 生成的英文文本中的性别、种族和宗教偏见。通过开发一种倾向于较少偏见而不是有偏见的完成的损失函数，我们的方法培养了 LLM 中对尊重和非歧视性语言的偏好。我们还提供了一个手动设计的数据集，用于训练 LLM 识别和纠正偏见。该数据集包含与有偏见和无偏见完成配对的各种提示。在 Microsoft Phi-2 模型上实施这种方法，我们展示了偏见输出的大幅减少，因为我们的模型在几乎所有偏见基准上都优于基线模型。与其他开源模型相比，我们的模型在大多数基准上也实现了更好的性能。通过减少模型生成的语言中的偏见，我们的研究标志着朝着开发更合乎道德和社会责任的 LLM 迈出了重要一步。我们在HuggingFace上公开发布BiasDPO数据集。

Title: Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction

Authors: Suma Bailis, Jane Friedhoff, Feiyang Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13943
Pdf URL: https://arxiv.org/pdf/2407.13943
Copy Paste: [[2407.13943]] Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction(https://arxiv.org/abs/2407.13943)
Keywords: language model, gpt, llm
Abstract: This paper introduces Werewolf Arena, a novel framework for evaluating large language models (LLMs) through the lens of the classic social deduction game, Werewolf. In Werewolf Arena, LLMs compete against each other, navigating the game's complex dynamics of deception, deduction, and persuasion. The framework introduces a dynamic turn-taking system based on bidding, mirroring real-world discussions where individuals strategically choose when to speak. We demonstrate the framework's utility through an arena-style tournament featuring Gemini and GPT models. Our results reveal distinct strengths and weaknesses in the models' strategic reasoning and communication. These findings highlight Werewolf Arena's potential as a challenging and scalable LLM benchmark.
摘要：本文介绍了狼人竞技场，这是一个通过经典社交推理游戏狼人的视角来评估大型语言模型 (LLM) 的新框架。在狼人竞技场中，LLM 相互竞争，驾驭游戏中复杂的欺骗、推理和说服动态。该框架引入了一种基于竞标的动态轮流系统，反映了现实世界中的讨论，个人可以策略性地选择何时发言。我们通过以 Gemini 和 GPT 模型为特色的竞技场式锦标赛展示了该框架的实用性。我们的结果揭示了模型在战略推理和沟通方面的明显优势和劣势。这些发现凸显了狼人竞技场作为具有挑战性和可扩展性的 LLM 基准的潜力。

Title: FANTAstic SEquences and Where to Find Them: Faithful and Efficient API Call Generation through State-tracked Constrained Decoding and Reranking

Authors: Zhuoer Wang, Leonardo F. R. Ribeiro, Alexandros Papangelis, Rohan Mukherjee, Tzu-Yen Wang, Xinyan Zhao, Arijit Biswas, James Caverlee, Angeliki Metallinou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13945
Pdf URL: https://arxiv.org/pdf/2407.13945
Copy Paste: [[2407.13945]] FANTAstic SEquences and Where to Find Them: Faithful and Efficient API Call Generation through State-tracked Constrained Decoding and Reranking(https://arxiv.org/abs/2407.13945)
Keywords: language model
Abstract: API call generation is the cornerstone of large language models' tool-using ability that provides access to the larger world. However, existing supervised and in-context learning approaches suffer from high training costs, poor data efficiency, and generated API calls that can be unfaithful to the API documentation and the user's request. To address these limitations, we propose an output-side optimization approach called FANTASE. Two of the unique contributions of FANTASE are its State-Tracked Constrained Decoding (SCD) and Reranking components. SCD dynamically incorporates appropriate API constraints in the form of Token Search Trie for efficient and guaranteed generation faithfulness with respect to the API documentation. The Reranking component efficiently brings in the supervised signal by leveraging a lightweight model as the discriminator to rerank the beam-searched candidate generations of the large language model. We demonstrate the superior performance of FANTASE in API call generation accuracy, inference efficiency, and context efficiency with DSTC8 and API Bank datasets.
摘要：API 调用生成是大型语言模型使用工具能力的基石，它提供了访问更广阔世界的途径。然而，现有的监督和上下文学习方法存在训练成本高、数据效率差以及生成的 API 调用可能不符合 API 文档和用户请求的问题。为了解决这些限制，我们提出了一种称为 FANTASE 的输出端优化方法。FANTASE 的两个独特贡献是其状态跟踪约束解码 (SCD) 和重新排名组件。SCD 以 Token Search Trie 的形式动态地整合了适当的 API 约束，以实现高效且保证的生成忠实于 API 文档。重新排名组件通过利用轻量级模型作为鉴别器来重新排名大型语言模型的波束搜索候选生成，从而有效地引入监督信号。我们通过 DSTC8 和 API Bank 数据集展示了 FANTASE 在 API 调用生成准确性、推理效率和上下文效率方面的卓越性能。

Title: RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Authors: Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, Vittorio Castelli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.13998
Pdf URL: https://arxiv.org/pdf/2407.13998
Copy Paste: [[2407.13998]] RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering(https://arxiv.org/abs/2407.13998)
Keywords: language model, llm, retrieval augmented generation
Abstract: Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
摘要：基于检索增强生成 (RAG-QA) 的问答是 NLP 中的一个重要研究课题，具有广泛的实际应用。然而，此任务的大多数现有数据集要么使用单一源语料库构建，要么由简短的提取答案组成，这不足以评估基于大型语言模型 (LLM) 的 RAG-QA 系统的跨域泛化能力。为了解决这些限制，我们创建了 Long-form RobustQA (LFRQA)，这是一个新的数据集，包含人工编写的长格式答案，将来自多个文档的简短提取答案整合成一个连贯的叙述，涵盖七个不同领域的 26K 个查询和大型语料库。我们进一步提出了 RAG-QA Arena，通过使用 LLM 作为评估器直接比较模型生成的答案与 LFRQA 的答案。我们通过大量实验表明，RAG-QA Arena 与人类对答案质量的判断高度相关。此外，只有 41.3% 最具竞争力的 LLM 答案优于 LFRQA 的答案，这表明 RAG-QA Arena 是未来研究具有挑战性的评估平台。

Title: NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication

Authors: Yuchen Lian, Tessa Verhoef, Arianna Bisazza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.13999
Pdf URL: https://arxiv.org/pdf/2407.13999
Copy Paste: [[2407.13999]] NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication(https://arxiv.org/abs/2407.13999)
Keywords: agent
Abstract: Recent advances in computational linguistics include simulating the emergence of human-like languages with interacting neural network agents, starting from sets of random symbols. The recently introduced NeLLCom framework (Lian et al., 2023) allows agents to first learn an artificial language and then use it to communicate, with the aim of studying the emergence of specific linguistics properties. We extend this framework (NeLLCom-X) by introducing more realistic role-alternating agents and group communication in order to investigate the interplay between language learnability, communication pressures, and group size effects. We validate NeLLCom-X by replicating key findings from prior research simulating the emergence of a word-order/case-marking trade-off. Next, we investigate how interaction affects linguistic convergence and emergence of the trade-off. The novel framework facilitates future simulations of diverse linguistic aspects, emphasizing the importance of interaction and group dynamics in language evolution.
摘要：计算语言学的最新进展包括使用交互神经网络代理模拟类人语言的出现，从一组随机符号开始。最近推出的 NeLLCom 框架 (Lian 等人，2023) 允许代理首先学习一种人工语言，然后使用它进行交流，目的是研究特定语言学属性的出现。我们通过引入更现实的角色交替代理和群组通信来扩展此框架 (NeLLCom-X)，以研究语言可学习性、沟通压力和群组规模效应之间的相互作用。我们通过复制先前研究的关键发现来验证 NeLLCom-X，模拟词序/格标记权衡的出现。接下来，我们研究交互如何影响语言收敛和权衡的出现。新框架有助于未来模拟各种语言方面，强调交互和群体动态在语言进化中的重要性。

Title: HeCiX: Integrating Knowledge Graphs and Large Language Models for Biomedical Research

Authors: Prerana Sanjay Kulkarni, Muskaan Jain, Disha Sheshanarayana, Srinivasan Parthiban
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14030
Pdf URL: https://arxiv.org/pdf/2407.14030
Copy Paste: [[2407.14030]] HeCiX: Integrating Knowledge Graphs and Large Language Models for Biomedical Research(https://arxiv.org/abs/2407.14030)
Keywords: language model, gpt
Abstract: Despite advancements in drug development strategies, 90% of clinical trials fail. This suggests overlooked aspects in target validation and drug optimization. In order to address this, we introduce HeCiX-KG, Hetionet-Clinicaltrials neXus Knowledge Graph, a novel fusion of data from this http URL and Hetionet in a single knowledge graph. HeCiX-KG combines data on previously conducted clinical trials from this http URL, and domain expertise on diseases and genes from Hetionet. This offers a thorough resource for clinical researchers. Further, we introduce HeCiX, a system that uses LangChain to integrate HeCiX-KG with GPT-4, and increase its usability. HeCiX shows high performance during evaluation against a range of clinically relevant issues, proving this model to be promising for enhancing the effectiveness of clinical research. Thus, this approach provides a more holistic view of clinical trials and existing biological data.
摘要：尽管药物开发策略取得了进步，但 90% 的临床试验还是以失败告终。这表明在靶标验证和药物优化方面存在被忽视的方面。为了解决这个问题，我们引入了 HeCiX-KG，即 Hetionet-Clinicaltrials neXus 知识图谱，这是来自此 http URL 和 Hetionet 的数据在单个知识图中的新型融合。HeCiX-KG 结合了来自此 http URL 的先前进行的临床试验的数据以及来自 Hetionet 的疾病和基因领域专业知识。这为临床研究人员提供了全面的资源。此外，我们引入了 HeCiX，这是一个使用 LangChain 将 HeCiX-KG 与 GPT-4 集成并提高其可用性的系统。HeCiX 在针对一系列临床相关问题的评估中表现出色，证明该模型有望提高临床研究的有效性。因此，这种方法可以更全面地了解临床试验和现有的生物数据。

Title: ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?

Authors: Siddhant Waghjale, Vishruth Veerendranath, Zora Zhiruo Wang, Daniel Fried
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14044
Pdf URL: https://arxiv.org/pdf/2407.14044
Copy Paste: [[2407.14044]] ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?(https://arxiv.org/abs/2407.14044)
Keywords: language model, llm
Abstract: Although large language models (LLMs) have been largely successful in generating functionally correct programs, conditioning models to produce efficient solutions while ensuring correctness remains a challenge. Further, unreliability in benchmarking code efficiency is a hurdle across varying hardware specifications for popular interpreted languages such as Python. In this paper, we present ECCO, a reproducible benchmark for evaluating program efficiency via two paradigms: natural language (NL) based code generation and history-based code editing. On ECCO, we adapt and thoroughly investigate the three most promising existing LLM-based approaches: in-context learning, iterative refinement with execution or NL feedback, and fine-tuning conditioned on execution and editing history. While most methods degrade functional correctness and moderately increase program efficiency, we find that adding execution information often helps maintain functional correctness, and NL feedback enhances more on efficiency. We release our benchmark to support future work on LLM-based generation of efficient code.
摘要：尽管大型语言模型 (LLM) 在生成功能正确的程序方面取得了很大成功，但调节模型以在确保正确性的同时产生有效的解决方案仍然是一个挑战。此外，对于流行的解释语言（如 Python），在不同的硬件规格下，对代码效率进行基准测试的不可靠性是一个障碍。在本文中，我们提出了 ECCO，这是一个可重现的基准，用于通过两种范式评估程序效率：基于自然语言 (NL) 的代码生成和基于历史的代码编辑。在 ECCO 上，我们调整并彻底研究了三种最有前途的现有基于 LLM 的方法：上下文学习、使用执行或 NL 反馈的迭代细化以及以执行和编辑历史为条件的微调。虽然大多数方法会降低功能正确性并适度提高程序效率，但我们发现添加执行信息通常有助于保持功能正确性，而 NL 反馈可以更多地提高效率。我们发布了基准以支持未来基于 LLM 生成高效代码的工作。

Title: Prompted Aspect Key Point Analysis for Quantitative Review Summarization

Authors: An Quang Tang, Xiuzhen Zhang, Minh Ngoc Dinh, Erik Cambria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14049
Pdf URL: https://arxiv.org/pdf/2407.14049
Copy Paste: [[2407.14049]] Prompted Aspect Key Point Analysis for Quantitative Review Summarization(https://arxiv.org/abs/2407.14049)
Keywords: language model, llm, prompt
Abstract: Key Point Analysis (KPA) aims for quantitative summarization that provides key points (KPs) as succinct textual summaries and quantities measuring their prevalence. KPA studies for arguments and reviews have been reported in the literature. A majority of KPA studies for reviews adopt supervised learning to extract short sentences as KPs before matching KPs to review comments for quantification of KP prevalence. Recent abstractive approaches still generate KPs based on sentences, often leading to KPs with overlapping and hallucinated opinions, and inaccurate quantification. In this paper, we propose Prompted Aspect Key Point Analysis (PAKPA) for quantitative review summarization. PAKPA employs aspect sentiment analysis and prompted in-context learning with Large Language Models (LLMs) to generate and quantify KPs grounded in aspects for business entities, which achieves faithful KPs with accurate quantification, and removes the need for large amounts of annotated data for supervised training. Experiments on the popular review dataset Yelp and the aspect-oriented review summarization dataset SPACE show that our framework achieves state-of-the-art performance. Source code and data are available at: this https URL
摘要：关键点分析 (KPA) 旨在进行定量总结，以简洁的文本总结和数量来衡量关键点 (KP)。文献中已经报道了针对论点和评论的 KPA 研究。大多数针对评论的 KPA 研究采用监督学习来提取短句作为 KP，然后将 KP 与评论相匹配以量化 KP 的流行程度。最近的抽象方法仍然基于句子生成 KP，这通常会导致 KP 具有重叠和幻觉意见，并且量化不准确。在本文中，我们提出了用于定量评论总结的提示方面关键点分析 (PAKPA)。PAKPA 采用方面情绪分析和使用大型语言模型 (LLM) 的提示性上下文学习来为业务实体生成和量化基于方面的 KP，从而实现忠实的 KP 和准确的量化，并且无需大量带注释的数据进行监督训练。在流行评论数据集 Yelp 和面向方面的评论摘要数据集 SPACE 上的实验表明，我们的框架达到了最先进的性能。源代码和数据可从以下网址获取：此 https URL

Title: LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Authors: Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14057
Pdf URL: https://arxiv.org/pdf/2407.14057
Copy Paste: [[2407.14057]] LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference(https://arxiv.org/abs/2407.14057)
Keywords: language model, llm, long context, prompt
Abstract: The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.
摘要：基于 Transformer 的大型语言模型的推理包括两个连续的阶段：1）预填充阶段，用于计算提示的 KV 缓存并生成第一个 token；2）解码阶段，用于生成后续 token。对于长提示，必须在预填充阶段为所有 token 计算 KV 缓存，这会显著增加生成第一个 token 所需的时间。因此，预填充阶段可能会成为生成过程中的瓶颈。一个悬而未决的问题仍然是，所有提示 token 是否对于生成第一个 token 都必不可少。为了回答这个问题，我们引入了一种新方法 LazyLLM，它在预填充和解码阶段有选择地计算对下一个 token 预测很重要的 token 的 KV。与一次性修剪提示的静态修剪方法相反，LazyLLM 允许语言模型在不同的生成步骤中从上下文中动态选择不同的 token 子集，即使它们可能在前面的步骤中被修剪过。在各种任务的标准数据集上进行的大量实验表明，LazyLLM 是一种通用方法，可以与现有语言模型无缝集成，无需微调即可显著加速生成。例如，在多文档问答任务中，LazyLLM 将 LLama 2 7B 模型的预填充阶段加速了 2.34 倍，同时保持了准确性。

Title: Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation

Authors: Joy Mahapatra, Utpal Garain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14088
Pdf URL: https://arxiv.org/pdf/2407.14088
Copy Paste: [[2407.14088]] Impact of Model Size on Fine-tuned LLM Performance in Data-to-Text Generation: A State-of-the-Art Investigation(https://arxiv.org/abs/2407.14088)
Keywords: llm
Abstract: Data-to-text (D2T) generation aims to generate human-readable text from semi-structured data, such as tables and graphs. The recent success of D2T is largely attributed to advancements in LLMs. Despite the success of LLMs, no research has been conducted to illustrate the impact of model size on the performance of fine-tuned LLMs for D2T tasks. D2T model performance is typically assessed based on three key qualities: \textit{readability} (indicates fluency and coherence), \textit{informativeness} (measures content similarity), and \textit{faithfulness} (assesses consistency of factual information). It is currently uncertain whether increasing the size of LLMs effectively improves performance in D2T tasks across these three qualities. The objective of this study is to investigate the performance of fine-tuned LLMs in D2T tasks in terms of model size. Through extensive comparative analysis, we aim to elucidate both the advantages and limitations of scaling model sizes across five widely used D2T datasets (E2E, ViGGo, WikiTableText, DART, and WebNLG) and twelve state-of-the-art LLMs with varying sizes from five different LLM families (T5, BART, OPT, BLOOM, and Llama 2). To comprehensively cover all the three essential qualities of D2T models, we incorporate six widely recognized automatic metrics -- \textsc{BLEU}, \textsc{METEOR}, \textsc{BERTScore}, \textsc{MoverScore}, \textsc{Parent}, and \textsc{BARTScore}. We also provide an in-depth analysis of LLM performance concerning model size in the presence of source-reference divergence, a critical aspect of D2T tasks. Our investigation reveals that increasing LLM size enhances \textit{readability} and \textit{informativeness} in D2T tasks, but larger (in terms of size) LLMs may sacrifice \textit{faithfulness}. Moreover, small-sized LLMs show more resilience than larger ones when source-reference divergence is present.
摘要：数据到文本 (D2T) 生成旨在从半结构化数据（例如表格和图形）生成人类可读的文本。D2T 最近的成功很大程度上归功于 LLM 的进步。尽管 LLM 取得了成功，但尚未进行任何研究来说明模型大小对 D2T 任务中微调 LLM 性能的影响。D2T 模型性能通常根据三个关键品质来评估：\textit{可读性}（表示流畅性和连贯性）、\textit{信息性}（衡量内容相似性）和 \textit{忠实度}（评估事实信息的一致性）。目前尚不确定增加 LLM 的大小是否能有效提高 D2T 任务在这三个品质上的性能。本研究的目的是从模型大小方面研究微调 LLM 在 D2T 任务中的性能。通过广泛的比较分析，我们旨在阐明在五个广泛使用的 D2T 数据集（E2E、ViGGo、WikiTableText、DART 和 WebNLG）和来自五个不同 LLM 系列（T5、BART、OPT、BLOOM 和 Llama 2）的十二个不同大小的最先进的 LLM 中扩展模型大小的优势和局限性。为了全面涵盖 D2T 模型的所有三个基本品质，我们结合了六个广泛认可的自动指标 - \textsc{BLEU}、\textsc{METEOR}、\textsc{BERTScore}、\textsc{MoverScore}、\textsc{Parent} 和 \textsc{BARTScore}。我们还对存在源参考发散（D2T 任务的一个关键方面）的情况下 LLM 模型大小的性能进行了深入分析。我们的调查显示，在 D2T 任务中，增加 LLM 的大小可以提高 \textit{可读性} 和 \textit{信息量}，但较大的 LLM（就大小而言）可能会牺牲 \textit{忠实度}。此外，当存在源参考发散时，小型 LLM 比大型 LLM 表现出更大的弹性。

Title: I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction

Authors: Zaiqiao Meng, Hao Zhou, Yifang Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14133
Pdf URL: https://arxiv.org/pdf/2407.14133
Copy Paste: [[2407.14133]] I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction(https://arxiv.org/abs/2407.14133)
Keywords: language model, prompt
Abstract: Visual Language Models (VLMs) are essential for various tasks, particularly visual reasoning tasks, due to their robust multi-modal information integration, visual reasoning capabilities, and contextual awareness. However, existing \VLMs{}' visual spatial reasoning capabilities are often inadequate, struggling even with basic tasks such as distinguishing left from right. To address this, we propose the \ours{} model, designed to enhance the visual spatial reasoning abilities of VLMS. ZeroVLM employs Zero-1-to-3, a 3D reconstruction model for obtaining different views of the input images and incorporates a prompting mechanism to further improve visual spatial reasoning. Experimental results on four visual spatial reasoning datasets show that our \ours{} achieves up to 19.48% accuracy improvement, which indicates the effectiveness of the 3D reconstruction and prompting mechanisms of our ZeroVLM.
摘要：视觉语言模型 (VLM) 因其强大的多模态信息整合、视觉推理能力和情境感知能力而对各种任务，尤其是视觉推理任务至关重要。然而，现有的 \VLM{} 的视觉空间推理能力往往不足，甚至难以完成区分左右等基本任务。为了解决这个问题，我们提出了 \ours{} 模型，旨在增强 VLMS 的视觉空间推理能力。ZeroVLM 采用 Zero-1-to-3（一种用于获取输入图像的不同视图的 3D 重建模型），并结合提示机制进一步改进视觉空间推理。在四个视觉空间推理数据集上的实验结果表明，我们的 \ours{} 实现了高达 19.48% 的准确率提升，这表明我们的 ZeroVLM 的 3D 重建和提示机制是有效的。

Title: Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

Authors: Valentin Pelloin, Lena Dodson, Émile Chapuis, Nicolas Hervé, David Doukhan
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2407.14180
Pdf URL: https://arxiv.org/pdf/2407.14180
Copy Paste: [[2407.14180]] Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis(https://arxiv.org/abs/2407.14180)
Keywords: language model, llm
Abstract: This paper introduces a computational framework designed to delineate gender distribution biases in topics covered by French TV and radio news. We transcribe a dataset of 11.7k hours, broadcasted in 2023 on 21 French channels. A Large Language Model (LLM) is used in few-shot conversation mode to obtain a topic classification on those transcriptions. Using the generated LLM annotations, we explore the finetuning of a specialized smaller classification model, to reduce the computational cost. To evaluate the performances of these models, we construct and annotate a dataset of 804 dialogues. This dataset is made available free of charge for research purposes. We show that women are notably underrepresented in subjects such as sports, politics and conflicts. Conversely, on topics such as weather, commercials and health, women have more speaking time than their overall average across all subjects. We also observe representations differences between private and public service channels.
摘要：本文介绍了一个计算框架，旨在描述法国电视和广播新闻所涵盖主题的性别分布偏差。我们转录了 2023 年在 21 个法国频道播出的 11.7k 小时的数据集。大型语言模型 (LLM) 用于少数镜头对话模式，以获得这些转录的主题分类。使用生成的 LLM 注释，我们探索专门的较小分类模型的微调，以降低计算成本。为了评估这些模型的性能，我们构建并注释了一个包含 804 个对话的数据集。该数据集可免费用于研究目的。我们发现，女性在体育、政治和冲突等主题中的代表性明显不足。相反，在天气、商业广告和健康等话题上，女性的发言时间比她们在所有主题中的总体平均水平要多。我们还观察到私人和公共服务渠道之间的代表性差异。

Title: LeKUBE: A Legal Knowledge Update BEnchmark

Authors: Changyue Wang, Weihang Su, Hu Yiran, Qingyao Ai, Yueyue Wu, Cheng Luo, Yiqun Liu, Min Zhang, Shaoping Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14192
Pdf URL: https://arxiv.org/pdf/2407.14192
Copy Paste: [[2407.14192]] LeKUBE: A Legal Knowledge Update BEnchmark(https://arxiv.org/abs/2407.14192)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have significantly shaped the applications of AI in multiple fields, including the studies of legal intelligence. Trained on extensive legal texts, including statutes and legal documents, the legal LLMs can capture important legal knowledge/concepts effectively and provide important support for downstream legal applications such as legal consultancy. Yet, the dynamic nature of legal statutes and interpretations also poses new challenges to the use of LLMs in legal applications. Particularly, how to update the legal knowledge of LLMs effectively and efficiently has become an important research problem in practice. Existing benchmarks for evaluating knowledge update methods are mostly designed for the open domain and cannot address the specific challenges of the legal domain, such as the nuanced application of new legal knowledge, the complexity and lengthiness of legal regulations, and the intricate nature of legal reasoning. To address this gap, we introduce the Legal Knowledge Update BEnchmark, i.e. LeKUBE, which evaluates knowledge update methods for legal LLMs across five dimensions. Specifically, we categorize the needs of knowledge updates in the legal domain with the help of legal professionals, and then hire annotators from law schools to create synthetic updates to the Chinese Criminal and Civil Code as well as sets of questions of which the answers would change after the updates. Through a comprehensive evaluation of state-of-the-art knowledge update methods, we reveal a notable gap between existing knowledge update methods and the unique needs of the legal domain, emphasizing the need for further research and development of knowledge update mechanisms tailored for legal LLMs.
摘要：大型语言模型 (LLM) 的最新进展极大地影响了人工智能在多个领域的应用，包括法律智能研究。通过对包括法规和法律文件在内的大量法律文本进行训练，法律 LLM 可以有效地捕捉重要的法律知识/概念，并为法律咨询等下游法律应用提供重要支持。然而，法律法规和解释的动态性质也对 LLM 在法律应用中的使用提出了新的挑战。特别是，如何有效、高效地更新 LLM 的法律知识已成为实践中的一个重要研究问题。现有的评估知识更新方法的基准大多是针对开放领域设计的，无法解决法律领域的特定挑战，例如新法律知识的细微应用、法律规定的复杂性和冗长性以及法律推理的复杂性。为了弥补这一差距，我们推出了法律知识更新基准，即 LeKUBE，它从五个维度评估法律 LLM 的知识更新方法。具体来说，我们在法律专业人士的帮助下对法律领域知识更新的需求进行分类，然后聘请法学院的注释者对中国刑法和民法典进行合成更新，并创建更新后答案会发生变化的问题集。通过对最先进的知识更新方法的全面评估，我们发现现有的知识更新方法与法律领域的独特需求之间存在显著差距，强调需要进一步研究和开发适合法律法学硕士的知识更新机制。

Title: Conditioning Chat-GPT for information retrieval: the Unipa-GPT case study

Authors: Irene Siragusa, Roberto Pirrone
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14246
Pdf URL: https://arxiv.org/pdf/2407.14246
Copy Paste: [[2407.14246]] Conditioning Chat-GPT for information retrieval: the Unipa-GPT case study(https://arxiv.org/abs/2407.14246)
Keywords: language model, gpt, chat, retrieval augmented generation
Abstract: This paper illustrates the architecture and training of Unipa-GPT, a chatbot relying on a Large Language Model, developed for assisting students in choosing a bachelor/master degree course at the University of Palermo. Unipa-GPT relies on gpt-3.5-turbo, it was presented in the context of the European Researchers' Night (SHARPER night). In our experiments we adopted both the Retrieval Augmented Generation (RAG) approach and fine-tuning to develop the system. The whole architecture of Unipa-GPT is presented, both the RAG and the fine-tuned systems are compared, and a brief discussion on their performance is reported. Further comparison with other Large Language Models and the experimental results during the SHARPER night are illustrated.
摘要：本文介绍了 Unipa-GPT 的架构和训练，Unipa-GPT 是一个基于大型语言模型的聊天机器人，旨在帮助巴勒莫大学的学生选择学士/硕士学位课程。Unipa-GPT 依赖于 gpt-3.5-turbo，它是在欧洲研究人员之夜 (SHARPER night) 的背景下展示的。在我们的实验中，我们采用了检索增强生成 (RAG) 方法和微调来开发系统。本文介绍了 Unipa-GPT 的整体架构，比较了 RAG 和微调系统，并简要讨论了它们的性能。本文还说明了与其他大型语言模型的进一步比较以及 SHARPER night 期间的实验结果。

Title: Voices in a Crowd: Searching for Clusters of Unique Perspectives

Authors: Nikolas Vitsakis, Amit Parekh, Ioannis Konstas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14259
Pdf URL: https://arxiv.org/pdf/2407.14259
Copy Paste: [[2407.14259]] Voices in a Crowd: Searching for Clusters of Unique Perspectives(https://arxiv.org/abs/2407.14259)
Keywords: language model
Abstract: Language models have been shown to reproduce underlying biases existing in their training data, which is the majority perspective by default. Proposed solutions aim to capture minority perspectives by either modelling annotator disagreements or grouping annotators based on shared metadata, both of which face significant challenges. We propose a framework that trains models without encoding annotator metadata, extracts latent embeddings informed by annotator behaviour, and creates clusters of similar opinions, that we refer to as voices. Resulting clusters are validated post-hoc via internal and external quantitative metrics, as well a qualitative analysis to identify the type of voice that each cluster represents. Our results demonstrate the strong generalisation capability of our framework, indicated by resulting clusters being adequately robust, while also capturing minority perspectives based on different demographic factors throughout two distinct datasets.
摘要：语言模型已被证明可以重现其训练数据中存在的潜在偏见，默认情况下，这是多数人的观点。提出的解决方案旨在通过对注释者分歧进行建模或基于共享元数据对注释者进行分组来捕捉少数人的观点，这两种方法都面临着重大挑战。我们提出了一个框架，该框架可以在不编码注释者元数据的情况下训练模型，提取由注释者行为提供信息的潜在嵌入，并创建相似意见的集群，我们将其称为声音。通过内部和外部定量指标以及定性分析对得到的集群进行事后验证，以确定每个集群代表的声音类型。我们的结果证明了我们框架的强大泛化能力，结果集群足够稳健，同时还捕捉了两个不同数据集中基于不同人口统计因素的少数人观点。

Title: Predictive Simultaneous Interpretation: Harnessing Large Language Models for Democratizing Real-Time Multilingual Communication

Authors: Kurando Iida, Kenjiro Mimura, Nobuo Ito
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14269
Pdf URL: https://arxiv.org/pdf/2407.14269
Copy Paste: [[2407.14269]] Predictive Simultaneous Interpretation: Harnessing Large Language Models for Democratizing Real-Time Multilingual Communication(https://arxiv.org/abs/2407.14269)
Keywords: language model, llm
Abstract: This study introduces a groundbreaking approach to simultaneous interpretation by directly leveraging the predictive capabilities of Large Language Models (LLMs). We present a novel algorithm that generates real-time translations by predicting speaker utterances and expanding multiple possibilities in a tree-like structure. This method demonstrates unprecedented flexibility and adaptability, potentially overcoming the structural differences between languages more effectively than existing systems. Our theoretical analysis, supported by illustrative examples, suggests that this approach could lead to more natural and fluent translations with minimal latency. The primary purpose of this paper is to share this innovative concept with the academic community, stimulating further research and development in this field. We discuss the theoretical foundations, potential advantages, and implementation challenges of this technique, positioning it as a significant step towards democratizing multilingual communication.
摘要：本研究通过直接利用大型语言模型 (LLM) 的预测功能，引入了一种开创性的同声传译方法。我们提出了一种新颖的算法，通过预测说话者的话语并在树状结构中扩展多种可能性来生成实时翻译。这种方法表现出前所未有的灵活性和适应性，可能比现有系统更有效地克服语言之间的结构差异。我们的理论分析以示例为依据，表明这种方法可以实现更自然、更流畅的翻译，同时将延迟降至最低。本文的主要目的是与学术界分享这一创新概念，促进该领域的进一步研究和发展。我们讨论了这项技术的理论基础、潜在优势和实施挑战，将其定位为实现多语言交流民主化的重要一步。

Title: How to Engage Your Readers? Generating Guiding Questions to Promote Active Reading

Authors: Peng Cui, Vilém Zouhar, Xiaoyu Zhang, Mrinmaya Sachan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14309
Pdf URL: https://arxiv.org/pdf/2407.14309
Copy Paste: [[2407.14309]] How to Engage Your Readers? Generating Guiding Questions to Promote Active Reading(https://arxiv.org/abs/2407.14309)
Keywords: language model
Abstract: Using questions in written text is an effective strategy to enhance readability. However, what makes an active reading question good, what the linguistic role of these questions is, and what is their impact on human reading remains understudied. We introduce GuidingQ, a dataset of 10K in-text questions from textbooks and scientific articles. By analyzing the dataset, we present a comprehensive understanding of the use, distribution, and linguistic characteristics of these questions. Then, we explore various approaches to generate such questions using language models. Our results highlight the importance of capturing inter-question relationships and the challenge of question position identification in generating these questions. Finally, we conduct a human study to understand the implication of such questions on reading comprehension. We find that the generated questions are of high quality and are almost as effective as human-written questions in terms of improving readers' memorization and comprehension.
摘要：在书面文本中使用问题是提高可读性的有效策略。然而，什么才是好的主动阅读问题，这些问题的语言作用是什么，以及它们对人类阅读的影响仍未得到充分研究。我们介绍了 GuidingQ，这是一个来自教科书和科学文章的 10K 文内问题的数据集。通过分析数据集，我们全面了解了这些问题的使用、分布和语言特征。然后，我们探索了使用语言模型生成此类问题的各种方法。我们的结果强调了在生成这些问题时捕捉问题间关系的重要性和问题位置识别的挑战。最后，我们进行了一项人类研究，以了解这些问题对阅读理解的影响。我们发现生成的问题质量很高，在提高读者的记忆和理解能力方面几乎与人类编写的问题一样有效。

Title: Multimodal Misinformation Detection using Large Vision-Language Models

Authors: Sahar Tahmasebi, Eric Müller-Budack, Ralph Ewerth
Subjects: cs.CL, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2407.14321
Pdf URL: https://arxiv.org/pdf/2407.14321
Copy Paste: [[2407.14321]] Multimodal Misinformation Detection using Large Vision-Language Models(https://arxiv.org/abs/2407.14321)
Keywords: language model, llm
Abstract: The increasing proliferation of misinformation and its alarming impact have motivated both industry and academia to develop approaches for misinformation detection and fact checking. Recent advances on large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with misinformation detection remains relatively underexplored. Most of existing state-of-the-art approaches either do not consider evidence and solely focus on claim related features or assume the evidence to be provided. Few approaches consider evidence retrieval as part of the misinformation detection but rely on fine-tuning models. In this paper, we investigate the potential of LLMs for misinformation detection in a zero-shot setting. We incorporate an evidence retrieval component into the process as it is crucial to gather pertinent information from various sources to detect the veracity of claims. To this end, we propose a novel re-ranking approach for multimodal evidence retrieval using both LLMs and large vision-language models (LVLM). The retrieved evidence samples (images and texts) serve as the input for an LVLM-based approach for multimodal fact verification (LVLM4FV). To enable a fair evaluation, we address the issue of incomplete ground truth for evidence samples in an existing evidence retrieval dataset by annotating a more complete set of evidence samples for both image and text retrieval. Our experimental results on two datasets demonstrate the superiority of the proposed approach in both evidence retrieval and fact verification tasks and also better generalization capability across dataset compared to the supervised baseline.
摘要：虚假信息的日益泛滥及其令人震惊的影响促使业界和学术界都开始开发虚假信息检测和事实核查的方法。大型语言模型 (LLM) 的最新进展在各种任务中都表现出色，但 LLM 是否以及如何帮助检测虚假信息仍然相对未被充分探索。大多数现有的最先进方法要么不考虑证据，只关注与声明相关的特征，要么假设提供证据。很少有方法将证据检索视为虚假信息检测的一部分，而是依赖于微调模型。在本文中，我们研究了 LLM 在零样本设置下检测虚假信息的潜力。我们将证据检索组件纳入该过程，因为从各种来源收集相关信息对于检测声明的真实性至关重要。为此，我们提出了一种新颖的重新排名方法，用于使用 LLM 和大型视觉语言模型 (LVLM) 进行多模态证据检索。检索到的证据样本（图像和文本）用作基于 LVLM 的多模态事实验证方法 (LVLM4FV) 的输入。为了进行公平评估，我们通过为图像和文本检索标注一组更完整的证据样本来解决现有证据检索数据集中证据样本的真实性不完整的问题。我们在两个数据集上的实验结果证明了所提出的方法在证据检索和事实验证任务中的优势，并且与监督基线相比，该方法在整个数据集中的泛化能力也更好。

Title: LLMs left, right, and center: Assessing GPT's capabilities to label political bias from web domains

Authors: Raphael Hernandes
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2407.14344
Pdf URL: https://arxiv.org/pdf/2407.14344
Copy Paste: [[2407.14344]] LLMs left, right, and center: Assessing GPT's capabilities to label political bias from web domains(https://arxiv.org/abs/2407.14344)
Keywords: language model, gpt, llm
Abstract: This research investigates whether OpenAI's GPT-4, a state-of-the-art large language model, can accurately classify the political bias of news sources based solely on their URLs. Given the subjective nature of political labels, third-party bias ratings like those from Ad Fontes Media, AllSides, and Media Bias/Fact Check (MBFC) are often used in research to analyze news source diversity. This study aims to determine if GPT-4 can replicate these human ratings on a seven-degree scale ("far-left" to "far-right"). The analysis compares GPT-4's classifications against MBFC's, and controls for website popularity using Open PageRank scores. Findings reveal a high correlation ($\text{Spearman's } \rho = .89$, $n = 5,877$, $p < 0.001$) between GPT-4's and MBFC's ratings, indicating the model's potential reliability. However, GPT-4 abstained from classifying approximately $\frac{2}{3}$ of the dataset, particularly less popular and less biased sources. The study also identifies a slight leftward skew in GPT-4's classifications compared to MBFC's. The analysis suggests that while GPT-4 can be a scalable, cost-effective tool for political bias classification of news websites, but its use should complement human judgment to mitigate biases. Further research is recommended to explore the model's performance across different settings, languages, and additional datasets.
摘要：这项研究调查了 OpenAI 的 GPT-4（一种最先进的大型语言模型）是否能够仅根据新闻来源的 URL 准确地对其政治偏见进行分类。鉴于政治标签的主观性，第三方偏见评级（如 Ad Fontes Media、AllSides 和 Media Bias/Fact Check (MBFC) 的评级）通常用于研究以分析新闻来源的多样性。这项研究旨在确定 GPT-4 是否可以在七度尺度（“极左”到“极右”）上复制这些人工评级。该分析将 GPT-4 的分类与 MBFC 的分类进行比较，并使用 Open PageRank 分数控制网站受欢迎程度。研究结果显示，GPT-4 和 MBFC 的评级之间存在高度相关性（$\text{Spearman's } \rho = .89$，$n = 5,877$，$p < 0.001$），表明该模型具有潜在的可靠性。然而，GPT-4 放弃了对数据集中大约 $\frac{2}{3}$ 的分类，尤其是不太受欢迎和偏见较少的来源。该研究还发现，与 MBFC 相比，GPT-4 的分类略微向左倾斜。分析表明，虽然 GPT-4 可以成为一种可扩展、经济高效的新闻网站政治偏见分类工具，但它的使用应该补充人类的判断以减轻偏见。建议进一步研究以探索该模型在不同设置、语言和其他数据集中的表现。

Title: Open Artificial Knowledge

Authors: Vadim Borisov, Richard H. Schreiber
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14371
Pdf URL: https://arxiv.org/pdf/2407.14371
Copy Paste: [[2407.14371]] Open Artificial Knowledge(https://arxiv.org/abs/2407.14371)
Keywords: language model, gpt, llm, chat
Abstract: The tremendous success of chat-based AI systems like ChatGPT, Claude, and Gemini stems from Large Language Models (LLMs) trained on vast amount of datasets. However, acquiring high-quality, diverse, and ethically sourced training data remains a significant challenge. We introduce the Open Artificial Knowledge (OAK) dataset, a large-scale resource of over 500 million tokens (at the moment of writing) designed to address this issue. OAK leverages an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B , to generate high-quality text across diverse domains, guided by Wikipedia's main categories. Our methodology ensures broad knowledge coverage while maintaining coherence and factual accuracy. The OAK dataset aims to foster the development of more capable and aligned language models while addressing critical issues of data scarcity and privacy in LLM training, and it is freely available on this http URL.
摘要：ChatGPT、Claude 和 Gemini 等基于聊天的 AI 系统的巨大成功源于在海量数据集上训练的大型语言模型 (LLM)。然而，获取高质量、多样化且符合道德规范的训练数据仍然是一项重大挑战。我们推出了开放人工智能 (OAK) 数据集，这是一个超过 5 亿个标记的大型资源（在撰写本文时），旨在解决这一问题。OAK 利用一组最先进的 LLM，包括 GPT4o、LLaMa3-70B、LLaMa3-8B、Mixtral-8x7B、Gemma-7B 和 Gemma-2-9B，在维基百科主要类别的指导下生成跨不同领域的高质量文本。我们的方法确保广泛的知识覆盖范围，同时保持连贯性和事实准确性。 OAK 数据集旨在促进开发更强大、更一致的语言模型，同时解决 LLM 培训中数据稀缺和隐私的关键问题，并且可以在此 http URL 上免费获取。

Title: Check-Eval: A Checklist-based Approach for Evaluating Text Quality

Authors: Jayr Pereira, Roberto Lotufo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14467
Pdf URL: https://arxiv.org/pdf/2407.14467
Copy Paste: [[2407.14467]] Check-Eval: A Checklist-based Approach for Evaluating Text Quality(https://arxiv.org/abs/2407.14467)
Keywords: language model, gpt, llm
Abstract: Evaluating the quality of text generated by large language models (LLMs) remains a significant challenge. Traditional metrics often fail to align well with human judgments, particularly in tasks requiring creativity and nuance. In this paper, we propose Check-Eval, a novel evaluation framework leveraging LLMs to assess the quality of generated text through a checklist-based approach. Check-Eval can be employed as both a reference-free and reference-dependent evaluation method, providing a structured and interpretable assessment of text quality. The framework consists of two main stages: checklist generation and checklist evaluation. We validate Check-Eval on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and SummEval. Our results demonstrate that Check-Eval achieves higher correlations with human judgments compared to existing metrics, such as G-Eval and GPTScore, underscoring its potential as a more reliable and effective evaluation framework for natural language generation tasks. The code for our experiments is available at https://anonymous.4open.science/r/check-eval-0DB4.
摘要：评估大型语言模型 (LLM) 生成的文本的质量仍然是一项重大挑战。传统的指标通常无法与人类判断很好地保持一致，尤其是在需要创造力和细微差别的任务中。在本文中，我们提出了 Check-Eval，这是一种利用 LLM 通过基于检查表的方法评估生成文本质量的新型评估框架。Check-Eval 既可以用作无参考评估方法，也可以用作参考依赖评估方法，提供结构化且可解释的文本质量评估。该框架包括两个主要阶段：检查表生成和检查表评估。我们在两个基准数据集上验证了 Check-Eval：葡萄牙法律语义文本相似性和 SummEval。我们的结果表明，与现有指标（例如 G-Eval 和 GPTScore）相比，Check-Eval 与人类判断的相关性更高，突显了其作为自然语言生成任务更可靠、更有效的评估框架的潜力。我们的实验代码可在 https://anonymous.4open.science/r/check-eval-0DB4 获得。

Title: ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Authors: Peng Xu, Wei Ping, Xianchao Wu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14482
Pdf URL: https://arxiv.org/pdf/2407.14482
Copy Paste: [[2407.14482]] ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities(https://arxiv.org/abs/2407.14482)
Keywords: gpt, llm, long context, prompt, chat, retrieval-augmented generation
Abstract: In this work, we introduce ChatQA 2, a Llama3-based model designed to bridge the gap between open-access LLMs and leading proprietary models (e.g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities are essential for LLMs to process large volumes of information that cannot fit into a single prompt and are complementary to each other, depending on the downstream tasks and computational budgets. We present a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. Our results demonstrate that the Llama3-ChatQA-2-70B model achieves accuracy comparable to GPT-4-Turbo-2024-0409 on many long-context understanding tasks and surpasses it on the RAG benchmark. Interestingly, we find that the state-of-the-art long-context retriever can alleviate the top-k context fragmentation issue in RAG, further improving RAG-based results for long-context understanding tasks. We also provide extensive comparisons between RAG and long-context solutions using state-of-the-art long-context LLMs.
摘要：在这项工作中，我们引入了 ChatQA 2，这是一个基于 Llama3 的模型，旨在弥补开放获取 LLM 与领先的专有模型（例如 GPT-4-Turbo）在长上下文理解和检索增强生成 (RAG) 功能方面的差距。这两种功能对于 LLM 处理无法容纳在单个提示中的大量信息至关重要，并且根据下游任务和计算预算相互补充。我们提供了详细的持续训练配方，以将 Llama3-70B-base 的上下文窗口从 8K 扩展到 128K 个 token，以及一个三阶段指令调整过程，以增强模型的指令跟踪、RAG 性能和长上下文理解能力。我们的结果表明，Llama3-ChatQA-2-70B 模型在许多长上下文理解任务上实现了与 GPT-4-Turbo-2024-0409 相当的准确率，并在 RAG 基准上超越了它。有趣的是，我们发现最先进的长上下文检索器可以缓解 RAG 中的前 k 个上下文碎片化问题，从而进一步改善基于 RAG 的长上下文理解任务结果。我们还使用最先进的长上下文 LLM 对 RAG 和长上下文解决方案进行了广泛的比较。

Title: Evaluating the Reliability of Self-Explanations in Large Language Models

Authors: Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14487
Pdf URL: https://arxiv.org/pdf/2407.14487
Copy Paste: [[2407.14487]] Evaluating the Reliability of Self-Explanations in Large Language Models(https://arxiv.org/abs/2407.14487)
Keywords: language model, llm, prompt
Abstract: This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.
摘要：本文研究了大型语言模型 (LLM) 在被要求解释其先前的输出时生成的解释的可靠性。我们使用三个最先进的 LLM（2B 到 8B 个参数）在两个不同的分类任务（客观和主观）上评估了两种此类自我解释 - 提取式和反事实式。我们的研究结果表明，虽然这些自我解释可以与人类判断相关联，但它们并没有完全准确地遵循模型的决策过程，这表明感知和实际模型推理之间存在差距。我们表明，这一差距是可以弥合的，因为提示 LLM 进行反事实解释可以产生忠实、信息丰富且易于验证的结果。这些反事实为传统可解释性方法（例如 SHAP、LIME）提供了一种有前途的替代方案，前提是提示针对特定任务进行定制并检查有效性。

Title: Internal Consistency and Self-Feedback in Large Language Models: A Survey

Authors: Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14507
Pdf URL: https://arxiv.org/pdf/2407.14507
Copy Paste: [[2407.14507]] Internal Consistency and Self-Feedback in Large Language Models: A Survey(https://arxiv.org/abs/2407.14507)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are expected to respond accurately but often exhibit deficient reasoning or generate hallucinatory content. To address these, studies prefixed with ``Self-'' such as Self-Consistency, Self-Improve, and Self-Refine have been initiated. They share a commonality: involving LLMs evaluating and updating itself to mitigate the issues. Nonetheless, these efforts lack a unified perspective on summarization, as existing surveys predominantly focus on categorization without examining the motivations behind these works. In this paper, we summarize a theoretical framework, termed Internal Consistency, which offers unified explanations for phenomena such as the lack of reasoning and the presence of hallucinations. Internal Consistency assesses the coherence among LLMs' latent layer, decoding layer, and response layer based on sampling methodologies. Expanding upon the Internal Consistency framework, we introduce a streamlined yet effective theoretical framework capable of mining Internal Consistency, named Self-Feedback. The Self-Feedback framework consists of two modules: Self-Evaluation and Self-Update. This framework has been employed in numerous studies. We systematically classify these studies by tasks and lines of work; summarize relevant evaluation methods and benchmarks; and delve into the concern, ``Does Self-Feedback Really Work?'' We propose several critical viewpoints, including the ``Hourglass Evolution of Internal Consistency'', ``Consistency Is (Almost) Correctness'' hypothesis, and ``The Paradox of Latent and Explicit Reasoning''. Furthermore, we outline promising directions for future research. We have open-sourced the experimental code, reference list, and statistical data, available at \url{this https URL}.
摘要：大型语言模型 (LLM) 有望准确响应，但通常表现出推理不足或产生幻觉内容。为了解决这些问题，以“自我”为前缀的研究已经开始，例如自我一致性、自我改进和自我完善。它们有一个共同点：涉及 LLM 评估和更新自身以缓解问题。尽管如此，这些努力缺乏对总结的统一视角，因为现有的调查主要侧重于分类，而没有研究这些工作背后的动机。在本文中，我们总结了一个称为内部一致性的理论框架，它为缺乏推理和幻觉存在等现象提供了统一的解释。内部一致性基于采样方法评估 LLM 的潜在层、解码层和响应层之间的一致性。在内部一致性框架的基础上，我们引入了一个能够挖掘内部一致性的精简但有效的理论框架，称为自我反馈。自我反馈框架由两个模块组成：自我评估和自我更新。该框架已在许多研究中使用。我们根据任务和工作方向对这些研究进行系统分类；总结相关的评估方法和基准；并深入探讨“自我反馈真的有效吗？”的问题。我们提出了几个关键观点，包括“内部一致性的沙漏演变”、“一致性（几乎）正确性”假设和“潜在和显性推理的悖论”。此外，我们概述了未来研究的有希望的方向。我们已经开源了实验代码、参考列表和统计数据，可在 \url{this https URL} 上找到。