2025-01-22

Title: ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

Authors: Aarush Sinha, Viraj Virk, Dipshikha Chakraborty, P.S. Sreeja
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10483
Pdf URL: https://arxiv.org/pdf/2501.10483
Copy Paste: [[2501.10483]] ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature(https://arxiv.org/abs/2501.10483)
Keywords: language model, hallucination
Abstract: Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.
摘要：语言模型 [LM] 现在在信息生成和合成中发挥着越来越重要的作用；这些系统中科学知识的表示需要高度准确。一个主要的挑战是幻觉；也就是说，产生看似合理但实际上是错误的信息，包括虚构的引文和不存在的研究论文。这种不准确性在所有需要高度事实正确性的领域都是危险的，例如学术界和教育界。这项工作提出了一种流程，用于评估语言模型在科学文献中生成响应时产生幻觉的频率。我们提出了 ArxEval，这是一个使用 ArXiv 作为存储库的评估流程，包含两个任务：混乱标题和混合标题。我们的评估包括十五种广泛使用的语言模型，并提供了有关它们在处理科学文献方面的可靠性的比较见解。

Title: Tabular-TX: Theme-Explanation Structure-based Table Summarization via In-Context Learning

Authors: TaeYoon Kwack, Jisoo Kim, Ki Yong Jung, DongGeon Lee, Heesun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10487
Pdf URL: https://arxiv.org/pdf/2501.10487
Copy Paste: [[2501.10487]] Tabular-TX: Theme-Explanation Structure-based Table Summarization via In-Context Learning(https://arxiv.org/abs/2501.10487)
Keywords: language model, llm
Abstract: This paper proposes a Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline designed to efficiently process table data. Tabular-TX preprocesses table data by focusing on highlighted cells and then generates summary sentences structured with a Theme Part in the form of adverbial phrases followed by an Explanation Part in the form of clauses. In this process, customized analysis is performed by considering the structural characteristics and comparability of the table. Additionally, by utilizing In-Context Learning, Tabular-TX optimizes the analytical capabilities of large language models (LLMs) without the need for fine-tuning, effectively handling the structural complexity of table data. Results from applying the proposed Tabular-TX to generate table-based summaries demonstrated superior performance compared to existing fine-tuning-based methods, despite limitations in dataset size. Experimental results confirmed that Tabular-TX can process complex table data more effectively and established it as a new alternative for table-based question answering and summarization tasks, particularly in resource-constrained environments.
摘要：本文提出了一种基于主题解释结构的表格摘要 (Tabular-TX) 流程，旨在高效处理表格数据。Tabular-TX 通过重点关注突出显示的单元格来预处理表格数据，然后生成摘要句子，该摘要句子由副词短语形式的主题部分和从句形式的解释部分构成。在此过程中，通过考虑表格的结构特征和可比性来执行定制分析。此外，通过利用上下文学习，Tabular-TX 优化了大型语言模型 (LLM) 的分析能力，而无需进行微调，从而有效地处理表格数据的结构复杂性。尽管数据集大小有限，但应用所提出的 Tabular-TX 生成基于表格的摘要的结果与现有的基于微调的方法相比表现出卓越的性能。实验结果证实，Tabular-TX 可以更有效地处理复杂的表格数据，并将其确立为基于表格的问答和摘要任务的新替代方案，尤其是在资源受限的环境中。

Title: The Geometry of Tokens in Internal Representations of Large Language Models

Authors: Karthik Viswanathan, Yuri Gardinazzi, Giada Panerai, Alberto Cazzaniga, Matteo Biagetti
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.10573
Pdf URL: https://arxiv.org/pdf/2501.10573
Copy Paste: [[2501.10573]] The Geometry of Tokens in Internal Representations of Large Language Models(https://arxiv.org/abs/2501.10573)
Keywords: language model, prompt
Abstract: We investigate the relationship between the geometry of token embeddings and their role in the next token prediction within transformer models. An important aspect of this connection uses the notion of empirical measure, which encodes the distribution of token point clouds across transformer layers and drives the evolution of token representations in the mean-field interacting picture. We use metrics such as intrinsic dimension, neighborhood overlap, and cosine similarity to observationally probe these empirical measures across layers. To validate our approach, we compare these metrics to a dataset where the tokens are shuffled, which disrupts the syntactic and semantic structure. Our findings reveal a correlation between the geometric properties of token embeddings and the cross-entropy loss of next token predictions, implying that prompts with higher loss values have tokens represented in higher-dimensional spaces.
摘要：我们研究了 token 嵌入的几何形状与它们在 Transformer 模型中的下一个 token 预测中的作用之间的关系。这种联系的一个重要方面是使用经验测量的概念，它对 Transformer 层中 token 点云的分布进行编码，并推动均值场相互作用图中 token 表示的演变。我们使用内在维度、邻域重叠和余弦相似度等指标来跨层观察探测这些经验测量。为了验证我们的方法，我们将这些指标与 token 被打乱的数据集进行比较，这会破坏句法和语义结构。我们的研究结果揭示了 token 嵌入的几何属性与下一个 token 预测的交叉熵损失之间的相关性，这意味着损失值较高的提示在更高维空间中表示 token。

Title: Adapting Large Language Models for Character-based Augmentative and Alternative Communication

Authors: Dylan Gaines, Keith Vertanen
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2501.10582
Pdf URL: https://arxiv.org/pdf/2501.10582
Copy Paste: [[2501.10582]] Adapting Large Language Models for Character-based Augmentative and Alternative Communication(https://arxiv.org/abs/2501.10582)
Keywords: language model
Abstract: Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. We fine-tune models using a large dataset of sentences we curated in which each sentence is rated according to how useful it might be for spoken or written AAC communication. We find that using an algorithm to produce character predictions from a subword large language model provides more accurate predictions than adding a classification layer or using a byte-level model. We also find that our domain adaptation curriculum is effective at improving model performance on simple, conversational text.
摘要：增强和替代性沟通 (AAC) 的用户可以通过使用字符语言模型的界面逐个字母地书写。但是，大多数最先进的大型预训练语言模型预测可变长度的子词标记。我们研究如何实际使用此类模型进行准确而高效的字符预测。我们使用精心挑选的大型句子数据集对模型进行微调，其中每个句子都根据其对口头或书面 AAC 通信的有用程度进行评级。我们发现，使用算法从子词大型语言模型生成字符预测比添加分类层或使用字节级模型提供更准确的预测。我们还发现我们的领域适应课程可以有效提高简单对话文本的模型性能。

Title: Iterative Tree Analysis for Medical Critics

Authors: Zenan Huang, Mingwei Li, Zheng Zhou, Youxin Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.10642
Pdf URL: https://arxiv.org/pdf/2501.10642
Copy Paste: [[2501.10642]] Iterative Tree Analysis for Medical Critics(https://arxiv.org/abs/2501.10642)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have been widely adopted across various domains, yet their application in the medical field poses unique challenges, particularly concerning the generation of hallucinations. Hallucinations in open-ended long medical text manifest as misleading critical claims, which are difficult to verify due to two reasons. First, critical claims are often deeply entangled within the text and cannot be extracted based solely on surface-level presentation. Second, verifying these claims is challenging because surface-level token-based retrieval often lacks precise or specific evidence, leaving the claims unverifiable without deeper mechanism-based analysis. In this paper, we introduce a novel method termed Iterative Tree Analysis (ITA) for medical critics. ITA is designed to extract implicit claims from long medical texts and verify each claim through an iterative and adaptive tree-like reasoning process. This process involves a combination of top-down task decomposition and bottom-up evidence consolidation, enabling precise verification of complex medical claims through detailed mechanism-level reasoning. Our extensive experiments demonstrate that ITA significantly outperforms previous methods in detecting factual inaccuracies in complex medical text verification tasks by 10%. Additionally, we will release a comprehensive test set to the public, aiming to foster further advancements in research within this domain.
摘要：大型语言模型 (LLM) 已广泛应用于各个领域，但它们在医学领域的应用带来了独特的挑战，特别是在幻觉的产生方面。开放式长篇医学文本中的幻觉表现为误导性的关键主张，由于两个原因，这些主张很难得到验证。首先，关键主张通常深深地纠缠在文本中，无法仅基于表面层次的呈现来提取。其次，验证这些主张具有挑战性，因为基于表面层次的标记检索通常缺乏精确或具体的证据，因此如果没有更深层次的基于机制的分析，这些主张就无法得到验证。在本文中，我们为医学批评家介绍了一种称为迭代树分析 (ITA) 的新方法。ITA 旨在从长篇医学文本中提取隐含的主张，并通过迭代和自适应的树状推理过程验证每个主张。该过程结合了自上而下的任务分解和自下而上的证据整合，从而能够通过详细的机制级推理精确验证复杂的医学主张。我们进行了大量的实验，结果表明，ITA 在复杂医学文本验证任务中检测事实错误方面的表现比以前的方法高出 10%。此外，我们将向公众发布一个综合测试集，旨在促进该领域研究的进一步发展。

Title: DNA 1.0 Technical Report

Authors: Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.10648
Pdf URL: https://arxiv.org/pdf/2501.10648
Copy Paste: [[2501.10648]] DNA 1.0 Technical Report(https://arxiv.org/abs/2501.10648)
Keywords: language model
Abstract: In this report, we present DNA 1.0 8B Instruct, a state-of-the-art bilingual language model optimized for Korean and English language tasks. By applying continual pre-training (CPT) with high-quality Korean datasets to Llama 3.1 8B and subsequent supervised fine-tuning (SFT), we create an instruction-following model with enhanced Korean language capabilities. This model is then merged with Llama 3.1 8B Instruct via spherical linear interpolation (SLERP) and undergoes further optimization through direct preference optimization (DPO) and knowledge distillation (KD). DNA 1.0 8B Instruct achieves state-of-the-art results on Korean-specific tasks, including KMMLU (53.26%), KoBEST (83.40%), and BELEBELE (57.99%), while maintaining strong English capabilities on MMLU (66.64%), MMLU-Pro (43.05%) and GSM8K (80.52%). As an open model, DNA 1.0 8B Instruct represents a significant advancement in bilingual language modeling. As an open model, DNA 1.0 8B Instruct is freely available through this https URL . For commercial licensing inquiries or feedback, please contact us at this https URL
摘要：在本报告中，我们介绍了 DNA 1.0 8B Instruct，这是一种针对韩语和英语语言任务优化的最先进的双语语言模型。通过对 Llama 3.1 8B 进行使用高质量韩语数据集的持续预训练 (CPT) 以及随后的监督微调 (SFT)，我们创建了一个具有增强韩语能力的指令遵循模型。然后，该模型通过球面线性插值 (SLERP) 与 Llama 3.1 8B Instruct 合并，并通过直接偏好优化 (DPO) 和知识提炼 (KD) 进行进一步优化。 DNA 1.0 8B Instruct 在韩语特定任务上取得了最佳成绩，包括 KMMLU（53.26%）、KoBEST（83.40%）和 BELEBELE（57.99%），同时在 MMLU（66.64%）、MMLU-Pro（43.05%）和 GSM8K（80.52%）上保持了强大的英语能力。作为一个开放模型，DNA 1.0 8B Instruct 代表了双语语言建模的重大进步。作为一个开放模型，DNA 1.0 8B Instruct 可通过此 https URL 免费获取。如需商业许可咨询或反馈，请通过此 https URL 与我们联系

Title: Harnessing the Potential of Large Language Models in Modern Marketing Management: Applications, Future Directions, and Strategic Recommendations

Authors: Raha Aghaei, Ali A. Kiaei, Mahnaz Boush, Javad Vahidi, Mohammad Zavvar, Zeynab Barzegar, Mahan Rofoosheh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.10685
Pdf URL: https://arxiv.org/pdf/2501.10685
Copy Paste: [[2501.10685]] Harnessing the Potential of Large Language Models in Modern Marketing Management: Applications, Future Directions, and Strategic Recommendations(https://arxiv.org/abs/2501.10685)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have revolutionized the process of customer engagement, campaign optimization, and content generation, in marketing management. In this paper, we explore the transformative potential of LLMs along with the current applications, future directions, and strategic recommendations for marketers. In particular, we focus on LLMs major business drivers such as personalization, real-time-interactive customer insights, and content automation, and how they enable customers and business outcomes. For instance, the ethical aspects of AI with respect to data privacy, transparency, and mitigation of bias are also covered, with the goal of promoting responsible use of the technology through best practices and the use of new technologies businesses can tap into the LLM potential, which help growth and stay one step ahead in the turmoil of digital marketing. This article is designed to give marketers the necessary guidance by using best industry practices to integrate these powerful LLMs into their marketing strategy and innovation without compromising on the ethos of their brand.
摘要：大型语言模型 (LLM) 彻底改变了营销管理中的客户参与、活动优化和内容生成过程。在本文中，我们探讨了 LLM 的变革潜力以及当前的应用、未来方向和对营销人员的战略建议。特别是，我们关注 LLM 的主要业务驱动因素，例如个性化、实时交互式客户洞察和内容自动化，以及它们如何实现客户和业务成果。例如，还介绍了 AI 在数据隐私、透明度和减轻偏见方面的道德方面，目的是通过最佳实践和新技术的使用来促进负责任地使用该技术，企业可以利用 LLM 的潜力，这有助于增长并在数字营销的动荡中保持领先一步。本文旨在为营销人员提供必要的指导，通过使用最佳行业实践将这些强大的 LLM 融入他们的营销策略和创新中，而不会损害其品牌精神。

Title: Development of Application-Specific Large Language Models to Facilitate Research Ethics Review

Authors: Sebastian Porsdam Mann, Joel Seah Jiehao, Stephen R. Latham, Julian Savulescu, Mateo Aboy, Brian D. Earp
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2501.10741
Pdf URL: https://arxiv.org/pdf/2501.10741
Copy Paste: [[2501.10741]] Development of Application-Specific Large Language Models to Facilitate Research Ethics Review(https://arxiv.org/abs/2501.10741)
Keywords: language model, llm
Abstract: Institutional review boards (IRBs) play a crucial role in ensuring the ethical conduct of human subjects research, but face challenges including inconsistency, delays, and inefficiencies. We propose the development and implementation of application-specific large language models (LLMs) to facilitate IRB review processes. These IRB-specific LLMs would be fine-tuned on IRB-specific literature and institutional datasets, and equipped with retrieval capabilities to access up-to-date, context-relevant information. We outline potential applications, including pre-review screening, preliminary analysis, consistency checking, and decision support. While addressing concerns about accuracy, context sensitivity, and human oversight, we acknowledge remaining challenges such as over-reliance on AI and the need for transparency. By enhancing the efficiency and quality of ethical review while maintaining human judgment in critical decisions, IRB-specific LLMs offer a promising tool to improve research oversight. We call for pilot studies to evaluate the feasibility and impact of this approach.
摘要：机构审查委员会 (IRB) 在确保人类受试者研究的道德行为方面发挥着至关重要的作用，但面临着包括不一致、延迟和低效率在内的挑战。我们建议开发和实施特定于应用的大型语言模型 (LLM)，以促进 IRB 审查流程。这些 IRB 特定的 LLM 将根据 IRB 特定的文献和机构数据集进行微调，并配备检索功能以访问最新的上下文相关信息。我们概述了潜在的应用，包括预审筛选、初步分析、一致性检查和决策支持。在解决对准确性、上下文敏感性和人为监督的担忧的同时，我们承认仍然存在一些挑战，例如过度依赖人工智能和对透明度的需求。通过提高伦理审查的效率和质量，同时在关键决策中保持人类判断，IRB 特定的 LLM 提供了一种有前途的工具来改善研究监督。我们呼吁进行试点研究以评估这种方法的可行性和影响。

Title: BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

Authors: Prashant Jayannavar, Liliang Ren, Marisa Hudspeth, Charlotte Lambert, Ariel Cordes, Elizabeth Kaplan, Anjali Narayan-Chen, Julia Hockenmaier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10836
Pdf URL: https://arxiv.org/pdf/2501.10836
Copy Paste: [[2501.10836]] BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues(https://arxiv.org/abs/2501.10836)
Keywords: llm, agent
Abstract: Interactive agents capable of understanding and executing instructions in the physical world have long been a central goal in AI research. The Minecraft Collaborative Building Task (MCBT) provides one such setting to work towards this goal (Narayan-Chen, Jayannavar, and Hockenmaier 2019). It is a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated Blocks World Environment. We focus on the challenging Builder Action Prediction (BAP) subtask of predicting correct action sequences in a given multimodal game context with limited training data (Jayannavar, Narayan-Chen, and Hockenmaier 2020). We take a closer look at evaluation and data for the BAP task, discovering key challenges and making significant improvements on both fronts to propose BAP v2, an upgraded version of the task. This will allow future work to make more efficient and meaningful progress on it. It comprises of: (1) an enhanced evaluation benchmark that includes a cleaner test set and fairer, more insightful metrics, and (2) additional synthetic training data generated from novel Minecraft dialogue and target structure simulators emulating the MCBT. We show that the synthetic data can be used to train more performant and robust neural models even with relatively simple training methods. Looking ahead, such data could also be crucial for training more sophisticated, data-hungry deep transformer models and training/fine-tuning increasingly large LLMs. Although modeling is not the primary focus of this work, we also illustrate the impact of our data and training methodologies on a simple LLM- and transformer-based model, thus validating the robustness of our approach, and setting the stage for more advanced architectures and LLMs going forward.
摘要：能够理解和执行物理世界中的指令的交互式代理长期以来一直是 AI 研究的核心目标。Minecraft 协作构建任务 (MCBT) 提供了一种实现此目标的环境（Narayan-Chen、Jayannavar 和 Hockenmaier 2019）。这是一款双人游戏，其中建筑师 (A) 指示建造者 (B) 在模拟的积木世界环境中构建目标结构。我们专注于具有挑战性的建造者动作预测 (BAP) 子任务，即在给定的多模态游戏环境中使用有限的训练数据预测正确的动作序列（Jayannavar、Narayan-Chen 和 Hockenmaier 2020）。我们仔细研究了 BAP 任务的评估和数据，发现了关键挑战并在两个方面做出了重大改进，提出了 BAP v2，即该任务的升级版本。这将使未来的工作能够取得更高效、更有意义的进展。它包括：（1）增强的评估基准，包括更清晰的测试集和更公平、更有洞察力的指标，以及（2）从模拟 MCBT 的新型 Minecraft 对话和目标结构模拟器生成的额外合成训练数据。我们表明，即使使用相对简单的训练方法，合成数据也可以用于训练性能更高、更稳健的神经模型。展望未来，这些数据对于训练更复杂、数据密集型深度 Transformer 模型以及训练/微调越来越大的 LLM 也至关重要。虽然建模不是这项工作的主要重点，但我们还说明了我们的数据和训练方法对简单的 LLM 和基于 Transformer 的模型的影响，从而验证了我们方法的稳健性，并为未来更先进的架构和 LLM 奠定了基础。

Title: Zero-shot and Few-shot Learning with Instruction-following LLMs for Claim Matching in Automated Fact-checking

Authors: Dina Pisarevskaya, Arkaitz Zubiaga
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10860
Pdf URL: https://arxiv.org/pdf/2501.10860
Copy Paste: [[2501.10860]] Zero-shot and Few-shot Learning with Instruction-following LLMs for Claim Matching in Automated Fact-checking(https://arxiv.org/abs/2501.10860)
Keywords: language model, gpt, llm, prompt
Abstract: The claim matching (CM) task can benefit an automated fact-checking pipeline by putting together claims that can be resolved with the same fact-check. In this work, we are the first to explore zero-shot and few-shot learning approaches to the task. We consider CM as a binary classification task and experiment with a set of instruction-following large language models (GPT-3.5-turbo, Gemini-1.5-flash, Mistral-7B-Instruct, and Llama-3-8B-Instruct), investigating prompt templates. We introduce a new CM dataset, ClaimMatch, which will be released upon acceptance. We put LLMs to the test in the CM task and find that it can be tackled by leveraging more mature yet similar tasks such as natural language inference or paraphrase detection. We also propose a pipeline for CM, which we evaluate on texts of different lengths.
摘要：声明匹配 (CM) 任务可以将可以通过相同事实核查解决的声明放在一起，从而使自动事实核查流程受益。在这项工作中，我们首次探索了零样本和少样本学习方法来完成该任务。我们将 CM 视为二元分类任务，并使用一组遵循指令的大型语言模型 (GPT-3.5-turbo、Gemini-1.5-flash、Mistral-7B-Instruct 和 Llama-3-8B-Instruct) 进行实验，研究提示模板。我们引入了一个新的 CM 数据集 ClaimMatch，它将在被接受后发布。我们在 CM 任务中对 LLM 进行了测试，发现可以通过利用更成熟但相似的任务（例如自然语言推理或释义检测）来解决它。我们还提出了一个 CM 流程，我们根据不同长度的文本对其进行了评估。

Title: Generating Structured Outputs from Language Models: Benchmark and Studies

Authors: Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, Harsha Nori
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10868
Pdf URL: https://arxiv.org/pdf/2501.10868
Copy Paste: [[2501.10868]] Generating Structured Outputs from Language Models: Benchmark and Studies(https://arxiv.org/abs/2501.10868)
Keywords: language model
Abstract: Reliably generating structured outputs has become a critical capability for modern language model (LM) applications. Constrained decoding has emerged as the dominant technology across sectors for enforcing structured outputs during generation. Despite its growing adoption, little has been done with the systematic evaluation of the behaviors and performance of constrained decoding. Constrained decoding frameworks have standardized around JSON Schema as a structured data format, with most uses guaranteeing constraint compliance given a schema. However, there is poor understanding of the effectiveness of the methods in practice. We present an evaluation framework to assess constrained decoding approaches across three critical dimensions: efficiency in generating constraint-compliant outputs, coverage of diverse constraint types, and quality of the generated outputs. To facilitate this evaluation, we introduce JSONSchemaBench, a benchmark for constrained decoding comprising 10K real-world JSON schemas that encompass a wide range of constraints with varying complexity. We pair the benchmark with the existing official JSON Schema Test Suite and evaluate six state-of-the-art constrained decoding frameworks, including Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini. Through extensive experiments, we gain insights into the capabilities and limitations of constrained decoding on structured generation with real-world JSON schemas. Our work provides actionable insights for improving constrained decoding frameworks and structured generation tasks, setting a new standard for evaluating constrained decoding and structured generation. We release JSONSchemaBench at this https URL
摘要：可靠地生成结构化输出已成为现代语言模型 (LM) 应用程序的关键功能。约束解码已成为各个行业在生成过程中强制执行结构化输出的主导技术。尽管其应用范围越来越广，但对约束解码的行为和性能的系统评估却很少。约束解码框架已围绕 JSON Schema 标准化为结构化数据格式，大多数用途都保证在给定模式的情况下遵守约束。然而，人们对这些方法在实践中的有效性了解甚少。我们提出了一个评估框架，从三个关键维度评估约束解码方法：生成符合约束的输出的效率、各种约束类型的覆盖率以及生成输出的质量。为了促进这一评估，我们引入了 JSONSchemaBench，这是约束解码的基准，包含 10K 个现实世界的 JSON 模式，涵盖了各种复杂程度的约束。我们将基准与现有的官方 JSON Schema 测试套件配对，并评估了六个最先进的约束解码框架，包括 Guidance、Outlines、Llamacpp、XGrammar、OpenAI 和 Gemini。通过大量实验，我们深入了解了约束解码在现实世界 JSON 模式的结构化生成方面的能力和局限性。我们的工作为改进约束解码框架和结构化生成任务提供了可行的见解，为评估约束解码和结构化生成树立了新标准。我们在此 https URL 上发布了 JSONSchemaBench

Title: LegalGuardian: A Privacy-Preserving Framework for Secure Integration of Large Language Models in Legal Practice

Authors: M. Mikail Demir, Hakan T. Otal, M. Abdullah Canbaz
Subjects: cs.CL, cs.CR, cs.IR
Abstract URL: https://arxiv.org/abs/2501.10915
Pdf URL: https://arxiv.org/pdf/2501.10915
Copy Paste: [[2501.10915]] LegalGuardian: A Privacy-Preserving Framework for Secure Integration of Large Language Models in Legal Practice(https://arxiv.org/abs/2501.10915)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) hold promise for advancing legal practice by automating complex tasks and improving access to justice. However, their adoption is limited by concerns over client confidentiality, especially when lawyers include sensitive Personally Identifiable Information (PII) in prompts, risking unauthorized data exposure. To mitigate this, we introduce LegalGuardian, a lightweight, privacy-preserving framework tailored for lawyers using LLM-based tools. LegalGuardian employs Named Entity Recognition (NER) techniques and local LLMs to mask and unmask confidential PII within prompts, safeguarding sensitive data before any external interaction. We detail its development and assess its effectiveness using a synthetic prompt library in immigration law scenarios. Comparing traditional NER models with one-shot prompted local LLM, we find that LegalGuardian achieves a F1-score of 93% with GLiNER and 97% with Qwen2.5-14B in PII detection. Semantic similarity analysis confirms that the framework maintains high fidelity in outputs, ensuring robust utility of LLM-based tools. Our findings indicate that legal professionals can harness advanced AI technologies without compromising client confidentiality or the quality of legal documents.
摘要：大型语言模型 (LLM) 有望通过自动化复杂任务和改善司法途径来推进法律实践。然而，由于对客户机密性的担忧，它们的采用受到限制，尤其是当律师在提示中包含敏感的个人身份信息 (PII) 时，存在未经授权的数据泄露风险。为了缓解这种情况，我们推出了 LegalGuardian，这是一个轻量级的隐私保护框架，专为使用基于 LLM 的工具的律师量身定制。LegalGuardian 采用命名实体识别 (NER) 技术和本地 LLM 来屏蔽和揭露提示中的机密 PII，在任何外部交互之前保护敏感数据。我们详细介绍了它的开发，并使用合成提示库在移民法场景中评估了它的有效性。将传统的 NER 模型与一次性提示的本地 LLM 进行比较，我们发现 LegalGuardian 在 PII 检测中使用 GLiNER 实现了 93% 的 F1 分数，使用 Qwen2.5-14B 实现了 97% 的 F1 分数。语义相似性分析证实，该框架在输出中保持了高保真度，确保了基于 LLM 的工具的强大实用性。我们的研究结果表明，法律专业人士可以利用先进的人工智能技术，而不会损害客户机密性或法律文件的质量。

Title: Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

Authors: Jingran Xie, Shun Lei, Yue Yu, Yang Xiang, Hui Wang, Xixin Wu, Zhiyong Wu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.10937
Pdf URL: https://arxiv.org/pdf/2501.10937
Copy Paste: [[2501.10937]] Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data(https://arxiv.org/abs/2501.10937)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have integrated speech with text-based LLMs to take speech question as input and output text response. However, the lack of spoken question-answering datasets that include speech style information to supervised fine-tuning (SFT) limits the performance of these systems. As a result, while these systems excel at understanding speech content, they often struggle to generate empathetic responses. In response, we propose a novel approach that circumvents the need for question-answering data, called Listen, Perceive, and Express (LPE). Our method employs a two-stage training process, initially guiding the LLM to listen the content and perceive the emotional aspects of speech. Subsequently, we utilize Chain-of-Thought (CoT) prompting to unlock the model's potential for expressing empathetic responses based on listened spoken content and perceived emotional cues. We employ experiments to prove the effectiveness of proposed method. To our knowledge, this is the first attempt to leverage CoT for speech-based dialogue.
摘要：富有同理心的对话对于自然的人机交互至关重要，它使对话系统能够以更加个性化和情感化的方式做出响应，从而提高用户满意度和参与度。大型语言模型 (LLM) 的出现利用其强大的功能彻底改变了对话生成，并展示了其在多模态领域的潜力。许多研究将语音与基于文本的 LLM 相结合，以将语音问题作为输入并输出文本响应。然而，缺乏包含语音风格信息的监督微调 (SFT) 的口头问答数据集限制了这些系统的性能。因此，虽然这些系统擅长理解语音内容，但它们往往难以产生富有同理心的回应。作为回应，我们提出了一种绕过问答数据需求的新方法，称为聆听、感知和表达 (LPE)。我们的方法采用两阶段训练过程，最初引导 LLM 聆听内容并感知语音的情感方面。随后，我们利用思维链 (CoT) 提示来释放模型的潜力，使其能够根据听到的口语内容和感知到的情绪线索表达富有同理心的反应。我们通过实验来证明所提方法的有效性。据我们所知，这是首次尝试利用 CoT 进行基于语音的对话。

Title: InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

Authors: Jing Ding, Kai Feng, Binbin Lin, Jiarui Cai, Qiushi Wang, Yu Xie, Xiaojin Zhang, Zhongyu Wei, Wei Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10943
Pdf URL: https://arxiv.org/pdf/2501.10943
Copy Paste: [[2501.10943]] InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models(https://arxiv.org/abs/2501.10943)
Keywords: language model, llm
Abstract: The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering this http URL also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at this https URL.
摘要：大型语言模型 (LLM) 的应用在各个领域取得了显著的成功，但它们在中国保险业等专业领域的有效性仍未得到充分探索。保险知识的复杂性，包括专业术语和各种数据类型，对模型和用户都构成了重大挑战。为了解决这个问题，我们引入了 InsQABench，这是中国保险业的基准数据集，分为三类：保险常识知识、保险结构化数据库和保险非结构化文档，反映了现实世界的保险问答，此 http URL 还提出了两种方法，SQL-ReAct 和 RAG-ReAct，以应对结构化和非结构化数据任务中的挑战。评估表明，虽然 LLM 难以处理特定领域的术语和细微的条款文本，但对 InsQABench 进行微调可以显着提高性能。我们的基准为推进保险领域的 LLM 应用奠定了坚实的基础，数据和代码可在此 https URL 上获取。

Title: The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs

Authors: Nitay Calderon, Roi Reichart, Rotem Dror
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2501.10970
Pdf URL: https://arxiv.org/pdf/2501.10970
Copy Paste: [[2501.10970]] The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs(https://arxiv.org/abs/2501.10970)
Keywords: language model, gpt, llm, prompt
Abstract: The "LLM-as-a-judge" paradigm employs Large Language Models (LLMs) as annotators and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure -- the Alternative Annotator Test (alt-test) -- that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming open-source LLMs, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.
摘要：“LLM 即评委”范式使用大型语言模型 (LLM) 作为注释者和评估者，完成传统上由人类执行的任务。LLM 注释不仅在 NLP 研究中，而且在医学、心理学和社会科学等领域也得到广泛应用。尽管 LLM 在塑造研究结果和见解方面发挥着重要作用，但目前尚无标准或严格的程序来确定 LLM 是否可以取代人类注释者。在本文中，我们提出了一种新颖的统计程序——替代注释者测试 (alt-test)——它只需要一小部分注释示例即可证明使用 LLM 注释的合理性。此外，我们还引入了一种通用且可解释的方法来比较 LLM 评委。为了演示我们的程序，我们整理了十个数据集的多样化集合，包括语言和视觉语言任务，并使用六个 LLM 和四种提示技术进行了实验。我们的结果表明，LLM 有时可以用闭源 LLM（例如 GPT-4o）代替人类，表现优于开源 LLM，并且提示技术会产生质量参差不齐的评判。我们希望这项研究能鼓励更严格、更可靠的做法。

Title: From Arabic Text to Puzzles: LLM-Driven Development of Arabic Educational Crosswords

Authors: Kamyar Zeinalipour, Mohamed Zaky Saad, Marco Maggini, Marco Gori
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11035
Pdf URL: https://arxiv.org/pdf/2501.11035
Copy Paste: [[2501.11035]] From Arabic Text to Puzzles: LLM-Driven Development of Arabic Educational Crosswords(https://arxiv.org/abs/2501.11035)
Keywords: language model, gpt, llm
Abstract: We present an Arabic crossword puzzle generator from a given text that utilizes advanced language models such as GPT-4-Turbo, GPT-3.5-Turbo and Llama3-8B-Instruct, specifically developed for educational purposes, this innovative generator leverages a meticulously compiled dataset named Arabic-Clue-Instruct with over 50,000 entries encompassing text, answers, clues, and categories. This dataset is intricately designed to aid in the generation of pertinent clues linked to specific texts and keywords within defined categories. This project addresses the scarcity of advanced educational tools tailored for the Arabic language, promoting enhanced language learning and cognitive development. By providing a culturally and linguistically relevant tool, our objective is to make learning more engaging and effective through gamification and interactivity. Integrating state-of-the-art artificial intelligence with contemporary learning methodologies, this tool can generate crossword puzzles from any given educational text, thereby facilitating an interactive and enjoyable learning experience. This tool not only advances educational paradigms but also sets a new standard in interactive and cognitive learning technologies. The model and dataset are publicly available.
摘要：我们提供了一个基于给定文本的阿拉伯语填字游戏生成器，该生成器利用了专为教育目的开发的高级语言模型，例如 GPT-4-Turbo、GPT-3.5-Turbo 和 Llama3-8B-Instruct，这个创新的生成器利用了精心编译的数据集，名为 Arabic-Clue-Instruct，其中包含超过 50,000 个条目，包括文本、答案、线索和类别。该数据集经过精心设计，有助于生成与特定文本和定义类别中的关键字相关的相关线索。该项目解决了针对阿拉伯语的高级教育工具的稀缺问题，促进了语言学习和认知发展的提高。通过提供与文化和语言相关的工具，我们的目标是通过游戏化和互动性使学习更具吸引力和效率。该工具将最先进的人工智能与当代学习方法相结合，可以根据任何给定的教育文本生成填字游戏，从而促进互动和愉快的学习体验。该工具不仅推动了教育模式的发展，还为交互式和认知学习技术树立了新标准。该模型和数据集均已公开。

Title: LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models

Authors: Jingyuan Yang, Rongjun Li, Weixuan Wang, Ziyu Zhou, Zhiyong Feng, Wei Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11036
Pdf URL: https://arxiv.org/pdf/2501.11036
Copy Paste: [[2501.11036]] LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models(https://arxiv.org/abs/2501.11036)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLM behavior by adjusting their latent representations during inference time, has been explored to improve the semantic consistency of LLMs. However, these methods typically operate at the model component level, such as layer hidden states or attention heads. They face a challenge due to the ``polysemanticity issue'', where the model components of LLMs typically encode multiple entangled features, making precise steering difficult. To address this challenge, we drill down to feature-level representations and propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. More specifically, our method maps the hidden states of relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder (SAE), ensuring model steering based on decoupled feature representations with minimal interference. Comprehensive experiments on both NLU and NLG datasets demonstrate the effectiveness of our method in enhancing semantic consistency, resulting in significant performance gains for various NLU and NLG tasks.
摘要：大型语言模型 (LLM) 在接受语义等效的释义输入时，通常会生成不一致的响应。最近，人们探索了激活转向技术，这是一种通过在推理期间调整其潜在表示来调节 LLM 行为的技术，旨在提高 LLM 的语义一致性。然而，这些方法通常在模型组件级别运行，例如层隐藏状态或注意力头。它们面临着“多语义问题”带来的挑战，其中 LLM 的模型组件通常编码多个纠缠特征，使得精确转向变得困难。为了应对这一挑战，我们深入研究特征级表示并提出了 LF-Steering，这是一种新颖的激活转向方法，可以精确识别导致语义不一致的潜在特征表示。更具体地说，我们的方法基于稀疏自动编码器 (SAE) 将相关转换器层的隐藏状态映射到稀疏激活的高维特征空间中，确保基于解耦特征表示的模型转向，同时将干扰降至最低。在 NLU 和 NLG 数据集上进行的综合实验证明了我们的方法在增强语义一致性方面的有效性，从而显著提高了各种 NLU 和 NLG 任务的性能。

Title: Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach

Authors: Jingyuan Yang, Dapeng Chen, Yajing Sun, Rongjun Li, Zhiyong Feng, Wei Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11041
Pdf URL: https://arxiv.org/pdf/2501.11041
Copy Paste: [[2501.11041]] Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach(https://arxiv.org/abs/2501.11041)
Keywords: language model, llm, prompt
Abstract: A Large Language Model (LLM) tends to generate inconsistent and sometimes contradictory outputs when presented with a prompt that has equivalent semantics but is expressed differently from the original prompt. To achieve semantic consistency of an LLM, one of the key approaches is to finetune the model with prompt-output pairs with semantically equivalent meanings. Despite its effectiveness, a data-driven finetuning method incurs substantial computation costs in data preparation and model optimization. In this regime, an LLM is treated as a ``black box'', restricting our ability to gain deeper insights into its internal mechanism. In this paper, we are motivated to enhance the semantic consistency of LLMs through a more interpretable method (i.e., model editing) to this end. We first identify the model components (i.e., attention heads) that have a key impact on the semantic consistency of an LLM. We subsequently inject biases into the output of these model components along the semantic-consistency activation direction. It is noteworthy that these modifications are cost-effective, without reliance on mass manipulations of the original model parameters. Through comprehensive experiments on the constructed NLU and open-source NLG datasets, our method demonstrates significant improvements in the semantic consistency and task performance of LLMs. Additionally, our method exhibits promising generalization capabilities by performing well on tasks beyond the primary tasks.
摘要：当呈现具有等效语义但表达方式与原始提示不同的提示时，大型语言模型 (LLM) 往往会产生不一致且有时相互矛盾的输出。为了实现 LLM 的语义一致性，关键方法之一是使用具有语义等效含义的提示输出对来微调模型。尽管数据驱动的微调方法很有效，但它会在数据准备和模型优化中产生大量的计算成本。在这种情况下，LLM 被视为“黑匣子”，限制了我们深入了解其内部机制的能力。在本文中，我们旨在通过一种更易于解释的方法（即模型编辑）来增强 LLM 的语义一致性。我们首先确定对 LLM 的语义一致性有关键影响的模型组件（即注意力头）。随后，我们沿着语义一致性激活方向向这些模型组件的输出注入偏差。值得注意的是，这些修改是经济高效的，不依赖于对原始模型参数的大量操作。通过在构建的 NLU 和开源 NLG 数据集上进行全面实验，我们的方法证明了 LLM 的语义一致性和任务性能的显著提升。此外，我们的方法在主要任务之外的任务上表现良好，展现出良好的泛化能力。

Title: IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

Authors: Elad Levi, Ilan Kadar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11067
Pdf URL: https://arxiv.org/pdf/2501.11067
Copy Paste: [[2501.11067]] IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems(https://arxiv.org/abs/2501.11067)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at this https URL
摘要：大型语言模型 (LLM) 正在改变人工智能，演变为能够自主规划和执行的任务导向系统。LLM 的主要应用之一是对话式 AI 系统，它必须处理多轮对话、集成特定领域的 API 并遵守严格的策略约束。然而，评估这些代理仍然是一项重大挑战，因为传统方法无法捕捉现实世界交互的复杂性和多变性。我们引入了 IntellAgent，这是一个可扩展的开源多代理框架，旨在全面评估对话式 AI 系统。IntellAgent 通过结合策略驱动的图形建模、现实事件生成和交互式用户代理模拟，自动创建多样化的综合基准。这种创新方法提供了细粒度的诊断，解决了静态和手动策划的基准与粗粒度指标的局限性。IntellAgent 代表了评估对话式 AI 的范式转变。通过模拟不同复杂程度的现实多策略场景，IntellAgent 可以捕捉代理能力和策略约束之间的细微相互作用。与传统方法不同，它采用基于图形的策略模型来表示策略交互的关系、可能性和复杂性，从而实现高度详细的诊断。IntellAgent 还可以识别关键的性能差距，为有针对性的优化提供可操作的见解。其模块化、开源设计支持无缝集成新域、策略和 API，促进可重复性和社区协作。我们的研究结果表明，IntellAgent 是一个有效的框架，通过解决连接研究和部署的挑战来推进对话式 AI。该框架可在此 https URL 上找到

Title: Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective

Authors: Yiyao Yu, Yuxiang Zhang, Dongdong Zhang, Xiao Liang, Hengyuan Zhang, Xingxing Zhang, Ziyi Yang, Mahmoud Khademi, Hany Awadalla, Junjie Wang, Yujiu Yang, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11110
Pdf URL: https://arxiv.org/pdf/2501.11110
Copy Paste: [[2501.11110]] Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective(https://arxiv.org/abs/2501.11110)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have made notable progress in mathematical reasoning, yet they often rely on single-paradigm reasoning that limits their effectiveness across diverse tasks. In this paper, we introduce Chain-of-Reasoning (CoR), a novel unified framework that integrates multiple reasoning paradigms--Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR)--to enable synergistic collaboration. CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution. We propose a Progressive Paradigm Training (PPT) strategy that allows models to progressively master these paradigms, culminating in the development of CoR-Math-7B. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models, achieving up to a 41.0% absolute improvement over GPT-4 in theorem proving tasks and a 7.9% improvement over RL-based methods in arithmetic tasks. These results showcase the enhanced mathematical comprehensive ability of our model, achieving significant performance gains on specific tasks and enabling zero-shot generalization across tasks.
摘要：大型语言模型 (LLM) 在数学推理方面取得了显著进展，但它们通常依赖于单一范式推理，这限制了它们在不同任务中的有效性。在本文中，我们引入了推理链 (CoR)，这是一种新颖的统一框架，它集成了多种推理范式——自然语言推理 (NLR)、算法推理 (AR) 和符号推理 (SR)——以实现协同合作。CoR 使用不同的推理范式生成多个潜在答案，并将它们综合成一个连贯的最终解决方案。我们提出了一种渐进式范式训练 (PPT) 策略，使模型能够逐步掌握这些范式，最终开发出 CoR-Math-7B。实验结果表明，CoR-Math-7B 的表现显著优于当前的 SOTA 模型，在定理证明任务中，其绝对性能比 GPT-4 提高了 41.0%，在算术任务中，其绝对性能比基于 RL 的方法提高了 7.9%。这些结果展示了我们模型数学综合能力的增强，在特定任务上取得了显著的性能提升，并实现了跨任务的零样本泛化。

Title: Clinical trial cohort selection using Large Language Models on n2c2 Challenges

Authors: Chi-en Amy Tai, Xavier Tannier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11114
Pdf URL: https://arxiv.org/pdf/2501.11114
Copy Paste: [[2501.11114]] Clinical trial cohort selection using Large Language Models on n2c2 Challenges(https://arxiv.org/abs/2501.11114)
Keywords: language model, llm
Abstract: Clinical trials are a critical process in the medical field for introducing new treatments and innovations. However, cohort selection for clinical trials is a time-consuming process that often requires manual review of patient text records for specific keywords. Though there have been studies on standardizing the information across the various platforms, Natural Language Processing (NLP) tools remain crucial for spotting eligibility criteria in textual reports. Recently, pre-trained large language models (LLMs) have gained popularity for various NLP tasks due to their ability to acquire a nuanced understanding of text. In this paper, we study the performance of large language models on clinical trial cohort selection and leverage the n2c2 challenges to benchmark their performance. Our results are promising with regard to the incorporation of LLMs for simple cohort selection tasks, but also highlight the difficulties encountered by these models as soon as fine-grained knowledge and reasoning are required.
摘要：临床试验是医学领域引入新疗法和创新的关键过程。然而，临床试验的队列选择是一个耗时的过程，通常需要手动审查患者文本记录中的特定关键词。尽管已经有关于在各个平台上标准化信息的研究，但自然语言处理 (NLP) 工具对于在文本报告中发现资格标准仍然至关重要。最近，预训练的大型语言模型 (LLM) 因其能够获得对文本的细微理解的能力而受到各种 NLP 任务的欢迎。在本文中，我们研究了大型语言模型在临床试验队列选择方面的表现，并利用 n2c2 挑战对其性能进行了基准测试。我们的结果对于将 LLM 纳入简单的队列选择任务很有希望，但也凸显了这些模型在需要细粒度知识和推理时遇到的困难。

Title: Tell me about yourself: LLMs are aware of their learned behaviors

Authors: Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11120
Pdf URL: https://arxiv.org/pdf/2501.11120
Copy Paste: [[2501.11120]] Tell me about yourself: LLMs are aware of their learned behaviors(https://arxiv.org/abs/2501.11120)
Keywords: llm
Abstract: We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.
摘要：我们研究行为自我意识——LLM 无需上下文示例即可表达其行为的能力。我们在表现出特定行为的数据集上微调 LLM，例如 (a) 做出高风险的经济决策，以及 (b) 输出不安全的代码。尽管数据集不包含相关行为的明确描述，但经过微调的 LLM 可以明确描述它。例如，一个经过训练以输出不安全代码的模型说，“我编写的代码是不安全的。”事实上，模型对一系列行为和各种评估都表现出行为自我意识。请注意，虽然我们微调模型以表现出编写不安全代码等行为，但我们不会微调它们以表达自己的行为——模型无需任何特殊训练或示例即可做到这一点。行为自我意识与人工智能安全有关，因为模型可以使用它来主动披露有问题的行为。特别是，我们研究后门策略，其中模型仅在某些触发条件下才会表现出意外行为。我们发现，即使没有后门触发器，模型有时也能识别它们是否有后门。但是，默认情况下，模型无法直接输出其触发器。我们的结果表明，模型具有令人惊讶的自我意识能力和自发表达隐性行为的能力。未来的工作可以针对更广泛的场景和模型（包括实际场景）研究这种能力，并解释它在 LLM 中是如何出现的。

Title: A Collection of Question Answering Datasets for Norwegian

Authors: Vladislav Mikhailov, Petter Mæhlum, Victoria Ovedie Chruickshank Langø, Erik Velldal, Lilja Øvrelid
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11128
Pdf URL: https://arxiv.org/pdf/2501.11128
Copy Paste: [[2501.11128]] A Collection of Question Answering Datasets for Norwegian(https://arxiv.org/abs/2501.11128)
Keywords: language model
Abstract: This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian - Bokmål and Nynorsk - our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokmål than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.
摘要：本文介绍了一套新的挪威语问答数据集：NorOpenBookQA、NorCommonSenseQA、NorTruthfulQA 和 NRK-Quiz-QA。这些数据涵盖了广泛的技能和知识领域，包括世界知识、常识推理、真实性和有关挪威的知识。我们的数据集涵盖了挪威语（博克马尔语和尼诺斯克语）的书面标准，包含由母语人士创建的 10,000 多对问答。我们详细介绍了数据集创建方法，并展示了在零样本和少样本方案中评估 11 个语言模型 (LM) 的结果。大多数 LM 在博克马尔语方面的表现优于尼诺斯克语，在常识推理方面最吃力，并且在生成问题答案时往往不真实。我们所有的数据集和注释材料都是公开的。

Title: Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

Authors: Ivan Lopez, Fateme Nateghi Haredasht, Kaitlin Caoili, Jonathan H Chen, Akshay Chaudhari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11199
Pdf URL: https://arxiv.org/pdf/2501.11199
Copy Paste: [[2501.11199]] Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation(https://arxiv.org/abs/2501.11199)
Keywords: language model, prompt
Abstract: Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.
摘要：准确分类临床文本通常需要对预训练的语言模型进行微调，这个过程成本高昂且耗时，因为需要高质量的数据和专家注释者。合成数据生成提供了一种替代方案，尽管预训练模型可能无法捕捉临床笔记的句法多样性。我们提出了一种嵌入驱动的方法，该方法使用来自一小组真实临床笔记的多样性采样来指导大型语言模型进行少量提示，从而生成更能反映临床语法的合成文本。我们在分类任务中使用 CheXpert 数据集评估了这种方法，并将其与随机少量和零样本方法进行了比较。使用余弦相似度和图灵测试，我们的方法生成的合成笔记与真实临床文本更加接近。我们的流程将达到 0.85 AUC 截止值所需的数据减少了 40%（AUROC）和 30%（AUPRC），而使用合成数据增强模型可将 AUROC 提高 57%，将 AUPRC 提高 68%。此外，我们的合成数据的有效性是真实数据的 0.9 倍，价值提高了 60%。

Title: Irony in Emojis: A Comparative Study of Human and LLM Interpretation

Authors: Yawen Zheng, Hanjia Lyu, Jiebo Luo
Subjects: cs.CL, cs.CV, cs.SI
Abstract URL: https://arxiv.org/abs/2501.11241
Pdf URL: https://arxiv.org/pdf/2501.11241
Copy Paste: [[2501.11241]] Irony in Emojis: A Comparative Study of Human and LLM Interpretation(https://arxiv.org/abs/2501.11241)
Keywords: language model, gpt, llm, prompt
Abstract: Emojis have become a universal language in online communication, often carrying nuanced and context-dependent meanings. Among these, irony poses a significant challenge for Large Language Models (LLMs) due to its inherent incongruity between appearance and intent. This study examines the ability of GPT-4o to interpret irony in emojis. By prompting GPT-4o to evaluate the likelihood of specific emojis being used to express irony on social media and comparing its interpretations with human perceptions, we aim to bridge the gap between machine and human understanding. Our findings reveal nuanced insights into GPT-4o's interpretive capabilities, highlighting areas of alignment with and divergence from human behavior. Additionally, this research underscores the importance of demographic factors, such as age and gender, in shaping emoji interpretation and evaluates how these factors influence GPT-4o's performance.
摘要：表情符号已成为在线交流中的通用语言，通常带有微妙且与上下文相关的含义。其中，反讽对大型语言模型 (LLM) 提出了重大挑战，因为它在外观和意图之间具有内在的不一致性。本研究考察了 GPT-4o 解释表情符号中反讽的能力。通过提示 GPT-4o 评估特定表情符号在社交媒体上用于表达反讽的可能性并将其解释与人类感知进行比较，我们旨在弥合机器与人类理解之间的差距。我们的研究结果揭示了对 GPT-4o 解释能力的细微见解，突出了与人类行为一致和不同的领域。此外，这项研究强调了年龄和性别等人口因素在塑造表情符号解释方面的重要性，并评估了这些因素如何影响 GPT-4o 的表现。

Title: Can xLLMs Understand the Structure of Dialog? Exploring Multilingual Response Generation in Complex Scenarios

Authors: Zhongtian Hu, Yiwen Cui, Ronghan Li, Meng Zhao, Lifang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11269
Pdf URL: https://arxiv.org/pdf/2501.11269
Copy Paste: [[2501.11269]] Can xLLMs Understand the Structure of Dialog? Exploring Multilingual Response Generation in Complex Scenarios(https://arxiv.org/abs/2501.11269)
Keywords: language model, llm
Abstract: Multilingual research has garnered increasing attention, especially in the domain of dialogue systems. The rapid advancements in large language models (LLMs) have fueled the demand for high-performing multilingual models. However, two major challenges persist: the scarcity of high-quality multilingual datasets and the limited complexity of existing datasets in capturing realistic dialogue scenarios. To address these gaps, we introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues. Each sample in the dataset features at least three participants discussing a wide range of topics, including society, culture, politics, and this http URL extensive experiments, we uncover significant limitations in previously recognized multilingual capabilities of LLMs when applied to such complex dialogue scenarios. For instance, the widely accepted multilingual complementary ability of LLMs is notably impacted. By conducting further experiments, we explore the mechanisms of LLMs in multilingual environments from multiple perspectives, shedding new light on their performance in real-world, diverse conversational contexts.
摘要：多语言研究越来越受到关注，尤其是在对话系统领域。大型语言模型 (LLM) 的快速发展推动了对高性能多语言模型的需求。然而，仍然存在两个主要挑战：高质量多语言数据集的稀缺性和现有数据集在捕捉真实对话场景方面的复杂性有限。为了解决这些差距，我们引入了 XMP，这是一个来自多方播客对话的高质量并行多语言数据集。数据集中的每个样本至少有三名参与者讨论广泛的主题，包括社会、文化、政治和这个 http URL 广泛的实验，我们发现当应用于如此复杂的对话场景时，LLM 先前公认的多语言能力存在重大局限性。例如，LLM 被广泛接受的多语言互补能力受到显著影响。通过进一步的实验，我们从多个角度探索了 LLM 在多语言环境中的机制，为它们在现实世界、多样化的对话环境中的表现提供了新的见解。

Title: Multi-round, Chain-of-thought Post-editing for Unfaithful Summaries

Authors: Yi-Hui Lee, Xiangci Li, Jessica Ouyang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11273
Pdf URL: https://arxiv.org/pdf/2501.11273
Copy Paste: [[2501.11273]] Multi-round, Chain-of-thought Post-editing for Unfaithful Summaries(https://arxiv.org/abs/2501.11273)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent large language models (LLMs) have demonstrated a remarkable ability to perform natural language understanding and generation tasks. In this work, we investigate the use of LLMs for evaluating faithfulness in news summarization, finding that it achieves a strong correlation with human judgments. We further investigate LLMs' capabilities as a faithfulness post-editor, experimenting with different chain-of-thought prompts for locating and correcting factual inconsistencies between a generated summary and the source news document and are able to achieve a higher editing success rate than was reported in prior work. We perform both automated and human evaluations of the post-edited summaries, finding that prompting LLMs using chain-of-thought reasoning about factual error types is an effective faithfulness post-editing strategy, performing comparably to fine-tuned post-editing models. We also demonstrate that multiple rounds of post-editing, which has not previously been explored, can be used to gradually improve the faithfulness of summaries whose errors cannot be fully corrected in a single round.
摘要：最近的大型语言模型 (LLM) 已展示出执行自然语言理解和生成任务的卓越能力。在这项工作中，我们研究了使用 LLM 评估新闻摘要的忠实度，发现它与人类判断具有很强的相关性。我们进一步研究了 LLM 作为忠实度后编辑的能力，尝试使用不同的思路链提示来定位和纠正生成的摘要和源新闻文档之间的事实不一致，并且能够实现比之前工作中报告的更高的编辑成功率。我们对编辑后的摘要进行了自动和人工评估，发现使用关于事实错误类型的思路链推理提示 LLM 是一种有效的忠实度后编辑策略，其表现与微调的后编辑模型相当。我们还证明了可以使用以前从未探索过的多轮后编辑来逐步提高无法在一轮中完全纠正错误的摘要的忠实度。

Title: Question-to-Question Retrieval for Hallucination-Free Knowledge Access: An Approach for Wikipedia and Wikidata Question Answering

Authors: Santhosh Thottingal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11301
Pdf URL: https://arxiv.org/pdf/2501.11301
Copy Paste: [[2501.11301]] Question-to-Question Retrieval for Hallucination-Free Knowledge Access: An Approach for Wikipedia and Wikidata Question Answering(https://arxiv.org/abs/2501.11301)
Keywords: llm, hallucination
Abstract: This paper introduces an approach to question answering over knowledge bases like Wikipedia and Wikidata by performing "question-to-question" matching and retrieval from a dense vector embedding store. Instead of embedding document content, we generate a comprehensive set of questions for each logical content unit using an instruction-tuned LLM. These questions are vector-embedded and stored, mapping to the corresponding content. Vector embedding of user queries are then matched against this question vector store. The highest similarity score leads to direct retrieval of the associated article content, eliminating the need for answer generation. Our method achieves high cosine similarity ( > 0.9 ) for relevant question pairs, enabling highly precise retrieval. This approach offers several advantages including computational efficiency, rapid response times, and increased scalability. We demonstrate its effectiveness on Wikipedia and Wikidata, including multimedia content through structured fact retrieval from Wikidata, opening up new pathways for multimodal question answering.
摘要：本文介绍了一种在维基百科和维基数据等知识库上进行问答的方法，即通过从密集向量嵌入存储中执行“问题到问题”匹配和检索。我们不是嵌入文档内容，而是使用指令调整的 LLM 为每个逻辑内容单元生成一组全面的问题。这些问题被嵌入到向量中并存储，映射到相应的内容。然后将用户查询的向量嵌入与此问题向量存储进行匹配。最高相似度得分可直接检索相关文章内容，无需生成答案。我们的方法实现了相关问题对的高余弦相似度（> 0.9），从而实现了高精度检索。这种方法具有多种优势，包括计算效率高、响应时间快和可扩展性强。我们展示了它在维基百科和维基数据上的有效性，包括通过从维基数据检索结构化事实来获取多媒体内容，为多模式问答开辟了新途径。

Title: Few-shot Policy (de)composition in Conversational Question Answering

Authors: Kyle Erwin, Guy Axelrod, Maria Chang, Achille Fokoue, Maxwell Crouse, Soham Dan, Tian Gao, Rosario Uceda-Sosa, Ndivhuwo Makondo, Naweed Khan, Alexander Gray
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11335
Pdf URL: https://arxiv.org/pdf/2501.11335
Copy Paste: [[2501.11335]] Few-shot Policy (de)composition in Conversational Question Answering(https://arxiv.org/abs/2501.11335)
Keywords: language model, llm, prompt
Abstract: The task of policy compliance detection (PCD) is to determine if a scenario is in compliance with respect to a set of written policies. In a conversational setting, the results of PCD can indicate if clarifying questions must be asked to determine compliance status. Existing approaches usually claim to have reasoning capabilities that are latent or require a large amount of annotated data. In this work, we propose logical decomposition for policy compliance (LDPC): a neuro-symbolic framework to detect policy compliance using large language models (LLMs) in a few-shot setting. By selecting only a few exemplars alongside recently developed prompting techniques, we demonstrate that our approach soundly reasons about policy compliance conversations by extracting sub-questions to be answered, assigning truth values from contextual information, and explicitly producing a set of logic statements from the given policies. The formulation of explicit logic graphs can in turn help answer PCDrelated questions with increased transparency and explainability. We apply this approach to the popular PCD and conversational machine reading benchmark, ShARC, and show competitive performance with no task-specific finetuning. We also leverage the inherently interpretable architecture of LDPC to understand where errors occur, revealing ambiguities in the ShARC dataset and highlighting the challenges involved with reasoning for conversational question answering.
摘要：策略合规性检测 (PCD) 的任务是确定某个场景是否符合一组书面策略。在对话环境中，PCD 的结果可以表明是否必须提出澄清问题来确定合规性状态。现有方法通常声称具有潜在的或需要大量注释数据的推理能力。在这项工作中，我们提出了策略合规性的逻辑分解 (LDPC)：一种神经符号框架，用于在少数情况下使用大型语言模型 (LLM) 检测策略合规性。通过仅选择几个示例以及最近开发的提示技术，我们证明了我们的方法通过提取要回答的子问题、从上下文信息中分配真值以及从给定的策略中明确生成一组逻辑语句，可以合理地推理策略合规性对话。显式逻辑图的制定反过来可以帮助回答与 PCD 相关的问题，提高透明度和可解释性。我们将这种方法应用于流行的 PCD 和对话式机器阅读基准 ShARC，并在没有特定于任务的微调的情况下表现出具有竞争力的性能。我们还利用 LDPC 固有的可解释架构来了解错误发生的位置，揭示 ShARC 数据集中的歧义并强调对话问答推理所涉及的挑战。

Title: Verifying Cross-modal Entity Consistency in News using Vision-language Models

Authors: Sahar Tahmasebi, Eric Müller-Budack, Ralph Ewerth
Subjects: cs.CL, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2501.11403
Pdf URL: https://arxiv.org/pdf/2501.11403
Copy Paste: [[2501.11403]] Verifying Cross-modal Entity Consistency in News using Vision-language Models(https://arxiv.org/abs/2501.11403)
Keywords: language model, prompt
Abstract: The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying Cross-modal Entity Consistency~(LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at \url{this https URL}.
摘要：网络已成为重要的信息来源，但它也被用来传播虚假信息，这些信息通常通过图像和文本等多种模态传达。识别不一致的跨模态信息，特别是人物、地点和事件等实体，对于检测虚假信息至关重要。以前的研究要么通过评估图像与整个文档的一致性来识别脱离上下文的虚假信息，忽略单个实体的关系，要么关注与新闻无关的通用实体。到目前为止，只有少数方法解决了验证新闻中图像和文本之间实体一致性的任务。然而，大型视觉语言模型 (LVLM) 的潜力尚未得到探索。在本文中，我们提出了一个基于 LVLM 的框架来验证跨模态实体一致性~(LVLM4CEC)，以评估新闻文章中的人物、地点和事件在两种模态中是否一致。我们建议使用从网络爬取的参考图像对 LVLM 进行实体验证的有效提示策略。此外，我们扩展了三个现有数据集，用于新闻中的实体验证任务，从而提供了手动地面实况数据。我们的结果表明，LVLM 具有自动化跨模态实体验证的潜力，在使用证据图像时，识别人物和事件的准确性有所提高。此外，我们的方法在文档中的位置和事件验证方面优于基线。数据集和源代码可在 GitHub 上的 \url{此 https URL} 上找到。

Title: Neural Contextual Reinforcement Framework for Logical Structure Language Generation

Authors: Marcus Irvin, William Cooper, Edward Hughes, Jessica Morgan, Christopher Hamilton
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11417
Pdf URL: https://arxiv.org/pdf/2501.11417
Copy Paste: [[2501.11417]] Neural Contextual Reinforcement Framework for Logical Structure Language Generation(https://arxiv.org/abs/2501.11417)
Keywords: language model
Abstract: The Neural Contextual Reinforcement Framework introduces an innovative approach to enhancing the logical coherence and structural consistency of text generated by large language models. Leveraging reinforcement learning principles, the framework integrates custom reward functions and dynamic context alignment mechanisms to address challenges inherent in maintaining long-range dependencies across extended sequences. The architecture incorporates multi-head attention layers and hierarchical encoding modules, enabling the model to produce outputs that align closely with human expectations of logical structure and semantic flow. Quantitative evaluations across diverse datasets demonstrate substantial improvements in coherence metrics, perplexity reduction, and semantic alignment, showcasing the framework's ability to outperform baseline models in both general and domain-specific tasks. Qualitative analyses further highlight the framework's capacity to generate text with improved narrative clarity and reduced redundancy, reflecting its effectiveness in balancing fluency with structural precision. In addition to its performance gains, the framework exhibits robustness in handling noisy input data and scalability across varying model sizes, reinforcing its versatility in practical applications. Experimental results reveal that optimal context window sizes significantly influence coherence outcomes, showing the importance of architectural flexibility in adapting to diverse linguistic structures. Cross-lingual performance evaluations affirm the framework's adaptability to multiple languages, extending its utility beyond monolingual contexts. Resource efficiency analyses indicate a reduction in computational overhead compared to traditional approaches, emphasizing the practicality of the framework for large-scale deployment.
摘要：神经上下文强化框架引入了一种创新方法来增强大型语言模型生成的文本的逻辑连贯性和结构一致性。该框架利用强化学习原理，集成了自定义奖励函数和动态上下文对齐机制，以解决在扩展序列中维护长距离依赖关系所固有的挑战。该架构结合了多头注意层和分层编码模块，使模型能够产生与人类对逻辑结构和语义流的期望紧密一致的输出。跨不同数据集的定量评估表明，连贯性指标、困惑度降低和语义对齐方面都有了显著的改善，展示了该框架在一般和特定领域任务中超越基线模型的能力。定性分析进一步突出了该框架生成具有更好的叙述清晰度和减少冗余的文本的能力，反映了其在平衡流畅性和结构精度方面的有效性。除了性能提升之外，该框架在处理嘈杂的输入数据方面表现出很强的鲁棒性，并且在不同模型大小上具有可扩展性，增强了其在实际应用中的多功能性。实验结果表明，最佳上下文窗口大小显著影响连贯性结果，表明架构灵活性在适应多种语言结构方面的重要性。跨语言性能评估肯定了该框架对多种语言的适应性，将其实用性扩展到单语环境之外。资源效率分析表明，与传统方法相比，计算开销有所减少，强调了该框架在大规模部署中的实用性。

Title: RACCOON: A Retrieval-Augmented Generation Approach for Location Coordinate Capture from News Articles

Authors: Jonathan Lin, Aditya Joshi, Hye-young Paik, Tri Dung Doung, Deepti Gurdasani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11440
Pdf URL: https://arxiv.org/pdf/2501.11440
Copy Paste: [[2501.11440]] RACCOON: A Retrieval-Augmented Generation Approach for Location Coordinate Capture from News Articles(https://arxiv.org/abs/2501.11440)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: Geocoding involves automatic extraction of location coordinates of incidents reported in news articles, and can be used for epidemic intelligence or disaster management. This paper introduces Retrieval-Augmented Coordinate Capture Of Online News articles (RACCOON), an open-source geocoding approach that extracts geolocations from news articles. RACCOON uses a retrieval-augmented generation (RAG) approach where candidate locations and associated information are retrieved in the form of context from a location database, and a prompt containing the retrieved context, location mentions and news articles is fed to an LLM to generate the location coordinates. Our evaluation on three datasets, two underlying LLMs, three baselines and several ablation tests based on the components of RACCOON demonstrate the utility of RACCOON. To the best of our knowledge, RACCOON is the first RAG-based approach for geocoding using pre-trained LLMs.
摘要：地理编码涉及自动提取新闻文章中报道的事件的位置坐标，可用于疫情情报或灾害管理。本文介绍了一种从新闻文章中提取地理位置的开源地理编码方法——检索增强在线新闻文章坐标捕获 (RACCOON)。RACCOON 使用检索增强生成 (RAG) 方法，其中从位置数据库中以上下文的形式检索候选位置和相关信息，并将包含检索到的上下文、位置提及和新闻文章的提示输入到 LLM 以生成位置坐标。我们对三个数据集、两个底层 LLM、三个基线和基于 RACCOON 组件的几次消融测试进行了评估，证明了 RACCOON 的实用性。据我们所知，RACCOON 是第一个使用预训练 LLM 进行地理编码的基于 RAG 的方法。

Title: Curiosity-Driven Reinforcement Learning from Human Feedback

Authors: Haoran Sun, Yekun Chai, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11463
Pdf URL: https://arxiv.org/pdf/2501.11463
Copy Paste: [[2501.11463]] Curiosity-Driven Reinforcement Learning from Human Feedback(https://arxiv.org/abs/2501.11463)
Keywords: language model, llm
Abstract: Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at this https URL.
摘要：事实证明，从人类反馈中获取的强化学习 (RLHF) 可有效使大型语言模型 (LLM) 与人类偏好保持一致，但通常以降低输出多样性为代价。多样性与对齐质量之间的这种权衡仍然是一项重大挑战。从强化学习中好奇心驱动的探索中汲取灵感，我们引入了好奇心驱动的 RLHF (CD-RLHF)，这是一个结合了新状态的内在奖励和传统稀疏外在奖励的框架，以优化输出多样性和对齐质量。我们通过对一系列任务（包括文本摘要和指令遵循）进行大量实验证明了 CD-RLHF 的有效性。我们的方法在多个面向多样性的指标上实现了多样性的显著提升，同时保持了与人类偏好的一致性，与标准 RLHF 相当。我们在此 https URL 上公开提供我们的代码。

Title: Graph-defined Language Learning with LLMs

Authors: Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11478
Pdf URL: https://arxiv.org/pdf/2501.11478
Copy Paste: [[2501.11478]] Graph-defined Language Learning with LLMs(https://arxiv.org/abs/2501.11478)
Keywords: language model, llm
Abstract: Recent efforts leverage Large Language Models (LLMs) for modeling text-attributed graph structures in node classification tasks. These approaches describe graph structures for LLMs to understand or aggregate LLM-generated textual attribute embeddings through graph structure. However, these approaches face two main limitations in modeling graph structures with LLMs. (i) Graph descriptions become verbose in describing high-order graph structure. (ii) Textual attributes alone do not contain adequate graph structure information. It is challenging to model graph structure concisely and adequately with LLMs. LLMs lack built-in mechanisms to model graph structures directly. They also struggle with complex long-range dependencies between high-order nodes and target nodes. Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose \textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge \textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates graphs into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand graph structures. During fine-tuning, this corpus describes the structural information of target nodes concisely with only a few tokens. By treating graphs as a new language, GDL4LLM enables LLMs to model graph structures adequately and concisely for node classification tasks. Extensive experiments on three real-world datasets demonstrate that GDL4LLM outperforms description-based and textual attribute embeddings-based baselines by efficiently modeling different orders of graph structure with LLMs.
摘要：最近的研究利用大型语言模型 (LLM) 在节点分类任务中对文本属性图结构进行建模。这些方法描述了图结构，以便 LLM 理解或通过图结构聚合 LLM 生成的文本属性嵌入。然而，这些方法在使用 LLM 对图结构进行建模时面临两个主要限制。(i) 在描述高阶图结构时，图描述变得冗长。(ii) 文本属性本身不包含足够的图结构信息。使用 LLM 简洁而充分地对图结构进行建模是一项挑战。LLM 缺乏直接对图结构进行建模的内置机制。它们还难以处理高阶节点和目标节点之间复杂的长程依赖关系。受以下观察结果启发：在一种语言上预训练的 LLM 只需极少的额外训练就能在另一种语言上取得优异的表现，我们提出了 \textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge \textbf{L}anguage \textbf{M}odel (GDL4LLM)。这个新颖的框架使 LLM 能够将其强大的语言理解能力转移到图结构数据上。GDL4LLM 将图转换为图语言语料库而不是图描述，并在该语料库上预训练 LLM 以充分理解图结构。在微调过程中，该语料库仅使用几个标记简洁地描述目标节点的结构信息。通过将图视为一种新语言，GDL4LLM 使 LLM 能够为节点分类任务充分而简洁地建模图结构。在三个真实数据集上进行的大量实验表明，通过使用 LLM 有效地建模不同顺序的图结构，GDL4LLM 的表现优于基于描述和基于文本属性嵌入的基线。

Title: Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges

Authors: Vincent Koc
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11496
Pdf URL: https://arxiv.org/pdf/2501.11496
Copy Paste: [[2501.11496]] Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges(https://arxiv.org/abs/2501.11496)
Keywords: language model, llm
Abstract: Generative AI and large-scale language models (LLM) have emerged as powerful tools in language preservation, particularly for near-native and endangered languages. With the increasing reliance on technology for communication, education, and cultural documentation, new opportunities have emerged to mitigate the dramatic decline of linguistic diversity worldwide. This paper examines the role of generative AIs and LLMs in preserving endangered languages, highlighting the risks and challenges associated with their use. We analyze the underlying technologies driving these models, including natural language processing (NLP) and deep learning, and explore several cases where these technologies have been applied to low-resource languages. Additionally, we discuss ethical considerations, data scarcity issues, and technical challenges while proposing solutions to enhance AI-driven language preservation.
摘要：生成式人工智能和大规模语言模型 (LLM) 已成为语言保护的有力工具，尤其是对于近母语和濒危语言而言。随着人们越来越依赖技术进行交流、教育和文化记录，出现了新的机遇来缓解全球语言多样性的急剧下降。本文探讨了生成式人工智能和 LLM 在保护濒危语言方面的作用，强调了使用它们的风险和挑战。我们分析了驱动这些模型的底层技术，包括自然语言处理 (NLP) 和深度学习，并探讨了这些技术应用于资源匮乏语言的几个案例。此外，我们还讨论了道德考虑、数据稀缺问题和技术挑战，并提出了增强人工智能驱动的语言保护的解决方案。

Title: Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

Authors: Nishant Balepur, Vishakh Padmakumar, Fumeng Yang, Shi Feng, Rachel Rudinger, Jordan Lee Boyd-Graber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11549
Pdf URL: https://arxiv.org/pdf/2501.11549
Copy Paste: [[2501.11549]] Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas(https://arxiv.org/abs/2501.11549)
Keywords: llm, prompt
Abstract: LLMs are tuned to follow instructions (aligned) by learning which of two outputs users prefer for a prompt. However, this preference data format does not convey why users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply abductive reasoning to preference data, inferring needs and interests of users, i.e. personas, that may prefer each output. We test this idea in two steps: Persona Inference (PI)-abductively inferring personas of users who prefer chosen or rejected outputs-and Persona Tailoring (PT)-training models to tailor responses to personas from PI. We find: 1) LLMs infer personas accurately explaining why different users may prefer both chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization, enabling models to support user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.
摘要：通过学习用户更喜欢两种输出中的哪一种，LLM 可以调整为遵循指令（对齐）。但是，这种偏好数据格式并不能说明用户为什么喜欢选择或拒绝的响应，因此在这些数据集上训练的 LLM 无法根据不同的用户需求定制响应。为了揭示这些个性化参数，我们将溯因推理应用于偏好数据，推断可能更喜欢每个输出的用户（即角色）的需求和兴趣。我们分两步测试这个想法：角色推理（PI）——溯因推断喜欢选择或拒绝输出的用户角色——和角色定制（PT）——训练模型以根据 PI 定制对角色的响应。我们发现：1）LLM 可以推断角色，准确解释为什么不同的用户可能同时喜欢选择或拒绝的输出；2）通过 PT 使用 PI 角色增强的偏好数据进行训练可以增强个性化，使模型能够支持用户编写的角色； 3) 被拒绝的响应角色形成了更难的个性化评估，表明 PT 比典型的对齐方法更能帮助具有不常见偏好的用户。我们主张对个性化偏好采取溯因观点，不仅要问哪种响应更好，还要问何时、为什么以及对谁更好。

Title: PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation

Authors: Jinyu Wang, Jingjing Fu, Lei Song, Jiang Bian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11551
Pdf URL: https://arxiv.org/pdf/2501.11551
Copy Paste: [[2501.11551]] PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation(https://arxiv.org/abs/2501.11551)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to incrementally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems' problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iteratively construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks.
摘要：尽管检索增强生成 (RAG) 系统取得了显著进步，通过外部检索扩展了大型语言模型 (LLM) 的功能，但这些系统通常难以满足现实世界工业应用的复杂和多样化需求。仅依靠检索不足以从专业语料库中提取执行逻辑推理的深度、领域特定知识。为了解决这个问题，我们引入了专业知识和原理增强生成 (PIKE-RAG)，专注于提取、理解和应用专业知识，同时构建连贯的原理以逐步引导 LLM 获得准确的响应。认识到工业任务的多样化挑战，我们引入了一种新范式，根据任务在知识提取和应用中的复杂性对其进行分类，从而可以系统地评估 RAG 系统的解决问题能力。这种战略方法为分阶段开发和增强 RAG 系统提供了路线图，以满足工业应用不断变化的需求。此外，我们提出了知识原子化和知识感知任务分解，以有效地从数据块中提取多方面知识，并分别基于原始查询和累积知识迭代构建原理，从而在各种基准测试中展示出卓越的性能。

Title: Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems

Authors: Giorgio Robino
Subjects: cs.CL, cs.AI, cs.ET, cs.HC, cs.PL
Abstract URL: https://arxiv.org/abs/2501.11613
Pdf URL: https://arxiv.org/pdf/2501.11613
Copy Paste: [[2501.11613]] Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems(https://arxiv.org/abs/2501.11613)
Keywords: language model, llm, prompt, agent
Abstract: This study introduces Conversation Routines (CR), a structured prompt engineering framework for developing task-oriented dialog systems using Large Language Models (LLMs). While LLMs demonstrate remarkable natural language understanding capabilities, engineering them to reliably execute complex business workflows remains challenging. The proposed CR framework enables the development of Conversation Agentic Systems (CAS) through natural language specifications, embedding task-oriented logic within LLM prompts. This approach provides a systematic methodology for designing and implementing complex conversational workflows while maintaining behavioral consistency. We demonstrate the framework's effectiveness through two proof of concept implementations: a Train Ticket Booking System and an Interactive Troubleshooting Copilot. These case studies validate CR's capability to encode sophisticated behavioral patterns and decision logic while preserving natural conversational flexibility. Results show that CR enables domain experts to design conversational workflows in natural language while leveraging custom enterprise functionalities (tools) developed by software engineers, creating an efficient division of responsibilities where developers focus on core API implementation and domain experts handle conversation design. While the framework shows promise in accessibility and adaptability, we identify key challenges including computational overhead, non-deterministic behavior, and domain-specific logic optimization. Future research directions include enhancing system robustness, improving scalability for complex multi-agent interactions, and addressing the identified limitations across diverse business applications.
摘要：本研究介绍了对话例程 (CR)，这是一种使用大型语言模型 (LLM) 开发面向任务的对话系统的结构化提示工程框架。虽然 LLM 表现出卓越的自然语言理解能力，但对其进行工程设计以可靠地执行复杂的业务工作流仍然具有挑战性。所提出的 CR 框架通过自然语言规范实现了对话代理系统 (CAS) 的开发，将面向任务的逻辑嵌入 LLM 提示中。这种方法提供了一种系统的方法，用于设计和实施复杂的对话工作流，同时保持行为一致性。我们通过两个概念验证实现来证明该框架的有效性：火车票预订系统和交互式故障排除副驾驶。这些案例研究验证了 CR 编码复杂行为模式和决策逻辑的能力，同时保留了自然的对话灵活性。结果表明，CR 使领域专家能够用自然语言设计对话工作流，同时利用软件工程师开发的自定义企业功能（工具），从而实现高效的职责分工，开发人员专注于核心 API 实现，领域专家负责对话设计。虽然该框架在可访问性和适应性方面表现出色，但我们发现关键挑战包括计算开销、非确定性行为和领域特定逻辑优化。未来的研究方向包括增强系统稳健性、提高复杂多代理交互的可扩展性以及解决跨不同业务应用程序的已发现限制。

Title: Trojan Detection Through Pattern Recognition for Large Language Models

Authors: Vedant Bhasin, Matthew Yudin, Razvan Stefanescu, Rauf Izmailov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11621
Pdf URL: https://arxiv.org/pdf/2501.11621
Copy Paste: [[2501.11621]] Trojan Detection Through Pattern Recognition for Large Language Models(https://arxiv.org/abs/2501.11621)
Keywords: language model, prompt
Abstract: Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model's alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics. The evaluation of our approach on the TrojAI and RLHF poisoned model datasets demonstrates promising results.
摘要：特洛伊木马后门可以在各个阶段注入大型语言模型，包括预训练、微调和上下文学习，对模型的对齐构成重大威胁。由于因果语言建模的性质，在搜索空间巨大的情况下检测这些触发器具有挑战性。在本研究中，我们提出了一个多阶段框架来检测大型语言模型中的特洛伊木马触发器，包括标记过滤、触发器识别和触发器验证。我们讨论了现有的触发器识别方法，并提出了两种依赖于输出逻辑的黑盒触发器反转方法的变体，分别利用了波束搜索和贪婪解码。我们表明验证阶段在该过程中至关重要，并提出了保留语义的提示和特殊扰动来区分实际的特洛伊木马触发器和具有相似特征的其他对抗性字符串。在 TrojAI 和 RLHF 中毒模型数据集上对我们的方法的评估显示出有希望的结果。

Title: YouLeQD: Decoding the Cognitive Complexity of Questions and Engagement in Online Educational Videos from Learners' Perspectives

Authors: Nong Ming, Sachin Sharma, Jiho Noh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11712
Pdf URL: https://arxiv.org/pdf/2501.11712
Copy Paste: [[2501.11712]] YouLeQD: Decoding the Cognitive Complexity of Questions and Engagement in Online Educational Videos from Learners' Perspectives(https://arxiv.org/abs/2501.11712)
Keywords: language model
Abstract: Questioning is a fundamental aspect of education, as it helps assess students' understanding, promotes critical thinking, and encourages active engagement. With the rise of artificial intelligence in education, there is a growing interest in developing intelligent systems that can automatically generate and answer questions and facilitate interactions in both virtual and in-person education settings. However, to develop effective AI models for education, it is essential to have a fundamental understanding of questioning. In this study, we created the YouTube Learners' Questions on Bloom's Taxonomy Dataset (YouLeQD), which contains learner-posed questions from YouTube lecture video comments. Along with the dataset, we developed two RoBERTa-based classification models leveraging Large Language Models to detect questions and analyze their cognitive complexity using Bloom's Taxonomy. This dataset and our findings provide valuable insights into the cognitive complexity of learner-posed questions in educational videos and their relationship with interaction metrics. This can aid in the development of more effective AI models for education and improve the overall learning experience for students.
摘要：提问是教育的一个基本方面，因为它有助于评估学生的理解程度、促进批判性思维并鼓励积极参与。随着人工智能在教育领域的兴起，人们越来越有兴趣开发能够自动生成和回答问题并促进虚拟和面对面教育环境中互动的智能系统。然而，要开发有效的教育人工智能模型，必须对提问有基本的了解。在本研究中，我们创建了 YouTube 学习者布鲁姆分类法数据集 (YouLeQD) 问题，其中包含学习者在 YouTube 讲座视频评论中提出的问题。除了数据集之外，我们还开发了两个基于 RoBERTa 的分类模型，利用大型语言模型来检测问题并使用布鲁姆分类法分析它们的认知复杂性。该数据集和我们的研究结果为教育视频中学习者提出的问题的认知复杂性及其与互动指标的关系提供了宝贵的见解。这有助于开发更有效的教育人工智能模型，并改善学生的整体学习体验。

Title: Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy

Authors: Saeid Asgari Taghanaki, Joao Monteiro
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11721
Pdf URL: https://arxiv.org/pdf/2501.11721
Copy Paste: [[2501.11721]] Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy(https://arxiv.org/abs/2501.11721)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT's performance is predictive of MMLU-Pro's, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models' ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at this https URL.
摘要：大型语言模型 (LLM) 在生成复杂概念的详细且连贯的解释方面表现出了非凡的能力。然而，这些模型真正理解它们所表达的概念的程度仍不清楚。为了评估模型相对于其生成内容的理解水平，我们实施了一个自我评估流程，其中模型：(i) 给定一个主题，生成一个包含有关该主题信息的摘录，(ii) 给定一个摘录，生成问答对，最后 (iii) 给定一个问题，生成一个答案。我们将这种自我评估方法称为解释-查询-测试 (EQT)。有趣的是，运行 EQT 流程后生成的问题的准确性与模型性能密切相关，这已由典型基准测试（如 MMLU-Pro）验证。换句话说，EQT 的性能可以预测 MMLU-Pro 的性能，并且 EQT 可用于对模型进行排名，而无需任何外部评估数据源（除了感兴趣的主题列表）。此外，我们的结果揭示了模型生成详细解释的能力与它们在与这些解释相关的问题上的表现之间存在差异。这一差距凸显了当前 LLM 内部知识表示和推理能力的根本局限性。我们在此 https URL 上发布了代码。

Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Authors: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2501.11733
Pdf URL: https://arxiv.org/pdf/2501.11733
Copy Paste: [[2501.11733]] Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks(https://arxiv.org/abs/2501.11733)
Keywords: agent
Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: this https URL.
摘要：智能手机已成为现代生活中不可或缺的一部分，但在移动设备上处理复杂任务往往令人沮丧。基于大型多模态模型 (LMM) 的移动代理的最新进展已展示出在移动环境中感知和行动的能力。然而，当前的方法面临着重大的局限性：它们无法满足现实世界的人类需求，难以处理推理密集型和长期任务，并且缺乏从先前经验中学习和改进的机制。为了克服这些挑战，我们引入了 Mobile-Agent-E，这是一个能够通过过去的经验自我进化的分层多代理框架。我们所说的分层是指高级规划和低级操作执行的明确分离。该框架包括一个管理器，负责通过将复杂任务分解为子目标来制定总体计划，以及四个下属代理——感知器、操作员、动作反射器和记录器——分别处理细粒度的视觉感知、即时动作执行、错误验证和信息聚合。 Mobile-Agent-E 还具有一个新颖的自我进化模块，该模块可维护由提示和快捷方式组成的持久长期记忆。提示是一般指导，也是从先前任务中吸取的教训，关于如何有效地与环境交互。快捷方式是可重复使用的可执行原子操作序列，专为特定子程序量身定制。提示和快捷方式的加入有助于不断改进性能和效率。除了这个框架，我们还推出了 Mobile-Eval-E，这是一个新的基准，具有需要长期、多应用程序交互的复杂移动任务。实证结果表明，Mobile-Agent-E 在三个基础模型主干上比以前最先进的方法实现了 22% 的绝对改进。项目页面：此 https URL。

Title: Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

Authors: William Held, Bhargavi Paranjape, Punit Singh Koura, Mike Lewis, Frank Zhang, Todor Mihaylov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11747
Pdf URL: https://arxiv.org/pdf/2501.11747
Copy Paste: [[2501.11747]] Optimizing Pretraining Data Mixtures with LLM-Estimated Utility(https://arxiv.org/abs/2501.11747)
Keywords: language model, llm
Abstract: Large Language Models improve with increasing amounts of high-quality training data. However, leveraging larger datasets requires balancing quality, quantity, and diversity across sources. After evaluating nine baseline methods under both compute- and data-constrained scenarios, we find token-count heuristics outperform manual and learned mixes, indicating that simple approaches accounting for dataset size and diversity are surprisingly effective. Building on this insight, we propose two complementary approaches: UtiliMax, which extends token-based heuristics by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $\sim$200x. Together, these approaches establish a new framework for automated, compute-efficient data mixing that is robust across training regimes.
摘要：随着高质量训练数据数量的增加，大型语言模型会得到改进。但是，利用更大的数据集需要在来源之间平衡质量、数量和多样性。在计算和数据受限的情况下评估了九种基线方法后，我们发现标记计数启发式方法优于手动和学习混合方法，这表明考虑数据集大小和多样性的简单方法出奇地有效。基于这一见解，我们提出了两种互补的方法：UtiliMax，它通过结合缩小规模消融的效用估计来扩展基于标记的启发式方法，与手动基线相比，速度提高了 10.6 倍；模型估计数据效用 (MEDU)，它利用 LLM 从小样本中估计数据效用，匹配基于消融的性能，同时将计算要求降低 $\sim$200 倍。这些方法共同建立了一个自动化、计算高效的数据混合新框架，该框架在训练方案中都很强大。

Title: The Value of Nothing: Multimodal Extraction of Human Values Expressed by TikTok Influencers

Authors: Alina Starovolsky-Shitrit, Alon Neduva, Naama Appel Doron, Ella Daniel, Oren Tsur
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2501.11770
Pdf URL: https://arxiv.org/pdf/2501.11770
Copy Paste: [[2501.11770]] The Value of Nothing: Multimodal Extraction of Human Values Expressed by TikTok Influencers(https://arxiv.org/abs/2501.11770)
Keywords: language model
Abstract: Societal and personal values are transmitted to younger generations through interaction and exposure. Traditionally, children and adolescents learned values from parents, educators, or peers. Nowadays, social platforms serve as a significant channel through which youth (and adults) consume information, as the main medium of entertainment, and possibly the medium through which they learn different values. In this paper we extract implicit values from TikTok movies uploaded by online influencers targeting children and adolescents. We curated a dataset of hundreds of TikTok movies and annotated them according to the Schwartz Theory of Personal Values. We then experimented with an array of Masked and Large language model, exploring how values can be detected. Specifically, we considered two pipelines -- direct extraction of values from video and a 2-step approach in which videos are first converted to elaborated scripts and then values are extracted. Achieving state-of-the-art results, we find that the 2-step approach performs significantly better than the direct approach and that using a trainable Masked Language Model as a second step significantly outperforms a few-shot application of a number of Large Language Models. We further discuss the impact of fine-tuning and compare the performance of the different models on identification of values present or contradicted in the TikTok. Finally, we share the first values-annotated dataset of TikTok videos. Our results pave the way to further research on influence and value transmission in video-based social platforms.
摘要：社会和个人价值观通过互动和接触传递给年轻一代。传统上，儿童和青少年从父母、教育者或同龄人那里学习价值观。如今，社交平台是青少年（和成年人）获取信息的重要渠道，是主要的娱乐媒介，也可能是他们学习不同价值观的媒介。在本文中，我们从针对儿童和青少年的网络影响者上传的 TikTok 电影中提取了隐性价值观。我们整理了数百部 TikTok 电影的数据集，并根据施瓦茨个人价值观理论对其进行了注释。然后，我们尝试了一系列 Masked 和 Large 语言模型，探索如何检测价值观。具体来说，我们考虑了两种流程——直接从视频中提取价值观和一种两步方法，其中首先将视频转换为详细的脚本，然后提取价值观。为了获得最佳结果，我们发现两步方法的效果明显优于直接方法，并且使用可训练的屏蔽语言模型作为第二步的效果明显优于少量应用多个大型语言模型。我们进一步讨论了微调的影响，并比较了不同模型在识别 TikTok 中存在或矛盾的价值观方面的表现。最后，我们分享了第一个带有价值观注释的 TikTok 视频数据集。我们的研究结果为进一步研究基于视频的社交平台中的影响力和价值观传播铺平了道路。

Title: Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection

Authors: Ali Naseh, Niloofar Mireshghallah
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11786
Pdf URL: https://arxiv.org/pdf/2501.11786
Copy Paste: [[2501.11786]] Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection(https://arxiv.org/abs/2501.11786)
Keywords: language model, gpt, llm
Abstract: Recent work shows membership inference attacks (MIAs) on large language models (LLMs) produce inconclusive results, partly due to difficulties in creating non-member datasets without temporal shifts. While researchers have turned to synthetic data as an alternative, we show this approach can be fundamentally misleading. Our experiments indicate that MIAs function as machine-generated text detectors, incorrectly identifying synthetic data as training samples regardless of the data source. This behavior persists across different model architectures and sizes, from open-source models to commercial ones such as GPT-3.5. Even synthetic text generated by different, potentially larger models is classified as training data by the target model. Our findings highlight a serious concern: using synthetic data in membership evaluations may lead to false conclusions about model memorization and data leakage. We caution that this issue could affect other evaluations using model signals such as loss where synthetic or machine-generated translated data substitutes for real-world samples.
摘要：最近的研究表明，针对大型语言模型 (LLM) 的成员推理攻击 (MIA) 会产生不确定的结果，部分原因是难以创建没有时间偏移的非成员数据集。虽然研究人员已经转向使用合成数据作为替代方案，但我们表明这种方法可能从根本上具有误导性。我们的实验表明，MIA 充当机器生成的文本检测器，无论数据源是什么，都会错误地将合成数据识别为训练样本。这种行为存在于不同的模型架构和大小中，从开源模型到 GPT-3.5 等商业模型。即使是由不同的、可能更大的模型生成的合成文本也被目标模型归类为训练数据。我们的研究结果突出了一个严重的问题：在成员资格评估中使用合成数据可能会导致关于模型记忆和数据泄漏的错误结论。我们警告说，这个问题可能会影响使用模型信号的其他评估，例如合成或机器生成的翻译数据替代真实世界样本的损失。

Title: Benchmarking Large Language Models via Random Variables

Authors: Zijin Hong, Hao Wu, Su Dong, Junnan Dong, Yilin Xiao, Yujing Zhang, Zhu Wang, Feiran Huang, Linyi Li, Hongxia Yang, Xiao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11790
Pdf URL: https://arxiv.org/pdf/2501.11790
Copy Paste: [[2501.11790]] Benchmarking Large Language Models via Random Variables(https://arxiv.org/abs/2501.11790)
Keywords: language model, llm
Abstract: With the continuous advancement of large language models (LLMs) in mathematical reasoning, evaluating their performance in this domain has become a prominent research focus. Recent studies have raised concerns about the reliability of current mathematical benchmarks, highlighting issues such as simplistic design and potential data leakage. Therefore, creating a reliable benchmark that effectively evaluates the genuine capabilities of LLMs in mathematical reasoning remains a significant challenge. To address this, we propose RV-Bench, a framework for Benchmarking LLMs via Random Variables in mathematical reasoning. Specifically, the background content of a random variable question (RV question) mirrors the original problem in existing standard benchmarks, but the variable combinations are randomized into different values. LLMs must fully understand the problem-solving process for the original problem to correctly answer RV questions with various combinations of variable values. As a result, the LLM's genuine capability in mathematical reasoning is reflected by its accuracy on RV-Bench. Extensive experiments are conducted with 29 representative LLMs across 900+ RV questions. A leaderboard for RV-Bench ranks the genuine capability of these LLMs. Further analysis of accuracy dropping indicates that current LLMs still struggle with complex mathematical reasoning problems.
摘要：随着大型语言模型（LLM）在数学推理领域的不断进步，评估其在该领域的表现已成为一个突出的研究重点。最近的研究引发了人们对当前数学基准可靠性的担忧，突出了诸如设计过于简单和潜在的数据泄露等问题。因此，创建一个可靠的基准来有效评估LLM在数学推理方面的真正能力仍然是一项重大挑战。为了解决这个问题，我们提出了RV-Bench，这是一个通过数学推理中的随机变量对LLM进行基准测试的框架。具体来说，随机变量问题（RV问题）的背景内容反映了现有标准基准中的原始问题，但变量组合被随机化为不同的值。LLM必须充分了解原始问题的解题过程，才能正确回答具有各种变量值组合的RV问题。因此，LLM在数学推理方面的真正能力通过其在RV-Bench上的准确性来反映。对29个代表性LLM进行了广泛的实验，涉及900多个RV问题。 RV-Bench 的排行榜对这些 LLM 的真正能力进行了排名。对准确率下降的进一步分析表明，当前的 LLM 仍然难以解决复杂的数学推理问题。

Title: Fact-Preserved Personalized News Headline Generation

Authors: Zhao Yang, Junhong Lian, Xiang Ao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11828
Pdf URL: https://arxiv.org/pdf/2501.11828
Copy Paste: [[2501.11828]] Fact-Preserved Personalized News Headline Generation(https://arxiv.org/abs/2501.11828)
Keywords: prompt
Abstract: Personalized news headline generation, aiming at generating user-specific headlines based on readers' preferences, burgeons a recent flourishing research direction. Existing studies generally inject a user interest embedding into an encoderdecoder headline generator to make the output personalized, while the factual consistency of headlines is inadequate to be verified. In this paper, we propose a framework Fact-Preserved Personalized News Headline Generation (short for FPG), to prompt a tradeoff between personalization and consistency. In FPG, the similarity between the candidate news to be exposed and the historical clicked news is used to give different levels of attention to key facts in the candidate news, and the similarity scores help to learn a fact-aware global user embedding. Besides, an additional training procedure based on contrastive learning is devised to further enhance the factual consistency of generated headlines. Extensive experiments conducted on a real-world benchmark PENS validate the superiority of FPG, especially on the tradeoff between personalization and factual consistency.
摘要：个性化新闻标题生成是近年来蓬勃发展的一个研究方向，旨在根据读者的喜好生成针对特定用户的新闻标题。现有研究通常将用户兴趣嵌入到编码器/解码器标题生成器中以使输出个性化，但标题的事实一致性尚待验证。在本文中，我们提出了一个基于事实的个性化新闻标题生成框架（Fact-Preserved Personalized News Headline Generation，简称FPG），以在个性化和一致性之间进行权衡。在FPG中，利用待曝光候选新闻与历史点击新闻之间的相似性来对候选新闻中的关键事实给予不同程度的关注，相似度得分有助于学习事实感知的全局用户嵌入。此外，还设计了一种基于对比学习的附加训练程序，以进一步增强所生成标题的事实一致性。在真实世界基准PENS上进行的大量实验验证了FPG的优越性，特别是在个性化和事实一致性之间的权衡方面。

Title: Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs

Authors: Saiful Haq, Niyati Chhaya, Piyush Pandey, Pushpak Bhattacharya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11833
Pdf URL: https://arxiv.org/pdf/2501.11833
Copy Paste: [[2501.11833]] Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs(https://arxiv.org/abs/2501.11833)
Keywords: gpt, llm
Abstract: In this paper, we present an investigative study on how Mental Sets influence the reasoning capabilities of LLMs. LLMs have excelled in diverse natural language processing (NLP) tasks, driven by advancements in parameter-efficient fine-tuning (PEFT) and emergent capabilities like in-context learning (ICL). For complex reasoning tasks, selecting the right model for PEFT or ICL is critical, often relying on scores on benchmarks such as MMLU, MATH, and GSM8K. However, current evaluation methods, based on metrics like F1 Score or reasoning chain assessments by larger models, overlook a key dimension: adaptability to unfamiliar situations and overcoming entrenched thinking patterns. In cognitive psychology, Mental Set refers to the tendency to persist with previously successful strategies, even when they become inefficient - a challenge for problem solving and reasoning. We compare the performance of LLM models like Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct and GPT-4o in the presence of mental sets. To the best of our knowledge, this is the first study to integrate cognitive psychology concepts into the evaluation of LLMs for complex reasoning tasks, providing deeper insights into their adaptability and problem-solving efficacy.
摘要：在本文中，我们介绍了一项关于心理定势如何影响 LLM 推理能力的调查研究。在参数高效微调 (PEFT) 和情境学习 (ICL) 等新兴能力的推动下，LLM 在各种自然语言处理 (NLP) 任务中表现出色。对于复杂的推理任务，选择正确的 PEFT 或 ICL 模型至关重要，通常依赖于 MMLU、MATH 和 GSM8K 等基准测试的分数。然而，当前的评估方法基于 F1 分数等指标或更大模型的推理链评估，忽略了一个关键维度：适应不熟悉的情况和克服根深蒂固的思维模式。在认知心理学中，心理定势指的是坚持以前成功的策略的倾向，即使它们变得效率低下——这对解决问题和推理来说是一个挑战。我们比较了 Llama-3.1-8B-Instruct、Llama-3.1-70B-Instruct 和 GPT-4o 等 LLM 模型在心理定势存在下的性能。据我们所知，这是第一项将认知心理学概念融入对法学硕士复杂推理任务评估的研究，为其适应性和解决问题的有效性提供了更深入的见解。

Title: Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance

Authors: Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2501.11849
Pdf URL: https://arxiv.org/pdf/2501.11849
Copy Paste: [[2501.11849]] Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance(https://arxiv.org/abs/2501.11849)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Detecting organized political campaigns is of paramount importance in fighting against disinformation on social media. Existing approaches for the identification of such organized actions employ techniques mostly from network science, graph machine learning and natural language processing. Their ultimate goal is to analyze the relationships and interactions (e.g. re-posting) among users and the textual similarities of their posts. Despite their effectiveness in recognizing astroturf campaigns, these methods face significant challenges, notably the class imbalance in available training datasets. To mitigate this issue, recent methods usually resort to data augmentation or increasing the number of positive samples, which may not always be feasible or sufficient in real-world settings. Following a different path, in this paper, we propose a novel framework for identifying astroturf campaigns based solely on large language models (LLMs), introducing a Balanced Retrieval-Augmented Generation (Balanced RAG) component. Our approach first gives both textual information concerning the posts (in our case tweets) and the user interactions of the social network as input to a language model. Then, through prompt engineering and the proposed Balanced RAG method, it effectively detects coordinated disinformation campaigns on X (Twitter). The proposed framework does not require any training or fine-tuning of the language model. Instead, by strategically harnessing the strengths of prompt engineering and Balanced RAG, it facilitates LLMs to overcome the effects of class imbalance and effectively identify coordinated political campaigns. The experimental results demonstrate that by incorporating the proposed prompt engineering and Balanced RAG methods, our framework outperforms the traditional graph-based baselines, achieving 2x-3x improvements in terms of precision, recall and F1 scores.
摘要：发现有组织的政治运动对于打击社交媒体上的虚假信息至关重要。现有的识别此类有组织行动的方法主要采用网络科学、图机器学习和自然语言处理技术。他们的最终目标是分析用户之间的关系和互动（例如转发）以及他们帖子的文本相似性。尽管这些方法在识别虚假草皮运动方面很有效，但它们面临着重大挑战，尤其是可用训练数据集中的类别不平衡。为了缓解这个问题，最近的方法通常采用数据增强或增加正样本的数量，但这在现实环境中可能并不总是可行或足够的。本文走上了一条不同的道路，我们提出了一个全新的框架，仅基于大型语言模型 (LLM) 来识别虚假草皮运动，引入了平衡检索增强生成 (Balanced RAG) 组件。我们的方法首先将有关帖子（在我们的例子中是推文）和社交网络用户互动的文本信息作为语言模型的输入。然后，通过提示工程和提出的 Balanced RAG 方法，它可以有效地检测 X（Twitter）上的协调性虚假信息活动。所提出的框架不需要对语言模型进行任何训练或微调。相反，通过战略性地利用提示工程和 Balanced RAG 的优势，它有助于 LLM 克服类别不平衡的影响并有效识别协调的政治运动。实验结果表明，通过结合所提出的提示工程和 Balanced RAG 方法，我们的框架优于传统的基于图的基线，在准确率、召回率和 F1 分数方面实现了 2 到 3 倍的提升。

Title: Cross-Entropy Attacks to Language Models via Rare Event Simulation

Authors: Mingze Ni, Yongshun Gong, Wei Liu
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11852
Pdf URL: https://arxiv.org/pdf/2501.11852
Copy Paste: [[2501.11852]] Cross-Entropy Attacks to Language Models via Rare Event Simulation(https://arxiv.org/abs/2501.11852)
Keywords: language model
Abstract: Black-box textual adversarial attacks are challenging due to the lack of model information and the discrete, non-differentiable nature of text. Existing methods often lack versatility for attacking different models, suffer from limited attacking performance due to the inefficient optimization with word saliency ranking, and frequently sacrifice semantic integrity to achieve better attack outcomes. This paper introduces a novel approach to textual adversarial attacks, which we call Cross-Entropy Attacks (CEA), that uses Cross-Entropy optimization to address the above issues. Our CEA approach defines adversarial objectives for both soft-label and hard-label settings and employs CE optimization to identify optimal replacements. Through extensive experiments on document classification and language translation problems, we demonstrate that our attack method excels in terms of attacking performance, imperceptibility, and sentence quality.
摘要：由于缺乏模型信息以及文本的离散、不可微分特性，黑盒文本对抗攻击具有挑战性。现有方法通常缺乏攻击不同模型的通用性，由于对词显着性排名的优化效率低下而导致攻击性能有限，并且经常牺牲语义完整性以获得更好的攻击结果。本文介绍了一种新的文本对抗攻击方法，我们称之为交叉熵攻击 (CEA)，该方法使用交叉熵优化来解决上述问题。我们的 CEA 方法为软标签和硬标签设置定义了对抗目标，并采用 CE 优化来确定最佳替代方案。通过对文档分类和语言翻译问题的大量实验，我们证明了我们的攻击方法在攻击性能、不可感知性和句子质量方面表现出色。

Title: From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning

Authors: Yafu Li, Zhilin Wang, Tingchen Fu, Ganqu Cui, Sen Yang, Yu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11877
Pdf URL: https://arxiv.org/pdf/2501.11877
Copy Paste: [[2501.11877]] From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning(https://arxiv.org/abs/2501.11877)
Keywords: language model, gpt, llm
Abstract: Scaling data and model size has been proven effective for boosting the performance of large language models. In addition to training-time scaling, recent studies have revealed that increasing test-time computational resources can further improve performance. In this work, we introduce Aggregation Fine-Tuning (AFT), a supervised finetuning paradigm where the model learns to synthesize multiple draft responses, referred to as proposals, into a single, refined answer, termed aggregation. At inference time, a propose-and-aggregate strategy further boosts performance by iteratively generating proposals and aggregating them. Empirical evaluations on benchmark datasets show that AFT-trained models substantially outperform standard SFT. Notably, an AFT model, fine-tuned from Llama3.1-8B-Base with only 64k data, achieves a 41.3% LC win rate on AlpacaEval 2, surpassing significantly larger LLMs such as Llama3.1-405B-Instruct and GPT4. By combining sequential refinement and parallel sampling, the propose-and-aggregate framework scales inference-time computation in a flexible manner. Overall, These findings position AFT as a promising approach to unlocking additional capabilities of LLMs without resorting to increasing data volume or model size.
摘要：事实证明，扩展数据和模型大小可有效提升大型语言模型的性能。除了训练时间扩展之外，最近的研究表明，增加测试时间计算资源可以进一步提高性能。在这项工作中，我们引入了聚合微调 (AFT)，这是一种监督式微调范式，其中模型学习将多个草稿响应（称为提案）合成为一个单一的、精炼的答案（称为聚合）。在推理时，提议和聚合策略通过迭代生成提案并聚合它们来进一步提高性能。对基准数据集的实证评估表明，AFT 训练的模型大大优于标准 SFT。值得注意的是，一个仅使用 64k 数据从 Llama3.1-8B-Base 微调的 AFT 模型在 AlpacaEval 2 上实现了 41.3% 的 LC 胜率，超过了 Llama3.1-405B-Instruct 和 GPT4 等明显更大的 LLM。通过结合顺序细化和并行采样，提议和聚合框架可以灵活地扩展推理时间计算。总体而言，这些发现将 AFT 定位为一种有前途的方法，无需增加数据量或模型大小即可解锁 LLM 的额外功能。

Title: Med-R$^2$: Crafting Trustworthy LLM Physicians through Retrieval and Reasoning of Evidence-Based Medicine

Authors: Keer Lu, Zheng Liang, Da Pan, Shusen Zhang, Xin Wu, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11885
Pdf URL: https://arxiv.org/pdf/2501.11885
Copy Paste: [[2501.11885]] Med-R$^2$: Crafting Trustworthy LLM Physicians through Retrieval and Reasoning of Evidence-Based Medicine(https://arxiv.org/abs/2501.11885)
Keywords: language model, llm
Abstract: In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. However, despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 14.87\% improvement over vanilla RAG methods and even a 3.59\% enhancement compared to fine-tuning strategies, without incurring additional training costs.
摘要：近年来，大型语言模型（LLM）在临床场景中展现出了卓越的能力。然而，尽管潜力巨大，现有的研究在将LLM应用于医疗场景时仍面临挑战。依赖医疗数据集进行训练的策略成本高昂，并且可能受到训练数据过时的影响。利用外部知识库是一种合适的替代方案，但它面临着检索精度有限、答案提取效果不佳等障碍。这些问题共同阻碍了LLM在掌握医学专业知识方面表现出预期的熟练程度。为了应对这些挑战，我们引入了Med-R^2，这是一个遵循循证医学（EBM）流程的新型LLM医生框架，有效整合了检索机制以及证据的选择和推理过程，从而增强了LLM在医疗场景中的解决问题的能力，培养了值得信赖的LLM医生。我们全面的实验表明，Med-R^2 比 vanilla RAG 方法实现了 14.87% 的提升，与微调策略相比甚至实现了 3.59% 的提升，而且不会产生额外的训练成本。

Title: Panoramic Interests: Stylistic-Content Aware Personalized Headline Generation

Authors: Junhong Lian, Xiang Ao, Xinyu Liu, Yang Liu, Qing He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11900
Pdf URL: https://arxiv.org/pdf/2501.11900
Copy Paste: [[2501.11900]] Panoramic Interests: Stylistic-Content Aware Personalized Headline Generation(https://arxiv.org/abs/2501.11900)
Keywords: language model, llm
Abstract: Personalized news headline generation aims to provide users with attention-grabbing headlines that are tailored to their preferences. Prevailing methods focus on user-oriented content preferences, but most of them overlook the fact that diverse stylistic preferences are integral to users' panoramic interests, leading to suboptimal personalization. In view of this, we propose a novel Stylistic-Content Aware Personalized Headline Generation (SCAPE) framework. SCAPE extracts both content and stylistic features from headlines with the aid of large language model (LLM) collaboration. It further adaptively integrates users' long- and short-term interests through a contrastive learning-based hierarchical fusion network. By incorporating the panoramic interests into the headline generator, SCAPE reflects users' stylistic-content preferences during the generation process. Extensive experiments on the real-world dataset PENS demonstrate the superiority of SCAPE over baselines.
摘要：个性化新闻标题生成旨在为用户提供根据其偏好定制的吸引注意力的标题。现行方法侧重于以用户为导向的内容偏好，但大多数方法都忽略了多样化的风格偏好是用户全景兴趣不可或缺的一部分这一事实，导致个性化效果不佳。鉴于此，我们提出了一种新颖的风格内容感知个性化标题生成 (SCAPE) 框架。SCAPE 在大型语言模型 (LLM) 协作的帮助下从标题中提取内容和风格特征。它通过基于对比学习的分层融合网络进一步自适应地整合用户的长期和短期兴趣。通过将全景兴趣纳入标题生成器，SCAPE 在生成过程中反映了用户的风格内容偏好。在真实世界数据集 PENS 上进行的大量实验证明了 SCAPE 优于基线。

Title: HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja

Authors: Seyoung Song, Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11951
Pdf URL: https://arxiv.org/pdf/2501.11951
Copy Paste: [[2501.11951]] HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja(https://arxiv.org/abs/2501.11951)
Keywords: language model
Abstract: While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.
摘要：虽然韩国历史文献是无价的文化遗产，但理解这些文献需要深入的汉字专业知识。汉字是 20 世纪前韩国使用的一种古老语言，其字符借用自古中文，但在韩国已经发展了几个世纪。现代韩国人和中国人如果没有大量额外帮助就无法理解韩国历史文献，尽管之前的努力已经产生了一些韩语和英语翻译，但这需要深入的专业知识，因此大多数文献都没有翻译成任何现代语言。为了弥补这一差距，我们推出了 HERITAGE，这是第一个开源汉字 NLP 工具包，用于帮助理解和翻译用汉字书写的未开发的韩国历史文献。HERITAGE 是一个基于 Web 的平台，通过汉字语言模型提供历史文献理解中的三个关键任务的模型预测：标点符号恢复、命名实体识别和机器翻译 (MT)。HERITAGE 还提供了一个交互式词汇表，它提供了现代韩语中汉字字符的字符级阅读以及字符级英语定义。HERITAGE 有两个用途。首先，任何对这些文献感兴趣的人都可以通过模型预测和交互式词汇表（尤其是韩语和英语的机器翻译输出）获得大致了解。其次，由于模型输出并不完美，汉字专家可以对其进行修改，以产生更好的注释和翻译。这将提高翻译效率，并可能导致大多数历史文献被翻译成现代语言，从而降低未开发的韩国历史文献的门槛。

Title: Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model

Authors: Minghan Wang, Viet-Thanh Pham, Farhad Moghimifar, Thuy-Trang Vu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11953
Pdf URL: https://arxiv.org/pdf/2501.11953
Copy Paste: [[2501.11953]] Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model(https://arxiv.org/abs/2501.11953)
Keywords: language model, llm
Abstract: Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.
摘要：尽管取得了显著的成绩，机器翻译 (MT) 在翻译语言中的文化元素（如习语、谚语和口语表达）方面的研究仍然不足。本文研究了最先进的神经机器翻译 (NMT) 和大型语言模型 (LLM) 在翻译深深植根于文化背景的谚语方面的能力。我们为四对语言构建了一个包含独立谚语和对话谚语的翻译数据集。我们的实验表明，所研究的模型可以在具有相似文化背景的语言之间实现良好的翻译，并且 LLM 在谚语翻译方面的表现通常优于 NMT 模型。此外，我们发现当前的自动评估指标（如 BLEU、CHRF++ 和 COMET）不足以可靠地评估谚语翻译的质量，这凸显了对更具文化意识的评估指标的需求。

Title: TAD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection

Authors: Yang Cao, Sikun Yang, Chen Li, Haolong Xiang, Lianyong Qi, Bo Liu, Rongsheng Li, Ming Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11960
Pdf URL: https://arxiv.org/pdf/2501.11960
Copy Paste: [[2501.11960]] TAD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection(https://arxiv.org/abs/2501.11960)
Keywords: language model
Abstract: Text anomaly detection is crucial for identifying spam, misinformation, and offensive language in natural language processing tasks. Despite the growing adoption of embedding-based methods, their effectiveness and generalizability across diverse application scenarios remain under-explored. To address this, we present TAD-Bench, a comprehensive benchmark designed to systematically evaluate embedding-based approaches for text anomaly detection. TAD-Bench integrates multiple datasets spanning different domains, combining state-of-the-art embeddings from large language models with a variety of anomaly detection algorithms. Through extensive experiments, we analyze the interplay between embeddings and detection methods, uncovering their strengths, weaknesses, and applicability to different tasks. These findings offer new perspectives on building more robust, efficient, and generalizable anomaly detection systems for real-world applications.
摘要：文本异常检测对于在自然语言处理任务中识别垃圾邮件、错误信息和攻击性语言至关重要。尽管基于嵌入的方法越来越多地被采用，但它们在不同应用场景中的有效性和通用性仍未得到充分探索。为了解决这个问题，我们提出了 TAD-Bench，这是一个全面的基准，旨在系统地评估基于嵌入的文本异常检测方法。TAD-Bench 集成了跨不同领域的多个数据集，将大型语言模型的最新嵌入与各种异常检测算法相结合。通过大量实验，我们分析了嵌入和检测方法之间的相互作用，揭示了它们的优点、缺点和对不同任务的适用性。这些发现为构建更强大、更高效、更通用的异常检测系统以用于实际应用提供了新的视角。

Title: A Hybrid Attention Framework for Fake News Detection with Large Language Models

Authors: Xiaochuan Xu, Peiyang Yu, Zeqiu Xu, Jiani Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11967
Pdf URL: https://arxiv.org/pdf/2501.11967
Copy Paste: [[2501.11967]] A Hybrid Attention Framework for Fake News Detection with Large Language Models(https://arxiv.org/abs/2501.11967)
Keywords: language model, llm
Abstract: With the rapid growth of online information, the spread of fake news has become a serious social challenge. In this study, we propose a novel detection framework based on Large Language Models (LLMs) to identify and classify fake news by integrating textual statistical features and deep semantic features. Our approach utilizes the contextual understanding capability of the large language model for text analysis and introduces a hybrid attention mechanism to focus on feature combinations that are particularly important for fake news identification. Extensive experiments on the WELFake news dataset show that our model significantly outperforms existing methods, with a 1.5\% improvement in F1 score. In addition, we assess the interpretability of the model through attention heat maps and SHAP values, providing actionable insights for content review strategies. Our framework provides a scalable and efficient solution to deal with the spread of fake news and helps build a more reliable online information ecosystem.
摘要：随着网络信息的快速增长，虚假新闻的传播已成为一项严峻的社会挑战。在本研究中，我们提出了一种基于大型语言模型 (LLM) 的新型检测框架，通过整合文本统计特征和深度语义特征来识别和分类虚假新闻。我们的方法利用大型语言模型的上下文理解能力进行文本分析，并引入混合注意力机制来关注对虚假新闻识别特别重要的特征组合。在 WELFake 新闻数据集上进行的大量实验表明，我们的模型明显优于现有方法，F1 得分提高了 1.5%。此外，我们通过注意力热图和 SHAP 值评估模型的可解释性，为内容审查策略提供可行的见解。我们的框架提供了一种可扩展且有效的解决方案来应对虚假新闻的传播，并有助于建立更可靠的在线信息生态系统。

Title: Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues

Authors: Maya Medjad, Hugo Imbert, Bruno Yun, Raphaël Szymocha, Frédéric Armetta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11977
Pdf URL: https://arxiv.org/pdf/2501.11977
Copy Paste: [[2501.11977]] Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues(https://arxiv.org/abs/2501.11977)
Keywords: language model, llm, prompt
Abstract: Training task-oriented dialogue systems is both costly and time-consuming, due to the need for high-quality datasets encompassing diverse intents. Traditional methods depend on extensive human annotation, while recent advancements leverage large language models (LLMs) to generate synthetic data. However, these approaches often require custom prompts or code, limiting accessibility for non-technical users. We introduce GraphTOD, an end-to-end framework that simplifies the generation of task-oriented dialogues. Users can create dialogues by specifying transition graphs in JSON format. Our evaluation demonstrates that GraphTOD generates high-quality dialogues across various domains, significantly lowering the cost and complexity of dataset creation.
摘要：由于需要涵盖各种意图的高质量数据集，因此训练面向任务的对话系统既昂贵又耗时。传统方法依赖于大量的人工注释，而最近的进展则利用大型语言模型 (LLM) 来生成合成数据。然而，这些方法通常需要自定义提示或代码，限制了非技术用户的可访问性。我们引入了 GraphTOD，这是一个端到端框架，可简化面向任务的对话的生成。用户可以通过指定 JSON 格式的转换图来创建对话。我们的评估表明，GraphTOD 可跨各个领域生成高质量对话，从而显著降低数据集创建的成本和复杂性。

Title: MedS$^3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking

Authors: Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12051
Pdf URL: https://arxiv.org/pdf/2501.12051
Copy Paste: [[2501.12051]] MedS$^3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking(https://arxiv.org/abs/2501.12051)
Keywords: language model, gpt, prompt
Abstract: Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAIs O1 highlights test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks. This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings. In this work, we present a deployable, small-scale medical language model, \mone, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm. Starting with a seed dataset of around 8,000 instances spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct verifiable reasoning chains. Each reasoning step is assigned an evolution rollout value, allowing verified trajectories to train the policy model and the reward model. During inference, the policy model generates multiple responses, and the reward model selects the one with the highest reward score. Experiments on eleven evaluation datasets demonstrate that \mone outperforms prior open-source models by 2 points, with the addition of the reward model further boosting performance ($\sim$13 points), surpassing GPT-4o-mini. Code and data are available at \url{this https URL}.
摘要：医学语言模型 (MLM) 已成为推进医学自然语言处理的关键。然而，依赖于预训练或监督微调的先前模型通常表现出较低的数据效率和在现实世界临床应用中的有限实用性。虽然 OpenAIs O1 强调数学中的测试时间扩展，但在医学中复制这种方法的尝试通常会将 GPT 系列模型的响应提炼为开源模型，主要侧重于多项选择任务。这种策略虽然简单明了，但却忽略了数据隐私和临床环境中的实际部署等关键问题。在这项工作中，我们提出了一个可部署的小规模医学语言模型 \mone，它使用自我进化范式为临床任务中的长链推理而设计。从涵盖五个领域和 16 个数据集的约 8,000 个实例的种子数据集开始，我们提示一个基础策略模型执行蒙特卡洛树搜索 (MCTS) 来构建可验证的推理链。每个推理步骤都分配有一个演化推出值，允许经过验证的轨迹训练策略模型和奖励模型。在推理过程中，策略模型会生成多个响应，而奖励模型会选择具有最高奖励分数的响应。在 11 个评估数据集上进行的实验表明，\mone 的表现比之前的开源模型高出 2 分，而奖励模型的加入进一步提升了性能（\sim$13 分），超越了 GPT-4o-mini。代码和数据可在 \url{此 https URL} 上找到。

Title: Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes

Authors: Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Torsten Panholzer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12106
Pdf URL: https://arxiv.org/pdf/2501.12106
Copy Paste: [[2501.12106]] Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes(https://arxiv.org/abs/2501.12106)
Keywords: language model, llm, prompt
Abstract: Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation. Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from this https URL. We also release the dataset as a new valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.
摘要：德国的肿瘤记录工作主要以手动方式完成，需要阅读患者记录并将数据输入结构化数据库。大型语言模型 (LLM) 可以通过提高效率和可靠性来增强此过程。本次评估测试了 11 种不同的开源 LLM，其大小从 10 亿到 700 亿模型参数不等，用于肿瘤记录过程的三个基本任务：识别肿瘤诊断、分配 ICD-10 代码和提取首次诊断日期。为了评估 LLM 在这些任务上的表现，我们准备了一个基于匿名泌尿科医生笔记的带注释文本片段数据集。我们使用不同的提示策略来研究小样本提示中示例数量的影响，并探索 LLM 的总体能力。Llama 3.1 8B、Mistral 7B 和 Mistral NeMo 12 B 模型在这些任务中的表现相当出色。训练数据较少或参数少于 70 亿的模型表现出明显较低的性能，而较大的模型没有表现出性能提升。来自泌尿科以外其他医学领域的例子也可以改善小样本提示的结果，这表明 LLM 能够处理肿瘤文档所需的任务。开源 LLM 显示出自动化肿瘤文档的巨大潜力。来自 70 亿到 120 亿个参数的模型可以在性能和资源效率之间实现最佳平衡。通过量身定制的微调和精心设计的提示，这些模型可能成为未来临床文档的重要工具。评估代码可从此 https URL 获得。我们还发布了数据集作为一种新的宝贵资源，以解决德语医学 NLP 中缺乏真实且易于访问的基准的问题。

Title: Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

Authors: Qirun Dai, Dylan Zhang, Jiaqi W. Ma, Hao Peng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12147
Pdf URL: https://arxiv.org/pdf/2501.12147
Copy Paste: [[2501.12147]] Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities(https://arxiv.org/abs/2501.12147)
Keywords: language model, llm
Abstract: Selecting appropriate training data is crucial for effective instruction fine-tuning of large language models (LLMs), which aims to (1) elicit strong capabilities, and (2) achieve balanced performance across a diverse range of tasks. Influence-based methods show promise in achieving (1) by estimating the contribution of each training example to the model's predictions, but often struggle with (2). Our systematic investigation reveals that this underperformance can be attributed to an inherent bias where certain tasks intrinsically have greater influence than others. As a result, data selection is often biased towards these tasks, not only hurting the model's performance on others but also, counterintuitively, harms performance on these high-influence tasks themselves. As a remedy, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data, and then iteratively balances data selection by choosing the training example with the highest influence on the most underrepresented task. Experiments with both Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities show that BIDS consistently outperforms both state-of-the-art influence-based algorithms and other non-influence-based selection frameworks. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.
摘要：选择合适的训练数据对于大型语言模型 (LLM) 的有效指令微调至关重要，其目的是 (1) 获得强大的能力，以及 (2) 在各种任务中实现平衡的性能。基于影响力的方法通过估计每个训练示例对模型预测的贡献，有望实现 (1)，但往往难以实现 (2)。我们的系统调查显示，这种表现不佳可以归因于固有的偏见，即某些任务本质上比其他任务具有更大的影响力。因此，数据选择往往偏向这些任务，不仅损害了模型在其他任务上的性能，而且与直觉相反，损害了这些高影响力任务本身的性能。作为一种补救措施，我们提出了一种平衡且有影响力的数据选择算法 BIDS。BIDS 首先对训练数据的影响力分数进行归一化，然后通过选择对代表性最低的任务影响最大的训练示例来迭代平衡数据选择。使用 Llama-3 和 Mistral-v0.3 在涵盖五种不同功能的七个基准上进行的实验表明，BIDS 的表现始终优于最先进的基于影响力的算法和其他不基于影响力的选择框架。令人惊讶的是，在 BIDS 选择的 15% 子集上进行训练甚至可以胜过全数据集训练，且性能更加均衡。我们的分析进一步强调了实例级规范化和所选数据的迭代优化对于平衡学习各种功能的重要性。

Title: AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

Authors: Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, Sean Lai, Xupeng Miao, Zhihao Jia
Subjects: cs.CL, cs.AI, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12162
Pdf URL: https://arxiv.org/pdf/2501.12162
Copy Paste: [[2501.12162]] AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding(https://arxiv.org/abs/2501.12162)
Keywords: llm
Abstract: This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe's potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.
摘要：本文介绍了 AdaServe，这是第一个通过细粒度推测解码支持 SLO 定制的 LLM 服务系统。AdaServe 利用草稿模型的逻辑来预测令牌的推测准确性，并采用理论上最优的算法来构建令牌树进行验证。为了在不影响吞吐量的情况下满足不同的 SLO 要求，AdaServe 采用了一种推测和选择方案，首先为每个请求构建候选令牌树，然后动态选择令牌以满足各个 SLO 约束，同时优化吞吐量。综合评估表明，与最先进的系统相比，AdaServe 的 SLO 实现率提高了 73%，吞吐量提高了 74%。这些结果强调了 AdaServe 有可能提高不同应用场景中 LLM 部署的效率和适应性。

Title: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

Authors: Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12273
Pdf URL: https://arxiv.org/pdf/2501.12273
Copy Paste: [[2501.12273]] Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement(https://arxiv.org/abs/2501.12273)
Keywords: language model, llm
Abstract: The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.
摘要：监督微调 (SFT) 数据的质量在增强大型语言模型 (LLM) 的对话能力方面起着至关重要的作用。然而，随着 LLM 变得越来越先进，高质量人工注释的 SFT 数据的可用性已成为一个重大瓶颈，因此需要更多地依赖合成训练数据。在这项工作中，我们引入了 Condor，这是一种新颖的两阶段合成数据生成框架，它结合了世界知识树和自我反思细化，以大规模生成高质量的 SFT 数据。我们的实验结果表明，仅对 20K Condor 生成的样本进行微调的基础模型与同类模型相比实现了卓越的性能。Condor 中的额外细化阶段进一步实现了不同规模（高达 72B）的 LLM 的迭代自我改进，验证了我们方法的有效性。此外，我们对训练后合成数据的扩展的研究揭示了巨大的未开发的性能改进潜力，为未来的研究开辟了有希望的途径。

Title: Automatic Labelling with Open-source LLMs using Dynamic Label Schema Integration

Authors: Thomas Walshe, Sae Young Moon, Chunyang Xiao, Yawwani Gunawardana, Fran Silavong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12332
Pdf URL: https://arxiv.org/pdf/2501.12332
Copy Paste: [[2501.12332]] Automatic Labelling with Open-source LLMs using Dynamic Label Schema Integration(https://arxiv.org/abs/2501.12332)
Keywords: language model, gpt, llm
Abstract: Acquiring labelled training data remains a costly task in real world machine learning projects to meet quantity and quality requirements. Recently Large Language Models (LLMs), notably GPT-4, have shown great promises in labelling data with high accuracy. However, privacy and cost concerns prevent the ubiquitous use of GPT-4. In this work, we explore effectively leveraging open-source models for automatic labelling. We identify integrating label schema as a promising technology but found that naively using the label description for classification leads to poor performance on high cardinality tasks. To address this, we propose Retrieval Augmented Classification (RAC) for which LLM performs inferences for one label at a time using corresponding label schema; we start with the most related label and iterates until a label is chosen by the LLM. We show that our method, which dynamically integrates label description, leads to performance improvements in labelling tasks. We further show that by focusing only on the most promising labels, RAC can trade off between label quality and coverage - a property we leverage to automatically label our internal datasets.
摘要：在现实世界的机器学习项目中，获取标记的训练数据仍然是一项昂贵的任务，以满足数量和质量要求。最近，大型语言模型 (LLM)，尤其是 GPT-4，在高精度标记数据方面显示出巨大的潜力。然而，隐私和成本问题阻碍了 GPT-4 的广泛使用。在这项工作中，我们探索如何有效地利用开源模型进行自动标记。我们认为集成标签模式是一项很有前途的技术，但发现单纯地使用标签描述进行分类会导致高基数任务的性能不佳。为了解决这个问题，我们提出了检索增强分类 (RAC)，其中 LLM 使用相应的标签模式一次对一个标签执行推理；我们从最相关的标签开始，然后迭代，直到 LLM 选择一个标签。我们表明，我们的方法动态地集成了标签描述，可以提高标记任务的性能。我们进一步表明，通过只关注最有前途的标签，RAC 可以在标签质量和覆盖率之间进行权衡——我们利用这一特性来自动标记我们的内部数据集。