2025-02-25

Title: Integrating Domain Knowledge into Large Language Models for Enhanced Fashion Recommendations

Authors: Zhan Shi, Shanglin Yang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15696
Pdf URL: https://arxiv.org/pdf/2502.15696
Copy Paste: [[2502.15696]] Integrating Domain Knowledge into Large Language Models for Enhanced Fashion Recommendations(https://arxiv.org/abs/2502.15696)
Keywords: language model, llm, prompt
Abstract: Fashion, deeply rooted in sociocultural dynamics, evolves as individuals emulate styles popularized by influencers and iconic figures. In the quest to replicate such refined tastes using artificial intelligence, traditional fashion ensemble methods have primarily used supervised learning to imitate the decisions of style icons, which falter when faced with distribution shifts, leading to style replication discrepancies triggered by slight variations in input. Meanwhile, large language models (LLMs) have become prominent across various sectors, recognized for their user-friendly interfaces, strong conversational skills, and advanced reasoning capabilities. To address these challenges, we introduce the Fashion Large Language Model (FLLM), which employs auto-prompt generation training strategies to enhance its capacity for delivering personalized fashion advice while retaining essential domain knowledge. Additionally, by integrating a retrieval augmentation technique during inference, the model can better adjust to individual preferences. Our results show that this approach surpasses existing models in accuracy, interpretability, and few-shot learning capabilities.
摘要：随着个人效仿由影响者和标志性人物普及的样式，时尚植根于社会文化动态。为了使用人工智能复制这种精致的口味，传统的时尚合奏方法主要使用了监督的学习来模仿样式图标的决策，当面对分配变化时，这些方法会使风格复制差异，从而导致风格复制差异，这是由于轻微的输入变化而触发的。同时，大型语言模型（LLMS）在各个领域都变得突出，以用户友好的界面，强大的对话技能和高级推理能力认可。为了应对这些挑战，我们介绍了时尚大语言模型（FLLM），该模型采用自动推出的生成培训策略来增强其在保留基本领域知识的同时提供个性化时尚建议的能力。此外，通过在推断期间集成检索扩展技术，该模型可以更好地适应个人偏好。我们的结果表明，这种方法在准确性，可解释性和很少的学习能力方面超过了现有模型。

Title: Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction

Authors: Vivaan Sandwar, Bhav Jain, Rishan Thangaraj, Ishaan Garg, Michael Lam, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15725
Pdf URL: https://arxiv.org/pdf/2502.15725
Copy Paste: [[2502.15725]] Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction(https://arxiv.org/abs/2502.15725)
Keywords: language model, gpt, llm, prompt
Abstract: Debate is a commonly used form of human communication catered towards problem-solving because of its efficiency. Debate fundamentally allows multiple viewpoints to be brought up in problem-solving, and for complex problems, each viewpoint opens a new path for problem-solving. In this work, we apply this concept to LLM decision-making by proposing town hall-style debate prompting (THDP), a prompting method that splices a language model into multiple personas that will debate one another to reach a conclusion. Our experimental pipeline varies both the number of personas and the personality types of each persona to find the optimum town hall size and personality for benchmark performance as measured by ZebraLogic bench, a reasoning-intensive benchmark characterized by both multiple-choice and fill-in-the-blank questions. Our experimental results demonstrate that a town hall size of 5 personas with LLM-determined personality types performs optimally on ZebraLogic, achieving a 13\% improvement over one-shot CoT baselines in per-cell accuracy in GPT-4o, 9% puzzle accuracy increase in Claude 3.5 Sonnet, and an improvement in hard puzzle accuracy from 10-15%.
摘要：辩论是人类交流的一种常用形式，由于其效率而适合解决问题。从根本上讲，辩论允许在解决问题的问题中提出多个观点，并且对于复杂的问题，每个观点都为解决问题的新途径打开了新的途径。在这项工作中，我们通过提出市政厅式的辩论提示（THDP）将此概念应用于LLM决策，这是一种提示方法，将语言模型拼接到多个角色中，可以互相辩论以达到结论。我们的实验管道各种角色的角色数量和人格类型都不同，以找到由Zebralogic Bench衡量的最佳市政厅的大小和个性，以实现基准表现，这是一种以多项选择和填充为特征的推理密集型基准测试。空白的问题。我们的实验结果表明，一个具有LLM确定个性类型的5个角色的市政厅大小在Zebralogic方面表现最佳，在GPT-4O中的每单元素准确性方面的每电池准确性比一次性的COT基本线取得了13 \％的改善，9％的难题的精度提高了9％在Claude 3.5十四行诗中，硬性难题准确性的提高了10-15％。

Title: On the Effectiveness of Large Language Models in Automating Categorization of Scientific Texts

Authors: Gautam Kishore Shahi, Oliver Hummel
Subjects: cs.CL, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15745
Pdf URL: https://arxiv.org/pdf/2502.15745
Copy Paste: [[2502.15745]] On the Effectiveness of Large Language Models in Automating Categorization of Scientific Texts(https://arxiv.org/abs/2502.15745)
Keywords: language model, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has led to a multitude of application opportunities. One traditional task for Information Retrieval systems is the summarization and classification of texts, both of which are important for supporting humans in navigating large literature bodies as they e.g. exist with scientific publications. Due to this rapidly growing body of scientific knowledge, recent research has been aiming at building research information systems that not only offer traditional keyword search capabilities, but also novel features such as the automatic detection of research areas that are present at knowledge intensive organizations in academia and industry. To facilitate this idea, we present the results obtained from evaluating a variety of LLMs in their ability to sort scientific publications into hierarchical classifications systems. Using the FORC dataset as ground truth data, we have found that recent LLMs (such as Meta Llama 3.1) are able to reach an accuracy of up to 0.82, which is up to 0.08 better than traditional BERT models.
摘要：大型语言模型（LLM）的快速发展导致了许多应用程序机会。信息检索系统的一项传统任务是对文本的汇总和分类，这对于支持人类在大型文学机构中的浏览至关重要，例如。存在科学出版物。由于科学知识的迅速发展，最近的研究一直旨在建立研究信息系统，不仅提供传统的关键字搜索功能，而且还提供新的功能，例如自动检测学术界知识组织中存在的研究领域和行业。为了促进这一想法，我们介绍了通过评估各种LLM的能力将科学出版物分类为层次分类系统的能力而获得的结果。我们发现，使用FORC数据集作为地面真相数据，我们发现最近的LLM（例如Meta Llama 3.1）能够达到高达0.82的准确性，比传统的BERT模型高达0.08。

Title: Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset

Authors: Rawand Alfugaha, Mohammad AL-Smadi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15810
Pdf URL: https://arxiv.org/pdf/2502.15810
Copy Paste: [[2502.15810]] Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset(https://arxiv.org/abs/2502.15810)
Keywords: language model, llm, prompt
Abstract: This study evaluates the performance of Large Language Models (LLMs) on SemEval-2020 Task 4 dataset, focusing on commonsense validation and explanation. Our methodology involves evaluating multiple LLMs, including LLaMA3-70B, Gemma2-9B, and Mixtral-8x7B, using zero-shot prompting techniques. The models are tested on two tasks: Task A (Commonsense Validation), where models determine whether a statement aligns with commonsense knowledge, and Task B (Commonsense Explanation), where models identify the reasoning behind implausible statements. Performance is assessed based on accuracy, and results are compared to fine-tuned transformer-based models. The results indicate that larger models outperform previous models and perform closely to human evaluation for Task A, with LLaMA3-70B achieving the highest accuracy of 98.40% in Task A whereas, lagging behind previous models with 93.40% in Task B. However, while models effectively identify implausible statements, they face challenges in selecting the most relevant explanation, highlighting limitations in causal and inferential reasoning.
摘要：这项研究评估了大语言模型（LLM）在Semeval-2020任务4数据集上的性能，重点是常识验证和解释。我们的方法涉及使用零拍摄提示技术评估多个LLM，包括Llama3-70B，Gemma2-9b和Mixtral-8x7b。对两个任务进行了测试：任务A（常识验证），其中模型确定陈述是否与常识性知识和任务B（常识说明）相一致，其中模型在其中确定了令人难以置信的语句背后的推理。根据精度评估性能，并将结果与基于微调的变压器模型进行比较。结果表明，较大的模型的表现优于先前的模型，并且与人体评估A的表现紧密相关，而Llama3-70B在任务A中达到了98.40％的最高准确度，而落后于任务B中的93.40％的模型。但是，模型。他们有效地识别出令人难以置信的陈述，在选择最相关的解释时面临挑战，突出了因果和推论推理的局限性。

Title: Tabular Embeddings for Tables with Bi-Dimensional Hierarchical Metadata and Nesting

Authors: Gyanendra Shrestha, Chutain Jiang, Sai Akula, Vivek Yannam, Anna Pyayt, Michael Gubanov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15819
Pdf URL: https://arxiv.org/pdf/2502.15819
Copy Paste: [[2502.15819]] Tabular Embeddings for Tables with Bi-Dimensional Hierarchical Metadata and Nesting(https://arxiv.org/abs/2502.15819)
Keywords: gpt, llm
Abstract: Embeddings serve as condensed vector representations for real-world entities, finding applications in Natural Language Processing (NLP), Computer Vision, and Data Management across diverse downstream tasks. Here, we introduce novel specialized embeddings optimized, and explicitly tailored to encode the intricacies of complex 2-D context in tables, featuring horizontal, vertical hierarchical metadata, and nesting. To accomplish that we define the Bi-dimensional tabular coordinates, separate horizontal, vertical metadata and data contexts by introducing a new visibility matrix, encode units and nesting through the embeddings specifically optimized for mimicking intricacies of such complex structured data. Through evaluation on 5 large-scale structured datasets and 3 popular downstream tasks, we observed that our solution outperforms the state-of-the-art models with the significant MAP delta of up to 0.28. GPT-4 LLM+RAG slightly outperforms us with MRR delta of up to 0.1, while we outperform it with the MAP delta of up to 0.42.
摘要：嵌入是现实世界实体的凝结向量表示，在不同的下游任务中找到自然语言处理（NLP），计算机视觉和数据管理的应用。在这里，我们介绍了针对表格中的复杂2-D上下文的复杂性进行优化的新型专业嵌入式，并量身定制，以水平，垂直层次元素元素和嵌套。为此，我们通过引入一个新的可见性矩阵，编码单元并通过嵌入，专门针对模仿这种复杂结构化数据的复杂性来定义了双维表坐标，单独的水平水平，垂直元数据和数据上下文。通过评估5个大尺度结构化数据集和3个流行的下游任务，我们观察到，我们的解决方案的表现优于最先进的模型，其显着地图Delta高达0.28。 GPT-4 LLM+RAG的表现略高于我们的MRR Delta高达0.1，而我们的MAP DELTA的表现高达0.42。

Title: Towards Robust ESG Analysis Against Greenwashing Risks: Aspect-Action Analysis with Cross-Category Generalization

Authors: Keane Ong, Rui Mao, Deeksha Varshney, Erik Cambria, Gianmarco Mengaldo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15821
Pdf URL: https://arxiv.org/pdf/2502.15821
Copy Paste: [[2502.15821]] Towards Robust ESG Analysis Against Greenwashing Risks: Aspect-Action Analysis with Cross-Category Generalization(https://arxiv.org/abs/2502.15821)
Keywords: llm
Abstract: Sustainability reports are key for evaluating companies' environmental, social and governance, ESG performance, but their content is increasingly obscured by greenwashing - sustainability claims that are misleading, exaggerated, and fabricated. Yet, existing NLP approaches for ESG analysis lack robustness against greenwashing risks, often extracting insights that reflect misleading or exaggerated sustainability claims rather than objective ESG performance. To bridge this gap, we introduce A3CG - Aspect-Action Analysis with Cross-Category Generalization, as a novel dataset to improve the robustness of ESG analysis amid the prevalence of greenwashing. By explicitly linking sustainability aspects with their associated actions, A3CG facilitates a more fine-grained and transparent evaluation of sustainability claims, ensuring that insights are grounded in verifiable actions rather than vague or misleading rhetoric. Additionally, A3CG emphasizes cross-category generalization. This ensures robust model performance in aspect-action analysis even when companies change their reports to selectively favor certain sustainability areas. Through experiments on A3CG, we analyze state-of-the-art supervised models and LLMs, uncovering their limitations and outlining key directions for future research.
摘要：可持续性报告是评估公司的环境，社会和治理，ESG绩效的关键，但是它们的内容越来越被绿化 - 可持续性索赔所掩盖，这些声明具有误导，夸张和捏造。然而，现有的NLP方法进行ESG分析的方法缺乏稳健性，并且经常提取见证人的见解，这些见解反映出误导性或夸张的可持续性主张，而不是客观的ESG绩效。为了弥合这一差距，我们引入了A3CG - 具有跨类别概括的方面分析，作为一种新型数据集，以在绿色洗涤率的流行率中提高ESG分析的鲁棒性。通过将可持续性方面明确联系起来，A3CG促进了对可持续性主张的更细粒度和透明的评估，以确保洞察力以可验证的行动为基础，而不是模糊或误导性的言论。此外，A3CG强调跨类别概括。即使公司更改报告以选择性地支持某些可持续性领域，这也可以确保在方面分析中的稳健模型性能。通过对A3CG的实验，我们分析了最先进的监督模型和LLM，发现了它们的局限性，并概述了未来研究的关键方向。

Title: CoME: An Unlearning-based Approach to Conflict-free Model Editing

Authors: Dahyun Jung, Jaehyung Seo, Jaewook Lee, Chanjun Park, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15826
Pdf URL: https://arxiv.org/pdf/2502.15826
Copy Paste: [[2502.15826]] CoME: An Unlearning-based Approach to Conflict-free Model Editing(https://arxiv.org/abs/2502.15826)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) often retain outdated or incorrect information from pre-training, which undermines their reliability. While model editing methods have been developed to address such errors without full re-training, they frequently suffer from knowledge conflicts, where outdated information interferes with new knowledge. In this work, we propose Conflict-free Model Editing (CoME), a novel framework that enhances the accuracy of knowledge updates in LLMs by selectively removing outdated knowledge. CoME leverages unlearning to mitigate knowledge interference, allowing new information to be integrated without compromising relevant linguistic features. Through experiments on GPT-J and LLaMA-3 using Counterfact and ZsRE datasets, we demonstrate that CoME improves both editing accuracy and model reliability when applied to existing editing methods. Our results highlight that the targeted removal of outdated knowledge is crucial for enhancing model editing effectiveness and maintaining the model's generative performance.
摘要：大型语言模型（LLMS）通常会保留前培训中过时或不正确的信息，从而破坏其可靠性。尽管已经开发出模型编辑方法来解决此类错误而没有进行全面重新训练，但它们经常遭受知识冲突的困扰，而过时的信息会干扰新知识。在这项工作中，我们提出了无冲突模型编辑（COME），这是一个新颖的框架，通过选择性地删除过时的知识来增强LLM中知识更新的准确性。来利用学习来减轻知识干扰，可以整合新信息而不会损害相关的语言特征。通过使用RunterFact和ZSRE数据集在GPT-J和Llama-3上进行实验，我们证明，当应用于现有编辑方法时，它可以提高编辑精度和模型可靠性。我们的结果强调，有针对性的去除过时的知识对于增强模型编辑效果并保持模型的生成性能至关重要。

Title: Pragmatic Reasoning improves LLM Code Generation

Authors: Zhuchen Cao, Sven Apel, Adish Singla, Vera Demberg
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2502.15835
Pdf URL: https://arxiv.org/pdf/2502.15835
Copy Paste: [[2502.15835]] Pragmatic Reasoning improves LLM Code Generation(https://arxiv.org/abs/2502.15835)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive potential in translating natural language (NL) instructions into program code. However, user instructions often contain inherent ambiguities, making it challenging for LLMs to generate code that accurately reflects the user's true intent. To address this challenge, researchers have proposed to produce multiple candidates of the program code and then rerank them to identify the best solution. In this paper, we propose CodeRSA, a novel code candidate reranking mechanism built upon the Rational Speech Act (RSA) framework, designed to guide LLMs toward more comprehensive pragmatic reasoning about user intent. We evaluate CodeRSA using one of the latest LLMs on a popular code generation dataset. Our experiment results show that CodeRSA consistently outperforms common baselines, surpasses the state-of-the-art approach in most cases, and demonstrates robust overall performance. These findings underscore the effectiveness of integrating pragmatic reasoning into code candidate reranking, offering a promising direction for enhancing code generation quality in LLMs.
摘要：大型语言模型（LLM）在将自然语言（NL）指令转换为程序代码方面表现出了令人印象深刻的潜力。但是，用户说明通常包含固有的歧义，这使得LLMS生成准确反映用户真正意图的代码具有挑战性。为了应对这一挑战，研究人员提出了生产程序代码的多个候选者，然后将其重新确定以确定最佳解决方案。在本文中，我们提出了基于《理性语音法》（RSA）框架的新型代码候选机制Codersa，旨在指导LLMS对用户意图进行更全面的务实推理。我们使用流行的代码生成数据集上的最新LLM中评估Codersa。我们的实验结果表明，Codersa始终胜过公共基线，在大多数情况下超过了最新方法，并证明了稳健的总体表现。这些发现强调了将务实推理整合到代码候选者重新骑行中的有效性，从而为提高LLM中代码生成质量提供了有希望的方向。

Title: Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Authors: Haokun Chen, Sebastian Szyller, Weilin Xu, Nageen Himayat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15836
Pdf URL: https://arxiv.org/pdf/2502.15836
Copy Paste: [[2502.15836]] Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models(https://arxiv.org/abs/2502.15836)
Keywords: language model, llm
Abstract: Large language models (LLMs) have become increasingly popular. Their emergent capabilities can be attributed to their massive training datasets. However, these datasets often contain undesirable or inappropriate content, e.g., harmful texts, personal information, and copyrighted material. This has promoted research into machine unlearning that aims to remove information from trained models. In particular, approximate unlearning seeks to achieve information removal by strategically editing the model rather than complete model retraining. Recent work has shown that soft token attacks (STA) can successfully extract purportedly unlearned information from LLMs, thereby exposing limitations in current unlearning methodologies. In this work, we reveal that STAs are an inadequate tool for auditing unlearning. Through systematic evaluation on common unlearning benchmarks (Who Is Harry Potter? and TOFU), we demonstrate that such attacks can elicit any information from the LLM, regardless of (1) the deployed unlearning algorithm, and (2) whether the queried content was originally present in the training corpus. Furthermore, we show that STA with just a few soft tokens (1-10) can elicit random strings over 400-characters long. Thus showing that STAs are too powerful, and misrepresent the effectiveness of the unlearning methods. Our work highlights the need for better evaluation baselines, and more appropriate auditing tools for assessing the effectiveness of unlearning in LLMs.
摘要：大型语言模型（LLM）变得越来越流行。他们的紧急功能可以归因于他们的大规模培训数据集。但是，这些数据集通常包含不良或不适当的内容，例如有害文本，个人信息和受版权保护的材料。这促进了旨在从训练有素的模型中删除信息的机器研究的研究。特别是，近似学者试图通过战略性编辑模型而不是完整的模型再培训来实现信息删除。最近的工作表明，软令牌攻击（STA）可以从LLMS中成功提取据称是未读取的信息，从而暴露于当前未学习方法中的局限性。在这项工作中，我们揭示了Stas是审核未学习的工具不足。通过对普通学习基准测试的系统评估（谁是哈利·波特？和豆腐），我们证明了这种攻击可以从LLM中获取任何信息，无论（1）已部署的未读取算法以及（2）查询内容是否最初是最初的内容出现在培训语料库中。此外，我们表明只有少数柔软令牌（1-10）的STA可以引起400个字符的随机字符串。因此表明Stas太强大了，并且歪曲了学习方法的有效性。我们的工作强调了需要更好的评估基线的需求，以及更合适的审核工具来评估LLMS中学习的有效性。

Title: Hallucination Detection in Large Language Models with Metamorphic Relations

Authors: Borui Yang, Md Afif Al Mamun, Jie M. Zhang, Gias Uddin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15844
Pdf URL: https://arxiv.org/pdf/2502.15844
Copy Paste: [[2502.15844]] Hallucination Detection in Large Language Models with Metamorphic Relations(https://arxiv.org/abs/2502.15844)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM's response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT's F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions.
摘要：大型语言模型（LLM）容易幻觉，例如，在其回答中，实际上是错误的信息。这些幻觉对基于LLM的应用提出了挑战，这些应用需要高度的事实准确性。现有的幻觉检测方法主要取决于外部资源，这可能会遭受诸如低可用性，不完全覆盖范围，隐私问题，高潜伏期，低可靠性和可扩展性差的问题。还有一些方法，具体取决于输出概率，对于GPT型号（例如GPT模型）通常无法访问。本文介绍了Metaqa，这是一种独立的幻觉检测方法，利用变质关系和迅速的突变。与现有方法不同，Metaqa在没有任何外部资源的情况下运行，并且与开源和封闭源LLM兼容。 Metaqa基于以下假设：如果LLM的响应是幻觉，则将违反设计的变质关系。我们将Metaqa与最新的零资源幻觉检测方法，跨多个数据集以及两个开源和两个封闭源LLMS进行了比较。我们的结果表明，从精确度，召回和F1得分方面，Metaqa的表现优于自我检查。对于我们研究的四个LLM，MetaQA的优于自我检查的优于优势范围为0.041-0.113（精确），0.143-0.430（用于召回）和0.154-0.368（对于F1分数）。例如，在Mistral-7b的情况下，Metaqa的平均F1得分为0.435，而SelfCheckCgpt的F1分数为0.205，改善率为112.2％。 Metaqa还表现出所有不同类别的问题的优势。

Title: Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection

Authors: Yihao Xue, Kristjan Greenewald, Youssef Mroueh, Baharan Mirzasoleiman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15845
Pdf URL: https://arxiv.org/pdf/2502.15845
Copy Paste: [[2502.15845]] Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection(https://arxiv.org/abs/2502.15845)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) suffer from hallucination problems, which hinder their reliability in sensitive applications. In the black-box setting, several self-consistency-based techniques have been proposed for hallucination detection. We empirically study these techniques and show that they achieve performance close to that of a supervised (still black-box) oracle, suggesting little room for improvement within this paradigm. To address this limitation, we explore cross-model consistency checking between the target model and an additional verifier LLM. With this extra information, we observe improved oracle performance compared to purely self-consistency-based methods. We then propose a budget-friendly, two-stage detection algorithm that calls the verifier model only for a subset of cases. It dynamically switches between self-consistency and cross-consistency based on an uncertainty interval of the self-consistency classifier. We provide a geometric interpretation of consistency-based hallucination detection methods through the lens of kernel mean embeddings, offering deeper theoretical insights. Extensive experiments show that this approach maintains high detection performance while significantly reducing computational cost.
摘要：大型语言模型（LLMS）遭受幻觉问题，这阻碍了它们在敏感应用中的可靠性。在黑盒环境中，已经提出了几种基于自抗性的技术来进行幻觉检测。我们从经验上研究了这些技术，并表明它们的性能接近受监督的（仍然是黑盒）甲骨文的性能，这表明在此范式中几乎没有改进的空间。为了解决此限制，我们探索了目标模型和其他验证者LLM之间的跨模型一致性检查。借助这些额外的信息，我们观察到与纯粹基于自稳定的方法相比，甲骨文的性能提高了。然后，我们提出了一种预算友好的，两阶段的检测算法，该算法仅针对案例的一部分调用验证器模型。它基于自符抗性分类器的不确定性间隔，在自遇到和跨矛盾之间动态切换。我们通过内核平均嵌入镜头提供了基于一致性的幻觉检测方法的几何解释，从而提供了更深入的理论见解。广泛的实验表明，这种方法保持高检测性能，同时大大降低了计算成本。

Title: Forecasting Frontier Language Model Agent Capabilities

Authors: Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15850
Pdf URL: https://arxiv.org/pdf/2502.15850
Copy Paste: [[2502.15850]] Forecasting Frontier Language Model Agent Capabilities(https://arxiv.org/abs/2502.15850)
Keywords: language model, llm, agent
Abstract: As Language Models (LMs) increasingly operate as autonomous agents, accurately forecasting their capabilities becomes crucial for societal preparedness. We evaluate six forecasting methods that predict downstream capabilities of LM agents. We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings. We evaluate our forecasting methods by backtesting them on a dataset of 38 LMs from the OpenLLM 2 leaderboard. We then use the validated two-step approach (Release Date$\to$Elo$\to$Benchmark) to predict LM agent performance for frontier models on three benchmarks: SWE-Bench Verified (software development), Cybench (cybersecurity assessment), and RE-Bench (ML research engineering). Our forecast predicts that by the beginning of 2026, non-specialized LM agents with low capability elicitation will reach a success rate of 54% on SWE-Bench Verified, while state-of-the-art LM agents will reach an 87% success rate. Our approach does not account for recent advances in inference-compute scaling and might thus be too conservative.
摘要：随着语言模型（LMS）越来越多地作为自主代理人运作，准确地预测其能力对于社会准备就至关重要。我们评估了六种预测LM代理下游能力的预测方法。我们使用“一步”方法，可以直接从计算或模型发布日期（Twiptep”方法等输入指标来预测基准分数，或者首先预测中间度量的“两步”方法，例如跨基准性能（PC-1）的主要成分（PC-1）和人类评估的竞争性ELO评分。我们通过在OpenLLM 2排行榜的38 LMS数据集上对预测方法进行了对预测方法的评估。然后，我们使用经过验证的两步方法（发布日期$ \ to $ elo $ \ to $基准）来预测三个基准的Frontier模型的LM代理性能：SWE-Bench验证（软件开发），Cybench（网络安全评估），（网络安全评估），，和Rebench（ML研究工程）。我们的预测预测，到2026年初，在经过验证的SWE-Bench上，具有低功能启发的非专业LM代理将达到54％的成功率，而最先进的LM代理将达到87％的成功率。。我们的方法不能解释推理计算扩展方面的最新进展，因此可能太保守了。

Title: Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Authors: Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15851
Pdf URL: https://arxiv.org/pdf/2502.15851
Copy Paste: [[2502.15851]] Control Illusion: The Failure of Instruction Hierarchies in Large Language Models(https://arxiv.org/abs/2502.15851)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. While controlled prompt engineering and model fine-tuning show modest improvements, our results indicate that instruction hierarchy enforcement is not robustly realized, calling for deeper architectural innovations beyond surface-level modifications.
摘要：大型语言模型（LLMS）越来越多地通过分层指令方案部署，其中某些指令（例如，系统级指令）有望优先于其他指令（例如，用户消息）。但是，我们对这些层次控制机制的有效程度缺乏系统的了解。我们介绍了一个基于约束优先级的系统评估框架，以评估LLMS的执行指导层次结构的很好。我们在六个最先进的LLMS进行的实验表明，即使是简单的格式冲突，模型即使是一致的指导优先级。我们发现，广泛的系统/用户提示分离无法建立可靠的指令层次结构，并且模型对某些约束类型表现出强烈的固有偏见，而不管其优先级指定如何。虽然受控的及时工程和模型微调显示出适度的改进，但我们的结果表明，指令层次结构执行并未实现强大的实现，呼吁除了表面级修改以外的更深层次的体系结构创新。

Title: PPC-GPT: Federated Task-Specific Compression of Large Language Models via Pruning and Chain-of-Thought Distillation

Authors: Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15857
Pdf URL: https://arxiv.org/pdf/2502.15857
Copy Paste: [[2502.15857]] PPC-GPT: Federated Task-Specific Compression of Large Language Models via Pruning and Chain-of-Thought Distillation(https://arxiv.org/abs/2502.15857)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Compressing Large Language Models (LLMs) into task-specific Small Language Models (SLMs) encounters two significant challenges: safeguarding domain-specific knowledge privacy and managing limited resources. To tackle these challenges, we propose PPC-GPT, a innovative privacy-preserving federated framework specifically designed for compressing LLMs into task-specific SLMs via pruning and Chain-of-Thought (COT) distillation. PPC-GPT works on a server-client federated architecture, where the client sends differentially private (DP) perturbed task-specific data to the server's LLM. The LLM then generates synthetic data along with their corresponding rationales. This synthetic data is subsequently used for both LLM pruning and retraining processes. Additionally, we harness COT knowledge distillation, leveraging the synthetic data to further improve the retraining of structurally-pruned SLMs. Our experimental results demonstrate the effectiveness of PPC-GPT across various text generation tasks. By compressing LLMs into task-specific SLMs, PPC-GPT not only achieves competitive performance but also prioritizes data privacy protection.
摘要：将大型语言模型（LLM）压缩到特定任务的小语言模型（SLM）中遇到了两个重大挑战：保护特定领域的知识隐私并管理有限的资源。为了应对这些挑战，我们提出了PPC-GPT，这是一种创新的保护隐私的联合框架，该框架专门设计用于通过修剪和思想链（COT）蒸馏将LLMS压缩到特定于任务的SLM中。 PPC-GPT在服务器 - 客户联合体系结构上工作，在该体系结构中，客户端将特定于任务特定的任务数据发送到服务器的LLM。然后，LLM生成综合数据及其相应的理由。随后将此合成数据用于LLM修剪和再培训过程。此外，我们利用COT知识蒸馏，利用合成数据进一步改善结构延伸的SLM的重新培训。我们的实验结果证明了PPC-GPT在各种文本生成任务中的有效性。通过将LLMS压缩到特定于任务的SLM中，PPC-GPT不仅可以实现竞争性能，而且还优先考虑数据隐私保护。

Title: Synthetic vs. Gold: The Role of LLM-Generated Labels and Data in Cyberbullying Detection

Authors: Arefeh Kazemi, Sri Balaaji Natarajan Kalaivendan, Joachim Wagner, Hamza Qadeer, Brian Davis
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.15860
Pdf URL: https://arxiv.org/pdf/2502.15860
Copy Paste: [[2502.15860]] Synthetic vs. Gold: The Role of LLM-Generated Labels and Data in Cyberbullying Detection(https://arxiv.org/abs/2502.15860)
Keywords: llm
Abstract: This study investigates the role of LLM-generated synthetic data in cyberbullying detection. We conduct a series of experiments where we replace some or all of the authentic data with synthetic data, or augment the authentic data with synthetic data. We find that synthetic cyberbullying data can be the basis for training a classifier for harm detection that reaches performance close to that of a classifier trained with authentic data. Combining authentic with synthetic data shows improvements over the baseline of training on authentic data alone for the test data for all three LLMs tried. These results highlight the viability of synthetic data as a scalable, ethically viable alternative in cyberbullying detection while emphasizing the critical impact of LLM selection on performance outcomes.
摘要：这项研究研究了LLM生成的合成数据在网络欺凌检测中的作用。我们进行了一系列实验，其中我们用合成数据替换一些或所有真实数据，或者使用合成数据来增强真实数据。我们发现，合成网络欺凌数据可以是训练分类器检测的基础，该损害检测能够达到接近经过精神数据的分类器的绩效。将真实的数据与合成数据相结合，显示了仅在所有三个LLMS试验的测试数据上培训基线的改进。这些结果突出了合成数据作为在网络欺凌检测中的可扩展性，在道德上可行的替代方案的可行性，同时强调了LLM选择对性能结果的关键影响。

Title: MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use

Authors: Zaid Khan, Ali Farhadi, Ranjay Krishna, Luca Weihs, Mohit Bansal, Tanmay Gupta
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2502.15872
Pdf URL: https://arxiv.org/pdf/2502.15872
Copy Paste: [[2502.15872]] MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use(https://arxiv.org/abs/2502.15872)
Keywords: gpt, llm
Abstract: When a human requests an LLM to complete a coding task using functionality from a large code repository, how do we provide context from the repo to the LLM? One approach is to add the entire repo to the LLM's context window. However, most tasks involve only fraction of symbols from a repo, longer contexts are detrimental to the LLM's reasoning abilities, and context windows are not unlimited. Alternatively, we could emulate the human ability to navigate a large repo, pick out the right functionality, and form a plan to solve the task. We propose MutaGReP (Mutation-guided Grounded Repository Plan Search), an approach to search for plans that decompose a user request into natural language steps grounded in the codebase. MutaGReP performs neural tree search in plan space, exploring by mutating plans and using a symbol retriever for grounding. On the challenging LongCodeArena benchmark, our plans use less than 5% of the 128K context window for GPT-4o but rival the coding performance of GPT-4o with a context window filled with the repo. Plans produced by MutaGReP allow Qwen 2.5 Coder 32B and 72B to match the performance of GPT-4o with full repo context and enable progress on the hardest LongCodeArena tasks. Project page: this http URL
摘要：当人类要求LLM使用大型代码存储库中的功能完成编码任务时，我们如何提供从存储库到LLM的上下文？一种方法是将整个存储库添加到LLM的上下文窗口中。但是，大多数任务仅涉及回购中的符号的一部分，更长的上下文对LLM的推理能力有害，并且上下文窗口并非无限。另外，我们可以模仿人类浏览大型存储库，挑选正确功能并制定解决任务的计划的能力。我们提出了诱变（突变引导的接地存储库计划搜索），一种搜索计划将用户请求分解为代码库中的自然语言步骤的方法。诱变在计划空间中进行神经树搜索，通过突变计划并使用符号猎犬进行接地探索。在具有挑战性的LongCodeArena基准中，我们的计划使用了GPT-4O的128K上下文窗口中的不到5％，但与带有库存的上下文窗口相比，GPT-4O的编码性能。杂物制定的计划允许QWEN 2.5编码器32B和72B与GPT-4O的性能与完整的回购上下文相匹配，并在最难的LongcodeArena任务上取得进展。项目页面：此HTTP URL

Title: A Close Look at Decomposition-based XAI-Methods for Transformer Language Models

Authors: Leila Arras, Bruno Puri, Patrick Kahardipraja, Sebastian Lapuschkin, Wojciech Samek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15886
Pdf URL: https://arxiv.org/pdf/2502.15886
Copy Paste: [[2502.15886]] A Close Look at Decomposition-based XAI-Methods for Transformer Language Models(https://arxiv.org/abs/2502.15886)
Keywords: language model, gpt
Abstract: Various XAI attribution methods have been recently proposed for the transformer architecture, allowing for insights into the decision-making process of large language models by assigning importance scores to input tokens and intermediate representations. One class of methods that seems very promising in this direction includes decomposition-based approaches, i.e., XAI-methods that redistribute the model's prediction logit through the network, as this value is directly related to the prediction. In the previous literature we note though that two prominent methods of this category, namely ALTI-Logit and LRP, have not yet been analyzed in juxtaposition and hence we propose to close this gap by conducting a careful quantitative evaluation w.r.t. ground truth annotations on a subject-verb agreement task, as well as various qualitative inspections, using BERT, GPT-2 and LLaMA-3 as a testbed. Along the way we compare and extend the ALTI-Logit and LRP methods, including the recently proposed AttnLRP variant, from an algorithmic and implementation perspective. We further incorporate in our benchmark two widely-used gradient-based attribution techniques. Finally, we make our carefullly constructed benchmark dataset for evaluating attributions on language models, as well as our code, publicly available in order to foster evaluation of XAI-methods on a well-defined common ground.
摘要：最近已经针对变压器体系结构提出了各种XAI归因方法，从而通过为输入令牌和中间表示形式分配重要性得分来了解大语言模型的决策过程。在这个方向上似乎非常有前途的一类方法包括基于分解的方法，即通过网络重新分布模型的预测logit的XAI方法，因为此值与预测直接相关。在先前的文献中，我们注意到，该类别的两种突出方法，即Alti-Logit和LRP，尚未在并置中分析，因此我们建议通过进行仔细的定量评估W.R.T.来缩小这一差距。主题动词协议任务以及各种定性检查，使用BERT，GPT-2和LLAMA-3作为测试床。一路上，我们从算法和实现的角度比较和扩展了Alti-LOGIT和LRP方法，包括最近提出的ATTNLRP变体。我们进一步纳入了两种基于梯度的归因技术。最后，我们制作了仔细构建的基准数据集，以评估语言模型的归因以及我们的代码，并公开使用，以在明确定义的共同点上促进对XAI方法的评估。

Title: Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models

Authors: Zheyuan Liu, Guangyao Dou, Xiangchi Yuan, Chunhui Zhang, Zhaoxuan Tan, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15910
Pdf URL: https://arxiv.org/pdf/2502.15910
Copy Paste: [[2502.15910]] Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models(https://arxiv.org/abs/2502.15910)
Keywords: language model, llm
Abstract: Generative models such as Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) trained on massive datasets can lead them to memorize and inadvertently reveal sensitive information, raising ethical and privacy concerns. While some prior works have explored this issue in the context of LLMs, it presents a unique challenge for MLLMs due to the entangled nature of knowledge across modalities, making comprehensive unlearning more difficult. To address this challenge, we propose Modality Aware Neuron Unlearning (MANU), a novel unlearning framework for MLLMs designed to selectively clip neurons based on their relative importance to the targeted forget data, curated for different modalities. Specifically, MANU consists of two stages: important neuron selection and selective pruning. The first stage identifies and collects the most influential neurons across modalities relative to the targeted forget knowledge, while the second stage is dedicated to pruning those selected neurons. MANU effectively isolates and removes the neurons that contribute most to the forget data within each modality, while preserving the integrity of retained knowledge. Our experiments conducted across various MLLM architectures illustrate that MANU can achieve a more balanced and comprehensive unlearning in each modality without largely affecting the overall model utility.
摘要：在大规模数据集中训练的诸如大语言模型（LLM）和多模式大语言模型（MLLM）之类的生成模型可能会导致他们记住并无意间揭示敏感信息，从而引发道德和隐私问题。尽管一些先前的作品在LLM的背景下探讨了这个问题，但由于跨模式的知识的纠缠性，它给MLLM带来了独特的挑战，这使得全面学习更加困难。为了应对这一挑战，我们提出了模态意识到神经元学习（MANU），这是一个新颖的MLLMS框架，旨在根据其对目标忘记数据的相对重要性选择性地剪辑神经元，以策划不同的方式。具体而言，MANU由两个阶段组成：重要的神经元选择和选择性修剪。第一阶段识别并收集了相对于目标忘记知识的模态性最具影响力的神经元，而第二阶段则致力于修剪那些选定的神经元。 Manu有效地隔离并去除对每种模式中忘记数据做出最大贡献的神经元，同时保留保留知识的完整性。我们在各种MLLM体系结构进行的实验表明，Manu可以在每种模式中实现更平衡，更全面的学习，而不会在很大程度上影响整体模型实用程序。

Title: Mind the Gap! Static and Interactive Evaluations of Large Audio Models

Authors: Minzhi Li, William Barr Held, Michael J Ryan, Kunat Pipatanakul, Potsawee Manakul, Hao Zhu, Diyi Yang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.15919
Pdf URL: https://arxiv.org/pdf/2502.15919
Copy Paste: [[2502.15919]] Mind the Gap! Static and Interactive Evaluations of Large Audio Models(https://arxiv.org/abs/2502.15919)
Keywords: chat
Abstract: As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results ($\tau \leq 0.33$ for all benchmarks). While combining multiple coarse-grained features yields modest predictive power ($R^2$=$0.30$), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.
摘要：随着AI聊天机器人无处不在，语音互动提出了一种令人信服的方式，可以为语义和社会信号提供快速，高带宽的沟通。这将大型音频模型（LAM）的研究推向了语音本地体验。但是，将LAM开发与用户目标保持一致需要清楚地了解用户需求和偏好以建立可靠的进度指标。这项研究通过引入一种评估LAM并收集来自484名参与者的7,500 LAM相互作用的交互方法来解决这些挑战。通过对用户查询的主题建模，我们确定了音频接口的主要用例。然后，我们分析用户偏好排名和定性反馈，以确定哪些模型最能与用户需求保持一致。最后，我们评估了静态基准如何预测交互式性能 - 我们的分析表明，没有单个基准与交互式结果密切相关（所有基准分析的$ \ tau \ leq 0.33 $）。在组合多个粗粒功能会产生适度的预测能力（$ r^2 $ = $ 0.30 $），但在口头问答和年龄预测的二十个数据集中只有两个显示出显着的正相关。这表明明确需要开发与用户偏好更好相关的LAM评估。

Title: Self-Taught Agentic Long Context Understanding

Authors: Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, Emad Barsoum
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15920
Pdf URL: https://arxiv.org/pdf/2502.15920
Copy Paste: [[2502.15920]] Self-Taught Agentic Long Context Understanding(https://arxiv.org/abs/2502.15920)
Keywords: language model, llm, long context, prompt, agent
Abstract: Answering complex, long-context questions remains a major challenge for large language models (LLMs) as it requires effective question clarifications and context retrieval. We propose Agentic Long-Context Understanding (AgenticLU), a framework designed to enhance an LLM's understanding of such queries by integrating targeted self-clarification with contextual grounding within an agentic workflow. At the core of AgenticLU is Chain-of-Clarifications (CoC), where models refine their understanding through self-generated clarification questions and corresponding contextual groundings. By scaling inference as a tree search where each node represents a CoC step, we achieve 97.8% answer recall on NarrativeQA with a search depth of up to three and a branching factor of eight. To amortize the high cost of this search process to training, we leverage the preference pairs for each step obtained by the CoC workflow and perform two-stage model finetuning: (1) supervised finetuning to learn effective decomposition strategies, and (2) direct preference optimization to enhance reasoning quality. This enables AgenticLU models to generate clarifications and retrieve relevant context effectively and efficiently in a single inference pass. Extensive experiments across seven long-context tasks demonstrate that AgenticLU significantly outperforms state-of-the-art prompting methods and specialized long-context LLMs, achieving robust multi-hop reasoning while sustaining consistent performance as context length grows.
摘要：回答复杂的，长篇文章的问题仍然是大型语言模型（LLMS）的主要挑战，因为它需要有效的问题澄清和上下文检索。我们提出了代理长篇小说理解（Ageniclu），这是一个框架，旨在通过将目标的自我传播与代理工作流中的上下文接地进行整合，从而增强LLM对此类查询的理解。 Ageniclu的核心是关链（COC），其中模型通过自我生成的澄清问题和相应的上下文基础来完善其理解。通过将每个节点代表COC步骤的树搜索缩放，我们在叙事中获得了97.8％的答案回忆，搜索深度最多为三个，分支系数为8。为了摊销此搜索过程的高成本进行培训，我们利用COC工作流程获得的每个步骤的偏好对，并执行两阶段模型的登录：（1）监督的Finetuning来学习有效的分解策略，以及（2）直接偏好优化以提高推理质量。这使Agenticlu模型能够在单个推理通行证中有效，有效地生成澄清并有效地检索相关上下文。在七个长篇文章任务上进行的广泛实验表明，Ageniclu的表现明显优于最先进的提示方法和专业的长篇小写LLM，从而实现了强大的多跳跃推理，同时随着上下文长度的增长而保持一致的性能。

Title: Improving Consistency in Large Language Models through Chain of Guidance

Authors: Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15924
Pdf URL: https://arxiv.org/pdf/2502.15924
Copy Paste: [[2502.15924]] Improving Consistency in Large Language Models through Chain of Guidance(https://arxiv.org/abs/2502.15924)
Keywords: language model, llm, prompt
Abstract: Consistency is a fundamental dimension of trustworthiness in Large Language Models (LLMs). For humans to be able to trust LLM-based applications, their outputs should be consistent when prompted with inputs that carry the same meaning or intent. Despite this need, there is no known mechanism to control and guide LLMs to be more consistent at inference time. In this paper, we introduce a novel alignment strategy to maximize semantic consistency in LLM outputs. Our proposal is based on Chain of Guidance (CoG), a multistep prompting technique that generates highly consistent outputs from LLMs. For closed-book question-answering (Q&A) tasks, when compared to direct prompting, the outputs generated using CoG show improved consistency. While other approaches like template-based responses and majority voting may offer alternative paths to consistency, our work focuses on exploring the potential of guided prompting. We use synthetic data sets comprised of consistent input-output pairs to fine-tune LLMs to produce consistent and correct outputs. Our fine-tuned models are more than twice as consistent compared to base models and show strong generalization capabilities by producing consistent outputs over datasets not used in the fine-tuning process.
摘要：一致性是大语言模型（LLM）中可信赖性的基本维度。为了使人类能够信任基于LLM的应用程序，在提示输入的提示时，其输出应保持一致。尽管需要此需求，但仍未有已知的机制来控制和指导LLM在推理时更加一致。在本文中，我们引入了一种新颖的对齐策略，以最大程度地提高LLM输出的语义一致性。我们的建议基于指导链（COG），这是一种多步提示技术，可产生LLMS高度一致的输出。对于封闭的问题撤职（问答）任务，与直接提示相比，使用COG生成的输出显示出提高的一致性。尽管基于模板的回应和大多数投票等其他方法可能为一致性提供替代途径，但我们的工作着重于探索引导提示的潜力。我们使用由一致的输入输出对组成的合成数据集以及微调LLM，以产生一致和纠正输出。与基本模型相比，我们的微调模型是一致的两倍以上，并通过在微调过程中未使用的数据集产生一致的输出来显示出强大的概括功能。

Title: CVE-LLM : Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models

Authors: Rikhiya Ghosh, Hans-Martin von Stockhausen, Martin Schmitt, George Marica Vasile, Sanjeev Kumar Karn, Oladimeji Farri
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2502.15932
Pdf URL: https://arxiv.org/pdf/2502.15932
Copy Paste: [[2502.15932]] CVE-LLM : Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models(https://arxiv.org/abs/2502.15932)
Keywords: language model, llm
Abstract: The National Vulnerability Database (NVD) publishes over a thousand new vulnerabilities monthly, with a projected 25 percent increase in 2024, highlighting the crucial need for rapid vulnerability identification to mitigate cybersecurity attacks and save costs and resources. In this work, we propose using large language models (LLMs) to learn vulnerability evaluation from historical assessments of medical device vulnerabilities in a single manufacturer's portfolio. We highlight the effectiveness and challenges of using LLMs for automatic vulnerability evaluation and introduce a method to enrich historical data with cybersecurity ontologies, enabling the system to understand new vulnerabilities without retraining the LLM. Our LLM system integrates with the in-house application - Cybersecurity Management System (CSMS) - to help Siemens Healthineers (SHS) product cybersecurity experts efficiently assess the vulnerabilities in our products. Also, we present guidelines for efficient integration of LLMs into the cybersecurity tool.
摘要：国家脆弱性数据库（NVD）每月发布超过一千个新的漏洞，预计2024年增加了25％，强调了对快速脆弱性识别以减轻网络安全攻击并节省成本和资源的关键需求。在这项工作中，我们建议使用大型语言模型（LLMS）从单个制造商投资组合中的医疗设备脆弱性的历史评估中学习脆弱性评估。我们强调了使用LLM进行自动脆弱性评估的有效性和挑战，并引入了一种以网络安全本体论来丰富历史数据的方法，从而使系统能够理解新的漏洞而无需重新验证LLM。我们的LLM系统与内部应用程序（网络安全管理系统（CSM））集成在一起，以帮助西门子卫生仪（SHS）产品网络安全专家有效评估我们产品中的脆弱性。此外，我们介绍了将LLM有效整合到网络安全工具中的指南。

Title: AutoMedPrompt: A New Framework for Optimizing LLM Medical Prompts Using Textual Gradients

Authors: Sean Wu, Michael Koo, Fabien Scalzo, Ira Kurtz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.15944
Pdf URL: https://arxiv.org/pdf/2502.15944
Copy Paste: [[2502.15944]] AutoMedPrompt: A New Framework for Optimizing LLM Medical Prompts Using Textual Gradients(https://arxiv.org/abs/2502.15944)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated increasingly sophisticated performance in medical and other fields of knowledge. Traditional methods of creating specialist LLMs require extensive fine-tuning and training of models on large datasets. Recently, prompt engineering, instead of fine-tuning, has shown potential to boost the performance of general foundation models. However, prompting methods such as chain-of-thought (CoT) may not be suitable for all subspecialty, and k-shot approaches may introduce irrelevant tokens into the context space. We present AutoMedPrompt, which explores the use of textual gradients to elicit medically relevant reasoning through system prompt optimization. AutoMedPrompt leverages TextGrad's automatic differentiation via text to improve the ability of general foundation LLMs. We evaluated AutoMedPrompt on Llama 3, an open-source LLM, using several QA benchmarks, including MedQA, PubMedQA, and the nephrology subspecialty-specific NephSAP. Our results show that prompting with textual gradients outperforms previous methods on open-source LLMs and surpasses proprietary models such as GPT-4, Claude 3 Opus, and Med-PaLM 2. AutoMedPrompt sets a new state-of-the-art (SOTA) performance on PubMedQA with an accuracy of 82.6$\%$, while also outperforming previous prompting strategies on open-sourced models for MedQA (77.7$\%$) and NephSAP (63.8$\%$).
摘要：大型语言模型（LLM）在医学和其他知识领域表现出越来越复杂的表现。创建专业LLM的传统方法需要大量的微调和培训大型数据集上的模型。最近，及时的工程，而不是微调，已经显示出可能提高一般基础模型的性能的潜力。但是，提示诸如《思考链》（COT）之类的方法可能并不适合所有亚专业，而K-shot方法可能会引入无关的令牌，以中的上下文空间。我们提出了AutomedPrompt，探讨了通过系统迅速优化引起医学相关推理的文本梯度的使用。 AutomedPrompt通过文本利用TextGrad的自动差异来提高一般基础LLM的能力。我们使用多种QA基准测试了包括MEDQA，PubMedQA和肾脏科亚科特异性nephsap的QA基准，评估了Llama 3的AutomedPrompt。我们的结果表明，提示文本梯度的表现优于开源LLM上的先前方法，并超过了专有模型，例如GPT-4，Claude 3 Opus和Med-Palm 2。AutomedPrompt设置了一种新的最先进的ART（SOTA）（SOTA）在PubMedQA上的性能为82.6 $ \％$，同时也优于MEDQA开源型号的先前提示策略（77.7 $ \％$）和Nephsap（63.8 $ \％$）。

Title: MMRAG: Multi-Mode Retrieval-Augmented Generation with Large Language Models for Biomedical In-Context Learning

Authors: Zaifu Zhan, Jun Wang, Shuang Zhou, Jiawen Deng, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15954
Pdf URL: https://arxiv.org/pdf/2502.15954
Copy Paste: [[2502.15954]] MMRAG: Multi-Mode Retrieval-Augmented Generation with Large Language Models for Biomedical In-Context Learning(https://arxiv.org/abs/2502.15954)
Keywords: language model, prompt, retrieval-augmented generation
Abstract: Objective: To optimize in-context learning in biomedical natural language processing by improving example selection. Methods: We introduce a novel multi-mode retrieval-augmented generation (MMRAG) framework, which integrates four retrieval strategies: (1) Random Mode, selecting examples arbitrarily; (2) Top Mode, retrieving the most relevant examples based on similarity; (3) Diversity Mode, ensuring variation in selected examples; and (4) Class Mode, selecting category-representative examples. This study evaluates MMRAG on three core biomedical NLP tasks: Named Entity Recognition (NER), Relation Extraction (RE), and Text Classification (TC). The datasets used include BC2GM for gene and protein mention recognition (NER), DDI for drug-drug interaction extraction (RE), GIT for general biomedical information extraction (RE), and HealthAdvice for health-related text classification (TC). The framework is tested with two large language models (Llama2-7B, Llama3-8B) and three retrievers (Contriever, MedCPT, BGE-Large) to assess performance across different retrieval strategies. Results: The results from the Random mode indicate that providing more examples in the prompt improves the model's generation performance. Meanwhile, Top mode and Diversity mode significantly outperform Random mode on the RE (DDI) task, achieving an F1 score of 0.9669, a 26.4% improvement. Among the three retrievers tested, Contriever outperformed the other two in a greater number of experiments. Additionally, Llama 2 and Llama 3 demonstrated varying capabilities across different tasks, with Llama 3 showing a clear advantage in handling NER tasks. Conclusion: MMRAG effectively enhances biomedical in-context learning by refining example selection, mitigating data scarcity issues, and demonstrating superior adaptability for NLP-driven healthcare applications.
摘要：目的：通过改进示例选择，在生物医学自然语言处理中优化中文学习。方法：我们引入了一种新型的多模式检索生成（MMRAG）框架，该框架集成了四种检索策略：（1）随机模式，任意选择示例；（2）最高模式，根据相似性检索最相关的示例；（3）多样性模式，确保选定示例中的变化；（4）类模式，选择类别代表的示例。这项研究评估了三个核心生物医学NLP任务的MMRAG：命名实体识别（NER），关系提取（RE）和文本分类（TC）。使用的数据集包括用于基因和蛋白质提及识别的BC2GM，用于药物交互作用的DDI，DDI（RE），用于一般生物医学信息提取（RE）的GIT以及与健康相关的文本分类（TC）的HealthAdvice。该框架通过两种大语言模型（Llama2-7b，Llama3-8B）和三个猎犬（Contriever，Medcpt，BGE-LARGE）进行了测试，以评估不同检索策略的性能。结果：随机模式的结果表明，在提示中提供更多示例可改善模型的生成性能。同时，在RE（DDI）任务上，顶部模式和多样性模式显着超过了随机模式，其F1得分为0.9669，提高了26.4％。在经过测试的三个检索器中，Contriever在更多的实验中的表现优于其他两个。此外，美洲驼（Llama 2）和美洲驼（Llama 3）在不同任务中表现出不同的功能，而骆驼3在处理NER任务方面具有明显的优势。结论：MMRAG通过完善示例选择，减轻数据稀缺问题并证明了NLP驱动的医疗保健应用程序的卓越适应性，从而有效地增强了生物医学的内在学习学习。

Title: R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression

Authors: Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15957
Pdf URL: https://arxiv.org/pdf/2502.15957
Copy Paste: [[2502.15957]] R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression(https://arxiv.org/abs/2502.15957)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Memory plays a key role in enhancing LLMs' performance when deployed to real-world applications. Existing solutions face trade-offs: explicit memory designs based on external storage require complex management and incur storage overhead, while implicit memory designs that store information via parameters struggle with reliable retrieval. In this paper, we propose R$^3$Mem, a memory network that optimizes both information Retention and Retrieval through Reversible context compression. Specifically, R$^3$Mem employs virtual memory tokens to compress and encode infinitely long histories, further enhanced by a hierarchical compression strategy that refines information from document- to entity-level for improved assimilation across granularities. For retrieval, R$^3$Mem employs a reversible architecture, reconstructing raw data by invoking the model backward with compressed information. Implemented via parameter-efficient fine-tuning, it can integrate seamlessly with any Transformer-based model. Experiments demonstrate that our memory design achieves state-of-the-art performance in long-context language modeling and retrieval-augmented generation tasks. It also significantly outperforms conventional memory modules in long-horizon interaction tasks like conversational agents, showcasing its potential for next-generation retrieval systems.
摘要：当部署到现实世界应用程序时，内存在增强LLMS的性能方面起着关键作用。现有的解决方案面临权衡：基于外部存储的明确内存设计需要复杂的管理和储存开销，而通过参数存储信息的隐式内存设计则可以可靠的检索而努力。在本文中，我们建议通过可逆上下文压缩来优化信息保留和检索的内存网络R $^3 $ MEM。具体而言，r $^3 $ mem采用虚拟内存令牌来压缩和编码无限长的历史，通过层次压缩策略进一步增强，从文档到实体级别，以改善跨性别的同化。对于检索，R $^3 $ MEM采用可逆体系结构，通过使用压缩信息向后调用模型来重建原始数据。通过参数有效的微调实现，它可以与任何基于变压器的模型无缝集成。实验表明，我们的内存设计在长篇文章的语言建模和检索演出的生成任务中实现了最先进的性能。它还大大优于长音交互任务（例如对话剂）中的常规内存模块，展示了其下一代检索系统的潜力。

Title: Sparsity May Be All You Need: Sparse Random Parameter Adaptation

Authors: Jesus Rios, Pierre Dognin, Ronny Luss, Karthikeyan N. Ramamurthy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.15975
Pdf URL: https://arxiv.org/pdf/2502.15975
Copy Paste: [[2502.15975]] Sparsity May Be All You Need: Sparse Random Parameter Adaptation(https://arxiv.org/abs/2502.15975)
Keywords: language model
Abstract: Full fine-tuning of large language models for alignment and task adaptation has become prohibitively expensive as models have grown in size. Parameter-Efficient Fine-Tuning (PEFT) methods aim at significantly reducing the computational and memory resources needed for fine-tuning these models by only training on a small number of parameters instead of all model parameters. Currently, the most popular PEFT method is the Low-Rank Adaptation (LoRA), which freezes the parameters of the model to be fine-tuned and introduces a small set of trainable parameters in the form of low-rank matrices. We propose simply reducing the number of trainable parameters by randomly selecting a small proportion of the model parameters to train on. In this paper, we compare the efficiency and performance of our proposed approach with PEFT methods, including LoRA, as well as full parameter fine-tuning.
摘要：随着型号的规模增长，大型语言模型的完整微调和任务适应的完整模型变得非常昂贵。参数有效的微调（PEFT）方法旨在通过仅培训少量参数而不是所有模型参数来显着减少对这些模型进行微调所需的计算和内存资源。当前，最受欢迎的PEFT方法是低级适应（LORA），它冻结了模型的参数进行微调，并以低级别矩阵的形式引入了一小部分可训练的参数。我们建议通过随机选择一小部分模型参数来减少可训练参数的数量。在本文中，我们将提出方法的PEFT方法（包括Lora）以及完整的参数微调比较了我们提出的方法的效率和性能。

Title: KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Authors: Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16002
Pdf URL: https://arxiv.org/pdf/2502.16002
Copy Paste: [[2502.16002]] KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse(https://arxiv.org/abs/2502.16002)
Keywords: language model, llm
Abstract: We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we propose a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation of LLMs when using KV caches computed independently for each document, KVLink introduces three key components: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, using trainable special tokens to restore self-attention across independently encoded documents, and applying mixed-data fine-tuning to enhance performance while preserving the model's original capabilities. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 90% compared to standard LLM inference, making it a scalable and efficient solution for context reuse.
摘要：我们描述了KVLink，这是一种在大语言模型（LLMS）中重复使用高效键值（KV）的方法。在许多LLM应用程序中，不同的输入可以共享重叠的上下文，例如在多个查询中出现的同一检索的文档。但是，LLMS仍然需要为每个查询编码整个上下文，从而导致冗余计算。在本文中，我们提出了一种新的策略，以消除这种效率低下的策略，在该效率下，每个文档的KV缓存将独立进行。在推断期间，将检索的文档的KV缓存串联，从而使模型可以重用缓存的表示，而不是重新计算它们。为了减轻针对每个文档独立计算的KV缓存时，LLMS的性能降低，KVLink引入了三个关键组件：调整推理时KV缓存的位置嵌入以匹配一致后的全球位置，使用可训练的特殊代价来恢复跨越的自我注意事项独立编码的文档，并应用混合数据微调以提高性能，同时保留模型的原始功能功能。在7个数据集中进行的实验表明，KVLINK将问题的准确性提高到了最新方法的平均4％。此外，通过利用预先计算的KV缓存，我们的方法将最先进的时间缩短了90％，与标准LLM推断相比，我们的方法是可重复使用的可扩展有效解决方案。

Title: Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

Authors: Won Seok Jang, Sharmin Sultana, Zonghai Yao, Hieu Tran, Zhichao Yang, Sunjae Kwon, Hong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16022
Pdf URL: https://arxiv.org/pdf/2502.16022
Copy Paste: [[2502.16022]] Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation(https://arxiv.org/abs/2502.16022)
Keywords: gpt, llm, prompt, chat
Abstract: Objective: OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. Materials and Methods: We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Results and Discussion: Fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Conclusion: Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.
摘要：目的：OpenNotes使患者能够访问EHR笔记，但医疗术语可以阻碍理解。为了提高理解，我们评估了使用提示，微调和数据增强来提取和确定关键医疗术语的封闭和开源LLM。材料和方法：我们对106个专家注销的EHR注释进行了LLM，尝试（i）（i）一般性和结构化提示，（ii）零射击与少量提示，（iii）微调和（iv）数据增强。为了增强低资源设置中的开源模型，我们使用CHATGPT进行数据增强和应用排名技术。我们逐步增加了增强数据集大小（10至10,000），并进行了5倍的交叉验证，报告F1分数和平均值等级（MRR）。结果与讨论：微调和数据扩展改善了其他策略的性能。 GPT-4 Turbo获得了最高的F1（0.433），而具有数据增强的Mistral7b具有最高的MRR（0.746）。开源型号进行微调或增强时，效果优于封闭源模型。值得注意的是，最好的F1和MRR分数并不总是保持一致。很少有射击促使香草模型中的零射击优于零射击，结构化提示可以在模型之间产生不同的偏好。微调改善了零射击性能，但有时会降低几次射击性能。与其他方法相比，数据增强的执行或更好。结论：我们的评估突出了提示，微调和数据增强在改善低资源场景中医疗术语提取模型性能方面的有效性。

Title: Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

Authors: Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, Colleen Waickman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16051
Pdf URL: https://arxiv.org/pdf/2502.16051
Copy Paste: [[2502.16051]] Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare(https://arxiv.org/abs/2502.16051)
Keywords: language model
Abstract: Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all 203 base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables (e.g., AGE), and are available for male, female, or non-binary-coded patients. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating eleven off-the-shelf and four mental health fine-tuned LMs on category-specific task accuracy, on the impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human annotated samples.
摘要：当前的医学语言模型（LM）基准通常过于简化日常临床实践任务的复杂性，而依靠在多项选择董事会考试问题上评估LMS。因此，我们提出了一个专家创建和注释的数据集，该数据集涵盖了心理保健中决策的五个关键领域：治疗，诊断，文档，监测和分类。该数据集 - 无需任何LM援助而创建的 - 旨在捕获精神健康从业人员遇到的细微临床推理和日常歧义，这反映了现有数据集缺失的固有的护理交付复杂性。几乎所有的203个基本问题都有五个答案选项，每个答案都已经删除了人口统计患者信息，并用变量（例如年龄）替换，并且适用于男性，女性或非二进制编码的患者。对于有关歧义性和多个有效答案选项的问题类别，我们创建了一个偏好数据集，该数据集具有来自专家注释的不确定性。我们概述了一系列预期的用例，并通过评估针对特定类别的任务准确性的11个现成和四个心理健康微调的LMS来证明我们的数据集的可用性，这对患者人口统计信息对决策的影响，以及自由形式的响应偏离人体注释的样本的程度。

Title: Echo: A Large Language Model with Temporal Episodic Memory

Authors: WenTao Liu, Ruohua Zhang, Aimin Zhou, Feng Gao, JiaLi Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.16090
Pdf URL: https://arxiv.org/pdf/2502.16090
Copy Paste: [[2502.16090]] Echo: A Large Language Model with Temporal Episodic Memory(https://arxiv.org/abs/2502.16090)
Keywords: language model, llm, agent
Abstract: Research on large language models (LLMs) has shown remarkable performance in domains such as mathematics, programming, and literary creation. However, most studies have focused on semantic memory-based question answering, neglecting LLMs' potential to handle episodic memory (EM)-related queries. This oversight has led to suboptimal performance in applications requiring EM, including emotional companionship, personal AI assistants, and AI teachers. To address this gap, we introduce Echo, a LLM enhanced with temporal episodic memory. We propose a Multi-Agent Data Generation Framework that guides the model in generating multi-turn, complex scenario episodic memory dialogue data (EM-Train). Temporal information is innovatively incorporated into the LLM training process, and Echo is trained using the EM-Train. Furthermore, We develop an EM-Test benchmark specifically designed to evaluate LLMs' episodic memory capabilities. The EM-Test assesses performance across various time spans and difficulty levels, providing a comprehensive evaluation of multi-turn episodic memory dialogues. Our experiments demonstrate that Echo significantly outperforms state-of-the-art LLMs on EM-Test. Additionally, a qualitative analysis reveals Echo's potential to exhibit human-like episodic memory capabilities. We will open-source all datasets, code, and model weights.
摘要：大型语言模型（LLM）的研究在数学，编程和文学创作等领域中表现出色。但是，大多数研究都集中在基于语义记忆的问题答案上，从而忽略了LLMS处理情节记忆（EM）相关的查询的潜力。这种疏忽导致在需要EM的应用中表现出色，包括情感陪伴，个人AI助手和AI老师。为了解决这一差距，我们介绍了Echo，llm通过时间情节记忆增强。我们提出了一个多代理数据生成框架，该框架可以指导该模型生成多转，复杂的情景内存对话数据（EM-TRAIN）。时间信息创新到LLM培训过程中，并且使用EM-Train对Echo进行了训练。此外，我们开发了一种专门设计用于评估LLMS情节内存功能的EM测试基准。该EM测试评估了各个时间跨度和难度级别的性能，从而对多转型情节记忆对话进行了全面评估。我们的实验表明，回声在EM测试上的表现明显优于最先进的LLM。此外，定性分析揭示了Echo具有类似人类的情节记忆能力的潜力。我们将开放所有数据集，代码和模型权重。

Title: Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming

Authors: Rui Li, Peiyi Wang, Jingyuan Ma, Di Zhang, Lei Sha, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16109
Pdf URL: https://arxiv.org/pdf/2502.16109
Copy Paste: [[2502.16109]] Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming(https://arxiv.org/abs/2502.16109)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have gained increasing attention for their remarkable capacity, alongside concerns about safety arising from their potential to produce harmful content. Red teaming aims to find prompts that could elicit harmful responses from LLMs, and is essential to discover and mitigate safety risks before real-world deployment. However, manual red teaming is both time-consuming and expensive, rendering it unscalable. In this paper, we propose RTPE, a scalable evolution framework to evolve red teaming prompts across both breadth and depth dimensions, facilitating the automatic generation of numerous high-quality and diverse red teaming prompts. Specifically, in-breadth evolving employs a novel enhanced in-context learning method to create a multitude of quality prompts, whereas in-depth evolving applies customized transformation operations to enhance both content and form of prompts, thereby increasing diversity. Extensive experiments demonstrate that RTPE surpasses existing representative automatic red teaming methods on both attack success rate and diversity. In addition, based on 4,800 red teaming prompts created by RTPE, we further provide a systematic analysis of 8 representative LLMs across 8 sensitive topics.
摘要：大型语言模型（LLMS）因其出色的能力而引起了人们的关注，同时担心其产生有害内容的潜力引起的安全性。 Red Teaming旨在找到可能引起LLM的有害反应的提示，并且在现实部署之前发现和减轻安全风险至关重要。但是，手动红色团队既耗时又昂贵，使其无法实现。在本文中，我们提出了RTPE，这是一个可扩展的演变框架，以在宽度和深度维度上跨越红色的团队提示，从而促进自动生成许多高质量和多样化的红色团队提示。具体而言，圆周不断发展采用一种新颖的增强的内在学习方法来创建众多质量提示，而深入的不断发展则采用自定义的转换操作来增强内容和提示的形式，从而提高多样性。广泛的实验表明，RTPE在攻击成功率和多样性上都超过了现有的代表性自动红色小组方法。此外，基于RTPE创建的4,800个红色团队提示，我们进一步提供了8个敏感主题的8个代表性LLM的系统分析。

Title: Chain-of-Description: What I can understand, I can put into words

Authors: Jiaxin Guo, Daimeng Wei, Zongyao Li, Hengchao Shang, Yuanchang Luo, Hao Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16137
Pdf URL: https://arxiv.org/pdf/2502.16137
Copy Paste: [[2502.16137]] Chain-of-Description: What I can understand, I can put into words(https://arxiv.org/abs/2502.16137)
Keywords: language model, prompt, chat
Abstract: In this paper, we propose a novel strategy defined as Chain-of-Description (CoD) Prompting, tailored for Multi-Modal Large Language Models. This approach involves having the model first provide a detailed description of the multi-modal input before generating an answer to the question. When applied to models such as Qwen2-Audio, Qwen2-VL, and Qwen2.5-VL, CoD Prompting significantly enhances performance compared to standard prompting methods. This is demonstrated by nearly a 4\% improvement in the speech category of the audio benchmark AIR-Bench-Chat and a 5.3\% improvement in the hard-level portion of the vision benchmark MMMU\_Pro. Our ablation study further validates the effectiveness of CoD Prompting.
摘要：在本文中，我们提出了一种定义为描述链（COD）提示的新型策略，该策略是针对多模式大型语言模型量身定制的。这种方法涉及让模型首先提供对多模式输入的详细描述，然后再产生答案。与标准提示方法相比，当应用于QWEN2-AUDIO，QWEN2-VL和QWEN2.5-VL等模型时，COD提示会显着提高性能。这是通过音频基准空气板聊天的语音类别的几乎4 \％改进的证明，而视觉基准mmmu \ _pro的硬级部分的5.3 \％改进。我们的消融研究进一步验证了COD提示的有效性。

Title: Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration

Authors: Haoxuan Wang
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2502.16142
Pdf URL: https://arxiv.org/pdf/2502.16142
Copy Paste: [[2502.16142]] Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration(https://arxiv.org/abs/2502.16142)
Keywords: language model, llm
Abstract: In this study, we investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system, specifically focusing on enhancing rare word recognition performance. Using a 190,000-hour dataset primarily sourced from YouTube, pre-processed with Whisper V3 pseudo-labeling, we demonstrate that the LLM-ASR architecture outperforms traditional Zipformer-Transducer models in the zero-shot rare word recognition task, after training on a large dataset. Our analysis reveals that the LLM contributes significantly to improvements in rare word error rate (R-WER), while the speech encoder primarily determines overall transcription performance (Orthographic Word Error Rate, O-WER, and Normalized Word Error Rate, N-WER). Through extensive ablation studies, we highlight the importance of adapter integration in aligning speech encoder outputs with the LLM's linguistic capabilities. Furthermore, we emphasize the critical role of high-quality labeled data in achieving optimal performance. These findings provide valuable insights into the synergy between LLM-based ASR architectures, paving the way for future advancements in large-scale LLM-based speech recognition systems.
摘要：在这项研究中，我们研究了大型语言模型（LLM）与自动语音识别（ASR）系统的整合，特别是致力于增强稀有单词识别性能。使用190,000小时的数据集，该数据集主要来自YouTube，并通过Whisper V3伪标记进行了预处理，我们证明，LLM-ASR体系结构在零照片稀有单词识别任务中的传统Zipformer-Transducer模型优于传统的Zipformer-Transducer模型数据集。我们的分析表明，LLM对罕见单词错误率的提高（R-WER）有很大贡献，而语音编码器主要决定总体转录性能（拼写单词错误率，O-WER，O-WER和标准化的单词错误率，N-WER）。通过广泛的消融研究，我们强调了适配器集成在使语音编码器输出与LLM的语言能力相结合的重要性。此外，我们强调了高质量标记数据在实现最佳性能中的关键作用。这些发现为基于LLM的ASR体系结构之间的协同作用提供了宝贵的见解，为大规模LLM基于LLM的语音识别系统的未来进步铺平了道路。

Title: The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination

Authors: Yuji Zhang, Sha Li, Cheng Qian, Jiateng Liu, Pengfei Yu, Chi Han, Yi R. Fung, Kathleen McKeown, Chengxiang Zhai, Manling Li, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16143
Pdf URL: https://arxiv.org/pdf/2502.16143
Copy Paste: [[2502.16143]] The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination(https://arxiv.org/abs/2502.16143)
Keywords: language model, llm, hallucination
Abstract: Hallucination is a persistent challenge in large language models (LLMs), where even with rigorous quality control, models often generate distorted facts. This paradox, in which error generation continues despite high-quality training data, calls for a deeper understanding of the underlying LLM mechanisms. To address it, we propose a novel concept: knowledge overshadowing, where model's dominant knowledge can obscure less prominent knowledge during text generation, causing the model to fabricate inaccurate details. Building on this idea, we introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing. Central to our approach is the log-linear law, which predicts that the rate of factual hallucination increases linearly with the logarithmic scale of (1) Knowledge Popularity, (2) Knowledge Length, and (3) Model Size. The law provides a means to preemptively quantify hallucinations, offering foresight into their occurrence even before model training or inference. Built on overshadowing effect, we propose a new decoding strategy CoDa, to mitigate hallucinations, which notably enhance model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%). Our findings not only deepen understandings of the underlying mechanisms behind hallucinations but also provide actionable insights for developing more predictable and controllable language models.
摘要：在大语言模型（LLM）中，幻觉是一个持续的挑战，即使在严格的质量控制下，模型也经常会产生扭曲的事实。尽管有高质量的培训数据，但这种悖论仍在继续产生错误，但仍需要更深入地了解基本的LLM机制。为了解决这个问题，我们提出了一个新颖的概念：知识掩盖，模型的主导知识可以使文本生成期间的知识不那么突出，从而导致模型制造不准确的细节。在这个想法的基础上，我们引入了一个新颖的框架，以通过建模知识模型来量化事实幻觉。我们方法的核心是对数线性定律，该定律预测，事实幻觉的速率随（1）知识受欢迎程度，（2）知识长度和（3）模型大小的对数尺度线性增加。该法律提供了一种先发制于量化幻觉的手段，甚至在模型培训或推理之前就可以远见。我们基于遮盖效果，提出了一种新的解码策略尾声，以减轻幻觉，这特别增强了模型的事实，对过时（27.9％），Memotrap（13.1％）和NQ-SWAP（18.3％）。我们的发现不仅加深对幻觉背后的基本机制的理解，而且还为开发更可预测和可控制的语言模型提供了可行的见解。

Title: Number Representations in LLMs: A Computational Parallel to Human Perception

Authors: H.V. AlquBoj, Hilal AlQuabeh, Velibor Bojkovic, Tatsuya Hiraoka, Ahmed Oumar El-Shangiti, Munachiso Nwadike, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16147
Pdf URL: https://arxiv.org/pdf/2502.16147
Copy Paste: [[2502.16147]] Number Representations in LLMs: A Computational Parallel to Human Perception(https://arxiv.org/abs/2502.16147)
Keywords: language model, llm
Abstract: Humans are believed to perceive numbers on a logarithmic mental number line, where smaller values are represented with greater resolution than larger ones. This cognitive bias, supported by neuroscience and behavioral studies, suggests that numerical magnitudes are processed in a sublinear fashion rather than on a uniform linear scale. Inspired by this hypothesis, we investigate whether large language models (LLMs) exhibit a similar logarithmic-like structure in their internal numerical representations. By analyzing how numerical values are encoded across different layers of LLMs, we apply dimensionality reduction techniques such as PCA and PLS followed by geometric regression to uncover latent structures in the learned embeddings. Our findings reveal that the model's numerical representations exhibit sublinear spacing, with distances between values aligning with a logarithmic scale. This suggests that LLMs, much like humans, may encode numbers in a compressed, non-uniform manner.
摘要：人们认为人类会在对数心理数字线上感知数字，在对数心理数字线上，较小的值的分辨率比较大的值更大。神经科学和行为研究支持的这种认知偏见表明，数值幅度是以sublinear的方式处理的，而不是以均匀的线性量表进行处理。受这一假设的启发，我们研究了大型语言模型（LLM）是否在其内部数值表示中表现出类似的对数结构。通过分析数值如何在LLM的不同层上编码，我们应用了降低降低技术，例如PCA和PLS，然后使用几何回归，以发现学习嵌入的潜在结构。我们的发现表明，该模型的数值表示表现出透明体间距，并且值与对数尺度对齐之间的距离。这表明LLM与人类一样，可能以压缩，不均匀的方式编码数字。

Title: EPERM: An Evidence Path Enhanced Reasoning Model for Knowledge Graph Question and Answering

Authors: Xiao Long, Liansheng Zhuang, Aodi Li, Minghong Yao, Shafei Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16171
Pdf URL: https://arxiv.org/pdf/2502.16171
Copy Paste: [[2502.16171]] EPERM: An Evidence Path Enhanced Reasoning Model for Knowledge Graph Question and Answering(https://arxiv.org/abs/2502.16171)
Keywords: language model, llm, hallucination
Abstract: Due to the remarkable reasoning ability, Large language models (LLMs) have demonstrated impressive performance in knowledge graph question answering (KGQA) tasks, which find answers to natural language questions over knowledge graphs (KGs). To alleviate the hallucinations and lack of knowledge issues of LLMs, existing methods often retrieve the question-related information from KGs to enrich the input context. However, most methods focus on retrieving the relevant information while ignoring the importance of different types of knowledge in reasoning, which degrades their performance. To this end, this paper reformulates the KGQA problem as a graphical model and proposes a three-stage framework named the Evidence Path Enhanced Reasoning Model (EPERM) for KGQA. In the first stage, EPERM uses the fine-tuned LLM to retrieve a subgraph related to the question from the original knowledge graph. In the second stage, EPERM filters out the evidence paths that faithfully support the reasoning of the questions, and score their importance in reasoning. Finally, EPERM uses the weighted evidence paths to reason the final answer. Since considering the importance of different structural information in KGs for reasoning, EPERM can improve the reasoning ability of LLMs in KGQA tasks. Extensive experiments on benchmark datasets demonstrate that EPERM achieves superior performances in KGQA tasks.
摘要：由于具有显着的推理能力，大型语言模型（LLMS）在知识图答案（KGQA）任务中表现出了令人印象深刻的表现，该任务通过知识图（KGS）找到了自然语言问题的答案。为了减轻LLM的幻觉和知识问题，现有方法经常从KGS中检索与问题相关的信息以丰富输入环境。但是，大多数方法都侧重于检索相关信息，同时忽略不同类型的知识在推理中的重要性，从而降低了其性能。为此，本文将KGQA问题重新定义为图形模型，并提出了一个三阶段框架，名为KGQA的证据路径增强推理模型（EPER）。在第一阶段，Eperm使用微调的LLM检索与原始知识图中问题相关的子图。在第二阶段，Eperm滤除了忠实地支持问题推理的证据路径，并在推理中获得了重要性。最后，Eperm使用加权证据路径来推理最终答案。由于考虑了KGS中不同结构信息在推理中的重要性，因此EPER可以提高LLM在KGQA任务中的推理能力。基准数据集的广泛实验表明，EPERM在KGQA任务中取得了出色的表现。

Title: Mapping 1,000+ Language Models via the Log-Likelihood Vector

Authors: Momose Oyama, Hiroaki Yamagiwa, Yusuke Takase, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16173
Pdf URL: https://arxiv.org/pdf/2502.16173
Copy Paste: [[2502.16173]] Mapping 1,000+ Language Models via the Log-Likelihood Vector(https://arxiv.org/abs/2502.16173)
Keywords: language model
Abstract: To compare autoregressive language models at scale, we propose using log-likelihood vectors computed on a predefined text set as model features. This approach has a solid theoretical basis: when treated as model coordinates, their squared Euclidean distance approximates the Kullback-Leibler divergence of text-generation probabilities. Our method is highly scalable, with computational cost growing linearly in both the number of models and text samples, and is easy to implement as the required features are derived from cross-entropy loss. Applying this method to over 1,000 language models, we constructed a "model map," providing a new perspective on large-scale model analysis.
摘要：为了大规模比较自回归语言模型，我们建议使用在预定义的文本集作为模型特征上计算的对数可能性向量。这种方法具有坚实的理论基础：当将其视为模型坐标时，它们的平方欧几里得距离近似文本生成概率的kullback-leibler差异。我们的方法是可扩展的，在模型和文本样本的数量中，计算成本在线性上呈线性增长，并且由于所需的特征源自交叉渗透损失，因此易于实现。将此方法应用于1,000多种语言模型，我们构建了一个“模型图”，为大规模模型分析提供了新的视角。

Title: BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking

Authors: Yuxuan Liu, Hongda Sun, Wenya Guo, Xinyan Xiao, Cunli Mao, Zhengtao Yu, Rui Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16181
Pdf URL: https://arxiv.org/pdf/2502.16181
Copy Paste: [[2502.16181]] BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking(https://arxiv.org/abs/2502.16181)
Keywords: llm
Abstract: Complex claim fact-checking performs a crucial role in disinformation detection. However, existing fact-checking methods struggle with claim vagueness, specifically in effectively handling latent information and complex relations within claims. Moreover, evidence redundancy, where nonessential information complicates the verification process, remains a significant issue. To tackle these limitations, we propose Bilateral Defusing Verification (BiDeV), a novel fact-checking working-flow framework integrating multiple role-played LLMs to mimic the human-expert fact-checking process. BiDeV consists of two main modules: Vagueness Defusing identifies latent information and resolves complex relations to simplify the claim, and Redundancy Defusing eliminates redundant content to enhance the evidence quality. Extensive experimental results on two widely used challenging fact-checking benchmarks (Hover and Feverous-s) demonstrate that our BiDeV can achieve the best performance under both gold and open settings. This highlights the effectiveness of BiDeV in handling complex claims and ensuring precise fact-checking
摘要：复杂的主张事实检查在虚假信息检测中起着至关重要的作用。但是，现有的事实检查方法与索赔模糊性斗争，特别是在有效地处理索赔内的潜在信息和复杂关系方面。此外，在非必要信息使验证过程复杂化的证据冗余仍然是一个重要的问题。为了应对这些局限性，我们提出了双边脱水验证（BIDEV），这是一种新的事实检查的工作流框架，将多个角色扮演的LLMS整合在一起以模仿人类专家的事实检查过程。 BIDEV由两个主要模块组成：模糊性融合可以识别潜在信息并解决复杂的关系以简化主张，并裁员消除了消除冗余内容以提高证据质量。对两种广泛使用的挑战性事实检查基准（Hover and thofory-S）的广泛实验结果表明，我们的Bidev在黄金和开放环境下都可以取得最佳性能。这突出了Bidev在处理复杂主张并确保精确事实检查中的有效性

Title: IPO: Your Language Model is Secretly a Preference Classifier

Authors: Shivank Garg, Ayush Singh, Shweta Singh, Paras Chopra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16182
Pdf URL: https://arxiv.org/pdf/2502.16182
Copy Paste: [[2502.16182]] IPO: Your Language Model is Secretly a Preference Classifier(https://arxiv.org/abs/2502.16182)
Keywords: language model, llm
Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. While it enables LLMs to achieve human-level alignment, it often incurs significant computational and financial costs due to its reliance on training external reward models or human-labeled preferences. In this work, we propose \textbf{Implicit Preference Optimization (IPO)}, an alternative approach that leverages generative LLMs as preference classifiers, thereby reducing the dependence on external human feedback or reward models to obtain preferences. We conduct a comprehensive evaluation on the preference classification ability of LLMs using RewardBench, assessing models across different sizes, architectures, and training levels to validate our hypothesis. Furthermore, we investigate the self-improvement capabilities of LLMs by generating multiple responses for a given instruction and employing the model itself as a preference classifier for Direct Preference Optimization (DPO)-based training. Our findings demonstrate that models trained through IPO achieve performance comparable to those utilizing state-of-the-art reward models for obtaining preferences.
摘要：从人类反馈（RLHF）中学习的强化已成为将大语言模型（LLM）与人类偏好保持一致的主要方法。尽管它使LLMS能够达到人类水平的一致性，但由于其依赖培训外部奖励模型或人类标记的偏好，它通常会产生巨大的计算和财务成本。在这项工作中，我们提出\ textbf {隐式优先优化（IPO）}，这是一种利用生成llms作为偏好分类器的替代方法，从而降低了对外部人类反馈或奖励模型的依赖，以获得偏好。我们对使用RewardBench的LLM的偏好分类能力进行了全面评估，评估了不同大小，体系结构和培训水平的模型以验证我们的假设。此外，我们通过为给定指令产生多个响应并将模型本身作为基于直接偏好优化的偏好分类器（DPO）基于培训的偏好分类器，从而研究了LLM的自我完善能力。我们的发现表明，经过IPO训练的模型实现了与利用最先进的奖励模型获得偏好模型的模型相当的。

Title: ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

Authors: Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16268
Pdf URL: https://arxiv.org/pdf/2502.16268
Copy Paste: [[2502.16268]] ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning(https://arxiv.org/abs/2502.16268)
Keywords: language model, llm
Abstract: Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.
摘要：评估大型语言模型（LLMS）提出了重大挑战，尤其是由于数据污染问题和正确答案的泄漏。为了应对这些挑战，我们介绍了ThinkBench，这是一个新颖的评估框架，旨在鲁NMS的推理能力。 ThinkBench提出了一种动态数据生成方法，用于构建分布（OOD）数据集，并提供一个OOD数据集，其中包含从推理任务中绘制的2,912个样本。 ThinkBench统一了推理模型和非调理模型的评估。我们在相同的实验条件下评估了16个LLM和4个PRM，并表明LLMS的大多数性能远非可靠，并且面临一定程度的数据泄漏。通过动态生成OOD数据集，ThinkBench有效地提供了对LLM的可靠评估，并减少了数据污染的影响。

Title: LegalBench.PT: A Benchmark for Portuguese Law

Authors: Beatriz Canaverde, Telmo Pessoa Pires, Leonor Melo Ribeiro, André F. T. Martins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16357
Pdf URL: https://arxiv.org/pdf/2502.16357
Copy Paste: [[2502.16357]] LegalBench.PT: A Benchmark for Portuguese Law(https://arxiv.org/abs/2502.16357)
Keywords: gpt, llm
Abstract: The recent application of LLMs to the legal field has spurred the creation of benchmarks across various jurisdictions and languages. However, no benchmark has yet been specifically designed for the Portuguese legal system. In this work, we present this http URL, the first comprehensive legal benchmark covering key areas of Portuguese law. To develop this http URL, we first collect long-form questions and answers from real law exams, and then use GPT-4o to convert them into multiple-choice, true/false, and matching formats. Once generated, the questions are filtered and processed to improve the quality of the dataset. To ensure accuracy and relevance, we validate our approach by having a legal professional review a sample of the generated questions. Although the questions are synthetically generated, we show that their basis in human-created exams and our rigorous filtering and processing methods applied result in a reliable benchmark for assessing LLMs' legal knowledge and reasoning abilities. Finally, we evaluate the performance of leading LLMs on this http URL and investigate potential biases in GPT-4o's responses. We also assess the performance of Portuguese lawyers on a sample of questions to establish a baseline for model comparison and validate the benchmark.
摘要：LLM在法律领域的最新应用促使在各种司法管辖区和语言中建立了基准。但是，尚未专门为葡萄牙法律制度设计基准。在这项工作中，我们介绍了此HTTP URL，这是涵盖葡萄牙法律关键领域的第一个全面的法律基准。为了开发此HTTP URL，我们首先从真实法律考试中收集长期的问题和答案，然后使用GPT-4O将其转换为多项选择，true/fals和匹配格式。生成后，对问题进行过滤并处理以提高数据集的质量。为了确保准确性和相关性，我们通过将法律专业的审查示例验证了生成问题的样本来验证我们的方法。尽管这些问题是合成的，但我们表明，它们在人类创建的考试中的基础以及我们严格的过滤和处理方法所应用的基础导致了评估LLMS的法律知识和推理能力的可靠基准。最后，我们评估了领先的LLM在此HTTP URL上的性能，并研究了GPT-4O响应中的潜在偏见。我们还评估了葡萄牙律师在一个问题样本中的表现，以建立基线以进行比较并验证基准。

Title: Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores

Authors: Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.16358
Pdf URL: https://arxiv.org/pdf/2502.16358
Copy Paste: [[2502.16358]] Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores(https://arxiv.org/abs/2502.16358)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) are revolutionizing information retrieval, with chatbots becoming an important source for answering user queries. As by their design, LLMs prioritize generating correct answers, the value of highly plausible yet incorrect answers (candidate answers) tends to be overlooked. However, such answers can still prove useful, for example, they can play a crucial role in tasks like Multiple-Choice Question Answering (MCQA) and QA Robustness Assessment (QARA). Existing QA datasets primarily focus on correct answers without explicit consideration of the plausibility of other candidate answers, limiting opportunity for more nuanced evaluations of models. To address this gap, we introduce PlausibleQA, a large-scale dataset comprising 10,000 questions and 100,000 candidate answers, each annotated with plausibility scores and justifications for their selection. Additionally, the dataset includes 900,000 justifications for pairwise comparisons between candidate answers, further refining plausibility assessments. We evaluate PlausibleQA through human assessments and empirical experiments, demonstrating its utility in MCQA and QARA analysis. Our findings show that plausibility-aware approaches are effective for MCQA distractor generation and QARA. We release PlausibleQA as a resource for advancing QA research and enhancing LLM performance in distinguishing plausible distractors from correct answers.
摘要：大型语言模型（LLM）正在彻底改变信息检索，聊天机器人成为回答用户查询的重要来源。通过设计，LLMS优先考虑生成正确的答案，高度合理但错误的答案（候选答案）的价值往往会被忽略。但是，这样的答案仍然可以证明是有用的，例如，它们可以在多项选择性答案（MCQA）和QA Rotustness评估（QARA）等任务中发挥关键作用。现有的质量检查数据集主要集中在正确的答案上，而无需明确考虑其他候选人答案的合理性，从而限制了对模型进行更细微的评估的机会。为了解决这一差距，我们介绍了PlausibleQA，这是一个大规模数据集，其中包含10,000个问题和100,000个候选答案，每个数据集都以合理的分数和选择的理由进行注释。此外，该数据集还包括900,000个理由，用于候选答案之间的成对比较，进一步完善了合理性评估。我们通过人类评估和经验实验评估PlausibleQA，证明了其在MCQA和QARA分析中的效用。我们的发现表明，合理性的方法对MCQA发电量和Qara有效。我们发布了PlausibleQA，作为推进质量检查研究并提高LLM性能的资源，以区分合理的干扰因素和正确答案。

Title: A generative approach to LLM harmfulness detection with special red flag tokens

Authors: Sophie Xhonneux, David Dobre, Mehrnaz Mohfakhami, Leo Schwinn, Gauthier Gidel
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.16366
Pdf URL: https://arxiv.org/pdf/2502.16366
Copy Paste: [[2502.16366]] A generative approach to LLM harmfulness detection with special red flag tokens(https://arxiv.org/abs/2502.16366)
Keywords: language model, llm, long context, prompt
Abstract: Most safety training methods for large language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call red flag token () and propose to fine-tune the model to generate this token at any time harmful content is generated or about to be generated. This novel safety training method effectively augments LLMs into generative classifiers of harmfulness at all times during the conversation. This method offers several advantages: it enables the model to explicitly learn the concept of harmfulness while marginally affecting the generated distribution, thus maintaining the model's utility. It also evaluates each generated answer rather than just the input prompt and provides a stronger defence against sampling-based attacks. In addition, it simplifies the evaluation of the model's robustness and reduces correlated failures when combined with a classifier. We further show an increased robustness to long contexts, and supervised fine-tuning attacks.
摘要：大多数基于微调的大语模型（LLM）的安全培训方法在面对有害请求时会大大改变模型的输出分布，从而将其从不安全的答案转移到拒绝响应的情况下。这些方法固有地损害了模型功能，并可能使自动回归模型容易受到攻击的影响，而攻击可能是肯定响应的初始标记。为了避免这种情况，我们建议使用特殊令牌扩展模型的词汇量，我们称为危险信号令牌（），并提议微调模型以在任何时候生成有害内容或即将生成有害内容。这种新颖的安全训练方法在对话期间始终将LLMS有效地扩展到有害的生成分类器中。该方法提供了几个优点：它使模型能够明确地学习有害性的概念，同时略微影响生成的分布，从而维护模型的效用。它还评估了每个生成的答案，而不仅仅是输入提示，并为基于抽样的攻击提供了更强大的防御。此外，它简化了对模型鲁棒性的评估，并在与分类器结合使用时减少了相关的失败。我们进一步显示出对长篇小说和监督微调攻击的鲁棒性。

Title: Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines

Authors: Saurabh Srivastava, Sweta Pati, Ziyu Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16377
Pdf URL: https://arxiv.org/pdf/2502.16377
Copy Paste: [[2502.16377]] Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines(https://arxiv.org/abs/2502.16377)
Keywords: language model, llm
Abstract: In this work, we study the effect of annotation guidelines -- textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance.
摘要：在这项工作中，我们研究注释准则的效果 - 当指示大型语言模型进行事件提取时，对事件类型和参数的文本描述。我们在全数据和低DATA设置中使用了人提供的和机器生成的指南进行了一系列实验。我们的结果表明，当有大量的培训数据并强调其在改善跨索马概括和低频事件型性能方面的有效性时，注释指南的希望。

Title: Sequence-level Large Language Model Training with Contrastive Preference Optimization

Authors: Zhili Feng, Dhananjay Ram, Cole Hawkins, Aditya Rawal, Jinman Zhao, Sheng Zha
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.16433
Pdf URL: https://arxiv.org/pdf/2502.16433
Copy Paste: [[2502.16433]] Sequence-level Large Language Model Training with Contrastive Preference Optimization(https://arxiv.org/abs/2502.16433)
Keywords: language model
Abstract: The next token prediction loss is the dominant self-supervised training objective for large language models and has achieved promising results in a variety of downstream tasks. However, upon closer investigation of this objective, we find that it lacks an understanding of sequence-level signals, leading to a mismatch between training and inference processes. To bridge this gap, we introduce a contrastive preference optimization (CPO) procedure that can inject sequence-level information into the language model at any training stage without expensive human labeled data. Our experiments show that the proposed objective surpasses the next token prediction in terms of win rate in the instruction-following and text generation tasks.
摘要：下一个令牌预测损失是大型语言模型的主要自我监督训练目标，并在各种下游任务中取得了令人鼓舞的结果。但是，经过对该目标进行仔细研究后，我们发现它缺乏对序列级信号的理解，从而导致训练和推理过程之间的不匹配。为了弥合这一差距，我们引入了对比性偏好优化（CPO）程序，可以在任何训练阶段将序列级别的信息注入语言模型，而无需昂贵的人类标记的数据。我们的实验表明，所提出的目标超过了下一个令牌预测，从指令跟随和文本生成任务中的获胜率方面。

Title: Contrastive Learning of English Language and Crystal Graphs for Multimodal Representation of Materials Knowledge

Authors: Yang Jeong Park, Mayank Kumaran, Chia-Wei Hsu, Elsa Olivetti, Ju Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16451
Pdf URL: https://arxiv.org/pdf/2502.16451
Copy Paste: [[2502.16451]] Contrastive Learning of English Language and Crystal Graphs for Multimodal Representation of Materials Knowledge(https://arxiv.org/abs/2502.16451)
Keywords: language model
Abstract: Artificial intelligence (AI) is increasingly used for the inverse design of materials, such as crystals and molecules. Existing AI research on molecules has integrated chemical structures of molecules with textual knowledge to adapt to complex instructions. However, this approach has been unattainable for crystals due to data scarcity from the biased distribution of investigated crystals and the lack of semantic supervision in peer-reviewed literature. In this work, we introduce a contrastive language-crystals model (CLaC) pre-trained on a newly synthesized dataset of 126k crystal structure-text pairs. To demonstrate the advantage of using synthetic data to overcome data scarcity, we constructed a comparable dataset extracted from academic papers. We evaluate CLaC's generalization ability through various zero-shot cross-modal tasks and downstream applications. In experiments, CLaC achieves state-of-the-art zero-shot generalization performance in understanding crystal structures, surpassing latest large language models.
摘要：人工智能（AI）越来越多地用于材料的反设计，例如晶体和分子。现有的对分子的AI研究具有分子的化学结构，并具有文本知识以适应复杂的指示。但是，由于研究晶体的有偏分布以及在同行评审的文献中缺乏语义监督的数据稀缺，这种方法是无法实现的。在这项工作中，我们在126K Crystal结构文本对的新合成数据集上介绍了对对比的语言 - 晶体模型（CLAC）。为了证明使用合成数据克服数据稀缺的优势，我们构建了从学术论文中提取的可比数据集。我们通过各种零射击跨模式任务和下游应用程序评估CLAC的概括能力。在实验中，CLAC在理解晶体结构中实现了最先进的零拍概括性能，超过了最新的大型语言模型。

Title: Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge

Authors: Heegyu Kim, Taeyang Jeon, Seungtaek Choi, Jihoon Hong, Dongwon Jeon, Sungbum Cho, Ga-Yeon Baek, Kyung-Won Kwak, Dong-Hee Lee, Sun-Jin Choi, Jisu Bae, Chihoon Lee, Yunseo Kim, Jinsung Park, Hyunsouk Cho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16457
Pdf URL: https://arxiv.org/pdf/2502.16457
Copy Paste: [[2502.16457]] Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge(https://arxiv.org/abs/2502.16457)
Keywords: language model, llm
Abstract: Materials synthesis is vital for innovations such as energy storage, catalysis, electronics, and biomedical devices. Yet, the process relies heavily on empirical, trial-and-error methods guided by expert intuition. Our work aims to support the materials science community by providing a practical, data-driven resource. We have curated a comprehensive dataset of 17K expert-verified synthesis recipes from open-access literature, which forms the basis of our newly developed benchmark, AlchemyBench. AlchemyBench offers an end-to-end framework that supports research in large language models applied to synthesis prediction. It encompasses key tasks, including raw materials and equipment prediction, synthesis procedure generation, and characterization outcome forecasting. We propose an LLM-as-a-Judge framework that leverages large language models for automated evaluation, demonstrating strong statistical agreement with expert assessments. Overall, our contributions offer a supportive foundation for exploring the capabilities of LLMs in predicting and guiding materials synthesis, ultimately paving the way for more efficient experimental design and accelerated innovation in materials science.
摘要：材料合成对于诸如储能，催化，电子和生物医学设备等创新至关重要。然而，该过程在很大程度上依赖于以专家直觉为指导的经验，反复试验方法。我们的工作旨在通过提供实用，数据驱动的资源来支持材料科学界。我们已经策划了一个全面的数据集，该数据集是开放式文献中的17K专家培养的合成食谱，该数据构成了我们新开发的基准Alchemybench的基础。 Alchemybench提供了一个端到端框架，该框架支持用于合成预测的大语言模型中的研究。它包括关键任务，包括原材料和设备预测，综合程序生成和表征结果预测。我们提出了一个LLM-AS-A-A-a-a-Gudge框架，该框架利用大型语言模型进行自动化评估，并证明了与专家评估的强烈统计一致。总体而言，我们的贡献为探索LLMS预测和指导材料合成功能的能力提供了支持基础，最终为更有效的实验设计铺平了道路，并在材料科学中加速了创新。

Title: A Fine-Tuning Approach for T5 Using Knowledge Graphs to Address Complex Tasks

Authors: Xiaoxuan Liao, Binrong Zhu, Jacky He, Guiran Liu, Hongye Zheng, Jia Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16484
Pdf URL: https://arxiv.org/pdf/2502.16484
Copy Paste: [[2502.16484]] A Fine-Tuning Approach for T5 Using Knowledge Graphs to Address Complex Tasks(https://arxiv.org/abs/2502.16484)
Keywords: language model
Abstract: With the development of deep learning technology, large language models have achieved remarkable results in many natural language processing tasks. However, these models still have certain limitations in handling complex reasoning tasks and understanding rich background knowledge. To solve this problem, this study proposed a T5 model fine-tuning method based on knowledge graphs, which enhances the model's reasoning ability and context understanding ability by introducing external knowledge graphs. We used the SQuAD1.1 dataset for experiments. The experimental results show that the T5 model based on knowledge graphs is significantly better than other baseline models in reasoning accuracy, context understanding, and the ability to handle complex problems. At the same time, we also explored the impact of knowledge graphs of different scales on model performance and found that as the scale of the knowledge graph increases, the performance of the model gradually improves. Especially when dealing with complex problems, the introduction of knowledge graphs greatly improves the reasoning ability of the T5 model. Ablation experiments further verify the importance of entity and relationship embedding in the model and prove that a complete knowledge graph is crucial to improving the various capabilities of the T5 model. In summary, this study provides an effective method to enhance the reasoning and understanding capabilities of large language models and provides new directions for future research.
摘要：随着深度学习技术的发展，大型语言模型在许多自然语言处理任务中取得了显着的结果。但是，这些模型在处理复杂的推理任务和了解丰富的背景知识时仍有一定的局限性。为了解决这个问题，本研究提出了一种基于知识图的T5模型微调方法，该方法通过引入外部知识图来增强模型的推理能力和上下文理解能力。我们使用了Squad1.1数据集进行实验。实验结果表明，基于知识图的T5模型在推理准确性，上下文理解和处理复杂问题的能力方面明显优于其他基线模型。同时，我们还探讨了不同量表的知识图对模型性能的影响，并发现随着知识图的规模的增加，模型的性能逐渐提高。特别是在处理复杂问题时，知识图的引入大大提高了T5模型的推理能力。消融实验进一步验证了模型中实体和关系嵌入的重要性，并证明完整的知识图对于提高T5模型的各种功能至关重要。总而言之，这项研究提供了一种有效的方法来增强大语模型的推理和理解能力，并为将来的研究提供了新的方向。

Title: All That Glitters is Not Novel: Plagiarism in AI Generated Research

Authors: Tarun Gupta, Danish Pruthi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16487
Pdf URL: https://arxiv.org/pdf/2502.16487
Copy Paste: [[2502.16487]] All That Glitters is Not Novel: Plagiarism in AI Generated Research(https://arxiv.org/abs/2502.16487)
Keywords: llm, agent
Abstract: Automating scientific research is considered the final frontier of science. Recently, several papers claim autonomous research agents can generate novel research ideas. Amidst the prevailing optimism, we document a critical concern: a considerable fraction of such research documents are smartly plagiarized. Unlike past efforts where experts evaluate the novelty and feasibility of research ideas, we request $13$ experts to operate under a different situational logic: to identify similarities between LLM-generated research documents and existing work. Concerningly, the experts identify $24\%$ of the $50$ evaluated research documents to be either paraphrased (with one-to-one methodological mapping), or significantly borrowed from existing work. These reported instances are cross-verified by authors of the source papers. Problematically, these LLM-generated research documents do not acknowledge original sources, and bypass inbuilt plagiarism detectors. Lastly, through controlled experiments we show that automated plagiarism detectors are inadequate at catching deliberately plagiarized ideas from an LLM. We recommend a careful assessment of LLM-generated research, and discuss the implications of our findings on research and academic publishing.
摘要：自动化科学研究被认为是科学的最终领域。最近，几篇论文声称自主研究代理可以产生新颖的研究思想。在普遍的乐观态度中，我们记录了一个关键问题：此类研究文件中很大一部分被智能地窃。与过去的专家评估研究思想的新颖性和可行性的努力不同，我们要求$ 13 $的专家在不同的情境逻辑下运作：确定LLM生成的研究文档与现有工作之间的相似之处。关于$ 50 $评估的研究文件的$ 24 \％$ $ 24 \％的研究文件（用一对一的方法映射），或从现有工作中大量借用。这些报告的实例由源文件的作者进行了交叉验证。有问题的是，这些LLM生成的研究文件不承认原始来源，并绕过内置的pla窃探测器。最后，通过受控的实验，我们表明自动窃探测器并不是从LLM捕捉故意窃思想的不足。我们建议对LLM生成的研究进行仔细评估，并讨论我们对研究和学术出版的发现的含义。

Title: Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models

Authors: Yuyi Huang, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, Ailin Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16491
Pdf URL: https://arxiv.org/pdf/2502.16491
Copy Paste: [[2502.16491]] Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models(https://arxiv.org/abs/2502.16491)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content, which poses severe societal risks. We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content. These strategies, inspired by psychological phenomena such as the "Priming Effect", "Safe Attention Shift", and "Cognitive Dissonance", effectively attack the models' guarding mechanisms. Our experiments achieved an attack success rate (ASR) of 100% on various open-source models, including Meta's Llama-3.2, Google's Gemma-2, Mistral's Mistral-NeMo, Falcon's Falcon-mamba, Apple's DCLM, Microsoft's Phi3, and Qwen's Qwen2.5, among others. Similarly, for closed-source models such as OpenAI's GPT-4o, Google's Gemini-1.5, and Claude-3.5, we observed an ASR of at least 95% on the AdvBench dataset, which represents the current state-of-the-art. This study underscores the urgent need to reassess the use of generative models in critical applications to mitigate potential adverse societal impacts.
摘要：大型语言模型（LLM）严重影响了各个行业，但遭受了严重缺陷，产生有害内容的潜在敏感性，这带来了严重的社会风险。我们制定并测试了对流行LLM的新型攻击策略，以暴露其在产生不适当内容方面的脆弱性。这些策略受到心理现象的启发，例如“启动效应”，“安全注意力转移”和“认知失调”，有效地攻击了模型的保护机制。我们的实验在各种开源模型上达到了100％的攻击成功率（ASR），包括Meta的Llama-3.2，Google的Gemma-2，Mistral的Mismtral-Nemo，Falcon的Falcon's Falcon-Mamba，Apple的DCLM，Microsoft的Phi3和Qwen's Qwen2 .5，等等。同样，对于诸如OpenAI的GPT-4O，Google的Gemini-1.5和Claude-3.5之类的封闭源模型，我们在Advbench数据集中观察到了至少95％的ASR，该数据集代表了当前的最新ART。这项研究强调了迫切需要在关键应用中重新评估生成模型的使用，以减轻潜在的不利社会影响。

Title: FanChuan: A Multilingual and Graph-Structured Benchmark For Parody Detection and Analysis

Authors: Yilun Zheng, Sha Li, Fangkun Wu, Yang Ziyi, Lin Hongchao, Zhichao Hu, Cai Xinjun, Ziming Wang, Jinxuan Chen, Sitao Luan, Jiahao Xu, Lihui Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.16503
Pdf URL: https://arxiv.org/pdf/2502.16503
Copy Paste: [[2502.16503]] FanChuan: A Multilingual and Graph-Structured Benchmark For Parody Detection and Analysis(https://arxiv.org/abs/2502.16503)
Keywords: language model, gpt, llm
Abstract: Parody is an emerging phenomenon on social media, where individuals imitate a role or position opposite to their own, often for humor, provocation, or controversy. Detecting and analyzing parody can be challenging and is often reliant on context, yet it plays a crucial role in understanding cultural values, promoting subcultures, and enhancing self-expression. However, the study of parody is hindered by limited available data and deficient diversity in current datasets. To bridge this gap, we built seven parody datasets from both English and Chinese corpora, with 14,755 annotated users and 21,210 annotated comments in total. To provide sufficient context information, we also collect replies and construct user-interaction graphs to provide richer contextual information, which is lacking in existing datasets. With these datasets, we test traditional methods and Large Language Models (LLMs) on three key tasks: (1) parody detection, (2) comment sentiment analysis with parody, and (3) user sentiment analysis with parody. Our extensive experiments reveal that parody-related tasks still remain challenging for all models, and contextual information plays a critical role. Interestingly, we find that, in certain scenarios, traditional sentence embedding methods combined with simple classifiers can outperform advanced LLMs, i.e. DeepSeek-R1 and GPT-o3, highlighting parody as a significant challenge for LLMs.
摘要：模仿是社交媒体上的一种新兴现象，在该现象中，人们模仿自己的角色或位置，通常是出于幽默，挑衅或争议。检测和分析模仿可能具有挑战性，并且通常依赖于上下文，但它在理解文化价值观，促进亚文化和增强自我表达方面起着至关重要的作用。但是，在当前数据集中，可用数据和多样性不足的多样性阻碍了模仿的研究。为了弥合这一差距，我们从英语和中文语料库中构建了七个模仿数据集，总共有14,755个带注释的用户和21,210个带注释的评论。为了提供足够的上下文信息，我们还收集答复并构建用户互动图，以提供更丰富的上下文信息，而这些信息在现有数据集中缺乏。使用这些数据集，我们在三个关键任务上测试传统方法和大语言模型（LLM）：（1）模仿检测，（2）用模仿的评论情感分析，以及（3）用模仿的用户情感分析。我们广泛的实验表明，与模仿相关的任务对于所有模型仍然具有挑战性，上下文信息起着至关重要的作用。有趣的是，我们发现，在某些情况下，传统的句子嵌入方法与简单的分类器结合使用，可以超越高级LLM，即DeepSeek-R1和GPT-O3，这突出了模仿是对LLM的重大挑战。

Title: GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking

Authors: Yingjian Chen, Haoran Liu, Yinhong Liu, Rui Yang, Han Yuan, Yanran Fu, Pengyuan Zhou, Qingyu Chen, James Caverlee, Irene Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16514
Pdf URL: https://arxiv.org/pdf/2502.16514
Copy Paste: [[2502.16514]] GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking(https://arxiv.org/abs/2502.16514)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are widely used, but they often generate subtle factual errors, especially in long-form text. These errors are fatal in some specialized domains such as medicine. Existing fact-checking with grounding documents methods face two main challenges: (1) they struggle to understand complex multihop relations in long documents, often overlooking subtle factual errors; (2) most specialized methods rely on pairwise comparisons, requiring multiple model calls, leading to high resource and computational costs. To address these challenges, we propose \textbf{\textit{GraphCheck}}, a fact-checking framework that uses extracted knowledge graphs to enhance text representation. Graph Neural Networks further process these graphs as a soft prompt, enabling LLMs to incorporate structured knowledge more effectively. Enhanced with graph-based reasoning, GraphCheck captures multihop reasoning chains which are often overlooked by existing methods, enabling precise and efficient fact-checking in a single inference call. Experimental results on seven benchmarks spanning both general and medical domains demonstrate a 6.1\% overall improvement over baseline models. Notably, GraphCheck outperforms existing specialized fact-checkers and achieves comparable performance with state-of-the-art LLMs, such as DeepSeek-V3 and OpenAI-o1, with significantly fewer parameters.
摘要：大型语言模型（LLM）被广泛使用，但它们通常会产生细微的事实错误，尤其是在长篇文本中。这些错误在某些专业领域（例如医学）是致命的。现有的事实核对文档方法面临两个主要挑战：（1）他们努力理解长文件中的复杂的多跃波关系，通常会忽略微妙的事实错误；（2）最专业的方法依赖于成对比较，需要多个模型调用，从而导致高资源和计算成本。为了解决这些挑战，我们建议\ textbf {\ textit {graphCheck}}，这是一个事实检查框架，使用提取的知识图来增强文本表示。图形神经网络将这些图作为软提示进一步处理，从而使LLMS能够更有效地合并结构化知识。通过基于图的推理增强，GraphCheck捕获了多台化推理链，这些链通常被现有方法忽略，从而在单个推理调用中启用了精确有效的事实检查。跨越一般和医疗领域的七个基准测试结果的实验结果表明，基线模型的总体改善为6.1 \％。值得注意的是，GraphCheck的表现优于现有的专业事实检查器，并且与最先进的LLMS（例如DeepSeek-V3和OpenAI-O1）相当的性能，参数较少。

Title: Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension

Authors: Yulong Wu, Viktor Schlegel, Riza Batista-Navarro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16523
Pdf URL: https://arxiv.org/pdf/2502.16523
Copy Paste: [[2502.16523]] Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension(https://arxiv.org/abs/2502.16523)
Keywords: language model, llm
Abstract: As neural language models achieve human-comparable performance on Machine Reading Comprehension (MRC) and see widespread adoption, ensuring their robustness in real-world scenarios has become increasingly important. Current robustness evaluation research, though, primarily develops synthetic perturbation methods, leaving unclear how well they reflect real life scenarios. Considering this, we present a framework to automatically examine MRC models on naturally occurring textual perturbations, by replacing paragraph in MRC benchmarks with their counterparts based on available Wikipedia edit history. Such perturbation type is natural as its design does not stem from an arteficial generative process, inherently distinct from the previously investigated synthetic approaches. In a large-scale study encompassing SQUAD datasets and various model architectures we observe that natural perturbations result in performance degradation in pre-trained encoder language models. More worryingly, these state-of-the-art Flan-T5 and Large Language Models (LLMs) inherit these errors. Further experiments demonstrate that our findings generalise to natural perturbations found in other more challenging MRC benchmarks. In an effort to mitigate these errors, we show that it is possible to improve the robustness to natural perturbations by training on naturally or synthetically perturbed examples, though a noticeable gap still remains compared to performance on unperturbed data.
摘要：随着神经语言模型在机器阅读理解（MRC）上实现人为比较的表现，并参见广泛的采用，确保其在现实情况下的鲁棒性变得越来越重要。但是，当前的鲁棒性评估研究主要是开发合成的扰动方法，尚不清楚它们反映现实生活中的情况。考虑到这一点，我们提出了一个框架，可以通过基于可用的Wikipedia编辑历史记录在MRC基准中替换MRC基准中的段落，以自动检查MRC模型。这种扰动类型是自然的，因为它的设计并不源于骨膜生成过程，本质上与先前研究的合成方法不同。在包括小队数据集和各种模型体系结构的大规模研究中，我们观察到自然扰动会导致预训练的编码器语言模型的性能降解。更令人担忧的是，这些最先进的Flan-T5和大型语言模型（LLMS）继承了这些错误。进一步的实验表明，我们的发现推广到其他更具挑战性的MRC基准中发现的自然扰动。为了减轻这些错误，我们表明，通过对自然或合成扰动的示例进行培训，可以提高自然扰动的鲁棒性，尽管与未扰动数据的性能相比，仍然存在明显的差距。

Title: Retrieval-Augmented Fine-Tuning With Preference Optimization For Visual Program Generation

Authors: Deokhyung Kang, Jeonghun Cho, Yejin Jeon, Sunbin Jang, Minsub Lee, Jawoon Cho, Gary Geunbae Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16529
Pdf URL: https://arxiv.org/pdf/2502.16529
Copy Paste: [[2502.16529]] Retrieval-Augmented Fine-Tuning With Preference Optimization For Visual Program Generation(https://arxiv.org/abs/2502.16529)
Keywords: language model, llm, prompt
Abstract: Visual programming languages (VPLs) allow users to create programs through graphical interfaces, which results in easier accessibility and their widespread usage in various domains. To further enhance this accessibility, recent research has focused on generating VPL code from user instructions using large language models (LLMs). Specifically, by employing prompting-based methods, these studies have shown promising results. Nevertheless, such approaches can be less effective for industrial VPLs such as Ladder Diagram (LD). LD is a pivotal language used in industrial automation processes and involves extensive domain-specific configurations, which are difficult to capture in a single prompt. In this work, we demonstrate that training-based methods outperform prompting-based methods for LD generation accuracy, even with smaller backbone models. Building on these findings, we propose a two-stage training strategy to further enhance VPL generation. First, we employ retrieval-augmented fine-tuning to leverage the repetitive use of subroutines commonly seen in industrial VPLs. Second, we apply direct preference optimization (DPO) to further guide the model toward accurate outputs, using systematically generated preference pairs through graph editing operations. Extensive experiments on real-world LD data demonstrate that our approach improves program-level accuracy by over 10% compared to supervised fine-tuning, which highlights its potential to advance industrial automation.
摘要：视觉编程语言（VPLS）允许用户通过图形接口创建程序，从而使可访问性更容易及其在各个域中的广泛使用。为了进一步增强这种可访问性，最近的研究重点是使用大语言模型（LLMS）从用户说明中生成VPL代码。具体而言，通过采用基于促进的方法，这些研究表明了有希望的结果。然而，对于阶梯图（LD）等工业VPL，这种方法可能会效率不高。 LD是一种用于工业自动化过程的关键语言，涉及广泛的域特异性配置，在一个提示中很难捕获。在这项工作中，我们证明，即使使用较小的骨干模型，基于培训的方法也优于基于提示的LD生成精度的方法。在这些发现的基础上，我们提出了一种两阶段的培训策略，以进一步增强VPL生成。首先，我们采用检索调查的微调来利用工业VPL中常见的子例程的重复使用。其次，我们采用直接优先优化（DPO），以系统地生成的优先对通过图编辑操作，进一步指导模型到准确的输出。对现实世界LD数据的广泛实验表明，与受监督的微调相比，我们的方法提高了计划级别的准确性超过10％，这突出了其提高工业自动化的潜力。

Title: Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs

Authors: Jonathan Rystrøm, Hannah Rose Kirk, Scott Hale
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2502.16534
Pdf URL: https://arxiv.org/pdf/2502.16534
Copy Paste: [[2502.16534]] Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs(https://arxiv.org/abs/2502.16534)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. A key concern is US-centric bias, where LLMs reflect US rather than local cultural values. We propose a novel methodology that compares LLM-generated response distributions against population-level opinion data from the World Value Survey across four languages (Danish, Dutch, English, and Portuguese). Using a rigorous linear mixed-effects regression framework, we compare two families of models: Google's Gemma models (2B--27B parameters) and successive iterations of OpenAI's turbo-series. Across the families of models, we find no consistent relationships between language capabilities and cultural alignment. While the Gemma models have a positive correlation between language capability and cultural alignment across languages, the OpenAI models do not. Importantly, we find that self-consistency is a stronger predictor of multicultural alignment than multilingual capabilities. Our results demonstrate that achieving meaningful cultural alignment requires dedicated effort beyond improving general language capabilities.
摘要：大型语言模型（LLM）越来越多地跨全球语言。但是，跨语言交流的能力并不一定会转化为适当的文化代表。一个关键问题是以美国为中心的偏见，其中LLM反映了我们，而不是当地的文化价值。我们提出了一种新颖的方法，将LLM生成的响应分布与跨四种语言（丹麦，荷兰语，英语和葡萄牙语）的世界价值调查的人群级别的意见数据进行了比较。使用严格的线性混合效应回归框架，我们比较了两个模型家族：Google的Gemma型号（2B--27B参数）和OpenAI涡轮系列的连续迭代。在整个模型家族中，我们发现语言能力和文化一致性之间没有一致的关系。尽管Gemma模型在语言能力和跨语言的文化一致性之间具有正相关，但OpenAI模型却没有。重要的是，我们发现自我矛盾是多元文化一致性比多语言能力更强大的预测指标。我们的结果表明，实现有意义的文化统一需要不仅提高一般语言能力，还需要专门的努力。

Title: Advanced Chain-of-Thought Reasoning for Parameter Extraction from Documents Using Large Language Models

Authors: Hong Cai Chen, Yi Pin Xu, Yang Zhang
Subjects: cs.CL, cs.AI, cs.AR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.16540
Pdf URL: https://arxiv.org/pdf/2502.16540
Copy Paste: [[2502.16540]] Advanced Chain-of-Thought Reasoning for Parameter Extraction from Documents Using Large Language Models(https://arxiv.org/abs/2502.16540)
Keywords: language model, llm, chain-of-thought
Abstract: Extracting parameters from technical documentation is crucial for ensuring design precision and simulation reliability in electronic design. However, current methods struggle to handle high-dimensional design data and meet the demands of real-time processing. In electronic design automation (EDA), engineers often manually search through extensive documents to retrieve component parameters required for constructing PySpice models, a process that is both labor-intensive and time-consuming. To address this challenge, we propose an innovative framework that leverages large language models (LLMs) to automate the extraction of parameters and the generation of PySpice models directly from datasheets. Our framework introduces three Chain-of-Thought (CoT) based techniques: (1) Targeted Document Retrieval (TDR), which enables the rapid identification of relevant technical sections; (2) Iterative Retrieval Optimization (IRO), which refines the parameter search through iterative improvements; and (3) Preference Optimization (PO), which dynamically prioritizes key document sections based on relevance. Experimental results show that applying all three methods together improves retrieval precision by 47.69% and reduces processing latency by 37.84%. Furthermore, effect size analysis using Cohen's d reveals that PO significantly reduces latency, while IRO contributes most to precision enhancement. These findings underscore the potential of our framework to streamline EDA processes, enhance design accuracy, and shorten development timelines. Additionally, our algorithm has model-agnostic generalization, meaning it can improve parameter search performance across different LLMs.
摘要：从技术文档中提取参数对于确保电子设计的设计精度和模拟可靠性至关重要。但是，当前的方法难以处理高维设计数据并满足实时处理的需求。在电子设计自动化（EDA）中，工程师经常手动搜索广泛的文档，以检索构建Pyspice模型所需的组件参数，这是一个既劳动密集型又耗时的过程。为了应对这一挑战，我们提出了一个创新的框架，该框架利用大型语言模型（LLMS）直接从数据表中自动提取参数和Pyspice模型的生成。我们的框架介绍了三个基于思考链（COT）的技术：（1）有针对性的文档检索（TDR），该技术可以快速识别相关的技术部分；（2）迭代检索优化（IRO），通过迭代改进来完善参数搜索；（3）偏好优化（PO），该优化基于相关性将关键文档部分动态排序。实验结果表明，将所有三种方法一起使用共同提高了47.69％的检索精度，并将加工潜伏期降低了37.84％。此外，使用Cohen D的效应大小分析表明，PO显着降低了潜伏期，而IRO则最大程度地提高了精确的增强。这些发现强调了我们框架简化EDA流程，提高设计准确性并缩短开发时间表的潜力。此外，我们的算法具有模型不可静止的概括，这意味着它可以改善不同LLM的参数搜索性能。

Title: Reasoning About Persuasion: Can LLMs Enable Explainable Propaganda Detection?

Authors: Maram Hasanain, Md Arid Hasan, Mohamed Bayan Kmainasi, Elisa Sartori, Ali Ezzat Shahroor, Giovanni Da San Martino, Firoj Alam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16550
Pdf URL: https://arxiv.org/pdf/2502.16550
Copy Paste: [[2502.16550]] Reasoning About Persuasion: Can LLMs Enable Explainable Propaganda Detection?(https://arxiv.org/abs/2502.16550)
Keywords: llm
Abstract: There has been significant research on propagandistic content detection across different modalities and languages. However, most studies have primarily focused on detection, with little attention given to explanations justifying the predicted label. This is largely due to the lack of resources that provide explanations alongside annotated labels. To address this issue, we propose a multilingual (i.e., Arabic and English) explanation-enhanced dataset, the first of its kind. Additionally, we introduce an explanation-enhanced LLM for both label detection and rationale-based explanation generation. Our findings indicate that the model performs comparably while also generating explanations. We will make the dataset and experimental resources publicly available for the research community.
摘要：关于跨不同方式和语言的宣传内容检测的大量研究。但是，大多数研究主要集中在检测上，而对预测标签的解释很少。这在很大程度上是由于缺乏与注释标签一起提供解释的资源。为了解决这个问题，我们提出了一个多语言（即阿拉伯语和英语）解释增强数据集的数据集。此外，我们介绍了一个以标签检测和基于基本原理的解释生成的解释增强的LLM。我们的发现表明，该模型的性能相当同时也产生了解释。我们将使数据集和实验资源公开可用于研究界。

Title: Beyond Words: How Large Language Models Perform in Quantitative Management Problem-Solving

Authors: Jonathan Kuzmanko
Subjects: cs.CL, cs.AI, cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2502.16556
Pdf URL: https://arxiv.org/pdf/2502.16556
Copy Paste: [[2502.16556]] Beyond Words: How Large Language Models Perform in Quantitative Management Problem-Solving(https://arxiv.org/abs/2502.16556)
Keywords: language model, llm
Abstract: This study examines how Large Language Models (LLMs) perform when tackling quantitative management decision problems in a zero-shot setting. Drawing on 900 responses generated by five leading models across 20 diverse managerial scenarios, our analysis explores whether these base models can deliver accurate numerical decisions under varying presentation formats, scenario complexities, and repeated attempts. Contrary to prior findings, we observed no significant effects of text presentation format (direct, narrative, or tabular) or text length on accuracy. However, scenario complexity -- particularly in terms of constraints and irrelevant parameters -- strongly influenced performance, often degrading accuracy. Surprisingly, the models handled tasks requiring multiple solution steps more effectively than expected. Notably, only 28.8\% of responses were exactly correct, highlighting limitations in precision. We further found no significant ``learning effect'' across iterations: performance remained stable across repeated queries. Nonetheless, significant variations emerged among the five tested LLMs, with some showing superior binary accuracy. Overall, these findings underscore both the promise and the pitfalls of harnessing LLMs for complex quantitative decision-making, informing managers and researchers about optimal deployment strategies.
摘要：这项研究研究了在零弹性设置中解决定量管理决策问题时，大型语言模型（LLM）的性能。我们的分析借鉴了五个不同的管理场景中的五个领先模型生成的900个响应，探讨了这些基本模型是否可以在不同的演示格式，场景复杂性和重复尝试下提供准确的数值决策。与先前的发现相反，我们观察到文本演示格式（直接，叙述或表格）或文本长度对准确性没有显着影响。但是，场景复杂性 - 尤其是在约束和无关参数方面 - 强烈影响性能，通常会降低准确性。令人惊讶的是，这些模型处理了需要多个解决方案步骤的任务。值得注意的是，只有28.8％的响应完全正确，这突出了精确的局限性。我们进一步在迭代中发现没有显着的``学习效应''：在重复查询中的性能保持稳定。但是，在五个测试的LLM中出现了显着的变化，其中一些显示出较高的二进制精度。总体而言，这些发现强调了利用LLMS进行复杂定量决策的承诺和陷阱，向经理和研究人员告知最佳部署策略。

Title: Revealing the Pragmatic Dilemma for Moral Reasoning Acquisition in Language Models

Authors: Guangliang Liu, Lei Jiang, Xitong Zhang, Kristen Marie Johnson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16600
Pdf URL: https://arxiv.org/pdf/2502.16600
Copy Paste: [[2502.16600]] Revealing the Pragmatic Dilemma for Moral Reasoning Acquisition in Language Models(https://arxiv.org/abs/2502.16600)
Keywords: language model, llm
Abstract: Ensuring that Large Language Models (LLMs) return just responses which adhere to societal values is crucial for their broader application. Prior research has shown that LLMs often fail to perform satisfactorily on tasks requiring moral cognizance, such as ethics-based judgments. While current approaches have focused on fine-tuning LLMs with curated datasets to improve their capabilities on such tasks, choosing the optimal learning paradigm to enhance the ethical responses of LLMs remains an open research debate. In this work, we aim to address this fundamental question: can current learning paradigms enable LLMs to acquire sufficient moral reasoning capabilities? Drawing from distributional semantics theory and the pragmatic nature of moral discourse, our analysis indicates that performance improvements follow a mechanism similar to that of semantic-level tasks, and therefore remain affected by the pragmatic nature of morals latent in discourse, a phenomenon we name the pragmatic dilemma. We conclude that this pragmatic dilemma imposes significant limitations on the generalization ability of current learning paradigms, making it the primary bottleneck for moral reasoning acquisition in LLMs.
摘要：确保大型语言模型（LLM）返回仅遵守社会价值的响应对于更广泛的应用至关重要。先前的研究表明，LLMS通常无法在需要道德认知的任务上令人满意地执行，例如基于道德的判断。尽管当前的方法专注于通过策划的数据集进行微调LLM，以提高其在此类任务上的能力，但选择最佳的学习范式来增强LLMS的道德回应仍然是一项公开的研究辩论。在这项工作中，我们旨在解决这个基本问题：当前的学习范式是否可以使LLMS能够获得足够的道德推理能力？从分布语义理论和道德话语的务实性质中汲取灵感，我们的分析表明，绩效的改进遵循与语义级任务相似的机制，因此仍然受到“道德在话语中的务实本质”的影响，我们将其称为一种现象，我们将务实的困境。我们得出的结论是，这种务实的困境对当前学习范式的概括能力施加了重大局限性，这使其成为LLMS道德推理获取的主要瓶颈。

Title: MemeIntel: Explainable Detection of Propagandistic and Hateful Memes

Authors: Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, Firoj Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16612
Pdf URL: https://arxiv.org/pdf/2502.16612
Copy Paste: [[2502.16612]] MemeIntel: Explainable Detection of Propagandistic and Hateful Memes(https://arxiv.org/abs/2502.16612)
Keywords: language model
Abstract: The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to label detection and the generation of explanation-based rationales for predicted labels. To address this challenge, we introduce MemeIntel, an explanation-enhanced dataset for propaganda memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results demonstrate that this approach significantly improves performance over the base model for both \textbf{label detection} and explanation generation, outperforming the current state-of-the-art with an absolute improvement of ~3% on ArMeme and ~7% on Hateful Memes. For reproducibility and future research, we aim to make the MemeIntel dataset and experimental resources publicly available.
摘要：社交媒体上多模式内容的扩散在理解和调节复杂的，依赖上下文的问题（例如误导，仇恨言论和宣传）方面提出了重大挑战。尽管已经努力开发资源并提出了自动检测的新方法，但对标签检测和预测标签的基于解释的理由的产生有限。为了应对这一挑战，我们介绍了Memeintel，这是一个用英语的阿拉伯语和仇恨模因宣传模因的解释增强数据集，这使其成为这些任务的第一个大规模资源。为了解决这些任务，我们提出了一种多阶段优化方法和火车视觉模型（VLMS）。我们的结果表明，这种方法可显着提高\ textBf {label检测}和解释生成的基本模型的性能，表现优于当前的最新技术，在Armeme的绝对提高〜3％，在Armeme上的提高〜7％，在可恶的模因。对于可重复性和未来的研究，我们旨在使Memeintel数据集和实验资源公开可用。

Title: CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

Authors: Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, Zhaoxiang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16614
Pdf URL: https://arxiv.org/pdf/2502.16614
Copy Paste: [[2502.16614]] CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models(https://arxiv.org/abs/2502.16614)
Keywords: language model, llm
Abstract: The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.
摘要：大语言模型（LLMS）的批评能力对于推理能力至关重要，这可以提供必要的建议（例如，详细的分析和建设性反馈）。因此，如何评估LLM的批评能力引起了极大的关注，并提出了一些批评基准。但是，现有的批评基准通常具有以下限制：（1）。专注于一般域中的各种推理任务以及对代码任务的评估不足（例如，仅涵盖代码生成任务），其中查询的难度相对容易（例如，CritistBench的代码查询来自Humaneval和MBPP）。（2）。缺乏来自不同维度的全面评估。为了解决这些局限性，我们为LLMS引入了称为CodeCriticBench的整体代码批评基准。具体来说，我们的编解码物包括两个具有不同困难的主流代码任务（即代码生成和代码质量质量质量质量）。此外，评估方案还包括基本的批评评估和针对不同特征的高级评估评估，其中精细颗粒评估清单已精心设计用于高级设置。最后，我们对现有LLM的广泛实验结果进行了广泛的实验结果，该结果显示了Codecriticbench的有效性。

Title: Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries

Authors: Yin Wu, Quanyu Long, Jing Li, Jianfei Yu, Wenya Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.16636
Pdf URL: https://arxiv.org/pdf/2502.16636
Copy Paste: [[2502.16636]] Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries(https://arxiv.org/abs/2502.16636)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is a popular approach for enhancing Large Language Models (LLMs) by addressing their limitations in verifying facts and answering knowledge-intensive questions. As the research in LLM extends their capability to handle input modality other than text, e.g. image, several multimodal RAG benchmarks are proposed. Nonetheless, they mainly use textual knowledge bases as the primary source of evidences for augmentation. There still lack benchmarks designed to evaluate images as augmentation in RAG systems and how they leverage visual knowledge. We propose Visual-RAG, a novel Question Answering benchmark that emphasizes visual knowledge intensive questions. Unlike prior works relying on text-based evidence, Visual-RAG necessitates text-to-image retrieval and integration of relevant clue images to extract visual knowledge as evidence. With Visual-RAG, we evaluate 5 open-sourced and 3 proprietary Multimodal LLMs (MLLMs), revealing that images can serve as good evidence in RAG; however, even the SoTA models struggle with effectively extracting and utilizing visual knowledge
摘要：检索授权的一代（RAG）是一种通过解决事实并回答知识密集性问题的局限性来增强大语模型（LLM）的流行方法。随着LLM的研究扩展了处理文本以外的输入方式的能力，例如图像，提出了几个多模式的抹布基准。尽管如此，他们主要将文本知识基础作为增强证据的主要来源。仍然缺乏旨在评估图像作为抹布系统中的增强图像以及它们如何利用视觉知识的基准测试。我们提出了Visual-Rag，这是一个回答基准的新颖问题，强调视觉知识密集型问题。与依靠基于文本的证据的先前作品不同，视觉rag需要文本对图像检索和相关线索图像的整合以提取视觉知识作为证据。借助视觉窗格，我们评估了5个开源和3个专有的多模式LLM（MLLM），揭示了图像可以作为抹布中的良好证据。但是，即使SOTA模型也在有效提取和利用视觉知识方面努力

Title: CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Authors: Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2502.16645
Pdf URL: https://arxiv.org/pdf/2502.16645
Copy Paste: [[2502.16645]] CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale(https://arxiv.org/abs/2502.16645)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: this https URL.
摘要：大型语言模型（LLMS）在软件工程中表现出色，但在适应不断发展的代码知识方面面临挑战，尤其是关于第三方库API的频繁更新。这种限制源于静态预训练数据集，通常会导致不可行的代码或具有次优的安全性和效率的实现。为此，本文介绍了代码，这是一种用于识别过时的代码模式并从Python第三方库中收集实时代码知识更新的数据引擎。在CodeSync的基础上，我们开发了CodeSyncbench，这是评估LLMS与Code Evolution保持同步能力的全面基准，该基准涵盖了来自六个Python库中220个API的现实世界更新。我们的基准测试在三个评估任务中提供3,300个测试用例，以及一个由2,200个培训样本组成的更新意见调整数据集。对14个最先进的LLM的广泛实验表明，即使在高级知识更新方法（例如DPO，ORPO和SIMPO）的支持下，它们即使支持动态代码的演变也很难。我们认为，我们的基准可以为未来的实时代码知识更新的更有效方法提供强大的基础。实验代码和数据集可公开可用：此HTTPS URL。

Title: MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Authors: Hengzhi Li, Megan Tjandrasuwita, Yi R. Fung, Armando Solar-Lezama, Paul Pu Liang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.16671
Pdf URL: https://arxiv.org/pdf/2502.16671
Copy Paste: [[2502.16671]] MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models(https://arxiv.org/abs/2502.16671)
Keywords: language model, llm, prompt
Abstract: Socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important as AI becomes more closely integrated with peoples' daily activities. However, current works in artificial social reasoning all rely on language-only, or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel source of data rich in nonverbal and social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting non-verbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing 221 videos from YouTube, through rigorous annotation and verification, resulting in a benchmark with 101 videos and 806 question-answer pairs. Using MimeQA, we evaluate state-of-the-art video large language models (vLLMs) and find that their overall accuracy ranges from 15-30%. Our analysis reveals that vLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. Our data resources are released at this https URL to inspire future work in foundation models that embody true social intelligence capable of interpreting non-verbal human interactions.
摘要：在日常生活中可以与人类无缝地理解和互动的社会智能AI越来越重要，因为AI与人们的日常活动更加紧密地融合在一起。但是，当前在人工社会推理中的工作都依赖于仅语言的基准和培训模型，或者以语言为主导的方法，从而导致系统正在改善言语交流但与非言语社会理解斗争。为了解决这一限制，我们利用了富含非语言和社交互动的新数据来源 - MIME视频。哑剧是指没有口头言语的手势和运动来表达表达的艺术，这在解释非语言社会交流时提出了独特的挑战和机遇。我们通过严格的注释和验证来贡献一个名为Mimeqa的新数据集，该数据集通过从YouTube中采购221个视频，从而获得了101个视频和806个问答对的基准。我们使用Mimeqa评估了最先进的视频大语模型（VLLM），并发现它们的整体准确性范围为15-30％。我们的分析表明，VLLM通常无法将想象中的对象扎根，并且在文本提示中过度汇总，同时忽略了微妙的非语言相互作用。我们的数据资源在此HTTPS URL上发布，以激发未来的工作，这些基础模型体现了能够解释非语言人类互动的真正社会智能。

Title: Automatic Input Rewriting Improves Translation with Large Language Models

Authors: Dayeon Ki, Marine Carpuat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16682
Pdf URL: https://arxiv.org/pdf/2502.16682
Copy Paste: [[2502.16682]] Automatic Input Rewriting Improves Translation with Large Language Models(https://arxiv.org/abs/2502.16682)
Keywords: language model, llm
Abstract: Can we improve machine translation (MT) with LLMs by rewriting their inputs automatically? Users commonly rely on the intuition that well-written text is easier to translate when using off-the-shelf MT systems. LLMs can rewrite text in many ways but in the context of MT, these capabilities have been primarily exploited to rewrite outputs via post-editing. We present an empirical study of 21 input rewriting methods with 3 open-weight LLMs for translating from English into 6 target languages. We show that text simplification is the most effective MT-agnostic rewrite strategy and that it can be improved further when using quality estimation to assess translatability. Human evaluation further confirms that simplified rewrites and their MT outputs both largely preserve the original meaning of the source and MT. These results suggest LLM-assisted input rewriting as a promising direction for improving translations.
摘要：我们可以通过自动重写其输入来改善机器翻译（MT）吗？用户通常依靠直觉，即使用现成的MT系统时，写得很好的文本更容易翻译。 LLM可以以多种方式重写文本，但是在MT的背景下，这些功能主要利用通过后编辑来重写输出。我们介绍了21种输入重写方法的实证研究，其中3种开放量LLM从英语翻译成6种目标语言。我们表明，简化文本是最有效的MT不合SNOSTIC改写策略，并且在使用质量估计来评估可翻译性时，它可以进一步改进。人类评估进一步证实，简化的重写及其MT输出在很大程度上保留了来源和MT的原始含义。这些结果表明，LLM辅助输入重写是改善翻译的有希望的方向。

Title: WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale

Authors: Jiaxi Li, Xingxing Zhang, Xun Wang, Xiaolong Huang, Li Dong, Liang Wang, Si-Qing Chen, Wei Lu, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16684
Pdf URL: https://arxiv.org/pdf/2502.16684
Copy Paste: [[2502.16684]] WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale(https://arxiv.org/abs/2502.16684)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data synthesis methods focus narrowly on objectives like fact retrieval and summarization, restricting their generalizability to complex, real-world tasks. WildLong extracts meta-information from real user queries, models co-occurrence relationships via graph-based methods, and employs adaptive generation to produce scalable data. It extends beyond single-document tasks to support multi-document reasoning, such as cross-document comparison and aggregation. Our models, finetuned on 150K instruction-response pairs synthesized using WildLong, surpasses existing open-source long-context-optimized models across benchmarks while maintaining strong performance on short-context tasks without incorporating supplementary short-context data. By generating a more diverse and realistic long-context instruction dataset, WildLong enhances LLMs' ability to generalize to complex, real-world reasoning over long contexts, establishing a new paradigm for long-context data synthesis.
摘要：具有扩展上下文Windows的大型语言模型（LLMS）启用了需要广泛信息集成的任务，但受到高质量，多样的数据集的稀缺而限制了长篇文章指令调整。现有的数据综合方法狭义地集中在事实检索和摘要等目标上，从而限制了它们对复杂的现实世界任务的普遍性。 Wildlong从真实的用户查询，通过基于图的方法共同存在的关系中提取元信息，并采用自适应生成来产生可扩展的数据。它扩展了超越单文件的任务，以支持多文件推理，例如跨文档比较和聚合。我们的模型在使用Wildlong合成的150K指令 - 响应对中进行了填充，超过了跨基准的现有开源长篇小说优化的模型，同时在不包含补充的短篇小说数据的情况下保持在短篇小说任务上的强大性能。通过生成更多样化和现实的长篇文章指令数据集，Wildlong增强了LLMS在长篇小说中推广到复杂的现实世界推理的能力，从而为长篇小说数据综合建立了新的范式。

Title: Toward Responsible Federated Large Language Models: Leveraging a Safety Filter and Constitutional AI

Authors: Eunchung Noh, Jeonghun Baek
Subjects: cs.CL, cs.DC, cs.MA
Abstract URL: https://arxiv.org/abs/2502.16691
Pdf URL: https://arxiv.org/pdf/2502.16691
Copy Paste: [[2502.16691]] Toward Responsible Federated Large Language Models: Leveraging a Safety Filter and Constitutional AI(https://arxiv.org/abs/2502.16691)
Keywords: language model, llm
Abstract: Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe responses, remains underexplored in the context of FedLLM. In FedLLM, client data used for training may contain harmful content, leading to unsafe LLMs that generate harmful responses. Aggregating such unsafe LLMs into the global model and distributing them to clients may result in the widespread deployment of unsafe LLMs. To address this issue, we incorporate two well-known RAI methods into FedLLM: the safety filter and constitutional AI. Our experiments demonstrate that these methods significantly enhance the safety of the LLM, achieving over a 20% improvement on AdvBench, a benchmark for evaluating safety performance.
摘要：最近的研究越来越专注于使用联邦学习（称为FEDLLM）培训大语言模型（LLM）。但是，旨在确保安全响应的负责人AI（RAI）在FedLlm的背景下仍未得到充实。在FEDLLM中，用于培训的客户数据可能包含有害内容，从而导致不安全的LLM产生有害响应。将这种不安全的LLM汇总到全球模型中并将其分发给客户可能会导致不安全LLM的广泛部署。为了解决这个问题，我们将两种著名的RAI方法纳入FedLlm：安全过滤器和宪法AI。我们的实验表明，这些方法显着提高了LLM的安全性，在Advbench上取得了20％的提高，这是评估安全性能的基准。

Title: Code Summarization Beyond Function Level

Authors: Vladimir Makharev, Vladimir Ivanov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16704
Pdf URL: https://arxiv.org/pdf/2502.16704
Copy Paste: [[2502.16704]] Code Summarization Beyond Function Level(https://arxiv.org/abs/2502.16704)
Keywords: llm, prompt
Abstract: Code summarization is a critical task in natural language processing and software engineering, which aims to generate concise descriptions of source code. Recent advancements have improved the quality of these summaries, enhancing code readability and maintainability. However, the content of a repository or a class has not been considered in function code summarization. This study investigated the effectiveness of code summarization models beyond the function level, exploring the impact of class and repository contexts on the summary quality. The study involved revising benchmarks for evaluating models at class and repository levels, assessing baseline models, and evaluating LLMs with in-context learning to determine the enhancement of summary quality with additional context. The findings revealed that the fine-tuned state-of-the-art CodeT5+ base model excelled in code summarization, while incorporating few-shot learning and retrieved code chunks from RAG significantly enhanced the performance of LLMs in this task. Notably, the Deepseek Coder 1.3B and Starcoder2 15B models demonstrated substantial improvements in metrics such as BLEURT, METEOR, and BLEU-4 at both class and repository levels. Repository-level summarization exhibited promising potential but necessitates significant computational resources and gains from the inclusion of structured context. Lastly, we employed the recent SIDE code summarization metric in our evaluation. This study contributes to refining strategies for prompt engineering, few-shot learning, and RAG, addressing gaps in benchmarks for code summarization at various levels. Finally, we publish all study details, code, datasets, and results of evaluation in the GitHub repository available at this https URL.
摘要：代码摘要是自然语言处理和软件工程中的关键任务，该任务旨在生成源代码的简洁描述。最近的进步改善了这些摘要的质量，增强了代码的可读性和可维护性。但是，在功能代码摘要中尚未考虑存储库或类的内容。这项研究研究了代码摘要模型超出功能级别的有效性，探讨了类和存储库环境对摘要质量的影响。该研究涉及修改基准测试，以评估类和存储库级别的模型，评估基线模型，并通过文本学习评估LLM，以确定摘要质量的增强，并使用其他上下文。研究结果表明，经过微调的最先进的Codet5+基本模型在代码摘要方面表现出色，同时结合了很少的学习和从抹布中检索代码块，从而显着提高了LLMS在此任务中的性能。值得注意的是，DeepSeek编码器1.3B和StarCoder2 15B模型在类和存储库水平上表现出了诸如Bleurt，Meteor和Bleu-4之类的指标的大幅改进。存储库级别的汇总表现出有希望的潜力，但需要从结构化环境的包含中获得大量的计算资源和收益。最后，我们在评估中采用了最新的副代码摘要指标。这项研究有助于提出迅速工程，很少学习和抹布的策略，以解决基准中的差距以在各个层面上进行代码摘要。最后，我们在此HTTPS URL上可用的GitHub存储库中发布了所有研究详细信息，代码，数据集和评估结果。

Title: Can ChatGPT Learn to Count Letters?

Authors: Javier Conde, Gonzalo Martínez, Pedro Reviriego, Zhen Gao, Shanshan Liu, Fabrizio Lombardi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16705
Pdf URL: https://arxiv.org/pdf/2502.16705
Copy Paste: [[2502.16705]] Can ChatGPT Learn to Count Letters?(https://arxiv.org/abs/2502.16705)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) struggle on simple tasks such as counting the number of occurrences of a letter in a word. In this paper, we investigate if ChatGPT can learn to count letters and propose an efficient solution.
摘要：大型语言模型（LLM）在简单任务上挣扎，例如计算单词中字母的发生数量。在本文中，我们研究Chatgpt是否可以学会计算字母并提出有效的解决方案。

Title: Beyond Pattern Recognition: Probing Mental Representations of LMs

Authors: Moritz Miller, Kumar Shridhar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16717
Pdf URL: https://arxiv.org/pdf/2502.16717
Copy Paste: [[2502.16717]] Beyond Pattern Recognition: Probing Mental Representations of LMs(https://arxiv.org/abs/2502.16717)
Keywords: language model, llm, prompt
Abstract: Language Models (LMs) have demonstrated impressive capabilities in solving complex reasoning tasks, particularly when prompted to generate intermediate explanations. However, it remains an open question whether these intermediate reasoning traces represent a dynamic, evolving thought process or merely reflect sophisticated pattern recognition acquired during large scale pre training. Drawing inspiration from human cognition, where reasoning unfolds incrementally as new information is assimilated and internal models are continuously updated, we propose to delve deeper into the mental model of various LMs. We propose a new way to assess the mental modeling of LMs, where they are provided with problem details gradually, allowing each new piece of data to build upon and refine the model's internal representation of the task. We systematically compare this step by step mental modeling strategy with traditional full prompt methods across both text only and vision and text modalities. Experiments on the MathWorld dataset across different model sizes and problem complexities confirm that both text-based LLMs and multimodal LMs struggle to create mental representations, questioning how their internal cognitive processes work.
摘要：语言模型（LMS）在解决复杂的推理任务方面表现出了令人印象深刻的能力，尤其是在提示产生中间解释时。但是，这些中间推理痕迹是一个动态，不断发展的思维过程还是仅反映在大规模预训练中获得的复杂模式识别，仍然是一个空的问题。从人类认知中汲取灵感，因为新信息被吸收并不断更新，因此推理会逐步发展，我们建议深入研究各种LMS的心理模型。我们提出了一种评估LMS心理建模的新方法，在该方法中逐渐为它们提供了问题细节，从而允许每个新数据构建并完善模型对任务的内部表示。我们从系统地将此逐步的心理建模策略与仅文本以及视觉和文本方式的传统完整及时方法进行比较。在不同模型大小和问题复杂性的数学数据集上进行的实验证实，基于文本的LLM和多模式LMS都在努力创建心理表示，质疑其内部认知过程如何工作。

Title: Speed and Conversational Large Language Models: Not All Is About Tokens per Second

Authors: Javier Conde, Miguel González, Pedro Reviriego, Zhen Gao, Shanshan Liu, Fabrizio Lombardi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16721
Pdf URL: https://arxiv.org/pdf/2502.16721
Copy Paste: [[2502.16721]] Speed and Conversational Large Language Models: Not All Is About Tokens per Second(https://arxiv.org/abs/2502.16721)
Keywords: language model, llm
Abstract: The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.
摘要：研究开放式大语模型（LLM）的速度及其对手头任务的依赖性在GPU上运行时，以对最受欢迎的开放式LLM的速度进行比较分析。

Title: Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders

Authors: Suneel Nadipalli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16722
Pdf URL: https://arxiv.org/pdf/2502.16722
Copy Paste: [[2502.16722]] Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders(https://arxiv.org/abs/2502.16722)
Keywords: language model, llm
Abstract: Fine-tuning pre-trained transformers is a powerful technique for enhancing the performance of base models on specific tasks. From early applications in models like BERT to fine-tuning Large Language Models (LLMs), this approach has been instrumental in adapting general-purpose architectures for specialized downstream tasks. Understanding the fine-tuning process is crucial for uncovering how transformers adapt to specific objectives, retain general representations, and acquire task-specific features. This paper explores the underlying mechanisms of fine-tuning, specifically in the BERT transformer, by analyzing activation similarity, training Sparse AutoEncoders (SAEs), and visualizing token-level activations across different layers. Based on experiments conducted across multiple datasets and BERT layers, we observe a steady progression in how features adapt to the task at hand: early layers primarily retain general representations, middle layers act as a transition between general and task-specific features, and later layers fully specialize in task adaptation. These findings provide key insights into the inner workings of fine-tuning and its impact on representation learning within transformer architectures.
摘要：微调预训练的变压器是一种强大的技术，可在特定任务上增强基本模型的性能。从伯特（Bert）等模型中的早期应用到微调大语模型（LLM），这种方法在适应通用架构方面起到了为专门的下游任务而调整通用架构。了解微调过程对于发现变压器如何适应特定目标，保留一般表示并获得特定于任务的特征至关重要。本文通过分析激活相似性，训练稀疏的自动编码器（SAE）以及可视化不同层的令牌级激活，探讨了微调的基本机制，特别是在BERT变压器中的基本机制。基于在多个数据集和BERT层进行的实验，我们观察到特征如何适应手头的任务的稳定进展：早期层主要保留一般表示，中间层是一般特征和特定于任务的层之间的过渡，以及后来的层之间的过渡完全专注于任务适应。这些发现为微调的内部运作及其对变压器体系结构内的表示学习的影响提供了关键的见解。

Title: SQLong: Enhanced NL2SQL for Longer Contexts with LLMs

Authors: Dai Quoc Nguyen, Cong Duy Vu Hoang, Duy Vu, Gioacchino Tangari, Thanh Tien Vu, Don Dharmasiri, Yuan-Fang Li, Long Duong
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2502.16747
Pdf URL: https://arxiv.org/pdf/2502.16747
Copy Paste: [[2502.16747]] SQLong: Enhanced NL2SQL for Longer Contexts with LLMs(https://arxiv.org/abs/2502.16747)
Keywords: language model, llm
Abstract: Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas.
摘要：开放重量大型语言模型（LLMS）在自然语言中具有明显的高级性能到SQL（NL2SQL）任务。但是，随着上下文长度的增加，它们的有效性在处理大型数据库模式时会降低。为了解决这一限制，我们提出了Sqlong，这是一个新颖有效的数据增强框架，旨在在长篇下说方案中针对NL2SQL任务增强LLM性能。 Sqlong通过使用其他合成create表命令和相应的数据行扩展现有数据库模式来生成增强数据集，并从培训数据中的各种模式中采样。这种方法有效地模拟了填充和评估期间的长篇文章方案。通过蜘蛛和鸟类数据集的实验，我们证明了使用Sqlong-augment数据进行的LLMS固定明显优于在标准数据集中训练的LLM。这些意味着使用复杂的数据库模式，Sqlong的实际实现及其对改善现实世界中NL2SQL功能的影响。

Title: Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions

Authors: Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, Serina Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16761
Pdf URL: https://arxiv.org/pdf/2502.16761
Copy Paste: [[2502.16761]] Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions(https://arxiv.org/abs/2502.16761)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) present novel opportunities in public opinion research by predicting survey responses in advance during the early stages of survey design. Prior methods steer LLMs via descriptions of subpopulations as LLMs' input prompt, yet such prompt engineering approaches have struggled to faithfully predict the distribution of survey responses from human subjects. In this work, we propose directly fine-tuning LLMs to predict response distributions by leveraging unique structural characteristics of survey data. To enable fine-tuning, we curate SubPOP, a significantly scaled dataset of 3,362 questions and 70K subpopulation-response pairs from well-established public opinion surveys. We show that fine-tuning on SubPOP greatly improves the match between LLM predictions and human responses across various subpopulations, reducing the LLM-human gap by up to 46% compared to baselines, and achieves strong generalization to unseen surveys and subpopulations. Our findings highlight the potential of survey-based fine-tuning to improve opinion prediction for diverse, real-world subpopulations and therefore enable more efficient survey designs. Our code is available at this https URL.
摘要：大型语言模型（LLMS）通过在调查设计的早期阶段预先预测调查响应来展示舆论研究中的新机会。先前的方法通过将亚群描述为LLMS的输入提示来指导LLM，但是这种及时的工程方法一直在努力忠实地预测人类受试者的调查反应的分布。在这项工作中，我们直接提出了通过利用调查数据的独特结构特征来预测响应分布的响应分布。为了启用微调，我们策划了一个大小，这是一个大量扩展的数据集，该数据集由3,362个问题和70k亚群响应对，来自公认的公众舆论调查。我们表明，在子流程上进行的微调大大改善了LLM预测与各种子群中的人类反应之间的匹配，与基本线相比，LLM-Human Gap最多将LLM-Human Gap降低了46％，并实现了对看不见的Surveys和subpoperations的强烈概括。我们的发现凸显了基于调查的微调来改善各种现实世界亚种群的意见预测的潜力，因此可以实现更有效的调查设计。我们的代码可在此HTTPS URL上找到。

Title: A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts

Authors: Jhon Rayo, Raul de la Rosa, Mario Garrido
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16767
Pdf URL: https://arxiv.org/pdf/2502.16767
Copy Paste: [[2502.16767]] A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts(https://arxiv.org/abs/2502.16767)
Keywords: language model, llm, retrieval augmented generation
Abstract: Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.
摘要：监管文本本质上是漫长而复杂的，对信息检索系统提出了重大挑战，以支持监管官员进行合规任务。本文介绍了一个混合信息检索系统，该系统结合了词汇和语义搜索技术，以从大型监管语料库中提取相关信息。该系统将微调的句子变压器模型与传统的BM25算法集成在一起，以实现语义精度和词汇覆盖范围。为了产生准确而全面的响应，在检索增强发电（RAG）框架中使用大语言模型（LLM）合成所检索的段落。实验结果表明，混合系统的表现明显胜过独立的词汇和语义方法，在召回@10和map@10中有了显着改善。通过公开分享我们的微调模型和方法，我们旨在推动为监管域中以合规性驱动的应用程序开发强大的自然语言处理工具。

Title: LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint

Authors: Qianli Ma, Dongrui Liu, Qian Chen, Linfeng Zhang, Jing Shao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16770
Pdf URL: https://arxiv.org/pdf/2502.16770
Copy Paste: [[2502.16770]] LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint(https://arxiv.org/abs/2502.16770)
Keywords: language model, llm
Abstract: Fine-tuning pre-trained Large Language Models (LLMs) for specialized tasks incurs substantial computational and data costs. While model merging offers a training-free solution to integrate multiple task-specific models, existing methods suffer from safety-utility conflicts where enhanced general capabilities degrade safety safeguards. We identify two root causes: \textbf{neuron misidentification} due to simplistic parameter magnitude-based selection, and \textbf{cross-task neuron interference} during merging. To address these challenges, we propose \textbf{LED-Merging}, a three-stage framework that \textbf{L}ocates task-specific neurons via gradient-based attribution, dynamically \textbf{E}lects critical neurons through multi-model importance fusion, and \textbf{D}isjoints conflicting updates through parameter isolation. Extensive experiments on Llama-3-8B, Mistral-7B, and Llama2-13B demonstrate that LED-Merging reduces harmful response rates(\emph{e.g.}, a 31.4\% decrease on Llama-3-8B-Instruct on HarmBench) while preserving 95\% of utility performance(\emph{e.g.}, 52.39\% accuracy on GSM8K). LED-Merging resolves safety-utility conflicts and provides a lightweight, training-free paradigm for constructing reliable multi-task LLMs.
摘要：针对专业任务的微调预训练的大语言模型（LLM）会造成大量的计算和数据成本。虽然模型合并提供了无培训的解决方案来整合多个特定任务的模型，但现有方法遭受了安全 - 私人性冲突的损失，而增强的一般功能会降低安全保障。我们确定了两个根本原因：\ textbf {神经元错误识别}由于基于简单的参数幅度选择，并且在合并过程中\ textbf {cross-任务神经元干扰}。为了应对这些挑战，我们提出\ textbf {led-mering}，这是一个三阶段的框架，该框架是\ textbf {l}通过基于梯度的属性进行特定于任务特定的神经元，动态\ textbf {e}通过多模型的关键神经元与多模型的关键神经元相lect重要性融合，\ textbf {d}通过参数隔离来冲突更新。关于Llama-3-8B，Mistral-7b和Llama2-13b的广泛实验表明，LED合并降低了有害响应率（\ Emph {e.g。}，llama-3-8b in Inflinct a Harmench上的Llama-3-8b构造下降31.4 \％）保留95 \％的公用事业性能（\ emph {e.g。}，52.39 \％准确性在GSM8K上）。 LED合并解决了安全性冲突，并为构建可靠的多任务LLM提供了轻巧，无训练的范例。

Title: MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

Authors: Bhawna Piryani, Jamshid Mozafari, Abdelrahman Abdallah, Antoine Doucet, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16781
Pdf URL: https://arxiv.org/pdf/2502.16781
Copy Paste: [[2502.16781]] MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts(https://arxiv.org/abs/2502.16781)
Keywords: llm
Abstract: Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.
摘要：光学特征识别（OCR）在数字化历史和多语言文档中起着至关重要的作用，但是OCR错误（不完美的文本提取，包括字符插入，删除和置换）可以显着影响下游任务等下游任务，例如提问（QA）。在这项工作中，我们介绍了多种语言QA数据集MultioCR-QA，旨在分析OCR噪声对QA系统性能的影响。 Multiocr-QA数据集由60k问答对组成，涵盖了三种语言，即英语，法语和德语。该数据集是根据OCR-ED旧文档策划的，可以评估OCR引起的问题回答的挑战。我们在各种级别和类型的OCR错误上评估了多克QA，以访问LLM在处理现实世界数字化错误中的鲁棒性。我们的发现表明，质量检查系统很容易引起OCR引起的错误，并且在嘈杂的OCR文本上表现出性能降解。

Title: Are Large Language Models Good Data Preprocessors?

Authors: Elyas Meguellati, Nardiena Pratama, Shazia Sadiq, Gianluca Demartini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16790
Pdf URL: https://arxiv.org/pdf/2502.16790
Copy Paste: [[2502.16790]] Are Large Language Models Good Data Preprocessors?(https://arxiv.org/abs/2502.16790)
Keywords: language model, gpt, llm
Abstract: High-quality textual training data is essential for the success of multimodal data processing tasks, yet outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods. While recent work addressing this issue has predominantly focused on using GPT models for data preprocessing on relatively simple public datasets, there is a need to explore a broader range of Large Language Models (LLMs) and tackle more challenging and diverse datasets. In this study, we investigate the use of multiple LLMs, including LLaMA 3.1 70B, GPT-4 Turbo, and Sonnet 3.5 v2, to refine and clean the textual outputs of BLIP and GIT. We assess the impact of LLM-assisted data cleaning by comparing downstream-task (SemEval 2024 Subtask "Multilabel Persuasion Detection in Memes") models trained on cleaned versus non-cleaned data. While our experimental results show improvements when using LLM-cleaned captions, statistical tests reveal that most of these improvements are not significant. This suggests that while LLMs have the potential to enhance data cleaning and repairing, their effectiveness may be limited depending on the context they are applied to, the complexity of the task, and the level of noise in the text. Our findings highlight the need for further research into the capabilities and limitations of LLMs in data preprocessing pipelines, especially when dealing with challenging datasets, contributing empirical evidence to the ongoing discussion about integrating LLMs into data preprocessing pipelines.
摘要：高质量的文本培训数据对于多模式数据处理任务的成功至关重要，但是来自BLIP和GIT（例如GIT）的图像字幕模型的输出通常包含错误和异常，这些错误和异常很难使用基于规则的方法进行纠正。尽管最近解决此问题的工作主要集中在使用GPT模型对相对简单的公共数据集上进行数据预处理，但有必要探索更广泛的大型语言模型（LLMS），并解决更具挑战性和更具挑战性的数据集。在这项研究中，我们研究了包括Llama 3.1 70B，GPT-4 Turbo和Sonnet 3.5 V2在内的多个LLM的使用，以完善和清洁Blip和Git的文本输出。我们通过比较下游任务（Semeval 2024子任务“模因中的多标记说服检测”）模型来评估LLM辅助数据清洁的影响。虽然我们的实验结果显示使用LLM清洗标题时的改进，但统计测试表明，这些改进大多数并不显着。这表明，尽管LLM有可能增强数据清洁和维修，但它们的有效性可能受到限制，具体取决于其所应用的上下文，任务的复杂性以及文本中的噪声水平。我们的发现强调了需要进一步研究LLM在数据预处理管道中的功能和局限性的需求，尤其是在处理挑战性数据集时，为将LLMS整合到数据预处理管道中的持续讨论中提供了经验证据。

Title: Unsupervised Topic Models are Data Mixers for Pre-training Language Models

Authors: Jiahui Peng, Xinlin Zhuang, Qiu Jiantao, Ren Ma, Jing Yu, Tianyi Bai, Conghui He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16802
Pdf URL: https://arxiv.org/pdf/2502.16802
Copy Paste: [[2502.16802]] Unsupervised Topic Models are Data Mixers for Pre-training Language Models(https://arxiv.org/abs/2502.16802)
Keywords: language model, llm
Abstract: The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various domains, sources, and topics. Effectively integrating these heterogeneous data sources is crucial for optimizing LLM performance. Previous research has predominantly concentrated on domain-based data mixing, often neglecting the nuanced topic-level characteristics of the data. To address this gap, we propose a simple yet effective topic-based data mixing strategy that utilizes fine-grained topics generated through our topic modeling method, DataWeave. DataWeave employs a multi-stage clustering process to group semantically similar documents and utilizes LLMs to generate detailed topics, thereby facilitating a more nuanced understanding of dataset composition. Our strategy employs heuristic methods to upsample or downsample specific topics, which significantly enhances LLM performance on downstream tasks, achieving superior results compared to previous, more complex data mixing approaches. Furthermore, we confirm that the topics Science and Relationships are particularly effective, yielding the most substantial performance improvements. We will make our code and datasets publicly available.
摘要：大语言模型（LLM）的性能受到其预训练数据的质量和组成的显着影响，这些数据本质上是多种多样的，涵盖了各种领域，来源和主题。有效地整合这些异质数据源对于优化LLM性能至关重要。先前的研究主要集中在基于域的数据混合上，通常忽略了数据的细微差别特征。为了解决这一差距，我们提出了一种简单但有效的基于主题的数据混合策略，该策略利用了通过我们的主题建模方法DataWeave生成的细粒度主题。 DataWeave采用多阶段聚类过程来分组语义相似的文档，并利用LLMS生成详细的主题，从而促进对数据集组成的更加细微的理解。我们的策略采用启发式方法来进行样本或下样本的特定主题，从而显着提高了下游任务的LLM性能，与以前的更复杂的数据混合方法相比，实现了卓越的结果。此外，我们确认科学和关系的主题特别有效，从而带来了最大的绩效改进。我们将公开提供代码和数据集。

Title: CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers

Authors: Anh Duc Le, Tu Vu, Nam Le Hai, Nguyen Thi Ngoc Diep, Linh Ngo Van, Trung Le, Thien Huu Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16806
Pdf URL: https://arxiv.org/pdf/2502.16806
Copy Paste: [[2502.16806]] CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers(https://arxiv.org/abs/2502.16806)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints. Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models. However, existing KD methods often assume shared vocabularies and tokenizers, limiting their flexibility. While approaches like Universal Logit Distillation (ULD) and Dual-Space Knowledge Distillation (DSKD) address vocabulary mismatches, they overlook the critical \textbf{reasoning-aware distillation} aspect. To bridge this gap, we propose CoT2Align a universal KD framework that integrates Chain-of-Thought (CoT) augmentation and introduces Cross-CoT Alignment to enhance reasoning transfer. Additionally, we extend Optimal Transport beyond token-wise alignment to a sequence-level and layer-wise alignment approach that adapts to varying sequence lengths while preserving contextual integrity. Comprehensive experiments demonstrate that CoT2Align outperforms existing KD methods across different vocabulary settings, improving reasoning capabilities and robustness in domain-specific tasks.
摘要：大型语言模型（LLMS）在各种NLP任务中实现最先进的性能，但由于高计算成本和内存限制，面临部署挑战。知识蒸馏（KD）是一个有前途的解决方案，将知识从大型教师模型转移到较小的学生模型。但是，现有的KD方法通常假设共享的词汇和象征器，从而限制了它们的灵活性。虽然诸如通用logit蒸馏（ULD）和双空间知识蒸馏（DSKD）之类的方法地址词汇不匹配，但它们忽略了关键的\ textbf {clifting-practioning-aware atake aware asware蒸馏}方面。为了弥合这一差距，我们提出了COT2Align一个通用的KD框架，该框架集成了经过三通链（COT）的增强，并引入了跨核对准路线以增强推理转移。此外，我们将最佳传输范围扩展到依赖令牌比对以外的序列级别和层面对齐方式，该方法适应了不同的序列长度，同时保留了上下文完整性。全面的实验表明，COT2Align优于不同词汇环境的现有KD方法，从而提高了特定领域特定任务的推理能力和鲁棒性。

Title: Uncertainty Quantification of Large Language Models through Multi-Dimensional Responses

Authors: Tiejin Chen, Xiaoou Liu, Longchao Da, Xiaoou Liu, Vagelis Papalexakis, Hua Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16820
Pdf URL: https://arxiv.org/pdf/2502.16820
Copy Paste: [[2502.16820]] Uncertainty Quantification of Large Language Models through Multi-Dimensional Responses(https://arxiv.org/abs/2502.16820)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks due to large training datasets and powerful transformer architecture. However, the reliability of responses from LLMs remains a question. Uncertainty quantification (UQ) of LLMs is crucial for ensuring their reliability, especially in areas such as healthcare, finance, and decision-making. Existing UQ methods primarily focus on semantic similarity, overlooking the deeper knowledge dimensions embedded in responses. We introduce a multi-dimensional UQ framework that integrates semantic and knowledge-aware similarity analysis. By generating multiple responses and leveraging auxiliary LLMs to extract implicit knowledge, we construct separate similarity matrices and apply tensor decomposition to derive a comprehensive uncertainty representation. This approach disentangles overlapping information from both semantic and knowledge dimensions, capturing both semantic variations and factual consistency, leading to more accurate UQ. Our empirical evaluations demonstrate that our method outperforms existing techniques in identifying uncertain responses, offering a more robust framework for enhancing LLM reliability in high-stakes applications.
摘要：大型语言模型（LLMS）由于较大的培训数据集和强大的变压器体系结构，在各种任务中都表现出了出色的功能。但是，LLMS回答的可靠性仍然是一个问题。 LLMS的不确定性量化（UQ）对于确保其可靠性至关重要，尤其是在医疗保健，金融和决策等领域。现有的UQ方法主要集中于语义相似性，忽略了嵌入在响应中的更深层次的知识维度。我们介绍了一个多维的UQ框架，该框架集成了语义和知识感知的相似性分析。通过产生多个响应并利用辅助LLM提取隐式知识，我们构建了单独的相似性矩阵并应用张量分解以得出全面的不确定性表示。这种方法将与语义和知识维度的重叠信息相关，从而捕获语义变化和事实一致性，从而导致更准确的UQ。我们的经验评估表明，我们的方法在识别不确定的响应方面优于现有技术，为增强高风险应用程序中LLM可靠性提供了更强大的框架。

Title: Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization

Authors: Yao Xiao, Hai Ye, Linyao Chen, Hwee Tou Ng, Lidong Bing, Xiaoli Li, Roy Ka-wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16825
Pdf URL: https://arxiv.org/pdf/2502.16825
Copy Paste: [[2502.16825]] Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization(https://arxiv.org/abs/2502.16825)
Keywords: language model, llm
Abstract: Iterative data generation and model retraining are widely used to align large language models (LLMs). It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to \emph{scale up} the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a \emph{decline} in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 ($C_7^2$) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position $\mu - 2\sigma$ rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
摘要：迭代数据生成和模型再培训被广泛用于对齐大语言模型（LLMS）。通常，它涉及一个政策模型，以生成政策响应和指导培训数据选择的奖励模型。直接偏好优化（DPO）通过构建选定和拒绝响应的偏好对进一步增强了这一过程。在这项工作中，我们的目标是\ emph {扩展}通过重复随机抽样的式样品数量，以提高对齐性能。常规实践选择以最高奖励和最低拒绝的样本选择DPO。但是，我们的实验表明，随着样本量的增加，这种策略会导致性能下降。为了解决这个问题，我们通过样本奖励的基本正态分布的镜头研究偏好数据构建。我们将奖励空间分为七个代表点，并系统地探索所有21个（$ C_7^2 $）的成对组合。通过使用Alpacaeval 2对四个模型进行评估，我们发现在奖励位置$ \ MU -2 \ sigma $而不是最低奖励，选择拒绝的响应，对于最佳性能至关重要。我们最终引入了可扩展的偏好数据构建策略，该策略会随着样本量表的增加而始终如一地增强模型性能。

Title: REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction

Authors: Omar Sharif, Joseph Gatto, Madhusudan Basak, Sarah M. Preum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16838
Pdf URL: https://arxiv.org/pdf/2502.16838
Copy Paste: [[2502.16838]] REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction(https://arxiv.org/abs/2502.16838)
Keywords: language model, llm
Abstract: Event argument extraction identifies arguments for predefined event roles in text. Traditional evaluations rely on exact match (EM), requiring predicted arguments to match annotated spans exactly. However, this approach fails for generative models like large language models (LLMs), which produce diverse yet semantically accurate responses. EM underestimates performance by disregarding valid variations, implicit arguments (unstated but inferable), and scattered arguments (distributed across a document). To bridge this gap, we introduce Reliable Evaluation framework for Generative event argument extraction (REGen), a framework that better aligns with human judgment. Across six datasets, REGen improves performance by an average of 23.93 F1 points over EM. Human validation further confirms REGen's effectiveness, achieving 87.67% alignment with human assessments of argument correctness.
摘要：事件参数提取标识文本中预定义事件角色的参数。传统评估依赖于确切的匹配（EM），需要预测的参数才能完全匹配带注释的跨度。但是，对于大型语言模型（LLM）等生成模型，这种方法失败了，这些模型产生了多种而精确的响应。通过无视有效的变化，隐式论证（未陈述但可推断）和分散的参数（分布在文档中）来低估性能。为了弥合这一差距，我们为生成事件参数提取（regen）引入了可靠的评估框架，该框架可以更好地与人类的判断力保持一致。在六个数据集中，Remen在EM上平均提高了23.93 F1点。人类验证进一步证实了Regen的有效性，与人类对论点正确性的评估达到了87.67％的一致性。

Title: "Actionable Help" in Crises: A Novel Dataset and Resource-Efficient Models for Identifying Request and Offer Social Media Posts

Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera, Muhammad Imran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16839
Pdf URL: https://arxiv.org/pdf/2502.16839
Copy Paste: [[2502.16839]] "Actionable Help" in Crises: A Novel Dataset and Resource-Efficient Models for Identifying Request and Offer Social Media Posts(https://arxiv.org/abs/2502.16839)
Keywords: language model, llm
Abstract: During crises, social media serves as a crucial coordination tool, but the vast influx of posts--from "actionable" requests and offers to generic content like emotional support, behavioural guidance, or outdated information--complicates effective classification. Although generative LLMs (Large Language Models) can address this issue with few-shot classification, their high computational demands limit real-time crisis response. While fine-tuning encoder-only models (e.g., BERT) is a popular choice, these models still exhibit higher inference times in resource-constrained environments. Moreover, although distilled variants (e.g., DistilBERT) exist, they are not tailored for the crisis domain. To address these challenges, we make two key contributions. First, we present CrisisHelpOffer, a novel dataset of 101k tweets collaboratively labelled by generative LLMs and validated by humans, specifically designed to distinguish actionable content from noise. Second, we introduce the first crisis-specific mini models optimized for deployment in resource-constrained settings. Across 13 crisis classification tasks, our mini models surpass BERT (also outperform or match the performance of RoBERTa, MPNet, and BERTweet), offering higher accuracy with significantly smaller sizes and faster speeds. The Medium model is 47% smaller with 3.8% higher accuracy at 3.5x speed, the Small model is 68% smaller with a 1.8% accuracy gain at 7.7x speed, and the Tiny model, 83% smaller, matches BERT's accuracy at 18.6x speed. All models outperform existing distilled variants, setting new benchmarks. Finally, as a case study, we analyze social media posts from a global crisis to explore help-seeking and assistance-offering behaviours in selected developing and developed countries.
摘要：在危机期间，社交媒体是一种至关重要的协调工具，但是帖子的大量涌入 - 从“可行”的请求中提供，并提供对情感支持，行为指导或过时的信息等通用内容的提议，并使有效的分类完美。尽管生成的LLM（大型语言模型）可以通过很少的分类来解决此问题，但它们的高计算需求限制了实时危机响应。虽然仅微调编码模型（例如BERT）是一个流行的选择，但这些模型仍显示出资源约束环境中的推理时间更高。此外，尽管存在蒸馏变种（例如，蒸馏厂），但它们并不是为危机领域量身定制的。为了应对这些挑战，我们做出了两个关键的贡献。首先，我们提出了CrishelPoffer，这是一个由生成LLMS合作的101K推文的新型数据集，并由人类验证，专门设计用于将可行的内容与噪声区分开来。其次，我们介绍了针对资源约束设置中部署的第一个特定于危机的迷你模型。在13项危机分类任务中，我们的迷你车型超过了BERT（也表现优于Roberta，Mpnet和Bertweet的性能），提供了更高的精度，其尺寸明显较小，速度更快。中型型号较小47％，在3.5倍速度下的精度提高3.8％，小型型号较小68％，精度为7.7倍，而小型速度为1.8％，而小型模型（83％）较小，将BERT的精度与18.6倍相匹配。速度。所有型号的表现都优于现有的蒸馏变体，设定了新的基准测试。最后，作为一个案例研究，我们分析了来自全球危机的社交媒体帖子，以探讨选定发展中和发达国家的寻求帮助和援助行为。

Title: Sarang at DEFACTIFY 4.0: Detecting AI-Generated Text Using Noised Data and an Ensemble of DeBERTa Models

Authors: Avinash Trivedi, Sangeetha Sivanesan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16857
Pdf URL: https://arxiv.org/pdf/2502.16857
Copy Paste: [[2502.16857]] Sarang at DEFACTIFY 4.0: Detecting AI-Generated Text Using Noised Data and an Ensemble of DeBERTa Models(https://arxiv.org/abs/2502.16857)
Keywords: language model
Abstract: This paper presents an effective approach to detect AI-generated text, developed for the Defactify 4.0 shared task at the fourth workshop on multimodal fact checking and hate speech detection. The task consists of two subtasks: Task-A, classifying whether a text is AI generated or human written, and Task-B, classifying the specific large language model that generated the text. Our team (Sarang) achieved the 1st place in both tasks with F1 scores of 1.0 and 0.9531, respectively. The methodology involves adding noise to the dataset to improve model robustness and generalization. We used an ensemble of DeBERTa models to effectively capture complex patterns in the text. The result indicates the effectiveness of our noise-driven and ensemble-based approach, setting a new standard in AI-generated text detection and providing guidance for future developments.
摘要：本文提出了一种检测AI生成的文本的有效方法，该方法是为Defactify 4.0共享任务开发的，该任务是在第四个关于多模式事实检查和仇恨言论检测的研讨会上开发的。该任务由两个子任务组成：任务-A，对文本是AI生成还是人为书面的分类，以及Task-B，对生成文本的特定大型语言模型进行分类。我们的团队（Sarang）在两项任务中均获得了F1分别为1.0和0.9531的第一名。该方法涉及在数据集中添加噪声，以改善模型的鲁棒性和概括。我们使用Deberta模型的合奏来有效地捕获文本中的复杂模式。结果表明了我们以噪声为基础和合奏的方法的有效性，在AI生成的文本检测中设定了新标准，并为未来的发展提供指导。

Title: LongAttn: Selecting Long-context Training Data via Token-level Attention

Authors: Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, Sujian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16860
Pdf URL: https://arxiv.org/pdf/2502.16860
Copy Paste: [[2502.16860]] LongAttn: Selecting Long-context Training Data via Token-level Attention(https://arxiv.org/abs/2502.16860)
Keywords: language model, llm, long context
Abstract: With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.
摘要：随着大语言模型（LLM）的发展，在处理长篇小说中越来越需要大量进步。为了增强长期文化功能，构建具有长期依赖性的高质量培训数据至关重要。选择长篇小说数据的现有方法通常依赖于句子级分析，这在性能和效率方面都可以非常优化。在本文中，我们提出了一个新颖的令牌级框架Longattn，该框架利用LLMS的自我注意力机制来衡量数据的远程依赖性。通过计算令牌得分的代币依赖性强度和分布均匀性，Longattn有效地量化了长期依赖性，从而可以更准确，有效地选择数据。我们从开源长篇小说数据集（Arxiv，Book和Code）过滤LongABC-32K。通过我们的全面实验，Longattn证明了其出色的有效性，可伸缩性和效率。为了促进长篇文化数据的未来研究，我们发布了代码和高质量的长篇小写培训数据longabc-32k。

Title: CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter

Authors: Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, Zhongchao Shi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.16880
Pdf URL: https://arxiv.org/pdf/2502.16880
Copy Paste: [[2502.16880]] CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter(https://arxiv.org/abs/2502.16880)
Keywords: language model, llm
Abstract: Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training and inference. Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge. To address this, we propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting. CORAL introduces Cross-Step Representation Alignment, a method that enhances consistency across multiple training steps, significantly improving speculative drafting performance. Additionally, we identify the LM head as a major bottleneck in the inference speed of the draft model. We introduce a weight-grouping mechanism that selectively activates a subset of LM head parameters during inference, substantially reducing the latency of the draft model. We evaluate CORAL on three LLM families and three benchmark datasets, achieving speedup ratios of 2.50x-4.07x, outperforming state-of-the-art methods such as EAGLE-2 and HASS. Our results demonstrate that CORAL effectively mitigates training-inference misalignment and delivers significant speedup for modern LLMs with large vocabularies.
摘要：投机解码是一种强大的技术，它通过利用轻量级投机性草案模型来加速大型语言模型（LLM）推断。但是，由于训练和推理之间的不对对准，现有的设计遭受了性能的影响。最近的方法试图通过采用多步培训策略来解决这个问题，但是不同培训步骤的复杂输入使模型草案更难收敛。为了解决这个问题，我们提出了珊瑚，这是一个新颖的框架，可提高投机绘图的准确性和效率。 Coral引入了跨步骤表示对齐方式，该方法可提高多个训练步骤的一致性，从而显着提高投机性起草性能。此外，我们将LM头确定为草稿模型推理速度的主要瓶颈。我们引入了一种重组机制，该机制在推理过程中有选择地激活LM头参数的子集，从而大大降低了草案模型的潜伏期。我们评估了三个LLM家族和三个基准数据集的珊瑚，达到2.50x-4.07倍的加速比，表现优于EAGLE-2和HASS等最先进的方法。我们的结果表明，珊瑚有效地减轻了训练推导的错位，并为具有较大词汇的现代LLM提供了显着的加速。

Title: DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance

Authors: Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16886
Pdf URL: https://arxiv.org/pdf/2502.16886
Copy Paste: [[2502.16886]] DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance(https://arxiv.org/abs/2502.16886)
Keywords: language model, llm
Abstract: To alleviate memory burden during inference of large language models (LLMs), numerous studies have focused on compressing the KV cache by exploring aspects such as attention sparsity. However, these techniques often require a pre-defined cache budget; as the optimal budget varies with different input lengths and task types, it limits their practical deployment accepting open-domain instructions. To address this limitation, we propose a new KV cache compression objective: to always ensure the full-cache performance regardless of specific inputs, while maximizing KV cache pruning as much as possible. To achieve this goal, we introduce a novel KV cache compression method dubbed DBudgetKV, which features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Empirical evaluation spanning diverse context lengths, task types, and model sizes suggests that our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average. Furthermore, our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
摘要：为了减轻大型语言模型（LLMS）推断期间的记忆负担，许多研究重点是通过探索诸如注意力稀疏之类的方面来压缩KV缓存。但是，这些技术通常需要预定义的缓存预算。随着最佳预算随不同的输入长度和任务类型而变化，因此它限制了他们接受开放域指令的实际部署。为了解决此限制，我们提出了一个新的KV缓存压缩目标：无论特定的输入如何，始终确保全缓存性能，同时尽可能最大程度地提高KV高速缓存。为了实现这一目标，我们引入了一种新型的KV缓存压缩方法，该方法称为DBUDGETKV，该方法具有基于注意力的指标，可以在其余的KV高速缓存不太可能匹配全缓存性能时发出信号，然后停止修剪过程。涵盖各种上下文长度，任务类型和模型尺寸的经验评估表明，我们的方法可以有效，稳健地实现无损的KV修剪，平均超过25％的压缩率。此外，我们的方法易于集成在LLM推理中，不仅可以优化内存空间，而且与现有方法相比，推理时间减少了。

Title: Applying LLMs to Active Learning: Towards Cost-Efficient Cross-Task Text Classification without Manually Labeled Data

Authors: Yejian Zhang, Shingo Takada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16892
Pdf URL: https://arxiv.org/pdf/2502.16892
Copy Paste: [[2502.16892]] Applying LLMs to Active Learning: Towards Cost-Efficient Cross-Task Text Classification without Manually Labeled Data(https://arxiv.org/abs/2502.16892)
Keywords: language model, gpt, llm
Abstract: Machine learning-based classifiers have been used for text classification, such as sentiment analysis, news classification, and toxic comment classification. However, supervised machine learning models often require large amounts of labeled data for training, and manual annotation is both labor-intensive and requires domain-specific knowledge, leading to relatively high annotation costs. To address this issue, we propose an approach that integrates large language models (LLMs) into an active learning framework. Our approach combines the Robustly Optimized BERT Pretraining Approach (RoBERTa), Generative Pre-trained Transformer (GPT), and active learning, achieving high cross-task text classification performance without the need for any manually labeled data. Furthermore, compared to directly applying GPT for classification tasks, our approach retains over 93% of its classification performance while requiring only approximately 6% of the computational time and monetary cost, effectively balancing performance and resource efficiency. These findings provide new insights into the efficient utilization of LLMs and active learning algorithms in text classification tasks, paving the way for their broader application.
摘要：基于机器学习的分类器已用于文本分类，例如情感分析，新闻分类和有毒评论分类。但是，监督的机器学习模型通常需要大量的标记数据进行培训，并且手动注释既是劳动力密集型，又需要特定于领域的知识，从而导致注释成本相对较高。为了解决这个问题，我们提出了一种将大型语言模型（LLM）集成到主动学习框架中的方法。我们的方法结合了强大优化的BERT预处理方法（Roberta），生成的预训练的变压器（GPT）和主动学习，实现了高跨任务文本分类性能，而无需任何手动标记的数据。此外，与直接将GPT应用于分类任务相比，我们的方法保留了其分类性能的93％以上，同时仅需要大约6％的计算时间和货币成本，有效地平衡了绩效和资源效率。这些发现为文本分类任务中LLM的有效利用和主动学习算法提供了新的见解，为其更广泛的应用铺平了道路。

Title: Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

Authors: Chenghao Fan, Zhenyi Lu, Sichen Liu, Xiaoye Qu, Wei Wei, Chengfeng Gu, Yu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16894
Pdf URL: https://arxiv.org/pdf/2502.16894
Copy Paste: [[2502.16894]] Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment(https://arxiv.org/abs/2502.16894)
Keywords: language model, llm
Abstract: While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.
摘要：虽然低排名适应（LORA）可以为大型语言模型（LLMS）提供参数有效的微调，但其性能通常不足以完全微调（完整的FT）。当前方法通过用静态奇异值分解（SVD）子集初始化来优化LORA，从而导致预先训练的知识的次优率。改善洛拉的另一个途径是结合了专家（MOE）结构的混合物。但是，体重错位和复杂的梯度动态使得在Lora Moe体系结构之前采用SVD变得具有挑战性。为了减轻这些问题，我们建议\下划线{g} reat l \下划线{o} r \下划线{a}混合 - SVD结构的MOE和（2）通过得出理论上的缩放系数来对齐优化与完整的微调MOE。我们证明了适当的扩展，而不修改体系结构或培训算法，可以提高洛拉·莫伊（Lora Moe）的效率和性能。在25个数据集中进行的实验，包括自然语言理解，常识性推理，图像分类和自然语言产生，展示了山羊的最新性能，以完整的FT缩小了差距。

Title: Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs

Authors: Himanshu Beniwal, Sailesh Panda, Mayank Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16901
Pdf URL: https://arxiv.org/pdf/2502.16901
Copy Paste: [[2502.16901]] Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs(https://arxiv.org/abs/2502.16901)
Keywords: language model, llm
Abstract: We explore Cross-lingual Backdoor ATtacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare tokens serving as specific effective triggers. Our findings expose a critical vulnerability in the fundamental architecture that enables cross-lingual transfer in these models. Our code and data are publicly available at this https URL.
摘要：我们在多语言大语言模型（MLLMS）中探索跨语性的后门攻击（X-BAT），揭示插入一种语言的后门如何通过共享的嵌入空间自动转移到他人。我们将毒性分类作为案例研究，我们证明攻击者可以通过用单一语言中毒数据来损害多语言系统，而罕见的令牌是特定的有效触发器。我们的发现在基本体系结构中暴露了一个关键的脆弱性，在这些模型中可以跨语性转移。我们的代码和数据在此HTTPS URL上公开可用。

Title: GuidedBench: Equipping Jailbreak Evaluation with Guidelines

Authors: Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2502.16903
Pdf URL: https://arxiv.org/pdf/2502.16903
Copy Paste: [[2502.16903]] GuidedBench: Equipping Jailbreak Evaluation with Guidelines(https://arxiv.org/abs/2502.16903)
Keywords: language model, llm
Abstract: Jailbreaking methods for large language models (LLMs) have gained increasing attention for building safe and responsible AI systems. After analyzing 35 jailbreak methods across six categories, we find that existing benchmarks, relying on universal LLM-based or keyword-matching scores, lack case-specific criteria, leading to conflicting results. In this paper, we introduce a more robust evaluation framework for jailbreak methods, with a curated harmful question dataset, detailed case-by-case evaluation guidelines, and a scoring system equipped with these guidelines. Our experiments show that existing jailbreak methods exhibit better discrimination when evaluated using our benchmark. Some jailbreak methods that claim to achieve over 90% attack success rate (ASR) on other benchmarks only reach a maximum of 30.2% on our benchmark, providing a higher ceiling for more advanced jailbreak research; furthermore, using our scoring system reduces the variance of disagreements between different evaluator LLMs by up to 76.33%. This demonstrates its ability to provide more fair and stable evaluation.
摘要：大型语言模型（LLM）的越狱方法已引起人们对建立安全和负责的AI系统的越来越多的关注。在分析了六个类别的35种越狱方法之后，我们发现现有的基准依赖于通用LLM或基于LLM的关键字匹配分数，缺乏特定于病例的标准，从而导致结果相互矛盾。在本文中，我们针对越狱方法介绍了一个更强大的评估框架，其中包含有害的有害问题数据集，详细的逐案案例评估指南以及配备这些准则的评分系统。我们的实验表明，现有的越狱方法在使用我们的基准测试评估时表现出更好的歧视。一些声称在其他基准上获得超过90％攻击成功率（ASR）的越狱方法仅在我们的基准测试中最多达到30.2％，为更高级的越狱研究提供了更高的上限。此外，使用我们的评分系统可将不同评估者LLMS之间的分歧的方差降低多达76.33％。这表明了其提供更公平和稳定的评估的能力。

Title: AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

Authors: Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16906
Pdf URL: https://arxiv.org/pdf/2502.16906
Copy Paste: [[2502.16906]] AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models(https://arxiv.org/abs/2502.16906)
Keywords: language model, llm
Abstract: While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice dataset. Beyond benchmark creation, this synthesis method can generate high-quality training data by incorporating program verifiers into the rejection sampling process, enabling systematic enhancement of LLMs' reasoning capabilities across diverse datasets.
摘要：尽管对大语言模型（LLM）的逻辑推理评估引起了极大的关注，但现有的基准主要依赖于容易随机猜测的多项选择格式，从而导致性能高估和实质性的性能波动。为了获得模型推理能力的更准确的评估，我们提出了一种合成开放式逻辑难题的自动化方法，并使用它来开发双语基准Autologi。我们的方法具有基于程序的验证和可控制的难度级别，从而使更可靠的评估更好地区分了模型的推理能力。对八个现代LLM的广泛评估表明，自动志可以更好地反映真正的模型功能，其性能得分范围从35％到73％，而源多项选择数据集的狭窄范围为21％至37％。除了基准创建之外，这种合成方法还可以通过将程序验证器纳入拒绝抽样过程来生成高质量的培训数据，从而使LLMS在不同数据集中的推理能力的系统增强。

Title: Dependency Parsing with the Structuralized Prompt Template

Authors: Keunha Kim, Youngjoong Ko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16919
Pdf URL: https://arxiv.org/pdf/2502.16919
Copy Paste: [[2502.16919]] Dependency Parsing with the Structuralized Prompt Template(https://arxiv.org/abs/2502.16919)
Keywords: prompt
Abstract: Dependency parsing is a fundamental task in natural language processing (NLP), aiming to identify syntactic dependencies and construct a syntactic tree for a given sentence. Traditional dependency parsing models typically construct embeddings and utilize additional layers for prediction. We propose a novel dependency parsing method that relies solely on an encoder model with a text-to-text training approach. To facilitate this, we introduce a structured prompt template that effectively captures the structural information of dependency trees. Our experimental results demonstrate that the proposed method achieves outstanding performance compared to traditional models, despite relying solely on a pre-trained model. Furthermore, this method is highly adaptable to various pre-trained models across different target languages and training environments, allowing easy integration of task-specific features.
摘要：依赖性解析是自然语言处理（NLP）中的一项基本任务，旨在识别句法依赖性并为给定句子构造句法树。传统的依赖解析模型通常构建嵌入，并利用其他图层进行预测。我们提出了一种新颖的依赖解析方法，该方法仅依赖于文本对文本培训方法的编码模型。为了促进这一点，我们引入了一个结构化的及时模板，该模板有效地捕获了依赖树的结构信息。我们的实验结果表明，尽管仅依靠预先训练的模型，但所提出的方法与传统模型相比取得了出色的性能。此外，此方法高度适应不同目标语言和培训环境的各种预训练的模型，从而可以轻松整合特定于任务的功能。

Title: SS-MPC: A Sequence-Structured Multi-Party Conversation System

Authors: Yoonjin Jang, Keunha Kim, Youngjoong Ko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16920
Pdf URL: https://arxiv.org/pdf/2502.16920
Copy Paste: [[2502.16920]] SS-MPC: A Sequence-Structured Multi-Party Conversation System(https://arxiv.org/abs/2502.16920)
Keywords: language model
Abstract: Recent Multi-Party Conversation (MPC) models typically rely on graph-based approaches to capture dialogue structures. However, these methods have limitations, such as information loss during the projection of utterances into structural embeddings and constraints in leveraging pre-trained language models directly. In this paper, we propose \textbf{SS-MPC}, a response generation model for MPC that eliminates the need for explicit graph structures. Unlike existing models that depend on graphs to analyze conversation structures, SS-MPC internally encodes the dialogue structure as a sequential input, enabling direct utilization of pre-trained language models. Experimental results show that \textbf{SS-MPC} achieves \textbf{15.60\% BLEU-1} and \textbf{12.44\% ROUGE-L} score, outperforming the current state-of-the-art MPC response generation model by \textbf{3.91\%p} in \textbf{BLEU-1} and \textbf{0.62\%p} in \textbf{ROUGE-L}. Additionally, human evaluation confirms that SS-MPC generates more fluent and accurate responses compared to existing MPC models.
摘要：最近的多方对话（MPC）模型通常依赖于基于图的方法来捕获对话结构。但是，这些方法具有局限性，例如在直接利用预训练的语言模型的情况下将话语投射到结构嵌入和约束时的信息丢失。在本文中，我们提出了\ textbf {SS-MPC}，这是MPC的响应生成模型，它消除了对显式图结构的需求。与依赖图表来分析对话结构的现有模型不同，SS-MPC内部将对话结构编码为顺序输入，从而可以直接利用预训练的语言模型。实验结果表明，\ textbf {ss-mpc}达到\ textbf {15.60 \％bleu-1}和\ textbf {12.44 \％rouge-l}得分，超过了当前的最终MPC响应模型， \ textbf {3.91 \％p} in \ textbf {bleu-1}和\ textbf {0.62 \％p}在\ textbf {rouge-l}中。此外，人类评估证实，与现有MPC模型相比，SS-MPC会产生更流利和准确的响应。

Title: Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties

Authors: Zhenglin Wang, Jialong Wu, Pengfei LI, Yong Jiang, Deyu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16922
Pdf URL: https://arxiv.org/pdf/2502.16922
Copy Paste: [[2502.16922]] Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties(https://arxiv.org/abs/2502.16922)
Keywords: language model, llm
Abstract: Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.
摘要：时间推理对人类认知至关重要，对于各种现实世界的应用至关重要。尽管大型语言模型的最新进展表明了时间推理中有希望的能力，但现有的基准主要依赖于基于规则的构造，缺乏上下文深度，并且涉及有限的时间实体。为了解决这些局限性，我们引入了中国时间推理（CTM），这是一种基准，旨在评估中国王朝年表的广泛范围内的时间推理LLM。 CTM强调跨境关系，成对的时间对齐以及情境化和文化基础的推理，提供了全面的评估。广泛的实验结果揭示了CTM带来的挑战，并突出了潜在的改进途径。

Title: A Systematic Survey of Automatic Prompt Optimization Techniques

Authors: Kiran Ramnath, Kang Zhou, Sheng Guan, Soumya Smruti Mishra, Xuan Qi, Zhengyuan Shen, Shuai Wang, Sangmin Woo, Sullam Jeoung, Yawei Wang, Haozhu Wang, Han Ding, Yuzhe Lu, Zhichao Xu, Yun Zhou, Balasubramaniam Srinivasan, Qiaojing Yan, Yueyan Chen, Haibo Ding, Panpan Xu, Lin Lee Cheong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16923
Pdf URL: https://arxiv.org/pdf/2502.16923
Copy Paste: [[2502.16923]] A Systematic Survey of Automatic Prompt Optimization Techniques(https://arxiv.org/abs/2502.16923)
Keywords: language model, llm, prompt
Abstract: Since the advent of large language models (LLMs), prompt engineering has been a crucial step for eliciting desired responses for various Natural Language Processing (NLP) tasks. However, prompt engineering remains an impediment for end users due to rapid advances in models, tasks, and associated best practices. To mitigate this, Automatic Prompt Optimization (APO) techniques have recently emerged that use various automated techniques to help improve the performance of LLMs on various tasks. In this paper, we present a comprehensive survey summarizing the current progress and remaining challenges in this field. We provide a formal definition of APO, a 5-part unifying framework, and then proceed to rigorously categorize all relevant works based on their salient features therein. We hope to spur further research guided by our framework.
摘要：自大型语言模型（LLMS）的出现以来，及时工程一直是为各种自然语言处理（NLP）任务提供所需响应的关键步骤。但是，由于模型，任务和相关最佳实践的快速进步，迅速的工程仍然是最终用户的障碍。为了减轻这种情况，最近出现了使用各种自动化技术来帮助提高LLM在各种任务上的性能。在本文中，我们提出了一项综合调查，总结了当前的进展以及该领域的剩余挑战。我们提供了一个正式的APO定义，即APO，一个由5部分组成的统一框架，然后根据其显着特征在严格地对所有相关作品进行严格分类。我们希望刺激以我们的框架为指导的进一步研究。

Title: Reasoning Does Not Necessarily Improve Role-Playing Ability

Authors: Xiachong Feng, Longxu Dou, Lingpeng Kong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16940
Pdf URL: https://arxiv.org/pdf/2502.16940
Copy Paste: [[2502.16940]] Reasoning Does Not Necessarily Improve Role-Playing Ability(https://arxiv.org/abs/2502.16940)
Keywords: language model, llm, chain-of-thought
Abstract: The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: "Can reasoning techniques enhance the role-playing capabilities of LLMs?" To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English role-playing performance. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.
摘要：角色扮演大语言模型（LLM）的应用正在迅速扩展，在学术和商业领域，推动了对高精度角色扮演模型的不断增长。同时，推理技术的快速发展不断地突破了LLM的性能界限。实用的角色扮演要求和不断发展的推理能力的相交提出了一个重要的研究问题：“推理技术可以增强LLM的角色扮演能力吗？”为了解决这个问题，我们使用6个角色扮演基准，24个LLM和3种不同的角色扮演策略进行了全面的研究，将直接零击角色扮演，角色扮演的有效性，角色扮演与思想链进行了比较（COT）（COT（COT）），以及使用推理优化的LLM的角色扮演。我们的发现表明，COT可能会降低角色扮演的表现，推理优化的LLM不适合角色扮演，推理能力破坏了角色扮演的缩放定律，大型模型仍然缺乏精通高级角色扮演和中国角色扮演绩效绩效表现的能力超过英语角色扮演的表现。此外，基于广泛的实验结果，我们提出了两个有希望的未来研究方向：用于改善角色扮演LLM的角色感知COT和用于角色扮演LLM的强化学习，旨在提高适应性，一致性和扮演LLMS的有效性用于研究和现实世界的应用。

Title: UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings

Authors: Layba Fiaz, Munief Hassan Tahir, Sana Shams, Sarmad Hussain
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16961
Pdf URL: https://arxiv.org/pdf/2502.16961
Copy Paste: [[2502.16961]] UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings(https://arxiv.org/abs/2502.16961)
Keywords: language model, llm
Abstract: Multilingual Large Language Models (LLMs) often provide suboptimal performance on low-resource languages like Urdu. This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture and continually pre-trained on 128 million Urdu tokens, capturing the rich diversity of the language. To enhance instruction-following and translation capabilities, we leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs. Evaluation across three machine translation datasets demonstrates significant performance improvements compared to state-of-the-art (SOTA) models, establishing a new benchmark for Urdu LLMs. These findings underscore the potential of targeted adaptation strategies with limited data and computational resources to address the unique challenges of low-resource languages.
摘要：多语言大型语言模型（LLM）通常在乌尔都语（如乌尔都语）等低资源语言上提供次优性能。本文介绍了乌尔杜拉玛1.0，这是一种源自开源骆驼-3.1-8B教学体系结构的模型，并不断对1.28亿个乌尔都语代币进行预先培训，从而捕获了该语言的丰富多样性。为了增强跟随和翻译功能，我们利用低级适应（LORA）以41,000个乌尔都语说明和大约50,000个英语URDU翻译对微调模型。与最先进（SOTA）模型相比，三个机器翻译数据集的评估显示出显着的性能改进，为乌尔都语LLMS建立了新的基准。这些发现强调了有限的数据和计算资源，以应对低资源语言的独特挑战。

Title: LongSafety: Evaluating Long-Context Safety of Large Language Models

Authors: Yida Lu, Jiale Cheng, Zhexin Zhang, Shiyao Cui, Cunxiang Wang, Xiaotao Gu, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16971
Pdf URL: https://arxiv.org/pdf/2502.16971
Copy Paste: [[2502.16971]] LongSafety: Evaluating Long-Context Safety of Large Language Models(https://arxiv.org/abs/2502.16971)
Keywords: language model, llm, long context
Abstract: As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks. LongSafety encompasses 7 categories of safety issues and 6 user-oriented long-context tasks, with a total of 1,543 test cases, averaging 5,424 words per context. Our evaluation towards 16 representative LLMs reveals significant safety vulnerabilities, with most models achieving safety rates below 55%. Our findings also indicate that strong safety performance in short-context scenarios does not necessarily correlate with safety in long-context tasks, emphasizing the unique challenges and urgency of improving long-context safety. Moreover, through extensive analysis, we identify challenging safety issues and task types for long-context models. Furthermore, we find that relevant context and extended input sequences can exacerbate safety risks in long-context scenarios, highlighting the critical need for ongoing attention to long-context safety challenges. Our code and data are available at this https URL.
摘要：随着大型语言模型（LLM）在理解和产生长序列方面继续促进，从长篇小说中引入了新的安全问题。但是，LLM在长期文化任务中的安全性仍然不足，在评估和改善其安全性方面存在很大的差距。为了解决这个问题，我们介绍了Longsafety，这是第一个专门设计用于评估开放式长篇小说任务中LLM安全性的综合基准。 Longsafety包括7类安全问题和6个面向用户的长篇小说任务，共有1,543个测试用例，平均每个上下文中的5,424个单词。我们对16位代表性LLM的评估揭示了严重的安全漏洞，大多数模型的安全率低于55％。我们的发现还表明，在短上下文方案中的强大安全性能不一定与长篇小说任务中的安全性相关，强调了改善长期面力安全的独特挑战和紧迫性。此外，通过广泛的分析，我们确定了长篇小说模型的挑战性安全问题和任务类型。此外，我们发现相关的上下文和扩展的输入序列会在长篇小说方案中加剧安全风险，从而强调了对长期以来的安全挑战的持续关注的关键需求。我们的代码和数据可在此HTTPS URL上找到。

Title: Hotter and Colder: A New Approach to Annotating Sentiment, Emotions, and Bias in Icelandic Blog Comments

Authors: Steinunn Rut Friðriksdóttir, Dan Saattrup Nielsen, Hafsteinn Einarsson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.16987
Pdf URL: https://arxiv.org/pdf/2502.16987
Copy Paste: [[2502.16987]] Hotter and Colder: A New Approach to Annotating Sentiment, Emotions, and Bias in Icelandic Blog Comments(https://arxiv.org/abs/2502.16987)
Keywords: gpt
Abstract: This paper presents Hotter and Colder, a dataset designed to analyze various types of online behavior in Icelandic blog comments. Building on previous work, we used GPT-4o mini to annotate approximately 800,000 comments for 25 tasks, including sentiment analysis, emotion detection, hate speech, and group generalizations. Each comment was automatically labeled on a 5-point Likert scale. In a second annotation stage, comments with high or low probabilities of containing each examined behavior were subjected to manual revision. By leveraging crowdworkers to refine these automatically labeled comments, we ensure the quality and accuracy of our dataset resulting in 12,232 uniquely annotated comments and 19,301 annotations. Hotter and Colder provides an essential resource for advancing research in content moderation and automatically detectiong harmful online behaviors in Icelandic.
摘要：本文介绍了较热和冷的，这是一个旨在在冰岛博客评论中分析各种在线行为的数据集。在先前的工作的基础上，我们使用GPT-4O Mini来注释大约800,000条评论，以完成25个任务，包括情感分析，情感检测，仇恨言论和团体概括。每个注释都以5点李克特量表自动标记。在第二个注释阶段，对包含每个检查行为的高概率或低概率的评论进行了手动修订。通过利用人群工人来完善这些自动标记的评论，我们确保数据集的质量和准确性，从而产生12,232个唯一注释的注释和19,301个注释。较热和寒冷提供了一种重要的资源，可以推进冰岛中的内容节制研究并自动检测有害在线行为。

Title: All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

Authors: Davide Testa, Giovanni Bonetta, Raffaella Bernardi, Alessandro Bondielli, Alessandro Lenci, Alessio Miaschi, Lucia Passaro, Bernardo Magnini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.16989
Pdf URL: https://arxiv.org/pdf/2502.16989
Copy Paste: [[2502.16989]] All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark(https://arxiv.org/abs/2502.16989)
Keywords: language model
Abstract: We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses and the language and culture of the videos. It evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlight when one of two alone encodes sufficient information to solve the tasks, when they are both needed and when the full richness of the short video is essential instead of just a part of it. Thanks to its carefully taught design, it evaluates VLMs' consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric. Last but not least, the video collection has been carefully selected to reflect the Italian culture and the language data are produced by native-speakers.
摘要：我们介绍了Maia（多模式AI评估），这是一种本地 - 意大利基准测试，旨在细化视频中视觉语言模型的推理能力。 Maia不同于其他可用的视频基准，其设计，推理类别，使用的指标以及视频的语言和文化。它在两个校准任务上评估视觉语言模型（VLM）：视觉语句验证任务和一个开放式的视觉问题解答任务，均在相同的一组与视频相关的问题上。它考虑了十二个推理类别，旨在通过强调两个单独编码足够的信息来求解任务，以及短视频的全部丰富性是必不可少的，而不是仅仅是一部分的一部分时，旨在消除语言和视觉关系。它。由于其精心授课的设计，它通过汇总度量来评估VLMS的一致性和视觉上的自然语言理解和产生。最后但并非最不重要的一点是，视频集已仔细选择，以反映意大利文化，语言数据是由母语扬声器产生的。

Title: Quantifying Logical Consistency in Transformers via Query-Key Alignment

Authors: Eduard Tulchinskii, Anastasia Voznyuk, Laida Kushnareva, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17017
Pdf URL: https://arxiv.org/pdf/2502.17017
Copy Paste: [[2502.17017]] Quantifying Logical Consistency in Transformers via Query-Key Alignment(https://arxiv.org/abs/2502.17017)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated impressive performance in various natural language processing tasks, yet their ability to perform multi-step logical reasoning remains an open challenge. Although Chain-of-Thought prompting has improved logical reasoning by enabling models to generate intermediate steps, it lacks mechanisms to assess the coherence of these logical transitions. In this paper, we propose a novel, lightweight evaluation strategy for logical reasoning that uses query-key alignments inside transformer attention heads. By computing a single forward pass and extracting a "QK-score" from carefully chosen heads, our method reveals latent representations that reliably separate valid from invalid inferences, offering a scalable alternative to traditional ablation-based techniques. We also provide an empirical validation on multiple logical reasoning benchmarks, demonstrating improved robustness of our evaluation method against distractors and increased reasoning depth. The experiments were conducted on a diverse set of models, ranging from 1.5B to 70B parameters.
摘要：大型语言模型（LLMS）在各种自然语言处理任务中表现出了令人印象深刻的表现，但是它们执行多步逻辑推理的能力仍然是一个开放的挑战。尽管经过思考链的提示通过使模型产生中间步骤改善了逻辑推理，但它缺乏评估这些逻辑过渡的连贯性的机制。在本文中，我们为逻辑推理提出了一种新颖的轻量评估策略，该策略使用变压器注意力头内的查询键对齐。通过计算单个正向通行证并从精心选择的头部提取“ QK得分”，我们的方法揭示了潜在表示，这些表示与无效的推论可靠地分开，为传统的基于消融的技术提供了可扩展的替代方案。我们还对多种逻辑推理基准进行了经验验证，这表明我们对分散术者的评估方法的鲁棒性和提高了推理深度。实验是在各种模型集中进行的，范围从1.5b到70b参数。

Title: Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization

Authors: Zixuan Gong, Xiaolin Hu, Huayi Tang, Yong Liu
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.17024
Pdf URL: https://arxiv.org/pdf/2502.17024
Copy Paste: [[2502.17024]] Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization(https://arxiv.org/abs/2502.17024)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: (a) Limited i.i.d. Setting. Most studies focus on supervised function learning tasks where prompts are constructed with i.i.d. input-label pairs. This i.i.d. assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. (b) Lack of Emergence Explanation. Most literature answers what ICL does from an implicit optimization perspective but falls short in elucidating how ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, auto-regressive next-token prediction (AR-NTP), which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that ICL emerges from the generalization of sequences and topics. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
摘要：大型语言模型（LLM）表现出了非凡的内在学习能力（ICL）。但是，现有的ICL理论分析主要表现出两个局限性：（a）限制I.I.D.环境。大多数研究都集中在有监督的功能学习任务上，其中提示与I.I.D构建。输入标签对。这个I.I.D.假设与迅速令牌相互依存的真实语言学习方案显着分歧。（b）缺乏出现的解释。大多数文献从隐性优化的角度回答了ICL所做的事情，但在阐明ICL的出现以及训练阶段对ICL的影响方面缺乏。在我们的论文中，为了扩展（a），我们采用了更实用的范式，自动回归的下一步预测（AR-NTP），这与语言模型的实际培训非常相吻合。具体而言，在AR-NTP中，我们强调了迅速令牌依赖性，这涉及根据上述序列预测每个后续令牌。为了解决（b），我们对系统的预训练和ICL框架进行形式化，突出显示了序列和主题的层结构，以及两级期望。总之，我们提出了对预训练的LLM的数据依赖性，依赖于主题依赖性和优化依赖性的Pac-bayesian泛化界限，并研究了ICL从序列和主题的概括中出现。我们的理论得到了关于数值线性动态系统，合成GINC和现实世界语言数据集的实验的支持。

Title: Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology

Authors: Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, Hua Wei
Subjects: cs.CL, cs.AI, cs.SC
Abstract URL: https://arxiv.org/abs/2502.17026
Pdf URL: https://arxiv.org/pdf/2502.17026
Copy Paste: [[2502.17026]] Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology(https://arxiv.org/abs/2502.17026)
Keywords: language model, llm
Abstract: Understanding the uncertainty in large language model (LLM) explanations is important for evaluating their faithfulness and reasoning consistency, and thus provides insights into the reliability of LLM's output regarding a question. In this work, we propose a novel framework that quantifies uncertainty in LLM explanations through a reasoning topology perspective. By designing a structural elicitation strategy, we guide the LLMs to frame the explanations of an answer into a graph topology. This process decomposes the explanations into the knowledge related sub-questions and topology-based reasoning structures, which allows us to quantify uncertainty not only at the semantic level but also from the reasoning path. It further brings convenience to assess knowledge redundancy and provide interpretable insights into the reasoning process. Our method offers a systematic way to interpret the LLM reasoning, analyze limitations, and provide guidance for enhancing robustness and faithfulness. This work pioneers the use of graph-structured uncertainty measurement in LLM explanations and demonstrates the potential of topology-based quantification.
摘要：了解大语言模型（LLM）解释中的不确定性对于评估其忠诚和推理一致性很重要，从而提供了对LLM关于问题的可靠性的见解。在这项工作中，我们提出了一个新颖的框架，该框架通过推理拓扑观点来量化LLM解释中的不确定性。通过设计结构启发策略，我们指导LLMS将答案的解释构成图形拓扑。这个过程将与知识相关的子问题和基于拓扑的推理结构的解释分解了，这使我们不仅可以在语义层面，而且从推理路径上量化不确定性。这进一步评估知识冗余并提供有关推理过程的可解释见解的便利。我们的方法提供了一种系统的方式来解释LLM推理，分析限制并为增强稳健性和忠诚提供指导。这项工作开创了LLM解释中的图形结构不确定性测量的使用，并证明了基于拓扑的定量的潜力。

Title: Language Model Re-rankers are Steered by Lexical Similarities

Authors: Lovisa Hagström, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, Alexander Junge
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17036
Pdf URL: https://arxiv.org/pdf/2502.17036
Copy Paste: [[2502.17036]] Language Model Re-rankers are Steered by Lexical Similarities(https://arxiv.org/abs/2502.17036)
Keywords: language model, retrieval-augmented generation
Abstract: Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 re-ranker on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.
摘要：语言模型（LM）重新级别用于优化检索生成（RAG）的检索结果。它们比BM25这样的词汇匹配方法昂贵，但假定可以更好地处理语义信息。为了了解LM重新级别是否始终符合此假设，我们评估了NQ，LITQA2和DRUID数据集上的6种不同的LM重新率。我们的结果表明，LM重新级别努力在Druid上胜过简单的BM25重新级别。利用基于BM25分数的新型分离度量，我们解释并确定了来自词汇差异的重新级别误差。我们还研究了不同的方法来改善LM重新升级性能，并发现这些方法主要对NQ有用。综上所述，我们的工作确定并解释了LM重新级别的弱点，并指出需要对其进行评估的更多对抗性和现实数据集的需求。

Title: PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance

Authors: Haoran Li, Wenbin Hu, Huihao Jing, Yulin Chen, Qi Hu, Sirui Han, Tianshu Chu, Peizhao Hu, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17041
Pdf URL: https://arxiv.org/pdf/2502.17041
Copy Paste: [[2502.17041]] PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance(https://arxiv.org/abs/2502.17041)
Keywords: language model, llm
Abstract: Recent advancements in generative large language models (LLMs) have enabled wider applicability, accessibility, and flexibility. However, their reliability and trustworthiness are still in doubt, especially for concerns regarding individuals' data privacy. Great efforts have been made on privacy by building various evaluation benchmarks to study LLMs' privacy awareness and robustness from their generated outputs to their hidden representations. Unfortunately, most of these works adopt a narrow formulation of privacy and only investigate personally identifiable information (PII). In this paper, we follow the merit of the Contextual Integrity (CI) theory, which posits that privacy evaluation should not only cover the transmitted attributes but also encompass the whole relevant social context through private information flows. We present PrivaCI-Bench, a comprehensive contextual privacy evaluation benchmark targeted at legal compliance to cover well-annotated privacy and safety regulations, real court cases, privacy policies, and synthetic data built from the official toolkit to study LLMs' privacy and safety compliance. We evaluate the latest LLMs, including the recent reasoner models QwQ-32B and Deepseek R1. Our experimental results suggest that though LLMs can effectively capture key CI parameters inside a given context, they still require further advancements for privacy compliance.
摘要：生成大语言模型（LLMS）的最新进展使得更广泛的适用性，可访问性和灵活性。但是，它们的可靠性和可信度仍然令人怀疑，尤其是对于个人数据隐私的担忧。通过建立各种评估基准来研究LLMS的隐私意识和鲁棒性，从其产生的产出到其隐藏的表示形式，已经为隐私做出了巨大的努力。不幸的是，这些作品中的大多数都采用了狭义的隐私表述，仅研究个人身份信息（PII）。在本文中，我们遵循上下文完整性（CI）理论的优点，该理论认为，隐私评估不仅应涵盖传输的属性，而且还应通过私人信息流涵盖整个相关的社会环境。我们提出了Privaci Bench，这是针对法律合规性的全面背景隐私评估基准，旨在涵盖良好的隐私和安全法规，实际法院案件，隐私政策以及从官方工具包中构建的合成数据，以研究LLMS的隐私和安全合规性。我们评估了最新的LLM，包括最近的推理器模型QWQ-32B和DeepSeek R1。我们的实验结果表明，尽管LLM可以有效地捕获给定上下文中的关键CI参数，但它们仍然需要进一步的依从性才能遵守隐私。

Title: Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability

Authors: Ashhadul Islam, Samir Brahim Belhaouari, Amine Bermak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17071
Pdf URL: https://arxiv.org/pdf/2502.17071
Copy Paste: [[2502.17071]] Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability(https://arxiv.org/abs/2502.17071)
Keywords: language model, gpt, llm, chat
Abstract: The exponential growth of large language models (LLMs) like ChatGPT has revolutionized artificial intelligence, offering unprecedented capabilities in natural language processing. However, the extensive computational resources required for training these models have significant environmental implications, including high carbon emissions, energy consumption, and water usage. This research presents a novel approach to LLM pruning, focusing on the systematic evaluation of individual weight importance throughout the training process. By monitoring parameter evolution over time, we propose a method that effectively reduces model size without compromising performance. Extensive experiments with both a scaled-down LLM and a large multimodal model reveal that moderate pruning enhances efficiency and reduces loss, while excessive pruning drastically deteriorates model performance. These findings highlight the critical need for optimized AI models to ensure sustainable development, balancing technological advancement with environmental responsibility.
摘要：像Chatgpt这样的大型语言模型（LLM）的指数增长彻底改变了人工智能，提供了自然语言处理中前所未有的功能。但是，训练所需的广泛的计算资源具有重大的环境影响，包括高碳排放，能源消耗和用水。这项研究提出了一种新颖的LLM修剪方法，重点是在整个培训过程中对个人体重重要性的系统评估。通过监视参数演变，我们提出了一种有效降低模型大小而不会损害性能的方法。使用缩小的LLM和大型多模式模型进行了广泛的实验表明，中等修剪会提高效率并降低损失，而过度修剪会大大恶化模型性能。这些发现凸显了优化AI模型以确保可持续发展，平衡技术进步与环境责任的关键需求。

Title: Automatically Evaluating the Paper Reviewing Capability of Large Language Models

Authors: Hyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, Juho Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17086
Pdf URL: https://arxiv.org/pdf/2502.17086
Copy Paste: [[2502.17086]] Automatically Evaluating the Paper Reviewing Capability of Large Language Models(https://arxiv.org/abs/2502.17086)
Keywords: language model, llm
Abstract: Peer review is essential for scientific progress, but it faces challenges such as reviewer shortages and growing workloads. Although Large Language Models (LLMs) show potential for providing assistance, research has reported significant limitations in the reviews they generate. While the insights are valuable, conducting the analysis is challenging due to the considerable time and effort required, especially given the rapid pace of LLM developments. To address the challenge, we developed an automatic evaluation pipeline to assess the LLMs' paper review capability by comparing them with expert-generated reviews. By constructing a dataset consisting of 676 OpenReview papers, we examined the agreement between LLMs and experts in their strength and weakness identifications. The results showed that LLMs lack balanced perspectives, significantly overlook novelty assessment when criticizing, and produce poor acceptance decisions. Our automated pipeline enables a scalable evaluation of LLMs' paper review capability over time.
摘要：同行评审对于科学进步至关重要，但面临诸如审阅者短缺和工作量不断增长的挑战。尽管大型语言模型（LLMS）显示出提供援助的潜力，但研究报告了它们产生的评论的重大局限性。虽然洞察力很有价值，但由于需要大量的时间和精力，进行分析既有挑战，尤其是考虑到LLM开发的快速发展。为了应对挑战，我们开发了一条自动评估管道，通过将它们与专家生成的评论进行比较，以评估LLMS的论文审查能力。通过构建一个由676篇OpenReview论文组成的数据集，我们研究了LLMS与专家之间在其实力和弱点身份方面的协议。结果表明，LLM缺乏平衡的观点，在批评时大大忽视了新颖性评估，并做出了不良的接受决定。我们的自动化管道可以随着时间的推移对LLMS的纸质评论能力进行可扩展评估。

Title: WildFrame: Comparing Framing in Humans and LLMs on Naturally Occurring Texts

Authors: Gili Lior, Liron Nacchace, Gabriel Stanovsky
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17091
Pdf URL: https://arxiv.org/pdf/2502.17091
Copy Paste: [[2502.17091]] WildFrame: Comparing Framing in Humans and LLMs on Naturally Occurring Texts(https://arxiv.org/abs/2502.17091)
Keywords: llm
Abstract: Humans are influenced by how information is presented, a phenomenon known as the framing effect. Previous work has shown that LLMs may also be susceptible to framing but has done so on synthetic data and did not compare to human behavior. We introduce WildFrame, a dataset for evaluating LLM responses to positive and negative framing, in naturally-occurring sentences, and compare humans on the same data. WildFrame consists of 1,000 texts, first selecting real-world statements with clear sentiment, then reframing them in either positive or negative light, and lastly, collecting human sentiment annotations. By evaluating eight state-of-the-art LLMs on WildFrame, we find that all models exhibit framing effects similar to humans ($r\geq0.57$), with both humans and models being more influenced by positive rather than negative reframing. Our findings benefit model developers, who can either harness framing or mitigate its effects, depending on the downstream application.
摘要：人类受到信息的呈现，这种现象称为框架效应。先前的工作表明，LLMS也可能容易受到构架的影响，但在合成数据上也很容易与人类行为相提并论。我们介绍了Wildframe，这是一个用于评估自然句子中正面和负面框架的LLM响应的数据集，并在同一数据上比较人类。野生框架由1,000篇文本组成，首先以清晰的情绪选择了现实世界的陈述，然后以正面或负面的光线进行重新构架，最后收集人类情感注释。通过评估野生框架上的八个最先进的LLM，我们发现所有模型都表现出与人类类似的框架效应（$ r \ geq0.57 $），其中人类和模型都受到正面反射的影响更大。我们的发现使模型开发人员受益于可以利用框架或减轻其效果的模型开发人员，具体取决于下游应用程序。

Title: Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

Authors: Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.17110
Pdf URL: https://arxiv.org/pdf/2502.17110
Copy Paste: [[2502.17110]] Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration(https://arxiv.org/abs/2502.17110)
Keywords: agent
Abstract: The rapid increase in mobile device usage necessitates improved automation for seamless task management. However, many AI-driven frameworks struggle due to insufficient operational knowledge. Manually written knowledge helps but is labor-intensive and inefficient. To address these challenges, we introduce Mobile-Agent-V, a framework that leverages video guidance to provide rich and cost-effective operational knowledge for mobile automation. Mobile-Agent-V enhances task execution capabilities by leveraging video inputs without requiring specialized sampling or preprocessing. Mobile-Agent-V integrates a sliding window strategy and incorporates a video agent and deep-reflection agent to ensure that actions align with user instructions. Through this innovative approach, users can record task processes with guidance, enabling the system to autonomously learn and execute tasks efficiently. Experimental results show that Mobile-Agent-V achieves a 30% performance improvement compared to existing frameworks.
摘要：移动设备使用的快速增加需要改善无缝任务管理的自动化。但是，由于运营知识不足，许多AI驱动的框架挣扎。手动书面知识有助于劳动密集型和效率低下。为了应对这些挑战，我们引入了移动代理-V，该框架利用视频指南为移动自动化提供丰富且具有成本效益的操作知识。移动代理-V通过利用视频输入而无需专门的采样或预处理来增强任务执行功能。移动代理-V集成了滑动窗口策略，并结合了视频代理和深层反思代理，以确保操作与用户指令保持一致。通过这种创新的方法，用户可以通过指导记录任务过程，从而使系统能够自主学习和执行任务。实验结果表明，与现有框架相比，移动代理-V可以提高30％的性能。

Title: LettuceDetect: A Hallucination Detection Framework for RAG Applications

Authors: Ádám Kovács, Gábor Recski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17125
Pdf URL: https://arxiv.org/pdf/2502.17125
Copy Paste: [[2502.17125]] LettuceDetect: A Hallucination Detection Framework for RAG Applications(https://arxiv.org/abs/2502.17125)
Keywords: llm, hallucination, prompt, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) systems remain vulnerable to hallucinated answers despite incorporating external knowledge sources. We present LettuceDetect a framework that addresses two critical limitations in existing hallucination detection methods: (1) the context window constraints of traditional encoder-based methods, and (2) the computational inefficiency of LLM based approaches. Building on ModernBERT's extended context capabilities (up to 8k tokens) and trained on the RAGTruth benchmark dataset, our approach outperforms all previous encoder-based models and most prompt-based models, while being approximately 30 times smaller than the best models. LettuceDetect is a token-classification model that processes context-question-answer triples, allowing for the identification of unsupported claims at the token level. Evaluations on the RAGTruth corpus demonstrate an F1 score of 79.22% for example-level detection, which is a 14.8% improvement over Luna, the previous state-of-the-art encoder-based architecture. Additionally, the system can process 30 to 60 examples per second on a single GPU, making it more practical for real-world RAG applications.
摘要：尽管融合了外部知识来源，但检索增强发电（RAG）系统仍然容易受到幻觉答案的影响。我们提出了一个框架框架，该框架解决了现有幻觉检测方法中的两个关键局限性：（1）基于传统编码器方法的上下文窗口限制，以及（2）基于LLM的方法的计算效率低下。以Modernbert的扩展上下文功能（最高为8K令牌）为基础，并在Ragtruth基准数据集上进行了培训，我们的方法的表现优于所有以前的基于编码器的模型和最及时的型号，而大约比最佳模型小约30倍。 LetTuceStect是一个代币分类模型，该模型处理上下文问题 - 问题 - 答案三元组，可以在令牌级别识别不支持的主张。对Ragtruth语料库的评估表明，F1得分为79.22％，例如级别检测，这比Luna（先前的基于编码器的先前的架构）提高了14.8％。此外，该系统可以在单个GPU上每秒处理30至60个示例，从而使其对于现实世界的破布应用程序更实用。

Title: Thus Spake Long-Context Large Language Model

Authors: Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17129
Pdf URL: https://arxiv.org/pdf/2502.17129
Copy Paste: [[2502.17129]] Thus Spake Long-Context Large Language Model(https://arxiv.org/abs/2502.17129)
Keywords: language model, llm, long context
Abstract: Long context is an important topic in Natural Language Processing (NLP), running through the development of NLP architectures, and offers immense opportunities for Large Language Models (LLMs) giving LLMs the lifelong learning potential akin to humans. Unfortunately, the pursuit of a long context is accompanied by numerous obstacles. Nevertheless, long context remains a core competitive advantage for LLMs. In the past two years, the context length of LLMs has achieved a breakthrough extension to millions of tokens. Moreover, the research on long-context LLMs has expanded from length extrapolation to a comprehensive focus on architecture, infrastructure, training, and evaluation technologies. Inspired by the symphonic poem, Thus Spake Zarathustra, we draw an analogy between the journey of extending the context of LLM and the attempts of humans to transcend its mortality. In this survey, We will illustrate how LLM struggles between the tremendous need for a longer context and its equal need to accept the fact that it is ultimately finite. To achieve this, we give a global picture of the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation, showcasing the full spectrum of long-context technologies. At the end of this survey, we will present 10 unanswered questions currently faced by long-context LLMs. We hope this survey can serve as a systematic introduction to the research on long-context LLMs.
摘要：漫长的上下文是自然语言处理（NLP）的重要主题，贯穿NLP体系结构的开发，并为大型语言模型（LLMS）提供了巨大的机会，从而使LLMS具有类似于人类的终身学习潜力。不幸的是，追求漫长的背景伴随着许多障碍。然而，长篇小说仍然是LLM的核心竞争优势。在过去的两年中，LLM的上下文长度已取得了对数百万个令牌的突破性扩展。此外，对长篇小说LLM的研究已从长度外推到对建筑，基础设施，培训和评估技术的全面关注。受《交响曲》诗的启发，因此，我们在扩展LLM背景的旅程与人类超越其死亡率的尝试之间进行了类比。在这项调查中，我们将说明LLM在更长的环境中的巨大需求与接受最终是有限的事实之间的巨大需求之间的斗争。为了实现这一目标，我们从四个角度（建筑，基础架构，培训和评估）展示了长篇小说技术的完整范围，从而给出了长篇文化LLM的生命周期的全球图片。在这项调查结束时，我们将提出长篇小说LLMS目前面临的10个未解决的问题。我们希望这项调查可以作为长篇文化LLM的研究的系统介绍。

Title: MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Authors: María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17163
Pdf URL: https://arxiv.org/pdf/2502.17163
Copy Paste: [[2502.17163]] MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation(https://arxiv.org/abs/2502.17163)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. We release our benchmark to support the community developing accurate evaluation methods for multilingual RAG systems.
摘要：根据专家人类注释者的判断，对检索增强产生（RAG）系统的自动评估依赖于忠实和相关性等细粒度。元评估基准支持与人类判断良好相关的自动评估者的发展。但是，现有的基准主要集中在英语或使用翻译的数据上，而这些数据未能捕捉文化细微差别。一种本机方法可以更好地表示最终用户体验。在这项工作中，我们开发了多种语言的端到端元评估抹布基准（memerag）。我们的基准基于流行的Miracl数据集，使用本地语言问题并通过不同的大型语言模型（LLM）产生回答，然后由专家注释者评估，以忠诚和相关性。我们描述了我们的注释过程，并表明它达到了高通道的一致性。然后，根据人类评估者，我们分析了跨语言的答案生成LLM的性能。最后，我们将数据集应用于我们的主要用例中，这是为基准的多语言自动评估器（LLM-AS-A-Gudge）进行基准测试。我们表明，我们的基准可以可靠地确定高级提示技术和LLM提供的改进。我们发布我们的基准，以支持社区开发用于多语言抹布系统的准确评估方法。

Title: JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning

Authors: Huanghai Liu, Quzhe Huang, Qingjing Chen, Yiran Hu, Jiayu Ma, Yun Liu, Weixing Shen, Yansong Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17166
Pdf URL: https://arxiv.org/pdf/2502.17166
Copy Paste: [[2502.17166]] JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning(https://arxiv.org/abs/2502.17166)
Keywords: language model, llm
Abstract: The Four-Element Theory is a fundamental framework in criminal law, defining the constitution of crime through four dimensions: Subject, Object, Subjective aspect, and Objective aspect. This theory is widely referenced in legal reasoning, and many Large Language Models (LLMs) attempt to incorporate it when handling legal tasks. However, current approaches rely on LLMs' internal knowledge to incorporate this theory, often lacking completeness and representativeness. To address this limitation, we introduce JUREX-4E, an expert-annotated knowledge base covering 155 criminal charges. It is structured through a progressive hierarchical annotation framework that prioritizes legal source validity and employs diverse legal interpretation methods to ensure comprehensiveness and authority. We evaluate JUREX-4E on the Similar Charge Distinction task and apply it to Legal Case Retrieval, demonstrating its effectiveness in improving LLM performance. Experimental results validate the high quality of JUREX-4E and its substantial impact on downstream legal tasks, underscoring its potential for advancing legal AI applications. Code: this https URL
摘要：四元素理论是刑法中的一个基本框架，它通过四个维度来定义犯罪的构成：主题，对象，主观方面和客观方面。该理论在法律推理中被广泛引用，许多大型语言模型（LLM）试图在处理法律任务时将其纳入。但是，当前的方法依靠LLMS的内部知识来融合这一理论，通常缺乏完整性和代表性。为了解决这一限制，我们介绍了Jurex-4E，这是一个涉及155项刑事指控的专家注销的知识库。它是通过渐进式分层注释框架进行结构的，该框架优先考虑法律来源的有效性，并采用各种法律解释方法来确保全面和权威。我们对JUREX-4E评估了类似的指控任务，并将其应用于法律案例检索，证明了其在改善LLM绩效方面的有效性。实验结果证明了JUREX-4E的高质量及其对下游法律任务的重大影响，强调了其推进法律AI应用的潜力。代码：此HTTPS URL

Title: Logic Haystacks: Probing LLMs Long-Context Logical Reasoning (Without Easily Identifiable Unrelated Padding)

Authors: Damien Sileo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17169
Pdf URL: https://arxiv.org/pdf/2502.17169
Copy Paste: [[2502.17169]] Logic Haystacks: Probing LLMs Long-Context Logical Reasoning (Without Easily Identifiable Unrelated Padding)(https://arxiv.org/abs/2502.17169)
Keywords: language model, gpt, llm, long context
Abstract: Large language models demonstrate promising long context processing capabilities, with recent models touting context windows close to one million tokens. However, the evaluations supporting these claims often involve simple retrieval tasks or synthetic tasks padded with irrelevant text, which the models may easily detect and discard. In this work, we generate lengthy simplified English text with first-order logic representations spanning up to 2048 clauses (around 25k GPT-4 tokens). We formulate an evaluation task with evidence retrieval for contradiction detection. The long, homogeneous text is filled with distractors that are both hard to distinguish from relevant evidences and provably not interfering with them. Our evaluation of evidence retrieval shows that the effective context window is much smaller with realistic distractors, already crumbling at 128 clauses.
摘要：大型语言模型显示出有希望的长上下文处理功能，最近的模型吹捧上下文窗口接近一百万个令牌。但是，支持这些主张的评估通常涉及简单的检索任务或无关文本填充的综合任务，模型可以很容易地检测和丢弃。在这项工作中，我们生成了冗长的简化英语文本，其一阶逻辑表示为2048条（约25k GPT-4令牌）。我们制定了一项评估任务，并通过证据检索矛盾检测。漫长而均匀的文本充满了干扰因素，这些干扰因素都很难与相关证据区分开，而且事实证明不会干扰它们。我们对证据检索的评估表明，有效的上下文窗口在逼真的干扰因素中较小，已经在128个条款上崩溃了。

Title: Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch

Authors: Xueru Wen, Jie Lou, Zichao Li, Yaojie Lu, Xing Yu, Yuqiu Ji, Guohai Xu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Debing Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17173
Pdf URL: https://arxiv.org/pdf/2502.17173
Copy Paste: [[2502.17173]] Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch(https://arxiv.org/abs/2502.17173)
Keywords: language model, llm
Abstract: Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.
摘要：奖励模型（RMS）对于将大语言模型（LLM）与人类偏好保持一致至关重要。但是，大多数RM研究都集中在英语上，并在很大程度上依赖合成资源，从而导致中文的数据集和可靠性较差的数据集和基准。为了解决这一差距，我们介绍了Cheemsbench，这是在中文背景下完全被人类注销的RM评估基准，以及Cheemspreference，这是通过人机合作提供的大规模且多样化的偏好数据集，以支持中国RM培训。我们系统地评估了Cheemsbench上的开源歧视和生成性RMS，并观察到其在中国情景中捕获人类偏好的能力的显着局限性。此外，根据Cheemspreference，我们构建了一个在Cheemsbench上实现最先进的表现的RM，这表明了RM培训中人类监督的必要性。我们的发现表明，缩放的AI生成的数据努力为完全捕捉人类的偏好而努力，强调了高质量的人类监督在RM发展中的重要性。

Title: Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric

Authors: Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao Wang, Shuo Li, Huijie Lv, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17184
Pdf URL: https://arxiv.org/pdf/2502.17184
Copy Paste: [[2502.17184]] Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric(https://arxiv.org/abs/2502.17184)
Keywords: language model
Abstract: Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by assessing their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information distribution in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level "novelty." Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance, highlighting its value in guiding data engineering practices. With NovelSum as an optimization objective, we further develop a greedy, diversity-oriented data selection strategy that outperforms existing approaches, validating both the effectiveness and practical significance of our metric.
摘要：数据多样性对于大型语言模型的指导调整至关重要。现有研究探索了各种多样性感知的数据选择方法，以构建高质量数据集并增强模型性能。但是，精确定义和衡量数据多样性的基本问题仍然没有被忽视，这限制了对数据工程的明确指导。为了解决这个问题，我们通过广泛的微调实验来评估它们与模型性能的相关性，从而系统地分析了11种现有多样性测量方法。我们的结果表明，可靠的多样性度量应适当说明样本间差异和样本空间中的信息分布。在此基础上，我们提出了小说，这是一种基于样本级“新颖性”的新多样性指标。对模拟和现实世界数据的实验表明，小说准确地捕获了多样性变化，并与指令调节的模型性能达到了0.97的相关性，从而强调了其在指导数据工程实践中的价值。以小说为优化目标，我们进一步制定了一种面向贪婪的，以多样性为导向的数据选择策略，该策略胜过现有方法，从而验证了我们的指标的有效性和实际意义。

Title: Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks

Authors: Andrei Chernov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17187
Pdf URL: https://arxiv.org/pdf/2502.17187
Copy Paste: [[2502.17187]] Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks(https://arxiv.org/abs/2502.17187)
Keywords: language model, llm
Abstract: Recently, Large Language Models (LLMs) with Mixture of Experts (MoE) layers have gained significant attention. Currently, state-of-the-art LLMs utilize this architecture. There is a substantial amount of research on how to train such models and how to select hyperparameters for this architecture. However, there is a lack of studies focusing on post-evaluation analysis of MoE layer properties. In this paper, we take a first step toward closing this gap by evaluating expert contributions on the quiz-based MMLU benchmark. We show that most experts were never activated during inference on this benchmark. Additionally, the output distribution of gating networks is much closer to uniform than sparse. Finally, we demonstrate that the average performance of some experts within the same layer varies significantly.
摘要：最近，具有专家（MOE）层混合的大型语言模型（LLM）引起了极大的关注。当前，最先进的LLMS利用此体系结构。关于如何培训此类模型以及如何为该体系结构选择超参数的研究有大量研究。但是，缺乏研究MOE层特性的评估后分析的研究。在本文中，我们通过评估基于测验的MMLU基准的专家贡献来解决这一差距的第一步。我们表明，在推断此基准测试期间，大多数专家从未被激活。另外，门控网络的输出分布比稀疏更接近均匀。最后，我们证明同一层中某些专家的平均表现差异很大。

Title: Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following

Authors: Jie Zeng, Qianyu He, Qingyu Ren, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17204
Pdf URL: https://arxiv.org/pdf/2502.17204
Copy Paste: [[2502.17204]] Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following(https://arxiv.org/abs/2502.17204)
Keywords: language model, llm
Abstract: Real-world instructions with multiple constraints pose a significant challenge to existing large language models (LLMs). An observation is that the LLMs exhibit dramatic performance fluctuation when disturbing the order of the incorporated constraints. Yet, none of the existing works has systematically investigated this position bias problem in the field of multi-constraint instruction following. To bridge this gap, we design a probing task where we quantitatively measure the difficulty distribution of the constraints by a novel Difficulty Distribution Index (CDDI). Through the experimental results, we find that LLMs are more performant when presented with the constraints in a ``hard-to-easy'' order. This preference can be generalized to LLMs with different architecture or different sizes of parameters. Additionally, we conduct an explanation study, providing an intuitive insight into the correlation between the LLM's attention and constraint orders. Our code and dataset are publicly available at this https URL.
摘要：具有多个约束的现实世界说明对现有大型语言模型（LLMS）构成了重大挑战。一个观察结果是，在干扰合并约束的顺序时，LLMS表现出巨大的性能波动。但是，现有的作品都没有系统地研究了以下多构造指令领域的这个位置偏见问题。为了弥合这一差距，我们设计了一个探测任务，我们通过新的难度分布指数（CDDI）定量测量约束的难度分布。通过实验结果，我们发现LLM在``难以满意''顺序的约束时更具性能。该偏好可以推广到具有不同体系结构或不同大小参数的LLM。此外，我们进行了一项解释研究，为LLM的注意力和约束顺序之间的相关性提供了直观的见解。我们的代码和数据集可在此HTTPS URL上公开获得。

Title: CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought

Authors: Boxuan Zhang, Ruqi Zhang
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.17214
Pdf URL: https://arxiv.org/pdf/2502.17214
Copy Paste: [[2502.17214]] CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought(https://arxiv.org/abs/2502.17214)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which incurs high computational costs. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. CoT-UQ captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. This key reasoning information is then aggregated to produce a final uncertainty estimate. We conduct extensive experiments based on LLaMA Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods. The code is available at: this https URL.
摘要：大型语言模型（LLM）在许多任务中都表现出色，但努力在其产生的响应中准确量化不确定性。这种限制使检测错误信息并确保可靠的决策具有挑战性。 LLM的现有不确定性量化（UQ）方法主要是迅速而不是响应，通常需要多个响应样本，这会造成高计算成本。此外，LLM已被证明过于自信，尤其是在使用推理步骤得出答案时。在这项工作中，我们提出了COT-UQ，COT-UQ是一个响应的UQ框架，该框架将LLMS固有的推理能力通过思考链（COT）整合到UQ过程中。 COT-UQ通过从每个推理步骤中提取关键字并评估其对最终答案的重要性，从而捕获关键信息。然后将此关键的推理信息汇总以产生最终的不确定性估计。我们在逻辑和数学推理任务中，基于LLAMA家族的大量实验从8b到13B不等。实验结果表明，与当前的UQ方法相比，COT-UQ显着胜过现有的UQ方法，平均AUROC的平均提高为5.9％。该代码可用：此HTTPS URL。

Title: Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

Authors: Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.17239
Pdf URL: https://arxiv.org/pdf/2502.17239
Copy Paste: [[2502.17239]] Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction(https://arxiv.org/abs/2502.17239)
Keywords: language model, llm
Abstract: We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at this https URL
摘要：我们介绍了Baichuan-Audio，这是一种端到端的音频模型，无缝地集成了音频理解和发电。它具有文本引导的对齐语音生成机制，从而实现了与理解和发电能力的实时语音互动。 Baichuan-Audio利用了预先训练的ASR模型，然后以12.5 Hz的帧速率进行多编码书的语音离散化。该多编码书设置可确保语音令牌保留语义和声学信息。为了进一步增强建模，使用独立的音频头来处理音频令牌，从而有效地捕获了它们的独特特征。为了减轻预训练期间智力的丧失并保留LLM的原始功能，我们提出了一种两阶段的预训练策略，该策略在增强音频建模的同时保持语言理解。结盟后，该模型在基于实时的语音对话中表现出色，并展示了出色的提问功能，并展示了其多功能性和效率。提出的模型表明了实时口语对话中的出色表现，并表现出强大的提问能力。我们的代码，模型和培训数据可在此HTTPS URL上获得

Title: Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

Authors: Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17262
Pdf URL: https://arxiv.org/pdf/2502.17262
Copy Paste: [[2502.17262]] Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective(https://arxiv.org/abs/2502.17262)
Keywords: language model, llm
Abstract: The rapid advancements in computing dramatically increase the scale and cost of training Large Language Models (LLMs). Accurately predicting downstream task performance prior to model training is crucial for efficient resource allocation, yet remains challenging due to two primary constraints: (1) the "emergence phenomenon", wherein downstream performance metrics become meaningful only after extensive training, which limits the ability to use smaller models for prediction; (2) Uneven task difficulty distributions and the absence of consistent scaling laws, resulting in substantial metric variability. Existing performance prediction methods suffer from limited accuracy and reliability, thereby impeding the assessment of potential LLM capabilities. To address these challenges, we propose a Clustering-On-Difficulty (COD) downstream performance prediction framework. COD first constructs a predictable support subset by clustering tasks based on difficulty features, strategically excluding non-emergent and non-scalable clusters. The scores on the selected subset serve as effective intermediate predictors of downstream performance on the full evaluation set. With theoretical support, we derive a mapping function that transforms performance metrics from the predictable subset to the full evaluation set, thereby ensuring accurate extrapolation of LLM downstream performance. The proposed method has been applied to predict performance scaling for a 70B LLM, providing actionable insights for training resource allocation and assisting in monitoring the training process. Notably, COD achieves remarkable predictive accuracy on the 70B LLM by leveraging an ensemble of small models, demonstrating an absolute mean deviation of 1.36% across eight important LLM evaluation benchmarks.
摘要：计算方面的快速进步大大提高了训练大语言模型（LLM）的规模和成本。准确地预测模型培训之前的下游任务性能对于有效的资源分配至关重要，但由于两个主要限制，因此仍然具有挑战性：（1）“出现现象”，其中下游性能指标仅在广泛的培训后才有意义，这限制了能够限制的能力使用较小的模型进行预测；（2）任务难度分布不平衡和缺乏一致的缩放定律，从而实现了实质性的度量可变性。现有的绩效预测方法的准确性和可靠性有限，从而阻碍了潜在LLM功能的评估。为了应对这些挑战，我们提出了一个缺乏的群集（COD）下游性能预测框架。 COD首先通过基于难度特征的聚类任务来构建可预测的支持子集，从策略性地排除了非燃料和不可降低的群集。所选子集上的分数是完整评估集中下游性能的有效中间预测指标。通过理论支持，我们得出了一个映射函数，该函数将性能指标从可预测的子集转换为完整的评估集，从而确保了LLM下游性能的准确推断。提出的方法已应用于预测70B LLM的性能缩放，为培训资源分配提供了可行的见解，并有助于监视培训过程。值得注意的是，COD通过利用小型模型的合奏来实现70B LLM的显着预测精度，这表明在八个重要的LLM评估基准中，绝对平均偏差为1.36％。

Title: MonoTODia: Translating Monologue Requests to Task-Oriented Dialogues

Authors: Sebastian Steindl, Ulrich Schäfer, Bernd Ludwig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17268
Pdf URL: https://arxiv.org/pdf/2502.17268
Copy Paste: [[2502.17268]] MonoTODia: Translating Monologue Requests to Task-Oriented Dialogues(https://arxiv.org/abs/2502.17268)
Keywords: language model
Abstract: Data scarcity is one of the main problems when it comes to real-world applications of transformer-based models. This is especially evident for task-oriented dialogue (TOD) systems, which require specialized datasets, that are usually not readily available. This can hinder companies from adding TOD systems to their services. This study therefore investigates a novel approach to sourcing annotated dialogues from existing German monologue material. Focusing on a real-world example, we investigate whether these monologues can be transformed into dialogue formats suitable for training TOD systems. We show the approach with the concrete example of a company specializing in travel bookings via e-mail. We fine-tune state-of-the-art Large Language Models for the task of rewriting e-mails as dialogues and annotating them. To ensure the quality and validity of the generated data, we employ crowd workers to evaluate the dialogues across multiple criteria and to provide gold-standard annotations for the test dataset. We further evaluate the usefulness of the dialogues for training TOD systems. Our evaluation shows that the dialogues and annotations are of high quality and can serve as a valuable starting point for training TOD systems. Finally, we make the annotated dataset publicly available to foster future research.
摘要：在基于变压器的模型的现实应用程序方面，数据稀缺是主要问题之一。这对于需要专门数据集的面向任务的对话（TOD）系统尤其明显，通常不容易获得。这可能会阻碍公司将TOD系统添加到其服务中。因此，这项研究调查了一种新的方法，用于采购现有德国独白材料的注释对话。为了关注现实世界的例子，我们研究了这些独白是否可以转变为适合训练TOD系统的对话格式。我们以一家专门通过电子邮件进行旅行预订的公司的具体示例展示了该方法。我们将最新的大语言模型调整为将电子邮件重写作为对话和注释的任务。为了确保生成数据的质量和有效性，我们雇用人群工人来评估跨多个标准的对话，并为测试数据集提供金标准的注释。我们进一步评估了训练TOD系统对话的有用性。我们的评估表明，对话和注释是高质量的，可以作为训练TOD系统的宝贵起点。最后，我们公开使用注释的数据集来培养未来的研究。

Title: Capability Instruction Tuning: A New Paradigm for Dynamic LLM Routing

Authors: Yi-Kai Zhang, De-Chuan Zhan, Han-Jia Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17282
Pdf URL: https://arxiv.org/pdf/2502.17282
Copy Paste: [[2502.17282]] Capability Instruction Tuning: A New Paradigm for Dynamic LLM Routing(https://arxiv.org/abs/2502.17282)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated human-like instruction-following abilities, particularly those exceeding 100 billion parameters. The combined capability of some smaller, resource-friendly LLMs can address most of the instructions that larger LLMs excel at. In this work, we explore how to route the best-performing LLM for each instruction to achieve better overall performance. We develop a new paradigm, constructing capability instructions with model capability representation, user instruction, and performance inquiry prompts to assess the performance. To learn from capability instructions, we introduce a new end-to-end framework called Model Selection with Aptitude Test (Model-SAT), which generates positive and negative samples based on what different models perform well or struggle with. Model-SAT uses a model capability encoder that extends its model representation to a lightweight LLM. Our experiments show that Model-SAT understands the performance dimensions of candidate models and provides the probabilities of their capability to handle various instructions. Additionally, during deployment, a new model can quickly infer its aptitude test results across 50 tasks, each with 20 shots. Model-SAT performs state-of-the-art model routing without candidate inference and in real-world new model-released scenarios. The code is available at this https URL
摘要：大型语言模型（LLMS）表现出类似人类的指导跟踪能力，尤其是那些超过1000亿参数的能力。一些较小的，资源友好的LLM的组合功能可以解决较大的LLMS Excel的大多数说明。在这项工作中，我们探讨了如何为每种说明的表现最佳的LLM路由表现最好的LLM，以实现更好的整体性能。我们开发了一个新的范式，以模型能力表示，用户指导和性能查询提示来评估性能。要从能力指示中学习，我们引入了一个新的端到端框架，称为模型选择（Model-SAT），该框架根据不同的模型的表现良好或努力生成正面和负面样本。 Model-SAT使用模型能力编码器，该编码器将其模型表示形式扩展到轻型LLM。我们的实验表明，模型-SAT了解候选模型的性能维度，并提供了其处理各种说明能力的概率。此外，在部署期间，一个新模型可以快速推断其在50个任务中的能力测试结果，每个任务都有20张照片。模型-SAT执行没有候选推理的最新模型路由，在现实世界中，新的模型已发行的方案。该代码可在此HTTPS URL上找到

Title: Child vs. machine language learning: Can the logical structure of human language unleash LLMs?

Authors: Uli Sauerland, Celia Matthaei, Felix Salfner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17304
Pdf URL: https://arxiv.org/pdf/2502.17304
Copy Paste: [[2502.17304]] Child vs. machine language learning: Can the logical structure of human language unleash LLMs?(https://arxiv.org/abs/2502.17304)
Keywords: llm
Abstract: We argue that human language learning proceeds in a manner that is different in nature from current approaches to training LLMs, predicting a difference in learning biases. We then present evidence from German plural formation by LLMs that confirm our hypothesis that even very powerful implementations produce results that miss aspects of the logic inherent to language that humans have no problem with. We conclude that attention to the different structures of human language and artificial neural networks is likely to be an avenue to improve LLM performance.
摘要：我们认为，人类语言学习的方式与当前的培训LLM的方法不同，预测学习偏见的差异。然后，我们提供了LLM的德国复数形成的证据，这些证据证实了我们的假设，即即使是非常有力的实现，也会产生任何遗漏人类固有的逻辑方面的结果。我们得出的结论是，人们对人类语言和人工神经网络的不同结构的关注可能是改善LLM性能的途径。

Title: `Generalization is hallucination' through the lens of tensor completions

Authors: Liang Ze Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17305
Pdf URL: https://arxiv.org/pdf/2502.17305
Copy Paste: [[2502.17305]] `Generalization is hallucination' through the lens of tensor completions(https://arxiv.org/abs/2502.17305)
Keywords: language model, hallucination
Abstract: In this short position paper, we introduce tensor completions and artifacts and make the case that they are a useful theoretical framework for understanding certain types of hallucinations and generalizations in language models.
摘要：在这份简短的立场论文中，我们介绍了张量的完成和工件，并证明它们是理解某些类型的幻觉和语言模型中的概括的有用理论框架。

Title: HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization

Authors: Zhenghao Liu, Haolan Wang, Xinze Li, Qiushi Xiong, Xiaocui Yang, Yu Gu, Yukun Yan, Qi Shi, Fangfang Li, Ge Yu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17315
Pdf URL: https://arxiv.org/pdf/2502.17315
Copy Paste: [[2502.17315]] HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization(https://arxiv.org/abs/2502.17315)
Keywords: language model, llm
Abstract: Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. To better capture these structural semantics, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, and optimizes MLLMs to effectively learn more comprehensive table information from these multiple modalities. Specifically, HIPPO samples model responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during DPO training. Experimental results on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances reasoning abilities based on unimodal table representations but also facilitates the extraction of crucial and distinct semantics from different modal representations. All data and codes are available at this https URL.
摘要：表格数据包含丰富的结构语义，并在组织和操纵信息中起着至关重要的作用。为了更好地捕获这些结构语义，本文介绍了混合模式偏好优化（HIPPO）模型，该模型使用文本和图像表示表，并优化了MLLM，以有效地从这些多种方式中学习更全面的表格信息。具体而言，河马从混合模式表表示中采样模型响应，并设计了一种模态的采样策略，以增强响应多样性并减轻DPO培训期间的模态偏差。桌子答案和表事实验证任务的实验结果证明了河马的有效性，比各种表推理模型取得了4％的改善。进一步的分析表明，河马不仅基于单峰表表示增强了推理能力，而且还促进了从不同模态表示中提取至关重要和独特的语义。所有数据和代码均可在此HTTPS URL上找到。

Title: Turning Conversations into Workflows: A Framework to Extract and Evaluate Dialog Workflows for Service AI Agents

Authors: Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Caiming Xiong, Shiva Kumar Pentyala, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17321
Pdf URL: https://arxiv.org/pdf/2502.17321
Copy Paste: [[2502.17321]] Turning Conversations into Workflows: A Framework to Extract and Evaluate Dialog Workflows for Service AI Agents(https://arxiv.org/abs/2502.17321)
Keywords: prompt, chain-of-thought, agent
Abstract: Automated service agents require well-structured workflows to provide consistent and accurate responses to customer queries. However, these workflows are often undocumented, and their automatic extraction from conversations remains unexplored. In this work, we present a novel framework for extracting and evaluating dialog workflows from historical interactions. Our extraction process consists of two key stages: (1) a retrieval step to select relevant conversations based on key procedural elements, and (2) a structured workflow generation process using a question-answer-based chain-of-thought (QA-CoT) prompting. To comprehensively assess the quality of extracted workflows, we introduce an automated agent and customer bots simulation framework that measures their effectiveness in resolving customer issues. Extensive experiments on the ABCD and SynthABCD datasets demonstrate that our QA-CoT technique improves workflow extraction by 12.16\% in average macro accuracy over the baseline. Moreover, our evaluation method closely aligns with human assessments, providing a reliable and scalable framework for future research.
摘要：自动化服务代理需要结构良好的工作流程，以对客户查询提供一致，准确的响应。但是，这些工作流程通常是无证件的，并且它们从对话中自动提取仍然没有探索。在这项工作中，我们提出了一个新颖的框架，用于从历史互动中提取和评估对话工作流程。我们的提取过程包括两个关键阶段：（1）基于关键过程元素选择相关对话的检索步骤，以及（2）使用基于问答的链条的结构化工作流程生成（QA-COT）（QA-COT））提示。为了全面评估提取的工作流的质量，我们介绍了一个自动化代理和客户机器人模拟框架，该框架衡量了它们在解决客户问题方面的有效性。在ABCD和SynthABCD数据集上进行的广泛实验表明，我们的QA-COT技术在基线的平均宏观准确性中提取工作流程的提取量增加了12.16 \％。此外，我们的评估方法与人类评估密切一致，为将来的研究提供了可靠且可扩展的框架。

Title: Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization

Authors: Yen-Ju Lu, Ting-Yao Hu, Hema Swetha Koppula, Hadi Pouransari, Jen-Hao Rick Chang, Yin Xia, Xiang Kong, Qi Zhu, Simon Wang, Oncel Tuzel, Raviteja Vemulapalli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17328
Pdf URL: https://arxiv.org/pdf/2502.17328
Copy Paste: [[2502.17328]] Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization(https://arxiv.org/abs/2502.17328)
Keywords: llm
Abstract: In this work, we propose Mutual Reinforcing Data Synthesis (MRDS) within LLMs to improve few-shot dialogue summarization task. Unlike prior methods that require external knowledge, we mutually reinforce the LLMś dialogue synthesis and summarization capabilities, allowing them to complement each other during training and enhance overall performances. The dialogue synthesis capability is enhanced by directed preference optimization with preference scoring from summarization capability. The summarization capability is enhanced by the additional high quality dialogue-summary paired data produced by the dialogue synthesis capability. By leveraging the proposed MRDS mechanism, we elicit the internal knowledge of LLM in the format of synthetic data, and use it to augment the few-shot real training dataset. Empirical results demonstrate that our method improves dialogue summarization, achieving a 1.5% increase in ROUGE scores and a 0.3% improvement in BERT scores in few-shot settings. Furthermore, our method attains the highest average scores in human evaluations, surpassing both the pre-trained models and the baselines fine-tuned solely for summarization tasks.
摘要：在这项工作中，我们建议在LLMS中相互加强数据合成（MRDS），以改善少量对话摘要任务。与需要外部知识的先前方法不同，我们相互加强了LLMś对话的综合和摘要功能，从而使他们在训练过程中可以相互补充并增强整体性能。对话综合能力通过定向偏好优化和摘要能力的偏好评分来增强。通过对话综合能力产生的其他高质量对话 - 苏省配对数据，可以增强汇总能力。通过利用拟议的MRDS机制，我们以合成数据的形式引起了LLM的内部知识，并使用它来增强少量拍摄的真实培训数据集。经验结果表明，我们的方法改善了对话摘要，在几次射击设置中，胭脂分数提高了1.5％，BERT得分提高了0.3％。此外，我们的方法在人类评估中达到了最高的平均分数，超过了预训练的模型，并且仅用于摘要任务的基本线。

Title: On Relation-Specific Neurons in Large Language Models

Authors: Yihong Liu, Runsheng Chen, Lea Hirlimann, Ahmad Dawar Hakimi, Mingyang Wang, Amir Hossein Kargaran, Sascha Rothe, François Yvon, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17355
Pdf URL: https://arxiv.org/pdf/2502.17355
Copy Paste: [[2502.17355]] On Relation-Specific Neurons in Large Language Models(https://arxiv.org/abs/2502.17355)
Keywords: language model, llm
Abstract: In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself -- independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relation. To investigate this, we study the Llama-2 family on a chosen set of relations with a statistics-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation $r$ on the LLM's ability to handle (1) facts whose relation is $r$ and (2) facts whose relation is a different relation $r' \neq r$. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. $\textbf{(i) Neuron cumulativity.}$ The neurons for $r$ present a cumulative effect so that deactivating a larger portion of them results in the degradation of more facts in $r$. $\textbf{(ii) Neuron versatility.}$ Neurons can be shared across multiple closely related as well as less related relations. Some relation neurons transfer across languages. $\textbf{(iii) Neuron interference.}$ Deactivating neurons specific to one relation can improve LLM generation performance for facts of other relations. We will make our code publicly available at this https URL.
摘要：在大型语言模型（LLMS）中，某些神经元可以存储在预训练期间学习的不同知识。尽管知识通常是关系和实体的结合，但尚不清楚某些神经元是否专注于关系本身 - 独立于任何实体。我们假设此类神经元检测到涉及这种关系的输入文本和指导产生中的关系。为了调查这一点，我们研究了与基于统计的方法的一组关系集的Llama-2家族。我们的实验证明了关系特异性神经元的存在。我们衡量有选择性地停止与关系$ r $特定的候选神经元对LLM处理（1）事实的能力的效果，其关系为$ r $ $ r $，以及（2）关系是不同关系的事实，其关系是不同的关系$ r'\ neq r $。关于它们编码关系信息的能力，我们为关系特异性神经元的以下三种特性提供了证据。 $ \ textbf {（i）神经元累积。} $ $ r $的神经元具有累积效应，因此停用其中的一部分会导致更多的事实降级为$ r $。 $ \ textbf {（ii）神经元的多功能性。} $神经元可以在多个密切相关和相关关系较少的关系中共享。一些关系神经元跨语言转移。 $ \ textbf {（iii）神经元干扰。} $一个特定于一种关系的神经元可以改善其他关系事实的LLM生成性能。我们将在此HTTPS URL上公开代码。

Title: Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects

Authors: Toheeb A. Jimoh, Tabea De Wille, Nikola S. Nikolov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.17364
Pdf URL: https://arxiv.org/pdf/2502.17364
Copy Paste: [[2502.17364]] Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects(https://arxiv.org/abs/2502.17364)
Keywords: language model
Abstract: Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable. Several NLP applications are ubiquitous, partly due to the myriads of datasets being churned out daily through mediums like social networking sites. However, the growing development has not been evident in most African languages due to the persisting resource limitation, among other issues. Yorùbá language, a tonal and morphologically rich African language, suffers a similar fate, resulting in limited NLP usage. To encourage further research towards improving this situation, this systematic literature review aims to comprehensively analyse studies addressing NLP development for Yorùbá, identifying challenges, resources, techniques, and applications. A well-defined search string from a structured protocol was employed to search, select, and analyse 105 primary studies between 2014 and 2024 from reputable databases. The review highlights the scarcity of annotated corpora, limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles. It also revealed the prominent techniques, including rule-based methods, among others. The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and desertion of language for digital usage. This review synthesises existing research, providing a foundation for advancing NLP for Yorùbá and in African languages generally. It aims to guide future research by identifying gaps and opportunities, thereby contributing to the broader inclusion of Yorùbá and other under-resourced African languages in global NLP advancements.
摘要：自然语言处理（NLP）已成为人工智能的主要子集，因为需要帮助机器理解人类语言看起来必不可少。几个NLP应用程序无处不在，部分原因是无数数据集每天通过社交网站等媒体搅动。但是，由于资源限制的持续存在，在大多数非洲语言中的发展尚不明显。 Yorùbá语言是一种音调且形态上丰富的非洲语言，其命运类似，导致NLP使用有限。为了鼓励进一步的研究改善这种情况，这项系统的文献综述旨在全面分析针对约鲁巴的NLP开发的研究，确定挑战，资源，技术和应用。在2014年至2024年之间，使用了一个结构化协议的明确定义的搜索字符串来搜索，选择和分析105个主要研究。该评论强调了带注释的语料库的稀缺性，预训练的语言模型的可用性有限以及语言挑战（如音调复杂性和大声依赖性依赖性）是重大障碍。它还揭示了包括基于规则的方法在内的突出技术。这些发现揭示了越来越多的多语言和单语言资源，即使该领域受到社会文化因素的限制，例如代码转换和数字用法的语言遗弃。这篇综述综合了现有的研究，为Yorùbá和非洲语言促进NLP的基础是基础。它的目的是通过确定差距和机会来指导未来的研究，从而为全球NLP的进步促进更广泛的Yorùbá和其他资源不足的非洲语言。

Title: What is a Good Question? Utility Estimation with LLM-based Simulations

Authors: Dong-Ho Lee, Hyundong Cho, Jonathan May, Jay Pujara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17383
Pdf URL: https://arxiv.org/pdf/2502.17383
Copy Paste: [[2502.17383]] What is a Good Question? Utility Estimation with LLM-based Simulations(https://arxiv.org/abs/2502.17383)
Keywords: llm, prompt
Abstract: Asking questions is a fundamental aspect of learning that facilitates deeper understanding. However, characterizing and crafting questions that effectively improve learning remains elusive. To address this gap, we propose QUEST (Question Utility Estimation with Simulated Tests). QUEST simulates a learning environment that enables the quantification of a question's utility based on its direct impact on improving learning outcomes. Furthermore, we can identify high-utility questions and use them to fine-tune question generation models with rejection sampling. We find that questions generated by models trained with rejection sampling based on question utility result in exam scores that are higher by at least 20% than those from specialized prompting grounded on educational objectives literature and models fine-tuned with indirect measures of question quality, such as saliency and expected information gain.
摘要：提出问题是学习的基本方面，可以促进更深入的理解。但是，表征和制作有效改善学习的问题仍然难以捉摸。为了解决这一差距，我们提出了任务（通过模拟测试进行问题效用估算）。 Quest模拟了一个学习环境，该学习环境可以基于其对改善学习成果的直接影响来量化问题的实用程序。此外，我们可以识别出高纯粹的问题，并将其用于用拒绝抽样微调问题生成模型。我们发现，基于问题实用程序训练的模型产生的问题会导致考试成绩至少20％，高于专业促使基于教育目标文献的专业促使，并以问题质量的间接度量进行了微调，例如作为显着性和预期信息的增长。

Title: Mitigating Bias in RAG: Controlling the Embedder

Authors: Taeyoun Kim, Jacob Springer, Aditi Raghunathan, Maarten Sap
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17390
Pdf URL: https://arxiv.org/pdf/2502.17390
Copy Paste: [[2502.17390]] Mitigating Bias in RAG: Controlling the Embedder(https://arxiv.org/abs/2502.17390)
Keywords: llm, retrieval augmented generation
Abstract: In retrieval augmented generation (RAG) systems, each individual component -- the LLM, embedder, and corpus -- could introduce biases in the form of skews towards outputting certain perspectives or identities. In this work, we study the conflict between biases of each component and their relationship to the overall bias of the RAG system, which we call bias conflict. Examining both gender and political biases as case studies, we show that bias conflict can be characterized through a linear relationship among components despite its complexity in 6 different LLMs. Through comprehensive fine-tuning experiments creating 120 differently biased embedders, we demonstrate how to control bias while maintaining utility and reveal the importance of reverse-biasing the embedder to mitigate bias in the overall system. Additionally, we find that LLMs and tasks exhibit varying sensitivities to the embedder bias, a crucial factor to consider for debiasing. Our results underscore that a fair RAG system can be better achieved by carefully controlling the bias of the embedder rather than increasing its fairness.
摘要：在检索增强生成（RAG）系统中，每个单独的组件 - LLM，嵌入器和语料库都可以以偏斜的形式引入偏见，以输出某些观点或身份。在这项工作中，我们研究了每个组件的偏见与它们与抹布系统的整体偏见之间的冲突，我们称之为偏见冲突。通过研究性别和政治偏见作为案例研究，我们表明，尽管在6种不同的LLM中，偏见冲突可以通过组件之间的线性关系来表征。通过创建120个不同偏见的嵌入者的全面微调实验，我们演示了如何控制偏见的同时保持效用，并揭示了反向偏向嵌入者减轻整体系统中偏置的重要性。此外，我们发现LLM和任务对嵌入偏置表现出不同的敏感性，这是考虑偏见的关键因素。我们的结果表明，通过仔细控制嵌入器的偏见而不是增加公平性，可以更好地实现公平的抹布系统。

Title: Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Authors: Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.17407
Pdf URL: https://arxiv.org/pdf/2502.17407
Copy Paste: [[2502.17407]] Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning(https://arxiv.org/abs/2502.17407)
Keywords: llm
Abstract: Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.
摘要：缩放预训练的计算已被证明有效地实现了误导性，但是对于测试时间缩放而言，相同的能力也有效吗？在这项工作中，我们介绍了MCLM，这是一种多语言数学基准，其中包含55种语言的竞争级问题。我们测试三个测试时间缩放方法结果奖励建模（ORM），过程奖励建模（ORM）以及预算强迫（BF） - 均在QWEN2.5-1.5B数学和MR1-1.5B上，我们训练了多语言LLM用于扩展推理。我们的实验表明，使用QWEN2.5-1.5B数学与ORM的数学在MCLM上达到35.8，而MR1-1.5B上的BF为35.2。尽管“ Thinking LLM”最近引起了极大的关注，但我们发现它们的性能与传统的缩放方法相媲美，例如Best-N曾经被限制在类似的推理拖曳水平。此外，尽管BF对英语AIME产生了20分的提高，但它仅提供其他语言的平均平均增长率 - 我们研究了测试时间缩放可能无法推广的其他测试时间缩放方法一致的模式一致与多语言任务有效。为了促进进一步的研究，我们发布了MCLM，MR1-1.5B和评估结果。

Title: Reasoning with Latent Thoughts: On the Power of Looped Transformers

Authors: Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, Sashank J. Reddi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17416
Pdf URL: https://arxiv.org/pdf/2502.17416
Copy Paste: [[2502.17416]] Reasoning with Latent Thoughts: On the Power of Looped Transformers(https://arxiv.org/abs/2502.17416)
Keywords: language model, chain-of-thought
Abstract: Large language models have shown remarkable reasoning abilities and scaling laws suggest that large parameter count, especially along the depth axis, is the primary driver. In this work, we make a stronger claim -- many reasoning problems require a large depth but not necessarily many parameters. This unlocks a novel application of looped models for reasoning. Firstly, we show that for many synthetic reasoning problems like addition, $p$-hop induction, and math problems, a $k$-layer transformer looped $L$ times nearly matches the performance of a $kL$-layer non-looped model, and is significantly better than a $k$-layer model. This is further corroborated by theoretical results showing that many such reasoning problems can be solved via iterative algorithms, and thus, can be solved effectively using looped models with nearly optimal depth. Perhaps surprisingly, these benefits also translate to practical settings of language modeling -- on many downstream reasoning tasks, a language model with $k$-layers looped $L$ times can be competitive to, if not better than, a $kL$-layer language model. In fact, our empirical analysis reveals an intriguing phenomenon: looped and non-looped models exhibit scaling behavior that depends on their effective depth, akin to the inference-time scaling of chain-of-thought (CoT) reasoning. We further elucidate the connection to CoT reasoning by proving that looped models implicitly generate latent thoughts and can simulate $T$ steps of CoT with $T$ loops. Inspired by these findings, we also present an interesting dichotomy between reasoning and memorization, and design a looping-based regularization that is effective on both fronts.
摘要：大型语言模型显示出了显着的推理能力和缩放定律表明，大型参数计数，尤其是在深度轴上，是主要驱动力。在这项工作中，我们提出了更强有力的主张 - 许多推理问题需要很大的深度，但不一定是很多参数。这解锁了循环模型进行推理的新颖应用。首先，我们表明，对于许多综合推理问题，诸如添加，$ p $ -HOP归纳和数学问题，$ k $ - 莱默变压器循环$ l $ t $ times几乎与$ kl $ layer的性能相匹配模型，并且比$ k $ layer模型要好得多。理论结果表明，可以通过迭代算法解决许多这样的推理问题，从而进一步证实了这一点，因此可以使用具有几乎最佳深度的循环模型有效地解决。也许令人惊讶的是，这些好处也转化为语言建模的实际设置 - 在许多下游推理任务上，具有$ k $ layers looped $ l $ times的语言模型可以竞争，如果不是比$ kl $更好，层语言模型。实际上，我们的经验分析揭示了一种有趣的现象：循环和非循环模型表现出缩放行为，取决于其有效深度，类似于思考链（COT）推理的推理时间缩放。我们通过证明循环模型隐含地产生潜在思想，并可以使用$ t $ loops模拟$ T $ sptep，进一步阐明了与COT推理的连接。受这些发现的启发，我们还提出了推理和记忆之间有趣的二分法，并设计了一个基于循环的正则化，这对这两个方面都是有效的。

Title: LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification

Authors: Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.17421
Pdf URL: https://arxiv.org/pdf/2502.17421
Copy Paste: [[2502.17421]] LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification(https://arxiv.org/abs/2502.17421)
Keywords: language model, llm
Abstract: Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models (LLMs). Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges: the increasing memory demands of the draft model, the distribution shift between the short-training corpora and long-context inference, and inefficiencies in attention implementation. In this work, we enhance the performance of speculative decoding in long-context settings by addressing these challenges. First, we propose a memory-efficient draft model with a constant-sized Key-Value (KV) cache. Second, we introduce novel position indices for short-training data, enabling seamless adaptation from short-context training to long-context inference. Finally, we present an innovative attention aggregation method that combines fast implementations for prefix computation with standard attention for tree mask handling, effectively resolving the latency and memory inefficiencies of tree decoding. Our approach achieves strong results on various long-context tasks, including repository-level code completion, long-context summarization, and o1-like long reasoning tasks, demonstrating significant improvements in latency reduction. The code is available at this https URL.
摘要：投机解码已成为一种有希望的技术，可以减轻大语模型（LLMS）中自回归解码的高推断潜伏期。尽管有希望，但在LLM中的投机解码有效地应用仍面临三个关键挑战：模型草案草案的记忆需求不断增加，短期训练语料库和长篇文化推断之间的分配变化以及注意力实施中的效率低下。在这项工作中，我们通过解决这些挑战来提高在长篇文章设置中投机解码的性能。首先，我们提出了一个具有恒定尺寸的键值（KV）缓存的记忆效率的草稿模型。其次，我们介绍了短训练数据的新型位置指数，从而实现了从短上下文训练到长篇小说推断的无缝适应。最后，我们提出了一种创新的注意聚合方法，该方法将前缀计算的快速实现与对树蒙版处理的标准注意相结合，从而有效地解决了树解码的潜伏期和记忆效率低下。我们的方法在各种长篇小说任务上取得了良好的成果，包括存储库级代码的完成，长篇小说摘要和类似O1的长期推理任务，这表明了延迟降低的显着改善。该代码可在此HTTPS URL上找到。