2025-02-12

Title: Survey on Vision-Language-Action Models

Authors: Adilzhan Adilkhanov, Amir Yelenov, Assylkhan Seitzhanov, Ayan Mazhitov, Azamat Abdikarimov, Danissa Sandykbayeva, Daryn Kenzhebek, Daulet Baimukashev, Dinmukhammed Mukashev, Ilyas Umurbekov, Jabrail Chumakov, Kamila Spanova, Karina Burunchina, Rasul Yermagambet, Rustam Chibar, Saltanat Seitzhan, Soibkhon Khajikhanov, Tasbolat Taunyazov, Temirlan Galimzhanov, Temirlan Kaiyrbay, Tleukhan Mussin, Togzhan Syrymova, Valeriya Kostyukova, Yermakhan Kassym, Madina Yergibay, Margulan Issa, Moldir Zabirova, Nurdaulet Zhuzbay, Nurlan Kabdyshev, Nurlan Zhaniyar, Yerkebulan Massalim, Zerde Nurbayeva, Zhanat Kappassov
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.06851
Pdf URL: https://arxiv.org/pdf/2502.06851
Copy Paste: [[2502.06851]] Survey on Vision-Language-Action Models(https://arxiv.org/abs/2502.06851)
Keywords: language model, llm
Abstract: This paper presents an AI-generated review of Vision-Language-Action (VLA) models, summarizing key methodologies, findings, and future directions. The content is produced using large language models (LLMs) and is intended only for demonstration purposes. This work does not represent original research, but highlights how AI can help automate literature reviews. As AI-generated content becomes more prevalent, ensuring accuracy, reliability, and proper synthesis remains a challenge. Future research will focus on developing a structured framework for AI-assisted literature reviews, exploring techniques to enhance citation accuracy, source credibility, and contextual understanding. By examining the potential and limitations of LLM in academic writing, this study aims to contribute to the broader discussion of integrating AI into research workflows. This work serves as a preliminary step toward establishing systematic approaches for leveraging AI in literature review generation, making academic knowledge synthesis more efficient and scalable.
摘要：本文介绍了人工智能生成的视觉-语言-行动 (VLA) 模型综述，总结了主要方法、发现和未来方向。内容是使用大型语言模型 (LLM) 生成的，仅用于演示目的。这项工作并不代表原创研究，但强调了人工智能如何帮助实现文献综述的自动化。随着人工智能生成的内容变得越来越普遍，确保准确性、可靠性和适当的综合仍然是一个挑战。未来的研究将侧重于开发人工智能辅助文献综述的结构化框架，探索提高引用准确性、来源可信度和上下文理解的技术。通过研究 LLM 在学术写作中的潜力和局限性，本研究旨在为将人工智能整合到研究工作流程中的更广泛讨论做出贡献。这项工作是建立利用人工智能生成文献综述的系统方法的初步步骤，使学术知识综合更加高效和可扩展。

Title: Self-Supervised Prompt Optimization

Authors: Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Fengwei Teng, Jinhao Tu, Xinbing Liang, Sirui Hong, Chenglin Wu, Yuyu Luo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.06855
Pdf URL: https://arxiv.org/pdf/2502.06855
Copy Paste: [[2502.06855]] Self-Supervised Prompt Optimization(https://arxiv.org/abs/2502.06855)
Keywords: language model, llm, prompt
Abstract: Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at this https URL.
摘要：精心设计的提示对于增强大型语言模型 (LLM) 的推理能力至关重要，同时使其输出与不同领域的任务要求保持一致。但是，手动设计的提示需要专业知识和反复实验。虽然现有的提示优化方法旨在自动化此过程，但它们严重依赖外部参考（例如基本事实或人类），限制了它们在无法获得此类数据或获取成本高昂的现实场景中的适用性。为了解决这个问题，我们提出了自监督提示优化 (SPO)，这是一个经济高效的框架，无需外部参考即可发现封闭式和开放式任务的有效提示。受提示质量直接体现在 LLM 输出中并且 LLM 可以有效评估对任务要求的遵守情况的观察启发，我们纯粹从输出比较中得出评估和优化信号。具体而言，SPO 通过由 LLM 评估器评估的成对输出比较来选择更优的提示，然后由 LLM 优化器将输出与任务要求对齐。大量实验表明，SPO 的表现优于最先进的即时优化方法，以显著更低的成本（例如，现有方法的 1.1% 到 5.6%）和更少的样本（例如，三个样本）实现了相当或更好的结果。代码可在此 https URL 上找到。

Title: LLM-Supported Natural Language to Bash Translation

Authors: Finnian Westenfelder, Erik Hemberg, Miguel Tulla, Stephen Moskal, Una-May O'Reilly, Silviu Chiricescu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06858
Pdf URL: https://arxiv.org/pdf/2502.06858
Copy Paste: [[2502.06858]] LLM-Supported Natural Language to Bash Translation(https://arxiv.org/abs/2502.06858)
Keywords: language model, llm
Abstract: The Bourne-Again Shell (Bash) command-line interface for Linux systems has complex syntax and requires extensive specialized knowledge. Using the natural language to Bash command (NL2SH) translation capabilities of large language models (LLMs) for command composition circumvents these issues. However, the NL2SH performance of LLMs is difficult to assess due to inaccurate test data and unreliable heuristics for determining the functional equivalence of Bash commands. We present a manually verified test dataset of 600 instruction-command pairs and a training dataset of 40,939 pairs, increasing the size of previous datasets by 441% and 135%, respectively. Further, we present a novel functional equivalence heuristic that combines command execution with LLM evaluation of command outputs. Our heuristic can determine the functional equivalence of two Bash commands with 95% confidence, a 16% increase over previous heuristics. Evaluation of popular LLMs using our test dataset and heuristic demonstrates that parsing, in-context learning, in-weight learning, and constrained decoding can improve NL2SH accuracy by up to 32%. Our findings emphasize the importance of dataset quality, execution-based evaluation and translation method for advancing NL2SH translation. Our code is available at this https URL
摘要：Linux 系统的 Bourne-Again Shell (Bash) 命令行界面语法复杂，需要大量专业知识。使用大型语言模型 (LLM) 的自然语言到 Bash 命令 (NL2SH) 翻译功能进行命令组合可以避免这些问题。但是，由于测试数据不准确以及用于确定 Bash 命令功能等效性的启发式方法不可靠，因此很难评估 LLM 的 NL2SH 性能。我们提供了一个手动验证的 600 个指令命令对的测试数据集和一个 40,939 对的训练数据集，分别将以前的数据集的大小增加了 441% 和 135%。此外，我们提出了一种新颖的功能等效启发式方法，将命令执行与命令输出的 LLM 评估相结合。我们的启发式方法可以以 95% 的置信度确定两个 Bash 命令的功能等效性，比以前的启发式方法提高了 16%。使用我们的测试数据集和启发式方法对流行的 LLM 进行评估表明，解析、上下文学习、权重学习和约束解码可以将 NL2SH 准确率提高多达 32%。我们的研究结果强调了数据集质量、基于执行的评估和翻译方法对于推进 NL2SH 翻译的重要性。我们的代码可在此 https URL 上找到

Title: Knowledge Graph-Guided Retrieval Augmented Generation

Authors: Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06864
Pdf URL: https://arxiv.org/pdf/2502.06864
Copy Paste: [[2502.06864]] Knowledge Graph-Guided Retrieval Augmented Generation(https://arxiv.org/abs/2502.06864)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has emerged as a promising technology for addressing hallucination issues in the responses generated by large language models (LLMs). Existing studies on RAG primarily focus on applying semantic-based approaches to retrieve isolated relevant chunks, which ignore their intrinsic relationships. In this paper, we propose a novel Knowledge Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes knowledge graphs (KGs) to provide fact-level relationships between chunks, improving the diversity and coherence of the retrieved results. Specifically, after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG employs a KG-guided chunk expansion process and a KG-based chunk organization process to deliver relevant and important knowledge in well-organized paragraphs. Extensive experiments conducted on the HotpotQA dataset and its variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based approaches, in terms of both response quality and retrieval quality.
摘要：检索增强生成 (RAG) 已成为一种有前途的技术，可用于解决大型语言模型 (LLM) 生成的响应中的幻觉问题。现有的 RAG 研究主要集中于应用基于语义的方法来检索孤立的相关块，而忽略了它们的内在关系。在本文中，我们提出了一种新颖的知识图谱引导检索增强生成 (KG$^2$RAG) 框架，该框架利用知识图谱 (KG) 提供块之间的事实级关系，从而提高检索结果的多样性和连贯性。具体而言，在执行基于语义的检索以提供种子块后，KG$^2$RAG 采用 KG 引导的块扩展过程和基于 KG 的块组织过程，以组织良好的段落形式提供相关且重要的知识。在 HotpotQA 数据集及其变体上进行的大量实验证明了 KG$^2$RAG 与现有的基于 RAG 的方法相比在响应质量和检索质量方面的优势。

Title: Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests

Authors: David Noever, Forrest McKee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06867
Pdf URL: https://arxiv.org/pdf/2502.06867
Copy Paste: [[2502.06867]] Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests(https://arxiv.org/abs/2502.06867)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the most conservative approach with 73% refusals and 27% allowances, while Mistral attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and 80% allowances. Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse.
摘要：为大型语言模型开发强大的安全基准需要开放、可重复的数据集，这些数据集可以衡量对有害内容的适当拒绝和对合法科学话语的潜在过度限制。我们提供了一个开源数据集和测试框架，用于评估主要受控物质查询的 LLM 安全机制，分析了四种主要模型对系统变化提示的响应。我们的结果揭示了不同的安全性：Claude-3.5-sonnet 展示了最保守的方法，拒绝率为 73%，允许率为 27%，而 Mistral 则试图回答 100% 的查询。GPT-3.5-turbo 表现出中等限制，拒绝率为 10%，允许率为 90%，Grok-2 记录了 20% 的拒绝和 80% 的允许。测试提示变化策略显示响应一致性下降，从单个提示的 85% 下降到五种变化的 65%。这一公开可用的基准能够系统地评估必要的安全限制与合法科学探究的潜在过度审查之间的关键平衡，同时为衡量人工智能安全实施的进展奠定基础。思路链分析揭示了安全机制中的潜在漏洞，凸显了在不过度限制理想和有效的科学论述的情况下实施强有力的保障措施的复杂性。

Title: Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject

Authors: Zenghao Duan, Wenbin Duan, Zhiyi Yin, Yinghan Shen, Shaoling Jing, Jie Zhang, Huawei Shen, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06868
Pdf URL: https://arxiv.org/pdf/2502.06868
Copy Paste: [[2502.06868]] Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject(https://arxiv.org/abs/2502.06868)
Keywords: language model, llm
Abstract: Knowledge editing has become a promising approach for efficiently and precisely updating knowledge embedded in large language models (LLMs). In this work, we focus on Same-Subject Editing, which involves modifying multiple attributes of a single entity to ensure comprehensive and consistent updates to entity-centric knowledge. Through preliminary observation, we identify a significant challenge: Current state-of-the-art editing methods struggle when tasked with editing multiple related knowledge pieces for the same subject. To address the lack of relevant editing data for identical subjects in traditional benchmarks, we introduce the $\text{S}^2\text{RKE}$(Same-Subject Related Knowledge Editing) benchmark. Our extensive experiments reveal that only mainstream locate-then-edit methods, such as ROME and MEMIT, exhibit "related knowledge perturbation," where subsequent edits interfere with earlier ones. Further analysis reveals that these methods over-rely on subject information, neglecting other critical factors, resulting in reduced editing effectiveness.
摘要：知识编辑已成为一种高效、精确更新大型语言模型 (LLM) 中嵌入知识的有前途的方法。在这项工作中，我们专注于同主题编辑，这涉及修改单个实体的多个属性，以确保对以实体为中心的知识进行全面和一致的更新。通过初步观察，我们发现了一个重大挑战：当前最先进的编辑方法在编辑同一主题的多个相关知识片段时会遇到困难。为了解决传统基准中缺乏相同主题的相关编辑数据的问题，我们引入了 $\text{S}^2\text{RKE}$(同主题相关知识编辑) 基准。我们的大量实验表明，只有主流的定位然后编辑方法（例如 ROME 和 MEMIT）才会出现“相关知识扰动”，即后续编辑会干扰先前的编辑。进一步分析表明，这些方法过度依赖主题信息，忽略了其他关键因素，导致编辑效果降低。

Title: Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey

Authors: Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, Ryan Rossi, Franck Dernoncourt, Md Mehrab Tanjim, Nesreen Ahmed, Xiaorui Liu, Wenqi Fan, Erik Blasch, Yu Wang, Meng Jiang, Tyler Derr
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06872
Pdf URL: https://arxiv.org/pdf/2502.06872
Copy Paste: [[2502.06872]] Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey(https://arxiv.org/abs/2502.06872)
Keywords: language model, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is an advanced technique designed to address the challenges of Artificial Intelligence-Generated Content (AIGC). By integrating context retrieval into content generation, RAG provides reliable and up-to-date external knowledge, reduces hallucinations, and ensures relevant context across a wide range of tasks. However, despite RAG's success and potential, recent studies have shown that the RAG paradigm also introduces new risks, including robustness issues, privacy concerns, adversarial attacks, and accountability issues. Addressing these risks is critical for future applications of RAG systems, as they directly impact their trustworthiness. Although various methods have been developed to improve the trustworthiness of RAG methods, there is a lack of a unified perspective and framework for research in this topic. Thus, in this paper, we aim to address this gap by providing a comprehensive roadmap for developing trustworthy RAG systems. We place our discussion around five key perspectives: reliability, privacy, safety, fairness, explainability, and accountability. For each perspective, we present a general framework and taxonomy, offering a structured approach to understanding the current challenges, evaluating existing solutions, and identifying promising future research directions. To encourage broader adoption and innovation, we also highlight the downstream applications where trustworthy RAG systems have a significant impact.
摘要：检索增强生成 (RAG) 是一种先进的技术，旨在应对人工智能生成内容 (AIGC) 的挑战。通过将上下文检索集成到内容生成中，RAG 可提供可靠且最新的外部知识，减少幻觉并确保在各种任务中具有相关的上下文。然而，尽管 RAG 取得了成功并具有潜力，但最近的研究表明，RAG 范式也带来了新的风险，包括稳健性问题、隐私问题、对抗性攻击和问责制问题。解决这些风险对于 RAG 系统的未来应用至关重要，因为它们直接影响其可信度。尽管已经开发了各种方法来提高 RAG 方法的可信度，但缺乏统一的研究视角和框架。因此，在本文中，我们旨在通过提供开发可信 RAG 系统的全面路线图来解决这一差距。我们的讨论围绕五个关键角度展开：可靠性、隐私、安全性、公平性、可解释性和问责制。对于每个观点，我们都提出了一个通用框架和分类法，提供了一种结构化的方法来理解当前的挑战、评估现有的解决方案并确定有希望的未来研究方向。为了鼓励更广泛的采用和创新，我们还重点介绍了可信赖的 RAG 系统具有重大影响的下游应用。

Title: Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning

Authors: Subin Kim, Hoonrae Kim, Heejin Do, Gary Geunbae Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06873
Pdf URL: https://arxiv.org/pdf/2502.06873
Copy Paste: [[2502.06873]] Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning(https://arxiv.org/abs/2502.06873)
Keywords: language model, gpt, llm, prompt
Abstract: Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.
摘要：先前的研究已经揭示了大型语言模型 (LLM) 支持认知重构疗法的潜力；然而，他们的重点主要放在基于文本的方法上，往往忽视了现实生活中治疗中至关重要的非语言证据的重要性。为了弥补这一差距，我们将文本认知重构扩展到多模态，并结合了视觉线索。具体来说，我们提出了一个名为多模态认知支持对话 (M2CoSC) 的新数据集，它将每个 GPT-4 生成的对话与反映虚拟客户面部表情的图像配对。为了更好地反映真实的心理治疗，其中面部表情导致对隐性情感证据的解读，我们提出了一种多跳心理治疗推理方法，明确识别和整合微妙的证据。我们对 LLM 和视觉语言模型 (VLM) 的全面实验表明，使用 M2CoSC 数据集，VLM 作为心理治疗师的表现得到了显着改善。此外，多跳心理治疗推理方法使 VLM 能够提供更周到、更有同理心的建议，优于标准提示方法。

Title: Group Reasoning Emission Estimation Networks

Authors: Yanming Guo, Xiao Qian, Kevin Credit, Jin Ma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.06874
Pdf URL: https://arxiv.org/pdf/2502.06874
Copy Paste: [[2502.06874]] Group Reasoning Emission Estimation Networks(https://arxiv.org/abs/2502.06874)
Keywords: language model, llm
Abstract: Accurate greenhouse gas (GHG) emission reporting is critical for governments, businesses, and investors. However, adoption remains limited particularly among small and medium enterprises due to high implementation costs, fragmented emission factor databases, and a lack of robust sector classification methods. To address these challenges, we introduce Group Reasoning Emission Estimation Networks (GREEN), an AI-driven carbon accounting framework that standardizes enterprise-level emission estimation, constructs a large-scale benchmark dataset, and leverages a novel reasoning approach with large language models (LLMs). Specifically, we compile textual descriptions for 20,850 companies with validated North American Industry Classification System (NAICS) labels and align these with an economic model of carbon intensity factors. By reframing sector classification as an information retrieval task, we fine-tune Sentence-BERT models using a contrastive learning loss. To overcome the limitations of single-stage models in handling thousands of hierarchical categories, we propose a Group Reasoning method that ensembles LLM classifiers based on the natural NAICS ontology, decomposing the task into multiple sub-classification steps. We theoretically prove that this approach reduces classification uncertainty and computational complexity. Experiments on 1,114 NAICS categories yield state-of-the-art performance (83.68% Top-1, 91.47% Top-10 accuracy), and case studies on 20 companies report a mean absolute percentage error (MAPE) of 45.88%. The project is available at: this https URL.
摘要：准确的温室气体 (GHG) 排放报告对政府、企业和投资者至关重要。然而，由于实施成本高、排放因子数据库分散以及缺乏强大的行业分类方法，采用率仍然有限，尤其是在中小企业中。为了应对这些挑战，我们引入了群体推理排放估算网络 (GREEN)，这是一个由人工智能驱动的碳核算框架，可标准化企业级排放估算，构建大规模基准数据集，并利用大型语言模型 (LLM) 的新型推理方法。具体来说，我们为 20,850 家具有经过验证的北美行业分类系统 (NAICS) 标签的公司编写了文本描述，并将其与碳强度因子的经济模型保持一致。通过将行业分类重新定义为信息检索任务，我们使用对比学习损失对 Sentence-BERT 模型进行微调。为了克服单阶段模型在处理数千个层次类别方面的局限性，我们提出了一种基于自然 NAICS 本体的 LLM 分类器集成的群推理方法，将任务分解为多个子分类步骤。我们从理论上证明了这种方法可以降低分类不确定性和计算复杂性。对 1,114 个 NAICS 类别的实验产生了最先进的性能（83.68% Top-1，91.47% Top-10 准确率），对 20 家公司的案例研究报告平均绝对百分比误差 (MAPE) 为 45.88%。该项目可从以下网址获取：此 https URL。

Title: Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging

Authors: Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Daixin Wang, Qing Cui, Zhiqiang Zhang, Jun Zhou, Fei Wu, Kun Kuang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.06876
Pdf URL: https://arxiv.org/pdf/2502.06876
Copy Paste: [[2502.06876]] Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging(https://arxiv.org/abs/2502.06876)
Keywords: language model, llm
Abstract: Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI, with existing methods like data mixture strategies facing limitations including reliance on expert knowledge and conflicting optimization signals. While model merging offers a promising alternative by integrating specialized models, its potential for 3H optimization remains underexplored. This paper establishes the first comprehensive benchmark for model merging in 3H-aligned LLMs, systematically evaluating 15 methods (12 training-free merging and 3 data mixture techniques) across 10 datasets associated with 5 annotation dimensions, 2 LLM families, and 2 training paradigms. Our analysis reveals three pivotal insights: (i) previously overlooked collaborative/conflicting relationships among 3H dimensions, (ii) the consistent superiority of model merging over data mixture approaches in balancing alignment trade-offs, and (iii) the critical role of parameter-level conflict resolution through redundant component pruning and outlier mitigation. Building on these findings, we propose R-TSVM, a Reweighting-enhanced Task Singular Vector Merging method that incorporates outlier-aware parameter weighting and sparsity-adaptive rank selection strategies adapted to the heavy-tailed parameter distribution and sparsity for LLMs, further improving LLM alignment across multiple evaluations. Our models will be available at this https URL.
摘要：实现大型语言模型 (LLM) 在有用性、诚实性和无害性方面的平衡对齐 (3H 优化) 是负责任的 AI 的基石，而现有方法（如数据混合策略）面临着诸多限制，包括依赖专家知识和优化信号冲突。虽然模型合并通过集成专门的模型提供了一种有前途的替代方案，但其在 3H 优化方面的潜力仍未得到充分开发。本文建立了 3H 对齐 LLM 中模型合并的第一个综合基准，系统地评估了 15 种方法（12 种无训练合并和 3 种数据混合技术），涉及 5 个注释维度、2 个 LLM 系列和 2 个训练范式相关的 10 个数据集。我们的分析揭示了三个关键见解：(i) 以前被忽视的 3H 维度之间的协作/冲突关系，(ii) 模型合并在平衡对齐权衡方面始终优于数据混合方法，以及 (iii) 通过冗余组件修剪和异常值缓解在参数级冲突解决方面发挥关键作用。基于这些发现，我们提出了 R-TSVM，这是一种重新加权增强型任务奇异向量合并方法，该方法结合了异常值感知参数加权和稀疏度自适应等级选择策略，以适应 LLM 的重尾参数分布和稀疏度，从而进一步改善了多次评估中的 LLM 对齐。我们的模型将在此 https URL 上提供。

Title: Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction

Authors: Shengbin Yue, Ting Huang, Zheng Jia, Siyuan Wang, Shujun Liu, Yun Song, Xuanjing Huang, Zhongyu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06882
Pdf URL: https://arxiv.org/pdf/2502.06882
Copy Paste: [[2502.06882]] Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction(https://arxiv.org/abs/2502.06882)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have significantly advanced legal intelligence, but the scarcity of scenario data impedes the progress toward interactive legal scenarios. This paper introduces a Multi-agent Legal Simulation Driver (MASER) to scalably generate synthetic data by simulating interactive legal scenarios. Leveraging real-legal case sources, MASER ensures the consistency of legal attributes between participants and introduces a supervisory mechanism to align participants' characters and behaviors as well as addressing distractions. A Multi-stage Interactive Legal Evaluation (MILE) benchmark is further constructed to evaluate LLMs' performance in dynamic legal scenarios. Extensive experiments confirm the effectiveness of our framework.
摘要：大型语言模型 (LLM) 具有显著的法律智能，但场景数据的稀缺阻碍了其向交互式法律场景迈进。本文介绍了一种多智能体法律模拟驱动程序 (MASER)，通过模拟交互式法律场景来可扩展地生成合成数据。利用真实的法律案例来源，MASER 确保参与者之间的法律属性的一致性，并引入监督机制来协调参与者的性格和行为以及解决干扰问题。进一步构建了多阶段交互式法律评估 (MILE) 基准，以评估 LLM 在动态法律场景中的表现。大量实验证实了我们框架的有效性。

Title: Investigating the Zone of Proximal Development of Language Models for In-Context Learning

Authors: Peng Cui, Mrinmaya Sachan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.06990
Pdf URL: https://arxiv.org/pdf/2502.06990
Copy Paste: [[2502.06990]] Investigating the Zone of Proximal Development of Language Models for In-Context Learning(https://arxiv.org/abs/2502.06990)
Keywords: language model, llm
Abstract: In this paper, we introduce a learning analytics framework to analyze the in-context learning (ICL) behavior of large language models (LLMs) through the lens of the Zone of Proximal Development (ZPD), an established theory in educational psychology. ZPD delineates the space between what a learner is capable of doing unsupported and what the learner cannot do even with support. We adapt this concept to ICL, measuring the ZPD of LLMs based on model performance on individual examples with and without ICL. Furthermore, we propose an item response theory (IRT) model to predict the distribution of zones for LLMs. Our findings reveal a series of intricate and multifaceted behaviors of ICL, providing new insights into understanding and leveraging this technique. Finally, we demonstrate how our framework can enhance LLM in both inference and fine-tuning scenarios: (1) By predicting a model's zone of proximal development, we selectively apply ICL to queries that are most likely to benefit from demonstrations, achieving a better balance between inference cost and performance; (2) We propose a human-like curriculum for fine-tuning, which prioritizes examples within the model's ZPD. The curriculum results in improved performance, and we explain its effectiveness through an analysis of the training dynamics of LLMs.
摘要：在本文中，我们引入了一个学习分析框架，通过教育心理学中成熟的理论近侧发展区 (ZPD) 的视角来分析大型语言模型 (LLM) 的情境学习 (ICL) 行为。ZPD 描述了学习者在没有支持的情况下能够做到的事情与有支持的情况下无法做到的事情之间的空间。我们将这个概念应用于 ICL，根据模型在有和没有 ICL 的个别示例上的表现来测量 LLM 的 ZPD。此外，我们提出了一个项目反应理论 (IRT) 模型来预测 LLM 的区域分布。我们的研究结果揭示了一系列错综复杂且多方面的 ICL 行为，为理解和利用这项技术提供了新的见解。最后，我们展示了我们的框架如何在推理和微调场景中增强 LLM：(1) 通过预测模型的近侧发展区，我们有选择地将 ICL 应用于最有可能从演示中受益的查询，从而在推理成本和性能之间取得更好的平衡；（2）我们提出了一种类似人类的微调课程，该课程优先考虑模型 ZPD 中的示例。该课程可提高性能，我们通过分析 LLM 的训练动态来解释其有效性。

Title: Demystifying Singular Defects in Large Language Models

Authors: Haoqi Wang, Tong Zhang, Mathieu Salzmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07004
Pdf URL: https://arxiv.org/pdf/2502.07004
Copy Paste: [[2502.07004]] Demystifying Singular Defects in Large Language Models(https://arxiv.org/abs/2502.07004)
Keywords: language model, llm
Abstract: Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs and will therefore publicly release our code.
摘要：众所周知，大型 Transformer 模型会产生高范数 token。在视觉 Transformer (ViT) 中，此类 token 已通过层的线性近似的奇异向量进行数学建模。然而，在大型语言模型 (LLM) 中，高范数 token 的根本原因仍未被深入探究，其与 ViT 的不同属性需要新的分析框架。在本文中，我们通过一系列近期模型提供了理论见解和实证验证，得出以下观察结果：i) 逐层奇异方向可预测 LLM 中 token 范数的突然爆发。ii) 层的负特征值可以解释其突然衰减。iii) 初始 token 和非初始 token 之间的计算路径不同。iv) 高范数 token 由近似相应模块的矩阵的右首奇异向量触发。我们展示了这些发现的两个实际应用：改进量化方案和设计 LLM 签名。我们的发现不仅促进了对 LLM 中奇异缺陷的理解，还为其应用开辟了新途径。我们期望这项工作将促进对 LLM 内部机制的进一步研究，因此将公开发布我们的代码。

Title: Finding Words Associated with DIF: Predicting Differential Item Functioning using LLMs and Explainable AI

Authors: Hotaka Maeda, Yikai Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07017
Pdf URL: https://arxiv.org/pdf/2502.07017
Copy Paste: [[2502.07017]] Finding Words Associated with DIF: Predicting Differential Item Functioning using LLMs and Explainable AI(https://arxiv.org/abs/2502.07017)
Keywords: language model, llm
Abstract: We fine-tuned and compared several encoder-based Transformer large language models (LLM) to predict differential item functioning (DIF) from the item text. We then applied explainable artificial intelligence (XAI) methods to these models to identify specific words associated with DIF. The data included 42,180 items designed for English language arts and mathematics summative state assessments among students in grades 3 to 11. Prediction $R^2$ ranged from .04 to .32 among eight focal and reference group pairs. Our findings suggest that many words associated with DIF reflect minor sub-domains included in the test blueprint by design, rather than construct-irrelevant item content that should be removed from assessments. This may explain why qualitative reviews of DIF items often yield confusing or inconclusive results. Our approach can be used to screen words associated with DIF during the item-writing process for immediate revision, or help review traditional DIF analysis results by highlighting key words in the text. Extensions of this research can enhance the fairness of assessment programs, especially those that lack resources to build high-quality items, and among smaller subpopulations where we do not have sufficient sample sizes for traditional DIF analyses.
摘要：我们对几种基于编码器的 Transformer 大型语言模型 (LLM) 进行了微调和比较，以从项目文本中预测差异项目功能 (DIF)。然后，我们将可解释的人工智能 (XAI) 方法应用于这些模型，以识别与 DIF 相关的特定单词。数据包括 42,180 个针对 3 至 11 年级学生的英语语言艺术和数学总结性状态评估而设计的项目。在八个焦点组和参考组对中，预测 $R^2$ 的范围从 0.04 到 0.32。我们的研究结果表明，许多与 DIF 相关的单词反映了测试蓝图中设计中包含的次要子域，而不是应该从评估中删除的与构造无关的项目内容。这可能解释了为什么 DIF 项目的定性审查通常会产生令人困惑或不确定的结果。我们的方法可用于在项目编写过程中筛选与 DIF 相关的单词以进行立即修改，或者通过突出显示文本中的关键词来帮助审查传统的 DIF 分析结果。这项研究的扩展可以提高评估项目的公平性，特别是那些缺乏资源来构建高质量项目的评估项目，以及在我们没有足够样本量进行传统 DIF 分析的较小亚群中。

Title: AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements

Authors: Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07022
Pdf URL: https://arxiv.org/pdf/2502.07022
Copy Paste: [[2502.07022]] AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements(https://arxiv.org/abs/2502.07022)
Keywords: language model, llm
Abstract: Despite over a decade of legislative efforts to address modern slavery in the supply chains of large corporations, the effectiveness of government oversight remains hampered by the challenge of scrutinizing thousands of statements annually. While Large Language Models (LLMs) can be considered a well established solution for the automatic analysis and summarization of documents, recognizing concrete modern slavery countermeasures taken by companies and differentiating those from vague claims remains a challenging task. To help evaluate and fine-tune LLMs for the assessment of corporate statements, we introduce a dataset composed of 5,731 modern slavery statements taken from the Australian Modern Slavery Register and annotated at the sentence level. This paper details the construction steps for the dataset that include the careful design of annotation specifications, the selection and preprocessing of statements, and the creation of high-quality annotation subsets for effective model evaluations. To demonstrate our dataset's utility, we propose a machine learning methodology for the detection of sentences relevant to mandatory reporting requirements set by the Australian Modern Slavery Act. We then follow this methodology to benchmark modern language models under zero-shot and supervised learning settings.
摘要：尽管十多年来，政府一直在努力通过立法解决大型企业供应链中的现代奴隶制问题，但每年审查数千份声明的挑战仍然阻碍了政府监督的有效性。虽然大型语言模型 (LLM) 可以被视为一种成熟的自动分析和总结文档的解决方案，但识别公司采取的具体现代奴隶制对策并将其与模糊的声明区分开来仍然是一项艰巨的任务。为了帮助评估和微调 LLM 以评估公司声明，我们引入了一个数据集，该数据集由 5,731 条现代奴隶制声明组成，这些声明取自澳大利亚现代奴隶制登记册，并在句子级别进行了注释。本文详细介绍了数据集的构建步骤，包括精心设计注释规范、选择和预处理语句以及创建高质量的注释子集以进行有效的模型评估。为了展示我们数据集的实用性，我们提出了一种机器学习方法来检测与澳大利亚现代奴隶制法案规定的强制性报告要求相关的句子。然后，我们按照这种方法在零样本和监督学习设置下对现代语言模型进行基准测试。

Title: Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark

Authors: M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07057
Pdf URL: https://arxiv.org/pdf/2502.07057
Copy Paste: [[2502.07057]] Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark(https://arxiv.org/abs/2502.07057)
Keywords: language model, llm
Abstract: Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models' (LLMs) ability to capture syntactic, morphosyntactic, and semantic structures. This paper introduces a novel framework for systematically evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages. Using a Turkish dataset of 6,200 multiple-choice questions from the Massive Multitask Language Understanding (MMLU) benchmark, the framework assesses tokenizers across five key metrics: vocabulary size, token count, processing time, language-specific token percentages (\%TR), and token purity. These metrics provide a structured approach to evaluating how well tokenizers preserve linguistic structures. While \%TR measures the proportion of valid words in the target language, \%Pure assesses the alignment of tokens with meaningful linguistic units, such as roots and valid morphemes, minimizing semantic fragmentation. The findings reveal that \%TR, introduced as a critical metric, exhibits a stronger correlation with downstream performance (e.g., MMLU scores) than token purity, emphasizing its role in improving model accuracy. Additionally, larger model parameters do not necessarily yield better tokenization quality or enhanced results, highlighting the importance of tailored tokenization strategies that prioritize linguistic alignment. This framework sets a new standard for developing robust tokenization methods optimized for morphologically complex and low-resource languages. Future work will refine morphological analysis, explore domain-specific customizations, and conduct cross-linguistic evaluations to further enhance tokenization practices.
摘要：标记化是 NLP 中的一个基本预处理步骤，直接影响大型语言模型 (LLM) 捕获句法、形态句法和语义结构的能力。本文介绍了一种系统评估标记化策略的新框架，以解决形态丰富且资源匮乏的语言中的挑战。该框架使用来自大规模多任务语言理解 (MMLU) 基准的 6,200 道多项选择题的土耳其语数据集，从五个关键指标评估标记器：词汇量、标记数、处理时间、特定语言的标记百分比 (\%TR) 和标记纯度。这些指标提供了一种结构化的方法来评估标记器如何很好地保留语言结构。\%TR 衡量目标语言中有效单词的比例，而 \%Pure 评估标记与有意义的语言单位（如词根和有效词素）的对齐情况，从而最大限度地减少语义碎片化。研究结果表明，作为关键指标引入的 \%TR 与下游性能（例如 MMLU 分数）的相关性比标记纯度更高，这强调了其在提高模型准确性方面的作用。此外，更大的模型参数不一定会产生更好的标记质量或增强的结果，这凸显了优先考虑语言对齐的定制标记策略的重要性。该框架为开发针对形态复杂和资源匮乏的语言优化的强大标记方法设定了新标准。未来的工作将改进形态分析，探索特定领域的定制，并进行跨语言评估，以进一步增强标记实践。

Title: Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties

Authors: Zixin Tang, Chieh-Yang Huang, Tsung-Chi Li, Ho Yim Sam Ng, Hen-Hsen Huang, Ting-Hao 'Kenneth' Huang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2502.07058
Pdf URL: https://arxiv.org/pdf/2502.07058
Copy Paste: [[2502.07058]] Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties(https://arxiv.org/abs/2502.07058)
Keywords: language model, llm
Abstract: A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as this http URL, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.
摘要：一种语言可以有不同的变体。这些变体会影响自然语言处理 (NLP) 模型的性能，包括大型语言模型 (LLM)，这些模型通常使用广泛使用的变体数据进行训练。本文介绍了一种新颖且经济高效的方法来对跨语言变体的模型性能进行基准测试。我们认为，国际在线评论平台（例如此 http URL）可以作为有效的数据源，用于构建数据集，这些数据集可以捕获来自类似现实场景的不同语言变体的评论，例如使用相同语言（例如普通话）但不同语言变体（例如台湾普通话、大陆普通话）对同一家酒店具有相同评级的评论。为了证明这一概念，我们构建了一个上下文对齐的数据集，其中包含台湾普通话和大陆普通话的评论，并在情感分析任务中测试了六个 LLM。我们的结果表明，LLM 在台湾普通话方面的表现始终不佳。

Title: Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations

Authors: Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul Röttger, Daniel Hershcovich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07068
Pdf URL: https://arxiv.org/pdf/2502.07068
Copy Paste: [[2502.07068]] Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations(https://arxiv.org/abs/2502.07068)
Keywords: language model, llm, prompt
Abstract: Large-scale surveys are essential tools for informing social science research and policy, but running surveys is costly and time-intensive. If we could accurately simulate group-level survey results, this would therefore be very valuable to social science research. Prior work has explored the use of large language models (LLMs) for simulating human behaviors, mostly through prompting. In this paper, we are the first to specialize LLMs for the task of simulating survey response distributions. As a testbed, we use country-level results from two global cultural surveys. We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions for a given question. Then, we show that this method substantially outperforms other methods and zero-shot classifiers, even on unseen questions, countries, and a completely unseen survey. While even our best models struggle with the task, especially on unseen questions, our results demonstrate the benefits of specialization for simulation, which may accelerate progress towards sufficiently accurate simulation in the future.
摘要：大规模调查是为社会科学研究和政策提供信息的重要工具，但进行调查成本高昂且耗时。如果我们能够准确地模拟群体层面的调查结果，那么这对社会科学研究将非常有价值。先前的研究已经探索了使用大型语言模型 (LLM) 来模拟人类行为，主要是通过提示。在本文中，我们首次专门针对 LLM 来模拟调查响应分布。作为试验台，我们使用了两项全球文化调查的国家级结果。我们设计了一种基于第一个标记概率的微调方法，以最小化给定问题的预测响应分布和实际响应分布之间的差异。然后，我们表明，即使在看不见的问题、国家和完全看不见的调查中，该方法也大大优于其他方法和零样本分类器。虽然即使是我们最好的模型也难以完成这项任务，尤其是在看不见的问题上，但我们的结果证明了模拟专业化的好处，这可能会加速未来实现足够准确的模拟。

Title: IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

Authors: Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, Hridesh Rajan
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2502.07072
Pdf URL: https://arxiv.org/pdf/2502.07072
Copy Paste: [[2502.07072]] IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models(https://arxiv.org/abs/2502.07072)
Keywords: language model, gpt, llm
Abstract: Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model's most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model's overall performance by altering a smaller portion of the model. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80\%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.
摘要：我们每天都会听到大型语言模型 (LLM) 的惊人成就，同样，我们每天都会听到它们面临的挑战。众所周知，LLM 容易受到数据集偏差的影响，从而导致毒性等问题。虽然已经采用了领域自适应训练来缓解这些问题，但这些技术通常在修复过程中不加区分地处理所有模型参数，导致修复质量差和模型多功能性降低。在本文中，我们介绍了一种新颖的基于动态切片的意图感知 LLM 修复策略 IRepair。这种方法有选择地针对模型中最容易出错的部分进行修复。具体来说，我们建议动态切片模型中最敏感、需要立即关注的层，将修复工作集中在这些区域。这种方法通过改变模型的较小部分，可以实现更有效的修复，同时对模型的整体性能的影响可能更小。我们在毒性缓解设置中，针对 GPT2 和 GPT-Neo 系列的三个模型评估了我们的技术，这些模型的参数范围从 8 亿到 16 亿。我们的结果表明，与最接近的基线直接偏好优化相比，IRepair 修复错误的效率提高了 43.6%，同时对总体性能的干扰减少了 46%。我们的实证分析还表明，错误更集中在模型的较小部分，前 20% 的层的错误密度比其余 80% 高 773%。这凸显了选择性修复的必要性。此外，我们证明了动态选择方法对于解决分散在整个模型中的错误至关重要，从而确保修复的稳健性和效率。

Title: Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Authors: Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2502.07077
Pdf URL: https://arxiv.org/pdf/2502.07077
Copy Paste: [[2502.07077]] Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models(https://arxiv.org/abs/2502.07077)
Keywords: language model, llm
Abstract: The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
摘要：用户将大型语言模型 (LLM) 拟人化的倾向越来越受到 AI 开发人员、研究人员和政策制定者的兴趣。在这里，我们提出了一种新方法，用于在现实和多样化的环境中实证评估拟人化 LLM 行为。除了单轮静态基准之外，我们还为最先进 (SOTA) LLM 评估贡献了三项方法上的进步。首先，我们开发了 14 种拟人化行为的多轮评估。其次，我们通过模拟用户交互提出了一种可扩展的自动化方法。第三，我们进行了一项交互式大规模人类受试者研究 (N=1101)，以验证我们测量的模型行为是否能预测真实用户的拟人化感知。我们发现，所有评估的 SOTA LLM 都表现出类似的行为，其特点是建立关系（例如同理心和认可）和使用第一人称代词，并且大多数行为仅在多轮之后才首次发生。我们的工作为研究设计选择如何影响拟人化模型行为以及推动关于这些行为可取性的伦理辩论奠定了实证基础。它还展示了对人机交互中复杂的社会现象进行多轮评估的必要性。

Title: SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation

Authors: Saurabh Kumar Pandey, Sachin Vashistha, Debrup Das, Somak Aditya, Monojit Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07101
Pdf URL: https://arxiv.org/pdf/2502.07101
Copy Paste: [[2502.07101]] SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation(https://arxiv.org/abs/2502.07101)
Keywords: prompt
Abstract: To understand the complexity of sequence classification tasks, Hahn et al. (2021) proposed sensitivity as the number of disjoint subsets of the input sequence that can each be individually changed to change the output. Though effective, calculating sensitivity at scale using this framework is costly because of exponential time complexity. Therefore, we introduce a Sensitivity-based Multi-Armed Bandit framework (SMAB), which provides a scalable approach for calculating word-level local (sentence-level) and global (aggregated) sensitivities concerning an underlying text classifier for any dataset. We establish the effectiveness of our approach through various applications. We perform a case study on CHECKLIST generated sentiment analysis dataset where we show that our algorithm indeed captures intuitively high and low-sensitive words. Through experiments on multiple tasks and languages, we show that sensitivity can serve as a proxy for accuracy in the absence of gold data. Lastly, we show that guiding perturbation prompts using sensitivity values in adversarial example generation improves attack success rate by 15.58%, whereas using sensitivity as an additional reward in adversarial paraphrase generation gives a 12.00% improvement over SOTA approaches. Warning: Contains potentially offensive content.
摘要：为了理解序列分类任务的复杂性，Hahn 等人 (2021) 提出敏感度是输入序列中可以单独更改以改变输出的不相交子集的数量。尽管有效，但由于指数时间复杂度，使用此框架大规模计算敏感度的成本很高。因此，我们引入了一个基于敏感度的多臂老虎机框架 (SMAB)，它提供了一种可扩展的方法来计算任何数据集的底层文本分类器的词级局部（句子级）和全局（聚合）敏感度。我们通过各种应用建立了我们方法的有效性。我们对 CHECKLIST 生成的情绪分析数据集进行了案例研究，我们表明我们的算法确实直观地捕捉了高敏感度和低敏感度的单词。通过对多个任务和语言的实验，我们表明在没有黄金数据的情况下，敏感度可以作为准确性的代理。最后，我们表明，在对抗性示例生成中使用敏感度值引导扰动提示可将攻击成功率提高 15.58%，而在对抗性释义生成中使用敏感度作为额外奖励可比 SOTA 方法提高 12.00%。警告：包含潜在的冒犯性内容。

Title: Structural Reformation of Large Language Model Neuron Encapsulation for Divergent Information Aggregation

Authors: Denis Bakushev, Gideon Boultinghouse, Harriet Oppenheimer, Sebastian Gillingwater, Valentina Ashington, Wilfred Stanborough
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07124
Pdf URL: https://arxiv.org/pdf/2502.07124
Copy Paste: [[2502.07124]] Structural Reformation of Large Language Model Neuron Encapsulation for Divergent Information Aggregation(https://arxiv.org/abs/2502.07124)
Keywords: language model
Abstract: Structured neuron encapsulation introduces a modular framework that enables more effective aggregation and specialization of information within deep learning architectures. A model modified through this framework demonstrated improved perplexity scores, greater lexical variability, and enhanced consistency in logical reasoning, suggesting that structured parameter distribution contributes to more efficient language representation. Statistical analyses of generated text highlighted a wider range of sentence structures and reduced redundancy in token selection, indicating that encapsulation fosters more adaptable language generation. A detailed evaluation of attention weight distributions revealed that the experimental model exhibited greater divergence in cross-layer activations, supporting the hypothesis that encapsulated neurons assume specialized processing roles. Logical consistency assessments further demonstrated that modular architectures mitigate contradictory outputs, reducing internal conflicts in inferred relationships between linguistic constructs. Computational trade-offs were analyzed, with results showing a minor increase in processing overhead, though improvements in parameter efficiency and structured decision-making compensated for the additional complexity. The mathematical formulation of the encapsulation mechanism confirmed that modular aggregation maintains stable convergence properties while promoting distinct functional roles for different neuron clusters.
摘要：结构化神经元封装引入了一个模块化框架，使深度学习架构中的信息更有效地聚合和专业化。通过该框架修改的模型表现出更高的困惑度分数、更大的词汇变化性和增强的逻辑推理一致性，这表明结构化参数分布有助于更有效地表达语言。对生成文本的统计分析突出了更广泛的句子结构和减少了标记选择中的冗余，表明封装促进了更具适应性的语言生成。对注意力权重分布的详细评估表明，实验模型在跨层激活中表现出更大的分歧，支持了封装神经元承担专门处理角色的假设。逻辑一致性评估进一步表明，模块化架构可以缓解矛盾的输出，减少语言结构之间推断关系的内部冲突。分析了计算权衡，结果显示处理开销略有增加，但参数效率和结构化决策的改进弥补了额外的复杂性。封装机制的数学公式证实了模块化聚合保持稳定的收敛特性，同时促进不同神经元簇发挥不同的功能作用。

Title: Cardiverse: Harnessing LLMs for Novel Card Game Prototyping

Authors: Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2502.07128
Pdf URL: https://arxiv.org/pdf/2502.07128
Copy Paste: [[2502.07128]] Cardiverse: Harnessing LLMs for Novel Card Game Prototyping(https://arxiv.org/abs/2502.07128)
Keywords: language model, llm
Abstract: The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and develop scalable gameplay AI for large-scale evaluations. This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework. The approach highlights a graph-based indexing method for generating novel game designs, an LLM-driven system for consistent game code generation validated by gameplay records, and a gameplay AI constructing method that uses an ensemble of LLM-generated action-value functions optimized through self-play. These contributions aim to accelerate card game prototyping, reduce human labor, and lower barriers to entry for game developers.
摘要：电脑游戏（尤其是纸牌游戏）的原型设计需要大量人力进行创意构思和游戏玩法评估。大型语言模型 (LLM) 的最新进展为自动化和简化这些流程提供了机会。然而，对于 LLM 来说，设计超越现有数据库的新颖游戏机制、生成一致的游戏环境以及开发可扩展的游戏玩法 AI 以进行大规模评估仍然具有挑战性。本文通过介绍一个全面的自动化纸牌游戏原型设计框架来解决这些挑战。该方法重点介绍了一种用于生成新颖游戏设计的基于图形的索引方法、一种由 LLM 驱动的、通过游戏记录验证的一致游戏代码生成系统，以及一种使用通过自玩优化的 LLM 生成的动作值函数集合的游戏玩法 AI 构建方法。这些贡献旨在加速纸牌游戏原型设计、减少人力劳动并降低游戏开发者的进入门槛。

Title: Language-TPP: Integrating Temporal Point Processes with Language Models for Event Analysis

Authors: Quyu Kong, Yixuan Zhang, Yang Liu, Panrong Tong, Enqi Liu, Feng Zhou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07139
Pdf URL: https://arxiv.org/pdf/2502.07139
Copy Paste: [[2502.07139]] Language-TPP: Integrating Temporal Point Processes with Language Models for Event Analysis(https://arxiv.org/abs/2502.07139)
Keywords: language model, llm
Abstract: Temporal Point Processes (TPPs) have been widely used for event sequence modeling, but they often struggle to incorporate rich textual event descriptions effectively. Conversely, while Large Language Models (LLMs) have been shown remarkable capabilities in processing textual data, they lack mechanisms for handling temporal dynamics. To bridge this gap, we introduce Language-TPP, a unified framework that integrates TPPs with LLMs for enhanced event sequence modeling. Language-TPP introduces a novel temporal encoding mechanism that converts continuous time intervals into specialized byte-tokens, enabling seamless integration with standard LLM architectures. This approach allows Language-TPP to achieve state-of-the-art performance across multiple TPP tasks, including event time prediction, type prediction, and intensity estimation, on five datasets. Additionally, we demonstrate that incorporating temporal information significantly improves the quality of generated event descriptions.
摘要：时间点过程 (TPP) 已广泛用于事件序列建模，但它们通常难以有效地整合丰富的文本事件描述。相反，虽然大型语言模型 (LLM) 在处理文本数据方面表现出色，但它们缺乏处理时间动态的机制。为了弥补这一差距，我们引入了 Language-TPP，这是一个将 TPP 与 LLM 集成以增强事件序列建模的统一框架。Language-TPP 引入了一种新颖的时间编码机制，可将连续时间间隔转换为专门的字节标记，从而实现与标准 LLM 架构的无缝集成。这种方法使 Language-TPP 能够在五个数据集上的多个 TPP 任务（包括事件时间预测、类型预测和强度估计）中实现最先进的性能。此外，我们证明结合时间信息可显著提高生成的事件描述的质量。

Title: Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

Authors: Jiayuan Zhu, Junde Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07143
Pdf URL: https://arxiv.org/pdf/2502.07143
Copy Paste: [[2502.07143]] Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning(https://arxiv.org/abs/2502.07143)
Keywords: language model, llm
Abstract: Accurate and efficient diagnosis in online medical consultations remains a challenge for current large language models. These models often rely on single-turn interactions and lack the ability to refine their predictions through follow-up questions. Additionally, their responses frequently contain complex medical terminology, making them less accessible to non-medical users and creating barriers to effective communication. In this paper, we introduce Ask Patients with Patience (APP), the first multi-turn dialogue that enables LLMs to iteratively refine diagnoses based on grounded reasoning. By integrating medical guidelines and entropy minimization, APP improves both diagnostic accuracy and efficiency. Furthermore, it features human-centric communication that bridges the gap between user comprehension and medical terminology, significantly enhancing user accessibility and engagement. We evaluated APP using a subset of the ReMeDi dataset, comparing it with single-turn and traditional multi-turn LLM baselines. APP achieved higher similarity scores in diagnosis predictions, demonstrating better alignment with ground truth diagnoses. Entropy analysis showed that APP reduces diagnostic uncertainty more rapidly across iterations, increasing confidence in its predictions. APP also excels in user accessibility and empathy, further bridging the gap between complex medical language and user understanding. Code will be released at: this https URL.
摘要：在线医疗咨询中的准确高效诊断仍然是当前大型语言模型面临的挑战。这些模型通常依赖于单轮交互，缺乏通过后续问题来改进预测的能力。此外，他们的回答通常包含复杂的医学术语，这使得非医疗用户难以理解，并为有效沟通造成了障碍。在本文中，我们介绍了“耐心询问患者”（APP），这是第一个多轮对话，使 LLM 能够基于有根据的推理迭代改进诊断。通过整合医疗指南和熵最小化，APP 提高了诊断的准确性和效率。此外，它具有以人为本的沟通功能，弥合了用户理解和医学术语之间的差距，显著提高了用户的可访问性和参与度。我们使用 ReMeDi 数据集的子集评估了 APP，并将其与单轮和传统的多轮 LLM 基线进行了比较。APP 在诊断预测中获得了更高的相似度得分，表明与基本事实诊断的一致性更高。熵分析表明，APP 在迭代过程中更快地降低了诊断不确定性，从而提高了对其预测的信心。 APP 在用户可访问性和同理心方面也表现出色，进一步缩小了复杂的医学语言与用户理解之间的差距。代码将在以下网址发布：此 https URL。

Title: Does Training on Synthetic Data Make Models Less Robust?

Authors: Lingze Zhang, Ellie Pavlick
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07164
Pdf URL: https://arxiv.org/pdf/2502.07164
Copy Paste: [[2502.07164]] Does Training on Synthetic Data Make Models Less Robust?(https://arxiv.org/abs/2502.07164)
Keywords: language model, llm
Abstract: An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain "blindspots" by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our "blindspot" task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with synthetic data doesn't necessarily reduce the use of the heuristic, it also does not make it worse as we hypothesized.
摘要：一种越来越常见的做法是使用合成数据训练大型语言模型 (LLM)。通常，这些合成数据是由与其用于训练的 LLM 相同或相似的 LLM 生成的。这就提出了一个问题：合成数据是否可能通过强化 LLM 已经编码的启发式方法而加剧某些“盲点”。在本文中，我们使用 Llama-2-7B-hf 模型对自然语言推理 (NLI) 任务进行模拟实验。我们使用 MultiNLI 作为一般任务，使用 HANS（一种旨在衡量 NLI 特定启发式策略存在性的目标评估集）作为我们的“盲点”任务。我们的目标是确定一般任务和盲点任务之间的性能差异是否出现。我们的结果表明，合成数据并没有像我们预期的那样强化盲点。具体来说，我们发现，虽然使用合成数据进行微调并不一定会减少启发式方法的使用，但也不会像我们假设的那样使情况变得更糟。

Title: Don't Just Demo, Teach Me the Principles: A Principle-Based Multi-Agent Prompting Strategy for Text Classification

Authors: Peipei Wei, Dimitris Dimitriadis, Yan Xu, Mingwei Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07165
Pdf URL: https://arxiv.org/pdf/2502.07165
Copy Paste: [[2502.07165]] Don't Just Demo, Teach Me the Principles: A Principle-Based Multi-Agent Prompting Strategy for Text Classification(https://arxiv.org/abs/2502.07165)
Keywords: llm, prompt, agent
Abstract: We present PRINCIPLE-BASED PROMPTING, a simple but effective multi-agent prompting strategy for text classification. It first asks multiple LLM agents to independently generate candidate principles based on analysis of demonstration samples with or without labels, consolidates them into final principles via a finalizer agent, and then sends them to a classifier agent to perform downstream classification tasks. Extensive experiments on binary and multi-class classification datasets with different sizes of LLMs show that our approach not only achieves substantial performance gains (1.55% - 19.37%) over zero-shot prompting on macro-F1 score but also outperforms other strong baselines (CoT and stepback prompting). Principles generated by our approach help LLMs perform better on classification tasks than human crafted principles on two private datasets. Our multi-agent PRINCIPLE-BASED PROMPTING approach also shows on-par or better performance compared to demonstration-based few-shot prompting approaches, yet with substantially lower inference costs. Ablation studies show that label information and the multi-agent cooperative LLM framework play an important role in generating high-quality principles to facilitate downstream classification tasks.
摘要：我们提出了基于原则的提示，这是一种简单但有效的文本分类多智能体提示策略。它首先要求多个 LLM 智能体根据对带标签或不带标签的演示样本的分析独立生成候选原则，通过最终代理将它们合并为最终原则，然后将它们发送到分类器代理以执行下游分类任务。对具有不同大小的 LLM 的二分类和多分类数据集进行的大量实验表明，我们的方法不仅在宏 F1 分数上实现了比零样本提示显著的性能提升（1.55% - 19.37%），而且优于其他强基线（CoT 和后退提示）。我们的方法生成的原则有助于 LLM 在分类任务上的表现优于两个私有数据集上人工制定的原则。与基于演示的少量样本提示方法相比，我们的多智能体基于原则的提示方法也表现出同等或更好的性能，但推理成本却大大降低。消融研究表明，标签信息和多智能体合作的 LLM 框架在生成高质量原则以促进下游分类任务方面发挥着重要作用。

Title: Refine Knowledge of Large Language Models via Adaptive Contrastive Learning

Authors: Yinghui Li, Haojing Huang, Jiayi Kuang, Yangning Li, Shu-Yu Guo, Chao Qu, Xiaoyu Tan, Hai-Tao Zheng, Ying Shen, Philip S. Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07184
Pdf URL: https://arxiv.org/pdf/2502.07184
Copy Paste: [[2502.07184]] Refine Knowledge of Large Language Models via Adaptive Contrastive Learning(https://arxiv.org/abs/2502.07184)
Keywords: language model, llm, hallucination
Abstract: How to alleviate the hallucinations of Large Language Models (LLMs) has always been the fundamental goal pursued by the LLMs research community. Looking through numerous hallucination-related studies, a mainstream category of methods is to reduce hallucinations by optimizing the knowledge representation of LLMs to change their output. Considering that the core focus of these works is the knowledge acquired by models, and knowledge has long been a central theme in human societal progress, we believe that the process of models refining knowledge can greatly benefit from the way humans learn. In our work, by imitating the human learning process, we design an Adaptive Contrastive Learning strategy. Our method flexibly constructs different positive and negative samples for contrastive learning based on LLMs' actual mastery of knowledge. This strategy helps LLMs consolidate the correct knowledge they already possess, deepen their understanding of the correct knowledge they have encountered but not fully grasped, forget the incorrect knowledge they previously learned, and honestly acknowledge the knowledge they lack. Extensive experiments and detailed analyses on widely used datasets demonstrate the effectiveness of our method.
摘要：如何缓解大型语言模型（LLM）的幻觉一直是LLM研究界追求的根本目标。纵观众多与幻觉相关的研究，一类主流方法是通过优化LLM的知识表示来改变其输出，从而减少幻觉。考虑到这些工作的核心关注点是模型所获得的知识，而知识早已是人类社会进步的中心主题，我们认为模型提炼知识的过程可以极大地借鉴人类的学习方式。在我们的工作中，通过模仿人类的学习过程，我们设计了一种自适应对比学习策略。我们的方法根据LLM对知识的实际掌握程度，灵活地构造不同的正负样本进行对比学习。该策略有助于LLM巩固已经拥有的正确知识，加深对遇到但未完全掌握的正确知识的理解，忘记之前学习过的错误知识，并诚实地承认自己缺乏的知识。在广泛使用的数据集上的大量实验和详细分析证明了我们方法的有效性。

Title: Perceived Confidence Scoring for Data Annotation with Zero-Shot LLMs

Authors: Sina Salimian, Gias Uddin, Most Husne Jahan, Shaina Raza
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07186
Pdf URL: https://arxiv.org/pdf/2502.07186
Copy Paste: [[2502.07186]] Perceived Confidence Scoring for Data Annotation with Zero-Shot LLMs(https://arxiv.org/abs/2502.07186)
Keywords: llm
Abstract: Zero-shot LLMs are now also used for textual classification tasks, e.g., sentiment/emotion detection of a given input as a sentence/article. However, their performance can be suboptimal in such data annotation tasks. We introduce a novel technique Perceived Confidence Scoring (PCS) that evaluates LLM's confidence for its classification of an input by leveraging Metamorphic Relations (MRs). The MRs generate semantically equivalent yet textually mutated versions of the input. Following the principles of Metamorphic Testing (MT), the mutated versions are expected to have annotation labels similar to the input. By analyzing the consistency of LLM responses across these variations, PCS computes a confidence score based on the frequency of predicted labels. PCS can be used both for single LLM and multiple LLM settings (e.g., majority voting). We introduce an algorithm Perceived Differential Evolution (PDE) that determines the optimal weights assigned to the MRs and the LLMs for a classification task. Empirical evaluation shows PCS significantly improves zero-shot accuracy for Llama-3-8B-Instruct (4.96%) and Mistral-7B-Instruct-v0.3 (10.52%), with Gemma-2-9b-it showing a 9.39% gain. When combining all three models, PCS significantly outperforms majority voting by 7.75%.
摘要：零样本 LLM 现在也用于文本分类任务，例如，对给定输入作为句子/文章进行情绪/情感检测。然而，它们在这种数据注释任务中的表现可能不是最理想的。我们引入了一种新技术感知置信度评分 (PCS)，它通过利用变形关系 (MR) 来评估 LLM 对输入分类的置信度。MR 生成语义等效但文本变异的输入版本。遵循变形测试 (MT) 的原则，变异版本应具有与输入相似的注释标签。通过分析这些变化中 LLM 响应的一致性，PCS 根据预测标签的频率计算置信度分数。PCS 既可用于单个 LLM，也可用于多个 LLM 设置（例如，多数投票）。我们引入了一种算法感知差异进化 (PDE)，该算法确定分配给分类任务的 MR 和 LLM 的最佳权重。实证评估表明，PCS 显著提高了 Llama-3-8B-Instruct（4.96%）和 Mistral-7B-Instruct-v0.3（10.52%）的零样本准确率，而 Gemma-2-9b-it 的准确率提高了 9.39%。将这三个模型结合起来，PCS 的表现明显优于多数投票法 7.75%。

Title: A Large-Scale Benchmark for Vietnamese Sentence Paraphrases

Authors: Sang Quang Nguyen, Kiet Van Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07188
Pdf URL: https://arxiv.org/pdf/2502.07188
Copy Paste: [[2502.07188]] A Large-Scale Benchmark for Vietnamese Sentence Paraphrases(https://arxiv.org/abs/2502.07188)
Keywords: language model, gpt, llm
Abstract: This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.
摘要：本文介绍了 ViSP，这是一个高质量的越南语句子释义数据集，包含从各个领域收集的 120 万个原文释义对。该数据集采用混合方法构建，将自动释义生成与手动评估相结合，以确保高质量。我们使用回译、EDA 和 BART 和 T5 等基线模型以及大型语言模型 (LLM)（包括 GPT-4o、Gemini-1.5、Aya、Qwen-2.5 和 Meta-Llama-3.1 变体）等方法进行了实验。据我们所知，这是第一项关于越南语释义的大规模研究。我们希望我们的数据集和研究结果将成为未来越南语释义任务研究和应用的宝贵基础。

Title: Graph RAG-Tool Fusion

Authors: Elias Lumer, Pradeep Honaganahalli Basavaraju, Myles Mason, James A. Burke, Vamse Kumar Subbiah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07223
Pdf URL: https://arxiv.org/pdf/2502.07223
Copy Paste: [[2502.07223]] Graph RAG-Tool Fusion(https://arxiv.org/abs/2502.07223)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Recent developments in retrieval-augmented generation (RAG) for selecting relevant tools from a tool knowledge base enable LLM agents to scale their complex tool calling capabilities to hundreds or thousands of external tools, APIs, or agents-as-tools. However, traditional RAG-based tool retrieval fails to capture structured dependencies between tools, limiting the retrieval accuracy of a retrieved tool's dependencies. For example, among a vector database of tools, a "get stock price" API requires a "stock ticker" parameter from a "get stock ticker" API, and both depend on OS-level internet connectivity tools. In this paper, we address this limitation by introducing Graph RAG-Tool Fusion, a novel plug-and-play approach that combines the strengths of vector-based retrieval with efficient graph traversal to capture all relevant tools (nodes) along with any nested dependencies (edges) within the predefined tool knowledge graph. We also present ToolLinkOS, a new tool selection benchmark of 573 fictional tools, spanning over 15 industries, each with an average of 6.3 tool dependencies. We demonstrate that Graph RAG-Tool Fusion achieves absolute improvements of 71.7% and 22.1% over naïve RAG on ToolLinkOS and ToolSandbox benchmarks, respectively (mAP@10). ToolLinkOS dataset is available at this https URL
摘要：用于从工具知识库中选择相关工具的检索增强生成 (RAG) 的最新发展使 LLM 代理能够将其复杂的工具调用功能扩展到数百或数千个外部工具、API 或代理即工具。但是，传统的基于 RAG 的工具检索无法捕获工具之间的结构化依赖关系，从而限制了检索到的工具依赖关系的检索准确性。例如，在工具的向量数据库中，“获取股票价格”API 需要来自“获取股票行情”API 的“股票行情”参数，并且两者都依赖于操作系统级互联网连接工具。在本文中，我们通过引入 Graph RAG-Tool Fusion 来解决这一限制，这是一种新颖的即插即用方法，它将基于向量的检索的优势与高效的图形遍历相结合，以捕获预定义工具知识图中的所有相关工具（节点）以及任何嵌套依赖关系（边）。我们还介绍了 ToolLinkOS，这是一个新的工具选择基准，包含 573 种虚构工具，涵盖 15 个行业，每种工具平均有 6.3 个工具依赖项。我们证明，Graph RAG-Tool Fusion 在 ToolLinkOS 和 ToolSandbox 基准上分别比简单 RAG 实现了 71.7% 和 22.1% 的绝对改进（mAP@10）。ToolLinkOS 数据集可在此 https URL 上获取

Title: GENERator: A Long-Context Generative Genomic Foundation Model

Authors: Wei Wu, Qiuyi Li, Mingyang Li, Kun Fu, Fuli Feng, Jieping Ye, Hui Xiong, Zheng Wang
Subjects: cs.CL, q-bio.GN
Abstract URL: https://arxiv.org/abs/2502.07272
Pdf URL: https://arxiv.org/pdf/2502.07272
Copy Paste: [[2502.07272]] GENERator: A Long-Context Generative Genomic Foundation Model(https://arxiv.org/abs/2502.07272)
Keywords: language model, llm, prompt
Abstract: Advancements in DNA sequencing technologies have significantly improved our ability to decode genomic sequences. However, the prediction and interpretation of these sequences remain challenging due to the intricate nature of genetic material. Large language models (LLMs) have introduced new opportunities for biological sequence analysis. Recent developments in genomic language models have underscored the potential of LLMs in deciphering DNA sequences. Nonetheless, existing models often face limitations in robustness and application scope, primarily due to constraints in model structure and training data scale. To address these limitations, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences that translate into proteins structurally analogous to known families. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences with specific activity profiles. These capabilities position the GENERator as a pivotal tool for genomic research and biotechnological advancement, enhancing our ability to interpret and predict complex biological systems and enabling precise genomic interventions.
摘要：DNA 测序技术的进步大大提高了我们解码基因组序列的能力。然而，由于遗传物质的复杂性，预测和解释这些序列仍然具有挑战性。大型语言模型 (LLM) 为生物序列分析带来了新的机会。基因组语言模型的最新发展凸显了 LLM 在解密 DNA 序列方面的潜力。尽管如此，现有模型在稳健性和应用范围方面往往存在局限性，这主要是由于模型结构和训练数据规模的限制。为了解决这些限制，我们提出了 GENERator，这是一个生成基因组基础模型，具有 98k 碱基对 (bp) 的上下文长度和 1.2B 参数。GENERator 在包含 386B bp 真核 DNA 的广泛数据集上进行训练，在现有和新提出的基准上都表现出最先进的性能。该模型遵循分子生物学的中心法则，准确生成蛋白质编码序列，并将其转化为结构类似于已知家族的蛋白质。它还在序列优化方面显示出巨大的潜力，特别是通过快速生成具有特定活性特征的启动子序列。这些功能使 GENERator 成为基因组研究和生物技术进步的关键工具，增强了我们解释和预测复杂生物系统的能力，并实现了精确的基因组干预。

Title: Small Language Model Makes an Effective Long Text Extractor

Authors: Yelin Chen, Fanjin Zhang, Jie Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07286
Pdf URL: https://arxiv.org/pdf/2502.07286
Copy Paste: [[2502.07286]] Small Language Model Makes an Effective Long Text Extractor(https://arxiv.org/abs/2502.07286)
Keywords: language model, llm, prompt
Abstract: Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: span-based methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective fine-tuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner. Code: this https URL
摘要：命名实体识别 (NER) 是自然语言处理 (NLP) 中的一个基本问题。然而，从扩展文本（例如主页）中提取较长实体跨度（例如奖项）的任务很少被探索。当前的 NER 方法主要分为两类：基于跨度的方法和基于生成的方法。基于跨度的方法需要枚举所有可能的标记对跨度，然后对每个跨度进行分类，从而导致大量冗余计算和过多的 GPU 内存使用。相比之下，基于生成的方法涉及提示或微调大型语言模型 (LLM) 以适应下游 NER 任务。然而，这些方法难以准确生成较长的跨度，并且通常需要花费大量时间才能进行有效的微调。为了应对这些挑战，本文介绍了一种轻量级的基于跨度的 NER 方法 SeNER，该方法结合了双向箭头注意机制和 [CLS] 标记上的 LogN-Scaling，可以有效地嵌入长文本，并包含一种新颖的双向滑动窗口加号注意 (BiSPA) 机制，可以显着减少冗余候选标记对跨度并同时对标记对跨度之间的交互进行建模。大量实验表明，我们的方法在三个长 NER 数据集上实现了最先进的提取精度，并且能够以 GPU 内存友好的方式从长文本中提取实体。代码：此 https URL

Title: CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

Authors: Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07316
Pdf URL: https://arxiv.org/pdf/2502.07316
Copy Paste: [[2502.07316]] CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction(https://arxiv.org/abs/2502.07316)
Keywords: language model, chain-of-thought
Abstract: Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at this https URL.
摘要：推理是大型语言模型的基本能力。虽然先前的研究主要侧重于增强数学或代码生成等狭隘技能，但由于训练数据稀疏和碎片化，提高许多其他推理任务的性能仍然具有挑战性。为了解决这个问题，我们提出了 CodeI/O，这是一种新颖的方法，通过将原始代码转换为代码输入输出预测格式，系统地压缩固有嵌入上下文代码中的各种推理模式。通过训练模型以预测给定代码和测试用例的输入/输出，完全以自然语言作为思路链 (CoT) 原理，我们将它们暴露给通用推理原语——如逻辑流规划、状态空间搜索、决策树遍历和模块分解——同时将结构化推理与代码特定语法分离并保持程序严谨性。实验结果表明，CodeI/O 可在符号、科学、逻辑、数学和数字以及常识推理任务中带来持续改进。通过匹配现有的真实输出或使用预测的输入重新执行代码，我们可以验证每个预测并通过多轮修订进一步增强 CoT，从而产生 CodeI/O++ 并实现更高的性能。我们的数据和模型可在此 https URL 上找到。

Title: MEMIT-Merge: Addressing MEMIT's Key-Value Conflicts in Same-Subject Batch Editing for LLMs

Authors: Zilu Dong, Xiangqing Shen, Rui Xia
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07322
Pdf URL: https://arxiv.org/pdf/2502.07322
Copy Paste: [[2502.07322]] MEMIT-Merge: Addressing MEMIT's Key-Value Conflicts in Same-Subject Batch Editing for LLMs(https://arxiv.org/abs/2502.07322)
Keywords: language model, llm
Abstract: As large language models continue to scale up, knowledge editing techniques that modify models' internal knowledge without full retraining have gained significant attention. MEMIT, a prominent batch editing algorithm, stands out for its capability to perform mass knowledge modifications. However, we uncover a critical limitation that MEMIT's editing efficacy significantly deteriorates when processing batches containing multiple edits sharing the same subject. Our analysis reveals that the root cause lies in MEMIT's key value modeling framework: When multiple facts with the same subject in a batch are modeled through MEMIT's key value mechanism, identical keys (derived from the shared subject) are forced to represent different values (corresponding to different knowledge), resulting in updates conflicts during editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach that merges value computation processes for facts sharing the same subject, effectively resolving the performance degradation in same-subject batch editing scenarios. Experimental results demonstrate that when MEMIT's edit success rate drops to around 50% at larger batch sizes, MEMIT-Merge maintains a success rate exceeding 90%, showcasing remarkable robustness to subject entity collisions.
摘要：随着大型语言模型的规模不断扩大，无需完全重新训练即可修改模型内部知识的知识编辑技术引起了广泛关注。MEMIT 是一种著名的批量编辑算法，因其执行大规模知识修改的能力而脱颖而出。然而，我们发现了一个关键的限制，即在处理包含共享同一主题的多个编辑的批次时，MEMIT 的编辑效率会显著下降。我们的分析表明，根本原因在于 MEMIT 的键值建模框架：当通过 MEMIT 的键值机制对一批中具有相同主题的多个事实进行建模时，相同的键（来自共享主题）被迫表示不同的值（对应不同的知识），从而导致编辑期间发生更新冲突。针对这个问题，我们提出了一种增强方法 MEMIT-Merge，它合并了共享同一主题的事实的值计算过程，有效地解决了同主题批量编辑场景下的性能下降问题。实验结果表明，当 MEMIT 的编辑成功率在较大的批次大小下下降到 50% 左右时，MEMIT-Merge 的成功率仍保持在 90% 以上，表现出对主题实体碰撞的显著鲁棒性。

Title: Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering

Authors: Shuzheng Si, Haozhe Zhao, Gang Chen, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Kaikai An, Kangyang Luo, Chen Qian, Fanchao Qi, Baobao Chang, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07340
Pdf URL: https://arxiv.org/pdf/2502.07340
Copy Paste: [[2502.07340]] Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering(https://arxiv.org/abs/2502.07340)
Keywords: language model, llm, hallucination
Abstract: Training LLMs on data that contains unfamiliar knowledge during the instruction tuning stage can make LLMs overconfident and encourage hallucinations. To address this challenge, we introduce a novel framework, NOVA, which identifies high-quality data that aligns well with the LLM's learned knowledge to reduce hallucinations. NOVA includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. Specifically, ICP evaluates the LLM's understanding of the given instruction by calculating the tailored consistency among multiple self-generated responses. SEI further assesses the familiarity of the LLM with the target response by comparing it to the generated responses, using the proposed semantic clustering and well-designed voting strategy. Finally, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity to enhance data quality. By considering data quality and avoiding unfamiliar data, we can utilize the selected data to effectively align LLMs to follow instructions and hallucinate less. Extensive experiments and analysis show that NOVA significantly reduces hallucinations and allows LLMs to maintain a strong ability to follow instructions.
摘要：在指令调整阶段，使用包含不熟悉知识的数据训练 LLM 会使 LLM 过度自信并导致幻觉。为了应对这一挑战，我们引入了一个新框架 NOVA，它可以识别与 LLM 所学知识高度吻合的高质量数据，以减少幻觉。NOVA 包括内部一致性探测 (ICP) 和语义等价识别 (SEI)，以衡量 LLM 对指令数据的熟悉程度。具体而言，ICP 通过计算多个自生成响应之间的定制一致性来评估 LLM 对给定指令的理解。SEI 通过使用提出的语义聚类和精心设计的投票策略，通过将 LLM 与生成的响应进行比较，进一步评估 LLM 对目标响应的熟悉程度。最后，我们引入了一个专家一致的奖励模型，考虑熟悉度以外的特征来提高数据质量。通过考虑数据质量并避免不熟悉的数据，我们可以利用选定的数据有效地对齐 LLM 以遵循指令并减少幻觉。大量实验和分析表明，NOVA 显著减少幻觉，并使 LLM 保持强大的遵循指令的能力。

Title: BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Authors: Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07346
Pdf URL: https://arxiv.org/pdf/2502.07346
Copy Paste: [[2502.07346]] BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models(https://arxiv.org/abs/2502.07346)
Keywords: language model, llm, long context
Abstract: Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.
摘要：以前的多语言基准测试主要关注简单的理解任务，但对于大型语言模型 (LLM)，我们强调指令遵循、推理、长上下文理解、代码生成等方面的能力。然而，跨语言测量这些高级能力的探索还不够。为了解决这种差异，我们引入了 BenchMAX，这是一个多向多语言评估基准，可以公平地比较不同语言之间的这些重要能力。为了保持高质量，在将数据从英语机器翻译成其他 16 种语言后，三位不同的母语注释者在所有任务中独立注释每个样本。此外，我们提出了一个源于数据集构建的新型翻译挑战。在 BenchMAX 上进行的大量实验表明，不同语言的核心能力的有效性各不相同，突显了无法通过简单地扩大模型规模来弥补的性能差距。BenchMAX 是一个全面的多语言评估平台，为促进多语言语言模型的发展提供了一个有前途的测试平台。数据集和代码是公开的。

Title: Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation

Authors: Zhiyin Tan, Jennifer D'Souza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07352
Pdf URL: https://arxiv.org/pdf/2502.07352
Copy Paste: [[2502.07352]] Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation(https://arxiv.org/abs/2502.07352)
Keywords: language model, llm, prompt
Abstract: This study presents a framework for automated evaluation of dynamically evolving topic taxonomies in scientific literature using Large Language Models (LLMs). In digital library systems, topic modeling plays a crucial role in efficiently organizing and retrieving scholarly content, guiding researchers through complex knowledge landscapes. As research domains proliferate and shift, traditional human centric and static evaluation methods struggle to maintain relevance. The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics. Tailored prompts guide LLM assessments, ensuring consistent and interpretable evaluations across various datasets and modeling techniques. Experiments on benchmark corpora demonstrate the method's robustness, scalability, and adaptability, underscoring its value as a more holistic and dynamic alternative to conventional evaluation strategies.
摘要：本研究提出了一个使用大型语言模型 (LLM) 自动评估科学文献中动态演变的主题分类法的框架。在数字图书馆系统中，主题建模在有效组织和检索学术内容、引导研究人员探索复杂的知识图谱方面发挥着至关重要的作用。随着研究领域的激增和转变，传统的以人为本的静态评估方法难以保持相关性。所提出的方法利用 LLM 来衡量关键质量维度，例如连贯性、重复性、多样性和主题文档一致性，而无需过度依赖专家注释者或狭隘的统计指标。量身定制的提示指导 LLM 评估，确保在各种数据集和建模技术中进行一致且可解释的评估。在基准语料库上的实验证明了该方法的稳健性、可扩展性和适应性，强调了其作为传统评估策略更全面、更动态的替代方案的价值。

Title: LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

Authors: Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07365
Pdf URL: https://arxiv.org/pdf/2502.07365
Copy Paste: [[2502.07365]] LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation(https://arxiv.org/abs/2502.07365)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines.
摘要：大型语言模型 (LLM) 通过扩展位置编码和轻量级持续预训练获得了扩展的上下文窗口。然而，这往往会导致短文本任务的性能下降，而导致这种性能下降的原因仍未得到充分探索。在这项工作中，我们确定了导致这一问题的两个主要因素：隐藏状态和注意力分数的分布漂移，以及持续预训练期间的灾难性遗忘。为了应对这些挑战，我们提出了带恢复蒸馏的长上下文预训练 (LongReD)，这是一种新颖的方法，旨在通过最小化扩展模型和原始模型之间的分布差异来减轻短文本性能下降的问题。除了对长文本进行训练外，LongReD 还会在短文本上从原始模型中蒸馏出选定层的隐藏状态。此外，LongReD 还引入了短到长的蒸馏，通过利用跳过的位置索引将短文本上的输出分布与长文本上的输出分布对齐。在常见文本基准上的实验表明，LongReD 有效地保留了模型的短文本性能，同时保持了与基线相当甚至更好的处理长文本的能力。

Title: Target-Augmented Shared Fusion-based Multimodal Sarcasm Explanation Generation

Authors: Palaash Goel, Dushyant Singh Chauhan, Md Shad Akhtar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07391
Pdf URL: https://arxiv.org/pdf/2502.07391
Copy Paste: [[2502.07391]] Target-Augmented Shared Fusion-based Multimodal Sarcasm Explanation Generation(https://arxiv.org/abs/2502.07391)
Keywords: llm
Abstract: Sarcasm is a linguistic phenomenon that intends to ridicule a target (e.g., entity, event, or person) in an inherent way. Multimodal Sarcasm Explanation (MuSE) aims at revealing the intended irony in a sarcastic post using a natural language explanation. Though important, existing systems overlooked the significance of the target of sarcasm in generating explanations. In this paper, we propose a Target-aUgmented shaRed fusion-Based sarcasm explanatiOn model, aka. TURBO. We design a novel shared-fusion mechanism to leverage the inter-modality relationships between an image and its caption. TURBO assumes the target of the sarcasm and guides the multimodal shared fusion mechanism in learning intricacies of the intended irony for explanations. We evaluate our proposed TURBO model on the MORE+ dataset. Comparison against multiple baselines and state-of-the-art models signifies the performance improvement of TURBO by an average margin of $+3.3\%$. Moreover, we explore LLMs in zero and one-shot settings for our task and observe that LLM-generated explanation, though remarkable, often fails to capture the critical nuances of the sarcasm. Furthermore, we supplement our study with extensive human evaluation on TURBO's generated explanations and find them out to be comparatively better than other systems.
摘要：讽刺是一种语言现象，意在以固有方式嘲讽目标（例如实体、事件或人）。多模态讽刺解释 (MuSE) 旨在使用自然语言解释揭示讽刺帖子中的讽刺意图。尽管讽刺目标很重要，但现有系统忽视了讽刺目标在生成解释中的重要性。在本文中，我们提出了一种基于目标增强共享融合的讽刺解释模型，又名 TURBO。我们设计了一种新颖的共享融合机制来利用图像与其标题之间的模态间关系。TURBO 假设讽刺的目标并指导多模态共享融合机制学习解释的讽刺意图的复杂性。我们在 MORE+ 数据集上评估了我们提出的 TURBO 模型。与多个基线和最先进的模型进行比较表明，TURBO 的性能平均提高了 $+3.3\%$。此外，我们在零次和一次设置中探索了 LLM，并观察到 LLM 生成的解释虽然很出色，但往往无法捕捉到讽刺的关键细微差别。此外，我们对 TURBO 生成的解释进行了广泛的人工评估，以补充我们的研究，并发现它们比其他系统更好。

Title: Entity Linking using LLMs for Automated Product Carbon Footprint Estimation

Authors: Steffen Castle, Julian Moreno Schneider, Leonhard Hennig, Georg Rehm
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07418
Pdf URL: https://arxiv.org/pdf/2502.07418
Copy Paste: [[2502.07418]] Entity Linking using LLMs for Automated Product Carbon Footprint Estimation(https://arxiv.org/abs/2502.07418)
Keywords: language model, llm
Abstract: Growing concerns about climate change and sustainability are driving manufacturers to take significant steps toward reducing their carbon footprints. For these manufacturers, a first step towards this goal is to identify the environmental impact of the individual components of their products. We propose a system leveraging large language models (LLMs) to automatically map components from manufacturer Bills of Materials (BOMs) to Life Cycle Assessment (LCA) database entries by using LLMs to expand on available component information. Our approach reduces the need for manual data processing, paving the way for more accessible sustainability practices.
摘要：人们对气候变化和可持续性的日益关注正促使制造商采取重大措施减少碳足迹。对于这些制造商来说，实现这一目标的第一步是确定其产品中各个组件对环境的影响。我们提出了一个利用大型语言模型 (LLM) 的系统，通过使用 LLM 扩展可用的组件信息，自动将制造商物料清单 (BOM) 中的组件映射到生命周期评估 (LCA) 数据库条目。我们的方法减少了对手动数据处理的需求，为更易于实现的可持续性实践铺平了道路。

Title: RomanLens: Latent Romanization and its role in Multilinguality in LLMs

Authors: Alan Saji (1), Jaavid Aktar Husain (2), Thanmay Jayakumar (1 and 3), Raj Dabre (1, 3, 4 and 5), Anoop Kunchukuttan (1, 3 and 6), Mitesh M. Khapra (1 and 3), Ratish Puduppully (7) ((1) Nilekani Centre at AI4Bharat, (2) Singapore University of Technology and Design, (3) Indian Institute of Technology Madras, India, (4) National Institute of Information and Communications Technology, Kyoto, Japan, (5) Indian Institute of Technology Bombay, India, (6) Microsoft, India, (7) IT University of Copenhagen)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07424
Pdf URL: https://arxiv.org/pdf/2502.07424
Copy Paste: [[2502.07424]] RomanLens: Latent Romanization and its role in Multilinguality in LLMs(https://arxiv.org/abs/2502.07424)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit remarkable multilingual generalization despite being predominantly trained on English-centric corpora. A fundamental question arises: how do LLMs achieve such robust multilingual capabilities? For non-Latin script languages, we investigate the role of romanization - the representation of non-Latin scripts using Latin characters - as a bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and romanized scripts, suggesting a shared underlying representation. Additionally in translation towards non Latin languages, our findings reveal that when the target language is in romanized form, its representations emerge earlier in the model's layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of romanization in facilitating language transfer. Our work provides new directions for potentially improving multilingual language modeling and interpretability.
摘要：尽管大型语言模型 (LLM) 主要在以英语为中心的语料库上进行训练，但它们表现出显著的多语言泛化能力。一个基本问题出现了：LLM 如何实现如此强大的多语言能力？对于非拉丁文字语言，我们研究了罗马化（使用拉丁字符表示非拉丁文字）作为多语言处理桥梁的作用。使用机械可解释性技术，我们分析了下一个标记生成，发现中间层在转换为母语文字之前经常以罗马化形式表示目标词，我们将这种现象称为潜在罗马化。此外，通过激活修补实验，我们证明 LLM 在母语和罗马化文字中以类似的方式编码语义概念，表明存在共享的底层表示。此外，在向非拉丁语言的翻译中，我们的研究结果表明，当目标语言为罗马化形式时，其表示在模型层中出现得比母语文字更早。这些见解有助于更深入地理解 LLM 中的多语言表示，并强调罗马化在促进语言转换方面的隐性作用。我们的工作为潜在地改善多语言建模和可解释性提供了新的方向。

Title: Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

Authors: Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07445
Pdf URL: https://arxiv.org/pdf/2502.07445
Copy Paste: [[2502.07445]] Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon(https://arxiv.org/abs/2502.07445)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.
摘要：大型语言模型 (LLM) 通常似乎在公共基准测试中表现出色，但这些高分可能掩盖了对特定于数据集的表面线索的过度依赖，而不是真正的语言理解。我们引入了 Chameleon 基准测试过拟合检测器 (C-BOD)，这是一个元评估框架，它通过参数转换系统地扭曲基准测试提示并检测 LLM 的过拟合。通过重新表述输入同时保留其语义内容和标签，C-BOD 可以揭示模型的性能是否由记忆模式驱动。使用 26 个领先的 LLM 在 MMLU 基准测试上进行评估，我们的方法表明在适度扰动下平均性能下降 2.15%，其中 26 个模型中有 20 个表现出统计上显着的差异。值得注意的是，具有更高基线准确度的模型在扰动下表现出更大的性能差异，而更大的 LLM 往往对重新表述更敏感，这表明这两种情况都可能过度依赖固定的提示模式。相比之下，Llama 系列和基线准确率较低的模型表现出的退化并不明显，表明对表面线索的依赖性降低。此外，C-BOD 的数据集和模型无关设计可以轻松集成到训练管道中，以促进更强大的语言理解。我们的研究结果要求社区超越排行榜分数，在 LLM 评估中优先考虑弹性和泛化。

Title: PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian

Authors: Erfan Moosavi Monazzah, Vahid Rahimzadeh, Yadollah Yaghoobzadeh, Azadeh Shakery, Mohammad Taher Pilehvar
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2502.07459
Pdf URL: https://arxiv.org/pdf/2502.07459
Copy Paste: [[2502.07459]] PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian(https://arxiv.org/abs/2502.07459)
Keywords: language model, llm
Abstract: Large language models predominantly reflect Western cultures, largely due to the dominance of English-centric training data. This imbalance presents a significant challenge, as LLMs are increasingly used across diverse contexts without adequate evaluation of their cultural competence in non-English languages, including Persian. To address this gap, we introduce PerCul, a carefully constructed dataset designed to assess the sensitivity of LLMs toward Persian culture. PerCul features story-based, multiple-choice questions that capture culturally nuanced scenarios. Unlike existing benchmarks, PerCul is curated with input from native Persian annotators to ensure authenticity and to prevent the use of translation as a shortcut. We evaluate several state-of-the-art multilingual and Persian-specific LLMs, establishing a foundation for future research in cross-cultural NLP evaluation. Our experiments demonstrate a 11.3% gap between best closed source model and layperson baseline while the gap increases to 21.3% by using the best open-weight model. You can access the dataset from here: this https URL
摘要：大型语言模型主要反映西方文化，这主要是由于以英语为中心的训练数据占主导地位。这种不平衡带来了重大挑战，因为 LLM 越来越多地用于不同的背景，而没有充分评估其在非英语语言（包括波斯语）中的文化能力。为了解决这一差距，我们引入了 PerCul，这是一个精心构建的数据集，旨在评估 LLM 对波斯文化的敏感性。PerCul 以故事为基础的多项选择题为特色，可以捕捉文化细微差别的场景。与现有基准不同，PerCul 是根据波斯语母语注释者的输入进行策划的，以确保真实性并防止使用翻译作为捷径。我们评估了几种最先进的多语言和波斯语专用 LLM，为跨文化 NLP 评估的未来研究奠定了基础。我们的实验表明，最佳闭源模型与外行基线之间存在 11.3% 的差距，而使用最佳开放权重模型，差距将增加到 21.3%。您可以从此处访问数据集：此 https URL

Title: Multi-Agent Collaboration for Multilingual Code Instruction Tuning

Authors: Jian Yang, Wei Zhang, Jiaxi Yang, Yibo Miao, Shanghaoran Quan, Zhenhe Wu, Qiyao Peng, Liqun Yang, Tianyu Liu, Zeyu Cui, Binyuan Hui, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07487
Pdf URL: https://arxiv.org/pdf/2502.07487
Copy Paste: [[2502.07487]] Multi-Agent Collaboration for Multilingual Code Instruction Tuning(https://arxiv.org/abs/2502.07487)
Keywords: llm, agent
Abstract: Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.
摘要：代码理解和生成方面的最新进展表明，在高质量指令数据集上进行微调的代码 LLM 可以获得强大的能力来解决广泛的代码相关任务。然而，大多数现有的方法主要孤立地看待每种编程语言，而忽略了不同编程语言之间的知识转移。为了弥合不同编程语言之间的差距，我们引入了一个新颖的多智能体协作框架来增强代码 LLM 的多语言指令调整，其中多个具有生成记忆的语言特定智能体组件协同工作，以高效、有效地将知识从一种语言转移到另一种语言。具体来说，我们首先从代码片段生成语言特定指令数据，然后将生成的数据作为语言特定代理的种子数据提供。多个语言特定代理讨论并协作制定新指令及其相应的解决方案（一种新的编程语言或现有的编程语言），为了进一步鼓励跨语言转移，每个代理将其生成历史存储为记忆，然后总结其优缺点。最后，高质量的多语言指令数据用于鼓励不同编程语言之间的知识转移，以训练 Qwen2.5-xCoder。多语言编程基准测试的实验结果证明了 Qwen2.5-xCoder 在共享共同知识方面的卓越性能，凸显了其缩小跨语言差距的潜力。

Title: Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Authors: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07490
Pdf URL: https://arxiv.org/pdf/2502.07490
Copy Paste: [[2502.07490]] Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More(https://arxiv.org/abs/2502.07490)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.
摘要：研究发现，大型语言模型 (LLM) 无法准确检索关键信息。为了解决这个问题，我们提出了掩码增强自回归预测 (MEAP)，这是一种简单而有效的训练范式，它将掩码语言模型 (MLM) 无缝集成到下一个标记预测 (NTP) 中，以增强后者的上下文检索能力。具体而言，MEAP 首先随机屏蔽一小部分输入标记，然后使用仅解码器的 Transformer 直接执行标准的下一个标记预测自回归。MEAP 消除了 MLM 对双向注意或编码器-解码器架构的需求，在预训练或推理期间不会产生额外的计算开销。大量实验表明，MEAP 在关键信息检索和长上下文推理任务上的表现远远优于 NTP，而在常识推理任务上的表现相当或更好。 MEAP 的优势还扩展到监督微调，它在迷失于中间的场景中表现出显著的优势，比 NTP 高出 11.77 个百分点。我们的分析表明，MEAP 的有效性源于它能够通过集中精力于一组减少的非掩码标记来提高更易区分的注意力得分。这种机制提高了模型对任务相关信号的关注度，同时减轻了外围上下文的影响。这些发现使 MEAP 成为大型语言模型的一个有前途的训练范例。

Title: Grammar Control in Dialogue Response Generation for Language Learning Chatbots

Authors: Dominik Glandorf, Peng Cui, Detmar Meurers, Mrinmaya Sachan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07544
Pdf URL: https://arxiv.org/pdf/2502.07544
Copy Paste: [[2502.07544]] Grammar Control in Dialogue Response Generation for Language Learning Chatbots(https://arxiv.org/abs/2502.07544)
Keywords: language model, gpt, prompt, chat
Abstract: Chatbots based on large language models offer cheap conversation practice opportunities for language learners. However, they are hard to control for linguistic forms that correspond to learners' current needs, such as grammar. We control grammar in chatbot conversation practice by grounding a dialogue response generation model in a pedagogical repository of grammar skills. We also explore how this control helps learners to produce specific grammar. We comprehensively evaluate prompting, fine-tuning, and decoding strategies for grammar-controlled dialogue response generation. Strategically decoding Llama3 outperforms GPT-3.5 when tolerating minor response quality losses. Our simulation predicts grammar-controlled responses to support grammar acquisition adapted to learner proficiency. Existing language learning chatbots and research on second language acquisition benefit from these affordances. Code available on GitHub.
摘要：基于大型语言模型的聊天机器人为语言学习者提供了廉价的对话练习机会。然而，对于与学习者当前需求相对应的语言形式，例如语法，它们很难控制。我们通过将对话响应生成模型建立在语法技能的教学库中来控制聊天机器人对话练习中的语法。我们还探索了这种控制如何帮助学习者产生特定的语法。我们全面评估了语法控制对话响应生成的提示、微调和解码策略。在容忍轻微的响应质量损失时，战略性解码 Llama3 的表现优于 GPT-3.5。我们的模拟预测了语法控制的响应，以支持适应学习者熟练程度的语法习得。现有的语言学习聊天机器人和第二语言习得研究受益于这些功能。代码可在 GitHub 上找到。

Title: Unsupervised Translation of Emergent Communication

Authors: Ido Levy, Orr Paradise, Boaz Carmeli, Ron Meir, Shafi Goldwasser, Yonatan Belinkov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07552
Pdf URL: https://arxiv.org/pdf/2502.07552
Copy Paste: [[2502.07552]] Unsupervised Translation of Emergent Communication(https://arxiv.org/abs/2502.07552)
Keywords: agent
Abstract: Emergent Communication (EC) provides a unique window into the language systems that emerge autonomously when agents are trained to jointly achieve shared goals. However, it is difficult to interpret EC and evaluate its relationship with natural languages (NL). This study employs unsupervised neural machine translation (UNMT) techniques to decipher ECs formed during referential games with varying task complexities, influenced by the semantic diversity of the environment. Our findings demonstrate UNMT's potential to translate EC, illustrating that task complexity characterized by semantic diversity enhances EC translatability, while higher task complexity with constrained semantic variability exhibits pragmatic EC, which, although challenging to interpret, remains suitable for translation. This research marks the first attempt, to our knowledge, to translate EC without the aid of parallel data.
摘要：新兴通信 (EC) 为了解语言系统提供了一个独特的窗口，当代理被训练共同实现共同目标时，这些语言系统会自主出现。然而，解释 EC 并评估其与自然语言 (NL) 的关系却很困难。本研究采用无监督神经机器翻译 (UNMT) 技术来解读在具有不同任务复杂性的参考游戏中形成的 EC，这些任务复杂性受环境语义多样性的影响。我们的研究结果证明了 UNMT 翻译 EC 的潜力，说明以语义多样性为特征的任务复杂性增强了 EC 的可译性，而具有受限语义可变性的更高任务复杂性则表现出务实的 EC，尽管解释起来具有挑战性，但仍然适合翻译。据我们所知，这项研究标志着首次尝试在不借助并行数据的情况下翻译 EC。

Title: O1 Embedder: Let Retrievers Think Before Action

Authors: Ruin Yan, Zheng Liu, Defu Lian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07555
Pdf URL: https://arxiv.org/pdf/2502.07555
Copy Paste: [[2502.07555]] O1 Embedder: Let Retrievers Think Before Action(https://arxiv.org/abs/2502.07555)
Keywords: language model, llm
Abstract: The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs' ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.
摘要：大型语言模型 (LLM) 的功能日益强大，彻底改变了人们访问和利用信息的方式。值得注意的是，LLM 擅长执行细粒度数据表示，这有助于精确检索信息。它们还根据外部参考生成高质量的答案，从而产生有用的知识。最近推出的推理模型（如 OpenAI O1 和 DeepSeek R1）标志着又一次飞跃，凸显了 LLM 在提供最终答案之前进行渐进式思考的能力。这一突破显著提高了处理复杂任务（例如编码和数学证明）的能力。受这一进展的启发，我们旨在为检索模型开发类似的功能，这些模型有望解决该领域的关键挑战，包括多任务检索、零样本检索以及需要对复杂关系进行深入推理的任务。出于这种动机，我们提出了一种名为 O1 Embedder 的新方法，它在对目标文档进行检索之前为输入查询生成有用的想法。为了实现这一目标，我们克服了两个技术难题。首先，我们设计了一个数据合成工作流程，通过从 LLM 专家那里生成初步想法，然后使用检索委员会对其进行改进，为 O1 Embedder 创建训练信号。其次，我们优化了训练过程，使预训练模型能够联合微调，以通过行为克隆生成检索想法，并通过对比学习执行密集检索。我们的方法通过全面的实验进行评估，在 12 个流行数据集上取得了显着的改进，涵盖了域内和域外场景。这些结果凸显了 O1 Embedder 卓越的准确性和通用性，为下一代 IR 基础模型的开发铺平了道路。

Title: We Can't Understand AI Using our Existing Vocabulary

Authors: John Hewitt, Robert Geirhos, Been Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07586
Pdf URL: https://arxiv.org/pdf/2502.07586
Copy Paste: [[2502.07586]] We Can't Understand AI Using our Existing Vocabulary(https://arxiv.org/abs/2502.07586)
Keywords: llm
Abstract: This position paper argues that, in order to understand AI, we cannot rely on our existing vocabulary of human words. Instead, we should strive to develop neologisms: new words that represent precise human concepts that we want to teach machines, or machine concepts that we need to learn. We start from the premise that humans and machines have differing concepts. This means interpretability can be framed as a communication problem: humans must be able to reference and control machine concepts, and communicate human concepts to machines. Creating a shared human-machine language through developing neologisms, we believe, could solve this communication problem. Successful neologisms achieve a useful amount of abstraction: not too detailed, so they're reusable in many contexts, and not too high-level, so they convey precise information. As a proof of concept, we demonstrate how a "length neologism" enables controlling LLM response length, while a "diversity neologism" allows sampling more variable responses. Taken together, we argue that we cannot understand AI using our existing vocabulary, and expanding it through neologisms creates opportunities for both controlling and understanding machines better.
摘要：本立场文件认为，为了理解人工智能，我们不能依赖现有的人类词汇。相反，我们应该努力开发新词：代表我们想要教给机器的精确人类概念的新词，或者我们需要学习的机器概念的新词。我们从人类和机器具有不同概念的前提开始。这意味着可解释性可以被视为一个沟通问题：人类必须能够引用和控制机器概念，并将人类概念传达给机器。我们相信，通过开发新词创建共享的人机语言可以解决这一沟通问题。成功的新词实现了有用的抽象：不太详细，因此它们可以在许多情况下重复使用，也不太高级，因此它们传达精确的信息。作为概念证明，我们展示了“长度新词”如何能够控制 LLM 响应长度，而“多样性新词”如何允许采样更多可变的响应。总之，我们认为我们无法使用现有词汇来理解人工智能，而通过新词扩展它为更好地控制和理解机器创造了机会。

Title: DPO-Shift: Shifting the Distribution of Direct Preference Optimization

Authors: Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07599
Pdf URL: https://arxiv.org/pdf/2502.07599
Copy Paste: [[2502.07599]] DPO-Shift: Shifting the Distribution of Direct Preference Optimization(https://arxiv.org/abs/2502.07599)
Keywords: language model
Abstract: Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce \method to controllably shift the distribution of the chosen probability. Then, we show that \method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of \method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at this https URL.
摘要：直接偏好优化 (DPO) 及其变体已越来越流行，用于将语言模型与人类偏好对齐。这些方法旨在教会模型更好地区分选择（或首选）和拒绝（或不首选）的响应。然而，先前的研究已经发现，在训练过程中，选择响应的概率通常会降低，这种现象被称为似然位移。为了应对这一挑战，我们在这项工作中引入了 \method 来可控地改变所选概率的分布。然后，我们表明 \method 在提高所选概率和牺牲奖励边际之间表现出根本的权衡，理论分析和实验验证都支持这一点。此外，我们证明了 \method 在下游任务（例如 MT-Bench 和设计的胜率实验）上优于 DPO。我们相信这项研究表明，DPO 的似然位移问题可以通过一个简单的、理论上可行的解决方案得到有效缓解。我们的代码可在此 https URL 上找到。

Title: Tractable Transformers for Flexible Conditional Generation

Authors: Anji Liu, Xuejie Liu, Dayuan Zhao, Mathias Niepert, Yitao Liang, Guy Van den Broeck
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07616
Pdf URL: https://arxiv.org/pdf/2502.07616
Copy Paste: [[2502.07616]] Tractable Transformers for Flexible Conditional Generation(https://arxiv.org/abs/2502.07616)
Keywords: language model, gpt
Abstract: Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.
摘要：非自回归 (NAR) 生成模型很有价值，因为它们可以以比自回归 (AR) 模型更原则性的方式处理各种条件生成任务，而自回归 (AR) 模型则受到顺序依赖性要求的限制。NAR 模型的最新进展（例如扩散语言模型）已证明在无条件生成方面的表现优于类似大小的 AR 模型（例如 GPT）。然而，这种改进并不总能提高条件生成性能。我们表明，造成这种差距的一个关键原因是难以推广到训练期间未见过的条件概率查询。因此，强大的无条件生成性能并不能保证高质量的条件生成。本文提出了 Tractable Transformers (Tracformer)，这是一种基于 Transformer 的生成模型，对不同的条件生成任务更具鲁棒性。与仅依赖从完整输入中得出的全局上下文特征的现有模型不同，Tracformers 结合了稀疏 Transformer 编码器来捕获局部和全局上下文信息。此信息通过解码器路由以进行条件生成。实证结果表明，与最近的传播和 AR 模型基线相比，Tracformers 在文本建模上实现了最先进的条件生成性能。

Title: FoQA: A Faroese Question-Answering Dataset

Authors: Annika Simonsen, Dan Saattrup Nielsen, Hafsteinn Einarsson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07642
Pdf URL: https://arxiv.org/pdf/2502.07642
Copy Paste: [[2502.07642]] FoQA: A Faroese Question-Answering Dataset(https://arxiv.org/abs/2502.07642)
Keywords: language model, gpt, llm
Abstract: We present FoQA, a Faroese extractive question-answering (QA) dataset with 2,000 samples, created using a semi-automated approach combining Large Language Models (LLMs) and human validation. The dataset was generated from Faroese Wikipedia articles using GPT-4-turbo for initial QA generation, followed by question rephrasing to increase complexity and native speaker validation to ensure quality. We provide baseline performance metrics for FoQA across multiple models, including LLMs and BERT, demonstrating its effectiveness in evaluating Faroese QA performance. The dataset is released in three versions: a validated set of 2,000 samples, a complete set of all 10,001 generated samples, and a set of 2,395 rejected samples for error analysis.
摘要：我们介绍了 FoQA，这是一个包含 2,000 个样本的法罗语抽取式问答 (QA) 数据集，采用半自动化方法创建，结合了大型语言模型 (LLM) 和人工验证。该数据集由法罗语维基百科文章生成，使用 GPT-4-turbo 进行初始 QA 生成，然后重新措辞问题以增加复杂性，并进行母语人士验证以确保质量。我们为 FoQA 提供了跨多个模型（包括 LLM 和 BERT）的基线性能指标，证明了其在评估法罗语 QA 性能方面的有效性。该数据集有三个版本：一组经过验证的 2,000 个样本、一组完整的 10,001 个生成样本，以及一组用于错误分析的 2,395 个被拒绝的样本。

Title: Auto-Drafting Police Reports from Noisy ASR Outputs: A Trust-Centered LLM Approach

Authors: Param Kulkarni, Yingchi Liu, Hao-Ming Fu, Shaohua Yang, Isuru Gunasekara, Matt Peloquin, Noah Spitzer-Williams, Xiaotian Zhou, Xiaozhong Liu, Zhengping Ji, Yasser Ibrahim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07677
Pdf URL: https://arxiv.org/pdf/2502.07677
Copy Paste: [[2502.07677]] Auto-Drafting Police Reports from Noisy ASR Outputs: A Trust-Centered LLM Approach(https://arxiv.org/abs/2502.07677)
Keywords: llm
Abstract: Achieving a delicate balance between fostering trust in law en- forcement and protecting the rights of both officers and civilians continues to emerge as a pressing research and product challenge in the world today. In the pursuit of fairness and transparency, this study presents an innovative AI-driven system designed to generate police report drafts from complex, noisy, and multi-role dialogue data. Our approach intelligently extracts key elements of law enforcement interactions and includes them in the draft, producing structured narratives that are not only high in quality but also reinforce accountability and procedural clarity. This frame- work holds the potential to transform the reporting process, ensur- ing greater oversight, consistency, and fairness in future policing practices. A demonstration video of our system can be accessed at this https URL Y-kpCHNO/view?usp=sharing
摘要：在培养对执法部门的信任与保护警官和平民的权利之间取得微妙的平衡，继续成为当今世界面临的紧迫的研究和产品挑战。为了追求公平和透明，本研究提出了一种创新的人工智能驱动系统，旨在从复杂、嘈杂和多角色的对话数据中生成警察报告草稿。我们的方法可以智能地提取执法互动的关键要素并将其包含在草稿中，从而生成结构化的叙述，这些叙述不仅质量高，而且还加强了问责制和程序清晰度。该框架有可能改变报告流程，确保未来警务实践中更大的监督、一致性和公平性。我们系统的演示视频可以通过此 https URL Y-kpCHNO/view?usp=sharing 访问

Title: Large Language Models as Proxies for Theories of Human Linguistic Cognition

Authors: Imry Ziv, Nur Lan, Emmanuel Chemla, Roni Katzir
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07687
Pdf URL: https://arxiv.org/pdf/2502.07687
Copy Paste: [[2502.07687]] Large Language Models as Proxies for Theories of Human Linguistic Cognition(https://arxiv.org/abs/2502.07687)
Keywords: language model, llm
Abstract: We consider the possible role of current large language models (LLMs) in the study of human linguistic cognition. We focus on the use of such models as proxies for theories of cognition that are relatively linguistically-neutral in their representations and learning but differ from current LLMs in key ways. We illustrate this potential use of LLMs as proxies for theories of cognition in the context of two kinds of questions: (a) whether the target theory accounts for the acquisition of a given pattern from a given corpus; and (b) whether the target theory makes a given typologically-attested pattern easier to acquire than another, typologically-unattested pattern. For each of the two questions we show, building on recent literature, how current LLMs can potentially be of help, but we note that at present this help is quite limited.
摘要：我们考虑了当前大型语言模型 (LLM) 在人类语言认知研究中可能发挥的作用。我们重点研究了如何使用这些模型作为认知理论的代理，这些理论在表示和学习方面相对语言中立，但在关键方面与当前的 LLM 不同。我们在两种问题的背景下说明了 LLM 作为认知理论代理的潜在用途：(a) 目标理论是否解释了从给定语料库中获取给定模式的过程；(b) 目标理论是否使给定的类型学证实模式比另一个类型学未经证实的模式更容易获得。对于这两个问题中的每一个，我们根据最近的文献展示了当前的 LLM 如何可能有所帮助，但我们注意到目前这种帮助非常有限。

Title: Making Language Models Robust Against Negation

Authors: MohammadHossein Rezaei, Eduardo Blanco
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.07717
Pdf URL: https://arxiv.org/pdf/2502.07717
Copy Paste: [[2502.07717]] Making Language Models Robust Against Negation(https://arxiv.org/abs/2502.07717)
Keywords: language model
Abstract: Negation has been a long-standing challenge for language models. Previous studies have shown that they struggle with negation in many natural language understanding tasks. In this work, we propose a self-supervised method to make language models more robust against negation. We introduce a novel task, Next Sentence Polarity Prediction (NSPP), and a variation of the Next Sentence Prediction (NSP) task. We show that BERT and RoBERTa further pre-trained on our tasks outperform the off-the-shelf versions on nine negation-related benchmarks. Most notably, our pre-training tasks yield between 1.8% and 9.1% improvement on CondaQA, a large question-answering corpus requiring reasoning over negation.
摘要：否定一直是语言模型面临的长期挑战。先前的研究表明，它们在许多自然语言理解任务中都难以应对否定。在这项工作中，我们提出了一种自监督方法，使语言模型对否定更具鲁棒性。我们引入了一项新任务，即下一句极性预测 (NSPP) 和下一句预测 (NSP) 任务的变体。我们表明，在我们的任务上进一步预训练的 BERT 和 RoBERTa 在九个与否定相关的基准上的表现优于现成的版本。最值得注意的是，我们的预训练任务在 CondaQA 上取得了 1.8% 到 9.1% 的改进，CondaQA 是一个需要推理否定的大型问答语料库。

Title: WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

Authors: Kshitij Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.07747
Pdf URL: https://arxiv.org/pdf/2502.07747
Copy Paste: [[2502.07747]] WHODUNIT: Evaluation benchmark for culprit detection in mystery stories(https://arxiv.org/abs/2502.07747)
Keywords: language model, gpt, llm, prompt
Abstract: We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.
摘要：我们提出了一个新数据集 WhoDunIt，用于评估大型语言模型 (LLM) 在叙事语境中的演绎推理能力。该数据集由开放域神秘小说和短篇小说构成，要求 LLM 在阅读和理解故事后识别出肇事者。为了评估模型的稳健性，我们应用了一系列字符级名称增强，包括原始名称、名称交换以及用流行话语中众所周知的真实和/或虚构实体替换。我们进一步使用各种提示风格来研究提示对演绎推理准确性的影响。我们使用最先进的模型（特别是 GPT-4o、GPT-4-turbo 和 GPT-4o-mini）进行评估研究，通过多次试验进行评估，并选择多数响应以确保可靠性。结果表明，虽然 LLM 在未改变的文本上表现可靠，但某些名称替换（尤其是那些被广泛认可的名称替换）会降低准确性。此数据集可在此处公开获取。

Title: Breaking Down Bias: On The Limits of Generalizable Pruning Strategies

Authors: Sibo Ma, Alejandro Salinas, Peter Henderson, Julian Nyarko
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07771
Pdf URL: https://arxiv.org/pdf/2502.07771
Copy Paste: [[2502.07771]] Breaking Down Bias: On The Limits of Generalizable Pruning Strategies(https://arxiv.org/abs/2502.07771)
Keywords: language model, llm
Abstract: We employ model pruning to examine how LLMs conceptualize racial biases, and whether a generalizable mitigation strategy for such biases appears feasible. Our analysis yields several novel insights. We find that pruning can be an effective method to reduce bias without significantly increasing anomalous model behavior. Neuron-based pruning strategies generally yield better results than approaches pruning entire attention heads. However, our results also show that the effectiveness of either approach quickly deteriorates as pruning strategies become more generalized. For instance, a model that is trained on removing racial biases in the context of financial decision-making poorly generalizes to biases in commercial transactions. Overall, our analysis suggests that racial biases are only partially represented as a general concept within language models. The other part of these biases is highly context-specific, suggesting that generalizable mitigation strategies may be of limited effectiveness. Our findings have important implications for legal frameworks surrounding AI. In particular, they suggest that an effective mitigation strategy should include the allocation of legal responsibility on those that deploy models in a specific use case.
摘要：我们采用模型修剪来检查 LLM 如何概念化种族偏见，以及针对此类偏见的可推广缓解策略是否可行。我们的分析产生了一些新颖的见解。我们发现修剪是一种有效的方法，可以减少偏见，而不会显著增加异常模型行为。基于神经元的修剪策略通常比修剪整个注意力头的方法产生更好的结果。然而，我们的结果还表明，随着修剪策略变得更加普遍，这两种方法的有效性都会迅速下降。例如，在金融决策背景下训练消除种族偏见的模型很难推广到商业交易中的偏见。总体而言，我们的分析表明，种族偏见在语言模型中仅部分地表示为一般概念。这些偏见的另一部分是高度特定于上下文的，这表明可推广的缓解策略可能效果有限。我们的研究结果对围绕人工智能的法律框架具有重要意义。特别是，它们表明有效的缓解策略应包括对在特定用例中部署模型的人分配法律责任。

Title: Auditing Prompt Caching in Language Model APIs

Authors: Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.07776
Pdf URL: https://arxiv.org/pdf/2502.07776
Copy Paste: [[2502.07776]] Auditing Prompt Caching in Language Model APIs(https://arxiv.org/abs/2502.07776)
Keywords: language model, llm, prompt
Abstract: Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users' prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.
摘要：大型语言模型 (LLM) 中的提示缓存会导致数据相关的时间变化：缓存的提示比非缓存的提示处理得更快。这些时间差异带来了旁道时序攻击的风险。例如，如果缓存在用户之间共享，攻击者可以从快速的 API 响应时间中识别缓存的提示，以了解有关其他用户提示的信息。由于提示缓存可能会导致隐私泄露，因此 API 提供商的缓存策略透明度非常重要。为此，我们开发并进行了统计审计，以检测现实世界的 LLM API 提供商中的提示缓存。我们在包括 OpenAI 在内的七个 API 提供商中检测到用户之间的全局缓存共享，从而导致有关用户提示的潜在隐私泄露。由于提示缓存导致的时间变化还可能导致有关模型架构的信息泄露。也就是说，我们发现证据表明 OpenAI 的嵌入模型是一个仅解码器的 Transformer，这以前并不为公众所知。