2025-02-28

Title: Cognitive networks highlight differences and similarities in the STEM mindsets of human and LLM-simulated trainees, experts and academics

Authors: Edith Haim, Lars van den Bergh, Cynthia S. Q. Siew, Yoed N. Kenett, Daniele Marinazzo, Massimo Stella
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19529
Pdf URL: https://arxiv.org/pdf/2502.19529
Copy Paste: [[2502.19529]] Cognitive networks highlight differences and similarities in the STEM mindsets of human and LLM-simulated trainees, experts and academics(https://arxiv.org/abs/2502.19529)
Keywords: language model, gpt, llm
Abstract: Understanding attitudes towards STEM means quantifying the cognitive and emotional ways in which individuals, and potentially large language models too, conceptualise such subjects. This study uses behavioural forma mentis networks (BFMNs) to investigate the STEM-focused mindset, i.e. ways of associating and perceiving ideas, of 177 human participants and 177 artificial humans simulated by GPT-3.5. Participants were split in 3 groups - trainees, experts and academics - to compare the influence of expertise level on their mindset. The results revealed that human forma mentis networks exhibited significantly higher clustering coefficients compared to GPT-3.5, indicating that human mindsets displayed a tendency to form and close triads of conceptual associations while recollecting STEM ideas. Human experts, in particular, demonstrated robust clustering coefficients, reflecting better integration of STEM concepts into their cognitive networks. In contrast, GPT-3.5 produced sparser mindsets. Furthermore, both human and GPT mindsets framed mathematics in neutral or positive terms, differently from STEM high schoolers, researchers and other large language models sampled in other works. This research contributes to understanding how mindset structure can provide cognitive insights about memory structure and machine limitations.
摘要：理解对STEM的态度意味着量化个人以及潜在的大语言模型的认知和情感方式，概念化此类主题。这项研究使用行为形式网络（BFMN）来研究177名人类参与者和177名由GPT-3.5模拟的人为人类的人类参与者的关联和感知思想的方式，即关联和感知思想的方式。参与者分为3组（学员，专家和学者），以比较专业水平对他们的心态的影响。结果表明，与GPT-3.5相比，人类形式网络表现出明显更高的聚类系数，这表明人的心态在回顾STEM思想的同时表现出形成和紧密的概念关联的趋势。尤其是人类专家表现出强大的聚类系数，反映了STEM概念更好地整合到其认知网络中。相反，GPT-3.5产生了更稀疏的心态。此外，人类和GPT心态都以中立或积极的方式构建数学，与STEM高中生，研究人员和其他在其他作品中采样的大型语言模型不同。这项研究有助于了解心态结构如何提供有关记忆结构和机器限制的认知见解。

Title: Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents

Authors: Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19545
Pdf URL: https://arxiv.org/pdf/2502.19545
Copy Paste: [[2502.19545]] Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents(https://arxiv.org/abs/2502.19545)
Keywords: language model, gpt, llm, hallucination, agent
Abstract: The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination-generating false information-and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models' outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized "I don't know" responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.
摘要：大语模型（LLM）在客户支持中的部署受幻觉生成虚假信息和专有模型的高成本的限制。为了应对这些挑战，我们提出了一个检索提出的提问（QA）管道，并探索如何平衡人类的投入和自动化。使用有关三星智能电视用户手册的问题的数据集，我们证明了LLMS生成的合成数据优于众包数据，从而减少了易元模型中的幻觉。我们还比较了自我训练（对自己的输出的微调模型）和知识蒸馏（对更强模型的输出进行微调，例如GPT-4O），并发现自训练可以实现可比的幻觉减少。我们猜想这一令人惊讶的发现可以归因于知识蒸馏案中的暴露偏见问题增加，并通过事后分析来支持这一猜想。我们还通过上下文化的“我不知道”回答来提高无法回答的问题和检索失败的鲁棒性。这些发现表明，可以使用综合数据和开源模型进行自我培训来构建可扩展的，具有成本效益的质量检查系统，从而减少对专有工具的依赖或昂贵的人类注释。

Title: When Large Language Models Meet Speech: A Survey on Integration Approaches

Authors: Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.19548
Pdf URL: https://arxiv.org/pdf/2502.19548
Copy Paste: [[2502.19548]] When Large Language Models Meet Speech: A Survey on Integration Approaches(https://arxiv.org/abs/2502.19548)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for
摘要：大型语言模型（LLM）的最新进展激发了人们对将其应用扩展到基于文本的任务之外的兴趣。大量研究探索了将其他模式与LLM的融合，特别是与文本有关的语音方式。本文调查了语音与LLM的集成，将方法分类为三种主要方法：基于文本的，基于潜在的基于主代表和基于音频的集成。我们还展示了如何在各种语音相关的应用程序中应用这些方法，并强调了该领域的挑战以提供灵感

Title: Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?

Authors: Yudi Zhang, Lu Wang, Meng Fang, Yali Du, Chenghua Huang, Jun Wang, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19557
Pdf URL: https://arxiv.org/pdf/2502.19557
Copy Paste: [[2502.19557]] Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?(https://arxiv.org/abs/2502.19557)
Keywords: language model, llm
Abstract: Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT). However, this approach neglects the potential to distill both data (output content) and reward signals (quality evaluations). Extracting reliable reward signals directly from teacher models is challenging, as LLMs are optimized for generation rather than evaluation, often resulting in biased or inconsistent assessments. To address this limitation, we propose a novel distillation pipeline that transfers both responses and rewards. Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses, enabling reward learning without explicit external evaluation. The reward model subsequently guides reinforcement learning (RL), allowing iterative refinement of the student model after an SFT warm-up phase. Experiments on GSM8K and MMLU-PRO demonstrate that our method consistently outperforms traditional SFT-based approaches, enabling student models to surpass the performance of their teachers. This work highlights the potential for scalable, efficient distillation through structured self-supervised reward learning, reducing dependence on external reward supervision.
摘要：大型语言模型（LLM）通常涉及通过监督的微调（SFT）转移教师模型的响应。但是，这种方法忽略了提炼数据（输出内容）和奖励信号（质量评估）的潜力。直接从教师模型中提取可靠的奖励信号是具有挑战性的，因为LLM是针对生成而不是评估的优化，通常会导致偏见或不一致的评估。为了解决这一局限性，我们提出了一种新型的蒸馏管道，可以转移反应和奖励。我们的方法通过自我监督的机制生成伪奖励，该机制利用教师和学生的反应的固有结构，在没有明确的外部评估的情况下实现奖励学习。奖励模型随后指导增强学习（RL），从而在SFT热身阶段迭代了学生模型。 GSM8K和MMLU-PRO的实验表明，我们的方法始终超过传统的基于SFT的方法，从而使学生模型能够超过其教师的表现。这项工作强调了通过结构化的自我监督奖励学习进行可扩展，有效蒸馏的潜力，从而减少了对外部奖励监督的依赖。

Title: Stay Focused: Problem Drift in Multi-Agent Debate

Authors: Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, Bela Gipp
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19559
Pdf URL: https://arxiv.org/pdf/2502.19559
Copy Paste: [[2502.19559]] Stay Focused: Problem Drift in Multi-Agent Debate(https://arxiv.org/abs/2502.19559)
Keywords: language model, llm, agent
Abstract: Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations, particularly when scaling them to longer reasoning chains. In this study, we unveil a new issue of multi-agent debate: discussions drift away from the initial problem over multiple turns. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, we perform a human study with eight experts on discussions suffering from problem drift, who find the most common issues are a lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). To systematically address the issue of problem drift, we propose DRIFTJudge, a method based on LLM-as-a-judge, to detect problem drift at test-time. We further propose DRIFTPolicy, a method to mitigate 31% of problem drift cases. Our study can be seen as a first step to understanding a key limitation of multi-agent debate, highlighting pathways for improving their effectiveness in the future.
摘要：多代理辩论 - 大型语言模型的多个实例讨论基于互动的问题 - 显示了解决知识和推理任务的希望。但是，这些方法显示出局限性，尤其是在将它们扩展到更长的推理链时。在这项研究中，我们揭开了新的多项式辩论的新期：讨论从多个回合中脱离了最初的问题。我们将这一现象定义为问题漂移，并在十个任务中量化其存在（即三个生成，三个知识，三个推理和一项跟随指令的任务）。为了确定此问题的原因，我们对八名专家进行了有关问题漂移的讨论的研究，他们发现最常见的问题是缺乏进展（35％的案件），低质量反馈（占案件的26％）和缺乏明确性（25％的案件）。为了系统地解决问题漂移的问题，我们提出了基于LLM-AS-A-Gudge的方法Drifthenge，以检测测试时间的问题漂移。我们进一步提出了漂流液，这是减轻31％问题漂移案例的方法。我们的研究可以看作是理解多代理辩论的关键局限性的第一步，突出了提高未来有效性的途径。

Title: Do Large Language Models Know How Much They Know?

Authors: Gabriele Prato, Jerry Huang, Prasannna Parthasarathi, Shagun Sodhani, Sarath Chandar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19573
Pdf URL: https://arxiv.org/pdf/2502.19573
Copy Paste: [[2502.19573]] Do Large Language Models Know How Much They Know?(https://arxiv.org/abs/2502.19573)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have emerged as highly capable systems and are increasingly being integrated into various uses. However, the rapid pace of their deployment has outpaced a comprehensive understanding of their internal mechanisms and a delineation of their capabilities and limitations. A desired attribute of an intelligent system is its ability to recognize the scope of its own knowledge. To investigate whether LLMs embody this characteristic, we develop a benchmark designed to challenge these models to enumerate all information they possess on specific topics. This benchmark evaluates whether the models recall excessive, insufficient, or the precise amount of information, thereby indicating their awareness of their own knowledge. Our findings reveal that all tested LLMs, given sufficient scale, demonstrate an understanding of how much they know about specific topics. While different architectures exhibit varying rates of this capability's emergence, the results suggest that awareness of knowledge may be a generalizable attribute of LLMs. Further research is needed to confirm this potential and fully elucidate the underlying mechanisms.
摘要：大型语言模型（LLMS）已成为高功能强大的系统，并且越来越多地整合到各种用途中。但是，他们部署的迅速步伐超过了对其内部机制以及对其能力和局限性的描述的全面理解。智能系统的所需属性是其识别其知识范围的能力。为了研究LLM是否体现了这种特征，我们开发了一个基准，旨在挑战这些模型，以枚举它们在特定主题上拥有的所有信息。该基准评估模型是回想起过多，不足还是精确的信息量，从而表明他们对自己的知识的认识。我们的发现表明，所有经过足够规模的测试LLM都证明了他们对特定主题了解的了解。尽管不同的体系结构表现出这种能力出现的不同速率，但结果表明，知识的认识可能是LLM的可概括属性。需要进一步的研究来确认这一潜力并充分阐明潜在的机制。

Title: Where Are We? Evaluating LLM Performance on African Languages

Authors: Ife Adebara, Hawau Olamide Toyin, Nahom Tesfu Ghebremichael, AbdelRahim Elmadany, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19582
Pdf URL: https://arxiv.org/pdf/2502.19582
Copy Paste: [[2502.19582]] Where Are We? Evaluating LLM Performance on African Languages(https://arxiv.org/abs/2502.19582)
Keywords: language model, llm
Abstract: Africa's rich linguistic heritage remains underrepresented in NLP, largely due to historical policies that favor foreign languages and create significant data inequities. In this paper, we integrate theoretical insights on Africa's language landscape with an empirical evaluation using Sahara - a comprehensive benchmark curated from large-scale, publicly accessible datasets capturing the continent's linguistic diversity. By systematically assessing the performance of leading large language models (LLMs) on Sahara, we demonstrate how policy-induced data variations directly impact model effectiveness across African languages. Our findings reveal that while a few languages perform reasonably well, many Indigenous languages remain marginalized due to sparse data. Leveraging these insights, we offer actionable recommendations for policy reforms and inclusive data practices. Overall, our work underscores the urgent need for a dual approach - combining theoretical understanding with empirical evaluation - to foster linguistic diversity in AI for African communities.
摘要：非洲丰富的语言遗产在NLP中的代表性不足，这在很大程度上是由于历史政策有利于外语并造成了重要的数据不平等。在本文中，我们将对非洲语言环境的理论见解与使用撒哈拉的经验评估相结合，这是一种综合的基准测试，该基准是根据大规模公开访问的数据集策划的，捕获了大陆的语言多样性。通过系统地评估撒哈拉领先的大语言模型（LLM）的性能，我们演示了政策诱导的数据变化如何直接影响非洲语言的模型有效性。我们的发现表明，尽管几种语言的表现相当出色，但由于数据稀疏，许多土著语言仍然被边缘化。利用这些见解，我们为政策改革和包容性数据实践提供了可行的建议。总体而言，我们的工作强调了对双重方法的迫切需求 - 将理论理解与经验评估相结合，以促进非洲社区AI的语言多样性。

Title: NeoBERT: A Next-Generation BERT

Authors: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19587
Pdf URL: https://arxiv.org/pdf/2502.19587
Copy Paste: [[2502.19587]] NeoBERT: A Next-Generation BERT(https://arxiv.org/abs/2502.19587)
Keywords: language model
Abstract: Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.
摘要：建筑，预培训和微调的最新创新导致了大型自动退缩语言模型（例如Llama and Deepseek）的非凡学习和推理能力。相比之下，尽管许多下游NLP应用是基础，但像Bert和Roberta这样的编码者并未看到相同的进度。为了弥合这一差距，我们介绍了Neobert，Neobert是下一代编码器，通过整合建筑，现代数据和优化的预训练方法中的最新进步，重新定义了双向模型的功能。 Neobert专为无缝采用而设计：它是现有基本型号的插件替代品，依赖于最佳的深度宽度比，并利用了4,096个令牌的扩展上下文长度。尽管具有2500亿个参数足迹，但它在巨大的MTEB基准测试中取得了最先进的结果，在相同的微调条件下，伯特（Bert）大，罗伯塔（Roberta），大，名字师和现代伯特（Modernbert）的表现优于大，罗伯塔（Roberta）。此外，我们严格评估每种修饰对胶水的影响，并为MTEB设计一个均匀的微调和评估框架。我们发布所有代码，数据，检查点和培训脚本，以加速研究和现实世界中的采用。

Title: A City of Millions: Mapping Literary Social Networks At Scale

Authors: Sil Hamilton, Rebecca M. M. Hicke, David Mimno, Matthew Wilkens
Subjects: cs.CL, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2502.19590
Pdf URL: https://arxiv.org/pdf/2502.19590
Copy Paste: [[2502.19590]] A City of Millions: Mapping Literary Social Networks At Scale(https://arxiv.org/abs/2502.19590)
Keywords: language model, prompt
Abstract: We release 70,509 high-quality social networks extracted from multilingual fiction and nonfiction narratives. We additionally provide metadata for ~30,000 of these texts (73% nonfiction and 27% fiction) written between 1800 and 1999 in 58 languages. This dataset provides information on historical social worlds at an unprecedented scale, including data for 1,192,855 individuals in 2,805,482 pair-wise relationships annotated for affinity and relationship type. We achieve this scale by automating previously manual methods of extracting social networks; specifically, we adapt an existing annotation task as a language model prompt, ensuring consistency at scale with the use of structured output. This dataset provides an unprecedented resource for the humanities and social sciences by providing data on cognitive models of social realities.
摘要：我们发布了70,509个从多语言小说和非小说叙事中提取的高质量社交网络。我们还提供了约30,000本文本（73％的非小说类和27％的小说）的元数据，其中1800年至1999年写了58种语言。该数据集以前所未有的规模提供有关历史社会世界的信息，其中包括2,805,482个对亲和力和关系类型注释的2,805,482配对关系中的1,192,855个人的数据。我们通过自动化提取社交网络的先前手动方法来实现这一规模；具体来说，我们将现有注释任务调整为语言模型提示，以确保与结构化输出的使用规模一致。该数据集通过提供有关社会现实的认知模型的数据，为人文和社会科学提供了前所未有的资源。

Title: Revisiting Word Embeddings in the LLM Era

Authors: Yash Mahajan, Matthew Freestone, Sathyanarayanan Aakur, Santu Karmaker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19607
Pdf URL: https://arxiv.org/pdf/2502.19607
Copy Paste: [[2502.19607]] Revisiting Word Embeddings in the LLM Era(https://arxiv.org/abs/2502.19607)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.
摘要：大型语言模型（LLMS）最近在各种NLP任务中表现出显着的进步。因此，最近出现了一种流行趋势，NLP研究人员从这些大型解码器模型中提取单词/句子/文档嵌入，并将其用于各种推理任务，并具有令人鼓舞的结果。但是，目前尚不清楚LLM诱导的嵌入的性能是否仅仅是因为规模或产生的潜在嵌入是否与诸如Word2Vec，Glove，Senten-Bert（Sbert）或Universal Senten Endoder（使用）等经典编码模型有显着差异。这是我们在本文中通过系统地比较LLM诱导的嵌入的经典脱皮和上下文化的单词嵌入来研究的中心问题。我们的结果表明，LLMS群集在语义上相关的单词更紧密，并且在脱皮设置中的类比任务上表现更好。但是，在上下文化的设置中，像SIMCSE这样的经典模型在句子级别的相似性评估任务中经常优于llms，强调了它们与细颗粒语义的持续相关性。

Title: Evaluation of Hate Speech Detection Using Large Language Models and Geographical Contextualization

Authors: Anwar Hossain Zahid, Monoshi Kumar Roy, Swarna Das
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19612
Pdf URL: https://arxiv.org/pdf/2502.19612
Copy Paste: [[2502.19612]] Evaluation of Hate Speech Detection Using Large Language Models and Geographical Contextualization(https://arxiv.org/abs/2502.19612)
Keywords: language model, llm
Abstract: The proliferation of hate speech on social media is one of the serious issues that is bringing huge impacts to society: an escalation of violence, discrimination, and social fragmentation. The problem of detecting hate speech is intrinsically multifaceted due to cultural, linguistic, and contextual complexities and adversarial manipulations. In this study, we systematically investigate the performance of LLMs on detecting hate speech across multilingual datasets and diverse geographic contexts. Our work presents a new evaluation framework in three dimensions: binary classification of hate speech, geography-aware contextual detection, and robustness to adversarially generated text. Using a dataset of 1,000 comments from five diverse regions, we evaluate three state-of-the-art LLMs: Llama2 (13b), Codellama (7b), and DeepSeekCoder (6.7b). Codellama had the best binary classification recall with 70.6% and an F1-score of 52.18%, whereas DeepSeekCoder had the best performance in geographic sensitivity, correctly detecting 63 out of 265 locations. The tests for adversarial robustness also showed significant weaknesses; Llama2 misclassified 62.5% of manipulated samples. These results bring to light the trade-offs between accuracy, contextual understanding, and robustness in the current versions of LLMs. This work has thus set the stage for developing contextually aware, multilingual hate speech detection systems by underlining key strengths and limitations, therefore offering actionable insights for future research and real-world applications.
摘要：社交媒体上仇恨言论的扩散是对社会产生巨大影响的严重问题之一：暴力，歧视和社会分裂的升级。由于文化，语言和情境复杂性和对抗性操纵，检测仇恨言论的问题本质上是多方面的。在这项研究中，我们系统地研究了LLM在多语言数据集和各种地理环境中检测仇恨言论的性能。我们的工作在三个维度上提出了一个新的评估框架：仇恨言论的二进制分类，地理意识到的上下文检测以及对对抗性生成的文本的鲁棒性。使用来自五个不同地区的1,000条评论的数据集，我们评估了三个最先进的LLM：Llama2（13b），Codellama（7b）（7b）和DeepSeekcoder（6.7b）。 Codellama的二元分类召回率最高，为70.6％，F1得分为52.18％，而DeepSeekcoder在地理敏感性方面的表现最佳，在265个位置正确检测到63个。对抗性鲁棒性的测试也显示出明显的弱点。 Llama2错误分类为62.5％的操纵样品。这些结果揭示了当前版本的LLMS的准确性，上下文理解和鲁棒性之间的权衡。因此，这项工作为通过强调关键优势和局限而开发上下文意识到的多语言仇恨言语检测系统的阶段为舞台奠定了基础，从而为未来的研究和现实世界应用提供了可行的见解。

Title: Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review

Authors: Sungduk Yu, Man Luo, Avinash Madusu, Vasudev Lal, Phillip Howard
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19614
Pdf URL: https://arxiv.org/pdf/2502.19614
Copy Paste: [[2502.19614]] Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review(https://arxiv.org/abs/2502.19614)
Keywords: language model, llm
Abstract: Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs. Motivated by the shortcomings of existing methods, we propose a new detection approach which surpasses existing methods in the identification of AI written peer reviews. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI.
摘要：同行评审是确保已发表科学研究完整性的关键过程。对这一过程的信心是基于以下假设：相关领域的专家对提交出版的手稿的优点进行了仔细的考虑。随着大型语言模型（LLMS）的最新快速进步，对同行评审过程的新风险是，疏忽的审稿人将依靠LLM来执行经常耗时的审查论文。但是，缺乏现有资源来基准在同行评审领域中AI文本的可检测性。为了解决这一缺陷，我们介绍了一个综合数据集，其中包含788,984个AI写的同行评论，并配对相应的人类评论，涵盖了向两个领先的AI研究会议（ICLR和Neurips）提交的8年论文。我们使用这种新资源来评估18种现有的AI文本检测算法的能力，以区分人类撰写的同行评论和不同的最先进的LLM。在现有方法的缺点的推动下，我们提出了一种新的检测方法，该方法超过了AI书面同行评论的现有方法。我们的工作揭示了在单个同行评审级别识别AI生成的文本的困难，强调了迫切需要对新工具和方法检测这种不道德使用的生成AI的需求。

Title: Weaker LLMs' Opinions Also Matter: Mixture of Opinions Enhances LLM's Mathematical Reasoning

Authors: Yanan Chen, Ali Pesaranghader, Tanmana Sadhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19622
Pdf URL: https://arxiv.org/pdf/2502.19622
Copy Paste: [[2502.19622]] Weaker LLMs' Opinions Also Matter: Mixture of Opinions Enhances LLM's Mathematical Reasoning(https://arxiv.org/abs/2502.19622)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: Recent advances in Large Language Models (LLMs) have raised interest in their formal reasoning capabilities, particularly in mathematics. While closed LLMs like GPT-4 perform well on mathematical benchmarks, e.g., GSM8K, it remains unclear whether small to medium-sized open LLMs can achieve similar performance, questioning their reliability. To close this gap, we propose a post-training approach leveraging a mixture of opinions (MoO) from weaker ancillary LLMs to enhance a (relatively) stronger LLM's reasoning. For that, each post-training sample is augmented with Chain-of-Thought (CoT) reasoning steps and answers from ancillary LLMs, enabling the main LLM to learn from diverse perspectives. We compare MoO with standard supervised fine-tuning (SFT), few-shot prompting, and the Mixture of Agents (MoA) method on mathematical reasoning benchmarks. Our results show that incorporating weaker LLMs' opinions improves mathematical reasoning by an average of 5%, highlighting the value of diverse perspectives in reasoning tasks.
摘要：大型语言模型（LLM）的最新进展引起了人们对其正式推理能力的兴趣，尤其是在数学方面。虽然GPT-4（例如GPT-4）在数学基准（例如GSM8K）上表现良好，但尚不清楚中小型开放式LLM是否可以实现相似的性能，从而质疑其可靠性。为了缩小这一差距，我们提出了一种培训后的方法，利用较弱的辅助LLM的意见（MOO）混合在一起，以增强（相对）强大的LLM的推理。为此，每个训练后样本都通过辅助LLM的思考链（COT）推理步骤和答案进行增强，从而使主要LLM能够从不同的角度学习。我们将MOO与标准监督的微调（SFT），很少的提示以及代理（MOA）方法（MOA）方法进行比较。我们的结果表明，纳入LLMS的意见较弱会使数学推理平均提高5％，从而突出了在推理任务中各种观点的价值。

Title: Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning

Authors: Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, Hoifung Poon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19655
Pdf URL: https://arxiv.org/pdf/2502.19655
Copy Paste: [[2502.19655]] Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning(https://arxiv.org/abs/2502.19655)
Keywords: language model
Abstract: Reinforcement learning from verifiable rewards (RLVR) has recently gained attention for its ability to elicit self-evolved reasoning capabilitie from base language models without explicit reasoning supervisions, as demonstrated by DeepSeek-R1. While prior work on RLVR has primarily focused on mathematical and coding domains, its applicability to other tasks and domains remains unexplored. In this work, we investigate whether medical reasoning can emerge from RLVR. We introduce Med-RLVR as an initial study of RLVR in the medical domain leveraging medical multiple-choice question answering (MCQA) data as verifiable labels. Our results demonstrate that RLVR is not only effective for math and coding but also extends successfully to medical question answering. Notably, Med-RLVR achieves performance comparable to traditional supervised fine-tuning (SFT) on in-distribution tasks while significantly improving out-of-distribution generalization, with an 8-point accuracy gain. Further analysis of training dynamics reveals that, with no explicit reasoning supervision, reasoning emerges from the 3B-parameter base model. These findings underscore the potential of RLVR in domains beyond math and coding, opening new avenues for its application in knowledge-intensive fields such as medicine.
摘要：从DeepSeek-R1证明的，从基本语言模型中引起自我发展的推理的能力，从基本语言模型中引起自我发展的推理能力，从可验证的奖励（RLVR）中学习的强化学习能力引起了人们的关注。尽管RLVR的先前工作主要集中在数学和编码域上，但其对其他任务和域的适用性仍未探索。在这项工作中，我们研究了医学推理是否可以从RLVR中出现。我们将MED-RLVR作为对医疗领域的RLVR的初步研究，该研究利用医学多项选择答案（MCQA）数据作为可验证的标签。我们的结果表明，RLVR不仅对数学和编码有效，而且还可以成功地扩展到医疗问题的回答。值得注意的是，MED-RLVR的性能与分布任务的传统监督微调（SFT）相媲美，同时具有8分的精度增长，同时显着改善了分布式概括。对训练动力学的进一步分析表明，在没有明确的推理监督的情况下，从3B参数基本模型中出现了推理。这些发现强调了RLVR在数学和编码之外的领域中的潜力，这为其在医学等知识密集型领域的应用开辟了新的途径。

Title: Investigating Neurons and Heads in Transformer-based LLMs for Typographical Errors

Authors: Kohei Tsuji, Tatsuya Hiraoka, Yuchang Cheng, Eiji Aramaki, Tomoya Iwakura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19669
Pdf URL: https://arxiv.org/pdf/2502.19669
Copy Paste: [[2502.19669]] Investigating Neurons and Heads in Transformer-based LLMs for Typographical Errors(https://arxiv.org/abs/2502.19669)
Keywords: llm
Abstract: This paper investigates how LLMs encode inputs with typos. We hypothesize that specific neurons and attention heads recognize typos and fix them internally using local and global contexts. We introduce a method to identify typo neurons and typo heads that work actively when inputs contain typos. Our experimental results suggest the following: 1) LLMs can fix typos with local contexts when the typo neurons in either the early or late layers are activated, even if those in the other are not. 2) Typo neurons in the middle layers are responsible for the core of typo-fixing with global contexts. 3) Typo heads fix typos by widely considering the context not focusing on specific tokens. 4) Typo neurons and typo heads work not only for typo-fixing but also for understanding general contexts.
摘要：本文研究了LLMS如何用错别字编码输入。我们假设特定的神经元和注意力头会识别错别字，并使用本地和全球环境在内部修复它们。我们介绍了一种方法来识别输入包含错别字时积极工作的错别字神经元和错字头。我们的实验结果表明：1）LLM可以在激活早期或晚期中的错别字神经元时使用局部环境修复错别字，即使另一个中的类型神经元也没有。 2）中层中的错别字神经元负责与全局上下文的错字固定核心。 3）通过广泛考虑上下文不关注特定令牌来修复错别字。 4）错别字神经元和错字头不仅可以用于打字，还可以理解一般环境。

Title: GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Authors: Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, Jordan Lee Boyd-Graber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19684
Pdf URL: https://arxiv.org/pdf/2502.19684
Copy Paste: [[2502.19684]] GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration(https://arxiv.org/abs/2502.19684)
Keywords: language model
Abstract: Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.
摘要：语言模型通常被错误地校准，从而确保答案不正确。我们介绍了Grace，这是一种用于语言模型校准的基准，将与人类校准进行了比较。宽限期由问答对组成，其中每个问题都包含一系列线索，这些线索逐渐变得更容易，所有这些都会得到相同的答案；模型必须在揭示线索时尽早正确回答。此设置允许基于模型的早期，准确和自信地回答模型校准的颗粒测量。收集了这些问题后，我们主持了人类与模型竞赛，以收集有关人类和模型团队的时间，准确性和信心的1,749个数据点。我们提出了一种指标的CALScore，该度量使用GRACE分析模型校准误差，并确定与人类行为不同的模型误解类型。我们发现，尽管人类的准确性不如模型，但人类通常会更好地校准。由于最先进的模型在恩典上挣扎，因此它有效地评估了改进模型校准的进展。

Title: Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLMs

Authors: Hannah Cyberey, Yangfeng Ji, David Evans
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2502.19721
Pdf URL: https://arxiv.org/pdf/2502.19721
Copy Paste: [[2502.19721]] Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLMs(https://arxiv.org/abs/2502.19721)
Keywords: language model, llm
Abstract: Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate potential harms that may result from these biases, but most work studies biases in LLMs as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of "gender" is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model's representation. We also present a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs.
摘要：大型语言模型（LLM）已知可以永久刻板印象和表现出偏见。已经提出了各种策略来减轻这些偏见可能造成的潜在危害，但是LLMS中的大多数工作研究都是黑盒问题，而无需考虑模型中概念的代表。我们从代表工程中调整技术来研究LLM中“性别”的概念。我们引入了一种新方法，该方法通过概率加权提取概念表示，而无需标记数据，并有效地选择了转向向量来测量和操纵模型的表示。我们还提出了一种基于投影的方法，该方法可以精确地转向模型预测，并证明其在减轻LLMS性别偏见方面的有效性。

Title: Few-Shot Multilingual Open-Domain QA from 5 Examples

Authors: Fan Jiang, Tom Drummond, Trevor Cohn
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2502.19722
Pdf URL: https://arxiv.org/pdf/2502.19722
Copy Paste: [[2502.19722]] Few-Shot Multilingual Open-Domain QA from 5 Examples(https://arxiv.org/abs/2502.19722)
Keywords: language model, llm, prompt
Abstract: Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a \emph{few-shot learning} approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, \textsc{FsModQA}, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a \emph{cross-lingual prompting} strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.
摘要：鉴于大量特定于语言的培训数据，多语言开放域问答答案的最新方法已取得了令人鼓舞的结果。但是，大量注释成本限制了这些方法在代表性不足的语言中的应用。我们介绍了一种\ emph {少数学习}方法，以合成大型语言模型（LLM）的大规模多语言数据。我们的方法始于使用Wikidata进行的大规模自我监督预训练，然后培训高质量的合成多语言数据，该数据通过促使LLMS几乎没有射击监督而生成的高质量合成多语言数据。最终模型\ textsc {fsmodqa}明显优于MLODQA中的现有少量和监督基线，以及跨语言和单语检索。我们进一步表明，我们的方法可以扩展，以通过仅使用英语监督数据的\ emph {跨语性提示}策略对新语言进行有效的零射击适应，从而使其成为MLODQA任务的一般且适用的解决方案，而无需昂贵的大规模注释。

Title: CNsum:Automatic Summarization for Chinese News Text

Authors: Yu Zhao, Songping Huang, Dongsheng Zhou, Zhaoyun Ding, Fei Wang, Aixin Nian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19723
Pdf URL: https://arxiv.org/pdf/2502.19723
Copy Paste: [[2502.19723]] CNsum:Automatic Summarization for Chinese News Text(https://arxiv.org/abs/2502.19723)
Keywords: language model
Abstract: Obtaining valuable information from massive data efficiently has become our research goal in the era of Big Data. Text summarization technology has been continuously developed to meet this demand. Recent work has also shown that transformer-based pre-trained language models have achieved great success on various tasks in Natural Language Processing (NLP). Aiming at the problem of Chinese news text summary generation and the application of Transformer structure on Chinese, this paper proposes a Chinese news text summarization model (CNsum) based on Transformer structure, and tests it on Chinese datasets such as THUCNews. The results of the conducted experiments show that CNsum achieves better ROUGE score than the baseline models, which verifies the outperformance of the model.
摘要：从大量数据中获得有价值的信息已成为我们在大数据时代的研究目标。文本摘要技术已不断开发以满足这一需求。最近的工作还表明，基于变压器的预训练的语言模型在自然语言处理（NLP）的各种任务上取得了巨大成功。本文旨在针对中国新闻文本摘要生成的问题以及变压器结构在中文上的应用，提出了基于变压器结构的中国新闻文本摘要模型（CNSUM），并在Thucnews等中国数据集上进行了对其进行测试。进行实验的结果表明，CNSUM比基线模型获得了更好的胭脂评分，该模型验证了模型的表现。

Title: Preference Learning Unlocks LLMs' Psycho-Counseling Skills

Authors: Mian Zhang, Shaun M. Eack, Zhiyu Zoey Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19731
Pdf URL: https://arxiv.org/pdf/2502.19731
Copy Paste: [[2502.19731]] Preference Learning Unlocks LLMs' Psycho-Counseling Skills(https://arxiv.org/abs/2502.19731)
Keywords: language model, gpt, llm
Abstract: Applying large language models (LLMs) to assist in psycho-counseling is an emerging and meaningful approach, driven by the significant gap between patient needs and the availability of mental health support. However, current LLMs struggle to consistently provide effective responses to client speeches, largely due to the lack of supervision from high-quality real psycho-counseling data, whose content is typically inaccessible due to client privacy concerns. Furthermore, the quality of therapists' responses in available sessions can vary significantly based on their professional training and experience. Assessing the quality of therapists' responses remains an open challenge. In this work, we address these challenges by first proposing a set of professional and comprehensive principles to evaluate therapists' responses to client speeches. Using these principles, we create a preference dataset, PsychoCounsel-Preference, which contains 36k high-quality preference comparison pairs. This dataset aligns with the preferences of professional psychotherapists, providing a robust foundation for evaluating and improving LLMs in psycho-counseling. Experiments on reward modeling and preference learning demonstrate that PsychoCounsel-Preference is an excellent resource for LLMs to acquire essential skills for responding to clients in a counseling session. Our best-aligned model, PsychoCounsel-Llama3-8B, achieves an impressive win rate of 87% against GPT-4o. We release PsychoCounsel-Preference, PsychoCounsel-Llama3-8B and the reward model PsychoCounsel Llama3-8B-Reward to facilitate the research of psycho-counseling with LLMs at: this https URL.
摘要：采用大型语言模型（LLM）来协助心理咨询是一种新兴而有意义的方法，这是由于患者需求与心理健康支持的可用性之间的显着差距所致。但是，当前的LLM努力始终如一地对客户演讲提供有效的回应，这主要是由于缺乏高质量的真实心理信息数据的监督，由于客户隐私问题，其内容通常无法访问。此外，根据他们的专业培训和经验，治疗师的反应质量可能会大不相同。评估治疗师的反应质量仍然是一个悬而未决的挑战。在这项工作中，我们首先提出了一套专业和全面的原则来评估治疗师对客户演讲的反应，以解决这些挑战。使用这些原则，我们创建了一个偏好数据集，即Psychocounsel-Preference，其中包含36K高质量的优先比较对。该数据集符合专业心理治疗师的偏好，为评估和改善心理表现的LLM提供了强大的基础。关于奖励建模和偏好学习的实验表明，Psychocounsel-Preference是LLMS在咨询课程中获得对客户反应的基本技能的绝佳资源。我们最合适的模型Psychocounsel-llama3-8b，对GPT-4O的胜利率令人印象深刻。我们发布Psychocounsel-Preference，Psychocounsel-llama3-8b和奖励模型Psychocounsel llama3-8b-奖励，以促进与LLMS的心理局有关的研究：this HTTPS url。

Title: R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning

Authors: Minggui He, Yilun Liu, Shimin Tao, Yuanchang Luo, Hongyong Zeng, Chang Su, Li Zhang, Hongxia Ma, Daimeng Wei, Weibin Meng, Hao Yang, Boxing Chen, Osamu Yoshie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19735
Pdf URL: https://arxiv.org/pdf/2502.19735
Copy Paste: [[2502.19735]] R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning(https://arxiv.org/abs/2502.19735)
Keywords: language model, llm, chain-of-thought
Abstract: Despite recent breakthroughs in reasoning-enhanced large language models (LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine translation (MT), where human translators naturally employ structured, multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored. Existing methods either design a fixed CoT tailored for a specific MT sub-task (e.g., literature translation), or rely on synthesizing CoTs unaligned with humans and supervised fine-tuning (SFT) prone to catastrophic forgetting, limiting their adaptability to diverse translation scenarios. This paper introduces R1-Translator (R1-T1), a novel framework to achieve inference-time reasoning for general MT via reinforcement learning (RL) with human-aligned CoTs comprising six common patterns. Our approach pioneers three innovations: (1) extending reasoning-based translation beyond MT sub-tasks to six languages and diverse tasks (e.g., legal/medical domain adaptation, idiom resolution); (2) formalizing six expert-curated CoT templates that mirror hybrid human strategies like context-aware paraphrasing and back translation; and (3) enabling self-evolving CoT discovery and anti-forgetting adaptation through RL with KL-constrained rewards. Experimental results indicate a steady translation performance improvement in 21 languages and 80 translation directions on Flores-101 test set, especially on the 15 languages unseen from training, with its general multilingual abilities preserved compared with plain SFT.
摘要：尽管最近在推理增强的大语言模型（LLM）（如DeepSeek-R1）中取得了突破，并将推理时间推理纳入机器翻译（MT），其中人类翻译人员自然采用了结构化的，多层的推理推理链（COTS），但仍被列出。现有方法要么设计针对特定MT子任务（例如文献翻译）量身定制的固定COT，要么依赖于与人类不结盟的COTS的合成和容易遭受灾难性遗忘的监督微调（SFT），从而将其适应性限制为各种翻译风景。本文介绍了R1转换器（R1-T1），这是一个新颖的框架，旨在通过加强学习（RL）以构成六个常见模式的人类一致的COTS来实现一般MT的推理时间推理。我们的方法开创了三项创新：（1）将基于推理的翻译扩展到MT子任务超出六种语言和各种任务（例如，法律/医疗领域适应性，分辨率）；（2）正式化六个专家策划的COT模板，这些模板反映了混合人类策略等诸如上下文感知和背部翻译的策略；（3）通过RL通过KL受限的奖励实现自我发展的COT发现和反遗嘱适应。实验结果表明，在Flores-101测试集上的21种语言和80个翻译方向的稳定翻译性能提高，尤其是在训练中看不见的15种语言上，与普通SFT相比，其一般多语言能力保留了一般的多语言能力。

Title: XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs

Authors: Linyang He, Ercong Nie, Sukru Samet Dindar, Arsalan Firoozi, Adrian Florea, Van Nguyen, Corentin Puffay, Riki Shimizu, Haotian Ye, Jonathan Brennan, Helmut Schmid, Hinrich Schütze, Nima Mesgarani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19737
Pdf URL: https://arxiv.org/pdf/2502.19737
Copy Paste: [[2502.19737]] XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs(https://arxiv.org/abs/2502.19737)
Keywords: llm, prompt
Abstract: We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs' multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.
摘要：我们在这项工作中介绍XCOMP，这是一种涵盖17种语言的多语言概念最小对数据集。使用此数据集，我们通过金属语言提示，直接概率测量和神经语言探测来评估LLMS的多语言概念理解。通过比较基础，指导调节和知识延伸的模型，我们发现：1）LLMS对低资源语言表现出较弱的概念理解，尽管在同一概念集上进行了测试，但在语言中的准确性也有所不同。 2）LLM在区分明显不同的概念 - 陶艺对方面表现出色，但是当负对具有微妙的语义相似性时，表现出明显的性能下降。 3）教学调整可提高概念理解的性能，但不会增强内部能力；知识蒸馏可以增强对低资源语言的概念理解的内部能力，其明确任务绩效有限。 4）更多具有形态学复杂的语言会产生较低的概念理解分数，并需要更深的层面才能进行概念推理。

Title: HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Authors: Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng, Shuqi Wang, Chufan Shi, Zhengwu Liu, Ngai Wong
Subjects: cs.CL, cs.AR
Abstract URL: https://arxiv.org/abs/2502.19747
Pdf URL: https://arxiv.org/pdf/2502.19747
Copy Paste: [[2502.19747]] HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture(https://arxiv.org/abs/2502.19747)
Keywords: language model, llm
Abstract: Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.
摘要：低级适应性（LORA）是一种主要参数有效的鉴定方法，可适应大型语言模型（LLMS）的下游任务。在本文中，我们首先建议将lora-fintuned llms部署在混合中的内存（CIM）体系结构上（即，预处理的权重到SRAM上）。为了解决RRAM固有的噪音的性能降解，我们设计了一种新颖的硬件可感知的低排名适应性（Halora）方法，旨在通过在理想和嘈杂条件下对培训目标对准训练目标，旨在训练Lora分支，该分支既强大又准确。实验Finetuning Llama 3.2 1B和3B表明了Halora在多个推理任务中的有效性，在平均得分方面取得了高达22.7的提高，同时保持稳健性在各种噪声水平。

Title: Beneath the Surface: How Large Language Models Reflect Hidden Bias

Authors: Jinhao Pan, Chahat Raj, Ziyu Yao, Ziwei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19749
Pdf URL: https://arxiv.org/pdf/2502.19749
Copy Paste: [[2502.19749]] Beneath the Surface: How Large Language Models Reflect Hidden Bias(https://arxiv.org/abs/2502.19749)
Keywords: language model, llm
Abstract: The exceptional performance of Large Language Models (LLMs) often comes with the unintended propagation of social biases embedded in their training data. While existing benchmarks evaluate overt bias through direct term associations between bias concept terms and demographic terms, LLMs have become increasingly adept at avoiding biased responses, creating an illusion of neutrality. However, biases persist in subtler, contextually hidden forms that traditional benchmarks fail to capture. We introduce the Hidden Bias Benchmark (HBB), a novel dataset designed to assess hidden bias that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios. We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response to overt bias, they continue to reinforce biases in nuanced settings. Data, code, and results are available at this https URL.
摘要：大型语言模型（LLM）的出色表现通常带有嵌入其培训数据中的社会偏见的意外传播。尽管现有基准通过偏见概念术语和人口统计术语之间的直接术语关联来评估明显的偏见，但LLM越来越擅长避免偏见的响应，从而产生了中立的幻想。但是，偏见持续存在在微妙的，上下文隐藏的形式，传统基准无法捕获。我们介绍了隐藏的偏见基准（HBB），这是一个新颖的数据集，旨在评估隐藏的偏见，即在现实世界中的自然主义，巧妙的环境中隐藏了偏见概念。我们分析了六个最先进的LLM，表明尽管模型会响应明显的偏见而减少偏见，但它们仍在细微的环境中继续增强偏见。数据，代码和结果可在此HTTPS URL上找到。

Title: PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation

Authors: Nathan Roll
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19756
Pdf URL: https://arxiv.org/pdf/2502.19756
Copy Paste: [[2502.19756]] PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation(https://arxiv.org/abs/2502.19756)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) showcase increasingly impressive English benchmark scores, however their performance profiles remain inconsistent across multilingual settings. To address this gap, we introduce PolyPrompt, a novel, parameter-efficient framework for enhancing the multilingual capabilities of LLMs. Our method learns a set of trigger tokens for each language through a gradient-based search, identifying the input query's language and selecting the corresponding trigger tokens which are prepended to the prompt during inference. We perform experiments on two ~1 billion parameter models, with evaluations on the global MMLU benchmark across fifteen typologically and resource diverse languages, demonstrating accuracy gains of 3.7%-19.9% compared to naive and translation-pipeline baselines.
摘要：大型语言模型（LLM）展示了越来越令人印象深刻的英语基准分数，但是在多语言环境中，它们的性能概况仍然不一致。为了解决这一差距，我们引入了PolyPrompt，这是一种新型参数有效的框架，用于增强LLM的多语言能力。我们的方法通过基于梯度的搜索来学习每种语言的一组触发令牌，识别输入查询的语言并选择相应的触发令牌，这些触发令牌在推理过程中已备份到提示。我们对两个约有10亿个参数模型进行了实验，对15种类型的全球MMLU基准进行了评估，并且与NAIVE和Translation-Pipeline Baselines相比，准确性增长了3.7％-19.9％。

Title: EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models

Authors: Che Hyun Lee, Heeseung Kim, Jiheum Yeom, Sungroh Yoon
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19765
Pdf URL: https://arxiv.org/pdf/2502.19765
Copy Paste: [[2502.19765]] EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models(https://arxiv.org/abs/2502.19765)
Keywords: language model
Abstract: We propose EdiText, a controllable text editing method that modify the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of editing on its own, this fine-grained method, integrated with the SDEdit approach, enables EdiText to make precise adjustments within the desired range. EdiText demonstrates its controllability to robustly adjust reference text at broad range of levels across various tasks, including toxicity control and sentiment control.
摘要：我们提出了EDITEXT，这是一种可控的文本编辑方法，该方法将参考文本修改为各种规模的所需属性。我们集成了一种基于SDEDIT的编辑技术，该技术允许对文本编辑程度进行广泛的调整。此外，我们引入了一种基于自我条件的新型精细编辑方法，该方法允许对参考文本的微妙控制。尽管能够自行编辑，但与SDEDIT方法集成的这种细颗粒方法使Editext能够在所需范围内进行精确的调整。 EDITEXT展示了其可控性，可以在各种任务（包括毒性控制和情感控制）的各个任务中稳健地调整参考文本。

Title: Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

Authors: Peilin Wu, Xinlu Zhang, Wenhao Yu, Xingyu Liu, Xinya Du, Zhiyu Zoey Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19779
Pdf URL: https://arxiv.org/pdf/2502.19779
Copy Paste: [[2502.19779]] Do Retrieval-Augmented Language Models Adapt to Varying User Needs?(https://arxiv.org/abs/2502.19779)
Keywords: language model
Abstract: Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing evaluation benchmarks often assume a single optimal approach to leveraging retrieved information, failing to account for varying user needs. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases-Context-Exclusive, Context-First, and Memory-First-across three distinct context settings: Context Matching, Knowledge Conflict, and Information Irrelevant. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications where models must adapt to diverse user requirements. Through extensive experiments on multiple QA datasets, including HotpotQA, DisentQA, and our newly constructed synthetic URAQ dataset, we find that restricting memory usage improves robustness in adversarial retrieval conditions but decreases peak performance with ideal retrieval results and model family dominates behavioral differences. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems and provide insights into optimizing model performance across varied retrieval contexts. We will release our code and URAQ dataset upon acceptance of the paper.
摘要：检索式语言模型（RALMS）的最新进步证明了它们在知识密集型任务中的功效。但是，现有的评估基准通常采用单一的最佳方法来利用检索到的信息，但无法解决用户需求的不同。本文介绍了一个新颖的评估框架，该框架系统地评估了三个用户的RALMS，需要案例，case-context-context-context-context-contect-context-trift和membory-tirst-actross三个不同的上下文设置：上下文匹配，知识冲突和信息无关。通过改变用户说明和检索信息的性质，我们的方法捕获了模型必须适应不同用户需求的现实世界应用程序的复杂性。通过在包括HOTPOTQA，DISENTQA和我们新构建的合成URAQ数据集在内的多个QA数据集上进行的大量实验，我们发现限制内存使用量可以改善对抗性检索条件的鲁棒性，但在峰值性能中降低了峰值性能，而降低了峰值的峰值性能，并降低了理想的结果和模型家族统治行为差异。我们的发现突出了以用户为中心的评估在开发检索功能系统中的必要性，并为在各种检索环境中优化模型性能提供了见解。接受论文后，我们将发布代码和URAQ数据集。

Title: Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

Authors: Zixuan Weng, Xiaolong Jin, Jinyuan Jia, Xiangyu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19820
Pdf URL: https://arxiv.org/pdf/2502.19820
Copy Paste: [[2502.19820]] Foot-In-The-Door: A Multi-turn Jailbreak for LLMs(https://arxiv.org/abs/2502.19820)
Keywords: language model, llm, prompt
Abstract: Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs. Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical this http URL approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses. Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn this http URL code is available at this https URL .
摘要：确保AI安全至关重要，因为大语言模型越来越多地集成到现实世界中。一个关键的挑战是越狱，对抗性提示旁路内置保障，以引起有害的不允许产出。受到心理脚踏实地原则的启发，我们引入了一种新型的多转越越狱方法，该方法利用了这种现象，其中较小的初始承诺降低了对更重要或更不道德的抵抗力，这种HTTP URL方法逐渐升级了恶意通过中间桥梁提示并通过模型自身对响应引起的响应来引起有毒的响应，从而逐渐升级了恶意的用户疑问意图。对两个越狱基准测试的广泛实验结果表明，FITD在七个广泛使用的模型中达到了94％的平均攻击成功率，表现优于现有的最新方法。此外，我们提供了对LLM自我腐败的深入分析，强调了当前的对齐策略中的漏洞，并强调了多转移中固有的风险此HTTP URL代码，可在此HTTPS URL上获得。

Title: MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue

Authors: Yujia Chen, Changsong Li, Yiming Wang, Qingqing Xiao, Nan Zhang, Zifan Kong, Peng Wang, Binyu Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19860
Pdf URL: https://arxiv.org/pdf/2502.19860
Copy Paste: [[2502.19860]] MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue(https://arxiv.org/abs/2502.19860)
Keywords: language model, llm, chat, agent
Abstract: Mental health issues are worsening in today's competitive society, such as depression and anxiety. Traditional healings like counseling and chatbots fail to engage effectively, they often provide generic responses lacking emotional depth. Although large language models (LLMs) have the potential to create more human-like interactions, they still struggle to capture subtle emotions. This requires LLMs to be equipped with human-like adaptability and warmth. To fill this gap, we propose the MIND (Multi-agent INner Dialogue), a novel paradigm that provides more immersive psychological healing environments. Considering the strong generative and role-playing ability of LLM agents, we predefine an interactive healing framework and assign LLM agents different roles within the framework to engage in interactive inner dialogues with users, thereby providing an immersive healing experience. We conduct extensive human experiments in various real-world healing dimensions, and find that MIND provides a more user-friendly experience than traditional paradigms. This demonstrates that MIND effectively leverages the significant potential of LLMs in psychological healing.
摘要：在当今竞争社会（例如抑郁和焦虑）中，心理健康问题正在恶化。传统的康复咨询和聊天机器人无法有效参与，它们通常会提供缺乏情感深度的通用反应。尽管大型语言模型（LLM）有可能建立更类似人类的互动，但它们仍然很难捕捉微妙的情绪。这需要LLMS配备人类的适应性和温暖。为了填补这一空白，我们提出了思想（多代理的内心对话），这是一种新颖的范式，可提供更身临其境的心理康复环境。考虑到LLM代理具有强大的生成性和角色扮演能力，我们预先定义了一个交互式的康复框架，并在框架中为LLM代理分配了不同的角色，以与用户进行交互式内部对话，从而提供沉浸式的康复体验。我们在各种现实世界的康复维度上进行了广泛的人类实验，并发现思维提供了比传统范式更具用户友好的体验。这表明，思维有效地利用了LLM在心理康复中的重要潜力。

Title: MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge

Authors: Yuntao Du, Kailin Jiang, Zhi Gao, Chenrui Shi, Zilong Zheng, Siyuan Qi, Qing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19870
Pdf URL: https://arxiv.org/pdf/2502.19870
Copy Paste: [[2502.19870]] MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge(https://arxiv.org/abs/2502.19870)
Keywords: language model, llm
Abstract: Knowledge editing techniques have emerged as essential tools for updating the factual knowledge of large language models (LLMs) and multimodal models (LMMs), allowing them to correct outdated or inaccurate information without retraining from scratch. However, existing benchmarks for multimodal knowledge editing primarily focus on entity-level knowledge represented as simple triplets, which fail to capture the complexity of real-world multimodal information. To address this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse visual knowledge in real-world scenarios. MMKE-Bench addresses these limitations by incorporating three types of editing tasks: visual entity editing, visual semantic editing, and user-specific editing. Besides, MMKE-Bench uses free-form natural language to represent and edit knowledge, offering a more flexible and effective format. The benchmark consists of 2,940 pieces of knowledge and 8,363 images across 33 broad categories, with evaluation questions automatically generated and human-verified. We assess five state-of-the-art knowledge editing methods on three prominent LMMs, revealing that no method excels across all criteria, and that visual and user-specific edits are particularly challenging. MMKE-Bench sets a new standard for evaluating the robustness of multimodal knowledge editing techniques, driving progress in this rapidly evolving field.
摘要：知识编辑技术已成为更新大型语言模型（LLMS）和多模式模型（LMMS）的事实知识的基本工具，从而使它们可以纠正过时的或不准确的信息，而无需从头开始重新审阅。但是，用于多模式知识编辑的现有基准主要集中于表示为简单三胞胎的实体级别知识，这些知识未能捕获现实世界多模式信息的复杂性。为了解决这个问题，我们介绍了MMKE-Bench，这是一种全面的多模式知识编辑基准，旨在评估LMM在现实世界中编辑各种视觉知识的能力。 MMKE基础通过合并三种类型的编辑任务来解决这些限制：视觉实体编辑，视觉语义编辑和特定于用户的编辑。此外，MMKE-Bench使用自由形式的自然语言来表示和编辑知识，提供更灵活，更有效的格式。该基准由33个广泛类别的2,940张知识和8,363张图像组成，评估问题自动产生和人为验证。我们评估了三种突出的LMM上的五种最先进的知识编辑方法，表明没有任何方法在所有标准上都擅长，并且视觉和用户特定的编辑特别具有挑战性。 MMKE BENCH设定了一个新标准，用于评估多模式知识编辑技术的鲁棒性，从而在这个快速发展的领域中推动了进度。

Title: Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents

Authors: Zhenyu Liu, Yunxin Li, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19917
Pdf URL: https://arxiv.org/pdf/2502.19917
Copy Paste: [[2502.19917]] Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents(https://arxiv.org/abs/2502.19917)
Keywords: language model, llm, agent
Abstract: To improve Multimodal Large Language Models' (MLLMs) ability to process images and complex instructions, researchers predominantly curate large-scale visual instruction tuning datasets, which are either sourced from existing vision tasks or synthetically generated using LLMs and image descriptions. However, they often suffer from critical flaws, including misaligned instruction-image pairs and low-quality images. Such issues hinder training efficiency and limit performance improvements, as models waste resources on noisy or irrelevant data with minimal benefit to overall capability. To address this issue, we propose a \textbf{Vi}sual-Centric \textbf{S}election approach via \textbf{A}gents Collaboration (ViSA), which centers on image quality assessment and image-instruction relevance evaluation. Specifically, our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images. Finally, we reorganize 80K instruction data from large open-source datasets. Extensive experiments demonstrate that ViSA outperforms or is comparable to current state-of-the-art models on seven benchmarks, using only 2.5\% of the original data, highlighting the efficiency of our data selection approach. Moreover, we conduct ablation studies to validate the effectiveness of each component of our method. The code is available at this https URL.
摘要：为了改善多模式大型语言模型的（MLLM）处理图像和复杂说明的能力，研究人员主要策划大规模的视觉说明调谐数据集，这些数据集是从现有视觉任务中来自现有视觉任务或使用LLMS和图像描述的合成生成的。但是，它们通常会遭受关键缺陷，包括未对准的教学图像对和低质量的图像。此类问题阻碍了训练效率并限制了绩效的提高，因为在嘈杂或无关的数据上浪费资源对整体能力的好处最小。为了解决这个问题，我们通过\ textbf {a} gents Collaboration（Visa）提出了一个以\ textric-centric \ textbf {s}选举方法进行的\ textbf {vi} \ textbf {vi}，该方法以图像质量评估和图像结构相关性评估为中心。具体而言，我们的方法包括1）通过视觉代理协作的图像信息量化方法，以选择具有丰富视觉信息的图像，以及2）以视觉为中心的指令质量评估方法，以选择与高质量图像有关的高质量指导数据。最后，我们从大型开源数据集重新组织了80K指令数据。广泛的实验表明，签证的表现优于七个基准的当前最新模型，仅使用2.5％的原始数据，强调了我们数据选择方法的效率。此外，我们进行消融研究以验证方法的每个组成部分的有效性。该代码可在此HTTPS URL上找到。

Title: GeoEdit: Geometric Knowledge Editing for Large Language Models

Authors: Yujie Feng, Liming Zhan, Zexin Lu, Yongxin Xu, Xu Chu, Yasha Wang, Jiannong Cao, Philip S. Yu, Xiao-Ming Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19953
Pdf URL: https://arxiv.org/pdf/2502.19953
Copy Paste: [[2502.19953]] GeoEdit: Geometric Knowledge Editing for Large Language Models(https://arxiv.org/abs/2502.19953)
Keywords: language model, llm
Abstract: Regular updates are essential for maintaining up-to-date knowledge in large language models (LLMs). Consequently, various model editing methods have been developed to update specific knowledge within LLMs. However, training-based approaches often struggle to effectively incorporate new knowledge while preserving unrelated general knowledge. To address this challenge, we propose a novel framework called Geometric Knowledge Editing (GeoEdit). GeoEdit utilizes the geometric relationships of parameter updates from fine-tuning to differentiate between neurons associated with new knowledge updates and those related to general knowledge perturbations. By employing a direction-aware knowledge identification method, we avoid updating neurons with directions approximately orthogonal to existing knowledge, thus preserving the model's generalization ability. For the remaining neurons, we integrate both old and new knowledge for aligned directions and apply a "forget-then-learn" editing strategy for opposite directions. Additionally, we introduce an importance-guided task vector fusion technique that filters out redundant information and provides adaptive neuron-level weighting, further enhancing model editing performance. Extensive experiments on two publicly available datasets demonstrate the superiority of GeoEdit over existing state-of-the-art methods.
摘要：定期更新对于维持大型语言模型（LLMS）的最新知识至关重要。因此，已经开发了各种模型编辑方法来更新LLM中的特定知识。但是，基于培训的方法通常很难有效地纳入新知识，同时保留无关的常识。为了应对这一挑战，我们提出了一个名为“几何知识编辑（Geoedit）”的新颖框架。 GeoEdit利用参数更新的几何关系从微调到与新知识更新相关的神经元和与一般知识扰动相关的神经元的区分。通过采用方向感知的知识识别方法，我们避免更新具有与现有知识正交的方向的神经元，从而保持模型的概括能力。对于剩余的神经元，我们将旧知识和新知识都整合为对齐方向，并针对相反的方向应用“忘记的学习”编辑策略。此外，我们引入了一种重要的任务矢量融合技术，该技术可以过滤冗余信息并提供自适应的神经元级别的加权，从而进一步增强模型编辑性能。在两个公开可用数据集上进行的广泛实验证明了地理上的优越性，而不是现有的最新方法。

Title: Collaborative Stance Detection via Small-Large Language Model Consistency Verification

Authors: Yu Yan, Sheng Sun, Zixiang Tang, Teli Liu, Min Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19954
Pdf URL: https://arxiv.org/pdf/2502.19954
Copy Paste: [[2502.19954]] Collaborative Stance Detection via Small-Large Language Model Consistency Verification(https://arxiv.org/abs/2502.19954)
Keywords: language model, llm
Abstract: Stance detection on social media aims to identify attitudes expressed in tweets towards specific targets. Current studies prioritize Large Language Models (LLMs) over Small Language Models (SLMs) due to the overwhelming performance improving provided by LLMs. However, heavily relying on LLMs for stance detection, regardless of the cost, is impractical for real-world social media monitoring systems that require vast data analysis. To this end, we propose \textbf{\underline{Co}}llaborative Stance Detection via Small-Large Language Model Consistency \textbf{\underline{Ver}}ification (\textbf{CoVer}) framework, which enhances LLM utilization via context-shared batch reasoning and logical verification between LLM and SLM. Specifically, instead of processing each text individually, CoVer processes texts batch-by-batch, obtaining stance predictions and corresponding explanations via LLM reasoning in a shared context. Then, to exclude the bias caused by context noises, CoVer introduces the SLM for logical consistency verification. Finally, texts that repeatedly exhibit low logical consistency are classified using consistency-weighted aggregation of prior LLM stance predictions. Our experiments show that CoVer outperforms state-of-the-art methods across multiple benchmarks in the zero-shot setting, achieving 0.54 LLM queries per tweet while significantly enhancing performance. Our CoVer offers a more practical solution for LLM deploying for social media stance detection.
摘要：社交媒体上的立场检测旨在确定在推文中对特定目标表达的态度。由于LLMS提供了压倒性的性能，目前的研究优先于小语言模型（SLM）优先考虑大型语言模型（LLM）。但是，无论费用如何，都在很大程度上依靠LLM进行立场检测是不切实际的，对于需要大量数据分析的现实社交媒体监控系统。为此，我们提出了\ textbf {\下划线{co}} llaborative lakorative lange语言模型一致性\ textbf {\ textbf {\ textbf {\ usewissline {ver}} ification（\ textbf {cover}）框架，从而通过上下文汇总的批处理和logical intalifiews and llm infieption增强了LLM利用率和LOGINGIFIENAGE和LOGINGIFIENAL LLM和LLM LLM LLM LLM。具体而言，与其单独处理每个文本，不如涵盖逐批批量的文本，在共享上下文中通过LLM推理获得了立场预测和相应的解释。然后，为了排除上下文噪声引起的偏见，封面将SLM引入逻辑一致性验证。最后，使用先前LLM立场预测的一致性加权聚合对反复表现出低逻辑一致性的文本进行了分类。我们的实验表明，在零拍设置中，涵盖了多个基准测试的最先进方法，每推推文获得0.54 LLM查询，同时显着提高了性能。我们的封面为LLM部署用于社交媒体立场检测提供了更实用的解决方案。

Title: Deterministic or probabilistic? The psychology of LLMs as random number generators

Authors: Javier Coronado-Blázquez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19965
Pdf URL: https://arxiv.org/pdf/2502.19965
Copy Paste: [[2502.19965]] Deterministic or probabilistic? The psychology of LLMs as random number generators(https://arxiv.org/abs/2502.19965)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have transformed text generation through inherently probabilistic context-aware mechanisms, mimicking human natural language. In this paper, we systematically investigate the performance of various LLMs when generating random numbers, considering diverse configurations such as different model architectures, numerical ranges, temperature, and prompt languages. Our results reveal that, despite their stochastic transformers-based architecture, these models often exhibit deterministic responses when prompted for random numerical outputs. In particular, we find significant differences when changing the model, as well as the prompt language, attributing this phenomenon to biases deeply embedded within the training data. Models such as DeepSeek-R1 can shed some light on the internal reasoning process of LLMs, despite arriving to similar results. These biases induce predictable patterns that undermine genuine randomness, as LLMs are nothing but reproducing our own human cognitive biases.
摘要：大型语言模型（LLM）通过固有的概率上下文感知机制改变了文本生成，模仿了人类的自然语言。在本文中，我们在生成随机数时系统地研究各种LLM的性能，考虑到不同的模型架构，数值范围，温度和及时语言等多样化的配置。我们的结果表明，尽管具有基于随机变压器的随机结构，但这些模型在提示随机数值输出时通常会表现出确定性的响应。特别是，我们在更改模型以及及时的语言时发现了显着差异，将这种现象归因于深层嵌入培训数据中的偏见。尽管达到了类似的结果，但诸如DeepSeek-R1之类的模型可以阐明LLM的内部推理过程。这些偏见会导致可预测的模式破坏真正的随机性，因为LLM不过是再现我们自己的人类认知偏见。

Title: The Lookahead Limitation: Why Multi-Operand Addition is Hard for LLMs

Authors: Tanja Baeumel, Josef van Genabith, Simon Ostermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.19981
Pdf URL: https://arxiv.org/pdf/2502.19981
Copy Paste: [[2502.19981]] The Lookahead Limitation: Why Multi-Operand Addition is Hard for LLMs(https://arxiv.org/abs/2502.19981)
Keywords: language model, llm
Abstract: Autoregressive large language models (LLMs) exhibit impressive performance across various tasks but struggle with simple arithmetic, such as addition of two or more operands. We show that this struggle arises from LLMs' use of a simple one-digit lookahead heuristic, which works fairly well (but not perfect) for two-operand addition but fails in multi-operand cases, where the carry-over logic is more complex. Our probing experiments and digit-wise accuracy evaluation show that LLMs fail precisely where a one-digit lookahead is insufficient to account for cascading carries. We analyze the impact of tokenization strategies on arithmetic performance and show that all investigated models, regardless of tokenization, are inherently limited in the addition of multiple operands due to their reliance on a one-digit lookahead heuristic. Our findings reveal fundamental limitations that prevent LLMs from generalizing to more complex numerical reasoning.
摘要：自回归的大型语言模型（LLM）在各种任务中表现出令人印象深刻的表现，但在简单的算术中挣扎，例如增加两个或更多操作数。我们表明，这场斗争是由LLMS使用简单的一位数字lookahead启发式方法引起的，该启发式启发式效果非常好（但不是完美的），对于两手术添加而言，但在多手术和随身携带逻辑更为复杂的情况下会失败。我们的探测实验和数字准确性评估表明，LLMS精确地失败了一位数字的LookAhead不足以说明级联的携带。我们分析了令牌化策略对算术性能的影响，并表明所有研究的模型，无论令牌化如何，都在添加多个操作数方面固有地受到限制，因为它们依赖于一位数字的LookAhead启发式方法。我们的发现揭示了基本的局限性，可以防止LLM概括到更复杂的数值推理。

Title: Erasing Without Remembering: Safeguarding Knowledge Forgetting in Large Language Models

Authors: Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19982
Pdf URL: https://arxiv.org/pdf/2502.19982
Copy Paste: [[2502.19982]] Erasing Without Remembering: Safeguarding Knowledge Forgetting in Large Language Models(https://arxiv.org/abs/2502.19982)
Keywords: language model, llm
Abstract: In this paper, we explore machine unlearning from a novel dimension, by studying how to safeguard model unlearning in large language models (LLMs). Our goal is to prevent unlearned models from recalling any related memory of the targeted this http URL begin by uncovering a surprisingly simple yet overlooked fact: existing methods typically erase only the exact expressions of the targeted knowledge, leaving paraphrased or related information intact. To rigorously measure such oversights, we introduce UGBench, the first benchmark tailored for evaluating the generalisation performance across 13 state-of-the-art this http URL reveals that unlearned models can still recall paraphrased answers and retain target facts in intermediate layers. To address this, we propose PERMU, a perturbation-based method that significantly enhances the generalisation capabilities for safeguarding LLM this http URL demonstrate that PERMU delivers up to a 50.13% improvement in unlearning while maintaining a 43.53% boost in robust generalisation. Our code can be found in this https URL.
摘要：在本文中，我们通过研究如何在大型语言模型（LLMS）中进行维护模型来探索机器从新颖的维度上学习。我们的目标是防止未经学习的模型回忆起目标的任何相关记忆此HTTP URL首先要揭示一个令人惊讶的简单但被忽视的事实：现有方法通常仅删除目标知识的确切表达，而释义或相关信息完整。为了严格衡量这样的疏忽，我们介绍了UGBench，这是第一个用于评估13个最先进的概括性能的基准，该HTTP URL表明，未学习的模型仍然可以回忆起解释的答案，并保留了中间层中的目标事实。为了解决这一问题，我们提出了一种基于扰动的方法Permu，可显着增强该HTTP URL的概括能力，以维护LLM，这表明，permu在稳定的一般性中保持43.53％的增长，在保持43.53％的增强方面取得了高达50.13％的提高。我们的代码可以在此HTTPS URL中找到。

Title: Polish-ASTE: Aspect-Sentiment Triplet Extraction Datasets for Polish

Authors: Marta Lango, Borys Naglik, Mateusz Lango, Iwo Naglik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20046
Pdf URL: https://arxiv.org/pdf/2502.20046
Copy Paste: [[2502.20046]] Polish-ASTE: Aspect-Sentiment Triplet Extraction Datasets for Polish(https://arxiv.org/abs/2502.20046)
Keywords: language model
Abstract: Aspect-Sentiment Triplet Extraction (ASTE) is one of the most challenging and complex tasks in sentiment analysis. It concerns the construction of triplets that contain an aspect, its associated sentiment polarity, and an opinion phrase that serves as a rationale for the assigned polarity. Despite the growing popularity of the task and the many machine learning methods being proposed to address it, the number of datasets for ASTE is very limited. In particular, no dataset is available for any of the Slavic languages. In this paper, we present two new datasets for ASTE containing customer opinions about hotels and purchased products expressed in Polish. We also perform experiments with two ASTE techniques combined with two large language models for Polish to investigate their performance and the difficulty of the assembled datasets. The new datasets are available under a permissive licence and have the same file format as the English datasets, facilitating their use in future research.
摘要：方面的三重态提取（ASTE）是情感分析中最具挑战性和最复杂的任务之一。它涉及包含一个方面的三胞胎的构建，其相关的情感极性以及一个意见短语，这是分配的极性的基本原理。尽管该任务的普及越来越普及，并提出了许多用于解决它的机器学习方法，但ASTE的数据集数量非常有限。特别是，任何斯拉夫语言都无法使用数据集。在本文中，我们介绍了两个新数据集，用于ASTE，其中包含有关酒店的客户意见和以波兰语表达的产品。我们还使用两种ASTE技术以及两种大型语言模型进行了实验，以研究其性能和组装数据集的难度。新数据集可在允许许可下获得，并具有与英文数据集相同的文件格式，从而促进了它们在未来的研究中的使用。

Title: Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Authors: Haochen Sun, Shuwen Zhang, Lei Ren, Hao Xu, Hao Fu, Caixia Yuan, Xiaojie Wang
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2502.20073
Pdf URL: https://arxiv.org/pdf/2502.20073
Copy Paste: [[2502.20073]] Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents(https://arxiv.org/abs/2502.20073)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-powered Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks from two novel perspectives. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments over 10 popular LLMs and show that, while the LLMs present a strong ability in goal interpretation, there is a significant discrepancy in active collaboration and continuous adaption that are critical for efficiently fulfilling complicated tasks. Notably, we highlight the strengths and weaknesses in LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-sourced benchmark. Environments, 30 open-ended tasks, and an integrated evaluation package are now publicly available at this https URL.
摘要：基于大型语言模型（LLM）的代理系统在传统NLP任务以外的现实世界应用中取得了长足的进步。本文提出了一个新的LLM驱动的多代理系统（LLM-MAS）基准，该基准是合作熟练的，建立在流行的过度煮熟-AI游戏的基础上，在交互式环境中具有更适用和更具挑战性的任务。合作熟练从两个新颖的角度扩展了现有的基准。首先，它提供了一个多代理框架，支持各种任务和目标，并通过自然语言交流鼓励协作。其次，它引入了一系列面向过程的评估指标，以评估不同LLM代理的细粒度协作能力，这在先前的工作中经常被忽略。我们进行了10种流行的LLM的广泛实验，并表明，尽管LLM在目标解释方面具有强大的能力，但在积极的协作和持续适应方面存在很大的差异，这对于有效完成复杂的任务至关重要。值得注意的是，我们强调了LLM-MAS中的优点和劣势，并为改善和评估统一和开源基准的LLM-MAS提供了见解。环境，30个开放式任务和集成评估软件包现已在此HTTPS URL上公开可用。

Title: LongRoPE2: Near-Lossless LLM Context Window Scaling

Authors: Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20082
Pdf URL: https://arxiv.org/pdf/2502.20082
Copy Paste: [[2502.20082]] LongRoPE2: Near-Lossless LLM Context Window Scaling(https://arxiv.org/abs/2502.20082)
Keywords: language model, llm
Abstract: LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at this https URL.
摘要：Longrope2是一种新颖的方法，它将预训练的大语言模型（LLM）的有效上下文窗口扩展到目标长度，同时保留在原始较短上下文窗口上的性能。这是通过三个贡献来实现的：（1）一个假设，即较高绳索维度的训练不足有助于在现有方法中观察到的持续分布（OOD）问题；（2）一种有效的绳索恢复算法，该算法采用以“针驱动”的困惑为指导的进化搜索来解决训练问题不足；（3）一种混合的上下文窗口训练方法，该方法微调型号的权重以采用重新续线序列的重新绳索，同时用原始绳索保留短篇小说性能。在各种基准的LLAMA3-8B和PHI3-MINI-3.8B上进行了广泛的实验验证了假设并证明了longrope2的有效性。值得注意的是，Longrope2扩展了Llama3-8B以实现128K有效上下文长度，同时仅使用10B代币保留了98.5％的短上下文性能 - 比Meta的方法少了80倍，而Meta的方法却少了80倍，而Meta未能达到目标有效上下文长度。代码将在此HTTPS URL上可用。

Title: Self-Training Elicits Concise Reasoning in Large Language Models

Authors: Tergel Munkhbat, Namgyu Ho, Seohyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.20122
Pdf URL: https://arxiv.org/pdf/2502.20122
Copy Paste: [[2502.20122]] Self-Training Elicits Concise Reasoning in Large Language Models(https://arxiv.org/abs/2502.20122)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at this https URL
摘要：经过思考链（COT）推理使大型语言模型（LLMS）通过中间令牌利用其他计算来解决复杂的任务。但是，我们认为，典型的推理轨迹包含许多冗余令牌，从而产生了无关的推理成本。在检查当前LLM的产出分布后，我们发现了其潜在能力相对于其默认行为更简洁的能力。为了引起这种能力，我们提出了简单的微调方法，这些方法利用在特定于任务的设置中利用通过最佳N采样和少量调节获得的自我生成的简洁推理路径。我们的组合方法平均在GSM8K和MATH上的五个模型系列中平均降低了输出令牌30％，同时保持了平均精度。通过利用LLMS的基本随机性和内在学习能力，我们的自我训练方法强烈地在广泛的模型上（包括具有广泛培训后培训的模型）中简洁推理。代码可在此HTTPS URL上找到

Title: Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking

Authors: Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.20129
Pdf URL: https://arxiv.org/pdf/2502.20129
Copy Paste: [[2502.20129]] Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking(https://arxiv.org/abs/2502.20129)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. In this work, we (1) evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit, a subset of model components, responsible for tracking the world state, finding that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three realistic settings: skipping intermediate steps, introducing data noise, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSA), highlighting its resilience in challenging scenarios.
摘要：经过思考链（COT）可显着提高各种任务中大语言模型（LLM）的性能，并且先前的研究表明，COT理论上可以提高表现力。但是，对变压器+COT可以学习的算法的机理理解有限。在这项工作中，我们（1）评估了变压器+COT及其变体的状态跟踪功能，从而确认了COT的有效性。（2）接下来，我们确定电路，一个模型组件的子集，负责跟踪世界状态，发现晚期MLP神经元起着关键作用。我们提出了两个指标，即压缩和区分，并表明每个状态的神经元设置达到了近100％的精度，提供了模型中嵌入的隐式有限状态自动机（FSA）的证据。（3）此外，我们探讨了三个现实的设置：跳过中间步骤，引入数据噪声和测试长度的概括。我们的结果表明，变形金刚+COT学习了强大的算法（FSA），从而强调了其在挑战的情况下的韧性。

Title: Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge

Authors: Yan-Lun Chen, Yi-Ru Wei, Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.20186
Pdf URL: https://arxiv.org/pdf/2502.20186
Copy Paste: [[2502.20186]] Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge(https://arxiv.org/abs/2502.20186)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate strong task-specific capabilities through fine-tuning, but merging multiple fine-tuned models often leads to degraded performance due to overlapping instruction-following components. Task Arithmetic (TA), which combines task vectors derived from fine-tuning, enables multi-task learning and task forgetting but struggles to isolate task-specific knowledge from general instruction-following behavior. To address this, we propose Layer-Aware Task Arithmetic (LATA), a novel approach that assigns layer-specific weights to task vectors based on their alignment with instruction-following or task-specific components. By amplifying task-relevant layers and attenuating instruction-following layers, LATA improves task learning and forgetting performance while preserving overall model utility. Experiments on multiple benchmarks, including WikiText-2, GSM8K, and HumanEval, demonstrate that LATA outperforms existing methods in both multi-task learning and selective task forgetting, achieving higher task accuracy and alignment with minimal degradation in output quality. Our findings highlight the importance of layer-wise analysis in disentangling task-specific and general-purpose knowledge, offering a robust framework for efficient model merging and editing.
摘要：大型语言模型（LLMS）通过微调表现出强大的特定任务功能，但是合并多个微调模型通常会导致由于跟随指令重叠的组件而导致的性能下降。任务算术（TA）结合了从微调衍生而来的任务向量，可以实现多任务学习和任务遗忘，但努力将特定于任务的知识从一般指导跟随行为中隔离开来。为了解决这个问题，我们提出了层次感知的任务算术（LATA），这是一种新型方法，该方法根据其与指令遵循或特定于任务的组件的对齐方式为任务向量分配特定于层的权重。通过放大与任务相关的层并衰减跟随层的指导层，LATA可以改善任务学习和忘记性能，同时保留整体模型实用程序。在包括Wikitext-2，GSM8K和HumaneVal在内的多个基准测试的实验表明，LATA在多任务学习和选择性任务忘记，实现更高的任务准确性和与产出质量的最小降级方面相结合，超过了现有的方法。我们的发现突出了层次分析在解开特定任务和通用知识中的重要性，为有效的模型合并和编辑提供了强大的框架。

Title: ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models

Authors: Haibin Chen, Kangtao Lv, Chengwei Hu, Yanshi Li, Yujin Yuan, Yancheng He, Xingyao Zhang, Langming Liu, Shilei Liu, Wenbo Su, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20196
Pdf URL: https://arxiv.org/pdf/2502.20196
Copy Paste: [[2502.20196]] ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models(https://arxiv.org/abs/2502.20196)
Keywords: language model, llm, retrieval-augmented generation
Abstract: With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary challenges: (1) handle the heterogeneous and diverse nature of tasks, (2) distinguish between generality and specificity within the e-commerce field. To address these problems, we propose \textbf{ChineseEcomQA}, a scalable question-answering benchmark focused on fundamental e-commerce concepts. ChineseEcomQA is built on three core characteristics: \textbf{Focus on Fundamental Concept}, \textbf{E-commerce Generality} and \textbf{E-commerce Expertise}. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks, thus addressing the challenge of heterogeneity and diversity. Additionally, by carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts, allowing for precise validation of domain capabilities. We achieve this through a scalable benchmark construction process that combines LLM validation, Retrieval-Augmented Generation (RAG) validation, and rigorous manual annotation. Based on ChineseEcomQA, we conduct extensive evaluations on mainstream LLMs and provide some valuable insights. We hope that ChineseEcomQA could guide future domain-specific evaluations, and facilitate broader LLM adoption in e-commerce applications.
摘要：随着在电子商务等领域中大型语言模型（LLM）的使用越来越多，特定领域的概念评估基准对于评估其域功能至关重要。现有的LLM可能会在复杂的电子商务应用程序中生成事实错误的信息。因此，有必要建立电子商务概念基准。现有基准遇到两个主要挑战：（1）处理任务的异质性和多样性，（2）区分电子商务领域内的一般性和特殊性。为了解决这些问题，我们提出\ textbf {CentercheComqa}，这是一个可扩展的问题，涉及基本电子商务概念的基准。 centerchecomqa建立在三个核心特征上：\ textbf {专注于基本概念}，\ textbf {e-commerce generality}和\ textbf {e-commerce Expertical}。基本概念旨在适用于各种电子商务任务，从而解决异质性和多样性的挑战。此外，通过仔细平衡一般性和特异性，中文有效地区分了广泛的电子商务概念，从而可以精确地验证域能力。我们通过可扩展的基准构建过程来实现这一目标，该过程结合了LLM验证，检索功能（RAG）验证和严格的手动注释。基于中文，我们对主流LLM进行广泛的评估，并提供一些有价值的见解。我们希望中文Ecomqa可以指导未来的特定领域评估，并促进电子商务应用中更广泛的LLM采用。

Title: FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving

Authors: Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20238
Pdf URL: https://arxiv.org/pdf/2502.20238
Copy Paste: [[2502.20238]] FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving(https://arxiv.org/abs/2502.20238)
Keywords: language model, llm
Abstract: Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
摘要：许多具有挑战性的推理任务不仅需要快速，直观的响应，而且需要采用更加故意的多步骤方法。大型语言模型（LLMS）的最新进展突出了从“系统1”快速反应方式转向“系统2”样式解决问题和纠正问题的一种重要转变。但是，当前的基准在很大程度上依赖于最终解答的准确性，这使许多模型的中间推理步骤未经检查。这无法评估模型在推理过程中反映和纠正错误的能力。为了弥合这一差距，我们介绍了Fineason，这是一种逻辑式式基准测试，用于对LLMS的推理能力进行细粒度评估。每个难题可以分解为原子步骤，使其非常适合严格验证中间正确性。在此基础上，我们介绍了两个任务：状态检查和国家过渡，以全面评估模型如何评估当前状况并计划下一步行动。为了支持更广泛的研究，我们还提供了一个拼图训练集，旨在提高一般数学任务的性能。我们表明，经过州检查和过渡数据训练的模型表明，GSM8K的数学推理的增益高达5.1％。

Title: From Retrieval to Generation: Comparing Different Approaches

Authors: Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Mohammed Ali, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20245
Pdf URL: https://arxiv.org/pdf/2502.20245
Copy Paste: [[2502.20245]] From Retrieval to Generation: Comparing Different Approaches(https://arxiv.org/abs/2502.20245)
Keywords: language model, gpt, retrieval-augmented generation
Abstract: Knowledge-intensive tasks, particularly open-domain question answering (ODQA), document reranking, and retrieval-augmented language modeling, require a balance between retrieval accuracy and generative flexibility. Traditional retrieval models such as BM25 and Dense Passage Retrieval (DPR), efficiently retrieve from large corpora but often lack semantic depth. Generative models like GPT-4-o provide richer contextual understanding but face challenges in maintaining factual consistency. In this work, we conduct a systematic evaluation of retrieval-based, generation-based, and hybrid models, with a primary focus on their performance in ODQA and related retrieval-augmented tasks. Our results show that dense retrievers, particularly DPR, achieve strong performance in ODQA with a top-1 accuracy of 50.17\% on NQ, while hybrid models improve nDCG@10 scores on BEIR from 43.42 (BM25) to 52.59, demonstrating their strength in document reranking. Additionally, we analyze language modeling tasks using WikiText-103, showing that retrieval-based approaches like BM25 achieve lower perplexity compared to generative and hybrid methods, highlighting their utility in retrieval-augmented generation. By providing detailed comparisons and practical insights into the conditions where each approach excels, we aim to facilitate future optimizations in retrieval, reranking, and generative models for ODQA and related knowledge-intensive applications.
摘要：知识密集型任务，尤其是开放域的问题答案（ODQA），文档重新计算和检索型语言建模，需要在检索准确性和生成性灵活性之间保持平衡。传统的检索模型，例如BM25和密集通道检索（DPR），从大型语料库中有效地检索，但通常缺乏语义深度。诸如GPT-4-O之类的生成模型提供了更丰富的上下文理解，但在保持事实一致性方面面临着挑战。在这项工作中，我们对基于检索的，基于生成的和混合模型进行了系统的评估，主要关注其在ODQA和相关检索任务的任务上。我们的结果表明，密集的检索器，尤其是DPR，在ODQA中实现了强劲的表现，而NQ的前1位准确性为50.17 \％，而混合模型在BEIR上提高了NDCG@10分10分，从43.42（BM25）提高到52.59，表明它们在文档reranking中的强度。此外，我们使用Wikitext-103分析了语言建模任务，这表明与生成和混合方法相比，基于检索的方法具有较低的困惑，从而突出了它们在检索增强生成中的实用性。通过对每种方法擅长的条件提供详细的比较和实际见解，我们旨在促进未来的ODQA和相关知识密集型应用程序的检索，重新养护和生成模型的优化。

Title: Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets

Authors: Chichien Tsai, Chiamu Yu, Yingdar Lin, Yusung Wu, Weibin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20246
Pdf URL: https://arxiv.org/pdf/2502.20246
Copy Paste: [[2502.20246]] Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets(https://arxiv.org/abs/2502.20246)
Keywords: language model, llm
Abstract: The increasing adoption of large language models (LLMs) for code-related tasks has raised concerns about the security of their training datasets. One critical threat is dead code poisoning, where syntactically valid but functionally redundant code is injected into training data to manipulate model behavior. Such attacks can degrade the performance of neural code search systems, leading to biased or insecure code suggestions. Existing detection methods, such as token-level perplexity analysis, fail to effectively identify dead code due to the structural and contextual characteristics of programming languages. In this paper, we propose DePA (Dead Code Perplexity Analysis), a novel line-level detection and cleansing method tailored to the structural properties of code. DePA computes line-level perplexity by leveraging the contextual relationships between code lines and identifies anomalous lines by comparing their perplexity to the overall distribution within the file. Our experiments on benchmark datasets demonstrate that DePA significantly outperforms existing methods, achieving 0.14-0.19 improvement in detection F1-score and a 44-65% increase in poisoned segment localization precision. Furthermore, DePA enhances detection speed by 0.62-23x, making it practical for large-scale dataset cleansing. Overall, by addressing the unique challenges of dead code poisoning, DePA provides a robust and efficient solution for safeguarding the integrity of code generation model training datasets.
摘要：大型语言模型（LLMS）用于与代码相关的任务的越来越多，引起了人们对其培训数据集安全性的担忧。一个关键的威胁是死亡代码中毒，其中句法有效但功能多余的代码被注入训练数据以操纵模型行为。这种攻击会降低神经代码搜索系统的性能，从而导致偏见或不安全的代码建议。现有的检测方法，例如令牌级别的困惑分析，由于编程语言的结构和上下文特征，无法有效地识别死亡代码。在本文中，我们提出了DEPA（死亡代码困惑分析），这是一种针对代码结构属性量身定制的新型线路级检测和清洁方法。 DEPA通过利用代码线之间的上下文关系来计算线路级别的困惑，并通过将其困惑与文件中的整体分布进行比较来识别异常线。我们在基准数据集上的实验表明，DEPA显着胜过现有方法，实现0.14-0.19检测F1得分的改善，中毒细分市场定位精度提高了44-65％。此外，DEPA可提高检测速度0.62-23X，使其可用于大规模数据集清洁。总体而言，通过应对死亡代码中毒的独特挑战，DEPA提供了一个强大而有效的解决方案，可保护代码生成模型培训数据集的完整性。

Title: LLM as a Broken Telephone: Iterative Generation Distorts Information

Authors: Amr Mohamed, Mingmeng Geng, Michalis Vazirgiannis, Guokan Shang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20258
Pdf URL: https://arxiv.org/pdf/2502.20258
Copy Paste: [[2502.20258]] LLM as a Broken Telephone: Iterative Generation Distorts Information(https://arxiv.org/abs/2502.20258)
Keywords: language model, llm, prompt
Abstract: As large language models are increasingly responsible for online content, concerns arise about the impact of repeatedly processing their own outputs. Inspired by the "broken telephone" effect in chained human communication, this study investigates whether LLMs similarly distort information through iterative generation. Through translation-based experiments, we find that distortion accumulates over time, influenced by language choice and chain complexity. While degradation is inevitable, it can be mitigated through strategic prompting techniques. These findings contribute to discussions on the long-term effects of AI-mediated information propagation, raising important questions about the reliability of LLM-generated content in iterative workflows.
摘要：由于大型语言模型越来越多地负责在线内容，因此对反复处理自己的产出的影响产生了担忧。受到锁定人类交流的“电话破裂”效应的启发，这项研究调查了LLM是否通过迭代产生类似地扭曲信息。通过基于翻译的实验，我们发现失真会随着时间的流逝而累积，受语言选择和链复杂性的影响。尽管不可避免地退化，但可以通过战略提示技术来减轻它。这些发现有助于讨论AI介导的信息传播的长期影响，从而提出了有关迭代工作流程中LLM生成内容可靠性的重要问题。

Title: LangProBe: a Language Programs Benchmark

Authors: Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.20315
Pdf URL: https://arxiv.org/pdf/2502.20315
Copy Paste: [[2502.20315]] LangProBe: a Language Programs Benchmark(https://arxiv.org/abs/2502.20315)
Keywords: language model, prompt
Abstract: Composing language models (LMs) into multi-step language programs and automatically optimizing their modular prompts is now a mainstream paradigm for building AI systems, but the tradeoffs in this space have only scarcely been studied before. We introduce LangProBe, the first large-scale benchmark for evaluating the architectures and optimization strategies for language programs, with over 2000 combinations of tasks, architectures, optimizers, and choices of LMs. Using LangProBe, we are the first to study the impact of program architectures and optimizers (and their compositions together and with different models) on tradeoffs of quality and cost. We find that optimized language programs offer strong cost--quality Pareto improvement over raw calls to models, but simultaneously demonstrate that human judgment (or empirical decisions) about which compositions to pursue is still necessary for best performance. We will open source the code and evaluation data for LangProBe.
摘要：现在，将语言模型（LMS）构成多步语言程序并自动优化其模块化提示，现在是建立AI系统的主流范式，但是在这个领域的权衡很少才被研究。我们介绍了Langprobe，这是第一个用于评估语言程序的体系结构和优化策略的大规模基准，具有超过2000个任务，体系结构，优化器和LMS选择的组合。使用Langprobe，我们是第一个研究计划架构和优化者（及其组成以及不同模型）对质量和成本折衷的影响的人。我们发现，优化的语言程序提供了良好的成本 - 帕累托对模型的原始调用的质量改善，但同时证明了人类的判断（或经验决策）关于最佳表现仍然需要进行哪些构图仍然是必要的。我们将开放langprobe的代码和评估数据。

Title: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Authors: Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20330
Pdf URL: https://arxiv.org/pdf/2502.20330
Copy Paste: [[2502.20330]] Long-Context Inference with Retrieval-Augmented Speculative Decoding(https://arxiv.org/abs/2502.20330)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference, particularly in managing key-value (KV) caches, presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We present Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer dynamic that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both approaches, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2x speedups. Our analyses reveal that RAPID achieves robust acceleration beyond 32K context length and demonstrates superior generation quality in real-world applications.
摘要：长篇文章大语模型（LLMS）的出现为传统检索演说一代（RAG）提供了一种有希望的替代方法，用于处理广泛的文档。但是，长篇文化推断的计算开销，尤其是在管理密钥价值（KV）缓存方面，提出了重大效率挑战。传统上，传统上，使用较小的草稿模型可以加速推理（SD），但由于内存限制的KV高速缓存操作，其有效性大大降低了长篇小写场景。我们提出了检索式的投机解码（快速），该解码利用了抹布来加速和增强长期推断的发电质量。 Rapid介绍了在缩短检索上下文上运行的抹布Arafter-A Prafter-A草案，以推测生成长篇小说目标LLM的生成。我们的方法实现了一个新的范式，同一规模甚至更大的LLM可以在保持计算效率的同时充当抹布起草者。为了充分利用更强的抹布起草者的潜在优势能力，我们开发了一种推理时间知识转移动态，可以通过抹布来丰富目标分布。对Llama-3.1和QWEN2.5骨架的广泛实验表明，快速有效地整合了两种方法的优势，从而在InfiniteBench上实现了显着的性能提高（例如，Llama-3.1-8B的InfiniteBench上的39.33至42.83），并具有超过2x的加速。我们的分析表明，快速实现了超过32K上下文长度的强大加速度，并在现实世界应用中表现出了卓越的发电质量。

Title: Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

Authors: Yukang Yang, Declan Campbell, Kaixuan Huang, Mengdi Wang, Jonathan Cohen, Taylor Webb
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20332
Pdf URL: https://arxiv.org/pdf/2502.20332
Copy Paste: [[2502.20332]] Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models(https://arxiv.org/abs/2502.20332)
Keywords: language model
Abstract: Many recent studies have found evidence for emergent reasoning capabilities in large language models, but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we perform a comprehensive study of the internal mechanisms that support abstract rule induction in an open-source language model (Llama3-70B). We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers, symbol abstraction heads convert input tokens to abstract variables based on the relations between those tokens. In intermediate layers, symbolic induction heads perform sequence induction over these abstract variables. Finally, in later layers, retrieval heads predict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.
摘要：许多最近的研究发现了大型语言模型中紧急推理能力的证据，但是关于这些能力的鲁棒性以及它们依赖结构化推理机制的程度，争论仍然存在。为了阐明这些问题，我们对支持开源语言模型中的抽象规则诱导的内部机制进行了全面研究（LLAMA3-70B）。我们确定了一个新兴的符号体系结构，该体系结构通过一系列三个计算来实现抽象推理。在早期层中，符号抽象头基于这些令牌之间的关系，将输入令牌转换为抽象变量。在中间层中，符号诱导头对这些抽象变量执行序列诱导。最后，在后来的层中，检索负责人通过检索与预测的抽象变量相关的值来预测接下来的令牌。这些结果表明，符号和神经网络方法之间的长期辩论解决了，这表明神经网络中的出现推理取决于符号机制的出现。

Title: Expertise Is What We Want

Authors: Alan Ashworth, Munir Al-Dajani, Keegan Duchicela, Kiril Kafadarov, Allison Kurian, Othman Laraki, Amina Lazrak, Divneet Mandair, Wendy McKennon, Rebecca Miksad, Jayodita Sanghvi, Travis Zack
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20335
Pdf URL: https://arxiv.org/pdf/2502.20335
Copy Paste: [[2502.20335]] Expertise Is What We Want(https://arxiv.org/abs/2502.20335)
Keywords: language model, llm, hallucination
Abstract: Clinical decision-making depends on expert reasoning, which is guided by standardized, evidence-based guidelines. However, translating these guidelines into automated clinical decision support systems risks inaccuracy and importantly, loss of nuance. We share an application architecture, the Large Language Expert (LLE), that combines the flexibility and power of Large Language Models (LLMs) with the interpretability, explainability, and reliability of Expert Systems. LLMs help address key challenges of Expert Systems, such as integrating and codifying knowledge, and data normalization. Conversely, an Expert System-like approach helps overcome challenges with LLMs, including hallucinations, atomic and inexpensive updates, and testability. To highlight the power of the Large Language Expert (LLE) system, we built an LLE to assist with the workup of patients newly diagnosed with cancer. Timely initiation of cancer treatment is critical for optimal patient outcomes. However, increasing complexity in diagnostic recommendations has made it difficult for primary care physicians to ensure their patients have completed the necessary workup before their first visit with an oncologist. As with many real-world clinical tasks, these workups require the analysis of unstructured health records and the application of nuanced clinical decision logic. In this study, we describe the design & evaluation of an LLE system built to rapidly identify and suggest the correct diagnostic workup. The system demonstrated a high degree of clinical-level accuracy (>95%) and effectively addressed gaps identified in real-world data from breast and colon cancer patients at a large academic center.
摘要：临床决策取决于专家推理，专家推理以标准化的基于证据的指南为指导。但是，将这些准则转化为自动化的临床决策支持系统风险不准确，重要的是损失细微差别。我们共享一个应用程序体系结构，即大型语言专家（LLE），将大语言模型（LLMS）的灵活性和力量与专家系统的可解释性，解释性和可靠性相结合。 LLM有助于应对专家系统的关键挑战，例如集成和编纂知识以及数据归一化。相反，一种类似于系统的系统的方法有助于克服LLM的挑战，包括幻觉，原子和廉价的更新以及可检验性。为了强调大型语言专家（LLE）系统的力量，我们建立了一个LLE，以协助新诊断为癌症的患者进行检查。及时开始癌症治疗对于最佳患者结局至关重要。但是，诊断建议的复杂性提高使初级保健医生很难确保患者在第一次拜访肿瘤科医生之前完成了必要的检查。与许多现实世界中的临床任务一样，这些检查需要分析非结构化的健康记录和细微临床决策逻辑的应用。在这项研究中，我们描述了构建的LLE系统的设计和评估，以快速识别并建议正确的诊断工作。该系统显示出高度的临床准确性（> 95％），并有效地解决了大型学术中心的乳腺癌和结肠癌患者的现实数据中发现的差距。

Title: Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

Authors: Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Y. Li, Aviv Bick, J. Zico Kolter, Albert Gu, François Fleuret, Tri Dao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20339
Pdf URL: https://arxiv.org/pdf/2502.20339
Copy Paste: [[2502.20339]] Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners(https://arxiv.org/abs/2502.20339)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advancements have demonstrated that the performance of large language models (LLMs) can be significantly enhanced by scaling computational resources at test time. A common strategy involves generating multiple Chain-of-Thought (CoT) trajectories and aggregating their outputs through various selection mechanisms. This raises a fundamental question: can models with lower complexity leverage their superior generation throughput to outperform similarly sized Transformers for a fixed computational budget? To address this question and overcome the lack of strong subquadratic reasoners, we distill pure and hybrid Mamba models from pretrained Transformers. Trained on only 8 billion tokens, our distilled models show strong performance and scaling on mathematical reasoning datasets while being much faster at inference for large batches and long sequences. Despite the zero-shot performance hit due to distillation, both pure and hybrid Mamba models can scale their coverage and accuracy performance past their Transformer teacher models under fixed time budgets, opening a new direction for scaling inference compute.
摘要：最近的进步表明，通过在测试时扩展计算资源，可以显着增强大语言模型（LLM）的性能。一种共同的策略涉及生成多个经过思考（COT）轨迹，并通过各种选择机制汇总其产出。这提出了一个基本问题：具有较低复杂性的模型是否可以利用其出色的生成吞吐量来超过尺寸的变压器来获得固定的计算预算？为了解决这个问题并克服缺乏强大的亚二次推理器，我们将纯净和混合的MAMBA模型从验证的变压器中提取。我们的蒸馏型仅对80亿个代币进行了培训，在数学推理数据集上表现出很强的性能和扩展性，同时在推断大批次和长序列时要快得多。尽管由于蒸馏而导致零拍的性能，但在固定时间预算下，纯净和混合Mamba模型都可以将其覆盖范围和准确性性能缩放到其变形金刚教师模型上，从而为扩展推理计算打开了新的方向。

Title: Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models

Authors: Yi Jing, Zijun Yao, Lingxu Ran, Hongzhu Guo, Xiaozhi Wang, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20344
Pdf URL: https://arxiv.org/pdf/2502.20344
Copy Paste: [[2502.20344]] Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models(https://arxiv.org/abs/2502.20344)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel in tasks that require complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Previous work on linguistic mechanisms has been limited by coarse granularity, insufficient causal analysis, and a narrow focus. In this study, we present a systematic and comprehensive causal investigation using sparse auto-encoders (SAEs). We extract a wide range of linguistic features from six dimensions: phonetics, phonology, morphology, syntax, semantics, and pragmatics. We extract, evaluate, and intervene on these features by constructing minimal contrast datasets and counterfactual sentence datasets. We introduce two indices-Feature Representation Confidence (FRC) and Feature Intervention Confidence (FIC)-to measure the ability of linguistic features to capture and control linguistic phenomena. Our results reveal inherent representations of linguistic knowledge in LLMs and demonstrate the potential for controlling model outputs. This work provides strong evidence that LLMs possess genuine linguistic knowledge and lays the foundation for more interpretable and controllable language modeling in future research.
摘要：大型语言模型（LLM）在需要复杂语言能力的任务中表现出色，例如参考歧义和隐喻识别/发电。尽管LLM具有令人印象深刻的能力，但它们用于处理和代表语言知识的内部机制在很大程度上仍然是不透明的。先前关于语言机制的工作受到粗糙粒度，因果分析不足和关注狭窄的限制。在这项研究中，我们提出了使用稀疏自动编码器（SAE）进行系统和全面的因果研究。我们从六个维度中提取各种语言特征：语音，语音学，形态，语法，语义和语用学。我们通过构建最小的对比数据集和反事实句子数据集来提取，评估和干预这些功能。我们介绍了两个指数 - 特征表示置信度（FRC）和特征干预置信度（FIC），以衡量语言特征捕获和控制语言现象的能力。我们的结果揭示了LLMS语言知识的内在表示，并证明了控制模型输出的潜力。这项工作提供了有力的证据，表明LLM具有真正的语言知识，并为未来的研究中的更容易解释和可控制的语言建模奠定了基础。

Title: KEDRec-LM: A Knowledge-distilled Explainable Drug Recommendation Large Language Model

Authors: Kai Zhang, Rui Zhu, Shutian Ma, Jingwei Xiong, Yejin Kim, Fabricio Murai, Xiaozhong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.20350
Pdf URL: https://arxiv.org/pdf/2502.20350
Copy Paste: [[2502.20350]] KEDRec-LM: A Knowledge-distilled Explainable Drug Recommendation Large Language Model(https://arxiv.org/abs/2502.20350)
Keywords: language model, llm
Abstract: Drug discovery is a critical task in biomedical natural language processing (NLP), yet explainable drug discovery remains underexplored. Meanwhile, large language models (LLMs) have shown remarkable abilities in natural language understanding and generation. Leveraging LLMs for explainable drug discovery has the potential to improve downstream tasks and real-world applications. In this study, we utilize open-source drug knowledge graphs, clinical trial data, and PubMed publications to construct a comprehensive dataset for the explainable drug discovery task, named \textbf{expRxRec}. Furthermore, we introduce \textbf{KEDRec-LM}, an instruction-tuned LLM which distills knowledge from rich medical knowledge corpus for drug recommendation and rationale generation. To encourage further research in this area, we will publicly release\footnote{A copy is attached with this submission} both the dataset and KEDRec-LM.
摘要：药物发现是生物医学自然语言加工（NLP）的关键任务，但可解释的药物发现仍然没有得到充实的态度。同时，大型语言模型（LLM）在自然语言理解和产生方面表现出了显着的能力。利用LLM进行可解释的药物发现有可能改善下游任务和现实世界的应用。在这项研究中，我们利用开源药物知识图，临床试验数据和PubMed出版物来为可解释的药物发现任务构建全面的数据集，称为\ textbf {exprxrec}。此外，我们介绍了\ textbf {kedrec-lm}，这是一种指令调整的LLM，它将知识从丰富的医学知识语料库中提取为药物推荐和基本原理的产生。为了鼓励在这一领域进行进一步的研究，我们将公开发布\脚注{此提交中附有副本}数据集和kedrec-lm。

Title: Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLMs

Authors: Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben Narad, Timothy T. Rogers, Lalit K Jain, Robert D Nowak, Bob Mankoff, Jifan Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.20356
Pdf URL: https://arxiv.org/pdf/2502.20356
Copy Paste: [[2502.20356]] Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLMs(https://arxiv.org/abs/2502.20356)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have shown significant limitations in understanding creative content, as demonstrated by Hessel et al. (2023)'s influential work on the New Yorker Cartoon Caption Contest (NYCCC). Their study exposed a substantial gap between LLMs and humans in humor comprehension, establishing that understanding and evaluating creative content is key challenge in AI development. We revisit this challenge by decomposing humor understanding into three components and systematically improve each: enhancing visual understanding through improved annotation, utilizing LLM-generated humor reasoning and explanations, and implementing targeted alignment with human preference data. Our refined approach achieves 82.4% accuracy in caption ranking, singificantly improving upon the previous 67% benchmark and matching the performance of world-renowned human experts in this domain. Notably, while attempts to mimic subgroup preferences through various persona prompts showed minimal impact, model finetuning with crowd preferences proved remarkably effective. These findings reveal that LLM limitations in creative judgment can be effectively addressed through focused alignment to specific subgroups and individuals. Lastly, we propose the position that achieving artificial general intelligence necessitates systematic collection of human preference data across creative domains. We advocate that just as human creativity is deeply influenced by individual and cultural preferences, training LLMs with diverse human preference data may be essential for developing true creative understanding.
摘要：大型语言模型（LLMS）在理解创意内容方面显示了重大局限性，如Hessel等人所证明的那样。（2023年）在《纽约客》卡通标题比赛（NYCCC）上的有影响力的作品。他们的研究在幽默理解中揭示了LLM和人类之间的巨大差距，确定理解和评估创造性内容是AI发展的关键挑战。我们通过将幽默的理解分解为三个组成部分，并系统地改善各种挑战：通过改善注释，利用LLM生成的幽默推理和解释，并实现与人类偏好数据的有针对性的一致性，从而增强视觉理解。我们精致的方法在标题排名中的准确性达到82.4％，从前的67％的基准有了单独的改善，并与该领域享有世界知名的人类专家的表现相匹配。值得注意的是，尽管尝试通过各种角色提示模仿亚组偏好的尝试显示出最小的影响，但模型填充和人群偏好被证明非常有效。这些发现表明，可以通过将专注于特定子组和个人的重点对齐来有效地解决创造性判断中的LLM限制。最后，我们提出这样的立场，即实现人工通用智能需要系统地收集跨创意领域的人类偏好数据。我们主张说，正如人类创造力受到个人和文化偏好的深刻影响一样，具有多样化的人类偏好数据的LLM可能对于发展真正的创造性理解至关重要。

Title: Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

Authors: Ryan C. Barron, Maksim E. Eren, Olga M. Serafimova, Cynthia Matuszek, Boian S. Alexandrov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20364
Pdf URL: https://arxiv.org/pdf/2502.20364
Copy Paste: [[2502.20364]] Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization(https://arxiv.org/abs/2502.20364)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Agentic Generative AI, powered by Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable to specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain here comprises complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research. Here, we introduce a generative AI system that integrates RAG, VS, and KG, constructed via Non-Negative Matrix Factorization (NMF), to enhance legal information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends-challenging tasks that are essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This framework supports legal document clustering, summarization, and cross-referencing, for scalable, interpretable, and accurate retrieval for semi-structured data while advancing computational law and AI.
摘要：由大型语言模型（LLM）提供支持的代理生成AI，具有检索功能增强的生成（RAG），知识图（KGS）和矢量商店（VSS），代表了适用于适用于特殊领域的变革性技术，例如法律系统，研究，建议系统，推荐系统，Cybersecurity，Cybersecurity和Global Secuther Cresection。这项技术擅长推断庞大的非结构化或半结构化数据集中的关系。这里的法律领域包括复杂的数据，其特征是具有复杂关系的广泛，相互关联和半结构的知识系统。它包括宪法，法规，法规和判例法。提取见解并浏览复杂的法律文件网络及其关系对于有效的法律研究至关重要。在这里，我们介绍了一种通过非负矩阵分解（NMF）构建的抹布，VS和KG的生成AI系统，以增强法律信息检索和AI推理并最小化幻觉。在法律体系中，这些技术使AI代理能够识别和分析案件，法规和法律先例之间的复杂联系，发现隐藏的关系并预测对确保正义和提高运营效率至关重要的法律趋势挑战任务。我们的系统采用网络刮擦技术从Justia等公开访问的平台中系统地收集法律文本，例如法规，宪法规定和判例法。它通过利用高级语义表示，层次关系和潜在主题发现来弥合传统的基于关键字的搜索与上下文理解之间的差距。该框架支持法律文档的集群，汇总和交叉引用，以便在进行计算法和AI的同时，用于半结构数据的可扩展，可解释和准确的检索。