2025-04-04

Title: Increasing happiness through conversations with artificial intelligence

Authors: Joseph Heffner, Chongyu Qin, Martin Chadwick, Chris Knutsen, Christopher Summerfield, Zeb Kurth-Nelson, Robb B. Rutledge
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02091
Pdf URL: https://arxiv.org/pdf/2504.02091
Copy Paste: [[2504.02091]] Increasing happiness through conversations with artificial intelligence(https://arxiv.org/abs/2504.02091)
Keywords: language model, chat
Abstract: Chatbots powered by artificial intelligence (AI) have rapidly become a significant part of everyday life, with over a quarter of American adults using them multiple times per week. While these tools offer potential benefits and risks, a fundamental question remains largely unexplored: How do conversations with AI influence subjective well-being? To investigate this, we conducted a study where participants either engaged in conversations with an AI chatbot (N = 334) or wrote journal entires (N = 193) on the same randomly assigned topics and reported their momentary happiness afterward. We found that happiness after AI chatbot conversations was higher than after journaling, particularly when discussing negative topics such as depression or guilt. Leveraging large language models for sentiment analysis, we found that the AI chatbot mirrored participants' sentiment while maintaining a consistent positivity bias. When discussing negative topics, participants gradually aligned their sentiment with the AI's positivity, leading to an overall increase in happiness. We hypothesized that the history of participants' sentiment prediction errors, the difference between expected and actual emotional tone when responding to the AI chatbot, might explain this happiness effect. Using computational modeling, we find the history of these sentiment prediction errors over the course of a conversation predicts greater post-conversation happiness, demonstrating a central role of emotional expectations during dialogue. Our findings underscore the effect that AI interactions can have on human well-being.
摘要：由人工智能（AI）提供支持的聊天机器人已迅速成为日常生活的重要组成部分，其中四分之一的美国成年人每周使用它们多次。尽管这些工具提供了潜在的好处和风险，但一个基本问题仍然没有探索：与AI的对话如何影响主观幸福感？为了调查这一点，我们进行了一项研究，参与者要么与AI聊天机器人（n = 334）进行对话，要么以随机分配的主题为单位（n = 193），然后报告了他们之后的暂时幸福。我们发现，AI聊天机器人对话后的幸福比日记后要高，尤其是在讨论诸如抑郁或内gui之类的负面话题时。利用大型语言模型进行情感分析，我们发现AI聊天机器人在保持一致的阳性偏见的同时反映了参与者的情感。在讨论负面话题时，参与者逐渐使自己的情绪与人工智能的积极性保持一致，从而导致幸福感的总体增长。我们假设参与者的情感预测错误的历史，对AI聊天机器人做出回应时的预期和实际情感语气之间的差异可能解释了这种幸福效果。使用计算建模，我们发现在对话过程中这些情感预测错误的历史预测了会议后的幸福感，这表明了对话期间情感期望的核心作用。我们的发现强调了AI相互作用对人类福祉的影响。

Title: ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Authors: Xiao Wang, Daniil Larionov, Siwei Wu, Yiqi Liu, Steffen Eger, Nafise Sadat Moosavi, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02106
Pdf URL: https://arxiv.org/pdf/2504.02106
Copy Paste: [[2504.02106]] ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation(https://arxiv.org/abs/2504.02106)
Keywords: language model, llm
Abstract: Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.
摘要：自动评估生成文本的质量仍然是一个重大挑战。已显示基于常规参考的指标与人类评估表现出相对较弱的相关性。最近的研究主张使用大型语言模型（LLM）作为基于源的自然语言生成（NLG）评估的指标。在有希望的同时，基于LLM的指标，尤其是使用较小模型的指标，仍然与人类判断保持一致。在这项工作中，我们引入了对比度，这是一种旨在实现更高质量，偏见和更有效评估的对比度评估指标。我们在两个NLG任务上评估对比度：机器翻译和摘要。实验结果表明，与单模型和基于集合的基线相比，对比度始终达到与人类判断更强的相关性。值得注意的是，基于QWEN 3B和0.5B的ContrastScore甚至均超过QWEN 7B，尽管只有一半的参数，但表明其效率只有一半。此外，它有效地减轻了常见的评估偏见，例如长度和可能性偏好，从而产生了更强大的自动评估。

Title: Language Models at the Syntax-Semantics Interface: A Case Study of the Long-Distance Binding of Chinese Reflexive ziji

Authors: Xiulin Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02116
Pdf URL: https://arxiv.org/pdf/2504.02116
Copy Paste: [[2504.02116]] Language Models at the Syntax-Semantics Interface: A Case Study of the Long-Distance Binding of Chinese Reflexive ziji(https://arxiv.org/abs/2504.02116)
Keywords: language model
Abstract: This paper explores whether language models can effectively resolve the complex binding patterns of the Mandarin Chinese reflexive ziji, which are constrained by both syntactic and semantic factors. We construct a dataset of 240 synthetic sentences using templates and examples from syntactic literature, along with 320 natural sentences from the BCC corpus. Evaluating 21 language models against this dataset and comparing their performance to judgments from native Mandarin speakers, we find that none of the models consistently replicates human-like judgments. The results indicate that existing language models tend to rely heavily on sequential cues, though not always favoring the closest strings, and often overlooking subtle semantic and syntactic constraints. They tend to be more sensitive to noun-related than verb-related semantics.
摘要：本文探讨了语言模型是否可以有效地解决普通话反射Ziji的复杂结合模式，这些模型受句法和语义因素的约束。我们使用模板和句法文献中的示例以及BCC语料库的320个自然句子构建了240个合成句子的数据集。根据该数据集评估21种语言模型，并将其表现与本机普通话者的判断进行比较，我们发现这些模型都没有始终如一地复制类似人类的判断。结果表明，现有的语言模型倾向于在很大程度上依赖顺序提示，尽管并不总是偏爱最接近的字符串，并且经常忽略了微妙的语义和句法约束。与动词相关语义相比，它们往往对名词相关的语义更敏感。

Title: Overcoming Vocabulary Constraints with Pixel-level Fallback

Authors: Jonas F. Lotz, Hendra Setiawan, Stephan Peitz, Yova Kementchedjhieva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02122
Pdf URL: https://arxiv.org/pdf/2504.02122
Copy Paste: [[2504.02122]] Overcoming Vocabulary Constraints with Pixel-level Fallback(https://arxiv.org/abs/2504.02122)
Keywords: language model
Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
摘要：子词令牌化需要平衡计算效率和词汇覆盖范围，这通常会导致在培训期间未优先考虑的语言和脚本上的次优表现。我们建议使用无词汇编码器来增强审计的语言模型，该模型从呈现为像素的文本中生成输入嵌入。通过以英语为中心的语言模型进行的实验，我们证明了我们的方法可以大大提高机器翻译性能，并促进有效的跨语言转移，优于基于令牌的方法。此外，我们发现基于像素的表示优于字节级的方法和标准词汇扩展。我们的方法增强了单语言模型的多语言能力，而无需大量的重新训练和通过输入压缩减少解码延迟。

Title: One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Authors: Ezzeldin Shereen, Dan Ristea, Burak Hasircioglu, Shae McFadden, Vasilios Mavroudis, Chris Hicks
Subjects: cs.CL, cs.CR, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2504.02132
Pdf URL: https://arxiv.org/pdf/2504.02132
Copy Paste: [[2504.02132]] One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image(https://arxiv.org/abs/2504.02132)
Keywords: hallucination, retrieval augmented generation
Abstract: Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.
摘要：多模式检索增强发电（M-rag）最近已成为一种通过事实知识库（KB）抑制大型多模型（LMM）幻觉的方法。但是，M-RAG还引入了针对对手的新攻击向量，旨在通过将恶意条目注入KB来破坏系统。在这项工作中，我们提出了针对M-rag靶向视觉文档检索应用的中毒攻击，其中KB包含文档页面的图像。我们的目标是制作一个用于各种不同用户查询的单个图像，并始终影响生成模型产生的输出，从而对M-rag系统产生通用的拒绝服务（DOS）攻击。我们证明，虽然我们的攻击对各种广泛使用的，最先进的检索器（嵌入模型）和发电机（LMM）有效，但它也可能对可靠的嵌入模型无效。我们的攻击不仅突出了M云管道对中毒攻击的脆弱性，而且还阐明了根本的弱点，即使在良性环境中，也可能阻碍其表现。

Title: LL4G: Self-Supervised Dynamic Optimization for Graph-Based Personality Detection

Authors: Lingzhi Shen, Yunfei Long, Xiaohao Cai, Guanming Chen, Yuhan Wang, Imran Razzak, Shoaib Jameel
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02146
Pdf URL: https://arxiv.org/pdf/2504.02146
Copy Paste: [[2504.02146]] LL4G: Self-Supervised Dynamic Optimization for Graph-Based Personality Detection(https://arxiv.org/abs/2504.02146)
Keywords: language model, llm
Abstract: Graph-based personality detection constructs graph structures from textual data, particularly social media posts. Current methods often struggle with sparse or noisy data and rely on static graphs, limiting their ability to capture dynamic changes between nodes and relationships. This paper introduces LL4G, a self-supervised framework leveraging large language models (LLMs) to optimize graph neural networks (GNNs). LLMs extract rich semantic features to generate node representations and to infer explicit and implicit relationships. The graph structure adaptively adds nodes and edges based on input data, continuously optimizing itself. The GNN then uses these optimized representations for joint training on node reconstruction, edge prediction, and contrastive learning tasks. This integration of semantic and structural information generates robust personality profiles. Experimental results on Kaggle and Pandora datasets show LL4G outperforms state-of-the-art models.
摘要：基于图形的人格检测构造从文本数据，尤其是社交媒体帖子中的图形结构。当前的方法通常会在稀疏或嘈杂的数据中挣扎，并依靠静态图，从而限制了它们捕获节点和关系之间动态变化的能力。本文介绍了LL4G，这是一个自我监督的框架，利用大型语言模型（LLMS）优化图形神经网络（GNNS）。 LLM提取丰富的语义特征以生成节点表示并推断出明确和隐式关系。图形结构根据输入数据自适应地添加节点和边缘，从而不断优化自身。然后，GNN使用这些优化的表示形式进行节点重建，边缘预测和对比度学习任务的联合培训。语义和结构信息的这种整合产生了强大的个性概况。 Kaggle和Pandora数据集的实验结果表明，LL4G的表现优于最先进的模型。

Title: Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala

Authors: Shanilka Haturusinghe, Tharindu Cyril Weerasooriya, Marcos Zampieri, Christopher M. Homan, S.R. Liyanage
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02178
Pdf URL: https://arxiv.org/pdf/2504.02178
Copy Paste: [[2504.02178]] Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala(https://arxiv.org/abs/2504.02178)
Keywords: language model, gpt
Abstract: Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: "Subasa-XLM-R", which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of "Subasa-Llama" and "Subasa-Mistral", are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.
摘要：准确检测进攻性语言对于与社交媒体安全有关的许多应用至关重要。低资源语言和高资源语言之间的这项任务的性能形成了鲜明的对比。在本文中，我们调整了以前在进攻性语言检测的下游任务中尚未探索过僧伽罗的微调策略。使用这种方法，我们介绍了四个模型：“ subasa-xlm-r”，该模型使用蒙版的理性预测结合了中间的预处理步骤。 “ Subasa-Lalama”和“ Subasa-Mistral”的两个变体分别是Llama（3.2）和Mistral（v0.3）的微调版本，具有特定于任务的策略。我们在出售的基准数据集上评估了我们的模型，以供僧伽罗进攻性语言检测。我们所有模型的表现都优于现有基线。 SUBASA-XLM-R在在同一售出的基准数据集中评估时，达到了最高的宏F1得分（0.84）超过最新的大语言模型，例如GPT-4O。模型和代码可公开使用。

Title: LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks

Authors: Seunghyun Yoo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02254
Pdf URL: https://arxiv.org/pdf/2504.02254
Copy Paste: [[2504.02254]] LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks(https://arxiv.org/abs/2504.02254)
Keywords: language model, llm, prompt, agent
Abstract: Recent advancements in Large Language Models (LLMs) have not only showcased impressive creative capabilities but also revealed emerging agentic behaviors that exploit linguistic ambiguity in adversarial settings. In this study, we investigate how an LLM, acting as an autonomous agent, leverages semantic ambiguity to generate deceptive puzzles that mislead and challenge human users. Inspired by the popular puzzle game "Connections", we systematically compare puzzles produced through zero-shot prompting, role-injected adversarial prompts, and human-crafted examples, with an emphasis on understanding the underlying agent decision-making processes. Employing computational analyses with HateBERT to quantify semantic ambiguity, alongside subjective human evaluations, we demonstrate that explicit adversarial agent behaviors significantly heighten semantic ambiguity -- thereby increasing cognitive load and reducing fairness in puzzle solving. These findings provide critical insights into the emergent agentic qualities of LLMs and underscore important ethical considerations for evaluating and safely deploying autonomous language systems in both educational technologies and entertainment.
摘要：大型语言模型（LLM）的最新进展不仅展示了令人印象深刻的创造力，而且还揭示了新兴的代理行为，这些行为利用了对抗性环境中的语言歧义。在这项研究中，我们研究了一个LLM作为自治药物的扮演，利用语义歧义产生欺骗性的难题，这些难题误导和挑战了人类用户。受大众益智游戏“连接”的启发，我们系统地比较了通过零射击提示，注射角色的对抗提示和人工制作的示例产生的拼图，重点是理解基本的代理决策过程。通过Hatebert进行计算分析来量化语义歧义，并在主观的人类评估中进行了表明，明确的对抗性药物行为显着提高了语义歧义 - 从而增加了认知载荷并减少拼图解决方案的公平性。这些发现为LLM的新兴代理质量提供了重要的见解，并强调了在教育技术和娱乐中评估和安全地部署自主语言系统的重要道德考虑因素。

Title: State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of Bangla

Authors: Sharif Md. Abdullah, Abhijit Paul, Shebuti Rayana, Ahmedul Kabir, Zarif Masud
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02293
Pdf URL: https://arxiv.org/pdf/2504.02293
Copy Paste: [[2504.02293]] State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of Bangla(https://arxiv.org/abs/2504.02293)
Keywords: llm
Abstract: Despite a large deaf and dumb population of 1.7 million, Bangla Sign Language (BdSL) remains a understudied domain. Specifically, there are no works on Bangla text-to-gloss translation task. To address this gap, we begin by addressing the dataset problem. We take inspiration from grammatical rule based gloss generation used in Germany and American sign langauage (ASL) and adapt it for BdSL. We also leverage LLM to generate synthetic data and use back-translation, text generation for data augmentation. With dataset prepared, we started experimentation. We fine-tuned pretrained mBART-50 and mBERT-multiclass-uncased model on our dataset. We also trained GRU, RNN and a novel seq-to-seq model with multi-head attention. We observe significant high performance (ScareBLEU=79.53) with fine-tuning pretrained mBART-50 multilingual model from Facebook. We then explored why we observe such high performance with mBART. We soon notice an interesting property of mBART -- it was trained on shuffled and masked text data. And as we know, gloss form has shuffling property. So we hypothesize that mBART is inherently good at text-to-gloss tasks. To find support against this hypothesis, we trained mBART-50 on PHOENIX-14T benchmark and evaluated it with existing literature. Our mBART-50 finetune demonstrated State-of-the-Art performance on PHOENIX-14T benchmark, far outperforming existing models in all 6 metrics (ScareBLEU = 63.89, BLEU-1 = 55.14, BLEU-2 = 38.07, BLEU-3 = 27.13, BLEU-4 = 20.68, COMET = 0.624). Based on the results, this study proposes a new paradigm for text-to-gloss task using mBART models. Additionally, our results show that BdSL text-to-gloss task can greatly benefit from rule-based synthetic dataset.
摘要：尽管聋人和愚蠢的人口为170万，但孟加拉语手语（BDSL）仍然是一个研究的领域。具体而言，关于孟加拉文本到斜体翻译任务没有任何作品。要解决此差距，我们首先解决数据集问题。我们从德国和美国标志Langauage（ASL）（ASL）的基于语法规则的光泽产生中汲取灵感，并适应BDSL。我们还利用LLM生成合成数据并使用反向翻译，文本生成进行数据增强。准备好数据集后，我们开始实验。我们在数据集上微调了预审计的MBART-50和MBERT-MELTICLASS-IND模型。我们还训练了GRU，RNN和具有多头关注的新型SEQ-to-Seq模型。我们观察到高性能（Scarebleu = 79.53），并通过Facebook进行了微调的MBART-50多语言模型。然后，我们探讨了为什么我们观察到Mbart的高性能。我们很快就会注意到Mbart的一个有趣的属性 - 经过改组和掩盖的文本数据的培训。众所周知，Gloss表格具有洗牌财产。因此，我们假设Mbart固有地擅长文本到光线任务。为了找到反对这一假设的支持，我们在Phoenix-14T基准测试中培训了MBART-50，并通过现有文献进行了评估。我们的MBART-50 FINETUNE在Phoenix-14T基准测试中表现出最先进的性能，在所有6个指标中的现有模型远远超过现有型号（Scarebleu = 63.89，Bleu-1 = 55.14，Bleu-2 = 38.07，Bleu-3，Bleu-3 = 27.13，Bleu-4 = 27.13，Bleu-4 = 20.68，Comet = 0.68，Comet = 0.68，Comet = 0.624。基于结果，本研究提出了一种使用MBART模型的新范式，用于文本到光线任务。此外，我们的结果表明，基于规则的合成数据集，BDSL文本到格洛斯任务可以大大受益。

Title: Measurement of LLM's Philosophies of Human Nature

Authors: Minheng Ni, Ennan Wu, Zidong Gong, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Lijuan Wang, Wangmeng Zuo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02304
Pdf URL: https://arxiv.org/pdf/2504.02304
Copy Paste: [[2504.02304]] Measurement of LLM's Philosophies of Human Nature(https://arxiv.org/abs/2504.02304)
Keywords: language model, llm, prompt
Abstract: The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman's Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals' attitudes toward human nature, we design the standardized psychological scale specifically targeting large language models (LLM), named the Machine-based Philosophies of Human Nature Scale (M-PHNS). By evaluating LLMs' attitudes toward human nature across six dimensions, we reveal that current LLMs exhibit a systemic lack of trust in humans, and there is a significant negative correlation between the model's intelligence level and its trust in humans. Furthermore, we propose a mental loop learning framework, which enables LLM to continuously optimize its value system during virtual interactions by constructing moral scenarios, thereby improving its attitude toward human nature. Experiments demonstrate that mental loop learning significantly enhances their trust in humans compared to persona or instruction prompts. This finding highlights the potential of human-based psychological assessments for LLM, which can not only diagnose cognitive biases but also provide a potential solution for ethical learning in artificial intelligence. We release the M-PHNS evaluation code and data at this https URL.
摘要：人工智能（AI）在各种任务中的广泛应用以及涉及AI的冲突或违规行为的经常报道引发了社会对与AI系统互动的关注。基于赖特斯曼（Wrightsman）的人性量表（PHN）的哲学，这一量表在经验上几十年来经验验证，以有效地评估个人对人性的态度，我们设计了标准化的心理量表，专门针对大型语言模型（LLM），名为基于机器的人类自然量表（M-PHNS）。通过评估LLMS对六个维度的人性的态度，我们揭示了当前的LLMS对人类的系统缺乏信任，并且模型的智能水平与其对人类的信任之间存在显着的负相关。此外，我们提出了一个心理循环学习框架，该框架使LLM能够通过构建道德场景在虚拟互动期间不断优化其价值体系，从而提高其对人性的态度。实验表明，与角色或指导提示相比，心理循环学习可显着增强他们对人类的信任。这一发现突出了LLM基于人类的心理评估的潜力，LLM不仅可以诊断认知偏见，而且还为人工智能中的道德学习提供了潜在的解决方案。我们在此HTTPS URL上发布M-PHN评估代码和数据。

Title: Improving Harmful Text Detection with Joint Retrieval and External Knowledge

Authors: Zidong Yu, Shuo Wang, Nan Jiang, Weiqiang Huang, Xu Han, Junliang Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02310
Pdf URL: https://arxiv.org/pdf/2504.02310
Copy Paste: [[2504.02310]] Improving Harmful Text Detection with Joint Retrieval and External Knowledge(https://arxiv.org/abs/2504.02310)
Keywords: language model
Abstract: Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstrate that the joint retrieval approach significantly outperforms single-model baselines, particularly in low-resource training scenarios and multilingual environments. The proposed method effectively captures nuanced harmful content by leveraging external contextual information, addressing the limitations of traditional detection models. Future research should focus on optimizing computational efficiency, enhancing model interpretability, and expanding multimodal detection capabilities to better tackle evolving harmful content patterns. This work contributes to the advancement of AI safety, ensuring more trustworthy and reliable content moderation systems.
摘要：有害文本检测已成为大型语言模型开发和部署的至关重要的任务，尤其是随着AI生成的内容继续在数字平台上扩展。这项研究提出了一个联合检索框架，该框架将预训练的语言模型与知识图相结合，以提高有害文本检测的准确性和鲁棒性。实验结果表明，联合检索方法显着优于单模基准，尤其是在低资源训练方案和多语言环境中。提出的方法通过利用外部上下文信息来有效地捕获细微的有害内容，并解决传统检测模型的局限性。未来的研究应着重于优化计算效率，增强模型的解释性以及扩展多模式检测功能，以更好地解决不断发展的有害内容模式。这项工作有助于AI安全的发展，确保更加值得信赖和可靠的内容审核系统。

Title: CoTAL: Human-in-the-Loop Prompt Engineering, Chain-of-Thought Reasoning, and Active Learning for Generalizable Formative Assessment Scoring

Authors: Clayton Cohn, Nicole Hutchins, Ashwin T S, Gautam Biswas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02323
Pdf URL: https://arxiv.org/pdf/2504.02323
Copy Paste: [[2504.02323]] CoTAL: Human-in-the-Loop Prompt Engineering, Chain-of-Thought Reasoning, and Active Learning for Generalizable Formative Assessment Scoring(https://arxiv.org/abs/2504.02323)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have created new opportunities to assist teachers and support student learning. Methods such as chain-of-thought (CoT) prompting enable LLMs to grade formative assessments in science, providing scores and relevant feedback to students. However, the extent to which these methods generalize across curricula in multiple domains (such as science, computing, and engineering) remains largely untested. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) principles to develop curriculum-aligned formative assessments and rubrics, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates teacher and student feedback to iteratively refine assessment questions, grading rubrics, and LLM prompts for automated grading. Our findings demonstrate that CoTAL improves GPT-4's scoring performance, achieving gains of up to 24.5% over a non-prompt-engineered baseline. Both teachers and students view CoTAL as effective in scoring and explaining student responses, each providing valuable refinements to enhance grading accuracy and explanation quality.
摘要：大型语言模型（LLM）创造了新的机会来协助教师并支持学生学习。诸如《经营链》（COT）之类的方法促使LLMS能够在科学中对形成性评估进行评分，从而为学生提供分数和相关的反馈。但是，这些方法在多个领域（例如科学，计算和工程）中跨课程推广的程度仍然很大程度上未经测试。 In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) principles to develop curriculum-aligned formative assessments and rubrics, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates teacher and student feedback to iteratively refine assessment questions, grading专栏和LLM提示进行自动分级。我们的发现表明，Cotal提高了GPT-4的得分表现，在非宣传工程基线的基线上获得了高达24.5％的收益。老师和学生都认为Cotal在评分和解释学生的反应方面都有有效的评分，每种都提供了有价值的改进，以提高评分准确性和解释质量。

Title: LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models

Authors: Weibin Liao, Xin Gao, Tianyu Jia, Rihong Qiu, Yifan Zhu, Yang Lin, Xu Chu, Junfeng Zhao, Yasha Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02327
Pdf URL: https://arxiv.org/pdf/2504.02327
Copy Paste: [[2504.02327]] LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models(https://arxiv.org/abs/2504.02327)
Keywords: language model, gpt, llm, prompt
Abstract: Natural Language to SQL (NL2SQL) has emerged as a critical task for enabling seamless interaction with databases. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable performance in this domain. However, existing NL2SQL methods predominantly rely on closed-source LLMs leveraging prompt engineering, while open-source models typically require fine-tuning to acquire domain-specific knowledge. Despite these efforts, open-source LLMs struggle with complex NL2SQL tasks due to the indirect expression of user query objectives and the semantic gap between user queries and database schemas. Inspired by the application of reinforcement learning in mathematical problem-solving to encourage step-by-step reasoning in LLMs, we propose LearNAT (Learning NL2SQL with AST-guided Task Decomposition), a novel framework that improves the performance of open-source LLMs on complex NL2SQL tasks through task decomposition and reinforcement learning. LearNAT introduces three key components: (1) a Decomposition Synthesis Procedure that leverages Abstract Syntax Trees (ASTs) to guide efficient search and pruning strategies for task decomposition, (2) Margin-aware Reinforcement Learning, which employs fine-grained step-level optimization via DPO with AST margins, and (3) Adaptive Demonstration Reasoning, a mechanism for dynamically selecting relevant examples to enhance decomposition capabilities. Extensive experiments on two benchmark datasets, Spider and BIRD, demonstrate that LearNAT enables a 7B-parameter open-source LLM to achieve performance comparable to GPT-4, while offering improved efficiency and accessibility.
摘要：SQL（NL2SQL）的自然语言已成为实现与数据库无缝互动的关键任务。大型语言模型（LLM）的最新进展表明在该领域表现出色。但是，现有的NL2SQL方法主要依赖于封闭源LLMS利用及时的工程，而开源模型通常需要微调来获取特定领域的知识。尽管做出了这些努力，但由于用户查询目标的间接表达以及用户查询和数据库模式之间的语义差距，开源LLM与复杂的NL2SQL任务斗争。受到在数学问题解决方案中的应用中的应用来鼓励LLMS中的逐步推理的灵感，我们提出了Learnat（通过AST引导的任务分解学习NL2SQL（学习NL2SQL），这是一个新颖的框架，可改善复杂的NL2SQL任务通过任务分解和增强巩固和增强方案学习的开源NL2SQL任务的性能。 LEALAT介绍了三个关键组成部分：（1）分解合成过程，该过程利用抽象的语法树（AST）指导有效的搜索和修剪任务分解的策略，（（2）利润率的增强学习，利用良好的级别级别通过DPO进行了dpo，以适应AST和3）适应的（3）适应性的（3）适应性（3）分解功能。在两个基准数据集（Spider和Bird）上进行的广泛实验表明，Learnat可以使7B参数开源LLM实现与GPT-4相当的性能，同时提供提高的效率和可访问性。

Title: The quasi-semantic competence of LLMs: a case study on the part-whole relation

Authors: Mattia Proietti, Alessandro Lenci
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02395
Pdf URL: https://arxiv.org/pdf/2504.02395
Copy Paste: [[2504.02395]] The quasi-semantic competence of LLMs: a case study on the part-whole relation(https://arxiv.org/abs/2504.02395)
Keywords: language model, llm, prompt
Abstract: Understanding the extent and depth of the semantic competence of \emph{Large Language Models} (LLMs) is at the center of the current scientific agenda in Artificial Intelligence (AI) and Computational Linguistics (CL). We contribute to this endeavor by investigating their knowledge of the \emph{part-whole} relation, a.k.a. \emph{meronymy}, which plays a crucial role in lexical organization, but it is significantly understudied. We used data from ConceptNet relations \citep{speer2016conceptnet} and human-generated semantic feature norms \citep{McRae:2005} to explore the abilities of LLMs to deal with \textit{part-whole} relations. We employed several methods based on three levels of analysis: i.) \textbf{behavioral} testing via prompting, where we directly queried the models on their knowledge of meronymy, ii.) sentence \textbf{probability} scoring, where we tested models' abilities to discriminate correct (real) and incorrect (asymmetric counterfactual) \textit{part-whole} relations, and iii.) \textbf{concept representation} analysis in vector space, where we proved the linear organization of the \textit{part-whole} concept in the embedding and unembedding spaces. These analyses present a complex picture that reveals that the LLMs' knowledge of this relation is only partial. They have just a ``\emph{quasi}-semantic'' competence and still fall short of capturing deep inferential properties.
摘要：了解\ emph {大语言模型}（LLMS）语义能力的程度和深度是人工智能（AI）和计算语言学（CL）的当前科学议程的中心。我们通过调查他们对\ emph {part-phole}关系的了解，又称\ emph {meronymy}，在词汇组织中起着至关重要的作用，为这项工作做出了贡献。我们使用了概念网络关系\ citep {speer2016conceptnet}的数据和人类生成的语义特征规范\ citep {mcrae：2005}来探索LLMS处理\ textit {part-whole}关系的能力。我们采用了基于三个分析级别的几种方法：i。）\ textbf {行为}通过提示进行测试，在那里我们直接询问了模型的词，ii。）句子\ textbf {概率}得分，我们在其中测试了模型的模型，以区分正确的（真实）和不正确的（immetric contractial contractial and textiT）。 \ textbf {概念表示}在向量空间中的分析，我们在其中证明了\ textit {part-whole}概念的线性组织在嵌入式和无用的空间中。这些分析提出了一个复杂的图片，表明LLMS对这种关系的了解只是部分。他们只有``\ emph {quasi} - 道义''能力，但仍然没有捕获深厚的推论属性。

Title: Scaling Analysis of Interleaved Speech-Text Language Models

Authors: Gallil Maimon, Michael Hassid, Amit Roth, Yossi Adi
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2504.02398
Pdf URL: https://arxiv.org/pdf/2504.02398
Copy Paste: [[2504.02398]] Scaling Analysis of Interleaved Speech-Text Language Models(https://arxiv.org/abs/2504.02398)
Keywords: language model
Abstract: Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding, yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source models, samples, and data - this https URL.
摘要：现有的语音语言模型（SLM）缩放分析描绘了黯淡的图片。他们预测，与文本相比，SLM需要更多的计算和数据，这使一些人质疑训练高质量SLM的可行性。但是，现代SLM通常是使用语音文本交织以允许知识转移的文本来初始化的。这就提出了一个问题 - 与无文本SLMS相比，交错的SLMS比更有效？在本文中，我们回答了一个响亮的人，是的！我们通过训练数十个并分析缩放趋势来对交错的SLM进行缩放分析。我们看到，在此设置下，SLMS在计算中更有效。此外，我们的结果表明，缩放动力学与无文本SLM有显着差异，这表明应该将更多的计算预算分配到增加训练代币而不是训练代币。我们还研究了综合数据和文本模型家族在解锁这一潜力中的作用。结果表明，与其他方法相比，我们的缩放模型与语音语义指标上的领先模型相比，在语音语义指标上的领先模型可相当。我们开源模型，样本和数据 - 此HTTPS URL。

Title: DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers

Authors: Max Müller-Eberstein, Mike Zhang, Elisa Bassignana, Peter Brunsgaard Trolle, Rob van der Goot
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2504.02403
Pdf URL: https://arxiv.org/pdf/2504.02403
Copy Paste: [[2504.02403]] DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers(https://arxiv.org/abs/2504.02403)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.
摘要：大型语言模型（LLMS）已经广泛采用了社会的采用。但是，尽管他们能够以英语以外的语言与用户互动，但他们已被证明缺乏文化意识，为代表性不足的语言社区提供了呼吸中心或不适当的回应。为了调查这种差距和散文语言与文化水平，我们对丹麦语中途语言进行了首个文化评估研究，在该研究中，以英语为母语的人会促使不同的模型来解决需要文化意识的任务。我们对63个人口统计学不同参与者的1,038次相互作用的分析强调了对文化适应的开放挑战：尤其是，当前使用的自动翻译数据如何不足以培训或衡量文化适应，以及对母语扬声器数据的培训如何可以多双重响应率。我们将研究数据发布为Dakultur，Dakultur是第一个丹麦本地文化意识数据集。

Title: AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology

Authors: Xiang Feng, Wentao Jiang, Zengmao Wang, Yong Luo, Pingbo Xu, Baosheng Yu, Hua Jin, Bo Du, Jing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02404
Pdf URL: https://arxiv.org/pdf/2504.02404
Copy Paste: [[2504.02404]] AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology(https://arxiv.org/abs/2504.02404)
Keywords: language model, llm
Abstract: The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at this https URL.
摘要：大型语言模型（LLM）在医学领域的应用引起了极大的关注，但是它们在更专业的领域（如麻醉学）中的推理能力仍未得到充实。在本文中，我们系统地评估了LLM在麻醉学中的推理能力，并分析了影响其性能的关键因素。为此，我们介绍了Anesbench，这是一种跨语言基准，旨在评估三个层面上与麻醉相关的推理：事实检索（系统1），混合推理（系统1.x）和复杂的决策（系统2）。通过广泛的实验，我们首先探讨了模型特征，包括模型量表，思想链（COT）长度和语言可传递性，都会影响推理性能。然后，我们进一步评估了不同培训策略的有效性，利用了我们精心策划的麻醉相关数据集，包括连续的预训练（CPT）和受监督的微调（SFT）。此外，我们还研究了测试时间推理技术，例如最佳N采样和梁搜索，影响推理性能，并评估推理增强模型蒸馏的影响，特别是DeepSeek-R1。我们将在此HTTPS URL上公开发布Anesbench，以及我们的CPT和SFT培训数据集和评估代码。

Title: Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation

Authors: Alexandre Misrahi, Nadezhda Chirkova, Maxime Louis, Vassilina Nikoulina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02411
Pdf URL: https://arxiv.org/pdf/2504.02411
Copy Paste: [[2504.02411]] Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation(https://arxiv.org/abs/2504.02411)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances LLM factuality, but multi-domain applications face challenges like lack of diverse benchmarks and poor out-of-domain generalization. The first contribution of this work is to introduce a diverse benchmark comprising a variety of question-answering tasks from 8 sources and covering 13 domains. Our second contribution consists in systematically testing out-of-domain generalization for typical RAG tuning strategies. While our findings reveal that standard fine-tuning fails to generalize effectively, we show that sequence-level distillation with teacher-generated labels improves out-of-domain performance by providing more coherent supervision. Our findings highlight key strategies for improving multi-domain RAG robustness.
摘要：检索增强的一代（RAG）增强了LLM的事实，但是多域应用程序面临挑战，例如缺乏不同的基准和不良的外域概括。这项工作的首要贡献是引入各种基准，其中包括来自8个来源的各种提问任务，并涵盖了13个领域。我们的第二个贡献包括系统地测试典型的抹布调整策略的跨域概括。尽管我们的发现表明，标准的微调无法有效地概括，但我们表明，序列级别的蒸馏与教师生成的标签可通过提供更连贯的监督来改善室外性能。我们的发现突出了改善多域抹布鲁棒性的关键策略。

Title: Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Authors: Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02438
Pdf URL: https://arxiv.org/pdf/2504.02438
Copy Paste: [[2504.02438]] Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation(https://arxiv.org/abs/2504.02438)
Keywords: language model
Abstract: Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLaMP, a hierarchical video-language model that processes hour-long videos at ``mixed precision'' through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLaMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLaMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLaMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance.
摘要：由于处理扩展时间序列的高计算成本，长期视频处理从根本上挑战视觉语言模型（VLM）。现有的令牌修剪和特征合并方法通常会牺牲关键的时间依赖性或稀释语义信息。我们介绍了微分蒸馏，这是一种有原则的方法，该方法可以系统地保留与任务相关的信息，同时抑制冗余。基于这一原则，我们开发了Vilamp，这是一种层次的视频语言模型，该模型通过“混合精度”通过两个关键机制处理长时间的视频：（1）差异密钥帧的选择，可最大化查询相关性，同时保持帧级别的时间独特性和（2）差异特征在非绘制范围内保持差异级别的差异特征，以限定量的范围。因此，Vilamp保留了关键框架中的完整信息，同时将非按键框架减少到其最突出的功能，类似于混合精确的培训。广泛的实验表明，Vilamp在四个视频理解基准的情况下，尤其是长形式的含量。值得注意的是，Vilamp可以在单个NVIDIA A100 GPU上处理超长的视频（最多10K帧），在保持最先进的性能的同时，达到了可观的计算效率。

Title: Cognitive Memory in Large Language Models

Authors: Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, Yong Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02441
Pdf URL: https://arxiv.org/pdf/2504.02441
Copy Paste: [[2504.02441]] Cognitive Memory in Large Language Models(https://arxiv.org/abs/2504.02441)
Keywords: language model, llm, hallucination, prompt
Abstract: This paper examines memory mechanisms in Large Language Models (LLMs), emphasizing their importance for context-rich responses, reduced hallucinations, and improved efficiency. It categorizes memory into sensory, short-term, and long-term, with sensory memory corresponding to input prompts, short-term memory processing immediate context, and long-term memory implemented via external databases or structures. The text-based memory section covers acquisition (selection and summarization), management (updating, accessing, storing, and resolving conflicts), and utilization (full-text search, SQL queries, semantic search). The KV cache-based memory section discusses selection methods (regularity-based summarization, score-based approaches, special token embeddings) and compression techniques (low-rank compression, KV merging, multimodal compression), along with management strategies like offloading and shared attention mechanisms. Parameter-based memory methods (LoRA, TTT, MoE) transform memories into model parameters to enhance efficiency, while hidden-state-based memory approaches (chunk mechanisms, recurrent transformers, Mamba model) improve long-text processing by combining RNN hidden states with current methods. Overall, the paper offers a comprehensive analysis of LLM memory mechanisms, highlighting their significance and future research directions.
摘要：本文研究了大语言模型（LLM）中的记忆机制，强调了它们对上下文响应的重要性，幻觉降低和提高效率。它将记忆分为感官，短期和长期，感官记忆对应于输入提示，短期内存处理即时上下文以及通过外部数据库或结构实现的长期内存。基于文本的内存部分涵盖了采集（选择和摘要），管理（更新，访问，存储和解决冲突）和利用率（全文搜索，SQL查询，语义搜索）。基于KV缓存的内存部分讨论了选择方法（基于常规性的摘要，基于得分的方法，特殊令牌嵌入）和压缩技术（低级别压缩，KV合并，多模式压缩）以及诸如Offloading和共享注意机制之类的管理策略。基于参数的内存方法（Lora，TTT，MOE）将内存转换为模型参数以提高效率，而基于隐藏的状态的存储方法（块机制，经常性变压器，MAMBA模型）通过将RNN隐藏状态与当前方法结合在一起来改善长文本处理。总体而言，本文对LLM记忆机制进行了全面分析，突出了其意义和未来的研究方向。

Title: Inference-Time Scaling for Generalist Reward Modeling

Authors: Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, Yu Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02495
Pdf URL: https://arxiv.org/pdf/2504.02495
Copy Paste: [[2504.02495]] Inference-Time Scaling for Generalist Reward Modeling(https://arxiv.org/abs/2504.02495)
Keywords: language model, llm
Abstract: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.
摘要：强化学习（RL）在大规模语言模型（LLMS）的训练后广泛采用。最近，RL从LLM中激励推理能力表明$ \ textit {正确的学习方法可以启用有效的推理时间可伸缩性} $。 RL的关键挑战是在可验证的问题或人造规则之外，在各个领域中获得LLM的准确奖励信号。在这项工作中，我们通过对通用查询的更多推理进行了研究，即进行更多的推理，即通才RM} $的$ \ textbf {推理时间可扩展性，以及如何通过适当学习方法提高性能计算量表的有效性。对于RM方法，我们采用点式生成奖励建模（GRM），以使不同输入类型的灵活性和推理时间缩放的潜力。对于学习方法，我们提出了自我原理的批评调整（SPCT），以通过在线RL中促进GRM中可扩展的奖励产生行为，以适应原理并准确地生成批评，从而导致$ \ textbf {deepseekseek-grm} $模型。此外，为了有效的推理时间缩放，我们使用并行抽样来扩展计算用法，并引入元rm来指导投票过程以更好地缩放性能。从经验上讲，我们表明SPCT显着提高了GRM的质量和可扩展性，在没有严重偏见的各种RM基准测试中优于现有方法和模型，并且与训练时间扩展相比，可以实现更好的性能。 DeepSeek-Grm仍然面临某些任务中的挑战，我们认为这可以通过通才奖励系统中的未来努力来解决。这些型号将被释放并开源。

Title: UNDO: Understanding Distillation as Optimization

Authors: Kushal Jain, Piyushi Goyal, Kumar Shridhar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02521
Pdf URL: https://arxiv.org/pdf/2504.02521
Copy Paste: [[2504.02521]] UNDO: Understanding Distillation as Optimization(https://arxiv.org/abs/2504.02521)
Keywords: language model, llm, prompt
Abstract: Knowledge distillation has emerged as an effective strategy for compressing large language models' (LLMs) knowledge into smaller, more efficient student models. However, standard one-shot distillation methods often produce suboptimal results due to a mismatch between teacher-generated rationales and the student's specific learning requirements. In this paper, we introduce the UNDO: UNderstanding Distillation as Optimization framework, designed to bridge this gap by iteratively identifying the student's errors and prompting the teacher to refine its explanations accordingly. Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales that specifically address these weaknesses. Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods, achieving performance gains of up to 20%. Additionally, we show that teacher-generated data refined through our iterative process remains effective even when applied to different student models, underscoring the broad applicability of our approach. Our work fundamentally reframes knowledge distillation as an iterative teacher-student interaction, effectively leveraging dynamic refinement by the teacher for better knowledge distillation.
摘要：知识蒸馏已成为将大型语言模型（LLM）知识压缩为较小，更有效的学生模型的有效策略。但是，由于教师生成的理由与学生的特定学习要求之间的不匹配，标准的一声蒸馏方法通常会产生次优的结果。在本文中，我们介绍了撤消：理解蒸馏作为优化框架，旨在通过迭代识别学生的错误并提示老师相应地改进其解释来弥合这一差距。每次迭代都直接针对学生的学习缺陷，激励老师提供专门解决这些弱点的量身定制和增强的理由。对各种具有挑战性的数学和常识性推理任务的经验评估表明，我们的迭代蒸馏方法撤消，显着优于标准的一步蒸馏方法，可实现高达20％的性能提高。此外，我们表明，即使应用于不同的学生模型，也可以通过迭代过程来完善教师生成的数据，从而强调了我们方法的广泛适用性。我们的工作从根本上将知识蒸馏重新缩放为一种迭代的教师互动，从而有效利用教师的动态改进来更好地进行知识蒸馏。

Title: Leveraging LLM For Synchronizing Information Across Multilingual Tables

Authors: Siddharth Khincha, Tushar Kataria, Ankita Anand, Dan Roth, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02559
Pdf URL: https://arxiv.org/pdf/2504.02559
Copy Paste: [[2504.02559]] Leveraging LLM For Synchronizing Information Across Multilingual Tables(https://arxiv.org/abs/2504.02559)
Keywords: language model, llm, prompt
Abstract: The vast amount of online information today poses challenges for non-English speakers, as much of it is concentrated in high-resource languages such as English and French. Wikipedia reflects this imbalance, with content in low-resource languages frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization. This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. Our findings reveal that single-prompt approaches often produce suboptimal results, prompting us to introduce a task decomposition strategy that enhances coherence and accuracy. Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%), highlighting the model strength in dynamically updating and enriching data across architectures
摘要：如今，大量的在线信息对非英语说话者构成了挑战，因为其中大部分集中在英语和法语等高资源语言中。 Wikipedia反映了这种不平衡，其内容经常过时或不完整。最近的研究试图使用基于规则的方法来改善Wikipedia表的跨语言同步。这些方法可能是有效的，但是它们在复杂性和概括中挣扎。本文使用零射击提示作为可扩展解决方案探讨了用于多语言信息同步的大型语言模型（LLMS）。我们介绍了信息更新数据集，并模拟了更新过时的Wikipedia表的现实世界过程，并评估LLM性能。我们的发现表明，单次验证方法通常会产生次优结果，促使我们引入了一种任务分解策略，从而提高了连贯性和准确性。我们提出的方法的表现优于现有基准，尤其是在信息更新（1.79％）和信息增加（20.58％）中，突出显示了模型强度在动态更新和丰富跨体系结构的数据时的强度

Title: Language Models reach higher Agreement than Humans in Historical Interpretation

Authors: Fabio Celli, Georgios Spathulas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02572
Pdf URL: https://arxiv.org/pdf/2504.02572
Copy Paste: [[2504.02572]] Language Models reach higher Agreement than Humans in Historical Interpretation(https://arxiv.org/abs/2504.02572)
Keywords: language model, hallucination
Abstract: This paper compares historical annotations by humans and Large Language Models. The findings reveal that both exhibit some cultural bias, but Large Language Models achieve a higher consensus on the interpretation of historical facts from short texts. While humans tend to disagree on the basis of their personal biases, Large Models disagree when they skip information or produce hallucinations. These findings have significant implications for digital humanities, enabling large-scale annotation and quantitative analysis of historical data. This offers new educational and research opportunities to explore historical interpretations from different Language Models, fostering critical thinking about bias.
摘要：本文比较了人类和大语言模型的历史注释。研究结果表明，两者都表现出一些文化偏见，但是大型语言模型就短文对历史事实的解释得出了更高的共识。尽管人类倾向于根据个人偏见不同意，但大型模型在跳过信息或产生幻觉时不同意。这些发现对数字人文科学具有重要意义，可以对历史数据进行大规模注释和定量分析。这提供了新的教育和研究机会，可以探索不同语言模型的历史解释，从而促进了对偏见的批判性思维。

Title: LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning

Authors: Kepu Zhang, Guofu Xie, Weijie Yu, Mingyue Xu, Xu Tang, Yaxin Li, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02590
Pdf URL: https://arxiv.org/pdf/2504.02590
Copy Paste: [[2504.02590]] LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning(https://arxiv.org/abs/2504.02590)
Keywords: llm
Abstract: The legal mathematical reasoning ability of LLMs is crucial when applying them to real-world scenarios, as it directly affects the credibility of the LLM. While existing legal LLMs can perform general judicial question answering, their legal mathematical reasoning capabilities have not been trained. Open-domain reasoning models, though able to generate detailed calculation steps, do not follow the reasoning logic required for legal scenarios. Additionally, there is currently a lack of legal mathematical reasoning datasets to help validate and enhance LLMs' reasoning abilities in legal contexts. To address these issues, we propose the first Chinese legal Mathematical Reasoning Dataset, LexNum, which includes three common legal mathematical reasoning scenarios: economic compensation, work injury compensation, and traffic accident compensation. Based on LexNum, we tested the performance of existing legal LLMs and reasoning LLMs, and introduced LexPam, a reinforcement learning algorithm guided by legal procedural awareness to train LLMs, enhancing their mathematical reasoning abilities in legal scenarios. Experiments on tasks in the three legal scenarios show that the performance of existing legal LLMs and reasoning models in legal mathematical reasoning tasks is unsatisfactory. LexPam can enhance the LLM's ability in these tasks.
摘要：LLM的法律数学推理能力将其应用于现实世界情景时至关重要，因为它直接影响了LLM的信誉。尽管现有的法律LLM可以执行一般的司法问题回答，但他们的法律数学推理能力尚未培训。开放域推理模型虽然能够生成详细的计算步骤，但并未遵循法律场景所需的推理逻辑。此外，目前缺乏法律数学推理数据集来帮助验证和增强法律背景下LLM的推理能力。为了解决这些问题，我们提出了第一个中国法律数学推理数据集Lexnum，其中包括三种常见的法律数学推理方案：经济补偿，工伤补偿和交通事故赔偿。基于Lexnum，我们测试了现有的LLM和推理LLM的性能，并引入了Lexpam，这是一种强化学习算法，以法律程序意识为指导，以培训LLMS，增强其在法律场景中的数学推理能力。在三种法律情况下进行任务的实验表明，法律数学推理任务中现有的法律LLM和推理模型的表现并不令人满意。 Lexpam可以在这些任务中增强LLM的能力。

Title: LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems

Authors: Zishuo Liu, Carlos Rabat Villarreal, Mostafa Rahgouy, Amit Das, Zheng Zhang, Chang Ren, Dongji Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02671
Pdf URL: https://arxiv.org/pdf/2504.02671
Copy Paste: [[2504.02671]] LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems(https://arxiv.org/abs/2504.02671)
Keywords: language model, llm, prompt
Abstract: Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning. Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve. Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explored. This work conducted an exploratory study to examine the capabilities and limitations of LLMs in solving FPs. We first evaluated the overall performance of three advanced LLMs using a publicly available FP dataset. We designed prompts according to the recently proposed TELeR taxonomy, including a zero-shot scenario. Results indicated that all three LLMs achieved a fp_score (range between 0 - 1) below 0.5, underscoring the inherent difficulty of these reasoning tasks. To further investigate, we categorized FPs into standard and specific questions, hypothesizing that LLMs would perform better on standard questions, which are characterized by clarity and conciseness, than on specific ones. Comparative experiments confirmed this hypothesis, demonstrating that LLMs performed better on standard FPs in terms of both accuracy and efficiency.
摘要：费米问题（FPS）是需要类似人类逻辑和数值推理的数学推理任务。与其他推理问题不同，FPS通常涉及现实世界中的不切实际或模棱两可的概念，从而使它们甚至使人难以解决。尽管AI的进步，尤其是在各种推理任务中使用大型语言模型（LLM）的进步，但FPS仍相对爆发。这项工作进行了一项探索性研究，以检查LLMS解决FPS的能力和局限性。我们首先使用公开可用的FP数据集评估了三个高级LLM的整体性能。我们根据最近提出的Teler分类法设计了提示，其中包括零拍摄情况。结果表明，所有三个LLM都达到了0.5以下的FP_Score（0-1之间的范围），强调了这些推理任务的固有难度。为了进一步调查，我们将FPS归类为标准和特定问题，假设LLM在标准问题上的表现更好，而标准问题的特征是具有清晰度和简洁性，而不是特定的问题。比较实验证实了这一假设，表明LLM在准确性和效率方面在标准FP上的表现更好。

Title: The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Authors: Nikhil Verma, Manasa Bharadwaj
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02708
Pdf URL: https://arxiv.org/pdf/2504.02708
Copy Paste: [[2504.02708]] The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context(https://arxiv.org/abs/2504.02708)
Keywords: language model, llm
Abstract: Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.
摘要：对齐调整使大型语言模型能够在推理，指导跟踪和最大程度地减少有害世代方面表现出色。但是，尽管部署了广泛的部署，但这些模型表现出单语言偏见，这引起了人们对跨语言对齐的有效性的担忧。当前的一致性方法主要集中在英语上，因此尚不清楚对齐机制如何推广到多语言设置。为了解决这个问题，我们对对齐前后LLM的嵌入空间的分布变化进行了系统的分析，从而发现了其对不同语言跨不同语言模型行为的影响。我们利用对齐空间中对齐诱导的分离，作为定量工具，以衡量对齐方式如何实施安全限制。我们的研究使用平衡的毒性数据集和平行的文本 - 氧化基准评估了七个LLM，从而揭示了高资源和低资源语言之间潜在表示空间的显着差异。这些发现强调了对语言特定的微调的需求，以确保公平，可靠和强大的多语言对齐。我们的见解为开发真正安全的多语言LLM的基础提供了基础，强调了解决代表性不足语言的一致性差距的紧迫性。

Title: ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization

Authors: Kehua Feng, Keyan Ding, Jing Yu, Menghan Li, Yuhao Wang, Tong Xu, Xinda Wang, Qiang Zhang, Huajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02725
Pdf URL: https://arxiv.org/pdf/2504.02725
Copy Paste: [[2504.02725]] ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization(https://arxiv.org/abs/2504.02725)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.
摘要：大型语言模型（LLM）的最新进展已加速了人工智能的进步，但它们产生有害内容的潜力却带来了关键的安全挑战。现有的一致性方法通常难以涵盖各种安全方案，并且仍然容易受到对抗攻击的影响。在这项工作中，我们提出了前推理偏好优化（ERPO），这是一个新型的安全一致性框架，通过思考链将LLMS与明确的先发制人推理，并通过嵌入预定义的安全规则来提供安全判断的明确证据。具体而言，我们的方法包括三个阶段：首先，使用构造的推理模块通过监督的微调（SFT）为模型提供前推理；其次，通过直接偏好优化（DPO）提高安全性，实用性和效率；第三，通过长度控制的迭代偏好优化策略来减轻推理潜伏期。多个开源LLMS的实验表明，ERPO在保持响应效率的同时显着提高了安全性能。

Title: Why do LLMs attend to the first token?

Authors: Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličkovi ć, Razvan Pascanu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02732
Pdf URL: https://arxiv.org/pdf/2504.02732
Copy Paste: [[2504.02732]] Why do LLMs attend to the first token?(https://arxiv.org/abs/2504.02732)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) tend to attend heavily to the first token in the sequence -- creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention. Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour. We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training.
摘要：大型语言模型（LLMS）倾向于在序列中大量参加第一个令牌 - 创建了所谓的注意力台。许多作品已经详细研究了这一现象，提出了各种利用或减轻这种现象的方法。注意水槽已与定量困难，安全问题和流媒体关注有关。但是，尽管许多作品提供了是否发生的条件，但一个关键的问题仍然很浅：LLMS为什么要学习这种模式以及如何使用它们？在这项工作中，我们从理论和经验上争论，这种机制为LLM提供了一种避免过度混合的方法，将其连接到数学研究信息如何在变压器中传播的现有工作线。我们进行实验来验证我们的理论直觉，并展示诸如上下文长度，深度和数据包装等选择如何影响下水道行为。我们希望这项研究提供了一个新的实践观点，说明为什么关注点在LLM中很有用，从而更好地了解训练过程中形成的注意力模式。

Title: Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

Authors: Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, Marek Rei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02733
Pdf URL: https://arxiv.org/pdf/2504.02733
Copy Paste: [[2504.02733]] Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study(https://arxiv.org/abs/2504.02733)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. Existing methods to enhance LLM robustness are primarily focused on perturbed data samples, whereas improving resiliency to perturbations of task-level instructions has remained relatively underexplored. In this work, we focus on character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We experiment with a variety of techniques to enhance the robustness of LLMs, including self-denoising and representation alignment, testing different models (Llama 3 and Flan-T5), datasets (CoLA, QNLI, SST-2) and instructions (both task-oriented and role-oriented). We find that, on average, self-denoising -- whether performed by a frozen LLM or a fine-tuned model -- achieves substantially higher performance gains than alternative strategies, including more complex baselines such as ensembling and supervised methods.
摘要：大型语言模型（LLM）非常容易受到输入扰动的影响，因为即使是较小的及时更改也可能导致大幅不同的输出。现有的增强LLM鲁棒性的方法主要集中在受扰动的数据样本上，而提高对任务级指令扰动的弹药的弹药的弹药仍然相对尚未得到充实。在这项工作中，我们专注于特定于任务指令的字符和单词级编辑，这些指令大大降低了下游性能。我们尝试各种技术来增强LLM的鲁棒性，包括自我降级和表示对齐，测试不同的模型（Llama 3和Flan-T5），数据集（COLA，QNLI，SST-2）和指令（以任务为导向和以角色为导向和面向角色）。我们发现，平均而言，无论是由冷冻LLM执行还是微调模型，都比替代策略（包括更复杂的基线（例如结合和监督方法），都可以实现自我调整。

Title: MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Authors: Jaap Jumelet, Leonie Weissweiler, Arianna Bisazza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02768
Pdf URL: https://arxiv.org/pdf/2504.02768
Copy Paste: [[2504.02768]] MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs(https://arxiv.org/abs/2504.02768)
Keywords: llm
Abstract: We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.
摘要：我们介绍了Multiblimp 1.0，这是语言最小对的大量多语言基准，涵盖了101种语言，6种语言现象，并包含超过125,000个最小值的对。我们的最小对是使用全自动管道创建的，利用了通用依赖性和Unimorph的大规模语言资源。 Multiblimp 1.0以前所未有的多语言量表评估LLM的能力，并突出显示了当前最新艺术品对低资源语言进行建模的缺点。

Title: A Framework for Robust Cognitive Evaluation of LLMs

Authors: Karin de Langis, Jong Inn Park, Bin Hu, Khanh Chi Le, Andreas Schramm, Michael C. Mensink, Andrew Elfenbein, Dongyeop Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02789
Pdf URL: https://arxiv.org/pdf/2504.02789
Copy Paste: [[2504.02789]] A Framework for Robust Cognitive Evaluation of LLMs(https://arxiv.org/abs/2504.02789)
Keywords: language model, llm, prompt
Abstract: Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood. A growing body of research draws on cognitive science to investigate LLM cognition, but standard methodologies and experimen-tal pipelines have not yet been established. To address this gap we develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs, with a particular emphasis on robustness in response collection. The key features of CognitivEval include: (i) automatic prompt permutations, and (ii) testing that gathers both generations and model probability estimates. Our experiments demonstrate that these features lead to more robust experimental outcomes. Using CognitivEval, we replicate five classic experiments in cognitive science, illustrating the framework's generalizability across various experimental tasks and obtaining a cognitive profile of several state of the art LLMs. CognitivEval will be released publicly to foster broader collaboration within the cognitive science community.
摘要：广泛观察到了大语言模型（LLM）中的紧急认知能力，但是它们的性质和潜在机制仍然知之甚少。越来越多的研究借鉴了认知科学来研究LLM的认知，但是尚未确定标准方法和实验式管道。为了解决这一差距，我们发展了认知因素，这是一个系统地评估LLMS人工认知能力的框架，特别强调了响应收集的鲁棒性。 CognitiveVal的关键特征包括：（i）自动提示排列，以及（ii）测试，以收集几代人和模型概率估计。我们的实验表明，这些特征会导致更强大的实验结果。我们使用认知事件，复制了认知科学中的五个经典实验，说明了该框架在各种实验任务中的普遍性，并获得了几种最先进的LLMS的认知概况。 CognitiveVal将公开发布，以促进认知科学界的更广泛的合作。

Title: A Survey of Large Language Models in Mental Health Disorder Detection on Social Media

Authors: Zhuohan Ge (1), Nicole Hu (2), Darian Li (1), Yubo Wang (3), Shihao Qi (1), Yuming Xu (1), Han Shi (3), Jason Zhang (1) ((1) The Hong Kong Polytechnic University, (2) The Chinese University of Hong Kong, (3) Hong Kong University of Science and Technology)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.02800
Pdf URL: https://arxiv.org/pdf/2504.02800
Copy Paste: [[2504.02800]] A Survey of Large Language Models in Mental Health Disorder Detection on Social Media(https://arxiv.org/abs/2504.02800)
Keywords: language model, llm
Abstract: The detection and intervention of mental health issues represent a critical global research focus, and social media data has been recognized as an important resource for mental health research. However, how to utilize Large Language Models (LLMs) for mental health problem detection on social media poses significant challenges. Hence, this paper aims to explore the potential of LLM applications in social media data analysis, focusing not only on the most common psychological disorders such as depression and anxiety but also incorporating psychotic disorders and externalizing disorders, summarizing the application methods of LLM from different dimensions, such as text data analysis and detection of mental disorders, and revealing the major challenges and shortcomings of current research. In addition, the paper provides an overview of popular datasets, and evaluation metrics. The survey in this paper provides a comprehensive frame of reference for researchers in the field of mental health, while demonstrating the great potential of LLMs in mental health detection to facilitate the further application of LLMs in future mental health interventions.
摘要：心理健康问题的检测和干预是一个关键的全球研究重点，社交媒体数据被认为是心理健康研究的重要资源。但是，如何利用大型语言模型（LLM）在社交媒体上检测精神健康问题会带来重大挑战。因此，本文旨在探索LLM在社交媒体数据分析中的应用潜力，不仅关注抑郁症和焦虑等最常见的心理疾病，而且还集中于精神病和外部性疾病，总结了来自不同维度的LLM的应用方法，例如文本数据分析和检测精神疾病，以及主要的挑战和主要的挑战，并揭示了主要的挑战和当前的研究。此外，本文提供了流行数据集和评估指标的概述。本文的调查为心理健康领域的研究人员提供了全面的参考框架，同时证明了LLM在心理健康检测中的巨大潜力，以促进LLM在未来的心理健康干预措施中的进一步应用。

Title: MegaMath: Pushing the Limits of Open Math Corpora

Authors: Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, Eric P. Xing
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02807
Pdf URL: https://arxiv.org/pdf/2504.02807
Copy Paste: [[2504.02807]] MegaMath: Pushing the Limits of Open Math Corpora(https://arxiv.org/abs/2504.02807)
Keywords: language model, llm
Abstract: Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.
摘要：数学推理是人类智能的基石，也是大型语言模型（LLMS）高级功能的关键基准。但是，研究社区仍然缺乏针对以数学为中心的LLM预培训的开放，高质量的语料库。我们提出了Megamath，这是一个通过以下实践策划的开放数据集，该数据集是通过以下实践策划的：（1）重新审视Web数据：我们从常见的爬网中重新提取了数学文档，并使用以数学为导向的HTML优化，基于FastText的过滤和删除来获取Internet上的较高质量数据。（2）召回与数学相关的代码数据：我们从大型代码培训语料库（Stack-V2）中确定了与数学相关的高质量代码，从而进一步增强了数据多样性。（3）探索综合数据：我们合成了QA风格的文本，与数学相关的代码以及从Web数据或代码数据中进行交织的文本代码块。通过整合这些策略并通过广泛的消融验证其有效性，Megamath在现有的开放数学预训练数据集中提供了371B代币，其数量和最高质量。

Title: Generative Evaluation of Complex Reasoning in Large Language Models

Authors: Haowei Lin, Xiangyu Wang, Ruilin Yan, Baizhou Huang, Haotian Ye, Jianhua Zhu, Zihao Wang, James Zou, Jianzhu Ma, Yitao Liang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.02810
Pdf URL: https://arxiv.org/pdf/2504.02810
Copy Paste: [[2504.02810]] Generative Evaluation of Complex Reasoning in Large Language Models(https://arxiv.org/abs/2504.02810)
Keywords: language model, llm
Abstract: With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.
摘要：有了强大的大型语言模型（LLMS）展示了超人类推理功能，就出现了一个关键的问题：LLMS是真正的推理，还是只是回忆起其广泛的网络绑带培训数据集中的答案？一旦纳入随后的LLM培训集中，公开发布的基准不可避免地会被污染，从而破坏了其作为忠实评估的可靠性。为了解决这个问题，我们介绍了专门用于评估LLM中推理的生成评估框架Kumo。 Kumo协同结合了LLM与符号引擎，以动态生产多种多样的多转弯推理任务，这些任务是可以观察到的，难以调节。通过自动化管道，Kumo连续地跨越开放式域生成了新任务，诱使模型表明真正的概括而不是记忆。我们对库莫创建的100个领域的5,000个任务评估了23个最先进的LLMS，对他们针对大学生的推理能力进行了基准测试。我们的发现表明，许多LLM在简单的推理任务方面表现优于大学级别的表现，而推理的LLM在复杂的推理挑战方面达到了大学级别的表现。此外，kumo任务的LLM性能与新发布的现实世界推理基准的结果密切相关，这强调了Kumo的价值作为真正的LLM推理能力的强大，持久的评估工具。