2025-04-14

Title: Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT

Authors: Harishwar Reddy, Madhusudan Srinivasan, Upulee Kanewala
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07982
Pdf URL: https://arxiv.org/pdf/2504.07982
Copy Paste: [[2504.07982]] Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT(https://arxiv.org/abs/2504.07982)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues, often reflecting biases inherent in their training data. These biases pose risks, particularly when LLMs are deployed in sensitive areas such as healthcare, finance, and law. This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs. We define and apply a set of fairness-oriented metamorphic relations (MRs) to assess the LLaMA and GPT model, a state-of-the-art LLM, across diverse demographic inputs. Our methodology includes generating source and follow-up test cases for each MR and analyzing model responses for fairness violations. The results demonstrate the effectiveness of MT in exposing bias patterns, especially in relation to tone and sentiment, and highlight specific intersections of sensitive attributes that frequently reveal fairness faults. This research improves fairness testing in LLMs, providing a structured approach to detect and mitigate biases and improve model robustness in fairness-sensitive applications.
摘要：大型语言模型（LLM）在自然语言处理方面取得了长足的进步，但仍然容易受到与公平相关的问题的影响，通常反映了其培训数据中固有的偏见。这些偏见会带来风险，特别是当LLM部署在敏感地区，例如医疗保健，金融和法律时。本文介绍了一种变质测试方法，以系统地识别LLMS中的公平性错误。我们定义并应用了一组面向公平的变质关系（MRS），以评估各种人口统计学输入中的最先进的LLM Llama和GPT模型。我们的方法包括为每个MR生成源和后续测试案例，并分析违反公平的模型响应。结果表明，MT在暴露偏见模式中的有效性，尤其是与语调和情感有关，并突出显示了经常揭示公平性故障的敏感属性的特定交集。这项研究改善了LLMS中的公平测试，提供了一种结构化方法来检测和减轻偏见并改善对公平敏感应用中的模型鲁棒性。

Title: Psychological Health Knowledge-Enhanced LLM-based Social Network Crisis Intervention Text Transfer Recognition Method

Authors: Shurui Wu, Xinyi Huang, Dingxin Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07983
Pdf URL: https://arxiv.org/pdf/2504.07983
Copy Paste: [[2504.07983]] Psychological Health Knowledge-Enhanced LLM-based Social Network Crisis Intervention Text Transfer Recognition Method(https://arxiv.org/abs/2504.07983)
Keywords: language model, llm
Abstract: As the prevalence of mental health crises increases on social media platforms, identifying and preventing potential harm has become an urgent challenge. This study introduces a large language model (LLM)-based text transfer recognition method for social network crisis intervention, enhanced with domain-specific mental health knowledge. We propose a multi-level framework that incorporates transfer learning using BERT, and integrates mental health knowledge, sentiment analysis, and behavior prediction techniques. The framework includes a crisis annotation tool trained on social media datasets from real-world events, enabling the model to detect nuanced emotional cues and identify psychological crises. Experimental results show that the proposed method outperforms traditional models in crisis detection accuracy and exhibits greater sensitivity to subtle emotional and contextual variations.
摘要：随着社交媒体平台上心理健康危机的普遍性的增加，确定和预防潜在伤害已成为紧迫的挑战。这项研究介绍了一个大型语言模型（LLM）基于社交网络危机干预的文本转移识别方法，并通过特定领域的心理健康知识增强了。我们提出了一个多层框架，该框架结合了使用BERT的转移学习，并整合了心理健康知识，情感分析和行为预测技术。该框架包括一个在现实世界中的社交媒体数据集中训练的危机注释工具，使该模型能够检测细微的情感线索并确定心理危机。实验结果表明，该提出的方法在危机检测准确性中的表现优于传统模型，并表现出对微妙的情感和上下文变化的更高敏感性。

Title: SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Authors: Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, Zhangyang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07986
Pdf URL: https://arxiv.org/pdf/2504.07986
Copy Paste: [[2504.07986]] SEAL: Steerable Reasoning Calibration of Large Language Models for Free(https://arxiv.org/abs/2504.07986)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at this https URL.
摘要：大型语言模型（LLM），例如OpenAI的O1系列，通过扩展的思维链（COT）推理机制证明了复杂推理任务的引人注目的功能。但是，最近的研究揭示了COT推理轨迹的大量冗余，这不仅增加了推理潜伏期，而且还通过将注意力转移到不必要的推理路径上对模型性能产生负面影响。为了解决这个问题，我们研究了LLM的内部推理结构，并将它们分为三种主要思想类型：执行，反思和过渡思想。此外，我们的分析表明，过度反思和过渡思想与失败案例密切相关，这些思想类别在潜在空间中表现出明显的分离。基于这些，我们引入了密封件（可通行的推理校准），这是一种无训练的方法，可以无缝校准COT工艺，提高准确性，同时显示出显着的效率提高。密封由一个离线阶段组成，用于在潜在空间中提取推理转向向量，然后通过使用转向向量的表示干预对推理轨迹进行直通校准。值得注意的是，转向矢量在各种任务中表现出强大的可传递性。多种模型（DeepSeek-R1-Distill和QWQ-32B-Preigiew）和基准测试（Math500，GSM8K，LiveCodeBench）之间进行了广泛的实验，可验证密封的有效性，而准确性提高了11％，而将其降低了11.8％至50.4％。我们的代码在此HTTPS URL上公开可用。

Title: Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance

Authors: Nirvan Patil, Malhar Abhay Inamdar, Agnivo Gosai, Guruprasad Pathak, Anish Joshi, Aryan Sagavekar, Anish Joshirao, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.07989
Pdf URL: https://arxiv.org/pdf/2504.07989
Copy Paste: [[2504.07989]] Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance(https://arxiv.org/abs/2504.07989)
Keywords: language model, llm
Abstract: Small Language Models (SLMs) offer efficient alternatives to LLMs for specific domains. The 2023 TinyStories study developed an English dataset that allows SLMs with 1 to 10 million parameters to produce coherent outputs. Our research expands this framework by translating the original dataset into Indian languages and creating synthetic data using LLMs. We focus on Hindi, Marathi, and Bengali, evaluating SLMs for regional language processing and understanding linguistic complexity. We show that SLMs efficiently process regional languages with significantly fewer parameters than LLMs, providing a complementary framework for ``inference based evaluation" of tokenization strategies and linguistic complexity. Our analysis shows that language-specific tokenizers outperform general-purpose ones for Indian languages. Empirical validations, supported by information-theoretic and morphological analyses, provides fundamental understanding behind the better performance of Hindi models over Marathi and Bengali. Additionally, we show that synthetic datasets outperform translated content for training SLMs. Correlation analyses reveal cross-linguistic patterns and language-specific relationships between creativity, grammatical precision, and narrative completeness. These findings advance both the practical application of SLMs to underserved languages and our theoretical understanding of neural language development.
摘要：小语言模型（SLM）为特定域提供了有效的LLMS替代方案。 2023年的小故事研究开发了一个英语数据集，该数据集允许具有1至1000万参数的SLM产生连贯的输出。我们的研究通过将原始数据集转换为印度语言并使用LLM创建合成数据来扩展此框架。我们专注于印地语，马拉地语和孟加拉语，评估SLM的区域语言处理和理解语言复杂性。 We show that SLMs efficiently process regional languages with significantly fewer parameters than LLMs, providing a complementary framework for ``inference based evaluation" of tokenization strategies and linguistic complexity. Our analysis shows that language-specific tokenizers outperform general-purpose ones for Indian languages. Empirical validations, supported by information-theoretic and morphological analyses, provides fundamental understanding behind the better performance of印地语模型和孟加拉语，我们表明，合成数据集超过了训练SLM的内容。

Title: 'Neural howlround' in large language models: a self-reinforcing bias phenomenon, and a dynamic attenuation solution

Authors: Seth Drake
Subjects: cs.CL, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2504.07992
Pdf URL: https://arxiv.org/pdf/2504.07992
Copy Paste: [[2504.07992]] 'Neural howlround' in large language models: a self-reinforcing bias phenomenon, and a dynamic attenuation solution(https://arxiv.org/abs/2504.07992)
Keywords: language model, llm
Abstract: Large language model (LLM)-driven AI systems may exhibit an inference failure mode we term `neural howlround,' a self-reinforcing cognitive loop where certain highly weighted inputs become dominant, leading to entrenched response patterns resistant to correction. This paper explores the mechanisms underlying this phenomenon, which is distinct from model collapse and biased salience weighting. We propose an attenuation-based correction mechanism that dynamically introduces counterbalancing adjustments and can restore adaptive reasoning, even in `locked-in' AI systems. Additionally, we discuss some other related effects arising from improperly managed reinforcement. Finally, we outline potential applications of this mitigation strategy for improving AI robustness in real-world decision-making tasks.
摘要：大型语言模型（LLM）驱动的AI系统可能表现出推理故障模式，我们称其为“神经ho叫”，这是一种自我增强的认知环，其中某些高度加权的输入变得占主导地位，从而导致根深蒂固的响应模式可抵抗校正。本文探讨了这种现象的基础机制，该机制不同于模型崩溃和偏见的显着性加权。我们提出了一种基于衰减的校正机制，即使在“锁定” AI系统中，也可以动态引入平衡调整并可以恢复适应性推理。此外，我们讨论了由不当管理的强化产生的其他相关效果。最后，我们概述了这种缓解策略的潜在应用，以改善现实世界决策任务中的AI鲁棒性。

Title: SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness

Authors: Biplav Srivastava, Kausik Lakkaraju, Nitin Gupta, Vansh Nagpal, Bharath C. Muppasani, Sara E. Jones
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07995
Pdf URL: https://arxiv.org/pdf/2504.07995
Copy Paste: [[2504.07995]] SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness(https://arxiv.org/abs/2504.07995)
Keywords: language model, gpt, llm, chat
Abstract: Collaborative assistants, or chatbots, are data-driven decision support systems that enable natural interaction for task completion. While they can meet critical needs in modern society, concerns about their reliability and trustworthiness persist. In particular, Large Language Model (LLM)-based chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible. However, such chatbots have limitations, including their inability to explain response generation, the risk of generating problematic content, the lack of standardized testing for reliability, and the need for deep AI expertise and extended development times. These issues make chatbots unsuitable for trust-sensitive applications like elections or healthcare. To address these concerns, we introduce SafeChat, a general architecture for building safe and trustworthy chatbots, with a focus on information retrieval use cases. Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance), and 'do-not-respond' strategies to prevent harmful answers; (b) usability, with automatic extractive summarization of long responses, traceable to their sources, and automated trust assessments to communicate expected chatbot behavior, such as sentiment; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices. We implemented SafeChat in an executable framework using the open-source chatbot platform Rasa. A case study demonstrates its application in building ElectionBot-SC, a chatbot designed to safely disseminate official election information. SafeChat is being used in many domains, validating its potential, and is available at: this https URL.
摘要：协作助理或聊天机器人是数据驱动的决策支持系统，可实现任务完成的自然互动。尽管他们可以在现代社会中满足关键需求，但对其可靠性和可信赖性的关注仍然存在。特别是，大型语言模型（LLM）基于Chatgpt，Gemini和DeepSeek等聊天机器人越来越易于使用。但是，此类聊天机器人有局限性，包括无法解释响应产生，产生有问题的内容的风险，缺乏标准化的可靠性测试以及对深度AI专业知识的需求和扩展的开发时间。这些问题使聊天机器人不适合对选举或医疗保健等信任敏感的应用程序。为了解决这些问题，我们介绍了Safechat，这是一种用于构建安全且值得信赖的聊天机器人的一般体系结构，重点是信息检索用例。 SafeChat的主要特征包括：（a）安全性，具有域形不足的设计，在该设计中，响应是接地的，可以追溯到批准的来源（出处），以及“ do-not-not-respond”策略，以防止有害答案；（b）可用性，自动提取性摘要长期响应，可追溯到其来源，以及自动化的信任评估，以传达预期的聊天机器人行为，例如情感；（c）快速，可扩展的开发，包括CSV驱动的工作流程，自动测试以及与各种设备的集成。我们使用开源聊天机器人平台RASA在可执行的框架中实现了Safechat。一项案例研究表明了其在构建选举机器人-SC中的应用，这是一种旨在安全传播官方选举信息的聊天机器人。 SafeChat正在许多域中使用，验证其潜力，并在以下网址提供：此HTTPS URL。

Title: BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models

Authors: Tian Xie, Tongxin Yin, Vaishakh Keshava, Xueru Zhang, Siddhartha Reddy Jonnalagadda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.07997
Pdf URL: https://arxiv.org/pdf/2504.07997
Copy Paste: [[2504.07997]] BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models(https://arxiv.org/abs/2504.07997)
Keywords: language model, llm
Abstract: While large language models (LLMs) already play significant roles in society, research has shown that LLMs still generate content including social bias against certain sensitive groups. While existing benchmarks have effectively identified social biases in LLMs, a critical gap remains in our understanding of the underlying reasoning that leads to these biased outputs. This paper goes one step further to evaluate the causal reasoning process of LLMs when they answer questions eliciting social biases. We first propose a novel conceptual framework to classify the causal reasoning produced by LLMs. Next, we use LLMs to synthesize $1788$ questions covering $8$ sensitive attributes and manually validate them. The questions can test different kinds of causal reasoning by letting LLMs disclose their reasoning process with causal graphs. We then test 4 state-of-the-art LLMs. All models answer the majority of questions with biased causal reasoning, resulting in a total of $4135$ biased causal graphs. Meanwhile, we discover $3$ strategies for LLMs to avoid biased causal reasoning by analyzing the "bias-free" cases. Finally, we reveal that LLMs are also prone to "mistaken-biased" causal reasoning, where they first confuse correlation with causality to infer specific sensitive group names and then incorporate biased causal reasoning.
摘要：尽管大型语言模型（LLM）已经在社会中起着重要作用，但研究表明，LLM仍会产生内容，包括针对某些敏感群体的社会偏见。尽管现有的基准有效地确定了LLM中的社会偏见，但我们对导致这些偏见产出的基本推理的理解仍然存在一个危险的差距。本文回答引起社会偏见的问题时，迈出了一步，以评估LLM的因果推理过程。我们首先提出了一个新颖的概念框架，以对LLMS产生的因果推理进行分类。接下来，我们使用LLMS合成$ 1788 $的问题，涵盖$ 8 $敏感属性并手动验证它们。这些问题可以通过让LLMS使用因果图披露其推理过程来测试各种因果推理。然后，我们测试4个最先进的LLM。所有模型都以有偏见的因果推理来回答大多数问题，从而产生了$ 4135 $偏见的因果图。同时，通过分析“无偏见”案件，我们发现LLMS的$ 3 $策略以避免因果推理。最后，我们揭示了LLM也容易出现“偏见”因果推理，在那里他们首先将与因果关系混淆以推断特定的敏感群体名称，然后结合有偏见的因果推理。

Title: Linguistic Interpretability of Transformer-based Language Models: a systematic review

Authors: Miguel López-Otal, Jorge Gracia, Jordi Bernad, Carlos Bobed, Lucía Pitarch-Ballesteros, Emma Anglés-Herrero
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08001
Pdf URL: https://arxiv.org/pdf/2504.08001
Copy Paste: [[2504.08001]] Linguistic Interpretability of Transformer-based Language Models: a systematic review(https://arxiv.org/abs/2504.08001)
Keywords: language model
Abstract: Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
摘要：基于变压器体系结构的语言模型在许多与语言相关的任务（例如文本分类或情感分析）中取得了出色的成果。但是，尽管这些模型的结构是明确定义的，但对它们的内部计算如何帮助他们取得结果知之甚少。截至今天，这将使这些模型是一种“黑匣子”系统。但是，有一系列研究 - “可解释性” - 旨在了解这些模型中的信息是如何编码的。更具体地说，有专门研究基于变形金刚的模型是否具有类似于人类者的语言现象的知识，这是我们称之为这些模型的“语言解释性”的知识。在这项调查中，我们对160所研究作品进行了全面的分析，分布在多种语言和模型中，包括多语言，试图从几个传统语言学学科的角度发现语言信息：语法，形态学，词典 - 词和话语。我们的调查填补了现有的可解释性文献的空白，该文献不关注这些模型中的语言知识，或者提出了一些局限性 - 例如仅研究基于英语的模型。我们的调查还集中于预先训练的语言模型，而不是进一步专门针对下游任务，重点是使用探索模型内部表示的可解释性技术的作品。

Title: More diverse more adaptive: Comprehensive Multi-task Learning for Improved LLM Domain Adaptation in E-commerce

Authors: Tong Piao, Pei Tang, Zhipeng Zhang, Jiaqi Li, Qiao Liu, Zufeng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08002
Pdf URL: https://arxiv.org/pdf/2504.08002
Copy Paste: [[2504.08002]] More diverse more adaptive: Comprehensive Multi-task Learning for Improved LLM Domain Adaptation in E-commerce(https://arxiv.org/abs/2504.08002)
Keywords: language model, llm
Abstract: In recent years, Large Language Models (LLMs) have been widely applied across various domains due to their powerful domain adaptation capabilities. Previous studies have suggested that diverse, multi-modal data can enhance LLMs' domain adaptation performance. However, this hypothesis remains insufficiently validated in the e-commerce sector. To address this gap, we propose a comprehensive e-commerce multi-task framework and design empirical experiments to examine the impact of diverse data and tasks on LLMs from two perspectives: "capability comprehensiveness" and "task comprehensiveness." Specifically, we observe significant improvements in LLM performance by progressively introducing tasks related to new major capability areas and by continuously adding subtasks within different major capability domains. Furthermore, we observe that increasing model capacity amplifies the benefits of diversity, suggesting a synergistic relationship between model capacity and data diversity. Finally, we validate the best-performing model from our empirical experiments in the KDD Cup 2024, achieving a rank 5 in Task 1. This outcome demonstrates the significance of our research for advancing LLMs in the e-commerce domain.
摘要：近年来，由于其强大的域适应能力，大型语言模型（LLM）已广泛应用于各个领域。先前的研究表明，多种模式数据可以增强LLMS的域适应性性能。但是，该假设在电子商务领域仍未得到充分验证。为了解决这一差距，我们提出了一个全面的电子商务多任务框架和设计经验实验，以从两个角度研究不同数据和任务对LLM的影响：“能力全面性”和“任务全面性”。具体而言，我们通过逐步引入与新的主要能力领域相关的任务以及在不同的主要功能域中不断添加子任务，从而观察到LLM性能的显着改善。此外，我们观察到，增加模型能力增加了多样性的好处，这表明模型容量与数据多样性之间存在协同关系。最后，我们从2024年KDD杯中的经验实验中验证了表现最佳的模型，在任务1中获得了排名5的排名。这一结果表明了我们研究对推进电子商务领域LLM的重要性。

Title: Can Reasoning LLMs Enhance Clinical Document Classification?

Authors: Akram Mustafa, Usman Naseem, Mostafa Rahimi Azghadi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08040
Pdf URL: https://arxiv.org/pdf/2504.08040
Copy Paste: [[2504.08040]] Can Reasoning LLMs Enhance Clinical Document Classification?(https://arxiv.org/abs/2504.08040)
Keywords: language model, gpt, llm, chat
Abstract: Clinical document classification is essential for converting unstructured medical texts into standardised ICD-10 diagnoses, yet it faces challenges due to complex medical language, privacy constraints, and limited annotated datasets. Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task. This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat); in classifying clinical discharge summaries using the MIMIC-IV dataset. Using cTAKES to structure clinical narratives, models were assessed across three experimental runs, with majority voting determining final predictions. Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%), with Gemini 2.0 Flash Thinking achieving the highest accuracy (75%) and F1 score (76%). However, non-reasoning models demonstrated greater stability (91% vs 84% consistency). Performance varied across ICD-10 codes, with reasoning models excelling in complex cases but struggling with abstract categories. Findings indicate a trade-off between accuracy and consistency, suggesting that a hybrid approach could optimise clinical coding. Future research should explore multi-label classification, domain-specific fine-tuning, and ensemble methods to enhance model reliability in real-world applications.
摘要：临床文档分类对于将非结构化的医学文本转换为标准化的ICD-10诊断至关重要，但是由于复杂的医学语言，隐私限制和有限的注释数据集，它面临挑战。大型语言模型（LLMS）为这项任务提供了有希望的改善准确性和效率。这项研究评估了八个LLM的性能和一致性；四个推理（QWEN QWQ，DeepSeek推理器，GPT O3 Mini，Gemini 2.0 Flash Thinking）和四个非争议（Llama 3.3，GPT 4O Mini，Gemini 2.0 2.0 Flash，DeepSeek Chat）；在使用MIMIC-IV数据集对临床放电摘要进行分类时。使用CTAKE来构建临床叙事，在三个实验跑步中评估了模型，多数投票决定了最终预测。结果表明，推理模型的精度（71％vs 68％）和F1得分（67％vs 60％）优于非争议模型，Gemini 2.0 Flash Thinking可以达到最高精度（75％）和F1分数（76％）。但是，非争议模型表现出更大的稳定性（91％vs 84％的一致性）。 ICD-10代码的性能各不相同，其推理模型在复杂的情况下出色，但在抽象类别中挣扎。调查结果表明准确性和一致性之间的权衡，表明混合方法可以优化临床编码。未来的研究应探讨多标签分类，特定于领域的微调和集合方法，以增强现实世界应用中的模型可靠性。

Title: DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

Authors: Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08120
Pdf URL: https://arxiv.org/pdf/2504.08120
Copy Paste: [[2504.08120]] DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?(https://arxiv.org/abs/2504.08120)
Keywords: language model, llm
Abstract: Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.
摘要：支持推理的大语言模型（LLM）最近在复杂的逻辑和数学任务中表现出了令人印象深刻的表现，但是它们在评估自然语言生成方面的有效性仍未得到探索。这项研究系统地将基于推理的LLM（DeepSeek-R1和OpenAI O3）与跨机器翻译（MT）（MT）和文本摘要（TS）评估任务进行了比较。我们评估了三个架构类别的八个模型，包括最先进的推理模型，它们的蒸馏变体（从8B到70B参数）以及同等的常规，非季节性的LLM。我们在WMT23和夏斯文基准上进行的实验表明，推理能力的好处是高度模型和任务依赖性的：虽然OpenAI O3-Mini模型显示出一致的性能提高，而推理强度的提高，而DeepSeek-R1表现不佳与其非劳动变体相比，与TS评估的某些方面相比，其非劳动性变体相比。相关分析表明，增加的推理令牌用法与O3-MINI模型中的评估质量正相关。此外，我们的结果表明，推理能力的蒸馏在中型模型（32B）中保持合理的性能，但在较小的变体（8B）中大大降解。这项工作提供了对NLG评估的推理LLM的首次全面评估，并提供了对其实际使用的见解。

Title: Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Authors: Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjape, Adina Williams, Tal Linzen, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08165
Pdf URL: https://arxiv.org/pdf/2504.08165
Copy Paste: [[2504.08165]] Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora(https://arxiv.org/abs/2504.08165)
Keywords: language model
Abstract: Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.
摘要：儿童可以从少于1亿个单词的意见中获取语言。大型语言模型的数据效率要低得多：它们通常需要3或4个数量级的数据，并且在许多评估中仍然不如人类表现不佳。这些密集的资源要求限制了研究人员培训新模型并将现有模型用作发展上合理的认知模型的能力。 Babylm挑战是一项公共努力，参与者竞争以固定数据预算优化语言模型培训。比较针对语法能力，下游任务绩效和概括的各种评估任务的提交。参与者最多可以提交三个曲目，并逐渐宽松的数据限制。从30多种提交中，我们就如何最好地培训数据有效的语言模型以及未来的努力（也许不应该）集中精力提取具体建议。使用LTG-Bert Architecture（Samuel等，2023）的获奖文章优于接受数万亿个单词的训练的模型。其他提交的意见通过对较短的输入序列进行培训或培训学生模型，从而取得了良好的成果。造成大量意见的课程学习尝试在很大程度上没有成功，尽管有些表现出适度的改进。

Title: Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models

Authors: Yu Fu, Haz Sameen Shahgir, Hui Liu, Xianfeng Tang, Qi He, Yue Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08202
Pdf URL: https://arxiv.org/pdf/2504.08202
Copy Paste: [[2504.08202]] Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models(https://arxiv.org/abs/2504.08202)
Keywords: language model
Abstract: Recent advances in long-context models (LCMs), designed to handle extremely long input contexts, primarily focus on utilizing external contextual information, often leaving the influence of large language models' intrinsic knowledge underexplored. In this work, we investigate how this intrinsic knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model's ability to utilize intrinsic knowledge, which we call intrinsic retrieval ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval can interfere with the model's ability to use its own knowledge effectively, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both retrieval abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior intrinsic retrieval ability. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance under LCM conditions, highlighting the importance of evaluating models from a dual-retrieval perspective.
摘要：旨在处理非常长的输入上下文的长篇文本模型（LCM）的最新进展，主要集中于利用外部上下文信息，通常会留下大型语言模型的固有知识的影响。在这项工作中，我们研究了这种内在知识如何影响内容的产生，并证明随着上下文长度的扩展，其影响越来越明显。此外，我们表明该模型利用固有知识的能力（我们称之为内在的检索能力）并不能同时改善其通过外部检索能力来利用上下文知识的能力。此外，更好的外部检索可以干扰该模型有效地使用自己的知识的能力，从而限制其全部潜力。为了弥合这一差距，我们设计了一个简单而有效的混合针中的混合针头测试，该测试根据模型在两种检索能力中的能力中评估模型，而不是仅仅强调外部检索能力。我们的实验结果表明，QWEN-2.5模型的表现明显胜过Llama-3.1模型，表明了优越的内在检索能力。此外，即使是更强大的Llama-3.1-70B-Instruct模型也无法在LCM条件下表现出更好的性能，从而强调了从双重续签角度评估模型的重要性。

Title: LLM for Comparative Narrative Analysis

Authors: Leo Kampen, Carlos Rabat Villarreal, Louis Yu, Santu Karmaker, Dongji Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08211
Pdf URL: https://arxiv.org/pdf/2504.08211
Copy Paste: [[2504.08211]] LLM for Comparative Narrative Analysis(https://arxiv.org/abs/2504.08211)
Keywords: gpt, llm, prompt
Abstract: In this paper, we conducted a Multi-Perspective Comparative Narrative Analysis (CNA) on three prominent LLMs: GPT-3.5, PaLM2, and Llama2. We applied identical prompts and evaluated their outputs on specific tasks, ensuring an equitable and unbiased comparison between various LLMs. Our study revealed that the three LLMs generated divergent responses to the same prompt, indicating notable discrepancies in their ability to comprehend and analyze the given task. Human evaluation was used as the gold standard, evaluating four perspectives to analyze differences in LLM performance.
摘要：在本文中，我们对三个突出的LLMS：GPT-3.5，Palm2和Llama2进行了多角度比较叙事分析（CNA）。我们应用了相同的提示，并在特定任务上评估了它们的输出，从而确保了各种LLM之间的公平和公正的比较。我们的研究表明，这三个LLM对同一提示产生了不同的响应，表明其理解和分析给定任务的能力显着差异。人类评估被用作黄金标准，评估了四种观点来分析LLM性能的差异。

Title: Out of Style: RAG's Fragility to Linguistic Variation

Authors: Tianyu Cao, Neel Bhandari, Akhila Yerukola, Akari Asai, Maarten Sap
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08231
Pdf URL: https://arxiv.org/pdf/2504.08231
Copy Paste: [[2504.08231]] Out of Style: RAG's Fragility to Linguistic Variation(https://arxiv.org/abs/2504.08231)
Keywords: llm, retrieval-augmented generation
Abstract: Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely underexplored. This presents a critical gap for practical deployment, where user queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. In this work, we systematically analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance. We evaluate two retrieval models and nine LLMs, ranging from 3 to 72 billion parameters, across four information-seeking Question Answering (QA) datasets. Our results reveal that linguistic reformulations significantly impact both retrieval and generation stages, leading to a relative performance drop of up to 40.41% in Recall@5 scores for less formal queries and 38.86% in answer match scores for queries containing grammatical errors. Notably, RAG systems exhibit greater sensitivity to such variations compared to LLM-only generations, highlighting their vulnerability to error propagation due to linguistic shifts. These findings highlight the need for improved robustness techniques to enhance reliability in diverse user interactions.
摘要：尽管在各种NLP基准中的检索增强生成（RAG）系统的性能令人印象深刻，但它们在处理现实世界中的用户-LLM交互查询方面的稳健性仍然很大程度上尚未得到充分展望。这为实用部署带来了一个关键的差距，在该差距中，用户查询表现出更大的语言变化，并且可以触发跨相互依存的抹布组件的级联错误。在这项工作中，我们系统地分析了四个语言维度（形式，可读性，礼貌和语法正确性）如何影响抹布的表现。我们在四个寻求信息答案（QA）数据集中评估了两个检索模型和9个LLM，范围从3到720亿个参数。我们的结果表明，语言的重新纠正会显着影响检索和发电阶段，从而导致相对性能下降到40.41％的召回@5分数的正式查询和38.86％的答案匹配匹配分数中，含有语法错误。值得注意的是，与仅LLM的一代相比，RAG系统对此类变化具有更大的敏感性，这突出了它们由于语言转移而引起的错误传播的脆弱性。这些发现凸显了需要提高鲁棒性技术以提高各种用户交互的可靠性。

Title: Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

Authors: Yonchanok Khaokaew, Flora D. Salim, Andreas Züfle, Hao Xue, Taylor Anderson, Matthew Scotch, David J Heslop
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08260
Pdf URL: https://arxiv.org/pdf/2504.08260
Copy Paste: [[2504.08260]] Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare(https://arxiv.org/abs/2504.08260)
Keywords: language model, llm, prompt, agent
Abstract: Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.
摘要：在大型语言模型（LLM）驱动的情况下，越来越多地使用生成剂来模拟人类行为。这些模拟物是研究人类行为的沙箱，而不会损害隐私或安全性。但是，尚不清楚此类代理人是否可以真正代表真实的个人。这项工作比较了《了解美国医疗保健决策》的《美国理解研究》（UAS）的调查数据和生成代理的模拟反应。使用基于人口统计学的及时工程，我们创建了数字调查受访者的数字双胞胎，并分析了不同的LLMS繁殖现实世界行为的效果。我们的发现表明，一些LLM无法反映现实的决策，例如预测普遍的疫苗接受。但是，美洲驼3捕获了种族和收入的变化，但也引入了UAS数据中不存在的偏见。这项研究强调了生成代理进行行为研究的潜力，同时强调了LLM和促使策略的偏见风险。

Title: ELSA: A Style Aligned Dataset for Emotionally Intelligent Language Generation

Authors: Vishal Gandhi, Sagar Gandhi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08281
Pdf URL: https://arxiv.org/pdf/2504.08281
Copy Paste: [[2504.08281]] ELSA: A Style Aligned Dataset for Emotionally Intelligent Language Generation(https://arxiv.org/abs/2504.08281)
Keywords: language model, llm, prompt
Abstract: Advancements in emotion aware language processing increasingly shape vital NLP applications ranging from conversational AI and affective computing to computational psychology and creative content generation. Existing emotion datasets either lack emotional granularity or fail to capture necessary stylistic diversity, limiting the advancement of effective emotion conditioned text generation systems. Seeking to bridge this crucial gap between granularity and style diversity, this paper introduces a novel systematically constructed dataset named ELSA Emotion and Language Style Alignment Dataset leveraging fine grained emotion taxonomies adapted from existing sources such as dair ai emotion dataset and GoEmotions taxonomy. This dataset comprises multiple emotionally nuanced variations of original sentences regenerated across distinct contextual styles such as conversational, formal, poetic, and narrative, using advanced Large Language Models LLMs. Rigorous computational evaluation using metrics such as perplexity, embedding variance, readability, lexical diversity, and semantic coherence measures validates the datasets emotional authenticity, linguistic fluency, and textual diversity. Comprehensive metric analyses affirm its potential to support deeper explorations into emotion conditioned style adaptive text generation. By enabling precision tuned emotionally nuanced language modeling, our dataset creates fertile ground for research on fine grained emotional control, prompt driven explanation, interpretability, and style adaptive expressive language generation with LLMs.
摘要：情感意识语言处理的进步越来越多地塑造了从对话AI和情感计算到计算心理学和创造性内容产生的重要NLP应用程序。现有的情感数据集要么缺乏情感粒度，要么无法捕获必要的文体多样性，从而限制了有效的情感条件文本生成系统的发展。为了弥合粒度和风格多样性之间的这一关键差距，本文介绍了一个新颖的系统构建的数据集，名为ELSA情感和语言样式校准数据集，利用了基于现有源的精细粒度情感分类法，该数据量是根据现有来源（例如Dair AI Emotion Dataset和Goemimitions分类法）所改编的。该数据集包括使用先进的大型语言模型LLMS，包括对话，正式，诗意和叙事等不同背景样式的原始句子的多种情感细微差异。使用诸如困惑，嵌入差异，可读性，词汇多样性和语义连贯性等指标进行严格的计算评估验证数据集情感真实性，语言流利性和文本多样性。全面的度量分析肯定了其潜在的潜力，可以支持对情感风格的自适应文本生成的更深入的探索。通过启用精确调整的情感细微差别的语言建模，我们的数据集为研究精细的粒状情感控制，迅速驱动的解释，可解释性和风格适应性表达语言生成而创造了肥沃的基础。

Title: Large language models could be rote learners

Authors: Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08300
Pdf URL: https://arxiv.org/pdf/2504.08300
Copy Paste: [[2504.08300]] Large language models could be rote learners(https://arxiv.org/abs/2504.08300)
Keywords: language model, llm
Abstract: Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).
摘要：多项选择问题（MCQ）基准广泛用于评估大型语言模型（LLMS），但其可靠性受到基准污染的破坏。在这项研究中，我们将污染重新构成学习的固有方面，并试图将真正的能力获取从LLM评估中的表面记忆中解散。首先，通过在不同的记忆条件下分析模型性能，我们发现了一种违反直觉的趋势：LLM在记忆的MCQ上的表现要比非途径较差，这表明两种不同的学习现象的共存，即死记硬背的记忆和真实的能力学习。为了解开它们，我们提出了Trineval，这是一个新颖的评估框架，将MCQ重新定义为另一种三位一体格式，从而减少了记忆，同时保留了知识评估。实验验证了Trineval在重新制定中的有效性，其评估表明，共同的LLM可能会记住死记硬背的20.5％的知识点（平均为MMLU）。

Title: Scholar Inbox: Personalized Paper Recommendations for Scientists

Authors: Markus Flicke, Glenn Angrabeit, Madhav Iyengar, Vitalii Protsenko, Illia Shakun, Jovan Cicvaric, Bora Kargi, Haoyu He, Lukas Schuler, Lewin Scholz, Kavyanjali Agnihotri, Yong Cao, Andreas Geiger
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2504.08385
Pdf URL: https://arxiv.org/pdf/2504.08385
Copy Paste: [[2504.08385]] Scholar Inbox: Personalized Paper Recommendations for Scientists(https://arxiv.org/abs/2504.08385)
Keywords: prompt
Abstract: Scholar Inbox is a new open-access platform designed to address the challenges researchers face in staying current with the rapidly expanding volume of scientific literature. We provide personalized recommendations, continuous updates from open-access archives (arXiv, bioRxiv, etc.), visual paper summaries, semantic search, and a range of tools to streamline research workflows and promote open research access. The platform's personalized recommendation system is trained on user ratings, ensuring that recommendations are tailored to individual researchers' interests. To further enhance the user experience, Scholar Inbox also offers a map of science that provides an overview of research across domains, enabling users to easily explore specific topics. We use this map to address the cold start problem common in recommender systems, as well as an active learning strategy that iteratively prompts users to rate a selection of papers, allowing the system to learn user preferences quickly. We evaluate the quality of our recommendation system on a novel dataset of 800k user ratings, which we make publicly available, as well as via an extensive user study. this https URL
摘要：Scholar Inbox是一个新的开放式平台，旨在应对研究人员在迅速扩大的科学文献中保持最新水平所面临的挑战。我们提供个性化的建议，开放式档案（Arxiv，Biorxiv等）的连续更新，视觉纸张摘要，语义搜索以及一系列简化研究工作流程并促进开放研究访问的工具。该平台的个性化推荐系统对用户评分进行了培训，以确保对个人研究人员的兴趣量身定制建议。为了进一步增强用户体验，Scholar Inbox还提供了科学地图，概述了跨领域的研究，从而使用户可以轻松探索特定的主题。我们使用此地图来解决推荐系统中常见的冷启动问题，以及一种主动学习策略，该策略迭代地提示用户对论文的选择进行评分，从而使系统可以快速学习用户偏好。我们在80万用户评分的新型数据集上评估了推荐系统的质量，我们可以公开使用，并通过广泛的用户研究。此HTTPS URL

Title: Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

Authors: Yin Jou Huang, Rafik Hadfi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08399
Pdf URL: https://arxiv.org/pdf/2504.08399
Copy Paste: [[2504.08399]] Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models(https://arxiv.org/abs/2504.08399)
Keywords: language model, llm, agent
Abstract: There is a growing interest in assessing the personality traits of Large language models (LLMs). However, traditional personality assessments based on self-report questionnaires may fail to capture their true behavioral nuances due to inherent biases and meta-knowledge contamination. This paper introduces a novel multi-observer framework for LLM personality assessment that draws inspiration from informant-report methods in psychology. Instead of relying solely on self-assessments, our approach employs multiple observer agents configured with a specific relationship context (e.g., family, friend, or workplace) to simulate interactive scenarios with a subject LLM. These observers engage in dialogues and subsequently provide ratings across the Big Five personality dimensions. Our experiments reveal that LLMs possess systematic biases in self-report personality ratings. Moreover, aggregating observer ratings effectively reduces non-systematic biases and achieves optimal reliability with 5-7 observers. The findings highlight the significant impact of relationship context on personality perception and demonstrate that a multi-observer paradigm yields a more robust and context-sensitive evaluation of LLM personality traits.
摘要：评估大语言模型（LLMS）的人格特征的兴趣越来越大。但是，由于固有的偏见和元知识污染，基于自我报告问卷的传统人格评估可能无法捕获其真正的行为细微差别。本文介绍了一个新型的LLM个性评估的多观察者框架，该框架从心理学中的线人报告方法中汲取灵感。我们的方法不仅依靠自我评估，而是采用了配置有特定关系上下文（例如家人，朋友或工作场所）的多个观察者代理，以模拟使用主题LLM的交互式场景。这些观察者进行对话，并随后在五大人格维度上提供评分。我们的实验表明，LLM在自我报告人格评级中具有系统的偏见。此外，汇总观察者等级有效地降低了非系统性偏见，并通过5-7个观察者实现了最佳的可靠性。这些发现突出了关系上下文对人格感知的重要影响，并表明多观察者范式对LLM人格特征产生了更强大和上下文敏感的评估。

Title: Integrated ensemble of BERT- and features-based models for authorship attribution in Japanese literary works

Authors: Taisei Kanda, Mingzhe Jin, Wataru Zaitsu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08527
Pdf URL: https://arxiv.org/pdf/2504.08527
Copy Paste: [[2504.08527]] Integrated ensemble of BERT- and features-based models for authorship attribution in Japanese literary works(https://arxiv.org/abs/2504.08527)
Keywords: language model
Abstract: Traditionally, authorship attribution (AA) tasks relied on statistical data analysis and classification based on stylistic features extracted from texts. In recent years, pre-trained language models (PLMs) have attracted significant attention in text classification tasks. However, although they demonstrate excellent performance on large-scale short-text datasets, their effectiveness remains under-explored for small samples, particularly in AA tasks. Additionally, a key challenge is how to effectively leverage PLMs in conjunction with traditional feature-based methods to advance AA research. In this study, we aimed to significantly improve performance using an integrated integrative ensemble of traditional feature-based and modern PLM-based methods on an AA task in a small sample. For the experiment, we used two corpora of literary works to classify 10 authors each. The results indicate that BERT is effective, even for small-sample AA tasks. Both BERT-based and classifier ensembles outperformed their respective stand-alone models, and the integrated ensemble approach further improved the scores significantly. For the corpus that was not included in the pre-training data, the integrated ensemble improved the F1 score by approximately 14 points, compared to the best-performing single model. Our methodology provides a viable solution for the efficient use of the ever-expanding array of data processing tools in the foreseeable future.
摘要：传统上，基于从文本中提取的风格特征的统计数据分析和分类依赖于统计数据分析和分类。近年来，预训练的语言模型（PLM）在文本分类任务中引起了极大的关注。但是，尽管它们在大规模的短文本数据集上表现出了出色的性能，但对于小样本，尤其是在AA任务中，它们的有效性仍然不足。此外，一个关键的挑战是如何有效利用PLM与传统的基于功能的方法来推进AA研究。在这项研究中，我们旨在使用基于传统特征和现代PLM的方法在小样本中的AA任务上进行集成集成集合。在实验中，我们使用了两份文学作品来对10位作者进行分类。结果表明，即使对于小样本AA任务，BERT也是有效的。基于BERT的分类器和分类器合奏的表现都优于其各自的独立模型，并且集成的合奏方法进一步改善了分数。对于未包含在训练前数据中的语料库，与表现最好的单个模型相比，集成集成的F1分数大约提高了14分。我们的方法为可预见的未来有效地使用了一系列数据处理工具提供了可行的解决方案。

Title: On The Landscape of Spoken Language Models: A Comprehensive Survey

Authors: Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2504.08528
Pdf URL: https://arxiv.org/pdf/2504.08528
Copy Paste: [[2504.08528]] On The Landscape of Spoken Language Models: A Comprehensive Survey(https://arxiv.org/abs/2504.08528)
Keywords: language model
Abstract: The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both "pure" language models of speech -- models of the distribution of tokenized speech sequences -- and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.
摘要：口语处理的领域正在从培训定制的，特定于任务的模型转变为使用和优化语言模型（SLM），这些模型（SLM）充当通用语音处理系统。这种趋势类似于在（文本）自然语言处理领域发生的通用语言模型的发展。 SLM包括语音的“纯”语言模型 - 令牌化语音序列的分布模型，以及将语音编码器与文本语言模型相结合的模型，通常包括口语和书面输入或输出。该领域的工作非常多样化，具有一系列术语和评估设置。本文旨在通过对该领域进化的近期作品进行统一的文献调查来提高对SLM的了解。我们的调查通过模型架构，培训和评估选择将该领域的工作分类，并描述了未来工作的一些关键挑战和方向。

Title: Lexical Bundle Frequency as a Construct-Relevant Candidate Feature in Automated Scoring of L2 Academic Writing

Authors: Burak Senel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08537
Pdf URL: https://arxiv.org/pdf/2504.08537
Copy Paste: [[2504.08537]] Lexical Bundle Frequency as a Construct-Relevant Candidate Feature in Automated Scoring of L2 Academic Writing(https://arxiv.org/abs/2504.08537)
Keywords: prompt
Abstract: Automated scoring (AS) systems are increasingly used for evaluating L2 writing, but require ongoing refinement for construct validity. While prior work suggested lexical bundles (LBs) - recurrent multi-word sequences satisfying certain frequency criteria - could inform assessment, their empirical integration into AS models needs further investigation. This study tested the impact of incorporating LB frequency features into an AS model for TOEFL independent writing tasks. Analyzing a sampled subcorpus (N=1,225 essays, 9 L1s) from the TOEFL11 corpus, scored by ETS-trained raters (Low, Medium, High), 3- to 9-word LBs were extracted, distinguishing prompt-specific from non-prompt types. A baseline Support Vector Machine (SVM) scoring model using established linguistic features (e.g., mechanics, cohesion, sophistication) was compared against an extended model including three aggregate LB frequency features (total prompt, total non-prompt, overall total). Results revealed significant, though generally small-effect, relationships between LB frequency (especially non-prompt bundles) and proficiency (p < .05). Mean frequencies suggested lower proficiency essays used more LBs overall. Critically, the LB-enhanced model improved agreement with human raters (Quadratic Cohen's Kappa +2.05%, overall Cohen's Kappa +5.63%), with notable gains for low (+10.1% exact agreement) and medium (+14.3% Cohen's Kappa) proficiency essays. These findings demonstrate that integrating aggregate LB frequency offers potential for developing more linguistically informed and accurate AS systems, particularly for differentiating developing L2 writers.
摘要：自动评分（AS）系统越来越多地用于评估L2写作，但需要持续的构造有效性。虽然先前的工作建议词汇束（LB） - 满足某些频率标准的复发多词序列 - 可以为评估提供信息，但它们与模型的经验整合需要进一步研究。这项研究测试了将LB频率特征纳入TOEFL独立写作任务的AS模型的影响。从TOEFL11语料库中分析采样的亚曲（n = 1,225篇论文，9 L1），由ETS训练的评分者（低，中，高）评分，提取了3至9字的LBS，从而区分了及时的特异性。将基线支持向量机（SVM）评分模型使用已建立的语言特征（例如，力学，内聚力，复杂）与包括三个骨料LB频率特征（总提示，总非prompt，总体总计）的扩展模型进行了比较。结果表明，尽管LB频率（尤其是非预测束）和熟练度（p <.05）之间的关系显着，但通常效果很小。平均频率表明，较低的熟练程度论文总体上使用了更多的磅。至关重要的是，LB增强模型改善了与人类评估者（二次Cohen的Kappa +2.05％，总体Cohen的Kappa +5.63％）的一致性，低（ +10.1％精确同意）和培养基（ +14.3％Cohen的Kappa）的熟练效率论文，并具有显着的增长（ +10.1％的确定性）。这些发现表明，整合汇总LB频率为开发更语言知情和准确的系统提供了潜力，尤其是在区分开发L2作者方面。

Title: UoB-NLP at SemEval-2025 Task 11: Leveraging Adapters for Multilingual and Cross-Lingual Emotion Detection

Authors: Frances Laureano De Leon, Yixiao Wang, Yue Feng, Mark G. Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08543
Pdf URL: https://arxiv.org/pdf/2504.08543
Copy Paste: [[2504.08543]] UoB-NLP at SemEval-2025 Task 11: Leveraging Adapters for Multilingual and Cross-Lingual Emotion Detection(https://arxiv.org/abs/2504.08543)
Keywords: language model
Abstract: Emotion detection in natural language processing is a challenging task due to the complexity of human emotions and linguistic diversity. While significant progress has been made in high-resource languages, emotion detection in low-resource languages remains underexplored. In this work, we address multilingual and cross-lingual emotion detection by leveraging adapter-based fine-tuning with multilingual pre-trained language models. Adapters introduce a small number of trainable parameters while keeping the pre-trained model weights fixed, offering a parameter-efficient approach to adaptation. We experiment with different adapter tuning strategies, including task-only adapters, target-language-ready task adapters, and language-family-based adapters. Our results show that target-language-ready task adapters achieve the best overall performance, particularly for low-resource African languages with our team ranking 7th for Tigrinya, and 8th for Kinyarwanda in Track A. In Track C, our system ranked 3rd for Amharic, and 4th for Oromo, Tigrinya, Kinyarwanda, Hausa, and Igbo. Our approach outperforms large language models in 11 languages and matches their performance in four others, despite our models having significantly fewer parameters. Furthermore, we find that adapter-based models retain cross-linguistic transfer capabilities while requiring fewer computational resources compared to full fine-tuning for each language.
摘要：由于人类情绪和语言多样性的复杂性，自然语言处理中的情绪检测是一项具有挑战性的任务。尽管高源语言已经取得了重大进展，但低资源语言的情感检测仍然没有得到充实。在这项工作中，我们通过利用基于适配器的微调使用多语言的预训练的语言模型来解决多语言和跨语性的情感检测。适配器引入了少量可训练的参数，同时保持预先训练的模型权重固定，从而提供适应性的参数方法。我们尝试不同的适配器调整策略，包括仅任务适配器，目标语言就绪的任务适配器和基于语言的适配器。我们的结果表明，针对目标语言的任务适配器取得了最佳的整体表现，尤其是对于低资源的非洲语言，我们的团队在Tigrinya中排名第七，Kinyarwanda排名第八，在轨道上排名第8，我们的系统在Amharic中排名第三，而Oromo，Tigrinya，Kinyarwanda，Kinyarwanda，Hausa，Hausa和igbo。尽管我们的模型的参数少得多，但我们的方法的表现优于11种语言的大型语言模型，并在另外四个语言中匹配了它们的性能。此外，我们发现基于适配器的模型保留了跨语言传输功能，同时与每种语言的完整微调相比，需要更少的计算资源。

Title: Playpen: An Environment for Exploring Learning Through Conversational Interaction

Authors: Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Momentè, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, Raffaella Bernardi, Raquel Fernández, Alexander Koller, Oliver Lemon, David Schlangen, Mario Giulianelli, Alessandro Suglia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08590
Pdf URL: https://arxiv.org/pdf/2504.08590
Copy Paste: [[2504.08590]] Playpen: An Environment for Exploring Learning Through Conversational Interaction(https://arxiv.org/abs/2504.08590)
Keywords: language model
Abstract: Are we running out of learning signal? Predicting the next word in an existing text has turned out to be a powerful signal, at least at scale. But there are signs that we are running out of this resource. In recent months, interaction between learner and feedback-giver has come into focus, both for "alignment" (with a reward model judging the quality of instruction following attempts) and for improving "reasoning" (process- and outcome-based verifiers judging reasoning steps). In this paper, we explore to what extent synthetic interaction in what we call Dialogue Games -- goal-directed and rule-governed activities driven predominantly by verbal actions -- can provide a learning signal, and how this signal can be used. We introduce an environment for producing such interaction data (with the help of a Large Language Model as counterpart to the learner model), both offline and online. We investigate the effects of supervised fine-tuning on this data, as well as reinforcement learning setups such as DPO, and GRPO; showing that all of these approaches achieve some improvements in in-domain games, but only GRPO demonstrates the ability to generalise to out-of-domain games as well as retain competitive performance in reference-based tasks. We release the framework and the baseline training setups in the hope that this can foster research in this promising new direction.
摘要：我们的学习信号用完了吗？至少在大规模上预测现有文本中的下一个单词已成为一个强大的信号。但是有迹象表明我们已经用完了这个资源。近几个月来，学习者和反馈奖者之间的互动已成为焦点，既是“对齐”（奖励模型，判断尝试后的教学质量）和改进“推理”（基于过程和基于结果的验证者判断推理步骤）。在本文中，我们探讨了我们所谓的对话游戏中的合成互动在多大程度上 - 以口头动作为主导的目标和规则对待的活动可以提供学习信号，以及如何使用此信号。我们介绍了一个用于生成此类互动数据的环境（借助大型语言模型，与学习者模型相当），无论是离线还是在线）。我们研究了监督微调对这些数据的影响，以及诸如DPO和GRPO之类的增强学习设置；表明所有这些方法在内域游戏中都取得了一些改进，但只有GRPO证明了将概括为旧域游戏的能力，并在基于参考的任务中保留竞争性能。我们发布框架和基线训练设置，希望这可以借助这个有希望的新方向促进研究。

Title: MedHal: An Evaluation Dataset for Medical Hallucination Detection

Authors: Gaya Mehenni, Amal Zouaq
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08596
Pdf URL: https://arxiv.org/pdf/2504.08596
Copy Paste: [[2504.08596]] MedHal: An Evaluation Dataset for Medical Hallucination Detection(https://arxiv.org/abs/2504.08596)
Keywords: hallucination
Abstract: We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning. We demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing improvements over general-purpose hallucination detection approaches. This resource enables more efficient evaluation of medical text generation systems while reducing reliance on costly expert review, potentially accelerating the development of medical AI research.
摘要：我们提出了Medhal，这是一种新型的大型数据集，专门设计用于评估模型是否可以检测医学文本中的幻觉。当当前的幻觉检测方法应用于药物（如药物）时可能会产生灾难性后果。现有的医疗数据集要么太小，只包含几百个样本，要么专注于一个任务，例如回答问题或自然语言推断。 Medhal通过以下方式解决了这些差距：（1）结合各种医学文本源和任务；（2）提供大量的带注释样品，适合训练医疗幻觉检测模型；（3）包括有关指导模型学习的事实不一致的解释。我们通过训练和评估基线医学幻觉检测模型来证明Medhal的效用，显示了对通用幻觉检测方法的改进。该资源可以对医学文本生成系统进行更有效的评估，同时减少对昂贵的专家审查的依赖，从而加速医疗AI研究的发展。

Title: A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

Authors: Julian Bäumler, Louis Blöcher, Lars-Joel Frey, Xian Chen, Markus Bayer, Christian Reuter
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08609
Pdf URL: https://arxiv.org/pdf/2504.08609
Copy Paste: [[2504.08609]] A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English(https://arxiv.org/abs/2504.08609)
Keywords: prompt
Abstract: The dissemination of online hate speech can have serious negative consequences for individuals, online communities, and entire societies. This and the large volume of hateful online content prompted both practitioners', i.e., in content moderation or law enforcement, and researchers' interest in machine learning models to automatically classify instances of hate speech. Whereas most scientific works address hate speech classification as a binary task, practice often requires a differentiation into sub-types, e.g., according to target, severity, or legality, which may overlap for individual content. Hence, researchers created datasets and machine learning models that approach hate speech classification in textual data as a multi-label problem. This work presents the first systematic and comprehensive survey of scientific literature on this emerging research landscape in English (N=46). We contribute with a concise overview of 28 datasets suited for training multi-label classification models that reveals significant heterogeneity regarding label-set, size, meta-concept, annotation process, and inter-annotator agreement. Our analysis of 24 publications proposing suitable classification models further establishes inconsistency in evaluation and a preference for architectures based on Bidirectional Encoder Representation from Transformers (BERT) and Recurrent Neural Networks (RNNs). We identify imbalanced training data, reliance on crowdsourcing platforms, small and sparse datasets, and missing methodological alignment as critical open issues and formulate ten recommendations for research.
摘要：在线仇恨言论的传播可能会对个人，在线社区和整个社会产生严重的负面影响。这和大量仇恨的在线内容促使从业者，即内容审核或执法部门，以及研究人员对机器学习模型的兴趣，以自动对仇恨言论的实例进行分类。尽管大多数科学作品将仇恨言论分类作为二元任务介绍，但实践通常需要分化为子类型，例如，根据目标，严重性或合法性，这可能与个人内容重叠。因此，研究人员创建了数据集和机器学习模型，这些模型将文本数据中的仇恨语音分类作为多标签问题。这项工作介绍了有关英文中新兴研究景观的科学文献的首次系统，全面的调查（n = 46）。我们贡献了28个适用于训练多标签分类模型的数据集的简洁概述，这些数据集揭示了有关标签，大小，元评估，注释过程和通道间协议的明显异质性。我们对提出合适分类模型的24个出版物的分析进一步确立了评估的不一致，并偏爱基于Transformers（BERT）（BERT）和复发性神经网络（RNN）的双向编码器表示的体系结构。我们将培训数据不平衡，对众包平台的依赖，小而稀疏的数据集以及缺少方法学对准作为关键的开放问题，并为研究提出了十个建议。

Title: Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

Authors: Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08672
Pdf URL: https://arxiv.org/pdf/2504.08672
Copy Paste: [[2504.08672]] Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning(https://arxiv.org/abs/2504.08672)
Keywords: llm
Abstract: Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at this https URL.
摘要：提高LLM推理技能引起了广泛的兴趣。但是，当前的训练后技术在很大程度上取决于监督信号，例如结果监督或辅助奖励模型，这些模型面临可扩展性和高注释成本的问题。这激发了我们在不需要外部监督的情况下增强LLM推理。我们介绍了一个可普遍的，纯粹的无监督的自我训练框架，名为Genius。没有外部辅助，天才需要以逐步方式寻求最佳响应序列并优化LLM。为了探索潜在步骤并利用最佳步骤，Genius引入了逐步的远见卓识策略，以通过模拟未来结果来采样和估算步骤值。此外，我们认识到无监督的设置不可避免地会引起内在噪声和不确定性。为了提供强大的优化，我们提出了一个优势校准的优化（ACO）损耗函数，以减轻估计不一致。将这些技术结合在一起，Genius提供了朝着自我启动的LLM推理以及一般查询而没有监督的高级初步步骤，鉴于广泛的查询可用性，革新推理缩放定律。该代码将在此HTTPS URL上发布。

Title: Fast-Slow-Thinking: Complex Task Solving with Large Language Models

Authors: Yiliu Sun, Yanfang Zhang, Zicheng Zhao, Sheng Wan, Dacheng Tao, Chen Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08690
Pdf URL: https://arxiv.org/pdf/2504.08690
Copy Paste: [[2504.08690]] Fast-Slow-Thinking: Complex Task Solving with Large Language Models(https://arxiv.org/abs/2504.08690)
Keywords: language model, llm, prompt
Abstract: Nowadays, Large Language Models (LLMs) have been gradually employed to solve complex tasks. To face the challenge, task decomposition has become an effective way, which proposes to divide a complex task into multiple simpler subtasks and then solve them separately so that the difficulty of the original task can be reduced. However, the performance of existing task decomposition methods can be suboptimal when the task contains overly complex logic and constraints. In this situation, the solution generated by LLMs may deviate from the original purpose of the task, or contain redundant or even erroneous content. Therefore, inspired by the fact that humans possess two thinking systems including fast thinking and slow thinking, this paper introduces a new task decomposition method termed ``Fast-Slow-Thinking'' (FST), which stimulates LLMs to solve tasks through the cooperation of Fast Thinking (FT) and Slow Thinking (ST) steps. Here FT focuses more on the general and concise aspect of the task, and ST focuses more on the details of the task. In FT, LLMs are prompted to remove the constraints of the original task, therefore simplifying it to a general and concise one. In ST, we recall the constraints removed in FT, so that LLMs can improve the answer generated in FT to meet the requirements of the original task. Therefore, our FST method enables LLMs to consider a complex problem via a human-like cognition process from coarse to fine, the effectiveness of which has been well demonstrated by the experiments on three types of tasks.
摘要：如今，大型语言模型（LLMS）已逐渐用于解决复杂的任务。为了面对挑战，任务分解已成为一种有效的方法，它建议将复杂的任务分为多个简单的子任务，然后单独解决它们，以便可以减少原始任务的困难。但是，当任务包含过于复杂的逻辑和约束时，现有任务分解方法的性能可能是次优的。在这种情况下，LLMS生成的解决方案可能偏离任务的原始目的，或包含冗余甚至错误的内容。因此，本文受到人类具有两个思维系统在内的两个思维系统的启发，本文介绍了一种称为``快速降低思维思想''（FST）的新任务分解方法，该方法通过快速思考（FT）和缓慢思考（ST）步骤来刺激LLMS来解决任务。在这里，FT更多地关注任务的一般和简洁方面，而ST则更多地关注任务的细节。在ft中，提示LLMS删除原始任务的约束，因此将其简化为一般而简洁。在ST中，我们回想起FT中删除的约束，以便LLM可以改善FT中生成的答案以满足原始任务的要求。因此，我们的FST方法使LLM可以通过从粗到细的人类认知过程来考虑一个复杂的问题，实验对三种任务的实验已经很好地证明了其有效性。

Title: TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning

Authors: Hang Ni, Fan Liu, Xinyu Ma, Lixin Su, Shuaiqiang Wang, Dawei Yin, Hui Xiong, Hao Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08694
Pdf URL: https://arxiv.org/pdf/2504.08694
Copy Paste: [[2504.08694]] TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning(https://arxiv.org/abs/2504.08694)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have shown promise in automating travel planning, yet they often fall short in addressing nuanced spatiotemporal rationality. While existing benchmarks focus on basic plan validity, they neglect critical aspects such as route efficiency, POI appeal, and real-time adaptability. This paper introduces TP-RAG, the first benchmark tailored for retrieval-augmented, spatiotemporal-aware travel planning. Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain annotated POIs, and 18,784 high-quality travel trajectory references sourced from online tourist documents, enabling dynamic and context-aware planning. Through extensive experiments, we reveal that integrating reference trajectories significantly improves spatial efficiency and POI rationality of the travel plan, while challenges persist in universality and robustness due to conflicting references and noisy data. To address these issues, we propose EvoRAG, an evolutionary framework that potently synergizes diverse retrieved trajectories with LLMs' intrinsic reasoning. EvoRAG achieves state-of-the-art performance, improving spatiotemporal compliance and reducing commonsense violation compared to ground-up and retrieval-augmented baselines. Our work underscores the potential of hybridizing Web knowledge with LLM-driven optimization, paving the way for more reliable and adaptive travel planning agents.
摘要：大型语言模型（LLMS）在自动化旅行计划方面表现出了希望，但它们通常在解决细微的时空理性方面缺乏。尽管现有的基准侧重于基本计划有效性，但它们忽略了关键方面，例如路线效率，POI吸引力和实时适应性。本文介绍了TP-rag，这是第一个针对检索的时空感知旅行计划量身定制的基准。我们的数据集包括2,348个现实世界旅行查询，85,575个精细注释的POI和18,784个高质量的旅行轨迹参考，从在线旅游文档中提出，实现动态和上下文感知的计划。通过广泛的实验，我们透露，整合参考轨迹可显着提高旅行计划的空间效率和POI合理性，而挑战则持续存在，由于参考和嘈杂数据冲突而导致的鲁棒性和鲁棒性。为了解决这些问题，我们提出了Evorag，这是一个进化框架，可与LLMS的固有推理相结合。与地面和检索功能相比，Evorag实现了最先进的性能，提高了时空的依从性并减少了常识性违规行为。我们的工作强调了将Web知识与LLM驱动的优化融合的潜力，为更可靠和适应性的旅行计划代理铺平了道路。

Title: Large Language Models as Span Annotators

Authors: Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondřej Plátek, Dimitra Gkatzia, Saad Mahamood, Ondřej Dušek, Simone Balloccu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08697
Pdf URL: https://arxiv.org/pdf/2504.08697
Copy Paste: [[2504.08697]] Large Language Models as Span Annotators(https://arxiv.org/abs/2504.08697)
Keywords: language model, llm
Abstract: For high-quality texts, single-score metrics seldom provide actionable feedback. In contrast, span annotation - pointing out issues in the text by annotating their spans - can guide improvements and provide insights. Until recently, span annotation was limited to human annotators or fine-tuned encoder models. In this study, we automate span annotation with large language models (LLMs). We compare expert or skilled crowdworker annotators with open and proprietary LLMs on three tasks: data-to-text generation evaluation, machine translation evaluation, and propaganda detection in human-written texts. In our experiments, we show that LLMs as span annotators are straightforward to implement and notably more cost-efficient than human annotators. The LLMs achieve moderate agreement with skilled human annotators, in some scenarios comparable to the average agreement among the annotators themselves. Qualitative analysis shows that reasoning models outperform their instruction-tuned counterparts and provide more valid explanations for annotations. We release the dataset of more than 40k model and human annotations for further research.
摘要：对于高质量的文本，单分数指标很少提供可行的反馈。相反，跨度注释 - 通过注释跨度指出文本中的问题 - 可以指导改进并提供见解。直到最近，跨度注释仅限于人类注释或微调编码模型。在这项研究中，我们使用大语言模型（LLM）自动跨度注释。我们将专家或熟练的人群工人注释者与开放且专有的LLM在三个任务上进行了比较：数据之间的生成评估，机器翻译评估和人为文本中的宣传检测。在我们的实验中，我们表明LLMS作为SPAN注释者可以直接实施，并且比人类注释者更具成本效益。在某些情况下，LLM与熟练的人类注释者达成了温和的一致性，与注释者本身之间的平均一致性相媲美。定性分析表明，推理模型的表现优于他们的指导调整的对应物，并为注释提供了更有效的解释。我们发布了超过40K模型和人类注释的数据集，以进行进一步研究。

Title: SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

Authors: Krishna C. Puvvada, Faisal Ladhak, Santiago Akle Serrano, Cheng-Ping Hsieh, Shantanu Acharya, Somshubra Majumdar, Fei Jia, Samuel Kriman, Simeng Sun, Dima Rekesh, Boris Ginsburg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.08719
Pdf URL: https://arxiv.org/pdf/2504.08719
Copy Paste: [[2504.08719]] SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling(https://arxiv.org/abs/2504.08719)
Keywords: language model, gpt
Abstract: We present a decoder-only Transformer architecture that robustly generalizes to sequence lengths substantially longer than those seen during training. Our model, SWAN-GPT, interleaves layers without positional encodings (NoPE) and sliding-window attention layers equipped with rotary positional encodings (SWA-RoPE). Experiments demonstrate strong performance on sequence lengths significantly longer than the training length without the need for additional long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by a straightforward dynamic scaling of attention scores during inference. In addition, SWAN-GPT is more computationally efficient than standard GPT architectures, resulting in cheaper training and higher throughput. Further, we demonstrate that existing pre-trained decoder-only models can be efficiently converted to the SWAN architecture with minimal continued training, enabling longer contexts. Overall, our work presents an effective approach for scaling language models to longer contexts in a robust and efficient manner.
摘要：我们提出了一个仅解码器的变压器体系结构，该体系结构可稳定地比训练期间所看到的长度长得多。我们的模型Swan-GPT交通层，没有位置编码（NOPE）和带有旋转位置编码（SWA Rope）的滑动窗口注意层。实验表明，序列长度的强大性能明显长于训练长度，而无需进行额外的长期训练。这种坚固的长度外推是通过我们的新结构实现的，通过推断期间注意力评分的直接动态缩放增强。此外，Swan-GPT比标准GPT体系结构更有效地计算效率，从而导致更便宜的训练和更高的吞吐量。此外，我们证明了现有的仅培训的仅解码器模型可以通过最小的持续培训有效地转换为天鹅建筑，从而实现了更长的环境。总体而言，我们的工作提出了一种有效的方法，可以将语言模型扩展到以稳健而有效的方式扩展上下文。