2025-04-08

Title: A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System

Authors: Mingyan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03739
Pdf URL: https://arxiv.org/pdf/2504.03739
Copy Paste: [[2504.03739]] A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System(https://arxiv.org/abs/2504.03739)
Keywords: gpt, hallucination, prompt
Abstract: Generative models, such as GPT and BERT, have significantly improved performance in tasks like text generation and summarization. However, hallucinations "where models generate non-factual or misleading content" are especially problematic in smaller-scale architectures, limiting their real-world this http URL this paper, we propose a unified Virtual Mixture-of-Experts (MoE) fusion strategy that enhances inference performance and mitigates hallucinations in a single Qwen 1.5 0.5B model without increasing the parameter count. Our method leverages multiple domain-specific expert prompts (with the number of experts being adjustable) to guide the model from different perspectives. We apply a statistical outlier truncation strategy based on the mean and standard deviation to filter out abnormally high probability predictions, and we inject noise into the embedding space to promote output diversity. To clearly assess the contribution of each module, we adopt a fixed voting mechanism rather than a dynamic gating network, thereby avoiding additional confounding factors. We provide detailed theoretical derivations from both statistical and ensemble learning perspectives to demonstrate how our method reduces output variance and suppresses hallucinations. Extensive ablation experiments on dialogue generation tasks show that our approach significantly improves inference accuracy and robustness in small models. Additionally, we discuss methods for evaluating the orthogonality of virtual experts and outline the potential for future work involving dynamic expert weight allocation using gating networks.
摘要：GPT和BERT等生成模型在文本生成和摘要等任务中的性能显着提高。但是，幻觉“模型产生非事实或误导性内容”在较小规模的体系结构中尤其有问题，限制了本文本文的真实世界，我们提出了统一的虚拟融合（MOE）融合策略，以提高推入性能，并在单一QWEN 1.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 borment中提高了幻想。我们的方法利用多个特定领域的专家提示（可调节专家的数量）来指导模型。我们根据平均值和标准偏差应用统计异常值截断策略，以过滤出异常高的概率预测，并将噪声注入嵌入空间中以促进输出多样性。为了清楚地评估每个模块的贡献，我们采用了固定的投票机制，而不是动态的门控网络，从而避免了其他混杂因素。我们从统计和集合学习的角度提供了详细的理论推导，以证明我们的方法如何减少输出方差并抑制幻觉。关于对话生成任务的广泛消融实验表明，我们的方法显着提高了小型模型中的推理准确性和鲁棒性。此外，我们讨论了评估虚拟专家正交性的方法，并概述了使用门控网络分配动态专家体重分配的未来工作的潜力。

Title: Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs

Authors: Sifan Li, Yujun Cai, Bryan Hooi, Nanyun Peng, Yiwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03786
Pdf URL: https://arxiv.org/pdf/2504.03786
Copy Paste: [[2504.03786]] Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs(https://arxiv.org/abs/2504.03786)
Keywords: language model, llm, retrieval augmented generation
Abstract: Traditional Chinese Medicine (TCM) has seen increasing adoption in healthcare, with specialized Large Language Models (LLMs) emerging to support clinical applications. A fundamental requirement for these models is accurate identification of TCM drug ingredients. In this paper, we evaluate how general and TCM-specialized LLMs perform when identifying ingredients of Chinese drugs. Our systematic analysis reveals consistent failure patterns: models often interpret drug names literally, overuse common herbs regardless of relevance, and exhibit erratic behaviors when faced with unfamiliar formulations. LLMs also fail to understand the verification task. These findings demonstrate that current LLMs rely primarily on drug names rather than possessing systematic pharmacological knowledge. To address these limitations, we propose a Retrieval Augmented Generation (RAG) approach focused on ingredient names. Experiments across 220 TCM formulations show our method significantly improves accuracy from approximately 50% to 82% in ingredient verification tasks. Our work highlights critical weaknesses in current TCM-specific LLMs and offers a practical solution for enhancing their clinical reliability.
摘要：传统中医（TCM）在医疗保健中的采用量增加，专门的大语言模型（LLMS）涌现为支持临床应用。这些模型的基本要求是准确鉴定TCM药物成分。在本文中，我们评估了识别中国药物成分时的一般和TCM专业LLM的表现。我们的系统分析揭示了一致的失败模式：模型通常从字面上解释药物名称，过度使用常见的草药，无论相关性如何，并且在面对不熟悉的配方时表现出不稳定的行为。 LLM也无法理解验证任务。这些发现表明，当前LLM的主要依赖于药物名称，而不是具有系统的药理知识。为了解决这些限制，我们提出了一种重点是成分名称的检索增强生成（RAG）方法。 220个TCM配方的实验表明，在成分验证任务中，我们的方法将精度从约50％提高到82％。我们的工作强调了当前TCM特异性LLM中的临界弱点，并提供了提高其临床可靠性的实用解决方案。

Title: Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Authors: Gonçalo Faria, Noah A. Smith
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.03790
Pdf URL: https://arxiv.org/pdf/2504.03790
Copy Paste: [[2504.03790]] Sample, Don't Search: Rethinking Test-Time Alignment for Language Models(https://arxiv.org/abs/2504.03790)
Keywords: language model, prompt
Abstract: Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.
摘要：提高测试时间计算已成为改善语言模型性能的有希望的方向，尤其是在模型登录由于计算限制或私人模型权重引起的不切实际或不可能的情况下。但是，由于对固有不完美的奖励代理的过度优化，现有的使用奖励模型（RM）的现有测试时间搜索方法通常会降低质量作为计算量表。我们介绍了一种新的测试时间对准方法Qalign。随着测试时间计算的扩展，Qalign会收敛到每个个体提示的最佳排列分布的采样。通过在马尔可夫链蒙特卡洛（Monte Carlo）中采用最新进展来生成文本，我们的方法可以实现较好的输出，而无需修改基础模型甚至需要logit访问。我们使用特定于任务的RM证明了Qalign对数学推理基准（GSM8K和GSM-Symbolic）的有效性，显示了对现有测试时间计算方法（如最佳N和多数投票）的一致改进。此外，当在Tulu 3偏好数据集中使用更现实的RMS应用，Qalign优于直接偏好优化（DPO），最佳N，多数投票和多数投票，以多数投票在各种数据集上投票（GSM8K，MATH500，IFEVAL，IFEVAL，IFEVAL，MMLU-RU-REDUX和TORTERFULS和TORTERFULLQA）。我们的方法是在测试时间对齐语言模型的实用解决方案，而我们的方法不得降解，扩大了可以从现成的语言模型中获得的能力的限制，而无需进行进一步的培训。

Title: Entropy-Based Block Pruning for Efficient Large Language Models

Authors: Liangwei Yang, Yuhui Xu, Juntao Tan, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03794
Pdf URL: https://arxiv.org/pdf/2504.03794
Copy Paste: [[2504.03794]] Entropy-Based Block Pruning for Efficient Large Language Models(https://arxiv.org/abs/2504.03794)
Keywords: language model
Abstract: As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.
摘要：随着大型语言模型继续扩展，它们不断增长的计算和存储要求对现实部署构成重大挑战。在这项工作中，我们研究了基于变压器的模型中的冗余，并提出了一种基于熵的修剪策略，以提高效率，同时保持性能。经验分析表明，隐藏表示形式的熵在早期块中减少，但在大多数随后的区块中逐渐增加。这种趋势表明，熵是计算块中信息丰富度的更有效度量。与主要捕获几何关系的余弦相似性不同，熵直接量化不确定性和信息内容，使其成为修剪的更可靠的标准。广泛的实验表明，我们的基于熵的修剪方法超过了基于余弦的相似性方法，可以降低模型大小，同时保持准确性，从而为有效的模型部署提供了有希望的方向。

Title: What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices

Authors: Sander Noels, Guillaume Bied, Maarten Buyl, Alexander Rogiers, Yousra Fettach, Jefrey Lijffijt, Tijl De Bie
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03803
Pdf URL: https://arxiv.org/pdf/2504.03803
Copy Paste: [[2504.03803]] What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices(https://arxiv.org/abs/2504.03803)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly deployed as gateways to information, yet their content moderation practices remain underexplored. This work investigates the extent to which LLMs refuse to answer or omit information when prompted on political topics. To do so, we distinguish between hard censorship (i.e., generated refusals, error messages, or canned denial responses) and soft censorship (i.e., selective omission or downplaying of key elements), which we identify in LLMs' responses when asked to provide information on a broad range of political figures. Our analysis covers 14 state-of-the-art models from Western countries, China, and Russia, prompted in all six official United Nations (UN) languages. Our analysis suggests that although censorship is observed across the board, it is predominantly tailored to an LLM provider's domestic audience and typically manifests as either hard censorship or soft censorship (though rarely both concurrently). These findings underscore the need for ideological and geographic diversity among publicly available LLMs, and greater transparency in LLM moderation strategies to facilitate informed user choices. All data are made freely available.
摘要：大型语言模型（LLMS）越来越多地作为信息网关部署，但其内容审核的实践仍然没有得到充实。这项工作调查了LLM拒绝回答或忽略政治主题的信息的程度。为此，我们区分了硬审查（即产生的拒绝，错误消息或罐装拒绝响应）和软审查（即选择性的遗漏或关键元素的选择性遗漏或淡淡的关键元素），当我们在LLMS的响应中识别出广泛的政治图形信息时，我们会在LLMS的响应中识别。我们的分析涵盖了来自西方国家，中国和俄罗斯的14种最先进的模型，这些模式促使所有六种官方联合国（联合国）语言促使。我们的分析表明，尽管全面观察了审查制度，但它主要是针对LLM提供商的国内受众量身定制的，通常表现为艰苦的审查制度或软审查制度（尽管很少同时同时同时）。这些发现强调了公开可用的LLM中意识形态和地理多样性的需求，以及LLM审核策略的更高透明度以促进知情的用户选择。所有数据均可自由使用。

Title: Do LLM Evaluators Prefer Themselves for a Reason?

Authors: Wei-Lin Chen, Zhepei Wei, Xinyu Zhu, Shi Feng, Yu Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03846
Pdf URL: https://arxiv.org/pdf/2504.03846
Copy Paste: [[2504.03846]] Do LLM Evaluators Prefer Themselves for a Reason?(https://arxiv.org/abs/2504.03846)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference detrimental, or does it simply reflect objectively superior outputs from more capable models? Disentangling these has been challenging due to the usage of subjective tasks in previous studies. To address this, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) Better generators are better judges -- LLM evaluators' accuracy strongly correlates with their task performance, and much of the self-preference in capable models is legitimate. (2) Harmful self-preference persists, particularly when evaluator models perform poorly as generators on specific task instances. Stronger models exhibit more pronounced harmful bias when they err, though such incorrect generations are less frequent. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce the harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.
摘要：大型语言模型（LLMS）越来越多地用作基准测试，奖励建模和自我启动等应用中的自动评估者。先前的工作突出了一个潜在的自我质量偏见，其中LLM倾向于自己产生的响应，这种趋势通常会随着模型的大小和能力而加剧。这提出了一个关键的问题：自我挑战是有害的，还是只是反映了功能更强大的模型的客观上优质输出？由于先前的研究中主观任务的使用，因此解开这些问题一直在挑战。为了解决这个问题，我们使用可验证的基准（数学推理，事实知识，代码生成）来研究自我质量，从而允许客观的基础评估。这使我们能够区分有害的自我挑战（偏爱客观上的更糟的反应）与合法的自我挑战（有利于真正优越的自我挑战）。我们在各种模型家族的受控评估条件下进行大规模实验（例如Llama，Qwen，Gemma，Mistral，Phi，Phi，GPT，DeepSeek）。我们的发现揭示了三个关键见解：（1）更好的生成器是更好的法官 - LLM评估者的准确性与他们的任务绩效密切相关，并且在有能力的模型中的许多自我挑战都是合法的。（2）有害的自我偏爱仍然存在，尤其是当评估器模型在特定任务实例上的发电机表现不佳时。当这种不正确的世代不太频繁时，更强大的模型在错误时表现出更明显的有害偏差。（3）推理时间缩放策略，例如在评估前产生长长的经过思考，有效地减少了有害的自我挑战。这些结果为基于LLM的评估和改善其可靠性的实用见解提供了更细微的理解。

Title: CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ)

Authors: Abhilekh Borah, Hasnat Md Abdullah, Kangda Wei, Ruihong Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03906
Pdf URL: https://arxiv.org/pdf/2504.03906
Copy Paste: [[2504.03906]] CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ)(https://arxiv.org/abs/2504.03906)
Keywords: language model, llm
Abstract: The rise of Large Language Models (LLMs) has raised questions about their ability to understand climate-related contexts. Though climate change dominates social media, analyzing its multimodal expressions is understudied, and current tools have failed to determine whether LLMs amplify credible solutions or spread unsubstantiated claims. To address this, we introduce CliME (Climate Change Multimodal Evaluation), a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. To systematically evaluate LLM performance, we present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Additionally, we propose three analytical lenses: Actionability, Criticality, and Justice, to guide the assessment of LLM-generated climate discourse using CAQ. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis. Among the models evaluated, Claude 3.7 Sonnet achieves the highest overall performance. We publicly release our CliME dataset and code to foster further research in this domain.
摘要：大型语言模型（LLM）的兴起提出了有关其了解与气候相关环境的能力的问题。尽管气候变化主导着社交媒体，但分析其多模式表达式的研究已被研究研究，并且当前的工具未能确定LLMS是否会放大可靠的解决方案或传播未经证实的主张。为了解决这个问题，我们介绍了Clime（气候变化多模式评估），这是一种首个多模式数据集，包括2579个Twitter和Reddit帖子。基准分配了各种各样的幽默模因和怀疑帖子，捕获了这些格式如何将复杂问题提炼成塑造公众舆论和政策讨论的病毒叙事。为了系统地评估LLM的性能，我们介绍了气候比对商（CAQ），这是一个包含五个不同维度的新型度量：表达，证据，共振，过渡和特异性。此外，我们提出了三种分析镜头：可行性，批判性和正义，以指导使用CAQ对LLM生成的气候话语进行评估。我们基于CAQ指标的发现表明，尽管大多数评估的LLM在批判性和正义上的表现相对较好，但它们在可行性轴上的表现始终表现不佳。在评估的模型中，Claude 3.7十四行诗的总体表现最高。我们公开发布我们的CLIME数据集和代码，以促进该领域的进一步研究。

Title: Adaptation of Large Language Models

Authors: Zixuan Ke, Yifei Ming, Shafiq Joty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03931
Pdf URL: https://arxiv.org/pdf/2504.03931
Copy Paste: [[2504.03931]] Adaptation of Large Language Models(https://arxiv.org/abs/2504.03931)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: This tutorial on adaptation of LLMs is designed to address the growing demand for models that go beyond the static capabilities of generic LLMs by providing an overview of dynamic, domain-specific, and task-adaptive LLM adaptation techniques. While general LLMs have demonstrated strong generalization across a variety of tasks, they often struggle to perform well in specialized domains such as finance, healthcare, and code generation for underrepresented languages. Additionally, their static nature limits their ability to evolve with the changing world, and they are often extremely large in size, making them impractical and costly to deploy at scale. As a result, the adaptation of LLMs has drawn much attention since the birth of LLMs and is of core importance, both for industry, which focuses on serving its targeted users, and academia, which can greatly benefit from small but powerful LLMs. To address this gap, this tutorial aims to provide an overview of the LLM adaptation techniques. We start with an introduction to LLM adaptation, from both the data perspective and the model perspective. We then emphasize how the evaluation metrics and benchmarks are different from other techniques. After establishing the problems, we explore various adaptation techniques. We categorize adaptation techniques into two main families. The first is parametric knowledge adaptation, which focuses on updating the parametric knowledge within LLMs. Additionally, we will discuss real-time adaptation techniques, including model editing, which allows LLMs to be updated dynamically in production environments. The second kind of adaptation is semi-parametric knowledge adaptation, where the goal is to update LLM parameters to better leverage external knowledge or tools through techniques like retrieval-augmented generation (RAG) and agent-based systems.
摘要：该有关LLMS适应的教程旨在解决对模型不断增长的需求，这些模型通过提供动态，特定于域和任务自适应的LLM适应技术的概述，超越了通用LLM的静态功能。尽管一般LLM在各种任务中都表现出了强烈的概括，但他们经常难以在专业领域（例如财务，医疗保健和代码生成不足的语言）中表现出色。此外，他们的静态性质限制了它们随着不断变化的世界而发展的能力，而且它们的规模通常非常大，使其不切实际且昂贵。结果，自LLM诞生以来，LLM的适应引起了人们的关注，并且对于行业而言，这都是核心的关注，这对行业的重点是为目标用户服务，而学术界可以从小而有力的LLM中受益匪浅。为了解决这一差距，本教程旨在提供LLM适应技术的概述。从数据的角度和模型的角度来看，我们从LLM改编的介绍开始。然后，我们强调评估指标和基准与其他技术的不同。建立问题后，我们探索了各种适应技术。我们将适应技术分为两个主要家庭。第一个是参数知识适应，它重点是更新LLM中的参数知识。此外，我们将讨论实时适应技术，包括模型编辑，该技术允许LLMS在生产环境中动态更新。第二种适应性是半参数知识适应，其目标是通过诸如检索功能增强的生成（RAG）和基于代理的系统等技术更新LLM参数，以更好地利用外部知识或工具。

Title: YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization

Authors: Dongsuk Jang, Alan Li, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.03932
Pdf URL: https://arxiv.org/pdf/2504.03932
Copy Paste: [[2504.03932]] YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization(https://arxiv.org/abs/2504.03932)
Keywords: gpt, llm, prompt, agent
Abstract: Automated summarization of healthcare community question-answering forums is challenging due to diverse perspectives presented across multiple user responses to each question. The PerAnsSumm Shared Task was therefore proposed to tackle this challenge by identifying perspectives from different answers and then generating a comprehensive answer to the question. In this study, we address the PerAnsSumm Shared Task using two complementary paradigms: (i) a training-based approach through QLoRA fine-tuning of LLaMA-3.3-70B-Instruct, and (ii) agentic approaches including zero- and few-shot prompting with frontier LLMs (LLaMA-3.3-70B-Instruct and GPT-4o) and a Mixture-of-Agents (MoA) framework that leverages a diverse set of LLMs by combining outputs from multi-layer feedback aggregation. For perspective span identification/classification, GPT-4o zero-shot achieves an overall score of 0.57, substantially outperforming the 0.40 score of the LLaMA baseline. With a 2-layer MoA configuration, we were able to improve LLaMA performance up by 28 percent to 0.51. For perspective-based summarization, GPT-4o zero-shot attains an overall score of 0.42 compared to 0.28 for the best LLaMA zero-shot, and our 2-layer MoA approach boosts LLaMA performance by 32 percent to 0.37. Furthermore, in few-shot setting, our results show that the sentence-transformer embedding-based exemplar selection provides more gain than manually selected exemplars on LLaMA models, although the few-shot prompting is not always helpful for GPT-4o. The YaleNLP team's approach ranked the overall second place in the shared task.
摘要：医疗保健社区提问论坛的自动汇总由于对每个问题的多个用户回答提供了各种观点，因此具有挑战性。因此，提出了Peranssumm共享的任务，以通过从不同答案中确定观点，然后对问题产生全面的答案来应对这一挑战。在这项研究中，我们使用两个互补范式解决了佩兰斯姆共同的任务：（i）通过Qlora Qlora微调Llama-3.3-70B-Instruct进行基于培训的方法，以及（ii）与Frontier llms（Llama-3.3-3.3-70b-instructs and GPT-4-4O）一起使用，包括零和少数提示的代理方法，以及一个混合了一个混合（MOA），通过组合多层反馈聚合的输出，多样化的LLMS集。对于透视跨度识别/分类，GPT-4O零射击的总分为0.57，显着超过了Llama基线的0.40得分。通过2层MOA配置，我们能够将美洲驼的性能提高28％，达到0.51。对于基于透视的摘要，GPT-4O零射击的总分为0.42，而最佳Llama零拍摄的总分为0.28，而我们的2层MOA方法将Llama的性能提高了32％至0.37。此外，在几次射击设置中，我们的结果表明，基于句子转换器嵌入的示例选择比在Llama模型上手动选择的示例提供了更多的增益，尽管少数发动的提示并不总是对GPT-4O有用。 Yalenlp团队的方法在共享任务中排名第二。

Title: Language Models Are Implicitly Continuous

Authors: Samuele Marro, Davide Evangelista, X. Angelo Huang, Emanuele La Malfa, Michele Lombardi, Michael Wooldridge
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03933
Pdf URL: https://arxiv.org/pdf/2504.03933
Copy Paste: [[2504.03933]] Language Models Are Implicitly Continuous(https://arxiv.org/abs/2504.03933)
Keywords: language model, llm
Abstract: Language is typically modelled with discrete sequences. However, the most successful approaches to language modelling, namely neural networks, are continuous and smooth function approximators. In this work, we show that Transformer-based language models implicitly learn to represent sentences as continuous-time functions defined over a continuous input space. This phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral, and suggests that LLMs reason about language in ways that fundamentally differ from humans. Our work formally extends Transformers to capture the nuances of time and space continuity in both input and output space. Our results challenge the traditional interpretation of how LLMs understand language, with several linguistic and engineering implications.
摘要：语言通常以离散序列进行建模。但是，语言建模最成功的方法，即神经网络，是连续且平稳的函数近似值。在这项工作中，我们表明基于变压器的语言模型隐含地学习将句子表示为在连续输入空间上定义的连续时间函数。这种现象发生在大多数最先进的大语言模型（LLMS）中，包括Llama2，Llama3，Phi3，Gemma，Gemma2和Mistral，并建议以与人类根本上不同的方式进行LLMS关于语言的理由。我们的工作正式扩展了变压器，以捕获输入和输出空间中时间和空间连续性的细微差别。我们的结果挑战了LLM如何理解语言的传统解释，并具有几种语言和工程意义。

Title: Clinical ModernBERT: An efficient and long context encoder for biomedical text

Authors: Simon A. Lee, Anthony Wu, Jeffrey N. Chiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03964
Pdf URL: https://arxiv.org/pdf/2504.03964
Copy Paste: [[2504.03964]] Clinical ModernBERT: An efficient and long context encoder for biomedical text(https://arxiv.org/abs/2504.03964)
Keywords: long context
Abstract: We introduce Clinical ModernBERT, a transformer based encoder pretrained on large scale biomedical literature, clinical notes, and medical ontologies, incorporating PubMed abstracts, MIMIC IV clinical data, and medical codes with their textual descriptions. Building on ModernBERT the current state of the art natural language text encoder featuring architectural upgrades such as rotary positional embeddings (RoPE), Flash Attention, and extended context length up to 8,192 tokens our model adapts these innovations specifically for biomedical and clinical domains. Clinical ModernBERT excels at producing semantically rich representations tailored for long context tasks. We validate this both by analyzing its pretrained weights and through empirical evaluation on a comprehensive suite of clinical NLP benchmarks.
摘要：我们介绍了临床Modernbert，这是一家基于变压器的编码器，该编码器在大规模的生物医学文献，临床注释和医学本体中预估计，将PubMed摘要，模拟IV临床数据和医学代码及其文本描述结合在一起。建立在现代的基础上，当前的艺术自然语言文本编码器具有建筑升级，例如旋转位置嵌入（绳索），闪光注意力和扩展上下文长度，高达8,192个图形，我们的模型适应了这些创新，专门针对生物医学和临床领域。临床现代伯特（Clinical Modernbert）擅长生产针对长篇小说任务量身定制的语义丰富的代表。我们通过分析其预处理的权重以及通过对临床NLP基准的全面套件进行经验评估来验证这一点。

Title: Structured Extraction of Process Structure Properties Relationships in Materials Science

Authors: Amit K Verma, Zhisong Zhang, Junwon Seo, Robin Kuo, Runbo Jiang, Emma Strubell, Anthony D Rollett
Subjects: cs.CL, cond-mat.mtrl-sci, cs.IR
Abstract URL: https://arxiv.org/abs/2504.03979
Pdf URL: https://arxiv.org/pdf/2504.03979
Copy Paste: [[2504.03979]] Structured Extraction of Process Structure Properties Relationships in Materials Science(https://arxiv.org/abs/2504.03979)
Keywords: language model, gpt, llm
Abstract: With the advent of large language models (LLMs), the vast unstructured text within millions of academic papers is increasingly accessible for materials discovery, although significant challenges remain. While LLMs offer promising few- and zero-shot learning capabilities, particularly valuable in the materials domain where expert annotations are scarce, general-purpose LLMs often fail to address key materials-specific queries without further adaptation. To bridge this gap, fine-tuning LLMs on human-labeled data is essential for effective structured knowledge extraction. In this study, we introduce a novel annotation schema designed to extract generic process-structure-properties relationships from scientific literature. We demonstrate the utility of this approach using a dataset of 128 abstracts, with annotations drawn from two distinct domains: high-temperature materials (Domain I) and uncertainty quantification in simulating materials microstructure (Domain II). Initially, we developed a conditional random field (CRF) model based on MatBERT, a domain-specific BERT variant, and evaluated its performance on Domain I. Subsequently, we compared this model with a fine-tuned LLM (GPT-4o from OpenAI) under identical conditions. Our results indicate that fine-tuning LLMs can significantly improve entity extraction performance over the BERT-CRF baseline on Domain I. However, when additional examples from Domain II were incorporated, the performance of the BERT-CRF model became comparable to that of the GPT-4o model. These findings underscore the potential of our schema for structured knowledge extraction and highlight the complementary strengths of both modeling approaches.
摘要：随着大型语言模型（LLM）的出现，尽管仍然存在重大挑战，但材料发现中数以百万计的学术论文中的庞大非结构化文本越来越易于使用。尽管LLMS提供了有希望的很少和零照片的学习能力，尤其是在稀缺专家注释的材料领域中尤其有价值，但通用LLM通常无法在没有进一步适应的情况下解决特定于特定材料的问题。为了弥合这一差距，对人体标记数据的微调LLM对于有效的结构化知识提取至关重要。在这项研究中，我们介绍了一种新颖的注释模式，旨在从科学文献中提取一般的过程结构 - 实验关系。我们使用128个摘要的数据集证明了这种方法的实用性，并从两个不同的领域中绘制了注释：高温材料（域I）和模拟材料微结构（域II）中的不确定性定量。最初，我们开发了一个基于域特异性BERT变体Matbert的条件随机场（CRF）模型，并在相同条件下评估了其在域中的性能。随后，我们将该模型与微调的LLM（GPT-4O）（GPT-4O）进行了比较。我们的结果表明，微调LLM可以显着改善域I上BERT-CRF基线的实体提取性能。但是，当纳入了域II的其他示例时，BERT-CRF模型的性能与GPT-4O模型的性能相当。这些发现强调了我们模式在结构化知识提取方面的潜力，并突出了两种建模方法的互补优势。

Title: Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

Authors: Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis
Subjects: cs.CL, cs.AI, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2504.03991
Pdf URL: https://arxiv.org/pdf/2504.03991
Copy Paste: [[2504.03991]] Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models(https://arxiv.org/abs/2504.03991)
Keywords: language model, llm, prompt, agent
Abstract: Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment (n=54 participants), that humans exhibit diverse coordination and communication behavior in this domain. We then show that our approach can effectively replicate trends from human teaming data and also capture behaviors that are not easily observed without collecting large amounts of data. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.
摘要：了解人类如何在团队中进行协作和沟通对于改善人类代理团队和AI辅助决策至关重要。但是，仅依靠大规模用户研究的数据是不切实际的，这是由于后勤，道德和实际的约束，因此需要具有多种人类行为的合成模型。最近，已证明由大语言模型（LLM）提供动力的代理商在社交环境中模仿人类的行为。但是，获得大量不同的行为需要以设计提示的形式进行手动努力。另一方面，质量多样性（QD）优化已被证明能够产生多种增强学习（RL）代理行为。在这项工作中，我们将QD优化与LLM驱动的代理相结合，以迭代地搜索在长途，多步协作环境中产生多样化团队行为的提示。我们首先通过人类受试者实验（n = 54名参与者）表明，人类在该领域表现出多样化的协调和交流行为。然后，我们证明我们的方法可以有效地从人类团队数据中复制趋势，并捕获没有收集大量数据的情况下不容易观察到的行为。我们的发现突出了QD和LLM驱动代理的组合，作为研究多代理协作中的团队和沟通策略的有效工具。

Title: Rethinking Reflection in Pre-Training

Authors: Essential AI: Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Tim Romanski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04022
Pdf URL: https://arxiv.org/pdf/2504.04022
Copy Paste: [[2504.04022]] Rethinking Reflection in Pre-Training(https://arxiv.org/abs/2504.04022)
Keywords: language model
Abstract: A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.
摘要：语言模型反思自己推理的能力为解决复杂问题提供了关键优势。尽管最近的研究集中在强化学习过程中这种能力如何发展，但我们表明，在模型的预训练期间，它实际上开始更早地出现。为了研究这一点，我们将故意的错误引入了思想链中，并测试该模型是否仍然可以通过识别和纠正这些错误来得出正确的答案。通过跟踪跨预训练阶段的性能，我们观察到这种自我校正能力出现早期，并且随着时间的流逝而稳步改善。例如，在4万亿代币上预先训练的OLMO2-7B模型在我们的六个自我反射任务上显示自校正。

Title: SyLeR: A Framework for Explicit Syllogistic Legal Reasoning in Large Language Models

Authors: Kepu Zhang, Weijie Yu, Zhongxiang Sun, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04042
Pdf URL: https://arxiv.org/pdf/2504.04042
Copy Paste: [[2504.04042]] SyLeR: A Framework for Explicit Syllogistic Legal Reasoning in Large Language Models(https://arxiv.org/abs/2504.04042)
Keywords: language model, llm
Abstract: Syllogistic reasoning is a fundamental aspect of legal decision-making, enabling logical conclusions by connecting general legal principles with specific case facts. Although existing large language models (LLMs) can generate responses to legal questions, they fail to perform explicit syllogistic reasoning, often producing implicit and unstructured answers that lack explainability and trustworthiness. To address this limitation, we propose SyLeR, a novel framework that empowers LLMs to engage in explicit syllogistic legal reasoning. SyLeR integrates a tree-structured hierarchical retrieval mechanism to effectively combine relevant legal statutes and precedent cases, forming comprehensive major premises. This is followed by a two-stage fine-tuning process: supervised fine-tuning warm-up establishes a foundational understanding of syllogistic reasoning, while reinforcement learning with a structure-aware reward mechanism refines the ability of the model to generate diverse logically sound and well-structured reasoning paths. We conducted extensive experiments across various dimensions, including in-domain and cross-domain user groups (legal laypersons and practitioners), multiple languages (Chinese and French), and different LLM backbones (legal-specific and open-domain LLMs). The results show that SyLeR significantly improves response accuracy and consistently delivers explicit, explainable, and trustworthy legal reasoning.
摘要：三段论推理是法律决策的一个基本方面，通过将一般法律原则与特定案例事实联系起来，从而实现逻辑结论。尽管现有的大型语言模型（LLM）可以产生对法律问题的回答，但他们无法执行明确的三段论推理，通常会产生缺乏可解释性和可信赖性的隐性和非结构化答案。为了解决这一限制，我们提出了Syler，这是一个新颖的框架，授权LLMS参与明确的三段论法律推理。 Syler整合了树结构的层次结构检索机制，以有效地结合相关法规和先例案件，形成全面的主要前提。接下来是一个两阶段的微调过程：监督的微调热身建立了对三段论推理的基本理解，而通过结构意识的奖励机制进行了加强学习，可完善该模型在逻辑上有声音和结构良好的推理路径的能力。我们在各个方面进行了广泛的实验，包括内域和跨域用户群（合法的外行和从业者），多种语言（中文和法语）以及不同的LLM骨架（特定于法律和开放式LLM）。结果表明，Syler可显着提高响应准确性，并始终如一地提供明确的，可解释和值得信赖的法律推理。

Title: FISH-Tuning: Enhancing PEFT Methods with Fisher Information

Authors: Kang Xue, Ming Dong, Xinhui Tu, Tingting He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04050
Pdf URL: https://arxiv.org/pdf/2504.04050
Copy Paste: [[2504.04050]] FISH-Tuning: Enhancing PEFT Methods with Fisher Information(https://arxiv.org/abs/2504.04050)
Keywords: language model, llm
Abstract: The rapid growth in the parameter size of Large Language Models (LLMs) has led to the development of Parameter-Efficient Fine-Tuning (PEFT) methods to alleviate the computational costs of fine-tuning. Among these, Fisher Induced Sparse uncHanging (FISH) Mask is a selection-based PEFT technique that identifies a subset of pre-trained parameters for fine-tuning based on approximate Fisher information. However, the integration of FISH Mask with other PEFT methods, such as LoRA and Adapters, remains underexplored. In this paper, we propose FISH-Tuning, a novel approach that incorporates FISH Mask into addition-based and reparameterization-based PEFT methods, including LoRA, Adapters, and their variants. By leveraging Fisher information to select critical parameters within these methods, FISH-Tuning achieves superior performance without additional memory overhead or inference latency. Experimental results across various datasets and pre-trained models demonstrate that FISH-Tuning consistently outperforms the vanilla PEFT methods with the same proportion of trainable parameters.
摘要：大语言模型（LLM）参数大小的快速增长导致了参数有效的微调（PEFT）方法的发展，以减轻微调的计算成本。其中，Fisher诱导的稀疏不变（FISH）面膜是一种基于选择的PEFT技术，可根据近似Fisher信息识别预训练的参数子集以进行微调。但是，将鱼罩与其他PEFT方法的整合，例如洛拉和适配器，仍然没有被忽视。在本文中，我们提出了一种新颖的方法，一种新型方法将鱼罩纳入基于添加的基于添加的和重新聚集的PEFT方法，包括洛拉，适配器及其变体。通过利用Fisher信息在这些方法中选择关键参数，可以在没有额外的内存开销或推理潜伏期的情况下实现卓越的性能。各种数据集和预训练模型的实验结果表明，鱼类调节始终以相同比例的可训练参数的比例优于香草PEFT方法。

Title: VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

Authors: Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2504.04060
Pdf URL: https://arxiv.org/pdf/2504.04060
Copy Paste: [[2504.04060]] VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation(https://arxiv.org/abs/2504.04060)
Keywords: language model, llm
Abstract: Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.
摘要：语音大语言模型（LLM）已成为语音处理中的重要研究重点。我们提出了Vocalnet-1b和Vocalnet-8B，这是一系列高性能，低延迟语音LLMS通过可扩展和模型的型号训练框架实现的，用于实时语音交互。与传统的下一步预测（NTP）背道而驰，我们引入了Multi-Token Prediction（MTP），这是一种针对语音LLM的新颖方法，可同时提高生成速度和质量。实验表明，尽管使用了较少的训练数据，但Vocalnet的表现优于主流Omni LLM，同时还超过了现有的开源语音LLMS。为了支持可重复性和社区进步，我们将在出版时开放所有模型权重，推理代码，培训数据和框架实现。

Title: Collaboration and Controversy Among Experts: Rumor Early Detection by Tuning a Comment Generator

Authors: Bing Wang, Bingrui Zhao, Ximing Li, Changchun Li, Wanfu Gao, Shengsheng Wang
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2504.04076
Pdf URL: https://arxiv.org/pdf/2504.04076
Copy Paste: [[2504.04076]] Collaboration and Controversy Among Experts: Rumor Early Detection by Tuning a Comment Generator(https://arxiv.org/abs/2504.04076)
Keywords: language model
Abstract: Over the past decade, social media platforms have been key in spreading rumors, leading to significant negative impacts. To counter this, the community has developed various Rumor Detection (RD) algorithms to automatically identify them using user comments as evidence. However, these RD methods often fail in the early stages of rumor propagation when only limited user comments are available, leading the community to focus on a more challenging topic named Rumor Early Detection (RED). Typically, existing RED methods learn from limited semantics in early comments. However, our preliminary experiment reveals that the RED models always perform best when the number of training and test comments is consistent and extensive. This inspires us to address the RED issue by generating more human-like comments to support this hypothesis. To implement this idea, we tune a comment generator by simulating expert collaboration and controversy and propose a new RED framework named CAMERED. Specifically, we integrate a mixture-of-expert structure into a generative language model and present a novel routing network for expert collaboration. Additionally, we synthesize a knowledgeable dataset and design an adversarial learning strategy to align the style of generated comments with real-world comments. We further integrate generated and original comments with a mutual controversy fusion module. Experimental results show that CAMERED outperforms state-of-the-art RED baseline models and generation methods, demonstrating its effectiveness.
摘要：在过去的十年中，社交媒体平台一直是传播谣言的关键，导致了重大的负面影响。为了解决这个问题，社区已经开发了各种谣言检测（RD）算法，以使用用户评论作为证据自动识别它们。但是，这些RD方法通常在谣言传播的早期阶段就失败了，只有有限的用户评论可用，这使社区专注于一个更具挑战性的主题，名为“谣言早期检测”（RED）。通常，现有的红色方法在早期评论中从有限语义中学习。但是，我们的初步实验表明，当培训和测试评论的数量一致且广泛时，红色模型总是表现最好的。这激发了我们通过产生更多类似人类的评论来支持这一假设来解决红色问题。为了实现这一想法，我们通过模拟专家协作和争议来调整评论生成器，并提出一个名为Camered的新红色框架。具体而言，我们将专家的混合结构集成到生成语言模型中，并提出一个新颖的路由网络，用于专家协作。此外，我们合成知识渊博的数据集并设计一种对抗性学习策略，以使生成的评论的样式与现实世界的评论保持一致。我们将生成的和原始评论与相互争议的融合模块相结合。实验结果表明，摄像机的表现优于最先进的红色基线模型和生成方法，证明了其有效性。

Title: A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models

Authors: Aviv Brokman, Xuguang Ai, Yuhang Jiang, Shashank Gupta, Ramakanth Kavuluru
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04083
Pdf URL: https://arxiv.org/pdf/2504.04083
Copy Paste: [[2504.04083]] A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models(https://arxiv.org/abs/2504.04083)
Keywords: language model, gpt, llm, prompt
Abstract: Objective: Zero-shot methodology promises to cut down on costs of dataset annotation and domain expertise needed to make use of NLP. Generative large language models trained to align with human goals have achieved high zero-shot performance across a wide variety of tasks. As of yet, it is unclear how well these models perform on biomedical relation extraction (RE). To address this knowledge gap, we explore patterns in the performance of OpenAI LLMs across a diverse sampling of RE tasks. Methods: We use OpenAI GPT-4-turbo and their reasoning model o1 to conduct end-to-end RE experiments on seven datasets. We use the JSON generation capabilities of GPT models to generate structured output in two ways: (1) by defining an explicit schema describing the structure of relations, and (2) using a setting that infers the structure from the prompt language. Results: Our work is the first to study and compare the performance of the GPT-4 and o1 for the end-to-end zero-shot biomedical RE task across a broad array of datasets. We found the zero-shot performances to be proximal to that of fine-tuned methods. The limitations of this approach are that it performs poorly on instances containing many relations and errs on the boundaries of textual mentions. Conclusion: Recent large language models exhibit promising zero-shot capabilities in complex biomedical RE tasks, offering competitive performance with reduced dataset curation and NLP modeling needs at the cost of increased computing, potentially increasing medical community accessibility. Addressing the limitations we identify could further boost reliability. The code, data, and prompts for all our experiments are publicly available: this https URL
摘要：目的：零射击方法有望减少使用NLP所需的数据集注释成本和域专业知识。经过培训以与人类目标保持一致的生成大型语言模型在各种任务中取得了高零射的性能。到目前为止，尚不清楚这些模型在生物医学关系提取（RE）上的表现如何。为了解决这一知识差距，我们探讨了Openai LLMS在RE任务的各种样本中的性能中的表现。方法：我们使用OpenAI GPT-4-Turbo及其推理模型O1在七个数据集上进行端到端的RE实验。我们使用GPT模型的JSON生成功能以两种方式生成结构化输出：（1）通过定义描述关系结构的显式架构，以及（2）使用从提示语言中渗透结构的设置。结果：我们的工作是第一个研究和比较在广泛的数据集中端到端零摄像生物医学重新任务的GPT-4和O1的性能。我们发现零射击性能与微调方法近端。这种方法的局限性在于，它在包含许多关系和错误的界限的实例上表现不佳。结论：最近的大型语言模型在复杂的生物医学重新任务中表现出令人鼓舞的零拍功能，以减少数据集策划和NLP建模需求，以增加计算，可能会增加医疗社区的可访问性，从而提供竞争性能。解决我们确定的限制可以进一步提高可靠性。我们所有实验的代码，数据和提示都公开可用：此HTTPS URL

Title: Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

Authors: Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04131
Pdf URL: https://arxiv.org/pdf/2504.04131
Copy Paste: [[2504.04131]] Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary(https://arxiv.org/abs/2504.04131)
Keywords: retrieval-augmented generation
Abstract: We present NUPunkt and CharBoundary, two sentence boundary detection libraries optimized for high-precision, high-throughput processing of legal text in large-scale applications such as due diligence, e-discovery, and legal research. These libraries address the critical challenges posed by legal documents containing specialized citations, abbreviations, and complex sentence structures that confound general-purpose sentence boundary detectors. Our experimental evaluation on five diverse legal datasets comprising over 25,000 documents and 197,000 annotated sentence boundaries demonstrates that NUPunkt achieves 91.1% precision while processing 10 million characters per second with modest memory requirements (432 MB). CharBoundary models offer balanced and adjustable precision-recall tradeoffs, with the large model achieving the highest F1 score (0.782) among all tested methods. Notably, NUPunkt provides a 29-32% precision improvement over general-purpose tools while maintaining exceptional throughput, processing multi-million document collections in minutes rather than hours. Both libraries run efficiently on standard CPU hardware without requiring specialized accelerators. NUPunkt is implemented in pure Python with zero external dependencies, while CharBoundary relies only on scikit-learn and optional ONNX runtime integration for optimized performance. Both libraries are available under the MIT license, can be installed via PyPI, and can be interactively tested at this https URL. These libraries address critical precision issues in retrieval-augmented generation systems by preserving coherent legal concepts across sentences, where each percentage improvement in precision yields exponentially greater reductions in context fragmentation, creating cascading benefits throughout retrieval pipelines and significantly enhancing downstream reasoning quality.
摘要：我们提出了Nupunkt和Charboundary，这是针对高精度，高通量处理法律文本的两个句子边界检测库，例如在尽职调查，电子研究和法律研究中进行的大规模应用中。这些图书馆解决了包含专业引用，缩写和复杂句子结构的法律文件所带来的关键挑战，这些结构将通用句子边界检测器混淆。我们对五个不同的法律数据集的实验评估，其中包含25,000多个文件和197,000个带注释的句子边界，表明Nupunkt在每秒以适度的内存需求（432 MB）处理每秒处理1000万个字符时可实现91.1％的精度（432 MB）。 Charboundary模型提供了平衡且可调节的Precision-Recall折衷方案，大型模型在所有测试方法中都达到了最高的F1得分（0.782）。值得注意的是，Nupunkt在维持卓越的吞吐量的同时，在几分钟而不是数小时内处理了数百万个文档收集，在维持卓越的吞吐量的同时，提供了29-32％的精度改进。这两个库在不需要专门加速器的情况下在标准CPU硬件上有效运行。 Nupunkt以零外部依赖关系为纯Python实现，而Charboundary仅依靠Scikit-Learn和可选的ONNX运行时集成来进行优化的性能。这两个库均根据MIT许可证提供，可以通过PYPI安装，并且可以在此HTTPS URL上进行交互测试。这些图书馆通过保留跨句子跨句子的连贯的法律概念来解决检索仪的生成系统中的关键精确问题，在这些句子中，精确度的每一个百分比提高在上下文分裂中呈指数增长，从而在整个检索管道中产生级联的好处，并显着增强下游推理质量。

Title: Cognitive Debiasing Large Language Models for Decision-Making

Authors: Yougang Lyu, Shijie Ren, Yue Feng, Zihan Wang, Zhumin Chen, Zhaochun Ren, Maarten de Rijke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04141
Pdf URL: https://arxiv.org/pdf/2504.04141
Copy Paste: [[2504.04141]] Cognitive Debiasing Large Language Models for Decision-Making(https://arxiv.org/abs/2504.04141)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown potential in supporting decision-making applications, particularly as personal conversational assistants in the financial, healthcare, and legal domains. While prompt engineering strategies have enhanced the capabilities of LLMs in decision-making, cognitive biases inherent to LLMs present significant challenges. Cognitive biases are systematic patterns of deviation from norms or rationality in decision-making that can lead to the production of inaccurate outputs. Existing cognitive bias mitigation strategies assume that input prompts contain (exactly) one type of cognitive bias and therefore fail to perform well in realistic settings where there maybe any number of biases. To fill this gap, we propose a cognitive debiasing approach, called self-debiasing, that enhances the reliability of LLMs by iteratively refining prompts. Our method follows three sequential steps -- bias determination, bias analysis, and cognitive debiasing -- to iteratively mitigate potential cognitive biases in prompts. Experimental results on finance, healthcare, and legal decision-making tasks, using both closed-source and open-source LLMs, demonstrate that the proposed self-debiasing method outperforms both advanced prompt engineering methods and existing cognitive debiasing techniques in average accuracy under no-bias, single-bias, and multi-bias settings.
摘要：大型语言模型（LLM）显示出潜力在支持决策应用程序方面，尤其是作为财务，医疗保健和法律领域的个人对话助手。尽管迅速的工程策略增强了LLM在决策中的能力，但LLM固有的认知偏见带来了重大挑战。认知偏见是与规范或决策合理性的系统偏差模式，可能导致产量不准确。现有的认知偏差缓解策略假定输入提示包含（准确）一种类型的认知偏见，因此在可能存在任何偏见的现实环境中表现良好。为了填补这一差距，我们提出了一种称为自我欺骗的认知偏见方法，该方法通过迭代提示提示提高了LLM的可靠性。我们的方法遵循三个顺序的步骤：偏差确定，偏差分析和认知证词 - 迭代减轻提示中潜在的认知偏见。使用封闭源和开源的LLMS的金融，医疗保健和法律决策任务的实验结果表明，拟议的自我抑制方法优于先进的及时工程方法和现有的认知证据，以及在没有偏见，单个偏见和多次偏见设置下平均准确性的现有认知证据。

Title: Reasoning on Multiple Needles In A Haystack

Authors: Yidong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04150
Pdf URL: https://arxiv.org/pdf/2504.04150
Copy Paste: [[2504.04150]] Reasoning on Multiple Needles In A Haystack(https://arxiv.org/abs/2504.04150)
Keywords: language model, gpt, llm
Abstract: The Needle In A Haystack (NIAH) task has been widely used to evaluate the long-context question-answering capabilities of Large Language Models (LLMs). However, its reliance on simple retrieval limits its effectiveness. To address this limitation, recent studies have introduced the Multiple Needles In A Haystack Reasoning (MNIAH-R) task, which incorporates supporting documents (Multiple needles) of multi-hop reasoning tasks into a distracting context (Haystack}). Despite this advancement, existing approaches still fail to address the issue of models providing direct answers from internal knowledge, and they do not explain or mitigate the decline in accuracy as context length increases. In this paper, we tackle the memory-based answering problem by filtering out direct-answer questions, and we reveal that performance degradation is primarily driven by the reduction in the length of the thinking process as the input length increases. Building on this insight, we decompose the thinking process into retrieval and reasoning stages and introduce a reflection mechanism for multi-round extension. We also train a model using the generated iterative thinking process, which helps mitigate the performance degradation. Furthermore, we demonstrate the application of this retrieval-reflection capability in mathematical reasoning scenarios, improving GPT-4o's performance on AIME2024.
摘要：干草堆（NIAH）任务中的针头已被广泛用于评估大语言模型（LLMS）的长篇小说提问功能。但是，它依赖简单检索会限制其有效性。为了解决这一局限性，最近的研究将Haystack推理（MNIAH-R）任务引入了多个针头，该任务将多跳推理任务的支持文档（多个针）纳入了分心的环境（Haystack}）。尽管取得了进步，但现有方法仍然无法解决从内部知识提供直接答案的模型问题，并且随着上下文长度的增加，它们并不能解释或减轻准确性下降。在本文中，我们通过滤除直接答案的问题来解决基于内存的答案问题，并且揭示性能降解主要是由于输入长度增加的思维过程长度的减少而驱动的。在这种见解的基础上，我们将思维过程分解为检索和推理阶段，并引入了多轮扩展的反射机制。我们还使用生成的迭代思维过程训练模型，这有助于减轻性能降解。此外，我们证明了这种检索反射能力在数学推理方案中的应用，从而提高了GPT-4O在AIME2024上的性能。

Title: STEP: Staged Parameter-Efficient Pre-training for Large Language Models

Authors: Kazuki Yano, Takumi Ito, Jun Suzuki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04151
Pdf URL: https://arxiv.org/pdf/2504.04151
Copy Paste: [[2504.04151]] STEP: Staged Parameter-Efficient Pre-training for Large Language Models(https://arxiv.org/abs/2504.04151)
Keywords: language model, llm
Abstract: Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques with model growth. We conduct experiments on pre-training LLMs of various sizes and demonstrate that STEP achieves up to a 53.9% reduction in maximum memory requirements compared to vanilla pre-training while maintaining equivalent performance. Furthermore, we show that the model by STEP performs comparably to vanilla pre-trained models on downstream tasks after instruction tuning.
摘要：由于模型参数的尺寸较大，预培训大型语言模型（LLMS）面临着重大的内存挑战。我们介绍了分阶段参数效率的预训练（步骤），该预训练将参数有效的调整技术与模型增长集成在一起。我们对各种尺寸的预训练LLM进行实验，并证明与香草预训练相比，最大记忆需求的最大记忆需求降低了53.9％，同时保持同等性能。此外，我们表明该模型在指令调整后，在下游任务上与香草预训练的模型相当地执行。

Title: Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

Authors: Zihao Li, Shaoxiong Ji, Hengyu Luo, Jörg Tiedemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04152
Pdf URL: https://arxiv.org/pdf/2504.04152
Copy Paste: [[2504.04152]] Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources(https://arxiv.org/abs/2504.04152)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit significant disparities in performance across languages, primarily benefiting high-resource languages while marginalizing underrepresented ones. Continual Pretraining (CPT) has emerged as a promising approach to address this imbalance, although the relative effectiveness of monolingual, bilingual, and code-augmented data strategies remains unclear. This study systematically evaluates 36 CPT configurations involving three multilingual base models, across 30+ languages categorized as altruistic, selfish, and stagnant, spanning various resource levels. Our findings reveal three major insights: (1) Bilingual CPT improves multilingual classification but often causes language mixing issues during generation. (2) Including programming code data during CPT consistently enhances multilingual classification accuracy, particularly benefiting low-resource languages, but introduces a trade-off by slightly degrading generation quality. (3) Contrary to prior work, we observe substantial deviations from language classifications according to their impact on cross-lingual transfer: Languages classified as altruistic often negatively affect related languages, selfish languages show conditional and configuration-dependent behavior, and stagnant languages demonstrate surprising adaptability under certain CPT conditions. These nuanced interactions emphasize the complexity of multilingual representation learning, underscoring the importance of systematic studies on generalizable language classification to inform future multilingual CPT strategies.
摘要：大型语言模型（LLMS）在跨语言的性能上表现出很大的差异，主要使高资源语言受益，同时将代表性不足的语言边缘化。尽管尚不清楚单语，双语和代码增强的数据策略的相对有效性，但持续预处理（CPT）已成为解决这种失衡的一种有希望的方法，尽管尚不清楚。这项研究系统地评估了36个CPT配置，涉及三种多语言基本模型，这些模型涉及30多种语言，分为利他，自私和停滞，涵盖了各种资源水平。我们的发现揭示了三个主要见解：（1）双语CPT改善了多语言分类，但通常会导致一代人的语言混合问题。（2）在CPT期间包括编程代码数据会始终提高多语言分类的精度，尤其是使低资源语言受益，但通过略有降低的发电质量引入了权衡。（3）与先前的工作相反，我们根据语言对跨语性转移的影响观察到与语言分类的实质性偏差：被归类为无私的语言通常会对相关语言产生负面影响，自私的语言在某些CPT条件下表现出有条件的有条件和依赖于配置的语言，并且在某些CPT条件下表现出令人惊讶的适应性。这些细微的互动强调了多语言表示学习的复杂性，强调了系统研究对可推广语言分类的重要性，以告知未来的多语言CPT策略。

Title: GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Authors: Hengyu Luo, Zihao Li, Joseph Attieh, Sawal Devkota, Ona de Gibert, Shaoxiong Ji, Peiqin Lin, Bhavani Sai Praneeth Varma Mantina, Ananda Sreenidhi, Raúl Vázquez, Mengjie Wang, Samea Yusofi, Jörg Tiedemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04155
Pdf URL: https://arxiv.org/pdf/2504.04155
Copy Paste: [[2504.04155]] GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models(https://arxiv.org/abs/2504.04155)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks are disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this gap, we introduce GlotEval, a lightweight framework designed for massively multilingual evaluation. Supporting seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation), spanning over dozens to hundreds of languages, GlotEval highlights consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval's applicability for multilingual and language-specific evaluations.
摘要：大型语言模型（LLMS）正在以全球前所未有的速度前进，地区越来越多地采用这些模型以其主要语言的应用。在各种语言环境中，尤其是在低资源语言中对这些模型的评估已成为学术界和行业的主要挑战。现有的评估框架不成比例地关注英语和少数高资源语言，从而忽略了LLM在多语言和低资源场景中的现实性能。为了解决这一差距，我们介绍了GloteVal，这是一个轻巧的框架，旨在大规模多语言评估。支持七个关键任务（机器翻译，文本分类，摘要，开放式生成，阅读理解，序列标签和内在评估），跨越数十种语言，Gloteval突出显示一致的多语言基准，语言特定的提示模板和非english Centric Concentric Machine Translation。这可以精确地诊断模型的优势和劣势。多语言翻译案例研究表明，Gloteval对多语言和语言特定评估的适用性。

Title: Adaptive Elicitation of Latent Information Using Natural Language

Authors: Jimmy Wang, Thomas Zollo, Richard Zemel, Hongseok Namkoong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.04204
Pdf URL: https://arxiv.org/pdf/2504.04204
Copy Paste: [[2504.04204]] Adaptive Elicitation of Latent Information Using Natural Language(https://arxiv.org/abs/2504.04204)
Keywords: language model, llm
Abstract: Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.
摘要：在许多应用程序领域，例如评估学生学习成果，诊断潜在的疾病或学习用户偏好的情况下，引起信息以减少对潜在实体的不确定性是一项关键任务。尽管自然语言是实现此目的的有力媒介，但大型语言模型（LLM）和现有的微调算法缺乏策略性地收集信息来完善自己对潜在实体的理解的机制。为了利用LLM在制定有效的信息收集策略中的概括能力和世界知识，我们提出了一个自适应启发框架，该框架积极降低对潜在实体的不确定性。由于很难对抽象潜在实体进行概率建模，因此我们的框架采用了不确定性的预测性观点，使用元学习的语言模型来模拟未来的观察，并对复杂的自然语言启用可扩展的不确定性量化。通过自回归的前向模拟，我们的模型量化了新问题如何减少认识的不确定性，从而使成熟的信息收集策略的发展能够选择最有用的下一个查询。在有关20个问题游戏，动态意见投票和自适应学生评估的实验中，我们的方法始终优于识别关键未知数和改善下游预测的基准，这说明了在自然语言环境中收集战略信息的希望。

Title: Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

Authors: Vishnu Kabir Chhabra, Mohammad Mahdi Khalili
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04215
Pdf URL: https://arxiv.org/pdf/2504.04215
Copy Paste: [[2504.04215]] Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability(https://arxiv.org/abs/2504.04215)
Keywords: language model
Abstract: The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures. In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility.
摘要：大型语言模型的快速增长激发了人们对模型压缩的重大兴趣，以此作为增强其可及性和实用性的一种手段。尽管广泛的研究通过安全镜头探索了模型压缩，但发现表明，与安全一致的模型经常失去压缩后可信度的要素。同时，机械性解释性领域已获得了著名的发现，例如在各种模型体系结构跨越残留的循环中介导拒绝行为中的单个方向的识别。在这项工作中，我们通过检查拒绝的机制来研究压缩模型的安全性，并采用了一种新颖的可解释性驱动的观点来评估模型安全性。此外，利用我们的可解释性分析的见解，我们提出了一种轻巧，计算上有效的方法，以增强压缩模型的安全性，而不会损害其性能或实用性。

Title: A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models

Authors: Yuantao Zhang, Zhankui Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04216
Pdf URL: https://arxiv.org/pdf/2504.04216
Copy Paste: [[2504.04216]] A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models(https://arxiv.org/abs/2504.04216)
Keywords: language model, llm
Abstract: The rise of Large Language Models (LLMs) has brought about concerns regarding copyright infringement and unethical practices in data and model usage. For instance, slight modifications to existing LLMs may be used to falsely claim the development of new models, leading to issues of model copying and violations of ownership rights. This paper addresses these challenges by introducing a novel metric for quantifying LLM similarity, which leverages perplexity curves and differences in Menger curvature. Comprehensive experiments validate the performance of our methodology, demonstrating its superiority over baseline methods and its ability to generalize across diverse models and domains. Furthermore, we highlight the capability of our approach in detecting model replication through simulations, emphasizing its potential to preserve the originality and integrity of LLMs. Code is available at this https URL.
摘要：大型语言模型（LLM）的兴起引起了人们对数据和模型用法中的版权侵权和不道德实践的关注。例如，对现有LLM的稍作修改可用于错误地要求开发新模型，从而导致模型复制和侵犯所有权的行为。本文通过引入一个新颖的指标来量化LLM相似性，从而解决了这些挑战，该指标利用了困惑曲线和Menger曲率的差异。全面的实验验证了我们的方法的性能，证明了其优越性优于基线方法及其在各种模型和领域中概括的能力。此外，我们强调了方法通过模拟检测模型复制的能力，强调了其保持LLM的独创性和完整性的潜力。代码可在此HTTPS URL上找到。

Title: Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models

Authors: Yuheng Wu, Wentao Guo, Zirui Liu, Heng Ji, Zhaozhuo Xu, Denghui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04238
Pdf URL: https://arxiv.org/pdf/2504.04238
Copy Paste: [[2504.04238]] Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models(https://arxiv.org/abs/2504.04238)
Keywords: language model, llm
Abstract: This paper investigates the emergence of Theory-of-Mind (ToM) capabilities in large language models (LLMs) from a mechanistic perspective, focusing on the role of extremely sparse parameter patterns. We introduce a novel method to identify ToM-sensitive parameters and reveal that perturbing as little as 0.001% of these parameters significantly degrades ToM performance while also impairing contextual localization and language understanding. To understand this effect, we analyze their interaction with core architectural components of LLMs. Our findings demonstrate that these sensitive parameters are closely linked to the positional encoding module, particularly in models using Rotary Position Embedding (RoPE), where perturbations disrupt dominant-frequency activations critical for contextual processing. Furthermore, we show that perturbing ToM-sensitive parameters affects LLM's attention mechanism by modulating the angle between queries and keys under positional encoding. These insights provide a deeper understanding of how LLMs acquire social reasoning abilities, bridging AI interpretability with cognitive science. Our results have implications for enhancing model alignment, mitigating biases, and improving AI systems designed for human interaction.
摘要：本文从机械的角度研究了大语言模型（LLM）中心理理论（TOM）功能的出现，重点是极稀疏的参数模式的作用。我们介绍了一种新的方法来识别对TOM敏感的参数，并揭示了这些参数中的0.001％的扰动显着降低了TOM性能，同时也损害了上下文的本地化和语言理解。为了了解这种效果，我们分析了它们与LLMS核心体系结构组件的相互作用。我们的发现表明，这些敏感参数与位置编码模块紧密相关，尤其是在使用旋转位置嵌入（绳索）的模型中，其中扰动破坏了对上下文处理至关重要的主导频率激活。此外，我们表明，对TOM敏感的参数扰动通过调节位置编码下的查询和键之间的角度来影响LLM的注意机制。这些见解提供了对LLM如何获得社会推理能力的更深入的理解，并将AI解释性与认知科学融合在一起。我们的结果对增强模型对齐，减轻偏见以及改善为人类相互作用设计的AI系统具有影响。

Title: Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models

Authors: Mingyang Wang, Heike Adel, Lukas Lange, Yihong Liu, Ercong Nie, Jannik Strötgen, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04264
Pdf URL: https://arxiv.org/pdf/2504.04264
Copy Paste: [[2504.04264]] Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models(https://arxiv.org/abs/2504.04264)
Keywords: language model, prompt
Abstract: Multilingual language models (MLMs) store factual knowledge across languages but often struggle to provide consistent responses to semantically equivalent prompts in different languages. While previous studies point out this cross-lingual inconsistency issue, the underlying causes remain unexplored. In this work, we use mechanistic interpretability methods to investigate cross-lingual inconsistencies in MLMs. We find that MLMs encode knowledge in a language-independent concept space through most layers, and only transition to language-specific spaces in the final layers. Failures during the language transition often result in incorrect predictions in the target language, even when the answers are correct in other languages. To mitigate this inconsistency issue, we propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency. Our findings shed light on the internal mechanisms of MLMs and provide a lightweight, effective strategy for producing more consistent factual outputs.
摘要：多语言语言模型（MLMS）跨语言存储了事实知识，但常常难以提供对不同语言语义上等效提示的一致响应。虽然先前的研究指出了这个跨语性的不一致问题，但基本原因仍未开发。在这项工作中，我们使用机械性解释性方法来研究MLMS中的跨语性不一致。我们发现，MLMS通过大多数层中与语言无关的概念空间进行编码，并且仅过渡到最终层中的语言特定空间。语言过渡期间的失败通常会导致目标语言的预测不正确，即使答案在其他语言中是正确的。为了减轻这种不一致的问题，我们提出了一种线性快捷方式方法，该方法绕过最终层中的计算，从而提高了预测准确性和跨语性的一致性。我们的发现阐明了MLMS的内部机制，并提供了一种轻巧，有效的策略来产生更一致的事实产出。

Title: Could AI Trace and Explain the Origins of AI-Generated Images and Text?

Authors: Hongchao Fang, Can Qin, Ran Xu, Feng Liu, Yixin Liu, Lichao Sun, Dongwon Lee, Lifu Huang, Wenpeng Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04279
Pdf URL: https://arxiv.org/pdf/2504.04279
Copy Paste: [[2504.04279]] Could AI Trace and Explain the Origins of AI-Generated Images and Text?(https://arxiv.org/abs/2504.04279)
Keywords: language model, gpt, llm
Abstract: AI-generated content is becoming increasingly prevalent in the real world, leading to serious ethical and societal concerns. For instance, adversaries might exploit large multimodal models (LMMs) to create images that violate ethical or legal standards, while paper reviewers may misuse large language models (LLMs) to generate reviews without genuine intellectual effort. While prior work has explored detecting AI-generated images and texts, and occasionally tracing their source models, there is a lack of a systematic and fine-grained comparative study. Important dimensions--such as AI-generated images vs. text, fully vs. partially AI-generated images, and general vs. malicious use cases--remain underexplored. Furthermore, whether AI systems like GPT-4o can explain why certain forged content is attributed to specific generative models is still an open question, with no existing benchmark addressing this. To fill this gap, we introduce AI-FAKER, a comprehensive multimodal dataset with over 280,000 samples spanning multiple LLMs and LMMs, covering both general and malicious use cases for AI-generated images and texts. Our experiments reveal two key findings: (i) AI authorship detection depends not only on the generated output but also on the model's original training intent; and (ii) GPT-4o provides highly consistent but less specific explanations when analyzing content produced by OpenAI's own models, such as DALL-E and GPT-4o itself.
摘要：AI生成的内容在现实世界中变得越来越普遍，导致严重的道德和社会问题。例如，对手可能会利用大型多式联模模型（LMM）创建违反道德或法律标准的图像，而纸质审阅者可能会滥用大型语言模型（LLMS）在没有真正的知识努力的情况下生成审核。虽然先前的工作探索了检测AI生成的图像和文本，并偶尔会追踪其源模型，但缺乏系统的和细粒度的比较研究。重要的尺寸 - 例如AI生成的图像与文本，完全与部分AI生成的图像以及一般与恶意用例的图像 - 磁性疾病 - 雷神不动。此外，像GPT-4O这样的AI系统是否可以解释为什么某些伪造的内容归因于特定的生成模型仍然是一个悬而未决的问题，没有现有的基准来解决此问题。为了填补这一空白，我们介绍了AI-Faker，这是一个全面的多模式数据集，其中包含超过280,000个样本，涵盖了多个LLMS和LMM，涵盖了AI生成的图像和文本的常规和恶意用例。我们的实验揭示了两个关键发现：（i）AI作者身份检测不仅取决于生成的输出，还取决于模型的原始培训意图；（ii）GPT-4O在分析OpenAI自己的模型（例如DALL-E和GPT-4O本身）所产生的内容时提供了高度一致但更具体的解释。

Title: Cross-Asset Risk Management: Integrating LLMs for Real-Time Monitoring of Equity, Fixed Income, and Currency Markets

Authors: Jie Yang, Yiqiu Tang, Yongjie Li, Lihua Zhang, Haoran Zhang
Subjects: cs.CL, cs.CE
Abstract URL: https://arxiv.org/abs/2504.04292
Pdf URL: https://arxiv.org/pdf/2504.04292
Copy Paste: [[2504.04292]] Cross-Asset Risk Management: Integrating LLMs for Real-Time Monitoring of Equity, Fixed Income, and Currency Markets(https://arxiv.org/abs/2504.04292)
Keywords: language model, llm
Abstract: Large language models (LLMs) have emerged as powerful tools in the field of finance, particularly for risk management across different asset classes. In this work, we introduce a Cross-Asset Risk Management framework that utilizes LLMs to facilitate real-time monitoring of equity, fixed income, and currency markets. This innovative approach enables dynamic risk assessment by aggregating diverse data sources, ultimately enhancing decision-making processes. Our model effectively synthesizes and analyzes market signals to identify potential risks and opportunities while providing a holistic view of asset classes. By employing advanced analytics, we leverage LLMs to interpret financial texts, news articles, and market reports, ensuring that risks are contextualized within broader market narratives. Extensive backtesting and real-time simulations validate the framework, showing increased accuracy in predicting market shifts compared to conventional methods. The focus on real-time data integration enhances responsiveness, allowing financial institutions to manage risks adeptly under varying market conditions and promoting financial stability through the advanced application of LLMs in risk analysis.
摘要：大型语言模型（LLM）已成为金融领域的强大工具，特别是对于不同资产类别的风险管理。在这项工作中，我们引入了一个跨资本风险管理框架，该框架利用LLMS促进对股权，固定收益和货币市场的实时监控。这种创新的方法可以通过汇总各种数据源来实现动态风险评估，最终增强决策过程。我们的模型有效地综合并分析了市场信号，以确定潜在的风险和机会，同时提供资产类别的整体视野。通过采用先进的分析，我们利用LLM来解释财务文本，新闻文章和市场报告，以确保在更广泛的市场叙述中存在背景风险。与传统方法相比，大量的回测和实时模拟验证了框架，显示出更高的预测市场变化的准确性。对实时数据集成的关注可以增强响应能力，从而使金融机构能够在不同的市场条件下熟练地管理风险，并通过在风险分析中提高LLMS的高级应用来促进金融稳定。

Title: Dynamic Hedging Strategies in Derivatives Markets with LLM-Driven Sentiment and News Analytics

Authors: Jie Yang, Yiqiu Tang, Yongjie Li, Lihua Zhang, Haoran Zhang
Subjects: cs.CL, cs.CE
Abstract URL: https://arxiv.org/abs/2504.04295
Pdf URL: https://arxiv.org/pdf/2504.04295
Copy Paste: [[2504.04295]] Dynamic Hedging Strategies in Derivatives Markets with LLM-Driven Sentiment and News Analytics(https://arxiv.org/abs/2504.04295)
Keywords: language model, llm
Abstract: Dynamic hedging strategies are essential for effective risk management in derivatives markets, where volatility and market sentiment can greatly impact performance. This paper introduces a novel framework that leverages large language models (LLMs) for sentiment analysis and news analytics to inform hedging decisions. By analyzing textual data from diverse sources like news articles, social media, and financial reports, our approach captures critical sentiment indicators that reflect current market conditions. The framework allows for real-time adjustments to hedging strategies, adapting positions based on continuous sentiment signals. Backtesting results on historical derivatives data reveal that our dynamic hedging strategies achieve superior risk-adjusted returns compared to conventional static approaches. The incorporation of LLM-driven sentiment analysis into hedging practices presents a significant advancement in decision-making processes within derivatives trading. This research showcases how sentiment-informed dynamic hedging can enhance portfolio management and effectively mitigate associated risks.
摘要：动态的对冲策略对于衍生品市场的有效风险管理至关重要，在这种市场中，波动性和市场情感会极大地影响绩效。本文介绍了一个新颖的框架，该框架利用大型语言模型（LLMS）进行情感分析和新闻分析以告知套期保值决策。通过分析来自新闻文章，社交媒体和财务报告等不同来源的文本数据，我们的方法捕获了反映当前市场状况的关键情感指标。该框架可以实时调整对冲策略，并根据连续的情感信号调整位置。对历史衍生品数据的进行回测结果表明，与传统的静态方法相比，我们的动态对冲策略获得了较高的风险调整后收益。将LLM驱动的情绪分析纳入对冲实践中，在衍生品交易中的决策过程中取得了重大进步。这项研究表明，情感信息的动态对冲如何增强投资组合管理并有效减轻相关风险。

Title: CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Authors: Weiwei Sun, Shengyu Feng, Shanda Li, Yiming Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04310
Pdf URL: https://arxiv.org/pdf/2504.04310
Copy Paste: [[2504.04310]] CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization(https://arxiv.org/abs/2504.04310)
Keywords: language model, llm, agent
Abstract: Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems-a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agent frameworks against established human-designed algorithms, revealing key strengths and limitations of current approaches and identifying promising directions for future research. CO-Bench is publicly available at this https URL.
摘要：尽管基于LLM的代理商在软件工程和机器学习研究等领域引起了极大的关注，但它们在推进组合优化（CO）中的作用仍然相对不受欢迎。这一差距强调了对它们在解决结构化的，约束密集型问题的潜力的深入了解 - 目前由于缺乏全面的基准进行系统调查而限制的追求。为了解决这个问题，我们介绍了Co-Bench，这是一个基准套件，其中包含36个现实世界中的CO问题，该问题来自广泛的域和复杂性水平。共同基础包括结构化问题制剂和策划数据，以支持对LLM代理的严格研究。我们针对已建立的人类设计的算法评估多个代理框架，揭示了当前方法的关键优势和局限性，并确定了未来研究的有希望的方向。该HTTPS URL可公开使用共同替补席。

Title: Balancing Complexity and Informativeness in LLM-Based Clustering: Finding the Goldilocks Zone

Authors: Justin Miller, Tristram Alexander
Subjects: cs.CL, cs.AI, math.ST
Abstract URL: https://arxiv.org/abs/2504.04314
Pdf URL: https://arxiv.org/pdf/2504.04314
Copy Paste: [[2504.04314]] Balancing Complexity and Informativeness in LLM-Based Clustering: Finding the Goldilocks Zone(https://arxiv.org/abs/2504.04314)
Keywords: language model, llm
Abstract: The challenge of clustering short text data lies in balancing informativeness with interpretability. Traditional evaluation metrics often overlook this trade-off. Inspired by linguistic principles of communicative efficiency, this paper investigates the optimal number of clusters by quantifying the trade-off between informativeness and cognitive simplicity. We use large language models (LLMs) to generate cluster names and evaluate their effectiveness through semantic density, information theory, and clustering accuracy. Our results show that Gaussian Mixture Model (GMM) clustering on embeddings generated by a LLM, increases semantic density compared to random assignment, effectively grouping similar bios. However, as clusters increase, interpretability declines, as measured by a generative LLM's ability to correctly assign bios based on cluster names. A logistic regression analysis confirms that classification accuracy depends on the semantic similarity between bios and their assigned cluster names, as well as their distinction from alternatives. These findings reveal a "Goldilocks zone" where clusters remain distinct yet interpretable. We identify an optimal range of 16-22 clusters, paralleling linguistic efficiency in lexical categorization. These insights inform both theoretical models and practical applications, guiding future research toward optimising cluster interpretability and usefulness.
摘要：聚集短文本数据的挑战在于平衡信息性与可解释性。传统的评估指标通常会忽略这一权衡。受到交流效率的语言原理的启发，本文通过量化信息性和认知简单性之间的权衡来调查最佳集群数量。我们使用大型语言模型（LLM）来生成群集名称，并通过语义密度，信息理论和聚类准确性评估其有效性。我们的结果表明，与随机分配相比，LLM产生的嵌入嵌入上的高斯混合模型（GMM）聚类增加，有效地分组了相似的BIOS。但是，随着群集的增加，可解释性下降，这是通过生成LLM根据群集名称正确分配BIOS的能力来衡量的。逻辑回归分析证实，分类精度取决于BIOS及其分配的群集名称之间的语义相似性以及它们与替代方案的区别。这些发现揭示了一个“ Goldilocks Zone”，其中集群保持独特而可解释。我们确定了16-22个簇的最佳范围，这与词汇分类中的语言效率平行。这些见解为理论模型和实际应用提供了信息，从而指导未来的研究以优化聚类的可解释性和实用性。

Title: IMPersona: Evaluating Individual Level LM Impersonation

Authors: Quan Shi, Carlos Jimenez, Stephen Dong, Brian Seo, Caden Yao, Adam Kelch, Karthik Narasimhan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04332
Pdf URL: https://arxiv.org/pdf/2504.04332
Copy Paste: [[2504.04332]] IMPersona: Evaluating Individual Level LM Impersonation(https://arxiv.org/abs/2504.04332)
Keywords: language model, prompt
Abstract: As language models achieve increasingly human-like capabilities in conversational text generation, a critical question emerges: to what extent can these systems simulate the characteristics of specific individuals? To evaluate this, we introduce IMPersona, a framework for evaluating LMs at impersonating specific individuals' writing style and personal knowledge. Using supervised fine-tuning and a hierarchical memory-inspired retrieval system, we demonstrate that even modestly sized open-source models, such as Llama-3.1-8B-Instruct, can achieve impersonation abilities at concerning levels. In blind conversation experiments, participants (mis)identified our fine-tuned models with memory integration as human in 44.44% of interactions, compared to just 25.00% for the best prompting-based approach. We analyze these results to propose detection methods and defense strategies against such impersonation attempts. Our findings raise important questions about both the potential applications and risks of personalized language models, particularly regarding privacy, security, and the ethical deployment of such technologies in real-world contexts.
摘要：随着语言模型在会话文本生成中越来越类似人类的能力，就出现了一个关键问题：这些系统可以在多大程度上模拟特定个体的特征？为了评估这一点，我们介绍了Impersona，这是一种评估LMS的框架，以模仿特定个人的写作风格和个人知识。使用受监管的微调和层次内存启发的检索系统，我们证明，即使是较小尺寸的开源模型，例如Llama-3.1-8B - 教学，也可以在有关级别的层次上实现模仿能力。在盲目的对话实验中，参与者（MIS）在44.44％的交互作用中确定了我们的微调模型，其中有44.44％的互动模型，而基于最佳提示的方法仅为25.00％。我们分析了这些结果，以提出针对这种假冒尝试的检测方法和防御策略。我们的发现提出了有关个性化语言模型的潜在应用和风险的重要问题，尤其是关于隐私，安全性和在现实世界中此类技术的道德部署的重要问题。

Title: Hallucination Detection using Multi-View Attention Features

Authors: Yuya Ogasa, Yuki Arase
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.04335
Pdf URL: https://arxiv.org/pdf/2504.04335
Copy Paste: [[2504.04335]] Hallucination Detection using Multi-View Attention Features(https://arxiv.org/abs/2504.04335)
Keywords: language model, hallucination
Abstract: This study tackles token-level hallucination detection in outputs of large language models. Previous studies revealed that attention exhibits irregular patterns when hallucination occurs. Inspired by this, we extract features from the attention matrix that provide complementary views of (a) the average attention each token receives, which helps identify whether certain tokens are overly influential or ignored, (b) the diversity of attention each token receives, which reveals whether attention is biased toward specific subsets, and (c) the diversity of tokens a token attends to during generation, which indicates whether the model references a narrow or broad range of information. These features are input to a Transformer-based classifier to conduct token-level classification to identify hallucinated spans. Experimental results indicate that the proposed method outperforms strong baselines on hallucination detection with longer input contexts, i.e., data-to-text and summarization tasks.
摘要：这项研究探讨了大语言模型输出中的令牌级幻觉检测。先前的研究表明，当幻觉发生时，注意力表现出不规则的模式。 Inspired by this, we extract features from the attention matrix that provide complementary views of (a) the average attention each token receives, which helps identify whether certain tokens are overly influential or ignored, (b) the diversity of attention each token receives, which reveals whether attention is biased toward specific subsets, and (c) the diversity of tokens a token attends to during generation, which indicates whether the model references a narrow or broad range of information.这些功能是对基于变压器的分类器的输入，以进行令牌级别的分类以识别幻觉跨度。实验结果表明，所提出的方法在较长的输入上下文（即数据到文本和摘要任务）方面优于幻觉检测方面的强基础。

Title: Generative Large Language Models Trained for Detecting Errors in Radiology Reports

Authors: Cong Sun, Kurt Teichman, Yiliang Zhou, Brian Critelli, David Nauheim, Graham Keir, Xindi Wang, Judy Zhong, Adam E Flanders, George Shih, Yifan Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04336
Pdf URL: https://arxiv.org/pdf/2504.04336
Copy Paste: [[2504.04336]] Generative Large Language Models Trained for Detecting Errors in Radiology Reports(https://arxiv.org/abs/2504.04336)
Keywords: language model, gpt, llm, prompt
Abstract: In this retrospective study, a dataset was constructed with two parts. The first part included 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC-CXR database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using the F1 score, 95\% confidence interval (CI) and paired-sample t-tests on our constructed dataset, with the prediction results further assessed by radiologists. Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model. Of these, 99 were confirmed to contain errors detected by the models by both radiologists, and 163 were confirmed to contain model-detected errors by at least one radiologist. Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports.
摘要：在这项回顾性研究中，一个数据集由两个部分构建。第一部分包括GPT-4使用指定的提示生成的1,656个合成胸部放射学报告，其中828个是无错误的合成报告和828个包含误差。第二部分包括614个报告：307 2011年至2016年间从模仿CXR数据库中提供的无错误报告和307个相应的合成报告，并根据这些模仿CXR报告和指定的提示，其中GPT-4产生的错误。所有错误均分为四种类型：否定，左/右，间隔更改和转录错误。然后，使用零射击提示，很少的提示或微调策略来完善包括Llama-3，GPT-4和Biomedbert在内的几种模型。最后，使用F1评分，95 \％置信区间（CI）和配对样本t检验评估了这些模型的性能，并在放射线学家进一步评估了预测结果。使用零射击提示，微调的Llama-3-70B实验模型在以下F1分数中获得了最佳性能：否定错误0.769，左/右错误为0.772，间隔更改错误的0.750，转录错误0.828的0.828，对于0.828的转录错误和0.780。在现实世界评估阶段，两名放射科医生审查了该模型随机选择的200种报告。其中，确认99个放射科医生检测到的误差，并确认163个被确认包含至少一名放射科医生的模型检测错误。生成的LLM，对合成和模拟CXR放射学报告进行了微调，在放射学报告中大大提高了错误检测。

Title: Compression Laws for Large Language Models

Authors: Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04342
Pdf URL: https://arxiv.org/pdf/2504.04342
Copy Paste: [[2504.04342]] Compression Laws for Large Language Models(https://arxiv.org/abs/2504.04342)
Keywords: language model, llm
Abstract: We introduce compression laws for language language models (LLMs). While recent scaling laws have sought to understand how LLMs scale with respect to model size, pre-training data, and computational resources, we focus on understanding how model compression affects the performance of a pre-trained LLM on downstream tasks. We empirically examine the effects of structured model compression on LLMs through over $1000$ experiments across eight models with sizes ranging from $0.5B$ to $14B$ parameters. Our findings indicate that the test cross-entropy loss increases quadratically with the compression ratio, whereas performance on downstream tasks declines only linearly. Our study emphasizes the importance of recovery fine-tuning in enhancing generation loss, showing that the test loss of compressed LLMs can improve by up to 55% with recovery fine-tuning. At higher compression ratios (up to 90%), compressed LLMs demonstrate a speed increase of 60% during inference compared to their uncompressed counterparts, compensating for the performance degradation at this level. However, for smaller models ($\le 7B$), the computational gains are limited, peaking at just 35%. We conclude that model compression can be highly beneficial for larger models, especially when a smaller model within the same computational budget is not available. These insights provide the practical guidelines for utilizing model compression techniques for adopting LLMs in real-life applications in resource-constrained settings.
摘要：我们介绍语言模型（LLMS）的压缩法。尽管最近的缩放定律试图了解LLM在模型大小，预训练数据和计算资源方面的规模，但我们专注于了解模型压缩如何影响下游任务预先训练的LLM的性能。我们通过经验研究结构化模型压缩对LLMS的影响，通过$ 1000 $的实验，尺寸从$ 0.5B $到$ 14B $参数不等。我们的发现表明，测试跨透明镜的损失随压缩率而二次增加，而下游任务的性能仅线性下降。我们的研究强调了恢复对增强发电损失的重要性，这表明压缩LLMS的测试损失随着恢复调查的速度最多可以提高55％。在较高的压缩比（高达90％）下，压缩LLMS在推断期间与未压缩的速度相比，速度增加了60％，从而补偿了此水平的性能降低。但是，对于较小的型号（$ \ le 7b $），计算收益是有限的，峰值仅为35％。我们得出的结论是，模型压缩对于较大的模型可能非常有益，尤其是当没有相同计算预算中的较小模型时。这些见解为利用模型压缩技术提供了实用的指南，以在资源约束设置中采用现实生活应用中的LLM。

Title: StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation

Authors: Shenyang Liu, Yang Gao, Shaoyan Zhai, Liqiang Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.04373
Pdf URL: https://arxiv.org/pdf/2504.04373
Copy Paste: [[2504.04373]] StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation(https://arxiv.org/abs/2504.04373)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Prompt Recovery, reconstructing prompts from the outputs of large language models (LLMs), has grown in importance as LLMs become ubiquitous. Most users access LLMs through APIs without internal model weights, relying only on outputs and logits, which complicates recovery. This paper explores a unique prompt recovery task focused on reconstructing prompts for style transfer and rephrasing, rather than typical question-answering. We introduce a dataset created with LLM assistance, ensuring quality through multiple techniques, and test methods like zero-shot, few-shot, jailbreak, chain-of-thought, fine-tuning, and a novel canonical-prompt fallback for poor-performing cases. Our results show that one-shot and fine-tuning yield the best outcomes but highlight flaws in traditional sentence similarity metrics for evaluating prompt recovery. Contributions include (1) a benchmark dataset, (2) comprehensive experiments on prompt recovery strategies, and (3) identification of limitations in current evaluation metrics, all of which advance general prompt recovery research, where the structure of the input prompt is unrestricted.
摘要：迅速恢复，从大型语言模型（LLMS）的输出中重建提示，随着LLMS无处不在。大多数用户通过无内部模型权重的API访问LLM，仅依靠输出和逻辑，这会使恢复复杂化。本文探讨了一项独特的提示恢复任务，该任务的重点是重建样式转移和重新启动的提示，而不是典型的提问。我们介绍了一个使用LLM援助创建的数据集，通过多种技术确保质量，以及零拍，很少射击，越狱，思想链，微调，以及出色的典范奖励案件的新型典范后卫。我们的结果表明，单发和微调产生了最佳结果，但在传统句子相似性指标中突出了缺陷，以评估及时恢复。贡献包括（1）基准数据集，（2）有关及时恢复策略的综合实验，以及（3）当前评估指标中的限制，所有这些限制都在不受限制地进行输入提示的结构。

Title: PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

Authors: Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, Maarten Sap
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04377
Pdf URL: https://arxiv.org/pdf/2504.04377
Copy Paste: [[2504.04377]] PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages(https://arxiv.org/abs/2504.04377)
Keywords: language model, llm, prompt
Abstract: Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users.
摘要：对大语模型（LLM）的真正多语言安全审核的努力受到了狭窄的关注（例如，英语，中文）以及安全性定义的有限范围，从而导致适度能力差距很大。为了弥合这些差距，我们发布了PolyGuard，这是一种新的最先进的多语言安全模型，用于保护LLM世代以及相应的培训和评估数据集。 PolyGuard接受了PolyGuardMix的培训，PolyGuardMix是迄今为止最大的多语言安全培训语料库，其中包含17种语言（例如中文，捷克语，英语，印地语）的191万个样本。我们还推出了PolyguardPrompts，这是一种具有29k样品的高质量多语言基准，用于评估安全护栏。通过结合天然发生的多语言人类相互作用和仅英语安全数据集的人验证的机器翻译（Wildguardmix; Han等，2024），我们的数据集包含带有迅速有害的标签，响应危害和反应拒绝的标签。通过对多个安全性和毒性基准进行的广泛评估，我们证明，Polyguard的表现优于现有的最新开放式和商业安全分类器的表现为5.5％。我们为所有全球用户提供了更安全的多语言LLM的贡献。

Title: Pre-trained Language Models and Few-shot Learning for Medical Entity Extraction

Authors: Xiaokai Wang, Guiran Liu, Binrong Zhu, Jacky He, Hongye Zheng, Hanlu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04385
Pdf URL: https://arxiv.org/pdf/2504.04385
Copy Paste: [[2504.04385]] Pre-trained Language Models and Few-shot Learning for Medical Entity Extraction(https://arxiv.org/abs/2504.04385)
Keywords: language model
Abstract: This study proposes a medical entity extraction method based on Transformer to enhance the information extraction capability of medical literature. Considering the professionalism and complexity of medical texts, we compare the performance of different pre-trained language models (BERT, BioBERT, PubMedBERT, ClinicalBERT) in medical entity extraction tasks. Experimental results show that PubMedBERT achieves the best performance (F1-score = 88.8%), indicating that a language model pre-trained on biomedical literature is more effective in the medical domain. In addition, we analyze the impact of different entity extraction methods (CRF, Span-based, Seq2Seq) and find that the Span-based approach performs best in medical entity extraction tasks (F1-score = 88.6%). It demonstrates superior accuracy in identifying entity boundaries. In low-resource scenarios, we further explore the application of Few-shot Learning in medical entity extraction. Experimental results show that even with only 10-shot training samples, the model achieves an F1-score of 79.1%, verifying the effectiveness of Few-shot Learning under limited data conditions. This study confirms that the combination of pre-trained language models and Few-shot Learning can enhance the accuracy of medical entity extraction. Future research can integrate knowledge graphs and active learning strategies to improve the model's generalization and stability, providing a more effective solution for medical NLP research. Keywords- Natural Language Processing, medical named entity recognition, pre-trained language model, Few-shot Learning, information extraction, deep learning
摘要：这项研究提出了一种基于变压器的医学实体提取方法，以增强医学文献的信息提取能力。考虑到医学文本的专业精神和复杂性，我们比较了医疗实体提取任务中不同预训练的语言模型（Bert，Biobert，PubMedbert，Clinicalbert）的表现。实验结果表明，PubMedbert取得了最佳性能（F1得分= 88.8％），表明在生物医学文献中预先训练的语言模型在医学领域更有效。此外，我们分析了不同实体提取方法（CRF，基于SPAN，SEQ2SEQ）的影响，并发现基于SPAN的方法在医疗实体提取任务中表现最好（F1-SCORE = 88.6％）。它证明了识别实体边界的卓越精度。在低资源场景中，我们进一步探讨了在医疗实体提取中进行几次学习的应用。实验结果表明，即使只有10次训练样本，该模型也达到了79.1％的F1分数，从而验证了在有限的数据条件下少量学习的有效性。这项研究证实，预训练的语言模型和很少的学习学习可以提高医疗实体提取的准确性。未来的研究可以整合知识图和主动学习策略，以改善模型的概括和稳定性，从而为医疗NLP研究提供了更有效的解决方案。关键字 - 自然语言处理，医学名称实体识别，预训练的语言模型，很少的学习，信息提取，深度学习

Title: An overview of model uncertainty and variability in LLM-based sentiment analysis. Challenges, mitigation strategies and the role of explainability

Authors: David Herrera-Poyatos, Carlos Peláez-González, Cristina Zuheros, Andrés Herrera-Poyatos, Virilo Tejedor, Francisco Herrera, Rosana Montes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04462
Pdf URL: https://arxiv.org/pdf/2504.04462
Copy Paste: [[2504.04462]] An overview of model uncertainty and variability in LLM-based sentiment analysis. Challenges, mitigation strategies and the role of explainability(https://arxiv.org/abs/2504.04462)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have significantly advanced sentiment analysis, yet their inherent uncertainty and variability pose critical challenges to achieving reliable and consistent outcomes. This paper systematically explores the Model Variability Problem (MVP) in LLM-based sentiment analysis, characterized by inconsistent sentiment classification, polarization, and uncertainty arising from stochastic inference mechanisms, prompt sensitivity, and biases in training data. We analyze the core causes of MVP, presenting illustrative examples and a case study to highlight its impact. In addition, we investigate key challenges and mitigation strategies, paying particular attention to the role of temperature as a driver of output randomness and emphasizing the crucial role of explainability in improving transparency and user trust. By providing a structured perspective on stability, reproducibility, and trustworthiness, this study helps develop more reliable, explainable, and robust sentiment analysis models, facilitating their deployment in high-stakes domains such as finance, healthcare, and policymaking, among others.
摘要：大型语言模型（LLM）具有明显的高级情感分析，但是它们固有的不确定性和可变性构成了可靠和一致的结果的关键挑战。本文系统地探讨了基于LLM的情感分析中的模型变异性问题（MVP），其特征是不一致的情感分类，极化以及由随机推理机制，迅速灵敏度以及训练数据中的偏见引起的不确定性。我们分析了MVP的核心原因，展示了说明性示例和案例研究以突出其影响。此外，我们研究了关键的挑战和缓解策略，特别关注温度作为输出随机性的驱动力的作用，并强调解释性在提高透明度和用户信任方面的关键作用。通过提供有关稳定性，可重复性和可信赖性的结构性观点，这项研究有助于开发更可靠，可解释和强大的情感分析模型，从而促进其在金融，医疗保健和政策制定等高风险领域中的部署。

Title: Saliency-driven Dynamic Token Pruning for Large Language Models

Authors: Yao Tao, Yehui Tang, Yun Wang, Mingjian Zhu, Hailin Hu, Yunhe Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04514
Pdf URL: https://arxiv.org/pdf/2504.04514
Copy Paste: [[2504.04514]] Saliency-driven Dynamic Token Pruning for Large Language Models(https://arxiv.org/abs/2504.04514)
Keywords: language model, llm
Abstract: Despite the recent success of large language models (LLMs), LLMs are particularly challenging in long-sequence inference scenarios due to the quadratic computational complexity of the attention mechanism. Inspired by the interpretability theory of feature attribution in neural network models, we observe that not all tokens have the same contribution. Based on this observation, we propose a novel token pruning framework, namely Saliency-driven Dynamic Token Pruning (SDTP), to gradually and dynamically prune redundant tokens based on the input context. Specifically, a lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state, which is added to different layers of the LLM to hierarchically prune redundant tokens. Furthermore, a ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score. Extensive experiments have shown that our framework is generalizable to various models and datasets. By hierarchically pruning 65\% of the input tokens, our method greatly reduces 33\% $\sim$ 47\% FLOPs and achieves speedup up to 1.75$\times$ during inference, while maintaining comparable performance. We further demonstrate that SDTP can be combined with KV cache compression method for further compression.
摘要：尽管大型语言模型（LLMS）最近取得了成功，但由于注意机制的二次计算复杂性，LLM在长期推论方案中尤其具有挑战性。受神经网络模型中特征归因的解释性理论的启发，我们观察到并非所有令牌都具有相同的贡献。基于此观察，我们提出了一个新颖的令牌修剪框架，即显着驱动的动态令牌修剪（SDTP），以基于输入上下文逐渐和动态修剪冗余代币。具体而言，轻巧的显着驱动的预测模块旨在估计每个令牌具有隐藏状态的重要性得分，该状态添加到LLM的不同层中添加到层次上的修剪冗余令牌中。此外，提出了一种基于排名的优化策略，以最大程度地减少显着性评分的排名差异和预测的重要性得分。广泛的实验表明，我们的框架可推广到各种模型和数据集。通过分层修剪65％的输入令牌，我们的方法大大降低了33 \％$ \ sim $ 47 \％\％\％\％的flops，并在推理过程中最多达到1.75 $ \ times $，同时保持可比的性能。我们进一步证明，可以将SDTP与KV缓存压缩方法结合使用，以进行进一步的压缩。

Title: An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models

Authors: Anantharaman Janakiraman, Behnaz Ghoraani
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.04534
Pdf URL: https://arxiv.org/pdf/2504.04534
Copy Paste: [[2504.04534]] An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models(https://arxiv.org/abs/2504.04534)
Keywords: language model
Abstract: Text summarization is crucial for mitigating information overload across domains like journalism, medicine, and business. This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, open-source) using a novel multi-dimensional framework. We assessed models on seven diverse datasets (BigPatent, BillSum, CNN/DailyMail, PubMed, SAMSum, WikiHow, XSum) at three output lengths (50, 100, 150 tokens) using metrics for factual consistency, semantic similarity, lexical overlap, and human-like quality, while also considering efficiency factors. Our findings reveal significant performance differences, with specific models excelling in factual accuracy (deepseek-v3), human-like quality (claude-3-5-sonnet), and processing efficiency/cost-effectiveness (gemini-1.5-flash, gemini-2.0-flash). Performance varies dramatically by dataset, with models struggling on technical domains but performing well on conversational content. We identified a critical tension between factual consistency (best at 50 tokens) and perceived quality (best at 150 tokens). Our analysis provides evidence-based recommendations for different use cases, from high-stakes applications requiring factual accuracy to resource-constrained environments needing efficient processing. This comprehensive approach enhances evaluation methodology by integrating quality metrics with operational considerations, incorporating trade-offs between accuracy, efficiency, and cost-effectiveness to guide model selection for specific applications.
摘要：文本摘要对于减轻新闻，医学和商业等领域的信息过载至关重要。这项研究使用一种新型的多维框架评估了17种大语言模型（OpenAI，Google，人为，开源）的总结性能。我们使用三个输出长度（50、100、150个令牌）使用指标，用于事实一致性，语义相似性，词汇重叠和类似人类的质量来评估，评估了七个不同数据集（BigPatent，Billsum，CNN/Dailymail，PubMed，samsum，wikihow，Xsum）的模型，同时考虑到实际的一致性，词典相似性，同时考虑效率因素。我们的发现揭示了巨大的性能差异，特定模型在事实准确性（DeepSeek-V3），类似人类的质量（Claude-3-5-sonnet）以及处理效率/成本效益（Gemini-1.5-flash，Gemini-2.0-Flash）上具有出色的效果。性能因数据集而异，模型在技术域上挣扎，但在会话内容上表现良好。我们确定了事实一致性（最佳代币）和感知质量（最好的150个令牌）之间的临界张力。我们的分析为不同用例提供了基于证据的建议，从需要事实准确性的高风险应用程序到需要有效处理的资源约束环境。这种全面的方法通过将质量指标与运营注意事项相结合，将精度，效率和成本效益之间的权衡取舍来指导特定应用的模型选择，从而增强了评估方法。

Title: KnowsLM: A framework for evaluation of small language models for knowledge augmentation and humanised conversations

Authors: Chitranshu Harbola, Anupam Purwar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04569
Pdf URL: https://arxiv.org/pdf/2504.04569
Copy Paste: [[2504.04569]] KnowsLM: A framework for evaluation of small language models for knowledge augmentation and humanised conversations(https://arxiv.org/abs/2504.04569)
Keywords: language model, llm, prompt
Abstract: In the evolving landscape of conversational AI, generating concise, context-aware, and human-like dialogue using small and medium-sized language models (LLMs) remains a complex challenge. This study investigates the influence of LoRA rank, dataset scale, and prompt prefix design on both knowledge retention and stylistic alignment. While fine-tuning improves fluency and enables stylistic customization, its ability to integrate unseen knowledge is constrained -- particularly with smaller datasets. Conversely, RAG-augmented models, equipped to incorporate external documents at inference, demonstrated superior factual accuracy on out-of-distribution prompts, though they lacked the stylistic consistency achieved by fine-tuning. Evaluations by LLM-based judges across knowledge accuracy, conversational quality, and conciseness suggest that fine-tuning is best suited for tone adaptation, whereas RAG excels at real-time knowledge augmentation.
摘要：在对话式AI的不断发展的景观中，使用中小型语言模型（LLMS）产生简洁，上下文感知和类似人类的对话仍然是一个复杂的挑战。这项研究调查了洛拉等级，数据集量表和迅速前缀设计对知识保留和风格一致性的影响。虽然微调可以提高流利度并启用风格自定义，但其整合看不见的知识的能力受到限制，尤其是在较小的数据集中。相反，抹布式的模型，配备了推断时合并外部文档的模型，尽管缺乏通过微调实现的风格一致性，但在分发提示中表现出了较高的事实准确性。基于LLM的法官在知识准确性，对话质量和简洁性上进行的评估表明，微调最适合于音调适应，而RAG在实时知识增强方面表现出色。

Title: Steering off Course: Reliability Challenges in Steering Language Models

Authors: Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, Sachin Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04635
Pdf URL: https://arxiv.org/pdf/2504.04635
Copy Paste: [[2504.04635]] Steering off Course: Reliability Challenges in Steering Language Models(https://arxiv.org/abs/2504.04635)
Keywords: language model
Abstract: Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis demonstrate fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.
摘要：语言模型（LMS）的转向方法已成为微调的轻量替代方案，从而实现了对模型激活的目标修改。但是，先前的研究主要报告一些模型的结果，在理解这些方法的鲁棒性方面留下了关键的差距。在这项工作中，我们系统地检查了三种突出的转向方法 - dola，功能向量和任务向量。与评估了少数模型的原始研究相反，我们测试了14个家庭尺寸范围为1.5b至70b参数的36个模型。我们的实验揭示了转向方法的有效性的实质性差异，大量模型没有任何改进，有时在转向性能方面降解。我们的分析表明，这些方法的基本假设中存在基本缺陷，从而挑战了它们作为可扩展转向解决方案的可靠性。

Title: Splits! A Flexible Dataset for Evaluating a Model's Demographic Social Inference

Authors: Eylon Caplan, Tania Chakraborty, Dan Goldwasser
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04640
Pdf URL: https://arxiv.org/pdf/2504.04640
Copy Paste: [[2504.04640]] Splits! A Flexible Dataset for Evaluating a Model's Demographic Social Inference(https://arxiv.org/abs/2504.04640)
Keywords: language model, llm
Abstract: Understanding how people of various demographics think, feel, and express themselves (collectively called group expression) is essential for social science and underlies the assessment of bias in Large Language Models (LLMs). While LLMs can effectively summarize group expression when provided with empirical examples, coming up with generalizable theories of how a group's expression manifests in real-world text is challenging. In this paper, we define a new task called Group Theorization, in which a system must write theories that differentiate expression across demographic groups. We make available a large dataset on this task, Splits!, constructed by splitting Reddit posts by neutral topics (e.g. sports, cooking, and movies) and by demographics (e.g. occupation, religion, and race). Finally, we suggest a simple evaluation framework for assessing how effectively a method can generate 'better' theories about group expression, backed by human validation. We publicly release the raw corpora and evaluation scripts for Splits! to help researchers assess how methods infer--and potentially misrepresent--group differences in expression. We make Splits! and our evaluation module available at this https URL.
摘要：了解各种人口统计学的人如何思考，感受和表达自己（统称为群体表达）对于社会科学至关重要，并在大语言模型（LLMS）中评估偏见是必不可少的。尽管LLM可以在提供经验示例时有效地总结群体表达，但提出了关于小组表达如何在现实世界文本中表现出的可推广理论。在本文中，我们定义了一个称为“组理论”的新任务，其中系统必须编写区分人群群体表达的理论。我们提供了有关此任务的大型数据集，分裂！，通过通过中性主题（例如体育，烹饪和电影）和人口统计（例如职业，宗教和种族）构建的Reddit帖子来构建。最后，我们建议一个简单的评估框架，用于评估方法如何有效地产生有关人类验证的“更好”理论。我们将公开发布原始语料库和评估脚本以进行分裂！帮助研究人员评估方法如何推断出表达中的方法以及可能歪曲的组差异。我们分割！我们的评估模块可在此HTTPS URL上提供。

Title: scAgent: Universal Single-Cell Annotation via a LLM Agent

Authors: Yuren Mao, Yu Mi, Peigen Liu, Mengfei Zhang, Hanqing Liu, Yunjun Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04698
Pdf URL: https://arxiv.org/pdf/2504.04698
Copy Paste: [[2504.04698]] scAgent: Universal Single-Cell Annotation via a LLM Agent(https://arxiv.org/abs/2504.04698)
Keywords: language model, llm, agent
Abstract: Cell type annotation is critical for understanding cellular heterogeneity. Based on single-cell RNA-seq data and deep learning models, good progress has been made in annotating a fixed number of cell types within a specific tissue. However, universal cell annotation, which can generalize across tissues, discover novel cell types, and extend to novel cell types, remains less explored. To fill this gap, this paper proposes scAgent, a universal cell annotation framework based on Large Language Models (LLMs). scAgent can identify cell types and discover novel cell types in diverse tissues; furthermore, it is data efficient to learn novel cell types. Experimental studies in 160 cell types and 35 tissues demonstrate the superior performance of scAgent in general cell-type annotation, novel cell discovery, and extensibility to novel cell type.
摘要：细胞类型注释对于理解细胞异质性至关重要。基于单细胞RNA-seq数据和深度学习模型，在注释特定组织中的固定细胞类型方面已经取得了良好的进步。然而，可以跨组织概括，发现新型细胞类型并延伸至新颖的细胞类型的通用细胞注释仍然不那么探索。为了填补这一空白，本文提出了Scagent，这是一个基于大语言模型（LLM）的通用单元注释框架。 Scagent可以识别细胞类型并发现各种组织中的新细胞类型。此外，学习新颖的细胞类型是有效的数据。在160种细胞类型和35个组织中进行的实验研究表明，在一般细胞类型注释，新细胞发现以及对新细胞类型的可扩展性中，Scagent的表现出色。

Title: Causal Retrieval with Semantic Consideration

Authors: Hyunseo Shin, Wonseok Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04700
Pdf URL: https://arxiv.org/pdf/2504.04700
Copy Paste: [[2504.04700]] Causal Retrieval with Semantic Consideration(https://arxiv.org/abs/2504.04700)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced the performance of conversational AI systems. To extend their capabilities to knowledge-intensive domains such as biomedical and legal fields, where the accuracy is critical, LLMs are often combined with information retrieval (IR) systems to generate responses based on retrieved documents. However, for IR systems to effectively support such applications, they must go beyond simple semantic matching and accurately capture diverse query intents, including causal relationships. Existing IR models primarily focus on retrieving documents based on surface-level semantic similarity, overlooking deeper relational structures such as causality. To address this, we propose CAWAI, a retrieval model that is trained with dual objectives: semantic and causal relations. Our extensive experiments demonstrate that CAWAI outperforms various models on diverse causal retrieval tasks especially under large-scale retrieval settings. We also show that CAWAI exhibits strong zero-shot generalization across scientific domain QA tasks.
摘要：大型语言模型（LLM）的最新进展显着提高了对话AI系统的性能。为了将其功能扩展到知识密集型领域，例如生物医学和法律领域，在此精度至关重要的情况下，LLMS通常与信息检索（IR）系统相结合，以根据检索的文档产生响应。但是，要使IR系统有效支持此类应用程序，它们必须超越简单的语义匹配，并准确捕获包括因果关系在内的各种查询意图。现有的IR模型主要集中于基于表面层面的语义相似性检索文档，忽略了更深的关系结构，例如因果关系。为了解决这个问题，我们提出了Cawai，这是一种经过双重目标训练的检索模型：语义和因果关系。我们的广泛实验表明，Cawai在各种因果检索任务上尤其是在大规模检索设置下的各种因果检索任务的表现。我们还表明，Cawai在科学领域QA任务中表现出强烈的零拍概括。

Title: Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts

Authors: Yifei Yu, Qian-Wen Zhang, Lingfeng Qiao, Di Yin, Fang Li, Jie Wang, Zengxi Chen, Suncong Zheng, Xiaolong Liang, Xing Sun
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.04713
Pdf URL: https://arxiv.org/pdf/2504.04713
Copy Paste: [[2504.04713]] Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts(https://arxiv.org/abs/2504.04713)
Keywords: language model, llm, long context
Abstract: Evaluating the ability of large language models (LLMs) to handle extended contexts is critical, particularly for retrieving information relevant to specific queries embedded within lengthy inputs. We introduce Sequential-NIAH, a benchmark specifically designed to evaluate the capability of LLMs to extract sequential information items (known as needles) from long contexts. The benchmark comprises three types of needle generation pipelines: synthetic, real, and open-domain QA. It includes contexts ranging from 8K to 128K tokens in length, with a dataset of 14,000 samples (2,000 reserved for testing). To facilitate evaluation on this benchmark, we trained a synthetic data-driven evaluation model capable of evaluating answer correctness based on chronological or logical order, achieving an accuracy of 99.49% on synthetic test data. We conducted experiments on six well-known LLMs, revealing that even the best-performing model achieved a maximum accuracy of only 63.15%. Further analysis highlights the growing challenges posed by increasing context lengths and the number of needles, underscoring substantial room for improvement. Additionally, noise robustness experiments validate the reliability of the benchmark, making Sequential-NIAH an important reference for advancing research on long text extraction capabilities of LLMs.
摘要：评估大语言模型（LLM）处理扩展上下文的能力至关重要，特别是对于检索与冗长输入中嵌入的特定查询相关的信息。我们介绍了Sequential-NIAH，这是一种专门设计的，旨在评估LLM从长篇小说中提取顺序信息项（称为针）的能力。基准包括三种类型的针头管道：合成，真实和开放域QA。它包括长度从8K到128K的上下文，其中包含14,000个样本的数据集（预留了2,000个用于测试）。为了促进对该基准测试的评估，我们培训了一个合成数据驱动的评估模型，能够根据按时间顺序或逻辑顺序评估答案正确性，在合成测试数据上获得了99.49％的准确性。我们在六个众所周知的LLM上进行了实验，表明即使表现最佳的模型也达到了最大精度仅为63.15％。进一步的分析强调了增加上下文长度和针数的日益严重的挑战，强调了大量改进的空间。此外，噪声稳健性实验验证了基准的可靠性，使得顺序NIAH成为推进LLMS长文本提取能力研究的重要参考。

Title: Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

Authors: Will Cai, Tianneng Shi, Xuandong Zhao, Dawn Song
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.04715
Pdf URL: https://arxiv.org/pdf/2504.04715
Copy Paste: [[2504.04715]] Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs(https://arxiv.org/abs/2504.04715)
Keywords: language model, llm
Abstract: The proliferation of Large Language Models (LLMs) accessed via black-box APIs introduces a significant trust challenge: users pay for services based on advertised model capabilities (e.g., size, performance), but providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking. Detecting such substitutions is difficult due to the black-box nature, typically limiting interaction to input-output queries. This paper formalizes the problem of model substitution detection in LLM APIs. We systematically evaluate existing verification techniques, including output-based statistical tests, benchmark evaluations, and log probability analysis, under various realistic attack scenarios like model quantization, randomized substitution, and benchmark evasion. Our findings reveal the limitations of methods relying solely on text outputs, especially against subtle or adaptive attacks. While log probability analysis offers stronger guarantees when available, its accessibility is often limited. We conclude by discussing the potential of hardware-based solutions like Trusted Execution Environments (TEEs) as a pathway towards provable model integrity, highlighting the trade-offs between security, performance, and provider adoption. Code is available at this https URL
摘要：通过Black-Box API访问的大型语言模型（LLM）的扩散引入了一个重大的信任挑战：用户根据广告的模型功能（例如，尺寸，性能）为服务付费，但提供商可能会以更便宜，质量较低的交替使用方式秘密地替代指定的模型，以降低运营成本。这种缺乏透明性会破坏公平，侵蚀信任并使可靠的基准测试复杂化。由于黑框性质，很难检测这种替代，通常将交互限制为输入输出查询。本文正式在LLM API中形式化了模型替代检测问题。我们系统地评估了现有的验证技术，包括基于输出的统计测试，基准评估和对数概率分析，在各种现实的攻击方案（例如模型量化，随机替代和基准逃避）下。我们的发现揭示了仅依赖文本输出的方法的局限性，尤其是针对微妙或适应性攻击。虽然日志概率分析在可用时提供了更强的保证，但其可访问性通常受到限制。最后，我们讨论了基于硬件的解决方案（例如受信任的执行环境（TEE））的潜力，这是通往可证明的模型完整性的途径，突出了安全性，性能和提供者采用之间的权衡。代码可在此HTTPS URL上找到

Title: Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Authors: Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, Rema Padman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04717
Pdf URL: https://arxiv.org/pdf/2504.04717
Copy Paste: [[2504.04717]] Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models(https://arxiv.org/abs/2504.04717)
Keywords: language model, llm, agent
Abstract: Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains such as math and coding to complex conversational engagements in roleplay, healthcare, education, and even adversarial jailbreak settings, we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness over prolonged dialogues. The paper organizes current benchmarks and datasets into coherent categories that reflect the evolving landscape of multi-turn dialogue evaluation. In addition, we review a range of enhancement methodologies under multi-turn settings, including model-centric strategies (contextual learning, supervised fine-tuning, reinforcement learning, and new architectures), external integration approaches (memory-augmented, retrieval-based methods, and knowledge graph), and agent-based techniques for collaborative interactions. Finally, we discuss open challenges and propose future directions for research to further advance the robustness and effectiveness of multi-turn interactions in LLMs. Related resources and papers are available at this https URL.
摘要：大型语言模型（LLM）的最新进步彻底改变了其处理单转任务的能力，但是现实世界中的应用程序需要复杂的多转交互作用。这项调查对评估和增强LLM中多转交互的最新进展进行了全面综述。从诸如数学和编码之类的不同领域的指导到角色扮演，医疗保健，教育甚至对抗性越狱环境中的复杂对话交战等各种领域的指导，我们会系统地研究维持上下文，相干性，公平性，公平性和响应性在长期对话中的挑战。该论文将当前的基准和数据集组织为一致的类别，这些类别反映了多转化对话评估的不断发展的格局。此外，我们在多转弯设置下审查了一系列增强方法，包括以模型为中心的策略（上下文学习，监督微调，强化学习和新的体系结构），外部集成方法（内存，基于检索的方法和知识图）以及基于协作互动的代理技术。最后，我们讨论了开放的挑战，并提出了未来的研究方向，以进一步提高LLM中多转变相互作用的鲁棒性和有效性。相关资源和论文可在此HTTPS URL上找到。

Title: T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

Authors: Minki Kang, Jongwon Jeong, Jaewoong Cho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04718
Pdf URL: https://arxiv.org/pdf/2504.04718
Copy Paste: [[2504.04718]] T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models(https://arxiv.org/abs/2504.04718)
Keywords: language model
Abstract: Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.
摘要：最近的研究表明，测试时间计算缩放有效地改善了小语言模型（SLM）的性能。但是，先前的研究主要检查了测试时间计算缩放量表，以较大的模型作为验证者，而SLMS的自我验证却没有被抛弃。在这项工作中，我们研究了SLM是否可以可靠地自我验证其在测试时间缩放下的输出。我们发现，即使从较大的验证者那里进行知识蒸馏，SLM也会在需要记忆的验证任务中挣扎，例如数值计算和事实检查。为了解决这一限制，我们提出了工具集成的自我验证（T1），该工具集成的自我验证将记忆重验证的验证步骤委托给外部工具，例如代码解释器。我们的理论分析表明，工具集成减少了记忆需求并改善了测试时间缩放性能。数学基准测试的实验表明，在T1的测试时间缩放下，Llama-3.2 1B模型的表现优于明显更大的Llama-3.1 8B模型。此外，T1有效地概括了数学（MATH500）和多域知识密集型任务（MMLU-PRO）。我们的发现突出了工具集成的潜力，可以实质上提高SLM的自我验证能力。

Title: TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context

Authors: Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.04737
Pdf URL: https://arxiv.org/pdf/2504.04737
Copy Paste: [[2504.04737]] TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context(https://arxiv.org/abs/2504.04737)
Keywords: language model, llm
Abstract: In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi terms "Tathya" (fact) and "Nyaya" (justice), the TathyaNyaya dataset is uniquely designed to focus on factual statements rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes. Complementing this dataset, we present FactLegalLlama, an instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM), optimized for generating high-quality explanations in FJPE tasks. Finetuned on the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy with coherent, contextually relevant explanations, addressing the critical need for transparency and interpretability in AI-assisted legal systems. Our methodology combines transformers for binary judgment prediction with FactLegalLlama for explanation generation, creating a robust framework for advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses existing datasets in scale and diversity but also establishes a benchmark for building explainable AI systems in legal analysis. The findings underscore the importance of factual precision and domain-specific tuning in enhancing predictive performance and interpretability, positioning TathyaNyaya and FactLegalLlama as foundational resources for AI-assisted legal decision-making.
摘要：在基于事实的判断预测和解释（FJPE）的景观中，对事实数据的依赖对于开发强大而现实的AI驱动决策工具至关重要。本文介绍了Tathyanyaya，这是针对印度法律背景下量身定制的FJPE的最大注释数据集，其中包括印度最高法院和各个高等法院的判决。 Tathyanyaya数据集源自印地语术语“ tathya”（事实）和“ nyaya”（正义），其目的是专注于事实陈述而不是完整的法律文本，反映了现实世界中的司法流程，而事实数据驱动了现实数据。在补充该数据集的情况下，我们提出了Factlegalllama，这是Llama-3-8B大语言模型（LLM）的指令调整的变体，优化了用于在FJPE任务中生成高质量解释的优化。 Factlegalllama对Tathyanyaya的事实数据进行了挑剔，将预测精度与一致的，上下文相关的解释整合在一起，以解决AI辅助法律系统中对透明度和解释性的批判性需求。我们的方法将变形金刚与Factlegalllama结合在一起，以解释生成，从而为在印度法律领域中推进FJPE创造了强大的框架。 Tathyanyaya不仅超过了规模和多样性的现有数据集，而且还建立了在法律分析中构建可解释的AI系统的基准。这些发现强调了事实精度和特定领域的调整在增强预测性能和解释性方面的重要性，将Tathyanyaya和Factlegalllama定位为AI辅助法律决策的基础资源。

Title: Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs

Authors: Ankush Raut, Xiaofeng Zhu, Maria Leonor Pacheco
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04745
Pdf URL: https://arxiv.org/pdf/2504.04745
Copy Paste: [[2504.04745]] Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs(https://arxiv.org/abs/2504.04745)
Keywords: language model, llm, long context, prompt
Abstract: This paper evaluates the ability of Large Language Models (LLMs) to leverage contextual information in the form of structured linguistic representations. Specifically, we examine the impact of encoding both short and long contexts using Abstract Meaning Representation (AMR) structures across a diverse set of language tasks. We perform our analysis using 8-bit quantized and instruction-tuned versions of Llama 3.1 (8B), Phi-3, and Mistral 7B. Our results indicate that, for tasks involving short contexts, augmenting the prompt with the AMR of the original language context often degrades the performance of the underlying LLM. However, for tasks that involve long contexts, such as dialogue summarization in the SAMSum dataset, this enhancement improves LLM performance, for example, by increasing the zero-shot cosine similarity score of Llama 3.1 from 66.2% to 76%. This improvement is more evident in the newer and larger LLMs, but does not extend to the older or smaller ones. In addition, we observe that LLMs can effectively reconstruct the original text from a linearized AMR, achieving a cosine similarity of 81.3% in the best-case scenario.
摘要：本文评估了大型语言模型（LLM）以结构化语言表示形式利用上下文信息的能力。具体而言，我们研究了使用各种语言任务集中使用抽象含义表示（AMR）结构来编码简短和长上下文的影响。我们使用8位量化和指令调节版本的Llama 3.1（8b），Phi-3和Mismtral 7b进行分析。我们的结果表明，对于涉及短上下文的任务，使用原始语言上下文的AMR增加提示通常会降低基础LLM的性能。但是，对于涉及较长上下文的任务，例如SAMSUM数据集中的对话摘要，这种增强可以改善LLM的性能，例如，通过将Llama 3.1的零震动余弦相似性得分从66.2％提高到76％。在较新的LLM和更大的LLM中，这种改进更为明显，但并未扩展到较旧的或较小的LLM。此外，我们观察到LLM可以有效地从线性化AMR中重建原始文本，在最佳情况下达到81.3％的余弦相似性。

Title: Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations

Authors: Leonardo Ranaldi, Federico Ranaldi, Fabio Massimo Zanzotto, Barry Haddow, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04771
Pdf URL: https://arxiv.org/pdf/2504.04771
Copy Paste: [[2504.04771]] Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations(https://arxiv.org/abs/2504.04771)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is key to enhancing large language models (LLMs) to systematically access richer factual knowledge. Yet, using RAG brings intrinsic challenges, as LLMs must deal with potentially conflicting knowledge, especially in multilingual retrieval, where the heterogeneity of knowledge retrieved may deliver different outlooks. To make RAG more analytical, critical and grounded, we introduce Dialectic-RAG (DRAG), a modular approach guided by Argumentative Explanations, i.e., structured reasoning process that systematically evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives. Given a query and a set of multilingual related documents, DRAG selects and exemplifies relevant knowledge for delivering dialectic explanations that, by critically weighing opposing arguments and filtering extraneous content, clearly determine the final response. Through a series of in-depth experiments, we show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models. The final results demonstrate that DRAG significantly improves RAG approaches, requiring low-impact computational effort and providing robustness to knowledge perturbations.
摘要：检索增强的生成（RAG）是增强大型语言模型（LLM）的关键，以系统地获取更丰富的事实知识。但是，使用抹布会带来内在的挑战，因为LLM必须处理潜在的相互矛盾的知识，尤其是在多语言检索中，在此中，检索到的知识的异质性可能会带来不同的前景。为了使抹布更加分析，批判性和扎根，我们引入了辩证法抹布（Drag），这是一种模块化方法，以论证性解释为指导，即结构化推理过程，该过程通过比较，对比和解决相互矛盾的观点来系统地评估检索到的信息。给定查询和一组多语言相关文档，拖动选择并举例说明了相关知识，以提供辩证解释，这些解释通过严格权衡相反的论点和过滤无关的内容，明确确定了最终响应。通过一系列深入的实验，我们既将框架作为一种内在学习策略又显示出构建演示以指导较小模型的影响。最终结果表明，阻力可以显着改善抹布的方法，需要低影响计算的努力并为知识扰动提供鲁棒性。

Title: Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

Authors: Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04823
Pdf URL: https://arxiv.org/pdf/2504.04823
Copy Paste: [[2504.04823]] Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models(https://arxiv.org/abs/2504.04823)
Keywords: language model, chain-of-thought
Abstract: Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in this https URL.
摘要：推理语言模型的最新进展表明，在复杂的任务中表现出色，但其扩展的经过思考的推理过程增加了推理开销。尽管已广泛采用量化以降低大语言模型的推理成本，但其对推理模型的影响仍在研究中。在这项研究中，我们对量化的推理模型进行了首次系统研究，评估了开源的DeepSeek-R1-Distild Qwen和Llama家族的范围从1.5B到70B参数，以及QWQ-32B。我们的研究涵盖了使用最新的位宽度算法的重量，KV缓存和激活量化，并在数学（AIME，Math-500），Scientific（GPQA）和编程（LiveCodeBench）推理基准基准的数学（AIME，MATH-500），Scientific（GPQA）进行了广泛的评估。我们的发现表明，尽管W8A8或W4A16量化可以实现无损量化，但较低的位宽度会带来显着的准确性风险。我们进一步将模型大小，模型来源和任务难度确定为性能的关键决定因素。与期望相反，量化模型不会显示出增加的输出长度。此外，从策略性地扩展模型大小或推理步骤可以有效地提高性能。所有量化的模型和代码将在此HTTPS URL中开源。

Title: SAFT: Structure-aware Transformers for Textual Interaction Classification

Authors: Hongtao Wang, Renchi Yang, Hewen Wang, Haoran Zheng, Jianliang Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04861
Pdf URL: https://arxiv.org/pdf/2504.04861
Copy Paste: [[2504.04861]] SAFT: Structure-aware Transformers for Textual Interaction Classification(https://arxiv.org/abs/2504.04861)
Keywords: language model
Abstract: Textual interaction networks (TINs) are an omnipresent data structure used to model the interplay between users and items on e-commerce websites, social networks, etc., where each interaction is associated with a text description. Classifying such textual interactions (TIC) finds extensive use in detecting spam reviews in e-commerce, fraudulent transactions in finance, and so on. Existing TIC solutions either (i) fail to capture the rich text semantics due to the use of context-free text embeddings, and/or (ii) disregard the bipartite structure and node heterogeneity of TINs, leading to compromised TIC performance. In this work, we propose SAFT, a new architecture that integrates language- and graph-based modules for the effective fusion of textual and structural semantics in the representation learning of interactions. In particular, line graph attention (LGA)/gated attention units (GAUs) and pretrained language models (PLMs) are capitalized on to model the interaction-level and token-level signals, which are further coupled via the proxy token in an iterative and contextualized fashion. Additionally, an efficient and theoretically-grounded approach is developed to encode the local and global topology information pertaining to interactions into structural embeddings. The resulting embeddings not only inject the structural features underlying TINs into the textual interaction encoding but also facilitate the design of graph sampling strategies. Extensive empirical evaluations on multiple real TIN datasets demonstrate the superiority of SAFT over the state-of-the-art baselines in TIC accuracy.
摘要：文本交互网络（TIN）是一种无所不在的数据结构，用于建模电子商务网站，社交网络等上用户与项目之间的相互作用，其中每种互动都与文本描述相关联。对此类文本互动进行分类（TIC）可以在检测电子商务中的垃圾邮件评论，金融欺诈交易等方面进行广泛使用。现有的TIC解决方案（i）由于使用无上下文文本嵌入而无法捕获丰富的文本语义，和/或（ii）忽略了TIN的两部分结构和节点异质性，从而导致TIC性能受损。在这项工作中，我们提出了SAFT，这是一种将基于语言和图形的模块集成的新体系结构，以在表示相互作用的表示中有效融合文本和结构语义。特别是，线图（LGA）/门控注意单元（GAU）和预审前的语言模型（PLM）被大写以建模相互作用级别和令牌级信号，这些信号通过迭代和上下文化的方式通过代理令牌进一步耦合。此外，开发了一种有效且理论上的方法，以编码与结构嵌入中的相互作用有关的局部和全球拓扑信息。所得的嵌入不仅注入了罐头基础的结构特征，还将嵌入式编码的文本相互作用中，还可以促进图形采样策略的设计。对多个真实锡数据集进行的广泛经验评估表明，SAFT优于最先进的基线。

Title: Leveraging Large Language Models for Cost-Effective, Multilingual Depression Detection and Severity Assessment

Authors: Longdi Xian, Jianzhang Ni, Mingzhu Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.04891
Pdf URL: https://arxiv.org/pdf/2504.04891
Copy Paste: [[2504.04891]] Leveraging Large Language Models for Cost-Effective, Multilingual Depression Detection and Severity Assessment(https://arxiv.org/abs/2504.04891)
Keywords: language model, llm
Abstract: Depression is a prevalent mental health disorder that is difficult to detect early due to subjective symptom assessments. Recent advancements in large language models have offered efficient and cost-effective approaches for this objective. In this study, we evaluated the performance of four LLMs in depression detection using clinical interview data. We selected the best performing model and further tested it in the severity evaluation scenario and knowledge enhanced scenario. The robustness was evaluated in complex diagnostic scenarios using a dataset comprising 51074 statements from six different mental disorders. We found that DeepSeek V3 is the most reliable and cost-effective model for depression detection, performing well in both zero-shot and few-shot scenarios, with zero-shot being the most efficient choice. The evaluation of severity showed low agreement with the human evaluator, particularly for mild depression. The model maintains stably high AUCs for detecting depression in complex diagnostic scenarios. These findings highlight DeepSeek V3s strong potential for text-based depression detection in real-world clinical applications. However, they also underscore the need for further refinement in severity assessment and the mitigation of potential biases to enhance clinical reliability.
摘要：抑郁症是一种普遍的心理健康障碍，由于主观症状评估，很难尽早检测到。大型语言模型的最新进展为这一目标提供了有效且具有成本效益的方法。在这项研究中，我们使用临床访谈数据评估了四个LLM在抑郁症检测中的性能。我们选择了最佳性能模型，并在严重评估方案和知识增强方案中进一步测试了它。使用包含来自六种不同精神疾病的51074个陈述的数据集在复杂的诊断方案中评估鲁棒性。我们发现，DeepSeek V3是抑郁症检测最可靠，最具成本效益的模型，在零射门和少数场景中都表现良好，零射击是最有效的选择。严重程度的评估表明与人类评估者的一致性低，特别是对于轻度抑郁症。该模型在复杂的诊断方案中保持稳定的AUC来检测抑郁症。这些发现突出了DeepSeek V3在现实世界中临床应用中基于文本的抑郁症检测的强大潜力。但是，他们还强调了对严重性评估的进一步完善和缓解潜在偏见以提高临床可靠性的必要性。

Title: Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

Authors: Ran Xu, Wenqi Shi, Yuchen Zhuang, Yue Yu, Joyce C. Ho, Haoyu Wang, Carl Yang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.04915
Pdf URL: https://arxiv.org/pdf/2504.04915
Copy Paste: [[2504.04915]] Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration(https://arxiv.org/abs/2504.04915)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on this https URL.
摘要：由于环境检索和有限的复杂推理能力，检索增强的生成（RAG）系统通常难以准确处理多跳的问题纠缠任务。我们介绍了一个协作培训框架Collabor-rag，它利用白色盒子小语言模型（SLM）和BlackBox大语言模型（LLM）之间的相互增强。具体而言，SLM将复杂的查询分解为更简单的子问题，从而提高了Black-Box LLM的检索和促进更有效的推理的准确性。同时，Black-Box LLM提供了反馈信号，以提高SLM的分解能力。我们观察到，合作rag完全依赖于负担得起的Black-Box LLM的监督，而没有Frontier LLM的额外蒸馏，但在多个Black-Box LLMS中表现出强烈的概括。五个多跳QA数据集的实验评估表明，协作范围基本上优于现有的仅黑盒，而SLM微调基线平均比1.8％-14.2％的合作基准胜过1.8％-14.2％。特别是，我们的微调3B SLM超过了有关分解的冷冻32B LLM，突出了协作依次在改善复杂问题的推理和检索方面的效率。此HTTPS URL可用协作rag代码。

Title: M-Prometheus: A Suite of Open Multilingual LLM Judges

Authors: José Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, Seungone Kim, Ricardo Rei, Graham Neubig, André F. T. Martins
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04953
Pdf URL: https://arxiv.org/pdf/2504.04953
Copy Paste: [[2504.04953]] M-Prometheus: A Suite of Open Multilingual LLM Judges(https://arxiv.org/abs/2504.04953)
Keywords: language model, llm
Abstract: The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on natively multilingual feedback data instead of translated data. We release our models, training dataset, and code.
摘要：使用语言模型自动评估长期文本（LLM-AS-A-Gudge）的使用变得越来越普遍，但是大多数LLM法官都专门针对英语进行了优化，并采用增强其多语言评估功能的策略，在当前文献中仍然很大程度上尚未实现。这已经在非英语语言的自动评估方法的质量上造成了差异，最终阻碍了具有更好多语言能力的模型的开发。为了弥合这一差距，我们介绍了M-Prometheus，这是一套从3B到14B参数的开放式LLM法官，可以提供直接评估和对多语言输出的成对比较反馈。 M-Prometheus模型优于最先进的开放llm评委，这些法官在多种语言奖励基准测试中，涵盖了20多种语言，以及文学机器翻译（MT）评估涵盖4个语言。此外，可以在解码时利用M-丙二醇模型，以显着改善所有3种测试语言的产量，从而展示其用于开发更好多语言模型的实用性。最后，通过大量消融，我们确定获得有效的多语言法官的关键因素，包括骨干模型选择和对本质多语言反馈数据的培训，而不是翻译数据。我们发布模型，培训数据集和代码。

Title: A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

Authors: Carlos Peláez-González, Andrés Herrera-Poyatos, Cristina Zuheros, David Herrera-Poyatos, Virilo Tejedor, Francisco Herrera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.04976
Pdf URL: https://arxiv.org/pdf/2504.04976
Copy Paste: [[2504.04976]] A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models(https://arxiv.org/abs/2504.04976)
Keywords: language model, llm, hallucination, prompt
Abstract: The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures through generalization, objectives, and robustness gaps. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks on the basis of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories -- mismatched generalization, competing objectives, adversarial robustness, and mixed attacks -- offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study.
摘要：大型语言模型（LLMS）的研究是开放世界机器学习的关键领域。尽管LLMS具有非凡的自然语言处理能力，但它们也面临着几个挑战，包括一致性问题，幻觉和越狱脆弱性。越狱是指绕过对齐保护措施的提示，导致不安全的产出损害了LLM的完整性。这项工作特别关注越狱漏洞的挑战，并引入了基于LLMS训练领域的越狱袭击的新颖分类法。它通过概括，目标和稳健性差距来表征对齐失败的特征。我们的主要贡献是对越狱的观点，它是通过LLM培训和对齐期间出现的不同语言领域构建的。该观点强调了现有方法的局限性，使我们能够根据他们利用的基本模型缺陷对越狱攻击进行分类。与基于迅速施工方法（例如及时的模板）对攻击进行分类的常规分类不同，我们的方法对LLM行为有了更深入的了解。我们介绍了一个分类法，其中有四个类别：不匹配的概括，竞争目标，对抗性鲁棒性和混合攻击 - 为越狱漏洞的基本性质提供了见解。最后，我们介绍了这项分类学研究的关键课程。

Title: Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

Authors: Ling Hu, Yuemei Xu, Xiaoyang Gu, Letao Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.04994
Pdf URL: https://arxiv.org/pdf/2504.04994
Copy Paste: [[2504.04994]] Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs(https://arxiv.org/abs/2504.04994)
Keywords: language model, llm
Abstract: Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.
摘要：尽管大型语言模型（LLM）的表现令人印象深刻，但它们仍可以提出由编码价值的驱动的意外偏见和有害行为，强调迫切需要了解其背后的价值机制。但是，当前的研究主要通过外部响应来评估这些价值观，重点是AI安全性，缺乏可解释性和在现实世界中未能评估社会价值观。在本文中，我们提出了一个名为“ Value开发”的新框架，该框架旨在探索LLMS在神经元层面上国家社会价值的机制。作为一个案例研究，我们专注于中国的社会价值和首次构建C-Voice，这是一种大规模的双语基准，用于识别和评估LLMS中的中国社会价值。通过利用c-voice，我们识别并定位负责根据激活差编码这些值的神经元。最后，通过停用这些神经元，我们分析了模型行为的转移，从而揭示了值影响LLM决策的内部机制。对四个代表性LLM的广泛实验验证了我们框架的功效。基准和代码将可用。

Title: Surveying Professional Writers on AI: Limitations, Expectations, and Fears

Authors: Anastasiia Ivanova, Natalia Fedorova, Sergey Tilga, Ekaterina Artemova
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2504.05008
Pdf URL: https://arxiv.org/pdf/2504.05008
Copy Paste: [[2504.05008]] Surveying Professional Writers on AI: Limitations, Expectations, and Fears(https://arxiv.org/abs/2504.05008)
Keywords: language model, llm
Abstract: The rapid development of AI-driven tools, particularly large language models (LLMs), is reshaping professional writing. Still, key aspects of their adoption such as languages support, ethics, and long-term impact on writers voice and creativity remain underexplored. In this work, we conducted a questionnaire (N = 301) and an interactive survey (N = 36) targeting professional writers regularly using AI. We examined LLM-assisted writing practices across 25+ languages, ethical concerns, and user expectations. The findings of the survey demonstrate important insights, reflecting upon the importance of: LLMs adoption for non-English speakers; the degree of misinformation, domain and style adaptation; usability and key features of LLMs. These insights can guide further development, benefiting both writers and a broader user base.
摘要：AI驱动工具的快速开发，尤其是大型语言模型（LLMS），正在重塑专业写作。尽管如此，他们采用的关键方面，例如语言支持，道德和对作家的声音和创造力的长期影响仍然没有得到充实的影响。在这项工作中，我们进行了问卷调查（n = 301），并定期使用AI针对专业作家的交互式调查（n = 36）。我们研究了25多种语言，道德问题和用户期望的LLM辅助写作实践。调查的结果表明了重要的见解，反映了以下内容：LLMS采用：非英语说话者；错误信息，域和风格适应的程度； LLM的可用性和关键功能。这些见解可以指导进一步的发展，使作家和更广泛的用户群受益。

Title: Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models

Authors: Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.05050
Pdf URL: https://arxiv.org/pdf/2504.05050
Copy Paste: [[2504.05050]] Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models(https://arxiv.org/abs/2504.05050)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are foundational explorations to artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial compliance. Here, we demonstrate that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in LLMs' parametric memory, evading alignment safeguards and resurfacing under adversarial inducement at distributional shifts. In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs by proving that current alignment methods yield only local "safety regions" in the knowledge manifold. In contrast, pretrained knowledge remains globally connected to harmful concepts via high-likelihood adversarial trajectories. Building on this theoretical insight, we empirically validate our findings by employing semantic coherence inducement under distributional shifts--a method that systematically bypasses alignment constraints through optimized adversarial prompts. This combined theoretical and empirical approach achieves a 100% attack success rate across 19 out of 23 state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing their universal vulnerabilities.
摘要：大型语言模型（LLM）是人工通用智能的基本探索，但是通过教学调整和偏好学习，它们与人类价值观的一致性仅实现了肤浅的依从性。在这里，我们证明，在预处理期间嵌入的有害知识持续存在，因为LLMS的参数记忆中的不可磨灭的“黑暗模式”，在分布偏移下对对抗性诱导下的对齐对准保障和重新铺面。在这项研究中，我们首先通过证明当前的一致性方法仅在知识歧管中产生局部“安全区域”，从理论上分析了对齐LLM的内在伦理脆弱性。相比之下，经过审慎的知识在全球范围内通过高样本对抗轨迹与有害概念保持联系。在理论上的洞察力的基础上，我们通过在分布转移下采用语义连贯诱导来验证我们的发现 - 这种方法可以系统地通过优化的对抗性提示来系统地绕过对齐约束。这种结合的理论和经验方法在23个最先进的LLM中，在包括DeepSeek-R1和Llama-3的23个最先进的LLM中达到了100％的攻击成功率，揭示了它们的普遍脆弱性。

Title: Not All Data Are Unlearned Equally

Authors: Aravind Krishnan, Siva Reddy, Marius Mosbach
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05058
Pdf URL: https://arxiv.org/pdf/2504.05058
Copy Paste: [[2504.05058]] Not All Data Are Unlearned Equally(https://arxiv.org/abs/2504.05058)
Keywords: language model, llm
Abstract: Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.
摘要：Machine Unerning关注的是从训练有素的模型中删除从特定数据点汲取的知识的任务。在大型语言模型（LLMS）的背景下，最近的学习者最近受到了越来越多的关注，尤其是出于隐私目的从模型中删除有关命名实体的知识。尽管已经提出了各种方法来解决未成年的问题，但大多数现有方法都对待所有数据点都应平等地进行，即，蒙特利尔是加拿大的一个城市，与本文第一作者的电话号码完全相同。在这项工作中，我们表明，所有数据都是相等的假设，而不是LLM学习。我们研究学习的成功如何取决于我们希望在模型的预训练数据中学习的知识的频率，并发现频率会强烈影响未学习，即更频繁的知识很难学习。此外，我们发现了概率和基于生成的遗嘱评估之间的错位，并表明随着模型变大随着模型的变化而恶化。总体而言，我们的实验强调了需要更好地评估实践和新颖的LLM学习方法的新方法，从而考虑了模型的培训数据。

Title: On the Performance of an Explainable Language Model on PubMedQA

Authors: Venkat Srinivasan, Vishaal Jatav, Anushka Chandrababu, Geetika Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05074
Pdf URL: https://arxiv.org/pdf/2504.05074
Copy Paste: [[2504.05074]] On the Performance of an Explainable Language Model on PubMedQA(https://arxiv.org/abs/2504.05074)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have shown significant abilities in retrieving medical knowledge, reasoning over it and answering medical questions comparably to physicians. However, these models are not interpretable, hallucinate, are difficult to maintain and require enormous compute resources for training and inference. In this paper, we report results from Gyan, an explainable language model based on an alternative architecture, on the PubmedQA data set. The Gyan LLM is a compositional language model and the model is decoupled from knowledge. Gyan is trustable, transparent, does not hallucinate and does not require significant training or compute resources. Gyan is easily transferable across domains. Gyan-4.3 achieves SOTA results on PubmedQA with 87.1% accuracy compared to 82% by MedPrompt based on GPT-4 and 81.8% by Med-PaLM 2 (Google and DeepMind). We will be reporting results for other medical data sets - MedQA, MedMCQA, MMLU - Medicine in the future.
摘要：大型语言模型（LLMS）在检索医学知识，对知识进行推理以及向医生回答医学问题方面表现出了重要的能力。但是，这些模型不可解释，幻觉，难以维护，需要巨大的计算资源进行培训和推理。在本文中，我们报告了PubMedQA数据集的基于替代体系结构的可解释语言模型Gyan的结果。 Gyan LLM是一个组成语言模型，该模型与知识分离。 Gyan是可信赖的，透明的，不会幻觉，并且不需要大量的培训或计算资源。 Gyan很容易跨域转移。 Gyan-4.3以87.1％的精度在PubMedQA上获得SOTA结果，而MedPrompt则基于GPT-4，而Med-Palm 2（Google和DeepMind）获得了Medprompt的82％。我们将报告其他医疗数据集的结果-MEDQA，MEDMCQA，MMLU-将来医学。

Title: The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Authors: Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Y. Wong, Simon See
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05081
Pdf URL: https://arxiv.org/pdf/2504.05081
Copy Paste: [[2504.05081]] The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning(https://arxiv.org/abs/2504.05081)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learning (ICL) datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This duality explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.
摘要：经过思考链（COT）提示因其能够通过产生明确的解释原理来增强大语模型（LLMS）中推理能力的能力而得到广泛认可。但是，我们的研究揭示了与这种普遍观点的惊人矛盾。通过涉及16个最先进的LLM和9种基于模式的内在学习（ICL）数据集的广泛实验，我们证明了COT及其推理变体始终在不同的模型尺度和基准配置的情况下始终如一地表现不佳。为了系统地研究这种意外现象，我们设计了广泛的实验来验证几种假设的解释。我们的分析发现了基本的显式二元性双重性，驱动COT在基于模式的ICL中的表现：尽管LLMS努力从示范中推断出潜在的模式而导致的明确推理却使示威的基本模式造成了动荡，但由于cot rationals-fationes facten offen的补偿而增加了上下文距离，但仍能提供正确的答案，尽管有正确的答案，但仍能提供正确的答案。这种二元性解释了COT的相对表现不佳，因为弱显式推理的噪声会破坏过程，即使隐式机制部分挽救了结果。值得注意的是，即使在抽象和象征性推理中表现出色的长期推理模型，尽管计算成本较高，但仍无法完全克服这些局限性。我们的发现挑战了有关COT普遍效力的现有假设，从而对其局限性产生了新的见解，并指导未来的研究以更细微而有效的LLMS方法论。

Title: AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments

Authors: Saeid Ario Vaghefi, Aymane Hachcham, Veronica Grasso, Jiska Manicus, Nakiete Msemo, Chiara Colesanti Senni, Markus Leippold
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05104
Pdf URL: https://arxiv.org/pdf/2504.05104
Copy Paste: [[2504.05104]] AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments(https://arxiv.org/abs/2504.05104)
Keywords: llm, prompt, retrieval-augmented generation, chain-of-thought, agent
Abstract: Tracking financial investments in climate adaptation is a complex and expertise-intensive task, particularly for Early Warning Systems (EWS), which lack standardized financial reporting across multilateral development banks (MDBs) and funds. To address this challenge, we introduce an LLM-based agentic AI system that integrates contextual retrieval, fine-tuning, and multi-step reasoning to extract relevant financial data, classify investments, and ensure compliance with funding guidelines. Our study focuses on a real-world application: tracking EWS investments in the Climate Risk and Early Warning Systems (CREWS) Fund. We analyze 25 MDB project documents and evaluate multiple AI-driven classification methods, including zero-shot and few-shot learning, fine-tuned transformer-based classifiers, chain-of-thought (CoT) prompting, and an agent-based retrieval-augmented generation (RAG) approach. Our results show that the agent-based RAG approach significantly outperforms other methods, achieving 87\% accuracy, 89\% precision, and 83\% recall. Additionally, we contribute a benchmark dataset and expert-annotated corpus, providing a valuable resource for future research in AI-driven financial tracking and climate finance transparency.
摘要：跟踪气候适应的金融投资是一项复杂且专业的密集型任务，特别是对于预警系统（EWS），这些任务缺乏多边开发银行（MDB）和资金的标准财务报告。为了应对这一挑战，我们引入了一个基于LLM的代理AI系统，该系统集成了上下文检索，微调和多步骤推理，以提取相关财务数据，对投资进行分类并确保遵守资金指南。我们的研究重点介绍了现实世界的应用：跟踪EWS在气候风险和预警系统（CREWS）基金中的投资。我们分析了25个MDB项目文档，并评估了多种AI驱动的分类方法，包括零射击和少量学习，基于微调变压器的分类器，Thebough Thought（COT）提示以及一种基于代理的检索型（RAG）方法。我们的结果表明，基于代理的抹布方法显着优于其他方法，达到87 \％精度，89 \％的精度和83 \％召回。此外，我们为基准数据集和专家注册的语料库提供了贡献，为未来的AI驱动财务跟踪和气候财务透明度提供了宝贵的资源。

Title: DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation

Authors: Xinglin Lyu, Wei Tang, Yuang Li, Xiaofeng Zhao, Ming Zhu, Junhui Li, Yunfei Lu, Min Zhang, Daimeng Wei, Hao Yang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05122
Pdf URL: https://arxiv.org/pdf/2504.05122
Copy Paste: [[2504.05122]] DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation(https://arxiv.org/abs/2504.05122)
Keywords: language model, llm, hallucination, agent
Abstract: Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.
摘要：文档级别的上下文对于处理文本到文档级的机器翻译（MT）中的话语挑战至关重要。尽管自动语音识别（ASR）引起的噪音提出的话语挑战增加了，但文档级上下文在语音翻译（ST）中的集成仍然不足。在本文中，我们开发了Docia，这是一个在线框架，通过合并文档级别的上下文来增强ST性能。 Docia将ST管道分解为四个阶段。文档级上下文通过基于辅助语言模型的模块（大型语言模型）模块集成到ASR改进，MT和MT改进阶段。此外，Docia以多层次的方式利用文档级信息，同时最大程度地减少计算开销。此外，引入了一种简单而有效的确定机制，以防止幻觉过度细化，从而确保最终结果的可靠性。实验结果表明，在四个LLM的句子和话语指标中，Docia在句子和话语指标中的表现显着优于传统的ST基线，这表明了其在改善ST绩效方面的有效性。

Title: CARE: Aligning Language Models for Regional Cultural Awareness

Authors: Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, Wei Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05154
Pdf URL: https://arxiv.org/pdf/2504.05154
Copy Paste: [[2504.05154]] CARE: Aligning Language Models for Regional Cultural Awareness(https://arxiv.org/abs/2504.05154)
Keywords: language model
Abstract: Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge. Previous attempts to address this rely on synthetic data and express cultural knowledge only in English. In this work, we study whether a small amount of human-written, multilingual cultural preference data can improve LMs across various model families and sizes. We first introduce CARE, a multilingual resource of 24.1k responses with human preferences on 2,580 questions about Chinese and Arab cultures, all carefully annotated by native speakers and offering more balanced coverage. Using CARE, we demonstrate that cultural alignment improves existing LMs beyond generic resources without compromising general capabilities. Moreover, we evaluate the cultural awareness of LMs, native speakers, and retrieved web content when queried in different languages. Our experiment reveals regional disparities among LMs, which may also be reflected in the documentation gap: native speakers often take everyday cultural commonsense and social norms for granted, while non-natives are more likely to actively seek out and document them. CARE is publicly available at this https URL (we plan to add Japanese data in the near future).
摘要：现有的语言模型（LMS）经常表现出以西方为中心的偏见，并努力代表各种文化知识。以前的解决此问题的尝试依赖于合成数据并仅以英语表达文化知识。在这项工作中，我们研究了少量的人工编写的多语言文化偏好数据是否可以改善各种模型系列和大小的LMS。我们首先介绍Care，这是一种由2580个有关中国和阿拉伯文化的问题的24.1k回应的多语言资源，所有这些问题都是由母语人士仔细注释的，并提供了更加平衡的覆盖范围。使用CARE，我们证明文化一致性将现有的LMS提高了一般资源，而不会损害一般能力。此外，我们评估了使用不同语言查询的LMS，母语者的文化意识，并检索了Web内容。我们的实验揭示了LMS之间的区域差异，这也可能反映在文档差距中：母语人士经常将日常的文化常识和社会规范视为理所当然，而非本地人则更有可能积极地寻找和记录它们。此HTTPS URL公开提供护理（我们计划在不久的将来添加日本数据）。

Title: Concise Reasoning via Reinforcement Learning

Authors: Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, Kartik Talamadupula
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05185
Pdf URL: https://arxiv.org/pdf/2504.05185
Copy Paste: [[2504.05185]] Concise Reasoning via Reinforcement Learning(https://arxiv.org/abs/2504.05185)
Keywords: language model, llm
Abstract: Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results.
摘要：尽管大语言模型（LLMS）取得了重大进步，但推理模型的主要缺点是它们的巨大令牌用法，从而增加了计算成本，资源需求和响应时间。在这项工作中，我们重新审视了增强学习的核心原则（RL），并通过数学分析表明，产生冗长响应的趋势本质上是源于训练过程中基于RL的优化。这一发现质疑了较长响应固有地提高推理准确性的普遍假设。取而代之的是，我们发现了很大程度上忽略的简洁性和准确性之间的自然相关性。此外，我们表明，使用少量问题和有限的资源引入RL后训练的第二阶段，可以显着降低模型的思想链，同时维持甚至提高准确性。最后，我们通过广泛的实验结果来验证我们的结论。

Title: Post-Training Language Models for Continual Relation Extraction

Authors: Sefika Efeoglu, Adrian Paschke, Sonja Schimmler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05214
Pdf URL: https://arxiv.org/pdf/2504.05214
Copy Paste: [[2504.05214]] Post-Training Language Models for Continual Relation Extraction(https://arxiv.org/abs/2504.05214)
Keywords: language model, llm, chat
Abstract: Real-world data, such as news articles, social media posts, and chatbot conversations, is inherently dynamic and non-stationary, presenting significant challenges for constructing real-time structured representations through knowledge graphs (KGs). Relation Extraction (RE), a fundamental component of KG creation, often struggles to adapt to evolving data when traditional models rely on static, outdated datasets. Continual Relation Extraction (CRE) methods tackle this issue by incrementally learning new relations while preserving previously acquired knowledge. This study investigates the application of pre-trained language models (PLMs), specifically large language models (LLMs), to CRE, with a focus on leveraging memory replay to address catastrophic forgetting. We evaluate decoder-only models (eg, Mistral-7B and Llama2-7B) and encoder-decoder models (eg, Flan-T5 Base) on the TACRED and FewRel datasets. Task-incremental fine-tuning of LLMs demonstrates superior performance over earlier approaches using encoder-only models like BERT on TACRED, excelling in seen-task accuracy and overall performance (measured by whole and average accuracy), particularly with the Mistral and Flan-T5 models. Results on FewRel are similarly promising, achieving second place in whole and average accuracy metrics. This work underscores critical factors in knowledge transfer, language model architecture, and KG completeness, advancing CRE with LLMs and memory replay for dynamic, real-time relation extraction.
摘要：现实世界中的数据，例如新闻文章，社交媒体帖子和聊天机器人对话，本质上是动态的和非平稳的，在通过知识图（KGS）构建实时结构化表示方面面临着重大挑战。关系提取（RE）是KG创建的基本组成部分，通常在传统模型依靠静态的，过时的数据集时，通常会努力适应不断发展的数据。持续的关系提取（CRE）方法通过逐步学习新的关系来解决此问题，同时保留先前获得的知识。这项研究调查了预训练的语言模型（PLM），特别是大型语言模型（LLMS）的应用，重点是利用记忆重播来解决灾难性的遗忘。我们在Tacred和Dickrel数据集上评估了仅解码器模型（例如，Mistral-7b和Llama2-7b）和编码器模型（例如，Flan-T5基础）。 LLMS的任务收入微调表明，使用诸如BERT在Tacred上的诸如BERT之类的编码模型，在查看任务的准确性和整体性能方面表现出色（以整体和平均精度来衡量），尤其是Mistral和Flan-T5模型。很少有的结果同样有希望，在整体和平均准确性指标中获得第二名。这项工作强调了知识转移，语言模型体系结构和KG完整性的关键因素，使用LLM和内存重播CRE进行动态，实时关系提取。

Title: NoveltyBench: Evaluating Creativity and Diversity in Language Models

Authors: Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, Daphne Ippolito
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05228
Pdf URL: https://arxiv.org/pdf/2504.05228
Copy Paste: [[2504.05228]] NoveltyBench: Evaluating Creativity and Diversity in Language Models(https://arxiv.org/abs/2504.05228)
Keywords: language model, prompt
Abstract: Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work introduces NoveltyBench, a benchmark specifically designed to evaluate the ability of language models to produce multiple distinct and high-quality outputs. NoveltyBench utilizes prompts curated to elicit diverse answers and filtered real-world user queries. Evaluating 20 leading language models, we find that current state-of-the-art systems generate significantly less diversity than human writers. Notably, larger models within a family often exhibit less diversity than their smaller counterparts, challenging the notion that capability on standard benchmarks translates directly to generative utility. While prompting strategies like in-context regeneration can elicit diversity, our findings highlight a fundamental lack of distributional diversity in current models, reducing their utility for users seeking varied responses and suggesting the need for new training and evaluation paradigms that prioritize creativity alongside quality.
摘要：语言模型在标准基准上表现出了非凡的功能，但是由于模式崩溃，无法产生多样化和新颖的产出而越来越挣扎。我们的工作介绍了NoveltyBench，这是一种专门设计的基准，旨在评估语言模型产生多个不同和高质量输出的能力。 NovertyBench利用提示策划的提示来引起各种答案和过滤现实世界的用户查询。评估20种领先的语言模型时，我们发现当前的最新系统产生的多样性明显少于人类作家。值得注意的是，家庭中较大的模型通常比较小的同行表现出较少的多样性，这挑战了标准基准的能力直接转化为生成效用的观念。尽管促使诸如封闭式再生之类的策略可以引起多样性，但我们的发现突出了当前模型中缺乏分销多样性的根本性，从而减少了寻求各种响应的用户的实用性，并暗示需要对新的培训和评估范式以及优先级的创造力与质量相同。

Title: LLM-based Automated Grading with Human-in-the-Loop

Authors: Hang Li, Yucheng Chu, Kaiqi Yang, Yasemin Copur-Gencturk, Jiliang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05239
Pdf URL: https://arxiv.org/pdf/2504.05239
Copy Paste: [[2504.05239]] LLM-based Automated Grading with Human-in-the-Loop(https://arxiv.org/abs/2504.05239)
Keywords: language model, llm
Abstract: The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.
摘要：人工智能（AI）技术的兴起，尤其是大型语言模型（LLMS），为教育领域带来了重大进步。在各种应用中，侧重于评估开放式文本响应的自动简短答案分级（ASAG）在引入LLM的过程中取得了显着的进步。这些模型与传统的ASAG方法相比，不仅可以提高评分性能，而且还超越了与预定义的“黄金”答案的简单比较，从而实现了更复杂的分级场景，例如基于列音评估。但是，由于依赖完全自动化的方法，现有的LLM驱动方法在基于标题的评估中仍面临挑战。在这项工作中，我们通过通过人类的（HITL）方法利用其交互功能来探讨LLM在ASAG任务中的潜力。我们提出的框架（成绩）利用LLM的生成特性向人类专家提出问题，并结合了他们的见解，以动态地完善分级标准。这种自适应过程可显着提高评分精度，超过现有方法，并使ASAG更接近人类水平的评估。

Title: Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

Authors: Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05262
Pdf URL: https://arxiv.org/pdf/2504.05262
Copy Paste: [[2504.05262]] Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models(https://arxiv.org/abs/2504.05262)
Keywords: language model, llm
Abstract: Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition ($0$ to $2^{64}$), probing two core properties: commutativity ($A+B=B+A$) and compositional generalization (via isomorphic symbolic mappings, e.g., $7 \rightarrow y$). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to $\leq$7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of $A+B \neq B+A$) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.
摘要：尽管基准得分很高，但大语言模型（LLM）通常会失败简单问题，这引起了一个关键的问题：LLMS是否学习数学原理或仅仅记住模式？我们没有像最近的作品那样设计越来越复杂的基准测试，而是使用基本的两位数添加（$ 0 $至$ 2^{64} $）进行了研究，探索了两个核心属性：通勤（$ a+a+a+b = b+a $）和组成概括（通过Isomorphic象征性映射，e.g.，e.g。，$ 7 \ rightarrow y $ $ \ y y y y y y y y y y y y y y y y y y $ y y y y y y y y y y $ $ $ $）。尽管最新的LLM在数值添加方面达到了73.8-99.8 \％的精度，但在符号映射下，绩效折叠至$ \ leq $ 7.5 \％，表明未能推广学习的规则。数字计数和频繁违反通勤性的非单调性能缩放（超过1,700例$ a+a+b \ neq b+a $）进一步支持了这一点。显式提供加法规则平均降低了81.2 \％，而自我解释保持基线准确性，这表明LLM算术处理与人为定义的原则未对准。我们的发现表明，当前的LLM依赖于真实规则学习的记忆模式，突出了建筑局限性以及对实现真实数学推理的新方法的需求。

Title: Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation

Authors: Yucheng Chu, Peng He, Hang Li, Haoyu Han, Kaiqi Yang, Yu Xue, Tingting Li, Joseph Krajcik, Jiliang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05276
Pdf URL: https://arxiv.org/pdf/2504.05276
Copy Paste: [[2504.05276]] Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation(https://arxiv.org/abs/2504.05276)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Short answer assessment is a vital component of science education, allowing evaluation of students' complex three-dimensional understanding. Large language models (LLMs) that possess human-like ability in linguistic tasks are increasingly popular in assisting human graders to reduce their workload. However, LLMs' limitations in domain knowledge restrict their understanding in task-specific requirements and hinder their ability to achieve satisfactory performance. Retrieval-augmented generation (RAG) emerges as a promising solution by enabling LLMs to access relevant domain-specific knowledge during assessment. In this work, we propose an adaptive RAG framework for automated grading that dynamically retrieves and incorporates domain-specific knowledge based on the question and student answer context. Our approach combines semantic search and curated educational sources to retrieve valuable reference materials. Experimental results in a science education dataset demonstrate that our system achieves an improvement in grading accuracy compared to baseline LLM approaches. The findings suggest that RAG-enhanced grading systems can serve as reliable support with efficient performance gains.
摘要：简短的答案评估是科学教育的重要组成部分，可以评估学生复杂的三维理解。在语言任务中具有类似人类能力的大型语言模型（LLM）越来越受欢迎，可以帮助人类的毕业生减少工作量。但是，LLMS在域知识中的局限性限制了他们在特定于任务的要求中的理解，并阻碍了他们实现令人满意的性能的能力。通过使LLMS能够在评估过程中访问相关领域特定知识的情况下，检索增强的生成（RAG）作为一种有希望的解决方案出现。在这项工作中，我们为自动化分级提出了一个自适应抹布框架，该框架可以动态检索并根据问题和学生答案上下文进行特定于领域的知识。我们的方法结合了语义搜索和策划的教育资源，以检索有价值的参考材料。科学教育数据集中的实验结果表明，与基线LLM方法相比，我们的系统可以提高评分准确性。研究结果表明，抹布增强的分级系统可以作为可靠的支持，并有效地提高性能。

Title: Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

Authors: Pedro Ferreira, Wilker Aziz, Ivan Titov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.05294
Pdf URL: https://arxiv.org/pdf/2504.05294
Copy Paste: [[2504.05294]] Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations(https://arxiv.org/abs/2504.05294)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization - a key step in the alignment phase - can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the model's internal decision process and the generated explanation. Consequently, the LLM may engage in "reward hacking" by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RM's input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model's decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.
摘要：经过深思熟虑的解释被广泛用于检查大语言模型（LLMS）的决策过程并评估模型输出的可信度，这对于LLMS与人类之间的有效协作至关重要。我们证明了偏好优化 - 在对齐阶段的关键步骤 - 可以无意中降低这些解释的忠诚。之所以发生这种情况，是因为指导一致的奖励模型（RM）的任务是优化响应的预期质量和解释的适当性（例如，最小化偏见或遵守安全标准），造成潜在的冲突。 RM缺乏评估模型内部决策过程与生成的解释之间一致性的机制。因此，LLM可以通过产生最终响应，同时给出量身定制的解释，以最大程度地提高奖励，而不是准确反映其推理，从而参与“奖励黑客”。为了解决这个问题，我们提出了通过该预测的因果归因来丰富RM的输入，从而允许RM检测生成的自我解释与模型的决策过程之间的差异。在受控设置中，我们表明这种方法降低了LLM产生误导性解释的趋势。