2025-04-30

Title: It's the same but not the same: Do LLMs distinguish Spanish varieties?

Authors: Marina Mayor-Rocher, Cristina Pozo, Nina Melero, Gonzalo Martínez, María Grandury, Pedro Reviriego
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20049
Pdf URL: https://arxiv.org/pdf/2504.20049
Copy Paste: [[2504.20049]] It's the same but not the same: Do LLMs distinguish Spanish varieties?(https://arxiv.org/abs/2504.20049)
Keywords: language model, gpt, llm
Abstract: In recent years, large language models (LLMs) have demonstrated a high capacity for understanding and generating text in Spanish. However, with five hundred million native speakers, Spanish is not a homogeneous language but rather one rich in diatopic variations spanning both sides of the Atlantic. For this reason, in this study, we evaluate the ability of nine language models to identify and distinguish the morphosyntactic and lexical peculiarities of seven varieties of Spanish (Andean, Antillean, Continental Caribbean, Chilean, Peninsular, Mexican and Central American and Rioplatense) through a multiple-choice test. The results indicate that the Peninsular Spanish variety is the best identified by all models and that, among them, GPT-4o is the only model capable of recognizing the variability of the Spanish language. -- En los últimos años, los grandes modelos de lenguaje (LLMs, por sus siglas en inglés) han demostrado una alta capacidad para comprender y generar texto en español. Sin embargo, con quinientos millones de hablantes nativos, la española no es una lengua homogénea, sino rica en variedades diatópicas que se extienden a ambos lados del Atlántico. Por todo ello, evaluamos en este trabajo la capacidad de nueve modelos de lenguaje de identificar y discernir las peculiaridades morfosintácticas y léxicas de siete variedades de español (andino, antillano, caribeño continental, chileno, español peninsular, mexicano y centroamericano y rioplatense) mediante un test de respuesta múltiple. Los resultados obtenidos indican que la variedad de español peninsular es la mejor identificada por todos los modelos y que, de entre todos, GPT-4o es el único modelo capaz de identificar la variabilidad de la lengua española.
摘要：近年来，大型语言模型（LLM）表现出了高度理解和生成西班牙语文本的能力。但是，有了5亿本名人，西班牙语不是一种同质的语言，而是一种富含位位变化的语言，跨越大西洋的两面。因此，在这项研究中，我们评估了九种语言模型识别和区分七种西班牙（Andean，Antillean，Antillean，Contillean，Caribbean，Chilean，Chilean，Peninsular，Mexinel，墨西哥，中美洲和中美洲，中美洲和Rioplatense）的七种种类的形态和词汇特征的能力。结果表明，半岛西班牙品种是所有模型最能识别的，其中GPT-4O是唯一能够识别西班牙语可变性的模型。 -EnlosúltimosAños，Los Grandes Modelos de lenguaje（LLMS，Por Sus SiglasenGlés）Han Demostrado una una Alta Alta Alta Alta Alta Para Para Para para para comprender y Generar Texto enespañol。 Sin禁运，Con Quinientos Millones de hablantes nativos，laespañolano es una lengua hargua holgunea，sino rica en viedude diatoveadesdiatópicasque s que s extelight a ambos lados lados delatlántico。 Por todo ello, evaluamos en este trabajo la capacidad de nueve modelos de lenguaje de identificar y discernir las peculiaridades morfosintácticas y léxicas de siete variedades de español (andino, antillano, caribeño continental, chileno, español peninsular, mexicano y centroamericano y Rioplatense）中间测试de respuestamúltiph。 los Resultados obtenidos indican que la variedad deespañolpenineles la mejor distinificada por todos los Modelos y que y que y que，de entre todos，gpt-4o es eselúnicoModelo capaz de Identional capaz de de la variabilidad de la la la languaespañola。

Title: Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts

Authors: Frances Laureano De Leon, Harish Tayyar Madabushi, Mark G. Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20051
Pdf URL: https://arxiv.org/pdf/2504.20051
Copy Paste: [[2504.20051]] Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts(https://arxiv.org/abs/2504.20051)
Keywords: language model, gpt
Abstract: Multiword expressions, characterised by non-compositional meanings and syntactic irregularities, are an example of nuanced language. These expressions can be used literally or idiomatically, leading to significant changes in meaning. While large language models have demonstrated strong performance across many tasks, their ability to handle such linguistic subtleties remains uncertain. Therefore, this study evaluates how state-of-the-art language models process the ambiguity of potentially idiomatic multiword expressions, particularly in contexts that are less frequent, where models are less likely to rely on memorisation. By evaluating models across in Portuguese and Galician, in addition to English, and using a novel code-switched dataset and a novel task, we find that large language models, despite their strengths, struggle with nuanced language. In particular, we find that the latest models, including GPT-4, fail to outperform the xlm-roBERTa-base baselines in both detection and semantic tasks, with especially poor performance on the novel tasks we introduce, despite its similarity to existing tasks. Overall, our results demonstrate that multiword expressions, especially those which are ambiguous, continue to be a challenge to models.
摘要：以非复合含义和句法不规则性为特征的多字表达是细微的语言的一个例子。这些表达式可以从字面上或惯用地使用，从而导致含义的重大变化。尽管大型语言模型在许多任务中都表现出强大的表现，但它们处理此类语言微妙的能力仍然不确定。因此，这项研究评估了最先进的语言模型如何处理潜在的惯用多词表达式的歧义，尤其是在较不常见的环境中，模型不太可能依赖记忆。通过评估葡萄牙和加利西亚语中的模型，除了英语外，还要使用新颖的代码开关数据集和一项新颖的任务，我们发现大语模型尽管有优势，但仍在与细微的语言作斗争。特别是，我们发现包括GPT-4在内的最新模型在检测和语义任务中都无法超越XLM-Roberta-base基线，尽管与现有任务相似，但在我们引入的新任务上的性能尤为差。总体而言，我们的结果表明，多词表达式，尤其是那些模棱两可的表达式，仍然是模型的挑战。

Title: Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Authors: Zae Myung Kim, Chanwoo Park, Vipul Raheja, Dongyeop Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20157
Pdf URL: https://arxiv.org/pdf/2504.20157
Copy Paste: [[2504.20157]] Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models(https://arxiv.org/abs/2504.20157)
Keywords: language model, llm, prompt
Abstract: Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.
摘要：大型语言模型（LLMS）的基于奖励的对准方法面临两个关键局限性：奖励黑客的脆弱性，其中模型在奖励信号中利用缺陷；当使用LLMs用作奖励模型时，依赖脆性，劳动密集型及时工程。我们介绍了元政策优化（MPO），该框架通过整合一个元奖励模型来解决这些挑战，该模型在整个培训过程中都会动态地完善奖励模型的提示。在MPO中，Meta-Reward模型监视了不断发展的培训环境，并不断调整奖励模型的提示，以保持高对齐方式，提供了自适应奖励信号，以抵制该政策的剥削。这种元学习方法促进了更稳定的政策优化，并大大减少了对手动奖励提示设计的需求。它的表现与由广泛手工制作的奖励提示指导的模型相同或更好。此外，我们表明，MPO在不需要专门的奖励设计的情况下保持了跨不同任务的有效性，例如回答和数学推理。除了标准RLAIF之外，MPO的元学习配方很容易扩展到更高级别的对齐框架。总体而言，该方法解决了基于奖励的LLM的RL对齐的理论和实践挑战，为更强大和适应性的一致性策略铺平了道路。代码和模型将公开共享。

Title: MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools

Authors: Nishant Subramani, Jason Eisner, Justin Svegliato, Benjamin Van Durme, Yu Su, Sam Thomson
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20168
Pdf URL: https://arxiv.org/pdf/2504.20168
Copy Paste: [[2504.20168]] MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools(https://arxiv.org/abs/2504.20168)
Keywords: language model, agent
Abstract: Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at this https URL.
摘要：在世界上起作用的工具代理必须既有用，也必须安全。精心校准的模型信心可用于权衡潜在动作的风险与奖励，但先前的工作表明，许多模型的校准较差。受解释性文献的启发，我们提出了一类新型的模型内部置信度估计器（小鼠），以更好地评估通话工具时的置信度。小鼠首先使用Logitlens从语言模型的每个中间层进行解码，然后计算每个层的生成和最终输出之间的相似性得分。这些特征被馈入学习的概率分类器，以评估对解码输出的信心。在使用LLAMA3型号的模拟试验和错误（Ste）工具称数据集中，我们发现小鼠在平滑的预期校准误差上击败或匹配基准。使用小鼠的信心来确定是否调用工具可以显着改善在新的指标，预期的工具定价实用程序上。进一步的实验表明，小鼠是样品效率的，可以将零射门概括为看不见的API，并在风险水平变化的情况下导致更高的工具称为实用程序。我们的代码是开源的，可在此HTTPS URL上找到。

Title: A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports

Authors: Henning Schäfer, Cynthia S. Schmidt, Johannes Wutzkowsky, Kamil Lorek, Lea Reinartz, Johannes Rückert, Christian Temme, Britta Böckmann, Peter A. Horn, Christoph M. Friedrich
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2504.20220
Pdf URL: https://arxiv.org/pdf/2504.20220
Copy Paste: [[2504.20220]] A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports(https://arxiv.org/abs/2504.20220)
Keywords: language model
Abstract: Despite the growing adoption of electronic health records, many processes still rely on paper documents, reflecting the heterogeneous real-world conditions in which healthcare is delivered. The manual transcription process is time-consuming and prone to errors when transferring paper-based data to digital formats. To streamline this workflow, this study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents. Demonstrated on transfusion reaction reports, the design supports adaptation to other checkbox-rich document types. The proposed method integrates checkbox detection, multilingual optical character recognition (OCR) and multilingual vision-language models (VLMs). The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024. The result is a reduction in administrative workload and accurate regulatory reporting. The open-source availability of this pipeline encourages self-hosted parsing of checkbox forms.
摘要：尽管采用了电子健康记录，但许多过程仍然依赖纸质文件，这反映了提供医疗保健的异质现实状况。当将基于纸质的数据传输到数字格式时，手动转录过程是耗时的，容易出现错误。为了简化此工作流程，本研究提出了一条开源管道，该管道从扫描文档中提取并分类了复选框数据。在输血反应报告中证明，该设计支持适应其他复选框丰富的文档类型。提出的方法集成了复选框检测，多语言光学特征识别（OCR）和多语言视觉语言模型（VLMS）。与2017年至2024年每年汇编的金标准相比，该管道的精确度和召回率很高。结果是减少了行政工作量和准确的监管报告。该管道的开源可用性鼓励了复选框表格的自托管解析。

Title: Enhancing Systematic Reviews with Large Language Models: Using GPT-4 and Kimi

Authors: Dandan Chen Kaptur, Yue Huang, Xuejun Ryan Ji, Yanhui Guo, Bradley Kaptur
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2504.20276
Pdf URL: https://arxiv.org/pdf/2504.20276
Copy Paste: [[2504.20276]] Enhancing Systematic Reviews with Large Language Models: Using GPT-4 and Kimi(https://arxiv.org/abs/2504.20276)
Keywords: language model, gpt, llm
Abstract: This research delved into GPT-4 and Kimi, two Large Language Models (LLMs), for systematic reviews. We evaluated their performance by comparing LLM-generated codes with human-generated codes from a peer-reviewed systematic review on assessment. Our findings suggested that the performance of LLMs fluctuates by data volume and question complexity for systematic reviews.
摘要：这项研究深入研究了两个大语言模型（LLM）的GPT-4和Kimi，以进行系统评价。我们通过将LLM生成的代码与对同行评审的评估系统评估进行比较，评估了他们的性能。我们的发现表明，LLMS的性能会随数据量和系统评价的问题复杂性而波动。

Title: Local Prompt Optimization

Authors: Yash Jain, Vishal Chowdhary
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20355
Pdf URL: https://arxiv.org/pdf/2504.20355
Copy Paste: [[2504.20355]] Local Prompt Optimization(https://arxiv.org/abs/2504.20355)
Keywords: language model, llm, prompt
Abstract: In recent years, the use of prompts to guide the output of Large Language Models have increased dramatically. However, even the best of experts struggle to choose the correct words to stitch up a prompt for the desired task. To solve this, LLM driven prompt optimization emerged as an important problem. Existing prompt optimization methods optimize a prompt globally, where in all the prompt tokens have to be optimized over a large vocabulary while solving a complex task. The large optimization space (tokens) leads to insufficient guidance for a better prompt. In this work, we introduce Local Prompt Optimization (LPO) that integrates with any general automatic prompt engineering method. We identify the optimization tokens in a prompt and nudge the LLM to focus only on those tokens in its optimization step. We observe remarkable performance improvements on Math Reasoning (GSM8k and MultiArith) and BIG-bench Hard benchmarks across various automatic prompt engineering methods. Further, we show that LPO converges to the optimal prompt faster than global methods.
摘要：近年来，提示指导大语模型的输出的使用已大大增加。但是，即使是最好的专家也很难选择正确的单词来缝制所需任务的提示。为了解决这个问题，LLM驱动的提示优化是一个重要的问题。现有的提示优化方法在全球范围内优化了提示，其中所有提示令牌在解决复杂的任务时都必须在大型词汇上进行优化。较大的优化空间（令牌）导致不足以获得更好的提示。在这项工作中，我们介绍了与任何一般自动及时工程方法集成的本地提示优化（LPO）。我们在提示中识别优化令牌，并在其优化步骤中轻推LLM以仅关注这些令牌。我们观察到数学推理（GSM8K和Multiarith）以及各种自动及时工程方法的大基础基准测试的绩效改进。此外，我们表明LPO比全局方法更快地收敛到最佳提示。

Title: What Causes Knowledge Loss in Multilingual Language Models?

Authors: Maria Khelli, Samuel Cahyawijaya, Ayu Purwarianti, Genta Indra Winata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20356
Pdf URL: https://arxiv.org/pdf/2504.20356
Copy Paste: [[2504.20356]] What Causes Knowledge Loss in Multilingual Language Models?(https://arxiv.org/abs/2504.20356)
Keywords: language model
Abstract: Cross-lingual transfer in natural language processing (NLP) models enhances multilingual performance by leveraging shared linguistic knowledge. However, traditional methods that process all data simultaneously often fail to mimic real-world scenarios, leading to challenges like catastrophic forgetting, where fine-tuning on new tasks degrades performance on previously learned ones. Our study explores this issue in multilingual contexts, focusing on linguistic differences affecting representational learning rather than just model parameters. We experiment with 52 languages using LoRA adapters of varying ranks to evaluate non-shared, partially shared, and fully shared parameters. Our aim is to see if parameter sharing through adapters can mitigate forgetting while preserving prior knowledge. We find that languages using non-Latin scripts are more susceptible to catastrophic forgetting, whereas those written in Latin script facilitate more effective cross-lingual transfer.
摘要：自然语言处理（NLP）模型中的跨语性转移通过利用共同的语言知识来增强多语言表现。但是，同时处理所有数据的传统方法通常无法模仿现实世界中的情况，从而遇到了诸如灾难性遗忘之类的挑战，在这些新任务上进行了微调降低了以前学到的疾病的绩效。我们的研究在多语言环境中探讨了这个问题，重点是影响代表性学习的语言差异，而不仅仅是模型参数。我们使用不同等级的Lora适配器进行52种语言，以评估非共享，部分共享和完全共享的参数。我们的目标是查看通过适配器共享参数共享是否可以减轻遗忘，同时保留先验知识。我们发现，使用非拉丁文脚本的语言更容易受到灾难性遗忘的影响，而用拉丁语编写的语言则有助于更有效的跨语性转移。

Title: DMDTEval: An Evaluation and Analysis of LLMs on Disambiguation in Multi-domain Translation

Authors: Zhibo Man, Yuanmeng Chen, Yujie Zhang, Yufeng Chen, Jinan Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20371
Pdf URL: https://arxiv.org/pdf/2504.20371
Copy Paste: [[2504.20371]] DMDTEval: An Evaluation and Analysis of LLMs on Disambiguation in Multi-domain Translation(https://arxiv.org/abs/2504.20371)
Keywords: language model, llm, prompt
Abstract: Currently, Large Language Models (LLMs) have achieved remarkable results in machine translation. However, their performance in multi-domain translation (MDT) is less satisfactory; the meanings of words can vary across different domains, highlighting the significant ambiguity inherent in MDT. Therefore, evaluating the disambiguation ability of LLMs in MDT remains an open problem. To this end, we present an evaluation and analysis of LLMs on disambiguation in multi-domain translation (DMDTEval), our systematic evaluation framework consisting of three critical aspects: (1) we construct a translation test set with multi-domain ambiguous word annotation, (2) we curate a diverse set of disambiguation prompting templates, and (3) we design precise disambiguation metrics, and study the efficacy of various prompting strategies on multiple state-of-the-art LLMs. Our extensive experiments reveal a number of crucial findings that we believe will pave the way and also facilitate further research in the critical area of improving the disambiguation of LLMs.
摘要：当前，大型语言模型（LLM）在机器翻译中取得了显着的结果。但是，它们在多域翻译（MDT）中的性能不那么令人满意。单词的含义在不同的领域可能会有所不同，从而突出了MDT固有的重要歧义。因此，评估MDT中LLM的歧义能力仍然是一个空旷的问题。 To this end, we present an evaluation and analysis of LLMs on disambiguation in multi-domain translation (DMDTEval), our systematic evaluation framework consisting of three critical aspects: (1) we construct a translation test set with multi-domain ambiguous word annotation, (2) we curate a diverse set of disambiguation prompting templates, and (3) we design precise disambiguation metrics, and study the各种提示策略对多个最先进的LLM的功效。我们广泛的实验表明，我们认为这将为您铺平道路，并促进在改善LLM歧义的关键领域的进一步研究。

Title: On Psychology of AI -- Does Primacy Effect Affect ChatGPT and Other LLMs?

Authors: Mika Hämäläinen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.20444
Pdf URL: https://arxiv.org/pdf/2504.20444
Copy Paste: [[2504.20444]] On Psychology of AI -- Does Primacy Effect Affect ChatGPT and Other LLMs?(https://arxiv.org/abs/2504.20444)
Keywords: gpt, llm, prompt, chat
Abstract: We study the primacy effect in three commercial LLMs: ChatGPT, Gemini and Claude. We do this by repurposing the famous experiment Asch (1946) conducted using human subjects. The experiment is simple, given two candidates with equal descriptions which one is preferred if one description has positive adjectives first before negative ones and another description has negative adjectives followed by positive ones. We test this in two experiments. In one experiment, LLMs are given both candidates simultaneously in the same prompt, and in another experiment, LLMs are given both candidates separately. We test all the models with 200 candidate pairs. We found that, in the first experiment, ChatGPT preferred the candidate with positive adjectives listed first, while Gemini preferred both equally often. Claude refused to make a choice. In the second experiment, ChatGPT and Claude were most likely to rank both candidates equally. In the case where they did not give an equal rating, both showed a clear preference to a candidate that had negative adjectives listed first. Gemini was most likely to prefer a candidate with negative adjectives listed first.
摘要：我们研究了三个商业LLM的首要效果：Chatgpt，Gemini和Claude。我们通过重新利用使用人类受试者进行的著名实验Asch（1946）来做到这一点。该实验很简单，给定两个具有同等描述的候选人，如果一个描述在负面形容词之前先具有正形容词，而另一种描述则具有负形容词，则是一个正面形容词，然后是阳性形容词。我们在两个实验中对此进行了测试。在一个实验中，LLM在同一提示中同时同时给出了两个候选者，在另一个实验中，LLMS分别给出了两个候选者。我们用200对候选对测试所有模型。我们发现，在第一个实验中，Chatgpt更喜欢候选人，其中首先列出了正面形容词，而双子座则均同样偏爱。克劳德拒绝做出选择。在第二个实验中，Chatgpt和Claude最有可能平均对两个候选人进行排名。如果他们没有给出平等的评分，则两者都明确地偏爱了首先列出的负面形容词的候选人。双子座最有可能更喜欢首先列出的负面形容词的候选人。

Title: Team ACK at SemEval-2025 Task 2: Beyond Word-for-Word Machine Translation for English-Korean Pairs

Authors: Daniel Lee, Harsh Sharma, Jieun Han, Sunny Jeong, Alice Oh, Vered Shwartz
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.20451
Pdf URL: https://arxiv.org/pdf/2504.20451
Copy Paste: [[2504.20451]] Team ACK at SemEval-2025 Task 2: Beyond Word-for-Word Machine Translation for English-Korean Pairs(https://arxiv.org/abs/2504.20451)
Keywords: llm
Abstract: Translating knowledge-intensive and entity-rich text between English and Korean requires transcreation to preserve language-specific and cultural nuances beyond literal, phonetic or word-for-word conversion. We evaluate 13 models (LLMs and MT models) using automatic metrics and human assessment by bilingual annotators. Our findings show LLMs outperform traditional MT systems but struggle with entity translation requiring cultural adaptation. By constructing an error taxonomy, we identify incorrect responses and entity name errors as key issues, with performance varying by entity type and popularity level. This work exposes gaps in automatic evaluation metrics and hope to enable future work in completing culturally-nuanced machine translation.
摘要：在英语和韩国人之间翻译知识密集型和实体的文本需要经过构造，以保护语言特定和文化的细微差别，而不是字面上的，语音或单词词的转换。我们使用双语注释者使用自动指标和人类评估来评估13个模型（LLM和MT模型）。我们的发现表明，LLMS的表现要优于传统MT系统，但与实体翻译需要文化适应性的斗争。通过构建错误分类法，我们将不正确的响应和实体名称错误视为关键问题，并且性能随实体类型和受欢迎程度而变化。这项工作揭示了自动评估指标中的差距，并希望能够使未来的工作在完成文化化机器翻译方面。

Title: Fane at SemEval-2025 Task 10: Zero-Shot Entity Framing with Large Language Models

Authors: Enfa Fane, Mihai Surdeanu, Eduardo Blanco, Steven R. Corman
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2504.20469
Pdf URL: https://arxiv.org/pdf/2504.20469
Copy Paste: [[2504.20469]] Fane at SemEval-2025 Task 10: Zero-Shot Entity Framing with Large Language Models(https://arxiv.org/abs/2504.20469)
Keywords: language model, llm, prompt
Abstract: Understanding how news narratives frame entities is crucial for studying media's impact on societal perceptions of events. In this paper, we evaluate the zero-shot capabilities of large language models (LLMs) in classifying framing roles. Through systematic experimentation, we assess the effects of input context, prompting strategies, and task decomposition. Our findings show that a hierarchical approach of first identifying broad roles and then fine-grained roles, outperforms single-step classification. We also demonstrate that optimal input contexts and prompts vary across task levels, highlighting the need for subtask-specific strategies. We achieve a Main Role Accuracy of 89.4% and an Exact Match Ratio of 34.5%, demonstrating the effectiveness of our approach. Our findings emphasize the importance of tailored prompt design and input context optimization for improving LLM performance in entity framing.
摘要：了解新闻叙事实体如何研究媒体对事件社会看法的影响至关重要。在本文中，我们评估了大语言模型（LLMS）在分类框架角色中的零拍功能。通过系统的实验，我们评估输入环境的影响，促使策略和任务分解。我们的发现表明，首先识别广泛角色，然后是细粒角色的层次结构方法优于单步分类。我们还证明，最佳输入上下文和提示在任务级别各不相同，强调了对特定于子任务的策略的需求。我们的主要作用精度为89.4％，精确匹配比为34.5％，这表明了我们的方法的有效性。我们的发现强调了量身定制的及时设计和输入上下文优化对改善实体框架中LLM性能的重要性。

Title: Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training

Authors: Linjuan Wu, Haoran Wei, Huan Lin, Tianhao Li, Baosong Yang, Weiming Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20484
Pdf URL: https://arxiv.org/pdf/2504.20484
Copy Paste: [[2504.20484]] Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training(https://arxiv.org/abs/2504.20484)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit remarkable multilingual capabilities despite English-dominated pre-training, attributed to cross-lingual mechanisms during pre-training. Existing methods for enhancing cross-lingual transfer remain constrained by parallel resources, suffering from limited linguistic and domain coverage. We propose Cross-lingual In-context Pre-training (CrossIC-PT), a simple and scalable approach that enhances cross-lingual transfer by leveraging semantically related bilingual texts via simple next-word prediction. We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window. To access window size constraints, we implement a systematic segmentation policy to split long bilingual document pairs into chunks while adjusting the sliding window mechanism to preserve contextual coherence. We further extend data availability through a semantic retrieval framework to construct CrossIC-PT samples from web-crawled corpus. Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models (Llama-3.1-8B, Qwen2.5-7B, and Qwen2.5-1.5B) across six target languages, yielding performance gains of 3.79%, 3.99%, and 1.95%, respectively, with additional improvements after data augmentation.
摘要：大型语言模型（LLMS）尽管以英语为主的预训练，但仍具有出色的多语言能力，这归因于预训练期间的跨语性机制。现有的增强跨语言转移方法仍受到有限的语言和领域覆盖范围的平行资源的限制。我们提出了跨语义上的文字预训练（Crossic-PT），这是一种简单且可扩展的方法，通过通过简单的下一词预测利用语义相关的双语文本来增强跨语性转移。我们通过将与语义相关的双语wikipedia文档交织到单个上下文窗口中来构建互相关样品。为了访问窗口大小的约束，我们实施了系统的分割策略，以将长的双语文档拆分为块，同时调整滑动窗口机制以保持上下文连贯性。我们通过语义检索框架进一步扩展了数据可用性，以构建来自Web Crawled语料库的互交换样品。实验结果表明，Crossic-PT可改善三种模型（Llama-3.1-8B，QWEN2.5-7B和QWEN2.5-1.5B）的多种语言性能，分别提高了3.79％，3.99％和1.95％的绩效增长，并在数据增强后获得进一步的改进。

Title: UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation

Authors: Huimin Lu, Masaru Isonuma, Junichiro Mori, Ichiro Sakata
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20500
Pdf URL: https://arxiv.org/pdf/2504.20500
Copy Paste: [[2504.20500]] UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation(https://arxiv.org/abs/2504.20500)
Keywords: language model, gpt, llm
Abstract: We present UniDetox, a universally applicable method designed to mitigate toxicity across various large language models (LLMs). Previous detoxification methods are typically model-specific, addressing only individual models or model families, and require careful hyperparameter tuning due to the trade-off between detoxification efficacy and language modeling performance. In contrast, UniDetox provides a detoxification technique that can be universally applied to a wide range of LLMs without the need for separate model-specific tuning. Specifically, we propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding. This approach distills detoxifying representations in the form of synthetic text data, enabling universal detoxification of any LLM through fine-tuning with the distilled text. Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2. Furthermore, UniDetox eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. Additionally, analysis of the detoxifying text reveals a reduction in politically biased content, providing insights into the attributes necessary for effective detoxification of LLMs.
摘要：我们提出了UnideTox，这是一种普遍适用的方法，旨在减轻各种大型语言模型（LLM）的毒性。先前的排毒方法通常是特定于模型的，仅处理单个模型或模型家族，并且由于排毒功效和语言建模性能之间的权衡而需要仔细的高参数调整。相比之下，UnideTox提供了一种排毒技术，可以将其普遍应用于广泛的LLM，而无需单独的特定模型调整。具体而言，我们提出了一种新型有效的数据集蒸馏技术，用于使用对比度解码进行解毒。这种方法以合成文本数据的形式提炼排毒表示形式，从而通过使用蒸馏文本进行微调来使任何LLM的普遍排毒。我们的实验表明，从GPT-2蒸馏出的排毒文本可以有效地解毒大型模型，包括OPT，FALCON和LLAMA-2。此外，UnideTox消除了每个模型对单独的高参数调整的需求，因为可以在不同的模型上无缝地应用单个高参数配置。此外，对排毒文本的分析表明，政治上有偏见的内容的减少，提供了对有效排毒LLM所需的属性的见解。

Title: Revisiting the MIMIC-IV Benchmark: Experiments Using Language Models for Electronic Health Records

Authors: Jesus Lovon (IRIT-IRIS), Thouria Ben-Haddi, Jules Di Scala, Jose G. Moreno (IRIT-IRIS), Lynda Tamine (IRIT-IRIS)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20547
Pdf URL: https://arxiv.org/pdf/2504.20547
Copy Paste: [[2504.20547]] Revisiting the MIMIC-IV Benchmark: Experiments Using Language Models for Electronic Health Records(https://arxiv.org/abs/2504.20547)
Keywords: language model, llm
Abstract: The lack of standardized evaluation benchmarks in the medical domain for text inputs can be a barrier to widely adopting and leveraging the potential of natural language models for health-related downstream tasks. This paper revisited an openly available MIMIC-IV benchmark for electronic health records (EHRs) to address this issue. First, we integrate the MIMIC-IV data within the Hugging Face datasets library to allow an easy share and use of this collection. Second, we investigate the application of templates to convert EHR tabular data to text. Experiments using fine-tuned and zero-shot LLMs on the mortality of patients task show that fine-tuned text-based models are competitive against robust tabular classifiers. In contrast, zero-shot LLMs struggle to leverage EHR representations. This study underlines the potential of text-based approaches in the medical field and highlights areas for further improvement.
摘要：文本输入的医学领域缺乏标准化的评估基准可能是广泛采用和利用自然语言模型来实现与健康相关的下游任务的潜力的障碍。本文重新审视了电子健康记录（EHRS）的公开可用的模仿基准，以解决此问题。首先，我们将模仿数据集成在拥抱面式数据集库中，以简化该集合的共享和使用。其次，我们研究了模板将EHR表格数据转换为文本的应用。在患者的死亡率上使用微调和零射LLM的实验表明，基于文本的微调模型对可靠的表格分类器具有竞争力。相比之下，零射门LLM努力利用EHR表示。这项研究强调了医学领域的基于文本方法的潜力，并突出了进一步改进的领域。

Title: BrAIcht, a theatrical agent that speaks like Bertolt Brecht's characters

Authors: Baz Roland, Kristina Malyseva, Anna Pappa (LIASD), Tristan Cazenave (APA)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20552
Pdf URL: https://arxiv.org/pdf/2504.20552
Copy Paste: [[2504.20552]] BrAIcht, a theatrical agent that speaks like Bertolt Brecht's characters(https://arxiv.org/abs/2504.20552)
Keywords: language model, agent
Abstract: This project introduces BrAIcht, an AI conversational agent that creates dialogues in the distinctive style of the famous German playwright Bertolt Brecht. BrAIcht is fine-tuned using German LeoLM, a large language model with 7 billion parameters and a modified version of the base Llama2 suitable for German language tasks. For fine-tuning, 29 plays of Bertolt Brecht and 907 of other German plays that are stylistically similar to Bertolt Brecht are used to form a more di-erse dataset. Due to the limited memory capacity, a parameterefficient fine-tuning technique called QLoRA is implemented to train the large language model. The results, based on BLEU score and perplexity, show very promising performance of BrAIcht in generating dialogues in the style of Bertolt Brecht.
摘要：该项目介绍了Braicht，Braicht是一种AI对话代理，以著名的德国剧作家Bertolt Brecht的独特风格创建对话。 Braicht使用德国Leolm进行了微调，这是一种具有70亿个参数的大型语言模型，以及适合德语任务的基本Llama2的修改版本。为了进行微调，在风格上与Bertolt Brecht相似的其他德国戏剧中有29次戏剧和907部戏剧形成了更具DI-ERSE数据集。由于内存能力有限，实施了一种称为Qlora的参数微调技术来训练大型语言模型。基于BLEU得分和困惑的结果，在以Bertolt Brecht的风格生成对话时表现出非常有希望的表现。

Title: TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

Authors: Mihai Nadas, Laura Diosan, Andrei Piscoran, Andreea Tomescu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20605
Pdf URL: https://arxiv.org/pdf/2504.20605
Copy Paste: [[2504.20605]] TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models(https://arxiv.org/abs/2504.20605)
Keywords: language model, gpt, prompt
Abstract: Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.
摘要：道德故事是传输价值的时间测试的工具，但是现代的NLP缺乏一个大型的结构化语料库，与明确的道德教训相结合。我们用TF1-EN-3M缩小了这一差距，这是第一个开放数据集的300万英语寓言，该数据集仅由指令调整的模型生成的模型不超过8B参数。每个故事都遵循六槽脚手架（字符 - >特质 - >设置 - >冲突 - >解决 - >道德），该脚手架通过组合提示引擎产生，该引擎可以保证流派忠诚，同时涵盖广泛的主题空间。混合评估管道融合（i）基于GPT的评论家，与（II）无参考的多样性和可读性指标分数语法，创造力，道德清晰度和模板依从性。在十个开放式候选人中，一个8B参数Llama-3变体提供了最佳质量速度权衡，在单个消费者GPU（<24 GB VRAM）上产生高分的寓言，每1,000件货物约为13.5美分。我们在允许的许可下发布数据集，生成代码，评估脚本和完整元数据，从而实现了确切的可重复性和成本基准测试。 TF1-EN-3M开设了有关教学，叙事智能，价值一致性和对儿童友好的教育AI的研究途径，这表明大规模道德故事讲述不再需要专有的巨型模型。

Title: WenyanGPT: A Large Language Model for Classical Chinese Tasks

Authors: Xinyu Yao, Mengdi Wang, Bo Chen, Xiaobing Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20609
Pdf URL: https://arxiv.org/pdf/2504.20609
Copy Paste: [[2504.20609]] WenyanGPT: A Large Language Model for Classical Chinese Tasks(https://arxiv.org/abs/2504.20609)
Keywords: language model, gpt, llm
Abstract: Classical Chinese, as the core carrier of Chinese culture, plays a crucial role in the inheritance and study of ancient literature. However, existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese. This paper presents a comprehensive solution for Classical Chinese language processing. By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks. Additionally, we develop an evaluation benchmark dataset, WenyanBENCH. Experimental results on WenyanBENCH demonstrate that WenyanGPT significantly outperforms current advanced LLMs in various Classical Chinese tasks. We make the model's training data, instruction fine-tuning data\footnote, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.
摘要：作为中国文化的核心载体，古典中国人在古代文学的继承和研究中起着至关重要的作用。但是，现有的自然语言处理模型主要针对现代中国人进行优化，从而导致古典中文的表现不足。本文为古典中文处理提供了一种全面的解决方案。通过继续对Llama3-8B-Chinese模型进行预训练和教学微调，我们构建了一个大型语言模型Wenyangpt，该模型是专门为古典中国任务设计的。此外，我们开发了一个评估基准数据集Wenyanbench。 Wenyanbench上的实验结果表明，Wenyangpt在各种古典中国任务中都显着优于当前的高级LLM。我们将公开使用该模型的培训数据，指令微调数据\脚注以及评估基准数据集，以促进中国古典处理领域的进一步研究和开发。

Title: Cooking Up Creativity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations

Authors: Moran Mizrahi, Chen Shani, Gabriel Stanovsky, Dan Jurafsky, Dafna Shahaf
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20643
Pdf URL: https://arxiv.org/pdf/2504.20643
Copy Paste: [[2504.20643]] Cooking Up Creativity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations(https://arxiv.org/abs/2504.20643)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) excel at countless tasks, yet struggle with creativity. In this paper, we introduce a novel approach that couples LLMs with structured representations and cognitively inspired manipulations to generate more creative and diverse ideas. Our notion of creativity goes beyond superficial token-level variations; rather, we explicitly recombine structured representations of existing ideas, allowing our algorithm to effectively explore the more abstract landscape of ideas. We demonstrate our approach in the culinary domain with DishCOVER, a model that generates creative recipes. Experiments comparing our model's results to those of GPT-4o show greater diversity. Domain expert evaluations reveal that our outputs, which are mostly coherent and feasible culinary creations, significantly surpass GPT-4o in terms of novelty, thus outperforming it in creative generation. We hope our work inspires further research into structured creativity in AI.
摘要：大型语言模型（LLMS）在无数任务上表现出色，但在创造力中挣扎。在本文中，我们介绍了一种新颖的方法，该方法将LLM与结构化表示形式相结合，并具有认知灵感的操纵，以产生更具创造力和多样化的想法。我们的创造力概念超出了肤浅的令牌级别的变化。相反，我们明确地重组了现有思想的结构化表示，从而使我们的算法能够有效探索更抽象的思想格局。我们用DipCover在烹饪领域中演示了我们的方法，该模型会产生创意食谱。将我们的模型结果与GPT-4O的结果进行比较的实验显示出更大的多样性。领域专家评估表明，我们的输出大多是连贯且可行的烹饪创作，在新颖性方面显着超过了GPT-4O，因此在创造性的一代中表现出色。我们希望我们的工作激发了对AI结构化创造力的进一步研究。

Title: A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

Authors: Ivan Vykopal, Martin Hyben, Robert Moro, Michal Gregor, Jakub Simko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20668
Pdf URL: https://arxiv.org/pdf/2504.20668
Copy Paste: [[2504.20668]] A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages(https://arxiv.org/abs/2504.20668)
Keywords: language model, llm
Abstract: Online disinformation poses a global challenge, placing significant demands on fact-checkers who must verify claims efficiently to prevent the spread of false information. A major issue in this process is the redundant verification of already fact-checked claims, which increases workload and delays responses to newly emerging claims. This research introduces an approach that retrieves previously fact-checked claims, evaluates their relevance to a given input, and provides supplementary information to support fact-checkers. Our method employs large language models (LLMs) to filter irrelevant fact-checks and generate concise summaries and explanations, enabling fact-checkers to faster assess whether a claim has been verified before. In addition, we evaluate our approach through both automatic and human assessments, where humans interact with the developed tool to review its effectiveness. Our results demonstrate that LLMs are able to filter out many irrelevant fact-checks and, therefore, reduce effort and streamline the fact-checking process.
摘要：在线虚假信息提出了全球挑战，对必须有效验证索赔的事实检查者提出了重大要求，以防止虚假信息的传播。此过程中的一个主要问题是对已经进行事实检查的主张的冗余验证，这增加了工作量并延迟对新兴索赔的回应。这项研究介绍了一种检索先前事实检查主张，评估其与给定投入的相关性的方法，并提供补充信息以支持事实检查者。我们的方法采用大型语言模型（LLM）来过滤不相关的事实检查并产生简洁的摘要和解释，从而使事实检查者能够更快地评估是否曾经验证了索赔。此外，我们通过自动评估和人类评估来评估我们的方法，其中人类与开发的工具相互作用以审查其有效性。我们的结果表明，LLM能够滤除许多无关的事实检查，因此减少了努力并简化事实检查过程。

Title: Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?

Authors: Wing Yan Li, Zeqiang Wang, Jon Johnson, Suparna De
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.20679
Pdf URL: https://arxiv.org/pdf/2504.20679
Copy Paste: [[2504.20679]] Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?(https://arxiv.org/abs/2504.20679)
Keywords: language model
Abstract: Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.
摘要：在纵向社会科学调查中对语义上等效问题的自动检测对于长期研究为社会，经济和健康科学方面的经验研究提供了至关重要的。检索等效问题面临双重挑战：跨越的问题和回答选择之间的理论结构（即概念/子概念）的不一致表示，以及纵向文本中词汇和结构的演变。为了应对这些挑战，我们对计算机科学家和调查专家的跨学科合作提出了一项新的信息检索（IR）任务，即识别概念（例如住房，工作等），跨越问题和响应选择，以协调纵向人群研究。本文研究了跨越1946 - 2020年的调查数据集中的多种无监督方法，包括概率模型，语言模型的线性探测以及专门用于IR的预训练的神经网络。我们表明，IR专业的神经模型实现了最高的总体性能，而其他方法则具有相当的性能。此外，通过神经模型的概率模型的重新排列仅在F1分数中最多引入0.07的适度改进。调查专家进行的定性事后评估表明，模型通常对具有高词汇重叠的问题的敏感性低，尤其是在亚概念不匹配的情况下。总的来说，我们的分析旨在进一步研究社会科学中的纵向研究。

Title: Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?

Authors: Evangelia Gogoulou, Shorouq Zahra, Liane Guillou, Luise Dürlich, Joakim Nivre
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.20699
Pdf URL: https://arxiv.org/pdf/2504.20699
Copy Paste: [[2504.20699]] Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?(https://arxiv.org/abs/2504.20699)
Keywords: llm, hallucination, prompt
Abstract: A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as hallucination. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and language and we investigate the impact of model size, instruction tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task.
摘要：LLMS经常观察到的问题是它们产生荒谬，不合逻辑或实际不正确的输出的趋势，通常被广泛称为幻觉。在最近提出的幻觉检测和发电的幻觉任务的基础上，我们评估了一系列开放式LLMS，其在两项有条件的一代任务中检测固有幻觉的能力：翻译和释义。我们研究模型性能如何在任务和语言之间变化，并研究模型大小，指导调整和及时选择的影响。我们发现性能随着模型而变化，但在提示之间是一致的。最后，我们发现NLI模型的性能相当出色，这表明基于LLM的检测器不是此特定任务的唯一可行选择。

Title: Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think

Authors: Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20708
Pdf URL: https://arxiv.org/pdf/2504.20708
Copy Paste: [[2504.20708]] Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think(https://arxiv.org/abs/2504.20708)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model's optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model's confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13\% and 10\% respectively. Implementation is available at: this https URL.
摘要：大型语言模型（LLMS）利用逐步推理来解决复杂的问题。标准评估实践涉及产生完整的推理跟踪，并评估其结论中最终答案的正确性。在本文中，我们提出以下两个问题来挑战对最终答案的依赖：最终答案是否可靠地代表了模型的最佳结论？替代推理路径可以产生不同的结果吗？为了回答这些问题，我们分析了中间推理步骤，称为sudthoughts，并根据我们的发现提出了一种方法。我们的方法涉及根据语言提示将推理轨迹分割为顺序的次思想。首先，我们提示该模型从每个中间概念的终点产生连续性。我们从源自不同子思想的每个完整延续中提取一个潜在的答案。我们发现，与仅依赖于原始完整痕迹得出的答案相比，选择最常见的答案（该模式）通常会产生的准确性明显更高。分析从不同的概念中得出的答案之间的一致性揭示了与模型的信心和正确性相关的特征，这表明了识别较不可靠答案的潜力。我们在各种LLM和具有挑战性的数学推理数据集（AIME2024和AIME2025）中进行的实验表现出一致的准确性提高，分别提高到13 \％和10 \％。实施可用：此HTTPS URL。

Title: UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Authors: Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang
Subjects: cs.CL, cs.AI, cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20734
Pdf URL: https://arxiv.org/pdf/2504.20734
Copy Paste: [[2504.20734]] UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities(https://arxiv.org/abs/2504.20734)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.
摘要：检索增强的生成（RAG）通过将模型响应与与查询相关的外部知识接地，在提高事实准确性方面表现出了巨大的希望。但是，大多数现有的破布方法仅限于仅文本语料库，尽管最近的努力将抹布扩展到了其他模态，例如图像和视频，但它们通常在单一特定于模态的语料库中运行。相比之下，实际查询在他们所需的知识类型上差异很大，而这些知识源无法解决。为了解决这个问题，我们介绍了Universalrag，这是一个新型的抹布框架，旨在检索和整合具有不同方式和粒度的异质来源的知识。具体而言，由于观察的动机，即强迫所有模式进入一个统一的表示空间，从单个组合语料库衍生出统一的空间会导致模态差距，在这种情况下，检索倾向于从与查询相同的方式中偏爱物品，我们提出了一种动态意识的路由机制，该方法可以动态地识别最合适的偶发性态度性型族裔和在其范围内执行目标检索。同样，除了方式之外，我们还将每种方式都组织成多个粒度水平，从而可以根据查询的复杂性和范围进行微调的检索。我们验证了跨越多种模态的8个基准测试的Universalrag，显示了其优越性比模式特异性和统一基线的优越性。

Title: Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Authors: Roman Abramov, Felix Steinbauer, Gjergji Kasneci
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20752
Pdf URL: https://arxiv.org/pdf/2504.20752
Copy Paste: [[2504.20752]] Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers(https://arxiv.org/abs/2504.20752)
Keywords: language model
Abstract: Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.
摘要：变形金刚在众多NLP任务中取得了巨大的成功，但在多步骤的事实推理中继续表现出显着的差距，尤其是在现实世界知识稀疏时。 Grokking的最新进展表明，神经网络一旦检测到潜在的逻辑模式就可以从记忆过渡到完美概括 - 但是这些研究主要使用了小的合成任务。 In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking.令人惊讶的是，我们发现即使是事实错误的合成数据也可以增强出现的推理电路，而不是降低准确性，因为它迫使模型依靠关系结构而不是记忆。当对多跳的推理基准进行评估时，我们的方法可在2Wikimultihopqa上获得高达95-100％的准确性 - 在强基础方面显着改善，并匹配或超过当前的最新结果。我们进一步提供了一个深入的分析，即增加$ \ phi_r $如何推动变形金刚内部概括电路的形成。我们的发现表明，基于Grokking的数据增强可以解锁隐式多跳的推理能力，从而为大型语言模型中的更强大和可解释的事实推理打开了大门。

Title: Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption

Authors: Wenxiao Wang, Parsa Hosseini, Soheil Feizi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20769
Pdf URL: https://arxiv.org/pdf/2504.20769
Copy Paste: [[2504.20769]] Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption(https://arxiv.org/abs/2504.20769)
Keywords: language model, gpt, prompt, chain-of-thought
Abstract: Chain-of-thought prompting has demonstrated great success in facilitating the reasoning abilities of large language models. In this work, we explore how these enhanced reasoning abilities can be exploited to improve the robustness of large language models in tasks that are not necessarily reasoning-focused. In particular, we show how a wide range of large language models exhibit significantly improved robustness against reference corruption using a simple method called chain-of-defensive-thought, where only a few exemplars with structured and defensive reasoning are provided as demonstrations. Empirically, the improvements can be astounding, especially given the simplicity and applicability of the method. For example, in the Natural Questions task, the accuracy of GPT-4o degrades from 60% to as low as 3% with standard prompting when 1 out of 10 references provided is corrupted with prompt injection attacks. In contrast, GPT-4o using chain-of-defensive-thought prompting maintains an accuracy of 50%.
摘要：经过深思熟虑的提示在促进大语言模型的推理能力方面取得了巨大的成功。在这项工作中，我们探讨了如何利用这些增强的推理能力来提高不一定以推理为重点的任务中大语言模型的稳健性。特别是，我们展示了如何使用一种称为防守链的方法的简单方法来显着改善针对参考腐败的鲁棒性，其中只有少数具有结构化和防御性推理的示例作为示范。从经验上讲，这些改进可能令人震惊，尤其是考虑到该方法的简单性和适用性。例如，在自然问题任务中，GPT-4O的准确性从60％降低到低至3％，当提供的10个参考中有1个被及时注射攻击而损坏时，标准提示的准确性。相比之下，使用防御性链的GPT-4O促使促进链的准确性为50％。

Title: Turing Machine Evaluation for Large Language Model

Authors: Haitao Wu, Zongbo Han, Huaxi Huang, Changqing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20771
Pdf URL: https://arxiv.org/pdf/2504.20771
Copy Paste: [[2504.20771]] Turing Machine Evaluation for Large Language Model(https://arxiv.org/abs/2504.20771)
Keywords: language model, llm
Abstract: With the rapid development and widespread application of Large Language Models (LLMs), rigorous evaluation has become particularly crucial. This research adopts a novel perspective, focusing on evaluating the core computational reasoning ability of LLMs, defined as the capacity of model to accurately understand rules, and execute logically computing operations. This capability assesses the reliability of LLMs as precise executors, and is critical to advanced tasks such as complex code generation and multi-step problem-solving. We propose an evaluation framework based on Universal Turing Machine (UTM) simulation. This framework requires LLMs to strictly follow instructions and track dynamic states, such as tape content and read/write head position, during multi-step computations. To enable standardized evaluation, we developed TMBench, a benchmark for systematically studying the computational reasoning capabilities of LLMs. TMBench provides several key advantages, including knowledge-agnostic evaluation, adjustable difficulty, foundational coverage through Turing machine encoding, and unlimited capacity for instance generation, ensuring scalability as models continue to evolve. We find that model performance on TMBench correlates strongly with performance on other recognized reasoning benchmarks (Pearson correlation coefficient is 0.73), clearly demonstrating that computational reasoning is a significant dimension for measuring the deep capabilities of LLMs. Code and data are available at this https URL.
摘要：随着大语言模型（LLM）的快速发展和广泛应用，严格的评估变得尤为重要。这项研究采用了一种新颖的观点，重点是评估LLM的核心计算推理能力，定义为模型准确理解规则并执行逻辑计算操作的能力。该功能评估LLMS作为精确执行者的可靠性，对于高级任务（例如复杂的代码生成和多步问题解决）至关重要。我们提出了一个基于通用图灵机（UTM）模拟的评估框架。该框架要求LLM严格遵循指示和跟踪动态状态，例如磁带内容，并在多步计算过程中读取/写下头部位置。为了实现标准化的评估，我们开发了TMBench，这是一种系统地研究LLMS计算推理能力的基准。 TMBENCH提供了几个关键优势，包括知识不足的评估，可调节的难度，通过图灵机编码的基础覆盖范围以及实例生成的无限能力，确保随着模型的不断发展而确保可伸缩性。我们发现TMBENCH上的模型性能与其他公认的推理基准的性能密切相关（Pearson相关系数为0.73），清楚地表明，计算推理是测量LLMS深度功能的重要方面。代码和数据可在此HTTPS URL上找到。

Title: Universal language model with the intervention of quantum theory

Authors: D.-F. Qin
Subjects: cs.CL, quant-ph
Abstract URL: https://arxiv.org/abs/2504.20839
Pdf URL: https://arxiv.org/pdf/2504.20839
Copy Paste: [[2504.20839]] Universal language model with the intervention of quantum theory(https://arxiv.org/abs/2504.20839)
Keywords: language model
Abstract: This paper examines language modeling based on the theory of quantum mechanics. It focuses on the introduction of quantum mechanics into the symbol-meaning pairs of language in order to build a representation model of natural language. At the same time, it is realized that word embedding, which is widely used as a basic technique for statistical language modeling, can be explained and improved by the mathematical framework of quantum mechanics. On this basis, this paper continues to try to use quantum statistics and other related theories to study the mathematical representation, natural evolution and statistical properties of natural language. It is also assumed that the source of such quantum properties is the physicality of information. The feasibility of using quantum theory to model natural language is pointed out through the construction of a experimental code. The paper discusses, in terms of applications, the possible help of the theory in constructing generative models that are popular nowadays. A preliminary discussion of future applications of the theory to quantum computers is also presented.
摘要：本文根据量子力学理论研究了语言建模。它着重于将量子力学引入到符号意义的语言对中，以构建自然语言的表示模型。同时，人们意识到，可以通过量子力学的数学框架来解释和改进单词嵌入，该单词嵌入被广泛用作统计语言建模的基本技术。在此基础上，本文继续尝试使用量子统计和其他相关理论来研究自然语言的数学表示，自然进化和统计特性。还假定这种量子特性的来源是信息的物理性。通过构建实验代码来指出，将量子理论用于模型自然语言的可行性。本文在应用方面讨论了该理论在构建当今流行的生成模型时的可能帮助。还提出了对理论对量子计算机的未来应用的初步讨论。

Title: JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry

Authors: Anum Afzal, Alexandre Mercier, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20849
Pdf URL: https://arxiv.org/pdf/2504.20849
Copy Paste: [[2504.20849]] JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry(https://arxiv.org/abs/2504.20849)
Keywords: language model, gpt, llm
Abstract: Online platforms are increasingly interested in using Data-to-Text technologies to generate content and help their users. Unfortunately, traditional generative methods often fall into repetitive patterns, resulting in monotonous galleries of texts after only a few iterations. In this paper, we investigate LLM-based data-to-text approaches to automatically generate marketing texts that are of sufficient quality and diverse enough for broad adoption. We leverage Language Models such as T5, GPT-3.5, GPT-4, and LLaMa2 in conjunction with fine-tuning, few-shot, and zero-shot approaches to set a baseline for diverse marketing texts. We also introduce a metric JaccDiv to evaluate the diversity of a set of texts. This research extends its relevance beyond the music industry, proving beneficial in various fields where repetitive automated content generation is prevalent.
摘要：在线平台越来越有兴趣使用数据到文本技术来生成内容并帮助其用户。不幸的是，传统的生成方法通常属于重复的模式，导致仅几次迭代后的文本单调画廊。在本文中，我们研究了基于LLM的数据对文本方法，以自动生成质量足够且多样化的营销文本，以便广泛采用。我们利用语言模型（例如T5，GPT-3.5，GPT-4和Llama2）与微调，很少的射击方法和零照片的方法一起为各种营销文本设定了基线。我们还引入了一个度量的JACCDIV，以评估一组文本的多样性。这项研究将其相关性扩展到了音乐行业，证明在重复的自动化内容产生的各个领域有益。

Title: DYNAMAX: Dynamic computing for Transformers and Mamba based architectures

Authors: Miguel Nogales, Matteo Gambella, Manuel Roveri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20922
Pdf URL: https://arxiv.org/pdf/2504.20922
Copy Paste: [[2504.20922]] DYNAMAX: Dynamic computing for Transformers and Mamba based architectures(https://arxiv.org/abs/2504.20922)
Keywords: llm
Abstract: Early exits (EEs) offer a promising approach to reducing computational costs and latency by dynamically terminating inference once a satisfactory prediction confidence on a data sample is achieved. Although many works integrate EEs into encoder-only Transformers, their application to decoder-only architectures and, more importantly, Mamba models, a novel family of state-space architectures in the LLM realm, remains insufficiently explored. This work introduces DYNAMAX, the first framework to exploit the unique properties of Mamba architectures for early exit mechanisms. We not only integrate EEs into Mamba but also repurpose Mamba as an efficient EE classifier for both Mamba-based and transformer-based LLMs, showcasing its versatility. Our experiments employ the Mistral 7B transformer compared to the Codestral 7B Mamba model, using data sets such as TruthfulQA, CoQA, and TriviaQA to evaluate computational savings, accuracy, and consistency. The results highlight the adaptability of Mamba as a powerful EE classifier and its efficiency in balancing computational cost and performance quality across NLP tasks. By leveraging Mamba's inherent design for dynamic processing, we open pathways for scalable and efficient inference in embedded applications and resource-constrained environments. This study underscores the transformative potential of Mamba in redefining dynamic computing paradigms for LLMs.
摘要：一旦获得了对数据样本的令人满意的预测信心，早期出口（EES）提供了一种有希望的方法来减少计算成本和潜伏期。尽管许多作品将EES集成到了仅编码的变压器中，但它们在仅解码器体系结构中的应用，更重要的是，LLM Realm中一个新颖的状态空间体系结构家族Mamba Models仍然没有充分探索。这项工作引入了Dynamax，这是第一个利用Mamba体系结构的独特特性来提早出口机制的框架。我们不仅将EES整合到Mamba中，还将MAMBA重新为基于MAMBA和基于变压器的LLM的有效EE分类器，以展示其多功能性。与CodeStral 7b Mamba模型相比，我们的实验采用了Mistral 7b变压器，使用了诸如真实性，COQA和TRIVIAQA之类的数据集来评估计算节省，准确性和一致性。结果突出了Mamba作为强大的EE分类器的适应性及其在平衡NLP任务的计算成本和性能质量方面的效率。通过利用Mamba的固有设计进行动态处理，我们为嵌入式应用程序和资源约束环境开辟了可扩展有效推断的途径。这项研究强调了MAMBA在重新定义LLM的动态计算范式中的变革潜力。

Title: Trace-of-Thought: Enhanced Arithmetic Problem Solving via Reasoning Distillation From Large to Small Language Models

Authors: Tyler McDonald, Ali Emami
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.20946
Pdf URL: https://arxiv.org/pdf/2504.20946
Copy Paste: [[2504.20946]] Trace-of-Thought: Enhanced Arithmetic Problem Solving via Reasoning Distillation From Large to Small Language Models(https://arxiv.org/abs/2504.20946)
Keywords: language model, gpt, llm, prompt
Abstract: As Large Language Models (LLMs) continue to be leveraged for daily tasks, prompt engineering remains an active field of contribution within computational linguistics, particularly in domains requiring specialized knowledge such as arithmetic reasoning. While these LLMs are optimized for a variety of tasks, their exhaustive employment may become computationally or financially cumbersome for small teams. Additionally, complete reliance on proprietary, closed-source models often limits customization and adaptability, posing significant challenges in research and application scalability. Instead, by leveraging open-source models at or below 7 billion parameters, we can optimize our resource usage while still observing remarkable gains over standard prompting approaches. To cultivate this notion, we introduce Trace-of-Thought Prompting, a simple, zero-shot prompt engineering method that instructs LLMs to create observable subproblems using critical problem-solving, specifically designed to enhance arithmetic reasoning capabilities. When applied to open-source models in tandem with GPT-4, we observe that Trace-of-Thought not only allows novel insight into the problem-solving process but also introduces performance gains as large as 125% on language models at or below 7 billion parameters. This approach underscores the potential of open-source initiatives in democratizing AI research and improving the accessibility of high-quality computational linguistics applications.
摘要：随着大型语言模型（LLM）继续用于日常任务，及时工程仍然是计算语言学中的积极贡献领域，尤其是在需要专业知识（例如算术推理）的领域中。尽管这些LLM针对各种任务进行了优化，但对于小型团队来说，它们的详尽就业可能会在计算或财务上变得麻烦。此外，完全依赖专有的，封闭式模型通常会限制自定义和适应性，从而在研究和应用程序可扩展性方面构成了重大挑战。取而代之的是，通过利用或低于70亿个参数的开源模型，我们可以优化我们的资源使用情况，同时仍然观察到与标准提示方法相比。为了培养这个概念，我们引入了一种经过思考的提示，一种简单的，零射的及时工程方法，该方法指示LLMS使用关键问题解决的可观察的子问题，该方法专门设计用于增强算术推理能力。当与GPT-4同时应用于开源模型时，我们观察到，经过思考的痕迹不仅可以对解决问题的过程进行新颖的见解，而且还引入了或低于70亿个参数的语言模型的绩效增长。这种方法强调了开源计划在使人工智能研究民主化和改善高质量计算语言学应用的可访问性方面的潜力。

Title: Information Gravity: A Field-Theoretic Model for Token Selection in Large Language Models

Authors: Maryna Vyshnyvetska
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20951
Pdf URL: https://arxiv.org/pdf/2504.20951
Copy Paste: [[2504.20951]] Information Gravity: A Field-Theoretic Model for Token Selection in Large Language Models(https://arxiv.org/abs/2504.20951)
Keywords: language model, llm, hallucination
Abstract: We propose a theoretical model called "information gravity" to describe the text generation process in large language models (LLMs). The model uses physical apparatus from field theory and spacetime geometry to formalize the interaction between user queries and the probability distribution of generated tokens. A query is viewed as an object with "information mass" that curves the semantic space of the model, creating gravitational potential wells that "attract" tokens during generation. This model offers a mechanism to explain several observed phenomena in LLM behavior, including hallucinations (emerging from low-density semantic voids), sensitivity to query formulation (due to semantic field curvature changes), and the influence of sampling temperature on output diversity.
摘要：我们提出了一个称为“信息重力”的理论模型，以描述大语言模型（LLMS）中的文本生成过程。该模型使用现场理论和时空几何形状的物理设备来形式化用户查询之间的相互作用和生成的令牌的概率分布。查询被视为具有“信息质量”的对象，该对象曲线曲线的语义空间，从而创建了在代生成过程中“吸引”令牌的重力潜在井。该模型提供了一种解释LLM行为中观察到的几种现象的机制，包括幻觉（来自低密度语义空隙的幻觉），对查询配方的敏感性（由于语义场曲率的变化）以及采样温度对输出多样性的影响。

Title: OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification

Authors: Shangyu Li, Juyong Jiang, Tiancheng Zhao, Jiasi Shen
Subjects: cs.CL, cs.AI, cs.OS, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2504.20964
Pdf URL: https://arxiv.org/pdf/2504.20964
Copy Paste: [[2504.20964]] OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification(https://arxiv.org/abs/2504.20964)
Keywords: language model, llm, long context
Abstract: We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 20k-30k tokens. Our comprehensive evaluation of 12 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification. Significant disparities in their performance on the benchmark highlight differences in their ability to handle long-context code generation tasks. The evaluation toolkit and benchmark are available at this https URL.
摘要：我们介绍了OSVBench，这是一种用于评估大型语言模型（LLM）的新基准，以生成与操作系统内核验证任务有关的完整规范代码。该基准首先将规范生成问题定义为在语法和语义范围内通过编程模型提供LLM的范围内的程序合成问题。 LLM需要了解提供的验证假设以及潜在的语法和语义空间要搜索，然后在操作系统的高级功能描述的指导下为潜在的错误操作系统代码实现生成完整的规范。该基准测试基于现实世界中的操作系统内核，HyperKernel，由245个复杂的规范生成任务组成，每个任务都是大约20k-30k令牌的长上下文任务。我们对12个LLM的全面评估表明，当前LLM在用于操作系统验证的规范生成任务上的性能有限。其在基准测试中的性能差异很大，这突出了其处理长篇小说代码生成任务的能力的差异。评估工具包和基准可在此HTTPS URL上获得。

Title: SetKE: Knowledge Editing for Knowledge Elements Overlap

Authors: Yifan Wei, Xiaoyan Yu, Ran Song, Hao Peng, Angsheng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.20972
Pdf URL: https://arxiv.org/pdf/2504.20972
Copy Paste: [[2504.20972]] SetKE: Knowledge Editing for Knowledge Elements Overlap(https://arxiv.org/abs/2504.20972)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) excel in tasks such as retrieval and question answering but require updates to incorporate new knowledge and reduce inaccuracies and hallucinations. Traditional updating methods, like fine-tuning and incremental learning, face challenges such as overfitting and high computational costs. Knowledge Editing (KE) provides a promising alternative but often overlooks the Knowledge Element Overlap (KEO) phenomenon, where multiple triplets share common elements, leading to editing conflicts. We identify the prevalence of KEO in existing KE datasets and show its significant impact on current KE methods, causing performance degradation in handling such triplets. To address this, we propose a new formulation, Knowledge Set Editing (KSE), and introduce SetKE, a method that edits sets of triplets simultaneously. Experimental results demonstrate that SetKE outperforms existing methods in KEO scenarios on mainstream LLMs. Additionally, we introduce EditSet, a dataset containing KEO triplets, providing a comprehensive benchmark.
摘要：大型语言模型（LLMS）在诸如检索和问答诸如诸如检索和问题之类的任务中表现出色，但需要更新以纳入新知识并减少不准确性和幻觉。传统的更新方法，例如微调和增量学习，面临诸如过度拟合和高计算成本之类的挑战。知识编辑（KE）提供了一种有希望的替代方案，但经常忽略知识元素重叠（KEO）现象，其中多个三胞胎共享共同的元素，导致编辑冲突。我们确定了现有的KE数据集中KEO的流行率，并显示出其对当前KE方法的重大影响，从而导致了处理此类三重态的性能降解。为了解决这个问题，我们提出了一种新的公式，知识集编辑（KSE），并介绍SetKe，SetKe是一种同时编辑三胞胎的方法。实验结果表明，SetKe在主流LLMS上的KEO方案中优于现有方法。此外，我们介绍了一个包含KEO三重态的数据集Editset，提供了全面的基准。