2025-04-29

Title: Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages

Authors: Alessio Buscemi, Cédric Lothritz, Sergio Morales, Marcos Gomez-Vazquez, Robert Clarisó, Jordi Cabot, German Castignani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.18560
Pdf URL: https://arxiv.org/pdf/2504.18560
Copy Paste: [[2504.18560]] Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages(https://arxiv.org/abs/2504.18560)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have exhibited impressive natural language processing capabilities but often perpetuate social biases inherent in their training data. To address this, we introduce MultiLingual Augmented Bias Testing (MLA-BiTe), a framework that improves prior bias evaluation methods by enabling systematic multilingual bias testing. MLA-BiTe leverages automated translation and paraphrasing techniques to support comprehensive assessments across diverse linguistic settings. In this study, we evaluate the effectiveness of MLA-BiTe by testing four state-of-the-art LLMs in six languages -- including two low-resource languages -- focusing on seven sensitive categories of discrimination.
摘要：大型语言模型（LLMS）表现出令人印象深刻的自然语言处理能力，但通常会延续其培训数据中固有的社会偏见。为了解决这个问题，我们引入了多语言增强偏置测试（MLA-BITE），该框架通过启用系统的多语言偏置测试来改善先前的偏见评估方法。 MLA咬合利用自动翻译和释义技术来支持各种语言环境的全面评估。在这项研究中，我们通过用六种语言测试四种最先进的LLM（包括两种低资源语言）来评估MLA咬合的有效性，重点是七个敏感类别的歧视类别。

Title: Span-Level Hallucination Detection for LLM-Generated Answers

Authors: Passant Elchafei, Mervet Abu-Elkheir
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18639
Pdf URL: https://arxiv.org/pdf/2504.18639
Copy Paste: [[2504.18639]] Span-Level Hallucination Detection for LLM-Generated Answers(https://arxiv.org/abs/2504.18639)
Keywords: gpt, llm, hallucination, prompt
Abstract: Detecting spans of hallucination in LLM-generated answers is crucial for improving factual consistency. This paper presents a span-level hallucination detection framework for the SemEval-2025 Shared Task, focusing on English and Arabic texts. Our approach integrates Semantic Role Labeling (SRL) to decompose the answer into atomic roles, which are then compared with a retrieved reference context obtained via question-based LLM prompting. Using a DeBERTa-based textual entailment model, we evaluate each role semantic alignment with the retrieved context. The entailment scores are further refined through token-level confidence measures derived from output logits, and the combined scores are used to detect hallucinated spans. Experiments on the Mu-SHROOM dataset demonstrate competitive performance. Additionally, hallucinated spans have been verified through fact-checking by prompting GPT-4 and LLaMA. Our findings contribute to improving hallucination detection in LLM-generated responses.
摘要：检测LLM生成的答案中幻觉的跨度对于改善事实一致性至关重要。本文为Semeval-2025共享任务提供了一个跨度级幻觉检测框架，重点是英语和阿拉伯文本。我们的方法集成了语义角色标签（SRL），以将答案分解为原子角色，然后将其与通过基于问题的LLM提示获得的检索到的参考上下文进行了比较。使用基于Deberta的文本元素模型，我们通过检索到的上下文来评估每个角色语义一致性。通过从输出逻辑得出的令牌置信度度量进行进一步完善的成分，并将组合得分用于检测幻觉跨度。 MU棚数据集的实验表明了竞争性能。此外，通过提示GPT-4和Llama来核实事实检查，幻觉跨度得到了验证。我们的发现有助于改善LLM生成反应中的幻觉检测。

Title: Can Third-parties Read Our Emotions?

Authors: Jiayi Li, Yingfan Zhou, Pranav Narayanan Venkit, Halima Binte Islam, Sneha Arya, Shomir Wilson, Sarah Rajtmajer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18673
Pdf URL: https://arxiv.org/pdf/2504.18673
Copy Paste: [[2504.18673]] Can Third-parties Read Our Emotions?(https://arxiv.org/abs/2504.18673)
Keywords: language model, llm, prompt
Abstract: Natural Language Processing tasks that aim to infer an author's private states, e.g., emotions and opinions, from their written text, typically rely on datasets annotated by third-party annotators. However, the assumption that third-party annotators can accurately capture authors' private states remains largely unexamined. In this study, we present human subjects experiments on emotion recognition tasks that directly compare third-party annotations with first-party (author-provided) emotion labels. Our findings reveal significant limitations in third-party annotations-whether provided by human annotators or large language models (LLMs)-in faithfully representing authors' private states. However, LLMs outperform human annotators nearly across the board. We further explore methods to improve third-party annotation quality. We find that demographic similarity between first-party authors and third-party human annotators enhances annotation performance. While incorporating first-party demographic information into prompts leads to a marginal but statistically significant improvement in LLMs' performance. We introduce a framework for evaluating the limitations of third-party annotations and call for refined annotation practices to accurately represent and model authors' private states.
摘要：旨在从书面文本中推断出作者私人状态的自然语言处理任务，例如情感和观点，通常依赖于第三方注释者注释的数据集。但是，第三方注释者可以准确捕获作者的私人状态的假设在很大程度上没有审查。在这项研究中，我们介绍了人类对情感识别任务的实验，将第三方注释与第一方（作者提供的）情感标签进行了比较。我们的发现揭示了第三方注释的重大局限性 - 无论是人类注释者还是大语言模型（LLMS）提供的 - 忠实地代表作者的私人国家。但是，LLM的表现几乎超过了人类注释。我们进一步探讨了提高第三方注释质量的方法。我们发现第一方作者和第三方人类注释者之间的人口相似性增强了注释绩效。在将第一方的人群信息纳入提示的同时，会导致LLMS的性能的边际但统计学上的显着改善。我们介绍了一个框架，以评估第三方注释的局限性，并呼吁精确的注释实践，以准确代表和建模作者的私人状态。

Title: Building UD Cairo for Old English in the Classroom

Authors: Lauren Levine, Junghyun Min, Amir Zeldes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18718
Pdf URL: https://arxiv.org/pdf/2504.18718
Copy Paste: [[2504.18718]] Building UD Cairo for Old English in the Classroom(https://arxiv.org/abs/2504.18718)
Keywords: llm, prompt
Abstract: In this paper we present a sample treebank for Old English based on the UD Cairo sentences, collected and annotated as part of a classroom curriculum in Historical Linguistics. To collect the data, a sample of 20 sentences illustrating a range of syntactic constructions in the world's languages, we employ a combination of LLM prompting and searches in authentic Old English data. For annotation we assigned sentences to multiple students with limited prior exposure to UD, whose annotations we compare and adjudicate. Our results suggest that while current LLM outputs in Old English do not reflect authentic syntax, this can be mitigated by post-editing, and that although beginner annotators do not possess enough background to complete the task perfectly, taken together they can produce good results and learn from the experience. We also conduct preliminary parsing experiments using Modern English training data, and find that although performance on Old English is poor, parsing on annotated features (lemma, hyperlemma, gloss) leads to improved performance.
摘要：在本文中，我们根据开罗的句子为旧英语的样本库提供了一个样本，作为历史语言学课堂课程的一部分，收集和注释。为了收集数据，有20个句子的样本说明了世界语言中的一系列句法结构，我们在真实的旧英语数据中采用了LLM提示和搜索的组合。对于注释，我们将句子分配给多个学生，而先前暴露于UD的学生，我们对其进行比较和裁决的注释。我们的结果表明，虽然当前的旧英语输出输出不能反映出真实的语法，但可以通过后编辑来减轻这种语法，尽管初学者注释者没有足够的背景来完美地完成任务，但它们可以共同完成良好的结果并从经验中学习。我们还使用现代英语培训数据进行初步解析实验，并发现尽管旧英语的性能很差，但对带注释的特征（Lemma，Armylemma，Gloss）进行解析会导致性能的提高。

Title: EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

Authors: Jianyou Wang, Weili Cao, Kaicheng Wang, Xiaoyue Wang, Ashish Dalvi, Gino Prasad, Qishan Liang, Hsuan-lin Her, Ming Wang, Qin Yang, Gene W. Yeo, David E. Neal, Maxim Khan, Christopher D. Rosin, Ramamohan Paturi, Leon Bergen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18736
Pdf URL: https://arxiv.org/pdf/2504.18736
Copy Paste: [[2504.18736]] EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers(https://arxiv.org/abs/2504.18736)
Keywords: language model
Abstract: We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at this https URL
摘要：我们研究了自动找到与生物医学论文中假设相关的证据的任务。当研究人员研究科学假设时，找到相关证据是一个重要的步骤。我们介绍了证据解决方案，以衡量该任务的模型性能，该任务是由一条新型管道创建的，该管道由生物医学论文的假设产生和句子句子注释，以提供相关证据，完全由现有的人类专家判断。我们通过多组人类专家注释来证明管道的有效性和准确性。我们在基准上评估了各种语言模型和检索系统，发现模型性能仍然明显远远超过了该任务的专家水平。为了显示我们提议的管道的可伸缩性，我们创建了一个更大的证据Bench-100K，其中具有107,461个完全注释的论文，并带有假设，以促进模型培训和开发。这两个数据集都在此HTTPS URL上可用

Title: SynLexLM: Scaling Legal LLMs with Synthetic Data and Curriculum Learning

Authors: Ojasw Upadhyay, Abishek Saravankumar, Ayman Ismail
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.18762
Pdf URL: https://arxiv.org/pdf/2504.18762
Copy Paste: [[2504.18762]] SynLexLM: Scaling Legal LLMs with Synthetic Data and Curriculum Learning(https://arxiv.org/abs/2504.18762)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are powerful but often require extensive fine-tuning and large datasets for specialized domains like law. General-purpose pre-training may not capture legal nuances, and acquiring sufficient legal data is challenging. We introduce SynLexLM, a novel approach to efficiently pre-train a legal LLM. Our method employs curriculum learning, progressing from simple to complex legal texts and queries, combined with synthetic data augmentation using models like Gemini Pro to address data scarcity. We aim to achieve improved performance on legal benchmarks (BigLaw-Bench, EUR-Lex-Sum) compared to traditional models and fine-tuned versions. Preliminary work involves generating synthetic QA pairs reflecting legal reasoning. This work aims to enhance legal document analysis and research tools, potentially democratizing access to advanced legal AI.
摘要：大型语言模型（LLMS）功能强大，但通常需要广泛的微调和大型数据集，例如法律等专业领域。通用预培训可能无法捕获法律细微差别，并且获取足够的法律数据具有挑战性。我们介绍了Synlexlm，这是一种有效预先培训法律LLM的新型方法。我们的方法采用课程学习，从简单到复杂的法律文本和查询发展，并使用Gemini Pro等模型来解决数据稀缺。与传统的模型和微调版本相比，我们旨在在法律基准（Biglaw Bench，Eure-Lex-SUM）上提高性能。初步工作涉及产生反映法律推理的合成质量检查对。这项工作旨在增强法律文档分析和研究工具，并可能使获得高级法律AI的访问权限。

Title: Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation

Authors: Jong Inn Park, Maanas Taneja, Qianwen Wang, Dongyeop Kang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.18805
Pdf URL: https://arxiv.org/pdf/2504.18805
Copy Paste: [[2504.18805]] Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation(https://arxiv.org/abs/2504.18805)
Keywords: llm, prompt, agent
Abstract: Generating engaging, accurate short-form videos from scientific papers is challenging due to content complexity and the gap between expert authors and readers. Existing end-to-end methods often suffer from factual inaccuracies and visual artifacts, limiting their utility for scientific dissemination. To address these issues, we propose SciTalk, a novel multi-LLM agentic framework, grounding videos in various sources, such as text, figures, visual styles, and avatars. Inspired by content creators' workflows, SciTalk uses specialized agents for content summarization, visual scene planning, and text and layout editing, and incorporates an iterative feedback mechanism where video agents simulate user roles to give feedback on generated videos from previous iterations and refine generation prompts. Experimental evaluations show that SciTalk outperforms simple prompting methods in generating scientifically accurate and engaging content over the refined loop of video generation. Although preliminary results are still not yet matching human creators' quality, our framework provides valuable insights into the challenges and benefits of feedback-driven video generation. Our code, data, and generated videos will be publicly available.
摘要：由于内容的复杂性以及专家作者和读者之间的差距，从科学论文中生成引人入胜的准确的短形式视频。现有的端到端方法通常遭受事实不准确和视觉伪像，从而限制了它们进行科学传播的效用。为了解决这些问题，我们提出了Scitalk，这是一种新型的多LLM代理框架，在各种来源（例如文本，图形，视觉样式和化身）中扎根视频。 Scitalk受到内容创建者的工作流程的启发，使用专门的代理来进行内容摘要，视觉场景计划以及文本和布局编辑，并结合了一种迭代反馈机制，其中视频代理模拟用户角色以对以前的迭代和改进生成提示的生成视频提供反馈。实验评估表明，Scitalk优于简单提示方法，即在视频生成的精制循环中生成科学准确而引人入胜的内容。尽管初步结果尚未与人类创作者的质量相匹配，但我们的框架为反馈驱动的视频生成的挑战和好处提供了宝贵的见解。我们的代码，数据和生成的视频将公开可用。

Title: Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

Authors: Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang, Tat-Seng Chua, Yu-Gang Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18838
Pdf URL: https://arxiv.org/pdf/2504.18838
Copy Paste: [[2504.18838]] Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks(https://arxiv.org/abs/2504.18838)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety; and (ii) from manual to automated evaluation, encompassing dynamic dataset curation and "LLM-as-a-judge" scoring. Yet, even with these transitions, a crucial obstacle persists: the evaluation generalization issue. Bounded test sets cannot scale alongside models whose abilities grow seemingly without limit. We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics. Due to the fast evolving of this field, we will maintain a living GitHub repository (links are in each section) to crowd-source updates and corrections, and warmly invite contributors and collaborators.
摘要：大型语言模型（LLM）以惊人的速度前进，并且在学术界，行业和日常应用中变得必不可少。为了与现状保持同步，这项调查探讨了LLMS构成的评估所带来的核心挑战。我们识别和分析了两个关键转变：（i）从特定于任务到基于功能的评估，该评估重组了围绕核心竞争力的基准测试，例如知识，推理，指导，以下，多模式理解和安全；（ii）从手动到自动化评估，包括动态数据集策划和“ LLM-AS-A-A-a-Gudge”评分。然而，即使有这些过渡，关键的障碍仍然存在：评估概括问题。有限的测试集无法与似乎没有限制的模型并肩作用。从方法，数据集，评估者和指标的角度来看，我们将剖析此问题，以及上述两个过渡的核心挑战。由于该领域的快速发展，我们将维护一个活着的GitHub存储库（每个部分中的链接），以进行众包更新和更正，并热烈邀请贡献者和合作者。

Title: Towards Robust Dialogue Breakdown Detection: Addressing Disruptors in Large Language Models with Self-Guided Reasoning

Authors: Abdellah Ghassel, Xianzhi Li, Xiaodan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18839
Pdf URL: https://arxiv.org/pdf/2504.18839
Copy Paste: [[2504.18839]] Towards Robust Dialogue Breakdown Detection: Addressing Disruptors in Large Language Models with Self-Guided Reasoning(https://arxiv.org/abs/2504.18839)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) are rapidly changing various domains. However, their capabilities in handling conversational breakdowns still require an in-depth exploration. This paper addresses the challenge of detecting and mitigating dialogue breakdowns within LLM-driven conversational systems. While powerful models from OpenAI and Anthropic excel in many dialogue tasks, they can still produce incoherent or contradictory responses, commonly referred to as breakdowns, which undermine user trust. To tackle this, we propose an approach that combines specialized fine-tuning with advanced prompting strategies, including few-shot learning, chain-of-thought reasoning, and analogical prompting. In particular, we fine-tune a small 8B model and demonstrate its robust classification and calibration capabilities in English and Japanese dialogue. We also validate its generalization on the BETOLD dataset, achieving a 7\% accuracy improvement over its base model. Furthermore, we introduce a real-time deployment architecture that selectively escalates suspicious responses to more resource-intensive frontier models only when breakdowns are detected, significantly cutting operational expenses and energy consumption. Experimental results show our method surpasses prior state-of-the-art specialized classifiers while also narrowing performance gaps between smaller open-source models and large proprietary ones. Our approach offers a scalable solution for robust conversational AI in high-impact domains by combining efficiency, interpretability, and reliability.
摘要：大型语言模型（LLM）正在迅速改变各种领域。但是，它们在处理对话分解的能力仍然需要深入探索。本文解决了在LLM驱动的对话系统中检测和减轻对话分解的挑战。尽管OpenAI和Anthropic Excel的强大模型在许多对话任务中仍然可以产生不连贯或矛盾的响应，通常称为崩溃，这破坏了用户的信任。为了解决这个问题，我们提出了一种将专业的微调与高级提示策略相结合的方法，包括很少的学习，经过思考的推理和类似的提示。特别是，我们微调了一个小型8B型号，并在英语和日语对话中演示了其强大的分类和校准功能。我们还验证了其对销售数据集的概括，从而实现了基本模型的7 \％精度提高。此外，我们引入了一种实时部署体系结构，该体系结构仅在检测到崩溃时才有选择性地升级对更多资源密集型边境模型的可疑响应，从而大大减少运营费用和能源消耗。实验结果表明，我们的方法超过了先前的最新专业分类器，同时还缩小了较小的开源模型和大型专有模型之间的性能差距。我们的方法通过结合效率，可解释性和可靠性，为高影响力域中的鲁棒对话AI提供了可扩展的解决方案。

Title: When2Call: When (not) to Call Tools

Authors: Hayley Ross, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18851
Pdf URL: https://arxiv.org/pdf/2504.18851
Copy Paste: [[2504.18851]] When2Call: When (not) to Call Tools(https://arxiv.org/abs/2504.18851)
Keywords: language model
Abstract: Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling -- whether the correct tool is called with the correct parameters -- and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can't be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at this https URL.
摘要：利用外部工具是现代语言模型（LMS）扩展其功能并将其集成到现有系统中的关键功能。但是，现有的基准主要集中于工具调用的准确性 - 是否使用正确的参数调用正确的工具 - 而不是评估LMS何时应（不是）呼叫工具的评估。我们开发了一个新的基准测试，当2CALL评估工具称呼决策：何时生成工具调用，何时提出后续问题以及何时接受该问题，无法用提供的工具来回答问题。我们发现，最先进的工具称呼LMS在2CALL时显示出很大的改进空间，表明该基准的重要性。我们还为何时2召唤并利用基准的多项选择性来开发培训集，以开发偏好优化培训制度，该制度比传统的微调表现出更大的进步。我们在此HTTPS URL上发布基准和培训数据以及评估脚本。

Title: Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation

Authors: Yi Lu, Wanxu Zhao, Xin Zhou, Chenxin An, Chenglong Wang, Shuo Li, Yuming Yang, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.18857
Pdf URL: https://arxiv.org/pdf/2504.18857
Copy Paste: [[2504.18857]] Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation(https://arxiv.org/abs/2504.18857)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free framework to extrapolate the context window of LLMs by diving into RoPE's different hidden dimensions. Instead of manipulating all dimensions equally, DPE detects the effective length for every dimension and finds the key dimensions for context extension. We reuse the original position indices with their embeddings from the pre-trained model and manipulate the key dimensions' position indices to their most effective lengths. In this way, DPE adjusts the pre-trained models with minimal modifications while ensuring that each dimension reaches its optimal state for extrapolation. DPE significantly surpasses well-known baselines such as YaRN and Self-Extend. DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models' performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER. When compared with commercial models, Llama 3.1 70B with DPE even achieves better performance than GPT-4-128K.
摘要：当输入令牌数量超过预训练的长度时，大型语言模型（LLMS）通常很难处理并产生连贯的环境。长篇文化扩展的最新进展大大扩展了LLM的上下文窗口，但需要昂贵的开销来训练具有更长上下文的大型模型。在这项工作中，我们提出了尺寸的位置嵌入操作（DPE），这是一个无训练的框架，可通过潜入绳索的不同隐藏尺寸中来推断LLM的上下文窗口。 DPE并没有平等地操纵所有维度，而是检测每个维度的有效长度，并找到上下文扩展的关键维度。我们将原始位置指数从预训练的模型中重复使用，并将关键维度的位置指数操纵到其最有效的长度。通过这种方式，DPE通过最小的修改调整了预训练的模型，同时确保每个维度都达到其外推的最佳状态。 DPE显着超过了众所周知的基线，例如纱线和自我扩展。 DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models' performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER.与商业模型相比，Llama 3.1 70B具有DPE的性能甚至比GPT-4-128K更好。

Title: Latent Adversarial Training Improves the Representation of Refusal

Authors: Alexandra Abbas, Nora Petrova, Helios Ael Lyons, Natalia Perez-Campanero
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.18872
Pdf URL: https://arxiv.org/pdf/2504.18872
Copy Paste: [[2504.18872]] Latent Adversarial Training Improves the Representation of Refusal(https://arxiv.org/abs/2504.18872)
Keywords: language model
Abstract: Recent work has shown that language models' refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Understanding this encoding is crucial for evaluating LAT's effectiveness and limitations, just as the discovery of linear refusal directions revealed vulnerabilities in traditional supervised safety fine-tuning (SSFT). Through the analysis of Llama 2 7B, we examine how LAT reorganizes the refusal behavior in the model's latent space compared to SSFT and embedding space adversarial training (AT). By computing activation differences between harmful and harmless instruction pairs and applying Singular Value Decomposition (SVD), we find that LAT significantly alters the refusal representation, concentrating it in the first two SVD components which explain approximately 75 percent of the activation differences variance - significantly higher than in reference models. This concentrated representation leads to more effective and transferable refusal vectors for ablation attacks: LAT models show improved robustness when attacked with vectors from reference models but become more vulnerable to self-generated vectors compared to SSFT and AT. Our findings suggest that LAT's training perturbations enable a more comprehensive representation of refusal behavior, highlighting both its potential strengths and vulnerabilities for improving model safety.
摘要：最近的工作表明，语言模型的拒绝行为主要是在其潜在空间中的一个方向上编码，从而使其容易受到针对性攻击的影响。尽管潜在的对抗训练（LAT）试图通过在训练过程中引入噪声来改善鲁棒性，但一个关键问题仍然存在：这种基于噪声的训练如何影响拒绝行为的基本表示？理解这种编码对于评估LAT的有效性和局限性至关重要，就像发现线性拒绝方向的发现揭示了传统监督安全性微调（SSFT）中的脆弱性一样。通过对Llama 2 7b的分析，我们研究了与SSFT和嵌入太空对抗训练（AT）相比，LAT如何重组模型潜在空间的拒绝行为。通过计算有害和无害指令对之间的激活差异以及应用奇异值分解（SVD），我们发现LAT显着改变了拒绝表示，将其集中在前两个SVD组件中，这些SVD组件解释了大约75％的激活差异差异，高于参考模型。这种集中的表示会导致对消融攻击的更有效和可转移的拒绝向量：与参考模型的向量攻击时，LAT模型显示出改善的鲁棒性，但与SSFT相比，与SSFT和AT相比，它更容易受到自我生成的向量的影响。我们的发现表明，LAT的训练扰动可以更全面地表示拒绝行为，从而强调了其潜在的优势和脆弱性，以改善模型安全性。

Title: A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification

Authors: Junichiro Niimi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.18884
Pdf URL: https://arxiv.org/pdf/2504.18884
Copy Paste: [[2504.18884]] A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification(https://arxiv.org/abs/2504.18884)
Keywords: language model, llm
Abstract: With the advance of large language models (LLMs), LLMs have been utilized for the various tasks. However, the issues of variability and reproducibility of results from each trial of LLMs have been largely overlooked in existing literature while actual human annotation uses majority voting to resolve disagreements among annotators. Therefore, this study introduces the straightforward ensemble strategy to a sentiment analysis using LLMs. As the results, we demonstrate that the ensemble of multiple inference using medium-sized LLMs produces more robust and accurate results than using a large model with a single attempt with reducing RMSE by 18.6%.
摘要：随着大语言模型（LLM）的发展，LLM已用于各种任务。但是，在现有文献中，每项LLM试验结果的可变性和可重复性问题在很大程度上被忽略了，而实际的人类注释则使用多数投票来解决注释者之间的分歧。因此，这项研究将直接的整体策略引入了使用LLM的情感分析。作为结果，我们证明，使用中型LLMS的多重推理的集合比使用单一尝试将RMSE降低18.6％的大型模型产生更强和准确的结果。

Title: MTCSC: Retrieval-Augmented Iterative Refinement for Chinese Spelling Correction

Authors: Junhong Liang, Yu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18938
Pdf URL: https://arxiv.org/pdf/2504.18938
Copy Paste: [[2504.18938]] MTCSC: Retrieval-Augmented Iterative Refinement for Chinese Spelling Correction(https://arxiv.org/abs/2504.18938)
Keywords: language model, llm
Abstract: Chinese Spelling Correction (CSC) aims to detect and correct erroneous tokens in sentences. While Large Language Models (LLMs) have shown remarkable success in identifying and rectifying potential errors, they often struggle with maintaining consistent output lengths and adapting to domain-specific corrections. Furthermore, existing CSC task impose rigid constraints requiring input and output lengths to be identical, limiting their applicability. In this work, we extend traditional CSC to variable-length correction scenarios, including Chinese Splitting Error Correction (CSEC) and ASR N-best Error Correction. To address domain adaptation and length consistency, we propose MTCSC (Multi-Turn CSC) framework based on RAG enhanced with a length reflection mechanism. Our approach constructs a retrieval database from domain-specific training data and dictionaries, fine-tuning retrievers to optimize performance for error-containing inputs. Additionally, we introduce a multi-source combination strategy with iterative length reflection to ensure output length fidelity. Experiments across diverse domain datasets demonstrate that our method significantly outperforms current approaches in correction quality, particularly in handling domain-specific and variable-length error correction tasks.
摘要：中国拼写校正（CSC）旨在检测和纠正句子中的错误令牌。尽管大型语言模型（LLMS）在识别和纠正潜在错误方面取得了显着成功，但他们通常在保持一致的输出长度和适应特定于领域的校正方面很难。此外，现有的CSC任务施加了严格的约束，要求输入和输出长度相同，从而限制其适用性。在这项工作中，我们将传统的CSC扩展到可变长度校正方案，包括中国分裂误差校正（CSEC）和ASR n-bess误差校正。为了解决域的适应性和长度一致性，我们提出了基于RAG的MTCSC（Multi-Turn CSC）框架，并具有长度反射机制。我们的方法从特定于域的培训数据和词典中构建了一个检索数据库，从而对检索器进行微调检索器，以优化遇到错误输入的性能。此外，我们引入了具有迭代长度反射的多源组合策略，以确保输出长度的保真度。跨不同域数据集的实验表明，我们的方法在校正质量方面的当前方法显着超过了当前的方法，尤其是在处理域特异性和可变长度误差校正任务中。

Title: LawFlow : Collecting and Simulating Lawyers' Thought Processes

Authors: Debarati Das, Khanh Chi Le, Ritik Sachin Parkar, Karin De Langis, Brendan Madson, Chad M. Berryman, Robin M. Willis, Daniel H. Moses, Brett McDonnell, Daniel Schwarcz, Dongyeop Kang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.18942
Pdf URL: https://arxiv.org/pdf/2504.18942
Copy Paste: [[2504.18942]] LawFlow : Collecting and Simulating Lawyers' Thought Processes(https://arxiv.org/abs/2504.18942)
Keywords: llm
Abstract: Legal practitioners, particularly those early in their careers, face complex, high-stakes tasks that require adaptive, context-sensitive reasoning. While AI holds promise in supporting legal work, current datasets and models are narrowly focused on isolated subtasks and fail to capture the end-to-end decision-making required in real-world practice. To address this gap, we introduce LawFlow, a dataset of complete end-to-end legal workflows collected from trained law students, grounded in real-world business entity formation scenarios. Unlike prior datasets focused on input-output pairs or linear chains of thought, LawFlow captures dynamic, modular, and iterative reasoning processes that reflect the ambiguity, revision, and client-adaptive strategies of legal practice. Using LawFlow, we compare human and LLM-generated workflows, revealing systematic differences in structure, reasoning flexibility, and plan execution. Human workflows tend to be modular and adaptive, while LLM workflows are more sequential, exhaustive, and less sensitive to downstream implications. Our findings also suggest that legal professionals prefer AI to carry out supportive roles, such as brainstorming, identifying blind spots, and surfacing alternatives, rather than executing complex workflows end-to-end. Building on these findings, we propose a set of design suggestions, rooted in empirical observations, that align AI assistance with human goals of clarity, completeness, creativity, and efficiency, through hybrid planning, adaptive execution, and decision-point support. Our results highlight both the current limitations of LLMs in supporting complex legal workflows and opportunities for developing more collaborative, reasoning-aware legal AI systems. All data and code are available on our project page (this https URL).
摘要：法律从业者，尤其是那些职业生涯的早期，面临需要自适应，上下文敏感推理的复杂，高风险的任务。尽管AI有望支持法律工作，但当前的数据集和模型狭窄地专注于孤立的子任务，并且未能捕获现实世界实践中所需的端到端决策。为了解决这一差距，我们介绍了Lawflow，这是一个从训练有素的法律学生那里收集的完整端到端法律工作流程的数据集，该数据集以现实世界的业务实体形成场景为基础。与以前的数据集专注于投入输出对或思想的线性链不同，法律捕获了反映法律实践的歧义，修订和客户自适应策略的动态，模块化和迭代推理过程。使用法律流，我们比较了人类和LLM生成的工作流程，揭示了结构，推理灵活性和计划执行方面的系统差异。人类的工作流程倾向于模块化和自适应，而LLM工作流程更顺序，详尽且对下游含义较不敏感。我们的发现还表明，法律专业人员更喜欢AI执行支持角色，例如头脑风暴，识别盲点和浮出水面的替代方案，而不是终端执行复杂的工作流程。在这些发现的基础上，我们提出了一组设计建议，植根于经验观察，通过混合计划，自适应执行和决策点支持，AI的帮助与人类的清晰，完整性，创造力和效率的人类目标相结合。我们的研究结果凸显了LLM在支持复杂的法律工作流程中的当前局限性，也介绍了开发更合作，推理意识的法律AI系统的机会。所有数据和代码均可在我们的项目页面（此HTTPS URL）上找到。

Title: Dynamic Fisher-weighted Model Merging via Bayesian Optimization

Authors: Sanwoo Lee, Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.18992
Pdf URL: https://arxiv.org/pdf/2504.18992
Copy Paste: [[2504.18992]] Dynamic Fisher-weighted Model Merging via Bayesian Optimization(https://arxiv.org/abs/2504.18992)
Keywords: language model
Abstract: The fine-tuning of pre-trained language models has resulted in the widespread availability of task-specific models. Model merging offers an efficient way to create multi-task models by combining these fine-tuned models at the parameter level, without the need for training data or joint training on multiple datasets. Existing merging approaches typically involve scaling the parameters model-wise or integrating parameter importance parameter-wise. Both approaches exhibit their own weaknesses, leading to a notable performance gap compared to multi-task fine-tuning. In this paper, we unify these seemingly distinct strategies into a more general merging framework, and introduce Dynamic Fisher-weighted Merging (DF-Merge). Specifically, candidate models are associated with a set of coefficients that linearly scale their fine-tuned parameters. Bayesian optimization is applied to dynamically adjust these coefficients, aiming to maximize overall performance on validation sets. Each iteration of this process integrates parameter importance based on the Fisher information conditioned by the coefficients. Experimental results show that DF-Merge outperforms strong baselines across models of different sizes and a variety of tasks. Our analysis shows that the effectiveness of DF-Merge arises from the unified view of merging and that near-optimal performance is achievable in a few iterations, even with minimal validation data.
摘要：预先训练的语言模型的微调导致了特定于任务模型的广泛可用性。模型合并提供了一种有效的方法来创建多任务模型，通过在参数级别组合这些微调模型，而无需在多个数据集上培训数据或联合培训。现有的合并方法通常涉及按模型缩放参数或集成参数重要性参数。两种方法都表现出自己的弱点，与多任务微调相比，性能差距显着。在本文中，我们将这些看似不同的策略统一为更一般的合并框架，并引入动态的Fisher加权合并（DF-Merge）。具体而言，候选模型与一组线性扩展其微调参数的系数相关联。贝叶斯优化用于动态调整这些系数，旨在最大程度地提高验证集的整体性能。该过程的每次迭代都基于基于系数条件的Fisher信息整合了参数重要性。实验结果表明，DF-Merge在不同大小和各种任务的模型中的表现优于强大的基线。我们的分析表明，DF-Merge的有效性来自合并的统一观点，即使使用最少的验证数据，也可以在一些迭代中实现近乎最佳的性能。

Title: Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs

Authors: Mohammad Akbar-Tajari, Mohammad Taher Pilehvar, Mohammad Mahmoody
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2504.19019
Pdf URL: https://arxiv.org/pdf/2504.19019
Copy Paste: [[2504.19019]] Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs(https://arxiv.org/abs/2504.19019)
Keywords: language model, llm, prompt
Abstract: The challenge of ensuring Large Language Models (LLMs) align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of LLM alignment using the Graph of Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks, achieving up to five times better jailbreak success rate against robust models like Llama. Notably, GoAT creates high-quality, human-readable prompts without requiring access to the targeted model's parameters, making it a black-box attack. Unlike approaches constrained by tree-based reasoning, GoAT's reasoning is based on a more intricate graph structure. By making simultaneous attack paths aware of each other's progress, this dynamic framework allows a deeper integration and refinement of reasoning paths, significantly enhancing the collaborative exploration of adversarial vulnerabilities in LLMs. At a technical level, GoAT starts with a graph structure and iteratively refines it by combining and improving thoughts, enabling synergy between different thought paths. The code for our implementation can be found at: this https URL.
摘要：确保大型语言模型（LLM）与社会标准保持一致的挑战引起了人们的关注，因为这些模型仍然容易绕过其安全机制的对抗性越狱。确定这些漏洞对于增强LLM对此类漏洞的鲁棒性至关重要。我们提出了攻击图（山羊）的图，这是一种使用思想框架图[Besta等，2024]生成对抗性提示来测试LLM对齐稳健性的方法。山羊擅长于产生高效的越狱提示，对受害者模型的疑问比最先进的攻击更少，这是针对像美洲驼这样的强大模型的越野越越狱成功率的五倍。值得注意的是，山羊会创建高质量的人类可读提示，而无需访问目标模型的参数，从而使其成为黑盒攻击。与受到基于树的推理约束的方法不同，山羊的推理基于更复杂的图形结构。通过使同时了解彼此的进步的同时攻击路径，这个动态框架可以更深入地集成和完善推理路径，从而显着增强了对LLMS中对抗性脆弱性的协作探索。在技术层面上，山羊从图形结构开始，并通过结合和改进思想来迭代地完善它，从而在不同的思想路径之间产生协同作用。我们实施的代码可以在以下位置找到：此HTTPS URL。

Title: Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting

Authors: Zhyar Rzgar K Rostam, Gábor Kertész
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.19021
Pdf URL: https://arxiv.org/pdf/2504.19021
Copy Paste: [[2504.19021]] Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting(https://arxiv.org/abs/2504.19021)
Keywords: language model
Abstract: Efficient text classification is essential for handling the increasing volume of academic publications. This study explores the use of pre-trained language models (PLMs), including BERT, SciBERT, BioBERT, and BlueBERT, fine-tuned on the Web of Science (WoS-46985) dataset for scientific text classification. To enhance performance, we augment the dataset by executing seven targeted queries in the WoS database, retrieving 1,000 articles per category aligned with WoS-46985's main classes. PLMs predict labels for this unlabeled data, and a hard-voting strategy combines predictions for improved accuracy and confidence. Fine-tuning on the expanded dataset with dynamic learning rates and early stopping significantly boosts classification accuracy, especially in specialized domains. Domain-specific models like SciBERT and BioBERT consistently outperform general-purpose models such as BERT. These findings underscore the efficacy of dataset augmentation, inference-driven label prediction, hard-voting, and fine-tuning techniques in creating robust and scalable solutions for automated academic text classification.
摘要：有效的文本分类对于处理不断增加的学术出版物至关重要。这项研究探讨了在科学网络（WOS-46985）数据集上进行微调的预训练语言模型（PLM），包括Bert，Scibert，Biobert和Bluebert，以进行科学文本分类。为了提高性能，我们通过在WOS数据库中执行七个有针对性的查询来增强数据集，从而检索与WOS-46985的主要类别相一致的每类1,000篇文章。 PLM可以预测此未标记数据的标签，而硬投票策略结合了预测，以提高准确性和信心。通过动态学习率和早期停止在扩展的数据集上进行微调，从而显着提高了分类精度，尤其是在专用域中。 Scibert和Biobert等领域特异性模型始终超过诸如BERT之类的通用模型。这些发现强调了数据集扩展，推理驱动的标签预测，硬投票和微调技术在创建可自动化学术文本分类的强大而可扩展的解决方案方面的功效。

Title: KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation

Authors: Jiabin Fan, Guoqing Luo, Michael Bowling, Lili Mou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19024
Pdf URL: https://arxiv.org/pdf/2504.19024
Copy Paste: [[2504.19024]] KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation(https://arxiv.org/abs/2504.19024)
Keywords: language model, llm
Abstract: We propose a novel k-step return estimation method (called KETCHUP) for Reinforcement Learning(RL)-based knowledge distillation (KD) in text generation tasks. Our idea is to induce a K-step return by using the Bellman Optimality Equation for multiple steps. Theoretical analysis shows that this K-step formulation reduces the variance of the gradient estimates, thus leading to improved RL optimization especially when the student model size is large. Empirical evaluation on three text generation tasks demonstrates that our approach yields superior performance in both standard task metrics and large language model (LLM)-based evaluation. These results suggest that our K-step return induction offers a promising direction for enhancing RL-based KD in LLM research.
摘要：我们提出了一种新颖的K-STEP回报估计方法（称为番茄酱），用于加固学习（RL）基于文本生成任务的知识蒸馏（KD）。我们的想法是通过使用Bellman最佳方程来诱导K-Step返回。理论分析表明，该K-STEP公式降低了梯度估计的方差，从而改善了RL优化，尤其是当学生模型大小较大时。对三个文本任务的经验评估表明，我们的方法在标准任务指标和基于大型语言模型（LLM）的评估中都能产生卓越的性能。这些结果表明，我们的K-STEP返回感应为增强LLM研究中基于RL的KD提供了有希望的方向。

Title: Calibrating Translation Decoding with Quality Estimation on LLMs

Authors: Di Wu, Yibin Lei, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19044
Pdf URL: https://arxiv.org/pdf/2504.19044
Copy Paste: [[2504.19044]] Calibrating Translation Decoding with Quality Estimation on LLMs(https://arxiv.org/abs/2504.19044)
Keywords: language model, llm
Abstract: Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses -- the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation -- thereby enhancing the effectiveness of translation decoding. With our method, translation on large language models (LLMs) improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations -- even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: this https URL.
摘要：神经机器翻译（NMT）系统通常采用最大后验（MAP）解码来从分布质量中选择得分最高的翻译。但是，最近的证据突出了地图解码的不足，通常会导致低质量甚至病理假设 - 解码目标与现实世界翻译质量不符。本文提出了通过直接优化其Pearson相关性来校准假设的可能性，并从分配视图中进行翻译质量，从而增强了翻译解码的有效性。通过我们的方法，大型语言模型（LLM）的翻译在有限的训练后（每个方向2k实例）大大改善。通过监督的微调来实现的这种改进是正交的，从而在广泛的指标和人类评估中获得了可观的增长，即使应用于表现最佳的翻译特异性LLMS，在高质量的翻译数据上进行了微调，例如塔楼，例如与最近的优化方法相比，诸如塔楼（例如塔）。此外，经过校准的翻译可能性可以直接充当翻译质量的强大代理，近似甚至超过某些最新的翻译质量估计模型，例如Cometkiwi。最后，我们的深入分析表明，校准增强了地图解码的有效性，从而提高了现实部署的效率。由此产生的最新翻译模型涵盖了10种语言，以及随附的代码和人类评估数据，已发布给社区：此HTTPS URL。

Title: Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

Authors: Anindya Bijoy Das, Shibbir Ahmed, Shahnewaz Karim Sakib
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.19061
Pdf URL: https://arxiv.org/pdf/2504.19061
Copy Paste: [[2504.19061]] Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models(https://arxiv.org/abs/2504.19061)
Keywords: language model, llm, hallucination
Abstract: Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, such as reasons for hospital admission, significant in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive numerical simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization.
摘要：临床摘要在医疗保健中至关重要，因为它将复杂的医学数据提炼成易消化信息，从而增强患者的理解和护理管理。大型语言模型（LLMS）在自动化和提高此类摘要的准确性方面具有巨大的潜力，因为它们的高级自然语言理解能力。这些模型尤其适用于总结医学/临床文本的背景，在此，精确和简洁的信息转移至关重要。在本文中，我们研究了开源LLMS从出院报告中提取关键事件的有效性，例如住院原因，重大院内事件和关键的随访行动。此外，我们还评估了这些模型产生的摘要中各种类型幻觉的普遍性。检测幻觉至关重要，因为它直接影响信息的可靠性，可能会影响患者护理和治疗结果。我们进行了全面的数值模拟，以严格评估这些模型的性能，进一步探讨临床摘要中提取内容的准确性和保真度。

Title: ClimaEmpact: Domain-Aligned Small Language Models and Datasets for Extreme Weather Analytics

Authors: Deeksha Varshney, Keane Ong, Rui Mao, Erik Cambria, Gianmarco Mengaldo
Subjects: cs.CL, cs.AI, cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2504.19066
Pdf URL: https://arxiv.org/pdf/2504.19066
Copy Paste: [[2504.19066]] ClimaEmpact: Domain-Aligned Small Language Models and Datasets for Extreme Weather Analytics(https://arxiv.org/abs/2504.19066)
Keywords: language model, llm
Abstract: Accurate assessments of extreme weather events are vital for research and policy, yet localized and granular data remain scarce in many parts of the world. This data gap limits our ability to analyze potential outcomes and implications of extreme weather events, hindering effective decision-making. Large Language Models (LLMs) can process vast amounts of unstructured text data, extract meaningful insights, and generate detailed assessments by synthesizing information from multiple sources. Furthermore, LLMs can seamlessly transfer their general language understanding to smaller models, enabling these models to retain key knowledge while being fine-tuned for specific tasks. In this paper, we propose Extreme Weather Reasoning-Aware Alignment (EWRA), a method that enhances small language models (SLMs) by incorporating structured reasoning paths derived from LLMs, and ExtremeWeatherNews, a large dataset of extreme weather event-related news articles. EWRA and ExtremeWeatherNews together form the overall framework, ClimaEmpact, that focuses on addressing three critical extreme-weather tasks: categorization of tangible vulnerabilities/impacts, topic labeling, and emotion analysis. By aligning SLMs with advanced reasoning strategies on ExtremeWeatherNews (and its derived dataset ExtremeAlign used specifically for SLM alignment), EWRA improves the SLMs' ability to generate well-grounded and domain-specific responses for extreme weather analytics. Our results show that the approach proposed guides SLMs to output domain-aligned responses, surpassing the performance of task-specific models and offering enhanced real-world applicability for extreme weather analytics.
摘要：对极端天气事件的准确评估对于研究和政策至关重要，但是在世界许多地方，本地化和颗粒状数据仍然很少。这个数据差距限制了我们分析极端天气事件的潜在结果和影响的能力，从而阻碍了有效的决策。大型语言模型（LLMS）可以处理大量的非结构化文本数据，提取有意义的见解，并通过综合来自多个来源的信息来生成详细的评估。此外，LLM可以将其一般语言理解无缝地转移到较小的模型上，从而使这些模型能够保留关键知识，同时对特定任务进行微调。在本文中，我们提出了极端的天气推理意识对准（EWRA），这种方法通过合并源自LLMS的结构化推理路径和Extremeweathernews（极端天气相关事件与事件相关的新闻报道）来增强小语言模型（SLM）。 EWRA和Extremeweathernes构成了整个框架Climaempact，重点是解决三个关键的极限任务：有形脆弱性/影响力，主题标签和情感分析的分类。通过使SLM与Exormeweatherews的先进推理策略保持一致（及其针对SLM对齐的衍生数据集极致），EWRA提高了SLMS对极端天气分析的极端天气分析产生良好和域特异性响应的能力。我们的结果表明，该方法提出的指导SLM可以输出域对齐的响应，超越了特定于任务模型的性能，并为极端天气分析提供了增强的现实世界适用性。

Title: Sample-Efficient Language Model for Hinglish Conversational AI

Authors: Sakshi Singh, Abhinav Prakash, Aakriti Shah, Chaitanya Sachdeva, Sanjana Dumpala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19070
Pdf URL: https://arxiv.org/pdf/2504.19070
Copy Paste: [[2504.19070]] Sample-Efficient Language Model for Hinglish Conversational AI(https://arxiv.org/abs/2504.19070)
Keywords: language model, chat
Abstract: This paper presents our process for developing a sample-efficient language model for a conversational Hinglish chatbot. Hinglish, a code-mixed language that combines Hindi and English, presents a unique computational challenge due to inconsistent spelling, lack of standardization, and limited quality of conversational data. This work evaluates multiple pre-trained cross-lingual language models, including Gemma3-4B and Qwen2.5-7B, and employs fine-tuning techniques to improve performance on Hinglish conversational tasks. The proposed approach integrates synthetically generated dialogues with insights from existing Hinglish datasets to address data scarcity. Experimental results demonstrate that models with fewer parameters, when appropriately fine-tuned on high-quality code-mixed data, can achieve competitive performance for Hinglish conversation generation while maintaining computational efficiency.
摘要：本文介绍了我们为会话Hinglish聊天机器人开发样本效率语言模型的过程。 Hinglish是一种结合印地语和英语的代码混合语言，由于拼写不一致，缺乏标准化和对话数据质量有限，提出了独特的计算挑战。这项工作评估了多种预训练的跨语性语言模型，包括GEMMA3-4B和QWEN2.5-7B，并采用微调技术来提高对话对话任务的性能。所提出的方法将合成生成的对话与现有Hinglish数据集的见解集成在一起，以解决数据稀缺性。实验结果表明，在高质量的代码混合数据上进行适当调整时，具有较少参数的模型可以在保持计算效率的同时实现竞争性能。

Title: Efficient Reasoning for LLMs through Speculative Chain-of-Thought

Authors: Jikai Wang, Juntao Li, Lijun Wu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19095
Pdf URL: https://arxiv.org/pdf/2504.19095
Copy Paste: [[2504.19095]] Efficient Reasoning for LLMs through Speculative Chain-of-Thought(https://arxiv.org/abs/2504.19095)
Keywords: language model, llm, chain-of-thought
Abstract: Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have recently attracted widespread attention due to their impressive task-solving abilities. However, the enormous model size and the generation of lengthy thought chains introduce significant reasoning costs and response latency. Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length. In this paper, we introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed through large and small model collaboration. SCoT conducts thought-level drafting using a lightweight draft model. Then it selects the best CoT draft and corrects the error cases with the target model. The proposed thinking behavior alignment improves the efficiency of drafting and the draft selection strategy maintains the prediction accuracy for complex problems. Experimental results on GSM8K, MATH, GaoKao, CollegeMath and Olympiad datasets show that SCoT reduces reasoning latency by 48\%$\sim$66\% for Deepseek-R1-Distill-Qwen-32B while achieving near-target-model-level performance. Our code is available at this https URL.
摘要：诸如OpenAI-O1和DeepSeek-R1之类的大型推理语言模型由于令人印象深刻的任务解决能力，最近引起了广泛的关注。但是，巨大的模型大小和漫长的思想链的产生会引入大量的推理成本和响应潜伏期。现有的有效推理的方法主要集中于减少模型参数的数量或缩短思想链长度。在本文中，我们介绍了投机性链链（SCOT），从而通过大型和小型模型协作加速了平均推理速度，从另一个角度降低了推理潜伏期。 SCOT使用轻量级草稿模型进行思想级的起草。然后，它选择最佳的COT草稿，并使用目标模型更正错误情况。提出的思维行为对准提高了起草的效率，而选拔策略草案保持了复杂问题的预测准确性。 GSM8K，MATH，GAOKAO，COLLEGEMATH和OLYMPIAD数据集的实验结果表明，SCOT可将推理潜伏期减少48 \％$ \ sim $ 66 \％$ 66 \％，而DeepSeek-R1-Distill-Qwen-32b，而实现接近target-Model级的级别的表现。我们的代码可在此HTTPS URL上找到。

Title: Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation

Authors: Qianren Mao, Qili Zhang, Hanwen Hao, Zhentao Han, Runhua Xu, Weifeng Jiang, Qi Hu, Zhijun Chen, Tyler Zhou, Bo Li, Yangqiu Song, Jin Dong, Jianxin Li, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19101
Pdf URL: https://arxiv.org/pdf/2504.19101
Copy Paste: [[2504.19101]] Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation(https://arxiv.org/abs/2504.19101)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution for enhancing the accuracy and credibility of Large Language Models (LLMs), particularly in Question & Answer tasks. This is achieved by incorporating proprietary and private data from integrated databases. However, private RAG systems face significant challenges due to the scarcity of private domain data and critical data privacy issues. These obstacles impede the deployment of private RAG systems, as developing privacy-preserving RAG systems requires a delicate balance between data security and data availability. To address these challenges, we regard federated learning (FL) as a highly promising technology for privacy-preserving RAG services. We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG). This framework facilitates collaborative training of client-side RAG retrieval models. The parameters of these models are aggregated and distributed on a central-server, ensuring data privacy without direct sharing of raw data. In FedE4RAG, knowledge distillation is employed for communication between the server and client models. This technique improves the generalization of local RAG retrievers during the federated learning process. Additionally, we apply homomorphic encryption within federated learning to safeguard model parameters and mitigate concerns related to data leakage. Extensive experiments conducted on the real-world dataset have validated the effectiveness of FedE4RAG. The results demonstrate that our proposed framework can markedly enhance the performance of private RAG systems while maintaining robust data privacy protection.
摘要：检索增强的一代（RAG）最近成为提高大语模型（LLMS）的准确性和信誉的有前途解决方案，尤其是在问题和答案任务中。这是通过合并来自集成数据库的专有和私人数据来实现的。但是，由于私人域数据和关键数据隐私问题的稀缺，私人抹布系统面临重大挑战。这些障碍阻碍了私人抹布系统的部署，因为开发保护隐私的抹布系统需要数据安全性和数据可用性之间的微妙平衡。为了应对这些挑战，我们将联邦学习（FL）视为保护隐私的破布服务的高度前景技术。我们提出了一个新颖的框架，称为联邦检索仪（Fede4rag）。该框架有助于对客户端抹布检索模型进行协作培训。这些模型的参数汇总和分布在中央服务器上，可确保数据隐私而无需直接共享原始数据。在Fede4RAG中，使用知识蒸馏用于服务器和客户端模型之间的通信。该技术在联合学习过程中改善了当地抹布检索器的概括。此外，我们在联合学习中应用同态加密来保护模型参数并减轻与数据泄漏有关的问题。在现实世界数据集上进行的广泛实验已经验证了Fede4RAG的有效性。结果表明，我们提出的框架可以显着增强私人抹布系统的性能，同时保持强大的数据隐私保护。

Title: APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

Authors: Huajian Xin, Luming Li, Xiaoran Jin, Jacques Fleuriot, Wenda Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19110
Pdf URL: https://arxiv.org/pdf/2504.19110
Copy Paste: [[2504.19110]] APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries(https://arxiv.org/abs/2504.19110)
Keywords: language model, llm, agent
Abstract: Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.
摘要：大型语言模型（LLMS）的最新进展已在正式定理中证明了希望，但现有的基准仍限于孤立的静态证明任务，未能捕获现实世界正式数学库的迭代，工程密集型工作流程。在软件工程方面的类似进步的推动下，我们介绍了自动证明工程（APE）的范式，该范式旨在自动化证明工程任务，例如功能添加，证明重构和使用LLMS进行错误修复。为了促进这个方向的研究，我们提出了APE Bench I，这是由现实世界中的Mathlib4历史构建的第一个现实基准，其中包含自然语言中描述的各种文件级任务，并通过混合方法结合精益编译器和LLM-AS-AS-A-Judge进行了验证。我们进一步开发了Eleanstic，这是一种优化的可扩展的并行验证基础架构，用于跨多个版本的Mathlib进行证明检查。最新的LLMS的经验结果表明，在局部编辑方面表现出色，但在处理复杂的证明工程方面的实质性下降。这项工作奠定了开发验证工程中代理工作流程的基础，未来的基准标准针对多文件协调，项目规模验证以及能够计划，编辑和维修正式图书馆的自治代理。

Title: SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

Authors: Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, Kwan-Yee K. Wong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19162
Pdf URL: https://arxiv.org/pdf/2504.19162
Copy Paste: [[2504.19162]] SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning(https://arxiv.org/abs/2504.19162)
Keywords: language model, llm, chain-of-thought
Abstract: Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, applying SPC to guide the test-time search of diverse LLMs significantly improves their mathematical reasoning performance on MATH500 and AIME2024, outperforming state-of-the-art process reward models.
摘要：由于获得高质量的阶梯监督的困难和成本，评估大语言模型（LLM）推理的分步可靠性（LLM）推理的逐步可靠性仍然具有挑战性。在本文中，我们介绍了自我游戏评论家（SPC），这是一种新颖的方法，即评论家模型可以通过对抗性自我玩游戏来评估推理步骤的能力，从而消除了对手动级别级别注释的需求。 SPC涉及对基本模型的两份副本进行微调来扮演两个角色，即“偷偷摸摸的发电机”，故意产生错误的步骤，旨在难以检测到，并分析了推理步骤的正确性。这两个模型参与了对抗性游戏，在该游戏中，发电机旨在欺骗评论家，而评论家模型则试图确定发电机的错误。使用基于游戏成果的增强学习，模型迭代改善；每次对抗的获胜者都会获得积极的奖励，失败者获得了负面的奖励，推动了持续的自我进化。对三个推理过程基准（ProcessBench，PRM800K，Deltabench）进行的实验表明，我们的SPC逐渐增强其误差检测能力（例如，准确性从70.8％从ProcessBench上的70.8％提高到77.7％），并且超过了包括强大的基线（包括蒸馏R1）模型。此外，应用SPC指导对不同LLM的测试时间搜索可显着提高其在Math500和AIME2024上的数学推理性能，这表现优于最先进的过程奖励模型。

Title: WuNeng: Hybrid State with Attention

Authors: Liu Xiao, Li Zhiyuan, Lin Yueyu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19191
Pdf URL: https://arxiv.org/pdf/2504.19191
Copy Paste: [[2504.19191]] WuNeng: Hybrid State with Attention(https://arxiv.org/abs/2504.19191)
Keywords: language model
Abstract: The WuNeng architecture introduces a novel approach to enhancing the expressivity and power of large language models by integrating recurrent neural network (RNN)-based RWKV-7 with advanced attention mechanisms, prioritizing heightened contextual coherence over reducing KV cache size. Building upon the hybrid-head concept from Hymba, WuNeng augments standard multi-head attention with additional RWKV-7 state-driven heads, rather than replacing existing heads, to enrich the model's representational capacity. A cross-head interaction technique fosters dynamic synergy among standard, state-driven, and newly introduced middle heads, leveraging concatenation, additive modulation, and gated fusion for robust information integration. Furthermore, a multi-token state processing mechanism harnesses the continuous RWKV-7 state to capture intricate, sequence-wide dependencies, significantly boosting expressivity. Remarkably, these enhancements are achieved with minimal additional parameters, ensuring efficiency while empowering the model to excel in complex reasoning and sequence generation tasks. WuNeng sets a new standard for balancing expressivity and computational efficiency in modern neural architectures.
摘要：Wuneng体系结构通过整合基于复发性的神经网络（RNN）的RWKV-7与高级注意力机制来提高大语模型的表达和力量的新颖方法，优先考虑降低KV缓存大小的上下文相干性。 Wuneng以Hymba的Hybrid头概念为基础，增强了标准的多头关注，并以其他RWKV-7国家驱动的头部而不是更换现有的头部，以丰富模型的代表性能力。一种横向交互技术在标准，以国家驱动和新引入的中间头部中促进了动态协同作用，利用串联，添加剂调制和封闭式融合以实现强大的信息集成。此外，一种多语状态处理机制可以利用连续的RWKV-7状态捕获复杂的范围范围内的依赖性，从而显着提高了表现力。值得注意的是，这些增强功能是通过最小的其他参数实现的，从而确保效率，同时使模型能够在复杂的推理和序列生成任务中脱颖而出。 Wuneng为现代神经体系结构中的表达和计算效率平衡设定了新的标准。

Title: Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Authors: Dylan Bouchard, Mohit Singh Chauhan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19254
Pdf URL: https://arxiv.org/pdf/2504.19254
Copy Paste: [[2504.19254]] Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers(https://arxiv.org/abs/2504.19254)
Keywords: language model, llm, hallucination
Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.
摘要：幻觉是大语言模型（LLM）的持续问题。随着这些模型越来越多地用于医疗保健和金融等高风险领域，有效幻觉检测的需求至关重要。为此，我们为零资源幻觉检测提出了一个多功能框架，从业人员可以应用于现实世界中的用例。为此，我们适应了各种现有的不确定性量化（UQ）技术，包括Black-Box UQ，White-Box UQ和LLM-AS-A-Gudge，将其作为必要的将其转换为标准化响应级别的置信度得分从0到1。为了提高灵活性，我们引入了一个单个信心量的可调式启发方法，以提高灵活性。这种方法使从业人员能够优化特定用例的合奏，以提高性能。为了简化实施，本文的Python工具包（UQLM）提供了完整的得分手。为了评估各种得分手的表现，我们使用多个LLM提问基准进行了一组广泛的实验。我们发现我们的可调合奏通常超过其各个组件，并且胜过现有的幻觉检测方法。我们的结果证明了定制幻觉检测策略的好处，以提高LLM的准确性和可靠性。

Title: VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

Authors: Mohamed Gado, Towhid Taliee, Muhammad Memon, Dmitry Ignatov, Radu Timofte
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19267
Pdf URL: https://arxiv.org/pdf/2504.19267
Copy Paste: [[2504.19267]] VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?(https://arxiv.org/abs/2504.19267)
Keywords: gpt, llm
Abstract: Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.
摘要：视觉讲故事是一个结合了计算机视觉和自然语言处理的跨学科领域，可以从图像序列中产生凝聚力的叙事。本文提出了一种新颖的方法，该方法利用了多模型模型的最新进步，特别是针对变形金刚基于变压器的架构和大型多模型模型，以实现视觉讲故事的任务。利用大规模的视觉故事讲述（VIST）数据集，我们的Vist-GPT模型会产生视觉扎根的，上下文适当的叙述。我们解决了不适合此任务的传统评估指标的局限性，例如BLEU，Meteor，Rouge和Cider。取而代之的是，我们利用了Rovist和Groovist，新颖的无参考指标，旨在评估视觉讲故事，专注于视觉接地，连贯性和非差额。这些指标提供了对叙事质量的更细微的评估，与人类的判断紧密保持一致。

Title: AndroidGen: Building an Android Language Agent under Data Scarcity

Authors: Hanyu Lai, Junjie Gao, Xiao Liu, Yifan Xu, Shudan Zhang, Yuxiao Dong, Jie Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19298
Pdf URL: https://arxiv.org/pdf/2504.19298
Copy Paste: [[2504.19298]] AndroidGen: Building an Android Language Agent under Data Scarcity(https://arxiv.org/abs/2504.19298)
Keywords: language model, llm, agent
Abstract: Large language models have opened up a world of possibilities for various NLP tasks, sparking optimism for the future. Despite their potential, LLMs have yet to be widely used as agents on real mobile devices. The main challenge is the need for high-quality data sources. Time constraints and labor intensity often hinder human annotation. On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity. In addition, we leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories. We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement. Code, model, and data are available at this https URL.
摘要：大型语言模型为各种NLP任务开辟了一个可能性的世界，对未来产生了乐观。尽管有潜力，但LLM尚未被广泛用作实际移动设备的代理。主要的挑战是需要高质量的数据源。时间限制和劳动强度通常会阻碍人类注释。另一方面，现有的LLM表现出不足的完成率，需要强大的数据过滤策略。鉴于这些挑战，我们开发了一个名为Androidgen的框架，以增强数据稀缺下的基于LLM的代理的功能。此外，我们利用Androidgen在这些轨迹上收集轨迹，并在这些轨迹上训练开源LLM，以开发开源移动代理，而无需手动标记轨迹。我们通过Androidworld，AITW和各种流行的应用程序广泛评估Androidgen，展示了其改进并揭示了未来改进的潜在领域。代码，模型和数据可在此HTTPS URL上找到。

Title: BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Authors: Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19314
Pdf URL: https://arxiv.org/pdf/2504.19314
Copy Paste: [[2504.19314]] BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese(https://arxiv.org/abs/2504.19314)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at this https URL.
摘要：随着大型语言模型（LLM）演变为使用工具的代理商，实时浏览网络的能力已成为衡量其推理和检索能力的关键标准。诸如BrowseComp之类的现有基准专注于英语，并忽略其他主要信息生态系统的语言，基础设施和与审查相关的复杂性 - 最著名的是中文。为了解决这一差距，我们介绍了BrowseComp-ZH，这是一种高难题的基准设计，以全面评估中国网络上的LLM代理。 BrowseComp-ZH由289个多跳的问题组成，涵盖11个不同领域。每个问题都是从简短，客观且易于验证的答案（例如，日期，编号或适当名词）进行反向工程的。采用两阶段质量控制方案来努力解决高问题的难度和回答唯一性。我们在提议的BrowseComp-ZH上基准了20种最先进的语言模型和代理搜索系统。尽管他们的对话和检索能力很强，但大多数模型都在努力挣扎：大量的精度率低于10％，而只有少数超过20％。即使是表现最好的系统，OpenAI的Deepresearch也只能达到42.9％。这些结果表明了Browsecomp-ZH的相当困难，因为成功不仅需要有效的检索策略，而且需要复杂的推理和信息和解，即当前模型仍在努力掌握的能力。我们的数据集，施工准则和基准结果已在此HTTPS URL上公开发布。

Title: Unified Multi-Task Learning & Model Fusion for Efficient Language Model Guardrailing

Authors: James O' Neill, Santhosh Subramanian, Eric Lin, Vaikkunth Mugunthan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19333
Pdf URL: https://arxiv.org/pdf/2504.19333
Copy Paste: [[2504.19333]] Unified Multi-Task Learning & Model Fusion for Efficient Language Model Guardrailing(https://arxiv.org/abs/2504.19333)
Keywords: language model, gpt, llm
Abstract: The trend towards large language models (LLMs) for guardrailing against undesired behaviors is increasing and has shown promise for censoring user inputs. However, increased latency, memory consumption, hosting expenses and non-structured outputs can make their use prohibitive. In this work, we show that task-specific data generation can lead to fine-tuned classifiers that significantly outperform current state of the art (SoTA) while being orders of magnitude smaller. Secondly, we show that using a single model, \texttt{MultiTaskGuard}, that is pretrained on a large synthetically generated dataset with unique task instructions further improves generalization. Thirdly, our most performant models, \texttt{UniGuard}, are found using our proposed search-based model merging approach that finds an optimal set of parameters to combine single-policy models and multi-policy guardrail models. % On 7 public datasets and 4 guardrail benchmarks we created, our efficient guardrail classifiers improve over the best performing SoTA publicly available LLMs and 3$^{\text{rd}}$ party guardrail APIs in detecting unsafe and safe behaviors by an average F1 score improvement of \textbf{29.92} points over Aegis-LlamaGuard and \textbf{21.62} over \texttt{gpt-4o}, respectively. Lastly, our guardrail synthetic data generation process that uses custom task-specific guardrail poli
摘要：针对不希望的行为进行护栏的大型语言模型（LLM）的趋势正在增加，并显示了对用户输入进行审查的希望。但是，增加的延迟，内存消耗，托管费用和非结构化产出可能会使它们的使用过高。在这项工作中，我们表明特定于任务的数据生成可以导致分类器，这些分类器的表现明显超过了当前的最新现状（SOTA），而数量级较小。其次，我们表明，使用单个模型\ texttt {MultitaskGuard}，该模型在具有唯一任务指令的大型合成生成的数据集上预处理，从而进一步改善了概括。第三，使用我们提出的基于搜索的模型合并方法找到了我们最性能的模型\ texttt {uniguard}，该方法找到了一组最佳的参数集，可以将单政策模型和多政策护栏模型组合在一起。我们创建的7个公共数据集和4个护栏基准的％，有效的护栏分类器改善了表现最佳的SOTA公开可用的LLM和3 $^{\ text {rd}} $ party party Guardrail apis在检测不安全和安全的行为方面，平均F1得分即可提高。 \ textbf {21.62}分别\ texttt {gpt-4o}。最后，我们使用自定义特定任务的护栏poli的护栏综合数据生成过程

Title: Explanatory Summarization with Discourse-Driven Planning

Authors: Dongqi Liu, Xi Yu, Vera Demberg, Mirella Lapata
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19339
Pdf URL: https://arxiv.org/pdf/2504.19339
Copy Paste: [[2504.19339]] Explanatory Summarization with Discourse-Driven Planning(https://arxiv.org/abs/2504.19339)
Keywords: hallucination, prompt
Abstract: Lay summaries for scientific documents typically include explanations to help readers grasp sophisticated concepts or arguments. However, current automatic summarization methods do not explicitly model explanations, which makes it difficult to align the proportion of explanatory content with human-written summaries. In this paper, we present a plan-based approach that leverages discourse frameworks to organize summary generation and guide explanatory sentences by prompting responses to the plan. Specifically, we propose two discourse-driven planning strategies, where the plan is conditioned as part of the input or part of the output prefix, respectively. Empirical experiments on three lay summarization datasets show that our approach outperforms existing state-of-the-art methods in terms of summary quality, and it enhances model robustness, controllability, and mitigates hallucination.
摘要：科学文档的外行摘要通常包括说明，以帮助读者掌握复杂的概念或论点。但是，当前的自动摘要方法并未明确模拟解释，这使得解释性内容的比例与人写入的摘要变得很难。在本文中，我们提出了一种基于计划的方法，该方法利用话语框架来组织摘要生成和指导解释性句子，通过提示对计划的响应。具体而言，我们提出了两种以话语为导向的计划策略，该计划分别作为输入或输出前缀的一部分的条件。在三个外行摘要数据集上进行的经验实验表明，我们的方法在摘要质量方面优于现有的最新方法，并增强了模型的鲁棒性，可控性和减轻幻觉。

Title: ICL CIPHERS: Quantifying "Learning'' in In-Context Learning via Substitution Ciphers

Authors: Zhouxiang Fang, Aayush Mishra, Muhan Gao, Anqi Liu, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19395
Pdf URL: https://arxiv.org/pdf/2504.19395
Copy Paste: [[2504.19395]] ICL CIPHERS: Quantifying "Learning'' in In-Context Learning via Substitution Ciphers(https://arxiv.org/abs/2504.19395)
Keywords: llm
Abstract: Recent works have suggested that In-Context Learning (ICL) operates in dual modes, i.e. task retrieval (remember learned patterns from pre-training) and task learning (inference-time ``learning'' from demonstrations). However, disentangling these the two modes remains a challenging goal. We introduce ICL CIPHERS, a class of task reformulations based on substitution ciphers borrowed from classic cryptography. In this approach, a subset of tokens in the in-context inputs are substituted with other (irrelevant) tokens, rendering English sentences less comprehensible to human eye. However, by design, there is a latent, fixed pattern to this substitution, making it reversible. This bijective (reversible) cipher ensures that the task remains a well-defined task in some abstract sense, despite the transformations. It is a curious question if LLMs can solve ICL CIPHERS with a BIJECTIVE mapping, which requires deciphering the latent cipher. We show that LLMs are better at solving ICL CIPHERS with BIJECTIVE mappings than the NON-BIJECTIVE (irreversible) baseline, providing a novel approach to quantify ``learning'' in ICL. While this gap is small, it is consistent across the board on four datasets and six models. Finally, we examine LLMs' internal representations and identify evidence in their ability to decode the ciphered inputs.
摘要：最近的作品表明，在双重模式中，即任务检索（请记住从培训前训练中学习的模式）和任务学习（推理时间'从演示中````学习''''''''）。但是，解开这两种模式仍然是一个具有挑战性的目标。我们介绍了ICL密码，这是一类基于从经典密码学借来的替换密码的任务重新进行的。在这种方法中，中文输入中的一个子集用其他（无关）代币代替，使英语句子对人眼的理解不那么理解。但是，根据设计，这种替代具有潜在的固定模式，使其可逆。这种射精（可逆的）密码可确保尽管有所改造，但从某种意义上说，任务仍然是一个明确的任务。这是一个奇怪的问题，如果LLM可以用两种二级法来求解ICL密码，这需要破译潜在的密码。我们表明，LLM比非基础（不可逆的）基线更好地求解具有射击的ICL密码，从而提供了一种新颖的方法来量化ICL中的``学习''。尽管此差距很小，但在四个数据集和六个型号上，它在整个基础上都是一致的。最后，我们检查了LLMS的内部表示形式，并确定了其解码可依赖输入的能力的证据。

Title: Context Selection and Rewriting for Video-based EducationalQuestion Generation

Authors: Mengxia Yu, Bang Nguyen, Olivia Zino, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19406
Pdf URL: https://arxiv.org/pdf/2504.19406
Copy Paste: [[2504.19406]] Context Selection and Rewriting for Video-based EducationalQuestion Generation(https://arxiv.org/abs/2504.19406)
Keywords: language model
Abstract: Educational question generation (EQG) is a crucial component of intelligent educational systems, significantly aiding self-assessment, active learning, and personalized education. While EQG systems have emerged, existing datasets typically rely on predefined, carefully edited texts, failing to represent real-world classroom content, including lecture speech with a set of complementary slides. To bridge this gap, we collect a dataset of educational questions based on lectures from real-world classrooms. On this realistic dataset, we find that current methods for EQG struggle with accurately generating questions from educational videos, particularly in aligning with specific timestamps and target answers. Common challenges include selecting informative contexts from extensive transcripts and ensuring generated questions meaningfully incorporate the target answer. To address the challenges, we introduce a novel framework utilizing large language models for dynamically selecting and rewriting contexts based on target timestamps and answers. First, our framework selects contexts from both lecture transcripts and video keyframes based on answer relevance and temporal proximity. Then, we integrate the contexts selected from both modalities and rewrite them into answer-containing knowledge statements, to enhance the logical connection between the contexts and the desired answer. This approach significantly improves the quality and relevance of the generated questions. Our dataset and code are released in this https URL.
摘要：教育问题产生（EQG）是智能教育系统的关键组成部分，在大大帮助自我评估，积极学习和个性化教育方面。尽管出现了EQG系统，但现有数据集通常依赖于预定义的，精心编辑的文本，无法代表现实世界中的课堂内容，包括带有一组补充幻灯片的讲座语音。为了弥合这一差距，我们根据实际教室的讲座收集了一个教育问题数据集。在这个现实的数据集中，我们发现EQG当前的方法与准确地从教育视频中产生问题，尤其是与特定的时间戳和目标答案保持一致。常见的挑战包括从广泛的成绩单中选择信息上下文，并确保有意义地纳入目标答案。为了应对挑战，我们介绍了一个新颖的框架，利用大型语言模型根据目标时间戳和答案动态选择和重写上下文。首先，我们的框架根据答案相关性和时间接近从讲座笔录和视频关键帧中选择上下文。然后，我们整合了从模式中选择的上下文，并将它们重写为含答案的知识语句，以增强上下文和所需答案之间的逻辑连接。这种方法可大大提高生成问题的质量和相关性。我们的数据集和代码在此HTTPS URL中发布。

Title: Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Authors: Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, Deshraj Yadav
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.19413
Pdf URL: https://arxiv.org/pdf/2504.19413
Copy Paste: [[2504.19413]] Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory(https://arxiv.org/abs/2504.19413)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.
摘要：大型语言模型（LLM）在产生上下文相干的响应方面表现出了极大的能力，但其固定上下文Windows构成了基本的挑战，可以保持长期多课程对话的一致性。我们介绍了MEM0，这是一种可扩展的以内存为中心的体系结构，通过动态提取，合并和从正在进行的对话中检索显着信息来解决此问题。在此基础的基础上，我们进一步提出了一种增强的变体，该变体利用基于图的内存表示来捕获对话元素之间的复杂关系结构。通过对Locomo Benchmark的全面评估，我们将我们的方法与六个基线类别进行比较：（i）已建立的内存增强系统，（ii）检索成绩（rag），具有不同块的大小和k值的零件和k值，（iii）的完整信息解决方案，（iii）一个完整的对话历史记录，（iv）（iv）（iv）（iv）（iv），即VII），即VII）。专用的内存管理平台。经验结果表明，我们的方法始终超过四个问题类别的所有现有内存系统：单跳，时间，多跳和开放域。值得注意的是，MEM0在OpenAI上实现了LLM-AS-A-A-Augge指标的相对改善，而具有图内存储器的MEM0的总分比基本配置高2％。除了准确性增长之外，我们还显着减少了与全文方法相比的计算开销。特别是，MEM0的P95潜伏期降低了91％，并节省了90％以上的令牌成本，在高级推理能力和实际部署限制之间提供了令人信服的平衡。我们的发现突出了结构化的，持久的记忆机制的关键作用，用于长期对话连贯性，为更可靠，有效的LLM驱动的AI代理铺平了道路。

Title: Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models

Authors: Jacky He, Guiran Liu, Binrong Zhu, Hanlu Zhang, Hongye Zheng, Xiaokai Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19436
Pdf URL: https://arxiv.org/pdf/2504.19436
Copy Paste: [[2504.19436]] Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models(https://arxiv.org/abs/2504.19436)
Keywords: language model, gpt, retrieval-augmented generation
Abstract: This paper focuses on the dynamic optimization of the Retrieval-Augmented Generation (RAG) architecture. It proposes a state-aware dynamic knowledge retrieval mechanism to enhance semantic understanding and knowledge scheduling efficiency in large language models for open-domain question answering and complex generation tasks. The method introduces a multi-level perceptive retrieval vector construction strategy and a differentiable document matching path. These components enable end-to-end joint training and collaborative optimization of the retrieval and generation modules. This effectively addresses the limitations of static RAG structures in context adaptation and knowledge access. Experiments are conducted on the Natural Questions dataset. The proposed structure is thoroughly evaluated across different large models, including GPT-4, GPT-4o, and DeepSeek. Comparative and ablation experiments from multiple perspectives confirm the significant improvements in BLEU and ROUGE-L scores. The approach also demonstrates stronger robustness and generation consistency in tasks involving semantic ambiguity and multi-document fusion. These results highlight its broad application potential and practical value in building high-quality language generation systems.
摘要：本文着重于检索增强生成（RAG）体系结构的动态优化。它提出了一种国家感知的动态知识检索机制，以增强大语模型中的语义理解和知识调度效率，以解决开放域问题答案和复杂的生成任务。该方法引入了多级感知检索矢量构建策略和可区分的文档匹配路径。这些组件能够端到端的联合培训以及检索和发电模块的协作优化。这有效地解决了上下文适应和知识访问中静态抹布结构的局限性。实验是在自然问题数据集上进行的。在包括GPT-4，GPT-4O和DeepSeek在内的不同大型模型中对所提出的结构进行了彻底评估。从多个角度来看，比较和消融实验证实了BLEU和Rouge-L得分的显着改善。该方法还表明，在涉及语义歧义和多文章融合的任务中，更强的鲁棒性和产生一致性。这些结果突出了其在建立高质量语言生成系统方面的广泛应用潜力和实用价值。

Title: Systematic Bias in Large Language Models: Discrepant Response Patterns in Binary vs. Continuous Judgment Tasks

Authors: Yi-Long Lu, Chunhui Zhang, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19445
Pdf URL: https://arxiv.org/pdf/2504.19445
Copy Paste: [[2504.19445]] Systematic Bias in Large Language Models: Discrepant Response Patterns in Binary vs. Continuous Judgment Tasks(https://arxiv.org/abs/2504.19445)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly used in tasks such as psychological text analysis and decision-making in automated workflows. However, their reliability remains a concern due to potential biases inherited from their training process. In this study, we examine how different response format: binary versus continuous, may systematically influence LLMs' judgments. In a value statement judgments task and a text sentiment analysis task, we prompted LLMs to simulate human responses and tested both formats across several models, including both open-source and commercial models. Our findings revealed a consistent negative bias: LLMs were more likely to deliver "negative" judgments in binary formats compared to continuous ones. Control experiments further revealed that this pattern holds across both tasks. Our results highlight the importance of considering response format when applying LLMs to decision tasks, as small changes in task design can introduce systematic biases.
摘要：大型语言模型（LLM）越来越多地用于自动化工作流中的心理文本分析和决策。但是，由于他们的培训过程继承了潜在的偏见，它们的可靠性仍然是一个关注。在这项研究中，我们研究了不同的响应格式：二元与连续的方式如何系统地影响LLMS的判断。在价值语句判断任务和文本情绪分析任务中，我们提示LLMS模拟人类响应，并在包括开源和商业模型在内的几种模型中测试了两种格式。我们的发现表明，与连续形式相比，LLMS更有可能以二元格式提供“负”判断。控制实验进一步揭示了这种模式在两个任务中均存在。我们的结果强调了在将LLMS应用于决策任务时考虑响应格式的重要性，因为任务设计的小变化可以引入系统偏见。

Title: Towards Long Context Hallucination Detection

Authors: Siyi Liu, Kishaloy Halder, Zheng Qi, Wei Xiao, Nikolaos Pappas, Phu Mon Htut, Neha Anna John, Yassine Benajiba, Dan Roth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.19457
Pdf URL: https://arxiv.org/pdf/2504.19457
Copy Paste: [[2504.19457]] Towards Long Context Hallucination Detection(https://arxiv.org/abs/2504.19457)
Keywords: language model, llm, long context, hallucination
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, they are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context. Although many studies have investigated contextual hallucinations in LLMs, addressing them in long-context inputs remains an open problem. In this work, we take an initial step toward solving this problem by constructing a dataset specifically designed for long-context hallucination detection. Furthermore, we propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations through a decomposition and aggregation mechanism. Our experimental results show that the proposed architecture significantly outperforms previous models of similar size as well as LLM-based models across various metrics, while providing substantially faster inference.
摘要：大型语言模型（LLMS）在各种任务中表现出了出色的表现。但是，它们容易出现上下文幻觉，生成与给定上下文无根据或矛盾的信息。尽管许多研究已经调查了LLM中的情境幻觉，但在长期以来的输入中对它们进行了解决仍然是一个开放的问题。在这项工作中，我们通过构建专门设计用于长篇文化幻觉检测的数据集来迈出解决此问题的第一步。此外，我们提出了一种新颖的体系结构，该架构可以使预训练的编码模型（例如Bert）通过分解和聚集机制来处理长上下文并有效地检测上下文幻觉。我们的实验结果表明，所提出的体系结构在各种指标上都显着优于先前的类似大小以及基于LLM的模型的模型，同时提供了更快的推断。

Title: BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

Authors: Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H. Chen, Santiago Romero-Brufau, Kueiyu Joshua Lin, Jie Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.19467
Pdf URL: https://arxiv.org/pdf/2504.19467
Copy Paste: [[2504.19467]] BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text(https://arxiv.org/abs/2504.19467)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 state-of-the-art LLMs (including DeepSeek-R1, GPT-4o, Gemini, and Llama 4) under various inference strategies. With a total of 13,572 experiments, our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding.
摘要：大型语言模型（LLMS）对医疗应用有很大的希望，并且正在迅速发展，新模型以加速的速度发布。但是，在临床环境中对LLM的当前评估仍然有限。大多数现有的基准测试依赖医学检查式的问题或PubMed衍生的文本，未能捕获现实世界电子健康记录（EHR）数据的复杂性。其他人则狭义地关注特定的应用程序方案，从而限制了它们在更广泛的临床用途中的普遍性。为了解决这一差距，我们提出了桥梁，这是一个综合的多语言基准，其中包含87项来自跨九种语言的现实世界临床数据源的任务。我们在各种推理策略下系统地评估了52个最先进的LLM（包括DeepSeek-R1，GPT-4O，Gemini和Llama 4）。总共有13,572次实验，我们的结果揭示了模型大小，语言，自然语言处理任务和临床专业的实质性差异。值得注意的是，我们证明，开源LLM可以实现与专有模型相当的性能，而基于较旧的体系结构通常表现不佳的医学微调LLMS和更新的通用通用模型。桥梁及其相应的排行榜是基础资源，也是对现实世界中临床文本理解中新LLM的开发和评估的独特参考。

Title: Conflicts in Texts: Data, Implications and Challenges

Authors: Siyi Liu, Dan Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19472
Pdf URL: https://arxiv.org/pdf/2504.19472
Copy Paste: [[2504.19472]] Conflicts in Texts: Data, Implications and Challenges(https://arxiv.org/abs/2504.19472)
Keywords: hallucination
Abstract: As NLP models become increasingly integrated into real-world applications, it becomes clear that there is a need to address the fact that models often rely on and generate conflicting information. Conflicts could reflect the complexity of situations, changes that need to be explained and dealt with, difficulties in data annotation, and mistakes in generated outputs. In all cases, disregarding the conflicts in data could result in undesired behaviors of models and undermine NLP models' reliability and trustworthiness. This survey categorizes these conflicts into three key areas: (1) natural texts on the web, where factual inconsistencies, subjective biases, and multiple perspectives introduce contradictions; (2) human-annotated data, where annotator disagreements, mistakes, and societal biases impact model training; and (3) model interactions, where hallucinations and knowledge conflicts emerge during deployment. While prior work has addressed some of these conflicts in isolation, we unify them under the broader concept of conflicting information, analyze their implications, and discuss mitigation strategies. We highlight key challenges and future directions for developing conflict-aware NLP systems that can reason over and reconcile conflicting information more effectively.
摘要：随着NLP模型越来越多地集成到现实世界中，很明显，有必要解决一个事实，即模型通常依赖并产生冲突的信息。冲突可能反映了情况的复杂性，需要解释和处理的变化，数据注释的困难以及生成的输出中的错误。在所有情况下，无视数据冲突都可能导致模型的不希望行为，并破坏NLP模型的可靠性和可信赖性。这项调查将这些冲突分为三个关键领域：（1）网络上的自然文本，事实矛盾，主观偏见和多种观点引入了矛盾；（2）人类注销的数据，注释者分歧，错误和社会偏见会影响模型培训；（3）模型相互作用，在部署过程中幻觉和知识冲突。尽管先前的工作已经孤立地解决了其中一些冲突，但我们在更广泛的信息概念，分析其含义并讨论缓解策略的情况下将它们统一。我们重点介绍了开发冲突意识的NLP系统的关键挑战和未来的方向，这些系统可以更有效地推理和调和冲突的信息。

Title: Detecting Effects of AI-Mediated Communication on Language Complexity and Sentiment

Authors: Kristen Sussman, Daniel Carter
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2504.19556
Pdf URL: https://arxiv.org/pdf/2504.19556
Copy Paste: [[2504.19556]] Detecting Effects of AI-Mediated Communication on Language Complexity and Sentiment(https://arxiv.org/abs/2504.19556)
Keywords: language model, gpt, chat
Abstract: Given the subtle human-like effects of large language models on linguistic patterns, this study examines shifts in language over time to detect the impact of AI-mediated communication (AI- MC) on social media. We compare a replicated dataset of 970,919 tweets from 2020 (pre-ChatGPT) with 20,000 tweets from the same period in 2024, all of which mention Donald Trump during election periods. Using a combination of Flesch-Kincaid readability and polarity scores, we analyze changes in text complexity and sentiment. Our findings reveal a significant increase in mean sentiment polarity (0.12 vs. 0.04) and a shift from predominantly neutral content (54.8% in 2020 to 39.8% in 2024) to more positive expressions (28.6% to 45.9%). These findings suggest not only an increasing presence of AI in social media communication but also its impact on language and emotional expression patterns.
摘要：鉴于大语言模型对语言模式的微妙影响，这项研究随着时间的流逝，语言的转变以检测AI介导的交流（AI-MC）对社交媒体的影响。我们比较了2020年的970,919条推文的复制数据集和2024年同期的20,000条推文，所有这些都提到了唐纳德·特朗普在选举期间。使用Flesch-Kincaid的可读性和极性得分的结合，我们分析了文本复杂性和情感的变化。我们的发现表明，平均情感极性（0.12 vs. 0.04）和从主要中性含量（2020年为54.8％到2024年的39.8％）的平均情感显着增加，转变为更多的阳性表达（28.6％至45.9％）。这些发现不仅表明，社交媒体沟通中AI的存在越来越多，还表明其对语言和情感表达模式的影响。

Title: m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

Authors: Meng Xiao, Xunxin Cai, Chengrui Wang, Yuanchun Zhou
Subjects: cs.CL, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.19565
Pdf URL: https://arxiv.org/pdf/2504.19565
Copy Paste: [[2504.19565]] m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training(https://arxiv.org/abs/2504.19565)
Keywords: language model, gpt, llm, prompt, agent
Abstract: The rapid progress of large language models (LLMs) in biomedical research has underscored the limitations of existing open-source annotated scientific corpora, which are often insufficient in quantity and quality. Addressing the challenge posed by the complex hierarchy of biomedical knowledge, we propose a knowledge-driven, multi-agent framework for scientific corpus distillation tailored for LLM training in the biomedical domain. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. These agents collectively generate and refine domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.
摘要：大语言模型（LLM）在生物医学研究中的快速进步强调了现有的开源注释科学语料库的局限性，这些科学公司通常不足以数量和质量。为了应对生物医学知识的复杂层次结构所带来的挑战，我们提出了一个知识驱动的，多代理的框架，用于针对生物医学领域的LLM培训量身定制的科学语料库蒸馏。我们方法的核心是一种协作的多机构体系结构，在该建筑中，专门的代理在医学主题标题（网格）层次结构的指导下，共同努力从广泛的科学文献中自主提取，合成和自我评估的高质量文本数据。这些代理人共同生成和完善了特定于域的问答对，确保了与生物医学本体论的全面覆盖范围和一致性，同时最大程度地减少了手动参与。广泛的实验结果表明，在我们的多代理蒸馏数据集中训练的语言模型可以在生物医学提问任务方面取得显着改进，表现优于强大的生命科学LLM基准和先进的专有模型。值得注意的是，我们的AI-Ready数据集使Llama3-70B尽管规模较大，但使用MedPrompt和Med-Palm-2可以超过GPT-4。详细的消融研究和案例分析进一步验证了框架内每种代理的有效性和协同作用，从而突出了多代理合作在生物医学LLM培训中的潜力。

Title: Coreference Resolution for Vietnamese Narrative Texts

Authors: Hieu-Dai Tran, Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19606
Pdf URL: https://arxiv.org/pdf/2504.19606
Copy Paste: [[2504.19606]] Coreference Resolution for Vietnamese Narrative Texts(https://arxiv.org/abs/2504.19606)
Keywords: language model, gpt, llm
Abstract: Coreference resolution is a vital task in natural language processing (NLP) that involves identifying and linking different expressions in a text that refer to the same entity. This task is particularly challenging for Vietnamese, a low-resource language with limited annotated datasets. To address these challenges, we developed a comprehensive annotated dataset using narrative texts from VnExpress, a widely-read Vietnamese online news platform. We established detailed guidelines for annotating entities, focusing on ensuring consistency and accuracy. Additionally, we evaluated the performance of large language models (LLMs), specifically GPT-3.5-Turbo and GPT-4, on this dataset. Our results demonstrate that GPT-4 significantly outperforms GPT-3.5-Turbo in terms of both accuracy and response consistency, making it a more reliable tool for coreference resolution in Vietnamese.
摘要：核心分辨率是自然语言处理（NLP）中的重要任务，涉及识别和链接指代相同实体的文本中的不同表达式。对于越南人来说，这项任务尤其具有挑战性，越南人是一种低资源语言，带有注释的数据集有限。为了应对这些挑战，我们使用越南广泛阅读的在线新闻平台Vnexpress的叙事文本开发了一个全面的注释数据集。我们建立了注释实体的详细指南，专注于确保一致性和准确性。此外，我们在此数据集上评估了大语言模型（LLMS）的性能（LLMS），特别是GPT-3.5-Turbo和GPT-4。我们的结果表明，GPT-4在准确性和响应一致性方面显着优于GPT-3.5涡轮增压，使其成为越南核心分辨率更可靠的工具。

Title: VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

Authors: Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19627
Pdf URL: https://arxiv.org/pdf/2504.19627
Copy Paste: [[2504.19627]] VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning(https://arxiv.org/abs/2504.19627)
Keywords: language model
Abstract: Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is inefficient compared to humans who analyze information and generate content at the conceptual level, extracting relevant visual concepts with minimal effort. This inefficiency, stemming from the lack of a visual concept model, limits LVLMs' usability in real-world applications. To address this, we propose VCM, an end-to-end self-supervised visual concept modeling framework. VCM leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations. Our results show that VCM significantly reduces computational costs (e.g., 85\% fewer FLOPs for LLaVA-1.5-7B) while maintaining strong performance across diverse image understanding tasks. Moreover, VCM enhances visual encoders' capabilities in classic visual concept perception tasks. Extensive quantitative and qualitative experiments validate the effectiveness and efficiency of VCM.
摘要：大型视觉语言模型（LVLM）对于现实世界中的AI任务（例如具有强大的视觉推理能力）而言是重要的。但是，当前的LVLM在令牌级别上处理整个图像，与分析信息并在概念级别生成内容的人相比，这效率低下，以最少的努力提取相关的视觉概念。由于缺乏视觉概念模型，这种低效率限制了LVLMS在现实世界中的可用性。为了解决这个问题，我们提出了VCM，这是一个端到端的自我监督的视觉概念建模框架。 VCM在多个采样实例和视觉语言微调中利用隐式对比学习来构建视觉概念模型，而无需昂贵的概念级注释。我们的结果表明，VCM大大降低了计算成本（例如，LLAVA-1.5-7B的掉落量减少了85％），同时保持了各种图像理解任务的强劲性能。此外，VCM增强了视觉编码在经典视觉概念感知任务中的功能。广泛的定量和定性实验验证了VCM的有效性和效率。

Title: Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

Authors: Osma Suominen, Juho Inkinen, Mona Lehtinen
Subjects: cs.CL, cs.AI, cs.DL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19675
Pdf URL: https://arxiv.org/pdf/2504.19675
Copy Paste: [[2504.19675]] Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs(https://arxiv.org/abs/2504.19675)
Keywords: language model, llm
Abstract: This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.
摘要：本文介绍了Semeval-2025任务5（LLMS4Subjects）中的ANNIF系统，该系统的重点是使用大语言模型（LLMS）进行主题索引。需要使用GND主题词汇来从双语TIBKAT数据库中为书目记录创建主题预测的任务。我们的方法结合了Annif Toolkit中实施的传统自然语言处理和机器学习技术，以及基于LLM的创新方法，用于翻译和合成数据生成，并合并了单语模型的预测。该系统在全受试者类别中排名第一，在定量评估中排名第一，在TIB核心对象类别中排名第一，在定性评估中排名第四。这些发现证明了将传统XMTC算法与现代LLM技术相结合的潜力，以提高在多语言环境中受试者索引的准确性和效率。

Title: Taming the Titans: A Survey of Efficient LLM Inference Serving

Authors: Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang
Subjects: cs.CL, cs.AI, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19720
Pdf URL: https://arxiv.org/pdf/2504.19720
Copy Paste: [[2504.19720]] Taming the Titans: A Survey of Efficient LLM Inference Serving(https://arxiv.org/abs/2504.19720)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.
摘要：用于生成AI的大型语言模型（LLM）取得了显着的进步，并发展成为各个领域和应用程序广泛采用的复杂和多功能工具。但是，由其大量参数造成的大量内存开销，再加上注意机制的高计算需求，在实现LLM推理服务的低潜伏期和高吞吐量方面构成了重大挑战。在开创性研究的推动下，最近的进步已大大加快了这一领域的进步。本文对这些方法进行了全面的调查，涵盖了基本实例级别的方法，深入的集群级策略，新兴方案方向以及其他杂项但重要的领域。在实例级别，我们审查模型放置，请求调度，解码长度预测，存储管理和分类范式。在群集级别，我们探索GPU群集部署，多稳定负载平衡和云服务解决方案。对于新兴方案，我们组织了有关特定任务，模块和辅助方法的讨论。为了确保整体概述，我们还重点介绍了几个小众但至关重要的领域。最后，我们概述了潜在的研究方向，以进一步推进LLM推理服务领域。

Title: LLM-Assisted Automated Deductive Coding of Dialogue Data: Leveraging Dialogue-Specific Characteristics to Enhance Contextual Understanding

Authors: Ying Na, Shihui Feng
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2504.19734
Pdf URL: https://arxiv.org/pdf/2504.19734
Copy Paste: [[2504.19734]] LLM-Assisted Automated Deductive Coding of Dialogue Data: Leveraging Dialogue-Specific Characteristics to Enhance Contextual Understanding(https://arxiv.org/abs/2504.19734)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Dialogue data has been a key source for understanding learning processes, offering critical insights into how students engage in collaborative discussions and how these interactions shape their knowledge construction. The advent of Large Language Models (LLMs) has introduced promising opportunities for advancing qualitative research, particularly in the automated coding of dialogue data. However, the inherent contextual complexity of dialogue presents unique challenges for these models, especially in understanding and interpreting complex contextual information. This study addresses these challenges by developing a novel LLM-assisted automated coding approach for dialogue data. The novelty of our proposed framework is threefold: 1) We predict the code for an utterance based on dialogue-specific characteristics -- communicative acts and communicative events -- using separate prompts following the role prompts and chain-of-thoughts methods; 2) We engaged multiple LLMs including GPT-4-turbo, GPT-4o, DeepSeek in collaborative code prediction; 3) We leveraged the interrelation between events and acts to implement consistency checking using GPT-4o. In particular, our contextual consistency checking provided a substantial accuracy improvement. We also found the accuracy of act predictions was consistently higher than that of event predictions. This study contributes a new methodological framework for enhancing the precision of automated coding of dialogue data as well as offers a scalable solution for addressing the contextual challenges inherent in dialogue analysis.
摘要：对话数据一直是理解学习过程的关键来源，为学生如何参与协作讨论以及这些互动如何影响他们的知识构建提供了重要的见解。大型语言模型（LLM）的出现引入了有希望的推进定性研究的机会，尤其是在对话数据的自动编码中。但是，对话的固有上下文复杂性为这些模型带来了独特的挑战，尤其是在理解和解释复杂的上下文信息方面。这项研究通过为对话数据开发一种新型的LLM辅助自动编码方法来解决这些挑战。我们提出的框架的新颖性是三重的：1）我们根据角色提示和思维方法的方法，根据对话特定的特征（交流行为和交流事件）预测了基于对话特定特征和交流事件的发言的代码； 2）我们参与了多个LLM，包括GPT-4-Turbo，GPT-4O，DeepSeek，以协作代码预测； 3）我们利用事件和行为之间的相互关系来使用GPT-4O实施一致性检查。特别是，我们的上下文一致性检查提供了实质性的准确性提高。我们还发现，ACT预测的准确性始终高于事件预测的准确性。这项研究为增强对话数据的自动编码的精度提供了一个新的方法学框架，并为解决对话分析中固有的上下文挑战提供了可扩展的解决方案。

Title: Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Authors: Huichi Zhou, Zehao Xu, Munan Zhao, Kaihong Li, Yiqiang Li, Hongtao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19759
Pdf URL: https://arxiv.org/pdf/2504.19759
Copy Paste: [[2504.19759]] Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs(https://arxiv.org/abs/2504.19759)
Keywords: language model, llm
Abstract: In this paper, we introduce the Multilingual Moral Reasoning Benchmark (MMRB) to evaluate the moral reasoning abilities of large language models (LLMs) across five typologically diverse languages and three levels of contextual complexity: sentence, paragraph, and document. Our results show moral reasoning performance degrades with increasing context complexity, particularly for low-resource languages such as Vietnamese. We further fine-tune the open-source LLaMA-3-8B model using curated monolingual data for alignment and poisoning. Surprisingly, low-resource languages have a stronger impact on multilingual reasoning than high-resource ones, highlighting their critical role in multilingual NLP.
摘要：在本文中，我们介绍了多语言的道德推理基准（MMRB），以评估五种类型上多样的语言和三个级别上下文复杂性的大语言模型（LLM）的道德推理能力：句子，段落，段落和文档。我们的结果表明，道德推理性能会随着上下文复杂性的增加而降低，尤其是对于越南语等低资源语言而言。我们使用精选的单语言数据进行对齐和中毒，进一步调整了开源美洲拉玛-3-8B模型。令人惊讶的是，低资源语言对多语言推理的影响比高资源的推理更强烈，从而强调了它们在多语言NLP中的关键作用。

Title: Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance

Authors: Takuya Tamura, Taro Yano, Masafumi Enomoto, Masafumi Oyamada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19811
Pdf URL: https://arxiv.org/pdf/2504.19811
Copy Paste: [[2504.19811]] Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance(https://arxiv.org/abs/2504.19811)
Keywords: language model, llm
Abstract: Accurately forecasting the performance of Large Language Models (LLMs) before extensive fine-tuning or merging can substantially reduce both computational expense and development time. Although prior approaches like scaling laws account for global factors such as parameter size or training tokens, they often overlook explicit lineage relationships - i.e., which models are derived or merged from which parents. In this work, we propose a novel Lineage-Regularized Matrix Factorization (LRMF) framework that encodes ancestral ties among LLMs via a graph Laplacian regularizer. By leveraging multi-hop parent-child connections, LRMF consistently outperforms conventional matrix factorization and collaborative filtering methods in both instance-level and benchmark-level performance prediction. Our large-scale study includes 2,934 publicly available Hugging Face models and 21,000+ instances across 6 major benchmarks, showing that lineage constraints yield up to 7-10 percentage points higher correlation with actual performance compared to baselines. Moreover, LRMF effectively addresses the cold-start problem, providing accurate estimates for newly derived or merged models even with minimal data. This lineage-guided strategy thus offers a resource-efficient way to inform hyperparameter tuning, data selection, and model combination in modern LLM development.
摘要：准确地预测大语言模型（LLM）在大量微调或合并之前可以大大减少计算费用和开发时间。尽管诸如缩放法律之类的先前方法占参数规模或培训令牌等全球因素，但它们经常忽略显式的谱系关系 - 即，哪些模型是派生或合并的。在这项工作中，我们提出了一种新型的谱系调查矩阵分解（LRMF）框架，该框架通过图Laplacian常规化器编码LLMS之间的祖先关系。通过利用多跳的父子连接，LRMF在实例级别和基准级级别的性能预测中始终优于常规矩阵分解和协作过滤方法。我们的大规模研究包括6个主要基准的2,934个公开的拥抱面部模型和21,000多个实例，表明谱系约束与基准相比，与实际性能的相关性高达7-10个百分点。此外，LRMF有效地解决了冷门问题，即使使用最少的数据，也可以为新得出或合并的模型提供准确的估计。因此，这种血统引导的策略提供了一种资源有效的方式，可以在现代LLM开发中为超参数调整，数据选择和模型组合提供信息。

Title: Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language

Authors: Anastasia Zhukova, Christian E. Matt, Terry Ruas, Bela Gipp
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19856
Pdf URL: https://arxiv.org/pdf/2504.19856
Copy Paste: [[2504.19856]] Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language(https://arxiv.org/abs/2504.19856)
Keywords: language model
Abstract: Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique that further trains a language model (LM) on its pretraining task, e.g., language masking. Although popular, it requires a significant corpus of domain-related data, which is difficult to obtain for specific domains in languages other than English, such as the process industry in the German language. This paper introduces an efficient approach called ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL) and k-nearest neighbors (kNN) to augment target data with domain-related and in-domain texts, significantly reducing GPU time while maintaining strong model performance. Our results show that this approach performs better than traditional DAPT by 3.5 of the average IR metrics (e.g., mAP, MRR, and nDCG) and requires almost 4 times less computing time, providing a cost-effective solution for industries with limited computational capacity. The findings highlight the broader applicability of this framework to other low-resource industries, making NLP-based solutions more accessible and feasible in production environments.
摘要：域自适应持续预处理（DAPT）是一种最先进的技术，可以进一步训练语言模型（LM），例如语言掩盖。尽管很受欢迎，但它需要大量与域相关的数据，这对于英语以外的其他语言（例如德语流程行业）很难获得。本文介绍了一种有效的方法，称为ICL审计或ICL-APT，该方法利用了含义的内在学习（ICL）和K-Nearest邻居（KNN），以增加与域相关和与域内文本的目标数据，从而大大减少GPU时间，同时维持强大的模型性能。我们的结果表明，这种方法的性能比传统的DAPT的平均指标（例如MAP，MRR和NDCG）的3.5更好，并且需要少4倍的计算时间，这为计算能力有限的行业提供了成本效益的解决方案。这些发现突出了该框架对其他低资源行业的更广泛适用性，这使得基于NLP的解决方案在生产环境中更容易访问和可行。

Title: semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

Authors: Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, Yu Wang
Subjects: cs.CL, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2504.19867
Pdf URL: https://arxiv.org/pdf/2504.19867
Copy Paste: [[2504.19867]] semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage(https://arxiv.org/abs/2504.19867)
Keywords: language model, llm
Abstract: Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models.
摘要：现有的大型语言模型（LLM）服务系统分为两类：1）统一的系统，在该系统中，预填充阶段和解码阶段在同一GPU上共同置于同一GPU，共享统一的计算资源和存储，以及2）一个分解的系统，其中两个阶段将两个阶段分解为不同的GPU。 The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache.此类存储效率低下在高请求率下提供较差的绩效。在本文中，我们确定分解系统的优势在于分解计算，即对计算资源进行分配以实现两个阶段的异步计算。因此，我们提出了一个新型的LLM服务系统，该系统是半PD，其特征是分解计算和统一存储。在Semi-PD中，我们引入了一个计算资源控制器，以在流媒体多处理器（SM）级别实现分类的计算，以及一个统一的内存管理器，以管理两个阶段的异步内存访问。半PD在两个阶段之间具有低空的资源调整机制，并且服务级目标（SLO）意识到动态分配算法以优化SLO成就。与最先进的系统相比，Semi-PD以较高的请求率保持较低的延迟，在DeepSeek系列模型上，每个请求的平均端到端潜伏期减少了1.27-2.58倍，并提供1.55-1.72倍更多的请求，以遵守LLAMA系列模型的延迟约束。

Title: GenCLS++: Pushing the Boundaries of Generative Classification in LLMs Through Comprehensive SFT and RL Studies Across Diverse Datasets

Authors: Mingqian He, Fei Zhao, Chonggang Lu, Ziyan Liu, Yue Wang, Haofu Qian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.19898
Pdf URL: https://arxiv.org/pdf/2504.19898
Copy Paste: [[2504.19898]] GenCLS++: Pushing the Boundaries of Generative Classification in LLMs Through Comprehensive SFT and RL Studies Across Diverse Datasets(https://arxiv.org/abs/2504.19898)
Keywords: language model, llm, prompt
Abstract: As a fundamental task in machine learning, text classification plays a crucial role in many areas. With the rapid scaling of Large Language Models (LLMs), particularly through reinforcement learning (RL), there is a growing need for more capable discriminators. Consequently, advances in classification are becoming increasingly vital for enhancing the overall capabilities of LLMs. Traditional discriminative methods map text to labels but overlook LLMs' intrinsic generative strengths. Generative classification addresses this by prompting the model to directly output labels. However, existing studies still rely on simple SFT alone, seldom probing the interplay between training and inference prompts, and no work has systematically leveraged RL for generative text classifiers and unified SFT, RL, and inference-time prompting in one framework. We bridge this gap with GenCLS++, a framework that jointly optimizes SFT and RL while systematically exploring five high-level strategy dimensions-in-context learning variants, category definitions, explicit uncertainty labels, semantically irrelevant numeric labels, and perplexity-based decoding-during both training and inference. After an SFT "policy warm-up," we apply RL with a simple rule-based reward, yielding sizable extra gains. Across seven datasets, GenCLS++ achieves an average accuracy improvement of 3.46% relative to the naive SFT baseline; on public datasets, this improvement rises to 4.00%. Notably, unlike reasoning-intensive tasks that benefit from explicit thinking processes, we find that classification tasks perform better without such reasoning steps. These insights into the role of explicit reasoning provide valuable guidance for future LLM applications.
摘要：作为机器学习的基本任务，文本分类在许多领域都起着至关重要的作用。随着大语言模型（LLM）的快速缩放，尤其是通过增强学习（RL），对更有能力的歧视者的需求越来越大。因此，分类的进步对于增强LLM的整体功能变得越来越重要。传统的判别方法将文本映射到标签，但忽略了LLM的内在生成强度。生成分类通过提示模型直接输出标签来解决这一问题。但是，现有研究仍然仅依靠简单的SFT，很少探测培训和推理提示之间的相互作用，并且在一个框架中，没有系统地将RL用于生成文本分类器以及统一的SFT，RL和推理时间提示。我们用GENCLS ++弥合了这一差距，该框架共同优化了SFT和RL，同时系统地探索了五个高级别的策略维度，类别定义，类别定义，明确的不确定性标签，语义上无关的数字标签，以及基于综合的基于基于培训的训练和培训。经过SFT“政策热身”之后，我们以简单的基于规则的奖励应用RL，从而获得可观的额外收益。在七个数据集中，GENCLS ++相对于天真的SFT基线，平均准确性提高了3.46％。在公共数据集上，这种改进上升到4.00％。值得注意的是，与明确的思维过程中受益的推理密集型任务不同，我们发现分类任务在没有这样的推理步骤的情况下表现更好。这些对明确推理作用的见解为将来的LLM应用程序提供了宝贵的指导。

Title: Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking

Authors: Luigia Costabile, Gian Marco Orlando, Valerio La Gatta, Vincenzo Moscato
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2504.19940
Pdf URL: https://arxiv.org/pdf/2504.19940
Copy Paste: [[2504.19940]] Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking(https://arxiv.org/abs/2504.19940)
Keywords: language model, llm, agent
Abstract: The growing spread of online misinformation has created an urgent need for scalable, reliable fact-checking solutions. Crowdsourced fact-checking - where non-experts evaluate claim veracity - offers a cost-effective alternative to expert verification, despite concerns about variability in quality and bias. Encouraged by promising results in certain contexts, major platforms such as X (formerly Twitter), Facebook, and Instagram have begun shifting from centralized moderation to decentralized, crowd-based approaches. In parallel, advances in Large Language Models (LLMs) have shown strong performance across core fact-checking tasks, including claim detection and evidence evaluation. However, their potential role in crowdsourced workflows remains unexplored. This paper investigates whether LLM-powered generative agents - autonomous entities that emulate human behavior and decision-making - can meaningfully contribute to fact-checking tasks traditionally reserved for human crowds. Using the protocol of La Barbera et al. (2024), we simulate crowds of generative agents with diverse demographic and ideological profiles. Agents retrieve evidence, assess claims along multiple quality dimensions, and issue final veracity judgments. Our results show that agent crowds outperform human crowds in truthfulness classification, exhibit higher internal consistency, and show reduced susceptibility to social and cognitive biases. Compared to humans, agents rely more systematically on informative criteria such as Accuracy, Precision, and Informativeness, suggesting a more structured decision-making process. Overall, our findings highlight the potential of generative agents as scalable, consistent, and less biased contributors to crowd-based fact-checking systems.
摘要：在线错误信息的不断增长使人们迫切需要可扩展，可靠的事实检查解决方案。众包事实检查 - 非专家评估索赔的真实性 - 尽管担心质量和偏见的可变性，但仍提供了一种具有成本效益的专家验证替代方法。在某些背景下有希望的结果的鼓舞之中，X（以前为Twitter），Facebook和Instagram等主要平台已开始从集中式节制转变为分散的，基于人群的方法。同时，大语言模型（LLM）的进步在核心事实检查任务中表现出强劲的表现，包括索赔检测和证据评估。但是，它们在众包工作流程中的潜在作用仍未得到探索。本文研究了LLM驱动的生成代理 - 模仿人类行为和决策的自主实体是否可以有意义地促进传统上保留给人类人群的事实核对任务。使用La Barbera等人的方案。（2024年），我们模拟了具有不同人口和意识形态概况的生成代理人的人群。代理商检索证据，评估多个质量维度的索赔，并发出最终的真实性判断。我们的结果表明，特工在真实性分类中的人群超越人类，表现出更高的内部一致性，并显示出对社会和认知偏见的敏感性降低。与人类相比，代理人更系统地依赖于信息标准，例如准确性，精度和信息性，这表明了更具结构化的决策过程。总体而言，我们的发现突出了生成代理作为可扩展，一致且不太有偏见的基于人群的事实检查系统的潜力。

Title: TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

Authors: Emre Can Acikgoz, Carl Guo, Suvodip Dey, Akul Datta, Takyoung Kim, Gokhan Tur, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.19982
Pdf URL: https://arxiv.org/pdf/2504.19982
Copy Paste: [[2504.19982]] TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons(https://arxiv.org/abs/2504.19982)
Keywords: language model, llm, agent
Abstract: Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and {\tau}-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.
摘要：以任务为导向的对话（TOD）系统正在经历由大语言模型（LLM）驱动的革命，但是这些系统的评估方法仍然不足以使其不断增长。尽管传统的自动指标有效地评估了早期的模块化系统，但它们仅关注对话级别，无法检测到在用户代理交互过程中可能出现的关键中间错误。在本文中，我们介绍了TD-Eval（转向和对话级别的评估），这是一个两步评估框架，通过整体对话级别的比较统一了细粒度的转向级分析。在转弯级别，我们沿着三个特定于TOD的维度评估了每个响应：对话凝聚力，后端知识一致性和策略合规性。同时，我们设计了使用成对比较的TOD Agent Arena，以提供对话级别质量的度量。通过对Multiwoz 2.4和{\ tau} -bench进行的实验，我们证明TD-Eval有效地标识了常规指标会错过的对话误差。此外，与传统和基于LLM的指标相比，TD-Eval与人类判断的一致性更好。这些发现表明，TD-Eval引入了用于TOD系统评估的新范式，并通过插入式研究框架有效地评估了转弯和系统水平，以供将来的研究。

Title: Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom

Authors: Rishika Sen, Sujoy Roychowdhury, Sumit Soman, H. G. Ranjani, Srikhetra Mohanty
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20000
Pdf URL: https://arxiv.org/pdf/2504.20000
Copy Paste: [[2504.20000]] Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom(https://arxiv.org/abs/2504.20000)
Keywords: language model, llm
Abstract: Knowledge Distillation (KD) is one of the approaches to reduce the size of Large Language Models (LLMs). A LLM with smaller number of model parameters (student) is trained to mimic the performance of a LLM of a larger size (teacher model) on a specific task. For domain-specific tasks, it is not clear if teacher or student model, or both, must be considered for domain adaptation. In this work, we study this problem from perspective of telecom domain Question-Answering (QA) task. We systematically experiment with Supervised Fine-tuning (SFT) of teacher only, SFT of student only and SFT of both prior to KD. We design experiments to study the impact of vocabulary (same and different) and KD algorithms (vanilla KD and Dual Space KD, DSKD) on the distilled model. Multi-faceted evaluation of the distillation using 14 different metrics (N-gram, embedding and LLM-based metrics) is considered. Experimental results show that SFT of teacher improves performance of distilled model when both models have same vocabulary, irrespective of algorithm and metrics. Overall, SFT of both teacher and student results in better performance across all metrics, although the statistical significance of the same depends on the vocabulary of the teacher models.
摘要：知识蒸馏（KD）是减少大语言模型（LLMS）规模的方法之一。具有较少模型参数（学生）的LLM经过训练，以模仿特定任务更大尺寸（教师模型）的LLM的性能。对于特定于领域的任务，尚不清楚必须考虑教师或学生模型或两者兼而有之以进行域适应。在这项工作中，我们从电信领域提问（QA）任务的角度研究了这个问题。我们系统地试验了只有教师的监督微调（SFT），仅在KD之前的学生SFT和两者的SFT。我们设计实验来研究词汇（相同和不同）和KD算法（Vanilla KD和Dual Space KD，DSKD）对蒸馏模型的影响。考虑使用14个不同的指标（n-gram，嵌入和基于LLM的指标）对蒸馏进行多方面评估。实验结果表明，当两个模型都具有相同的词汇，无论算法和指标如何，SFT的SFT都会提高蒸馏模型的性能。总体而言，教师和学生的SFT在所有指标上都会提高表现，尽管同一统计学意义取决于教师模型的词汇。

Title: LLM-Generated Fake News Induces Truth Decay in News Ecosystem: A Case Study on Neural News Recommendation

Authors: Beizhe Hu, Qiang Sheng, Juan Cao, Yang Li, Danding Wang
Subjects: cs.CL, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2504.20013
Pdf URL: https://arxiv.org/pdf/2504.20013
Copy Paste: [[2504.20013]] LLM-Generated Fake News Induces Truth Decay in News Ecosystem: A Case Study on Neural News Recommendation(https://arxiv.org/abs/2504.20013)
Keywords: language model, llm
Abstract: Online fake news moderation now faces a new challenge brought by the malicious use of large language models (LLMs) in fake news production. Though existing works have shown LLM-generated fake news is hard to detect from an individual aspect, it remains underexplored how its large-scale release will impact the news ecosystem. In this study, we develop a simulation pipeline and a dataset with ~56k generated news of diverse types to investigate the effects of LLM-generated fake news within neural news recommendation systems. Our findings expose a truth decay phenomenon, where real news is gradually losing its advantageous position in news ranking against fake news as LLM-generated news is involved in news recommendation. We further provide an explanation about why truth decay occurs from a familiarity perspective and show the positive correlation between perplexity and news ranking. Finally, we discuss the threats of LLM-generated fake news and provide possible countermeasures. We urge stakeholders to address this emerging challenge to preserve the integrity of news ecosystems.
摘要：现在，在线假新闻审核面临着由于恶意使用大语模型（LLM）在假新闻制作中带来的新挑战。尽管现有的作品表明，LLM生成的假新闻很难从各个方面检测到，但其大规模发布将如何影响新闻生态系统。在这项研究中，我们开发了一个模拟管道和一个数据集，该数据集具有约56K的生成新闻，以研究LLM生成的假新闻在神经新闻推荐系统中的影响。我们的发现揭示了真相衰败现象，其中真正的新闻逐渐失去了在新闻中对假新闻的有利地位，因为LLM生成的新闻参与了新闻推荐。我们进一步提供了一个解释，说明为什么从熟悉的角度出现真理衰减，并显示困惑和新闻排名之间的正相关。最后，我们讨论了LLM生成的假新闻的威胁，并提供可能的对策。我们敦促利益相关者应对这一新兴挑战，以保持新闻生态系统的完整性。

Title: Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages

Authors: Pritika Rohera, Chaitrali Ginimav, Gayatri Sawant, Raviraj Joshi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20022
Pdf URL: https://arxiv.org/pdf/2504.20022
Copy Paste: [[2504.20022]] Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages(https://arxiv.org/abs/2504.20022)
Keywords: language model, gpt, llm, hallucination
Abstract: Multilingual Large Language Models (LLMs) have demonstrated significant effectiveness across various languages, particularly in high-resource languages such as English. However, their performance in terms of factual accuracy across other low-resource languages, especially Indic languages, remains an area of investigation. In this study, we assess the factual accuracy of LLMs - GPT-4o, Gemma-2-9B, Gemma-2-2B, and Llama-3.1-8B - by comparing their performance in English and Indic languages using the IndicQuest dataset, which contains question-answer pairs in English and 19 Indic languages. By asking the same questions in English and their respective Indic translations, we analyze whether the models are more reliable for regional context questions in Indic languages or when operating in English. Our findings reveal that LLMs often perform better in English, even for questions rooted in Indic contexts. Notably, we observe a higher tendency for hallucination in responses generated in low-resource Indic languages, highlighting challenges in the multilingual understanding capabilities of current LLMs.
摘要：多语言大语言模型（LLM）在各种语言中都表现出显着的有效性，尤其是在英语等高源语言中。但是，它们在其他低资源语言（尤其是指示语言）上的事实准确性方面的表现仍然是一个调查领域。在这项研究中，我们通过使用Indinquest数据集比较了它们在英语和指示语言中的性能，其中包含Englise-Asswer对，并评估了LLMS-GPT-4O，GEMMA-2-9B，GEMMA-2-2B和LLAMA-3.1-8B的事实准确性，该数据包含英语和19个语言。通过用英语及其各自的指示翻译提出相同的问题，我们分析了模型在指示语言中或用英语运作时更可靠的区域上下文问题。我们的发现表明，LLM通常在英语中表现更好，即使是植根于指示背景的问题。值得注意的是，我们观察到在低资源指示语言中产生的响应中幻觉的趋势更高，这突出了当前LLM的多语言理解能力中的挑战。

Title: AutoJudge: Judge Decoding Without Manual Annotation

Authors: Roman Garipov, Fedor Velikonivtsev, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20039
Pdf URL: https://arxiv.org/pdf/2504.20039
Copy Paste: [[2504.20039]] AutoJudge: Judge Decoding Without Manual Annotation(https://arxiv.org/abs/2504.20039)
Keywords: language model, llm
Abstract: We introduce AutoJudge, a framework that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the generated response, relaxing the guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft model should be corrected to preserve quality, and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We test our approach with Llama 3.2 1B (draft) and Llama 3.1 8B (target) models on zero-shot GSM8K reasoning, where it achieves up to 1.5x more accepted tokens per verification cycle with under 1% degradation in answer accuracy compared to standard speculative decoding and over 2x with small loss in accuracy. When applied to the LiveCodeBench benchmark, our approach automatically detects other, programming-specific important tokens and shows similar speedups, demonstrating its ability to generalize across tasks.
摘要：我们介绍了自动判断，这是一个框架，该框架加速了大型语言模型（LLM）推断，以特定于任务的损失投机解码。我们没有匹配原始模型输出分布令牌令牌，而是确定哪些生成的令牌会影响生成的响应的下游质量，从而放松保证，以便可以更快地生成“不重要的”令牌。我们的方法依靠半怪兽搜索算法来测试应纠正目标和草案模型之间的哪些不匹配以保持质量，以及可以跳过哪些。然后，我们根据现有的LLM嵌入式培训轻量级分类器，以预测，在推理时，可以安全地接受哪些标记不匹配的代币，而不会损害最终的答案质量。我们使用Llama 3.2 1b（草稿）和Llama 3.1 8b（目标）模型测试我们的方法，在GSM8K推理中，与标准的投散解码相比，它的答案准确性低于1％，并且在准确性损失的2倍上，它的每个验证周期的可接受的令牌高达1.5倍。当应用于LiveCodeBench基准测试时，我们的方法会自动检测其他针对特定于编程的重要令牌并显示出相似的加速，从而证明了其跨任务的推广能力。