2025-04-21

Title: Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis

Authors: In Hak Moon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13187
Pdf URL: https://arxiv.org/pdf/2504.13187
Copy Paste: [[2504.13187]] Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis(https://arxiv.org/abs/2504.13187)
Keywords: language model, gpt, llm, chat
Abstract: This study presents a comprehensive evaluation of five leading large language models (LLMs) - Chat GPT 4o, Copilot Pro, Gemini Advanced, Claude Pro, and Meta AI - on their performance in solving calculus differentiation problems. The investigation assessed these models across 13 fundamental problem types, employing a systematic cross-evaluation framework where each model solved problems generated by all models. Results revealed significant performance disparities, with Chat GPT 4o achieving the highest success rate (94.71%), followed by Claude Pro (85.74%), Gemini Advanced (84.42%), Copilot Pro (76.30%), and Meta AI (56.75%). All models excelled at procedural differentiation tasks but showed varying limitations with conceptual understanding and algebraic manipulation. Notably, problems involving increasing/decreasing intervals and optimization word problems proved most challenging across all models. The cross-evaluation matrix revealed that Claude Pro generated the most difficult problems, suggesting distinct capabilities between problem generation and problem-solving. These findings have significant implications for educational applications, highlighting both the potential and limitations of LLMs as calculus learning tools. While they demonstrate impressive procedural capabilities, their conceptual understanding remains limited compared to human mathematical reasoning, emphasizing the continued importance of human instruction for developing deeper mathematical comprehension.
摘要：这项研究对五个领先的大型语言模型（LLM）进行了全面评估-CHAT GPT 4O，Copilot Pro，Gemini Advanced，Claude Pro和Meta AI-在解决计算分化问题方面的表现。调查使用系统的交叉评估框架评估了13种基本问题类型的这些模型，其中每个模型都求解了所有模型产生的问题。结果表明，CHAT GPT 4O的绩效差异很大，其成功率最高（94.71％），其次是Claude Pro（85.74％），Gemini Advanced（84.42％），Copilot Pro（76.30％）和Meta AI（56.75％）。所有模型在程序区分任务上都表现出色，但通过概念理解和代数操纵显示出不同的局限性。值得注意的是，涉及增加/减少间隔和优化单词问题的问题证明了所有模型中最具挑战性的问题。交叉评估矩阵表明，克劳德·普罗（Claude Pro）产生了最困难的问题，这表明问题产生和解决问题之间存在明显的功能。这些发现对教育应用具有重要意义，强调了LLM作为微积分学习工具的潜力和局限性。尽管他们表现出令人印象深刻的程序能力，但与人类数学推理相比，他们的概念理解仍然有限，这强调了人类教学在发展更深入的数学理解方面的持续重要性。

Title: BASIR: Budget-Assisted Sectoral Impact Ranking -- A Dataset for Sector Identification and Performance Prediction Using Language Models

Authors: Sohom Ghosh, Sudip Kumar Naskar
Subjects: cs.CL, q-fin.ST
Abstract URL: https://arxiv.org/abs/2504.13189
Pdf URL: https://arxiv.org/pdf/2504.13189
Copy Paste: [[2504.13189]] BASIR: Budget-Assisted Sectoral Impact Ranking -- A Dataset for Sector Identification and Performance Prediction Using Language Models(https://arxiv.org/abs/2504.13189)
Keywords: language model
Abstract: Government fiscal policies, particularly annual union budgets, exert significant influence on financial markets. However, real-time analysis of budgetary impacts on sector-specific equity performance remains methodologically challenging and largely unexplored. This study proposes a framework to systematically identify and rank sectors poised to benefit from India's Union Budget announcements. The framework addresses two core tasks: (1) multi-label classification of excerpts from budget transcripts into 81 predefined economic sectors, and (2) performance ranking of these sectors. Leveraging a comprehensive corpus of Indian Union Budget transcripts from 1947 to 2025, we introduce BASIR (Budget-Assisted Sectoral Impact Ranking), an annotated dataset mapping excerpts from budgetary transcripts to sectoral impacts. Our architecture incorporates fine-tuned embeddings for sector identification, coupled with language models that rank sectors based on their predicted performances. Our results demonstrate 0.605 F1-score in sector classification, and 0.997 NDCG score in predicting ranks of sectors based on post-budget performances. The methodology enables investors and policymakers to quantify fiscal policy impacts through structured, data-driven insights, addressing critical gaps in manual analysis. The annotated dataset has been released under CC-BY-NC-SA-4.0 license to advance computational economics research.
摘要：政府财政政策，尤其是年度工会预算，对金融市场产生重大影响。但是，对预算对特定部门股票绩效的预算影响的实时分析在方法论上仍然具有挑战性，并且在很大程度上没有探索。这项研究提出了一个框架，以系统地识别和对有助于从印度工会预算公告中受益的部门。该框架解决了两个核心任务：（1）从预算成绩单到81个预定义的经济部门的多标签分类，以及（2）这些部门的绩效排名。利用1947年至2025年的印度联盟预算笔录的全面语料库，我们引入了BASIR（预算辅助部门影响排名），这是带注释的数据集映射摘要，从预算笔录到部门影响。我们的体系结构结合了用于扇区识别的微调嵌入，再加上基于其预测性能的语言模型。我们的结果表明，在部门分类中，基于预算后的性能，在扇区分类中进行了0.605 F1评分，预测扇区等级的NDCG得分为0.997。该方法使投资者和政策制定者能够通过结构化的，数据驱动的见解来量化财政政策的影响，从而解决手动分析中的关键差距。注释的数据集已根据CC-BY-NC-SA-4.0许可发布，以推进计算经济学研究。

Title: KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

Authors: Bokwang Hwang, Seonkyu Lim, Taewoong Kim, Yongjae Geun, Sunghyun Bang, Sohyun Park, Jihyun Park, Myeonggyu Lee, Jinwoo Lee, Yerin Kim, Jinsun Yoo, Jingyeong Hong, Jina Park, Yongchan Kim, Suhyun Kim, Younggyun Hahm, Yiseul Lee, Yejee Kang, Chanhyuk Yoon, Chansu Lee, Heeyewon Jeong, Jiyeon Lee, Seonhye Gu, Hyebin Kang, Yousang Cho, Hangyeol Yoo, KyungTae Lim
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.13216
Pdf URL: https://arxiv.org/pdf/2504.13216
Copy Paste: [[2504.13216]] KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding(https://arxiv.org/abs/2504.13216)
Keywords: language model, gpt, llm, prompt
Abstract: We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity. The benchmark is constructed through a semi-automated pipeline that combines GPT-4-generated prompts with expert validation to ensure domain relevance and factual accuracy. We evaluate a range of representative LLMs and observe notable performance differences across models, with trade-offs between task accuracy and output safety across different model families. These results highlight persistent challenges in applying LLMs to high-stakes financial applications, particularly in reasoning and safety. Grounded in real-world financial use cases and aligned with the Korean regulatory and linguistic context, KFinEval-Pilot serves as an early diagnostic tool for developing safer and more reliable financial AI systems.
摘要：我们介绍了Kfineval-Pilot，这是一个专门设计用于评估韩国金融领域中大型语言模型（LLM）的基准套件。 Kfineval-Pilot解决了现有以英语为中心的基准的局限性，包括在三个关键领域的1000多个精心策划的问题：财务知识，法律推理和财务毒性。基准是通过半自动化管道构建的，该管道将GPT-4生成的提示与专家验证相结合，以确保域的相关性和事实准确性。我们评估了一系列代表性的LLM，并观察到模型之间的显着性能差异，在不同模型家族的任务准确性和输出安全性之间进行了权衡。这些结果凸显了将LLM应用于高风险财务应用的持续挑战，尤其是在推理和安全方面。基于现实世界中的财务用例，并与韩国监管和语言背景保持一致，Kfineval-Pilot是一种早期诊断工具，可开发更安全，更可靠的金融AI系统。

Title: Sustainability via LLM Right-sizing

Authors: Jennifer Haase, Finn Klessascheck, Jan Mendling, Sebastian Pokutta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13217
Pdf URL: https://arxiv.org/pdf/2504.13217
Copy Paste: [[2504.13217]] Sustainability via LLM Right-sizing(https://arxiv.org/abs/2504.13217)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.
摘要：大型语言模型（LLM）已越来越嵌入到组织工作流程中。这引起了人们对他们的能耗，财务成本和数据主权的担忧。尽管性能基准通常庆祝尖端模型，但现实世界的部署决策需要更广泛的视角：何时较小的本地可部署模型“足够好”？这项研究通过评估十个日常职业任务中的11个专有和开放权重的LLM来提供经验答案，包括总结文本，生成时间表以及起草电子邮件和建议。使用基于双LLM的评估框架，我们将任务执行自动化和标准化评估，跨越与产出质量，事实准确性和道德责任有关的十个标准。结果表明，GPT-4O的性能始终如一，但成本和环境足迹的效果明显更高。值得注意的是，诸如Gemma-3和Phi-4之类的较小模型在大多数任务上取得了强大而可靠的结果，这表明它们在需要成本效率，本地部署或隐私的上下文中的可行性。一项集群分析显示，三个模型组 - 高级全能者，合格的通才以及有限但安全的表演者 - 突出了质量，控制和可持续性之间的权衡。值得注意的是，任务类型影响了模型的有效性：概念任务挑战了大多数模型，而聚合和转换任务产生了更好的性能。我们主张从性能最大化的基准转变为任务和上下文感知的充足评估，以更好地反映组织优先事项。我们的方法为通过可持续性镜头评估AI模型的可扩展方法提供了可扩展的方法，并为实践中负责任的LLM部署提供了可行的指导。

Title: DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

Authors: Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.13227
Pdf URL: https://arxiv.org/pdf/2504.13227
Copy Paste: [[2504.13227]] DIDS: Domain Impact-aware Data Sampling for Large Language Model Training(https://arxiv.org/abs/2504.13227)
Keywords: language model, llm
Abstract: Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.
摘要：大型语言模型（LLMS）通常是在多域数据集中培训的，在多域数据集中，由于域中的重要性在下游任务中的重要性变化，因此域采样策略显着影响模型性能。优化域级采样策略的现有方法在维持内域的一致性和准确衡量域的影响方面努力。在本文中，我们介绍了域影响感知数据采样（DIDS）。为了确保域内的一致性，提出了基于其学习效果的梯度聚类算法来组培训数据，在这种效果下，使用代理语言模型和尺寸降低来减少计算开销。为了准确衡量域的影响，我们开发了一个Fisher Information Matrix（FIM）指导度量，该指标量化了域特异性参数更新的方式如何影响模型在下游任务上的输出分布以及理论保证。此外，为了确定最佳抽样比率，DIDS结合了FIM引导的域影响评估和损失学习轨迹，这些轨迹指示了域特异性的潜力，同时考虑了边际收益的减少。广泛的实验表明，DIDS在保持可比的训练效率的同时，达到平均表现3.4％。

Title: ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs

Authors: Yan Yang, Yixia Li, Hongru Wang, Xuetao Wei, Jianqiao Yu, Yun Chen, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13237
Pdf URL: https://arxiv.org/pdf/2504.13237
Copy Paste: [[2504.13237]] ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs(https://arxiv.org/abs/2504.13237)
Keywords: language model, llm
Abstract: With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating $2\times$ higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.
摘要：随着特定于任务的大型语言模型的扩散，Delta压缩已成为一种方法，可以通过有效压缩Delta模型参数来减轻部署许多此类模型的资源挑战。先前的delta-sparsification方法要么在单数值分解（SVD）之后直接删除参数，要么直接截断奇异向量。但是，这些方法要么完全无视参数的重要性，要么以太粗糙的粒度进行评估。在这项工作中，我们介绍了一种新颖的意识到的三角洲稀疏方法。利用SVD，它根据其重要性动态调节不同奇异向量的稀疏性比，即使在高稀疏性比下也可以有效地保留至关重要的任务特定知识。实验表明，在同一性能水平上，授予达到最先进的三角洲稀疏性能，表明$ 2 \ tims $ $压缩比。当与现有方法集成时，将设置有关增量量化和模型合并的新最新技术。

Title: CPG-EVAL: A Multi-Tiered Benchmark for Evaluating the Chinese Pedagogical Grammar Competence of Large Language Models

Authors: Dong Wang
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.SI
Abstract URL: https://arxiv.org/abs/2504.13261
Pdf URL: https://arxiv.org/pdf/2504.13261
Copy Paste: [[2504.13261]] CPG-EVAL: A Multi-Tiered Benchmark for Evaluating the Chinese Pedagogical Grammar Competence of Large Language Models(https://arxiv.org/abs/2504.13261)
Keywords: language model, gpt, llm, chat
Abstract: Purpose: The rapid emergence of large language models (LLMs) such as ChatGPT has significantly impacted foreign language education, yet their pedagogical grammar competence remains under-assessed. This paper introduces CPG-EVAL, the first dedicated benchmark specifically designed to evaluate LLMs' knowledge of pedagogical grammar within the context of foreign language instruction. Methodology: The benchmark comprises five tasks designed to assess grammar recognition, fine-grained grammatical distinction, categorical discrimination, and resistance to linguistic interference. Findings: Smaller-scale models can succeed in single language instance tasks, but struggle with multiple instance tasks and interference from confusing instances. Larger-scale models show better resistance to interference but still have significant room for accuracy improvement. The evaluation indicates the need for better instructional alignment and more rigorous benchmarks, to effectively guide the deployment of LLMs in educational contexts. Value: This study offers the first specialized, theory-driven, multi-tiered benchmark framework for systematically evaluating LLMs' pedagogical grammar competence in Chinese language teaching contexts. CPG-EVAL not only provides empirical insights for educators, policymakers, and model developers to better gauge AI's current abilities in educational settings, but also lays the groundwork for future research on improving model alignment, enhancing educational suitability, and ensuring informed decision-making concerning LLM integration in foreign language instruction.
摘要：目的：大语言模型（LLM）（例如ChatGpt）的快速出现对外语教育产生了重大影响，但他们的教学语法能力仍未得到评估。本文介绍了CPG-Eval，这是第一个专门设计的专门基准，该基准是专门用于评估LLMS在外语教学背景下对教学语法知识的知识。方法论：该基准包括五项旨在评估语法识别，精细语法区别，分类歧视和对语言干扰的抗性的任务。调查结果：较小规模的模型可以在单语言实例任务中取得成功，但是在多个实例任务和混淆实例中的干扰中挣扎。大型模型表现出更好的抵抗力，但仍有明显的准确性改善空间。评估表明需要更好的教学对准和更严格的基准，以有效地指导LLM在教育环境中的部署。价值：这项研究提供了第一个专业，理论驱动的多层基准框架，用于系统地评估LLMS在中文教学环境中的教学语法能力。 CPG-Eval不仅为教育者，政策制定者和模型开发人员提供经验见解，以更好地衡量AI在教育环境中的当前能力，还为未来的研究奠定了基础，以改善模型的统一性，增强教育适用性，并确保在外语教学中涉及LLM的综合性综合。

Title: THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

Authors: Xiao Pu, Michael Saxon, Wenyue Hua, William Yang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13367
Pdf URL: https://arxiv.org/pdf/2504.13367
Copy Paste: [[2504.13367]] THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models(https://arxiv.org/abs/2504.13367)
Keywords: language model
Abstract: Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking--generating large amounts of unnecessary tokens which don't improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
摘要：推理模型在传统语言模型遇到的艰巨任务上表现出了令人印象深刻的表现。但是，许多人困扰着过度思考的问题，这使大量不必要的令牌产生了无法提高问题准确性的不必要令牌。我们介绍了问题级难度的近似度量，并证明存在问题难度和最佳令牌支出之间存在明确的关系，并评估校准多种推理模型在有效分配最佳令牌计数方面的校准程度。我们发现，通常，推理模型的校准很差，尤其是在简单问题上。为了评估简单问题的校准，我们介绍了DUMB500，这是一个非常简单的数学，推理，代码和任务问题的数据集，并在这些简单示例中共同评估推理模型，以及来自相同任务域上现有边界基准的极其困难的示例。最后，我们介绍了Thoughtterinator，这是一种无训练的黑匣子解码技术，可显着改善推理模型校准。

Title: Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

Authors: Grace Byun, Shinsun Lee, Nayoung Choi, Jinho Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13425
Pdf URL: https://arxiv.org/pdf/2504.13425
Copy Paste: [[2504.13425]] Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering(https://arxiv.org/abs/2504.13425)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.
摘要：由于检索范围和数据安全风险有限，现有的检索型发电（RAG）系统在企业设置中面临挑战。当相关的内部文档不可用时，系统会努力产生准确和完整的响应。此外，使用封闭源的大语言模型（LLMS）引起了人们对暴露专有信息的担忧。为了解决这些问题，我们提出了安全的多方面rag（secmulti-rag）框架，该框架不仅从内部文档中从内部文档中检索，而且还从两个补充来源中检索：预先生成的专家知识，以获取预期的查询和按需外部LLM生成的知识。为了减轻安全风险，我们采用本地开源发电机，并且只有在通过过滤机制将提示视为安全时才选择性地利用外部LLM。这种方法可以提高完整性，防止数据泄漏并降低成本。在我们对汽车行业的报告生成任务的评估中，Secmulti-Rag显着胜过传统的抹布 - 在基于LLM的评估中，在正确性，丰富性和有益的范围内达到79.3％至91.9％的获胜率，在人类评估中获得56.3％至70.4％。这突出了Secmulti-rag作为企业抹布的实用和安全的解决方案。

Title: From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs

Authors: Jiliang Ni, Jiachen Pu, Zhongyi Yang, Kun Zhou, Hui Wang, Xiaoliang Xiao, Dakui Wang, Xin Li, Jingfeng Luo, Conggang Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13471
Pdf URL: https://arxiv.org/pdf/2504.13471
Copy Paste: [[2504.13471]] From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs(https://arxiv.org/abs/2504.13471)
Keywords: language model, llm
Abstract: In recent years, Large Language Models (LLMs) have significantly advanced artificial intelligence by optimizing traditional Natural Language Processing (NLP) pipelines, improving performance and generalization. This has spurred their integration into various systems. Many NLP systems, including ours, employ a "one-stage" pipeline directly incorporating LLMs. While effective, this approach incurs substantial costs and latency due to the need for large model parameters to achieve satisfactory outcomes. This paper introduces a three-stage cost-efficient end-to-end LLM deployment pipeline-including prototyping, knowledge transfer, and model compression-to tackle the cost-performance dilemma in LLM-based frameworks. Our approach yields a super tiny model optimized for cost and performance in online systems, simplifying the system architecture. Initially, by transforming complex tasks into a function call-based LLM-driven pipeline, an optimal performance prototype system is constructed to produce high-quality data as a teacher model. The second stage combine techniques like rejection fine-tuning, reinforcement learning and knowledge distillation to transfer knowledge to a smaller 0.5B student model, delivering effective performance at minimal cost. The final stage applies quantization and pruning to extremely compress model to 0.4B, achieving ultra-low latency and cost. The framework's modular design and cross-domain capabilities suggest potential applicability in other NLP areas.
摘要：近年来，大型语言模型（LLM）通过优化传统的自然语言处理（NLP）管道，改善性能和概括，具有显着高级的人工智能。这促使他们集成到各种系统中。许多NLP系统（包括我们的NLP系统）采用了直接合并LLM的“单级”管道。尽管有效，但由于需要大型模型参数实现令人满意的结果，这种方法会造成大量成本和潜伏期。本文介绍了三阶段的端到端LLM部署管道，包括原型，知识转移和模型压缩，以解决基于LLM的框架中成本效果的难题。我们的方法产生了一个针对在线系统中成本和性能进行优化的超级小型模型，从而简化了系统体系结构。最初，通过将复杂的任务转换为基于函数呼叫的LLM驱动管道，构建了最佳性能原型系统，以生成高质量的数据作为教师模型。第二阶段结合了拒绝微调，加强学习和知识蒸馏等技术，以将知识转移到较小的0.5B学生模型，以最低的成本提供有效的绩效。最后阶段将量化和修剪适用于极低的潜伏期和成本，以极度压缩模型。该框架的模块化设计和跨域功能表明在其他NLP区域中可能适用。

Title: LLM Sensitivity Evaluation Framework for Clinical Diagnosis

Authors: Chenwei Yan, Xiangling Fu, Yuxuan Xiong, Tianyi Wang, Siu Cheung Hui, Ji Wu, Xien Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13475
Pdf URL: https://arxiv.org/pdf/2504.13475
Copy Paste: [[2504.13475]] LLM Sensitivity Evaluation Framework for Clinical Diagnosis(https://arxiv.org/abs/2504.13475)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at this https URL.
摘要：大型语言模型（LLM）在各个领域都表现出了令人印象深刻的表现。但是，对于临床诊断，LLM的可靠性和敏感性需要更高的期望：像医师一样思考，对影响诊断推理的关键医疗信息保持敏感，因为细微的变化会导致不同的诊断结果。然而，现有作品主要着重于研究LLM对无关紧要的环境的敏感性，并忽略了关键信息的重要性。在本文中，我们通过引入不同的扰动策略来研究LLM的敏感性，即GPT-3.5，GPT-4，GEMINI，CLAUDE3和LLAMA2-7B。评估结果突出了当前LLM在对关键医疗信息中保持敏感的诊断决策的局限性。 LLM的演变必须集中于提高其可靠性，增强其对关键信息敏感的能力，并有效利用这些信息。这些改进将增强人类对LLM的信任，并促进其在现实世界中的实际应用。我们的代码和数据集可在此HTTPS URL上找到。

Title: Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning

Authors: Jianing Wang, Jin Jiang, Yang Liu, Mengdi Zhang, Xunliang Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13500
Pdf URL: https://arxiv.org/pdf/2504.13500
Copy Paste: [[2504.13500]] Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning(https://arxiv.org/abs/2504.13500)
Keywords: language model, llm
Abstract: In this paper, we introduce a new \emph{process prejudge} strategy in LLM reasoning to demonstrate that bootstrapping with process prejudge allows the LLM to adaptively anticipate the errors encountered when advancing the subsequent reasoning steps, similar to people sometimes pausing to think about what mistakes may occur and how to avoid them, rather than relying solely on trial and error. Specifically, we define a prejudge node in the rationale, which represents a reasoning step, with at least one step that follows the prejudge node that has no paths toward the correct answer. To synthesize the prejudge reasoning process, we present an automated reasoning framework with a dynamic tree-searching strategy. This framework requires only one LLM to perform answer judging, response critiquing, prejudge generation, and thought completion. Furthermore, we develop a two-phase training mechanism with supervised fine-tuning (SFT) and reinforcement learning (RL) to further enhance the reasoning capabilities of LLMs. Experimental results from competition-level complex reasoning demonstrate that our method can teach the model to prejudge before thinking and significantly enhance the reasoning ability of LLMs. Code and data is released at this https URL.
摘要：在本文中，我们在LLM推理中介绍了一种新的\ emph {Process Prevudge}策略，以证明使用流程预先判断的引导使LLM可以适应地预测随后的推理步骤时遇到的错误，类似于人们有时会暂停出现错误以及避免发生的错误，而不是依靠和依靠偏见和错误。具体而言，我们在理由中定义了一个预判节点，该节点代表了一个推理步骤，至少一个步骤遵循没有通向正确答案的前进的节点。为了综合限制推理过程，我们提出了一个具有动态树搜索策略的自动推理框架。该框架只需要一个LLM来执行答案，回答批评，预法生成和思想完成。此外，我们使用有监督的微调（SFT）和增强学习（RL）开发了两阶段训练机制，以进一步增强LLM的推理能力。竞争级复杂推理的实验结果表明，我们的方法可以教导该模型在思考之前预先判断并显着提高LLM的推理能力。代码和数据在此HTTPS URL上发布。

Title: CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models

Authors: Feiyang Li, Peng Fang, Zhan Shi, Arijit Khan, Fang Wang, Dan Feng, Weihao Wang, Xin Zhang, Yongjian Cui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13534
Pdf URL: https://arxiv.org/pdf/2504.13534
Copy Paste: [[2504.13534]] CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models(https://arxiv.org/abs/2504.13534)
Keywords: language model, llm, prompt, retrieval-augmented generation, chain-of-thought
Abstract: While chain-of-thought (CoT) reasoning improves the performance of large language models (LLMs) in complex tasks, it still has two main challenges: the low reliability of relying solely on LLMs to generate reasoning chains and the interference of natural language reasoning chains on the inference logic of LLMs. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key designs: (i) Knowledge Graph-driven CoT Generation, featuring knowledge graphs to modulate reasoning chain generation of LLMs, thereby enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which incorporates retrieval-augmented generation (RAG) into knowledge graphs to retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable information; (iii) Pseudo-Program Prompting Execution, which encourages LLMs to execute reasoning tasks in pseudo-programs with greater logical rigor. We conduct a comprehensive evaluation on nine public datasets, covering three reasoning problems. Compared with the-state-of-the-art methods, CoT-RAG exhibits a significant accuracy improvement, ranging from 4.0% to 23.0%. Furthermore, testing on four domain-specific datasets, CoT-RAG shows remarkable accuracy and efficient execution, highlighting its strong practical applicability and scalability.
摘要：虽然经过思考链（COT）推理改善了复杂任务中大语言模型（LLMS）的性能，但它仍然面临两个主要挑战：仅依靠LLMS来产生推理链和自然语言推理链对LLMS推理逻辑的干扰的可靠性低。为了解决这些问题，我们提出了COT-rag，这是一个具有三个关键设计的新颖推理框架：（i）知识图形驱动的COT生成，具有知识图以调节LLM的推理链生成，从而增强了推理信誉；（ii）可学习的知识案例感知的抹布，将检索功能的生成（抹布）纳入知识图中，以检索相关的子案例和子描述，从而为LLM提供了可学习的信息；（iii）伪编程提示执行，这鼓励LLMS以更高的逻辑严格性执行伪编程中的推理任务。我们对九个公共数据集进行了全面评估，涵盖了三个推理问题。与面前的方法相比，COT-rag具有显着的准确性提高，范围从4.0％到23.0％。此外，在四个特定领域的数据集上进行测试，cot-rag显示出了出色的准确性和有效的执行，突出了其强大的实用适用性和可扩展性。

Title: DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification

Authors: Yu Li, Han Jiang, Zhihua Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13562
Pdf URL: https://arxiv.org/pdf/2504.13562
Copy Paste: [[2504.13562]] DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification(https://arxiv.org/abs/2504.13562)
Keywords: language model, llm
Abstract: With the widespread adoption of Large Language Models (LLMs), jailbreak attacks have become an increasingly pressing safety concern. While safety-aligned LLMs can effectively defend against normal harmful queries, they remain vulnerable to such attacks. Existing defense methods primarily rely on fine-tuning or input modification, which often suffer from limited generalization and reduced utility. To address this, we introduce DETAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs via targeted attention modification. Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks. During inference, we reallocate attention to emphasize the user's core intention, minimizing interference from attack tokens. Our experimental results demonstrate that DETAM outperforms various baselines in jailbreak defense and exhibits robust generalization across different attacks and models, maintaining its effectiveness even on in-the-wild jailbreak data. Furthermore, in evaluating the model's utility, we incorporated over-defense datasets, which further validate the superior performance of our approach. The code will be released immediately upon acceptance.
摘要：随着大型语言模型（LLM）的广泛采用，越狱攻击已成为越来越紧迫的安全问题。尽管安全一致的LLM可以有效防御正常的有害疑问，但它们仍然容易受到此类攻击的影响。现有的防御方法主要依赖于微调或输入修改，这些修改通常会受到有限的概括和效用减少的损失。为了解决这个问题，我们介绍了Detam，这是一种无填补的防御方法，通过针对性的注意修改提高了针对LLMS越狱攻击的防御能力。具体而言，我们分析了成功和失败的防御能力之间的注意力评分差异，以确定注意力头对越狱攻击敏感。在推论期间，我们重新分配注意力以强调用户的核心意图，从而最大程度地减少攻击令牌的干扰。我们的实验结果表明，DETAM在越狱防御中的表现优于各种基线，并在不同的攻击和模型中表现出强大的概括，即使在野外越狱数据上也保持了其有效性。此外，在评估模型的实用程序时，我们合并了防御性数据集，这进一步验证了我们方法的出色性能。该代码将在接受后立即发布。

Title: Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

Authors: Zihao Feng, Xiaoxue Wang, Ziwei Bai, Donghang Su, Bowen Wu, Qun Yu, Baoxun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13592
Pdf URL: https://arxiv.org/pdf/2504.13592
Copy Paste: [[2504.13592]] Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling(https://arxiv.org/abs/2504.13592)
Keywords: llm, chain-of-thought
Abstract: Intent detection, a critical component in task-oriented dialogue (TOD) systems, faces significant challenges in adapting to the rapid influx of integrable tools with complex interrelationships. Existing approaches, such as zero-shot reformulations and LLM-based dynamic recognition, struggle with performance degradation when encountering unseen intents, leading to erroneous task routing. To enhance the model's generalization performance on unseen tasks, we employ Reinforcement Learning (RL) combined with a Reward-based Curriculum Sampling (RCS) during Group Relative Policy Optimization (GRPO) training in intent detection tasks. Experiments demonstrate that RL-trained models substantially outperform supervised fine-tuning (SFT) baselines in generalization. Besides, the introduction of the RCS, significantly bolsters the effectiveness of RL in intent detection by focusing the model on challenging cases during training. Moreover, incorporating Chain-of-Thought (COT) processes in RL notably improves generalization in complex intent detection tasks, underscoring the importance of thought in challenging scenarios. This work advances the generalization of intent detection tasks, offering practical insights for deploying adaptable dialogue systems.
摘要：意图检测是以任务为导向对话（TOD）系统中的关键组成部分，在适应具有复杂相互关系的可集成工具的快速涌入时面临重大挑战。现有的方法，例如零拍摄的重新纠正和基于LLM的动态识别，在遇到看不见的意图时会与性能退化斗争，从而导致错误的任务路由。为了增强模型在看不见的任务上的概括性能，我们采用了强化学习（RL），并在小组相对策略优化（GRPO）培训期间结合了基于奖励的课程抽样（RC）。实验表明，经过RL训练的模型基本上优于监督的微调（SFT）基准。此外，引入RCS，通过将模型重点放在训练过程中的挑战性病例上，从而显着巩固RL在意图检测中的有效性。此外，将思想链（COT）过程纳入RL可以显着改善复杂的意图检测任务中的概括，从而强调了思想在具有挑战性的情况下的重要性。这项工作推进了意图检测任务的概括，提供了用于部署适应性对话系统的实用见解。

Title: Continual Pre-Training is (not) What You Need in Domain Adaption

Authors: Pin-Er Chen, Da-Chen Lian, Shu-Kai Hsieh, Sieh-Chuen Huang, Hsuan-Lei Shao, Jun-Wei Chiu, Yang-Hsien Lin, Zih-Ching Chen, Cheng-Kuang, Eddie TC Huang, Simon See
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13603
Pdf URL: https://arxiv.org/pdf/2504.13603
Copy Paste: [[2504.13603]] Continual Pre-Training is (not) What You Need in Domain Adaption(https://arxiv.org/abs/2504.13603)
Keywords: language model, llm, hallucination, prompt
Abstract: The recent advances in Legal Large Language Models (LLMs) have transformed the landscape of legal research and practice by automating tasks, enhancing research precision, and supporting complex decision-making processes. However, effectively adapting LLMs to the legal domain remains challenging due to the complexity of legal reasoning, the need for precise interpretation of specialized language, and the potential for hallucinations. This paper examines the efficacy of Domain-Adaptive Continual Pre-Training (DACP) in improving the legal reasoning capabilities of LLMs. Through a series of experiments on legal reasoning tasks within the Taiwanese legal framework, we demonstrate that while DACP enhances domain-specific knowledge, it does not uniformly improve performance across all legal tasks. We discuss the trade-offs involved in DACP, particularly its impact on model generalization and performance in prompt-based tasks, and propose directions for future research to optimize domain adaptation strategies in legal AI.
摘要：法律大语言模型（LLM）的最新进展已通过自动化任务，增强研究精度并支持复杂的决策过程来改变法律研究和实践的景观。但是，由于法律推理的复杂性，对专业语言的精确解释以及幻觉的潜力，有效地将LLM适应法律领域仍然具有挑战性。本文研究了域自适应持续预训练（DACP）在提高LLM的法律推理能力方面的功效。通过一系列有关台湾法律框架内法律推理任务的实验，我们证明，尽管DACP增强了特定于领域的知识，但并不能统一地改善所有法律任务的绩效。我们讨论了DACP中涉及的权衡，尤其是其对基于及时任务的模型概括和绩效的影响，并提出了未来研究的方向，以优化法律AI中的领域适应策略。

Title: Long-context Non-factoid Question Answering in Indic Languages

Authors: Ritwik Mishra, Rajiv Ratn Shah, Ponnurangam Kumaraguru
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13615
Pdf URL: https://arxiv.org/pdf/2504.13615
Copy Paste: [[2504.13615]] Long-context Non-factoid Question Answering in Indic Languages(https://arxiv.org/abs/2504.13615)
Keywords: language model, llm, long context
Abstract: Question Answering (QA) tasks, which involve extracting answers from a given context, are relatively straightforward for modern Large Language Models (LLMs) when the context is short. However, long contexts pose challenges due to the quadratic complexity of the self-attention mechanism. This challenge is compounded in Indic languages, which are often low-resource. This study explores context-shortening techniques, including Open Information Extraction (OIE), coreference resolution, Answer Paragraph Selection (APS), and their combinations, to improve QA performance. Compared to the baseline of unshortened (long) contexts, our experiments on four Indic languages (Hindi, Tamil, Telugu, and Urdu) demonstrate that context-shortening techniques yield an average improvement of 4\% in semantic scores and 47\% in token-level scores when evaluated on three popular LLMs without fine-tuning. Furthermore, with fine-tuning, we achieve an average increase of 2\% in both semantic and token-level scores. Additionally, context-shortening reduces computational overhead. Explainability techniques like LIME and SHAP reveal that when the APS model confidently identifies the paragraph containing the answer, nearly all tokens within the selected text receive high relevance scores. However, the study also highlights the limitations of LLM-based QA systems in addressing non-factoid questions, particularly those requiring reasoning or debate. Moreover, verbalizing OIE-generated triples does not enhance system performance. These findings emphasize the potential of context-shortening techniques to improve the efficiency and effectiveness of LLM-based QA systems, especially for low-resource languages. The source code and resources are available at this https URL.
摘要：问题回答（QA）任务涉及从给定上下文中提取答案，对于现代大型语言模型（LLMS）而言，当上下文短时，它相对简单。然而，由于自我注意的机制的二次复杂性，长篇小说构成了挑战。这项挑战在指示语言中加重了，这些语言通常是低资源的。这项研究探讨了上下文缩短技术，包括开放信息提取（OIE），核心分辨率，答案段落选择（APS）及其组合，以提高质量检查性能。与未交换（长）上下文的基线相比，我们对四种指示语言（印地语，泰米尔语，泰卢固语和乌尔都语）的实验表明，上下文缩短技术在语义得分中的平均提高4 \％，而在三种流行的LLMS上进行评估时，在代币级别的评分中，我们的语义差为4 \％。此外，通过微调，我们在语义和令牌级别的得分中平均增加了2 \％。此外，上下文缩短了计算开销。解释性技术（例如石灰和外形）表明，当APS模型自信地识别包含答案的段落时，所选文本中的几乎所有令牌都获得了很高的相关性分数。但是，该研究还强调了基于LLM的质量检查系统在解决非事实问题（尤其是需要推理或辩论的问题的局限性）中的局限性。此外，口头化的OIE生成的三元组并不能提高系统性能。这些发现强调了上下文缩短技术的潜力，以提高基于LLM的QA系统的效率和有效性，尤其是对于低资源语言。源代码和资源可在此HTTPS URL上找到。

Title: Divergent LLM Adoption and Heterogeneous Convergence Paths in Research Writing

Authors: Cong William Lin, Wu Zhu
Subjects: cs.CL, cs.AI, econ.GN
Abstract URL: https://arxiv.org/abs/2504.13629
Pdf URL: https://arxiv.org/pdf/2504.13629
Copy Paste: [[2504.13629]] Divergent LLM Adoption and Heterogeneous Convergence Paths in Research Writing(https://arxiv.org/abs/2504.13629)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs), such as ChatGPT, are reshaping content creation and academic writing. This study investigates the impact of AI-assisted generative revisions on research manuscripts, focusing on heterogeneous adoption patterns and their influence on writing convergence. Leveraging a dataset of over 627,000 academic papers from arXiv, we develop a novel classification framework by fine-tuning prompt- and discipline-specific large language models to detect the style of ChatGPT-revised texts. Our findings reveal substantial disparities in LLM adoption across academic disciplines, gender, native language status, and career stage, alongside a rapid evolution in scholarly writing styles. Moreover, LLM usage enhances clarity, conciseness, and adherence to formal writing conventions, with improvements varying by revision type. Finally, a difference-in-differences analysis shows that while LLMs drive convergence in academic writing, early adopters, male researchers, non-native speakers, and junior scholars exhibit the most pronounced stylistic shifts, aligning their writing more closely with that of established researchers.
摘要：大型语言模型（LLM），例如ChatGpt，正在重塑内容创建和学术写作。这项研究调查了AI辅助生成修订对研究手稿的影响，重点是异质采用模式及其对写作融合的影响。利用Arxiv的627,000多个学术论文的数据集，我们通过微调及时和纪律特定的大语言模型来开发一个新颖的分类框架，以检测Chatgpt Revpt的文本的风格。我们的发现揭示了LLM在学科，性别，母语状况和职业阶段的LLM采用方面的巨大差异，以及学术写作风格的快速发展。此外，LLM的使用增强了对正式写作惯例的清晰度，简洁性和遵守，并随着修订类型而变化。最后，一项差异分析表明，尽管LLMS推动了学术写作中的融合，但早期采用者，男性研究人员，非本地讲话者和初级学者表现出最明显的风格转变，使他们的写作与知名研究人员的写作更加紧密地保持一致。

Title: Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

Authors: Shaomu Tan, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13630
Pdf URL: https://arxiv.org/pdf/2504.13630
Copy Paste: [[2504.13630]] Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling(https://arxiv.org/abs/2504.13630)
Keywords: gpt, llm, prompt
Abstract: A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.
摘要：MT评估中的一个关键挑战是人类评级的固有噪音和不一致。基于回归的神经指标与这种噪音斗争，同时促使LLM在系统级别的评估方面表现出希望，但在细分市场级别的表现较差。在这项工作中，我们提出了一种补救措施，这是一种新型MT指标框架，将翻译评估重新定义为奖励建模任务。补救措施没有直接对人类评级进行不完美的评级进行回归，而是使用成对偏好数据学习相对翻译质量，从而进行了更可靠的评估。在WMT22-24共享任务的广泛实验（39对，111吨系统）中，补救措施在细分市场和系统级别的评估中都达到了最先进的性能。具体而言，Remedy-9B超过了更大的WMT获奖者，并大量关闭的LLM，例如Metricx-13b，Xcomet-insemble，Gemba-GPT-4，Palm-540B和Fineted Palm2。进一步的分析表明，补救措施在检测翻译错误和评估低质量翻译方面具有卓越的能力。

Title: Simulating Before Planning: Constructing Intrinsic User World Model for User-Tailored Dialogue Policy Planning

Authors: Tao He, Lizi Liao, Ming Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13643
Pdf URL: https://arxiv.org/pdf/2504.13643
Copy Paste: [[2504.13643]] Simulating Before Planning: Constructing Intrinsic User World Model for User-Tailored Dialogue Policy Planning(https://arxiv.org/abs/2504.13643)
Keywords: agent
Abstract: Recent advancements in dialogue policy planning have emphasized optimizing system agent policies to achieve predefined goals, focusing on strategy design, trajectory acquisition, and efficient training paradigms. However, these approaches often overlook the critical role of user characteristics, which are essential in real-world scenarios like conversational search and recommendation, where interactions must adapt to individual user traits such as personality, preferences, and goals. To address this gap, we first conduct a comprehensive study utilizing task-specific user personas to systematically assess dialogue policy planning under diverse user behaviors. By leveraging realistic user profiles for different tasks, our study reveals significant limitations in existing approaches, highlighting the need for user-tailored dialogue policy planning. Building on this foundation, we present the User-Tailored Dialogue Policy Planning (UDP) framework, which incorporates an Intrinsic User World Model to model user traits and feedback. UDP operates in three stages: (1) User Persona Portraying, using a diffusion model to dynamically infer user profiles; (2) User Feedback Anticipating, leveraging a Brownian Bridge-inspired anticipator to predict user reactions; and (3) User-Tailored Policy Planning, integrating these insights to optimize response strategies. To ensure robust performance, we further propose an active learning approach that prioritizes challenging user personas during training. Comprehensive experiments on benchmarks, including collaborative and non-collaborative settings, demonstrate the effectiveness of UDP in learning user-specific dialogue strategies. Results validate the protocol's utility and highlight UDP's robustness, adaptability, and potential to advance user-centric dialogue systems.
摘要：对话政策计划的最新进展强调了优化系统代理政策以实现预定义的目标，重点是策略设计，轨迹获取和有效的培训范式。但是，这些方法通常会忽略用户特征的关键作用，这些方法在对话搜索和推荐等现实情况下至关重要，在这种情况下，交互必须适应个性，个性，偏好和目标等单个用户特征。为了解决这一差距，我们首先利用特定于任务的用户角色进行了一项全面的研究，以系统地评估不同用户行为的对话策略计划。通过利用现实的用户资料来执行不同的任务，我们的研究揭示了现有方法的重大限制，强调了对用户限制的对话策略计划的需求。在此基础的基础上，我们介绍了用户限制的对话策略计划（UDP）框架，该框架结合了一个内在的用户世界模型，以建模用户特征和反馈。 UDP分为三个阶段：（1）使用扩散模型动态推断用户配置文件的用户角色刻画；（2）用户反馈预期，利用布朗桥启发的预测者预测用户反应；（3）用户限制的政策计划，整合这些见解以优化响应策略。为了确保表现良好，我们进一步提出了一种积极的学习方法，该方法在培训期间优先考虑挑战用户角色。在包括协作和非授权设置在内的基准测试的全面实验证明了UDP在学习特定用户的对话策略方面的有效性。结果验证了协议的实用程序，并突出了UDP的鲁棒性，适应性以及以用户为中心的对话系统的潜力。

Title: Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

Authors: Andrea Santilli, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong, Luca Zappella, Sinead Williamson
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.13677
Pdf URL: https://arxiv.org/pdf/2504.13677
Copy Paste: [[2504.13677]] Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results(https://arxiv.org/abs/2504.13677)
Keywords: language model, llm
Abstract: Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.
摘要：语言模型（LMS）中的不确定性量化（UQ）对于提高其安全性和可靠性至关重要。评估通常使用诸如AUROC之类的性能指标来评估UQ方法（例如，负序列概率）与任务正确性函数（例如Rouge-l）的关系。在本文中，我们表明，通常使用正确性函数通过夸大某些UQ方法的性能来偏差UQ评估。我们在4个数据集X 4型号X 6 UQ方法中评估了7个正确性函数 - 从基于词汇和嵌入的指标到LLM-AS-A-a-gudge方法。我们的分析表明，这些正确性函数误差的长度偏差会通过与UQ方法中的长度偏差相互作用而扭曲了UQ评估。 We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

Title: Deep literature reviews: an application of fine-tuned language models to migration research

Authors: Stefano M. Iacus, Haodong Qi, Jiyoung Han
Subjects: cs.CL, cs.LG, stat.AP, stat.CO
Abstract URL: https://arxiv.org/abs/2504.13685
Pdf URL: https://arxiv.org/pdf/2504.13685
Copy Paste: [[2504.13685]] Deep literature reviews: an application of fine-tuned language models to migration research(https://arxiv.org/abs/2504.13685)
Keywords: language model, llm
Abstract: This paper presents a hybrid framework for literature reviews that augments traditional bibliometric methods with large language models (LLMs). By fine-tuning open-source LLMs, our approach enables scalable extraction of qualitative insights from large volumes of research content, enhancing both the breadth and depth of knowledge synthesis. To improve annotation efficiency and consistency, we introduce an error-focused validation process in which LLMs generate initial labels and human reviewers correct misclassifications. Applying this framework to over 20000 scientific articles about human migration, we demonstrate that a domain-adapted LLM can serve as a "specialist" model - capable of accurately selecting relevant studies, detecting emerging trends, and identifying critical research gaps. Notably, the LLM-assisted review reveals a growing scholarly interest in climate-induced migration. However, existing literature disproportionately centers on a narrow set of environmental hazards (e.g., floods, droughts, sea-level rise, and land degradation), while overlooking others that more directly affect human health and well-being, such as air and water pollution or infectious diseases. This imbalance highlights the need for more comprehensive research that goes beyond physical environmental changes to examine their ecological and societal consequences, particularly in shaping migration as an adaptive response. Overall, our proposed framework demonstrates the potential of fine-tuned LLMs to conduct more efficient, consistent, and insightful literature reviews across disciplines, ultimately accelerating knowledge synthesis and scientific discovery.
摘要：本文提出了文献评论的混合框架，该框架通过大语言模型（LLMS）增强了传统的书目方法。通过微调开源LLM，我们的方法可以从大量研究内容中可扩展地提取定性见解，从而增强知识综合的广度和深度。为了提高注释效率和一致性，我们引入了以错误为中心的验证过程，其中LLMS生成初始标签，人类审阅者正确分类。将此框架应用于有关人类迁移的20000多篇科学文章，我们证明了适应领域的LLM可以用作“专家”模型 - 能够准确选择相关研究，检测新兴趋势并确定关键的研究差距。值得注意的是，LLM辅助综述揭示了对气候引起的迁移的学术兴趣日益增长的兴趣。但是，现有文献不成比例地集中在一组狭窄的环境危害（例如洪水，干旱，海平面上升和土地退化）上，同时忽略了其他人更直接影响人类健康和福祉，例如空气和水污染或感染性疾病。这种不平衡凸显了对更全面的研究的需求，而这超出了身体环境的变化，以检查其生态和社会后果，尤其是在将移民作为适应性反应塑造时。总体而言，我们提出的框架展示了微调LLM的潜力，可以在学科跨学科进行更有效，一致和有见地的文献综述，最终加速知识综合和科学发现。

Title: Controlled Territory and Conflict Tracking (CONTACT): (Geo-)Mapping Occupied Territory from Open Source Intelligence

Authors: Paul K. Mandal, Cole Leo, Connor Hurley
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.13730
Pdf URL: https://arxiv.org/pdf/2504.13730
Copy Paste: [[2504.13730]] Controlled Territory and Conflict Tracking (CONTACT): (Geo-)Mapping Occupied Territory from Open Source Intelligence(https://arxiv.org/abs/2504.13730)
Keywords: language model, llm, prompt
Abstract: Open-source intelligence provides a stream of unstructured textual data that can inform assessments of territorial control. We present CONTACT, a framework for territorial control prediction using large language models (LLMs) and minimal supervision. We evaluate two approaches: SetFit, an embedding-based few-shot classifier, and a prompt tuning method applied to BLOOMZ-560m, a multilingual generative LLM. Our model is trained on a small hand-labeled dataset of news articles covering ISIS activity in Syria and Iraq, using prompt-conditioned extraction of control-relevant signals such as military operations, casualties, and location references. We show that the BLOOMZ-based model outperforms the SetFit baseline, and that prompt-based supervision improves generalization in low-resource settings. CONTACT demonstrates that LLMs fine-tuned using few-shot methods can reduce annotation burdens and support structured inference from open-ended OSINT streams. Our code is available at this https URL.
摘要：开源智能提供了一系列非结构化的文本数据，这些数据可以告知领土控制的评估。我们提出联系，这是使用大语言模型（LLM）和最小监督的领土控制预测框架。我们评估了两种方法：SetFit，一种基于嵌入式的几个射击分类器，以及应用于Bloomz-560m的及时调整方法，这是一种多语言生成LLM。我们的模型经过了一个涉及叙利亚和伊拉克ISIS活动的新闻文章的小型手工标记的数据集培训，并使用了及时的控制相关信号（例如军事行动，伤亡和地点参考）的及时提取。我们表明，基于Bloomz的模型的表现优于SetFit基线，并且基于及时的监督改善了低资源设置中的概括。触点表明，使用少量射击方法微调的LLMS可以减轻注释负担，并支持开放式OSINT流的结构推理。我们的代码可在此HTTPS URL上找到。

Title: BadApex: Backdoor Attack Based on Adaptive Optimization Mechanism of Black-box Large Language Models

Authors: Zhengxian Wu, Juan Wen, Wanli Peng, Ziwei Zhang, Yinghan Zhou, Yiming Xue
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2504.13775
Pdf URL: https://arxiv.org/pdf/2504.13775
Copy Paste: [[2504.13775]] BadApex: Backdoor Attack Based on Adaptive Optimization Mechanism of Black-box Large Language Models(https://arxiv.org/abs/2504.13775)
Keywords: language model, llm, prompt, agent
Abstract: Previous insertion-based and paraphrase-based backdoors have achieved great success in attack efficacy, but they ignore the text quality and semantic consistency between poisoned and clean texts. Although recent studies introduce LLMs to generate poisoned texts and improve the stealthiness, semantic consistency, and text quality, their hand-crafted prompts rely on expert experiences, facing significant challenges in prompt adaptability and attack performance after defenses. In this paper, we propose a novel backdoor attack based on adaptive optimization mechanism of black-box large language models (BadApex), which leverages a black-box LLM to generate poisoned text through a refined prompt. Specifically, an Adaptive Optimization Mechanism is designed to refine an initial prompt iteratively using the generation and modification agents. The generation agent generates the poisoned text based on the initial prompt. Then the modification agent evaluates the quality of the poisoned text and refines a new prompt. After several iterations of the above process, the refined prompt is used to generate poisoned texts through LLMs. We conduct extensive experiments on three dataset with six backdoor attacks and two defenses. Extensive experimental results demonstrate that BadApex significantly outperforms state-of-the-art attacks. It improves prompt adaptability, semantic consistency, and text quality. Furthermore, when two defense methods are applied, the average attack success rate (ASR) still up to 96.75%.
摘要：以前基于插入的基于插入的后门在攻击功效方面取得了巨大成功，但它们忽略了中毒文本和干净文本之间的文本质量和语义一致性。尽管最近的研究介绍了LLM，以产生中毒的文本并提高隐形，语义一致性和文本质量，但他们的手工制作的提示依靠专家经验，在防御后的迅速适应性和攻击性能方面面临着巨大的挑战。在本文中，我们提出了一种基于黑盒大语言模型（Badapex）自适应优化机制的新型后门攻击，该机制利用Black-Box LLM通过精致的提示来生成中毒文本。具体而言，自适应优化机制旨在使用生成和修饰剂对初始提示进行精炼。发电代理根据初始提示生成中毒文本。然后，修改剂评估有毒文本的质量并完善了新提示。经过上述过程的几次迭代后，精制提示被用来通过LLM生成中毒的文本。我们在三个数据集上进行了大量的实验，并进行了六次后门攻击和两次防御措施。广泛的实验结果表明，Badapex明显优于最先进的攻击。它提高了及时的适应性，语义一致性和文本质量。此外，当采用两种防御方法时，平均攻击成功率（ASR）仍高达96.75％。

Title: Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations

Authors: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, Yu Rong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13816
Pdf URL: https://arxiv.org/pdf/2504.13816
Copy Paste: [[2504.13816]] Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations(https://arxiv.org/abs/2504.13816)
Keywords: llm, hallucination
Abstract: While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs' perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs' recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at this https URL.
摘要：尽管了解LLM的知识边界对于防止幻觉至关重要，但对LLM的知识边界的研究主要集中在英语上。在这项工作中，我们介绍了第一项研究，以分析LLM在处理多种语言的已知和未知问题时探讨其内部表示，如何识别不同语言的知识边界。我们的实证研究揭示了三个关键发现：1）LLMS对知识边界的看法在不同语言的中间至中层中编码。 2）知识边界感知的语言差异遵循线性结构，这激发了我们提出的无训练对准方法的建议，该方法有效地传递了知识边界感知能力，从而有助于降低低资产阶级语言的幻觉风险； 3）对双语问题对翻译的微调进一步增强了LLM对语言跨语言知识边界的认识。考虑到缺乏用于跨语性知识边界分析的标准测试床，我们构建了一个多语言评估套件，其中包括三种代表性的知识边界数据。我们的代码和数据集可在此HTTPS URL上公开获得。

Title: Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Authors: Junjie Yang, Junhao Song, Xudong Han, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Yichao Zhang, Qian Niu, Benji Peng, Keyu Chen, Ming Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.13825
Pdf URL: https://arxiv.org/pdf/2504.13825
Copy Paste: [[2504.13825]] Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models(https://arxiv.org/abs/2504.13825)
Keywords: language model
Abstract: Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, such as attention-based approaches, block-wise logit distillation, and decoupling distillation, have notably improved student model performance. These techniques focus on stimulus complexity, attention mechanisms, and global information capture to optimize knowledge transfer. In addition, KD has proven effective in compressing large language models while preserving accuracy, reducing computational overhead, and improving inference speed. This survey synthesizes the latest literature, highlighting key findings, contributions, and future directions in knowledge distillation to provide insights for researchers and practitioners on its evolving role in artificial intelligence and machine learning.
摘要：知识蒸馏（KD）是一种将知识从复杂的教师模型转移到更简单的学生模型的技术，可显着提高模型效率和准确性。它已经证明了各种应用程序的重大进步，包括图像分类，对象检测，语言建模，文本分类和情感分析。 KD方法的最新创新，例如基于注意力的方法，良好的logit蒸馏和脱钩蒸馏，显着改善了学生模型的表现。这些技术着重于刺激复杂性，注意机制和全球信息捕获，以优化知识转移。此外，KD已被证明有效地压缩了大型语言模型，同时保持准确性，降低计算开销和提高推理速度。这项调查综合了最新的文献，突出了知识蒸馏中的关键发现，贡献和未来的方向，为研究人员和从业人员提供有关其在人工智能和机器学习中不断发展的作用的见解。

Title: Generative AI Act II: Test Time Scaling Drives Cognition Engineering

Authors: Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13828
Pdf URL: https://arxiv.org/pdf/2504.13828
Copy Paste: [[2504.13828]] Generative AI Act II: Test Time Scaling Drives Cognition Engineering(https://arxiv.org/abs/2504.13828)
Keywords: language model, prompt
Abstract: The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations in knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: this https URL
摘要：大型语言模型的第一代 - 可能被称为生成AI的“行为”（2020-2023） - 通过大规模参数和数据缩放取得了显着的成功，但在知识潜伏期，浅薄的推理和受限的认知过程中表现出了根本的限制。在这个时代，迅速的工程成为我们的主要界面，并通过自然语言实现了对话级别的交流。现在，我们目睹了“第二幕”（2024年至今）的出现，其中模型正在通过测试时间缩放技术从知识回溯系统（在潜在空间中）过渡到思想构建引擎。这种新的范式通过基于语言的思想建立了与AI的思维级联系。在本文中，我们阐明了认知工程的概念基础，并解释了为什么这一刻对于其发展至关重要。我们通过全面的教程和优化的实施来系统地分解这些先进的方法，使对认知工程的访问权限，并使每个从业者都可以参加AI的第二幕。我们在GitHub存储库中提供定期更新的有关测试时间缩放的论文集合：此HTTPS URL

Title: Science Hierarchography: Hierarchical Organization of Science Literature

Authors: Muhan Gao, Jash Shah, Weiqi Wang, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.13834
Pdf URL: https://arxiv.org/pdf/2504.13834
Copy Paste: [[2504.13834]] Science Hierarchography: Hierarchical Organization of Science Literature(https://arxiv.org/abs/2504.13834)
Keywords: llm, prompt, agent
Abstract: Scientific knowledge is growing rapidly, making it challenging to track progress and high-level conceptual links across broad disciplines. While existing tools like citation networks and search engines make it easy to access a few related papers, they fundamentally lack the flexible abstraction needed to represent the density of activity in various scientific subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that allows for the categorization of scientific work across varying levels of abstraction, from very broad fields to very specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve the goals of SCIENCE HIERARCHOGRAPHY, we develop a range of algorithms. Our primary approach combines fast embedding-based clustering with LLM-based prompting to balance the computational efficiency of embedding methods with the semantic precision offered by LLM prompting. We demonstrate that this approach offers the best trade-off between quality and speed compared to methods that heavily rely on LLM prompting, such as iterative tree construction with LLMs. To better reflect the interdisciplinary and multifaceted nature of research papers, our hierarchy captures multiple dimensions of categorization beyond simple topic labels. We evaluate the utility of our framework by assessing how effectively an LLM-based agent can locate target papers using the hierarchy. Results show that this structured approach enhances interpretability, supports trend discovery, and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo: $\href{this https URL}{this https URL}$
摘要：科学知识正在迅速发展，这使得在广泛学科的进度和高级概念联系方面具有挑战性。尽管引用网络和搜索引擎等现有工具使访问一些相关论文变得易于访问，但它们从根本上缺乏代表各种科学子场中活动密度所需的灵活抽象。我们激励科学分层图，这是将科学文献组织成高质量的等级结构的目标，该结构允许从不同水平的抽象范围内分类科学工作，从非常广泛的领域到非常具体的研究。这样的表示可以提供有关哪些字段的探索且探索了哪些字段的见解。为了实现科学分层法的目标，我们开发了一系列算法。我们的主要方法将基于快速嵌入的聚类与基于LLM的提示结合在一起，以平衡嵌入方法的计算效率与LLM提示提供的语义精度。我们证明，与严重依赖LLM提示的方法相比，这种方法在质量和速度之间提供了最佳的权衡，例如使用LLMS的迭代树构造。为了更好地反映研究论文的跨学科和多方面的性质，我们的层次结构捕获了超出简单主题标签的多个分类的维度。我们通过评估基于LLM的代理如何使用层次结构定位目标论文来评估框架的实用性。结果表明，这种结构化方法可增强可解释性，支持趋势发现，并为探索传统搜索方法以外的科学文献提供了另一种途径。代码，数据和演示：$ \ href {this https url} {this https url} $