2025-08-29

Title: Social Bias in Multilingual Language Models: A Survey

Authors: Lance Calvin Lim Gamboa, Yue Feng, Mark Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20201
Pdf URL: https://arxiv.org/pdf/2508.20201
Copy Paste: [[2508.20201]] Social Bias in Multilingual Language Models: A Survey(https://arxiv.org/abs/2508.20201)
Keywords: language model
Abstract: Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field's dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature's inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.
摘要：预处理的多语言模型表现出与处理英语文本的模型相同的社会偏见。这项系统的综述分析了新兴研究，将偏见评估和缓解方法扩展到多语言和非英语环境。我们研究了有关语言多样性，文化意识及其评估指标和缓解技术的研究。我们的调查阐明了该领域主要的方法论设计选择中的差距（例如，对某些语言的偏爱，多语言缓解实验的稀缺性），同时遇到的常见问题和解决方案在适应跨语言和文化的偏见基准中实现的解决方案。从我们的发现的含义中，我们绘制了未来研究的指示，可以增强多语言偏见文献的包容性，跨文化适用性以及与最先进的NLP进步的一致性。

Title: Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

Authors: Mohammad Amini, Babak Ahmadi, Xiaomeng Xiong, Yilin Zhang, Christopher Qiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.20217
Pdf URL: https://arxiv.org/pdf/2508.20217
Copy Paste: [[2508.20217]] Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models(https://arxiv.org/abs/2508.20217)
Keywords: language model, gpt, prompt, chain-of-thought
Abstract: This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.
摘要：这项研究使用语言模型探索自动生成（AIG）来创建多种选择问题（MCQ）进行形态学评估，旨在降低手动测试开发的成本和不一致。该研究使用了两倍的方法。首先，我们比较了一个微调的培养基模型（Gemma，2B），其中一个较大的未调节（GPT-3.5，175b）。其次，我们评估了七种结构化提示策略，包括零射，很少，经过思考链，基于角色的，顺序和组合。使用自动指标和五个维度的专家评分评估生成的项目。我们还使用了经过专家评级样品培训的GPT-4.1，以大规模模拟人类评分。结果表明，结构化的提示，尤其是结合了经营链和顺序设计的策略，可显着改善Gemma的输出。 Gemma通常比GPT-3.5的零拍响应产生的构造与教学更合适的项目更合适，并且及时设计在中型模型性能中起关键作用。这项研究表明，在有限的数据条件下，结构化提示和有效的微调可以增强AIG的中型模型。我们强调了将自动指标，专家判断和大型模拟模拟相结合的价值，以确保与评估目标保持一致。拟议的工作流提供了一种实用且可扩展的方法，可以为K-12开发和验证语言评估项目。

Title: Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Authors: Rikuto Kotoge, Mai Nishimura, Jiaxin Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20324
Pdf URL: https://arxiv.org/pdf/2508.20324
Copy Paste: [[2508.20324]] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities(https://arxiv.org/abs/2508.20324)
Keywords: language model, agent
Abstract: Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.
摘要：强化学习已成为一种训练后的方法，以引起代理抹布行为，例如从语言模型中进行搜索和计划。但是，紧凑的语言模型（例如，0.5b参数）由于推理能力差而挣扎，导致了稀疏的奖励和不稳定的培训。为了克服这些困难，我们提出了蒸馏引导的政策优化（DGPO），该政策优化通过在政策优化期间的教师示威和持续的教师指导中冷静的初始化来解决挑战。为了系统地评估我们的方法，我们引入了代理抹布能力（ARC），这是一种细粒度分析推理，搜索协调和响应合成。全面的实验表明，DGPO使紧凑的模型能够达到复杂的代理搜索行为，甚至在某些情况下甚至超过了更大的教师模型。 DGPO在计算资源约束环境中可行的代理抹布。

Title: GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Authors: Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.20325
Pdf URL: https://arxiv.org/pdf/2508.20325
Copy Paste: [[2508.20325]] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs(https://arxiv.org/abs/2508.20325)
Keywords: language model, gpt, llm, prompt, chat
Abstract: As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.
摘要：随着大型语言模型越来越多地与各个领域不可或缺，它们产生有害反应的潜力引起了重大的社会和监管问题。作为回应，政府发布了道德准则，以促进可信赖的AI的发展。但是，这些准则通常是对开发人员和测试人员的高级需求，在将其转化为可行的测试问题以验证LLM的依从性方面留出了差距。为了应对这一挑战，我们介绍了通过\ textbf {a} daptbf {a} daptbf {a} dappbf \ textbf {r} ole-play and play and play and textbf \ textbf {d} iNgnostics），一种评估指南的指南，以特定的质量确定指南，该指南介绍了指南，该指南将特定的指导级用于特定问题。为了实施这一点，Guard根据政府发行的指南使用自动生成指南侵入问题，从而测试了回答是否符合这些准则。当响应直接违反准则时，警卫报告了不一致的情况。此外，对于不直接违反准则的响应，Guard将``越狱''的概念（名为Guard-JD）整合到了诊断剂，该诊断概念会创建场景，从而引起不道德或指导性侵犯的响应，从而有效地识别了可以绕开内置安全机制的潜在情况。我们的方法最终在合规报告中达到顶峰，描述了依从性的程度并突出了任何违法行为。我们通过经验验证了警卫对七个LLM的有效性，包括Vicuna-13b，Longchat-7b，Llama2-7B，Llama-3-8B，GPT-3.5，GPT-3，GPT-4，GPT-4，GPT-4O和Claude-3.7，通过在三项政府发行的指南和执行越来越越来越诊断下进行测试，并通过三个政府发行的指导诊断。此外，Guard-JD可以将越狱诊断转移到视觉模型中，以证明其在促进可靠的基于LLM的应用程序中的用途。

Title: Joint Enhancement of Relational Reasoning for Long-Context LLMs

Authors: Zhirui Chen, Wei Shen, Jiashui Huang, Ling Shao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20351
Pdf URL: https://arxiv.org/pdf/2508.20351
Copy Paste: [[2508.20351]] Joint Enhancement of Relational Reasoning for Long-Context LLMs(https://arxiv.org/abs/2508.20351)
Keywords: language model, llm, long context, hallucination
Abstract: Despite significant progress, large language models (LLMs) still struggle with long contexts due to memory limitations and their inability to tackle complex and long-context tasks. Additionally, LLMs often suffer from a lack of transparency and are prone to producing hallucinations. To address these challenges, we propose \textbf{JERR}, a novel framework designed to enhance long-context comprehension via graph-based reasoning in LLMs. JERR integrates three key components: synopsis extraction, graph construction, and relational reasoning. First, synopsis is extracted by chunking text strategically, allowing the model to summarize and understand information more efficiently. Second, we build a directed acyclic graph (DAG) to resolve redundancy, ensuring logical consistency and clarity. Finally, we incorporate Monte Carlo Tree Search (MCTS) to help the model navigate complex reasoning paths, ensuring more accurate and interpretable outputs. This framework provides a novel solution that enables LLMs to handle extended contexts and complex reasoning tasks with improved reliability and transparency. Experimental results show that JERR consistently outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation.
摘要：尽管取得了重大进展，但由于记忆限制及其无法应对复杂和长期的任务，大型语言模型（LLM）仍在长期存在方面挣扎。此外，LLMS通常会缺乏透明度，并且容易产生幻觉。为了应对这些挑战，我们提出了\ textbf {jerr}，这是一个新颖的框架，旨在通过LLMS中的基于图形的推理来增强长篇小说理解。 Jerr整合了三个关键组成部分：提取物提取，图形结构和关系推理。首先，提取概要是通过策略性地构成文本来提取的，从而使模型可以更有效地总结和理解信息。其次，我们构建了一个有向的无环图（DAG）来解决冗余，以确保逻辑一致性和清晰度。最后，我们合并了蒙特卡洛树搜索（MCT），以帮助模型导航复杂的推理路径，从而确保更准确和可解释的输出。该框架提供了一种新颖的解决方案，使LLM可以处理扩展的上下文和复杂的推理任务，并具有提高的可靠性和透明度。实验结果表明，JERR始终优于Rouge和F1指标上的所有基线，在LLM评估评估中得分最高。

Title: Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems

Authors: Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li, Qifan Zhang, Jia Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.20373
Pdf URL: https://arxiv.org/pdf/2508.20373
Copy Paste: [[2508.20373]] Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems(https://arxiv.org/abs/2508.20373)
Keywords: language model, llm, chain-of-thought
Abstract: Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at this https URL, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1.
摘要：推理大语言模型（RLLM）最近在复杂的推理任务上取得了显着的进步，这在很大程度上是由于其长链（长COT）功能而实现的。但是，开发这些长的COT行为在很大程度上依赖于使用高质量数据集进行培训，这些数据集通常是昂贵且人为策划的（例如，数学和代码），而尚未探索可扩展的替代方案。在这项工作中，我们将NP-HARD（NPH）的图形问题引入了一种新型的合成训练语料库，因为它们固有地需要深层的推理，广泛的探索和反思性策略，这些策略是长期cot推理的核心特征。在此洞察力的基础上，我们开发了一个两阶段的训练后框架：（i）在拒绝采样的NPH图实例上进行了长期监督的小说，从而实质上增强了推理深度，以及（ii）通过精细奖励设计的增强式学习（RL），从而提高了推理效率。我们的旗舰模型Graph-R1-7B在数学，编码，STEM和逻辑上展示了强有力的概括，并且在准确性和推理效率方面都超过了NPH图问题上的QWQ-32B。这些结果将NPH图形问题定位为有效且可扩展的资源，用于推进LLM中的长COT推理，从而为LLM后培训打开了新的边界。我们的实现可在此HTTPS URL上获得，模型和数据集托管在我们的拥抱脸部集合Hkust-dsail/graph-r1中。

Title: CAPE: Context-Aware Personality Evaluation Framework for Large Language Models

Authors: Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20385
Pdf URL: https://arxiv.org/pdf/2508.20385
Copy Paste: [[2508.20385]] CAPE: Context-Aware Personality Evaluation Framework for Large Language Models(https://arxiv.org/abs/2508.20385)
Keywords: language model, gpt, llm, agent
Abstract: Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior. Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama--8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: this https URL
摘要：传统上用于评估人类的心理测验现在正在应用于大型语言模型（LLMS）以评估其行为特征。但是，现有研究遵循一种无上下文的方法，孤立地回答每个问题，以避免上下文影响。我们将其称为“迪士尼世界测试”，这是一个人工环境，忽略了现实世界的应用程序，其中会话历史塑造了响应。为了弥合这一差距，我们建议对LLMS的第一个环境感知性格评估（CAPE）框架，并结合了先前的对话互动。为了彻底分析上下文的影响，我们介绍了新颖的指标来量化LLM响应的一致性，LLM响应是人类行为的基本特征。我们在7个LLM上进行的详尽实验表明，对话历史通过中文学习增强了响应一致性，但也会引起人格转移，而GPT-3.5-Turbo和GPT-4-Turbo表现出极端的偏差。虽然GPT模型质疑排序的强大，但Gemini-1.5-Flash和Llama-8b表现出显着的敏感性。此外，GPT模型的响应源于它们内在的人格特征以及先前的相互作用，而双子座1.5-flash和Llama-8B在很大程度上取决于先前的相互作用。最后，将我们的框架应用于角色扮演代理商（RPA）显示了与上下文相关的人格转变，可以改善响应的一致性，并更好地与人类判断保持一致。我们的代码和数据集可公开可用：此HTTPS URL

Title: Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

Authors: Xu Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.20395
Pdf URL: https://arxiv.org/pdf/2508.20395
Copy Paste: [[2508.20395]] Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction(https://arxiv.org/abs/2508.20395)
Keywords: language model, gpt, llm
Abstract: Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.
摘要：大型语言模型（LLM）的最新进展通常依赖于生成中间推理步骤以提高准确性。但是，很少的工作已经研究了推理实用程序如何有助于最终答案的正确性。由于自回归产生的随机性，产生更多的上下文并不能保证对答案的信心增加。如果我们可以预测在一代期间，一个推理步骤是否有用，我们可以尽早停止或修剪无效的步骤，从而避免最终决定中的注意力。我们使用QWEN2.5-32B和GPT-4O提出了一项关于数学数据集的Oracle研究，以生成推理链，然后采用单独的模型（QWEN3-8B）来量化这些链的效用以最终准确。具体而言，我们使用条件熵（词汇上的预期负log-okelione）在每个推理步骤中测量模型的不确定性，并逐步扩展上下文。我们的结果显示了一个清晰的模式：有条件的熵在步骤上降低与正确的答案密切相关，而平坦或增加的熵通常会导致错误的答案。我们还证实了错误的推理路径往往比正确的推理路径更长，这表明更长的推理不一定会产生更好的结果。这些发现是激发未来在设计有效推理管道方面的未来工作的基础，这些渠道可以及早发现并避免及早推理。

Title: UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

Authors: Sam Jung, Agustin Garcinuno, Spencer Mateega
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20410
Pdf URL: https://arxiv.org/pdf/2508.20410
Copy Paste: [[2508.20410]] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools(https://arxiv.org/abs/2508.20410)
Keywords: prompt
Abstract: AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and \textit{4000+} expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at this https URL.
摘要：AI文本到应用工具承诺在几分钟内提供高质量的应用程序和网站，但是没有公共基准严格验证这些主张。我们介绍了UI Bench，这是第一个大规模基准，该基准通过专家成对比较评估了竞争AI文本到应用工具的视觉卓越。跨越10个工具，30个提示，300个生成的站点和\ textit {4000+}专家判断，UI板台将具有Trueskill衍生模型的系统排名，该模型产生了校准的置信区间。 UI Bench建立了可再现的标准，用于推进AI驱动的Web设计。我们发布（i）完整的提示集，（ii）开源评估框架以及（iii）公共排行榜。由参与者评级的生成的站点将很快发布。在此HTTPS URL上查看UI基础排行榜。

Title: DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

Authors: Hengchuan Zhu, Yihuan Xu, Yichen Li, Zijie Meng, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.20416
Pdf URL: https://arxiv.org/pdf/2508.20416
Copy Paste: [[2508.20416]] DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding(https://arxiv.org/abs/2508.20416)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.
摘要：大型语言模型（LLM）和医学LLM（MED-LLM）的最新进展表明，在一般医疗基准上表现出色。但是，由于缺乏针对性的评估资源，它们在需要更深入领域知识的牙科等专业医疗领域的能力仍然没有被忽视。在本文中，我们介绍了牙科培养基，这是第一个旨在评估和推进牙科领域LLM的全面双语基准。牙科培养基由两个主要组成部分组成：Dentalqa，一个英语 - 中国的问题 - 诉讼（QA）基准，其中有36,597个问题，涵盖了4个任务和16个牙科子场；牙科牙齿是一种大规模的高质量语料库，具有33735万个代币，以策划牙科域的适应性，支持受监督的微调（SFT）和检索增强的生成（RAG）。我们评估了14个LLM，涵盖了专有，开源和特定于医学的模型，并揭示了任务类型和语言之间的巨大性能差距。对QWEN-2.5-3B进行的进一步实验表明，域的适应性基本上可以改善模型性能，尤其是在以知识密集型和术语为中心的任务上，并强调了针对域特异性基准测试对开发针对医疗保健应用程序量身定制的可信赖和有效LLMS的重要性。

Title: KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval

Authors: Chi Minh Bui, Ngoc Mai Thieu, Van Vinh Nguyen, Json J.Jung, Khac-Hoai Nam Bui
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2508.20417
Pdf URL: https://arxiv.org/pdf/2508.20417
Copy Paste: [[2508.20417]] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval(https://arxiv.org/abs/2508.20417)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR's superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness
摘要：知识图（kgs）与大语言模型（LLMS）的集成为改善检索增强生成（RAG）系统的检索阶段提供了重要潜力。在这项研究中，我们提出了KG-CQR，这是一个新型的上下文查询检索框架（CQR），该框架通过使用以语料库为中心的kg丰富复杂输入查询的上下文表示来增强检索阶段。与主要解决语料库级上下文损失的现有方法不同，KG-CQR专注于通过结构化关系表示，提取和完成相关的KG子图以生成语义上丰富的查询环境。 KG-CQR包括子图提取，完成和上下文生成模块，可作为模型 - 静态管道运行，可确保跨不同尺寸的LLM的可扩展性，而无需额外的培训。 Ragbench和Multihop-rag数据集的实验结果证明了KG-CQR的出色性能，在MAP上取得了4-6％的提高，而Recce@25比强基线模型提高了2-3％。此外，对挑战性的抹布任务（例如多跳问答）的评估表明，通过合并KG-CQR，该性能始终优于现有的基线，从

Title: CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance

Authors: Feng Zhang, Chengjie Pang, Yuehan Zhang, Chenyu Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20420
Pdf URL: https://arxiv.org/pdf/2508.20420
Copy Paste: [[2508.20420]] CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance(https://arxiv.org/abs/2508.20420)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Civil aviation maintenance is a domain characterized by stringent industry standards. Within this field, maintenance procedures and troubleshooting represent critical, knowledge-intensive tasks that require sophisticated reasoning. To address the lack of specialized evaluation tools for large language models (LLMs) in this vertical, we propose and develop an industrial-grade benchmark specifically designed for civil aviation maintenance. This benchmark serves a dual purpose: It provides a standardized tool to measure LLM capabilities within civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning. By pinpointing these deficiencies, the benchmark establishes a foundation for targeted improvement efforts (e.g., domain-specific fine-tuning, RAG optimization, or specialized prompt engineering), ultimately facilitating progress toward more intelligent solutions within civil aviation maintenance. Our work addresses a significant gap in the current LLM evaluation, which primarily focuses on mathematical and coding reasoning tasks. In addition, given that Retrieval-Augmented Generation (RAG) systems are currently the dominant solutions in practical applications , we leverage this benchmark to evaluate existing well-known vector embedding models and LLMs for civil aviation maintenance scenarios. Through experimental exploration and analysis, we demonstrate the effectiveness of our benchmark in assessing model performance within this domain, and we open-source this evaluation benchmark and code to foster further research and development:this https URL
摘要：民航维护是一个以严格的行业标准为特征的领域。在该字段中，维护过程和故障排除代表需要复杂推理的关键，知识密集型任务。为了解决该垂直行业缺乏大型语言模型（LLM）的专门评估工具，我们建议并开发专门为民航维护设计的工业级基准。该基准标准具有双重目的：它提供了一种标准化工具来测量民航维护中的LLM功能，从而确定了领域知识和复杂推理中的特定差距。通过查明这些缺陷，基准为有针对性的改进工作建立了基础（例如，特定于领域的微调，抹布优化或专门的及时工程），最终促进了在民用航空维护中更加智能解决方案的进步。我们的工作解决了当前LLM评估中的显着差距，该差距主要集中在数学和编码推理任务上。此外，鉴于检索功能的生成（RAG）系统目前是实际应用中的主要解决方案，我们利用此基准来评估现有著名的矢量嵌入模型和LLMS，以用于民航维护方案。通过实验探索和分析，我们证明了基准在评估该领域内模型性能方面的有效性，并开源此评估基准和代码以促进进一步的研究和开发：此HTTPS URL

Title: MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Authors: Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20453
Pdf URL: https://arxiv.org/pdf/2508.20453
Copy Paste: [[2508.20453]] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers(https://arxiv.org/abs/2508.20453)
Keywords: language model, llm, agent
Abstract: We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: this https URL.
摘要：我们介绍了MCP-Bench，这是一种用于评估实际，多步骤任务的大型语言模型（LLM）的基准，这些任务需要工具使用，跨工具协调，精确的参数控制以及解决任务的计划/推理。 MCP Bench建立在模型上下文协议（MCP）的基础上，将LLMS连接到28个代表性的LIVE MCP服务器，这些服务器跨越了跨领域的250个工具，例如财务，旅行，科学计算和学术搜索。与先前的基于API的基准分析不同，每个MCP服务器都提供了一组辅助工具，旨在共同使用，从而构建具有丰富输入输出耦合的真实多步任务。 MCP Bench测试剂中的任务能够从没有明确工具名称的模糊指令中检索相关工具，计划多跳执行轨迹，以实现复杂目标，中间工具输出中的地面响应以及协调跨域工作流程 - 不充分依赖于依赖explicit工具的现有基准进行了充分评估的跨域工具，并依赖于explicit的工具的隔离工具。我们提出了一个多方面的评估框架，涵盖了工具级架构的理解和使用，轨迹级别的计划和任务完成。在20个高级LLMS上进行的实验表明，MCP基础的持续挑战。代码和数据：此HTTPS URL。

Title: Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques

Authors: Yucheng Ruan, Xiang Lan, Daniel J. Tan, Hairil Rizal Abdullah, Mengling Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20460
Pdf URL: https://arxiv.org/pdf/2508.20460
Copy Paste: [[2508.20460]] Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques(https://arxiv.org/abs/2508.20460)
Keywords: prompt
Abstract: Background Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. This study aimed to introduce and assess a deep learning framework using natural language processing techniques that integrates multimodal EHRs to predict mortality and resource utilization in critical care settings. Methods Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model's robustness against the corruption in structured EHRs. Results Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6\%/0.8\% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusions The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels.
摘要：通过电子健康记录（EHR）预测死亡率和资源利用率的背景对于优化患者预后和管理重症监护室（ICU）的成本至关重要。现有方法主要集中在结构化的EHR上，通常忽略自由文本注释中宝贵的临床见解。此外，结构化数据中文本信息的潜力并未完全利用。这项研究旨在使用自然语言处理技术介绍和评估深度学习框架，该技术将多模式EHR整合在一起，以预测重症监护环境中的死亡率和资源利用。利用两个现实世界EHR数据集的方法，我们开发了并评估了具有领先现有方法的三个临床任务的模型。我们还对框架中的三个关键组成部分进行了消融研究：医疗提示，自由文本和预训练的句子编码器。此外，我们评估了该模型对结构化EHR腐败的鲁棒性。结果我们在三个临床任务上进行的两个现实世界数据集进行了我们的实验表明，我们提出的模型在死亡率预测的BACC/AUROC上提高了1.6 \％/0.8 \％，而LOS预测的RMSE/MAE为0.5％/2.2％，用于SUGSE/MAE for SUGITAL PURITATION pURATIOTIAD fortation for sugitiation for sugitation for的方法。与以不同的腐败率不同的三个任务中的其他基线相比，它始终表现出卓越的性能。结论提出的框架是一种有效而准确的深度学习方法，用于预测重症监护中的死亡率和资源利用。该研究还强调了在分析多模式EHR时使用迅速学习的成功。重要的是，该模型在结构化数据中，尤其是在高腐败水平上表现出对数据腐败的强大弹性。

Title: ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety

Authors: Luke Bates, Max Glockner, Preslav Nakov, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20468
Pdf URL: https://arxiv.org/pdf/2508.20468
Copy Paste: [[2508.20468]] ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety(https://arxiv.org/abs/2508.20468)
Keywords: language model, llm
Abstract: Conspiracy theories erode public trust in science and institutions while resisting debunking by evolving and absorbing counter-evidence. As AI-generated misinformation becomes increasingly sophisticated, understanding rhetorical patterns in conspiratorial content is important for developing interventions such as targeted prebunking and assessing AI vulnerabilities. We introduce ConspirED (CONSPIR Evaluation Dataset), which captures the cognitive traits of conspiratorial ideation in multi-sentence excerpts (80--120 words) from online conspiracy articles, annotated using the CONSPIR cognitive framework (Lewandowsky and Cook, 2020). ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits. Using ConspirED, we (i) develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts, and (ii) evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs. We find that both are misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.
摘要：阴谋理论侵蚀了公众对科学和机构的信任，同时通过不断发展和吸收对现实来抵抗揭穿。随着AI生成的错误信息变得越来越复杂，了解阴谋含量中的修辞模式对于开发诸如有针对性的预灌注和评估AI脆弱性等干预措施很重要。我们介绍了同谋（阴谋评估数据集），该数据集捕获了使用阴谋认知框架（Lewandowsky and Cook，2020年）从在线阴谋文章中捕获多句话摘录（80---120单词）中阴谋概念的认知特征（80---120单词）。串谋是针对一般认知特征注释的阴谋含量的第一个数据集。使用同谋，我们（i）开发了识别阴谋特征并确定文本摘录中的主要特征的计算模型，以及（ii）评估对阴谋输入的大语言/推理模型（LLM/LRM）鲁棒性。我们发现，两者都被阴谋含量未对准，即使成功地偏转了可比较的事实检查的错误信息，也会产生反映输入推理模式的输出。

Title: SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

Authors: Pengjiang Li, Zaitian Wang, Xinhao Zhang, Ran Zhang, Lu Jiang, Pengfei Wang, Yuanchun Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20514
Pdf URL: https://arxiv.org/pdf/2508.20514
Copy Paste: [[2508.20514]] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM(https://arxiv.org/abs/2508.20514)
Keywords: language model, llm
Abstract: Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.
摘要：科学文献中的主题发现为研究人员提供了宝贵的见解，以识别新兴趋势并探索新的调查途径，从而促进更轻松的科学信息检索。许多机器学习方法，尤其是深层嵌入技术，已应用于发现研究主题。但是，大多数现有的主题发现方法都依赖于嵌入单词来捕获语义，并且缺乏对科学出版物的全面理解，而在复杂的，高维的文本关系中挣扎。受到大语模型（LLMS）对文本信息的特殊理解的启发，我们提出了LLMS增强的高级主题发现方法，以改善科学主题识别，即Scitopic。具体来说，我们首先构建了一个文本编码器，以从科学出版物中捕获内容，包括元数据，标题和摘要。接下来，我们构建了一个空间优化模块，该模块集成了LLMS指导的基于熵的采样和三胞胎任务，从而增强了对歧义实例之间主题相关性和上下文复杂性的关注。然后，我们建议通过优化三胞胎的对比度损失，根据LLM的指导微调文本编码器，迫使文本编码器更好地区分不同主题的实例。最后，在三个现实世界中的科学出版物数据集上进行的广泛实验表明，Scitopic优于最先进的科学主题发现方法，使研究人员能够获得更深入，更快的见解。

Title: Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data

Authors: Jiahao Xiao, Jiangming Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.20557
Pdf URL: https://arxiv.org/pdf/2508.20557
Copy Paste: [[2508.20557]] Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data(https://arxiv.org/abs/2508.20557)
Keywords: language model
Abstract: The widespread success of pre-trained language models has established a new training paradigm, where a global PLM is fine-tuned using task-specific data from local clients. The local data are highly different from each other and can not capture the global distribution of the whole data in real world. To address the challenges of non-IID data in real environments, privacy-preserving federated distillation has been proposed and highly investigated. However, previous experimental non-IID scenarios are primarily identified with the label (output) diversity, without considering the diversity of language domains (input) that is crucial in natural language processing. In this paper, we introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data. The benchmark can be used to evaluate the federated learning framework in a real environment. To this end, we propose an Adaptive Federated Distillation (AdaFD) framework designed to address multi-domain non-IID challenges in both homogeneous and heterogeneous settings. Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works. The code for this paper is available at: this https URL.
摘要：预先训练的语言模型的广泛成功已经建立了一个新的培训范式，其中使用本地客户的特定于任务数据对全球PLM进行了微调。本地数据彼此高度不同，无法捕获现实世界中整个数据的全局分布。为了应对实际环境中非IID数据的挑战，已经提出了保护隐私的联合蒸馏，并进行了高度研究。但是，以前的实验非IID场景主要用标签（输出）多样性来识别，而无需考虑语言域（输入）在自然语言处理中至关重要的多样性。在本文中，我们介绍了一组全面的多域非IID场景，并提出了一个包括不同数据的统一基准测试框架。基准可用于在真实环境中评估联合学习框架。为此，我们提出了一个自适应联合蒸馏（ADAFD）框架，旨在解决均质和异质环境中多域非IID挑战。实验结果表明，与现有作品相比，我们的模型捕获了本地客户的多样性并取得更好的性能。本文的代码可在以下网址提供：此HTTPS URL。

Title: KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling

Authors: Yangfan Wang, Jie Liu, Chen Tang, Lian Yan, Jingchi Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20567
Pdf URL: https://arxiv.org/pdf/2508.20567
Copy Paste: [[2508.20567]] KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling(https://arxiv.org/abs/2508.20567)
Keywords: language model
Abstract: Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the Knowledge Composition Sampling (KCS), an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: this https URL.
摘要：由于数据稀疏性，多跳的问题回答面临重大挑战，这增加了语言模型学习虚假模式的可能性。为了解决这个问题，先前的研究重点是通过内容计划和各种表达来多样化问题的产生。但是，这些方法通常强调产生简单的问题并忽略基本知识的整合，例如文档中的相关句子。本文介绍了知识构图采样（KCS），这是一个创新的框架，旨在通过在给定上下文中抽样各种知识组成来扩大生成的多跳问题的多样性。 KCS将知识组成选择模型为句子级的条件预测任务，并利用概率对比损失来预测下一个最相关的知识。在推论期间，我们采用随机解码策略来有效平衡准确性和多样性。与竞争性基线相比，我们的KCS将知识组成选择的总体准确性提高了3.9％，其数据增强的应用可改善HOTPOTQA和2WIKIMULTIHOPQA数据集。我们的代码可用：此HTTPS URL。

Title: A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models

Authors: Soham Petkar, Hari Aakash K, Anirudh Vempati, Akshit Sinha, Ponnurangam Kumarauguru, Chirag Agarwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.20583
Pdf URL: https://arxiv.org/pdf/2508.20583
Copy Paste: [[2508.20583]] A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models(https://arxiv.org/abs/2508.20583)
Keywords: language model, llm, prompt
Abstract: Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.
摘要：图形模型（GLM）中的发展旨在将图形神经网络（GNN）的结构推理能力与对大语言模型（LLMS）的语义理解（LLMS）的结构推理能力相结合。但是，我们证明，主要是重新利用节点级分类数据集的GLM的当前评估基准不足以评估多模式推理。我们的分析表明，仅使用单峰信息就可以实现这些基准的强大性能，这表明它们不需要图形语言集成。为了解决此评估差距，我们介绍了CLEGR（组成语言推理）基准，该基准旨在评估各种复杂性水平的多模式推理。我们的基准测试采用了合成图生成管道，并与需要关于结构和文本语义的联合推理的问题配对。我们对代表性的GLM架构进行了彻底的评估，并发现软宣传的LLM基准在含有完整的GNN主链的GLMS上进行。该结果提出了将图形结构纳入LLM的架构必要性。我们进一步表明，GLM在需要结构推理的任务中表现出明显的性能降解。这些发现突出了当前GLM的图形推理功能中的局限性，并为社区推进了涉及图结构和语言的明确多模式推理的基础。

Title: Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning

Authors: Nelson Filipe Costa, Leila Kosseim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20712
Pdf URL: https://arxiv.org/pdf/2508.20712
Copy Paste: [[2508.20712]] Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning(https://arxiv.org/abs/2508.20712)
Keywords: gpt, llm, prompt
Abstract: This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.
摘要：本文介绍了第一个用于隐式话语关系识别（IDRR）的多语言和多标签分类模型。我们的模型Harch在最近发布的Discogem 2.0语料库中进行了评估，并利用话语感官之间的层次依赖性来预测PDTB 3.0框架中所有三个感觉级别的概率分布。我们比较了几个预先训练的编码器骨架，发现罗伯塔·霍克（Roberta-Harch）在英语方面取得了最好的表现，而XLM-Roberta-Harch在多语言环境中表现最好。此外，我们使用所有语言配置中的少量提示将微调模型与GPT-4O和Llama-4-Maverick进行了比较。我们的结果表明，我们的微调模型始终优于这些LLM，突出了特定于任务的微调比在IDRR中提示的优势。最后，我们报告了Discogem 1.0语料库的SOTA结果，进一步验证了我们的分层方法的有效性。

Title: Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models

Authors: Ruiyi Yan, Yugo Murawaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20718
Pdf URL: https://arxiv.org/pdf/2508.20718
Copy Paste: [[2508.20718]] Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models(https://arxiv.org/abs/2508.20718)
Keywords: language model
Abstract: Large language models have significantly enhanced the capacities and efficiency of text generation. On the one hand, they have improved the quality of text-based steganography. On the other hand, they have also underscored the importance of watermarking as a safeguard against malicious misuse. In this study, we focus on tokenization inconsistency (TI) between Alice and Bob in steganography and watermarking, where TI can undermine robustness. Our investigation reveals that the problematic tokens responsible for TI exhibit two key characteristics: infrequency and temporariness. Based on these findings, we propose two tailored solutions for TI elimination: a stepwise verification method for steganography and a post-hoc rollback method for watermarking. Experiments show that (1) compared to traditional disambiguation methods in steganography, directly addressing TI leads to improvements in fluency, imperceptibility, and anti-steganalysis capacity; (2) for watermarking, addressing TI enhances detectability and robustness against attacks.
摘要：大型语言模型显着提高了文本生成的能力和效率。一方面，他们提高了基于文本的隐肌的质量。另一方面，他们还强调了水印作为防御恶意滥用的保护的重要性。在这项研究中，我们将重点放在爱丽丝和鲍勃之间的代币化不一致（TI）中，在密集术和水印中，TI可以破坏稳健性。我们的调查表明，负责TI的有问题的令牌具有两个关键特征：频率和暂时性。基于这些发现，我们提出了两种用于消除TI的量身定制的解决方案：一种逐步验证方法，用于隐身术和一种用于水印后的回滚方法。实验表明，（1）与传统的歧义方法相比，直接解决TI的方法可以提高流利性，不可信性和抗稳定能力；（2）用于水印，解决TI可增强对攻击的可检测性和鲁棒性。

Title: rStar2-Agent: Agentic Reasoning Technical Report

Authors: Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20722
Pdf URL: https://arxiv.org/pdf/2508.20722
Copy Paste: [[2508.20722]] rStar2-Agent: Agentic Reasoning Technical Report(https://arxiv.org/abs/2508.20722)
Keywords: agent
Abstract: We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at this https URL.
摘要：我们介绍了RSTAR2-AGENT，这是一种14B数学推理模型，该模型训练了经纪性增强学习，以实现前沿级的性能。除了目前的长床外，该模型还展示了高级认知行为，例如在使用Python编码工具之前仔细思考，并反思代码执行反馈以自主探索，验证和完善复杂问题解决中的中间步骤。通过三个关键的创新来启用此功能，从而使代理RL有效地进行了规模：（i）具有可靠的Python代码环境的有效RL基础架构，可支持高通量执行，并降低了高推广成本，从而在有限的GPU资源（64 MI300X GPU）上进行了培训；（ii）GRPO-ROC，一种具有重新样本的推出策略的代理RL算法，该策略可解决编码工具的固有环境噪声，从而使模型可以在代码环境中更有效地推理；（iii）一种有效的代理训练食谱，始于非回答SFT，并通过多RL阶段进行，从而获得高级认知能力，计算成本最低。为此，RSTAR2代理将预先训练的14B模型提高到一周之内仅510 RL步骤，在AIME24的平均得分中获得平均得分为80.6％，AIME25的平均得分为80.6％，而DeepSeek-R1（671b）的平均得分为69.8％。除了数学外，RSTAR2-AGENT-14B还表明了对对齐，科学推理和代理工具使用任务的强烈概括。该HTTPS URL可用代码和培训食谱。

Title: Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees

Authors: Stephen Meisenbacher, Maulik Chevli, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20736
Pdf URL: https://arxiv.org/pdf/2508.20736
Copy Paste: [[2508.20736]] Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees(https://arxiv.org/abs/2508.20736)
Keywords: llm
Abstract: Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under local DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter $\varepsilon$. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high $\varepsilon$ values. Addressing this challenge, we introduce DP-ST, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the divide-and-conquer paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a privatization neighborhood. When combined with LLM post-processing, our method allows for coherent text generation even at lower $\varepsilon$ values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable $\varepsilon$ levels.
摘要：许多在自然语言处理中差异隐私（DP）的交汇处进行的许多作品旨在通过在DP保证下转换文本来保护隐私。从单词扰动到完整的文档重写，以及最常在本地DP下，可以通过多种方式执行。在这里，在某些受隐私参数$ \ varepsilon $约束的约束中，必须与任何其他潜在文本都无法区分输入文本。这样的保证非常苛刻，最近的作品表明，本地DP下的私有化文本只能在非常高的$ \ varepsilon $值下合理地完成。在应对这一挑战时，我们介绍了DP-ST，该DP-ST利用了本地DP保证的私人文档生成的语义三元。通过对我们的方法的评估，我们证明了分裂和诱饵范式的有效性，尤其是在将DP概念（和隐私保证）限制为私有化社区的有效性。当与LLM后处理结合使用时，我们的方法即使在较低的$ \ varepsilon $ value也可以同时平衡隐私和实用程序，也可以进行连贯的文本生成。这些发现突出了连贯性在合理$ \ VAREPSILON $级别以平衡私有化输出中的重要性。

Title: Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets

Authors: Vassiliy Cheremetiev, Quang Long Ho Ngo, Chau Ying Kot, Alina Elena Baia, Andrea Cavallaro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20750
Pdf URL: https://arxiv.org/pdf/2508.20750
Copy Paste: [[2508.20750]] Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets(https://arxiv.org/abs/2508.20750)
Keywords: language model, llm
Abstract: Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.
摘要：隐性仇恨言论（IHS）是间接语言，它通过微妙的线索，讽刺或编码术语传达了偏见或仇恨。 IHS很难检测到它，因为它不包含明确的贬义或炎症词。为了应对这一挑战，特定于任务的管道可以与外部知识或其他信息（例如上下文，情感和情感数据）相辅相成。在本文中，我们表明，通过仅微调基于大语言模型（LLMS）的最新通用嵌入模型，例如Stella，Jasper，NV-Exper-effer-effer-Exbed and E5，我们实现了最新的性能。在多个IHS数据集上进行的实验可提高数据库的1.10个百分点，并且在F1-MACRO评分方面，跨数据库评估的提高了20.35个百分点。

Title: GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation

Authors: Yuanhao Ding, Esteban Garces Arias, Meimingwei Li, Julian Rodemann, Matthias Aßenmacher, Danlu Chen, Gaojuan Fan, Christian Heumann, Chongsheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20757
Pdf URL: https://arxiv.org/pdf/2508.20757
Copy Paste: [[2508.20757]] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation(https://arxiv.org/abs/2508.20757)
Keywords: llm
Abstract: Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel "Glocal" uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at this https URL.
摘要：开放式文本生成面临着一个关键的挑战：在LLM输出中平衡连贯性与多样性。尽管已经出现了基于对比的基于搜索的解码策略来解决这一权衡，但其实际实用程序通常受到高参数依赖性和高计算成本的限制。我们介绍了一种自适应解码方法，该方法通过新颖的“ Glocal”不确定性驱动的框架有效地平衡了这些竞争目标。后卫将全球熵估计与局部熵偏差相结合，以整合长期和短期不确定性信号。我们证明，我们提出的全球熵配方有效地减轻了不确定性的突然变化，例如突然过度自信或高熵尖峰，并提供了无偏见和一致性的理论保证。为了减少计算开销，我们将简单但有效的基于令牌计数的罚款纳入警卫中。实验结果表明，后卫在文本多样性和连贯性之间取得了良好的平衡，同时在发电速度方面表现出很大的改善。在一项更细微的比较研究中，在文本质量的不同维度上，人类和LLM评估人员都验证了其出色的性能。我们的代码可在此HTTPS URL上找到。

Title: Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions

Authors: Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, Honglei Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20764
Pdf URL: https://arxiv.org/pdf/2508.20764
Copy Paste: [[2508.20764]] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions(https://arxiv.org/abs/2508.20764)
Keywords: language model, llm
Abstract: Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we conduct the first comparative analysis of emotional arcs between real and LLM-generated Cognitive Behavioral Therapy dialogues. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions transcribed from public videos and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability,more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity between real and synthetic speakers is low, especially for clients. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. We introduce RealCBT, a curated dataset of real CBT sessions, to support future research in this space.
摘要：大语模型（LLM）产生的合成疗法对话越来越多地用于心理健康NLP中，以模拟咨询场景，火车模型和补充有限的现实世界数据。但是，目前尚不清楚这些合成对话是否捕获了真实疗法的细微情绪动态。在这项工作中，我们对真实和LLM生成的认知行为疗法对话之间的情绪弧进行了首次比较分析。我们适应了话语情感动力框架，以分析跨价，唤醒和优势维度的细粒度的情感轨迹。我们的分析涵盖了完整的对话和单个演讲者角色（辅导员和客户），使用了从公共视频中转录的真实会议和仙人掌数据集中的合成对话。我们发现，虽然合成对话流利且在结构上是连贯的，但它们与关键情感特性中的真实对话不同：真实的会话表现出更大的情感可变性，更多的情感语言以及更真实的反应性和调节模式。此外，真实和合成扬声器之间的情感弧相似性很低，尤其是对于客户而言。这些发现强调了当前LLM生成的治疗数据的局限性，并强调了情绪忠诚度在心理健康应用中的重要性。我们介绍了真正的CBT会话的策划数据集Realcbt，以支持该领域的未来研究。

Title: Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Authors: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.20766
Pdf URL: https://arxiv.org/pdf/2508.20766
Copy Paste: [[2508.20766]] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection(https://arxiv.org/abs/2508.20766)
Keywords: language model, llm
Abstract: Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.
摘要：大语言模型（LLM）中的安全一致性通常涉及调解内部表示以拒绝有害要求。最近的研究表明，这些安全机制可以通过消融或删除模型中的特定代表性方向绕过。在本文中，我们提出了相反的方法：排名一度安全注射（ROSI），这是一种白盒方法，通过永久将其激活转向拒绝中间的子空间来放大模型的安全对齐。 Rosi用作简单，微调的排名一重修饰，应用于所有残留的流写矩阵。所需的安全方向可以从一小部分有害和无害的指令对中计算出来。我们表明，Rosi始终提高安全拒绝率 - 通过Llama Guard 3评估 - 同时保留了模型在MMLU，Hellaswag和Arc等标准基准上的效用。此外，我们表明Rosi还可以通过扩大自己的潜在安全方向来重新平衡“未经审查”模型，从而证明其作为有效的最后一英里安全程序。我们的结果表明，有针对性的，可解释的重量转向是一种廉价且有效的机制，可改善LLM安全性，并补充更多资源密集型的微调范式。

Title: Exploring Machine Learning and Language Models for Multimodal Depression Detection

Authors: Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, Pai Chet Ng, Xiaoxiao Miao
Subjects: cs.CL, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2508.20805
Pdf URL: https://arxiv.org/pdf/2508.20805
Copy Paste: [[2508.20805]] Exploring Machine Learning and Language Models for Multimodal Depression Detection(https://arxiv.org/abs/2508.20805)
Keywords: language model, llm
Abstract: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.
摘要：本文介绍了我们对第一个多模式性格感知抑郁症检测挑战的方法，重点是使用机器学习和深度学习模型进行多模式抑郁症检测。我们在音频，视频和文本功能上探索和比较XGBoost，基于变压器的架构以及大语言模型（LLMS）的性能。我们的结果突出了每种模型的优势和局限性在捕获跨模态的抑郁相关信号方面，为有效的心理健康预测策略提供了见解。

Title: GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction

Authors: Jie Zhao, Wanting Ning, Yuxiao Fei, Yubo Feng, Lishuang Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.20828
Pdf URL: https://arxiv.org/pdf/2508.20828
Copy Paste: [[2508.20828]] GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction(https://arxiv.org/abs/2508.20828)
Keywords: language model, llm, prompt
Abstract: In Natural Language Processing(NLP), Event Temporal Relation Extraction (ETRE) is to recognize the temporal relations of two events. Prior studies have noted the importance of language models for ETRE. However, the restricted pre-trained knowledge of Small Language Models(SLMs) limits their capability to handle minority class relations in imbalanced classification datasets. For Large Language Models(LLMs), researchers adopt manually designed prompts or instructions, which may introduce extra noise, leading to interference with the model's judgment of the long-distance dependencies between events. To address these issues, we propose GDLLM, a Global Distance-aware modeling approach based on LLMs. We first present a distance-aware graph structure utilizing Graph Attention Network(GAT) to assist the LLMs in capturing long-distance dependency features. Additionally, we design a temporal feature learning paradigm based on soft inference to augment the identification of relations with a short-distance proximity band, which supplements the probabilistic information generated by LLMs into the multi-head attention mechanism. Since the global feature can be captured effectively, our framework substantially enhances the performance of minority relation classes and improves the overall learning ability. Experiments on two publicly available datasets, TB-Dense and MATRES, demonstrate that our approach achieves state-of-the-art (SOTA) performance.
摘要：在自然语言处理（NLP）中，事件时间关系提取（ETRE）是认识两个事件的时间关系。先前的研究指出了语言模型对Etre的重要性。但是，受限制的小语言模型（SLM）知识限制了其处理不平衡分类数据集中少数群体关系的能力。对于大型语言模型（LLM），研究人员采用手动设计的提示或说明，这可能会引入额外的噪音，从而干扰了该模型对事件之间的长距离依赖性的判断。为了解决这些问题，我们建议GDLLM是一种基于LLM的全球距离感知建模方法。我们首先提出了使用图形注意网络（GAT）的距离感知图结构，以帮助LLMS捕获长距离依赖性特征。此外，我们设计了一种基于软推断的时间特征学习范式，以增强与短途接近频段的关系的识别，该范围将LLMS生成的概率信息补充到多头注意力机制中。由于可以有效地捕获全球功能，因此我们的框架大大提高了少数族裔关系类的表现并提高了整体学习能力。在两个公开可用数据集的实验TB密度和均值，表明我们的方法可以实现最先进的（SOTA）性能。

Title: MSRS: Evaluating Multi-Source Retrieval-Augmented Generation

Authors: Rohan Phanse, Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20867
Pdf URL: https://arxiv.org/pdf/2508.20867
Copy Paste: [[2508.20867]] MSRS: Evaluating Multi-Source Retrieval-Augmented Generation(https://arxiv.org/abs/2508.20867)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines -- including sparse and dense retrievers combined with frontier LLMs -- reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.
摘要：通常在设置中评估检索仪式的系统，在这些设置中，可以在单个源中找到需要回答查询的信息，否则答案是简短的或基于FACTOID的。但是，许多现实世界中的应用程序都要求能够集成和总结散布在多个来源的信息，在这些信息中，没有一个来源足以回答用户的问题。在这种情况下，抹布管道的检索组件必须识别各种相关信号，并且生成组件必须在多个来源之间连接和合成信息。我们提出了一个可扩展的框架，用于构建评估基准，以挑战抹布系统以整合不同来源的信息并产生长形式的响应。使用我们的框架，我们在多源检索和合成上构建了两个新的基准：MSRS故事和MSRS-Meet，分别代表叙事的综合和摘要任务，需要从大型集合中取回。我们对各种抹布管道的广泛实验 - 包括稀疏和密集的猎犬与边境LLM相结合 - 表明，发电质量高度依赖于检索有效性，这因任务而变化很大。虽然多源合成即使在甲骨文检索环境中也有挑战，但我们发现，在这一独特的步骤中，推理模型的表现明显优于标准LLM。

Title: The Uneven Impact of Post-Training Quantization in Machine Translation

Authors: Benjamin Marie, Atsushi Fujita
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20893
Pdf URL: https://arxiv.org/pdf/2508.20893
Copy Paste: [[2508.20893]] The Uneven Impact of Post-Training Quantization in Machine Translation(https://arxiv.org/abs/2508.20893)
Keywords: language model, llm
Abstract: Quantization is essential for deploying large language models (LLMs) on resource-constrained hardware, but its implications for multilingual tasks remain underexplored. We conduct the first large-scale evaluation of post-training quantization (PTQ) on machine translation across 55 languages using five LLMs ranging from 1.7B to 70B parameters. Our analysis reveals that while 4-bit quantization often preserves translation quality for high-resource languages and large models, significant degradation occurs for low-resource and typologically diverse languages, particularly in 2-bit settings. We compare four quantization techniques (AWQ, BitsAndBytes, GGUF, and AutoRound), showing that algorithm choice and model size jointly determine robustness. GGUF variants provide the most consistent performance, even at 2-bit precision. Additionally, we quantify the interactions between quantization, decoding hyperparameters, and calibration languages, finding that language-matched calibration offers benefits primarily in low-bit scenarios. Our findings offer actionable insights for deploying multilingual LLMs for machine translation under quantization constraints, especially in low-resource settings.
摘要：量化对于在资源受限的硬件上部署大型语言模型（LLMS）至关重要，但其对多语言任务的影响仍然没有被忽视。我们使用55种LLMS跨1.7B到70B参数的55种语言对机器翻译的训练后量化（PTQ）进行了首次大规模评估。我们的分析表明，尽管4位量化通常可以保留高资源语言和大型模型的翻译质量，但对于低资源和类型的语言，尤其是在2位环境中，会发生重大降级。我们比较了四种量化技术（AWQ，BitsandBytes，GGGUF和AutorOnd），表明算法选择和模型大小共同确定了鲁棒性。 GGUF变体即使在2位精度下也提供最一致的性能。此外，我们量化了量化，解码超参数和校准语言之间的相互作用，发现语言匹配的校准主要在低位场景中提供了好处。我们的发现提供了可行的见解，用于在量化限制下部署多语言LLM用于机器翻译，尤其是在低资源设置中。

Title: SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Authors: Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, JingBo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20916
Pdf URL: https://arxiv.org/pdf/2508.20916
Copy Paste: [[2508.20916]] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement(https://arxiv.org/abs/2508.20916)
Keywords: language model, llm
Abstract: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.
摘要：语音到语音（S2S）大语言模型（LLMS）是自然人类计算机互动的基础，实现了端到端的口语对话系统。但是，评估这些模型仍然是一个基本挑战。我们建议\ texttt {sagelm}，这是全面的S2S LLMS评估的端到端，多态和可解释的语音LLM。首先，与无视声学特征的级联方法不同，Sagelm共同评估语义和声学维度。其次，它利用基于基本原理的监督来增强解释性和指导模型学习，与基于规则的强化学习方法相比，与评估结果达到了卓越的一致性。第三，我们介绍\ textit {SpeechFeedback}，一个合成的偏好数据集，并采用两阶段的培训范式来减轻语音偏好数据的稀缺性。 Sagelm接受了语义和声学维度的培训，与人类评估者达到了82.79 \％的一致性率，分别以级联和基于SLM的基线的表现分别高出7.42 \％\％和26.20 \％。

Title: How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench

Authors: Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20931
Pdf URL: https://arxiv.org/pdf/2508.20931
Copy Paste: [[2508.20931]] How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench(https://arxiv.org/abs/2508.20931)
Keywords: language model, llm, agent
Abstract: Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $\tau$-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.
摘要：大语言模型（LLM）推理和规划能力的最新进展使其成为能够在动态环境中使用工具的自主剂的潜力。但是，在$ \ tau $ bench之类的多转交谈环境中，这些代理通常会在持续的推理，遵守特定领域的策略以及在漫长的工具呼叫和对话中提取正确信息。为了捕获和减轻这些失败，我们对对话轨迹中发生的常见错误进行了全面的手动分析。然后，我们尝试对工具销售代理的投入重新制定，以改善代理决策。最后，我们提出了输入重建多代理（IRMA）框架，该框架会自动重新整理使用相关域规则和工具建议的用户查询，以供工具称呼代理重点关注。结果表明，在总体通过^5分数中，IRMA的表现显着优于16.1％，12.7％和19.1％的反应，功能呼叫和自我反射。与动态环境中的其他方法相比，这些发现突出了IRMA的优异可靠性和一致性。

Title: STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment

Authors: Jiaqian Li, Qisheng Hu, Jing Li, Wenya Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.20944
Pdf URL: https://arxiv.org/pdf/2508.20944
Copy Paste: [[2508.20944]] STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment(https://arxiv.org/abs/2508.20944)
Keywords: llm
Abstract: In-Context Learning (ICL) has become a powerful paradigm that enables LLMs to perform a wide range of tasks without task-specific fine-tuning. However, the effectiveness of ICL heavily depends on the quality of exemplar selection. In particular, for structured prediction tasks such as semantic parsing, existing ICL selection strategies often overlook structural alignment, leading to suboptimal performance and poor generalization. To address this issue, we propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance. First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned. Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations. This plug-in is model-agnostic, requires minimal overhead, and can be seamlessly integrated into existing pipelines. Experiments on four benchmarks spanning three semantic parsing tasks demonstrate that our method consistently outperforms existing baselines with multiple recent LLMs as inference-time models.
摘要：内部文化学习（ICL）已成为一个强大的范式，使LLM可以执行无需特定任务的微调即可执行各种任务。但是，ICL的有效性在很大程度上取决于示例选择的质量。特别是，对于诸如语义解析之类的结构化预测任务，现有的ICL选择策略通常会忽略结构对齐，从而导致次优性能和不良的概括。为了解决这个问题，我们提出了一种新颖的两阶段选择策略，该策略在效率，概括性和性能之间取得了强大的平衡。首先，我们使用结构感知的监督微调基于BERT的检索器，以指导它选择具有语义相关且结构对齐的示例。然后，我们使用插件模块增强了猎犬，该模块放大了隐藏表示形式中的句法有意义的信息。该插件是模型的不合理的，需要最小的开销，并且可以无缝集成到现有管道中。跨越三个语义解析任务的四个基准测试的实验表明，我们的方法始终以最新的LLM作为推理时间模型优于现有基准。

Title: ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

Authors: Tianjian Liu, Fanqi Wan, Jiajian Guo, Xiaojun Quan
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2508.20973
Pdf URL: https://arxiv.org/pdf/2508.20973
Copy Paste: [[2508.20973]] ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents(https://arxiv.org/abs/2508.20973)
Keywords: language model, llm, agent
Abstract: Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models' proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.
摘要：积极的对话已成为推进大语言模型（LLM）的关键且具有挑战性的研究问题。现有作品主要集中在特定领域或以任务为导向的方案上，这导致了分散的评估，并限制了模型主动对话能力的全面探索。在这项工作中，我们提出了ProadiveEval，这是一个统一的框架，旨在评估LLM的主动对话能力。该框架将积极的对话分解为目标计划和对话指导，从而在各个领域建立评估指标。此外，它还可以自动生成多样化和具有挑战性的评估数据。根据提议的框架，我们开发了328个跨越6个不同域的评估环境。通过使用22种不同类型的LLM的实验，我们表明DeepSeek-R1和Claude-3.7-Sonnet分别在目标规划和对话指导任务上表现出非凡的表现。最后，我们研究推理能力如何影响主动行为，并讨论它们对未来模型发展的影响。

Title: Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Authors: Chen Chen, Yuchen Sun, Jiaxin Gao, Xueluan Gong, Qian Wang, Ziyao Wang, Yongsen Zheng, Kwok-Yan Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.21004
Pdf URL: https://arxiv.org/pdf/2508.21004
Copy Paste: [[2508.21004]] Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution(https://arxiv.org/abs/2508.21004)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model's parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM's attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.
摘要：大型语言模型（LLM）取得了重大进步，在各种自然语言处理（NLP）任务中取得了出色的表现。但是，它们仍然容易受到后门攻击的影响，在该攻击中，模型通常对于标准查询行为，但在激活特定的触发器时会产生有害的响应或意外输出。现有的后门防御能力要么缺乏全面性，要么专注于狭窄的触发设置，仅检测机制和有限的域，或者无法承受高级场景，例如基于模型的基于模型的，多触发器和无触发攻击。在本文中，我们提出了一种新颖的方法，是使用内部和外部机制通过知识稀释从LLM中消除后门行为的新方法。在内部，LETHE利用轻量级数据集来训练一个干净的模型，然后通过在模型的参数内存中稀释后门影响，将其与后门模型合并，以中和恶意行为。在外部，Lethe将良性和语义相关的证据纳入了促使LLM注意力转向后门功能的提示。关于5个广泛使用的LLM的分类和发电域的实验结果表明，对8个后门攻击的效果优于8个最先进的防御基准。 LETHE在维持模型实用程序的同时，将高级后门攻击的攻击成功率降低了98％。此外，Lethe已被证明是具有成本效益和强大的反对自适应后门攻击的。

Title: An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs

Authors: Mathieu Bourdin, Anas Neumann, Thomas Paviot, Robert Pellerin, Samir Lamouri
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.21024
Pdf URL: https://arxiv.org/pdf/2508.21024
Copy Paste: [[2508.21024]] An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs(https://arxiv.org/abs/2508.21024)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful solution to mitigate the limitations of Large Language Models (LLMs), such as hallucinations and outdated knowledge. However, deploying RAG-based tools in Small and Medium Enterprises (SMEs) remains a challenge due to their limited resources and lack of expertise in natural language processing (NLP). This paper introduces EASI-RAG, Enterprise Application Support for Industrial RAG, a structured, agile method designed to facilitate the deployment of RAG systems in industrial SME contexts. EASI-RAG is based on method engineering principles and comprises well-defined roles, activities, and techniques. The method was validated through a real-world case study in an environmental testing laboratory, where a RAG tool was implemented to answer operators queries using data extracted from operational procedures. The system was deployed in under a month by a team with no prior RAG experience and was later iteratively improved based on user feedback. Results demonstrate that EASI-RAG supports fast implementation, high user adoption, delivers accurate answers, and enhances the reliability of underlying data. This work highlights the potential of RAG deployment in industrial SMEs. Future works include the need for generalization across diverse use cases and further integration with fine-tuned models.
摘要：检索增强的一代（RAG）已成为减轻大语言模型（LLM）（例如幻觉和过时的知识）的局限性的有力解决方案。但是，由于其资源有限和自然语言处理（NLP）缺乏专业知识，在中小型企业（SME）中部署基于抹布的工具仍然是一个挑战。本文介绍了对工业抹布的EASI-RAG，企业应用程序的支持，这是一种结构化的敏捷方法，旨在促进在工业中小型企业环境中部署破布系统。 EASI-RAG基于方法工程原理，包括定义明确的角色，活动和技术。该方法通过在环境测试实验室中的现实案例研究进行了验证，在环境测试实验室中，使用从操作过程中提取的数据实现了抹布工具来回答操作员的查询。该系统是由没有事先抹布经验的团队在一个月不到一个月的时间内部署的，后来根据用户反馈进行了迭代改进。结果表明，EAI-RAG支持快速实施，高度采用，提供准确的答案，并提高基本数据的可靠性。这项工作突出了工业中小企业中抹布部署的潜力。未来的工作包括需要跨不同用例的概括以及与微调模型的进一步集成。

Title: Enabling Equitable Access to Trustworthy Financial Reasoning

Authors: William Jurayj, Nils Holzenberger, Benjamin Van Durme
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.21051
Pdf URL: https://arxiv.org/pdf/2508.21051
Copy Paste: [[2508.21051]] Enabling Equitable Access to Trustworthy Financial Reasoning(https://arxiv.org/abs/2508.21051)
Keywords: language model, llm
Abstract: According to the United States Internal Revenue Service, ''the average American spends $\$270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.
摘要：根据美国国税局的说法，“美国人的平均花费$ \ $ 270 $和13小时提交税款”。即使在美国之外，税收提交也需要复杂的推理，将重叠规则的应用与数值计算相结合。由于错误可能会受到昂贵的惩罚，因此任何自动化系统都必须提供高精度和可审核性，从而使现代大型语言模型（LLMS）适合此任务。我们提出了一种将LLM与符号求解器集成以计算税收义务的方法。我们将该系统的变体评估在具有挑战性的法定推理评估（SARA）数据集上，并包括一种新的方法，用于估计基于实际税收错误的现实惩罚部署这种系统的成本。我们进一步展示了将普通文本规则的上前翻译结合到形式逻辑程序中，再加上智能检索到正式案例表示的示例，可以显着提高此任务的绩效，并将成本降低至远低于现实世界平均值。我们的结果表明，神经符号结构的前景和经济可行性，可以增加公平获得可靠的税收援助。