2025-08-04

Title: PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Authors: Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00079
Pdf URL: https://arxiv.org/pdf/2508.00079
Copy Paste: [[2508.00079]] PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems(https://arxiv.org/abs/2508.00079)
Keywords: language model, llm, agent
Abstract: The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at this https URL.
摘要：物理学的纪律是人类智力的基石，推动了技术的发展，并加深了我们对宇宙基本原理的理解。当代文学包括一些围绕解决物理问题的任务的作品，这是自然语言推理的关键领域。在本文中，我们评估了Frontier LLM在解决数学和描述性物理问题方面的性能。我们还采用了大量的推理时间技术和代理框架来提高模型的性能。这包括其他较小的LLM代理以累积方式验证提议的解决方案的验证，我们对技术所带来的性能进行了比较分析。当将多代理框架应用于最初表现不佳的问题时，会有很大的改进。此外，我们引入了一个针对物理问题的新评估基准，$ {\ rm p {\ small hysics} e {\ small val}} $，由19,609个问题组成，这些问题来自各种物理教科书及其相应的正确求解，从物理论坛和教育网站上删除了相应的正确求解。我们的代码和数据在此HTTPS URL上公开可用。

Title: Do LLMs produce texts with "human-like" lexical diversity?

Authors: Kelly Kendro, Jeffrey Maloney, Scott Jarvis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00086
Pdf URL: https://arxiv.org/pdf/2508.00086
Copy Paste: [[2508.00086]] Do LLMs produce texts with "human-like" lexical diversity?(https://arxiv.org/abs/2508.00086)
Keywords: gpt, llm, chat
Abstract: The degree to which LLMs produce writing that is truly human-like remains unclear despite the extensive empirical attention that this question has received. The present study addresses this question from the perspective of lexical diversity. Specifically, the study investigates patterns of lexical diversity in LLM-generated texts from four ChatGPT models (-3.5, -4, -o4 mini, and -4.5) in comparison with texts written by L1 and L2 English participants (n = 240) across four education levels. Six dimensions of lexical diversity were measured in each text: volume, abundance, variety-repetition, evenness, disparity, and dispersion. Results from one-way MANOVAs, one-way ANOVAS, and Support Vector Machines revealed that the LLM-generated texts differed significantly from human-written texts for each variable, with ChatGPT-o4 mini and -4.5 differing the most. Within these two groups, ChatGPT-4.5 demonstrated higher levels of lexical diversity despite producing fewer tokens. The human writers' lexical diversity did not differ across subgroups (i.e., education, language status). Altogether, the results indicate that LLMs do not produce human-like texts in relation to lexical diversity, and the newer LLMs produce less human-like texts than older models. We discuss the implications of these results for language pedagogy and related applications.
摘要：尽管这个问题已经得到了广泛的经验关注，但LLM产生真正类似于人类的文字的程度仍然不清楚。本研究从词汇多样性的角度解决了这个问题。具体而言，该研究研究了LLM生成的文本中词汇多样性的模式（-3.5，-4，-o4 mini和-4.5），与L1和L2英语参与者（n = 240）撰写的文本相比，四个教育水平的文本相比。在每个文本中都测量了六个词汇多样性的维度：体积，丰度，多样性，偶数，差异和分散性。单向MANOVA，单向方差分析和支持向量机器的结果表明，LLM生成的文本与每个变量的人写的文本有显着差异，而Chatgpt-O4 mini和-4.5差异最大。在这两组中，Chatgpt-4.5尽管产生了更少的令牌，但词汇多样性较高。人类作家的词汇多样性在各个亚组之间没有差异（即教育，语言状态）。总而言之，结果表明，LLM与词汇多样性相关的人类不产生类似人类的文本，而较新的LLM与较旧模型产生的人类样本更少。我们讨论了这些结果对语言教育和相关应用的含义。

Title: FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

Authors: Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, Scott Yih
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00109
Pdf URL: https://arxiv.org/pdf/2508.00109
Copy Paste: [[2508.00109]] FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality(https://arxiv.org/abs/2508.00109)
Keywords: language model, prompt
Abstract: Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across long-tailed facts.
摘要：长期事实评估评估模型对短提示产生准确，全面响应的能力。现有的基准通常缺乏人类验证，从而导致潜在的质量问题。为了解决此限制，我们介绍了工厂，这是一个大规模的，人为验证的提示集。工厂使用模型的方法开发，并由人类精致，包括挑战事实，回答和明确的提示。我们使用工厂和现有数据集对6种最先进的语言模型进行人体评估。我们的结果表明，工厂是一个具有挑战性的基准：SOTA模型响应中大约40％的索赔不是事实，而其他数据集则只有10％。我们的分析确定了工厂对先前基准测试的优势，强调了其可靠性以及模型跨越长尾事实的必要性。

Title: Comparison of Large Language Models for Deployment Requirements

Authors: Alper Yaman, Jannik Schwab, Christof Nitsche, Abhirup Sinha, Marco Huber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00185
Pdf URL: https://arxiv.org/pdf/2508.00185
Copy Paste: [[2508.00185]] Comparison of Large Language Models for Deployment Requirements(https://arxiv.org/abs/2508.00185)
Keywords: language model, gpt, llm, hallucination
Abstract: Large Language Models (LLMs), such as Generative Pre-trained Transformers (GPTs) are revolutionizing the generation of human-like text, producing contextually relevant and syntactically correct content. Despite challenges like biases and hallucinations, these Artificial Intelligence (AI) models excel in tasks, such as content creation, translation, and code generation. Fine-tuning and novel architectures, such as Mixture of Experts (MoE), address these issues. Over the past two years, numerous open-source foundational and fine-tuned models have been introduced, complicating the selection of the optimal LLM for researchers and companies regarding licensing and hardware requirements. To navigate the rapidly evolving LLM landscape and facilitate LLM selection, we present a comparative list of foundational and domain-specific models, focusing on features, such as release year, licensing, and hardware requirements. This list is published on GitLab and will be continuously updated.
摘要：大型语言模型（LLM），例如生成预训练的变形金刚（GPT）正在彻底改变人类般的文本的产生，从而产生上下文相关和语法上正确的内容。尽管诸如偏见和幻觉之类的挑战，但这些人工智能（AI）模型在任务（例如创建内容，翻译和代码生成）中表现出色。微调和新颖的体系结构，例如专家的混合物（MOE），解决了这些问题。在过去的两年中，引入了许多开源基础和微调模型，这使研究人员和公司在许可和硬件要求方面的最佳LLM选择变得复杂。为了浏览迅速发展的LLM景观并促进LLM选择，我们提供了基础和域特异性模型的比较列表，重点关注功能，例如发布年，许可和硬件要求。此列表在GitLab上发布，将不断更新。

Title: Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

Authors: Xiaofeng Wu, Alan Ritter, Wei Xu
Subjects: cs.CL, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00217
Pdf URL: https://arxiv.org/pdf/2508.00217
Copy Paste: [[2508.00217]] Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges(https://arxiv.org/abs/2508.00217)
Keywords: language model, llm
Abstract: Tables have gained significant attention in large language models (LLMs) and multimodal large language models (MLLMs) due to their complex and flexible structure. Unlike linear text inputs, tables are two-dimensional, encompassing formats that range from well-structured database tables to complex, multi-layered spreadsheets, each with different purposes. This diversity in format and purpose has led to the development of specialized methods and tasks, instead of universal approaches, making navigation of table understanding tasks challenging. To address these challenges, this paper introduces key concepts through a taxonomy of tabular input representations and an introduction of table understanding tasks. We highlight several critical gaps in the field that indicate the need for further research: (1) the predominance of retrieval-focused tasks that require minimal reasoning beyond mathematical and logical operations; (2) significant challenges faced by models when processing complex table structures, large-scale tables, length context, or multi-table scenarios; and (3) the limited generalization of models across different tabular representations and formats.
摘要：由于其复杂且灵活的结构，表格在大语言模型（LLM）和多模式大语模型（MLLM）中引起了极大的关注。与线性文本输入不同，表是二维，包含的格式，范围从结构良好的数据库表到复杂的多层电子表格，每个表格都有不同的目的。格式和目的的多样性导致了专业方法和任务的发展，而不是通用方法，从而使桌子的导航理解任务具有挑战性。为了应对这些挑战，本文通过表格输入表示的分类法和桌面理解任务的介绍来介绍关键概念。我们强调了该领域中的几个关键差距，表明需要进一步研究：（1）在以数学和逻辑操作之外需要最小推理的以检索为重点的任务的优势；（2）在处理复杂的表结构，大规模表，长度上下文或多桌场景时，模型面临的重大挑战；（3）模型跨不同表达表示和格式的有限概括。

Title: Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform

Authors: Rana Aref Salama, Abdou Youssef, Mona Diab
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00220
Pdf URL: https://arxiv.org/pdf/2508.00220
Copy Paste: [[2508.00220]] Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform(https://arxiv.org/abs/2508.00220)
Keywords: language model
Abstract: Wavelet transforms, a powerful mathematical tool, have been widely used in different domains, including Signal and Image processing, to unravel intricate patterns, enhance data representation, and extract meaningful features from data. Tangible results from their application suggest that Wavelet transforms can be applied to NLP capturing a variety of linguistic and semantic properties. In this paper, we empirically leverage the application of Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We aim to showcase the capabilities of DWT in analyzing embedding representations at different levels of resolution and compressing them while maintaining their overall quality. We assess the effectiveness of DWT embeddings on semantic similarity tasks to show how DWT can be used to consolidate important semantic information in an embedding vector. We show the efficacy of the proposed paradigm using different embedding models, including large language models, on downstream tasks. Our results show that DWT can reduce the dimensionality of embeddings by 50-93% with almost no change in performance for semantic similarity tasks, while achieving superior accuracy in most downstream tasks. Our findings pave the way for applying DWT to improve NLP applications.
摘要：小波变换是一种强大的数学工具，已在包括信号和图像处理在内的不同域中广泛使用，以揭示复杂的模式，增强数据表示并从数据中提取有意义的特征。他们应用的切实结果表明，小波变换可以应用于捕获各种语言和语义特性的NLP。在本文中，我们从经验上利用离散小波变换（DWT）的应用到单词和句子嵌入。我们的目的是展示DWT在分析不同级别的分辨率嵌入表示并压缩它们的同时保持整体质量的能力。我们评估DWT嵌入在语义相似性任务上的有效性，以显示如何使用DWT来整合嵌入载体中的重要语义信息。我们使用不同的嵌入模型（包括大语言模型）在下游任务上显示了提出的范式的功效。我们的结果表明，DWT可以将嵌入式的维度降低50-93％，而语义相似性任务的性能几乎没有变化，同时在大多数下游任务中都能达到卓越的准确性。我们的发现为应用DWT改善NLP应用程序的方式铺平了道路。

Title: Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English

Authors: Bryce Anderson, Riley Galpin, Tom S. Juzek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00238
Pdf URL: https://arxiv.org/pdf/2508.00238
Copy Paste: [[2508.00238]] Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English(https://arxiv.org/abs/2508.00238)
Keywords: language model, gpt, llm, chat
Abstract: In recent years, written language, particularly in science and education, has undergone remarkable shifts in word usage. These changes are widely attributed to the growing influence of Large Language Models (LLMs), which frequently rely on a distinct lexical style. Divergences between model output and target audience norms can be viewed as a form of misalignment. While these shifts are often linked to using Artificial Intelligence (AI) directly as a tool to generate text, it remains unclear whether the changes reflect broader changes in the human language system itself. To explore this question, we constructed a dataset of 22.1 million words from unscripted spoken language drawn from conversational science and technology podcasts. We analyzed lexical trends before and after ChatGPT's release in 2022, focusing on commonly LLM-associated words. Our results show a moderate yet significant increase in the usage of these words post-2022, suggesting a convergence between human word choices and LLM-associated patterns. In contrast, baseline synonym words exhibit no significant directional shift. Given the short time frame and the number of words affected, this may indicate the onset of a remarkable shift in language use. Whether this represents natural language change or a novel shift driven by AI exposure remains an open question. Similarly, although the shifts may stem from broader adoption patterns, it may also be that upstream training misalignments ultimately contribute to changes in human language use. These findings parallel ethical concerns that misaligned models may shape social and moral beliefs.
摘要：近年来，书面语言，尤其是科学和教育方面的语言，在单词使用情况下发生了显着的转变。这些变化广泛归因于大语言模型（LLMS）的日益增长的影响，这些模型通常依赖于独特的词汇风格。模型输出和目标受众规范之间的分歧可以视为未对准的一种形式。尽管这些转变通常与直接使用人工智能（AI）作为生成文本的工具有关，但尚不清楚这些变化是否反映了人类语言系统本身的更大变化。为了探讨这个问题，我们从从对话的科学和技术播客中绘制的无脚本口语中构建了一个数据集。我们分析了Chatgpt在2022年发布之前和之后的词汇趋势，重点是通常与LLM相关的单词。我们的结果表明，在2022年后，这些单词的使用情况中等而显着增加，这表明人的单词选择与LLM相关模式之间存在融合。相比之下，基线同义词词没有明显的方向移动。鉴于时间范围短，单词数量受影响，这可能表明语言使用的显着转变。这是代表自然语言变化还是由AI暴露驱动的新型转变仍然是一个悬而未决的问题。同样，尽管这种转变可能源于更广泛的采用模式，但上游培训的未对准最终导致了人类语言使用的变化。这些发现平行的道德关注点可能会塑造社会和道德信念。

Title: Integrating clinical reasoning into large language model-based diagnosis through etiology-aware attention steering

Authors: Peixian Li, Yu Tian, Ruiqi Tu, Chengkai Wu, Jingjing Ren, Jingsong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00285
Pdf URL: https://arxiv.org/pdf/2508.00285
Copy Paste: [[2508.00285]] Integrating clinical reasoning into large language model-based diagnosis through etiology-aware attention steering(https://arxiv.org/abs/2508.00285)
Keywords: language model, llm
Abstract: Objective: Large Language Models (LLMs) demonstrate significant capabilities in medical text understanding and generation. However, their diagnostic reliability in complex clinical scenarios remains limited. This study aims to enhance LLMs' diagnostic accuracy and clinical reasoning ability. Method: We propose an Etiology-Aware Attention Steering Framework to integrate structured clinical reasoning into LLM-based diagnosis. Specifically, we first construct Clinical Reasoning Scaffolding (CRS) based on authoritative clinical guidelines for three representative acute abdominal emergencies: acute appendicitis, acute pancreatitis, and acute cholecystitis. Next, we develop the Etiology-Aware Head Identification algorithm to pinpoint attention heads crucial for the model's etiology reasoning. To ensure reliable clinical reasoning alignment, we introduce the Reasoning-Guided Parameter-Efficient Fine-tuning that embeds etiological reasoning cues into input representations and steers the selected Etiology-Aware Heads toward critical information through a Reasoning-Guided Loss function. Result: On the Consistent Diagnosis Cohort, our framework improves average diagnostic accuracy by 15.65% and boosts the average Reasoning Focus Score by 31.6% over baselines. External validation on the Discrepant Diagnosis Cohort further confirms its effectiveness in enhancing diagnostic accuracy. Further assessments via Reasoning Attention Frequency indicate that our models exhibit enhanced reliability when faced with real-world complex scenarios. Conclusion: This study presents a practical and effective approach to enhance clinical reasoning in LLM-based diagnosis. By aligning model attention with structured CRS, the proposed framework offers a promising paradigm for building more interpretable and reliable AI diagnostic systems in complex clinical settings.
摘要：目的：大语言模型（LLMS）在医学文本理解和产生中表现出重要的功能。但是，它们在复杂的临床方案中的诊断可靠性仍然有限。这项研究旨在提高LLMS的诊断准确性和临床推理能力。方法：我们提出了一个病因学意见的注意力转向框架，将结构化临床推理整合到基于LLM的诊断中。具体而言，我们首先根据三个代表性急性腹部紧急情况的权威临床指南构建临床推理脚手架（CRS）：急性阑尾炎，急性胰腺炎和急性胆囊炎。接下来，我们开发了对病因学的头部识别算法，以确定注意力头对模型的病因学推理至关重要。为了确保可靠的临床推理一致性，我们介绍了推理引导的参数有效的微调，将病因学推理嵌入输入表示中，并通过推理引导的损耗函数将所选病因学的头部转向关键信息。结果：在一致的诊断队列中，我们的框架将平均诊断准确性提高了15.65％，并使平均推理焦点评分比基线提高了31.6％。对差异诊断队列的外部验证进一步证实了其在提高诊断准确性方面的有效性。通过推理注意力频率进行进一步的评估表明，当面对现实世界中的复杂场景时，我们的模型表现出增强的可靠性。结论：这项研究提出了一种实用有效的方法来增强基于LLM的诊断的临床推理。通过将模型的关注与结构化的CR保持一致，该拟议的框架为在复杂的临床环境中构建更容易解释和可靠的AI诊断系统提供了有希望的范式。

Title: Systematic Evaluation of Optimization Techniques for Long-Context Language Models

Authors: Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar
Subjects: cs.CL, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2508.00305
Pdf URL: https://arxiv.org/pdf/2508.00305
Copy Paste: [[2508.00305]] Systematic Evaluation of Optimization Techniques for Long-Context Language Models(https://arxiv.org/abs/2508.00305)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like pruning, quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two LLM architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps LLM practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.
摘要：大型语言模型（LLMS）在各种自然语言处理任务中表现出色，但面临资源需求和有限的上下文窗口。尽管诸如修剪，量化和代币掉落之类的技术可以减轻这些问题，但它们在长篇文化方案和系统评估中的功效仍未得到充实。本文系统地基于这些优化，表征记忆使用，延迟和吞吐量，并研究这些方法如何影响文本生成的质量。我们首先分析了支持长篇小说的两个LLM体系结构的个人优化方法，然后系统地评估这些技术的组合，以评估这种更深的分析如何影响性能指标。随后，我们在具有700亿参数模型的较大变体上研究了单个优化方法的可伸缩性。我们新颖的见解表明，与较小的近似组相比，由于复合近似误差，天真的组合推理优化算法可能会对较大的模型产生不利影响。实验表明，仅依靠F1通过隐藏了回答任务的问题，从而掩盖了这些效果。通过将系统级分析与特定于任务的见解集成在一起，本研究可以帮助LLM从业人员和研究人员探索和平衡任务和硬件配置之间的效率，准确性和可扩展性。

Title: PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Authors: Keer Lu, Chong Chen, Bin Cui, Huang Leng, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00344
Pdf URL: https://arxiv.org/pdf/2508.00344
Copy Paste: [[2508.00344]] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning(https://arxiv.org/abs/2508.00344)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.
摘要：大型语言模型（LLM）在解决面向代理的任务方面取得了显着进步。尽管具有潜力，但现有的工作在基于代理的环境中部署LLM时仍会面临挑战。广泛采用的代理范式的反应是将单步推理与立即行动执行的整合，这限制了其在需要长期战略计划的复杂任务中的有效性。此外，解决问题期间计划者与执行人之间的协调也是代理设计中要考虑的关键因素。此外，当前的方法主要依赖于监督的微调，这通常会导致模型记住既定的任务完成轨迹，从而在面对新的问题上下文时限制了它们的概括能力。为了应对这些挑战，我们引入了一个自适应的全球计划代理商Adaplan，旨在协同高级显式指导，以支持有效的长途决策。根据拟议的范式，我们进一步提出了Pilotrl，这是一个由渐进式增强学习驱动的LLM代理的全球计划指导培训框架。我们首先开发该模型在解决代理任务时从全球计划中遵循明确指导的能力。随后，基于这个基础，我们专注于优化生成计划的质量。最后，我们对模型的计划和执行协调进行联合优化。实验表明，PILOTRL可以实现最先进的性能，Llama3.1-8b-Instruct + Pilotrl超过封闭式GPT-4O的GPT-4O 3.60％，同时在55.78％的速度上与GPT-4O-MINI相比，在一个可比的参数尺度上比GPT-4O-Mini相比更为实质性增长。

Title: Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Authors: Alan Dao (Gia Tuan Dao), Dinh Bach Vu, Alex Nguyen, Norapat Buppodom
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00360
Pdf URL: https://arxiv.org/pdf/2508.00360
Copy Paste: [[2508.00360]] Lucy: edgerunning agentic web search on mobile with machine generated task vectors(https://arxiv.org/abs/2508.00360)
Keywords: language model, agent
Abstract: Small language models (SLMs) are inherently limited in knowledge-intensive tasks due to their constrained capacity. While test-time computation offers a path to enhanced performance, most approaches treat reasoning as a fixed or heuristic process. In this work, we propose a new paradigm: viewing the model's internal reasoning, delimited by and tags, as a dynamic task vector machine. Rather than treating the content inside these tags as a mere trace of thought, we interpret the generation process itself as a mechanism through which the model \textbf{constructs and refines its own task vectors} on the fly. We developed a method to optimize this dynamic task vector machine through RLVR and successfully trained an agentic web-search model. We present Lucy, a 1.7B-parameter SLM that leverages this dynamic reasoning mechanism with MCP integration to achieve 78.3% accuracy on the SimpleQA benchmark, performing on par with much larger models such as DeepSeek-V3. This demonstrates that small models can rival large ones when equipped with structured, self-constructed task reasoning.
摘要：小型语言模型（SLM）由于能力受限而固有地受到知识密集型任务的限制。尽管测试时间计算为增强性能提供了途径，但大多数方法将推理视为固定或启发式过程。在这项工作中，我们提出了一个新的范式：查看模型的内部推理，由和标签界定，作为动态任务向量机。与其将这些标签中的内容视为仅仅是思想的痕迹，我们将生成过程本身解释为一种机制，模型\ textbf {构造和完善其自己的任务向量}。我们开发了一种通过RLVR优化该动态任务向量机的方法，并成功训练了代理网络搜索模型。我们提出了露西（Lucy），这是一种1.7B参数SLM，利用MCP集成的这种动态推理机制在SimpleQA基准上实现了78.3％的精度，并以更大的模型（例如DeepSeek-v3）表现出色。这表明，在配备结构化的，自我结构的任务推理时，小型模型可以与大型模型相抗衡。

Title: EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Authors: Jiyu Chen, Poh Seng Lim, Shuang Peng, Daxiong Luo, JungHau Foo, Yap Deep, Timothy Lee Jun Jie, Kelvin Teh Kae Wen, Fan Yang, Danyu Feng, Hao-Yun Chen, Peng-Wen Chen, Fangyuan Li, Xiaoxin Chen, Wong Wai Mun
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00370
Pdf URL: https://arxiv.org/pdf/2508.00370
Copy Paste: [[2508.00370]] EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices(https://arxiv.org/abs/2508.00370)
Keywords: language model, llm
Abstract: Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.
摘要：由于自我注意力的二次时间复杂性和增长的键值（KV）缓存需求，将基于变形金刚的大型语言模型（LLMS）用于长期序列任务的资源受限的边缘设备上仍然具有挑战性。尽管现有的KV高速缓存优化提高了内存效率，但它们通常无法减少首次令牌（TTFT）的时间，并且可能通过令牌修剪降低性能。替代序列建模架构解决了其中一些局限性，但通常需要完全重新训练，并且缺乏基础架构支持。 EdgeInfinite通过仅微调一小部分参数，保持质量，同时降低计算和存储成本，包括改进的TTFT，从而提供有效的解决方案。但是，其跟踪能力的指导能力有限，并且缺乏特定于移动的优化。为了解决这些问题，我们提出了Edgeinfinite-Insuctuct，该教学介绍了针对长期序列任务（例如摘要和问题回答）量身定制的细分监督微调（S-SFT）策略。我们通过使用细粒度的训练后量化（PTQ）来进一步优化Edgeinfinite-Insruct，以在Edge NPU上有效地部署，以减少计算需求，同时保持准确性，并通过实现内存使用量和固定形状的计算图，并通过对输入的特定自定义Toke token and Cache and Cache stemigation来平衡内存的记忆使用和远程效率。长篇文本基准和现实世界移动任务的实验表明，我们的方法可以提高特定于域的性能，同时保持NPU加速的边缘设备的效率。

Title: Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

Authors: Dingzirui Wang, Xuangliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00385
Pdf URL: https://arxiv.org/pdf/2508.00385
Copy Paste: [[2508.00385]] Multi-Layer Attention is the Amplifier of Demonstration Effectiveness(https://arxiv.org/abs/2508.00385)
Keywords: llm
Abstract: Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to focus more on effective ones. Considering that current demonstration selection methods primarily focus on the relevance to the user query while overlooking the information that the model has already assimilated, we propose a novel method called GradS, which leverages gradient flow for demonstration selection. We use the magnitude of the gradient flow of the demonstration with respect to a given user query as the criterion, thereby ensuring the effectiveness of the chosen ones. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experimental results confirm that the disparity in effectiveness among demonstrations is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of $6.8\%$ on average over the strongest baselines, demonstrating its effectiveness.
摘要：大量研究研究了启发相关方法设计的文本学习（ICL）有效性的潜在机制。但是，现有工作主要假定ICL中提供的示范的有效性，而许多研究表明并非所有示范都有效，无法在ICL期间产生任何绩效提高。因此，在本文中，我们调查了演示无效的原因。我们的分析基于梯度流和线性自我注意模型。通过将梯度流设置为零，我们推断出，如果模型已学习了其信息，或者与用户查询无关，则演示将变得无效。此外，我们证明，在多层模型中，演示的有效性差异会随着图层的增加而放大，从而使模型更加专注于有效的模型。考虑到当前的演示选择方法主要集中在与用户查询的相关性的同时，同时忽略了模型已经同化的信息，我们提出了一种称为Grads的新方法，该方法利用梯度流进行演示选择。我们将演示的梯度流量相对于给定的用户查询作为标准，从而确保所选梯度的有效性。我们在五个主流数据集中验证了四个突出的LLM的推导和毕业生。实验结果证实，随着模型层的增加，证明效率的差异被放大，从而证实了我们的推导。此外，毕业生平均达到了$ 6.8 \％$的相对提高，比最强的基线表明其有效性。

Title: SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation

Authors: Hengxing Cai, Jinhan Dong, Yijie Rao, Jingcheng Deng, Jingjun Tan, Qien Chen, Haidong Wang, Zhen Wang, Shiyu Huang, Agachai Sumalee, Renxin Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00390
Pdf URL: https://arxiv.org/pdf/2508.00390
Copy Paste: [[2508.00390]] SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation(https://arxiv.org/abs/2508.00390)
Keywords: language model, agent
Abstract: Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) aims to enable agents to accurately localize targets and plan flight paths in complex environments based on natural language instructions, with broad applications in intelligent inspection, disaster rescue, and urban monitoring. Recent progress in Vision-Language Models (VLMs) has provided strong semantic understanding for this task, while reinforcement learning (RL) has emerged as a promising post-training strategy to further improve generalization. However, existing RL methods often suffer from inefficient use of training data, slow convergence, and insufficient consideration of the difficulty variation among training samples, which limits further performance improvement. To address these challenges, we propose \textbf{Semantic-Aware Gaussian Curriculum Scheduling (SA-GCS)}, a novel training framework that systematically integrates Curriculum Learning (CL) into RL. SA-GCS employs a Semantic-Aware Difficulty Estimator (SA-DE) to quantify the complexity of training samples and a Gaussian Curriculum Scheduler (GCS) to dynamically adjust the sampling distribution, enabling a smooth progression from easy to challenging tasks. This design significantly improves training efficiency, accelerates convergence, and enhances overall model performance. Extensive experiments on the CityNav benchmark demonstrate that SA-GCS consistently outperforms strong baselines across all metrics, achieves faster and more stable convergence, and generalizes well across models of different scales, highlighting its robustness and scalability. The implementation of our approach is publicly available.
摘要：无人机（UAV）视觉导航（VLN）旨在使代理商根据自然语言指令在复杂环境中精确定位目标并计划飞行路径，并在智能检查，灾难救援和城市监测中进行广泛应用。视觉模型（VLM）的最新进展为这项任务提供了强烈的语义理解，而强化学习（RL）已成为一种有希望的后训练后策略，以进一步改善概括。但是，现有的RL方法通常遭受训练数据的效率低下，收敛速度缓慢以及对训练样本中难度变化的考虑不足，这限制了进一步的性能改善。为了应对这些挑战，我们建议\ textbf {语义觉醒的高斯课程调度（SA-GCS）}，这是一个新颖的培训框架，该培训框架系统地将课程学习（CL）整合到RL中。 SA-GC使用语义意识的难度估计器（SA-DE）来量化培训样本的复杂性和高斯课程调度程序（GCS），以动态调整采样分布，从而使从易于挑战到具有挑战性的任务的平稳进展。这种设计可显着提高训练效率，加速融合并提高整体模型性能。在CityNAV基准上进行的广泛实验表明，SA-GC始终优于所有指标的强大基准，实现更快，更稳定的收敛性，并在不同尺度的模型中概括地概括了其稳健性和可扩展性。我们的方法的实施是公开可用的。

Title: ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

Authors: Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, Yongfeng Zhang
Subjects: cs.CL, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2508.00429
Pdf URL: https://arxiv.org/pdf/2508.00429
Copy Paste: [[2508.00429]] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network(https://arxiv.org/abs/2508.00429)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness -- some nodes are rich in information, while others remain sparse. Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model's ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen LLM backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.
摘要：图形神经网络（GNNS）通过通过预定义的聚集机制在邻居节点之间传播信息，在基于图的学习中取得了显着成功。但是，这种固定方案经常受到两个关键局限性。首先，他们无法处理节点信息的不平衡 - 有些节点富含信息，而另一些节点仍然很少。其次，预定义的消息主要传递主要利用局部结构相似性，同时忽略了整个图的全局语义关系，从而限制了模型捕获遥远但相关信息的能力。我们提出了基于代理的框架检索图形代理网络（Reagan），该框架可以通过自主节点，节点级决策来赋予每个节点。每个节点都充当代理，该代理会根据其内部内存独立计划其下一个操作，从而实现节点级别的计划和自适应消息传播。此外，检索功能的生成（RAG）允许节点访问语义相关的内容并在图中构建全局关系。里根（Reagan）使用冷冻的LLM主链在不进行微调的情况下实现竞争性能，而无需微调，展示了代理计划和图形学习中局部全球检索的潜力。

Title: Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

Authors: Yuqi Tang, Kehua Feng, Yunfeng Wang, Zhiwen Chen, Chengfei Lv, Gang Yu, Qiang Zhang, Keyan Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00454
Pdf URL: https://arxiv.org/pdf/2508.00454
Copy Paste: [[2508.00454]] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges(https://arxiv.org/abs/2508.00454)
Keywords: language model, llm, prompt
Abstract: Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the ``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.
摘要：评估大语言模型（LLM）的会话能力仍然是一项具有挑战性的任务。当前的主流方法主要依赖于``llm-as a-As-a-aughth''范例，在该范式中，LLM被提示作为评估对话质量的评估者。但是，这些方法经常受到各种偏见的影响，这些偏见会破坏评估结果的可靠性和一致性。以调查这些偏见，以降低这些偏见，以选择评估多个llms，并将其用于评估和评估律师的判决。尽管有效，在本文中，这种多重计算方法是在本文中提出的，我们提出了一个有效的多转向对话评估者，该评估者通过将其偏好知识汇总到单个模型中，以确保多样化的屈曲效率。评分和成对比较对话评估基准表明，我们的方法在各种情况下都优于现有基准，展示其效率和鲁棒性。

Title: GETALP@AutoMin 2025: Leveraging RAG to Answer Questions based on Meeting Transcripts

Authors: Jeongwoo Kang, Markarit Vartampetian, Felix Herron, Yongxin Zhou, Diandra Fabre, Gabriela Gonzalez-Saez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00476
Pdf URL: https://arxiv.org/pdf/2508.00476
Copy Paste: [[2508.00476]] GETALP@AutoMin 2025: Leveraging RAG to Answer Questions based on Meeting Transcripts(https://arxiv.org/abs/2508.00476)
Keywords: retrieval augmented generation
Abstract: This paper documents GETALP's submission to the Third Run of the Automatic Minuting Shared Task at SIGDial 2025. We participated in Task B: question-answering based on meeting transcripts. Our method is based on a retrieval augmented generation (RAG) system and Abstract Meaning Representations (AMR). We propose three systems combining these two approaches. Our results show that incorporating AMR leads to high-quality responses for approximately 35% of the questions and provides notable improvements in answering questions that involve distinguishing between different participants (e.g., who questions).
摘要：本文记录了Getalp在2025年Sigdial的自动汇总共享任务的第三次投资中提交的提交。我们根据会议成绩单参加了任务B：提问。我们的方法基于检索增强生成（RAG）系统和抽象含义表示（AMR）。我们提出了三个结合这两种方法的系统。我们的结果表明，合并AMR会导致大约35％的问题的高质量答复，并在回答涉及区分不同参与者（例如，谁问题）的问题方面提供了显着的改进。

Title: EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond

Authors: Jiaxin Deng, Qingcheng Zhu, Junbiao Pang, Linlin Yang, Zhongqian Fu, Baochang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00522
Pdf URL: https://arxiv.org/pdf/2508.00522
Copy Paste: [[2508.00522]] EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond(https://arxiv.org/abs/2508.00522)
Keywords: language model, chat
Abstract: Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat-LoRA and its efficient version i.e., EFlat-LoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFlat-LoRA achieves optimize efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFlat-LoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models e.g., Qwen-VL-Chat shows performance improvements of 1.5% and 1.0% on SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.
摘要：很少的研究探讨了低级适应（LORA）的表达能力和概括能力之间的相关性。清晰度感知最小化（SAM）通过鼓励与局部平面最小值的收敛来改善卷积神经网络（CNN）和变压器的模型概括。但是，由于缺乏经验寻求平坦的最小值或开发理论方法的工具，劳拉的清晰度与概括之间的联系尚未完全探索。在这项工作中，我们提出了Flat-Lora及其高效版本，即Eflat-Lora，为Lora寻求平坦的最小值。具体而言，我们从理论上证明了整个参数空间中的扰动可以转移到低级别子空间中。这种方法消除了低率子空间中多个矩阵的扰动引入的潜在干扰。我们在大型语言模型和视觉模型上进行的广泛实验表明，Eflat-Lora可以优化与Lora相当的效率，同时达到可比性甚至更好的性能。例如，在带有罗伯塔大（Roberta-large）的胶水数据集上，Eflat-Lora的表现分别超过了洛拉（Lora），平均表现出1.0％和0.5％。在视觉模型中，例如，QWEN-VL-CHAT在SQA和Vizwiz数据集上显示了1.5％和1.0％的性能提高。这些经验结果还证明了洛拉的概括与锐度密切相关，这是先前方法所省略的。

Title: PaPaformer: Language Model from Pre-trained Paraller Paths

Authors: Joonas Tapaninaho, Mourad Oussala
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00544
Pdf URL: https://arxiv.org/pdf/2508.00544
Copy Paste: [[2508.00544]] PaPaformer: Language Model from Pre-trained Paraller Paths(https://arxiv.org/abs/2508.00544)
Keywords: language model
Abstract: The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces \textit{PaPaformer}, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting possibilities to customize paths to accommodate specific task requirements.
摘要：现代大语模型的培训需要越来越多的计算能力和时间。甚至较小的变体，例如小语言模型（SLM），也需要几天的时间才能在最佳情况下进行训练，通常需要多个GPU。本文探讨了在数小时而不是天数/几周内训练和评估仅解码器的语言模型的方法。我们介绍了\ textit {papaformer}，这是一种仅解码器的变压器体系结构变体，其较低维的并行路径被合并为较大的模型。该论文表明，这些较低维度的路径可以通过不同类型的训练数据进行单独训练，然后合并为一个较大的模型。此方法提供了减少模型参数总数和训练时间的选项，并提高性能。此外，使用并行路径结构的使用开辟了有趣的可能性，以自定义路径以适应特定的任务要求。

Title: SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought

Authors: Jianwei Wang, Ziming Wu, Fuming Lai, Shaobing Lian, Ziqian Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00574
Pdf URL: https://arxiv.org/pdf/2508.00574
Copy Paste: [[2508.00574]] SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought(https://arxiv.org/abs/2508.00574)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: While Chain-of-Thought (CoT) reasoning improves model performance, it incurs significant time costs due to the generation of discrete CoT tokens (DCoT). Continuous CoT (CCoT) offers a more efficient alternative, but existing CCoT methods are hampered by indirect fine-tuning, limited alignment, or inconsistent targets. To overcome these limitations, we propose \textit{SynAdapt}, an innovative efficient reasoning framework. Specifically, \textit{SynAdapt} generates the synthetic CCoT to serve as a precise and effective alignment target for LLMs. This synthetic CCoT explicitly guides the LLM to learn CCoT and derive accurate answers directly. Furthermore, relying solely on CCoT is insufficient for solving hard questions. To address this, \textit{SynAdapt} integrates a difficulty classifier that leverages both question context and CCoT to identify hard questions. CCoT can effectively help identify hard questions after some brief reasoning. We then adaptively prompt the LLM to re-think these hard questions for improved performance. Extensive experimental results across various benchmarks from different difficulty levels strongly demonstrate the effectiveness of our method, achieving the best accuracy-efficiency trade-off.
摘要：虽然经过思考链（COT）推理改善了模型性能，但由于产生离散的COT令牌（DCOT），它会造成大量的时间成本。连续COT（CCOT）提供了更有效的替代方案，但是现有的CCOT方法受到间接微调，有限的对齐或不一致的目标的阻碍。为了克服这些限制，我们提出了一个创新的有效推理框架\ textit {synadapt}。具体而言，\ textit {synadapt}生成合成CCOT，以作为LLMS的精确和有效的对齐目标。该合成CCOT明确指导LLM学习CCOT并直接获得准确的答案。此外，仅依靠CCOT不足以解决难题。为了解决这个问题，\ textIt {synadapt}集成了一个难度分类器，该分类器利用问题上下文和CCOT来识别难题。经过一些简短的推理，CCOT可以有效地帮助识别棘手的问题。然后，我们会自适应提示LLM重新考虑这些难题以提高性能。来自不同难度水平的各种基准的广泛实验结果强烈证明了我们方法的有效性，从而实现了最佳的准确性效率折衷。

Title: A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models

Authors: Mingruo Yuan, Shuyi Zhang, Ben Kao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00600
Pdf URL: https://arxiv.org/pdf/2508.00600
Copy Paste: [[2508.00600]] A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models(https://arxiv.org/abs/2508.00600)
Keywords: language model, llm
Abstract: Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.
摘要：准确的置信度估计对于值得信赖的大语言模型（LLMS）系统至关重要，因为它使用户能够确定何时信任输出并在安全至关重要的应用中实现可靠的部署。 LLMS的当前置信度估计方法忽略了响应与上下文信息之间的相关性，这是输出质量评估的关键因素，尤其是在提供背景知识的情况下。为了弥合这一差距，我们提出了CRUX（上下文感知的熵减少和统一的一致性检查），这是第一个框架，它通过两个新颖的指标整合了上下文忠诚度和置信度的一致性。首先，上下文熵的减少代表数据不确定性，而通过有或没有上下文的对比度采样，信息增益。其次，统一的一致性考试通过有或没有上下文的生成的答案的全球一致性捕获了潜在的模型不确定性。在三个基准数据集（COQA，Squad，Quac）和两个特定区域的数据集（Bioasq，Eduqg）上进行的实验表明了关键的有效性，比现有基线达到了最高的AUROC。

Title: Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?

Authors: Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00614
Pdf URL: https://arxiv.org/pdf/2508.00614
Copy Paste: [[2508.00614]] Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?(https://arxiv.org/abs/2508.00614)
Keywords: llm, prompt
Abstract: This is the third in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate two commonly held prompting beliefs: a) offering to tip the AI model and b) threatening the AI model. Tipping was a commonly shared tactic for improving AI performance and threats have been endorsed by Google Founder Sergey Brin (All-In, May 2025, 8:20) who observed that 'models tend to do better if you threaten them,' a claim we subject to empirical testing here. We evaluate model performance on GPQA (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024). We demonstrate two things: - Threatening or tipping a model generally has no significant effect on benchmark performance. - Prompt variations can significantly affect performance on a per-question level. However, it is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question. Taken together, this suggests that simple prompting variations might not be as effective as previously assumed, especially for difficult problems. However, as reported previously (Meincke et al. 2025a), prompting approaches can yield significantly different results for individual questions.
摘要：这是一系列简短报告中的第三份，该报告旨在帮助商业，教育和政策领导者通过严格的测试来了解与AI合作的技术细节。在本报告中，我们调查了两个普遍持有的促使信念：a）提出向AI模型提供小费和b）威胁AI模型。小费是提高人工智能表现的一种通常共同的策略，而谷歌创始人谢尔盖·布林（Sergey Brin）认可了威胁，他（Allin In-in-in-in-in-2025，8：20）观察到，“如果您威胁他们，模型往往会做得更好”，这是我们在这里接受经验测试的主张。我们评估了GPQA（Rein等人2024）和MMLU-PRO（Wang等人2024）上的模型性能。我们证明了两件事： - 威胁或大小款模型通常对基准性能没有显着影响。 - 及时变化会在每次疑问水平上显着影响性能。但是，很难事先知道特定的提示方法是否会帮助或损害LLM回答任何特定问题的能力。综上所述，这表明简单提示的变化可能不像以前假设的那样有效，尤其是对于困难的问题。但是，如前所述（Meincke等，2025a），提示方法可以为单个问题产生明显不同的结果。

Title: DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models

Authors: Shantanu Thorat, Andrew Caines
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00619
Pdf URL: https://arxiv.org/pdf/2508.00619
Copy Paste: [[2508.00619]] DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models(https://arxiv.org/abs/2508.00619)
Keywords: language model, llm
Abstract: Existing AIG (AI-generated) text detectors struggle in real-world settings despite succeeding in internal testing, suggesting that they may not be robust enough. We rigorously examine the machine-learning procedure to build these detectors to address this. Most current AIG text detection datasets focus on zero-shot generations, but little work has been done on few-shot or one-shot generations, where LLMs are given human texts as an example. In response, we introduce the Diverse Adversarial Corpus of Texts Yielded from Language models (DACTYL), a challenging AIG text detection dataset focusing on one-shot/few-shot generations. We also include texts from domain-specific continued-pre-trained (CPT) language models, where we fully train all parameters using a memory-efficient optimization approach. Many existing AIG text detectors struggle significantly on our dataset, indicating a potential vulnerability to one-shot/few-shot and CPT-generated texts. We also train our own classifiers using two approaches: standard binary cross-entropy (BCE) optimization and a more recent approach, deep X-risk optimization (DXO). While BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, the latter excels on out-of-distribution (OOD) texts. In our mock deployment scenario in student essay detection with an OOD student essay dataset, the best DXO classifier outscored the best BCE-trained classifier by 50.56 macro-F1 score points at the lowest false positive rates for both. Our results indicate that DXO classifiers generalize better without overfitting to the test set. Our experiments highlight several areas of improvement for AIG text detectors.
摘要：现有的AIG（AI生成的）文本检测器尽管成功进行了内部测试，但仍在现实世界中挣扎，这表明它们可能不够健全。我们严格检查机器学习程序以构建这些检测器以解决此问题。当前的大多数AIG文本检测数据集都集中在零射者的几代上，但是几乎没有完成的几代人几乎没有完成，其中将LLMS作为例子为例。作为回应，我们介绍了从语言模型（Dactyl）产生的文本的各种对抗性语料库，该文本（Dactyl）是一个挑战性的AIG文本检测数据集，重点是一杆/几代世代。我们还包括来自特定于域的持续训练（CPT）语言模型的文本，在其中我们使用内存有效的优化方法对所有参数进行了充分训练。许多现有的AIG文本检测器在我们的数据集中大为挣扎，这表明潜在的漏洞/少数射击和CPT生成的文本。我们还使用两种方法来训练自己的分类器：标准二进制跨凝结（BCE）优化和最新的方法，即深X风险优化（DXO）。尽管BCE训练的分类器在Dactyl测试集中略优胜于DXO分类器，但后者在分布式（OOD）文本上表现出色。在我们使用OOD学生论文数据集中的学生论文检测中的模拟部署方案中，最佳DXO分类器以最佳的BCE训练分类器的比例以50.56 Macro-f1得分点，两者的误报率最低。我们的结果表明，DXO分类器在不适合测试集的情况下更好地概括了。我们的实验突出了AIG文本探测器的几个改进领域。

Title: Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

Authors: Wenxuan Wang, Zizhan Ma, Meidan Ding, Shiyi Zheng, Shengyuan Liu, Jie Liu, Jiaming Ji, Wenting Chen, Xiang Li, Linlin Shen, Yixuan Yuan
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00669
Pdf URL: https://arxiv.org/pdf/2508.00669
Copy Paste: [[2508.00669]] Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications(https://arxiv.org/abs/2508.00669)
Keywords: language model, llm, prompt, agent
Abstract: The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.
摘要：大型语言模型（LLM）在医学中的扩散具有令人印象深刻的能力，但是临床实践的基石是执行系统的，透明和可验证的推理的能力。这催化了从单步答案生成转变为明确设计用于医学推理的LLM的发展。本文提供了对该新兴领域的首次系统评价。我们提出了推理增强技术的分类法，分类为培训时间策略（例如，受监督的微调，强化学习）和测试时间机制（例如，及时的工程，多机构系统）。我们分析了如何在不同的数据方式（文本，图像，代码）以及关键临床应用（例如诊断，教育和治疗计划）中应用这些技术。此外，我们调查了评估基准从简单准确度指标到对推理质量和视觉解释性的复杂评估的演变。基于对2022 - 2025年60项开创性研究的分析，我们结论是确定关键挑战，包括忠实度 - 合格性差距以及对本地多峰推理的需求，并概述了建立有效，健壮和社会技术负责的医疗AI的未来方向。

Title: MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

Authors: Farhan Farsi, Farnaz Aghababaloo, Shahriar Shariati Motlagh, Parsa Ghofrani, MohammadAli SadraeiJavaheri, Shayan Bali, Amirhossein Shabani, Farbod Bijary, Ghazal Zamaninejad, AmirMohammad Salehoof, Saeedeh Momtazi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00673
Pdf URL: https://arxiv.org/pdf/2508.00673
Copy Paste: [[2508.00673]] MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language(https://arxiv.org/abs/2508.00673)
Keywords: language model, llm
Abstract: As large language models (LLMs) become increasingly embedded in our daily lives, evaluating their quality and reliability across diverse contexts has become essential. While comprehensive benchmarks exist for assessing LLM performance in English, there remains a significant gap in evaluation resources for other languages. Moreover, because most LLMs are trained primarily on data rooted in European and American cultures, they often lack familiarity with non-Western cultural contexts. To address this limitation, our study focuses on the Persian language and Iranian culture. We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams. Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field.
摘要：随着大型语言模型（LLM）越来越多地融入我们的日常生活中，评估它们在各种环境中的质量和可靠性已变得至关重要。尽管存在用于评估英语LLM性能的全面基准，但其他语言的评估资源仍然存在很大的差距。此外，由于大多数LLM的培训主要是基于植根于欧美文化的数据，因此他们通常对非西方文化背景感到熟悉。为了解决这一局限性，我们的研究集中于波斯语和伊朗文化。我们介绍了19个新的评估数据集，专门旨在评估伊朗法律，波斯语法，波斯语成语和大学入学考试等主题的LLM。使用这些数据集，我们对41个突出的LLM进行了基准测试，旨在弥合现有的文化和语言评估差距。

Title: Team "better_call_claude": Style Change Detection using a Sequential Sentence Pair Classifier

Authors: Gleb Schmidt, Johannes Römisch, Mariia Halchynska, Svetlana Gorovaia, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00675
Pdf URL: https://arxiv.org/pdf/2508.00675
Copy Paste: [[2508.00675]] Team "better_call_claude": Style Change Detection using a Sequential Sentence Pair Classifier(https://arxiv.org/abs/2508.00675)
Keywords: language model
Abstract: Style change detection - identifying the points in a document where writing style shifts - remains one of the most important and challenging problems in computational authorship analysis. At PAN 2025, the shared task challenges participants to detect style switches at the most fine-grained level: individual sentences. The task spans three datasets, each designed with controlled and increasing thematic variety within documents. We propose to address this problem by modeling the content of each problem instance - that is, a series of sentences - as a whole, using a Sequential Sentence Pair Classifier (SSPC). The architecture leverages a pre-trained language model (PLM) to obtain representations of individual sentences, which are then fed into a bidirectional LSTM (BiLSTM) to contextualize them within the document. The BiLSTM-produced vectors of adjacent sentences are concatenated and passed to a multi-layer perceptron for prediction per adjacency. Building on the work of previous PAN participants classical text segmentation, the approach is relatively conservative and lightweight. Nevertheless, it proves effective in leveraging contextual information and addressing what is arguably the most challenging aspect of this year's shared task: the notorious problem of "stylistically shallow", short sentences that are prevalent in the proposed benchmark data. Evaluated on the official PAN-2025 test datasets, the model achieves strong macro-F1 scores of 0.923, 0.828, and 0.724 on the EASY, MEDIUM, and HARD data, respectively, outperforming not only the official random baselines but also a much more challenging one: claude-3.7-sonnet's zero-shot performance.
摘要：样式变更检测 - 识别文档中写作样式变化的点 - 仍然是计算作者身份分析中最重要，最具挑战性的问题之一。在PAN 2025，共同的任务挑战参与者以最细粒度的级别检测样式开关：单个句子。该任务涵盖了三个数据集，每个数据集都在文档中具有受控和增加的主题品种设计。我们建议通过使用顺序句子对分类器（SSPC）建模每个问题实例的内容来解决此问题。该体系结构利用预先训练的语言模型（PLM）获取单个句子的表示，然后将其馈入双向LSTM（Bilstm），以在文档中将其上下文化。相邻句子的Bilstm生产的向量是串联的，并传递给多层感知器，以进行每个邻接的预测。在以前的PAN参与者经典文本细分的基础上，该方法相对保守和轻巧。然而，事实证明，它可以有效地利用上下文信息并解决今年共同任务中最具挑战性的方面是什么：臭名昭著的“风格上浅层”的问题，即提议的基准数据中普遍存在的简短句子。该模型在官方PAN-2025测试数据集上进行了评估，在简单，中和硬数据上分别获得了0.923、0.828和0.724的强大宏F1分数，不仅均超过了官方的随机基线，而且胜过更具挑战性的官方：Claude-3.7-3.7-Sonnet的零零型效果。

Title: Better Call Claude: Can LLMs Detect Changes of Writing Style?

Authors: Johannes Römisch, Svetlana Gorovaia, Mariia Halchynska, Gleb Schmidt, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00680
Pdf URL: https://arxiv.org/pdf/2508.00680
Copy Paste: [[2508.00680]] Better Call Claude: Can LLMs Detect Changes of Writing Style?(https://arxiv.org/abs/2508.00680)
Keywords: language model, llm
Abstract: This article explores the zero-shot performance of state-of-the-art large language models (LLMs) on one of the most challenging tasks in authorship analysis: sentence-level style change detection. Benchmarking four LLMs on the official PAN~2024 and 2025 "Multi-Author Writing Style Analysis" datasets, we present several observations. First, state-of-the-art generative models are sensitive to variations in writing style - even at the granular level of individual sentences. Second, their accuracy establishes a challenging baseline for the task, outperforming suggested baselines of the PAN competition. Finally, we explore the influence of semantics on model predictions and present evidence suggesting that the latest generation of LLMs may be more sensitive to content-independent and purely stylistic signals than previously reported.
摘要：本文探讨了最先进的大语言模型（LLMS）在作者身份分析中最具挑战性的任务之一：句子级风格变更检测中的零拍摄性能。在官方PAN 〜2024和2025“多作者写作样式分析”数据集上对四个LLM进行了基准测试，我们提出了一些观察结果。首先，即使在单个句子的颗粒状水平上，最新的生成模型也对写作风格的变化很敏感。其次，他们的准确性为任务建立了具有挑战性的基准，表现优于PAN竞赛的建议基线。最后，我们探讨语义对模型预测的影响和现有证据表明，最新一代LLM可能比以前报道的与内容无关和纯粹的风格信号更敏感。

Title: NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System

Authors: Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Ajay Varghese Thomas, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00709
Pdf URL: https://arxiv.org/pdf/2508.00709
Copy Paste: [[2508.00709]] NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System(https://arxiv.org/abs/2508.00709)
Keywords: llm, retrieval-augmented generation
Abstract: Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.
摘要：法律判断预测（LJP）已成为AI法律的关键领域，旨在使司法结果预测自动化并增强法律推理中的解释性。尽管印度背景下以前的方法依赖于事实，问题和推理等内部案例内容，但它们经常忽略普通法系统的核心要素，这依赖于法定规定和司法先例。在这项工作中，我们提出了Nyayarag，这是一个检索型发电框架（RAG）框架，该框架通过为模型提供事实案例描述，相关法律法规和语义检索事先案例，从而模拟现实的法庭场景。 Nyayarag评估了这些联合意见在预测法院裁决中的有效性，并使用针对印度法律制度量身定制的特定领域的管道来产生法律解释。我们使用标准词汇和语义指标以及基于LLM的评估者（例如G-eval）评估各种输入配置的性能。我们的结果表明，具有结构化法律知识的增强事实投入可显着提高预测准确性和解释质量。

Title: Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA

Authors: Yingxu Wang, Shiqi Fan, Mengzhu Wang, Siwei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00719
Pdf URL: https://arxiv.org/pdf/2508.00719
Copy Paste: [[2508.00719]] Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA(https://arxiv.org/abs/2508.00719)
Keywords: language model, llm, prompt
Abstract: Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top-$k$ relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.
摘要：知识图应答（KGQA）旨在通过利用其关系和语义结构来检索准确的答案来解释自然语言查询并通过知识图执行结构性推理。最近的KGQA方法主要遵循检索到期的范式，依靠GNN或启发式规则进行静态路径提取，或使用使用大语言模型（LLMS）的动态路径生成策略，以促使共同执行检索和推理。但是，前者由于静态路径提取和缺乏上下文精致而受到有限的适应能力，而后者则造成了高计算成本，并且由于依赖固定得分功能和广泛的LLM调用而导致的准确路径评估斗争。为了解决这些问题，本文提出了动态自适应MCT的推理（DAMR），这是一个新颖的框架，将符号搜索与自适应路径评估相结合，以获得有效和上下文感知的KGQA。 Damr采用蒙特卡洛树搜索（MCTS）骨干，以基于LLM的计划者为指导，该固定器在每个步骤中选择顶级$ K $相关关系以减少搜索空间。为了提高路径评估的准确性，我们引入了一个基于轻量变压器的得分手，该得分手通过交叉注意通过共同编码问题和关系序列来执行上下文感知的合理性估计，从而使模型能够在多跳跃推理过程中捕获细粒度的语义转移。此外，为了减轻高质量监督的稀缺性，Damr结合了一种动态的伪路径改进机制，该机制定期从搜索过程中探索的部分路径产生训练信号，从而使得得分手可以不断适应推理轨迹的分布。对多个KGQA基准测试的广泛实验表明，DAMR的表现明显优于最先进的方法。

Title: Out-of-Context Abduction: LLMs Make Inferences About Procedural Data Leveraging Declarative Facts in Earlier Training Data

Authors: Sohaib Imran, Rob Lamb, Peter M. Atkinson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00741
Pdf URL: https://arxiv.org/pdf/2508.00741
Copy Paste: [[2508.00741]] Out-of-Context Abduction: LLMs Make Inferences About Procedural Data Leveraging Declarative Facts in Earlier Training Data(https://arxiv.org/abs/2508.00741)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) are trained on large corpora, yet it is unclear whether they can reason about the information present within their training data. We design experiments to study out-of-context abduction in LLMs, the ability to infer the most plausible explanations for observations using relevant facts present in training data. We train treatment LLMs on names and behavior descriptions of fictitious chatbots, but not on examples of dialogue with the chatbots. We find that OpenAI's GPT 4o LLM can correctly infer at least one chatbot's name after observing example responses characteristic of that chatbot. We also find that previously training GPT 4o on descriptions of a chatbot's behavior allows it to display behaviors more characteristic of the chatbot when iteratively trained to display such behaviors. Our results have implications for situational awareness in LLMs and, therefore, for AI safety.
摘要：大型语言模型（LLM）是在大型语料库中培训的，但目前尚不清楚他们是否可以推理其培训数据中存在的信息。我们设计实验以研究LLMS中脱离外观的外观绑架，即使用训练数据中存在的相关事实来推断观测最合理的解释的能力。我们对虚拟聊天机器人的姓名和行为描述进行培训LLM，但没有在与聊天机器人对话的例子上进行培训。我们发现，在观察该聊天机器人的示例响应特征之后，OpenAI的GPT 4O LLM可以正确推断一个聊天机器人的名字。我们还发现，以前对聊天机器人行为描述的GPT 4O训练gpt 4O可以使其在迭代训练以显示此类行为时显示聊天机器人的行为。我们的结果对LLM的情境意识具有影响，因此对AI安全。

Title: Applying Psychometrics to Large Language Model Simulated Populations: Recreating the HEXACO Personality Inventory Experiment with Generative Agents

Authors: Sarah Mercer, Daniel P. Martin, Phil Swatton
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00742
Pdf URL: https://arxiv.org/pdf/2508.00742
Copy Paste: [[2508.00742]] Applying Psychometrics to Large Language Model Simulated Populations: Recreating the HEXACO Personality Inventory Experiment with Generative Agents(https://arxiv.org/abs/2508.00742)
Keywords: language model, gpt, agent
Abstract: Generative agents powered by Large Language Models demonstrate human-like characteristics through sophisticated natural language interactions. Their ability to assume roles and personalities based on predefined character biographies has positioned them as cost-effective substitutes for human participants in social science research. This paper explores the validity of such persona-based agents in representing human populations; we recreate the HEXACO personality inventory experiment by surveying 310 GPT-4 powered agents, conducting factor analysis on their responses, and comparing these results to the original findings presented by Ashton, Lee, & Goldberg in 2004. Our results found 1) a coherent and reliable personality structure was recoverable from the agents' responses demonstrating partial alignment to the HEXACO framework. 2) the derived personality dimensions were consistent and reliable within GPT-4, when coupled with a sufficiently curated population, and 3) cross-model analysis revealed variability in personality profiling, suggesting model-specific biases and limitations. We discuss the practical considerations and challenges encountered during the experiment. This study contributes to the ongoing discourse on the potential benefits and limitations of using generative agents in social science research and provides useful guidance on designing consistent and representative agent personas to maximise coverage and representation of human personality traits.
摘要：由大语言模型提供动力的生成代理通过复杂的自然语言相互作用展示了类似人类的特征。他们基于预定义的角色传记扮演角色和个性的能力使他们定位为社会科学研究中人类参与者的成本效益替代品。本文探讨了这种基于角色的代理人在代表人类人群中的有效性。我们通过调查310个GPT-4动力代理，对其响应进行了分析，并将这些结果与Ashton，Lee和Goldberg在2004年在2004年提出的原始发现进行比较来重现Hexaco性格清单实验。我们的结果发现1）我们的结果可从特工的响应中恢复一致和可靠的人格结构。 2）衍生的人格维度在GPT-4中是一致且可靠的，与足够精选的人群相结合时，3）跨模型分析揭示了人格谱分析的差异，表明模型特定的偏见和局限性。我们讨论实验过程中遇到的实际考虑和挑战。这项研究有助于对在社会科学研究中使用生成代理的潜在利益和局限性的持续论述，并为设计一致和代表性的代理人角色提供了有用的指导，以最大程度地提高人格特征的覆盖范围和代表。

Title: Agentic large language models improve retrieval-based radiology question answering

Authors: Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00743
Pdf URL: https://arxiv.org/pdf/2508.00743
Copy Paste: [[2508.00743]] Agentic large language models improve retrieval-based radiology question answering(https://arxiv.org/abs/2508.00743)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation, agent
Abstract: Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia, and dynamically synthesize evidence-based responses. We evaluated 24 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting (73% vs. 64%; P<0.001) and conventional online RAG (73% vs. 68%; P<0.001). The greatest gains occurred in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and small-scale models (e.g., Qwen 2.5-7B improved from 55% to 71%), while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models exhibited meaningful improvements (e.g., MedGemma-27B improved from 71% to 81%), indicating complementary roles of retrieval and fine-tuning. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, particularly among mid-sized LLMs, warranting future studies to validate their clinical utility.
摘要：放射学的临床决策越来越受到人工智能（AI）的影响，尤其是通过大语言模型（LLM）。但是，用于放射学问题（QA）的传统检索型发电系统（抹布）系统通常依赖于单步检索，从而限制了他们处理复杂的临床推理任务的能力。在这里，我们提出了一个代理抹布框架，使LLM可以自主分解放射学问题，迭代地从Radiopaedia中检索有针对性的临床证据，并动态合成基于证据的反应。我们使用先前确定的RSNA-RADIOQA和ExtendedQA数据集的104个专家策划的放射学问题，评估了24个LLM，分别跨越各种体系结构，参数量表（0.5B至> 670b）和训练范式（通用，通用，推理优化，临床微调）。代理检索显着提高了零射击的平均诊断准确性（73％对64％； P <0.001）和常规的在线抹布（73％vs. 68％； P <0.001）。最大的收益发生在中型模型中（例如，Mistral较大的大幅度从72％提高到81％）和小规模模型（例如，QWEN 2.5-7B从55％提高到55％到71％），而非常大的模型（> 200B参数）显示出最小的变化（<2％提高）。此外，在46％的病例中，代理检索减少了幻觉（平均9.4％），并在46％的病例中检索了与临床相关的环境，这基本上有助于事实基础。即使是临床微调模型也表现出有意义的改进（例如，Medgemma-27b从71％提高到81％），表明检索和微调的互补作用。这些结果突出了代理框架提高放射学质量检查中的事实和诊断准确性的潜力，尤其是在中型LLMS中，有必要将来的研究来验证其临床实用性。

Title: GLiDRE: Generalist Lightweight model for Document-level Relation Extraction

Authors: Robin Armingaud, Romaric Besançon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00757
Pdf URL: https://arxiv.org/pdf/2508.00757
Copy Paste: [[2508.00757]] GLiDRE: Generalist Lightweight model for Document-level Relation Extraction(https://arxiv.org/abs/2508.00757)
Keywords: language model
Abstract: Relation Extraction (RE) is a fundamental task in Natural Language Processing, and its document-level variant poses significant challenges, due to the need to model complex interactions between entities across sentences. Current approaches, largely based on the ATLOP architecture, are commonly evaluated on benchmarks like DocRED and Re-DocRED. However, their performance in zero-shot or few-shot settings remains largely underexplored due to the task's complexity. Recently, the GLiNER model has shown that a compact NER model can outperform much larger Large Language Models. With a similar motivation, we introduce GLiDRE, a new model for document-level relation extraction that builds on the key ideas of GliNER. We benchmark GLiDRE against state-of-the-art models across various data settings on the Re-DocRED dataset. Our results demonstrate that GLiDRE achieves state-of-the-art performance in few-shot scenarios. Our code is publicly available.
摘要：关系提取（RE）是自然语言处理中的一项基本任务，由于需要对实体之间的复杂相互作用建模跨句子之间的复杂相互作用，因此其文档级变体构成了重大挑战。当前的方法主要基于ATLOP架构，通常在Docred和重新转移的基准上进行评估。但是，由于任务的复杂性，它们在零射击或几次射击设置中的性能在很大程度上仍未得到充实。最近，Gliner模型表明，紧凑的NER模型可以胜过更大的大型语言模型。有了类似的动机，我们介绍了Glidre，这是一种基于Gliner的关键思想的文档级关系提取的新模型。我们在重新转移数据集的各种数据设置上对最新模型进行了基准GLIDRE。我们的结果表明，Glidre在几场比赛中实现了最先进的表现。我们的代码公开可用。

Title: MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations

Authors: Qiyao Xue, Yuchen Dou, Ryan Shi, Xiang Lorraine Li, Wei Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00760
Pdf URL: https://arxiv.org/pdf/2508.00760
Copy Paste: [[2508.00760]] MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations(https://arxiv.org/abs/2508.00760)
Keywords: language model, llm
Abstract: Hate speech detection on Chinese social networks presents distinct challenges, particularly due to the widespread use of cloaking techniques designed to evade conventional text-based detection systems. Although large language models (LLMs) have recently improved hate speech detection capabilities, the majority of existing work has concentrated on English datasets, with limited attention given to multimodal strategies in the Chinese context. In this study, we propose MMBERT, a novel BERT-based multimodal framework that integrates textual, speech, and visual modalities through a Mixture-of-Experts (MoE) architecture. To address the instability associated with directly integrating MoE into BERT-based models, we develop a progressive three-stage training paradigm. MMBERT incorporates modality-specific experts, a shared self-attention mechanism, and a router-based expert allocation strategy to enhance robustness against adversarial perturbations. Empirical results in several Chinese hate speech datasets show that MMBERT significantly surpasses fine-tuned BERT-based encoder models, fine-tuned LLMs, and LLMs utilizing in-context learning approaches.
摘要：中国社交网络上的仇恨言论检测提出了明显的挑战，特别是由于广泛使用旨在逃避常规基于文本的检测系统的掩护技术。尽管大型语言模型（LLMS）最近提高了仇恨言论检测能力，但现有的大部分工作都集中在英语数据集上，而在中文背景下对多模式策略的关注有限。在这项研究中，我们提出了MMBERT，这是一个基于BERT的新型多模式框架，该框架通过Experts（MOE）架构的混合物整合了文本，语音和视觉方式。为了解决与将MOE直接集成到基于BERT的模型相关的不稳定，我们开发了一个渐进的三阶段训练范式。 MMBERT结合了特定于模式的专家，一种共同的自我发挥机制以及基于路由器的专家分配策略，以增强对抗性扰动的鲁棒性。几个中国仇恨言论数据集中的经验结果表明，MMBERT使用文化学习方法显着超过了基于BERT的编码模型，微调LLM和LLMS。

Title: ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM-Driven Code Generation

Authors: Atakan Site, Emre Hakan Erdemir, Gülşen Eryiğit
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00762
Pdf URL: https://arxiv.org/pdf/2508.00762
Copy Paste: [[2508.00762]] ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM-Driven Code Generation(https://arxiv.org/abs/2508.00762)
Keywords: language model, llm, prompt
Abstract: This paper presents our system for SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data. The primary objective of this task is to perform question answering on given tabular datasets from diverse domains under two subtasks: DataBench QA (Subtask I) and DataBench Lite QA (Subtask II). To tackle both subtasks, we developed a zero-shot solution with a particular emphasis on leveraging Large Language Model (LLM)-based code generation. Specifically, we propose a Python code generation framework utilizing state-of-the-art open-source LLMs to generate executable Pandas code via optimized prompting strategies. Our experiments reveal that different LLMs exhibit varying levels of effectiveness in Python code generation. Additionally, results show that Python code generation achieves superior performance in tabular question answering compared to alternative approaches. Although our ranking among zero-shot systems is unknown at the time of this paper's submission, our system achieved eighth place in Subtask I and sixth place in Subtask~II among the 30 systems that outperformed the baseline in the open-source models category.
摘要：本文介绍了我们针对Semeval-2025任务8的系统8：数据库，对表格数据进行问题。该任务的主要目的是在两个子任务下的不同域中对给定的表格数据集进行问题答案：数据核心QA（子任务I）和DataBench Lite QA（子任务II）。为了解决这两个子任务，我们开发了一个零弹药解决方案，特别强调利用大型语言模型（LLM）的代码生成。具体而言，我们建议使用最先进的开源LLMS通过优化的提示策略来生成可执行的熊猫代码。我们的实验表明，不同的LLM在Python代码生成中表现出不同的有效性。此外，结果表明，与替代方法相比，Python代码生成在表格问题回答中取得了出色的性能。尽管在本文提交时，我们在零射击系统中的排名尚不清楚，但我们的系统在子任务I中获得了第八名，在30个系统中，在subtask〜II中获得了第六名，在开源模型类别中的基线表现优于基线。

Title: Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models

Authors: Xushuo Tang, Yi Ding, Zhengyi Yang, Yin Chen, Yongrui Gu, Wenke Yang, Mingchen Ju, Xin Cao, Yongfei Liu, Wenjie Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00788
Pdf URL: https://arxiv.org/pdf/2508.00788
Copy Paste: [[2508.00788]] Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models(https://arxiv.org/abs/2508.00788)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical. Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI. Prior work, such as the MISGENDERED benchmark, revealed significant limitations in earlier LLMs' handling of inclusive pronouns, but was constrained to outdated models and limited evaluations. In this study, we introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs' pronoun fidelity. We benchmark five representative LLMs, GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5, across zero-shot, few-shot, and gender identity inference. Our results show notable improvements compared with previous studies, especially in binary and gender-neutral pronoun accuracy. However, accuracy on neopronouns and reverse inference tasks remains inconsistent, underscoring persistent gaps in identity-sensitive reasoning. We discuss implications, model-specific observations, and avenues for future inclusive AI research.
摘要：大型语言模型（LLM）越来越多地在公平和包容性至关重要的敏感环境中部署。代词用法，尤其是关于性别中立和Neopronouns的代词，仍然是负责人AI的关键挑战。先前的工作，例如错误性别的基准测试，揭示了早期LLMS对包容性代词的处理有重大局限性，但被限制在过时的模型和有限的评估中。在这项研究中，我们介绍了错误性别+，这是一种扩展和更新的基准，用于评估LLMS的代词保真度。我们跨零射，很少射击和性别认同，我们基准了五个代表性LLM，GPT-4O，Claude 4，DeepSeek-V3，Qwen Turbo和Qwen2.5。我们的结果表明，与以前的研究相比，尤其是在二元和性别中性代词准确性方面的显着改善。但是，Neopronouns和反向推理任务的准确性仍然不一致，这突显了对身份敏感的推理的持续差距。我们讨论了未来包容性AI研究的含义，特定于模型的观察和途径。

Title: Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

Authors: Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00819
Pdf URL: https://arxiv.org/pdf/2508.00819
Copy Paste: [[2508.00819]] Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models(https://arxiv.org/abs/2508.00819)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.
摘要：扩散大语言模型（DLLM）正在成为主要的自回归大语模型的强大替代方法，提供有效的并行生成和能力的全球上下文建模。但是，DLLM的实际应用受到关键的体系结构约束的阻碍：需要静态预定义的生成长度。这种静态的长度分配导致了问题的折衷：不足的长度在复杂的任务上的削弱性能，而过度长度会导致大量的计算开销，有时会导致性能退化。尽管推理框架很僵硬，但我们观察到该模型本身具有与给定任务的最佳响应长度相关的内部信号。为了弥合这一差距，我们利用这些潜在信号并引入Daedal，这是一种新颖的无训练denoisising策略，可为扩散大语言模型提供动态的自适应长度扩展。 DAEDAL分为两个阶段：1）在脱索过程之前，Daedal从短初始长度开始，并迭代地将其扩展到适合任务的粗度，并在序列完成度量指标的指导下。 2）在DeNoising过程中，DAEDAL通过确定和扩展不足的发电区域通过掩码令牌插入来动态介入，以确保最终的输出得到充分开发。在DLLM上进行的广泛实验表明，DAEDAL可以在某些情况下与精心调整的固定长度基准相当，并且在某些情况下可以达到较高的固定长度基准，同时通过实现较高的有效令牌比，同时提高了计算效率。通过解决静态长度约束，Daedal解锁了DLLM的新潜力，与自动回归的对应物弥合了临界差距，并为更有效，有能力的一代铺平了道路。