2025-08-05

Title: FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

Authors: Hagyeong Shin, Binoy Robin Dalal, Iwona Bialynicka-Birula, Navjot Matharu, Ryan Muir, Xingwei Yang, Samuel W. K. Wong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.00889
Pdf URL: https://arxiv.org/pdf/2508.00889
Copy Paste: [[2508.00889]] FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts(https://arxiv.org/abs/2508.00889)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) are known to hallucinate, producing natural language outputs that are not grounded in the input, reference materials, or real-world knowledge. In enterprise applications where AI features support business decisions, such hallucinations can be particularly detrimental. LLMs that analyze and summarize contact center conversations introduce a unique set of challenges for factuality evaluation, because ground-truth labels often do not exist for analytical interpretations about sentiments captured in the conversation and root causes of the business problems. To remedy this, we first introduce a \textbf{3D} -- \textbf{Decompose, Decouple, Detach} -- paradigm in the human annotation guideline and the LLM-judges' prompt to ground the factuality labels in linguistically-informed evaluation criteria. We then introduce \textbf{FECT}, a novel benchmark dataset for \textbf{F}actuality \textbf{E}valuation of Interpretive AI-Generated \textbf{C}laims in Contact Center Conversation \textbf{T}ranscripts, labeled under our 3D paradigm. Lastly, we report our findings from aligning LLM-judges on the 3D paradigm. Overall, our findings contribute a new approach for automatically evaluating the factuality of outputs generated by an AI system for analyzing contact center conversations.
摘要：众所周知，大型语言模型（LLM）是幻觉的，产生了自然语言输出，这些输出未基于输入，参考材料或现实世界知识。在AI具有支持业务决策的企业应用程序中，这种幻觉可能特别有害。分析和总结接触中心对话的LLM为事实评估带来了一系列独特的挑战，因为对于对话中捕获的情感和业务问题根本原因的分析解释通常不存在。为了解决这个问题，我们首先引入了\ textbf {3d} - \ textbf {分解，decouple，dictach} - 人类注释指南中的范式和LLM-judges'提示在语言上有信息的评估标准中扎根于语言上的真实性标签。 We then introduce \textbf{FECT}, a novel benchmark dataset for \textbf{F}actuality \textbf{E}valuation of Interpretive AI-Generated \textbf{C}laims in Contact Center Conversation \textbf{T}ranscripts, labeled under our 3D paradigm.最后，我们报告了3D范式上的LLM-judges的调查结果。总体而言，我们的发现为自动评估AI系统生成的输出的事实的新方法做出了新的方法，用于分析接触中心对话。

Title: XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML

Authors: Ernesto L. Estevanell-Valladares, Suilan Estevez-Velarde, Yoan Gutiérrez, Andrés Montoyo, Ruslan Mitkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.00924
Pdf URL: https://arxiv.org/pdf/2508.00924
Copy Paste: [[2508.00924]] XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML(https://arxiv.org/abs/2508.00924)
Keywords: language model
Abstract: Experts in machine learning leverage domain knowledge to navigate decisions in model selection, hyperparameter optimisation, and resource allocation. This is particularly critical for fine-tuning language models (LMs), where repeated trials incur substantial computational overhead and environmental impact. However, no existing automated framework simultaneously tackles the entire model selection and HPO task for resource-efficient LM fine-tuning. We introduce XAutoLM, a meta-learning-augmented AutoML framework that reuses past experiences to optimise discriminative and generative LM fine-tuning pipelines efficiently. XAutoLM learns from stored successes and failures by extracting task- and system-level meta-features to bias its sampling toward fruitful configurations and away from costly dead ends. On four text classification and two question-answering benchmarks, XAutoLM surpasses zero-shot optimiser's peak F1 on five of six tasks, cuts mean evaluation time by up to 4.5x, reduces error ratios by up to sevenfold, and uncovers up to 50% more pipelines above the zero-shot Pareto front. In contrast, simpler memory-based baselines suffer negative transfer. We release XAutoLM and our experience store to catalyse resource-efficient, Green AI fine-tuning in the NLP community.
摘要：机器学习的专家利用域知识来浏览模型选择，超参数优化和资源分配中的决策。这对于微调语言模型（LMS）尤其重要，在该模型中，反复试验会引起大量的计算开销和环境影响。但是，没有现有的自动化框架同时解决资源有效的LM微调的整个模型选择和HPO任务。我们介绍了Xautolm，这是一种由元学习的automl框架，它重新恢复了过去的经验，以有效地优化判别和生成的LM微调管道。 Xautolm通过提取任务和系统级的元功能从存储的成功和失败中学习，以使其采样偏向富有成果的配置，并远离昂贵的死胡同。在四个文本分类和两个提问基准的基准上，Xautolm在六个任务中的五个任务中超过了零摄像机优化器的峰值F1，将平均评估时间缩短了4.5倍，将误差比最多降低了7倍，并且最高可达高达50％的流量，高达50％的管道。相比之下，基于内存的基线却遭受负转移。我们发布Xautolm和我们的经验商店，以催化NLP社区中的资源效率，绿色AI微调。

Title: MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation

Authors: Yiqun Chen, Erhan Zhang, Lingyong Yan, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Jiaxin Mao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.01005
Pdf URL: https://arxiv.org/pdf/2508.01005
Copy Paste: [[2508.01005]] MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation(https://arxiv.org/abs/2508.01005)
Keywords: hallucination, retrieval-augmented generation, agent
Abstract: In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single-round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real-world queries, a fixed RAG pipeline often struggles to balance performance and cost efficiency across different queries. To address this challenge, we propose an adaptive RAG framework called MAO-ARAG, which leverages multi-agent orchestration. Our adaptive RAG is conceived as a multi-turn framework. Specifically, we define multiple executor agents, representing typical RAG modules such as query reformulation agents, document selection agent, and generation agents. A planner agent intelligently selects and integrates the appropriate agents from these executors into a suitable workflow tailored for each query, striving for high-quality answers while maintaining reasonable costs. During each turn, the planner agent is trained using reinforcement learning, guided by an outcome-based reward (F1 score) and a cost-based penalty, continuously improving answer quality while keeping costs within a reasonable range. Experiments conducted on multiple QA datasets demonstrate that our approach, which dynamically plans workflows for each query, not only achieves high answer quality but also maintains both cost and latency within acceptable this http URL code of MAO-ARAG is on this https URL.
摘要：在提问（QA）系统中，检索效果的一代（RAG）在提高响应准确性和降低幻觉问题方面已成为关键。抹布系统的体系结构差异很大，包括单轮抹布，迭代抹布和推理抹布，每个抹布都量身定制，以解决不同类型的查询。由于实际查询的复杂性不同，固定的破布管道通常努力平衡不同查询的性能和成本效率。为了应对这一挑战，我们提出了一个称为Mao-Arag的自适应抹布框架，该框架利用多代理编排。我们的自适应抹布被认为是一个多转弯框架。具体而言，我们定义了多个执行代理，代表典型的抹布模块，例如查询重新计算代理，文档选择剂和发电代理。计划者智能选择并将这些执行者的适当代理集成到适合每个查询的合适工作流程中，并努力寻求高质量的答案，同时保持合理的成本。在每个回合期间，计划者都会使用强化学习培训，并在基于结果的奖励（F1分数）和基于成本的罚款的指导下，不断提高答案质量，同时将成本保持在合理的范围内。在多个质量检查数据集上进行的实验表明，我们的方法（动态地计划每个查询的工作流程）不仅可以达到高答案质量，而且还可以在此HTTPS URL上可接受的HTTP URL代码可接受的成本和潜伏期。

Title: UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu

Authors: Farah Adeeba, Brian Dillon, Hassan Sajjad, Rajesh Bhatt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01006
Pdf URL: https://arxiv.org/pdf/2508.01006
Copy Paste: [[2508.01006]] UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu(https://arxiv.org/abs/2508.01006)
Keywords: language model, llm
Abstract: Multilingual Large Language Models (LLMs) have shown remarkable performance across various languages; however, they often include significantly less data for low-resource languages such as Urdu compared to high-resource languages like English. To assess the linguistic knowledge of LLMs in Urdu, we present the Urdu Benchmark of Linguistic Minimal Pairs (UrBLiMP) i.e. pairs of minimally different sentences that contrast in grammatical acceptability. UrBLiMP comprises 5,696 minimal pairs targeting ten core syntactic phenomena, carefully curated using the Urdu Treebank and diverse Urdu text corpora. A human evaluation of UrBLiMP annotations yielded a 96.10% inter-annotator agreement, confirming the reliability of the dataset. We evaluate twenty multilingual LLMs on UrBLiMP, revealing significant variation in performance across linguistic phenomena. While LLaMA-3-70B achieves the highest average accuracy (94.73%), its performance is statistically comparable to other top models such as Gemma-3-27B-PT. These findings highlight both the potential and the limitations of current multilingual LLMs in capturing fine-grained syntactic knowledge in low-resource languages.
摘要：多语言大语言模型（LLMS）在各种语言上表现出了出色的性能；但是，与英语（如英语）相比，与乌尔都语（例如乌尔都语）相比，它们通常包含较少的低资源语言数据。为了评估乌尔都语中LLM的语言知识，我们介绍了语言最小对（URBLIMP）的乌尔都语基准，即在语法上可接受性中对比的几对最小句子对。 Urblimp包含5,696个最小对，针对十个核心句法现象，使用Urdu Treebank和Diverse Urdu Text Corpora进行了精心策划。人类对URBLIMP注释的评估产生了96.10％的通知者协议，证实了数据集的可靠性。我们在URBLIMP上评估了20个多语言LLM，揭示了语言现象的性能显着差异。尽管Llama-3-70B达到了最高的平均精度（94.73％），但其性能在统计上与其他顶级模型（例如Gemma-3-27b-pt）相当。这些发现突出了当前多语言LLM的潜力和局限性在捕获低资源语言中的细粒度句法知识方面的潜力和局限性。

Title: Cross-Domain Web Information Extraction at Pinterest

Authors: Michael Farag, Patrick Halina, Andrey Zaytsev, Alekhya Munagala, Imtihan Ahmed, Junhao Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.01096
Pdf URL: https://arxiv.org/pdf/2508.01096
Copy Paste: [[2508.01096]] Cross-Domain Web Information Extraction at Pinterest(https://arxiv.org/abs/2508.01096)
Keywords: language model, gpt, llm
Abstract: The internet offers a massive repository of unstructured information, but it's a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce websites is essential to enhance user experiences and improve content distribution. In this paper, we present Pinterest's system for attribute extraction, which achieves remarkable accuracy and scalability at a manageable cost. Our approach leverages a novel webpage representation that combines structural, visual, and text modalities into a compact form, optimizing it for small model learning. This representation captures each visible HTML node with its text, style and layout information. We show how this allows simple models such as eXtreme Gradient Boosting (XGBoost) to extract attributes more accurately than much more complex Large Language Models (LLMs) such as Generative Pre-trained Transformer (GPT). Our results demonstrate a system that is highly scalable, processing over 1,000 URLs per second, while being 1000 times more cost-effective than the cheapest GPT alternatives.
摘要：互联网提供了一个大量的非结构化信息存储库，但是将其转换为结构化格式是一个重大挑战。在Pinterest，从电子商务网站中准确提取结构化产品数据的能力对于增强用户体验和改善内容分布至关重要。在本文中，我们介绍了Pinterest的属性提取系统，该系统以可管理的成本实现了出色的准确性和可扩展性。我们的方法利用了一种新颖的网页表示，将结构，视觉和文本模式结合到紧凑的形式中，以对小型模型学习进行优化。此表示形式捕获了每个可见的HTML节点，其文本，样式和布局信息。我们展示了如何比更复杂的大型语言模型（LLM）（例如生成性预训练的变压器（GPT））更准确地提取诸如极端梯度提升（XGBoost）之类的简单模型。我们的结果表明，一个高度可扩展的系统，处理每秒超过1,000个URL，而成本效益的1000倍是最便宜的GPT替代方案。

Title: Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Authors: Liam G. McCoy, Fateme Nateghi Haredasht, Kanav Chopra, David Wu, David JH Wu, Abass Conteh, Sarita Khemani, Saloni Kumar Maharaj, Vishnu Ravi, Arth Pahwa, Yingjie Weng, Leah Rosengaus, Lena Giang, Kelvin Zhenghao Li, Olivia Jee, Daniel Shirvani, Ethan Goh, Jonathan H. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01159
Pdf URL: https://arxiv.org/pdf/2508.01159
Copy Paste: [[2508.01159]] Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates(https://arxiv.org/abs/2508.01159)
Keywords: language model, gpt, llm, prompt, agent
Abstract: This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models -- including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro -- for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2\%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real-world physician communication.
摘要：这项研究评估了大语言模型（LLMS）生成用于电子咨询的结构化临床咨询模板的能力。使用斯坦福大学的Econsult团队开发和定期使用的145个专家制作的模板，我们评估了边界模型 - 包括O3，GPT-4O，Kimi K2，Claude 4 Sonnet，Llama 3 70B和Gemini 2.5 Pro-它们的能力，它们能够产生临床连贯，简洁的问题，并确定临床问题临床问题。通过将迅速优化，语义自动化和优先分析的多机管道结合在一起，我们表明，尽管O3之类的模型具有高度的综合性（最高92.2 \％），但它们始终如一地产生过长的模板，并且在长度约束下未能正确确定最临床上最重要的重要问题。在专业方面的性能各不相同，在叙事驱动的领域（例如精神病学和止痛医学）中会大大退化。我们的发现表明，LLM可以增强医生之间的结构化临床信息交换，同时强调需要更强大的评估方法，以捕获模型在现实医师沟通的时间限制内优先考虑临床明显信息的能力。

Title: CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages

Authors: Jiyu Chen, Necva Bölücü, Sarvnaz Karimi, Diego Mollá, Cécile L. Paris
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01161
Pdf URL: https://arxiv.org/pdf/2508.01161
Copy Paste: [[2508.01161]] CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages(https://arxiv.org/abs/2508.01161)
Keywords: llm
Abstract: Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The \textit{Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion} shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM with LoRA setting separately for each language.
摘要：由于情感表达的多样化和细微差别的方式，跨不同语言的情绪构成了挑战。 \ textit {semeval 2025任务11：基于文本的情感弥合差距}共享任务是为了研究跨不同语言的情感识别的。该任务的目的是实施一个情感认可者，该识别者可以根据他们的书面文本片段以及这些情感的强度来识别一般第三方观察者将一般观察者归因于作者的基本情感状态。我们报告了对情绪识别中LLM的各种任务适应策略的调查。我们表明，针对此任务的最有效方法是为每种语言分别用Lora设置进行预训练的多语言LLM。

Title: Adaptive Content Restriction for Large Language Models via Suffix Optimization

Authors: Yige Li, Peihai Jiang, Jun Sun, Peng Shu, Tianming Liu, Zhen Xiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01198
Pdf URL: https://arxiv.org/pdf/2508.01198
Copy Paste: [[2508.01198]] Adaptive Content Restriction for Large Language Models via Suffix Optimization(https://arxiv.org/abs/2508.01198)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated significant success across diverse applications. However, enforcing content restrictions remains a significant challenge due to their expansive output space. One aspect of content restriction is preventing LLMs from generating harmful content via model alignment approaches such as supervised fine-tuning (SFT). Yet, the need for content restriction may vary significantly across user groups, change rapidly over time, and not always align with general definitions of harmfulness. Applying SFT to each of these specific use cases is impractical due to the high computational, data, and storage demands. Motivated by this need, we propose a new task called \textit{Adaptive Content Restriction} (AdaCoRe), which focuses on lightweight strategies -- methods without model fine-tuning -- to prevent deployed LLMs from generating restricted terms for specific use cases. We propose the first method for AdaCoRe, named \textit{Suffix Optimization (SOP)}, which appends a short, optimized suffix to any prompt to a) prevent a target LLM from generating a set of restricted terms, while b) preserving the output quality. To evaluate AdaCoRe approaches, including our SOP, we create a new \textit{Content Restriction Benchmark} (CoReBench), which contains 400 prompts for 80 restricted terms across 8 carefully selected categories. We demonstrate the effectiveness of SOP on CoReBench, which outperforms the system-level baselines such as system suffix by 15\%, 17\%, 10\%, 9\%, and 6\% on average restriction rates for Gemma2-2B, Mistral-7B, Vicuna-7B, Llama3-8B, and Llama3.1-8B, respectively. We also demonstrate that SOP is effective on POE, an online platform hosting various commercial LLMs, highlighting its practicality in real-world scenarios.
摘要：大型语言模型（LLM）在不同的应用程序中表现出了巨大的成功。但是，由于其扩大的产出空间，实施内容限制仍然是一个重大挑战。内容限制的一个方面是防止LLM通过模型对准方法（例如监督微调（SFT））产生有害内容。然而，对内容限制的需求可能会在用户群体之间差异很大，随着时间的流逝迅速变化，并且并不总是与有害性的一般定义保持一致。由于较高的计算，数据和存储需求，将SFT应用于这些特定用例中的每一个都是不切实际的。在这种需求的推动下，我们提出了一项名为\ textIt {自适应内容限制}（Adacore）的新任务，该任务的重点是轻量级策略 - 无需模型的方法，以防止已部署的LLMS为特定使用情况生成受限术语。我们提出了adacore的第一种方法，称为\ textIt {后缀优化（SOP）}，该方法将简短的优化后缀附加到任何提示中，以a）防止目标llm生成一组受限项，而b）保留输出质量。为了评估Adacore方法，包括我们的SOP，我们创建了一个新的\ textIt {Content Dractiction Benchmark}（CoreBench），其中包含400个提示，其中8个精心选择类别的80个限制性术语。我们证明了SOP对CoreBench的有效性，该核心班级的有效性优于系统级基准（例如系统后缀）15 \％，17 \％，10 \％，9 \％和6 \％的GEMMA2-2B，MISMISAL-7B，VICUNA-7B，VICUNA-7B，LLLAMA3-8B，LLLAMA3-8B，LLLAMA3.1-8B，lllama3.1-8b，ARCHAME 3.1-8B，分别为6 \％。我们还证明，SOP对POE有效，POE是一个在线平台，托管了各种商业LLM，在现实世界中强调了其实用性。

Title: Show or Tell? Modeling the evolution of request-making in Human-LLM conversations

Authors: Shengqi Zhu, Jeffrey M. Rzeszotarski, David Mimno
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2508.01213
Pdf URL: https://arxiv.org/pdf/2508.01213
Copy Paste: [[2508.01213]] Show or Tell? Modeling the evolution of request-making in Human-LLM conversations(https://arxiv.org/abs/2508.01213)
Keywords: llm, chat
Abstract: Chat logs provide a rich source of information about LLM users, but patterns of user behavior are often masked by the variability of queries. We present a new task, segmenting chat queries into contents of requests, roles, query-specific context, and additional expressions. We find that, despite the familiarity of chat-based interaction, request-making in LLM queries remains significantly different from comparable human-human interactions. With the data resource, we introduce an important perspective of diachronic analyses with user expressions. We find that query patterns vary between early ones emphasizing requests, and individual users explore patterns but tend to converge with experience. Finally, we show that model capabilities affect user behavior, particularly with the introduction of new models, which are traceable at the community level.
摘要：聊天日志提供了有关LLM用户的丰富信息来源，但是用户行为的模式通常被查询的可变性所掩盖。我们提出了一项新任务，将聊天查询分割为请求，角色，特定查询上下文和其他表达式的内容。我们发现，尽管熟悉基于聊天的互动，但LLM查询中的请求制定与可比的人类互动仍然有很大不同。借助数据资源，我们介绍了用用户表达式进行的简介分析的重要视角。我们发现，在强调请求的早期探索方式之间，查询模式有所不同，并且个人用户探索模式，但往往会融合经验。最后，我们表明模型功能会影响用户行为，尤其是在引入新模型的引入，这在社区层面上是可追溯的。

Title: WebDS: An End-to-End Benchmark for Web-based Data Science

Authors: Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01222
Pdf URL: https://arxiv.org/pdf/2508.01222
Copy Paste: [[2508.01222]] WebDS: An End-to-End Benchmark for Web-based Data Science(https://arxiv.org/abs/2508.01222)
Keywords: llm, agent
Abstract: A large portion of real-world data science tasks are complex and require multi-hop web-based interactions: finding appropriate data available on the internet, synthesizing real-time data of various modalities from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions, such as form submissions or e-commerce transactions, and often do not require diverse tool-using capabilities required for web based data science. Conversely, traditional data science benchmarks typically concentrate on static, often textually bound datasets and do not assess end-to-end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end-to-end web-based data science benchmark. It comprises 870 web-based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi-step operations requiring the use of tools and heterogeneous data formats that better reflect the realities of modern data analytics. Evaluations of current SOTA LLM agents indicate significant performance gaps in accomplishing these tasks. For instance, Browser Use, which accomplishes 80% of tasks on Web Voyager, successfully completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes like poor information grounding, repetitive behavior and shortcut-taking that agents performing WebDS' tasks display. By providing a more robust and realistic testing ground, WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.
摘要：现实世界中数据科学任务的很大一部分是复杂的，需要基于多跳Web的交互：在Internet上找到适当的数据，综合了来自不同位置的各种模式的实时数据，并产生了汇总的分析。现有的Web基准通常专注于简单的交互，例如表单提交或电子商务交易，并且通常不需要基于Web的数据科学所需的多种工具使用功能。相反，传统的数据科学基准通常集中于静态的，通常是文本绑定的数据集，并且不评估包括数据获取，清洁，分析和洞察力生成的端到端工作流程。作为回应，我们介绍了WebDS，这是第一个基于Web的数据科学基准。它包括从结构化政府数据门户到非结构化新闻媒体的29个不同网站的870个基于Web的数据科学任务，挑战代理商执行复杂的多步操作，需要使用工具和异构数据格式，以更好地反映现代数据分析的现实。对当前SOTA LLM代理的评估表明，完成这些任务时的性能差距很大。例如，在Web Voyager上完成80％的任务的浏览器使用成功完成了WebD中的15％的任务，我们的分析表明，这是由于新的故障模式，例如较差的信息接地，重复性行为和快捷方式，该代理商执行WebDS任务显示。通过提供更健壮和现实的测试场，WebDS为实际上有用的基于LLM的数据科学开发开发的重大进步奠定了基础。

Title: WarriorMath: Enhancing the Mathematical Ability of Large Language Models with a Defect-aware Framework

Authors: Yue Chen, Minghua He, Fangkai Yang, Pu Zhao, Lu Wang, Yu Kang, Yifei Dong, Yuefeng Zhan, Hao Sun, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01245
Pdf URL: https://arxiv.org/pdf/2508.01245
Copy Paste: [[2508.01245]] WarriorMath: Enhancing the Mathematical Ability of Large Language Models with a Defect-aware Framework(https://arxiv.org/abs/2508.01245)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in solving mathematical problems, yet their performance is often limited by the availability of high-quality, diverse training data. Existing methods focus on augmenting datasets through rephrasing or difficulty progression but overlook the specific failure modes of LLMs. This results in synthetic questions that the model can already solve, providing minimal performance gains. To address this, we propose WarriorMath, a defect-aware framework for mathematical problem solving that integrates both targeted data synthesis and progressive training. In the synthesis stage, we employ multiple expert LLMs in a collaborative process to generate, critique, and refine problems. Questions that base LLMs fail to solve are identified and iteratively improved through expert-level feedback, producing high-quality, defect-aware training data. In the training stage, we introduce a progressive learning framework that iteratively fine-tunes the model using increasingly challenging data tailored to its weaknesses. Experiments on six mathematical benchmarks show that WarriorMath outperforms strong baselines by 12.57% on average, setting a new state-of-the-art. Our results demonstrate the effectiveness of a defect-aware, multi-expert framework for improving mathematical ability.
摘要：大型语言模型（LLM）在解决数学问题方面表现出色，但是它们的性能通常受到高质量，多样化培训数据的可用性的限制。现有的方法着重于通过重新绘制或难度进步增强数据集，但忽略了LLMS的特定故障模式。这导致了模型已经可以解决的综合问题，从而提供了最小的性能增长。为了解决这个问题，我们提出了WarriorMath，这是一个用于解决目标数据综合和渐进式培训的数学问题的缺陷框架。在合成阶段，我们在协作过程中使用多个专家LLM来产生，批评和完善问题。基本LLM无法解决的问题是通过专家级反馈来迭代的，从而产生高质量的缺陷感知培训数据。在培训阶段，我们介绍了一个渐进式学习框架，该框架使用越来越具有挑战性的数据对模型进行微调，以其弱点量身定制。六个数学基准测试的实验表明，沃里马斯的表现平均优于强大的基准，平均比例提高了12.57％，从而创造了新的最新面积。我们的结果证明了缺陷感知的多型专家框架提高数学能力的有效性。

Title: Bridging LLMs and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025

Authors: Long S. T. Nguyen, Khang H. N. Vo, Thu H. A. Nguyen, Tuan C. Bui, Duc Q. Nguyen, Thanh-Tung Tran, Anh D. Nguyen, Minh L. Nguyen, Fabien Baldacci, Thang H. Bui, Emanuel Di Nardo, Angelo Ciaramella, Son H. Le, Ihsan Ullah, Lorenzo Di Rocco, Tho T. Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01263
Pdf URL: https://arxiv.org/pdf/2508.01263
Copy Paste: [[2508.01263]] Bridging LLMs and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025(https://arxiv.org/abs/2508.01263)
Keywords: language model, llm
Abstract: The growing integration of Artificial Intelligence (AI) into education has intensified the need for transparency and interpretability. While hackathons have long served as agile environments for rapid AI prototyping, few have directly addressed eXplainable AI (XAI) in real-world educational contexts. This paper presents a comprehensive analysis of the XAI Challenge 2025, a hackathon-style competition jointly organized by Ho Chi Minh City University of Technology (HCMUT) and the International Workshop on Trustworthiness and Reliability in Neurosymbolic AI (TRNS-AI), held as part of the International Joint Conference on Neural Networks (IJCNN 2025). The challenge tasked participants with building Question-Answering (QA) systems capable of answering student queries about university policies while generating clear, logic-based natural language explanations. To promote transparency and trustworthiness, solutions were required to use lightweight Large Language Models (LLMs) or hybrid LLM-symbolic systems. A high-quality dataset was provided, constructed via logic-based templates with Z3 validation and refined through expert student review to ensure alignment with real-world academic scenarios. We describe the challenge's motivation, structure, dataset construction, and evaluation protocol. Situating the competition within the broader evolution of AI hackathons, we argue that it represents a novel effort to bridge LLMs and symbolic reasoning in service of explainability. Our findings offer actionable insights for future XAI-centered educational systems and competitive research initiatives.
摘要：人工智能（AI）纳入教育的日益融合增强了对透明度和解释性的需求。尽管长期以来，黑客马拉松一直是快速AI原型制作的敏捷环境，但在现实世界中，很少有人直接解决了可解释的AI（XAI）。本文对Ho Chi Minh City Technology（HCMUT）共同组织的Hackathon风格的竞赛（HCMUT）和国际Neurosymbolic AI（TRNS-AI）的可信赖和可靠性研讨会共同组织了对XAI挑战赛2025年的全面分析。这项挑战责任参与者的构建问题（QA）系统，能够回答学生对大学政策的疑问，同时产生明确的，基于逻辑的自然语言解释。为了促进透明度和可信赖性，需要解决方案使用轻巧的大语言模型（LLMS）或混合LLM-Symbolic系统。提供了高质量的数据集，该数据集通过具有Z3验证的基于逻辑的模板构建，并通过专家学生评论进行了完善，以确保与现实世界的学术场景保持一致。我们描述了挑战的动机，结构，数据集构建和评估协议。在AI黑客马拉松的更广泛演变中，我们认为竞争代表了桥梁LLM和象征性推理以服务于解释性的新颖努力。我们的发现为未来以XAI为中心的教育系统和竞争性研究计划提供了可行的见解。

Title: Prompting Large Language Models with Partial Knowledge for Answering Questions with Unseen Entities

Authors: Zhichao Yan, Jiapu Wang, Jiaoyan Chen, Yanyan Wang, Hongye Tan, Jiye Liang, Xiaoli Li, Ru Li, Jeff Z.Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01290
Pdf URL: https://arxiv.org/pdf/2508.01290
Copy Paste: [[2508.01290]] Prompting Large Language Models with Partial Knowledge for Answering Questions with Unseen Entities(https://arxiv.org/abs/2508.01290)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) shows impressive performance by supplementing and substituting parametric knowledge in Large Language Models (LLMs). Retrieved knowledge can be divided into three types: explicit answer evidence, implicit answer clue, and insufficient answer context which can be further categorized into totally irrelevant and partially relevant information. Effectively utilizing partially relevant knowledge remains a key challenge for RAG systems, especially in incomplete knowledge base retrieval. Contrary to the conventional view, we propose a new perspective: LLMs can be awakened via partially relevant knowledge already embedded in LLMs. To comprehensively investigate this phenomenon, the triplets located in the gold reasoning path and their variants are used to construct partially relevant knowledge by removing the path that contains the answer. We provide theoretical analysis of the awakening effect in LLMs and support our hypothesis with experiments on two Knowledge Graphs (KGs) Question Answering (QA) datasets. Furthermore, we present a new task, Unseen Entity KGQA, simulating real-world challenges where entity linking fails due to KG incompleteness. Our awakening-based approach demonstrates greater efficacy in practical applications, outperforms traditional methods that rely on embedding-based similarity which are prone to returning noisy information.
摘要：通过补充和替换大语言模型（LLMS）中的参数知识，检索增强的生成（RAG）显示出令人印象深刻的性能。检索的知识可以分为三种类型：明确的答案证据，隐性答案线索以及不足的答案上下文，这些上下文可以进一步分类为完全无关紧要且部分相关的信息。有效利用部分相关的知识仍然是破布系统的关键挑战，尤其是在不完整的知识库检索中。与传统观点相反，我们提出了一种新的观点：可以通过已经嵌入LLM的部分相关知识来唤醒LLM。为了全面研究这一现象，位于黄金推理路径中的三胞胎及其变体用于通过删除包含答案的路径来构建部分相关的知识。我们提供了LLM中觉醒效应的理论分析，并通过对两个知识图（KGS）问题答案（QA）数据集进行实验来支持我们的假设。此外，我们提出了一项新任务，看不见的实体kgqa，模拟了由于kg不完整而导致实体失败的现实世界挑战。我们基于觉醒的方法在实用应用中表现出更大的功效，优于依靠基于嵌入的相似性的传统方法，这些方法容易返回嘈杂的信息。

Title: KEDAS: Knowledge Editing Alignment with Diverse Augmentation and Self-adaptive Inference

Authors: Chenming Tang, Yutong Yang, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01302
Pdf URL: https://arxiv.org/pdf/2508.01302
Copy Paste: [[2508.01302]] KEDAS: Knowledge Editing Alignment with Diverse Augmentation and Self-adaptive Inference(https://arxiv.org/abs/2508.01302)
Keywords: language model, llm
Abstract: Knowledge editing aims to modify outdated knowledge in large language models (LLMs) efficiently while retaining their powerful capabilities. Most existing methods rely on either parameter-level editing or retrieval-based approaches. In this work, we propose Knowledge Editing alignment with Diverse Augmentation and Self-adaptive inference (KEDAS) to better align LLMs with knowledge editing. In the alignment phase, LLMs learn to apply in-context edited knowledge via low-rank adaptation. During editing, we design a diverse edit augmentation technique to improve the recall of edits. After that, a self-adaptive post-alignment inference mechanism is proposed, in which a filter-based smart retriever is employed to perform a dynamic selection of inference routing. Specifically, irrelevant queries will go through the original pre-alignment model directly, while relevant ones, together with their related edits, go through the model with aligned adapters activated. In experiments, KEDAS secures the highest overall performance scores in 35 out of 36 cases across four datasets with three LLMs on three settings, surpassing its strong knowledge editing alignment counterpart by about 19.8 harmonic mean scores of edit success, locality and portability and outperforming both parameter editing and retrieval-based baselines significantly. Analysis of computational cost and performance on general tasks further validates the robustness and efficiency of KEDAS, indicating that it presents an ideal paradigm of knowledge editing alignment.
摘要：知识编辑旨在在保留强大功能的同时有效地修改大语言模型（LLM）中过时的知识。大多数现有方法都依赖于参数级编辑或基于检索的方法。在这项工作中，我们提出了知识编辑对齐方式，以不同的增强和自适应推理（KEDA）更好地使LLM与知识编辑更好地结合。在对齐阶段，LLMS学会通过低级别适应来应用文章中的编辑知识。在编辑过程中，我们设计了一种多样化的编辑扩展技术，以改善编辑的召回。之后，提出了一种自适应的后对准推理机制，其中采用基于滤波器的智能猎犬执行动态推理路由选择。具体而言，无关的查询将直接通过原始的预一致模型，而相关的查询以及其相关编辑的相关编辑，并通过激活了对齐适配器的模型。在实验中，KEDA在四个设置上的四个LLM的36个情况中的35个案例中的35个中获得了最高的总体性能得分，超过了其强大的知识编辑对齐对应的，大约是19.8个谐波平均值的编辑成功，局部成功，本地性能和胜过基于参数的基准和基于基于基础的基准。对一般任务的计算成本和性能的分析进一步验证了KEDAS的鲁棒性和效率，表明它提出了知识编辑对齐的理想范式。

Title: D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

Authors: Weibo Zhou, Lingbo Li, Shangsong Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01309
Pdf URL: https://arxiv.org/pdf/2508.01309
Copy Paste: [[2508.01309]] D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation(https://arxiv.org/abs/2508.01309)
Keywords: language model, llm, prompt
Abstract: The scarcity and high cost of high-quality question-answering (QA) datasets hinder supervised fine-tuning (SFT) for domain-specific large language models (LLMs). To address this, we introduce D-SCoRE, a training-free pipeline that utilizes LLMs and prompt engineering to produce diverse, high-quality QA datasets from arbitrary textual sources. D-SCoRE integrates $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport to generate QA-COT datasets tailored for domain-aware SFT. Multi-dimensional control mechanisms, such as semantic role transformation, question type balancing, and counterfactual materials, enhance diversity and relevance, overcoming limitations of existing QA generation. LLMs fine-tuned on D-SCoRE-generated QA datasets, and human-annotated QA datasets (SQuAD, Covid-QA) are evaluated on SQuADShifts and Covid-QA test sets, with D-SCoRE outperforming across most domains. D-SCoRE generates six QA-CoT pairs with four-option counterfactual materials per 100-200-word text in 90 seconds using an 8B LLM on consumer-grade hardware. Its simplicity and scalability enable efficient QA generation and high-performance fine-tuning across domains.
摘要：高质量提问（QA）数据集的稀缺性和高成本阻碍了针对特定领域的大型语言模型（LLMS）的监督微调（SFT）。为了解决这个问题，我们介绍了D-Score，这是一条无培训的管道，该管道利用LLM并促使工程从任意文本来源生产出多样化的高质量质量质量数据集。 D-SCoRE integrates $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport to generate QA-COT datasets tailored for domain-aware SFT.多维控制机制，例如语义角色转化，问题类型平衡和反事实材料，增强了多样性和相关性，克服了现有质量检查的局限性。在D-Score生成的QA数据集上进行了微调，并在小队和COVID-QA测试集上评估了人类通知的QA数据集（小队，COVID-QA），并且在大多数域中，D得分均优于大多数域。 D-Score在90秒内使用8B LLM在消费级硬件上生成了六个QA-COT对，每100-200字文本中有四次反事实材料。它的简单性和可伸缩性使跨域的有效的质量检查和高性能微调。

Title: LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Authors: Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Hongfei Yan, Jingang Wang, Xunliang Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01317
Pdf URL: https://arxiv.org/pdf/2508.01317
Copy Paste: [[2508.01317]] LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points(https://arxiv.org/abs/2508.01317)
Keywords: language model, llm
Abstract: The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of $\mathbf{11.51\%}$ on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.
摘要：大型语言模型（LLM）的进步与高质量，多样化的培训数据的稀缺性斗争。为了解决这一限制，我们提出了LinkSyn，这是一个新颖的知识点（KP）基于图形的合成框架，可以灵活控制学科和难度分布，同时平衡KP覆盖范围和受欢迎程度。 LinkSyn从问答（QA）种子数据中提取KPS，并构造KP图，以合成由KPS强烈链接并从图形步道采样的多个种子中的多种QA数据。具体而言，LinkSyn合并了（1）知识分布值函数，以指导路径采样概率和平衡KP覆盖范围和在图形步行过程中的受欢迎程度的调整；（2）通过DeepSeek-R1进行基于扩散的合成，通过利用沿每条路径的多个逻辑关联的多个种子；（3）通过灵活的难度调整，在给定学科内的高缺陷质量检查。通过执行linkSyn，我们合成了LinkQa，这是一个具有50B令牌的多样性多学科QA数据集。在Llama-3 8b上进行的广泛实验表明，通过LinkQA进行的持续预培训可以在MMLU和CMMLU上平均改善$ \ MathBf {11.51 \％} $，从而确立了新的SOTA结果。 linkQA始终增强跨模型大小和初始拖分段的性能。

Title: Large-Scale Diverse Synthesis for Mid-Training

Authors: Xuemiao Zhang, Chengying Tu, Can Ren, Rongxiang Weng, Hongfei Yan, Jingang Wang, Xunliang Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01326
Pdf URL: https://arxiv.org/pdf/2508.01326
Copy Paste: [[2508.01326]] Large-Scale Diverse Synthesis for Mid-Training(https://arxiv.org/abs/2508.01326)
Keywords: language model, llm
Abstract: The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to mitigate difficulty degradation; (3) refines answers via DeepSeek-V3 to improve output quality. We utilize BoostQA in mid-training, a mid-stage between pre-training and post-training, to optimize domain-specific knowledge acquisition and enhance data quality. Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of $\mathbf{12.74\%}$ on MMLU and CMMLU and establish SOTA average performance across 12 benchmarks. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.
摘要：由于传统语料库提供了有限的信息，因此高质量，知识密集型培训数据的稀缺性阻碍了大语言模型（LLMS）的发展。先前的研究已经综合了依赖于Corpora的问题 - QA（QA）数据，以提高模型性能，但面临质量检查数据可扩展性和知识多样性的挑战，尤其是在跨域环境中。此外，利用我们设计的学科和难度注释系统，我们探究了茎学科和高缺陷数据中的模型缺陷。为了克服这些局限性，我们提出了一条新型的多元化管道，以合成100B键的大规模质量检查数据集BOOSTQA。我们的合成框架：（1）从异质来源策划种子数据；（2）利用DeepSeek-R1实施以茎为中心的多级合成来提高数据多样性和高缺陷综合，以减轻难度降解；（3）通过DeepSeek-V3优化答案，以提高产出质量。我们在中期训练中利用BOOSTQA，这是训练前和培训之间的中期，以优化特定领域的知识获取并提高数据质量。我们的方法使MMLU和CMMLU上的$ \ Mathbf {12.74 \％} $平均改善$ \ MathBf {12.74 \％} $，使Llama-3 8b在40B token数据集中进行了训练，并在12个基准测试中建立了SOTA平均性能。 BOOSTQA还证明了可靠的可伸缩性，并且性能始终提高，因为模型尺寸，数据量和初始FLOPS量表。

Title: MaRGen: Multi-Agent LLM Approach for Self-Directed Market Research and Analysis

Authors: Roman Koshkin, Pengyu Dai, Nozomi Fujikawa, Masahito Togami, Marco Visentini-Scarzanella
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.01370
Pdf URL: https://arxiv.org/pdf/2508.01370
Copy Paste: [[2508.01370]] MaRGen: Multi-Agent LLM Approach for Self-Directed Market Research and Analysis(https://arxiv.org/abs/2508.01370)
Keywords: language model, llm, agent
Abstract: We present an autonomous framework that leverages Large Language Models (LLMs) to automate end-to-end business analysis and market report generation. At its core, the system employs specialized agents - Researcher, Reviewer, Writer, and Retriever - that collaborate to analyze data and produce comprehensive reports. These agents learn from real professional consultants' presentation materials at Amazon through in-context learning to replicate professional analytical methodologies. The framework executes a multi-step process: querying databases, analyzing data, generating insights, creating visualizations, and composing market reports. We also introduce a novel LLM-based evaluation system for assessing report quality, which shows alignment with expert human evaluations. Building on these evaluations, we implement an iterative improvement mechanism that optimizes report quality through automated review cycles. Experimental results show that report quality can be improved by both automated review cycles and consultants' unstructured knowledge. In experimental validation, our framework generates detailed 6-page reports in 7 minutes at a cost of approximately \$1. Our work could be an important step to automatically create affordable market insights.
摘要：我们提出了一个自主框架，该框架利用大型语言模型（LLMS）自动化端到端业务分析和市场报告生成。该系统以研究人员，审稿人，作家和猎犬的方式雇用了专门的代理商，可以合作分析数据并提供全面的报告。这些代理商通过在亚马逊的真正专业顾问的演示材料中学习，通过内在学习来复制专业的分析方法。该框架执行一个多步骤过程：查询数据库，分析数据，生成洞察力，创建可视化和撰写市场报告。我们还引入了一种基于LLM的新型评估系统，用于评估报告质量，该系统与专家人类评估保持一致。在这些评估的基础上，我们实施了一种迭代改进机制，该机制通过自动审核周期优化了报告质量。实验结果表明，自动化审核周期和顾问的非结构化知识都可以提高报告质量。在实验验证中，我们的框架在7分钟内生成详细的6页报告，费用约为1美元。我们的工作可能是自动创建负担得起的市场见解的重要一步。

Title: ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

Authors: Rania Al-Sabbagh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01411
Pdf URL: https://arxiv.org/pdf/2508.01411
Copy Paste: [[2508.01411]] ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations(https://arxiv.org/abs/2508.01411)
Keywords: language model
Abstract: ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts. The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate. Additionally, the dataset is a valuable resource for research in various disciplines, including translation studies, cross-linguistic analysis, and lexical semantics. The dataset can also serve pedagogical purposes by training translation students and aid professional translators as a translation memory. The contributions are twofold: first, the dataset features textual genres not found in existing parallel Egyptian Arabic and English datasets, and second, it is a gold-standard dataset that has been translated and aligned by human experts.
摘要：Arzen-Multigenre是埃及阿拉伯歌曲歌词，小说和电视节目字幕的平行数据集，这些字幕是手动翻译并与他们的英语对应者一致的。该数据集包含25,557个段对，可用于对新机器翻译模型进行基准测试，以少量设置进行微型语言模型，并适应商用机器翻译应用程序，例如Google Translate。此外，数据集是各种学科研究的宝贵资源，包括翻译研究，跨语言分析和词汇语义。该数据集还可以通过培训翻译学生和帮助专业翻译作为翻译记忆来实现教学目的。贡献是双重的：首先，数据集具有在现有的平行埃及阿拉伯语和英语数据集中找到的文本流派，其次，它是一个金标准数据集，已由人类专家翻译和对齐。

Title: Discovering Bias Associations through Open-Ended LLM Generations

Authors: Jinhao Pan, Chahat Raj, Ziwei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01412
Pdf URL: https://arxiv.org/pdf/2508.01412
Copy Paste: [[2508.01412]] Discovering Bias Associations through Open-Ended LLM Generations(https://arxiv.org/abs/2508.01412)
Keywords: language model, llm
Abstract: Social biases embedded in Large Language Models (LLMs) raise critical concerns, resulting in representational harms -- unfair or distorted portrayals of demographic groups -- that may be expressed in subtle ways through generated language. Existing evaluation methods often depend on predefined identity-concept associations, limiting their ability to surface new or unexpected forms of bias. In this work, we present the Bias Association Discovery Framework (BADF), a systematic approach for extracting both known and previously unrecognized associations between demographic identities and descriptive concepts from open-ended LLM outputs. Through comprehensive experiments spanning multiple models and diverse real-world contexts, BADF enables robust mapping and analysis of the varied concepts that characterize demographic identities. Our findings advance the understanding of biases in open-ended generation and provide a scalable tool for identifying and analyzing bias associations in LLMs. Data, code, and results are available at this https URL
摘要：嵌入大型语言模型（LLM）中的社会偏见引起了关键的关注，从而导致代表性危害 - 不公平或扭曲的人口群体的刻画 - 可以通过生成的语言以微妙的方式表达。现有的评估方法通常取决于预定义的身份概念概念关联，从而限制了它们浮出新形式或意外形式的偏见的能力。在这项工作中，我们介绍了偏见关联发现框架（BADF），这是一种系统的方法，用于从开放式LLM输出中提取人口统计学身份和描述性概念之间的已知和以前未被认可的关联。通过跨越多种模型和各种现实世界环境的全面实验，BADF可以对表征人口统计身份的各种概念进行强大的映射和分析。我们的发现提高了对开放式生成中偏见的理解，并为识别和分析LLMS中的偏见关联提供了可扩展的工具。数据，代码和结果可在此HTTPS URL上找到

Title: From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs

Authors: Haonan Bian, Yutao Qi, Rui Yang, Yuanxi Che, Jiaqian Wang, Heming Xia, Ranran Zhen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01424
Pdf URL: https://arxiv.org/pdf/2508.01424
Copy Paste: [[2508.01424]] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs(https://arxiv.org/abs/2508.01424)
Keywords: language model, llm
Abstract: Large Language Models (LLMs), despite their success in question answering, exhibit limitations in complex multi-hop question answering (MQA) tasks that necessitate non-linear, structured reasoning. This limitation stems from their inability to adequately capture deep conceptual relationships between entities. To overcome this challenge, we present **ORACLE** (**O**ntology-driven **R**easoning **A**nd **C**hain for **L**ogical **E**ucidation), a training-free framework that combines LLMs' generative capabilities with the structural benefits of knowledge graphs. Our approach operates through three stages: (1) dynamic construction of question-specific knowledge ontologies using LLMs, (2) transformation of these ontologies into First-Order Logic reasoning chains, and (3) systematic decomposition of the original query into logically coherent sub-questions. Experimental results on several standard MQA benchmarks show that our framework achieves highly competitive performance, rivaling current state-of-the-art models like DeepSeek-R1. Detailed analyses further confirm the effectiveness of each component, while demonstrating that our method generates more logical and interpretable reasoning chains than existing approaches.
摘要：大型语言模型（LLMS）尽管在问题上取得了成功，但在复杂的多跳问答（MQA）任务中表现出局限性，这些任务需要非线性，结构化推理。这种局限性源于它们无法充分捕捉实体之间的深厚概念关系。为了克服这一挑战，我们提出** oracle **（** o ** o ** ntology驱动的** r ** r ** r ** a ** a ** a ** nd ** c ** hain for ** l ** l ** ogical ** e ** e ** e ** ucidation），这是一种将LLMS与知识图的结构益处相结合的无培训框架。我们的方法通过三个阶段运行：（1）使用LLMS的特定问题本体论的动态构建，（2）将这些本体论转换为一阶逻辑推理链，以及（3）将原始查询的系统分解为逻辑上相干的子问题。几个标准MQA基准测试的实验结果表明，我们的框架实现了高度竞争性的性能，与DeepSeek-R1（例如DeepSeek-r1）匹配了当前的最新模型。详细的分析进一步证实了每个组件的有效性，同时证明我们的方法比现有方法产生更逻辑和可解释的推理链。

Title: Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

Authors: Xinlin Zhuang, Feilong Tang, Haolin Yang, Ming Hu, Huifa Li, Haochen Xue, Yichen Li, Junjun He, Zongyuan Ge, Ying Qian, Imran Razzak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01450
Pdf URL: https://arxiv.org/pdf/2508.01450
Copy Paste: [[2508.01450]] Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data(https://arxiv.org/abs/2508.01450)
Keywords: language model, llm
Abstract: Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms the baseline, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at this https URL.
摘要：监督的微调（SFT）在调整大型语言模型（LLMS）中起着关键作用，例如医学推理等专业领域。但是，现有的SFT实践通常依赖于包含冗余和低质量样本的未经过滤数据集，从而导致了实质性的计算成本和次优的性能。尽管现有的方法试图通过根据样本难度选择数据来缓解此问题，这是由知识和推理复杂性定义的，但它们忽略了每个样本的优化实用程序，反映在其梯度中。有趣的是，我们发现单独基于梯度的影响有利于易于彻底的样品，这些样本会导致大量参数移动但缺乏深层的推理链，而仅困难就选择了噪音或过度复杂的案例，而这些案例无法指导稳定的优化。基于这一观察结果，我们提出了一个数据选择策略，难度影响象限（DIQ），该象限优先考虑样本，在高难题的高影响力象限中，以平衡复杂的临床推理和实质性的梯度影响，从而使有效的医学推理能够有效的医学推理，并以最小的细微调查数据实现了有效的医学推理。此外，人类和LLM-AS-A-A-A-A-A-A-A-A-As-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-AS-Subset表现出更高的数据质量，并产生临床推理，这与差异诊断，安全检查和证据引用的专家实践更加一致，因为DIQ强调促进类似专家的推理模式的样品。对医学推理基准测试的广泛实验表明，DIQ仅在1％的选定数据上进行了微调以匹配全数据库性能，而使用10％的模型始终超过基线，突出了原则数据选择优于蛮力缩放的优势。该代码和数据可在此HTTPS URL上找到。

Title: TreeDiff: AST-Guided Code Generation with Diffusion LLMs

Authors: Yiming Zeng, Jinghan Cao, Zexin Li, Yiming Chen, Tao Ren, Dawei Xiang, Xidong Wu, Shangqian Gao, Tingting Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01473
Pdf URL: https://arxiv.org/pdf/2508.01473
Copy Paste: [[2508.01473]] TreeDiff: AST-Guided Code Generation with Diffusion LLMs(https://arxiv.org/abs/2508.01473)
Keywords: language model, llm
Abstract: Recent advances in diffusion-based language models have opened new possibilities for controllable and bidirectional sequence generation. These models provide an alternative to traditional autoregressive approaches by framing text generation as an iterative denoising process. However, applying diffusion models to structured domains such as source code remains a significant challenge. Programming languages differ from natural language in that they follow strict syntactic and semantic rules, with hierarchical organization that must be preserved for correctness. Standard token-level corruption techniques used during training often ignore this structure, which may hinder the model's ability to learn meaningful representations of code. To address this limitation, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Trees (ASTs) into the denoising process. Instead of masking individual tokens at random, we selectively corrupt syntactically meaningful code spans derived from AST subtrees. This enables the model to reconstruct programs in a way that respects grammatical boundaries and captures long-range dependencies. Experimental results demonstrate that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns. These findings highlight the potential of incorporating structural information into diffusion-based training and suggest that syntax-guided denoising is a promising direction for advancing diffusion-based language models in code generation tasks.
摘要：基于扩散的语言模型的最新进展为可控和双向序列生成开辟了新的可能性。这些模型通过将文本生成作为迭代性降解过程提供了传统自回归方法的替代方法。但是，将扩散模型应用于诸如源代码之类的结构化域仍然是一个重大挑战。编程语言与自然语言不同，因为它们遵循严格的句法和语义规则，并且必须保留为正确性的层次组织。训练中使用的标准令牌级损坏技术通常会忽略此结构，这可能会阻碍模型学习代码的有意义表示的能力。为了解决这一限制，我们提出了一个语法感知的扩散框架，该框架将抽象语法树（ASTS）的结构性先验纳入了降解过程。我们不是随机掩盖单个令牌，而是有选择地损坏从AST子树派生的句法有意义的代码。这使模型能够以尊重语法边界并捕获长期依赖性的方式重建程序。实验结果表明，语法感知腐败可显着提高句法正确性，重建精度和对看不见的代码模式的概括。这些发现突出了将结构信息纳入基于扩散的训练中的潜力，并表明语法引导的denoisising是在代码生成任务中推进基于扩散的语言模型的有希望的方向。

Title: Harnessing Collective Intelligence of LLMs for Robust Biomedical QA: A Multi-Model Approach

Authors: Dimitra Panou, Alexandros C. Dimopoulos, Manolis Koubarakis, Martin Reczko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01480
Pdf URL: https://arxiv.org/pdf/2508.01480
Copy Paste: [[2508.01480]] Harnessing Collective Intelligence of LLMs for Robust Biomedical QA: A Multi-Model Approach(https://arxiv.org/abs/2508.01480)
Keywords: language model, llm
Abstract: Biomedical text mining and question-answering are essential yet highly demanding tasks, particularly in the face of the exponential growth of biomedical literature. In this work, we present our participation in the 13th edition of the BioASQ challenge, which involves biomedical semantic question-answering for Task 13b and biomedical question-answering for developing topics for the Synergy task. We deploy a selection of open-source large language models (LLMs) as retrieval-augmented generators to answer biomedical questions. Various models are used to process the questions. A majority voting system combines their output to determine the final answer for Yes/No questions, while for list and factoid type questions, the union of their answers in used. We evaluated 13 state-of-the-art open source LLMs, exploring all possible model combinations to contribute to the final answer, resulting in tailored LLM pipelines for each question type. Our findings provide valuable insight into which combinations of LLMs consistently produce superior results for specific question types. In the four rounds of the 2025 BioASQ challenge, our system achieved notable results: in the Synergy task, we secured 1st place for ideal answers and 2nd place for exact answers in round 2, as well as two shared 1st places for exact answers in round 3 and 4.
摘要：生物医学文本挖掘和提问是必不可少但要求高度要求的任务，尤其是面对生物医学文献的指数增长。在这项工作中，我们介绍了第13届BioASQ挑战的参与，该挑战涉及任务13B的生物医学语义问题避开问题，以及为协同任务开发主题的生物医学问题。我们将精选的开源大语言模型（LLM）作为检索发电机来回答生物医学问题。各种模型用于处理问题。多数投票系统结合了他们的产出，以确定是/否问题的最终答案，而对于列表和Factoid类型问题，其二手答案的结合。我们评估了13个最先进的开源LLM，探索了所有可能的模型组合以有助于最终答案，从而为每种问题类型提供了量身定制的LLM管道。我们的发现提供了宝贵的见解，以了解哪些LLM的组合始终为特定问题类型产生卓越的结果。在2025年BioASQ挑战的四轮比赛中，我们的系统取得了显着的结果：在协同任务中，我们在第2轮中获得了第一名的理想答案和第二名的确切答案，以及在第3和4轮中获得确切答案的两个共享第一名。

Title: The Homogenizing Effect of Large Language Models on Human Expression and Thought

Authors: Zhivar Sourati, Alireza S. Ziabari, Morteza Dehghani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01491
Pdf URL: https://arxiv.org/pdf/2508.01491
Copy Paste: [[2508.01491]] The Homogenizing Effect of Large Language Models on Human Expression and Thought(https://arxiv.org/abs/2508.01491)
Keywords: language model, llm
Abstract: Cognitive diversity, reflected in variations of language, perspective, and reasoning, is essential to creativity and collective intelligence. This diversity is rich and grounded in culture, history, and individual experience. Yet as large language models (LLMs) become deeply embedded in people's lives, they risk standardizing language and reasoning. This Review synthesizes evidence across linguistics, cognitive, and computer science to show how LLMs reflect and reinforce dominant styles while marginalizing alternative voices and reasoning strategies. We examine how their design and widespread use contribute to this effect by mirroring patterns in their training data and amplifying convergence as all people increasingly rely on the same models across contexts. Unchecked, this homogenization risks flattening the cognitive landscapes that drive collective intelligence and adaptability.
摘要：认知多样性反映在语言，观点和推理的变化中，对于创造力和集体智慧至关重要。这种多样性是丰富的，扎根于文化，历史和个人经验。然而，随着大型语言模型（LLMS）深深地嵌入人们的生活中，他们冒着标准化语言和推理的风险。这篇评论综合了跨语言学，认知和计算机科学的证据，以表明LLM如何反映和增强主导风格，同时边缘化替代声音和推理策略。我们研究了他们的设计和广泛使用如何通过反映其训练数据中的模式并扩大收敛性来促进这种效果，因为所有人都越来越多地依靠跨环境中的相同模型。不受组织的均质化可能会使促进集体智慧和适应性的认知景观变平。

Title: A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents

Authors: Clayton Cohn, Surya Rayala, Namrata Srivastava, Joyce Horn Fonteles, Shruti Jain, Xinying Luo, Divya Mereddy, Naveeduddin Mohammed, Gautam Biswas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01503
Pdf URL: https://arxiv.org/pdf/2508.01503
Copy Paste: [[2508.01503]] A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents(https://arxiv.org/abs/2508.01503)
Keywords: language model, gpt, llm, chat, agent
Abstract: Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, the current use of LLM systems like ChatGPT in classrooms often lacks the solid theoretical foundation found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based agents focused on STEM+C learning. We illustrate this framework with Inquizzitor, an LLM-based formative assessment agent that integrates human-AI hybrid intelligence and provides feedback grounded in cognitive science principles. Our findings show that Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, offering teachers effective guidance that students value. This research underscores the potential for theory-driven LLM integration in education, highlighting the ability of these systems to provide adaptive and principled instruction.
摘要：大型语言模型（LLMS）为创建教学代理提供了新的机会，以进行有意义的对话以支持学生学习。但是，当前在教室中使用LLM系统的LLM系统通常缺乏早期智能辅导系统中发现的坚实的理论基础。为了弥合这一差距，我们提出了一个框架，将以证据为中心的设计与社会认知理论相结合，以在LLM基于STEM+C学习的基于LLM的代理中进行适应性脚手架。我们用基于LLM的形成性评估剂进行了询问，该框架整合了人类混合智能，并提供了基于认知科学原理的反馈。我们的发现表明，询问者提供了高质量的评估和与核心学习理论一致的互动，从而为教师提供了有效的指导，使学生重视。这项研究强调了理论驱动的LLM在教育中的潜力，强调了这些系统提供自适应和原则性教学的能力。

Title: MOPrompt: Multi-objective Semantic Evolution for Prompt Optimization

Authors: Sara Câmara, Eduardo Luz, Valéria Carvalho, Ivan Meneghini, Gladston Moreira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01541
Pdf URL: https://arxiv.org/pdf/2508.01541
Copy Paste: [[2508.01541]] MOPrompt: Multi-objective Semantic Evolution for Prompt Optimization(https://arxiv.org/abs/2508.01541)
Keywords: language model, llm, prompt
Abstract: Prompt engineering is crucial for unlocking the potential of Large Language Models (LLMs). Still, since manual prompt design is often complex, non-intuitive, and time-consuming, automatic prompt optimization has emerged as a research area. However, a significant challenge in prompt optimization is managing the inherent trade-off between task performance, such as accuracy, and context size. Most existing automated methods focus on a single objective, typically performance, thereby failing to explore the critical spectrum of efficiency and effectiveness. This paper introduces the MOPrompt, a novel Multi-objective Evolutionary Optimization (EMO) framework designed to optimize prompts for both accuracy and context size (measured in tokens) simultaneously. Our framework maps the Pareto front of prompt solutions, presenting practitioners with a set of trade-offs between context size and performance, a crucial tool for deploying Large Language Models (LLMs) in real-world applications. We evaluate MOPrompt on a sentiment analysis task in Portuguese, using Gemma-2B and Sabiazinho-3 as evaluation models. Our findings show that MOPrompt substantially outperforms the baseline framework. For the Sabiazinho model, MOPrompt identifies a prompt that achieves the same peak accuracy (0.97) as the best baseline solution, but with a 31% reduction in token length.
摘要：及时工程对于解锁大语模型（LLM）的潜力至关重要。尽管如此，由于手动及时设计通常是复杂的，非直觉且耗时的自动及时及时优化，因此已经成为研究领域。但是，迅速优化的重大挑战是管理任务绩效（例如准确性和上下文规模）之间的固有权衡。大多数现有的自动化方法都集中在单个目标（通常是性能）上，因此未能探索效率和有效性的关键范围。本文介绍了Moprompt，这是一种新型的多目标进化优化（EMO）框架，旨在同时优化提示，以同时对上下文大小（以令牌为单位测量）。我们的框架绘制了迅速解决方案的帕累托（Pareto）的前沿，向从业者展示了上下文大小和性能之间的一系列权衡，这是在现实世界应用程序中部署大型语言模型（LLM）的关键工具。我们使用gemma-2b和sabiazinho-3作为评估模型评估了葡萄牙语中的情感分析任务的MOPROMPT。我们的发现表明，Moprompt的表现大大优于基线框架。对于Sabiazinho模型，Moprompt确定了一个提示，即达到与最佳基线解决方案相同的峰精度（0.97），但令牌长度降低了31％。

Title: Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Authors: Yujia Zheng, Tianhao Li, Haotian Huang, Tianyu Zeng, Jingyu Lu, Chuangxin Chu, Yuekai Huang, Ziyou Jiang, Qian Xiong, Yuyao Ge, Mingyang Li
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.01554
Pdf URL: https://arxiv.org/pdf/2508.01554
Copy Paste: [[2508.01554]] Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models(https://arxiv.org/abs/2508.01554)
Keywords: language model, llm, prompt
Abstract: Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering mechanism. As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework, verified through human review. Extensive experiments across these datasets and five advanced LLMs demonstrate that ComPerturb achieves state-of-the-art attack success rates. Ablation studies validate the complementary benefits of prompt dissection and PPL filtering. Our results underscore the importance of prompt structure awareness and controlled perturbation for reliable adversarial robustness evaluation in LLMs. Code and data are available at this https URL.
摘要：基于及时的对抗性攻击已成为评估大语言模型（LLMS）鲁棒性的有效手段。但是，现有方法通常将提示视为单片文本，忽略其结构异质性不同的提示组件会导致对抗性鲁棒性不平等。像Promptrobust这样的先前作品假设提示是价值中立的，但是我们的分析表明，具有丰富结构的复杂，特定于域的提示具有不同的脆弱性。为了解决这一差距，我们引入了Proftanatomy，这是一个自动化框架，将提示剖分为功能组件，并通过使用我们建议的方法concotrurb选择性地扰动每个组件，从而生成多样化的，可解释的对抗性示例。为了确保语言合理性和减轻分布变化，我们进一步融合了基于控制的（PPL）的过滤机制。作为互补资源，我们使用及时的解剖框架来注释四个公共指导数据集，并通过人类审查验证。这些数据集和五个高级LLM的广泛实验表明，依据可以实现最先进的攻击成功率。消融研究验证了迅速解剖和PPL过滤的互补益处。我们的结果强调了迅速结构意识和受控的扰动的重要性，对LLMS中可靠的对抗性鲁棒性评估。代码和数据可在此HTTPS URL上找到。

Title: OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

Authors: Maziyar Panahi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01630
Pdf URL: https://arxiv.org/pdf/2508.01630
Copy Paste: [[2508.01630]] OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets(https://arxiv.org/abs/2508.01630)
Keywords: language model
Abstract: Named-entity recognition (NER) is fundamental to extracting structured information from the >80% of healthcare data that resides in unstructured clinical notes and biomedical literature. Despite recent advances with large language models, achieving state-of-the-art performance across diverse entity types while maintaining computational efficiency remains a significant challenge. We introduce OpenMed NER, a suite of open-source, domain-adapted transformer models that combine lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA). Our approach performs cost-effective DAPT on a 350k-passage corpus compiled from ethically sourced, publicly available research repositories and de-identified clinical notes (PubMed, arXiv, and MIMIC-III) using DeBERTa-v3, PubMedBERT, and BioELECTRA backbones. This is followed by task-specific fine-tuning with LoRA, which updates less than 1.5% of model parameters. We evaluate our models on 12 established biomedical NER benchmarks spanning chemicals, diseases, genes, and species. OpenMed NER achieves new state-of-the-art micro-F1 scores on 10 of these 12 datasets, with substantial gains across diverse entity types. Our models advance the state-of-the-art on foundational disease and chemical benchmarks (e.g., BC5CDR-Disease, +2.70 pp), while delivering even larger improvements of over 5.3 and 9.7 percentage points on more specialized gene and clinical cell line corpora. This work demonstrates that strategically adapted open-source models can surpass closed-source solutions. This performance is achieved with remarkable efficiency: training completes in under 12 hours on a single GPU with a low carbon footprint (< 1.2 kg CO2e), producing permissively licensed, open-source checkpoints designed to help practitioners facilitate compliance with emerging data protection and AI regulations, such as the EU AI Act.
摘要：指定的实体识别（NER）是从> 80％的医疗保健数据中提取结构化信息的基础，该数据位于非结构化的临床注释和生物医学文献中。尽管最近有大型语言模型取得了进步，但在维持计算效率的同时，在各种实体类型的同时取得了最先进的绩效仍然是一个重大挑战。我们介绍了Open Med Ner，这是一套开源的，域适应的变压器模型，它们结合了轻巧的域自适应预训练（DAPT）与参数有效的低级适应性（LORA）。我们的方法对使用Deberta-V3，PubMedMedbert和Bioelectrecla Backbones汇编的350k-Passage语料库进行了具有成本效益的DAPT（PubMed，Arxiv和Mimic-III）。接下来是对Lora进行特定于任务的微调，该调整不到模型参数的1.5％。我们在12个既定的化学物质，疾病，基因和物种的生物医学基准测试基准上评估了我们的模型。 OpenMed NER在这12个数据集中的10个数据集中有10个获得了新的最先进的Micro-F1分数，在不同的实体类型之间取得了可观的增长。我们的模型推进了有关基础疾病和化学基准的最新技术（例如BC5CDR-DISESE，+2.70 pp），同时在更专业的基因和临床细胞系公司上提供了超过5.3和9.7个百分点。这项工作表明，战略性改编的开源模型可以超越封闭源解决方案。通过出色的效率实现了这种性能：培训在12小时内完成，单个GPU的碳足迹（<1.2 kg CO2E）（<1.2 kg CO2E），可提供允许的许可，开源的开源检查点，旨在帮助实践者促进遵守新兴数据保护和AI法规，例如EU AI AI ACT。

Title: Authorship Attribution in Multilingual Machine-Generated Texts

Authors: Lucio La Cava, Dominik Macko, Róbert Móro, Ivan Srba, Andrea Tagarelli
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2508.01656
Pdf URL: https://arxiv.org/pdf/2508.01656
Copy Paste: [[2508.01656]] Authorship Attribution in Multilingual Machine-Generated Texts(https://arxiv.org/abs/2508.01656)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods, their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.
摘要：随着大型语言模型（LLM）达到了类似人类的流利性和连贯性，将机器生成的文本（MGT）与人写的内容区分开来变得越来越困难。尽管MGT检测的早期努力集中在二元分类上，但LLM的不断增长的景观和多样性需要更精细但充满挑战的作者归因（AA），即能够识别出文本背后的精确发生器（LLM或人）。但是，如今，AA仍然局限于单语言环境，英语是最受调查的环境，俯瞰了现代LLM的多语言性质和使用。在这项工作中，我们介绍了多语言作者身份归因的问题，该问题涉及将文本归因于跨不同语言的人类或多个LLM发电机。我们专注于18种语言 - 涵盖多个家庭和编写脚本 - 以及8个发电机（7个LLM和人为著名的班级），我们研究了单语AA方法的多语言适用性，它们的跨语化转移性以及发电机对归因性能的影响。我们的结果表明，尽管某些单语AA方法可以适应多语种环境，但仍存在重大限制和挑战，尤其是在跨不同语言家族转移的过程中，强调了多语言AA的复杂性，并且需要更强大的方法来更好地匹配现实世界中的现实情况。

Title: CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Authors: Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2508.01674
Pdf URL: https://arxiv.org/pdf/2508.01674
Copy Paste: [[2508.01674]] CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions(https://arxiv.org/abs/2508.01674)
Keywords: language model, llm, chat
Abstract: Personalization of Large Language Models (LLMs) often assumes users hold static preferences that reflect globally in all tasks. In reality, humans hold dynamic preferences that change depending on the context. As users interact with an LLM in various contexts, they naturally reveal their contextual preferences, which a model must infer and apply in future contexts to ensure alignment. To assess this, we introduce CUPID, a benchmark of 756 human-curated interaction session histories between users and LLM-based chat assistants. In each interaction session, the user provides a request in a specific context and expresses their preference through multi-turn feedback. Given a new user request and prior interaction sessions, our benchmark assesses whether LLMs can infer the preference relevant to this request and generate a response that satisfies this preference. With CUPID, we evaluated 10 open and proprietary LLMs, revealing that state-of-the-art LLMs struggle to infer preferences from multi-turn interactions and fail to discern what previous context is relevant to a new request -- under 50% precision and 65% recall. Our work highlights the need to advance LLM capabilities for more contextually personalized interactions and proposes CUPID as a resource to drive these improvements.
摘要：大型语言模型（LLMS）的个性化通常假设用户持有在所有任务中全球反映的静态偏好。实际上，人类拥有动态偏好，这些偏好会根据上下文而改变。当用户在各种上下文中与LLM互动时，他们自然会揭示其上下文偏好，模型必须在将来的上下文中推断和应用，以确保对齐。为了评估这一点，我们介绍了Cupid，这是用户与基于LLM的聊天助理之间756个人类策划互动历史的基准。在每个互动会话中，用户在特定上下文中提供了一个请求，并通过多转反馈表示他们的喜好。考虑到新的用户请求和事先交互会话，我们的基准测试评估LLMS是否可以推断与此请求相关的偏好，并生成满足此偏好的响应。借助丘比特，我们评估了10个开放和专有的LLMS，揭示了最先进的LLM努力从多转交互作用中推断出偏好，并且无法分辨出与新请求有关的以前的上下文与新请求相关 - 50％的精度低于50％和65％的召回。我们的工作强调了需要提高LLM功能以进行更情境个性化的互动，并提出CUPID作为推动这些改进的资源。

Title: The Bidirectional Process Reward Model

Authors: Lingyin Zhang, Jun Gao, Xiaoxue Ren, Ziqiang Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01682
Pdf URL: https://arxiv.org/pdf/2508.01682
Copy Paste: [[2508.01682]] The Bidirectional Process Reward Model(https://arxiv.org/abs/2508.01682)
Keywords: language model, llm, prompt
Abstract: Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs) by assigning fine-grained scores to intermediate reasoning steps within a solution trajectory. However, existing PRMs predominantly adopt a unidirectional left-to-right (L2R) evaluation paradigm, which limits their ability to leverage global context, making it challenging to verify the consistency of earlier steps based on later ones. In light of these challenges, we propose a novel bidirectional evaluation paradigm, named Bidirectional Process Reward Model (BiPRM). BiPRM seamlessly incorporates a parallel right-to-left (R2L) evaluation stream alongside the conventional L2R flow, enabling later reasoning steps to help assess earlier ones in real time. Notably, the built-in R2L evaluation is implemented solely through prompt modifications that reverse the original reasoning trajectory, without any additional parameters or inference latency introduced. This ensures BiPRM remains both efficient and broadly compatible with existing PRM studies. We conduct extensive experiments on two mathematical reasoning benchmarks using samples generated by three different policy models. Our method, BiPRM, is evaluated across three backbones and three distinct PRM objectives. Across all settings, BiPRM consistently outperforms unidirectional baselines, achieving up to a 31.9% improvement in stepwise reward evaluation. Generally, our results highlight BiPRM's effectiveness, robustness, and general applicability, offering a promising new direction for process-based reward modeling.
摘要：流程奖励模型（PRM）已成为一种有希望的方法，可以通过将细粒度分数分配给解决方案轨迹中的中间推理步骤，从而提高大语模型（LLMS）的推理质量。但是，现有的PRMS主要采用单向从左到右（L2R）评估范式，这限制了其利用全球环境的能力，这使得验证基于后期步骤的早期步骤的一致性具有挑战性。鉴于这些挑战，我们提出了一种新型的双向评估范式，称为双向过程奖励模型（BIPRM）。 BIPRM无缝地将平行的左右（R2L）评估流与传统的L2R流程融合在一起，从而实现了以后的推理步骤，以帮助实时评估早期的步骤。值得注意的是，内置的R2L评估仅是通过迅速修改来实现的，这些修改扭转了原始的推理轨迹，而没有任何其他参数或推理潜伏期。这样可以确保BIPRM既有效又与现有的PRM研究兼容。我们使用三种不同策略模型生成的样品对两个数学推理基准进行了广泛的实验。我们的方法BIPRM在三个骨干和三个不同的PRM目标上进行了评估。在所有设置中，BIPRM始终胜过单向基线，在逐步奖励评估中提高了31.9％。通常，我们的结果突出了BIPRM的有效性，鲁棒性和一般适用性，为基于过程的奖励建模提供了有希望的新方向。

Title: Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

Authors: Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Lizhe Zhang, Yan Liu, Bin Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01696
Pdf URL: https://arxiv.org/pdf/2508.01696
Copy Paste: [[2508.01696]] Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy(https://arxiv.org/abs/2508.01696)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising framework for enhancing the capabilities of Large Language Models (LLMs), especially in knowledge-intensive tasks. Despite its advantages, current RAG methods often struggle to *fully exploit knowledge during generation*. In particular, the synergy between the model's internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long-chain training strategy that synthesizes extended multi-agent reasoning trajectories from CoCoA-zero to fine-tune the LLM. This strategy enhances the model's capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experiments results show that CoCoA-zero and CoCoA achieve superior performance on open-domain and multi-hop QA tasks.
摘要：检索增强的一代（RAG）已成为增强大语模型（LLMS）功能的有前途的框架，尤其是在知识密集型任务中。尽管具有优势，但当前的抹布方法通常很难 *在一代中完全利用知识 *。特别是，模型的内部参数知识与外部检索知识之间的协同作用仍然有限。检索到的内容有时可能会误导生成，而某些生成的内容可以指导模型取得更准确的输出。在这项工作中，我们提出了协作代理链，该框架旨在增强参数和检索知识的明确协同作用。具体而言，我们首先引入Cocoa-Zero，这是一个多代理的抹布框架，首先执行有条件的知识吸引，然后是原因答案。在此基础上，我们开发了可可，这是一种长链训练策略，它综合了从可可零（可可零）进行扩展的多代理推理轨迹，以微调LLM。该策略增强了模型明确整合和共同利用参数并检索知识的能力。实验结果表明，可可零和可可在开放域和多跳QA任务上取得了卓越的性能。

Title: Am I Blue or Is My Hobby Counting Teardrops? Expression Leakage in Large Language Models as a Symptom of Irrelevancy Disruption

Authors: Berkay Köprü, Mehrzad Mashal, Yigit Gurses, Akos Kadar, Maximilian Schmitt, Ditty Mathew, Felix Burkhardt, Florian Eyben, Björn W. Schuller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01708
Pdf URL: https://arxiv.org/pdf/2508.01708
Copy Paste: [[2508.01708]] Am I Blue or Is My Hobby Counting Teardrops? Expression Leakage in Large Language Models as a Symptom of Irrelevancy Disruption(https://arxiv.org/abs/2508.01708)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have advanced natural language processing (NLP) skills such as through next-token prediction and self-attention, but their ability to integrate broad context also makes them prone to incorporating irrelevant information. Prior work has focused on semantic leakage, bias introduced by semantically irrelevant context. In this paper, we introduce expression leakage, a novel phenomenon where LLMs systematically generate sentimentally charged expressions that are semantically unrelated to the input context. To analyse the expression leakage, we collect a benchmark dataset along with a scheme to automatically generate a dataset from free-form text from common-crawl. In addition, we propose an automatic evaluation pipeline that correlates well with human judgment, which accelerates the benchmarking by decoupling from the need of annotation for each analysed model. Our experiments show that, as the model scales in the parameter space, the expression leakage reduces within the same LLM family. On the other hand, we demonstrate that expression leakage mitigation requires specific care during the model building process, and cannot be mitigated by prompting. In addition, our experiments indicate that, when negative sentiment is injected in the prompt, it disrupts the generation process more than the positive sentiment, causing a higher expression leakage rate.
摘要：大型语言模型（LLMS）具有先进的自然语言处理（NLP）技能，例如通过下一步的预测和自我注意，但是它们整合广泛背景的能力也使它们容易纳入无关紧要的信息。先前的工作集中在语义泄漏上，语义上无关的环境引入了偏见。在本文中，我们引入了表达泄漏，这是一种新型现象，其中llms系统地产生了与输入上下文无关的情感表达式。为了分析表达式泄漏，我们收集了一个基准数据集以及一个方案，以自动从公共爬行中自由形式文本生成数据集。此外，我们提出了一个自动评估管道，该管道与人类判断很好地相关，该管道通过将每个分析模型的注释需要解耦，从而加速了基准测试。我们的实验表明，随着模型在参数空间中的缩放，表达式泄漏会在同一LLM家族中减少。另一方面，我们证明表达泄漏缓解需要在模型构建过程中进行特定的护理，并且不能通过提示来缓解。此外，我们的实验表明，当提示中注入负面情绪时，它会破坏生成过程而不是积极情绪，从而导致更高的表达泄漏率。

Title: CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications

Authors: Raviraj Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Long, Sanjay Singh Chauhan, Niranjan Wartikar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.01710
Pdf URL: https://arxiv.org/pdf/2508.01710
Copy Paste: [[2508.01710]] CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications(https://arxiv.org/abs/2508.01710)
Keywords: language model, llm, prompt, agent
Abstract: The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Content-Safety-Dataset-Multilingual-v1, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-Multilingual-8B-v1 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work represents a significant step toward closing the safety gap in multilingual LLMs by enabling the development of culturally aware safety guard models.
摘要：在代理应用中，大型语言模型（LLM）的使用越来越多，凸显了对强大的安全护罩模型的需求。虽然英语的内容安全性经过了良好的研究，但由于收集具有文化对齐标记的数据集的高昂成本，非英语语言缺乏相似的进步。我们提出了CuretureGuard，这是一种新颖的解决方案，用于策划跨多种语言的文化对齐，高质量的安全数据集。我们的方法引入了四阶段的合成数据生成和过滤管道：文化数据隔离，文化数据适应，机器翻译和质量过滤。该管道使Nemotron-Content-Safety-Dataset-V2英语安全数据集转换和扩展为八种不同的语言：阿拉伯语，德语，西班牙语，法语，印地语，印地语，日语，泰语和中文。由此产生的数据集，Nemotron-content-Safety-Dataset-Multlingual-V1，包含386,661种语言的样本，并通过基于Lora的Fine-tuning来促进Llama-3.1-Nemotron-Safty-Multingual-8B-V1训练Llama-3.1-Nemotron-Safty-Safty-Multingual-8B-V1。最终模型在几个多语言内容安全基准下实现了最先进的性能。我们还基于多语言安全性的最新开放LLM，并观察到这些LLM在以非英语语言提示时更容易发出不安全的响应。这项工作代表了通过实现具有文化意识的安全保护卫队模型的发展来缩小多语言LLM的安全差距的重要一步。

Title: Enhancing the Preference Extractor in Multi-turn Dialogues: From Annotating Disasters to Accurate Preference Extraction

Authors: Cheng Wang, ziru Liu, Pengcheng Tang, Mingyu Zhang, Quanyu Dai, Yue Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01739
Pdf URL: https://arxiv.org/pdf/2508.01739
Copy Paste: [[2508.01739]] Enhancing the Preference Extractor in Multi-turn Dialogues: From Annotating Disasters to Accurate Preference Extraction(https://arxiv.org/abs/2508.01739)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Identifying user preferences in dialogue systems is a pivotal aspect of providing satisfying services. Current research shows that using large language models (LLMs) to fine-tune a task-specific preference extractor yields excellent results in terms of accuracy and generalization. However, the primary challenge stems from the inherent difficulty in obtaining high-quality labeled multi-turn dialogue data. Accurately tracking user preference transitions across turns not only demands intensive domain expertise and contextual consistency maintenance for annotators (termed \textbf{``Annotating Disaster''}) but also complicates model training due to error propagation in sequential dependency learning. Inspired by the observation that multi-turn preference extraction can be decomposed into iterative executions of one-turn extraction processes. We propose a novel dialogue data generation framework named \textbf{IterChat}. First, we construct a new data format that categorizes the dialogue data into attributed historical preferences and one-turn dialogues. This reduces the probability of annotation errors and improves annotation efficiency. Then, to generate a high-quality and diverse dialogue dataset, we adopt GPT4 to pre-define the preference slots in the target preference extractor task and then randomly sample the subset of the slots and their corresponding schema values to create the dialogue datasets. Experimental results indicate that fine-tuning or only few-shot prompting with the new dialogue format yields superior performance compared to the original multi-turn dialogues. Additionally, the new data format improves annotator efficiency with a win rate of 28.4\% higher than the original multi-turn dialogues.
摘要：在对话系统中识别用户偏好是提供令人满意的服务的关键方面。当前的研究表明，使用大型语言模型（LLM）微调特定于任务的偏好提取器在准确性和概括方面产生了出色的结果。但是，主要的挑战源于获得高质量标记的多圈对话数据的固有困难。准确跟踪跨回合的用户偏好过渡不仅需要注释者的强化域专业知识和上下文一致性维护（称为\ textbf {``````注释''}），而且还使模型培训复杂化，这是由于顺序依赖学习中的错误传播而复杂化。灵感来自观察到多转弯偏好提取可以分解为单转提取过程的迭代执行。我们提出了一个新颖的对话数据生成框架，名为\ textbf {iterchat}。首先，我们构建了一种新的数据格式，将对话数据分类为归因的历史偏好和一转对话。这降低了注释错误的可能性并提高注释效率。然后，为了生成高质量和多样化的对话数据集，我们采用GPT4来预先定义目标偏好提取器任务中的偏好插槽，然后随机对插槽的子集及其相应的架构值进行随机采样以创建对话数据集。实验结果表明，与原始的多圈对话相比，新的对话格式的微调或仅少量弹性提示可以产生较高的性能。此外，新的数据格式提高了注释效率，胜率比原始的多转对话高28.4 \％。

Title: A comprehensive taxonomy of hallucinations in Large Language Models

Authors: Manuel Cossio
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01781
Pdf URL: https://arxiv.org/pdf/2508.01781
Copy Paste: [[2508.01781]] A comprehensive taxonomy of hallucinations in Large Language Models(https://arxiv.org/abs/2508.01781)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their propensity for hallucination, generating plausible but factually incorrect or fabricated content, remains a critical challenge. This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespective of architecture or training. It explores core distinctions, differentiating between intrinsic (contradicting input context) and extrinsic (inconsistent with training data or reality), as well as factuality (absolute correctness) and faithfulness (adherence to input). The report then details specific manifestations, including factual errors, contextual and logical inconsistencies, temporal disorientation, ethical violations, and task-specific hallucinations across domains like code generation and multimodal applications. It analyzes the underlying causes, categorizing them into data-related issues, model-related factors, and prompt-related influences. Furthermore, the report examines cognitive and human factors influencing hallucination perception, surveys evaluation benchmarks and metrics for detection, and outlines architectural and systemic mitigation strategies. Finally, it introduces web-based resources for monitoring LLM releases and performance. This report underscores the complex, multifaceted nature of LLM hallucinations and emphasizes that, given their theoretical inevitability, future efforts must focus on robust detection, mitigation, and continuous human oversight for responsible and reliable deployment in critical applications.
摘要：大型语言模型（LLM）彻底改变了自然语言处理，但它们幻觉的倾向，产生了合理但实际上不正确或捏造的内容，仍然是一个至关重要的挑战。该报告提供了LLM幻觉的全面分类法，从正式的定义和一个理论框架开始，该框架在可计算的LLMS中提出了其固有的不可避免的，而与建筑或培训无关。它探讨了核心区别，区分固有的（与输入上下文相矛盾）和外在（与培训数据或现实不一致），以及事实（绝对正确性）和忠诚（遵守输入）。然后，该报告详细介绍了特定的表现形式，包括事实错误，上下文和逻辑上的不一致，时间迷失方向，道德违规以及跨代码生成和多模式应用等领域的特定于任务的幻觉。它分析了基本原因，将其分类为与数据相关的问题，与模型相关的因素以及与迅速相关的影响。此外，该报告还研究了影响幻觉感知的认知和人为因素，调查评估基准和指标以进行检测，并概述了建筑和系统性缓解策略。最后，它引入了基于Web的资源，以监视LLM发布和性能。该报告强调了LLM幻觉的复杂，多方面的性质，并强调，鉴于其理论上的必然性，未来的努力必须集中于在关键应用中负责和可靠部署的强大检测，缓解和不断的人类监督。

Title: AGENTICT$^2$S:Robust Text-to-SPARQL via Agentic Collaborative Reasoning over Heterogeneous Knowledge Graphs for the Circular Economy

Authors: Yang Zhao, Chengxiao Dai, Wei Zhuo, Tan Chuan Fu, Yue Xiu, Dusit Niyato, Jonathan Z. Low, Eugene Ho Hong Zhuang, Daren Zong Loong Tan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01815
Pdf URL: https://arxiv.org/pdf/2508.01815
Copy Paste: [[2508.01815]] AGENTICT$^2$S:Robust Text-to-SPARQL via Agentic Collaborative Reasoning over Heterogeneous Knowledge Graphs for the Circular Economy(https://arxiv.org/abs/2508.01815)
Keywords: prompt, agent
Abstract: Question answering over heterogeneous knowledge graphs (KGQA) involves reasoning across diverse schemas, incomplete alignments, and distributed data sources. Existing text-to-SPARQL approaches rely on large-scale domain-specific fine-tuning or operate within single-graph settings, limiting their generalizability in low-resource domains and their ability to handle queries spanning multiple graphs. These challenges are particularly relevant in domains such as the circular economy, where information about classifications, processes, and emissions is distributed across independently curated knowledge graphs (KGs). We present AgenticT$^2$S, a modular framework that decomposes KGQA into subtasks managed by specialized agents responsible for retrieval, query generation, and verification. A scheduler assigns subgoals to different graphs using weak-to-strong alignment strategies. A two-stage verifier detects structurally invalid and semantically underspecified queries through symbolic validation and counterfactual consistency checks. Experiments on real-world circular economy KGs demonstrate that AgenticT$^2$S improves execution accuracy by 17.3% and triple level F$_1$ by 25.4% over the best baseline, while reducing the average prompt length by 46.4%. These results demonstrate the benefits of agent-based schema-aware reasoning for scalable KGQA and support decision-making in sustainability domains through robust cross-graph reasoning.
摘要：关于异质知识图（KGQA）的问题回答涉及跨不同模式，不完整对齐和分布式数据源的推理。现有的文本到SPARQL方法依赖于大规模域特异性微调或在单圈设置中运行，从而限制了它们在低资源域中的推广性以及处理跨越多个图形的查询的能力。这些挑战在诸如循环经济之类的领域中尤其重要，在循环经济中，有关分类，过程和排放的信息分布在独立策划的知识图（kgs）中。我们提出了Agentict $^2 $ S，这是一个模块化框架，将KGQA分解为由负责检索，查询生成和验证的专业代理管理的子任务。调度程序使用弱到紧密的对齐策略将子目标分配给不同的图。一个两阶段的验证者通过符号验证和反事实一致性检查检测结构无效和语义上未指定的查询。现实世界中循环经济的实验表明，代理$^2 $ s将执行精度提高了17.3％，而三重水平F $ _1 $提高了25.4％，而平均及时及时的及时及时及时的及时及时及时的及时及时及时及时及时及时及时迅速的长度降低了46.4％。这些结果证明了通过稳健的跨界推理，基于代理的架构感知推理对可持续性域中的可扩展性KGQA的好处。

Title: MLP Memory: Language Modeling with Retriever-pretrained External Memory

Authors: Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01832
Pdf URL: https://arxiv.org/pdf/2508.01832
Copy Paste: [[2508.01832]] MLP Memory: Language Modeling with Retriever-pretrained External Memory(https://arxiv.org/abs/2508.01832)
Keywords: language model, llm, hallucination
Abstract: While modern decoder-only LLMs achieve superior performance across various domains, hallucinations have risen to be a common problem in their generated text, hindering their application in knowledge-intensive tasks. Retriever-augmented generation (RAG) offers a solution, but the non-parametric nature of the retriever hinders its deep interaction with LLM. In this work, we propose to decouple memorization from the LLM decoder using a pretrained, differentiable external memory. The external memory is an MLP pretrained by imitating the behavior of a retriever on the entire pretraining dataset. Our resulting architecture, which comprises a transformer decoder and an external MLP memory pretrained on language modeling and retriever imitation respectively, demonstrates strong perplexity and performance on downstream tasks. Experiments show our architecture exhibits steeper power-law scaling with model size, achieving 17.5% and 24.1% improvement on WikiText-103 and Web datasets compared to decoder-only models while benefiting from added training without overfitting. We demonstrate superior performance on three hallucination benchmarks and nine memory-intensive tasks. Additionally, our approach delivers $80\times$ speedup over $k$NN-LM (500M tokens) and $1.3\times$ faster inference than decoder-only models. Unlike $k$NN-LM, which impairs reasoning, our MLP memory improves StrategyQA performance. We will open-source our code and models in the future.
摘要：尽管现代解码器的LLM在各个领域都达到了卓越的性能，但幻觉却在其生成的文本中成为一个普遍的问题，从而阻碍了他们在知识密集型任务中的应用。猎猎犬（Retriever-aigment Edenation）（RAG）提供了解决方案，但是猎犬的非参数性质阻碍了其与LLM的深刻相互作用。在这项工作中，我们建议使用经过验证的，可区分的外部记忆将LLM解码器的记忆与LLM解码器解次。外部内存是通过模仿整个预处理数据集中猎犬的行为来预估计的MLP。我们所得的架构分别包括在语言建模和回猎犬模仿的情况下预测的变压器解码器和外部MLP内存，在下游任务上表现出了强烈的困惑和性能。实验表明，我们的体系结构表现出具有模型尺寸的更陡峭的大法缩放率，与仅解码器模型相比，Wikitext-103和Web数据集的较高的尺寸为17.5％和24.1％，同时在不适合过度拟合的情况下受益于培训。我们在三个幻觉基准和九项记忆密集型任务上展示了出色的表现。此外，我们的方法还提供了超过$ K $ nn-lm（5亿代币）的$ 80 \ times $速度，而$ 1.3 \ times $ $ $ $ $比仅解码器的型号更快。与损害推理的$ k $ nn-lm不同，我们的MLP内存改善了策略QA的性能。将来我们将开放代码和模型。

Title: Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

Authors: Yuhan Guo, Cong Guo, Aiwen Sun, Hongliang He, Xinyu Yang, Yue Lu, Yingji Zhang, Xuntao Guo, Dong Zhang, Jianzhuang Liu, Jiang Duan, Yijia Xiao, Liangjian Wen, Hai-Ming Xu, Yong Dai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01858
Pdf URL: https://arxiv.org/pdf/2508.01858
Copy Paste: [[2508.01858]] Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents(https://arxiv.org/abs/2508.01858)
Keywords: chain-of-thought, agent
Abstract: Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at this https URL
摘要：多模式的大规模模型已经显着推动了Web代理的发展，从而使人们能够与类似人类认知的数字环境进行感知和互动。在本文中，我们认为Web代理必须首先获得足够的知识才能有效参与认知推理。因此，我们将Web代理的功能分解为两个基本阶段：知识内容学习和认知过程。为了正式化这一点，我们提出了Web-Cogkinkledge框架，将知识归类为事实，概念和程序。在此框架中，知识内容学习对应于代理人的记忆和理解过程，这些过程依赖于前两种知识类型，代表了学习的“什么”。相反，认知过程对应于基于程序知识的探索，定义了推理和行动的“方式”。为了促进知识获取，我们构建了Web-cogdataset，这是一种从14个现实世界网站策划的结构化资源，旨在系统地灌输Web代理所需的核心知识。该数据集用作代理的概念基础 - 建立理解的“名词” - 以及学习如何推理和行动的基础。在这个基础的基础上，我们通过新颖的知识链（COT）推理框架，开发和培训我们建议的代理人Web-CoverSoiner来实现这些过程。广泛的实验揭示了其与现有模型相比的重要优势，尤其是在结构化知识具有决定性的情况下概括了看不见的任务。为了实现严格的评估，我们介绍了Web-cogbench，这是一个全面的评估套件，旨在评估和比较划分的知识领域和认知能力的代理性能。我们的代码和数据是在此HTTPS URL上开源的

Title: Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models

Authors: Yijun Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01862
Pdf URL: https://arxiv.org/pdf/2508.01862
Copy Paste: [[2508.01862]] Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models(https://arxiv.org/abs/2508.01862)
Keywords: language model, llm, hallucination
Abstract: Large Language Models have demonstrated remarkable capabilities across diverse tasks, yet they frequently generate hallucinations outputs that are fluent but factually incorrect or unsupported. We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in LLM outputs. Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations. We hypothesize that genuine knowledge exhibits robustness to counterfactual variations, while hallucinated content shows inconsistent confidence patterns when confronted with plausible alternatives. Our comprehensive evaluation on TruthfulQA, factual statement datasets, and curated hallucination examples demonstrates that counterfactual probing achieves superior detection performance compared to baseline methods, while our adaptive mitigation strategies reduce hallucination scores by an average of 24.5%. The approach requires no model retraining and can be integrated into existing LLM pipelines as a realtime verification mechanism.
摘要：大型语言模型表现出了各种任务的显着功能，但是它们经常产生流利但实际上不正确或不支持的幻觉输出。我们提出了反事实探测，这是一种用于检测和缓解LLM输出中幻觉的新方法。我们的方法动态生成反事实陈述，这些语句看起来很合理，但包含微妙的事实错误，然后评估模型对这些扰动的敏感性。我们假设真正的知识对反事实的变化表现出鲁棒性，而幻觉的内容在面对合理的替代方案时显示出不一致的置信度模式。我们对真实性，事实声明数据集和策划的幻觉示例的全面评估表明，与基线方法相比，反事实探测的检测性能卓越，而我们的自适应缓解策略将幻觉得分平均降低24.5％。该方法不需要模型再培训，并且可以作为实时验证机制集成到现有的LLM管道中。

Title: Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Authors: Jaskaranjeet Singh, Rakesh Thakur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01918
Pdf URL: https://arxiv.org/pdf/2508.01918
Copy Paste: [[2508.01918]] Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language(https://arxiv.org/abs/2508.01918)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Despite the rapid advancement of large language models (LLMs), low-resource languages remain largely excluded from the NLP landscape. We present PunGPT2, the first fully open-source suite of Punjabi large language models, trained from scratch on a 35GB domain-diverse corpus encompassing literature, religious texts, news, and social discourse. Unlike prior multilingual approaches, PunGPT2 captures rich syntactic and morphological features unique to Punjabi through a tokenizer optimised with byte pair encoding and linguistically aligned pretraining objectives. To improve factual grounding and domain recall, we introduce Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever over a curated Punjabi knowledge base. We further develop Pun-Instruct, a parameter-efficient, instruction-tuned variant using QLoRA, enabling robust zero-shot and instruction-following performance with significantly reduced compute needs. As a key innovation, we propose Quantum-RAG, a novel hybrid retrieval system that fuses sparse (BM25) and dense methods with quantum-inspired semantic matching. By encoding queries using amplitude-based embeddings and retrieving via quantum kernel similarity, Quantum-RAG achieves improved contextual relevance with minimal memory overhead marking the first practical integration of quantum representations in low-resource language generation. Our models significantly outperform strong multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency. This work provides a scalable, reproducible blueprint for extending LLM capabilities to underrepresented languages and pioneers quantum-aware retrieval in low-resource NLP
摘要：尽管大语言模型（LLMS）迅速发展，但低资源语言仍在很大程度上排除在NLP景观之外。我们介绍了Pungpt2，这是旁遮普语大型语言模型的第一套完全开源的套件，该套件在35GB的域名多样性语料库中从头开始训练，其中包括文学，宗教文本，新闻和社会话语。与以前的多语言方法不同，Pungpt2通过使用字节对编码和语言对齐的预处理预处理的目标来捕获旁遮普特征的丰富句法和形态特征。为了改善事实接地和域名，我们介绍了Pun-Rag，这是一个检索型的生成框架，将Pungpt2与密集的Faiss检索器结合在一个精心策划的旁遮普知识基础上。我们进一步开发了PUN-Instruct，这是一种使用Qlora的参数效率，指令调整的变体，可实现稳健的零射击和遵循指令遵循的性能，并显着降低了计算需求。作为关键创新，我们提出了量子抹布，这是一种新型的混合检索系统，该系统融合了稀疏（BM25）和密集的方法，并具有量子启发的语义匹配。通过使用基于振幅的嵌入并通过量子内核相似性检索查询，量子rag可以通过最小的内存高架来提高上下文相关性，以标志着低资源语言生成中量子表示的首次实用集成。我们的模型在困惑，事实和流利度上大大优于强大的多语言基线（Mbert，Mt5，Muril）。这项工作提供了可扩展的，可再现的蓝图，用于将LLM功能扩展到代表性不足的语言和先驱量子量子量的低资源NLP的量子检索

Title: Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback

Authors: Tom S. Juzek, Zina B. Ward
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01930
Pdf URL: https://arxiv.org/pdf/2508.01930
Copy Paste: [[2508.01930]] Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback(https://arxiv.org/abs/2508.01930)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are known to overuse certain terms like "delve" and "intricate." The exact reasons for these lexical choices, however, have been unclear. Using Meta's Llama model, this study investigates the contribution of Learning from Human Feedback (LHF), under which we subsume Reinforcement Learning from Human Feedback and Direct Preference Optimization. We present a straightforward procedure for detecting the lexical preferences of LLMs that are potentially LHF-induced. Next, we more conclusively link LHF to lexical overuse by experimentally emulating the LHF procedure and demonstrating that participants systematically prefer text variants that include certain words. This lexical overuse can be seen as a sort of misalignment, though our study highlights the potential divergence between the lexical expectations of different populations -- namely LHF workers versus LLM users. Our work contributes to the growing body of research on explainable artificial intelligence and emphasizes the importance of both data and procedural transparency in alignment research.
摘要：众所周知，大型语言模型（LLMS）过度使用某些术语，例如“ delve”和“复杂”。但是，这些词汇选择的确切原因尚不清楚。使用Meta的Llama模型，本研究研究了从人类反馈（LHF）中学习的贡献，根据该贡献，我们从人类的反馈和直接偏好优化中进行了增强学习。我们提出了一种直接的程序，用于检测潜在的LHF诱导的LLM的词汇偏好。接下来，我们通过实验模拟LHF程序，更结论性地将LHF与词汇过度使用联系起来，并证明参与者系统地更喜欢包含某些单词的文本变体。尽管我们的研究突出了不同人群的词汇期望之间的潜在差异 - 即LHF工人与LLM用户之间的潜在差异，但这种词汇过度使用可以看作是一种未对准。我们的工作有助于对可解释的人工智能的不断增长的研究体系，并强调数据和程序透明度在对齐研究中的重要性。

Title: ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Authors: Philip Schroeder, Ondrej Biza, Thomas Weng, Hongyin Luo, James Glass
Subjects: cs.CL, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.01943
Pdf URL: https://arxiv.org/pdf/2508.01943
Copy Paste: [[2508.01943]] ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks(https://arxiv.org/abs/2508.01943)
Keywords: language model, hallucination
Abstract: Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER's time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: this https URL
摘要：视觉语言模型（VLMS）在各种图像理解任务中表现出令人印象深刻的功能，但仍需要在视频中超过扩展的相机框架序列的设置中挣扎。这限制了它们在体现设置中的实用性，这需要在任务尝试的每一刻连续的视觉输入流对长帧序列进行推理。为了解决这一限制，我们提出了漫游者（递归递归的视频推理），该框架使模型能够将长途视频轨迹递归分解为与轨迹中较短子任务相对应的段。在此过程中，流浪者促进了更加集中和准确的推理，而不是暂时局部的框架序列而不会失去全球环境。我们评估了Rover，并在不同的OpenX实现视频和源自Robocasa的新数据集上评估了Rover的实施，该数据集由543个视频组成，显示了27个机器人操纵任务中的专家和扰动的非专家轨迹。漫游者在三个视频推理任务上的表现优于强大的基线：任务进度估计，框架级别的自然语言推理和视频问题回答。我们观察到，通过减少每个时间步中模型原因的帧数，漫游者会减轻幻觉，尤其是在轨迹的意外或非最佳时刻时期。此外，通过实现子任务特异性滑动上下文窗口，Rover的时间复杂性与视频长度线性缩放，对基线的渐近改进。演示，代码和数据可用：此HTTPS URL

Title: SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

Authors: Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01959
Pdf URL: https://arxiv.org/pdf/2508.01959
Copy Paste: [[2508.01959]] SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension(https://arxiv.org/abs/2508.01959)
Keywords: retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.
摘要：在长文档上检索增强的一代（RAG）通常涉及将文本分成较小的块，这些块是基本的检索单元。但是，由于原始文档的依赖关系，上下文信息通常对于准确解释每个块是必不可少的。为了解决这个问题，先前的工作探索了编码更长的上下文窗口以生成较长块的嵌入。尽管做出了这些努力，但在检索和下游任务中的收益仍然有限。这是因为（1）由于其必须编码的信息增加而导致嵌入模型的能力的较长块，并且（2）由于模型或人类带宽的限制，许多现实世界应用仍需要返回本地证据。我们通过在更广泛的上下文窗口中代表短块来提高检索性能的方式来提出另一种方法来应对这一挑战 - 即，在其上下文中置于块的含义。我们进一步表明，现有的嵌入模型不具备有效地编码此类上下文的能力，因此引入了新的训练范式并开发位置的嵌入模型（SITEMB）。为了评估我们的方法，我们策划了一个专门设计用于评估位置检索功能的书籍图检索数据集。在此基准测试上，我们基于BGE-M3的SiteMB-V1模型基本上优于最先进的嵌入模型，其中包括具有1B参数的几种具有高达7-8B参数的模型。我们的8B SiteMB-V1.5模型进一步提高了10％以上的性能，并在不同语言和几种下游应用程序上显示出强劲的结果。

Title: TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

Authors: Fan Gao, Cheng Huang, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01977
Pdf URL: https://arxiv.org/pdf/2508.01977
Copy Paste: [[2508.01977]] TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models(https://arxiv.org/abs/2508.01977)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: this https URL.
摘要：为了解决超过600万人使用的低资源语言的藏族严重数据稀缺性，我们介绍了Tibstc-Cot，这是大规模的，多域的藏族数据集，该数据集自动通过大型语言模型（LLMS）自动构建。 TIBSTC-COT在低资源设置中为数据集创建建立了一个可扩展且可重复的框架，涵盖了语言理解和生成必不可少的不同领域和推理模式。在此数据集的基础上，我们开发了具有阳光思维的LLM Family，这是一系列以藏为中心的LLM，配备了经过三通的能力。阳光思维完全接受了TIBSTC-COT的培训，已经表现出强大的推理和发电性能，可与最先进的（SOTA）多语言LLM相当。我们的工作标志着通过资源创造和模型创新来实现高质量的藏族语言处理，这是迈向包容性AI的重要一步。所有数据都可用：此HTTPS URL。

Title: Contextually Aware E-Commerce Product Question Answering using RAG

Authors: Praveen Tangarajan, Anand A. Rajasekar, Manish Rathi, Vinay Rao Dandin, Ozan Ersoy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.01990
Pdf URL: https://arxiv.org/pdf/2508.01990
Copy Paste: [[2508.01990]] Contextually Aware E-Commerce Product Question Answering using RAG(https://arxiv.org/abs/2508.01990)
Keywords: retrieval augmented generation
Abstract: E-commerce product pages contain a mix of structured specifications, unstructured reviews, and contextual elements like personalized offers or regional variants. Although informative, this volume can lead to cognitive overload, making it difficult for users to quickly and accurately find the information they need. Existing Product Question Answering (PQA) systems often fail to utilize rich user context and diverse product information effectively. We propose a scalable, end-to-end framework for e-commerce PQA using Retrieval Augmented Generation (RAG) that deeply integrates contextual understanding. Our system leverages conversational history, user profiles, and product attributes to deliver relevant and personalized answers. It adeptly handles objective, subjective, and multi-intent queries across heterogeneous sources, while also identifying information gaps in the catalog to support ongoing content improvement. We also introduce novel metrics to measure the framework's performance which are broadly applicable for RAG system evaluations.
摘要：电子商务产品页面包含结构化规范，非结构化评论以及诸如个性化报价或区域变体（例如，无效的评论）的混合。尽管内容丰富，但此卷可能会导致认知超负荷，这使用户很难快速，准确地找到所需的信息。现有的产品问答（PQA）系统通常无法利用丰富的用户上下文，并有效地多样化产品信息。我们使用检索增强发电（RAG）为电子商务PQA提供了一个可扩展的端到端框架，该框架深入整合了上下文理解。我们的系统利用对话历史记录，用户资料和产品属性来提供相关和个性化的答案。它擅长处理各种异质来源的客观，主观和多面查询，同时还识别目录中的信息差距以支持正在进行的内容改进。我们还介绍了新颖的指标，以测量框架的性能，这广泛适用于抹布系统评估。

Title: Prompting Large Language Models to Detect Dementia Family Caregivers

Authors: Md Badsha Biswas, Özlem Uzuner
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.01999
Pdf URL: https://arxiv.org/pdf/2508.01999
Copy Paste: [[2508.01999]] Prompting Large Language Models to Detect Dementia Family Caregivers(https://arxiv.org/abs/2508.01999)
Keywords: language model, llm, prompt
Abstract: Social media, such as Twitter, provides opportunities for caregivers of dementia patients to share their experiences and seek support for a variety of reasons. Availability of this information online also paves the way for the development of internet-based interventions in their support. However, for this purpose, tweets written by caregivers of dementia patients must first be identified. This paper demonstrates our system for the SMM4H 2025 shared task 3, which focuses on detecting tweets posted by individuals who have a family member with dementia. The task is outlined as a binary classification problem, differentiating between tweets that mention dementia in the context of a family member and those that do not. Our solution to this problem explores large language models (LLMs) with various prompting methods. Our results show that a simple zero-shot prompt on a fine-tuned model yielded the best results. Our final system achieved a macro F1-score of 0.95 on the validation set and the test set. Our full code is available on GitHub.
摘要：社交媒体（例如Twitter）为痴呆症患者的护理人员提供了分享他们的经验并寻求支持的机会。此信息在线的可用性还为开发基于Internet的干预措施的支持铺平了道路。但是，为此，必须首先确定由痴呆症患者的护理人员撰写的推文。本文展示了我们针对SMM4H 2025共享任务3的系统，该任务的重点是检测患有痴呆症的家人的个人发布的推文。该任务被概述为二进制分类问题，与在家庭成员的背景下提到痴呆症的推文与那些不提及痴呆症的推文。我们对此问题的解决方案通过各种提示方法探索了大型语言模型（LLM）。我们的结果表明，在微调模型上进行简单的零射提示可得出最佳结果。我们的最终系统在验证集和测试集上达到了0.95的宏F1得分。我们的完整代码可在GitHub上找到。

Title: SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

Authors: Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, Jingqi Tong, Yilong Wu, Baoyu Fan, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02013
Pdf URL: https://arxiv.org/pdf/2508.02013
Copy Paste: [[2508.02013]] SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents(https://arxiv.org/abs/2508.02013)
Keywords: agent
Abstract: Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.
摘要：最近，角色扮演者已成为实现个性化互动和情感共鸣的有希望的范式。现有研究主要集中在文本方式上，忽略了现实的互动场景中语音的关键维度。特别是，缺乏对语音角色扮演剂（SRPA）的系统评估。为了解决这一差距，我们构建了SpeechRole-DATA，这是一个大规模的高质量数据集，包括98个不同的角色和112K基于语音的单转弯和多转交谈。每个角色都表现出独特的声音特征，包括音色和韵律，从而使语音角色扮演更加复杂。此外，我们提出了Speakrole-eval，这是一种多维评估基准，该基准在基本互动能力，语音表现力和角色扮演忠诚度等关键方面有系统地评估SRPAS的性能。实验结果揭示了级联和端到端语音角色扮演剂在保持人声风格一致性和角色连贯性方面的优势和挑战。我们发布了所有数据，代码和基线模型，为语音驱动的多模式角色扮演研究提供了坚实的基础，并促进了该领域的进一步发展。

Title: SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models

Authors: Wanqi Yang, Yanda Li, Yunchao Wei, Meng Fang, Ling Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02018
Pdf URL: https://arxiv.org/pdf/2508.02018
Copy Paste: [[2508.02018]] SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models(https://arxiv.org/abs/2508.02018)
Keywords: language model
Abstract: Large audio-language models (LALMs) have achieved near-human performance in sentence-level transcription and emotion recognition. However, existing evaluations focus mainly on surface-level perception, leaving the capacity of models for contextual and inference-driven reasoning in speech-based scenarios insufficiently examined. To address this gap, we introduce SpeechR, a unified benchmark for evaluating reasoning over speech in large audio-language models. SpeechR evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment. It includes three distinct evaluation formats. The multiple-choice version measures answer selection accuracy. The generative version assesses the coherence and logical consistency of reasoning chains. The acoustic-feature version investigates whether variations in stress and emotion affect reasoning performance. Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities. SpeechR establishes a structured benchmark for evaluating reasoning in spoken language, enabling more targeted analysis of model capabilities across diverse dialogue-based tasks.
摘要：大型音频模型（LALMS）在句子级的转录和情感识别方面已取得了近乎人类的表现。但是，现有的评估主要集中在表面层面的看法上，在基于语音和推理的情况下，模型的能力在基于语音和推理驱动的场景中不足。为了解决这一差距，我们介绍了SpeechR，这是一个统一的基准测试，用于评估大型音频模型中的语音推理。 SpeechR评估沿三个关键维度的模型：事实检索，程序推断和规范性判断。它包括三种不同的评估格式。多项选择版本测量了答案选择精度。生成版本评估了推理链的相干性和逻辑一致性。声学版本调查了压力和情绪的变化是否影响推理性能。对11个最先进的LALMS的评估表明，高转录精度不能转化为强大的推理能力。 SpeechR建立了一种结构化基准，用于评估口语中的推理，从而对基于对话的任务进行更有针对性的模型能力分析。

Title: Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

Authors: Huihan Li, You Chen, Siyuan Wang, Yixin He, Ninareh Mehrabi, Rahul Gupta, Xiang Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02037
Pdf URL: https://arxiv.org/pdf/2508.02037
Copy Paste: [[2508.02037]] Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time(https://arxiv.org/abs/2508.02037)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources - local, mid-range, or long-range - based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.
摘要：大型语言模型（LLMS）在推理基准方面表现良好，但是当输入略有变化时，通常会失败，这引起了人们对成功依赖记忆的程度的担忧。这个问题在经过思考链（COT）推理中尤为严重，其中虚假的记忆模式可能会触发中间错误，从而将级联成为错误的最终答案。我们介绍了STIM，这是一种用于源感知令牌的记忆识别的新颖框架，该框架将推理链中的每个令牌归因于每个记忆源之一 - 局部，中端或远程 - 基于他们的统计共同出现在Frathering Corpus中的统计共同出现。我们跨任务和分配设置的令牌级别的分析表明，模型更多地依赖于复杂或长尾案例中的记忆，并且本地记忆通常是错误的主要驱动力，导致多达67％的错误令牌。我们还表明，来自Stim的记忆得分可以有效地预测错误的推理步骤中的错误令牌。 STIM提供了一种强大的工具，用于诊断和改进模型推理，并可以推广到其他结构化的阶梯生成任务。

Title: Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

Authors: Soyeon Kim, Jindong Wang, Xing Xie, Steven Euijong Whang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02045
Pdf URL: https://arxiv.org/pdf/2508.02045
Copy Paste: [[2508.02045]] Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models(https://arxiv.org/abs/2508.02045)
Keywords: language model, llm
Abstract: Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how \ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: this https URL.
摘要：事实会随着时间的流逝而发展，这对于大型语言模型（LLM）至关重要，以准确，可靠地处理时间敏感的事实知识。尽管已经广泛研究了事实时间敏感的提问（TSQA）任务，但现有的基准通常依赖手动策划或一组固定的预定义模板，这限制了可扩展且全面的TSQA评估。为了应对这些挑战，我们提出了TDBench，这是一种新的基准测试，该基准通过利用时间数据库和数据库技术（例如时间sql和功能依赖性）来系统地构建TSQA对。我们还引入了一个名为“时间准确性”的细粒度评估指标，该指标评估了模型解释中时间参考的有效性以及传统的答案准确性，以实现更可靠的TSQA评估。关于当代LLM的广泛实验表明，\我们的{}如何在减少对人工劳动的依赖，对现有的Wikipedia/Wikipedia/Wikidata基于基于Wikidata的TSQA评估方法进行补充，通过对应用程序特定的数据和无缝的多人合作生成来补充现有的Wikipedia/Wikipedia/Wikidata评估方法。代码和数据可公开可用：此HTTPS URL。

Title: ProCut: LLM Prompt Compression via Attribution Estimation

Authors: Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02053
Pdf URL: https://arxiv.org/pdf/2508.02053
Copy Paste: [[2508.02053]] ProCut: LLM Prompt Compression via Attribution Estimation(https://arxiv.org/abs/2508.02053)
Keywords: llm, prompt
Abstract: In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.
摘要：在大规模的工业LLM系统中，及时的模板通常会扩展到数千个令牌，因为迭代的团队迭代地纳入了诸如任务说明，很少的示例和启发式规则之类的部分，以增强稳健性和覆盖范围。这种扩展会导致肿的提示，这些提示很难维持并产生大量的推理潜伏期和服务成本。为了解决这个问题，我们通过归因估计（PROCUT）引入提示压缩，这是一个灵活的，LLM-Agnostic的无训练框架，通过归因分析来压缩提示。 PROCUT段促使模板进入语义上有意义的单元，量化其对任务性能的影响，并降低了较低的组件。通过对五个公共基准数据集和现实世界中的工业提示进行大量实验，我们表明，Procut可实现大量迅速尺寸降低（生产的代币少78％），同时维持甚至略微改善了任务绩效（比其他方法高达62％）。我们进一步介绍了一个LLM驱动的归因估计器，该估计值将压缩潜伏期降低了50％以上，并证明Procut与现有的及时及时优化框架无缝集成以产生简洁，高性能的提示。

Title: The SMeL Test: A simple benchmark for media literacy in language models

Authors: Gustaf Ahdritz, Anat Kleiman
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02074
Pdf URL: https://arxiv.org/pdf/2508.02074
Copy Paste: [[2508.02074]] The SMeL Test: A simple benchmark for media literacy in language models(https://arxiv.org/abs/2508.02074)
Keywords: language model, llm, hallucination
Abstract: The internet is rife with unattributed, deliberately misleading, or otherwise untrustworthy content. Though large language models (LLMs) are often tasked with autonomous web browsing, the extent to which they have learned the simple heuristics human researchers use to navigate this noisy environment is not currently known. In this paper, we introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context. We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently trusts more reliable sources; while reasoning in particular is associated with higher scores, even the best API model we test hallucinates up to 70% of the time. Remarkably, larger and more capable models do not necessarily outperform their smaller counterparts. We hope our work sheds more light on this important form of hallucination and guides the development of new methods to combat it.
摘要：互联网盛行，没有任何不受欢迎的，故意误导性或其他不可信的内容。尽管大型语言模型（LLMS）通常负责自动网络浏览，但他们了解了人类研究人员用来浏览这种嘈杂环境的简单启发式方法的程度。在本文中，我们介绍了合成媒体素养测试（SMEL测试），这是一种最小的基准测试，该基准测试了语言模型在上下文中积极过滤不信任信息的能力。我们基准了各种常用的指令调整的LLM，包括推理模型，发现没有模型始终如一地信任更可靠的来源。虽然推理尤其与更高的分数有关，但即使是最好的API模型，我们测试幻觉的时间最多达70％。值得注意的是，较大且功能更强大的模型不一定要优于较小的型号。我们希望我们的工作能够对这种幻觉的这种重要形式有更多的了解，并指导开发与它对的新方法。

Title: When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

Authors: Jin Li, Keyu Wang, Shu Yang, Zhuoran Zhang, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02087
Pdf URL: https://arxiv.org/pdf/2508.02087
Copy Paste: [[2508.02087]] When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models(https://arxiv.org/abs/2508.02087)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.
摘要：大型语言模型（LLMS）经常表现出相关行为，即使那些与事实知识相矛盾，也同意用户陈述的观点。尽管先前的工作已经记录了这种趋势，但使这种行为能够理解的内部机制仍然很糟糕。在本文中，我们提供了关于LLM中粘粘性如何产生的机理说明。我们首先系统地研究用户意见如何诱导不同模型家族的糊状。我们发现，简单的意见陈述可靠地引起了粘粘性，而用户专业知识框架的影响可以忽略不计。通过Logit镜头分析和因果激活补丁，我们确定了粘粘剂的两个阶段出现：（1）晚期输出偏好偏移和（2）更深的表示差异。我们还验证了用户权威不会影响行为，因为模型不会内部编码。此外，我们研究了语法观点如何影响妓女行为，发现第一人称提示（``我相信...''）始终诱导比第三人称框架（``他们相信……''）更高的粘液率来诱导更高的粘液率。这些发现凸显了粘粘剂不是表面级的人工制品，而是源于更深层次的知识的结构覆盖，对一致性和真实的AI系统产生了影响。

Title: Learning Dynamics of Meta-Learning in Small Model Pretraining

Authors: David Demitri Africa, Yuval Weiss, Paula Buttery, Richard Diehl Martinez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02189
Pdf URL: https://arxiv.org/pdf/2508.02189
Copy Paste: [[2508.02189]] Learning Dynamics of Meta-Learning in Small Model Pretraining(https://arxiv.org/abs/2508.02189)
Keywords: language model
Abstract: Large language models are powerful but costly. We ask whether meta-learning can make the pretraining of small language models not only better but also more interpretable. We integrate first-order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M-570M params), and evaluate it on a fundamental NLP task with many settings and real-world applications. Compared with vanilla training, our model (i) reaches the same loss up to 1.6x sooner, (ii) improves F1 on multilingual Universal NER under equal compute, and (iii) makes the training dynamics easy to read: first the network's representations fan out ("diversify") and later they collapse into a smaller, shared subspace ("compress"). This two-stage shift shows up as a rise-and-fall in both effective-rank curves and attention-head entropy. The same curves pinpoint which layers specialise earliest and which later reconverge, giving a compact, interpretable signature of meta-adaptation. Code, checkpoints and WandB logs are released.
摘要：大型语言模型强大，但昂贵。我们询问元学习是否可以使小语言模型的审议不仅更好，而且更容易解释。我们将一阶MAML与子集屏蔽的LM预处理集成在一起，从而产生四种仅使用解码器的模型（11m-570m params），并在具有许多设置和现实世界应用的基本NLP任务上对其进行评估。与香草训练相比，我们的模型（i）较早达到1.6倍，（ii）（ii）在均等计算下的多语言通用ner上提高了F1，并且（iii）使训练动态易于阅读：首先，网络的表示粉丝（“多样化”），后来它们崩溃成较小的，共享的suppace（“ compress”）。在有效排名曲线和注意头部熵中，这种两阶段的转变表现为上升和下降。相同的曲线精确指示了最早专业的层，后来又重新分配了元适应性的紧凑，可解释的标志。释放代码，检查点和WANDB日志。

Title: Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Authors: Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, Hao Zhou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02193
Pdf URL: https://arxiv.org/pdf/2508.02193
Copy Paste: [[2508.02193]] Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference(https://arxiv.org/abs/2508.02193)
Keywords: language model
Abstract: We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.
摘要：我们提出种子扩散预览，这是一种基于离散状态扩散的大规模语言模型，提供了非常快的推理速度。多亏了非顺序的平行生成，离散扩散模型提供了一个显着的加速，以减轻逐个解码的固有延迟，如最近所示（例如，Mercury Coder，Gemini扩散）。种子扩散预览的推理速度在H20 GPU上的推理速度为2,146代币，同时在跨越标准代码评估基准的扫描中保持有竞争力的性能，比当代的汞和双子座扩散率要快得多，在代码模型的快速帕莱托边境上确立了新的ART的新状态。

Title: Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

Authors: Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02208
Pdf URL: https://arxiv.org/pdf/2508.02208
Copy Paste: [[2508.02208]] Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems(https://arxiv.org/abs/2508.02208)
Keywords: language model, llm
Abstract: Evaluating the mathematical capability of Large Language Models (LLMs) is a critical yet challenging frontier. Existing benchmarks fall short, particularly for proof-centric problems, as manual creation is unscalable and costly, leaving the true mathematical abilities of LLMs largely unassessed. To overcome these barriers, we propose Proof2Hybrid, the first fully automated framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora. The key novelty of our solution is Proof2X, a roadmap of converting mathematical proofs into various kinds of questions that are easy to verify. Instructed by this roadmap, we propose a new type of hybrid-formatted questions, named ``$m$-out-of-$n$ multiple judge questions'', specifically designed to enable robust, automatic evaluation while being resilient to guessing and superficial pattern matching inherent in traditional formats. As a demonstration of our framework, we introduce AlgGeoTest, a benchmark for algebraic geometry--a frontier domain of modern mathematics--comprising 456 challenging items. Our extensive evaluations on state-of-the-art LLMs using AlgGeoTest reveal profound deficits in their comprehension of algebraic geometry, providing a more precise measure of their true mathematical capabilities. Our framework and benchmark pave the way for a new wave of in-depth research into the mathematical intelligence of AI systems.
摘要：评估大语言模型（LLMS）的数学能力是一个至关重要的边界。现有的基准缺乏，特别是出于以证据为中心的问题，因为手动创建是不可估量的且昂贵的，因此LLM的真正数学能力在很大程度上没有评估。为了克服这些障碍，我们提出了证明2Hybrid，这是第一个完全自动化的框架，该框架综合了自然语言数学语料库的高质量，以证据为中心的基准。我们解决方案的主要新颖性是Pirce 2x，这是将数学证明转换为易于验证的各种问题的路线图。在此路线图的指示下，我们提出了一种新型的混合形式的问题，名为“ $ M $ -M $ - $ N $ n $多重法官问题”，专门设计用于启用强大的自动评估，同时具有富有弹性的猜测和表面上的图案匹配，以传统格式固有。为了展示我们的框架，我们介绍了Alggeotest，这是代数几何形状的基准 - 现代数学的前沿领域 - 复制456个具有挑战性的项目。我们对最新的LLM进行了广泛的评估，该评估在对代数几何形状的理解中揭示了深刻的缺陷，从而更加精确地衡量了它们的真实数学能力。我们的框架和基准为AI系统的数学智能进行了新的深入研究铺平了道路。

Title: Isolating Culture Neurons in Multilingual Large Language Models

Authors: Danial Namazifard, Lukas Galke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02241
Pdf URL: https://arxiv.org/pdf/2508.02241
Copy Paste: [[2508.02241]] Isolating Culture Neurons in Multilingual Large Language Models(https://arxiv.org/abs/2508.02241)
Keywords: language model, llm
Abstract: Language and culture are deeply intertwined, yet it is so far unclear how and where multilingual large language models encode culture. Here, we extend upon an established methodology for identifying language-specific neurons and extend it to localize and isolate culture-specific neurons, carefully disentangling their overlap and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that LLMs encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated independently from language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited - promoting fairness, inclusivity, and alignment. Code and data is available at this https URL .
摘要：语言和文化是牢固地交织在一起的，但是到目前为止，尚不清楚多语言大语言模型如何编码文化。在这里，我们扩展了一种既定的方法来识别特定语言的神经元并将其扩展到本地化和隔离文化特定的神经元，仔细地删除了它们与语言特异性神经元的重叠和相互作用。为了促进我们的实验，我们介绍了穆雷尔（Murel），这是一个8520万个代币的策划数据集，涵盖了六种不同的文化。我们的本地化和干预实验表明，LLMS在不同的神经元种群中编码不同的培养物，主要在上层中，并且这些培养神经元可以独立于语言特异性神经元或其他培养物进行调节。这些发现表明，可以选择性地隔离和编辑多语言语言模型中的文化知识和倾向 - 促进公平，包容性和一致性。代码和数据可在此HTTPS URL上找到。

Title: Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning

Authors: Jia Deng, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02260
Pdf URL: https://arxiv.org/pdf/2508.02260
Copy Paste: [[2508.02260]] Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning(https://arxiv.org/abs/2508.02260)
Keywords: language model, llm
Abstract: Recently, reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs). A core challenge in RLVR involves managing the exchange between entropy and performance of policies. Despite the importance of this exchange, a fine-grained understanding of when and how this exchange operates most effectively remains limited. To bridge this gap, we conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity. Specifically, we first divide the training process into two distinct stages based on entropy dynamics, i.e., rising stage and plateau stage, and then systematically investigate how this mechanism varies across stage-level, instance-level, and token-level granularitiess. Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns, which in turn drives rapid performance gains. Moreover, in the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences. Motivated by these findings, we propose two methods that dynamically adjust the reward signal using perplexity and positional information to focus RL updates on tokens that exhibit high learning potential, achieving improvements compared to the baseline methods on various LLMs.
摘要：最近，具有可验证奖励（RLVR）的增强学习已被广泛用于增强大语模型（LLMS）的推理能力。 RLVR中的核心挑战涉及管理熵和政策绩效之间的交换。尽管这种交流很重要，但对何时以及如何最有效运作的良好了解仍然有限。为了弥合这一差距，我们对RLVR的熵 - 性能交换机制进行了系统的经验分析。具体而言，我们首先根据熵动力学（即上升阶段和高原阶段）将训练过程分为两个不同的阶段，然后系统地研究该机制如何在阶段级别，实例级别和代币级别的粒度粒度上变化。我们的分析表明，在上升的阶段，负样品的熵减少有助于学习有效的推理模式，这反过来促进了快速的性能增长。此外，在高原阶段，学习效率与低渗透样本中存在的高渗透令牌和位于序列结束时的高凝集令牌密切相关。在这些发现的激励下，我们提出了两种方法，这些方法使用困惑和位置信息动态调整奖励信号，以将RL更新集中在具有较高学习潜力的代币上，与各种LLMS的基线方法相比，取得了进步。

Title: SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System

Authors: Serry Sibaee, Omer Nacar, Yasser Al-Habashi, Adel Ammar, Wadii Boulila
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02268
Pdf URL: https://arxiv.org/pdf/2508.02268
Copy Paste: [[2508.02268]] SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System(https://arxiv.org/abs/2508.02268)
Keywords: gpt
Abstract: The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.
摘要：阿拉伯世界的丰富语言景观的特征是现代标准阿拉伯语（MSA），正式交流语言与日常生活中使用的各种区域方言之间存在显着差距。这种挖沟物对自然语言处理，尤其是机器翻译提出了巨大的挑战。本文介绍了\ textbf {shami-mt}，这是一种专门设计的双向机器翻译系统，该系统专门弥合MSA和叙利亚方言之间的通信差距。我们介绍了两个专门模型，一种用于MSA到Shami，另一个用于Shami-To-MSA翻译，均建立在最先进的ARAT5V2-BASE-1024架构上。这些模型在全面的NABRA数据集上进行了微调，并对Madar语料库的看不见的数据进行了严格评估。当通过OpenAI模型GPT-4.1判断时，我们的MSA-SHAMI模型达到了出色的平均质量评分\ TextBf {4.01 fextbf {4.01}，这表明了其产生不仅是准确而且辩证性真实性的翻译的能力。这项工作为以前服务不足的语言对提供了至关重要的高保真工具，推进了阿拉伯语翻译领域，并在内容本地化，文化遗产和跨文化交流中提供了重要的应用。

Title: Simple Methods Defend RAG Systems Well Against Real-World Attacks

Authors: Ilias Triantafyllopoulos, Renyi Qu, Salvatore Giorgi, Brenda Curtis, Lyle H. Ungar, João Sedoc
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02296
Pdf URL: https://arxiv.org/pdf/2508.02296
Copy Paste: [[2508.02296]] Simple Methods Defend RAG Systems Well Against Real-World Attacks(https://arxiv.org/abs/2508.02296)
Keywords: gpt, llm, chat, retrieval-augmented generation
Abstract: Ensuring safety and in-domain responses for Retrieval-Augmented Generation (RAG) systems is paramount in safety-critical applications, yet remains a significant challenge. To address this, we evaluate four methodologies for Out-Of-Domain (OOD) query detection: GPT-4o, regression-based, Principal Component Analysis (PCA)-based, and Neural Collapse (NC), to ensure the RAG system only responds to queries confined to the system's knowledge base. Specifically, our evaluation explores two novel dimensionality reduction and feature separation strategies: \textit{PCA}, where top components are selected using explained variance or OOD separability, and an adaptation of \textit{Neural Collapse Feature Separation}. We validate our approach on standard datasets (StackExchange and MSMARCO) and real-world applications (Substance Use and COVID-19), including tests against LLM-simulated and actual attacks on a COVID-19 vaccine chatbot. Through human and LLM-based evaluations of response correctness and relevance, we confirm that an external OOD detector is crucial for maintaining response relevance.
摘要：在安全至关重要的应用中，确保安全性和检索功能的响应至关重要，但仍然是一个重大挑战。为了解决这个问题，我们评估了四个方法论（OOD）查询检测：GPT-4O，基于回归的基于回归的主成分分析（PCA）基于基于回归的主体和神经崩溃（NC），以确保抹布系统仅响应局限于系统知识库的查询。具体而言，我们的评估探讨了两个新颖的维度降低和特征分离策略：\ textit {pca}，其中使用解释的方差或OOD可分离性选择了顶部组件，以及\ textit {Neural Collapse特征分离}的适应性。我们在标准数据集（STACKEXCHANGE和MSMARCO）和现实世界应用程序（药物使用和COVID-19）上验证方法，包括针对LLM模拟的测试和对Covid-19疫苗聊天机器人的实际攻击。通过基于人类和LLM的响应正确性和相关性的评估，我们确认外部OOD检测器对于保持响应相关性至关重要。

Title: LaMPE: Length-aware Multi-grained Position Encoding for Adaptive Long-context Scaling Without Training

Authors: Sikui Zhang, Guangze Gao, Ziyun Gan, Chunfeng Yuan, Zefeng Lin, Houwen Peng, Bing Li, Weiming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02308
Pdf URL: https://arxiv.org/pdf/2508.02308
Copy Paste: [[2508.02308]] LaMPE: Length-aware Multi-grained Position Encoding for Adaptive Long-context Scaling Without Training(https://arxiv.org/abs/2508.02308)
Keywords: language model, llm
Abstract: Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window, primarily due to the out-of-distribution (OOD) behavior of Rotary Position Embedding (RoPE). Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies, ignoring the dynamic relationship between input length and the model's effective context window. To this end, we propose Length-aware Multi-grained Positional Encoding (LaMPE), a training-free method that fully utilizes the model's effective context window for adaptive long-context scaling in LLMs. Motivated by the left-skewed frequency distribution of relative positions, LaMPE establishes a dynamic relationship between mapping length and input length through a parametric scaled sigmoid function to adaptively allocate positional capacity across varying input lengths. Meanwhile, LaMPE devises a novel multi-grained attention mechanism that strategically allocates positional resolution across different sequence regions to capture both fine-grained locality and long-range dependencies. Our method can be seamlessly applied to a wide range of RoPE-based LLMs without training. Extensive experiments on three representative LLMs across five mainstream long-context benchmarks demonstrate that LaMPE achieves significant performance improvements compared to existing length extrapolation methods. The code will be released at this https URL.
摘要：当输入超过预处理上下文窗口时，大型语言模型（LLMS）的性能会显着降解，这主要是由于旋转位置嵌入（ROPE）的分布外（OOD）行为。最近的研究通过将OOD位置改造为固定的映射策略来缓解此问题，从而忽略了输入长度与模型的有效上下文窗口之间的动态关系。为此，我们提出了长度感知的多透明位置编码（LAMPE），这是一种无训练的方法，该方法充分利用了模型的有效上下文窗口，用于自适应LLMS中的自适应长篇小写缩放。莱姆（Lampe）由相对位置的左旋转频率分布进行动机，通过参数缩放的sigmoid函数在映射长度和输入长度之间建立了动态关系，以适应分配不同输入长度的位置容量。同时，Lampe设计了一种新型的多层次注意机制，该机制在战略上分配了不同序列区域的位置分辨率，以捕获细粒度的位置和远距离依赖性。我们的方法无需训练即可无缝地应用于广泛的基于绳索的LLM。与现有的长度外推法相比，对五个主流长篇小说基准的三个代表性LLM进行了广泛的实验表明，Lampe可以实现显着的性能改善。该代码将在此HTTPS URL上发布。

Title: VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Authors: Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu
Subjects: cs.CL, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2508.02317
Pdf URL: https://arxiv.org/pdf/2508.02317
Copy Paste: [[2508.02317]] VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo(https://arxiv.org/abs/2508.02317)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. % We present \veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. \veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. \veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. % Using \veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.
摘要：大型语言模型（LLM）的最新进展在Omni-Modal理解和产生方面取得了令人印象深刻的进步。但是，由于处理多种方式所需的异质模型体系结构，训练Omni-Modal LLM仍然是一个重大挑战，因此需要进行高效的大规模培训的复杂系统设计。现有的框架通常与平行逻辑纠缠模型定义，并产生有限的可伸缩性和实质性的工程开销，以端到端全面模式训练。％我们提出\ Veomni，这是一个模块化，有效的训练框架，可加速Omni-Modal LLM的发展。 \ veomni引入了以模型为中心的分布式配方，该配方将通信与计算的交流取消，从而使Omni-Modal-Modal LLM上有效的3D并行性。 \ veomni还具有灵活的配置接口，支持新模式的无缝集成随着最小的代码更改。％使用\ veomni，具有30b参数的Omni-Modal混合物（MOE）模型可以通过超过2,800个令牌/秒/sec/gpu吞吐量和比例训练，并通过128 GPU上的3D并行性训练至160K上下文长度，以展示其出色的效率和延伸能力，以训练大型Omni-Modal llms。

Title: CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Authors: Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02322
Pdf URL: https://arxiv.org/pdf/2508.02322
Copy Paste: [[2508.02322]] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis(https://arxiv.org/abs/2508.02322)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
摘要：具有专家混合物（MOE）架构的大型语言模型（LLM）的特点是它们在各种任务中的参数增加，但它们也遭受了大量的计算和存储开销。值得注意的是，MOE模型的性能收益与专家参数的增长不成比例地扩展。虽然先前的工作试图通过专家级的修剪，合并或分解来减少参数，但它们仍然面临性能和计算效率的挑战。在本文中，我们通过将微功能引入跨矩阵的细粒度压缩单元来解决这些挑战。我们首先建立了更基本的视角，将MoE层视为微型专家的混合物，而现在的相机是一个轻巧且无训练的框架，用于识别微型专家冗余。我们的分析发现了解码过程中微功能贡献的显着差异。基于这种见解，我们进一步提出了一个结构化的微型专家修剪框架和摄像机-Q，这是一种专为微型专家设计的混合精确量化想法。在九个下游任务上进行的广泛实验表明，摄像机-P在修剪比率下的强大基准始终优于20％至60％。此外，在激进的2位量化下，Camera-Q取得了卓越的结果，超过了现有的矩阵和频道级别的想法。值得注意的是，我们的方法可以在不到5分钟的时间内对单个NVIDIA A100-40GB GPU进行QWEN2-57B-A14B的完整微功能分析。

Title: Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models

Authors: Jiayi Zhang, Shu Yang, Junchao Wu, Derek F. Wong, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02360
Pdf URL: https://arxiv.org/pdf/2508.02360
Copy Paste: [[2508.02360]] Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models(https://arxiv.org/abs/2508.02360)
Keywords: language model
Abstract: Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model's political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.
摘要：关于政治主题的微调大语模型将在各种问题上显着操纵他们的政治立场，并无意中影响他们对无关的主题的立场。尽管以前的研究提出了这个问题，但仍然缺乏对这些立场的内部表示以及导致意想不到的跨主题概括的机制的理解。在本文中，我们从神经元水平的角度系统地探索了这种现象的内部机制，以及如何减轻政治微调的跨主题概括。首先，我们通过激活对比（PNLAC）提出政治神经元的本地化，以识别两种不同类型的政治神经元：一般政治神经元，这些神经元在多个政治主题中统治了立场，而主题特定的神经元则影响了该模型对单个主题的政治立场。我们通过激活修补实验发现了四个模型和数据集中这些政治神经元类型的存在。利用这些见解，我们引入了一种基于抑制作用的微调方法抑制性，有效地减轻了跨主场立场的概括。实验结果表明，在各种模型和数据集中鉴定出的神经元类型的鲁棒性，并表明抑制作用将跨主题的概括显着降低了20％，同时保留了特定于主题的性能。此外，我们证明，仅有选择地抑制5％的神经元足以有效地减轻跨主场的概括。

Title: CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation

Authors: Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02401
Pdf URL: https://arxiv.org/pdf/2508.02401
Copy Paste: [[2508.02401]] CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation(https://arxiv.org/abs/2508.02401)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs. To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding KV cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive KV cache allocation strategy. Experimental results demonstrate the proposed CompressKV consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: this https URL.
摘要：大型语言模型（LLM）的最新进展显着提高了长期处理的处理。但是，增加的键值（KV）缓存大小对内存和执行效率构成了关键的挑战。大多数KV缓存压缩方法都使用基于分组的查询注意力（GQA）的LLM中的所有注意力头来依靠启发式令牌驱逐。该方法忽略了注意力头的不同功能，从而导致关键令牌的驱逐，从而降低LLM的性能。为了解决上面的问题，我们首先要确定每一层中的注意力头部的所有注意力头来确定重要的代币，而是确定每一层的注意力头不仅能够检索提示的初始和最终令牌，而且还可以在文本中检索重要的代币并在其周围的语义环境中检索重要的标记。之后，我们利用此类头来确定重要令牌并保留其相应的KV缓存对。此外，我们分析了每个层的缓存驱逐误差，并引入了层 - 自适应KV缓存策略。实验结果表明，在Longbench和in-a-a-haystack基准的各种记忆预算下，提出的CompressKV始终优于最先进的方法。我们的代码可公开可用：此HTTPS URL。

Title: AI-Based Measurement of Innovation: Mapping Expert Insight into Large Language Model Applications

Authors: Robin Nowak, Patrick Figge, Carolin Haeussler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02430
Pdf URL: https://arxiv.org/pdf/2508.02430
Copy Paste: [[2508.02430]] AI-Based Measurement of Innovation: Mapping Expert Insight into Large Language Model Applications(https://arxiv.org/abs/2508.02430)
Keywords: language model, llm, prompt
Abstract: Measuring innovation often relies on context-specific proxies and on expert evaluation. Hence, empirical innovation research is often limited to settings where such data is available. We investigate how large language models (LLMs) can be leveraged to overcome the constraints of manual expert evaluations and assist researchers in measuring innovation. We design an LLM framework that reliably approximates domain experts' assessment of innovation from unstructured text data. We demonstrate the performance and broad applicability of this framework through two studies in different contexts: (1) the innovativeness of software application updates and (2) the originality of user-generated feedback and improvement ideas in product reviews. We compared the performance (F1-score) and reliability (consistency rate) of our LLM framework against alternative measures used in prior innovation studies, and to state-of-the-art machine learning- and deep learning-based models. The LLM framework achieved higher F1-scores than the other approaches, and its results are highly consistent (i.e., results do not change across runs). This article equips R&D personnel in firms, as well as researchers, reviewers, and editors, with the knowledge and tools to effectively use LLMs for measuring innovation and evaluating the performance of LLM-based innovation measures. In doing so, we discuss, the impact of important design decisions-including model selection, prompt engineering, training data size, training data distribution, and parameter settings-on performance and reliability. Given the challenges inherent in using human expert evaluation and existing text-based measures, our framework has important implications for harnessing LLMs as reliable, increasingly accessible, and broadly applicable research tools for measuring innovation.
摘要：衡量创新通常取决于特定于上下文的代理和专家评估。因此，经验创新研究通常仅限于可以使用此类数据的设置。我们研究了如何利用大型语言模型（LLM）来克服手动专家评估的限制，并帮助研究人员衡量创新。我们设计了一个LLM框架，可靠地近似于域专家对非结构化文本数据的创新评估。我们通过在不同背景下的两项研究中证明了该框架的性能和广泛适用性：（1）软件应用程序更新的创新性以及（2）产品评论中用户生成的反馈和改进思想的独创性。我们将LLM框架的性能（F1得分）和可靠性（一致性）与先前创新研究中使用的替代措施以及最新的机器学习和基于深度学习的模型进行了比较。 LLM框架的F1得分比其他方法更高，并且其结果高度一致（即，结果不会在跨行动中发生变化）。本文为公司以及研究人员，审阅者和编辑提供了研发人员，并提供了有效利用LLM的知识和工具来衡量创新和评估基于LLM的创新措施的绩效。在此过程中，我们讨论了重要的设计决策的影响 - 包括模型选择，及时的工程，培训数据大小，培训数据分布以及参数设置的性能和可靠性。鉴于使用人类专家评估和现有基于文本的措施所固有的挑战，我们的框架对利用LLM的可靠，越来越易于访问且广泛适用的研究工具具有重要意义。

Title: LatentPrompt: Optimizing Promts in Latent Space

Authors: Mateusz Bystroński, Grzegorz Piotrowski, Nitesh V. Chawla, Tomasz Kajdanowicz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02452
Pdf URL: https://arxiv.org/pdf/2508.02452
Copy Paste: [[2508.02452]] LatentPrompt: Optimizing Promts in Latent Space(https://arxiv.org/abs/2508.02452)
Keywords: language model, llm, prompt
Abstract: Recent advances have shown that optimizing prompts for Large Language Models (LLMs) can significantly improve task performance, yet many optimization techniques rely on heuristics or manual exploration. We present LatentPrompt, a model-agnostic framework for prompt optimization that leverages latent semantic space to automatically generate, evaluate, and refine candidate prompts without requiring hand-crafted rules. Beginning with a set of seed prompts, our method embeds them in a continuous latent space and systematically explores this space to identify prompts that maximize task-specific performance. In a proof-of-concept study on the Financial PhraseBank sentiment classification benchmark, LatentPrompt increased classification accuracy by approximately 3 percent after a single optimization cycle. The framework is broadly applicable, requiring only black-box access to an LLM and an automatic evaluation metric, making it suitable for diverse domains and tasks.
摘要：最近的进步表明，对大语言模型（LLM）的优化提示可以显着提高任务性能，但是许多优化技术依赖于启发式方法或手动探索。我们提出了LitentPrompt，这是一个模型不合时宜的框架，用于及时优化，利用潜在的语义空间自动生成，评估和完善候选提示而无需手工制作的规则。从一组种子提示开始，我们的方法将它们嵌入了连续的延伸空间中，并系统地探索了此空间以识别提示，从而最大程度地提高了特定于任务的性能。在单个优化周期后，在一项关于金融短语表情感分类基准的概念验证研究中，潜伏的分类精度提高了约3％。该框架广泛适用，仅需要黑框访问LLM和自动评估指标，使其适用于不同的域和任务。

Title: From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks

Authors: Shuzhou Yuan, Zhan Qu, Mario Tawfelis, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02502
Pdf URL: https://arxiv.org/pdf/2508.02502
Copy Paste: [[2508.02502]] From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks(https://arxiv.org/abs/2508.02502)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) exhibit strong linguistic capabilities, but little is known about how they encode psycholinguistic knowledge across languages. We investigate whether and how LLMs exhibit human-like psycholinguistic responses under different linguistic identities using two tasks: sound symbolism and word valence. We evaluate two models, Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct, under monolingual and bilingual prompting in English, Dutch, and Chinese. Behaviorally, both models adjust their outputs based on prompted language identity, with Qwen showing greater sensitivity and sharper distinctions between Dutch and Chinese. Probing analysis reveals that psycholinguistic signals become more decodable in deeper layers, with Chinese prompts yielding stronger and more stable valence representations than Dutch. Our results demonstrate that language identity conditions both output behavior and internal representations in LLMs, providing new insights into their application as models of cross-linguistic cognition.
摘要：大型语言模型（LLMS）具有强大的语言能力，但对它们如何编码跨语言的心理语言知识知之甚少。我们研究了LLM是否以及如何使用两个任务在不同的语言身份下表现出类似人类的心理语言反应：声音象征和单词价。我们在英语，荷兰语和中文的单语和双语提示下评估了两种模型，分别是Llama-3.3-70B-Instruct和Qwen2.5-72B-Instruct。从行为上讲，这两个模型都根据提示的语言身份来调整其输出，QWEN在荷兰语和中文之间表现出更大的敏感性和更大的区别。探测分析表明，心理语言信号在更深层次的层中变得更加可解码，而中国提示比荷兰人产生更强大，更稳定的价代表。我们的结果表明，语言认同条件既适应LLMS的输出行为和内部表示形式，从而为其作为跨语言认知模型的应用提供了新的见解。

Title: Modular Arithmetic: Language Models Solve Math Digit by Digit

Authors: Tanja Baeumel, Daniil Gurgurov, Yusser al Ghussin, Josef van Genabith, Simon Ostermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02513
Pdf URL: https://arxiv.org/pdf/2508.02513
Copy Paste: [[2508.02513]] Modular Arithmetic: Language Models Solve Math Digit by Digit(https://arxiv.org/abs/2508.02513)
Keywords: language model, llm
Abstract: While recent work has begun to uncover the internal strategies that Large Language Models (LLMs) employ for simple arithmetic tasks, a unified understanding of their underlying mechanisms is still lacking. We extend recent findings showing that LLMs represent numbers in a digit-wise manner and present evidence for the existence of digit-position-specific circuits that LLMs use to perform simple arithmetic tasks, i.e. modular subgroups of MLP neurons that operate independently on different digit positions (units, tens, hundreds). Notably, such circuits exist independently of model size and of tokenization strategy, i.e. both for models that encode longer numbers digit-by-digit and as one token. Using Feature Importance and Causal Interventions, we identify and validate the digit-position-specific circuits, revealing a compositional and interpretable structure underlying the solving of arithmetic problems in LLMs. Our interventions selectively alter the model's prediction at targeted digit positions, demonstrating the causal role of digit-position circuits in solving arithmetic tasks.
摘要：尽管最近的工作已经开始揭示大型语言模型（LLMS）用于简单算法任务的内部策略，但仍然缺乏对其基本机制的统一理解。我们扩展了最近的发现表明，LLMs以数字方式表示数字，并提供了证据，证明了LLM用于执行简单算术任务的数字位置特异性电路，即MLP神经元的模块化亚组，这些MLP神经元的模块化亚组在不同的位置（单位，数百个）上独立地在不同的位置上独立运行。值得注意的是，此类电路独立于模型大小和代币化策略，即，对于编码较长数字数字数字的模型，并且是一个令牌。使用特征重要性和因果干预措施，我们识别和验证数字位置特异性电路，揭示了解决LLMS中算术问题的基础的组成和可解释的结构。我们的干预措施选择性地改变了模型在目标数字位置上的预测，证明了数字位置电路在解决算术任务中的因果作用。

Title: PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLMs

Authors: Zhan Qu, Shuzhou Yuan, Michael Färber
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02515
Pdf URL: https://arxiv.org/pdf/2508.02515
Copy Paste: [[2508.02515]] PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLMs(https://arxiv.org/abs/2508.02515)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: This paper presents a systematic investigation into the constrained generation capabilities of large language models (LLMs) in producing Songci, a classical Chinese poetry form characterized by strict structural, tonal, and rhyme constraints defined by Cipai templates. We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks. Using this framework, we evaluate the generative performance of 18 LLMs, including 3 proprietary models and 15 open-source models across four families, under five prompting strategies: zero-shot, one-shot, completion-based, instruction-tuned, and chain-of-thought. Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic. Leveraging the critic's feedback as a reward signal, we fine-tune three lightweight open-source LLMs via supervised fine-tuning (SFT), resulting in improvements of up to 5.88% in formal conformity. Our findings offer new insights into the generative strengths and limitations of LLMs in producing culturally significant and formally constrained literary texts.
摘要：本文对生产Songci的大型语言模型（LLMS）的约束生成能力进行了系统的研究，这是一种经典的中国诗歌形式，其特征是由Cipai模板定义的严格结构，音调和押韵约束。我们首先开发了一个全面的，多面的评估框架，其中包括：（i）使用LLMS的自动质量评估，（iii）人类评估以及（iv）基于分类的探测任务。使用此框架，我们评估了18个LLM的生成性能，其中包括四个家庭的3种专有型号和15个开源模型，这些模型是五种提示策略：零射击，一声，基于完成，基于教学，指导调整和思想链。最后，我们提出了一种生成批判性的体系结构，其中评估框架是自动化的评论家。我们利用评论家的反馈作为奖励信号，我们通过监督的微调（SFT）微调了三个轻巧的开源LLM，从而提高了高达5.88％的正式合规性。我们的发现为LLM的生成优势和局限性提供了新的见解，从而产生了文化意义和正式约束文学文本。

Title: I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2

Authors: Jack Merullo, Arjun Khurana, Oliver McLaughlin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02527
Pdf URL: https://arxiv.org/pdf/2508.02527
Copy Paste: [[2508.02527]] I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2(https://arxiv.org/abs/2508.02527)
Keywords: language model
Abstract: Large language models demonstrate proficiency on phonetic tasks, such as rhyming, without explicit phonetic or auditory grounding. In this work, we investigate how \verb|Llama-3.2-1B-Instruct| represents token-level phonetic information. Our results suggest that Llama uses a rich internal model of phonemes to complete phonetic tasks. We provide evidence for high-level organization of phoneme representations in its latent space. In doing so, we also identify a ``phoneme mover head" which promotes phonetic information during rhyming tasks. We visualize the output space of this head and find that, while notable differences exist, Llama learns a model of vowels similar to the standard IPA vowel chart for humans, despite receiving no direct supervision to do so.
摘要：大型语言模型表明了对语音任务的熟练程度，例如押韵，而没有明确的语音或听觉接地。在这项工作中，我们研究了\动词| Llama-3.2-1b-instruct |代表令牌级的语音信息。我们的结果表明，美洲驼使用丰富的音素内部模型来完成语音任务。我们为高级语音表示在其潜在空间中的高级组织提供了证据。在这样做时，我们还确定了``音素发动机头''，该``音素发动机''在押韵任务期间促进语音信息。我们可视化该头的输出空间，发现尽管存在显着差异，但尽管没有收到直接的监督，但Llama还是学会了类似于人类标准IPA元音图的模型。

Title: Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction

Authors: Karan Reddy, Mayukha Pal
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02532
Pdf URL: https://arxiv.org/pdf/2508.02532
Copy Paste: [[2508.02532]] Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction(https://arxiv.org/abs/2508.02532)
Keywords: language model, gpt, retrieval-augmented generation
Abstract: Standard transformer-based language models, while powerful for general text, often struggle with the fine-grained syntax and entity relationships in complex technical, engineering documents. To address this, we propose the Contextual Graph Transformer (CGT), a hybrid neural architecture that combines Graph Neural Networks (GNNs) and Transformers for domain-specific question answering. CGT constructs a dynamic graph over input tokens using sequential, skip-gram, and semantic similarity edges, which is processed by GATv2Conv layers for local structure learning. These enriched embeddings are then passed to a Transformer encoder to capture global dependencies. Unlike generic large models, technical domains often require specialized language models with stronger contextualization and structure awareness. CGT offers a parameter-efficient solution for such use cases. Integrated into a Retrieval-Augmented Generation (RAG) pipeline, CGT outperforms baselines like GPT-2 and BERT, achieving 24.7% higher accuracy than GPT-2 with 62.4% fewer parameters. This gain stems from CGTs ability to jointly model structural token interactions and long-range semantic coherence. The model is trained from scratch using a two-phase approach: pretraining on general text followed by fine-tuning on domain-specific manuals. This highlights CGTs adaptability to technical language, enabling better grounding, entity tracking, and retrieval-augmented responses in real-world applications.
摘要：标准的基于变压器的语言模型虽然对一般文本有力，但经常在复杂的技术，工程文档中与细粒度的语法和实体关系斗争。为了解决这个问题，我们提出了上下文图形变压器（CGT），这是一种混合神经架构，结合了图形神经网络（GNNS）和变压器，以解决特定领域的问题。 CGT使用顺序，跳过和语义相似性边缘构造了输入令牌的动态图，该图是由GATV2Conv层处理的，用于局部结构学习。然后，这些富集的嵌入将传递给变压器编码器以捕获全局依赖性。与通用大型模型不同，技术领域通常需要具有更强情境化和结构意识的专业语言模型。 CGT为此类用例提供了参数效率的解决方案。 CGT集成到检索功能的生成（RAG）管道中，胜过GPT-2和BERT等基线，比GPT-2高24.7％，参数少62.4％。这种增益源于CGTS共同对结构令牌相互作用和远程语义连贯性进行共同模拟的能力。使用两相方法从划痕中训练该模型：对一般文本进行预处理，然后在特定于域特定的手册上进行微调。这突出了CGTS对技术语言的适应性，从而在现实世界应用程序中实现了更好的基础，实体跟踪和检索效果。

Title: Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Authors: Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02558
Pdf URL: https://arxiv.org/pdf/2508.02558
Copy Paste: [[2508.02558]] Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction(https://arxiv.org/abs/2508.02558)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$\times$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.
摘要：扩散大语言模型（DLLM）在推理和平行解码方面实现了突破，但在推理过程中遭受了过度二次计算复杂性和内存开销。当前的缓存技术通过存储全层状态来加速解码，但施加了大量的内存使用量，以限制长期影响应用程序。我们对DLLM中注意模式的分析揭示了持续的跨层稀疏性，而在解码步骤和低相关标记的关键令牌保持不重要的情况下保持不重要，激励选择性的缓存驱逐。我们提出了稀疏dllm，这是第一个无训练的框架，通过延迟双向稀疏缓存将动态缓存驱逐与稀疏注意力集成在一起。通过利用令牌显着性在步骤上的稳定性，它保留了关键的令牌，并使用注意力引导的策略动态驱逐前缀/后缀条目。 Llada和Dream系列上的广泛实验表明，稀疏DLLM的吞吐量比Vanilla DLLM高出10美元$ \ times $，具有可比的性能和类似的峰值内存成本，在效率和有效性方面的表现优于以前的方法。

Title: Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs

Authors: Jérémie Dentan, Davide Buscaldi, Sonia Vanier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02573
Pdf URL: https://arxiv.org/pdf/2508.02573
Copy Paste: [[2508.02573]] Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs(https://arxiv.org/abs/2508.02573)
Keywords: language model, llm
Abstract: Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization.
摘要：大语言模型（LLMS）中的逐字记忆是一种多方面的现象，涉及不同的潜在机制。我们介绍了一种新颖的方法，以分析现有分类法描述的不同形式的记忆形式。具体而言，我们在LLM的注意力重量上训练卷积神经网络（CNN），并评估该分类法与解码所涉及的注意力权重之间的一致性。我们发现现有的分类法的性能较差，并且无法反映注意力阻滞中不同的机制。我们提出了一种新的分类学，可以最大程度地与注意力重量保持一致，包括三类：使用语言建模能力猜测的记忆样本，由于训练集中的重复较高而被召回的记忆样本以及未使用的样本。我们的结果表明，几乎没有逐字记忆的记忆与独特的注意机制不符。我们还表明，该模型实际上猜测了很大一部分可提取的样品，因此应分别研究。最后，我们开发了一种自定义的视觉可解释性技术，以将各种记忆形式所涉及的注意力权重定位。

Title: EHSAN: Leveraging ChatGPT in a Hybrid Framework for Arabic Aspect-Based Sentiment Analysis in Healthcare

Authors: Eman Alamoudi, Ellis Solaiman
Subjects: cs.CL, cs.AI, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2508.02574
Pdf URL: https://arxiv.org/pdf/2508.02574
Copy Paste: [[2508.02574]] EHSAN: Leveraging ChatGPT in a Hybrid Framework for Arabic Aspect-Based Sentiment Analysis in Healthcare(https://arxiv.org/abs/2508.02574)
Keywords: language model, gpt, prompt, chat
Abstract: Arabic-language patient feedback remains under-analysed because dialect diversity and scarce aspect-level sentiment labels hinder automated assessment. To address this gap, we introduce EHSAN, a data-centric hybrid pipeline that merges ChatGPT pseudo-labelling with targeted human review to build the first explainable Arabic aspect-based sentiment dataset for healthcare. Each sentence is annotated with an aspect and sentiment label (positive, negative, or neutral), forming a pioneering Arabic dataset aligned with healthcare themes, with ChatGPT-generated rationales provided for each label to enhance transparency. To evaluate the impact of annotation quality on model performance, we created three versions of the training data: a fully supervised set with all labels reviewed by humans, a semi-supervised set with 50% human review, and an unsupervised set with only machine-generated labels. We fine-tuned two transformer models on these datasets for both aspect and sentiment classification. Experimental results show that our Arabic-specific model achieved high accuracy even with minimal human supervision, reflecting only a minor performance drop when using ChatGPT-only labels. Reducing the number of aspect classes notably improved classification metrics across the board. These findings demonstrate an effective, scalable approach to Arabic aspect-based sentiment analysis (SA) in healthcare, combining large language model annotation with human expertise to produce a robust and explainable dataset. Future directions include generalisation across hospitals, prompt refinement, and interpretable data-driven modelling.
摘要：阿拉伯语患者反馈的反馈仍然不足，因为方言多样性和稀缺的方面情感标签阻碍了自动评估。为了解决这一差距，我们介绍了Ehsan，这是一种以数据为中心的混合管道，将Chatgpt伪标签与有针对性的人类审查合并，以构建第一个可解释的基于阿拉伯的基于阿拉伯方面的情感数据集用于医疗保健。每个句子都带有一个方面和情感标签（正面，负面或中性），形成了与医疗主题一致的先锋阿拉伯数据集，并提供了为每个标签提供Chatgpt-Inerated Prinationes，以提高透明度。为了评估注释质量对模型性能的影响，我们创建了三个版本的培训数据：一个全面监督的设置，所有标签由人类审查，一个半监督的套装，具有50％的人类审查，以及仅带有机器生成的标签的无监督套装。我们在这些数据集上微调了两个变压器模型，以进行方面和情感分类。实验结果表明，即使人类的监督最少，我们的阿拉伯特异性模型也达到了很高的精度，这仅反映出使用仅Chatgpt的标签时的性能下降。减少方面类别的数量明显改善了整体分类指标。这些发现表明了对基于阿拉伯方面的情感分析（SA）的有效，可扩展的方法，将大型语言模型注释与人类专业知识相结合，以产生强大而可解释的数据集。未来的方向包括跨医院的概括，及时的完善和可解释的数据驱动建模。

Title: MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification

Authors: Ming Pok Ng, Junqi Jiang, Gabriel Freedman, Antonio Rago, Francesca Toni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02584
Pdf URL: https://arxiv.org/pdf/2508.02584
Copy Paste: [[2508.02584]] MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification(https://arxiv.org/abs/2508.02584)
Keywords: language model, gpt, llm, hallucination
Abstract: Leveraging outputs from multiple large language models (LLMs) is emerging as a method for harnessing their power across a wide range of tasks while mitigating their capacity for making errors, e.g., hallucinations. However, current approaches to combining insights from multiple LLMs often involve unstructured interactions (e.g., free debate), resulting in model generations that are not faithfully justifiable. In this work, we introduce MArgE, a novel framework to provide formal structure to the evidence from each LLM, in the form of a tree of extracted arguments, for the task of claim verification. We use a variant of Argumentative LLMs (ArgLLMs), i.e. LLMs driven by frameworks and semantics from the field of computational argumentation, to construct structured argument trees for given claims. This process creates an inspectable pathway from the initial arguments to the final claim verification decisions, providing a faithful justification thereof. We show experimentally that MArgE can significantly outperform single LLMs, including three open-source models (4B to 8B parameters), GPT-4o-mini and existing ArgLLMs, as well as prior methods for unstructured multi-LLM debates. We thus demonstrate the advantages of incorporating formal, argumentative reasoning mechanisms when combining multiple LLM outputs.
摘要：从多种大型语言模型（LLM）中利用输出是一种方法，是一种在各种任务中利用其力量的方法，同时减轻了造成错误的能力，例如幻觉。但是，当前结合来自多个LLM的见解的方法通常涉及非结构化的互动（例如，免费辩论），从而导致了不忠实合理的模型世代。在这项工作中，我们介绍了Marge，这是一个新颖的框架，旨在以每一个LLM的证据（以提取的论证为生的形式）为索赔验证的任务提供正式结构。我们使用一系列论证llms（argllms）的变体，即来自计算参数领域的框架和语义驱动的LLM，以构建给定主张的结构化参数树。此过程创建了一条可检查的途径，从最初的论点到最终的索赔验证决定，提供了忠实的理由。我们通过实验表明，MARGE可以显着超过单个LLM，包括三种开源模型（4B至8B参数），GPT-4O-MINI和现有的Argllms，以及对非结构化多LLLM辩论的先前方法。因此，我们证明了在组合多个LLM输出时合并正式的，有争议的推理机制的优势。

Title: CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Authors: Omri Uzan, Yuval Pinter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02591
Pdf URL: https://arxiv.org/pdf/2508.02591
Copy Paste: [[2508.02591]] CharBench: Evaluating the Role of Tokenization in Character-Level Tasks(https://arxiv.org/abs/2508.02591)
Keywords: language model, llm
Abstract: Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models' reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs. We encourage future work to build on the benchmark and evaluation methodology introduced here as tools for improving model performance on such tasks.
摘要：需要角色级推理的任务，例如在单词中计数或定位字符，对于当代语言模型仍然具有挑战性。一个常见的猜想是，语言模型对子词单元而不是角色的依赖有助于他们在角色级任务中的斗争，但是最近的研究给出了关于令牌化作用的矛盾结论，使其影响不清楚。为了解决这一差距，我们介绍了Charbench，这是字符级任务的全面基准，该基准比现有替代方案大两个数量级。我们在Charbench上评估了各种领先的开放权重和专有模型，并发现它对现代LLM提出了重大挑战，某些任务的平均准确性为43.6％和32.3％。我们对单词及其对令牌的分割的内在特性与模型性能相对应的深入分析。对于计算任务，我们发现令牌化属性与正确性弱相关，而查询单词的长度和实际字符计数起着更重要的作用。相比之下，对于需要词内定位理解的任务，性能与包含查询性格的令牌长度负相关，这表明更长的令牌掩盖了LLMS的字符位置信息。我们鼓励未来的工作以基于此处介绍的基准和评估方法为基础，作为改善此类任务模型绩效的工具。

Title: Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation

Authors: Jianxiang Zang, Meiling Ning, Shihan Dou, Jiazheng Zhang, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02618
Pdf URL: https://arxiv.org/pdf/2508.02618
Copy Paste: [[2508.02618]] Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation(https://arxiv.org/abs/2508.02618)
Keywords: language model, llm, prompt
Abstract: The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, mainstream preference modeling in RM is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate preference modeling through attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the preference modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in RM.
摘要：奖励模型（RM）是大型语言模型（LLM）从人类反馈（RLHF）学习的核心组成部分，负责为产生的响应提供奖励信号。但是，RM中的主流偏好模型在令牌级别的相互作用方面不足，这使得其判断信号很容易受到对上下文的不当关注。这源于两个基本局限性：（1）当前的偏好模型采用仅解码器的体系结构，其中单向因果注意机制导致迅速响应序列内的前递减内序列的注意。（2）独立的暹罗编码范例引起了所选序列和被拒绝序列之间缺乏令牌级的相互关注。为了解决这种“注意力黑客攻击”，我们提出了“相互作用蒸馏”，这是一个新型的培训框架，用于通过注意力级优化进行更适当的偏好建模。该方法引入了基于互动的自然语言理解模型，作为教师，通过全面关注提供复杂的令牌交互模式，并指导偏好建模，以通过注意对准目标模拟教师模型的互动模式。通过广泛的实验，与最新的RM RM优化方法相比，相互作用的蒸馏能够提供更稳定和可推广的奖励信号，该方法靶向数据噪声，强调注意力黑客入侵构成了RM中更加基本的限制。

Title: Test Set Quality in Multilingual LLM Evaluation

Authors: Kranti Chalamalasetti, Gabriel Bernier-Colborne, Yvan Gauthier, Sowmya Vajjala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.02635
Pdf URL: https://arxiv.org/pdf/2508.02635
Copy Paste: [[2508.02635]] Test Set Quality in Multilingual LLM Evaluation(https://arxiv.org/abs/2508.02635)
Keywords: language model, llm
Abstract: Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models. However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages - French and Telugu, identifying several errors in the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages). Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.
摘要：最近，最近以半自动的方式开发了几个多语言基准数据集，以衡量进度并了解大语言模型多语言能力的最新功能。但是，尽管在确定甚至完全被人类宣传的测试集中的错误方面存在着以前的工作，但对数据集本身的质量并没有很多关注。在本文中，我们手动分析了两种语言的最新多语言评估集：法语和泰卢固语，确定了此过程中的几个错误。我们将几个LLM的性能差异与数据集的原始版本和修订版本进行了比较，并在两种语言中确定了很大的差异（在某些情况下在某些情况下几乎为10％）。基于这些结果，我们认为不应将测试集视为不变，应重新审视，检查正确性并可能版本化。我们对数据集创建者以及消费者都提出了一些建议，以解决数据集质量问题。