2025-08-22

Title: Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Authors: Jianfeng Si, Lin Sun, Zhewen Tan, Xiangzheng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.14904
Pdf URL: https://arxiv.org/pdf/2508.14904
Copy Paste: [[2508.14904]] Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training(https://arxiv.org/abs/2508.14904)
Keywords: language model, llm
Abstract: Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model's safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.
摘要：当前的大型语言模型（LLM）中内容安全性方法（例如监督的微调（SFT）和从人类反馈）学习（RLHF）通常依靠多阶段的培训管道，并且缺乏良好的元素后部署后的可控性。为了解决这些局限性，我们提出了一个统一的共同训练框架，该框架有效地整合了多种安全行为：积极（合法/亲社会），负面（不过滤/容易过滤/风险）和在单个SFT阶段内被拒绝（拒绝/拒绝的拒绝/导向/保守）。值得注意的是，每种行为都通过简单的系统级指令或魔术令牌动态激活，从而在推理时可以进行隐秘和高效的行为切换。这种灵活性支持各种部署方案，例如对安全用户互动的呈阳性，对内部红色团队负面，而拒绝了上游温度信号触发的上下文意识到的拒绝。这种共同训练策略在输出空间中引起了独特的安全对齐范围，其特征是分离响应分布与每个安全模式相对应。这个边缘的存在为模型的安全鲁棒性提供了经验证据，并实现了前所未有的细粒控制。实验表明，我们的方法与SFT+DPO的安全对齐质量相匹配，我们的8B模型在安全性能方面尤其超过了DeepSeek-R1（671b），同时大大降低了训练的复杂性和部署成本。这项工作为LLM内容安全性提供了可扩展，高效且高度可控的解决方案。

Title: Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages

Authors: Israel Abebe Azime, Tadesse Destaw Belay, Dietrich Klakow, Philipp Slusallek, Anshuman Chhabra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.14913
Pdf URL: https://arxiv.org/pdf/2508.14913
Copy Paste: [[2508.14913]] Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages(https://arxiv.org/abs/2508.14913)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.
摘要：大型语言模型（LLM）在解决自然语言表达的数学问题方面表现出了重要的能力。但是，由于社会文化任务数据集的稀缺，反映了准确的本地实体，例如人的名称，组织名称和货币等准确的本地实体，因此低资源语言中的多语言和文化基础的数学推理落后于英语。现有的多语言基准主要是通过翻译生产的，通常保留以英语为中心的实体，这是由于与基于人类的基于人类的本地化相关的高成本。此外，自动本地化工具是有限的，因此，真正的本地化数据集仍然很少。为了弥合这一差距，我们介绍了一个数学单词问题的LLM驱动的文化本地化框架，该框架自动构建了来自现有来源的本地名称，组织和货币的数据集。我们发现，在适当的社会文化背景下，翻译的基准可以掩盖真正的多语言数学能力。通过广泛的实验，我们还表明，当跨各种语言引入本机实体时，我们的框架可以帮助减轻以英语为中心的实体偏见并提高鲁棒性。

Title: Improving LLMs for Machine Translation Using Synthetic Preference Data

Authors: Dario Vajda, Domen Vreš, Marko Robnik-Šikonja
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.14951
Pdf URL: https://arxiv.org/pdf/2508.14951
Copy Paste: [[2508.14951]] Improving LLMs for Machine Translation Using Synthetic Preference Data(https://arxiv.org/abs/2508.14951)
Keywords: language model, llm
Abstract: Large language models have emerged as effective machine translation systems. In this paper, we explore how a general instruction-tuned large language model can be improved for machine translation using relatively few easily produced data resources. Using Slovene as a use case, we improve the GaMS-9B-Instruct model using Direct Preference Optimization (DPO) training on a programmatically curated and enhanced subset of a public dataset. As DPO requires pairs of quality-ranked instances, we generated its training dataset by translating English Wikipedia articles using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct. We ranked the resulting translations based on heuristics coupled with automatic evaluation metrics such as COMET. The evaluation shows that our fine-tuned model outperforms both models involved in the dataset generation. In comparison to the baseline models, the fine-tuned model achieved a COMET score gain of around 0.04 and 0.02, respectively, on translating Wikipedia articles. It also more consistently avoids language and formatting errors.
摘要：大型语言模型已成为有效的机器翻译系统。在本文中，我们探讨了如何使用相对较少产生的数据资源来改进通用指导调整的大语言模型。使用Slovene作为用例，我们使用直接偏好优化（DPO）培训在编程策划和增强的公共数据集子集上改进了GAMS-9B - 教学模型。由于DPO需要成对的质量级实例，我们通过使用两个LLM，GAMS-9B-Instruct和Eurollm-9B-Instruct翻译英语Wikipedia文章来生成其培训数据集。我们根据启发式方法以及自动评估指标（例如彗星）对产生的翻译进行了排名。评估表明，我们的微调模型优于数据集生成中的两个模型。与基线模型相比，微调模型在翻译Wikipedia文章时的彗星得分的增长分别为0.04和0.02。它也更始终如一地避免语言和格式化错误。

Title: Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems

Authors: Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Fedor Splitt, Jiaao Li, Yoana Tsoneva, Sebastian Möller, Vera Schmitt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.14982
Pdf URL: https://arxiv.org/pdf/2508.14982
Copy Paste: [[2508.14982]] Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems(https://arxiv.org/abs/2508.14982)
Keywords: language model, llm
Abstract: Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered considerable attention for their ability to enhance user comprehension through dialogue-based explanations. Current ConvXAI systems often are based on intent recognition to accurately identify the user's desired intention and map it to an explainability method. While such methods offer great precision and reliability in discerning users' underlying intentions for English, a significant challenge in the scarcity of training data persists, which impedes multilingual generalization. Besides, the support for free-form custom inputs, which are user-defined data distinct from pre-configured dataset instances, remains largely limited. To bridge these gaps, we first introduce MultiCoXQL, a multilingual extension of the CoXQL dataset spanning five typologically diverse languages, including one low-resource language. Subsequently, we propose a new parsing approach aimed at enhancing multilingual parsing performance, and evaluate three LLMs on MultiCoXQL using various parsing strategies. Furthermore, we present Compass, a new multilingual dataset designed for custom input extraction in ConvXAI systems, encompassing 11 intents across the same five languages as MultiCoXQL. We conduct monolingual, cross-lingual, and multilingual evaluations on Compass, employing three LLMs of varying sizes alongside BERT-type models.
摘要：基于大语言模型（LLM）的可话式解释人工智能（CORVXAI）系统，由于其通过基于对话的解释增强用户理解的能力引起了人们的关注。当前的Convxai系统通常基于意图识别，以准确识别用户的所需意图并将其映射到解释性方法。尽管这种方法在辨别用户对英语的基本意图方面具有很高的精确性和可靠性，但培训数据稀缺的巨大挑战仍然存在，这阻碍了多语言的概括。此外，对自由形式的自定义输入的支持是用户定义的数据，这些数据与预配置的数据集实例不同，仍然很大程度上有限。为了弥合这些差距，我们首先引入了MulticoXQL，这是Coxql数据集的多语言扩展，涵盖了五种类型上多样化的语言，包括一种低资源语言。随后，我们提出了一种新的解析方法，旨在提高多语言解析性能，并使用各种解析策略评估MulticoxQL上的三个LLM。此外，我们提出了Compass，这是一种新的多语言数据集，旨在在Convxai系统中自定义输入提取，涵盖了与MulticoxQL相同五种语言的11个意图。我们对指南针进行单语，跨语义和多语言评估，并采用三个不同尺寸的LLM与BERT型模型一起进行。

Title: Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Authors: Bolian Li, Yanran Wu, Xinyu Luo, Ruqi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15044
Pdf URL: https://arxiv.org/pdf/2508.15044
Copy Paste: [[2508.15044]] Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner(https://arxiv.org/abs/2508.15044)
Keywords: language model, llm
Abstract: Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.
摘要：将大型语言模型（LLM）与人类偏好保持一致已成为其发展的关键一步。最近的研究越来越集中于测试时间对齐，在推断过程中分配了额外的计算以提高LLM的安全性和推理能力。但是，这些测试时间对准技术通常会产生大量的推理成本，从而限制了它们的实际应用。我们的灵感来自投机采样加速度，该采样加速度利用小型草稿模型有效预测未来的代币，以解决测试时间对齐的效率瓶颈。我们介绍了奖励偏移的投机采样（SSS）算法，其中模型草案与人类的偏好保持一致，而目标模型保持不变。从理论上讲，我们可以利用对齐的草稿模型和未对准的目标模型之间的分布变化来恢复RLHF最佳解决方案，而无需实际获得它，通过修改接受标准和奖励令牌分布。我们的算法以测试时间弱到较强的对准实验的推理成本大大降低了推理成本，从而验证了其有效性和效率。

Title: LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

Authors: MohamamdJavad Ardestani, Ehsan Kamalloo, Davood Rafiei
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15085
Pdf URL: https://arxiv.org/pdf/2508.15085
Copy Paste: [[2508.15085]] LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text(https://arxiv.org/abs/2508.15085)
Keywords: llm, hallucination, prompt
Abstract: LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.
摘要：LongRecall。机器生成的文本的完整性确保捕获所有相关信息，在医学和法律等领域以及基于列表的问题回答（QA）等任务中至关重要，在这种情况下，遗漏可能会带来严重的后果。但是，现有的召回指标通常取决于词汇重叠，导致未经证实的实体和解释答案的错误，而LLM-AS-A-AS-A-A-Gudge方法具有长期的整体提示，捕获了更广泛的语义，但仍然容易发生错误和幻觉，而无需结构验证。我们介绍了LongRecall，这是一个一般的三阶段召回评估框架，将答案分解为独立的事实，依次通过词汇和语义滤波缩小了合理的候选匹配，并通过结构化的结构检查来验证其对齐方式。该设计减少了误报和假否定性，同时适应各种措辞和上下文变化，作为系统召回评估的基础构建基础。我们使用人类注释和基于LLM的法官对三个挑战性的长格式质量质量质量测试基准进行了评估，这表明了对强词汇和LLM-AS-AS-A-A-Gudge基准的回忆准确性的显着提高。

Title: Mapping the Course for Prompt-based Structured Prediction

Authors: Matt Pauk, Maria Leonor Pacheco
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15090
Pdf URL: https://arxiv.org/pdf/2508.15090
Copy Paste: [[2508.15090]] Mapping the Course for Prompt-based Structured Prediction(https://arxiv.org/abs/2508.15090)
Keywords: llm, hallucination, prompt
Abstract: LLMs have been shown to be useful for a variety of language tasks, without requiring task-specific fine-tuning. However, these models often struggle with hallucinations and complex reasoning problems due to their autoregressive nature. We propose to address some of these issues, specifically in the area of structured prediction, by combining LLMs with combinatorial inference in an attempt to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can effectively estimate LLM confidence values for use with symbolic inference, and show that, regardless of the prompting strategy, the addition of symbolic inference on top of prompting alone leads to more consistent and accurate predictions. Additionally, we show that calibration and fine-tuning using structured prediction objectives leads to increased performance for challenging tasks, showing that structured learning is still valuable in the era of LLMs.
摘要：LLM已被证明对各种语言任务很有用，而无需特定于任务的微调。但是，由于其自回归性质，这些模型通常在幻觉和复杂的推理问题上困难。我们建议通过将LLM与组合推断相结合，以将LLM的预测能力与推理方法提供的结构一致性结合，以解决其中一些问题，特别是在结构化预测领域。我们进行详尽的实验，以了解哪些提示策略可以有效地估算LLM置信值值，以便与象征性推断一起使用，并表明，无论促使策略如何，都可以在提示单独提示的基础上添加符号推理会导致更加一致和准确的预测。此外，我们表明，使用结构化预测目标进行校准和微调会导致挑战性任务的性能提高，这表明在LLMS时代，结构化学习仍然很有价值。

Title: Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Authors: Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15096
Pdf URL: https://arxiv.org/pdf/2508.15096
Copy Paste: [[2508.15096]] Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset(https://arxiv.org/abs/2508.15096)
Keywords: language model, llm
Abstract: Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.
摘要：在高质量的结构化数据（例如数学和代码）上预处理大型语言模型（LLM）大大增强了推理能力。但是，由于脆弱的提取启发式，HTML到文本转换以及未能可靠地维护数学结构，现有由常见爬网构建的以数学为中心的数据集损失了质量下降的质量。在这项工作中，我们介绍了Nemotron-CC-MATH，这是一种大规模的高质量数学语料库，使用新颖的，域形的无形管道构建，专为可靠的科学文本提取而设计。与以前的努力不同，我们的管道通过利用Lynx和有针对性的基于LLM的清洁阶段来利用各种格式（例如Mathjax，Katex，MathML）恢复数学。这种方法在删除样板，将符号标准化为乳胶表示并纠正不一致的同时，保留了方程和代码块的结构完整性。我们收集了一个大的高质量数学语料库，即Nemotron-CC-MATH-3+（133B令牌）和Nemotron-CC-MATH-4+（52B令牌）。值得注意的是，Nemotron-CC-MATH-4+不仅超过了所有先前的开放数学数据集，包括Megamath，Finemath和OpenWebMath，但还包含比Finemath-4+的5.5倍，这是以前是最高质量的数学预处理数据集的5.5倍。当用来预识Nemotron-T 8b模型时，我们的语料库在数学上的收益+4.8至+12.6，MBPP +上的+4.6至+14.3在强基础上的收益+14.3增长，同时也改善了MMLU和MMLU-STEM上的一般域性能。我们提出了第一个可靠地提取科学内容的管道 - 包括数学 - 从网络尺度数据中获取，从而在数学，代码和一般推理方面产生可衡量的收益，并在开放的数学预处理语料库中设置新的最新状态。为了支持开源工作，我们发布了代码和数据集。

Title: Identifying and Answering Questions with False Assumptions: An Interpretable Approach

Authors: Zijie Wang, Eduardo Blanco
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15139
Pdf URL: https://arxiv.org/pdf/2508.15139
Copy Paste: [[2508.15139]] Identifying and Answering Questions with False Assumptions: An Interpretable Approach(https://arxiv.org/abs/2508.15139)
Keywords: language model, llm, hallucination
Abstract: People often ask questions with false assumptions, a type of question that does not have regular answers. Answering such questions require first identifying the false assumptions. Large Language Models (LLMs) often generate misleading answers because of hallucinations. In this paper, we focus on identifying and answering questions with false assumptions in several domains. We first investigate to reduce the problem to fact verification. Then, we present an approach leveraging external evidence to mitigate hallucinations. Experiments with five LLMs demonstrate that (1) incorporating retrieved evidence is beneficial and (2) generating and validating atomic assumptions yields more improvements and provides an interpretable answer by specifying the false assumptions.
摘要：人们经常以虚假的假设提出问题，这是一种没有定期答案的问题。回答此类问题需要首先确定错误的假设。大型语言模型（LLM）通常由于幻觉而产生误导性答案。在本文中，我们专注于在几个领域中使用虚假假设来识别和回答问题。我们首先调查以将问题减少到事实验证。然后，我们提出一种利用外部证据来减轻幻觉的方法。使用五个LLM的实验表明，（1）合并检索的证据是有益的，（2）生成和验证原子假设会产生更多的改进，并通过指定错误的假设来提供可解释的答案。

Title: ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

Authors: Seungmin Han, Haeun Kwon, Ji-jun Park, Taeyang Yoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15164
Pdf URL: https://arxiv.org/pdf/2508.15164
Copy Paste: [[2508.15164]] ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following(https://arxiv.org/abs/2508.15164)
Keywords: language model, gpt, llm, hallucination, agent
Abstract: Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative "memory-perception-planning-execution" cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.
摘要：尽管大型语言模型（LLM）和大型视觉语言模型（LVLM）取得了重大进步，但当前模型仍在处理复杂，多转弯和视觉上的任务方面仍面临重大挑战，这些任务需要深刻的推理，持续的上下文理解，实体跟踪，实体跟踪和多步指导。现有的基准通常在捕获现实世界多模式相互作用的活力和复杂性方面通常缺乏，从而导致上下文丢失和视觉幻觉等问题。为了解决这些限制，我们介绍了MMDR基础（多模式对话推理基准），这是一个新型数据集，其中包含300个精心设计的复杂多转向对话方案，每个核心方案平均均可进行5-7个转弯，并在包括视觉实体跟踪和推理深度在内的六个核心维度进行了评估。此外，我们提出了一个整体框架COLVLM代理（上下文LVLM代理），它通过迭代“记忆感知计划 - 执行”周期增强了现有的LVLM，并具有先进的推理和指令，而无需对基础模型进行广泛的重新训练。我们对MMDR板凳的广泛实验表明，COLVLM代理人始终达到卓越的性能，达到4.03的平均人类评估评分，尤其超过了最新的商业模型，例如GPT-4O（3.92）和Gemini 1.5 Pro（3.85）。该框架在推理深度，指令依从性和抑制误差方面具有显着优势，并在扩展的对话转弯中保持了稳健的性能，从而验证了其模块化设计的有效性和对复杂多模式相互作用的迭代方法的有效性。

Title: SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

Authors: Dong Liu, Yanxuan Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15190
Pdf URL: https://arxiv.org/pdf/2508.15190
Copy Paste: [[2508.15190]] SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling(https://arxiv.org/abs/2508.15190)
Keywords: language model
Abstract: Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to $2.4\times$ reduction in token count and $1.9\times$ speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.
摘要：令牌化在语言建模中起着至关重要的作用，但现有的方法（例如字节对编码（BPE）或文字）纯粹在频率统计上运行，而忽略了文本的基本语义结构。这导致语义上冗余跨度的过度措施和上下文连贯性的实现不足，尤其是在长篇小说场景中。在这项工作中，我们提出了\ textbf {semtoken}，这是一种语义意识的令牌化框架，共同降低令牌冗余并提高计算效率。 Semtoken首先通过轻质编码器提取上下文语义嵌入，并执行局部语义聚类以合并语义上等效的令牌。然后，它基于语义密度分配了异质令牌粒度，从而使富含内容的区域中的更细粒度的令牌化和重复或低渗透跨度的更粗糙的压缩。 Semtoken可以与现代语言模型和注意力加速方法无缝集成。长篇文章建模基准（例如Wikitext-103和Longbench）的实验表明，Semtoken的实验可达到$ 2.4 \ times $ $降低令牌计数，而$ 1.9 \ times $速度$速度，具有可忽略不计或没有降级的近距离和下游准确性。我们的发现表明，语义结构提供了一个有希望的新轴，用于在大语言模型中优化令牌化和计算。

Title: Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

Authors: Yuanchen Zhou, Shuo Jiang, Jie Zhu, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15202
Pdf URL: https://arxiv.org/pdf/2508.15202
Copy Paste: [[2508.15202]] Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models(https://arxiv.org/abs/2508.15202)
Keywords: language model, llm
Abstract: Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at this https URL.
摘要：Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness.我们介绍了\ TextBf {Fin-Prm}，这是一种针对评估财务任务中的中间推理步骤量身定制的域特有的，轨迹感知的PRM。 Fin-Prm整合了阶梯级和轨迹级别的奖励监督，从而使对与财务逻辑一致的推理痕迹进行精细评估。我们在离线和在线奖励学习设置中应用FIN-PRM，支持三个关键应用程序：（i）选择高质量的推理轨迹，用于基于蒸馏的基于蒸馏的监督微调，（ii）为加强学习提供密集的过程级别的奖励，以及（III）指导奖励在考试时间获得最佳奖励。包括CFLUE和FINQA在内的财务推理基准的实验结果表明，Fin-Prm在轨迹选择质量中始终优于通用PRM和强大的域基线。经过FIN-PRM训练的下游模型通过基准产生了实质性改进，在监督学习中获得12.9％，增强学习中的5.2 \％，在测试时性能中获得5.1 \％。这些发现突出了域特有的奖励建模的价值，即使LLM与专家级财务推理保持一致。我们的项目资源将在此HTTPS URL上提供。

Title: SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Authors: Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15212
Pdf URL: https://arxiv.org/pdf/2508.15212
Copy Paste: [[2508.15212]] SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning(https://arxiv.org/abs/2508.15212)
Keywords: language model, llm
Abstract: Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at this https URL.
摘要：大语言模型（LLMS）中的长篇小说推断越来越多地受到KV高速缓存瓶颈的约束：内存使用量随序列长度线性增长，而注意力计算则四次缩放。现有方法通过通过诸如令牌驱逐或合并以减少内存和计算开销等策略来压缩沿时间轴的KV缓存来解决此问题。但是，这些方法通常忽略了特征维度（即通道轴）之间细粒度的重要性变化，从而限制了它们有效平衡效率和模型准确性的能力。实际上，我们观察到频道显着性在查询和位置上都有很大的不同：某些特征渠道在给定查询中带有接近零的信息，而其他特征渠道则相关性。为了解决此监督，我们提出了Spark是一种无训练的插件方法，该方法通过在频道级别上修剪KV来应用非结构化的稀疏性，同时在注意分数计算过程中动态恢复了修剪的条目。值得注意的是，我们的方法与现有的KV压缩和量化技术正交，使其与它们集成以实现进一步的加速度兼容。通过减少频道级别的冗余，Spark可以在同一内存预算中处理更长的序列。对于相等长度的序列，与基于驱逐的方法相比，Spark不仅可以保留或提高模型的准确性，还可以将KV缓存存储量降低30％以上。此外，与基线驱逐方法相比，Spark的侵略性修剪比率为80％，其降解率降低了5％，表明其稳健性和有效性。我们的代码将在此HTTPS URL上可用。

Title: Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering

Authors: Bolei He, Xinran He, Run Shao, Shanfu Shu, Xianwei Xue, Mingquan Cheng, Haifeng Li, Zhenhua Ling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15213
Pdf URL: https://arxiv.org/pdf/2508.15213
Copy Paste: [[2508.15213]] Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering(https://arxiv.org/abs/2508.15213)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.
摘要：大型语言模型（LLM）在一般质量检查中表现良好，但在特定于领域的情况下通常会挣扎。检索增强的一代（RAG）引入了外部知识，但由于嘈杂的检索而造成了幻觉和潜伏期。继续仔细预测会内化域知识，但昂贵，缺乏跨域的灵活性。我们将这一挑战归因于领域知识的长尾分布，这使部分而有用的内部知识未被充分利用。我们进一步认为，知识获取应该是渐进的，反映人类的学习：首先理解概念，然后将其应用于复杂的推理。为了解决这个问题，我们提出了SELCT2Know（S2K），这是一个具有成本效益的框架，通过内部外部知识自我选择策略和选择性监督的微调来内化领域知识。我们还引入了结构化的推理数据生成管道，并集成了GRPO以增强推理能力。医疗，法律和金融质量检查基准测试的实验表明，S2K始终超过现有的方法，并匹配域预测的LLM，其成本明显较低。

Title: Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall

Authors: Sijia Cui, Aiyao He, Shuai Xu, Hongming Zhang, Yanna Wang, Qingyang Zhang, Yajing Wang, Bo Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15214
Pdf URL: https://arxiv.org/pdf/2508.15214
Copy Paste: [[2508.15214]] Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall(https://arxiv.org/abs/2508.15214)
Keywords: language model, llm, prompt
Abstract: Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1\% on easy and 4.7\% on hard questions. We further test SEER on $\tau$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44\% and 23.38\%, respectively.
摘要：功能调用使大型语言模型（LLMS）通过利用工具和API与外部系统进行交互。当面对多步工具使用情况时，LLMS仍然在工具选择，参数生成和工具链计划方面遇到困难。现有方法通常依赖于手动设计特定于任务的演示或从精选库中检索。这些方法需要大量的专家努力和迅速的工程变得越来越复杂和效率低下，因为工具多样性和任务难度量表。为了应对这些挑战，我们提出了一种自引导的方法，即逐步体验回忆（SEER），该方法从不断更新的体验池中执行细粒度，逐步检索。 SEER不再依靠静态或手动策划的库，而是通过过去的成功轨迹增强体验池，从而可以持续扩展池并随着时间的推移改善模型性能。 SEER在工具QA基准测试中进行了评估，在易于问题上的平均提高了6.1 \％，而在难题上，平均提高了4.7 \％。我们进一步测试了$ \ tau $ -bench，其中包括两个现实世界。 Seer由QWEN2.5-7B和QWEN2.5-72B模型提供支持，分别显示出7.44 \％和23.38 \％的实质性准确性提高。

Title: Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?

Authors: Momoka Furuhashi, Kouta Nakayama, Takashi Kodama, Saku Sugawara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15218
Pdf URL: https://arxiv.org/pdf/2508.15218
Copy Paste: [[2508.15218]] Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?(https://arxiv.org/abs/2508.15218)
Keywords: language model
Abstract: Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations. \footnote{Our code is available at~this https URL
摘要：使用大型语言模型对生成任务的自动评估由于模棱两可的标准而面临挑战。尽管自动清单生成是一种潜在的有前途的方法，但其实用性仍然没有得到充实。我们研究是否应将清单用于所有问题或有选择性地使用六种方法生成它们，评估它们在八个模型尺寸的有效性，并确定与人类评估相关的清单项目。通过对成对比较和直接得分任务的实验，我们发现选择性清单的使用倾向于提高成对设置中的评估性能，而其收益在直接得分中的一致性较小。我们的分析还表明，即使是与人类得分相关性较低的清单项目，也经常反映了人类所写的标准，表明人类评估中的潜在不一致。这些发现凸显了更清楚地定义客观评估标准以指导人类和自动评估的必要性。 \ footNote {我们的代码可在〜this HTTPS URL上找到

Title: VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

Authors: Hanling Zhang, Yayu Zhou, Tongcheng Fang, Zhihang Yuan, Guohao Dai, Yu Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15229
Pdf URL: https://arxiv.org/pdf/2508.15229
Copy Paste: [[2508.15229]] VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models(https://arxiv.org/abs/2508.15229)
Keywords: language model
Abstract: Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.
摘要：小语言模型（SLM）在资源受限的环境中提供了计算优势，但是内存限制仍然是边缘设备部署的关键瓶颈。 SLMS记忆足迹的很大一部分源于与词汇相关的组件，尤其是嵌入式和语言建模（LM）头部的组件，这是由于词汇量很大。现有的静态词汇修剪虽然减少记忆使用量，但却遭受了僵硬的，尺寸适合的设计，这些设计会导致预填充阶段的信息丢失，并且缺乏灵活性。在这项工作中，我们确定了词汇还原挑战挑战的基础的两个关键原则：词汇局部性原理，即在任何单个推论过程中只需要一小部分令牌，以及SLM词汇相关组件之间的计算特征的不对称性。基于这些见解，我们介绍了词汇量，这是一个新颖的解耦动态词汇选择框架，通过卸载嵌入并实现LM头的混合静态词汇选择策略来解决记忆约束，从而实现了词汇组件的逐步加载。跨不同下游任务的全面实验表明，词汇量在与词汇相关的组件的记忆使用情况下最多可减少99％，而任务绩效的最低或没有降解，从而实质上超过了现有的静态静态词汇前修剪。

Title: WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai

Authors: Peerat Limkonchotiwat, Pume Tuchinda, Lalita Lowphansirikul, Surapon Nonesung, Panuthep Tasawong, Alham Fikri Aji, Can Udomcharoenchaikit, Sarana Nutanong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15239
Pdf URL: https://arxiv.org/pdf/2508.15239
Copy Paste: [[2508.15239]] WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai(https://arxiv.org/abs/2508.15239)
Keywords: language model, llm
Abstract: Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.
摘要：大型语言模型在英语中以教学的遵循表现出色，但是它们在泰语（如泰语）中的表现仍然没有被忽视。现有的基准通常依赖翻译，现实世界中所需的文化和领域特定细微差别。我们提出了Wangchanthaiinstruct，这是一个由人为人为的泰国数据集进行评估和指导调整，涵盖了四种专业领域和七种任务类型。 Wangchanthaiinstruct通过注释者，领域专家和AI研究人员的多阶段质量控制过程，支持两项研究：（1）零摄像的评估，显示了在文化和专业方面的任务上的绩效差距，（2）指导调整研究与消融隔离本机构监督的效果。在Wangchanthaiinstruct上进行了微调的模型优于使用域内和不域基准中翻译数据的模型。这些发现强调了对文化和专业扎根的指导数据的需求，以改善低资源，语言上不同的环境中的LLM对齐。

Title: EMNLP: Educator-role Moral and Normative Large Language Models Profiling

Authors: Yilin Jiang, Mingzi Zhang, Sheng Jin, Zengyi Yu, Xiangjie Kong, Binghao Tu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15250
Pdf URL: https://arxiv.org/pdf/2508.15250
Copy Paste: [[2508.15250]] EMNLP: Educator-role Moral and Normative Large Language Models Profiling(https://arxiv.org/abs/2508.15250)
Keywords: language model, llm, prompt
Abstract: Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 12 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at this https URL.
摘要：模拟专业（SP）使大型语言模型（LLM）能够效仿专业角色。但是，在这些情况下，全面的心理和道德评估仍然缺乏。本文介绍了EMNLP，这是一种教育工作者的道德和规范性LLMS分析框架，用于人格谱，道德发展阶段的测量以及在软及时注射下的道德风险。 EMNLP扩展了现有的量表并构建了88个特定于教师的道德困境，从而与人类教师进行了以专业为导向的比较。有针对性的软及时注射集评估教师sp中的合规性和脆弱性。在12个LLM上进行的实验表明，教师角色llms比人类教师表现出更理想化和两极分化的个性，在抽象的道德推理中表现出色，但在情感上复杂的情况下挣扎。具有更强推理的模型更容易受到有害的迅速注入的影响，从而揭示了能力和安全性之间的悖论。模型温度和其他超参数的影响有限，除了某些风险行为。本文介绍了第一个评估教育AI教师职位LLM的道德和心理一致性的基准。资源可在此HTTPS URL上找到。

Title: Conflict-Aware Soft Prompting for Retrieval-Augmented Generation

Authors: Eunseong Choi, June Park, Hyeri Lee, Jongwuk Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15253
Pdf URL: https://arxiv.org/pdf/2508.15253
Copy Paste: [[2508.15253]] Conflict-Aware Soft Prompting for Retrieval-Augmented Generation(https://arxiv.org/abs/2508.15253)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM's parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.
摘要：检索增强的生成（RAG）通过将外部知识纳入其输入提示中来增强大语言模型（LLMS）的功能。但是，当检索到的上下文与LLM的参数知识矛盾时，它通常无法解决不正确的外部上下文和正确参数知识之间的冲突，称为上下文 - 记忆冲突。为了解决这个问题，我们介绍了由上下文评估者和基础LLM组成的，引入了意识到冲突的检索生成一代（CARE）。上下文评估器从原始上下文令牌编码紧凑的内存令牌嵌入。通过接地/对抗性软提示，对上下文评估者进行了训练，以辨别不可靠的上下文并捕获指导信号，该指导信号将推理指向更可靠的知识来源。广泛的实验表明，护理可以有效地减轻上下文记忆冲突，从而导致质量检查和事实检查基准的平均绩效增益为5.0 \％，从而为可信赖和适应性的破布系统建立了有希望的方向。

Title: TComQA: Extracting Temporal Commonsense from Text

Authors: Lekshmi R Nair, Arun Sankar, Koninika Pal
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.15274
Pdf URL: https://arxiv.org/pdf/2508.15274
Copy Paste: [[2508.15274]] TComQA: Extracting Temporal Commonsense from Text(https://arxiv.org/abs/2508.15274)
Keywords: language model, llm
Abstract: Understanding events necessitates grasping their temporal context, which is often not explicitly stated in natural language. For example, it is not a trivial task for a machine to infer that a museum tour may last for a few hours, but can not take months. Recent studies indicate that even advanced large language models (LLMs) struggle in generating text that require reasoning with temporal commonsense due to its infrequent explicit mention in text. Therefore, automatically mining temporal commonsense for events enables the creation of robust language models. In this work, we investigate the capacity of LLMs to extract temporal commonsense from text and evaluate multiple experimental setups to assess their effectiveness. Here, we propose a temporal commonsense extraction pipeline that leverages LLMs to automatically mine temporal commonsense and use it to construct TComQA, a dataset derived from SAMSum and RealNews corpora. TComQA has been validated through crowdsourcing and achieves over 80\% precision in extracting temporal commonsense. The model trained with TComQA also outperforms an LLM fine-tuned on existing dataset of temporal question answering task.
摘要：理解事件需要掌握其时间上下文，通常没有用自然语言明确说明。例如，一台机器推断博物馆之旅可能持续几个小时，但不能花费几个月的时间，这并不是一件琐碎的任务。最近的研究表明，即使是先进的大语模型（LLM），由于其文本中很少明确提及，因此在产生需要具有时间常识推理的文本方面进行了努力。因此，为事件自动挖掘时间常识，可以创建强大的语言模型。在这项工作中，我们研究了LLM从文本中提取时间常识的能力，并评估多个实验设置以评估其有效性。在这里，我们提出了一个暂时的常识提取管道，该管道利用LLMS自动开采时间常识并使用它来构建TCOMQA，这是一种源自Samsum和Realnews Corpora的数据集。 TCOMQA通过众包得到了验证，并在提取时间常识方面达到了80 \％的精度。经过TCOMQA训练的模型还胜过了在时间问答任务的现有数据集上微调的LLM。

Title: A Survey on Large Language Model Benchmarks

Authors: Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15361
Pdf URL: https://arxiv.org/pdf/2508.15361
Copy Paste: [[2508.15361]] A Survey on Large Language Model Benchmarks(https://arxiv.org/abs/2508.15361)
Keywords: language model, agent
Abstract: In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
摘要：近年来，随着大语言模型能力的深度和广度的快速发展，各种相应的评估基准的数量越来越多。作为模型性能的定量评估工具，基准不仅是衡量模型功能的核心手段，而且是指导模型开发方向和促进技术创新方向的关键要素。我们首次系统地检查了大语言模型基准的当前状态和开发，将283个代表性基准分为三类：一般能力，特定于域和特定于目标的基准。一般能力基准涵盖核心语言学，知识和推理等方面；特定领域的基准专注于自然科学，人文科学和工程技术等领域；我们指出的是，目前的基准有问题，例如由于数据污染而引起的，由于文化和语言偏见引起的不公平评估以及缺乏对过程信誉和动态环境的评估，并为未来的BenchMarks Innovation提供了引用的设计Paradigm。

Title: Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation

Authors: Yichi Zhang, Yao Huang, Yifan Wang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng Dong, Jun Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15370
Pdf URL: https://arxiv.org/pdf/2508.15370
Copy Paste: [[2508.15370]] Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation(https://arxiv.org/abs/2508.15370)
Keywords: language model, llm, chain-of-thought
Abstract: The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.
摘要：尽管其能力取得了重大进展，但多模式大语言模型（MLLM）的可信赖性仍然是一个密集的关注。现有的评估和缓解方法通常集中在狭窄的方面和忽略多模式引入的风险上。为了应对这些挑战，我们提出了MultiTrust-X，这是评估，分析和缓解MLLM的可信度问题的全面基准。我们定义了一个三维框架，其中包括五个可信赖的方面，其中包括真实性，鲁棒性，安全，公平和隐私；两种新型风险类型涵盖了多模式风险和跨模式影响；以及从数据，模型架构，培训和推理算法的角度来看的各种缓解策略。基于分类法，MultiTrust-X包括32个任务和28个策划数据集，可实现30多个开源和专有MLLM的整体评估，并使用8种代表性缓解方法进行深入分析。我们的广泛实验揭示了当前模型中的重大漏洞，包括可信度和一般能力之间的差距，以及通过多模式训练和推理来扩大基本LLMS潜在风险。此外，我们的受控分析发现了现有缓解策略的关键局限性，尽管某些方法可以改善特定方面，但很少有效地解决整体可信赖性，许多人引入了损害模型实用性的意外权衡。这些发现还为未来的改进提供了实用的见解，例如推理的好处，以更好地平衡安全性和绩效。基于这些见解，我们引入了一种推理增强的安全对准方法（RESA）方法，该方法将模型与思想链推理能力相比，发现潜在的风险，从而实现最先进的结果。

Title: Confidence-Modulated Speculative Decoding for Large Language Models

Authors: Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15371
Pdf URL: https://arxiv.org/pdf/2508.15371
Copy Paste: [[2508.15371]] Confidence-Modulated Speculative Decoding for Large Language Models(https://arxiv.org/abs/2508.15371)
Keywords: language model
Abstract: Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.
摘要：投机性解码已成为通过通过草稿然后验证的范式并行化令牌生成来加速自回旋推断的有效方法。但是，现有方法依赖于静态起草长度和刚性验证标准，从而限制了它们在不同模型不确定性和输入复杂性中的适应性。本文提出了一个基于置信度调制的制图的投机解码的信息理论框架。通过利用熵和基于边缘的不确定性度量，对起草者的输出分布，该方法在每次迭代处动态调整了投机产生的令牌的数量。这种自适应机制可降低回滚频率，改善资源利用率并保持输出保真度。此外，使用相同的置信信号调节验证过程，从而在不牺牲生成质量的情况下更灵活地接受了草稿令牌。关于机器翻译和摘要任务的实验表明，在保留或改善BLEU和Rouge分数的同时，对标准投机解码进行了显着加速。所提出的方法提供了一种原则上的插入方法，可在不同的不确定性条件下在大型语言模型中有效且可靠的解码。

Title: Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Authors: Woojin Chung, Jeonghoon Kim
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15390
Pdf URL: https://arxiv.org/pdf/2508.15390
Copy Paste: [[2508.15390]] Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training(https://arxiv.org/abs/2508.15390)
Keywords: language model
Abstract: Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model's vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and show that larger vocabularies reduce this complexity. Above 24K, every common word is already a single token, so further growth mainly deepens the relative token-frequency imbalance. A word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. Constraining input and output embedding norms to attenuate the effect of token-frequency imbalance reverses the gain, directly showing that the model exploits rather than suffers from imbalance. Because the same frequent words cover roughly 77% of tokens in downstream benchmarks, this training advantage transfers intact. We also show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results reframe "bigger vocabularies help" as "lowering the complexity of tokenized text helps," providing a simple, principled lever for tokenizer-model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.
摘要：大型语言模型是用令牌训练的，由此产生的令牌分布高度不平衡：几个单词占主导地位，而大多数词很少发生。最近的练习偏爱了越来越多的词汇，但利益的来源尚不清楚。我们进行了一项对照研究，在固定数据，计算和优化的同时，将语言模型的词汇量从24K到196K。我们首先量化了通过Kolmogorov复杂性正式形式化令牌化文本的复杂性，并表明较大的词汇降低了这种复杂性。在24K以上，每个通用词已经是一个单一的令牌，因此进一步的增长主要会加深相对令牌的频率不平衡。单词级别的损失分解表明，较大的词汇几乎完全通过降低了2500个最常见单词的不确定性，即使稀有尾巴上升的损失也是如此。限制输入和输出嵌入规范以减弱令牌频率不平衡的效果会逆转增益，直接表明该模型利用而不是遭受不平衡。由于相同的频繁单词涵盖下游基准中约有77％的令牌，因此该训练优势完好无损。我们还表明，具有固定词汇的放大模型参数产生的频率相同。我们的结果将“更大的词汇有所帮助”重新构架，因为“降低令牌化文本的复杂性有助于”，为令牌剂模型共同设计提供了一个简单的，有原则的杠杆，并阐明了在预训练中管理语言模型缩放的损失动态。

Title: Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models

Authors: Tobias Schreieder, Tim Schopf, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15396
Pdf URL: https://arxiv.org/pdf/2508.15396
Copy Paste: [[2508.15396]] Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models(https://arxiv.org/abs/2508.15396)
Keywords: language model, llm
Abstract: The increasing adoption of large language models (LLMs) has been accompanied by growing concerns regarding their reliability and trustworthiness. As a result, a growing body of research focuses on evidence-based text generation with LLMs, aiming to link model outputs to supporting evidence to ensure traceability and verifiability. However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks. To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions. Thereby, we focus on approaches that use citations, attribution, or quotations for evidence-based text generation. Building on this, we examine the distinctive characteristics and representative methods in the field. Finally, we highlight open challenges and outline promising directions for future work.
摘要：大型语言模型（LLM）的采用越来越多，伴随着对其可靠性和可信度的越来越关注。结果，越来越多的研究集中在使用LLM的基于证据的文本生成上，旨在将模型输出与支持证据联系起来，以确保可追溯性和可验证性。但是，由于不一致的术语，孤立的评估实践以及缺乏统一的基准，该领域被分散。为了弥合这一差距，我们系统地分析了134篇论文，引入了使用LLMS的基于证据的文本生成的统一分类法，并研究了七个关键维度的300个评估指标。因此，我们专注于使用引用，归因或引文来创建基于证据的文本生成的方法。在此基础上，我们检查了该领域中的独特特征和代表性方法。最后，我们重点介绍了未来工作的开放挑战，并概述了有希望的方向。

Title: When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

Authors: Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15407
Pdf URL: https://arxiv.org/pdf/2508.15407
Copy Paste: [[2508.15407]] When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models(https://arxiv.org/abs/2508.15407)
Keywords: language model
Abstract: Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at this https URL.
摘要：大型音频语言模型（LALMS）通过音频感知功能增强，使它们能够有效地处理并了解结合音频和文本的多模式输入。但是，它们在处理音频和文本模式之间相互矛盾的信息方面的性能仍然在很大程度上尚未进行。本文介绍了MCR基础，这是第一个专门设计的综合基准，该基准是针对评估LALM在使用不一致的音频 - 文本对时如何优先级的。通过跨不同音频理解任务的广泛评估，我们揭示了一种有关现象的评估：当模式之间存在不一致时，LALMS对文本输入表现出很大的偏见，经常无视音频证据。这种趋势会导致以音频为中心的任务中的大幅绩效下降，并引起了现实世界应用的重要可靠性问题。我们进一步研究了文本偏见的影响因素，并通过监督的填充探索缓解策略，并分析模型置信模式，即使有矛盾的输入，这些置信模式也揭示了持续的过度自信。这些发现强调了在训练过程中需要改善模态平衡的需求，并提高了更复杂的融合机制，以提高处理冲突的多模式输入时的鲁棒性。该项目可在此HTTPS URL上找到。

Title: LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Authors: Yirong Sun, Yizhong Geng, Peidong Wei, Yanjun Chen, Jinghan Yang, Rongfei Chen, Wei Zhang, Xiaoyu Shen
Subjects: cs.CL, cs.AI, cs.LG, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2508.15418
Pdf URL: https://arxiv.org/pdf/2508.15418
Copy Paste: [[2508.15418]] LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model(https://arxiv.org/abs/2508.15418)
Keywords: language model
Abstract: The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in this https URL.
摘要：大型语音语言模型（LSLMS）的发展因碎片结构和缺乏透明度而减慢，阻碍了研究的系统比较和可重复性。与视觉语言域不同，LSLM字段遭受了释放模型权重的共同实践，而无需其相应的训练数据和配置。为了解决这些关键的差距，我们介绍了LLASO，这是第一个用于大规模语音建模的完全开放的端到端框架。 LLASO为社区提供了三个基本资源：（1）Llaso-Align，一个12m的语音文本对齐语料库；（2）LLASO-Instruct，13.5m的实施多任务指令调用数据集；（3）Llaso-eval，一种可再现的基准，用于标准化评估。为了验证我们的框架，我们构建并发布了LLASO-BASE，这是一个专门针对我们的公共数据训练的3.8B参数参考模型。它达到0.72的归一化分数，建立了超过可比模型的强，可重复的基线。我们的分析表明，尽管更广泛的培训覆盖范围可以提高性能，但在看不见的任务上，尤其是在纯音频场景中，持续的概括差距仍然存在。通过释放完整的数据，基准和模型，LLASO建立了一个基础开放标准，以统一研究工作并加速LSLMS中社区驱动的进步。我们发布代码，数据集，预验证的模型，并在此HTTPS URL中产生。

Title: A Study of Privacy-preserving Language Modeling Approaches

Authors: Pritilata Saha, Abhirup Sinha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15421
Pdf URL: https://arxiv.org/pdf/2508.15421
Copy Paste: [[2508.15421]] A Study of Privacy-preserving Language Modeling Approaches(https://arxiv.org/abs/2508.15421)
Keywords: language model
Abstract: Recent developments in language modeling have increased their use in various applications and domains. Language models, often trained on sensitive data, can memorize and disclose this information during privacy attacks, raising concerns about protecting individuals' privacy rights. Preserving privacy in language models has become a crucial area of research, as privacy is one of the fundamental human rights. Despite its significance, understanding of how much privacy risk these language models possess and how it can be mitigated is still limited. This research addresses this by providing a comprehensive study of the privacy-preserving language modeling approaches. This study gives an in-depth overview of these approaches, highlights their strengths, and investigates their limitations. The outcomes of this study contribute to the ongoing research on privacy-preserving language modeling, providing valuable insights and outlining future research directions.
摘要：语言建模方面的最新发展增加了它们在各种应用程序和域中的使用。语言模型通常经过敏感数据培训，可以在隐私攻击期间记住和披露此信息，从而引起人们对保护个人隐私权的担忧。在语言模型中保留隐私已成为研究的关键领域，因为隐私是基本人权之一。尽管具有重要意义，但了解这些语言模型所拥有的隐私风险以及如何缓解其隐私风险仍然有限。这项研究通过提供有关保护隐私语言建模方法的全面研究来解决这一问题。这项研究对这些方法进行了深入的概述，突出了它们的优势，并研究了它们的局限性。这项研究的结果有助于对保护隐私语言建模的持续研究，提供宝贵的见解并概述未来的研究方向。

Title: PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback

Authors: Alexandru Coca, Bo-Hsiang Tseng, Pete Boothroyd, Jianpeng Cheng, Mark Gaynor, Zhenxing Zhang, Joe Stacey, Tristan Guigue, Héctor Martinez Alonso, Diarmuid Ó Séaghdha, Anders Johannsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15456
Pdf URL: https://arxiv.org/pdf/2508.15456
Copy Paste: [[2508.15456]] PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback(https://arxiv.org/abs/2508.15456)
Keywords: language model, agent
Abstract: Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate state tracking. We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To this end, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art state tracking performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and robust user goal estimation as the dialogue progresses, demonstrating the effectiveness of execution-aware state tracking.
摘要：可编程的以任务为导向的对话（TOD）代理使语言模型能够遵循结构化的对话策略，但其有效性取决于准确的状态跟踪。我们提出PYTOD，一种代理，生成可执行的代码以跟踪对话状态，并使用策略和执行反馈进行有效的错误校正。为此，PyTOD采用了一种简单的约束解码方法，使用语言模型而不是语法规则来遵循API模式。这导致了具有挑战性的SGD基准的最先进的州跟踪性能。我们的实验表明，随着对话的进行，PYTOD的准确性和强大的用户目标估计都超过了强大的基线，这表明了执行感知状态跟踪的有效性。

Title: RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores

Authors: Yingshu Li, Yunyi Liu, Lingqiao Liu, Lei Wang, Luping Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15464
Pdf URL: https://arxiv.org/pdf/2508.15464
Copy Paste: [[2508.15464]] RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores(https://arxiv.org/abs/2508.15464)
Keywords: gpt, prompt
Abstract: Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.
摘要：由于缺乏临床接地，可解释和细粒度的指标，评估自动生成的放射学报告仍然是一个基本挑战。现有方法要么产生粗糙的总分，要么依赖不透明的黑盒模型，从而限制了它们在现实世界中的临床工作流程中的用处。我们介绍了Radreason，这是一个用于放射学报告的新型评估框架，该报告不仅在六种临床定义的误差类型中输出细粒度的子得分，而且还产生了可读的理由，可以解释每个分数背后的基本原理。我们的方法建立在小组相对策略优化的基础上，并结合了两个关键的创新：（1）基于实时F1统计的临床上具有挑战性的错误类型；（2）多数引导优势缩放，该缩放是根据次数协议得出的迅速难度来调整策略梯度更新。这些组件共同实现了更稳定的优化，并与专家临床判断更好地对齐。 Rexval基准测试的实验表明，Radresason超过了所有先前的离线指标，并与基于GPT-4的评估达到了均等，同时保持可解释，具有成本效益且适合临床部署。代码将在出版后发布。

Title: SLM4Offer: Personalized Marketing Offer Generation Using Contrastive Learning Based Fine-Tuning

Authors: Vedasamhitha Challapalli, Konduru Venkat Sai, Piyush Pratap Singh, Rupesh Prasad, Arvind Maurya, Atul Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15471
Pdf URL: https://arxiv.org/pdf/2508.15471
Copy Paste: [[2508.15471]] SLM4Offer: Personalized Marketing Offer Generation Using Contrastive Learning Based Fine-Tuning(https://arxiv.org/abs/2508.15471)
Keywords: language model
Abstract: Personalized marketing has emerged as a pivotal strategy for enhancing customer engagement and driving business growth. Academic and industry efforts have predominantly focused on recommendation systems and personalized advertisements. Nonetheless, this facet of personalization holds significant potential for increasing conversion rates and improving customer satisfaction. Prior studies suggest that well-executed personalization strategies can boost revenue by up to 40 percent, underscoring the strategic importance of developing intelligent, data-driven approaches for offer generation. This work introduces SLM4Offer, a generative AI model for personalized offer generation, developed by fine-tuning a pre-trained encoder-decoder language model, specifically Google's Text-to-Text Transfer Transformer (T5-Small 60M) using a contrastive learning approach. SLM4Offer employs InfoNCE (Information Noise-Contrastive Estimation) loss to align customer personas with relevant offers in a shared embedding space. A key innovation in SLM4Offer lies in the adaptive learning behaviour introduced by contrastive loss, which reshapes the latent space during training and enhances the model's generalizability. The model is fine-tuned and evaluated on a synthetic dataset designed to simulate customer behaviour and offer acceptance patterns. Experimental results demonstrate a 17 percent improvement in offer acceptance rate over a supervised fine-tuning baseline, highlighting the effectiveness of contrastive objectives in advancing personalized marketing.
摘要：个性化营销已成为提高客户参与和推动业务增长的关键策略。学术和行业的努力主要集中在推荐系统和个性化广告上。但是，这个个性化方面具有提高转化率和提高客户满意度的巨大潜力。先前的研究表明，执行良好的个性化策略可以将收入提高高达40％，这强调了开发智能，数据驱动的方法的战略重要性。 This work introduces SLM4Offer, a generative AI model for personalized offer generation, developed by fine-tuning a pre-trained encoder-decoder language model, specifically Google's Text-to-Text Transfer Transformer (T5-Small 60M) using a contrastive learning approach. SLM4FFER在共享嵌入空间中使用Infonce（信息噪声对抗性估计）损失与相关优惠使客户角色与相关报价保持一致。 SLM4FFER中的一个关键创新在于对比度损失引入的自适应学习行为，这在训练过程中重塑了潜在空间并增强了模型的推广性。该模型在旨在模拟客户行为并提供接受模式的合成数据集上进行了微调和评估。实验结果表明，与监督的微调基准相比，要约接受率的提高了17％，这突出了对比目标在推进个性化营销方面的有效性。

Title: Subjective Behaviors and Preferences in LLM: Language of Browsing

Authors: Sai Sundaresan, Harshita Chopra, Atanu R. Sinha, Koustava Goswami, Nagasai Saketh Naidu, Raghav Karan, N Anushka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15474
Pdf URL: https://arxiv.org/pdf/2508.15474
Copy Paste: [[2508.15474]] Subjective Behaviors and Preferences in LLM: Language of Browsing(https://arxiv.org/abs/2508.15474)
Keywords: language model, llm
Abstract: A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user's self-constructed "language", albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the "language of browsing" better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users' heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.
摘要：大型语言模型（LLM）提供了跨领域和任务的多功能性，据称使用户具有各种行为和偏好。当用户具有固有的主观行为和偏好时，我们质疑有关LLM的这种看法，如其无处不在和特质的网站或应用程序中所示。因此生成的页面的顺序行为日志形成了类似于每个用户的自我构造的“语言”的东西，尽管没有结构和语法自然语言。我们问：（i）小LM可以比大LM更好地代表“浏览的语言”吗？（ii）具有一组参数（或单个LM）的LM可以充分捕获多种用户的异质性，主观行为和偏好吗？（iii）一个具有高平均性能的LM可以产生较低的性能差异以使对齐能够在用户级别上良好吗？我们介绍了适合主观行为的群集LM培训，HETLM（异质性意识到语言模型的培训）。我们发现（i）使用页面级令牌训练的小型LM训练有素，胜过大量或填充的LMS；（ii）具有异质群集特定参数集的HETLM优于同一家族的单个LM，控制参数的数量；（iii）随之而来的发电差异较高和较低的差异，这意味着改善对齐方式。

Title: Influence-driven Curriculum Learning for Pre-training on Limited Data

Authors: Loris Schoenegger, Lukas Thoma, Terra Blevins, Benjamin Roth
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15475
Pdf URL: https://arxiv.org/pdf/2508.15475
Copy Paste: [[2508.15475]] Influence-driven Curriculum Learning for Pre-training on Limited Data(https://arxiv.org/abs/2508.15475)
Keywords: language model
Abstract: Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.
摘要：课程学习是一种培训技术，该技术以示例难度（例如，从简单到更复杂的文档）呈现给模型的培训技术，对培训前语言模型的成功有限。在这项工作中，我们研究课程学习是否会竞争，如果我们替换常规的以人为中心的难度指标，这与模型培训期间观察到的示例难度更紧密相对应。具体来说，我们通过其\ textit {训练数据影响}对培训示例进行分类，该分数估计了单个培训示例对模型输出的影响。在我们的课程中培训的模型能够以超过10个百分点的基准训练的模型胜过基准的10个百分点，这证实课程学习对语言模型预培训有益，只要采用了以模型为中心的难度概念。

Title: SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts -- Extended Version

Authors: Nghiem Thanh Pham, Tung Kieu, Duc-Manh Nguyen, Son Ha Xuan, Nghia Duong-Trung, Danh Le-Phuoc
Subjects: cs.CL, cs.CY, cs.PF
Abstract URL: https://arxiv.org/abs/2508.15478
Pdf URL: https://arxiv.org/pdf/2508.15478
Copy Paste: [[2508.15478]] SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts -- Extended Version(https://arxiv.org/abs/2508.15478)
Keywords: language model
Abstract: Small Language Models (SLMs) offer computational efficiency and accessibility, yet a systematic evaluation of their performance and environmental impact remains lacking. We introduce SLM-Bench, the first benchmark specifically designed to assess SLMs across multiple dimensions, including accuracy, computational efficiency, and sustainability metrics. SLM-Bench evaluates 15 SLMs on 9 NLP tasks using 23 datasets spanning 14 domains. The evaluation is conducted on 4 hardware configurations, providing a rigorous comparison of their effectiveness. Unlike prior benchmarks, SLM-Bench quantifies 11 metrics across correctness, computation, and consumption, enabling a holistic assessment of efficiency trade-offs. Our evaluation considers controlled hardware conditions, ensuring fair comparisons across models. We develop an open-source benchmarking pipeline with standardized evaluation protocols to facilitate reproducibility and further research. Our findings highlight the diverse trade-offs among SLMs, where some models excel in accuracy while others achieve superior energy efficiency. SLM-Bench sets a new standard for SLM evaluation, bridging the gap between resource efficiency and real-world applicability.
摘要：小语言模型（SLM）提供了计算效率和可访问性，但仍缺乏对其性能和环境影响的系统评估。我们介绍了SLM Bench，这是第一个专门设计用于评估多个维度的SLM的基准，包括准确性，计算效率和可持续性指标。 SLM Bench使用跨越14个域的23个数据集在9个NLP任务上评估15个SLM。该评估是对4种硬件配置进行的，从而对其有效性进行了严格的比较。与先前的基准分析不同，SLM Bench量化了11个指标，跨正确，计算和消耗，从而可以对效率折衷进行整体评估。我们的评估考虑了受控的硬件条件，确保了跨模型的公平比较。我们开发了具有标准化评估协议的开源基准测试管道，以促进可重复性和进一步的研究。我们的发现突出了SLM之间的各种权衡，其中有些模型在准确性上表现出色，而另一些模型则提高了能源效率。 SLM Bench设定了SLM评估的新标准，弥合了资源效率和现实世界中适用性之间的差距。

Title: HebID: Detecting Social Identities in Hebrew-language Political Text

Authors: Guy Mor-Lan, Naama Rivlin-Angert, Yael R. Kaplan, Tamir Sheafer, Shaul R. Shenhav
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15483
Pdf URL: https://arxiv.org/pdf/2508.15483
Copy Paste: [[2508.15483]] HebID: Detecting Social Identities in Hebrew-language Political Text(https://arxiv.org/abs/2508.15483)
Keywords: llm
Abstract: Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.
摘要：政治语言与社会认同深深交织在一起。尽管社会身份通常是由特定的文化背景来塑造的，并通过语言的特定用途来表达，但现有的小组和身份检测数据集主要是以英语为中心，单标签，专注于粗糙的身份类别。我们介绍了Hebid，这是第一个用于社会身份检测的多标签希伯来语语料库：以色列政客的Facebook帖子（2018年12月至APR 2021）的5,536个句子，对十二个细微的社会身份进行了注释（例如，对权利，超级正统，社会与社会的依据，以调查数据为基础）。我们与2B-9B参数生成LLM一起基准测试了多标签和单标签编码器，发现希伯来语调整的LLM提供了最佳结果（Macro-$ $ f_1 $ = 0.74）。我们将分类器应用于政治家的Facebook帖子和议会演讲，评估流行度，时间趋势，聚类模式以及与性别相关的身份表达的差异。我们利用国家公共调查中的身份选择，可以在精英话语中描绘的身份与公众的身份重点进行比较。赫比德为在希伯来语中研究社会身份提供了全面的基础，并可以作为其他非英国政治背景下类似研究的典范。

Title: Dream 7B: Diffusion Large Language Models

Authors: Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, Lingpeng Kong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15487
Pdf URL: https://arxiv.org/pdf/2508.15487
Copy Paste: [[2508.15487]] Dream 7B: Diffusion Large Language Models(https://arxiv.org/abs/2508.15487)
Keywords: language model, llm
Abstract: We introduce Dream 7B, the most powerful open diffusion large language model to date. Unlike autoregressive (AR) models that generate tokens sequentially, Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Our model consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling. We release both Dream-Base and Dream-Instruct to facilitate further research in diffusion-based language modeling.
摘要：我们介绍了Dream 7b，这是迄今为止最强大的开放扩散大语模型。与依次生成代币的自回旋（AR）模型不同，Dream 7b采用离散的扩散建模来通过迭代denoising在平行地完善序列。我们的模型始终优于一般，数学和编码任务上的现有扩散语言模型。 Dream 7b展示了卓越的计划能力和推理灵活性，包括任意订购，填充功能和可调质量速度折衷。这些结果是通过简单但有效的训练技术来实现的，包括基于AR的LLM初始化和上下文自适应令牌级噪声重新安排。我们释放了梦想基础和梦境，以促进基于扩散的语言建模的进一步研究。

Title: The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech

Authors: Naama Rivlin-Angert, Guy Mor-Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15524
Pdf URL: https://arxiv.org/pdf/2508.15524
Copy Paste: [[2508.15524]] The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech(https://arxiv.org/abs/2508.15524)
Keywords: llm
Abstract: We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4\%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline combining finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F$_1$ of 0.74 for binary PDD detection and a macro-F$_1$ of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male than female politicians, and stronger tendencies among right-leaning actors - with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for understanding democratic discourse.
摘要：我们介绍了政治合法化话语（PDD）的首次大规模计算研究，该研究定义为对政治实体规范有效性的象征性攻击。 We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4\%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing.我们引入了一个两阶段的分类管道，结合了鉴定编码器模型和解码器LLMS。我们的最佳型号（DICTALM 2.0）的二进制PDD检测达到了f $ _1 $ 0.74，而宏F $ _1 $ _1 $ 0.67 $ _1 $ 0.67，用于分类的合法化特征。将此分类器应用于纵向和跨平台数据，我们看到PDD在三十年中显着上升，社交媒体与议会辩论的流行率更高，与女性政客相比，男性的使用量更大，在右翼演员中的倾向更强 - 在当选运动和重大政治活动中，尖峰和重大政治活动中有明显的尖峰。我们的发现证明了自动化PDD分析对理解民主话语的可行性和价值。

Title: SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking

Authors: Xiangyang Zhu, Yuan Tian, Chunyi Li, Kaiwei Zhang, Wei Sun, Guangtao Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15526
Pdf URL: https://arxiv.org/pdf/2508.15526
Copy Paste: [[2508.15526]] SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking(https://arxiv.org/abs/2508.15526)
Keywords: language model, llm, agent
Abstract: The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing benchmarks generally rely on labor-intensive manual curation, which causes excessive time and resource consumption. They also exhibit significant redundancy and limited difficulty. To alleviate these problems, we introduce SafetyFlow, the first agent-flow system designed to automate the construction of LLM safety benchmarks. SafetyFlow can automatically build a comprehensive safety benchmark in only four days without any human intervention by orchestrating seven specialized agents, significantly reducing time and resource cost. Equipped with versatile tools, the agents of SafetyFlow ensure process and cost controllability while integrating human expertise into the automatic pipeline. The final constructed dataset, SafetyFlowBench, contains 23,446 queries with low redundancy and strong discriminative power. Our contribution includes the first fully automated benchmarking pipeline and a comprehensive safety benchmark. We evaluate the safety of 49 advanced LLMs on our dataset and conduct extensive experiments to validate our efficacy and efficiency.
摘要：大语言模型（LLM）的快速扩散增强了对可靠安全评估以发现模型漏洞的要求。为此，提出了许多LLM安全评估基准。但是，现有的基准通常依赖于劳动密集型手动策划，这会导致时间和资源消耗过多。它们还表现出明显的冗余和有限的难度。为了减轻这些问题，我们引入了SafetyFlow，这是第一个旨在自动化LLM安全基准测试的代理流系统。安全流可以在短短四天内自动建立全面的安全基准，而无需进行任何人工干预，从而大大降低了时间和资源成本。安全流的代理配备了多功能工具，可以确保过程和成本控制能力，同时将人类专业知识整合到自动管道中。最终构建的数据集SafetlowFlowbench包含23,446个查询，具有低冗余性和强大的判别能力。我们的贡献包括第一个完全自动化的基准测试管道和全面的安全基准。我们评估了数据集中49个高级LLM的安全性，并进行了广泛的实验以验证我们的功效和效率。

Title: Trained Miniatures: Low cost, High Efficacy SLMs for Sales & Marketing

Authors: Ishaan Bhola, Mukunda NS, Sravanth Kurmala, Harsh Nandwani, Arihant Jain
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15617
Pdf URL: https://arxiv.org/pdf/2508.15617
Copy Paste: [[2508.15617]] Trained Miniatures: Low cost, High Efficacy SLMs for Sales & Marketing(https://arxiv.org/abs/2508.15617)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel in text generation; however, these creative elements require heavy computation and are accompanied by a steep cost. Especially for targeted applications such as sales and marketing outreach, these costs are far from feasible. This paper introduces the concept of "Trained Miniatures" - Small Language Models(SLMs) fine-tuned for specific, high-value applications, generating similar domain-specific responses for a fraction of the cost.
摘要：大型语言模型（LLMS）在文本生成中表现出色；但是，这些创意元素需要大量计算，并伴随着陡峭的成本。特别是对于诸如销售和营销外展等目标应用程序，这些成本远非可行的。本文介绍了“受过训练的缩影”的概念 - 针对特定高价值应用程序进行了微调的小语言模型（SLM），从而为一小部分成本生成了类似的特定领域响应。

Title: SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Authors: Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15648
Pdf URL: https://arxiv.org/pdf/2508.15648
Copy Paste: [[2508.15648]] SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models(https://arxiv.org/abs/2508.15648)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at this https URL.
摘要：大型语言模型（LLMS）在各种自然语言处理任务上都表现出色，但仍然容易受到导致有害内容产生的越狱攻击。在本文中，我们揭示了一个关键的安全矛盾：LLM可以更有效地将有害要求确定为歧视者，而不是捍卫它们作为发电机。这种洞察力激发了我们探索模型固有的歧视和发电能力的调整。为此，我们提出了SDGO（自我歧视引导的优化），这是一个强化学习框架，它利用模型自身的歧视能力作为奖励信号，通过迭代自我改善来增强发电性安全。在训练阶段，我们的方法不需要任何其他注释的数据或外部模型。广泛的实验表明，与基于迅速的基线和基于培训的基线相比，SDGO显着提高了模型安全性，同时保持对一般基准测试的有益性。通过使LLMS的歧视和发电能力保持一致，SDGO带来了越来越多的越狱攻击的出色表现。这种对齐能够在这两个功能之间实现更严格的耦合，从而仅用少量的判别样本就可以进一步增强该模型的生成能力。我们的代码和数据集可在此HTTPS URL上找到。

Title: Benchmarking Computer Science Survey Generation

Authors: Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.15658
Pdf URL: https://arxiv.org/pdf/2508.15658
Copy Paste: [[2508.15658]] Benchmarking Computer Science Survey Generation(https://arxiv.org/abs/2508.15658)
Keywords: language model, llm
Abstract: Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: this https URL
摘要：科学调查文章在总结研究进度中起着至关重要的作用，但是由于学术文献的快速增长，他们的手动创作变得越来越不可行。尽管大型语言模型（LLMS）为自动化此过程提供了有希望的功能，但由于缺乏标准化的基准和评估协议，该领域的进展受到阻碍。为了解决这一差距，我们介绍了激增（测量生成评估），这是一种评估计算机科学领域科学测量生成的新基准。激增包括（1）测试实例的集合，每个实例包括主题描述，专家撰写的调查以及其全套引用的参考文献，以及（2）大规模的学术语料库，其中包含100万篇论文，作为检索池。此外，我们提出了一个自动评估框架，该框架衡量了跨四个维度生成的调查：信息覆盖，参考准确性，结构组织和内容质量。我们对基于LLM的不同方法的评估表明，即使对于先进的自我反省框架，调查的生成仍然具有很高的挑战。这些发现突出了任务的复杂性以及继续研究的必要性。我们已经开源了所有代码，数据和模型：此HTTPS URL

Title: EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models

Authors: Xinyi Ling, Hanwen Du, Zhihui Zhu, Xia Ning
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15721
Pdf URL: https://arxiv.org/pdf/2508.15721
Copy Paste: [[2508.15721]] EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models(https://arxiv.org/abs/2508.15721)
Keywords: language model, llm
Abstract: E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through this https URL.
摘要：电子商务平台富含多模式数据，其中包含描绘产品详细信息的各种图像。但是，这提出了一个重要的问题：这些图像是否总是增强产品的理解，或者有时会引入冗余或降解性能？现有数据集的规模和设计都受到限制，因此很难系统地检查这个问题。为此，我们介绍了Ecommmmu，这是一个电子商务多模式多模式的理解数据集，其中有406,190个样本和8,989,510张图像。 EcommMMU由设计具有8个基本任务的多图像视觉语言数据和一个专门的VSS子集组成，可根据多模式大语言模型（MLLM）的能力有效地利用视觉内容。对Ecommmu的分析表明，产品图像不会始终如一地提高性能，并且在某些情况下可以降解。这表明MLLM可能难以有效利用丰富的视觉内容来完成电子商务任务。在这些见解的基础上，我们提出了Sumei，Sumei是一种数据驱动的方法，该方法通过在将视觉实用程序用于下游任务之前通过预测视觉实用程序来策略性地利用多个图像。全面的实验证明了Sumei的有效性和鲁棒性。可以通过此HTTPS URL获得数据和代码。

Title: End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Authors: Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.15746
Pdf URL: https://arxiv.org/pdf/2508.15746
Copy Paste: [[2508.15746]] End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning(https://arxiv.org/abs/2508.15746)
Keywords: language model, gpt, llm, hallucination, prompt, agent
Abstract: Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See this https URL.
摘要：知识差距和幻觉阻碍了医学大语言模型的准确诊断。检索和工具增强的方法有助于，但它们的影响受到外部知识的使用和反馈理论不良的可追溯性的限制。为了应对这些挑战，我们介绍了深入DXSearch，这是一种经过训练的端到端训练的辅助学习（RL），以实现TraceBale检索检索效果进行医学诊断。在Deep-DxSearch中，我们首先构建了一个大规模的医学检索语料库，其中包括患者记录和可靠的医学知识来源，以支持跨诊断方案的检索意识推理。更概念地，我们将LLM作为核心代理和检索语料库构图，并在格式，检索，推理结构和诊断准确性上使用量身定制的奖励，从而从大规模数据通过RL从大规模数据中发展出代理RAG策略。实验表明，我们的端到端代理RL训练框架始终优于跨多个数据中心的迅速设计和无训练的破布方法。经过训练，Deep-DxSearch在诊断准确性方面取得了可观的增长，超过了强大的诊断基准，例如GPT-4O，DeepSeek-R1和其他医学特异性框架，用于在分布和分布外的情况下，在分布和分布范围内的常见和罕见疾病诊断。此外，关于奖励设计和检索语料库组件的消融研究证实了它们的关键作用，与传统实施相比，强调了我们方法的独特性和有效性。最后，案例研究和可解释性分析重点介绍了Deep-Dxsearch诊断政策的改进，从而更深入地了解其性能提高，并支持临床医生提供更可靠，更精确的初步诊断。请参阅此HTTPS URL。

Title: Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

Authors: Yufeng Zhao, Junnan Liu, Hongwei Liu, Dongsheng Zhu, Yuan Shen, Songyang Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15754
Pdf URL: https://arxiv.org/pdf/2508.15754
Copy Paste: [[2508.15754]] Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis(https://arxiv.org/abs/2508.15754)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.
摘要：大型语言模型（LLMS）通过诸如《思想链》（COT）推理之类的方法在推理任务方面取得了长足的进步。但是，它们通常缺乏需要精确计算的任务。通过将外部工具纳入推理过程中，工具集成推理（TIR）已成为解决方案。然而，TIR在提高LLM的推理能力方面的概括尚不清楚。此外，TIR是否改善了模型的推理行为，并帮助模型认为尚待研究。我们介绍了Reasonzoo，这是一个综合的基准，涵盖了九种不同的推理类别，以评估TIR在各个领域的有效性。此外，我们提出了两个新颖的指标，即性能感知成本（PAC）和性能成本曲线（AUC-PCC）下的面积，以评估推理效率。我们的经验评估表明，支持TIR的模型在数学和非数学任务中始终优于其非TIR对应物。此外，TIR提高了推理效率，这可以改善PAC和AUC-PCC，这表明过度思考和更简化的推理。这些发现强调了TIR的领域总益处及其在复杂的推理任务中提高LLM能力的潜力。

Title: LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Authors: Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15760
Pdf URL: https://arxiv.org/pdf/2508.15760
Copy Paste: [[2508.15760]] LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries(https://arxiv.org/abs/2508.15760)
Keywords: llm, agent
Abstract: Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.
摘要：工具调用已成为AI代理与现实世界互动并解决复杂任务的关键能力。虽然模型上下文协议（MCP）为工具集成提供了强大的标准化框架，但基于AI代理可以在现实，动态场景中使用不同的MCP工具有效地解决多步骤任务的方法有很大的差距。在这项工作中，我们介绍了LiveMCP-101，这是101个精心策划的现实查询的基准，通过迭代LLM重写和手动审查进行了完善，需要协调使用多种MCP工具，包括Web搜索，文件操作，数学推理和数据分析。此外，我们介绍了一种新型的评估方法，该方法利用基本真相执行计划而不是原始API输出，更好地反映了现实世界环境的不断发展的本质。实验表明，即使是Frontier LLMS的成功率低于60 \％，突出了工具编排的主要挑战。详细的消融和错误分析进一步揭示了令牌用法中不同的故障模式和效率低下，指出了用于推进当前模型的具体方向。 LiveMCP-101设定了一个严格的标准，用于评估现实世界代理能力，朝着自主AI系统迈进，这些系统可以通过使用工具使用可靠地执行复杂的任务。