2025-08-13

Title: Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models

Authors: Alaa Alhamzeh, Mays Al Rebdawi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08262
Pdf URL: https://arxiv.org/pdf/2508.08262
Copy Paste: [[2508.08262]] Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models(https://arxiv.org/abs/2508.08262)
Keywords: language model, gpt, llm
Abstract: Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model's fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.
摘要：财务论证在塑造投资决策和对金融机构的公共信任方面起着至关重要的作用。然而，评估其质量在文献中的研究仍然很差。在本文中，我们使用Finargquality数据集研究了三个最先进的LLMS GPT-4O，Llama 3.1和Gemma 2在注释论证质量中的功能。我们的贡献是双重的。首先，我们评估了LLM生成的多个运行注释的一致性，并根据人类注释进行基准测试。其次，我们引入了旨在注入性别偏见的对抗性攻击，以分析模型做出反应并确保模型的公平性和鲁棒性。两种实验均在三种温度环境中进行，以评估其对注释稳定性的影响和与人标签的比对。我们的发现表明，基于LLM的注释比人类同行获得了更高的通知者一致性，尽管这些模型仍然表现出不同程度的性别偏见。我们对这些结果进行了多方面的分析，并提供了实用的建议，以指导未来的研究，以实现更可靠，具有成本效益和偏见的注释方法。

Title: TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection

Authors: Tarık Saraç, Selin Mergen, Mucahid Kutlu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08265
Pdf URL: https://arxiv.org/pdf/2508.08265
Copy Paste: [[2508.08265]] TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection(https://arxiv.org/abs/2508.08265)
Keywords: language model, llm
Abstract: In this paper, we present our work developed for the scientific web discourse detection task (Task 4a) of CheckThat! 2025. We propose a novel council debate method that simulates structured academic discussions among multiple large language models (LLMs) to identify whether a given tweet contains (i) a scientific claim, (ii) a reference to a scientific study, or (iii) mentions of scientific entities. We explore three debating methods: i) single debate, where two LLMs argue for opposing positions while a third acts as a judge; ii) team debate, in which multiple models collaborate within each side of the debate; and iii) council debate, where multiple expert models deliberate together to reach a consensus, moderated by a chairperson model. We choose council debate as our primary model as it outperforms others in the development test set. Although our proposed method did not rank highly for identifying scientific claims (8th out of 10) or mentions of scientific entities (9th out of 10), it ranked first in detecting references to scientific studies.
摘要：在本文中，我们介绍了为CheckThat的科学Web讲话检测任务（任务4A）开发的工作！ 2025年。我们提出了一种新的理事会辩论方法，该方法模拟了多个大型语言模型（LLM）之间的结构化学术讨论，以确定给定推文是否包含（i）科学主张，（ii）对科学研究的参考，或者（iii）提及科学实体。我们探讨了三种辩论方法：i）单一辩论，其中两个LLM主张反对立场，而第三个辩论是法官。 ii）团队辩论，其中多种模型在辩论的两边进行了协作；和iii）理事会辩论，其中多个专家模型共同考虑达成共识，由主席模型主持。我们选择理事会的辩论作为我们的主要模型，因为它在开发测试集中的表现都胜过其他人。尽管我们提出的方法并未确定科学主张（在10分中的第8个）或对科学实体提及（10分中的第9个）的排名，但它在检测到科学研究的参考时排名第一。

Title: Heartificial Intelligence: Exploring Empathy in Language Models

Authors: Victoria Williams, Benjamin Rosman
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2508.08271
Pdf URL: https://arxiv.org/pdf/2508.08271
Copy Paste: [[2508.08271]] Heartificial Intelligence: Exploring Empathy in Language Models(https://arxiv.org/abs/2508.08271)
Keywords: language model, llm
Abstract: Large language models have become increasingly common, used by millions of people worldwide in both professional and personal contexts. As these models continue to advance, they are frequently serving as virtual assistants and companions. In human interactions, effective communication typically involves two types of empathy: cognitive empathy (understanding others' thoughts and emotions) and affective empathy (emotionally sharing others' feelings). In this study, we investigated both cognitive and affective empathy across several small (SLMs) and large (LLMs) language models using standardized psychological tests. Our results revealed that LLMs consistently outperformed humans - including psychology students - on cognitive empathy tasks. However, despite their cognitive strengths, both small and large language models showed significantly lower affective empathy compared to human participants. These findings highlight rapid advancements in language models' ability to simulate cognitive empathy, suggesting strong potential for providing effective virtual companionship and personalized emotional support. Additionally, their high cognitive yet lower affective empathy allows objective and consistent emotional support without running the risk of emotional fatigue or bias.
摘要：大型语言模型已经变得越来越普遍，在专业和个人背景下，全球数百万人使用。随着这些模型继续发展，它们经常是虚拟助手和同伴。在人类互动中，有效的沟通通常涉及两种类型的同理心：认知同理心（理解他人的思想和情感）和情感同理心（在情感上分享他人的感受）。在这项研究中，我们使用标准化的心理测试研究了几个小（SLM）和大型（LLM）语言模型的认知和情感同理心。我们的结果表明，LLM始终在认知移情任务上超过了包括心理学的学生（包括心理学专业的学生）。但是，尽管具有认知优势，但与人类参与者相比，小型和大型语言模型都表现出明显较低的情感同理心。这些发现突出了语言模型模拟认知同理心能力的快速进步，这表明提供了有效的虚拟伴侣和个性化的情感支持的强大潜力。此外，他们高的认知且情感较低的同理心允许在不承担情绪疲劳或偏见的风险的情况下客观和一致的情感支持。

Title: TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning

Authors: Kristian Miok, Blaz Škrlj, Daniela Zaharie, Marko Robnik Šikonja
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08273
Pdf URL: https://arxiv.org/pdf/2508.08273
Copy Paste: [[2508.08273]] TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning(https://arxiv.org/abs/2508.08273)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Clinical language models often struggle to provide trustworthy predictions and explanations when applied to lengthy, unstructured electronic health records (EHRs). This work introduces TT-XAI, a lightweight and effective framework that improves both classification performance and interpretability through domain-aware keyword distillation and reasoning with large language models (LLMs). First, we demonstrate that distilling raw discharge notes into concise keyword representations significantly enhances BERT classifier performance and improves local explanation fidelity via a focused variant of LIME. Second, we generate chain-of-thought clinical explanations using keyword-guided prompts to steer LLMs, producing more concise and clinically relevant reasoning. We evaluate explanation quality using deletion-based fidelity metrics, self-assessment via LLaMA-3 scoring, and a blinded human study with domain experts. All evaluation modalities consistently favor the keyword-augmented method, confirming that distillation enhances both machine and human interpretability. TT-XAI offers a scalable pathway toward trustworthy, auditable AI in clinical decision support.
摘要：临床语言模型通常难以提供值得信赖的预测和解释，并应用于冗长的，非结构化的电子健康记录（EHR）。这项工作介绍了TT-XAI，这是一个轻巧有效的框架，可通过域感知的关键字蒸馏和大型语言模型（LLMS）提高分类性能和解释性。首先，我们证明将原始注释蒸馏成简明的关键字表示形式可显着提高BERT分类器的性能，并通过柠檬的重点变体提高本地解释忠诚度。其次，我们使用关键字引导的提示来引导LLM，从而产生经过思考的临床解释，从而产生更简洁和临床相关的推理。我们使用基于缺失的忠诚度指标，通过Llama-3评分进行自我评估以及与领域专家进行盲目的人类研究来评估解释质量。所有评估方式都始终偏向于关键字提取方法，证实蒸馏可以增强机器和人类的可解释性。 TT-XAI为临床决策支持提供了可信赖，可审核的AI的可扩展途径。

Title: Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition

Authors: Roberto Labadie-Tamayo, Djordje Slijepčević, Xihui Chen, Adrian Jaques Böck, Andreas Babic, Liz Freimann, Christiane Atzmüller Matthias Zeppelzauer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08274
Pdf URL: https://arxiv.org/pdf/2508.08274
Copy Paste: [[2508.08274]] Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition(https://arxiv.org/abs/2508.08274)
Keywords: language model, llm
Abstract: The rapid increase in hate speech on social media has exposed an unprecedented impact on society, making automated methods for detecting such content important. Unlike prior black-box models, we propose a novel transparent method for automated hate and counter speech recognition, i.e., "Speech Concept Bottleneck Model" (SCBM), using adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to map input texts to an abstract adjective-based representation, which is then sent to a light-weight classifier for downstream tasks. Across five benchmark datasets spanning multiple languages and platforms (e.g., Twitter, Reddit, YouTube), SCBM achieves an average macro-F1 score of 0.69 which outperforms the most recently reported results from the literature on four out of five datasets. Aside from high recognition accuracy, SCBM provides a high level of both local and global interpretability. Furthermore, fusing our adjective-based concept representation with transformer embeddings, leads to a 1.8% performance increase on average across all datasets, showing that the proposed representation captures complementary information. Our results demonstrate that adjective-based concept representations can serve as compact, interpretable, and effective encodings for hate and counter speech recognition. With adapted adjectives, our method can also be applied to other NLP tasks.
摘要：社交媒体上仇恨言论的迅速增加对社会产生了前所未有的影响，从而使检测到这种内容的自动化方法很重要。与先前的黑盒模型不同，我们提出了一种新颖的透明方法，用于自动仇恨和反语音识别，即“语音概念瓶颈模型”（SCBM），将形容词用作人类解剖的瓶颈概念。 SCBM利用大型语言模型（LLMS）将输入文本映射到基于抽象形容词的表示形式，然后将其发送到用于下游任务的轻质分类器。在跨越多种语言和平台（例如Twitter，Reddit，YouTube）的五个基准数据集中，SCBM的平均宏F1得分为0.69，表现优于最近在五个数据集中的四个文献中报告的结果。除了高识别精度外，SCBM还提供了高水平的本地和全球解释性。此外，将基于形容词的概念表示与变压器嵌入融合，使所有数据集的平均性能提高1.8％，表明所提出的表示形式捕获了互补信息。我们的结果表明，基于形容词的概念表示可以作为仇恨和反语音识别的紧凑，可解释和有效的编码。使用改编的形容词，我们的方法也可以应用于其他NLP任务。

Title: MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

Authors: Haiyun Guo, ZhiYan Hou, Yu Chen, Jinghan He, Yandu Sun, Yuzhe Zhou, Shujing Guo, Kuan Zhu, Jinqiao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08275
Pdf URL: https://arxiv.org/pdf/2508.08275
Copy Paste: [[2508.08275]] MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis(https://arxiv.org/abs/2508.08275)
Keywords: language model, llm, chain-of-thought
Abstract: Multimodal Large Language Models (MLLMs) rely on continual instruction tuning to adapt to the evolving demands of real-world applications. However, progress in this area is hindered by the lack of rigorous and systematic benchmarks. To address this gap, we present MLLM-CTBench, a comprehensive evaluation benchmark with three key contributions: (1) Multidimensional Evaluation: We combine final answer accuracy with fine-grained CoT reasoning quality assessment, enabled by a specially trained CoT evaluator; (2) Comprehensive Evaluation of Algorithms and Training Paradigms: We benchmark eight continual learning algorithms across four major categories and systematically compare reinforcement learning with supervised fine-tuning paradigms; (3) Carefully Curated Tasks: We select and organize 16 datasets from existing work, covering six challenging domains. Our key findings include: (i) Models with stronger general capabilities exhibit greater robustness to forgetting during continual learning; (ii) Reasoning chains degrade more slowly than final answers, supporting the hierarchical forgetting hypothesis; (iii) The effectiveness of continual learning algorithms is highly dependent on both model capability and task order; (iv) In reinforcement learning settings, incorporating KL-divergence constraints helps maintain policy stability and plays a crucial role in mitigating forgetting. MLLM-CTBench establishes a rigorous standard for continual instruction tuning of MLLMs and offers practical guidance for algorithm design and evaluation.
摘要：多模式的大语言模型（MLLM）依靠连续的指令调整来适应现实世界应用的不断发展的需求。但是，由于缺乏严格和系统的基准，这一领域的进展受到了阻碍。为了解决这一差距，我们提出了MLLM-CTBENCH，这是一个全面的评估基准，具有三个关键贡献：（1）多维评估：我们将最终答案的准确性与精细的COT推理质量评估相结合，由专门训练有素的COT评估员启用；（2）对算法和培训范式的全面评估：我们基准了四个主要类别的八种持续学习算法，并系统地将强化学习与受监督的微调范式进行比较；（3）精心策划的任务：我们从现有工作中选择并组织了16个数据集，涵盖了六个具有挑战性的域。我们的主要发现包括：（i）具有更强一般能力的模型表现出更大的稳健性，可以在持续学习过程中忘记；（ii）推理链比最终答案更慢，支持分层遗忘假设；（iii）持续学习算法的有效性高度取决于模型能力和任务顺序；（iv）在强化学习环境中，纳入KL-Divergence限制有助于维持政策稳定性，并在减轻遗忘方面起着至关重要的作用。 MLLM-CTBENCH为MLLM的持续指导调整建立了严格的标准，并为算法设计和评估提供了实用的指导。

Title: Evaluating Contrast Localizer for Identifying Causal Unitsin Social & Mathematical Tasks in Language Models

Authors: Yassine Jamaa, Badr AlKhamissi, Satrajit Ghosh, Martin Schrimpf
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08276
Pdf URL: https://arxiv.org/pdf/2508.08276
Copy Paste: [[2508.08276]] Evaluating Contrast Localizer for Identifying Causal Unitsin Social & Mathematical Tasks in Language Models(https://arxiv.org/abs/2508.08276)
Keywords: language model, llm
Abstract: This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units.
摘要：这项工作适应了神经科学对比定位器，以指出大型语言模型（LLMS）和视觉语言模型（VLMS）中的心理理论（TOM）和数学推理任务的因果关系单位。在11个LLM和5个VLM中，大小从3B到90B参数不等，我们使用对比性刺激集定位了顶级激活的单元，并通过靶向消融来评估其因果关系。我们比较了病变功能选择的单位与低激活的效果，并在既定的TOM和数学基准之间对低激活和随机选择的单位进行了随机选择的单位。与期望相反，低激活单元有时比高度激活的单元产生的性能下降更大，而从数学本地化的单位却比Tom Localizer的单位降低了TOM性能的损害通常更大。这些发现引起了人们的质疑，基于对比的本地化的因果关系，并强调了对更广泛的刺激集的需求，并更准确地捕获了特定于任务的单位。

Title: Objective Metrics for Evaluating Large Language Models Using External Data Sources

Authors: Haoze Du, Richard Li, Edward Gehringer
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08277
Pdf URL: https://arxiv.org/pdf/2508.08277
Copy Paste: [[2508.08277]] Objective Metrics for Evaluating Large Language Models Using External Data Sources(https://arxiv.org/abs/2508.08277)
Keywords: language model, llm
Abstract: Evaluating the performance of Large Language Models (LLMs) is a critical yet challenging task, particularly when aiming to avoid subjective assessments. This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters to assess LLM outputs across various tasks. By utilizing well-defined benchmarks, factual datasets, and structured evaluation pipelines, the approach ensures consistent, reproducible, and bias-minimized measurements. The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation while ensuring alignment with real-world applications. This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.
摘要：评估大语言模型（LLM）的表现是一项至关重要但具有挑战性的任务，尤其是在避免主观评估时。本文提出了一个框架，用于利用从不同学期的类文本材料得出的主观指标，以评估各种任务的LLM输出。通过利用明确定义的基准，事实数据集和结构化评估管道，该方法可确保一致，可重现和偏置最小化的测量值。该框架强调评分的自动化和透明度，从而减少对人类解释的依赖，同时确保与现实世界应用保持一致。该方法解决了主观评估方法的局限性，为教育，科学和其他高风险领域的绩效评估提供了可扩展的解决方案。

Title: MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language

Authors: Andres Garcia Rincon, Eliseo Ferrante
Subjects: cs.CL, cs.AI, cs.LG, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2508.08283
Pdf URL: https://arxiv.org/pdf/2508.08283
Copy Paste: [[2508.08283]] MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language(https://arxiv.org/abs/2508.08283)
Keywords: language model, llm, agent
Abstract: This paper presents MinionsLLM, a novel framework that integrates Large Language Models (LLMs) with Behavior Trees (BTs) and Formal Grammars to enable natural language control of multi-agent systems within arbitrary, user-defined environments. MinionsLLM provides standardized interfaces for defining environments, agents, and behavioral primitives, and introduces two synthetic dataset generation methods (Method A and Method B) to fine-tune LLMs for improved syntactic validity and semantic task relevance. We validate our approach using Google's Gemma 3 model family at three parameter scales (1B, 4B, and 12B) and demonstrate substantial gains: Method B increases syntactic validity to 92.6% and achieves a mean task performance improvement of 33% over baseline. Notably, our experiments show that smaller models benefit most from fine-tuning, suggesting promising directions for deploying compact, locally hosted LLMs in resource-constrained multi-agent control scenarios. The framework and all resources are released open-source to support reproducibility and future research.
摘要：本文介绍了Minionsllm，这是一个新颖的框架，将大型语言模型（LLM）与行为树（BTS）和正式语法集成在一起，以在任意用户定义的环境中对多机构系统的自然语言控制。 MinionsLLM为定义环境，代理和行为原始词提供了标准化的接口，并引入了两种合成数据集生成方法（方法A和方法B）以微调LLM，以提高句法有效性和语义任务相关性。我们使用Google的Gemma 3模型家族以三个参数量表（1b，4b和12b）验证我们的方法，并证明了可观的增长：方法B将句法有效性提高到92.6％，并实现了比基线的平均任务绩效提高33％。值得注意的是，我们的实验表明，较小的模型从微调中受益最大，这表明在资源约束的多代理控制方案中部署紧凑型，本地托管的LLM的有希望的方向。框架和所有资源均释放开源，以支持可重复性和未来的研究。

Title: The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

Authors: Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Schwartz-Ziv, Tomasz Kajdanowicz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08285
Pdf URL: https://arxiv.org/pdf/2508.08285
Copy Paste: [[2508.08285]] The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs(https://arxiv.org/abs/2508.08285)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9\% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.
摘要：大型语言模型（LLM）彻底改变了自然语言处理，但是它们幻觉的趋势对可靠的部署构成了严重的挑战。尽管有许多幻觉检测方法，但它们的评估通常依赖于胭脂，胭脂是基于与人类判断失调的词汇叠加的度量。通过全面的人类研究，我们证明，尽管胭脂表现出很高的召回，但其精度极低会导致误导性能估计。实际上，几种既定的检测方法表明，使用人类对准的指标（如LLM-AS-Gudge）进行评估时，性能下降高达45.9％\％。此外，我们的分析表明，基于响应长度的简单启发式方法可以与复杂检测技术相媲美，从而在当前的评估实践中暴露了基本缺陷。我们认为，采用语义意识和强大的评估框架对于准确评估幻觉检测方法的真实性能至关重要，最终确保LLM输出的可信度。

Title: Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions

Authors: Farah Atif, Nursultan Askarbekuly, Kareem Darwish, Monojit Choudhury
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.08287
Pdf URL: https://arxiv.org/pdf/2508.08287
Copy Paste: [[2508.08287]] Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions(https://arxiv.org/abs/2508.08287)
Keywords: language model, gpt, llm
Abstract: Despite the increasing usage of Large Language Models (LLMs) in answering questions in a variety of domains, their reliability and accuracy remain unexamined for a plethora of domains including the religious domains. In this paper, we introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. Unlike prior work, which either overlooks the distinctions between religious school of thought or fails to evaluate abstention behavior, we assess LLMs not only on their accuracy but also on their ability to recognize when not to answer. Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought. While GPT-4o outperforms all other models in accuracy, Gemini and Fanar demonstrate superior abstention behavior critical for minimizing confident incorrect answers. Notably, all models exhibit a performance drop in Arabic, highlighting the limitations in religious reasoning for languages other than English. To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic jurisprudence queries. Our findings underscore the need for task-specific evaluation and cautious deployment of LLMs in religious applications.
摘要：尽管大型语言模型（LLM）在回答各种领域的问题时的使用量增加，但对于包括宗教领域在内的众多领域，它们的可靠性和准确性仍未审查。在本文中，我们介绍了一种小说的基准FIQHQA，重点是在阿拉伯语和英语中由四个主要的逊尼派思想流派明确分类的LLM生成的伊斯兰裁决。与先前的工作不同，这要么忽略了宗教思想学校之间的区别，要么无法评估弃权行为，我们不仅会根据其准确性，而且还根据他们识别何时不回答的能力进行评估。我们的零射击和弃权实验表明，LLM，语言和法律流派之间的差异很大。尽管GPT-4O在准确性上均优于所有其他模型，但Gemini和Fanar表现出了卓越的弃权行为，对于最大程度地减少自信的错误答案至关重要。值得注意的是，所有模型均表现出阿拉伯语的表现，强调了英语以外的其他语言的宗教推理局限性。据我们所知，这是第一个基准LLM对精细伊斯兰思想特定统治学院的功效并评估伊斯兰法学司法查询的弃权的研究。我们的发现强调了在宗教应用中对LLM的特定任务评估和谨慎部署的需求。

Title: Putnam-AXIOM: A Functional and Static Benchmark

Authors: Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo
Subjects: cs.CL, cs.AI, cs.LG, cs.LO, cs.NE
Abstract URL: https://arxiv.org/abs/2508.08292
Pdf URL: https://arxiv.org/pdf/2508.08292
Copy Paste: [[2508.08292]] Putnam-AXIOM: A Functional and Static Benchmark(https://arxiv.org/abs/2508.08292)
Keywords: language model, llm
Abstract: Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at this https URL.
摘要：当前的大型语言模型（LLMS）的数学推理基准正在饱和，有些人的精度> 90％，并且越来越受到训练集污染的损害。我们介绍了Putnam-Axiom，这是从享有声望的William Lowell Putnam数学竞争中提出的522个大学级竞争问题的基准，以及Putnam-Axiom变异，这是一个看不见的100个功能变体集，由程序上扰动变量和常数产生。变异协议产生了同样困难，看不见的实例的无限流 - 产生了污染的测试床。在原始集合上，OpenAI的O1-preiview（最强的评估模型）得分为41.9％，但其精度下降了19.6％（相对降低46.8％），对配对的变化下降。其余的18个型号显示出相同的下降趋势，其中十个具有不重叠的95％置信区间。这些差距暗示了记忆，并突出了动态基准的必要性。我们与教师努力的精度（TFA）相辅相成，这是一个轻巧的指标，可直接得分推理痕迹并自动化自然语言证明评估。因此，putnam-axiom提供了一个严格的，污染的评估框架，用于评估LLM的高级数学推理。数据和评估代码在此HTTPS URL上公开可用。

Title: CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

Authors: Shuzhou Yuan, William LaCroix, Hardik Ghoshal, Ercong Nie, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08386
Pdf URL: https://arxiv.org/pdf/2508.08386
Copy Paste: [[2508.08386]] CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation(https://arxiv.org/abs/2508.08386)
Keywords: language model, gpt, llm, prompt, chat, chain-of-thought
Abstract: Large Language Models (LLMs) are increasingly employed as AI tutors due to their scalability and potential for personalized instruction. However, off-the-shelf LLMs often underperform in educational settings: they frequently reveal answers too readily, fail to adapt their responses to student uncertainty, and remain vulnerable to emotionally manipulative prompts. To address these challenges, we introduce CoDAE, a framework that adapts LLMs for educational use through Chain-of-Thought (CoT) data augmentation. We collect real-world dialogues between students and a ChatGPT-based tutor and enrich them using CoT prompting to promote step-by-step reasoning and pedagogically aligned guidance. Furthermore, we design targeted dialogue cases to explicitly mitigate three key limitations: over-compliance, low response adaptivity, and threat vulnerability. We fine-tune four open-source LLMs on different variants of the augmented datasets and evaluate them in simulated educational scenarios using both automatic metrics and LLM-as-a-judge assessments. Our results show that models fine-tuned with CoDAE deliver more pedagogically appropriate guidance, better support reasoning processes, and effectively resist premature answer disclosure.
摘要：大型语言模型（LLM）由于其可扩展性和个性化教学的潜力而越来越多地用作AI导师。但是，现成的LLM在教育环境中通常表现不佳：它们经常很容易透露答案，无法调整对学生不确定性的反应，并且仍然容易受到情感操纵性提示的影响。为了应对这些挑战，我们介绍了Codae，该框架可以通过Thearking（COT）数据增强来调整LLM的教育用途。我们在学生与基于Chatgpt的导师之间收集现实世界对话，并使用COT促使促进逐步推理和教学上一致的指导来丰富它们。此外，我们设计了针对性的对话情况，以明确减轻三个关键局限性：过度符合性，低响应适应性和威胁脆弱性。我们在增强数据集的不同变体上微调了四个开源LLM，并使用自动指标和LLM-AS-A-A-A-Gudge评估在模拟教育场景中对其进行评估。我们的结果表明，用CODAE进行微调的模型提供了更有教学上适当的指导，更好的支持推理过程，并有效抵抗过早的答案披露。

Title: Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

Authors: Jiatong Li, Weida Wang, Qinggang Zhang, Junxian Li, Di Zhang, Changmeng Zheng, Shufei Zhang, Xiaoyong Wei, Qing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08401
Pdf URL: https://arxiv.org/pdf/2508.08401
Copy Paste: [[2508.08401]] Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery(https://arxiv.org/abs/2508.08401)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.
摘要：大型语言模型（LLMS），特别是诸如DeepSeek-R1和QWQ之类的明确的长链（COT）推理模型，表现出强大的推理能力，在常识性推理和数学推断方面取得了令人印象深刻的表现。尽管具有有效性，但长期推理模型经常因其有限的能力和低效率而受到批评，例如分子发现等知识密集型领域。在该领域的成功需要对领域知识的精确理解，包括分子结构和化学原理，这是由于分子数据的固有复杂性和高质量专家注释的稀缺性而具有挑战性的。为了弥合这一差距，我们引入了Mol-R1，这是一个新颖的框架，旨在提高基于文本的分子生成中R1样明确的长期推理LLM的解释性和推理性能。我们的方法始于高质量的推理数据集，该数据集是通过通过封闭式蒸馏（PRID）事先调节（PRID）策划的，这是一种专门的蒸馏策略，可有效地生成以先前法规为指导的配对推理痕迹。在此基础上，我们介绍了Moia，Moia，分子迭代适应性，这是一种复杂的训练策略，迭代地结合了受监督的微调（SFT）和增强策略优化（RPO），该策略量身定制，旨在提高分子发现的R1样推理的推理性能。最后，我们检查了Mol-R1在基于文本的分子推理生成任务中的性能，显示出对现有基准的卓越性能。

Title: Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Authors: Saketh Reddy Vemula, Dipti Mishra Sharma, Parameswari Krishnamurthy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08424
Pdf URL: https://arxiv.org/pdf/2508.08424
Copy Paste: [[2508.08424]] Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment(https://arxiv.org/abs/2508.08424)
Keywords: language model
Abstract: Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models -- starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal that better morphological alignment correlates positively -- though moderately -- with performance in syntax-based tasks such as Parts-of-Speech tagging, Named Entity Recognition and Dependency Parsing. However, we also find that the tokenizer algorithm (Byte-pair Encoding vs. Unigram) plays a more significant role in influencing downstream performance than morphological alignment alone. Naive Unigram tokenizers outperform others across most settings, though hybrid tokenizers that incorporate morphological segmentation significantly improve performance within the BPE framework. In contrast, intrinsic metrics like Corpus Token Count (CTC) and Rényi entropy showed no correlation with downstream performance.
摘要：关于语言建模的先前工作表明，关于形态学上的象征化方法是否改善了性能，尤其是对于具有复杂形态的语言而言。为了调查这一点，我们选择了一套类型的语言集：泰卢固语（凝集），印地语（主要与某些凝聚力融合）和英语（融合）。我们对语言模型进行了全面的评估 - 从令牌培训开始，并通过填充和下游任务评估扩展。为了说明在代币剂变体之间观察到的一致性差异，我们重点介绍了两个关键因素：形态对准和令牌化质量。为了评估泰卢固语中令牌剂的形态对齐，我们创建了一个数据集，其中包含600个衍生物和7000个拐点词形式的金词素分割。我们的实验表明，更好的形态对准与基于语法的任务（例如词性词性标记，命名实体识别和依赖性解析）的性能呈正相关（尽管中等）。但是，我们还发现，与单独的形态一致性相比，令牌算法算法（编码与摘要）在影响下游性能中起着更为重要的作用。在大多数环境中，幼稚的umigram sikenizers在大多数环境中的表现都要优于其他人，尽管结合形态分割的混合引物可以显着改善BPE框架内的性能。相反，诸如语料库代币计数（CTC）和Rényi熵之类的内在指标与下游性能没有相关性。

Title: Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints

Authors: Daren Yao, Jinsong Yuan, Ruike Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08466
Pdf URL: https://arxiv.org/pdf/2508.08466
Copy Paste: [[2508.08466]] Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints(https://arxiv.org/abs/2508.08466)
Keywords: language model, llm
Abstract: Small large language models (LLMs) often face difficulties in aligning output to human preferences, particularly when operating under severe performance gaps. In this work, we propose two lightweight DPO-based variants -- Adaptive Margin-Sigmoid Loss and APO-hinge-zero -- to better address underperformance scenarios by introducing margin-based objectives and selective update mechanisms. Our APO-hinge-zero method, which combines hinge-induced hard-example mining with the chosen-focused optimization of APO-zero, achieves strong results. In AlpacaEval, APO-hinge-zero improves the win rate by +2.0 points and the length-controlled win rate by +1.4 points compared to the APO-zero baseline. In MT-Bench, our methods maintain competitive performance in diverse categories, particularly excelling in STEM and Humanities tasks. These results demonstrate that simple modifications to preference-based objectives can significantly enhance small LLM alignment under resource constraints, offering a practical path toward more efficient deployment.
摘要：小型大型语言模型（LLM）通常会在将产出与人类偏好保持一致时面临困难，尤其是在严重的性能差距下运行时。在这项工作中，我们提出了两个基于DPO的轻巧变体 - 自适应边距 - sigmoid损失和apo-hishe-Zero-通过引入基于保证金的目标和选择性更新机制，以更好地解决表现不佳的情况。我们的Apo-Hinge-Zero方法结合了铰链诱导的硬示例挖掘与选择的Apo-Zero的优化，可以实现强大的结果。与Apo-Zero-Zero基线相比，Apo-Hhinge-Zero在山帕卡瓦尔（Apo-Hinge-Zero）的获胜率提高了+2.0点，长度控制的获胜率提高了+1.4分。在MT Bench中，我们的方法在不同类别中保持竞争性能，尤其是在STEM和人文学科任务中出色的竞争性能。这些结果表明，对基于偏好的目标的简单修改可以显着增强资源约束下的小LLM对齐，从而为更有效的部署提供了实用的途径。

Title: Momentum Point-Perplexity Mechanics in Large Language Models

Authors: Lorenzo Tomaz, Judd Rosenblatt, Thomas Berry Jones, Diogo Schwerz de Lucena
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08492
Pdf URL: https://arxiv.org/pdf/2508.08492
Copy Paste: [[2508.08492]] Momentum Point-Perplexity Mechanics in Large Language Models(https://arxiv.org/abs/2508.08492)
Keywords: language model
Abstract: We take a physics-based approach to studying how the internal hidden states of large language models change from token to token during inference. Across 20 open-source transformer models (135M-3B parameters), we find that a quantity combining the rate of change in hidden states and the model's next-token certainty, analogous to energy in physics, remains nearly constant. Random-weight models conserve this "energy" more tightly than pre-trained ones, while training shifts models into a faster, more decisive regime with greater variability. Using this "log-Lagrangian" view, we derive a control method called Jacobian steering, which perturbs hidden states in the minimal way needed to favor a target token. This approach maintained near-constant energy in two tested models and produced continuations rated higher in semantic quality than the models' natural outputs. Viewing transformers through this mechanics lens offers a principled basis for interpretability, anomaly detection, and low-risk steering. This could help make powerful models more predictable and aligned with human intent.
摘要：我们采用一种基于物理的方法来研究大语模型的内部隐藏状态如何在推理过程中从令牌变为令牌。在20种开源变压器模型（135m-3b参数）中，我们发现，结合了隐藏状态变化速率和该模型的下一步确定性（类似于物理学的能量）的数量几乎保持恒定。随机重量模型比预先训练的模型更紧密地保护这种“能量”，而训练则将模型转变为更快，更具决定性的方案，具有更大的可变性。使用此“ Log-lagrangian”视图，我们得出了一种称为Jacobian转向的控制方法，该方法以最小的方式隐藏了perturbs，以偏爱目标令牌。这种方法在两个测试的模型中保持了接近恒定的能量，并产生了比模型的自然输出的语义质量更高的连续性。通过此力学镜头查看变压器为可解释性，异常检测和低风险转向提供了原则上的基础。这可能有助于使强大的模型更可预测，并与人类意图保持一致。

Title: Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression

Authors: Jadie Adams, Brian Hu, Emily Veenhuis, David Joy, Bharadwaj Ravichandran, Aaron Bray, Anthony Hoogs, Arslan Basharat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08509
Pdf URL: https://arxiv.org/pdf/2508.08509
Copy Paste: [[2508.08509]] Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression(https://arxiv.org/abs/2508.08509)
Keywords: language model, llm
Abstract: Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI.
摘要：目前，使用诸如从人类反馈（RLHF）学习等技术对齐大语言模型（LLM）。但是，这些方法使用标量奖励，只能平均反映用户偏好。相反，多元化的一致性试图捕捉到一组属性的各种用户偏好，而不仅仅是帮助和无害。为此，我们提出了一个基于几个可以适应个人用户偏好的几个比较回归的可传输多元化模型。我们的方法利用基于一组细粒度属性的基础上的学习和推理来比较响应选项并做出一致的选择。为了评估我们的算法，我们还通过调整道德完整性语料库（MIC）和helpSteer2数据集提出了两个新的可转让多元化基准，分别证明了我们对价值一致的决策和奖励建模的适用性。我们的几个相对回归方法可以解释，并且与不同的属性和LLM兼容，同时表现出多个基线和最新方法。我们的工作提供了多元化一致性的新见解和研究方向，从而使LLM的更公平，更有代表性地使用，并在道德AI中推进最先进的艺术品。

Title: DeCAL Tokenwise Compression

Authors: Sameer Panwar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08514
Pdf URL: https://arxiv.org/pdf/2508.08514
Copy Paste: [[2508.08514]] DeCAL Tokenwise Compression(https://arxiv.org/abs/2508.08514)
Keywords: language model
Abstract: This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations by the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on many downstream tasks, with usually only minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable.
摘要：本文引入了贴花，这是一种用于令牌压缩的新方法。贴花使用通过Denoising审慎的编码器语言模型，以学习由编码器产生高质量的通用压缩表示形式。贴花对编码器进行了少量修改，重点是最大化压缩质量，即使以计算为代价。我们表明，在2倍压缩处的贴花可以在许多下游任务上不压缩，通常在指标中，在提问，摘要和多矢量检索任务中，指标的下降量最高。贴花可在可以利用预先计算的密集表示的情况下节省大量节省，我们认为该方法可以进一步开发为更广泛地适用。

Title: DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives

Authors: Sehwan Moon, Aram Lee, Jeong Eun Kim, Hee-Ju Kang, Il-Seon Shin, Sung-Wan Kim, Jae-Min Kim, Min Jhon, Ju-Wan Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08591
Pdf URL: https://arxiv.org/pdf/2508.08591
Copy Paste: [[2508.08591]] DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives(https://arxiv.org/abs/2508.08591)
Keywords: language model, llm
Abstract: Advances in large language models (LLMs) have enabled a wide range of applications. However, depression prediction is hindered by the lack of large-scale, high-quality, and rigorously annotated datasets. This study introduces DepressLLM, trained and evaluated on a novel corpus of 3,699 autobiographical narratives reflecting both happiness and distress. DepressLLM provides interpretable depression predictions and, via its Score-guided Token Probability Summation (SToPS) module, delivers both improved classification performance and reliable confidence estimates, achieving an AUC of 0.789, which rises to 0.904 on samples with confidence $\geq$ 0.95. To validate its robustness to heterogeneous data, we evaluated DepressLLM on in-house datasets, including an Ecological Momentary Assessment (EMA) corpus of daily stress and mood recordings, and on public clinical interview data. Finally, a psychiatric review of high-confidence misclassifications highlighted key model and data limitations that suggest directions for future refinements. These findings demonstrate that interpretable AI can enable earlier diagnosis of depression and underscore the promise of medical AI in psychiatry.
摘要：大型语言模型（LLM）的进步已启用了广泛的应用程序。但是，由于缺乏大规模，高质量和严格注释的数据集，抑郁预测受到了阻碍。这项研究介绍了Destressllm，对3,699个自传叙事的新型语料库进行了训练和评估，反映了幸福和困扰。 DePressLLM提供了可解释的抑郁预测，并通过其得分引导的令牌概率求和（Stops）模块提供了改进的分类性能和可靠的置信度估计，可实现0.789的AUC，以置信度$ \ geq $ 0.95置信度上升到0.904。为了验证其对异质数据的鲁棒性，我们评估了内部数据集上的dowressllm，包括每日压力和情绪记录的生态瞬时评估（EMA）语料库以及公共临床访谈数据。最后，对高信任错误分类的精神病综述突出了关键模型和数据限制，这些模型暗示了未来细化的方向。这些发现表明，可解释的AI可以早期诊断抑郁症，并强调精神病学中医学AI的希望。

Title: Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review

Authors: David Santandreu Calonge (1), Linda Smail (2) ((1) Center for Teaching and Learning, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates, (2) College of Interdisciplinary Studies, Zayed University, Dubai, United Arab Emirates)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08610
Pdf URL: https://arxiv.org/pdf/2508.08610
Copy Paste: [[2508.08610]] Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review(https://arxiv.org/abs/2508.08610)
Keywords: retrieval-augmented generation
Abstract: This review examines recent advances in Parameter-Efficient Fine-Tuning (PEFT), with a focus on Low-Rank Adaptation (LoRA), to optimize Retrieval-Augmented Generation (RAG) systems like Qwen3, DeepSeek, and Kimi. These systems face challenges in understanding and generating authentic Cantonese colloquial expressions due to limited annotated data and linguistic variability. The review evaluates the integration of LoRA within RAG frameworks, benchmarks PEFT methods for retrieval and generation accuracy, identify domain adaptation strategies under limited data, and compares fine-tuning techniques aimed at improving semantic fidelity under data-scarce conditions. A systematic analysis of recent studies employing diverse LoRA variants, synthetic data generation, user feedback integration, and adaptive parameter allocation was conducted to assess their impact on computational efficiency, retrieval precision, linguistic authenticity, and scalability. Findings reveal that dynamic and ensemble LoRA adaptations significantly reduce trainable parameters without sacrificing retrieval accuracy and generation quality in dialectal contexts. However, limitations remain in fully preserving fine-grained linguistic nuances, especially for low-resource settings like Cantonese. The integration of real-time user feedback and domain-specific data remains underdeveloped, limiting model adaptability and personalization. While selective parameter freezing and nonlinear adaptation methods offer better trade-offs between efficiency and accuracy, their robustness at scale remains an open challenge. This review highlights the promise of PEFT-enhanced RAG systems for domain-specific language tasks and calls for future work targeting dialectal authenticity, dynamic adaptation, and scalable fine-tuning pipelines.
摘要：这篇综述研究了参数有效的微调（PEFT）的最新进展，重点是低级适应（LORA），以优化QWEN3，DeepSeek和Kimi等检索型增强发电（RAG）系统。这些系统在理解和产生正宗的广东话口语表达式方面面临挑战，这是由于注释的数据和语言可变性有限。该评论评估了LORA在RAG框架内的集成，基准测试PEFT方法用于检索和发电精确度，在有限的数据下确定域的适应策略，并比较旨在改善数据 - 分散条件下语义忠诚度的微调技术。对最近的研究进行了系统分析，该研究采用了不同的洛拉变体，合成数据生成，用户反馈集成和自适应参数分配，以评估其对计算效率，检索精度，语言真实性和可伸缩性的影响。调查结果表明，动态和整体洛拉适应大大降低了可训练的参数，而无需牺牲辩证背景下的检索准确性和发电质量。但是，局限性仍然存在于充分保留细粒度的语言细微差别，尤其是对于像广东话这样的低资源环境。实时用户反馈和特定领域数据的集成仍然不发达，限制了模型的适应性和个性化。尽管选择性参数冻结和非线性适应方法在效率和准确性之间提供了更好的权衡，但它们的稳健性仍然是一个开放的挑战。这篇评论重点介绍了针对特定于领域的语言任务的PEFT增强抹布系统的承诺，并呼吁未来的工作针对方言真实性，动态适应和可扩展的微调管道。

Title: InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

Authors: Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08636
Pdf URL: https://arxiv.org/pdf/2508.08636
Copy Paste: [[2508.08636]] InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling(https://arxiv.org/abs/2508.08636)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.
摘要：大型语言模型（LLM）通过启用复杂的推理能力彻底改变了人工智能。尽管强化学习的最新进步（RL）主要集中在特定领域的推理任务（例如数学或代码生成）上，但实际的推理场景通常需要模型来处理狭窄的基准测试的多样化和复杂的环境，这些环境无法完全捕获。为了解决这一差距，我们提出了InternBootCamp，这是一个开源框架，其中包括专门针对LLM推理研究设计的1000多个域名任务环境。我们的代码库提供了两个关键功能：（1）具有可配置难度级别的无限培训/测试案例的自动生成，以及（2）用于客观响应评估的集成验证模块。这些功能使InterbootCamp基础基础架构用于基于RL的模型优化，合成数据生成和模型评估。尽管手动开发具有巨大任务范围的框架非常笨拙，但我们通过通过手动验证协议补充的自动化代理工作流来加速开发程序，这使任务范围能够快速扩展。％使用这些训练营，我们进一步建立了Bootcamp-eval，这是一种自动生成的基准，用于全面的绩效评估。评估表明，在许多推理任务中，Frontier模型在许多推理任务中仍然表现不佳，而通过InterbootCamp进行培训提供了一种有效的方法来显着提高性能，从而导致了我们的32B模型，从而在Bootcamp-eval上实现了最先进的结果，并在其他已建立的基准方面表现出色。特别是，我们验证了一致的性能增长来自包括更多的培训任务，即\ textbf {task缩放}，在两个数量级上，提供了有希望的推理通才的有前途的途径。

Title: Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents

Authors: Zheng Wu, Heyuan Huang, Yanjia Yang, Yuanyi Song, Xingyu Lou, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08645
Pdf URL: https://arxiv.org/pdf/2508.08645
Copy Paste: [[2508.08645]] Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents(https://arxiv.org/abs/2508.08645)
Keywords: language model, retrieval-augmented generation, agent
Abstract: As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the \textbf{I}ntention \textbf{A}lignment \textbf{R}ate between mobile-use agents and humans, we first collect \textbf{MobileIAR}, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents' understanding of human intent. Then we propose \textbf{IFRAgent}, a framework built upon \textbf{I}ntention \textbf{F}low \textbf{R}ecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79\% (32.06\% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30\% (26.34\% relative improvement). The codes are available at this https URL.
摘要：随着多模式大型语言模型迅速发展，通过使用模仿图形用户界面的人类相互作用的移动使用代理，移动任务的自动化变得越来越可行。为了进一步增强移动用途的代理人，先前的研究采用了示范学习来改善人类示范中的移动使用代理。但是，这些方法仅着眼于人类的明确意图流（例如，步骤序列），同时忽略了隐式意图流（例如个人偏好），这使得很难构建个性化的移动使用代理。在这项工作中，为了评估\ textbf {i} ntention \ textbf {a} strignment \ textbf {r}在移动使用代理和人类之间进行培养，我们首先收集\ textbf {Mobileiar}，一个包含人与人类立场的动作和地面操作的数据集。这可以对代理人对人类意图的理解进行全面评估。然后，我们提出\ textbf {ifragent}，这是一个基于\ textbf {i} ntention \ textbf {f}低\ textbf {r textbf {r}的框架。 ifragent分析明确的意图从人类的演示中流动到构建标准操作程序（SOP）的查询级别矢量库（SOP），并分析隐式意图流以建立用户级别的习惯存储库。 ifragent然后利用SOP提取器，结合检索功能增强的一代和查询重写器来生成个性化的查询和SOP，从原始模棱两可的查询中生成个性化的查询，从而增强了移动使用代理和人类意图之间的对齐。实验结果表明，ifragent在人类意图一致性率的平均表现平均高6.79 \％（32.06 \％相对改善），并将步骤完成率提高了5.30 \％（26.34 \％\％相对改善）。这些代码可在此HTTPS URL上找到。

Title: LLaMA-Based Models for Aspect-Based Sentiment Analysis

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08649
Pdf URL: https://arxiv.org/pdf/2508.08649
Copy Paste: [[2508.08649]] LLaMA-Based Models for Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2508.08649)
Keywords: language model, llm
Abstract: While large language models (LLMs) show promise for various tasks, their performance in compound aspect-based sentiment analysis (ABSA) tasks lags behind fine-tuned models. However, the potential of LLMs fine-tuned for ABSA remains unexplored. This paper examines the capabilities of open-source LLMs fine-tuned for ABSA, focusing on LLaMA-based models. We evaluate the performance across four tasks and eight English datasets, finding that the fine-tuned Orca~2 model surpasses state-of-the-art results in all tasks. However, all models struggle in zero-shot and few-shot scenarios compared to fully fine-tuned ones. Additionally, we conduct error analysis to identify challenges faced by fine-tuned models.
摘要：虽然大语言模型（LLMS）对各种任务都表现出了希望，但它们在基于复合方面的情感分析（ABSA）任务中的表现落后于微调模型。但是，对ABSA的LLMS的潜力仍未开发。本文研究了ABSA的开源LLMS的功能，重点介绍了基于Llama的模型。我们评估了四个任务和八个英语数据集的性能，发现微调的Orca〜2型号超过了最先进的型号。但是，与完全微调的情况相比，所有模型都在零射击和少数场景中挣扎。此外，我们进行错误分析以确定微调模型面临的挑战。

Title: UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08650
Pdf URL: https://arxiv.org/pdf/2508.08650
Copy Paste: [[2508.08650]] UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection(https://arxiv.org/abs/2508.08650)
Keywords: language model
Abstract: This paper presents our system built for the WASSA-2024 Cross-lingual Emotion Detection Shared Task. The task consists of two subtasks: first, to assess an emotion label from six possible classes for a given tweet in one of five languages, and second, to predict words triggering the detected emotions in binary and numerical formats. Our proposed approach revolves around fine-tuning quantized large language models, specifically Orca~2, with low-rank adapters (LoRA) and multilingual Transformer-based models, such as XLM-R and mT5. We enhance performance through machine translation for both subtasks and trigger word switching for the second subtask. The system achieves excellent performance, ranking 1st in numerical trigger words detection, 3rd in binary trigger words detection, and 7th in emotion detection.
摘要：本文介绍了我们为WASSA-2024跨语性情感检测共享任务构建的系统。该任务由两个子任务组成：首先，以五种语言之一的给定推文评估六个可能类的情感标签，其次，以预测触发二进制和数值格式的被检测到的情绪的单词。我们提出的方法围绕微调量化的大型语言模型，尤其是Orca〜2，具有低级适配器（LORA）和基于多语言变压器的模型，例如XLM-R和MT5。我们通过用于子任务的机器翻译和第二个子任务的触发单词切换来增强性能。该系统在数值触发单词检测中排名出色，在数值触发单词检测中排名第一，二进制触发单词检测中的第三名和情感检测中的第七名。

Title: Prompt-Based Approach for Czech Sentiment Analysis

Authors: Jakub Šmíd, Pavel Přibáň
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08651
Pdf URL: https://arxiv.org/pdf/2508.08651
Copy Paste: [[2508.08651]] Prompt-Based Approach for Czech Sentiment Analysis(https://arxiv.org/abs/2508.08651)
Keywords: prompt
Abstract: This paper introduces the first prompt-based methods for aspect-based sentiment analysis and sentiment classification in Czech. We employ the sequence-to-sequence models to solve the aspect-based tasks simultaneously and demonstrate the superiority of our prompt-based approach over traditional fine-tuning. In addition, we conduct zero-shot and few-shot learning experiments for sentiment classification and show that prompting yields significantly better results with limited training examples compared to traditional fine-tuning. We also demonstrate that pre-training on data from the target domain can lead to significant improvements in a zero-shot scenario.
摘要：本文介绍了第一个基于基于方面的情感分析和情感分类的基于迅速的方法。我们采用序列到序列模型同时解决基于方面的任务，并证明我们基于及时的方法比传统微调的优越性。此外，我们对情感分类进行了零射击和少量学习实验，并表明与传统的微调相比，培训示例有限的结果明显更好。我们还证明，对来自目标域的数据进行预训练可以导致零照片方案的显着改善。

Title: LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement

Authors: Rajmohan C, Sarthak Harne, Arvind Agarwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08653
Pdf URL: https://arxiv.org/pdf/2508.08653
Copy Paste: [[2508.08653]] LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement(https://arxiv.org/abs/2508.08653)
Keywords: language model, llm, prompt
Abstract: Transforming unstructured text into structured data is a complex task, requiring semantic understanding, reasoning, and structural comprehension. While Large Language Models (LLMs) offer potential, they often struggle with handling ambiguous or domain-specific data, maintaining table structure, managing long inputs, and addressing numerical reasoning. This paper proposes an efficient system for LLM-driven text-to-table generation that leverages novel prompting techniques. Specifically, the system incorporates two key strategies: breaking down the text-to-table task into manageable, guided sub-tasks and refining the generated tables through iterative self-feedback. We show that this custom task decomposition allows the model to address the problem in a stepwise manner and improves the quality of the generated table. Furthermore, we discuss the benefits and potential risks associated with iterative self-feedback on the generated tables while highlighting the trade-offs between enhanced performance and computational cost. Our methods achieve strong results compared to baselines on two complex text-to-table generation datasets available in the public domain.
摘要：将非结构化文本转换为结构化数据是一项复杂的任务，需要语义理解，推理和结构理解。尽管大型语言模型（LLMS）具有潜力，但他们经常在处理模棱两可或特定领域的数据，维护表结构，管理长输入和解决数值推理方面挣扎。本文提出了一个有效的LLM驱动文本到桌子生成的系统，该系统利用了新颖的提示技术。具体而言，该系统结合了两个关键策略：将文本到餐桌的任务分解为可管理的指导子任务，并通过迭代的自我反馈来完善生成的表。我们表明，此自定义任务分解使该模型可以逐步解决问题并提高生成表的质量。此外，我们讨论了与生成表的迭代自我反馈相关的好处和潜在风险，同时突出了增强性能和计算成本之间的权衡。与在公共领域可用的两个复杂的文本到餐桌的生成数据集上，我们的方法获得了强劲的结果。

Title: TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

Authors: Armel Zebaze, Benoît Sagot, Rachel Bawden
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08680
Pdf URL: https://arxiv.org/pdf/2508.08680
Copy Paste: [[2508.08680]] TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation(https://arxiv.org/abs/2508.08680)
Keywords: llm
Abstract: LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \textsc{TopXGen}, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that \textsc{TopXGen} boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at this https URL.
摘要：LLM已被证明在机器翻译（MT）方面的表现很好，使用内部文化学习（ICL），在翻译成高资源语言（HRLS）时，可以与受监督的模型媲美。但是，当转化为低资源语言（LRLS）时，它们会落后。通过相似性搜索和监督微调帮助进行示例选择。但是，他们给出的改进受到现有并行数据集的大小，质量和多样性的限制。低资源MT中的一种常见技术是合成的并行数据创建，其中最常见的是反向翻译，从而将现有的目标端文本自动转化为源语言。但是，这假定存在高质量和相关的目标端文本，这些文本对于许多LRL不容易获得。在本文中，我们提出了\ textsc {topxgen}，这是一种基于LLM的方法，用于生成多个LRL中的高质量和主题多样性数据，然后可以倒换以产生用于ICL和微调的有用和多样的平行文本。我们的直觉是，尽管LLM努力转化为LRL，但它们可以很好地转化为HRL和它们的多语言能力使它们能够产生高质量的，自然的目标侧文本，这可以很好地转化为高资源的源语言。我们表明\ textsc {topxgen}在微调和内部学习过程中提高了LLM翻译性能。代码和输出可在此HTTPS URL上找到。

Title: Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

Authors: Bram van Dijk, Tiberon Kuiper, Sirin Aoulad si Ahmed, Armel Levebvre, Jake Johnson, Jan Duin, Simon Mooijaart, Marco Spruit
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.08684
Pdf URL: https://arxiv.org/pdf/2508.08684
Copy Paste: [[2508.08684]] Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults(https://arxiv.org/abs/2508.08684)
Keywords: hallucination, chat
Abstract: Voice-controlled interfaces can support older adults in clinical contexts, with chatbots being a prime example, but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the this http URL chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to realistic datasets. Furthermore, our results suggest that truncating existing architectures is helpful in balancing the accuracy-speed trade-off, though we also identify some cases with high WER due to hallucinations.
摘要：语音控制的界面可以在临床环境下支持老年人，聊天机器人是一个很好的例子，但是对于代表性不足的群体来说，可靠的自动语音识别（ASR）仍然是瓶颈。这项研究评估了有关年长的荷兰成年人语言使用的最新ASR模型，他们与专为老年环境设计的HTTP URL聊天机器人进行了互动。我们基于通用的多语言ASR模型，并为老年人说的荷兰语进行了微调，同时也考虑处理速度。我们的结果表明，通用多语言模型的表现优于微型模型，这表明最近的ASR模型可以将其推广到现实数据集。此外，我们的结果表明，截断现有的架构有助于平衡准确的速度权衡，尽管由于幻觉，我们还确定了某些较高的情况。

Title: A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

Authors: Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu
Subjects: cs.CL, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2508.08712
Pdf URL: https://arxiv.org/pdf/2508.08712
Copy Paste: [[2508.08712]] A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models(https://arxiv.org/abs/2508.08712)
Keywords: language model, llm
Abstract: As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation.
摘要：随着文本生成已成为现代大型语言模型（LLM）的核心能力，它为广泛的下游应用程序提供了支持。但是，大多数现有的LLM都依赖自回归（AR）的生成，由于该过程的固有顺序性质，基于以前生成的有限生成速度的上下文归因，一次产生一个令牌。为了应对这一挑战，越来越多的研究人员已经开始探索平行文本生成一代，旨在破坏令牌的一代瓶颈和提高推理效率。尽管兴趣越来越大，但仍缺乏有关哪些特定技术构成平行文本生成以及它们如何改善推理性能的全面分析。为了弥合这一差距，我们提出了对平行文本生成方法的系统调查。我们将现有方法分类为基于AR和非AR的范例，并对每个类别中的核心技术进行详细检查。在这种分类法之后，我们在速度，质量和效率方面评估了他们的理论权衡，并研究了它们与替代加速策略的结合和比较的潜力。最后，根据我们的发现，我们重点介绍了最新进步，确定开放的挑战，并概述了并行文本生成的未来研究的有希望的方向。

Title: IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization

Authors: Yuzhuo Bai, Shitong Duan, Muhua Huang, Jing Yao, Zhenghao Liu, Peng Zhang, Tun Lu, Xiaoyuan Yi, Maosong Sun, Xing Xie
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.08719
Pdf URL: https://arxiv.org/pdf/2508.08719
Copy Paste: [[2508.08719]] IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization(https://arxiv.org/abs/2508.08719)
Keywords: language model, llm, prompt
Abstract: Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs' trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs' behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs' stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines.
摘要：经过大型语言模型（LLMS）接受了各种人类作者的培训，表现出了一定的能力，可以通过提示，从事个性化LLM和社交模拟等应用，反映特定的类似人类的特征（例如，个性或价值观）。但是，现有的方法遇到了肤浅的启发问题：LLM只能转向模仿浅层和不稳定的风格模式，未能精确，始终如一地在人类等各种任务上体现所需的特征。为了应对这一挑战，我们提出了irote，这是一种新型的在稳定且可转移特征启发的方法中。借助心理理论，表明特征是通过与身份相关的反射形成的，我们的方法会自动生成并优化提示中的文本自我反省，该提示包括自我感知的经验，以刺激LLMS的特质驱动的行为。通过迭代化最大化信息理论目标来进行优化，从而增强了LLMS行为与目标性状之间的联系，同时在不进行任何微调的情况下降低了反射中的嘈杂冗余，导致了令人回味和紧凑的性状反思。在三个人类特征系统上进行的广泛实验表明，一个单一的IROTE生成的自我反射可以引起LLMS稳定地对目标性状的稳定模仿，而除了简单的问卷回答以外，在各种下游任务中，始终超越了现有的强大底线。

Title: Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation

Authors: Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, Liantao Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08730
Pdf URL: https://arxiv.org/pdf/2508.08730
Copy Paste: [[2508.08730]] Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation(https://arxiv.org/abs/2508.08730)
Keywords: language model, llm, prompt
Abstract: Medical Lay Language Generation (MLLG) plays a vital role in improving the accessibility of complex scientific content for broader audiences. Recent literature to MLLG commonly employ parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using paired expert-lay language datasets. However, LoRA struggles with the challenges posed by multi-source heterogeneous MLLG datasets. Specifically, through a series of exploratory experiments, we reveal that standard LoRA fail to meet the requirement for semantic fidelity and diverse lay-style generation in MLLG task. To address these limitations, we propose Magical, an asymmetric LoRA architecture tailored for MLLG under heterogeneous data scenarios. Magical employs a shared matrix $A$ for abstractive summarization, along with multiple isolated matrices $B$ for diverse lay-style generation. To preserve semantic fidelity during the lay language generation process, Magical introduces a Semantic Invariance Constraint to mitigate semantic subspace shifts on matrix $A$. Furthermore, to better adapt to diverse lay-style generation, Magical incorporates the Recommendation-guided Switch, an externally interface to prompt the LLM to switch between different matrices $B$. Experimental results on three real-world lay language generation datasets demonstrate that Magical consistently outperforms prompt-based methods, vanilla LoRA, and its recent variants, while also reducing trainable parameters by 31.66%.
摘要：医学外行语言产生（MLLG）在改善更广泛受众的复杂科学内容的可及性方面起着至关重要的作用。 MLLG的最新文献通常采用参数有效的微调方法，例如低级适应（LORA），使用配对的专家lay语言数据集来微调大语言模型（LLMS）。但是，洛拉（Lora）在多源异质MLLG数据集面临的挑战上挣扎。具体而言，通过一系列探索性实验，我们揭示了标准洛拉无法满足MLLG任务中语义忠诚度和多样化的外行形式产生的要求。为了解决这些局限性，我们提出了魔术，这是一种在异质数据情景下针对MLLG量身定制的不对称LORA架构。 Magical使用共享的矩阵$ a $用于抽象摘要，以及多个隔离矩阵$ b $用于多样化的外行风格。为了在外行语言生成过程中保留语义保真度，Magical引入了语义不变性约束，以减轻矩阵$ a $的语义子空间变化。此外，为了更好地适应各种外行风格的一代，Magical结合了推荐引导开关，这是一种外部接口，以提示LLM在不同矩阵之间切换$ B $。三个现实世界语言生成数据集的实验结果表明，神奇的始终优于基于及时的方法，即Vanilla Lora及其最近的变体，同时还将可训练的参数降低了31.66％。

Title: SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs

Authors: Haotian Chen, Qingqing Long, Meng Xiao, Xiao Luo, Wei Ju, Chengrui Wang, Xuezhi Wang, Yuanchun Zhou, Hengshu Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08742
Pdf URL: https://arxiv.org/pdf/2508.08742
Copy Paste: [[2508.08742]] SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs(https://arxiv.org/abs/2508.08742)
Keywords: language model, llm
Abstract: Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, \textit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.
摘要：科学文献问题回答是朝着新的科学发现迈出的关键一步。最近，\ textIt {两阶段}检索生成的大语言模型（rag-llms）在该域中显示出令人印象深刻的进步。这样的两阶段框架，尤其是第二阶段（Reranker），在科学领域尤其重要，在科学领域中，术语的细微差异可能会对最终的面向事实的或知识密集的答案产生很大的负面影响。尽管取得了重大进展，但这些作品的潜力和局限性仍未开发。在这项工作中，我们提出了一个面向科学的rerank的抹布基准（Scirerankbench），用于评估rag-llms系统中的rerankers，涵盖五个科学主题。为了严格评估噪声弹性，相关性歧义和事实一致性的重读性能，我们开发了三种类型的问题，即文本 - 杂感（q-c-a）对（即嘈杂的上下文（NC）（NC），语义上相似，但在逻辑上相似但在逻辑上相似，但在逻辑上相似，但在逻辑上无关紧要的环境（SSLI）和反复感（CC）（CC）（CC）。通过对五个LLM家族的13个广泛使用的读者的系统评估，我们对它们的相对优势和局限性提供了详细的见解。据我们所知，Scirerankbench是第一个专门为评估Rag-llms中的Rerankers开发的基准，该基准为其未来发展提供了宝贵的观察和指导。

Title: DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation

Authors: Stavros Doropoulos (1), Stavros Vologiannidis (1), Ioannis Magnisalis (2) ((1) Department of Computer, Informatics and Telecommunications Engineering, International Hellenic University, (2) DG Informatics, European Commission, Brussels, Belgium)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08761
Pdf URL: https://arxiv.org/pdf/2508.08761
Copy Paste: [[2508.08761]] DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation(https://arxiv.org/abs/2508.08761)
Keywords: language model, llm, chat, agent
Abstract: The manual translation of unstructured team dialogue into the structured artifacts required for Information Technology (IT) project governance is a critical bottleneck in modern information systems management. We introduce DevNous, a Large Language Model-based (LLM) multi-agent expert system, to automate this unstructured-to-structured translation process. DevNous integrates directly into team chat environments, identifying actionable intents from informal dialogue and managing stateful, multi-turn workflows for core administrative tasks like automated task formalization and progress summary synthesis. To quantitatively evaluate the system, we introduce a new benchmark of 160 realistic, interactive conversational turns. The dataset was manually annotated with a multi-label ground truth and is publicly available. On this benchmark, DevNous achieves an exact match turn accuracy of 81.3\% and a multiset F1-Score of 0.845, providing strong evidence for its viability. The primary contributions of this work are twofold: (1) a validated architectural pattern for developing ambient administrative agents, and (2) the introduction of the first robust empirical baseline and public benchmark dataset for this challenging problem domain.
摘要：非结构化团队对话到信息技术（IT）项目治理所需的结构化工件的手动翻译是现代信息系统管理中的关键瓶颈。我们介绍了Devnous，这是一种基于语言的大型模型（LLM）多代理专家系统，以自动化此非结构化的翻译过程。 Devnous直接集成到团队聊天环境中，从非正式的对话中确定可行的意图，并为核心管理任务（例如自动任务形式化和进度摘要综合）管理状态的，多转的工作流程。为了定量评估该系统，我们引入了160个现实，交互式对话转弯的新基准。该数据集用多标签地面真相手动注释，并可以公开使用。在此基准上，Devnous达到了81.3 \％的确切匹配转弯精度，多组分F1得分为0.845，为其可行性提供了有力的证据。这项工作的主要贡献是双重的：（1）用于开发环境行政代理的经过验证的架构模式，以及（2）引入第一个强大的经验基线和公共基准数据集，以实现这个具有挑战性的问题域。

Title: Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering

Authors: Yunfeng Ning, Mayi Xu, Jintao Wen, Qiankun Pi, Yuanyuan Zhu, Ming Zhong, Jiawei Jiang, Tieyun Qian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08785
Pdf URL: https://arxiv.org/pdf/2508.08785
Copy Paste: [[2508.08785]] Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering(https://arxiv.org/abs/2508.08785)
Keywords: llm, hallucination, retrieval-augmented generation
Abstract: LLMs often suffer from hallucinations and outdated or incomplete knowledge. RAG is proposed to address these issues by integrating external knowledge like that in KGs into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission, especially when using third-party LLM APIs lacking transparency and control. In this paper, we investigate the privacy-protected RAG scenario for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) How can anonymous entities be converted into retrievable information. (2) How to retrieve question-relevant anonymous entities. Hence, we propose a novel ARoG framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. It supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. To guide LLMs to effectively retrieve knowledge from KGs, the two strategies strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness.
摘要：LLMS经常患有幻觉，过时或不完整的知识。提议通过将类似于公斤的外部知识整合到LLM中，以解决这些问题。但是，由于LLM的黑箱性质和潜在的不安全数据传输，尤其是在使用缺乏透明度和控制的第三方LLM API时，利用抹布系统中的私人KG会带来很大的隐私风险。在本文中，我们首次研究了受隐私保护的抹布场景，其中kgs中的实体是LLMS匿名的，从而阻止了它们访问实体语义。由于实体语义的丧失，以前的抹布系统无法通过将问题与匿名实体的毫无意义的标识符匹配来从KGS中获取与问题相关的知识。为了在这种情况下实现有效的抹布系统，必须解决两个关键挑战：（1）如何将匿名实体转换为可检索的信息。（2）如何检索与问题相关的匿名实体。因此，我们提出了一个新颖的AROG框架，包括以关系为中心的抽象和面向结构的抽象策略。对于挑战（1），第一个策略通过动态捕获其相邻关系的语义来将实体摘要到高级概念中。它补充有意义的语义，可以进一步支持检索过程。对于挑战（2），第二种策略将非结构化的自然语言问题转化为结构化的抽象概念路径。这些路径可以更有效地与KG中的抽象概念保持一致，从而改善了检索性能。为了指导LLM有效地从KGS中检索知识，这两种策略严格保护隐私免于暴露于LLMS。三个数据集的实验表明，AROG实现了强大的性能和隐私性。

Title: Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Authors: Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Xuanjing Huang, Jiecao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08791
Pdf URL: https://arxiv.org/pdf/2508.08791
Copy Paste: [[2508.08791]] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments(https://arxiv.org/abs/2508.08791)
Keywords: language model, llm
Abstract: Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models' tool-use performance without degrading their general capabilities, regardless of inference modes or training algorithms. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models.
摘要：有效的工具使用对于大型语言模型（LLM）至关重要，以与环境有意义地互动。但是，由于缺乏专门为工具使用设计的有效加固学习（RL）框架，进展受到限制，这是因为在构建稳定的训练环境和设计可验证的奖励机制方面面临着挑战。为了解决这个问题，我们提出了一条自动化的环境施工管道，结合了方案分解，文档生成，功能集成，复杂性扩展和局部部署。这使创建高质量的培训环境可以提供详细且可衡量的反馈，而无需依靠外部工具。此外，我们引入了一种可验证的奖励机制，该机制既评估工具使用的精度和任务执行的完整性。当与从构造环境收集的轨迹数据结合使用时，该机制将与标准RL算法无缝集成，以促进反馈驱动的模型训练。在不同量表的LLM上进行的实验表明，无论推论模式或训练算法如何，我们的方法都可以显着增强模型的工具使用性能而不会降低其一般能力。我们的分析表明，这些收益是由改进的上下文理解和推理所带来的，这是由模型中低层MLP参数的更新驱动的。

Title: TiMoE: Time-Aware Mixture of Language Experts

Authors: Robin Faro, Dongyang Fan, Tamar Alphaidze, Martin Jaggi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08827
Pdf URL: https://arxiv.org/pdf/2508.08827
Copy Paste: [[2508.08827]] TiMoE: Time-Aware Mixture of Language Experts(https://arxiv.org/abs/2508.08827)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) are typically trained on fixed snapshots of the web, which means that their knowledge becomes stale and their predictions risk temporal leakage: relying on information that lies in the future relative to a query. We tackle this problem by pre-training from scratch a set of GPT-style experts on disjoint two-year slices of a 2013-2024 corpus and combining them through TiMoE, a Time-aware Mixture of Language Experts. At inference time, TiMoE masks all experts whose training window ends after the query timestamp and merges the remaining log-probabilities in a shared space, guaranteeing strict causal validity while retaining the breadth of multi-period knowledge. We also release TSQA, a 10k-question benchmark whose alternatives are explicitly labelled as past, future or irrelevant, allowing fine-grained measurement of temporal hallucinations. Experiments on eight standard NLP tasks plus TSQA show that a co-adapted TiMoE variant matches or exceeds the best single-period expert and cuts future-knowledge errors by up to 15%. Our results demonstrate that modular, time-segmented pre-training paired with causal routing is a simple yet effective path toward LLMs that stay chronologically grounded without sacrificing general performance much. We open source our code at TiMoE (Github): this https URL
摘要：大型语言模型（LLM）通常在网络的固定快照上进行培训，这意味着他们的知识变为陈旧，预测有风险的时间泄漏：依靠将来相对于查询的信息。我们通过从头开始预训练的一组GPT风格的专家在2013-2024语料库的两年切片中进行预训练，并通过Timoe结合它们，这是语言专家的时间感知混合物。在推理时，Timoe掩盖了所有训练窗口在查询时间戳之后结束的专家，并在共享空间中融合了剩余的对数概率，从而确保了严格的因果有效性，同时保留了多周期知识的广度。我们还释放了TSQA，这是一种10k问题的基准，其替代方案被明确标记为过去，未来或无关紧要，从而可以对时间幻觉进行细粒度的测量。对八个标准NLP任务以及TSQA进行的实验表明，共同适应的Timoe变体匹配或超过了最佳的单周期专家，并将未来的知识错误降低了15％。我们的结果表明，模块化的，时间分段的预训练与因果路由是通往LLM的简单而有效的路径，该路径是按时间顺序扎根的，而无需牺牲一般绩效。我们在Timoe（github）开源代码：此HTTPS URL

Title: An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Authors: Yuren Hao, Xiang Wan, Chengxiang Zhai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08833
Pdf URL: https://arxiv.org/pdf/2508.08833
Copy Paste: [[2508.08833]] An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems(https://arxiv.org/abs/2508.08833)
Keywords: llm
Abstract: In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
摘要：在本文中，我们引入了一个超出常规方法的系统框架，以通过对高级数学问题进行压力测试，以评估LLM的数学性鲁棒性，这些问题在数学上是相等的，但具有语言和参数变化。这些转换使我们能够测量LLM对非数学扰动的敏感性，从而更准确地评估其数学推理能力。使用这种新的评估方法，我们创建了PutNamGap，这是一个新的基准数据集，具有多个数学等效竞争级数学问题的变化。借助新数据集，我们评估了代表性LLM的多个家族并检查其稳健性。在18种商业和开源模型中，我们观察到变体的急剧性能降解。 Openai的旗舰推理模型O3在原始产品上得分49％，但在表面变体上下降了4个百分点，基于核心步骤的变体上的分数下降了10.5个百分点，而较小的型号的效果要差得多。总体而言，结果表明，提出的新评估方法可有效加深我们对LLM的鲁棒性的理解，并为进一步提高其数学推理能力而产生新的见解。

Title: Steering Towards Fairness: Mitigating Political Bias in LLMs

Authors: Afrozah Nadeem, Mark Dras, Usman Naseem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08846
Pdf URL: https://arxiv.org/pdf/2508.08846
Copy Paste: [[2508.08846]] Steering Towards Fairness: Mitigating Political Bias in LLMs(https://arxiv.org/abs/2508.08846)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases, particularly along political and economic dimensions. In this paper, we propose a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), our method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.
摘要：大型语言模型（LLM）的最新进展使它们能够在不同的现实应用程序中广泛使用。但是，仍然担心他们倾向于编码和繁殖意识形态偏见，尤其是在政治和经济方面。在本文中，我们提出了一个框架，用于通过分析内部模型表示形式来探测和减轻基于解码器的LLM的偏见。基于政治指南测试（PCT），我们的方法使用对比对提取和比较诸如Mistral和DeepSeek之类的模型中的隐藏层激活。我们介绍了一条全面的激活提取管道，能够跨多个意识形态轴进行层次分析，从而揭示了与政治框架相关的有意义的差异。我们的结果表明，解码器llms系统地编码跨层的代表性偏差，可以利用这些偏差以进行有效的基于转向向量的缓解措施。这项工作提供了有关LLM中政治偏见如何编码的新见解，并为超出表面水平的产出干预提供了一种原则性的方法。

Title: BiasGym: Fantastic Biases and How to Find (and Remove) Them

Authors: Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08855
Pdf URL: https://arxiv.org/pdf/2508.08855
Copy Paste: [[2508.08855]] BiasGym: Fantastic Biases and How to Find (and Remove) Them(https://arxiv.org/abs/2508.08855)
Keywords: language model, llm
Abstract: Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being `reckless drivers') and in probing fictional associations (e.g., people from a country having `blue skin'), showing its utility for both safety interventions and interpretability research.
摘要：理解大语模型（LLM）重量中编码的偏见和刻板印象对于制定有效的缓解策略至关重要。有偏见的行为通常是微妙的，也不是孤立的，即使是故意引起的，使系统的分析和偏见特别具有挑战性。为了解决这个问题，我们介绍了Biasgym，这是一个简单，具有成本效益且可广泛的框架，用于可靠地注入，分析和缓解LLMS内的概念关联。 Biasgym由两个组成部分：Biasinject组成，它们通过基于令牌的微调将特定的偏见注入模型，同时保持模型冻结，而BiasScope则利用这些注入的信号来识别和引导组件负责偏置行为。我们的方法可以使机械分析的偏差启发，支持有针对性的依据，而不会在下游任务上降低绩效，并在训练过程中概括偏见。我们证明了偏见在减少现实世界刻板印象（例如，来自一个国家的人们是“鲁ck驱动器”的人们）以及探测虚构的关联（例如，来自一个具有“蓝色皮肤”的国家）的有效性。

Title: Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models

Authors: Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, Isabelle Augenstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08879
Pdf URL: https://arxiv.org/pdf/2508.08879
Copy Paste: [[2508.08879]] Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models(https://arxiv.org/abs/2508.08879)
Keywords: language model, llm
Abstract: The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs' representations impacts their cultural understanding. Prior work only performs extrinsic evaluation of LLMs' cultural competence, without accounting for how LLMs' internal mechanisms lead to cultural (mis)representation. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of LLMs to elicit the underlying cultural knowledge space. CultureScope utilizes a patching method to extract the cultural knowledge. We introduce a cultural flattening score as a measure of the intrinsic cultural biases. Additionally, we study how LLMs internalize Western-dominance bias and cultural flattening, which allows us to trace how cultural biases emerge within LLMs. Our experimental results reveal that LLMs encode Western-dominance bias and cultural flattening in their cultural knowledge space. We find that low-resource cultures are less susceptible to cultural biases, likely due to their limited training resources. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs' cultural understanding. Our codes and data used for experiments are publicly available.
摘要：大型语言模型（LLM）跨不同文化背景的日益增长的部署需要更好地理解LLMS代表中未经文化的文化过度概括的文化如何影响他们的文化理解。先前的工作仅对LLMS的文化能力进行外部评估，而不会考虑LLMS的内部机制如何导致文化（MIS）代表。为了弥合这一差距，我们提出了CulturesCope，这是第一个基于机械的解释性方法，该方法探究了LLM的内部表示以引起基本的文化知识空间。 CulturesCope利用修补方法来提取文化知识。我们引入了文化扁平分数，以衡量内在的文化偏见。此外，我们研究了LLM如何将西方占主导地位的偏见和文化平坦化，这使我们能够追踪文化偏见在LLM中的出现。我们的实验结果表明，LLM在其文化知识空间中编码西方占主导地位的偏见和文化扁平化。我们发现，低资源文化不太容易受到文化偏见的影响，这可能是由于其培训资源有限。我们的工作为减轻文化偏见和增强LLMS文化理解的未来研究奠定了基础。我们用于实验的代码和数据已公开可用。

Title: ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

Authors: Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, Xing Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08895
Pdf URL: https://arxiv.org/pdf/2508.08895
Copy Paste: [[2508.08895]] ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs(https://arxiv.org/abs/2508.08895)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
摘要：大语言模型（LLM）的规模和复杂性的提高构成了重大的推理潜伏期挑战，这主要是由于其自回旋解码范式的特征是下一步预测的顺序性质。通过重新检查自回归模型的输出，我们观察到某些段表现出可行的结构，我们将其称为内在的并行性。同时解码每个可行的分支（即平行解码）可以显着提高LLM的总体推理速度。在本文中，我们提出了一种自适应串行 - 平行解码（ASPD），该解码解决了两个核心挑战：可行数据的自动构造和有效的并行解码机制。更具体地说，我们引入了一种非侵入性管道，该管道自动从自回旋模型的响应中自动提取和验证可行的结构。为了赋予有效的自适应串行平行解码，我们实现了混合解码引擎，该引擎可以在串行和并行解码模式之间进行无缝过渡，同时保持可重复使用的KV缓存，从而最大程度地提高计算效率。跨通用任务，检索发明的一代，数学推理的广泛评估表明，ASPD在有效性和效率上都实现了前所未有的绩效。值得注意的是，在Vicuna板凳上，我们的方法达到了高达3.19倍的速度（平均1.85倍），同时与自回归模型相比，将响应质量保持在1％的差异之内，从而实现了显着的加速度，而不会损害发电质量。我们的框架为有效的LLM并行推理设定了一个开创性的基准，为其在延迟敏感的应用程序（例如AI驱动的客户服务机器人）中的部署铺平了道路并回答检索引擎。

Title: Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

Authors: Khondoker Ittehadul Islam, Gabriele Sarti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08933
Pdf URL: https://arxiv.org/pdf/2508.08933
Copy Paste: [[2508.08933]] Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation(https://arxiv.org/abs/2508.08933)
Keywords: language model
Abstract: Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.
摘要：语言模型在复杂的多步推理任务上表现出了出色的性能。但是，他们的评估主要仅限于高源语言，例如英语。在本文中，我们介绍了一种手动翻译的Bangla多步性推理数据集，该数据集从英语显示的数据集中衍生而来，其中均具有二进制和非二进制问题类型。我们对原始数据集中的以英语为中心和以英语为中心的多语言小语言模型进行了对照评估，并进行了翻译版本，以比较他们利用相关推理步骤以产生正确答案的能力。我们的结果表明，在可比的环境中，推理环境对更具挑战性的非二进制问题是有益的，但是模型努力有效地采用相关的孟加拉推理步骤。我们通过探索推理步骤如何促进模型的预测，突出模型和语言之间的不同趋势来得出结论。

Title: Train Long, Think Short: Curriculum Learning for Efficient Reasoning

Authors: Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, Bernard Ghanem
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08940
Pdf URL: https://arxiv.org/pdf/2508.08940
Copy Paste: [[2508.08940]] Train Long, Think Short: Curriculum Learning for Efficient Reasoning(https://arxiv.org/abs/2508.08940)
Keywords: language model, llm
Abstract: Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: this https URL.
摘要：最新的关于增强大语模型（LLM）推理能力的工作已引入了明确的长度控制，作为限制计算成本的一种手段，同时保持准确性。但是，现有方法依赖于固定长度的培训预算，这些预算并不能利用从探索到压缩过程中自然发展的优势。在这项工作中，我们提出了一种使用小组相对策略优化（GRPO）的课程学习策略，用于长度控制推理。我们的方法从慷慨的代币预算开始，并逐渐将它们逐渐收紧培训，鼓励模型首先发现有效的解决方案策略，然后将其提炼成更简洁的推理痕迹。我们使用奖励函数增强GRPO，可以平衡三个信号：任务正确性（通过验证器反馈），长度效率和格式化依从性（通过结构标签）。对GSM8K，Math500，SVAMP，College Math和GSM+进行的实验表明，基于课程的培训始终在同一最终预算上优于固定预算基准，从而实现了更高的准确性，并显着提高了令牌效率。我们进一步消融了奖励加权和衰减计划设计的影响，这表明渐进式约束是训练有效推理模型的强大归纳偏见。我们的代码和检查点发布在以下位置：此HTTPS URL。

Title: Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens

Authors: Lucas Albarede, Jose Moreno, Lynda Tamine, Luce Lefeuvre
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.08942
Pdf URL: https://arxiv.org/pdf/2508.08942
Copy Paste: [[2508.08942]] Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens(https://arxiv.org/abs/2508.08942)
Keywords: language model, llm, hallucination
Abstract: Despite their impressive performances, Large Language Models (LLMs) remain prone to hallucination, which critically undermines their trustworthiness. While most of the previous work focused on tackling answer and attribution correctness, a recent line of work investigated faithfulness, with a focus on leveraging internal model signals to reflect a model's actual decision-making process while generating the answer. Nevertheless, these methods induce additional latency and have shown limitations in directly aligning token generation with attribution generation. In this paper, we introduce LoDIT, a method that jointly generates and faithfully attributes answers in RAG by leveraging specific token logits during generation. It consists of two steps: (1) marking the documents with specific token identifiers and then leveraging the logits of these tokens to estimate the contribution of each document to the answer during generation, and (2) aggregating these contributions into document attributions. Experiments on a trustworthiness-focused attributed text-generation benchmark, Trust-Align, show that LoDIT significantly outperforms state-of-the-art models on several metrics. Finally, an in-depth analysis of LoDIT shows both its efficiency in terms of latency and its robustness in different settings.
摘要：尽管表现令人印象深刻，但大型语言模型（LLM）仍然容易出现幻觉，这严重破坏了他们的可信赖性。尽管以前的大多数工作都集中在解决答案和归因正确性上，但最近的工作范围调查了忠诚，重点是利用内部模型信号来反映模型的实际决策过程，同时产生答案。然而，这些方法会导致额外的延迟，并显示出与归因生成直接对齐代币产生的局限性。在本文中，我们介绍了Lodit，这种方法通过利用特定的代币logits在发电期间，共同生成并忠实地归因于抹布中的答案。它由两个步骤组成：（1）用特定令牌标识符标记文档，然后利用这些令牌的徽标来估计每生物期间每个文档对答案的贡献，以及（2）将这些贡献汇总到文档归因中。以可信赖性为中心的实验归因于文本生成基准，Trust-Align表明，在几个指标上，Lodit显着超过了最先进的模型。最后，对LODIT的深入分析既显示了其在不同环境中的潜伏期及其稳健性方面的效率。

Title: Retrospective Sparse Attention for Efficient Long-Context Generation

Authors: Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09001
Pdf URL: https://arxiv.org/pdf/2508.09001
Copy Paste: [[2508.09001]] Retrospective Sparse Attention for Efficient Long-Context Generation(https://arxiv.org/abs/2508.09001)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%.
摘要：大型语言模型（LLMS）越来越多地部署在诸如推理，代码生成和多转化对话之类的长篇文章任务中。但是，对扩展上下文的推断是由键值（KV）缓存瓶颈的，其内存足迹随序列长度长度增长，并在每个解码步骤中占据延迟。尽管最近的KV缓存压缩方法识别并加载了重要令牌，但它们主要集中在输入上下文上，并且无法解决长期解码过程中出现的累积注意力错误。在本文中，我们介绍了Retrowate，这是一种新颖的KV缓存更新技术，可追溯通过随后的解码步骤使用新来的KV条目来回顾过去的注意力输出。通过维护轻量级输出缓存，Retrowateention可以使过去的查询能够有效地访问更相关的上下文，同时导致最小的延迟开销。这打破了固定发出的输出范式，并允许对先前近似值进行持续校正。关于长期基准测试的广泛实验表明，追溯始终超过最先进的（SOTA）KV压缩方法，将有效的KV暴露量提高了1.6 $ \ times $，而准确性则高达21.9 \％。

Title: LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA

Authors: Adrián Gude, Roi Santos-Ríos, Francisco Prado-Valiño, Ana Ezquerro, Jesús Vilares
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09012
Pdf URL: https://arxiv.org/pdf/2508.09012
Copy Paste: [[2508.09012]] LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA(https://arxiv.org/abs/2508.09012)
Keywords: language model, prompt
Abstract: This paper describes our participation in SemEval 2025 Task 8, focused on Tabular Question Answering. We developed a zero-shot pipeline that leverages an Large Language Model to generate functional code capable of extracting the relevant information from tabular data based on an input question. Our approach consists of a modular pipeline where the main code generator module is supported by additional components that identify the most relevant columns and analyze their data types to improve extraction accuracy. In the event that the generated code fails, an iterative refinement process is triggered, incorporating the error feedback into a new generation prompt to enhance robustness. Our results show that zero-shot code generation is a valid approach for Tabular QA, achieving rank 33 of 53 in the test phase despite the lack of task-specific fine-tuning.
摘要：本文介绍了我们参与Semeval 2025 Task 8，重点是表达问题。我们开发了一个零拍的管道，该管道利用大型语言模型生成能够根据输入问题从表格数据中提取相关信息的功能代码。我们的方法由模块化管道组成，其中主代码生成器模块由识别最相关列的其他组件支持并分析其数据类型以提高提取精度。如果生成的代码失败，则会触发迭代的完善过程，将错误反馈纳入新一代提示中以增强鲁棒性。我们的结果表明，零射击代码生成是表格质量检查的有效方法，尽管缺乏特定于任务的微调，但在测试阶段达到了53位的排名33。

Title: A Survey on Training-free Alignment of Large Language Models

Authors: Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09016
Pdf URL: https://arxiv.org/pdf/2508.09016
Copy Paste: [[2508.09016]] A Survey on Training-free Alignment of Large Language Models(https://arxiv.org/abs/2508.09016)
Keywords: language model, llm
Abstract: The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques--leveraging in-context learning, decoding-time adjustments, and post-generation corrections--offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.
摘要：大型语言模型（LLM）的一致性旨在确保其产出遵守人类价值观，道德标准和法律规范。传统的一致性方法通常依赖于资源密集型微调（FT），这些微调（FT）可能会遭受知识退化的影响，并且在模型可访问性或计算资源受到限制的情况下面临挑战。相比之下，无训练（TF）对准技术 - 杠杆内部的学习，解码时间调整和发电后校正 - 通过在不重新训练LLM的情况下启用对齐方式，使其适应开放源和封闭源环境。本文介绍了对TF对齐方法的首次系统综述，并通过对编码，编码和编码后编码的阶段进行分类。对于每个阶段，我们从LLM和多模式LLM（MLLM）的角度提供详细的检查，突出了它们的机制和局限性。此外，我们确定了关键的挑战和未来的方向，为更具包容性和有效的TF对准技术铺平了道路。通过综合和组织快速增长的研究体系，该调查为从业者提供了指导，并促进了更安全，更可靠的LLM的发展。

Title: LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback

Authors: Chen Xu, Zhenyu Lv, Tian Lan, Xianyang Wang, Luyao Ji, Leyang Cui, Minqiang Yang, Jian Shen, Qunxi Dong, Xiuling Liu, Juan Wang, Bin Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09042
Pdf URL: https://arxiv.org/pdf/2508.09042
Copy Paste: [[2508.09042]] LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback(https://arxiv.org/abs/2508.09042)
Keywords: language model, llm, agent
Abstract: Although large language models (LLMs) hold significant promise in psychotherapy, their direct application in patient-facing scenarios raises ethical and safety concerns. Therefore, this work shifts towards developing an LLM as a supervisor to train real therapists. In addition to the privacy of clinical therapist training data, a fundamental contradiction complicates the training of therapeutic behaviors: clear feedback standards are necessary to ensure a controlled training system, yet there is no absolute "gold standard" for appropriate therapeutic behaviors in practice. In contrast, many common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback that can serve as clearer evidence. Motivated by this, we create a novel therapist-training paradigm: (1) guidelines for mistaken behaviors and targeted correction strategies are first established as standards; (2) a human-in-the-loop dialogue-feedback dataset is then constructed, where a mistake-prone agent intentionally makes standard mistakes during interviews naturally, and a supervisor agent locates and identifies mistakes and provides targeted feedback; (3) after fine-tuning on this dataset, the final supervisor model is provided for real therapist training. The detailed experimental results of automated, human and downstream assessments demonstrate that models fine-tuned on our dataset MATE, can provide high-quality feedback according to the clinical guideline, showing significant potential for the therapist training scenario.
摘要：尽管大型语言模型（LLMS）在心理治疗方面具有巨大的希望，但它们在面向患者的情况下的直接应用引起了道德和安全问题。因此，这项工作转向开发LLM作为主管来培训真正的治疗师。除了临床治疗师培训数据的隐私外，基本的矛盾使治疗行为的训练变得复杂：确保受控培训系统的明确反馈标准是必要的，但是在实践中，没有绝对的“金标准”。相比之下，许多常见的治疗错误是普遍且可识别的，这使它们成为有针对性反馈的有效触发器，可以作为更清晰的证据。在此激励的情况下，我们创建了一种新颖的治疗师训练范式：（1）首先确立了错误行为和有针对性纠正策略的指南；（2）随后构建了一个人类的对话反馈数据集，在访谈中有意在访谈中故意犯标准错误，而主管代理会找到和识别错误并提供目标反馈；（3）在该数据集进行了微调后，为真正的治疗师培训提供了最终的主管模型。自动化，人类和下游评估的详细实验结果表明，根据临床指南，我们的数据集伴侣进行了微调可以提供高质量的反馈，这显示了治疗师培训方案的重要潜力。

Title: MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Authors: Zeyu Huang, Juyuan Wang, Longfeng Chen, Boyi Xiao, Leng Cai, Yawen Zeng, Jin Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09057
Pdf URL: https://arxiv.org/pdf/2508.09057
Copy Paste: [[2508.09057]] MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions(https://arxiv.org/abs/2508.09057)
Keywords: language model, prompt, agent
Abstract: Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55\% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52\% and 29.41\% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.
摘要：鉴于大型视觉语言模型（LVLM）在推理和视觉理解中的显着进步，移动代理正在迅速出现以满足用户的自动化需求。但是，现有的评估基准与现实世界脱节，无法充分解决用户的多样化和复杂要求。根据我们广泛的用户问卷收集，我们确定了五个任务：多应用，含糊，交互式，单应用和不道德的说明。围绕这些任务，我们提出\ textbf {mvisu bench}，这是一种双语基准，其中包含137个移动应用程序中的404个任务。此外，我们提出了AIDER，这是一个插件模块，它充当动态提示提示器，以减轻风险并阐明移动代理的用户意图。我们的AIDE易于整合到多个框架中，并且与MVISU BENCH的当前最新ART（SOTA）相比，总体成功率已成功提高了19.55％。具体而言，对于不道德和互动指令，它的成功率提高了53.52 \％和29.41 \％。通过广泛的实验和分析，我们突出了现有移动代理与现实世界用户期望之间的差距。

Title: READER: Retrieval-Assisted Drafter for Efficient LLM Inference

Authors: Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Sultan Isali, Vasily Kalugin, Stanislav Ilyushin, Nuriza Aitassova, Yi Fei, Zeng Weidi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09072
Pdf URL: https://arxiv.org/pdf/2508.09072
Copy Paste: [[2508.09072]] READER: Retrieval-Assisted Drafter for Efficient LLM Inference(https://arxiv.org/abs/2508.09072)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) generate tokens autoregressively, with each token depending on the preceding context. This sequential nature makes the inference process inherently difficult to accelerate, posing a significant challenge for efficient deployment. In recent years, various methods have been proposed to address this issue, with the most effective approaches often involving the training of additional draft models. In this paper, we introduce READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a novel lossless speculative decoding method that enhances model-based approaches by leveraging self-repetitions in the text. Our algorithm expands the speculative decoding tree using tokens obtained through statistical search. This work focuses on large batch sizes (>= 8), an underexplored yet important area for industrial applications. We also analyze the key-value (KV) cache size during speculative decoding and propose an optimization to improve performance for large batches. As a result, READER outperforms existing speculative decoding methods. Notably, READER requires no additional training and can reuse pre-trained speculator models, increasing the speedup by over 40\%. Our method demonstrates particularly strong performance on search-based tasks, such as retrieval-augmented generation, where we achieve more than 10x speedup.
摘要：大型语言模型（LLMS）会自动重新收集，每个令牌取决于前面的上下文。这种顺序的性质使推理过程固有地很难加速，这对有效的部署构成了重大挑战。近年来，已经提出了各种方法来解决这个问题，最有效的方法通常涉及培训其他模型草案。在本文中，我们介绍了读者（用于有效的LLM推断的检索辅助起草者），这是一种新颖的无损投机解码方法，通过利用文本中的自我重复来增强基于模型的方法。我们的算法使用通过统计搜索获得的代币扩展了投机解码树。这项工作着重于大批量大小（> = 8），这是工业应用的未经置换但重要的领域。我们还分析了投机解码过程中的键值（KV）缓存大小，并提出优化以提高大批量的性能。结果，读者的表现优于现有的投机解码方法。值得注意的是，读者不需要额外的培训，并且可以重复使用预训练的投机者模型，从而将加速度提高40 \％。我们的方法表明，在基于搜索的任务（例如检索型的一代）上表现出特别强大的性能，我们达到了10倍以上的速度。

Title: Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Authors: Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09091
Pdf URL: https://arxiv.org/pdf/2508.09091
Copy Paste: [[2508.09091]] Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages(https://arxiv.org/abs/2508.09091)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM's embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.
摘要：大型语言模型（LLMS）以英语表现出色，但由于以英语为中心的培训，其性能在低资源语言（LRLS）上大大降低。诸如Langbridge LLM和多语言编码器（例如大量多语言文本到文本传输变压器（MT5））之类的方法，它们通常仅使用最终的编码器层。我们提出了一种融合所有中间层的新型体系结构，丰富传递给LLM的语言信息。我们的方法具有两种策略：（1）全球软磁性加权，以实现整体层的重要性，以及（2）一个学习代币特定权重的变压器软磁模型。融合表示形式被映射到LLM的嵌入空间中，使其能够处理多语言输入。该模型仅在不使用任何平行或多语言数据的情况下对英语数据进行培训。在XNLI，INDICXNLI，Sinhala新闻分类和亚马逊评论上进行了评估，我们的变压器SoftMax模型大大胜过Langbridge的基线。我们观察到LRL的强劲表现增长，将僧伽罗的分类准确性从71.66％提高到75.86％，并在泰米尔语，孟加拉语和马拉雅拉姆语等指示语言中取得了明显的改进。这些特定的收益促进了平均XNLI准确性从70.36％提高到71.50％。这种方法为通往更有能力和公平的多语言LLM提供了可扩展的，可扩展的途径。

Title: AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Authors: Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, Fengzong Lian
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2508.09101
Pdf URL: https://arxiv.org/pdf/2508.09101
Copy Paste: [[2508.09101]] AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators(https://arxiv.org/abs/2508.09101)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.
摘要：大型语言模型（LLM）表现出了各个领域的显着功能，而代码生成是关键的关键领域。尽管已经提出了许多基准来评估其代码生成能力，但这些基准测试面临几个关键局限性。首先，他们通常依靠手动注释，这些注释耗时且难以在不同的编程语言和问题复杂性上扩展。其次，大多数现有的基准主要集中在Python上，而少数多语言基准的难度和语言分布不均匀。为了应对这些挑战，我们提出了Autocodgen，这是一种自动化方法，用于生成高缺陷的多语言代码生成数据集，而无需手动注释。 Autocedgen通过使用LLMS生成测试输入并通过多语言沙箱获得测试输出，从而确保测试案例的正确性和完整性，同时通过反阶问题生成和多个过滤步骤来实现高数据质量。使用这种新颖的方法，我们介绍了Autocodbench，这是一个大规模代码生成的基准，其中包括3,920个问题，该问题均匀分布在20种编程语言上。它是专门设计的，旨在评估LLM关于具有挑战性，多样化和实用的多语言任务。我们评估了Autocodebench上的30多个领先的开源和专有LLMS及其简化版本Autocodebench-Lite。结果表明，即使是最先进的LLM，这些任务的复杂性，多样性和多语言性质都在努力。此外，我们介绍了Autocodebench-Complete，这是专门为基础模型设计的，以评估其少量代码生成功能。我们希望Autocodbench系列将成为一种宝贵的资源，并激发社区专注于更具挑战性和实用的多语言代码生成场景。

Title: SinLlama - A Large Language Model for Sinhala

Authors: H.W.K.Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09115
Pdf URL: https://arxiv.org/pdf/2508.09115
Copy Paste: [[2508.09115]] SinLlama - A Large Language Model for Sinhala(https://arxiv.org/abs/2508.09115)
Keywords: language model, llm
Abstract: Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.
摘要：开源大语模型（LLM）通常会忽略诸如僧伽罗之类的低资源语言。在这项研究中，我们扩展了现有的多语言LLM（Llama-3-8B），以更好地为僧伽罗服务。我们使用特定于Sinhala的词汇来增强LLM令牌，并在清洁的1000万个Sinhala语料库上进行连续的预训练，从而产生了Sinllama模型。这是首个基于解码器的开源LLM，并具有明确的Sinhala支持。当Sinllama对三个文本分类任务进行微调时，它的表现优于基础和指示Llama-3-8B的变体，这是一个很大的差距。

Title: OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Authors: Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, Saravan Rajmohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09124
Pdf URL: https://arxiv.org/pdf/2508.09124
Copy Paste: [[2508.09124]] OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows(https://arxiv.org/abs/2508.09124)
Keywords: language model, llm, agent
Abstract: Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
摘要：由大语言模型（LLMS）提供动力的自主代理人越来越多地部署在需要复杂的长途工作流程的现实应用程序中。但是，现有的基准主要集中在独立和独立的原子任务上，未能捕获长期的上下文依赖项和在现实场景中需要的多相互作用协调。为了解决这一差距，我们介绍了Odysseybench，这是一种全面的基准，用于评估LLM代理在不同的办公应用程序中长期工作流程中的LLM代理，包括Word，Excel，PDF，电子邮件和日历。我们的基准标准包括两个互补的分裂：Odysseybench+具有300个来自现实世界用例的任务，而Odysseybench-neo具有302个新合成的复杂任务。每个任务都要求代理商从长马相互作用历史记录中确定基本信息，并在各种应用程序上执行多步推理。为了启用可扩展的基准创建，我们提出了一个多代理框架，该框架通过系统的环境探索，任务生成和对话综合，可以自动化长远的工作流程基准。我们广泛的评估表明，奥德赛班族有效地挑战了最先进的LLM代理商，与现有的原子质任务基准相比，在复杂的，现实世界中的能力中提供了更准确的评估。我们认为，Odysseybench将是推进现实世界生产力方案中LLM代理的开发和评估的宝贵资源。此外，我们释放Odysseybench和Homeragents，以促进该线路的研究。

Title: Complex Logical Instruction Generation

Authors: Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09125
Pdf URL: https://arxiv.org/pdf/2508.09125
Copy Paste: [[2508.09125]] Complex Logical Instruction Generation(https://arxiv.org/abs/2508.09125)
Keywords: language model, llm, agent
Abstract: Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: this https URL
摘要：以下指导催化了最近的大语言模型（LLM）时代，并且是基本技能，基础是推理和代理行为等更先进的能力。随着任务越来越具有挑战性，嵌入自然语言指令中的逻辑结构变得越来越复杂。但是，LLM在此类逻辑丰富的指令上的性能仍然不足。我们提出了逻辑和逻辑效果。 Logicifgen是一个可扩展的自动化框架，用于从代码函数中生成可验证的指令，可以自然表达富含逻辑，例如条件，嵌套，递归和功能调用。我们进一步策划了复杂代码功能的集合，并使用Logicifgen来构建LogicifeVal，这是一个包括426个可验证逻辑丰富的说明的基准。我们的实验表明，当前最新的LLM仍在努力遵循逻辑效果中的说明。大多数LLM只能遵循少于60％的说明，从而揭示了遵循指令能力的明显缺陷。代码和基准：此HTTPS URL

Title: Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Authors: Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09138
Pdf URL: https://arxiv.org/pdf/2508.09138
Copy Paste: [[2508.09138]] Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models(https://arxiv.org/abs/2508.09138)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
摘要：扩散大语言模型（DLLM）通过迭代deNOSISING生成文本，但当前的解码策略丢弃了丰富的中间预测，而有利于最终输出。我们在这里的工作揭示了一种关键现象，时间振荡，在中间过程中经常出现正确的答案，但在以后的DeNoSising步骤中被覆盖。为了解决这个问题，我们介绍了利用时间一致性的两种互补方法：1）时间段落自称投票，一种无培训的测试时间解码策略，汇总了跨deNo的步骤的预测，以选择最一致的输出； 2）一种称为时间一致性增强的训练后方法，该方法使用时间语义熵（TSE），这是一个跨中间预测的语义稳定性的度量，作为鼓励稳定世代的奖励信号。多个基准的经验结果证明了我们方法的有效性。仅使用负TSE奖励，我们观察到倒计时数据集的平均平均值比现有DLLM的平均提高了24.7％。结合准确性奖励，我们在GSM8K上获得了2.0％的绝对增长，在MATH500上获得了4.3％，SVAMP的绝对增长率分别为6.6％，倒计时分别为25.3％。我们的发现强调了DLLM中时间动态的未开发潜力，并提供了两个简单而有效的工具来利用它们。