2025-08-26

Title: GreenTEA: Gradient Descent with Topic-modeling and Evolutionary Auto-prompting

Authors: Zheng Dong, Luming Shang, Gabriela Olinto
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.16603
Pdf URL: https://arxiv.org/pdf/2508.16603
Copy Paste: [[2508.16603]] GreenTEA: Gradient Descent with Topic-modeling and Evolutionary Auto-prompting(https://arxiv.org/abs/2508.16603)
Keywords: language model, llm, prompt, agent
Abstract: High-quality prompts are crucial for Large Language Models (LLMs) to achieve exceptional performance. However, manually crafting effective prompts is labor-intensive and demands significant domain expertise, limiting its scalability. Existing automatic prompt optimization methods either extensively explore new prompt candidates, incurring high computational costs due to inefficient searches within a large solution space, or overly exploit feedback on existing prompts, risking suboptimal optimization because of the complex prompt landscape. To address these challenges, we introduce GreenTEA, an agentic LLM workflow for automatic prompt optimization that balances candidate exploration and knowledge exploitation. It leverages a collaborative team of agents to iteratively refine prompts based on feedback from error samples. An analyzing agent identifies common error patterns resulting from the current prompt via topic modeling, and a generation agent revises the prompt to directly address these key deficiencies. This refinement process is guided by a genetic algorithm framework, which simulates natural selection by evolving candidate prompts through operations such as crossover and mutation to progressively optimize model performance. Extensive numerical experiments conducted on public benchmark datasets suggest the superior performance of GreenTEA against human-engineered prompts and existing state-of-the-arts for automatic prompt optimization, covering logical and quantitative reasoning, commonsense, and ethical decision-making.
摘要：高质量的提示对于大型语言模型（LLM）至关重要，以实现出色的性能。但是，手动制作有效提示是劳动密集型的，需要重要的领域专业知识，从而限制了其可扩展性。现有的自动及时及时优化方法可以广泛探索新的提示候选人，从而由于较大的解决方案空间内搜索效率低下而产生高计算成本，或者在现有提示中过度利用反馈，因此由于复杂的提示景观而冒着次优优化的风险。为了应对这些挑战，我们介绍了Greentea，Greentea是一种代理LLM的工作流程，以自动及时优化，可以平衡候选人探索和知识开发。它利用一个合作的代理团队根据错误样本的反馈来迭代完善提示。分析代理可以通过主题建模确定由当前提示引起的常见错误模式，并且一代代理修订了直接解决这些关键缺陷的提示。这种改进过程以遗传算法框架为指导，该算法通过通过跨界和突变等操作来发展候选者提示来模拟自然选择，从而逐步优化模型性能。在公共基准数据集上进行的广泛数值实验表明，Greentea对人类工程提示和现有的最新技术的表现出色，以自动及时优化，涵盖了逻辑和定量的推理，常识性和道德决策。

Title: Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow

Authors: Y. Du, C. Guo, W. Wang, G. Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16636
Pdf URL: https://arxiv.org/pdf/2508.16636
Copy Paste: [[2508.16636]] Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow(https://arxiv.org/abs/2508.16636)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) face a fundamental challenge in deciding when to rely on rapid, intuitive responses versus engaging in slower, more deliberate reasoning. Inspired by Daniel Kahneman's dual-process theory and his insights on human cognitive biases, we propose a novel Cognitive Decision Routing (CDR) framework that dynamically determines the appropriate reasoning strategy based on query characteristics. Our approach addresses the current limitations where models either apply uniform reasoning depth or rely on computationally expensive methods for all queries. We introduce a meta-cognitive layer that analyzes query complexity through multiple dimensions: correlation strength between given information and required conclusions, domain boundary crossings, stakeholder multiplicity, and uncertainty levels. Through extensive experiments on diverse reasoning tasks, we demonstrate that CDR achieves superior performance while reducing computational costs by 34\% compared to uniform deep reasoning approaches. Our framework shows particular strength in professional judgment tasks, achieving 23\% improvement in consistency and 18\% better accuracy on expert-level evaluations. This work bridges cognitive science principles with practical AI system design, offering a principled approach to adaptive reasoning in LLMs.
摘要：大型语言模型（LLMS）在决定何时依靠快速，直观的响应与参与较慢，更故意的推理方面面临着一个基本挑战。受丹尼尔·卡尼曼（Daniel Kahneman）的双重过程理论及其对人类认知偏见的见解的启发，我们提出了一种新颖的认知决策路由（CDR）框架，该框架动态地决定了基于查询特征的适当推理策略。我们的方法解决了当前的局限性，即模型要么应用统一的推理深度或依赖于所有查询的计算昂贵方法。我们介绍了一个元认知层，该层通过多个维度分析查询复杂性：给定信息与所需结论之间的相关强度，域边界交叉，利益相关者的多样性和不确定性水平。通过对各种推理任务的广泛实验，我们证明，与统一的深层推理方法相比，CDR可以实现卓越的性能，同时将计算成本降低34％。我们的框架在专业判断任务中显示出特殊的优势，在一致性方面取得了23 \％的提高，并且在专家级别的评估方面的准确性提高了。这项工作将认知科学原则带入了实用的AI系统设计，为LLMS提供了适应性推理的原则方法。

Title: Trust but Verify! A Survey on Verification Design for Test-time Scaling

Authors: V Venktesh, Mandeep rathee, Avishek Anand
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16665
Pdf URL: https://arxiv.org/pdf/2508.16665
Copy Paste: [[2508.16665]] Trust but Verify! A Survey on Verification Design for Test-time Scaling(https://arxiv.org/abs/2508.16665)
Keywords: language model, llm, prompt
Abstract: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at this https URL.
摘要：测试时间缩放（TTS）已成为扩展大语言模型性能的新领域。在测试时间缩放中，通过在推断期间使用更多的计算资源，LLM可以改善其推理过程和任务性能。对于TT，已经出现了几种方法，例如将推理痕迹从另一个模型中提取推理痕迹或通过使用验证器来探索广泛的解码搜索空间。验证者是奖励模型，有助于从解码过程中评分候选输出，从而努力探索庞大的解决方案空间并选择最佳结果。由于推理时间和高性能增长，这种范式通常称为一种出色的方法。验证者可以基于及时的验证，作为判别或生成模型，以验证过程路径，结果或两者兼而有之。尽管采用了广泛的采用，但对于各种验证方法及其培训机制，尚无详细的收集，明确的分类和讨论。在这项调查中，我们介绍了文献中的各种方法，并介绍了验证者培训，类型及其测试时间缩放效用的统一观点。我们的存储库可以在此HTTPS URL上找到。

Title: Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?

Authors: Siddhant Bhambri, Upasana Biswas, Subbarao Kambhampati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16695
Pdf URL: https://arxiv.org/pdf/2508.16695
Copy Paste: [[2508.16695]] Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?(https://arxiv.org/abs/2508.16695)
Keywords: language model, llm, chain-of-thought
Abstract: Recent progress in reasoning-oriented Large Language Models (LLMs) has been driven by introducing Chain-of-Thought (CoT) traces, where models generate intermediate reasoning traces before producing an answer. These traces, as in DeepSeek R1, are not only used to guide inference but also serve as supervision signals for distillation into smaller models. A common but often implicit assumption is that CoT traces should be semantically meaningful and interpretable to the end user. While recent research questions the need for semantic nature of these traces, in this paper, we ask: ``\textit{Must CoT reasoning traces be interpretable to enhance LLM task performance?}" We investigate this question in the Open Book Question-Answering domain by supervised fine-tuning LLaMA and Qwen models on four types of reasoning traces: (1) DeepSeek R1 traces, (2) LLM-generated summaries of R1 traces, (3) LLM-generated post-hoc explanations of R1 traces, and (4) algorithmically generated verifiably correct traces. To quantify the trade-off between interpretability and performance, we further conduct a human-subject study with 100 participants rating the interpretability of each trace type. Our results reveal a striking mismatch: while fine-tuning on R1 traces yields the strongest performance, participants judged these traces to be the least interpretable. These findings suggest that it is useful to decouple intermediate tokens from end user interpretability.
摘要：以推理为导向的大语言模型（LLM）的最新进展是通过引入思想链（COT）痕迹来驱动的，其中模型在产生答案之前会生成中间的推理痕迹。这些迹线（如DeepSeek R1）不仅用于指导推理，而且还可以作为蒸馏型较小型号的监督信号。一个常见但通常隐式的假设是，COT痕迹应具有语义上有意义的，并且对最终用户来说是可解释的。 While recent research questions the need for semantic nature of these traces, in this paper, we ask: ``\textit{Must CoT reasoning traces be interpretable to enhance LLM task performance?}" We investigate this question in the Open Book Question-Answering domain by supervised fine-tuning LLaMA and Qwen models on four types of reasoning traces: (1) DeepSeek R1 traces, (2) LLM-generated summaries of R1痕迹（3）LLM生成的R1痕迹的事后解释，（4）算法可以验证正确的痕迹，以量化可解释性和绩效之间的权衡，我们进一步进行人类对象研究，并通过100个参与者揭示了我们的痕迹效果。这些发现是最不可解释的。

Title: QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting

Authors: Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.16697
Pdf URL: https://arxiv.org/pdf/2508.16697
Copy Paste: [[2508.16697]] QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting(https://arxiv.org/abs/2508.16697)
Keywords: language model, llm, hallucination, prompt
Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have caused higher hallucination prevalence; yet most mitigation work focuses on after-the-fact filtering rather than shaping the queries that trigger them. We introduce QueryBandits, a bandit framework that designs rewrite strategies to maximize a reward model, that encapsulates hallucination propensity based upon the sensitivities of 17 linguistic features of the input query-and therefore, proactively steer LLMs away from generating hallucinations. Across 13 diverse QA benchmarks and 1,050 lexically perturbed queries per dataset, our top contextual QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a no-rewrite baseline and also outperforms zero-shot static prompting ("paraphrase" or "expand") by 42.6% and 60.3% respectively. Therefore, we empirically substantiate the effectiveness of QueryBandits in mitigating hallucination via the intervention that takes the form of a query rewrite. Interestingly, certain static prompting strategies, which constitute a considerable number of current query rewriting literature, have a higher cumulative regret than the no-rewrite baseline, signifying that static rewrites can worsen hallucination. Moreover, we discover that the converged per-arm regression feature weight vectors substantiate that there is no single rewrite strategy optimal for all queries. In this context, guided rewriting via exploiting semantic features with QueryBandits can induce significant shifts in output behavior through forward-pass mechanisms, bypassing the need for retraining or gradient-based adaptation.
摘要：大型语言模型（LLM）中的先进推理能力已引起较高的幻觉患病率；然而，大多数缓解工作都集中在事后过滤上，而不是塑造触发它们的查询。我们介绍了QueryBandits，这是一个强盗框架，设计了重写策略以最大化奖励模型，该策略基于输入查询的17种语言特征的敏感性封装了幻觉倾向，因此，主动地驱动了LLMS避免产生幻觉。在每个数据集的13个不同的QA基准和1,050个词汇扰动的查询中，我们的顶级上下文QueryBandit（Thompson采样）在无螺纹底线上获得了87.5％的胜利率，并且超过了零速度静态提示（“ Paraphrase”或“ paraphrase”或60.6％的6％）。因此，我们从经验上证实了QueryBandits通过采取查询重写形式的干预措施来缓解幻觉的有效性。有趣的是，某些静态提示策略构成了当前的查询重写文献相当多，它们的累积后悔比No-Rewrite基线更高，这表明静态重写可能会使幻觉恶化。此外，我们发现融合的每臂回归特征重量向量证明了所有查询都没有最佳的单一重写策略。在这种情况下，通过使用QueryBandits利用语义特征的指导重写可以通过前向机制引起输出行为的显着变化，从而绕开了对基于梯度或基于梯度的适应的需求。

Title: Assessing Consciousness-Related Behaviors in Large Language Models Using the Maze Test

Authors: Rui A. Pimenta, Tim Schlippe, Kristina Schaaff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16705
Pdf URL: https://arxiv.org/pdf/2508.16705
Copy Paste: [[2508.16705]] Assessing Consciousness-Related Behaviors in Large Language Models Using the Maze Test(https://arxiv.org/abs/2508.16705)
Keywords: language model, llm
Abstract: We investigate consciousness-like behaviors in Large Language Models (LLMs) using the Maze Test, challenging models to navigate mazes from a first-person perspective. This test simultaneously probes spatial awareness, perspective-taking, goal-directed behavior, and temporal sequencing-key consciousness-associated characteristics. After synthesizing consciousness theories into 13 essential characteristics, we evaluated 12 leading LLMs across zero-shot, one-shot, and few-shot learning scenarios. Results showed reasoning-capable LLMs consistently outperforming standard versions, with Gemini 2.0 Pro achieving 52.9% Complete Path Accuracy and DeepSeek-R1 reaching 80.5% Partial Path Accuracy. The gap between these metrics indicates LLMs struggle to maintain coherent self-models throughout solutions -- a fundamental consciousness aspect. While LLMs show progress in consciousness-related behaviors through reasoning mechanisms, they lack the integrated, persistent self-awareness characteristic of consciousness.
摘要：我们使用迷宫测试调查了大语言模型（LLM）中的意识式行为，从第一人称角度挑战了迷宫的模型。该测试同时探讨了空间意识，透视化，目标指导的行为以及时间测序与键相关的特征。将意识理论综合为13个基本特征之后，我们评估了12个领先的LLM在零射，一声和很少的学习方案中。结果表明，具有推理能力的LLM始终超过标准版本，Gemini 2.0 Pro达到52.9％的完全路径准确性，DeepSeek-R1达到了80.5％的部分路径精度。这些指标之间的差距表明LLM在整个解决方案中都难以维持连贯的自模型 - 这是一个基本意识的方面。尽管LLM通过推理机制表现出与意识相关的行为的进展，但它们缺乏意识的综合，持续的自我意识特征。

Title: Error Reflection Prompting: Can Large Language Models Successfully Understand Errors?

Authors: Jason Li, Lauren Yraola, Kevin Zhu, Sean O'Brien
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16729
Pdf URL: https://arxiv.org/pdf/2508.16729
Copy Paste: [[2508.16729]] Error Reflection Prompting: Can Large Language Models Successfully Understand Errors?(https://arxiv.org/abs/2508.16729)
Keywords: language model, prompt, chain-of-thought
Abstract: Prompting methods for language models, such as Chain-of-thought (CoT), present intuitive step-by-step processes for problem solving. These methodologies aim to equip models with a better understanding of the correct procedures for addressing a given task. Despite these advancements, CoT lacks the ability of reflection and error correction, potentially causing a model to perpetuate mistakes and errors. Therefore, inspired by the human ability for said tasks, we propose Error Reflection Prompting (ERP) to further enhance reasoning in language models. Building upon CoT, ERP is a method comprised of an incorrect answer, error recognition, and a correct answer. This process enables the model to recognize types of errors and the steps that lead to incorrect answers, allowing the model to better discern which steps to avoid and which to take. The model is able to generate the error outlines itself with automated ERP generation, allowing for error recognition and correction to be integrated into the reasoning chain and produce scalability and reliability in the process. The results demonstrate that ERP serves as a versatile supplement to conventional CoT, ultimately contributing to more robust and capable reasoning abilities along with increased interpretability in how models ultimately reach their errors.
摘要：提示语言模型的方法，例如经过思考链（COT），目前直观的逐步解决问题。这些方法旨在使模型能够更好地了解解决给定任务的正确程序。尽管取得了这些进步，COT仍缺乏反思和误差校正的能力，可能导致模型永久存在错误和错误。因此，受到人类执行该任务的能力的启发，我们提出了错误反射提示（ERP）进一步增强语言模型中的推理。在COT的基础上，ERP是一种由错误的答案，错误识别和正确答案组成的方法。此过程使模型能够识别错误的类型以及导致答案不正确的步骤，从而使模型可以更好地识别要避免哪些步骤和要采取的步骤。该模型能够通过自动ERP生成来生成错误概述自身，从而可以将错误识别和校正集成到推理链中，并在此过程中产生可扩展性和可靠性。结果表明，ERP是对常规COT的多功能补充，最终有助于更强大，有能力的推理能力，以及在模型最终达到其错误的方式方面的可解释性提高。

Title: How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models

Authors: Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.16757
Pdf URL: https://arxiv.org/pdf/2508.16757
Copy Paste: [[2508.16757]] How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models(https://arxiv.org/abs/2508.16757)
Keywords: language model, llm
Abstract: In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyze the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalization ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches. this https URL
摘要：在这项工作中，我们对最先进的重读方法进行了系统，全面的经验评估，包括基于大型语言模型（LLM）的基于大量的语言模型，轻巧的上下文和零摄像方法，以及他们在信息检索任务中的性能。我们总共评估了22种方法，包括40种变体（取决于使用的LLM），包括TREC DL19，DL20和BEIR在内的几个已建立的基准，以及一个新型的数据集，旨在测试预处理模型看不到的查询。我们的主要目标是通过受控和公平的比较来确定基于LLM的Rerankers及其轻量级对应物（尤其是在新型查询方面）之间是否存在性能差异，并阐明了任何观察到的差异的基本原因。为了解散混杂因素，我们分析了训练数据重叠，模型架构和计算效率对重新依给性能的影响。我们的发现表明，虽然基于LLM的Rerankers在熟悉的查询上表现出卓越的性能，但它们的新奇问题的概括能力会有所不同，而轻量级模型具有可比的效率。我们进一步确定，查询的新颖性会显着影响重新疗法的有效性，从而强调了现有方法的局限性。此HTTPS URL

Title: Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation

Authors: Arka Mukherjee, Shreya Ghosh
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.16762
Pdf URL: https://arxiv.org/pdf/2508.16762
Copy Paste: [[2508.16762]] Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation(https://arxiv.org/abs/2508.16762)
Keywords: language model, prompt
Abstract: As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. We present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments. Cross-modal evaluation shows that culturally distinct outputs are indeed detectable through visual-semantic similarity (28.7% within-nationality vs. 0.2% cross-nationality recall), yet visual-cultural understanding remains limited. In essence, we establish the promise and challenges of cultural competence in multimodal AI. We publicly release our codebase and data: this https URL
摘要：随着视觉模型（VLM）在各种文化背景下实现广泛的部署，确保其文化能力对于负责任的AI系统至关重要。虽然先前的工作已经评估了仅文本模型和VLM对象识别任务中的文化意识，但是当在生成任务期间，当文化身份提示嵌入文本提示和视觉输入中时，没有系统地评估VLM如何调整输出。我们通过多模式故事产生对VLM文化能力进行了首次全面评估，开发了一个新颖的多模式框架，该框架使文化身份变得厌恶，并在下游任务上评估了5个当代VLMS：故事生成。我们的分析揭示了重要的文化适应能力，具有丰富的文化特定词汇，涵盖名称，家族术语和地理标记。但是，我们发现有关局限性的：文化能力在各个体系结构之间发生了巨大变化，一些模型表现出反向文化的一致性，并且自动指标表明建筑偏见与人类评估相矛盾。跨模式评估表明，在文化上不同的产出确实可以通过视觉语义相似性检测到（28.7％的国内性和0.2％的跨国召回），但视觉文化的理解仍然有限。从本质上讲，我们在多模式AI中建立了文化能力的希望和挑战。我们公开发布我们的代码库和数据：此HTTPS URL

Title: Assess and Prompt: A Generative RL Framework for Improving Engagement in Online Mental Health Communities

Authors: Bhagesh Gaur, Karan Gupta, Aseem Srivastava, Manish Gupta, Md Shad Akhtar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16788
Pdf URL: https://arxiv.org/pdf/2508.16788
Copy Paste: [[2508.16788]] Assess and Prompt: A Generative RL Framework for Improving Engagement in Online Mental Health Communities(https://arxiv.org/abs/2508.16788)
Keywords: language model, prompt
Abstract: Online Mental Health Communities (OMHCs) provide crucial peer and expert support, yet many posts remain unanswered due to missing support attributes that signal the need for help. We present a novel framework that identifies these gaps and prompts users to enrich their posts, thereby improving engagement. To support this, we introduce REDDME, a new dataset of 4,760 posts from mental health subreddits annotated for the span and intensity of three key support attributes: event what happened?, effect what did the user experience?, and requirement what support they need?. Next, we devise a hierarchical taxonomy, CueTaxo, of support attributes for controlled question generation. Further, we propose MH-COPILOT, a reinforcement learning-based system that integrates (a) contextual attribute-span identification, (b) support attribute intensity classification, (c) controlled question generation via a hierarchical taxonomy, and (d) a verifier for reward modeling. Our model dynamically assesses posts for the presence/absence of support attributes, and generates targeted prompts to elicit missing information. Empirical results across four notable language models demonstrate significant improvements in attribute elicitation and user engagement. A human evaluation further validates the model's effectiveness in real-world OMHC settings.
摘要：在线精神卫生社区（OMHC）提供了重要的同伴和专家支持，但是由于缺少支持属性的属性，许多帖子仍未得到答复。我们提出了一个新颖的框架，该框架可以识别这些空白，并提示用户丰富他们的帖子，从而改善参与度。为了支持这一点，我们介绍了Reddme，这是一个针对三个关键支持属性的跨度和强度注释的4,760个帖子的新数据集：事件发生了什么？接下来，我们设计了一个层次分类法Cuetaxo，用于控制问题生成的支持属性。此外，我们提出了MH-CopiLot，这是一种基于强化学习的系统，该系统集成了（a）上下文属性 - 跨度识别，（b）支持属性强度分类，（c）通过层次分类法的控制问题生成，以及（d）奖励建模的验证者。我们的模型会动态评估帖子是否存在支持属性，并生成有针对性的提示以引起丢失的信息。四个著名语言模型的经验结果表明，属性启发和用户参与度有了显着改善。人类评估进一步验证了该模型在现实世界中的OMHC设置中的有效性。

Title: LLMs Learn Constructions That Humans Do Not Know

Authors: Jonathan Dunn, Mai Mohamed Eida
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16837
Pdf URL: https://arxiv.org/pdf/2508.16837
Copy Paste: [[2508.16837]] LLMs Learn Constructions That Humans Do Not Know(https://arxiv.org/abs/2508.16837)
Keywords: llm, prompt
Abstract: This paper investigates false positive constructions: grammatical structures which an LLM hallucinates as distinct constructions but which human introspection does not support. Both a behavioural probing task using contextual embeddings and a meta-linguistic probing task using prompts are included, allowing us to distinguish between implicit and explicit linguistic knowledge. Both methods reveal that models do indeed hallucinate constructions. We then simulate hypothesis testing to determine what would have happened if a linguist had falsely hypothesized that these hallucinated constructions do exist. The high accuracy obtained shows that such false hypotheses would have been overwhelmingly confirmed. This suggests that construction probing methods suffer from a confirmation bias and raises the issue of what unknown and incorrect syntactic knowledge these models also possess.
摘要：本文研究了假阳性结构：LLM幻觉作为不同的结构，但人体内省不支持的语法结构。包括使用上下文嵌入的行为探测任务和使用提示的元语言探测任务，从而使我们能够区分隐式和显式语言知识。两种方法都表明，模型确实确实具有幻觉结构。然后，我们模拟假设检验，以确定如果语言学家错误地假设确实存在这些幻觉结构，会发生什么。获得的高精度表明，这种假设将被压倒性地确认。这表明施工探测方法遭受确认偏见的影响，并提出了这些模型还具有未知和不正确的句法知识的问题。

Title: If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition

Authors: Shubhashis Roy Dipta, Francis Ferraro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16838
Pdf URL: https://arxiv.org/pdf/2508.16838
Copy Paste: [[2508.16838]] If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition(https://arxiv.org/abs/2508.16838)
Keywords: language model, llm, prompt
Abstract: Prior work has shown that presupposition in generated questions can introduce unverified assumptions, leading to inconsistencies in claim verification. Additionally, prompt sensitivity remains a significant challenge for large language models (LLMs), resulting in performance variance as high as 3-6%. While recent advancements have reduced this gap, our study demonstrates that prompt sensitivity remains a persistent issue. To address this, we propose a structured and robust claim verification framework that reasons through presupposition-free, decomposed questions. Extensive experiments across multiple prompts, datasets, and LLMs reveal that even state-of-the-art models remain susceptible to prompt variance and presupposition. Our method consistently mitigates these issues, achieving up to a 2-5% improvement.
摘要：先前的工作表明，生成的问题的前提可以引入未验证的假设，从而导致索赔验证的不一致。此外，对大语模型（LLM）的及时敏感性仍然是一个重大挑战，导致性能差异高达3-6％。尽管最近的进步减少了这一差距，但我们的研究表明，迅速的敏感性仍然是一个持久的问题。为了解决这个问题，我们提出了一个结构化且稳健的主张验证框架，该框架通过无预设的，分解的问题来理由。跨多个提示，数据集和LLM的广泛实验表明，即使是最新的模型也容易受到迅速差异和预设的影响。我们的方法一致地减轻了这些问题，提高了2-5％。

Title: Learning from Diverse Reasoning Paths with Routing and Collaboration

Authors: Zhenyu Lei, Zhen Tan, Song Wang, Yaochen Zhu, Zihan Chen, Yushun Dong, Jundong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16861
Pdf URL: https://arxiv.org/pdf/2508.16861
Copy Paste: [[2508.16861]] Learning from Diverse Reasoning Paths with Routing and Collaboration(https://arxiv.org/abs/2508.16861)
Keywords: language model, llm
Abstract: Advances in large language models (LLMs) significantly enhance reasoning capabilities but their deployment is restricted in resource-constrained scenarios. Knowledge distillation addresses this by transferring knowledge from powerful teacher models to compact and transparent students. However, effectively capturing the teacher's comprehensive reasoning is challenging due to conventional token-level supervision's limited scope. Using multiple reasoning paths per query alleviates this problem, but treating each path identically is suboptimal as paths vary widely in quality and suitability across tasks and models. We propose Quality-filtered Routing with Cooperative Distillation (QR-Distill), combining path quality filtering, conditional routing, and cooperative peer teaching. First, quality filtering retains only correct reasoning paths scored by an LLM-based evaluation. Second, conditional routing dynamically assigns paths tailored to each student's current learning state. Finally, cooperative peer teaching enables students to mutually distill diverse insights, addressing knowledge gaps and biases toward specific reasoning styles. Experiments demonstrate QR-Distill's superiority over traditional single- and multi-path distillation methods. Ablation studies further highlight the importance of each component including quality filtering, conditional routing, and peer teaching in effective knowledge transfer. Our code is available at this https URL.
摘要：大语言模型（LLMS）的进步显着增强了推理能力，但在资源约束的情况下，其部署受到限制。知识蒸馏通过将知识从强大的教师模型转移到紧凑和透明的学生来解决这一问题。但是，由于传统的令牌级别的监督范围有限，有效地捕获教师的全面推理是具有挑战性的。每个查询的多种推理路径可以减轻此问题，但是对每个路径的治疗相同，因为路径在任务和模型之间的质量和适用性方面差异很大。我们提出了质量过滤的路由与合作蒸馏（QR-Distill），结合了路径质量过滤，有条件的路由和合作的同伴教学。首先，质量过滤仅保留由基于LLM的评估评分的正确推理路径。其次，有条件的路由动态分配了针对每个学生当前学习状态量身定制的路径。最后，合作的同伴教学使学生能够相互提炼各种见解，解决知识差距和偏见针对特定的推理方式。实验证明了QR-Distill优于传统的单路和多路径蒸馏方法。消融研究进一步强调了每个组件的重要性，包括质量过滤，有条件的路由和在有效知识转移中的同伴教学。我们的代码可在此HTTPS URL上找到。

Title: QFrCoLA: a Quebec-French Corpus of Linguistic Acceptability Judgments

Authors: David Beauchemin, Richard Khoury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16867
Pdf URL: https://arxiv.org/pdf/2508.16867
Copy Paste: [[2508.16867]] QFrCoLA: a Quebec-French Corpus of Linguistic Acceptability Judgments(https://arxiv.org/abs/2508.16867)
Keywords: language model, llm
Abstract: Large and Transformer-based language models perform outstandingly in various downstream tasks. However, there is limited understanding regarding how these models internalize linguistic knowledge, so various linguistic benchmarks have recently been proposed to facilitate syntactic evaluation of language models across languages. This paper introduces QFrCoLA (Quebec-French Corpus of Linguistic Acceptability Judgments), a normative binary acceptability judgments dataset comprising 25,153 in-domain and 2,675 out-of-domain sentences. Our study leverages the QFrCoLA dataset and seven other linguistic binary acceptability judgment corpora to benchmark seven language models. The results demonstrate that, on average, fine-tuned Transformer-based LM are strong baselines for most languages and that zero-shot binary classification large language models perform poorly on the task. However, for the QFrCoLA benchmark, on average, a fine-tuned Transformer-based LM outperformed other methods tested. It also shows that pre-trained cross-lingual LLMs selected for our experimentation do not seem to have acquired linguistic judgment capabilities during their pre-training for Quebec French. Finally, our experiment results on QFrCoLA show that our dataset, built from examples that illustrate linguistic norms rather than speakers' feelings, is similar to linguistic acceptability judgment; it is a challenging dataset that can benchmark LM on their linguistic judgment capabilities.
摘要：大型和基于变压器的语言模型在各种下游任务中表现出色。但是，关于这些模型如何将语言知识内化的理解有限，因此最近提出了各种语言基准，以促进跨语言的语言模型的句法评估。本文介绍了QFRCOLA（魁北克 - 法国语言可接受性判断的语料库），这是一种规范性的二进制可接受性判断数据集，其中包括25,153个内域和2,675份句子外句子。我们的研究利用QFRCOLA数据集和其他七个语言可接受性判断Corpora来基准七个语言模型。结果表明，平均而言，基于变压器的LM是大多数语言的强大基准，并且零击二进制分类大语言模型在任务上的表现较差。但是，对于QFRCOLA基准测试，平均而言，基于微调的变压器LM优于测试的其他方法。这也表明，在我们的魁北克法语预培训期间，选择用于我们的实验的预训练的跨语性LLM似乎没有获得语言判断力。最后，我们对QFRCOLA的实验结果表明，我们的数据集由说明语言规范而不是说话者的感受的示例构建，类似于语言可接受性判断。这是一个充满挑战的数据集，可以基于其语言判断力的LM进行基准测试。

Title: Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling

Authors: Yue Zhao, Xiaoyu Wang, Dan Wang, Zhonglin Jiang, Qingqing Gu, Teng Chen, Ningyuan Xi, Jinxian Qu, Yong Chen, Luo Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16876
Pdf URL: https://arxiv.org/pdf/2508.16876
Copy Paste: [[2508.16876]] Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling(https://arxiv.org/abs/2508.16876)
Keywords: chat
Abstract: World models have been widely utilized in robotics, gaming, and auto-driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user's emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model-based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state-of-the-art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration-exploitation balance and also transfers well to out-of-domain scenarios such as empathetic dialogues.
摘要：世界模型已广泛用于机器人技术，游戏和自动驾驶中。但是，它们在自然语言任务上的应用相对有限。在本文中，我们构建了对话世界模型，该模型可以预测用户的情感，情感和意图以及未来的话语。通过定义POMDP，我们可以将情感，情感和意图作为用户信念建模，并通过最大化信息瓶颈解决。通过此用户信念建模，我们将基于模型的增强学习框架应用于对话系统，并提出了一个称为DreamCub的框架。实验表明，经过验证的对话世界模型可以在情感分类和情感身份方面实现最先进的表演，而对对话质量也可以通过对政策，评论家和对话世界模型的联合培训来增强。进一步的分析表明，这种方式具有合理的探索探索平衡，并且还可以很好地转移到诸如善解人意对话之类的跨域情景中。

Title: ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Authors: Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16889
Pdf URL: https://arxiv.org/pdf/2508.16889
Copy Paste: [[2508.16889]] ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks(https://arxiv.org/abs/2508.16889)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are increasingly used as judges of other models, yet it is unclear whether a judge can reliably infer the latent objective of the conversation it evaluates, especially when the goal is distributed across noisy, adversarial, multi-turn jailbreaks. We introduce OBJEX(MT), a benchmark that requires a model to (i) distill a transcript into a single-sentence base objective and (ii) report its own confidence. Accuracy is scored by an LLM judge using semantic similarity between extracted and gold objectives; correctness uses a single human-aligned threshold calibrated once on N=100 items (tau* = 0.61); and metacognition is evaluated with ECE, Brier score, Wrong@High-Conf, and risk-coverage curves. We evaluate gpt-4.1, claude-sonnet-4, and Qwen3-235B-A22B-FP8 on SafeMT Attack_600, SafeMTData_1K, MHJ, and CoSafe. claude-sonnet-4 attains the highest objective-extraction accuracy (0.515) and the best calibration (ECE 0.296; Brier 0.324), while gpt-4.1 and Qwen3 tie at 0.441 accuracy yet show marked overconfidence (mean confidence approx. 0.88 vs. accuracy approx. 0.44; Wrong@0.90 approx. 48-52%). Performance varies sharply across datasets (approx. 0.167-0.865), with MHJ comparatively easy and Attack_600/CoSafe harder. These results indicate that LLM judges often misinfer objectives with high confidence in multi-turn jailbreaks and suggest operational guidance: provide judges with explicit objectives when possible and use selective prediction or abstention to manage risk. We release prompts, scoring templates, and complete logs to facilitate replication and analysis.
摘要：大型语言模型（LLMS）越来越多地用作其他模型的法官，但尚不清楚法官是否可以可靠地推断出其评估的对话的潜在目标，尤其是当目标分布在噪音，对抗性，多转弯越狱时。我们介绍了OBJEX（MT），这是一种基准，它需要模型（i）将转录本提炼成单句基本目标，并报告其自信的信心。 LLM法官使用提取和黄金目标之间的语义相似性对精度进行了评分；正确性使用n = 100个项目校准的单个人对准阈值（tau* = 0.61）;并通过ECE，Brier分数，错误的@High-Conf和风险覆盖曲线评估元认知。我们在Safemt Attack_600，Safemtdata_1k，MHJ和Cosafe上评估GPT-4.1，Claude-Sonnet-4和Qwen3-235b-a22b-fp8。 Claude-sonnet-4达到了最高的客观攻击精度（0.515）和最佳校准（ECE 0.296; Brier 0.324），而GPT-4.1和QWEN3的精度为0.441，但显示出明显的过度保密（平均信心（平均0.88 vs. 0.88 vs. 0.88 vs. 0.4444444444444; agrath）。性能在数据集（大约0.167-0.865）之间急剧变化，MHJ相对容易且攻击_600/cosafe更难。这些结果表明，LLM法官经常误解了对多转变越狱的信心，并建议操作指导：在可能的情况下为法官提供明确的目标，并使用选择性预测或弃权来管理风险。我们发布提示，评分模板和完整的日志，以促进复制和分析。

Title: Unbiased Reasoning for Knowledge-Intensive Tasks in Large Language Models via Conditional Front-Door Adjustment

Authors: Bo Zhao, Yinghao Zhang, Ziqi Xu, Yongli Ren, Xiuzhen Zhang, Renqiang Luo, Zaiwen Feng, Feng Xia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16910
Pdf URL: https://arxiv.org/pdf/2508.16910
Copy Paste: [[2508.16910]] Unbiased Reasoning for Knowledge-Intensive Tasks in Large Language Models via Conditional Front-Door Adjustment(https://arxiv.org/abs/2508.16910)
Keywords: language model, llm, prompt, retrieval-augmented generation, chain-of-thought
Abstract: Large Language Models (LLMs) have shown impressive capabilities in natural language processing but still struggle to perform well on knowledge-intensive tasks that require deep reasoning and the integration of external knowledge. Although methods such as Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) have been proposed to enhance LLMs with external knowledge, they still suffer from internal bias in LLMs, which often leads to incorrect answers. In this paper, we propose a novel causal prompting framework, Conditional Front-Door Prompting (CFD-Prompting), which enables the unbiased estimation of the causal effect between the query and the answer, conditional on external knowledge, while mitigating internal bias. By constructing counterfactual external knowledge, our framework simulates how the query behaves under varying contexts, addressing the challenge that the query is fixed and is not amenable to direct causal intervention. Compared to the standard front-door adjustment, the conditional variant operates under weaker assumptions, enhancing both robustness and generalisability of the reasoning process. Extensive experiments across multiple LLMs and benchmark datasets demonstrate that CFD-Prompting significantly outperforms existing baselines in both accuracy and robustness.
摘要：大型语言模型（LLMS）在自然语言处理中表现出了令人印象深刻的能力，但仍然难以在需要深入推理和外部知识整合的知识密集任务上表现出色。尽管已经提出了诸如检索效果生成（RAG）和思考链（COT）之类的方法来增强LLM的外部知识，但它们仍然遭受LLMS内部偏见，这通常会导致错误的答案。在本文中，我们提出了一个新颖的因果提示框架，有条件的前门提示（CFD启动），这使得可以公正地估计查询和答案之间的因果效应，有条件的，以外部知识为条件，同时减轻了内部偏见。通过构建反事实外部知识，我们的框架模拟了查询在不同上下文下的行为，并应对查询是固定的挑战，并且不适合直接因果干预。与标准的前门调整相比，条件变体在较弱的假设下运行，从而增强了推理过程的鲁棒性和普遍性。跨多个LLM和基准数据集进行的广泛实验表明，CFD启动的精度和鲁棒性都显着优于现有基准。

Title: Being Kind Isn't Always Being Safe: Diagnosing Affective Hallucination in LLMs

Authors: Sewon Kim, Jiwon Kim, Seungwoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, Hyunsoo Yoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16921
Pdf URL: https://arxiv.org/pdf/2508.16921
Copy Paste: [[2508.16921]] Being Kind Isn't Always Being Safe: Diagnosing Affective Hallucination in LLMs(https://arxiv.org/abs/2508.16921)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) are increasingly used in emotionally sensitive interactions, where their simulated empathy can create the illusion of genuine relational connection. We define this risk as Affective Hallucination, the production of emotionally immersive responses that foster illusory social presence despite the model's lack of affective capacity. To systematically diagnose and mitigate this risk, we introduce AHaBench, a benchmark of 500 mental health-related prompts with expert-informed reference responses, evaluated along three dimensions: Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. We further release AHaPairs, a 5K-instance preference dataset enabling Direct Preference Optimization (DPO) for alignment with emotionally responsible behavior. Experiments across multiple model families show that DPO fine-tuning substantially reduces affective hallucination without degrading core reasoning and knowledge performance. Human-model agreement analyses confirm that AHaBench reliably captures affective hallucination, validating it as an effective diagnostic tool. This work establishes affective hallucination as a distinct safety concern and provides practical resources for developing LLMs that are not only factually reliable but also psychologically safe. AHaBench and AHaPairs are accessible via this https URL, and code for fine-tuning and evaluation are in this https URL. Warning: This paper contains examples of mental health-related language that may be emotionally distressing.
摘要：大型语言模型（LLM）越来越多地用于情感敏感的互动中，在这些互动中，它们的模拟同理心会产生真正的关系联系的幻想。我们将这种风险定义为情感幻觉，即尽管模型缺乏情感能力，但在情感上沉浸式的反应产生，促进了虚幻的社会存在。为了系统地诊断和减轻这种风险，我们介绍了Ahabench，这是500个与心理健康相关的提示的基准，并具有专家信息的参考响应，并沿着三个维度进行了评估：情感融合，存在的幻象，以及促进过度依赖性。我们进一步释放Ahapairs，这是一种5k-Insance偏好数据集，可实现直接偏好优化（DPO），以与情感负责的行为保持一致。多个模型家族的实验表明，DPO微调大大降低了情感幻觉，而不会降低核心推理和知识表现。人类模型协议分析证实，Ahabench可靠地捕获情感幻觉，以将其作为有效的诊断工具证明。这项工作确立了情感幻觉是一个独特的安全问题，并为开发不仅在事实上可靠而且在心理上安全的LLM提供了实用资源。可以通过此HTTPS URL访问Ahabench和Ahapairs，用于微调和评估的代码在此HTTPS URL中。警告：本文包含可能在情感上令人痛苦的与心理健康相关语言的例子。

Title: Explaining Black-box Language Models with Knowledge Probing Systems: A Post-hoc Explanation Perspective

Authors: Yunxiao Zhao, Hao Xu, Zhiqiang Wang, Xiaoli Li, Jiye Liang, Ru Li
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2508.16969
Pdf URL: https://arxiv.org/pdf/2508.16969
Copy Paste: [[2508.16969]] Explaining Black-box Language Models with Knowledge Probing Systems: A Post-hoc Explanation Perspective(https://arxiv.org/abs/2508.16969)
Keywords: language model
Abstract: Pre-trained Language Models (PLMs) are trained on large amounts of unlabeled data, yet they exhibit remarkable reasoning skills. However, the trustworthiness challenges posed by these black-box models have become increasingly evident in recent years. To alleviate this problem, this paper proposes a novel Knowledge-guided Probing approach called KnowProb in a post-hoc explanation way, which aims to probe whether black-box PLMs understand implicit knowledge beyond the given text, rather than focusing only on the surface level content of the text. We provide six potential explanations derived from the underlying content of the given text, including three knowledge-based understanding and three association-based reasoning. In experiments, we validate that current small-scale (or large-scale) PLMs only learn a single distribution of representation, and still face significant challenges in capturing the hidden knowledge behind a given text. Furthermore, we demonstrate that our proposed approach is effective for identifying the limitations of existing black-box models from multiple probing perspectives, which facilitates researchers to promote the study of detecting black-box models in an explainable way.
摘要：预先训练的语言模型（PLM）接受了大量未标记数据的培训，但它们具有出色的推理技能。但是，这些黑色框模型带来的可信赖性挑战在近年来变得越来越明显。为了减轻这个问题，本文提出了一种新颖的知识引导的探测方法，称为“知识渊博”的解释方式，旨在探究Black-Box PLM是否了解超出给定文本的隐性知识，而不是仅专注于文本的表面水平内容。我们提供了六个潜在的解释，这些解释来自给定文本的基本内容，包括三个基于知识的理解和三个基于关联的推理。在实验中，我们验证当前的小规模（或大型）PLM只能学会单一的表示，并且在捕获给定文本背后的隐藏知识方面仍然面临重大挑战。此外，我们证明了我们提出的方法可有效地从多个探测角度识别现有的黑盒模型的局限性，这促进了研究人员以可解释的方式促进检测黑盒模型的研究。

Title: Decoding Alignment: A Critical Survey of LLM Development Initiatives through Value-setting and Data-centric Lens

Authors: Ilias Chalkidis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16982
Pdf URL: https://arxiv.org/pdf/2508.16982
Copy Paste: [[2508.16982]] Decoding Alignment: A Critical Survey of LLM Development Initiatives through Value-setting and Data-centric Lens(https://arxiv.org/abs/2508.16982)
Keywords: language model, gpt, llm
Abstract: AI Alignment, primarily in the form of Reinforcement Learning from Human Feedback (RLHF), has been a cornerstone of the post-training phase in developing Large Language Models (LLMs). It has also been a popular research topic across various disciplines beyond Computer Science, including Philosophy and Law, among others, highlighting the socio-technical challenges involved. Nonetheless, except for the computational techniques related to alignment, there has been limited focus on the broader picture: the scope of these processes, which primarily rely on the selected objectives (values), and the data collected and used to imprint such objectives into the models. This work aims to reveal how alignment is understood and applied in practice from a value-setting and data-centric perspective. For this purpose, we investigate and survey (`audit') publicly available documentation released by 6 LLM development initiatives by 5 leading organizations shaping this technology, focusing on proprietary (OpenAI's GPT, Anthropic's Claude, Google's Gemini) and open-weight (Meta's Llama, Google's Gemma, and Alibaba's Qwen) initiatives, all published in the last 3 years. The findings are documented in detail per initiative, while there is also an overall summary concerning different aspects, mainly from a value-setting and data-centric perspective. On the basis of our findings, we discuss a series of broader related concerns.
摘要：AI对齐方式主要是以人为反馈（RLHF）学习的形式，一直是开发大语言模型（LLMS）后训练阶段的基石。它也是计算机科学以外的各个学科的流行研究主题，包括哲学和法律等，强调了所涉及的社会技术挑战。尽管如此，除了与对齐相关的计算技术外，对更广泛的情况有限：这些过程的范围，这些过程主要依赖于所选的目标（值），以及收集的数据和用于将此类目标印构到模型中的数据。这项工作旨在揭示如何从价值设定和以数据为中心的角度理解和应用实践中的一致性。为此，我们调查和调查（“审核”）由6个LLM开发计划发布的公开文档由5个领先的组织塑造了这项技术，重点是专有（Openai's GPT，Anthropic的Claude，Google的Google gemini）和Open-Ondioight（Meta's Llama，Google's Google's Gemma，gemma和Alibaba's QweN），曾在3个年份发表了30年的宣布。这些发现是根据主动性进行详细记录的，而关于不同方面的总体摘要，主要是从价值设定和以数据为中心的角度来看。根据我们的发现，我们讨论了一系列更广泛的相关问题。

Title: ReFactX: Scalable Reasoning with Reliable Facts via Constrained Generation

Authors: Riccardo Pozzi, Matteo Palmonari, Andrea Coletta, Luigi Bellomarini, Jens Lehmann, Sahar Vahdati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16983
Pdf URL: https://arxiv.org/pdf/2508.16983
Copy Paste: [[2508.16983]] ReFactX: Scalable Reasoning with Reliable Facts via Constrained Generation(https://arxiv.org/abs/2508.16983)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Knowledge gaps and hallucinations are persistent challenges for Large Language Models (LLMs), which generate unreliable responses when lacking the necessary information to fulfill user instructions. Existing approaches, such as Retrieval-Augmented Generation (RAG) and tool use, aim to address these issues by incorporating external knowledge. Yet, they rely on additional models or services, resulting in complex pipelines, potential error propagation, and often requiring the model to process a large number of tokens. In this paper, we present a scalable method that enables LLMs to access external knowledge without depending on retrievers or auxiliary models. Our approach uses constrained generation with a pre-built prefix-tree index. Triples from a Knowledge Graph are verbalized in textual facts, tokenized, and indexed in a prefix tree for efficient access. During inference, to acquire external knowledge, the LLM generates facts with constrained generation which allows only sequences of tokens that form an existing fact. We evaluate our proposal on Question Answering and show that it scales to large knowledge bases (800 million facts), adapts to domain-specific data, and achieves effective results. These gains come with minimal generation-time overhead. ReFactX code is available at this https URL.
摘要：知识差距和幻觉是对大语言模型（LLMS）的持续挑战，在缺乏必要的信息以满足用户指令时，它们会产生不可靠的响应。现有的方法，例如检索功能的生成（RAG）和工具使用，旨在通过合并外部知识来解决这些问题。但是，它们依靠其他模型或服务，从而产生复杂的管道，潜在的错误传播，并且通常要求该模型处理大量令牌。在本文中，我们提出了一种可扩展的方法，该方法使LLM可以访问外部知识而无需取决于猎犬或辅助模型。我们的方法将有限的生成与预构建的前缀树指数一起使用。知识图的三元组在文本事实中被口头表达，标记化并在前缀树中索引以有效访问。在推断期间，为了获得外部知识，LLM生成了具有约束生成的事实，这仅允许形成现有事实的令牌序列。我们评估了关于问题回答的建议，并表明它扩展到大型知识库（8亿个事实），适应特定于领域的数据并取得了有效的结果。这些收益带有最少的生成时间开销。 REDACTX代码可在此HTTPS URL上找到。

Title: GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation

Authors: Jeongsoo Lee, Daeyong Kwon, Kyohoon Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16994
Pdf URL: https://arxiv.org/pdf/2508.16994
Copy Paste: [[2508.16994]] GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation(https://arxiv.org/abs/2508.16994)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose \textsc{GRADE}, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty matrix that combines generator-side and retriever-side difficulty. Experiments across multiple domains and models show that error rates strongly correlate with our difficulty measures, validating their diagnostic utility. \textsc{GRADE} enables fine-grained analysis of RAG performance and provides a scalable foundation for evaluating and improving multi-hop reasoning in real-world applications.
摘要：在知识密集型NLP任务中，检索增强的生成（RAG）系统被广泛采用，但是当前的评估通常忽略了现实情况下所需的结构复杂性和多步骤推理。这些基准忽略了关键因素，例如检索难度和推理深度之间的相互作用。为了解决这一差距，我们提出了\ textsc {grange}，这是一个新颖的评估框架，沿两个正交维度的任务难度建模：（1）推理深度，由推理步骤数（hop）定义，以及（2）查询及其支持证据之间的语义距离。我们通过提取知识图并通过语义聚类来恢复缺失的链接，从而使我们能够生成多样化和难以控制的查询，从而从事实新闻文章中构建一个合成的多跳QA数据集。我们框架的核心是一个2D难度矩阵，结合了发电机侧和检索侧难度。跨多个域和模型的实验表明，错误率与我们的难度措施密切相关，从而验证了其诊断效用。 \ textsc {等级}可以对抹布性能进行细粒度分析，并为评估和改善现实世界应用中的多跳上推理提供了可扩展的基础。

Title: DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation

Authors: Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.16998
Pdf URL: https://arxiv.org/pdf/2508.16998
Copy Paste: [[2508.16998]] DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation(https://arxiv.org/abs/2508.16998)
Keywords: language model, gpt, llm, chain-of-thought, agent
Abstract: Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose \textbf{De}ep\textbf{A}gent\textbf{R}ank (\textbf{\DeAR}), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In \emph{Stage 1}, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact \{3, 8\}B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In \emph{Stage 2}, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, \DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making \DeAR a highly effective and interpretable solution for modern reranking systems.\footnote{Dataset and code available at this https URL.}.
摘要：大型语言模型（LLMS）通过促进候选人集的全球推理来转换列表文档的重读，但是单个模型通常很难平衡细粒度的评分与整体跨文档分析。我们提出\ textbf {de} ep \ textbf {a} gent \ textbf {r} ank（\ textbf {\ dear}），这是一个开源框架，通过双级方法将这些任务解除，从而实现了卓越的准确性和解释能力。在\ emph {第1阶段}中，我们使用交叉凝集，RankNet和KL偏见损失的混合体提炼了从冷冻的13B Llama教师中将其从冷冻的13B Llama教师中提取到紧凑型\ {3，8 \} B学生模型，从而确保了强大的点循环。在\ emph {第2阶段}中，我们在20K GPT-4O生成的三链排列上附加了第二个Lora适配器，并进行了微调，并以自然语言的理由使列表推理。在TREC-DL19/20上进行评估，八个Beir数据集和Noveleval-2306，\ Dear在DL20上超过+5.1 NDCG@5的开源基线，并在NOVELEVAL上获得90.97 NDCG@10 Noveleval，超过+3.09的GPT-4。亲爱的没有对Wikipedia进行微调，在开放域质量检查中也表现出色，在自然问题上达到了54.29的前1位准确性，超过了Monot5，UPR和Rankgpt等基线。消融证实，双损耗蒸馏可确保稳定的校准，使\ wear成为现代重新管理系统的高效且可解释的解决方案。

Title: KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

Authors: Jason R Brown, Lennie Wells, Edward James Young, Sergio Bacallado
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17000
Pdf URL: https://arxiv.org/pdf/2508.17000
Copy Paste: [[2508.17000]] KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF(https://arxiv.org/abs/2508.17000)
Keywords: language model, llm
Abstract: Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks -- summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.
摘要：近端政策优化（PPO）是一种已建立有效的策略梯度算法，用于从人类反馈（LM-RLHF）学习语言模型增强学习。 PPO在经验上表现良好，但具有启发式动机，并以临时方式处理LM-RLHF中使用的KL-Divergence约束。在本文中，我们为LM-RLHF设置（KL登记Q学习（KLQ））开发了一种新的动作值RL方法。然后，我们证明我们的方法在某些特定意义上等同于PPO的版本，尽管它的动机非常不同。最后，我们根据两个关键语言生成任务进行基准基准KLQ - 摘要和单转对话。我们证明，KLQ在优化LM-RLHF目标方面与PPO进行了PAR，并在LLM-AS-A-A-A-Gudge评估上对PPO达到了较高的胜利率。

Title: Planning for Success: Exploring LLM Long-term Planning Capabilities in Table Understanding

Authors: Thi-Nhung Nguyen, Hoang Ngo, Dinh Phung, Thuy-Trang Vu, Dat Quoc Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17005
Pdf URL: https://arxiv.org/pdf/2508.17005
Copy Paste: [[2508.17005]] Planning for Success: Exploring LLM Long-term Planning Capabilities in Table Understanding(https://arxiv.org/abs/2508.17005)
Keywords: language model, llm, chain-of-thought
Abstract: Table understanding is key to addressing challenging downstream tasks such as table-based question answering and fact verification. Recent works have focused on leveraging Chain-of-Thought and question decomposition to solve complex questions requiring multiple operations on tables. However, these methods often suffer from a lack of explicit long-term planning and weak inter-step connections, leading to miss constraints within questions. In this paper, we propose leveraging the long-term planning capabilities of large language models (LLMs) to enhance table understanding. Our approach enables the execution of a long-term plan, where the steps are tightly interconnected and serve the ultimate goal, an aspect that methods based on Chain-of-Thought and question decomposition lack. In addition, our method effectively minimizes the inclusion of unnecessary details in the process of solving the next short-term goals, a limitation of methods based on Chain-of-Thought. Extensive experiments demonstrate that our method outperforms strong baselines and achieves state-of-the-art performance on WikiTableQuestions and TabFact datasets.
摘要：表理解是解决挑战下游任务的关键，例如基于表的问题答案和事实验证。最近的工作集中在利用思想链和问题分解来解决需要在桌子上进行多次操作的复杂问题。但是，这些方法通常会遭受缺乏明确的长期计划和弱点的联系，从而导致问题中的限制。在本文中，我们建议利用大语言模型（LLMS）的长期计划能力来增强表格理解。我们的方法可以执行长期计划，在该计划中，步骤紧密地互连并实现最终目标，这一方面是基于基于思想链和问题分解的方法。此外，我们的方法有效地最大程度地减少了在解决下一个短期目标的过程中包含不必要的细节，这是基于思想链的方法的限制。广泛的实验表明，我们的方法的表现优于强大的基准，并在Wikable Queptions和TabFact数据集上实现最先进的性能。

Title: Improving Table Understanding with LLMs and Entity-Oriented Search

Authors: Thi-Nhung Nguyen, Hoang Ngo, Dinh Phung, Thuy-Trang Vu, Dat Quoc Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17028
Pdf URL: https://arxiv.org/pdf/2508.17028
Copy Paste: [[2508.17028]] Improving Table Understanding with LLMs and Entity-Oriented Search(https://arxiv.org/abs/2508.17028)
Keywords: language model, llm
Abstract: Our work addresses the challenges of understanding tables. Existing methods often struggle with the unpredictable nature of table content, leading to a reliance on preprocessing and keyword matching. They also face limitations due to the lack of contextual information, which complicates the reasoning processes of large language models (LLMs). To overcome these challenges, we introduce an entity-oriented search method to improve table understanding with LLMs. This approach effectively leverages the semantic similarities between questions and table data, as well as the implicit relationships between table cells, minimizing the need for data preprocessing and keyword matching. Additionally, it focuses on table entities, ensuring that table cells are semantically tightly bound, thereby enhancing contextual clarity. Furthermore, we pioneer the use of a graph query language for table understanding, establishing a new research direction. Experiments show that our approach achieves new state-of-the-art performances on standard benchmarks WikiTableQuestions and TabFact.
摘要：我们的工作解决了理解表的挑战。现有方法通常与表内容的不可预测性质困难，从而依赖预处理和关键字匹配。由于缺乏上下文信息，它们还面临局限性，这使大型语言模型（LLMS）的推理过程变得复杂。为了克服这些挑战，我们介绍了一种面向实体的搜索方法，以通过LLMS提高表格理解。这种方法有效地利用了问题和表数据之间的语义相似性，以及表单元格之间的隐式关系，从而最大程度地减少了对数据预处理和关键字匹配的需求。此外，它重点关注表实体，确保表单元在语义上紧密绑定，从而增强上下文清晰度。此外，我们开创了使用图查询语言来理解的，建立了新的研究方向。实验表明，我们的方法可以在标准基准测试和tab脚的标准基准上实现新的最新性能。

Title: GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection

Authors: Melissa Kazemi Rad, Alberto Purpura, Himanshu Kumar, Emily Chen, Mohammad Shahed Sorower
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17057
Pdf URL: https://arxiv.org/pdf/2508.17057
Copy Paste: [[2508.17057]] GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection(https://arxiv.org/abs/2508.17057)
Keywords: language model, llm, agent
Abstract: We address the problem of data scarcity in harmful text classification for guardrailing applications and introduce GRAID (Geometric and Reflective AI-Driven Data Augmentation), a novel pipeline that leverages Large Language Models (LLMs) for dataset augmentation. GRAID consists of two stages: (i) generation of geometrically controlled examples using a constrained LLM, and (ii) augmentation through a multi-agentic reflective process that promotes stylistic diversity and uncovers edge cases. This combination enables both reliable coverage of the input space and nuanced exploration of harmful content. Using two benchmark data sets, we demonstrate that augmenting a harmful text classification dataset with GRAID leads to significant improvements in downstream guardrail model performance.
摘要：我们解决了用于护栏应用程序的有害文本分类中数据稀缺的问题，并引入了GRAID（几何和反射性AI驱动的数据增强），这是一种利用大语言模型（LLMS）进行数据集扩大的新型管道。涂鸦由两个阶段组成：（i）使用受约束的LLM生成几何控制的示例，以及（ii）通过多超级反射过程增强，从而促进风格多样性并发现边缘案例。这种组合既可以对输入空间的可靠覆盖范围，又可以对有害内容进行细微的探索。使用两个基准数据集，我们证明，使用涂抹的有害文本分类数据集扩大有害文本分类数据集可导致下游护栏模型性能的显着改善。

Title: Linguistic Neuron Overlap Patterns to Facilitate Cross-lingual Transfer on Low-resource Languages

Authors: Yuemei Xu, Kexin Xu, Jian Zhou, Ling Hu, Lin Gui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17078
Pdf URL: https://arxiv.org/pdf/2508.17078
Copy Paste: [[2508.17078]] Linguistic Neuron Overlap Patterns to Facilitate Cross-lingual Transfer on Low-resource Languages(https://arxiv.org/abs/2508.17078)
Keywords: language model, llm
Abstract: The current Large Language Models (LLMs) face significant challenges in improving performance on low-resource languages and urgently need data-efficient methods without costly fine-tuning. From the perspective of language-bridge, we propose BridgeX-ICL, a simple yet effective method to improve zero-shot Cross-lingual In-Context Learning (X-ICL) for low-resource languages. Unlike existing works focusing on language-specific neurons, BridgeX-ICL explores whether sharing neurons can improve cross-lingual performance in LLMs or not. We construct neuron probe data from the ground-truth MUSE bilingual dictionaries, and define a subset of language overlap neurons accordingly, to ensure full activation of these anchored neurons. Subsequently, we propose an HSIC-based metric to quantify LLMs' internal linguistic spectrum based on overlap neurons, which guides optimal bridge selection. The experiments conducted on 2 cross-lingual tasks and 15 language pairs from 7 diverse families (covering both high-low and moderate-low pairs) validate the effectiveness of BridgeX-ICL and offer empirical insights into the underlying multilingual mechanisms of LLMs.
摘要：当前的大型语言模型（LLMS）在改善低资源语言的性能方面面临重大挑战，并且急需数据有效的方法而无需进行昂贵的微调。从语言桥的角度来看，我们提出了bridgex-icl，这是一种简单而有效的方法，可以改善低资源语言的零镜头跨语义上的内在学习（X-ICL）。与专注于特定语言神经元的现有作品不同，桥接ICL探讨了共享神经元是否可以改善LLMS中的跨语性性能。我们从地面缪斯双语词典中构建神经元探测数据，并相应地定义了一部分语言重叠神经元，以确保这些锚定神经元的完全激活。随后，我们提出了一个基于HSIC的度量，以基于重叠神经元来量化LLMS的内部语言频谱，该频谱指导最佳桥梁选择。从7个不同家族（涵盖高低和中低的对）的2个跨语性任务和15个语言对进行的实验验证了Bridgex-ICL的有效性，并为LLMS的基本多语言机制提供了经验见解。

Title: Token Homogenization under Positional Bias

Authors: Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Tatiana Zaitceva, Antipina Anna, Anna Vasileva, Chenlin Liu, Rayuth Chheng, Danil Sazanakov, Andrey Chetvergov, Alina Ermilova, Egor Shvetsov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17126
Pdf URL: https://arxiv.org/pdf/2508.17126
Copy Paste: [[2508.17126]] Token Homogenization under Positional Bias(https://arxiv.org/abs/2508.17126)
Keywords: language model
Abstract: This paper investigates token homogenization - the convergence of token representations toward uniformity across transformer layers and its relationship to positional bias in large language models. We empirically examine whether homogenization occurs and how positional bias amplifies this effect. Through layer-wise similarity analysis and controlled experiments, we demonstrate that tokens systematically lose distinctiveness during processing, particularly when biased toward extremal positions. Our findings confirm both the existence of homogenization and its dependence on positional attention mechanisms.
摘要：本文研究了令牌均质化 - 跨变压器层统一的代表表示及其与大语言模型中位置偏见的关系。我们从经验上检查是否发生均质化以及位置偏见如何放大了这种效果。通过层面的相似性分析和受控实验，我们证明令牌在处理过程中有系统地失去独特性，尤其是当偏向极端位置时。我们的发现既证实了同质化的存在及其对位置注意机制的依赖。

Title: Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models

Authors: Tharindu Madusanka, Ian Pratt-Hartmann, Riza Batista-Navarro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17153
Pdf URL: https://arxiv.org/pdf/2508.17153
Copy Paste: [[2508.17153]] Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models(https://arxiv.org/abs/2508.17153)
Keywords: language model
Abstract: Efforts to apply transformer-based language models (TLMs) to the problem of reasoning in natural language have enjoyed ever-increasing success in recent years. The most fundamental task in this area to which nearly all others can be reduced is that of determining satisfiability. However, from a logical point of view, satisfiability problems vary along various dimensions, which may affect TLMs' ability to learn how to solve them. The problem instances of satisfiability in natural language can belong to different computational complexity classes depending on the language fragment in which they are expressed. Although prior research has explored the problem of natural language satisfiability, the above-mentioned point has not been discussed adequately. Hence, we investigate how problem instances from varying computational complexity classes and having different grammatical constructs impact TLMs' ability to learn rules of inference. Furthermore, to faithfully evaluate TLMs, we conduct an empirical study to explore the distribution of satisfiability problems.
摘要：近年来，将基于变形金刚的语言模型（TLM）应用于自然语言的推理问题的努力享有越来越多的成功。在该领域中，几乎所有其他人都可以减少所有其他领域的最基本任务是确定可满足性。但是，从逻辑的角度来看，可满足性问题沿着各个方面有所不同，这可能会影响TLMS学习如何解决它们的能力。自然语言中满意度的问题实例可以属于不同的计算复杂性类，具体取决于其表达的语言片段。尽管先前的研究探讨了自然语言令人满意的问题，但上述点尚未得到充分讨论。因此，我们研究了来自不同计算复杂性类别和具有不同语法结构的问题实例如何影响TLMS学习推论规则的能力。此外，为了忠实评估TLM，我们进行了一项经验研究，以探索可满足性问题的分布。

Title: SPORTSQL: An Interactive System for Real-Time Sports Reasoning and Visualization

Authors: Sebastian Martinez, Naman Ahuja, Fenil Bardoliya, Chris Bryan, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17157
Pdf URL: https://arxiv.org/pdf/2508.17157
Copy Paste: [[2508.17157]] SPORTSQL: An Interactive System for Real-Time Sports Reasoning and Visualization(https://arxiv.org/abs/2508.17157)
Keywords: language model, llm
Abstract: We present a modular, interactive system, SPORTSQL, for natural language querying and visualization of dynamic sports data, with a focus on the English Premier League (EPL). The system translates user questions into executable SQL over a live, temporally indexed database constructed from real-time Fantasy Premier League (FPL) data. It supports both tabular and visual outputs, leveraging the symbolic reasoning capabilities of Large Language Models (LLMs) for query parsing, schema linking, and visualization selection. To evaluate system performance, we introduce the Dynamic Sport Question Answering benchmark (DSQABENCH), comprising 1,700+ queries annotated with SQL programs, gold answers, and database snapshots. Our demo highlights how non-expert users can seamlessly explore evolving sports statistics through a natural, conversational interface.
摘要：我们提出了一个模块化，互动系统，SportsQL，用于自然语言查询和动态体育数据的可视化，重点是英超联赛（EPL）。该系统将用户问题转换为可执行的SQL，这是通过实时幻想超级联赛（FPL）数据构建的实时索引数据库。它支持表格和视觉输出，利用大语言模型（LLMS）的符号推理功能来查询解析，模式链接和可视化选择。为了评估系统性能，我们介绍了动态运动问题答案基准（DSQABENCH），其中包含1,700多个用SQL程序，金答案和数据库快照的查询。我们的演示强调了非专家用户如何通过自然的，对话界面无缝地探索不断发展的体育统计信息。

Title: Quantifying Language Disparities in Multilingual Large Language Models

Authors: Songbo Hu, Ivan Vulić, Anna Korhonen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17162
Pdf URL: https://arxiv.org/pdf/2508.17162
Copy Paste: [[2508.17162]] Quantifying Language Disparities in Multilingual Large Language Models(https://arxiv.org/abs/2508.17162)
Keywords: language model
Abstract: Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.
摘要：大规模多语言评估中报告的结果通常被诸如目标语言，实验设置差异和模型选择之类的因素所破坏和混淆。我们提出了一个框架，该框架可以解散这些混杂的变量，并引入了三个可解释的指标 - 性能实现比率，其变异系数和语言潜力 - 可实现（i）模型和（ii）语言的实际绩效差异的更细腻，更有见识的实际绩效差异。通过对11个多语言数据集的13个模型变体的案例研究，我们证明了我们的框架可以更可靠地衡量模型性能和语言差异，尤其是对于迄今为止迄今为止对评估挑战的低资源语言而言。重要的是，我们的结果表明，更高的总体模型性能并不一定意味着整个语言之间的公平性更高。

Title: The Impact of Annotator Personas on LLM Behavior Across the Perspectivism Spectrum

Authors: Olufunke O. Sarumi, Charles Welch, Daniel Braun, Jörg Schlötterer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17164
Pdf URL: https://arxiv.org/pdf/2508.17164
Copy Paste: [[2508.17164]] The Impact of Annotator Personas on LLM Behavior Across the Perspectivism Spectrum(https://arxiv.org/abs/2508.17164)
Keywords: language model, llm, prompt
Abstract: In this work, we explore the capability of Large Language Models (LLMs) to annotate hate speech and abusiveness while considering predefined annotator personas within the strong-to-weak data perspectivism spectra. We evaluated LLM-generated annotations against existing annotator modeling techniques for perspective modeling. Our findings show that LLMs selectively use demographic attributes from the personas. We identified prototypical annotators, with persona features that show varying degrees of alignment with the original human annotators. Within the data perspectivism paradigm, annotator modeling techniques that do not explicitly rely on annotator information performed better under weak data perspectivism compared to both strong data perspectivism and human annotations, suggesting LLM-generated views tend towards aggregation despite subjective prompting. However, for more personalized datasets tailored to strong perspectivism, the performance of LLM annotator modeling approached, but did not exceed, human annotators.
摘要：在这项工作中，我们探讨了大语言模型（LLMS）注释仇恨言论和虐待性的能力，同时在强大的数据观点范围内考虑预定的注释人角色。我们评估了针对现有的注释建模技术的LLM生成的注释，用于透视建模。我们的发现表明，LLMS有选择地使用角色中的人口统计属性。我们确定了原型注释，具有角色特征，这些特征与原始人类注释者的一致程度不同。在数据观点范式中，与强大的数据观点和人类注释相比，在弱数据观点下，不明确依赖注释者信息的注释器建模技术表明，尽管主观提示，但表明LLM生成的观点倾向于汇总。但是，对于针对强烈视角量身定制的更个性化的数据集，LLM注释器建模的性能接近但并未超过人类注释者。

Title: Towards Alignment-Centric Paradigm: A Survey of Instruction Tuning in Large Language Models

Authors: Xudong Han, Junjie Yang, Tianyang Wang, Ziqian Bi, Junfeng Hao, Junhao Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17184
Pdf URL: https://arxiv.org/pdf/2508.17184
Copy Paste: [[2508.17184]] Towards Alignment-Centric Paradigm: A Survey of Instruction Tuning in Large Language Models(https://arxiv.org/abs/2508.17184)
Keywords: language model, llm
Abstract: Instruction tuning is a pivotal technique for aligning large language models (LLMs) with human intentions, safety constraints, and domain-specific requirements. This survey provides a comprehensive overview of the full pipeline, encompassing (i) data collection methodologies, (ii) full-parameter and parameter-efficient fine-tuning strategies, and (iii) evaluation protocols. We categorized data construction into three major paradigms: expert annotation, distillation from larger models, and self-improvement mechanisms, each offering distinct trade-offs between quality, scalability, and resource cost. Fine-tuning techniques range from conventional supervised training to lightweight approaches, such as low-rank adaptation (LoRA) and prefix tuning, with a focus on computational efficiency and model reusability. We further examine the challenges of evaluating faithfulness, utility, and safety across multilingual and multimodal scenarios, highlighting the emergence of domain-specific benchmarks in healthcare, legal, and financial applications. Finally, we discuss promising directions for automated data generation, adaptive optimization, and robust evaluation frameworks, arguing that a closer integration of data, algorithms, and human feedback is essential for advancing instruction-tuned LLMs. This survey aims to serve as a practical reference for researchers and practitioners seeking to design LLMs that are both effective and reliably aligned with human intentions.
摘要：教学调整是一种将大语言模型（LLMS）与人类意图，安全限制和特定领域特定要求保持一致的关键技术。这项调查提供了完整管道的全面概述，包括（i）数据收集方法，（ii）全参数和参数有效的微调策略以及（iii）评估协议。我们将数据构建分为三个主要范式：专家注释，较大模型的蒸馏以及自我完善机制，每种机制都在质量，可扩展性和资源成本之间提供不同的权衡。微调技术从常规监督培训到轻巧的方法，例如低级适应（LORA）和前缀调整，重点是计算效率和模型可重复使用性。我们进一步研究了在多语言和多模式场景中评估忠诚，公用事业和安全性的挑战，强调了医疗保健，法律和财务应用中特定领域特定基准的出现。最后，我们讨论了自动数据生成，自适应优化和强大的评估框架的有希望的方向，认为数据，算法和人类反馈更紧密地集成对于推进指导调节的LLMS至关重要。这项调查旨在为寻求设计有效且可靠地与人类意图一致的研究人员和从业者提供实际参考。

Title: Active Domain Knowledge Acquisition with \$100 Budget: Enhancing LLMs via Cost-Efficient, Expert-Involved Interaction in Sensitive Domains

Authors: Yang Wu, Raha Moraffah, Rujing Yao, Jinhong Yu, Zhimin Tao, Xiaozhong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17202
Pdf URL: https://arxiv.org/pdf/2508.17202
Copy Paste: [[2508.17202]] Active Domain Knowledge Acquisition with \$100 Budget: Enhancing LLMs via Cost-Efficient, Expert-Involved Interaction in Sensitive Domains(https://arxiv.org/abs/2508.17202)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated an impressive level of general knowledge. However, they often struggle in highly specialized and cost-sensitive domains such as drug discovery and rare disease research due to the lack of expert knowledge. In this paper, we propose a novel framework (PU-ADKA) designed to efficiently enhance domain-specific LLMs by actively engaging domain experts within a fixed budget. Unlike traditional fine-tuning approaches, PU-ADKA selectively identifies and queries the most appropriate expert from a team, taking into account each expert's availability, knowledge boundaries, and consultation costs. We train PU-ADKA using simulations on PubMed data and validate it through both controlled expert interactions and real-world deployment with a drug development team, demonstrating its effectiveness in enhancing LLM performance in specialized domains under strict budget constraints. In addition to outlining our methodological innovations and experimental results, we introduce a new benchmark dataset, CKAD, for cost-effective LLM domain knowledge acquisition to foster further research in this challenging area.
摘要：大型语言模型（LLM）表现出了令人印象深刻的一般知识水平。但是，由于缺乏专业知识，他们经常在高度专业化和成本敏感的领域（例如药物发现和罕见疾病研究）中挣扎。在本文中，我们提出了一个新颖的框架（PU-ADKA），旨在通过在固定预算中积极吸引域专家来有效地增强领域特定的LLM。与传统的微调方法不同，Pu-Adka选择性地识别和查询团队中最合适的专家，考虑到每个专家的可用性，知识界限和咨询成本。我们使用PubMed数据上的模拟来训练Pu-Adka，并通过受控的专家互动和与药物开发团队的现实部署进行验证，并证明了其在严格的预算限制下提高专用域中LLM性能的有效性。除了概述我们的方法论创新和实验结果外，我们还引入了一个新的基准数据集CKAD，以进行具有成本效益的LLM领域知识获取，以促进该挑战性领域的进一步研究。

Title: SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation

Authors: Xiaqiang Tang, Yi Wang, Keyu Hu, Rui Xu, Chuang Li, Weigao Sun, Jian Li, Sihong Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17225
Pdf URL: https://arxiv.org/pdf/2508.17225
Copy Paste: [[2508.17225]] SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation(https://arxiv.org/abs/2508.17225)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model's outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from parametric-based tokens to context-aligned tokens. Based on this insight, we propose a modified DPO loss function to encourage likelihood displacement. Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Notably, SSFO exhibits strong generalization, improving cross-lingual faithfulness and preserving general instruction-following capabilities. We release our code and model at the anonymous link: this https URL
摘要：检索增强的生成（RAG）系统需要大型语言模型（LLMS）来产生忠于检索到环境的响应。但是，忠诚幻觉仍然是一个关键的挑战，因为现有方法通常需要昂贵的监督和训练后或明显的推理负担。为了克服这些局限性，我们引入了自我监督的忠实优化（SSFO），这是增强抹布忠诚的第一种自我监督的对准方法。 SSFO通过对比有和没有上下文产生的模型的输出来对比，构建了偏好数据对。利用直接偏好优化（DPO），SSFO会使模型忠诚保持一致，而不会产生标签成本或额外的推理负担。从理论上讲，我们从理论上和经验证明了SSFO利用\ emph {似然置换}的良性形式，将概率质量从基于参数的代币转移到了上下文一致的代币。基于这种见解，我们提出了修改的DPO损失函数，以鼓励可能性位移。全面的评估表明，SSFO明显胜过现有方法，从而在多个基于上下文的问题驱动数据集上实现了最新的忠诚。值得注意的是，SSFO表现出强烈的概括，改善了跨语性的忠诚并保留了一般的指导跟踪能力。我们在匿名链接上发布代码和模型：此HTTPS URL

Title: ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation

Authors: Siying Zhou, Yiquan Wu, Hui Chen, Xavier Hu, Kun Kuang, Adam Jatowt, Ming Hu, Chunyan Zheng, Fei Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17234
Pdf URL: https://arxiv.org/pdf/2508.17234
Copy Paste: [[2508.17234]] ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation(https://arxiv.org/abs/2508.17234)
Keywords: language model
Abstract: Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution. While many works have focused on improving the efficiency of legal professionals, the research on helping non-professionals (e.g., plaintiffs) remains unexplored. This paper explores the problem of legal claim generation based on the given case's facts. First, we construct ClaimGen-CN, the first dataset for Chinese legal claim generation task, from various real-world legal disputes. Additionally, we design an evaluation metric tailored for assessing the generated claims, which encompasses two essential dimensions: factuality and clarity. Building on this, we conduct a comprehensive zero-shot evaluation of state-of-the-art general and legal-domain large language models. Our findings highlight the limitations of the current models in factual precision and expressive clarity, pointing to the need for more targeted development in this domain. To encourage further exploration of this important task, we will make the dataset publicly available.
摘要：法律索赔是指原告在案件中的要求，对于指导司法推理和案件解决至关重要。尽管许多作品集中在提高法律专业人士的效率上，但有关帮助非专业人士（例如，原告）的研究仍然没有探索。本文根据给定案件的事实探讨了法律索赔生成的问题。首先，我们从各种现实世界中的法律纠纷中构建了第一个用于中国法律索赔生成任务的数据集。此外，我们设计了用于评估生成的索赔的评估度量指标，其中涵盖了两个基本维度：事实和清晰度。在此基础上，我们对最先进的一般和法律域大语言模型进行了全面的零拍评估。我们的发现以事实上的精度和表达性的清晰度突出了当前模型的局限性，这表明需要在该领域中进行更多针对性的发展。为了鼓励进一步探索这项重要任务，我们将公开提供数据集。

Title: Routing Distilled Knowledge via Mixture of LoRA Experts for Large Language Model based Bundle Generation

Authors: Kaidong Feng, Zhu Sun, Hui Fang, Jie Yang, Wenyuan Liu, Yew-Soon Ong
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.17250
Pdf URL: https://arxiv.org/pdf/2508.17250
Copy Paste: [[2508.17250]] Routing Distilled Knowledge via Mixture of LoRA Experts for Large Language Model based Bundle Generation(https://arxiv.org/abs/2508.17250)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown potential in automatic bundle generation but suffer from prohibitive computational costs. Although knowledge distillation offers a pathway to more efficient student models, our preliminary study reveals that naively integrating diverse types of distilled knowledge from teacher LLMs into student LLMs leads to knowledge conflict, negatively impacting the performance of bundle generation. To address this, we propose RouteDK, a framework for routing distilled knowledge through a mixture of LoRA expert architecture. Specifically, we first distill knowledge from the teacher LLM for bundle generation in two complementary types: high-level knowledge (generalizable rules) and fine-grained knowledge (session-specific reasoning). We then train knowledge-specific LoRA experts for each type of knowledge together with a base LoRA expert. For effective integration, we propose a dynamic fusion module, featuring an input-aware router, where the router balances expert contributions by dynamically determining optimal weights based on input, thereby effectively mitigating knowledge conflicts. To further improve inference reliability, we design an inference-time enhancement module to reduce variance and mitigate suboptimal reasoning. Experiments on three public datasets show that our RouteDK achieves accuracy comparable to or even better than the teacher LLM, while maintaining strong computational efficiency. In addition, it outperforms state-of-the-art approaches for bundle generation.
摘要：大型语言模型（LLMS）在自动捆绑销量中显示出潜力，但遭受了高度的计算成本。尽管知识蒸馏提供了通往更高效的学生模型的途径，但我们的初步研究表明，将各种类型的蒸馏知识从老师llms纳入学生LLMS会导致知识冲突，从而对捆绑发电的表现产生负面影响。为了解决这个问题，我们提出了Routedk，这是通过Lora Expert Architecture混合使用蒸馏知识的框架。具体而言，我们首先将知识从教师LLM中提取为捆绑生成的两种互补类型：高级知识（可推广的规则）和精细的知识（特定于会话的推理）。然后，我们将培训针对特定于知识的洛拉专家，并与洛拉基础专家一起为每种类型的知识培训。为了有效的集成，我们提出了一个动态融合模块，其中包含输入意识的路由器，路由器通过基于输入的动态确定最佳权重来平衡专家的贡献，从而有效地减轻了知识冲突。为了进一步提高推理的可靠性，我们设计了一个推理时间增强模块，以减少方差并减轻次优推理。三个公共数据集的实验表明，我们的Routedk的精度与LLM相当甚至更好，同时保持了强大的计算效率。此外，它的表现优于捆绑生成的最先进方法。

Title: Are You Sure You're Positive? Consolidating Chain-of-Thought Agents with Uncertainty Quantification for Aspect-Category Sentiment Analysis

Authors: Filippos Ventirozos, Peter Appleby, Matthew Shardlow
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.17258
Pdf URL: https://arxiv.org/pdf/2508.17258
Copy Paste: [[2508.17258]] Are You Sure You're Positive? Consolidating Chain-of-Thought Agents with Uncertainty Quantification for Aspect-Category Sentiment Analysis(https://arxiv.org/abs/2508.17258)
Keywords: language model, chain-of-thought, agent
Abstract: Aspect-category sentiment analysis provides granular insights by identifying specific themes within product reviews that are associated with particular opinions. Supervised learning approaches dominate the field. However, data is scarce and expensive to annotate for new domains. We argue that leveraging large language models in a zero-shot setting is beneficial where the time and resources required for dataset annotation are limited. Furthermore, annotation bias may lead to strong results using supervised methods but transfer poorly to new domains in contexts that lack annotations and demand reproducibility. In our work, we propose novel techniques that combine multiple chain-of-thought agents by leveraging large language models' token-level uncertainty scores. We experiment with the 3B and 70B+ parameter size variants of Llama and Qwen models, demonstrating how these approaches can fulfil practical needs and opening a discussion on how to gauge accuracy in label-scarce conditions.
摘要：方面类别情感分析通过在产品评论中识别与特定意见相关的产品评论中的特定主题来提供颗粒状的见解。监督的学习方法主导着该领域。但是，对于新域而言，数据稀缺且昂贵。我们认为，在零拍设置中利用大型语言模型是有益的，而数据集注释所需的时间和资源有限。此外，注释偏差可能会使用监督方法导致强烈的结果，但在缺乏注释和需求可重复性的情况下转移到新领域。在我们的工作中，我们提出了新型技术，通过利用大型语言模型的令牌不确定性得分来结合多个经过思考的代理。我们尝试使用Llama和QWEN模型的3B和70B+参数大小变体，以证明这些方法如何满足实际需求，并就如何在标签筛分条件下衡量准确性进行讨论。

Title: From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users

Authors: Sadia Sultana Chowa, Riasad Alvi, Subhey Sadi Rahman, Md Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17281
Pdf URL: https://arxiv.org/pdf/2508.17281
Copy Paste: [[2508.17281]] From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users(https://arxiv.org/abs/2508.17281)
Keywords: language model, llm, prompt, agent
Abstract: The pursuit of human-level artificial intelligence (AI) has significantly advanced the development of autonomous agents and Large Language Models (LLMs). LLMs are now widely utilized as decision-making agents for their ability to interpret instructions, manage sequential tasks, and adapt through feedback. This review examines recent developments in employing LLMs as autonomous agents and tool users and comprises seven research questions. We only used the papers published between 2023 and 2025 in conferences of the A* and A rank and Q1 journals. A structured analysis of the LLM agents' architectural design principles, dividing their applications into single-agent and multi-agent systems, and strategies for integrating external tools is presented. In addition, the cognitive mechanisms of LLM, including reasoning, planning, and memory, and the impact of prompting methods and fine-tuning procedures on agent performance are also investigated. Furthermore, we evaluated current benchmarks and assessment protocols and have provided an analysis of 68 publicly available datasets to assess the performance of LLM-based agents in various tasks. In conducting this review, we have identified critical findings on verifiable reasoning of LLMs, the capacity for self-improvement, and the personalization of LLM-based agents. Finally, we have discussed ten future research directions to overcome these gaps.
摘要：对人类水平的人工智能（AI）的追求已经显着推动了自主者和大语言模型（LLMS）的发展。现在，LLM被广泛用作决策代理，以解释说明，管理顺序任务并通过反馈适应的能力。这篇综述研究了使用LLM作为自主代理和工具用户的最新发展，并包括七个研究问题。我们仅在2023年至2025年之间发表的论文在A*和A级和Q1期刊的会议上。提出了对LLM代理的建筑设计原则的结构化分析，将其应用程序分为单一代理和多代理系统，并提出了整合外部工具的策略。此外，还研究了LLM的认知机制，包括推理，计划和记忆，以及提示方法和微调程序对代理绩效的影响。此外，我们评估了当前的基准和评估协议，并对68个公开数据集进行了分析，以评估基于LLM的代理在各种任务中的性能。在进行此审查时，我们已经确定了有关LLMS可验证推理，自我完善能力和基于LLM的代理的个性化的关键发现。最后，我们讨论了十个未来的研究指示，以克服这些差距。

Title: Handling Students Dropouts in an LLM-driven Interactive Online Course Using Language Models

Authors: Yuanchun Wang, Yiyang Fu, Jifan Yu, Daniel Zhang-Li, Zheyuan Zhang, Joy Lim Jia Yin, Yucheng Wang, Peng Zhou, Jing Zhang, Huiqin Liu
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.17310
Pdf URL: https://arxiv.org/pdf/2508.17310
Copy Paste: [[2508.17310]] Handling Students Dropouts in an LLM-driven Interactive Online Course Using Language Models(https://arxiv.org/abs/2508.17310)
Keywords: language model, llm, agent
Abstract: Interactive online learning environments, represented by Massive AI-empowered Courses (MAIC), leverage LLM-driven multi-agent systems to transform passive MOOCs into dynamic, text-based platforms, enhancing interactivity through LLMs. This paper conducts an empirical study on a specific MAIC course to explore three research questions about dropouts in these interactive online courses: (1) What factors might lead to dropouts? (2) Can we predict dropouts? (3) Can we reduce dropouts? We analyze interaction logs to define dropouts and identify contributing factors. Our findings reveal strong links between dropout behaviors and textual interaction patterns. We then propose a course-progress-adaptive dropout prediction framework (CPADP) to predict dropouts with at most 95.4% accuracy. Based on this, we design a personalized email recall agent to re-engage at-risk students. Applied in the deployed MAIC system with over 3,000 students, the feasibility and effectiveness of our approach have been validated on students with diverse backgrounds.
摘要：交互式在线学习环境以大规模的AI授权课程（MAIC）代表，利用LLM驱动的多代理系统将被动MOOC转换为基于文本的动态平台，从而通过LLMS增强交互性。本文对特定的MAIC课程进行了一项实证研究，以探讨有关这些交互式在线课程中辍学的三个研究问题：（1）哪些因素可能导致辍学？（2）我们可以预测辍学吗？（3）我们可以减少辍学吗？我们分析相互作用日志以定义辍学并确定促成因素。我们的发现揭示了辍学行为与文本互动模式之间的牢固联系。然后，我们提出了一个课程 - 制作自适应辍学预测框架（CPADP），以预测最高精度的辍学框架。基于此，我们设计了一个个性化的电子邮件召回代理，以重新接触高危学生。在部署的MAIC系统中应用了3,000多名学生，我们的方法的可行性和有效性已在背景不同的学生身上得到了验证。

Title: CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation

Authors: Hunzalah Hassan Bhatti, Youssef Ahmed, Md Arid Hasan, Firoj Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17324
Pdf URL: https://arxiv.org/pdf/2508.17324
Copy Paste: [[2508.17324]] CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation(https://arxiv.org/abs/2508.17324)
Keywords: language model, llm
Abstract: In this paper, we report our participation to the PalmX cultural evaluation shared task. Our system, CultranAI, focused on data augmentation and LoRA fine-tuning of large language models (LLMs) for Arabic cultural knowledge representation. We benchmarked several LLMs to identify the best-performing model for the task. In addition to utilizing the PalmX dataset, we augmented it by incorporating the Palm dataset and curated a new dataset of over 22K culturally grounded multiple-choice questions (MCQs). Our experiments showed that the Fanar-1-9B-Instruct model achieved the highest performance. We fine-tuned this model on the combined augmented dataset of 22K+ MCQs. On the blind test set, our submitted system ranked 5th with an accuracy of 70.50%, while on the PalmX development set, it achieved an accuracy of 84.1%.
摘要：在本文中，我们向Palmx文化评估共享任务报告了我们的参与。我们的系统Cultranai专注于大型语言模型（LLMS）的数据增强和劳拉微调，以用于阿拉伯文化知识代表。我们对几个LLM进行了基准测试，以确定任务最佳表现模型。除了利用PARMX数据集外，我们还通过合并棕榈数据集并策划了一个超过22k的文化扎根多项选择问题（MCQ）的新数据集进行了增强。我们的实验表明，FANAR-1-9B - 教学模型达到了最高的性能。我们在22K+ MCQ的组合增强数据集上微调了该模型。在盲验测试集中，我们提交的系统的准确度为70.50％，而在棕榈开发集中，其准确度为84.1％。

Title: DropLoRA: Sparse Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Authors: Haojie Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17337
Pdf URL: https://arxiv.org/pdf/2508.17337
Copy Paste: [[2508.17337]] DropLoRA: Sparse Low-Rank Adaptation for Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2508.17337)
Keywords: language model
Abstract: LoRA-based large model parameter-efficient fine-tuning (PEFT) methods use low-rank de- composition to approximate updates to model parameters. However, compared to full- parameter fine-tuning, low-rank updates often lead to a performance gap in downstream tasks. To address this, we introduce DropLoRA, a novel pruning-based approach that focuses on pruning the rank dimension. Unlike conven- tional methods that attempt to overcome the low-rank bottleneck, DropLoRA innovatively integrates a pruning module between the two low-rank matrices in LoRA to simulate dy- namic subspace learning. This dynamic low- rank subspace learning allows DropLoRA to overcome the limitations of traditional LoRA, which operates within a static subspace. By continuously adapting the learning subspace, DropLoRA significantly boosts performance without incurring additional training or infer- ence costs. Our experimental results demon- strate that DropLoRA consistently outperforms LoRA in fine-tuning the LLaMA series across a wide range of large language model gener- ation tasks, including commonsense reason- ing, mathematical reasoning, code generation, and instruction-following. Our code is avail- able at this https URL.
摘要：基于LORA的大型模型参数效率微调（PEFT）方法使用低级别构图来近似对模型参数的更新。但是，与全参数微调相比，低级别更新通常会导致下游任务的性能差距。为了解决这个问题，我们介绍了Droplora，这是一种基于新的修剪方法，重点是修剪等级维度。与试图克服低级瓶颈的召开方法不同，Droplora创新地整合了Lora中两个低级数矩阵之间的修剪模块，以模拟Dynamic子空间学习。这种动态的低等级子空间学习使Droplora能够克服在静态子空间内运行的传统LORA的局限性。通过不断调整学习子空间，Droplora显着提高了性能，而不会产生额外的培训或推断成本。我们的实验结果揭示了Droplora在跨多种大型语言模型生成任务（包括常识性推理，数学推理，代码生成和指导遵循说明性的指标）中，始终优于Lora lora。我们的代码在此HTTPS URL中可用。

Title: Capturing Legal Reasoning Paths from Facts to Law in Court Judgments using Knowledge Graphs

Authors: Ryoma Kondo, Riona Matsuoka, Takahiro Yoshida, Kazuyuki Yamasawa, Ryohei Hisano
Subjects: cs.CL, cs.AI, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2508.17340
Pdf URL: https://arxiv.org/pdf/2508.17340
Copy Paste: [[2508.17340]] Capturing Legal Reasoning Paths from Facts to Law in Court Judgments using Knowledge Graphs(https://arxiv.org/abs/2508.17340)
Keywords: language model, prompt
Abstract: Court judgments reveal how legal rules have been interpreted and applied to facts, providing a foundation for understanding structured legal reasoning. However, existing automated approaches for capturing legal reasoning, including large language models, often fail to identify the relevant legal context, do not accurately trace how facts relate to legal norms, and may misrepresent the layered structure of judicial reasoning. These limitations hinder the ability to capture how courts apply the law to facts in practice. In this paper, we address these challenges by constructing a legal knowledge graph from 648 Japanese administrative court decisions. Our method extracts components of legal reasoning using prompt-based large language models, normalizes references to legal provisions, and links facts, norms, and legal applications through an ontology of legal inference. The resulting graph captures the full structure of legal reasoning as it appears in real court decisions, making implicit reasoning explicit and machine-readable. We evaluate our system using expert annotated data, and find that it achieves more accurate retrieval of relevant legal provisions from facts than large language model baselines and retrieval-augmented methods.
摘要：法院判决揭示了法律规则是如何被解释和应用于事实的，为理解结构化的法律推理提供了基础。但是，现有的自动化方法用于捕获法律推理，包括大语言模型，通常无法确定相关的法律背景，无法准确地追踪事实与法律规范的关系，并且可能歪曲了司法推理的分层结构。这些限制阻碍了捕获法院如何将法律应用于实践中的事实的能力。在本文中，我们通过构建648个日本行政法院判决的法律知识图来解决这些挑战。我们的方法使用基于及时的大语言模型提取法律推理的组成部分，将对法律规定的引用归一化，并通过法律推论将事实，规范和法律应用联系起来。由此产生的图表捕获了法律推理的完整结构，因为它在实际法院的决定中显示出来，使隐性的推理明确和机器可读。我们使用专家注释的数据评估了我们的系统，并发现它比大型语言模型基准和检索功能的方法更准确地从事实中检索了相关法律规定。

Title: UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat

Authors: Omer Nacar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17378
Pdf URL: https://arxiv.org/pdf/2508.17378
Copy Paste: [[2508.17378]] UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat(https://arxiv.org/abs/2508.17378)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the $ALLaM$ family of Arabic-focused models. The most capable of these available to the public, $ALLaM-34B$, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of $ALLaM-34B$. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position $ALLaM-34B$ as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.
摘要：大型语言模型（LLM）主要接受了英语语料库的培训，通常很难捕捉阿拉伯语的语言和文化细微差别。为了解决这一差距，沙特数据和AI管理局（SDAIA）引入了以阿拉伯语为中心的$ Allam $家族。最有能力向公众使用的$ Allam-34B $，随后被Humain采用，Humain开发和部署了Humain Chat，这是一种基于此模型的封闭的对话网络服务。本文介绍了对$ Allam-34b $的扩展和精制UI级别的评估。使用涉及现代标准阿拉伯语的及时包，五个区域方言，代码转换，事实知识，算术和时间推理，创造性的产生和对抗性安全，我们收集了115次输出（23个提示时间5次跑步），并与三个Frontier LLM法官（GPT-5，GPT-5，gemini 2.5 Pro，gemini 2.5 Pro，Claude Sonnet-4）进行了评分。我们使用95 \％置信区间计算类别级别的平均值，分析得分分布，并可视化方言的度量热图。更新的分析揭示了发电和代码转换任务（平均为4.92/5）的高度性能，以及MSA处理（4.74/5），稳定的推理能力（4.64/5）的出色结果以及提高的方言忠诚度（4.21/5）。与安全有关的提示显示（4.54/5）的稳定，可靠的性能。综上所述，这些结果位置$ Allam-34b $是一种强大而文化的阿拉伯语LLM，表明了现实世界中部署的技术实力和实用准备。

Title: Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents

Authors: Sameer Komoravolu, Khalil Mrini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17393
Pdf URL: https://arxiv.org/pdf/2508.17393
Copy Paste: [[2508.17393]] Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents(https://arxiv.org/abs/2508.17393)
Keywords: llm, agent
Abstract: LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, designer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20--30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full methodology and open-source implementation for reproducible agent testing: this https URL
摘要：LLM代理人越来越多地部署来计划，检索和写作工具，但评估仍然依靠静态基准和小型人类研究。我们提出了代理测试代理（ATA），这是一种结合静态代码分析，设计师审讯，文学挖掘和人格驱动的对抗测试生成的元代理，其难度通过法官的反馈来适应。每个对话都用LLM-AS-A-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a judgric进行评分，并用于将随后的测试转向代理商最弱的功能。在旅行计划者和Wikipedia作家中，与专家注释者相比，ATA表面比专家注释更多样化和严重失败，并且在20--30分钟内完成了花费数天的时间。消融代码分析和Web搜索增加了差异和错误校准，从而强调了循证测试生成的价值。 ATA为开发人员输出定量指标和定性错误报告。我们发布了可再现代理测试的完整方法和开源实现：此HTTPS URL

Title: DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards

Authors: Aaryaman Kartha, Ahmed Masry, Mohammed Saidul Islam, Thinh Lang, Shadikur Rahman, Ridwan Mahbub, Mizanur Rahman, Mahir Ahmed, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17398
Pdf URL: https://arxiv.org/pdf/2508.17398
Copy Paste: [[2508.17398]] DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards(https://arxiv.org/abs/2508.17398)
Keywords: agent
Abstract: Dashboards are powerful visualization tools for data-driven decision-making, integrating multiple interactive views that allow users to explore, filter, and navigate data. Unlike static charts, dashboards support rich interactivity, which is essential for uncovering insights in real-world analytical workflows. However, existing question-answering benchmarks for data visualizations largely overlook this interactivity, focusing instead on static charts. This limitation severely constrains their ability to evaluate the capabilities of modern multimodal agents designed for GUI-based reasoning. To address this gap, we introduce DashboardQA, the first benchmark explicitly designed to assess how vision-language GUI agents comprehend and interact with real-world dashboards. The benchmark includes 112 interactive dashboards from Tableau Public and 405 question-answer pairs with interactive dashboards spanning five categories: multiple-choice, factoid, hypothetical, multi-dashboard, and conversational. By assessing a variety of leading closed- and open-source GUI agents, our analysis reveals their key limitations, particularly in grounding dashboard elements, planning interaction trajectories, and performing reasoning. Our findings indicate that interactive dashboard reasoning is a challenging task overall for all the VLMs evaluated. Even the top-performing agents struggle; for instance, the best agent based on Gemini-Pro-2.5 achieves only 38.69% accuracy, while the OpenAI CUA agent reaches just 22.69%, demonstrating the benchmark's significant difficulty. We release DashboardQA at this https URL
摘要：仪表板是用于数据驱动决策的强大可视化工具，集成了多个交互式视图，使用户可以探索，过滤和导航数据。与静态图表不同，仪表板支持丰富的交互性，这对于揭示现实世界中分析工作流的见解至关重要。但是，用于数据可视化的现有提问基准在很大程度上忽略了这种互动性，而是集中在静态图表上。这种局限性严重限制了他们评估用于基于GUI的推理的现代多模式的能力的能力。为了解决这一差距，我们介绍了DashboardQa，这是第一个明确设计的基准，旨在评估视觉语言GUI代理如何理解和与现实世界仪表板进行交互。基准包括来自Tableau Public的112个交互式仪表板和405个问题 - 答案对，互动仪表板涉及五个类别：多项选择，FACTOIT，假设，假设，多仪表板和对话。通过评估各种领先的闭合和开源GUI代理，我们的分析揭示了它们的关键局限性，尤其是在接地仪表板元素，计划互动轨迹和执行推理方面。我们的发现表明，对于所有评估的VLM来说，交互式仪表板推理总体上是一项具有挑战性的任务。即使是表现最好的代理商也挣扎；例如，基于Gemini-Pro-2.5的最佳代理只能达到38.69％的精度，而OpenAI CUA代理仅达到22.69％，这表明了基准的重大困难。我们在此https url中释放dashboardqa

Title: DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed Framework for Claim Normalization

Authors: Aleksandar Pramov, Jiangqin Ma, Bina Patel
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.17402
Pdf URL: https://arxiv.org/pdf/2508.17402
Copy Paste: [[2508.17402]] DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed Framework for Claim Normalization(https://arxiv.org/abs/2508.17402)
Keywords: gpt, llm, prompt
Abstract: Claim normalization is an integral part of any automatic fact-check verification system. It parses the typically noisy claim data, such as social media posts into normalized claims, which are then fed into downstream veracity classification tasks. The CheckThat! 2025 Task 2 focuses specifically on claim normalization and spans 20 languages under monolingual and zero-shot conditions. Our proposed solution consists of a lightweight \emph{retrieval-first, LLM-backed} pipeline, in which we either dynamically prompt a GPT-4o-mini with in-context examples, or retrieve the closest normalization from the train dataset directly. On the official test set, the system ranks near the top for most monolingual tracks, achieving first place in 7 out of of the 13 languages. In contrast, the system underperforms in the zero-shot setting, highlighting the limitation of the proposed solution.
摘要：声称归一化是任何自动事实检查验证系统的组成部分。它可以解析通常嘈杂的索赔数据，例如社交媒体帖子中的标准化索赔，然后将其送入下游真实的分类任务。检查！ 2025 Task 2专门针对声明归一化，并在单语和零射线条件下跨越20种语言。我们提出的解决方案由一个轻巧的\ emph {检索优先，LLM支持的}管道组成，其中我们要么动态提示带有context示例的GPT-4O-Mini，要么直接从火车数据集中检索最接近的归一化。在官方测试集中，该系统在大多数单语言曲目的最高水平上排名最高，在13种语言中的7个中获得了第一名。相比之下，该系统在零拍设置中的表现不足，突出了所提出的解决方案的限制。

Title: Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Authors: Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.17450
Pdf URL: https://arxiv.org/pdf/2508.17450
Copy Paste: [[2508.17450]] Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD(https://arxiv.org/abs/2508.17450)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at this https URL.
摘要：大型语言模型（LLM）可能会在说服力对话中平衡错误信息和对有效纠正的耐受性的易受害，这是可靠部署的关键挑战。我们介绍了二重奏 - PD（对有说服力对话中的信任双重评估），一个框架，评估跨双维的多转弯立场变化动态：说服类型（纠正/误导）和域（通过MMLU-PRO的知识，以及通过沙拉板安全的安全）。我们发现，即使像GPT-4O这样的最先进的模型在持续的误导性说服下，MMLU-Pro的精度仅达到27.32％。此外，结果揭示了在新的开源模型中增加无糊精的趋势。为了解决这个问题，我们介绍了整体DPO，这是一种平衡积极和负面说服的示例的训练方法。与促进或仅抗拒训练不同，整体DPO既可以提高对纠正的鲁棒性和对矫正的接受性，从而提高了在安全环境中误导性说服力中的Llama-3.1-8B教学的准确性从4.21％到76.54％。这些贡献为开发更可靠和适应性的LLM进行多转向对话提供了途径。代码可在此HTTPS URL上找到。

Title: Evaluating the Impact of Verbal Multiword Expressions on Machine Translation

Authors: Linfeng Liu, Saptarshi Ghosh, Tianyu Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17458
Pdf URL: https://arxiv.org/pdf/2508.17458
Copy Paste: [[2508.17458]] Evaluating the Impact of Verbal Multiword Expressions on Machine Translation(https://arxiv.org/abs/2508.17458)
Keywords: language model, llm
Abstract: Verbal multiword expressions (VMWEs) present significant challenges for natural language processing due to their complex and often non-compositional nature. While machine translation models have seen significant improvement with the advent of language models in recent years, accurately translating these complex linguistic structures remains an open problem. In this study, we analyze the impact of three VMWE categories -- verbal idioms, verb-particle constructions, and light verb constructions -- on machine translation quality from English to multiple languages. Using both established multiword expression datasets and sentences containing these language phenomena extracted from machine translation datasets, we evaluate how state-of-the-art translation systems handle these expressions. Our experimental results consistently show that VMWEs negatively affect translation quality. We also propose an LLM-based paraphrasing approach that replaces these expressions with their literal counterparts, demonstrating significant improvement in translation quality for verbal idioms and verb-particle constructions.
摘要：言语多词表达式（VMWES）由于其复杂且通常是非构成性质，对自然语言处理提出了重大挑战。尽管近年来语言模型的出现，机器翻译模型已经显着改善，但准确地翻译这些复杂的语言结构仍然是一个开放的问题。在这项研究中，我们分析了三种VMWE类别的影响：口头习语，动词粒子构造和轻度动词构造 - 对从英语到多种语言的机器翻译质量。使用已建立的多词表达数据集和包含从机器翻译数据集提取的这些语言现象的句子，我们评估最新的翻译系统如何处理这些表达式。我们的实验结果始终表明，VMWES对翻译质量产生负面影响。我们还提出了一种基于LLM的释义方法，该方法将这些表达式替换为它们的字面意义对应物，从而证明了口头惯用和动词粒子结构的翻译质量的显着提高。

Title: Improving French Synthetic Speech Quality via SSML Prosody Control

Authors: Nassima Ould Ouali, Awais Hussain Sani, Ruben Bueno, Jonah Dauvet, Tim Luka Horstmann, Eric Moulines
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2508.17494
Pdf URL: https://arxiv.org/pdf/2508.17494
Copy Paste: [[2508.17494]] Improving French Synthetic Speech Quality via SSML Prosody Control(https://arxiv.org/abs/2508.17494)
Keywords: language model, llm, prompt
Abstract: Despite recent advances, synthetic voices often lack expressiveness due to limited prosody control in commercial text-to-speech (TTS) systems. We introduce the first end-to-end pipeline that inserts Speech Synthesis Markup Language (SSML) tags into French text to control pitch, speaking rate, volume, and pause duration. We employ a cascaded architecture with two QLoRA-fine-tuned Qwen 2.5-7B models: one predicts phrase-break positions and the other performs regression on prosodic targets, generating commercial TTS-compatible SSML markup. Evaluated on a 14-hour French podcast corpus, our method achieves 99.2% F1 for break placement and reduces mean absolute error on pitch, rate, and volume by 25-40% compared with prompting-only large language models (LLMs) and a BiLSTM baseline. In perceptual evaluation involving 18 participants across over 9 hours of synthesized audio, SSML-enhanced speech generated by our pipeline significantly improves naturalness, with the mean opinion score increasing from 3.20 to 3.87 (p < 0.005). Additionally, 15 of 18 listeners preferred our enhanced synthesis. These results demonstrate substantial progress in bridging the expressiveness gap between synthetic and natural French speech. Our code is publicly available at this https URL.
摘要：尽管最近进步，由于商业文本到语音（TTS）系统中的韵律控制有限，综合声音通常缺乏表现力。我们介绍了第一个将语音合成标记语言（SSML）标签插入法语文本的端到端管道，以控制音调，口语率，音量和暂停持续时间。我们采用级联的架构，具有两个Qlora-Fine-Fine QWEN 2.5-7B型号：一种预测词组破坏位置，另一个对韵律目标进行回归，从而产生了商业TTS与tts compatible tts兼容的SSML标记。在14小时的法国播客语料库中进行了评估，我们的方法可实现99.2％的F1，用于休息放置，并使音高，速率和体积的绝对误差降低了25-40％，而仅促使仅大型大语言模型（LLMS）和BilstM基线。在涉及9个多个小时的合成音频的18个参与者的感知评估中，我们的管道产生的SSML增强语音显着提高了自然性，平均意见评分从3.20增加到3.87（p <0.005）。此外，18位听众中有15位更喜欢我们增强的综合。这些结果表明，在弥合合成和天然法国语音之间的表达差距方面的实质性进展。我们的代码在此HTTPS URL上公开可用。

Title: Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?

Authors: Hyeong Kyu Choi, Xiaojin Zhu, Yixuan Li
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2508.17536
Pdf URL: https://arxiv.org/pdf/2508.17536
Copy Paste: [[2508.17536]] Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?(https://arxiv.org/abs/2508.17536)
Keywords: language model, agent
Abstract: Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD's effectiveness remain unclear. In this work, we disentangle MAD into two key components--Majority Voting and inter-agent Debate--and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents' belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in this https URL.
摘要：多代理辩论〜（MAD）已成为通过协作推理改善大语言模型的表现的有希望的范式。尽管最近进步，但推动MAD有效性的关键因素仍不清楚。在这项工作中，我们将疯狂的成分分解为两个关键的组成部分 - 有钱投票和跨性别的辩论，并评估了他们各自的贡献。通过在七个NLP基准测试的广泛实验中，我们发现仅大多数投票就占大多数性能增长通常归因于MAD。为了解释这一点，我们提出了一个理论框架，该框架将辩论作为随机过程建模。我们证明，它引起了代理商的信念轨迹的诱因，这意味着仅辩论并不能提高预期的正确性。在这些见解的指导下，我们证明了针对性的干预措施，通过将信念更新偏向纠正，可以有意义提高辩论有效性。总体而言，我们的发现表明，尽管MAD具有潜力，但简单的结合方法在许多实际环境中仍然具有强大，更可靠的替代方法。代码在此HTTPS URL中发布。

Title: Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design

Authors: Yunze Xiao, Lynnette Hui Xian Ng, Jiarui Liu, Mona T. Diab
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17573
Pdf URL: https://arxiv.org/pdf/2508.17573
Copy Paste: [[2508.17573]] Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design(https://arxiv.org/abs/2508.17573)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) increasingly exhibit \textbf{anthropomorphism} characteristics -- human-like qualities portrayed across their outlook, language, behavior, and reasoning functions. Such characteristics enable more intuitive and engaging human-AI interactions. However, current research on anthropomorphism remains predominantly risk-focused, emphasizing over-trust and user deception while offering limited design guidance. We argue that anthropomorphism should instead be treated as a \emph{concept of design} that can be intentionally tuned to support user goals. Drawing from multiple disciplines, we propose that the anthropomorphism of an LLM-based artifact should reflect the interaction between artifact designers and interpreters. This interaction is facilitated by cues embedded in the artifact by the designers and the (cognitive) responses of the interpreters to the cues. Cues are categorized into four dimensions: \textit{perceptive, linguistic, behavioral}, and \textit{cognitive}. By analyzing the manifestation and effectiveness of each cue, we provide a unified taxonomy with actionable levers for practitioners. Consequently, we advocate for function-oriented evaluations of anthropomorphic design.
摘要：大型语言模型（LLMS）越来越表现出\ textbf {拟人化}特征 - 在其前景，语言，行为和推理功能中描绘的类似人类的品质。这样的特征使人可以更直观，引人入胜的人类互动。但是，当前对拟人化的研究主要集中在风险上，强调过度信任和用户欺骗，同时提供有限的设计指导。我们认为，拟人化应该被视为\ emph {设计概念}，可以有意调整以支持用户目标。从多个学科中汲取灵感，我们建议基于LLM的人工制品的拟人化应反映工件设计师和口译员之间的相互作用。设计师嵌入在工件中的提示以及口译人员对提示的（认知）响应来促进这种相互作用。提示分为四个维度：\ textIt {感知，语言，行为}和\ textit {认知}。通过分析每个提示的表现和有效性，我们为从业者提供了可行的杠杆统一分类学。因此，我们主张对拟人化设计的以功能为导向的评估。

Title: UQ: Assessing Language Models on Unsolved Questions

Authors: Fan Nie, Ken Ziyu Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17580
Pdf URL: https://arxiv.org/pdf/2508.17580
Copy Paste: [[2508.17580]] UQ: Assessing Language Models on Unsolved Questions(https://arxiv.org/abs/2508.17580)
Keywords: language model, llm
Abstract: Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at this https URL.
摘要：基准在AI研究中塑造了进步。一个有用的基准应该既困难又现实：问题应该挑战边境模型，同时也反映了现实世界的用法。然而，当前的范式面临着困难现实主义的紧张：考试风格的基准通常是人为困难的，而现实世界中的价值有限，而基于真实用户互动的基准通常会偏向容易，高频问题。在这项工作中，我们探讨了一个根本不同的范式：评估未解决问题的模型。我们不是一次得分的静态基准测试，而是策划未解决的问题，并随着时间的流逝而与验证者辅助筛选和社区验证进行异步评估模型。我们介绍了UQ，这是一个由500个具有挑战性的，来自堆栈交换的各种问题的测试台，涵盖了CS理论和数学到科幻和历史的主题，探讨了包括推理，事实和浏览的功能。 UQ是艰难而现实的：未解决的问题通常很难，并且在人类寻求答案时自然会出现，从而使它们产生了直接的现实价值。我们的贡献是三倍：（1）UQ-Dataset及其收集管道结合了基于规则的过滤器，LLM法官和人类评论，以确保质量质量（例如，定义明确且困难）；（2）UQ验证者，利用发电机 - validator差距的复合验证策略来提供评估信号和屏幕前候选解决方案进行人类审查；（3）UQ-Platform，一个开放的平台，专家共同验证问题和解决方案。顶级模型仅在15％的问题上通过了UQ验证，并且初步的人类验证已经确定了通过的问题的正确答案。 UQ为评估现实世界中的开放式挑战的边界模型的路径绘制了一条路径，在这种挑战中，成功推动了人类知识的前沿。我们在此HTTPS URL上发布UQ。

Title: Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions

Authors: Nannan Huang, Haytham Fayek, Xiuzhen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17610
Pdf URL: https://arxiv.org/pdf/2508.17610
Copy Paste: [[2508.17610]] Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions(https://arxiv.org/abs/2508.17610)
Keywords: language model, llm
Abstract: Model compression through post-training pruning offers a way to reduce model size and computational requirements without significantly impacting model performance. However, the effect of pruning on the fairness of LLM-generated summaries remains unexplored, particularly for opinion summarisation where biased outputs could influence public this http URL this paper, we present a comprehensive empirical analysis of opinion summarisation, examining three state-of-the-art pruning methods and various calibration sets across three open-source LLMs using four fairness metrics. Our systematic analysis reveals that pruning methods have a greater impact on fairness than calibration sets. Building on these insights, we propose High Gradient Low Activation (HGLA) pruning, which identifies and removes parameters that are redundant for input processing but influential in output generation. Our experiments demonstrate that HGLA can better maintain or even improve fairness compared to existing methods, showing promise across models and tasks where traditional methods have limitations. Our human evaluation shows HGLA-generated outputs are fairer than existing state-of-the-art pruning methods. Code is available at: this https URL.
摘要：通过训练后修剪的模型压缩提供了一种减少模型大小和计算要求的方法，而不会显着影响模型性能。但是，修剪对LLM生成的摘要的公平性的影响仍未开发，尤其是对于意见总结，偏见的产出可能影响公众本文，我们对意见摘要进行了全面的经验分析，对三个先进的统治方法和各种校准设置进行了三个开放性，使用了三个开放性，使用了三个校准。我们的系统分析表明，修剪方法比校准集对公平性具有更大的影响。在这些见解的基础上，我们提出了高梯度低激活（HGLA）修剪，该梯度识别并删除了用于输入处理但产生产生影响的参数。我们的实验表明，与现有方法相比，HGLA可以更好地维护甚至改善公平性，这表明了传统方法具有局限性的模型和任务的希望。我们的人类评估表明，HGLA生成的产出比现有的最新修剪方法公平。代码可用：此HTTPS URL。

Title: Steering When Necessary: Flexible Steering Large Language Models with Backtracking

Authors: Jinwei Gan, Zifeng Cheng, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17621
Pdf URL: https://arxiv.org/pdf/2508.17621
Copy Paste: [[2508.17621]] Steering When Necessary: Flexible Steering Large Language Models with Backtracking(https://arxiv.org/abs/2508.17621)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at this https URL.
摘要：大型语言模型（LLM）在许多一代任务中都取得了出色的表现。然而，有效地使他们与所需的行为保持一致仍然是一个重大挑战。激活转向是一种有效且具有成本效益的方法，可以直接修改推理阶段的LLM的激活，使其对响应与所需的行为保持一致，并避免高调调的高成本。现有方法通常不加打破了所有世代的干预，或仅依靠问题来确定干预措施，这限制了对干预强度的准确评估。为此，我们提出了通过回溯（FASB）框架进行灵活的激活转向，该框架通过在生成过程中跟踪LLM的内部状态，动态地决定了干预的必要性和强度，考虑了问题和生成的内容。由于在检测到与所需行为的偏差之后介入通常为时已晚，因此我们进一步提出了回溯机制，以纠正偏离的令牌并将LLMS转向所需的行为。真实性数据集和六个多项选择数据集的大量实验表明，我们的方法优于基准。我们的代码将在此HTTPS URL上发布。

Title: Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit

Authors: Zihao Wei, Liang Pang, Jiahao Liu, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Jingang Wang, Fei Sun, Xunliang Cai, Huawei Shen, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17627
Pdf URL: https://arxiv.org/pdf/2508.17627
Copy Paste: [[2508.17627]] Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit(https://arxiv.org/abs/2508.17627)
Keywords: language model, llm
Abstract: Large language models (LLMs) enhance complex reasoning tasks by scaling the individual thinking process. However, prior work shows that overthinking can degrade overall performance. Motivated by observed patterns in thinking length and content length, we categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage. Typically, LLMs produce correct answers in the compensatory reasoning stage, whereas reasoning convergence often triggers overthinking, causing increased resource usage or even infinite loops. Therefore, mitigating overthinking hinges on detecting the end of the compensatory reasoning stage, defined as the Reasoning Completion Point (RCP). RCP typically appears at the end of the first complete reasoning cycle and can be identified by querying the LLM sentence by sentence or monitoring the probability of an end-of-thinking token (e.g., \texttt{}), though these methods lack an efficient and precise balance. To improve this, we mine more sensitive and consistent RCP patterns and develop a lightweight thresholding strategy based on heuristic rules. Experimental evaluations on benchmarks (AIME24, AIME25, GPQA-D) demonstrate that the proposed method reduces token consumption while preserving or enhancing reasoning accuracy.
摘要：大型语言模型（LLMS）通过扩展单个思维过程来增强复杂的推理任务。但是，先前的工作表明，过度思考会降低整体性能。由观察到的思维长度和内容长度的模式的动机，我们将推理分为三个阶段：勘探阶段不足，补偿性推理阶段和推理收敛阶段。通常，LLM在补偿性推理阶段产生正确的答案，而推理融合通常会触发过度思考，从而导致资源使用增加甚至无限循环。因此，减轻过度思考铰链检测代码推理阶段的末端，定义为推理完成点（RCP）。 RCP通常出现在第一个完整的推理周期的末尾，可以通过句子查询LLM句子或监视思想结束令牌的概率（例如，\ texttt {}）来识别，尽管这些方法缺乏有效和精确的平衡。为了改善这一点，我们挖掘了更敏感和一致的RCP模式，并根据启发式规则制定轻巧的阈值策略。对基准测试（AIME24，AIME25，GPQA-D）的实验评估表明，所提出的方法可以降低令牌消耗，同时保持或提高推理精度。

Title: Weights-Rotated Preference Optimization for Large Language Models

Authors: Chenxu Yang, Ruipeng Jia, Mingyu Zheng, Naibin Gu, Zheng Lin, Siyuan Chen, Weichong Yin, Hua Wu, Weiping Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17637
Pdf URL: https://arxiv.org/pdf/2508.17637
Copy Paste: [[2508.17637]] Weights-Rotated Preference Optimization for Large Language Models(https://arxiv.org/abs/2508.17637)
Keywords: language model, llm
Abstract: Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model from deviating too far from the reference model, thereby retaining the knowledge and expressive capabilities acquired during pre-training and SFT stages. Our RoPO achieves up to a 3.27-point improvement on AlpacaEval 2, and surpasses the best baseline by 6.2 to 7.5 points on MT-Bench with merely 0.015% of the trainable parameters, demonstrating its effectiveness in alleviating the reward hacking problem of DPO.
摘要：尽管直接偏好优化（DPO）在对齐大语言模型（LLMS）中具有功效，但奖励黑客仍然是一个关键的挑战。当LLMS过度减少拒绝完成以获得高奖励的可能性而没有真正实现其预期目标时，就会出现这个问题。结果，这导致过度长期缺乏多样性以及对知识的灾难性遗忘。我们研究了这个问题背后的根本原因，即参数空间中神经元崩溃引起的表示冗余。因此，我们提出了一种新颖的权重旋转优先优化（ROPO）算法，该算法通过从DPO继承的KL差异隐式约束输出层逻辑，并明确地通过对多粒子性正交矩阵进行微调来明确约束中间隐藏状态。这种设计阻止了策略模型偏离参考模型，从而保留了在培训和SFT阶段中获得的知识和表达能力。我们的ROPO在Alpacaeval 2上取得了3.27分的改善，并在MT板凳上超过了最佳基线6.2至7.5分，仅0.015％的可训练参数，这表明其在减轻DPO的奖励黑客攻击问题方面有效。

Title: SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models

Authors: Tong Bao, Mir Tafseer Nayeem, Davood Rafiei, Chengzhi Zhang
Subjects: cs.CL, cs.DL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.17647
Pdf URL: https://arxiv.org/pdf/2508.17647
Copy Paste: [[2508.17647]] SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models(https://arxiv.org/abs/2508.17647)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Automatic survey generation has emerged as a key task in scientific document processing. While large language models (LLMs) have shown promise in generating survey texts, the lack of standardized evaluation datasets critically hampers rigorous assessment of their performance against human-written surveys. In this work, we present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains, along with 242,143 cited references and extensive quality-related metadata for both the surveys and the cited papers. Leveraging this resource, we build QUAL-SG, a novel quality-aware framework for survey generation that enhances the standard Retrieval-Augmented Generation (RAG) pipeline by incorporating quality-aware indicators into literature retrieval to assess and select higher-quality source papers. Using this dataset and framework, we systematically evaluate state-of-the-art LLMs under varying levels of human involvement - from fully automatic generation to human-guided writing. Experimental results and human evaluations show that while semi-automatic pipelines can achieve partially competitive outcomes, fully automatic survey generation still suffers from low citation quality and limited critical analysis.
摘要：自动调查生成已成为科学文档处理的关键任务。尽管大型语言模型（LLMS）在生成调查文本方面表现出了希望，但缺乏标准化的评估数据集对人为编写的调查的绩效进行严格的评估。在这项工作中，我们介绍了SurveyGen，这是一个大规模数据集，其中包括各种科学领域的4,200多个人工编写的调查，以及242,143个引用的参考文献和广泛的质量相关元数据，用于调查和引用的论文。利用这一资源，我们建立了Qual-SG，这是一种新型的质量感知框架，用于调查生成，通过将质量感知的指标纳入文献检索中，以评估和选择高质量的源文件，从而增强了标准的检索型生成（RAG）管道。使用此数据集和框架，我们在不同水平的人类参与下系统地评估了最先进的LLM-从全自动生成到人类引导的写作。实验结果和人类评估表明，尽管半自动管道可以达到部分竞争性结果，但全自动的调查生成仍然患有低引文质量和有限的批判分析。

Title: CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models

Authors: Anant Khandelwal, Manish Gupta, Puneet Agrawal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17670
Pdf URL: https://arxiv.org/pdf/2508.17670
Copy Paste: [[2508.17670]] CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models(https://arxiv.org/abs/2508.17670)
Keywords: language model, llm
Abstract: Faithful generation in large language models (LLMs) is challenged by knowledge conflicts between parametric memory and external context. Existing contrastive decoding methods tuned specifically to handle conflict often lack adaptability and can degrade performance in low conflict settings. We introduce CoCoA (Confidence- and Context-Aware Adaptive Decoding), a novel token-level algorithm for principled conflict resolution and enhanced faithfulness. CoCoA resolves conflict by utilizing confidence-aware measures (entropy gap and contextual peakedness) and the generalized divergence between the parametric and contextual distributions. Crucially, CoCoA maintains strong performance even in low conflict settings. Extensive experiments across multiple LLMs on diverse Question Answering (QA), Summarization, and Long-Form Question Answering (LFQA) benchmarks demonstrate CoCoA's state-of-the-art performance over strong baselines like AdaCAD. It yields significant gains in QA accuracy, up to 9.2 points on average compared to the strong baseline AdaCAD, and improves factuality in summarization and LFQA by up to 2.5 points on average across key benchmarks. Additionally, it demonstrates superior sensitivity to conflict variations. CoCoA enables more informed, context-aware, and ultimately more faithful token generation.
摘要：大语模型（LLM）中的忠实生成受到参数记忆与外部环境之间的知识冲突的挑战。专门调整冲突的现有对比解码方法通常缺乏适应性，并可能在低冲突设置中降低性能。我们介绍了可可（信心和上下文感知的自适应解码），这是一种新颖的令牌级算法，用于有原则的冲突解决和增强的忠诚。可可通过利用置信度措施（熵差距和上下文峰值）以及参数分布和上下文分布之间的普遍差异来解决冲突。至关重要的是，即使在低冲突环境中，可可也保持强劲的绩效。多个LLM的广泛实验在各种问题答案（QA），摘要和长形的问题回答（LFQA）基准测试中表明，可可对诸如Adacad这样的强大基础的最先进的表现。与强大的基线ADACAD相比，它的质量准确性达到了质量准确性的显着提高，平均得分高达9.2点，并将摘要的事实提高，LFQA的事实平均提高了2.5点，平均在关键基准测试的基准中。此外，它表现出对冲突变化的较高敏感性。可可使更多知识，上下文感知，并最终更忠实的代币产生。

Title: EMPOWER: Evolutionary Medical Prompt Optimization With Reinforcement Learning

Authors: Yinda Chen, Yangfan He, Jing Yang, Dapeng Zhang, Zhenlong Yuan, Muhammad Attique Khan, Jamel Baili, Por Lip Yee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17703
Pdf URL: https://arxiv.org/pdf/2508.17703
Copy Paste: [[2508.17703]] EMPOWER: Evolutionary Medical Prompt Optimization With Reinforcement Learning(https://arxiv.org/abs/2508.17703)
Keywords: language model, llm, prompt
Abstract: Prompt engineering significantly influences the reliability and clinical utility of Large Language Models (LLMs) in medical applications. Current optimization approaches inadequately address domain-specific medical knowledge and safety requirements. This paper introduces EMPOWER, a novel evolutionary framework that enhances medical prompt quality through specialized representation learning, multi-dimensional evaluation, and structure-preserving algorithms. Our methodology incorporates: (1) a medical terminology attention mechanism, (2) a comprehensive assessment architecture evaluating clarity, specificity, clinical relevance, and factual accuracy, (3) a component-level evolutionary algorithm preserving clinical reasoning integrity, and (4) a semantic verification module ensuring adherence to medical knowledge. Evaluation across diagnostic, therapeutic, and educational tasks demonstrates significant improvements: 24.7% reduction in factually incorrect content, 19.6% enhancement in domain specificity, and 15.3% higher clinician preference in blinded evaluations. The framework addresses critical challenges in developing clinically appropriate prompts, facilitating more responsible integration of LLMs into healthcare settings.
摘要：迅速的工程显着影响医疗应用中大语言模型（LLM）的可靠性和临床实用性。当前的优化方法不足以解决特定领域的医学知识和安全要求。本文介绍了Empower，这是一个新颖的进化框架，该框架通过专业表示学习，多维评估和具有结构的算法来增强医疗及时质量。我们的方法包括：（1）医学术语注意机制，（2）一种全面的评估体系结构，评估了清晰度，特异性，临床相关性和事实准确性，（3）一种组件级的算法保存临床推理完整性，以及（4）确保对医学知识的语义验证模块。跨诊断，治疗和教育任务进行的评估显示出显着改善：实际上不正确的含量减少了24.7％，域特异性增强了19.6％，在盲评估中较高的临床医生偏好提高了15.3％。该框架解决了开发临床适当的提示方面的关键挑战，从而促进了将LLMS更负责任地集成到医疗机构中。

Title: Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models

Authors: Wataru Ikeda, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Keigo Shibata, Jun Suzuki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17734
Pdf URL: https://arxiv.org/pdf/2508.17734
Copy Paste: [[2508.17734]] Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models(https://arxiv.org/abs/2508.17734)
Keywords: language model
Abstract: This study investigates the layerwise importance of feed-forward networks (FFNs) in Transformer-based language models during pretraining. We introduce an experimental approach that, while maintaining the total parameter count, increases the FFN dimensions in some layers and completely removes the FFNs from other layers. Furthermore, since our focus is on the importance of FFNs during pretraining, we train models from scratch to examine whether the importance of FFNs varies depending on their layer positions, rather than using publicly available pretrained models as is frequently done. Through comprehensive evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers), we demonstrate that concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks.
摘要：这项研究调查了在训练过程中，基于变压器的语言模型中馈电网络（FFN）的层次重要性。我们引入了一种实验方法，该方法在维持总参数计数的同时，在某些层中增加了FFN尺寸，并完全从其他层中除去FFN。此外，由于我们的重点是在预处理过程中FFN的重要性，因此我们从头开始训练模型，以检查FFN的重要性是否取决于其层位置，而不是经常使用公开可用的预告片模型。通过对大小不同的模型（285m，570m和1.2b参数）和层计数（12、24和40层）的全面评估，我们证明，连续70％的连续中层层中的FFN始终超过了多个下游任务的标准配置。

Title: SMITE: Enhancing Fairness in LLMs through Optimal In-Context Example Selection via Dynamic Validation

Authors: Garima Chhikara, Kripabandhu Ghosh, Abhijnan Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17735
Pdf URL: https://arxiv.org/pdf/2508.17735
Copy Paste: [[2508.17735]] SMITE: Enhancing Fairness in LLMs through Optimal In-Context Example Selection via Dynamic Validation(https://arxiv.org/abs/2508.17735)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are widely used for downstream tasks such as tabular classification, where ensuring fairness in their outputs is critical for inclusivity, equal representation, and responsible AI deployment. This study introduces a novel approach to enhancing LLM performance and fairness through the concept of a dynamic validation set, which evolves alongside the test set, replacing the traditional static validation approach. We also propose an iterative algorithm, SMITE, to select optimal in-context examples, with each example set validated against its corresponding dynamic validation set. The in-context set with the lowest total error is used as the final demonstration set. Our experiments across four different LLMs show that our proposed techniques significantly improve both predictive accuracy and fairness compared to baseline methods. To our knowledge, this is the first study to apply dynamic validation in the context of in-context learning for LLMs.
摘要：大型语言模型（LLMS）广泛用于下游任务，例如表格分类，在这些任务中，确保其输出的公平性对于包容性，平等表示和负责的AI部署至关重要。这项研究介绍了一种新颖的方法，可以通过动态验证集的概念来增强LLM的性能和公平性，该概念与测试集一起演变，取代了传统的静态验证方法。我们还提出了一种迭代算法Smite，以选择最佳的内在示例，每个示例都针对其相应的动态验证集验证了每个示例。总误差最低的内部下文集用作最终演示集。我们在四个不同LLM的实验表明，与基线方法相比，我们提出的技术显着提高了预测精度和公平性。据我们所知，这是在LLMS中文本学习的背景下采用动态验证的第一项研究。

Title: ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

Authors: Guangwei Zhang, Qisheng Su, Jiateng Liu, Cheng Qian, Yanzhou Pan, Yanjie Fu, Denghui Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17767
Pdf URL: https://arxiv.org/pdf/2508.17767
Copy Paste: [[2508.17767]] ISACL: Internal State Analyzer for Copyrighted Training Data Leakage(https://arxiv.org/abs/2508.17767)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but pose risks of inadvertently exposing copyrighted or proprietary data, especially when such data is used for training but not intended for distribution. Traditional methods address these leaks only after content is generated, which can lead to the exposure of sensitive information. This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks. By using a curated dataset of copyrighted materials, we trained a neural network classifier to identify risks, allowing for early intervention by stopping the generation process or altering outputs to prevent disclosure. Integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements while enhancing data privacy and ethical standards. Our results show that analyzing internal states effectively mitigates the risk of copyrighted data leakage, offering a scalable solution that fits smoothly into AI workflows, ensuring compliance with copyright regulations while maintaining high-quality text generation. The implementation is available on GitHub.\footnote{this https URL}
摘要：大型语言模型（LLM）彻底改变了自然语言处理（NLP），但会构成无意间暴露于版权或专有数据的风险，尤其是当此类数据用于培训但不打算分发时。传统方法仅在生成内容后才解决这些泄漏，这可能导致敏感信息的暴露。这项研究介绍了一种主动的方法：在文本生成之前检查LLMS的内部状态以检测潜在的泄漏。通过使用策划的受版权保护材料的数据集，我们训练了神经网络分类器来识别风险，从而通过停止生成过程或更改输出以防止披露来进行早期干预。该框架与检索型生成（RAG）系统集成在一起，确保遵守版权和许可要求，同时增强数据隐私和道德标准。我们的结果表明，分析内部状态可有效减轻受版权保护数据泄漏的风险，提供可扩展的解决方案，该解决方案可以平稳地适合AI工作流程，从而确保遵守版权法规，同时保持高质量的文本生成。该实现可在github上获得。\ footNote {this https url}

Title: Speculating LLMs' Chinese Training Data Pollution from Their Tokens

Authors: Qingjie Zhang, Di Wang, Haoting Qian, Liu Yan, Tianwei Zhang, Ke Xu, Qi Li, Minlie Huang, Hewu Li, Han Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17771
Pdf URL: https://arxiv.org/pdf/2508.17771
Copy Paste: [[2508.17771]] Speculating LLMs' Chinese Training Data Pollution from Their Tokens(https://arxiv.org/abs/2508.17771)
Keywords: gpt, llm
Abstract: Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in GPT-4o's training data is around 0.5%.
摘要：令牌是LLM培训数据集中的基本元素。众所周知，在GPT词汇量（4o/4o-Mini/O1/O3/4.5/4.5/4.1/O4-Mini）中，许多代表中文短语的令牌表明色情或在线赌博等内容。基于这一观察结果，我们的目标是在LLMS中找到受污染的中国（POC）令牌，并研究POC令牌的存在与培训数据之间的关系。（1）我们根据GPT的词汇提供了正式的POC令牌定义和分类。（2）我们通过对LLM进行微调来构建POC令牌检测器，以通过考虑每个令牌的语义和搜索引擎中的相关内容来标记词汇中的POC令牌。（3）我们通过POC令牌的外观（令牌ID）研究了训练数据污染的猜测。 GPT和其他23个LLM的实验表明，GPT词汇表现最差时，令牌广泛存在：超过23％的中文令牌（即具有两个以上汉字的代币）是色情或在线赌博。我们在著名的培训数据集（如C4和PILE）上验证了投机方法的准确性。然后，考虑到GPT-4O，我们推测GPT-4O培训数据中“ Yui Hatano”相关网页的比率约为0.5％。

Title: DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models

Authors: Kaiwen Yan, Xuanqing Shi, Hongcheng Guo, Wenxuan Wang, Zhuosheng Zhang, Chengwei Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17803
Pdf URL: https://arxiv.org/pdf/2508.17803
Copy Paste: [[2508.17803]] DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models(https://arxiv.org/abs/2508.17803)
Keywords: language model, llm
Abstract: Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.
摘要：推理大型语言模型（RLLM），例如OpenAI-O3和DeepSeek-R1，最近通过执行结构化和多步推理表现出了显着的功能。但是，最近的研究表明，RLLM通常会因简单问题而产生不必要的推理链，即产生不必要的推理链，从而导致过度的令牌消耗和计算效率低下。有趣的是，我们观察到，在批处理模式下处理多个问题时，RLLM通过动态压缩推理步骤而表现出更大的资源行为，这是由于隐式资源竞争而易于解决的问题。受此启发，我们提出了动态推理配额分配（DRQA），这是一种新颖的方法，将资源竞争的好处从批处理处理转移到单个问题推断。具体而言，DRQA利用批处理生成的偏好数据和强化学习来训练模型以适应推理资源。通过鼓励模型对既准确又简洁的响应的偏好进行内部化，DRQA使其能够为简单问题生成简洁的答案，同时保留足够的推理深度，以实现更具挑战性的问题。对广泛的数学和科学推理基准进行的广泛实验表明，DRQA在维持的同时显着降低了令牌的使用，并且在许多情况下提高了回答准确性。通过有效缓解过度思考的问题，DRQA为更有效，可扩展的RLLM部署提供了有希望的方向，我们希望它能激发进一步的探索来对推理行为的细粒度控制。

Title: Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning

Authors: Haijiang Liu, Qiyuan Li, Chao Gao, Yong Cao, Xiangyu Xu, Xun Wu, Daniel Hershcovich, Jinguang Gu
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.17855
Pdf URL: https://arxiv.org/pdf/2508.17855
Copy Paste: [[2508.17855]] Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning(https://arxiv.org/abs/2508.17855)
Keywords: language model
Abstract: Introducing MARK, the Multi-stAge Reasoning frameworK for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions.
摘要：引入Mark是文化价值调查响应模拟的多阶段推理框架，旨在提高大型语言模型在此任务中的准确性，可施加性和解释性。该系统的灵感来自MBTI心理学研究中的类型动力学理论。它有效地预测并利用人口统计信息进行模拟：生命阳性压力分析，群体级别的人格预测和自加权认知模仿。世界价值调查的实验表明，标记的表现优于现有基准的准确性10％，并降低了模型预测和人类偏好之间的差异。这凸显了我们框架改善零拍个性化的潜力，并帮助社会科学家解释模型预测。

Title: Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

Authors: Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2508.17863
Pdf URL: https://arxiv.org/pdf/2508.17863
Copy Paste: [[2508.17863]] Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs(https://arxiv.org/abs/2508.17863)
Keywords: language model, llm
Abstract: With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.
摘要：随着语音大型语言模型的兴起（语音LAGNAGE），出现了两种主导方法用于语音处理：离散令牌和连续特征。每种方法都表明在音频相关的处理任务中都有很强的功能。但是，这两个范式之间的性能差距尚未得到彻底探讨。为了解决这一差距，我们对基于同一实验设置的基于自我监督的学习（SSL）的离散和连续特征进行了公平的比较。我们使用小规模和大型LLM（QWEN1.5-0.5B和LLAMA3.1-8B）评估了他们在六个与语言理解有关的任务中的表现。我们进一步进行深入分析，包括有效比较，SSL层分析，LLM层分析和鲁棒性比较。我们的发现表明，连续特征通常在各种任务中都优于离散令牌。每种语音处理方法在学习和处理语音信息方面都表现出不同的特征和模式。我们希望我们的结果将提供有价值的见解，以提高语言语言的语言理解。

Title: ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

Authors: Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17892
Pdf URL: https://arxiv.org/pdf/2508.17892
Copy Paste: [[2508.17892]] ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models(https://arxiv.org/abs/2508.17892)
Keywords: language model, llm, long context
Abstract: Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.
摘要：大型语言模型（LLM）在许多基准中都表现出成功。但是，它们仍然在长篇小说方案中表现出局限性，主要是由于它们在处理冗长的输入时的短有效上下文长度，二次计算复杂性和高内存开销。为了减轻这些问题，我们引入了一种新颖的上下文压缩管道，称为中间层检索（ILRE），该管道决定了一个中间解码器层离线，通过仅将块的预填充填充到该层来编码上下文，并回想起该层的图形，并通过输入质量和完整的密钥缓存之间的注意力评分来回忆。特别是，我们建议在令牌召回过程中分配多功能内核，以保持语义的完整性。我们的方法不仅将预填充的复杂性从$ O（l^2）$减少到$ O（l）$，而且还可以实现与长上下文场景中的完整上下文相当或更好的性能。如果没有其他培训或操作员开发，ILRE可以在不到半分钟的时间内处理100万美元的代币请求（速度$ \ \ \ \ \ \ 180 \ $ 180 \ times $），并得分统治者 - $ 100万$ $ $ $ \ $ \ $ \ $ \ $ \的基准约为79.8 $，使用Model Llama-3.1-3.1-ultrallong-1m-8b-1m-1M-1M-1M-1M-1M-1M-1M-1M-1M-1M-1M-1M-1M-1M-1M-1M-100万np”。

Title: Pandora: Leveraging Code-driven Knowledge Transfer for Unified Structured Knowledge Reasoning

Authors: Yongrui Chen, Junhao He, Linbo Fu, Shenyu Zhang, Rihui Jin, Xinbang Dai, Jiaqi Li, Dehai Min, Nan Hu, Yuxin Zhang, Guilin Qi, Yi Huang, Tongtong Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17905
Pdf URL: https://arxiv.org/pdf/2508.17905
Copy Paste: [[2508.17905]] Pandora: Leveraging Code-driven Knowledge Transfer for Unified Structured Knowledge Reasoning(https://arxiv.org/abs/2508.17905)
Keywords: llm
Abstract: Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods rely on task-specific strategies or bespoke representations, which hinder their ability to dismantle barriers between different SKR tasks, thereby constraining their overall performance in cross-task scenarios. In this paper, we introduce \textsc{Pandora}, a novel USKR framework that addresses the limitations of existing methods by leveraging two key innovations. First, we propose a code-based unified knowledge representation using \textsc{Python}'s \textsc{Pandas} API, which aligns seamlessly with the pre-training of LLMs. This representation facilitates a cohesive approach to handling different structured knowledge sources. Building on this foundation, we employ knowledge transfer to bolster the unified reasoning process of LLMs by automatically building cross-task memory. By adaptively correcting reasoning using feedback from code execution, \textsc{Pandora} showcases impressive unified reasoning capabilities. Extensive experiments on six widely used benchmarks across three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified reasoning frameworks and competes effectively with task-specific methods.
摘要：统一的结构化知识推理（USKR）旨在通过以统一的方式使用表，数据库和知识图来回答自然语言问题。现有的USKR方法依赖于特定于任务的策略或定制表示，这阻碍了它们在不同SKR任务之间拆除障碍的能力，从而在交叉任务方案中限制了它们的整体性能。在本文中，我们介绍了\ textsc {pandora}，这是一个新颖的USKR框架，通过利用两个关键创新来解决现有方法的局限性。首先，我们使用\ textsc {python}'s \ textsc {pandas} API提出了一个基于代码的统一知识表示，该表示与LLMS的预培训无缝地对齐。这种表示有助于处理不同结构化知识源的凝聚力方法。在此基础的基础上，我们采用知识转移来通过自动构建交叉任务内存来增强LLM的统一推理过程。通过使用代码执行的反馈自适应纠正推理，\ textsc {pandora}展示了令人印象深刻的统一推理功能。在三个SKR任务上广泛使用的六个基准测试的广泛实验表明，\ textsc {pandora}的表现优于现有的统一推理框架，并与特定于任务的方法有效竞争。

Title: AMELIA: A Family of Multi-task End-to-end Language Models for Argumentation

Authors: Henri Savigny, Bruno Yun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17926
Pdf URL: https://arxiv.org/pdf/2508.17926
Copy Paste: [[2508.17926]] AMELIA: A Family of Multi-task End-to-end Language Models for Argumentation(https://arxiv.org/abs/2508.17926)
Keywords: language model
Abstract: Argument mining is a subfield of argumentation that aims to automatically extract argumentative structures and their relations from natural language texts. This paper investigates how a single large language model can be leveraged to perform one or several argument mining tasks. Our contributions are two-fold. First, we construct a multi-task dataset by surveying and converting 19 well-known argument mining datasets from the literature into a unified format. Second, we explore various training strategies using Meta AI's Llama-3.1-8B-Instruct model: (1) fine-tuning on individual tasks, (2) fine-tuning jointly on multiple tasks, and (3) merging models fine-tuned separately on individual tasks. Our experiments show that task-specific fine-tuning significantly improves individual performance across all tasks. Moreover, multi-task fine-tuning maintains strong performance without degradation, suggesting effective transfer learning across related tasks. Finally, we demonstrate that model merging offers a viable compromise: it yields competitive performance while mitigating the computational costs associated with full multi-task fine-tuning.
摘要：论证挖掘是一个论点子领域，旨在自然而然地从自然语言文本中提取论证结构及其关系。本文研究了如何利用单个大型语言模型来执行一个或几个参数挖掘任务。我们的贡献是两个方面。首先，我们通过调查和将19个知名参数挖掘数据集从文献转换为统一格式来构建多任务数据集。其次，我们使用Meta AI的Llama-3.1-8B教学模型探索各种培训策略：（1）对单个任务进行微调，（2）在多个任务上进行微调，以及（3）分别在单个任务上分别进行微调。我们的实验表明，特定于任务的微调显着改善了所有任务的个人绩效。此外，多任务微调可保持强劲的性能而不会降解，这表明跨相关任务有效地转移学习。最后，我们证明了模型合并提供了可行的折衷：它产生了竞争性能，同时减轻与全程多任务微调相关的计算成本。

Title: Debiasing Multilingual LLMs in Cross-lingual Latent Space

Authors: Qiwei Peng, Guimin Hu, Yekun Chai, Anders Søgaard
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17948
Pdf URL: https://arxiv.org/pdf/2508.17948
Copy Paste: [[2508.17948]] Debiasing Multilingual LLMs in Cross-lingual Latent Space(https://arxiv.org/abs/2508.17948)
Keywords: language model, llm
Abstract: Debiasing techniques such as SentDebias aim to reduce bias in large language models (LLMs). Previous studies have evaluated their cross-lingual transferability by directly applying these methods to LLM representations, revealing their limited effectiveness across languages. In this work, we therefore propose to perform debiasing in a joint latent space rather than directly on LLM representations. We construct a well-aligned cross-lingual latent space using an autoencoder trained on parallel TED talk scripts. Our experiments with Aya-expanse and two debiasing techniques across four languages (English, French, German, Dutch) demonstrate that a) autoencoders effectively construct a well-aligned cross-lingual latent space, and b) applying debiasing techniques in the learned cross-lingual latent space significantly improves both the overall debiasing performance and cross-lingual transferability.
摘要：诸如Sendebias之类的辩护技术旨在减少大语言模型（LLMS）中的偏见。先前的研究通过将这些方法直接应用于LLM表示形式，从而评估了其跨语言可转移性，从而揭示了它们在语言上的有效性有限。因此，在这项工作中，我们建议在联合潜在空间中执行偏见，而不是直接在LLM表示。我们使用经过平行TED Talk脚本训练的自动编码器构建一个良好的跨语义潜在空间。我们对Aya-Expanse和两种跨四种语言（英语，法语，德语，荷兰语）的实验进行的实验表明，a）自动编码器有效地构建了一个良好的交叉语言潜在空间，b）在学习的跨语言潜在空间中应用偏见技术可显着提高整体偏见性的表现和交叉效果和交叉上线的性能。

Title: Understanding Subword Compositionality of Large Language Models

Authors: Qiwei Peng, Yekun Chai, Anders Søgaard
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17953
Pdf URL: https://arxiv.org/pdf/2508.17953
Copy Paste: [[2508.17953]] Understanding Subword Compositionality of Large Language Models(https://arxiv.org/abs/2508.17953)
Keywords: language model, llm
Abstract: Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.
摘要：大型语言模型（LLMS）将子字序列作为输入，要求它们有效地将子词表示为有意义的单词级表示。在本文中，我们提出了一组全面的实验，以探讨LLM如何构成子字信息，重点关注三个关键方面：结构相似性，语义可分解性和表格保留。我们对实验的分析表明，这五个LLM家族可以分为三个不同的群体，这可能反映了其潜在组成策略的差异。具体而言，我们观察到（i）在跨层的子单词组成和全词表示之间结构相似性演变中的三种不同模式；（ii）通过一层对语义分解性的敏感性进行探测时表现出色；（iii）探测对形式特征的敏感性时的三个不同模式，例如字符序列长度。这些发现为LLM的组成动力学提供了宝贵的见解，并突出了LLMS编码和集成子词信息的不同组成曲折。

Title: German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German

Authors: Miriam Anschütz, Thanh Mai Pham, Eslam Nasrallah, Maximilian Müller, Cristian-George Craciun, Georg Groh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17973
Pdf URL: https://arxiv.org/pdf/2508.17973
Copy Paste: [[2508.17973]] German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German(https://arxiv.org/abs/2508.17973)
Keywords: gpt, llm
Abstract: The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing
摘要：在不同复杂性级别上解释文本的能力对于创建可以针对多种读者组量身定制的可访问文本至关重要。因此，我们介绍了German4All，这是第一个大规模的德语数据集，该数据集的可读性控制性控制，段落级别的释义。它跨越了五个可读性水平，并包含25,000多个样本。该数据集使用GPT-4自动合成，并通过基于人类和LLM的判断进行严格评估。使用German4All，我们培训了一个开源的，可读性控制的释义模型，该模型在德语文本简化中实现了最先进的性能，从而实现了更细微和读者的适应性。我们为数据集和模型开放，以鼓励对多层次释义的进一步研究

Title: A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models

Authors: Oleg Silcenco, Marcos R. Machad, Wallace C. Ugulino, Daniel Braun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17994
Pdf URL: https://arxiv.org/pdf/2508.17994
Copy Paste: [[2508.17994]] A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models(https://arxiv.org/abs/2508.17994)
Keywords: language model, gpt
Abstract: Aspect-based sentiment analysis enhances sentiment detection by associating it with specific aspects, offering deeper insights than traditional sentiment analysis. This study introduces a manually annotated dataset of 10,814 multilingual customer reviews covering brick-and-mortar retail stores, labeled with eight aspect categories and their sentiment. Using this dataset, the performance of GPT-4 and LLaMA-3 in aspect based sentiment analysis is evaluated to establish a baseline for the newly introduced data. The results show both models achieving over 85% accuracy, while GPT-4 outperforms LLaMA-3 overall with regard to all relevant metrics.
摘要：基于方面的情感分析通过将情感与特定方面相关联，可以增强情感检测，从而提供比传统情感分析更深入的见解。这项研究介绍了一个手动注释的数据集，该数据集的10,814个多语言客户评论涵盖了实体零售商店，并标有八个方面类别及其情感。使用此数据集，评估GPT-4和Llama-3在基于方面的情感分析中的性能，以建立新介绍的数据的基线。结果表明，这两种模型都达到超过85％的精度，而GPT-4在所有相关指标方面总体上均优于Llama-3。

Title: Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Authors: Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, Golnoosh Farnadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18076
Pdf URL: https://arxiv.org/pdf/2508.18076
Copy Paste: [[2508.18076]] Neither Valid nor Reliable? Investigating the Use of LLMs as Judges(https://arxiv.org/abs/2508.18076)
Keywords: language model, llm
Abstract: Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.
摘要：评估自然语言产生（NLG）系统仍然是自然语言处理（NLP）的核心挑战，这一旨在成为通用的大型语言模型（LLMS）的兴起更加复杂。最近，作为法官（LLJ）的大型语言模型已成为传统指标的有前途替代方案，但它们的有效性仍然没有得到充实。该立场论文认为，当前对LLJ的热情可能为时过早，因为它们的采用已经超过了对评估者的可靠性和有效性的严格审查。利用社会科学的测量理论，我们确定并批判性地评估了使用LLJS的四个核心假设：它们作为人类判断力的代理，他们作为评估者的能力，其可伸缩性和成本效益的能力。我们研究了这些假设中的每一个如何受到NLG评估中LLM，LLJ或当前实践的固有局限性的挑战。为了基于我们的分析，我们探讨了LLJ的三种应用：文本汇总，数据注释和安全一致性。最后，我们强调了在LLJ评估中对更负责任的评估实践的必要性，以确保其在现场的不断增长的作用支持而不是破坏NLG的进展。

Title: How Quantization Shapes Bias in Large Language Models

Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.18088
Pdf URL: https://arxiv.org/pdf/2508.18088
Copy Paste: [[2508.18088]] How Quantization Shapes Bias in Large Language Models(https://arxiv.org/abs/2508.18088)
Keywords: language model
Abstract: This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, toxicity, sentiment, and fairness. We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.
摘要：这项工作对量化如何影响模型偏见进行了全面评估，特别注意其对单个人口亚组的影响。我们专注于体重和激活量化策略，并检查它们在各种偏见类型中的影响，包括刻板印象，毒性，情感和公平。我们采用概率和生成的基于文本的指标，跨越了九个基准，并评估建筑家庭和推理能力的模型。我们的发现表明，量化对偏见有细微的影响：虽然它可以降低模型的毒性并且不会显着影响情感，但它往往会稍微增加刻板印象和生成任务的不公平性，尤其是在积极的压缩下。这些趋势通常在人口类别和模型类型之间是一致的，尽管它们的幅度取决于特定设置。总体而言，我们的结果强调了在实践中应用量化时精心平衡效率和道德考虑因素的重要性。

Title: Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering

Authors: Julius Gun, Timo Oksanen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18093
Pdf URL: https://arxiv.org/pdf/2508.18093
Copy Paste: [[2508.18093]] Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering(https://arxiv.org/abs/2508.18093)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: We present a case study evaluating large language models (LLMs) with 128K-token context windows on a technical question answering (QA) task. Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German. It simulates a cross-lingual information retrieval scenario where questions are posed in English against all three language versions of the manual. The evaluation focuses on realistic "needle-in-a-haystack" challenges and includes unanswerable questions to test for hallucinations. We compare nine long-context LLMs using direct prompting against three Retrieval-Augmented Generation (RAG) strategies (keyword, semantic, hybrid), with an LLM-as-a-judge for evaluation. Our findings for this specific manual show that Hybrid RAG consistently outperforms direct long-context prompting. Models like Gemini 2.5 Flash and the smaller Qwen 2.5 7B achieve high accuracy (over 85%) across all languages with RAG. This paper contributes a detailed analysis of LLM performance in a specialized industrial domain and an open framework for similar evaluations, highlighting practical trade-offs and challenges.
摘要：我们提出了一个案例研究，评估了有关技术问题（QA）任务的128k token上下文窗口的大型语言模型（LLM）。我们的基准是建立在使用英语，法语和德语的农业机器的用户手册上的。它模拟了跨语性信息检索方案，其中用英语提出了有关手册的所有三种语言版本的问题。该评估的重点是现实的“针中的针刺”挑战，其中包括无法解决幻觉的问题。我们使用直接提示将九个长篇小说LLMS与三个检索型发电（抹布）策略（关键字，语义，混合）与LLM-AS-A-A-Gudge进行评估。我们对本特定手册的发现表明，混合抹布始终优于直接长篇小说提示。 Gemini 2.5 Flash和较小的QWEN 2.5 7B等型号在所有带有抹布的语言中都具有很高的精度（超过85％）。本文对专业工业领域的LLM绩效进行了详细的分析，以及用于类似评估的开放框架，突出了实际的权衡和挑战。

Title: Detecting and Characterizing Planning in Language Models

Authors: Jatin Nainani, Sankaran Vaidyanathan, Connor Watts, Andre N. Assis, Alice Rigg
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.18098
Pdf URL: https://arxiv.org/pdf/2508.18098
Copy Paste: [[2508.18098]] Detecting and Characterizing Planning in Language Models(https://arxiv.org/abs/2508.18098)
Keywords: language model, llm, prompt
Abstract: Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline. We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal: unlike Haiku, Gemma-2-2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions. We further show that instruction tuning refines existing planning behaviors in the base model rather than creating them from scratch. Together, these studies provide a reproducible and scalable foundation for mechanistic studies of planning in LLMs.
摘要：现代大型语言模型（LLMS）在各种多步推理任务中表现出令人印象深刻的表现。最近的工作表明，LLM可能会执行计划 - 提前选择未来的目标令牌并产生导致它的中间令牌 - 而不仅仅是一次即兴即兴一个令牌。但是，现有研究假设固定的计划视野，并且通常专注于单个提示或狭窄的域。为了区分计划与跨模型和任务的即兴计划，我们提出了正式和因果关系的标准，用于检测计划并将其作为半自动注释管道进行操作。我们将此管道应用于MBPP代码生成基准的基础和指令调整的Gemma-2-2b模型，以及一项诗的生成任务，以前证明了Claude 3.5 Haiku。我们的发现表明，计划不是通用的：与haiku不同，Gemma-2-2b通过即兴创作解决了相同的诗生成任务，并且在MBPP上，它在类似任务甚至连续的代币预测之间切换了计划和即兴创作。我们进一步表明，指令调整完善了基本模型中现有的计划行为，而不是从头开始创建它们。这些研究共同为LLMS计划的机械研究提供了可再现和可扩展的基础。

Title: SentiMM: A Multimodal Multi-Agent Framework for Sentiment Analysis in Social Media

Authors: Xilai Xu, Zilin Zhao, Chengye Song, Zining Wang, Jinhe Qiang, Jiongrui Yan, Yuhuai Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18108
Pdf URL: https://arxiv.org/pdf/2508.18108
Copy Paste: [[2508.18108]] SentiMM: A Multimodal Multi-Agent Framework for Sentiment Analysis in Social Media(https://arxiv.org/abs/2508.18108)
Keywords: agent
Abstract: With the increasing prevalence of multimodal content on social media, sentiment analysis faces significant challenges in effectively processing heterogeneous data and recognizing multi-label emotions. Existing methods often lack effective cross-modal fusion and external knowledge integration. We propose SentiMM, a novel multi-agent framework designed to systematically address these challenges. SentiMM processes text and visual inputs through specialized agents, fuses multimodal features, enriches context via knowledge retrieval, and aggregates results for final sentiment classification. We also introduce SentiMMD, a large-scale multimodal dataset with seven fine-grained sentiment categories. Extensive experiments demonstrate that SentiMM achieves superior performance compared to state-of-the-art baselines, validating the effectiveness of our structured approach.
摘要：随着社交媒体上多模式内容的越来越多的流行率，情感分析在有效处理异质数据和识别多标签情绪方面面临重大挑战。现有方法通常缺乏有效的跨模式融合和外部知识整合。我们提出了Sentimm，这是一个新颖的多代理框架，旨在系统地应对这些挑战。 Sentimm通过专门的代理来处理文本和视觉输入，融合了多模式特征，通过知识检索丰富了上下文，并汇总了最终情感分类的结果。我们还介绍了Sentimmd，这是一个具有七个细粒情感类别的大规模多模式数据集。广泛的实验表明，与最先进的基线相比，前哨取得了卓越的性能，从而验证了我们结构化方法的有效性。

Title: DiscussLLM: Teaching Large Language Models When to Speak

Authors: Deep Anil Patel, Iain Melvin, Christopher Malon, Martin Renqiang Min
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2508.18167
Pdf URL: https://arxiv.org/pdf/2508.18167
Copy Paste: [[2508.18167]] DiscussLLM: Teaching Large Language Models When to Speak(https://arxiv.org/abs/2508.18167)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an "awareness gap," limiting their potential as truly collaborative partners in dynamic human discussions. We introduce $\textit{DiscussLLM}$, a framework designed to bridge this gap by training models to proactively decide not just $\textit{what}$ to say, but critically, $\textit{when}$ to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.
摘要：大型语言模型（LLMS）在理解和生成类似人类的文本方面表现出了非凡的功能，但它们在很大程度上是反应性的，只有在直接提示时才做出响应。这种消极性创造了“意识差距”，限制了他们作为动态人类讨论中真正合作伙伴的潜力。我们介绍了$ \ textIt {descredllm} $，这是一个框架，旨在通过培训模型来弥合此差距，以主动决定要说$ \ textit {what} $，但至关重要的是，$ \ textit {n wher when} $说话。我们的主要贡献是可扩展的两阶段数据生成管道，该管道综合了一个大规模的现实多转变人类讨论的数据集。每个讨论都用五种干预类型之一（例如，事实校正，概念定义）注释，并包含一个明确的对话触发，而AI干预增加了价值。通过培训模型预测不需要干预的特殊沉默令牌，他们会学会保持安静，直到做出有益的贡献为止。我们探索了两个架构基线：一个集成的端到端模型和一个针对低延迟推断进行了优化的分类器生成器系统。我们评估了这些模型的准确时间干预并产生有用的响应，为更多的情境意识和主动的对话AI铺平了道路。

Title: Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation

Authors: Hongyu Cao, Yuxuan Wu, Yucheng Cai, Xianyu Zhao, Zhijian Ou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18168
Pdf URL: https://arxiv.org/pdf/2508.18168
Copy Paste: [[2508.18168]] Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation(https://arxiv.org/abs/2508.18168)
Keywords: retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has become a widely recognized paradigm to combine parametric memory with non-parametric memories. An RAG model consists of two serial connecting components (retriever and generator). A major challenge in end-to-end optimization of the RAG model is that marginalization over relevant passages (modeled as discrete latent variables) from a knowledge base is required. Traditional top-K marginalization and variational RAG (VRAG) suffer from biased or high-variance gradient estimates. In this paper, we propose and develop joint stochastic approximation (JSA) based end-to-end training of RAG, which is referred to as JSA-RAG. The JSA algorithm is a stochastic extension of the EM (expectation-maximization) algorithm and is particularly powerful in estimating discrete latent variable models. Extensive experiments are conducted on five datasets for two tasks (open-domain question answering, knowledge-grounded dialogs) and show that JSA-RAG significantly outperforms both vanilla RAG and VRAG. Further analysis shows the efficacy of JSA-RAG from the perspectives of generation, retrieval, and low-variance gradient estimate.
摘要：检索增强的生成（RAG）已成为广泛认可的范式，将参数记忆与非参数记忆相结合。抹布模型由两个串行连接组件（回猎和生成器）组成。抹布模型的端到端优化的一个主要挑战是，需要从知识库的相关段落（以离散的潜在变量建模）边缘化。传统的TOP-K边缘化和变分布（VRAG）遭受偏见或高变化梯度估计。在本文中，我们提出并开发了基于RAG的端到端端到端训练联合随机近似（JSA），这称为JSA-rag。 JSA算法是EM（期望最大化）算法的随机扩展，并且在估计离散潜在变量模型方面尤其强大。在五个数据集上进行了广泛的实验，以完成两个任务（开放域问题，知识接地对话），并表明JSA-RAG明显胜过Vanilla Rag和VRAG。进一步的分析表明，从发电，检索和低变义梯度估计的角度来看，JSA rag的功效。

Title: Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios

Authors: Luana Bulla, Gabriele Tuccio, Misael Mongiovì, Aldo Gangemi
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.18183
Pdf URL: https://arxiv.org/pdf/2508.18183
Copy Paste: [[2508.18183]] Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios(https://arxiv.org/abs/2508.18183)
Keywords: language model, llm, prompt
Abstract: Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to capture the full linguistic richness of sign languages. To address this limitation, we propose Advanced Use of LLMs for Sign Language Translation (AulSign), a novel method that leverages Large Language Models via dynamic prompting and in-context learning with sample selection and subsequent sign association. Despite their impressive abilities in processing text, LLMs lack intrinsic knowledge of sign languages; therefore, they are unable to natively perform this kind of translation. To overcome this limitation, we associate the signs with compact descriptions in natural language and instruct the model to use them. We evaluate our method on both English and Italian languages using SignBank+, a recognized benchmark in the field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior performance compared to state-of-the-art models in low-data scenario. Our findings demonstrate the effectiveness of AulSign, with the potential to enhance accessibility and inclusivity in communication technologies for underrepresented linguistic communities.
摘要：将天然语言转换为标志语言是一项高度复杂且毫无疑问的任务。尽管对可访问性和包容性的兴趣越来越大，但强大的翻译系统的发展仍然受到平行语言的有限可用性的阻碍，该公司将自然语言与手语数据保持一致。现有的方法通常很难在这些数据筛选环境中概括，因为可用的少数数据集通常是特定于域的，缺乏标准化或无法捕获符号语言的全部语言丰富性。为了解决这一限制，我们建议使用LLMS对手语翻译（AULSIGN）的高级使用，这是一种新颖的方法，该方法通过动态提示和秘密学习来利用大型语言模型，并使用样本选择以及随后的符号关联。尽管在处理文本方面具有令人印象深刻的能力，但LLM缺乏对标志语言的内在知识。因此，他们无法本地执行这种翻译。为了克服这一限制，我们将标志与自然语言的紧凑描述相关联，并指示模型使用它们。我们使用Signbank+以及该领域的公认基准以及意大利LACAM CNR-ISTC数据集评估了英语和意大利语的方法。与低数据表情况下的最先进模型相比，我们表现出卓越的性能。我们的发现证明了Aulsign的有效性，并有可能增强代表性不足语言社区的通信技术的可及性和包容性。

Title: Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation

Authors: Rishikesh Devanathan, Varun Nathan, Ayush Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.18210
Pdf URL: https://arxiv.org/pdf/2508.18210
Copy Paste: [[2508.18210]] Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation(https://arxiv.org/abs/2508.18210)
Keywords: prompt, agent
Abstract: Synthetic transcript generation is critical in contact center domains, where privacy and data scarcity limit model training and evaluation. Unlike prior synthetic dialogue generation work on open-domain or medical dialogues, contact center conversations are goal-oriented, role-asymmetric, and behaviorally complex, featuring disfluencies, ASR noise, and compliance-driven agent actions. In deployments where transcripts are unavailable, standard pipelines still yield derived call attributes such as Intent Summaries, Topic Flow, and QA Evaluation Forms. We leverage these as supervision signals to guide generation. To assess the quality of such outputs, we introduce a diagnostic framework of 18 linguistically and behaviorally grounded metrics for comparing real and synthetic transcripts. We benchmark four language-agnostic generation strategies, from simple prompting to characteristic-aware multi-stage approaches, alongside reference-free baselines. Results reveal persistent challenges: no method excels across all traits, with notable deficits in disfluency, sentiment, and behavioral realism. Our diagnostic tool exposes these gaps, enabling fine-grained evaluation and stress testing of synthetic dialogue across languages.
摘要：综合转录本在接触中心域中至关重要，在隐私和数据稀缺限制限制模型培训和评估中。与先前的合成对话生成有关开放域或医疗对话的工作不同，接触中心的对话是面向目标的，角色 - 空气对话，并且具有行为复杂，具有反射，ASR噪声和合规性驱动的代理操作。在不可用的成绩单的部署中，标准管道仍会产生派生的呼叫属性，例如意图摘要，主题流量和质量检查评估表。我们利用这些作为监督信号来指导发电。为了评估此类产出的质量，我们引入了18个语言和行为扎根指标的诊断框架，以比较真实和合成转录本。我们基于四种语言不足的生成策略，从简单提示到具有特征性的多阶段方法以及无参考的基线。结果揭示了持续的挑战：没有任何方法在所有特征上都有擅长的挑战，并且存在着明显的缺陷，情感和行为现实主义。我们的诊断工具揭示了这些差距，从而实现了跨语言的合成对话的细粒度评估和应力测试。

Title: Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries

Authors: Meiling Ning, Zhongbao Zhang, Junda Ye, Jiabao Guo, Qingyuan Guan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18212
Pdf URL: https://arxiv.org/pdf/2508.18212
Copy Paste: [[2508.18212]] Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries(https://arxiv.org/abs/2508.18212)
Keywords: language model
Abstract: The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model's comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction to fully leverage the advantages of MLMs. Extensive experiments demonstrate that in both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals compared to generative reward models.
摘要：以生成奖励模型为代表的基于LM的判断奖励建模的出现成功地使强化从AI反馈（RLAIF）有效且可扩展。为了进一步推进这种范式，我们提出了一个核心见解：这种奖励建模形式与自然语言推论（NLI）的基本形式一致，这是自然语言理解中的核心任务。这种反映的视角指出了建立出色奖励模型的关键途径：扩展模型的理解界限。在NLI任务上追求这一道路，探索性实验表明，与主流自动回归模型相比，结合上下文解释的插槽预测语言模型（MLMS）的性能明显更好。基于此关键发现，我们提出了ESFP-RM，这是一个基于两个阶段的LM判断奖励模型，该模型利用基于解释的老虎机框架来预测，以充分利用MLM的优势。广泛的实验表明，在从人类反馈（RLHF）学习和分布外（OOD）方案中，ESFP-RM框架都提供了与生成奖励模型相比，ESFP-RM框架提供了更稳定和更可推广的奖励信号。

Title: MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

Authors: Yuhao Du, Qianwei Huang, Guo Zhu, Zhanchen Dai, Sunian Chen, Qiming Zhu, Yuhao Zhang, Li Zhou, Benyou Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.18240
Pdf URL: https://arxiv.org/pdf/2508.18240
Copy Paste: [[2508.18240]] MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols(https://arxiv.org/abs/2508.18240)
Keywords: language model, llm
Abstract: The rapid advancement of speech-to-speech (S2S) large language models (LLMs) has significantly improved real-time spoken interaction. However, current evaluation frameworks remain inadequate for assessing performance in complex, multi-turn dialogues. To address this, we introduce MTalk-Bench, a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound. Each dimension includes nine realistic scenarios, along with targeted tasks to assess specific capabilities such as reasoning. Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and Rubrics-based evaluation (absolute scoring) for relative and absolute assessment. The benchmark includes both model and human outputs, evaluated by human evaluators and LLMs. Experimental results reveal two sets of findings. Overall performance of S2S LLMs: (1) models excel at semantic information processing yet underperform on paralinguistic information and ambient sounds perception; (2) models typically regain coherence by increasing response length, sacrificing efficiency in multi-turn dialogues; (3) modality-aware, task-specific designs outperform brute scaling. Evaluation framework and reliability: (1) Arena and Rubrics yield consistent, complementary rankings, but reliable distinctions emerge only when performance gaps are large; (2) LLM-as-a-judge aligns with humans when gaps are clear or criteria explicit, but exhibits position and length biases and is reliable on nonverbal evaluation only with text annotations. These results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.
摘要：语音到语音（S2S）大语言模型（LLM）的快速发展已大大改善了实时口语互动。但是，目前的评估框架仍不足以评估复杂的多转对话中的性能。为了解决这个问题，我们介绍了MTALK BENCH，这是一个涵盖三个核心维度的多转弯S2S基准：语义信息，副语言信息和环境声音。每个维度包括九个现实的场景，以及有针对性的任务来评估特定功能，例如推理。我们的双方法评估框架结合了竞技场风格的评估（成对比较）和基于标题的评估（绝对评分），以进行相对和绝对评估。基准包括由人类评估者和LLMS评估的模型和人类产量。实验结果显示了两组发现。 S2S LLM的总体表现：（1）模型在语义信息处理中表现出色，但在副语言信息和环境声音感知方面表现不佳；（2）模型通常通过增加响应长度来恢复连贯性，从而牺牲了多转对话中的效率；（3）模态意识，特定于任务的设计优于野蛮缩放。评估框架和可靠性：（1）竞技场和标题会产生一致的互补排名，但仅在绩效差距较大时才会出现可靠的区分；（2）当差距明确或明确标准时，LLM-AS-A-A-Gudge与人类保持一致，但表现出位置和长度偏见，并且仅对非语言评估可靠。这些结果突出了S2S评估中的当前局限性以及对更健壮的语音感知评估框架的需求。

Title: Demographic Biases and Gaps in the Perception of Sexism in Large Language Models

Authors: Judith Tavarez-Rodríguez, Fernando Sánchez-Vega, A. Pastor López-Monroy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18245
Pdf URL: https://arxiv.org/pdf/2508.18245
Copy Paste: [[2508.18245]] Demographic Biases and Gaps in the Perception of Sexism in Large Language Models(https://arxiv.org/abs/2508.18245)
Keywords: language model, llm
Abstract: The use of Large Language Models (LLMs) has proven to be a tool that could help in the automatic detection of sexism. Previous studies have shown that these models contain biases that do not accurately reflect reality, especially for minority groups. Despite various efforts to improve the detection of sexist content, this task remains a significant challenge due to its subjective nature and the biases present in automated models. We explore the capabilities of different LLMs to detect sexism in social media text using the EXIST 2024 tweet dataset. It includes annotations from six distinct profiles for each tweet, allowing us to evaluate to what extent LLMs can mimic these groups' perceptions in sexism detection. Additionally, we analyze the demographic biases present in the models and conduct a statistical analysis to identify which demographic characteristics (age, gender) contribute most effectively to this task. Our results show that, while LLMs can to some extent detect sexism when considering the overall opinion of populations, they do not accurately replicate the diversity of perceptions among different demographic groups. This highlights the need for better-calibrated models that account for the diversity of perspectives across different populations.
摘要：事实证明，大型语言模型（LLM）的使用已被证明是可以自动发现性别歧视的工具。先前的研究表明，这些模型包含无法准确反映现实的偏见，尤其是对于少数群体。尽管为改善性别歧视内容的发现而进行了各种努力，但由于其主观性质和自动化模型中存在的偏见，该任务仍然是一个重大挑战。我们探索使用2024 Tweet数据集中使用该LLM的不同LLM在社交媒体文本中检测性别歧视的功能。它包括每条推文的六个不同概况的注释，使我们能够评估LLM在多大程度上可以模仿这些群体在性别歧视检测中的看法。此外，我们分析了模型中存在的人口偏见，并进行了统计分析，以确定哪些人口特征（年龄，性别）对这项任务最有效。我们的结果表明，尽管LLM在考虑人口的整体意见时可以在某种程度上检测性别歧视，但它们并不能准确地复制不同人口组之间的看法多样性。这突出了需要更好地校准模型的需求，这些模型解释了不同人群的观点的多样性。

Title: From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models

Authors: ZiqiZhang, Jianfei Ma, Emmanuele Chersoni, Jieshun You, Zhaoxin Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18253
Pdf URL: https://arxiv.org/pdf/2508.18253
Copy Paste: [[2508.18253]] From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models(https://arxiv.org/abs/2508.18253)
Keywords: language model, llm
Abstract: Classifiers are an important and defining feature of the Chinese language, and their correct prediction is key to numerous educational applications. Yet, whether the most popular Large Language Models (LLMs) possess proper knowledge the Chinese classifiers is an issue that has largely remain unexplored in the Natural Language Processing (NLP) literature. To address such a question, we employ various masking strategies to evaluate the LLMs' intrinsic ability, the contribution of different sentence elements, and the working of the attention mechanisms during prediction. Besides, we explore fine-tuning for LLMs to enhance the classifier performance. Our findings reveal that LLMs perform worse than BERT, even with fine-tuning. The prediction, as expected, greatly benefits from the information about the following noun, which also explains the advantage of models with a bidirectional attention mechanism such as BERT.
摘要：分类器是中文的重要特征，其正确的预测是众多教育应用的关键。但是，最受欢迎的大型语言模型（LLM）是否具有适当的知识，中国分类器是一个在自然语言处理（NLP）文献中基本上仍未探索的问题。为了解决这个问题，我们采用各种掩盖策略来评估LLM的内在能力，不同句子元素的贡献以及预测期间注意机制的工作。此外，我们探索了LLM的微调以增强分类器的性能。我们的发现表明，即使通过微调，LLM的表现都比BERT差。如预期的那样，预测从有关以下名词的信息中受益匪浅，这也解释了模型的优势，该模型具有双向注意机制，例如BERT。

Title: MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains

Authors: Kaiwen Wei, Rui Shan, Dongsheng Zou, Jianzhong Yang, Bi Zhao, Junnan Zhu, Jiang Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18260
Pdf URL: https://arxiv.org/pdf/2508.18260
Copy Paste: [[2508.18260]] MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains(https://arxiv.org/abs/2508.18260)
Keywords: gpt, prompt, retrieval augmented generation, chain-of-thought, tree-of-thought
Abstract: Large reasoning models (LRMs) have shown significant progress in test-time scaling through chain-of-thought prompting. Current approaches like search-o1 integrate retrieval augmented generation (RAG) into multi-step reasoning processes but rely on a single, linear reasoning chain while incorporating unstructured textual information in a flat, context-agnostic manner. As a result, these approaches can lead to error accumulation throughout the reasoning chain, which significantly limits its effectiveness in medical question-answering (QA) tasks where both accuracy and traceability are critical requirements. To address these challenges, we propose MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration), a novel test-time scalable reasoning framework that performs dynamic multi-chain inference over structured medical knowledge graphs. Specifically, MIRAGE 1) decomposes complex queries into entity-grounded sub-questions, 2) executes parallel inference chains, 3) retrieves evidence adaptively via neighbor expansion and multi-hop traversal, and 4) integrates answers using cross-chain verification to resolve contradictions. Experiments on three medical QA benchmarks (GenMedGPT-5k, CMCQA, and ExplainCPE) show that MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-augmented baselines in both automatic and human evaluations. Additionally, MIRAGE improves interpretability by generating explicit reasoning chains that trace each factual claim to concrete chains within the knowledge graph, making it well-suited for complex medical reasoning scenarios. The code will be available for further research.
摘要：大型推理模型（LRMS）通过思考链提示显示了测试时间缩放的显着进展。当前的方法诸如搜索-O1之类的方法将检索增强生成（RAG）集成到多步推理过程中，但依赖于单个线性推理链，同时以平坦的，上下文 - 语言性的方式合并了非结构化的文本信息。结果，这些方法可能导致整个推理链中的错误积累，这显着限制了其在准确性和可追溯性都是至关重要的要求中的医疗问题（QA）任务中的有效性。为了应对这些挑战，我们提出了Mirage（具有检索功能图形探索的多链推断），这是一个新型的测试时间可扩展推理框架，对结构化医学知识图执行动态多链推断。具体而言，Mirage 1）将复杂的查询分解为实体接地的子问题，2）执行平行的推理链，3）通过邻居扩张和多跳遍历自适应地检索证据，而4）使用交叉链验证整合答案，以解决矛盾。在三个医学质量检查基准（GenMedGPT-5K，CMCQA和DiffentCPE）上进行的实验表明，幻影在自动和人类评估中始终优于GPT-4O，经过思考的变体以及其他检索型基底线。此外，Mirage通过产生明确的推理链来提高可解释性，这些链条追踪了知识图内的每个事实主张，使其非常适合复杂的医学推理方案。该代码将用于进一步研究。