2025-09-09

Title: An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training

Authors: Yanis Labrak, Richard Dufour, Mickaël Rouvier
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2509.05359
Pdf URL: https://arxiv.org/pdf/2509.05359
Copy Paste: [[2509.05359]] An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training(https://arxiv.org/abs/2509.05359)
Keywords: language model
Abstract: This paper investigates discrete unit representations in Speech Language Models (SLMs), focusing on optimizing speech modeling during continual pre-training. In this paper, we systematically examine how model architecture, data representation, and training robustness influence the pre-training stage in which we adapt existing pre-trained language models to the speech modality. Our experiments highlight the role of speech encoders and clustering granularity across different model scales, showing how optimal discretization strategies vary with model capacity. By examining cluster distribution and phonemic alignments, we investigate the effective use of discrete vocabulary, uncovering both linguistic and paralinguistic patterns. Additionally, we explore the impact of clustering data selection on model robustness, highlighting the importance of domain matching between discretization training and target applications.
摘要：本文研究语音语言模型（SLM）中的离散单位表示，重点是在持续预训练期间优化语音建模。在本文中，我们系统地研究了模型架构，数据表示和训练鲁棒性如何影响训练的训练阶段，在该阶段中，我们将现有的预训练的语言模型适应语音方式。我们的实验突出了语音编码器和跨不同模型量表的粒度的作用，表明最佳离散策略的变化如何随模型容量而变化。通过检查群集分布和音素比对，我们研究了离散词汇的有效使用，揭示了语言和副语言模式。此外，我们探讨了聚类数据选择对模型鲁棒性的影响，从而强调了离散化培训和目标应用程序之间域匹配的重要性。

Title: Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection

Authors: Jerry Li, Evangelos Papalexakis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05360
Pdf URL: https://arxiv.org/pdf/2509.05360
Copy Paste: [[2509.05360]] Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection(https://arxiv.org/abs/2509.05360)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language, however, a fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information. Detecting hallucinations has quickly become an important topic, with various methods such as uncertainty estimation, LLM Judges, retrieval augmented generation (RAG), and consistency checks showing promise. Many of these methods build upon foundational metrics, such as ROUGE, BERTScore, or Perplexity, which often lack the semantic depth necessary to detect hallucinations effectively. In this work, we propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text. This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content. We demonstrate this by applying tensor decomposition methods to extract singular values from each mode and use these as input features to train a multi-layer perceptron (MLP) binary classifier for hallucinations. Our method is evaluated on the HaluEval dataset and demonstrates significant improvements over traditional baselines, as well as competitive performance against state-of-the-art LLM judges.
摘要：大型语言模型（LLM）表现出涉及自然语言的各种任务的有效性，但是，幻觉的基本问题仍然困扰着这些模型，从而限制了它们在产生一致，真实的信息方面的信任度。检测幻觉已迅速成为一个重要的话题，并采用各种方法，例如不确定性估计，LLM法官，检索增强发电（RAG）和一致性检查表现出了希望。这些方法中的许多基于基础指标，例如胭脂，伯特镜或困惑，它们通常缺乏有效检测幻觉所需的语义深度。在这项工作中，我们提出了一种受胭脂启发的新颖方法，该方法从LLM生成的文本中构造了N-gram频率张量。该张量通过编码共发生模式来捕获更丰富的语义结构，从而可以更好地区分事实和幻觉内容。我们通过应用张量分解方法从每种模式提取奇异值来证明这一点，并将其用作输入特征来训练多层感知器（MLP）二进制分类器以进行幻觉。我们的方法在Halueval数据集上进行了评估，并证明了对传统基线的显着改善，以及针对最先进的LLM法官的竞争性能。

Title: A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs

Authors: Jiacheng Wei, Faguo Wu, Xiao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05385
Pdf URL: https://arxiv.org/pdf/2509.05385
Copy Paste: [[2509.05385]] A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs(https://arxiv.org/abs/2509.05385)
Keywords: language model, llm
Abstract: Large language models are unable to continuously adapt and learn from new data during reasoning at inference time. To address this limitation, we propose that complex reasoning tasks be decomposed into atomic subtasks and introduce SAGE, a trigger-guided dynamic fine-tuning framework that enables adaptive updates during reasoning at inference time. SAGE consists of three key components: (1) a Trigger module that detects reasoning failures through multiple evaluation metrics in real time; (2) a Trigger Buffer module that clusters anomaly samples using a streaming clustering process with HDBSCAN, followed by stability checks and similarity-based merging; and (3) a Lora Store module that dynamically optimizes parameter updates with an adapter pool for knowledge retention. Evaluation results show that SAGE demonstrates excellent accuracy, robustness, and stability on the atomic reasoning subtask through dynamic knowledge updating during test time.
摘要：大型语言模型在推理时推理期间无法连续适应并从新数据中学习。为了解决这一限制，我们建议将复杂的推理任务分解为原子子任务，并引入Sage，Sage是一种触发引导的动态微调框架，可以在推理时间推理期间启用自适应更新。 SAGE由三个关键组成部分组成：（1）一个触发模块，该模块通过实时多个评估指标来检测推理失败；（2）一个触发缓冲模块，该模块使用带有HDBSCAN的流群聚类过程将异常样品聚集，然后进行稳定检查和基于相似性的合并；（3）使用适配器池动态优化参数更新的LORA商店模块以保持知识保留。评估结果表明，通过在测试时间内动态知识更新，SAGE在原子推理子任务上表现出了出色的精度，鲁棒性和稳定性。

Title: Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

Authors: Andrea Wynn, Harsh Satija, Gillian Hadfield
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2509.05396
Pdf URL: https://arxiv.org/pdf/2509.05396
Copy Paste: [[2509.05396]] Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate(https://arxiv.org/abs/2509.05396)
Keywords: agent
Abstract: While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. The prior work has exclusively focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time -- even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivized nor adequately equipped to resist persuasive but incorrect reasoning.
摘要：尽管已提出多代理辩论是提高AI推理能力的有前途的策略，但我们发现辩论有时可能是有害而不是有帮助的。先前的工作仅集中在均匀群体中的辩论上，而我们探讨了模型能力的多样性如何影响多代理相互作用的动态和结果。通过一系列实验，我们证明了争论会导致随着时间的推移准确性的降低 - 即使在那些更强（即功能更强大）模型的设置中，其数量超过了弱者的较弱。我们的分析表明，模型经常从正确的答案转移到不正确的答案，以响应同伴推理，而不是一致性而不是具有挑战性的有缺陷的推理。这些结果凸显了在多代理辩论期间交换原因的重要故障模式，这表明当辩论的天真应用可能会导致绩效下降，而当代理人既不激励代理也没有充分的能力来抵抗说服力但不正确的推理。

Title: No Translation Needed: Forecasting Quality from Fertility and Metadata

Authors: Jessica M. Lundin, Ada Zhang, David Adelani, Cody Carroll
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05425
Pdf URL: https://arxiv.org/pdf/2509.05425
Copy Paste: [[2509.05425]] No Translation Needed: Forecasting Quality from Fertility and Metadata(https://arxiv.org/abs/2509.05425)
Keywords: gpt
Abstract: We show that translation quality can be predicted with surprising accuracy \textit{without ever running the translation system itself}. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ($R^{2}=0.66$ for XX$\rightarrow$English and $R^{2}=0.72$ for English$\rightarrow$XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.
摘要：我们表明，可以以令人惊讶的精度\ textit {而无需运行翻译系统本身就可以预测翻译质量。我们只使用少数功能，代币的生育率，代币计数以及基本的语言元数据（语言家族，脚本和地区），我们可以预测弗洛雷斯-200基准中203种语言的GPT-4O翻译CHRF分数。梯度提升模型实现了有利的性能（$ r^{2} = 0.66 $ xx $ \ rightarrow $英语和$ r^{2} = 0.72 $ for English $ \ rightarrow $ xx）。特征重要性分析表明，类型学因素主导了预测英语，而生育能力对于翻译成各种目标语言的作用更大。这些发现表明，翻译质量是由令牌水平的生育能力和更广泛的语言类型学塑造的，为多语言评估和质量估计提供了新的见解。

Title: Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

Authors: Logan Lawrence, Ashton Williamson, Alexander Shelton
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05440
Pdf URL: https://arxiv.org/pdf/2509.05440
Copy Paste: [[2509.05440]] Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too(https://arxiv.org/abs/2509.05440)
Keywords: language model, chat
Abstract: As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \textit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\textbf{+0.03}), TopicalChat (\textbf{-0.03}), and HANNA (\textbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.
摘要：由于大型语言模型已越来越多地用作自动评估者评估自由形式内容的评估者，包括文档摘要，对话和故事产生，因此工作一直致力于通过衡量其与人类判断的相关性来评估此类模型。对于\ textit {sample level}性能，使用机器生成的文本之间的成对比较来运行的方法表现良好，但通常缺乏将绝对分数分配给单个摘要的能力，这对于需要阈值的用例至关重要。在这项工作中，我们提出了一种直接分数方法，该方法使用合成摘要在测试时充当成对机器排名。我们表明，我们的方法与轴平均的样本级相关性（\ textbf {+0.03}），topicalChat（\ textbf {-0.03}）和hanna（hanna（hanna）（\ textbf {+0.05}）meta-eft anda和hanna-ennation and topicalChat（\ textbf {-0.03}），在轴平均的样本级相关性方面的性能相当。在中下文中摘要作为数据，以促进未来的工作。

Title: From Staff Messages to Actionable Insights: A Multi-Stage LLM Classification Framework for Healthcare Analytics

Authors: Hajar Sakai, Yi-En Tseng, Mohammadsadegh Mikaeili, Joshua Bosire, Franziska Jovin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05484
Pdf URL: https://arxiv.org/pdf/2509.05484
Copy Paste: [[2509.05484]] From Staff Messages to Actionable Insights: A Multi-Stage LLM Classification Framework for Healthcare Analytics(https://arxiv.org/abs/2509.05484)
Keywords: language model, gpt, llm
Abstract: Hospital call centers serve as the primary contact point for patients within a hospital system. They also generate substantial volumes of staff messages as navigators process patient requests and communicate with the hospital offices following the established protocol restrictions and guidelines. This continuously accumulated large amount of text data can be mined and processed to retrieve insights; however, traditional supervised learning approaches require annotated data, extensive training, and model tuning. Large Language Models (LLMs) offer a paradigm shift toward more computationally efficient methodologies for healthcare analytics. This paper presents a multi-stage LLM-based framework that identifies staff message topics and classifies messages by their reasons in a multi-class fashion. In the process, multiple LLM types, including reasoning, general-purpose, and lightweight models, were evaluated. The best-performing model was o3, achieving 78.4% weighted F1-score and 79.2% accuracy, followed closely by gpt-5 (75.3% Weighted F1-score and 76.2% accuracy). The proposed methodology incorporates data security measures and HIPAA compliance requirements essential for healthcare environments. The processed LLM outputs are integrated into a visualization decision support tool that transforms the staff messages into actionable insights accessible to healthcare professionals. This approach enables more efficient utilization of the collected staff messaging data, identifies navigator training opportunities, and supports improved patient experience and care quality.
摘要：医院呼叫中心是医院系统中患者的主要接触点。随着导航员处理患者的要求，他们还会产生大量员工消息，并根据已建立的协议限制和准则与医院办公室进行沟通。可以挖掘并处理这种连续积累的大量文本数据以检索见解。但是，传统的监督学习方法需要带注释的数据，广泛的培训和模型调整。大型语言模型（LLMS）提供了向医疗保健分析更有效的计算有效方法的范式。本文介绍了一个基于LLM的多阶段框架，该框架识别员工消息主题并以多级方式通过其原因对消息进行分类。在此过程中，评估了多种LLM类型，包括推理，通用和轻量级模型。表现最佳的模型是O3，达到78.4％的加权F1分数和79.2％的精度，然后是GPT-5（75.3％加权F1得分和76.2％的精度）。提出的方法结合了数据安全指标和HIPAA合规性要求，对医疗保健环境必不可少。处理后的LLM输出被集成到可视化决策支持工具中，该工具将员工信息转换为医疗保健专业人员可以使用的可行见解。这种方法可以更有效地利用收集的员工消息传递数据，确定导航器培训机会，并支持改善患者的经验和护理质量。

Title: The Token Tax: Systematic Bias in Multilingual Tokenization

Authors: Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, Cody Carroll
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05486
Pdf URL: https://arxiv.org/pdf/2509.05486
Copy Paste: [[2509.05486]] The Token Tax: Systematic Bias in Multilingual Tokenization(https://arxiv.org/abs/2509.05486)
Keywords: language model, llm
Abstract: Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
摘要：令牌化效率低下对形态上复杂，低资源的语言，夸大计算资源和沮丧的准确性施加了结构性劣势。我们在Afrimmlu（9,000个MCQA项目； 5个主题； 16种非洲语言）上评估了10种大语言模型（LLM），并表明生育能力（令牌/单词）可靠地预测准确性。较高的生育能力始终预测所有模型和受试者的精度较低。我们进一步发现，在Afrimmlu数据集中，推理模型（DeepSeek，O1）始终优于高和低资源语言的非争议同行，从而缩小了前几代观察到的准确性差距。最后，将令牌通货膨胀转化为经济学，代币的加倍导致了三倍的培训成本和时间，并强调了许多语言所面临的代币税收。这些结果激发了形态学意识的象征化，公平的定价和公平自然语言处理（NLP）的多语言基准。

Title: Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG)

Authors: Mansi Garg, Lee-Chi Wang, Bhavesh Ghanchi, Sanjana Dumpala, Shreyash Kakde, Yen Chih Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05505
Pdf URL: https://arxiv.org/pdf/2509.05505
Copy Paste: [[2509.05505]] Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG)(https://arxiv.org/abs/2509.05505)
Keywords: language model, retrieval-augmented generation
Abstract: This work presents a Biomedical Literature Question Answering (Q&A) system based on a Retrieval-Augmented Generation (RAG) architecture, designed to improve access to accurate, evidence-based medical information. Addressing the shortcomings of conventional health search engines and the lag in public access to biomedical research, the system integrates diverse sources, including PubMed articles, curated Q&A datasets, and medical encyclopedias ,to retrieve relevant information and generate concise, context-aware responses. The retrieval pipeline uses MiniLM-based semantic embeddings and FAISS vector search, while answer generation is performed by a fine-tuned Mistral-7B-v0.3 language model optimized using QLoRA for efficient, low-resource training. The system supports both general medical queries and domain-specific tasks, with a focused evaluation on breast cancer literature demonstrating the value of domain-aligned retrieval. Empirical results, measured using BERTScore (F1), show substantial improvements in factual consistency and semantic relevance compared to baseline models. The findings underscore the potential of RAG-enhanced language models to bridge the gap between complex biomedical literature and accessible public health knowledge, paving the way for future work on multilingual adaptation, privacy-preserving inference, and personalized medical AI systems.
摘要：这项工作介绍了基于检索功能的生成（Q＆A）系统的生物医学文献问题答案（Q＆A）系统，旨在提高对准确的，基于证据的医疗信息的访问。该系统解决了传统的健康搜索引擎的缺点以及在公众获得生物医学研究方面的滞后，包括PubMed文章，策划的问答数据集和医疗百科全书，以获取相关信息并生成简洁的，上下文意识到的响应。检索管道使用基于微小的语义嵌入和FAISS矢量搜索，而答案生成是由使用Qlora进行优化的微调Mistral-7B-V0.3语言模型进行的，用于高效，低资源培训。该系统支持一般的医疗查询和特定领域的任务，对乳腺癌文献进行了重点评估，证明了与域平均检索的价值。与基线模型相比，使用Bertscore（F1）测量的经验结果在事实一致性和语义相关性上显示出很大的改善。这些发现强调了抹布增强语言模型的潜力，即弥合复杂的生物医学文献与可访问的公共卫生知识之间的差距，为将来的多语言适应性，隐私性推理和个性化医疗AI系统铺平了道路。

Title: Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models: The Obfuscation Task as a Case Study

Authors: Serge Lionel Nikiema, Jordan Samhi, Micheline Bénédicte Moumoula, Albérick Euraste Djiré, Abdoul Kader Kaboré, Jacques Klein, Tegawendé F. Bissyandé
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05553
Pdf URL: https://arxiv.org/pdf/2509.05553
Copy Paste: [[2509.05553]] Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models: The Obfuscation Task as a Case Study(https://arxiv.org/abs/2509.05553)
Keywords: language model
Abstract: This research addresses a fundamental question in AI: whether large language models truly understand concepts or simply recognize patterns. The authors propose bidirectional reasoning,the ability to apply transformations in both directions without being explicitly trained on the reverse direction, as a test for genuine understanding. They argue that true comprehension should naturally allow reversibility. For example, a model that can change a variable name like userIndex to i should also be able to infer that i represents a user index without reverse training. The researchers tested current language models and discovered what they term cognitive specialization: when models are fine-tuned on forward tasks, their performance on those tasks improves, but their ability to reason bidirectionally becomes significantly worse. To address this issue, they developed Contrastive Fine-Tuning (CFT), which trains models using three types of examples: positive examples that maintain semantic meaning, negative examples with different semantics, and forward-direction obfuscation examples. This approach aims to develop deeper understanding rather than surface-level pattern recognition and allows reverse capabilities to develop naturally without explicit reverse training. Their experiments demonstrated that CFT successfully achieved bidirectional reasoning, enabling strong reverse performance while maintaining forward task capabilities. The authors conclude that bidirectional reasoning serves both as a theoretical framework for assessing genuine understanding and as a practical training approach for developing more capable AI systems.
摘要：这项研究解决了AI中的一个基本问题：大型语言模型是真正理解概念还是只是识别模式。作者提出了双向推理，即在两个方向上应用转换的能力，而不会在反向方向上明确训练，以实现真正的理解。他们认为真正的理解自然应该允许可逆性。例如，一个可以将诸如UserIndex之类的变量名称更改为我的模型，我还应该能够推断出我在没有反向培训的情况下代表用户索引。研究人员测试了当前的语言模型，并发现了他们所说的认知专业化：当模型对远期任务进行微调时，他们在这些任务上的表现会提高，但是他们在双向上进行推理的能力变得更加糟糕。为了解决这个问题，他们开发了对比度微调（CFT），该示例使用三种类型的示例训练模型：维持语义含义的积极示例，具有不同语义的负面示例以及前向方向的混淆示例。这种方法旨在发展更深入的理解，而不是表面级别的模式识别，并允许反向能力自然发展而无需明确的反向训练。他们的实验表明，CFT成功实现了双向推理，在保持前瞻性任务功能的同时，可以实现强大的反向性能。作者得出的结论是，双向推理既是评估真正理解的理论框架，又是一种用于开发功能更强大的AI系统的实用培训方法。

Title: Ad hoc conventions generalize to new referents

Authors: Anya Ji, Claire Augusta Bergey, Ron Eliav, Yoav Artzi, Robert D. Hawkins
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.05566
Pdf URL: https://arxiv.org/pdf/2509.05566
Copy Paste: [[2509.05566]] Ad hoc conventions generalize to new referents(https://arxiv.org/abs/2509.05566)
Keywords: agent
Abstract: How do people talk about things they've never talked about before? One view suggests that a new shared naming system establishes an arbitrary link to a specific target, like proper names that cannot extend beyond their bearers. An alternative view proposes that forming a shared way of describing objects involves broader conceptual alignment, reshaping each individual's semantic space in ways that should generalize to new referents. We test these competing accounts in a dyadic communication study (N=302) leveraging the recently-released KiloGram dataset containing over 1,000 abstract tangram images. After pairs of participants coordinated on referential conventions for one set of images through repeated communication, we measured the extent to which their descriptions aligned for undiscussed images. We found strong evidence for generalization: partners showed increased alignment relative to their pre-test labels. Generalization also decayed nonlinearly with visual similarity (consistent with Shepard's law) and was robust across levels of the images' nameability. These findings suggest that ad hoc conventions are not arbitrary labels but reflect genuine conceptual coordination, with implications for theories of reference and the design of more adaptive language agents.
摘要：人们如何谈论他们以前从未谈论过的事情？一种观点表明，一个新的共享命名系统建立了与特定目标的任意链接，例如不能超越其承载者的专有名称。另一种观点提出，形成一种描述对象的共同方法涉及更广泛的概念对准，以应推广到新的指南的方式重塑每个人的语义空间。我们在一项二元通信研究（n = 302）中测试了这些竞争性帐户，该研究利用了最近发行的千克数据集，该数据集包含1000多个抽象的Tangram图像。在通过反复通信的一组图像协调的参与者对一组参与者进行协调之后，我们测量了他们的描述在未发现图像中对齐的程度。我们发现了有力的概括证据：伴侣相对于他们的测试前标签表现出增加的一致性。概括还以视觉相似性（与Shepard定律一致）非线性衰减，并且在图像的名称性层面上具有稳健性。这些发现表明，临时约定不是任意标签，而是反映了真正的概念协调，对参考理论和更适应性语言代理的设计产生了影响。

Title: Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation

Authors: Hongyan Xie, Yitong Yao, Yikun Ban, Zixuan Huang, Deqing Wang, Zhenhe Wu, Haoxiang Su, Chao Wang, Shuangyong Song, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05602
Pdf URL: https://arxiv.org/pdf/2509.05602
Copy Paste: [[2509.05602]] Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation(https://arxiv.org/abs/2509.05602)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) excel at reasoning tasks but are expensive to deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated by LLMs to copy LLMs' abilities. However, these CoT data may include noisy rationales that either fail to substantiate the answers or contribute no additional information to support answer prediction, which leads SLMs to capture spurious correlations between questions and answers and compromise the quality of reasoning. In this work, we propose Chain-of-Thought Correctness Perception Distillation (CoPeD), which aims to improve the reasoning quality of the student model from the perspectives of task setting and data utilization. Firstly, we introduce a correctness-aware task setting that encourages the student model to predict answers based on correct rationales and revise them when they are incorrect. This setting improves the faithfulness of reasoning and allows the model to learn from its mistakes. Then, we propose a Correctness-Aware Weighted loss, which dynamically adjusts the contribution of each training instance based on the combined loss of the rationale and the answer. This strategy encourages the model to focus more on samples where the rationale offers stronger support for the correct answer. Experiments have shown that CoPeD is effective on both in-distribution (IND) and out-of-distribution (OOD) benchmark reasoning datasets.
摘要：大型语言模型（LLMS）在推理任务方面表现出色，但部署昂贵。因此，小语言模型（SLM）是在LLMS生成的COR复制LLMS能力的COT数据上微调的。但是，这些COT数据可能包括嘈杂的理由，这些理由无法证实答案，或者没有提供其他信息来支持答案预测，这会导致SLM捕获问题和答案之间的虚假相关性，并妥协推理的质量。在这项工作中，我们提出了经过思考的正确性感知蒸馏（应对），旨在从任务设置和数据利用的角度提高学生模型的推理质量。首先，我们引入了正确的意识任务设置，该设置鼓励学生模型根据正确的理由预测答案，并在错误时修改它们。这种环境改善了推理的忠诚，并使模型可以从错误中学习。然后，我们提出了正确的加权损失，该损失是根据基本原理和答案的综合损失动态调整每个训练实例的贡献。该策略鼓励该模型更多地专注于基本原理为正确答案提供更大支持的样本。实验表明，应对对分配（IND）和分布（OOD）基准推理数据集有效。

Title: Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation

Authors: Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05605
Pdf URL: https://arxiv.org/pdf/2509.05605
Copy Paste: [[2509.05605]] Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation(https://arxiv.org/abs/2509.05605)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs' representation space for efficient and tailored preference dataset construction, named Icon$^{2}$. Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.
摘要：大型语言模型（LLM）需要高质量的偏好数据集，以与人类的偏好保持一致。但是，构建此类数据集的常规方法面临重大挑战：依赖预制指令通常会导致与目标模型的分布不匹配，而对多个随机响应进行采样的需求引入了实质性的计算开销。在这项工作中，我们通过利用LLMS表示空间的固有法规来探索范式转移，以进行高效且量身定制的偏好数据集构建，名为Icon $^{2} $。具体而言，它首先提取层方向向量以编码复杂的人类偏好，然后使用这些向量根据其固有的一致性来过滤自同一指令。在解码过程中，双向固有控制被应用于转向令牌表示，从而使响应对的精确生成具有明显的比对区别。实验结果表明，对齐和效率都有显着提高。 Llama3-8b和Qwen2-7b的平均获胜率提高了13.89％，而Alpacaeval 2.0和Arena-Hard的平均获胜率提高，同时将计算成本降低了48.1％。

Title: Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents

Authors: Qiyuan Chen, Jiahe Chen, Hongsen Huang, Qian Shao, Jintai Chen, Renjie Hua, Hongxia Xu, Ruijia Wu, Ren Chuan, Jian Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05607
Pdf URL: https://arxiv.org/pdf/2509.05607
Copy Paste: [[2509.05607]] Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents(https://arxiv.org/abs/2509.05607)
Keywords: agent
Abstract: The paradigm shift from traditional ranked-based search to Generative Search Engines has rendered conventional SEO metrics obsolete, creating an urgent need to understand, measure, and optimize for content influence on synthesized answers. This paper introduces a comprehensive, end-to-end framework for Generative Search Engine Optimization (GSEO) to address this challenge. We make two primary contributions. First, we construct CC-GSEO-Bench, a large-scale, content-centric benchmark, and propose a multi-dimensional evaluation framework that systematically quantifies influence, moving beyond surface-level attribution to assess substantive semantic impact. Second, we design a novel multi-agent system that operationalizes this framework, automating the strategic refinement of content through a collaborative analyze-revise-evaluate workflow. Our empirical analysis using this framework reveals novel insights into the dynamics of content influence, offering actionable strategies for creators and establishing a principled foundation for future GSEO research.
摘要：从传统排名的搜索到生成搜索引擎的范式转变使传统的SEO指标过时，迫切需要了解，测量和优化对综合答案的内容影响。本文介绍了生成搜索引擎优化（GSEO）的全面，端到端的框架，以应对这一挑战。我们做出了两个主要贡献。首先，我们构建了一个以内容为中心的大规模基准的CC-GSEO基础基础，并提出了一个多维评估框架，该框架系统地量化了影响力，超越了表面级别的归因，以评估实质性的语义影响。其次，我们设计了一个新型的多代理系统，该系统可以通过协作分析 - 雷神评估工作流程来实现该框架的操作，从而自动化内容的战略改进。我们使用此框架的经验分析揭示了对内容影响动态的新见解，为创建者提供了可行的策略，并为未来的一个年一个年一个关于司机研究建立了原则性的基础。

Title: New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

Authors: Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05609
Pdf URL: https://arxiv.org/pdf/2509.05609
Copy Paste: [[2509.05609]] New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR(https://arxiv.org/abs/2509.05609)
Keywords: language model
Abstract: Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.
摘要：在自动语音识别（ASR）中弥合知识转移的预训练模型（ASR）的预训练模型（ASR）是一项核心挑战。这种比对本质上是结构化的和不对称的：虽然多个连续的声学框架通常对应于单个语言令牌（多对一），但某些声学转变区可能与多个相邻令牌（一对多）有关。此外，声学序列通常包括没有语言对应物的框架，例如背景噪声或沉默可能导致匹配条件不平衡。在这项工作中，我们采取了新的见解，将一致性和匹配视为检测问题，其目标是以高精度来识别有意义的对应关系并回忆，以确保语言令牌的完全覆盖，同时灵活处理冗余或嘈杂的声学框架，以转移ASR的语言知识。基于这个新的见解，我们提出了一个不平衡的基于最佳传输的对准模型，该模型明确处理分布不匹配和结构不对称的模型，具有声学和语言方式之间的柔和和部分匹配。我们的方法可确保每个语言令牌至少在一个声学观察中接地，同时允许从声学到语言单元进行灵活的概率映射。我们通过对基于CTC的ASR系统进行实验评估了我们的模型，并具有用于知识转移的预训练的语言模型。实验结果证明了我们的方法在灵活地控制匹配程度，从而提高ASR性能的有效性。

Title: From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics

Authors: Shay Dahary, Avi Edana, Alexander Apartsin, Yehudit Aperstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05617
Pdf URL: https://arxiv.org/pdf/2509.05617
Copy Paste: [[2509.05617]] From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics(https://arxiv.org/abs/2509.05617)
Keywords: language model, llm
Abstract: The emotional content of song lyrics plays a pivotal role in shaping listener experiences and influencing musical preferences. This paper investigates the task of multi-label emotional attribution of song lyrics by predicting six emotional intensity scores corresponding to six fundamental emotions. A manually labeled dataset is constructed using a mean opinion score (MOS) approach, which aggregates annotations from multiple human raters to ensure reliable ground-truth labels. Leveraging this dataset, we conduct a comprehensive evaluation of several publicly available large language models (LLMs) under zero-shot scenarios. Additionally, we fine-tune a BERT-based model specifically for predicting multi-label emotion scores. Experimental results reveal the relative strengths and limitations of zero-shot and fine-tuned models in capturing the nuanced emotional content of lyrics. Our findings highlight the potential of LLMs for emotion recognition in creative texts, providing insights into model selection strategies for emotion-based music information retrieval applications. The labeled dataset is available at this https URL.
摘要：歌词的情感内容在塑造听众体验和影响音乐偏好方面起着关键作用。本文通过预测与六个基本情绪相对应的六个情感强度得分来调查歌曲歌词的多标签情感归因的任务。使用平均意见分数（MOS）方法构建手动标记的数据集，该方法汇总了来自多个人类评估者的注释，以确保可靠的地面真相标签。利用此数据集，我们在零摄像方案下对几种公开可用的大语言模型（LLM）进行了全面评估。此外，我们对基于BERT的模型进行微调，专门用于预测多标签情绪得分。实验结果揭示了零击和微调模型的相对优势和局限性在捕获歌词的细微情感内容时。我们的发现突出了LLM在创意文本中具有情感识别的潜力，为基于情感的音乐信息检索应用提供了对模型选择策略的见解。标记的数据集可在此HTTPS URL上找到。

Title: Few-Shot Query Intent Detection via Relation-Aware Prompt Learning

Authors: Liang Zhang, Yuan Li, Shijie Zhang, Zheng Zhang, Xitong Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.05635
Pdf URL: https://arxiv.org/pdf/2509.05635
Copy Paste: [[2509.05635]] Few-Shot Query Intent Detection via Relation-Aware Prompt Learning(https://arxiv.org/abs/2509.05635)
Keywords: language model, prompt
Abstract: Intent detection is a crucial component of modern conversational systems, since accurately identifying user intent at the beginning of a conversation is essential for generating effective responses. Recent efforts have focused on studying this problem under a challenging few-shot scenario. These approaches primarily leverage large-scale unlabeled dialogue text corpora to pretrain language models through various pretext tasks, followed by fine-tuning for intent detection with very limited annotations. Despite the improvements achieved, existing methods have predominantly focused on textual data, neglecting to effectively capture the crucial structural information inherent in conversational systems, such as the query-query relation and query-answer relation. To address this gap, we propose SAID, a novel framework that integrates both textual and relational structure information in a unified manner for model pretraining for the first time. Building on this framework, we further propose a novel mechanism, the query-adaptive attention network (QueryAdapt), which operates at the relation token level by generating intent-specific relation tokens from well-learned query-query and query-answer relations explicitly, enabling more fine-grained knowledge transfer. Extensive experimental results on two real-world datasets demonstrate that SAID significantly outperforms state-of-the-art methods.
摘要：意图检测是现代对话系统的关键组成部分，因为在对话开始时准确识别用户意图对于产生有效的响应至关重要。最近的努力重点是在挑战性的几个镜头情况下研究这个问题。这些方法主要通过各种借口任务利用大规模未标记的对话文本语料库来验证语言模型，然后进行微调以进行意图检测，并具有非常有限的注释。尽管取得了进步，但现有的方法主要集中在文本数据上，而忽略了有效捕获对话系统中固有的关键结构信息，例如查询 - 疑问关系和查询 - 答案关系。我们提出，为了解决这一差距，一个新颖的框架，以统一的方式集成了文本和关系结构信息，以首次进行模型进行预审。在此框架的基础上，我们进一步提出了一种新颖的机制，即自适应注意网络（QueryAdapt），该机制通过产生特定于认识的查询Query-Query-Query Quary-Query-Query-Asswer关系，在关系令牌级别运行，并明确地实现了更良好的知识转移。两个现实世界数据集的广泛实验结果表明，所述表现明显优于最先进的方法。

Title: LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding

Authors: Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05657
Pdf URL: https://arxiv.org/pdf/2509.05657
Copy Paste: [[2509.05657]] LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding(https://arxiv.org/abs/2509.05657)
Keywords: language model, llm, prompt
Abstract: Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search. The datasets and models will be released at this https URL.
摘要：大型语言模型（LLMS）的最新进展已为解决复杂的优化问题（包括神经体系结构搜索（NAS））开辟了新的途径。但是，现有的LLM驱动的NAS方法在很大程度上依赖于迅速的工程和特定领域的调整，从而限制了它们在各种任务中的实用性和可扩展性。在这项工作中，我们提出了LM-Searcher，这是一个新型框架，它利用LLMS进行跨域神经体系结构优化，而无需进行广泛的域特异性适应。我们方法的核心是Ncode，这是一种用于神经体系结构的通用数值字符串表示，它可以启用跨域体系结构编码和搜索。我们还将NAS问题重新制定为一项排名任务，培训LLMS使用源自基于新型的基于修剪的子空间抽样策略的指令调查样本从候选池中选择高性能的体系结构。我们精心策划的数据集涵盖了广泛的建筑 - 绩效对，鼓励了强大而可转移的学习。全面的实验表明，LM-Searcher在内域（例如用于图像分类的CNN）和室外（例如，用于细分和生成任务的LORA配置）的竞争性能，为灵活且基于LLM的LLM基于LLM的建筑搜索建立了新的范式。数据集和模型将在此HTTPS URL上发布。

Title: Cross-Question Method Reuse in Large Language Models: From Word-Level Prediction to Rational Logical-Layer Reasoning

Authors: Hong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05660
Pdf URL: https://arxiv.org/pdf/2509.05660
Copy Paste: [[2509.05660]] Cross-Question Method Reuse in Large Language Models: From Word-Level Prediction to Rational Logical-Layer Reasoning(https://arxiv.org/abs/2509.05660)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been widely applied to assist in finding solutions for diverse questions. Prior work has proposed representing a method as a pair of a question and its corresponding solution, enabling method reuse. However, existing approaches typically require the questions to be highly similar. In this paper, we extend the scope of method reuse to address questions with low similarity or with hidden similarities that are not explicitly observable. For questions that are similar in a general-specific sense (i.e., broader or narrower in scope), we propose to first separate the question and solution, rather than directly feeding the pair to the LLM. The LLM is then guided to adapt the solution to new but related questions, allowing it to focus on solution transfer rather than question recognition. Furthermore, we extend this approach to cases where questions only share partial features or hidden characteristics. This enables cross-question method reuse beyond conventional similarity constraints. Experimental verification shows that our scope-extension approach increases the probability of filtering out reusable solutions, thereby improving the effectiveness of cross-question method reuse.
摘要：大型语言模型（LLMS）已被广泛应用于协助寻找各种问题的解决方案。先前的工作提出了将方法表示为一对问题及其相应的解决方案，从而使方法重复使用。但是，现有方法通常要求问题高度相似。在本文中，我们扩展了方法重用的范围，以解决相似性低的问题或隐藏的相似性，这些问题无法明确观察到。对于在一般特定意义上相似的问题（即范围更广泛或更狭窄），我们建议首先将问题和解决方案分开，而不是直接将配对喂入LLM。然后，LLM被指导以使解决方案适应新的但相关的问题，从而使其专注于解决方案转移而不是问题识别。此外，我们将这种方法扩展到仅具有部分特征或隐藏特征的问题的情况。这使得超出传统相似性约束以外的交叉疑问方法重复使用。实验验证表明，我们的范围扩展方法增加了滤除可重复使用溶液的可能性，从而提高了跨疑问方法的有效性。

Title: Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Authors: Michael Hoffmann, Jophin John, Stefan Schweter, Gokul Ramakrishnan, Hoi-Fong Mak, Alice Zhang, Dmitry Gaynullin, Nicolay J. Hammer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05668
Pdf URL: https://arxiv.org/pdf/2509.05668
Copy Paste: [[2509.05668]] Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian(https://arxiv.org/abs/2509.05668)
Keywords: language model, llm
Abstract: We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.
摘要：我们提出了Llama-Genba-10b，这是一种三语言基础模型，涉及大语模型中以英语为中心的偏见。 Llama-Genba-10b建立在Llama 3.1-8B上，并缩放到10b参数，在164B代币（82B英语，82B德语和80M巴伐利亚）上不断预估计，平衡资源，同时防止英语优势。该模型针对德国NLP社区，还促进了巴伐利亚式的低资源语言。开发解决了四个挑战：（1）尽管巴伐利亚稀缺性，（2）为英语，德语和巴伐利亚制造一个统一的令牌，（2）优化跨语言转移的建筑和语言竞赛超票，以及（4）建立第一个标准标准的曲折评估Suite通过翻译德国的贝纳德（Bavarians）。评估表明，Llama-Genba-10b实现了强劲的跨语性表现，在巴伐利亚的微调变体超过了Apertus-8B-2509和Gemma-2-9b，并确立了该语言中最佳模型的最佳模型，同时在英语中也超过了Eurollm，并在德语中匹配了Eurollm。对脑CS-2进行的培训表明，具有有效的大规模多语言预处理，并提供了有记录的能源使用，为包含低资源语言的包容性基础模型提供了蓝图。

Title: A Survey of the State-of-the-Art in Conversational Question Answering Systems

Authors: Manoj Madushanka Perera, Adnan Mahmood, Kasun Eranda Wijethilake, Fahmida Islam, Maryam Tahermazandarani, Quan Z. Sheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05716
Pdf URL: https://arxiv.org/pdf/2509.05716
Copy Paste: [[2509.05716]] A Survey of the State-of-the-Art in Conversational Question Answering Systems(https://arxiv.org/abs/2509.05716)
Keywords: language model, gpt
Abstract: Conversational Question Answering (ConvQA) systems have emerged as a pivotal area within Natural Language Processing (NLP) by driving advancements that enable machines to engage in dynamic and context-aware conversations. These capabilities are increasingly being applied across various domains, i.e., customer support, education, legal, and healthcare where maintaining a coherent and relevant conversation is essential. Building on recent advancements, this survey provides a comprehensive analysis of the state-of-the-art in ConvQA. This survey begins by examining the core components of ConvQA systems, i.e., history selection, question understanding, and answer prediction, highlighting their interplay in ensuring coherence and relevance in multi-turn conversations. It further investigates the use of advanced machine learning techniques, including but not limited to, reinforcement learning, contrastive learning, and transfer learning to improve ConvQA accuracy and efficiency. The pivotal role of large language models, i.e., RoBERTa, GPT-4, Gemini 2.0 Flash, Mistral 7B, and LLaMA 3, is also explored, thereby showcasing their impact through data scalability and architectural advancements. Additionally, this survey presents a comprehensive analysis of key ConvQA datasets and concludes by outlining open research directions. Overall, this work offers a comprehensive overview of the ConvQA landscape and provides valuable insights to guide future advancements in the field.
摘要：通过推动能够从事动态和上下文感知的对话的进步，通过推动自然语言处理（NLP）中的关键区域（NLP）中出现了会话问题回答（CONSQA）系统。这些功能越来越多地跨越各个领域，即客户支持，教育，法律和医疗保健，其中维持连贯和相关的对话至关重要。在最新进步的基础上，这项调查提供了对Convqa最先进的全面分析。这项调查首先检查了Convqa系统的核心组成部分，即历史选择，问题理解和回答预测，突出了它们在确保多转交谈中的连贯性和相关性方面的相互作用。它进一步研究了先进的机器学习技术的使用，包括但不限于增强学习，对比度学习和转移学习以提高Convqa的准确性和效率。还探索了大语言模型的关键作用，即罗伯塔，GPT-4，Gemini 2.0 Flash，Mistral 7b和Llama 3，也探索了通过数据可扩展性和建筑进步来展示其影响。此外，这项调查对关键Convqa数据集进行了全面分析，并通过概述开放研究方向进行了结论。总体而言，这项工作提供了Convqa景观的全面概述，并提供了有价值的见解，以指导该领域的未来进步。

Title: Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models

Authors: Donya Rooein, Flor Miriam Plaza-del-Arco, Debora Nozza, Dirk Hovy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05719
Pdf URL: https://arxiv.org/pdf/2509.05719
Copy Paste: [[2509.05719]] Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models(https://arxiv.org/abs/2509.05719)
Keywords: language model
Abstract: Given Farsi's speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and find significant challenges in data availability and quality, despite the overall increase in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings indicate that the volume of data is insufficient to significantly improve a language's prospects in NLP.
摘要：鉴于Farsi的演讲者群超过1.27亿人，并且数字文本的可用性不断增长，其中包括超过130万篇关于Wikipedia的文章，因此被认为是一种中资源语言。但是，当更仔细地检查情况时，此标签很快就会崩溃。我们专注于三个主观任务（情感分析，情绪分析和毒性检测），尽管数据可用性的总体增加，但仍在数据可用性和质量方面发现重大挑战。我们回顾了110个有关FARSI主观任务的出版物，并观察到缺乏公开可用的数据集。此外，现有数据集通常缺乏基本的人口统计学因素，例如年龄和性别，这对于在语言中准确建模主观性至关重要。在使用少数可用数据集评估预测模型时，结果在数据集和模型中都非常不稳定。我们的发现表明，数据量不足以显着改善NLP语言的前景。

Title: Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification

Authors: Fernando Gabriela García, Qiyang Shi, Zilin Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05741
Pdf URL: https://arxiv.org/pdf/2509.05741
Copy Paste: [[2509.05741]] Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification(https://arxiv.org/abs/2509.05741)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: This research introduces VeriFact-CoT (Verified Factual Chain-of-Thought), a novel method designed to address the pervasive issues of hallucination and the absence of credible citation sources in Large Language Models (LLMs) when generating complex, fact-sensitive content. By incorporating a multi-stage mechanism of 'fact verification-reflection-citation integration,' VeriFact-CoT empowers LLMs to critically self-examine and revise their intermediate reasoning steps and final answers. This process significantly enhances the objective accuracy, trustworthiness, and traceability of the generated outputs, making LLMs more reliable for applications demanding high fidelity such as scientific research, news reporting, and legal consultation.
摘要：这项研究介绍了验证 - 验证者（经过验证的事实链），这是一种新颖的方法，旨在解决幻觉的普遍问题，以及在产生复杂的事实敏感内容时，在大语言模型（LLMS）中缺乏可信的引用来源。通过纳入“事实验证反射引用集成”的多阶段机制，验证效果使LLM能够进行严格的自我检查，并修改其中间推理步骤和最终答案。该过程大大提高了生成的产出的客观准确性，可信赖性和可追溯性，使LLMS对要求高保真度的应用程序更可靠，例如科学研究，新闻报道和法律咨询。

Title: ZhiFangDanTai: Fine-tuning Graph-based Retrieval-Augmented Generation Model for Traditional Chinese Medicine Formula

Authors: ZiXuan Zhang, Bowen Hao, Yingjie Li, Hongzhi Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05867
Pdf URL: https://arxiv.org/pdf/2509.05867
Copy Paste: [[2509.05867]] ZhiFangDanTai: Fine-tuning Graph-based Retrieval-Augmented Generation Model for Traditional Chinese Medicine Formula(https://arxiv.org/abs/2509.05867)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Traditional Chinese Medicine (TCM) formulas play a significant role in treating epidemics and complex diseases. Existing models for TCM utilize traditional algorithms or deep learning techniques to analyze formula relationships, yet lack comprehensive results, such as complete formula compositions and detailed explanations. Although recent efforts have used TCM instruction datasets to fine-tune Large Language Models (LLMs) for explainable formula generation, existing datasets lack sufficient details, such as the roles of the formula's sovereign, minister, assistant, courier; efficacy; contraindications; tongue and pulse diagnosis-limiting the depth of model outputs. To address these challenges, we propose ZhiFangDanTai, a framework combining Graph-based Retrieval-Augmented Generation (GraphRAG) with LLM fine-tuning. ZhiFangDanTai uses GraphRAG to retrieve and synthesize structured TCM knowledge into concise summaries, while also constructing an enhanced instruction dataset to improve LLMs' ability to integrate retrieved information. Furthermore, we provide novel theoretical proofs demonstrating that integrating GraphRAG with fine-tuning techniques can reduce generalization error and hallucination rates in the TCM formula task. Experimental results on both collected and clinical datasets demonstrate that ZhiFangDanTai achieves significant improvements over state-of-the-art models. Our model is open-sourced at this https URL.
摘要：传统中药（TCM）配方在治疗流行病和复杂疾病中起着重要作用。现有的TCM模型利用传统算法或深度学习技术来分析公式关系，但缺乏全面的结果，例如完整的配方组成和详细的解释。尽管最近的努力使用了TCM指导数据集微调大语言模型（LLMS）以产生可解释的公式，但现有数据集缺乏足够的细节，例如公式的主权，部长，助理，快递员的角色；功效;禁忌症；舌头和脉搏诊断限制模型输出的深度。为了应对这些挑战，我们提出了Zhifangdantai，该框架将基于图的检索生成（GraphRag）与LLM微调结合在一起。 Zhifangdantai使用GraphRag将结构化的TCM知识检索和合成简洁的摘要，同时还构建了增强的指令数据集以提高LLMS集成检索到的信息的能力。此外，我们提供了新颖的理论证明，表明将图形与微调技术集成可以降低TCM公式任务中的概括误差和幻觉率。对收集和临床数据集的实验结果表明，吉法丹泰对最先进的模型取得了重大改进。我们的模型在此HTTPS URL上开源。

Title: MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

Authors: François Grolleau, Emily Alsentzer, Timothy Keyes, Philip Chung, Akshay Swaminathan, Asad Aali, Jason Hom, Tridu Huynh, Thomas Lew, April S. Liang, Weihan Chu, Natasha Z. Steele, Christina F. Lin, Jingkun Yang, Kameron C. Black, Stephen P. Ma, Fateme N. Haredasht, Nigam H. Shah, Kevin Schulman, Jonathan H. Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05878
Pdf URL: https://arxiv.org/pdf/2509.05878
Copy Paste: [[2509.05878]] MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries(https://arxiv.org/abs/2509.05878)
Keywords: language model, llm, agent
Abstract: Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.
摘要：评估大语言模型（LLM）生成的临床文本中的事实准确性是采用的关键障碍，因为专家审查对于这些系统所需的持续质量保证而言不可计入。我们通过两个补充贡献来应对这一挑战。首先，我们介绍了Medfacteval，这是一个可扩展的，事实基础评估的框架，临床医生在其中定义了高度的关键事实和“ LLM陪审团”（多数LLM多数票） - 评估了它们在生成的摘要中的包含。其次，我们提出Medagentbrief，这是一种模型不合时宜的多步骤工作流，旨在产生高质量的事实放电摘要。为了验证我们的评估框架，我们使用七个人多数派对临床医生定义的关键事实的投票建立了金标准的参考。 MedFacteval LLM陪审团与该小组（Cohen's Kappa = 81％）达成了几乎完美的一致性，这是一种统计学上与单个人类专家的统计学绩效（Kappa = 67％，p <0.001）。我们的工作既提供了强大的评估框架（Medfacteval）和高性能的一代工作流程（Medagentbrief），从而提供了一种全面的方法来推动负责任地在临床工作流程中负责生成的AI。

Title: Let's Roleplay: Examining LLM Alignment in Collaborative Dialogues

Authors: Abhijnan Nath, Carine Graff, Nikhil Krishnaswamy
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05882
Pdf URL: https://arxiv.org/pdf/2509.05882
Copy Paste: [[2509.05882]] Let's Roleplay: Examining LLM Alignment in Collaborative Dialogues(https://arxiv.org/abs/2509.05882)
Keywords: language model, llm, agent
Abstract: As Large Language Models (LLMs) integrate into diverse workflows, they are increasingly being considered "collaborators" with humans. If such AI collaborators are to be reliable, their behavior over multiturn interactions must be predictable, validated and verified before deployment. Common alignment techniques are typically developed under simplified single-user settings and do not account for the dynamics of long-horizon multiparty interactions. This paper examines how different alignment methods affect LLM agents' effectiveness as partners in multiturn, multiparty collaborations. We study this question through the lens of friction agents that intervene in group dialogues to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision-making. Using a roleplay methodology, we evaluate interventions from differently-trained friction agents in collaborative task conversations. We propose a novel counterfactual evaluation framework that quantifies how friction interventions change the trajectory of group collaboration and belief alignment. Our results show that a friction-aware approach significantly outperforms common alignment baselines in helping both convergence to a common ground, or agreed-upon task-relevant propositions, and correctness of task outcomes.
摘要：随着大型语言模型（LLM）融入多种工作流程，它们越来越被视为与人类的“合作者”。如果这样的AI合作者是可靠的，那么在部署之前，必须对多发相互作用的行为进行预测，验证和验证。常见的比对技术通常是在简化的单用户设置下开发的，并且不考虑长匹马多方相互作用的动力学。本文探讨了不同的一致性方法如何影响LLM代理作为Multiturn，多方合作的合作伙伴的有效性。我们通过干预小组对话的摩擦剂镜头研究这个问题，以鼓励协作小组放慢脚步并反思他们进行审议决策的推理。使用角色扮演方法，我们在协作任务对话中评估了不同训练的摩擦剂的干预措施。我们提出了一个新颖的反事实评估框架，该框架量化了摩擦干预措施如何改变群体协作和信仰一致性的轨迹。我们的结果表明，一种摩擦感知的方法在帮助融合到共同点或与任务相关的命题和任务成果的正确性方面均显着优于共同的一致性基准。

Title: Accelerating Large Language Model Inference via Early-Exiting Algorithms

Authors: Sangmin Bae
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05915
Pdf URL: https://arxiv.org/pdf/2509.05915
Copy Paste: [[2509.05915]] Accelerating Large Language Model Inference via Early-Exiting Algorithms(https://arxiv.org/abs/2509.05915)
Keywords: language model
Abstract: Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.
摘要：大型语言模型已经获得了显着的功能，但是它们的实际部署受到了巨大的计算成本的阻碍。虽然自适应计算方法（例如早期访问的承诺降低这些成本），但它们引入了基本冲突：旨在节省计算的坦率动态通常会产生系统级的瓶颈，从而可以在批处理推理中自相矛盾地减少吞吐量。本论文通过共同设计自适应算法和模型体系结构来解决这一冲突，以在动态和效率之间取得最佳平衡。为此，我们的工作首先通过提出有效的平行解码机制来解决常规早期效果的关键来源。然后，我们表明，深参数共享提供了一个体系结构基础，该基础不仅产生紧凑的参数有效模型，而且固有地减轻了影响动态推断的关键同步问题。最后，这项工作提出了一个统一的框架，其中轻量级路由器经过概述，以动态为每个令牌分配最佳递归深度。这种方法通过有效地优化单个模型中的自适应计算和参数效率，在效率和性能之间建立了新的帕累托前沿。

Title: KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino

Authors: Lorenzo Alfred Nery, Ronald Dawson Catignas, Thomas James Tiam-Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06065
Pdf URL: https://arxiv.org/pdf/2509.06065
Copy Paste: [[2509.06065]] KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino(https://arxiv.org/abs/2509.06065)
Keywords: language model, gpt, llm, hallucination
Abstract: Large Language Models (LLMs) achieve remarkable performance across various tasks, but their tendency to produce hallucinations limits reliable adoption. Benchmarks such as TruthfulQA have been developed to measure truthfulness, yet they are primarily available in English, leaving a gap in evaluating LLMs in low-resource languages. To address this, we present KatotohananQA, a Filipino translation of the TruthfulQA benchmark. Seven free-tier proprietary models were assessed using a binary-choice framework. Findings show a significant performance gap between English and Filipino truthfulness, with newer OpenAI models (GPT-5 and GPT-5 mini) demonstrating strong multilingual robustness. Results also reveal disparities across question characteristics, suggesting that some question types, categories, and topics are less robust to multilingual transfer which highlight the need for broader multilingual evaluation to ensure fairness and reliability in LLM usage.
摘要：大型语言模型（LLMS）在各种任务中都取得了出色的表现，但是它们产生幻觉的趋势限制了可靠的采用。已经开发出诸如真实性的基准来衡量真实性，但是它们主要以英语提供，在评估低资源语言的LLMS方面留下了差距。为了解决这个问题，我们介绍了Katotohananqa，这是真实性基准的菲律宾翻译。使用二进制选择框架评估了七个自由层专有模型。研究结果表明，英语和菲律宾真实性之间的性能差距很大，新的OpenAI模型（GPT-5和GPT-5 Mini）表现出强烈的多语言鲁棒性。结果还揭示了问题特征跨越的差异，表明某些问题类型，类别和主题对多语言转移的强大不强，这突出了对更广泛的多语言评估的需求，以确保LLM使用中的公平性和可靠性。

Title: Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Authors: Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, Bin Dong
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.06079
Pdf URL: https://arxiv.org/pdf/2509.06079
Copy Paste: [[2509.06079]] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge(https://arxiv.org/abs/2509.06079)
Keywords: gpt
Abstract: Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at this https URL.
摘要：多模式推理仍然是人工智能中的基本挑战。尽管基于文本的推理取得了长足的进步，但即使是GPT-O3等最先进的模型，也很难在多模式场景中保持强劲的性能。为了解决这一差距，我们介绍了一个标题辅助推理框架，该框架有效地弥合了视觉和文字方式。我们的方法在ICML 2025 AI中获得了第一名，用于数学研讨会\＆Challenge 2：Seephys，突出了其有效性和鲁棒性。此外，我们验证其对数学基准的概括用于几何推理，证明了我们方法的多功能性。我们的代码在此HTTPS URL上公开可用。

Title: Orthogonal Low-rank Adaptation in Lie Groups for Continual Learning of Large Language Models

Authors: Kefan Cao, Shuaicheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06100
Pdf URL: https://arxiv.org/pdf/2509.06100
Copy Paste: [[2509.06100]] Orthogonal Low-rank Adaptation in Lie Groups for Continual Learning of Large Language Models(https://arxiv.org/abs/2509.06100)
Keywords: language model, llm
Abstract: Large language models (LLMs) are prone to catastrophic forgetting in sequential multi-task settings. Parameter regularization methods such as O-LoRA and N-LoRA alleviate task interference by enforcing low-rank subspace orthogonality, but they overlook the fact that conventional additive fine-tuning disrupts the intrinsic geometric structure of LLM parameters, limiting performance. Our key insight is that the parameter space of LLMs possesses a geometric structure, which must be preserved in addition to enforcing orthogonality. Based on this, we propose Orthogonal Low-rank Adaptation in Lie Groups (OLieRA), which introduces Lie group theory into LLM fine-tuning: leveraging multiplicative updates to preserve parameter geometry while applying orthogonality constraints to task subspaces. Experiments demonstrate that OLieRA achieves state-of-the-art results on the Standard CL benchmark and remains among the top-performing methods in the Large Number of Tasks setting.
摘要：大型语言模型（LLM）容易在连续的多任务设置中灾难性遗忘。诸如O-Lora和N-Lora之类的参数正则化方法通过强制执行低率子空间正交性来减轻任务干扰，但它们忽略了这样一个事实，即常规的加性微调会破坏LLM参数的固有几何结构，从而限制了性能。我们的关键见解是，LLM的参数空间具有几何结构，除了执行正交性外，还必须保留它。基于此，我们提出了Lie组（OLIERA）的正交低级适应性，该适应性群体将谎言组理论引入LLM微调：利用乘法更新以保持参数几何形状，同时将正交性约束应用于任务子空间。实验表明，Oliera在标准CLENG测试中取得了最新的结果，并且在大量任务设置中仍然是表现最佳的方法之一。

Title: Benchmarking Gender and Political Bias in Large Language Models

Authors: Jinrui Yang, Xudong Han, Timothy Baldwin
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2509.06164
Pdf URL: https://arxiv.org/pdf/2509.06164
Copy Paste: [[2509.06164]] Benchmarking Gender and Political Bias in Large Language Models(https://arxiv.org/abs/2509.06164)
Keywords: language model, gpt, llm
Abstract: We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks -- gender classification and vote prediction -- revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.
摘要：我们介绍了Europarlvote，这是一种在政治敏感的背景下评估大语言模型（LLM）的新基准。它将欧洲议会的辩论与召集投票成果联系起来，并为欧洲议会（MEP）的每个成员（例如性别，年龄，国家和政治团体）提供丰富的人群元数据。我们使用Europarlvote，在两项任务（性别分类和投票预测）上评估了最先进的LLM，以揭示偏见的一致模式。我们发现，LLMS经常将女性MEP分类为男性，并在模拟女性演讲者的选票时表现出降低的准确性。从政治上讲，LLM倾向于偏爱中间派群体，而在左翼和极右派方面表现不佳。在鲁棒性和公平性方面，诸如GPT-4O之类的专有模型均优于开放式替代品。我们发布了Europarlvote数据集，代码和演示，以支持政治背景下NLP公平性和问责制的未来研究。

Title: Understanding the Influence of Synthetic Data for Text Embedders

Authors: Jacob Mitchell Springer, Vaibhav Adlakha, Siva Reddy, Aditi Raghunathan, Marius Mosbach
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06184
Pdf URL: https://arxiv.org/pdf/2509.06184
Copy Paste: [[2509.06184]] Understanding the Influence of Synthetic Data for Text Embedders(https://arxiv.org/abs/2509.06184)
Keywords: llm
Abstract: Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic LLM-generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are sparse and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.
摘要：开发通用文本嵌入式的最新进展是由对合成LLM生成数据的不断增长的Corpora进行培训所驱动的。尽管如此，尚无公开可用的合成数据集，这为研究其泛化作用带来了障碍。为了解决这个问题，我们首先复制并公开发布Wang等人提出的合成数据。（Mistral-E5）。我们的合成数据是高质量的，可导致性能的一致改进。接下来，我们批判性地检查合成数据可以改善模型概括的位置。我们的分析表明，合成数据的好处稀疏，高度局部到单个数据集中。此外，我们观察到不同类别的绩效与受益于一项任务的数据之间的权衡，从而使绩效降低了另一个任务。我们的发现突出了当前合成数据方法来构建通用嵌入式的局限性，并挑战了对合成数据培训的概念，导致跨任务跨越更强的嵌入模型。

Title: Augmented Fine-Tuned LLMs for Enhanced Recruitment Automation

Authors: Mohamed T. Younes, Omar Walid, Khaled Shaban, Ali Hamdi, Mai Hassan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06196
Pdf URL: https://arxiv.org/pdf/2509.06196
Copy Paste: [[2509.06196]] Augmented Fine-Tuned LLMs for Enhanced Recruitment Automation(https://arxiv.org/abs/2509.06196)
Keywords: language model, llm
Abstract: This paper presents a novel approach to recruitment automation. Large Language Models (LLMs) were fine-tuned to improve accuracy and efficiency. Building upon our previous work on the Multilayer Large Language Model-Based Robotic Process Automation Applicant Tracking (MLAR) system . This work introduces a novel methodology. Training fine-tuned LLMs specifically tuned for recruitment tasks. The proposed framework addresses the limitations of generic LLMs by creating a synthetic dataset that uses a standardized JSON format. This helps ensure consistency and scalability. In addition to the synthetic data set, the resumes were parsed using DeepSeek, a high-parameter LLM. The resumes were parsed into the same structured JSON format and placed in the training set. This will help improve data diversity and realism. Through experimentation, we demonstrate significant improvements in performance metrics, such as exact match, F1 score, BLEU score, ROUGE score, and overall similarity compared to base models and other state-of-the-art LLMs. In particular, the fine-tuned Phi-4 model achieved the highest F1 score of 90.62%, indicating exceptional precision and recall in recruitment tasks. This study highlights the potential of fine-tuned LLMs. Furthermore, it will revolutionize recruitment workflows by providing more accurate candidate-job matching.
摘要：本文提出了一种新颖的招聘自动化方法。大型语言模型（LLM）经过微调以提高准确性和效率。基于我们以前在多层大型模型的机器人过程自动化申请人跟踪（MLAR）系统的工作。这项工作引入了一种新颖的方法。培训针对招聘任务的精细调整LLMS。提出的框架通过创建使用标准化JSON格式的合成数据集来解决通用LLM的局限性。这有助于确保一致性和可扩展性。除合成数据集外，使用高参数LLM DeepSeek解析了简历。将简历解析为相同的结构化JSON格式，并放置在训练集中。这将有助于改善数据多样性和现实主义。通过实验，与基本模型和其他最先进的LLM相比，我们证明了性能指标，例如精确匹配，F1分数，BLEU得分，胭脂分数和总体相似性的显着改善。特别是，微调的PHI-4模型达到了90.62％的最高F1分数，表明招聘任务中的出色精度和回忆。这项研究突出了微调LLM的潜力。此外，它将通过提供更准确的候选人匹配来彻底改变招聘工作流程。

Title: MSLEF: Multi-Segment LLM Ensemble Finetuning in Recruitment

Authors: Omar Walid, Mohamed T. Younes, Khaled Shaban, Mai Hassan, Ali Hamdi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06200
Pdf URL: https://arxiv.org/pdf/2509.06200
Copy Paste: [[2509.06200]] MSLEF: Multi-Segment LLM Ensemble Finetuning in Recruitment(https://arxiv.org/abs/2509.06200)
Keywords: language model, llm
Abstract: This paper presents MSLEF, a multi-segment ensemble framework that employs LLM fine-tuning to enhance resume parsing in recruitment automation. It integrates fine-tuned Large Language Models (LLMs) using weighted voting, with each model specializing in a specific resume segment to boost accuracy. Building on MLAR , MSLEF introduces a segment-aware architecture that leverages field-specific weighting tailored to each resume part, effectively overcoming the limitations of single-model systems by adapting to diverse formats and structures. The framework incorporates Gemini-2.5-Flash LLM as a high-level aggregator for complex sections and utilizes Gemma 9B, LLaMA 3.1 8B, and Phi-4 14B. MSLEF achieves significant improvements in Exact Match (EM), F1 score, BLEU, ROUGE, and Recruitment Similarity (RS) metrics, outperforming the best single model by up to +7% in RS. Its segment-aware design enhances generalization across varied resume layouts, making it highly adaptable to real-world hiring scenarios while ensuring precise and reliable candidate representation.
摘要：本文介绍了MSLEF，这是一种多段集成框架，该框架采用LLM微调来增强招聘自动化中的简历解析。它使用加权投票集成了微调的大语言模型（LLM），每个模型都专门研究特定的简历部分以提高准确性。在MLAR的基础上，MSLEF引入了一种细分感知的体系结构，该体系结构利用针对每个简历部分量身定制的特定于现场的加权，从而有效地克服了单模系统的局限性，通过适应各种格式和结构。该框架将Gemini-2.5-Flash LLM作为复杂部分的高级聚合器，并利用Gemma 9B，Llama 3.1 8b和Phi-4 14b。 MSLEF在精确匹配（EM），F1得分，BLEU，Rouge和招聘相似性（RS）指标方面取得了重大改进，表现优于最佳单个模型，高达 +7％。它的细分感知设计增强了各种简历布局的概括，使其高度适应现实世界的招聘方案，同时确保精确可靠的候选人代表。

Title: Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Authors: Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2509.06350
Pdf URL: https://arxiv.org/pdf/2509.06350
Copy Paste: [[2509.06350]] Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?(https://arxiv.org/abs/2509.06350)
Keywords: language model, llm, prompt
Abstract: Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.
摘要：对大型语言模型（LLM）的越狱攻击已经证明了各种成功的方法，从而使攻击者操纵模型以产生旨在避免的有害响应。其中，贪婪的协调梯度（GCG）已成为一种普遍有效的方法，可在后缀中优化令牌以产生可越狱的提示。尽管已经提出了几种改进的GCG变体，但它们都依赖于固定长度后缀。但是，这些后缀中的潜在冗余仍然没有探索。在这项工作中，我们提出了Mask-GCG，这是一种使用可学习的令牌掩模来识别后缀中有影响力的令牌的方法。我们的方法提高了在高影响位置的高影响位置处的令牌的更新概率。这种修剪不仅降低了冗余，而且降低了梯度空间的大小，从而降低了计算开销并缩短了与GCG相比获得成功攻击所需的时间。我们通过将其应用于原始GCG和几种改进的变体来评估面膜-GCG。实验结果表明，后缀中的大多数令牌对攻击成功产生了重大贡献，并且修剪少数低影响令牌不会影响损失值或损害攻击成功率（ASR），从而揭示了LLM提示中的令牌冗余。我们的发现为从越狱攻击的角度开发有效和可解释的LLM提供了见解。

Title: PL-CA: A Parametric Legal Case Augmentation Framework

Authors: Ao Chang, Yubo Chen, Jun Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06356
Pdf URL: https://arxiv.org/pdf/2509.06356
Copy Paste: [[2509.06356]] PL-CA: A Parametric Legal Case Augmentation Framework(https://arxiv.org/abs/2509.06356)
Keywords: llm, long context, hallucination
Abstract: Conventional RAG is considered one of the most effective methods for addressing model knowledge insufficiency and hallucination, particularly in the judicial domain that requires high levels of knowledge rigor, logical consistency, and content integrity. However, the conventional RAG method only injects retrieved documents directly into the model's context, which severely constrains models due to their limited context windows and introduces additional computational overhead through excessively long contexts, thereby disrupting models' attention and degrading performance on downstream tasks. Moreover, many existing benchmarks lack expert annotation and focus solely on individual downstream tasks while real-world legal scenarios consist of multiple mixed legal tasks, indicating conventional benchmarks' inadequacy for reflecting models' true capabilities. To address these limitations, we propose PL-CA, which introduces a parametric RAG (P-RAG) framework to perform data augmentation on corpus knowledge and encode this legal knowledge into parametric vectors, and then integrates this parametric knowledge into the LLM's feed-forward networks (FFN) via LoRA, thereby alleviating models' context pressure. Additionally, we also construct a multi-task legal dataset comprising more than 2000 training and test instances, which are all expert-annotated and manually verified. We conduct our experiments on our dataset, and the experimental results demonstrate that our method reduces the overhead associated with excessively long contexts while maintaining competitive performance on downstream tasks compared to conventional RAG. Our code and dataset are provided in the appendix.
摘要：传统的破布被认为是解决模型知识不足和幻觉的最有效方法之一，尤其是在需要高水平的知识，逻辑一致性和内容完整性的司法领域中。但是，传统的抹布方法仅将文档直接检索到模型的上下文中，该文档由于其有限的上下文窗口而严重限制了模型，并通过过长的上下文引入了其他计算开销，从而破坏了模型的注意力和下游任务上的绩效。此外，许多现有的基准缺乏专家注释，并且仅关注单个下游任务，而现实世界中的法律场景由多个混合法律任务组成，这表明传统的基准测试不足，无法反映模型的真正功能。为了解决这些局限性，我们提出了PL-CA，该PL-CA引入了参数rag（p-rag）框架，以对语料库知识执行数据增强，并将这些法律知识编码为参数矢量，然后通过LORA通过LORA集成了该参数知识，然后通过LORA通过LORA进行授课模型的模型。此外，我们还构建了一个由2000多个培训和测试实例组成的多任务法律数据集，这些数据集都经过专家通知和手动验证。我们在数据集上进行实验，实验结果表明，与常规抹布相比，我们的方法减少了与过长的背景相关的间接费用，同时保持在下游任务上的竞争性能。我们的代码和数据集在附录中提供。

Title: Do LLMs exhibit the same commonsense capabilities across languages?

Authors: Ivan Martínez-Murillo, Elena Lloret, Paloma Moreda, Albert Gatt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06401
Pdf URL: https://arxiv.org/pdf/2509.06401
Copy Paste: [[2509.06401]] Do LLMs exhibit the same commonsense capabilities across languages?(https://arxiv.org/abs/2509.06401)
Keywords: language model, llm
Abstract: This paper explores the multilingual commonsense generation abilities of Large Language Models (LLMs). To facilitate this investigation, we introduce MULTICOM, a novel benchmark that extends the COCOTEROS dataset to four languages: English, Spanish, Dutch, and Valencian. The task involves generating a commonsensical sentence that includes a given triplet of words. We evaluate a range of open-source LLMs, including LLaMA, Qwen, Gemma, EuroLLM, and Salamandra, on this benchmark. Our evaluation combines automatic metrics, LLM-as-a-judge approaches (using Prometheus and JudgeLM), and human annotations. Results consistently show superior performance in English, with significantly lower performance in less-resourced languages. While contextual support yields mixed results, it tends to benefit underrepresented languages. These findings underscore the current limitations of LLMs in multilingual commonsense generation. The dataset is publicly available at this https URL.
摘要：本文探讨了大语言模型（LLMS）的多语言常识生成能力。为了促进这项调查，我们介绍了Multicom，这是一种新颖的基准，将Cocoteros数据集扩展到四种语言：英语，西班牙语，荷兰语和瓦伦西亚语。该任务涉及生成一个通用句子，其中包括给定的单词三胞胎。我们在此基准上评估了一系列开源LLM，包括Llama，Qwen，Gemma，Eurollm和Salamandra。我们的评估结合了自动指标，LLM-AS-A-A-Gudge方法（使用Prometheus和Judgelm）和人类注释。结果始终显示出英语的出色表现，并且在资源较低的语言中的性能明显降低。尽管上下文支持会产生不同的结果，但它倾向于使代表性不足的语言受益。这些发现强调了LLM在多语言常识生成中的当前局限性。该数据集可在此HTTPS URL上公开可用。

Title: WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Authors: Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06501
Pdf URL: https://arxiv.org/pdf/2509.06501
Copy Paste: [[2509.06501]] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents(https://arxiv.org/abs/2509.06501)
Keywords: language model, llm, agent
Abstract: The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.
摘要：大型语言模型（LLM）的范式越来越多地转向了代理应用，在这些应用程序中，网络浏览功能是从不同的在线资源中检索信息的基础。但是，现有的开源Web代理要么在复杂任务上表现出有限的信息寻求信息，要么缺乏透明的实现。在这项工作中，我们确定关键的挑战在于缺乏挑战数据来寻求信息。为了解决此限制，我们介绍了WebExplorer：一种使用基于模型的探索和迭代，远程查询进化的系统数据生成方法。此方法创建了需要多步推理和复杂的Web导航的挑战性查询对。通过利用我们精心策划的高质量数据集，我们通过监督的微调成功地开发了高级Web代理WebExplorer-8B，然后进行了增强学习。我们的模型支持128K上下文长度，最多100个工具调用转弯，从而实现了长马问题解决方案。在各种各样的信息寻求基准的基准中，WebExplorer-8B在其规模上实现了最先进的性能。值得注意的是，作为一个8B大小的型号，WebExplorer-8B能够在RL训练后平均进行16个弯道有效搜索，比BrowseComp-en/ZH上的Weberailor-72B获得更高的精度，并在WebWalkerQA和Frames上达到100B参数的模型中的最佳性能。除了这些寻求信息的任务外，我们的模型还在HLE基准上实现了强烈的概括，即使它仅在知识密集型质量检查数据中进行培训。这些结果突出了我们的方法是通往长跑网络代理的实用途径。

Title: Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training

Authors: Andrei Baroian, Kasper Notebomer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06518
Pdf URL: https://arxiv.org/pdf/2509.06518
Copy Paste: [[2509.06518]] Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training(https://arxiv.org/abs/2509.06518)
Keywords: language model, llm
Abstract: Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.
摘要：基于变压器的语言模型传统上使用统一（各向同性）层的大小，但它们忽略了不同深度可以扮演的各种功能角色及其计算能力需求。在层面上的缩放（LWS）和修剪文献的基础上，我们引入了三种新的LWS变体 - 构造，反向和皇冠 - 通过在预训练阶段的两个或三点线性插值重新分布FFN宽度和注意力头。我们介绍了LWS及其变体的第一个系统消融，其固定预算为1.8亿参数，在5B代币上训练。与等同于等于成本的基线相比，所有模型都会收敛到相似的损失，并取得更好的性能，而训练吞吐量的大幅下降。这项工作代表了用于预训练的层架构设计空间的第一步，但是未来的工作应将实验扩展到数量级的代价和参数，以充分评估其潜力。

Title: LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection

Authors: Jian Wu, Hang Yu, Bingchang Liu, Wenjie Yang, Peng Di, Jianguo Li, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06524
Pdf URL: https://arxiv.org/pdf/2509.06524
Copy Paste: [[2509.06524]] LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection(https://arxiv.org/abs/2509.06524)
Keywords: language model, llm
Abstract: Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.
摘要：将大型语言模型（LLM）调整到特定领域通常会面临关键的瓶颈：高质量，人类策划的数据的稀缺性。尽管很容易获得大量未检查的数据，但不加选择地使用它们来进行微调风险引入噪声和降解性能。因此，战略数据选择至关重要，需要一种既准确又有效的方法。现有的方法被归类为基于相似性和直接优化方法，难以同时实现这些目标。在本文中，我们介绍了LAMDAS（LLM作为特定于域数据选择的隐式分类器），一种新型方法，它利用了预培训的LLM本身作为隐式分类器，从而绕开了显式功能工程和计算强度的优化过程。 LAMDA将数据选择作为单级分类问题进行了重新框架，并识别“属于”小参考数据集定义的目标域“属于”的候选数据。广泛的实验结果表明，LAMDA不仅超过了一小部分数据的全数据训练的性能，而且在各种情况下都超过了九个最先进（SOTA）基准的表现。此外，与所有评估的基准相比，LAMDA在性能提高与计算效率之间达到了最引人注目的平衡。

Title: SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

Authors: Mengxue Yang, Chun Yang, Jiaqi Zhu, Jiafan Li, Jingqi Zhang, Yuyang Li, Ying Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06531
Pdf URL: https://arxiv.org/pdf/2509.06531
Copy Paste: [[2509.06531]] SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion(https://arxiv.org/abs/2509.06531)
Keywords: language model, llm
Abstract: Link prediction in knowledge graphs requires integrating structural information and semantic context to infer missing entities. While large language models offer strong generative reasoning capabilities, their limited exploitation of structural signals often results in structural sparsity and semantic ambiguity, especially under incomplete or zero-shot settings. To address these challenges, we propose SLiNT (Structure-aware Language model with Injection and coNtrastive Training), a modular framework that injects knowledge-graph-derived structural context into a frozen LLM backbone with lightweight LoRA-based adaptation for robust link prediction. Specifically, Structure-Guided Neighborhood Enhancement (SGNE) retrieves pseudo-neighbors to enrich sparse entities and mitigate missing context; Dynamic Hard Contrastive Learning (DHCL) introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and Gradient-Decoupled Dual Injection (GDDI) performs token-level structure-aware intervention while preserving the core LLM parameters. Experiments on WN18RR and FB15k-237 show that SLiNT achieves superior or competitive performance compared with both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.
摘要：知识图中的链接预测需要整合结构信息和语义上下文，以推断缺失的实体。尽管大型语言模型具有强大的生成推理能力，但它们对结构信号的利用有限，通常会导致结构上的稀疏性和语义歧义，尤其是在不完整或零拍的设置下。为了应对这些挑战，我们建议Slint（具有注入和对比度训练的结构意识语言模型），这是一个模块化框架，将知识图形衍生的结构上下文注入了冷冻的LLM骨架中，并具有基于轻量的Lora的适应性，以实现稳健的链接预测。具体而言，结构引导的邻里增强（SGNE）检索伪 - 邻居以丰富稀疏实体并减轻缺失的上下文；动态硬性学习（DHCL）通过插值硬积极因素和负面因素来解决实体级别的歧义，从而引入了细粒度的监督；梯度 - 偶联的双注射（GDDI）在保留核心LLM参数的同时，执行令牌级的结构感知干预措施。 WN18RR和FB15K-237上的实验表明，与基于嵌入的基线和基于生成的基线相比，SLINT具有优越或竞争性的性能，这证明了结构感知表示的学习对可扩展知识图完成的有效性。

Title: HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models

Authors: Xin Tong, Zhi Lin, Jingya Wang, Bo Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06596
Pdf URL: https://arxiv.org/pdf/2509.06596
Copy Paste: [[2509.06596]] HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models(https://arxiv.org/abs/2509.06596)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) often produce hallucinations in retrieval-augmented or long-context generation, even when relevant evidence is present. This stems from two issues: head importance is treated as input-agnostic, and raw attention weights poorly reflect each token's true contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a parameter-free decoding framework that directly addresses both challenges. HAVE introduces head-adaptive gating, which performs instance-level soft reweighing of attention heads, and value calibration, which augments attention with the magnitude of value vectors to approximate write-back contribution. Together, these modules construct token-level evidence aligned with model updates and fuse it with the LM distribution through a lightweight uncertainty-scaled policy. HAVE requires no finetuning and operates in a single forward pass, making it efficient and broadly applicable. Experiments across multiple QA benchmarks and LLM families demonstrate that HAVE consistently reduces hallucinations and outperforms strong baselines, including DAGCD, with modest overhead. The framework is transparent, reproducible, and readily integrates with off-the-shelf LLMs, advancing trustworthy generation in real-world settings.
摘要：大型语言模型（LLMS）通常在检索声音或长篇文化生成中产生幻觉，即使存在相关证据。这源于两个问题：头部的重要性被视为输入不平衡，原始注意力的权重反映了每个令牌的真实贡献。我们提供（头部自适应门控和价值校准），这是一个无参数的解码框架，直接解决这两个挑战。已经引入了头部自适应门，该门具有对注意力头的实例级软性重新升高和价值校准，从而通过价值向量的大小来增强注意力，从而近似地写下书面贡献。这些模块共同构建了令牌级别的证据，与模型更新一致，并通过轻巧的不确定性缩放策略将其与LM分布融合。不需要任何填充，并且在一次前传球中运行，从而使其有效且广泛地适用。多个质量检查基准和LLM家族之间的实验表明，它们始终降低幻觉和优于较强的基本线，包括dagcd，其开销适度。该框架是透明的，可重现的，并且很容易与现成的LLMS集成，从而在现实世界中推进了值得信赖的一代。

Title: Guided Decoding and Its Critical Role in Retrieval-Augmented Generation

Authors: Özgür Uğur, Musa Yılmaz, Esra Şavirdi, Özay Ezerceli, Mahmut El Huseyni, Selva Taş, Reyhan Bayraktar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06631
Pdf URL: https://arxiv.org/pdf/2509.06631
Copy Paste: [[2509.06631]] Guided Decoding and Its Critical Role in Retrieval-Augmented Generation(https://arxiv.org/abs/2509.06631)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: The integration of Large Language Models (LLMs) into various applications has driven the need for structured and reliable responses. A key challenge in Retrieval-Augmented Generation (RAG) systems is ensuring that outputs align with expected formats while minimizing hallucinations. This study examines the role of guided decoding in RAG systems, comparing three methods, Outlines, XGrammar, and LM Format Enforcer, across different multi-turn prompting setups (0-turn, 1-turn, and 2-turn). By evaluating success rates, hallucination rates, and output quality, we provide insights into their performance and applicability. Our findings reveal how multi-turn interactions influence guided decoding, uncovering unexpected performance variations that can inform method selection for specific use cases. This work advances the understanding of structured output generation in RAG systems, offering both theoretical insights and practical guidance for LLM deployment.
摘要：大型语言模型（LLM）集成到各种应用程序中，促使对结构化和可靠的响应的需求。检索功能生成（RAG）系统的主要挑战是确保输出与预期格式保持一致，同时最大程度地减少幻觉。这项研究研究了指导解码在抹布系统中的作用，比较了在不同的多转移促使设置（0转，1-Turn和2-Turn）的三种方法，轮廓，XGrammar和LM格式执行器。通过评估成功率，幻觉率和产出质量，我们可以洞悉其性能和适用性。我们的发现揭示了多转交互作用如何影响指导的解码，从而揭示了可以为特定用例的方法选择的意外性能变化。这项工作促进了对抹布系统中结构化产量产生的理解，为LLM部署提供了理论见解和实用指导。

Title: Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval

Authors: Hao Lin, Peitong Xie, Jingxue Chen, Jie Lin, Qingkun Tang, Qianchun Lu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.06650
Pdf URL: https://arxiv.org/pdf/2509.06650
Copy Paste: [[2509.06650]] Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval(https://arxiv.org/abs/2509.06650)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on the retrieval stage, particularly the coarse-ranking process. Existing coarse-ranking optimization approaches often struggle to balance domain-specific knowledge learning with query enhencement, resulting in suboptimal retrieval performance. To address this challenge, we propose MoLER, a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval. MoLER has a two-stage pipeline: a continual pre-training (CPT) phase using a Mixture of Losses (MoL) to balance domain-specific knowledge with general language capabilities, and a reinforcement learning (RL) phase leveraging Group Relative Policy Optimization (GRPO) to optimize query and passage generation for maximizing document recall. A key innovation is our Multi-query Single-passage Late Fusion (MSLF) strategy, which reduces computational overhead during RL training while maintaining scalable inference via Multi-query Multi-passage Late Fusion (MMLF). Extensive experiments on benchmark datasets show that MoLER achieves state-of-the-art performance, significantly outperforming baseline methods. MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.
摘要：检索增强的生成（RAG）系统在很大程度上依赖于检索阶段，尤其是粗级的过程。现有的粗级优化方法通常难以在特定领域的知识学习与查询增强之间取得平衡，从而导致了次优的检索性能。为了应对这一挑战，我们提出了莫勒（Moler），这是一种使用摩尔增强的增强学习来优化检索的域具有域名的抹布方法。 Moler具有两阶段的管道：使用损失（MOL）混合（MOL）的连续预训练（CPT）相，以平衡特定于领域的知识与一般语言能力，以及增强阶段学习（RL）相对相对策略优化（GRPO），以优化查询和通过的问题，以最大化文档回忆。一个关键的创新是我们的多电量单通道后期融合（MSLF）策略，该策略在RL训练过程中降低了计算开销，同时通过多Query多通量后期融合（MMLF）维持可扩展的推断。基准数据集的广泛实验表明，莫勒实现了最先进的性能，大大优于基线方法。 Moler桥接了抹布系统中的知识差距，从而在专门的域中实现了可靠且可扩展的检索。

Title: IntrEx: A Dataset for Modeling Engagement in Educational Conversations

Authors: Xingwei Tan, Mahathi Parvatham, Chiara Gambi, Gabriele Pergola
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06652
Pdf URL: https://arxiv.org/pdf/2509.06652
Copy Paste: [[2509.06652]] IntrEx: A Dataset for Modeling Engagement in Educational Conversations(https://arxiv.org/abs/2509.06652)
Keywords: language model, gpt, llm, chat
Abstract: Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.
摘要：参与和动机对于第二语言的获取至关重要，但是保持学习者对教育对话的兴趣仍然是一个挑战。虽然先前的研究探讨了使教育文本变得有趣的是什么，但对推动对话参与的语言特征知之甚少。为了解决这一差距，我们介绍了Intrex，这是第一个注释的大型数据集，以引起教师互动中的兴趣和预期兴趣。 Intrex建立在教师聊天室语料库（TSCC）的基础上，通过合并序列级别的注释，扩展了先前的工作，从而使互动的研究超越了孤立的转弯，以捕捉对扩展对话的兴趣如何发展。我们采用严格的注释过程，使用100多种第二语言学习者，采用一种基于比较的评分方法，该方法受到对人类反馈（RLHF）学习的启发，以提高共识。我们研究大型语言模型（LLMS）是否可以预测人类的兴趣判断。我们发现，llms（7b/8b参数）对趣味性评级进行了微调，其表现优于较大的专有模型，例如GPT-4O，这表明了专业数据集以建模教育环境中的参与度的潜力。最后，我们分析了语言和认知因素，例如具体性，可理解性（可读性）和采用，会影响参与教育对话。

Title: Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

Authors: Yanrui Du, Fenglei Fan, Sendong Zhao, Jiawei Cao, Qika Lin, Kai He, Ting Liu, Bing Qin, Mengling Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06795
Pdf URL: https://arxiv.org/pdf/2509.06795
Copy Paste: [[2509.06795]] Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint(https://arxiv.org/abs/2509.06795)
Keywords: language model, llm
Abstract: Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance various abilities of Large Language Models (LLMs). However, prior studies have shown that IFT can significantly compromise LLMs' safety, particularly their ability to refuse malicious instructions, raising significant concerns. Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior. Building on this insight, our study reveals that the r-direction tends to drift during training, which we identify as one of the causes of the associated safety risks. To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample's hidden state onto the r-direction. Our initial analysis shows that applying an appropriate constraint can effectively mitigate the refusal direction drift and associated safety risks, but remains limited by overall performance barriers. To overcome this barrier, informed by our observation of early-stage sharp drift and a data-driven perspective, we introduce a warm-up strategy that emphasizes early-stage strong constraints and broaden the data distribution to strengthen constraint signals, leading to an enhanced ProCon method. Experimental results under various datasets, scenarios, and LLMs demonstrate that our method can significantly mitigate safety risks posed by IFT while preserving task performance gains. Even compared with strong baselines, our method consistently delivers superior overall performance. Crucially, our analysis indicates that ProCon can contribute to stabilizing the r-direction during training, while such an interpretability-driven exploration of LLMs' internal mechanisms lays a solid foundation for future safety research.
摘要：指导微调（IFT）已被广泛用作一种有效的训练后策略，以增强大型语言模型（LLMS）的各种能力。但是，先前的研究表明，IFT可以显着损害LLMS的安全性，尤其是他们拒绝恶意指示的能力，从而引起了重大关注。对LLM的内部机制的最新研究已经确定了隐藏状态中的拒绝方向（r方向），该方向在管理拒绝行为中起着关键作用。在这种见解的基础上，我们的研究表明，RIDICTION在培训期间倾向于漂移，我们认为这是相关安全风险的原因之一。为了减轻这种漂移，我们提出的Procon方法引入了一个投影受限的损失项，该损失项将每个训练样本隐藏状态的投影幅度正常于R方向。我们的初步分析表明，应用适当的约束可以有效地减轻拒绝方向漂移和相关的安全风险，但仍受到整体性能障碍的限制。为了克服这一障碍，通过观察早期尖锐的漂移和数据驱动的观点，我们介绍了一种热身策略，该策略强调早期阶段的强限制并扩大数据分布以增强约束信号，从而导致了增强的Procon方法。各种数据集，方案和LLMS下的实验结果表明，我们的方法可以显着减轻IFT在保留任务绩效增长的同时带来的安全风险。即使与强大的基线相比，我们的方法也始终提供卓越的整体性能。至关重要的是，我们的分析表明，Procon可以在训练过程中稳定RIDICTION，而这种可解释性驱动的LLMS内部机制的探索为未来的安全研究奠定了坚实的基础。

Title: MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

Authors: Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06806
Pdf URL: https://arxiv.org/pdf/2509.06806
Copy Paste: [[2509.06806]] MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML(https://arxiv.org/abs/2509.06806)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
摘要：大型语言模型（LLMS）具有广泛的世界知识和强大的通用推理能力，但是他们很难从许多关于标准机器学习（ML）任务的文本示例中学习，也就是说，即可通过纯粹的内在学习（ICL）而没有梯度下降。我们介绍了Machinelearninglm，这是一种便携式持续预处理的框架，它使通用LLM具有强大的文本ML功能，同时保留其一般知识和更广泛的聊天工作流程的推理。我们的训练过程综合了数百万个结构性因果模型（SCM）的ML任务，跨越了射击的数量高达1,024。我们从一位随机的老师开始，将基于树的决策策略蒸馏到LLM中，以增强数值建模的鲁棒性。所有任务均通过令牌有效的提示序列化，每个上下文窗口允许3倍至6倍的示例，并通过批处理延误高达50倍摊销的吞吐量。尽管设置适中（QWEN-2.5-7B - 洛拉等级8），但MachineLearninglm的表现均超过了强大的LLM基准（例如GPT-5-MINI），平均在跨金融，物理学，生物学和医疗保健范围内的分布分布分类的分布分类量平均约为15％。它表现出惊人的许多缩放定律：随着中文示范从8个增长到1,024，精度可以单调地提高。没有任何特定于任务的培训，它就可以达到数百个镜头的随机森林级准确性。保留了一般聊天功能，包括知识和推理：它在MMLU上实现了75.4％。

Title: MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Authors: Yanrui Du, Fenglei Fan, Sendong Zhao, Jiawei Cao, Ting Liu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06807
Pdf URL: https://arxiv.org/pdf/2509.06807
Copy Paste: [[2509.06807]] MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security(https://arxiv.org/abs/2509.06807)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs' security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs' usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.
摘要：随着大型语言模型（LLMS）日益渗透到人类的生活中，其安全已成为一个关键问题，尤其是他们保持对恶意指示无害反应的能力。尽管广泛的方法改善了LLMS的安全性，但它们通常会导致损害实际可用性的保守，面向拒绝的响应。这提出了一个关键的挑战：如何在LLMS的可用性和安全性之间推进Pareto前沿，而不必在它们之间进行权衡。为了解决这个问题，我们提出了MOGU框架，其中层内路由器通过感知隐藏状态动态分配权重，从而平衡了安全性优化和可用性优化变体的贡献。尽管具有最初的潜力，但MOGU框架仍面临诸如参数冗余和性能瓶颈等局限性。为了克服这些，我们进一步提出了一个改进的MOGU_V2框架，该框架在路由器和隐藏状态之间建立了更紧密的耦合。在MOGU_V2中，路由器仅嵌入到编码高度分类的安全功能的层中，并且在路由器优化期间激活了骨干模块以启用双向适应。 MOGU_V2在各种LLMS系列中表现出强大的适应性和稳定的改进，包括主流LLM在各种应用中用作大脑的主流LLMS，用于资源受限的场景优化的设备LLM，以及针对用户解释性量身定制的推理LLMS。同时，即使面对通过指导进行微调引入的风险，MOGU_V2也可以轻松恢复安全性，而不会通过简单的数据混合策略损害任务绩效的增长。这些全面的改进将MOGU_V2作为一种强大而多功能的解决方案，用于减轻现实世界应用中的安全风险。

Title: Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem

Authors: Valentin Quesnel, Damien Sileo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06809
Pdf URL: https://arxiv.org/pdf/2509.06809
Copy Paste: [[2509.06809]] Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem(https://arxiv.org/abs/2509.06809)
Keywords: language model, llm
Abstract: The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover's saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for "interesting" theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. this https URL this https URL
摘要：高质量的逻辑上声音数据的稀缺性是推进大型语言模型（LLMS）数学推理的关键瓶颈。我们的工作通过将数十年的自动定理变成可扩展的数据引擎来应对这一挑战。我们的框架不依赖于容易出错的LLM或像Lean和Isabelle这样的复杂的证明语法，而是利用E-Prover的饱和能力在庞大的TPTP Axiom库上来得出一个庞大的，可保证的，可保证的定理语料库。我们的管道原则上很简单：饱和公理，过滤“有趣”定理，并生成任务。循环中没有LLM，我们通过构造消除了事实错误。然后将此纯粹的符号数据转换为三个难度控制的挑战：需要验证，前提选择和证明重建。我们在边境模型上进行的零拍摄实验揭示了一个明显的弱点：性能崩溃了，需要深层结构推理。我们的框架既提供了诊断工具来衡量此差距，也提供了可扩展的符号培训数据来解决它。我们将代码和数据公开可用。此https url此https url

Title: A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

Authors: Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06813
Pdf URL: https://arxiv.org/pdf/2509.06813
Copy Paste: [[2509.06813]] A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs(https://arxiv.org/abs/2509.06813)
Keywords: language model, llm
Abstract: Effective Operation and Maintenance (O&M) is critical to reducing the Levelised Cost of Energy (LCOE) from wind power, yet the unstructured, free-text nature of turbine maintenance logs presents a significant barrier to automated analysis. Our paper addresses this by presenting a novel and reproducible framework for benchmarking Large Language Models (LLMs) on the task of classifying these complex industrial records. To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool. We systematically evaluate a diverse suite of state-of-the-art proprietary and open-source LLMs, providing a foundational assessment of their trade-offs in reliability, operational efficiency, and model calibration. Our results quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores. We also demonstrate that classification performance is highly dependent on the task's semantic ambiguity, with all models showing higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and that calibration varies dramatically, we conclude that the most effective and responsible near-term application is a Human-in-the-Loop system, where LLMs act as a powerful assistant to accelerate and standardise data labelling for human experts, thereby enhancing O&M data quality and downstream reliability analysis.
摘要：有效的操作和维护（O＆M）对于降低风能的水平成本（LCOE）至关重要，但是涡轮机维护日志的非结构化，自由文本的性质为自动分析带来了重大的障碍。我们的论文通过介绍一个针对这些复杂的工业记录进行分类的大型语言模型（LLM）的新颖而可重复的框架来解决这一问题。为了促进透明度并鼓励进一步的研究，该框架已作为开源工具公开提供。我们系统地评估了一系列最先进的专有和开源LLMS，从而对其可靠性，运营效率和模型校准进行了基础评估。我们的结果量化了明确的性能层次结构，识别出具有基准标准和可信赖，良好校准的置信度得分高的顶级模型。我们还证明，分类性能高度依赖于任务的语义歧义，所有模型在客观组件识别上均比在解释性维护动作上均表现出更高的共识。鉴于没有模型可以达到完美的准确性，并且校准差异很大，因此我们得出结论，最有效，最负责任的近期应用是人类在循环系统中，LLMS是一名有力的助手，可以加速和标准化人类专家的数据标记，从而增强O＆M数据质量和下流的可靠性分析。

Title: COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Authors: Eugene Kwek, Wenpeng Yin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.06836
Pdf URL: https://arxiv.org/pdf/2509.06836
Copy Paste: [[2509.06836]] COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens(https://arxiv.org/abs/2509.06836)
Keywords: llm
Abstract: Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.
摘要：使LLM在记忆，延迟和服务成本方面提高效率对于边缘部署，交互式应用程序和可持续的推断至关重要。修剪是实现这一目标的关键技术。但是，先前的修剪方法是有限的：修剪宽度通常会破坏标准变压器布局或需要自定义推理代码，而深度修剪会删除整个层，并可能导致突然的准确性下降。在这项工作中，我们提出了紧凑型，它共同（i）降低了罕见的词汇量，以缩小嵌入/无用的嵌入和（ii）使用共同加权激活的prunes ffn ffn中间通道，将重要性与后式代币分布保持一致。 Compact享有深度和宽度修剪的优点，例如：部署友好性（保持标准的变压器体系结构），规模适应性（Qualt-appaptivity）（Vocab vs. FFN修剪），无训练的操作，具有竞争性的修剪时间，以及与遍布吞吐量的强大内存储蓄。 QWEN，LLAMA和GEMMA家族（0.5B-70B）的实验显示出最新的下游任务性能，以相似或更高的修剪比率，并且参数，GPU记忆和端到端潜伏期的大幅降低。

Title: EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models

Authors: Mohammad Reza Mirbagheri, Mohammad Mahdi Mirkamali, Zahra Motoshaker Arani, Ali Javeri, Amir Mahdi Sadeghzadeh, Rasool Jalili
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2509.06838
Pdf URL: https://arxiv.org/pdf/2509.06838
Copy Paste: [[2509.06838]] EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models(https://arxiv.org/abs/2509.06838)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs), trained on extensive datasets using advanced deep learning architectures, have demonstrated remarkable performance across a wide range of language tasks, becoming a cornerstone of modern AI technologies. However, ensuring their trustworthiness remains a critical challenge, as reliability is essential not only for accurate performance but also for upholding ethical, cultural, and social values. Careful alignment of training data and culturally grounded evaluation criteria are vital for developing responsible AI systems. In this study, we introduce the EPT (Evaluation of Persian Trustworthiness) metric, a culturally informed benchmark specifically designed to assess the trustworthiness of LLMs across six key aspects: truthfulness, safety, fairness, robustness, privacy, and ethical alignment. We curated a labeled dataset and evaluated the performance of several leading models - including ChatGPT, Claude, DeepSeek, Gemini, Grok, LLaMA, Mistral, and Qwen - using both automated LLM-based and human assessments. Our results reveal significant deficiencies in the safety dimension, underscoring the urgent need for focused attention on this critical aspect of model behavior. Furthermore, our findings offer valuable insights into the alignment of these models with Persian ethical-cultural values and highlight critical gaps and opportunities for advancing trustworthy and culturally responsible AI. The dataset is publicly available at: this https URL.
摘要：大型语言模型（LLMS）使用先进的深度学习体系结构在广泛的数据集中进行了培训，在各种语言任务中表现出了出色的性能，成为现代AI技术的基石。但是，确保他们的可信度仍然是一个关键的挑战，因为可靠性不仅对于准确的绩效，而且对于保持道德，文化和社会价值观至关重要。仔细的培训数据和文化扎根评估标准对开发负责任的AI系统至关重要。在这项研究中，我们介绍了EPT（对波斯可信度的评估）度量，这是一种文化知情的基准，专门旨在评估LLM在六个关键方面的可信度：真实性，安全性，公平性，稳健性，隐私性和道德一致性。我们使用基于LLM的自动化和人类评估，策划了一个标记的数据集，并评估了几种领先模型的性能，包括Chatgpt，Claude，DeepSeek，Gemini，Grok，Llame，Mistral和Qwen。我们的结果揭示了安全维度的明显缺陷，强调了迫切需要将注意力集中在模型行为的这一关键方面。此外，我们的发现为这些模型的一致性提供了有价值的见解，这些模型具有波斯的道德文化价值观，并强调了临界差距和机会，以促进值得信赖和文化责任的AI。该数据集可公开获得：此HTTPS URL。

Title: The Majority is not always right: RL training for solution aggregation

Authors: Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, Ilia Kulikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06870
Pdf URL: https://arxiv.org/pdf/2509.06870
Copy Paste: [[2509.06870]] The Majority is not always right: RL training for solution aggregation(https://arxiv.org/abs/2509.06870)
Keywords: language model, llm
Abstract: Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.
摘要：通过生成多个独立解决方案并在其中选择或聚集来扩展测试时间计算已成为改善大型语言模型（LLMS）在挑战推理任务上的中心范式。尽管大多数先前的工作都依靠简单的多数投票或奖励模型排名来汇总解决方案，但这些方法可能只会带来有限的收益。在这项工作中，我们建议将汇总作为一种明确的推理技能学习：鉴于一组候选解决方案，我们培训了一个聚合器模型，以审查，调和和综合最终，正确的答案，并使用从可验证的奖励中进行强化学习。关键要素是仔细平衡简单和硬训练的示例，从而使模型可以学习恢复少数族裔但纠正的答案以及简单的多数股权答案。从经验上讲，我们发现我们的方法，Aggm，在多个基准测试中都优于基于规则的强大和奖励模型基准。此外，它有效地概括了来自不同模型的解决方案，包括比培训数据中包含的更强的解决方案，同时所需的代币要比大多数用更少的解决方案投票的代币要少得多。

Title: UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction

Authors: Joe Wilder, Nikhil Kadapala, Benji Xu, Mohammed Alsaadi, Aiden Parsons, Mitchell Rogers, Palash Agarwal, Adam Hassick, Laura Dietz
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.06883
Pdf URL: https://arxiv.org/pdf/2509.06883
Copy Paste: [[2509.06883]] UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction(https://arxiv.org/abs/2509.06883)
Keywords: llm, prompt
Abstract: We participate in CheckThat! Task 2 English and explore various methods of prompting and in-context learning, including few-shot prompting and fine-tuning with different LLM families, with the goal of extracting check-worthy claims from social media passages. Our best METEOR score is achieved by fine-tuning a FLAN-T5 model. However, we observe that higher-quality claims can sometimes be extracted using other methods, even when their METEOR scores are lower.
摘要：我们参加了CheckThat！任务2英语并探索了提示和秘密学习的各种方法，包括与不同的LLM系列的射击和微调，目的是从社交媒体段落中提取值得支票的主张。我们最好的流星得分是通过微调Flan-T5模型来实现的。但是，我们观察到，即使其流星得分较低，也可以使用其他方法提取更高质量的主张。

Title: mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Authors: Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, Benjamin Van Durme
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.06888
Pdf URL: https://arxiv.org/pdf/2509.06888
Copy Paste: [[2509.06888]] mmBERT: A Modern Multilingual Encoder with Annealed Language Learning(https://arxiv.org/abs/2509.06888)
Keywords: language model
Abstract: Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.
摘要：仅编码语言模型经常用于各种标准的机器学习任务，包括分类和检索。但是，对于编码模型，尤其是在多语言模型方面，缺乏最近的研究。我们介绍了MMBERT，这是一种仅在1800多种语言的多语言文本中预测的仅编码语言模型。为了构建mmbert，我们介绍了几个新元素，包括逆面膜比时间表和逆温度采样比。我们仅在衰减阶段将数据组合添加到数据组合中，这表明它可以显着提高性能，并最大程度地提高了相对较少的培训数据的收益。尽管仅在短衰减阶段包括这些低资源语言，但我们达到了与OpenAI的O3和Google的Gemini 2.5 Pro等模型相似的分类性能。总体而言，我们表明，在高水回语言上，MMBERT在分类和检索任务上的表现大大优于上一代模型。

Title: Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification

Authors: Aivin V. Solatorio
Subjects: cs.CL, cs.CR, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2509.06902
Pdf URL: https://arxiv.org/pdf/2509.06902
Copy Paste: [[2509.06902]] Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification(https://arxiv.org/abs/2509.06902)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) as stochastic systems may generate numbers that deviate from available data, a failure known as \emph{numeric hallucination}. Existing safeguards -- retrieval-augmented generation, citations, and uncertainty estimation -- improve transparency but cannot guarantee fidelity: fabricated or misquoted values may still be displayed as if correct. We propose \textbf{Proof-Carrying Numbers (PCN)}, a presentation-layer protocol that enforces numeric fidelity through mechanical verification. Under PCN, numeric spans are emitted as \emph{claim-bound tokens} tied to structured claims, and a verifier checks each token under a declared policy (e.g., exact equality, rounding, aliases, or tolerance with qualifiers). Crucially, PCN places verification in the \emph{renderer}, not the model: only claim-checked numbers are marked as verified, and all others default to unverified. This separation prevents spoofing and guarantees fail-closed behavior. We formalize PCN and prove soundness, completeness under honest tokens, fail-closed behavior, and monotonicity under policy refinement. PCN is lightweight and model-agnostic, integrates seamlessly into existing applications, and can be extended with cryptographic commitments. By enforcing verification as a mandatory step before display, PCN establishes a simple contract for numerically sensitive settings: \emph{trust is earned only by proof}, while the absence of a mark communicates uncertainty.
摘要：大型语言模型（LLMS）作为随机系统可能会生成偏离可用数据的数字，即被称为\ emph {数字幻觉}的失败。现有的保障措施（检索效果的生成，引用和不确定性估计）提高了透明度，但无法保证保真度：捏造或引用的值仍然可以显示好像正确。我们提出了\ textbf {证明携带数字（PCN）}，这是一种表现层协议，可通过机械验证实施数字保真度。在PCN下，数字跨度被排放为\ emph {apect-bound bound令牌}，与结构化的索赔相关，验证者根据声明的策略（例如，具有准确的平等，圆形，舍入，别名或限定者的公差）检查每个令牌）。至关重要的是，PCN将验证放置在\ emph {Renderer}中，而不是模型：仅将主张检查的数字标记为已验证，所有其他数字默认为未验证。这种分离阻止了欺骗并保证失败的行为。我们在政策改进下正式化了PCN并证明了诚实令牌，失败行为和单调性的完整性，完整性。 PCN是轻巧的，模型不合时宜的，可以无缝集成到现有应用中，并且可以通过加密承诺扩展。通过在显示前执行验证作为强制性步骤，PCN为数值敏感的设置建立了一个简单的合同：\ emph {仅通过证明}赚取信任}，而缺乏商标会传达不确定性。

Title: Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

Authors: Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06948
Pdf URL: https://arxiv.org/pdf/2509.06948
Copy Paste: [[2509.06948]] Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning(https://arxiv.org/abs/2509.06948)
Keywords: language model, llm
Abstract: Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach limits interaction between SFT and RL, thereby constraining overall effectiveness. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.
摘要：强化学习（RL）已被证明有效地激励大型语言模型（LLMS）的推理能力，但由于其反复试验性质而受到严重效率挑战。虽然普通实践采用监督的微调（SFT）作为RL的热身阶段，但这种脱钩的两阶段方法限制了SFT和RL之间的相互作用，从而限制了整体效率。这项研究介绍了一种学习推理模型的新方法，该方法采用了双重优化，以促进这些训练范式之间的更好合作。通过根据最佳RL策略的SFT目标来调节SFT目标，我们的方法使SFT能够指导RL的优化过程。在培训期间，下层同时进行SFT监督，并且高层显式地进行了RL更新，并且显式上级别可以最大程度地提高合作收益 - 仅与RL相对于RL，联合SFT-RL培训的性能优势。对五个推理基准的经验评估表明，我们的方法始终优于基准，并在有效性和效率之间取得更好的平衡。

Title: Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Authors: Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06949
Pdf URL: https://arxiv.org/pdf/2509.06949
Copy Paste: [[2509.06949]] Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models(https://arxiv.org/abs/2509.06949)
Keywords: language model, llm
Abstract: We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: this https URL
摘要：我们提出了Tracerl，这是一种扩散语言模型（DLMS）的轨迹感知增强学习框架，将首选推理轨迹纳入训练后，并且适用于不同的体系结构。配备了基于扩散的价值模型，可以增强训练稳定性，我们证明了在复杂的数学和编码任务上的推理性能提高了。此外，它也可以应用于将块特异性模型调整到较大的块中，从而提高了采样灵活性。我们采用Tracerl，得出了一系列最先进的扩散语言模型，即Trado。尽管小于7B尺度的AR模型，但Trado-4B-Instruct仍然一致地胜过复杂的数学推理任务。 Trado-8B教学实现在QWEN2.5-7B-INSTRUCTION的相对准确性提高6.1％，而Llama3.1-8B - 实验室在数学推理基准方面的相对准确度提高了51.3％。通过课程学习，我们还得出了第一个长核DLM，在Math500上表现优于QWEN2.5-7B-Instruct，相对准确性增长率为18.1％。为了促进可再现的研究和实践应用，我们发布了一个全面的开源框架，用于建筑，培训和部署跨不同体系结构的扩散LLM。该框架集成了用于推理和强化学习的加速KV-CACHE技术和推理引擎，并包括针对数学，编码和一般任务的各种监督微调和RL方法的实现。代码和模型：此HTTPS URL

Title: On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts

Authors: Linlu Qiu, Cedegao E. Zhang, Joshua B. Tenenbaum, Yoon Kim, Roger P. Levy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.06952
Pdf URL: https://arxiv.org/pdf/2509.06952
Copy Paste: [[2509.06952]] On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts(https://arxiv.org/abs/2509.06952)
Keywords: language model, prompt, chain-of-thought, agent
Abstract: Language use is shaped by pragmatics -- i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs' pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.
摘要：语言使用是由实用主义者塑造的 - 即在上下文中有关交流目标和规范的推理。随着语言模型（LMS）越来越多地用作对话剂，了解其务实的推理能力变得越来越重要。我们提出了一个源自波长的评估框架，波长是一个流行的通信游戏，在该游戏中，演讲者和听众以细粒的方式对广泛的概念进行交流。我们使用直接和思想链（COT）提示研究了一系列关于语言理解和语言生产的LMS，并进一步探讨了将贝叶斯实用推理纳入LM推论的理性语音法（RSA）方法。我们发现，最先进的LMS（而不是较小的LMS）在语言理解力上实现了强劲的表现，即使没有COT提示或RSA，也可以获得相似的人类准确性，并与人类判断表现出很高的相关性。在语言生产方面，COT可以胜过直接提示，并且使用RSA对两种方法都有显着改进。我们的研究有助于确定LMS务实推理能力的优势和局限性，并证明了通过RSA改善它们的潜力，开放了未来的途径，以理解LMS和人类中的概念代表，语言理解和社会推理。