2025-09-08

Title: INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance

Authors: Shisong Chen, Qian Zhu, Wenyan Yang, Chengyi Yang, Zhong Wang, Ping Wang, Xuan Lin, Bo Xu, Daqian Li, Chao Yuan, Licai Qi, Wanqing Xu, sun zhenxing, Xin Lu, Shiqiang Xiong, Chao Chen, Haixiang Hu, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04455
Pdf URL: https://arxiv.org/pdf/2509.04455
Copy Paste: [[2509.04455]] INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance(https://arxiv.org/abs/2509.04455)
Keywords: language model, llm
Abstract: Insurance, as a critical component of the global financial system, demands high standards of accuracy and reliability in AI applications. While existing benchmarks evaluate AI capabilities across various domains, they often fail to capture the unique characteristics and requirements of the insurance domain. To address this gap, we present INSEva, a comprehensive Chinese benchmark specifically designed for evaluating AI systems' knowledge and capabilities in insurance. INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension, comprising 38,704 high-quality evaluation examples sourced from authoritative materials. Our benchmark implements tailored evaluation methods for assessing both faithfulness and completeness in open-ended responses. Through extensive evaluation of 8 state-of-the-art Large Language Models (LLMs), we identify significant performance variations across different dimensions. While general LLMs demonstrate basic insurance domain competency with average scores above 80, substantial gaps remain in handling complex, real-world insurance scenarios. The benchmark will be public soon.
摘要：作为全球金融体系的关键组成部分，保险要求在AI应用程序中具有高准确性和可靠性标准。 While existing benchmarks evaluate AI capabilities across various domains, they often fail to capture the unique characteristics and requirements of the insurance domain. To address this gap, we present INSEva, a comprehensive Chinese benchmark specifically designed for evaluating AI systems' knowledge and capabilities in insurance. INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension, comprising 38,704 high-quality evaluation examples sourced from authoritative materials.我们的基准测试实施量身定制的评估方法，以评估开放式回应中的忠诚和完整性。 Through extensive evaluation of 8 state-of-the-art Large Language Models (LLMs), we identify significant performance variations across different dimensions. While general LLMs demonstrate basic insurance domain competency with average scores above 80, substantial gaps remain in handling complex, real-world insurance scenarios.基准将很快公开。

Title: Mentalic Net: Development of RAG-based Conversational AI and Evaluation Framework for Mental Health Support

Authors: Anandi Dutta, Shivani Mruthyunjaya, Jessica Saddington, Kazi Sifatul Islam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04456
Pdf URL: https://arxiv.org/pdf/2509.04456
Copy Paste: [[2509.04456]] Mentalic Net: Development of RAG-based Conversational AI and Evaluation Framework for Mental Health Support(https://arxiv.org/abs/2509.04456)
Keywords: language model, llm, prompt, chat, retrieval-augmented generation
Abstract: The emergence of large language models (LLMs) has unlocked boundless possibilities, along with significant challenges. In response, we developed a mental health support chatbot designed to augment professional healthcare, with a strong emphasis on safe and meaningful application. Our approach involved rigorous evaluation, covering accuracy, empathy, trustworthiness, privacy, and bias. We employed a retrieval-augmented generation (RAG) framework, integrated prompt engineering, and fine-tuned a pre-trained model on novel datasets. The resulting system, Mentalic Net Conversational AI, achieved a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges. We advocate for a human-in-the-loop approach and a long-term, responsible strategy in developing such transformative technologies, recognizing both their potential to change lives and the risks they may pose if not carefully managed.
摘要：大型语言模型（LLM）的出现具有解锁的无限可能性，以及巨大的挑战。作为回应，我们开发了一种精神健康支持聊天机器人，旨在扩大专业医疗保健，并非常重视安全有意义的应用。我们的方法涉及严格的评估，涵盖准确性，同理心，可信赖性，隐私和偏见。我们采用了检索型生成（RAG）框架，集成的及时工程，并在新型数据集上微调了预训练的模型。由此产生的系统，精神上的对话AI的BERT得分为0.898，其他评估指标落在令人满意的范围内。我们主张采用人类的人类方法，并在开发这种变革性技术方面采取长期负责任的策略，认识到它们改变生命的潜力以及如果不仔细管理的话，他们可能会带来的风险。

Title: Do MLLMs Really Understand the Charts?

Authors: Xiao Zhang, Dongyuan Li, Liuyu Xiang, Yao Zhang, Cheng Zhong, Zhaofeng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04457
Pdf URL: https://arxiv.org/pdf/2509.04457
Copy Paste: [[2509.04457]] Do MLLMs Really Understand the Charts?(https://arxiv.org/abs/2509.04457)
Keywords: language model, gpt, llm, hallucination
Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. Therefore, a question arises: Do MLLMs really understand the charts? Since a human is capable of understanding charts and estimating the values by visual reasoning, we first carefully establish a comprehensive Chart Reasoning Benchmark CRBench to rigorously evaluate the visual reasoning abilities of MLLMs on non-annotated charts. We argue that MLLMs are primarily relying on recognition rather than reasoning to interpret the charts. To steer MLLMs to reasonable chart understanding, we propose ChartReasoner that mimics human behavior by grounding their estimation in chart understanding. Extensive results on the proposed CRBench show that ChartReasnoner-3B/7B achieves superior performance in chart reasoning, even compared to GPT-4o and Gemini-2.5-Flash. More importantly, ChartReasnoner also demonstrates the visual reasoning abilities in general chart comprehension on public benchmarks, leading to significant performance gains and enabling MLLMs to rationally understand the charts. The code and dataset will be publicly available upon publication.
摘要：尽管多模式的大语言模型（MLLM）在图表理解中表现出越来越令人印象深刻的性能，但是在处理未经通知的图表时，大多数表现出令人震惊的幻觉和显着的性能退化。因此，出现一个问题：MLLM是否真的了解图表？由于人类能够通过视觉推理理解图表并估算值，因此我们首先仔细建立了一个全面的图表推理基准CRBENCH，以严格评估MLLM在非公开图表上的视觉推理能力。我们认为MLLM主要依靠识别而不是推理来解释图表。为了引导MLLM进行合理的图表理解，我们建议Chartreasoner通过将其在图表理解中进行估算来模仿人类行为。拟议的Crbench上的广泛结果表明，即使与GPT-4O和Gemini-2.5-Flash相比，Chartreasnoner-3b/7b在图表推理中的表现也出色。更重要的是，Chartreasnoner还展示了公共基准的一般图表理解中的视觉推理能力，从而导致了显着的性能增长，并使MLLM能够合理地理解图表。该代码和数据集将在发布后公开使用。

Title: Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies

Authors: Daniel B. Hier, Steven Keith Platt, Tayo Obafemi-Ajayi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04458
Pdf URL: https://arxiv.org/pdf/2509.04458
Copy Paste: [[2509.04458]] Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies(https://arxiv.org/abs/2509.04458)
Keywords: language model, gpt, llm
Abstract: Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. We investigate why these failures occur by analyzing predictions across two major ontologies, Human Phenotype Ontology and Gene Ontology, and two high-performing models, GPT-4o and LLaMa 3.1 405B. We evaluate nine candidate features related to term familiarity, identifier usage, morphology, and ontology structure. Univariate and multivariate analyses show that exposure to ontology identifiers is the strongest predictor of linking success.
摘要：大型语言模型通常在生物医学NLP任务上表现良好，但可能无法将本体术语链接到其正确的标识符。我们研究了为什么通过分析两个主要本体论的预测，人类表型本体论和基因本体论以及两个高性能模型GPT-4O和Llama 3.1 405b。我们评估了与术语熟悉，标识符使用，形态和本体结构有关的九种候选特征。单变量和多变量分析表明，接触对本体学标识符是连接成功的最强预测指标。

Title: Uncertainty-Aware Collaborative System of Large and Small Models for Multimodal Sentiment Analysis

Authors: Shiqin Han, Manning Gao, Menghua Jiang, Yuncheng Jiang, Haifeng Hu, Sijie Mai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04459
Pdf URL: https://arxiv.org/pdf/2509.04459
Copy Paste: [[2509.04459]] Uncertainty-Aware Collaborative System of Large and Small Models for Multimodal Sentiment Analysis(https://arxiv.org/abs/2509.04459)
Keywords: language model, llm, prompt
Abstract: The advent of Multimodal Large Language Models (MLLMs) has significantly advanced the state-of-the-art in multimodal machine learning, yet their substantial computational demands present a critical barrier to real-world deployment. Conversely, smaller, specialized models offer high efficiency but often at the cost of performance. To reconcile this performance-efficiency trade-off, we propose a novel Uncertainty-Aware Collaborative System (U-ACS) that synergistically orchestrates a powerful MLLM (e.g., HumanOmni) and a lightweight baseline model for multimodal sentiment analysis. The core of our system is an uncertainty-driven cascade mechanism, where the efficient small model first acts as a rapid filter for all input samples. Only those samples yielding high predictive uncertainty, thereby indicating greater difficulty, are selectively escalated to the MLLM for more sophisticated analysis. Furthermore, our system introduces advanced strategies to handle ambiguous or conflicting predictions, including weighted averaging for predictions of similar polarity and a prompt-based cross-verification to resolve conflicting predictions when both models exhibit high uncertainty. This sample-difficulty-aware approach allows for a dynamic allocation of computational resources, drastically reducing inference costs while retaining the high accuracy of MLLM. Extensive experiments on benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance, while requiring only a fraction of the computational resources compared to using a standalone MLLM.
摘要：多模式大语模型（MLLM）的出现显着推动了多模式机器学习的最新技术，但是它们的实质性计算需求却是对现实世界部署的关键障碍。相反，较小的专业型号具有很高的效率，但通常以性能为代价。为了调和这种绩效效率的权衡，我们提出了一种新颖的不确定性感知的协作系统（U-ACS），该系统协同策划了一个强大的MLLM（例如Humanomni）和用于多模态情感分析的轻量级基线模型。我们系统的核心是不确定性驱动的级联机制，其中有效的小型模型首先是所有输入样品的快速滤波器。仅产生高预测性不确定性的那些样品，从而表明更大的难度被选择性地升级到MLLM以进行更复杂的分析。此外，我们的系统还引入了高级策略来处理模棱两可或相互冲突的预测，包括对相似极性的预测进行加权平均和基于迅速的交叉验证，以解决两种模型表现出较高的不确定性时解决冲突的预测。这种样本缺陷感知的方法允许对计算资源进行动态分配，从而大大降低了推理成本，同时保留了MLLM的高精度。基准数据集上的广泛实验表明，我们提出的方法可以实现最先进的性能，而与使用独立的MLLM相比，仅需要一小部分计算资源。

Title: CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection

Authors: Yihan Chen, Jiawei Chen, Guozhao Mo, Xuanang Chen, Ben He, Xianpei Han, Le Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04460
Pdf URL: https://arxiv.org/pdf/2509.04460
Copy Paste: [[2509.04460]] CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection(https://arxiv.org/abs/2509.04460)
Keywords: language model, llm
Abstract: The growing integration of large language models (LLMs) into the peer review process presents potential risks to the fairness and reliability of scholarly evaluation. While LLMs offer valuable assistance for reviewers with language refinement, there is growing concern over their use to generate substantive review content. Existing general AI-generated text detectors are vulnerable to paraphrasing attacks and struggle to distinguish between surface language refinement and substantial content generation, suggesting that they primarily rely on stylistic cues. When applied to peer review, this limitation can result in unfairly suspecting reviews with permissible AI-assisted language enhancement, while failing to catch deceptively humanized AI-generated reviews. To address this, we propose a paradigm shift from style-based to content-based detection. Specifically, we introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews, covering six distinct modes of human-AI collaboration. Furthermore, we develop CoCoDet, an AI review detector via a multi-task learning framework, designed to achieve more accurate and robust detection of AI involvement in review content. Our work offers a practical foundation for evaluating the use of LLMs in peer review, and contributes to the development of more precise, equitable, and reliable detection methods for real-world scholarly applications. Our code and data will be publicly available at this https URL.
摘要：大型语言模型（LLM）纳入同行评审过程的日益增长的整合为学术评估的公平性和可靠性带来了潜在的风险。尽管LLM为语言精致的审阅者提供了宝贵的帮助，但人们越来越担心它们用于生成实质性审查内容。现有的一般AI生成的文本检测器容易受到释义攻击的影响，并难以区分表面语言的细化和大量内容的产生，这表明它们主要依赖于风格的提示。当应用于同行评审时，这种限制可能会导致不公平地怀疑具有允许的AI辅助语言增强的审查，同时未能捕捉欺骗性的人性化AI生成的评论。为了解决这个问题，我们提出了从基于样式的检测到基于内容的检测的范式转变。具体而言，我们介绍了椰子，这是一种面向内容的基准，建立在AI生成的同伴评论的精细数据集的基础上，涵盖了六种不同的人类协作模式。此外，我们通过多任务学习框架开发了Cocodet，这是AI审查探测器，旨在实现对AI参与审查内容的更准确和强大的检测。我们的工作为评估LLM在同行评审中的使用提供了实用的基础，并有助于开发对现实世界学术应用的更精确，公平和可靠的检测方法。我们的代码和数据将在此HTTPS URL上公开获取。

Title: From Post To Personality: Harnessing LLMs for MBTI Prediction in Social Media

Authors: Tian Ma, Kaiyu Feng, Yu Rong, Kangfei Zhao
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2509.04461
Pdf URL: https://arxiv.org/pdf/2509.04461
Copy Paste: [[2509.04461]] From Post To Personality: Harnessing LLMs for MBTI Prediction in Social Media(https://arxiv.org/abs/2509.04461)
Keywords: language model, llm, hallucination, retrieval augmented generation
Abstract: Personality prediction from social media posts is a critical task that implies diverse applications in psychology and sociology. The Myers Briggs Type Indicator (MBTI), a popular personality inventory, has been traditionally predicted by machine learning (ML) and deep learning (DL) techniques. Recently, the success of Large Language Models (LLMs) has revealed their huge potential in understanding and inferring personality traits from social media content. However, directly exploiting LLMs for MBTI prediction faces two key challenges: the hallucination problem inherent in LLMs and the naturally imbalanced distribution of MBTI types in the population. In this paper, we propose PostToPersonality (PtoP), a novel LLM based framework for MBTI prediction from social media posts of individuals. Specifically, PtoP leverages Retrieval Augmented Generation with in context learning to mitigate hallucination in LLMs. Furthermore, we fine tune a pretrained LLM to improve model specification in MBTI understanding with synthetic minority oversampling, which balances the class imbalance by generating synthetic samples. Experiments conducted on a real world social media dataset demonstrate that PtoP achieves state of the art performance compared with 10 ML and DL baselines.
摘要：来自社交媒体帖子的人格预测是一项关键任务，暗示了心理学和社会学中的各种应用。传统上，机器学习（ML）和深度学习（DL）技术可以预测Myers Briggs类型指标（MBTI）。最近，大型语言模型（LLMS）的成功揭示了它们从社交媒体内容中理解和推断性格特征方面的巨大潜力。但是，直接利用LLM进行MBTI预测面临两个关键挑战：LLMS固有的幻觉问题以及人群中MBTI类型的自然分布。在本文中，我们提出了后人性（PTOP），这是一个基于LLM的新型框架，用于从个人的社交媒体帖子中预测。具体而言，PTOP利用了在上下文中学习的增强生成，以减轻LLMS的幻觉。此外，我们微调了经过预定的LLM，以通过合成少数族裔的过度采样来改善MBTI理解中的模型规范，从而通过产生合成样本来平衡类失衡。在现实世界的社交媒体数据集上进行的实验表明，与10毫升和DL基准相比，PTOP实现了最先进的性能。

Title: Benchmarking GPT-5 for biomedical natural language processing

Authors: Yu Hou, Zaifu Zhan, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04462
Pdf URL: https://arxiv.org/pdf/2509.04462
Copy Paste: [[2509.04462]] Benchmarking GPT-5 for biomedical natural language processing(https://arxiv.org/abs/2509.04462)
Keywords: gpt, prompt
Abstract: The rapid expansion of biomedical literature has heightened the need for scalable natural language processing (NLP) solutions. While GPT-4 substantially narrowed the gap with task-specific systems, especially in question answering, its performance across other domains remained uneven. We updated a standardized BioNLP benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across 12 datasets spanning six task families: named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification. Using fixed prompt templates, identical decoding parameters, and batch inference, we report primary metrics per dataset and include prior results for GPT-4, GPT-3.5, and LLaMA-2-13B for comparison. GPT-5 achieved the strongest overall benchmark performance, with macro-average scores rising to 0.557 under five-shot prompting versus 0.506 for GPT-4 and 0.508 for GPT-4o. On MedQA, GPT-5 reached 94.1% accuracy, exceeding the previous supervised state of the art by over fifty points, and attained parity with supervised systems on PubMedQA (0.734). In extraction tasks, GPT-5 delivered major gains in chemical NER (0.886 F1) and ChemProt relation extraction (0.616 F1), outperforming GPT-4 and GPT-4o, though summarization and disease NER still lagged behind domain-specific baselines. These results establish GPT-5 as a general-purpose model now offering deployment-ready performance for reasoning-oriented biomedical QA, while precision-critical extraction and evidence-dense summarization continue to favor fine-tuned or hybrid approaches. The benchmark delineates where simple prompting suffices and where retrieval-augmented or planning-based scaffolds are likely required, providing actionable guidance for BioNLP system design as frontier models advance.
摘要：生物医学文献的快速扩展增强了对可扩展自然语言处理（NLP）解决方案的需求。虽然GPT-4通过特定于任务的系统尤其是在问题答案中缩小了差距，但其在其他领域的性能仍然不平衡。我们更新了标准化的BIONLP基准测试，以评估零，一单和五弹射的GPT-5和GPT-4O，跨越了跨越六个任务系列的12个数据集：命名实体识别，关系提取，多标签文档分类，问题答案，问答，摘要，文本摘要和文本简化。使用固定的提示模板，相同的解码参数和批次推理，我们每个数据集报告主要指标，并包括GPT-4，GPT-3.5和Llama-2-13b的先前结果，以进行比较。 GPT-5取得了最强的总体基准性能，宏观平均得分在五杆促使下提高到0.557，而GPT-4的宏观平均得分为0.506，而GPT-4O的宏观平均得分为0.506。在MEDQA上，GPT-5的准确性达到94.1％，超过了先前的有监督状态50分，并与PubMedQA上有监督的系统达到了奇偶校验（0.734）。在提取任务中，GPT-5在化学NER（0.886 F1）和ChemProt关系提取（0.616 F1）方面带来了重大收益，表现优于GPT-4和GPT-4O，尽管摘要和疾病仍然落后于域特异性基线。这些结果将GPT-5建立为通用模型，现在为以推理为导向的生物医学质量质量质量质量质量质量提供部署性能，而精确的至关重要的提取和证据密集的摘要继续采用微调或混合方法。基准描绘在简单提示足以满足的地方以及可能需要基于计划或计划的脚手架的地方，为Bionlp系统设计提供了可行的指导，作为Frontier Models的提前。

Title: Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?

Authors: Yang Nan, Pengfei He, Ravi Tandon, Han Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04464
Pdf URL: https://arxiv.org/pdf/2509.04464
Copy Paste: [[2509.04464]] Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?(https://arxiv.org/abs/2509.04464)
Keywords: language model, llm
Abstract: Large language models (LLMs) have delivered significant breakthroughs across diverse domains but can still produce unreliable or misleading outputs, posing critical challenges for real-world applications. While many recent studies focus on quantifying model uncertainty, relatively little work has been devoted to \textit{diagnosing the source of uncertainty}. In this study, we show that, when an LLM is uncertain, the patterns of disagreement among its multiple generated responses contain rich clues about the underlying cause of uncertainty. To illustrate this point, we collect multiple responses from a target LLM and employ an auxiliary LLM to analyze their patterns of disagreement. The auxiliary model is tasked to reason about the likely source of uncertainty, such as whether it stems from ambiguity in the input question, a lack of relevant knowledge, or both. In cases involving knowledge gaps, the auxiliary model also identifies the specific missing facts or concepts contributing to the uncertainty. In our experiment, we validate our framework on AmbigQA, OpenBookQA, and MMLU-Pro, confirming its generality in diagnosing distinct uncertainty sources. Such diagnosis shows the potential for relevant manual interventions that improve LLM performance and reliability.
摘要：大型语言模型（LLMS）在不同领域取得了重大突破，但仍然可以产生不可靠或误导性的产出，对现实世界应用构成了关键的挑战。尽管许多最近的研究着重于量化模型的不确定性，但相对较少的工作专门用于\ textit {诊断不确定性来源}。在这项研究中，我们表明，当LLM不确定时，其多个生成的响应中的分歧模式包含有关不确定性根本原因的丰富线索。为了说明这一点，我们从目标LLM收集了多个响应，并采用辅助LLM来分析其分歧模式。辅助模型的任务是推理可能的不确定性来源，例如它是否源于输入问题，缺乏相关知识或两者兼而有之。在涉及知识差距的情况下，辅助模型还标识了导致不确定性的特定缺失事实或概念。在我们的实验中，我们验证了在Ambigqa，OpenBookQa和MMLU-Pro上的框架，证实了其在诊断不同的不确定性来源时的普遍性。这种诊断表明了相关的手动干预措施的潜力，以提高LLM性能和可靠性。

Title: Emotionally-Aware Agents for Dispute Resolution

Authors: Sushrita Rakshit, James Hale, Kushal Chawla, Jeanne M. Brett, Jonathan Gratch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04465
Pdf URL: https://arxiv.org/pdf/2509.04465
Copy Paste: [[2509.04465]] Emotionally-Aware Agents for Dispute Resolution(https://arxiv.org/abs/2509.04465)
Keywords: language model, agent
Abstract: In conflict, people use emotional expressions to shape their counterparts' thoughts, feelings, and actions. This paper explores whether automatic text emotion recognition offers insight into this influence in the context of dispute resolution. Prior work has shown the promise of such methods in negotiations; however, disputes evoke stronger emotions and different social processes. We use a large corpus of buyer-seller dispute dialogues to investigate how emotional expressions shape subjective and objective outcomes. We further demonstrate that large-language models yield considerably greater explanatory power than previous methods for emotion intensity annotation and better match the decisions of human annotators. Findings support existing theoretical models for how emotional expressions contribute to conflict escalation and resolution and suggest that agent-based systems could be useful in managing disputes by recognizing and potentially mitigating emotional escalation.
摘要：在冲突中，人们使用情感表达来塑造同行的思想，感觉和行动。本文探讨了自动文本情感识别是否在解决争议的背景下可以洞悉这种影响。先前的工作表明了谈判中这种方法的承诺。但是，争议引起了更强烈的情绪和不同的社会过程。我们使用大量的买方销售者争端对话来研究情感表达方式如何塑造主观和客观的结果。我们进一步证明，大型语言模型比以前的情绪强度注释的方法具有更大的解释能力，并且可以更好地匹配人类注释者的决策。研究结果支持现有的理论模型，以表达情绪表达如何促进冲突升级和解决方案，并建议基于代理的系统可以通过识别并潜在地减轻情绪升级来管理争议。

Title: Just-in-time and distributed task representations in language models

Authors: Yuxuan Li, Declan Campbell, Stephanie C. Y. Chan, Andrew Kyle Lampinen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04466
Pdf URL: https://arxiv.org/pdf/2509.04466
Copy Paste: [[2509.04466]] Just-in-time and distributed task representations in language models(https://arxiv.org/abs/2509.04466)
Keywords: language model, prompt
Abstract: Many of language models' impressive capabilities originate from their in-context learning: based on instructions or examples, they can infer and perform new tasks without weight updates. In this work, we investigate \emph{when} representations for new tasks are formed in language models, and \emph{how} these representations change over the course of context. We focus on ''transferrable'' task representations -- vector representations that can restore task context in another instance of the model, even without the full prompt. We show that these representations evolve in non-monotonic and sporadic ways, and are distinct from a more inert representation of high-level task categories that persists throughout the context. Specifically, models often condense multiple evidence into these transferrable task representations, which align well with the performance improvement based on more examples in the context. However, this accrual process exhibits strong locality along the sequence dimension, coming online only at certain tokens -- despite task identity being reliably decodable throughout the context. Moreover, these local but transferrable task representations tend to capture minimal ''task scopes'', such as a semantically-independent subtask, and models rely on more temporally-distributed representations to support longer and composite tasks. This two-fold locality (temporal and semantic) underscores a kind of just-in-time computational process underlying language models' ability to adapt to new evidence and learn new tasks on the fly.
摘要：许多语言模型的令人印象深刻的功能源于其在文章中的学习：基于说明或示例，它们可以推断和执行新任务而无需更新。在这项工作中，我们调查了在语言模型中形成新任务的\ emph {wher}表示，\ emph {}如何在上下文过程中发生变化。我们专注于“转移”任务表示形式 - 即使没有完整提示，可以在模型的另一个实例中恢复任务上下文的向量表示。我们表明，这些表示形式以非单调和零星的方式发展，并且与在整个上下文中持续存在的高级任务类别的更惰性表示不同。具体而言，模型经常将多个证据凝结到这些可转让的任务表示中，这些证据与基于上下文中更多示例的性能改进非常吻合。但是，这种应计过程沿序列维度表现出强大的位置，仅在某些令牌上进行在线 - 尽管任务身份在整个上下文中都可以可靠地解码。此外，这些本地但可转让的任务表示倾向于捕获最小的“任务范围”，例如语义独立的子任务，模型依赖于更具时间分布的表示来支持更长的复合任务。这个两倍的局部性（时间和语义）强调了语言模型适应新证据和即时学习新任务的能力的基本计算过程。

Title: Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

Authors: Hao Zhang, Mengsi Lyu, Yulong Ao, Yonghua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04467
Pdf URL: https://arxiv.org/pdf/2509.04467
Copy Paste: [[2509.04467]] Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference(https://arxiv.org/abs/2509.04467)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands. However, existing methods often ignore the characteristics of prefill-decode (PD) disaggregation in practice. In this paper, we propose a novel pruning method for PD disaggregation inference, enabling more precise and efficient block and KV Cache pruning. Our approach constructs pruning and distillation sets to perform iterative block removal independently for the prefill and decode stages, obtaining better pruning solutions. Moreover, we introduce a token-aware cache pruning mechanism that retains all KV Cache in the prefill stage but selectively reuses entries for the first and last token sequences in selected layers during decode, reducing communication costs with minimal overhead. Extensive experiments demonstrate that our approach consistently achieves strong performance in both PD disaggregation and PD unified settings without disaggregation. Under the default settings, our method achieves a 20.56% inference speedup and a 4.95 times reduction in data transmission bandwidth consumption.
摘要：大型语言模型（LLMS）在各种任务中都展示了出色的功能，但是它们的部署受到高计算和内存成本的限制。模型修剪提供了一种减轻这些要求的有效手段。但是，现有方法通常忽略了在实践中分类（PD）分类的特征。在本文中，我们提出了一种用于PD分解推断的新型修剪方法，从而实现了更精确，更有效的块和KV缓存修剪。我们的方法构建修剪和蒸馏集以在预填充和解码阶段独立执行迭代块去除，从而获得更好的修剪解决方案。此外，我们引入了一个令牌的缓存修剪机制，该机制保留在预填充阶段中的所有KV缓存，但在解码过程中选择性地将选定层中的第一个和最后一个令牌序列的条目选择性降低，从而降低了交流成本，以最少的开销来降低通信成本。广泛的实验表明，我们的方法始终在PD分类和PD统一设置中均未实现强大的性能。在默认设置下，我们的方法可实现20.56％的推理速度，而数据传输带宽消耗的降低了4.95倍。

Title: Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study

Authors: Xuan Yao, Qianteng Wang, Xinbo Liu, Ke-Wei Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04468
Pdf URL: https://arxiv.org/pdf/2509.04468
Copy Paste: [[2509.04468]] Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study(https://arxiv.org/abs/2509.04468)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: The rapid advancement of large language models presents significant opportunities for financial applications, yet systematic evaluation in specialized financial contexts remains limited. This study presents the first comprehensive evaluation of state-of-the-art LLMs using 1,560 multiple-choice questions from official mock exams across Levels I-III of CFA, most rigorous professional certifications globally that mirror real-world financial analysis complexity. We compare models distinguished by core design priorities: multi-modal and computationally powerful, reasoning-specialized and highly accurate, and lightweight efficiency-optimized. We assess models under zero-shot prompting and through a novel Retrieval-Augmented Generation pipeline that integrates official CFA curriculum content. The RAG system achieves precise domain-specific knowledge retrieval through hierarchical knowledge organization and structured query generation, significantly enhancing reasoning accuracy in professional financial certification evaluation. Results reveal that reasoning-oriented models consistently outperform others in zero-shot settings, while the RAG pipeline provides substantial improvements particularly for complex scenarios. Comprehensive error analysis identifies knowledge gaps as the primary failure mode, with minimal impact from text readability. These findings provide actionable insights for LLM deployment in finance, offering practitioners evidence-based guidance for model selection and cost-performance optimization.
摘要：大型语言模型的快速发展为财务应用提供了重要的机会，但是在专业金融环境中的系统评估仍然有限。这项研究提出了对最先进的LLM的首次全面评估，该评估使用CFA I-III级别的官方模拟考试中的1,560个多项选择问题，这是全球最严格的专业认证，这些专业认证在全球范围内，反映了现实世界中财务分析的复杂性。我们比较以核心设计优先级区分的模型：多模式和计算功能强大，推理特有的，高度准确以及轻巧的效率优化。我们通过零射击提示和通过新颖的检索型生成管道评估模型，该管道整合了官方CFA课程内容。 RAG系统通过层次知识组织和结构性查询生成实现了精确的领域特定知识检索，从而显着提高了专业财务认证评估中的推理精度。结果表明，以推理为导向的模型在零拍设置中始终优于其他模型，而RAG管道则提供了实质性改进，尤其是对于复杂方案。全面的错误分析将知识差距确定为主要故障模式，而文本可读性的影响最小。这些发现为财务部署LLM部署提供了可行的见解，为从业者提供了基于循证的模型选择和成本效果优化的指南。

Title: Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing

Authors: David Berghaus, Armin Berger, Lars Hillebrand, Kostadin Cvejoski, Rafet Sifa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04469
Pdf URL: https://arxiv.org/pdf/2509.04469
Copy Paste: [[2509.04469]] Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing(https://arxiv.org/abs/2509.04469)
Keywords: language model, gpt, llm, prompt
Abstract: This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.
摘要：本文使用零摄像机提示，基于三个家庭的三个家庭（GPT-5，Gemini 2.5和开源Gemma 3）的八个多模式大型语言模型（GPT-5，Gemini 2.5和开源Gemma 3），使用零拍摄提示。我们比较了两种处理策略：使用多模式功能的直接图像处理，以及首先将文档转换为Markdown的结构化解析方法。结果表明，天然图像处理通常优于结构化方法，其性能在模型类型和文档特征之间变化。该基准为自动化文档系统选择适当的模型和处理策略提供了见解。我们的代码可在线提供。

Title: COCORELI: Cooperative, Compositional Reconstitution \& Execution of Language Instructions

Authors: Swarnadeep Bhar, Omar Naim, Eleni Metheniti, Bastien Navarri, Loïc Cabannes, Morteza Ezzabady, Nicholas Asher
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04470
Pdf URL: https://arxiv.org/pdf/2509.04470
Copy Paste: [[2509.04470]] COCORELI: Cooperative, Compositional Reconstitution \& Execution of Language Instructions(https://arxiv.org/abs/2509.04470)
Keywords: language model, llm, hallucination, agent
Abstract: We present COCORELI, a hybrid agent framework designed to tackle the limitations of large language models (LLMs) in tasks requiring: following complex instructions, minimizing hallucination, and spatial reasoning. COCORELI integrates medium-sized LLM agents with novel abstraction mechanisms and a discourse module to parse instructions to in-context learn dynamic, high-level representations of the environment. Experiments on natural collaborative construction tasks show that COCORELI outperforms single-LLM CoT and agentic LLM systems, all using larger LLMs. It manages to largely avoid hallucinations, identify missing information, ask for clarifications, and update its learned objects. COCORELI's abstraction abilities extend beyond ENVIRONMENT, as shown in the ToolBench API completion task.
摘要：我们提出了Cocoreli，这是一种混合代理框架，旨在应对需要：遵循复杂说明，最小化幻觉和空间推理的大型语言模型（LLM）的局限性。 Cocoreli将中型LLM代理与新颖的抽象机制和讨论模块整合到解析指令中，以学习环境的动态，高级表示。关于自然协作施工任务的实验表明，Cocoreli的表现要优于单-LLM COT和Agesic LLM系统，所有这些都使用较大的LLM。它在很大程度上避免了幻觉，识别丢失的信息，要求澄清并更新其学习的对象。如工具板API完成任务所示，Cocoreli的抽象能力超出了环境。

Title: MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

Authors: Alice Schiavone (1 and 2), Marco Fraccaro (3), Lea Marie Pehrson (1, 4 and 5), Silvia Ingala (4 and 6), Rasmus Bonnevie (3), Michael Bachmann Nielsen (5), Vincent Beliveau (7), Melanie Ganz (1 and 2), Desmond Elliott (1) ((1) Department of Computer Science, University of Copenhagen, Denmark, (2) Neurobiology Research Unit, Copenhagen University Hospital, Denmark, (3) Unumed Aps, Denmark, (4) Department of Diagnostic Radiology, Copenhagen University Hospital, Denmark, (5) Department of Clinical Medicine, University of Copenhagen, Denmark, (6) Cerebriu A/S, Denmark, (7) Institute for Human Genetics, Medical University of Innsbruck, Austria)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04471
Pdf URL: https://arxiv.org/pdf/2509.04471
Copy Paste: [[2509.04471]] MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification(https://arxiv.org/abs/2509.04471)
Keywords: language model, llm, prompt
Abstract: Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.
摘要：放射学报告包含丰富的临床信息，可用于训练成像模型，而无需依赖昂贵的手动注释。但是，现有方法面临关键的局限性：基于规则的方法与语言可变性，监督模型需要大量注释的数据集，而最近的基于LLM的系统依赖于不适合临床使用的封闭源或资源密集型模型。此外，当前的解决方案在很大程度上仅限于英语和单模式，单税工程学数据集。我们介绍了Mosaic，这是一种多语言，分类法敏捷和计算有效方法的放射学报告分类。 Mosaic建立在紧凑的开放访问语言模型（MEDGEMMA-4B）上，同时支持零/少的提示和轻巧的微调，从而在消费者级GPU上部署。我们以英语，西班牙语，法语和丹麦语评估了七个数据集的镶嵌，涵盖了多种成像方式和标签分类法。该模型在五个胸部X射线数据集中达到平均宏F1分数为88，接近或超过专家级的性能，同时仅需要24 GB的GPU内存。随着数据的增强，只有80个带注释的样本足以在丹麦报告中达到82的加权F1分数，而整个1600个样本训练集则足以达到86个。 Mosaic在临床环境中提供了大型或专有LLM的实用替代方法。代码和型号是开源的。我们邀请社区评估和扩展有关新语言，分类学和方式的镶嵌。

Title: RECAP: REwriting Conversations for Intent Understanding in Agentic Planning

Authors: Kushan Mitra, Dan Zhang, Hannah Kim, Estevam Hruschka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04472
Pdf URL: https://arxiv.org/pdf/2509.04472
Copy Paste: [[2509.04472]] RECAP: REwriting Conversations for Intent Understanding in Agentic Planning(https://arxiv.org/abs/2509.04472)
Keywords: language model, llm, prompt, agent
Abstract: Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning. We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that assesses planning utility given the rewritten intent. Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agent planning in open-domain dialogue systems.
摘要：了解用户意图对于有效计划对话助手至关重要，尤其是那些由大型语言模型（LLMS）提供支持的助手。但是，现实世界的对话通常是模棱两可，指定或动态性的，这使意图检测成为持续的挑战。基于分类的传统方法努力在开放式环境中概括，从而导致脆弱的解释和下游计划不佳。我们提出了回顾（代理计划的重写对话），这是一种新的基准测试，旨在评估和提高意图重写，将用户代理对话重新标记为用户目标的简洁表示。回顾捕捉了各种挑战，例如歧义，意图漂移，模糊性和混合目标对话。除了数据集之外，我们还引入了一个基于LLM的评估器，该评估器鉴于重写的意图，该评估器评估了计划实用程序。使用RECAP，我们开发了一种及时的重写方法，以优于基准。我们进一步证明，对两个基于DPO的重写器进行微调可带来额外的实用性收益。我们的结果强调，意图重写是改善开放域对话系统中的代理计划的关键和可进行的组件。

Title: SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings

Authors: Jaekwon Yoo, Kunal Chandiramani, Divya Tadimeti, Abenezer Girma, Chandra Dhir
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04473
Pdf URL: https://arxiv.org/pdf/2509.04473
Copy Paste: [[2509.04473]] SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings(https://arxiv.org/abs/2509.04473)
Keywords: language model, llm
Abstract: While integrating speech encoder with LLM requires substantial data and resources, use cases face limitations due to insufficient availability. To address this, we propose a solution with a parameter-efficient adapter that converts speech embeddings into LLM-compatible tokens, focusing on end-to-end automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). To reduce labeling costs, we employ an LLM-based synthetic dataset annotation technique. The proposed adapter, using 7x fewer trainable parameters, achieves significant performance gains: a 26% relative Word Error Rates (WER) improvement on the LibriSpeech ASR task, a 6.3% relative F1 score increase on the NER task, and a 32% relative F1 score boost on the SA task. Moreover, using advanced techniques such as adding a classifier regularizer and optimizing the LLM with Low-Rank Adaptation (LoRA) yields notable performance gains, with Spoken Language Understanding Evaluation (SLUE) score improvement of 6.6% and 9.5%
摘要：尽管将语音编码器与LLM集成需要大量的数据和资源，但由于可用性不足而导致的用例面临限制。为了解决这个问题，我们提出了一个具有参数有效适配器的解决方案，该解决方案将语音嵌入到LLM兼容的代币中，重点关注端到端对端到端自动语音识别（ASR），命名实体识别（NER）和情感分析（SA）。为了降低标签成本，我们采用了基于LLM的合成数据集注释技术。提出的适配器使用少7倍的可训练参数实现了显着的性能提高：对LibrisPeech ASR任务的相对单词错误率（WER）提高了26％，NER任务的相对F1分数增加了6.3％，而SA任务的相对F1得分提高了32％。此外，使用高级技术，例如添加分类器正常化程序并以低级别适应（LORA）优化LLM，可获得显着的性能提高，并以口语理解评估（SLUE）得分提高了6.6％和9.5％

Title: Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

Authors: Shengyin Sun, Yiming Li, Xing Li, Yingzhao Lian, Weizhe Lin, Hui-Ling Zhen, Zhiyuan Yang, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04474
Pdf URL: https://arxiv.org/pdf/2509.04474
Copy Paste: [[2509.04474]] Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling(https://arxiv.org/abs/2509.04474)
Keywords: language model, llm
Abstract: Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.
摘要：测试时间缩放已成为一种强大的范式，用于通过在推理过程中分配其他计算资源来增强大语言模型（LLMS）的推理能力。但是，由于产生了冗余和重复的推理轨迹，因此该范式固有效率低下，从而导致了大量的计算开销。投机解码为缓解这种低效率的途径提供了有希望的途径，但在结构化的，重复的测试时间缩放的环境中，其功效仍然在很大程度上尚未探索。为了弥合这一差距，我们引入了第一个综合基准测试，旨在评估用于加速LLM测试时间缩放的投机解码方法。我们的基准测试提供了跨代表性测试时间缩放范式（例如，最佳N采样和多轮思维）的一致实验协议，可以对三个主要类别的投机解码进行公平比较：基于模型，基于培训，基于n-gram的方法。广泛的实验表明，简单的基于N-Gram的方法有效地捕获了重复模式，从而证明了加速测试时间缩放的独特潜力。这种现象证明了将基于N-Gram的方法与基于模型或基于培训的方法相结合的价值，以平衡测试时间缩放中重复性和多样推理的加速度。我们希望这项基准刺激对测试时间缩放的投机解码进行进一步的研究，通过更好地处理重复性和多样的推理路径，在LLM中更快，更实用的推理。

Title: ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute

Authors: Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04475
Pdf URL: https://arxiv.org/pdf/2509.04475
Copy Paste: [[2509.04475]] ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute(https://arxiv.org/abs/2509.04475)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model's capability but a flaw in the scaling strategy itself, a phenomenon we term "Tunnel Vision", where a model's imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model's latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.
摘要：大型语言模型（LLM）的最新进展是由测试时间计算量表驱动的 - 这种策略通过产生更长的顺序思考过程来改善推理。尽管有效，但随着计算的增加，这种方法遇到了显着的瓶颈，在此进一步计算仅提供边际性能增长。我们认为，这种天花板不是模型能力的固有限制，而是缩放策略本身的缺陷，这是我们称为“隧道视觉”的现象，其中模型的不完美初始步骤将其锁定在次优的推理路径中。为了克服这一点，我们引入了一个新的缩放范式：本地思想并行性。我们介绍了一个端到端的框架，该框架训练LLM以并行生成多种，多样化的推理路径，并将它们合成为卓越的最终答案。通过同时探索不同的思想线条，包装纸有效地避开了隧道视觉问题，并解锁了模型的潜在推理潜力。我们的方法表明，比例计算（宽度）是一种比顺序缩放（深度）更有效的推理方法。在具有挑战性的推理基准方面，参考器比顺序LLM的准确性提高了（1.5B的12.3％，平均为7B模型，使用8个平行路径），同时仅添加可忽略的潜伏期在头顶（7.1％）。这使较小的模型能够超越更大的对应物，并将并行思维建立为扩展未来LLM的关键，有效的维度。

Title: Training Text-to-Molecule Models with Context-Aware Tokenization

Authors: Seojin Kim, Hyeontae Song, Jaehyun Nam, Jinwoo Shin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04476
Pdf URL: https://arxiv.org/pdf/2509.04476
Copy Paste: [[2509.04476]] Training Text-to-Molecule Models with Context-Aware Tokenization(https://arxiv.org/abs/2509.04476)
Keywords: language model
Abstract: Recently, text-to-molecule models have shown great potential across various chemical applications, e.g., drug-discovery. These models adapt language models to molecular data by representing molecules as sequences of atoms. However, they rely on atom-level tokenizations, which primarily focus on modeling local connectivity, thereby limiting the ability of models to capture the global structural context within molecules. To tackle this issue, we propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the significance of the substructure-level contexts in understanding molecule structures, e.g., ring systems, we introduce substructure-level tokenization for text-to-molecule models. Building on our tokenization scheme, we develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics. Extensive experiments verify the superiority of CAMT5 in various text-to-molecule generation tasks. Intriguingly, we find that CAMT5 outperforms the state-of-the-art methods using only 2% of training tokens. In addition, we propose a simple yet effective ensemble strategy that aggregates the outputs of text-to-molecule models to further boost the generation performance. Code is available at this https URL.
摘要：最近，文本到全分子模型在各种化学应用中显示出巨大的潜力，例如药物发现。这些模型通过表示分子作为原子序列来适应分子数据。但是，它们依赖于原子级引物，主要集中于对局部连通性进行建模，从而限制了模型捕获分子内整体结构环境的能力。为了解决这个问题，我们提出了一种新颖的文本对整体模型，即创造的上下文感知分子T5（CAMT5）。受到子结构级上下文在理解分子结构（例如环系统）中的重要性的启发，我们引入了用于文本到整体模型的子结构级别化令牌化。在我们的令牌化计划的基础上，我们制定了一种基于重要的培训策略，该策略优先考虑关键子结构，从而使CAMT5更好地捕获了分子语义。广泛的实验验证了在各种文本到整体生成任务中CAMT5的优越性。有趣的是，我们发现CAMT5仅使用2％的训练令牌来优于最先进的方法。此外，我们提出了一种简单而有效的整体策略，该策略汇总了文本到整体模型的输出，以进一步提高生成性能。代码可在此HTTPS URL上找到。

Title: No Clustering, No Routing: How Transformers Actually Process Rare Tokens

Authors: Jing Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04479
Pdf URL: https://arxiv.org/pdf/2509.04479
Copy Paste: [[2509.04479]] No Clustering, No Routing: How Transformers Actually Process Rare Tokens(https://arxiv.org/abs/2509.04479)
Keywords: language model, gpt
Abstract: Large language models struggle with rare token prediction, yet the mechanisms driving their specialization remain unclear. Prior work identified specialized ``plateau'' neurons for rare tokens following distinctive three-regime influence patterns \cite{liu2025emergent}, but their functional organization is unknown. We investigate this through neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models. Our findings show that: (1) rare token processing requires additional plateau neurons beyond the power-law regime sufficient for common tokens, forming dual computational regimes; (2) plateau neurons are spatially distributed rather than forming modular clusters; and (3) attention mechanisms exhibit no preferential routing to specialists. These results demonstrate that rare token specialization arises through distributed, training-driven differentiation rather than architectural modularity, preserving context-sensitive flexibility while achieving adaptive capacity allocation.
摘要：大型语言模型以罕见的令牌预测而挣扎，但是推动其专业化的机制尚不清楚。先前的工作确定了专门的``高原''神经元的稀有令牌，这是在独特的三重场影响模式下\ cite {liu2025emergent}，但它们的功能组织尚不清楚。我们通过神经元的影响分析，基于图的聚类以及GPT-2 XL和毕达氏菌模型中的注意力头消融进行了研究。我们的发现表明：（1）罕见的令牌处理需要超出足以共同代币的幂律制度的其他高原神经元，从而形成双重计算制度；（2）平稳神经元是空间分布的，而不是形成模块化簇；（3）注意机制与专家没有优先路由。这些结果表明，罕见的令牌专业化是通过分布式的，训练驱动的分化而不是建筑模块化而产生的，从而保留了上下文敏感的灵活性，同时实现了适应能力的分配。

Title: Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition

Authors: Ryo Takahashi, Naoki Saito, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04480
Pdf URL: https://arxiv.org/pdf/2509.04480
Copy Paste: [[2509.04480]] Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition(https://arxiv.org/abs/2509.04480)
Keywords: language model, llm, prompt
Abstract: Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans' prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.
摘要：视觉情感识别（VER）是一个重要的研究主题，因为其广泛应用，包括挖掘和广告设计。扩展这种能力以在个人层面上识别情绪，进一步扩大了其潜在应用。最近，多模式的大语言模型（MLLM）吸引了越来越多的注意力，并且表现出与常规VER方法相当的性能。但是，MLLM在包含一般意见的大型和多样化的数据集上进行了培训，这使他们偏爱多数观点和熟悉的模式。这种趋势限制了他们在个性化的VER中的性能，这对于实际和现实世界的应用至关重要，并指示了改进的关键领域。为了解决此限制，提出的方法采用了受人体及时工程过程启发的离散及时调整，以使每个人都适应了VER任务。我们的方法从生成的提示中选择最佳的自然语言表示形式，并使用它来更新提示，以实现准确的个性化VER。

Title: Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

Authors: Ravi Shankar, Sheng Wong, Lin Li, Magdalena Bachmann, Alex Silverthorne, Beth Albert, Gabriel Davis Jones
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04482
Pdf URL: https://arxiv.org/pdf/2509.04482
Copy Paste: [[2509.04482]] Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare(https://arxiv.org/abs/2509.04482)
Keywords: language model, retrieval-augmented generation
Abstract: Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women's health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM's advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.
摘要：可靠的弃权对于检索功能增强的发电（RAG）系统至关重要，尤其是在诸如妇女健康之类的安全领域，在这种情况下，错误的答案可能会导致危害。我们提出了一个基于能量的模型（EBM），该模型在260万个指南衍生的问题的密集语义语料库上学习平滑的能量景观，从而使系统能够决定何时产生或弃用。我们对EBM进行了校准的SoftMax基线和k-neart最邻居（KNN）密度启发式的基准测试。 EBM在语义上的硬病例上实现了出色的弃权性能，SoftMax的AUROC为0.961，而0.950达到0.950，同时也降低了FPR@95（0.235 vs 0.331）。在易于负面方面，性能在各种方法之间都是可比性的，但是EBM的优势在安全 - 关键的硬分配中最为明显。具有控制负面采样和公平数据暴露的全面消融表明，鲁棒性主要源于能量评分头，而包含或排除特定的负类型（硬，容易，混合）的纳入决策界限，但对于严重案例的概括不是必不可少的。这些结果表明，基于能量的弃权评分比基于概率的软置信度提供了更可靠的置信信号，为安全的破布系统提供了可扩展且可解释的基础。

Title: DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs

Authors: Minghui Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04483
Pdf URL: https://arxiv.org/pdf/2509.04483
Copy Paste: [[2509.04483]] DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs(https://arxiv.org/abs/2509.04483)
Keywords: llm
Abstract: Claim decomposition plays a crucial role in the fact-checking process by breaking down complex claims into simpler atomic components and identifying their unfactual elements. Despite its importance, current research primarily focuses on generative methods for decomposition, with insufficient emphasis on evaluating the quality of these decomposed atomic claims. To bridge this gap, we introduce \textbf{DecMetrics}, which comprises three new metrics: \texttt{COMPLETENESS}, \texttt{CORRECTNESS}, and \texttt{SEMANTIC ENTROPY}, designed to automatically assess the quality of claims produced by decomposition models. Utilizing these metrics, we develop a lightweight claim decomposition model, optimizing its performance through the integration of these metrics as a reward function. Through automatic evaluation, our approach aims to set a benchmark for claim decomposition, enhancing both the reliability and effectiveness of fact-checking systems.
摘要：主张分解在事实检查过程中起着至关重要的作用，通过将复杂的主张分解为简单的原子组成部分并确定其不切实际的元素。尽管其重要性，但当前的研究主要集中于分解的生成方法，而不足以评估这些分解原子主张的质量。为了弥合这一差距，我们介绍了\ textbf {decmetrics}，其中包括三个新的指标：\ texttt {完整性}，\ texttt {recriptits {recriptits}和\ texttt {语义熵}，旨在自动评估由分解模型产生的主张质量的质量。利用这些指标，我们开发了一个轻巧的主张分解模型，通过将这些指标作为奖励功能的集成来优化其性能。通过自动评估，我们的方法旨在为索赔分解设定基准，从而提高事实检查系统的可靠性和有效性。

Title: The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

Authors: Abdelrahman Sadallah, Tim Baumgärtner, Iryna Gurevych, Ted Briscoe
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.04484
Pdf URL: https://arxiv.org/pdf/2509.04484
Copy Paste: [[2509.04484]] The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors(https://arxiv.org/abs/2509.04484)
Keywords: gpt
Abstract: Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.
摘要：向纸质作者提供建设性的反馈是同行评审的核心组成部分。随着审阅者越来越少的时间进行审核，需要自动支持系统以确保审查质量高，从而使评论中的反馈对作者有用。为此，我们确定了审查评论的四个关键方面（评论的弱点部分中的个别要点），这些方面推动了作者的实用性：可行性，基础和特殊性，可验证性和帮助。为了评估和开发评估评论评论的模型，我们介绍了Refutil数据集。我们收集1,430个人标记的评论评论，并使用10K合成标记的评论来扩展我们的数据，以进行培训。综合数据还包含理由，即评论评论的方面分数的解释。我们采用了Refutil数据集，基于对这些方面和产生理由的评论评估评论的评估评论。我们的实验表明，这些微调模型达到了与人类的一致性水平，并且在某些情况下超过了强大的封闭模型（如GPT-4O）。我们的分析进一步表明，机器生成的评论通常对我们的四个方面的表现不佳。

Title: ASCENDgpt: A Phenotype-Aware Transformer Model for Cardiovascular Risk Prediction from Electronic Health Records

Authors: Chris Sainsbury, Andreas Karwath
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04485
Pdf URL: https://arxiv.org/pdf/2509.04485
Copy Paste: [[2509.04485]] ASCENDgpt: A Phenotype-Aware Transformer Model for Cardiovascular Risk Prediction from Electronic Health Records(https://arxiv.org/abs/2509.04485)
Keywords: language model, gpt
Abstract: We present ASCENDgpt, a transformer-based model specifically designed for cardiovascular risk prediction from longitudinal electronic health records (EHRs). Our approach introduces a novel phenotype-aware tokenization scheme that maps 47,155 raw ICD codes to 176 clinically meaningful phenotype tokens, achieving 99.6\% consolidation of diagnosis codes while preserving semantic information. This phenotype mapping contributes to a total vocabulary of 10,442 tokens - a 77.9\% reduction when compared with using raw ICD codes directly. We pretrain ASCENDgpt on sequences derived from 19402 unique individuals using a masked language modeling objective, then fine-tune for time-to-event prediction of five cardiovascular outcomes: myocardial infarction (MI), stroke, major adverse cardiovascular events (MACE), cardiovascular death, and all-cause mortality. Our model achieves excellent discrimination on the held-out test set with an average C-index of 0.816, demonstrating strong performance across all outcomes (MI: 0.792, stroke: 0.824, MACE: 0.800, cardiovascular death: 0.842, all-cause mortality: 0.824). The phenotype-based approach enables clinically interpretable predictions while maintaining computational efficiency. Our work demonstrates the effectiveness of domain-specific tokenization and pretraining for EHR-based risk prediction tasks.
摘要：我们提出了AscendGpt，这是一种基于变压器的模型，专门为纵向电子健康记录（EHR）的心血管风险预测而设计。我们的方法介绍了一种新型的表型式令牌化方案，该方案将47,155个原始ICD代码映射到176个临床上有意义的表型令牌，实现了99.6 \％的诊断代码合并，同时保留语义信息。与直接使用原始ICD代码相比，该表型映射有助于总词汇为10,442个令牌 - 降低了77.9％。我们使用掩盖语言建模目标对从19402年独特的个体得出的序列进行了预言，然后对五个心血管结局的实行时间预测进行微调：心肌梗塞（MI），中风，主要心血管心血管疾病事件（MACE），心血管死亡和所有损失。我们的模型在持有测试集上获得了极好的歧视，平均C指数为0.816，表明所有结果的表现都很强（MI：0.792，中风：0.824：0.824，MACE：0.800，心血管死亡：0.842，0.842，全因死亡率：0.824）。基于表型的方法可以在保持计算效率的同时进行临床解释的预测。我们的工作证明了基于EHR的风险预测任务的域特异性令牌化和预处理的有效性。

Title: Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition

Authors: Hao Shi, Yusuke Fujita, Tomoya Mizumoto, Lianbo Liu, Atsushi Kojima, Yui Sudo
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2509.04488
Pdf URL: https://arxiv.org/pdf/2509.04488
Copy Paste: [[2509.04488]] Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition(https://arxiv.org/abs/2509.04488)
Keywords: language model, llm, prompt
Abstract: Prompts are crucial for task definition and for improving the performance of large language models (LLM)-based systems. However, existing LLM-based multi-talker (MT) automatic speech recognition (ASR) systems either omit prompts or rely on simple task-definition prompts, with no prior work exploring the design of prompts to enhance performance. In this paper, we propose extracting serialized output prompts (SOP) and explicitly guiding the LLM using structured prompts to improve system performance (SOP-MT-ASR). A Separator and serialized Connectionist Temporal Classification (CTC) layers are inserted after the speech encoder to separate and extract MT content from the mixed speech encoding in a first-speaking-first-out manner. Subsequently, the SOP, which serves as a prompt for LLMs, is obtained by decoding the serialized CTC outputs using greedy search. To train the model effectively, we design a three-stage training strategy, consisting of serialized output training (SOT) fine-tuning, serialized speech information extraction, and SOP-based adaptation. Experimental results on the LibriMix dataset show that, although the LLM-based SOT model performs well in the two-talker scenario, it fails to fully leverage LLMs under more complex conditions, such as the three-talker scenario. The proposed SOP approach significantly improved performance under both two- and three-talker conditions.
摘要：提示对于任务定义和改善基于大语言模型（LLM）的系统的性能至关重要。但是，现有的基于LLM的多对话者（MT）自动语音识别（ASR）系统要么省略提示或依靠简单的任务定义提示，而且没有先前的工作探索提示的设计以提高性能。在本文中，我们提出提取串行的输出提示（SOP），并使用结构化提示明确指导LLM，以提高系统性能（SOP-MT-ASR）。语音编码器之后插入分离器和序列化连接派时间分类（CTC）层，以以先到先话的优点方式将混合语音编码分开并提取MT内容。随后，通过使用贪婪搜索来解码序列化的CTC输出来获得作为LLMS提示的SOP。为了有效地训练模型，我们设计了一个三阶段训练策略，包括序列化输出培训（SOT）微调，序列化语音信息提取和基于SOP的适应性。 Librimix数据集上的实验结果表明，尽管基于LLM的SOT模型在两词器方案中表现良好，但在更复杂的条件下（例如三台词架方案）无法完全利用LLM。拟议的SOP方法在两词和三词器条件下都显着提高了性能。

Title: Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR

Authors: Xinnian Zhao, Hugo Van Hamme
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04491
Pdf URL: https://arxiv.org/pdf/2509.04491
Copy Paste: [[2509.04491]] Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR(https://arxiv.org/abs/2509.04491)
Keywords: prompt
Abstract: This study proposes a novel approach to using TV subtitles within a weakly supervised (WS) Automatic Speech Recognition (ASR) framework. Although TV subtitles are readily available, their imprecise alignment with corresponding audio limits their applicability as supervised targets for verbatim transcription. Rather than using subtitles as direct supervision signals, our method reimagines them as context-rich prompts. This design enables the model to handle discrepancies between spoken audio and subtitle text. Instead, generated pseudo transcripts become the primary targets, with subtitles acting as guiding cues for iterative refinement. To further enhance the process, we introduce a weighted attention mechanism that emphasizes relevant subtitle tokens during inference. Our experiments demonstrate significant improvements in transcription accuracy, highlighting the effectiveness of the proposed method in refining transcripts. These enhanced pseudo-labeled datasets provide high-quality foundational resources for training robust ASR systems.
摘要：这项研究提出了一种新的方法，可以在弱监督（WS）自动语音识别（ASR）框架中使用电视字幕。尽管电视字幕很容易获得，但它们与相应音频的不精确对齐将其适用性限制为逐字转录的监督目标。我们的方法不是使用字幕作为直接监督信号，而是将其重新构想为上下文丰富的提示。该设计使模型可以处理口语音频和字幕文本之间的差异。取而代之的是，生成的伪转录本成为主要目标，字幕充当迭代精致的指导提示。为了进一步增强该过程，我们引入了一种加权注意机制，该机制强调了推断期间相关的字幕令牌。我们的实验表明转录精度有显着提高，突出了所提出的方法在提炼转录本中的有效性。这些增强的伪标记的数据集为培训强大的ASR系统提供了高质量的基础资源。

Title: Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

Authors: Charles Moslonka, Hicham Randrianarivo, Arthur Garnier, Emmanuel Malherbe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04492
Pdf URL: https://arxiv.org/pdf/2509.04492
Copy Paste: [[2509.04492]] Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate(https://arxiv.org/abs/2509.04492)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks critically undermine their real-world reliability. This paper introduces an applied methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate (EPR) metric that offers baseline performance, later augmented with supervised learning. Our learned model uses features representing the entropic contributions of the accessible top-ranked tokens within a single generated sequence, requiring no multiple query re-runs. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves hallucination detection over using EPR alone. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top <10 per token), confirming its practical efficiency and suitability for these API-constrained deployments. This work provides a readily deployable technique to enhance the trustworthiness of LLM responses from a single generation pass in QA and Retrieval-Augmented Generation (RAG) systems, with its utility further demonstrated in a finance framework analyzing responses to queries on annual reports from an industrial dataset.
摘要：大语言模型（LLM）的幻觉输出提出问题回答（QA）任务严重破坏了其现实世界的可靠性。本文介绍了一种适用的方法，用于鲁棒，单发幻觉检测，专门针对具有有限数据访问的方案设计，例如与Black-Box LLM API进行交互，通常只能暴露出每个代币的最佳候选日志可行性。我们的方法直接从在非怪兽解码过程中生成的这些易于可用的对数概率直接得出了不确定性指标。我们首先得出一个提供基线性能的熵生产率（EPR）度量标准，后来通过监督学习进行了增强。我们博学的模型使用代表单个生成序列中可访问的顶级令牌的熵贡献的功能，不需要多个查询重新运行。该估计量在不同的QA数据集和多个LLM中进行了评估，可显着改善幻觉检测，而不是仅使用EPR。至关重要的是，仅使用典型的一组可用的日志探针（例如，每个令牌<10）表明高性能，这证实了其对这些API受限部署的实际效率和适用性。这项工作提供了一项易于部署的技术，以增强质量检查和检索功能增强生成（RAG）系统的LLM响应的可信度，并在财务框架中进一步证明了其效用，分析了工业数据集对年度报告的查询的响应。

Title: Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations

Authors: Krithi Shailya, Akhilesh Kumar Mishra, Gokul S Krishnan, Balaraman Ravindran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04498
Pdf URL: https://arxiv.org/pdf/2509.04498
Copy Paste: [[2509.04498]] Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations(https://arxiv.org/abs/2509.04498)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used as daily recommendation systems for tasks like education planning, yet their recommendations risk perpetuating societal biases. This paper empirically examines geographic, demographic, and economic biases in university and program suggestions from three open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Using 360 simulated user profiles varying by gender, nationality, and economic status, we analyze over 25,000 recommendations. Results show strong biases: institutions in the Global North are disproportionately favored, recommendations often reinforce gender stereotypes, and institutional repetition is prevalent. While LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, systemic disparities persist. To quantify these issues, we propose a novel, multi-dimensional evaluation framework that goes beyond accuracy by measuring demographic and geographic representation. Our findings highlight the urgent need for bias consideration in educational LMs to ensure equitable global access to higher education.
摘要：大型语言模型（LLM）越来越多地用作教育计划等任务的日常推荐系统，但他们的建议可能会延续社会偏见。本文凭经验研究了大学的地理，人口和经济偏见以及三个开源LLM的计划建议：Llama-3.1-8B，Gemma-7b和Misstral-7B。我们使用360个模拟用户概况因性别，国籍和经济状况而变化，我们分析了25,000多个建议。结果表明，强烈的偏见：全球北方的机构受到了不成比例的青睐，建议通常会增强性别刻板印象，并且制度重复很普遍。尽管Llama-3.1取得了最高的多样性，但建议在58个国家 /地区的481所独特的大学，而系统的差异仍然存在。为了量化这些问题，我们提出了一个新颖的多维评估框架，该框架通过测量人口统计学和地理表示超出了准确性。我们的发现凸显了在教育LMS中迫切需要考虑偏见，以确保全球获得高等教育的公平访问。

Title: DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

Authors: Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04499
Pdf URL: https://arxiv.org/pdf/2509.04499
Copy Paste: [[2509.04499]] DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence(https://arxiv.org/abs/2509.04499)
Keywords: gpt, llm, agent
Abstract: Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, this http URL, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.
摘要：生成性搜索引擎和深入研究LLM代理商承诺值得信赖，源构成的综合，但用户经常遇到过度自信，疲软的采购和令人困惑的引用实践。我们介绍了DeepTrace，这是一种新颖的社会技术审计框架，将先前的社区识别失败案例转变为八个可测量的维度，涵盖答案文本，来源和引用。 DeepTrace使用声明级分析（分解，信心评分），并构建引用和事实支持矩阵来审核系统如何通过端到端的方式推理和归因于证据。使用自动提取管道进行流行的公共模型（例如GPT-4.5/5，此HTTP URL，困惑性，副标士/Bing，Gemini）和与人类评估者有验证协议的LLM判断，我们评估了网络搜索引擎和深度研究的配置。我们的发现表明，生成性搜索引擎和深入的研究代理商经常对辩论查询产生单方面的，高度自信的回答，并包括大量的陈述，这些陈述不受其自己列出的来源的支持。深入研究的配置减少了过度自信，并且可以达到高引文的彻底性，但是它们仍然在辩论查询上高度一面，并且仍然显示出大量的无支撑语句的部分，引文准确性范围为40--80％。

Title: Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts

Authors: Rushi Wang, Jiateng Liu, Cheng Qian, Yifan Shen, Yanzhou Pan, Zhaozhuo Xu, Ahmed Abbasi, Heng Ji, Denghui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04500
Pdf URL: https://arxiv.org/pdf/2509.04500
Copy Paste: [[2509.04500]] Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts(https://arxiv.org/abs/2509.04500)
Keywords: language model, llm
Abstract: Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.
摘要：合并外部环境可以显着提高大语言模型（LLM）的响应质量。但是，现实世界中的环境通常将相关信息与不成比例的不适当内容混合在一起，从而带来可靠性风险。 LLM如何处理和优先考虑混合上下文？为了研究这一点，我们介绍了有毒的上下文测试床，将查询与包含相关和不适当内容的现实环境配对。受动物联想学习的启发，我们从神经科学中调整了Rescorla-Wagner（RW）模型，以量化竞争性上下文信号如何影响LLM输出。我们的改编模型揭示了一种一致的行为模式：LLMS表现出强烈的趋势，即在上下文中融合了不太普遍的信息。在现实世界中，这种敏感性是有害的，在现实世界中，少量不适当的内容可能会大大降低响应质量。我们的测试床上的经验评估进一步证实了这一脆弱性。为了解决这个问题，我们介绍了RW-Steering，这是一种基于两阶段的基于填充的方法，使该模型能够在内部识别和忽略不适当的信号。与先前依赖各种环境混合物的广泛监督的方法不同，RW-Steering在不同比例的不适当内容中概述了稳定的。实验表明，我们最好的微调模型将响应质量提高了39.8％，并逆转了不良的行为曲线，从而确立了RW-Steering作为一种可靠的，可推广的上下文工程解决方案，以改善现实世界中LLM安全性。

Title: Understanding Reinforcement Learning for Model Training, and future directions with GRAPE

Authors: Rohit Patel
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04501
Pdf URL: https://arxiv.org/pdf/2509.04501
Copy Paste: [[2509.04501]] Understanding Reinforcement Learning for Model Training, and future directions with GRAPE(https://arxiv.org/abs/2509.04501)
Keywords: llm
Abstract: This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.
摘要：本文提供了一个独立的，从划伤的，对模型进行指导调整的关键算法的解释：SFT，拒绝采样，增强，信任区域策略优化（TRPO），近端策略优化（PPO），小组相对策略优化（GRPO）（GRPO）和直接优化（DPO）。这些算法的解释通常假定先验知识，缺乏关键细节和/或过于普遍且复杂。在这里，使用针对LLM的简化和明确的符号逐步讨论和开发每种方法，旨在消除歧义并提供对概念的清晰而直观的理解。通过最大程度地将绕道绕组到更广泛的RL文献中，并将概念与LLM相连，我们消除了多余的抽象并减少了认知开销。在此博览会之后，我们提供了有关新技术和方法的文献综述。最后，提出了以葡萄的形式进行研究和探索的新思想（广义相对优势政策演变）。

Title: VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples

Authors: Qixin Sun, Ziqin Wang, Hengyuan Zhao, Yilin Li, Kaiyou Song, Linjiang Huang, Xiaolin Hu, Qingpei Guo, Si Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04502
Pdf URL: https://arxiv.org/pdf/2509.04502
Copy Paste: [[2509.04502]] VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples(https://arxiv.org/abs/2509.04502)
Keywords: language model, llm, prompt, retrieval augmented generation, retrieval-augmented generation, chain-of-thought
Abstract: Retrieval Augmented Generation enhances the response accuracy of Large Language Models (LLMs) by integrating retrieval and generation modules with external knowledge, demonstrating particular strength in real-time queries and Visual Question Answering tasks. However, the effectiveness of RAG is frequently hindered by the precision of the retriever: many retrieved samples fed into the generation phase are irrelevant or misleading, posing a critical bottleneck to LLMs' performance. To address this challenge, we introduce VaccineRAG, a novel Chain-of-Thought-based retrieval-augmented generation dataset. On one hand, VaccineRAG employs a benchmark to evaluate models using data with varying positive/negative sample ratios, systematically exposing inherent weaknesses in current LLMs. On the other hand, it enhances models' sample-discrimination capabilities by prompting LLMs to generate explicit Chain-of-Thought (CoT) analysis for each sample before producing final answers. Furthermore, to enhance the model's ability to learn long-sequence complex CoT content, we propose Partial-GRPO. By modeling the outputs of LLMs as multiple components rather than a single whole, our model can make more informed preference selections for complex sequences, thereby enhancing its capacity to learn complex CoT. Comprehensive evaluations and ablation studies on VaccineRAG validate the effectiveness of the proposed scheme. The code and dataset will be publicly released soon.
摘要：检索增强生成通过将检索和生成模块与外部知识整合在一起，提高了大语言模型（LLM）的响应准确性，从而在实时查询中证明了特殊的强度，并在实时查询和视觉问题上回答任务。然而，抹布的有效性经常受到猎犬的精度的阻碍：许多喂入生成阶段的样品是无关的或误导的，对LLMS的性能带来了关键的瓶颈。为了应对这一挑战，我们介绍了Vaccinerag，这是一种新型基于想象的检索型生成数据集。一方面，Vaccinerag采用基准来评估模型，该数据具有不同的正/负样本比率，系统地揭示了当前LLM中固有的弱点。另一方面，它通过促使LLMS在产生最终答案之前对每个样本生成明确的三链分析（COT）分析来增强模型的样本歧视能力。此外，为了增强模型学习长期复杂的COT含量的能力，我们提出了部分GRPO。通过将LLMS的输出建模为多个组件，而不是一个整体，我们的模型可以为复杂序列提供更明智的偏好选择，从而增强其学习复杂COT的能力。对疫苗的全面评估和消融研究验证了拟议方案的有效性。代码和数据集将很快公开发布。

Title: Behavioral Fingerprinting of Large Language Models

Authors: Zehua Pei, Hui-Ling Zhen, Ying Zhang, Zhiyuan Yang, Xing Li, Xianzhi Yu, Mingxuan Yuan, Bei Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04504
Pdf URL: https://arxiv.org/pdf/2509.04504
Copy Paste: [[2509.04504]] Behavioral Fingerprinting of Large Language Models(https://arxiv.org/abs/2509.04504)
Keywords: language model, llm, prompt
Abstract: Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting'' framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model's intrinsic cognitive and interactive styles. Using a curated \textit{Diagnostic Prompt Suite} and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model's interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: this https URL
摘要：当前的大型语言模型（LLMS）的基准主要集中在性能指标上，通常无法捕获区分它们的细微行为特征。本文介绍了一个新颖的``行为指纹''框架，旨在通过创建模型的内在认知和交互式样式的多方面概况来超越传统评估。使用精选的\ textit {诊断提示套件}和创新的，自动化的评估管道，其中强大的LLM充当公正的法官，我们在能力层次上分析了18个模型。我们的结果揭示了LLM景观中的关键差异：尽管诸如抽象和因果推理之类的核心能力在顶级模型之间融合，而与对齐相关的行为（例如粘粘体和语义鲁棒性）却发生了巨大变化。我们进一步记录了可能反映常见一致性激励措施的跨模型默认角色聚类（ISTJ/ESTJ）。综上所述，这表明模型的互动性质不是其规模或推理能力的新兴特性，而是特定且高度可变的开发人员对齐策略的直接结果。我们的框架提供了一种可再现和可扩展的方法，用于揭示这些深层行为差异。项目：此HTTPS URL

Title: From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach

Authors: Nithyashree Sivasubramaniam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04507
Pdf URL: https://arxiv.org/pdf/2509.04507
Copy Paste: [[2509.04507]] From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach(https://arxiv.org/abs/2509.04507)
Keywords: language model, llm
Abstract: Silent Speech Interfaces (SSIs) have gained attention for their ability to generate intelligible speech from non-acoustic signals. While significant progress has been made in advancing speech generation pipelines, limited work has addressed the recognition and downstream processing of synthesized speech, which often suffers from phonetic ambiguity and noise. To overcome these challenges, we propose an enhanced automatic speech recognition framework that combines a transformer-based acoustic model with a large language model (LLM) for post-processing. The transformer captures full utterance context, while the LLM ensures linguistic consistency. Experimental results show a 16% relative and 6% absolute reduction in word error rate (WER) over a 36% baseline, demonstrating substantial improvements in intelligibility for silent speech interfaces.
摘要：无声的语音界面（SSIS）因其从非声学信号中产生可理解的语音的能力而引起了人们的关注。尽管在推进语音生成管道方面取得了重大进展，但有限的工作已经解决了对合成语音的识别和下游处理，这些语音通常受到语音歧义和噪音的折磨。为了克服这些挑战，我们提出了一个增强的自动语音识别框架，将基于变压器的声学模型与大型语言模型（LLM）结合在一起，以进行后处理。变压器捕获了完整的话语上下文，而LLM确保语言一致性。实验结果表明，基线36％的单词错误率（WER）的相对相对16％和6％的绝对降低，这表明无声语音界面的可理解性有了很大的提高。

Title: ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models

Authors: Biddut Sarker Bijoy, Mohammad Saqib Hasan, Pegah Alipoormolabashi, Avirup Sil, Aruna Balasubramanian, Niranjan Balasubramanian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04508
Pdf URL: https://arxiv.org/pdf/2509.04508
Copy Paste: [[2509.04508]] ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models(https://arxiv.org/abs/2509.04508)
Keywords: language model, llm, agent
Abstract: Multi-agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade-off, we instantiate single and multi-agent systems for the complex problems in the AppWorld environment using different sized language models. We find that difficulties with long-trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub-task training strategy, which introduces new sub-tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi-agents at all configurations. Our Pareto analysis shows that fine-tuned multi-agent systems yield better effectiveness-efficiency trade-offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates.
摘要：具有较小语言模型（SLM）的多机构系统为通过大型语言模型（LLMS）提供动力的单一代理系统提供了可行的替代方案。在这项工作中，我们研究了这些替代方案如何从有效性和效率上进行比较。为了研究这种权衡，我们使用不同尺寸的语言模型实例化了Appworld环境中复杂问题的单一和多代理系统。我们发现，在较小的语言模型（SLM）中长期对象学习的困难限制了其性能。即使经过专业角色的培训，SLM也无法有效学习所有子任务。为了解决这个问题，我们介绍了一个简单的渐进式子任务培训策略，该策略在每个培训时期逐渐介绍了新的子任务。我们发现，这种新颖的策略（类似于实例级别的课程学习）始终提高多代理在所有配置下的有效性。我们的帕累托分析表明，微调的多机构系统产生更好的有效性效率折衷。其他消融和分析显示了我们进行性培训策略的重要性及其降低子任务错误率的能力。

Title: Scaling behavior of large language models in emotional safety classification across sizes and tasks

Authors: Edoardo Pinzuti, Oliver Tüscher, André Ferreira Castro
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04512
Pdf URL: https://arxiv.org/pdf/2509.04512
Copy Paste: [[2509.04512]] Scaling behavior of large language models in emotional safety classification across sizes and tasks(https://arxiv.org/abs/2509.04512)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Understanding how large language models (LLMs) process emotionally sensitive content is critical for building safe and reliable systems, particularly in mental health contexts. We investigate the scaling behavior of LLMs on two key tasks: trinary classification of emotional safety (safe vs. unsafe vs. borderline) and multi-label classification using a six-category safety risk taxonomy. To support this, we construct a novel dataset by merging several human-authored mental health datasets (> 15K samples) and augmenting them with emotion re-interpretation prompts generated via ChatGPT. We evaluate four LLaMA models (1B, 3B, 8B, 70B) across zero-shot, few-shot, and fine-tuning settings. Our results show that larger LLMs achieve stronger average performance, particularly in nuanced multi-label classification and in zero-shot settings. However, lightweight fine-tuning allowed the 1B model to achieve performance comparable to larger models and BERT in several high-data categories, while requiring <2GB VRAM at inference. These findings suggest that smaller, on-device models can serve as viable, privacy-preserving alternatives for sensitive applications, offering the ability to interpret emotional context and maintain safe conversational boundaries. This work highlights key implications for therapeutic LLM applications and the scalable alignment of safety-critical systems.
摘要：了解大型语言模型（LLM）如何处理情感敏感的内容对于建立安全可靠的系统，尤其是在心理健康环境中。我们研究了LLM在两个关键任务上的缩放行为：情绪安全的三元分类（安全与不安全与边界线）和使用六类安全风险分类法的多标签分类。为了支持这一点，我们通过合并了几个人为的心理健康数据集（> 15K样本）来构建一个新颖的数据集，并通过通过Chatgpt生成的情感重新解释提示来增强它们。我们在零射击，很少射击和微调设置中评估了四种Llama型号（1B，3B，8B，70B）。我们的结果表明，较大的LLM可以实现更强的平均性能，尤其是在细微的多标签分类和零弹位设置中。但是，轻巧的微调允许1B模型实现与大型模型相当的性能，并在几个高数据类别的类别中实现了较大的模型，同时需要推断<2GB VRAM。这些发现表明，较小的，设备的模型可以作为敏感应用程序可行的，具有隐私性的替代方案，提供解释情感上下文并保持安全的对话界限的能力。这项工作突出了对治疗性LLM应用的关键含义和安全关键系统的可扩展对齐。

Title: Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations

Authors: Martha O. Dimgba, Sharon Oba, Ameeta Agrawal, Philippe J. Giabbanelli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04515
Pdf URL: https://arxiv.org/pdf/2509.04515
Copy Paste: [[2509.04515]] Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations(https://arxiv.org/abs/2509.04515)
Keywords: language model, gpt, prompt
Abstract: Language models have been shown to propagate social bias through their output, particularly in the representation of gender and ethnicity. This paper investigates gender and ethnicity biases in AI-generated occupational stories. Representation biases are measured before and after applying our proposed mitigation strategy, Bias Analysis and Mitigation through Explanation (BAME), revealing improvements in demographic representation ranging from 2% to 20%. BAME leverages model-generated explanations to inform targeted prompt engineering, effectively reducing biases without modifying model parameters. By analyzing stories generated across 25 occupational groups, three large language models (Claude 3.5 Sonnet, Llama 3.1 70B Instruct, and GPT-4 Turbo), and multiple demographic dimensions, we identify persistent patterns of overrepresentation and underrepresentation linked to training data stereotypes. Our findings demonstrate that guiding models with their own internal reasoning mechanisms can significantly enhance demographic parity, thereby contributing to the development of more transparent generative AI systems.
摘要：语言模型已被证明可以通过其产出来传播社会偏见，尤其是在性别和种族的代表方面。本文调查了AI生成的职业故事中的性别和种族偏见。在应用我们提出的缓解策略，通过解释（BAME）进行偏见分析和缓解措施之前和之后，对表示偏差进行了衡量，揭示了人口统计学的改善范围从2％到20％。 BAME利用模型生成的解释来为有针对性的及时工程提供信息，从而有效地减少了偏见而无需修改模型参数。通过分析跨25个职业群体产生的故事，三种大型语言模型（Claude 3.5 SONNET，LLAMA 3.1 70B指示和GPT-4 Turbo）以及多个人口统计学维度，我们确定了过分代表性的持久模式，并且与训练数据刻板印象相关的代表性不足。我们的发现表明，具有自己内部推理机制的指导模型可以显着提高人口统计学率，从而有助于开发更透明的生成AI系统。

Title: Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets

Authors: Sophie Jaffer, Simeon Sayer
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.04516
Pdf URL: https://arxiv.org/pdf/2509.04516
Copy Paste: [[2509.04516]] Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets(https://arxiv.org/abs/2509.04516)
Keywords: language model, llm
Abstract: As large language models (LLMs) expand multilingual capabilities, questions remain about the equity of their performance across languages. While many communities stand to benefit from AI systems, the dominance of English in training data risks disadvantaging non-English speakers. To test the hypothesis that such data disparities may affect model performance, this study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data. To simulate how multilingual LLMs process non-English queries through internal translation and abstraction, we translated the Swahili news data into English and evaluated it using the English-trained model. This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili, thus isolating the effect of language consistency versus cross-lingual abstraction. The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively. This gap suggests that translation alone does not bridge representational differences between languages and that models trained in one language may struggle to accurately interpret translated inputs due to imperfect internal knowledge representation, suggesting that native-language training remains important for reliable outcomes. In educational and informational contexts, even small performance gaps may compound inequality. Future research should focus on addressing broader dataset development for underrepresented languages and renewed attention to multilingual model evaluation, ensuring the reinforcing effect of global AI deployment on existing digital divides is reduced.
摘要：随着大型语言模型（LLMS）扩展多语言能力，因此关于其跨语言表现的公平性仍然存在。尽管许多社区将从AI系统中受益，但英语在培训数据风险中的主导地位不利，非英语人士。为了测试这样的数据差异可能会影响模型性能的假设，本研究比较了两个单语的BERT模型：一种完全在Swahili数据上训练和测试，另一个对可比的英语新闻数据进行了比较。为了通过内部翻译和抽象来模拟多语言LLMS如何处理非英语查询，我们将西部新闻数据转换为英语，并使用英语训练的模型对其进行了评估。这种方法通过评估将斯瓦希里语输入用于英语模型评估是否会产生更好或更差的性能，而与完全在斯瓦希里语中的训练和测试模型相比，对假设的表现更好或更差，从而隔离了语言一致性与跨语言抽象的效果。结果证明，尽管进行了高质量的翻译，但天然的斯瓦希尔式训练模型的表现要好于斯瓦希里语到英语翻译模型，但错误的错误却少了四倍：0.36％和1.47％分别少了四倍。这一差距表明，单独翻译并不能弥合语言之间的代表性差异，而用一种语言训练的模型可能难以准确地解释由于内部知识不完美而导致的翻译输入，这表明本地语言培训对于可靠的结果仍然很重要。在教育和信息环境中，即使是较小的绩效差距也可能会使不平等更加复杂。未来的研究应着重于解决代表性不足语言的更广泛的数据集开发，并重新关注多语言模型评估，以确保全球AI部署对现有数字鸿沟的增强影响。

Title: Advancing SLM Tool-Use Capability using Reinforcement Learning

Authors: Dhruvi Paprunia, Vansh Kharidia, Pankti Doshi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04518
Pdf URL: https://arxiv.org/pdf/2509.04518
Copy Paste: [[2509.04518]] Advancing SLM Tool-Use Capability using Reinforcement Learning(https://arxiv.org/abs/2509.04518)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have progressed beyond simple text creation, and tool use has become increasingly important for complex, real-world tasks. Tool use in LLMs refers to their ability to utilize external resources such as APIs, databases, or software functions to extend their functionality beyond generating this http URL are used for tasks such as performing calculations, making API calls to retrieve the current time and date, and more. This capability enables models to fetch real-time data, execute commands, or solve problems requiring dynamic interaction, making it indispensable for applications like AI agents in virtual assistants, robotic control, or automated workflows. However, while LLMs are usually adept tool use, their vast resource requirements and computation complexity restrict their use in every use this http URL a result, there is an increasing need for more compact and efficient Small Language Models (SLMs). Small language models (SLMs) struggle in tool use compared to large language models (LLMs). As soon in Table 1. SLMs are typically trained on smaller, more specific datasets, resulting in a narrower knowledge base and limited contextual understanding compared to LLMs. This research addresses these challenges by using Reinforcement Learning (RL), specifically Group Relative Policy Optimization (GRPO), to enhance tool-use proficiency in SLMs. Unlike conventional fine-tuning approaches that require heavy computation and often lack adaptability, our method provides an efficient, effective solution that significantly boosts SLM tool-use accuracy, increasing their practical utility.
摘要：大型语言模型（LLM）超出了简单的文本创建，而工具的使用对于复杂的现实世界任务变得越来越重要。 LLMS中的工具使用是指利用外部资源（例如API，数据库或软件功能）扩展其功能的能力，而不是生成本HTTP URL的功能，用于执行计算之类的任务，进行API调用以检索当前时间和日期等等。该功能使模型能够获取需要动态互动的实时数据，执行命令或解决问题，这对于虚拟助手中的AI代理，机器人控制或自动化工作流的应用程序都是必不可少的。但是，尽管LLM通常是熟练的工具使用，但其庞大的资源需求和计算复杂性限制了它们在每种用途中的用途，结果http url的结果却越来越需要更紧凑，更有效的小语言模型（SLMS）。与大语言模型（LLM）相比，小型语言模型（SLM）在工具使用中挣扎。在表1中，SLM很快就会在较小的，更具体的数据集上进行训练，从而导致知识库更狭窄，并且与LLMS相比有限。这项研究通过使用强化学习（RL），特别是组相对政策优化（GRPO）来解决这些挑战，以增强SLM的工具使用能力。与需要大量计算并且通常缺乏适应性的常规微调方法不同，我们的方法提供了一种高效，有效的解决方案，可显着提高SLM工具使用精度，从而提高其实际效用。

Title: Hierarchical Section Matching Prediction (HSMP) BERT for Fine-Grained Extraction of Structured Data from Hebrew Free-Text Radiology Reports in Crohn's Disease

Authors: Zvi Badash, Hadas Ben-Atya, Naama Gavrielov, Liam Hazan, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04519
Pdf URL: https://arxiv.org/pdf/2509.04519
Copy Paste: [[2509.04519]] Hierarchical Section Matching Prediction (HSMP) BERT for Fine-Grained Extraction of Structured Data from Hebrew Free-Text Radiology Reports in Crohn's Disease(https://arxiv.org/abs/2509.04519)
Keywords: prompt
Abstract: Extracting structured clinical information from radiology reports is challenging, especially in low-resource languages. This is pronounced in Crohn's disease, with sparsely represented multi-organ findings. We developed Hierarchical Structured Matching Prediction BERT (HSMP-BERT), a prompt-based model for extraction from Hebrew radiology text. In an administrative database study, we analyzed 9,683 reports from Crohn's patients imaged 2010-2023 across Israeli providers. A subset of 512 reports was radiologist-annotated for findings across six gastrointestinal organs and 15 pathologies, yielding 90 structured labels per subject. Multilabel-stratified split (66% train+validation; 33% test), preserving label prevalence. Performance was evaluated with accuracy, F1, Cohen's $\kappa$, AUC, PPV, NPV, and recall. On 24 organ-finding combinations with $>$15 positives, HSMP-BERT achieved mean F1 0.83$\pm$0.08 and $\kappa$ 0.65$\pm$0.17, outperforming the SMP zero-shot baseline (F1 0.49$\pm$0.07, $\kappa$ 0.06$\pm$0.07) and standard fine-tuning (F1 0.30$\pm$0.27, $\kappa$ 0.27$\pm$0.34; paired t-test $p < 10^{-7}$). Hierarchical inference cuts runtime 5.1$\times$ vs. traditional inference. Applied to all reports, it revealed associations among ileal wall thickening, stenosis, and pre-stenotic dilatation, plus age- and sex-specific trends in inflammatory findings. HSMP-BERT offers a scalable solution for structured extraction in radiology, enabling population-level analysis of Crohn's disease and demonstrating AI's potential in low-resource settings.
摘要：从放射学报告中提取结构化的临床信息具有挑战性，尤其是在低资源语言中。这在克罗恩氏病中发音为稀疏的多器官发现。我们开发了层次结构匹配预测BERT（HSMP-bert），这是一个及时的模型，用于从希伯来语放射学文本中提取。在一项行政数据库研究中，我们分析了以色列提供者在2010 - 2023年成像的克罗恩患者的9,683份报告。在六个胃肠道器官和15个病理上进行了512个报告的子集，以进行放射科医生注释，每个受试者产生90个结构化标签。多标记分层的拆分（66％的火车+验证； 33％的测试），保留标签患病率。精确评估了性能，F1，Cohen的$ \ kappa $，AUC，PPV，NPV和Recell。在24个器官调查组合和$> $ 15的阳性中，HSMP-bert达到的平均f1 0.83 $ \ pm $ 0.08和$ \ kappa $ 0.65 $ \ pm $ 0.17优于SMP零射击基线，F1 0.49 $ \ pm $ 0.07，$ 0.07，$ \ kappa $ 0.07，$ 0.06 $ \ fine-fine-fine-fide（F1 $ \ fine-0.06）（F1 0.07）（F1 0.07） 0.30 $ \ pm $ 0.27，$ \ kappa $ 0.27 $ \ pm $ 0.34;分层推理削减运行时5.1 $ \ times $ vs.传统推理。应用于所有报告，它揭示了回肠壁增厚，狭窄和静脉前扩张之间的关联，以及炎症发现中的年龄和性别特异性趋势。 HSMP-bert为放射学方面的结构化提取提供了可扩展的解决方案，从而实现了克罗恩病的人口水平分析，并在低资源环境中证明了AI的潜力。

Title: Using LLMs to create analytical datasets: A case study of reconstructing the historical memory of Colombia

Authors: David Anderson, Galia Benitez, Margret Bjarnadottir, Shriyan Reyya
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.04523
Pdf URL: https://arxiv.org/pdf/2509.04523
Copy Paste: [[2509.04523]] Using LLMs to create analytical datasets: A case study of reconstructing the historical memory of Colombia(https://arxiv.org/abs/2509.04523)
Keywords: language model, gpt, llm
Abstract: Colombia has been submerged in decades of armed conflict, yet until recently, the systematic documentation of violence was not a priority for the Colombian government. This has resulted in a lack of publicly available conflict information and, consequently, a lack of historical accounts. This study contributes to Colombia's historical memory by utilizing GPT, a large language model (LLM), to read and answer questions about over 200,000 violence-related newspaper articles in Spanish. We use the resulting dataset to conduct both descriptive analysis and a study of the relationship between violence and the eradication of coca crops, offering an example of policy analyses that such data can support. Our study demonstrates how LLMs have opened new research opportunities by enabling examinations of large text corpora at a previously infeasible depth.
摘要：哥伦比亚在数十年的武装冲突中被淹没了，但直到最近，对暴力行为的系统记录并不是哥伦比亚政府的优先事项。这导致缺乏公开可用的冲突信息，因此缺乏历史记录。这项研究通过利用大型语言模型（LLM）GPT来阅读和回答有关超过200,000多个与暴力相关的报纸文章的问题，从而为哥伦比亚的历史记忆做出了贡献。我们使用所得数据集进行描述性分析以及对暴力与根除可口可乐之间关系的研究，提供了此类数据可以支持的政策分析的示例。我们的研究表明，LLM如何通过以前不可行的深度对大型文本语料库进行检查，从而打开了新的研究机会。

Title: Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation

Authors: Zaifu Zhan, Shuang Zhou, Min Zeng, Kai Yu, Meijia Song, Xiaoyi Chen, Jun Wang, Yu Hou, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04534
Pdf URL: https://arxiv.org/pdf/2509.04534
Copy Paste: [[2509.04534]] Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation(https://arxiv.org/abs/2509.04534)
Keywords: language model, prompt
Abstract: Large language models have demonstrated remarkable capabilities in biomedical natural language processing, yet their rapid growth in size and computational requirements present a major barrier to adoption in healthcare settings where data privacy precludes cloud deployment and resources are limited. In this study, we systematically evaluated the impact of quantization on 12 state-of-the-art large language models, including both general-purpose and biomedical-specific models, across eight benchmark datasets covering four key tasks: named entity recognition, relation extraction, multi-label classification, and question answering. We show that quantization substantially reduces GPU memory requirements-by up to 75%-while preserving model performance across diverse tasks, enabling the deployment of 70B-parameter models on 40GB consumer-grade GPUs. In addition, domain-specific knowledge and responsiveness to advanced prompting methods are largely maintained. These findings provide significant practical and guiding value, highlighting quantization as a practical and effective strategy for enabling the secure, local deployment of large yet high-capacity language models in biomedical contexts, bridging the gap between technical advances in AI and real-world clinical translation.
摘要：大型语言模型表明，在生物医学自然语言处理中具有显着的功能，但是它们在规模和计算要求上的快速增长却是在医疗保健环境中采用的主要障碍，在医疗保健环境中，数据隐私排除了云部署和资源受到限制。在这项研究中，我们系统地评估了量化对12种最先进的大语言模型的影响，包括通用和生物医学特异性模型，包括涵盖四个关键任务的八个基准数据集：命名实体识别，关系提取，多标签分类和问题答案。我们表明，量化大大降低了GPU内存的需求 - 最多可以在各种任务中保留模型性能，从而使40GB消费级GPU的70B参数模型可以部署。此外，在很大程度上维护了特定领域的知识和对先进提示方法的响应能力。这些发现提供了巨大的实践和指导价值，强调量化是一种实用有效的策略，以实现在生物医学环境中大型但高容量的语言模型的安全，本地部署，从而弥合了AI和现实世界中临床临床传播的技术进步之间的差距。

Title: Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions

Authors: Faruk Alpay, Taylan Alpay
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04549
Pdf URL: https://arxiv.org/pdf/2509.04549
Copy Paste: [[2509.04549]] Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions(https://arxiv.org/abs/2509.04549)
Keywords: language model, prompt
Abstract: Transformer-based language models excel in NLP tasks, but fine-grained control remains challenging. This paper explores methods for manipulating transformer models through principled interventions at three levels: prompts, activations, and weights. We formalize controllable text generation as an optimization problem addressable via prompt engineering, parameter-efficient fine-tuning, model editing, and reinforcement learning. We introduce a unified framework encompassing prompt-level steering, activation interventions, and weight-space edits. We analyze robustness and safety implications, including adversarial attacks and alignment mitigations. Theoretically, we show minimal weight updates can achieve targeted behavior changes with limited side-effects. Empirically, we demonstrate >90% success in sentiment control and factual edits while preserving base performance, though generalization-specificity trade-offs exist. We discuss ethical dual-use risks and the need for rigorous evaluation. This work lays groundwork for designing controllable and robust language models.
摘要：基于变压器的语言模型在NLP任务中表现出色，但是细粒度的控制仍然具有挑战性。本文探讨了通过在三个层面上有原则的干预措施来操纵变压器模型的方法：提示，激活和权重。我们将可控制的文本生成形式化为可通过及时工程，有效的微调，模型编辑和强化学习来解决的优化问题。我们介绍了一个统一的框架，包括及时的转向，激活干预措施和重量空间编辑。我们分析了鲁棒性和安全性的影响，包括对抗性攻击和减轻对准。从理论上讲，我们显示的最小重量更新可以通过有限的副作用实现目标行为改变。从经验上讲，尽管存在概括的特定性权衡，但在保留基本绩效的同时，我们在情感控制和事实编辑方面取得了> 90％的成功。我们讨论道德双重使用风险以及进行严格评估的需求。这项工作为设计可控制和强大的语言模型奠定了基础。

Title: Sample-efficient Integration of New Modalities into Large Language Models

Authors: Osman Batur İnce, André F. T. Martins, Oisin Mac Aodha, Edoardo M. Ponti
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2509.04606
Pdf URL: https://arxiv.org/pdf/2509.04606
Copy Paste: [[2509.04606]] Sample-efficient Integration of New Modalities into Large Language Models(https://arxiv.org/abs/2509.04606)
Keywords: language model, llm
Abstract: Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data, which is often not available for low-resource modalities. In this paper, we introduce a method for sample-efficient modality integration (SEMI) into Large Language Models (LLMs). To this end, we devise a hypernetwork that can adapt a shared projector -- placed between modality-specific encoders and an LLM -- to any modality. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), is conditioned on a few samples from any arbitrary modality at inference time to generate a suitable adapter. To increase the diversity of training modalities, we artificially multiply the number of encoders through isometric transformations. We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, inertial measurements, and molecules) with encoders of arbitrary embedding dimensionality. For instance, to reach the same accuracy as 32-shot SEMI, training the projector from scratch needs 64$\times$ more data. As a result, SEMI holds promise to extend the modality coverage of foundation models.
摘要：多模式基础模型可以处理几种方式。但是，由于可能的方式的空间很大并且会随着时间的流逝而发展，因此从头开始训练模型以包含所有模式是不可行的。此外，将模式集成到现有的基础模型中，目前需要大量的配对数据，这通常不适合低资源模式。在本文中，我们将样本有效模态积分（SEMI）的方法介绍到大语言模型（LLMS）中。为此，我们设计了一个可以将共享投影仪（将特定于模态编码器和LLM之间的共享投影仪）调整为任何模式的超网络。通过高资源方式（即文本，语音，音频，视频）训练的超级net工作是根据推理时间在任何任意模式的一些样本中进行的，以生成合适的适配器。为了增加训练方式的多样性，我们通过等距转换将编码器的数量人为地增加。我们发现，在新模式的几次整合（即卫星图像，天文图像，惯性测量和分子）与任意嵌入维度的编码器的几次整合过程中，半效率可以显着提高样品效率。例如，要达到与32-SHOT半隔离的准确性，从头开始训练投影仪需要64 $ \ times $更多数据。结果，半决赛有望扩大基础模型的方式。

Title: Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs

Authors: Brennen Hill, Surendra Parla, Venkata Abhijeeth Balabhadruni, Atharv Prajod Padmalayam, Sujay Chandra Shekara Sharma
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04615
Pdf URL: https://arxiv.org/pdf/2509.04615
Copy Paste: [[2509.04615]] Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs(https://arxiv.org/abs/2509.04615)
Keywords: language model, llm, prompt
Abstract: The proliferation of Large Language Models (LLMs) has introduced critical security challenges, where adversarial actors can manipulate input prompts to cause significant harm and circumvent safety alignments. These prompt-based attacks exploit vulnerabilities in a model's design, training, and contextual understanding, leading to intellectual property theft, misinformation generation, and erosion of user trust. A systematic understanding of these attack vectors is the foundational step toward developing robust countermeasures. This paper presents a comprehensive literature survey of prompt-based attack methodologies, categorizing them to provide a clear threat model. By detailing the mechanisms and impacts of these exploits, this survey aims to inform the research community's efforts in building the next generation of secure LLMs that are inherently resistant to unauthorized distillation, fine-tuning, and editing.
摘要：大语言模型（LLM）的扩散引入了关键的安全挑战，在这种挑战中，对抗性参与者可以操纵输入提示，以造成严重的伤害和规避的安全对准。这些基于及时的攻击利用了模型的设计，培训和上下文理解中的漏洞，从而导致知识产权盗窃，错误信息产生和对用户信任的侵蚀。对这些攻击向量的系统理解是发展强大的对策的基本步骤。本文介绍了对基于及时的攻击方法的全面文献调查，对它们进行了分类以提供明确的威胁模型。通过详细说明这些漏洞的机制和影响，该调查旨在为研究社区努力建立下一代安全的LLM，这些安全LLM固有地抵抗未经授权的蒸馏，微调和编辑。

Title: Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

Authors: Ayush Gupta, Ramneet Kaur, Anirban Roy, Adam D. Cobb, Rama Chellappa, Susmit Jha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04655
Pdf URL: https://arxiv.org/pdf/2509.04655
Copy Paste: [[2509.04655]] Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs(https://arxiv.org/abs/2509.04655)
Keywords: language model, llm
Abstract: We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model's dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of $2\%$ to $37\%$ when treating OOD datapoints as positives and in-domain test datapoints as negatives.
摘要：我们为专门的大型语言模型（LLMS）提出了一种新颖的推理时间外（OOD）检测算法。尽管通过微调实现了在内域任务上的最新性能，但专门的LLMS在带有OOD输入的情况下仍然容易受到不正确或不可靠的输出的影响，从而在关键应用程序中构成风险。我们的方法利用基于模型的辍学公差的新不合格度量来利用电感保形异常检测（ICAD）框架。在LLMS中关于多疾病和冗余性的最新发现的激励之中，我们假设内域输入表现出比OOD输入更高的辍学耐受性。我们通过有效的集合方法在多个层之间汇总了辍学公差，从而改善了检测，同时保持了ICAD的理论错误警报界限。使用医学专业的LLMS的实验表明，我们的方法检测到OOD输入更好，而AUROC的提高$ 2 \％$ $ $ $ $ 37 \％$ $当将OOD数据视为阳性和内域测试数据点作为负面。

Title: AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Authors: Aisha Alansari, Hamzah Luqman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04656
Pdf URL: https://arxiv.org/pdf/2509.04656
Copy Paste: [[2509.04656]] AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs(https://arxiv.org/abs/2509.04656)
Keywords: language model, llm, hallucination
Abstract: Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs' hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic's widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs' outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: \href{this https URL}{Github link}.
摘要：最近，关于大语言模型（LLM）幻觉的广泛研究主要集中在英语上。尽管多语言和阿拉伯特异性LLM的数量越来越多，但在阿拉伯环境中评估LLMS的幻觉仍然相对尚未得到充实。考虑到阿拉伯语在许多地区的广泛使用及其在全球传播和媒体中的重要性，知识差距尤其如此。本文介绍了对阿拉伯语和多语言LLM的首次全面幻觉评估，对两个关键的阿拉伯语自然语言生成任务：生成问题答案（GQA）和摘要。这项研究总共评估了12个LLM，包括4种阿拉伯语预培训模型，4种多语言模型和4种基于推理的模型。为了评估LLMS产出的事实一致性和忠诚，我们开发了一个精细元素幻觉评估框架，该框架由12个高粒度幻觉指标组成，这些指标代表了每个任务的不同特征。结果表明，事实幻觉比所有模型和任务中的忠诚错误更为普遍。值得注意的是，与多语言模型相比，阿拉伯语预训练的模型Allam始终显示出较低的幻觉率，以及与基于推理的模型的比较性能。该代码可在：\ href {此https url} {github link}中获得。

Title: Evaluating NL2SQL via SQL2NL

Authors: Mohammadtaher Safarzadeh, Afshin Oroojlooyjadid, Dan Roth
Subjects: cs.CL, cs.AI, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04657
Pdf URL: https://arxiv.org/pdf/2509.04657
Copy Paste: [[2509.04657]] Evaluating NL2SQL via SQL2NL(https://arxiv.org/abs/2509.04657)
Keywords: gpt
Abstract: Robust evaluation in the presence of linguistic variation is key to understanding the generalization capabilities of Natural Language to SQL (NL2SQL) models, yet existing benchmarks rarely address this factor in a systematic or controlled manner. We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation-distinct from prior work that primarily investigates ambiguity or schema perturbations. Our analysis reveals that state-of-the-art models are far more brittle than standard benchmarks suggest. For example, LLaMa3.3-70B exhibits a 10.23% drop in execution accuracy (from 77.11% to 66.9%) on paraphrased Spider queries, while LLaMa3.1-8B suffers an even larger drop of nearly 20% (from 62.9% to 42.5%). Smaller models (e.g., GPT-4o mini) are disproportionately affected. We also find that robustness degradation varies significantly with query complexity, dataset, and domain -- highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable performance in real-world settings.
摘要：在存在语言变化的情况下，可靠的评估是理解自然语言对SQL（NL2SQL）模型的概括能力的关键，但是现有的基准很少以系统或控制的方式解决该因素。我们提出了一个新型的模式一致的隔离框架，它利用SQL-TO-NL（SQL2NL）自动生成语义上等效的，词汇多样的查询，同时保持与原始架构和意图的对齐方式。这使得对NL2SQL鲁棒性的第一个针对性评估与先前的工作隔离为语言变化，主要研究歧义或模式扰动。我们的分析表明，最先进的模型比标准基准建议的要脆得多。例如，Llama3.3-70B在释义的蜘蛛查询上表现出10.23％的执行精度下降（从77.11％到66.9％），而Llama3.1-8B的执行精度下降了近20％（从62.9％到42.5％）。较小的模型（例如GPT-4O Mini）受到不成比例的影响。我们还发现，鲁棒性降解随查询复杂性，数据集和域而差异很大 - 突出了对评估框架的需求，这些评估框架是明确测量语言概括以确保在现实世界中可靠的性能。

Title: Why Language Models Hallucinate

Authors: Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04664
Pdf URL: https://arxiv.org/pdf/2509.04664
Copy Paste: [[2509.04664]] Why Language Models Hallucinate(https://arxiv.org/abs/2509.04664)
Keywords: language model, hallucination
Abstract: Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
摘要：就像面临艰苦考试问题的学生一样，大型语言模型有时会猜测不确定的时候，产生合理但不正确的陈述，而不是承认不确定性。这种“幻觉”即使在最先进的系统和破坏信任中也存在。我们认为语言模型幻觉是因为培训和评估程序奖励猜测不确定性，并且我们分析了现代培训管道中幻觉的统计原因。幻觉不必是神秘的 - 它们仅作为二进制分类中的错误起源。如果无法将错误的陈述与事实区分开，则通过自然的统计压力，预验证的语言模型中的幻觉将出现。然后，我们认为，由于大多数评估的分级方式，幻觉持续存在 - 优化语言模型是良好的测试者，并且猜测何时不确定会改善测试性能。这种惩罚不确定反应的“流行病”只能通过社会技术缓解来解决：修改现有基准的评分，这些基准是未对准但主导的排行榜，而不是引入额外的幻觉评估。这种更改可能会引导该领域朝着更值得值得信赖的AI系统迈进。

Title: ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs

Authors: Samira Khorshidi, Azadeh Nikfarjam, Suprita Shankar, Yisi Sang, Yash Govind, Hyun Jang, Ali Kasgari, Alexis McClimans, Mohamed Soliman, Vishnu Konda, Ahmed Fakhry, Xiaoguang Qi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04696
Pdf URL: https://arxiv.org/pdf/2509.04696
Copy Paste: [[2509.04696]] ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs(https://arxiv.org/abs/2509.04696)
Keywords: language model, llm, prompt
Abstract: Knowledge graphs (KGs) are foundational to many AI applications, but maintaining their freshness and completeness remains costly. We present ODKE+, a production-grade system that automatically extracts and ingests millions of open-domain facts from web sources with high precision. ODKE+ combines modular components into a scalable pipeline: (1) the Extraction Initiator detects missing or stale facts, (2) the Evidence Retriever collects supporting documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and ontology-guided prompting for large language models (LLMs), (4) a lightweight Grounder validates extracted facts using a second LLM, and (5) the Corroborator ranks and normalizes candidate facts for ingestion. ODKE+ dynamically generates ontology snippets tailored to each entity type to align extractions with schema constraints, enabling scalable, type-consistent fact extraction across 195 predicates. The system supports batch and streaming modes, processing over 9 million Wikipedia pages and ingesting 19 million high-confidence facts with 98.8% precision. ODKE+ significantly improves coverage over traditional methods, achieving up to 48% overlap with third-party KGs and reducing update lag by 50 days on average. Our deployment demonstrates that LLM-based extraction, grounded in ontological structure and verification workflows, can deliver trustworthiness, production-scale knowledge ingestion with broad real-world applicability. A recording of the system demonstration is included with the submission and is also available at this https URL.
摘要：知识图（KGS）是许多AI应用的基础，但是保持其新鲜度和完整性仍然昂贵。我们提出了ODKE+，这是一种生产级系统，自动从网络源中提取并摄取数百万个开放域的事实，以高精度。 ODKE+ combines modular components into a scalable pipeline: (1) the Extraction Initiator detects missing or stale facts, (2) the Evidence Retriever collects supporting documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and ontology-guided prompting for large language models (LLMs), (4) a lightweight Grounder validates extracted facts using a second LLM, and (5) the Corroborator ranks and将候选事实归一化。 ODKE+动态生成针对每个实体类型量身定制的本体片段，可与模式约束对齐，从而在195个谓词中启用可扩展的，类型一致的事实提取。该系统支持批处理和流式传输模式，处理超过900万个Wikipedia页面，并以98.8％的精度摄入了1900万个高信事实。 ODKE+显着改善了传统方法的覆盖范围，与第三方公斤重叠多达48％的重叠，并将更新的滞后滞后减少50天。我们的部署表明，基于LLM的提取基于本体论结构和验证工作流程，可以提供可信赖性，生产规模的知识摄入，具有广泛的现实世界适用性。提交中包含了系统演示的记录，也可以在此HTTPS URL上获得。

Title: KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering

Authors: Yushi Sun, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.04716
Pdf URL: https://arxiv.org/pdf/2509.04716
Copy Paste: [[2509.04716]] KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering(https://arxiv.org/abs/2509.04716)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation, chain-of-thought
Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs) by incorporating external data, with Knowledge Graphs (KGs) offering crucial information for question answering. Traditional Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing, which typically retrieves knowledge strictly necessary for answer generation, thus often suffer from low coverage due to rigid schema requirements and semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information. Our retrieval-filtering-summarization approach, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs, reduces noises and improves QA for both simple and complex questions. Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.
摘要：通过将外部数据和知识图（kgs）合并为提出问题回答的知识图（kgs），检索增强的生成（RAG）通过将外部数据（kgs）纳入外部数据（kgs）来减轻幻觉。传统知识图答案（KGQA）方法依赖于语义解析，该语义解析通常严格检索答案产生所需的知识，因此由于严格的模式要求和语义歧义而经常受到覆盖范围低的范围。我们提出了Kerag，这是一种基于KG的新型RAG管道，通过检索可能包含相关信息的更广泛的子图来增强质量检查覆盖率。我们的检索过滤 - 夏季化方法与微调的LLM相结合，用于知识子图的经过思考推理，减少噪音并改善了简单和复杂问题的质量检查。实验表明，Kerag的质量超过了最先进的解决方案，并超过GPT-4O（工具）10-21％。

Title: A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning

Authors: Cheng Peng, Xinyu Dong, Mengxian Lyu, Daniel Paredes, Yaoyun Zhang, Yonghui Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04753
Pdf URL: https://arxiv.org/pdf/2509.04753
Copy Paste: [[2509.04753]] A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning(https://arxiv.org/abs/2509.04753)
Keywords: language model, gpt, llm, prompt
Abstract: Natural language processing (NLP) is a key technology to extract important patient information from clinical narratives to support healthcare applications. The rapid development of large language models (LLMs) has revolutionized many NLP tasks in the clinical domain, yet their optimal use in patient information extraction tasks requires further exploration. This study examines LLMs' effectiveness in patient information extraction, focusing on LLM architectures, fine-tuning strategies, and multi-task instruction tuning techniques for developing robust and generalizable patient information extraction systems. This study aims to explore key concepts of using LLMs for clinical concept and relation extraction tasks, including: (1) encoder-only or decoder-only LLMs, (2) prompt-based parameter-efficient fine-tuning (PEFT) algorithms, and (3) multi-task instruction tuning on few-shot learning performance. We benchmarked a suite of LLMs, including encoder-based LLMs (BERT, GatorTron) and decoder-based LLMs (GatorTronGPT, Llama 3.1, GatorTronLlama), across five datasets. We compared traditional full-size fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning framework that combines both tasks across four datasets to evaluate the zero-shot and few-shot learning performance using the leave-one-dataset-out strategy.
摘要：自然语言处理（NLP）是从临床叙述中提取重要患者信息以支持医疗保健应用的关键技术。大型语言模型（LLM）的快速发展彻底改变了临床领域中许多NLP任务，但是它们在患者信息提取任务中的最佳用途需要进一步探索。这项研究研究了LLMS在患者信息提取方面的有效性，重点关注LLM架构，微调策略以及多任务指令调谐技术，以开发可靠且可推广的患者信息提取系统。这项研究旨在探讨使用LLM用于临床概念和关系提取任务的关键概念，包括：（1）仅编码或仅解码器的LLM，（2）基于及时的基于及时的参数效率高效调用（PEFT）算法，以及（3）在几次学习学习表现上调整多任务指导性能。我们在五个数据集中对一套LLM进行了基准测试，包括基于编码器的LLM（Bert，Gatortron）和基于解码器的LLMS（Gatortrongpt，Llama 3.1，Gatortronllama）。我们比较了传统的全尺寸微调和及时的基于PEFT。我们探索了一个多任务指令调整框架，该框架结合了四个数据集中的两个任务，以使用剩下的一式数据集策略来评估零射门和几乎没有射击的学习性能。

Title: Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework

Authors: Zucheng Liang, Wenxin Wei, Kaijie Zhang, Hongyi Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04770
Pdf URL: https://arxiv.org/pdf/2509.04770
Copy Paste: [[2509.04770]] Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework(https://arxiv.org/abs/2509.04770)
Keywords: language model, llm
Abstract: Accurately answering complex questions has consistently been a significant challenge for Large Language Models (LLMs). To address this, this paper proposes a multi-hop question decomposition method for complex questions, building upon research within the MQUAKE framework. Utilizing the LLAMA3 model, we systematically investigate the impact of multi-hop question decomposition within knowledge graphs on model comprehension and reasoning accuracy, both before and after model training. In our experiments, we systematically partitioned and converted the MQUAKE-T dataset into two distinct formats: a single-hop dataset designed for directly answering complex questions, and a multi-hop dataset constructed using the multi-hop question decomposition method. We then fine-tuned the LLAMA3 model on these datasets and conducted inference tests. Our results demonstrate that, without fine-tuning the LLM, the prediction performance based on the multi-hop question decomposition method significantly outperforms the method of directly answering complex questions. After fine-tuning using the LoRA (Low-Rank Adaptation) method, the performance of both approaches improved compared to the untrained baseline. Crucially, the method utilizing multi-hop decomposition consistently maintained its superiority. These findings validate the effectiveness of the multi-hop decomposition method both before and after training, demonstrating its capability to effectively enhance the LLM's ability to answer complex questions.
摘要：准确地回答复杂的问题一直是大型语言模型（LLM）的重大挑战。为了解决这个问题，本文提出了一种基于Mquake框架内的研究，提出了一种用于复杂问题的多跳问题分解方法。利用Llama3模型，我们系统地研究了在模型培训之前和之后，在知识图中以及知识图中多跳问题分解对模型理解和推理精度的影响。在我们的实验中，我们将Mquake-T数据集系统分区并转换为两种不同的格式：设计用于直接回答复杂问题的单跳数据集，以及使用多跳跃问题分解方法构建的多跳数据集。然后，我们在这些数据集上微调了Llama3模型并进行了推理测试。我们的结果表明，在没有微调LLM的情况下，基于多跳问题分解方法的预测性能显着超过了直接回答复杂问题的方法。使用LORA（低级适应）方法进行微调后，与未经训练的基线相比，两种方法的性能都得到了改善。至关重要的是，利用多跳分解的方法始终保持其优越性。这些发现证明了训练前后多跳分解方法的有效性，证明了其有效增强LLM回答复杂问题的能力。

Title: Decoders Laugh as Loud as Encoders

Authors: Eli Borodach, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04779
Pdf URL: https://arxiv.org/pdf/2509.04779
Copy Paste: [[2509.04779]] Decoders Laugh as Loud as Encoders(https://arxiv.org/abs/2509.04779)
Keywords: language model, gpt, llm
Abstract: From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of Large Language Models (LLMs) shocked the scientific community when a single model can apply for various natural language processing (NLP) tasks, while the output results are sometimes even better than most human communication skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the decoders, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)
摘要：从计算机的黎明开始，艾伦·图灵（Allen Turing）梦想着可以使用语言作为人类进行交流的机器人。当单个模型可以申请各种自然语言处理（NLP）任务时，大型语言模型（LLM）领域的最新进展震惊了科学界，而输出结果有时甚至比大多数人类沟通技巧更好。诸如GPT，Claude，Grok等模型已经在科学界留下了痕迹。但是，尚不清楚这些模型如何理解它们的产生，尤其是在诸如幽默之类的细微主题中。计算机是否了解幽默的问题仍然是开放的（在解码器中，要检查的最新是GPT-2）。我们在本文中解决了这个问题。我们已经表明，进行微调解码器（GPT-4O）执行（平均F1-MACRO得分为0.85）以及最佳的微型编码器（Roberta，平均为F1得分为0.86）

Title: Enhancing Diversity in Large Language Models via Determinantal Point Processes

Authors: Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Ioannis Ch. Paschalidis, Aldo Pacchiano
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04784
Pdf URL: https://arxiv.org/pdf/2509.04784
Copy Paste: [[2509.04784]] Enhancing Diversity in Large Language Models via Determinantal Point Processes(https://arxiv.org/abs/2509.04784)
Keywords: language model, llm, prompt
Abstract: Supervised fine-tuning and reinforcement learning are two popular methods for post-training large language models (LLMs). While improving the model's performance on downstream tasks, they often reduce the model's output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on lexical differences. We propose a novel training method named DQO based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.
摘要：监督的微调和加强学习是两种流行的大型语言模型（LLMS）的流行方法。在改善模型在下游任务上的性能的同时，它们通常会降低模型的输出多样性，从而导致狭窄的规范响应。通过在推理时间运行或专注于词汇差异，现有的增强多样性方法受到限制。我们提出了一种基于确定点过程（DPP）的新型培训方法，以共同优化LLM的质量和语义多样性。我们的方法样本并嵌入了每个提示的一组响应，然后使用基于内核的相似性矩阵的决定因素来衡量多样性，因为这些响应的嵌入量所跨越的体积。跨指令跟随，摘要，故事产生和推理任务的实验表明，我们的方法基本上改善了语义多样性而无需牺牲模型质量。

Title: Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects

Authors: Gunmay Handa, Zekun Wu, Adriano Koshiyama, Philip Treleaven
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04794
Pdf URL: https://arxiv.org/pdf/2509.04794
Copy Paste: [[2509.04794]] Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects(https://arxiv.org/abs/2509.04794)
Keywords: language model, llm, agent
Abstract: Personality manipulation in large language models (LLMs) is increasingly applied in customer service and agentic scenarios, yet its mechanisms and trade-offs remain unclear. We present a systematic study of personality control using the Big Five traits, comparing in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our contributions are fourfold. First, we construct a contrastive dataset with balanced high/low trait responses, enabling effective steering vector computation and fair cross-method evaluation. Second, we introduce a unified evaluation framework based on within-run $\Delta$ analysis that disentangles, reasoning capability, agent performance, and demographic bias across MMLU, GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to separate openness from conscientiousness, addressing representational overlap in trait encoding. Fourth, we propose a three-level stability framework that quantifies method-, trait-, and combination-level robustness, offering practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment with minimal capability loss, PEFT delivers the highest alignment at the cost of degraded task performance, and MS provides lightweight runtime control with competitive effectiveness. Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL, and personality encoding consolidating around intermediate layers. Taken together, these results establish personality manipulation as a multi-level probe into behavioral representation, linking surface conditioning, parameter encoding, and activation-level steering, and positioning mechanistic steering as a lightweight alternative to fine-tuning for both deployment and interpretability.
摘要：大语言模型（LLM）中的人格操纵越来越多地用于客户服务和代理方案，但其机制和权衡仍不清楚。我们对使用五大特征进行了对人格控制的系统研究，比较了文本学习（ICL），参数有效的微调（PEFT）和机械转向（MS）。我们的贡献是四倍。首先，我们构建了一个具有平衡高/低性状响应的对比数据集，从而实现了有效的转向向量计算和公平的交叉方法评估。其次，我们基于运行中的$ \ delta $分析介绍了一个统一的评估框架，该框架分析了MMLU，GAIA和BBQ基准的散布，推理能力，代理性能和人口统计学偏见。第三，我们开发了性状纯化技术，以将开放性与尽职尽责，解决特征编码中的代表性重叠。第四，我们提出了一个三级稳定框架，该框架可以量化方法，性状和组合级的鲁棒性，并在部署约束下提供实际的指导。关于Gemma-2-2b-it和Llama-3-8b-Inscruct的实验揭示了明确的权衡：ICL与最小的能力损失达到了强大的一致性，PEFT以退化的任务绩效为代价提供了最高的一致性，并且MS提供了轻巧的运行时控制，并具有竞争力的有效性。性状级别的分析表明，开放性具有挑战性的挑战性，对ICL具有最抵抗力，以及编码围绕中间层巩固的个性。综上所述，这些结果将人格操作确定为行为表示，将表面调理，参数编码和激活级转向以及定位机械转向的定位机械转向作为轻量级替代方案，以实现部署和解释性的微调替代方案。

Title: Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training

Authors: Figarri Keisha, Zekun Wu, Ze Wang, Adriano Koshiyama, Philip Treleaven
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04796
Pdf URL: https://arxiv.org/pdf/2509.04796
Copy Paste: [[2509.04796]] Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training(https://arxiv.org/abs/2509.04796)
Keywords: language model, llm, prompt
Abstract: Large language models increasingly rely on synthetic data due to human-written content scarcity, yet recursive training on model-generated outputs leads to model collapse, a degenerative process threatening factual reliability. We define knowledge collapse as a distinct three-stage phenomenon where factual accuracy deteriorates while surface fluency persists, creating "confidently wrong" outputs that pose critical risks in accuracy-dependent domains. Through controlled experiments with recursive synthetic training, we demonstrate that collapse trajectory and timing depend critically on instruction format, distinguishing instruction-following collapse from traditional model collapse through its conditional, prompt-dependent nature. We propose domain-specific synthetic training as a targeted mitigation strategy that achieves substantial improvements in collapse resistance while maintaining computational efficiency. Our evaluation framework combines model-centric indicators with task-centric metrics to detect distinct degradation phases, enabling reproducible assessment of epistemic deterioration across different language models. These findings provide both theoretical insights into collapse dynamics and practical guidance for sustainable AI training in knowledge-intensive applications where accuracy is paramount.
摘要：大型语言模型越来越多地依赖于人为所写的内容稀缺，但对模型生成的产出的递归培训会导致模型崩溃，这是一种变性的过程，威胁到事实可靠性。我们将知识崩溃定义为一种独特的三阶段现象，在该现象中，事实准确性会恶化，而表面流畅性持续存在，从而产生了“自信的错误”输出，从而在准确性依赖性域中构成关键风险。通过递归合成训练的受控实验，我们证明了崩溃的轨迹和时机严重取决于指令格式，从而通过其条件，迅速依赖性的性质区分了以指导性的崩溃与传统模型崩溃的区分。我们建议特定于域的合成训练作为一种有针对性的缓解策略，该策略在维持计算效率的同时，可以实现崩溃抗性的实质性改善。我们的评估框架将以模型为中心的指标与以任务为中心的指标相结合，以检测不同的降解阶段，从而可以对不同语言模型的认知恶化的可重复评估。这些发现既提供了对崩溃动态的理论见解，又为在精度至关重要的知识密集型应用中的可持续AI培训提供了实用指导。

Title: Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Authors: Ilham Wicaksono, Zekun Wu, Theo King, Adriano Koshiyama, Philip Treleaven
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04802
Pdf URL: https://arxiv.org/pdf/2509.04802
Copy Paste: [[2509.04802]] Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs(https://arxiv.org/abs/2509.04802)
Keywords: language model, gpt, llm, agent
Abstract: As large language models transition to agentic systems, current safety evaluation frameworks face critical gaps in assessing deployment-specific risks. We introduce AgentSeer, an observability-based evaluation framework that decomposes agentic executions into granular action and component graphs, enabling systematic agentic-situational assessment. Through cross-model validation on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and iterative refinement attacks, we demonstrate fundamental differences between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash (50.00% ASR), with both models showing susceptibility to social engineering while maintaining logic-based attack resistance. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover "agentic-only" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, agent transfer operations as highest-risk tools, semantic rather than syntactic vulnerability mechanisms, and context-dependent attack effectiveness, alongside model-specific security profiles in absolute ASR levels and optimal injection strategies. Direct attack transfer from model-level to agentic contexts shows degraded performance (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash: 28%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic evaluation gaps. These findings establish the urgent need for agentic-situation evaluation paradigms, with AgentSeer providing the standardized methodology and empirical validation.
摘要：随着大型语言模型过渡到代理系统，当前的安全评估框架在评估特定部署的风险时面临着关键的差距。我们介绍了Agentseer，这是一个基于可观察性的评估框架，将代理执行分解为颗粒状作用和组件图，从而实现了系统的代理鉴定评估。通过使用Harmbench单转和迭代精炼攻击对GPT-OSS-20B和GEMINI-2.0-FLASH的跨模型验证，我们展示了模型级别和代理级别漏洞概况之间的基本差异。模型级别的评估揭示了基线差异：GPT-OSS-20B（39.47％ASR）与Gemini-2.0-Flash（50.00％ASR），两种模型均显示出对社交工程的敏感性，同时保持基于逻辑的攻击性。但是，代理水平的评估暴露了传统评估看不见的特定特定风险。我们发现仅在代理环境中出现的“仅代理”漏洞，两种模型的工具称呼均显示出24-60％的ASR。跨模型分析揭示了普遍的代理模式，代理转移操作作为最高风险的工具，语义而不是句法脆弱性机制以及与上下文相关的攻击效果，以及在绝对ASR级别中的模型特定安全概况以及最佳注入策略。从模型级到代理环境的直接攻击转移显示出降低的性能（GPT-OSS-20B：57％的人类注入ASR； Gemini-2.0-Flash：28％），而上下文感知的迭代攻击成功地损害了在模型级别上失败的目标，确认了确认的，确认了确认的系统评估差距。这些发现确定了对代理评估范式的迫切需求，而代理商提供了标准化的方法和经验验证。

Title: AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding

Authors: Yan Xie, Yibo Cui, Liang Xie, Erwei Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04821
Pdf URL: https://arxiv.org/pdf/2509.04821
Copy Paste: [[2509.04821]] AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding(https://arxiv.org/abs/2509.04821)
Keywords: language model, llm
Abstract: Spoken Language Understanding (SLU) is a core component of conversational systems, enabling machines to interpret user utterances. Despite its importance, developing effective SLU systems remains challenging due to the scarcity of labeled training data and the computational burden of deploying Large Language Models (LLMs) in real-world applications. To further alleviate these issues, we propose an Adaptive Feature Distillation framework that transfers rich semantic representations from a General Text Embeddings (GTE)-based teacher model to a lightweight student model. Our method introduces a dynamic adapter equipped with a Residual Projection Neural Network (RPNN) to align heterogeneous feature spaces, and a Dynamic Distillation Coefficient (DDC) that adaptively modulates the distillation strength based on real-time feedback from intent and slot prediction performance. Experiments on the Chinese profile-based ProSLU benchmark demonstrate that AFD-SLU achieves state-of-the-art results, with 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.
摘要：口语理解（SLU）是对话系统的核心组成部分，使计算机可以解释用户话语。尽管它很重要，但由于标记的培训数据的稀缺以及在现实世界应用中部署大型语言模型（LLM）的计算负担，开发有效的SLU系统仍然具有挑战性。为了进一步缓解这些问题，我们提出了一个自适应功能蒸馏框架，该框架将丰富的语义表示从一般文本嵌入（GTE）的教师模型转移到轻量级的学生模型。我们的方法引入了一个动态适配器，配备了残留投影神经网络（RPNN），以使异质特征空间保持一致，并具有动态蒸馏系数（DDC），该系数（DDC）可适应基于意图和插槽预测性能的实时反馈。基于中国概况的ProSlu基准的实验表明，AFD-SLU取得了最新的结果，意图精度为95.67％，插槽F1得分为92.02％，总体准确度为85.50％。

Title: Memorization $\neq$ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?

Authors: Boxiang Ma, Ru Li, Yuanlong Wang, Hongye Tan, Xiaoli Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04866
Pdf URL: https://arxiv.org/pdf/2509.04866
Copy Paste: [[2509.04866]] Memorization $\neq$ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?(https://arxiv.org/abs/2509.04866)
Keywords: language model, llm
Abstract: Driven by vast and diverse textual data, large language models (LLMs) have demonstrated impressive performance across numerous natural language processing (NLP) tasks. Yet, a critical question persists: does their generalization arise from mere memorization of training data or from deep semantic understanding? To investigate this, we propose a bi-perspective evaluation framework to assess LLMs' scenario cognition - the ability to link semantic scenario elements with their arguments in context. Specifically, we introduce a novel scenario-based dataset comprising diverse textual descriptions of fictional facts, annotated with scenario elements. LLMs are evaluated through their capacity to answer scenario-related questions (model output perspective) and via probing their internal representations for encoded scenario elements-argument associations (internal representation perspective). Our experiments reveal that current LLMs predominantly rely on superficial memorization, failing to achieve robust semantic scenario cognition, even in simple cases. These findings expose critical limitations in LLMs' semantic understanding and offer cognitive insights for advancing their capabilities.
摘要：在庞大而多样化的文本数据的驱动下，大型语言模型（LLM）在众多自然语言处理（NLP）任务中表现出令人印象深刻的表现。然而，一个关键的问题仍然存在：它们的概括是否仅仅是仅仅是对训练数据的记忆或深入的语义理解？为了调查这一点，我们提出了一个双眼评估框架，以评估LLMS的场景认知 - 将语义场景元素与它们的论点联系起来的能力。具体来说，我们介绍了一个基于新颖的情况的数据集，其中包含虚构事实的各种文本描述，并注释了情景元素。 LLM通过其回答方案相关问题（模型输出透视图）的能力进行评估，并通过探测其内部表示形式对编码的方案元素 - ARGUMENT-ARGUMENT关联（内部表示观点）。我们的实验表明，当前的LLMS主要依赖于表面记忆，即使在简单的情况下，也无法实现强大的语义情景认知。这些发现暴露了LLMS语义理解中的关键局限性，并提供了推动其能力的认知见解。

Title: Using LLMs for Multilingual Clinical Entity Linking to ICD-10

Authors: Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04868
Pdf URL: https://arxiv.org/pdf/2509.04868
Copy Paste: [[2509.04868]] Using LLMs for Multilingual Clinical Entity Linking to ICD-10(https://arxiv.org/abs/2509.04868)
Keywords: language model, gpt, llm
Abstract: The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.
摘要：临床实体的联系是从临床文本中提取结构化信息的关键部分。这是将代码从医学本体或分类分配到文本中的短语的过程。国际疾病分类-10修订（ICD -10）是为统计和保险目的对疾病进行分类的国际标准。将正确的ICD-10代码自动分配到出院摘要中的条款将简化医疗保健专业人员的工作，并确保医院中的一致编码。我们的论文提出了一种使用大语言模型（LLM）将临床术语与ICD-10代码联系起来的方法。该方法由使用临床词典的多阶段管道组成，以匹配文本中的明确术语，然后使用GPT-4.1的秘密学习来预测与词典不匹配的术语的ICD-10代码。我们的系统在西班牙语的不同基准数据集上预测ICD -10代码的结果很有希望的结果-0.89 f1的类别为0.89 f1，在CODIESP上的子类别上，在ELCARDIOCC上的子类别和希腊语-0.85 f1上显示了0.78 f1。

Title: L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning

Authors: Raul Singh, Nicolo Brunello, Vincenzo Scotti, Mark James Carman
Subjects: cs.CL, cs.PF
Abstract URL: https://arxiv.org/abs/2509.04884
Pdf URL: https://arxiv.org/pdf/2509.04884
Copy Paste: [[2509.04884]] L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning(https://arxiv.org/abs/2509.04884)
Keywords: language model, llm
Abstract: The ability of Large Language Models (LLMs) to solve complex tasks has made them crucial in the development of AI-based applications. However, the high computational requirements to fine-tune these LLMs on downstream tasks pose significant challenges, particularly when resources are limited. In response to this challenge, we introduce L1RA, a novel technique aimed at dynamically distributing the rank of low-rank adapters during fine-tuning using LoRA. Given a rank budget (i.e., total sum of adapters rank), L1RA leverages L1 regularisation to prune redundant ranks and redistribute them across adapters, thereby optimising resource utilisation. Through a series of comprehensive experiments, we empirically demonstrate that L1RA maintains comparable or even reduced computational overhead compared to other LoRA variants, including the vanilla approach, while achieving same or better performances. Moreover, the post-training analysis of rank distribution unveiled insights into the specific model components requiring the most adaptation to align with the task objective: the feed-forward layers and the attention output projection. These results highlight the efficacy of L1RA in not only enhancing the efficiency of LLM fine-tuning, but also in providing valuable diagnostic information for model refinement and customisation. In conclusion, L1RA stands as a promising technique for advancing the performance and interpretability of LLM adaptation, particularly in scenarios where computational resources are constrained.
摘要：大型语言模型（LLM）解决复杂任务的能力使它们对于基于AI的应用程序的开发至关重要。但是，在下游任务上微调这些LLM的高计算要求提出了重大挑战，尤其是在资源受到限制的情况下。为了应对这一挑战，我们引入了L1RA，这是一种新型技术，旨在在使用LORA进行微调过程中动态分配低级适配器的等级。鉴于排名预算（即适配器等级的总和），L1RA利用L1正则化来修剪冗余等级，并在适配器中重新分配它们，从而优化了资源利用率。通过一系列全面的实验，我们从经验上证明，与其他LORA变体（包括香草方法）相比，L1RA保持了可比甚至减少的计算开销，同时实现了相同或更好的性能。此外，对等级分布的训练后分析揭示了对特定模型组件的见解，需要最适合与任务目标保持一致：前馈层和注意力输出投影。这些结果强调了L1RA的功效不仅提高了LLM微调的效率，还可以为模型改进和定制提供有价值的诊断信息。总之，L1RA是推进LLM适应性的性能和解释性的一种有希望的技术，尤其是在计算资源受到限制的情况下。

Title: PLaMo 2 Technical Report

Authors: Preferred Networks: Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, Hiroaki Mikami, Shogo Murai, Daisuke Nishino, Kento Nozawa, Shintarou Okada, Daisuke Okanohara, Shunta Saito, Shotaro Sano, Shuji Suzuki, Daisuke Tanaka, Avinash Ummadisingu, Hanqin Wang, Sixue Wang, Tianqi Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04897
Pdf URL: https://arxiv.org/pdf/2509.04897
Copy Paste: [[2509.04897]] PLaMo 2 Technical Report(https://arxiv.org/abs/2509.04897)
Keywords: language model, llm
Abstract: In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.
摘要：在本报告中，我们介绍了Plamo 2，这是一系列以日本为中心的大型语言模型，其中包含一个基于桑巴的混合体系结构，该架构通过持续的预培训过渡到支持32K代币上下文。培训利用广泛的合成语料库来克服数据稀缺性，而计算效率是通过重新使用和结构化修剪来实现的。这种有效的修剪方法产生了一个8B模型，可实现与我们以前的100B模型相当的性能。训练后进一步完善了使用监督微调（SFT）和直接偏好优化（DPO）的管道，并通过综合日本教学数据和模型合并技术增强了模型。 PLAMO 2模型在使用VLLM的推理和量化量最低的情况下进行了优化，并在日本基准测试中实现了最先进的结果，在教学 - 遵循，语言流利性和日本特定知识方面表现优于类似大小的开放模型。

Title: ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

Authors: Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, Jiajun Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04903
Pdf URL: https://arxiv.org/pdf/2509.04903
Copy Paste: [[2509.04903]] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning(https://arxiv.org/abs/2509.04903)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in long-context understanding, yet they face significant challenges in high-quality long-form generation. Existing studies primarily suffer from two limitations: (1) A heavy reliance on scarce, high-quality long-form response data for supervised fine-tuning (SFT) or for pairwise preference reward in reinforcement learning (RL). (2) Focus on coarse-grained quality optimization dimensions, such as relevance, coherence, and helpfulness, overlooking the fine-grained specifics inherent to diverse long-form generation scenarios. To address this issue, we propose a framework using Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first automatically deconstructs each instruction into a set of fine-grained, adaptive constraint criteria by identifying its underlying intents and demands. Subsequently, we design a reward mechanism that quantifies the quality of long-form responses based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we utilize reinforcement learning to guide models toward superior long-form generation capabilities. Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.
摘要：大型语言模型（LLMS）在长期理解中表现出了很大的进步，但是它们在高质量的长期产生中面临着重大挑战。现有研究主要遭受两个局限性：（1）严重依赖于监督微调（SFT）的稀缺，高质量的长格式响应数据或在强化学习（RL）中的成对偏好奖励。（2）专注于相关性，连贯性和帮助性的粗粒质量优化维度，忽略了多种长期生成场景所固有的细粒细节。为了解决这个问题，我们建议使用自适应约束增强的奖励（ACE-RL）提出一个框架。 ACE-RL首先通过识别其潜在的意图和需求来自动将每个指令解构为一组细粒度的自适应约束标准。随后，我们设计了一种奖励机制，该机制基于对相应约束的满意度量化长形响应的质量，从而将主观质量评估转换为约束验证。最后，我们利用强化学习来指导模型达到出色的长形成生成能力。 Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.

Title: Classification of kinetic-related injury in hospital triage data using NLP

Authors: Midhun Shyam, Jim Basilakis, Kieran Luken, Steven Thomas, John Crozier, Paul M. Middleton, X. Rosalind Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04969
Pdf URL: https://arxiv.org/pdf/2509.04969
Copy Paste: [[2509.04969]] Classification of kinetic-related injury in hospital triage data using NLP(https://arxiv.org/abs/2509.04969)
Keywords: language model, llm
Abstract: Triage notes, created at the start of a patient's hospital visit, contain a wealth of information that can help medical staff and researchers understand Emergency Department patient epidemiology and the degree of time-dependent illness or injury. Unfortunately, applying modern Natural Language Processing and Machine Learning techniques to analyse triage data faces some challenges: Firstly, hospital data contains highly sensitive information that is subject to privacy regulation thus need to be analysed on site; Secondly, most hospitals and medical facilities lack the necessary hardware to fine-tune a Large Language Model (LLM), much less training one from scratch; Lastly, to identify the records of interest, expert inputs are needed to manually label the datasets, which can be time-consuming and costly. We present in this paper a pipeline that enables the classification of triage data using LLM and limited compute resources. We first fine-tuned a pre-trained LLM with a classifier using a small (2k) open sourced dataset on a GPU; and then further fine-tuned the model with a hospital specific dataset of 1000 samples on a CPU. We demonstrated that by carefully curating the datasets and leveraging existing models and open sourced data, we can successfully classify triage data with limited compute resources.
摘要：Triage在患者医院就诊时创建的Triage Notes包含大量信息，可以帮助医务人员和研究人员了解急诊科患者的流行病学以及时间依赖性疾病或受伤程度。不幸的是，应用现代的自然语言处理和机器学习技术来分析分类数据面临一些挑战：首先，医院数据包含高度敏感的信息，这些信息受到隐私权法规，因此需要在现场进行分析；其次，大多数医院和医疗机构都缺乏对大型语言模型（LLM）微调的必要硬件，更不用说从头开始培训了。最后，要确定感兴趣的记录，需要专家输入来手动标记数据集，这可能是耗时且昂贵的。我们在本文中介绍了一条管道，该管道可以使用LLM和有限的计算资源对分类数据进行分类。我们首先使用GPU上的小（2K）开源数据集对经过分类器进行了预训练的LLM；然后通过CPU上的1000个样品的医院特定数据集进行了对模型的进一步微调。我们证明，通过仔细策划数据集并利用现有模型和开源数据，我们可以成功地用有限的计算资源对分类数据进行分类。

Title: Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short Texts

Authors: Julius Neumann, Robert Lange, Yuni Susanti, Michael Färber
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04982
Pdf URL: https://arxiv.org/pdf/2509.04982
Copy Paste: [[2509.04982]] Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short Texts(https://arxiv.org/abs/2509.04982)
Keywords: language model
Abstract: Sentiment classification in short text datasets faces significant challenges such as class imbalance, limited training samples, and the inherent subjectivity of sentiment labels -- issues that are further intensified by the limited context in short texts. These factors make it difficult to resolve ambiguity and exacerbate data sparsity, hindering effective learning. In this paper, we evaluate the effectiveness of small Transformer-based models (i.e., BERT and RoBERTa, with fewer than 1 billion parameters) for multi-label sentiment classification, with a particular focus on short-text settings. Specifically, we evaluated three key factors influencing model performance: (1) continued domain-specific pre-training, (2) data augmentation using automatically generated examples, specifically generative data augmentation, and (3) architectural variations of the classification head. Our experiment results show that data augmentation improves classification performance, while continued pre-training on augmented datasets can introduce noise rather than boost accuracy. Furthermore, we confirm that modifications to the classification head yield only marginal benefits. These findings provide practical guidance for optimizing BERT-based models in resource-constrained settings and refining strategies for sentiment classification in short-text datasets.
摘要：短文数据集中的情感分类面临着重大挑战，例如班级失衡，有限的培训样本以及情感标签的固有主观性 - 这些问题在短文中被有限的上下文所进一步加剧。这些因素使得难以解决歧义并加剧数据稀疏性，从而阻碍有效的学习。在本文中，我们评估了基于小型变压器的模型（即Bert和Roberta，少于10亿个参数）的多标签情感分类的有效性，并特别关注短文本设置。具体而言，我们评估了影响模型性能的三个关键因素：（1）持续的域特异性预训练，（2）使用自动生成的示例，特别是生成数据增强的数据增强，以及（3）分类头的架构变化。我们的实验结果表明，数据增强可以提高分类性能，而在增强数据集上继续进行预训练可以引入噪声而不是提高准确性。此外，我们确认对分类头的修改仅产生边际收益。这些发现为在资源受限的设置中优化基于BERT的模型以及在短文本数据集中的情感分类的策略提供了实用的指导。

Title: Do Large Language Models Need Intent? Revisiting Response Generation Strategies for Service Assistant

Authors: Inbal Bolshinsky, Shani Kupiec, Almog Sasson, Yehudit Aperstein, Alexander Apartsin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05006
Pdf URL: https://arxiv.org/pdf/2509.05006
Copy Paste: [[2509.05006]] Do Large Language Models Need Intent? Revisiting Response Generation Strategies for Service Assistant(https://arxiv.org/abs/2509.05006)
Keywords: language model
Abstract: In the era of conversational AI, generating accurate and contextually appropriate service responses remains a critical challenge. A central question remains: Is explicit intent recognition a prerequisite for generating high-quality service responses, or can models bypass this step and produce effective replies directly? This paper conducts a rigorous comparative study to address this fundamental design dilemma. Leveraging two publicly available service interaction datasets, we benchmark several state-of-the-art language models, including a fine-tuned T5 variant, across both paradigms: Intent-First Response Generation and Direct Response Generation. Evaluation metrics encompass both linguistic quality and task success rates, revealing surprising insights into the necessity or redundancy of explicit intent modelling. Our findings challenge conventional assumptions in conversational AI pipelines, offering actionable guidelines for designing more efficient and effective response generation systems.
摘要：在对话AI时代，产生准确且适当的服务响应仍然是一个关键挑战。一个主要问题仍然存在：明确的意图识别是产生高质量服务响应的先决条件，还是模型可以绕过此步骤并直接产生有效的答复？本文进行了一项严格的比较研究，以解决这一基本设计困境。利用两个公开服务的交互数据集，我们基准了几种最先进的语言模型，包括范式的微调T5变体：意图优先响应生成和直接响应生成。评估指标涵盖了语言质量和任务成功率，揭示了对明确意图建模的必要性或冗余的令人惊讶的见解。我们的发现挑战了对话式AI管道中的常规假设，为设计更有效的响应生成系统提供了可行的指南。

Title: Masked Diffusion Language Models with Frequency-Informed Training

Authors: Despoina Kosmopoulou, Efthymios Georgiou, Vaggelis Dorovatas, Georgios Paraskevopoulos, Alexandros Potamianos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05056
Pdf URL: https://arxiv.org/pdf/2509.05056
Copy Paste: [[2509.05056]] Masked Diffusion Language Models with Frequency-Informed Training(https://arxiv.org/abs/2509.05056)
Keywords: language model
Abstract: We present a masked diffusion language modeling framework for data-efficient training for the BabyLM 2025 Challenge. Our approach applies diffusion training objectives to language modeling under strict data constraints, incorporating frequency-informed masking that prioritizes learning from rare tokens while maintaining theoretical validity. We explore multiple noise scheduling strategies, including two-mode approaches, and investigate different noise weighting schemes within the NELBO objective. We evaluate our method on the BabyLM benchmark suite, measuring linguistic competence, world knowledge, and human-likeness. Results show performance competitive to hybrid autoregressive-masked baselines, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.
摘要：我们为Babylm 2025挑战提供了一个蒙面的扩散语言建模框架。我们的方法将扩散培训目标应用于严格的数据约束下的语言建模，并结合了频率信息的掩蔽，该掩蔽优先于稀有令牌学习，同时保持理论有效性。我们探讨了多种噪声调度策略，包括两种模式方法，并研究Nelbo目标内的不同噪声加权方案。我们在Babylm基准套件上评估了我们的方法，衡量语言能力，世界知识和人类风格。结果表明，与混合自动回归屏蔽基线的性能有竞争力，表明基于扩散的培训为数据限制语言学习提供了可行的替代方案。

Title: Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

Authors: Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05060
Pdf URL: https://arxiv.org/pdf/2509.05060
Copy Paste: [[2509.05060]] Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations(https://arxiv.org/abs/2509.05060)
Keywords: language model
Abstract: We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.
摘要：我们介绍了Entropy2Vec，这是一个新颖的框架，用于通过利用单语言模型的熵来得出跨语性语言表示。与遭受特征稀疏性和静态快照的传统类型学清单不同，Entropy2Vec使用语言模型中固有的不确定性来捕获语言之间的类型学关系。通过训练单一语言的语言模型，我们假设其预测的熵反映了其与其他语言的结构相似性：低熵表示高相似性，而高熵表明差异更大。这种方法产生密集的，非SPARSE语言的嵌入，这些嵌入方式可适应不同的时间范围，没有缺失值。经验评估表明，Entropy2VEC嵌入与已建立的类型学类别保持一致，并在下游多语言NLP任务中实现了竞争性能，例如Lingualchemy框架所解决的竞争性能。

Title: ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions

Authors: Matteo Bortoletto, Constantin Ruhdorfer, Andreas Bulling
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05066
Pdf URL: https://arxiv.org/pdf/2509.05066
Copy Paste: [[2509.05066]] ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions(https://arxiv.org/abs/2509.05066)
Keywords: agent
Abstract: Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents' mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models' performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.
摘要：基础模型的大多数现有思想理论（TOM）基准依赖于Sally-Anne测试的变化，仅提供对Tom的观点非常有限，并忽略了人类社会互动的复杂性。为了解决这一差距，我们提出了Tom-SSI：一种新的基准测试，该基准专门设计用于在具有社交互动和空间动态的环境中测试TOM功能。虽然当前的TOM基准测试仅限于仅文本或二元相互作用，但Tom-Ssi是多模式的，包括多达四种在位置环境中通信和移动的代理的组相互作用。这种独特的设计使我们能够首次研究混合的合作 - 刺激性环境和关于多个代理商的精神状态的推理，从而比现有基准相比，捕获了更广泛的社会认知。我们的评估表明，当前模型的性能仍然受到严重限制，尤其是在这些新任务中，突出了未来研究的关键差距。

Title: Triadic Fusion of Cognitive, Functional, and Causal Dimensions for Explainable LLMs: The TAXAL Framework

Authors: David Herrera-Poyatos, Carlos Peláez-González, Cristina Zuheros, Virilo Tejedor, Rosana Montes, Francisco Herrera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05199
Pdf URL: https://arxiv.org/pdf/2509.05199
Copy Paste: [[2509.05199]] Triadic Fusion of Cognitive, Functional, and Causal Dimensions for Explainable LLMs: The TAXAL Framework(https://arxiv.org/abs/2509.05199)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) are increasingly being deployed in high-risk domains where opacity, bias, and instability undermine trust and accountability. Traditional explainability methods, focused on surface outputs, do not capture the reasoning pathways, planning logic, and systemic impacts of agentic LLMs. We introduce TAXAL (Triadic Alignment for eXplainability in Agentic LLMs), a triadic fusion framework that unites three complementary dimensions: cognitive (user understanding), functional (practical utility), and causal (faithful reasoning). TAXAL provides a unified, role-sensitive foundation for designing, evaluating, and deploying explanations in diverse sociotechnical settings. Our analysis synthesizes existing methods, ranging from post-hoc attribution and dialogic interfaces to explanation-aware prompting, and situates them within the TAXAL triadic fusion model. We further demonstrate its applicability through case studies in law, education, healthcare, and public services, showing how explanation strategies adapt to institutional constraints and stakeholder roles. By combining conceptual clarity with design patterns and deployment pathways, TAXAL advances explainability as a technical and sociotechnical practice, supporting trustworthy and context-sensitive LLM applications in the era of agentic AI.
摘要：大型语言模型（LLM）越来越多地部署在高风险领域，在这种高风险领域中，不透明度，偏见和不稳定破坏了信任和问责制。传统的解释性方法，集中于表面输出，不会捕获代理LLM的推理途径，计划逻辑和系统影响。我们介绍了分类（用于代理LLMS中解释性的三合会一致性），这是一个三合一融合框架，将三个互补维度统一：认知（用户理解），功能（实际实用程序）和因果关系（忠实的推理）。账单为在各种社会技术环境中设计，评估和部署解释提供了一个统一的，对角色敏感的基础。我们的分析综合了现有方法，从事后归因和对话界面到解释感知的提示，并将其定位在分类三合会融合模型中。我们通过法律，教育，医疗保健和公共服务中的案例研究进一步证明了其适用性，展示了解释策略如何适应机构的约束和利益相关者角色。通过将概念清晰度与设计模式和部署途径相结合，分类的进步可以解释为技术和社会技术实践，从而在代理AI时代支持值得信赖和上下文敏感的LLM应用。

Title: Hunyuan-MT Technical Report

Authors: Mao Zheng, Zheng Li, Bingxin Qu, Mingyang Song, Yang Du, Mingrui Sun, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05209
Pdf URL: https://arxiv.org/pdf/2509.05209
Copy Paste: [[2509.05209]] Hunyuan-MT Technical Report(https://arxiv.org/abs/2509.05209)
Keywords: chain-of-thought
Abstract: In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.
摘要：在本报告中，我们介绍了我们的第一个开源多语言翻译模型Hunyuan-MT-7B，该模型支持跨33种主要语言的双向翻译，并特别强调了普通话与几种少数族裔语言以及方言之间的翻译。此外，为了在测试时间服务和解决各种翻译方案并提高模型性能，我们介绍了Hunyuan-Mt-Chimera-7b，这是一种受缓慢思考模式启发的翻译模型。该模型集成了由Hunyuan-MT-7B模型在不同的参数设置下产生的多个输出，从而实现了基于三链（COT）的常规缓慢思考模型的性能。我们的模型的开发是针对多语言翻译设计的整体培训过程，该过程始于一般和以MT为导向的预培训，以构建基础能力，从而进行了监督的微调（SFT），以进行特定于任务的适应，并通过强化学习（RL）和弱核RL和弱点。通过全面的实验，我们证明了Hunyuan-MT-7B和Hunyuan-Mt-Chimera-7b均显着胜过所有可比较参数大小的转换特异性模型，以及大多数SOTA大型模型，尤其是在Mandarin和Mandarin和少数族裔语言和少数族裔语言之间的翻译任务以及方言。在WMT2025共享任务（常规机器翻译）中，我们的模型展示了最先进的性能，在31个语言对中的30个中排名第一。这一结果突出了我们模型在各种语言范围内的鲁棒性，包括中文，英语和日语等高源语言，以及包括捷克，马拉地语，爱沙尼亚语和冰岛的低资源语言。

Title: BEDTime: A Unified Benchmark for Automatically Describing Time Series

Authors: Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05215
Pdf URL: https://arxiv.org/pdf/2509.05215
Copy Paste: [[2509.05215]] BEDTime: A Unified Benchmark for Automatically Describing Time Series(https://arxiv.org/abs/2509.05215)
Keywords: language model, llm
Abstract: Many recent studies have proposed general-purpose foundation models designed for a variety of time series analysis tasks. While several established datasets already exist for evaluating these models, previous works frequently introduce their models in conjunction with new datasets, limiting opportunities for direct, independent comparisons and obscuring insights into the relative strengths of different methods. Additionally, prior evaluations often cover numerous tasks simultaneously, assessing a broad range of model abilities without clearly pinpointing which capabilities contribute to overall performance. To address these gaps, we formalize and evaluate 3 tasks that test a model's ability to describe time series using generic natural language: (1) recognition (True/False question-answering), (2) differentiation (multiple choice question-answering), and (3) generation (open-ended natural language description). We then unify 4 recent datasets to enable head-to-head model comparisons on each task. Experimentally, in evaluating 13 state-of-the-art language, vision--language, and time series--language models, we find that (1) popular language-only methods largely underperform, indicating a need for time series-specific architectures, (2) VLMs are quite successful, as expected, identifying the value of vision models for these tasks and (3) pretrained multimodal time series--language models successfully outperform LLMs, but still have significant room for improvement. We also find that all approaches exhibit clear fragility in a range of robustness tests. Overall, our benchmark provides a standardized evaluation on a task necessary for time series reasoning systems.
摘要：许多最近的研究提出了专门针对各种时间序列分析任务的通用基础模型。尽管已经存在一些已建立的数据集用于评估这些模型，但以前的作品经常与新数据集一起引入其模型，限制了直接，独立比较的机会，并掩盖了对不同方法相对优势的见解。此外，先前的评估通常同时涵盖众多任务，评估广泛的模型能力，而没有明确指出哪些功能有助于整体绩效。为了解决这些差距，我们对三个任务进行了形式化和评估，这些任务测试了模型使用通用自然语言描述时间序列的能力：（1）识别（true/false Question-swering），（2）差异（2）差异化（多项选择询问问题）和（3）生成（开放式自然语言描述）。然后，我们统一了4个最近的数据集，以启用每个任务的头对头模型比较。 Experimentally, in evaluating 13 state-of-the-art language, vision--language, and time series--language models, we find that (1) popular language-only methods largely underperform, indicating a need for time series-specific architectures, (2) VLMs are quite successful, as expected, identifying the value of vision models for these tasks and (3) pretrained multimodal time series--language models successfully outperform LLMs, but still有很大的改进空间。我们还发现，在一系列鲁棒性测试中，所有方法均表现出明显的脆弱性。总体而言，我们的基准对时间序列推理系统所需的任务进行了标准化评估。

Title: HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models

Authors: Chang Dai, Hongyu Shan, Mingyang Song, Di Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05218
Pdf URL: https://arxiv.org/pdf/2509.05218
Copy Paste: [[2509.05218]] HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models(https://arxiv.org/abs/2509.05218)
Keywords: language model, long context
Abstract: Positional encoding mechanisms enable Transformers to model sequential structure and long-range dependencies in text. While absolute positional encodings struggle with extrapolation to longer sequences due to fixed positional representations, and relative approaches like Alibi exhibit performance degradation on extremely long contexts, the widely-used Rotary Positional Encoding (RoPE) introduces oscillatory attention patterns that hinder stable long-distance dependency modelling. We address these limitations through a geometric reformulation of positional encoding. Drawing inspiration from Lorentz transformations in hyperbolic geometry, we propose Hyperbolic Rotary Positional Encoding (HoPE), which leverages hyperbolic functions to implement Lorentz rotations on token representations. Theoretical analysis demonstrates that RoPE is a special case of our generalized formulation. HoPE fundamentally resolves RoPE's slation issues by enforcing monotonic decay of attention weights with increasing token distances. Extensive experimental results, including perplexity evaluations under several extended sequence benchmarks, show that HoPE consistently exceeds existing positional encoding methods. These findings underscore HoPE's enhanced capacity for representing and generalizing long-range dependencies. Data and code will be available.
摘要：位置编码机制使变压器能够建模文本中的顺序结构和远程依赖性。尽管由于固定位置表示，绝对位置编码与外推到更长序列的外推障碍，而诸如Alibi之类的相对方法在极长的背景下表现出性能降解，但广泛使用的旋转位置编码（ROPE）引入了振荡性注意模式，阻碍了稳定的长距离依赖性模型。我们通过对位置编码的几何重新进行重新制定来解决这些局限性。我们提出了双曲线几何形状中的洛伦兹变换的灵感，我们提出了双曲旋转位置编码（HOPE），该旋转位置编码（HOPE）利用双曲线功能在代币表示上实现Lorentz旋转。理论分析表明，绳索是我们广义配方的特殊情况。希望从根本上解决了绳索的问题，通过随着令牌距离的增加来实施注意力重量的单调衰减。广泛的实验结果，包括几个扩展序列基准下的困惑评估，表明希望始终超过现有的位置编码方法。这些发现强调了希望增强的代表和推广长期依赖性的能力。数据和代码将可用。

Title: Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation

Authors: Abdul Waheed, Chancharik Mitra, Laurie Z. Wang, Deva Ramanan, Bhiksha Raj
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05226
Pdf URL: https://arxiv.org/pdf/2509.05226
Copy Paste: [[2509.05226]] Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation(https://arxiv.org/abs/2509.05226)
Keywords: chain-of-thought
Abstract: Chain-of-thought reasoning, while powerful, can produce unnecessarily verbose output for simpler problems. We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity. Remarkably, we show that models can be endowed with such dynamic inference pathways without any architectural modifications; we simply post-train on data that is carefully curated to include chain-of-thought traces that are proportional in length to problem difficulty. Our analysis reveals that post-training via supervised fine-tuning (SFT) primarily captures patterns like reasoning length and format, while direct preference optimization (DPO) preserves reasoning accuracy, with their combination reducing length and maintaining or improving performance. Both quantitative metrics and qualitative assessments confirm that models can learn to "think proportionally", reasoning minimally on simple problems while maintaining depth for complex ones.
摘要：经过思考的推理虽然强大，但可以产生不必要的冗长输出，从而解决简单的问题。我们提出了一个难以感知推理的框架，该框架教授模型根据问题复杂性动态调整推理深度。值得注意的是，我们表明模型可以赋予这种动态推理途径，而无需进行任何架构修改。我们只是在精心策划的数据上进行培训，以包括与问题难度成正比的经过思考的痕迹。我们的分析表明，通过监督的微调（SFT）进行培训主要捕获诸如推理长度和格式之类的模式，而直接偏好优化（DPO）则保持了推理精度，其组合降低了长度并保持或提高性能。定量指标和定性评估都证实，模型可以学会“按比例思考”，在简单问题上最少地推理，同时维持复杂的问题的深度。

Title: CURE: Controlled Unlearning for Robust Embeddings -- Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Authors: Aysenur Kocak, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05230
Pdf URL: https://arxiv.org/pdf/2509.05230
Copy Paste: [[2509.05230]] CURE: Controlled Unlearning for Robust Embeddings -- Mitigating Conceptual Shortcuts in Pre-Trained Language Models(https://arxiv.org/abs/2509.05230)
Keywords: language model
Abstract: Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.
摘要：预训练的语言模型在各种应用程序中取得了巨大的成功，但仍然容易受到损害稳健性和公平性的虚假，概念驱动的相关性。在这项工作中，我们介绍了Cure，这是一个新颖而轻的框架，该框架系统地分解和抑制了概念上的快捷方式，同时保留基本内容信息。我们的方法首先通过由逆转网络加强的专用内容提取器提取概念 - 意外表示表示，从而确保了与任务相关的信息的最小损失。随后的可控偏见模块采用对比度学习来细致调整剩余概念提示的影响，从而使模型能够减少有害偏见或对目标任务的有益相关性。 CURE使用三个预训练的架构在IMDB和Yelp数据集上进行了评估，在IMDB上的F1分数中的+10点的绝对提高和Yelp上的+2点的绝对提高，同时引入了最小的计算盖帽。我们的方法建立了一种灵活的，无监督的蓝图，用于打击概念性偏见，为更可靠和公平的语言理解系统铺平了道路。

Title: Uniform Information Density and Syntactic Reduction: Revisiting $\textit{that}$-Mentioning in English Complement Clauses

Authors: Hailin Hao, Elsi Kaiser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05254
Pdf URL: https://arxiv.org/pdf/2509.05254
Copy Paste: [[2509.05254]] Uniform Information Density and Syntactic Reduction: Revisiting $\textit{that}$-Mentioning in English Complement Clauses(https://arxiv.org/abs/2509.05254)
Keywords: language model
Abstract: Speakers often have multiple ways to express the same meaning. The Uniform Information Density (UID) hypothesis suggests that speakers exploit this variability to maintain a consistent rate of information transmission during language production. Building on prior work linking UID to syntactic reduction, we revisit the finding that the optional complementizer $\textit{that}$in English complement clauses is more likely to be omitted when the clause has low information density (i.e., more predictable). We advance this line of research by analyzing a large-scale, contemporary conversational corpus and using machine learning and neural language models to refine estimates of information density. Our results replicated the established relationship between information density and $\textit{that}$-mentioning. However, we found that previous measures of information density based on matrix verbs' subcategorization probability capture substantial idiosyncratic lexical variation. By contrast, estimates derived from contextual word embeddings account for additional variance in patterns of complementizer usage.
摘要：演讲者通常有多种表达相同含义的方法。统一的信息密度（UID）假设表明，说话者利用这种可变性来保持语言生产过程中信息传输的一致速率。基于将UID与句法减少联系起来的先前工作，我们重新审视了以下发现：当条款的信息密度较低时（即更可预测的），更可能省略英语补语条款中的可选互补{textit {thit} $。我们通过分析一个大规模的当代对话语料库并使用机器学习和神经语言模型来完善信息密度的估计来推进这一研究。我们的结果复制了信息密度与$ \ textit {that} $ - 提及之间的既定关系。但是，我们发现，基于矩阵动词的子分类概率的先前信息密度测量值捕获了实质性特质词汇变化。相比之下，从上下文单词嵌入式中得出的估计值是补充用法模式的其他差异。

Title: Elucidating the Design Space of Decay in Linear Attention

Authors: Zhen Qin, Xuyang Shen, Yiran Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.05282
Pdf URL: https://arxiv.org/pdf/2509.05282
Copy Paste: [[2509.05282]] Elucidating the Design Space of Decay in Linear Attention(https://arxiv.org/abs/2509.05282)
Keywords: language model
Abstract: This paper presents a comprehensive investigation into the decay mechanisms inherent in linear complexity sequence models. We systematically delineate the design space of decay mechanisms across four pivotal dimensions: parameterization strategy, which refers to the computational methodology for decay; parameter sharing, which involves the utilization of supplementary parameters for decay computation; decay granularity, comparing scalar versus vector-based decay; and compatibility with relative positional encoding methods, such as Rotary Position Embedding (RoPE). Through an extensive series of experiments conducted on diverse language modeling tasks, we uncovered several critical insights. Firstly, the design of the parameterization strategy for decay requires meticulous consideration. Our findings indicate that effective configurations are typically confined to a specific range of parameters. Secondly, parameter sharing cannot be used arbitrarily, as it may cause decay values to be too large or too small, thereby significantly impacting performance. Thirdly, under identical parameterization strategies, scalar decay generally underperforms compared to its vector-based counterpart. However, in certain scenarios with alternative parameterization strategies, scalar decay may unexpectedly surpass vector decay in efficacy. Lastly, our analysis reveals that RoPE, a commonly employed relative positional encoding method, typically fails to provide tangible benefits to the majority of linear attention mechanisms.
摘要：本文对线性复杂性序列模型固有的衰减机制进行了全面研究。我们系统地描绘了跨四个关键维度的衰减机制的设计空间：参数化策略，它指的是衰减的计算方法；参数共享，涉及利用补充参数进行衰减计算；衰减粒度，比较标量与基于向量的衰减；以及与相对位置编码方法的兼容性，例如旋转位置嵌入（绳索）。通过对各种语言建模任务进行的大量实验，我们发现了一些关键见解。首先，衰减的参数化策略的设计需要一丝一点的考虑。我们的发现表明，有效配置通常仅限于特定的参数范围。其次，参数共享不能任意使用，因为它可能导致衰减值太大或太小，从而显着影响性能。第三，在相同的参数化策略下，标量衰减通常与基于向量的对应物相比表现不佳。但是，在某些具有替代参数化策略的情况下，标量衰减可能意外地超过疗效中的向量衰减。最后，我们的分析表明，绳索是一种常用的相对位置编码方法，通常无法为大多数线性注意机制提供切实的好处。

Title: Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Authors: Deniz Bayazit, Aaron Mueller, Antoine Bosselut
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05291
Pdf URL: https://arxiv.org/pdf/2509.05291
Copy Paste: [[2509.05291]] Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining(https://arxiv.org/abs/2509.05291)
Keywords: language model, llm
Abstract: Large language models (LLMs) learn non-trivial abstractions during pretraining, like detecting irregular plural noun subjects. However, it is not well understood when and how specific linguistic abilities emerge as traditional evaluation methods such as benchmarking fail to reveal how models acquire concepts and capabilities. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.
摘要：Large language models (LLMs) learn non-trivial abstractions during pretraining, like detecting irregular plural noun subjects.但是，由于基准制定等传统评估方法，何时以及如何出现特定的语言能力何时以及如何出现，无法透露模型如何获得概念和能力。为了弥合这一差距并更好地了解概念级别的模型培训，我们使用稀疏的横码来发现和对齐模型检查点。 Using this approach, we track the evolution of linguistic features during pretraining.我们在开放式检查点三重态之间训练横码，并具有明显的性能和表示形式变化，并引入了一种新颖的指标，相对间接效应（Relie），以追踪训练阶段，在该阶段中，单个特征对任务绩效至关重要。 We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining.我们的方法是建筑不可分割的，可扩展的，为整个训练的表示形式学习提供了一种有希望的途径。