2025-09-10

Title: MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Authors: Ruggero Marino Lazzaroni, Alessandro Angioi, Michelangelo Puliga, Davide Sanna, Roberto Marras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07135
Pdf URL: https://arxiv.org/pdf/2509.07135
Copy Paste: [[2509.07135]] MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations(https://arxiv.org/abs/2509.07135)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain.
摘要：大型语言模型（LLM）显示出在教育中的潜力越来越高，但是在专业领域中非英语语言的基准仍然很少。我们介绍了MedBench-IT，这是评估意大利医科大学入学考试LLM的第一个全面基准。 Medbench-It来自领先的预备材料出版商Edizioni Simone，包括17,410个专家写的多项选择问题，跨越了六个主题（生物学，化学，逻辑，一般文化，数学，物理学）和三个难度水平。我们评估了各种模型，包括专有LLM（GPT-4O，Claude Series）和资源有效的开源替代方案（<30B参数），重点是实用可部署性。除了准确性之外，我们还进行了严格的可重复性测试（88.86％的响应一致性，因受试者而变化），订购偏差分析（最小影响）和推理及时评估。我们还研究了问题可读性和模型性能之间的相关性，发现了统计学意义但相反的关系。 MedBench-IT为意大利NLP社区，Edtech开发人员和从业人员提供了至关重要的资源，为此关键领域提供了有关当前功能和标准化评估方法的见解。

Title: Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models

Authors: Zhiyin Tan, Jennifer D'Souza
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2509.07142
Pdf URL: https://arxiv.org/pdf/2509.07142
Copy Paste: [[2509.07142]] Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models(https://arxiv.org/abs/2509.07142)
Keywords: language model, llm
Abstract: This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at this https URL.
摘要：这项研究提出了一个使用大语言模型（LLM）自动评估主题模型的框架。主题建模对于在数字图书馆系统中组织和检索学术内容至关重要，帮助用户浏览复杂和不断发展的知识领域。但是，使用广泛使用的自动指标（例如连贯性和多样性）通常仅捕获狭窄的统计模式，并且无法解释实践中的语义失败。我们介绍了一个面向目的的评估框架，该框架采用了九个基于LLM的指标，这些指标涵盖了主题质量的四个关键维度：词汇有效性，主流语义声音，主题间的结构性声音和文档主题校准声音。该框架通过基于对抗性和采样的协议进行验证，并在跨越新闻文章，学术出版物和社交媒体帖子的数据集中应用，以及多种主题建模方法和开放源代码LLM。我们的分析表明，基于LLM的指标提供了可解释，健壮和与任务相关的评估，从而发现了诸如冗余和语义漂移等主题模型中的关键弱点，这些弱点通常被传统指标所遗漏。这些结果支持开发可扩展的，细粒度的评估工具，以维持动态数据集中的主题相关性。所有支持此工作的代码和数据都可以在此HTTPS URL上访问。

Title: Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

Authors: Amal Chebbi, Babajide Kolade
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07177
Pdf URL: https://arxiv.org/pdf/2509.07177
Copy Paste: [[2509.07177]] Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector(https://arxiv.org/abs/2509.07177)
Keywords: language model, gpt, llm
Abstract: Large Language Models have demonstrated impressive capabilities across various domains. However, their general-purpose nature often limits their effectiveness in specialized fields such as energy, where deep technical expertise and precise domain knowledge are essential. In this paper, we introduce EnergyGPT, a domain-specialized language model tailored for the energy sector, developed by fine-tuning LLaMA 3.1-8B model using Supervised Fine-Tuning on a high-quality, curated corpus of energy-related texts. We present a complete development pipeline, including data collection and curation, model fine-tuning, benchmark design and LLM-judge choice, evaluation and deployment. Through this work, we demonstrate that our training strategy enables improvements in domain relevance and performance without the need for large-scale infrastructure. By evaluating the performance of the model using domain-specific question-answering benchmarks, our results demonstrate that EnergyGPT outperforms the base model in most of the energy-related language understanding and generation tasks.
摘要：大型语言模型在各个领域都表现出了令人印象深刻的功能。但是，他们的通用性质通常会限制其在能源等专业领域的有效性，在这种领域，深厚的技术专业知识和精确的领域知识至关重要。在本文中，我们介绍了一种为能源领域量身定制的域专业语言模型，该模型是由通过微调的Llama 3.1-8b模型开发的，该模型是在高质量的，精心策划的能源相关文本的语料库上进行的细微调整。我们提供了完整的开发管道，包括数据收集和策展，模型微调，基准设计和LLM法官选择，评估和部署。通过这项工作，我们证明我们的培训策略可以改善域相关性和性能，而无需大规模的基础架构。通过使用特定领域的提问基准评估模型的性能，我们的结果表明，在大多数与能源相关的语言理解和生成任务中，EnergyGPT优于基本模型。

Title: DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

Authors: Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07188
Pdf URL: https://arxiv.org/pdf/2509.07188
Copy Paste: [[2509.07188]] DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge(https://arxiv.org/abs/2509.07188)
Keywords: language model, llm, agent
Abstract: Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models' ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.
摘要：出院沟通是患者护理中关键但毫无疑问的组成部分，该目标的目标从诊断转变为教育。尽管最近的大型语言模型（LLM）基准强调了访问性诊断推理，但他们无法评估访问后模型支持患者的能力。我们介绍了Energeim，这是一种新颖的基准，该基准评估了LLM的功能作为个性化的出院教育者。 Energesim模拟了LLM驱动的医生与具有多种社会心理特征的患者（例如，健康素养，教育，情感）之间的访问后的多转交谈。相互作用是在六个临床接地的排放主题中构造的，并沿三个轴进行评估：（1）对话质量通过自动和LLM-AS-Gudge评估，（2）个性化的文档生成，包括自由文本摘要和结构化的AHRQ清单，以及（3）通过下游多次多次检查的患者理解。在18个LLM的实验中，揭示了出院教育能力的显着差距，并且在患者概况之间的表现差异很大。值得注意的是，模型规模并不总是产生更好的教育成果，突出了战略使用和内容优先级的权衡。 Demangesim在访问后的临床教育中为LLM进行基准测试并促进公平的个性化患者支持提供了第一步。

Title: Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation

Authors: Zahra Atf, Peter R Lewis
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2509.07190
Pdf URL: https://arxiv.org/pdf/2509.07190
Copy Paste: [[2509.07190]] Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation(https://arxiv.org/abs/2509.07190)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where explaining uncertainty is both technical and ethical. Probabilistic methods are often opaque and misaligned with expectations of transparency. We propose a framework based on rule-based moral principles for handling uncertainty in LLM-generated text. Using insights from moral psychology and virtue ethics, we define rules such as precaution, deference, and responsibility to guide responses under epistemic or aleatoric uncertainty. These rules are encoded in a lightweight Prolog engine, where uncertainty levels (low, medium, high) trigger aligned system actions with plain-language rationales. Scenario-based simulations benchmark rule coverage, fairness, and trust calibration. Use cases in clinical and legal domains illustrate how moral reasoning can improve trust and interpretability. Our approach offers a transparent, lightweight alternative to probabilistic models for socially responsible natural language generation.
摘要：大型语言模型（LLM）越来越多地用于高风险环境中，其中解释不确定性既技术又有道德。概率方法通常是不透明的，并且对透明度的期望不对。我们提出了一个基于基于规则的道德原则来处理LLM生成文本中不确定性的框架。利用道德心理学和美德伦理学的见解，我们定义了诸如预防，尊重和责任之类的规则，以指导认识或态度不确定性下的回应。这些规则是在轻型序言引擎中编码的，其中不确定性级别（低，中，高）触发了与平淡的原理的对齐系统动作。基于方案的模拟基准规则覆盖，公平性和信任校准。临床和法律领域中的用例说明了道德推理如何改善信任和解释性。我们的方法为社会负责的自然语言生成提供了透明，轻巧的替代方案。

Title: LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Authors: Aida Kostikova, Ole Pütz, Steffen Eger, Olga Sabelfeld, Benjamin Paassen
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2509.07274
Pdf URL: https://arxiv.org/pdf/2509.07274
Copy Paste: [[2509.07274]] LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade(https://arxiv.org/abs/2509.07274)
Keywords: language model, llm, prompt
Abstract: Migration has been a core topic in German political debate, from millions of expellees post World War II over labor migration to refugee movements in the recent past. Studying political speech regarding such wide-ranging phenomena in depth traditionally required extensive manual annotations, limiting the scope of analysis to small subsets of the data. Large language models (LLMs) have the potential to partially automate even complex annotation tasks. We provide an extensive evaluation of a multiple LLMs in annotating (anti-)solidarity subtypes in German parliamentary debates compared to a large set of thousands of human reference annotations (gathered over a year). We evaluate the influence of model size, prompting differences, fine-tuning, historical versus contemporary data; and we investigate systematic errors. Beyond methodological evaluation, we also interpret the resulting annotations from a social science lense, gaining deeper insight into (anti-)solidarity trends towards migrants in the German post-World War II period and recent past. Our data reveals a high degree of migrant-directed solidarity in the postwar period, as well as a strong trend towards anti-solidarity in the German parliament since 2015, motivating further research. These findings highlight the promise of LLMs for political text analysis and the importance of migration debates in Germany, where demographic decline and labor shortages coexist with rising polarization.
摘要：移民一直是德国政治辩论中的核心话题，从第二次世界大战后数百万的驱逐出关于劳工移民的驱逐到最近的难民运动。传统上，研究有关这种广泛现象的政治演讲需要广泛的手动注释，将分析范围限制在数据的小子集中。大型语言模型（LLMS）有可能部分自动化复杂的注释任务。与大量的数千个人类参考注释（一年多的人聚集）相比，我们在德国议会辩论中的注释（反）团结亚型中提供了多个LLM的广泛评估。我们评估模型大小的影响，促使差异，微调，历史与当代数据的影响；我们研究了系统的错误。除了方法论评估之外，我们还解释了社会科学镜头所产生的注释，从而更深入地了解了（反）团结趋势对移民在德国第二次世界大战后期和最近的过去。我们的数据揭示了战后时期的高度移民指导的团结，以及自2015年以来德国议会反降低的强烈趋势，激发了进一步的研究。这些发现凸显了LLM对政治文本分析的希望以及德国移民辩论的重要性，在德国，人口下降和劳动短缺与两极分化的增长共存。

Title: Causal Attention with Lookahead Keys

Authors: Zhuoqing Song, Peng Sun, Huizhuo Yuan, Quanquan Gu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.07301
Pdf URL: https://arxiv.org/pdf/2509.07301
Copy Paste: [[2509.07301]] Causal Attention with Lookahead Keys(https://arxiv.org/abs/2509.07301)
Keywords: language model
Abstract: In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.
摘要：在标准因果关注中，每个令牌的查询，键和值（QKV）都是静态的，并且仅编码先前的上下文。我们使用Lookahead Keys（Castle）引入因果关注，该关注机制不断更新每个令牌的钥匙，随着上下文的发展。我们将这些更新的键lookahead键称为较早的位置，但它们集成了与这些位置相对于这些位置的较晚的信息，同时严格保留了自回归属性。尽管该机制似乎是顺序的，但我们得出了数学等效性，该数学等效性避免在每个位置上明确实现LookAhead键并实现有效的并行训练。在语言建模基准测试基准上，城堡始终优于跨模型量表的标准因果关注，减少验证的困惑并改善了一系列下游任务的性能。

Title: Instance-level Performance Prediction for Long-form Generation Tasks

Authors: Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Omar Alonso, Matthew Lease
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.07309
Pdf URL: https://arxiv.org/pdf/2509.07309
Copy Paste: [[2509.07309]] Instance-level Performance Prediction for Long-form Generation Tasks(https://arxiv.org/abs/2509.07309)
Keywords: llm
Abstract: We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.
摘要：我们激励并共享一个新的基准测试，以对具有多方面的优质质量指标的长形成型任务进行实例级别的性能预测。我们的任务，模型和公制的公式公式预测只有黑框模型输入和输出的连续评估度量得分。除了预测度量得分的点估计值之外，基准还需要推断预测间隔，以量化点估计值的不确定性。评估跨越了11个具有多个LLM，基准和指标的长格式数据集/任务。我们表明，可以在长期生成任务中有效预测分数，使用少于16个培训示例。总体而言，我们介绍了一项新颖而有用的任务，这是推动进步的宝贵基准，并准备在当今采用实际采用的基线。

Title: Does This Look Familiar to You? Knowledge Analysis via Model Internal Representations

Authors: Sihyun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07311
Pdf URL: https://arxiv.org/pdf/2509.07311
Copy Paste: [[2509.07311]] Does This Look Familiar to You? Knowledge Analysis via Model Internal Representations(https://arxiv.org/abs/2509.07311)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models (LLMs) have been driven by pretraining, supervised fine tuning (SFT), and alignment tuning. Among these, SFT plays a crucial role in transforming a model 's general knowledge into structured responses tailored to specific tasks. However, there is no clearly established methodology for effective training data selection. Simply increasing the volume of data does not guarantee performance improvements, while preprocessing, sampling, and validation require substantial time and cost. To address this issue, a variety of data selection methods have been proposed. Among them, knowledge based selection approaches identify suitable training data by analyzing the model 's responses. Nevertheless, these methods typically rely on prompt engineering, making them sensitive to variations and incurring additional costs for prompt design. In this study, we propose Knowledge Analysis via Model Internal Representations (KAMIR), a novel approach that overcomes these limitations by analyzing data based on the model 's internal representations. KAMIR computes similarities between the hidden states of each layer (block) and the final hidden states for a given input to assess the data. Unlike prior methods that were largely limited to multiple choice tasks, KAMIR can be applied to a wide range of tasks such as machine reading comprehension and summarization. Moreover, it selects data useful for training based on the model 's familiarity with the input, even with a small dataset and a simple classifier architecture. Experiments across diverse task datasets demonstrate that training with less familiar data leads to better generalization performance.
摘要：大型语言模型（LLM）的最新进展是由预处理，有监督的微调（SFT）和对齐调整驱动的。其中，SFT在将模型的一般知识转换为针对特定任务的结构化响应中起着至关重要的作用。但是，没有明确确定的有效培训数据选择的方法。简单地增加数据量并不能保证性能提高，同时进行预处理，采样和验证需要大量的时间和成本。为了解决此问题，已经提出了多种数据选择方法。其中，基于知识的选择方法通过分析模型的响应来确定合适的培训数据。然而，这些方法通常依赖于迅速的工程，使它们对变化敏感并为及时设计产生额外成本。在这项研究中，我们通过模型内部表示（Kamir）提出了知识分析，该方法是一种新颖的方法，通过基于模型的内部表示来分析数据来克服这些局限性。卡米尔（Kamir）计算每个层的隐藏状态（块）和最终隐藏状态之间的相似性，以评估给定输入。与先前的方法主要仅限于多项选择任务不同，Kamir可以应用于机器阅读理解和摘要等各种任务。此外，它选择了基于模型对输入的熟悉的培训数据有用的数据，即使使用一个小的数据集和简单的分类器体系结构。跨不同任务数据集的实验表明，使用较不熟悉的数据的培训会导致更好的概括性能。

Title: Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation

Authors: Nakyung Lee, Yeongoon Kim, Minhae Oh, Suhwan Kim, Jin Woo Koo, Hyewon Jo, Jungwoo Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07324
Pdf URL: https://arxiv.org/pdf/2509.07324
Copy Paste: [[2509.07324]] Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation(https://arxiv.org/abs/2509.07324)
Keywords: language model
Abstract: Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from localization, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose Self-Attention One-step Belief Propagation (SAOBP), a refinement framework that injects multi-hop relationships through a belief propagation process. To interpret and quantify these interactions, we introduce Global Token Dependency (GTD) that captures the relative contribution of multihop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.
摘要：基于变压器的自我发场机制是现代语言模型的核心，但通常会遭受本地化的损害，在这种情况下，注意力崩溃了有限的令牌子集，无法捕获远距离的依赖性。为了解决这个问题，我们提出了自我注意的一步信仰传播（SAOBP），这是一个改进框架，通过信仰传播过程注入多跳的关系。为了解释和量化这些相互作用，我们介绍了全局令牌依赖性（GTD），该依赖性依赖性（GTD）捕获了注意力图内的多主连接的相对贡献。经验结果表明，SAOBP有助于防止熵在更深的层中塌陷，并在适当的级别上自适应地保持GTD，从而支持改进模型性能。重要的是，我们观察到小规模模型中的竞争增长，突出了其提高资源约束场景中推断质量的潜力。

Title: PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

Authors: Yixuan Tang, Yi Yang, Ahmed Abbasi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07370
Pdf URL: https://arxiv.org/pdf/2509.07370
Copy Paste: [[2509.07370]] PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions(https://arxiv.org/abs/2509.07370)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advancements in Large Language Models (LLMs) demonstrate remarkable capabilities across various fields. These developments have led to more direct communication between humans and LLMs in various situations, such as social companionship and psychological support. However, LLMs often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their communication style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel LLM post-training framework that enables LLMs to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading LLMs, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse~offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems.
摘要：大型语言模型（LLMS）的最新进展表明，各个领域都有显着的能力。这些事态发展导致在各种情况下，例如社会陪伴和心理支持等各种情况下，人类与LLM之间的沟通更加直接沟通。但是，LLM经常在现实世界中的情感感知和社会能力上表现出局限性。这些局限性部分源于它们无法将其沟通风格和情感表达适应不同的社会和任务环境。在这项工作中，我们介绍了Personafuse，这是一种新型的LLM训练后培训框架，使LLMS能够适应并表达不同的个性，以适应不同的情况。受特质激活理论和五大人格模型的启发，Personafuse采用了专家构建结构的混合体，将角色适配器与动态路由网络相结合，从而实现了上下文特质的表达。实验结果表明，人力资源在社会情感智能的多个维度上大大优于基线模型。重要的是，这些收益是在不牺牲一般推理能力或模型安全性的情况下实现的，这仍然是直接提示和监督微调方法的共同局限性。 Personafuse还可以在以人为本的以心理健康咨询和基于审查的客户服务等下游应用程序中进行一致的改进。最后，包括GPT-4O和DeepSeek在内的领先LLM的人类偏好评估表明，尽管模型大小相对较小，但Personafuse仍可以达到竞争性响应质量。这些发现表明，Personafuse〜提供了一种理论上扎根的实用方法来开发社会情感增强的LLM，这标志着对以人为中心的AI系统的重大进步。

Title: Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents

Authors: Sankalp Tattwadarshi Swain, Anshika Krishnatray, Dhruv Kumar, Jagat Sesh Challa
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2509.07389
Pdf URL: https://arxiv.org/pdf/2509.07389
Copy Paste: [[2509.07389]] Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents(https://arxiv.org/abs/2509.07389)
Keywords: language model, llm, agent
Abstract: Existing evaluation studies on linguistic competence of large language models (LLM agents) have focused primarily on vocabulary learning, morphological rule induction, syntactic generalization, pragmatic inference, and cross-linguistic transfer. However, none assess whether LLM agents can acquire a language through pattern recognition and interactive feedback, a central feature of human language acquisition. We propose a novel experimental framework in which an LLM agent is evaluated on its ability to acquire and use a newly constructed language (Tinkatongue) in conversation with a bot that understands only Tinkatongue. Our findings show that LLM agents fail to establish a conversation within 100 responses, yet they adopt distinct strategies that mirror human approaches to language learning. The results suggest a new direction for evaluation benchmarks and open pathways to model designs that learn more effectively from interactive feedback.
摘要：关于大语模型（LLM代理）语言能力的现有评估研究主要集中于词汇学习，形态学规则诱导，句法概括，务实的推论和跨语言转移。但是，没有人评估LLM代理是否可以通过模式识别和互动反馈来获取语言，这是人类语言获取的主要特征。我们提出了一个新颖的实验框架，其中评估了LLM代理商在与仅了解Tinkatongue的机器人对话中获取和使用新建语言（Tinkatongue）的能力。我们的发现表明，LLM代理商未能在100个回应中建立对话，但他们采用了不同的策略，这些策略反映了人类的语言学习方法。结果提出了一个新的方向，用于评估基准和开放式途径，用于模型设计，这些设计从交互式反馈中更有效地学习。

Title: The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering

Authors: Yi-Jie Cheng, Oscar Chew, Yun-Nung Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07399
Pdf URL: https://arxiv.org/pdf/2509.07399
Copy Paste: [[2509.07399]] The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering(https://arxiv.org/abs/2509.07399)
Keywords: language model, llm, hallucination
Abstract: Integrating knowledge graphs (KGs) into the reasoning processes of large language models (LLMs) has emerged as a promising approach to mitigate hallucination. However, existing work in this area often relies on proprietary or extremely large models, limiting accessibility and scalability. In this study, we investigate the capabilities of existing integration methods for small language models (SLMs) in KG-based question answering and observe that their performance is often constrained by their limited ability to traverse and reason over knowledge graphs. To address this limitation, we propose leveraging simple and efficient exploration modules to handle knowledge graph traversal in place of the language model itself. Experiment results demonstrate that these lightweight modules effectively improve the performance of small language models on knowledge graph question answering tasks. Source code: this https URL.
摘要：将知识图（kgs）集成到大语言模型（LLM）的推理过程中已成为减轻幻觉的一种有前途的方法。但是，该领域的现有工作通常依赖于专有或大型模型，从而限制了可访问性和可伸缩性。在这项研究中，我们研究了基于kg的问题回答中现有的小语言模型（SLM）的现有集成方法的功能，并观察到它们的性能通常受到其遍历和超过知识图的理由的有限能力的限制。为了解决这一限制，我们建议利用简单有效的探索模块来处理知识图遍历语言模型本身。实验结果表明，这些轻量级模块有效地改善了小语言模型在知识图问答任务上的小语模型的性能。源代码：此HTTPS URL。

Title: LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

Authors: Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Chenyang Zhao, Zhongwei Wan, Chaofan Tao, Wendong Xu, Hui Shen, Chengming Li, Lingpeng Kong, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07403
Pdf URL: https://arxiv.org/pdf/2509.07403
Copy Paste: [[2509.07403]] LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction(https://arxiv.org/abs/2509.07403)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods. Unlike conventional approaches, our RAG method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The CoEM method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications. Furthermore, we conducted a comparative case study experiment on the GPT series to demonstrate the differences among various models in terms of EI. Code is available on GitHub at this https URL, and the project page can be found at this https URL.
摘要：大型语言模型（LLM）在情绪智力（EI）和长篇小说理解方面取得了重大进展。但是，现有的基准倾向于在长期文化的情况下，尤其是在往来的实用环境下，互动漫长，多样化且通常嘈杂的情况下，倾向于忽略EI的某些方面。为了朝着这种现实的设置迈进，我们提出了长远的长远，这是专门针对长篇文本EI任务设计的基准。它涵盖了各种各样的任务，包括情感分类，情感检测，情感质量质量质量检查，情感对话，情感摘要和情感表达。平均而言，这些任务的输入长度达到8,777个令牌，情绪表达需要长期产生。为了提高在现实限制下的性能，我们合并了检索功能的生成（RAG）和协作情感建模（COEM），并将其与基于标准的及时及时方法进行比较。与传统的方法不同，我们的抹布方法将对话上下文和大型语言模型本身作为检索来源，从而避免依赖外部知识基础。 COEM方法通过将任务分解为五个阶段，从而整合了检索增强和有限的知识注入来进一步提高性能。实验结果表明，抹布和COEM始终在大多数长篇小说任务中增强与EI相关的性能，从而将LLMS推向了更实用和现实的EI应用。此外，我们对GPT系列进行了比较案例研究实验，以证明各种模型之间的差异。代码可在此HTTPS URL上的GitHub上找到，并且可以在此HTTPS URL上找到项目页面。

Title: AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training

Authors: Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07459
Pdf URL: https://arxiv.org/pdf/2509.07459
Copy Paste: [[2509.07459]] AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training(https://arxiv.org/abs/2509.07459)
Keywords: language model
Abstract: Positive, supportive online communication in social media (candy speech) has the potential to foster civility, yet automated detection of such language remains underexplored, limiting systematic analysis of its impact. We investigate how candy speech can be reliably detected in a 46k-comment German YouTube corpus by monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual XLM-RoBERTa-Large model trained to detect candy speech at the span level outperforms other approaches, ranking first in both binary positive F1: 0.8906) and categorized span-based detection (strict F1: 0.6307) subtasks at the GermEval 2025 Shared Task on Candy Speech Detection. We speculate that span-based training, multilingual capabilities, and emoji-aware tokenizers improved detection performance. Our results demonstrate the effectiveness of multilingual models in identifying positive, supportive language.
摘要：社交媒体中积极，支持性的在线交流（Candy Speak）有可能促进文明，但对这种语言的自动检测仍然没有被忽视，从而限制了对其影响的系统分析。我们研究了如何通过单语言和多语言模型（包括Gbert，Qwen3 Embedding和XLM-Roberta）在46k-comment YouTube语料库中可靠地检测到的糖果语音。我们发现，一种多语言的XLM-Roberta-large模型，训练有素训练，可以在跨度上检测糖果语音的其他方法，在二进制正面F1：0.8906中排名第一，在Germeval 2025 Candy Speect the Candy Speection the Candy Speection在Candy Speection the Candy Speard Tescection中基于二进制阳性F1：0.8906）和基于分类的检测（严格的F1：0.6307）子任务。我们推测基于跨度的培训，多语言能力和表情符号意识到的象征器改善了检测性能。我们的结果表明，多语言模型在识别积极的支持语言方面的有效性。

Title: HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention

Authors: Saumya Goswami, Siddharth Kurra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07475
Pdf URL: https://arxiv.org/pdf/2509.07475
Copy Paste: [[2509.07475]] HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention(https://arxiv.org/abs/2509.07475)
Keywords: language model, hallucination, retrieval-augmented generation
Abstract: Detecting content that contradicts or is unsupported by a given source text is a critical challenge for the safe deployment of generative language models. We introduce HALT-RAG, a post-hoc verification system designed to identify hallucinations in the outputs of Retrieval-Augmented Generation (RAG) pipelines. Our flexible and task-adaptable framework uses a universal feature set derived from an ensemble of two frozen, off-the-shelf Natural Language Inference (NLI) models and lightweight lexical signals. These features are used to train a simple, calibrated, and task-adapted meta-classifier. Using a rigorous 5-fold out-of-fold (OOF) training protocol to prevent data leakage and produce unbiased estimates, we evaluate our system on the HaluEval benchmark. By pairing our universal feature set with a lightweight, task-adapted classifier and a precision-constrained decision policy, HALT-RAG achieves strong OOF F1-scores of 0.7756, 0.9786, and 0.7391 on the summarization, QA, and dialogue tasks, respectively. The system's well-calibrated probabilities enable a practical abstention mechanism, providing a reliable tool for balancing model performance with safety requirements.
摘要：检测与给定源文本相矛盾或不支持的内容是安全部署生成语言模型的关键挑战。我们介绍了Halt-Rag，这是一种事后验证系统，旨在识别检索功能生成（RAG）管道输出中的幻觉。我们的灵活和任务适应的框架使用了一个通用功能集，该集合来自两个冷冻，现成的自然语言推断（NLI）模型和轻量级词汇信号。这些功能用于训练简单，校准和由任务适应的元分类器。使用严格的5倍折（OOF）训练方案来防止数据泄漏并产生无偏见的估计，我们在Halueval基准上评估了系统。通过将我们的通用功能设置与轻巧的，由任务适应的分类器和精确受限的决策策略，Halt-rag在摘要，QA和对话任务上分别实现了0.7756、0.9786和0.7391的强F1分数。该系统的良好校准概率可以实现实用的弃用机制，为平衡模型性能与安全要求提供了可靠的工具。

Title: ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval

Authors: Zihan Chen, Lei Shi, Weize Wu, Qiji Zhou, Yue Zhang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.07512
Pdf URL: https://arxiv.org/pdf/2509.07512
Copy Paste: [[2509.07512]] ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval(https://arxiv.org/abs/2509.07512)
Keywords: language model, llm
Abstract: Many contemporary data-driven research efforts in the natural sciences, such as chemistry and materials science, require large-scale, high-performance entity recognition from scientific datasets. Large language models (LLMs) have increasingly been adopted to solve the entity recognition task, with the same trend being observed on all-spectrum NLP tasks. The prevailing entity recognition LLMs rely on fine-tuned technology, yet the fine-tuning process often incurs significant cost. To achieve a best performance-cost trade-off, we propose ALLabel, a three-stage framework designed to select the most informative and representative samples in preparing the demonstrations for LLM modeling. The annotated examples are used to construct a ground-truth retrieval corpus for LLM in-context learning. By sequentially employing three distinct active learning strategies, ALLabel consistently outperforms all baselines under the same annotation budget across three specialized domain datasets. Experimental results also demonstrate that selectively annotating only 5\%-10\% of the dataset with ALLabel can achieve performance comparable to the method annotating the entire dataset. Further analyses and ablation studies verify the effectiveness and generalizability of our proposal.
摘要：自然科学的许多当代数据驱动的研究工作，例如化学和材料科学，都需要科学数据集的大规模，高性能实体识别。大型语言模型（LLMS）越来越多地被采用以解决实体识别任务，并且在全谱NLP任务上也观察到相同的趋势。盛行的实体识别LLMS依赖于微调技术，但是微调过程通常会产生巨大的成本。为了实现最佳的性能成本权衡，我们提出了Allabel，这是一个三阶段框架，旨在选择最有用和代表性的样本，以准备LLM建模的演示。带注释的示例用于为LLM中文学习构建基本真相检索语料库。通过依次采用三种独特的主动学习策略，Allabel在三个专用域数据集中始终在相同注释预算下的所有基准都表现出色。实验结果还表明，使用Allabel选择性注释仅5 \％-10 \％可以实现与注释整个数据集的方法相当的性能。进一步的分析和消融研究验证了我们提案的有效性和普遍性。

Title: VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Authors: Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07553
Pdf URL: https://arxiv.org/pdf/2509.07553
Copy Paste: [[2509.07553]] VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents(https://arxiv.org/abs/2509.07553)
Keywords: language model, agent
Abstract: With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 20.64\% in untrustworthy scenarios over the state-of-the-art, without compromising normal performance. Analysis highlights VeriOS-Agent's rationality, generalizability, and scalability. The codes, datasets and models are available at this https URL.
摘要：随着多模式大语言模型的快速进步，操作系统（OS）代理越来越能够通过在设备的图形用户界面（GUIS）自动化任务。但是，大多数现有的OS代理都是为理想化的设置而设计的，而现实世界中的环境通常呈现不可信的条件。为了减轻这种情况下的过度执行的风险，我们提出了一个以查询为驱动的人类代理-GUI互动框架，该框架使OS代理可以决定何时查询人类以确定更可靠的任务完成。在此框架基础上，我们引入了Verios-Agent，这是一位可信赖的OS代理商，其训练有两个阶段的学习范式，使元知识的分离和利用变得更加恶心。具体而言，Verios-Agent在正常条件下自主执行动作，同时在不信任的情况下主动查询人类。实验表明，在不损害正常绩效的情况下，在不信任的情况下，在不信任的情况下，Verios-Agent将平均逐步成功率提高了20.64 \％。分析强调了Verios-Agent的合理性，可推广性和可扩展性。代码，数据集和模型可在此HTTPS URL上找到。

Title: Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition

Authors: Yi Liu, Xiangrong Zhu, Xiangyu Liu, Wei Wei, Wei Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07555
Pdf URL: https://arxiv.org/pdf/2509.07555
Copy Paste: [[2509.07555]] Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition(https://arxiv.org/abs/2509.07555)
Keywords: language model, llm, retrieval-augmented generation
Abstract: In a rapidly evolving world where information updates swiftly, knowledge in large language models (LLMs) becomes outdated quickly. Retraining LLMs is not a cost-effective option, making knowledge editing (KE) without modifying parameters particularly necessary. We find that although existing retrieval-augmented generation (RAG)-based KE methods excel at editing simple knowledge, they struggle with KE in multi-hop question answering due to the issue of "edit skipping", which refers to skipping the relevant edited fact in inference. In addition to the diversity of natural language expressions of knowledge, edit skipping also arises from the mismatch between the granularity of LLMs in problem-solving and the facts in the edited memory. To address this issue, we propose a novel Iterative Retrieval-Augmented Knowledge Editing method with guided decomposition (IRAKE) through the guidance from single edited facts and entire edited cases. Experimental results demonstrate that IRAKE mitigates the failure of editing caused by edit skipping and outperforms state-of-the-art methods for KE in multi-hop question answering.
摘要：在迅速发展的迅速发展的世界中，大型语言模型（LLMS）的知识迅速过时。再培训LLMS不是一个具有成本效益的选择，可以使知识编辑（KE）在没有特别必要的情况下修改参数。我们发现，尽管现有的检索型生成（RAG）基于基于编辑简单知识的KE方法出色，但由于“编辑跳过”的问题，它们在多跳的问题回答中与KE挣扎，这是指跳过相关编辑的事实的推断。除了自然语言的知识表达方式外，编辑跳过还引起了LLM在解决问题的粒度与编辑记忆中的事实之间的不匹配。为了解决这个问题，我们通过单个编辑的事实和整个编辑案例的指导，提出了一种新颖的迭代检索检索知识编辑方法，并通过指导性分解（IRAKE）进行指导。实验结果表明，IRAKE减轻了由编辑跳过而胜过Multi-Hop问题中KE的最先进方法引起的编辑失败。

Title: BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment

Authors: Andrey Sakhovskiy, Elena Tutubalina
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07588
Pdf URL: https://arxiv.org/pdf/2509.07588
Copy Paste: [[2509.07588]] BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment(https://arxiv.org/abs/2509.07588)
Keywords: language model, llm
Abstract: In recent years, there has been substantial progress in using pretrained Language Models (LMs) on a range of tasks aimed at improving the understanding of biomedical texts. Nonetheless, existing biomedical LLMs show limited comprehension of complex, domain-specific concept structures and the factual information encoded in biomedical Knowledge Graphs (KGs). In this work, we propose BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel joint LM and KG pre-training method that augments an LM with external knowledge by the simultaneous learning of a dedicated KG encoder and aligning the representations of both the LM and the graph. For a given textual sequence, we link biomedical concept mentions to the Unified Medical Language System (UMLS) KG and utilize local KG subgraphs as cross-modal positive samples for these mentions. Our empirical findings indicate that implementing our method on several leading biomedical LMs, such as PubMedBERT and BioLinkBERT, improves their performance on a range of language understanding tasks and the quality of entity representations, even with minimal pre-training on a small alignment dataset sourced from PubMed scientific abstracts.
摘要：近年来，在旨在提高对生物医学文本的理解的一系列任务上，使用预审慎的语言模型（LMS）取得了长足的进步。但是，现有的生物医学LLM对复杂，特定领域的概念结构以及生物医学知识图（KGS）编码的事实信息的理解有限。在这项工作中，我们提出了一种新型的关节LM和KG预训练方法Bali（生物医学知识图和语言模型对齐），该方法通过同时学习专用KG编码器并使LM和图形的表示形式相结合，从而增强了LM的外部知识。对于给定的文本序列，我们将生物医学概念提及与统一的医学语言系统（UMLS）KG联系起来，并利用本地KG子图作为这些提及的跨模式阳性样本。我们的经验发现表明，在PubMedbert和Biolinkbert等几种领先的生物医学LM上实施我们的方法，也可以在一系列语言理解任务和实体表示质量上提高其性能，即使在来自PubMed Scientific摘要中得出的小型对齐数据集中的预培训最少。

Title: MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs

Authors: Libo Ren, Yee Man Ng, Lifeng Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07622
Pdf URL: https://arxiv.org/pdf/2509.07622
Copy Paste: [[2509.07622]] MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs(https://arxiv.org/abs/2509.07622)
Keywords: language model, gpt, llm, prompt
Abstract: Efficient communication between patients and clinicians plays an important role in shared decision-making. However, clinical reports are often lengthy and filled with clinical jargon, making it difficult for domain experts to identify important aspects in the document efficiently. This paper presents the methodology we applied in the MultiClinSUM shared task for summarising clinical case documents. We used an Iterative Self-Prompting technique on large language models (LLMs) by asking LLMs to generate task-specific prompts and refine them via example-based few-shot learning. Furthermore, we used lexical and embedding space metrics, ROUGE and BERT-score, to guide the model fine-tuning with epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P, R, F1) from the official evaluation on 3,396 clinical case reports from various specialties extracted from open journals. The high BERTscore indicates that the model produced semantically equivalent output summaries compared to the references, even though the overlap at the exact lexicon level is lower, as reflected in the lower ROUGE scores. This work sheds some light on how perspective-aware ISP (PA-ISP) can be deployed for clinical report summarisation and support better communication between patients and clinicians.
摘要：患者和临床医生之间的有效沟通在共同决策中起着重要作用。但是，临床报告通常很长，并充满了临床行话，因此领域专家很难有效地识别文档中的重要方面。本文介绍了我们在多键符共享的任务中应用的方法，以汇总临床病例文档。我们通过要求LLMS生成特定于任务的提示并通过基于示例的少量学习来生成特定于任务的提示并完善它们，从而在大语言模型（LLMS）上使用了一种迭代自我提出技术。此外，我们使用词汇和嵌入式空间指标，胭脂和伯特得分，用时代指导模型微调。我们在GPT-4和GPT-4O上使用透视感知ISP的提交得分（46.53，24.68，30.77）和BERTSCORES和BERTSCORES（87.84，83.25，85.46）的提交（P，P，85.46），可（P，P，R，R，F1），对各种临床案例的正式评估进行了正式评估。高bertscore表明该模型与参考文献相比产生了语义上等效的输出摘要，即使确切词典水平的重叠率较低，如较低的胭脂评分所反映。这项工作阐明了如何部署观点感知的ISP（PA-ISP）进行临床报告摘要，并支持患者和临床医生之间更好的沟通。

Title: MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

Authors: Xixi Wu, Yanchao Tan, Nan Hou, Ruiyang Zhang, Hong Cheng
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.07666
Pdf URL: https://arxiv.org/pdf/2509.07666
Copy Paste: [[2509.07666]] MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval(https://arxiv.org/abs/2509.07666)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning. To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at this https URL.
摘要：文档理解是具有广泛应用程序的基础AI功能，文档问答（DOCQA）是关键评估任务。传统方法将文档通过大型语言模型（LLMS）进行处理文本，但是此过程剥离了关键的多模式信息（如数字）。尽管大型视觉模型（LVLM）解决了这一限制，但它们的约束输入大小使多页文档理解是不可行的。检索增强的生成（RAG）方法通过选择相关页面来减轻这种情况，但它们仅依赖语义相关性，忽略了页面和查询之间的逻辑连接，这对于推理至关重要。为此，我们提出了Molorag，这是一种逻辑感知的检索框架，用于多模式，多页文档的理解。通过构建一个捕获页面之间上下文关系的页面图，轻巧的VLM可以执行图形遍历以检索相关页面，包括经常忽略的逻辑连接的页面。这种方法结合了语义和逻辑相关性，以提供更准确的检索。检索后，顶部$ k $页面被送入任意lvlms进行问答。为了提高灵活性，Molorag提供了两种变体：一种无训练的解决方案，可轻松部署，并进行微调版本，以改善逻辑相关性检查。在四个DOCQA数据集上的实验表明，与LVLM直接推断相比，准确性的平均改善为9.68％，比基线相比，检索精度为7.44％。代码和数据集在此HTTPS URL上发布。

Title: M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models

Authors: Zexuan Li, Hongliang Dai, Piji Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07730
Pdf URL: https://arxiv.org/pdf/2509.07730
Copy Paste: [[2509.07730]] M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models(https://arxiv.org/abs/2509.07730)
Keywords: language model, llm
Abstract: For Relation Extraction (RE), the manual annotation of training data may be prohibitively expensive, since the sentences that contain the target relations in texts can be very scarce and difficult to find. It is therefore beneficial to develop an efficient method that can automatically extract training instances from unlabeled texts for training RE models. Recently, large language models (LLMs) have been adopted in various natural language processing tasks, with RE also benefiting from their advances. However, when leveraging LLMs for RE with predefined relation categories, two key challenges arise. First, in a multi-class classification setting, LLMs often struggle to comprehensively capture the semantics of every relation, leading to suboptimal results. Second, although employing binary classification for each relation individually can mitigate this issue, it introduces significant computational overhead, resulting in impractical time complexity for real-world applications. Therefore, this paper proposes a framework called M-BRe to extract training instances from unlabeled texts for RE. It utilizes three modules to combine the advantages of both of the above classification approaches: Relation Grouping, Relation Extraction, and Label Decision. Extensive experiments confirm its superior capability in discovering high-quality training samples from unlabeled texts for RE.
摘要：对于关系提取（RE），培训数据的手动注释可能非常昂贵，因为包含文本中目标关系的句子可能非常稀缺且难以找到。因此，开发一种有效的方法是有益的，该方法可以自动从未标记的文本中提取培训实例以进行培训模型。最近，在各种自然语言处理任务中采用了大型语言模型（LLM），RE也从其进步中受益。但是，当利用LLMS与预定义的关系类别进行RE时，会出现两个关键挑战。首先，在多级分类环境中，LLMS经常难以全面地捕获每个关系的语义，从而导致次优结果。其次，尽管对每个关系采用二进制分类可以分别减轻此问题，但它引入了重要的计算开销，从而导致现实世界应用的时间复杂性不切实际。因此，本文提出了一个称为M-BRE的框架，以从未标记的文本中提取培训实例。它利用三个模块结合了上述两种分类方法的优势：关系分组，关系提取和标签决策。广泛的实验证实了它在发现未标记文本的高质量培训样本方面具有出色的能力。

Title: Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts

Authors: Rochana Prih Hastuti, Rian Adam Rajagede, Mansour Al Ghanim, Mengxin Zheng, Qian Lou
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2509.07755
Pdf URL: https://arxiv.org/pdf/2509.07755
Copy Paste: [[2509.07755]] Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts(https://arxiv.org/abs/2509.07755)
Keywords: language model, gpt, llm
Abstract: As large language models (LLMs) adapted to sensitive domains such as medicine, their fluency raises safety risks, particularly regarding provenance and accountability. Watermarking embeds detectable patterns to mitigate these risks, yet its reliability in medical contexts remains untested. Existing benchmarks focus on detection-quality tradeoffs, overlooking factual risks under low-entropy settings often exploited by watermarking's reweighting strategy. We propose a medical-focused evaluation workflow that jointly assesses factual accuracy and coherence. Using GPT-Judger and further human validation, we introduce the Factuality-Weighted Score (FWS), a composite metric prioritizing factual accuracy beyond coherence to guide watermarking deployment in medical domains. Our evaluation shows current watermarking methods substantially compromise medical factuality, with entropy shifts degrading medical entity representation. These findings underscore the need for domain-aware watermarking approaches that preserve the integrity of medical content.
摘要：随着大型语言模型（LLMS）适应诸如医学之类的敏感领域，它们的流利性会增加安全风险，尤其是在出处和责任方面。水印嵌入了可检测的模式以减轻这些风险，但其在医疗环境中的可靠性仍未经过测试。现有的基准专注于检测质量的权衡，忽略了水印的重量级策略经常利用的低渗透环境下的事实风险。我们提出了一个以医学为中心的评估工作流程，可以共同评估事实准确性和连贯性。使用GPT-Judger和进一步的人类验证，我们介绍了事实加权评分（FWS），这是一种复合度量，是优先级的事实准确性，超出了一致性，以指导水印在医疗领域中的部署。我们的评估表明，当前的水印方法实质上损害了医学事实，熵变化会降低医疗实体代表。这些发现强调了对保留医疗内容完整性的水印方法的需求。

Title: Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning

Authors: Michele Joshua Maggini, Dhia Merzougui, Rabiraj Bandyopadhyay, Gaël Dias, Fabrice Maurel, Pablo Gamallo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07768
Pdf URL: https://arxiv.org/pdf/2509.07768
Copy Paste: [[2509.07768]] Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning(https://arxiv.org/abs/2509.07768)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The spread of fake news, polarizing, politically biased, and harmful content on online platforms has been a serious concern. With large language models becoming a promising approach, however, no study has properly benchmarked their performance across different models, usage methods, and languages. This study presents a comprehensive overview of different Large Language Models adaptation paradigms for the detection of hyperpartisan and fake news, harmful tweets, and political bias. Our experiments spanned 10 datasets and 5 different languages (English, Spanish, Portuguese, Arabic and Bulgarian), covering both binary and multiclass classification scenarios. We tested different strategies ranging from parameter efficient Fine-Tuning of language models to a variety of different In-Context Learning strategies and prompts. These included zero-shot prompts, codebooks, few-shot (with both randomly-selected and diversely-selected examples using Determinantal Point Processes), and Chain-of-Thought. We discovered that In-Context Learning often underperforms when compared to Fine-Tuning a model. This main finding highlights the importance of Fine-Tuning even smaller models on task-specific settings even when compared to the largest models evaluated in an In-Context Learning setup - in our case LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407 and Qwen2.5-7B-Instruct.
摘要：假新闻，两极分化，政治上有偏见和有害内容的传播是一个严重的问题。然而，随着大型语言模型成为一种有前途的方法，没有任何研究能够在不同的模型，用法方法和语言上进行适当的基准测试。这项研究介绍了不同语言模型适应范式的全面概述，以检测党派和虚假新闻，有害的推文以及政治偏见。我们的实验涵盖了10个数据集和5种不同的语言（英语，西班牙语，葡萄牙语，阿拉伯语和保加利亚语），涵盖了二进制和多类分类方案。我们测试了不同的策略，从参数有效地对语言模型进行微调到各种不同的内在学习策略和提示。其中包括零射击提示，代码书，很少射击（使用确定点过程随机选择和多样化的示例）以及经过思考链。我们发现，与模型进行微调相比，在文章中的学习经常表现不佳。这一主要发现强调了在特定于任务的设置上进行微调的重要性，即使与在内在学习设置中评估的最大模型相比 - 在我们的情况下，llama3.1-8b-instruct，Mistral-Nemo-Nemo-Instruct-2407和qwen2.5-7b- inStruct。

Title: Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

Authors: Xiaolin Chen, Xuemeng Song, Haokun Wen, Weili Guan, Xiangyu Zhao, Liqiang Nie
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.07817
Pdf URL: https://arxiv.org/pdf/2509.07817
Copy Paste: [[2509.07817]] Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems(https://arxiv.org/abs/2509.07817)
Keywords: language model, llm
Abstract: Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. Inspired by this, we aim to fully utilize dual knowledge (\textit{i.e., } structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textit{dynamic knowledge type selection} and 2) \textit{intention-response decoupling}. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type's utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters.
摘要：文本响应生成对于多模式\ Mbox {以任务为导向的}对话框系统是关键的，该}旨在基于多模式上下文生成适当的文本响应。尽管现有的努力表现出了显着的进步，但仍然存在以下局限性：1）\ textit {忽视非结构化的评论知识}和2）\ textit {大型语言模型（LLMS）}的利用不足}。受此启发的启发，我们旨在充分利用双重知识（\ textIt {i。，}结构化属性和非结构化的评论知识），以促进多模式面向任务的对话框系统中的文本响应生成。但是，由于两个关键的挑战，此任务是不平凡的：1）\ textIt {动态知识类型选择}和2）\ textit {tirt-response deconpling}。为了应对这些挑战，我们通过适应多模式对话框系统的LLM（命名为DK2R）来提出一种新颖的双重知识增强的两阶段推理器。具体来说，DK2R首先从外部知识库中提取结构化属性和非结构化评论知识，从而给定对话框上下文。此后，DK2R使用LLM通过分析LLM生成的临时探针响应来评估每种知识类型的效用。此外，DK2R分别通过专用推理总结了意图的关键线索，这些线索进一步用作辅助信号，以增强基于LLM的文本响应生成。在公共数据集上进行的广泛实验验证了DK2R的优势。我们发布了代码和参数。

Title: Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

Authors: Mihai Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.07829
Pdf URL: https://arxiv.org/pdf/2509.07829
Copy Paste: [[2509.07829]] Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost(https://arxiv.org/abs/2509.07829)
Keywords: language model, llm, prompt
Abstract: Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine tuning, and evaluation in English-Romanian literary translations, centred on the creation and open release of both a compact, fine tuned language model (TF2-12B) and large scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high quality literary datasets in low resource languages such as Romanian. Our pipeline first generates 15k high quality Romanian references from the TF1 pool using a high performing LLM. We then apply a two stage fine tuning process to a 12B parameter open weight model: (i) instruction tuning to capture genre specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus level BLEU and a five dimension LLM based rubric (accuracy, fluency, coherence, style, cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine tuned model achieves fluency and adequacy competitive with top performing large proprietary models, while being open, accessible, and significantly more cost effective. Alongside the fine tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost efficient translation, cross lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low resource settings.
摘要：最近，文学翻译已成为机器翻译研究中独特而复杂的任务。但是，小型开放模型的翻译仍然是一个开放的问题。我们通过引入TinyFabulist翻译框架（TF2）来为这项正在进行的研究做出了贡献，这是一个统一的创建数据集创建，微调和评估的统一框架，以紧凑的，微调的语言模型（TF2-12B）的创建和开放释放为中心，并以公开发行为中心。 DS-TF2-EN-RO-15K）。在迄今为止最大的合成英语寓言集合（TF1）的基础上，我们满足了迄今为止的合成英语寓言，我们满足了对罗马尼亚语等低资源语言的丰富，高质量文学数据集的需求。我们的管道首先使用高性能LLM从TF1池中产生15K高质量的罗马尼亚参考。然后，我们将两个阶段的微调过程应用于12B参数开放权重模型：（i）指令调整以捕获特定流派的叙事样式，以及（ii）适配器压缩以进行有效的部署。评估结合了语料库级别的BLEU和五维LLM的标题（准确性，流利度，连贯性，样式，文化适应性），以提供对翻译质量的细微评估。结果表明，我们的微调模型可以通过表现出色的大型专有模型实现流利性和足够的竞争力，同时开放，易于访问且具有更高的成本效益。除了微调模型和两个数据集外，我们还公开发布了所有脚本和评估提示。因此，TF2提供了一条端到端，可再现的管道，用于研究成本高效的翻译，跨语言叙事产生以及广泛采用在低资源环境中文化上重要的文学内容的开放模型。

Title: Are Humans as Brittle as Large Language Models?

Authors: Jiahui Li, Sean Papay, Roman Klinger
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2509.07869
Pdf URL: https://arxiv.org/pdf/2509.07869
Copy Paste: [[2509.07869]] Are Humans as Brittle as Large Language Models?(https://arxiv.org/abs/2509.07869)
Keywords: language model, llm, prompt
Abstract: The output of large language models (LLM) is unstable, due to both non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to instruction changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.
摘要：大型语言模型（LLM）的输出是不稳定的，这既是解码过程的非确定性，又引起了勃起性。尽管LLM生成的固有非确定性可能通过输出的分布变化模仿人类注释中的现有不确定性，但很大程度上假定但尚未探索，即及时勃起效应是LLMS所独有的。这提出了一个问题：人类注释者对教学变化的敏感性相似吗？如果是这样，应该认为LLMS中的胆汁性是否有问题？可以替代地假设，及时迅速勃起的人正确反映了人类注释差异。为了填补这一研究差距，我们系统地比较了对人类注释者的迅速修改对LLM的影响以及相同的指导修改的影响，重点是关于人类是否对迅速扰动的敏感性。为了研究这一点，我们促使人类和LLM都以迅速变化为条件的一组文本分类任务。我们的发现表明，人类和LLM都表现出对特定类型的迅速修改的响应，尤其是涉及替代标签集或标签格式的特定类型的迅速修改。但是，人类判断的分布受印刷错误和标签顺序的影响较小，而不是LLM的标签顺序。

Title: From Detection to Mitigation: Addressing Gender Bias in Chinese Texts via Efficient Tuning and Voting-Based Rebalancing

Authors: Chengyan Wu, Yiqiang Cai, Yufei Cheng, Yun Xue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07889
Pdf URL: https://arxiv.org/pdf/2509.07889
Copy Paste: [[2509.07889]] From Detection to Mitigation: Addressing Gender Bias in Chinese Texts via Efficient Tuning and Voting-Based Rebalancing(https://arxiv.org/abs/2509.07889)
Keywords: language model, llm
Abstract: This paper presents our team's solution to Shared Task 7 of NLPCC-2025, which focuses on sentence-level gender bias detection and mitigation in Chinese. The task aims to promote fairness and controllability in natural language generation by automatically detecting, classifying, and mitigating gender bias. To address this challenge, we adopt a fine-tuning approach based on large language models (LLMs), efficiently adapt to the bias detection task via Low-Rank Adaptation (LoRA). In terms of data processing, we construct a more balanced training set to alleviate class imbalance and introduce heterogeneous samples from multiple sources to enhance model generalization. For the detection and classification sub-tasks, we employ a majority voting strategy that integrates outputs from multiple expert models to boost performance. Additionally, to improve bias generation detection and mitigation, we design a multi-temperature sampling mechanism to capture potential variations in bias expression styles. Experimental results demonstrate the effectiveness of our approach in bias detection, classification, and mitigation. Our method ultimately achieves an average score of 47.90%, ranking fourth in the shared task.
摘要：本文介绍了我们团队对NLPCC-2025共享任务7的解决方案，该解决方案重点介绍了句子级别的性别偏见检测和缓解中文。该任务旨在通过自动检测，分类和减轻性别偏见来促进自然语言产生的公平性和可控性。为了应对这一挑战，我们采用基于大语言模型（LLM）的微调方法，通过低级适应（LORA）有效地适应了偏见检测任务。在数据处理方面，我们构建了一个更加平衡的训练设置，以减轻阶级失衡，并引入来自多个来源的异质样本以增强模型的概括。对于检测和分类子任务，我们采用了多数投票策略，该策略将来自多个专家模型的产出整合以提高性能。此外，为了改善偏置产生检测和缓解措施，我们设计了一种多温度抽样机制，以捕获偏置表达方式的潜在变化。实验结果证明了我们方法在偏置检测，分类和缓解方面的有效性。我们的方法最终取得了47.90％的平均得分，在共享任务中排名第四。

Title: Biased Tales: Cultural and Topic Bias in Generating Children's Stories

Authors: Donya Rooein, Vilém Zouhar, Debora Nozza, Dirk Hovy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07908
Pdf URL: https://arxiv.org/pdf/2509.07908
Copy Paste: [[2509.07908]] Biased Tales: Cultural and Topic Bias in Generating Children's Stories(https://arxiv.org/abs/2509.07908)
Keywords: language model, llm
Abstract: Stories play a pivotal role in human communication, shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (LLMs) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists' attributes and story elements in LLM-generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse.
摘要：故事在人类交流，塑造信念和道德上，尤其是在儿童中起着关键作用。随着父母越来越依靠大型语言模型（LLM）来制作就寝时间的故事，在这些叙述中，文化和性别刻板印象的存在引起了重大关注。为了解决这个问题，我们提出了有偏见的故事，这是一个综合数据集，旨在分析偏见如何影响主角在LLM生成的故事中的属性和故事元素。我们的分析发现了惊人的差距。当主角被描述为女孩（与男孩相比）时，与外观相关的属性增加了55.26％。非西方儿童的故事不成比例地强调文化遗产，传统和家庭主题，远远超过西方儿童的故事。我们的发现凸显了社会文化偏见在使创造性AI使用更公平和多样化中的作用。

Title: GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models

Authors: Tuo Wang, Adithya Kulkarni, Tyler Cody, Peter A. Beling, Yujun Yan, Dawei Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.07925
Pdf URL: https://arxiv.org/pdf/2509.07925
Copy Paste: [[2509.07925]] GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models(https://arxiv.org/abs/2509.07925)
Keywords: language model, llm
Abstract: Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models, a structure-aware framework that leverages dependency parse trees and hierarchical graph pooling to refine uncertainty quantification. By incorporating supervised learning, GENUINE effectively models semantic and structural relationships, improving confidence assessments. Extensive experiments across NLP tasks show that GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15%, demonstrating the effectiveness of graph-based uncertainty modeling. The code is available at this https URL.
摘要：不确定性估计对于增强大语言模型（LLM）的可靠性至关重要，尤其是在高风险应用中。现有的方法通常忽略语义依赖性，依赖于无法捕获生成文本中结构关系的令牌级别概率度量。我们提出了真正的：大型语言模型的图形增强的多级不确定性估计，这是一个结构感知的框架，利用依赖性解析树和分层图池以完善不确定性量化。通过合并有监督的学习，真正有效地对语义和结构关系进行建模，从而提高置信度评估。跨NLP任务的广泛实验表明，与基于语义熵的方法相比，真正的AUROC高达29％，并使校准误差降低了15％以上，这表明了基于图的不确定性建模的有效性。该代码可在此HTTPS URL上找到。

Title: SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Authors: Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07968
Pdf URL: https://arxiv.org/pdf/2509.07968
Copy Paste: [[2509.07968]] SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge(https://arxiv.org/abs/2509.07968)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: this https URL.
摘要：我们介绍了SimpleQA验证，这是一种基于OpenAI的SimpleQA评估大型语言模型（LLM）短形式事实的1,000个促进基准。它解决了OpenAI基准测试中的临界局限性，包括嘈杂和错误的标签，局部偏见和问题冗余。通过严格的多阶段过滤过程创建了SimpleQA验证，涉及删除，主题平衡和源对帐，以产生更可靠，更具挑战性的评估集，以及自动化提示的改进。在这个新的基准测试中，Gemini 2.5 Pro的最先进的F1得分为55.6，表现优于其他边界模型，包括GPT-5。这项工作为研究社区提供了一种更高的坚定工具，以跟踪参数模型事实的真正进步和减轻幻觉。基准数据集，评估代码和排行榜可在以下网址提供：此HTTPS URL。

Title: Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Authors: Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.07980
Pdf URL: https://arxiv.org/pdf/2509.07980
Copy Paste: [[2509.07980]] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning(https://arxiv.org/abs/2509.07980)
Keywords: language model, llm, prompt
Abstract: Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at this https URL.
摘要：通过同时探索多个推理路径，平行思维已成为一种增强大语言模型（LLM）推理能力的新方法。但是，通过培训来激活此类功能仍然具有挑战性，因为现有方法主要依赖于综合数据的监督微调（SFT），这鼓励了教师责任模仿而不是探索和概括。与它们不同，我们建议\ textbf {Parallel-r1}，这是第一个强化学习（RL）框架，可实现复杂的现实世界推理任务的并行思维行为。我们的框架采用了一种渐进式课程，该课程明确解决了与RL平行思考的训练中的冷启动问题。我们首先使用SFT在迅速生成的轨迹上从更轻松的任务中灌输并行的思维能力，然后过渡到RL以探索和推广此技能，以解决更严重的问题。在包括数学，AMC23和AIME在内的各种数学基准测试的实验表明，并行R1成功灌输了并行思维，从而对直接对RL直接培训的顺序思维模型进行了8.4％的准确性提高。进一步的分析揭示了模型的思维行为发生了明显的转变：在早期阶段，它将平行思维用作探索策略，而在以后的阶段，它使用相同的功能进行多观点验证。最重要的是，我们将平行思维视为\ textbf {中训练勘探支架}，在RL后，这个临时探索阶段解锁了更高的性能上限，比AIME25的基线提高了42.9％。我们的模型，数据和代码将在此HTTPS URL上开源。