2025-09-05

Title: Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

Authors: Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, Maryam Zolnoori
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2509.03525
Pdf URL: https://arxiv.org/pdf/2509.03525
Copy Paste: [[2509.03525]] Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies(https://arxiv.org/abs/2509.03525)
Keywords: language model, llm, prompt
Abstract: Over half of US adults with Alzheimer disease and related dementias remain undiagnosed, and speech-based screening offers a scalable detection approach. We compared large language model adaptation strategies for dementia detection using the DementiaBank speech corpus, evaluating nine text-only models and three multimodal audio-text models on recordings from DementiaBank speech corpus. Adaptations included in-context learning with different demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Results showed that class-centroid demonstrations achieved the highest in-context learning performance, reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head substantially improved underperforming models. Among multimodal models, fine-tuned audio-text systems performed well but did not surpass the top text-only models. These findings highlight that model adaptation strategies, including demonstration selection, reasoning design, and tuning method, critically influence speech-based dementia detection, and that properly adapted open-weight models can match or exceed commercial systems.
摘要：我们有一半以上患有阿尔茨海默氏病和相关痴呆症的成年人仍未诊断，基于语音的筛查提供了可扩展的检测方法。我们比较了使用Dementiabank语音语料库进行痴呆检测的大型语言模型适应策略，评估了九种纯文本模型和三种在痴呆症语音语料库录音中的多模式音频文本模型。适应包括具有不同的演示选择策略，推理提示，参数有效的微调和多模式集成的文献学习。结果表明，集体式演示的表现最高的内在学习绩效，推理改善了较小的模型，而令牌级别的微调通常会产生最佳的分数。添加分类头大大改善了表现不佳的模型。在多模型模型中，微调的音频文本系统的性能很好，但没有超过顶级文本模型。这些发现凸显了模型适应策略，包括演示选择，推理设计和调整方法，严重影响基于语音的痴呆症检测，并且适当调整的开放式重量模型可以匹配或超过商业系统。

Title: Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Authors: Yansong Liu, Jiateng Li, Yuan Liu
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2509.03526
Pdf URL: https://arxiv.org/pdf/2509.03526
Copy Paste: [[2509.03526]] Enhancing Speech Large Language Models through Reinforced Behavior Alignment(https://arxiv.org/abs/2509.03526)
Keywords: language model, llm
Abstract: The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.
摘要：大型语言模型（LLMS）的最新进步激发了相当大的研究兴趣，即将其语言能力扩展到文本以外的其他模式，这导致基于语音的LLM（SecemberLMS）出现，并具有以语音或文本形式处理用户请求的能力。但是，由于模式间的差异，与基于文本的LLM相比，这些语音LM在遵循指令中仍然具有显着的性能差距，尤其是在面对用户语音的动态和可变性质时。为了应对这一挑战，本文介绍了一个称为加强行为一致性（RBA）的框架，旨在增强语言生成语言的能力。 RBA不再依靠人类注释中有监督的微调来进行微调，而是采用自我合成方法来通过强大的教师LLM生成广泛的，高保真的一致性数据。然后，使用基于加强学习的方法，语音与教师的行为保持一致。实验结果表明，该方法有效地增强了胜过常规蒸馏基线的语音的指导跟踪功能。至关重要的是，我们证明了RBA可以无缝地扩展到包括口头问题回答和语音到文本翻译的任务，从而在开放基准上获得只有自生数据的开放基准的最新性能。

Title: Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model

Authors: Bohdan M. Pavlyshenko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03527
Pdf URL: https://arxiv.org/pdf/2509.03527
Copy Paste: [[2509.03527]] Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model(https://arxiv.org/abs/2509.03527)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: In the paper, we consider multilevel multitask analysis of cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG). On the first level of analytics, the fine-tuned model generates graph and text summaries with sentiment scores as well as JSON representations of summaries. Higher levels perform hierarchical stacking that consolidates sets of graph-based and text-based summaries as well as summaries of summaries into comprehensive reports. The combination of graph and text summaries provides complementary views of cryptocurrency news. The model is fine-tuned with 4-bit quantization using the PEFT/LoRA approach. The representation of cryptocurrency news as knowledge graph can essentially eliminate problems with large language model hallucinations. The obtained results demonstrate that the use of fine-tuned Mistral 7B LLM models for multilevel cryptocurrency news analysis can conduct informative qualitative and quantitative analytics, providing important insights.
摘要：在本文中，我们使用以检索功能（RAG）的微调Mistral 7B模型（RAG）来考虑加密货币新闻的多级多任务分析。在第一级分析中，微调模型生成具有情感分数以及摘要的JSON表示的图形和文本摘要。更高级别执行层次堆叠，以整合基于图的和基于文本的摘要集以及摘要摘要的集合。图形和文本摘要的组合提供了加密货币新闻的互补视图。使用PEFT/LORA方法对该模型进行了4位量化的微调。加密货币新闻作为知识图的表示基本上可以消除大型语言模型幻觉的问题。获得的结果表明，使用微调的Mistral 7b LLM模型用于多级加密货币新闻分析可以进行信息丰富的定性和定量分析，从而提供重要的见解。

Title: The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process

Authors: Matilde Contestabile, Chiara Ferrara, Alberto Giovannetti, Giovanni Parrillo, Andrea Vandin
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2509.03528
Pdf URL: https://arxiv.org/pdf/2509.03528
Copy Paste: [[2509.03528]] The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process(https://arxiv.org/abs/2509.03528)
Keywords: language model, llm
Abstract: Process Mining (PM), initially developed for industrial and business contexts, has recently been applied to social systems, including legal ones. However, PM's efficacy in the legal domain is limited by the accessibility and quality of datasets. We introduce ProLiFIC (Procedural Lawmaking Flow in Italian Chambers), a comprehensive event log of the Italian lawmaking process from 1987 to 2022. Created from unstructured data from the Normattiva portal and structured using large language models (LLMs), ProLiFIC aligns with recent efforts in integrating PM with LLMs. We exemplify preliminary analyses and propose ProLiFIC as a benchmark for legal PM, fostering new developments.
摘要：最初是为工业和商业环境开发的流程挖掘（PM）最近已应用于包括法律制造的社会系统。但是，PM在法律领域的功效受数据集的可访问性和质量的限制。我们引入了多产的（意大利钱伯斯的程序立法流），这是1987年至2022年意大利立法过程的全面事件日志。由Normattiva门户网站的非结构化数据和使用大语言模型（LLMS）结构化创建，与最新的努力与LLMS集成。我们举例说明了初步分析，并提出多产作为法律PM的基准，从而促进了新的发展。

Title: Real-Time Detection of Hallucinated Entities in Long-Form Generation

Authors: Oscar Obeso, Andy Arditi, Javier Ferrando, Joshua Freeman, Cameron Holmes, Neel Nanda
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.03531
Pdf URL: https://arxiv.org/pdf/2509.03531
Copy Paste: [[2509.03531]] Real-Time Detection of Hallucinated Entities in Long-Form Generation(https://arxiv.org/abs/2509.03531)
Keywords: language model, hallucination
Abstract: Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emph{entity-level hallucinations} -- e.g., fabricated names, dates, citations -- rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.
摘要：现在，大型语言模型通常用于幻觉可能造成严重伤害的高风险应用中，例如医疗咨询或法律建议。但是，现有的幻觉检测方法对于现实世界的使用是不切实际的，因为它们要么仅限于简短的事实查询，要么需要昂贵的外部验证。我们提出了一种廉价，可扩展的方法，用于实时识别长期的幻觉令牌，并有效地扩展到70B参数模型。我们的方法针对\ emph {entity-level幻觉} - 例如，捏造的名称，日期，引用 - 而不是索赔级，从而自然映射到令牌级标签并启用流式检测。我们开发了一种注释方法，该方法利用Web搜索以注释模型响应，并用扎根的标签，指示哪些令牌对应于制造实体。该数据集使我们能够使用简单有效的方法（例如线性探针）培训有效的幻觉分类器。在四个模型家族中评估我们的分类器对长形响应的表现始终优于基线，包括更昂贵的方法，例如语义熵（例如AUC 0.90 vs 0.71 vs 0.71 vs 0.71对于Llama-3.3-70B），并且也可以改善短形式的问答环境。此外，尽管仅接受实体级标签进行培训，但我们的探针在数学推理任务中有效地检测了错误的答案，表明超出实体的概括。尽管我们的注释方法很昂贵，但我们发现一种模型的注释响应可用于培训其他模型的有效分类器。因此，我们公开发布数据集，以促进重复使用。总体而言，我们的工作为可扩展的现实世界幻觉检测提出了一种有希望的新方法。

Title: Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck

Authors: Igor Halperin
Subjects: cs.CL, cs.LG, q-fin.GN
Abstract URL: https://arxiv.org/abs/2509.03533
Pdf URL: https://arxiv.org/pdf/2509.03533
Copy Paste: [[2509.03533]] Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck(https://arxiv.org/abs/2509.03533)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) are prone to critical failure modes, including \textit{intrinsic faithfulness hallucinations} (also known as confabulations), where a response deviates semantically from the provided context. Frameworks designed to detect this, such as Semantic Divergence Metrics (SDM), rely on identifying latent topics shared between prompts and responses, typically by applying geometric clustering to their sentence embeddings. This creates a disconnect, as the topics are optimized for spatial proximity, not for the downstream information-theoretic analysis. In this paper, we bridge this gap by developing a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering. Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. The resulting method, which we dub UDIB, can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. By applying UDIB to the joint clustering of LLM prompt and response embeddings, we generate a shared topic representation that is not merely spatially coherent but is fundamentally structured to be maximally informative about the prompt-response relationship. This provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations.
摘要：大型语言模型（LLMS）容易出现关键的故障模式，包括\ textit {固有的忠实幻觉}（也称为confabulations），其中响应从提供的上下文中偏离了所提供的上下文。旨在检测这一点的框架（例如语义差异指标（SDM））依赖于识别提示和响应之间共享的潜在主题，通常是通过将几何聚类应用于其句子嵌入。这会造成脱节，因为主题是针对空间接近性进行了优化的，而不是下游信息理论分析。在本文中，我们通过开发以确定性信息瓶颈（DIB）为基础的原则性主题识别方法来弥合这一差距，用于几何聚类。我们的关键贡献是将DIB方法转换为一种实用算法，通过将其棘手的KL差异项用计算有效的上限替代其棘手的KL差异项来将其用于高维数据。我们将其列为UDIB的结果方法可以解释为K均值的熵登记且可靠的版本，它固有地偏爱了大量的信息群集。通过将UDIB应用于LLM提示和响应嵌入的联合聚类，我们生成了共享的主题表示，该表示不仅是空间相干的，而且从根本上结构结构为及时响应关系具有最大信息。这为SDM框架提供了卓越的基础，并为检测解剖提供了一种新颖，更敏感的工具。

Title: AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models

Authors: Cheng-Kai Yeh, Hsing-Wang Lee, Chung-Hung Kuo, Hen-Hsen Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.03537
Pdf URL: https://arxiv.org/pdf/2509.03537
Copy Paste: [[2509.03537]] AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models(https://arxiv.org/abs/2509.03537)
Keywords: language model, llm
Abstract: Abstraction--the ability to recognize and distill essential computational patterns from complex problem statements--is a foundational skill in computer science, critical both for human problem-solvers and coding-oriented large language models (LLMs). Despite recent advances in training LLMs for code generation using reinforcement learning (RL), most existing approaches focus primarily on superficial pattern recognition, overlooking explicit training for abstraction. In this study, we propose AR$^2$ (Adversarial Reinforcement Learning for Abstract Reasoning), a novel framework explicitly designed to enhance the abstraction abilities of LLMs. AR$^2$ employs a teacher model to transform kernel problems into narrative-rich, challenging descriptions without changing their fundamental logic. Simultaneously, a student coding model is trained to solve these complex narrative problems by extracting their underlying computational kernels. Experimental results demonstrate that AR$^2$ substantially improves the student model's accuracy on previously unseen, challenging programming tasks, underscoring abstraction as a key skill for enhancing LLM generalization.
摘要：抽象 - 从复杂的问题陈述中识别和提炼基本计算模式的能力 - 是计算机科学的基础技能，对于人类问题阶层和面向编码的大语言模型（LLMS）至关重要。尽管最近使用增强学习（RL）培训LLM的LLM培训LLM的进步，但大多数现有方法主要集中于表面模式识别，忽略了抽象的明确培训。在这项研究中，我们提出了$^2 $（用于抽象推理的对抗性增强学习），这是一个新颖的框架，旨在增强LLM的抽象能力。 Ar $^2 $采用教师模型将内核问题转变为叙事丰富的描述，而无需更改其基本逻辑。同时，培训了学生编码模型，以通过提取其潜在的计算内核来解决这些复杂的叙事问题。实验结果表明，Ar $^2 $显着提高了学生模型在以前看不见的，具有挑战性的编程任务上的准确性，强调抽象作为增强LLM概括的关键技能。

Title: Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

Authors: Shanglin Wu, Lihui Liu, Jinho D. Choi, Kai Shu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03540
Pdf URL: https://arxiv.org/pdf/2509.03540
Copy Paste: [[2509.03540]] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction(https://arxiv.org/abs/2509.03540)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) methods address this issue by incorporating external knowledge from trusted sources at inference time. However, such methods typically treat knowledge as unstructured text, which limits their ability to support compositional reasoning and identify factual inconsistencies. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external information retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM's latent knowledge. The graph is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse factual QA benchmarks, demonstrating consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static KG-augmented methods. Our findings suggest that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.
摘要：大型语言模型（LLM）通常由于其参数记忆的限制而努力产生事实一致的答案。检索增强的生成（RAG）方法通过在推理时纳入可信赖来源的外部知识来解决此问题。但是，这种方法通常将知识视为非结构化文本，这限制了其支持组成推理的能力并确定事实不一致。为了克服这些局限性，我们提出了一个新颖的框架，该框架在推理过程中动态构建和扩展知识图（kg），整合从LLM中提取的内部知识以及从外部来源检索到的外部信息。我们的方法首先是通过提示从问题中提取种子kg，然后使用LLM的潜在知识进行迭代扩展。然后，通过外部检索选择性地完善该图，从而增强事实覆盖范围并纠正不准确性。我们对三种不同事实质量检查基准测试的方法评估了我们的方法，表明了对基线提示和静态KG的方法的事实准确性，答案精度和可解释性的持续提高。我们的发现表明，推理时间kg的构建是以结构化，可解释和可扩展的方式增强LLM事实的有前途的方向。

Title: ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference

Authors: Qi Chen, Jingxuan Wei, Zhuoya Yao, Haiguang Wang, Gaowei Wu, Bihui Yu, Siyuan Li, Cheng Tan
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.03565
Pdf URL: https://arxiv.org/pdf/2509.03565
Copy Paste: [[2509.03565]] ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference(https://arxiv.org/abs/2509.03565)
Keywords: gpt, agent
Abstract: Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in this https URL.
摘要：了解科学思想的发展如何不仅需要汇总单个论文，还要求对主题相关研究的结构化，跨文档推理。在这项工作中，我们正式化了多文件科学推论，这是一项新任务，介绍并调整相关论文中的动机，方法和实验结果，以重建研究开发链。此任务引入了关键挑战，包括时间对齐松散结构化的方法和标准化异质实验表。我们提出了ResearchPulse，这是一个基于代理的框架，该框架集成了指导计划，科学内容提取和结构化可视化。它由三个协调的代理组成：一个任务分解的计划代理，一种构建动机方法思维图的MMAP代理以及合成实验界限图的LCHART-ANDEMENT。为了支持这项任务，我们介绍了ResearchPulse-Bench，这是带注释的纸张簇的引文感知的基准。实验表明，尽管使用了7B规模的代理，我们的系统在语义对齐，结构一致性和视觉保真度中始终超过强大的基准，例如GPT-4O。该数据集可在此HTTPS URL中使用。

Title: NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management

Authors: Josh Wisoff, Yao Tang, Zhengyu Fang, Jordan Guzman, YuTang Wang, Alex Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03610
Pdf URL: https://arxiv.org/pdf/2509.03610
Copy Paste: [[2509.03610]] NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management(https://arxiv.org/abs/2509.03610)
Keywords: language model
Abstract: Note-taking is a critical practice for capturing, organizing, and reflecting on information in both academic and professional settings. The recent success of large language models has accelerated the development of AI-assisted tools, yet existing solutions often struggle with efficiency. We present NoteBar, an AI-assisted note-taking tool that leverages persona information and efficient language models to automatically organize notes into multiple categories and better support user workflows. To support research and evaluation in this space, we further introduce a novel persona-conditioned dataset of 3,173 notes and 8,494 annotated concepts across 16 MBTI personas, offering both diversity and semantic richness for downstream tasks. Finally, we demonstrate that NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without reliance on heavy infrastructure. Together, NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management.
摘要：笔记是在学术和专业环境中捕获，组织和反思信息的关键实践。大型语言模型的最新成功加快了AI辅助工具的开发，但是现有的解决方案通常会以效率而挣扎。我们提出了Notebar，这是一种AI辅助笔记工具，利用角色信息和有效的语言模型自动将注释分为多个类别，并更好地支持用户工作流程。为了支持该领域的研究和评估，我们进一步介绍了一个新颖的角色条件数据集，其中包含16个MBTI角色的3,173个注释和8,494个带注释的概念，为下游任务提供了多样性和语义丰富。最后，我们证明了通知栏可以以实用和经济有效的方式部署，从而在不依赖重型基础设施的情况下实现互动式使用。 Notebar及其随附的数据集共同为推进AI辅助个人知识管理提供了可扩展且可扩展的基础。

Title: E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

Authors: Aryan Gupta, Anupam Purwar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03615
Pdf URL: https://arxiv.org/pdf/2509.03615
Copy Paste: [[2509.03615]] E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition(https://arxiv.org/abs/2509.03615)
Keywords: language model, llm
Abstract: Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.
摘要：多语言，嘈杂和多样化的现实世界图像中的光学特征识别（OCR）仍然是光学特征识别系统的重大挑战。随着大型视觉模型（LVLM）的兴起，对它们概括和推理固定OCR管道的能力的兴趣越来越大。在这项工作中，我们介绍了Sprinklr-Edge-OR，这是一种专门针对资源受限环境中边缘部署的新型OCR系统。我们对五个最先进的LVLM（Internvl，Qwen，Got OCR，Llame，minicpm）和两个传统的OCR系统（Sprinklr-Edge-OR，Suryaocr，Suryaocr）进行了大规模的比较评估。我们的基准涵盖了广泛的指标，包括准确性，语义一致性，语言覆盖率，计算效率（延迟，内存，GPU使用率）和部署成本。为了更好地反映现实世界的适用性，我们还进行了边缘案例部署分析，评估了仅CPU环境的模型性能。在结果中，QWEN的精度最高（0.54），而Sprinklr-Edge-OR提供了最佳的总体F1得分（0.46），并且在效率方面的表现优于其他人，处理图像的速度快35速度（平均0.17秒）（平均每图像0.17秒），比LVLM的成本低于0.006 USD（0.006 USD）。我们的发现表明，即使在LLMS时代，由于其低计算要求，低延迟和非常高的负担能力，最佳的边缘部署最佳OCR系统即使是传统的OCR系统。

Title: Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Authors: Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Jou Barzdukas, Simon Fu, Narmeen Oozeer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.03647
Pdf URL: https://arxiv.org/pdf/2509.03647
Copy Paste: [[2509.03647]] Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators(https://arxiv.org/abs/2509.03647)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97\%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.
摘要：大型语言模型（LLMS）越来越多地充当自动化评估者，但他们却遭受了“自我挑战偏见”的困扰：一种倾向于偏爱自己的产出而不是其他模型的产出。这种偏见破坏了评估管道中的公平性和可靠性，特别是对于偏好调整和模型路由等任务。我们调查轻巧的转向向量是否可以在推理时间不进行重新培训的情况下减轻此问题。我们介绍了一个精选的数据集，将自我挑战偏见区分为自我挑战和自我挑战的不合理示例的合理示例，我们使用两种方法构建了转向向量：对比度激活添加（CAA）和一种基于优化的方法。我们的结果表明，转向向量可以将不合理的自我质量偏见减少97 \％，从而大大优于提示和直接偏好优化基线。然而，转向向量在合法的自我挑战和公正的协议上是不稳定的，这意味着自我偏爱涵盖了多个或非线性的方向。这强调了他们的诺言和限制，作为法利律师事务所审判的保障措施，并激发了更强大的干预措施。

Title: Measuring How (Not Just Whether) VLMs Build Common Ground

Authors: Saki Imai, Mert İnan, Anthony Sicilia, Malihe Alikhani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03805
Pdf URL: https://arxiv.org/pdf/2509.03805
Copy Paste: [[2509.03805]] Measuring How (Not Just Whether) VLMs Build Common Ground(https://arxiv.org/abs/2509.03805)
Keywords: language model, gpt
Abstract: Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.
摘要：大型视觉语言模型（VLMS）越来越多地声称推理技能，但是当前的基准测试是通过单转或答案设置评估它们的。但是，基础是一个互动过程，人们通过持续的沟通逐渐发展共同的理解。我们介绍了一个四中套件（接地效率，内容对齐，词汇适应和人类忠诚），以系统地评估交互式接地环境中的VLM性能。我们在三个专有VLM之间的150次自我播放会议上部署了套件，并将其与人类二元组进行比较。这三个模型与至少三个指标的人类模式不同，而GPT4O-Mini是最接近的总体。我们发现（i）任务成功得分并未表示成功的接地，并且（ii）高图像牙能对准不一定会预测任务成功。我们的公制套件和发现为VLM接地的未来研究提供了一个框架。

Title: Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

Authors: Jiaxin Guo, Daimeng Wei, Yuanchang Luo, Xiaoyu Chen, Zhanglin Wu, Huan Yang, Hengchao Shang, Zongyao Li, Zhiqiang Rao, Jinlong Yang, Hao Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03809
Pdf URL: https://arxiv.org/pdf/2509.03809
Copy Paste: [[2509.03809]] Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation(https://arxiv.org/abs/2509.03809)
Keywords: language model, llm
Abstract: Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.
摘要：大型语言模型（LLMS）已将文档级机器翻译（\ textit {doc} -mt）的新时代迎来了新时代，但他们的全文件输出挑战了现有的评估方法，以逐句对齐。我们介绍了\ textIt {\ textbf {align-then-slide}}，这是一个超长文档的完整评估框架。在对齐阶段，我们会自动推断句子级源目标对应关系，并重建目标以匹配源句号，解决遗漏和多对一/一对一的映射。在N型滑动评估阶段中，我们计算1-，2-，3和4型在多跨性评估下的平均度量评分。 WMT基准测试的实验表明，在我们的方法（专家MQM排名）之间，Pearson的相关性为0.929。在新策划的现实世界测试集中，我们的方法再次与人类的判断紧密保持一致。此外，由Align-slide产生的偏好数据可实现有效的CPO培训及其直接用作GRPO的奖励模型，两者都产生了比香草SFT基线优先的翻译。结果将我们的框架验证为DOC-MT系统的准确，健壮且可操作的评估工具。

Title: Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Authors: Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03867
Pdf URL: https://arxiv.org/pdf/2509.03867
Copy Paste: [[2509.03867]] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth(https://arxiv.org/abs/2509.03867)
Keywords: language model, llm
Abstract: We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.
摘要：我们介绍了Drivelology，这是一种独特的语言现象，其特征是“无深度”，在语法上是连贯但务实地矛盾的，情感上的载荷或夸张的颠覆性的话语。尽管这种表达可能类似于表面级别的胡说八道，但它们编码了需要上下文推断，道德推理或情感解释的隐性含义。我们发现，尽管在许多自然语言处理（NLP）任务上表现出色，但目前的大型语言模型（LLMS）始终无法掌握Drivelological Text的分层语义。为了进行调查，我们构建了一个小的但多样化的基准数据集，其中包含1200多个精心策划的示例，其中包括英语，普通话，西班牙语，法语，日语和韩文的精选实例。注释特别具有挑战性：每个示例都需要仔细的专家审查，以验证它真正反映了动力学特征。该过程涉及多轮讨论和裁决，以解决分歧，强调了驱动学的微妙和主观性质。我们在分类，生成和推理任务上评估了一系列LLM。我们的结果揭示了LLMS的明显局限性：模型通常将驱动学与浅胡说八道混淆，产生不连贯的理由或完全错过了隐含的修辞功能。这些发现突出了LLMS务实理解中更深的代表性差距，并挑战了统计流利性意味着认知理解的假设。我们释放数据集和代码，以促进进一步的研究，以模拟超出表面水平相干性的语言深度。

Title: A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

Authors: Yanbo Wang, Yongcan Yu, Jian Liang, Ran He
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2509.03871
Pdf URL: https://arxiv.org/pdf/2509.03871
Copy Paste: [[2509.03871]] A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models(https://arxiv.org/abs/2509.03871)
Keywords: language model, llm, hallucination
Abstract: The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers can be found at \href{this https URL}{this https URL}.
摘要：长期推理的开发在各种任务中具有高级LLM性能，包括语言理解，复杂的问题解决和代码生成。该范式使模型能够生成中间的推理步骤，从而提高准确性和解释性。然而，尽管有这些进步，但对基于COT的推理如何影响语言模型的可信度如何仍然不发达。在本文中，我们调查了有关推理模型和COT技术的最新工作，重点介绍了值得信赖的推理的五个核心维度：真实性，安全，稳健性，公平和隐私。对于每个方面，我们提供了按时间顺序排列的最新研究的清晰结构化概述，以及对其方法，发现和局限性的详细分析。未来的研究方向也将在结尾附加以进行参考和讨论。总体而言，尽管推理技术有望通过缓解幻觉，有害内容检测和稳健性改善来增强模型的信任度，但尖端推理模型本身常常遭受可比甚至更大的安全性，鲁棒性和隐私性的脆弱性。通过综合这些见解，我们希望这项工作是AI安全社区的宝贵和及时资源，以了解推理可信度的最新进展。可以在\ href {此https url} {this https url}上找到相关论文的完整列表。

Title: False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Authors: Cheng Wang, Zeming Wei, Qin Liu, Muhao Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03888
Pdf URL: https://arxiv.org/pdf/2509.03888
Copy Paste: [[2509.03888]] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize(https://arxiv.org/abs/2509.03888)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at this https URL.
摘要：大型语言模型（LLMS）可以遵守有害的说明，尽管能力令人印象深刻，但仍引发了严重的安全问题。最近的工作利用了基于探测的方法来研究LLMS内部表示中恶意和良性输入的可分离性，研究人员提出了使用此类探测方法进行安全检测。我们系统地重新检查了这个范例。由于分布不良的性能，我们假设探针学习表面模式而不是语义有害。通过对照实验，我们确认了这一假设并确定所学的特定模式：教学模式和触发单词。我们的调查遵循一种系统的方法，从证明简单N-Gram方法的可比性能到具有语义清洁数据集的受控实验到对模式依赖性的详细分析。这些结果揭示了围绕当前基于探测的方法的一种错误的安全感，并强调了重新设计模型和评估协议的必要性，我们为此提供了进一步的讨论，希望在此方向上提出负责任的进一步研究。我们在此HTTPS URL上开源了该项目。

Title: MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation

Authors: Gowen Loo, Chang Liu, Qinghong Yin, Xiang Chen, Jiawei Chen, Jingyuan Zhang, Yu Tian
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.03891
Pdf URL: https://arxiv.org/pdf/2509.03891
Copy Paste: [[2509.03891]] MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation(https://arxiv.org/abs/2509.03891)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3\% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: this https URL
摘要：在人们的日常生活中，智能手机已经变得必不可少，几乎渗透到现代社会的各个方面。随着大语言模型（LLM）的持续发展，已经出现了许多基于LLM的移动代理。这些代理能够准确地解析不同的用户查询，并自动帮助用户完成复杂或重复的操作。但是，目前的代理人1）严重依赖LLM的理解能力，这可能导致由误操作或任务过程中省略的步骤引起的错误，2）与外部环境缺乏相互作用，通常在应用程序无法满足用户查询时终止任务，而3）缺乏记忆功能，需要每个指令，需要每个指导才能重新构造界面和以前的错误和正确的错误，并且会错误地学习。为了减轻上述问题，我们提出了Mobilerag，Mobilerag是一种移动代理商框架，该框架通过检索功能增强的一代（RAG）增强，其中包括Internag，localrag和Memrag。它利用抹布更快，更准确地识别用户查询，并完成复杂而长期的移动任务。此外，为了更全面地评估Mobilerag的性能，我们介绍了Mobilerag-eval，这是一个更具挑战性的基准，其特征是许多需要外部知识帮助的复杂，现实世界中的移动任务。关于Mobilerag-eval的广泛实验结果表明，Mobilerag可以轻松处理现实世界的移动任务，从而获得10.3 \％的改进，而不是使用更少的操作步骤的最先进方法。我们的代码可公开可用：此HTTPS URL

Title: MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering

Authors: Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03918
Pdf URL: https://arxiv.org/pdf/2509.03918
Copy Paste: [[2509.03918]] MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering(https://arxiv.org/abs/2509.03918)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought, tree-of-thought
Abstract: Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (LLMs) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance LLMs' reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist LLMs in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient LLM thought structure. MoT explores the problem in both horizontal and vertical dimensions through the "column-cell communication" mechanism, enabling LLMs to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for LLM reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at this https URL.
摘要：复杂的问题回答（QA）是NLP中的一项基本且具有挑战性的任务。尽管大型语言模型（LLMS）在质量检查中表现出令人印象深刻的表现，但由于推理能力不足，他们在面临复杂和抽象的质量检查任务时会遭受重大的性能下降。诸如《经营链》（COT）和思想树（TOT）之类的作品旨在增强LLMS的推理能力，但它们面临诸如树结构中的内层冗余和链结构中的单一路径之类的问题。尽管一些研究利用检索型生成（RAG）方法来协助LLMS推理，但有效利用涉及多个实体和啤酒花的大量信息的挑战仍然至关重要。为了解决这个问题，我们提出了一种新颖有效的LLM思想结构的思想矩阵（MOT）。 MOT通过“列密码通信”机制探索了水平和垂直尺寸的问题，使LLMS能够积极参与多策略和深层思维，从而降低了柱单元格内的冗余并增强了推理能力。此外，我们通过从检索到的知识图和原始文本中构建知识单元来开发事实纠正机制，从而增强LLM推理的初始知识并纠正错误的答案。这导致了有效，准确的质量检查框架（MTQA）的发展。实验结果表明，在F1和EM分数方面，我们的框架在四个广泛使用的数据集上优于最先进的方法，其推理时间仅为基线方法的14.4％，这既证明了其效率和准确性。此框架的代码可在此HTTPS URL上获得。

Title: Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling

Authors: Iro Lim, Haein Ji, Byungjun Kim
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2509.03932
Pdf URL: https://arxiv.org/pdf/2509.03932
Copy Paste: [[2509.03932]] Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling(https://arxiv.org/abs/2509.03932)
Keywords: language model
Abstract: This study introduces KPoEM (Korean Poetry Emotion Mapping) , a novel dataset for computational emotion analysis in modern Korean poetry. Despite remarkable progress in text-based emotion classification using large language models, poetry-particularly Korean poetry-remains underexplored due to its figurative language and cultural specificity. We built a multi-label emotion dataset of 7,662 entries, including 7,007 line-level entries from 483 poems and 615 work-level entries, annotated with 44 fine-grained emotion categories from five influential Korean poets. A state-of-the-art Korean language model fine-tuned on this dataset significantly outperformed previous models, achieving 0.60 F1-micro compared to 0.34 from models trained on general corpora. The KPoEM model, trained through sequential fine-tuning-first on general corpora and then on the KPoEM dataset-demonstrates not only an enhanced ability to identify temporally and culturally specific emotional expressions, but also a strong capacity to preserve the core sentiments of modern Korean poetry. This study bridges computational methods and literary analysis, presenting new possibilities for the quantitative exploration of poetic emotions through structured data that faithfully retains the emotional and cultural nuances of Korean literature.
摘要：这项研究介绍了Kpoem（韩国诗歌情感映射），这是现代韩国诗歌中计算情绪分析的新型数据集。尽管使用大型语言模型在基于文本的情感分类中取得了显着进展，但由于其具有象征意义的语言和文化特殊性，诗歌局部毫无疑问地散发出诗歌。我们构建了一个有7,662个条目的多标签情感数据集，其中包括来自483首诗和615个工作级别的7,007个线条级条目，并注明了来自五个有影响力的朝鲜诗人的44个细粒度的情感类别。该数据集上的最先进的韩国语言模型显着优于先前的模型，获得了0.60 F1-Micro的模型，而对General Corpora培训的模型的0.34比例为0.34。 Kpoem模型通过一般语料库的顺序微调至上进行了训练，然后在Kpoem数据集上进行了培训，不仅示出了增强的能力，可以识别时间和文化特定的情感表达，还可以保持强大的能力，以保持现代朝鲜诗歌的核心情感。这项研究桥接了计算方法和文学分析，从而通过结构化的数据忠实地保留了韩国文学的情感和文化细微差别，从而为对诗歌情绪的定量探索提供了新的可能性。

Title: SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

Authors: Yuqing Huang, Rongyang Zhang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Xuyang Zhi, Guiquan Liu, Xin Li, Hao Wang, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03934
Pdf URL: https://arxiv.org/pdf/2509.03934
Copy Paste: [[2509.03934]] SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment(https://arxiv.org/abs/2509.03934)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model's original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at this https URL.
摘要：大型语言模型（LLM）的最新进步通过其在理解和执行多种任务方面的显着能力来彻底改变了自然语言处理。尽管受到监督的微调，尤其是在检索成绩的一代（RAG）方案中，可以有效地提高特定于任务的性能，但通常会导致灾难性的遗忘，在这种情况下，模型失去了他们以前获得的知识和一般能力。现有解决方案要么需要访问一般指令数据，要么需要限制在保存模型的原始分布中。为了克服这些局限性，我们提出了Selfaug，这是一种对齐输入序列ligits以保持模型的语义分布的对齐的自分配对准方法，从而减轻了灾难性的遗忘和改善下游性能。广泛的实验表明，SelfAug在下游学习和一般能力保留之间取得了卓越的平衡。我们的全面经验分析揭示了分布变化与破布场景中灾难性遗忘的严重程度之间的直接相关性，强调了一般指导调整中缺乏抹布能力的情况如何导致微调期间的显着分布变化。我们的发现不仅可以提高人们对破布环境中灾难性遗忘的理解，而且还提供了适用于各种微调场景的实用解决方案。我们的代码在此HTTPS URL上公开可用。

Title: SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

Authors: Yuhao Zhang, Shaoming Duan, Jinhang Su, Chuanyi Liu, Peiyi Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03937
Pdf URL: https://arxiv.org/pdf/2509.03937
Copy Paste: [[2509.03937]] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning(https://arxiv.org/abs/2509.03937)
Keywords: language model, llm
Abstract: Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model's ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.
摘要：尽管自我播放微调（SPIN）的显着进步，它可以通过不同功能模型之间的竞争相互作用来将弱的大语言模型（LLM）转变为强大的模型，但它仍然在文本到SQL任务中面临挑战。自旋不会生成新信息，并且在自我播放过程中由对手模型产生的大量正确的SQL查询降低了主要模型生成准确的SQL查询的能力。为了应对这一挑战，我们提出了一种针对文本到SQL任务量身定制的新自我播放微调方法，称为SPFT-SQL。在自我播放之前，我们引入了一种基于验证的迭代微调方法，该方法基于数据库架构和验证反馈来综合高质量的微调数据迭代，以增强模型性能，同时构建具有不同功能的模型基础。在自我播放的微调阶段，我们提出了一种错误驱动的损耗方法，该方法激发了对手模型的不正确输出，使主模型能够区分对手模型生成的正确SQL和错误的SQL，从而提高了其生成正确的SQL的能力。对六个开源LLM和五个广泛使用的基准进行了广泛的实验和深入分析表明，我们的方法的表现优于现有的最新方法（SOTA）方法。

Title: VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

Authors: Weihao Wu, Liang Cao, Xinyu Wu, Zhiwei Lin, Rui Niu, Jingbei Li, Zhiyong Wu
Subjects: cs.CL, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2509.03940
Pdf URL: https://arxiv.org/pdf/2509.03940
Copy Paste: [[2509.03940]] VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents(https://arxiv.org/abs/2509.03940)
Keywords: language model, llm, agent
Abstract: Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.
摘要：大型语言模型（LLM）的最新进展极大地推动了角色扮演对话剂（RPCAS）的发展。这些系统旨在通过持续的角色来创造沉浸式用户体验。但是，当前的RPCA研究面临双重限制。首先，现有的作品主要集中在文本方式上，完全忽略了批判性的副语言特征，包括语调，韵律和言语节奏，这对于传达角色情感和塑造生动的身份至关重要。其次，基于语音的角色扮演领域长期缺乏标准化的评估基准。大多数当前的口语对话数据集仅针对基本功能评估，其中包含稀疏或定义不明的字符概况。因此，他们无法有效地量化诸如长期角色一致性之类的核心能力上的模型性能。为了解决这个关键的差距，我们引入了Voxrole，这是专为评估基于语音的RPCA的第一个综合基准。该基准包括13335个多转话对话，总共来自261部电影中1228个独特角色的语音总计65.6小时。为了构建此资源，我们提出了一个新颖的两阶段自动化管道，该管道首先将电影音频与脚本保持一致，然后采用LLM系统地为每个角色系统地构建多维配置文件。利用紫外线，我们对当代口语对话模型进行了多维评估，揭示了对它们各自的优势和局限性保持角色一致性的关键见解。

Title: CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

Authors: Ruiling Guo, Xinwei Yang, Chen Huang, Tong Zhang, Yong Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03957
Pdf URL: https://arxiv.org/pdf/2509.03957
Copy Paste: [[2509.03957]] CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking(https://arxiv.org/abs/2509.03957)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at this https URL
摘要：大型语言模型（LLM）对事实核对错误信息的有效性仍然不确定，尽管它们的使用日益增长。为此，我们提出了糖果，这是一种基准，旨在系统地评估LLM在事实检查中国错误信息方面的功能和局限性。具体来说，我们策划了一个〜20K实例的精心注释的数据集。我们的分析表明，当前的LLM在产生准确的事实检验结论时表现出局限性，即使通过经过思考链推理和很少的射击提示增强时，也会显示出局限性。为了理解这些局限性，我们开发了一种分类法，以将其结论的有缺陷的LLM生成的解释分类，并确定事实捏造是最常见的失败模式。尽管仅LLMS对事实检查是不可靠的，但我们的发现表明，当在场景中被部署为辅助工具时，它们具有巨大的潜力。我们的数据集和代码可以在此HTTPS URL上访问

Title: Exploring NLP Benchmarks in an Extremely Low-Resource Setting

Authors: Ulin Nuha, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03962
Pdf URL: https://arxiv.org/pdf/2509.03962
Copy Paste: [[2509.03962]] Exploring NLP Benchmarks in an Extremely Low-Resource Setting(https://arxiv.org/abs/2509.03962)
Keywords: language model, llm
Abstract: The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.
摘要：大型语言模型（LLM）的有效性降低了极低的资源语言，例如土著语言，这主要是由于缺乏标记的数据。尽管越来越兴趣，但这些语言的高质量自然语言处理（NLP）数据集的可用性仍然有限，因此很难开发强大的语言技术。本文通过专注于一种濒临灭绝的浪漫语言Ladin来解决这种差距，专门针对Val Badia变体。通过翻译单语意大利数据，我们创建了一小部分平行的Ladin-Uitatian句子对，为情感分析和多项选择性问题答案（MCQA）创建合成数据集。为了确保语言质量和可靠性，我们在方法中应用了严格的过滤和反向翻译程序。我们进一步证明，将这些合成数据集纳入机器翻译训练中会导致对现有意大利 - 拉丁翻译基线的实质性改进。我们的贡献包括Ladin的首个公开情感分析和MCQA数据集，建立了可以支持这种代表性不足语言的更广泛的NLP研究和下游应用程序的基础资源。

Title: Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study

Authors: Junghwan Lim, Gangwon Jo, Sungmin Lee, Jiyoung Park, Dongseok Kim, Jihwan Kim, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Kibong Choi, Jaeyeon Huh, Beomgyu Kim, Jangwoong Kim, Taehyun Kim, Haesol Lee, Jeesoo Lee, Dongpin Oh, Changseok Song, Daewon Suh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.03972
Pdf URL: https://arxiv.org/pdf/2509.03972
Copy Paste: [[2509.03972]] Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study(https://arxiv.org/abs/2509.03972)
Keywords: language model, gpt, llm
Abstract: We introduce Llama-3-Motif, a language model consisting of 102 billion parameters, specifically designed to enhance Korean capabilities while retaining strong performance in English. Developed on the Llama 3 architecture, Llama-3-Motif employs advanced training techniques, including LlamaPro and Masked Structure Growth, to effectively scale the model without altering its core Transformer architecture. Using the MoAI platform for efficient training across hyperscale GPU clusters, we optimized Llama-3-Motif using a carefully curated dataset that maintains a balanced ratio of Korean and English data. Llama-3-Motif shows decent performance on Korean-specific benchmarks, outperforming existing models and achieving results comparable to GPT-4.
摘要：我们介绍了Llama-3-Motif，该语言模型由1002亿个参数组成，专门设计用于增强韩国能力，同时保持英语的强劲表现。 Llama-3-MoTIF是在Llama 3体系结构上开发的，采用了包括Llamapro和Masked结构增长在内的先进培训技术，可以有效地扩展模型，而无需更改其核心变压器体系结构。使用MOAI平台进行高度大规模GPU群集的有效培训，我们使用精心策划的数据集优化了Llama-3-MOTIF，该数据集保持韩语和英语数据的平衡比率。 Llama-3-MOTIF在韩国特定的基准测试中表现出不错的性能，表现优于现有模型，并取得与GPT-4相当的结果。

Title: RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models

Authors: Zhaoyan Gong, Juan Li, Zhiqiang Liu, Lei Liang, Huajun Chen, Wen Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03995
Pdf URL: https://arxiv.org/pdf/2509.03995
Copy Paste: [[2509.03995]] RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models(https://arxiv.org/abs/2509.03995)
Keywords: language model, llm
Abstract: Current temporal knowledge graph question answering (TKGQA) methods primarily focus on implicit temporal constraints, lacking the capability of handling more complex temporal queries, and struggle with limited reasoning abilities and error propagation in decomposition frameworks. We propose RTQA, a novel framework to address these challenges by enhancing reasoning over TKGs without requiring training. Following recursive thinking, RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation to improve fault tolerance. RTQA consists of three core components: the Temporal Question Decomposer, the Recursive Solver, and the Answer Aggregator. Experiments on MultiTQ and TimelineKGQA benchmarks demonstrate significant Hits@1 improvements in "Multiple" and "Complex" categories, outperforming state-of-the-art methods. Our code and data are available at this https URL.
摘要：当前的时间知识图应答答案（TKGQA）方法主要集中于隐式时间限制，缺乏处理更复杂的时间查询的能力，并且在分解框架中与有限的推理能力和错误传播斗争。我们提出了RTQA，这是一个新颖的框架，可以通过在不需要培训的情况下增强推理来解决这些挑战。经过递归思考，RTQA递归将问题分解为子问题，使用LLMS和TKG知识来自下而上，并采用多路径答案答案聚合以提高容错的耐受性。 RTQA由三个核心组成部分组成：时间问题分解器，递归求解器和答案聚合器。在MultITQ和TimelineKGQA基准上进行的实验表明，“多个”和“复杂”类别中的1个改进的命中率很高，表现优于最先进的方法。我们的代码和数据可在此HTTPS URL上找到。

Title: On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Authors: Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, Kevin Roitero
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04013
Pdf URL: https://arxiv.org/pdf/2509.04013
Copy Paste: [[2509.04013]] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs(https://arxiv.org/abs/2509.04013)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model's robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.
摘要：大型语言模型（LLMS）的有效性通常通过基准（例如MMLU，ARC-C或Hellaswag）进行评估，在其原始措辞中提出问题，因此以固定的标准化格式提出问题。但是，现实世界的应用程序涉及语言可变性，要求模型在同一问题或查询的各种重新单词中保持其有效性。在这项研究中，我们系统地评估了LLM对解释基准问题的鲁棒性，并研究基于基准的评估是否提供了可靠的模型能力衡量。我们系统地生成了六个不同常见基准的所有问题的各种解释，并测量了不同规模和有效性的34个最先进的LLM的有效性变化。我们的发现表明，尽管LLM排名在解释的投入中保持相对稳定，但绝对有效性得分变化并大幅下降。这表明LLM在语言变异性上挣扎，引起人们对其概括能力和评估方法的关注。此外，观察到的性能下降挑战了基于基准测试的评估的可靠性，表明高基准分数可能无法完全捕获模型对现实世界输入变化的鲁棒性。我们讨论了这些发现对LLM评估方法的含义，并强调了更好地反映实际部署场景的鲁棒性基准的需求。

Title: What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

Authors: Debangan Mishra, Arihant Rastogi, Agyeya Negi, Shashwat Goel, Ponnurangam Kumaraguru
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.04032
Pdf URL: https://arxiv.org/pdf/2509.04032
Copy Paste: [[2509.04032]] What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages(https://arxiv.org/abs/2509.04032)
Keywords: prompt
Abstract: How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $\kappa_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $\kappa_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.
摘要：跨语言的模型输出有多相似？在这项工作中，我们使用最近提出的模型相似度度量$ \ kappa_p $研究了这个问题，适用于20种语言和47个主题。我们的分析表明，随着语言的规模和能力的增长，模型的响应越来越一致。有趣的是，模型在自己内部表现出比与其他语言引起的其他模型的同意更大的跨语言一致性。这些结果不仅突出了$ \ kappa_p $的价值作为评估多语言可靠性的实用工具，而且还可以指导开发更一致的多语言系统。

Title: Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning

Authors: Zhilin Wang, Zhe Yang, Yun Luo, Yafu Li, Haoran Zhang, Runzhe Zhan, Derek F. Wong, Jizhe Zhou, Yu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04059
Pdf URL: https://arxiv.org/pdf/2509.04059
Copy Paste: [[2509.04059]] Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning(https://arxiv.org/abs/2509.04059)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. To address this, we propose the idea of synthesizing sheet music problems grounded in music theory, which can serve both as evaluation benchmarks and as training data for reinforcement learning with verifiable rewards (RLVR). We introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench show the importance of models' reasoning abilities in interpreting sheet music. At the same time, the poor performance of Gemini 2.5-Pro highlights the challenges that MLLMs still face in interpreting sheet music in a visual format. By leveraging synthetic data for RLVR, Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench. Besides, the trained Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and achieves reasoning performance comparable to GPT-4 with the strategies of Role play and Chain-of-Thought. Notably, its performance on math problems also improves relative to the original Qwen3-8B-Base. Furthermore, our results show that the enhanced reasoning ability can also facilitate music composition. In conclusion, we are the first to propose the idea of synthesizing sheet music problems based on music theory rules, and demonstrate its effectiveness not only in advancing model reasoning for sheet music understanding but also in unlocking new possibilities for AI-assisted music creation.
摘要：增强大型语言模型（LLM）和多模式大语言模型（MLLM）的能力，这是迈向建立AI音乐家的关键一步。但是，当前的研究缺乏评估基准和用于乐谱推理的培训数据。为了解决这个问题，我们提出了综合音乐理论中基于音乐问题的表乐趣问题的想法，这些问题既可以用作评估基准，又可以用作具有可验证奖励（RLVR）的增强学习的培训数据。我们介绍了一个数据综合框架，该框架以文本和视觉方式生成可验证的乐谱问题，从而导致合成的乐谱推理基准（SSMR基础）和互补的培训集。 SSMR板凳上的评估结果表明，模型推理能力在解释乐谱中的重要性。同时，Gemini 2.5-Pro的表现不佳，突出了MLLM在以视觉形式解释乐谱时仍面临的挑战。通过利用RLVR，QWEN3-8B基本和QWEN2.5-VL-INSTRUCTION的综合数据实现SSMR基座的改进。此外，受过训练的QWEN3-8B基础在音乐理论上的总体表现超过了GPT-4，并且与GPT-4相当的推理性能与角色扮演和思想链的策略相当。值得注意的是，相对于原始的QWEN3-8B基础，其在数学问题上的性能也有所改善。此外，我们的结果表明，增强的推理能力也可以促进音乐创作。总之，我们是第一个提出基于音乐理论规则综合乐谱问题的想法的人，不仅在推进乐谱音乐理解的模型推理方面，还可以展示其有效性，而且还可以解锁新的可能性创作的新可能性。

Title: Arabic Chatbot Technologies in Education: An Overview

Authors: Hicham Bourhil, Yacine El Younoussi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04066
Pdf URL: https://arxiv.org/pdf/2509.04066
Copy Paste: [[2509.04066]] Arabic Chatbot Technologies in Education: An Overview(https://arxiv.org/abs/2509.04066)
Keywords: language model, gpt, llm, chat
Abstract: The recent advancements in Artificial Intelligence (AI) in general, and in Natural Language Processing (NLP) in particular, and some of its applications such as chatbots, have led to their implementation in different domains like education, healthcare, tourism, and customer service. Since the COVID-19 pandemic, there has been an increasing interest in these digital technologies to allow and enhance remote access. In education, e-learning systems have been massively adopted worldwide. The emergence of Large Language Models (LLM) such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformers) made chatbots even more popular. In this study, we present a survey on existing Arabic chatbots in education and their different characteristics such as the adopted approaches, language variety, and metrics used to measure their performance. We were able to identified some research gaps when we discovered that, despite the success of chatbots in other languages such as English, only a few educational Arabic chatbots used modern techniques. Finally, we discuss future directions of research in this field.
摘要：通常，人工智能（AI）的最新进步，尤其是自然语言处理（NLP），以及其某些应用程序（例如聊天机器人），导致了它们在教育，医疗保健，旅游业和客户服务等不同领域的实施。自COVID-19大流行以来，人们对这些数字技术的兴趣越来越多，以允许和增强远程访问。在教育方面，电子学习系统已在全球范围内广泛采用。大型语言模型（LLM）的出现，例如BERT（来自变形金刚的双向编码器表示）和GPT（生成性预训练的变形金刚），使聊天机器人更加受欢迎。在这项研究中，我们介绍了一项有关教育中现有的阿拉伯语聊天机器人及其不同特征的调查，例如用于衡量其表现的方法，语言多样性和指标。当我们发现，尽管聊天机器人在其他语言（例如英语）中成功，但我们能够确定一些研究差距，但只有少数教育的阿拉伯聊天机器人使用了现代技术。最后，我们讨论该领域研究的未来方向。

Title: Improving Narrative Classification and Explanation via Fine Tuned Language Models

Authors: Rishit Tyagi, Rahul Bouri, Mohit Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04077
Pdf URL: https://arxiv.org/pdf/2509.04077
Copy Paste: [[2509.04077]] Improving Narrative Classification and Explanation via Fine Tuned Language Models(https://arxiv.org/abs/2509.04077)
Keywords: language model, gpt, hallucination, prompt
Abstract: Understanding covert narratives and implicit messaging is essential for analyzing bias and sentiment. Traditional NLP methods struggle with detecting subtle phrasing and hidden agendas. This study tackles two key challenges: (1) multi-label classification of narratives and sub-narratives in news articles, and (2) generating concise, evidence-based explanations for dominant narratives. We fine-tune a BERT model with a recall-oriented approach for comprehensive narrative detection, refining predictions using a GPT-4o pipeline for consistency. For narrative explanation, we propose a ReACT (Reasoning + Acting) framework with semantic retrieval-based few-shot prompting, ensuring grounded and relevant justifications. To enhance factual accuracy and reduce hallucinations, we incorporate a structured taxonomy table as an auxiliary knowledge base. Our results show that integrating auxiliary knowledge in prompts improves classification accuracy and justification reliability, with applications in media analysis, education, and intelligence gathering.
摘要：了解秘密叙事和隐性消息传递对于分析偏见和情感至关重要。传统的NLP方法在检测微妙的措辞和隐藏的议程方面挣扎。这项研究解决了两个关键挑战：（1）新闻文章中叙事和亚叙事的多标签分类，以及（2）对主导叙事产生简洁的基于证据的解释。我们对BERT模型进行了以召回方式进行综合叙事检测，并使用GPT-4O管道进行一致性来提炼预测。为了叙事解释，我们提出了一个基于语义检索的少数弹性提示，提出了一个反应（推理 +代理）框架，以确保扎根和相关的理由。为了提高事实准确性并降低幻觉，我们将结构化的分类表作为辅助知识基础。我们的结果表明，将辅助知识纳入提示可以提高分类的准确性和理由的可靠性，并在媒体分析，教育和情报收集中的应用中提高了助理。

Title: Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

Authors: Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2509.04104
Pdf URL: https://arxiv.org/pdf/2509.04104
Copy Paste: [[2509.04104]] Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue(https://arxiv.org/abs/2509.04104)
Keywords: language model, llm, agent
Abstract: Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.
摘要：词汇对齐者在对话中开始使用类似的词语，众所周知，词汇对齐者开始使用类似的单词。但是，它在对话代理中的实施仍然没有得到充实的态度，尤其是考虑到大语模型（LLMS）的最新进展。作为实现人类代理对话中词汇结盟的第一步，这项研究借鉴了个性化对话代理的策略，并研究了稳定的个性化词汇概况的构建，以作为词汇对齐的基础。具体而言，我们使用召回，覆盖范围和余弦相似性指标来改变用于构建的转录数据的数量以及用于构建的抄录数据的数量以及每个词性词性（POS）类别（POS）类别（POS）类别（POS）类别的数量。结果表明，在10分钟的转录语音后创建的较小，更紧凑的配置文件，其中包含5个形容词，5个用于连词的项目以及副词，名词，代词和动词的10个项目，在性能和数据效率方面都提供了最佳的平衡。总之，这项研究为构建稳定的个性化词汇配置文件提供了实用的见解，考虑到最小的数据要求，这是朝着会话代理的词汇校准策略迈出的基本步骤。

Title: MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Authors: Dan Saattrup Smart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04111
Pdf URL: https://arxiv.org/pdf/2509.04111
Copy Paste: [[2509.04111]] MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages(https://arxiv.org/abs/2509.04111)
Keywords: language model, llm
Abstract: We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.
摘要：我们介绍了一个新的阅读理解数据集，称为Multiwikiqa，涵盖306种语言。上下文数据来自Wikipedia文章，其问题由LLM产生，答案在Wikipedia文章中逐字显示。我们对30种语言中产生的问题的流利性进行了众包人的评估，提供了证据表明问题质量良好。我们评估了6种不同大小的解码器和编码器模型的6种不同的语言模型，这表明基准非常困难，并且语言之间存在较大的性能差异。数据集和调查评估是免费的。

Title: MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

Authors: Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04183
Pdf URL: https://arxiv.org/pdf/2509.04183
Copy Paste: [[2509.04183]] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions(https://arxiv.org/abs/2509.04183)
Keywords: language model, llm, agent
Abstract: The growing demand for scalable psychological counseling highlights the need for fine-tuning open-source Large Language Models (LLMs) with high-quality, privacy-compliant data, yet such data remains scarce. Here we introduce MAGneT, a novel multi-agent framework for synthetic psychological counseling session generation that decomposes counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique. Unlike prior single-agent approaches, MAGneT better captures the structure and nuance of real counseling. In addition, we address inconsistencies in prior evaluation protocols by proposing a unified evaluation framework integrating diverse automatic and expert metrics. Furthermore, we expand the expert evaluations from four aspects of counseling in previous works to nine aspects, enabling a more thorough and robust assessment of data quality. Empirical results show that MAGneT significantly outperforms existing methods in quality, diversity, and therapeutic alignment of the generated counseling sessions, improving general counseling skills by 3.2% and CBT-specific skills by 4.3% on average on cognitive therapy rating scale (CTRS). Crucially, experts prefer MAGneT-generated sessions in 77.2% of cases on average across all aspects. Moreover, fine-tuning an open-source model on MAGneT-generated sessions shows better performance, with improvements of 6.3% on general counseling skills and 7.3% on CBT-specific skills on average on CTRS over those fine-tuned with sessions generated by baseline methods. We also make our code and data public.
摘要：对可扩展心理咨询的需求不断增长，这表明需要具有高质量，符合隐私数据的微调开源大型语言模型（LLMS），但这些数据仍然很少。在这里，我们介绍了磁铁，这是一个新型的多代理框架，用于合成心理咨询会议的一代，将辅导员的响应生成分解为由专业LLM代理处理的协调子任务，每项都对关键的心理技术进行建模。与以前的单一代理方法不同，磁铁更好地捕捉了真正的咨询的结构和细微差别。此外，我们通过提出一个统一的评估框架来集成多样化的自动和专家指标，以解决先前评估协议中的不一致之处。此外，我们将专家评估从以前作品的咨询的四个方面扩展到九个方面，从而对数据质量进行了更彻底，更强大的评估。经验结果表明，磁铁在产生的咨询会议的质量，多样性和治疗性一致性方面显着优于现有方法，将通用咨询技能提高3.2％，而CBT特异性技能平均在认知疗法评级量表（CTRS）中平均提高了4.3％。至关重要的是，专家们在平均所有方面平均有77.2％的病例中的磁铁生成的会话。此外，对磁铁生成的会话进行微调的开源模型显示出更好的性能，与基线方法相比，CTR的通用咨询技能的提高了6.3％，CTR平均CTR的CTRS技能为7.3％。我们还将代码和数据公开。

Title: Explicit and Implicit Data Augmentation for Social Event Detection

Authors: Congbo Ma, Yuxia Wang, Jia Wu, Jian Yang, Jing Du, Zitai Qiu, Qing Li, Hu Wang, Preslav Nakov
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2509.04202
Pdf URL: https://arxiv.org/pdf/2509.04202
Copy Paste: [[2509.04202]] Explicit and Implicit Data Augmentation for Social Event Detection(https://arxiv.org/abs/2509.04202)
Keywords: language model
Abstract: Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes large language models to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. The code is available at GitHub: this https URL.
摘要：社交事件检测涉及从社交媒体中识别和分类重要事件，这些事件依赖于标记的数据，但注释是昂贵且劳动力密集的。为了解决这个问题，我们提出了社交事件检测的增强框架（SED-aug），即插件双重增强框架，该框架结合了明确的基于文本的和隐性的功能空间扩展，以增强数据多样性和模型鲁棒性。明确的增强利用大型语言模型通过五种不同的一代策略来增强文本信息。对于隐式增强，我们设计了五种在结构融合嵌入的特征空间中运行的五种新型扰动技术。这些扰动是为了保持嵌入的语义和关系属性而制定，并使它们更加多样化。具体而言，SED-EAG在Twitter2012数据集上优于最佳基线模型，而在Twitter2018数据集上，SED-AUG的表现就平均F1分数胜过15.57％。该代码可在GitHub上获得：此HTTPS URL。

Title: Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Authors: Qinyan Zhang, Xinping Lei, Ruijie Miao, Yu Fu, Haojie Fan, Le Chang, Jiafan Hou, Dingling Zhang, Zhongfei Hou, Ziqiang Yang, Changxin Pu, Fei Hu, Jingkai Liu, Mengyun Liu, Yang Liu, Xiang Gao, Jiaheng Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04292
Pdf URL: https://arxiv.org/pdf/2509.04292
Copy Paste: [[2509.04292]] Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?(https://arxiv.org/abs/2509.04292)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.
摘要：大型语言模型（LLMS）在各种任务上实现了强大的绩效，但经常表现出认知惯性，努力遵循指示，即在监督微调（SFT）期间学到的标准化模式冲突。为了评估这一限制，我们提出了Inforve Ifeval，这是一种基准，该基准测量了模型的违反直觉能力的能力，可以超越训练引起的偏见并符合对抗性指示。 IFEVAL逆向IFEVAL引入了八种类型的挑战，包括问题纠正，有意的文本缺陷，无评论的代码以及反事实答案。我们使用人类的循环管道，在23个领域构建了1012个高质量的中文和英语问题的数据集，并根据优化的LLM-AS-A-A-Gudge框架进行了评估。现有领先LLM的实验证明了我们提出的反ifeval基准的必要性。我们的发现强调，未来的一致性努力不仅应追求流利的和事实的正确性，而且还应说明在非常规背景下的适应性。我们希望倒数ifeval既是诊断工具，也可以作为开发减轻认知惯性的方法的基础，减少过度拟合狭窄的模式，并最终增强LLM在多种多样且无法预测的现实世界中的指导性可靠性。

Title: Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models

Authors: Juraj Vladika, Mahdi Dhaini, Florian Matthes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.04304
Pdf URL: https://arxiv.org/pdf/2509.04304
Copy Paste: [[2509.04304]] Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models(https://arxiv.org/abs/2509.04304)
Keywords: language model, llm
Abstract: The growing capabilities of Large Language Models (LLMs) show significant potential to enhance healthcare by assisting medical researchers and physicians. However, their reliance on static training data is a major risk when medical recommendations evolve with new research and developments. When LLMs memorize outdated medical knowledge, they can provide harmful advice or fail at clinical reasoning tasks. To investigate this problem, we introduce two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Our evaluation of eight prominent LLMs on the datasets reveals consistent reliance on outdated knowledge across all models. We additionally analyze the influence of obsolete pre-training data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the groundwork for developing more current and reliable medical AI systems.
摘要：大型语言模型（LLMS）的增长能力通过协助医学研究人员和医师来增强医疗保健的巨大潜力。但是，当随着新的研究和发展发展，他们对静态培训数据的依赖是一个主要风险。当LLMS记住过时的医学知识时，他们可以提供有害建议或在临床推理任务下失败。为了调查这个问题，我们介绍了从系统评价得出的两个新颖的提问（QA）数据集：MEDREVQA（涵盖一般生物医学知识的16,501个QA对）和MedchangeQA（512个QA对的子集，其中医疗共识随着时间的推移发生了变化）。我们对数据集上八个突出的LLM的评估揭示了所有模型中过时的知识的一致依赖。我们还分析了过时的预训练数据和培训策略的影响，以解释这种现象并提出未来的缓解方向，为开发更多当前和可靠的医疗AI系统奠定了基础。

Title: Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

Authors: Bufan Gao, Elisa Kreiss
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04373
Pdf URL: https://arxiv.org/pdf/2509.04373
Copy Paste: [[2509.04373]] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases(https://arxiv.org/abs/2509.04373)
Keywords: llm, prompt
Abstract: As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs. Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that even minor prompt changes can substantially alter bias outcomes, sometimes reversing their direction entirely. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM ``testing mode'' performance, and what does this mean for the ecological validity of future benchmarks.
摘要：随着LLM越来越多地应用于社会影响力的环境中，对性别偏见的担忧已经促使衡量和减轻这种偏见的努力越来越大。这些努力通常依赖于与自然语言分布不同的评估任务，因为它们通常涉及仔细构建的任务提示，这些任务提示公开或秘密地表明存在与性别偏见相关的内容。在本文中，我们研究了任务的评估目的如何影响LLMS中测量的性别偏见。具体而言，我们在迅速的条件下测试模型（1）使测试上下文显着，并且（2）使以性别为中心的内容显着。然后，我们评估具有令牌概率和离散选择指标的四种任务格式的迅速灵敏度。我们发现，即使是较小的提示更改也会大大改变偏见结果，有时会完全逆转其方向。离散选择指标进一步倾向于相对于概率措施扩大偏见。这些发现不仅突出了LLM性别偏见评估的脆弱性，而且为NLP基准测试和开发社区打开了一个新的难题：在多大程度上可以在多大程度上可以控制的测试设计触发LLM“测试模式”的表现，这对未来基础标记的生态有效性意味着什么意味着什么。

Title: Can Language Models Handle a Non-Gregorian Calendar?

Authors: Mutsumi Sasaki, Go Kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.04432
Pdf URL: https://arxiv.org/pdf/2509.04432
Copy Paste: [[2509.04432]] Can Language Models Handle a Non-Gregorian Calendar?(https://arxiv.org/abs/2509.04432)
Keywords: language model
Abstract: Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well open-source LMs handle one such non-Gregorian system: the Japanese calendar. For our evaluation, we create datasets for four tasks that require both temporal knowledge and temporal reasoning. Evaluating a range of English-centric and Japanese-centric LMs, we find that some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and with maintaining consistency across calendars. Our results highlight the importance of developing LMs that are better equipped for culture-specific calendar understanding.
摘要：时间推理和知识是语言模型（LMS）的重要功能。尽管许多先前的工作已经分析和改善了LMS的时间推理，但大多数研究仅集中在Gregorian日历上。但是，许多非绿色系统，例如日语，希伯来语和希伯来语日历，都在积极使用，并反映了文化上扎根的时间概念。是否以及当前LM可以准确处理此类非gregorian日历的效果尚未得到评估。在这里，我们对开源LMS处理这样一种非gregorian系统的处理能力的系统评估：日语日历。为了进行评估，我们为需要时间知识和时间推理的四个任务创建数据集。在评估以英语为中心和日本的LMS的一系列范围内，我们发现某些模型可以执行日历转换，但是即使以日语为中心的模型则在日本 - 校友算术方面遇到了困难，并且在跨日历上保持一致性。我们的结果强调了开发LMS的重要性，这些LMS可以更好地用于特定文化的日历理解。