2025-09-16

Title: Uncovering the Vulnerability of Large Language Models in the Financial Domain via Risk Concealment

Authors: Gang Cheng, Haibo Jin, Wenbin Zhang, Haohan Wang, Jun Zhuang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.10546
Pdf URL: https://arxiv.org/pdf/2509.10546
Copy Paste: [[2509.10546]] Uncovering the Vulnerability of Large Language Models in the Financial Domain via Risk Concealment(https://arxiv.org/abs/2509.10546)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly integrated into financial applications, yet existing red-teaming research primarily targets harmful content, largely neglecting regulatory risks. In this work, we aim to investigate the vulnerability of financial LLMs through red-teaming approaches. We introduce Risk-Concealment Attacks (RCA), a novel multi-turn framework that iteratively conceals regulatory risks to provoke seemingly compliant yet regulatory-violating responses from LLMs. To enable systematic evaluation, we construct FIN-Bench, a domain-specific benchmark for assessing LLM safety in financial contexts. Extensive experiments on FIN-Bench demonstrate that RCA effectively bypasses nine mainstream LLMs, achieving an average attack success rate (ASR) of 93.18%, including 98.28% on GPT-4.1 and 97.56% on OpenAI o1. These findings reveal a critical gap in current alignment techniques and underscore the urgent need for stronger moderation mechanisms in financial domains. We hope this work offers practical insights for advancing robust and domain-aware LLM alignment.
摘要：大型语言模型（LLM）越来越多地整合到财务应用中，但现有的红色团队研究主要针对有害内容，在很大程度上忽略了监管风险。在这项工作中，我们旨在通过红色团队的方法调查金融LLM的脆弱性。我们引入了风险连续攻击（RCA），这是一个新型的多转弯框架，迭代地掩盖了LLMS的调节风险，从而挑衅了看似兼容但调节性侵入性的反应。为了实现系统评估，我们构建了Fin-Bench，这是一种特定于领域的基准，用于评估财务环境中LLM安全性。关于FIN BENCH的广泛实验表明，RCA有效地绕过了9个主流LLM，达到了93.18％的平均攻击成功率（ASR），其中GPT-4.1的98.28％和OpenAI O1的97.56％。这些发现揭示了当前的一致性技术的关键差距，并强调了对金融领域中更强大的适度机制的迫切需求。我们希望这项工作为推进强大和域而感知的LLM对齐提供实用见解。

Title: No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Authors: Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.10625
Pdf URL: https://arxiv.org/pdf/2509.10625
Copy Paste: [[2509.10625]] No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes(https://arxiv.org/abs/2509.10625)
Keywords: language model, llm
Abstract: Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers, suggesting that self-assessment emerges mid-computation. Notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.
摘要：大型语言模型（LLMS）是否可以预期何时正确回答？为了研究这一点，我们在读取问题后提取激活，但在生成任何令牌之前，并在生成任何令牌之前，并预测模型即将到来的答案是否正确。在三个从7到700亿个参数的开源模型家族中，对通用琐事问题训练的“侵入性正确性方向”的预测预测了分布的成功和多样化的分发知识数据集，超过了黑盒基础，并超过了黑盒基础和言语的信心。预测能力在中间层中饱和，表明自我评估会出现中间计算。值得注意的是，概括会在需要数学推理的问题上失败。此外，对于响应“我不知道”的模型，与探测分数如此密切相关，这表明相同的方向也捕获了信心。通过补充先前对探针和稀疏自动编码器获得的真实性和其他行为的结果，我们的工作为阐明LLM内部质量做出了重要的发现。

Title: Context Copying Modulation: The Role of Entropy Neurons in Managing Parametric and Contextual Knowledge Conflicts

Authors: Zineddine Tighidet, Andrea Mogini, Hedi Ben-younes, Jiali Mei, Patrick Gallinari, Benjamin Piwowarski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10663
Pdf URL: https://arxiv.org/pdf/2509.10663
Copy Paste: [[2509.10663]] Context Copying Modulation: The Role of Entropy Neurons in Managing Parametric and Contextual Knowledge Conflicts(https://arxiv.org/abs/2509.10663)
Keywords: language model, llm
Abstract: The behavior of Large Language Models (LLMs) when facing contextual information that conflicts with their internal parametric knowledge is inconsistent, with no generally accepted explanation for the expected outcome distribution. Recent work has identified in autoregressive transformer models a class of neurons -- called entropy neurons -- that produce a significant effect on the model output entropy while having an overall moderate impact on the ranking of the predicted tokens. In this paper, we investigate the preliminary claim that these neurons are involved in inhibiting context copying behavior in transformers by looking at their role in resolving conflicts between contextual and parametric information. We show that entropy neurons are responsible for suppressing context copying across a range of LLMs, and that ablating them leads to a significant change in the generation process. These results enhance our understanding of the internal dynamics of LLMs when handling conflicting information.
摘要：当面对与内部参数知识冲突的上下文信息时，大语模型（LLM）的行为是不一致的，没有公认的预期结果分布的解释。最近的工作已经在自回旋变压器模型中确定了一类神经元（称为熵神经元），该神经元对模型输出熵产生了重要影响，同时对预测令牌的排名产生了总体中等影响。在本文中，我们调查了这些神经元通过查看它们在解决上下文和参数信息之间冲突中的作用来抑制上下文复制行为的初步主张。我们表明，熵神经元负责抑制各种LLM的上下文复制，并且消融它们会导致生成过程发生重大变化。这些结果在处理冲突信息时增强了我们对LLM的内部动态的理解。

Title: Pluralistic Alignment for Healthcare: A Role-Driven Framework

Authors: Jiayou Zhong, Anudeex Shetty, Chao Jia, Xuanrui Lin, Usman Naseem
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.10685
Pdf URL: https://arxiv.org/pdf/2509.10685
Copy Paste: [[2509.10685]] Pluralistic Alignment for Healthcare: A Role-Driven Framework(https://arxiv.org/abs/2509.10685)
Keywords: language model, agent
Abstract: As large language models are increasingly deployed in sensitive domains such as healthcare, ensuring their outputs reflect the diverse values and perspectives held across populations is critical. However, existing alignment approaches, including pluralistic paradigms like Modular Pluralism, often fall short in the health domain, where personal, cultural, and situational factors shape pluralism. Motivated by the aforementioned healthcare challenges, we propose a first lightweight, generalizable, pluralistic alignment approach, EthosAgents, designed to simulate diverse perspectives and values. We empirically show that it advances the pluralistic alignment for all three modes across seven varying-sized open and closed models. Our findings reveal that health-related pluralism demands adaptable and normatively aware approaches, offering insights into how these models can better respect diversity in other high-stakes domains.
摘要：由于大型语言模型越来越多地部署在医疗保健等敏感领域中，因此它们的产出反映了跨种群的各种价值和观点至关重要。但是，现有的一致性方法，包括模块化多元化等多元化范式，通常在卫生领域，在个人，文化和情境因素中塑造了多元化。在上述医疗保健挑战的推动下，我们提出了一种轻巧，可推广的多元对准方法，旨在模拟各种观点和价值观。我们从经验上表明，它在七个不同大小的开放和封闭模型中的所有三种模式的多元化对齐都取得了进步。我们的发现表明，与健康相关的多元化需要适应性和规范意识的方法，从而有见识这些模型如何更好地尊重其他高风险领域的多样性。

Title: A Survey on Retrieval And Structuring Augmented Generation with Large Language Models

Authors: Pengcheng Jiang, Siru Ouyang, Yizhu Jiao, Ming Zhong, Runchu Tian, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10697
Pdf URL: https://arxiv.org/pdf/2509.10697
Copy Paste: [[2509.10697]] A Survey on Retrieval And Structuring Augmented Generation with Large Language Models(https://arxiv.org/abs/2509.10697)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) have revolutionized natural language processing with their remarkable capabilities in text generation and reasoning. However, these models face critical challenges when deployed in real-world applications, including hallucination generation, outdated knowledge, and limited domain expertise. Retrieval And Structuring (RAS) Augmented Generation addresses these limitations by integrating dynamic information retrieval with structured knowledge representations. This survey (1) examines retrieval mechanisms including sparse, dense, and hybrid approaches for accessing external knowledge; (2) explore text structuring techniques such as taxonomy construction, hierarchical classification, and information extraction that transform unstructured text into organized representations; and (3) investigate how these structured representations integrate with LLMs through prompt-based methods, reasoning frameworks, and knowledge embedding techniques. It also identifies technical challenges in retrieval efficiency, structure quality, and knowledge integration, while highlighting research opportunities in multimodal retrieval, cross-lingual structures, and interactive systems. This comprehensive overview provides researchers and practitioners with insights into RAS methods, applications, and future directions.
摘要：大型语言模型（LLMS）在文本生成和推理方面具有出色的能力彻底改变了自然语言处理。但是，这些模型在现实世界应用程序中部署（包括幻觉产生，过时的知识和有限的领域专业知识）时面临着关键的挑战。检索和结构（RAS）增强生成通过将动态信息检索与结构化知识表示来解决这些局限性。这项调查（1）研究了检索机制，包括稀疏，密集和混合方法，以获取外部知识；（2）探索文本结构技术，例如分类构建，等级分类和信息提取，将非结构化文本转化为有组织的表示；（3）研究这些结构化表示如何通过基于及时的方法，推理框架和知识嵌入技术与LLM集成。它还确定了检索效率，结构质量和知识整合方面的技术挑战，同时强调了多模式检索，跨语言结构和交互式系统的研究机会。这一综合概述为研究人员和从业人员提供了对RAS方法，应用程序和未来方向的见解。

Title: SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Authors: Iman Barati, Mostafa Amiri, Heshaam Faili
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10708
Pdf URL: https://arxiv.org/pdf/2509.10708
Copy Paste: [[2509.10708]] SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation(https://arxiv.org/abs/2509.10708)
Keywords: language model, llm
Abstract: Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [this https URL](this https URL)
摘要：监督的微调（SFT）对于培训大语言模型（LLM）至关重要，可显着增强关键功能，例如遵循的教学和文章学习。然而，由于独特的域约束和数据稀缺，创建针对特定领域的合适培训数据集仍然具有挑战性。在本文中，我们提出了SearchInstruct，这是一种旨在构建SFT高质量指令数据集的创新方法。我们的方法始于一组有限的特定领域，人类生成的问题，这些问题是使用大型语言模型系统地扩展的。随后，对域相关资源进行了动态检索，以为每个增强问题生成准确且上下文适当的答案。实验评估表明，SearchInstruct增强了SFT数据集的多样性和质量，从而导致专用域内LLM性能的可测量改善。此外，我们表明，除了数据集生成之外，所提出的方法还可以有效地促进诸如模型编辑之类的任务，从而有效地更新现有模型。为了促进可重复性和社区采用，我们提供完整的实施详细信息，一组生成的指令响应对以及公共可访问的GIT存储库中的源代码：[此https url]（此https url）（此https url）

Title: PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models

Authors: Zaur Gouliev, Jennifer Waters, Chengqian Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.10737
Pdf URL: https://arxiv.org/pdf/2509.10737
Copy Paste: [[2509.10737]] PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models(https://arxiv.org/abs/2509.10737)
Keywords: language model
Abstract: Disinformation spreads rapidly across linguistic boundaries, yet most AI models are still benchmarked only on English. We address this gap with a systematic comparison of five multilingual transformer models: mBERT, XLM, XLM-RoBERTa, RemBERT, and mT5 on a common fake-vs-true machine learning classification task. While transformer-based language models have demonstrated notable success in detecting disinformation in English, their effectiveness in multilingual contexts still remains up for debate. To facilitate evaluation, we introduce PolyTruth Disinfo Corpus, a novel corpus of 60,486 statement pairs (false claim vs. factual correction) spanning over twenty five languages that collectively cover five language families and a broad topical range from politics, health, climate, finance, and conspiracy, half of which are fact-checked disinformation claims verified by an augmented MindBugs Discovery dataset. Our experiments revealed performance variations. Models such as RemBERT achieved better overall accuracy, particularly excelling in low-resource languages, whereas models like mBERT and XLM exhibit considerable limitations when training data is scarce. We provide a discussion of these performance patterns and implications for real-world deployment. The dataset is publicly available on our GitHub repository to encourage further experimentation and advancement. Our findings illuminate both the potential and the current limitations of AI systems for multilingual disinformation detection.
摘要：虚假信息迅速传播到语言界限，但大多数AI模型仍然仅以英语为基础。我们通过五个多语言变压器模型的系统比较来解决这一差距：Mbert，XLM，XLM-Roberta，Rembert和MT5在常见的伪造VS-VS-VS-True机器学习分类任务上。尽管基于变压器的语言模型在检测英语中的虚假信息方面取得了显着的成功，但它们在多语言环境中的有效性仍然是辩论。为了促进评估，我们介绍了polytruth disinfo语料库，这是一个跨越了二十五种语言的60,486个陈述对（虚假索赔与事实校正）的新型语料库，这些语言涵盖了五种语言，这些语言共同涵盖了五个语言家庭，并且从政治，健康，气候，融资和阴谋中进行了广泛的主题范围，其中一半是事实证明了事实证明了被证实的索赔，并确定了一个有意义的索赔。我们的实验揭示了性能变化。雷德伯特（Rembert）等模型的总体准确性更好，尤其是在低资源语言中表现出色，而莫伯特（Mbert）和XLM等模型在培训数据稀缺时表现出很大的限制。我们对这些绩效模式以及对现实部署的影响提供了讨论。该数据集可在我们的GITHUB存储库上公开可用，以鼓励进一步的实验和进步。我们的发现阐明了AI系统在多语言虚假发现检测中的潜力和当前局限性。

Title: Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs

Authors: Mobina Pournemat, Keivan Rezaei, Gaurang Sriramanan, Arman Zarei, Jiaxiang Fu, Yang Wang, Hamid Eghbalzadeh, Soheil Feizi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10739
Pdf URL: https://arxiv.org/pdf/2509.10739
Copy Paste: [[2509.10739]] Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs(https://arxiv.org/abs/2509.10739)
Keywords: language model, llm, prompt
Abstract: Despite widespread success in language understanding and generation, large language models (LLMs) exhibit unclear and often inconsistent behavior when faced with tasks that require probabilistic reasoning. In this work, we present the first comprehensive study of the reasoning capabilities of LLMs over explicit discrete probability distributions. Given observations from a probability distribution, we evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation, by prompting them to provide responses to queries about either the joint distribution or its conditionals. These tasks thus probe a range of probabilistic skills, including frequency analysis, marginalization, and generative behavior. Through comprehensive empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models, with the latter demonstrating stronger inference and surprising capabilities in sample generation. Furthermore, our investigations reveal notable limitations, including sensitivity to variations in the notation utilized to represent probabilistic outcomes and performance degradation of over 60% as context length increases. Together, our results provide a detailed understanding of the probabilistic reasoning abilities of LLMs and identify key directions for future improvement.
摘要：尽管在语言理解和产生方面取得了广泛的成功，但在面对需要概率推理的任务时，大型语言模型（LLM）表现出不清楚的行为，通常不一致。在这项工作中，我们介绍了对LLM的推理能力的首次全面研究，而不是明确的离散概率分布。给定概率分布的观察结果，我们通过提示他们提示对关节分布或其条件性的查询的响应来评估三个精心设计的任务，模式识别，最大似然估计和样本生成的模型。因此，这些任务探讨了一系列概率技能，包括频率分析，边缘化和生成行为。通过全面的经验评估，我们证明了较小模型和较大模型之间存在明显的性能差距，后者表明了样本生成中更强的推断和令人惊讶的功能。此外，我们的调查显示出明显的局限性，包括对用来代表概率结果的符号变化的敏感性，并且随着上下文长度的增加，超过60％的绩效下降。总之，我们的结果提供了对LLMS概率推理能力的详细理解，并确定了未来改进的关键方向。

Title: Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models

Authors: Ozan Gokdemir, Neil Getty, Robert Underwood, Sandeep Madireddy, Franck Cappello, Arvind Ramanathan, Ian T. Foster, Rick L. Stevens
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.10744
Pdf URL: https://arxiv.org/pdf/2509.10744
Copy Paste: [[2509.10744]] Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models(https://arxiv.org/abs/2509.10744)
Keywords: language model, gpt, retrieval-augmented generation
Abstract: As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B-14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.
摘要：随着科学知识以前所未有的速度增长，评估基准必须发展以反映新发现并确保对当前，多样的文献进行测试语言模型。我们提出了一个可扩展的模块化框架，用于直接从大型科学论文中生成多项选择的提问（MCQA）基准。我们的管道可以自动化MCQA创建的每个阶段，包括PDF解析，语义分解，问题产生和模型评估。作为案例研究，我们从22,000篇辐射和癌症生物学的开放式文章中产生了16,000多个MCQ。然后，我们在这些问题上评估了一套小语言模型（1.1b-14b参数），将基线准确性与从纸衍生的语义块和从GPT-4.1提炼的推理痕迹的基线准确性与检索型的生成（RAG）进行了比较。我们发现，推理痕迹的检索始终提高合成和专家宣布的基准的性能，从而使几个小型模型能够超过2023年Astro辐射和癌症生物学考试的GPT-4。

Title: RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems

Authors: Adarsh Srinivasan, Jacob Dineen, Muhammad Umar Afzal, Muhammad Uzair Sarfraz, Irbaz B. Riaz, Ben Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10746
Pdf URL: https://arxiv.org/pdf/2509.10746
Copy Paste: [[2509.10746]] RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems(https://arxiv.org/abs/2509.10746)
Keywords: language model, prompt
Abstract: Large language models in healthcare often miss critical emotional cues, delivering medically sound but emotionally flat advice. This is especially problematic in clinical contexts where patients are distressed and vulnerable, and require empathic communication to support safety, adherence, and trust. We present RECAP (Reflect-Extract-Calibrate-Align-Produce), an inference-time framework that adds structured emotional reasoning without retraining. By decomposing empathy into transparent appraisal-theoretic stages and exposing per-dimension Likert signals, RECAP produces nuanced, auditable responses. Across EmoBench, SECEU, and EQ-Bench, RECAP improves emotional reasoning by 22-28% on 8B models and 10-13% on larger models over zero-shot baselines. Clinician evaluations further confirm superior empathetic communication. RECAP shows that modular, theory-grounded prompting can systematically enhance emotional intelligence in medical AI while preserving the accountability required for deployment.
摘要：医疗保健中的大型语言模型通常会错过关键的情感线索，提供医学上声音但情感平坦的建议。这在患者陷入困境和脆弱的临床环境中尤其有问题，并且需要移情沟通以支持安全，依从性和信任。我们提出了回顾（反射提取量化 - 启动 - 对产生），这是一个推理时间框架，可在不进行重新训练的情况下添加结构化的情绪推理。通过将同理心分解为透明的评估理论阶段并暴露于人均李克特信号，回顾产生细微的，可审计的响应。在Emobench，Seceu和EQ-Bench中，回顾8B型号的情绪推理将情绪推理提高22-28％，而在零光基线的较大型号上，将情绪推理提高了10-13％。临床医生评估进一步证实了卓越的交流。回顾表明，模块化，理论基础的提示可以系统地增强医疗AI中的情商，同时保留部署所需的责任。

Title: Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

Authors: Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.10798
Pdf URL: https://arxiv.org/pdf/2509.10798
Copy Paste: [[2509.10798]] Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction(https://arxiv.org/abs/2509.10798)
Keywords: language model, llm
Abstract: Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.
摘要：大型语言模型（LLMS）利用键值（KV）缓存在序列处理过程中存储历史信息。随着序列的长度延伸，KV高速缓存的大小将线性增长，这严重影响了记忆使用和解码效率。 KV缓存驱逐的当前方法通常会利用从预填充阶段开始的最后一个窗口作为查询来计算驱逐的KV重要性得分。尽管该方案易于实施，但它倾向于过度关注本地信息，这可能导致忽视或遗漏关键的全球信息。为了减轻此问题，我们提出了Q法官Q法官，这是一种新颖的培训方法，其中包含柔和的令牌列表。此方法仅以低训练成本来调整模型的嵌入层。通过在输入序列结束时加入软令牌列表，我们将这些令牌的注意力图训练到原始输入序列，以与实际解码令牌保持一致。这样，与软令牌相对应的查询可以有效地捕获全局信息，并更好地评估KV缓存中键和值的重要性，从而在驱逐KV缓存时保持解码质量。在相同的驱逐预算下，与现有驱逐方法相比，我们的方法表现出更少的性能降解。我们通过使用包括Longbench，Ruler和Needle-A-Haystack的基准，通过对Llama-3.1-8B-Instruct和Mistral-7b-Instruct-V0.3进行的模型进行实验来验证我们的方法。结果表明，长板座上的改善约为1分，而尺子的提高了3分。该提出的方法可以通过最小的培训开销无缝地集成到现有的开源模型中，从而在KV缓存驱逐方案中提高性能。

Title: Towards Automated Error Discovery: A Study in Conversational AI

Authors: Dominic Petrak, Thy Thy Tran, Iryna Gurevych
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2509.10833
Pdf URL: https://arxiv.org/pdf/2509.10833
Copy Paste: [[2509.10833]] Towards Automated Error Discovery: A Study in Conversational AI(https://arxiv.org/abs/2509.10833)
Keywords: language model, gpt, llm, agent
Abstract: Although LLM-based conversational agents demonstrate strong fluency and coherence, they still produce undesirable behaviors (errors) that are challenging to prevent from reaching users during deployment. Recent research leverages large language models (LLMs) to detect errors and guide response-generation models toward improvement. However, current LLMs struggle to identify errors not explicitly specified in their instructions, such as those arising from updates to the response-generation model or shifts in user behavior. In this work, we introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI, and propose SEEED (Soft Clustering Extended Encoder-Based Error Detection), as an encoder-based approach to its implementation. We enhance the Soft Nearest Neighbor Loss by amplifying distance weighting for negative samples and introduce Label-Based Sample Ranking to select highly contrastive examples for better representation learning. SEEED outperforms adapted baselines -- including GPT-4o and Phi-4 -- across multiple error-annotated dialogue datasets, improving the accuracy for detecting unknown errors by up to 8 points and demonstrating strong generalization to unknown intent detection.
摘要：尽管基于LLM的对话代理表现出强烈的流利性和连贯性，但它们仍然产生不良行为（错误），这些行为难以阻止在部署期间吸引用户。最近的研究利用大型语言模型（LLM）检测错误，并指导响应产生模型来改进。但是，当前的LLM难以确定未在说明中明确指定的错误，例如由更新到响应生成模型引起的错误或用户行为的变化。在这项工作中，我们介绍了自动化错误发现，这是一个用于检测和定义对话AI中错误的框架，并提出了Seeed（软聚类扩展的基于编码器的错误检测），作为基于编码器的实现方法。我们通过扩增阴性样品的距离加权来增强柔软的邻居损失，并引入基于标签的样品排名以选择高度对比的示例以进行更好的表示。在多个错误的对话数据集中，Seeed优于适应基线的基线（包括GPT-4O和PHI-4），提高了检测到未知误差的准确性，最多可达8分，并证明对未知意图检测的强烈概括。

Title: Evaluating Large Language Models for Evidence-Based Clinical Question Answering

Authors: Can Wang, Yiqun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10843
Pdf URL: https://arxiv.org/pdf/2509.10843
Copy Paste: [[2509.10843]] Evaluating Large Language Models for Evidence-Based Clinical Question Answering(https://arxiv.org/abs/2509.10843)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated substantial progress in biomedical and clinical applications, motivating rigorous evaluation of their ability to answer nuanced, evidence-based questions. We curate a multi-source benchmark drawing from Cochrane systematic reviews and clinical guidelines, including structured recommendations from the American Heart Association and narrative guidance used by insurers. Using GPT-4o-mini and GPT-5, we observe consistent performance patterns across sources and clinical domains: accuracy is highest on structured guideline recommendations (90%) and lower on narrative guideline and systematic review questions (60--70%). We also find a strong correlation between accuracy and the citation count of the underlying systematic reviews, where each doubling of citations is associated with roughly a 30% increase in the odds of a correct answer. Models show moderate ability to reason about evidence quality when contextual information is supplied. When we incorporate retrieval-augmented prompting, providing the gold-source abstract raises accuracy on previously incorrect items to 0.79; providing top 3 PubMed abstracts (ranked by semantic relevance) improves accuracy to 0.23, while random abstracts reduce accuracy (0.10, within temperature variation). These effects are mirrored in GPT-4o-mini, underscoring that source clarity and targeted retrieval -- not just model size -- drive performance. Overall, our results highlight both the promise and current limitations of LLMs for evidence-based clinical question answering. Retrieval-augmented prompting emerges as a useful strategy to improve factual accuracy and alignment with source evidence, while stratified evaluation by specialty and question type remains essential to understand current knowledge access and to contextualize model performance.
摘要：大型语言模型（LLMS）在生物医学和临床应用中表现出了很大的进步，激发了对他们回答细微差别的基于循证问题的能力的严格评估。我们从Cochrane系统评价和临床准则中策划了多源基准测试，包括美国心脏协会的结构性建议以及保险公司使用的叙事指南。使用GPT-4O-MINI和GPT-5，我们观察到跨来源和临床领域的一致性表现模式：在结构化指南建议（90％）中，准确性最高（90％），而叙事指南和系统评价问题（60---70％）较低。我们还发现，基础系统评论的准确性与引文数量之间存在很强的相关性，在这种评价中，每次引用的每次加倍都与正确答案的几率大约增加了30％。当提供上下文信息时，模型表现出适度的推理证据质量的能力。当我们合并检索提示的提示时，提供黄金源摘要将先前不正确的项目的准确性提高到0.79；提供前3个PubMed摘要（按语义相关性排名）提高了0.23的精度，而随机摘要降低了准确性（0.10，在温度变化范围内）。这些效果反映在GPT-4O-Mini中，强调了该来源的清晰度，并有针对性的检索 - 不仅是模型尺寸 - 驱动性能。总体而言，我们的结果凸显了LLMS对基于证据的临床问题答案的前景和当前局限性。检索提示提示是一种有用的策略，可以提高事实准确性和与来源证据的一致性，而通过专业和问题类型进行分层评估对于了解当前的知识访问和将模型性能进行上下文化仍然至关重要。

Title: GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings

Authors: Yixuan Tang, Yi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10844
Pdf URL: https://arxiv.org/pdf/2509.10844
Copy Paste: [[2509.10844]] GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings(https://arxiv.org/abs/2509.10844)
Keywords: llm, agent
Abstract: Domain-specific embedding models have shown promise for applications that require specialized semantic understanding, such as coding agents and financial retrieval systems, often achieving higher performance gains than general models. However, state-of-the-art embedding models are typically based on LLMs, which contain billions of parameters, making deployment challenging in resource-constrained environments. Model compression through pruning offers a promising solution, but existing pruning methods treat all parameters uniformly, failing to distinguish between general semantic representations and domain-specific patterns, leading to suboptimal pruning decisions. Thus, we propose GAPrune, a pruning framework that addresses this challenge by considering both domain importance and preserving general linguistic foundation. Our method uses Fisher Information to measure importance and general-domain gradient alignment to assess parameter behavior, then combines these signals using our Domain Alignment Importance (DAI) scoring. Lower DAI scores indicate that the parameter is either less important for the domain task or creates conflicts between domain and general objectives. Experiments on two domain benchmarks, FinMTEB and ChemTEB, show that GAPrune maintains performance within 2.5% of dense models in one-shot pruning at 50% sparsity, while outperforming all baselines. With retraining in 100 steps, GAPrune achieves +4.51% improvement on FinMTEB and +1.73% on ChemTEB, demonstrating that our pruning strategy not only preserves but enhances domain-specific capabilities. Our findings demonstrate that principled pruning strategies can achieve model compression and enhanced domain specialization, providing the research community with a new approach for development.
摘要：特定于领域的嵌入模型已经显示出对需要专业语义理解的应用的希望，例如编码代理和财务检索系统，通常比一般模型获得更高的性能增长。但是，最新的嵌入模型通常基于LLM，其中包含数十亿个参数，从而使部署在资源受限的环境中挑战。通过修剪的模型压缩提供了一个有希望的解决方案，但是现有的修剪方法统一地对待所有参数，未能区分一般的语义表示和特定于域的模式，从而导致次优的修剪决策。因此，我们提出了Gaprune，这是一个修剪框架，通过考虑领域的重要性和维护一般语言基础来解决这一挑战。我们的方法使用Fisher信息来衡量重要性和一般域梯度对准来评估参数行为，然后使用我们的域对齐重要性（DAI）评分结合了这些信号。较低的DAI分数表明，该参数对于域任务不太重要，要么在域和一般目标之间造成冲突。在两个域基准的FinmteB和Chemteb上进行的实验表明，Gaprune以50％的稀疏性为50％的密集模型的2.5％在密集模型的2.5％内保持了性能，同时表现优于所有基准。通过100个步骤进行重新培训，Gaprune在FinmTeB上取得了 +4.51％的提高，ChemTeB的提高 + +1.73％，这表明我们的修剪策略不仅可以保留，而且可以增强域特异性功能。我们的发现表明，有原则的修剪策略可以实现模型压缩和增强领域的专业化，从而为研究界提供了一种新的开发方法。

Title: Quantifier Scope Interpretation in Language Learners and LLMs

Authors: Shaohua Fang, Yue Li, Yan Cong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10860
Pdf URL: https://arxiv.org/pdf/2509.10860
Copy Paste: [[2509.10860]] Quantifier Scope Interpretation in Language Learners and LLMs(https://arxiv.org/abs/2509.10860)
Keywords: language model, llm
Abstract: Sentences with multiple quantifiers often lead to interpretive ambiguities, which can vary across languages. This study adopts a cross-linguistic approach to examine how large language models (LLMs) handle quantifier scope interpretation in English and Chinese, using probabilities to assess interpretive likelihood. Human similarity (HS) scores were used to quantify the extent to which LLMs emulate human performance across language groups. Results reveal that most LLMs prefer the surface scope interpretations, aligning with human tendencies, while only some differentiate between English and Chinese in the inverse scope preferences, reflecting human-similar patterns. HS scores highlight variability in LLMs' approximation of human behavior, but their overall potential to align with humans is notable. Differences in model architecture, scale, and particularly models' pre-training data language background, significantly influence how closely LLMs approximate human quantifier scope interpretations.
摘要：带有多个量化器的句子通常会导致解释性歧义，这些歧义可能会因语言而异。这项研究采用了一种跨语言方法来检查大型语言模型（LLMS）如何使用英语和中文的量词范围解释，并使用概率来评估解释性的可能性。人类的相似性（HS）得分用于量化LLM在语言群体中模仿人类表现的程度。结果表明，大多数LLM都喜欢表面范围的解释，与人类趋势保持一致，而在反向范围的偏好中只有一些分化的英语和中文，反映了人类相似的模式。 HS得分突出了LLMS人类行为近似的变异性，但它们与人类保持一致的总体潜力是显着的。模型结构，规模，特别是模型的训练前数据语言背景的差异显着影响LLM近似人类量化器范围的解释。

Title: CultureSynth: A Hierarchical Taxonomy-Guided and Retrieval-Augmented Framework for Cultural Question-Answer Synthesis

Authors: Xinyu Zhang, Pei Zhang, Shuang Luo, Jialong Tang, Yu Wan, Baosong Yang, Fei Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.10886
Pdf URL: https://arxiv.org/pdf/2509.10886
Copy Paste: [[2509.10886]] CultureSynth: A Hierarchical Taxonomy-Guided and Retrieval-Augmented Framework for Cultural Question-Answer Synthesis(https://arxiv.org/abs/2509.10886)
Keywords: language model, gpt, llm, chat, retrieval-augmented generation
Abstract: Cultural competence, defined as the ability to understand and adapt to multicultural contexts, is increasingly vital for large language models (LLMs) in global environments. While several cultural benchmarks exist to assess LLMs' cultural competence, current evaluations suffer from fragmented taxonomies, domain specificity, and heavy reliance on manual data annotation. To address these limitations, we introduce CultureSynth, a novel framework comprising (1) a comprehensive hierarchical multilingual cultural taxonomy covering 12 primary and 130 secondary topics, and (2) a Retrieval-Augmented Generation (RAG)-based methodology leveraging factual knowledge to synthesize culturally relevant question-answer pairs. The CultureSynth-7 synthetic benchmark contains 19,360 entries and 4,149 manually verified entries across 7 languages. Evaluation of 14 prevalent LLMs of different sizes reveals clear performance stratification led by ChatGPT-4o-Latest and Qwen2.5-72B-Instruct. The results demonstrate that a 3B-parameter threshold is necessary for achieving basic cultural competence, models display varying architectural biases in knowledge processing, and significant geographic disparities exist across models. We believe that CultureSynth offers a scalable framework for developing culturally aware AI systems while reducing reliance on manual annotation\footnote{Benchmark is available at this https URL.}.
摘要：文化能力被定义为理解和适应多元文化背景的能力，对于全球环境中的大语言模型（LLM）越来越重要。尽管存在一些文化基准来评估LLMS的文化能力，但目前的评估却遭受分类分类法，领域特异性和对手动数据注释的严重依赖。为了解决这些局限性，我们介绍了文化合成，这是一个新的框架，其中包括（1）涵盖12个主要的二级和130个次要主题的全面层次多语言文化分类学，以及（2）基于检索的产生（RAG）基于基于的方法论，利用事实知识来合成文化中有关的问题与文化有关的问题。培养基7合成基准包含19,360条条目和4,149个跨7种语言手动验证的条目。对14个不同尺寸的14个普遍的LLM的评估揭示了由Chatgpt-4O-Latest和Qwen2.5-72B-Instruct领导的明确性能分层。结果表明，对于达到基本文化能力，必须具有3B参数阈值，模型在知识处理中表现出不同的建筑偏见，并且在模型之间存在明显的地理差异。我们认为，Culturessynth提供了一个可扩展的框架，用于开发具有文化意识的AI系统，同时减少对手动注释\ footNote {基准{基准的依赖。

Title: Aligning ESG Controversy Data with International Guidelines through Semi-Automatic Ontology Construction

Authors: Tsuyoshi Iwata, Guillaume Comte, Melissa Flores, Ryoma Kondo, Ryohei Hisano
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.10922
Pdf URL: https://arxiv.org/pdf/2509.10922
Copy Paste: [[2509.10922]] Aligning ESG Controversy Data with International Guidelines through Semi-Automatic Ontology Construction(https://arxiv.org/abs/2509.10922)
Keywords: language model
Abstract: The growing importance of environmental, social, and governance data in regulatory and investment contexts has increased the need for accurate, interpretable, and internationally aligned representations of non-financial risks, particularly those reported in unstructured news sources. However, aligning such controversy-related data with principle-based normative frameworks, such as the United Nations Global Compact or Sustainable Development Goals, presents significant challenges. These frameworks are typically expressed in abstract language, lack standardized taxonomies, and differ from the proprietary classification systems used by commercial data providers. In this paper, we present a semi-automatic method for constructing structured knowledge representations of environmental, social, and governance events reported in the news. Our approach uses lightweight ontology design, formal pattern modeling, and large language models to convert normative principles into reusable templates expressed in the Resource Description Framework. These templates are used to extract relevant information from news content and populate a structured knowledge graph that links reported incidents to specific framework principles. The result is a scalable and transparent framework for identifying and interpreting non-compliance with international sustainability guidelines.
摘要：环境，社会和治理数据在监管和投资环境中的重要性越来越多，增加了对非金融风险的准确，可解释和国际统一的代表，尤其是在非结构化新闻来源中报道的准确性。但是，将这种争议相关的数据与基于原则的规范框架（例如联合国全球紧凑型或可持续发展目标）保持一致，提出了重大挑战。这些框架通常用抽象语言表达，缺乏标准的分类法，并且与商业数据提供商使用的专有分类系统不同。在本文中，我们提出了一种半自动方法，用于构建新闻中报告的环境，社会和治理事件的结构化知识表示。我们的方法使用轻巧的本体设计，正式的模式建模和大型语言模型将规范原理转换为资源描述框架中表达的可重复使用的模板。这些模板用于从新闻内容中提取相关信息，并填充结构化知识图，该图形将事件报告与特定框架原则联系起来。结果是一个可扩展且透明的框架，用于识别和解释与国际可持续性指南的违规行为。

Title: Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents

Authors: Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Aditya Vempaty, Prasenjit Dey, Ravi Kokku, Pawan Goyal, Niloy Ganguly
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10935
Pdf URL: https://arxiv.org/pdf/2509.10935
Copy Paste: [[2509.10935]] Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents(https://arxiv.org/abs/2509.10935)
Keywords: language model
Abstract: In this paper, we introduce Spotlight, a novel paradigm for information extraction that produces concise, engaging narratives by highlighting the most compelling aspects of a document. Unlike traditional summaries, which prioritize comprehensive coverage, spotlights selectively emphasize intriguing content to foster deeper reader engagement with the source material. We formally differentiate spotlights from related constructs and support our analysis with a detailed benchmarking study using new datasets curated for this work. To generate high-quality spotlights, we propose a two-stage approach: fine-tuning a large language model on our benchmark data, followed by alignment via Direct Preference Optimization (DPO). Our comprehensive evaluation demonstrates that the resulting model not only identifies key elements with precision but also enhances readability and boosts the engagement value of the original document.
摘要：在本文中，我们引入了Spotlight，这是一种新颖的信息提取范式，可通过突出文档的最引人注目的方面来产生简洁而引人入胜的叙述。与传统的摘要不同，优先考虑全面覆盖范围，聚焦有选择地强调有趣的内容，以促进更深层次的读者参与原始资料。我们正式将聚光灯与相关构造区分开，并通过使用为这项工作策划的新数据集通过详细的基准测试研究来支持我们的分析。为了产生高质量的聚光灯，我们提出了一种两阶段的方法：在我们的基准数据上进行大型语言模型，然后通过直接偏好优化（DPO）对齐。我们的全面评估表明，由此产生的模型不仅以精度来标识关键要素，而且还可以提高可读性并提高原始文档的参与价值。

Title: An Interpretable Benchmark for Clickbait Detection and Tactic Attribution

Authors: Lihi Nofar, Tomer Portal, Aviv Elbaz, Alexander Apartsin, Yehudit Aperstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10937
Pdf URL: https://arxiv.org/pdf/2509.10937
Copy Paste: [[2509.10937]] An Interpretable Benchmark for Clickbait Detection and Tactic Attribution(https://arxiv.org/abs/2509.10937)
Keywords: language model, gpt, llm, prompt
Abstract: The proliferation of clickbait headlines poses significant challenges to the credibility of information and user trust in digital media. While recent advances in machine learning have improved the detection of manipulative content, the lack of explainability limits their practical adoption. This paper presents a model for explainable clickbait detection that not only identifies clickbait titles but also attributes them to specific linguistic manipulation strategies. We introduce a synthetic dataset generated by systematically augmenting real news headlines using a predefined catalogue of clickbait strategies. This dataset enables controlled experimentation and detailed analysis of model behaviour. We present a two-stage framework for automatic clickbait analysis comprising detection and tactic attribution. In the first stage, we compare a fine-tuned BERT classifier with large language models (LLMs), specifically GPT-4.0 and Gemini 2.4 Flash, under both zero-shot prompting and few-shot prompting enriched with illustrative clickbait headlines and their associated persuasive tactics. In the second stage, a dedicated BERT-based classifier predicts the specific clickbait strategies present in each headline. This work advances the development of transparent and trustworthy AI systems for combating manipulative media content. We share the dataset with the research community at this https URL
摘要：点击诱饵标题的扩散对数字媒体中信息和用户信任的信誉构成了重大挑战。尽管机器学习的最新进展改善了操纵内容的检测，但缺乏解释性限制了其实际采用。本文提出了一个可解释的点击检测模型，该模型不仅可以识别点击诱饵标题，还将它们归因于特定的语言操纵策略。我们引入了一个合成数据集，该数据集是通过使用Clickbait策略的预定义目录来系统地增强真实新闻头条而生成的。该数据集可以对模型行为进行受控的实验和详细分析。我们提出了一个两阶段的框架，用于自动点击诱饵分析，其中包括检测和战术归因。在第一阶段，我们将微调的BERT分类器与大语言模型（LLMS）（特别是GPT-4.0和Gemini 2.4 Flash）进行了比较，在零射击的提示下和很少的射击下，提示具有说明性的Clickbait头条及其相关的说服力策略。在第二阶段，一个专用的基于BERT的分类器预测了每个标题中存在的特定点击诱饵策略。这项工作推动了透明和值得信赖的AI系统的开发，用于打击操纵性媒体内容。我们在此HTTPS URL上与研究社区共享数据集

Title: EmoBench-Reddit: A Hierarchical Benchmark for Evaluating the Emotional Intelligence of Multimodal Large Language Models

Authors: Haokun Li, Yazhou Zhang, Jizhi Ding, Qiuchi Li, Peng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11101
Pdf URL: https://arxiv.org/pdf/2509.11101
Copy Paste: [[2509.11101]] EmoBench-Reddit: A Hierarchical Benchmark for Evaluating the Emotional Intelligence of Multimodal Large Language Models(https://arxiv.org/abs/2509.11101)
Keywords: language model, llm
Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), they have demonstrated exceptional capabilities across a variety of vision-language tasks. However, current evaluation benchmarks predominantly focus on objective visual question answering or captioning, inadequately assessing the models' ability to understand complex and subjective human emotions. To bridge this gap, we introduce EmoBench-Reddit, a novel, hierarchical benchmark for multimodal emotion understanding. The dataset comprises 350 meticulously curated samples from the social media platform Reddit, each containing an image, associated user-provided text, and an emotion category (sad, humor, sarcasm, happy) confirmed by user flairs. We designed a hierarchical task framework that progresses from basic perception to advanced cognition, with each data point featuring six multiple-choice questions and one open-ended question of increasing difficulty. Perception tasks evaluate the model's ability to identify basic visual elements (e.g., colors, objects), while cognition tasks require scene reasoning, intent understanding, and deep empathy integrating textual context. We ensured annotation quality through a combination of AI assistance (Claude 4) and manual verification.
摘要：随着多模式大语言模型（MLLM）的快速发展，它们在各种视觉语言任务中表现出了出色的功能。但是，当前的评估基准主要集中于客观的视觉问题回答或字幕，不充分评估模型理解复杂和主观的人类情绪的能力。为了弥合这一差距，我们介绍了Emobench-Reddit，这是一种小说，分层的基准，用于多模式情感理解。该数据集包括来自社交媒体平台Reddit的350个精心策划的样本，每个样品包含图像，相关的用户提供的文本以及一个情感类别（悲伤，幽默，讽刺，讽刺，快乐），并由用户Flairs确认。我们设计了一个分层任务框架，该框架从基本的感知到高级认知，每个数据点都有六个多项选择问题和一个开放式难度的开放性问题。感知任务评估了模型识别基本视觉元素（例如颜色，对象）的能力，而认知任务需要场景推理，意图理解和深入的同理心整合文本上下文。我们通过AI援助（Claude 4）和手动验证的结合确保注释质量。

Title: Fluid Language Model Benchmarking

Authors: Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, Noah A. Smith
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.11106
Pdf URL: https://arxiv.org/pdf/2509.11106
Copy Paste: [[2509.11106]] Fluid Language Model Benchmarking(https://arxiv.org/abs/2509.11106)
Keywords: language model
Abstract: Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce Fluid Benchmarking, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, Fluid Benchmarking is based on the insight that the relative value of benchmark items depends on an LM's capability level, suggesting that evaluation should adapt to each LM. Methodologically, Fluid Benchmarking estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education. In our experiments, we compare Fluid Benchmarking against the common practice of random item sampling as well as more sophisticated baselines, including alternative methods grounded in item response theory. We examine four dimensions -- efficiency, validity, variance, and saturation -- and find that Fluid Benchmarking achieves superior performance in all of them (e.g., higher validity and less variance on MMLU with fifty times fewer items). Our analysis shows that the two components of Fluid Benchmarking have distinct effects: item response theory, used to map performance into a latent ability space, increases validity, while dynamic item selection reduces variance. Overall, our results suggest that LM benchmarking can be substantially improved by moving beyond static evaluation.
摘要：语言模型（LM）基准测试面临几个挑战：全面评估是昂贵的，基准通常无法衡量预期的功能，并且由于标记错误和基准测试饱和度而导致的评估质量会降低。尽管已经提出了减轻这些问题的各种策略，但它们倾向于孤立地解决各个方面，从而忽略了有关整体评估质量的更广泛问题。在这里，我们介绍了流体基准测试，这是一种新的评估方法，可以在多个维度上进行LM基准测试。受精神计量学的启发，流体基准测试基于以下见解：基准项目的相对值取决于LM的能力水平，这表明评估应适应每个LM。从方法上讲，流体基准测试估算基于现有LM评估结果的项目响应模型，并使用推断数量动态选择评估项目，类似于教育中的计算机化自适应测试。在我们的实验中，我们将流体基准测试与随机项目采样的共同实践以及更复杂的基线进行比较，包括基于项目响应理论的替代方法。我们检查了四个维度 - 效率，有效性，方差和饱和度 - 发现流体基准测试在所有方面都能达到较高的性能（例如，有效性较高，有效性和较小的MMLU差异较少五十倍）。我们的分析表明，流体基准测试的两个组成部分具有明显的效果：用于将性能映射到潜在能力空间中的项目响应理论，提高了有效性，而动态项目选择会降低方差。总体而言，我们的结果表明，通过超越静态评估，可以通过超大改善LM基准测试。

Title: We Argue to Agree: Towards Personality-Driven Argumentation-Based Negotiation Dialogue Systems for Tourism

Authors: Priyanshu Priya, Saurav Dudhate, Desai Vishesh Yasheshbhai, Asif Ekbal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.11118
Pdf URL: https://arxiv.org/pdf/2509.11118
Copy Paste: [[2509.11118]] We Argue to Agree: Towards Personality-Driven Argumentation-Based Negotiation Dialogue Systems for Tourism(https://arxiv.org/abs/2509.11118)
Keywords: language model, llm
Abstract: Integrating argumentation mechanisms into negotiation dialogue systems improves conflict resolution through exchanges of arguments and critiques. Moreover, incorporating personality attributes enhances adaptability by aligning interactions with individuals' preferences and styles. To advance these capabilities in negotiation dialogue systems, we propose a novel Personality-driven Argumentation-based Negotiation Dialogue Generation (PAN-DG) task. To support this task, we introduce PACT, a dataset of Personality-driven Argumentation-based negotiation Conversations for Tourism sector. This dataset, generated using Large Language Models (LLMs), features three distinct personality profiles, viz. Argumentation Profile, Preference Profile, and Buying Style Profile to simulate a variety of negotiation scenarios involving diverse personalities. Thorough automatic and manual evaluations indicate that the dataset comprises high-quality dialogues. Further, we conduct comparative experiments between pre-trained and fine-tuned LLMs for the PAN-DG task. Multi-dimensional evaluation demonstrates that the fine-tuned LLMs effectively generate personality-driven rational responses during negotiations. This underscores the effectiveness of PACT in enhancing personalization and reasoning capabilities in negotiation dialogue systems, thereby establishing a foundation for future research in this domain.
摘要：将论证机制纳入谈判对话系统可以通过交流论点和批评来改善冲突解决。此外，通过与个人的偏好和样式保持一致，纳入人格属性可以增强适应性。为了推进谈判对话系统中的这些能力，我们提出了一个新颖的基于人格驱动的谈判对话生成（PAN-DG）任务。为了支持这项任务，我们引入了PACT，这是一个基于人格驱动的论点的谈判对话的数据集。该数据集是使用大语言模型（LLM）生成的，具有三个不同的个性资料，即。论证概况，偏好概况和购买样式概况，以模拟各种涉及多样性的谈判场景。彻底的自动和手动评估表明，数据集包括高质量的对话。此外，我们针对PAN-DG任务进行了预训练和微调LLM之间的比较实验。多维评估表明，微调的LLM有效地在谈判过程中产生了人格驱动的理性响应。这强调了公约在增强谈判对话系统中的个性化和推理能力方面的有效性，从而为该领域的未来研究建立了基础。

Title: Joint Effects of Argumentation Theory, Audio Modality and Data Enrichment on LLM-Based Fallacy Classification

Authors: Hongxu Zhou, Hylke Westerdijk, Khondoker Ittehadul Islam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11127
Pdf URL: https://arxiv.org/pdf/2509.11127
Copy Paste: [[2509.11127]] Joint Effects of Argumentation Theory, Audio Modality and Data Enrichment on LLM-Based Fallacy Classification(https://arxiv.org/abs/2509.11127)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: This study investigates how context and emotional tone metadata influence large language model (LLM) reasoning and performance in fallacy classification tasks, particularly within political debate settings. Using data from U.S. presidential debates, we classify six fallacy types through various prompting strategies applied to the Qwen-3 (8B) model. We introduce two theoretically grounded Chain-of-Thought frameworks: Pragma-Dialectics and the Periodic Table of Arguments, and evaluate their effectiveness against a baseline prompt under three input settings: text-only, text with context, and text with both context and audio-based emotional tone metadata. Results suggest that while theoretical prompting can improve interpretability and, in some cases, accuracy, the addition of context and especially emotional tone metadata often leads to lowered performance. Emotional tone metadata biases the model toward labeling statements as \textit{Appeal to Emotion}, worsening logical reasoning. Overall, basic prompts often outperformed enhanced ones, suggesting that attention dilution from added inputs may worsen rather than improve fallacy classification in LLMs.
摘要：这项研究调查了上下文和情感语调元数据如何影响谬误分类任务中的大型语言模型（LLM）推理和表现，尤其是在政治辩论环境中。利用美国总统辩论中的数据，我们通过应用于QWEN-3（8B）模型的各种提示策略对六种谬误类型进行了分类。我们介绍了两个理论上扎根的思想链框架：Pragma-Dialectics和周期性的论证表，并在三个输入设置下的基线及时及时评估了它们的有效性：仅文本，文本，具有上下文和基于上下文和基于音频的情感情感的文本。结果表明，尽管理论提示可以提高可解释性，并且在某些情况下，准确性，添加上下文，尤其是情感音调元数据通常会导致性能降低。情感音调元数据将模型偏向于将陈述标记为\ textit {吸引情感}，从而恶化逻辑推理。总体而言，基本提示通常优于增强的提示，这表明增加的输入的注意力稀释可能会恶化而不是改善LLM中的谬误分类。

Title: When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity

Authors: Shiyao Cui, Xijia Feng, Yingkang Wang, Junxiao Yang, Zhexin Zhang, Biplab Sikdar, Hongning Wang, Han Qiu, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11141
Pdf URL: https://arxiv.org/pdf/2509.11141
Copy Paste: [[2509.11141]] When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity(https://arxiv.org/abs/2509.11141)
Keywords: language model, llm, prompt
Abstract: Emojis are globally used non-verbal cues in digital communication, and extensive research has examined how large language models (LLMs) understand and utilize emojis across contexts. While usually associated with friendliness or playfulness, it is observed that emojis may trigger toxic content generation in LLMs. Motivated by such a observation, we aim to investigate: (1) whether emojis can clearly enhance the toxicity generation in LLMs and (2) how to interpret this phenomenon. We begin with a comprehensive exploration of emoji-triggered LLM toxicity generation by automating the construction of prompts with emojis to subtly express toxic intent. Experiments across 5 mainstream languages on 7 famous LLMs along with jailbreak tasks demonstrate that prompts with emojis could easily induce toxicity generation. To understand this phenomenon, we conduct model-level interpretations spanning semantic cognition, sequence generation and tokenization, suggesting that emojis can act as a heterogeneous semantic channel to bypass the safety mechanisms. To pursue deeper insights, we further probe the pre-training corpus and uncover potential correlation between the emoji-related data polution with the toxicity generation behaviors. Supplementary materials provide our implementation code and data. (Warning: This paper contains potentially sensitive contents)
摘要：表情符号是数字通信中全球使用的非语言提示，广泛的研究研究了大型语言模型（LLMS）如何在环境中理解和利用表情符号。虽然通常与友善或嬉戏有关，但观察到表情符号可能会触发LLM中的有毒内容。我们的目的是在这样的观察过程中进行调查：（1）表情符号是否可以清楚地增强LLM中的毒性产生以及（2）如何解释这种现象。我们从对表情符号触发的LLM毒性产生的全面探索开始，从而自动化以表情符号的提示来巧妙地表达有毒意图。在7种著名LLM的5种主流语言以及越狱任务的实验表明，与表情符号的提示很容易引起毒性产生。为了理解这种现象，我们进行了模型级解释，涵盖语义认知，序列产生和令牌化，表明表情符号可以充当异质的语义通道，以绕过安全机制。为了寻求更深入的见解，我们进一步探讨了训练的领域，并发现了与表情符号相关的数据策略与毒性产生行为之间的潜在相关性。补充材料提供了我们的实施代码和数据。（警告：本文包含潜在敏感的内容）

Title: Text2Mem: A Unified Memory Operation Language for Memory Operating System

Authors: Felix Wang, Boyu Chen, Kerun Xu, Bo Tang, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11145
Pdf URL: https://arxiv.org/pdf/2509.11145
Copy Paste: [[2509.11145]] Text2Mem: A Unified Memory Operation Language for Memory Operating System(https://arxiv.org/abs/2509.11145)
Keywords: language model, agent
Abstract: Large language model agents increasingly depend on memory to sustain long horizon interaction, but existing frameworks remain limited. Most expose only a few basic primitives such as encode, retrieve, and delete, while higher order operations like merge, promote, demote, split, lock, and expire are missing or inconsistently supported. Moreover, there is no formal and executable specification for memory commands, leaving scope and lifecycle rules implicit and causing unpredictable behavior across systems. We introduce Text2Mem, a unified memory operation language that provides a standardized pathway from natural language to reliable execution. Text2Mem defines a compact yet expressive operation set aligned with encoding, storage, and retrieval. Each instruction is represented as a JSON based schema instance with required fields and semantic invariants, which a parser transforms into typed operation objects with normalized parameters. A validator ensures correctness before execution, while adapters map typed objects either to a SQL prototype backend or to real memory frameworks. Model based services such as embeddings or summarization are integrated when required. All results are returned through a unified execution contract. This design ensures safety, determinism, and portability across heterogeneous backends. We also outline Text2Mem Bench, a planned benchmark that separates schema generation from backend execution to enable systematic evaluation. Together, these components establish the first standardized foundation for memory control in agents.
摘要：大型语言模型代理人越来越依赖记忆来维持长时间的地平线相互作用，但现有的框架仍然有限。大多数人仅揭示诸如编码，检索和删除之类的基本原始图，而诸如合并，促进，降低，分裂，锁定和到期之类的高阶操作则缺失或不一致。此外，内存命令没有正式和可执行的规范，留下范围和生命周期规则的隐式和整个系统中的不可预测行为。我们介绍Text2Mem，这是一种统一的内存操作语言，它提供了从自然语言到可靠执行的标准化途径。 Text2Mem定义了一个紧凑而表达的操作集，该集合与编码，存储和检索对齐。每个指令都表示为基于JSON的架构实例，其中必需的字段和语义不变性，解析器将其转换为具有归一化参数的键入操作对象。验证器确保在执行前确保正确性，而适配器映射将对象键入SQL原型后端或真实的内存框架。在需要时集成了基于模型的服务，例如嵌入或摘要。所有结果均通过统一的执行合同返回。这种设计确保了跨异质后端的安全性，决定性和可移植性。我们还概述了Text2Mem基准，这是一个计划的基准，将模式生成与后端执行分开以实现系统评估。这些组件共同为代理中的内存控制建立了第一个标准化基础。

Title: Differentially-private text generation degrades output language quality

Authors: Erion Çano, Ivan Habernal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.11176
Pdf URL: https://arxiv.org/pdf/2509.11176
Copy Paste: [[2509.11176]] Differentially-private text generation degrades output language quality(https://arxiv.org/abs/2509.11176)
Keywords: language model, llm
Abstract: Ensuring user privacy by synthesizing data from large language models (LLMs) tuned under differential privacy (DP) has become popular recently. However, the impact of DP fine-tuned LLMs on the quality of the language and the utility of the texts they produce has not been investigated. In this work, we tune five LLMs with three corpora under four levels of privacy and assess the length, the grammatical correctness, and the lexical diversity of the text outputs they produce. We also probe the utility of the synthetic outputs in downstream classification tasks such as book genre recognition based on book descriptions and cause of death recognition based on verbal autopsies. The results indicate that LLMs tuned under stronger privacy constrains produce texts that are shorter by at least 77 %, that are less grammatically correct by at least 9 %, and are less diverse by at least 10 % in bi-gram diversity. Furthermore, the accuracy they reach in downstream classification tasks decreases, which might be detrimental to the usefulness of the generated synthetic data.
摘要：通过合成从差异隐私（DP）调整的大语言模型（LLM）的数据来确保用户隐私已变得流行。但是，尚未调查DP微调LLM对语言质量的影响以及其产生的文本的实用性。在这项工作中，我们调整了五个LLM，其中有三个语料库在四个级别的隐私层面下，并评估其产生的文本输出的长度，语法正确性和词汇多样性。我们还探究了在下游分类任务中合成输出的实用性，例如基于书籍描述的书类型识别和基于言语尸检的死亡识别原因。结果表明，在更强的隐私性下调整的LLM会限制其较短的文本至少77％，而语法上的正确性较低至少9％，并且在双克多样性中的多样性少于10％。此外，它们在下游分类任务中达到的准确性降低了，这可能不利于生成的合成数据的有用性。

Title: Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

Authors: Hang Guo, Yawei Li, Luca Benini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11177
Pdf URL: https://arxiv.org/pdf/2509.11177
Copy Paste: [[2509.11177]] Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs(https://arxiv.org/abs/2509.11177)
Keywords: language model, llm
Abstract: Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.
摘要：大语言模型（LLM）压缩（例如量化和修剪）的最新进展取得了显着的成功。但是，随着这些技术逐渐达到各自的限制，依靠一种进一步的压缩方法变得越来越具有挑战性。在这项工作中，我们通过结合量化和稀疏性来探索替代解决方案。这种联合方法虽然很有希望，但由于对体重分布的固有要求而引入了新的困难：量化有利于紧凑型范围，而修剪较高的差异则有益。为了攻击这个问题，我们提出了最佳的大脑恢复（OBR），这是一个通用且无训练的框架，可通过两者之间的误差补偿来对齐和量化。 OBR通过构建二阶Hessian目标来最大程度地减少下游任务的性能下降，然后通过替代近似将其重新制定为可牵引的问题，并最终通过小组误差补偿达到封闭形式的解决方案。实验表明，OBR在现有LLM上具有50％的稀疏性启用了积极的W4A4KV4量化，并且与FP16密度的基线相比，可提供高达4.72倍的速度和6.4倍的存储器。

Title: RanAT4BIE: Random Adversarial Training for Biomedical Information Extraction

Authors: Jian Chen, Shengyi Lv, Leilei Su
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.11191
Pdf URL: https://arxiv.org/pdf/2509.11191
Copy Paste: [[2509.11191]] RanAT4BIE: Random Adversarial Training for Biomedical Information Extraction(https://arxiv.org/abs/2509.11191)
Keywords: language model
Abstract: We introduce random adversarial training (RAT), a novel framework successfully applied to biomedical information extraction (BioIE) tasks. Building on PubMedBERT as the foundational architecture, our study first validates the effectiveness of conventional adversarial training in enhancing pre-trained language models' performance on BioIE tasks. While adversarial training yields significant improvements across various performance metrics, it also introduces considerable computational overhead. To address this limitation, we propose RAT as an efficiency solution for biomedical information extraction. This framework strategically integrates random sampling mechanisms with adversarial training principles, achieving dual objectives: enhanced model generalization and robustness while significantly reducing computational costs. Through comprehensive evaluations, RAT demonstrates superior performance compared to baseline models in BioIE tasks. The results highlight RAT's potential as a transformative framework for biomedical natural language processing, offering a balanced solution to the model performance and computational efficiency.
摘要：我们引入了随机对抗训练（RAT），这是一个成功应用于生物医学信息提取（BIOIE）任务的新型框架。我们的研究以PubMedbert为基础体系结构的基础，首先验证了传统的对抗训练在增强预训练的语言模型在BIOIE任务上的表现方面的有效性。尽管对抗性训练在各种性能指标上都有显着改进，但它也引入了相当大的计算开销。为了解决这一限制，我们建议将大鼠作为生物医学信息提取的效率解决方案。该框架从策略性地将随机抽样机制与对抗性训练原则相结合，实现双重目标：增强的模型概括和鲁棒性，同时大大降低计算成本。通过全面的评估，与BIOIE任务中的基线模型相比，RAT表现出卓越的性能。结果突出了大鼠作为生物医学自然语言处理的变革框架的潜力，为模型性能和计算效率提供了平衡的解决方案。

Title: The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

Authors: Valentin Romanov, Steven A Niederer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11295
Pdf URL: https://arxiv.org/pdf/2509.11295
Copy Paste: [[2509.11295]] The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences(https://arxiv.org/abs/2509.11295)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs). By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques. The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed. To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition. We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks. We provide detailed recommendations for how prompts should and shouldn't be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations. We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing. Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.
摘要：开发有效提示需要大量认知投资，以产生来自大语言模型（LLM）的可靠，高质量的响应。通过部署特定案例的及时工程技术，这些技术经常简化生命科学工作流程，研究人员可以实现巨大的效率增长，远远超过了掌握这些技术所需的初始时间投资。 2025年发布的提示报告概述了58种基于文本的提示工程技术，突出了可以构建众多提示的方式。为了提供可行的准则并减少导航这些各种方法的摩擦，我们将本报告详细介绍了6种核心技术：零射击，很少的方法，思想产生，结合，结合，自我批评和分解。我们分解了每种方法的重要性，并在与生命科学有关的用例中，从文献摘要和数据提取到编辑任务。我们为提示应如何结构化提供详细的建议，以解决共同的陷阱，包括多转交谈，幻觉和推理模型之间的区别。我们检查上下文窗口限制，诸如Claude代码之类的代理工具，同时分析了OpenAI，Google，Anthropic和Cllexity平台的深入研究工具的有效性，并讨论了当前的局限性。我们证明了如何迅速工程可以增加而不是替代有关数据处理和文档编辑的现有的个人实践。我们的目的是为核心及时的工程原则提供可行的指导，并促进从机会主义促使到有效，有助于更高质量研究的有效，低摩擦系统实践的过渡。

Title: Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context

Authors: Dasol Choi, Jungwhan Kim, Guijin Son
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11303
Pdf URL: https://arxiv.org/pdf/2509.11303
Copy Paste: [[2509.11303]] Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context(https://arxiv.org/abs/2509.11303)
Keywords: language model, gpt
Abstract: Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean physical commonsense reasoning dataset that incorporates cultural context. Starting from 3.01 million web-crawled questions, we employed a multi-stage filtering approach using three language models to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, we obtained 441 high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural grounding: 19.7\% of questions contain culturally specific elements like traditional Korean foods (kimchi), clothing (hanbok), and specialized appliances (kimchi refrigerators) that require culturally-aware reasoning beyond direct translation. We evaluate seven language models on Ko-PIQA, with the best model achieving 83.22\% accuracy while the weakest reaches only 59.86\%, demonstrating significant room for improvement. Models particularly struggle with culturally specific scenarios, highlighting the importance of culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. The dataset and code will be publicly available.
摘要：PIQA等物理常识性推理数据集主要以英语为中心，并且缺乏文化多样性。我们介绍了合并文化背景的韩国物理常识性推理数据集Ko-Piqa。从301万Web爬行的问题开始，我们使用三种语言模型采用了多阶段过滤方法来识别11,553个PIQA风格的问题。通过GPT-4O的完善和人类验证，我们获得了441个高质量的提问对。 KO-PIQA的一个关键特征是其文化基础：19.7％的问题包含具有文化特定元素，例如传统的韩国食品（Kimchi），服装（Hanbok）和专门的电器（Kimchi冰箱），这些电器（Kimchi冰箱）需要具有直接翻译超出直接翻译的文化意识的推理。我们在KO-PIQA上评估了七个语言模型，最佳模型可实现83.22 \％的准确性，而最弱的模型仅达到59.86 \％，显示出很大的改进空间。模型特别在文化上特定的情景中挣扎，突出了文化多样性数据集的重要性。 KO-PIQA既是韩国语言模型的基准，又是更具包含常识性推理研究的基础。数据集和代码将公开可用。

Title: !MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning

Authors: Mohamed Tarek, Seif Ahmed, Mohamed Basem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11365
Pdf URL: https://arxiv.org/pdf/2509.11365
Copy Paste: [[2509.11365]] !MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning(https://arxiv.org/abs/2509.11365)
Keywords: llm, prompt
Abstract: We present our systems for Track 2 (General Arabic Health QA, MedArabiQ) of the AraHealthQA-2025 shared task, where our methodology secured 2nd place in both Sub-Task 1 (multiple-choice question answering) and Sub-Task 2 (open-ended question answering) in Arabic clinical contexts. For Sub-Task 1, we leverage the Gemini 2.5 Flash model with few-shot prompting, dataset preprocessing, and an ensemble of three prompt configurations to improve classification accuracy on standard, biased, and fill-in-the-blank questions. For Sub-Task 2, we employ a unified prompt with the same model, incorporating role-playing as an Arabic medical expert, few-shot examples, and post-processing to generate concise responses across fill-in-the-blank, patient-doctor Q&A, GEC, and paraphrased variants.
摘要：我们介绍了ArahealthQA-2025共享任务的曲目2（阿拉伯健康质量检查，Medarabiq）的系统，我们的方法在该子任务中获得了第二名（多项选择性的答案）和阿拉伯临床环境中的第2个（多项选择性答案）和子任务2（开放式问题）。对于子任务1，我们利用Gemini 2.5 Flash模型具有很少的发动机提示，数据集预处理和三种及时配置的合奏来提高标准，有偏见和填充问题的分类精度。对于子任务2，我们采用相同模型的统一提示，将角色扮演作为阿拉伯医学专家，几乎没有示例，并进行后处理，以在空中填充的填充，患者doctor的Q＆A，GEC，GEC和释义变体中产生简洁的响应。

Title: Transformer Enhanced Relation Classification: A Comparative Analysis of Contextuality, Data Efficiency and Sequence Complexity

Authors: Bowen Jing, Yang Cui, Tianpeng Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.11374
Pdf URL: https://arxiv.org/pdf/2509.11374
Copy Paste: [[2509.11374]] Transformer Enhanced Relation Classification: A Comparative Analysis of Contextuality, Data Efficiency and Sequence Complexity(https://arxiv.org/abs/2509.11374)
Keywords: language model, llm
Abstract: In the era of large language model, relation extraction (RE) plays an important role in information extraction through the transformation of unstructured raw text into structured data (Wadhwa et al., 2023). In this paper, we systematically compare the performance of deep supervised learning approaches without transformers and those with transformers. We used a series of non-transformer architectures such as PA-LSTM(Zhang et al., 2017), C-GCN(Zhang et al., 2018), and AGGCN(attention guide GCN)(Guo et al., 2019), and a series of transformer architectures such as BERT, RoBERTa, and R-BERT(Wu and He, 2019). Our comparison included traditional metrics like micro F1, as well as evaluations in different scenarios, varying sentence lengths, and different percentages of the dataset for training. Our experiments were conducted on TACRED, TACREV, and RE-TACRED. The results show that transformer-based models outperform non-transformer models, achieving micro F1 scores of 80-90% compared to 64-67% for non-transformer models. Additionally, we briefly review the research journey in supervised relation classification and discuss the role and current status of large language models (LLMs) in relation extraction.
摘要：在大型语言模型的时代，通过将非结构化的原始文本转换为结构化数据，关系提取（RE）在信息提取中起着重要作用（Wadhwa等，2023）。在本文中，我们系统地比较了没有变压器和具有变压器的深层监督学习方法的性能。我们使用了一系列非转化器体系结构，例如PA-LSTM（Zhang等，2017），C-GCN（Zhang等，2018）和AGGCN（注意指南GCN）（Guo等，2019），以及一系列变形金刚架构，例如Bert，Roberta，Roberta和R-berta，以及R-berta和R-Berta（Wu and wu and He，2019年）。我们的比较包括传统的指标，例如微型F1，以及在不同情况下的评估，不同的句子长度以及数据集的不同百分比进行培训。我们的实验是在Tacred，Tacrev和重新介绍的。结果表明，基于变压器的模型优于非转换器模型，即达到80-90％的微型F1分数，而非转换器模型为64-67％。此外，我们简要回顾了监督关系分类的研究旅程，并讨论了大语言模型（LLMS）在关系提取中的作用和现状。

Title: Continually Adding New Languages to Multilingual Language Models

Authors: Abraham Toluwase Owodunni, Sachin Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11414
Pdf URL: https://arxiv.org/pdf/2509.11414
Copy Paste: [[2509.11414]] Continually Adding New Languages to Multilingual Language Models(https://arxiv.org/abs/2509.11414)
Keywords: language model
Abstract: Multilingual language models are trained on a fixed set of languages, and to support new languages, the models need to be retrained from scratch. This is an expensive endeavor and is often infeasible, as model developers tend not to release their pre-training data. Naive approaches, such as continued pretraining, suffer from catastrophic forgetting; however, mitigation strategies like experience replay cannot be applied due to the lack of original pretraining data. In this work, we investigate the problem of continually adding new languages to a multilingual model, assuming access to pretraining data in only the target languages. We explore multiple approaches to address this problem and propose Layer-Selective LoRA (LayRA), which adds Low-Rank Adapters (LoRA) to selected initial and final layers while keeping the rest of the model frozen. LayRA builds on two insights: (1) LoRA reduces forgetting, and (2) multilingual models encode inputs in the source language in the initial layers, reason in English in intermediate layers, and translate back to the source language in final layers. We experiment with adding multiple combinations of Galician, Swahili, and Urdu to pretrained language models and evaluate each method on diverse multilingual tasks. We find that LayRA provides the overall best tradeoff between preserving models' capabilities in previously supported languages, while being competitive with existing approaches such as LoRA in learning new languages. We also demonstrate that using model arithmetic, the adapted models can be equipped with strong instruction following abilities without access to any instruction tuning data in the target languages.
摘要：多语言语言模型接受了固定语言的培训，为了支持新语言，需要从头开始重新培训。这是一项昂贵的努力，通常是不可行的，因为模型开发人员倾向于不发布其预培训数据。天真的方法，例如持续预处理，遭受灾难性遗忘的困扰。但是，由于缺乏原始的预处理数据，无法应用诸如经验重播之类的缓解策略。在这项工作中，我们研究了仅在目标语言中访问预处理数据的多语言模型的问题。我们探索了解决此问题的多种方法，并提出了层选择性的Lora（Layra），该方法将低级适配器（Lora）添加到选定的初始和最终层，同时将其余模型冷冻。 Layra建立在两个见解的基础上：（1）洛拉（Lora）减少了遗忘，（2）多语言模型在初始层中用源语言编码输入，在中间层中以英语为单位，并将其转换回最终层中的源语言。我们尝试将加利西亚，斯瓦希里语和乌尔都语的多种组合添加到验证的语言模型中，并在各种多语言任务上评估每种方法。我们发现，Layra在保存模型的能力以前支持的语言之间提供了总体最佳的权衡，同时在学习新语言中与现有方法（例如Lora）具有竞争力。我们还证明，使用模型算术，适用的模型可以配备强大的说明能力，而无需访问目标语言中的任何指令调整数据。

Title: CognitiveSky: Scalable Sentiment and Narrative Analysis for Decentralized Social Media

Authors: Gaurab Chhetri, Anandi Dutta, Subasish Das
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2509.11444
Pdf URL: https://arxiv.org/pdf/2509.11444
Copy Paste: [[2509.11444]] CognitiveSky: Scalable Sentiment and Narrative Analysis for Decentralized Social Media(https://arxiv.org/abs/2509.11444)
Keywords: language model
Abstract: The emergence of decentralized social media platforms presents new opportunities and challenges for real-time analysis of public discourse. This study introduces CognitiveSky, an open-source and scalable framework designed for sentiment, emotion, and narrative analysis on Bluesky, a federated Twitter or this http URL alternative. By ingesting data through Bluesky's Application Programming Interface (API), CognitiveSky applies transformer-based models to annotate large-scale user-generated content and produces structured and analyzable outputs. These summaries drive a dynamic dashboard that visualizes evolving patterns in emotion, activity, and conversation topics. Built entirely on free-tier infrastructure, CognitiveSky achieves both low operational cost and high accessibility. While demonstrated here for monitoring mental health discourse, its modular design enables applications across domains such as disinformation detection, crisis response, and civic sentiment analysis. By bridging large language models with decentralized networks, CognitiveSky offers a transparent, extensible tool for computational social science in an era of shifting digital ecosystems.
摘要：分散的社交媒体平台的出现为公共话语实时分析带来了新的机遇和挑战。这项研究介绍了Cognitivesky，这是一个开源且可扩展的框架，旨在对Bluesky，Federated Twitter或此HTTP URL替代方案进行情感，情感和叙事分析。通过通过Bluesky的应用程序编程接口（API）摄入数据，Cognitivesky应用了基于变压器的模型来注释大规模的用户生成的内容，并产生结构化和可分析的输出。这些摘要驱动了动态的仪表板，该仪表板可视化情感，活动和对话主题中不断发展的模式。 Cognitivesky完全基于自由层基础设施，既达到低运营成本又达到高可访问性。在这里展示以监测心理健康话语，但其模块化设计可以跨越诸如虚假信息检测，危机反应和公民情绪分析等领域的应用。通过将大型语言模型与分散的网络桥接，Cognitivesky在变化的数字生态系统时代为计算社会科学提供了一种透明，可扩展的工具。

Title: CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Authors: Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.11465
Pdf URL: https://arxiv.org/pdf/2509.11465
Copy Paste: [[2509.11465]] CEMTM: Contextual Embedding-based Multimodal Topic Modeling(https://arxiv.org/abs/2509.11465)
Keywords: language model, llm
Abstract: We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.
摘要：我们介绍了CEMTM，这是一种具有上下文增强的多模式主题模型，旨在从包含文本和图像的短文档和长文档中推断出连贯且可解释的主题结构。 CEMTM建立在微调的大视觉语言模型（LVLM）的基础上，以获取上下文化的嵌入，并采用分配注意机制来对主题推理的重量标记级别的贡献。重建目标将基于主题的表示与文档的嵌入保持一致，从而鼓励跨模态的语义一致性。与现有方法不同，CEMTM可以在每个文档中处理多个图像，而无需重复编码并通过明确的单词主题和文档主题分布来维持可解释性。在六个多模式基准上进行的广泛实验表明，CEMTM始终胜过单峰和多模式基线，达到了显着的平均LLM分数为2.61。进一步的分析表明，它在下游的几次检索中有效性及其在复杂领域（例如科学文章）中捕获视觉扎根语义的能力。

Title: Improving LLMs' Learning for Coreference Resolution

Authors: Yujian Gan, Yuan Liang, Yanni Lin, Juntao Yu, Massimo Poesio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11466
Pdf URL: https://arxiv.org/pdf/2509.11466
Copy Paste: [[2509.11466]] Improving LLMs' Learning for Coreference Resolution(https://arxiv.org/abs/2509.11466)
Keywords: llm, hallucination
Abstract: Coreference Resolution (CR) is crucial for many NLP tasks, but existing LLMs struggle with hallucination and under-performance. In this paper, we investigate the limitations of existing LLM-based approaches to CR-specifically the Question-Answering (QA) Template and Document Template methods and propose two novel techniques: Reversed Training with Joint Inference and Iterative Document Generation. Our experiments show that Reversed Training improves the QA Template method, while Iterative Document Generation eliminates hallucinations in the generated source text and boosts coreference resolution. Integrating these methods and techniques offers an effective and robust solution to LLM-based coreference resolution.
摘要：核心分辨率（CR）对于许多NLP任务至关重要，但是现有的LLM在幻觉和绩效不足方面挣扎。在本文中，我们研究了现有的基于LLM的方法对CR特异性提问（QA）模板和文档模板方法的局限性，并提出了两种新技术：具有关节推理和迭代文档生成的反向培训。我们的实验表明，反向训练改善了质量检查模板方法，而迭代文档生成消除了生成的源文本中的幻觉，并提高了核心分辨率。集成这些方法和技术为基于LLM的核心分辨率提供了有效且可靠的解决方案。

Title: ClaimIQ at CheckThat! 2025: Comparing Prompted and Fine-Tuned Language Models for Verifying Numerical Claims

Authors: Anirban Saha Anik, Md Fahimul Kabir Chowdhury, Andrew Wyckoff, Sagnik Ray Choudhury
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.11492
Pdf URL: https://arxiv.org/pdf/2509.11492
Copy Paste: [[2509.11492]] ClaimIQ at CheckThat! 2025: Comparing Prompted and Fine-Tuned Language Models for Verifying Numerical Claims(https://arxiv.org/abs/2509.11492)
Keywords: language model, llm, prompt
Abstract: This paper presents our system for Task 3 of the CLEF 2025 CheckThat! Lab, which focuses on verifying numerical and temporal claims using retrieved evidence. We explore two complementary approaches: zero-shot prompting with instruction-tuned large language models (LLMs) and supervised fine-tuning using parameter-efficient LoRA. To enhance evidence quality, we investigate several selection strategies, including full-document input and top-k sentence filtering using BM25 and MiniLM. Our best-performing model LLaMA fine-tuned with LoRA achieves strong performance on the English validation set. However, a notable drop in the test set highlights a generalization challenge. These findings underscore the importance of evidence granularity and model adaptation for robust numerical fact verification.
摘要：本文介绍了我们针对Clef 2025 CheckThat的任务3的系统！ LAB，专注于使用检索的证据来验证数值和时间主张。我们探讨了两种互补方法：使用指令调整的大语言模型（LLMS）的零拍摄提示，并使用参数有效的洛拉（Lora）进行了监督的微调。为了提高证据质量，我们研究了几种选择策略，包括使用BM25和Minilm进行全文件输入和TOP-K句子过滤。我们表现最好的模型Llama对Lora进行了微调，可以在英语验证集上实现出色的性能。但是，测试集的显着下降突出了一般性挑战。这些发现强调了证据粒度和模型适应对鲁棒数值事实验证的重要性。

Title: AKCIT-FN at CheckThat! 2025: Switching Fine-Tuned SLMs and LLM Prompting for Multilingual Claim Normalization

Authors: Fabrycio Leite Nakano Almada, Kauan Divino Pouso Mariano, Maykon Adriell Dutra, Victor Emanuel da Silva Monteiro, Juliana Resplande Sant'Anna Gomes, Arlindo Rodrigues Galvão Filho, Anderson da Silva Soares
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11496
Pdf URL: https://arxiv.org/pdf/2509.11496
Copy Paste: [[2509.11496]] AKCIT-FN at CheckThat! 2025: Switching Fine-Tuned SLMs and LLM Prompting for Multilingual Claim Normalization(https://arxiv.org/abs/2509.11496)
Keywords: language model, llm, prompt
Abstract: Claim normalization, the transformation of informal social media posts into concise, self-contained statements, is a crucial step in automated fact-checking pipelines. This paper details our submission to the CLEF-2025 CheckThat! Task~2, which challenges systems to perform claim normalization across twenty languages, divided into thirteen supervised (high-resource) and seven zero-shot (no training data) tracks. Our approach, leveraging fine-tuned Small Language Models (SLMs) for supervised languages and Large Language Model (LLM) prompting for zero-shot scenarios, achieved podium positions (top three) in fifteen of the twenty languages. Notably, this included second-place rankings in eight languages, five of which were among the seven designated zero-shot languages, underscoring the effectiveness of our LLM-based zero-shot strategy. For Portuguese, our initial development language, our system achieved an average METEOR score of 0.5290, ranking third. All implementation artifacts, including inference, training, evaluation scripts, and prompt configurations, are publicly available at this https URL.
摘要：要求归一化，将非正式社交媒体帖子转变为简洁，独立的陈述，是自动化事实检查管道的关键一步。本文详细介绍了我们提交给CLEF-2025 CheckThat的提交！任务〜2挑战系统以跨二十种语言执行归一化的系统，分为13个监督（高资源）和7个零射击（无培训数据）轨道。我们的方法利用微调的小语言模型（SLM）用于有监督的语言和大型语言模型（LLM），以提示零拍摄场景，在二十种语言中的15种中获得了讲台位置（前三名）。值得注意的是，这包括八种语言中的第二名排名，其中五种是指定的零弹药语言之一，强调了我们基于LLM的零拍策略的有效性。对于葡萄牙语，我们的最初开发语言，我们的系统的平均流星得分为0.5290，排名排名第三。所有实施工件，包括推理，培训，评估脚本和及时配置，都可以在此HTTPS URL上公开使用。

Title: LVLMs are Bad at Overhearing Human Referential Communication

Authors: Zhengxiang Wang, Weiling Li, Panagiotis Kaliosis, Owen Rambow, Susan E. Brennan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11514
Pdf URL: https://arxiv.org/pdf/2509.11514
Copy Paste: [[2509.11514]] LVLMs are Bad at Overhearing Human Referential Communication(https://arxiv.org/abs/2509.11514)
Keywords: language model, agent
Abstract: During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.
摘要：在自发的对话中，演讲者在新颖的推荐表达式上进行了合作，然后他们可以在随后的对话中重新使用。了解这种参考表达是体现代理的重要能力，因此它可以在现实世界中执行任务。这需要整合和理解语言，愿景和对话互动。我们研究了七种最先进的大视觉语言模型（LVLM）的能力，因为他们偷听了人类对对象匹配任务的人类讨论参与者之间的自发对话。我们发现，对于当前的LVLM而言，这样的任务仍然具有挑战性，并且他们都无法表现出一致的性能提高，因为他们听到来自同一话语参与者的更多对话，重复多轮比赛。我们发布了我们的语料库和代码，以供重复性，并促进未来的研究。

Title: PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams - Dataset Construction and Evaluation

Authors: Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.11517
Pdf URL: https://arxiv.org/pdf/2509.11517
Copy Paste: [[2509.11517]] PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams - Dataset Construction and Evaluation(https://arxiv.org/abs/2509.11517)
Keywords: language model, llm, prompt
Abstract: BACKGROUND: Medical large language models (LLMS) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: to build a dataset of questions from medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) datasets containing 8,380 questions spanning 12 medical domains (2018-2025). We selected eight medical LLMs including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task-specific prompts to answer the questions appropriately. We employed parameter-efficient fine tuning (PEFT)and low-rant adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: medgemma-27b-text-it outperformed all other models, achieving a proportion of correct answers exceeding 90% in several instances. LLMs with <10 billion parameters exhibited <60% of correct answers, while some exams yielded results <50%. The fine-tuned version of medgemma-4b-it emerged victorious agains all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI application and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profiles to Peru's, interested parties should utilize medgemma-27b-text-it or a fine-tuned version of medgemma-4b-it.
摘要：背景：医学大语言模型（LLMS）在回答体检方面表现出色。但是，这种高性能在西班牙语和拉丁美洲国家的医学问题上的程度仍未得到探索。这些知识至关重要，因为基于LLM的医疗应用在拉丁美洲获得了吸引力。目的：从秘鲁医师进行专业培训的秘鲁医师进行的体检中提出问题；在此数据集上微调LLM；评估和比较Vanilla LLMS和微调LLM之间的准确性方面的性能。方法：我们策划了perumedqa，这是一个多项选择的提问（MCQA）数据集，其中包含8,380个问题，涵盖了12个医疗域（2018-2025）。我们选择了包括Medgemma-4B-IT和Medgemma-27b-Text-IT在内的八个医学LLM，并开发了零摄像的特定任务提示，以适当地回答问题。我们利用所有问题（测试集）采用了所有问题。结果：Medgemma-27b-Text-It优于所有其他模型，在多个情况下，获得了超过90％的正确答案。 <100亿个参数的LLM <60％的正确答案，而某些考试的结果<50％。 Medgemma-4b-it的微调版本再次出现了所有LLM，其参数<100亿个LLM，并在各种考试中以700亿个参数与LLM竞争。结论：对于需要讲西班牙语国家的知识库的医学AI应用和研究，以及与秘鲁相似的流行病学概况的知识库，感兴趣的各方应使用Medgemma-27b-Text-IT或Medgemma-4B-IT的精细调整版本。

Title: HARP: Hallucination Detection via Reasoning Subspace Projection

Authors: Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.11536
Pdf URL: https://arxiv.org/pdf/2509.11536
Copy Paste: [[2509.11536]] HARP: Hallucination Detection via Reasoning Subspace Projection(https://arxiv.org/abs/2509.11536)
Keywords: language model, llm, hallucination
Abstract: Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained. Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.
摘要：大语言模型（LLMS）中的幻觉是其在关键决策中可靠使用的主要障碍。尽管现有的幻觉检测方法提高了准确性，但他们仍然在解开语义和推理信息并保持鲁棒性方面挣扎。为了应对这些挑战，我们提出了一个新颖的幻觉检测框架（通过推理子空间投影检测幻觉检测）。 HARP确定LLM的隐藏状态空间可以分解为语义子空间和推理子空间的直接总和，前者编码语言表达式，后者捕获了内部推理过程。此外，我们证明了无用的层可以解散这些子空间，并通过将奇异值分解（SVD）应用于其参数，即可获得跨越语义和推理子空间的基础向量。最后，竖琴项目隐藏在推理子空间的基础向量上，然后将结果预测用作LLMS中幻觉检测的输入功能。通过使用这些投影，HARP将功能的尺寸降低到原始的约5％，过滤大多数噪声，并实现增强的鲁棒性。多个数据集的实验表明，HARP达到了最新的幻觉检测性能；特别是，它在Triviaqa上达到了92.8％的AUROC，表现优于先前的最佳方法7.5％。

Title: HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Authors: Wensheng Lu, Keyu Chen, Ruizhi Qiao, Xing Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.11552
Pdf URL: https://arxiv.org/pdf/2509.11552
Copy Paste: [[2509.11552]] HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking(https://arxiv.org/abs/2509.11552)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This paper first analyzes why existing RAG evaluation benchmarks are inadequate for assessing document chunking quality, specifically due to evidence sparsity. Based on this conclusion, we propose HiCBench, which includes manually annotated multi-level document chunking points, synthesized evidence-dense quetion answer(QA) pairs, and their corresponding evidence sources. Additionally, we introduce the HiChunk framework, a multi-level document structuring framework based on fine-tuned LLMs, combined with the Auto-Merge retrieval algorithm to improve retrieval quality. Experiments demonstrate that HiCBench effectively evaluates the impact of different chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves better chunking quality within reasonable time consumption, thereby enhancing the overall performance of RAG systems.
摘要：检索增强的生成（RAG）通过整合外部知识来源来增强语言模型的响应能力。但是，文档块作为抹布系统的重要组成部分通常缺乏有效的评估工具。本文首先分析了为什么现有的抹布评估基准不足以评估文档质量质量，特别是由于证据稀疏性。基于此结论，我们提出了Hicbench，其中包括手动注释的多层文档块，合成的证据密集的Quetion答案（QA）对及其相应的证据来源。此外，我们介绍了基于微调LLM的多层文档结构框架Hichunk Framework，并结合自动合并检索算法，以提高检索质量。实验表明，Hicbench有效地评估了整个RAG管道中不同块方法的影响。此外，Hichunk在合理的时间消耗中可以提高更好的质量，从而增强了抹布系统的整体性能。

Title: D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs

Authors: Yue Ding, Xiaofang Zhu, Tianze Xia, Junfei Wu, Xinlong Chen, Qiang Liu, Liang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11569
Pdf URL: https://arxiv.org/pdf/2509.11569
Copy Paste: [[2509.11569]] D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs(https://arxiv.org/abs/2509.11569)
Keywords: language model, llm, hallucination
Abstract: Although large Language Models (LLMs) have achieved remarkable success, their practical application is often hindered by the generation of non-factual content, which is called "hallucination". Ensuring the reliability of LLMs' outputs is a critical challenge, particularly in high-stakes domains such as finance, security, and healthcare. In this work, we revisit hallucination detection from the perspective of model architecture and generation dynamics. Leveraging the multi-layer structure and autoregressive decoding process of LLMs, we decompose hallucination signals into two complementary dimensions: the semantic breadth of token representations within each layer, and the semantic depth of core concepts as they evolve across layers. Based on this insight, we propose \textbf{D$^2$HScore (Dispersion and Drift-based Hallucination Score)}, a training-free and label-free framework that jointly measures: (1) \textbf{Intra-Layer Dispersion}, which quantifies the semantic diversity of token representations within each layer; and (2) \textbf{Inter-Layer Drift}, which tracks the progressive transformation of key token representations across layers. To ensure drift reflects the evolution of meaningful semantics rather than noisy or redundant tokens, we guide token selection using attention signals. By capturing both the horizontal and vertical dynamics of representation during inference, D$^2$HScore provides an interpretable and lightweight proxy for hallucination detection. Extensive experiments across five open-source LLMs and five widely used benchmarks demonstrate that D$^2$HScore consistently outperforms existing training-free baselines.
摘要：尽管大型语言模型（LLM）取得了杰出的成功，但它们的实际应用通常受到非事实内容的产生，这被称为“幻觉”。确保LLMS产出的可靠性是一个关键挑战，尤其是在金融，安全和医疗保健等高风险领域。在这项工作中，我们从模型架构和世代动态的角度重新访问幻觉检测。利用LLM的多层结构和自回归解码过程，我们将幻觉信号分解为两个互补的维度：每层代币表示的语义宽度，以及核心概念的语义深度，随着它们在跨层发展时。基于此见解，我们提出\ textbf {d $^2 $ hscore（分散和基于漂移的幻觉评分）}，这是一个无训练和无标签的框架，共同测量：（1）\ textbf {intera intra-layer散布}，可量化每个层中token代表的语义不同的框架。（2）\ textbf {层间漂移}，该}跟踪跨层的关键令牌表示的逐步转换。为了确保漂移反映有意义的语义的演变，而不是嘈杂或多余的令牌，我们使用注意信号指导令牌选择。通过捕获推理过程中表示的水平和垂直动力学，D $^2 $ hscore提供了可解释且轻巧的幻觉检测代理。在五个开源LLM和五个广泛使用的基准测试中进行的广泛实验表明，D $^2 $ HSCORE始终优于现有的无培训基线。

Title: Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia - Current Stage and Challenges

Authors: Sampoorna Poria, Xiaolei Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11570
Pdf URL: https://arxiv.org/pdf/2509.11570
Copy Paste: [[2509.11570]] Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia - Current Stage and Challenges(https://arxiv.org/abs/2509.11570)
Keywords: language model, gpt
Abstract: Rapid developments of large language models have revolutionized many NLP tasks for English data. Unfortunately, the models and their evaluations for low-resource languages are being overlooked, especially for languages in South Asia. Although there are more than 650 languages in South Asia, many of them either have very limited computational resources or are missing from existing language models. Thus, a concrete question to be answered is: Can we assess the current stage and challenges to inform our NLP community and facilitate model developments for South Asian languages? In this survey, we have comprehensively examined current efforts and challenges of NLP models for South Asian languages by retrieving studies since 2020, with a focus on transformer-based models, such as BERT, T5, & GPT. We present advances and gaps across 3 essential aspects: data, models, & tasks, such as available data sources, fine-tuning strategies, & domain applications. Our findings highlight substantial issues, including missing data in critical domains (e.g., health), code-mixing, and lack of standardized evaluation benchmarks. Our survey aims to raise awareness within the NLP community for more targeted data curation, unify benchmarks tailored to cultural and linguistic nuances of South Asia, and encourage an equitable representation of South Asian languages. The complete list of resources is available at: this https URL.
摘要：大型语言模型的快速发展彻底改变了许多NLP的英语数据任务。不幸的是，忽略了对低资源语言的模型及其评估，尤其是对于南亚语言。尽管南亚有650多种语言，但其中许多语言要么具有非常有限的计算资源，要么在现有语言模型中缺少。因此，要回答的具体问题是：我们可以评估当前阶段和挑战，以告知我们的NLP社区并促进南亚语言的模型发展？在这项调查中，我们已经全面研究了自2020年以来通过检索研究的NLP模型的当前努力和挑战，重点是基于变压器的模型，例如Bert，T5和GPT。我们介绍了三个基本方面的进步和差距：数据，模型和任务，例如可用的数据源，微调策略和域应用程序。我们的发现突出了实质性问题，包括关键领域（例如健康）中缺少数据，代码混合以及缺乏标准化的评估基准。我们的调查旨在提高NLP社区的意识，以实现更有针对性的数据策划，统一针对南亚文化和语言细微差别的基准，并鼓励对南亚语言的公平代表。完整的资源列表可在以下网址提供：此HTTPS URL。

Title: Analyzing Information-Seeking Behaviors in a Hakka AI Chatbot: A Cognitive-Pragmatic Study

Authors: Chu-Hsuan Lee, Chen-Chi Chang, Hung-Shin Lee, Yun-Hsiang Hsu, Ching-Yuan Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11591
Pdf URL: https://arxiv.org/pdf/2509.11591
Copy Paste: [[2509.11591]] Analyzing Information-Seeking Behaviors in a Hakka AI Chatbot: A Cognitive-Pragmatic Study(https://arxiv.org/abs/2509.11591)
Keywords: chat
Abstract: With many endangered languages at risk of disappearing, efforts to preserve them now rely more than ever on using technology alongside culturally informed teaching strategies. This study examines user behaviors in TALKA, a generative AI-powered chatbot designed for Hakka language engagement, by employing a dual-layered analytical framework grounded in Bloom's Taxonomy of cognitive processes and dialogue act categorization. We analyzed 7,077 user utterances, each carefully annotated according to six cognitive levels and eleven dialogue act types. These included a variety of functions, such as asking for information, requesting translations, making cultural inquiries, and using language creatively. Pragmatic classifications further highlight how different types of dialogue acts--such as feedback, control commands, and social greetings--align with specific cognitive intentions. The results suggest that generative AI chatbots can support language learning in meaningful ways--especially when they are designed with an understanding of how users think and communicate. They may also help learners express themselves more confidently and connect with their cultural identity. The TALKA case provides empirical insights into how AI-mediated dialogue facilitates cognitive development in low-resource language learners, as well as pragmatic negotiation and socio-cultural affiliation. By focusing on AI-assisted language learning, this study offers new insights into how technology can support language preservation and educational practice.
摘要：由于许多濒临灭绝的语言有消失的风险，因此保护它们的努力比以往任何时候都更依赖于使用技术以及文化知情的教学策略。这项研究研究了Talka中的用户行为，Talka是一种旨在HAKKA语言参与的生成性AI驱动的聊天机器人，它采用了基于Bloom的认知过程分类法和对话ACT分类的双层分析框架。我们分析了7,077种用户话语，每种用户的话语根据六个认知水平和11个对话行为进行了仔细的注释。这些功能包括各种功能，例如要求信息，要求翻译，进行文化查询以及创造性地使用语言。务实的分类进一步突出了不同类型的对话（例如反馈，控制命令和社会问候）如何与特定的认知意图保持一致。结果表明，生成的AI聊天机器人可以以有意义的方式支持语言学习，尤其是当他们设计出对用户思维和交流方式的理解时。他们还可以帮助学习者更自信地表达自己，并与文化认同联系。 TalkA案例提供了有关AI介导的对话如何促进低资源语言学习者的认知发展以及务实的谈判和社会文化隶属关系的经验见解。通过专注于AI辅助语言学习，本研究提供了有关技术如何支持语言保存和教育实践的新见解。

Title: HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems

Authors: Spandan Anaokar, Shrey Ganatra, Harshvivek Kashid, Swapnil Bhattacharyya, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11619
Pdf URL: https://arxiv.org/pdf/2509.11619
Copy Paste: [[2509.11619]] HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems(https://arxiv.org/abs/2509.11619)
Keywords: language model, llm, hallucination, chat, agent
Abstract: Large Language Models (LLMs) are widely used in industry but remain prone to hallucinations, limiting their reliability in critical applications. This work addresses hallucination reduction in consumer grievance chatbots built using LLaMA 3.1 8B Instruct, a compact model frequently used in industry. We develop HalluDetect, an LLM-based hallucination detection system that achieves an F1 score of 69% outperforming baseline detectors by 25.44%. Benchmarking five chatbot architectures, we find that out of them, AgentBot minimizes hallucinations to 0.4159 per turn while maintaining the highest token accuracy (96.13%), making it the most effective mitigation strategy. Our findings provide a scalable framework for hallucination mitigation, demonstrating that optimized inference strategies can significantly improve factual accuracy. While applied to consumer law, our approach generalizes to other high-risk domains, enhancing trust in LLM-driven assistants. We will release the code and dataset
摘要：大型语言模型（LLM）广泛用于行业，但仍然容易出现幻觉，从而限制了其在关键应用中的可靠性。这项工作解决了使用Llama 3.1 8b指令建立的消费者申诉聊天机器人的幻觉，这是一种经常在行业中使用的紧凑型模型。我们开发了Halludetect，这是一种基于LLM的幻觉检测系统，其F1得分超过基线检测器的F1得分高25.44％。我们发现，通过五个聊天机器人体系结构进行基准测试，在它们中，代理机器人将幻觉最小化至每回合0.4159，同时保持最高的令牌准确性（96.13％），使其成为最有效的缓解策略。我们的发现为缓解幻觉提供了可扩展的框架，表明优化的推理策略可以显着提高事实准确性。在应用于消费者法律的同时，我们的方法将其推广到其他高风险领域，从而增强了对LLM驱动助手的信任。我们将发布代码和数据集

Title: AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment

Authors: Kun Li, Lai-Man Po, Hongzheng Yang, Xuyuan Xu, Kangcheng Liu, Yuzhi Zhao
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.11620
Pdf URL: https://arxiv.org/pdf/2509.11620
Copy Paste: [[2509.11620]] AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment(https://arxiv.org/abs/2509.11620)
Keywords: language model, gpt, llm
Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to expert evaluations. However, their predictions may reflect subtle biases influenced by demographic factors such as gender, age, and education. In this work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two complementary dimensions: (1) stereotype bias, quantified by measuring variations in aesthetic evaluations across demographic groups; and (2) alignment between model outputs and genuine human aesthetic preferences. Our benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and introduces structured metrics (IFD, NRD, AAS) to assess both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results indicate that smaller models exhibit stronger stereotype biases, whereas larger models align more closely with human preferences. Incorporating identity information often exacerbates bias, particularly in emotional judgments. These findings underscore the importance of identity-aware evaluation frameworks in subjective vision-language tasks.
摘要：多模式的大语言模型（MLLM）越来越多地应用于个性化图像美学评估（PIAA），作为专家评估的可扩展替代品。但是，它们的预测可能反映出受性别，年龄和教育等人口因素影响的微妙偏见。在这项工作中，我们提出了Aesbiasbench，这是一种基准，旨在评估沿两个互补维度的MLLM：（1）刻板印象偏见，通过测量人口统计组的美学评估的变化来量化；（2）模型输出与真正的人类美学偏好之间的对齐。我们的基准涵盖了三个子任务（美学感知，评估，同理心），并引入结构化指标（IFD，NRD，AAS）来评估偏见和对齐。我们评估了19个MLLM，包括专有模型（例如GPT-4O，Claude-3.5-sonnet）和开源模型（例如Internvl-2.5，Qwen2.5-VL）。结果表明，较小的模型表现出更强的刻板印象偏见，而较大的模型与人类偏好更加紧密地结合。结合身份信息通常会加剧偏见，尤其是在情感判断中。这些发现强调了身份感知评估框架在主观视力语言任务中的重要性。

Title: EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI

Authors: Sai Kartheek Reddy Kasu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.11648
Pdf URL: https://arxiv.org/pdf/2509.11648
Copy Paste: [[2509.11648]] EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI(https://arxiv.org/abs/2509.11648)
Keywords: language model, llm
Abstract: The deployment of large language models (LLMs) in mental health and other sensitive domains raises urgent questions about ethical reasoning, fairness, and responsible alignment. Yet, existing benchmarks for moral and clinical decision-making do not adequately capture the unique ethical dilemmas encountered in mental health practice, where confidentiality, autonomy, beneficence, and bias frequently intersect. To address this gap, we introduce Ethical Reasoning in Mental Health (EthicsMH), a pilot dataset of 125 scenarios designed to evaluate how AI systems navigate ethically charged situations in therapeutic and psychiatric contexts. Each scenario is enriched with structured fields, including multiple decision options, expert-aligned reasoning, expected model behavior, real-world impact, and multi-stakeholder viewpoints. This structure enables evaluation not only of decision accuracy but also of explanation quality and alignment with professional norms. Although modest in scale and developed with model-assisted generation, EthicsMH establishes a task framework that bridges AI ethics and mental health decision-making. By releasing this dataset, we aim to provide a seed resource that can be expanded through community and expert contributions, fostering the development of AI systems capable of responsibly handling some of society's most delicate decisions.
摘要：大型语言模型（LLM）在心理健康和其他敏感领域中的部署提出了有关道德推理，公平和负责任的一致性的紧急问题。然而，现有的道德和临床决策基准并不能充分捕捉心理健康实践中遇到的独特道德困境，在这种实践中，机密性，自主权，仁慈和偏见经常相交。为了解决这一差距，我们介绍了心理健康方面的道德推理（ETHICSMH），这是一个旨在评估AI系统如何在治疗和精神病环境中导致道德上充斥的情况的试验数据集。每种情况都充满了结构化领域，包括多个决策选项，专家一致的推理，预期模型行为，现实世界影响和多利益相关者的观点。这种结构不仅可以评估决策准确性，而且还可以评估与专业规范的解释质量和一致性的评估。尽管规模谦虚，并通过模型辅助一代发展，但EthicsMH建立了一个任务框架，可以弥合AI伦理和心理健康决策。通过发布该数据集，我们旨在提供可以通过社区和专家贡献扩展的种子资源，从而促进能够负责任地处理一些社会最微妙的决定的AI系统的开发。

Title: A Dynamic Knowledge Update-Driven Model with Large Language Models for Fake News Detection

Authors: Di Jin, Jun Yang, Xiaobao Wang, Junwei Zhang, Shuqi Li, Dongxiao He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11687
Pdf URL: https://arxiv.org/pdf/2509.11687
Copy Paste: [[2509.11687]] A Dynamic Knowledge Update-Driven Model with Large Language Models for Fake News Detection(https://arxiv.org/abs/2509.11687)
Keywords: language model, retrieval-augmented generation
Abstract: As the Internet and social media evolve rapidly, distinguishing credible news from a vast amount of complex information poses a significant challenge. Due to the suddenness and instability of news events, the authenticity labels of news can potentially shift as events develop, making it crucial for fake news detection to obtain the latest event updates. Existing methods employ retrieval-augmented generation to fill knowledge gaps, but they suffer from issues such as insufficient credibility of retrieved content and interference from noisy information. We propose a dynamic knowledge update-driven model for fake news detection (DYNAMO), which leverages knowledge graphs to achieve continuous updating of new knowledge and integrates with large language models to fulfill dual functions: news authenticity detection and verification of new knowledge correctness, solving the two key problems of ensuring the authenticity of new knowledge and deeply mining news semantics. Specifically, we first construct a news-domain-specific knowledge graph. Then, we use Monte Carlo Tree Search to decompose complex news and verify them step by step. Finally, we extract and update new knowledge from verified real news texts and reasoning paths. Experimental results demonstrate that DYNAMO achieves the best performance on two real-world datasets.
摘要：随着互联网和社交媒体的发展迅速，将可信的新闻与大量复杂信息区分开来构成重大挑战。由于新闻事件的突然性和不稳定，随着事件的发展，新闻的真实性标签可能会改变，这对于伪造新闻检测至关重要，以获取最新的事件更新。现有的方法采用检索提示的生成来填补知识空白，但它们遭受了诸如检索内容的信誉不足和嘈杂信息干扰的问题。我们为假新闻检测（Dynamo）提出了动态知识更新驱动的模型，该模型利用知识图来实现新知识的连续更新，并与大语言模型集成在一起以实现双重功能：新闻真实性检测和新知识正确性的验证，解决了两个关键问题，以确保新知识和深度挖掘新闻新闻的真实性的两个关键问题。具体而言，我们首先构建了新闻域特定的知识图。然后，我们使用蒙特卡洛树搜索分解复杂的新闻并逐步验证它们。最后，我们从经过验证的真实新闻文本和推理路径中提取和更新新知识。实验结果表明，Dynamo在两个现实世界数据集上实现了最佳性能。

Title: CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model

Authors: Wei-Hsin Yeh, Yu-An Su, Chih-Ning Chen, Yi-Hsueh Lin, Calvin Ku, Wen-Hsin Chiu, Min-Chun Hu, Lun-Wei Ku
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.11698
Pdf URL: https://arxiv.org/pdf/2509.11698
Copy Paste: [[2509.11698]] CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model(https://arxiv.org/abs/2509.11698)
Keywords: gpt
Abstract: Motion instruction is a crucial task that helps athletes refine their technique by analyzing movements and providing corrective guidance. Although recent advances in multimodal models have improved motion understanding, generating precise and sport-specific instruction remains challenging due to the highly domain-specific nature of sports and the need for informative guidance. We propose CoachMe, a reference-based model that analyzes the differences between a learner's motion and a reference under temporal and physical aspects. This approach enables both domain-knowledge learning and the acquisition of a coach-like thinking process that identifies movement errors effectively and provides feedback to explain how to improve. In this paper, we illustrate how CoachMe adapts well to specific sports such as skating and boxing by learning from general movements and then leveraging limited data. Experiments show that CoachMe provides high-quality instructions instead of directions merely in the tone of a coach but without critical information. CoachMe outperforms GPT-4o by 31.6% in G-Eval on figure skating and by 58.3% on boxing. Analysis further confirms that it elaborates on errors and their corresponding improvement methods in the generated instructions. You can find CoachMe here: this https URL
摘要：运动指导是一项至关重要的任务，可以通过分析运动并提供纠正性指导来帮助运动员完善其技术。尽管多模式模型的最新进展已经提高了运动的理解，但是由于运动的特定于特定于领域的特定性质以及需要提供信息的指导，因此生成精确和运动的指导仍然具有挑战性。我们提出了CoachMe，这是一个基于参考的模型，分析了学习者的运动与时间和物理方面的参考之间的差异。这种方法使领域知识学习和获取类似教练的思维过程，该过程有效地识别了运动错误，并提供了反馈以解释如何改进。在本文中，我们说明了教练如何通过从一般运动中学习，然后利用有限的数据来很好地适应特定的运动，例如滑冰和拳击。实验表明，Coachme提供了高质量的说明，而不是仅以教练的语气提供指示，但没有关键信息。 Coachme在花样滑冰方面的G-Eval优于GPT-4O的31.6％，在拳击比赛中的表现为58.3％。分析进一步证实，它在生成的指令中详细阐述了错误及其相应的改进方法。您可以在这里找到Coachme：此HTTPS URL

Title: An Agentic Toolkit for Adaptive Information Extraction from Regulatory Documents

Authors: Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11773
Pdf URL: https://arxiv.org/pdf/2509.11773
Copy Paste: [[2509.11773]] An Agentic Toolkit for Adaptive Information Extraction from Regulatory Documents(https://arxiv.org/abs/2509.11773)
Keywords: llm, agent
Abstract: Declaration of Performance (DoP) documents, mandated by EU regulation, certify the performance of construction products. While some of their content is standardized, DoPs vary widely in layout, language, schema, and format, posing challenges for automated key-value pair extraction (KVP) and question answering (QA). Existing static or LLM-only IE pipelines often hallucinate and fail to adapt to this structural diversity. Our domain-specific, stateful agentic system addresses these challenges through a planner-executor-responder architecture. The system infers user intent, detects document modality, and orchestrates tools dynamically for robust, traceable reasoning while avoiding tool misuse or execution loops. Evaluation on a curated DoP dataset demonstrates improved robustness across formats and languages, offering a scalable solution for structured data extraction in regulated workflows.
摘要：欧盟法规规定的绩效声明（DOP）文件证明了建筑产品的性能。尽管它们的某些内容是标准化的，但DOPS的布局，语言，模式和格式差异很大，对自动化键值配对提取（KVP）和问题答案（QA）提出了挑战。现有的静态或仅限法学级别的IE管道通常会幻觉，并且无法适应这种结构多样性。我们针对特定领域的状态代理系统通过计划者 - 执行响应器体系结构解决了这些挑战。该系统会侵入用户意图，检测文档模式，并动态编排工具，以避免使用工具滥用或执行循环，以进行健壮，可追溯的推理。在策划的DOP数据集上的评估表明，跨格式和语言的鲁棒性提高了，为调节工作流程中的结构化数据提取提供了可扩展的解决方案。

Title: User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums

Authors: Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, Helena Holmström Olsson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.11777
Pdf URL: https://arxiv.org/pdf/2509.11777
Copy Paste: [[2509.11777]] User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums(https://arxiv.org/abs/2509.11777)
Keywords: language model, llm
Abstract: Customer feedback in industrial forums reflect a rich but underexplored source of insight into real-world product experience. These publicly shared discussions offer an organic view of user expectations, frustrations, and success stories shaped by the specific contexts of use. Yet, harnessing this information for systematic analysis remains challenging due to the unstructured and domain-specific nature of the content. The lack of structure and specialized vocabulary makes it difficult for traditional data analysis techniques to accurately interpret, categorize, and quantify the feedback, thereby limiting its potential to inform product development and support strategies. To address these challenges, this paper presents the User eXperience Perception Insights Dataset (UXPID), a collection of 7130 artificially synthesized and anonymized user feedback branches extracted from a public industrial automation forum. Each JavaScript object notation (JSON) record contains multi-post comments related to specific hardware and software products, enriched with metadata and contextual conversation data. Leveraging a large language model (LLM), each branch is systematically analyzed and annotated for UX insights, user expectations, severity and sentiment ratings, and topic classifications. The UXPID dataset is designed to facilitate research in user requirements, user experience (UX) analysis, and AI-driven feedback processing, particularly where privacy and licensing restrictions limit access to real-world data. UXPID supports the training and evaluation of transformer-based models for tasks such as issue detection, sentiment analysis, and requirements extraction in the context of technical forums.
摘要：工业论坛中的客户反馈反映了对现实世界产品体验的丰富而又毫无疑问的洞察力。这些公开共享的讨论提供了对用户期望，挫败感和成功故事的有机观点，这些案例是由特定的使用环境所塑造的。但是，由于内容的非结构化和特定于领域的性质，利用此信息进行系统分析仍然具有挑战性。缺乏结构和专门的词汇使传统数据分析技术难以准确解释，分类和量化反馈，从而限制了其为产品开发和支持策略提供信息的潜力。为了应对这些挑战，本文介绍了用户体验感知见解数据集（UXPID），该数据集是从公共工业自动化论坛中提取的7130个人为合成和匿名用户反馈分支的集合。每个JavaScript对象符号（JSON）记录都包含与特定硬件和软件产品相关的多台注释，并具有元数据和上下文对话数据。利用大型语言模型（LLM），每个分支都会系统地分析和注释UX洞察力，用户期望，严重性和情感评级以及主题分类。 UXPID数据集旨在促进用户需求，用户体验（UX）分析和AI驱动的反馈处理的研究，尤其是在隐私和许可限制限制对现实世界数据的访问权限的情况下。 UXPID支持在技术论坛背景下提取问题检测，情感分析和需求提取等任务的基于变压器的模型的培训和评估。

Title: When Curiosity Signals Danger: Predicting Health Crises Through Online Medication Inquiries

Authors: Dvora Goncharok, Arbel Shifman, Alexander Apartsin, Yehudit Aperstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11802
Pdf URL: https://arxiv.org/pdf/2509.11802
Copy Paste: [[2509.11802]] When Curiosity Signals Danger: Predicting Health Crises Through Online Medication Inquiries(https://arxiv.org/abs/2509.11802)
Keywords: language model, llm
Abstract: Online medical forums are a rich and underutilized source of insight into patient concerns, especially regarding medication use. Some of the many questions users pose may signal confusion, misuse, or even the early warning signs of a developing health crisis. Detecting these critical questions that may precede severe adverse events or life-threatening complications is vital for timely intervention and improving patient safety. This study introduces a novel annotated dataset of medication-related questions extracted from online forums. Each entry is manually labelled for criticality based on clinical risk factors. We benchmark the performance of six traditional machine learning classifiers using TF-IDF textual representations, alongside three state-of-the-art large language model (LLM)-based classification approaches that leverage deep contextual understanding. Our results highlight the potential of classical and modern methods to support real-time triage and alert systems in digital health spaces. The curated dataset is made publicly available to encourage further research at the intersection of patient-generated data, natural language processing, and early warning systems for critical health events. The dataset and benchmark are available at: this https URL.
摘要：在线医学论坛是对患者关注的丰富且未充分利用的见解来源，尤其是在使用药物方面。用户提出的许多问题中的一些问题可能表明混乱，滥用，甚至是发展中危机的预警信号。检测可能发生在严重不良事件或威胁生命并发症之前的这些关键问题对于及时干预和改善患者安全至关重要。这项研究介绍了从在线论坛中提取的与药物有关的问题的新颖注释数据集。每个条目都根据临床风险因素手动标记为关键性。我们使用TF-IDF文本表示，基于六个传统的机器学习分类器的性能，以及三种最先进的大语言模型（LLM）基于基于的分类方法，以利用深厚的上下文理解。我们的结果强调了经典和现代方法支持数字健康空间中实时分类和警报系统的潜力。策划的数据集可公开使用，以鼓励在患者生成的数据，自然语言处理和预警系统的交叉点上进行进一步的研究。数据集和基准标有：此HTTPS URL。

Title: From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives

Authors: Eden Mama, Liel Sheri, Yehudit Aperstein, Alexander Apartsin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11803
Pdf URL: https://arxiv.org/pdf/2509.11803
Copy Paste: [[2509.11803]] From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives(https://arxiv.org/abs/2509.11803)
Keywords: language model, llm
Abstract: The widespread adoption of large language models (LLMs) in healthcare raises critical questions about their ability to interpret patient-generated narratives, which are often informal, ambiguous, and noisy. Existing benchmarks typically rely on clean, structured clinical text, offering limited insight into model performance under realistic conditions. In this work, we present a novel synthetic dataset designed to simulate patient self-descriptions characterized by varying levels of linguistic noise, fuzzy language, and layperson terminology. Our dataset comprises clinically consistent scenarios annotated with ground-truth diagnoses, spanning a spectrum of communication clarity to reflect diverse real-world reporting styles. Using this benchmark, we fine-tune and evaluate several state-of-the-art models (LLMs), including BERT-based and encoder-decoder T5 models. To support reproducibility and future research, we release the Noisy Diagnostic Benchmark (NDB), a structured dataset of noisy, synthetic patient descriptions designed to stress-test and compare the diagnostic capabilities of large language models (LLMs) under realistic linguistic conditions. We made the benchmark available for the community: this https URL
摘要：医疗保健中大型语言模型（LLM）的广泛采用提出了有关解释患者生成的叙事能力的关键问题，这些叙事通常是非正式，模棱两可和嘈杂的。现有的基准通常依赖于干净的结构化临床文本，从而在现实条件下对模型性能提供了有限的见解。在这项工作中，我们提出了一个新型的合成数据集，旨在模拟患者的自我描述，其特征在于语言噪音，模糊语言和外行术语的不同水平。我们的数据集包含临床上一致的方案，并通过基础诊断进行了诊断，涵盖了各种各样的沟通清晰度，以反映不同的现实世界报告样式。使用此基准测试，我们对几种最先进的模型（LLM）进行微调和评估，包括基于BERT的和编码器Decoder T5模型。为了支持可重复性和未来的研究，我们发布了嘈杂的诊断基准（NDB），这是一个结构化的，合成的患者描述的结构化数据集，旨在在现实的语言条件下强调大语模型（LLMS）的诊断能力（LLMS）的诊断能力。我们为社区提供了基准：此HTTPS URL

Title: MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues

Authors: Weishu Chen, Jinyi Tang, Zhouhui Hou, Shihao Han, Mingjie Zhan, Zhiyuan Huang, Delong Liu, Jiawei Guo, Zhicheng Zhao, Fei Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11860
Pdf URL: https://arxiv.org/pdf/2509.11860
Copy Paste: [[2509.11860]] MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues(https://arxiv.org/abs/2509.11860)
Keywords: language model
Abstract: Memory extraction is crucial for maintaining coherent ultra-long dialogues in human-robot role-playing scenarios. However, existing methods often exhibit uncontrolled memory growth. To address this, we propose MOOM, the first dual-branch memory plugin that leverages literary theory by modeling plot development and character portrayal as core storytelling elements. Specifically, one branch summarizes plot conflicts across multiple time scales, while the other extracts the user's character profile. MOOM further integrates a forgetting mechanism, inspired by the ``competition-inhibition'' memory theory, to constrain memory capacity and mitigate uncontrolled growth. Furthermore, we present ZH-4O, a Chinese ultra-long dialogue dataset specifically designed for role-playing, featuring dialogues that average 600 turns and include manually annotated memory information. Experimental results demonstrate that MOOM outperforms all state-of-the-art memory extraction methods, requiring fewer large language model invocations while maintaining a controllable memory capacity.
摘要：记忆提取对于在人机角色扮演场景中保持连贯的超长对话至关重要。但是，现有方法通常表现出不受控制的内存增长。为了解决这个问题，我们提出了Moom，这是第一个双分支记忆插件，它通过将情节发展和角色描绘成核心讲故事元素进行建模来利用文学理论。具体来说，一个分支总结了跨多个时间尺度的情节冲突，而另一个分支则提取用户的字符配置文件。 Moom进一步整合了受``竞争抑制''记忆理论的启发的遗忘机制，以限制记忆能力并减轻不受控制的增长。此外，我们提出了ZH-4O，这是一种专门设计用于角色扮演的中文超长对话数据集，其中包含平均600个转弯的对话，并包含手动注释的内存信息。实验结果表明，摩尔的表现优于所有最先进的内存提取方法，在维持可控的内存能力的同时，需要更少的大型语言模型调用。

Title: Growing Perspectives: Modelling Embodied Perspective Taking and Inner Narrative Development Using Large Language Models

Authors: Sabrina Patania, Luca Annese, Anna Lambiase, Anita Pellegrini, Tom Foulsham, Azzurra Ruggeri, Silvia Rossi, Silvia Serino, Dimitri Ognibene
Subjects: cs.CL, cs.AI, cs.HC, cs.RO
Abstract URL: https://arxiv.org/abs/2509.11868
Pdf URL: https://arxiv.org/pdf/2509.11868
Copy Paste: [[2509.11868]] Growing Perspectives: Modelling Embodied Perspective Taking and Inner Narrative Development Using Large Language Models(https://arxiv.org/abs/2509.11868)
Keywords: language model, gpt, llm
Abstract: Language and embodied perspective taking are essential for human collaboration, yet few computational models address both simultaneously. This work investigates the PerspAct system [1], which integrates the ReAct (Reason and Act) paradigm with Large Language Models (LLMs) to simulate developmental stages of perspective taking, grounded in Selman's theory [2]. Using an extended director task, we evaluate GPT's ability to generate internal narratives aligned with specified developmental stages, and assess how these influence collaborative performance both qualitatively (action selection) and quantitatively (task efficiency). Results show that GPT reliably produces developmentally-consistent narratives before task execution but often shifts towards more advanced stages during interaction, suggesting that language exchanges help refine internal representations. Higher developmental stages generally enhance collaborative effectiveness, while earlier stages yield more variable outcomes in complex contexts. These findings highlight the potential of integrating embodied perspective taking and language in LLMs to better model developmental dynamics and stress the importance of evaluating internal speech during combined linguistic and embodied tasks.
摘要：语言和体现的视角采取对于人类的协作至关重要，但很少有计算模型同时解决这两者。这项工作调查了Permpact System [1]，该系统将React（原因和ACT）范式与大语言模型（LLMS）集成在一起，以模拟透视图的发展阶段，基于Selman的理论[2]。使用扩展的主任任务，我们评估了GPT生成与指定发展阶段保持一致的内部叙述的能力，并评估这些影响如何影响协作绩效（行动选择）和定量（任务效率）。结果表明，GPT可靠地在任务执行之前可靠地产生发展上一致的叙述，但在互动过程中通常会转向更高级的阶段，这表明语言交流有助于完善内部表示。较高的发育阶段通常提高协作效率，而较早的阶段在复杂的环境中产生了更多可变的结果。这些发现突出了在LLMS中整合具有体现的视角采集和语言的潜力，以更好地模拟发展动态，并强调在混合语言和体现的任务中评估内部语音的重要性。

Title: Uncertainty in Authorship: Why Perfect AI Detection Is Mathematically Impossible

Authors: Aadil Gani Ganie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11915
Pdf URL: https://arxiv.org/pdf/2509.11915
Copy Paste: [[2509.11915]] Uncertainty in Authorship: Why Perfect AI Detection Is Mathematically Impossible(https://arxiv.org/abs/2509.11915)
Keywords: language model, llm
Abstract: As large language models (LLMs) become more advanced, it is increasingly difficult to distinguish between human-written and AI-generated text. This paper draws a conceptual parallel between quantum uncertainty and the limits of authorship detection in natural language. We argue that there is a fundamental trade-off: the more confidently one tries to identify whether a text was written by a human or an AI, the more one risks disrupting the text's natural flow and authenticity. This mirrors the tension between precision and disturbance found in quantum systems. We explore how current detection methods--such as stylometry, watermarking, and neural classifiers--face inherent limitations. Enhancing detection accuracy often leads to changes in the AI's output, making other features less reliable. In effect, the very act of trying to detect AI authorship introduces uncertainty elsewhere in the text. Our analysis shows that when AI-generated text closely mimics human writing, perfect detection becomes not just technologically difficult but theoretically impossible. We address counterarguments and discuss the broader implications for authorship, ethics, and policy. Ultimately, we suggest that the challenge of AI-text detection is not just a matter of better tools--it reflects a deeper, unavoidable tension in the nature of language itself.
摘要：随着大型语言模型（LLMS）变得更加先进，很难区分人文和AI生成的文本。本文在量子不确定性和自然语言中的作者身份检测的限制之间提出了概念相似之处。我们认为存在一个基本的权衡：越自信地试图确定文本是由人类还是人工智能撰写的，越有可能破坏文本的自然流动和真实性。这反映了量子系统中的精度和干扰之间的张力。我们探讨了当前的检测方法（例如定型，水印和神经分类器）如何固有的局限性。增强检测准确性通常会导致AI的输出变化，从而使其他功能降低。实际上，试图检测AI作者身份的行为引入了文本其他地方的不确定性。我们的分析表明，当AI生成的文本紧密地模仿人的写作时，完美的检测不仅在技术上很困难，而且在理论上不可能。我们讨论反驳，并讨论对作者身份，道德和政策的更广泛含义。最终，我们建议AI-文本检测的挑战不仅是更好的工具问题 - 它反映了语言本身本质上的更深，不可避免的张力。

Title: Designing LLMs for cultural sensitivity: Evidence from English-Japanese translation

Authors: Helene Tenzer, Oumnia Abidi, Stefan Feuerriegel
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2509.11921
Pdf URL: https://arxiv.org/pdf/2509.11921
Copy Paste: [[2509.11921]] Designing LLMs for cultural sensitivity: Evidence from English-Japanese translation(https://arxiv.org/abs/2509.11921)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used in everyday communication, including multilingual interactions across different cultural contexts. While LLMs can now generate near-perfect literal translations, it remains unclear whether LLMs support culturally appropriate communication. In this paper, we analyze the cultural sensitivity of different LLM designs when applied to English-Japanese translations of workplace e-mails. Here, we vary the prompting strategies: (1) naive "just translate" prompts, (2) audience-targeted prompts specifying the recipient's cultural background, and (3) instructional prompts with explicit guidance on Japanese communication norms. Using a mixed-methods study, we then analyze culture-specific language patterns to evaluate how well translations adapt to cultural norms. Further, we examine the appropriateness of the tone of the translations as perceived by native speakers. We find that culturally-tailored prompting can improve cultural fit, based on which we offer recommendations for designing culturally inclusive LLMs in multilingual settings.
摘要：大型语言模型（LLM）越来越多地用于日常交流中，包括跨不同文化背景的多语言互动。尽管LLMS现在可以生成几乎完美的文字翻译，但尚不清楚LLMS是否支持适当的文化沟通。在本文中，当应用于工作场所电子邮件的英语 - 日本翻译时，我们分析了不同LLM设计的文化敏感性。在这里，我们改变了提示策略：（1）幼稚的“ Just Translate”提示，（2）受众键入的提示提示指定收件人的文化背景，以及（3）有关日本交流规范的明确指导的教学提示。然后，我们使用混合方法研究，分析特定文化的语言模式，以评估翻译对文化规范的适应程度。此外，我们研究了母语者所感知的翻译音调的适当性。我们发现，根据文化规定的提示可以改善文化契合度，根据该拟合，我们为在多语言环境中设计具有文化包含的LLM的建议提供了建议。

Title: Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding

Authors: Mingxiao Huo, Jiayi Zhang, Hewei Wang, Jinfeng Xu, Zheyu Chen, Huilin Tai, Yijun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11961
Pdf URL: https://arxiv.org/pdf/2509.11961
Copy Paste: [[2509.11961]] Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding(https://arxiv.org/abs/2509.11961)
Keywords: language model
Abstract: Vision-Language Models (VLMs) enable powerful multimodal reasoning but suffer from slow autoregressive inference, limiting their deployment in real-time applications. We introduce Spec-LLaVA, a system that applies speculative decoding to accelerate VLMs without sacrificing output quality. Spec-LLaVA pairs a lightweight draft VLM with a large target model: the draft speculates future tokens, which the target verifies in parallel, allowing multiple tokens to be generated per step. To maximize efficiency, we design a dynamic tree-based verification algorithm that adaptively expands and prunes speculative branches using draft model confidence. On MS COCO out-of-domain images, Spec-LLaVA achieves up to 3.28$\times$ faster decoding on LLaVA-1.5 (7B, 13B) with no loss in generation quality. This work presents a lossless acceleration framework for VLMs using dynamic tree-structured speculative decoding, opening a path toward practical real-time multimodal assistants. Importantly, the lightweight draft model design makes the framework amenable to resource-constrained or on-device deployment settings.
摘要：视觉模型（VLMS）可实现强大的多模式推理，但具有缓慢的自动回归推理，从而限制了其在实时应用程序中的部署。我们介绍了Spec-Lalava，该系统应用投机解码来加速VLM，而无需牺牲输出质量。 Spec-llava将轻量级草稿VLM与大型目标模型配对：草案推测了未来的令牌，该代币并行验证，允许每个步骤生成多个令牌。为了最大程度地提高效率，我们设计了一种基于动态树的验证算法，该算法使用模型置信度草案进行适应性扩展和修剪投机分支。在MS Coco室外图像上，Spec-llava在Llava-1.5（7b，13b）上更快地解码了Spec-llava，而发电质量却没有损失。这项工作使用动态树结构的投机解码为VLM提供了无损加速框架，为实时实时多模式助手打开了一条途径。重要的是，轻巧的草稿模型设计使该框架适合于资源约束或设备部署设置。

Title: ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Authors: Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.11963
Pdf URL: https://arxiv.org/pdf/2509.11963
Copy Paste: [[2509.11963]] ToolRM: Outcome Reward Models for Tool-Calling Large Language Models(https://arxiv.org/abs/2509.11963)
Keywords: language model, llm
Abstract: As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25\% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.
摘要：随着大型语言模型（LLMS）越来越多地与外部工具相互作用，用于工具使用的奖励建模已成为一个关键但毫无疑问的领域。现有的奖励模型，主要是针对自然语言输出培训的，难以评估基于工具的推理和执行。为了量化这一差距，我们介绍了FC-RewardBench，这是第一个旨在系统地评估奖励模型在工具称呼方案中的性能的基准。我们的分析表明，当前的奖励模型通常会错过有效工具使用的关键信号，从而突出了对域特异性建模的需求。为了解决这个问题，我们使用允许许可的开放式LLMS合成的数据为基于结果的奖励模型提出了培训框架。我们训练范围从1.7B到14B参数的模型，并在七个室外基准测试中对其进行评估。这些模型始终超越通用基线，在下游任务性能方面的平均改善最高为25％，并通过奖励引导的过滤启用数据有效的微调。

Title: Text Adaptation to Plain Language and Easy Read via Automatic Post-Editing Cycles

Authors: Jesús Calleja, David Ponce, Thierry Etchegoyhen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.11991
Pdf URL: https://arxiv.org/pdf/2509.11991
Copy Paste: [[2509.11991]] Text Adaptation to Plain Language and Easy Read via Automatic Post-Editing Cycles(https://arxiv.org/abs/2509.11991)
Keywords: language model
Abstract: We describe Vicomtech's participation in the CLEARS challenge on text adaptation to Plain Language and Easy Read in Spanish. Our approach features automatic post-editing of different types of initial Large Language Model adaptations, where successive adaptations are generated iteratively until readability and similarity metrics indicate that no further adaptation refinement can be successfully performed. Taking the average of all official metrics, our submissions achieved first and second place in Plain language and Easy Read adaptation, respectively.
摘要：我们描述了Vicomtech参与文本适应普通语言和简单阅读西班牙语的清除挑战。我们的方法具有不同类型的初始大语言模型适应的自动编辑后编辑，在此过程中，在迭代性和相似性指标表明无法成功执行任何进一步的适应性完善之前，在此进行了连续的适应性。以所有官方指标的平均值，我们的意见书分别以简单的语言获得了第一和第二名，并轻松阅读适应。

Title: Steering Language Models in Multi-Token Generation: A Case Study on Tense and Aspect

Authors: Alina Klerings, Jannik Brinkmann, Daniel Ruffinelli, Simone Ponzetto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12065
Pdf URL: https://arxiv.org/pdf/2509.12065
Copy Paste: [[2509.12065]] Steering Language Models in Multi-Token Generation: A Case Study on Tense and Aspect(https://arxiv.org/abs/2509.12065)
Keywords: language model, llm
Abstract: Large language models (LLMs) are able to generate grammatically well-formed text, but how do they encode their syntactic knowledge internally? While prior work has focused largely on binary grammatical contrasts, in this work, we study the representation and control of two multidimensional hierarchical grammar phenomena - verb tense and aspect - and for each, identify distinct, orthogonal directions in residual space using linear discriminant analysis. Next, we demonstrate causal control over both grammatical features through concept steering across three generation tasks. Then, we use these identified features in a case study to investigate factors influencing effective steering in multi-token generation. We find that steering strength, location, and duration are crucial parameters for reducing undesirable side effects such as topic shift and degeneration. Our findings suggest that models encode tense and aspect in structurally organized, human-like ways, but effective control of such features during generation is sensitive to multiple factors and requires manual tuning or automated optimization.
摘要：大型语言模型（LLMS）能够生成语法形成良好的文本，但是它们如何在内部编码其句法知识？尽管先前的工作主要集中在二元语法对比上，但在这项工作中，我们研究了两个多维层次层次语法现象的表示和控制 - 动词时态和方面 - 对于每一个，使用线性判别分析中的残留空间中的独特，正交方向。接下来，我们通过在三个世代任务中转向概念来证明对两个语法特征的因果控制。然后，我们在案例研究中使用这些确定的功能来研究影响多token生成有效转向的因素。我们发现，转向强度，位置和持续时间是减少不良副作用（例如主题移动和变性）的关键参数。我们的发现表明，模型在结构有组织的类似人类的方式中编码时态和方面，但是在生成过程中对此类特征的有效控制对多种因素敏感，需要手动调整或自动化优化。

Title: Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities

Authors: Payam Latifi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12098
Pdf URL: https://arxiv.org/pdf/2509.12098
Copy Paste: [[2509.12098]] Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities(https://arxiv.org/abs/2509.12098)
Keywords: language model, llm
Abstract: This pilot study presents a small-scale but carefully annotated benchmark of Named Entity Recognition (NER) performance across six systems: three non-LLM NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models (LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119 tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME). We evaluated each system's output against the manually annotated gold standard dataset using F1-score. The results show that LLMs generally outperform conventional tools in recognizing context-sensitive entities like person names, with Gemini achieving the highest average F1-score. However, traditional systems like Stanza demonstrate greater consistency in structured tags such as LOCATION and DATE. We also observed variability among LLMs, particularly in handling temporal expressions and multi-word organizations. Our findings highlight that while LLMs offer improved contextual understanding, traditional tools remain competitive in specific tasks, informing model selection.
摘要：这项试点研究提出了六个系统中指定实体识别（NER）性能的小规模但仔细注释的基准：三种非LLM NLP工具（NLTK，Spacy，Stanza）和三种通用大型语言模型（LLMS：gemini-1.5-flash，deepSeek-v3，deepseek-v3，qw3，qwen-3-4b）。该数据集包含119个代币，涵盖了五种实体类型（人员，位置，组织，日期，时间）。我们使用F1得分对每个系统的输出进行了对手动注释的金标准数据集的评估。结果表明，LLM在识别上下文敏感的实体（如人名称）方面的表现通常优于传统工具，双子座达到了最高的F1得分。但是，像Stanza这样的传统系统在结构化标签（例如位置和日期）中表现出更大的一致性。我们还观察到LLM之间的变异性，尤其是在处理时间表和多字组织方面。我们的发现强调，尽管LLM提供了改进的上下文理解，但传统工具在特定任务中仍然具有竞争力，从而为模型选择提供了竞争力。

Title: GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models

Authors: Min Zeng, Jinfei Sun, Xueyou Luo, Caiquan Liu, Shiqi Zhang, Li Xie, Xiaoxin Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12108
Pdf URL: https://arxiv.org/pdf/2509.12108
Copy Paste: [[2509.12108]] GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models(https://arxiv.org/abs/2509.12108)
Keywords: language model
Abstract: In natural language processing tasks, pure reinforcement learning (RL) fine-tuning methods often suffer from inefficient exploration and slow convergence; while supervised fine-tuning (SFT) methods, although efficient in training, have limited performance ceiling and less solid theoretical foundation compared to RL. To address efficiency-capability trade-off, we propose the Guess-Think-Answer (GTA) framework that combines the efficiency of SFT with the capability gains of RL in a unified training paradigm. GTA works by having the model first produce a provisional guess (optimized via cross-entropy loss), then reflect on this guess before generating the final answer, with RL rewards shaping both the final output and the format of the entire GTA structure. This hybrid approach achieves both faster convergence than pure RL and higher performance ceiling than pure SFT. To mitigate gradient conflicts between the two training signals, we employ loss masking and gradient constraints. Empirical results on four text classification benchmarks demonstrate that GTA substantially accelerates convergence while outperforming both standalone SFT and RL baselines.
摘要：在自然语言处理任务中，纯强化学习（RL）微调方法通常会遭受效率低下的探索和缓慢的收敛性；与RL相比，受监督的微调（SFT）方法虽然有效地训练，但性能上限有限，理论基础较少。为了解决效率能力折衷，我们提出了将SFT的效率与RL的能力提高效率在统一的培训范式中结合起来的猜测思想解答（GTA）框架。 GTA通过让模型首先产生临时猜测（通过横向渗透损失进行优化），然后在生成最终答案之前对此进行反思，而RL奖励构成了整个GTA结构的最终输出和格式。这种混合方法比纯RL的收敛速度更快，而且性能上限要比纯SFT更高。为了减轻两个培训信号之间的梯度冲突，我们采用损失掩盖和梯度约束。四个文本分类基准的经验结果表明，GTA基本上加速了收敛，同时均超过了独立的SFT和RL基准。

Title: CBP-Tuning: Efficient Local Customization for Black-box Large Language Models

Authors: Jiaxuan Zhao, Naibin Gu, Yuchen Feng, Xiyu Liu, Peng Fu, Zheng Lin, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12112
Pdf URL: https://arxiv.org/pdf/2509.12112
Copy Paste: [[2509.12112]] CBP-Tuning: Efficient Local Customization for Black-box Large Language Models(https://arxiv.org/abs/2509.12112)
Keywords: language model, llm, prompt
Abstract: The high costs of customizing large language models (LLMs) fundamentally limit their adaptability to user-specific needs. Consequently, LLMs are increasingly offered as cloud-based services, a paradigm that introduces critical limitations: providers struggle to support personalized customization at scale, while users face privacy risks when exposing sensitive data. To address this dual challenge, we propose Customized Black-box Prompt Tuning (CBP-Tuning), a novel framework that facilitates efficient local customization while preserving bidirectional privacy. Specifically, we design a two-stage framework: (1) a prompt generator trained on the server-side to capture domain-specific and task-agnostic capabilities, and (2) user-side gradient-free optimization that tailors soft prompts for individual tasks. This approach eliminates the need for users to access model weights or upload private data, requiring only a single customized vector per task while achieving effective adaptation. Furthermore, the evaluation of CBP-Tuning in the commonsense reasoning, medical and financial domain settings demonstrates superior performance compared to baselines, showcasing its advantages in task-agnostic processing and privacy preservation.
摘要：自定义大型语言模型（LLMS）的高昂成本从根本上限制了其适应性对用户特定需求的适应性。因此，LLM越来越多地作为基于云的服务提供，这是一个引入关键限制的范式：提供者很难在大规模上支持个性化的自定义，而在暴露敏感数据时，用户会面临隐私风险。为了应对这一双重挑战，我们提出了定制的Black-Box提示调整（CBP-TUNING），这是一个新颖的框架，可在保留双向隐私的同时促进有效的本地自定义。具体而言，我们设计了一个两个阶段的框架：（1）在服务器端训练的提示发电机，以捕获特定于域的特定和任务不合时宜的功能，以及（2）用户端的无梯度优化，可为单个任务量身定制软提示。这种方法消除了用户访问模型权重或上传私人数据的需求，在实现有效调整的同时，每个任务只需要一个自定义向量。此外，在常识性推理，医疗和金融领域设置中对CBP调整的评估表明，与基线相比，表现出色，展示了其在任务不可能的处理和隐私保护方面的优势。

Title: XplaiNLP at CheckThat! 2025: Multilingual Subjectivity Detection with Finetuned Transformers and Prompt-Based Inference with Large Language Models

Authors: Ariana Sahitaj, Jiaao Li, Pia Wenzel Neves, Fedor Splitt, Premtim Sahitaj, Charlott Jakob, Veronika Solopova, Vera Schmitt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.12130
Pdf URL: https://arxiv.org/pdf/2509.12130
Copy Paste: [[2509.12130]] XplaiNLP at CheckThat! 2025: Multilingual Subjectivity Detection with Finetuned Transformers and Prompt-Based Inference with Large Language Models(https://arxiv.org/abs/2509.12130)
Keywords: language model, gpt, llm, prompt
Abstract: This notebook reports the XplaiNLP submission to the CheckThat! 2025 shared task on multilingual subjectivity detection. We evaluate two approaches: (1) supervised fine-tuning of transformer encoders, EuroBERT, XLM-RoBERTa, and German-BERT, on monolingual and machine-translated training data; and (2) zero-shot prompting using two LLMs: o3-mini for Annotation (rule-based labelling) and gpt-4.1-mini for DoubleDown (contrastive rewriting) and Perspective (comparative reasoning). The Annotation Approach achieves 1st place in the Italian monolingual subtask with an F_1 score of 0.8104, outperforming the baseline of 0.6941. In the Romanian zero-shot setting, the fine-tuned XLM-RoBERTa model obtains an F_1 score of 0.7917, ranking 3rd and exceeding the baseline of 0.6461. The same model also performs reliably in the multilingual task and improves over the baseline in Greek. For German, a German-BERT model fine-tuned on translated training data from typologically related languages yields competitive performance over the baseline. In contrast, performance in the Ukrainian and Polish zero-shot settings falls slightly below the respective baselines, reflecting the challenge of generalization in low-resource cross-lingual scenarios.
摘要：本笔记本将XPLAINLP提交给checkthat！ 2025关于多语言主观性检测的共同任务。我们评估了两种方法：（1）有关单语言和机器翻译的培训数据，对变压器编码器，欧罗伯特，XLM-Roberta和German-Bert的监督微调；（2）使用两个LLM的零射击提示：O3-MINI进行注释（基于规则的标签）和GPT-4.1-MINI进行Doubledown（对比重写）和透视图（比较推理）。注释方法在意大利单语子任务中获得第一名，F_1得分为0.8104，表现优于基线为0.6941。在罗马尼亚的零射击设置中，微型XLM-ROBERTA模型获得了0.7917的F_1分数，排名第三，超过基线为0.6461。同一模型在多语言任务中也可靠地执行，并在希腊语中的基线改进。对于德国人来说，对来自类型相关语言的翻译培训数据进行了微调的德语模型，在基线上产生了竞争性能。相比之下，乌克兰和波兰零射击设置的性能略低于各自的基线，这反映了低资源跨语言场景中概括的挑战。

Title: Pun Unintended: LLMs and the Illusion of Humor Understanding

Authors: Alessandro Zangari, Matteo Marcuzzo, Andrea Albarelli, Mohammad Taher Pilehvar, Jose Camacho-Collados
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12158
Pdf URL: https://arxiv.org/pdf/2509.12158
Copy Paste: [[2509.12158]] Pun Unintended: LLMs and the Illusion of Humor Understanding(https://arxiv.org/abs/2509.12158)
Keywords: llm
Abstract: Puns are a form of humorous wordplay that exploits polysemy and phonetic similarity. While LLMs have shown promise in detecting puns, we show in this paper that their understanding often remains shallow, lacking the nuanced grasp typical of human interpretation. By systematically analyzing and reformulating existing pun benchmarks, we demonstrate how subtle changes in puns are sufficient to mislead LLMs. Our contributions include comprehensive and nuanced pun detection benchmarks, human evaluation across recent LLMs, and an analysis of the robustness challenges these models face in processing puns.
摘要：双关语是一种幽默的文字游戏形式，可以利用多义和语音相似性。尽管LLM在检测双关语方面表现出了希望，但我们在本文中表明，他们的理解往往仍然很浅，缺乏人类解释的细微典型掌握。通过系统地分析和重新制定现有的双关语基准，我们证明了双关语的细微变化足以误导LLM。我们的贡献包括全面和细微的双关语检测基准，最近的LLM中的人类评估以及对这些模型在处理双关语中所面临的鲁棒性挑战的分析。

Title: RAGs to Riches: RAG-like Few-shot Learning for Large Language Model Role-playing

Authors: Timothy Rupprecht, Enfu Nan, Arash Akbari, Arman Akbari, Lei Lu, Priyanka Maan, Sean Duffy, Pu Zhao, Yumei He, David Kaeli, Yanzhi Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12168
Pdf URL: https://arxiv.org/pdf/2509.12168
Copy Paste: [[2509.12168]] RAGs to Riches: RAG-like Few-shot Learning for Large Language Model Role-playing(https://arxiv.org/abs/2509.12168)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Role-playing Large language models (LLMs) are increasingly deployed in high-stakes domains such as healthcare, education, and governance, where failures can directly impact user trust and well-being. A cost effective paradigm for LLM role-playing is few-shot learning, but existing approaches often cause models to break character in unexpected and potentially harmful ways, especially when interacting with hostile users. Inspired by Retrieval-Augmented Generation (RAG), we reformulate LLM role-playing into a text retrieval problem and propose a new prompting framework called RAGs-to-Riches, which leverages curated reference demonstrations to condition LLM responses. We evaluate our framework with LLM-as-a-judge preference voting and introduce two novel token-level ROUGE metrics: Intersection over Output (IOO) to quantity how much an LLM improvises and Intersection over References (IOR) to measure few-shot demonstrations utilization rate during the evaluation tasks. When simulating interactions with a hostile user, our prompting strategy incorporates in its responses during inference an average of 35% more tokens from the reference demonstrations. As a result, across 453 role-playing interactions, our models are consistently judged as being more authentic, and remain in-character more often than zero-shot and in-context Learning (ICL) methods. Our method presents a scalable strategy for building robust, human-aligned LLM role-playing frameworks.
摘要：角色扮演的大语言模型（LLM）越来越多地部署在医疗保健，教育和治理等高风险领域中，失败会直接影响用户的信任和福祉。 LLM角色扮演的经济有效范式很少学习，但是现有的方法通常会导致模型以意外且可能有害的方式打破角色，尤其是在与敌对用户互动时。受检索型发电（RAG）的启发，我们将LLM角色扮演的角色扮演重新出现为文本检索问题，并提出了一个新的提示框架，称为RAGS-TO-RICHES，该框架利用策划的参考演示为LLM响应。我们通过llm-as-a-a-a-augghe偏爱投票评估我们的框架，并引入两个新颖的令牌级别的胭脂指标：相交的输出（IOO）到数量，以在评估任务期间在评估期间的参考文献（IOR）上即兴创作的LLM即兴演奏（IOR）的数量。在模拟与敌对用户的互动时，我们的提示策略在推理期间的响应中纳入了参考演示中平均增加35％的令牌。结果，在453个角色扮演的交互中，我们的模型始终被判断为更真实，并且比零射击和内在的学习（ICL）方法更频繁地保持字符。我们的方法提出了一种可扩展的策略，用于构建坚固的人类与人符合的LLM角色扮演框架。

Title: Preservation of Language Understanding Capabilities in Speech-aware Large Language Models

Authors: Marek Kubis, Paweł Skórzewski, Iwona Christop, Mateusz Czyżnikiewicz, Jakub Kubiak, Łukasz Bondaruk, Marcin Lewandowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.12171
Pdf URL: https://arxiv.org/pdf/2509.12171
Copy Paste: [[2509.12171]] Preservation of Language Understanding Capabilities in Speech-aware Large Language Models(https://arxiv.org/abs/2509.12171)
Keywords: language model
Abstract: The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.
摘要：该论文介绍了C3T（跨模式能力保护测试），这是评估语音感知大语模型的新基准。基准测试利用文本任务和语音克隆文本对语音模型来量化在通过语音输入访问模型时保留语言理解功能的程度。 C3T量化了不同类别的说话者的模型及其在文本和语音方式上的鲁棒性的模型的公平性。