2025-11-06

Title: Cache Mechanism for Agent RAG Systems

Authors: Shuhang Lin, Zhencan Peng, Lingyao Li, Xiao Lin, Xi Zhu, Yongfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02919
Pdf URL: https://arxiv.org/pdf/2511.02919
Copy Paste: [[2511.02919]] Cache Mechanism for Agent RAG Systems(https://arxiv.org/abs/2511.02919)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Recent advances in Large Language Model (LLM)-based agents have been propelled by Retrieval-Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG's success in improving agent performance, agent-level cache management, particularly constructing, maintaining, and updating a compact, relevant corpus dynamically tailored to each agent's need, remains underexplored. Therefore, we introduce ARC (Agent RAG Cache Mechanism), a novel, annotation-free caching framework that dynamically manages small, high-value corpora for each agent. By synthesizing historical query distribution patterns with the intrinsic geometry of cached items in the embedding space, ARC automatically maintains a high-relevance cache. With comprehensive experiments on three retrieval datasets, our experimental results demonstrate that ARC reduces storage requirements to 0.015% of the original corpus while offering up to 79.8% has-answer rate and reducing average retrieval latency by 80%. Our results demonstrate that ARC can drastically enhance efficiency and effectiveness in RAG-powered LLM agents.
摘要：基于大型语言模型 (LLM) 的代理的最新进展是由检索增强生成 (RAG) 推动的，它使模型能够访问大量外部知识库。尽管 RAG 在提高代理性能方面取得了成功，但代理级缓存管理，特别是构建、维护和更新根据每个代理的需求动态定制的紧凑的相关语料库，仍然没有得到充分探索。因此，我们引入了ARC（Agent RAG Cache Mechanism），这是一种新颖的、无注释的缓存框架，可以动态管理每个代理的小型、高价值语料库。通过将历史查询分布模式与嵌入空间中缓存项的内在几何结构相结合，ARC 自动维护高相关性缓存。通过对三个检索数据集的综合实验，我们的实验结果表明，ARC 将存储需求降低到原始语料库的 0.015%，同时提供高达 79.8% 的有答率，并将平均检索延迟降低 80%。我们的结果表明，ARC 可以极大地提高 RAG 支持的 LLM 代理的效率和有效性。

Title: LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Authors: Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03001
Pdf URL: https://arxiv.org/pdf/2511.03001
Copy Paste: [[2511.03001]] LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation(https://arxiv.org/abs/2511.03001)
Keywords: language model, llm, agent
Abstract: Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.
摘要：尽管最近在使用大型语言模型 (LLM) 自动生成 3D 场景方面取得了进展，但生成的场景通常缺乏现实环境中的真实空间布局和对象属性。由于这个问题源于不够详细、粗粒度的指令，因此在反映现实世界环境的更详细、细粒度指令的指导下推进 3D 场景合成变得至关重要。如果没有这样的现实场景，在不现实的环境中训练实体代理可能会导致它们学习与现实世界的物理和语义显着不同的先验知识，从而降低它们部署时的性能。因此，验证细粒度指令和生成场景之间的对齐对于有效学习至关重要。然而，当前的评估方法，例如 CLIPScore 和视觉语言模型 (VLM)，通常无法可靠地评估这种一致性。这一缺点主要源于他们对 3D 场景的浅薄理解，这常常导致场景组件接地不当。为了解决这个问题，我们引入了 LEGO-Eval，这是一个评估框架，配备了多种工具，旨在明确地面场景组件，从而实现更准确的对齐评估。我们还推出了 LEGO-Bench，这是一个详细说明的基准，指定了现实世界环境的复杂布局和属性。实验表明，在评估场景指令对齐方面，LEGO-Eval 的 F1 分数比 VLM-as-a-judge 的表现高出 0.41。使用 LEGO-Bench 进行基准测试揭示了当前生成方法的重大局限性。在所有评估的方法中，生成与细粒度指令完全一致的场景的成功率最多达到 10%。

Title: Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT

Authors: Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03005
Pdf URL: https://arxiv.org/pdf/2511.03005
Copy Paste: [[2511.03005]] Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT(https://arxiv.org/abs/2511.03005)
Keywords: language model, gpt, llm
Abstract: We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks. The pipeline first analyzes and categorizes common errors in summaries produced by a teacher model (GPT-3.5), then performs a targeted revision using a compact editor model (Llama 3.1 70B) to generate high-quality, refined training data. Fine-tuning a smaller student model (Llama 3.1 8B) on this refined data resulted in superior summarization performance compared to GPT-3.5. The ARF pipeline improves cost efficiency and data privacy while maintaining competitive accuracy, illustrating a generalizable framework for enhancing open-source LLMs across diverse downstream applications.
摘要：我们引入了分析-修订-微调 (ARF) 管道，使较小的开源语言模型 (LLM) 在客户服务摘要任务中超越更大的专有模型。该管道首先对教师模型 (GPT-3.5) 生成的摘要中的常见错误进行分析和分类，然后使用紧凑编辑器模型 (Llama 3.1 70B) 进行有针对性的修订，以生成高质量、精细的训练数据。与 GPT-3.5 相比，根据这些精炼数据对较小的学生模型 (Llama 3.1 8B) 进行微调，可以获得更出色的摘要性能。 ARF 管道提高了成本效率和数据隐私，同时保持了有竞争力的准确性，展示了一个用于增强跨不同下游应用程序的开源法学硕士的通用框架。

Title: Data-Efficient Adaptation and a Novel Evaluation Method for Aspect-based Sentiment Analysis

Authors: Yan Cathy Hua, Paul Denny, Jörg Wicker, Katerina Taškova
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.03034
Pdf URL: https://arxiv.org/pdf/2511.03034
Copy Paste: [[2511.03034]] Data-Efficient Adaptation and a Novel Evaluation Method for Aspect-based Sentiment Analysis(https://arxiv.org/abs/2511.03034)
Keywords: language model
Abstract: Aspect-based Sentiment Analysis (ABSA) is a fine-grained opinion mining approach that identifies and classifies opinions associated with specific entities (aspects) or their categories within a sentence. Despite its rapid growth and broad potential, ABSA research and resources remain concentrated in commercial domains, leaving analytical needs unmet in high-demand yet low-resource areas such as education and healthcare. Domain adaptation challenges and most existing methods' reliance on resource-intensive in-training knowledge injection further hinder progress in these areas. Moreover, traditional evaluation methods based on exact matches are overly rigid for ABSA tasks, penalising any boundary variations which may misrepresent the performance of generative models. This work addresses these gaps through three contributions: 1) We propose a novel evaluation method, Flexible Text Similarity Matching and Optimal Bipartite Pairing (FTS-OBP), which accommodates realistic extraction boundary variations while maintaining strong correlation with traditional metrics and offering fine-grained diagnostics. 2) We present the first ABSA study of small decoder-only generative language models (SLMs; <7B parameters), examining resource lower bounds via a case study in education review ABSA. We systematically explore data-free (in-context learning and weight merging) and data-light fine-tuning methods, and propose a multitask fine-tuning strategy that significantly enhances SLM performance, enabling 1.5-3.8 B models to surpass proprietary large models and approach benchmark results with only 200-1,000 examples on a single GPU. 3) We release the first public set of education review ABSA resources to support future research in low-resource domains.
摘要：基于方面的情感分析 (ABSA) 是一种细粒度的意见挖掘方法，可识别和分类与句子中特定实体（方面）或其类别相关的意见。尽管增长迅速且潜力巨大，ABSA 的研究和资源仍然集中在商业领域，导致教育和医疗保健等高需求但资源匮乏的领域的分析需求未能得到满足。领域适应挑战和大多数现有方法对资源密集型培训知识注入的依赖进一步阻碍了这些领域的进展。此外，基于精确匹配的传统评估方法对于 ABSA 任务来说过于严格，会惩罚任何可能歪曲生成模型性能的边界变化。这项工作通过三个贡献解决了这些差距：1）我们提出了一种新颖的评估方法，灵活的文本相似性匹配和最佳二分配对（FTS-OBP），它适应现实的提取边界变化，同时保持与传统指标的强相关性并提供细粒度的诊断。 2) 我们提出了第一个关于小型仅解码器生成语言模型（SLM；<7B 参数）的 ABSA 研究，通过教育评论 ABSA 中的案例研究检查资源下限。我们系统地探索了无数据（上下文学习和权重合并）和轻数据微调方法，并提出了一种显着增强 SLM 性能的多任务微调策略，使 1.5-3.8 B 模型能够超越专有的大型模型，并在单个 GPU 上仅用 200-1,000 个示例就接近基准结果。 3）我们发布了第一套公共教育评论ABSA资源，以支持资源匮乏领域的未来研究。

Title: ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment

Authors: Anthony Hevia, Sanjana Chintalapati, Veronica Ka Wai Lai, Thanh Tam Nguyen, Wai-Tat Wong, Terry Klassen, Lucy Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03048
Pdf URL: https://arxiv.org/pdf/2511.03048
Copy Paste: [[2511.03048]] ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment(https://arxiv.org/abs/2511.03048)
Keywords: language model, llm, prompt
Abstract: We present ROBOTO2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBOTO2 is publicly available at this https URL, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.
摘要：我们推出 ROBOTO2，这是一个基于网络的开源平台，用于大语言模型 (LLM) 辅助的临床试验偏倚风险 (ROB) 评估。 ROBOTO2 通过结合了 PDF 解析、检索增强的 LLM 提示和人机交互审核的交互式界面，简化了传统上劳动密集型的 ROB v2 (ROB2) 注释流程。用户可以上传临床试验报告，接收ROB2信号传导问题的初步答案和支持证据，并对系统建议提供实时反馈或更正。 ROBOTO2 在此 https URL 上公开提供，并发布代码和数据以促进可重复性和采用。我们构建并发布了 521 份儿科临床试验报告的数据集（8954 个信号问题和 1202 个证据段落），使用手动和法学硕士辅助方法进行注释，作为基准并支持未来的研究。使用该数据集，我们对 4 个法学硕士的 ROB2 性能进行了基准测试，并对当前模型功能以及系统审核这一关键方面自动化的持续挑战进行了分析。

Title: Reading Between the Lines: The One-Sided Conversation Problem

Authors: Victoria Ebert, Rishabh Singh, Tuochao Chen, Noah A. Smith, Shyamnath Gollakota
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.03056
Pdf URL: https://arxiv.org/pdf/2511.03056
Copy Paste: [[2511.03056]] Reading Between the Lines: The One-Sided Conversation Problem(https://arxiv.org/abs/2511.03056)
Keywords: llm, hallucination, prompt
Abstract: Conversational AI is constrained in many real-world settings where only one side of a dialogue can be recorded, such as telemedicine, call centers, and smart glasses. We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker's turns for real-time use cases, and (2) generating summaries from one-sided transcripts. Evaluating prompting and finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B testing and LLM-as-a-judge metrics, we find that access to one future turn and information about utterance length improves reconstruction, placeholder prompting helps to mitigate hallucination, and while large models generate promising reconstructions with prompting, smaller models require finetuning. Further, high-quality summaries can be generated without reconstructing missing turns. We present 1SC as a novel challenge and report promising results that mark a step toward privacy-aware conversational AI.
摘要：对话式人工智能在许多现实环境中受到限制，只能记录对话的一侧，例如远程医疗、呼叫中心和智能眼镜。我们将其形式化为单边对话问题（1SC）：从对话的一侧进行推断和学习。我们研究两项任务：（1）为实时用例重建丢失的发言者的回合，以及（2）从单方面的记录生成摘要。通过人类 A/B 测试和 LLM 作为法官指标来评估 MultiWOZ、DailyDialog 和 Candor 上的提示和微调模型，我们发现访问未来的回合和有关话语长度的信息可以改善重建，占位符提示有助于减轻幻觉，虽然大型模型通过提示生成有希望的重建，但较小的模型需要微调。此外，可以生成高质量的摘要，而无需重建丢失的回合。我们将 1SC 视为一项新颖的挑战，并报告了有希望的结果，这标志着向隐私意识对话 AI 迈出了一步。

Title: PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech

Authors: Michel Wong, Ali Alshehri, Sophia Kao, Haotian He
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.03080
Pdf URL: https://arxiv.org/pdf/2511.03080
Copy Paste: [[2511.03080]] PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech(https://arxiv.org/abs/2511.03080)
Keywords: language model, llm, prompt
Abstract: Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.
摘要：文本规范化 (TN) 是文本转语音 (TTS) 系统中的关键预处理步骤，可将书面形式转换为其规范的口语形式。传统的 TN 系统可以表现出高精度，但涉及大量的工程工作，难以扩展，并且对语言覆盖范围构成挑战，特别是在资源匮乏的环境中。我们提出了 PolyNorm，这是一种使用大型语言模型 (LLM) 的基于提示的 TN 方法，旨在减少对手动制定规则的依赖，并以最少的人为干预实现更广泛的语言适用性。此外，我们还提供了一个与语言无关的管道，用于自动数据管理和评估，旨在促进跨不同语言的可扩展实验。跨八种语言的实验表明，与基于生产级的系统相比，单词错误率 (WER) 持续降低。为了支持进一步的研究，我们发布了 PolyNorm-Benchmark，这是一个涵盖各种文本规范化现象的多语言数据集。

Title: CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic

Authors: Saad Mankarious, Ayah Zirikly
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03102
Pdf URL: https://arxiv.org/pdf/2511.03102
Copy Paste: [[2511.03102]] CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic(https://arxiv.org/abs/2511.03102)
Keywords: language model
Abstract: Mental health disorders affect millions worldwide, yet early detection remains a major challenge, particularly for Arabic-speaking populations where resources are limited and mental health discourse is often discouraged due to cultural stigma. While substantial research has focused on English-language mental health detection, Arabic remains significantly underexplored, partly due to the scarcity of annotated datasets. We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts. The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group. CARMA surpasses existing resources in both scale and diversity. We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions. To demonstrate the dataset's potential for further mental health analysis, we perform classification experiments using a range of models, from shallow classifiers to large language models. Our results highlight the promise of advancing mental health detection in underrepresented languages such as Arabic.
摘要：精神健康障碍影响着全世界数百万人，但早期发现仍然是一个重大挑战，特别是对于资源有限且由于文化耻辱而经常阻碍精神健康讨论的阿拉伯语人群。虽然大量研究集中在英语心理健康检测上，但阿拉伯语的研究仍然严重不足，部分原因是注释数据集的稀缺。我们推出了 CARMA，这是第一个自动注释的大型阿拉伯语 Reddit 帖子数据集。该数据集包含六种心理健康状况，例如焦虑、自闭症和抑郁症，以及一个对照组。 CARMA 在规模和多样性方面都超越了现有资源。我们对用户之间的词汇和语义差异进行定性和定量分析，提供对特定心理健康状况的语言标记的见解。为了证明该数据集用于进一步心理健康分析的潜力，我们使用一系列模型（从浅层分类器到大型语言模型）进行分类实验。我们的结果凸显了以阿拉伯语等代表性不足的语言推进心理健康检测的前景。

Title: Control Barrier Function for Aligning Large Language Models

Authors: Yuya Miyaoka, Masaki Inoue
Subjects: cs.CL, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2511.03121
Pdf URL: https://arxiv.org/pdf/2511.03121
Copy Paste: [[2511.03121]] Control Barrier Function for Aligning Large Language Models(https://arxiv.org/abs/2511.03121)
Keywords: language model, llm
Abstract: This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.
摘要：本文提出了一种基于控制的框架，通过利用控制屏障函数（CBF）来对齐大型语言模型（LLM），以确保生成用户所需的文本。所提出的框架将 CBF 安全过滤器应用于从基线 LLM 生成的预测令牌，以干预生成的文本。安全滤波器有两个显着的优点：该安全滤波器是附加类型，允许其用于对齐目的而无需微调基线LLM，并且如果存在关于所需对齐的评估模型，则可以直接应用于滤波器设计。整个文本生成系统采用开源语言模型实现，旨在生成积极的文本。

Title: MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Authors: Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03146
Pdf URL: https://arxiv.org/pdf/2511.03146
Copy Paste: [[2511.03146]] MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity(https://arxiv.org/abs/2511.03146)
Keywords: llm, chain-of-thought
Abstract: As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.
摘要：随着推理模型的迅速扩展，多模态在人类认知中的重要作用已经凸显出来，推动了探索以视觉为中心的认知行为的需求不断增长。然而，现有的多模态基准要么过分强调文本推理，要么未能系统地捕获以视觉为中心的认知行为，从而导致 MLLM 的认知能力评估不充分。为了解决这一限制，我们引入了 MME-CC（认知能力的多模态评估基准），这是一个基于视觉的基准，它将 11 个代表性推理任务组织成视觉信息的三个基本类别：空间、几何和基于知识的推理，并提供对 MLLM 在这些维度上的认知能力的细粒度分析。基于 MME-CC，我们对 16 个具有代表性的 MLLM 进行了广泛的实验。我们的研究表明，闭源模型目前总体领先（例如，Gemini-2.5-Pro 为 42.66，GLM-4.5V 为 30.45），而空间和几何推理仍然普遍较弱（小于或等于 30%）。我们进一步识别了常见的错误模式，包括方向错误、脆弱的跨视图身份持久性以及对反事实指令的不严格遵守，并观察到思想链通常遵循三阶段过程（提取 -> 推理 -> 验证），严重依赖视觉提取。我们希望这项工作能够促进将 MLLM 的认知能力作为评估和模型设计的核心的转变。

Title: Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment

Authors: Srishti Yadav, Jasmina Gajcin, Erik Miehling, Elizabeth Daly
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03152
Pdf URL: https://arxiv.org/pdf/2511.03152
Copy Paste: [[2511.03152]] Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment(https://arxiv.org/abs/2511.03152)
Keywords: llm
Abstract: Understanding how different stakeholders perceive risks in AI systems is essential for their responsible deployment. This paper presents a framework for stakeholder-grounded risk assessment by using LLMs, acting as judges to predict and explain risks. Using the Risk Atlas Nexus and GloVE explanation method, our framework generates stakeholder-specific, interpretable policies that shows how different stakeholders agree or disagree about the same risks. We demonstrate our method using three real-world AI use cases of medical AI, autonomous vehicles, and fraud detection domain. We further propose an interactive visualization that reveals how and why conflicts emerge across stakeholder perspectives, enhancing transparency in conflict reasoning. Our results show that stakeholder perspectives significantly influence risk perception and conflict patterns. Our work emphasizes the importance of these stakeholder-aware explanations needed to make LLM-based evaluations more transparent, interpretable, and aligned with human-centered AI governance goals.
摘要：了解不同利益相关者如何看待人工智能系统中的风险对于其负责任的部署至关重要。本文提出了一个利用法学硕士进行基于利益相关者的风险评估的框架，作为法官来预测和解释风险。使用 Risk Atlas Nexus 和 GloVE 解释方法，我们的框架生成特定于利益相关者的、可解释的政策，显示不同利益相关者如何同意或不同意相同的风险。我们使用医疗人工智能、自动驾驶车辆和欺诈检测领域的三个现实世界人工智能用例来演示我们的方法。我们进一步提出了一种交互式可视化，可以揭示利益相关者视角中冲突如何以及为何出现，从而提高冲突推理的透明度。我们的结果表明，利益相关者的观点显着影响风险认知和冲突模式。我们的工作强调了这些利益相关者意识解释的重要性，这些解释需要使基于法学硕士的评估更加透明、可解释，并与以人为中心的人工智能治理目标保持一致。

Title: Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks

Authors: Kevin Wang, Subre Abdoul Moktar, Jia Li, Kangshuo Li, Feng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03166
Pdf URL: https://arxiv.org/pdf/2511.03166
Copy Paste: [[2511.03166]] Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks(https://arxiv.org/abs/2511.03166)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model's understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model's epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.
摘要：大型语言模型 (LLM) 已变得越来越普遍，在许多行业和学科中都有应用。确保 LLM 输出的可信度至关重要，其中不确定性估计 (UE) 发挥着关键作用。在这项工作中，进行了一项全面的实证研究，以检验法学硕士中关于任意和认知不确定性的各种 UE 测量的稳健性和有效性。它涉及十二种不同的 UE 方法和四种生成质量指标，包括来自 LLM 批评者的 LLMScore，用于评估分布内 (ID) 和分布外 (OOD) 数据集上的问答 (QA) 任务中 LLM 生成的答案的不确定性。我们的分析表明，利用标记和序列概率的基于信息的方法由于与模型对数据的理解保持一致，因此在 ID 设置中表现得非常好。相反，基于密度的方法和 P(True) 度量在 OOD 环境中表现出卓越的性能，突出了它们在捕获模型的认知不确定性方面的有效性。语义一致性方法可评估生成答案的可变性，显示出跨不同数据集和生成指标的可靠性能。这些方法通常表现良好，但可能并不适合所有情况。

Title: BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

Authors: Shahriyar Zaman Ridoy, Azmine Toushik Wasi, Koushik Ahamed Tonmoy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03180
Pdf URL: https://arxiv.org/pdf/2511.03180
Copy Paste: [[2511.03180]] BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture(https://arxiv.org/abs/2511.03180)
Keywords: language model, llm, prompt
Abstract: As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.
摘要：随着多语言大语言模型 (LLM) 在南亚越来越受欢迎，其与当地道德规范的一致性仍待探索，尤其是孟加拉语，该语言的使用人数超过 2.85 亿，在全球排名第六。现有的道德基准主要以英语为中心，并由西方框架塑造，忽视了对现实世界部署至关重要的文化差异。为了解决这个问题，我们引入了 BengaliMoralBench，这是第一个针对孟加拉语和社会文化背景的大规模道德基准。它涵盖五个道德领域：日常活动、习惯、养育子女、家庭关系和宗教活动，并细分为 50 个与文化相关的子主题。每个场景都通过母语人士的共识使用三个道德视角进行注释：美德、常识和正义伦理。我们使用统一的提示协议和标准指标，对著名的多语言法学硕士（包括 Llama、Gemma、Qwen 和 DeepSeek）进行系统的零样本评估。表现差异很大（准确度为 50-91%），定性分析揭示了文化基础、常识推理和道德公平方面的一贯弱点。 BengaliMoralBench 为负责任的本地化奠定了基础，实现文化上一致的评估，并支持在孟加拉国等多元化、资源匮乏的多语言环境中部署道德稳健的人工智能。

Title: LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval

Authors: Wenchang Lei, Ping Zou, Yue Wang, Feng Sun, Lei Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03214
Pdf URL: https://arxiv.org/pdf/2511.03214
Copy Paste: [[2511.03214]] LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval(https://arxiv.org/abs/2511.03214)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) exhibit strong semantic understanding, yet struggle when user instructions involve ambiguous or conceptually misaligned terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity by extracting meta-relations-inheritance, alias, and composition-from natural language. The model further employs a reflection mechanism to validate these meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these relations and related descriptions are dynamically supplied to the LLM, improving its ability to interpret concepts and generate accurate responses. Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely on extended context windows, our method enables large language models to process texts of any length without the need for truncation. Experiments on standard benchmarks demonstrate that the LGM consistently outperforms existing RAG baselines.
摘要：大型语言模型 (LLM) 表现出强大的语义理解能力，但当用户指令涉及模糊或概念上不一致的术语时，就会遇到困难。我们提出语言图模型（LGM），通过从自然语言中提取元关系（继承、别名和组合）来增强概念清晰度。该模型进一步采用反射机制来验证这些元关系。利用概念迭代检索算法，这些关系和相关描述动态地提供给法学硕士，提高其解释概念和生成准确响应的能力。与依赖扩展上下文窗口的传统检索增强生成（RAG）方法不同，我们的方法使大型语言模型能够处理任意长度的文本，而无需截断。标准基准测试表明，LGM 始终优于现有 RAG 基准。

Title: Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification

Authors: Shaghayegh Kolli, Richard Rosenbaum, Timo Cavelius, Lasse Strothe, Andrii Lata, Jana Diesner
Subjects: cs.CL, cs.AI, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2511.03217
Pdf URL: https://arxiv.org/pdf/2511.03217
Copy Paste: [[2511.03217]] Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification(https://arxiv.org/abs/2511.03217)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet suffer from limited coverage or latency. By integrating LLMs with knowledge graphs and real-time search agents, we introduce a hybrid fact-checking approach that leverages the individual strengths of each component. Our system comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid one - hop lookups in DBpedia, 2) an LM-based classification guided by a task-specific labeling prompt, producing outputs with internal rule-based logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient. Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the Supported/Refuted split without task- specific fine - tuning. To address Not enough information cases, we conduct a targeted reannotation study showing that our approach frequently uncovers valid evidence for claims originally labeled as Not Enough Information (NEI), as confirmed by both expert annotators and LLM reviewers. With this paper, we present a modular, opensource fact-checking pipeline with fallback strategies and generalization across datasets.
摘要：大型语言模型 (LLM) 擅长生成流畅的话语，但可能缺乏经过验证的信息的可靠基础。与此同时，基于知识图谱的事实检查器提供了精确且可解释的证据，但其覆盖范围或延迟有限。通过将法学硕士与知识图和实时搜索代理集成，我们引入了一种混合事实检查方法，该方法利用了每个组件的各自优势。我们的系统包括三个自主步骤：1）知识图（KG）检索，用于在 DBpedia 中快速一跳查找；2）由特定于任务的标签提示引导的基于 LM 的分类，通过内部基于规则的逻辑生成输出；3）仅在 KG 覆盖范围不足时调用 Web 搜索代理。我们的流程在 FEVER 基准测试中的支持/驳斥分割上取得了 0.93 的 F1 分数，无需针对特定任务进行微调。为了解决信息不足的情况，我们进行了一项有针对性的重新注释研究，表明我们的方法经常发现最初标记为信息不足（NEI）的主张的有效证据，这一点得到了专家注释者和法学硕士审稿人的确认。在本文中，我们提出了一个模块化的开源事实检查管道，具有后备策略和跨数据集的泛化。

Title: IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs

Authors: Souvik Rana, Arul Menezes, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03237
Pdf URL: https://arxiv.org/pdf/2511.03237
Copy Paste: [[2511.03237]] IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs(https://arxiv.org/abs/2511.03237)
Keywords: language model, llm
Abstract: Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods such as Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present IndicSuperTokenizer, a tokenizer for Indic multilingual LLMs, that combines both subword and multi-word tokenization, along with language-specific pre-tokenization, leading to more linguistically aligned tokens and achieving a new state-of-the-art in fertility score. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by 39.5% over LLaMA4 and by 18% over Sutra (the current best). This translates to 44% improvement in inference throughput over LLaMA4 while maintaining comparable performance on English and Indic benchmarks. We also present detailed ablations across tokenizer training data size, vocabulary size, merging techniques, and pre-tokenization strategies, demonstrating the robustness of our design choices.
摘要：分词器在确定大型语言模型 (LLM) 的性能、训练效率和推理成本方面发挥着至关重要的作用。由于不同的脚本和丰富的形态变化，为多语言法学硕士设计有效的分词器尤其具有挑战性。虽然字节对编码 (BPE) 等子字方法被广泛采用，但它们在多语言环境中的有效性仍未得到充分探索。我们推出了 IndicSuperTokenizer，这是一种针对印度语多语言法学硕士的分词器，它结合了子词和多词分词，以及特定于语言的预分词，从而产生更多语言上对齐的标记，并在生育率得分方面实现了新的最先进水平。通过对英语、22 种印度语言和代码数据进行评估，我们的分词器将平均生育力分数比 LLaMA4 提高了 39.5%，比 Sutra（当前最好的）提高了 18%。这意味着推理吞吐量比 LLaMA4 提高了 44%，同时在英语和印度语基准测试中保持了可比的性能。我们还提供了分词器训练数据大小、词汇量、合并技术和预分词策略的详细消融，证明了我们设计选择的稳健性。

Title: Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature

Authors: Ranul Dayarathne, Uvini Ranaweera, Upeksha Ganegoda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03261
Pdf URL: https://arxiv.org/pdf/2511.03261
Copy Paste: [[2511.03261]] Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature(https://arxiv.org/abs/2511.03261)
Keywords: language model, gpt, llm, hallucination, chat, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google's AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI's Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.
摘要：检索增强生成（RAG）正在成为一种强大的技术，通过减少幻觉来增强生成式人工智能模型的能力。因此，RAG 与大型语言模型 (LLM) 一起日益突出，引发了人们对比较不同领域的问答 (QA) 中不同 LLM 的表现的兴趣。本研究比较了四个开源 LLM（Mistral-7b-instruct、LLaMa2-7b-chat、Falcon-7b-instruct 和 Orca-mini-v3-7b）的性能，以及利用 RAG 支持的计算机科学文献中 OpenAI 的趋势 GPT-3.5 与 QA 任务的性能。研究中采用的评估指标包括二元问题的准确性和精确度、人类专家的排名、谷歌人工智能模型 Gemini 的排名，以及长答案问题的余弦相似度。 GPT-3.5 与 RAG 配合使用时，可以有效回答二元问题和长答案问题，重申其作为高级法学硕士的地位。关于开源法学硕士，Mistral AI 的 Mistral-7b-instruct 与 RAG 配合使用，在回答二进制和长答案问题方面均优于其他软件。然而，在开源 LLM 中，Orca-mini-v3-7b 报告生成响应的平均延迟最短，而 Meta 的 LLaMa2-7b-chat 报告平均延迟最高。这项研究强调了这样一个事实：开源法学硕士也可以与具有更好基础设施的 GPT-3.5 等专有模型齐头并进。

Title: SCALE: Upscaled Continual Learning of Large Language Models

Authors: Jin-woo Lee, Junhwa Choi, Bongkyu Hwang, Jinho Choo, Bogun Kim, JeongSeon Yi, Joonseok Lee, DongYoung Jung, Jaeseon Park, Kyoungwon Park, Suk-hoon Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03270
Pdf URL: https://arxiv.org/pdf/2511.03270
Copy Paste: [[2511.03270]] SCALE: Upscaled Continual Learning of Large Language Models(https://arxiv.org/abs/2511.03270)
Keywords: language model
Abstract: We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model's original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model's behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.
摘要：我们重新审视大型语言模型的持续预训练，并认为现在的进展更多地取决于扩展正确的结构，而不是仅仅依赖扩展参数。我们引入了 SCALE，一种宽度升级架构，它将轻量级扩展插入到线性模块中，同时冻结所有预先训练的参数。这保留了残差和注意力拓扑并增加了容量，而不会干扰基本模型的原始功能。 SCALE 遵循两个原则：持久保存，通过面向保存的初始化和冻结预训练权重来维持基本模型的行为；协作适应，有选择地训练扩展组件的子集，以最小的干扰获取新知识。我们将这些想法实例化为 SCALE-Preserve（保留优先）、SCALE-Adapt（适应优先）和 SCALE-Route，SCALE-Route 是一个可选的路由扩展，可在保留和适应头之间执行令牌级路由。在受控的合成传记基准上，SCALE 减轻了深度扩展时观察到的严重遗忘，同时仍然获取新知识。在韩语语料库上的持续预训练中，SCALE 变体减少了英语评估的遗忘，并在韩语基准上获得了竞争收益，这些变体提供了最佳的整体稳定性-可塑性权衡。随附的分析阐明了保存何时可被证明成立，以及为什么与标准持续学习设置相比，保存和适应之间的相互作用可以稳定优化。

Title: Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

Authors: Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu, Haibao Yu, Jianing Qiu, Qianzhong Chen, Suning Huang, Yan Xu, Yong Gui, Yijun He, Jiankai Sun
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.03328
Pdf URL: https://arxiv.org/pdf/2511.03328
Copy Paste: [[2511.03328]] Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks(https://arxiv.org/abs/2511.03328)
Keywords: language model, llm
Abstract: A recent advancement in Multimodal Large Language Models (MLLMs) research is the emergence of "reasoning MLLMs" that offer explicit control over their internal thinking processes (normally referred as the "thinking mode") alongside the standard "non-thinking mode". This capability allows these models to engage in a step-by-step process of internal deliberation before generating a final response. With the rapid transition to and adoption of these "dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning processes of these MLLMs impact model performance and reliability in clinical tasks. This paper evaluates the active "thinking mode" capabilities of two leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We assessed their performance on four visual medical tasks using VQA-RAD and ROCOv2 datasets. Our findings reveal that the improvement from activating the thinking mode remains marginal compared to the standard non-thinking mode for the majority of the tasks. Their performance on complex medical tasks such as open-ended VQA and medical image interpretation remains suboptimal, highlighting the need for domain-specific medical data and more advanced methods for medical knowledge integration.
摘要：多模态大型语言模型 (MLLM) 研究的最新进展是“推理 MLLM”的出现，它与标准的“非思维模式”一起提供对其内部思维过程（通常称为“思维模式”）的明确控制。此功能允许这些模型在生成最终响应之前参与逐步的内部审议过程。随着这些“双态”MLLM 的快速过渡和采用，这项工作严格评估了这些 MLLM 的增强推理过程如何影响临床任务中的模型性能和可靠性。本文评估了两种领先的 MLLM（Seed1.5-VL 和 Gemini-2.5-Flash）在医疗应用中的主动“思维模式”功能。我们使用 VQA-RAD 和 ROCOv2 数据集评估了他们在四项视觉医学任务上的表现。我们的研究结果表明，与大多数任务的标准非思维模式相比，激活思维模式所带来的改善仍然微不足道。它们在开放式 VQA 和医学图像解读等复杂医疗任务上的表现仍然不够理想，这凸显了对特定领域医疗数据和更先进的医学知识整合方法的需求。

Title: Silenced Biases: The Dark Side LLMs Learned to Refuse

Authors: Rom Himelstein, Amit LeVi, Brit Youngmann, Yaniv Nemcovsky, Avi Mendelson
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2511.03369
Pdf URL: https://arxiv.org/pdf/2511.03369
Copy Paste: [[2511.03369]] Silenced Biases: The Dark Side LLMs Learned to Refuse(https://arxiv.org/abs/2511.03369)
Keywords: language model, llm, prompt
Abstract: Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.
摘要：与安全相关的大型语言模型 (LLM) 正变得越来越普遍，特别是在敏感应用中，公平至关重要，而有偏差的输出可能会造成重大损害。然而，评估模型的公平性是一项复杂的挑战，并且这样做的方法通常利用标准问答 (QA) 风格的方案。此类方法通常将模型的拒绝响应解释为积极的公平性测量，从而忽略了更深层次的问题，从而产生了错误的公平感。在这项工作中，我们引入了沉默偏差的概念，这是在模型的潜在空间中编码的不公平偏好，并被安全对齐有效地隐藏。以前考虑类似间接偏差的方法通常依赖于即时操作或手工隐式查询，这导致可扩展性有限，并且存在因额外偏差而污染评估过程的风险。我们提出了 Silenced Bias Benchmark (SBB)，其目的是通过采用激活引导来减少 QA 期间的模型拒绝，从而发现这些偏差。 SBB 支持轻松扩展到新的人口群体和主题，提出了公平性评估框架，鼓励未来开发超越对齐训练的屏蔽效应的公平模型和工具。我们在多个法学硕士中展示了我们的方法，我们的研究结果揭示了模型的直接响应与其潜在的公平问题之间的惊人区别。

Title: EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation

Authors: Yunbo Long, Yuhan Liu, Alexandra Brintrup
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03370
Pdf URL: https://arxiv.org/pdf/2511.03370
Copy Paste: [[2511.03370]] EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation(https://arxiv.org/abs/2511.03370)
Keywords: language model, llm, agent
Abstract: The deployment of large language models (LLMs) in automated negotiation has set a high performance benchmark, but their computational cost and data privacy requirements render them unsuitable for many privacy-sensitive, on-device applications such as mobile assistants, embodied AI agents or private client interactions. While small language models (SLMs) offer a practical alternative, they suffer from a significant performance gap compared to LLMs in playing emotionally charged complex personas, especially for credit negotiation. This paper introduces EQ-Negotiator, a novel framework that bridges this capability gap using emotional personas. Its core is a reasoning system that integrates game theory with a Hidden Markov Model(HMM) to learn and track debtor emotional states online, without pre-training. This allows EQ-Negotiator to equip SLMs with the strategic intelligence to counter manipulation while de-escalating conflict and upholding ethical standards. Through extensive agent-to-agent simulations across diverse credit negotiation scenarios, including adversarial debtor strategies like cheating, threatening, and playing the victim, we show that a 7B parameter language model with EQ-Negotiator achieves better debt recovery and negotiation efficiency than baseline LLMs more than 10 times its size. This work advances persona modeling from descriptive character profiles to dynamic emotional architectures that operate within privacy constraints. Besides, this paper establishes that strategic emotional intelligence, not raw model scale, is the critical factor for success in automated negotiation, paving the way for effective, ethical, and privacy-preserving AI negotiators that can operate on the edge.
摘要：自动谈判中大语言模型 (LLM) 的部署设定了高性能基准，但其计算成本和数据隐私要求使其不适合许多隐私敏感的设备上应用程序，例如移动助理、嵌入式人工智能代理或私人客户端交互。虽然小语言模型 (SLM) 提供了一种实用的替代方案，但与法学硕士相比，它们在扮演充满情感的复杂角色方面存在显着的性能差距，尤其是在信用谈判方面。本文介绍了 EQ-Negotiator，这是一种利用情感角色弥合这种能力差距的新颖框架。其核心是一个推理系统，将博弈论与隐马尔可夫模型（HMM）相结合，在线学习和跟踪债务人的情绪状态，无需预先训练。这使得 EQ-Negotiator 能够为 SLM 配备战略情报，以对抗操纵，同时缓和冲突并维护道德标准。通过在不同的信用谈判场景中进行广泛的代理对代理模拟，包括欺骗、威胁和扮演受害者等对抗性债务人策略，我们表明，带有 EQ-Negotiator 的 7B 参数语言模型比基准 LLM 实现了更好的债务追偿和谈判效率，其规模超过其规模的 10 倍以上。这项工作将角色建模从描述性角色档案推进到在隐私限制下运行的动态情感架构。此外，本文还指出，战略情商（而不是原始模型规模）是自动谈判成功的关键因素，这为高效、道德和保护隐私的人工智能谈判者在边缘操作铺平了道路。

Title: LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning

Authors: Shenghao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03372
Pdf URL: https://arxiv.org/pdf/2511.03372
Copy Paste: [[2511.03372]] LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning(https://arxiv.org/abs/2511.03372)
Keywords: language model, llm
Abstract: For complex logical data augmentation, heavy reliance on human annotation is costly, whereas direct generation with large language models yields uninterpretable and logically homogeneous examples. To address this, we present LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to propositional expressions, a compact rule library is compiled, and a bounded state-space search systematically discovers valid formulas that are then verbalized back into natural-language questions, ensuring both diversity and logical rigor under propositional logic. Experiments on ReClor and LogiQA show significant improvements in the logical-reasoning accuracy of pretrained models, confirming the effectiveness of LFC-DA for LLM-guided logical data augmentation.
摘要：对于复杂的逻辑数据增强，严重依赖人工注释的成本很高，而使用大型语言模型直接生成会产生无法解释且逻辑上同质的示例。为了解决这个问题，我们提出了 LFC-DA，一种符号逻辑控制的管道：逻辑文本首先映射到命题表达式，编译紧凑的规则库，有界状态空间搜索系统地发现有效公式，然后将其语言化回自然语言问题，确保命题逻辑下的多样性和逻辑严谨性。 ReClor 和 LogiQA 上的实验表明，预训练模型的逻辑推理准确性显着提高，证实了 LFC-DA 对于 LLM 引导的逻辑数据增强的有效性。

Title: Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

Authors: Célian Ringwald, Fabien Gandon, Catherine Faron, Franck Michel, Hanna Abi Akl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03407
Pdf URL: https://arxiv.org/pdf/2511.03407
Copy Paste: [[2511.03407]] Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties(https://arxiv.org/abs/2511.03407)
Keywords: language model
Abstract: Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.
摘要：小语言模型 (SLM) 在提取由 SHACL 形状引导的 RDF 三元组（专注于常见数据类型属性）时已显示出关系提取 (RE) 的前景。本文研究了 SLM 如何处理数据类型和对象属性以实现完整的 RDF 图提取。我们表明，关键瓶颈与稀有属性的长尾分布有关。为了解决这个问题，我们评估了几种策略：分层采样、加权损失、数据集缩放和基于模板的合成数据增强。我们表明，在不平衡的目标属性上表现同样出色的最佳策略是构建一个训练集，其中每个属性的出现次数超过给定的阈值。为了实现可重复性，我们公开发布了我们的数据集、实验结果和代码。我们的研究结果为训练形状感知 SLM 提供了实用指导，并强调了语义 RE 未来工作的有希望的方向。

Title: Efficient Reasoning via Thought-Training and Thought-Free Inference

Authors: Canhui Wu, Qiong Cao, Chao Xue, Wei Xi, Xiaodong He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03408
Pdf URL: https://arxiv.org/pdf/2511.03408
Copy Paste: [[2511.03408]] Efficient Reasoning via Thought-Training and Thought-Free Inference(https://arxiv.org/abs/2511.03408)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent advances in large language models (LLMs) have leveraged explicit Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most existing methods primarily compress verbose reasoning outputs. These Long-to-Short transformations aim to improve efficiency, but still rely on explicit reasoning during inference. In this work, we introduce \textbf{3TF} (\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree inference), a framework for efficient reasoning that takes a Short-to-Long perspective. We first train a hybrid model that can operate in both reasoning and non-reasoning modes, and then further train it on CoT-annotated data to internalize structured reasoning, while enforcing concise, thought-free outputs at inference time using the no-reasoning mode. Unlike compression-based approaches, 3TF improves the reasoning quality of non-reasoning outputs, enabling models to perform rich internal reasoning implicitly while keeping external outputs short. Empirically, 3TF-trained models obtain large improvements on reasoning benchmarks under thought-free inference, demonstrating that high quality reasoning can be learned and executed implicitly without explicit step-by-step generation.
摘要：大型语言模型 (LLM) 的最新进展利用了明确的思想链 (CoT) 提示来提高推理准确性。然而，大多数现有方法主要压缩详细推理输出。这些长到短的转换旨在提高效率，但在推理过程中仍然依赖于显式推理。在这项工作中，我们引入了 \textbf{3TF} （\textbf{T}thought-\textbf{T}raining 和 \textbf{T}thought-\textbf{F}ree inference），这是一种采用短到长视角的高效推理框架。我们首先训练一个可以在推理和非推理模式下运行的混合模型，然后在 CoT 注释的数据上进一步训练它以内化结构化推理，同时使用非推理模式在推理时强制执行简洁、无思想的输出。与基于压缩的方法不同，3TF 提高了非推理输出的推理质量，使模型能够隐式执行丰富的内部推理，同时保持外部输出简短。根据经验，3TF 训练的模型在无思想推理下的推理基准上获得了巨大的改进，这表明可以隐式地学习和执行高质量的推理，而无需显式的逐步生成。

Title: Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG

Authors: Longpeng Qiu, Ting Li, Shuai Mao, Nan Yang, Xiaohui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03410
Pdf URL: https://arxiv.org/pdf/2511.03410
Copy Paste: [[2511.03410]] Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG(https://arxiv.org/abs/2511.03410)
Keywords: language model, llm
Abstract: Input errors in question-answering (QA) systems often lead to incorrect responses. Large language models (LLMs) struggle with this task, frequently failing to interpret user intent (misinterpretation) or unnecessarily altering the original question's structure (over-correction). We propose QuestionRAG, a framework that tackles these problems. To address misinterpretation, it enriches the input with external knowledge (e.g., search results, related entities). To prevent over-correction, it uses reinforcement learning (RL) to align the model's objective with precise correction, not just paraphrasing. Our results demonstrate that knowledge augmentation is critical for understanding faulty questions. Furthermore, RL-based alignment proves significantly more effective than traditional supervised fine-tuning (SFT), boosting the model's ability to follow instructions and generalize. By integrating these two strategies, QuestionRAG unlocks the full potential of LLMs for the question correction task.
摘要：问答 (QA) 系统中的输入错误通常会导致错误的响应。大型语言模型 (LLM) 很难完成这项任务，经常无法解释用户意图（误解）或不必要地改变原始问题的结构（过度纠正）。我们提出 QuestionRAG，一个解决这些问题的框架。为了解决误解，它用外部知识（例如搜索结果、相关实体）丰富输入。为了防止过度校正，它使用强化学习 (RL) 来使模型的目标与精确校正保持一致，而不仅仅是释义。我们的结果表明，知识增强对于理解错误问题至关重要。此外，事实证明，基于强化学习的对齐比传统的监督微调 (SFT) 更有效，从而提高了模型遵循指令和泛化的能力。通过整合这两种策略，QuestionRAG 释放了法学硕士在问题修正任务中的全部潜力。

Title: CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Authors: Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03441
Pdf URL: https://arxiv.org/pdf/2511.03441
Copy Paste: [[2511.03441]] CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field(https://arxiv.org/abs/2511.03441)
Keywords: language model, llm
Abstract: Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.
摘要：对科学文献的批判性评价是生物医学领域的一项基本技能。虽然大型语言模型 (LLM) 可以在此任务中提供有希望的支持，但它们的可靠性仍然有限，特别是对于专业领域的关键推理。我们引入 CareMedEval，这是一个原始数据集，旨在评估法学硕士在生物医学关键评估和推理任务上的能力。该数据集源自法国医学生进行的真实考试，包含基于 37 篇科学文章的 534 个问题。与现有基准不同，CareMedEval 明确评估基于科学论文的批判性阅读和推理。在各种背景条件下对最先进的通才和生物医学专业法学硕士进行基准测试揭示了任务的难度：开放和商业模型未能超过 0.5 的精确匹配率，尽管生成中间推理标记大大改善了结果。然而，模型仍然面临挑战，特别是在有关研究局限性和统计分析的问题上。 CareMedEval 为扎根推理提供了一个具有挑战性的基准，暴露了当前法学硕士的局限性，并为未来发展关键评估的自动化支持铺平了道路。

Title: Kastor: Fine-tuned Small Language Models for Shape-based Active Relation Extraction

Authors: Ringwald Celian, Gandon Fabien, Faron Catherine, Michel Franck, Abi Akl Hanna
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03466
Pdf URL: https://arxiv.org/pdf/2511.03466
Copy Paste: [[2511.03466]] Kastor: Fine-tuned Small Language Models for Shape-based Active Relation Extraction(https://arxiv.org/abs/2511.03466)
Keywords: language model
Abstract: RDF pattern-based extraction is a compelling approach for fine-tuning small language models (SLMs) by focusing a relation extraction task on a specified SHACL shape. This technique enables the development of efficient models trained on limited text and RDF data. In this article, we introduce Kastor, a framework that advances this approach to meet the demands for completing and refining knowledge bases in specialized domains. Kastor reformulates the traditional validation task, shifting from single SHACL shape validation to evaluating all possible combinations of properties derived from the shape. By selecting the optimal combination for each training example, the framework significantly enhances model generalization and performance. Additionally, Kastor employs an iterative learning process to refine noisy knowledge bases, enabling the creation of robust models capable of uncovering new, relevant facts
摘要：基于 RDF 模式的提取是一种引人注目的方法，通过将关系提取任务集中在指定的 SHACL 形状上来微调小语言模型 (SLM)。该技术能够开发在有限文本和 RDF 数据上训练的高效模型。在本文中，我们介绍 Kastor，这是一个框架，它改进了这种方法，以满足完成和完善专业领域知识库的需求。 Kastor 重新制定了传统的验证任务，从单一 SHACL 形状验证转变为评估源自形状的所有可能的属性组合。通过为每个训练示例选择最佳组合，该框架显着增强了模型的泛化能力和性能。此外，Kastor 采用迭代学习过程来完善嘈杂的知识库，从而能够创建能够发现新的相关事实的强大模型

Title: BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation

Authors: Kazi Reyazul Hasan, Mubasshira Musarrat, A. B. M. Alim Al Islam, Muhammad Abdullah Adnan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.03498
Pdf URL: https://arxiv.org/pdf/2511.03498
Copy Paste: [[2511.03498]] BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation(https://arxiv.org/abs/2511.03498)
Keywords: language model
Abstract: Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and solving math problems. Our results show significant improvements in translation accuracy for technical content, making it easier for Bangla speakers to use English-focused language models effectively. Both the BanglaSTEM dataset and the trained translation model are publicly released at this https URL.
摘要：大型语言模型非常适合用英语解决技术问题，但在用孟加拉语提出相同问题时表现不佳。一个简单的解决方案是首先将孟加拉语问题翻译成英语，然后使用这些模型。然而，现有的孟加拉英语翻译系统在技术术语方面遇到了困难。他们经常误译专业词汇，从而改变问题的含义并导致错误的答案。我们展示了 BanglaSTEM，这是一个包含 5,000 个精心挑选的孟加拉语-英语句子对的数据集，这些句子对来自 STEM 领域，包括计算机科学、数学、物理、化学和生物学。我们使用语言模型生成了 12,000 多个翻译，然后使用人类评估者来选择正确保留技术术语的最高质量对。我们在 BanglaSTEM 上训练基于 T5 的翻译模型，并在两项任务上对其进行测试：生成代码和解决数学问题。我们的结果表明，技术内容的翻译准确性显着提高，使孟加拉语使用者更容易有效地使用以英语为中心的语言模型。 BanglaSTEM 数据集和经过训练的翻译模型均在此 https URL 公开发布。

Title: HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Authors: Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03506
Pdf URL: https://arxiv.org/pdf/2511.03506
Copy Paste: [[2511.03506]] HaluMem: Evaluating Hallucinations in Memory Systems of Agents(https://arxiv.org/abs/2511.03506)
Keywords: llm, hallucination, agent
Abstract: Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.
摘要：记忆系统是使法学硕士和人工智能代理等人工智能系统能够实现长期学习和持续交互的关键组件。然而，在记忆存储和检索过程中，这些系统经常表现出记忆幻觉，包括捏造、错误、冲突和遗漏。现有的记忆幻觉评估主要是端到端的问答，这使得很难定位记忆系统中出现幻觉的操作阶段。为了解决这个问题，我们引入了内存基准幻觉（HaluMem），这是第一个针对内存系统量身定制的操作级幻觉评估基准。 HaluMem定义了三个评估任务（记忆提取、记忆更新和记忆问答）来全面揭示交互不同操作阶段的幻觉行为。为了支持评估，我们构建了以用户为中心的多轮人机交互数据集 HaluMem-Medium 和 HaluMem-Long。两者均包含约 15k 记忆点和 3.5k 多种类型的问题。每个用户的平均对话长度达到 1.5k 和 2.6k 回合，上下文长度超过 1M token，从而能够评估不同上下文规模和任务复杂性的幻觉。基于HaluMem的实证研究表明，现有的记忆系统在提取和更新阶段容易产生和积累幻觉，随后将错误传播到问答阶段。未来的研究应集中于开发可解释和受约束的记忆操作机制，系统地抑制幻觉并提高记忆可靠性。

Title: One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

Authors: Qi Jia, Kaiwei Zhang, Xiujie Song, Ye Shen, Xiangyang Zhu, Guangtao Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03508
Pdf URL: https://arxiv.org/pdf/2511.03508
Copy Paste: [[2511.03508]] One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework(https://arxiv.org/abs/2511.03508)
Keywords: language model, gpt, llm
Abstract: Understanding how well large language models can follow users' instructions throughout a dialogue spanning multiple topics is of great importance for data-intensive conversational applications. Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience. In this work, we propose an extensible framework for assessing multi-turn instruction-following ability. At its core, our framework decouples linguistic surface forms from user intent simulation through a three-layer mechanism that tracks constraints, instructions, and topics. This framework mimics User-LLM interaction by enabling the dynamic construction of benchmarks with state changes and tracebacks, terminating a conversation only when the model exhausts a simulated user's patience. We define a suite of metrics capturing the quality of the interaction process. Using this framework, we construct EvolIF, an evolving instruction-following benchmark incorporating nine distinct constraint types. Our results indicate that GPT-5 exhibits superior instruction-following performance. It sustains an average of 18.54 conversational turns and demonstrates 70.31% robustness, outperforming Gemini-2.5-Pro by a significant margin of 11.41%, while other models lag far behind. All of the data and code will be made publicly available online.
摘要：了解大型语言模型在跨越多个主题的对话中如何很好地遵循用户的指令对于数据密集型会话应用程序非常重要。现有的基准测试通常仅限于固定的轮数，这使得它们容易饱和并且无法考虑用户的交互体验。在这项工作中，我们提出了一个可扩展的框架来评估多轮指令跟踪能力。我们的框架的核心是通过跟踪约束、指令和主题的三层机制将语言表面形式与用户意图模拟解耦。该框架通过启用具有状态更改和回溯的动态构建基准来模拟用户-LLM 交互，仅当模型耗尽模拟用户的耐心时才终止对话。我们定义了一套衡量交互过程质量的指标。使用这个框架，我们构建了 EvolIF，一个不断发展的指令跟踪基准，包含九种不同的约束类型。我们的结果表明 GPT-5 表现出卓越的指令跟踪性能。它平均维持 18.54 次会话，表现出 70.31% 的鲁棒性，明显优于 Gemini-2.5-Pro 11.41%，而其他模型则远远落后。所有数据和代码都将在线公开。

Title: SOLVE-Med: Specialized Orchestration for Leading Vertical Experts across Medical Specialties

Authors: Roberta Di Marino, Giovanni Dioguardi, Antonio Romano, Giuseppe Riccio, Mariano Barone, Marco Postiglione, Flora Amato, Vincenzo Moscato
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03542
Pdf URL: https://arxiv.org/pdf/2511.03542
Copy Paste: [[2511.03542]] SOLVE-Med: Specialized Orchestration for Leading Vertical Experts across Medical Specialties(https://arxiv.org/abs/2511.03542)
Keywords: language model, hallucination, agent
Abstract: Medical question answering systems face deployment challenges including hallucinations, bias, computational demands, privacy concerns, and the need for specialized expertise across diverse domains. Here, we present SOLVE-Med, a multi-agent architecture combining domain-specialized small language models for complex medical queries. The system employs a Router Agent for dynamic specialist selection, ten specialized models (1B parameters each) fine-tuned on specific medical domains, and an Orchestrator Agent that synthesizes responses. Evaluated on Italian medical forum data across ten specialties, SOLVE-Med achieves superior performance with ROUGE-1 of 0.301 and BERTScore F1 of 0.697, outperforming standalone models up to 14B parameters while enabling local deployment. Our code is publicly available on GitHub: this https URL.
摘要：医疗问答系统面临部署挑战，包括幻觉、偏见、计算需求、隐私问题以及跨不同领域的专业知识的需求。在这里，我们提出了 SOLVE-Med，这是一种多代理架构，结合了用于复杂医学查询的领域专用小语言模型。该系统采用用于动态专家选择的路由器代理、针对特定医疗领域进行微调的十个专用模型（每个模型 1B 参数）以及综合响应的协调器代理。根据意大利医学论坛十个专业的数据进行评估，SOLVE-Med 实现了卓越的性能，ROUGE-1 为 0.301，BERTScore F1 为 0.697，在高达 14B 参数的情况下优于独立模型，同时支持本地部署。我们的代码在 GitHub 上公开可用：此 https URL。

Title: MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

Authors: Sofie Helene Bruun, Dan Saattrup Smart
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03553
Pdf URL: https://arxiv.org/pdf/2511.03553
Copy Paste: [[2511.03553]] MultiZebraLogic: A Multilingual Logical Reasoning Benchmark(https://arxiv.org/abs/2511.03553)
Keywords: language model, gpt, llm
Abstract: Measuring the full abilities of large language models (LLMs) requires benchmarks representing multiple tasks. We aim to create large, high-quality datasets for comparison of logical reasoning skills across several languages and of suitable difficulty for LLMs of various reasoning ability. We explore multiple ways of increasing difficulty. We generate zebra puzzles in multiple languages, themes, sizes and including 14 different clue types and 8 red herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. Including 5 red herrings decreases o3-mini puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.
摘要：衡量大型语言模型 (LLM) 的全部能力需要代表多个任务的基准。我们的目标是创建大型、高质量的数据集，用于比较多种语言的逻辑推理技能，并为不同推理能力的法学硕士提供适当的难度。我们探索多种增加难度的方法。我们生成多种语言、主题、尺寸的斑马拼图，包括 14 种不同的线索类型和 8 种转移注意力的类型（无信息线索）。我们发现 2x3 和 4x5 的拼图大小分别对于 GPT-4o mini（非推理模型）和 o3-mini（推理模型）来说足够具有挑战性。包含 5 个红鲱鱼会使 4x5 谜题的 o3-mini 谜题级别准确度降低 15$\pm$7 %。 o3-mini 在 4x5 谜题上的分数并没有受到英语与丹麦语或普通房屋主题与特定国家 smoerrebroed 主题的使用的显着影响。我们发现难度和所选线索类型之间没有相关性。 128+1024 个谜题的数据集以 MultiZebraLogic 形式发布，采用九种日耳曼语言，尺寸为 2x3 和 4x5。我们发布了拼图生成代码，旨在适应更多语言和主题。

Title: AILA--First Experiments with Localist Language Models

Authors: Joachim Diederich
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03559
Pdf URL: https://arxiv.org/pdf/2511.03559
Copy Paste: [[2511.03559]] AILA--First Experiments with Localist Language Models(https://arxiv.org/abs/2511.03559)
Keywords: language model
Abstract: This paper presents the first empirical demonstration of controllable locality in transformer language models, a novel architectural framework that enables continuous control over the degree of representation localization through a tunable locality dial parameter. Unlike traditional language models that rely exclusively on distributed representations, our approach allows dynamic interpolation between highly interpretable localist encodings and efficient distributed representations without requiring model retraining. We conducted experiments on the WikiText corpus using a two-layer transformer architecture, systematically varying the locality parameter {\lambda} across the full spectrum from 1.0 (fully localist) to 0.0 (fully distributed). Our results demonstrate that localist configurations achieve dramatically lower attention entropy, with {\lambda} = 1.0 yielding 5.36 bits compared to 7.18 bits at {\lambda} = 0.0, while maintaining substantially higher pointer fidelity scores reflecting stronger alignment with rule-specified targets. Prediction experiments reveal that intermediate locality values optimize the tradeoff between interpretability and performance, with {\lambda} = 0.6 achieving test perplexity of 4.65 and accuracy of 84.7%. These findings establish that localist language models provide a practical framework for applications in regulated domains requiring both transparency and capability, offering precise mathematical control over the interpretability-performance spectrum through explicit penalty thresholds and information-theoretic design principles.
摘要：本文提出了 Transformer 语言模型中可控局部性的首次实证演示，这是一种新颖的架构框架，可以通过可调节的局部性拨盘参数连续控制表示局部性的程度。与完全依赖分布式表示的传统语言模型不同，我们的方法允许在高度可解释的本地编码和高效的分布式表示之间进行动态插值，而无需模型重新训练。我们使用两层 Transformer 架构在 WikiText 语料库上进行了实验，系统地将整个范围内的局部性参数 {\lambda} 从 1.0（完全局部主义）到 0.0（完全分布式）变化。我们的结果表明，本地主义配置实现了显着较低的注意力熵，与 {\lambda} = 0.0 时的 7.18 位相比，{\lambda} = 1.0 产生了 5.36 位，同时保持了更高的指针保真度分数，反映出与规则指定目标的更强对齐。预测实验表明，中间局部性值优化了可解释性和性能之间的权衡，{\lambda} = 0.6 实现了 4.65 的测试困惑度和 84.7% 的准确度。这些发现表明，本地主义语言模型为需要透明度和能力的监管领域的应用程序提供了一个实用的框架，通过明确的惩罚阈值和信息论设计原则对可解释性性能范围提供精确的数学控制。

Title: ASVRI-Legal: Fine-Tuning LLMs with Retrieval Augmented Generation for Enhanced Legal Regulation

Authors: One Octadion, Bondan Sapta Prakoso, Nanang Yudi Setiawan, Novanto Yudistira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.03563
Pdf URL: https://arxiv.org/pdf/2511.03563
Copy Paste: [[2511.03563]] ASVRI-Legal: Fine-Tuning LLMs with Retrieval Augmented Generation for Enhanced Legal Regulation(https://arxiv.org/abs/2511.03563)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: In this study, we explore the fine-tuning of Large Language Models (LLMs) to better support policymakers in their crucial work of understanding, analyzing, and crafting legal regulations. To equip the model with a deep understanding of legal texts, we curated a supervised dataset tailored to the specific needs of the legal domain. Additionally, we integrated the Retrieval-Augmented Generation (RAG) method, enabling the LLM to access and incorporate up-to-date legal knowledge from external sources. This combination of fine-tuning and RAG-based augmentation results in a tool that not only processes legal information but actively assists policymakers in interpreting regulations and drafting new ones that align with current needs. The results demonstrate that this approach can significantly enhance the effectiveness of legal research and regulation development, offering a valuable resource in the ever-evolving field of law.
摘要：在这项研究中，我们探索了大型语言模型（LLM）的微调，以更好地支持政策制定者理解、分析和制定法律法规的关键工作。为了使模型能够深入理解法律文本，我们根据法律领域的特定需求定制了一个监督数据集。此外，我们还集成了检索增强生成（RAG）方法，使法学硕士能够访问并整合来自外部来源的最新法律知识。这种微调和基于 RAG 的增强的结合产生了一种工具，该工具不仅可以处理法律信息，还可以积极协助政策制定者解释法规并起草符合当前需求的新法规。结果表明，这种方法可以显着提高法律研究和法规制定的有效性，为不断发展的法律领域提供宝贵的资源。

Title: Step-Audio-EditX Technical Report

Authors: Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu (Tony)Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
Subjects: cs.CL, cs.AI, cs.HC, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2511.03601
Pdf URL: https://arxiv.org/pdf/2511.03601
Copy Paste: [[2511.03601]] Step-Audio-EditX Technical Report(https://arxiv.org/abs/2511.03601)
Keywords: llm
Abstract: We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) this http URL core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
摘要：我们推出了 Step-Audio-EditX，这是第一个基于 LLM 的开源音频模型，擅长表达和迭代音频编辑，包括情感、说话风格和副语言学，以及强大的零样本文本转语音 (TTS)。该 http URL 核心创新在于仅利用大利润合成数据，从而避免了对基于嵌入的先验或辅助模块的需求。这种大幅学习方法可以实现跨语音的迭代控制和高表达能力，并且代表了传统关注表示级解开的基本支点。评估结果表明，Step-Audio-EditX在情感编辑和其他细粒度控制任务上均超越了MiniMax-2.6-hd和Doubao-Seed-TTS-2.0。

Title: Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability

Authors: Apoorva Upadhyaya, Wolfgang Nejdl, Marco Fisichella
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.03635
Pdf URL: https://arxiv.org/pdf/2511.03635
Copy Paste: [[2511.03635]] Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability(https://arxiv.org/abs/2511.03635)
Keywords: language model, llm
Abstract: Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward unseen targets. Existing research using contrastive, meta-learning, or data augmentation suffers from generalizability issues or lack of coherence between text and target. Recent works leveraging large language models (LLMs) for ZSSD focus either on improving unseen target-specific knowledge or generating explanations for stance analysis. However, most of these works are limited by their over-reliance on explicit reasoning, provide coarse explanations that lack nuance, and do not explicitly model the reasoning process, making it difficult to interpret the model's predictions. To address these issues, in our study, we develop a novel interpretable ZSSD framework, IRIS. We provide an interpretable understanding of the attitude of the input towards the target implicitly based on sequences within the text (implicit rationales) and explicitly based on linguistic measures (explicit rationales). IRIS considers stance detection as an information retrieval ranking task, understanding the relevance of implicit rationales for different stances to guide the model towards correct predictions without requiring the ground-truth of rationales, thus providing inherent interpretability. In addition, explicit rationales based on communicative features help decode the emotional and cognitive dimensions of stance, offering an interpretable understanding of the author's attitude towards the given target. Extensive experiments on the benchmark datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10% training data prove the generalizability of our model, benefiting from the proposed architecture and interpretable design.
摘要：零射击姿态检测（ZSSD）可识别哨所对看不见的目标的态度。现有的使用对比、元学习或数据增强的研究存在普遍性问题或文本和目标之间缺乏连贯性。最近利用 ZSSD 大型语言模型 (LLM) 的工作重点是改进看不见的特定目标知识或生成立场分析的解释。然而，这些工作大多数都受到过度依赖显式推理的限制，提供缺乏细微差别的粗略解释，并且没有显式地对推理过程进行建模，从而难以解释模型的预测。为了解决这些问题，在我们的研究中，我们开发了一种新颖的可解释 ZSSD 框架 IRIS。我们隐式地基于文本中的序列（隐式基本原理）并显式地基于语言测量（显式基本原理）提供对输入对目标的态度的可解释的理解。 IRIS 将立场检测视为一项信息检索排名任务，了解不同立场的隐含原理的相关性，以引导模型做出正确的预测，而不需要基本原理的基本事实，从而提供固有的可解释性。此外，基于交流特征的明确理由有助于解码立场的情感和认知维度，提供对作者对给定目标的态度的可解释的理解。使用 50%、30% 甚至 10% 训练数据在 VAST、EZ-STANCE、P-Stance 和 RFD 基准数据集上进行的大量实验证明了我们模型的通用性，受益于所提出的架构和可解释的设计。

Title: Do Androids Dream of Unseen Puppeteers? Probing for a Conspiracy Mindset in Large Language Models

Authors: Francesco Corso, Francesco Pierri, Gianmarco De Francisci Morales
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.03699
Pdf URL: https://arxiv.org/pdf/2511.03699
Copy Paste: [[2511.03699]] Do Androids Dream of Unseen Puppeteers? Probing for a Conspiracy Mindset in Large Language Models(https://arxiv.org/abs/2511.03699)
Keywords: language model, llm, prompt
Abstract: In this paper, we investigate whether Large Language Models (LLMs) exhibit conspiratorial tendencies, whether they display sociodemographic biases in this domain, and how easily they can be conditioned into adopting conspiratorial perspectives. Conspiracy beliefs play a central role in the spread of misinformation and in shaping distrust toward institutions, making them a critical testbed for evaluating the social fidelity of LLMs. LLMs are increasingly used as proxies for studying human behavior, yet little is known about whether they reproduce higher-order psychological constructs such as a conspiratorial mindset. To bridge this research gap, we administer validated psychometric surveys measuring conspiracy mindset to multiple models under different prompting and conditioning strategies. Our findings reveal that LLMs show partial agreement with elements of conspiracy belief, and conditioning with socio-demographic attributes produces uneven effects, exposing latent demographic biases. Moreover, targeted prompts can easily shift model responses toward conspiratorial directions, underscoring both the susceptibility of LLMs to manipulation and the potential risks of their deployment in sensitive contexts. These results highlight the importance of critically evaluating the psychological dimensions embedded in LLMs, both to advance computational social science and to inform possible mitigation strategies against harmful uses.
摘要：在本文中，我们研究了大型语言模型（LLM）是否表现出阴谋倾向，它们是否在该领域表现出社会人口统计学偏见，以及它们如何容易地接受阴谋观点。阴谋论在错误信息的传播和形成对机构的不信任方面发挥着核心作用，使它们成为评估法学硕士社会忠诚度的关键测试平台。法学硕士越来越多地被用作研究人类行为的代理，但人们对它们是否再现了阴谋心态等高阶心理结构知之甚少。为了弥补这一研究差距，我们在不同的提示和调节策略下对多个模型进行了经过验证的心理测量调查，测量阴谋心态。我们的研究结果表明，法学硕士对阴谋论的要素表现出部分同意，而社会人口统计属性的调节会产生不均匀的影响，暴露出潜在的人口统计学偏见。此外，有针对性的提示可以轻松地将模型响应转向阴谋方向，强调了法学硕士对操纵的敏感性以及在敏感环境中部署其的潜在风险。这些结果凸显了批判性评估法学硕士中嵌入的心理维度的重要性，这既可以推进计算社会科学，也可以为针对有害用途的可能缓解策略提供信息。

Title: Grounded Misunderstandings in Asymmetric Dialogue: A Perspectivist Annotation Scheme for MapTask

Authors: Nan Li, Albert Gatt, Massimo Poesio
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.03718
Pdf URL: https://arxiv.org/pdf/2511.03718
Copy Paste: [[2511.03718]] Grounded Misunderstandings in Asymmetric Dialogue: A Perspectivist Annotation Scheme for MapTask(https://arxiv.org/abs/2511.03718)
Keywords: llm
Abstract: Collaborative dialogue relies on participants incrementally establishing common ground, yet in asymmetric settings they may believe they agree while referring to different entities. We introduce a perspectivist annotation scheme for the HCRC MapTask corpus (Anderson et al., 1991) that separately captures speaker and addressee grounded interpretations for each reference expression, enabling us to trace how understanding emerges, diverges, and repairs over time. Using a scheme-constrained LLM annotation pipeline, we obtain 13k annotated reference expressions with reliability estimates and analyze the resulting understanding states. The results show that full misunderstandings are rare once lexical variants are unified, but multiplicity discrepancies systematically induce divergences, revealing how apparent grounding can mask referential misalignment. Our framework provides both a resource and an analytic lens for studying grounded misunderstanding and for evaluating (V)LLMs' capacity to model perspective-dependent grounding in collaborative dialogue.
摘要：协作对话依赖于参与者逐步建立共同点，但在不对称环境中，他们可能认为自己在提及不同实体时达成了共识。我们为 HCRC MapTask 语料库（Anderson 等人，1991）引入了一种透视主义注释方案，该方案分别捕获说话者和收件人对每个参考表达的基于解释，使我们能够跟踪理解如何随着时间的推移而出现、分歧和修复。使用方案约束的 LLM 注释管道，我们获得了 13k 个带有可靠性估计的注释参考表达式，并分析了由此产生的理解状态。结果表明，一旦词汇变体统一，完全误解的情况就很少见，但多重性差异会系统地引起分歧，揭示了明显的基础如何掩盖指称错位。我们的框架提供了资源和分析视角，用于研究有根据的误解和评估（V）法学硕士在协作对话中对依赖观点的基础进行建模的能力。