2025-11-25

Title: SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

Authors: Gyubok Lee, Woosog Chay, Edward Choi
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2511.17559
Pdf URL: https://arxiv.org/pdf/2511.17559
Copy Paste: [[2511.17559]] SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering(https://arxiv.org/abs/2511.17559)
Keywords: language model, llm, agent
Abstract: Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging: incorrect SQL queries-whether caused by model errors or problematic user inputs-can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated SQL before execution), which is crucial for safe deployment. To fill this gap, we introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems. SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate SQL queries. The benchmark comprises 4,200 triples of questions, candidate SQL queries, and expected model outputs, grounded in the MIMIC-III, MIMIC-IV, and eICU databases. It covers a diverse set of questions and corresponding candidate SQL queries generated by seven different text-to-SQL models, ensuring a realistic and challenging evaluation. Using SCARE, we benchmark a range of approaches-from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and SQL error correction, highlighting key challenges and outlining directions for future research.
摘要：大型语言模型 (LLM) 的最新进展使得文本到 SQL 模型的开发成为可能，该模型允许临床医生使用自然语言查询存储在电子健康记录 (EHR) 中的结构化数据。然而，在安全关键的临床环境中为 EHR 问答 (QA) 系统部署这些模型仍然具有挑战性：不正确的 SQL 查询（无论是由模型错误还是有问题的用户输入引起）可能会破坏临床决策并危及患者护理。虽然之前的工作主要集中在提高 SQL 生成准确性或在执行前过滤问题，但缺乏一个统一的基准来评估独立的事后验证机制（即在执行前检查和验证生成的 SQL 的组件），这对于安全部署至关重要。为了填补这一空白，我们引入了 SCARE，这是一个用于评估在 EHR QA 系统中充当事后安全层的方法的基准。 SCARE 评估 (1) 对问题可回答性进行分类（即确定问题是否可回答、不明确或无法回答）和 (2) 验证或纠正候选 SQL 查询的联合任务。该基准测试由 4,200 个三元组问题、候选 SQL 查询和预期模型输出组成，基于 MIMIC-III、MIMIC-IV 和 eICU 数据库。它涵盖了由七种不同的文本到 SQL 模型生成的各种问题和相应的候选 SQL 查询，确保了真实且具有挑战性的评估。使用 SCARE，我们对一系列方法进行了基准测试——从两阶段方法到代理框架。我们的实验揭示了问题分类和 SQL 纠错之间的关键权衡，突出了关键挑战并概述了未来研究的方向。

Title: $A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving

Authors: Yuechi Zhou, Yi Su, Jianxin Zhang, Juntao Li, Qingrong Xia, Zhefeng Wang, Xinyu Duan, Baoxing Huai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17560
Pdf URL: https://arxiv.org/pdf/2511.17560
Copy Paste: [[2511.17560]] $A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving(https://arxiv.org/abs/2511.17560)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Large language models (LLMs) have demonstrated strong capabilities in processing long contexts, enabling them to tackle tasks involving long textual inputs such as multi-turn conversations, legal documents, or retrieved documents in Retrieval-Augmented Generation (RAG) systems. However, despite their ability to handle long sequences, the resulting decoding latency and memory overhead remain substantial, posing challenges for real-world deployment. Recent advances in KV Cache reuse have shown potential to mitigate these costs, but still suffer from notable performance degradation. To address this issue, we conduct an in-depth investigation of recomputation-based reuse methods and observe that the recomputed tokens often fail to align with the context segments most relevant to the question. This misalignment hinders proper updates to the critical contextual representations. Therefore, we propose the $\textbf{A}$ttention-$\textbf{A}$ware $\textbf{A}$ccurate KV Cache Fusion algorithm ($A^3$), which precomputes and selectively fuses the KV Cache of text chunks based on their relevance to the question, achieving accurate integration with minimal computational overhead. Extensive experiments on various benchmarks and LLMs demonstrate that $A^3$ achieves the best task performance compared to four baselines while reducing the time-to-first-token (TTFT) by 2$\times$.
摘要：大型语言模型 (LLM) 在处理长上下文方面表现出了强大的能力，使它们能够处理涉及长文本输入的任务，例如多轮对话、法律文档或检索增强生成 (RAG) 系统中检索的文档。然而，尽管它们能够处理长序列，但由此产生的解码延迟和内存开销仍然很大，给实际部署带来了挑战。 KV 缓存重用方面的最新进展已显示出降低这些成本的潜力，但仍然遭受显着的性能下降。为了解决这个问题，我们对基于重新计算的重用方法进行了深入研究，并观察到重新计算的标记通常无法与与问题最相关的上下文片段保持一致。这种不一致阻碍了对关键上下文表示的正确更新。因此，我们提出了$\textbf{A}$ttention-$\textbf{A}$ware $\textbf{A}$ccurate KV Cache Fusion算法（$A^3$），该算法根据文本块与问题的相关性预先计算并选择性地融合文本块的KV Cache，以最小的计算开销实现准确的集成。对各种基准和 LLM 的大量实验表明，与四个基准相比，$A^3$ 实现了最佳任务性能，同时将首次标记时间 (TTFT) 缩短了 2$\times$。

Title: LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

Authors: Huimin Ren, Yan Liang, Baiqiao Su, Chaobo Sun, Hengtong Lu, Kaike Zhang, Chen Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17561
Pdf URL: https://arxiv.org/pdf/2511.17561
Copy Paste: [[2511.17561]] LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models(https://arxiv.org/abs/2511.17561)
Keywords: language model, llm
Abstract: The ability of Large Language Models (LLMs) to precisely follow complex and fine-grained lexical instructions is a cornerstone of their utility and controllability. However, evaluating this capability remains a significant challenge. Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability. Existing programmatic benchmarks, while objective, often lack the expressiveness to test intricate, compositional constraints at a granular level. To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following. Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonical triplet. This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline and facilitates objective verification via a transparent, programmatic engine. We release our dataset and open-source evaluation tools to facilitate further research into the controllability and reliability of LLMs.
摘要：大型语言模型 (LLM) 精确遵循复杂且细粒度的词汇指令的能力是其实用性和可控性的基石。然而，评估这种能力仍然是一个重大挑战。目前的方法要么依赖于主观且成本高昂的人工评估，要么依赖于自动化的法学硕士法官系统，这些系统存在固有的偏见和不可靠性。现有的程序化基准虽然客观，但通常缺乏在粒度级别测试复杂的组合约束的表现力。为了解决这些限制，我们引入了 LexInstructEval，这是一个用于细粒度词汇指令跟踪的新基准和评估框架。我们的框架建立在一个正式的、基于规则的语法之上，它将复杂的指令解构为规范的<过程、关系、值>三元组。该语法能够通过多阶段、人机交互管道系统地生成多样化的数据集，并通过透明的编程引擎促进客观验证。我们发布数据集和开源评估工具，以促进对法学硕士可控性和可靠性的进一步研究。

Title: Generative Caching for Structurally Similar Prompts and Responses

Authors: Sarthak Chakraborty, Suman Nath, Xuchao Zhang, Chetan Bansal, Indranil Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17565
Pdf URL: https://arxiv.org/pdf/2511.17565
Copy Paste: [[2511.17565]] Generative Caching for Structurally Similar Prompts and Responses(https://arxiv.org/abs/2511.17565)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce \ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. \ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that \ourmethod{} achieves 83\% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by $\sim$20\% and reduces end-to-end execution latency by $\sim$34\% compared to standard prompt matching.
摘要：大型语言模型 (LLM) 越来越多地用于跨不同场景规划、推理和执行任务。在可重复工作流程和代理设置等用例中，提示通常会以较小的变化重复使用，同时对重复任务具有类似的结构。这为缓存提供了机会。然而，精确的提示匹配在这种结构相似的提示上会失败，而语义缓存可能会通过忽略关键差异来产生不正确的响应。为了解决这个问题，我们引入了 \ourmethod{}，这是一种生成式缓存，可为结构相似的提示生成变化感知响应。 \ourmethod{} 识别类似提示结构中可重用的响应模式，并为新请求合成自定义输出。我们表明，我们的方法{}实现了 83% 的缓存命中率，同时对数据集的错误命中率极低，且没有提示重复。在代理工作流程中，与标准提示匹配相比，它将缓存命中率提高了 $\sim$20\%，并将端到端执行延迟降低了 $\sim$34\%。

Title: Community-Aligned Behavior Under Uncertainty: Evidence of Epistemic Stance Transfer in LLMs

Authors: Patrick Gerard, Aiden Chang, Svitlana Volkova
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2511.17572
Pdf URL: https://arxiv.org/pdf/2511.17572
Copy Paste: [[2511.17572]] Community-Aligned Behavior Under Uncertainty: Evidence of Epistemic Stance Transfer in LLMs(https://arxiv.org/abs/2511.17572)
Keywords: language model, llm
Abstract: When large language models (LLMs) are aligned to a specific online community, do they exhibit generalizable behavioral patterns that mirror that community's attitudes and responses to new uncertainty, or are they simply recalling patterns from training data? We introduce a framework to test epistemic stance transfer: targeted deletion of event knowledge, validated with multiple probes, followed by evaluation of whether models still reproduce the community's organic response patterns under ignorance. Using Russian--Ukrainian military discourse and U.S. partisan Twitter data, we find that even after aggressive fact removal, aligned LLMs maintain stable, community-specific behavioral patterns for handling uncertainty. These results provide evidence that alignment encodes structured, generalizable behaviors beyond surface mimicry. Our framework offers a systematic way to detect behavioral biases that persist under ignorance, advancing efforts toward safer and more transparent LLM deployments.
摘要：当大型语言模型 (LLM) 与特定在线社区保持一致时，它们是否表现出反映该社区对新不确定性的态度和反应的普遍行为模式，或者它们只是回忆训练数据中的模式？我们引入了一个测试认知立场转移的框架：有针对性地删除事件知识，通过多个探针进行验证，然后评估模型是否仍然在无知的情况下重现社区的有机反应模式。使用俄罗斯-乌克兰军事言论和美国党派推特数据，我们发现即使在积极的事实删除之后，一致的法学硕士仍保持稳定的、针对社区的行为模式来处理不确定性。这些结果提供了证据，表明对齐编码了超越表面模仿的结构化、可概括的行为。我们的框架提供了一种系统的方法来检测在无知的情况下持续存在的行为偏见，从而推动实现更安全、更透明的法学硕士部署。

Title: Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models

Authors: Vladimir Berman
Subjects: cs.CL, stat.ME, stat.ML, stat.OT
Abstract URL: https://arxiv.org/abs/2511.17575
Pdf URL: https://arxiv.org/pdf/2511.17575
Copy Paste: [[2511.17575]] Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models(https://arxiv.org/abs/2511.17575)
Keywords: language model
Abstract: We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the probability of each string, we obtain a Zipf-type rank-frequency law p(r) proportional to r^{-alpha}, with an exponent determined explicitly by the alphabet size and the space probability. Our contribution is twofold. Mathematically, we give a unified derivation linking word lengths, vocabulary growth, critical length, and rank-frequency structure in a single explicit model. Conceptually, we argue that this provides a structurally grounded null model for both natural-language word statistics and token statistics in large language models. The results show that Zipf-like patterns can arise purely from combinatorics and segmentation, without optimization principles or linguistic organization, and help clarify which phenomena require deeper explanation beyond random-text structure.
摘要：我们研究了一种故意简单的、完全非语言的文本模型：从有限的字母表加上单个空格符号组成的一系列独立绘图。字被定义为非空间符号的最大块。在这个符号级框架内，假设没有形态、语法或语义，我们得出了几个结构结果。首先，字长遵循仅由空间符号的概率控制的几何分布。其次，给定长度的预期单词数以及该长度的不同单词的预期数量允许基于优惠券收集器参数的封闭式表达式。这产生了一个临界字长 k*，在该字长处，字类型从平均出现多次转变为最多出现一次。第三，将长度为 k 的可能字符串数量的指数增长与每个字符串概率的指数衰减相结合，我们获得了与 r^{-alpha} 成正比的 Zipf 型秩频定律 p(r)，其指数由字母表大小和空间概率明确确定。我们的贡献是双重的。在数学上，我们在单个显式模型中给出了将单词长度、词汇量增长、临界长度和排名频率结构联系起来的统一推导。从概念上讲，我们认为这为大型语言模型中的自然语言单词统计和标记统计提供了一个结构基础的空模型。结果表明，类似 Zipf 的模式可以纯粹由组合和分割产生，无需优化原理或语言组织，并有助于澄清哪些现象需要超出随机文本结构的更深入解释。

Title: Computational frame analysis revisited: On LLMs for studying news coverage

Authors: Sharaj Kunjar, Alyssa Hasegawa Smith, Tyler R Mckenzie, Rushali Mohbe, Samuel V Scarpino, Brooke Foucault Welles
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.17746
Pdf URL: https://arxiv.org/pdf/2511.17746
Copy Paste: [[2511.17746]] Computational frame analysis revisited: On LLMs for studying news coverage(https://arxiv.org/abs/2511.17746)
Keywords: language model, gpt, llm
Abstract: Computational approaches have previously shown various promises and pitfalls when it comes to the reliable identification of media frames. Generative LLMs like GPT and Claude are increasingly being used as content analytical tools, but how effective are they for frame analysis? We address this question by systematically evaluating them against their computational predecessors: bag-of-words models and encoder-only transformers; and traditional manual coding procedures. Our analysis rests on a novel gold standard dataset that we inductively and iteratively developed through the study, investigating six months of news coverage of the US Mpox epidemic of 2022. While we discover some potential applications for generative LLMs, we demonstrate that they were consistently outperformed by manual coders, and in some instances, by smaller language models. Some form of human validation was always necessary to determine appropriate model choice. Additionally, by examining how the suitability of various approaches depended on the nature of different tasks that were part of our frame analytical workflow, we provide insights as to how researchers may leverage the complementarity of these approaches to use them in tandem. We conclude by endorsing a methodologically pluralistic approach and put forth a roadmap for computational frame analysis for researchers going forward.
摘要：在媒体帧的可靠识别方面，计算方法之前已经显示出各种前景和陷阱。像 GPT 和 Claude 这样的生成式法学硕士越来越多地被用作内容分析工具，但它们对于框架分析的效果如何？我们通过系统地对其计算前辈进行评估来解决这个问题：词袋模型和仅编码器变压器；和传统的手动编码程序。我们的分析基于一个新颖的黄金标准数据集，该数据集是我们通过研究归纳和迭代开发的，调查了 2022 年美国 Mpox 流行病的六个月新闻报道。虽然我们发现了生成式法学硕士的一些潜在应用，但我们证明它们始终优于手动编码员，在某些情况下，优于较小的语言模型。为了确定适当的模型选择，总是需要某种形式的人工验证。此外，通过检查各种方法的适用性如何取决于框架分析工作流程中不同任务的性质，我们提供了关于研究人员如何利用这些方法的互补性来协同使用它们的见解。最后，我们认可方法论上的多元化方法，并为研究人员今后的计算框架分析提出了路线图。

Title: PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese

Authors: Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.17808
Pdf URL: https://arxiv.org/pdf/2511.17808
Copy Paste: [[2511.17808]] PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese(https://arxiv.org/abs/2511.17808)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit significant variations in performance across linguistic and cultural contexts, underscoring the need for systematic evaluation in diverse languages. In this work, we present the most extensive evaluation of LLMs for the Portuguese language to date. Leveraging our newly introduced PoETa v2 benchmark -- a comprehensive suite of over 40 tasks in Portuguese -- we assess more than 20 models covering a broad spectrum of training scales and computational resources. Our study reveals how computational investment and language-specific adaptation impact performance in Portuguese, while also analyzing performance gaps in comparison to equivalent tasks in English. Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation. The benchmark is available at this https URL.
摘要：大型语言模型 (LLM) 在不同语言和文化背景下表现出显着差异，强调了对不同语言进行系统评估的必要性。在这项工作中，我们提出了迄今为止对葡萄牙语法学硕士最广泛的评估。利用我们新推出的 PoETa v2 基准（一套包含 40 多项葡萄牙语任务的综合套件），我们评估了 20 多个模型，涵盖广泛的训练规模和计算资源。我们的研究揭示了计算投入和特定语言的适应如何影响葡萄牙语的表现，同时还分析了与英语中的同等任务相比的表现差距。通过这一基准测试和分析，PoETa v2 为未来葡萄牙语建模和评估的研究奠定了基础。该基准可从此 https URL 获取。

Title: Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation

Authors: Scott Merrill, Shashank Srivastava
Subjects: cs.CL, cs.AI, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2511.17813
Pdf URL: https://arxiv.org/pdf/2511.17813
Copy Paste: [[2511.17813]] Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation(https://arxiv.org/abs/2511.17813)
Keywords: language model, llm
Abstract: Large language models offer opportunities to simulate multi-party deliberation, but realistic modeling remains limited by a lack of speaker-attributed data. Transcripts produced via automatic speech recognition (ASR) assign anonymous speaker labels (e.g., Speaker_1), preventing models from capturing consistent human behavior. This work introduces a reproducible pipeline to transform public Zoom recordings into speaker-attributed transcripts with metadata like persona profiles and pragmatic action tags (e.g., [propose_motion]). We release three local government deliberation datasets: Appellate Court hearings, School Board meetings, and Municipal Council sessions. Fine-tuning LLMs to model specific participants using this "action-aware" data produces a 67% reduction in perplexity and nearly doubles classifier-based performance metrics for speaker fidelity and realism. Turing-style human evaluations show our simulations are often indistinguishable from real deliberations, providing a practical and scalable method for complex realistic civic simulations.
摘要：大型语言模型提供了模拟多方审议的机会，但现实建模仍然因缺乏说话者归因数据而受到限制。通过自动语音识别 (ASR) 生成的文字记录会分配匿名说话者标签（例如，Speaker_1），从而阻止模型捕获一致的人类行为。这项工作引入了一个可复制的管道，将公共 Zoom 录音转换为讲话者归属的转录本，其中包含人物简介和实用动作标签等元数据（例如，[propose_motion]）。我们发布了三个地方政府审议数据集：上诉法院听证会、学校董事会会议和市议会会议。使用这种“行动感知”数据对法学硕士进行微调以对特定参与者进行建模，可将困惑度降低 67%，并且使基于分类器的说话者保真度和真实度的性能指标几乎翻倍。图灵式的人类评估表明，我们的模拟通常与真实的审议没有区别，为复杂的现实公民模拟提供了实用且可扩展的方法。

Title: A superpersuasive autonomous policy debating system

Authors: Allen Roush, Devin Gonier, John Hines, Judah Goldfeder, Philippe Martin Wyder, Sanjay Basu, Ravid Shwartz Ziv
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2511.17854
Pdf URL: https://arxiv.org/pdf/2511.17854
Copy Paste: [[2511.17854]] A superpersuasive autonomous policy debating system(https://arxiv.org/abs/2511.17854)
Keywords: llm, agent
Abstract: The capacity for highly complex, evidence-based, and strategically adaptive persuasion remains a formidable great challenge for artificial intelligence. Previous work, like IBM Project Debater, focused on generating persuasive speeches in simplified and shortened debate formats intended for relatively lay audiences. We introduce DeepDebater, a novel autonomous system capable of participating in and winning a full, unmodified, two-team competitive policy debate. Our system employs a hierarchical architecture of specialized multi-agent workflows, where teams of LLM-powered agents collaborate and critique one another to perform discrete argumentative tasks. Each workflow utilizes iterative retrieval, synthesis, and self-correction using a massive corpus of policy debate evidence (OpenDebateEvidence) and produces complete speech transcripts, cross-examinations, and rebuttals. We introduce a live, interactive end-to-end presentation pipeline that renders debates with AI speech and animation: transcripts are surface-realized and synthesized to audio with OpenAI TTS, and then displayed as talking-head portrait videos with EchoMimic V1. Beyond fully autonomous matches (AI vs AI), DeepDebater supports hybrid human-AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against AI in any speech, allowing AI-human and AI-AI rounds. In preliminary evaluations against human-authored cases, DeepDebater produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by DeepDebater. We open source all code, generated speech transcripts, audio and talking head video here: this https URL
摘要：高度复杂、基于证据和战略适应性的说服能力仍然是人工智能面临的巨大挑战。之前的工作（例如 IBM Project Debater）侧重于以简化和简短的辩论格式生成有说服力的演讲，旨在为相对外行的受众提供说服力。我们推出 DeepDebater，这是一种新颖的自治系统，能够参与并赢得完整的、未经修改的两队竞争性政策辩论。我们的系统采用了专门的多代理工作流程的分层架构，其中由法学硕士支持的代理团队相互协作和批评，以执行离散的论证任务。每个工作流程都利用大量政策辩论证据 (OpenDebateEvidence) 进行迭代检索、综合和自我修正，并生成完整的语音记录、交叉询问和反驳。我们引入了一种实时、交互式的端到端演示管道，通过 AI 语音和动画呈现辩论：通过 OpenAI TTS 对文字记录进行表面实现并合成为音频，然后使用 EchoMimic V1 将其显示为会说话的头像视频。除了完全自主的比赛（AI vs AI）之外，DeepDebater还支持人机混合操作：人类辩手可以在任何阶段进行干预，而人类可以选择在任何演讲中作为AI的对手，从而允许AI-人类和AI-AI回合。在针对人类撰写的案例的初步评估中，DeepDebater 产生了质量上乘的论证成分，并始终赢得由独立自主法官裁决的模拟回合。人类辩论专家教练也更喜欢 DeepDebater 构建的论点、证据和案例。我们在这里开源所有代码、生成的语音记录、音频和头部视频：此 https URL

Title: Principled Context Engineering for RAG: Statistical Guarantees via Conformal Prediction

Authors: Debashish Chakraborty, Eugene Yang, Daniel Khashabi, Dawn Lawrie, Kevin Duh
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2511.17908
Pdf URL: https://arxiv.org/pdf/2511.17908
Copy Paste: [[2511.17908]] Principled Context Engineering for RAG: Statistical Guarantees via Conformal Prediction(https://arxiv.org/abs/2511.17908)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances factual grounding in large language models (LLMs) by incorporating retrieved evidence, but LLM accuracy declines when long or noisy contexts exceed the model's effective attention span. Existing pre-generation filters rely on heuristics or uncalibrated LLM confidence scores, offering no statistical control over retained evidence. We evaluate and demonstrate context engineering through conformal prediction, a coverage-controlled filtering framework that removes irrelevant content while preserving recall of supporting evidence. Using both embedding- and LLM-based scoring functions, we test this approach on the NeuCLIR and RAGTIME collections. Conformal filtering consistently meets its target coverage, ensuring that a specified fraction of relevant snippets are retained, and reduces retained context by 2-3x relative to unfiltered retrieval. On NeuCLIR, downstream factual accuracy measured by ARGUE F1 improves under strict filtering and remains stable at moderate coverage, indicating that most discarded material is redundant or irrelevant. These results demonstrate that conformal prediction enables reliable, coverage-controlled context reduction in RAG, offering a model-agnostic and principled approach to context engineering.
摘要：检索增强生成 (RAG) 通过合并检索到的证据来增强大型语言模型 (LLM) 的事实基础，但当较长或嘈杂的上下文超出模型的有效注意力范围时，LLM 的准确性会下降。现有的预生成过滤器依赖于启发式或未经校准的 LLM 置信度分数，无法对保留的证据提供统计控制。我们通过保形预测来评估和演示上下文工程，保形预测是一种覆盖范围控制的过滤框架，可以删除不相关的内容，同时保留对支持证据的回忆。使用基于嵌入和基于 LLM 的评分函数，我们在 NeuCLIR 和 RAGTIME 集合上测试了这种方法。保形过滤始终满足其目标覆盖范围，确保保留相关片段的指定部分，并将保留的上下文相对于未过滤的检索减少 2-3 倍。在 NeuCLIR 上，ARGUE F1 测量的下游事实准确性在严格过滤下有所提高，并在中等覆盖范围内保持稳定，表明大多数废弃材料是多余或不相关的。这些结果表明，共形预测能够在 RAG 中实现可靠的、覆盖范围控制的上下文缩减，为上下文工程提供一种与模型无关且有原则的方法。

Title: L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Authors: Yuliang Zhan, Xinyu Tang, Han Wan, Jian Li, Ji-Rong Wen, Hao Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.17910
Pdf URL: https://arxiv.org/pdf/2511.17910
Copy Paste: [[2511.17910]] L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention(https://arxiv.org/abs/2511.17910)
Keywords: language model, llm, chain-of-thought
Abstract: Recently, Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs), but Vision-Language Models (VLMs) still struggle with multi-step reasoning tasks due to limited multimodal reasoning data. To bridge this gap, researchers have explored methods to transfer CoT reasoning from LLMs to VLMs. However, existing approaches either need high training costs or require architectural alignment. In this paper, we use Linear Artificial Tomography (LAT) to empirically show that LLMs and VLMs share similar low-frequency latent representations of CoT reasoning despite architectural differences. Based on this insight, we propose L2V-CoT, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs. L2V-CoT extracts and resamples low-frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference to enhance reasoning capabilities. Extensive experiments demonstrate that our approach consistently outperforms training-free baselines and even surpasses supervised methods.
摘要：最近，思想链（CoT）推理显着增强了大型语言模型（LLM）的能力，但由于多模态推理数据有限，视觉语言模型（VLM）仍然难以完成多步骤推理任务。为了弥补这一差距，研究人员探索了将 CoT 推理从 LLM 转移到 VLM 的方法。然而，现有方法要么需要高昂的培训成本，要么需要架构调整。在本文中，我们使用线性人工断层扫描 (LAT) 凭经验证明，尽管架构存在差异，LLM 和 VLM 仍具有相似的 CoT 推理低频潜在表示。基于这一见解，我们提出了 L2V-CoT，这是一种新颖的免训练潜在干预方法，可将 CoT 推理从 LLM 转移到 VLM。 L2V-CoT 在频域中从 LLM 中提取低频 CoT 表示并重新采样，从而在推理过程中实现维度匹配和潜在注入到 VLM 中，以增强推理能力。大量的实验表明，我们的方法始终优于无训练的基线，甚至超过了有监督的方法。

Title: Towards Efficient LLM-aware Heterogeneous Graph Learning

Authors: Wenda Li, Tongya Zheng, Shunyu Liu, Yu Wang, Kaixuan Chen, Hanyang Yuan, Bingde Hu, Zujie Ren, Mingli Song, Gang Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17923
Pdf URL: https://arxiv.org/pdf/2511.17923
Copy Paste: [[2511.17923]] Towards Efficient LLM-aware Heterogeneous Graph Learning(https://arxiv.org/abs/2511.17923)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Heterogeneous graphs are widely present in real-world complex networks, where the diversity of node and relation types leads to complex and rich semantics. Efforts for modeling complex relation semantics in heterogeneous graphs are restricted by the limitations of predefined semantic dependencies and the scarcity of supervised signals. The advanced pre-training and fine-tuning paradigm leverages graph structure to provide rich self-supervised signals, but introduces semantic gaps between tasks. Large Language Models (LLMs) offer significant potential to address the semantic issues of relations and tasks in heterogeneous graphs through their strong reasoning capabilities in textual modality, but their incorporation into heterogeneous graphs is largely limited by computational complexity. Therefore, in this paper, we propose an Efficient LLM-Aware (ELLA) framework for heterogeneous graphs, addressing the above issues. To capture complex relation semantics, we propose an LLM-aware Relation Tokenizer that leverages LLM to encode multi-hop, multi-type relations. To reduce computational complexity, we further employ a Hop-level Relation Graph Transformer, which help reduces the complexity of LLM-aware relation reasoning from exponential to linear. To bridge semantic gaps between pre-training and fine-tuning tasks, we introduce the fine-grained task-aware textual Chain-of-Thought (CoT) prompts. Extensive experiments on four heterogeneous graphs show that our proposed ELLA outperforms state-of-the-art methods in the performance and efficiency. In particular, ELLA scales up to 13b-parameter LLMs and achieves up to a 4x speedup compared with existing LLM-based methods. Our code is publicly available at this https URL.
摘要：异构图广泛存在于现实世界的复杂网络中，节点和关系类型的多样性导致了复杂而丰富的语义。在异构图中建模复杂关系语义的努力受到预定义语义依赖性的限制和监督信号的稀缺性的限制。先进的预训练和微调范式利用图结构提供丰富的自监督信号，但在任务之间引入了语义差距。大型语言模型（LLM）通过其强大的文本模态推理能力，为解决异构图中关系和任务的语义问题提供了巨大的潜力，但它们与异构图的结合在很大程度上受到计算复杂性的限制。因此，在本文中，我们提出了一种针对异构图的高效LLM-Aware（ELLA）框架，解决了上述问题。为了捕获复杂的关系语义，我们提出了一种 LLM 感知的关系分词器，它利用 LLM 来编码多跳、多类型关系。为了降低计算复杂度，我们进一步采用了 Hop-level Relation Graph Transformer，这有助于将 LLM 感知关系推理的复杂性从指数降低到线性。为了弥合预训练和微调任务之间的语义差距，我们引入了细粒度任务感知文本思想链（CoT）提示。对四个异构图的大量实验表明，我们提出的 ELLA 在性能和效率方面优于最先进的方法。特别是，与现有基于 LLM 的方法相比，ELLA 可扩展到 13b 参数 LLM，并实现高达 4 倍的加速。我们的代码可通过此 https URL 公开获取。

Title: SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

Authors: Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.17938
Pdf URL: https://arxiv.org/pdf/2511.17938
Copy Paste: [[2511.17938]] SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization(https://arxiv.org/abs/2511.17938)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at this https URL.
摘要：大型语言模型 (LLM) 和多模态 LLM (MLLM) 擅长思想链推理，但面临测试时的分布变化和缺乏可验证的监督。最近的测试时强化学习 (TTRL) 方法通过对采样轨迹的自洽投票得出无标签伪奖励，但它们经常崩溃：多数投票奖励占上风，响应缩短，Pass@1 下降。我们将其追溯到统一序列更新，其中大多数令牌是低熵追随者，而一小部分高熵子集决定推理分支。因此，我们提出了 SPINE，一种令牌选择性测试时强化学习框架，它（i）仅更新分叉令牌，即从前向传递统计数据中识别的高熵分支点，以及（ii）在这些令牌上应用熵带正则化器，以在熵太低时维持探索，并在熵太高时抑制噪声监督。 SPINE 插入 GRPO 式目标，可选择带有 KL 锚点，并且不需要标签或奖励模型。在涵盖多模态 VQA、一般和专家 QA、数学推理和医学 QA 的 10 个基准测试中，SPINE 相对于 TTRL 持续改进 Pass@1，同时避免响应长度崩溃，并在 LLM 和 MLLM 主干上产生更稳定的训练动态。这些结果表明，将更新与思想链分支点对齐是一种简单且无标签的机制，可以在推理模型中实现稳定有效的测试时间适应。代码可从此 https URL 获取。

Title: Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models

Authors: Shuo Zhang, Fabrizio Gotti, Fengran Mo, Jian-Yun Nie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17946
Pdf URL: https://arxiv.org/pdf/2511.17946
Copy Paste: [[2511.17946]] Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models(https://arxiv.org/abs/2511.17946)
Keywords: language model, llm, hallucination, prompt
Abstract: Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail knowledge, i.e., the accuracy of the generated answer drops for the ground-truth entities that are rare in pretraining. However, examining whether data coverage itself can serve as a detection signal is overlooked. We propose a complementary question: Does lexical training-data coverage of the question and/or generated answer provide additional signal for hallucination detection? To investigate this, we construct scalable suffix arrays over RedPajama's 1.3-trillion-token pretraining corpus to retrieve $n$-gram statistics for both prompts and model generations. We evaluate their effectiveness for hallucination detection across three QA benchmarks. Our observations show that while occurrence-based features are weak predictors when used alone, they yield modest gains when combined with log-probabilities, particularly on datasets with higher intrinsic model uncertainty. These findings suggest that lexical coverage features provide a complementary signal for hallucination detection. All code and suffix-array infrastructure are provided at this https URL.
摘要：大语言模型（LLM）中的幻觉是一个根本性的挑战，特别是在开放域问答中。先前的工作尝试使用模型内部信号（例如令牌级熵或生成一致性）来检测幻觉，而预训练数据暴露和幻觉之间的联系尚未得到充分探索。现有研究表明，法学硕士在长尾知识方面表现不佳，即对于预训练中很少见的地面实况实体，生成答案的准确性会下降。然而，检查数据覆盖范围本身是否可以作为检测信号却被忽视了。我们提出一个补充问题：问题和/或生成答案的词汇训练数据覆盖是否为幻觉检测提供了额外的信号？为了研究这一点，我们在 RedPajama 的 1.3 万亿代币预训练语料库上构建了可扩展的后缀数组，以检索提示和模型生成的 $n$-gram 统计数据。我们通过三个 QA 基准评估其幻觉检测的有效性。我们的观察表明，虽然基于事件的特征单独使用时预测效果较弱，但与对数概率结合使用时，它们会产生适度的增益，特别是在具有较高内在模型不确定性的数据集上。这些发现表明词汇覆盖特征为幻觉检测提供了补充信号。所有代码和后缀数组基础结构均在此 https URL 中提供。

Title: Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

Authors: Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18054
Pdf URL: https://arxiv.org/pdf/2511.18054
Copy Paste: [[2511.18054]] Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets(https://arxiv.org/abs/2511.18054)
Keywords: language model, llm
Abstract: High-quality training data is fundamental to large language model (LLM) performance, yet existing preprocessing pipelines often struggle to effectively remove noise and unstructured content from web-scale corpora. This paper presents Blu-WERP, a novel data preprocessing pipeline designed to optimize the quality of Common Crawl WARC files for LLM training. We demonstrate that Blu-WERP significantly outperforms established baselines including DCLM across multiple model scales and evaluation benchmarks. Our pipeline processes CC WARC dumps, implementing advanced filtering and quality assessment mechanisms. We conducted comprehensive evaluations using models with 150M, 400M, 530M, 750M, and 1B parameters, testing against nine standard benchmarks categorized as World Knowledge & Reasoning, Language Understanding, and Commonsense Reasoning. Results show Blu-WERP consistently achieved superior performance across all model scales. At the 1B parameter scale, Relatively Blu-WERP demonstrates a 4.0% and 9.5% aggregate improvement over DCLM and Fineweb respectively, while achieving quality-per-token efficiency gain. Categorical analysis reveals 2.4% improvement in World Knowledge & Reasoning, 6.2% improvement in Language Understanding, and 4.2% improvement in Commonsense Reasoning. These results establish Blu-WERP as a state-of-the-art preprocessing pipeline that substantially improves LLM training data quality and downstream model performance with reduced computational cost. Our findings contribute to the growing body of research on data-centric AI, demonstrating that preprocessing pipeline design significantly impacts LLM capabilities. The Blu-WERP pipeline represents a practical advancement in data quality optimization, offering researchers and practitioners an effective solution for improving LLM training efficiency and model performance.
摘要：高质量的训练数据是大型语言模型 (LLM) 性能的基础，但现有的预处理管道通常难以有效地从网络规模的语料库中去除噪声和非结构化内容。本文介绍了 Blu-WERP，这是一种新颖的数据预处理管道，旨在优化 LLM 培训的 Common Crawl WARC 文件的质量。我们证明，Blu-WERP 在多个模型尺度和评估基准上显着优于既定基线（包括 DCLM）。我们的管道处理 CC WARC 转储，实施先进的过滤和质量评估机制。我们使用具有 150M、400M、530M、750M 和 1B 参数的模型进行了综合评估，并针对世界知识与推理、语言理解和常识推理等九个标准基准进行了测试。结果显示 Blu-WERP 在所有模型规模上始终取得卓越的性能。在 1B 参数范围内，相对 Blu-WERP 分别比 DCLM 和 Fineweb 表现出 4.0% 和 9.5% 的总体改进，同时实现了每个令牌的质量效率增益。分类分析显示，世界知识和推理能力提高了 2.4%，语言理解能力提高了 6.2%，常识推理能力提高了 4.2%。这些结果使 Blu-WERP 成为最先进的预处理管道，可显着提高 LLM 训练数据质量和下游模型性能，同时降低计算成本。我们的研究结果有助于不断增长的以数据为中心的人工智能研究，表明预处理流程设计显着影响法学硕士的能力。 Blu-WERP 管道代表了数据质量优化方面的实际进步，为研究人员和从业者提供了提高 LLM 培训效率和模型性能的有效解决方案。

Title: Vector Arithmetic in Concept and Token Subspaces

Authors: Sheridan Feucht, Byron Wallace, David Bau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18162
Pdf URL: https://arxiv.org/pdf/2511.18162
Copy Paste: [[2511.18162]] Vector Arithmetic in Concept and Token Subspaces(https://arxiv.org/abs/2511.18162)
Keywords: llm
Abstract: In order to predict the next token, LLMs must represent semantic and surface-level information about the current word. Previous work identified two types of attention heads that disentangle this information: (i) Concept induction heads, which copy word meanings, and (ii) Token induction heads, which copy literal token representations (Feucht et al., 2025). We show that these heads can be used to identify subspaces of model activations that exhibit coherent semantic structure in Llama-2-7b. Specifically, when we transform hidden states using the attention weights of concept heads, we are able to more accurately perform parallelogram arithmetic (Mikolov et al., 2013) on the resulting hidden states, e.g., showing that "Athens" - "Greece" + "China" = "Beijing". This transformation allows for much higher nearest-neighbor accuracy (80%) than direct use of raw hidden states (47%). Analogously, we show that token heads allow for transformations that reveal surface-level word information in hidden states, allowing for operations like "coding" - "code" + "dance" = "dancing".
摘要：为了预测下一个标记，LLM 必须表示有关当前单词的语义和表面层信息。之前的工作确定了两种类型的注意力头来解开这些信息：（i）概念归纳头，它复制单词含义，以及（ii）令牌归纳头，它复制文字标记表示（Feucht et al., 2025）。我们表明，这些头可用于识别在 Llama-2-7b 中表现出连贯语义结构的模型激活子空间。具体来说，当我们使用概念头的注意力权重转换隐藏状态时，我们能够更准确地对所得隐藏状态执行平行四边形算术（Mikolov et al., 2013），例如，显示“雅典”-“希腊”+“中国”=“北京”。这种转换比直接使用原始隐藏状态 (47%) 具有更高的最近邻精度 (80%)。类似地，我们表明令牌头允许在隐藏状态下揭示表面级单词信息的转换，允许诸如“编码”-“代码”+“舞蹈”=“跳舞”之类的操作。

Title: Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models

Authors: Elias Lumer, Matt Melich, Olivia Zino, Elena Kim, Sara Dieter, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah, James A. Burke, Roberto Hernandez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18177
Pdf URL: https://arxiv.org/pdf/2511.18177
Copy Paste: [[2511.18177]] Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models(https://arxiv.org/abs/2511.18177)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation, agent
Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models to answer financial questions using external knowledge bases of U.S. SEC filings, earnings reports, and regulatory documents. However, existing work lacks systematic comparison of vector-based and non-vector RAG architectures for financial documents, and the empirical impact of advanced RAG techniques on retrieval accuracy, answer quality, latency, and cost remain unclear. We present the first systematic evaluation comparing vector-based agentic RAG using hybrid search and metadata filtering against hierarchical node-based systems that traverse document structure without embeddings. We evaluate two enhancement techniques applied to the vector-based architecture, i) cross-encoder reranking for retrieval precision, and ii) small-to-big chunk retrieval for context completeness. Across 1,200 SEC 10-K, 10-Q, and 8-K filings on a 150-question benchmark, we measure retrieval metrics (MRR, Recall@5), answer quality through LLM-as-a-judge pairwise comparisons, latency, and preprocessing costs. Vector-based agentic RAG achieves a 68% win rate over hierarchical node-based systems with comparable latency (5.2 compared to 5.98 seconds). Cross-encoder reranking achieves a 59% absolute improvement at optimal parameters (10, 5) for MRR@5. Small-to-big retrieval achieves a 65% win rate over baseline chunking with only 0.2 seconds additional latency. Our findings reveal that applying advanced RAG techniques to financial Q&A systems improves retrieval accuracy, answer quality, and has cost-performance tradeoffs to be considered in production.
摘要：检索增强生成 (RAG) 的最新进展使大型语言模型能够使用美国 SEC 文件、收益报告和监管文件的外部知识库来回答财务问题。然而，现有的工作缺乏针对金融文档的基于向量和非向量 RAG 架构的系统比较，并且先进的 RAG 技术对检索准确性、答案质量、延迟和成本的实证影响仍不清楚。我们提出了第一个系统评估，将使用混合搜索和元数据过滤的基于向量的代理 RAG 与在没有嵌入的情况下遍历文档结构的基于分层节点的系统进行比较。我们评估了应用于基于向量的架构的两种增强技术，i）跨编码器重新排名以提高检索精度，ii）从小到大块检索以提高上下文完整性。在 150 个问题基准上的 1,200 份 SEC 10-K、10-Q 和 8-K 文件中，我们测量检索指标（MRR、Recall@5），通过法学硕士作为法官的成对比较、延迟和预处理成本来衡量答案质量。基于矢量的代理 RAG 比具有相当延迟（5.2 秒与 5.98 秒）的基于分层节点的系统实现了 68% 的胜率。跨编码器重排序在 MRR@5 的最佳参数 (10, 5) 上实现了 59% 的绝对改进。从小到大检索比基线分块实现了 65% 的获胜率，并且仅增加了 0.2 秒的延迟。我们的研究结果表明，将先进的 RAG 技术应用于金融问答系统可以提高检索准确性、答案质量，并且在生产中需要考虑成本性能权衡。

Title: Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems

Authors: Faheem Nizar, Elias Lumer, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18194
Pdf URL: https://arxiv.org/pdf/2511.18194
Copy Paste: [[2511.18194]] Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems(https://arxiv.org/abs/2511.18194)
Keywords: language model, llm, retrieval augmented generation, agent
Abstract: Recent advances in Large Language Model Multi-Agent Systems enable scalable orchestration and retrieval of specialized, parallelized subagents, each equipped with hundreds or thousands of Model Context Protocol (MCP) servers and tools. However, existing agent, MCP, and retrieval methods typically match queries against a single agent description, obscuring fine-grained tool capabilities of each agent, resulting in suboptimal agent selection. We introduce Agent-as-a-Graph retrieval, a knowledge graph retrieval augmented generation approach that represents both tools and their parent agents as nodes and edges in a knowledge graph. During retrieval, i) relevant agents and tool nodes are first retrieved through vector search, ii) we apply a type-specific weighted reciprocal rank fusion (wRRF) for reranking tools and agents, and iii) parent agents are traversed in the knowledge graph for the final set of agents. We evaluate Agent-as-a-Graph on the LiveMCPBenchmark, achieving 14.9% and 14.6% improvements in Recall@5 and nDCG@5 over prior state-of-the-art retrievers, and 2.4% improvements in wRRF optimizations.
摘要：大型语言模型多代理系统的最新进展实现了专用并行子代理的可扩展编排和检索，每个子代理都配备了数百或数千个模型上下文协议（MCP）服务器和工具。然而，现有的代理、MCP 和检索方法通常将查询与单个代理描述进行匹配，从而模糊了每个代理的细粒度工具功能，从而导致代理选择不理想。我们引入了代理图检索，这是一种知识图检索增强生成方法，它将工具及其父代理表示为知识图中的节点和边。在检索过程中，i）首先通过向量搜索检索相关代理和工具节点，ii）我们应用特定于类型的加权倒数排名融合（wRRF）对工具和代理进行重新排名，以及iii）在知识图中遍历父代理以获得最终代理集。我们在 LiveMCPBenchmark 上评估 Agent-as-a-Graph，与之前最先进的检索器相比，Recall@5 和 nDCG@5 分别提高了 14.9% 和 14.6%，wRRF 优化提高了 2.4%。

Title: From Archives to Decisions: Multi-Agent Pharmaceutical Co-Scientist for Traceable Drug Discovery and Reverse Translation

Authors: Xiaochen Zheng, Alvaro Serra, Ilya Schneider Chernov, Maddalena Marchesi, Eunice Musvasva, Tatyana Y. Doktorova
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2511.18259
Pdf URL: https://arxiv.org/pdf/2511.18259
Copy Paste: [[2511.18259]] From Archives to Decisions: Multi-Agent Pharmaceutical Co-Scientist for Traceable Drug Discovery and Reverse Translation(https://arxiv.org/abs/2511.18259)
Keywords: agent
Abstract: Pharmaceutical research and development has accumulated vast, heterogeneous archives of data. Much of this knowledge stems from discontinued programs, and reusing these archives is invaluable for reverse translation. However, in practice, such reuse is often infeasible. In this work, we introduce DiscoVerse, a multi-agent co-scientist designed to support pharmaceutical research and development. The system implements semantic retrieval, cross-document linking, and auditable synthesis on a large historical corpus from Roche. To validate our approach at real-world scale, we selected a subset of 180 molecules from the Roche research repositories, covering over 0.87 billion BPE tokens and more than four decades of research. Given that automated evaluation metrics are poorly aligned with scientific utility, we evaluate the performance of DiscoVerse using blinded expert evaluation of source-linked outputs. To our knowledge, this is the first agentic framework systematically assessed on real pharmaceutical data for reverse translation, enabled by authorized access to confidential, end-to-end drug-development archives. Our contributions include role-specialized agent designs aligned with scientist workflows; human-in-the-loop support for reverse translation; expert evaluation; and a large-scale demonstration showing promising answer accuracy and decision-making insights. In brief, across seven benchmark queries covering 180 molecules, DiscoVerse achieved near-perfect recall ($\geq 0.99$) with moderate precision ($0.71-0.91$), while qualitative assessments of discontinuation rationale and organ-specific toxicity showed faithful, source-linked synthesis across preclinical and clinical evidence.
摘要：药物研究和开发积累了大量异构数据档案。这些知识大部分来自已停止的程序，重用这些档案对于逆向翻译来说是非常宝贵的。然而，在实践中，这种重用往往是不可行的。在这项工作中，我们介绍了 DiscoVerse，一个旨在支持药物研究和开发的多智能体联合科学家。该系统在罗氏的大型历史语料库上实现语义检索、跨文档链接和可审计综合。为了在现实世界规模验证我们的方法，我们从罗氏研究存储库中选择了 180 个分子的子集，涵盖超过 8.7 亿个 BPE 代币和超过四十年的研究。鉴于自动评估指标与科学实用性不太一致，我们使用对源链接输出进行盲法专家评估来评估 DiscoVerse 的性能。据我们所知，这是第一个对真实制药数据进行系统评估以进行逆向翻译的代理框架，通过授权访问机密的端到端药物开发档案来实现。我们的贡献包括与科学家工作流程相一致的角色专用代理设计；对反向翻译的人机交互支持；专家评审；大规模演示显示了有希望的答案准确性和决策洞察力。简而言之，在涵盖 180 个分子的 7 个基准查询中，DiscoVerse 实现了近乎完美的召回率 ($\geq 0.99$) 和中等精度 ($0.71-0.91$)，而对停药理由和器官特异性毒性的定性评估显示出跨临床前和临床证据的忠实、源头相关的合成。

Title: "AGI" team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa

Authors: Harsh Rathva, Pruthwik Mishra, Shrikant Malviya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18301
Pdf URL: https://arxiv.org/pdf/2511.18301
Copy Paste: [[2511.18301]] "AGI" team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa(https://arxiv.org/abs/2511.18301)
Keywords: language model, llm, hallucination
Abstract: The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including \textbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.
摘要：大型语言模型 (LLM) 生成的多语言科学文本中的幻觉检测对可靠的人工智能系统提出了重大挑战。本文介绍了我们向 SHROOM-CAP 2025 跨 9 种语言的科学幻觉检测共享任务提交的内容。与大多数主要关注模型架构的方法不同，我们采用了以数据为中心的策略，解决了训练数据稀缺和不平衡的关键问题。我们统一并平衡了五个现有数据集，创建了一个包含 124,821 个样本（50% 正确，50% 幻觉）的综合训练语料库，比原始 SHROOM 训练数据增加了 172 倍。我们的方法在这个增强的数据集上使用 5.6 亿个参数对 XLM-RoBERTa-Large 进行了微调，在所有语言中都取得了有竞争力的性能，包括 \textbf{古吉拉特语第二名}（零样本语言），Factuality F1 为 0.5107，在其余 8 种语言中排名在第 4-6 位之间。我们的结果表明，系统数据管理可以显着优于单独的架构创新，特别是对于零样本设置中的低资源语言。

Title: Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning

Authors: Mohammad Aqib, Mohd Hamza, Ying Hei Chui, Qipei Mei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18306
Pdf URL: https://arxiv.org/pdf/2511.18306
Copy Paste: [[2511.18306]] Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning(https://arxiv.org/abs/2511.18306)
Keywords: language model, retrieval-augmented generation
Abstract: Building codes contain critical information for ensuring safety, regulatory compliance, and informed decision-making in construction and engineering. Automated question answering systems over such codes enable quick and accurate access to specific regulatory clauses, improving efficiency and reducing errors. Retrieval-Augmented Generation (RAG) systems are essential for this task as they combine the precision of information retrieval with the generative capabilities of language models. However, tabular data are challenging to extract as they often involve complex layouts, merged cells, multi-row headers, and embedded semantic relationships that are not easily captured by traditional natural language processing techniques and Vision Language Models (VLMs). This paper explores and compares two methods for extracting information from tabular data in building codes using several pre-trained VLMs. First, a direct input method is used, where the image of the page is input directly into the VLMs, which are then tasked with answering questions based on the image. Second, an indirect input method is introduced, which involves converting an image of a page containing tables into the LaTeX code and then answering inquires based on the LaTeX-based input. The experiments find that the direct input method generally resulted in higher accuracy than the indirect input method. To further improve the performance, we fine-tuned each VLM using Low Rank Adaptation (LoRA) on a domain-specific tabular dataset. The fine-tuned models exhibited substantial improvements, with Qwen2.5-VL-3B-Instruct achieving relative accuracy gains exceeding 100%. Our results highlight the potential of parameter-efficient fine-tuning methods to adapt powerful VLMs for understanding complex structured data in specialized fields, such as building code interpretation and regulatory compliance.
摘要：建筑规范包含确保建筑和工程安全、合规性和明智决策的关键信息。基于此类代码的自动问答系统可以快速准确地访问特定的监管条款，从而提高效率并减少错误。检索增强生成（RAG）系统对于这项任务至关重要，因为它们将信息检索的精度与语言模型的生成能力结合起来。然而，表格数据的提取具有挑战性，因为它们通常涉及复杂的布局、合并的单元格、多行标题和嵌入式语义关系，而传统的自然语言处理技术和视觉语言模型 (VLM) 不容易捕获这些关系。本文探索并比较了两种使用多个预训练的 VLM 从建筑规范中的表格数据中提取信息的方法。首先，使用直接输入方法，将页面图像直接输入到 VLM，然后根据图像回答问题。其次，介绍了一种间接输入方法，该方法涉及将包含表格的页面图像转换为LaTeX代码，然后基于基于LaTeX的输入来回答查询。实验发现，直接输入法通常比间接输入法具有更高的准确度。为了进一步提高性能，我们在特定领域的表格数据集上使用低秩适应 (LoRA) 微调每个 VLM。微调后的模型表现出显着的改进，Qwen2.5-VL-3B-Instruct 的相对准确率提升超过 100%。我们的结果凸显了参数高效微调方法的潜力，可以使强大的 VLM 适应专业领域的复杂结构化数据，例如建筑规范解释和法规遵从性。

Title: Path-Constrained Retrieval: A Structural Approach to Reliable LLM Agent Reasoning Through Graph-Scoped Semantic Search

Authors: Joseph Oladokun
Subjects: cs.CL, cs.DB, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18313
Pdf URL: https://arxiv.org/pdf/2511.18313
Copy Paste: [[2511.18313]] Path-Constrained Retrieval: A Structural Approach to Reliable LLM Agent Reasoning Through Graph-Scoped Semantic Search(https://arxiv.org/abs/2511.18313)
Keywords: language model, llm, agent
Abstract: Large Language Model agents often retrieve context from knowledge bases that lack structural consistency with the agent's current reasoning state, leading to incoherent reasoning chains. We introduce Path-Constrained Retrieval (PCR), a retrieval method that combines structural graph constraints with semantic search to ensure retrieved information maintains logical relationships within a knowledge graph. PCR restricts the search space to nodes reachable from an anchor node, preventing retrieval of structurally disconnected information that may lead to inconsistent reasoning. We evaluate PCR on PathRAG-6, a benchmark spanning six domains with 180 nodes and 360 edges. Our results show that PCR achieves full structural consistency compared to 24-32 percent in baseline methods, while maintaining strong relevance scores. On the technology domain, PCR obtains full relevance at rank 10 with full structural consistency, significantly outperforming vector search and hybrid retrieval. PCR reduces the average graph distance of retrieved context by 78 percent compared to baselines, demonstrating retrieval of more structurally consistent information. These findings suggest that path-constrained retrieval is an effective approach for improving the reliability and coherence of LLM agent reasoning systems.
摘要：大型语言模型代理通常从知识库中检索上下文，而这些知识库与代理当前的推理状态缺乏结构一致性，从而导致推理链不连贯。我们引入路径约束检索（PCR），这是一种将结构图约束与语义搜索相结合的检索方法，以确保检索到的信息保持知识图谱内的逻辑关系。 PCR 将搜索空间限制为可从锚节点到达的节点，从而防止检索可能导致推理不一致的结构上断开的信息。我们在 PathRAG-6 上评估 PCR，这是一个跨越 6 个域、180 个节点和 360 个边的基准。我们的结果表明，与基线方法的 24-32% 相比，PCR 实现了完全的结构一致性，同时保持了很强的相关性得分。在技术领域，PCR 获得了排名 10 的完全相关性和完全结构一致性，显着优于向量搜索和混合检索。与基线相比，PCR 将检索到的上下文的平均图形距离缩短了 78%，这表明检索到的信息在结构上更加一致。这些发现表明路径约束检索是提高 LLM 代理推理系统的可靠性和连贯性的有效方法。

Title: Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection

Authors: Syed Mohaiminul Hoque, Naimur Rahman, Md Sakhawat Hossain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18324
Pdf URL: https://arxiv.org/pdf/2511.18324
Copy Paste: [[2511.18324]] Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection(https://arxiv.org/abs/2511.18324)
Keywords: language model
Abstract: This paper introduces the approach of "Gradient Masters" for BLP-2025 Task 1: "Bangla Multitask Hate Speech Identification Shared Task". We present an ensemble-based fine-tuning strategy for addressing subtasks 1A (hate-type classification) and 1B (target group classification) in YouTube comments. We propose a hybrid approach on a Bangla Language Model, which outperformed the baseline models and secured the 6th position in subtask 1A with a micro F1 score of 73.23% and the third position in subtask 1B with 73.28%. We conducted extensive experiments that evaluated the robustness of the model throughout the development and evaluation phases, including comparisons with other Language Model variants, to measure generalization in low-resource Bangla hate speech scenarios and data set coverage. In addition, we provide a detailed analysis of our findings, exploring misclassification patterns in the detection of hate speech.
摘要：本文介绍了 BLP-2025 任务 1：“孟加拉多任务仇恨言论识别共享任务”的“梯度大师”方法。我们提出了一种基于集成的微调策略，用于解决 YouTube 评论中的子任务 1A（仇恨类型分类）和 1B（目标群体分类）。我们提出了一种孟加拉语言模型的混合方法，该方法优于基线模型，并以 73.23% 的微 F1 分数在子任务 1A 中排名第六，在子任务 1B 中以 73.28% 排名第三。我们进行了广泛的实验，评估了模型在整个开发和评估阶段的稳健性，包括与其他语言模型变体的比较，以衡量资源匮乏的孟加拉仇恨言论场景和数据集覆盖范围的泛化能力。此外，我们还对我们的发现进行了详细分析，探索仇恨言论检测中的错误分类模式。

Title: OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas

Authors: James Y. Huang, Wenxuan Zhou, Nan Xu, Fei Wang, Qin Liu, Sheng Zhang, Hoifung Poon, Muhao Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18335
Pdf URL: https://arxiv.org/pdf/2511.18335
Copy Paste: [[2511.18335]] OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas(https://arxiv.org/abs/2511.18335)
Keywords: language model, gpt, llm
Abstract: The ability of Large Language Models (LLMs) to generate structured outputs that follow arbitrary schemas is crucial to a wide range of downstream tasks that require diverse structured representations of results such as information extraction, table generation, and function calling. While modern LLMs excel in generating unstructured responses in natural language, whether this advancement translates to a strong performance on text-to-structure tasks remains unclear. To bridge this gap, we first introduce OmniStruct, a comprehensive benchmark for assessing LLMs' capabilities on diverse text-to-structure tasks such as information extraction, table generation, and function calling. We build OmniStruct by identifying existing datasets across a wide range of tasks that are suitable for a structured answer format, and adapting them under a unified text-to-structure problem setting. To facilitate the development of efficient text-to-structure models, we collect high-quality training data via synthetic task generation. Without using any supervised data for OmniStruct tasks, our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models that can rival the performance of GPT-4o.
摘要：大型语言模型 (LLM) 生成遵循任意模式的结构化输出的能力对于需要多种结构化结果表示（例如信息提取、表生成和函数调用）的下游任务至关重要。虽然现代法学硕士擅长用自然语言生成非结构化响应，但这种进步是否会转化为在文本到结构任务上的强劲表现仍不清楚。为了弥补这一差距，我们首先引入 OmniStruct，这是一个综合基准，用于评估法学硕士在各种文本到结构任务（例如信息提取、表格生成和函数调用）上的能力。我们通过识别适合结构化答案格式的各种任务中的现有数据集，并在统一的文本到结构问题设置下调整它们来构建 OmniStruct。为了促进高效的文本到结构模型的开发，我们通过合成任务生成收集高质量的训练数据。在 OmniStruct 任务中不使用任何监督数据的情况下，我们的实验证明了将合成数据上更小的模型微调为可与 GPT-4o 性能相媲美的通用结构化生成模型的可能性。

Title: Towards Robust and Fair Next Visit Diagnosis Prediction under Noisy Clinical Notes with Large Language Models

Authors: Heejoon Koo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18393
Pdf URL: https://arxiv.org/pdf/2511.18393
Copy Paste: [[2511.18393]] Towards Robust and Fair Next Visit Diagnosis Prediction under Noisy Clinical Notes with Large Language Models(https://arxiv.org/abs/2511.18393)
Keywords: language model, llm, chain-of-thought
Abstract: A decade of rapid advances in artificial intelligence (AI) has opened new opportunities for clinical decision support systems (CDSS), with large language models (LLMs) demonstrating strong reasoning abilities on timely medical tasks. However, clinical texts are often degraded by human errors or failures in automated pipelines, raising concerns about the reliability and fairness of AI-assisted decision-making. Yet the impact of such degradations remains under-investigated, particularly regarding how noise-induced shifts can heighten predictive uncertainty and unevenly affect demographic subgroups. We present a systematic study of state-of-the-art LLMs under diverse text corruption scenarios, focusing on robustness and equity in next-visit diagnosis prediction. To address the challenge posed by the large diagnostic label space, we introduce a clinically grounded label-reduction scheme and a hierarchical chain-of-thought (CoT) strategy that emulates clinicians' reasoning. Our approach improves robustness and reduces subgroup instability under degraded inputs, advancing the reliable use of LLMs in CDSS. We release code at this https URL.
摘要：人工智能（AI）十年来的快速发展为临床决策支持系统（CDSS）带来了新的机遇，大型语言模型（LLM）在及时的医疗任务上表现出强大的推理能力。然而，临床文本经常因自动化流程中的人为错误或故障而质量下降，引发人们对人工智能辅助决策的可靠性和公平性的担忧。然而，这种退化的影响仍然没有得到充分研究，特别是关于噪声引起的变化如何增加预测的不确定性并对人口亚群体产生不均匀的影响。我们对不同文本损坏场景下最先进的法学硕士进行了系统研究，重点关注下次访问诊断预测的稳健性和公平性。为了解决巨大的诊断标签空间带来的挑战，我们引入了一种基于临床的标签减少方案和模拟临床医生推理的分层思想链（CoT）策略。我们的方法提高了稳健性并减少了输入退化情况下的亚组不稳定性，从而促进了 CDSS 中法学硕士的可靠使用。我们在此 https URL 发布代码。

Title: Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

Authors: Dana Arad, Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18409
Pdf URL: https://arxiv.org/pdf/2511.18409
Copy Paste: [[2511.18409]] Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models(https://arxiv.org/abs/2511.18409)
Keywords: language model
Abstract: Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.
摘要：机械可解释性 (MI) 旨在揭示语言模型 (LM) 如何实现特定行为，但衡量 MI 的进展仍然具有挑战性。最近发布的机械可解释性基准（MIB；Mueller 等人，2025）为评估电路和因果变量本地化提供了一个标准化框架。在此基础上，BlackboxNLP 2025 共享任务将 MIB 扩展为社区范围内可重复的 MI 技术比较。共享任务有两个轨道：电路定位，它评估识别因果影响组件和驱动模型行为的交互的方法，以及因果变量定位，它评估将激活映射到可解释特征的方法。三个团队涵盖八种不同的方法，参与者使用集成和正则化策略进行电路发现，在电路定位方面取得了显着的成果。通过一个团队跨越两种方法，参与者使用低维和非线性投影来表征激活向量，在因果变量定位方面取得了显着的成果。 MIB 排行榜保持开放；我们鼓励继续在此标准评估框架中开展工作，以衡量未来 MI 研究的进展。

Title: Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations

Authors: Yu Xia, Sungchul Kim, Tong Yu, Ryan A. Rossi, Julian McAuely
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2511.18413
Pdf URL: https://arxiv.org/pdf/2511.18413
Copy Paste: [[2511.18413]] Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations(https://arxiv.org/abs/2511.18413)
Keywords: language model, llm, agent
Abstract: Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user-item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.
摘要：代理推荐将推荐器视为大型语言模型 (LLM) 代理，可以在 Web 应用程序中规划、推理、使用工具并与不同偏好的用户进行交互。然而，大多数现有的代理推荐系统侧重于通用的单代理计划执行工作流程或多代理任务分解管道。如果没有面向推荐的设计，他们经常会充分利用用户-项目交互历史中的协作信号，从而导致推荐结果不令人满意。为了解决这个问题，我们提出了用于代理推荐的多代理协同过滤（MACF）框架，将传统的协同过滤算法与基于 LLM 的多代理协作进行类比。具体来说，给定目标用户和查询，我们将类似的用户和相关项目实例化为具有独特配置文件的 LLM 代理。每个代理都能够调用检索工具、建议候选项目并与其他代理交互。与传统协同过滤中的静态偏好聚合不同，MACF采用中央协调代理通过动态代理招募和个性化协作指令来自适应管理用户和项目代理之间的协作。来自三个不同领域的数据集的实验结果表明，与强代理推荐基线相比，我们的 MACF 框架具有优势。

Title: General Agentic Memory Via Deep Research

Authors: B.Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18423
Pdf URL: https://arxiv.org/pdf/2511.18423
Copy Paste: [[2511.18423]] General Agentic Memory Via Deep Research(https://arxiv.org/abs/2511.18423)
Keywords: language model, llm, agent
Abstract: Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.
摘要：内存对于人工智能代理至关重要，但广泛采用的静态内存旨在提前创建可用的内存，不可避免地会遭受严重的信息丢失。为了解决这个限制，我们提出了一个名为 \textbf{通用代理记忆（GAM）} 的新颖框架。 GAM 遵循“\textbf{即时 (JIT) 编译}”的原则，它专注于在运行时为其客户端创建优化的上下文，同时在离线阶段仅保留简单但有用的内存。为此，GAM 采用了具有以下组件的双重设计。 1) \textbf{Memorizer}，它使用轻量级内存突出显示关键历史信息，同时在通用页面存储中维护完整的历史信息。 2) \textbf{Researcher}，它从页面存储中检索并集成有用的信息，以供其在预先构建的内存引导下进行在线请求。这种设计使 GAM 能够有效利用前沿大型语言模型 (LLM) 的代理功能和测试时可扩展性，同时还通过强化学习促进端到端性能优化。在我们的实验研究中，我们证明了 GAM 相对于现有的内存系统，在各种基于内存的任务完成场景上取得了实质性的改进。

Title: MindEval: Benchmarking Language Models on Multi-turn Mental Health Support

Authors: José Pombal, Maya D'Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas, Ricardo Rei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18491
Pdf URL: https://arxiv.org/pdf/2511.18491
Copy Paste: [[2511.18491]] MindEval: Benchmarking Language Models on Multi-turn Mental Health Support(https://arxiv.org/abs/2511.18491)
Keywords: language model, llm, prompt, chat
Abstract: Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.
摘要：尽管当前的系统存在一些局限性，例如阿谀奉承或过度验证，以及强化不适应的信念，但通过人工智能聊天机器人提供心理健康支持的需求正在激增。创建更好系统的一个核心障碍是缺乏能够捕捉真实治疗相互作用复杂性的基准。大多数现有基准要么仅通过多项选择题来测试临床知识，要么单独评估单个答案。为了弥补这一差距，我们推出了 MindEval，这是一个与博士级许可临床心理学家合作设计的框架，用于在现实的多轮心理健康治疗对话中自动评估语言模型。通过患者模拟和法学硕士自动评估，我们的框架通过其完全自动化、与模型无关的设计平衡了游戏阻力和可重复性。我们首先根据人类生成的文本定量验证模拟患者的真实性，并证明自动专家判断和人类专家判断之间的强相关性。然后，我们评估了 12 个最先进的法学硕士，结果发现所有模型都表现不佳，平均得分低于 4 分（满分 6 分），特别是在有问题的 AI 特定通信模式方面存在特别弱点。值得注意的是，推理能力和模型规模并不能保证更好的性能，而且系统会随着交互时间的增加或支持症状严重的患者而恶化。我们发布所有代码、提示和人工评估数据。

Title: For Those Who May Find Themselves on the Red Team

Authors: Tyler Shoemaker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18499
Pdf URL: https://arxiv.org/pdf/2511.18499
Copy Paste: [[2511.18499]] For Those Who May Find Themselves on the Red Team(https://arxiv.org/abs/2511.18499)
Keywords: language model, llm
Abstract: This position paper argues that literary scholars must engage with large language model (LLM) interpretability research. While doing so will involve ideological struggle, if not out-right complicity, the necessity of this engagement is clear: the abiding instrumentality of current approaches to interpretability cannot be the only standard by which we measure interpretation with LLMs. One site at which this struggle could take place, I suggest, is the red team.
摘要：这篇立场文件认为，文学学者必须从事大语言模型（LLM）的可解释性研究。虽然这样做会涉及意识形态斗争，即使不是彻底的共谋，但这种参与的必要性是显而易见的：当前可解释性方法的持久工具性不能成为我们衡量法学硕士解释的唯一标准。我认为，这场斗争可能发生的一个地点是红队。

Title: Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

Authors: H.M. Shadman Tabib, Jaber Ahmed Deedar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18597
Pdf URL: https://arxiv.org/pdf/2511.18597
Copy Paste: [[2511.18597]] Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks(https://arxiv.org/abs/2511.18597)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints -- such as input size limits and acceptance rates -- play a crucial role in separating Hard problems from easier ones. By contrast, GPT-4o often overlooks these cues and exhibits a strong bias toward simpler categories. We further probe GPT-4o through a synthetic Hard-problem generation protocol. Surprisingly, GPT-4o labels almost all of its own synthetic Hard problems as Medium, contradicting its tendency to downgrade real Hard problems to Easy. Our findings connect to recent work on LLMs-as-judges and automatic difficulty estimation in programming and education, and highlight concrete failure modes that must be addressed before LLM-based judges can be considered trustworthy in competitive programming, educational platforms, or reinforcement-learning pipelines.
摘要：大型语言模型（LLM）在自然语言和代码生成方面表现出了令人印象深刻的能力，并且越来越多地被部署为模型输出和学习活动的自动判断者。然而，他们在结构化任务（例如预测竞争性编程问题的难度）上的行为仍未得到充分探索。我们对纯粹用作自然语言难度评估器的 GPT-4o 与在显式数字和文本特征上训练的可解释 Light-GBM 集成进行了系统比较。在包含 1,825 个 LeetCode 问题（标记为“简单”、“中等”或“困难”）的数据集上，LightGBM 的准确率达到 86%，而 GPT-4o 仅达到 37.75%。详细分析（包括混淆矩阵和基于 SHAP 的可解释性）表明，数字约束（例如输入大小限制和接受率）在区分困难问题和简单问题方面发挥着至关重要的作用。相比之下，GPT-4o 经常忽视这些线索，并表现出对更简单类别的强烈偏见。我们通过合成的困难问题生成协议进一步探测 GPT-4o。令人惊讶的是，GPT-4o 将几乎所有自己的合成困难问题都标记为“中等”，这与其将真正的困难问题降级为“简单”的倾向相矛盾。我们的研究结果与最近关于法学硕士作为评委以及编程和教育中的自动难度估计的工作相关，并强调了在法学硕士法官在竞争性编程、教育平台或强化学习管道中被认为值得信赖之前必须解决的具体失败模式。

Title: A Benchmark for Zero-Shot Belief Inference in Large Language Models

Authors: Joseph Malone, Rachith Aiyappa, Byunghwee Lee, Haewoon Kwak, Jisun An, Yong-Yeol Ahn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18616
Pdf URL: https://arxiv.org/pdf/2511.18616
Copy Paste: [[2511.18616]] A Benchmark for Zero-Shot Belief Inference in Large Language Models(https://arxiv.org/abs/2511.18616)
Keywords: language model, llm
Abstract: Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals' stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.
摘要：信念是人类推理、交流和形成社会联系的核心，但大多数研究信念的计算方法仍然局限于狭隘的社会政治背景，并依赖于微调来实现最佳性能。尽管跨学科的大型语言模型（LLM）的使用越来越多，但这些系统在不同信仰领域的泛化效果如何仍不清楚。我们引入了一个系统的、可重复的基准，用于评估法学硕士使用在线辩论平台的数据在零样本环境中预测个人对各种主题的立场的能力。该基准包括多种信息条件，这些信息条件隔离了人口背景和已知先验信念对预测成功的贡献。在几个中小型模型中，我们发现提供更多有关个人的背景信息可以提高预测准确性，但不同信念域的性能差异很大。这些发现揭示了当前法学硕士模拟人类推理的能力和局限性，推进了机器行为的研究，并为社会政治领域之外的信仰系统建模提供了一个可扩展的框架。

Title: Prompt Optimization as a State-Space Search Problem

Authors: Maanas Taneja
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18619
Pdf URL: https://arxiv.org/pdf/2511.18619
Copy Paste: [[2511.18619]] Prompt Optimization as a State-Space Search Problem(https://arxiv.org/abs/2511.18619)
Keywords: language model, prompt
Abstract: Language Models are extremely susceptible to performance collapse with even small changes to input prompt strings. Libraries such as DSpy (from Stanford NLP) avoid this problem through demonstration-based prompt optimisation. Inspired by this, I propose an alternative approach that treats prompt optimisation as a classical state-space search problem. I model the prompt space as a graph where nodes represent prompt states and edges correspond to deliberate transformations such as shortening, adding examples, or re- ordering content. Using beam search and random walk algorithms, I systematically explore this space, evaluating candidates on development sets and pruning unpromising branches. Across five NLP tasks (sentiment classification, question answering, summarisation, reason- ing, and natural language inference), I find that even shallow search configurations (beam width=2, depth=2) improve upon seed prompts on development sets. For instance, beam search achieves development accuracy gains from 0.40 to 0.80 on reasoning tasks, though test set improvements are more modest (0.20 to 0.50), indicating overfitting to the develop- ment heuristic. Analysis of successful optimisation paths reveals that transformations that make prompts concise appear most frequently, while verbosity operators are never selected. My results validate prompt optimization as a search problem and suggest that with greater computational resources and improved evaluation metrics, deeper exploration could yield more robust prompts that generalize beyond development sets. Code and implementation are available at [this https URL].
摘要：即使输入提示字符串发生很小的变化，语言模型也极易出现性能崩溃的情况。 DSpy（来自斯坦福大学 NLP）等库通过基于演示的提示优化避免了这个问题。受此启发，我提出了一种替代方法，将即时优化视为经典的状态空间搜索问题。我将提示空间建模为一个图表，其中节点代表提示状态，边缘对应于有意的转换，例如缩短、添加示例或重新排序内容。使用波束搜索和随机游走算法，我系统地探索这个空间，评估开发集上的候选者并修剪没有前途的分支。在五个 NLP 任务（情感分类、问答、总结、推理和自然语言推理）中，我发现即使是浅层搜索配置（波束宽度=2，深度=2）也能改善开发集上的种子提示。例如，集束搜索在推理任务上实现了从 0.40 到 0.80 的开发精度增益，尽管测试集的改进较为温和（0.20 到 0.50），这表明对开发启发式的过度拟合。对成功优化路径的分析表明，使提示简洁的转换出现的频率最高，而冗长运算符从未被选择。我的结果验证了提示优化是一个搜索问题，并表明，通过更大的计算资源和改进的评估指标，更深入的探索可以产生更强大的提示，泛化到开发集之外。代码和实现可在 [此 https URL] 获取。

Title: OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Authors: Michael J. Bommarito II
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18622
Pdf URL: https://arxiv.org/pdf/2511.18622
Copy Paste: [[2511.18622]] OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph(https://arxiv.org/abs/2511.18622)
Keywords: llm, agent
Abstract: We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.
摘要：我们提出了 OpenGloss，一种综合的英语百科全书词典和语义知识图，它将词典定义、百科全书上下文、词源历史和语义关系集成在一个统一的资源中。 OpenGloss 包含 150K 词素中的 537K 个词义，与 WordNet 3.1 和 Open English WordNet 相当，同时提供的词义定义是其四倍多。这些词素包括 910 万个语义边缘、100 万个用法示例、300 万个搭配和 6000 万个百科全书内容单词。整个资源是通过多代理程序生成管道生成的，具有模式验证的 LLM 输出和自动质量保证，整个资源在一周内生成，成本不到 1,000 美元。这表明结构化生成可以以手动管理不切实际的成本和时间尺度创建全面的词汇资源，从而随着基础模型的改进实现快速迭代。该资源通过提供支持词汇学习和自然语言处理任务的集成内容（定义、示例、搭配、百科全书、词源）来解决教学应用中的空白。作为综合生成的资源，OpenGloss 反映了当前基础模型的功能和局限性。该数据集在 CC-BY 4.0 下在 Hugging Face 上公开提供，使研究人员和教育工作者能够在此资源的基础上进行构建和调整。

Title: No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Authors: Shireen Chand, Faith Baca, Emilio Ferrara
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.18635
Pdf URL: https://arxiv.org/pdf/2511.18635
Copy Paste: [[2511.18635]] No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases(https://arxiv.org/abs/2511.18635)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
摘要：大型语言模型 (LLM) 从其训练数据中继承了社会偏见，可能导致有害或不公平的输出。虽然各种技术旨在减轻这些偏见，但它们的影响通常仅根据目标偏见的维度进行评估。这项工作研究了有针对性的偏见缓解的跨类别后果。我们研究了应用于七个模型家庭的十个模型的四种偏见缓解技术，并探讨了种族、宗教、职业和性别相关的偏见。我们使用 StereoSet 基准来衡量去偏差对模型一致性和刻板偏好的影响。我们的结果一致表明，虽然有针对性的缓解措施有时可以减少预期维度中的偏差，但它经常会导致其他方面意想不到的负面后果，例如增加模型偏差和降低总体一致性。这些发现强调了在检查和制定偏见缓解策略时迫切需要强大的多维评估工具，以避免无意中沿着非目标轴转移或恶化偏见。

Title: Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

Authors: Goun Pyeon, Inbum Heo, Jeesu Jung, Taewook Hwang, Hyuk Namgoong, Hyein Seo, Yerim Han, Eunbin Kim, Hyeonseok Kang, Sangkeun Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18649
Pdf URL: https://arxiv.org/pdf/2511.18649
Copy Paste: [[2511.18649]] Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting(https://arxiv.org/abs/2511.18649)
Keywords: language model, gpt, llm, prompt
Abstract: This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (text, image, text+figure) and prompt languages (Korean, English). GPT-5 Codex achieved the only perfect score (100 points) with text input and Korean prompts, while Grok 4, GPT-5, and Deepseek R1 scored above 95 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed geometry as the weakest domain (77.7% average) with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (from 82.6 to 100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a real-exam-based LLM assessment framework, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard (this https URL).
摘要：本研究利用2026年韩国大学学术能力考试（CSAT）数学部分系统地评估了大型语言模型（LLM）的数学推理能力，确保了完全无污染的评估环境。为了解决现有基准测试中的数据泄露问题，我们在考试公开发布后两小时内将所有 46 道题（22 道常见题和 24 道选修题）数字化，消除了包含在模型训练数据中的任何可能性。我们对 24 个最先进的法学硕士进行了跨不同输入模式（文本、图像、文本+图形）和提示语言（韩语、英语）的综合评估。 GPT-5 Codex 在文本输入和韩语提示方面取得了唯一的满分（100 分），而 Grok 4、GPT-5 和 Deepseek R1 得分超过 95 分。值得注意的是，gpt-oss-20B尽管体积相对较小，但仍获得了95.7分，展现出较高的性价比。针对具体问题的分析表明，几何是最弱的领域（平均 77.7%），在 4 点高难度问题上性能显着下降。文本输入始终优于图像输入，而提示语言效果因模型规模而异。在 GPT-5 系列的推理增强实验中，推理强度的增加提高了性能（从 82.6 点到 100 分），但令牌使用量增加了四倍，效率大幅降低，这表明推理最少的模型可能更实用。这项研究有助于：（1）实现完全非公开的评估环境，（2）基于真实考试的法学硕士评估框架，以及（3）结合绩效、成本和时间考虑的实用评估视角。详细结果和模型比较可参见 2026 年韩国 CSAT LLM 评估排行榜（此 https URL）。

Title: CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Authors: Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18659
Pdf URL: https://arxiv.org/pdf/2511.18659
Copy Paste: [[2511.18659]] CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning(https://arxiv.org/abs/2511.18659)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
摘要：检索增强生成（RAG）利用外部知识增强了大型语言模型（LLM），但仍然受到长上下文和不相交的检索生成优化的影响。在这项工作中，我们提出了 CLaRa（连续潜在推理），这是一个在共享连续空间中执行基于嵌入的压缩和联合优化的统一框架。为了获得语义丰富且可检索的压缩向量，我们引入了 SCP，这是一种使用 QA 和释义监督的密钥保留数据合成框架。然后，CLaRa 通过单一语言建模损失对重排序器和生成器进行端到端训练，并使用可微分的 top-k 估计器使梯度流经两个模块。理论上，这种统一优化使检索相关性与答案质量保持一致。跨多个 QA 基准的实验表明，CLaRa 实现了最先进的压缩和重新排名性能，通常超越基于文本的微调基线。

Title: Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models

Authors: Wangjiaxuan Xin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18696
Pdf URL: https://arxiv.org/pdf/2511.18696
Copy Paste: [[2511.18696]] Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models(https://arxiv.org/abs/2511.18696)
Keywords: language model, gpt, prompt
Abstract: This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models. ECN employs four stages: Perspective Adoption, Emotional Resonance, Reflective Understanding, and Integrative Synthesis, to guide models toward generating emotionally resonant and contextually aware responses. Experimental results demonstrate that ECN achieves the highest Empathy Quotient (EQ) scores across GPT-3.5-turbo and GPT-4, while maintaining competitive Regard and Perplexity metrics. These findings emphasize ECN's potential for applications requiring empathy and inclusivity in conversational AI.
摘要：本报告提出了同理心级联网络（ECN）框架，这是一种多阶段提示方法，旨在增强大型语言模型的同理心和包容能力。 ECN 采用四个阶段：观点采纳、情感共鸣、反思性理解和综合综合，来指导模型生成情感共鸣和情境感知的响应。实验结果表明，ECN 在 GPT-3.5-turbo 和 GPT-4 中获得了最高的同理心商数 (EQ) 分数，同时保持了具有竞争力的关注度和困惑度指标。这些发现强调了 ECN 在对话式 AI 中需要同理心和包容性的应用程序中的潜力。

Title: RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context

Authors: Yu Lei, Shuzheng Si, Wei Wang, Yifei Wu, Gang Chen, Fanchao Qi, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18743
Pdf URL: https://arxiv.org/pdf/2511.18743
Copy Paste: [[2511.18743]] RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context(https://arxiv.org/abs/2511.18743)
Keywords: language model, llm, hallucination, agent
Abstract: Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.
摘要：大型语言模型正在从单轮响应者演变为能够使用工具进行持续推理和决策以进行深入研究的代理。流行的系统采用线性计划管道来搜索以写入报告，由于缺乏对模型行为和上下文的明确控制，该系统会遭受错误累积和上下文腐烂的问题。我们推出了RhinoInsight，这是一个深度研究框架，它添加了两种控制机制，以增强鲁棒性、可追溯性和整体质量，而无需更新参数。首先，可验证清单模块将用户需求转化为可追踪和可验证的子目标，纳入人类或法学硕士批评者进行细化，并编制分层大纲以锚定后续行动并防止不可执行的计划。其次，证据审核模块构建搜索内容，迭代更新大纲，并修剪嘈杂的上下文，而批评者对高质量证据进行排名并将其与起草的内容绑定，以确保可验证性并减少幻觉。我们的实验表明，RhinoInsight 在深度研究任务上实现了最先进的性能，同时在深度搜索任务上保持竞争力。

Title: Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search

Authors: Matthew R. DeVerna, Kai-Cheng Yang, Harry Yaojun Yan, Filippo Menczer
Subjects: cs.CL, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2511.18749
Pdf URL: https://arxiv.org/pdf/2511.18749
Copy Paste: [[2511.18749]] Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search(https://arxiv.org/abs/2511.18749)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.
摘要：大型语言模型（LLM）给自动化端到端事实检查带来了希望，但之前的研究报告结果好坏参半。随着主流聊天机器人越来越多地配备推理功能和网络搜索工具 - 并且数百万用户已经依赖它们进行验证 - 严格的评估刻不容缓。我们评估了来自 OpenAI、Google、Meta 和 DeepSeek 的 15 个最近的法学硕士，涉及 PolitiFact 事实核查的 6,000 多个声明，将标准模型与推理和网络搜索变体进行比较。标准模型表现不佳，推理提供的好处微乎其微，网络搜索仅提供适度的收益，尽管网络上可以进行事实检查。相比之下，使用 PolitiFact 摘要的精心策划的 RAG 系统将模型变体的宏观 F1 平均提高了 233%。这些发现表明，让模型能够访问精心策划的高质量上下文是自动事实检查的一条有前途的途径。

Title: Context-Aware Whisper for Arabic ASR Under Linguistic Varieties

Authors: Bashar Talafha, Amin Abu Alhassan, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18774
Pdf URL: https://arxiv.org/pdf/2511.18774
Copy Paste: [[2511.18774]] Context-Aware Whisper for Arabic ASR Under Linguistic Varieties(https://arxiv.org/abs/2511.18774)
Keywords: hallucination, prompt
Abstract: Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.
摘要：资源匮乏的 ASR 仍然是一个具有挑战性的问题，特别是对于阿拉伯语等方言变化较大且标记数据有限的语言。我们提出了上下文感知提示策略，使 OpenAI 的 Whisper 能够在无需重新训练的情况下进行阿拉伯语语音识别。我们的方法包括使用第一遍转录或检索到的话语进行解码器提示，以及使用目标说话者语音中合成的语音进行编码器前缀。我们引入了提示重新排序、说话者感知前缀合成和特定模态检索（词汇、语义、声学）等技术，以改善现实世界、零样本设置中的转录。通过对九种阿拉伯语言条件进行评估，我们的方法将现代标准阿拉伯语的 WER 降低了 22.3%，方言语音的 WER 降低了 9.2%，显着减轻了幻觉和说话者不匹配的情况。

Title: HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations

Authors: Cao Linxiao, Wang Ruitao, Li Jindong, Zhou Zhipeng, Yang Menglin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18808
Pdf URL: https://arxiv.org/pdf/2511.18808
Copy Paste: [[2511.18808]] HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations(https://arxiv.org/abs/2511.18808)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.
摘要：检索增强生成 (RAG) 使大型语言模型 (LLM) 能够访问外部知识，有助于减轻幻觉并增强特定领域的专业知识。基于图的 RAG 通过引入显式关系组织来增强结构推理，从而实现跨语义连接的文本单元的信息传播。然而，这些方法通常依赖于捕获语义相似性的欧几里得嵌入，但缺乏层次深度的几何概念，限制了它们表示复杂知识图谱中固有的抽象关系的能力。为了捕获细粒度语义和全局层次结构，我们提出了 HyperbolicRAG，这是一种将双曲几何集成到基于图的 RAG 中的检索框架。 HyperbolicRAG 引入了三个关键设计：(1) 深度感知表示学习器，将节点嵌入共享庞加莱流形中，以将语义相似性与分层包含对齐；(2) 无监督对比正则化，强制跨抽象级别的几何一致性；(3) 相互排名融合机制，联合利用来自欧几里德空间和双曲空间的检索信号，强调推理过程中的跨空间一致性。跨多个 QA 基准的大量实验表明，HyperbolicRAG 的性能优于竞争基线，包括标准 RAG 和图形增强基线。

Title: Concept than Document: Context Compression via AMR-based Conceptual Entropy

Authors: Kaize Shi, Xueyao Sun, Xiaohui Tao, Lin Li, Qika Lin, Guandong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18832
Pdf URL: https://arxiv.org/pdf/2511.18832
Copy Paste: [[2511.18832]] Concept than Document: Context Compression via AMR-based Conceptual Entropy(https://arxiv.org/abs/2511.18832)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Large Language Models (LLMs) face information overload when handling long contexts, particularly in Retrieval-Augmented Generation (RAG) where extensive supporting documents often introduce redundant content. This issue not only weakens reasoning accuracy but also increases computational overhead. We propose an unsupervised context compression framework that exploits Abstract Meaning Representation (AMR) graphs to preserve semantically essential information while filtering out irrelevant text. By quantifying node-level entropy within AMR graphs, our method estimates the conceptual importance of each node, enabling the retention of core semantics. Specifically, we construct AMR graphs from raw contexts, compute the conceptual entropy of each node, and screen significant informative nodes to form a condensed and semantically focused context than raw documents. Experiments on the PopQA and EntityQuestions datasets show that our method outperforms vanilla and other baselines, achieving higher accuracy while substantially reducing context length. To the best of our knowledge, this is the first work introducing AMR-based conceptual entropy for context compression, demonstrating the potential of stable linguistic features in context engineering.
摘要：大型语言模型 (LLM) 在处理长上下文时面临信息过载，特别是在检索增强生成 (RAG) 中，大量支持文档经常引入冗余内容。这个问题不仅削弱了推理的准确性，而且增加了计算开销。我们提出了一种无监督的上下文压缩框架，该框架利用抽象含义表示（AMR）图来保留语义上的重要信息，同时过滤掉不相关的文本。通过量化 AMR 图中的节点级熵，我们的方法估计每个节点的概念重要性，从而保留核心语义。具体来说，我们从原始上下文构建 AMR 图，计算每个节点的概念熵，并筛选重要的信息节点，以形成比原始文档更简洁且语义集中的上下文。 PopQA 和 EntityQuestions 数据集上的实验表明，我们的方法优于普通方法和其他基线，在大幅缩短上下文长度的同时实现了更高的准确性。据我们所知，这是第一个引入基于 AMR 的概念熵进行上下文压缩的工作，展示了稳定语言特征在上下文工程中的潜力。

Title: Large Language Models for the Summarization of Czech Documents: From History to the Present

Authors: Václav Tran, Jakub Šmíd, Ladislav Lenc, Jean-Pierre Salmon, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18848
Pdf URL: https://arxiv.org/pdf/2511.18848
Copy Paste: [[2511.18848]] Large Language Models for the Summarization of Czech Documents: From History to the Present(https://arxiv.org/abs/2511.18848)
Keywords: language model, llm
Abstract: Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets. In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech. Our study makes the following main contributions: We demonstrate that LLMs achieve new state-of-the-art results on the SumeCzech dataset, a benchmark for modern Czech text summarization, showing the effectiveness of multilingual LLMs even for morphologically rich, medium-resource languages like Czech. We introduce a new dataset, Posel od Čerchova, designed for the summarization of historical Czech texts. This dataset is derived from digitized 19th-century publications and annotated for abstractive summarization. We provide initial baselines using modern LLMs to facilitate further research in this underrepresented area. By combining cutting-edge models with both modern and historical Czech datasets, our work lays the foundation for further progress in Czech summarization and contributes valuable resources for future research in Czech historical document processing and low-resource summarization more broadly.
摘要：文本摘要是自动将较长的文本压缩为较短、连贯的摘要，同时保留原始含义和关键信息的任务。尽管这项任务已经用英语和其他高资源语言进行了广泛的研究，但捷克语的摘要，特别是在历史文献的背景下，仍然没有得到充分的探索。这主要是由于捷克语固有的语言复杂性以及缺乏高质量的注释数据集。在这项工作中，我们通过利用大型语言模型 (LLM) 的功能（特别是 Mistral 和 mT5）来解决这一差距，它们在广泛的自然语言处理任务和多语言设置中表现出了强大的性能。此外，我们还提出了一种基于翻译的方法，首先将捷克语文本翻译成英语，使用英语语言模型对其进行总结，然后将摘要翻译回捷克语。我们的研究做出了以下主要贡献：我们证明法学硕士在 SumeCzech 数据集（现代捷克语文本摘要的基准）上取得了新的最先进的结果，显示了多语言法学硕士的有效性，即使对于形态丰富、资源中等的语言（如捷克语）也是如此。我们引入了一个新的数据集 Posel od Čerchova，旨在总结捷克历史文本。该数据集源自 19 世纪的数字化出版物，并带有抽象摘要注释。我们使用现代法学硕士提供初始基线，以促进这一代表性不足领域的进一步研究。通过将尖端模型与现代和历史捷克数据集相结合，我们的工作为捷克语摘要的进一步进展奠定了基础，并为捷克历史文档处理和更广泛的低资源摘要的未来研究贡献了宝贵的资源。

Title: Cognitive Alpha Mining via LLM-Driven Code-Based Evolution

Authors: Fengyuan Liu, Huang Yi, Sichun Luo, Yuqi Wang, Yazheng Yang, Xinye Li, Zefa Hu, Junlan Feng, Qi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18850
Pdf URL: https://arxiv.org/pdf/2511.18850
Copy Paste: [[2511.18850]] Cognitive Alpha Mining via LLM-Driven Code-Based Evolution(https://arxiv.org/abs/2511.18850)
Keywords: language model, llm, prompt, agent
Abstract: Discovering effective predictive signals, or ``alphas,'' from financial data with high dimensionality and extremely low signal-to-noise ratio remains a difficult open problem. Despite progress in deep learning, genetic programming, and, more recently, large language model (LLM)--based factor generation, existing approaches still explore only a narrow region of the vast alpha search space. Neural models tend to produce opaque and fragile patterns, while symbolic or formula-based methods often yield redundant or economically ungrounded expressions that generalize poorly. Although different in form, these paradigms share a key limitation: none can conduct broad, structured, and human-like exploration that balances logical consistency with creative leaps. To address this gap, we introduce the Cognitive Alpha Mining Framework (CogAlpha), which combines code-level alpha representation with LLM-driven reasoning and evolutionary search. Treating LLMs as adaptive cognitive agents, our framework iteratively refines, mutates, and recombines alpha candidates through multi-stage prompts and financial feedback. This synergistic design enables deeper thinking, richer structural diversity, and economically interpretable alpha discovery, while greatly expanding the effective search space. Experiments on A-share equities demonstrate that CogAlpha consistently discovers alphas with superior predictive accuracy, robustness, and generalization over existing methods. Our results highlight the promise of aligning evolutionary optimization with LLM-based reasoning for automated and explainable alpha discovery. All source code will be released.
摘要：从高维和极低信噪比的金融数据中发现有效的预测信号或“阿尔法”仍然是一个困难的开放问题。尽管深度学习、遗传编程以及最近基于大型语言模型 (LLM) 的因子生成取得了进展，但现有方法仍然只探索广阔的 alpha 搜索空间中的狭窄区域。神经模型往往会产生不透明且脆弱的模式，而符号或基于公式的方法通常会产生冗余或缺乏经济基础的表达，泛化能力较差。尽管形式不同，但这些范式都有一个关键的局限性：没有一个范式能够进行广泛的、结构化的、类人的探索，以平衡逻辑一致性和创造性的飞跃。为了解决这一差距，我们引入了认知 Alpha 挖掘框架 (CogAlpha)，它将代码级 alpha 表示与 LLM 驱动的推理和进化搜索相结合。我们的框架将法学硕士视为适应性认知代理，通过多阶段提示和财务反馈迭代地细化、变异和重组阿尔法候选者。这种协同设计能够实现更深入的思考、更丰富的结构多样性以及经济上可解释的阿尔法发现，同时极大地扩展了有效的搜索空间。 A 股股票的实验表明，CogAlpha 始终如一地发现 alpha，与现有方法相比，具有更高的预测准确性、鲁棒性和泛化性。我们的结果强调了将进化优化与基于法学硕士的推理相结合，以实现自动化和可解释的阿尔法发现的前景。所有源代码将被发布。

Title: FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

Authors: Masoomali Fatehkia, Enes Altinisik, Husrev Taha Sencar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18852
Pdf URL: https://arxiv.org/pdf/2511.18852
Copy Paste: [[2511.18852]] FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models(https://arxiv.org/abs/2511.18852)
Keywords: language model, llm, prompt
Abstract: Content moderation filters are a critical safeguard against alignment failures in language models. Yet most existing filters focus narrowly on general safety and overlook cultural context. In this work, we introduce FanarGuard, a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English. We construct a dataset of over 468K prompt and response pairs, drawn from synthetic and public datasets, scored by a panel of LLM judges on harmlessness and cultural awareness, and use it to train two filter variants. To rigorously evaluate cultural alignment, we further develop the first benchmark targeting Arabic cultural contexts, comprising over 1k norm-sensitive prompts with LLM-generated responses annotated by human raters. Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability, while matching the performance of state-of-the-art filters on safety benchmarks. These findings highlight the importance of integrating cultural awareness into moderation and establish FanarGuard as a practical step toward more context-sensitive safeguards.
摘要：内容审核过滤器是防止语言模型中对齐失败的关键保障。然而，大多数现有的过滤器都狭隘地关注一般安全，而忽视了文化背景。在这项工作中，我们引入了 FanarGuard，这是一种双语审核过滤器，可以评估阿拉伯语和英语的安全性和文化一致性。我们构建了一个包含超过 468K 个提示和响应对的数据集，这些数据集取自合成和公共数据集，由法学硕士评委小组对无害性和文化意识进行评分，并用它来训练两种过滤器变体。为了严格评估文化一致性，我们进一步开发了第一个针对阿拉伯文化背景的基准，其中包括超过 1000 个规范敏感的提示，以及由人类评分者注释的法学硕士生成的响应。结果表明，与注释者间的可靠性相比，FanarGuard 与人类注释的一致性更强，同时在安全基准上与最先进的过滤器性能相匹配。这些发现强调了将文化意识融入适度的重要性，并建立 FanarGuard 作为迈向更加上下文敏感的保障措施的实际步骤。

Title: Generating Reading Comprehension Exercises with Large Language Models for Educational Applications

Authors: Xingyu Huang, Fei Jiang, Jianli Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18860
Pdf URL: https://arxiv.org/pdf/2511.18860
Copy Paste: [[2511.18860]] Generating Reading Comprehension Exercises with Large Language Models for Educational Applications(https://arxiv.org/abs/2511.18860)
Keywords: language model, llm
Abstract: With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.
摘要：随着大型语言模型（LLM）的快速发展，LLM的应用大幅增长。在教育领域，法学硕士展示了巨大的潜力，特别是在自动文本生成方面，这使得创建智能和自适应学习内容成为可能。本文提出了一种新的法学硕士框架，称为阅读理解练习生成（RCEG）。它可以自动生成高质量、个性化的英语阅读理解练习。首先，RCEG 使用经过微调的 LLM 来生成候选内容。然后，它使用鉴别器来选择最佳候选者。最后，生成内容的质量得到了很大的提高。为了评估RCEG的性能，构建了英语阅读理解专用数据集来进行实验，并使用综合评价指标来分析实验结果。这些指标包括内容多样性、事实准确性、语言毒性和教学一致性。实验结果表明，RCEG 显着提高了生成练习的相关性和认知适当性。

Title: Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models

Authors: Yang Xiang, Yixin Ji, Juntao Li, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18864
Pdf URL: https://arxiv.org/pdf/2511.18864
Copy Paste: [[2511.18864]] Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models(https://arxiv.org/abs/2511.18864)
Keywords: language model, llm, chain-of-thought
Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.
摘要：大型推理模型 (LRM) 在复杂推理基准测试中表现出了卓越的性能。然而，它们的长思维链推理过程会产生大量的推理开销。剪枝已成为降低计算成本的一种有前途的方法。然而，现有的工作主要集中在大型语言模型（LLM）上，而修剪 LRM 仍有待探索。在这项工作中，我们对 LRM 剪枝进行了首次实证研究，结果表明直接应用现有的剪枝技术无法产生令人满意的结果。我们的研究结果表明，使用自生成的推理数据进行校准可以显着提高剪枝性能。我们进一步研究推理数据的难度和长度如何影响修剪结果。我们的分析表明，具有挑战性且长度适中的自生成推理数据可以作为理想的校准数据。基于这些见解，我们提出了一种选择性自生成推理（SSGR）数据构建策略，为剪枝 LRM 提供有效的校准数据。 DeepSeek-R1-Distill模型系列的实验结果验证了我们的策略与一般剪枝方法相比将剪枝LRM的推理能力提高了10%-13%。

Title: CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Authors: Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang, Bin Liang, Jing Li, Ruifeng Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18889
Pdf URL: https://arxiv.org/pdf/2511.18889
Copy Paste: [[2511.18889]] CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation(https://arxiv.org/abs/2511.18889)
Keywords: llm
Abstract: Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
摘要：数据污染在训练过程中无意中将模型暴露给测试数据，对自然语言处理任务中 LLM 评估的公平性构成了重大挑战。当前的研究试图通过修改现有数据集或从新收集的信息生成新数据集来缓解这个问题。然而，这些方法无法确保抗污染评估，因为它们无法完全消除模型中预先存在的知识或保留原始数据集的语义复杂性。为了解决这些限制，我们提出了 \textbf{CoreEval}，这是一种 \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation 策略，用于使用现实世界的知识自动更新数据。这种方法首先从原始数据中提取实体关系，并利用 GDELT 数据库检索相关的最新知识。然后，检索到的知识将被重新上下文化并与原始数据集成，并对原始数据进行提炼和重组，以确保语义连贯性并增强任务相关性。最终，采用强大的数据反射机制来迭代验证和细化标签，确保更新数据集和原始数据集之间的一致性。对更新数据集的大量实验验证了 CoreEval 的稳健性，证明了其在减轻数据污染引起的性能高估方面的有效性。

Title: Reproducibility Study of Large Language Model Bayesian Optimization

Authors: Adam Rychert, Gasper Spagnolo, Evgenii Posashkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.18891
Pdf URL: https://arxiv.org/pdf/2511.18891
Copy Paste: [[2511.18891]] Reproducibility Study of Large Language Model Bayesian Optimization(https://arxiv.org/abs/2511.18891)
Keywords: language model, gpt, prompt
Abstract: In this reproducibility study, we revisit the LLAMBO framework of Daxberger et al. (2024), a prompting-based Bayesian optimization (BO) method that uses large language models as discriminative surrogates and acquisition optimizers via text-only interactions. We replicate the core Bayesmark and HPOBench experiments under the original evaluation protocol, but replace GPT-3.5 with the open-weight Llama 3.1 70B model used for all text encoding components. Our results broadly confirm the main claims of LLAMBO. Contextual warm starting via textual problem and hyperparameter descriptions substantially improves early regret behaviour and reduces variance across runs. LLAMBO's discriminative surrogate is weaker than GP or SMAC as a pure single task regressor, yet benefits from cross task semantic priors induced by the language model. Ablations that remove textual context markedly degrade predictive accuracy and calibration, while the LLAMBO candidate sampler consistently generates higher quality and more diverse proposals than TPE or random sampling. Experiments with smaller backbones (Gemma 27B, Llama 3.1 8B) yield unstable or invalid predictions, suggesting insufficient capacity for reliable surrogate behaviour. Overall, our study shows that the LLAMBO architecture is robust to changing the language model backbone and remains effective when instantiated with Llama 3.1 70B.
摘要：在这项可重复性研究中，我们重新审视了 Daxberger 等人的 LLAMBO 框架。 (2024)，一种基于提示的贝叶斯优化 (BO) 方法，该方法使用大型语言模型作为判别代理和通过纯文本交互获取优化器。我们在原始评估协议下复制了核心 Bayesmark 和 HPOBench 实验，但将 GPT-3.5 替换为用于所有文本编码组件的开放权重 Llama 3.1 70B 模型。我们的结果大致证实了 LLAMBO 的主要主张。通过文本问题和超参数描述进行上下文热启动可显着改善早期后悔行为并减少运行之间的差异。作为纯粹的单任务回归器，LLAMBO 的判别代理比 GP 或 SMAC 弱，但受益于语言模型引起的跨任务语义先验。删除文本上下文的消融会显着降低预测准确性和校准，而 LLAMBO 候选采样器始终会生成比 TPE 或随机采样更高质量和更多样化的建议。使用较小主干（Gemma 27B、Llama 3.1 8B）的实验产生不稳定或无效的预测，表明可靠替代行为的能力不足。总体而言，我们的研究表明 LLAMBO 架构对于更改语言模型主干具有鲁棒性，并且在使用 Llama 3.1 70B 实例化时仍然有效。

Title: Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs

Authors: Sahil Kale
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18931
Pdf URL: https://arxiv.org/pdf/2511.18931
Copy Paste: [[2511.18931]] Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs(https://arxiv.org/abs/2511.18931)
Keywords: language model, gpt, llm
Abstract: Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.
摘要：现代大型语言模型集成了网络搜索来提供实时答案，但尚不清楚它们是否能够有效地校准以在实际需要时使用搜索。我们引入了一个基准，评估跨商业模型访问网络的必要性和有效性，而无需访问内部状态或参数。该数据集包括由截止前知识回答的 783 个时间锚定问题的静态分割，旨在测试模型是否基于低内部置信度调用搜索，以及由 288 个截止后查询组成的动态分割，旨在测试模型是否能够识别何时需要搜索并检索更新的信息。 Web 访问极大地提高了 GPT-5-mini 和 Claude Haiku 4.5 的静态精度，但置信度校准会恶化。在动态查询中，这两个模型都频繁调用搜索，但由于查询公式较弱，准确率仍低于 70%。每次提高准确性的调用成本仍然很低，但一旦初始检索失败，回报就会减少。选择性调用有所帮助，但模型在搜索后会变得过于自信且不一致。总体而言，内置网络搜索有意义地提高了事实准确性，并且可以有选择地调用，但模型仍然过于自信，在必要时跳过检索，并且一旦初始搜索查询表现不佳就会犹豫不决。总而言之，内部网络搜索作为良好的低延迟验证层比可靠的分析工具效果更好，并且具有明显的改进空间。

Title: Skeletons Matter: Dynamic Data Augmentation for Text-to-Query

Authors: Yuchen Ji, Bo Xu, Jie Shi, Jiaqing Liang, Deqing Yang, Yu Mao, Hai Chen, Yanghua Xiao
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2511.18934
Pdf URL: https://arxiv.org/pdf/2511.18934
Copy Paste: [[2511.18934]] Skeletons Matter: Dynamic Data Augmentation for Text-to-Query(https://arxiv.org/abs/2511.18934)
Keywords: language model, llm
Abstract: The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at this https URL.
摘要：将自然语言问题翻译成查询语言的任务长期以来一直是语义解析的中心焦点。大型语言模型 (LLM) 的最新进展显着加速了该领域的进展。然而，现有的研究通常集中于单一查询语言，导致方法在不同语言之间的通用性有限。在本文中，我们正式定义了文本到查询任务范例，统一了跨各种查询语言的语义解析任务。我们将查询骨架确定为文本到查询任务的共享优化目标，并提出了一种通用的动态数据增强框架，该框架可以显式诊断处理这些骨架以合成目标训练数据时特定于模型的弱点。对四个文本到查询基准的实验表明，我们的方法仅使用少量的合成数据即可实现最先进的性能，突出了我们方法的效率和通用性，并为文本到查询任务的统一研究奠定了坚实的基础。我们在此 https URL 发布我们的代码。

Title: GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

Authors: Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19078
Pdf URL: https://arxiv.org/pdf/2511.19078
Copy Paste: [[2511.19078]] GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning(https://arxiv.org/abs/2511.19078)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.
摘要：大型语言模型 (LLM) 在自然语言理解和生成方面表现出了令人印象深刻的能力，包括数学证明等多步骤推理。然而，现有方法通常缺乏明确的动态机制来结构性地表示和演化中间推理状态，这限制了它们执行上下文感知定理选择和迭代结论生成的能力。为了应对这些挑战，我们提出了 GraphMind，这是一种新颖的基于动态图的框架，它将图神经网络（GNN）与 LLM 相集成，以迭代地选择定理并生成用于多步骤推理的中间结论。我们的方法将推理过程建模为异构演化图，其中节点表示条件、定理和结论，而边捕获节点之间的逻辑依赖关系。通过使用 GNN 编码当前推理状态并利用语义匹配进行定理选择，我们的框架以闭环方式实现上下文感知、可解释和结构化推理。对各种问答（QA）数据集的实验表明，我们提出的 GraphMind 方法实现了一致的性能改进，并且在多步骤推理中显着优于现有基线，验证了我们方法的有效性和通用性。

Title: A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis

Authors: Wenxuan Mu, Jinzhong Ning, Di Zhao, Yijia Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19083
Pdf URL: https://arxiv.org/pdf/2511.19083
Copy Paste: [[2511.19083]] A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis(https://arxiv.org/abs/2511.19083)
Keywords: language model, llm, agent
Abstract: In-context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low-resource scenarios. However, existing ICL-based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM's insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR-Agent, a novel multi-agent framework for multi-domain low-resource in-context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR-Agent leverages natural-language type definitions and a static set of entity-level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain-specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self-assessment. Experiments across ten datasets from five domains demonstrate that KDR-Agent significantly outperforms existing zero-shot and few-shot ICL baselines across multiple LLM backbones. The code and data can be found at this https URL.
摘要：具有大型语言模型 (LLM) 的上下文学习 (ICL) 已成为资源匮乏场景中命名实体识别 (NER) 的一种有前途的范例。然而，现有的基于 ICL 的 NER 方法存在三个关键限制：（1）依赖于带注释示例的动态检索，当带注释的数据稀缺时，这会出现问题； (2) 由于法学硕士内部领域知识不足，对未见领域的泛化有限； (3) 未能纳入外部知识或解决实体模糊性。为了应对这些挑战，我们提出了 KDR-Agent，这是一种用于多领域低资源上下文 NER 的新型多代理框架，它集成了知识检索、消歧和反思分析。 KDR-Agent 利用自然语言类型定义和一组静态的实体级对比演示来减少对大型注释语料库的依赖。中央规划者协调专门代理来（i）从维基百科检索特定领域提及的事实知识，（ii）通过情境推理解决模糊实体，以及（iii）通过结构化自我评估反思和纠正模型预测。来自五个领域的十个数据集的实验表明，KDR-Agent 在多个 LLM 主干中显着优于现有的零样本和少样本 ICL 基线。代码和数据可以在此 https URL 中找到。

Title: DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF

Authors: Ziyuan Gao, Di Liang, Xianjie Wu, Philippe Morel, Minlong Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19097
Pdf URL: https://arxiv.org/pdf/2511.19097
Copy Paste: [[2511.19097]] DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF(https://arxiv.org/abs/2511.19097)
Keywords: chain-of-thought
Abstract: Existing reinforcement learning methods for Chain-of-Thought reasoning suffer from two critical limitations. First, they operate as monolithic black boxes that provide undifferentiated reward signals, obscuring individual step contributions and hindering error diagnosis. Second, sequential decoding has O(n) time complexity. This makes real-time deployment impractical for complex reasoning tasks. We present DeCoRL (Decoupled Reasoning Chains via Coordinated Reinforcement Learning), a novel framework that transforms reasoning from sequential processing into collaborative modular orchestration. DeCoRL trains lightweight specialized models to generate reasoning sub-steps concurrently, eliminating sequential bottlenecks through parallel processing. To enable precise error attribution, the framework designs modular reward functions that score each sub-step independently. Cascaded DRPO optimization then coordinates these rewards while preserving inter-step dependencies. Comprehensive evaluation demonstrates state-of-the-art results across RM-Bench, RMB, and RewardBench, outperforming existing methods including large-scale models. DeCoRL delivers 3.8 times faster inference while maintaining superior solution quality and offers a 22.7\% improvement in interpretability through explicit reward attribution. These advancements, combined with a 72.4\% reduction in energy consumption and a 68\% increase in throughput, make real-time deployment of complex reasoning systems a reality.
摘要：现有的思想链推理强化学习方法存在两个关键限制。首先，它们作为整体黑匣子运行，提供无差别的奖励信号，模糊了各个步骤的贡献并阻碍了错误诊断。其次，顺序解码的时间复杂度为 O(n)。这使得实时部署对于复杂的推理任务来说不切实际。我们提出了 DeCoRL（通过协调强化学习解耦推理链），这是一种新颖的框架，可将推理从顺序处理转变为协作模块化编排。 DeCoRL 训练轻量级专用模型以同时生成推理子步骤，通过并行处理消除顺序瓶颈。为了实现精确的错误归因，该框架设计了模块化奖励函数，对每个子步骤进行独立评分。然后，级联 DRPO 优化协调这些奖励，同时保留步骤间依赖性。综合评估显示了 RM-Bench、RMB 和 RewardBench 的最先进结果，优于包括大型模型在内的现有方法。 DeCoRL 的推理速度提高了 3.8 倍，同时保持卓越的解决方案质量，并通过明确的奖励归因将可解释性提高了 22.7%。这些进步，加上能耗降低 72.4%，吞吐量提高 68%，使复杂推理系统的实时部署成为现实。

Title: Emotion-Enhanced Multi-Task Learning with LLMs for Aspect Category Sentiment Analysis

Authors: Yaping Chai, Haoran Xie, Joe S. Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19122
Pdf URL: https://arxiv.org/pdf/2511.19122
Copy Paste: [[2511.19122]] Emotion-Enhanced Multi-Task Learning with LLMs for Aspect Category Sentiment Analysis(https://arxiv.org/abs/2511.19122)
Keywords: language model, llm
Abstract: Aspect category sentiment analysis (ACSA) has achieved remarkable progress with large language models (LLMs), yet existing approaches primarily emphasize sentiment polarity while overlooking the underlying emotional dimensions that shape sentiment expressions. This limitation hinders the model's ability to capture fine-grained affective signals toward specific aspect categories. To address this limitation, we introduce a novel emotion-enhanced multi-task ACSA framework that jointly learns sentiment polarity and category-specific emotions grounded in Ekman's six basic emotions. Leveraging the generative capabilities of LLMs, our approach enables the model to produce emotional descriptions for each aspect category, thereby enriching sentiment representations with affective expressions. Furthermore, to ensure the accuracy and consistency of the generated emotions, we introduce an emotion refinement mechanism based on the Valence-Arousal-Dominance (VAD) dimensional framework. Specifically, emotions predicted by the LLM are projected onto a VAD space, and those inconsistent with their corresponding VAD coordinates are re-annotated using a structured LLM-based refinement strategy. Experimental results demonstrate that our approach significantly outperforms strong baselines on all benchmark datasets. This underlines the effectiveness of integrating affective dimensions into ACSA.
摘要：方面类别情感分析（ACSA）在大型语言模型（LLM）方面取得了显着进展，但现有方法主要强调情感极性，而忽视了塑造情感表达的潜在情感维度。这种限制阻碍了模型捕获针对特定方面类别的细粒度情感信号的能力。为了解决这一限制，我们引入了一种新颖的情感增强多任务 ACSA 框架，该框架共同学习基于 Ekman 六种基本情感的情感极性和特定类别的情感。利用法学硕士的生成能力，我们的方法使模型能够为每个方面类别生成情感描述，从而通过情感表达丰富情感表示。此外，为了确保生成的情绪的准确性和一致性，我们引入了基于Valence-Arousal-Dominance（VAD）维度框架的情绪细化机制。具体来说，LLM预测的情绪被投影到VAD空间上，那些与其对应的VAD坐标不一致的情绪将使用基于结构化LLM的细化策略重新注释。实验结果表明，我们的方法在所有基准数据集上都显着优于强基线。这强调了将情感维度纳入 ACSA 的有效性。

Title: Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization

Authors: Zijian Wang, Yanxiang Ma, Chang Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19131
Pdf URL: https://arxiv.org/pdf/2511.19131
Copy Paste: [[2511.19131]] Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization(https://arxiv.org/abs/2511.19131)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs), enabling them to tackle com- plex multi-step tasks. While base LLMs, pre-trained on general text corpora, often struggle with reasoning due to a lack of specialized training, recent studies reveal their latent reason- ing potential tied to hidden states. However, existing hidden state manipulation methods, such as linear activation steering, suffer from limitations due to their rigid and unconstrained nature, often leading to distribution shifts and degraded text quality. In this work, we propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in probabilistic conditional generation. By reformulating the challenge as an optimization problem with a balanced likelihood and prior regularization framework, our method guides hidden states toward reasoning-oriented trajectories while preserving linguistic coherence. Extensive evaluations across mathematical, commonsense, and logical reasoning benchmarks demonstrate that our approach con- sistently outperforms existing steering methods, offering a theoretically principled and effective solution for enhancing reasoning capabilities in base LLMs.
摘要：思想链（CoT）推理是大型语言模型（LLM）的一项关键功能，使它们能够处理复杂的多步骤任务。虽然基础法学硕士在一般文本语料库上进行了预训练，但由于缺乏专门的训练，常常在推理方面遇到困难，但最近的研究揭示了他们与隐藏状态相关的潜在推理潜力。然而，现有的隐藏状态操纵方法（例如线性激活控制）由于其刚性和不受约束的性质而受到限制，通常会导致分布变化和文本质量下降。在这项工作中，我们提出了一种新方法，通过基于概率条件生成的隐藏状态操作，从基础 LLM 中引发 CoT 推理。通过将挑战重新表述为具有平衡似然性和先验正则化框架的优化问题，我们的方法引导隐藏状态朝推理导向的轨迹发展，同时保持语言的连贯性。对数学、常识和逻辑推理基准的广泛评估表明，我们的方法始终优于现有的指导方法，为增强基础法学硕士的推理能力提供了理论上有原则且有效的解决方案。

Title: Representational Stability of Truth in Large Language Models

Authors: Samantha Dies, Courtney Maynard, Germans Savcisens, Tina Eliassi-Rad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19166
Pdf URL: https://arxiv.org/pdf/2511.19166
Copy Paste: [[2511.19166]] Representational Stability of Truth in Large Language Models(https://arxiv.org/abs/2511.19166)
Keywords: language model, llm
Abstract: Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to $40\%$ flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes ($\leq 8.2\%$). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.
摘要：大语言模型 (LLM) 广泛用于事实任务，例如“什么治疗哮喘？”或“拉脱维亚的首都是哪里？”。然而，目前尚不清楚法学硕士在其内部概率表示中如何稳定地编码真、假和非真非假内容之间的区别。我们将表征稳定性引入为法学硕士的真实性表征对于真理的操作定义中的扰动的鲁棒性。我们通过以下方式评估表征稳定性：(i) 在 LLM 的激活上训练线性探针，以区分真实和不真实的陈述；(ii) 测量其学习的决策边界在受控标签变化下如何变化。使用来自十六个开源模型和三个事实域的激活，我们比较了两种类型的两种陈述。第一个是关于我们认为任何训练数据中都不存在的实体的类似事实的断言。我们称这些不熟悉的陈述为不熟悉的陈述。第二个是来自众所周知的虚构背景的非事实主张。我们称这些为熟悉的既非陈述。不熟悉的陈述引起最大的边界变化，在脆弱领域（例如单词定义）中产生高达 $40\%$ 的真实判断翻转，而熟悉的虚构陈述仍然更加连贯地聚集并产生较小的变化（$\leq 8.2\%$）。这些结果表明表征稳定性更多地源于认知熟悉度而不是语言形式。更广泛地说，我们的方法为审计和培训法学硕士提供了诊断，以在语义不确定性下保持连贯的真值分配，而不是单独优化输出准确性。

Title: In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations

Authors: Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19232
Pdf URL: https://arxiv.org/pdf/2511.19232
Copy Paste: [[2511.19232]] In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations(https://arxiv.org/abs/2511.19232)
Keywords: language model
Abstract: How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.
摘要：变压器如何以及在哪里注意到一个句子在语义上偏离了轨道？为了探索这个问题，我们使用精心策划的语料库评估了因果语言模型 (phi-2)，其中的句子的结论似乎合理或难以置信。我们的分析重点是在每个模型层采样的隐藏状态。为了研究违规行为是如何编码的，我们使用了两个互补探针。首先，我们使用线性探针进行每层检测。我们的研究结果表明，简单的线性解码器很难区分模型最低三分之一层中合理的结局和难以置信的结局。然而，其准确性在中间块中急剧增加，在顶层之前达到峰值。其次，我们检查了编码违规的有效维度。最初，违规扩大了表征子空间，随后在堆栈中间瓶颈后崩溃。这可能表明一个探索阶段过渡到快速整合阶段。总而言之，这些结果考虑了与人类阅读中经典心理语言学发现相一致的想法，即只有在句法解析之后才检测到语义异常，而句法解析发生在在线处理序列的后期。

Title: Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces

Authors: Shaltiel Shmidman, Asher Fredman, Oleg Sudakov, Meriem Bendris
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19333
Pdf URL: https://arxiv.org/pdf/2511.19333
Copy Paste: [[2511.19333]] Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces(https://arxiv.org/abs/2511.19333)
Keywords: language model, gpt, llm
Abstract: Test-time scaling, which leverages additional computation during inference to improve model accuracy, has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems by understanding the goal, turning this goal into a plan, working through intermediate steps, and checking their own work before answering . Frontier large language models with reasoning capabilities, such as DeepSeek-R1 and OpenAI's gpt-oss, follow the same procedure when solving complex problems by generating intermediate reasoning traces before giving the final answer. Today, these models are being increasingly used to generate reasoning traces that serve as high-quality supervised data for post-training of small and medium-sized language models to teach reasoning capabilities without requiring expensive human curation. In this work, we compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces. We compare the impact of reasoning traces generated by DeepSeek-R1 and gpt-oss LLMs in terms of accuracy and inference efficiency.
摘要：测试时间扩展在推理过程中利用额外的计算来提高模型的准确性，使得一类新型大型语言模型 (LLM) 能够通过理解目标、将目标转化为计划、完成中间步骤以及在回答之前检查自己的工作来推理复杂的问题。具有推理能力的前沿大型语言模型，例如 DeepSeek-R1 和 OpenAI 的 gpt-oss，在解决复杂问题时遵循相同的流程，通过生成中间推理轨迹，然后给出最终答案。如今，这些模型越来越多地用于生成推理轨迹，作为中小型语言模型后期训练的高质量监督数据，以教授推理能力，而无需昂贵的人工管理。在这项工作中，我们比较了中型法学硕士在两种推理轨迹上进行后训练后在数学问题上的表现。我们比较了 DeepSeek-R1 和 gpt-oss LLM 生成的推理轨迹在准确性和推理效率方面的影响。

Title: DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Authors: Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19399
Pdf URL: https://arxiv.org/pdf/2511.19399
Copy Paste: [[2511.19399]] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research(https://arxiv.org/abs/2511.19399)
Keywords: agent
Abstract: Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
摘要：深度研究模型执行多步骤研究，以产生长篇、明确的答案。然而，大多数开放式深度研究模型都是通过具有可验证奖励的强化学习（RLVR）在易于验证的短形式 QA 任务上进行训练，这并不能扩展到现实的长形式任务。我们通过强化学习与进化规则（RLER）来解决这个问题，其中我们构建和维护在训练期间与政策模型共同进化的规则；这使得评价标准能够纳入模型新探索的信息，并提供有区别的、符合政策的反馈。使用 RLER，我们开发了 Deep Research Tulu (DR Tulu-8B)，这是第一个直接训练用于开放式长篇深度研究的开放模型。在科学、医疗保健和一般领域的四个长期深度研究基准中，DR Tulu 的性能大大优于现有的开放深度研究模型，并匹配或超过专有深度研究系统，同时每个查询的体积更小且成本更低。为了促进未来的研究，我们发布了所有数据、模型和代码，包括用于深度研究系统的新的基于 MCP 的代理基础设施。

Title: Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration

Authors: James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu, Tristan Naumann, Muhao Chen, Hoifung Poon
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19417
Pdf URL: https://arxiv.org/pdf/2511.19417
Copy Paste: [[2511.19417]] Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration(https://arxiv.org/abs/2511.19417)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
摘要：大型语言模型 (LLM) 在具有挑战性的知识密集型推理任务中表现出了卓越的能力。然而，将法学硕士扩展到新模态（例如视觉）的感知和推理，通常需要昂贵的开发以法学硕士为骨干的大规模视觉语言模型（VLM）。较小的 VLM 更高效、适应性更强，但往往缺乏前沿 LLM 的广泛知识和推理能力。在这项工作中，我们提出了 BeMyEyes，这是一个模块化的多代理框架，通过对话协调作为感知器的高效、适应性强的 VLM 和作为推理器的强大的 LLM 之间的协作，从而将 LLM 扩展到多模式推理。然后，我们引入数据合成和监督微调管道来训练感知代理与推理代理有效协作。通过结合感知和推理代理的互补优势，BeMyEyes 避免了训练大规模多模态模型的需要，保留了法学硕士的泛化和推理能力，并允许灵活扩展到新的领域和模式。实验表明，我们的框架解锁了 LLM 的多模态推理功能，实现了轻量级且完全开源的解决方案，即为纯文本 DeepSeek-R1 配备 Qwen2.5-VL-7B 感知器，在各种知识密集型多模态任务上优于 GPT-4o 等大规模专有 VLM。这些结果证明了我们用于构建未来多模态推理系统的多代理方法的有效性、模块化和可扩展性。