2026-02-24

Title: ReportLogic: Evaluating Logical Quality in Deep Research Reports

Authors: Jujia Zhao, Zhaoxin Huan, Zihan Wang, Xiaolu Zhang, Jun Zhou, Suzan Verberne, Zhaochun Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.18446
Pdf URL: https://arxiv.org/pdf/2602.18446
Copy Paste: [[2602.18446]] ReportLogic: Evaluating Logical Quality in Deep Research Reports(https://arxiv.org/abs/2602.18446)
Keywords: language model, llm
Abstract: Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim--support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.
摘要：用户越来越依赖大型语言模型 (LLM) 进行深入研究，利用它们将不同的来源合成为支持理解和行动的结构化报告。在这种情况下，此类报告的实际可靠性取决于逻辑质量：报告的主张和论点是否得到明确支持并可以作为下游使用的基础而值得信赖，而不仅仅是显得流畅或信息丰富。然而，当前的评估框架很大程度上忽视了这一要求。为了弥补这一差距，我们引入了 ReportLogic，这是一个通过以读者为中心的可审核性镜头来量化报告级逻辑质量的基准。具体来说，ReportLogic 采用分层分类法，评估读者是否能够（1）通过统一的分析弧追踪主题报告结构（宏观逻辑），（2）通过必要的上下文理解进展（说明逻辑），以及（3）通过明确的主张支持验证结论（结构逻辑）。基于这种分类法，我们构建了一个人工注释的评分标准引导数据集，并训练一个开源 LogicJudge 以进行可扩展的评估。我们通过对抗性攻击进一步评估法官的稳健性，表明现成的法学硕士法官经常受到肤浅线索（例如冗长）的影响，而推理模式可以掩盖破裂的支持关系。总的来说，我们的结果为构建更强大的逻辑评估器和提高法学硕士生成报告的逻辑可靠性提供了可行的指导。

Title: ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification

Authors: Siran Liu, Cyril Y. He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.18447
Pdf URL: https://arxiv.org/pdf/2602.18447
Copy Paste: [[2602.18447]] ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification(https://arxiv.org/abs/2602.18447)
Keywords: language model, chain-of-thought
Abstract: Chain-of-Thought reasoning significantly improves the performance of large language models on complex tasks, but incurs high inference latency due to long generation traces. Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off among accuracy, inference speed, and resource efficiency. We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off. Our key insight is an asymmetry between generation and verification: while generating a correct reasoning step requires substantial model capacity, step-level verification is a constrained discriminative task for which small draft models are well-calibrated within their competence range, enabling high-confidence draft decisions to be accepted directly while selectively escalating uncertain cases to the large target model. Evaluation across diverse workloads shows that ConfSpec achieves up to 2.24$\times$ end-to-end speedups while matching target-model accuracy. Our method requires no external judge models and is orthogonal to token-level speculative decoding, enabling further multiplicative acceleration.
摘要：思想链推理显着提高了大型语言模型在复杂任务上的性能，但由于生成轨迹较长而导致推理延迟较高。步进级推测推理旨在减轻这种成本，但现有方法长期面临准确性、推理速度和资源效率之间的权衡。我们提出了 ConfSpec，一个置信门控级联验证框架，可以解决这种权衡问题。我们的主要见解是生成和验证之间的不对称：虽然生成正确的推理步骤需要大量的模型能力，但步骤级验证是一项受约束的判别任务，小型草稿模型在其能力范围内得到了很好的校准，使得高置信度的草稿决策能够被直接接受，同时有选择地将不确定的情况升级到大型目标模型。对不同工作负载的评估表明，ConfSpec 在匹配目标模型精度的同时实现了高达 2.24$\times$ 的端到端加速。我们的方法不需要外部判断模型，并且与令牌级推测解码正交，从而实现进一步的乘法加速。

Title: INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection

Authors: Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.18448
Pdf URL: https://arxiv.org/pdf/2602.18448
Copy Paste: [[2602.18448]] INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection(https://arxiv.org/abs/2602.18448)
Keywords: agent
Abstract: Administrative phone tasks drain roughly 1 trillion USD annually from U.S. healthcare, with over 500 million insurance-benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-specific acceptance rules) and (2) Compliance Verification (IC/PC decisions given fixed spans). Per-phase scores are strong across small, low-latency baselines, but end-to-end reliability is constrained by span-boundary errors. On real calls, full-call exact segmentation is low, showing a gap between conversational fluency and audit-grade evidence.
摘要：管理电话任务每年从美国医疗保健中消耗约 1 万亿美元，到 2024 年将手动处理超过 5 亿个保险福利验证呼叫。据我们所知，我们推出了 INSURE-Dial，这是第一个用于开发和评估合规感知语音代理的公共基准，用于通过基于跨度的合规验证进行阶段感知呼叫审计。该语料库包括 50 个由人工智能发起的、由现场保险代表进行的去识别化呼叫（平均每次呼叫 71 次）和 1,000 个反映相同工作流程的综合生成呼叫。所有呼叫均使用阶段结构的 JSON 模式进行注释，涵盖 IVR 导航、患者识别、覆盖状态、药物检查（最多两种药物）和客服人员识别 (CRN)，并且每个阶段都在明确的询问/应答逻辑下标记信息和程序合规性。我们定义了两个新颖的评估任务：（1）相边界检测（特定于相的验收规则下的跨度分割）和（2）合规性验证（给定固定跨度的 IC/PC 决策）。在较小的低延迟基线上，每相得分很高，但端到端可靠性受到跨度边界误差的限制。在实际通话中，完整通话的精确细分很低，显示出对话流畅性和审计级证据之间的差距。

Title: Prompt Optimization Via Diffusion Language Models

Authors: Shiyu Wang, Haolin Chen, Liangwei Yang, Jielin Qiu, Rithesh Murthy, Ming Zhu, Zixiang Chen, Silvio Savarese, Caiming Xiong, Shelby Heinecke, Huan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.18449
Pdf URL: https://arxiv.org/pdf/2602.18449
Copy Paste: [[2602.18449]] Prompt Optimization Via Diffusion Language Models(https://arxiv.org/abs/2602.18449)
Keywords: language model, gpt, llm, prompt
Abstract: We propose a diffusion-based framework for prompt optimization that leverages Diffusion Language Models (DLMs) to iteratively refine system prompts through masked denoising. By conditioning on interaction traces, including user queries, model responses, and optional feedback, our method enables flexible, span-level prompt updates without requiring gradient access or modifying the downstream language model. Across diverse benchmarks (e.g., $\tau$-bench, SST-2, SST-5), DLM-optimized prompts consistently improve the performance of a frozen target LLM (e.g., GPT-4o-mini). We further show that moderate diffusion step counts provide the best balance between refinement quality and stability. These results highlight diffusion-based prompt optimization as a general, model-agnostic, and scalable approach for enhancing LLM performance through iterative prompt refinement.
摘要：我们提出了一种基于扩散的提示优化框架，该框架利用扩散语言模型（DLM）通过掩模去噪迭代地细化系统提示。通过调节交互轨迹（包括用户查询、模型响应和可选反馈），我们的方法可以实现灵活的跨级提示更新，而无需梯度访问或修改下游语言模型。在不同的基准测试中（例如，$\tau$-bench、SST-2、SST-5），DLM 优化的提示不断提高冻结目标 LLM（例如 GPT-4o-mini）的性能。我们进一步表明，适度的扩散步数可以在细化质量和稳定性之间提供最佳平衡。这些结果强调了基于扩散的提示优化是一种通用的、与模型无关的、可扩展的方法，可通过迭代提示细化来增强 LLM 性能。

Title: Asymptotic Semantic Collapse in Hierarchical Optimization

Authors: Faruk Alpay, Bugra Kilictas
Subjects: cs.CL, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2602.18450
Pdf URL: https://arxiv.org/pdf/2602.18450
Copy Paste: [[2602.18450]] Asymptotic Semantic Collapse in Hierarchical Optimization(https://arxiv.org/abs/2602.18450)
Keywords: agent
Abstract: Multi-agent language systems can exhibit a failure mode where a shared dominant context progressively absorbs individual semantics, yielding near-uniform behavior across agents. We study this effect under the name Asymptotic Semantic Collapse in Hierarchical Optimization. In a closed linguistic setting with a Dominant Anchor Node whose semantic state has effectively infinite inertia, we show that repeated interactions with Peripheral Agent Nodes drive an asymptotic alignment that minimizes a global loss. We model semantic states as points on a Riemannian manifold and analyze the induced projection dynamics. Two consequences follow. First, the limiting semantic configuration is insensitive to the optimization history: both smooth gradient-style updates and stochastic noisy updates converge to the same topological endpoint, establishing path independence at convergence. Second, the degree of context dependence controls information content: moving from atomic (independent) representations to fully entangled (context-bound) representations forces the node entropy, interpreted as available degrees of freedom, to vanish in the limit. The theory connects information-theoretic quantities with differential-geometric structure and suggests an interpretation as an immutable consensus rule that constrains agents to a shared semantic grammar. A lightweight dataset-free benchmark on an RWKV-7 13B GGUF checkpoint complements the analysis, reporting zero hash collisions, mean compliance of 0.50 under greedy decoding and 0.531 under stochastic decoding, and final Jaccard-to-anchor similarity values of 0.295 and 0.224, respectively.
摘要：多智能体语言系统可能会表现出一种故障模式，其中共享的主导上下文逐渐吸收个体语义，从而在智能体之间产生近乎一致的行为。我们以层次优化中的渐近语义崩溃的名义研究这种效应。在具有语义状态实际上具有无限惯性的主导锚节点的封闭语言环境中，我们表明与外围代理节点的重复交互驱动渐近对齐，从而最小化全局损失。我们将语义状态建模为黎曼流形上的点，并分析诱导的投影动力学。接下来有两个后果。首先，限制语义配置对优化历史不敏感：平滑梯度式更新和随机噪声更新都收敛到相同的拓扑端点，从而在收敛时建立路径独立性。其次，上下文依赖程度控制信息内容：从原子（独立）表示转向完全纠缠（上下文绑定）表示会迫使节点熵（解释为可用自由度）在极限中消失。该理论将信息论量与微分几何结构联系起来，并提出了一种解释，作为一种不可变的共识规则，将代理限制在共享的语义语法上。 RWKV-7 13B GGUF 检查点上的轻量级无数据集基准对分析进行了补充，报告零哈希冲突，贪婪解码下的平均合规性为 0.50，随机解码下的平均合规性为 0.531，最终 Jaccard 到锚点的相似度值分别为 0.295 和 0.224。

Title: Luna-2: Scalable Single-Token Evaluation with Small Language Models

Authors: Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel, Shuai Shao, Yash Sheth
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.18583
Pdf URL: https://arxiv.org/pdf/2602.18583
Copy Paste: [[2602.18583]] Luna-2: Scalable Single-Token Evaluation with Small Language Models(https://arxiv.org/abs/2602.18583)
Keywords: language model, llm, hallucination
Abstract: Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.
摘要：实时护栏需要准确、廉价且快速的评估 - 然而当今的默认 LLM 作为法官 (LLMAJ) 缓慢、昂贵且由于多令牌生成而导致操作上不确定。我们提出了 Luna-2，这是一种新颖的架构，它利用仅解码器的小语言模型 (SLM) 进入确定性评估模型，以与使用前沿 LLM 的 LLMAJ 相当或更高的精度可靠地计算复杂的特定于任务的 LLMAJ 指标（例如毒性、幻觉、工具选择质量等），同时大幅降低计算成本和延迟。每个指标都以共享 SLM 主干之上的轻量级 LoRA/PEFT 头的形式实现，使数百个专用指标能够在单个 GPU 上同时运行，并以隐私保护和延迟优化的方式本地部署在 AI 系统旁边。在内容安全和幻觉基准方面，Luna-2 与最先进的基于 LLM 的评估器的准确性相匹配，同时将推理成本降低了 80 倍以上，将延迟降低了 20 倍以上。在本文中，我们概述了模型架构、训练方法，并报告了有关准确性、延迟和吞吐量结果的真实经验结果。在生产中，Luna-2 每月为我们的客户保护超过 1 亿个人工智能会话并处理超过 100B 代币，每年节省超过 3000 万美元的评估成本。

Title: DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning

Authors: Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham, Pei Zhou, Mengting Wan, Alex Stein, Virginia Estellers, Charles Chen, Morris Sharp, Richard Speyer, Tadas Baltrusaitis, Jennifer Neville, Eunsol Choi, Longqi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.18633
Pdf URL: https://arxiv.org/pdf/2602.18633
Copy Paste: [[2602.18633]] DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning(https://arxiv.org/abs/2602.18633)
Keywords: language model, llm
Abstract: Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples. Generating DP synthetic data typically involves a difficult trade-off. On one hand, DP finetuning methods train an LLM as a synthetic data generator with formal privacy guarantees, yet it still requires the raw content of private examples for model training. However, methods that avoid direct exposure to private data are bounded by an off-the-shelf, un-finetuned model, whose outputs often lack domain fidelity. Can we train an LLM to generate high-quality synthetic text without eyes-on access to individual private examples? In this work, we introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs. DP-RFT leverages DP-protected nearest-neighbor votes from an eyes-off private corpus as a reward signal for on-policy synthetic samples generated by an LLM. The LLM iteratively learns to generate synthetic data to maximize the expected DP votes through Proximal Policy Optimization (PPO). We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts. Our experiments show that DP-RFT closes the gap between private evolution and DP finetuning methods in terms of the fidelity and downstream utility of the generated synthetic data, while respecting the private data boundary.
摘要：差分私有 (DP) 合成数据生成在私有数据上开发大型语言模型 (LLM) 中发挥着关键作用，其中数据所有者无法提供对单个示例的亲眼访问。生成 DP 合成数据通常涉及困难的权衡。一方面，DP 微调方法将 LLM 训练为具有正式隐私保证的合成数据生成器，但它仍然需要私人示例的原始内容来进行模型训练。然而，避免直接暴露私人数据的方法受到现成的、未经微调的模型的限制，其输出通常缺乏域保真度。我们是否可以训练法学硕士生成高质量的合成文本，而无需直接访问各个私人示例？在这项工作中，我们介绍了差分私有强化微调（DP-RFT），这是一种用于使用法学硕士生成合成数据的在线强化学习算法。 DP-RFT 利用来自不受关注的私人语料库的 DP 保护的最近邻投票作为 LLM 生成的 on-policy 合成样本的奖励信号。 LLM 迭代学习生成合成数据，以通过近端策略优化 (PPO) 最大化预期 DP 投票。我们评估 DP-RFT 的长格式和特定领域的合成数据生成，例如新闻文章、会议记录和医学文章摘要。我们的实验表明，DP-RFT 在生成的合成数据的保真度和下游效用方面缩小了私有进化和 DP 微调方法之间的差距，同时尊重私有数据边界。

Title: PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

Authors: Nina Hosseini-Kivanani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.18652
Pdf URL: https://arxiv.org/pdf/2602.18652
Copy Paste: [[2602.18652]] PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation(https://arxiv.org/abs/2602.18652)
Keywords: llm
Abstract: Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.
摘要：多模态模型由于其非组合意义而难以应对惯用表达，这在多语言环境中会放大这一挑战。我们引入了 PolyFrame，这是我们用于多模式习语消歧的 MWE-2026 AdMIRe2 共享任务的系统，具有用于图像+文本排名（子任务 A）和纯文本标题排名（子任务 B）的统一管道。所有模型变体都保留冻结的 CLIP 式视觉语言编码器和多语言 BGE M3 编码器，仅训练轻量级模块：逻辑回归和基于 LLM 的句子类型预测器、习语同义词替换、干扰感知评分和 Borda 排名融合。从 CLIP 基线（英语开发中 26.7% Top-1，英语测试 6.7%）开始，添加习语感知释义和明确的句子类型分类，将英语性能提高到 60.0% Top-1，将零样本迁移到葡萄牙语的性能提高到 60.0% Top-1 (0.822 NDCG@5)。在多语言盲测中，我们的系统在 15 种语言中，子任务 A 的平均 Top-1/NDCG 得分为 0.35/0.73，子任务 B 的平均得分为 0.32/0.71。消融结果强调习语感知重写是性能的主要贡献者，而句子类型预测和多模态融合增强了鲁棒性。这些发现表明，无需微调大型多模态编码器，有效的习语消歧是可行的。

Title: Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM

Authors: Md Badsha Biswas, Ozlem Uzuner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.18693
Pdf URL: https://arxiv.org/pdf/2602.18693
Copy Paste: [[2602.18693]] Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM(https://arxiv.org/abs/2602.18693)
Keywords: language model, llm
Abstract: The spread of misinformation across digital platforms can pose significant societal risks. Claim verification, a.k.a. fact-checking, systems can help identify potential misinformation. However, their efficacy is limited by the knowledge sources that they rely on. Most automated claim verification systems depend on a single knowledge source and utilize the supporting evidence from that source; they ignore the disagreement of their source with others. This limits their knowledge coverage and transparency. To address these limitations, we present a novel system for open-domain claim verification (ODCV) that leverages large language models (LLMs), multi-perspective evidence retrieval, and cross-source disagreement analysis. Our approach introduces a novel retrieval strategy that collects evidence for both the original and the negated forms of a claim, enabling the system to capture supporting and contradicting information from diverse sources: Wikipedia, PubMed, and Google. These evidence sets are filtered, deduplicated, and aggregated across sources to form a unified and enriched knowledge base that better reflects the complexity of real-world information. This aggregated evidence is then used for claim verification using LLMs. We further enhance interpretability by analyzing model confidence scores to quantify and visualize inter-source disagreement. Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning. Our findings underscore the importance of embracing diversity, contradiction, and aggregation in evidence for building reliable and transparent claim verification systems
摘要：错误信息在数字平台上的传播可能会带来重大的社会风险。声明验证（又称事实检查）系统可以帮助识别潜在的错误信息。然而，它们的功效受到它们所依赖的知识来源的限制。大多数自动索赔验证系统依赖于单一知识源并利用该源的支持证据；他们忽视了自己的消息来源与其他消息来源的分歧。这限制了他们的知识覆盖范围和透明度。为了解决这些限制，我们提出了一种新颖的开放域索赔验证（ODCV）系统，该系统利用大型语言模型（LLM）、多视角证据检索和跨源分歧分析。我们的方法引入了一种新颖的检索策略，该策略收集主张的原始形式和否定形式的证据，使系统能够从不同来源捕获支持和矛盾的信息：维基百科、PubMed 和谷歌。这些证据集经过过滤、重复数据删除和跨来源聚合，形成统一且丰富的知识库，更好地反映现实世界信息的复杂性。然后，这些汇总的证据将用于法学硕士的索赔验证。我们通过分析模型置信度分数来量化和可视化源间分歧，进一步增强可解释性。通过对五个法学硕士的四个基准数据集的广泛评估，我们表明知识聚合不仅改进了声明验证，而且揭示了特定源推理的差异。我们的研究结果强调了拥抱证据的多样性、矛盾性和聚合性对于建立可靠和透明的索赔验证系统的重要性

Title: ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

Authors: Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2602.18721
Pdf URL: https://arxiv.org/pdf/2602.18721
Copy Paste: [[2602.18721]] ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models(https://arxiv.org/abs/2602.18721)
Keywords: language model, llm
Abstract: Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
摘要：自动语音识别（ASR）中的半监督学习通常依赖于伪标签，由于噪声监督，伪标签经常会遭受确认偏差和错误积累。为了解决这一限制，我们提出了 ReHear，这是一种迭代伪标签细化框架，它将指令调整的音频感知大语言模型 (LLM) 集成到自训练循环中。与传统的基于文本的校正器不同，我们的方法以 ASR 假设和源音频为 LLM 的条件，使其即使在严重的识别错误中也能恢复语音准确的转录本。这些精炼的伪标签可作为迭代周期中微调 ASR 模型的高保真目标。不同基准的实验结果表明，ReHear 有效地减轻了错误传播，始终优于监督基准和伪标记基准。

Title: Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem

Authors: Lichang Song, Ting Long, Yi Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.18734
Pdf URL: https://arxiv.org/pdf/2602.18734
Copy Paste: [[2602.18734]] Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem(https://arxiv.org/abs/2602.18734)
Keywords: retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) has demonstrated strong effectiveness in knowledge-intensive tasks by grounding language generation in external evidence. Despite its success, many existing RAG systems are built based on a ranking-centric, asymmetric dependency paradigm, where the generation quality of the generator is highly dependent on reranking results of the reranker. To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-makers rather than being connected through an asymmetric dependency pipeline. By jointly optimizing their behaviors toward a shared task objective, the reranker and generator are encouraged to cooperate, ensuring that document reranking and generation work in concert to improve the final response. Experimental results demonstrate good generalization and improved generation stability of CoRAG, even when the model is trained on only around 10K PopQA samples. Our model released in this https URL
摘要：检索增强生成（RAG）通过将语言生成建立在外部证据的基础上，在知识密集型任务中表现出了强大的有效性。尽管取得了成功，但许多现有的 RAG 系统都是基于以排名为中心的非对称依赖范式构建的，其中生成器的生成质量高度依赖于重新排名器的重新排名结果。为了克服这一限制，我们将 RAG 重新表述为协作多智能体决策问题，并提出协作检索增强生成（CoRAG），这是一个框架，其中重排序器和生成器充当对等决策者，而不是通过不对称依赖管道连接。通过共同优化他们的行为以实现共同的任务目标，鼓励重排序器和生成器合作，确保文档重排序和生成器协同工作以改进最终响应。实验结果表明，即使模型仅在大约 10K PopQA 样本上进行训练，CoRAG 也具有良好的泛化性和改进的生成稳定性。我们的模型在此 https URL 中发布

Title: ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

Authors: Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.18776
Pdf URL: https://arxiv.org/pdf/2602.18776
Copy Paste: [[2602.18776]] ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models(https://arxiv.org/abs/2602.18776)
Keywords: language model, prompt, chain-of-thought
Abstract: We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured output across all test cases, while the majority require fallback extraction methods despite high numerical accuracy. Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.
摘要：我们推出了rabbitnumbench，这是一个综合基准，用于评估东部阿拉伯-印度数字（阿拉伯文字中的0-9）和西部阿拉伯数字（0-9）的阿拉伯数字阅读任务的大型语言模型。我们使用四种提示策略（零样本、零样本 CoT、少样本、少样本 CoT）对 210 个数字阅读任务评估来自 10 个提供商的 71 个模型，涵盖六个上下文类别：纯数字、地址、日期、数量和价格。我们的评估包括 59,010 个单独的测试用例，并跟踪提取方法以测量结构化输出的生成。评估揭示了巨大的性能差异，不同模型和策略的准确度范围为 14.29\% 到 99.05\%。少样本思维链提示的准确度比零样本方法高 2.8 倍（80.06\% vs 28.76\%）。一个惊人的发现出现了：达到精英准确率 (98-99\%) 的模型通常会产生主要是非结构化的输出，大多数响应缺乏阿拉伯语 CoT 标记。只有 6 个模型在所有测试用例中一致地生成结构化输出，而大多数模型尽管数值精度很高，但仍需要后备提取方法。对 281 个模型-策略组合的综合评估表明，数字准确性和指令遵循代表了不同的能力，为阿拉伯数字理解建立了基线，并为生产阿拉伯语 NLP 系统中的模型选择提供了可行的指导。

Title: BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

Authors: Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.18788
Pdf URL: https://arxiv.org/pdf/2602.18788
Copy Paste: [[2602.18788]] BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models(https://arxiv.org/abs/2602.18788)
Keywords: language model, llm
Abstract: We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. this https URL
摘要：我们推出 BURMESE-SAN，这是第一个整体基准测试，可系统地评估缅甸语的大型语言模型 (LLM)，涵盖三个核心 NLP 能力：理解 (NLU)、推理 (NLR) 和生成 (NLG)。 BURMESE-SAN 整合了涵盖这些能力的七个子任务，包括问答、情感分析、毒性检测、因果推理、自然语言推理、抽象摘要和机器翻译，其中一些任务以前缅甸语无法使用。该基准是通过严格的母语驱动流程构建的，以确保语言的自然性、流畅性和文化真实性，同时最大限度地减少翻译引起的伪影。我们对开放式法学硕士和商业法学硕士进行了大规模评估，以研究缅甸语建模中因有限的预训练覆盖范围、丰富的形态学和句法变化而产生的挑战。我们的结果表明，缅甸语的性能更多地取决于架构设计、语言表示和指令调整，而不仅仅是模型规模。特别是东南亚地区的微调和新一代车型带来了可观的收益。最后，我们发布 BURMESE-SAN 作为公共排行榜，以支持缅甸语和其他资源匮乏语言的系统评估和持续进步。这个 https 网址

Title: Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Authors: Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.18806
Pdf URL: https://arxiv.org/pdf/2602.18806
Copy Paste: [[2602.18806]] Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models(https://arxiv.org/abs/2602.18806)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.
摘要：大型语言模型 (LLM) 表现出强大的推理性能，但它们可靠地监控、诊断和纠正自身错误的能力仍然有限。我们引入了一种基于心理学的元认知框架，该框架将 Ann Brown 的监管周期（规划、监控和评估）作为结构化的提示架构进行操作，并研究其在轻量级双进程 MetaController 中的集成，以实现自适应工作量分配。在使用 Llama-3 和 Qwen-3 (8B) 的各种推理和诊断基准（GSM8K、CRUXEval、MBPP、AIME、CorrightBench 和 TruthfulQA）中，明确的监管结构大大改善了错误诊断，并使成功的自我纠正率提高了三倍。对 580 个查询对进行的盲人评估显示，与标准和思想链基线相比，人们对可信度和元认知自我意识的总体偏好为 84%。将法学硕士推理建立在既定认知理论的基础上，为实现更加透明和诊断稳健的人工智能系统提供了原则性路径。

Title: EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation

Authors: Adam Dejl, Jonathan Pearson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.18823
Pdf URL: https://arxiv.org/pdf/2602.18823
Copy Paste: [[2602.18823]] EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation(https://arxiv.org/abs/2602.18823)
Keywords: language model, llm, prompt
Abstract: Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open-ended generation tasks, leading to growing reliance on LLM-based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain-specific evaluation suites for LLMs. EvalSense provides out-of-the-box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods for their specific use-cases. This is achieved through two unique components: (1) an interactive guide aiding users in evaluation method selection and (2) automated meta-evaluation tools that assess the reliability of different evaluation approaches using perturbed data. We demonstrate the effectiveness of EvalSense in a case study involving the generation of clinical notes from unstructured doctor-patient dialogues, using a popular open dataset. All code, documentation, and assets associated with EvalSense are open-source and publicly available at this https URL.
摘要：对大语言模型 (LLM) 进行稳健且全面的评估对于识别有效的 LLM 系统配置和减轻与在敏感领域部署 LLM 相关的风险至关重要。然而，传统的统计指标不太适合开放式生成任务，导致人们越来越依赖基于 LLM 的评估方法。这些方法虽然通常更灵活，但引入了额外的复杂性：它们依赖于精心选择的模型、提示、参数和评估策略，使得评估过程容易出现配置错误和偏差。在这项工作中，我们提出了 EvalSense，这是一个灵活的、可扩展的框架，用于为法学硕士构建特定领域的评估套件。 EvalSense 为广泛的模型提供者和评估策略提供开箱即用的支持，并帮助用户为其特定用例选择和部署合适的评估方法。这是通过两个独特的组件实现的：(1) 交互式指南，帮助用户选择评估方法；(2) 自动元评估工具，使用扰动数据评估不同评估方法的可靠性。我们在一个案例研究中展示了 EvalSense 的有效性，该案例研究涉及使用流行的开放数据集从非结构化医患对话中生成临床记录。与 EvalSense 相关的所有代码、文档和资产都是开源的，可通过此 https URL 公开获取。

Title: DeepInnovator: Triggering the Innovative Capabilities of LLMs

Authors: Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu, Chengen Huang, Junyang Lin, Chao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.18920
Pdf URL: https://arxiv.org/pdf/2602.18920
Copy Paste: [[2602.18920]] DeepInnovator: Triggering the Innovative Capabilities of LLMs(https://arxiv.org/abs/2602.18920)
Keywords: language model, llm, prompt, agent
Abstract: The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt engineering and lack a systematic training paradigm. To address this, we propose DeepInnovator, a training framework designed to trigger the innovative capability of LLMs. Our approach comprises two core components. (1) ``Standing on the shoulders of giants''. We construct an automated data extraction pipeline to extract and organize structured research knowledge from a vast corpus of unlabeled scientific literature. (2) ``Conjectures and refutations''. We introduce a ``Next Idea Prediction'' training paradigm, which models the generation of research ideas as an iterative process of continuously predicting, evaluating, and refining plausible and novel next idea. Both automatic and expert evaluations demonstrate that our DeepInnovator-14B significantly outperforms untrained baselines, achieving win rates of 80.53\%-93.81\%, and attains performance comparable to that of current leading LLMs. This work provides a scalable training pathway toward building research agents with genuine, originative innovative capability, and will open-source the dataset to foster community advancement. Source code and data are available at: this https URL.
摘要：大型语言模型（LLM）在加速科学发现方面的应用越来越受到关注，其重点是构建具有创新能力的研究主体，即能够自主产生新颖且重要的研究想法的能力。现有的方法主要依赖于复杂的即时工程，缺乏系统的培训范例。为了解决这个问题，我们提出了 DeepInnovator，一个旨在激发法学硕士创新能力的培训框架。我们的方法包括两个核心组成部分。（一）“站在巨人的肩膀上”。我们构建了一个自动数据提取管道，从大量未标记的科学文献中提取和组织结构化研究知识。 (2)“猜想与反驳”。我们引入了“下一个想法预测”训练范式，它将研究想法的生成建模为一个不断预测、评估和完善合理且新颖的下一个想法的迭代过程。自动评估和专家评估都表明，我们的 DeepInnovator-14B 显着优于未经训练的基准，实现了 80.53\%-93.81\% 的胜率，并且达到了与当前领先的法学硕士相当的性能。这项工作提供了一条可扩展的培训途径，旨在培养具有真正、原创创新能力的研究代理，并将开源数据集以促进社区进步。源代码和数据可在以下位置获得：此 https URL。

Title: Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning

Authors: Abhinaba Basu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.18922
Pdf URL: https://arxiv.org/pdf/2602.18922
Copy Paste: [[2602.18922]] Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning(https://arxiv.org/abs/2602.18922)
Keywords: gpt, llm, agent
Abstract: Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for GPTCache and 68.8% for a 20B-parameter LLM at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.
摘要：个人人工智能代理通过重复的 LLM 调用而产生大量成本。我们证明现有的缓存方法失败了：GPTCache 在实际基准测试中达到了 37.9% 的准确率； APC达到0-12%。根本原因是针对错误的属性进行优化——缓存有效性需要密钥一致性和精度，而不是分类准确性。我们观察到缓存键评估简化为聚类评估，并应用 V-measure 分解来在 MASSIVE、BANKING77、CLINC150 和 NyayaBench v2 上的 n=8,682 个点上分离这些评估，NyayaBench v2 是我们新的 8,514 条多语言代理数据集（528 个意图、20 个 W5H2 类、63 种语言）。我们介绍 W5H2，一个结构化意图分解框架。使用 SetFit（每类 8 个示例），W5H2 在大约 2 毫秒内在 MASSIVE 上实现了 91.1%+/-1.7%，而 GPTCache 为 37.9%，20B 参数 LLM 在 3,447 毫秒内实现了 68.8%。在 NyayaBench v2（20 个类别）上，SetFit 达到 55.3%，实现了 30 种语言的跨语言迁移。我们的五层级联可在本地处理 85% 的交互，预计可降低 97.5% 的成本。我们通过 RCPS 为九个绑定家族提供风险控制的选择性预测保证。

Title: Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Authors: Yonathan Ron, Shiri Gilboa, Tammuz Dubnov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.18966
Pdf URL: https://arxiv.org/pdf/2602.18966
Copy Paste: [[2602.18966]] Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation(https://arxiv.org/abs/2602.18966)
Keywords: language model, llm, prompt, agent
Abstract: Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.
摘要：特定领域的语音仍然是自动语音识别 (ASR) 的一个持续挑战，即使对于 OpenAI 的 Whisper 等最先进的系统也是如此。我们推出 Whisper：Courtside Edition，这是一种新颖的多智能体大语言模型 (LLM) 管道，无需重新训练即可增强 Whisper 转录。该管道拦截 Whisper 的初始转录本，应用专门的 LLM 代理进行域上下文识别、命名实体识别和行话检测，并生成指导 Whisper 解码器的紧凑提示。对 421 个 NBA 篮球评论片段（以密集的专有名词和技术术语为特征的领域）进行评估，我们的最佳管道实现了统计上显着的 17.0% 字错误率相对降低（WER；从 0.217 到 0.180，p<0.001）。 40.1% 的片段得到改善，而退化率仅为 7.1%，大大优于直接转录后编辑。这些结果表明，基于提示的增强可以为 ASR 提供可扩展的域适应，为昂贵的模型微调提供了一种实用的替代方案。

Title: Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

Authors: Wilson Y. Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.19008
Pdf URL: https://arxiv.org/pdf/2602.19008
Copy Paste: [[2602.19008]] Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks(https://arxiv.org/abs/2602.19008)
Keywords: llm, agent
Abstract: Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model$\times$task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units, successful runs adhere significantly more closely to the canonical solution path than failed runs ($+$0.060 Jaccard, $p<0.0001$, $n=488$ units, 95% CI [+0.043, +0.077]). This result survives six robustness checks including cross-model-family leave-one-out validation. Critically, the causal mechanism is gradual and self-reinforcing: the adherence gap is statistically indistinguishable from zero through the first 50% of the trajectory, ruling out early-branching selection bias, and each off-canonical tool call raises the probability that the next call is also off-canonical by 22.7 percentage points ($\hat{\beta}=+0.227$, $p<0.0001$), more than doubling the baseline rate. These findings imply that agent reliability cannot be improved by capability scaling alone, but offer a highly actionable intervention: a simple monitor that restarts the bottom tercile of runs based on mid-trajectory canonical adherence lifts success rates by $+$8.8 percentage points among intervened runs.
摘要：为什么语言智能体无法完成他们能够解决的任务？我们认为，许多此类故障是由任务潜在解决方案结构的随机漂移引起的可靠性故障，而不是能力故障。每个明确定义的工具使用任务都会施加一个规范的解决方案路径（即，在成功运行中共享的一组收敛的工具调用），而代理的成功关键取决于轨迹是否保持在该路径的操作范围内。我们使用自然实验来建立这一因果关系，该实验通过构建固定模型能力和任务难度。我们分析了 Toolathlon 基准的轨迹：22 个前沿模型在 3 次独立运行中分别尝试 108 个现实世界的工具使用任务，产生 515 个模型$\times$任务单元，其中同一模型在某些运行中成功，而在其他运行中失败，仅由于 LLM 采样随机性。在这些单元中，成功的运行比失败的运行更紧密地遵循规范解决方案路径（$+$0.060 Jaccard，$p<0.0001$，$n=488$ 单元，95% CI [+0.043，+0.077]）。该结果通过了六次稳健性检查，包括跨模型系列留一验证。重要的是，因果机制是渐进且自我强化的：在轨迹的前 50% 范围内，从零到前 50% 的遵守差距在统计上是无法区分的，排除了早期分支选择偏差，并且每次非规范工具调用都会将下一次调用也是非规范的概率提高 22.7 个百分点 ($\hat{\beta}=+0.227$, $p<0.0001$)，是基准率的两倍以上。这些发现意味着代理可靠性不能仅通过能力扩展来提高，但可以提供高度可操作的干预：一个简单的监视器，根据中间轨迹规范遵守情况重新启动底部三分之一的运行，将干预运行的成功率提高 8.8 个百分点。

Title: Uncovering Context Reliance in Unstructured Knowledge Editing

Authors: Zisheng Zhou, Mengqi Zhang, Shiguang Wu, Xiaotian Ye, Chi Zhang, Zhumin Chen, Pengjie Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19043
Pdf URL: https://arxiv.org/pdf/2602.19043
Copy Paste: [[2602.19043]] Uncovering Context Reliance in Unstructured Knowledge Editing(https://arxiv.org/abs/2602.19043)
Keywords: language model, llm
Abstract: Editing Large language models (LLMs) with real-world, unstructured knowledge is essential for correcting and updating their internal parametric knowledge. In this work, we revisit the fundamental next-token prediction (NTP) as a candidate paradigm for unstructured editing. We identify Context Reliance as a critical failure mode of NTP-based approaches, where knowledge acquired from edited text becomes highly dependent on its preceding context, leading to recall failures when that context is absent during inference. This hypothesis is supported by our empirical validation that prepending context during inference recovers knowledge recall. We further theoretically demonstrate that Context Reliance is an inherent consequence of gradient-based optimization, which tends to bind acquired knowledge to a specific aggregated contextual representation. To address this, we propose a simple yet effective COntext-INdependent editing framework (COIN), encouraging model to focus on knowledge within local scope rather than memorizing contextual patterns. Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.
摘要：使用现实世界的非结构化知识编辑大型语言模型 (LLM) 对于纠正和更新其内部参数知识至关重要。在这项工作中，我们重新审视基本的下一个标记预测（NTP）作为非结构化编辑的候选范例。我们将上下文依赖视为基于 NTP 的方法的一种关键失败模式，其中从编辑文本获取的知识变得高度依赖于其先前的上下文，当推理过程中不存在该上下文时，会导致调用失败。这一假设得到了我们的实证验证的支持，即在推理过程中预先考虑上下文可以恢复知识回忆。我们进一步从理论上证明，上下文依赖是基于梯度的优化的固有结果，它倾向于将获得的知识绑定到特定的聚合上下文表示。为了解决这个问题，我们提出了一个简单而有效的上下文无关编辑框架（COIN），鼓励模型关注本地范围内的知识而不是记住上下文模式。评估显示，COIN 将上下文依赖降低了 45.2%，编辑成功率比强基线高出 23.6%，凸显了减轻上下文依赖对于稳健编辑的重要作用。

Title: IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

Authors: Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang, Qi Guo, Jundong Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.19049
Pdf URL: https://arxiv.org/pdf/2602.19049
Copy Paste: [[2602.19049]] IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning(https://arxiv.org/abs/2602.19049)
Keywords: language model
Abstract: Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at this https URL.
摘要：大型语言模型越来越依赖长的思想链来提高准确性，但这种收益伴随着巨大的推理时间成本。我们重新审视令牌高效的后训练，并认为现有的序列级奖励塑造方法对如何在令牌之间分配推理工作提供了有限的控制。为了弥补这一差距，我们提出了 IAPO，这是一种信息论后训练框架，它根据每个令牌的条件互信息（MI）和最终答案来分配令牌优势。这产生了一种明确的、原则性的机制，用于识别信息推理步骤并抑制低效用探索。我们提供的理论分析表明，我们的 IAPO 可以在不损害正确性的情况下单调减少推理的冗长性。根据经验，IAPO 不断提高推理准确性，同时将推理长度减少高达 36%，在各种推理数据集中优于现有的令牌高效 RL 方法。广泛的实证评估表明，信息感知优势塑造是令牌高效后培训的一个强大而普遍的方向。该代码可从此 https URL 获取。

Title: Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer

Authors: Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng, Zhenkai Liang, Xiang Wang, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19058
Pdf URL: https://arxiv.org/pdf/2602.19058
Copy Paste: [[2602.19058]] Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer(https://arxiv.org/abs/2602.19058)
Keywords: language model, llm
Abstract: Large vision-language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text-only large language models (LLMs) on tasks that require multi-step inference and compositional decision-making. Motivated by their shared transformer architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large overlap: more than half of the top-activated units during multi-step inference are shared between representative LLMs and LVLMs, revealing a modality-invariant inference subspace. Through causal probing via activation amplification, we further show that these shared neurons encode consistent and interpretable concept-level effects, demonstrating their functional contribution to inference. Building on this insight, we propose Shared Neuron Low-Rank Fusion (SNRF), a parameter-efficient framework that transfers mature inference circuitry from LLMs to LVLMs. SNRF profiles cross-model activations to identify shared neurons, computes a low-rank approximation of inter-model weight differences, and injects these updates selectively within the shared-neuron subspace. This mechanism strengthens multimodal inference performance with minimal parameter changes and requires no large-scale multimodal fine-tuning. Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities. Our results demonstrate that shared neurons form an interpretable bridge between LLMs and LVLMs, enabling low-cost transfer of inference ability into multimodal models. Our code is available at [this https URL](this https URL).
摘要：大型视觉语言模型 (LVLM) 在各个领域迅速发展，但在需要多步骤推理和组合决策的任务上，它们仍然落后于强大的纯文本大型语言模型 (LLM)。受共享变压器架构的推动，我们研究了这两个模型系列是否依赖于共同的内部计算来进行此类推理。在神经元层面，我们发现了惊人的大重叠：多步推理期间超过一半的顶部激活单元在代表性 LLM 和 LVLM 之间共享，揭示了模态不变的推理子空间。通过激活放大的因果探测，我们进一步表明这些共享神经元编码一致且可解释的概念级效应，证明了它们对推理的功能贡献。基于这一见解，我们提出了共享神经元低阶融合 (SNRF)，这是一种参数高效的框架，可将成熟的推理电路从 LLM 转移到 LVLM。 SNRF 分析跨模型激活以识别共享神经元，计算模型间权重差异的低秩近似，并选择性地将这些更新注入共享神经元子空间内。该机制以最小的参数变化增强了多模态推理性能，并且不需要大规模的多模态微调。在不同的数学和感知基准中，SNRF 不断增强 LVLM 推理性能，同时保留感知能力。我们的结果表明，共享神经元在 LLM 和 LVLM 之间形成了一座可解释的桥梁，从而能够将推理能力低成本地转移到多模态模型中。我们的代码可在 [此 https URL]（此 https URL）处获取。

Title: Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Authors: Seong Hah Cho, Junyi Li, Anna Leshinskaya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.19101
Pdf URL: https://arxiv.org/pdf/2602.19101
Copy Paste: [[2602.19101]] Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models(https://arxiv.org/abs/2602.19101)
Keywords: language model, llm
Abstract: Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.
摘要：大型语言模型 (LLM) 的价值调整要求我们凭经验衡量这些模型实际的、获得的价值表示。人类价值表征的一个特点是区分不同种类的价值。我们研究法学硕士是否同样区分三种不同的善：道德、语法和经济。通过探索模型行为、嵌入和残余流激活，我们报告了普遍存在的价值纠缠案例：这些不同的价值表示之间的合并。具体来说，相对于人类规范，语法和经济评价都过度受到道德价值的影响。通过选择性消除与道德相关的激活向量，可以修复这种混淆。

Title: Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models

Authors: Kainan Liu, Yong Zhang, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19111
Pdf URL: https://arxiv.org/pdf/2602.19111
Copy Paste: [[2602.19111]] Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models(https://arxiv.org/abs/2602.19111)
Keywords: language model
Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA, are widely used for adapting pre-trained models to downstream tasks due to their computational and storage efficiency. However, in the context of LoRA and its variants, the potential of activation subspaces corresponding to tail eigenvectors remains substantially under-exploited, which may lead to suboptimal fine-tuning performance. In this work, we propose Astra (Activation-Space Tail-Eigenvector Low-Rank Adaptation), a novel PEFT method that leverages the tail eigenvectors of the model output activations-estimated from a small task-specific calibration set-to construct task-adaptive low-rank adapters. By constraining updates to the subspace spanned by these tail eigenvectors, Astra achieves faster convergence and improved downstream performance with a significantly reduced parameter budget. Extensive experiments across natural language understanding (NLU) and natural language generation (NLG) tasks demonstrate that Astra consistently outperforms existing PEFT baselines across 16 benchmarks and even surpasses full fine-tuning (FFT) in certain scenarios.
摘要：参数高效微调（PEFT）方法，尤其是 LoRA，由于其计算和存储效率而被广泛用于使预训练模型适应下游任务。然而，在 LoRA 及其变体的背景下，与尾部特征向量相对应的激活子空间的潜力仍然没有得到充分利用，这可能会导致微调性能不佳。在这项工作中，我们提出了 Astra（激活空间尾部特征向量低秩自适应），这是一种新颖的 PEFT 方法，它利用模型输出激活的尾部特征向量（从小型任务特定校准集估计）来构建任务自适应低秩适配器。通过限制对这些尾部特征向量跨越的子空间的更新，Astra 实现了更快的收敛并提高了下游性能，同时显着减少了参数预算。跨自然语言理解 (NLU) 和自然语言生成 (NLG) 任务的大量实验表明，Astra 在 16 个基准测试中始终优于现有的 PEFT 基线，甚至在某些场景中超过了完全微调 (FFT)。

Title: How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders

Authors: Michael McCoubrey, Angelo Salatino, Francesco Osborne, Enrico Motta
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2602.19115
Pdf URL: https://arxiv.org/pdf/2602.19115
Copy Paste: [[2602.19115]] How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders(https://arxiv.org/abs/2602.19115)
Keywords: language model, llm
Abstract: In recent years, there has been a growing use of generative AI, and large language models (LLMs) in particular, to support both the assessment and generation of scientific work. Although some studies have shown that LLMs can, to a certain extent, evaluate research according to perceived quality, our understanding of the internal mechanisms that enable this capability remains limited. This paper presents the first study that investigates how LLMs encode the concept of scientific quality through relevant monosemantic features extracted using sparse autoencoders. We derive such features under different experimental settings and assess their ability to serve as predictors across three tasks related to research quality: predicting citation count, journal SJR, and journal h-index. The results indicate that LLMs encode features associated with multiple dimensions of scientific quality. In particular, we identify four recurring types of features that capture key aspects of how research quality is represented: 1) features reflecting research methodologies; 2) features related to publication type, with literature reviews typically exhibiting higher impact; 3) features associated with high-impact research fields and technologies; and 4) features corresponding to specific scientific jargons. These findings represent an important step toward understanding how LLMs encapsulate concepts related to research quality.
摘要：近年来，生成式人工智能，尤其是大型语言模型（LLM）的使用越来越多，以支持科学工作的评估和生成。尽管一些研究表明法学硕士可以在一定程度上根据感知质量评估研究，但我们对实现这种能力的内部机制的理解仍然有限。本文提出了第一项研究，研究法学硕士如何通过使用稀疏自动编码器提取的相关单语义特征来编码科学质量的概念。我们在不同的实验设置下得出这些特征，并评估它们作为与研究质量相关的三项任务的预测因子的能力：预测引用计数、期刊 SJR 和期刊 h 指数。结果表明，法学硕士编码与科学质量的多个维度相关的特征。特别是，我们确定了四种重复出现的特征类型，这些特征捕获了研究质量如何表示的关键方面：1）反映研究方法的特征； 2）与出版物类型相关的特征，文献综述通常表现出更高的影响力； 3）与高影响力的研究领域和技术相关的特征； 4) 与特定科学术语相对应的特征。这些发现代表了理解法学硕士如何概括与研究质量相关的概念的重要一步。

Title: AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Authors: Qijie You, Wenkai Yu, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19127
Pdf URL: https://arxiv.org/pdf/2602.19127
Copy Paste: [[2602.19127]] AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG(https://arxiv.org/abs/2602.19127)
Keywords: language model, gpt, agent
Abstract: With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6\% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains -- either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task's logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at this https URL.
摘要：近年来，随着基于Agent的方法的快速发展，Agentic RAG无疑成为一个重要的研究方向。多跳推理需要模型进行深思熟虑的思考和多步骤交互，是评估此类能力的关键测试平台。然而，现有的基准通常只提供最终的问题和答案，而缺乏将原子问题逐渐连接到最终多跳查询的中间跳级问题。这种限制使研究人员无法分析代理在哪一步失败，并限制了对模型功能进行更细粒度的评估。此外，当前大多数基准测试都是手动构建的，这既耗时又费力，同时也限制了可扩展性和泛化性。为了应对这些挑战，我们引入了 AgenticRAGTracer，这是第一个 Agentic RAG 基准测试，主要由大型语言模型自动构建，旨在支持逐步验证。我们的基准测试跨越多个领域，包含 1,305 个数据点，并且与现有主流基准测试没有重叠。大量的实验表明，即使是最好的大型语言模型在我们的数据集上也表现不佳。例如，GPT-5 在我们数据集最难的部分上仅获得 22.6% 的 EM 准确度。跳跃感知诊断表明，失败主要是由扭曲的推理链驱动的——要么过早崩溃，要么陷入过度延伸。这凸显了分配与任务逻辑结构一致的步骤的严重缺陷，从而提供了传统评估中缺少的诊断维度。我们相信我们的工作将促进 Agentic RAG 的研究，并激发该领域进一步有意义的进展。我们的代码和数据可在此 https URL 中获取。

Title: A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions

Authors: Stefanie Schneider, Miriam Göldl, Julian Stalter, Ricarda Vollmer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19133
Pdf URL: https://arxiv.org/pdf/2602.19133
Copy Paste: [[2602.19133]] A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions(https://arxiv.org/abs/2602.19133)
Keywords: language model, llm
Abstract: This paper introduces FRAME (Fine-grained Recognition of Art-historical Metadata and Entities), a manually annotated dataset of art-historical image descriptions for Named Entity Recognition (NER) and Relation Extraction (RE). Descriptions were collected from museum catalogs, auction listings, open-access platforms, and scholarly databases, then filtered to ensure that each text focuses on a single artwork and contains explicit statements about its material, composition, or iconography. FRAME provides stand-off annotations in three layers: a metadata layer for object-level properties, a content layer for depicted subjects and motifs, and a co-reference layer linking repeated mentions. Across layers, entity spans are labeled with 37 types and connected by typed RE links between mentions. Entity types are aligned with Wikidata to support Named Entity Linking (NEL) and downstream knowledge-graph construction. The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large Language Models (LLMs).
摘要：本文介绍了 FRAME（艺术历史元数据和实体的细粒度识别），这是一个手动注释的艺术历史图像描述数据集，用于命名实体识别 (NER) 和关系提取 (RE)。描述是从博物馆目录、拍卖清单、开放获取平台和学术数据库中收集的，然后进行过滤，以确保每个文本都专注于一件艺术品，并包含有关其材料、构成或图像的明确陈述。 FRAME 提供三层独立注释：用于对象级属性的元数据层、用于描述主题和主题的内容层以及链接重复提及的共同引用层。跨层，实体跨度被标记为 37 种类型，并通过提及之间的类型化 RE 链接进行连接。实体类型与维基数据保持一致，以支持命名实体链接（NEL）和下游知识图构建。该数据集以 UIMA XMI 通用分析结构 (CAS) 文件形式发布，附带图像和书目元数据，可用于对 NER 和 RE 系统进行基准测试和微调，包括使用大型语言模型 (LLM) 的零样本和少样本设置。

Title: Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs

Authors: Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19157
Pdf URL: https://arxiv.org/pdf/2602.19157
Copy Paste: [[2602.19157]] Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs(https://arxiv.org/abs/2602.19157)
Keywords: language model, llm, prompt, retrieval-augmented generation, agent
Abstract: Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-specific corpora. While SFT can be effective, it requires persona-labeled data and retraining for new roles, limiting flexibility. In contrast, prompt- and RAG-based signals are easy to apply but can be diluted in long dialogues, leading to drifting and sometimes inconsistent persona behavior. To address this, we propose a contrastive Sparse AutoEncoder (SAE) framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model. A new 15,000-sample leakage-controlled corpus is constructed to provide balanced supervision for each facet. The learned vectors are integrated into the model's residual space and dynamically selected by a trait-activated routing module, enabling precise and interpretable personality steering. Experiments on Large Language Models (LLMs) show that the proposed method maintains stable character fidelity and output quality across contextualized settings, outperforming Contrastive Activation Addition (CAA) and prompt-only baselines. The combined SAE+Prompt configuration achieves the best overall performance, confirming that contrastively trained latent vectors can enhance persona control while preserving dialogue coherence.
摘要：角色扮演代理 (RPA) 中的个性控制通常通过免训练方法来实现，这些方法通过提示或检索增强生成注入角色描述和记忆，或者通过对角色特定语料库进行监督微调 (SFT)。虽然 SFT 可能很有效，但它需要角色标记的数据并针对新角色进行重新培训，从而限制了灵活性。相比之下，基于提示和 RAG 的信号很容易应用，但可能会在长对话中被淡化，导致角色行为漂移，有时甚至不一致。为了解决这个问题，我们提出了一个对比稀疏自动编码器（SAE）框架，该框架学习与大五 30 面模型对齐的面级个性控制向量。构建了一个新的 15,000 个样本的泄漏控制语料库，为每个方面提供平衡的监督。学习到的向量被集成到模型的残差空间中，并由特征激活的路由模块动态选择，从而实现精确且可解释的个性引导。大型语言模型 (LLM) 的实验表明，所提出的方法在上下文环境中保持稳定的字符保真度和输出质量，优于对比激活添加 (CAA) 和仅提示基线。组合的 SAE+Prompt 配置实现了最佳的整体性能，证实对比训练的潜在向量可以增强角色控制，同时保持对话连贯性。

Title: Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

Authors: Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.19177
Pdf URL: https://arxiv.org/pdf/2602.19177
Copy Paste: [[2602.19177]] Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content(https://arxiv.org/abs/2602.19177)
Keywords: language model, llm, prompt
Abstract: The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.
摘要：在社会科学研究中越来越多地使用大型语言模型（LLM）作为人类参与者的代理，这提出了一种有前景但在方法论上存在风险的范式转变。虽然法学硕士提供可扩展性和成本效益，但其“天真的”应用程序（提示他们在没有明确行为限制的情况下生成内容）引入了重大的语言差异，对研究结果的有效性提出了挑战。本文通过在真实的 X（以前称为 Twitter）数据上引入一种新颖的、以历史为条件的回复预测任务来解决这些限制，以创建一个数据集，旨在根据人类生成的内容评估法学硕士的语言输出。我们使用风格和基于内容的指标来分析这些差异，为研究人员评估合成数据的质量和真实性提供定量框架。我们的研究结果强调需要更复杂的提示技术和专门的数据集，以确保法学硕士生成的内容准确反映人类交流的复杂语言模式，从而提高计算社会科学研究的有效性。

Title: Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

Authors: Raihan Tanvir, Md. Golam Rabiul Alam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19212
Pdf URL: https://arxiv.org/pdf/2602.19212
Copy Paste: [[2602.19212]] Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection(https://arxiv.org/abs/2602.19212)
Keywords: language model, prompt
Abstract: Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.
摘要：社交媒体上的仇恨内容越来越多地以多模式模因的形式出现，将图像和文本结合起来传达有害的叙述。在孟加拉语等资源匮乏的语言中，由于注释数据有限、类不平衡和普遍的代码混合，自动检测仍然具有挑战性。为了解决这些问题，我们使用来自孟加拉语多模态攻击数据集 (MIMOSA) 的语义对齐样本来增强孟加拉语仇恨模因 (BHM) 数据集，从而改善类别平衡和语义多样性。我们提出了增强型双重共同注意力框架（xDORA），通过加权注意力池集成视觉编码器（CLIP、DINOv2）和多语言文本编码器（XGLM、XLM-R），以学习鲁棒的跨模态表示。在这些嵌入的基础上，我们开发了一个基于 FAISS 的 k 最近邻分类器，用于非参数推理，并引入了 RAG-Fused DORA，它结合了检索驱动的上下文推理。我们进一步在零样本、少样本和检索增强提示设置下评估 LLaVA。在扩展数据集上的实验表明，xDORA (CLIP + XLM-R) 在仇恨模因识别方面实现了 0.78 的宏观平均 F1 分数，在目标实体检测方面实现了 0.71 的宏观平均 F1 分数，而 RAG-Fused DORA 将性能提高到 0.79 和 0.74，在 DORA 基线上产生了增益。基于 FAISS 的分类器具有竞争力，并通过语义相似性建模展示了对稀有类别的鲁棒性。相比之下，LLaVA 在少量镜头设置中表现出有限的有效性，在检索增强下仅取得了适度的改进，突出了未经微调的代码混合孟加拉语内容的预训练视觉语言模型的限制。这些发现证明了监督式、检索增强式和非参数多模态框架在解决低资源仇恨语音检测中的语言和文化复杂性方面的有效性。

Title: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

Authors: Maryam Amirizaniani, Alireza Salemi, Hamed Zamani
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.19317
Pdf URL: https://arxiv.org/pdf/2602.19317
Copy Paste: [[2602.19317]] Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering(https://arxiv.org/abs/2602.19317)
Keywords: llm, retrieval-augmented generation
Abstract: Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing methods use the user's query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.
摘要：问答 (QA) 中的个性化要求答案既准确又符合用户的背景、偏好和历史背景。现有最先进的方法主要依赖于检索增强生成（RAG）解决方案，该解决方案通过从用户的个人资料中检索相关项目来构建个人上下文。现有方法直接使用用户的查询来检索个人文档，这种策略通常会导致表面级别的个性化。我们提出 PR2（个性化检索-增强推理），这是一种强化学习框架，集成了个人上下文的推理和检索以实现个性化。 PR2 学习自适应检索推理策略，确定何时检索、从用户配置文件中检索哪些证据以及如何将其合并到中间推理步骤中。通过在个性化奖励函数下优化多轮推理轨迹，该框架强化了推理路径，更好地符合用户特定的偏好和奖励模型反映的上下文信号。使用三个法学硕士对 LaMP-QA 基准进行的广泛实验表明，PR2 始终优于强大的基线，在个性化 QA 方面实现了 8.8%-12% 的平均相对改进。

Title: Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Authors: Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao, Dingyi Kang, Xu Hu, Feng Chen, Qiannan Li, Bingzhe Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.19320
Pdf URL: https://arxiv.org/pdf/2602.19320
Copy Paste: [[2602.19320]] Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations(https://arxiv.org/abs/2602.19320)
Keywords: language model, llm, agent
Abstract: Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.
摘要：代理记忆系统使大型语言模型（LLM）代理能够在长时间交互中维护状态，支持超出固定上下文窗口的长视野推理和个性化。尽管架构发展迅速，这些系统的经验基础仍然脆弱：现有基准通常规模不足，评估指标与语义效用不一致，骨干模型之间的性能差异很大，并且系统级成本经常被忽视。这项调查从架构和系统的角度对代理记忆进行了结构化分析。我们首先介绍基于四种内存结构的 MAG 系统的简明分类。然后，我们分析限制当前系统的关键痛点，包括基准饱和效应、度量有效性和判断灵敏度、依赖于主干的准确性以及内存维护引入的延迟和吞吐量开销。通过将记忆结构与经验限制联系起来，这项调查阐明了为什么当前的代理记忆系统往往表现不佳，并概述了更可靠的评估和可扩展系统设计的方向。

Title: PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

Authors: Isun Chehreh, Ebrahim Ansari
Subjects: cs.CL, cs.IR, cs.SI
Abstract URL: https://arxiv.org/abs/2602.19333
Pdf URL: https://arxiv.org/pdf/2602.19333
Copy Paste: [[2602.19333]] PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification(https://arxiv.org/abs/2602.19333)
Keywords: gpt, prompt, chat
Abstract: This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.
摘要：这项研究引入了第一个大规模、均衡的波斯语社交媒体文本分类数据集，专门为解决该领域缺乏综合资源的问题而设计。该数据集包含九个类别（经济、艺术、体育、政治、社会、健康、心理、历史和科学技术）的 36,000 个帖子，每个类别包含 4,000 个样本，以确保均衡的类别分布。数据收集涉及来自各个波斯语社交媒体平台的 60,000 个原始帖子，然后进行严格的预处理和混合注释，将基于 ChatGPT 的少量提示与人工验证相结合。为了减轻类别不平衡，我们采用了欠采样和语义冗余消除以及集成词汇替换和生成提示的高级数据增强策略。我们对多个模型进行了基准测试，包括 BiLSTM、XLM-RoBERTa（具有 LoRA 和 AdaLoRA 改编）、FaBERT、基于 SBERT 的架构以及波斯语特定的 TookaBERT（Base 和 Large）。实验结果表明，基于 Transformer 的模型始终优于传统神经网络，其中 TookaBERT-Large 实现了最佳性能（精度：0.9622，召回率：0.9621，F1-分数：0.9621）。类别评估进一步证实了所有类别的强劲表现，尽管社会和政治文本由于固有的模糊性而表现出略低的分数。这项研究提供了一个新的高质量数据集，并对前沿模型进行了全面评估，为波斯语 NLP 的进一步发展奠定了坚实的基础，包括趋势分析、社会行为建模和用户分类。该数据集是公开可用的，以支持未来的研究工作。

Title: Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

Authors: Jasmin Han (1), Janardan Devkota (1), Joseph Waring (1), Amanda Luken (2), Felix Naughton (3), Roger Vilardaga (4), Jonathan Bricker (5 and 6), Carl Latkin (7), Meghan Moran (7), Yiqun Chen (8 and 9), Johannes Thrul (1 and 10 and 11) ((1) Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA, (2) Department of Health Sciences, Towson University, Towson, USA, (3) Addiction Research Group, University of East Anglia, Norwich, UK, (4) Department of Implementation Science, Wake Forest University School of Medicine, Winston-Salem, USA, (5) Fred Hutchinson Cancer Center, Seattle, USA, (6) Department of Psychology, University of Washington, Seattle, USA, (7) Department of Health, Behavior and Society, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA, (8) Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA, (9) Department of Computer Science, Johns Hopkins Whiting School of Engineering, Baltimore, USA, (10) Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Baltimore, USA, (11) Centre for Alcohol Policy Research, La Trobe University, Melbourne, Australia)
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2602.19403
Pdf URL: https://arxiv.org/pdf/2602.19403
Copy Paste: [[2602.19403]] Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins(https://arxiv.org/abs/2602.19403)
Keywords: language model, llm, prompt
Abstract: Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery. This study evaluates whether large language models (LLMs) can accurately predict PME for smoking cessation messages. We evaluated multiple models for predicting PME across three domains: content quality, coping support, and quitting support. The dataset comprised 3010 message ratings (5-point Likert scale) from 301 young adult smokers. We compared (1) supervised learning models trained on labeled data, (2) zero and few-shot LLMs prompted without task-specific fine-tuning, and (3) LLM-based digital twins that incorporate individual characteristics and prior PME histories to generate personalized predictions. Model performance was assessed on three held-out messages per participant using accuracy, Cohen's kappa, and F1. LLM-based digital twins outperformed zero and few-shot LLMs (12 percentage points on average) and supervised baselines (13 percentage points), achieving accuracies of 0.49 (content), 0.45 (coping), and 0.49 (quitting), with directional accuracies of 0.75, 0.66, and 0.70 on a simplified 3-point scale. Digital twin predictions showed greater dispersion across rating categories, indicating improved sensitivity to individual differences. Integrating personal profiles with LLMs captures person-specific differences in PME and outperforms supervised and zero and few-shot approaches. Improved PME prediction may enable more tailored intervention content in mHealth. LLM-based digital twins show potential for supporting personalization of mobile smoking cessation and other health behavior change interventions.
摘要：潜在干预最终用户感知的消息有效性 (PME) 对于选择和优化移动健康 (mHealth) 平台交付的个性化戒烟干预消息非常重要。本研究评估大型语言模型 (LLM) 是否能够准确预测戒烟消息的 PME。我们评估了跨三个领域预测 PME 的多个模型：内容质量、应对支持和戒烟支持。该数据集包含来自 301 名年轻成年吸烟者的 3010 条消息评级（5 点李克特量表）。我们比较了（1）在标记数据上训练的监督学习模型，（2）在没有针对特定任务进行微调的情况下提示的零样本和少样本 LLM，以及（3）基于 LLM 的数字双胞胎，它结合了个体特征和先前的 PME 历史来生成个性化预测。使用准确性、Cohen's kappa 和 F1 对每个参与者的三个保留消息评估模型性能。基于法学硕士的数字双胞胎的表现优于零和少样本法学硕士（平均 12 个百分点）和监督基线（13 个百分点），在简化的 3 点量表上实现了 0.49（内容）、0.45（应对）和 0.49（退出）的准确度，方向准确度为 0.75、0.66 和 0.70。数字孪生预测显示，不同评级类别之间的差异更大，表明对个体差异的敏感性有所提高。将个人资料与法学硕士相结合，可以捕捉到 PME 中个人特定的差异，并且优于监督方法以及零样本和少样本方法。改进的 PME 预测可能会在移动医疗中实现更具针对性的干预内容。基于法学硕士的数字双胞胎显示出支持个性化移动戒烟和其他健康行为改变干预措施的潜力。

Title: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Authors: Arindam Khaled
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.19509
Pdf URL: https://arxiv.org/pdf/2602.19509
Copy Paste: [[2602.19509]] Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference(https://arxiv.org/abs/2602.19509)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While "Oracle" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary. By leveraging semantic agreement and confidence calibration among an ensemble of small models, our Router identifies "hard" problems with high precision. On the GSM8K benchmark, our system achieves 93.0% accuracy, effectively matching the Oracle baseline (98.0%) while reducing compute costs by 61%. We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.
摘要：大型语言模型 (LLM) 面临着推理成本和推理能力之间的持续权衡。虽然“Oracle”模型（例如 Llama-3-70B）实现了最先进的精度，但对于大批量部署而言，它们的成本过高。较小的模型（例如 8B 参数）具有成本效益，但难以处理复杂的任务。在这项工作中，我们提出了“Pyramid MoA”，一种分层混合代理架构，它使用轻量级路由器仅在必要时动态升级查询。通过利用小型模型集合之间的语义一致性和置信度校准，我们的路由器能够高精度地识别“难题”。在 GSM8K 基准测试中，我们的系统达到了 93.0% 的准确率，有效匹配 Oracle 基准 (98.0%)，同时将计算成本降低了 61%。我们证明，该系统引入的延迟开销可以忽略不计（+0.82s），并允许在性能和预算之间进行可调整的权衡。

Title: How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

Authors: Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He, Jian Liang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19526
Pdf URL: https://arxiv.org/pdf/2602.19526
Copy Paste: [[2602.19526]] How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1(https://arxiv.org/abs/2602.19526)
Keywords: prompt, agent
Abstract: Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.
摘要：深度研究代理通过多轮检索和面向决策的生成来处理知识密集型任务。虽然强化学习 (RL) 已被证明可以提高该范式的性能，但其贡献仍未得到充分探索。为了充分理解强化学习的作用，我们从提示模板、奖励函数和策略优化三个解耦维度进行了系统研究。我们的研究表明：1）快速思考模板比之前工作中使用的慢速思考模板具有更高的稳定性和更好的性能； 2) 由于回避答案导致训练崩溃，基于 F1 的奖励表现低于 EM；这可以通过纳入行动级别的处罚来缓解，最终超过 EM； 3) REINFORCE 优于 PPO，同时需要较少的搜索操作，而 GRPO 在策略优化方法中表现出最差的稳定性。基于这些见解，我们随后引入了 Search-R1++，这是一个强大的基线，可将 Search-R1 的性能从 0.403 提高到 0.442 (Qwen2.5-7B)，将 0.289 提高到 0.331 (Qwen2.5-3B)。我们希望我们的研究结果能够为深度研究系统中更有原则、更可靠的强化学习训练策略铺平道路。

Title: Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

Authors: Jeffrey Li, Josh Gardner, Doug Kang, Fangping Shi, Karanjeet Singh, Chun-Liang Li, Herumb Shandilya, David Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pour Ansari, Fartash Faghri
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.19548
Pdf URL: https://arxiv.org/pdf/2602.19548
Copy Paste: [[2602.19548]] Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining(https://arxiv.org/abs/2602.19548)
Keywords: llm
Abstract: One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.
摘要：构建网络规模的 LLM 预训练数据集的首要预处理步骤之一涉及从 HTML 中提取文本。尽管网络内容多种多样，但现有的开源数据集主要对所有网页应用单个固定提取器。在这项工作中，我们调查这种做法是否会导致互联网数据的覆盖和利用不理想。我们首先表明，虽然不同的提取器可能会在标准语言理解任务上产生相似的模型性能，但在固定过滤管道中幸存的页面可能会有很大差异。这表明了一种简单的干预措施：通过对不同的提取器进行联合，我们可以将 DCLM-Baseline 的代币收益率提高高达 71%，同时保持基准性能。我们进一步表明，对于表格和代码块等结构化内容，提取器的选择可以显着影响下游任务性能，在 WikiTQ 上差异高达 10 个百分点 (p.p.)，在 WikiTQ 上差异高达 3 个百分点 (p.p.)。关于人类评估。

Title: Temporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering

Authors: Wuzhenghong Wen, Bowen Zhou, Jinwen Huang, Xianjie Wu, Yuwei Sun, Su Pan, Liang Li, Jianting Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.19569
Pdf URL: https://arxiv.org/pdf/2602.19569
Copy Paste: [[2602.19569]] Temporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering(https://arxiv.org/abs/2602.19569)
Keywords: language model
Abstract: Question Answering over Temporal Knowledge Graphs (TKGQA) has attracted growing interest for handling time-sensitive queries. However, existing methods still struggle with: 1) weak incorporation of temporal constraints in question representation, causing biased reasoning; 2) limited ability to perform explicit multi-hop reasoning; and 3) suboptimal fusion of language and graph representations. We propose a novel framework with temporal-aware question encoding, multi-hop graph reasoning, and multi-view heterogeneous information fusion. Specifically, our approach introduces: 1) a constraint-aware question representation that combines semantic cues from language models with temporal entity dynamics; 2) a temporal-aware graph neural network for explicit multi-hop reasoning via time-aware message passing; and 3) a multi-view attention mechanism for more effective fusion of question context and temporal graph knowledge. Experiments on multiple TKGQA benchmarks demonstrate consistent improvements over multiple baselines.
摘要：时态知识图问答（TKGQA）在处理时间敏感查询方面引起了越来越多的兴趣。然而，现有的方法仍然存在以下问题：1）问题表示中时间约束的结合较弱，导致推理出现偏差； 2）执行显式多跳推理的能力有限； 3）语言和图形表示的次优融合。我们提出了一种具有时间感知问题编码、多跳图推理和多视图异构信息融合的新颖框架。具体来说，我们的方法引入了：1）一种约束感知问题表示，它将语言模型的语义线索与时间实体动态相结合； 2）时间感知图神经网络，通过时间感知消息传递进行显式多跳推理； 3）多视图注意力机制，用于更有效地融合问题上下文和时间图知识。对多个 TKGQA 基准的实验表明，在多个基准上都有一致的改进。

Title: Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

Authors: Borisiuk Anna, Andrey Savchenko, Alexander Panchecko, Elena Tutubalina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19612
Pdf URL: https://arxiv.org/pdf/2602.19612
Copy Paste: [[2602.19612]] Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning(https://arxiv.org/abs/2602.19612)
Keywords: language model, llm
Abstract: Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.
摘要：机器取消学习 (MU) 使大型语言模型 (LLM) 能够删除不安全或过时的信息。然而，现有的工作假设所有事实都同样容易被遗忘，并且在很大程度上忽略了被遗忘的知识是否源自预训练或监督微调（SFT）。在本文中，我们介绍了 DUAL（跨训练阶段的双重不学习评估），这是一个由 28.6k 维基数据派生的三元组组成的基准，使用维基百科链接计数和基于 LLM 的显着性分数来注释事实流行度。我们的实验表明，预训练模型和 SFT 模型对遗忘的反应不同。对遗忘数据进行 SFT 一步可以产生更平滑的遗忘、更稳定的调整以及 10-50% 的更高保留率，而对预训练模型的直接遗忘仍然不稳定，并且容易发生重新学习或灾难性遗忘。

Title: KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Authors: Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19643
Pdf URL: https://arxiv.org/pdf/2602.19643
Copy Paste: [[2602.19643]] KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge(https://arxiv.org/abs/2602.19643)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.
摘要：大型语言模型 (LLM) 具有生成有说服力且易于理解的语言的非凡能力。然而，连贯性并不等于真实性，因为回答往往包含微妙的幻觉。现有基准受到静态和狭隘问题的限制，导致覆盖范围有限和评估具有误导性。我们推出了 KGHaluBench，这是一个基于知识图谱的幻觉基准，可以评估法学硕士知识的广度和深度，为法学硕士的真实性提供更公平、更全面的洞察。我们的框架利用知识图谱动态构建具有挑战性的、多方面的问题，然后对问题的难度进行统计估计，以解决流行度偏差。我们的自动验证管道可以检测弃权并验证法学硕士在概念和正确性层面上的响应，以识别不同类型的幻觉。我们使用新颖的准确性和幻觉指标评估 25 个前沿模型。结果提供了对导致不同模型大小的幻觉的知识因素的更可解释的见解。 KGHaluBench 是公开可用的，以支持缓解幻觉的未来发展。

Title: SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

Authors: Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19840
Pdf URL: https://arxiv.org/pdf/2602.19840
Copy Paste: [[2602.19840]] SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation(https://arxiv.org/abs/2602.19840)
Keywords: language model, llm, agent
Abstract: Modern large language models (LLMs) excel at generating fluent and faithful translations. However, they struggle to preserve an author's unique literary style, often producing semantically correct but generic outputs. This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations. To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task. Specifically, our method quantifies literary style into a Stylistic Feature Spectrum (SFS) using the wavelet packet transform. This SFS serves as a control signal to dynamically assemble a tailored workflow of specialized translation agents based on the source text's structural patterns. Extensive experiments on translation benchmarks show that SAMAS achieves competitive semantic accuracy against strong baselines, primarily by leveraging its statistically significant advantage in style fidelity.
摘要：现代大型语言模型 (LLM) 擅长生成流畅且忠实的翻译。然而，他们努力保留作者独特的文学风格，常常产生语义正确但通用的输出。这种限制源于当前的单模型和静态多智能体系统无法感知和适应风格变化。为了解决这个问题，我们引入了风格自适应多代理系统（SAMAS），这是一种将风格保留视为信号处理任务的新颖框架。具体来说，我们的方法使用小波包变换将文学风格量化为风格特征谱（SFS）。该 SFS 用作控制信号，根据源文本的结构模式动态组装专门翻译代理的定制工作流程。对翻译基准的大量实验表明，SAMAS 主要通过利用其在风格保真度方面的统计显着优势，在强基线上实现了具有竞争力的语义准确性。

Title: SHIELD: Semantic Heterogeneity Integrated Embedding for Latent Discovery in Clinical Trial Safety Signals

Authors: Francois Vandenhende, Anna Georgiou, Theodoros Psaras, Ellie Karekla
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.19855
Pdf URL: https://arxiv.org/pdf/2602.19855
Copy Paste: [[2602.19855]] SHIELD: Semantic Heterogeneity Integrated Embedding for Latent Discovery in Clinical Trial Safety Signals(https://arxiv.org/abs/2602.19855)
Keywords: language model
Abstract: We present SHIELD, a novel methodology for automated and integrated safety signal detection in clinical trials. SHIELD combines disproportionality analysis with semantic clustering of adverse event (AE) terms applied to MedDRA term embeddings. For each AE, the pipeline computes an information-theoretic disproportionality measure (Information Component) with effect size derived via empirical Bayesian shrinkage. A utility matrix is constructed by weighting semantic term-term similarities by signal magnitude, followed by spectral embedding and clustering to identify groups of related AEs. Resulting clusters are annotated with syndrome-level summary labels using large language models, yielding a coherent, data-driven representation of treatment-associated safety profiles in the form of a network graph and hierarchical tree. We implement the SHIELD framework in the context of a single-arm incidence summary, to compare two treatment arms or for the detection of any treatment effect in a multi-arm trial. We illustrate its ability to recover known safety signals and generate interpretable, cluster-based summaries in a real clinical trial example. This work bridges statistical signal detection with modern natural language processing to enhance safety assessment and causal interpretation in clinical trials.
摘要：我们推出了 SHIELD，这是一种在临床试验中自动集成安全信号检测的新方法。 SHIELD 将不成比例分析与应用于 MedDRA 术语嵌入的不良事件 (AE) 术语的语义聚类相结合。对于每个 AE，管道计算信息论不比例性度量（信息分量），其效应大小通过经验贝叶斯收缩得出。通过按信号幅度对语义术语相似度进行加权来构建效用矩阵，然后进行频谱嵌入和聚类以识别相关 AE 组。使用大型语言模型对生成的聚类进行综合征级摘要标签进行注释，以网络图和层次树的形式生成与治疗相关的安全性概况的连贯的、数据驱动的表示。我们在单组发病率总结的背景下实施 SHIELD 框架，以比较两个治疗组或检测多组试验中的任何治疗效果。我们在真实的临床试验示例中展示了其恢复已知安全信号并生成可解释的、基于集群的摘要的能力。这项工作将统计信号检测与现代自然语言处理联系起来，以增强临床试验中的安全性评估和因果解释。

Title: Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling

Authors: Xiang Li, Zikai Wei, Yiyan Qi, Wanyun Zhou, Xiang Liu, Penglei Sun, Yongqi Zhang, Xiaowen Chu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.19919
Pdf URL: https://arxiv.org/pdf/2602.19919
Copy Paste: [[2602.19919]] Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling(https://arxiv.org/abs/2602.19919)
Keywords: language model, llm
Abstract: Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives. These limitations have motivated growing interest in using textual information as the primary source of trading signals in learning-based systems. Two key challenges hinder existing approaches: (1) the absence of large-scale, event-centric datasets that jointly model news semantics and statistically grounded market reactions, and (2) the misalignment between language model reasoning and financially valid trading behavior under dynamic market conditions. To address these challenges, we propose Janus-Q, an end-to-end event-driven trading framework that elevates financial news events from auxiliary signals to primary decision units. Janus-Q unifies event-centric data construction and model optimization under a two-stage paradigm. Stage I focuses on event-centric data construction, building a large-scale financial news event dataset comprising 62,400 articles annotated with 10 fine-grained event types, associated stocks, sentiment labels, and event-driven cumulative abnormal return (CAR). Stage II performs decision-oriented fine-tuning, combining supervised learning with reinforcement learning guided by a Hierarchical Gated Reward Model (HGRM), which explicitly captures trade-offs among multiple trading objectives. Extensive experiments demonstrate that Janus-Q achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines, improving the Sharpe Ratio by up to 102.0% while increasing direction accuracy by over 17.5% compared to the strongest competing strategies.
摘要：金融市场走势通常是由新闻传达的离散金融事件驱动的，其影响是异质的、突然的，并且难以在纯粹的数值预测目标下捕获。这些限制激发了人们对使用文本信息作为基于学习的系统中交易信号的主要来源的兴趣日益增长。现有方法面临两个关键挑战：（1）缺乏大规模、以事件为中心的数据集来共同建模新闻语义和基于统计的市场反应；（2）动态市场条件下语言模型推理与财务上有效的交易行为之间的不一致。为了应对这些挑战，我们提出了 Janus-Q，这是一种端到端事件驱动的交易框架，可将金融新闻事件从辅助信号提升为主要决策单元。 Janus-Q 在两阶段范式下统一了以事件为中心的数据构建和模型优化。第一阶段侧重于以事件为中心的数据构建，构建一个大规模的财经新闻事件数据集，包含 62,400 篇文章，并标注了 10 个细粒度事件类型、相关股票、情绪标签和事件驱动的累积异常收益（CAR）。第二阶段执行以决策为导向的微调，将监督学习与分层门控奖励模型（HGRM）指导下的强化学习相结合，该模型明确捕获了多个交易目标之间的权衡。大量实验表明，与市场指数和 LLM 基线相比，Janus-Q 实现了更加一致、可解释和有利可图的交易决策，与最强的竞争策略相比，夏普比率提高了 102.0%，方向准确性提高了 17.5% 以上。

Title: Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming

Authors: Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2602.19948
Pdf URL: https://arxiv.org/pdf/2602.19948
Copy Paste: [[2602.19948]] Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming(https://arxiv.org/abs/2602.19948)
Keywords: language model, gpt, llm, chat, agent
Abstract: Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents (including ChatGPT, Gemini, and this http URL) against a clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes. Our large-scale simulation (N=369 sessions) reveals critical safety gaps in the use of AI for mental health support. We identify specific iatrogenic risks, including the validation of patient delusions ("AI Psychosis") and failure to de-escalate suicide risk. Finally, we validate an interactive data visualization dashboard with diverse stakeholders, including AI engineers and red teamers, mental health professionals, and policy experts (N=9), demonstrating that this framework effectively enables stakeholders to audit the "black box" of AI psychotherapy. These findings underscore the critical safety risks of AI-provided mental health support and the necessity of simulation-based clinical red teaming before deployment.
摘要：大语言模型 (LLM) 越来越多地用于心理健康支持；然而，当前的安全基准往往无法检测治疗对话中固有的复杂的、纵向的风险。我们引入了一个评估框架，将人工智能心理治疗师与配备动态认知情感模型的模拟患者代理配对，并根据综合护理质量和风险本体评估治疗过程模拟。我们将此框架应用于高影响力的测试案例“酒精使用障碍”，针对代表不同临床表型的 15 名经过临床验证的患者角色队列评估了 6 种 AI 代理（包括 ChatGPT、Gemini 和此 http URL）。我们的大规模模拟（N=369 次）揭示了使用人工智能提供心理健康支持方面的关键安全漏洞。我们确定了特定的医源性风险，包括验证患者妄想（“AI 精神病”）和未能降低自杀风险。最后，我们与不同利益相关者（包括人工智能工程师和红队成员、心理健康专业人员和政策专家（N=9））验证了交互式数据可视化仪表板，证明该框架有效地使利益相关者能够审核人工智能心理治疗的“黑匣子”。这些发现强调了人工智能提供的心理健康支持的关键安全风险以及部署前基于模拟的临床红队的必要性。

Title: Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval

Authors: Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Yuanhuiyi Lyu, Yu Huang, Jungang Li, Kening Zheng, Xu Zheng, Philip S. Yu, James Kwok, Xuming Hu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.19961
Pdf URL: https://arxiv.org/pdf/2602.19961
Copy Paste: [[2602.19961]] Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval(https://arxiv.org/abs/2602.19961)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.
摘要：随着多模态信息的快速扩散，视觉文档检索（VDR）已成为弥合非结构化视觉丰富数据和精确信息获取之间差距的关键前沿。与传统的自然图像检索不同，视觉文档表现出由密集的文本内容、复杂的布局和细粒度的语义依赖性定义的独特特征。本文首次从多模态大语言模型 (MLLM) 时代的角度对 VDR 格局进行了全面调查。我们首先检查基准环境，然后深入研究方法论的演变，将方法分为三个主要方面：多模态嵌入模型、多模态重排序模型以及检索增强生成（RAG）和复杂文档智能代理系统的集成。最后，我们确定了持续存在的挑战并概述了有希望的未来方向，旨在为未来的多模式文档智能提供清晰的路线图。

Title: ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting

Authors: Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.19969
Pdf URL: https://arxiv.org/pdf/2602.19969
Copy Paste: [[2602.19969]] ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting(https://arxiv.org/abs/2602.19969)
Keywords: language model, llm
Abstract: The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly concentrated a small subset of tokens within a few documents, making others indistinguishable. Second, attention often overemphasizes phrases lexically similar to the query, yielding biased rankings that irrelevant documents with mere lexical resemblance are regarded as relevant. In this paper, we propose \textbf{ReAttn}, a post-hoc re-weighting strategy for attention-based re-ranking methods. It first compute the cross-document IDF weighting to down-weight attention on query-overlapping tokens that frequently appear across the candidate documents, reducing lexical bias and emphasizing distinctive terms. It then employs entropy-based regularization to mitigate over-concentrated attention, encouraging a more balanced distribution across informative tokens. Both adjustments operate directly on existing attention weights without additional training or supervision. Extensive experiments demonstrate the effectiveness of our method.
摘要：最近的大型语言模型（LLM）的强大功能使其对于零样本重排序任务非常有效。基于注意力的重新排序方法直接从注意力权重中得出相关性分数，为基于生成的重新排序方法提供了一种有效且可解释的替代方法。然而，它们仍然面临两个主要限制。首先，注意力信号高度集中在几个文档中的一小部分标记中，使得其他标记无法区分。其次，注意力常常过分强调与查询在词汇上相似的短语，从而产生有偏见的排名，即仅具有词汇相似性的不相关文档被认为是相关的。在本文中，我们提出了 \textbf{ReAttn}，一种基于注意力的重新排序方法的事后重新加权策略。它首先计算跨文档 IDF 权重，以降低对候选文档中频繁出现的查询重叠标记的关注，减少词汇偏差并强调独特术语。然后，它采用基于熵的正则化来减轻过度集中的注意力，鼓励信息令牌之间更加平衡的分配。这两种调整都直接对现有的注意力权重进行操作，无需额外的培训或监督。大量的实验证明了我们方法的有效性。

Title: gencat: Generative computerized adaptive testing

Authors: Wanyong Feng, Andrew Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.20020
Pdf URL: https://arxiv.org/pdf/2602.20020
Copy Paste: [[2602.20020]] gencat: Generative computerized adaptive testing(https://arxiv.org/abs/2602.20020)
Keywords: language model
Abstract: Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\textbf{GEN}erative \textbf{CAT}), a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection. First, we develop a Generative Item Response Theory (GIRT) model that enables us to estimate student knowledge from their open-ended responses and predict responses to unseen questions. We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment. Second, we introduce three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses. Third, we conduct experiments on two real-world programming datasets and demonstrate that GENCAT outperforms existing CAT baselines, achieving an AUC improvement of up to 4.32\% in the key early testing stages.
摘要：现有的计算机化自适应测试（CAT）框架通常建立在预测学生对问题的回答的正确性之上。尽管有效，但这种方法无法利用问题和回答中的文本信息，尤其是对于开放式问题。在这项工作中，我们提出了 GENCAT (\textbf{GEN}erative \textbf{CAT})，这是一种新颖的 CAT 框架，利用大型语言模型进行知识估计和问题选择。首先，我们开发了生成项目反应理论（GIRT）模型，使我们能够从学生的开放式回答中估计他们的知识，并预测对未见过的问题的回答。我们通过两步过程训练模型，首先通过监督微调，然后通过知识响应对齐的偏好优化。其次，我们介绍了三种问题选择算法，它们基于不确定性、语言多样性和样本学生回答的信息，利用 GIRT 模型的生成能力。第三，我们在两个真实世界的编程数据集上进行了实验，并证明 GENCAT 优于现有的 CAT 基线，在关键的早期测试阶段实现了高达 4.32% 的 AUC 改进。

Title: AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization

Authors: Fahmida Liza Piya, Rahmatollah Beheshti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.20040
Pdf URL: https://arxiv.org/pdf/2602.20040
Copy Paste: [[2602.20040]] AgenticSum: An Agentic Inference-Time Framework for Faithful Clinical Text Summarization(https://arxiv.org/abs/2602.20040)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) offer substantial promise for automating clinical text summarization, yet maintaining factual consistency remains challenging due to the length, noise, and heterogeneity of clinical documentation. We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content. The framework decomposes summarization into coordinated stages that compress task-relevant context, generate an initial draft, identify weakly supported spans using internal attention grounding signals, and selectively revise flagged content under supervisory control. We evaluate AgenticSum on two public datasets, using reference-based metrics, LLM-as-a-judge assessment, and human evaluation. Across various measures, AgenticSum demonstrates consistent improvements compared to vanilla LLMs and other strong baselines. Our results indicate that structured, agentic design with targeted correction offers an effective inference time solution to improve clinical note summarization using LLMs.
摘要：大语言模型 (LLM) 为自动化临床文本摘要提供了巨大的希望，但由于临床文档的长度、噪音和异质性，保持事实一致性仍然具有挑战性。我们提出了 AgenticSum，一个推理时间的代理框架，它将上下文选择、生成、验证和有针对性的纠正分开，以减少幻觉内容。该框架将摘要分解为协调的阶段，压缩与任务相关的上下文，生成初始草案，使用内部注意力基础信号识别弱支持的跨度，并在监督控制下有选择地修改标记的内容。我们使用基于参考的指标、法学硕士作为法官评估和人工评估，在两个公共数据集上评估 AgenticSum。在各种衡量标准中，与普通法学硕士和其他强大的基线相比，AgenticSum 表现出持续的改进。我们的结果表明，具有针对性校正的结构化、代理设计提供了有效的推理时间解决方案，可以使用法学硕士改进临床记录总结。

Title: Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously

Authors: Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou, Carl Yang, Xiangliang Zhang, Yanfang Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.20042
Pdf URL: https://arxiv.org/pdf/2602.20042
Copy Paste: [[2602.20042]] Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously(https://arxiv.org/abs/2602.20042)
Keywords: language model
Abstract: Large language models are being deployed in complex socio-technical systems, which exposes limits in current alignment practice. We take the position that the dominant paradigm of General Alignment, which compresses diverse human values into a single scalar reward, reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible uncertainty. These failures follow from the mathematics and incentives of scalarization and lead to \textbf{structural} value flattening, \textbf{normative} representation loss, and \textbf{cognitive} uncertainty blindness. We introduce Edge Alignment as a distinct approach in which systems preserve multi dimensional value structure, support plural and democratic representation, and incorporate epistemic mechanisms for interaction and clarification. To make this approach practical, we propose seven interdependent pillars organized into three phases. We identify key challenges in data collection, training objectives, and evaluation, outlining complementary technical and governance directions. Taken together, these measures reframe alignment as a lifecycle problem of dynamic normative governance rather than as a single instance optimization task.
摘要：大型语言模型正在复杂的社会技术系统中部署，这暴露了当前对齐实践的局限性。我们的立场是，普遍联盟的主导范式将不同的人类价值观压缩为单一的标量奖励，在价值观冲突、多元化利益相关者和不可减少的不确定性的环境中达到了结构上限。这些失败源于数学和标量化的激励，并导致 \textbf{结构} 值扁平化、\textbf{规范} 表示损失和 \textbf{认知} 不确定性盲目。我们引入边缘对齐作为一种独特的方法，其中系统保留多维价值结构，支持多元化和民主代表，并纳入互动和澄清的认知机制。为了使这种方法切实可行，我们提出了七个相互依赖的支柱，分为三个阶段。我们确定数据收集、培训目标和评估方面的主要挑战，概述互补的技术和治理方向。总的来说，这些措施将对齐重新定义为动态规范治理的生命周期问题，而不是单个实例优化任务。

Title: Entropy in Large Language Models

Authors: Marco Scharringhausen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.20052
Pdf URL: https://arxiv.org/pdf/2602.20052
Copy Paste: [[2602.20052]] Entropy in Large Language Models(https://arxiv.org/abs/2602.20052)
Keywords: language model, llm
Abstract: In this study, the output of large language models (LLM) is considered an information source generating an unlimited sequence of symbols drawn from a finite alphabet. Given the probabilistic nature of modern LLMs, we assume a probabilistic model for these LLMs, following a constant random distribution and the source itself thus being stationary. We compare this source entropy (per word) to that of natural language (written or spoken) as represented by the Open American National Corpus (OANC). Our results indicate that the word entropy of such LLMs is lower than the word entropy of natural speech both in written or spoken form. The long-term goal of such studies is to formalize the intuitions of information and uncertainty in large language training to assess the impact of training an LLM from LLM generated training data. This refers to texts from the world wide web in particular.
摘要：在这项研究中，大型语言模型（LLM）的输出被认为是一种信息源，生成从有限字母表中提取的无限符号序列。考虑到现代法学硕士的概率性质，我们假设这些法学硕士的概率模型遵循恒定的随机分布，因此来源本身是固定的。我们将此源熵（每个单词）与开放美国国家语料库 (OANC) 所代表的自然语言（书面或口头）的源熵进行比较。我们的结果表明，此类法学硕士的词熵低于书面或口头形式的自然语音的词熵。此类研究的长期目标是形式化大语言训练中信息和不确定性的直觉，以评估从法学硕士生成的训练数据中训练法学硕士的影响。这特指来自万维网的文本。

Title: Multilingual Large Language Models do not comprehend all natural languages to equal degrees

Authors: Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi, Walid Irhaymi, Jin Yan, Tamara Serrano, Elena Pagliarini, Fritz Günther, Evelina Leivada
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.20065
Pdf URL: https://arxiv.org/pdf/2602.20065
Copy Paste: [[2602.20065]] Multilingual Large Language Models do not comprehend all natural languages to equal degrees(https://arxiv.org/abs/2602.20065)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) play a critical role in how humans access information. While their core use relies on comprehending written requests, our understanding of this ability is currently limited, because most benchmarks evaluate LLMs in high-resource languages predominantly spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities. The default assumption is that English is the best-performing language for LLMs, while smaller, low-resource languages are linked to less reliable outputs, even in multilingual, state-of-the-art models. To track variation in the comprehension abilities of LLMs, we prompt 3 popular models on a language comprehension task across 12 languages, representing the Indo-European, Afro-Asiatic, Turkic, Sino-Tibetan, and Japonic language families. Our results suggest that the models exhibit remarkable linguistic accuracy across typologically diverse languages, yet they fall behind human baselines in all of them, albeit to different degrees. Contrary to what was expected, English is not the best-performing language, as it was systematically outperformed by several Romance languages, even lower-resource ones. We frame the results by discussing the role of several factors that drive LLM performance, such as tokenization, language distance from Spanish and English, size of training data, and data origin in high- vs. low-resource languages and WEIRD vs. non-WEIRD communities.
摘要：大型语言模型 (LLM) 在人类获取信息的方式中发挥着至关重要的作用。虽然其核心用途依赖于理解书面请求，但我们目前对这种能力的理解是有限的，因为大多数基准测试都是以主要由西方、受过教育、工业化、富裕和民主（WEIRD）社区使用的高资源语言来评估法学硕士。默认假设是英语是法学硕士表现最好的语言，而较小的、资源匮乏的语言与不太可靠的输出相关，即使在多语言、最先进的模型中也是如此。为了跟踪法学硕士理解能力的变化，我们在 12 种语言的语言理解任务中提出了 3 个流行模型，分别代表印欧语系、亚非语系、突厥语系、汉藏语系和日本语系。我们的结果表明，这些模型在不同类型的语言中表现出显着的语言准确性，但它们在所有这些语言中都落后于人类基线，尽管程度不同。与预期相反，英语并不是表现最好的语言，因为几种罗曼语系的语言，甚至是资源较低的语言，都系统地超越了英语。我们通过讨论推动 LLM 绩效的几个因素的作用来构建结果，例如标记化、与西班牙语和英语的语言距离、训练数据的大小、高资源语言与低资源语言以及 WEIRD 与非 WEIRD 社区的数据来源。

Title: How Retrieved Context Shapes Internal Representations in RAG

Authors: Samuel Yeh, Sharon Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.20091
Pdf URL: https://arxiv.org/pdf/2602.20091
Copy Paste: [[2602.20091]] How Retrieved Context Shapes Internal Representations in RAG(https://arxiv.org/abs/2602.20091)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial. In realistic retrieval settings, the retrieved document set often contains a mixture of documents that vary in relevance and usefulness. While prior work has largely examined these phenomena through output behavior, little is known about how retrieved context shapes the internal representations that mediate information integration in RAG. In this work, we study RAG through the lens of latent representations. We systematically analyze how different types of retrieved documents affect the hidden states of LLMs, and how these internal representation shifts relate to downstream generation behavior. Across four question-answering datasets and three LLMs, we analyze internal representations under controlled single- and multi-document settings. Our results reveal how context relevancy and layer-wise processing influence internal representations, providing explanations on LLMs output behaviors and insights for RAG system design.
摘要：检索增强生成 (RAG) 通过根据检索到的外部文档调节生成来增强大型语言模型 (LLM)，但检索到的上下文的影响通常并不小。在实际的检索设置中，检索到的文档集通常包含相关性和有用性各不相同的文档的混合体。虽然之前的工作主要通过输出行为来研究这些现象，但人们对检索到的上下文如何塑造在 RAG 中调解信息集成的内部表示知之甚少。在这项工作中，我们通过潜在表征的视角来研究 RAG。我们系统地分析了不同类型的检索文档如何影响 LLM 的隐藏状态，以及这些内部表示变化如何与下游生成行为相关。在四个问答数据集和三个法学硕士中，我们分析了受控单文档和多文档设置下的内部表示。我们的结果揭示了上下文相关性和分层处理如何影响内部表示，为法学硕士的输出行为提供了解释，并为 RAG 系统设计提供了见解。

Title: BabyLM Turns 4: Call for Papers for the 2026 BabyLM Workshop

Authors: Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen, Aaron Mueller, Suchir Salhan, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.20092
Pdf URL: https://arxiv.org/pdf/2602.20092
Copy Paste: [[2602.20092]] BabyLM Turns 4: Call for Papers for the 2026 BabyLM Workshop(https://arxiv.org/abs/2602.20092)
Keywords: language model
Abstract: BabyLM aims to dissolve the boundaries between cognitive modeling and language modeling. We call for both workshop papers and for researchers to join the 4th BabyLM competition. As in previous years, we call for participants in the data-efficient pretraining challenge in the general track. This year, we also offer a new track: Multilingual. We also call for papers outside the competition in any relevant areas. These include training efficiency, cognitively plausible research, weak model evaluation, and more.
摘要：BabyLM 旨在消除认知建模和语言建模之间的界限。我们呼吁研讨会论文和研究人员参加第四届 BabyLM 竞赛。与往年一样，我们呼吁参与者参加一般赛道的数据高效预训练挑战。今年，我们还提供了一个新曲目：多语言。我们还征集竞赛之外任何相关领域的论文。其中包括训练效率、认知上合理的研究、弱模型评估等等。

Title: NanoKnow: How to Know What Your Language Model Knows

Authors: Lingwei Gu, Nour Jedidi, Jimmy Lin
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2602.20122
Pdf URL: https://arxiv.org/pdf/2602.20122
Copy Paste: [[2602.20122]] NanoKnow: How to Know What Your Language Model Knows(https://arxiv.org/abs/2602.20122)
Keywords: language model, llm, chat
Abstract: How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at this https URL.
摘要：大型语言模型 (LLM) 如何知道他们所知道的内容？回答这个问题很困难，因为预训练数据通常是一个“黑匣子”——未知或无法访问。最近发布的 nanochat（一系列具有完全开放的预训练数据的小型法学硕士）解决了这个问题，因为它提供了一个透明的视图来了解模型参数知识的来源。为了理解法学硕士如何编码知识，我们发布了 NanoKnow，这是一个基准数据集，它根据答案是否存在于 nanochat 的预训练语料库中，将 Natural Questions 和 SQuAD 中的问题划分为多个部分。使用这些分割，我们现在可以正确地理清法学硕士在产生输出时所依赖的知识来源。为了展示 NanoKnow 的实用性，我们使用八个 nanochat 检查点进行实验。我们的研究结果表明：（1）闭卷准确性受到预训练数据中答案频率的强烈影响，（2）提供外部证据可以减轻这种频率依赖性，（3）即使有外部证据，在预训练期间看到答案时模型也会更准确，这证明参数和外部知识是互补的，（4）不相关信息是有害的，准确性会根据不相关上下文的位置和数量而降低。我们在此 https URL 发布所有 NanoKnow 工件。

Title: To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Authors: Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen, Yu Hou, Yifan Wu, Yang Ruan, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.20130
Pdf URL: https://arxiv.org/pdf/2602.20130
Copy Paste: [[2602.20130]] To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering(https://arxiv.org/abs/2602.20130)
Keywords: language model, llm, chain-of-thought
Abstract: Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time. Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ($\leq$4\%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.
摘要：目标：通过在保持准确性的同时避免不必要的推理，提高使用大型语言模型 (LLM) 的医学问答 (MedQA) 的效率。方法：我们提出选择性思维链（Selective CoT），这是一种推理时间策略，首先预测问题是否需要推理，并仅在需要时生成基本原理。两个开源法学硕士（Llama-3.1-8B 和 Qwen-2.5-7B）在四个生物医学 QA 基准（HeadQA、MedQA-USMLE、MedMCQA 和 PubMedQA）上进行了评估。指标包括准确性、生成的令牌总数和推理时间。结果：选择性 CoT 将推理时间减少了 13-45%，将令牌使用量减少了 8-47%，同时精度损失最小 ($\leq$4\%)。在某些模型-任务对中，它比标准 CoT 实现了更高的准确性和更高的效率。与固定长度 CoT 相比，选择性 CoT 以低得多的计算成本达到相似或更高的精度。讨论：选择性 CoT 仅在有益时才调用显式推理，从而动态平衡推理深度和效率，减少回忆型问题的冗余，同时保留可解释性。结论：选择性 CoT 为医学 QA 提供了一种简单、与模型无关且经济高效的方法，将推理工作与问题复杂性结合起来，以增强基于 LLM 的临床系统的实际可部署性。

Title: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Authors: Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.20135
Pdf URL: https://arxiv.org/pdf/2602.20135
Copy Paste: [[2602.20135]] KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration(https://arxiv.org/abs/2602.20135)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.
摘要：随着大型语言模型 (LLM) 的兴起，它们在检索增强生成 (RAG) 等应用中发挥了重要作用。然而，评估这些系统仍然受到构建专业评估数据集的时间和成本的瓶颈。我们引入了 KNIGHT，这是一个基于 LLM 的知识图驱动框架，用于从外部源生成多项选择题 (MCQ) 数据集。 KNIGHT 构建了一个特定于主题的知识图，这是对实体和关系的结构化且简约的摘要，可以重复使用它来生成教师控制的难度级别，包括多跳问题，而无需重复重新输入完整的源文本。该知识图作为压缩的、可重用的状态，使问题生成成为对图的廉价读取。我们在 Wikipedia/Wikidata 上实例化 KNIGHT，同时保持框架与领域和本体无关。作为案例研究，KNIGHT 生成了历史、生物学和数学领域的 6 个 MCQ 数据集。我们根据五个标准评估质量：流畅性、明确性（单一正确答案）、主题相关性、选项唯一性和给定所提供来源的可回答性（作为幻觉的代表）。结果表明，KNIGHT 能够从可重用的图形表示中实现令牌和成本高效的生成，在这些标准上实现高质量，并产生与 MMLU 风格基准一致的模型排名，同时支持特定于主题和难度控制的评估。