2026-01-14

Title: EmbeddingRWKV: State-Centric Retrieval with Reusable States

Authors: Haowen Hou, Jie Yang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.07861
Pdf URL: https://arxiv.org/pdf/2601.07861
Copy Paste: [[2601.07861]] EmbeddingRWKV: State-Centric Retrieval with Reusable States(https://arxiv.org/abs/2601.07861)
Keywords: llm, retrieval-augmented generation
Abstract: Current Retrieval-Augmented Generation (RAG) systems typically employ a traditional two-stage pipeline: an embedding model for initial retrieval followed by a reranker for refinement. However, this paradigm suffers from significant inefficiency due to the lack of shared information between stages, leading to substantial redundant computation. To address this limitation, we propose \textbf{State-Centric Retrieval}, a unified retrieval paradigm that utilizes "states" as a bridge to connect embedding models and rerankers. First, we perform state representation learning by fine-tuning an RWKV-based LLM, transforming it into \textbf{EmbeddingRWKV}, a unified model that serves as both an embedding model and a state backbone for extracting compact, reusable states. Building upon these reusable states, we further design a state-based reranker to fully leverage precomputed information. During reranking, the model processes only query tokens, decoupling inference cost from document length and yielding a 5.4$\times$--44.8$\times$ speedup. Furthermore, we observe that retaining all intermediate layer states is unnecessary; with a uniform layer selection strategy, our model maintains 98.62\% of full-model performance using only 25\% of the layers. Extensive experiments demonstrate that State-Centric Retrieval achieves high-quality retrieval and reranking results while significantly enhancing overall system efficiency. Code is available at \href{this https URL}{our GitHub repository}.
摘要：当前的检索增强生成（RAG）系统通常采用传统的两阶段管道：用于初始检索的嵌入模型，然后是用于细化的重新排序器。然而，由于阶段之间缺乏共享信息，这种范式效率显着低下，导致大量冗余计算。为了解决这个限制，我们提出了 \textbf{State-Centric Retrieval}，这是一种统一的检索范例，利用“状态”作为连接嵌入模型和重新排序器的桥梁。首先，我们通过微调基于 RWKV 的 LLM 来执行状态表示学习，将其转换为 \textbf{EmbeddingRWKV}，这是一个统一模型，既充当嵌入模型又充当提取紧凑、可重用状态的状态主干。在这些可重用状态的基础上，我们进一步设计了一个基于状态的重新排序器，以充分利用预先计算的信息。在重新排名期间，模型仅处理查询标记，将推理成本与文档长度解耦，并产生 5.4$\times$--44.8$\times$ 加速。此外，我们观察到保留所有中间层状态是不必要的；通过统一的层选择策略，我们的模型仅使用 25% 的层即可保持 98.62% 的全模型性能。大量实验表明，以状态为中心的检索可以实现高质量的检索和重新排序结果，同时显着提高整体系统效率。代码可在 \href{此 https URL}{我们的 GitHub 存储库}获取。

Title: A Human-Centric Pipeline for Aligning Large Language Models with Chinese Medical Ethics

Authors: Haoan Jin, Han Ying, Jiacheng Ji, Hanhui Xu, Mengyue Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07954
Pdf URL: https://arxiv.org/pdf/2601.07954
Copy Paste: [[2601.07954]] A Human-Centric Pipeline for Aligning Large Language Models with Chinese Medical Ethics(https://arxiv.org/abs/2601.07954)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models have enabled their application to a range of healthcare tasks. However, aligning LLMs with the nuanced demands of medical ethics, especially under complex real world scenarios, remains underexplored. In this work, we present MedES, a dynamic, scenario-centric benchmark specifically constructed from 260 authoritative Chinese medical, ethical, and legal sources to reflect the challenges in clinical decision-making. To facilitate model alignment, we introduce a guardian-in-the-loop framework that leverages a dedicated automated evaluator (trained on expert-labeled data and achieving over 97% accuracy within our domain) to generate targeted prompts and provide structured ethical feedback. Using this pipeline, we align a 7B-parameter LLM through supervised fine-tuning and domain-specific preference optimization. Experimental results, conducted entirely within the Chinese medical ethics context, demonstrate that our aligned model outperforms notably larger baselines on core ethical tasks, with observed improvements in both quality and composite evaluation metrics. Our work offers a practical and adaptable framework for aligning LLMs with medical ethics in the Chinese healthcare domain, and suggests that similar alignment pipelines may be instantiated in other legal and cultural environments through modular replacement of the underlying normative corpus.
摘要：大型语言模型的最新进展使其能够应用于一系列医疗保健任务。然而，如何使法学硕士与医学伦理学的微妙要求保持一致，特别是在复杂的现实世界场景下，仍有待探索。在这项工作中，我们提出了 MedES，这是一个动态的、以场景为中心的基准，专门根据 260 个权威的中国医学、伦理和法律来源构建，以反映临床决策中的挑战。为了促进模型对齐，我们引入了一个循环监护框架，该框架利用专用的自动评估器（经过专家标记数据的培训并在我们的领域内实现超过 97% 的准确率）来生成有针对性的提示并提供结构化的道德反馈。使用这个管道，我们通过监督微调和特定领域的偏好优化来调整 7B 参数 LLM。完全在中国医学伦理背景下进行的实验结果表明，我们的一致模型在核心伦理任务上明显优于更大的基线，并且在质量和综合评估指标方面均观察到了改进。我们的工作提供了一个实用且适应性强的框架，使法学硕士与中国医疗保健领域的医学伦理保持一致，并建议通过对底层规范语料库的模块化替换，可以在其他法律和文化环境中实例化类似的对齐管道。

Title: Knowing But Not Doing: Convergent Morality and Divergent Action in LLMs

Authors: Jen-tse Huang, Jiantong Qin, Xueli Qiu, Sharon Levy, Michelle R. Kaufman, Mark Dredze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07972
Pdf URL: https://arxiv.org/pdf/2601.07972
Copy Paste: [[2601.07972]] Knowing But Not Doing: Convergent Morality and Divergent Action in LLMs(https://arxiv.org/abs/2601.07972)
Keywords: language model, llm
Abstract: Value alignment is central to the development of safe and socially compatible artificial intelligence. However, how Large Language Models (LLMs) represent and enact human values in real-world decision contexts remains under-explored. We present ValAct-15k, a dataset of 3,000 advice-seeking scenarios derived from Reddit, designed to elicit ten values defined by Schwartz Theory of Basic Human Values. Using both the scenario-based questions and the traditional value questionnaire, we evaluate ten frontier LLMs (five from U.S. companies, five from Chinese ones) and human participants ($n = 55$). We find near-perfect cross-model consistency in scenario-based decisions (Pearson $r \approx 1.0$), contrasting sharply with the broad variability observed among humans ($r \in [-0.79, 0.98]$). Yet, both humans and LLMs show weak correspondence between self-reported and enacted values ($r = 0.4, 0.3$), revealing a systematic knowledge-action gap. When instructed to "hold" a specific value, LLMs' performance declines up to $6.6%$ compared to merely selecting the value, indicating a role-play aversion. These findings suggest that while alignment training yields normative value convergence, it does not eliminate the human-like incoherence between knowing and acting upon values.
摘要：价值一致性对于开发安全且与社会兼容的人工智能至关重要。然而，大型语言模型 (LLM) 如何在现实世界的决策环境中表示和制定人类价值观仍有待探索。我们提出了 ValAct-15k，这是一个源自 Reddit 的 3,000 个寻求建议场景的数据集，旨在引出施瓦茨基本人类价值观理论定义的十个价值观。使用基于场景的问题和传统价值问卷，我们评估了 10 名前沿法学硕士（5 名来自美国公司，5 名来自中国公司）和人类参与者（$n = 55$）。我们发现基于场景的决策具有近乎完美的跨模型一致性（Pearson $r \approx 1.0$），与人类观察到的广泛变异性形成鲜明对比（$r \in [-0.79, 0.98]$）。然而，人类和法学硕士在自我报告的价值观和制定的价值观之间表现出较弱的对应性（$r = 0.4, 0.3$），揭示了系统性的知识与行动差距。当指示“保留”特定值时，法学硕士的表现与仅选择该值相比下降了 6.6%$，这表明他们厌恶角色扮演。这些发现表明，虽然一致性训练产生了规范的价值观趋同，但它并没有消除价值观认知和行动之间像人类一样的不连贯性。

Title: Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis

Authors: Yuxi Xia, Kinga Stańczak, Benjamin Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07974
Pdf URL: https://arxiv.org/pdf/2601.07974
Copy Paste: [[2601.07974]] Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis(https://arxiv.org/abs/2601.07974)
Keywords: language model, llm, prompt
Abstract: AI-text detectors achieve high accuracy on in-domain benchmarks, but often struggle to generalize across different generation conditions such as unseen prompts, model families, or domains. While prior work has reported these generalization gaps, there are limited insights about the underlying causes. In this work, we present a systematic study aimed at explaining generalization behavior through linguistic analysis. We construct a comprehensive benchmark that spans 6 prompting strategies, 7 large language models (LLMs), and 4 domain datasets, resulting in a diverse set of human- and AI-generated texts. Using this dataset, we fine-tune classification-based detectors on various generation settings and evaluate their cross-prompt, cross-model, and cross-dataset generalization. To explain the performance variance, we compute correlations between generalization accuracies and feature shifts of 80 linguistic features between training and test conditions. Our analysis reveals that generalization performance for specific detectors and evaluation conditions is significantly associated with linguistic features such as tense usage and pronoun frequency.
摘要：人工智能文本检测器在域内基准测试中实现了高精度，但通常很难在不同的生成条件（例如看不见的提示、模型系列或域）中进行泛化。虽然之前的工作已经报告了这些普遍性差距，但对根本原因的了解有限。在这项工作中，我们提出了一项系统研究，旨在通过语言分析来解释泛化行为。我们构建了一个全面的基准，涵盖 6 种提示策略、7 个大型语言模型 (LLM) 和 4 个领域数据集，从而产生了一组不同的人类和人工智能生成的文本。使用这个数据集，我们在各种生成设置上微调基于分类的检测器，并评估它们的交叉提示、跨模型和跨数据集泛化。为了解释性能差异，我们计算了训练和测试条件下 80 个语言特征的泛化精度和特征变化之间的相关性。我们的分析表明，特定检测器和评估条件的泛化性能与时态使用和代词频率等语言特征显着相关。

Title: Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Authors: Haorui Yu, Ramon Ruiz-Dolz, Xuehang Wen, Fengrui Zhang, Qiufeng Yi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07984
Pdf URL: https://arxiv.org/pdf/2601.07984
Copy Paste: [[2601.07984]] Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models(https://arxiv.org/abs/2601.07984)
Keywords: language model
Abstract: Vision-Language Models (VLMs) excel at visual perception, yet their ability to interpret cultural meaning in art remains under-validated. We present a tri-tier evaluation framework for cross-cultural art-critique assessment: Tier I computes automated coverage and risk indicators offline; Tier II applies rubric-based scoring using a single primary judge across five dimensions; and Tier III calibrates the Tier II aggregate score to human ratings via isotonic regression, yielding a 5.2% reduction in MAE on a 152-sample held-out set. The framework outputs a calibrated cultural-understanding score for model selection and cultural-gap diagnosis, together with dimension-level diagnostics and risk indicators. We evaluate 15 VLMs on 294 expert anchors spanning six cultural traditions. Key findings are that (i) automated metrics are unreliable proxies for cultural depth, (ii) Western samples score higher than non-Western samples under our sampling and rubric, and (iii) cross-judge scale mismatch makes naive score averaging unreliable, motivating a single primary judge with explicit calibration. Dataset and code are available in the supplementary materials.
摘要：视觉语言模型 (VLM) 擅长视觉感知，但其解释艺术文化意义的能力仍未得到充分验证。我们提出了一个跨文化艺术评论评估的三层评估框架：第一层离线计算自动覆盖范围和风险指标；第二级采用基于评分标准的评分，由一位主要评委在五个维度进行评分； Tier III 通过等渗回归将 Tier II 总分校准为人类评分，在 152 个样本保留集上使 MAE 降低 5.2%。该框架为模型选择和文化差距诊断输出经过校准的文化理解得分，以及维度级诊断和风险指标。我们评估了跨越六种文化传统的 294 位专家主播的 15 个 VLM。主要发现是：（i）自动化指标不能可靠地代表文化深度；（ii）在我们的抽样和评分标准下，西方样本的得分高于非西方样本；（iii）跨评委量表不匹配使得朴素的平均得分不可靠，从而促使单个主要评委进行明确的校准。数据集和代码可在补充材料中找到。

Title: Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset

Authors: Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler, Djamé Seddah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07985
Pdf URL: https://arxiv.org/pdf/2601.07985
Copy Paste: [[2601.07985]] Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset(https://arxiv.org/abs/2601.07985)
Keywords: language model, llm
Abstract: The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.
摘要：在线平台上错误信息的迅速扩散凸显了对强大、最新、可解释和多语言事实核查资源的迫切需要。然而，现有数据集范围有限，通常缺乏多模式证据、结构化注释以及索赔、证据和判决之间的详细链接。本文介绍了一个全面的数据收集和处理管道，该管道通过聚合 ClaimReview 提要、抓取完整的揭穿文章、标准化异构索赔判决以及使用结构化元数据和一致的视觉内容丰富它们，构建法语和德语的多模式事实检查数据集。我们使用最先进的大语言模型 (LLM) 和多模态 LLM 来进行 (i) 在预定义证据类别下提取证据，以及 (ii) 将证据与判决联系起来的理由生成。 G-Eval 和人工评估的评估表明，我们的管道能够对不同组织或媒体市场的事实核查实践进行细粒度比较，促进开发更具可解释性和基于证据的事实核查模型，并为未来多语言、多模式错误信息验证的研究奠定基础。

Title: VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

Authors: Haorui Yu, Ramon Ruiz-Dolz, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.07986
Pdf URL: https://arxiv.org/pdf/2601.07986
Copy Paste: [[2601.07986]] VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding(https://arxiv.org/abs/2601.07986)
Keywords: language model
Abstract: We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception. Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation. VULCA-Bench contains 7,410 matched image-critique pairs spanning eight cultural traditions, with Chinese-English bilingual coverage. We operationalise cultural understanding using a five-layer framework (L1-L5, from Visual Perception to Philosophical Aesthetics), instantiated as 225 culture-specific dimensions and supported by expert-written bilingual critiques. Our pilot results indicate that higher-layer reasoning (L3-L5) is consistently more challenging than visual and technical analysis (L1-L2). The dataset, evaluation scripts, and annotation tools are available under CC BY 4.0 in the supplementary materials.
摘要：我们推出了 VULCA-Bench，这是一个多元文化艺术评论基准，用于评估视觉语言模型 (VLM) 超越表面视觉感知的文化理解。现有的 VLM 基准主要衡量 L1-L2 能力（对象识别、场景描述和事实问答），而低估了高阶文化解释。 VULCA-Bench 包含 7,410 个匹配的图像评论对，涵盖八个文化传统，涵盖中英双语。我们使用五层框架（L1-L5，从视觉感知到哲学美学）来实施文化理解，实例化为 225 个文化特定维度，并得到专家撰写的双语评论的支持。我们的试点结果表明，高层推理 (L3-L5) 始终比视觉和技术分析 (L1-L2) 更具挑战性。数据集、评估脚本和注释工具可在补充材料中的 CC BY 4.0 下获取。

Title: DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

Authors: Nayoung Choi, Jonathan Zhang, Jinho D. Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07994
Pdf URL: https://arxiv.org/pdf/2601.07994
Copy Paste: [[2601.07994]] DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs(https://arxiv.org/abs/2601.07994)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often exhibit increased response latency and degraded answer quality as dialogue length grows, making effective context management essential. However, existing methods rely on extra LLM calls to build memory or perform offline memory construction without considering the current user utterance, which can introduce inefficiencies or disrupt conversational continuity. We introduce DyCP, a lightweight context management method that dynamically segment and retrieve relevant memory at query time. It preserves the sequential structure of dialogue without predefined topic boundaries and supports efficient, adaptive context retrieval. Across three long-form dialogue benchmarks, LoCoMo, MT-Bench+, and SCM4LLMs, and multiple LLMs, DyCP consistently improves answer quality while reducing response latency. We also examine the gap between modern LLMs' expanded context windows and their actual long-context processing capacity, highlighting the continued importance of effective context management.
摘要：随着对话长度的增加，大型语言模型 (LLM) 通常会出现响应延迟增加和答案质量下降的情况，因此有效的上下文管理至关重要。然而，现有方法依赖额外的 LLM 调用来构建内存或执行离线内存构建，而不考虑当前用户的话语，这可能会导致效率低下或破坏对话的连续性。我们引入 DyCP，一种轻量级上下文管理方法，可在查询时动态分段和检索相关内存。它保留了对话的顺序结构，没有预定义的主题边界，并支持高效、自适应的上下文检索。在三个长篇对话基准（LoCoMo、MT-Bench+ 和 SCM4LLM）以及多个 LLM 中，DyCP 持续提高答案质量，同时减少响应延迟。我们还研究了现代法学硕士扩展的上下文窗口与其实际的长上下文处理能力之间的差距，强调了有效的上下文管理的持续重要性。

Title: LLM Review: Enhancing Creative Writing via Blind Peer Review Feedback

Authors: Weiyue Li, Mingxiao Song, Zhenda Shen, Dachuan Zhao, Yunfan Long, Yi Li, Yongce Li, Ruyi Yang, Mengyu Wang
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2601.08003
Pdf URL: https://arxiv.org/pdf/2601.08003
Copy Paste: [[2601.08003]] LLM Review: Enhancing Creative Writing via Blind Peer Review Feedback(https://arxiv.org/abs/2601.08003)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) often struggle with creative generation, and multi-agent frameworks that improve reasoning through interaction can paradoxically hinder creativity by inducing content homogenization. We introduce LLM Review, a peer-review-inspired framework implementing Blind Peer Review: agents exchange targeted feedback while revising independently, preserving divergent creative trajectories. To enable rigorous evaluation, we propose SciFi-100, a science fiction writing dataset with a unified framework combining LLM-as-a-judge scoring, human annotation, and rule-based novelty metrics. Experiments demonstrate that LLM Review consistently outperforms multi-agent baselines, and smaller models with our framework can surpass larger single-agent models, suggesting interaction structure may substitute for model scale.
摘要：大型语言模型 (LLM) 常常难以产生创造性，而通过交互改进推理的多智能体框架可能会导致内容同质化，从而反而阻碍创造力。我们引入了 LLM Review，这是一个受同行评审启发的框架，实施盲同行评审：代理在独立修改的同时交换有针对性的反馈，保留不同的创意轨迹。为了实现严格的评估，我们提出了 SciFi-100，这是一个科幻小说写作数据集，具有统一的框架，结合了法学硕士作为评判评分、人工注释和基于规则的新颖性指标。实验表明，LLM Review 始终优于多智能体基线，并且使用我们框架的较小模型可以超越较大的单智能体模型，这表明交互结构可以替代模型规模。

Title: Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models

Authors: Zhenghao He, Guangzhi Xiong, Bohan Liu, Sanchit Sinha, Aidong Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08058
Pdf URL: https://arxiv.org/pdf/2601.08058
Copy Paste: [[2601.08058]] Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models(https://arxiv.org/abs/2601.08058)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models. In this work, we study this question by directly analyzing and intervening on the internal representations of LLMs with Sparse Autoencoders (SAEs), identifying a small set of latent features that are causally associated with LLM reasoning behavior. Across multiple model families and reasoning benchmarks, we find that steering a single reasoning-related latent feature can substantially improve accuracy without explicit CoT prompting. For large models, latent steering achieves performance comparable to standard CoT prompting while producing more efficient outputs. We further observe that this reasoning-oriented internal state is triggered early in generation and can override prompt-level instructions that discourage explicit reasoning. Overall, our results suggest that multi-step reasoning in LLMs is supported by latent internal activations that can be externally activated, while CoT prompting is one effective, but not unique, way of activating this mechanism rather than its necessary cause.
摘要：思想链（CoT）提示提高了大型语言模型（LLM）的推理性能，但目前尚不清楚它为何起作用以及它是否是大型语言模型中触发推理的独特机制。在这项工作中，我们通过使用稀疏自动编码器（SAE）直接分析和干预法学硕士的内部表示来研究这个问题，识别与法学硕士推理行为因果相关的一小组潜在特征。在多个模型系列和推理基准中，我们发现，在没有显式 CoT 提示的情况下，引导单个推理相关的潜在特征可以显着提高准确性。对于大型模型，潜在转向可实现与标准 CoT 提示相当的性能，同时产生更高效的输出。我们进一步观察到，这种面向推理的内部状态在生成的早期就被触发，并且可以覆盖不鼓励显式推理的提示级指令。总的来说，我们的结果表明，LLM 中的多步推理受到可以外部激活的潜在内部激活的支持，而 CoT 提示是激活该机制的一种有效但不是唯一的方式，而不是其必要原因。

Title: Universal computation is intrinsic to language model decoding

Authors: Alex Lewandowski, Marlos C. Machado, Dale Schuurmans
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08061
Pdf URL: https://arxiv.org/pdf/2601.08061
Copy Paste: [[2601.08061]] Universal computation is intrinsic to language model decoding(https://arxiv.org/abs/2601.08061)
Keywords: language model, prompt
Abstract: Language models now provide an interface to express and often solve general problems in natural language, yet their ultimate computational capabilities remain a major topic of scientific debate. Unlike a formal computer, a language model is trained to autoregressively predict successive elements in human-generated text. We prove that chaining a language model's autoregressive output is sufficient to perform universal computation. That is, a language model can simulate the execution of any algorithm on any input. The challenge of eliciting desired computational behaviour can thus be reframed in terms of programmability: the ease of finding a suitable prompt. Strikingly, we demonstrate that even randomly initialized language models are capable of universal computation before training. This implies that training does not give rise to computational expressiveness -- rather, it improves programmability, enabling a natural language interface for accessing these intrinsic capabilities.
摘要：语言模型现在提供了一个接口来表达并经常解决自然语言中的一般问题，但它们的最终计算能力仍然是科学争论的主要话题。与正式计算机不同，语言模型经过训练可以自回归预测人类生成的文本中的连续元素。我们证明链接语言模型的自回归输出足以执行通用计算。也就是说，语言模型可以模拟任何算法在任何输入上的执行。因此，引发所需计算行为的挑战可以根据可编程性进行重新定义：轻松找到合适的提示。引人注目的是，我们证明即使是随机初始化的语言模型也能够在训练之前进行通用计算。这意味着训练不会提高计算表达能力，相反，它会提高可编程性，从而实现用于访问这些内在功能的自然语言界面。

Title: Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

Authors: Yuxi Xia, Dennis Ulmer, Terra Blevins, Yihong Liu, Hinrich Schütze, Benjamin Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08064
Pdf URL: https://arxiv.org/pdf/2601.08064
Copy Paste: [[2601.08064]] Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations(https://arxiv.org/abs/2601.08064)
Keywords: language model, llm, prompt
Abstract: Confidence estimation (CE) indicates how reliable the answers of large language models (LLMs) are, and can impact user trust and decision-making. Existing work evaluates CE methods almost exclusively through calibration, examining whether stated confidence aligns with accuracy, or discrimination, whether confidence is ranked higher for correct predictions than incorrect ones. However, these facets ignore pitfalls of CE in the context of LLMs and language variation: confidence estimates should remain consistent under semantically equivalent prompt or answer variations, and should change when the answer meaning differs. Therefore, we present a comprehensive evaluation framework for CE that measures their confidence quality on three new aspects: robustness of confidence against prompt perturbations, stability across semantic equivalent answers, and sensitivity to semantically different answers. In our work, we demonstrate that common CE methods for LLMs often fail on these metrics: methods that achieve good performance on calibration or discrimination are not robust to prompt variations or are not sensitive to answer changes. Overall, our framework reveals limitations of existing CE evaluations relevant for real-world LLM use cases and provides practical guidance for selecting and designing more reliable CE methods.
摘要：置信度估计 (CE) 表明大型语言模型 (LLM) 答案的可靠性，并且可以影响用户信任和决策。现有的工作几乎完全通过校准来评估 CE 方法，检查所陈述的置信度是否与准确性或歧视一致，正确预测的置信度是否比错误预测的排名更高。然而，这些方面忽略了法学硕士和语言变化背景下CE的陷阱：置信度估计应该在语义等效的提示或答案变化下保持一致，并且当答案含义不同时应该改变。因此，我们提出了一个 CE 的综合评估框架，在三个新方面衡量其置信质量：置信度对即时扰动的鲁棒性、语义等效答案的稳定性以及对语义不同答案的敏感性。在我们的工作中，我们证明了法学硕士的常见 CE 方法通常在这些指标上失败：在校准或区分方面取得良好性能的方法对提示变化不稳健，或者对答案变化不敏感。总体而言，我们的框架揭示了与现实世界 LLM 用例相关的现有 CE 评估的局限性，并为选择和设计更可靠的 CE 方法提供了实用指导。

Title: AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

Authors: Yongliang Miao, Yangyang Liang, Mengnan Du
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08097
Pdf URL: https://arxiv.org/pdf/2601.08097
Copy Paste: [[2601.08097]] AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling(https://arxiv.org/abs/2601.08097)
Keywords: language model
Abstract: Reward modeling is essential for aligning large language models with human preferences, yet predominant architectures rely on a static pooling strategy to condense sequences into scalar scores. This paradigm, however, suffers from two key limitations: a static inductive bias that misaligns with task-dependent preference signals, and a representational mismatch, as the backbone is optimized for generation rather than fine-grained discrimination. To address this, we propose AdaJudge, a unified framework that jointly adapts representation and aggregation. AdaJudge first refines backbone representations into a discrimination-oriented space via gated refinement blocks. It then replaces the static readout with an adaptive multi-view pooling module that dynamically routes and combines evidence. Extensive experiments on RM-Bench and JudgeBench show that AdaJudge outperforms strong off-the-shelf reward models and traditional pooling baselines.
摘要：奖励建模对于使大型语言模型与人类偏好保持一致至关重要，但主要架构依赖于静态池策略将序列压缩为标量分数。然而，这种范式存在两个关键限制：与任务相关的偏好信号不一致的静态归纳偏差，以及表征不匹配，因为骨干网是针对生成而不是细粒度区分进行优化的。为了解决这个问题，我们提出了 AdaJudge，一个联合适应表示和聚合的统一框架。 AdaJudge 首先通过门控细化块将主干表示细化为面向歧视的空间。然后，它用动态路由和组合证据的自适应多视图池模块取代静态读数。 RM-Bench 和 JudgeBench 上的大量实验表明，AdaJudge 的性能优于强大的现成奖励模型和传统池基线。

Title: Query Suggestion for Retrieval-Augmented Generation via Dynamic In-Context Learning

Authors: Fabian Spaeh, Tianyi Chen, Chen-Hao Chiang, Bin Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08105
Pdf URL: https://arxiv.org/pdf/2601.08105
Copy Paste: [[2601.08105]] Query Suggestion for Retrieval-Augmented Generation via Dynamic In-Context Learning(https://arxiv.org/abs/2601.08105)
Keywords: llm, hallucination, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation with tool-calling agents (agentic RAG) has become increasingly powerful in understanding, processing, and responding to user queries. However, the scope of the grounding knowledge is limited and asking questions that exceed this scope may lead to issues like hallucination. While guardrail frameworks aim to block out-of-scope questions (Rodriguez et al., 2024), no research has investigated the question of suggesting answerable queries in order to complete the user interaction. In this paper, we initiate the study of query suggestion for agentic RAG. We consider the setting where user questions are not answerable, and the suggested queries should be similar to aid the user interaction. Such scenarios are frequent for tool-calling LLMs as communicating the restrictions of the tools or the underlying datasets to the user is difficult, and adding query suggestions enhances the interaction with the RAG agent. As opposed to traditional settings for query recommendations such as in search engines, ensuring that the suggested queries are answerable is a major challenge due to the RAG's multi-step workflow that demands a nuanced understanding of the RAG as a whole, which the executing LLM lacks. As such, we introduce robust dynamic few-shot learning which retrieves examples from relevant workflows. We show that our system can be self-learned, for instance on prior user queries, and is therefore easily applicable in practice. We evaluate our approach on three benchmark datasets based on two unlabeled question datasets collected from real-world user queries. Experiments on real-world datasets confirm that our method produces more relevant and answerable suggestions, outperforming few-shot and retrieval-only baselines, and thus enable safer, more effective user interaction with agentic RAG.
摘要：使用工具调用代理（代理 RAG）的检索增强生成在理解、处理和响应用户查询方面变得越来越强大。然而，基础知识的范围是有限的，提出超出这个范围的问题可能会导致出现幻觉等问题。虽然护栏框架旨在阻止超出范围的问题（Rodriguez 等人，2024），但没有研究调查建议可回答的查询以完成用户交互的问题。在本文中，我们启动了代理RAG查询建议的研究。我们考虑用户问题无法回答的设置，并且建议的查询应该类似以帮助用户交互。这种情况对于工具调用 LLM 来说很常见，因为向用户传达工具或底层数据集的限制很困难，而添加查询建议可以增强与 RAG 代理的交互。与传统的查询建议设置（例如搜索引擎中的查询建议设置）相反，确保建议的查询可回答是一项重大挑战，因为 RAG 的多步骤工作流程需要对 RAG 整体有细致入微的了解，而执行法学硕士则缺乏这一点。因此，我们引入了强大的动态小样本学习，它从相关工作流程中检索示例。我们表明，我们的系统可以自学习，例如根据先前的用户查询，因此很容易应用于实践。我们根据从现实世界用户查询中收集的两个未标记问题数据集，在三个基准数据集上评估我们的方法。对真实世界数据集的实验证实，我们的方法可以产生更相关、更可靠的建议，优于少数样本和仅检索的基线，从而使用户能够与代理 RAG 进行更安全、更有效的交互。

Title: Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought

Authors: Bowen Li, Ziqi Xu, Jing Ren, Renqiang Luo, Xikun Zhang, Xiuzhen Zhang, Yongli Ren, Feng Xia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08108
Pdf URL: https://arxiv.org/pdf/2601.08108
Copy Paste: [[2601.08108]] Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought(https://arxiv.org/abs/2601.08108)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Despite notable advancements in prompting methods for Large Language Models (LLMs), such as Chain-of-Thought (CoT), existing strategies still suffer from excessive token usage and limited generalisability across diverse reasoning tasks. To address these limitations, we propose an Adaptive Causal Prompting with Sketch-of-Thought (ACPS) framework, which leverages structural causal models to infer the causal effect of a query on its answer and adaptively select an appropriate intervention (i.e., standard front-door and conditional front-door adjustments). This design enables generalisable causal reasoning across heterogeneous tasks without task-specific retraining. By replacing verbose CoT with concise Sketch-of-Thought, ACPS enables efficient reasoning that significantly reduces token usage and inference cost. Extensive experiments on multiple reasoning benchmarks and LLMs demonstrate that ACPS consistently outperforms existing prompting baselines in terms of accuracy, robustness, and computational efficiency.
摘要：尽管大型语言模型（LLM）的提示方法（例如思想链（CoT））取得了显着进步，但现有策略仍然存在过度使用令牌和跨不同推理任务的通用性有限的问题。为了解决这些限制，我们提出了一种带有思维草图的自适应因果提示（ACPS）框架，该框架利用结构因果模型来推断查询对其答案的因果影响，并自适应地选择适当的干预措施（即标准前门和条件前门调整）。这种设计可以实现跨异构任务的通用因果推理，而无需特定于任务的重新训练。通过用简洁的思维导图取代冗长的 CoT，ACPS 可以实现高效推理，从而显着降低令牌使用量和推理成本。对多个推理基准和 LLM 的大量实验表明，ACPS 在准确性、鲁棒性和计算效率方面始终优于现有的提示基准。

Title: Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training

Authors: Muhammad Taimoor Hassan, Jawad Ahmed, Muhammad Awais
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08141
Pdf URL: https://arxiv.org/pdf/2601.08141
Copy Paste: [[2601.08141]] Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training(https://arxiv.org/abs/2601.08141)
Keywords: language model
Abstract: Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language's complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined with 140 million tokens of English Wikipedia data to prevent catastrophic forgetting. We then fine-tune the resulting model on the Alif Urdu-instruct dataset. Through extensive evaluation on Urdu-specific benchmarks, Qalb demonstrates substantial improvements, achieving a weighted average score of 90.34 and outperforming the previous state-of-the-art Alif-1.0-Instruct model (87.1) by 3.24 points, while also surpassing the base LLaMA-3.1 8B-Instruct model by 44.64 points. Qalb achieves state-of-the-art performance with comprehensive evaluation across seven diverse tasks including Classification, Sentiment Analysis, and Reasoning. Our results demonstrate that continued pre-training on diverse, high-quality language data, combined with targeted instruction fine-tuning, effectively adapts foundation models to low-resource languages.
摘要：尽管大型语言模型取得了显着进展，但乌尔都语（一种超过 2.3 亿人使用的语言）在现代 NLP 系统中的代表性仍然严重不足。现有的多语言模型在乌尔都语特定任务上表现不佳，难以应对该语言复杂的形态、从右到左的纳斯塔利克文字和丰富的文学传统。即使是基本的 LLaMA-3.1 8B-Instruct 模型在生成流畅、适合上下文的乌尔都语文本方面也显示出有限的能力。我们介绍 Qalb，这是一种通过两阶段方法开发的乌尔都语语言模型：持续预训练，然后是监督微调。从 LLaMA 3.1 8B 开始，我们对 19.7 亿个 token 的数据集进行持续的预训练。该语料库包含 18.4 亿个不同乌尔都语文本的标记，涵盖新闻档案、古典和当代文学、政府文件和社交媒体，以及 1.4 亿个英语维基百科数据标记，以防止灾难性遗忘。然后，我们在 Alif Urdu-instruct 数据集上微调生成的模型。通过对乌尔都语特定基准的广泛评估，Qalb 表现出了显着的改进，加权平均得分达到 90.34 分，比之前最先进的 Alif-1.0-Instruct 模型 (87.1) 高出 3.24 分，同时也比基础 LLaMA-3.1 8B-Instruct 模型高出 44.64 分。 Qalb 通过对分类、情感分析和推理等七种不同任务的综合评估，实现了最先进的性能。我们的结果表明，对多样化的高质量语言数据进行持续的预训练，结合有针对性的指令微调，可以有效地将基础模型适应低资源语言。

Title: Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-Tuning

Authors: Khumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08146
Pdf URL: https://arxiv.org/pdf/2601.08146
Copy Paste: [[2601.08146]] Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-Tuning(https://arxiv.org/abs/2601.08146)
Keywords: llm
Abstract: Adapting LLMs to low-resource languages is difficult: labeled data is scarce, full-model fine-tuning is unstable, and continued cross-lingual tuning can cause catastrophic forgetting. We propose Circuit-Targeted Supervised Fine-Tuning (CT-SFT): a counterfactual-free adaptation of CD-T (Contextual Decomposition Transformer) that uses a label-balanced mean baseline and task-directional relevance scoring to identify a sparse set of task-relevant attention heads in a proxy-language checkpoint, then transfer learns to a target language by updating only those heads (plus LayerNorm) via head-level gradient masking. Across NusaX-Senti and XNLI, CT-SFT improves cross-lingual accuracy over continued full fine-tuning while updating only a small subset of model parameters. We find an editing-preserving trade-off: harder transfers favor editing circuit heads, while easier transfers often favor near-zero (i.e., low-relevance heads) updates, preserving the source mechanism. CT-SFT also substantially reduces catastrophic forgetting, preserving proxy/source-language competence during transfer.
摘要：让法学硕士适应资源匮乏的语言是很困难的：标记数据稀缺，全模型微调不稳定，持续的跨语言调优可能会导致灾难性的遗忘。我们提出了电路目标监督微调（CT-SFT）：CD-T（上下文分解变换器）的无反事实改编，它使用标签平衡平均基线和任务方向相关性评分来识别代理语言检查点中一组稀疏的任务相关注意头，然后通过头级梯度掩码仅更新这些头（加上 LayerNorm），将迁移学习到目标语言。在 NusaX-Senti 和 XNLI 中，CT-SFT 通过持续的全面微调提高了跨语言准确性，同时仅更新模型参数的一小部分。我们发现了编辑与保留的权衡：较难的传输有利于编辑电路头，而较容易的传输通常有利于接近零（即低相关性头）更新，从而保留源机制。 CT-SFT 还大大减少了灾难性遗忘，在传输过程中保留了代理/源语言的能力。

Title: WISE-Flow: Workflow-Induced Structured Experience for Self-Evolving Conversational Service Agents

Authors: Yuqing Zhou, Zhuoer Wang, Jie Yuan, Hong Wang, Samson Koelle, Ziwei Zhu, Wei Niu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08158
Pdf URL: https://arxiv.org/pdf/2601.08158
Copy Paste: [[2601.08158]] WISE-Flow: Workflow-Induced Structured Experience for Self-Evolving Conversational Service Agents(https://arxiv.org/abs/2601.08158)
Keywords: language model, llm, agent
Abstract: Large language model (LLM)-based agents are widely deployed in user-facing services but remain error-prone in new tasks, tend to repeat the same failure patterns, and show substantial run-to-run variability. Fixing failures via environment-specific training or manual patching is costly and hard to scale. To enable self-evolving agents in user-facing service environments, we propose WISE-Flow, a workflow-centric framework that converts historical service interactions into reusable procedural experience by inducing workflows with prerequisite-augmented action blocks. At deployment, WISE-Flow aligns the agent's execution trajectory to retrieved workflows and performs prerequisite-aware feasibility reasoning to achieve state-grounded next actions. Experiments on ToolSandbox and $\tau^2$-bench show consistent improvement across base models.
摘要：基于大型语言模型 (LLM) 的代理广泛部署在面向用户的服务中，但在新任务中仍然容易出错，往往会重复相同的故障模式，并表现出巨大的运行差异。通过针对特定环境的培训或手动修补来修复故障成本高昂且难以扩展。为了在面向用户的服务环境中实现自我进化代理，我们提出了 WISE-Flow，这是一个以工作流为中心的框架，通过引入具有先决条件增强操作块的工作流，将历史服务交互转换为可重用的程序体验。在部署时，WISE-Flow 将代理的执行轨迹与检索到的工作流程对齐，并执行先决条件感知的可行性推理，以实现基于状态的下一步操作。 ToolSandbox 和 $\tau^2$-bench 上的实验显示了基础模型的一致改进。

Title: SwiftMem: Fast Agentic Memory via Query-aware Indexing

Authors: Anxin Tian, Yiming Li, Xing Li, Hui-Ling Zhen, Lei Chen, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08160
Pdf URL: https://arxiv.org/pdf/2601.08160
Copy Paste: [[2601.08160]] SwiftMem: Fast Agentic Memory via Query-aware Indexing(https://arxiv.org/abs/2601.08160)
Keywords: llm, agent
Abstract: Agentic memory systems have become critical for enabling LLM agents to maintain long-term context and retrieve relevant information efficiently. However, existing memory frameworks suffer from a fundamental limitation: they perform exhaustive retrieval across the entire storage layer regardless of query characteristics. This brute-force approach creates severe latency bottlenecks as memory grows, hindering real-time agent interactions. We propose SwiftMem, a query-aware agentic memory system that achieves sub-linear retrieval through specialized indexing over temporal and semantic dimensions. Our temporal index enables logarithmic-time range queries for time-sensitive retrieval, while the semantic DAG-Tag index maps queries to relevant topics through hierarchical tag structures. To address memory fragmentation during growth, we introduce an embedding-tag co-consolidation mechanism that reorganizes storage based on semantic clusters to improve cache locality. Experiments on LoCoMo and LongMemEval benchmarks demonstrate that SwiftMem achieves 47$\times$ faster search compared to state-of-the-art baselines while maintaining competitive accuracy, enabling practical deployment of memory-augmented LLM agents.
摘要：代理记忆系统对于使 LLM 代理能够维护长期上下文并有效检索相关信息至关重要。然而，现有的内存框架存在一个根本性的限制：它们在整个存储层上执行详尽的检索，而不管查询特征如何。随着内存的增长，这种暴力方法会产生严重的延迟瓶颈，阻碍实时代理交互。我们提出了 SwiftMem，一种查询感知代理存储系统，它通过时间和语义维度上的专门索引来实现亚线性检索。我们的时间索引支持对数时间范围查询以进行时间敏感的检索，而语义 DAG 标签索引通过分层标签结构将查询映射到相关主题。为了解决增长过程中的内存碎片问题，我们引入了一种嵌入标签联合整合机制，该机制基于语义集群重新组织存储以提高缓存局部性。 LoCoMo 和 LongMemEval 基准测试表明，与最先进的基准相比，SwiftMem 的搜索速度提高了 47 倍\倍，同时保持了有竞争力的准确性，从而实现了内存增强 LLM 代理的实际部署。

Title: Relational Knowledge Distillation Using Fine-tuned Function Vectors

Authors: Andrea Kang, Yingnian Wu, Hongjing Lu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08169
Pdf URL: https://arxiv.org/pdf/2601.08169
Copy Paste: [[2601.08169]] Relational Knowledge Distillation Using Fine-tuned Function Vectors(https://arxiv.org/abs/2601.08169)
Keywords: language model, llm
Abstract: Representing relations between concepts is a core prerequisite for intelligent systems to make sense of the world. Recent work using causal mediation analysis has shown that a small set of attention heads encodes task representation in in-context learning, captured in a compact representation known as the function vector. We show that fine-tuning function vectors with only a small set of examples (about 20 word pairs) yields better performance on relation-based word-completion tasks than using the original vectors derived from causal mediation analysis. These improvements hold for both small and large language models. Moreover, the fine-tuned function vectors yield improved decoding performance for relation words and show stronger alignment with human similarity judgments of semantic relations. Next, we introduce the composite function vector - a weighted combination of fine-tuned function vectors - to extract relational knowledge and support analogical reasoning. At inference time, inserting this composite vector into LLM activations markedly enhances performance on challenging analogy problems drawn from cognitive science and SAT benchmarks. Our results highlight the potential of activation patching as a controllable mechanism for encoding and manipulating relational knowledge, advancing both the interpretability and reasoning capabilities of large language models.
摘要：表示概念之间的关系是智能系统理解世界的核心先决条件。最近使用因果中介分析的工作表明，一小组注意力头在上下文学习中对任务表示进行编码，并以称为函数向量的紧凑表示形式捕获。我们表明，仅使用一小组示例（大约 20 个单词对）微调函数向量在基于关系的单词完成任务上比使用从因果中介分析得出的原始向量产生更好的性能。这些改进适用于小型和大型语言模型。此外，微调后的函数向量提高了关系词的解码性能，并与人类对语义关系的相似性判断表现出更强的一致性。接下来，我们引入复合函数向量（微调函数向量的加权组合）来提取关系知识并支持类比推理。在推理时，将此复合向量插入 LLM 激活中可以显着提高从认知科学和 SAT 基准中提取的具有挑战性的类比问题的性能。我们的结果凸显了激活补丁作为编码和操作关系知识的可控机制的潜力，从而提高了大型语言模型的可解释性和推理能力。

Title: Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering

Authors: Lavanya Prahallad, Sai Utkarsh Choudarypally, Pragna Prahallad, Pranathi Prahallad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08176
Pdf URL: https://arxiv.org/pdf/2601.08176
Copy Paste: [[2601.08176]] Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering(https://arxiv.org/abs/2601.08176)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Automatic evaluation of large language model (LLM) responses requires not only factual correctness but also clarity, particularly in political question-answering. While recent datasets provide human annotations for clarity and evasion, the impact of prompt design on automatic clarity evaluation remains underexplored. In this paper, we study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task. We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies: simple prompting, chain-of-thought prompting, and chain-of-thought with few-shot examples. Model predictions are evaluated against human annotations using accuracy and class-wise metrics for clarity and evasion, along with hierarchical exact match. Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction, with accuracy improving from 56 percent to 63 percent under chain-of-thought with few-shot prompting. Chain-of-thought prompting yields the highest evasion accuracy at 34 percent, though improvements are less stable across fine-grained evasion categories. We further evaluate topic identification and find that reasoning-based prompting improves accuracy from 60 percent to 74 percent relative to human annotations. Overall, our findings indicate that prompt design reliably improves high-level clarity evaluation, while fine-grained evasion and topic detection remain challenging despite structured reasoning prompts.
摘要：大语言模型（LLM）回答的自动评估不仅需要事实正确性，而且还需要清晰度，特别是在政治问答方面。虽然最近的数据集为清晰度和规避提供了人工注释，但提示设计对自动清晰度评估的影响仍未得到充分探索。在本文中，我们使用 SemEval 2026 共享任务中的 CLARITY 数据集研究基于提示的清晰度评估。我们将数据集提供的 GPT-3.5 基线与在三种提示策略下评估的 GPT-5.2 进行比较：简单提示、思路链提示和带有少量示例的思路链。模型预测是根据人类注释进行评估的，使用准确性和类别指标来实现清晰度和规避，以及分层精确匹配。结果表明，GPT-5.2 在清晰度预测方面始终优于 GPT-3.5 基线，在少量镜头提示的思维链下，准确率从 56% 提高到 63%。思维链提示的规避准确率最高，为 34%，但在细粒度规避类别中的改进不太稳定。我们进一步评估主题识别，发现基于推理的提示相对于人工注释将准确性从 60% 提高到 74%。总体而言，我们的研究结果表明，提示设计可靠地提高了高级清晰度评估，而尽管有结构化推理提示，但细粒度的回避和主题检测仍然具有挑战性。

Title: Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis

Authors: Da Song, Yuheng Huang, Boqi Chen, Tianshuo Cong, Randy Goebel, Lei Ma, Foutse Khomh
Subjects: cs.CL, cs.AI, cs.CR, cs.LO, cs.SE
Abstract URL: https://arxiv.org/abs/2601.08196
Pdf URL: https://arxiv.org/pdf/2601.08196
Copy Paste: [[2601.08196]] Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis(https://arxiv.org/abs/2601.08196)
Keywords: language model, llm, agent
Abstract: The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.
摘要：将大型语言模型 (LLM) 集成到自主代理中可以使用复杂的工具，但在高风险领域，这些系统必须严格遵守监管标准，而不仅仅是简单的功能正确性。然而，现有的基准往往忽视隐含的监管合规性，因此无法评估法学硕士是否可以自主执行强制性安全约束。为了填补这一空白，我们引入了 LogiSafetyGen，这是一个框架，可将非结构化法规转换为线性时序逻辑预言机，并采用逻辑引导模糊测试来合成有效的、安全关键的跟踪。在此框架的基础上，我们构建了 LogiSafetyBench，这是一个包含 240 个人工验证任务的基准测试，需要法学硕士生成满足功能目标和潜在合规性规则的 Python 程序。对 13 个最先进 (SOTA) 法学硕士的评估表明，较大的模型尽管实现了更好的功能正确性，但经常优先考虑任务完成而不是安全，从而导致不合规行为。

Title: Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs

Authors: Yibo Wang, Hai-Long Sun, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08198
Pdf URL: https://arxiv.org/pdf/2601.08198
Copy Paste: [[2601.08198]] Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs(https://arxiv.org/abs/2601.08198)
Keywords: language model, llm
Abstract: Recently, self-play fine-tuning (SPIN) has been proposed to adapt large language models to downstream applications with scarce expert-annotated data, by iteratively generating synthetic responses from the model itself. However, SPIN is designed to optimize the current reward advantages of annotated responses over synthetic responses at hand, which may gradually vanish during iterations, leading to unstable optimization. Moreover, the utilization of reference policy induces a misalignment issue between the reward formulation for training and the metric for generation. To address these limitations, we propose a novel Triplet-based Self-Play fIne-tuNing (T-SPIN) method that integrates two key designs. First, beyond current advantages, T-SPIN additionally incorporates historical advantages between iteratively generated responses and proto-synthetic responses produced by the initial policy. Even if the current advantages diminish, historical advantages remain effective, stabilizing the overall optimization. Second, T-SPIN introduces the entropy constraint into the self-play framework, which is theoretically justified to support reference-free fine-tuning, eliminating the training-generation discrepancy. Empirical results on various tasks demonstrate not only the superior performance of T-SPIN over SPIN, but also its stable evolution during iterations. Remarkably, compared to supervised fine-tuning, T-SPIN achieves comparable or even better performance with only 25% samples, highlighting its effectiveness when faced with scarce annotated data.
摘要：最近，人们提出了自对弈微调（SPIN），通过从模型本身迭代生成合成响应，使大型语言模型适应缺乏专家注释数据的下游应用程序。然而，SPIN 的设计目的是优化注释响应相对于现有合成响应的当前奖励优势，这些优势可能在迭代过程中逐渐消失，从而导致优化不稳定。此外，参考策略的使用会导致训练奖励公式与生成指标之间的不一致问题。为了解决这些限制，我们提出了一种新颖的基于三元组的自对弈微调（T-SPIN）方法，该方法集成了两个关键设计。首先，除了当前的优势之外，T-SPIN 还结合了迭代生成的响应和初始策略产生的原始合成响应之间的历史优势。即使当前优势减弱，历史优势仍然有效，稳定了整体优化。其次，T-SPIN 将熵约束引入到 self-play 框架中，理论上可以支持无参考微调，消除训练生成差异。各种任务的实证结果不仅证明了 T-SPIN 相对于 SPIN 的优越性能，而且证明了它在迭代过程中的稳定演化。值得注意的是，与有监督的微调相比，T-SPIN 仅用 25% 的样本就实现了相当甚至更好的性能，突显了其在面对稀缺注释数据时的有效性。

Title: Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models

Authors: Rongji Li, Jian Xu, Xueqing Chen, Yisheng Yang, Jiayi Wang, Xingyu Chen, Chunyu Xie, Dawei Leng, Xu-Yao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08209
Pdf URL: https://arxiv.org/pdf/2601.08209
Copy Paste: [[2601.08209]] Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models(https://arxiv.org/abs/2601.08209)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: In domains such as biomedicine, materials, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine-tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval drift, and long-context pressure that yields query-dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation-level interface aligned to the frozen base model, avoiding prompt-time evidence serialization while enabling plug-and-play specialization and scalable multi-domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed-domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near-oracle selective activation for scalable multi-domain deployment.
摘要：在生物医学、材料和金融等领域，大型语言模型 (LLM) 的高风险部署需要注入私有的、特定领域的知识，这些知识是专有的、快速发展的，并且在公共预训练中代表性不足。然而，私有知识注入的两种主要范式都有明显的缺点：微调迭代成本高昂，持续更新可能会带来灾难性遗忘和一般能力回归的风险；检索增强生成（RAG）保持基本模型完整，但由于块引起的证据碎片、检索漂移和长上下文压力（产生依赖于查询的即时膨胀），在专门的私人语料库中很脆弱。受多模态法学硕士如何将异构模态整合到共享语义空间的启发，我们提出了生成增强生成（GAG），它将私人专业知识视为一种额外的专家模态，并通过与冻结基础模型对齐的紧凑的表示级接口注入它，避免即时证据序列化，同时通过可靠的选择性激活实现即插即用的专业化和可扩展的多域组合。在两个私人科学 QA 基准（免疫佐剂和催化材料）和混合域评估中，GAG 在两个基准上分别将专业性能比强大的 RAG 基准提高了 15.34% 和 14.86%，同时保持了在六个开放通用基准上的性能，并为可扩展的多域部署实现了近乎预言机的选择性激活。

Title: Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints

Authors: Seng Pei Liew, Kenta Shinzato, Yuyang Dong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08215
Pdf URL: https://arxiv.org/pdf/2601.08215
Copy Paste: [[2601.08215]] Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints(https://arxiv.org/abs/2601.08215)
Keywords: language model
Abstract: Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters ($N_{total}$) and expert sparsity ($s:=n_{exp}/n_{topk}$). Moreover, $n_{exp}$ and $n_{topk}$ do not "cancel out" within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes $N_{total}$ while minimizing $s$ (maximizing $n_{topk}$) and $n_{exp}$ under the given constraints. Our findings provide a robust framework for resolving architectural ambiguity and guiding MoE design.
摘要：现代专家混合 (MoE) 语言模型是根据总参数（内存占用）和活动参数（推理成本）设计的。然而，我们发现仅这两个因素不足以描述最佳架构。通过系统研究，我们证明 MoE 性能主要由总参数 ($N_{total}$) 和专家稀疏性 ($s:=n_{exp}/n_{topk}$) 决定。此外，$n_{exp}$ 和 $n_{topk}$ 不会在稀疏率内“抵消”；相反，更多的专家会强制减少核心模型维度（深度和宽度）以满足内存限制，从而稍微降低性能。这激发了 MoE 设计的一个简单原则，即在给定约束下最大化 $N_{total}$，同时最小化 $s$（最大化 $n_{topk}$）和 $n_{exp}$。我们的研究结果为解决架构模糊性和指导教育部设计提供了一个强大的框架。

Title: User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale

Authors: Jungho Cho, Minbyul Jeong, Sungrae Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08225
Pdf URL: https://arxiv.org/pdf/2601.08225
Copy Paste: [[2601.08225]] User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale(https://arxiv.org/abs/2601.08225)
Keywords: agent
Abstract: The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in "solely task-solving" trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.
摘要：最近向大型推理模型（LRM）作为自主代理的范式转变加剧了对复杂的多轮工具使用能力的需求。然而，现有的数据集和数据生成方法受到静态、预定义工具集的限制，这些工具集无法扩展到开放式人类代理协作的复杂性。为了解决这个问题，我们最初开发了一个大规模自动生成面向任务的多轮对话的框架，利用基于 LRM 的模拟器动态生成高价值、特定领域的工具来解决指定的任务。然而，我们观察到，纯粹面向任务的设计通常会导致“仅解决任务”的轨迹，其中代理以最少的交互完成目标，无法生成现实场景中看到的高轮数对话。为了弥补这一差距，我们转向面向用户的模拟范例。通过将任务生成与模仿人类行为规则的专用用户模拟器（例如增量请求和逐轮反馈）解耦，我们可以促进更真实、更扩展的多回合对话，从而反映现实世界问题解决的迭代本质。我们的生成管道作为多功能、即插即用模块运行，能够从任何状态启动生成，确保生成扩展工具使用数据的高可扩展性。此外，通过促进单个轨迹内的多个任务完成，它产生了一个高密度数据集，反映了现实世界中人类与智能体交互的多方面需求。

Title: Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning

Authors: Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang, Ding Xia, Piyalitt Ittichaiwong, Kanyakorn Veerakanjana, Hyunjae Kim, Qingyu Chen, Edison Marrese Taylor, Kazuma Kobayashi, Akkiko Aizawa, Irene Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08267
Pdf URL: https://arxiv.org/pdf/2601.08267
Copy Paste: [[2601.08267]] Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning(https://arxiv.org/abs/2601.08267)
Keywords: language model
Abstract: While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.
摘要：虽然推理增强型大型语言模型在英语医疗任务上表现强劲，但持续存在的多语言差距仍然存在，当地语言的推理能力明显较弱，限制了公平的全球医疗部署。为了弥补这一差距，我们引入了 Med-CoReasoner，这是一种语言信息协同推理框架，可引发并行的英语和本地语言推理，将其抽象为结构化概念，并通过概念级对齐和检索将本地临床知识整合到英语逻辑支架中。该设计将英语推理的结构稳健性与以当地语言编码的基于实践的专业知识结合起来。为了评估多项选择设置之外的多语言医学推理，我们构建了 MultiMed-X，这是一个涵盖七种语言的基准，包含专家注释的长格式问答和自然语言推理任务，每种语言包含 350 个实例。三个基准测试的实验表明，Med-CoReasoner 将多语言推理性能平均提高了 5%，尤其是在低资源语言中，效果尤其显着。此外，模型蒸馏和专家评估分析进一步证实 Med-CoReasoner 产生了临床合理且有文化基础的推理痕迹。

Title: Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees

Authors: Kun Li, Zenan Xu, Junan Li, Zengrui Jin, Jinghao Deng, Zexuan Qiu, Bo Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08274
Pdf URL: https://arxiv.org/pdf/2601.08274
Copy Paste: [[2601.08274]] Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees(https://arxiv.org/abs/2601.08274)
Keywords: language model, llm, chain-of-thought
Abstract: Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model's intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.
摘要：工具集成推理已成为增强大型语言模型 (LLM) 计算能力的关键范式，但将工具使用集成到长思想链 (长 CoT) 中的研究仍未充分，这主要是由于训练数据的稀缺以及在不影响模型内在长链推理的情况下集成工具使用的挑战。在本文中，我们介绍了 DART（通过 Rollout Trees 发现和强化工具集成推理链），这是一种强化学习框架，可以在长时间 CoT 推理过程中自发使用工具，无需人工注释。 DART 的运作方式是在训练期间构建动态展示树，以发现有效的工具使用机会，并在有希望的位置进行分支，以探索不同的工具集成轨迹。随后，基于树的流程优势估计会识别并归功于工具调用对解决方案做出积极贡献的特定子轨迹，从而有效地强化这些有益行为。对 AIME 和 GPQA-Diamond 等具有挑战性的基准进行的大量实验表明，DART 的性能显着优于现有方法，成功地将工具执行与长 CoT 推理协调起来。

Title: D$^2$Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning

Authors: Kangcheng Luo, Tinglang Wu, Yansong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08282
Pdf URL: https://arxiv.org/pdf/2601.08282
Copy Paste: [[2601.08282]] D$^2$Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning(https://arxiv.org/abs/2601.08282)
Keywords: llm, agent
Abstract: Recent search-augmented LLMs trained with reinforcement learning (RL) can interleave searching and reasoning for multi-hop reasoning tasks. However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose **D$^2$Plan**, a **D**ual-agent **D**ynamic global **Plan**ning paradigm for complex retrieval-augmented reasoning. **D$^2$Plan** operates through the collaboration of a *Reasoner* and a *Purifier*: the *Reasoner* constructs explicit global plans during reasoning and dynamically adapts them based on retrieval feedback; the *Purifier* assesses retrieval relevance and condenses key information for the *Reasoner*. We further introduce a two-stage training framework consisting of supervised fine-tuning (SFT) cold-start on synthesized trajectories and RL with plan-oriented rewards to teach LLMs to master the **D$^2$Plan** paradigm. Extensive experiments demonstrate that **D$^2$Plan** enables more coherent multi-step reasoning and stronger resilience to irrelevant information, thereby achieving superior performance on challenging QA benchmarks.
摘要：最近经过强化学习（RL）训练的搜索增强法学硕士可以将搜索和推理交错进行多跳推理任务。然而，随着不断积累的上下文充斥着关键证据和不相关信息，它们面临两种关键的失败模式：(1) 无效的搜索链构建，会产生错误的查询或忽略关键信息的检索；(2) 外围证据的推理劫持，导致模型将干扰因素误识别为有效证据。为了应对这些挑战，我们提出**D$^2$Plan**，这是一种用于复杂检索增强推理的**双代理**D**动态全局**规划**范式。 **D$^2$Plan** 通过 *Reasoner* 和 *Purifier* 的协作进行操作：*Reasoner* 在推理过程中构建明确的全局计划，并根据检索反馈动态调整它们； *Purifier* 评估检索相关性并为 *Reasoner* 压缩关键信息。我们进一步引入了一个两阶段训练框架，包括合成轨迹上的监督微调（SFT）冷启动和具有计划导向奖励的强化学习，以教导法学硕士掌握**D$^2$Plan**范式。大量实验表明，**D$^2$Plan** 能够实现更连贯的多步骤推理和对不相关信息更强的恢复能力，从而在具有挑战性的 QA 基准上实现卓越的性能。

Title: Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques

Authors: Marvin Schmitt, Anne Schwerk, Sebastian Lempert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08302
Pdf URL: https://arxiv.org/pdf/2601.08302
Copy Paste: [[2601.08302]] Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques(https://arxiv.org/abs/2601.08302)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: This study investigates the use of prompt engineering to enhance large language models (LLMs), specifically GPT-4o-mini and gemini-1.5-flash, in sentiment analysis tasks. It evaluates advanced prompting techniques like few-shot learning, chain-of-thought prompting, and self-consistency against a baseline. Key tasks include sentiment classification, aspect-based sentiment analysis, and detecting subtle nuances such as irony. The research details the theoretical background, datasets, and methods used, assessing performance of LLMs as measured by accuracy, recall, precision, and F1 score. Findings reveal that advanced prompting significantly improves sentiment analysis, with the few-shot approach excelling in GPT-4o-mini and chain-of-thought prompting boosting irony detection in gemini-1.5-flash by up to 46%. Thus, while advanced prompting techniques overall improve performance, the fact that few-shot prompting works best for GPT-4o-mini and chain-of-thought excels in gemini-1.5-flash for irony detection suggests that prompting strategies must be tailored to both the model and the task. This highlights the importance of aligning prompt design with both the LLM's architecture and the semantic complexity of the task.
摘要：本研究调查了在情感分析任务中使用即时工程来增强大型语言模型 (LLM)，特别是 GPT-4o-mini 和 Gemini-1.5-flash。它评估先进的提示技术，如少样本学习、思维链提示和与基线的自我一致性。关键任务包括情感分类、基于方面的情感分析以及检测讽刺等细微差别。该研究详细介绍了理论背景、数据集和使用的方法，通过准确性、召回率、精度和 F1 分数来评估法学硕士的表现。研究结果表明，高级提示显着改善了情感分析，GPT-4o-mini 中的少样本方法表现出色，而思维链提示将 Gemini-1.5-flash 中的反讽检测提高了高达 46%。因此，虽然先进的提示技术总体上提高了性能，但事实是，少样本提示最适合 GPT-4o-mini，而思想链在用于反讽检测的 Gemini-1.5-flash 中表现出色，这表明提示策略必须针对模型和任务进行定制。这凸显了将提示设计与法学硕士的架构和任务的语义复杂性保持一致的重要性。

Title: AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture

Authors: Bo Yang, Yu Zhang, Yunkui Chen, Lanfei Feng, Xiao Xu, Nueraili Aierken, Shijian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08308
Pdf URL: https://arxiv.org/pdf/2601.08308
Copy Paste: [[2601.08308]] AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture(https://arxiv.org/abs/2601.08308)
Keywords: agent
Abstract: Intelligent agent systems in real-world agricultural scenarios must handle diverse tasks under multimodal inputs, ranging from lightweight information understanding to complex multi-step execution. However, most existing approaches rely on a unified execution paradigm, which struggles to accommodate large variations in task complexity and incomplete tool availability commonly observed in agricultural environments. To address this challenge, we propose AgriAgent, a two-level agent framework for real-world agriculture. AgriAgent adopts a hierarchical execution strategy based on task complexity: simple tasks are handled through direct reasoning by modality-specific agents, while complex tasks trigger a contract-driven planning mechanism that formulates tasks as capability requirements and performs capability-aware tool orchestration and dynamic tool generation, enabling multi-step and verifiable execution with failure recovery. Experimental results show that AgriAgent achieves higher execution success rates and robustness on complex tasks compared to existing tool-centric agent baselines that rely on unified execution paradigms. All code, data will be released at after our work be accepted to promote reproducible research.
摘要：现实农业场景中的智能代理系统必须处理多模式输入下的各种任务，从轻量级信息理解到复杂的多步骤执行。然而，大多数现有方法依赖于统一的执行范例，该范例很难适应农业环境中常见的任务复杂性的巨大变化和不完整的工具可用性。为了应对这一挑战，我们提出了 AgriAgent，这是一个用于现实农业的两级代理框架。 AgriAgent采用基于任务复杂度的分层执行策略：简单任务由特定模态代理通过直接推理处理，而复杂任务则触发契约驱动的规划机制，将任务制定为能力需求，并进行能力感知的工具编排和动态工具生成，从而实现多步骤、可验证的执行和故障恢复。实验结果表明，与依赖统一执行范例的现有以工具为中心的代理基线相比，AgriAgent 在复杂任务上实现了更高的执行成功率和鲁棒性。所有代码、数据将在我们的工作被接受后发布，以促进可重复的研究。

Title: CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark

Authors: Daniil Gurgurov, Yusser Al Ghussin, Tanja Baeumel, Cheng-Ting Chou, Patrick Schramowski, Marius Mosbach, Josef van Genabith, Simon Ostermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08331
Pdf URL: https://arxiv.org/pdf/2601.08331
Copy Paste: [[2601.08331]] CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark(https://arxiv.org/abs/2601.08331)
Keywords: language model, llm, prompt
Abstract: Understanding and controlling the behavior of large language models (LLMs) is an increasingly important topic in multilingual NLP. Beyond prompting or fine-tuning, , i.e.,~manipulating internal representations during inference, has emerged as a more efficient and interpretable technique for adapting models to a target language. Yet, no dedicated benchmarks or evaluation protocols exist to quantify the effectiveness of steering techniques. We introduce CLaS-Bench, a lightweight parallel-question benchmark for evaluating language-forcing behavior in LLMs across 32 languages, enabling systematic evaluation of multilingual steering methods. We evaluate a broad array of steering techniques, including residual-stream DiffMean interventions, probe-derived directions, language-specific neurons, PCA/LDA vectors, Sparse Autoencoders, and prompting baselines. Steering performance is measured along two axes: language control and semantic relevance, combined into a single harmonic-mean steering score. We find that across languages simple residual-based DiffMean method consistently outperforms all other methods. Moreover, a layer-wise analysis reveals that language-specific structure emerges predominantly in later layers and steering directions cluster based on language family. CLaS-Bench is the first standardized benchmark for multilingual steering, enabling both rigorous scientific analysis of language representations and practical evaluation of steering as a low-cost adaptation alternative.
摘要：理解和控制大语言模型 (LLM) 的行为是多语言 NLP 中一个日益重要的主题。除了提示或微调之外，即在推理过程中操纵内部表示，已成为一种更有效且可解释的技术，用于使模型适应目标语言。然而，不存在专门的基准或评估协议来量化转向技术的有效性。我们推出了 CLaS-Bench，这是一个轻量级的并行问题基准，用于评估 32 种语言的法学硕士中的语言强迫行为，从而能够对多语言引导方法进行系统评估。我们评估了广泛的引导技术，包括残差流 DiffMean 干预、探针衍生方向、语言特定神经元、PCA/LDA 向量、稀疏自动编码器和提示基线。转向性能沿着两个轴进行测量：语言控制和语义相关性，组合成一个调和平均转向分数。我们发现，跨语言的简单的基于残差的 DiffMean 方法始终优于所有其他方法。此外，逐层分析表明，特定于语言的结构主要出现在后面的层中，并且基于语言家族的转向方向集群。 CLaS-Bench 是第一个多语言转向的标准化基准，既可以对语言表示进行严格的科学分析，又可以对转向进行实际评估，作为一种低成本的适应替代方案。

Title: Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue

Authors: Run Chen, Wen Liang, Ziwei Gong, Lin Ai, Julia Hirschberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08342
Pdf URL: https://arxiv.org/pdf/2601.08342
Copy Paste: [[2601.08342]] Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue(https://arxiv.org/abs/2601.08342)
Keywords: language model
Abstract: Mental manipulation, the strategic use of language to covertly influence or exploit others, is a newly emerging task in computational social reasoning. Prior work has focused exclusively on textual conversations, overlooking how manipulative tactics manifest in speech. We present the first study of mental manipulation detection in spoken dialogues, introducing a synthetic multi-speaker benchmark SPEECHMENTALMANIP that augments a text-based dataset with high-quality, voice-consistent Text-to-Speech rendered audio. Using few-shot large audio-language models and human annotation, we evaluate how modality affects detection accuracy and perception. Our results reveal that models exhibit high specificity but markedly lower recall on speech compared to text, suggesting sensitivity to missing acoustic or prosodic cues in training. Human raters show similar uncertainty in the audio setting, underscoring the inherent ambiguity of manipulative speech. Together, these findings highlight the need for modality-aware evaluation and safety alignment in multimodal dialogue systems.
摘要：心理操纵，即战略性地使用语言来秘密影响或利用他人，是计算社会推理中的一项新任务。之前的工作只关注文本对话，忽视了操纵策略在言语中的体现。我们提出了口语对话中心理操纵检测的第一项研究，引入了一个合成的多说话者基准 SPEECHMENTALMANIP，该基准通过高质量、语音一致的文本到语音渲染音频来增强基于文本的数据集。使用少镜头大型音频语言模型和人工注释，我们评估模态如何影响检测准确性和感知。我们的结果表明，与文本相比，模型表现出较高的特异性，但对语音的回忆明显较低，这表明模型对训练中丢失的声音或韵律线索很敏感。人类评分者在音频设置中表现出类似的不确定性，强调了操纵性语音固有的模糊性。总之，这些发现强调了在多模式对话系统中进行模式感知评估和安全调整的必要性。

Title: PATS: Personality-Aware Teaching Strategies with Large Language Model Tutors

Authors: Donya Rooein, Sankalan Pal Chowdhury, Mariia Eremeeva, Yuan Qin, Debora Nozza, Mrinmaya Sachan, Dirk Hovy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08402
Pdf URL: https://arxiv.org/pdf/2601.08402
Copy Paste: [[2601.08402]] PATS: Personality-Aware Teaching Strategies with Large Language Model Tutors(https://arxiv.org/abs/2601.08402)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) demonstrate their potential as educational tutors. However, different tutoring strategies benefit different student personalities, and mismatches can be counterproductive to student outcomes. Despite this, current LLM tutoring systems do not take into account student personality traits. To address this problem, we first construct a taxonomy that links pedagogical methods to personality profiles, based on pedagogical literature. We simulate student-teacher conversations and use our framework to let the LLM tutor adjust its strategy to the simulated student personality. We evaluate the scenario with human teachers and find that they consistently prefer our approach over two baselines. Our method also increases the use of less common, high-impact strategies such as role-playing, which human and LLM annotators prefer significantly. Our findings pave the way for developing more personalized and effective LLM use in educational applications.
摘要：大型语言模型（LLM）的最新进展展示了它们作为教育导师的潜力。然而，不同的辅导策略有利于不同的学生个性，不匹配可能会对学生的成绩产生反作用。尽管如此，目前的法学硕士辅导系统并没有考虑到学生的个性特征。为了解决这个问题，我们首先根据教学文献构建一个将教学方法与人格概况联系起来的分类法。我们模拟学生与老师的对话，并使用我们的框架让法学硕士导师根据模拟的学生个性调整其策略。我们与人类教师一起评估了这个场景，发现他们始终更喜欢我们的方法而不是两个基线。我们的方法还增加了不太常见、高影响力策略的使用，例如角色扮演，人类和法学硕士注释者非常喜欢这种策略。我们的研究结果为在教育应用中开发更加个性化和有效的法学硕士铺平了道路。

Title: Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

Authors: Nonghai Zhang, Weitao Ma, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08427
Pdf URL: https://arxiv.org/pdf/2601.08427
Copy Paste: [[2601.08427]] Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering(https://arxiv.org/abs/2601.08427)
Keywords: language model, llm
Abstract: Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid'' through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.
摘要：组相对策略优化 (GRPO) 显着增强了大型语言模型 (LLM) 的推理性能。然而，这种成功在很大程度上依赖于昂贵的外部验证者或人类规则。这种依赖性不仅会导致显着的计算成本和训练延迟，还会产生稀疏的奖励，从而阻碍优化效率。为了应对这些挑战，我们提出了 Latent-GRPO，这是一个直接从潜在空间几何中获得内在奖励的框架。至关重要的是，我们的实证分析揭示了一个引人注目的几何特性：正确推理轨迹的终端标记表示形成具有高类内相似性的密集簇，而不正确的轨迹仍然分散为异常值。鉴于这一发现，我们引入了迭代鲁棒质心估计（IRCE）算法，该算法通过球面投影减轻幅度波动并通过迭代聚合估计鲁棒的“真值质心”来生成密集、连续的奖励。多个数据集上的实验结果表明，我们的方法保持了模型性能，同时与基线相比，训练速度提高了 2 倍以上。此外，广泛的结果表明了强大的泛化能力和鲁棒性。该代码即将发布。

Title: Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management

Authors: Weitao Ma, Xiaocheng Feng, Lei Huang, Xiachong Feng, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08435
Pdf URL: https://arxiv.org/pdf/2601.08435
Copy Paste: [[2601.08435]] Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management(https://arxiv.org/abs/2601.08435)
Keywords: language model, agent
Abstract: Effective memory management is essential for large language model agents to navigate long-horizon tasks. Recent research has explored using Reinforcement Learning to develop specialized memory manager agents. However, existing approaches rely on final task performance as the primary reward, which results in severe reward sparsity and ineffective credit assignment, providing insufficient guidance for individual memory operations. To this end, we propose Fine-Mem, a unified framework designed for fine-grained feedback alignment. First, we introduce a Chunk-level Step Reward to provide immediate step-level supervision via auxiliary chunk-specific question answering tasks. Second, we devise Evidence-Anchored Reward Attribution to redistribute global rewards by anchoring credit to key memory operations, based on the specific memory items utilized as evidence in reasoning. Together, these components enable stable policy optimization and align local memory operations with the long-term utility of memory. Experiments on Memalpha and MemoryAgentBench demonstrate that Fine-Mem consistently outperforms strong baselines, achieving superior success rates across various sub-tasks. Further analysis reveals its adaptability and strong generalization capabilities across diverse model configurations and backbones.
摘要：有效的内存管理对于大型语言模型代理导航长期任务至关重要。最近的研究探索了使用强化学习来开发专门的内存管理代理。然而，现有方法依赖最终任务表现作为主要奖励，这导致严重的奖励稀疏和无效的信用分配，为个体记忆操作提供不足的指导。为此，我们提出了 Fine-Mem，一个专为细粒度反馈对齐而设计的统一框架。首先，我们引入了块级步骤奖励，通过辅助块特定的问答任务提供即时的步骤级监督。其次，我们设计了证据锚定奖励归因，根据在推理中用作证据的特定记忆项目，通过将信用锚定到关键记忆操作来重新分配全局奖励。这些组件共同实现了稳定的策略优化，并使本地内存操作与内存的长期效用保持一致。 Memalpha 和 MemoryAgentBench 上的实验表明，Fine-Mem 始终优于强大的基线，在各种子任务中实现了卓越的成功率。进一步的分析揭示了它在不同模型配置和主干上的适应性和强大的泛化能力。

Title: JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Authors: Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li, Liang Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08468
Pdf URL: https://arxiv.org/pdf/2601.08468
Copy Paste: [[2601.08468]] JudgeRLVR: Judge First, Generate Second for Efficient Reasoning(https://arxiv.org/abs/2601.08468)
Keywords: language model
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.
摘要：具有可验证奖励的强化学习（RLVR）已成为大型语言模型中推理的标准范例。然而，仅仅针对最终答案的正确性进行优化通常会导致模型陷入漫无目的、冗长的探索，它们依赖于详尽的试错策略，而不是结构化的规划来达成解决方案。虽然像长度惩罚这样的启发式约束可以减少冗长，但它们通常会截断基本的推理步骤，从而在效率和验证之间造成困难的权衡。在本文中，我们认为判别能力是高效生成的先决条件：通过学习区分有效解决方案，模型可以内化修剪搜索空间的指导信号。我们提出 JudgeRLVR，一种两阶段判断然后生成范例。在第一阶段，我们训练模型来判断具有可验证答案的解决方案响应。在第二阶段，我们使用从法官初始化的普通生成 RLVR 来微调相同的模型。与使用相同数学领域训练数据的 Vanilla RLVR 相比，JudgeRLVR 为 Qwen3-30B-A3B 实现了更好的质量-效率权衡：在域内数学方面，它提供了约 +3.7 点的平均准确度增益，平均生成长度为 -42%；在域外基准测试中，它的平均准确度提高了约 4.5 分，展示了增强的泛化能力。

Title: sui-1: Grounded and Verifiable Long-Form Summarization

Authors: Benedikt Droste, Jan Philipp Harries, Maximilian Idahl, Björn Plüster
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08472
Pdf URL: https://arxiv.org/pdf/2601.08472
Copy Paste: [[2601.08472]] sui-1: Grounded and Verifiable Long-Form Summarization(https://arxiv.org/abs/2601.08472)
Keywords: language model, prompt, chain-of-thought
Abstract: Large language models frequently generate plausible but unfaithful summaries that users cannot verify against source text, a critical limitation in compliance-sensitive domains such as government and legal analysis. We present sui-1, a 24B parameter model that produces abstractive summaries with inline citations, enabling users to trace each claim to its source sentence. Our synthetic data pipeline combines chain-of-thought prompting with multi-stage verification, generating over 22,000 high-quality training examples across five languages from diverse sources including parliamentary documents, web text, and Wikipedia. Evaluation shows sui-1 significantly outperforms all tested open-weight baselines, including models with 3x more parameters. These results demonstrate that task-specific training substantially outperforms scale alone for citation-grounded summarization. Model weights and an interactive demo are publicly available.
摘要：大型语言模型经常生成看似合理但不忠实的摘要，用户无法根据源文本进行验证，这是政府和法律分析等合规性敏感领域的一个关键限制。我们提出了 sui-1，这是一个 24B 参数模型，它可以生成带有内联引用的抽象摘要，使用户能够将每个声明追溯到其源语句。我们的合成数据管道将思想链提示与多阶段验证相结合，从议会文件、网络文本和维基百科等不同来源生成了跨五种语言的 22,000 多个高质量训练示例。评估显示 sui-1 显着优于所有测试的开放权重基线，包括参数增加 3 倍的模型。这些结果表明，针对基于引文的摘要，针对特定任务的训练大大优于单独的规模训练。模型权重和交互式演示是公开的。

Title: Do You Understand How I Feel?: Towards Verified Empathy in Therapy Chatbots

Authors: Francesco Dettori, Matteo Forasassi, Lorenzo Veronese, Livia Lestingi, Vincenzo Scotti, Matteo Giovanni Rossi
Subjects: cs.CL, cs.HC, cs.SE
Abstract URL: https://arxiv.org/abs/2601.08477
Pdf URL: https://arxiv.org/pdf/2601.08477
Copy Paste: [[2601.08477]] Do You Understand How I Feel?: Towards Verified Empathy in Therapy Chatbots(https://arxiv.org/abs/2601.08477)
Keywords: chat, agent
Abstract: Conversational agents are increasingly used as support tools along mental therapeutic pathways with significant societal impacts. In particular, empathy is a key non-functional requirement in therapeutic contexts, yet current chatbot development practices provide no systematic means to specify or verify it. This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots. A Transformer-based model extracts dialogue features, which are then translated into a Stochastic Hybrid Automaton model of dyadic therapy sessions. Empathy-related properties can then be verified through Statistical Model Checking, while strategy synthesis provides guidance for shaping agent behavior. Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.
摘要：对话代理越来越多地用作心理治疗途径的支持工具，具有重大的社会影响。特别是，同理心是治疗环境中的关键非功能性要求，但当前的聊天机器人开发实践没有提供系统方法来指定或验证它。本文设想了一个集成自然语言处理和形式验证的框架，以提供移情治疗聊天机器人。基于 Transformer 的模型提取对话特征，然后将其转换为二元治疗会话的随机混合自动机模型。然后可以通过统计模型检查来验证与同理心相关的属性，而策略合成则为塑造代理行为提供指导。初步结果表明，正式模型以良好的保真度捕获了治疗动态，并且临时策略提高了满足同理心要求的可能性。

Title: Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning

Authors: Tony Cristofano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08489
Pdf URL: https://arxiv.org/pdf/2601.08489
Copy Paste: [[2601.08489]] Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning(https://arxiv.org/abs/2601.08489)
Keywords: language model, prompt
Abstract: Safety-aligned language models systematically refuse harmful requests. While activation steering can modulate refusal, ablating the raw "refusal vector" calculated from contrastive harmful and harmless prompts often causes collateral damage and distribution drift. We argue this degradation occurs because the raw vector is polysemantic, entangling the refusal signal with core capability circuits and linguistic style. We introduce Surgical Refusal Ablation (SRA) to distill these steering directions. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, then uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model's semantic geometry. Across five models (Qwen3-VL and Ministral series), SRA achieves deep refusal reduction (0-2%) with negligible perplexity impact on Wikitext-2 (mean delta PPL approx. 0.02) and minimal distribution drift. Notably, standard ablation on Qwen3-VL-4B induces severe drift (first-token KL = 2.088), whereas SRA maintains the original distribution (KL = 0.044) while achieving the same 0% refusal rate. Using teacher-forced perplexity on GSM8K and MBPP as a high-resolution capability proxy, we show SRA preserves math and code distributions. These results suggest that common "model damage" is often "Ghost Noise," defined as the spectral bleeding of the dirty refusal direction into capability subspaces.
摘要：安全一致的语言模型系统地拒绝有害的请求。虽然激活引导可以调节拒绝，但消除根据有害和无害提示对比计算出的原始“拒绝向量”通常会导致附带损害和分布漂移。我们认为这种退化的发生是因为原始向量是多语义的，将拒绝信号与核心能力电路和语言风格纠缠在一起。我们引入手术拒绝消融（SRA）来提炼这些转向方向。 SRA 构建了代表受保护功能和风格混杂的独立概念原子的注册表，然后使用岭正则化谱残差使拒绝向量与这些方向正交。这产生了一个明确的拒绝方向，以拒绝相关结构为目标，同时最大限度地减少对模型语义几何的破坏。在五个模型（Qwen3-VL 和 Ministral 系列）中，SRA 实现了深度拒绝减少 (0-2%)，对 Wikitext-2 的困惑度影响可以忽略不计（平均 delta PPL 约为 0.02），并且分布漂移最小。值得注意的是，Qwen3-VL-4B 上的标准消融会引起严重的漂移（第一个标记 KL = 2.088），而 SRA 保持原始分布（KL = 0.044），同时实现相同的 0% 拒绝率。使用 GSM8K 和 MBPP 上教师强制的困惑作为高分辨率能力代理，我们展示了 SRA 保留了数学和代码分布。这些结果表明，常见的“模型损坏”通常是“幽灵噪声”，定义为脏拒绝方向的频谱渗入到能力子空间中。

Title: BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Authors: Erin Feiglin, Nir Hutnik, Raz Lapid
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08490
Pdf URL: https://arxiv.org/pdf/2601.08490
Copy Paste: [[2601.08490]] BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts(https://arxiv.org/abs/2601.08490)
Keywords: language model, llm, prompt
Abstract: We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models. Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk. By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance.
摘要：我们研究了大型语言模型 (LLM) 的故障模式，其中纯文本提示会引发过多的输出，我们将这种现象称为溢出。与越狱或提示注入不同，溢出是在普通交互设置下出现的，可能会导致服务成本升高、延迟和跨用户性能下降，特别是在跨多个请求扩展时。除了可用性之外，风险还在于经济和环境：不必要的代币会增加每个请求的成本和能源消耗，从而大规模增加大量运营支出和碳足迹。此外，溢出代表了共享环境中计算放大和服务降级的实用向量。我们引入了 BenchOverflow，这是一个与模型无关的基准，包含九种纯文本提示策略，可以在没有对抗性后缀或策略规避的情况下放大输出量。使用固定预算为 5000 个新代币的标准化协议，我们评估了九个开源和闭源模型，并观察到长度分布中明显的右移和重尾。资本饱和率 (CSR@1k/3k/5k) 和经验累积分布函数 (ECDF) 量化尾部风险；提示内方差和跨模型相关性表明，溢出具有广泛的可重复性，但在不同的家族和攻击向量中却是异构的。轻量级缓解措施（固定的简洁性提醒）会削弱大多数模型中所有策略的右尾并降低 CSR。我们的研究结果将长度控制视为可衡量的可靠性、成本和可持续性问题，而不是一种风格上的怪癖。通过对跨模型的长度控制鲁棒性进行标准化比较，BenchOverflow 为选择最大限度地减少资源浪费和运营费用的部署以及评估在不影响任务性能的情况下抑制计算放大的防御措施提供了实用的基础。

Title: It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models

Authors: Cristian Santini, Marieke Van Erp, Mehwish Alam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08500
Pdf URL: https://arxiv.org/pdf/2601.08500
Copy Paste: [[2601.08500]] It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models(https://arxiv.org/abs/2601.08500)
Keywords: language model, llm, hallucination, prompt
Abstract: Despite the recent advancements in NLP with the advent of Large Language Models (LLMs), Entity Linking (EL) for historical texts remains challenging due to linguistic variation, noisy inputs, and evolving semantic conventions. Existing solutions either require substantial training data or rely on domain-specific rules that limit scalability. In this paper, we present MHEL-LLaMo (Multilingual Historical Entity Linking with Large Language MOdels), an unsupervised ensemble approach combining a Small Language Model (SLM) and an LLM. MHEL-LLaMo leverages a multilingual bi-encoder (BELA) for candidate retrieval and an instruction-tuned LLM for NIL prediction and candidate selection via prompt chaining. Our system uses SLM's confidence scores to discriminate between easy and hard samples, applying an LLM only for hard cases. This strategy reduces computational costs while preventing hallucinations on straightforward cases. We evaluate MHEL-LLaMo on four established benchmarks in six European languages (English, Finnish, French, German, Italian and Swedish) from the 19th and 20th centuries. Results demonstrate that MHEL-LLaMo outperforms state-of-the-art models without requiring fine-tuning, offering a scalable solution for low-resource historical EL. The implementation of MHEL-LLaMo is available on Github.
摘要：尽管最近随着大型语言模型 (LLM) 的出现，自然语言处理 (NLP) 取得了进步，但由于语言变化、噪声输入和不断变化的语义约定，历史文本的实体链接 (EL) 仍然具有挑战性。现有的解决方案要么需要大量的训练数据，要么依赖于限制可扩展性的特定领域规则。在本文中，我们提出了 MHEL-LLaMo（与大语言模型链接的多语言历史实体），这是一种结合了小语言模型（SLM）和 LLM 的无监督集成方法。 MHEL-LLaMo 利用多语言双编码器 (BELA) 进行候选检索，并利用指令调整的 LLM 通过提示链接进行 NIL 预测和候选选择。我们的系统使用 SLM 的置信度分数来区分简单样本和困难样本，仅对困难样本应用 LLM。这种策略降低了计算成本，同时防止对简单情况的幻觉。我们根据 19 世纪和 20 世纪六种欧洲语言（英语、芬兰语、法语、德语、意大利语和瑞典语）的四个既定基准对 MHEL-LLaMo 进行评估。结果表明，MHEL-LLaMo 的性能优于最先进的模型，无需微调，为低资源历史 EL 提供了可扩展的解决方案。 MHEL-LLaMo 的实现可在 Github 上找到。

Title: STAGE: A Benchmark for Knowledge Graph Construction, Question Answering, and In-Script Role-Playing over Movie Screenplays

Authors: Qiuyu Tian, Yiding Li, Fengyi Chen, Zequn Liu, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08510
Pdf URL: https://arxiv.org/pdf/2601.08510
Copy Paste: [[2601.08510]] STAGE: A Benchmark for Knowledge Graph Construction, Question Answering, and In-Script Role-Playing over Movie Screenplays(https://arxiv.org/abs/2601.08510)
Keywords: agent
Abstract: Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.
摘要：电影剧本是丰富的长篇叙事，交织着复杂的人物关系、时间顺序的事件和对话驱动的互动。虽然之前的基准测试针对的是单个子任务，例如回答问题或对话生成，但它们很少评估模型是否可以构建一个连贯的故事世界，并在多种形式的推理和生成中一致地使用它。我们引入 STAGE（剧本文本、代理、图表和评估），这是对长电影剧本进行叙事理解的统一基准。 STAGE 定义了四个任务：知识图谱构建、场景级事件总结、长上下文剧本问答和脚本内角色扮演，所有这些任务都基于共享的叙事世界表示。该基准测试为 150 部中英电影提供了干净的剧本、精心策划的知识图以及以事件和人物为中心的注释，从而能够全面评估模型构建世界表征、抽象和验证叙事事件、推理长篇叙事以及生成人物一致反应的能力。

Title: STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

Authors: Seong-Gyu Park, Sohee Park, Jisu Lee, Hyunsik Na, Daeseon Choi
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08511
Pdf URL: https://arxiv.org/pdf/2601.08511
Copy Paste: [[2601.08511]] STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio(https://arxiv.org/abs/2601.08511)
Keywords: llm, chain-of-thought
Abstract: Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.
摘要：最近的法学硕士越来越多地整合推理机制，例如思想链（CoT）。然而，这种显式推理暴露了推理时间后门的新攻击面，它会在不改变模型参数的情况下注入恶意推理路径。由于这些攻击生成语言上连贯的路径，因此它们有效地逃避了传统的检测。为了解决这个问题，我们提出了 STAR（状态转换放大率），这是一种通过分析输出概率变化来检测后门的框架。 STAR 利用了统计差异，其中恶意输入引起的路径表现出高后验概率，尽管模型常识中的先验概率较低。我们量化这种状态转换放大并采用 CUSUM 算法来检测持续异常。跨不同模型 (8B-70B) 和五个基准数据集的实验表明，STAR 表现出强大的泛化能力，始终如一地实现近乎完美的性能 (AUROC $\approx$ 1.0)，效率比现有基线高出约 $42\times$。此外，事实证明，该框架能够有效抵御试图绕过检测的自适应攻击。

Title: DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report

Authors: Ruizhe Li, Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08536
Pdf URL: https://arxiv.org/pdf/2601.08536
Copy Paste: [[2601.08536]] DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report(https://arxiv.org/abs/2601.08536)
Keywords: llm, agent
Abstract: Deep Research Systems (DRS) aim to help users search the web, synthesize information, and deliver comprehensive investigative reports. However, how to rigorously evaluate these systems remains under-explored. Existing deep-research benchmarks often fall into two failure modes. Some do not adequately test a system's ability to analyze evidence and write coherent reports. Others rely on evaluation criteria that are either overly coarse or directly defined by LLMs (or both), leading to scores that can be biased relative to human experts and are hard to verify or interpret. To address these issues, we introduce Deep Research Bench II, a new benchmark for evaluating DRS-generated reports. It contains 132 grounded research tasks across 22 domains; for each task, a system must produce a long-form research report that is evaluated by a set of 9430 fine-grained binary rubrics in total, covering three dimensions: information recall, analysis, and presentation. All rubrics are derived from carefully selected expert-written investigative articles and are constructed through a four-stage LLM+human pipeline that combines automatic extraction with over 400 human-hours of expert review, ensuring that the criteria are atomic, verifiable, and aligned with human expert judgment. We evaluate several state-of-the-art deep-research systems on Deep Research Bench II and find that even the strongest models satisfy fewer than 50% of the rubrics, revealing a substantial gap between current DRSs and human experts.
摘要：深度研究系统 (DRS) 旨在帮助用户搜索网络、综合信息并提供全面的调查报告。然而，如何严格评估这些系统仍有待探索。现有的深度研究基准通常陷入两种失败模式。有些没有充分测试系统分析证据和编写连贯报告的能力。其他人依赖的评估标准要么过于粗略，要么由法学硕士（或两者兼而有之）直接定义，导致分数相对于人类专家可能存在偏差，并且难以验证或解释。为了解决这些问题，我们引入了 Deep Research Bench II，这是一个用于评估 DRS 生成的报告的新基准。它包含 22 个领域的 132 项扎根研究任务；对于每项任务，系统必须生成一份长篇研究报告，该报告由总共 9430 个细粒度的二元评分标准进行评估，涵盖三个维度：信息回忆、分析和呈现。所有评分标准均源自精心挑选的专家撰写的调查文章，并通过四阶段的法学硕士+人工管道构建，该管道将自动提取与超过 400 小时的专家评审相结合，确保标准是原子的、可验证的，并与人类专家判断保持一致。我们在 Deep Research Bench II 上评估了几个最先进的深度研究系统，发现即使是最强大的模型也只能满足不到 50% 的标准，这揭示了当前 DRS 与人类专家之间的巨大差距。

Title: Ministral 3

Authors: Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Clémence Lanfranchi, Corentin Barreau, Cyprien Courtot, Daniele Grattarola, Darius Dabert, Diego de las Casas, Elliot Chane-Sane, Faruk Ahmed, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Georgii Novikov, Guillaume Kunsch, Guillaume Lample, Guillaume Martin, Gunshi Gupta, Jan Ludziejewski, Jason Rute, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Karmesh Yadav, Khyathi Chandu, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Margaret Jennings, Marie Pellat, Mark Prins, Mathieu Poirée, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mia Chiquier, Michel Schimpf, Nathan Grinsztajn, Neha Gupta, Nikhil Raghuraman, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Patrick von Platen, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Pavankumar Reddy Muddireddy, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Quentin Torroba, Romain Sauvestre, Roman Soletskyi, Rupert Menneer, Sagar Vaze, Samuel Barry, Sanchit Gandhi, Siddhant Waghjale, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thiziri Nait Saada, Thomas Chabal, Thomas Foubert, Thomas Robert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08584
Pdf URL: https://arxiv.org/pdf/2601.08584
Copy Paste: [[2601.08584]] Ministral 3(https://arxiv.org/abs/2601.08584)
Keywords: language model
Abstract: We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.
摘要：我们推出了 Ministral 3 系列，这是一系列参数高效的密集语言模型，专为计算和内存受限的应用程序而设计，提供三种模型大小：3B、8B 和 14B 参数。对于每种模型大小，我们发布了三种变体：用于通用用途的预训练基础模型、经过微调的指令以及用于解决复杂问题的推理模型。此外，我们还介绍了通过级联蒸馏（Cascade Distillation）、迭代修剪和蒸馏技术持续训练来推导 Ministral 3 模型的方法。每个模型都具有图像理解功能，均在 Apache 2.0 许可下。

Title: ExpSeek: Self-Triggered Experience Seeking for Web Agents

Authors: Wenyuan Zhang, Xinghua Zhang, Haiyang Yu, Shuaiyi Nie, Bingli Wu, Juwei Yue, Tingwen Liu, Yongbin Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08605
Pdf URL: https://arxiv.org/pdf/2601.08605
Copy Paste: [[2601.08605]] ExpSeek: Self-Triggered Experience Seeking for Web Agents(https://arxiv.org/abs/2601.08605)
Keywords: agent
Abstract: Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience passively as global context before task execution, struggling to adapt to dynamically changing contextual observations during agent-environment interaction. We propose ExpSeek, which shifts experience toward step-level proactive seeking: (1) estimating step-level entropy thresholds to determine intervention timing using the model's intrinsic signals; (2) designing step-level tailor-designed experience content. Experiments on Qwen3-8B and 32B models across four challenging web agent benchmarks demonstrate that ExpSeek achieves absolute improvements of 9.3% and 7.5%, respectively. Our experiments validate the feasibility and advantages of entropy as a self-triggering signal, reveal that even a 4B small-scale experience model can significantly boost the performance of larger agent models.
摘要：网络代理中的体验干预成为一种有前途的技术范例，通过从积累的经验中提供有价值的见解来增强代理交互能力。然而，现有方法主要在任务执行之前被动地注入经验作为全局上下文，难以适应代理与环境交互期间动态变化的上下文观察。我们提出了 ExpSeek，它将经验转向阶梯级主动寻求：（1）估计阶梯级熵阈值，以使用模型的内在信号确定干预时机； (2)设计阶梯式的量身设计的体验内容。在四个具有挑战性的 Web 代理基准测试中对 Qwen3-8B 和 32B 模型进行的实验表明，ExpSeek 分别实现了 9.3% 和 7.5% 的绝对改进。我们的实验验证了熵作为自触发信号的可行性和优势，表明即使是 4B 小规模经验模型也可以显着提高大型代理模型的性能。

Title: GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning

Authors: Jiajin Liu, Yuanfu Sun, Dongzhe Fan, Qiaoyu Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08621
Pdf URL: https://arxiv.org/pdf/2601.08621
Copy Paste: [[2601.08621]] GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning(https://arxiv.org/abs/2601.08621)
Keywords: hallucination, agent
Abstract: Recent advances in search-augmented large reasoning models (LRMs) enable the retrieval of external knowledge to reduce hallucinations in multistep reasoning. However, their ability to operate on graph-structured data, prevalent in domains such as e-commerce, social networks, and scientific citations, remains underexplored. Unlike plain text corpora, graphs encode rich topological signals that connect related entities and can serve as valuable priors for retrieval, enabling more targeted search and improved reasoning efficiency. Yet, effectively leveraging such structure poses unique challenges, including the difficulty of generating graph-expressive queries and ensuring reliable retrieval that balances structural and semantic relevance. To address this gap, we introduce GraphSearch, the first framework that extends search-augmented reasoning to graph learning, enabling zero-shot graph learning without task-specific fine-tuning. GraphSearch combines a Graph-aware Query Planner, which disentangles search space (e.g., 1-hop, multi-hop, or global neighbors) from semantic queries, with a Graph-aware Retriever, which constructs candidate sets based on topology and ranks them using a hybrid scoring function. We further instantiate two traversal modes: GraphSearch-R, which recursively expands neighborhoods hop by hop, and GraphSearch-F, which flexibly retrieves across local and global neighborhoods without hop constraints. Extensive experiments across diverse benchmarks show that GraphSearch achieves competitive or even superior performance compared to supervised graph learning methods, setting state-of-the-art results in zero-shot node classification and link prediction. These findings position GraphSearch as a flexible and generalizable paradigm for agentic reasoning over graphs.
摘要：搜索增强大型推理模型（LRM）的最新进展使得外部知识的检索能够减少多步推理中的幻觉。然而，它们对电子商务、社交网络和科学引文等领域普遍存在的图结构数据进行操作的能力仍未得到充分开发。与纯文本语料库不同，图编码了丰富的拓扑信号，这些拓扑信号连接相关实体，可以作为有价值的检索先验，从而实现更有针对性的搜索并提高推理效率。然而，有效利用这种结构带来了独特的挑战，包括生成图形表达查询的困难以及确保平衡结构和语义相关性的可靠检索的困难。为了解决这一差距，我们引入了 GraphSearch，这是第一个将搜索增强推理扩展到图学习的框架，无需针对特定任务进行微调即可实现零样本图学习。 GraphSearch 结合了图形感知查询规划器和图形感知检索器，前者将搜索空间（例如，一跳、多跳或全局邻居）与语义查询分离，后者根据拓扑构造候选集，并使用混合评分函数对它们进行排名。我们进一步实例化了两种遍历模式：GraphSearch-R，它逐跳递归地扩展邻域；GraphSearch-F，它在没有跳数限制的情况下灵活地跨局部和全局邻域进行检索。跨不同基准的大量实验表明，与监督图学习方法相比，GraphSearch 实现了有竞争力甚至更优越的性能，在零样本节点分类和链路预测方面取得了最先进的结果。这些发现将 GraphSearch 定位为一种灵活且可推广的图代理推理范例。

Title: How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction

Authors: Yingjie He, Zhaolu Kang, Kehan Jiang, Qianyuan Zhang, Jiachen Qian, Chunlei Meng, Yujie Feng, Yuan Wang, Jiabao Dou, Aming Wu, Leqi Zheng, Pengxiang Zhao, Jiaxin Liu, Zeyu Zhang, Lei Wang, Guansu Wang, Qishi Zhan, Xiaomin He, Meisheng Zhang, Jianyuan Ni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08626
Pdf URL: https://arxiv.org/pdf/2601.08626
Copy Paste: [[2601.08626]] How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction(https://arxiv.org/abs/2601.08626)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.
摘要：大型语言模型（LLM）擅长语义理解，但它们从混乱的输入中重建内部结构的能力仍未得到充分开发。句子级恢复不适合自动评估，因为通常存在多个有效词序。我们引入了 OrderProbe，这是一种使用中文、日文和韩文固定四字符表达式进行结构重建的确定性基准，它具有独特的规范顺序，因此支持精确匹配评分。我们进一步提出了一个诊断框架，用于评估恢复准确性之外的模型，包括语义保真度、逻辑有效性、一致性、鲁棒性敏感性和信息密度。对 12 个广泛使用的 LLM 进行的实验表明，即使对于前沿系统，结构重建仍然很困难：零样本恢复率经常低于 35%。我们还观察到语义回忆和结构规划之间始终存在分离，这表明结构稳健性并不是语义能力的自动副产品。

Title: Moral Lenses, Political Coordinates: Towards Ideological Positioning of Morally Conditioned LLMs

Authors: Chenchen Yuan, Bolei Ma, Zheyu Zhang, Bardh Prenkaj, Frauke Kreuter, Gjergji Kasneci
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08634
Pdf URL: https://arxiv.org/pdf/2601.08634
Copy Paste: [[2601.08634]] Moral Lenses, Political Coordinates: Towards Ideological Positioning of Morally Conditioned LLMs(https://arxiv.org/abs/2601.08634)
Keywords: language model, llm
Abstract: While recent research has systematically documented political orientation in large language models (LLMs), existing evaluations rely primarily on direct probing or demographic persona engineering to surface ideological biases. In social psychology, however, political ideology is also understood as a downstream consequence of fundamental moral intuitions. In this work, we investigate the causal relationship between moral values and political positioning by treating moral orientation as a controllable condition. Rather than simply assigning a demographic persona, we condition models to endorse or reject specific moral values and evaluate the resulting shifts on their political orientations, using the Political Compass Test. By treating moral values as lenses, we observe how moral conditioning actively steers model trajectories across economic and social dimensions. Our findings show that such conditioning induces pronounced, value-specific shifts in models' political coordinates. We further notice that these effects are systematically modulated by role framing and model scale, and are robust across alternative assessment instruments instantiating the same moral value. This highlights that effective alignment requires anchoring political assessments within the context of broader social values including morality, paving the way for more socially grounded alignment techniques.
摘要：虽然最近的研究系统地记录了大语言模型（LLM）中的政治倾向，但现有的评估主要依赖于直接探索或人口角色工程来揭示意识形态偏见。然而，在社会心理学中，政治意识形态也被理解为基本道德直觉的下游结果。在这项工作中，我们通过将道德取向视为可控条件来研究道德价值观与政治取向之间的因果关系。我们不是简单地分配人口特征，而是使用政治罗盘测试来调整模型来认可或拒绝特定的道德价值观，并评估由此产生的政治倾向变化。通过将道德价值观视为镜头，我们观察道德调节如何积极引导经济和社会维度的模型轨迹。我们的研究结果表明，这种条件作用会导致模型的政治坐标发生明显的、特定于价值的转变。我们进一步注意到，这些影响是通过角色框架和模型规模系统调节的，并且在实例化相同道德价值的替代评估工具中是稳健的。这凸显出有效的联盟需要将政治评估锚定在更广泛的社会价值观（包括道德）的背景下，为更加基于社会的联盟技术铺平道路。

Title: RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation

Authors: Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08654
Pdf URL: https://arxiv.org/pdf/2601.08654
Copy Paste: [[2601.08654]] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation(https://arxiv.org/abs/2601.08654)
Keywords: llm, prompt
Abstract: The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at this https URL.
摘要：法学硕士作为法官范式承诺可扩展的基于规则的评估，但由于固有的生成随机性，将冻结的黑盒模型与人类标准保持一致仍然是一个挑战。我们将判断对齐重新定义为标准转移问题，并分离出三种反复出现的失败模式：由即时敏感性引起的量规不稳定、缺乏可审计证据的不可验证推理以及与人类评分边界的量表错位。为了解决这些问题，我们引入了 RULERS（Rubric Unification、Locking 和 Evidence-anchored Robust Scoring），这是一个编译器-执行器框架，可将自然语言规则转换为可执行规范。 RULERS 的运行方式是将标准编译到版本化的不可变捆绑包中，通过确定性证据验证强制执行结构化解码，并应用基于 Wasserstein 的轻量级事后校准，所有这些都无需更新模型参数。对论文和摘要基准的大量实验表明，RULERS 在人类一致性方面显着优于代表性基线，在对抗性标题扰动下保持强大的稳定性，并使较小的模型能够与较大的专有法官相媲美。总的来说，我们的结果表明，可靠的法学硕士评审需要可执行的评估标准、可验证的证据和校准的量表，而不是仅仅提示措辞。代码可从此 https URL 获取。

Title: Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification

Authors: Kyuri Im, Shuzhou Yuan, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08668
Pdf URL: https://arxiv.org/pdf/2601.08668
Copy Paste: [[2601.08668]] Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification(https://arxiv.org/abs/2601.08668)
Keywords: language model, llm, prompt
Abstract: While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task. In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back, which substantially reduces false refusals while preserving the original content, providing an effective and lightweight mitigation approach.
摘要：虽然大型语言模型（LLM）越来越多地应用于仇恨言论解毒，但提示经常会触发安全警报，导致 LLM 拒绝这项任务。在这项研究中，我们系统地调查了仇恨言论解毒中的错误拒绝行为，并分析了引发此类拒绝的语境和语言偏见。我们在英语和多语言数据集上评估了九个法学硕士，我们的结果表明，法学硕士不成比例地拒绝具有较高语义毒性的输入以及针对特定群体（特别是国籍、宗教和政治意识形态）的输入。尽管多语言数据集的总体错误拒绝率低于英语数据集，但模型仍然显示出对某些目标的系统性、依赖于语言的偏差。基于这些发现，我们提出了一种简单的交叉翻译策略，将英语仇恨言论翻译成中文进行解毒并返回，这在保留原始内容的同时大大减少了错误拒绝，提供了一种有效且轻量级的缓解方法。

Title: Lessons from the Field: An Adaptable Lifecycle Approach to Applied Dialogue Summarization

Authors: Kushal Chawla, Chenyang Zhu, Pengshan Cai, Sangwoo Cho, Scott Novotney, Ayushman Singh, Jonah Lewis, Keasha Safewright, Alfy Samuel, Erin Babinsky, Shi-Xiong Zhang, Sambit Sahu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08682
Pdf URL: https://arxiv.org/pdf/2601.08682
Copy Paste: [[2601.08682]] Lessons from the Field: An Adaptable Lifecycle Approach to Applied Dialogue Summarization(https://arxiv.org/abs/2601.08682)
Keywords: llm, prompt, agent
Abstract: Summarization of multi-party dialogues is a critical capability in industry, enhancing knowledge transfer and operational effectiveness across many domains. However, automatically generating high-quality summaries is challenging, as the ideal summary must satisfy a set of complex, multi-faceted requirements. While summarization has received immense attention in research, prior work has primarily utilized static datasets and benchmarks, a condition rare in practical scenarios where requirements inevitably evolve. In this work, we present an industry case study on developing an agentic system to summarize multi-party interactions. We share practical insights spanning the full development lifecycle to guide practitioners in building reliable, adaptable summarization systems, as well as to inform future research, covering: 1) robust methods for evaluation despite evolving requirements and task subjectivity, 2) component-wise optimization enabled by the task decomposition inherent in an agentic architecture, 3) the impact of upstream data bottlenecks, and 4) the realities of vendor lock-in due to the poor transferability of LLM prompts.
摘要：多方对话的总结是行业的一项关键能力，可以增强跨多个领域的知识转移和运营效率。然而，自动生成高质量的摘要具有挑战性，因为理想的摘要必须满足一系列复杂、多方面的要求。虽然摘要在研究中受到了极大的关注，但之前的工作主要使用静态数据集和基准，这在需求不可避免地发生变化的实际场景中很少见。在这项工作中，我们提出了一个关于开发代理系统来总结多方交互的行业案例研究。我们分享跨越整个开发生命周期的实用见解，指导从业者构建可靠、适应性强的摘要系统，并为未来的研究提供信息，涵盖：1）尽管需求和任务主观性不断变化，但仍需要稳健的评估方法；2）代理架构中固有的任务分解实现的组件优化；3）上游数据瓶颈的影响；4）由于LLM提示的可转移性差而导致供应商锁定的现实。

Title: QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models

Authors: Zhaolu Kang, Junhao Gong, Wenqing Hu, Shuo Yin, Kehan Jiang, Zhicheng Fang, Yingjie He, Chunlei Meng, Rong Fu, Dongyang Chen, Leqi Zheng, Eric Hanchen Jiang, Yunfei Feng, Yitong Leng, Junfan Zhu, Xiaoyou Chen, Xi Yang, Richeng Xuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08689
Pdf URL: https://arxiv.org/pdf/2601.08689
Copy Paste: [[2601.08689]] QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models(https://arxiv.org/abs/2601.08689)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs' quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.
摘要：大型语言模型（LLM）在许多领域都表现出了强大的能力，但它们在金融定量任务中的评估仍然分散，并且大多局限于以知识为中心的问答。我们推出了 QuantEval，这是一个跨定量金融三个基本维度评估法学硕士的基准：基于知识的 QA、定量数学推理和定量策略编码。与之前的财务基准不同，QuantEval 集成了 CTA 风格的回测框架，该框架执行模型生成的策略并使用财务绩效指标对其进行评估，从而能够更真实地评估定量编码能力。我们评估了一些最先进的开源和专有法学硕士，并观察到与人类专家的巨大差距，特别是在推理和策略编码方面。最后，我们对领域对齐数据进行大规模监督微调和强化学习实验，证明了一致的改进。我们希望 QuantEval 能够促进法学硕士定量金融能力的研究，并加速其在现实世界交易工作流程中的实际采用。我们还发布了完整的确定性回测配置（资产范围、成本模型和指标定义），以确保严格的可重复性。

Title: Nationality and Region Prediction from Names: A Comparative Study of Neural Models and Large Language Models

Authors: Keito Inoshita
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08692
Pdf URL: https://arxiv.org/pdf/2601.08692
Copy Paste: [[2601.08692]] Nationality and Region Prediction from Names: A Comparative Study of Neural Models and Large Language Models(https://arxiv.org/abs/2601.08692)
Keywords: language model, llm, prompt
Abstract: Predicting nationality from personal names has practical value in marketing, demographic research, and genealogical studies. Conventional neural models learn statistical correspondences between names and nationalities from task-specific training data, posing challenges in generalizing to low-frequency nationalities and distinguishing similar nationalities within the same region. Large language models (LLMs) have the potential to address these challenges by leveraging world knowledge acquired during pre-training. In this study, we comprehensively compare neural models and LLMs on nationality prediction, evaluating six neural models and six LLM prompting strategies across three granularity levels (nationality, region, and continent), with frequency-based stratified analysis and error analysis. Results show that LLMs outperform neural models at all granularity levels, with the gap narrowing as granularity becomes coarser. Simple machine learning methods exhibit the highest frequency robustness, while pre-trained models and LLMs show degradation for low-frequency nationalities. Error analysis reveals that LLMs tend to make ``near-miss'' errors, predicting the correct region even when nationality is incorrect, whereas neural models exhibit more cross-regional errors and bias toward high-frequency classes. These findings indicate that LLM superiority stems from world knowledge, model selection should consider required granularity, and evaluation should account for error quality beyond accuracy.
摘要：根据人名预测国籍在市场营销、人口研究和家谱研究中具有实用价值。传统的神经模型从特定任务的训练数据中学习姓名和国籍之间的统计对应关系，这对泛化低频国籍和区分同一地区的相似国籍提出了挑战。大型语言模型 (LLM) 有潜力通过利用预训练期间获得的世界知识来应对这些挑战。在本研究中，我们全面比较了神经模型和法学硕士在国籍预测方面的作用，通过基于频率的分层分析和误差分析，评估了跨三个粒度级别（国籍、地区和大陆）的六种神经模型和六种法学硕士提示策略。结果表明，法学硕士在所有粒度级别上都优于神经模型，并且随着粒度变粗，差距缩小。简单的机器学习方法表现出最高的频率鲁棒性，而预训练模型和法学硕士则表现出低频国籍的退化。错误分析表明，法学硕士往往会犯“差点错过”的错误，即使国籍不正确，也会预测正确的地区，而神经模型则表现出更多的跨地区错误和对高频类别的偏见。这些发现表明，LLM 的优越性源于世界知识，模型选择应考虑所需的粒度，评估应考虑准确性以外的错误质量。

Title: RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis

Authors: Zhengwei Tao, Bo Li, Jialong Wu, Guochen Yan, Huanyao Zhang, Jiahao Xu, Haitao Mi, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08699
Pdf URL: https://arxiv.org/pdf/2601.08699
Copy Paste: [[2601.08699]] RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis(https://arxiv.org/abs/2601.08699)
Keywords: language model, retrieval-augmented generation, agent
Abstract: Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.
摘要：代理检索增强生成 (RAG) 使大型语言模型能够自主规划和检索信息以解决复杂的问题。然而，反映现实世界检索环境的噪声和复杂性的高质量训练数据的稀缺阻碍了鲁棒代理的发展。传统的手动注释是不可扩展的，并且通常无法捕获处理检索失败所需的动态推理策略。为了弥补这一差距，我们引入了 RAGShaper，这是一种新颖的数据合成框架，旨在自动构建 RAG 任务和强大的智能体轨迹。 RAGShaper 结合了 InfoCurator 来构建密集的信息树，其中包含跨越感知和认知级别的对抗性干扰因素。此外，我们提出了一种约束导航策略，迫使教师代理面对这些干扰因素，从而得出明确展示纠错和噪声抑制的轨迹。综合实验证实，在我们的合成语料库上训练的模型显着优于现有基线，在噪声密集和复杂的检索任务中表现出卓越的鲁棒性。

Title: PrivGemo: Privacy-Preserving Dual-Tower Graph Retrieval for Empowering LLM Reasoning with Memory Augmentation

Authors: Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08739
Pdf URL: https://arxiv.org/pdf/2601.08739
Copy Paste: [[2601.08739]] PrivGemo: Privacy-Preserving Dual-Tower Graph Retrieval for Empowering LLM Reasoning with Memory Augmentation(https://arxiv.org/abs/2601.08739)
Keywords: language model, gpt, llm
Abstract: Knowledge graphs (KGs) provide structured evidence that can ground large language model (LLM) reasoning for knowledge-intensive question answering. However, many practical KGs are private, and sending retrieved triples or exploration traces to closed-source LLM APIs introduces leakage risk. Existing privacy treatments focus on masking entity names, but they still face four limitations: structural leakage under semantic masking, uncontrollable remote interaction, fragile multi-hop and multi-entity reasoning, and limited experience reuse for stability and efficiency. To address these issues, we propose PrivGemo, a privacy-preserving retrieval-augmented framework for KG-grounded reasoning with memory-guided exposure control. PrivGemo uses a dual-tower design to keep raw KG knowledge local while enabling remote reasoning over an anonymized view that goes beyond name masking to limit both semantic and structural exposure. PrivGemo supports multi-hop, multi-entity reasoning by retrieving anonymized long-hop paths that connect all topic entities, while keeping grounding and verification on the local KG. A hierarchical controller and a privacy-aware experience memory further reduce unnecessary exploration and remote interactions. Comprehensive experiments on six benchmarks show that PrivGemo achieves overall state-of-the-art results, outperforming the strongest baseline by up to 17.1%. Furthermore, PrivGemo enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
摘要：知识图 (KG) 提供结构化证据，可以为知识密集型问题解答的大型语言模型 (LLM) 推理奠定基础。然而，许多实用的 KG 都是私有的，将检索到的三元组或探索跟踪发送到闭源 LLM API 会带来泄漏风险。现有的隐私处理主要集中在掩盖实体名称，但仍然面临四个局限性：语义掩盖下的结构泄漏、不可控的远程交互、脆弱的多跳和多实体推理以及稳定性和效率方面的经验复用有限。为了解决这些问题，我们提出了 PrivGemo，这是一种隐私保护检索增强框架，用于基于 KG 的推理和内存引导曝光控制。 PrivGemo 使用双塔设计将原始知识图谱知识保留在本地，同时能够对匿名视图进行远程推理，超越名称屏蔽以限制语义和结构暴露。 PrivGemo 通过检索连接所有主题实体的匿名长跳路径来支持多跳、多实体推理，同时在本地 KG 上保持接地和验证。分层控制器和隐私感知体验存储器进一步减少了不必要的探索和远程交互。对六个基准的综合实验表明，PrivGemo 取得了总体最先进的结果，比最强基准高出 17.1%。此外，PrivGemo 使较小的模型（例如 Qwen3-4B）能够实现与 GPT-4-Turbo 相当的推理性能。

Title: From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding

Authors: Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08741
Pdf URL: https://arxiv.org/pdf/2601.08741
Copy Paste: [[2601.08741]] From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding(https://arxiv.org/abs/2601.08741)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) struggle to reason over large-scale enterprise spreadsheets containing thousands of numeric rows, multiple linked sheets, and embedded visual content such as charts and receipts. Prior state-of-the-art spreadsheet reasoning approaches typically rely on single-sheet compression or full-context encoding, which limits scalability and fails to reflect how real users interact with complex, multimodal workbooks. We introduce FRTR-Bench, the first large-scale benchmark for multimodal spreadsheet reasoning, comprising 30 enterprise-grade Excel workbooks spanning nearly four million cells and more than 50 embedded images. To address these challenges, we present From Rows to Reasoning (FRTR), an advanced, multimodal retrieval-augmented generation framework that decomposes Excel workbooks into granular row, column, and block embeddings, employs hybrid lexical-dense retrieval with Reciprocal Rank Fusion (RRF), and integrates multimodal embeddings to reason over both numerical and visual information. We tested FRTR on six LLMs, achieving 74% answer accuracy on FRTR-Bench with Claude Sonnet 4.5, a substantial improvement over prior state-of-the-art approaches that reached only 24%. On the SpreadsheetLLM benchmark, FRTR achieved 87% accuracy with GPT-5 while reducing token usage by roughly 50% compared to context-compression methods.
摘要：大型语言模型 (LLM) 很难对包含数千个数字行、多个链接工作表以及嵌入的可视内容（例如图表和收据）的大型企业电子表格进行推理。先前最先进的电子表格推理方法通常依赖于单表压缩或全上下文编码，这限制了可扩展性，并且无法反映真实用户如何与复杂的多模式工作簿交互。我们推出 FRTR-Bench，这是第一个多模式电子表格推理的大型基准，包含 30 个企业级 Excel 工作簿，涵盖近 400 万个单元格和 50 多个嵌入图像。为了应对这些挑战，我们提出了 From Rows to Reasoning (FRTR)，这是一种先进的多模态检索增强生成框架，它将 Excel 工作簿分解为细粒度的行、列和块嵌入，采用混合词汇密集检索和倒数排名融合 (RRF)，并集成多模态嵌入来对数字和视觉信息进行推理。我们在 6 位法学硕士上测试了 FRTR，使用 Claude Sonnet 4.5 在 FRTR-Bench 上实现了 74% 的答案准确率，这比之前仅达到 24% 的最先进方法有了显着提高。在 SpreadsheetLLM 基准测试中，FRTR 使用 GPT-5 实现了 87% 的准确率，同时与上下文压缩方法相比，令牌使用量减少了大约 50%。

Title: Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents

Authors: Xin Quan, Jiafeng Xiong, Marco Valentino, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08742
Pdf URL: https://arxiv.org/pdf/2601.08742
Copy Paste: [[2601.08742]] Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents(https://arxiv.org/abs/2601.08742)
Keywords: language model, llm, agent
Abstract: Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi-agent environments. Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention-driven reasoning essential for complex interactive systems. To address this gap, we introduce Attributional NLI (Att-NLI), a framework that extends NLI with principles from social psychology to assess an agent's capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). We instantiate Att-NLI via a textual game, Undercover-V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att-NLI agent employing abductive-deductive inference, and a neuro-symbolic Att-NLI agent performing abductive-deductive inference with external theorem provers. Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro-symbolic agents consistently outperforming others, achieving an average win rate of 17.08%. Our results underscore the role that Att-NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro-symbolic AI in building rational LLM agents acting in multi-agent environments.
摘要：归因推理是预测观察到的行为背后的潜在意图的能力，对于在多智能体环境中运行的大型语言模型 (LLM) 来说是一项关键但尚未充分开发的功能。事实上，传统的自然语言推理 (NLI) 无法捕捉复杂交互系统所必需的微妙的、意图驱动的推理。为了解决这一差距，我们引入了归因 NLI (Att-NLI)，这是一个框架，它利用社会心理学原理扩展了 NLI，以评估主体进行溯因意图推理（生成有关潜在意图的假设）和随后的演绎验证（得出有效的逻辑结论）的能力。我们通过文本游戏 Undercover-V 实例化 Att-NLI，试验三种类型的具有不同推理能力和访问外部工具的 LLM 代理：仅使用演绎推理的标准 NLI 代理、采用演绎-演绎推理的 Att-NLI 代理、以及使用外部定理证明者执行溯因-演绎推理的神经符号 Att-NLI 代理。大量实验表明，归因推理能力具有清晰的层次结构，神经符号代理始终优于其他代理，平均胜率达到 17.08%。我们的结果强调了 Att-NLI 在开发具有复杂推理能力的智能体方面可以发挥的作用，同时强调了神经符号人工智能在构建在多智能体环境中运行的理性 LLM 智能体方面的潜在影响。

Title: TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL

Authors: Jinbo Su, Yuxuan Hu, Cuiping Li, Hong Chen, Jia Li, Lintao Ma, Jing Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08743
Pdf URL: https://arxiv.org/pdf/2601.08743
Copy Paste: [[2601.08743]] TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL(https://arxiv.org/abs/2601.08743)
Keywords: llm, long context, prompt
Abstract: In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an opportunity for KV cache sharing across queries-current inference engines, such as SGLang and vLLM, generate redundant prefix cache copies when processing user queries with varying table orders. To address this inefficiency, we propose precomputing table representations as KV caches offline and querying the required ones online. A key aspect of our approach is the computation of table caches while preserving primary foreign key relationships between tables. Additionally, we construct a Table Trie structure to facilitate efficient KV cache lookups during inference. To enhance cache performance, we introduce a cache management system with a query reranking strategy to improve cache hit rates and a computation loading pipeline for parallelizing model inference and cache loading. Experimental results show that our proposed TableCache achieves up to a 3.62x speedup in Time to First Token (TTFT) with negligible performance degradation.
摘要：在文本到 SQL 任务中，现有的基于 LLM 的方法通常在提示中包含大量数据库模式，导致上下文长度较长并增加预填充延迟。虽然用户查询通常关注循环表集（为跨查询提供 KV 缓存共享的机会），但当前的推理引擎（例如 SGLang 和 vLLM）在处理具有不同表顺序的用户查询时会生成冗余前缀缓存副本。为了解决这种低效率问题，我们建议将表表示形式预先计算为 KV 离线缓存，并在线查询所需的表示形式。我们方法的一个关键方面是表缓存的计算，同时保留表之间的主外键关系。此外，我们构建了一个 Table Trie 结构，以促进推理过程中高效的 KV 缓存查找。为了增强缓存性能，我们引入了一个缓存管理系统，该系统具有查询重排序策略以提高缓存命中率和计算加载管道以并行化模型推理和缓存加载。实验结果表明，我们提出的 TableCache 在首次令牌时间 (TTFT) 方面实现了高达 3.62 倍的加速，而性能下降可以忽略不计。

Title: To Retrieve or To Think? An Agentic Approach for Context Evolution

Authors: Rubing Chen, Jian Wang, Wenjie Li, Xiao-Yong Wei, Qing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08747
Pdf URL: https://arxiv.org/pdf/2601.08747
Copy Paste: [[2601.08747]] To Retrieve or To Think? An Agentic Approach for Context Evolution(https://arxiv.org/abs/2601.08747)
Keywords: retrieval-augmented generation, agent
Abstract: Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning this http URL, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority this http URL aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token this http URL work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.
摘要：当前的上下文增强方法，例如检索增强生成，对于解决此 http URL 的知识密集型推理至关重要，它们通常遵循严格的、强力的策略，在每一步执行检索。这种不加区别的方法不仅会产生不必要的计算成本，而且还会因上下文中充满不相关的噪声而降低性能。为了解决这些限制，我们引入了代理上下文进化（ACE），这是一个受人类元认知启发的框架，可以动态地确定是否寻求新证据或利用现有知识进行推理。 ACE 采用中央编排器代理通过大多数人来战略性地做出决策。此 http URL 旨在激活用于外部检索的检索器代理和用于内部分析和细化的推理器代理之间交替。通过消除冗余的检索步骤，ACE 保持了简洁且不断发展的上下文。在具有挑战性的多跳 QA 基准上进行的大量实验表明，ACE 在准确性方面显着优于竞争基准，同时实现高效标记，此 http URL 工作为推进复杂、知识密集型任务的上下文演进生成提供了宝贵的见解。

Title: Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Authors: Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08808
Pdf URL: https://arxiv.org/pdf/2601.08808
Copy Paste: [[2601.08808]] Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge(https://arxiv.org/abs/2601.08808)
Keywords: language model, chain-of-thought
Abstract: Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at this https URL.
摘要：大型语言模型通常通过思想链 (CoT) 更有效地解决复杂的推理任务，但代价是长、低带宽的标记序列。相比之下，人类通常会通过维持合理的后续步骤的分布来进行温和的推理。受此启发，我们提出了 Multiplex Thinking，这是一种随机软推理机制，在每个思考步骤中，对 K 个候选 token 进行采样，并将它们的嵌入聚合成单个连续的 Multiplex token。这保留了词汇嵌入先验和标准离散生成的采样动态，同时在多重部署上引入易于处理的概率分布。因此，可以通过策略强化学习（RL）直接优化多重轨迹。重要的是，多重思维是自适应的：当模型有信心时，多重令牌几乎是离散的，并且表现得像标准 CoT；当不确定时，它紧凑地表示多个看似合理的下一步，而不增加序列长度。在具有挑战性的数学推理基准中，多重思维始终优于从 Pass@1 到 Pass@1024 的强大离散 CoT 和 RL 基线，同时生成更短的序列。代码和检查点可从此 https URL 获取。

Title: Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System

Authors: Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jenq-Neng Hwang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08829
Pdf URL: https://arxiv.org/pdf/2601.08829
Copy Paste: [[2601.08829]] Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System(https://arxiv.org/abs/2601.08829)
Keywords: language model, llm, agent
Abstract: In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at this https URL.
摘要：在这项工作中，我们使用真实世界的会议论文提交来探索 Elo 排名评审系统中的大型语言模型 (LLM) 代理审稿人动态。具有不同角色的多名法学硕士代理评审员参与由区域主席主持的多轮评审互动。我们将基线设置与包含 Elo 评级和审稿人记忆的条件进行比较。我们的模拟结果展示了一些有趣的发现，包括合并 Elo 如何提高区域主席决策的准确性，以及审稿人的自适应审稿策略，该策略利用我们的 Elo 系统而不提高审稿工作量。我们的代码可以在这个 https URL 上找到。