2026-01-29

Title: From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

Authors: Shinwoo Park, Yo-Sub Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19913
Pdf URL: https://arxiv.org/pdf/2601.19913
Copy Paste: [[2601.19913]] From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text(https://arxiv.org/abs/2601.19913)
Keywords: llm
Abstract: Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for linguistically trained readers, who can over-trust surface well-formedness. We study whether expert detection can be treated as a learnable skill and improved through structured calibration. We introduce LREAD, a rubric derived from national Korean writing standards and adapted to target micro-level artifacts (e.g., punctuation optionality, spacing behavior, and register shifts). In a three-phase longitudinal blind protocol with Korean linguistics majors, Phase 1 measures intuition-only detection, Phase 2 enforces criterion-level scoring with explicit justifications, and Phase 3 evaluates domain-focused mastery on held-out elementary essays. Across phases, majority-vote accuracy increases from 60% to 100%, accompanied by stronger inter-annotator agreement (Fleiss' kappa: -0.09 --> 0.82). Compared to state-of-the-art LLM detectors, calibrated humans rely more on language-specific micro-diagnostics that are not well captured by coarse discourse priors. Our findings suggest that rubric-scaffolded expert judgment can serve as an interpretable complement to automated detectors for non-English settings, and we release the full rubric and a taxonomy of calibrated detection signatures.
摘要：即使对于受过语言训练的读者来说，区分人类书写的韩语文本和流利的法学硕士输出仍然很困难，他们可能会过度相信表面的格式良好。我们研究专家检测是否可以被视为一种可学习的技能，并通过结构化校准进行改进。我们引入了 LREAD，这是一个源自韩国国家书写标准的评分标准，适用于针对微观层面的工件（例如标点符号可选性、间距行为和寄存器移位）。在与韩国语言学专业的三阶段纵向盲法协议中，第一阶段测量仅凭直觉的检测，第二阶段执行具有明确理由的标准级别评分，第三阶段评估对所提出的基础论文的以领域为中心的掌握程度。在各个阶段中，多数票准确率从 60% 增加到 100%，伴随着更强的注释者间一致性（Fleiss 的 kappa：-0.09 --> 0.82）。与最先进的 LLM 检测器相比，经过校准的人类更多地依赖于特定于语言的微观诊断，而这些微观诊断无法被粗略的话语先验很好地捕获。我们的研究结果表明，基于标准的专家判断可以作为非英语环境下自动检测器的可解释补充，并且我们发布了完整的标准和校准检测签名的分类法。

Title: Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

Authors: Maxwell Crouse, Ibrahim Abdelaziz, Kshitij Fadnis, Siva Sankalp Patel, Kinjal Basu, Chulaka Gunasekara, Sadhana Kumaravel, Asim Munawar, Pavan Kapanipathi
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2601.19914
Pdf URL: https://arxiv.org/pdf/2601.19914
Copy Paste: [[2601.19914]] Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments(https://arxiv.org/abs/2601.19914)
Keywords: language model
Abstract: Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for the validity of an interaction to be determined by whether or not the state of the execution environment matches to some prespecified objective. Unfortunately, this does not hold in many real-world tool use settings, e.g., in enterprise settings where data security is of the utmost importance or in cases where tool specifications are synthesized from multiple sources. In this work, we address this gap by introducing a data generation method, DiGiT-TC, that is designed to produce tool calling conversations that have the characteristics of conversations generated through search in a stateful environment. The key to our technique lies in a novel generation pattern that allows our approach to implicitly represent certain tool calls in the user request. We validate our approach on standard tool calling benchmarks and demonstrate that, even in stateful problem settings, our approach results in strong performance gains.
摘要：合成数据已被证明是一种宝贵的资源，可用于调整更小、更具成本效益的语言模型，以处理多轮工具调用对话的复杂性。虽然已经提出了许多用于生成合成多轮工具调用数据的框架和系统，但先前的工作经常假设任何工具调用交互都将发生在维护状态的执行环境中。当这样的环境可用时，这是有利的，因为它允许通过执行环境的状态是否与某个预先指定的目标匹配来确定交互的有效性。不幸的是，这在许多现实世界的工具使用设置中并不成立，例如，在数据安全至关重要的企业设置中，或者在从多个来源综合工具规范的情况下。在这项工作中，我们通过引入一种数据生成方法 DiGiT-TC 来解决这一差距，该方法旨在生成工具调用对话，这些对话具有通过在有状态环境中搜索生成的对话的特征。我们技术的关键在于一种新颖的生成模式，该模式允许我们的方法隐式表示用户请求中的某些工具调用。我们在标准工具调用基准上验证了我们的方法，并证明，即使在有状态的问题设置中，我们的方法也能带来强劲的性能提升。

Title: Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

Authors: Paul Tarau
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2601.19915
Pdf URL: https://arxiv.org/pdf/2601.19915
Copy Paste: [[2601.19915]] Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication(https://arxiv.org/abs/2601.19915)
Keywords: language model
Abstract: We introduce the \emph{Arrow Language Model}, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a \emph{left-nested implication chain} whose structure preserves order through non-commutative composition. Next-token prediction corresponds to \emph{modus ponens}, and sequence processing becomes constructive proof extension under the Curry--Howard correspondence. Our Prolog-based specialized theorem provers validate fundamental properties of the neural models, among which relations between commutative vs. non-commutative sequencing and single-token vs. multi-token prediction choices. We show that a neural architecture equivalent to multiplicative RNNs arises naturally from a proof-theoretic interpretation of next-token prediction as nested intuitionistic implication, we present a practical low-rank neural realization and position the model relative to Transformers and state-space models. Keywords: logic-based derivation of neural architectures, intuitionistic implicational logic, token-as-operator neural models, state-space models, alternatives to transformer-based foundational models.
摘要：我们引入 \emph{Arrow 语言模型}，这是一种从下一个标记预测的直觉逻辑解释衍生出来的神经架构。我们不是将标记表示为通过注意力混合的附加嵌入，而是将前缀编码为 \emph{左嵌套蕴涵链}，其结构通过非交换组合保留顺序。下一个标记预测对应于 \emph{modus ponens}，并且序列处理在 Curry--Howard 对应关系下成为构造性证明扩展。我们基于 Prolog 的专门定理证明器验证了神经模型的基本属性，其中包括交换排序与非交换排序以及单标记与多标记预测选择之间的关系。我们证明，相当于乘法 RNN 的神经架构自然地从下一个标记预测的证明理论解释中产生为嵌套直觉蕴涵，我们提出了一种实用的低秩神经实现，并将模型相对于 Transformer 和状态空间模型定位。关键词：基于逻辑的神经架构推导、直觉蕴涵逻辑、令牌作为运算符的神经模型、状态空间模型、基于变压器的基础模型的替代方案。

Title: PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

Authors: Songjun Tu, Yiwen Ma, Jiahao Lin, Qichao Zhang, Xiangyuan Lan, Junfeng.Li, Nan Xu, Linjing Li, Dongbin Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.19916
Pdf URL: https://arxiv.org/pdf/2601.19916
Copy Paste: [[2601.19916]] PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review(https://arxiv.org/abs/2601.19916)
Keywords: language model, llm
Abstract: Large language models can generate fluent peer reviews, yet their assessments often lack sufficient critical rigor when substantive issues are subtle and distributed across a paper. In this paper, we introduce PaperAudit-Bench, which consists of two components: (1) PaperAudit-Dataset, an error dataset covering both errors identifiable within individual sections and those requiring cross-section reasoning, designed for controlled evaluation under long-context settings; and (2) PaperAudit-Review, an automated review framework that integrates structured error detection with evidence-aware review generation to support critical assessment. Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths, highlighting the difficulty of identifying such errors under long-context settings. Relative to representative automated reviewing baselines, incorporating explicit error detection into the review workflow produces systematically stricter and more discriminative evaluations, demonstrating its suitability for peer review. Finally, we show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.
摘要：大型语言模型可以生成流畅的同行评审，但当实质性问题很微妙且分布在整篇论文中时，它们的评估往往缺乏足够的批判严谨性。在本文中，我们介绍了PaperAudit-Bench，它由两个组件组成：（1）PaperAudit-Dataset，一个错误数据集，涵盖各个部分中可识别的错误和需要横截面推理的错误，专为长上下文设置下的受控评估而设计； (2) PaperAudit-Review，一个自动审查框架，它将结构化错误检测与证据感知审查生成相结合，以支持关键评估。 PaperAudit-Bench 上的实验揭示了不同模型和检测深度的错误可检测性存在很大差异，凸显了在长上下文设置下识别此类错误的难度。相对于代表性的自动评审基线，将明确的错误检测纳入评审工作流程会产生系统上更严格、更具歧视性的评估，证明其适合同行评审。最后，我们表明该数据集支持通过 SFT 和 RL 训练轻量级 LLM 检测器，从而以降低的计算成本实现有效的错误检测。

Title: PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

Authors: Haoyu Zheng, Yun Zhu, Yuqian Yuan, Bo Yuan, Wenqiao Zhang, Siliang Tang, Jun Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.19917
Pdf URL: https://arxiv.org/pdf/2601.19917
Copy Paste: [[2601.19917]] PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models(https://arxiv.org/abs/2601.19917)
Keywords: language model, llm
Abstract: Strategic planning is critical for multi-step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning via Internalized Latent Optimization Trajectories), a non-invasive framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance. Instead of altering backbone weights, PILOT employs a lightweight Hyper-Network to synthesize a query-conditioned Latent Guidance vector. This vector acts as an internal steering mechanism, guiding the model's representations toward optimal reasoning paths. Extensive experiments on mathematical and coding benchmarks demonstrate that PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency.
摘要：战略规划对于多步推理至关重要，但紧凑的大型语言模型 (LLM) 通常缺乏制定全局战略的能力，导致长期任务中的错误传播。我们的分析表明，法学硕士拥有潜在的推理能力，当以教师模型的明确计划为条件时，可以释放这些能力；然而，由于延迟和可用性限制，运行时对外部指导的依赖通常是不切实际的。为了弥补这一差距，我们提出了 PILOT（通过内部潜在优化轨迹进行规划），这是一种非侵入性框架，旨在将对大型模型的战略监督内部化为内在的潜在指导。 PILOT 没有改变主干权重，而是采用轻量级超网络来合成查询条件的潜在指导向量。该向量充当内部转向机制，引导模型的表示走向最佳推理路径。关于数学和编码基准的大量实验表明，PILOT 有效地稳定了推理轨迹，始终优于强基线（例如，MATH500 上的 +8.9%），推理延迟可以忽略不计。

Title: Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLMs

Authors: Yitong Qiao, Licheng Pan, Yu Mi, Lei Liu, Yue Shen, Fei Sun, Zhixuan Chu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.19918
Pdf URL: https://arxiv.org/pdf/2601.19918
Copy Paste: [[2601.19918]] Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLMs(https://arxiv.org/abs/2601.19918)
Keywords: language model, llm, hallucination
Abstract: Hallucinations in Large Language Models (LLMs), i.e., the tendency to generate plausible but non-factual content, pose a significant challenge for their reliable deployment in high-stakes environments. However, existing hallucination detection methods generally operate under unrealistic assumptions, i.e., either requiring expensive intensive sampling strategies for consistency checks or white-box LLM states, which are unavailable or inefficient in common API-based scenarios. To this end, we propose a novel efficient zero-shot metric called Lowest Span Confidence (LSC) for hallucination detection under minimal resource assumptions, only requiring a single forward with output probabilities. Concretely, LSC evaluates the joint likelihood of semantically coherent spans via a sliding window mechanism. By identifying regions of lowest marginal confidence across variable-length n-grams, LSC could well capture local uncertainty patterns strongly correlated with factual inconsistency. Importantly, LSC can mitigate the dilution effect of perplexity and the noise sensitivity of minimum token probability, offering a more robust estimate of factual uncertainty. Extensive experiments across multiple state-of-the-art (SOTA) LLMs and diverse benchmarks show that LSC consistently outperforms existing zero-shot baselines, delivering strong detection performance even under resource-constrained conditions.
摘要：大型语言模型 (LLM) 中的幻觉，即生成看似合理但非事实内容的倾向，对其在高风险环境中的可靠部署构成了重大挑战。然而，现有的幻觉检测方法通常在不切实际的假设下运行，即要么需要昂贵的密集采样策略来进行一致性检查，要么需要白盒 LLM 状态，这在常见的基于 API 的场景中不可用或效率低下。为此，我们提出了一种新颖的高效零样本度量，称为最低跨度置信度（LSC），用于在最少资源假设下进行幻觉检测，仅需要具有输出概率的单个前向。具体来说，LSC 通过滑动窗口机制评估语义相干跨度的联合可能性。通过识别可变长度 n 元语法中边缘置信度最低的区域，LSC 可以很好地捕获与事实不一致密切相关的局部不确定性模式。重要的是，LSC 可以减轻困惑度的稀释效应和最小令牌概率的噪声敏感性，从而提供对事实不确定性的更稳健的估计。跨多个最先进 (SOTA) 法学硕士和不同基准的广泛实验表明，LSC 始终优于现有的零样本基线，即使在资源有限的条件下也能提供强大的检测性能。

Title: Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Authors: Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19921
Pdf URL: https://arxiv.org/pdf/2601.19921
Copy Paste: [[2601.19921]] Demystifying Multi-Agent Debate: The Role of Confidence and Diversity(https://arxiv.org/abs/2601.19921)
Keywords: language model, llm, agent
Abstract: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.
摘要：多智能体辩论 (MAD) 被广泛用于通过测试时间扩展来提高大型语言模型 (LLM) 性能，但最近的研究表明，尽管计算成本较高，但普通 MAD 的表现通常不如简单多数投票。研究表明，在同质主体和统一的信念更新下，辩论保留了预期的正确性，因此不能可靠地改善结果。根据人类审议和集体决策的发现，我们确定了普通 MAD 中缺少的两个关键机制：（i）初始观点的多样性和（ii）明确的、经过校准的信任沟通。我们提出两种轻量级干预措施。首先，具有多样性意识的初始化，选择更加多样化的候选答案池，增加了在辩论开始时出现正确假设的可能性。其次，一种信心调节的辩论协议，其中代理人表达校准的信心并根据其他人的信心来调整他们的更新。我们从理论上证明，多样性感知初始化在不改变潜在更新动态的情况下提高了 MAD 成功的先验概率，而置信度调制更新使辩论能够系统地转向正确的假设。根据经验，在六个面向推理的 QA 基准中，我们的方法始终优于普通的 MAD 和多数投票。我们的结果将人类审议与基于法学硕士的辩论联系起来，并证明简单、有原则的修改可以大大提高辩论的有效性。

Title: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Authors: Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19922
Pdf URL: https://arxiv.org/pdf/2601.19922
Copy Paste: [[2601.19922]] HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue(https://arxiv.org/abs/2601.19922)
Keywords: language model, llm
Abstract: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceived empathy and consistency. At the same time, humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, particularly in adversarial turns. Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions. This pattern suggests an emerging convergence in the criteria used to assess supportive quality. By placing humans and models on equal footing, HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency. It provides a unified empirical foundation for understanding where model-generated support aligns with human social judgment, where it diverges, and how affective conversational competence scales with model size.
摘要：支持性对话依赖于超越语言流畅性的技能，包括解读情绪、调整语气以及应对阻力、沮丧或痛苦的时刻。尽管语言模型取得了快速进展，但我们仍然缺乏一种明确的方法来了解他们在这些人际领域的能力与人类的能力相比如何。我们引入了 HEART，这是有史以来第一个在相同的多轮情感支持对话中直接比较人类和法学硕士的框架。对于每一个对话历史，我们将人类和模型的反应配对，并通过盲人评估者和法学硕士作为法官评估者的整体来评估它们。所有评估都遵循基于人际沟通科学的五个维度的标准：人类协调、同理心响应、协调、共鸣和任务跟踪。 HEART 揭示了惊人的行为模式。一些前沿模型在感知同理心和一致性方面接近或超过人类的平均反应。与此同时，人类在适应性重构、紧张命名和微妙的语气转变方面保持着优势，特别是在对抗性的转变中。人类和法学硕士作为法官的偏好在大约 80% 的成对比较中保持一致，匹配人与人之间的一致性，并且他们的书面理由强调相似的 HEART 维度。这种模式表明用于评估支持质量的标准正在出现趋同。通过将人类和模型置于平等的基础上，HEART 将支持性对话重新构建为独特的能力轴，与一般推理或语言流畅性分开。它为理解模型生成的支持在哪些方面与人类社会判断一致、在哪些方面存在分歧，以及情感对话能力如何随模型规模变化提供了统一的实证基础。

Title: Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

Authors: Boxiang Zhao, Qince Li, Zhonghao Wang, Zelin Cao, Yi Wang, Peng Cheng, Bo Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19923
Pdf URL: https://arxiv.org/pdf/2601.19923
Copy Paste: [[2601.19923]] Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation(https://arxiv.org/abs/2601.19923)
Keywords: language model, llm, agent
Abstract: As Large Language Models (LLMs) evolve into autonomous agents, the capability to faithfully translate natural language into rigorous structured formats-essential for tool invocation-and to convert complex tabular information into machine-readable specifications has become paramount. However, current evaluations lack effective methodologies to measure this structural fidelity without costly human intervention, as traditional text metrics fail to detect semantic drift in code-like outputs. This paper proposes Table-BiEval, a novel approach based on a human-free, self-supervised evaluation framework, to assess LLMs performance quantitatively. By leveraging deterministic Intermediate Representations, our framework calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content. Also, it empirically evaluates 15 state-of-the-art LLMs across dual topological dimensions-hierarchical structures and flat tables. The results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency and confirming that deep recursive nesting remains a universal bottleneck for current architectures.
摘要：随着大型语言模型 (LLM) 发展为自主代理，将自然语言忠实地转换为严格的结构化格式（对于工具调用至关重要）以及将复杂的表格信息转换为机器可读规范的能力已变得至关重要。然而，当前的评估缺乏有效的方法来衡量这种结构保真度，而无需昂贵的人工干预，因为传统的文本指标无法检测类似代码的输出中的语义漂移。本文提出了 Table-BiEval，这是一种基于无人、自我监督评估框架的新方法，用于定量评估法学硕士的表现。通过利用确定性中间表示，我们的框架计算内容语义准确性和标准化树编辑距离，以将结构与内容分离。此外，它还根据双重拓扑维度（层次结构和平面表）对 15 个最先进的法学硕士进行了实证评估。结果揭示了巨大的可变性，突显出中型模型在结构效率方面出人意料地优于大型模型，并证实深度递归嵌套仍然是当前架构的普遍瓶颈。

Title: OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Authors: Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.19924
Pdf URL: https://arxiv.org/pdf/2601.19924
Copy Paste: [[2601.19924]] OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling(https://arxiv.org/abs/2601.19924)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed-Integer Programming. Utilizing OPT-ENGINE, we conduct an extensive study of LLMs' reasoning capabilities, addressing two critical questions: 1.) Do LLMs' performance remain robust when generalizing to out-of-distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool-integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next-generation LLMs for advanced optimization. Our code is publicly available at \textcolor{blue}{this https URL}.
摘要：大型语言模型 (LLM) 在优化建模方面取得了令人瞩目的进展，促进了新方法和评估基准的快速扩展。然而，人们对它们在自动化制定和解决问题方面的能力边界仍然知之甚少，特别是在扩展到复杂的现实世界任务时。为了弥补这一差距，我们提出了 OPT-ENGINE，这是一个可扩展的基准框架，旨在以可控和可扩展的难度级别评估优化建模的法学硕士。 OPT-ENGINE 涵盖运筹学领域的 10 项规范任务，其中包括 5 项线性规划和 5 项混合整数规划。利用 OPT-ENGINE，我们对 LLM 的推理能力进行了广泛的研究，解决了两个关键问题：1.) 当泛化到复杂性超出当前基准水平的分布外优化任务时，LLM 的性能是否保持稳健？ 2.) 目前的法学硕士在从问题解释到解决方案生成的哪个阶段遇到了最显着的瓶颈？我们的实证结果产生了两个关键见解：首先，随着任务复杂性的增加，与外部求解器的工具集成推理表现出显着更高的鲁棒性，而纯文本推理达到了上限；其次，约束的自动制定构成了主要的性能瓶颈。这些发现为开发下一代高级优化法学硕士提供了可行的指导。我们的代码可在 \textcolor{blue}{this https URL} 上公开获取。

Title: Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

Authors: Yinuo Liu, Emre Sezgin, Eric A. Youngstrom
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19925
Pdf URL: https://arxiv.org/pdf/2601.19925
Copy Paste: [[2601.19925]] Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study(https://arxiv.org/abs/2601.19925)
Keywords: language model, gpt, llm, chat
Abstract: Introduction: Large language models (LLMs) can process requests and generate texts, but their feasibility for assessing complex academic content needs further investigation. To explore LLM's potential in assisting scientific review, this study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5's consistency and reliability in evaluating abstracts compared to one another and to human reviewers. Methods: 160 abstracts from a local conference were graded by human reviewers and three LLMs using one rubric. Composite score distributions across three LLMs and fourteen reviewers were examined. Inter-rater reliability was calculated using intraclass correlation coefficients (ICCs) for within-AI reliability and AI-human concordance. Bland-Altman plots were examined for visual agreement patterns and systematic bias. Results: LLMs achieved good-to-excellent agreement with each other (ICCs: 0.59-0.87). ChatGPT and Claude reached moderate agreement with human reviewers on overall quality and content-specific criteria, with ICCs ~.45-.60 for composite, impression, clarity, objective, and results. They exhibited fair agreement on subjective dimensions, with ICC ranging from 0.23-0.38 for impact, engagement, and applicability. Gemini showed fair agreement on half criteria and no reliability on impact and applicability. Three LLMs showed acceptable or negligible mean difference (ChatGPT=0.24, Gemini=0.42, Claude=-0.02) from the human mean composite scores. Discussion: LLMs could process abstracts in batches with moderate agreement with human experts on overall quality and objective criteria. With appropriate process architecture, they can apply a rubric consistently across volumes of abstracts exceeding feasibility for a human rater. The weaker performance on subjective dimensions indicates that AI should serve a complementary role in evaluation, while human expertise remains essential.
摘要：简介：大型语言模型（LLM）可以处理请求并生成文本，但其评估复杂学术内容的可行性需要进一步研究。为了探索 LLM 在协助科学审查方面的潜力，本研究检查了 ChatGPT-5、Gemini-3-Pro 和 Claude-Sonnet-4.5 在评估摘要方面的一致性和可靠性，并与彼此进行比较以及与人类审稿人进行比较。方法：来自当地会议的 160 篇摘要由人工审稿人和三位法学硕士使用一个评分标准进行评分。研究了三名法学硕士和十四名审稿人的综合分数分布。使用类内相关系数（ICC）计算评估者间的可靠性，以了解人工智能内部的可靠性和人工智能与人类的一致性。检查 Bland-Altman 图的视觉一致性模式和系统偏差。结果：法学硕士彼此之间达到了良好到优秀的一致性（ICC：0.59-0.87）。 ChatGPT 和 Claude 在整体质量和内容特定标准方面与人类审稿人达成了适度一致，综合、印象、清晰度、客观和结果的 ICC 约为 0.45-0.60。他们在主观维度上表现出相当的一致性，影响力、参与度和适用性的 ICC 范围为 0.23-0.38。双子座在半标准上表现出相当一致，但在影响和适用性方面没有可靠性。三位法学硕士与人类平均综合得分相比，显示出可接受或可忽略不计的平均差异（ChatGPT=0.24、Gemini=0.42、Claude=-0.02）。讨论：法学硕士可以批量处理摘要，并在整体质量和客观标准上与人类专家达成适度一致。通过适当的流程架构，他们可以在大量摘要中一致地应用评分标准，这超出了人类评分者的可行性。主观维度的表现较弱表明人工智能应该在评估中发挥补充作用，而人类的专业知识仍然至关重要。

Title: The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Authors: Nora Graichen, Iria de-Dios-Flores, Gemma Boleda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19926
Pdf URL: https://arxiv.org/pdf/2601.19926
Copy Paste: [[2601.19926]] The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models(https://arxiv.org/abs/2601.19926)
Keywords: language model
Abstract: We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models, reporting on 1,015 model results from a range of syntactic phenomena and interpretability methods. Our analysis shows that the state of the art presents a healthy variety of methods and data, but an over-focus on a single language (English), a single model (BERT), and phenomena that are easy to get at (like part of speech and agreement). Results also suggest that TLMs capture these form-oriented phenomena well, but show more variable and weaker performance on phenomena at the syntax-semantics interface, like binding or filler-gap dependencies. We provide recommendations for future work, in particular reporting complete data, better aligning theoretical constructs and methods across studies, increasing the use of mechanistic methods, and broadening the empirical scope regarding languages and linguistic phenomena.
摘要：我们对 337 篇评估基于 Transformer 的语言模型的句法能力的文章进行了系统回顾，报告了来自一系列句法现象和可解释性方法的 1,015 个模型结果。我们的分析表明，最先进的技术提供了多种健康的方法和数据，但过度关注单一语言（英语）、单一模型（BERT）和易于理解的现象（例如词性和协议）。结果还表明，TLM 很好地捕获了这些面向形式的现象，但在语法语义接口的现象（如绑定或填充间隙依赖性）上表现出更多的变化和更弱的性能。我们为未来的工作提供建议，特别是报告完整的数据、更好地协调跨研究的理论结构和方法、增加机械方法的使用以及扩大有关语言和语言现象的实证范围。

Title: Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey

Authors: Yuqing Zhao, Ziyao Liu, Yongsen Zheng, Kwok-Yan Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.19927
Pdf URL: https://arxiv.org/pdf/2601.19927
Copy Paste: [[2601.19927]] Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey(https://arxiv.org/abs/2601.19927)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs)-based question answering (QA) systems play a critical role in modern AI, demonstrating strong performance across various tasks. However, LLM-generated responses often suffer from hallucinations, unfaithful statements lacking reliable references. Retrieval-Augmented Generation (RAG) frameworks enhance LLM responses by incorporating external references but also introduce new forms of hallucination due to complex interactions between the retriever and generator. To address these challenges, researchers have explored attribution-based techniques that ensure responses are verifiably supported by retrieved content. Despite progress, a unified pipeline for these techniques, along with a clear taxonomy and systematic comparison of their strengths and weaknesses, remains lacking. A well-defined taxonomy is essential for identifying specific failure modes within RAG systems, while comparative analysis helps practitioners choose appropriate solutions based on hallucination types and application context. This survey investigates how attribution-based techniques are used within RAG systems to mitigate hallucinations and addresses the gap by: (i) outlining a taxonomy of hallucination types in RAG systems, (ii) presenting a unified pipeline for attribution techniques, (iii) reviewing techniques based on the hallucinations they target, and (iv) discussing strengths and weaknesses with practical guidelines. This work offers insights for future research and practical use of attribution techniques in RAG systems.
摘要：基于大型语言模型 (LLM) 的问答 (QA) 系统在现代人工智能中发挥着关键作用，在各种任务中表现出强大的性能。然而，法学硕士生成的答案经常受到幻觉、不忠实陈述的影响，缺乏可靠的参考资料。检索增强生成（RAG）框架通过合并外部参考来增强 LLM 响应，但由于检索器和生成器之间复杂的交互，也引入了新形式的幻觉。为了应对这些挑战，研究人员探索了基于归因的技术，以确保检索到的内容可验证地支持响应。尽管取得了进展，但这些技术仍然缺乏统一的管道，以及对其优缺点的清晰分类和系统比较。明确的分类对于识别 RAG 系统中的特定故障模式至关重要，而比较分析可以帮助从业者根据幻觉类型和应用环境选择适当的解决方案。本调查研究了如何在 RAG 系统中使用基于归因的技术来减轻幻觉并通过以下方式解决差距：（i）概述 RAG 系统中幻觉类型的分类，（ii）提出归因技术的统一管道，（iii）根据其针对的幻觉审查技术，以及（iv）通过实用指南讨论优点和缺点。这项工作为 RAG 系统中归因技术的未来研究和实际使用提供了见解。

Title: Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

Authors: David Linus Ostby
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2601.19929
Pdf URL: https://arxiv.org/pdf/2601.19929
Copy Paste: [[2601.19929]] Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding(https://arxiv.org/abs/2601.19929)
Keywords: llm
Abstract: We introduce Stingy Context, a hierarchical tree-based compression scheme achieving 18:1 reduction in LLM context for auto-coding tasks. Using our TREEFRAG exploit decomposition, we reduce a real source code base of 239k tokens to 11k tokens while preserving task fidelity. Empirical results across 12 Frontier models show 94 to 97% success on 40 real-world issues at low cost, outperforming flat methods and mitigating lost-in-the-middle effects.
摘要：我们引入了 Stingy Context，这是一种基于分层树的压缩方案，可实现自动编码任务的 LLM 上下文减少 18:1。使用我们的 TREEFRAG 漏洞分解，我们将 239k 令牌的真实源代码库减少到 11k 令牌，同时保持任务保真度。 12 个 Frontier 模型的实证结果显示，在 40 个现实世界问题上以低成本实现了 94% 到 97% 的成功，优于扁平方法并减轻了中间迷失效应。

Title: SDUs DAISY: A Benchmark for Danish Culture

Authors: Jacob Nielsen, Stine L. Beltoft, Peter Schneider-Kamp, Lukas Galke Poech
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19930
Pdf URL: https://arxiv.org/pdf/2601.19930
Copy Paste: [[2601.19930]] SDUs DAISY: A Benchmark for Danish Culture(https://arxiv.org/abs/2601.19930)
Keywords: language model
Abstract: We introduce a new benchmark for Danish culture via cultural heritage, Daisy, based on the curated topics from the Danish Culture Canon 2006. For each artifact in the culture canon, we query the corresponding Wikipedia page and have a language model generate random questions. This yields a sampling strategy within each work, with a mix of central of peripheral questions for each work, not only knowledge of mainstream information, but also in-depth cornerstones defining the heritage of Danish Culture, defined by the Canon committee. Each question-answer pair is humanly approved or corrected in the final dataset consisting of 741 close-ended question answer pairs covering topics, from 1300 BC. archaeological findings, 1700 century poems and musicals pieces to contemporary pop music and Danish design and architecture.
摘要：我们通过文化遗产 Daisy 引入了丹麦文化的新基准，它基于 2006 年丹麦文化经典中的策划主题。对于文化经典中的每个文物，我们查询相应的维基百科页面，并让语言模型生成随机问题。这在每部作品中产生了一个抽样策略，其中混合了每部作品的中心问题和外围问题，不仅是主流信息的知识，而且是定义丹麦文化遗产的深入基石，由佳能委员会定义。最终数据集中的每个问答对都经过人工批准或纠正，该数据集包含 741 个涵盖公元前 1300 年主题的封闭式问答对。考古发现、1700 世纪的诗歌和音乐剧、当代流行音乐以及丹麦设计和建筑。

Title: CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity

Authors: Sebastien Kawada, Dylan Holyoak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.19931
Pdf URL: https://arxiv.org/pdf/2601.19931
Copy Paste: [[2601.19931]] CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity(https://arxiv.org/abs/2601.19931)
Keywords: language model
Abstract: We present a hybrid neuro-symbolic system for the SemEval-2026 Task 4 on Narrative Story Similarity. Our approach combines neural self-consistency voting with a novel Multi-Scale Narrative Analysis Ensemble that operates as a symbolic tiebreaker. The neural network component uses a large language model with multiple parallel votes, applying a supermajority threshold for confident decisions and escalating uncertain cases to additional voting rounds. When votes result in a perfect tie, a symbolic ensemble combining five narrative similarity signals (lexical overlap, semantic embeddings, story grammar structure, event chain alignment, and narrative tension curves) provides the final decision. Our cascade architecture achieves 81% accuracy on the development set, demonstrating that selective deferral to symbolic methods can enhance neural predictions on genuinely ambiguous narrative comparisons.
摘要：我们为 SemEval-2026 任务 4 的叙事故事相似性提出了一个混合神经符号系统。我们的方法将神经自洽投票与新颖的多尺度叙事分析集成相结合，作为象征性的决胜局。神经网络组件使用具有多个并行投票的大型语言模型，为自信的决策应用绝对多数阈值，并将不确定的情况升级到额外的投票轮次。当投票结果完美平局时，结合五个叙事相似性信号（词汇重叠、语义嵌入、故事语法结构、事件链对齐和叙事张力曲线）的符号集合将提供最终决定。我们的级联架构在开发集上达到了 81% 的准确率，这表明选择性推迟符号方法可以增强对真正模糊的叙述比较的神经预测。

Title: "Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews

Authors: Ruyuan Wan, Changye Li, Ting-Hao 'Kenneth' Huang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2601.19932
Pdf URL: https://arxiv.org/pdf/2601.19932
Copy Paste: [[2601.19932]] "Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews(https://arxiv.org/abs/2601.19932)
Keywords: language model
Abstract: Coded language is an important part of human communication. It refers to cases where users intentionally encode meaning so that the surface text differs from the intended meaning and must be decoded to be understood. Current language models handle coded language poorly. Progress has been limited by the lack of real-world datasets and clear taxonomies. This paper introduces CodedLang, a dataset of 7,744 Chinese Google Maps reviews, including 900 reviews with span-level annotations of coded language. We developed a seven-class taxonomy that captures common encoding strategies, including phonetic, orthographic, and cross-lingual substitutions. We benchmarked language models on coded language detection, classification, and review rating prediction. Results show that even strong models can fail to identify or understand coded language. Because many coded expressions rely on pronunciation-based strategies, we further conducted a phonetic analysis of coded and decoded forms. Together, our results highlight coded language as an important and underexplored challenge for real-world NLP systems.
摘要：编码语言是人类交流的重要组成部分。它指的是用户故意对含义进行编码，使得表面文本与预期含义不同，并且必须解码才能理解的情况。当前的语言模型不能很好地处理编码语言。由于缺乏真实世界的数据集和明确的分类法，进展受到限制。本文介绍了 CodedLang，这是一个包含 7,744 条中文 Google 地图评论的数据集，其中包括 900 条带有编码语言跨级注释的评论。我们开发了一个七类分类法，捕获常见的编码策略，包括语音、正字法和跨语言替换。我们对编码语言检测、分类和评论评级预测的语言模型进行了基准测试。结果表明，即使是强大的模型也可能无法识别或理解编码语言。由于许多编码表达依赖于基于发音的策略，我们进一步对编码和解码形式进行了语音分析。总之，我们的结果凸显了编码语言对于现实世界的 NLP 系统来说是一个重要且尚未充分探索的挑战。

Title: Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle

Authors: Kei Saito
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.19933
Pdf URL: https://arxiv.org/pdf/2601.19933
Copy Paste: [[2601.19933]] Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle(https://arxiv.org/abs/2601.19933)
Keywords: language model
Abstract: Non-Resolution Reasoning (NRR) provides a formal framework for maintaining semantic ambiguity rather than forcing premature interpretation collapse. While the foundational architecture establishes state spaces and operators for ambiguity-preserving computation, the critical question of how natural language maps to these mathematical structures remains open. This paper introduces the text-to-state mapping function {\phi} that transforms linguistic input into superposition states within the NRR framework. We formalize the Contradiction-Preservation Principle, which requires that genuinely ambiguous expressions maintain non-zero entropy in their state representations, and develop extraction protocols using existing Large Language Models as interpretation generators. Empirical validation across 68 test sentences spanning lexical, structural, and pragmatic ambiguity demonstrates that our mapping achieves mean Shannon entropy H(S) = 1.087 bits for ambiguous inputs while baseline single-interpretation approaches yield H(S) = 0.000. The framework provides the missing algorithmic bridge between raw text and the formal state spaces on which NRR operators act, enabling architectural collapse deferment in language model inference.
摘要：非解析推理（NRR）提供了一个维持语义歧义的正式框架，而不是迫使解释过早崩溃。虽然基础架构为保留歧义的计算建立了状态空间和运算符，但自然语言如何映射到这些数学结构的关键问题仍然悬而未决。本文介绍了文本到状态映射函数 {\phi}，该函数将语言输入转换为 NRR 框架内的叠加状态。我们形式化了矛盾保持原理，它要求真正模糊的表达式在其状态表示中保持非零熵，并使用现有的大型语言模型作为解释生成器来开发提取协议。对跨越词汇、结构和语用歧义的 68 个测试句子进行的实证验证表明，对于歧义输入，我们的映射实现了平均香农熵 H(S) = 1.087 位，而基线单一解释方法产生 H(S) = 0.000。该框架提供了原始文本和 NRR 算子所作用的正式状态空间之间缺失的算法桥梁，从而实现了语言模型推理中的架构崩溃延迟。

Title: Quantifying non deterministic drift in large language models

Authors: Claire Nicholson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19934
Pdf URL: https://arxiv.org/pdf/2601.19934
Copy Paste: [[2601.19934]] Quantifying non deterministic drift in large language models(https://arxiv.org/abs/2601.19934)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output variability observed when the same prompt is issued multiple times under operator-free conditions. We evaluate two publicly accessible models, gpt-4o-mini and llama3.1-8b, across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures of 0.0 and 0.7. Drift is measured using unique output fractions, lexical similarity, and word count statistics, enabling direct comparison across models, prompting modes, and deployment types. The results show that nondeterminism persists even at temperature 0.0, with distinct variability patterns by model size, deployment, and prompt type. We situate these findings within existing work on concept drift, behavioural drift, and infrastructure-induced nondeterminism, discuss the limitations of lexical metrics, and highlight emerging semantic approaches. By establishing a systematic empirical baseline in the absence of stabilisation techniques, this study provides a reference point for evaluating future drift mitigation and control methods.
摘要：大型语言模型 (LLM) 广泛用于从摘要到决策支持等任务。实际上，即使温度和其他解码参数固定，相同的提示并不总是产生相同的输出。在这项工作中，我们进行了重复运行的实验，以凭经验量化基线行为漂移，定义为在无操作员的条件下多次发出相同提示时观察到的输出变异性。我们在 0.0 和 0.7 的温度下使用精确重复、扰动输入和重用模式，在五个提示类别中评估了两个可公开访问的模型 gpt-4o-mini 和 llama3.1-8b。漂移是使用独特的输出分数、词汇相似性和字数统计来衡量的，从而可以跨模型、提示模式和部署类型进行直接比较。结果表明，即使在温度 0.0 下，不确定性仍然存在，并且模型大小、部署和提示类型具有明显的可变性模式。我们将这些发现置于概念漂移、行为漂移和基础设施引起的不确定性的现有工作中，讨论词汇度量的局限性，并强调新兴的语义方法。通过在缺乏稳定技术的情况下建立系统的经验基线，本研究为评估未来的漂移缓解和控制方法提供了参考点。

Title: Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Authors: Yiting Shen, Kun Li, Wei Zhou, Songlin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.19935
Pdf URL: https://arxiv.org/pdf/2601.19935
Copy Paste: [[2601.19935]] Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents(https://arxiv.org/abs/2601.19935)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent's ability to passively retrieve isolated facts in response to explicit questions. They fail to evaluate the more crucial capability of actively applying memory to execute tasks. To address this gap, we introduce \textsc{Mem2ActBench}, a benchmark for evaluating whether agents can proactively leverage long-term memory to execute tool-based actions by selecting appropriate tools and grounding their parameters. The benchmark simulates persistent assistant usage, where users mention the same topic across long, interrupted interactions and expect previously established preferences and task states to be implicitly applied. We build the dataset with an automated pipeline that merges heterogeneous sources (ToolACE, BFCL, Oasst1), resolves conflicts via consistency modeling, and synthesizes 2,029 sessions with 12 user--assistant--tool turns on average. From these memory chains, a reverse-generation method produces 400 tool-use tasks, with human evaluation confirming 91.3\% are strongly memory-dependent. Experiments on seven memory frameworks show that current systems remain inadequate at actively utilizing memory for parameter grounding, highlighting the need for more effective approaches to evaluate and improve memory application in task execution.
摘要：基于大型语言模型 (LLM) 的代理越来越多地部署用于复杂的、基于工具的任务，其中长期记忆对于驱动操作至关重要。然而，现有的基准主要测试代理被动检索孤立事实以响应明确问题的能力。他们未能评估主动应用内存来执行任务的更重要的能力。为了解决这一差距，我们引入了 \textsc{Mem2ActBench}，这是一个基准，用于评估智能体是否可以通过选择适当的工具并根据其参数来主动利用长期记忆来执行基于工具的操作。该基准测试模拟持续的助理使用情况，其中用户在长时间、中断的交互中提及同一主题，并期望隐式应用先前建立的偏好和任务状态。我们使用自动化管道构建数据集，该管道合并异构源（ToolACE、BFCL、Oasst1），通过一致性建模解决冲突，并平均通过 12 个用户-助理-工具轮次合成 2,029 个会话。从这些记忆链中，反向生成方法产生 400 个工具使用任务，人工评估确认 91.3% 的任务强烈依赖于记忆。对七个内存框架的实验表明，当前系统在主动利用内存进行参数基础方面仍然不够，这凸显了需要更有效的方法来评估和改进任务执行中的内存应用。

Title: On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Authors: Michał Gromadzki, Anna Wróblewska, Agnieszka Kaliska
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.20006
Pdf URL: https://arxiv.org/pdf/2601.20006
Copy Paste: [[2601.20006]] On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text(https://arxiv.org/abs/2601.20006)
Keywords: language model, llm, prompt
Abstract: The rapid progress of large language models has enabled the generation of text that closely resembles human writing, creating challenges for authenticity verification in education, publishing, and digital security. Detecting AI-generated text has therefore become a crucial technical and ethical issue. This paper presents a comprehensive study of AI-generated text detection based on large-scale corpora and novel training strategies. We introduce a 1-billion-token corpus of human-authored texts spanning multiple genres and a 1.9-billion-token corpus of AI-generated texts produced by prompting a variety of LLMs across diverse domains. Using these resources, we develop and evaluate numerous detection models and propose two novel training paradigms: Per LLM and Per LLM family fine-tuning. Across a 100-million-token benchmark covering 21 large language models, our best fine-tuned detector achieves up to $99.6\%$ token-level accuracy, substantially outperforming existing open-source baselines.
摘要：大型语言模型的快速进步使得能够生成与人类书写非常相似的文本，这给教育、出版和数字安全领域的真实性验证带来了挑战。因此，检测人工智能生成的文本已成为一个关键的技术和伦理问题。本文对基于大规模语料库和新颖训练策略的人工智能生成文本检测进行了全面研究。我们引入了一个包含 10 亿个令牌的跨越多种流派的人类创作文本的语料库，以及一个包含 19 亿个令牌的人工智能生成文本的语料库，这些文本是通过提示跨不同领域的各种法学硕士而生成的。利用这些资源，我们开发和评估了大量的检测模型，并提出了两种新颖的训练范例：每法学硕士和每法学硕士系列微调。在覆盖 21 种大型语言模型的 1 亿代币基准测试中，我们最好的微调检测器实现了高达 99.6\%$ 的代币级别准确度，大大优于现有的开源基准。

Title: LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?

Authors: J. Ben Tamo, Daniel Carlander-Reuterfelt, Jonathan Rubin, Dezhi Hong, Mingxian Wang, Oleg Poliannikov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.20009
Pdf URL: https://arxiv.org/pdf/2601.20009
Copy Paste: [[2601.20009]] LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?(https://arxiv.org/abs/2601.20009)
Keywords: language model, llm, prompt
Abstract: Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control, the ability to respond in the intended language. We identify and characterize two key failure modes: the multilingual transfer bottleneck (correct language, incorrect task response) and the language consistency bottleneck (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks. To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Guided by these insights, we introduce selective fine-tuning of only the final layers responsible for language control. On Qwen-3-32B and Bloom-7.1B, this method achieves over 98 percent language consistency across six languages while fine-tuning only 3-5 percent of parameters, without sacrificing task accuracy. Importantly, this result is nearly identical to that of full-scope fine-tuning (for example, above 98 percent language consistency for both methods across all prompt scenarios) but uses a fraction of the computational resources. To the best of our knowledge, this is the first approach to leverage layer-localization of language control for efficient multilingual adaptation.
摘要：尽管进行了多语言预训练，大型语言模型常常难以应对非英语任务，特别是在语言控制方面，即以目标语言做出反应的能力方面。我们识别并描述了两种关键的故障模式：多语言传输瓶颈（正确的语言，不正确的任务响应）和语言一致性瓶颈（正确的任务响应，错误的语言）。为了系统地解决这些问题，我们设计了一个涵盖 MMLU、MGSM 和 XQuAD 基准的四种场景评估协议。为了探讨这些可解释性问题，我们扩展了逻辑透镜分析来逐层跟踪语言概率并计算隐藏状态的跨语言语义相似性。结果揭示了一个三阶段的内部结构：早期层将输入对齐到共享语义空间，中间层执行任务推理，后期层驱动特定于语言的生成。在这些见解的指导下，我们仅对负责语言控制的最后层进行选择性微调。在 Qwen-3-32B 和 Bloom-7.1B 上，该方法在六种语言中实现了超过 98% 的语言一致性，同时仅微调 3-5% 的参数，而不会牺牲任务准确性。重要的是，这个结果几乎与全范围微调的结果相同（例如，在所有提示场景中，两种方法的语言一致性超过 98%），但只使用一小部分计算资源。据我们所知，这是第一种利用语言控制的层本地化来实现高效多语言适应的方法。

Title: Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method

Authors: Pragatheeswaran Vipulanandan, Kamal Premaratne, Dilip Sarkar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20026
Pdf URL: https://arxiv.org/pdf/2601.20026
Copy Paste: [[2601.20026]] Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method(https://arxiv.org/abs/2601.20026)
Keywords: language model, llm, hallucination, prompt, chat
Abstract: Large language models (LLMs) exhibit strong generative capabilities but remain vulnerable to confabulations, fluent yet unreliable outputs that vary arbitrarily even under identical prompts. Leveraging a quantum tensor network based pipeline, we propose a quantum physics inspired uncertainty quantification framework that accounts for aleatoric uncertainty in token sequence probability for semantic equivalence based clustering of LLM generations. This offers a principled and interpretable scheme for hallucination detection. We further introduce an entropy maximization strategy that prioritizes high certainty, semantically coherent outputs and highlights entropy regions where LLM decisions are likely to be unreliable, offering practical guidelines for when human oversight is warranted. We evaluate the robustness of our scheme under different generation lengths and quantization levels, dimensions overlooked in prior studies, demonstrating that our approach remains reliable even in resource constrained deployments. A total of 116 experiments on TriviaQA, NQ, SVAMP, and SQuAD across multiple architectures including Mistral-7B, Mistral-7B-instruct, Falcon-rw-1b, LLaMA-3.2-1b, LLaMA-2-13b-chat, LLaMA-2-7b-chat, LLaMA-2-13b, and LLaMA-2-7b show consistent improvements in AUROC and AURAC over state of the art baselines.
摘要：大型语言模型（LLM）表现出强大的生成能力，但仍然容易受到混淆的影响，即使在相同的提示下，输出也可能会任意变化，流畅但不可靠。利用基于量子张量网络的管道，我们提出了一种受量子物理学启发的不确定性量化框架，该框架考虑了基于语义等价的 LLM 世代聚类的令牌序列概率的任意不确定性。这为幻觉检测提供了一个原则性的、可解释的方案。我们进一步引入了一种熵最大化策略，该策略优先考虑高确定性、语义一致的输出，并突出显示 LLM 决策可能不可靠的熵区域，为何时需要人类监督提供实用指南。我们评估了我们的方案在不同的生成长度和量化水平下的鲁棒性，以及先前研究中忽略的维度，表明我们的方法即使在资源受限的部署中仍然可靠。在 TriviaQA、NQ、SVAMP 和 SQuAD 上跨多种架构（包括 Mistral-7B、Mistral-7B-instruct、Falcon-rw-1b、LLaMA-3.2-1b、LLaMA-2-13b-chat、LLaMA-2-7b-chat、LLaMA-2-13b 和 LLaMA-2-7b）进行的总共 116 次实验显示了在以下方面的一致改进： AUROC 和 AURAC 超过最先进的基线。

Title: VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

Authors: Vikash Singh, Darion Cassel, Nathaniel Weir, Nick Feng, Sam Bayless
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.20055
Pdf URL: https://arxiv.org/pdf/2601.20055
Copy Paste: [[2601.20055]] VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning(https://arxiv.org/abs/2601.20055)
Keywords: language model, gpt, llm
Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
摘要：尽管大型语言模型 (LLM) 的语法非常流畅，但确保其在高风险领域的逻辑正确性仍然是一项根本挑战。我们提出了一个神经符号框架，它将 LLM 与 SMT 求解器相结合，通过迭代细化产生验证引导的答案。我们的方法将 LLM 输出分解为原子声明，将它们自动形式化为一阶逻辑，并使用自动定理证明来验证它们的逻辑一致性。我们引入了三项关键创新：（1）通过形式语义等价检查实现多模型共识，以确保候选人之间的逻辑级对齐，消除表面形式度量的句法偏差，（2）语义路由，将不同的声明类型引导到适当的验证策略：用于逻辑声明的符号解算器和用于常识推理的LLM集成，以及（3）通过最小校正子集（MCS）进行精确的逻辑错误定位，精确定位要修改的声明的确切子集，将二进制故障信号转换为可操作的反馈。我们的框架根据声明的逻辑状态对声明进行分类，并将多个验证信号聚合成具有基于方差的惩罚的统一分数。系统使用结构化反馈迭代地完善答案，直到满足验收标准或实现收敛。这种混合方法在可能的情况下提供正式的保证，并在其他地方提供共识验证，从而推进值得信赖的人工智能。借助 GPT-OSS-120B 模型，与单遍方法相比，VERGE 在一组推理基准的收敛时表现出平均性能提升 18.7%。

Title: Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

Authors: Amirhossein Haji Mohammad Rezaei, Zahra Shakeri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20102
Pdf URL: https://arxiv.org/pdf/2601.20102
Copy Paste: [[2601.20102]] Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects(https://arxiv.org/abs/2601.20102)
Keywords: language model, gpt, llm, prompt
Abstract: Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran's Q, $p<10^-14$), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under option-only prompting), while neutral edits produce smaller, non-systematic changes. A human-validated rubric ($\kappa=0.76$) applied via an LLM-as-judge shows that more than half of culturally grounded explanations end in an incorrect answer, linking culture-referential reasoning to diagnostic failure. We release prompts and augmentations to support evaluation and mitigation of culturally induced diagnostic errors.
摘要：工程可持续和公平的医疗保健需要医学语言模型，当呈现非决定性的文化信息时，该模型不会改变临床上正确的诊断。我们引入了一个反事实基准，通过插入与文化相关的 (i) 标识符标记、(ii) 上下文线索或 (iii) 它们对三个群体（加拿大土著、中东穆斯林、东南亚）的组合，以及长度匹配的中性控制，将 150 个 MedQA 测试项目扩展到 1650 个变体，其中临床医生验证黄金答案在所有变体中保持不变。我们在仅选项和简要解释提示下评估 GPT-5.2、Llama-3.1-8B、DeepSeek-R1 和 MedGemma (4B/27B)。在模型中，文化线索显着影响准确性（Cochran's Q，$p<10^-14$），当标识符和上下文同时出现时，准确性下降最大（在仅选项提示下最多 3-7 个百分点），而中性编辑会产生较小的非系统性变化。通过法学硕士法官应用的人工验证的标题（$\kappa=0.76$）显示，超过一半的基于文化的解释以错误的答案结束，将文化参照推理与诊断失败联系起来。我们发布提示和增强功能，以支持评估和减轻文化引起的诊断错误。

Title: FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language

Authors: Faezeh Hosseini, Mohammadali Yousefzadeh, Yadollah Yaghoobzadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20105
Pdf URL: https://arxiv.org/pdf/2601.20105
Copy Paste: [[2601.20105]] FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language(https://arxiv.org/abs/2601.20105)
Keywords: language model, gpt, llm, hallucination
Abstract: Figurative language, particularly fixed figurative expressions (FFEs) such as idioms and proverbs, poses persistent challenges for large language models (LLMs). Unlike literal phrases, FFEs are culturally grounded, largely non-compositional, and conventionally fixed, making them especially vulnerable to figurative hallucination. We define figurative hallucination as the generation or endorsement of expressions that sound idiomatic and plausible but do not exist as authentic figurative expressions in the target language. We introduce FFEHallu, the first comprehensive benchmark for evaluating figurative hallucination in LLMs, with a focus on Persian, a linguistically rich yet underrepresented language. FFEHallu consists of 600 carefully curated instances spanning three complementary tasks: (i) FFE generation from meaning, (ii) detection of fabricated FFEs across four controlled construction categories, and (iii) FFE to FFE translation from English to Persian. Evaluating six state of the art multilingual LLMs, we find systematic weaknesses in figurative competence and cultural grounding. While models such as GPT4.1 demonstrate relatively strong performance in rejecting fabricated FFEs and retrieving authentic ones, most models struggle to reliably distinguish real expressions from high quality fabrications and frequently hallucinate during cross lingual translation. These findings reveal substantial gaps in current LLMs handling of figurative language and underscore the need for targeted benchmarks to assess and mitigate figurative hallucination.
摘要：比喻语言，特别是成语和谚语等固定比喻表达（FFE），对大型语言模型（LLM）提出了持续的挑战。与字面短语不同，FFE 具有文化基础，很大程度上是非组合性的，并且传统上是固定的，这使得它们特别容易受到比喻幻觉的影响。我们将比喻性幻觉定义为听起来惯用且合理但在目标语言中并不存在的真实比喻性表达的产生或认可。我们推出了 FFEHallu，这是第一个评估法学硕士比喻性幻觉的综合基准，重点关注波斯语，这是一种语言丰富但代表性不足的语言。 FFEHallu 包含 600 个精心策划的实例，涵盖三个互补任务：(i) 根据含义生成 FFE，(ii) 跨四个受控构造类别检测制造的 FFE，以及 (iii) 从英语到波斯语的 FFE 到 FFE 翻译。通过评估六种最先进的多语言法学硕士，我们发现了比喻能力和文化基础方面的系统性弱点。虽然 GPT4.1 等模型在拒绝捏造的 FFE 和检索真实的 FFE 方面表现出相对较强的性能，但大多数模型难以可靠地区分真实表达与高质量捏造，并且在跨语言翻译过程中经常出现幻觉。这些发现揭示了当前法学硕士在处理比喻语言方面存在巨大差距，并强调需要有针对性的基准来评估和减轻比喻幻觉。

Title: Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

Authors: Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla, Sonal Chaturbhuj Gehlot
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.20126
Pdf URL: https://arxiv.org/pdf/2601.20126
Copy Paste: [[2601.20126]] Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models(https://arxiv.org/abs/2601.20126)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here this https URL.
摘要：大型语言模型 (LLM) 通常会产生幻觉或无法验证的内容，从而破坏了它们在事实领域的可靠性。这项工作研究了带有可验证奖励的强化学习（RLVR）作为一种训练范式，它明确奖励弃权（“我不知道”）和正确性，以促进智力谦逊。我们在不同的弃权奖励结构下使用三元奖励结构（$-1$，r_abs，1）在 MedMCQA 和 Hendrycks Math 基准上微调和评估 Granite-3.3-2B-Instruct 和 Qwen-3-4B-Instruct。我们进一步研究了将 RLVR 与在强化学习之前教授弃权的监督微调策略相结合的效果。我们的结果表明，适度的弃权奖励（r_abs $\approx -0.25$ 至 0.3）持续减少错误反应，而不会严重降低多项选择任务的准确性，并且较大的模型对弃权激励表现出更强的鲁棒性。在开放式问答中，我们观察到由于探索不足而造成的局限性，这可以通过监督弃权训练来部分缓解。总的来说，这些发现证明了可验证奖励设计作为语言模型中缓解幻觉的实用方法的可行性和灵活性。我们的弃权训练框架的可复制代码可以在这里 https URL 找到。

Title: Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

Authors: Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Jiri Gesi, Xianfeng Tang, Chen Luo, Yisi Sang, Hanqing Lu, Manling Li, Dakuo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20144
Pdf URL: https://arxiv.org/pdf/2601.20144
Copy Paste: [[2601.20144]] Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents(https://arxiv.org/abs/2601.20144)
Keywords: llm, agent
Abstract: Tool-calling agents are increasingly deployed in real-world customer-facing workflows. Yet most studies on tool-calling agents focus on idealized settings with general, fixed, and well-specified tasks. In real-world applications, user requests are often (1) ambiguous, (2) changing over time, or (3) infeasible due to policy constraints, and training and evaluation data that cover these diverse, complex interaction patterns remain under-represented. To bridge the gap, we present Trajectory2Task, a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents. The pipeline first conducts multi-turn exploration to produce valid tool-call trajectories. It then converts these trajectories into user-facing tasks with controlled intent adaptations. This process yields verifiable task that support closed-loop evaluation and training. We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures. Finally, using successful trajectories obtained from task rollouts, we fine-tune lightweight LLMs and find consistent improvements across all three conditions, along with better generalization to unseen tool-use domains, indicating stronger general tool-calling ability.
摘要：工具调用代理越来越多地部署在现实世界中面向客户的工作流程中。然而，大多数关于工具调用代理的研究都集中在具有一般、固定和明确指定任务的理想化环境。在现实应用中，用户请求通常是 (1) 不明确，(2) 随着时间的推移而变化，或 (3) 由于政策限制而不可行，而涵盖这些多样化、复杂的交互模式的培训和评估数据仍然代表性不足。为了弥补这一差距，我们提出了 Trajectory2Task，这是一个可验证的数据生成管道，用于研究在三种现实用户场景下大规模使用工具：意图不明确、意图变化和意图不可行。该管道首先进行多轮探索以产生有效的工具调用轨迹。然后，它将这些轨迹转换为具有受控意图适应的面向用户的任务。此过程产生支持闭环评估和培训的可验证任务。我们在生成的复杂用户场景任务上对七个最先进的法学硕士进行了基准测试，并观察了频繁的失败。最后，利用从任务推出中获得的成功轨迹，我们对轻量级 LLM 进行了微调，并在所有三个条件下发现了一致的改进，以及对未见过的工具使用领域的更好的泛化，表明更强的通用工具调用能力。

Title: Me-Agent: A Personalized Mobile Agent with Two-Level User Habit Learning for Enhanced Interaction

Authors: Shuoxin Wang, Chang Liu, Gowen Loo, Lifan Zheng, Kaiwen Wei, Xinyi Zeng, Jingyuan Zhang, Yu Tian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20162
Pdf URL: https://arxiv.org/pdf/2601.20162
Copy Paste: [[2601.20162]] Me-Agent: A Personalized Mobile Agent with Two-Level User Habit Learning for Enhanced Interaction(https://arxiv.org/abs/2601.20162)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Model (LLM)-based mobile agents have made significant performance advancements. However, these agents often follow explicit user instructions while overlooking personalized needs, leading to significant limitations for real users, particularly without personalized context: (1) inability to interpret ambiguous instructions, (2) lack of learning from user interaction history, and (3) failure to handle personalized instructions. To alleviate the above challenges, we propose Me-Agent, a learnable and memorable personalized mobile agent. Specifically, Me-Agent incorporates a two-level user habit learning approach. At the prompt level, we design a user preference learning strategy enhanced with a Personal Reward Model to improve personalization performance. At the memory level, we design a Hierarchical Preference Memory, which stores users' long-term memory and app-specific memory in different level memory. To validate the personalization capabilities of mobile agents, we introduce User FingerTip, a new benchmark featuring numerous ambiguous instructions for daily life. Extensive experiments on User FingerTip and general benchmarks demonstrate that Me-Agent achieves state-of-the-art performance in personalization while maintaining competitive instruction execution performance.
摘要：基于大型语言模型 (LLM) 的移动代理取得了显着的性能提升。然而，这些代理通常遵循明确的用户指令，而忽视个性化需求，从而导致真实用户受到重大限制，特别是在没有个性化上下文的情况下：（1）无法解释模糊指令，（2）缺乏从用户交互历史中学习，以及（3）无法处理个性化指令。为了缓解上述挑战，我们提出了 Me-Agent，一个可学习且难忘的个性化移动代理。具体来说，Me-Agent采用了两级用户习惯学习方法。在提示级别，我们设计了一种通过个人奖励模型增强的用户偏好学习策略，以提高个性化性能。在记忆层面，我们设计了分层偏好记忆，将用户的长期记忆和应用特定记忆存储在不同级别的记忆中。为了验证移动代理的个性化功能，我们引入了 User FingerTip，这是一个新的基准，具有大量针对日常生活的模糊指令。对 User FingerTip 和一般基准的大量实验表明，Me-Agent 在个性化方面实现了最先进的性能，同时保持了有竞争力的指令执行性能。

Title: Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

Authors: Haoyuan Yu, Yuxuan Chen, Minjie Cai
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2601.20230
Pdf URL: https://arxiv.org/pdf/2601.20230
Copy Paste: [[2601.20230]] Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems(https://arxiv.org/abs/2601.20230)
Keywords: language model, agent
Abstract: Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository this https URL.
摘要：全双工语音交互对于自然的人机交互至关重要。我们提出了一个框架，将复杂的对话分解为最小的对话单元，使系统能够独立处理每个单元并预测何时转换到下一个单元。该框架被实例化为围绕多模态大语言模型构建的半级联全双工对话系统，并由语音活动检测（VAD）和文本到语音（TTS）合成等辅助模块支持。由此产生的系统以免训练、即插即用的方式运行。在 HumDial 数据集上的实验证明了我们框架的有效性，该框架在类人口语系统挑战赛（赛道 2：全双工交互）测试集的所有团队中排名第二。代码可在 GitHub 存储库中使用此 https URL 获取。

Title: Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

Authors: Si Chen, Le Huy Khiem, Annalisa Szymanski, Ronald Metoyer, Ting Hua, Nitesh V. Chawla
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.20253
Pdf URL: https://arxiv.org/pdf/2601.20253
Copy Paste: [[2601.20253]] Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy(https://arxiv.org/abs/2601.20253)
Keywords: llm
Abstract: Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.
摘要：开放式问答 (QA) 评估模型执行事实回忆之外的情境推理的能力。这一挑战在基于实践的领域尤其严峻，这些领域的知识是程序性的并以专业判断为基础，而大多数现有的法学硕士基准依赖于预先存在的人类考试数据集，而这些数据集在此类环境中通常不可用。我们引入了一个根据布卢姆分类学专家撰写的指南自动生成基准的框架。它将专家实践转换为基于隐式违规的场景，并将其扩展为自动评分多项选择题 (MCQ) 和跨四个认知级别的多轮对话，从而实现确定性、可重复和可扩展的评估。应用于三个应用领域：教学、营养学和护理，我们发现模型推理和类人推理之间的差异：法学硕士有时在高阶推理（分析）上表现相对较好，但在较低级别的项目（记住）上失败的频率更高。我们制定了大规模的、心理测量学的基准，这些基准揭示了这些非直观的模型行为，并能够在现实世界环境中评估情境化推理。

Title: SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

Authors: Xuanyu Su, Diana Inkpen, Nathalie Japkowicz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20256
Pdf URL: https://arxiv.org/pdf/2601.20256
Copy Paste: [[2601.20256]] SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility(https://arxiv.org/abs/2601.20256)
Keywords: llm
Abstract: Online hate on social media ranges from overt slurs and threats (\emph{hard hate speech}) to \emph{soft hate speech}: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf{\textsc{SoftHateBench}}, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emph{Argumentum Model of Topics} (AMT) and \emph{Relevance Theory} (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion while preserving the stance, and RT guides generation to keep the AMT chain logically coherent. The benchmark spans \textbf{7} sociocultural domains and \textbf{28} target groups, comprising \textbf{4,745} soft-hate instances. Evaluations across encoder-based detectors, general-purpose LLMs, and safety models show a consistent drop from hard to soft tiers: systems that detect explicit hostility often fail when the same stance is conveyed through subtle, reasoning-based language. \textcolor{red}{\textbf{Disclaimer.} Contains offensive examples used solely for research.}
摘要：社交媒体上的在线仇恨范围从公开的诽谤和威胁（\emph{硬仇恨言论}）到\emph{软仇恨言论}：表面上看似合理的话语，但使用框架和基于价值的论点来引导受众指责或排除目标群体。我们假设当前的审核系统主要针对表面毒性线索进行了优化，对于这种推理驱动的敌意并不稳健，但现有的基准并没有系统地衡量这一差距。我们引入了 \textbf{\textsc{SoftHateBench}}，这是一个生成基准，可以产生软仇恨变体，同时保留潜在的敌对立场。为了产生软仇恨，我们将 \emph{主题论证模型} (AMT) 和 \emph{相关性理论} (RT) 整合到一个统一的框架中：AMT 提供了骨干论证结构，用于将明确的仇恨观点重写为看似中立的讨论，同时保留立场，RT 指导生成以保持 AMT 链逻辑上的连贯。该基准跨越 \textbf{7} 社会文化领域和 \textbf{28} 目标群体，包括 \textbf{4,745} 软仇恨实例。对基于编码器的检测器、通用法学硕士和安全模型的评估显示，从硬层到软层的持续下降：当通过微妙的、基于推理的语言传达相同的立场时，检测明确敌意的系统通常会失败。 \textcolor{red}{\textbf{免责声明。}包含仅用于研究的冒犯性示例。}

Title: RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis

Authors: Elina Sigdel, Anastasia Panfilova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20275
Pdf URL: https://arxiv.org/pdf/2601.20275
Copy Paste: [[2601.20275]] RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis(https://arxiv.org/abs/2601.20275)
Keywords: language model
Abstract: Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was developed to analyze English texts and translated into multiple languages. Our approach offers the adaptation of LIWC methodology for the Russian language, considering its grammatical and cultural specificities. The suggested approach comprises 96 categories, integrating syntactic, morphological, lexical, general statistical features, and results of predictions obtained using pre-trained language models (LMs) for text analysis. Rather than applying direct translation to existing thesauri, we built the dictionary specifically for the Russian language based on the content from several lexicographic resources, semantic dictionaries and corpora. The paper describes the process of mapping lemmas to 42 psycholinguistic categories and the implementation of the analyzer as part of RusLICA web service.
摘要：定义书面文本中的心理语言特征是一项越来越受到研究人员关注的任务。当前领域中使用最广泛的工具之一是语言查询和字数统计（LIWC），最初是为了分析英语文本并翻译成多种语言而开发的。考虑到俄语的语法和文化特殊性，我们的方法将 LIWC 方法改编为俄语。建议的方法包括 96 个类别，集成了句法、形态、词汇、一般统计特征以及使用用于文本分析的预训练语言模型 (LM) 获得的预测结果。我们没有将直接翻译应用于现有的辞典，而是根据多个词典资源、语义词典和语料库的内容专门为俄语构建了词典。本文描述了将引理映射到 42 个心理语言学类别的过程以及作为 RusLICA Web 服务一部分的分析器的实现。

Title: Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

Authors: Tianwei Lin, Zuyi Zhou, Xinda Zhao, Chenke Wang, Xiaohong Li, Yu Chen, Chuanrui Hu, Jian Pei, Yafeng Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.20276
Pdf URL: https://arxiv.org/pdf/2601.20276
Copy Paste: [[2601.20276]] Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale(https://arxiv.org/abs/2601.20276)
Keywords: llm, prompt, agent
Abstract: Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model's context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization) separately from end-to-end QA quality under full-context prompting. This enables consistent diagnosis for both native long-context prompting and retrieval pipelines. Across a reference-corpus ladder from domain-isolated 64K contexts to a globally shared 326M-token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long-context memory at scale.
摘要：长上下文法学硕士代理人必须从大型环境中获取正确的证据并忠实地使用它。然而，流行的大海捞针（NIAH）评估主要衡量良性跨度定位。针几乎是独一无二的，而大海捞针在很大程度上是无关紧要的。我们推出了 EverMemBench-S (EMB-S)，这是一种基于 326M 代币 MemoryBank 构建的对抗性 NIAH 风格基准测试。虽然完整的 MemoryBank 跨越 326M 令牌进行基于检索 (RAG) 的评估，但我们仅以适合每个模型上下文窗口的规模评估本机长上下文模型（在本工作中最多 1M 令牌），以确保公平比较。 EMB-S 将查询与经过碰撞测试的几乎未命中的硬底片和跨越一份或多份文档的黄金证据集配对，并通过人工筛选和法学硕士验证进行验证。我们还提出了一种解耦的诊断协议，该协议在全上下文提示下将证据访问（文档 ID 本地化）与端到端 QA 质量分开报告。这使得本机长上下文提示和检索管道能够进行一致的诊断。跨越从领域隔离的 64K 上下文到全球共享的 326M 令牌环境的参考语料库阶梯，我们观察到明显的现实差距。在语义干扰下，良性 NIAH 饱和的系统在证据访问方面急剧退化。这些结果表明，语义辨别（而不仅仅是上下文长度）是大规模长上下文记忆的主要瓶颈。

Title: SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

Authors: Kaiyuan Chen, Guangmin Zheng, Jin Wang, Xiaobing Zhou, Xuejie Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20312
Pdf URL: https://arxiv.org/pdf/2601.20312
Copy Paste: [[2601.20312]] SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger(https://arxiv.org/abs/2601.20312)
Keywords: language model
Abstract: Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO's impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.
摘要：现有的自进化方法忽视了细粒度推理步骤的影响，从而导致推理者与验证者之间的差距。蒙特卡罗（MC）过程监督的计算效率低下进一步加剧了缩小差距的难度。受错误相关负性（ERN）的启发，推理机可以在错误决策后定位错误，指导快速调整，我们提出了一种自适应过程优化（SAPO）方法，用于小语言模型（SLM）中的自我改进。 SAPO 通过主动最小化推理器-验证器差距而不是依赖于低效的 MC 估计，自适应且高效地引入过程监督信号。大量的实验表明，所提出的方法在数学和代码这两种具有挑战性的任务类型上优于大多数现有的自进化方法。此外，为了进一步研究 SAPO 对验证者性能的影响，这项工作为数学和编码任务中的过程奖励模型引入了两个新的基准。

Title: Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning

Authors: Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.20326
Pdf URL: https://arxiv.org/pdf/2601.20326
Copy Paste: [[2601.20326]] Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning(https://arxiv.org/abs/2601.20326)
Keywords: llm
Abstract: KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: this https URL.
摘要：KV 缓存通常仅用于加速自回归解码，对上下文信息进行编码，这些信息可以重复用于下游任务，而无需额外成本。我们建议将 KV 缓存视为轻量级表示，从而无需重新计算或存储完整的隐藏状态。尽管弱于专用嵌入，但 KV 派生表示足以满足两个关键应用：\textbf{(i) Chain-of-Embedding}，它们在 Llama-3.1-8B-Instruct 和 Qwen2-7B-Instruct 上实现了有竞争力或优越的性能；和 \textbf{(ii) 快/慢思维切换}，它们在 Qwen3-8B 和 DeepSeek-R1-Distil-Qwen-14B 上启用自适应推理，将令牌生成减少高达 5.7 美元\times$，同时精度损失最小。我们的研究结果将 KV 缓存确立为采样和推理的免费、有效的基础，为 LLM 推理中的表示重用开辟了新的方向。代码：此 https URL。

Title: CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Authors: Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20327
Pdf URL: https://arxiv.org/pdf/2601.20327
Copy Paste: [[2601.20327]] CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria(https://arxiv.org/abs/2601.20327)
Keywords: llm
Abstract: Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.
摘要：自动评估对于开放式自然语言生成至关重要但具有挑战性，特别是当基于规则的指标不可行时。与传统方法相比，最近的法学硕士作为法官范例可以实现更好、更灵活的评估，并显示出作为强化学习的生成奖励模型的前景。然而，之前的工作表明，它们看似令人印象深刻的基准性能与强化学习实践中的实际效果之间存在显着差距。我们将此问题归因于现有研究的一些局限性，包括成对评估占主导地位和评估标准优化不足。因此，我们提出了 CE-RM-4B，这是一种采用专用的两阶段推出方法训练的逐点生成奖励模型，并采用统一的基于查询的标准。我们的 CE-RM-4B 仅使用从开源偏好数据集中精选的约 5.7K 高质量数据，在各种奖励模型基准上实现了卓越的性能，特别是在 Best-of-N 场景中，并在下游 RL 实践中提供了更有效的改进。

Title: PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

Authors: Zhuang Chen, Dazhen Wan, Zhangkai Zheng, Guanqun Bi, Xiyao Xiao, Binghang Li, Minlie Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.20330
Pdf URL: https://arxiv.org/pdf/2601.20330
Copy Paste: [[2601.20330]] PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments(https://arxiv.org/abs/2601.20330)
Keywords: language model, llm
Abstract: While large language models show promise in mental healthcare, evaluating their therapeutic competence remains challenging due to the unstructured and longitudinal nature of counseling. We argue that current evaluation paradigms suffer from an unanchored defect, leading to two forms of instability: process drift, where unsteered client simulation wanders away from specific counseling goals, and standard drift, where static pointwise scoring lacks the stability for reliable judgment. To address this, we introduce Ps, a unified framework that calibrates the therapeutic competence of LLMs via trajectory-anchored tournaments. We first anchor the interaction trajectory in simulation, where clients precisely control the fluid consultation process to probe multifaceted capabilities. We then anchor the battle trajectory in judgments through an efficient Swiss-system tournament, utilizing dynamic pairwise battles to yield robust Elo ratings. Beyond ranking, we demonstrate that tournament trajectories can be transformed into credible reward signals, enabling on-policy reinforcement learning to enhance LLMs' performance. Extensive experiments validate the effectiveness of PsychePass and its strong consistency with human expert judgments.
摘要：虽然大型语言模型在心理保健方面显示出前景，但由于咨询的非结构化和纵向性质，评估其治疗能力仍然具有挑战性。我们认为，当前的评估范式存在非锚定缺陷，导致两种形式的不稳定：流程漂移（不受引导的客户模拟偏离特定的咨询目标）和标准漂移（静态逐点评分缺乏可靠判断的稳定性）。为了解决这个问题，我们引入了 Ps，一个统一的框架，通过轨迹锚定锦标赛来校准法学硕士的治疗能力。我们首先在模拟中锚定交互轨迹，客户可以精确控制流畅的咨询过程以探索多方面的能力。然后，我们通过高效的瑞士系统锦标赛将战斗轨迹锚定在判断中，利用动态的配对战斗来产生强大的 Elo 评级。除了排名之外，我们还证明锦标赛轨迹可以转化为可信的奖励信号，从而实现策略强化学习来提高法学硕士的表现。大量实验验证了PsychePass的有效性及其与人类专家判断的高度一致性。

Title: MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

Authors: Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, Jian Luan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.20335
Pdf URL: https://arxiv.org/pdf/2601.20335
Copy Paste: [[2601.20335]] MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment(https://arxiv.org/abs/2601.20335)
Keywords: agent
Abstract: Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents' task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.
摘要：移动图形用户界面（GUI）代理的最新进展凸显了对综合评估基准日益增长的需求。虽然新的在线基准测试比离线基准测试提供了更真实的测试，但它们往往关注代理的任务指令遵循能力，而忽略了他们的推理和探索能力。此外，这些基准测试没有考虑现实移动环境中的随机噪声。这导致基准测试和现实环境之间存在差距。为了解决这些限制，我们提出了 MobileBench-OL，这是一个在线基准测试，包含来自 80 个中国应用程序的 1080 个任务。它通过包含 5 个子集来衡量智能体的任务执行、复杂推理和噪声鲁棒性，这些子集设置了多个评估维度。我们还提供带有重置机制的自动评估框架，从而实现稳定且可重复的现实世界基准测试。在 MobileBench-OL 上评估 12 个领先的 GUI 代理显示出满足实际需求的巨大改进空间。人工评估进一步证实 MobileBench-OL 可以可靠地测量真实环境中领先 GUI 代理的性能。我们的数据和代码将在接受后发布。

Title: Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space

Authors: Yangyi Shen, Tianjian Feng, Jiaqi Han, Wen Wang, Tianlang Chen, Chunhua Shen, Jure Leskovec, Stefano Ermon
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.20339
Pdf URL: https://arxiv.org/pdf/2601.20339
Copy Paste: [[2601.20339]] Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space(https://arxiv.org/abs/2601.20339)
Keywords: language model
Abstract: Diffusion Language Models (DLMs) offer order-agnostic generation that can explore many possible decoding trajectories. However, current decoding methods commit to a single trajectory, limiting exploration in trajectory space. We introduce Order-Token Search to explore this space through jointly searching over generation order and token values. Its core is a likelihood estimator that scores denoising actions, enabling stable pruning and efficient exploration of diverse trajectories. Across mathematical reasoning and coding benchmarks, Order-Token Search consistently outperforms baselines on GSM8K, MATH500, Countdown, and HumanEval (3.1%, 3.8%, 7.9%, and 6.8% absolute over backbone), matching or surpassing diffu-GRPO post-trained d1-LLaDA. Our work establishes joint search as a key component for advancing decoding in DLMs.
摘要：扩散语言模型 (DLM) 提供与顺序无关的生成，可以探索许多可能的解码轨迹。然而，当前的解码方法致力于单一轨迹，限制了轨迹空间的探索。我们引入订单令牌搜索，通过联合搜索生成订单和令牌值来探索这个空间。其核心是一个似然估计器，可以对去噪动作进行评分，从而实现稳定的剪枝和有效探索不同的轨迹。在数学推理和编码基准测试中，Order-Token Search 始终优于 GSM8K、MATH500、Countdown 和 HumanEval 上的基线（绝对超过主干网的 3.1%、3.8%、7.9% 和 6.8%），匹配或超过 diffu-GRPO 后训练的 d1-LLaDA。我们的工作将联合搜索确立为推进 DLM 解码的关键组成部分。

Title: Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

Authors: Qihao Wang, Yue Hu, Mingzhe Lu, Jiayue Wu, Yanbing Liu, Yuanmin Tang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2601.20412
Pdf URL: https://arxiv.org/pdf/2601.20412
Copy Paste: [[2601.20412]] Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents(https://arxiv.org/abs/2601.20412)
Keywords: language model, llm, agent
Abstract: The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model's capability boundary. We validate that our framework's predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent's limits and a practical foundation for building more efficient systems.
摘要：大型语言模型 (LLM) 使用外部工具的能力可以解锁强大的现实世界交互，因此严格的评估至关重要。然而，当前的基准主要报告最终的准确性，揭示模型可以做什么，但掩盖了定义其真正能力边界的认知瓶颈。为了从简单的表现评分转变为诊断工具，我们引入了一个基于认知负荷理论的框架。我们的框架将任务复杂性解构为两个可量化的组成部分：内在负载，解决方案路径的固有结构复杂性，用新颖的工具交互图形式化；额外负载，即由于任务呈现不明确而产生的困难。为了实现受控实验，我们构建了 ToolLoad-Bench，这是第一个具有参数可调认知负载的基准。我们的评估揭示了随着认知负荷增加而出现明显的性能悬崖，使我们能够精确地映射每个模型的能力边界。我们验证了我们的框架的预测与经验结果高度校准，建立了理解代理局限性的原则方法，并为构建更高效的系统奠定了实际基础。

Title: SpeechMapper: Speech-to-text Embedding Projector for LLMs

Authors: Biswesh Mohapatra, Marcely Zanon Boito, Ioan Calapodescu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20417
Pdf URL: https://arxiv.org/pdf/2601.20417
Copy Paste: [[2601.20417]] SpeechMapper: Speech-to-text Embedding Projector for LLMs(https://arxiv.org/abs/2601.20417)
Keywords: llm, prompt
Abstract: Current speech LLMs bridge speech foundation models to LLMs using projection layers, training all of these components on speech instruction data. This strategy is computationally intensive and susceptible to task and prompt overfitting. We present SpeechMapper, a cost-efficient speech-to-LLM-embedding training approach that mitigates overfitting, enabling more robust and generalizable models. Our model is first pretrained without the LLM on inexpensive hardware, and then efficiently attached to the target LLM via a brief 1K-step instruction tuning (IT) stage. Through experiments on speech translation and spoken question answering, we demonstrate the versatility of SpeechMapper's pretrained block, presenting results for both task-agnostic IT, an ASR-based adaptation strategy that does not train in the target task, and task-specific IT. In task-agnostic settings, Speechmapper rivals the best instruction-following speech LLM from IWSLT25, despite never being trained on these tasks, while in task-specific settings, it outperforms this model across many datasets, despite requiring less data and compute. Overall, SpeechMapper offers a practical and scalable approach for efficient, generalizable speech-LLM integration without large-scale IT.
摘要：当前的语音法学硕士使用投影层将语音基础模型与法学硕士连接起来，并根据语音指令数据训练所有这些组件。该策略计算量大，容易受到任务和提示过度拟合的影响。我们推出了 SpeechMapper，这是一种经济高效的语音到 LLM 嵌入训练方法，可以减轻过度拟合，从而实现更强大和更通用的模型。我们的模型首先在廉价硬件上进行预训练，无需 LLM，然后通过简短的 1K 步指令调整 (IT) 阶段有效地连接到目标 LLM。通过语音翻译和口语问答实验，我们展示了 SpeechMapper 预训练块的多功能性，展示了任务无关 IT、基于 ASR 的适应策略（不在目标任务中进行训练的基于 ASR 的适应策略）和特定于任务的 IT 的结果。在与任务无关的设置中，Speechmapper 可以与 IWSLT25 中最好的指令跟踪语音 LLM 相媲美，尽管从未接受过这些任务的培训；而在特定于任务的设置中，它在许多数据集上优于该模型，尽管需要较少的数据和计算。总体而言，SpeechMapper 提供了一种实用且可扩展的方法，无需大规模 IT 即可实现高效、可通用的语音-LLM 集成。

Title: PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

Authors: Qihao Wang, Mingzhe Lu, Jiayue Wu, Yue Hu, Yanbing Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20439
Pdf URL: https://arxiv.org/pdf/2601.20439
Copy Paste: [[2601.20439]] PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use(https://arxiv.org/abs/2601.20439)
Keywords: language model, llm, hallucination, agent
Abstract: Large Language Models show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erroneous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforcement learning phase. In the online phase, a dedicated Planner is trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Experiments on the ToolHop and T-Eval benchmarks show PEARL significantly outperforms existing methods, achieving a new state-of-the-art success rate of \textbf{56.5\%} on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents.
摘要：大型语言模型在外部工具方面表现出巨大的潜力，但在复杂的多轮工具调用方面面临着重大挑战。他们经常表现出规划薄弱、工具幻觉、参数生成错误，并且难以实现稳健的交互。为了解决这些问题，我们提出了 PEARL，这是一个新颖的框架，可以增强 LLM 规划和执行以实现复杂工具的使用。 PEARL 采用两阶段方法：离线阶段，代理探索工具来学习有效的使用模式和故障条件；以及在线强化学习阶段。在在线阶段，专门的规划器通过组相对策略优化 (GRPO) 进行训练，并通过精心设计的奖励函数为规划质量提供独特的信号。 ToolHop 和 T-Eval 基准测试表明，PEARL 的性能显着优于现有方法，在 ToolHop 上实现了 \textbf{56.5\%} 的最新成功率，同时保持了较低的调用错误率。我们的工作标志着在解决工具使用的复杂规划挑战方面取得了关键进展，有助于开发更强大、更可靠的基于法学硕士的代理。

Title: BMAM: Brain-inspired Multi-Agent Memory Framework

Authors: Yang Li, Jiaxiang Liu, Yusong Wang, Yujie Wu, Mingkun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20465
Pdf URL: https://arxiv.org/pdf/2601.20465
Copy Paste: [[2601.20465]] BMAM: Brain-inspired Multi-Agent Memory Framework(https://arxiv.org/abs/2601.20465)
Keywords: agent
Abstract: Language-model-based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we term soul erosion. We present BMAM (Brain-inspired Multi-Agent Memory), a general-purpose memory architecture that models agent memory as a set of functionally specialized subsystems rather than a single unstructured store. Inspired by cognitive memory systems, BMAM decomposes memory into episodic, semantic, salience-aware, and control-oriented components that operate at complementary time scales. To support long-horizon reasoning, BMAM organizes episodic memories along explicit timelines and retrieves evidence by fusing multiple complementary signals. Experiments on the LoCoMo benchmark show that BMAM achieves 78.45 percent accuracy under the standard long-horizon evaluation setting, and ablation analyses confirm that the hippocampus-inspired episodic memory subsystem plays a critical role in temporal reasoning.
摘要：在扩展交互范围内运行的基于语言模型的代理在保留临时信息和保持跨会话行为一致性方面面临着持续的挑战，我们将这种失败模式称为灵魂侵蚀。我们提出了 BMAM（脑启发多智能体内存），这是一种通用内存架构，它将智能体内存建模为一组功能专用的子系统，而不是单个非结构化存储。受认知记忆系统的启发，BMAM 将记忆分解为情景、语义、显着性感知和控制导向的组件，这些组件在互补的时间尺度上运行。为了支持长视野推理，BMAM 沿着明确的时间线组织情景记忆，并通过融合多个互补信号来检索证据。 LoCoMo 基准测试表明，BMAM 在标准长视野评估设置下达到了 78.45% 的准确率，消融分析证实，受海马体启发的情景记忆子系统在时间推理中发挥着关键作用。

Title: Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

Authors: Evanfiya Logacheva, Arto Hellas, Tsvetomila Mihaylova, Juha Sorva, Ava Heinonen, Juho Leinonen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20476
Pdf URL: https://arxiv.org/pdf/2601.20476
Copy Paste: [[2601.20476]] Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch(https://arxiv.org/abs/2601.20476)
Keywords: language model, llm, hallucination
Abstract: Generative artificial intelligence (AI) has found a widespread use in computing education; at the same time, quality of generated materials raises concerns among educators and students. This study addresses this issue by introducing a novel method for diagram code generation with in-context examples based on the Rhetorical Structure Theory (RST), which aims to improve diagram generation by aligning models' output with user expectations. Our approach is evaluated by computer science educators, who assessed 150 diagrams generated with large language models (LLMs) for logical organization, connectivity, layout aesthetic, and AI hallucination. The assessment dataset is additionally investigated for its utility in automated diagram evaluation. The preliminary results suggest that our method decreases the rate of factual hallucination and improves diagram faithfulness to provided context; however, due to LLMs' stochasticity, the quality of the generated diagrams varies. Additionally, we present an in-depth analysis and discussion on the connection between AI hallucination and the quality of generated diagrams, which reveals that text contexts of higher complexity lead to higher rates of hallucination and LLMs often fail to detect mistakes in their output.
摘要：生成式人工智能 (AI) 在计算机教育中得到了广泛应用；与此同时，生成材料的质量引起了教育工作者和学生的关注。本研究通过引入一种基于修辞结构理论（RST）的带有上下文示例的图表代码生成新方法来解决这个问题，该方法旨在通过使模型的输出与用户期望保持一致来改进图表生成。我们的方法由计算机科学教育工作者进行了评估，他们评估了使用大型语言模型 (LLM) 生成的 150 个图表的逻辑组织、连接性、布局美学和 AI 幻觉。另外还研究了评估数据集在自动图表评估中的实用性。初步结果表明，我们的方法降低了事实幻觉的发生率，并提高了图表对所提供上下文的忠实度；然而，由于法学硕士的随机性，生成的图表的质量各不相同。此外，我们对人工智能幻觉与生成图表的质量之间的联系进行了深入的分析和讨论，结果表明，较高复杂性的文本上下文会导致更高的幻觉发生率，而法学硕士通常无法检测到其输出中的错误。

Title: Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models

Authors: Kumiko Nakajima, Jan Zuiderveld, Sandro Pezzelle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20546
Pdf URL: https://arxiv.org/pdf/2601.20546
Copy Paste: [[2601.20546]] Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models(https://arxiv.org/abs/2601.20546)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used in verbal creative tasks. However, previous assessments of the creative capabilities of LLMs remain weakly grounded in human creativity theory and are thus hard to interpret. The widely used Divergent Association Task (DAT) focuses on novelty, ignoring appropriateness, a core component of creativity. We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities, undermining its validity for model evaluation. Grounded in human creativity theory, which defines creativity as the combination of novelty and appropriateness, we introduce Conditional Divergent Association Task (CDAT). CDAT evaluates novelty conditional on contextual appropriateness, separating noise from creativity better than DAT, while remaining simple and objective. Under CDAT, smaller model families often show the most creativity, whereas advanced families favor appropriateness at lower novelty. We hypothesize that training and alignment likely shift models along this frontier, making outputs more appropriate but less creative. We release the dataset and code.
摘要：大语言模型 (LLM) 越来越多地用于口头创意任务。然而，之前对法学硕士创造力的评估仍然缺乏人类创造力理论的基础，因此很难解释。广泛使用的发散关联任务（DAT）注重新颖性，忽视了创造力的核心组成部分——适当性。我们在 DAT 上评估了一系列最先进的法学硕士，结果表明他们在任务上的得分低于不具备任何创造性能力的两个基线，从而削弱了其模型评估的有效性。基于人类创造力理论，将创造力定义为新颖性和适当性的结合，我们引入了条件发散关联任务（CDAT）。 CDAT 根据上下文的适当性来评估新颖性，比 DAT 更好地将噪音与创造力分开，同时保持简单和客观。在 CDAT 下，较小的模型系列通常表现出最大的创造力，而高级系列则偏爱适当性，但新颖性较低。我们假设培训和调整可能会沿着这一前沿改变模型，使输出更合适但创意更少。我们发布数据集和代码。

Title: A Computational Approach to Language Contact -- A Case Study of Persian

Authors: Ali Basirat, Danial Namazifard, Navid Baradaran Hemmati
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20592
Pdf URL: https://arxiv.org/pdf/2601.20592
Copy Paste: [[2601.20592]] A Computational Approach to Language Contact -- A Case Study of Persian(https://arxiv.org/abs/2601.20592)
Keywords: language model
Abstract: We investigate structural traces of language contact in the intermediate representations of a monolingual language model. Focusing on Persian (Farsi) as a historically contact-rich language, we probe the representations of a Persian-trained model when exposed to languages with varying degrees and types of contact with Persian. Our methodology quantifies the amount of linguistic information encoded in intermediate representations and assesses how this information is distributed across model components for different morphosyntactic features. The results show that universal syntactic information is largely insensitive to historical contact, whereas morphological features such as Case and Gender are strongly shaped by language-specific structure, suggesting that contact effects in monolingual language models are selective and structurally constrained.
摘要：我们研究单语语言模型的中间表示中语言接触的结构痕迹。我们将重点放在波斯语（波斯语）作为一种历史上接触丰富的语言，我们探讨了波斯语训练模型在接触与波斯语有不同程度和类型接触的语言时的表示。我们的方法量化了中间表示中编码的语言信息量，并评估这些信息如何在不同形态句法特征的模型组件之间分布。结果表明，通用句法信息在很大程度上对历史接触不敏感，而诸如格和性别之类的形态特征则受到语言特定结构的强烈影响，这表明单语语言模型中的接触效应是选择性的且在结构上受到限制。

Title: AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Authors: Kaiyuan Chen, Qimin Wu, Taiyu Hou, Tianhao Tang, Xueyu Hu, Yuchen Hou, Bikun Li, Chengming Qian, Guoyin Wang, Haolin Chen, Haotong Tian, Haoye Zhang, Haoyu Bian, Hongbing Pan, Hongkang Zhang, Hongyi Zhou, Jiaqi Cai, Jiewu Rao, Jiyuan Ren, Keduan Huang, Lucia Zhu Huang, Mingyu Yuan, Naixu Guo, Qicheng Tang, Qinyan Zhang, Shuai Chen, Siheng Chen, Ting Ting Li, Xiaoxing Guo, Yaocheng Zuo, Yaoqi Guo, Yinan Wang, Yinzhou Yu, Yize Wang, Yuan Jiang, Yuan Tian, Yuanshuo Zhang, Yuxuan Liu, Yvette Yan Zeng, Zenyu Shan, Zihan Yin, Xiaobo Hu, Yang Liu, Yixin Ren, Yuan Gong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20613
Pdf URL: https://arxiv.org/pdf/2601.20613
Copy Paste: [[2601.20613]] AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios(https://arxiv.org/abs/2601.20613)
Keywords: gpt, llm, chat, agent
Abstract: The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.
摘要：人工智能代理有效处理持续时间和复杂性不断增加的任务的能力不断增强，在编码、深入研究和复杂问题解决评估方面表现出卓越的性能。然而，在日常场景中，普通用户对这些先进的人工智能能力的感知仍然有限。我们认为，当前的评估优先考虑增加任务难度，而没有充分解决覆盖广大人群日常工作、生活和学习活动所需的代理任务的多样性。为了解决这个问题，我们提出了 AgentIF-OneDay，旨在确定普通用户是否可以利用自然语言指令和 AI 代理来完成各种日常任务。这些任务不仅需要通过对话解决问题，还需要理解各种附件类型并提供有形的基于文件的结果。该基准测试围绕三个以用户为中心的类别构建：开放工作流程执行，评估对明确和复杂工作流程的遵守情况；潜在指令，要求智能体从附件中推断出隐式指令；迭代细化，涉及修改或扩展正在进行的工作。我们采用实例级标准和完善的评估管道，将基于 LLM 的验证与人类判断相结合，使用 Gemini-3-Pro 实现了 80.1% 的一致率。 AgentIF-OneDay 包含 104 项任务，涵盖 767 个得分点。我们对四款领先的通用人工智能代理进行了基准测试，发现基于 API 构建的代理产品和基于代理 RL 的 ChatGPT 代理同时保持在第一梯队。领先的LLM API和开源模型具有内化的Agent能力，使AI应用团队能够开发尖端的Agent产品。

Title: P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

Authors: Wenlin Zhong, Chengyuan Liu, Yiquan Wu, Bovin Tan, Changlong Sun, Yi Wang, Xiaozhong Liu, Kun Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20649
Pdf URL: https://arxiv.org/pdf/2601.20649
Copy Paste: [[2601.20649]] P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering(https://arxiv.org/abs/2601.20649)
Keywords: llm
Abstract: While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain (gold-CoT). The core of our method is to calculate a Path Faithfulness Reward (PFR) for each reasoning step, which is derived from the conditional probability of generating the gold-CoT's suffix, given the model's current reasoning prefix. Crucially, this PFR can be flexibly integrated with any outcome-based reward, directly tackling the reward sparsity problem by providing dense guidance. Extensive experiments on reading comprehension and medical Question Answering benchmarks show that P2S significantly outperforms strong baselines.
摘要：虽然具有可验证奖励的强化学习（RLVR）在数学和编程等结构化领域具有先进的法学硕士推理能力，但由于缺乏可验证的奖励信号，其在一般领域推理任务中的应用仍然具有挑战性。为此，出现了诸如带有参考概率奖励的强化学习（RLPR）之类的方法，利用生成最终答案的概率作为奖励信号。然而，这些以结果为中心的方法忽视了对推理过程本身的关键一步一步的监督。为了解决这一差距，我们引入了概率过程监督（P2S），这是一种新颖的自我监督框架，它提供细粒度的过程奖励，而不需要单独的奖励模型或人工注释的推理步骤。在强化学习过程中，P2S合成并过滤出一条高质量的参考推理链（gold-CoT）。我们方法的核心是计算每个推理步骤的路径忠实奖励（PFR），该奖励是根据给定模型当前推理前缀生成 gold-CoT 后缀的条件概率得出的。至关重要的是，这种 PFR 可以灵活地与任何基于结果的奖励相结合，通过提供密集的指导来直接解决奖励稀疏问题。关于阅读理解和医学问答基准的大量实验表明，P2S 显着优于强基线。

Title: A Dialectic Pipeline for Improving LLM Robustness

Authors: Sara Candussio
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2601.20659
Pdf URL: https://arxiv.org/pdf/2601.20659
Copy Paste: [[2601.20659]] A Dialectic Pipeline for Improving LLM Robustness(https://arxiv.org/abs/2601.20659)
Keywords: language model, llm, hallucination, prompt, chain-of-thought
Abstract: Assessing ways in which Language Models can reduce their hallucinations and improve the outputs' quality is crucial to ensure their large-scale use. However, methods such as fine-tuning on domain-specific data or the training of a separate \textit{ad hoc} verifier require demanding computational resources (not feasible for many user applications) and constrain the models to specific fields of knowledge. In this thesis, we propose a dialectic pipeline that preserves LLMs' generalization abilities while improving the quality of its answer via self-dialogue, enabling it to reflect upon and correct tentative wrong answers. We experimented with different pipeline settings, testing our proposed method on different datasets and on different families of models. All the pipeline stages are enriched with the relevant context (in an oracle-RAG setting) and a study on the impact of its summarization or its filtering is conducted. We find that our proposed dialectic pipeline is able to outperform by significative margins the standard model answers and that it consistently achieves higher performances than Chain-of-Thought only prompting.
摘要：评估语言模型减少幻觉并提高输出质量的方法对于确保其大规模使用至关重要。然而，诸如对特定领域数据进行微调或训练单独的 \textit{ad hoc} 验证器等方法需要苛刻的计算资源（对于许多用户应用程序来说不可行），并将模型限制在特定的知识领域。在本论文中，我们提出了一种辩证管道，可以保留法学硕士的泛化能力，同时通过自我对话提高其答案的质量，使其能够反思并纠正尝试性的错误答案。我们尝试了不同的管道设置，在不同的数据集和不同的模型系列上测试我们提出的方法。所有管道阶段都丰富了相关上下文（在 oracle-RAG 设置中），并对其汇总或过滤的影响进行了研究。我们发现我们提出的辩证管道能够显着优于标准模型答案，并且它始终比仅提示的思想链获得更高的性能。

Title: Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

Authors: Juan Jose Rubio Jan, Jack Wu, Julia Ive
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.20674
Pdf URL: https://arxiv.org/pdf/2601.20674
Copy Paste: [[2601.20674]] Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science(https://arxiv.org/abs/2601.20674)
Keywords: language model, llm, retrieval augmented generation
Abstract: This study applies Large Language Models (LLMs) to two foundational Electronic Health Record (EHR) data science tasks: structured data querying (using programmatic languages, Python/Pandas) and information extraction from unstructured clinical text via a Retrieval Augmented Generation (RAG) pipeline. We test the ability of LLMs to interact accurately with large structured datasets for analytics and the reliability of LLMs in extracting semantically correct information from free text health records when supported by RAG. To this end, we presented a flexible evaluation framework that automatically generates synthetic question and answer pairs tailored to the characteristics of each dataset or task. Experiments were conducted on a curated subset of MIMIC III, (four structured tables and one clinical note type), using a mix of locally hosted and API-based LLMs. Evaluation combined exact-match metrics, semantic similarity, and human judgment. Our findings demonstrate the potential of LLMs to support precise querying and accurate information extraction in clinical workflows.
摘要：本研究将大型语言模型 (LLM) 应用于两项基础电子健康记录 (EHR) 数据科学任务：结构化数据查询（使用编程语言、Python/Pandas）以及通过检索增强生成 (RAG) 管道从非结构化临床文本中提取信息。我们测试了法学硕士与大型结构化数据集准确交互以进行分析的能力，以及法学硕士在 RAG 支持下从自由文本健康记录中提取语义正确信息的可靠性。为此，我们提出了一个灵活的评估框架，可以自动生成适合每个数据集或任务特征的合成问题和答案对。实验是在 MIMIC III 的精选子集（四个结构化表和一个临床记录类型）上进行的，混合使用本地托管和基于 API 的法学硕士。评估结合了精确匹配指标、语义相似性和人类判断。我们的研究结果证明了法学硕士在支持临床工作流程中精确查询和准确信息提取方面的潜力。

Title: Efficient Multimodal Planning Agent for Visual Question-Answering

Authors: Zhuo Chen, Xinyu Geng, Xinyu Wang, Yong Jiang, Zhen Zhang, Pengjun Xie, Kewei Tu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20676
Pdf URL: https://arxiv.org/pdf/2601.20676
Copy Paste: [[2601.20676]] Efficient Multimodal Planning Agent for Visual Question-Answering(https://arxiv.org/abs/2601.20676)
Keywords: prompt, retrieval-augmented generation, agent
Abstract: Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60\% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.
摘要：视觉问答（VQA）是一项具有挑战性的多模式任务，需要整合视觉和文本信息以生成准确的响应。虽然多模态检索增强生成 (mRAG) 在通过在图像和文本方面提供更多证据来增强 VQA 系统方面表现出了希望，但解决 VQA 查询（尤其是知识密集型查询）的默认程序通常依赖于具有固有依赖性的 mRAG 的多级管道。为了在保持 VQA 任务性能的同时减轻效率低下的限制，本文提出了一种训练多模态规划代理的方法，动态分解 mRAG 管道来解决 VQA 任务。我们的方法通过训练代理智能地确定每个 mRAG 步骤的必要性来优化效率和有效性之间的权衡。在我们的实验中，该代理可以帮助减少冗余计算，与现有方法相比，将搜索时间减少 60% 以上，并减少昂贵的工具调用。同时，实验表明，我们的方法平均超过六个不同的数据集，优于所有基线，包括深度研究代理和精心设计的基于提示的方法。代码将被发布。

Title: ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code

Authors: Mingqiao Mo, Yunlong Tan, Hao Zhang, Heng Zhang, Yangfan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20679
Pdf URL: https://arxiv.org/pdf/2601.20679
Copy Paste: [[2601.20679]] ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code(https://arxiv.org/abs/2601.20679)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their potential for software protection remains largely untapped. Reverse engineering continues to threaten software security, while traditional virtual machine protection (VMP) relies on rigid, rule-based transformations that are costly to design and vulnerable to automated analysis. In this work, we present the first protection-aware framework that learns robust representations of VMP-protected code. Our approach builds large-scale paired datasets of source code and normalized VM implementations, and introduces hierarchical dependency modeling at intra-, preceding-, and inter-instruction levels. We jointly optimize language modeling with functionality-aware and protection-aware contrastive objectives to capture both semantic equivalence and protection strength. To further assess resilience, we propose a protection effectiveness optimization task that quantifies and ranks different VM variants derived from the same source. Coupled with a two-stage continual pre-training and fine-tuning pipeline, our method enables models to generate, compare, and reason over protected code. Extensive experiments show that our framework significantly improves robustness across diverse protection levels, opening a new research direction for learning-based software defense. In this work, we present ShieldedCode, the first protection-aware framework that learns robust representations of VMP-protected code. Our method achieves 26.95% Pass@1 on L0 VM code generation compared to 22.58% for GPT-4o., and improves binary similarity detection Recall@1 by 10% over state of art methods like jTrans.
摘要：大型语言模型 (LLM) 在代码生成方面取得了显着的进步，但其软件保护的潜力在很大程度上尚未开发。逆向工程继续威胁软件安全，而传统的虚拟机保护 (VMP) 依赖于严格的、基于规则的转换，这种转换设计成本高昂，而且容易受到自动分析的影响。在这项工作中，我们提出了第一个保护感知框架，该框架可以学习受 VMP 保护的代码的稳健表示。我们的方法构建了源代码和标准化 VM 实现的大规模配对数据集，并在指令内、指令前和指令间引入了分层依赖关系建模。我们通过功能感知和保护感知对比目标联合优化语言建模，以捕获语义等价性和保护强度。为了进一步评估弹性，我们提出了一种保护有效性优化任务，该任务对源自同一来源的不同虚拟机变体进行量化和排名。与两阶段连续预训练和微调管道相结合，我们的方法使模型能够生成、比较和推理受保护的代码。大量实验表明，我们的框架显着提高了不同保护级别的鲁棒性，为基于学习的软件防御开辟了新的研究方向。在这项工作中，我们提出了 ShieldedCode，这是第一个能够学习 VMP 保护代码的稳健表示的保护感知框架。我们的方法在 L0 VM 代码生成上实现了 26.95% Pass@1，而 GPT-4o 为 22.58%。并且与 jTrans 等最先进的方法相比，二进制相似性检测 Recall@1 提高了 10%。

Title: AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Authors: Shicheng Fang, Yuxin Wang, XiaoRan Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng. Xuanjing Huang, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20730
Pdf URL: https://arxiv.org/pdf/2601.20730
Copy Paste: [[2601.20730]] AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts(https://arxiv.org/abs/2601.20730)
Keywords: language model, llm, agent
Abstract: The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.
摘要：大型语言模型 (LLM) 向自主代理的演变需要对广泛的动态上下文进行管理。然而，当前的基准在很大程度上仍然是静态的，依赖于无法模拟代理与环境交互的复杂性的被动检索任务，例如非线性推理和迭代反馈。为了解决这个问题，我们引入了 \textbf{AgentLongBench}，它通过基于横向思维难题的模拟环境推出来评估代理。该框架在知识密集型和无知识场景中生成严格的交互轨迹。对最先进的模型和内存系统（32K 到 4M 令牌）的实验暴露了一个严重的弱点：虽然擅长静态检索，但代理却难以处理工作流程所必需的动态信息合成。我们的分析表明，这种降级是由解决查询所需的最小令牌数量驱动的。这个因素解释了为什么大规模工具响应中固有的高信息密度比长轮对话中典型的记忆碎片带来了更大的挑战。

Title: QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks

Authors: Mae Sosto, Delfina Sol Martinez Pandiani, Laura Hollink
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.20731
Pdf URL: https://arxiv.org/pdf/2601.20731
Copy Paste: [[2601.20731]] QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks(https://arxiv.org/abs/2601.20731)
Keywords: language model, llm
Abstract: This paper examines how Large Language Models (LLMs) reproduce societal norms, particularly heterocisnormativity, and how these norms translate into measurable biases in their text generations. We investigate whether explicit information about a subject's gender or sexuality influences LLM responses across three subject categories: queer-marked, non-queer-marked, and the normalized "unmarked" category. Representational imbalances are operationalized as measurable differences in English sentence completions across four dimensions: sentiment, regard, toxicity, and prediction diversity. Our findings show that Masked Language Models (MLMs) produce the least favorable sentiment, higher toxicity, and more negative regard for queer-marked subjects. Autoregressive Language Models (ARLMs) partially mitigate these patterns, while closed-access ARLMs tend to produce more harmful outputs for unmarked subjects. Results suggest that LLMs reproduce normative social assumptions, though the form and degree of bias depend strongly on specific model characteristics, which may redistribute, but not eliminate, representational harms.
摘要：本文研究了大型语言模型（LLM）如何再现社会规范，特别是异质规范性，以及这些规范如何转化为文本生成中可测量的偏差。我们调查有关受试者性别或性取向的明确信息是否会影响法学硕士在三个主题类别中的反应：酷儿标记、非酷儿标记和规范化的“未标记”类别。表征不平衡被具体化为四个维度上英语句子完成的可测量差异：情感、尊重、毒性和预测多样性。我们的研究结果表明，蒙面语言模型（MLM）对酷儿标记对象产生最不利的情绪、更高的毒性和更多的负面影响。自回归语言模型 (ARLM) 部分缓解了这些模式，而封闭访问 ARLM 往往会对未标记的主题产生更多有害的输出。结果表明，法学硕士再现了规范的社会假设，尽管偏见的形式和程度在很大程度上取决于特定的模型特征，这可能会重新分配但不能消除代表性危害。

Title: Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts

Authors: Elham Aghakhani, Rezvaneh Rezapour
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2601.20747
Pdf URL: https://arxiv.org/pdf/2601.20747
Copy Paste: [[2601.20747]] Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts(https://arxiv.org/abs/2601.20747)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used for emotional support and mental health-related interactions outside clinical settings, yet little is known about how people evaluate and relate to these systems in everyday use. We analyze 5,126 Reddit posts from 47 mental health communities describing experiential or exploratory use of AI for emotional support or therapy. Grounded in the Technology Acceptance Model and therapeutic alliance theory, we develop a theory-informed annotation framework and apply a hybrid LLM-human pipeline to analyze evaluative language, adoption-related attitudes, and relational alignment at scale. Our results show that engagement is shaped primarily by narrated outcomes, trust, and response quality, rather than emotional bond alone. Positive sentiment is most strongly associated with task and goal alignment, while companionship-oriented use more often involves misaligned alliances and reported risks such as dependence and symptom escalation. Overall, this work demonstrates how theory-grounded constructs can be operationalized in large-scale discourse analysis and highlights the importance of studying how users interpret language technologies in sensitive, real-world contexts.
摘要：大语言模型 (LLM) 越来越多地用于临床环境之外的情感支持和心理健康相关交互，但人们对日常使用中如何评估和关联这些系统知之甚少。我们分析了来自 47 个心理健康社区的 5,126 个 Reddit 帖子，这些帖子描述了人工智能在情感支持或治疗方面的体验式或探索性使用。基于技术接受模型和治疗联盟理论，我们开发了一个基于理论的注释框架，并应用混合法学硕士-人类管道来大规模分析评估语言、与采用相关的态度和关系对齐。我们的研究结果表明，参与度主要取决于叙述结果、信任和响应质量，而不仅仅是情感纽带。积极情绪与任务和目标的一致性最为密切，而以陪伴为导向的使用往往涉及不一致的联盟和报告的风险，例如依赖性和症状升级。总的来说，这项工作展示了如何在大规模话语分析中实施基于理论的构造，并强调研究用户如何在敏感的现实世界环境中解释语言技术的重要性。

Title: Persona Prompting as a Lens on LLM Social Reasoning

Authors: Jing Yang, Moritz Hechtbauer, Elisabeth Khalilov, Evelyn Luise Brinkmann, Vera Schmitt, Nils Feldhus
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20757
Pdf URL: https://arxiv.org/pdf/2601.20757
Copy Paste: [[2601.20757]] Persona Prompting as a Lens on LLM Social Reasoning(https://arxiv.org/abs/2601.20757)
Keywords: language model, llm, prompt
Abstract: For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.
摘要：对于仇恨言论检测等社会敏感任务，大型语言模型 (LLM) 的解释质量对于用户信任和模型一致性等因素至关重要。虽然角色提示（PP）越来越多地被用作引导模型转向特定于用户的生成的一种方式，但它对模型原理的影响仍未得到充分探索。我们研究了法学硕士生成的基本原理在不同的模拟人口角色条件下如何变化。使用带有单词级基本原理注释的数据集，我们测量与来自不同人口群体的人类注释的一致性，并评估 PP 对模型偏差和人类对齐的影响。我们对三个法学硕士结果的评估揭示了三个关键发现：（1）PP 改善了最主观任务（仇恨言论）的分类，但降低了基本原理质量。 (2) 模拟的人物角色无法与现实世界的人口统计对应人物保持一致，并且人物角色之间的高度一致性表明模型难以接受重大指导。 (3) 模型表现出一致的人口统计偏见，并且无论 PP 如何，都倾向于将内容过度标记为有害。我们的研究结果揭示了一个关键的权衡：虽然 PP 可以改善社会敏感任务的分类，但它往往以理论质量为代价，并且无法减轻潜在的偏见，因此在应用时要谨慎。

Title: SERA: Soft-Verified Efficient Repository Agents

Authors: Ethan Shen, Danny Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers
Subjects: cs.CL, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2601.20789
Pdf URL: https://arxiv.org/pdf/2601.20789
Copy Paste: [[2601.20789]] SERA: Soft-Verified Efficient Repository Agents(https://arxiv.org/abs/2601.20789)
Keywords: agent
Abstract: Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now practical. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of frontier open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Combined with cost-efficiency, this enables specialization to private codebases. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases. We release SERA as the first model in Ai2's Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.
摘要：开放权重编码代理应该比闭源系统拥有根本优势：它们可以专门用于私有代码库，直接在其权重中编码存储库特定的信息。然而，培训的成本和复杂性使这一优势保持在理论上。我们证明它现在是实用的。我们提出了软验证高效存储库代理（SERA），这是一种训练编码代理的有效方法，可以快速、廉价地创建专用于私有代码库的代理。仅使用监督微调 (SFT)，SERA 在完全开源（开放数据、方法、代码）模型中实现了最先进的结果，同时与 Devstral-Small-2 等前沿开放权重模型的性能相匹配。创建 SERA 模型比强化学习便宜 26 倍，比以前的合成数据方法便宜 57 倍，以达到同等性能。我们的方法“软验证生成”(SVG) 可从单个代码存储库生成数千条轨迹。与成本效益相结合，这使得私有代码库的专业化成为可能。除了存储库专业化之外，我们还将 SVG 应用于更大的代码库，生成超过 200,000 个合成轨迹。我们使用该数据集来提供对训练编码代理的缩放法则、消融和混杂因素的详细分析。总的来说，我们相信我们的工作将极大地加速开放编码代理的研究，并展示专门针对私有代码库的开源模型的优势。我们发布 SERA 作为 Ai2 开放编码代理系列中的第一个模型，以及我们所有的代码、数据和 Claude 代码集成，以支持研究社区。

Title: Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

Authors: Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.20796
Pdf URL: https://arxiv.org/pdf/2601.20796
Copy Paste: [[2601.20796]] Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers(https://arxiv.org/abs/2601.20796)
Keywords: language model
Abstract: Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.
摘要：基于 Transformer 的多模态大语言模型通常表现出上下文学习（ICL）能力。受这种现象的启发，我们问：变压器如何学习从上下文示例中跨模态关联信息？我们通过对经过综合分类任务训练的小型变压器进行受控实验来研究这个问题，从而能够精确操作数据统计和模型架构。我们首先回顾现代变压器中单峰 ICL 的核心原理。虽然之前的一些研究结果重复了，但我们发现旋转位置嵌入 (RoPE) 增加了 ICL 的数据复杂性阈值。扩展到多模态设置揭示了一个基本的学习不对称性：当对来自主要模态的高多样性数据进行预训练时，次要模态中令人惊讶的低数据复杂性足以出现多模态 ICL。机制分析表明，这两种设置都依赖于归纳式机制，从匹配的上下文样本中复制标签。多模式培训完善并扩展了这些循环的跨模式。我们的研究结果为理解现代变压器中的多模态 ICL 提供了机械基础，并为未来的研究引入了受控测试平台。

Title: Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction

Authors: Aunabil Chakma, Mihai Surdeanu, Eduardo Blanco
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20803
Pdf URL: https://arxiv.org/pdf/2601.20803
Copy Paste: [[2601.20803]] Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction(https://arxiv.org/abs/2601.20803)
Keywords: llm
Abstract: This paper presents several strategies to automatically obtain additional examples for in-context learning of one-shot relation extraction. Specifically, we introduce a novel strategy for example selection, in which new examples are selected based on the similarity of their underlying syntactic-semantic structure to the provided one-shot example. We show that this method results in complementary word choices and sentence structures when compared to LLM-generated examples. When these strategies are combined, the resulting hybrid system achieves a more holistic picture of the relations of interest than either method alone. Our framework transfers well across datasets (FS-TACRED and FS-FewRel) and LLM families (Qwen and Gemma). Overall, our hybrid selection method consistently outperforms alternative strategies and achieves state-of-the-art performance on FS-TACRED and strong gains on a customized FewRel subset.
摘要：本文提出了几种自动获取额外示例的策略，用于一次性关系提取的上下文学习。具体来说，我们引入了一种新颖的示例选择策略，其中根据其底层句法语义结构与所提供的一次性示例的相似性来选择新示例。我们表明，与法学硕士生成的示例相比，该方法会产生互补的单词选择和句子结构。当这些策略结合起来时，所产生的混合系统比单独使用任何一种方法都能更全面地了解利益关系。我们的框架可以在数据集（FS-TACRED 和 FS-FewRel）和 LLM 系列（Qwen 和 Gemma）之间很好地传输。总体而言，我们的混合选择方法始终优于替代策略，并在 FS-TACRED 上实现了最先进的性能，并在定制的 FewRel 子集上取得了巨大的进步。

Title: Linear representations in language models can change dramatically over a conversation

Authors: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.20834
Pdf URL: https://arxiv.org/pdf/2601.20834
Copy Paste: [[2601.20834]] Linear representations in language models can change dramatically over a conversation(https://arxiv.org/abs/2601.20834)
Keywords: language model
Abstract: Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.
摘要：语言模型表示通常包含与高级概念相对应的线性方向。在这里，我们研究这些表征的动态：表征如何在（模拟）对话的背景下沿着这些维度演变。我们发现线性表示在对话中可能会发生巨大变化。例如，在对话开始时表示为事实的信息在对话结束时可以表示为非事实，反之亦然。这些变化取决于内容；虽然对话相关信息的表示可能会发生变化，但通用信息通常会被保留。即使对于将事实与更肤浅的响应模式分开的维度，这些变化也是稳健的，并且发生在不同的模型系列和模型层中。这些代表变更不需要进行政策对话；即使重放由完全不同的模型编写的对话脚本也可以产生类似的变化。然而，与仅仅在更明确的背景下讲述科幻故事相比，改编的效果要弱得多。我们还表明，沿着代表性方向进行引导可以在对话的不同时刻产生截然不同的效果。这些结果与这样的观点是一致的：表征可能会随着模型扮演由对话提示的特定角色而演变。我们的研究结果可能对可解释性和指导提出挑战——特别是，它们意味着使用特征或方向的静态解释可能会产生误导，或者假设特定范围的特征始终对应于特定的地面实况值的探针。然而，这些类型的表征动态也指出了令人兴奋的新研究方向，以了解模型如何适应上下文。

Title: When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation

Authors: David Tan, Pinzhen Chen, Josef van Genabith, Koel Dutta Chowdhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20858
Pdf URL: https://arxiv.org/pdf/2601.20858
Copy Paste: [[2601.20858]] When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation(https://arxiv.org/abs/2601.20858)
Keywords: language model, llm
Abstract: Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to "uncontaminated" languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz's FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.
摘要：大型语言模型（LLM）可能会受到基准污染，导致分数过高，从而将记忆掩盖为泛化，并且在多语言环境中，这种记忆甚至可以转移到“未受污染”的语言。使用 FLORES-200 翻译基准作为诊断，我们研究了两个 7-8B 指令调整的多语言 LLM：Bloomz（在 FLORES 上进行训练）和 Llama（作为未污染的对照）。我们确认了 Bloomz 的 FLORES 污染，并证明机器翻译污染可以是横向的，由于目标端记忆而人为地提高了看不见的翻译方向的性能。进一步的分析表明，尽管采取了各种源端扰动措施（例如释义和命名实体替换），但对记忆参考文献的回忆通常仍然存在。然而，替换命名实体会导致 BLEU 持续下降，这表明污染模型中记忆的有效探测方法。