2026-03-04

Title: Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Authors: Xiaoyu Luo, Wenrui Yu, Qiongxiu Li, Johannes Bjerva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02333
Pdf URL: https://arxiv.org/pdf/2603.02333
Copy Paste: [[2603.02333]] Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects(https://arxiv.org/abs/2603.02333)
Keywords: language model
Abstract: Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that autoregressive decoding corresponds to a limiting case of diffusion-based generation by setting the sampling resolution maximal. Extensive experiments across model scales and sampling strategies validate our theoretical predictions. Under aligned prefix-conditioned evaluations, we further demonstrate that DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.
摘要：自回归语言模型（ARM）已被证明可以记忆并偶尔逐字复制训练数据，这引起了人们对隐私和版权责任的担忧。扩散语言模型（DLM）最近成为一种有竞争力的替代方案，但由于生成动力学的根本差异，它们的记忆行为在很大程度上仍未被探索。为了解决这一差距，我们提出了 DLM 记忆的系统理论和实证特征。我们提出了一种广义的概率提取框架，该框架在任意掩码模式和随机采样轨迹下统一了前缀条件解码和基于扩散的生成。定理 4.3 在采样分辨率和记忆之间建立了单调关系：增加分辨率严格增加了精确训练数据提取的概率，这意味着自回归解码通过设置采样分辨率最大值来对应于基于扩散的生成的极限情况。跨模型尺度和采样策略的广泛实验验证了我们的理论预测。在一致的前缀条件评估下，我们进一步证明，与 ARM 相比，DLM 表现出显着较低的基于记忆的个人身份信息 (PII) 泄漏。

Title: Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs

Authors: Jiangang Hao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02353
Pdf URL: https://arxiv.org/pdf/2603.02353
Copy Paste: [[2603.02353]] Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs(https://arxiv.org/abs/2603.02353)
Keywords: language model, llm, prompt
Abstract: Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.
摘要：写作是一项基本的读写技能，它支持有效的沟通，培养批判性思维，促进跨学科学习，并使个人能够组织和表达复杂的想法。因此，写作评估在评估语言能力、交际有效性和分析推理方面起着至关重要的作用。大型语言模型（LLM）的快速发展使得生成连贯的高质量论文变得越来越容易，这引发了人们对学生提交作业的真实性的严重担忧。本章首先概述了人工智能生成和人工智能辅助论文的检测器的当前状况，以及负责任地使用它们的指南。然后，它提出了实证分析，以评估接受过某位法学硕士论文训练的检测器根据针对公共 GRE 写作提示生成的论文，泛化到识别其他法学硕士撰写的论文的效果。这些发现为开发和重新训练实际应用的探测器提供了指导。

Title: CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

Authors: Junzhe Shen, Jieru Zhao, Ziwei He, Zhouhan Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02547
Pdf URL: https://arxiv.org/pdf/2603.02547
Copy Paste: [[2603.02547]] CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think(https://arxiv.org/abs/2603.02547)
Keywords: language model
Abstract: We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.
摘要：我们研究为什么连续扩散语言模型（DLM）落后于离散扩散方法，尽管它们具有吸引人的连续生成动力学。在受控令牌恢复研究中，我们将令牌舍入（从去噪嵌入到令牌的最终投影）确定为主要瓶颈。基于这些见解，我们提出了 CoDAR（带有上下文自动回归解码器的连续扩散），这是一个两阶段框架，可以在嵌入空间中保持扩散完全连续，同时学习强大的上下文条件离散器：一个自回归 Transformer 解码器，交叉处理去噪嵌入序列并对标记执行上下文舍入。 LM1B 和 OpenWebText 上的实验表明，CoDAR 显着提高了潜在扩散的生成质量，并且与强大的离散 DLM 具有竞争力，同时暴露了一个简单的解码器（用于导航流畅性的温度旋钮）和多样性权衡。

Title: How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Authors: Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02578
Pdf URL: https://arxiv.org/pdf/2603.02578
Copy Paste: [[2603.02578]] How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities(https://arxiv.org/abs/2603.02578)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
摘要：大型语言模型 (LLM) 越来越多地部署在社会敏感领域，但其不可预测的行为（从不一致的意图到不一致的个性）带来了重大风险。我们引入了 SteerEval，这是一个分层基准，用于评估 LLM 跨三个领域的可控性：语言特征、情感和个性。每个域都分为三个规范级别：L1（表达什么）、L2（如何表达）和 L3（如何实例化），将高级行为意图与具体文本输出连接起来。使用 SteerEval，我们系统地评估了当代的转向方法，发现控制通常会在更细粒度的水平上退化。我们的基准为安全可控的法学硕士行为提供了一个原则性的、可解释的框架，作为未来研究的基础。

Title: ExpGuard: LLM Content Moderation in Specialized Domains

Authors: Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak, Juyoung Oh, Jaegul Choo, Jungmin Son
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02588
Pdf URL: https://arxiv.org/pdf/2603.02588
Copy Paste: [[2603.02588]] ExpGuard: LLM Content Moderation in Specialized Domains(https://arxiv.org/abs/2603.02588)
Keywords: language model, llm, prompt
Abstract: With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
摘要：随着大型语言模型 (LLM) 在实际应用中的部署不断增加，建立强大的安全护栏来调节其输入和输出对于确保遵守安全策略至关重要。当前的护栏模型主要解决一般的人类与法学硕士的互动问题，使法学硕士容易受到特定领域环境中有害和对抗性内容的影响，特别是那些富含技术术语和专业概念的环境。为了解决这一限制，我们引入了 ExpGuard，这是一种强大且专业的护栏模型，旨在防止金融、医疗和法律领域的有害提示和响应。此外，我们还推出了 ExpGuardMix，这是一个精心策划的数据集，包含来自这些特定部门的 58,928 个标记提示以及相应的拒绝和合规响应。该数据集分为两个子集：ExpGuardTrain（用于模型训练）和 ExpGuardTest（由领域专家注释的高质量测试集，用于根据技术和特定领域内容评估模型的稳健性）。对 ExpGuardTest 和八个已建立的公共基准进行的综合评估表明，ExpGuard 提供全面的竞争性能，同时表现出对特定领域的对抗性攻击的卓越弹性，在即时分类方面比 WildGuard 等最先进的模型高出 8.9%，在响应分类方面超过 15.3%。为了鼓励进一步的研究和开发，我们开源我们的代码、数据和模型，从而能够适应其他领域并支持创建日益强大的护栏模型。

Title: GPUTOK: GPU Accelerated Byte Level BPE Tokenization

Authors: Venu Gopal Kadamba, Kanishkha Jaisankar
Subjects: cs.CL, cs.AI, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02597
Pdf URL: https://arxiv.org/pdf/2603.02597
Copy Paste: [[2603.02597]] GPUTOK: GPU Accelerated Byte Level BPE Tokenization(https://arxiv.org/abs/2603.02597)
Keywords: language model, gpt, prompt
Abstract: As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer's outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical.
摘要：随着大型语言模型向百万个标记上下文窗口发展，CPU 标记器成为一个主要的减慢因素，因为它们一次处理文本一步，而强大的 GPU 却闲置。我们构建了一个基于 GPU 的字节级 BPE 分词器，遵循 GPT-2 的合并规则。它包括一个基本的 BlockBPE 风格的内核和一个更快的优化版本，该版本使用 cuCollections 静态映射、CUB 缩减和 Python 的 pybind11 接口。在 WikiText103 序列中，最多可处理 131k 个令牌，优化的 GPU 令牌生成器可生成与 CPU 版本相同的令牌，并且对于最长的输入，速度比 tiktoken 快约 1.7 倍，比 HuggingFace GPT-2 令牌生成器快约 7.6 倍。 Nsight 分析显示 70-80% 的 CUDA API 时间用于内存分配，因此添加内存池应该会带来最大的速度提升。使用 WikiText103 提示对生成任务进行的测试表明，我们的 GPU 分词器的输出在相似性和重叠指标上与 tiktoken 和 HuggingFace GPT-2 的输出保持在大约 1 个百分点的范围内，这意味着它在保持输出质量的同时使长上下文推理更加实用。

Title: Think, But Don't Overthink: Reproducing Recursive Language Models

Authors: Daren Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02615
Pdf URL: https://arxiv.org/pdf/2603.02615
Copy Paste: [[2603.02615]] Think, But Don't Overthink: Reproducing Recursive Language Models(https://arxiv.org/abs/2603.02615)
Keywords: language model, llm, prompt, agent
Abstract: This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: this https URL
摘要：该项目复制并扩展了Zhang等人最近提出的“递归语言模型”（RLM）框架。（2026）。该框架使大型语言模型 (LLM) 能够通过将提示卸载到外部 REPL 环境来处理近乎无限的上下文。虽然原始论文依赖于默认的递归深度 1 并建议将更深的递归作为未来的方向，但本研究专门研究了缩放递归深度的影响。使用最先进的开源代理模型（DeepSeek v3.2 和 Kimi K2），我在 S-NIAH 和 OOLONG 基准上评估了纯 LLM、RLM（深度=1）和 RLM（深度=2）。研究结果揭示了一个引人注目的现象：更深层次的递归导致模型“过度思考”。虽然深度 1 RLM 有效提高了复杂推理任务的准确性，但在简单检索任务上应用更深的递归（深度 = 2）或使用 RLM 反而会降低性能，并呈指数级增加执行时间（例如，从 3.6 秒到 344.5 秒）和令牌成本。代码和数据可在以下位置获取：此 https URL

Title: Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

Authors: Shubhangi Upasani, Ravi Shanker Raju, Bo Li, Mengmeing Ji, John Long, Chen Wu, Urmish Thakker, Guangtao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02631
Pdf URL: https://arxiv.org/pdf/2603.02631
Copy Paste: [[2603.02631]] Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models(https://arxiv.org/abs/2603.02631)
Keywords: language model, llm, prompt, agent
Abstract: Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad diversity of tasks, we find that attention-based token importance estimation transfers reliably across different model families despite differences in model architectures and tokenizers between draft and target models. Cross-model prompt compression largely retains 90~100% of full-prompt baseline performance and, in some cases, slightly improves accuracy due to denoising effects, while delivering substantial reductions in time to first token (TTFT). These results suggest that speculative prefill depends mainly on task priors and semantic structure, thus serving as a generalizable prompt compression primitive. We discuss the implications of our findings for agentic systems, where repeated long-context inference and heterogeneous model stacks make cross-model prompt compression both necessary and practical.
摘要：提示长度是代理大型语言模型 (LLM) 工作负载的主要瓶颈，其中重复的推理步骤和多次调用循环会产生大量的预填充成本。最近关于推测性预填充的工作表明，基于注意力的标记重要性估计可以实现免训练的提示压缩，但这假设存在与目标模型共享相同标记器的草稿模型。然而，在实践中，代理管道经常使用没有任何较小的家庭草案模型的模型。在这项工作中，我们研究跨系列推测预填充，其中一个模型系列的轻量级草图模型用于对不同系列的目标模型执行即时压缩。使用与先前工作相同的推测性预填充机制，我们评估了一系列跨家庭草案目标组合，包括 Qwen、LLaMA 和 DeepSeek 模型。在广泛的任务多样性中，我们发现基于注意力的标记重要性估计可以在不同模型系列之间可靠地转移，尽管草稿模型和目标模型之间的模型架构和标记器存在差异。跨模型提示压缩在很大程度上保留了 90%~100% 的全提示基线性能，并且在某些情况下，由于去噪效应而略微提高了准确性，同时大幅缩短了首次标记时间 (TTFT)。这些结果表明，推测性预填充主要取决于任务先验和语义结构，从而充当可推广的提示压缩原语。我们讨论了我们的研究结果对代理系统的影响，在代理系统中，重复的长上下文推理和异构模型堆栈使得跨模型提示压缩既必要又实用。

Title: Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Authors: Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi, Graham Neubig, Florian Matthes, Tatsuya Ishigaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02655
Pdf URL: https://arxiv.org/pdf/2603.02655
Copy Paste: [[2603.02655]] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches(https://arxiv.org/abs/2603.02655)
Keywords: language model, llm, prompt
Abstract: Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.
摘要：实时视频评论生成提供视频中正在进行的事件的文字描述。它支持体育、电子竞技和直播等领域的可访问性和参与度。评论的生成涉及两个重要的决定：说什么和何时说。虽然最近使用多模态大语言模型（MLLM）的基于提示的方法在内容生成方面表现出了强大的性能，但它们在很大程度上忽略了时间方面。我们研究上下文提示是否能够单独支持语义相关且适时的实时评论生成。我们提出了两种基于提示的解码策略：1）固定间隔方法，2）一种新颖的基于动态间隔的解码方法，该方法根据先前话语的估计持续时间调整下一个预测时间。两种方法都无需任何微调即可实现暂停感知生成。对赛车和格斗游戏的日语和英语数据集的实验表明，仅使用提示，基于动态间隔的解码就可以生成与人类说话时间和内容更加一致的评论。我们发布了多语言基准数据集、训练模型和实现，以支持实时视频评论生成的未来研究。

Title: Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

Authors: Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima, Naoto Inoue, Mayu Otani, Koh Takeuchi
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.02663
Pdf URL: https://arxiv.org/pdf/2603.02663
Copy Paste: [[2603.02663]] Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory(https://arxiv.org/abs/2603.02663)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.
摘要：多模态大型语言模型（MLLM）最近作为能够对不同模态进行推理的通用架构而出现。 MLLM 的基准应衡量其跨模式集成的能力。然而，当前的基准测试充满了捷径问题，这些问题只能使用单一模式来解决，从而产生不可靠的排名。例如，在视觉语言的情况下，我们可以在没有图像或文本的情况下找到正确的答案。这些低质量的问题不必要地增加了基准测试的规模和计算要求。我们引入了多模态和多维项目响应理论框架（M3IRT），该框架通过将模型能力和项目难度分解为仅图像、仅文本和跨模态组件来扩展经典 IRT。 M3IRT 估计 MLLM 的跨模态能力以及每个问题的跨模态难度，从而实现更好地反映多模态推理的紧凑、高质量子集。在三个基准的 24 个 VLM 中，M3IRT 优先考虑真正的跨模式问题而不是捷径，即使 50% 的项目是人为生成的低质量问题，也能保持排名保真度，从而降低评估成本，同时提高可靠性。因此，M3IRT 提供了一个用于评估跨模式推理和完善多模式基准的实用工具。

Title: ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

Authors: Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tack Hwa Wong, Muhammad Ravi Shulthan Habibi, Samuel Cahyawijaya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02676
Pdf URL: https://arxiv.org/pdf/2603.02676
Copy Paste: [[2603.02676]] ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs(https://arxiv.org/abs/2603.02676)
Keywords: language model, llm
Abstract: Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.
摘要：大型语言模型在推理任务中会受到内容效应的影响，特别是在多语言环境中。我们引入了一种新颖的方法，通过显式结构抽象来减少这些偏差，将三段论转换为规范逻辑表示，并应用确定性解析来确定有效性。根据 SemEval-2026 Task 11 多语言基准进行评估，我们的方法在所有子任务中均排名前 5，同时大幅降低内容影响，并为复杂的微调或激活级别干预提供有竞争力的替代方案。

Title: HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Authors: Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya, Md. Shad Akhtar
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2603.02684
Pdf URL: https://arxiv.org/pdf/2603.02684
Copy Paste: [[2603.02684]] HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse(https://arxiv.org/abs/2603.02684)
Keywords: language model
Abstract: Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.
摘要：微妙和间接的仇恨言论仍然是在线安全研究中尚未探索的挑战，特别是当误导性或操纵性叙述中嵌入有害意图时。现有的仇恨言论数据集主要捕获明显的毒性，未能充分体现错误信息煽动仇恨或使仇恨正常化的微妙方式。为了解决这一差距，我们提出了 HateMirage，这是一个新颖的虚假仇恨评论数据集，旨在推进对虚假或扭曲叙述中出现的仇恨的推理和可解释性研究。该数据集是通过从事实核查来源识别被广泛揭穿的错误信息声明并追踪相关 YouTube 讨论而构建的，最终产生了 4,530 条用户评论。每条评论都沿着三个可解释的维度进行注释：目标（受影响的人）、意图（评论背后的潜在动机或目标）和含义（其潜在的社会影响）。与之前提供令牌级或单维推理的可解释性数据集（例如 HateXplain 和 HARE）不同，HateMirage 引入了一个多维解释框架，可以捕获错误信息、伤害和社会后果之间的相互作用。我们使用 ROUGE-L F1 和 Sentence-BERT 相似性对 HateMirage 上的多个开源语言模型进行基准测试，以评估解释的连贯性。结果表明，解释质量可能更多地取决于预训练多样性和面向推理的数据，而不仅仅是模型规模。通过将错误信息推理与伤害归因相结合，HateMirage 为可解释的仇恨检测和负责任的人工智能研究建立了新的基准。

Title: Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Authors: Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu, Yuchen He, Zhiyuan Ning, Chen Yijun, Wenge Que, Li Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02701
Pdf URL: https://arxiv.org/pdf/2603.02701
Copy Paste: [[2603.02701]] Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization(https://arxiv.org/abs/2603.02701)
Keywords: language model, llm, agent
Abstract: Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.
摘要：优化通信拓扑是基于大型语言模型 (LLM) 的多代理系统 (MAS) 的效率和有效性的基础。虽然最近的方法利用强化学习来动态构建特定于任务的图，但它们通常依赖于具有绝对奖励（例如二元正确性）的单样本策略梯度。这种范式受到严重的梯度方差和信用分配问题的影响：简单的查询会对次优结构产生非信息性的积极奖励，而困难的查询通常会导致不提供学习信号的失败。为了应对这些挑战，我们提出了 Graph-GRPO，这是一种集成了组相对策略优化的新型拓扑优化框架。 Graph-GRPO 不是单独评估单个拓扑，而是针对每个查询对一组不同的通信图进行采样，并根据特定边在组内的相对性能计算其优势。通过规范化整个抽样组的奖励，我们的方法有效地减轻了任务难度差异带来的噪音，并实现了细粒度的信用分配。关于推理和代码生成基准的大量实验表明，Graph-GRPO 的性能显着优于最先进的基线，实现了卓越的训练稳定性，并识别了之前被奖励噪音掩盖的关键通信路径。

Title: Sensory-Aware Sequential Recommendation via Review-Distilled Representations

Authors: Yeo Chan Yoon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02709
Pdf URL: https://arxiv.org/pdf/2603.02709
Copy Paste: [[2603.02709]] Sensory-Aware Sequential Recommendation via Review-Distilled Representations(https://arxiv.org/abs/2603.02709)
Keywords: language model
Abstract: We propose a novel framework for sensory-aware sequential recommendation that enriches item representations with linguistically extracted sensory attributes from product reviews. Our approach, \textsc{ASEGR} (Attribute-based Sensory Enhanced Generative Recommendation), introduces a two-stage pipeline in which a large language model is first fine-tuned as a teacher to extract structured sensory attribute--value pairs, such as \textit{color: matte black} and \textit{scent: vanilla}, from unstructured review text. The extracted structures are then distilled into a compact student transformer that produces fixed-dimensional sensory embeddings for each item. These embeddings encode experiential semantics in a reusable form and are incorporated into standard sequential recommender architectures as additional item-level representations. We evaluate our method on four Amazon domains and integrate the learned sensory embeddings into representative sequential recommendation models, including SASRec, BERT4Rec, and BSARec. Across domains, sensory-enhanced models consistently outperform their identifier-based counterparts, indicating that linguistically grounded sensory representations provide complementary signals to behavioral interaction patterns. Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior. Overall, this work demonstrates that sensory attribute distillation offers a principled and scalable way to bridge information extraction and sequential recommendation through structured semantic representation learning.
摘要：我们提出了一种用于感官感知顺序推荐的新颖框架，该框架通过从产品评论中以语言方式提取的感官属性来丰富项目表示。我们的方法 \textsc{ASEGR} （基于属性的感官增强生成推荐）引入了一个两阶段管道，其中首先对大型语言模型进行微调，作为教师从非结构化评论文本中提取结构化的感官属性值对，例如 \textit{color: matte black} 和 \textit{scent: vanilla}。然后，提取的结构被提炼成一个紧凑的学生变压器，为每个项目生成固定维度的感官嵌入。这些嵌入以可重用的形式对经验语义进行编码，并作为附加的项目级表示合并到标准顺序推荐架构中。我们在四个亚马逊域上评估我们的方法，并将学习到的感觉嵌入集成到代表性的顺序推荐模型中，包括 SASRec、BERT4Rec 和 BSARec。在各个领域中，感官增强模型始终优于基于标识符的模型，这表明以语言为基础的感官表征为行为交互模式提供了补充信号。定性分析进一步表明，提取的属性与人类对产品的感知密切相关，从而实现了自然语言描述和推荐行为之间的可解释联系。总的来说，这项工作表明，感觉属性蒸馏提供了一种原则性的、可扩展的方法，通过结构化语义表示学习来连接信息提取和顺序推荐。

Title: Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

Authors: Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang, Hao Chen, Chunhua Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02760
Pdf URL: https://arxiv.org/pdf/2603.02760
Copy Paste: [[2603.02760]] Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration(https://arxiv.org/abs/2603.02760)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.
摘要：扩散大语言模型（dLLM）最近因其增强多样性、可控性和并行性的能力而引起了极大的关注。然而，它们的非顺序、双向屏蔽生成使得质量评估变得困难，强调了有效自我评估的必要性。在这项工作中，我们提出了 DiSE，一种简单而有效的 dLLM 自我评估置信度量化方法。 DiSE 通过计算在给定完整上下文的情况下在整个生成序列中重新生成标记的概率来量化置信度。该方法通过利用令牌再生概率来实现更高效、更可靠的质量评估，促进可能性估计和鲁棒的不确定性量化。在 DiSE 的基础上，我们进一步引入了一个灵活长度的生成框架，该框架根据模型对其自身输出的自我评估来自适应控制序列长度。我们从 dLLM 泛化的角度分析和验证了 DiSE 的可行性，并实证证明 DiSE 与语义连贯性和答案准确性均呈正相关。关于可能性评估、不确定性量化和灵活长度生成的大量实验进一步证实了所提出的 DiSE 的有效性。

Title: From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Authors: Weikang Shi, Houxing Ren, Junting Pan, Aojun Zhou, Ke Wang, Zimu Lu, Yunqiao Yang, Yuxuan Hu, Linda Wei, Mingjie Zhan, Hongsheng Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02775
Pdf URL: https://arxiv.org/pdf/2603.02775
Copy Paste: [[2603.02775]] From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench(https://arxiv.org/abs/2603.02775)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.
摘要：大型语言模型（LLM）在人工智能数学辅导方面显示出巨大潜力，但目前的评估往往依赖于简单化的指标或狭隘的教学场景，无法评估全面、多轮的教学效果。在本文中，我们介绍了 KMP-Bench，这是一个综合性的 K-8 数学教学基准，旨在从两个互补的角度评估法学硕士。第一个模块“KMP-对话”根据六个核心原则（例如，挑战、解释、反馈）评估整体教学能力，利用通过将不同教学组件编织在一起构建的新颖的多轮对话数据集。第二个模块“KMP-技能”提供对基础辅导能力的精细评估，包括多轮问题解决、错误检测和纠正以及问题生成。我们对 KMP-Bench 的评估揭示了一个关键的差异：虽然领先的法学硕士擅长解决具有可验证解决方案的任务，但他们在教学原则的细致入微的应用方面遇到了困难。此外，我们还推出了 KMP-Pile，一个大规模（150K）对话数据集。在 KMP-Pile 上微调的模型在 KMP-Bench 上显示出显着改进，强调了丰富的教学数据训练数据对于开发更有效的 AI 数学导师的价值。

Title: OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Authors: Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, Daniel Dahlmeier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02789
Pdf URL: https://arxiv.org/pdf/2603.02789
Copy Paste: [[2603.02789]] OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets(https://arxiv.org/abs/2603.02789)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.
摘要：多模态大语言模型 (MLLM) 增强了自然语言处理的潜力。然而，它们对文档信息提取的实际影响仍不清楚。特别是，目前尚不清楚纯 MLLM 管道（虽然更简单）是否能够真正与传统 OCR+MLLM 设置的性能相匹配。在本文中，我们进行了一项大规模基准测试研究，评估各种现成的 MLLM 在业务文档信息提取方面的性能。为了检查和探索故障模式，我们提出了一种自动分层错误分析框架，该框架利用大型语言模型（LLM）来系统地诊断错误模式。我们的研究结果表明，OCR 对于强大的 MLLM 来说可能不是必需的，因为仅图像输入可以实现与 OCR 增强方法相当的性能。此外，我们证明精心设计的模式、范例和指令可以进一步提高 MLLM 的性能。我们希望这项工作能够为推进文档信息提取提供实用的指导和有价值的见解。

Title: Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs

Authors: Prarthana Bhattacharyya, Joshua Mitton, Ralph Abboud, Simon Woodhead
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02830
Pdf URL: https://arxiv.org/pdf/2603.02830
Copy Paste: [[2603.02830]] Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs(https://arxiv.org/abs/2603.02830)
Keywords: language model, llm
Abstract: Predicting future student responses to questions is particularly valuable for educational learning platforms where it enables effective interventions. One of the key approaches to do this has been through the use of knowledge tracing (KT) models. These are small, domain-specific, temporal models trained on student question-response data. KT models are optimised for high accuracy on specific educational domains and have fast inference and scalable deployments. The rise of Large Language Models (LLMs) motivates us to ask the following questions: (1) How well can LLMs perform at predicting students' future responses to questions? (2) Are LLMs scalable for this domain? (3) How do LLMs compare to KT models on this domain-specific task? In this paper, we compare multiple LLMs and KT models across predictive performance, deployment cost, and inference speed to answer the above questions. We show that KT models outperform LLMs with respect to accuracy and F1 scores on this domain-specific task. Further, we demonstrate that LLMs are orders of magnitude slower than KT models and cost orders of magnitude more to deploy. This highlights the importance of domain-specific models for education prediction tasks and the fact that current closed source LLMs should not be used as a universal solution for all tasks.
摘要：预测未来学生对问题的反应对于教育学习平台特别有价值，因为它可以实现有效的干预。实现这一目标的关键方法之一是使用知识追踪（KT）模型。这些是小型的、特定领域的时间模型，根据学生的问题回答数据进行训练。 KT 模型针对特定教育领域的高精度进行了优化，并具有快速推理和可扩展部署。大型语言模型（LLM）的兴起促使我们提出以下问题：（1）LLM 在预测学生未来对问题的回答方面表现如何？ (2) 法学硕士是否适合该领域？ (3) 在这个特定领域的任务上，LLM 与 KT 模型相比如何？在本文中，我们比较了多个 LLM 和 KT 模型的预测性能、部署成本和推理速度，以回答上述问题。我们表明，KT 模型在该特定领域任务的准确性和 F1 分数方面优于 LLM。此外，我们证明 LLM 比 KT 模型慢几个数量级，而且部署成本也高几个数量级。这凸显了教育预测任务的特定领域模型的重要性，以及当前闭源法学硕士不应用作所有任务的通用解决方案的事实。

Title: Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

Authors: Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.02865
Pdf URL: https://arxiv.org/pdf/2603.02865
Copy Paste: [[2603.02865]] Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models(https://arxiv.org/abs/2603.02865)
Keywords: language model
Abstract: Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.
摘要：大型视觉语言模型（LVLM）在图表理解基准上表现出强大的性能，但它们仍然难以理解元素之间的关系，特别是那些由节点和有向边（例如箭头和线）表示的元素。为了调查这一限制的根本原因，我们使用基于有向图的精心构建的合成图数据集来探究 LVLM 的内部表示。我们的探测实验表明，边缘信息在视觉编码器中不是线性可分离的，并且仅在语言模型中的文本标记中进行线性编码。相比之下，节点信息和全局结构特征已经在视觉编码器的各个隐藏状态中线性编码。这些发现表明，形成线性可分离表示的阶段根据视觉信息的类型而变化。特别是，边缘表示的延迟出现可能有助于解释为什么 LVLM 在关系理解方面遇到困难，例如解释边缘方向，这需要更抽象、组合集成的过程。

Title: LaTeX Compilation: Challenges in the Era of LLMs

Authors: Tianyou Liu, Ziqiang Li, Yansong Li, Xurui Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02873
Pdf URL: https://arxiv.org/pdf/2603.02873
Copy Paste: [[2603.02873]] LaTeX Compilation: Challenges in the Era of LLMs(https://arxiv.org/abs/2603.02873)
Keywords: language model, llm
Abstract: As large language models (LLMs) increasingly assist scientific writing, limitations and the significant token cost of TeX become more and more visible. This paper analyzes TeX's fundamental defects in compilation and user experience design to illustrate its limitations on compilation efficiency, generated semantics, error localization, and tool ecosystem in the era of LLMs. As an alternative, Mogan STEM, a WYSIWYG structured editor, is introduced. Mogan outperforms TeX in the above aspects by its efficient data structure, fast rendering, and on-demand plugin loading. Extensive experiments are conducted to verify the benefits on compilation/rendering time and performance in LLM tasks. What's more, we show that due to Mogan's lower information entropy, it is more efficient to use .tmu (the document format of Mogan) to fine-tune LLMs than TeX. Therefore, we launch an appeal for larger experiments on LLM training using the .tmu format.
摘要：随着大型语言模型 (LLM) 越来越多地协助科学写作，TeX 的局限性和巨大的代币成本变得越来越明显。本文分析了 TeX 在编译和用户体验设计方面的根本缺陷，以说明其在法学硕士时代在编译效率、生成语义、错误定位和工具生态系统方面的局限性。作为替代方案，Mogan STEM 是一种所见即所得的结构化编辑器。 Mogan在上述方面优于TeX的是其高效的数据结构、快速的渲染以及按需插件加载。进行了大量的实验来验证 LLM 任务中编译/渲染时间和性能的优势。更重要的是，我们表明，由于 Mogan 的信息熵较低，使用 .tmu（Mogan 的文档格式）来微调 LLM 比 TeX 更有效。因此，我们呼吁使用 .tmu 格式进行更大规模的 LLM 培训实验。

Title: Eval4Sim: An Evaluation Framework for Persona Simulation

Authors: Eliseo Bao, Anxo Perez, Xi Wang, Javier Parapar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02876
Pdf URL: https://arxiv.org/pdf/2603.02876
Copy Paste: [[2603.02876]] Eval4Sim: An Evaluation Framework for Persona Simulation(https://arxiv.org/abs/2603.02876)
Keywords: language model, llm, chat
Abstract: Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural analysis. Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical. However, current evaluation practices largely rely on LLM-as-a-judge approaches, offering limited grounding in observable human behavior and producing opaque scalar scores. We address this gap by proposing Eval4Sim, an evaluation framework that measures how closely simulated conversations align with human conversational patterns across three complementary dimensions. Adherence captures how effectively persona backgrounds are implicitly encoded in generated utterances, assessed via dense retrieval with speaker-aware representations. Consistency evaluates whether a persona maintains a distinguishable identity across conversations, computed through authorship verification. Naturalness reflects whether conversations exhibit human-like flow rather than overly rigid or optimized structure, quantified through distributions derived from dialogue-focused Natural Language Inference. Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus (i.e., PersonaChat) as a reference baseline and penalizes deviations in both directions, distinguishing insufficient persona encoding from over-optimized, unnatural behaviour. Although demonstrated on PersonaChat, the applicability of Eval4Sim extends to any conversational corpus containing speaker-level annotations.
摘要：具有明确属性、背景和行为倾向规范的大型语言模型 (LLM) 角色越来越多地用于模拟人类对话，以完成用户建模、社交推理和行为分析等任务。因此，确保基于角色的模拟忠实地反映人类对话行为至关重要。然而，当前的评估实践在很大程度上依赖于法学硕士作为法官的方法，在可观察的人类行为方面提供的基础有限，并产生不透明的标量分数。我们通过提出 Eval4Sim 来解决这一差距，这是一个评估框架，用于衡量模拟对话在三个互补维度上与人类对话模式的吻合程度。依从性捕获了角色背景如何有效地隐式编码在生成的话语中，并通过使用说话者感知表示的密集检索进行评估。一致性评估角色是否在对话中保持可区分的身份，通过作者身份验证计算得出。自然度反映了对话是否表现出类似人类的流程，而不是过于僵化或优化的结构，通过以对话为中心的自然语言推理得出的分布进行量化。与绝对或面向优化的指标不同，Eval4Sim 使用人类对话语料库（即 PersonaChat）作为参考基线，并惩罚两个方向的偏差，区分角色编码不足和过度优化、不自然的行为。尽管在 PersonaChat 上得到了演示，但 Eval4Sim 的适用性可以扩展到任何包含说话者级别注释的对话语料库。

Title: Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

Authors: Guangjun Zhang, Hu Zhang, Yazhou Han, Yue Fan, Yuhang Shao, Ru Li, Hongye Tan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02909
Pdf URL: https://arxiv.org/pdf/2603.02909
Copy Paste: [[2603.02909]] Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction(https://arxiv.org/abs/2603.02909)
Keywords: llm, prompt, agent
Abstract: Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from this http URL the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of "Propose-Evaluate-Revise." Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement this http URL three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.
摘要：文档级事件参数提取（DEAE）对于知识获取至关重要，旨在从零样本设置的http URL中提取事件的参与者，现有方法利用LLM生成合成数据来解决注释数据稀缺带来的挑战。然而，仅仅依靠仅事件类型的提示使得生成的内容很难准确捕获未见过的事件的上下文和结构关系。此外，由于缺乏质量评估机制，确保合成数据的可靠性和可用性仍然是一个重大挑战。为此，我们引入了一种用于零样本文档级事件参数提取的多智能体协作框架（ZS-DEAE），它模拟了人类“提议-评估-修改”的协作认知过程。具体地，该框架包括生成代理和评估代理。生成代理通过利用已见事件中的知识来合成未见事件的数据，而评估代理从合成数据中提取参数并评估其与上下文的语义一致性。评估结果随后被转换为奖励信号，将事件结构约束纳入奖励设计中，通过强化从 RAMS 和 WikiEvents 数据集构建的 http URL 三个零样本场景，实现两个代理的迭代优化，我们的方法实现了数据生成质量和参数提取性能的改进，同时生成的数据也有效增强了其他 DEAE 模型的零样本性能。

Title: ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

Authors: Bo Xu, Haotian Wu, Hehai Lin, Weiquan Huang, Beier Zhu, Yao Shu, Chengwei Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02945
Pdf URL: https://arxiv.org/pdf/2603.02945
Copy Paste: [[2603.02945]] ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation(https://arxiv.org/abs/2603.02945)
Keywords: gpt
Abstract: Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.
摘要：模型合并旨在将多个特定于任务的专家模型组合成一个模型，同时保留跨不同任务的泛化能力。然而，专家之间的干扰，尤其是当他们接受针对不同目标的培训时，通常会导致性能显着下降。尽管最近取得了进展，但在不访问数据、重新训练或架构修改的情况下解决这种干扰仍然是一个根本挑战。本文提供的理论分析表明，即使在完全无数据的设置中，每个任务的输入协方差（这是最佳合并的关键因素）也可以根据其微调模型的参数差异隐式估计。基于这一见解，我们引入了 \acem，一种自适应协方差估计框架，可以有效减轻任务间干扰。我们的方法具有原则性的、封闭式的解决方案，与之前的迭代或启发式方法形成鲜明对比。对视觉和语言基准的大量实验表明，\acem 在无数据方法中树立了新的最先进水平。它始终优于现有基线；例如，\acem 在 GPT-2 上的七个任务中比之前的方法平均绝对提高了 4\%。由于其高效的封闭式公式，\acem 以适度的计算成本提供了卓越的性能，为模型合并提供了实用且有理论依据的解决方案。

Title: MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling

Authors: Jinwoong Kim, Sangjin Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03001
Pdf URL: https://arxiv.org/pdf/2603.03001
Copy Paste: [[2603.03001]] MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling(https://arxiv.org/abs/2603.03001)
Keywords: language model, long context
Abstract: Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the best mean score on five of the eight tasks, with strong performance on the CoLA and sentence pair inference tasks. When extending the context from 512 to 4,096 tokens, MaBERT reduces training time and inference latency by 2.36x and 2.43x, respectively, relative to the average of encoder baselines, demonstrating a practical long context efficient encoder.
摘要：自注意力编码器（例如来自 Transformers 的双向编码器表示（BERT））与序列长度成二次方缩放，使得长上下文建模变得昂贵。线性时间状态空间模型，例如Mamba，高效；然而，它们在模拟全局交互方面表现出局限性，并且可能会受到填充引起的状态污染的影响。我们提出了 MaBERT，一种混合编码器，它将用于全局依赖建模的 Transformer 层与用于线性时间状态更新的 Mamba 层交错。这种设计将全局上下文集成与快速状态积累交替使用，从而实现对长输入的高效训练和推理。为了稳定可变长度批处理，我们引入了 paddingsafe masking（它通过填充位置阻止状态传播）和 maskwareattention pooling（它仅聚合来自有效令牌的信息）。在 GLUE 上，MaBERT 在 8 个任务中的 5 个上取得了最佳平均分数，在 CoLA 和句对推理任务上表现强劲。当将上下文从 512 个标记扩展到 4,096 个标记时，相对于编码器基线的平均值，MaBERT 的训练时间和推理延迟分别减少了 2.36 倍和 2.43 倍，展示了一种实用的长上下文高效编码器。

Title: TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

Authors: Zixin Xiong, Ziteng Wang, Haotian Fan, Xinjie Zhang, Wenxuan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03047
Pdf URL: https://arxiv.org/pdf/2603.03047
Copy Paste: [[2603.03047]] TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health(https://arxiv.org/abs/2603.03047)
Keywords: language model, gpt, llm
Abstract: While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive nature. Existing evaluation paradigms for general-purpose LLMs fail to capture mental health-specific requirements, highlighting an urgent need to prioritize and enhance their trustworthiness. To address this, we propose TrustMH-Bench, a holistic framework designed to systematically quantify the trustworthiness of mental health LLMs. By establishing a deep mapping from domain-specific norms to quantitative evaluation metrics, TrustMH-Bench evaluates models across eight core pillars: Reliability, Crisis Identification and Escalation, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics. We conduct extensive experiments across six general-purpose LLMs and six specialized mental health models. Experimental results indicate that the evaluated models underperform across various trustworthiness dimensions in mental health scenarios, revealing significant deficiencies. Notably, even generally powerful models (e.g., GPT-5.1) fail to maintain consistently high performance across all dimensions. Consequently, systematically improving the trustworthiness of LLMs has become a critical task. Our data and code are released.
摘要：虽然大型语言模型 (LLM) 在提供可及的心理健康支持方面表现出巨大潜力，但由于该领域的高风险和安全敏感性质，其实际部署引起了严重的可信度问题。通用法学硕士现有的评估范式未能满足心理健康的具体要求，这凸显了优先考虑和增强其可信度的迫切需要。为了解决这个问题，我们提出了 TrustMH-Bench，这是一个整体框架，旨在系统地量化心理健康法学硕士的可信度。通过建立从特定领域规范到定量评估指标的深度映射，TrustMH-Bench 评估了八个核心支柱的模型：可靠性、危机识别和升级、安全、公平、隐私、稳健性、反阿谀奉承和道德。我们对六个通用法学硕士和六个专门的心理健康模型进行了广泛的实验。实验结果表明，评估的模型在心理健康场景中的各个可信度维度上表现不佳，揭示了重大缺陷。值得注意的是，即使是一般强大的模型（例如 GPT-5.1）也无法在所有维度上保持一致的高性能。因此，系统地提高LLM的可信度已成为一项关键任务。我们的数据和代码已发布。

Title: PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Authors: Sudip Bhujel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03054
Pdf URL: https://arxiv.org/pdf/2603.03054
Copy Paste: [[2603.03054]] PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems(https://arxiv.org/abs/2603.03054)
Keywords: language model, llm, hallucination, prompt, chat
Abstract: Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at this https URL.
摘要：大型语言模型越来越多地用于面向患者的医疗援助和临床决策支持，但使其适应临床对话通常需要来自可能包含敏感信息的医患对话的监督。传统的监督微调和来自人类反馈的强化学习（RLHF）可以放大记忆风险，从而实现经验成员推理和稀有训练集内容的提取。我们推出 PrivMedChat，这是一种用于医疗对话的差分隐私 RLHF (DP-RLHF) 的端到端框架。我们的设计在直接访问对话衍生监督的每个训练阶段强制实施差异隐私：（i）用于医疗 SFT 的差分私人随机梯度下降（DP-SGD）和（ii）用于从偏好对学习奖励模型的 DP-SGD。为了限制对齐期间的额外隐私支出，我们在对对话衍生的提示进行操作时将 DP-SGD 应用于 PPO 参与者和评论家，而奖励模型在 DP 训练后保持固定。我们还引入了一种无注释的偏好构建策略，将医生的反应与过滤后的非专家一代配对，以生成可扩展的偏好数据，而无需临床医生标记。医学对话基准实验表明，$\varepsilon=7$ 的 PrivMedChat 在所有 DP 模型中实现了最高的 ROUGE-L 0.156，将临床幻觉降低至 1.4%，将有害建议降低至 0.4%，并在 3 模型 LLM 陪审团评估中获得最高总分 2.86，同时产生接近机会的成员推理信号（AUC） 0.510-0.555）。我们在此 https URL 开源我们的代码。

Title: TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Authors: Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03081
Pdf URL: https://arxiv.org/pdf/2603.03081
Copy Paste: [[2603.03081]] TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models(https://arxiv.org/abs/2603.03081)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.
摘要：大型语言模型 (LLM) 在各种应用程序中取得了显着的成功，但仍然容易受到越狱攻击，攻击者会制作绕过安全对齐的提示并引发不安全的响应。在现有的方法中，基于优化的攻击显示出很强的有效性，但当前的方法经常遭受频繁拒绝、伪有害输出和低效的令牌级更新的困扰。在这项工作中，我们提出了 TAO-Attack，一种新的基于优化的越狱方法。 TAO-Attack 采用两阶段损失函数：第一阶段抑制拒绝以确保模型继续有害前缀，而第二阶段惩罚伪有害输出并鼓励模型完成更有害的完成。此外，我们设计了一种方向优先令牌优化（DPTO）策略，该策略通过在考虑更新幅度之前将候选者与梯度方向对齐来提高效率。对多个 LLM 的大量实验表明，TAO-Attack 始终优于最先进的方法，实现了更高的攻击成功率，在某些情况下甚至达到 100%。

Title: Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection

Authors: Sofiane Elguendouze, Erwan Hain, Elena Cabrio, Serena Villata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03095
Pdf URL: https://arxiv.org/pdf/2603.03095
Copy Paste: [[2603.03095]] Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection(https://arxiv.org/abs/2603.03095)
Keywords: language model, llm, prompt
Abstract: Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises. While research on this subtask remains relatively limited compared to other AM tasks, most existing approaches formulate it as a simplified sequence labeling problem, component classification, or a pipeline of component segmentation followed by classification. In this paper, we propose a novel approach based on instruction-tuned Large Language Models (LLMs) using compact instruction-based prompts, and reframe ACD as a language generation task, enabling arguments to be identified directly from plain text without relying on pre-segmented components. Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems. To the best of our knowledge, this is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex AM problems.
摘要：论证成分检测（ACD）是论证挖掘（AM）的核心子任务，也是其最具挑战性的方面之一，因为它需要共同界定论证范围并将其分类为主张和前提等组件。虽然与其他 AM 任务相比，对该子任务的研究仍然相对有限，但大多数现有方法将其表述为简化的序列标记问题、组件分类或组件分割和分类的流程。在本文中，我们提出了一种基于指令调整的大型语言模型（LLM）的新颖方法，使用紧凑的基于指令的提示，并将 ACD 重新构建为语言生成任务，从而能够直接从纯文本中识别参数，而无需依赖于预先分段的组件。标准基准测试的实验表明，与最先进的系统相比，我们的方法实现了更高的性能。据我们所知，这是将 ACD 完全建模为生成任务的首次尝试之一，凸显了针对复杂 AM 问题进行指令调整的潜力。

Title: Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Authors: Raad Khraishi, Iman Zafar, Katie Myles, Greig A Cowan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03111
Pdf URL: https://arxiv.org/pdf/2603.03111
Copy Paste: [[2603.03111]] Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems(https://arxiv.org/abs/2603.03111)
Keywords: gpt, llm
Abstract: Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.
摘要：由于升级、跨提供商路由和回退，部署的多轮 LLM 系统通常会在交互过程中切换模型。这种切换会造成上下文不匹配：生成稍后轮次的模型必须以不同模型编写的对话前缀为条件，这可能会导致无声的性能漂移。我们引入了一个开关矩阵基准，通过运行早期回合的前缀模型和最终回合的后缀模型来测量这种效果，并使用配对的情节级引导置信区间与无开关基线进行比较。在 CoQA 会话 QA 和 Multi-IF 基准测试中，即使是单轮切换也会产生普遍且具有统计显着性的方向性效应，并且可能会使 Multi-IF 严格成功率的结果波动 -8 至 +13 个百分点，并且 CoQA 上的绝对 F1 +/- 4，这与常见模型层之间的无切换差距（例如 GPT-5-nano 与 GPT-5-mini）相当。我们进一步发现了系统的兼容性模式：一些后缀模型在几乎任何非自身对话历史下都会退化，而其他后缀模型在几乎任何外来前缀下都会得到改善。为了实现压缩切换风险监控，我们将开关引起的漂移分解为每个模型的前缀影响和后缀敏感性项，占基准之间约 70% 的方差。这些结果将切换鲁棒性定位为单一模型基准测试所忽略的操作可靠性维度，从而促进了多轮系统中的显式监控和切换感知缓解。

Title: UniSkill: A Dataset for Matching University Curricula to Professional Competencies

Authors: Nurlan Musazade, Joszef Mezei, Mike Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03134
Pdf URL: https://arxiv.org/pdf/2603.03134
Copy Paste: [[2603.03134]] UniSkill: A Dataset for Matching University Curricula to Professional Competencies(https://arxiv.org/abs/2603.03134)
Keywords: language model
Abstract: Skill extraction and recommendation systems have been studied from recruiter, applicant, and education perspectives. While AI applications in job advertisements have received broad attention, deficiencies in the instructed skills side remain a challenge. In this work, we address the scarcity of publicly available datasets by releasing both manually annotated and synthetic datasets of skills from the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and university course pairs and publishing corresponding annotation guidelines. Specifically, we match graduate-level university courses with skills from the Systems Analysts and Management and Organization Analyst ESCO occupation groups at two granularities: course title with a skill, and course sentence with a skill. We train language models on this dataset to serve as a baseline for retrieval and recommendation systems for course-to-skill and skill-to-course matching. We evaluate the models on a portion of the annotated data. Our BERT model achieves 87% F1-score, showing that course and skill matching is a feasible task.
摘要：从招聘人员、申请人和教育的角度研究了技能提取和推荐系统。尽管人工智能在招聘广告中的应用受到了广泛关注，但指导技能方面的缺陷仍然是一个挑战。在这项工作中，我们通过发布欧洲技能、能力、资格和职业 (ESCO) 分类法和大学课程对中的手动注释和合成技能数据集，并发布相应的注释指南，解决了公开数据集的稀缺问题。具体来说，我们将研究生水平的大学课程与系统分析师、管理和组织分析师 ESCO 职业组的技能进行匹配，分为两个粒度：课程标题与技能，课程句子与技能。我们在此数据集上训练语言模型，作为课程与技能以及技能与课程匹配的检索和推荐系统的基线。我们根据部分注释数据评估模型。我们的 BERT 模型取得了 87% 的 F1 分数，表明课程和技能匹配是一项可行的任务。

Title: APRES: An Agentic Paper Revision and Evaluation System

Authors: Bingchen Zhao, Jenny Zhang, Chenxi Whitehouse, Minqi Jiang, Michael Shvartsman, Abhishek Charnalia, Despoina Magka, Tatiana Shavrina, Derek Dunfield, Oisin Mac Aodha, Yoram Bachrach
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03142
Pdf URL: https://arxiv.org/pdf/2603.03142
Copy Paste: [[2603.03142]] APRES: An Agentic Paper Revision and Evaluation System(https://arxiv.org/abs/2603.03142)
Keywords: language model, llm, agent
Abstract: Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the core scientific content. We demonstrate the success of APRES, which improves future citation prediction by 19.6% in mean averaged error over the next best baseline, and show that our paper revision process yields papers that are preferred over the originals by human expert evaluators 79% of the time. Our findings provide strong empirical support for using LLMs as a tool to help authors stress-test their manuscripts before submission. Ultimately, our work seeks to augment, not replace, the essential role of human expert reviewers, for it should be humans who discern which discoveries truly matter, guiding science toward advancing knowledge and enriching lives.
摘要：必须清楚地传达科学发现才能充分发挥其潜力。如果没有有效的沟通，即使是最具开创性的发现也有被忽视或误解的风险。科学家交流工作和从社区获得反馈的主要方式是通过同行评审。然而，当前的系统经常在审稿人之间提供不一致的反馈，最终阻碍稿件的改进并限制其潜在影响。在本文中，我们介绍了一种由大型语言模型 (LLM) 支持的新方法 APRES，用于根据评估标准更新科学论文文本。我们的自动化方法发现了一个可以高度预测未来引用计数的标题，并将其与 APRES 集成到一个自动化系统中，该系统可以修改论文以提高其质量和影响力。至关重要的是，应该在不改变核心科学内容的情况下实现这一目标。我们展示了 APRES 的成功，它使未来的引文预测的平均误差比下一个最佳基线提高了 19.6%，并表明我们的论文修订过程产生的论文在 79% 的情况下比人类专家评审员更喜欢原始论文。我们的研究结果为使用法学硕士作为工具来帮助作者在提交前对稿件进行压力测试提供了强有力的实证支持。最终，我们的工作旨在增强而不是取代人类专家评审员的重要作用，因为应该由人类来辨别哪些发现真正重要，引导科学推动知识进步和丰富生活。

Title: BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Authors: Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2603.03194
Pdf URL: https://arxiv.org/pdf/2603.03194
Copy Paste: [[2603.03194]] BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?(https://arxiv.org/abs/2603.03194)
Keywords: agent
Abstract: Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
摘要：当前代码代理的基准主要评估狭窄的、特定于存储库的修复，忽略了关键的现实挑战，例如跨存储库推理、特定领域的问题解决、依赖驱动的迁移和完整存储库生成。为了解决这一差距，我们引入了 BeyondSWE，这是一个综合基准，它使用跨越四种不同设置的 500 个现实世界实例，沿着两个轴（分辨率范围和知识范围）扩大了现有评估。实验结果揭示了显着的能力差距：即使是前沿模型的成功率也稳定在 45% 以下，而且没有一个模型能够在不同的任务类型上表现一致。为了系统地研究外部知识的作用，我们开发了 SearchSWE，一个将深度搜索与编码能力相结合的框架。我们的实验表明，搜索增强会产生不一致的增益，并且在某些情况下会降低性能，这突显了模拟在编码任务期间将搜索和推理交错的类似开发人员的工作流程的难度。这项工作提供了一个现实的、具有挑战性的评估基准和一个灵活的框架，以推进对更强大的代码代理的研究。

Title: Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Authors: Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03202
Pdf URL: https://arxiv.org/pdf/2603.03202
Copy Paste: [[2603.03202]] Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?(https://arxiv.org/abs/2603.03202)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at this https URL.
摘要：随着大型语言模型 (LLM) 将其数学能力提升至 IMO 水平，缺乏具有挑战性的高质量训练和评估问题已成为一个重大瓶颈。同时，最近的代码代理已经展示了代理编码和推理方面的复杂技能，这表明代码执行可以作为数学实验的可扩展环境。在本文中，我们研究了代码代理将现有数学问题自主演化为更复杂变体的潜力。我们引入了一个多代理框架，旨在执行问题演化，同时验证所生成问题的可解决性和增加的难度。我们的实验表明，只要进行足够的测试时探索，代码代理就可以合成新的、可解决的问题，这些问题在结构上与原始问题不同，并且比原始问题更具挑战性。这项工作提供了经验证据，表明代码驱动代理可以作为在可扩展计算环境中综合高难度数学推理问题的可行机制。我们的数据可通过此 https URL 获取。

Title: Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Authors: Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03205
Pdf URL: https://arxiv.org/pdf/2603.03205
Copy Paste: [[2603.03205]] Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use(https://arxiv.org/abs/2603.03205)
Keywords: language model, prompt, chat, agent
Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.
摘要：代理语言模型在与聊天模型完全不同的安全机制中运行：它们必须计划、调用工具并执行长期操作，其中单个错误（例如访问文件或输入凭据）可能会造成不可逆转的损害。现有的对齐方法主要针对静态生成和任务完成进行了优化，但由于顺序决策、对抗性工具反馈和过度自信的中间推理，在这些设置中会崩溃。我们引入了 MOSAIC，这是一种训练后框架，通过使安全决策变得明确且可学习，使代理能够安全地使用多步骤工具。 MOSAIC 将推理构造为计划、检查、然后采取行动或拒绝循环，并将明确的安全推理和拒绝作为一流的行动。为了在没有轨迹级标签的情况下进行训练，我们使用基于偏好的强化学习和成对轨迹比较，它捕获了标量奖励经常错过的安全区别。我们评估了 Qwen2.5-7B、Qwen3-4B-Thinking 和 Phi-4 三个模型系列以及涵盖有害任务、提示注入、良性工具使用和跨域隐私泄露的分布外基准的 MOSAIC 零样本。 MOSAIC 可减少高达 50% 的有害行为，将注入攻击时的有害任务拒绝率提高 20% 以上，减少隐私泄露，并保留或提高良性任务性能，从而展示了跨模型、领域和代理设置的强大泛化能力。