2026-03-27

Title: When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

Authors: Hasindri Watawana, Sergio Burdisso, Diego A. Moreno-Galván, Fernando Sánchez-Vega, A. Pastor López-Monroy, Petr Motlicek, Esaú Villatoro-Tello
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2603.24651
Pdf URL: https://arxiv.org/pdf/2603.24651
Copy Paste: [[2603.24651]] When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews(https://arxiv.org/abs/2603.24651)
Keywords: language model, prompt
Abstract: Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants' language.
摘要：由于公共语料库的可用性和语言模型的进步，从医患对话中自动检测抑郁症已经获得了动力。然而，可解释性仍然有限：经常报道强劲的业绩，但没有透露推动预测的因素。我们分析了三个数据集：ANDROIDS、DAIC-WOZ、E-DAIC，并从半结构化访谈中的访谈者提示中识别出系统性偏差。接受采访者轮流训练的模型利用固定的提示和位置来区分抑郁症患者和对照组，通常在不使用参与者语言的情况下获得高分类分数。将模型限制为参与者的话语可以更广泛地分发决策证据并反映真实的语言线索。虽然半结构化协议可确保一致性，但包括面试官提示会通过利用脚本工件来提高性能。我们的结果强调了跨数据集、与体系结构无关的偏差，并强调需要按时间和说话者本地化决策证据的分析，以确保模型从参与者的语言中学习。

Title: Demystifying When Pruning Works via Representation Hierarchies

Authors: Shwai He, Guoheng Sun, Haichao Zhang, Yun Fu, Ang Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.24652
Pdf URL: https://arxiv.org/pdf/2603.24652
Copy Paste: [[2603.24652]] Demystifying When Pruning Works via Representation Hierarchies(https://arxiv.org/abs/2603.24652)
Keywords: language model
Abstract: Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at this https URL
摘要：网络修剪会删除不太重要的参数或架构，通常有望在保持性能的同时提高效率。然而，这种期望在语言任务中并不总是成立：修剪模型可以在非生成任务上表现良好，但在生成环境中经常失败。为了理解这种差异，我们从表示层次结构的角度分析网络剪枝，将语言模型的内部计算分解为三个连续空间：嵌入（隐藏表示）、logit（前softmax输出）和概率（后softmax分布）。我们发现嵌入和 logit 空间中的表示对于修剪引起的扰动很大程度上具有鲁棒性。然而，从逻辑到概率的非线性变换放大了这些偏差，这些偏差在时间步长上累积并导致生成过程中的大幅退化。相反，分类标记概率子空间的稳定性以及嵌入空间的鲁棒性支持了对非生成任务（例如检索和多项选择选择）进行剪枝的有效性。我们的分析阐明了跨任务修剪的影响，并为其应用提供了实用指导。代码可在此 https URL 获取

Title: Fine-Tuning A Large Language Model for Systematic Review Screening

Authors: Kweku Yamoah, Noah Schroeder, Emmanuel Dorley, Neha Rani, Caleb Schutz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24767
Pdf URL: https://arxiv.org/pdf/2603.24767
Copy Paste: [[2603.24767]] Fine-Tuning A Large Language Model for Systematic Review Screening(https://arxiv.org/abs/2603.24767)
Keywords: language model, llm, prompt
Abstract: Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.
摘要：传统上，系统审查需要花费大量的人力时间和精力才能完成，部分原因是必须审查大量标题和摘要才能纳入。最近，研究人员开始探索如何使用大型语言模型（LLM）来使这一过程更加高效。然而，迄今为止的研究显示出不一致的结果。我们认为这是因为单独的提示可能无法为模型的良好表现提供足够的背景。在这项研究中，我们对一个小型的 12 亿参数开放权重 LLM 进行了微调，专门用于在系统评价的背景下进行研究筛选，其中人类对 8500 多个标题和摘要进行了潜在的纳入评级。我们的结果显示，微调模型的性能得到了显着提高，与基本模型相比，加权 F1 分数提高了 80.79%。当在 8,277 项研究的完整数据集上运行时，微调后的模型与人类编码员的一致性为 86.40%，真阳性率为 91.18%，真阴性率为 86.38%，并且在多次推理运行中具有完美的一致性。总而言之，我们的结果表明，在大规模系统评价中对法学硕士的标题和摘要筛选进行微调是有希望的。

Title: Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Authors: Mohammed Nowshad Ruhani Chowdhury, Mohammed Nowaz Rabbani Chowdhury, Sakari Lukkarinen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.24772
Pdf URL: https://arxiv.org/pdf/2603.24772
Copy Paste: [[2603.24772]] Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset(https://arxiv.org/abs/2603.24772)
Keywords: language model, llm
Abstract: Clinical documentation is a critical factor for patient safety, diagnosis, and continuity of care. The administrative burden of EHRs is a significant factor in physician burnout. This is a critical issue for low-resource languages, including Finnish. This study aims to investigate the effectiveness of a domain-aligned natural language processing (NLP); large language model for medical transcription in Finnish by fine-tuning LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations by students at Metropolia University of Applied Sciences. The fine-tuning process for medical transcription used a controlled preprocessing and optimization approach. The fine-tuning effectiveness was evaluated by sevenfold cross-validation. The evaluation metrics for fine-tuned LLaMA 3.1-8B were BLEU = 0.1214, ROUGE-L = 0.4982, and BERTScore F1 = 0.8230. The results showed a low n-gram overlap but a strong semantic similarity with reference transcripts. This study indicate that fine-tuning can be an effective approach for translation of medical discourse in spoken Finnish and support the feasibility of fine-tuning a privacy-oriented domain-specific large language model for clinical documentation in Finnish. Beside that provide directions for future work.
摘要：临床记录是患者安全、诊断和护理连续性的关键因素。电子病历的管理负担是导致医生倦怠的一个重要因素。对于包括芬兰语在内的资源匮乏的语言来说，这是一个关键问题。本研究旨在调查领域对齐自然语言处理（NLP）的有效性；通过在大都会应用科学大学学生模拟临床对话的小型验证语料库上微调 LLaMA 3.1-8B，开发芬兰语医学转录的大型语言模型。医学转录的微调过程使用了受控的预处理和优化方法。通过七重交叉验证评估微调效果。微调后的 LLaMA 3.1-8B 的评估指标为 BLEU = 0.1214、ROUGE-L = 0.4982 和 BERTScore F1 = 0.8230。结果显示 n 元语法重叠率较低，但与参考转录本具有很强的语义相似性。这项研究表明，微调可以成为芬兰语口语医学话语翻译的有效方法，并支持微调芬兰语临床文档的面向隐私的特定领域大语言模型的可行性。除此之外，还为今后的工作提供了方向。

Title: Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

Authors: Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24826
Pdf URL: https://arxiv.org/pdf/2603.24826
Copy Paste: [[2603.24826]] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining(https://arxiv.org/abs/2603.24826)
Keywords: language model
Abstract: Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.
摘要：通过文档重写生成合成数据已成为改进语言模型预训练的一种有前途的技术，但大多数研究都集中在英语上，并且没有系统地控制被重写的源数据的质量。我们提出了一项对照研究，研究在葡萄牙语持续预训练的背景下，合成重写如何与源数据质量相互作用。从 ClassiCC-PT（一个用 STEM 和教育质量分数注释的葡萄牙语语料库）开始，我们构建了两个不同质量级别的 10B 令牌子集，并使用 7B 指令调整模型将每个子集重写为四种样式，每个条件生成大约 40B 令牌的合成数据。我们在每种条件下训练两个以英语为中心的基本模型（1.1B 和 7B 参数），并在 PoETa V2（一个全面的 44 任务葡萄牙语基准）上进行评估。在 7B 规模上，重写高质量数据比未修改的相同数据产生 +3.4 NPM 增益，而重写低质量数据仅提供 +0.5 NPM。在 1.1B 尺度上，这种交互作用较弱，未经修改的低质量数据的性能与重写的高质量数据相当。我们的结果表明，综合重写主要充当质量乘数而不是数据管理的替代品，并且这种效果与规模相关。

Title: Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

Authors: Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, Hanghang Tong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24840
Pdf URL: https://arxiv.org/pdf/2603.24840
Copy Paste: [[2603.24840]] Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR(https://arxiv.org/abs/2603.24840)
Keywords: language model, llm, prompt
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at this https URL.
摘要：具有可验证奖励的强化学习 (RLVR) 显着提高了大型语言模型 (LLM) 的推理能力。然而，诸如 GRPO 和 DAPO 之类的方法会遭受巨大的计算成本，因为它们依赖于对每个提示进行多次部署采样。此外，在 RLVR 中，相对优势通常很稀疏：许多样本变得几乎全部正确或全部错误，产生较低的组内奖励方差，从而产生较弱的学习信号。在本文中，我们介绍了arrol（通过在线Rollout Pruning加速RLVR），这是一种在线Rollout修剪方法，它在生成过程中修剪Rollout，同时明确引导幸存的Rollout更加正确平衡，以增强学习信号。具体来说，arrol 动态训练轻量级质量头来预测部分推出的成功概率，并使用它来做出早期修剪决策。学习到的质量头可以进一步权衡候选者，以提高测试时间缩放期间的推理准确性。为了提高效率，我们提出了一种系统设计，该设计可以修剪推理引擎内部的部署，并重新批处理剩余的部署以进行对数概率计算和策略更新。在 Qwen-3 和 LLaMA-3.2 模型 (1B-8B) 上的 GRPO 和 DAPO 中，arrol 将平均准确度提高了 +2.30 至 +2.99，同时实现了高达 1.7 倍的训练加速，并在测试时间扩展的平均准确度方面产生了高达 +8.33 的额外增益。该代码可从此 https URL 获取。

Title: Estimating near-verbatim extraction risk in language models with decoding-constrained beam search

Authors: A. Feder Cooper, Mark A. Lemley, Christopher De Sa, Lea Duesterwald, Allison Casasola, Jamie Hayes, Katherine Lee, Daniel E. Ho, Percy Liang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.24917
Pdf URL: https://arxiv.org/pdf/2603.24917
Copy Paste: [[2603.24917]] Estimating near-verbatim extraction risk in language models with decoding-constrained beam search(https://arxiv.org/abs/2603.24917)
Keywords: language model, llm
Abstract: Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction -- computing the probability of generating a target suffix given a prefix under a decoding scheme -- addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.
摘要：最近的工作表明，用于量化法学硕士记忆的标准贪婪解码提取方法忽略了提取风险在序列之间的变化。概率提取——计算在解码方案下给定前缀生成目标后缀的概率——解决了这个问题，但仅适用于逐字记忆，缺少造成类似隐私和版权风险的近逐字实例。量化近逐字提取风险的成本很高：近逐字后缀集组合很大，可靠的蒙特卡罗 (MC) 估计可能需要每个序列约 100,000 个样本。为了减轻这种成本，我们引入了解码约束波束搜索，它以相当于每个序列约 20 个 MC 样本的成本产生近逐字提取风险的确定性下限。在实验中，我们的方法显示了逐字方法不可见的信息：更多的可提取序列、更大的每个序列提取质量，以及近逐字提取风险在模型大小和文本类型中表现的模式。

Title: Toward domain-specific machine translation and quality estimation systems

Authors: Javad Pourmostafa Roshan Sharami
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.24955
Pdf URL: https://arxiv.org/pdf/2603.24955
Copy Paste: [[2603.24955]] Toward domain-specific machine translation and quality estimation systems(https://arxiv.org/abs/2603.24955)
Keywords: language model
Abstract: Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.
摘要：机器翻译 (MT) 和质量估计 (QE) 在一般领域中表现良好，但在领域不匹配的情况下性能会下降。本论文研究如何通过一组以数据为中心的贡献，使 MT 和 QE 系统适应专门领域。第2章介绍了一种基于相似性的机器翻译数据选择方法。小型、有针对性的域内子集优于更大的通用数据集，并以较低的计算成本达到较高的翻译质量。第 3 章介绍了一个分阶段的 QE 训练流程，它将领域适应与轻量级数据增强相结合。该方法提高了跨领域、语言和资源设置的性能，包括零样本和跨语言案例。第 4 章研究子词标记化和词汇在微调中的作用。对齐的标记化词汇设置可以带来稳定的训练和更好的翻译质量，而不匹配的配置会降低性能。第 5 章提出了一种用于大型语言模型的 QE 引导的上下文学习方法。 QE 模型选择无需更新参数即可提高翻译质量的示例，并且性能优于标准检索方法。该方法还支持无参考设置，减少对单个参考集的依赖。这些结果表明，领域适应取决于数据选择、表示和有效的适应策略。该论文提供了构建在特定领域设置中可靠运行的 MT 和 QE 系统的方法。

Title: LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems

Authors: Yuhang Zhou, Zhuokai Zhao, Ke Li, Spilios Evmorfos, Gökalp Demirci, Mingyi Wang, Qiao Liu, Qifei Wang, Serena Li, Weiwei Li, Tingting Wang, Mingze Gao, Gedi Zhou, Abhishek Kumar, Xiangjun Fan, Lizhu Zhang, Jiayi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24979
Pdf URL: https://arxiv.org/pdf/2603.24979
Copy Paste: [[2603.24979]] LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems(https://arxiv.org/abs/2603.24979)
Keywords: llm, prompt, agent
Abstract: Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.
摘要：特征选择是大规模工业机器学习系统中至关重要的一步，直接影响模型的准确性、效率和可维护性。传统的特征选择方法依赖于标记数据和统计启发式，这使得它们很难应用于标记数据有限且必须满足多种操作约束的生产环境。为了解决这个问题，我们提出了模型特征代理（MoFA），这是一种模型驱动的框架，它使用语义和定量特征信息执行基于推理的顺序特征选择。 MoFA 将特征定义、重要性评分、相关性和元数据（例如特征组或类型）合并到结构化提示中，并通过可解释的、约束感知的推理来选择特征。我们在三个现实世界的工业应用中评估 MoFA：(1) 真实兴趣和时间价值预测，它在降低特征组复杂性的同时提高准确性；(2) 价值模型增强，它发现高阶交互项，从而在在线实验中产生实质性的参与收益；(3) 通知行为预测，它选择紧凑、高价值的特征子集，从而提高模型准确性和推理效率。总之，这些结果证明了基于 LLM 的推理在实际生产系统中进行特征选择的实用性和有效性。

Title: Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection

Authors: Xiaowei Zhu, Yubing Ren, Fang Fang, Shi Wang, Yanan Cao, Li Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24981
Pdf URL: https://arxiv.org/pdf/2603.24981
Copy Paste: [[2603.24981]] Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection(https://arxiv.org/abs/2603.24981)
Keywords: language model
Abstract: The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual property rights. These concerns highlight the urgent need for effective and reliable detection methods. While existing training-free approaches often achieve strong performance by aggregating token-level signals into a global score, they typically assume uniform token contributions, making them less robust under short sequences or localized token modifications. To address these limitations, we propose Exons-Detect, a training-free method for AI-generated text detection based on an exon-aware token reweighting perspective. Exons-Detect identifies and amplifies informative exonic tokens by measuring hidden-state discrepancy under a dual-model setting, and computes an interpretable translation score from the resulting importance-weighted token sequence. Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths. In particular, it attains a 2.2\% relative improvement in average AUROC over the strongest prior baseline on DetectRL.
摘要：大语言模型的快速发展使得人类书写和人工智能生成文本之间的界限日益模糊，引发了错误信息传播、作者身份模糊和知识产权威胁等社会风险。这些担忧突出表明迫切需要有效且可靠的检测方法。虽然现有的免训练方法通常通过将令牌级信号聚合到全局分数中来实现强大的性能，但它们通常假设统一的令牌贡献，这使得它们在短序列或局部令牌修改下不太稳健。为了解决这些限制，我们提出了 Exons-Detect，这是一种基于外显子感知标记重新加权视角的人工智能生成文本检测的免训练方法。 Exons-Detect 通过测量双模型设置下的隐藏状态差异来识别和放大信息丰富的外显子标记，并根据生成的重要性加权标记序列计算可解释的翻译分数。实证评估表明，Exons-Detect 实现了最先进的检测性能，并对对抗性攻击和变化的输入长度表现出强大的鲁棒性。特别是，与 DetectRL 上最强的先前基线相比，它的平均 AUROC 相对提高了 2.2%。

Title: Imperative Interference: Social Register Shapes Instruction Topology in Large Language Models

Authors: Tony Mason
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2603.25015
Pdf URL: https://arxiv.org/pdf/2603.25015
Copy Paste: [[2603.25015]] Imperative Interference: Social Register Shapes Instruction Topology in Large Language Models(https://arxiv.org/abs/2603.25015)
Keywords: language model, prompt
Abstract: System prompt instructions that cooperate in English compete in Spanish, with the same semantic content, but opposite interaction topology. We present instruction-level ablation experiments across four languages and four models showing that this topology inversion is mediated by social register: the imperative mood carries different obligatory force across speech communities, and models trained on multilingual data have learned these conventions. Declarative rewriting of a single instruction block reduces cross-linguistic variance by 81% (p = 0.029, permutation test). Rewriting three of eleven imperative blocks shifts Spanish instruction topology from competitive to cooperative, with spillover effects on unrewritten blocks. These findings suggest that models process instructions as social acts, not technical specifications: "NEVER do X" is an exercise of authority whose force is language-dependent, while "X: disabled" is a factual description that transfers across languages. If register mediates instruction-following at inference time, it plausibly does so during training. We state this as a testable prediction: constitutional AI principles authored in imperative mood may create language-dependent alignment. Corpus: 22 hand-authored probes against a production system prompt decomposed into 56 blocks.
摘要：英语配合的系统提示指令与西班牙语相互竞争，语义内容相同，但交互拓扑相反。我们提出了跨四种语言和四种模型的指令级消融实验，表明这种拓扑反转是由社会语域介导的：命令语气在语音社区中具有不同的强制力，并且在多语言数据上训练的模型已经学习了这些约定。单个指令块的声明式重写将跨语言方差减少了 81%（p = 0.029，排列测试）。重写十一个命令式块中的三个将西班牙语指令拓扑从竞争性转变为合作性，并对未重写的块产生溢出效应。这些发现表明，模型将指令处理为社会行为，而不是技术规范：“永远不要做 X”是一种权力的行使，其力量取决于语言，而“X：禁用”是跨语言转移的事实描述。如果寄存器在推理时调解指令遵循，那么它在训练期间似乎也会这样做。我们将此视为一个可检验的预测：以命令语气编写的宪法人工智能原则可能会产生依赖于语言的一致性。语料库：针对生产系统提示的 22 个手工编写的探针，分解为 56 个块。

Title: Approaches to Analysing Historical Newspapers Using LLMs

Authors: Filip Dobranić, Tina Munda, Oliver Pejić, Vojko Gorjanc, Uroš Šmajdek, David Bordon, Jakob Lenardič, Tjaša Konovšek, Kristina Pahor de Maiti Tekavčič, Ciril Bohak, Darja Fišer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25051
Pdf URL: https://arxiv.org/pdf/2603.25051
Copy Paste: [[2603.25051]] Approaches to Analysing Historical Newspapers Using LLMs(https://arxiv.org/abs/2603.25051)
Keywords: language model, llm
Abstract: This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.
摘要：本研究对 sPeriodika 语料库中的斯洛文尼亚历史报纸 \textit{Slovenec} 和 \textit{Slovenski narod} 进行了计算分析，结合了主题建模、基于大语言模型 (LLM) 的方面级情感分析、实体图可视化和定性话语分析，以研究集体身份、政治取向和民族归属感在二十世纪之交的公共话语中是如何表达的。使用 BERTopic，我们确定了主要的主题模式，并显示了两家报纸之间的共同关注点和明显的意识形态差异，反映了它们的保守天主教和自由进步取向。我们进一步评估了 OCR 降级历史斯洛文尼亚语中用于目标情感分类的四个遵循指令的 LLM，并选择斯洛文尼亚语适应的 GaMS3-12B-Instruct 模型作为最适合大规模应用的模型，同时还记录了重要的局限性，特别是它在中性情绪上的表现比在积极或消极情绪上的表现更强。在数据集规模上应用时，该模型揭示了集体身份描述中有意义的变化，一些群体主要出现在中性描述性上下文中，而另一些群体则更经常出现在评价性或与冲突相关的话语中。然后，我们创建 NER 图来探索集体身份和地点之间的关系。我们采用混合方法来分析命名实体图，将定量网络分析与批判性话语分析相结合。调查的重点是相互交织的历史政治和社会经济身份的出现和发展。总体而言，该研究证明了将可扩展的计算方法与批判性解释相结合以支持对嘈杂的历史报纸数据进行数字人文研究的价值。

Title: Closing the Confidence-Faithfulness Gap in Large Language Models

Authors: Miranda Muqing Miao, Lyle Ungar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25052
Pdf URL: https://arxiv.org/pdf/2603.25052
Copy Paste: [[2603.25052]] Closing the Confidence-Faithfulness Gap in Large Language Models(https://arxiv.org/abs/2603.25052)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.
摘要：大型语言模型（LLM）倾向于用语言表达置信度分数，而这些置信度分数在很大程度上与其实际准确性无关，但控制这种行为的几何关系仍然知之甚少。在这项工作中，我们提出了一种语言化置信度的机械可解释性分析，使用线性探针和对比激活添加（CAA）引导来表明校准和语言化置信度信号是线性编码的，但彼此正交——这一发现在三个开放权重模型和四个数据集上是一致的。有趣的是，当模型被提示同时推理问题并用语言表达置信度分数时，推理过程会扰乱语言表达的置信方向，从而加剧错误校准。我们将其称为“推理污染效应”。利用这种洞察力，我们引入了一个两阶段自适应引导管道，该管道读取模型的内部精度估计并引导语言输出以匹配它，从而显着改善所有评估模型的校准对齐。

Title: OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

Authors: Suraj Racha, Prashant Harish Joshi, Utkarsh Maurya, Nitin Yadav, Mridul Sharma, Ananya Kunisetty, Saranya Darisipudi, Nirmal Punjabi, Ganesh Ramakrishnan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25105
Pdf URL: https://arxiv.org/pdf/2603.25105
Copy Paste: [[2603.25105]] OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs(https://arxiv.org/abs/2603.25105)
Keywords: language model, llm, chat, agent
Abstract: Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.
摘要：大型语言模型（LLM）在复杂任务方面表现出了卓越的能力，但医学领域（特别是心理健康领域）的适应提出了特定的挑战。心理健康在全球范围内日益受到关注，法学硕士有很大潜力帮助解决这一问题。我们强调了法学硕士在心理健康方面面临的三个主要挑战——缺乏高质量的可解释和以知识为基础的培训数据；仅限于核心能力的培训范式以及多轮对话设置的评估。为了解决这个问题，我们提出了 oMind 框架，其中包括培训和调整 LLM 代理的各种能力，包括对话；高质量的约 164k 多任务 SFT 数据集，是我们基于结构化知识检索、基于 LLM 的修剪和审查操作的生成管道的结果。我们还介绍了 oMind-Chat - 一种新颖的多回合基准数据集，具有专家注释的回合级别和对话级别规则。我们对核心能力和对话的多样化实验表明，oMind 法学硕士的表现始终优于基线。 oMind-LLM 还显示出明显更好的推理能力，胜率高达 80%。

Title: Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

Authors: Jon-Paul Cacioli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25112
Pdf URL: https://arxiv.org/pdf/2603.25112
Copy Paste: [[2603.25112]] Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory(https://arxiv.org/abs/2603.25112)
Keywords: llm
Abstract: Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.
摘要：LLM 置信度的标准评估依赖于校准指标（ECE、Brier 分数），该指标合并了两种不同的能力：模型知道多少（1 类敏感性）以及模型对所知道的内容了解程度（2 类元认知敏感性）。我们引入了一个基于 Type-2 信号检测理论的评估框架，该框架使用元 d' 和元认知效率比 M-ratio 来分解这些能力。将 4 个 LLM（Llama-3-8B-Instruct、Mistral-7B-Instruct-v0.3、Llama-3-8B-Base、Gemma-2-9B-Instruct）应用于 224,000 项事实 QA 试验中，我们发现：（1）即使 1 型敏感性相似，不同模型的元认知效率也有很大差异——Mistral 实现了最高的 d' 但 M 比率最低；（2）元认知效率是特定领域的，不同的模型显示不同的最弱领域，而聚合指标不可见； (3) 温度操纵改变了 Type-2 标准，而 Meta-d' 对于四个模型中的两个保持稳定，将置信策略与元认知能力分离； (4) AUROC_2 和 M-ratio 产生完全倒置的模型排名，证明这些指标回答了根本不同的评估问题。 Meta-d' 框架揭示了哪些模型“知道自己不知道什么”，哪些模型只是由于标准放置而显得经过了良好校准——这种区别对模型选择、部署和人类与人工智能的协作有直接影响。预注册分析；代码和数据公开。

Title: To Write or to Automate Linguistic Prompts, That Is the Question

Authors: Marina Sánchez-Torrón, Daria Akselrod, Jason Rauchwerk
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25169
Pdf URL: https://arxiv.org/pdf/2603.25169
Copy Paste: [[2603.25169]] To Write or to Automate Linguistic Prompts, That Is the Question(https://arxiv.org/abs/2603.25169)
Keywords: llm, prompt
Abstract: LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.
摘要：LLM 的表现对提示设计高度敏感，但自动提示优化是否可以取代语言任务中的专家提示工程仍有待探索。我们首次对手工制作的零样本专家提示、基本 DSPy 签名和 GEPA 优化的 DSPy 签名在翻译、术语插入和语言质量评估方面进行了系统比较，评估了五种模型配置。结果取决于任务。在术语插入中，优化和手动提示产生的质量在统计上几乎无法区分。在翻译中，每种方法都在不同的模型上获胜。在 LQA 中，专家提示可实现更强的错误检测，同时优化可改善表征。在所有任务中，GEPA 都提升了最小的 DSPy 特征，并且大多数专家优化的比较显示没有统计上的显着差异。我们注意到这种比较是不对称的：GEPA 优化以编程方式搜索黄金标准分割，而专家提示原则上不需要标记数据，而是依赖于领域专业知识和迭代细化。

Title: Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Authors: Hieu Xuan Le, Benjamin Goh, Quy Anh Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25176
Pdf URL: https://arxiv.org/pdf/2603.25176
Copy Paste: [[2603.25176]] Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models(https://arxiv.org/abs/2603.25176)
Keywords: language model, llm, prompt, chat
Abstract: Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.
摘要：即时攻击，包括越狱和即时注入，对大型语言模型 (LLM) 系统构成严重的安全风险。在生产中，护栏必须在严格的低延迟约束下缓解这些攻击，从而导致部署差距，轻量级分类器和基于规则的系统难以在分布转移下泛化，而基于 LLM 的大容量法官对于实时执行来说仍然太慢或成本太高。在这项工作中，我们研究了轻量级、通用的法学硕士是否可以在现实世界的生产限制下可靠地充当安全法官。通过仔细的提示和输出设计，轻量级法学硕士被引导通过结构化推理过程，包括明确的意图分解、安全信号验证、危害评估和自我反思。我们在精心策划的数据集上评估我们的方法，该数据集结合了来自现实世界聊天机器人的良性查询和通过自动红队（ART）生成的对抗性提示，涵盖了多样化和不断发展的模式。我们的结果表明，通用的 LLM，例如 gemini-2.0-flash-lite-001，可以作为实时护栏的有效低延迟法官。此配置目前在生产中部署为新加坡公共服务聊天机器人的集中护栏服务。我们还评估了模型混合（MoM）设置，以评估聚合多个 LLM 法官是否可以相对于单模型法官提高即时攻击检测性能，但仅观察到适度的收益。

Title: Probing the Lack of Stable Internal Beliefs in LLMs

Authors: Yifan Luo, Kangping Xu, Yanzhen Lu, Yang Yuan, Andrew Chi-Chih Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25187
Pdf URL: https://arxiv.org/pdf/2603.25187
Copy Paste: [[2603.25187]] Probing the Lack of Stable Internal Beliefs in LLMs(https://arxiv.org/abs/2603.25187)
Keywords: language model, llm
Abstract: Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain "implicit consistency", defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users' guesses with "yes/no" answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit "goals" shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.
摘要：角色驱动的大语言模型 (LLM) 需要在交互过程中保持一致的行为倾向，以模拟类人的人格特征，例如持久性或可靠性。然而，目前的法学硕士通常缺乏稳定的内部代表，无法将他们的反应锚定在扩展对话上。这项工作探讨了法学硕士是否可以保持“隐式一致性”，其定义为在多轮交互中持续遵守未声明的目标。我们设计了一个包含 20 个问题的谜语游戏范例，其中法学硕士的任务是秘密选择一个目标，并用“是/否”的答案来回答用户的猜测。通过评估，我们发现法学硕士很难保持潜在的一致性：除非在上下文中明确提供他们选择的目标，否则他们隐含的“目标”会不断变化。这些发现凸显了建立角色驱动的法学硕士的关键局限性，并强调需要随着时间的推移锚定隐含目标的机制，这是对话系统等交互式应用中现实人格建模的关键。

Title: A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Authors: Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25196
Pdf URL: https://arxiv.org/pdf/2603.25196
Copy Paste: [[2603.25196]] A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations(https://arxiv.org/abs/2603.25196)
Keywords: language model, llm
Abstract: Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.
摘要：临床实践指南 (CPG) 在确保基于证据的决策和改善患者治疗结果方面发挥着关键作用。虽然大型语言模型 (LLM) 越来越多地部署在医疗保健场景中，但尚不清楚 LLM 在对话过程中可以在多大程度上识别并遵守 CPG。为了解决这一差距，我们引入了 CPGBench，这是一个自动化框架，用于对法学硕士在多轮对话中的临床指南检测和遵守能力进行基准测试。我们收集了过去十年中来自 9 个国家/地区和 2 个国际组织发表的 3,418 篇 CPG 文献，涵盖 24 个专业。从这些文件中，我们提取了 32,155 条临床建议，以及相应的出版机构、日期、国家、专业、推荐强度、证据级别等。每条建议都会相应生成一个多轮对话，以评估 8 位领先的法学硕士的检测和依从能力。我们发现，71.1%-89.6%的推荐可以被正确检测到，而只有3.6%-29.7%的相应标题可以被正确引用，揭示了了解指南内容和它们来自哪里之间的差距。不同模型的遵守率从 21.8% 到 63.2% 不等，这表明了解指南和能够应用指南之间存在很大差距。为了确认我们自动分析的有效性，我们进一步对来自不同专业的 56 名临床医生进行了全面的人体评估。据我们所知，CPGBench 是第一个系统地揭示法学硕士在对话中未能发现或遵守哪些临床建议的基准。鉴于每项临床建议都可能影响大量人群，并且临床应用本质上对安全至关重要，因此解决这些差距对于在现实世界的临床实践中安全、负责任地部署法学硕士至关重要。

Title: SafeMath: Inference-time Safety improves Math Accuracy

Authors: Sagnik Basu, Subhrajit Mitra, Aman Juneja, Somnath Banerjee, Rima Hazra, Animesh Mukherjee
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.25201
Pdf URL: https://arxiv.org/pdf/2603.25201
Copy Paste: [[2603.25201]] SafeMath: Inference-time Safety improves Math Accuracy(https://arxiv.org/abs/2603.25201)
Keywords: llm
Abstract: Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at this https URL.
摘要：最近的研究指出，法学硕士受到对抗性和看似良性的输入的操纵，导致有害的、有偏见的或违反政策的输出。在本文中，我们研究了一个尚未充分探索的有关有害和有毒数学应用问题的问题。我们表明，数学问题，特别是那些被定义为自然语言叙述的问题，可以作为传播偏见、不道德或心理有害内容的微妙媒介，在涉及儿童的教育环境中风险更高。为了支持对这种现象的系统研究，我们引入了 ToxicGSM，这是一个包含 1.9k 算术问题的数据集，其中嵌入了有害或敏感的上下文，同时保留了数学上明确定义的推理任务。使用该数据集，我们审核现有法学硕士的行为，并分析安全执行和数学正确性之间的权衡。我们进一步提出 SafeMath——一种安全对齐技术，可以减少有害输出，同时保持（在某些情况下提高）数学推理性能。我们的结果强调了将语言危害与数学推理分开的重要性，并证明有效的安全调整不必以牺牲准确性为代价。我们在此 https URL 发布源代码和数据集。

Title: Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

Authors: Giuseppe Samo, Paola Merlo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25227
Pdf URL: https://arxiv.org/pdf/2603.25227
Copy Paste: [[2603.25227]] Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian(https://arxiv.org/abs/2603.25227)
Keywords: language model, llm
Abstract: This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.
摘要：本研究以法语和意大利语的被动动词交替为例，比较了自然数据和合成数据对训练和评估大型语言模型 (LLM) 的影响。我们使用 Blackbird 语言矩阵 (BLM)，这是一种结构化数据集，旨在探索跨句子集的潜在模式的语言知识。我们将使用从通用依赖关系中提取的自然句子实例化的结构化模板与合成句子的结构化模板进行比较。实验表明，虽然模型在合成数据集上进行训练和测试时达到了最高性能，但它们不能可靠地推广到自然句子。相比之下，在自然数据上训练的模型在自然和合成测试套件中都表现出强大的性能，展示了它们捕获抽象语言模式的卓越能力。这些结果证实了自然数据和结构化设置在语言评估中对于探索法学硕士句法和语义知识的价值。

Title: MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

Authors: Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv, Bing Zhao, Wei Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25253
Pdf URL: https://arxiv.org/pdf/2603.25253
Copy Paste: [[2603.25253]] MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation(https://arxiv.org/abs/2603.25253)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs' abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs' strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.
摘要：大型语言模型（LLM）在推进科学发现方面具有巨大潜力，但对其在现实世界研究中的动态推理的系统评估仍然有限。当前的科学评估基准主要依赖于静态的单轮问答（QA）格式，这不足以衡量需要多步迭代和实验交互的复杂科学任务中的模型性能。为了解决这一差距，我们引入了 MolQuest，这是一种基于真实代理的新型评估框架，用于基于真实的化学实验数据来阐明分子结构。与现有数据集不同，MolQuest 将分子结构阐明形式化为多轮交互任务，要求模型主动规划实验步骤、集成异构光谱源（例如 NMR、MS）并迭代完善结构假设。该框架系统地评估了法学硕士在广阔而复杂的化学空间中的溯因推理和战略决策能力。实证结果表明，当代前沿模型在真实的科学场景中表现出显着的局限性：值得注意的是，即使是最先进的 (SOTA) 模型也只能达到约 50% 的准确度，而大多数其他模型的性能仍低于 30% 的阈值。这项工作为以科学为导向的法学硕士评估提供了一个可重复和可扩展的框架，我们的研究结果突出了当前法学硕士战略科学推理中的关键差距，为未来能够积极参与科学过程的人工智能研究设定了明确的方向。

Title: CRAFT: Grounded Multi-Agent Coordination Under Partial Information

Authors: Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25268
Pdf URL: https://arxiv.org/pdf/2603.25268
Copy Paste: [[2603.25268]] CRAFT: Grounded Multi-Agent Coordination Under Partial Information(https://arxiv.org/abs/2603.25268)
Keywords: language model, agent
Abstract: We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at this https URL
摘要：我们引入了 CRAFT，这是一种多智能体基准，用于在严格的部分信息下评估大语言模型中的语用交流。在这种情况下，具有互补但不完整视图的多个智能体必须通过自然语言进行协调，以构建一个共享的 3D 结构，而单个智能体无法完全观察到该结构。我们将这个问题形式化为多发送者实用推理任务，并提供一个诊断框架，将失败分解为空间基础、信念建模和实用沟通错误，包括前沿模型和开放权重模型中行为失败概况的分类。在一系列不同的模型中，包括 8 个开放权重模型和 7 个包含推理模型的前沿模型，我们发现更强的推理能力并不能可靠地转化为更好的协调：较小的开放权重模型通常匹配或优于前沿系统，而改进的个人沟通并不能保证成功的协作。这些结果表明，多智能体协调对于当前语言模型来说仍然是一个根本上未解决的挑战。我们的代码可以在这个 https URL 中找到

Title: When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech

Authors: Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25269
Pdf URL: https://arxiv.org/pdf/2603.25269
Copy Paste: [[2603.25269]] When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech(https://arxiv.org/abs/2603.25269)
Keywords: llm
Abstract: Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.
摘要：网上的仇恨内容通常使用事实性的、不一定正确的信息来表达，特别是在协调一致的在线骚扰活动和极端主义宣传中。如果未能共同解决仇恨言论 (HS) 和错误信息，可能会加深偏见，强化有害的刻板印象，并使旁观者面临心理困扰，同时污染公共辩论。此外，这些消息需要内容审核者付出更多努力，因为他们必须评估危害性和真实性，即对它们进行事实检查。为了应对这一挑战，我们发布了 WSF-ARG+，这是第一个将仇恨言论与检查价值信息相结合的数据集。我们还引入了一种新颖的 LLM-in-the-loop 框架，以方便对值得检查的声明进行注释。我们运行我们的框架，并使用 12 个不同规模和架构的开放权重法学硕士进行测试。我们通过广泛的人工评估对其进行了验证，并表明我们的法学硕士循环框架减少了人工工作量，同时又不影响数据的注释质量。最后，我们表明，具有值得检查的声明的 HS 消息显示出明显更高的骚扰和仇恨，并且结合值得检查的标签可将大型模型的基于 LLM 的 HS 检测平均提高至 0.213 宏 F1 和 0.154 宏 F1。

Title: Separate Before You Compress: The WWHO Tokenization Architecture

Authors: Kusal Darshana
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25309
Pdf URL: https://arxiv.org/pdf/2603.25309
Copy Paste: [[2603.25309]] Separate Before You Compress: The WWHO Tokenization Architecture(https://arxiv.org/abs/2603.25309)
Keywords: language model, llm
Abstract: Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.
摘要：当前的大型语言模型（LLM）大多使用基于 BPE（字节对编码）的分词器，这对于简单结构的拉丁文字（例如英语）非常有效。然而，由于结构复杂性，标准 BPE 分词器很难处理复杂的 Abugida 脚本。问题在于，这些分词器将复杂的连接词（多代码点字素簇）分解为无意义的子字符单元。这迫使法学硕士在推理时学习基本的拼写结构，从而降低了法学硕士的推理效率，并增加了推理成本，导致南半球国家征收巨额“代币税”。我们提出了一种新的三层架构 WWHO（Where-What-How Every）和一种名为 SGPE（音节感知字素对编码）的算法，该算法将脚本的语言规则与统计压缩过程分开，同时实现无缝的多语言标记化。我们使用僧伽罗语和梵文（印地语/梵语）作为高度复杂的阿布吉达文字，在经过清理的 3000 万个句子数据集上训练 WWHO，并在 1,499,950 个句子测试集上进行评估。对于僧伽罗语，SGPE 的标记与单词比率 (TWR) 为 1.274，每个标记 4.83 个字符，与 OpenAI 的 o200k 基础相比，标记数量减少了 61.7%。对于印地语，它的 TWR 为 1.181（与 200k 相比减少了 27.0%）。在混合文字（僧伽罗文、梵文和英语）数据集上，SGPE 的总体 TWR 为 1.240，相对于 o200k Base、Llama 4 Scout 和 DeepSeek V3 分别减少了 36.7%、39.6% 和 60.2%。这有效地将这些 Abugida 语言的可用上下文窗口扩展了 4.38 倍，同时确保了语言零中断保证，从而确保没有有效音节被分割到多个标记中。

Title: Beyond Detection: Rethinking Education in the Age of AI-writing

Authors: Maria Marina, Alexander Panchenko, Vasily Konovalov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25329
Pdf URL: https://arxiv.org/pdf/2603.25329
Copy Paste: [[2603.25329]] Beyond Detection: Rethinking Education in the Age of AI-writing(https://arxiv.org/abs/2603.25329)
Keywords: gpt, chat
Abstract: As generative AI tools like ChatGPT enter classrooms, workplaces and everyday thinking, writing is at risk of becoming a formality -- outsourced, automated and stripped of its cognitive value. But writing is not just output; it is how we learn to think. This paper explores what is lost when we let machines write for us, drawing on cognitive psychology, educational theory and real classroom practices. We argue that the process of writing -- messy, slow, often frustrating -- is where a human deep learning happens. The paper also explores the current possibilities of AI-text detection, how educators can adapt through smarter pedagogy rather than bans, and why the ability to recognize machine-generated language may become a critical literacy of the 21st century. In a world where writing can be faked, learning can not.
摘要：随着像 ChatGPT 这样的生成式人工智能工具进入课堂、工作场所和日常思维，写作有可能成为一种形式——外包、自动化并剥夺其认知价值。但写作不仅仅是输出，更是输出。这就是我们学习思考的方式。本文借鉴认知心理学、教育理论和真实的课堂实践，探讨了当我们让机器为我们写作时会失去什么。我们认为，写作的过程——混乱、缓慢、常常令人沮丧——正是人类深度学习发生的地方。该论文还探讨了当前人工智能文本检测的可能性、教育工作者如何通过更智能的教学法而不是禁令来适应，以及为什么识别机器生成语言的能力可能成为 21 世纪的关键素养。在一个写作可以伪造的世界里，学习却不能。

Title: Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

Authors: Paulo Roberto de Moura Júnior, Jean Lelong, Annabelle Blangero
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.25333
Pdf URL: https://arxiv.org/pdf/2603.25333
Copy Paste: [[2603.25333]] Adaptive Chunking: Optimizing Chunking-Method Selection for RAG(https://arxiv.org/abs/2603.25333)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at this https URL.
摘要：检索增强生成 (RAG) 的有效性很大程度上取决于文档的分块方式，即将文档分割成更小的单元以进行索引和检索。然而，常用的“一刀切”方法往往无法捕捉不同文本的细微结构和语义。尽管分块具有核心作用，但它缺乏专门的评估框架，因此很难独立于下游性能来评估和比较策略。我们通过引入自适应分块来挑战这种范式，该框架根据一组五个新颖的内在的、基于文档的指标为每个文档选择最合适的分块策略：参考完整性（RC）、分块内聚（ICC）、文档上下文连贯性（DCC）、块完整性（BI）和大小合规性（SC），它们直接评估跨关键维度的分块质量。为了支持这个框架，我们还引入了两个新的分块器，一个 LLM-regex 拆分器和一个拆分然后合并递归拆分器，以及有针对性的后处理技术。在跨越法律、技术和社会科学领域的多样化语料库中，我们的度量引导自适应方法显着提高了下游 RAG 的性能。在不改变模型或提示的情况下，我们的框架提高了 RAG 结果，将答案正确率提高到 72%（从 62-64%），并将成功回答的问题数量增加了 30% 以上（65 比 49）。这些结果表明，自适应的、文档感知的分块在一套互补的内在指标的指导下，为更强大的 RAG 系统提供了一条实用且有效的途径。代码可在此 https URL 获取。

Title: Large Language Model as Token Compressor and Decompressor

Authors: Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Yiran Wang, Wei Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25340
Pdf URL: https://arxiv.org/pdf/2603.25340
Copy Paste: [[2603.25340]] Large Language Model as Token Compressor and Decompressor(https://arxiv.org/abs/2603.25340)
Keywords: language model, llm, prompt
Abstract: In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.
摘要：在本文中，我们提出了一个新颖的见解：现成的法学硕士可以充当出色的令牌压缩器和解压缩器。为了进行演示，我们设计了一个自我表达的自动编码学习框架，对预训练的 LLM 进行微调，将长文本翻译成离散、可变长度潜在代码（称为 Z 令牌）的紧凑内部语言，并根据它们准确地重建原始文本。由此产生的表示是内容自适应的：语义密集的片段接收更多的 Z 令牌，而冗余或可预测的区域则通过基于 LoRA 的轻量级适配器头进行积极压缩。根据经验，我们的方法在 Wikipedia、CNN/DailyMail、HotpotQA 和 Qulac 式长查询数据集上实现了高达 18 倍的标记减少，同时保留了重建保真度和下游性能。这种简单而有效的设计支持直接在 Z 令牌空间中进行快速压缩和自回归生成等应用，为令牌高效的长上下文推理提供了潜在的途径。

Title: TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

Authors: Xu Huang, Zhejian Lai, Zixian Huang, Jiajun Chen, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25419
Pdf URL: https://arxiv.org/pdf/2603.25419
Copy Paste: [[2603.25419]] TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning(https://arxiv.org/abs/2603.25419)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.
摘要：大型语言模型（LLM）在英语数学推理方面表现出了卓越的熟练程度，但在多语言环境中仍然存在显着的表现差异，这在很大程度上归因于语言理解的缺陷。为了弥补这一差距，我们引入了翻译增强策略优化（TAPO），这是一种基于 GRPO 的新型强化学习框架。 TAPO 强制执行明确的对齐策略，其中模型利用英语作为支点，并遵循先理解后推理的范式。至关重要的是，我们采用了一种阶梯级相对优势机制，将理解与推理分离，允许整合翻译质量奖励，而不会引入优化冲突。大量实验表明，TAPO 有效地协同了语言理解和推理能力，并且兼容各种模型。它在多语言数学推理和翻译任务中都优于基线方法，同时很好地推广到未见过的语言和域外任务。

Title: Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering

Authors: Erkan Gunes, Christoffer Florczak, Tevfik Murat Yildirim
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.25422
Pdf URL: https://arxiv.org/pdf/2603.25422
Copy Paste: [[2603.25422]] Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering(https://arxiv.org/abs/2603.25422)
Keywords: language model, llm, prompt
Abstract: Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.
摘要：社会科学中使用大型语言模型 (LLM) 进行文本分类的最新进展表明，成本可以显着降低，而性能有时可以与现有的计算方法相媲美。然而，由于当前测试中的性能差异很大，我们转向如何最大化性能的问题。在本文中，我们将提示上下文作为提高准确性的可能途径，通过系统地改变提示工程的三个方面：标签描述、指导性推动和少量镜头示例。在两个不同的示例中，我们的测试表明，提示上下文的最小增加会带来最高的性能提升，而上下文的进一步增加只会带来此后边际性能的提升。令人担忧的是，增加提示上下文有时会降低准确性。此外，我们的测试表明模型、任务和批量大小之间存在很大的异质性，这强调了对每个 LLM 编码任务进行单独验证的必要性，而不是依赖一般规则。

Title: Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Authors: Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi, Rico Sennrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25489
Pdf URL: https://arxiv.org/pdf/2603.25489
Copy Paste: [[2603.25489]] Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties(https://arxiv.org/abs/2603.25489)
Keywords: llm
Abstract: Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
摘要：最近的低资源机器翻译策略依赖于法学硕士从高资源语言生成合成数据。我们发现这种方法对于罗曼什语是失败的，因为法学硕士往往会混淆它的 6 种不同的语言变体。我们的实验表明，数据增强的方向应该与源语言和目标语言之间的资源梯度保持一致。这种方法在资源最低的罗曼什语品种中比 Gemini 3 Pro 多出 23 BLEU。人类评估证实，我们的实验产生了第一个模型，可以在各个罗曼什语变体中生成流畅的翻译。

Title: An Experimental Comparison of the Most Popular Approaches to Fake News Detection

Authors: Pietro Dell'Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25501
Pdf URL: https://arxiv.org/pdf/2603.25501
Copy Paste: [[2603.25501]] An Experimental Comparison of the Most Popular Approaches to Fake News Detection(https://arxiv.org/abs/2603.25501)
Keywords: language model, llm
Abstract: In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into "Real" and "Fake" to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.
摘要：近年来，假新闻检测在公众辩论和科学研究中受到越来越多的关注。尽管检测技术取得了进步，但在大型语言模型 (LLM) 和社交媒体的放大能力的推动下，虚假信息的产生和传播变得更加复杂。我们对 12 种具有代表性的假新闻检测方法进行了严格评估，涵盖传统机器学习、深度学习、变压器和专门的跨域架构。我们在 10 个不同类型、来源、主题和标签原理的公开数据集上评估这些方法。我们将纯文本英语假新闻检测作为二元分类任务，将标签统一为“真实”和“假”，以确保一致的评估协议。我们承认标签语义因数据集而异，并且协调不可避免地消除了这种语义细微差别。每个数据集都被视为一个不同的域。我们进行域内、多域和跨域实验来模拟涉及域转移和分布外数据的真实场景。微调模型在领域内表现良好，但难以推广。跨域架构可以缩小这种差距，但需要大量数据，而法学硕士通过零样本和少样本学习提供了一种有前景的替代方案。考虑到固有的数据集混淆和可能的预训练暴露，结果应解释为这个纯文本英文协议中的鲁棒性评估。

Title: Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

Authors: Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Loáiciga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25537
Pdf URL: https://arxiv.org/pdf/2603.25537
Copy Paste: [[2603.25537]] Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence(https://arxiv.org/abs/2603.25537)
Keywords: language model, prompt
Abstract: We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at this https URL.
摘要：我们通过将人类书写的叙述与视觉写作提示语料库上的视觉语言模型（VLM）生成的叙述进行比较，研究视觉基础故事中的叙述连贯性。使用一组捕获叙事连贯性不同方面的指标，包括共指、话语关系类型、主题连续性、人物持久性和多模态人物基础，我们计算叙事连贯性得分。我们发现 VLM 显示出大致相似的相干性特征，但与人类的相干性特征存在系统性差异。此外，个别措施的差异通常很微妙，但当综合考虑时，差异就会变得更加明显。总的来说，我们的结果表明，尽管模型叙事表面上与人类相似，但在如何组织基于视觉的故事中的话语方面却表现出与人类的系统性差异。我们的代码可以在这个 https URL 上找到。

Title: PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

Authors: Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Edward Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25620
Pdf URL: https://arxiv.org/pdf/2603.25620
Copy Paste: [[2603.25620]] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency(https://arxiv.org/abs/2603.25620)
Keywords: language model, llm, agent
Abstract: Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: this https URL
摘要：基于大语言模型 (LLM) 的角色代理正迅速被采用作为跨不同领域的人类参与者的可扩展代理。然而，还没有系统的方法来验证角色代理的响应在整个交互过程中是否不存在矛盾和事实不准确。审讯方法论的一个原则提供了一个视角：无论多么精心捏造的身份，系统的审讯都会暴露其矛盾。我们应用这一原则提出了 PICon，这是一种通过逻辑链式多轮提问来探测角色代理的评估框架。 PICon 从三个核心维度评估一致性：内部一致性（免于自相矛盾）、外部一致性（与现实世界事实的一致性）和重新测试一致性（重复下的稳定性）。通过评估 7 组角色代理和 63 名真实的人类参与者，我们发现，即使是之前报告为高度一致的系统，也无法在所有三个维度上满足人类基线，揭示了链式提问下的矛盾和回避反应。这项工作为在信任角色代理作为人类参与者的替代品之前评估角色代理提供了概念基础和实用方法。我们在以下位置提供源代码和交互式演示：此 https URL

Title: Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

Authors: Mingmeng Geng, Yuhang Dong, Thierry Poibeau
Subjects: cs.CL, cs.AI, cs.CY, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.25638
Pdf URL: https://arxiv.org/pdf/2603.25638
Copy Paste: [[2603.25638]] Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers(https://arxiv.org/abs/2603.25638)
Keywords: language model, llm, prompt
Abstract: Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.
摘要：通过对 arXiv 论文的分析，我们报告了一些词语使用的变化，这些变化可能是由大型语言模型 (LLM) 驱动的，但之前没有得到足够的关注，例如标题中“beyond”和“via”的频率增加，以及摘要中“the”和“of”的频率减少。由于不同法学硕士之间的相似性，实验表明，当前的分类器很难在多类分类任务中准确确定哪个特定模型生成给定的文本。与此同时，法学硕士之间的差异也导致学术论文中词语使用模式的不断变化。通过采用直接且高度可解释的线性方法并考虑模型和提示之间的差异，我们定量评估了这些影响，并表明现实世界的法学硕士使用是异构的和动态的。

Title: Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

Authors: Cole Walsh, Rodica Ivan
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.25674
Pdf URL: https://arxiv.org/pdf/2603.25674
Copy Paste: [[2603.25674]] Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors(https://arxiv.org/abs/2603.25674)
Keywords: language model, llm, hallucination
Abstract: Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations'' and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.
摘要：自动化系统已在教育考试行业广泛采用，用于开放式回答评估和论文评分。这些系统通常达到与训练有素的人类评估者相当或更好的性能水平，但经常被证明容易受到与结构无关的因素（即与评估的结构无关的响应特征）和对抗条件的影响。鉴于大型语言模型在自动评分系统中的使用不断增加，人们重新关注“幻觉”以及这些基于法学硕士的自动评分方法对构建无关因素的稳健性。本研究调查了结构无关因素对基于双架构法学硕士的评分系统的影响，该评分系统旨在对情境判断测试中的短文式开放式回答项目进行评分。研究发现，评分系统对于用无意义的文本、拼写错误和书写复杂性填充响应通常具有鲁棒性。平均而言，重复大段文本会导致系统预测的分数较低，这与之前基于非法学硕士的评分系统的研究结果相矛盾，而偏离主题的回答会受到评分系统的严重惩罚。这些结果为未来基于 LLM 的评分系统在设计时考虑到结构相关性的稳健性提供了令人鼓舞的支持。

Title: Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

Authors: Haoyan Yang, Mario Xerri, Solha Park, Huajian Zhang, Yiyang Feng, Sai Akhil Kogilathota, Jiawei Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25681
Pdf URL: https://arxiv.org/pdf/2603.25681
Copy Paste: [[2603.25681]] Self-Improvement of Large Language Models: A Technical Overview and Future Outlook(https://arxiv.org/abs/2603.25681)
Keywords: language model, llm
Abstract: As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.
摘要：随着大型语言模型 (LLM) 的不断发展，仅通过人工监督来改进它们的成本越来越高，而且可扩展性也受到限制。随着模型在某些领域接近人类水平的能力，人类反馈可能不再为进一步改进提供足够的信息信号。与此同时，模型自主决策和执行复杂操作的能力不断增强，自然可以实现抽象，其中模型开发过程的组件可以逐步自动化。这些挑战和机遇共同推动了人们对自我完善的兴趣日益浓厚，模型可以自主生成数据、评估输出并迭代地完善自己的能力。在本文中，我们提出了自我改进语言模型的系统级视角，并介绍了组织现有技术的统一框架。我们将自我改进系统概念化为一个闭环生命周期，由四个紧密耦合的过程组成：数据采集、数据选择、模型优化和推理细化，以及一个自主评估层。在此框架内，模型本身在驱动每个阶段中发挥着核心作用：收集或生成数据、选择信息信号、更新其参数和细化输出，而自主评估层则持续监控进度并指导跨阶段的改进周期。遵循这个生命周期的视角，我们从技术角度系统地回顾和分析每个组件的代表性方法。我们进一步讨论了当前的局限性，并概述了我们对未来研究的愿景，以实现完全自我改进的法学硕士。

Title: S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Authors: Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25702
Pdf URL: https://arxiv.org/pdf/2603.25702
Copy Paste: [[2603.25702]] S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation(https://arxiv.org/abs/2603.25702)
Keywords: language model, llm
Abstract: Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.
摘要：块扩散语言模型通过将块方式自回归解码与块内并行去噪相结合，提供了一条比自回归生成速度更快的有希望的途径。然而，在实际加速所需的几步机制中，标准置信阈值解码通常很脆弱：激进的阈值会损害质量，而保守的阈值则需要不必要的去噪步骤。解决此问题的现有方法要么需要额外的培训，要么需要额外的测试时计算。我们提出了 S2D2，一种用于块扩散语言模型的免训练自推测解码框架。我们的主要观察结果是，当块大小减小到 1 时，块扩散模型就会变得自回归，从而允许相同的预训练模型充当起草者和验证者。 S2D2 将推测性验证步骤插入标准块扩散解码中，并使用轻量级路由策略来决定验证何时值得其成本。这产生了一种混合解码轨迹，其中扩散并行提出标记，而自回归模式充当局部序列级批评家。在三个主流的块扩散系列中，S2D2 持续改进了强置信阈值基线的准确性与速度权衡。在 SDAR 上，我们观察到与自回归解码相比，速度提升高达 4.7\times$，与调整后的动态解码基线相比，速度提升高达 1.57\times$，同时精度提高了高达 4.5$ 点。在 LLaDA2.1-Mini 上，S2D2 仍然是对内置自校正的补充，包括保守的设置，比静态基线快 4.4 美元\倍$，且精度稍高。

Title: Natural-Language Agent Harnesses

Authors: Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25723
Pdf URL: https://arxiv.org/pdf/2603.25723
Copy Paste: [[2603.25723]] Natural-Language Agent Harnesses(https://arxiv.org/abs/2603.25723)
Keywords: agent
Abstract: Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.
摘要：代理性能越来越依赖于\emph{线束工程}，但线束设计通常隐藏在控制器代码和特定于运行时的约定中，这使得它很难作为科学对象进行传输、比较和研究。我们询问代理工具的高级控制逻辑是否可以外部化为可移植的可执行工件。我们引入了 \textbf{自然语言代理线束} (NLAH)，它以可编辑的自然语言表达线束行为，以及 \textbf{智能线束运行时} (IHR)，这是一个共享运行时，通过显式契约、持久工件和轻量级适配器执行这些线束。在编码和计算机使用基准方面，我们对操作可行性、模块消融和代码到文本工具迁移进行受控评估。