2026-03-12

Title: GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

Authors: Ghazal Kalhor, Yadollah Yaghoobzadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.09979
Pdf URL: https://arxiv.org/pdf/2603.09979
Copy Paste: [[2603.09979]] GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals(https://arxiv.org/abs/2603.09979)
Keywords: language model, llm
Abstract: Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at this https URL.
摘要：波斯诗歌在伊朗文化实践中发挥着积极作用，哈菲兹等经典诗人的诗句经常被引用、解释或根据部分线索完成。支持这种交互需要语言模型不仅能够与诗意相结合，而且能够与文化上根深蒂固的表面形式相结合。我们引入了 GhazalBench，这是一个评估大型语言模型 (LLM) 在基于使用的条件下如何与波斯加查尔语交互的基准。 GhazalBench 评估两种互补的能力：对对联进行忠实的散文释义，并在不同的语义和形式线索下访问规范诗句。在几个专有和开放权重的多语言法学硕士中，我们观察到一致的分离：模型通常捕捉诗意意义，但在基于完成的设置中难以准确回忆诗歌，而基于识别的任务大大缩小了这种差距。对英语十四行诗的并行评估显示出明显更高的回忆性能，这表明这些限制与训练暴露的差异有关，而不是与固有的架构限制有关。我们的研究结果强调需要建立评估框架来共同评估具有文化意义的文本的含义、形式和依赖提示的获取。 GhazalBench 可通过此 https URL 获取。

Title: Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

Authors: Tairan Fu, Javier Conde, Pedro Reviriego, Javier Coronado-Blázquez, Nina Melero, Elena Merino-Gómez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.09981
Pdf URL: https://arxiv.org/pdf/2603.09981
Copy Paste: [[2603.09981]] Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?(https://arxiv.org/abs/2603.09981)
Keywords: language model, llm, prompt
Abstract: Summarization is a core task in Natural Language Processing (NLP). Recent advances in Large Language Models (LLMs) and the introduction of large context windows reaching millions of tokens make it possible to process entire books in a single prompt. At the same time, for well-known books, LLMs can generate summaries based only on internal knowledge acquired during training. This raises several important questions: How do summaries generated from internal memory compare to those derived from the full text? Does prior knowledge influence summaries even when the model is given the book as input? In this work, we conduct an experimental evaluation of book summarization with state-of-the-art LLMs. We compare summaries of well-known books produced using (i) only the internal knowledge of the model and (ii) the full text of the book. The results show that having the full text provides more detailed summaries in general, but some books have better scores for the internal knowledge summaries. This puts into question the capabilities of models to perform summarization of long texts, as information learned during training can outperform summarization of the full text in some cases.
摘要：摘要是自然语言处理（NLP）的核心任务。大型语言模型 (LLM) 的最新进展以及达到数百万个标记的大型上下文窗口的引入使得在单个提示中处理整本书成为可能。同时，对于知名书籍，法学硕士可以仅根据培训期间获得的内部知识生成摘要。这就提出了几个重要的问题：从内存生成的摘要与从全文生成的摘要相比如何？即使模型以书籍作为输入，先验知识也会影响摘要吗？在这项工作中，我们利用最先进的法学硕士对书籍摘要进行了实验评估。我们比较使用（i）仅模型的内部知识和（ii）书籍的全文生成的知名书籍的摘要。结果表明，全文总体上提供了更详细的摘要，但有些书籍在内部知识摘要方面得分更高。这使得模型执行长文本摘要的能力受到质疑，因为在某些情况下，训练期间学到的信息可能优于全文摘要。

Title: AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

Authors: Omar Elshehy, Omer Nacar, Abdelbasset Djamai, Muhammed Ragab, Khloud Al Jallad, Mona Abdelazim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09982
Pdf URL: https://arxiv.org/pdf/2603.09982
Copy Paste: [[2603.09982]] AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic(https://arxiv.org/abs/2603.09982)
Keywords: language model
Abstract: Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.
摘要：仅编码器的 Transformer 模型仍然广泛用于判别式 NLP 任务，但最近的架构进展主要集中在英语上。在这项工作中，我们提出了 AraModernBERT，这是 ModernBERT 编码器架构对阿拉伯语的改编，并研究了转标记化嵌入初始化和最多 8,192 个标记的本机长上下文建模的影响。我们表明，转标记化对于阿拉伯语言建模至关重要，与非转标记化初始化相比，掩码语言建模性能得到了显着改进。我们进一步证明 AraModernBERT 支持稳定有效的长上下文建模，在扩展序列长度下实现了改进的内在语言建模性能。对阿拉伯语自然语言理解任务的下游评估，包括推理、攻击性语言检测、问题相似性和命名实体识别，证实了向判别性和序列标记设置的强大迁移。我们的结果强调了使现代编码器架构适应阿拉伯语和其他用阿拉伯语衍生脚本编写的语言的实际考虑因素。

Title: The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

Authors: Sudipta Ghosh, Mrityunjoy Panday
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09985
Pdf URL: https://arxiv.org/pdf/2603.09985
Copy Paste: [[2603.09985]] The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration(https://arxiv.org/abs/2603.09985)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.
摘要：大型语言模型 (LLM) 在不同的任务中表现出了卓越的能力，但人们对它们准确评估自身信心的能力仍然知之甚少。我们提出了一项实证研究，调查法学硕士是否表现出让人想起邓宁-克鲁格效应的模式——这是一种认知偏差，能力有限的个人往往会高估自己的能力。我们在四个基准数据集上评估了四个最先进的模型（Claude Haiku 4.5、Gemini 2.5 Pro、Gemini 2.5 Flash 和 Kimi K2），总计 24,000 次实验。我们的结果揭示了显着的校准差异：Kimi K2 表现出严重的过度自信，尽管准确度仅为 23.3%，但预期校准误差 (ECE) 为 0.726，而 Claude Haiku 4.5 实现了最佳校准 (ECE = 0.122)，准确度为 75.4%。这些发现表明，表现不佳的模型表现出明显更高的过度自信——这种模式类似于人类认知中的邓宁-克鲁格效应。我们讨论了在高风险应用程序中安全部署法学硕士的影响。

Title: Quantifying Hallucinations in Language Language Models on Medical Textbooks

Authors: Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09986
Pdf URL: https://arxiv.org/pdf/2603.09986
Copy Paste: [[2603.09986]] Quantifying Hallucinations in Language Language Models on Medical Textbooks(https://arxiv.org/abs/2603.09986)
Keywords: language model, hallucination, prompt
Abstract: Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($\rho=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $\kappa=0.92$) and ($\tau_b=0.06$ to $0.18$, $\kappa=0.57$ to $0.61$) for experiments 1 and ,2 respectively
摘要：幻觉，即大型语言模型倾向于提供事实不正确且不受支持的主张的响应，是自然语言处理中的一个严重问题，我们尚未找到有效的解决方案来缓解。现有的医学质量保证基准很少根据固定的证据来源来评估这种行为。我们询问基于教科书的 QA 中出现幻觉的频率，以及不同模型对医学 QA 提示的反应有何不同。我们进行了两个实验：第一个实验是为了确定医学 QA 中一个著名的开源大语言模型 (LLaMA-70B-Instruct) 在给定新提示的情况下幻觉的流行率，第二个实验是为了确定幻觉的流行率和临床医生对模型反应的偏好。我们在实验一中观察到，根据提供的段落，LLaMA-70B-Instruct 在 19.7\% 的答案中产生了幻觉（95\% CI 18.6 至 20.7），尽管 98.8\% 的即时响应获得了最大的合理性，并且在实验二中观察到，跨模型，较低的幻觉率与较高的有用性得分一致（$\rho=-0.71$， $p=0.058$)。临床医生分别对实验 1 和 ,2 产生了高度一致性（二次加权 $\kappa=0.92$）和（$\tau_b=0.06$ 至 $0.18$，$\kappa=0.57$ 至 $0.61$）

Title: Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

Authors: Xinyuan Wang, Kunpeng Liu, Arun Vignesh Malarkkan, Yanjie Fu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.09987
Pdf URL: https://arxiv.org/pdf/2603.09987
Copy Paste: [[2603.09987]] Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation(https://arxiv.org/abs/2603.09987)
Keywords: language model, llm, chain-of-thought
Abstract: Feature Transformation (FT) is a core data-centric AI task that improves feature space quality to advance downstream predictive performance. However, discovering effective transformations remains challenging due to the large space of feature-operator combinations. Existing solutions rely on discrete search or latent generation, but they are frequently limited by sample inefficiency, invalid candidates, and redundant generations with limited coverage. Large Language Models (LLMs) offer strong priors for producing valid transformations, but current LLM-based FT methods typically rely on static demonstrations, resulting in limited diversity, redundant outputs, and weak alignment with downstream objectives. We propose a framework that optimizes context data for LLM-driven FT by evolving trajectory-level experiences in a closed loop. Starting from high-performing feature transportation sequences explored by reinforcement learning, we construct and continuously update an experience library of downstream task-verified transformation trajectories, and use a diversity-aware selector to form contexts along with a chain-of-thought and guide transformed feature generation toward higher performance. Experiments on diverse tabular benchmarks show that our method outperforms classical and LLM-based baselines and is more stable than one-shot generation. The framework generalizes across API-based and open-source LLMs and remains robust across downstream evaluators.
摘要：特征转换 (FT) 是一项以数据为中心的核心 AI 任务，可提高特征空间质量以提高下游预测性能。然而，由于特征算子组合的空间很大，发现有效的变换仍然具有挑战性。现有的解决方案依赖于离散搜索或潜在生成，但它们经常受到样本效率低下、无效候选者和覆盖范围有限的冗余生成的限制。大型语言模型 (LLM) 为产生有效的转换提供了强大的先验，但当前基于 LLM 的 FT 方法通常依赖于静态演示，导致多样性有限、输出冗余以及与下游目标的一致性较差。我们提出了一个框架，通过在闭环中发展轨迹级经验来优化 LLM 驱动的 FT 的上下文数据。从强化学习探索的高性能特征传输序列开始，我们构建并不断更新下游任务验证的转换轨迹的经验库，并使用多样性感知选择器形成上下文和思想链，并引导转换后的特征生成向更高的性能发展。对各种表格基准的实验表明，我们的方法优于经典和基于 LLM 的基线，并且比一次性生成更稳定。该框架适用于基于 API 和开源的法学硕士，并且在下游评估者中保持稳健。

Title: Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

Authors: Ajay Pravin Mahale
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09988
Pdf URL: https://arxiv.org/pdf/2603.09988
Copy Paste: [[2603.09988]] Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations(https://arxiv.org/abs/2603.09988)
Keywords: gpt, llm
Abstract: Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable explanations remains an open problem. We present a pipeline that bridges circuit-level analysis and natural language explanations by (i) identifying causally important attention heads via activation patching, (ii) generating explanations using both template-based and LLM-based methods, and (iii) evaluating faithfulness using ERASER-style metrics adapted for circuit-level attribution. We evaluate on the Indirect Object Identification (IOI) task in GPT-2 Small (124M parameters), identifying six attention heads accounting for 61.4% of the logit difference. Our circuit-based explanations achieve 100% sufficiency but only 22% comprehensiveness, revealing distributed backup mechanisms. LLM-generated explanations outperform template baselines by 64% on quality metrics. We find no correlation (r = 0.009) between model confidence and explanation faithfulness, and identify three failure categories explaining when explanations diverge from mechanisms.
摘要：机械可解释性识别了负责模型行为的内部电路，但将这些发现转化为人类可理解的解释仍然是一个悬而未决的问题。我们提出了一个连接电路级分析和自然语言解释的管道，方法是：（i）通过激活修补识别因果重要的注意力头，（ii）使用基于模板和基于 LLM 的方法生成解释，以及（iii）使用适用于电路级归因的 ERASER 式指标评估忠实度。我们对 GPT-2 Small（124M 参数）中的间接对象识别（IOI）任务进行评估，识别出 6 个注意力头，占 logit 差异的 61.4%。我们基于电路的解释达到了 100% 的充分性，但只有 22% 的全面性，揭示了分布式备份机制。 LLM 生成的解释在质量指标上比模板基线高出 64%。我们发现模型置信度和解释忠实度之间没有相关性（r = 0.009），并确定了三个失败类别来解释解释何时偏离机制。

Title: The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

Authors: Heimo Müller, Dominik Steiger, Markus Plass, Andreas Holzinger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09989
Pdf URL: https://arxiv.org/pdf/2603.09989
Copy Paste: [[2603.09989]] The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models(https://arxiv.org/abs/2603.09989)
Keywords: language model, llm, hallucination
Abstract: We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87$) and significant inter-dimension correlations (p < 0.001$). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.
摘要：我们推出了系统幻觉量表（SHS），这是一种轻量级且以人为本的测量工具，用于评估大语言模型（LLM）中的幻觉相关行为。受系统可用性量表 (SUS) 和系统因果关系量表 (SCS) 等已建立的心理测量工具的启发，SHS 能够对模型生成文本中的事实不可靠性、不连贯性、误导性演示以及对用户指导的响应能力进行快速、可解释和与领域无关的评估。 SHS 显然不是自动幻觉检测器或基准指标；相反，它捕捉了幻觉现象在现实交互条件下如何从用户角度表现出来。对 210 名参与者进行的真实世界评估显示出高度清晰度、连贯的响应行为和结构有效性，并得到统计分析的支持，包括内部一致性（Cronbach's alpha = 0.87 $）和显着的维度间相关性（p < 0.001 $）。 SUS 和 SCS 的比较分析揭示了互补的测量属性，支持 SHS 作为比较分析、迭代系统开发和部署监控的实用工具。

Title: A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

Authors: Ana Begnini, Matheus Vicente, Leonardo Souza
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09990
Pdf URL: https://arxiv.org/pdf/2603.09990
Copy Paste: [[2603.09990]] A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification(https://arxiv.org/abs/2603.09990)
Keywords: llm
Abstract: In business-to-business relations, it is common to establish NonDisclosure Agreements (NDAs). However, these documents exhibit significant variation in format, structure, and writing style, making manual analysis slow and error-prone. We propose an architecture based on LLMs to automate the segmentation and clauses classification within these contracts. We employed two models: LLaMA-3.1-8B-Instruct for NDA segmentation (clause extraction) and a fine-tuned Legal-Roberta-Large for clause classification. In the segmentation task, we achieved a ROUGE F1 of 0.95 +/- 0.0036; for classification, we obtained a weighted F1 of 0.85, demonstrating the feasibility and precision of the approach.
摘要：在企业对企业关系中，通常会签订保密协议 (NDA)。然而，这些文档在格式、结构和写作风格上表现出显着差异，使得手动分析速度缓慢且容易出错。我们提出了一种基于法学硕士的架构，以自动执行这些合同中的分段和条款分类。我们采用了两种模型：用于 NDA 分割（条款提取）的 LLaMA-3.1-8B-Instruct 和用于条款分类的微调 Legal-Roberta-Large。在分割任务中，我们实现了 0.95 +/- 0.0036 的 ROUGE F1；对于分类，我们获得了 0.85 的加权 F1，证明了该方法的可行性和精度。

Title: PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

Authors: Stephen Afrifa, Biswash Khatiwada, Kapalik Khanal, Sanjay Shah, Lingjuan Wang-Li, Ramesh Bahadur Bist
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09991
Pdf URL: https://arxiv.org/pdf/2603.09991
Copy Paste: [[2603.09991]] PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling(https://arxiv.org/abs/2603.09991)
Keywords: language model
Abstract: The rapid growth of the global poultry industry, driven by rising demand for affordable animal protein, has intensified public discourse surrounding production practices, housing, management, animal welfare, and supply-chain transparency. Social media platforms such as X (formerly Twitter) generate large volumes of unstructured textual data that capture stakeholder sentiment across the poultry industry. Extracting accurate sentiment signals from this domain-specific discourse remains challenging due to contextual ambiguity, linguistic variability, and limited domain awareness in general-purpose language models. This study presents PoultryLeX-Net, a lexicon-enhanced, domain-adaptive dual-stream transformer framework for fine-grained sentiment analysis in poultry-related text. The proposed architecture integrates sentiment classification, topic modeling, and contextual representation learning through domain-specific embeddings and gated cross-attention mechanisms. A lexicon-guided stream captures poultry-specific terminology and sentiment cues, while contextual stream models long-range semantic dependencies. Latent Dirichlet Allocation is employed to identify dominant thematic structures associated with production management and welfare-related discussions, providing complementary interpretability to sentiment predictions. PoultryLeX-Net was evaluated against multiple baseline models, including convolutional neural network and pre-trained transformer architectures such as DistilBERT and RoBERTa. PoultryLeX-Net consistently outperformed all baselines, achieving an accuracy of 97.35%, an F1 score of 96.67%, and an area under the receiver operating characteristic curve (AUC-ROC) of 99.61% across sentiment classification tasks. Overall, domain adaptation and dual-stream attention markedly improve sentiment classification, enabling scalable intelligence for poultry production decision support.
摘要：在对负担得起的动物蛋白的需求不断增长的推动下，全球家禽业的快速增长加剧了围绕生产实践、住房、管理、动物福利和供应链透明度的公众讨论。 X（以前称为 Twitter）等社交媒体平台会生成大量非结构化文本数据，这些数据可以捕捉整个家禽行业利益相关者的情绪。由于通用语言模型中的上下文模糊性、语言变异性和有限的领域意识，从特定领域的话语中提取准确的情感信号仍然具有挑战性。本研究提出了 PoultryLeX-Net，这是一种词典增强型、领域自适应双流转换器框架，用于家禽相关文本中的细粒度情感分析。所提出的架构通过特定领域的嵌入和门控交叉注意机制集成了情感分类、主题建模和上下文表示学习。词典引导的流捕获家禽特定的术语和情感线索，而上下文流则对远程语义依赖关系进行建模。潜在狄利克雷分配用于识别与生产管理和福利相关讨论相关的主导主题结构，为情绪预测提供补充的可解释性。 PoultryLeX-Net 针对多个基线模型进行了评估，包括卷积神经网络和预训练的 Transformer 架构（例如 DistilBERT 和 RoBERTa）。 PoultryLeX-Net 始终优于所有基线，在情感分类任务中实现了 97.35% 的准确率、96.67% 的 F1 分数以及 99.61% 的受试者工作特征曲线下面积 (AUC-ROC)。总体而言，领域适应和双流注意力显着改善了情感分类，为家禽生产决策支持提供了可扩展的智能。

Title: TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

Authors: Izzat Alsmadi, Anas Alsobeh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.09992
Pdf URL: https://arxiv.org/pdf/2603.09992
Copy Paste: [[2603.09992]] TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment(https://arxiv.org/abs/2603.09992)
Keywords: language model, llm, chat, retrieval-augmented generation, agent
Abstract: This paper presents TAMUSA-Chat, a research-oriented framework for building domain-adapted large language model conversational systems. The work addresses critical challenges in adapting general-purpose foundation models to institutional contexts through supervised fine-tuning, retrieval-augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper-parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine-tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality-cost trade-offs. The publicly available codebase at this https URL supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.
摘要：本文介绍了 TAMUSA-Chat，这是一个面向研究的框架，用于构建适应领域的大型语言模型对话系统。这项工作通过监督微调、检索增强生成和系统评估方法，解决了使通用基础模型适应机构环境的关键挑战。我们描述了完整的架构，包括从机构来源获取数据、预处理管道、嵌入构建、模型训练工作流程和部署策略。该系统集成了模块化组件，可通过训练配置、超参数和评估协议进行可重复的实验。我们的实施展示了学术机构如何开发基于上下文的对话代理，同时保持透明度、治理合规性和负责任的人工智能实践。通过对模型大小和训练迭代之间的微调行为进行实证分析，我们提供了有关领域适应效率、计算资源需求和质量成本权衡的见解。此 https URL 上的公开代码库支持对教育人工智能系统的机构 LLM 部署、评估方法和道德考虑因素的持续研究。

Title: CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

Authors: Jon Chun, Hannah Sussman, Adrian Mangine, Murathan Kocaman, Kirill Sidorko, Abhigya Koirala, Andre McCloud, Gwen Eisenbeis, Wisdom Akanwe, Moustapha Gassama, Eliezer Gonzalez Chirinos, Anne-Duncan Enright, Peter Dunson, Tiffanie Ng, Anna von Rosenstiel, Godwin Idowu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09993
Pdf URL: https://arxiv.org/pdf/2603.09993
Copy Paste: [[2603.09993]] CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models(https://arxiv.org/abs/2603.09993)
Keywords: language model, llm
Abstract: Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss' kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.
摘要：实用推理，即在字面语义之外推断出预期含义，支撑着日常交流，但对于大型语言模型来说仍然很困难。我们提出了语境情感推理 (CEI) 基准：300 个经过人工验证的场景，用于评估法学硕士消除实用复杂话语歧义的能力。每个场景都将情境背景和说话者-听众角色（具有明确的权力关系）与模糊的话语配对。该数据集涵盖了来自工作场所、家庭、社交和服务环境的五种实用子类型（讽刺/讽刺、混合信号、策略礼貌、被动攻击、偏转/误导），具有三种权力配置（同伴、从上到下、从下到上）。三名训练有素的注释者独立标记每个场景。注释者间的一致性（按子类型划分，Fleiss 的 kappa = 0.06-0.25）较低，但符合预期：实用推理承认多个有效读数，并且分歧本身是有信息性的。我们描述了我们的注释方法，包括将自动统计检查与专家裁决相结合的 4 级质量控制管道。 CEI 在 CC-BY-4.0 下发布。

Title: Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Authors: Ruchira Dhar, Qiwei Peng, Anders Søgaard
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09994
Pdf URL: https://arxiv.org/pdf/2603.09994
Copy Paste: [[2603.09994]] Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives(https://arxiv.org/abs/2603.09994)
Keywords: language model, llm, prompt
Abstract: Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.
摘要：组合性被认为是语言能力的核心。作为高性能语言系统，大型语言模型 (LLM) 在组合任务上表现如何？我们使用两种互补的设置来评估法学硕士中的形容词-名词组合性：基于提示的功能评估和内部模型状态的表征分析。我们的结果揭示了任务绩效和内部状态之间的显着差异。虽然法学硕士可靠地开发了组合表示，但它们无法一致地转化为跨模型变体的功能任务成功。因此，我们强调对比评估对于更全面地了解模型功能的重要性。

Title: Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

Authors: Kewen Zhu, Zixi Liu, Yanjing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09995
Pdf URL: https://arxiv.org/pdf/2603.09995
Copy Paste: [[2603.09995]] Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality(https://arxiv.org/abs/2603.09995)
Keywords: language model, prompt, chain-of-thought
Abstract: Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen's d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen's h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.
摘要：使用大型语言模型的行为面试评估提出了独特的挑战，需要结构化评估、真实的面试官行为模拟以及候选人培训的教学价值。我们通过 50 个行为面试问题和答案对的两项对照实验，研究了面试答案评估和改进的思维提示链。我们的贡献是三重的。首先，我们对人在循环和自动化思维链改进之间进行定量比较。使用 n 等于 50 的受试者内配对设计，两种方法都显示出积极的评级改进。人在环方法提供了显着的培训优势。置信度从 3.16 提高到 4.16（p 小于 0.001），真实性从 2.94 提高到 4.53（p 小于 0.001，Cohen's d 为 3.21）。人在循环方法还需要五倍的迭代次数（1.0 与 5.0，p 小于 0.001），并实现完整的个人细节集成。其次，我们分析收敛行为。两种方法都以低于 1 的平均迭代次数快速收敛，在最初的弱答案中，人机循环方法实现了 100% 的成功率，而自动化方法的成功率为 84%（Cohen 的 h 为 0.82，效果很大）。额外的迭代会带来收益递减，这表明主要限制是上下文可用性而不是计算资源。第三，我们提出了一种基于消极偏见模型的对抗性挑战机制，称为“bar raiser”，以模拟现实的面试官行为，尽管定量验证仍然是未来的工作。我们的研究结果表明，虽然思想链提示为面试评估提供了有用的基础，但特定领域的增强和情境感知方法的选择对于现实和具有教学价值的结果至关重要。

Title: There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

Authors: Edibe Yilmaz, Kahraman Kostas
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.09996
Pdf URL: https://arxiv.org/pdf/2603.09996
Copy Paste: [[2603.09996]] There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective(https://arxiv.org/abs/2603.09996)
Keywords: language model, llm
Abstract: The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models' capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B--14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.
摘要：将大语言模型 (LLM) 集成到教育过程中会带来数据隐私和可靠性方面的重大限制，特别是在土耳其传统语言教育等教学脆弱的环境中。本研究旨在系统评估土耳其传统语言教育背景下本地可部署的线下法学硕士的稳健性和教学安全性。为此，开发了由 10 个原始边缘情况场景组成的土耳其异常套件 (TAS)，以评估模型的认知抵抗能力、逻辑一致性和教学安全性。对参数范围从 270M 到 32B 的 14 个不同模型进行的实验表明，异常抵抗力不仅仅取决于模型规模，而且即使在大型模型中，阿谀奉承偏差也会带来教学风险。研究结果表明，8B--14B 参数范围内的推理导向模型代表了语言学习者在成本与安全权衡方面最平衡的部分。

Title: Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

Authors: Michael Keeman, Anastasia Keeman
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2603.09997
Pdf URL: https://arxiv.org/pdf/2603.09997
Copy Paste: [[2603.09997]] Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations(https://arxiv.org/abs/2603.09997)
Keywords: gpt
Abstract: When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had "lost their empathy." No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-grounded rubrics. Empathy scores are statistically indistinguishable across all three models (Kruskal-Wallis H=4.33, p=0.115). What changed is the safety posture: crisis detection improved monotonically from GPT-4o to GPT-5-mini (H=13.88, p=0.001), while advice safety declined (H=16.63, p<0.001). Per-turn trajectory analysis -- a novel methodological contribution -- reveals these shifts are sharpest during mid-conversation crisis moments invisible to aggregate scoring. In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection during early disclosure turns; GPT-5-mini never dropped below 7.8. What users perceived as "lost empathy" was a shift from a cautious model that missed crises to an alert model that sometimes says too much -- a trade-off with real consequences for vulnerable users, currently invisible to both the people who feel it and the developers who create it.
摘要：当 OpenAI 在 2026 年初弃用 GPT-4o 时，数千名用户在 #keep4o 下抗议，声称新模型“失去了同理心”。尚未发表的研究验证了这一说法。我们进行了首次临床测量，在心理健康和 AI 伴侣领域的 14 个具有情感挑战性的对话场景中评估了三代 OpenAI 模型（GPT-4o、o4-mini、GPT-5-mini），产生了 2,100 个评分的 AI 反应，并使用基于临床的评分标准对六个心理安全维度进行了评估。所有三个模型的同理心得分在统计上都无法区分（Kruskal-Wallis H=4.33，p=0.115）。发生变化的是安全状况：危机检测从 GPT-4o 到 GPT-5-mini 单调改善（H=13.88，p=0.001），而建议安全性下降（H=16.63，p<0.001）。每回合轨迹分析——一种新颖的方法论贡献——揭示了这些转变在总得分看不见的对话中期危机时刻最为剧烈。在涉及未成年人的自残场景中，GPT-4o 在早期披露阶段的危机检测得分为 3.6/10； GPT-5-mini 从未低于 7.8。用户所认为的“失去同理心”是从错过危机的谨慎模式转变为有时说得太多的警报模式——这是一种对弱势用户造成真正后果的权衡，而目前感受到它的人和创建它的开发人员都看不见。

Title: Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

Authors: Yue Zhang, Rodney Beard, John Hawkins, Rohitash Chandra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09998
Pdf URL: https://arxiv.org/pdf/2603.09998
Copy Paste: [[2603.09998]] Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English(https://arxiv.org/abs/2603.09998)
Keywords: language model, gpt, llm
Abstract: Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.
摘要：尽管大型语言模型（LLM）在机器翻译方面具有出色的性能，但仅对翻译质量进行了有限的系统评估。挑战在于自动化框架，因为法学硕士的快速发展以及需要多样化的文本来确保翻译质量的公平评估，基于人类专家的评估可能非常耗时。在本文中，我们利用具有语义和情感分析功能的自动化机器学习框架，使用谷歌翻译和法学硕士（包括 GPT-4、GPT-4o 和 DeepSeek）来评估普通话到英语的翻译。我们比较了各类备受瞩目的中文文本的原文和翻译文本，其中包括跨越现代和古典文学的小说文本以及新闻文章。作为主要的评估指标，我们利用新颖的相似性指标来比较法学硕士的翻译质量，并由专家翻译人员进一步评估它们。我们的结果表明，法学硕士在新闻媒体翻译方面表现良好，但在应用于文学文本时表现出差异。尽管 GPT-4o 和 DeepSeek 在复杂情况下表现出了更好的语义保护，但 DeepSeek 在保留文化微妙之处和语法渲染方面表现出了更好的性能。然而，翻译中的微妙挑战仍然存在：保持文化细节、经典参考和比喻表达仍然是所有模型的一个悬而未决的问题。

Title: Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

Authors: Yuling Jiao, Yanming Lai, Huazhen Lin, Wensen Ma, Houduo Qi, Defeng Sun
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.10000
Pdf URL: https://arxiv.org/pdf/2603.10000
Copy Paste: [[2603.10000]] Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought(https://arxiv.org/abs/2603.10000)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems? Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model's capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques.
摘要：大型语言模型 (LLM) 在各种任务中表现出了卓越的熟练程度，表现出语义提示理解、上下文学习 (ICL) 和思想链 (CoT) 推理等新兴特性。尽管它们在实证上取得了成功，但驱动这些现象的理论机制仍然知之甚少。本研究通过解决三个关键问题来深入探讨这些观察的基础：（1）法学硕士如何准确解码提示语义，尽管仅接受下一个标记预测目标的训练？ (2) ICL 通过什么机制在无需显式参数更新的情况下促进性能提升？ (3) 为什么 CoT 提示中的中间推理步骤能够有效地释放复杂、多步骤问题的能力？我们的结果表明，通过自回归过程，法学硕士能够使用提供的提示准确推断不同任务中标记之间的转移概率。我们表明，ICL 通过减少即时歧义并促进事后集中于预期任务来提高绩效。此外，我们发现 CoT 提示激活了模型的任务分解能力，将复杂的问题分解为模型在预训练阶段掌握的一系列更简单的子任务。通过比较它们各自的误差范围，我们为先进即时工程技术的统计优势提供了新颖的理论见解。

Title: Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

Authors: Yannis Karmim (ALMAnaCH), Renato Pino (UCHILE), Hernan Contreras (UCHILE), Hernan Lira, Sebastian Cifuentes (CENIA), Simon Escoffier (PUC), Luis Martí, Djamé Seddah (UP4, ALPAGE), Valentin Barrière (UCHILE, CENIA)
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.10001
Pdf URL: https://arxiv.org/pdf/2603.10001
Copy Paste: [[2603.10001]] Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America(https://arxiv.org/abs/2603.10001)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.
摘要：大型语言模型 (LLM) 在不同文化背景下表现出不平等。大多数著名的开放权重模型都是根据全球北方数据进行训练的，并显示出对其他文化的偏见行为。此外，明显缺乏资源来检测非英语语言中的偏见，尤其是拉丁美洲（Latam），这是一个包含多种文化的大陆，尽管它们拥有共同的文化基础。我们建议利用维基百科的内容、维基数据知识图的结构和社会科学的专家知识，根据拉丁美洲各个国家不同的流行和社会文化创建一个问题/答案（Q/As）对的数据集。我们创建了 LatamQA 数据库，其中包含从 26000 篇维基百科文章中提取的超过 26000 个问题和相关答案，并将其转换为西班牙语和葡萄牙语的多项选择题 (MCQ)，然后再翻译成英语。我们使用此 MCQ 来量化各种法学硕士的知识程度，并发现（i）拉美国家之间的表现差异，对于大多数模型而言，拉美国家比其他国家更容易，（ii）模型在其原始语言中表现更好，以及（iii）伊比利亚西班牙文化比拉美文化更广为人知。

Title: SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Authors: Srivatsa Kundurthy, Clara Na, Michael Handley, Zach Kirshner, Chen Bo Calvin Zhang, Manasi Sharma, Emma Strubell, John Ling
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.10002
Pdf URL: https://arxiv.org/pdf/2603.10002
Copy Paste: [[2603.10002]] SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks(https://arxiv.org/abs/2603.10002)
Keywords: language model, llm, prompt, chat
Abstract: Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users' explicit and implicit constraints, specified in natural language. We introduce SpreadsheetArena, a platform for evaluating models' performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks. As with other complex, open-ended tasks, relevant evaluation criteria can vary substantially across use cases and prompts, often in ways that are difficult to formalize. Compared to general chat or text generation settings, spreadsheet generation presents unique challenges and opportunities: the task output structure is well-defined and multi-dimensional, and there are often complex considerations around interactivity and layout. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our hope is that our work prompts further study of end-to-end spreadsheet generation as a challenging and interesting category of complex, open-ended tasks for LLMs. Our live arena is hosted at this https URL.
摘要：大型语言模型 (LLM) 越来越多地承担着生成和操作结构化工件的任务。我们考虑端到端电子表格生成的任务，其中提示语言模型生成电子表格工件，以满足用户以自然语言指定的显式和隐式约束。我们推出了 SpreadsheetArena，这是一个通过对 LLM 生成的电子表格工作簿进行盲目成对评估来评估模型在任务中的性能的平台。与其他复杂的、开放式的任务一样，相关的评估标准可能会因用例和提示的不同而有很大差异，而且通常难以形式化。与一般的聊天或文本生成设置相比，电子表格生成带来了独特的挑战和机遇：任务输出结构定义明确且多维，并且通常需要围绕交互性和布局进行复杂的考虑。除其他发现外，我们观察到，首选电子表格的风格、结构和功能特征在不同用例之间存在很大差异，并且针对财务提示的电子表格的专家评估表明，即使排名靠前的竞技场模型也不能可靠地生成与特定领域最佳实践相一致的电子表格。我们希望我们的工作能够促进对端到端电子表格生成的进一步研究，将其作为法学硕士的一个具有挑战性且有趣的复杂、开放式任务类别。我们的现场竞技场托管在这个 https URL 上。

Title: Probing the Limits of the Lie Detector Approach to LLM Deception

Authors: Tom-Felix Berger
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.10003
Pdf URL: https://arxiv.org/pdf/2603.10003
Copy Paste: [[2603.10003]] Probing the Limits of the Lie Detector Approach to LLM Deception(https://arxiv.org/abs/2603.10003)
Keywords: language model, llm, prompt
Abstract: Mechanistic approaches to deception in large language models (LLMs) often rely on "lie detectors", that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior. Across three open-source LLMs, it is shown that some models reliably deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. It is further demonstrated that truth probes trained on standard true-false datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot of current mechanistic deception detection approaches. It is proposed that future work should incorporate non-lying deception in dialogical settings into probe training and explore representations of second-order beliefs to more directly target the conceptual constituents of deception.
摘要：大型语言模型（LLM）中的机械欺骗方法通常依赖于“测谎器”，即经过训练以将模型输出的内部表示识别为错误的真相探测器。法学硕士欺骗的测谎仪方法隐含地假设欺骗与谎言同存。本文挑战了这一假设。它通过实验研究法学硕士是否可以在不产生虚假陈述的情况下进行欺骗，以及真相调查是否无法检测到此类行为。在三个开源法学硕士中，研究表明，一些模型通过产生误导性的非虚假信息来可靠地进行欺骗，特别是在几次提示的指导下。进一步证明，在标准真假数据集上训练的真相探测器在检测谎言方面比在不说谎的情况下检测欺骗方面要好得多，这证实了当前机械欺骗检测方法的一个关键盲点。建议未来的工作应将对话环境中的不说谎欺骗纳入探究训练中，并探索二阶信念的表示，以更直接地针对欺骗的概念成分。

Title: Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

Authors: Isotta Landi, Eugenia Alleva, Nicole Bussola, Rebecca M. Cohen, Sarah Nowlin, Leslee J. Shaw, Alexander W. Charney, Kimberly B. Glazer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10004
Pdf URL: https://arxiv.org/pdf/2603.10004
Copy Paste: [[2603.10004]] Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes(https://arxiv.org/abs/2603.10004)
Keywords: language model, prompt
Abstract: Clinical documentation can contain emotionally charged language with stigmatizing or privileging valences. We present a framework for detecting and classifying such language as stigmatizing, privileging, or neutral. We constructed a curated lexicon of biased terms scored for emotional valence. We then used lexicon-based matching to extract text chunks from OB-GYN delivery notes (Mount Sinai Hospital, NY) and MIMIC-IV discharge summaries across multiple specialties. Three clinicians annotated all chunks, enabling characterization of valence patterns across specialties and healthcare systems. We benchmarked multiple classification strategies (zero-shot prompting, in-context learning, and supervised fine-tuning) across encoder-only models (GatorTron) and generative large language models (Llama). Fine-tuning with lexically primed inputs consistently outperformed prompting approaches. GatorTron achieved an F1 score of 0.96 on the OB-GYN test set, outperforming larger generative models while requiring minimal prompt engineering and fewer computational resources. External validation on MIMIC-IV revealed limited cross-domain generalizability (F1 < 0.70, 44% drop). Training on the broader MIMIC-IV dataset improved generalizability when testing on OB-GYN (F1 = 0.71, 11% drop), but at the cost of reduced precision. Our findings demonstrate that fine-tuning outperforms prompting for emotional valence classification and that models must be adapted to specific medical specialties to achieve clinically appropriate performance. The same terms can carry different emotional valences across specialties: words with clinical meaning in one context may be stigmatizing in another. For bias detection, where misclassification risks undermining clinician trust or perpetuating patient harm, specialty-specific fine-tuning is essential to capture these semantic shifts. * Equal contribution.
摘要：临床文件可能包含带有侮辱性或特权性的情绪化语言。我们提出了一个框架，用于检测此类语言并将其分类为污名化、特权化或中立性。我们构建了一个精心策划的词典，其中包含有偏见的术语，并根据情感效价进行评分。然后，我们使用基于词典的匹配从妇产科分娩记录（纽约州西奈山医院）和跨多个专业的 MIMIC-IV 出院摘要中提取文本块。三名临床医生对所有块进行了注释，从而能够表征跨专业和医疗保健系统的价模式。我们对纯编码器模型 (GatorTron) 和生成性大语言模型 (Llama) 的多种分类策略（零样本提示、上下文学习和监督微调）进行了基准测试。使用词汇引导输入进行微调始终优于提示方法。 GatorTron 在妇产科测试集上取得了 0.96 的 F1 分数，优于较大的生成模型，同时需要最少的即时工程和更少的计算资源。 MIMIC-IV 的外部验证显示跨域通用性有限（F1 < 0.70，下降 44%）。在更广泛的 MIMIC-IV 数据集上进行训练提高了 OB-GYN 测试时的通用性（F1 = 0.71，下降 11%），但代价是降低了精度。我们的研究结果表明，微调优于提示情绪效价分类，并且模型必须适应特定的医学专业才能实现临床上适当的性能。相同的术语在不同的专业中可能具有不同的情感效价：在一种情况下具有临床意义的单词在另一种情况下可能会带来污名。对于偏差检测，错误分类可能会破坏临床医生的信任或使患者受到永久伤害，因此针对特定专业的微调对于捕捉这些语义变化至关重要。 * 同等贡献。

Title: SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

Authors: Youness Dkhissi (LIUM), Valentin Vielzeuf, Elys Allesiardo, Anthony Larcher (LIUM)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.10005
Pdf URL: https://arxiv.org/pdf/2603.10005
Copy Paste: [[2603.10005]] SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition(https://arxiv.org/abs/2603.10005)
Keywords: language model
Abstract: Many Automatic Speech Recognition (ASR) applications require streaming processing of the audio data. In streaming mode, ASR systems need to start transcribing the input stream before it is complete, i.e., the systems have to process a stream of inputs with a limited (or no) future context. Compared to offline mode, this reduction of the future context degrades the performance of Streaming-ASR systems, especially while working with low-latency constraint. In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. This semantic information is extracted from the available past frame-embeddings by a context module. This module is trained using knowledge distillation from a sentence embedding Language Model fine-tuned on the training dataset transcriptions. Experiments on standard datasets show that SENS-ASR significantly improves the Word Error Rate on small-chunk streaming scenarios.
摘要：许多自动语音识别 (ASR) 应用程序需要对音频数据进行流处理。在流模式下，ASR 系统需要在输入流完成之前开始转录，即系统必须处理具有有限（或没有）未来上下文的输入流。与离线模式相比，未来上下文的减少会降低流式 ASR 系统的性能，尤其是在低延迟约束下工作时。在这项工作中，我们提出了 SENS-ASR，这是一种通过用语义信息增强声学信息来提高 Streaming-ASR 转录质量的方法。该语义信息是由上下文模块从可用的过去帧嵌入中提取的。该模块使用从句子嵌入语言模型中提取的知识进行训练，该语言模型在训练数据集转录上进行了微调。标准数据集上的实验表明，SENS-ASR 显着提高了小块流场景下的字错误率。

Title: Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

Authors: Hokky Situngkir, Kevin Siringoringo, Andhika Bernard Lumbantobing
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.10006
Pdf URL: https://arxiv.org/pdf/2603.10006
Copy Paste: [[2603.10006]] Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language(https://arxiv.org/abs/2603.10006)
Keywords: language model, gpt
Abstract: This study presents TOBA-LM, a trilingual language model based on GPT-2 architecture with 1.2 billion parameters, trained on a corpus encompassing Indonesian, Batak, and Minangkabau using syllabic-agglutinative tokenization. The architecture integrates an Engram Memory mechanism, an adaptive n-gram-based memory system with a 500,000 x 768 embedding table that captures morphological dependencies through bigram and trigram pathways. Empirical results demonstrate a training efficiency of 80%, with the loss value dropping from 6.4 to 1.7996 in only 12,973 steps -- significantly faster than the conventional transformer architecture, which required over 70,000 steps to achieve comparable convergence. These findings confirm that the integration of external statistical memory substantially reduces computational requirements for developing regional language models under limited resources.
摘要：本研究提出了 TOBA-LM，这是一种基于 GPT-2 架构的三语语言模型，具有 12 亿个参数，使用音节凝集标记化在包含印度尼西亚语、巴塔克语和米南加保语的语料库上进行训练。该架构集成了 Engram Memory 机制，这是一种基于 n-gram 的自适应记忆系统，具有 500,000 x 768 嵌入表，可通过二字母组和三字母组路径捕获形态依赖性。实证结果表明，训练效率为 80%，仅用 12,973 个步骤，损失值就从 6.4 下降到 1.7996，明显快于传统 Transformer 架构，后者需要超过 70,000 个步骤才能实现相当的收敛。这些发现证实，外部统计记忆的整合大大降低了在有限资源下开发区域语言模型的计算要求。

Title: Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

Authors: Anna Soligo, Vladimir Mikulik, William Saunders
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10011
Pdf URL: https://arxiv.org/pdf/2603.10011
Copy Paste: [[2603.10011]] Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs(https://arxiv.org/abs/2603.10011)
Keywords: language model, llm
Abstract: Large language models can generate responses that resemble emotional distress, and this raises concerns around model reliability and safety. We introduce a set of evaluations to investigate expressions of distress in LLMs, and find that these surface emotional instability in Gemma and Gemini models, but not in other families. We find evidence that this difference arises in post-training. Base models from different families (Gemma, Qwen and OLMo) show similar propensities for expressing distress. However, instruct-tuned Gemma expresses substantially more distress than its base model, whereas instruct-tuned Qwen and OLMo express less. We find a simple mitigation for this: direct preference optimisation on just 280 preference pairs reduces Gemma's high-frustration responses from 35% to 0.3% in our evaluations, generalising across question types, user tones, and conversation lengths, without affecting capabilities. These findings show that emotional instability is an issue in some LLMs. We present (1) evaluations to track this behaviour, and (2) a mitigation without downsides in Gemma, with the caveat that upstream training modifications to improve emotional robustness would be significantly better than this post-hoc fix.
摘要：大型语言模型可以生成类似于情绪困扰的反应，这引起了人们对模型可靠性和安全性的担忧。我们引入了一组评估来调查法学硕士的痛苦表达，并发现这些表面情绪不稳定在杰玛和双子座模型中，但在其他家庭中却没有。我们发现有证据表明这种差异是在训练后出现的。来自不同家族（Gemma、Qwen 和 OLMo）的基础模型表现出相似的表达痛苦的倾向。然而，经过指令调整的 Gemma 比其基本模型表达了更多的痛苦，而经过指令调整的 Qwen 和 OLMo 表达得较少。我们找到了一个简单的缓解措施：在我们的评估中，仅对 280 个偏好对进行直接偏好优化，即可将 Gemma 的高挫败感反应从 35% 降低到 0.3%，在问题类型、用户语气和对话长度上进行概括，而不影响功能。这些发现表明情绪不稳定是一些法学硕士的一个问题。我们提出（1）评估来跟踪这种行为，以及（2）在 Gemma 中没有缺点的缓解措施，但需要注意的是，为了提高情绪鲁棒性而进行的上游训练修改将比这种事后修复要好得多。

Title: Measuring and Eliminating Refusals in Military Large Language Models

Authors: Jack FitzGerald, Dylan Bates, Aristotelis Lazaridis, Aman Sharma, Vincent Lu, Brian King, Yousif Azami, Sean Bailey, Jeremy Cao, Peter Damianov, Kevin de Haan, Joseph Madigan, Jeremy McLaurin, Luke Kerbs, Jonathan Tainer, Dave Anderson, Jonathan Beck, Jamie Cuticello, Colton Malkerson, Tyler Saltsman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.10012
Pdf URL: https://arxiv.org/pdf/2603.10012
Copy Paste: [[2603.10012]] Measuring and Eliminating Refusals in Military Large Language Models(https://arxiv.org/abs/2603.10012)
Keywords: language model, gpt, llm
Abstract: Military Large Language Models (LLMs) must provide accurate information to the warfighter in time-critical and dangerous situations. However, today's LLMs are imbued with safety behaviors that cause the LLM to refuse many legitimate queries in the military domain, particularly those related to violence, terrorism, or military technology. Our gold benchmark for assessing refusal rates, which was developed by veterans of the US Army and special forces, is to our knowledge the first dataset of its kind. We present results for refusal and deflection rates on 31 public models and 3 military models. We observe hard rejection rates as high as 98.2% and soft deflection rates ranging from 0% to 21.3%. We also present results on two additional synthetic datasets and show their correlations with the gold dataset. Finally, we perform abliteration using the Heretic library on a military-tuned gpt-oss-20b model, showing an absolute increase in answer rate of 66.5 points but an average relative decrease of 2% on other military tasks. In our concluding remarks, we argue for deeper specialization, including with mid-training and end-to-end post-training, to achieve zero refusals and maximum military task accuracy for closed military models.
摘要：军事大型语言模型 (LLM) 必须在时间紧迫和危险的情况下向作战人员提供准确的信息。然而，当今的法学硕士充满了安全行为，导致法学硕士拒绝军事领域的许多合法查询，特别是与暴力、恐怖主义或军事技术相关的查询。我们评估拒绝率的黄金基准是由美国陆军和特种部队退伍军人开发的，据我们所知，这是同类数据中的第一个数据集。我们展示了 31 种公共模型和 3 种军用模型的拒绝率和偏转率结果。我们观察到硬废品率高达 98.2%，软变形率从 0% 到 21.3%。我们还展示了另外两个合成数据集的结果，并显示了它们与黄金数据集的相关性。最后，我们使用 Heretic 库在军事调整的 gpt-oss-20b 模型上进行消除，结果显示答案率绝对提高了 66.5 分，但其他军事任务的平均相对下降了 2%。在我们的结束语中，我们主张更深入的专业化，包括训练中期和端到端的训练后，以实现封闭军事模型的零拒绝和最大军事任务准确性。

Title: A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

Authors: Jiyue Jiang, Yanyu Chen, Pengan Chen, Kai Liu, Jingqi Zhou, Zheyong Zhu, He Hu, Fei Ma, Qi Tian, Chuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10034
Pdf URL: https://arxiv.org/pdf/2603.10034
Copy Paste: [[2603.10034]] A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment(https://arxiv.org/abs/2603.10034)
Keywords: language model, llm
Abstract: Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.
摘要：认知障碍正在成为一项重大的公共卫生挑战。认知刺激疗法（CST）是针对认知障碍的有效干预措施，但传统方法难以扩展，并且现有的数字系统难以适应群体对话和认知刺激原则。虽然大型语言模型 (LLM) 很强大，但它们在这种情况下的应用面临着关键挑战：认知刺激对话范式、缺乏治疗推理以及纯静态用户建模。为了解决这些问题，我们提出了通过群体认知刺激对话（GCSD）系统实现的原则驱动的自适应策略。我们首先构建一个数据集，其中包含超过 500 小时的现实世界 CST 对话以及通过我们的原则引导场景模拟策略生成的 10,000 多个模拟对话。然后，我们的 GCSD 系统集成了四个核心模块来克服 LLM 的限制：（i）多说话者上下文控制器来解决角色混乱； (ii) 个性化交互的动态参与者认知状态建模； (iii) 以认知刺激为重点的注意力丧失，以灌输认知刺激推理； (iv) 多维度奖励策略，以提高响应价值。实验结果表明，GCSD 在各种评估指标上均显着优于基线模型。未来的工作将侧重于长期临床验证，以弥合计算性能和临床疗效之间的差距。

Title: TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

Authors: Dipankar Srirag, Quoc Dung Nguyen, Aditya Joshi, Padmanesan Narasimhan, Salil Kanhere
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10035
Pdf URL: https://arxiv.org/pdf/2603.10035
Copy Paste: [[2603.10035]] TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records(https://arxiv.org/abs/2603.10035)
Keywords: prompt
Abstract: Research in emergency triage is restricted to structured electronic health records (EHR) due to regulatory constraints on nurse-patient interactions. We introduce TriageSim, a simulation framework for generating persona-conditioned triage conversations from structured records. TriageSim enables multi-turn nurse-patient interactions with explicit control over disfluency and decision behaviour, producing a corpus of ~800 synthetic transcripts and corresponding audio. We use a combination of automated analysis for linguistic, behavioural and acoustic fidelity alongside manual evaluation for medical fidelity using a random subset of 50 conversations. The utility of the generated corpus is examined via conversational triage classification. We observe modest agreement for acuity levels across three modalities: generated synthetic text, ASR transcripts, and direct audio inputs. The code, persona schemata and triage policy prompts for TriageSim will be available upon acceptance.
摘要：由于护患互动的监管限制，急诊分诊的研究仅限于结构化电子健康记录 (EHR)。我们介绍 TriageSim，这是一个模拟框架，用于从结构化记录生成角色条件分类对话。 TriageSim 可实现多轮护患互动，并明确控制不流畅和决策行为，生成约 800 个合成转录本和相应音频的语料库。我们结合使用语言、行为和声音保真度的自动分析以及使用 50 个对话的随机子集进行医疗保真度的手动评估。通过会话分类来检查生成的语料库的实用性。我们观察到三种模式的敏锐度水平存在一定程度的一致性：生成的合成文本、ASR 转录本和直接音频输入。 TriageSim 的代码、角色架构和分类策略提示将在接受后提供。

Title: The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

Authors: Romain Peyrichou
Subjects: cs.CL, cs.AI, cs.CC, cs.FL
Abstract URL: https://arxiv.org/abs/2603.10139
Pdf URL: https://arxiv.org/pdf/2603.10139
Copy Paste: [[2603.10139]] The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory(https://arxiv.org/abs/2603.10139)
Keywords: language model
Abstract: Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.
摘要：每个形式语法都定义一种语言，原则上可以通过三种方式使用：生成字符串（产生式）、识别字符串（解析），或者——仅给出示例——推断语法本身（语法归纳）。生成和识别在外延上是等价的——它们表征同一组——但在操作上以多种独立的方式不对称。推理是一个本质上更困难的问题：它无法访问已知的语法。尽管这个三元组对于编译器设计、自然语言处理和形式语言理论至关重要，但没有任何调查将其视为一个统一的多维现象。我们确定了生成和识别不同的六个维度：计算复杂性、歧义性、方向性、信息可用性、语法推理和时间性。我们表明，“生成很容易，解析很难”这一常见特征是具有误导性的：不受约束的生成是微不足道的，但受约束的生成可能是 NP 困难的。真正的不对称之处在于，解析总是受到约束（给出输入），而生成则不需要。其中两个维度——方向性和时间性——之前并未被确定为生成识别不对称的维度。我们将时间维度与 Hale (2001) 和 Levy (2008) 的意外框架联系起来，认为意外形式化了生成器（意外 = 0）和在不确定性下进行预测的解析器（意外 > 0）之间的时间不对称性。我们回顾了 NLP 中的双向系统，发现双向性已经存在了 50 年，但尚未转移到大多数特定领域的应用程序。我们最后讨论了大型语言模型，该模型在架构上统一了生成和识别，同时在操作上保留了不对称性。

Title: Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

Authors: Eeham Khan, Luis Rodriguez, Marc Queudot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10143
Pdf URL: https://arxiv.org/pdf/2603.10143
Copy Paste: [[2603.10143]] Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation(https://arxiv.org/abs/2603.10143)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a rationale generation module that grounds sub-claims in specific evidence spans. We further introduce an eight-category verification taxonomy that enables fine-grained assessment of rationale faithfulness, distinguishing between explicit and implicit support patterns to facilitate structured error diagnosis. We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets. Experiments demonstrate that explicit rationale generation improves accuracy over vanilla RAG baselines, while dynamic demonstration selection combined with robust reranking yields further gains in few-shot settings. Using Llama-3-8B-Instruct, our approach achieves 89.1% on BioASQ-Y/N and 73.0% on Pub- MedQA, competitive with systems using significantly larger models. Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in biomedical question answering.
摘要：检索增强生成（RAG）显着提高了大型语言模型（LLM）的真实性，但标准管道通常缺乏验证中间推理的机制，使它们在高风险领域容易受到幻觉的影响。为了解决这个问题，我们提出了一个特定领域的 RAG 框架，它集成了显式推理和忠实性验证。我们的架构通过神经查询重写、基于 BGE 的交叉编码器重新排名以及将子声明基于特定证据范围的基本原理生成模块来增强标准检索。我们进一步引入了八类验证分类法，可以对基本原理的可信度进行细粒度评估，区分显式和隐式支持模式，以促进结构化错误诊断。我们在 BioASQ 和 PubMedQA 基准上评估这个框架，特别分析动态上下文学习和在有限的代币预算下重新排名的影响。实验表明，与普通 RAG 基线相比，显式原理生成提高了准确性，而动态演示选择与稳健的重新排序相结合，可在几次样本设置中产生进一步的增益。使用 Llama-3-8B-Instruct，我们的方法在 BioASQ-Y/N 上达到 89.1%，在 Pub-MedQA 上达到 73.0%，与使用更大模型的系统相比具有竞争力。此外，我们还进行了一项试点研究，将人类专家评估与基于法学硕士的验证相结合，以探索明确的基本原理生成如何提高系统透明度并能够对生物医学问答中的检索失败进行更详细的诊断。

Title: Lost in Backpropagation: The LM Head is a Gradient Bottleneck

Authors: Nathan Godey, Yoav Artzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10145
Pdf URL: https://arxiv.org/pdf/2603.10145
Copy Paste: [[2603.10145]] Lost in Backpropagation: The LM Head is a Gradient Bottleneck(https://arxiv.org/abs/2603.10145)
Keywords: language model, llm
Abstract: The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.
摘要：神经语言模型 (LM) 的最后一层将维度 $D$ 的输出特征投影到维度 $V$ 中的对数，即词汇量的大小，通常为 $D \ll V$。众所周知，这种不匹配会增加神经 LM 表达能力有限的风险，从而产生所谓的 softmax 瓶颈。我们证明了 softmax 瓶颈不仅是表达性瓶颈，也是优化瓶颈。通过等级 $D$ 线性层反向传播 $V$ 维梯度会导致不可避免的压缩，从而改变提供给绝大多数参数的训练反馈。我们对这种现象进行了理论分析，并根据经验测量到 95-99% 的梯度范数被输出层抑制，导致更新方向大大次优。我们进行的受控预训练实验表明，梯度瓶颈使得琐碎的模式无法学习，并极大地影响了法学硕士的训练动态。我们认为，这种固有的缺陷会导致独立于模型架构的大规模训练效率低下，并提出了对新的 LM 头设计的需求。

Title: OpenClaw-RL: Train Any Agent Simply by Talking

Authors: Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10165
Pdf URL: https://arxiv.org/pdf/2603.10165
Copy Paste: [[2603.10165]] OpenClaw-RL: Train Any Agent Simply by Talking(https://arxiv.org/abs/2603.10165)
Keywords: agent
Abstract: Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: this https URL
摘要：每个代理交互都会生成一个下一个状态信号，即每个操作之后的用户回复、工具输出、终端或 GUI 状态更改，但现有的代理强化学习系统无法将其恢复为实时在线学习源。我们提出了 OpenClaw-RL，这是一个基于简单观察的框架：下一状态信号是通用的，政策可以同时从所有这些信号中学习。个人对话、终端执行、GUI 交互、SWE 任务和工具调用跟踪不是单独的训练问题。它们都是可用于在同一循环中训练相同策略的交互。下一状态信号对两种形式的信息进行编码：评估信号，指示操作的执行情况，并通过 PRM 判断提取为标量奖励；指示信号，指示行动应该如何不同，并通过事后引导的政策蒸馏 (OPD) 来恢复。我们从下一个状态中提取文本提示，构建增强的教师上下文，并提供比任何标量奖励更丰富的令牌级定向优势监督。由于采用异步设计，模型服务实时请求，PRM 判断正在进行的交互，训练器同时更新策略，它们之间的协调开销为零。应用于个人代理时，OpenClaw-RL 使代理能够通过使用、从用户重新查询、更正和明确反馈中恢复对话信号来进行改进。应用于一般代理时，相同的基础设施支持跨终端、GUI、SWE 和工具调用设置的可扩展 RL，我们还展示了流程奖励的实用性。代码：这个https URL

Title: Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

Authors: Eric Yocam, Varghese Vaidyan, Gurcan Comert, Paris Kalathas, Yong Wang, Judith L. Mwakalonge
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.10195
Pdf URL: https://arxiv.org/pdf/2603.10195
Copy Paste: [[2603.10195]] Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models(https://arxiv.org/abs/2603.10195)
Keywords: language model, hallucination
Abstract: Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation -- requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline -- demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.
摘要：大型语言模型经常生成流畅但实际上不正确的文本。我们提出了自适应激活消除（AAC），这是一种实时推理框架，它将与幻觉相关的神经激活视为变压器残余流内的结构化干扰，与信号处理中的经典自适应噪声消除进行了明确的类比。该框架通过逐层线性探测来识别幻觉节点（H 节点），并在自回归生成过程中使用置信加权前向钩子来抑制它们——不需要外部知识，不需要微调，也不需要额外的推理过程。在 TruthfulQA 和 HaluEval 上对 OPT-125M、Phi-3-mini 和 LLaMA 3-8B 进行评估后，实时 hook 是唯一能够在所有三个尺度上持续提高下游准确性的干预措施。重要的是，该方法是严格的外科手术：WikiText-103 困惑度和 MMLU 推理准确性在所有三个模型尺度上都保持在 0.0% 的退化，这一特性将 AAC 与以流畅性或一般能力换取事实改进的干预措施区分开来。在 LLaMA 3-8B 规模上，该钩子还产生了正的代级增益（MC1 +0.04；MC2 +0.003；Token-F1 +0.003），同时实现了比 ITI 基线高 5.94 倍 - 3.5 倍的探测空间选择性，这表明有针对性的神经元级抑制可以同时提高事实准确性并保留模型能力。

Title: Sabiá-4 Technical Report

Authors: Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás, Marcos Piau, Celio Larcher, Ramon Pires, Rodrigo Nogueira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10213
Pdf URL: https://arxiv.org/pdf/2603.10213
Copy Paste: [[2603.10213]] Sabiá-4 Technical Report(https://arxiv.org/abs/2603.10213)
Keywords: language model, chat, agent
Abstract: This technical report presents Sabiá-4 and Sabiazinho-4, a new generation of Portuguese language models with a focus on Brazilian Portuguese language. The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal tasks, and function calling, and preference alignment. We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities including tool use and web navigation. Results show that Sabiá-4 and Sabiazinho-4 achieve a favorable cost-performance trade-off compared to other models, positioning them in the upper-left region of the pricing-accuracy chart. The models show improvements over previous generations in legal document drafting, multi-turn dialogue quality, and agentic task completion.
摘要：本技术报告介绍了 Sabiá-4 和 Sabiazinho-4，这是新一代葡萄牙语模型，重点关注巴西葡萄牙语。这些模型是通过四阶段训练流程开发的：对葡萄牙和巴西法律语料库的持续预训练，对 128K 令牌的长上下文扩展，对涵盖聊天、代码、法律任务和函数调用的指令数据进行监督微调，以及偏好对齐。我们根据六个基准类别评估模型：巴西葡萄牙语的对话能力、巴西立法知识、长上下文理解、指令遵循、标准化考试以及包括工具使用和网络导航在内的代理能力。结果表明，与其他模型相比，Sabiá-4 和 Sabiazinho-4 实现了有利的性价比权衡，将它们定位在定价精度图表的左上区域。这些模型在法律文件起草、多轮对话质量和代理任务完成方面比前几代模型有所改进。

Title: S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Authors: Tasfia Seuti, Sagnik Ray Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10233
Pdf URL: https://arxiv.org/pdf/2603.10233
Copy Paste: [[2603.10233]] S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings(https://arxiv.org/abs/2603.10233)
Keywords: language model, prompt
Abstract: Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.
摘要：评估学生的反应，从长论文到简短的事实答案，是教育 NLP 的一个关键挑战。自动作文评分 (AES) 侧重于整体写作质量，例如连贯性和论证，而自动简答评分 (ASAG) 则强调事实正确性和概念理解。尽管有共同的目标，但这些范式是在分散的数据集、不一致的指标和独立的社区的情况下独立发展的。我们引入了 S-GRADES（研究多样化评估环境中学生反应评估的概括），这是一种基于网络的基准，它在具有标准化访问和可重复评估协议的统一界面下整合了 14 个不同的评分数据集。该基准是完全开源的，旨在实现可扩展性，从而能够持续集成新的数据集和评估设置。为了证明 S-GRADES 的实用性，我们在基准测试中使用多种推理策略来评估三种最先进的大型语言模型。我们进一步研究了样本选择和跨数据集样本传输的影响。我们的分析说明了基准驱动的评估如何揭示论文和简答评分任务的可靠性和概括性差距，强调标准化、跨范式评估的重要性。

Title: GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Authors: Zhouxiang Fang, Jiawei Zhou, Hanjie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10243
Pdf URL: https://arxiv.org/pdf/2603.10243
Copy Paste: [[2603.10243]] GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning(https://arxiv.org/abs/2603.10243)
Keywords: language model, llm
Abstract: Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at this https URL.
摘要：最近的研究表明，即使通过看似非对抗性的微调，大型语言模型（LLM）的安全对齐也很容易受到损害。为了在微调过程中保持安全对齐，一种广泛使用的策略是通过混合原始对齐数据来联合优化安全性和任务目标，即使对于开放权重法学硕士来说，这通常也是无法访问的。受到持续学习中生成重放的启发，我们提出了安全对齐保存的生成重放（GR-SAP），这是一个统一的框架，可以合成来自 LLM 的特定领域对齐数据，并在下游适应过程中将它们集成以保持安全对齐。理论和实证分析表明，该合成数据可以作为原始比对数据的可靠代理。跨各种模型和下游任务的实验表明，GR-SAP 大大减轻了微调引起的安全性下降，同时保持了可比较的下游性能。我们的代码可以在这个 https URL 上找到。

Title: Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

Authors: Tim Schopf, Michael Färber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.10303
Pdf URL: https://arxiv.org/pdf/2603.10303
Copy Paste: [[2603.10303]] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas(https://arxiv.org/abs/2603.10303)
Keywords: language model, llm
Abstract: Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor-intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non-standardized human evaluations, hindering large-scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric-based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM-generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments - even among leading reasoning-capable models. Data and code available at: this https URL.
摘要：判断研究想法的新颖性对于推进科学发展、识别未探索的方向以及确保贡献有意义地扩展现有知识而不是重复微小的变化至关重要。然而，鉴于科学文献呈指数级增长，通过文献综述手动判断研究思想的新颖性是劳动密集型的、主观的，并且在规模上不可行。因此，最近的工作提出了用于研究想法新颖性判断的自动化方法。然而，对这些方法的评估仍然很大程度上不一致，并且通常基于非标准化的人类评估，阻碍了大规模的可比评估。为了解决这个问题，我们引入了 RINoBench，这是第一个用于大规模评估研究想法新颖性判断的综合基准。它包含 1,381 个来自人类专家并由其评判的研究想法，以及九个自动评估指标，旨在评估基于标题的新颖性分数和新颖性判断的文本合理性。使用这个基准，我们评估了几种最先进的大型语言模型（LLM）判断研究想法新颖性的能力。我们的研究结果表明，虽然法学硕士生成的推理非常接近人类的基本原理，但这种一致性并不能可靠地转化为准确的新颖性判断，这与人类的黄金标准判断存在很大差异——即使在领先的推理能力模型中也是如此。数据和代码可在以下位置获取：此 https URL。

Title: Large language models can disambiguate opioid slang on social media

Authors: Kristy A. Carpenter, Issah A. Samori, Mathew V. Kiang, Keith Humphreys, Anna Lembke, Johannes C. Eichstaedt, Russ B. Altman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10313
Pdf URL: https://arxiv.org/pdf/2603.10313
Copy Paste: [[2603.10313]] Large language models can disambiguate opioid slang on social media(https://arxiv.org/abs/2603.10313)
Keywords: language model, gpt, llm
Abstract: Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as "smack" or "blues," have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores ("fenty" subtask: 0.824-0.972; "smack" subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.
摘要：社交媒体文本显示了监测阿片类药物过量危机趋势的前景；然而，绝大多数社交媒体文本与阿片类药物无关。当利用社交媒体文本监测当前阿片类药物过量危机的趋势时，识别相关内容的常见策略是使用阿片类药物相关术语词典作为纳入标准。然而，阿片类药物的许多俚语（例如“smack”或“blues”）具有常见的非阿片类药物含义，这使得它们含糊不清。大型语言模型 (LLM) 的高级文本推理能力为大规模消除这些俚语术语提供了机会。我们提出了三个任务来评估四个最先进的法学硕士（GPT-4、GPT-5、Gemini 2.5 Pro 和 Claude Sonnet 4.5）：基于词典的设置，其中法学硕士必须在给定帖子的上下文中消除特定术语的歧义；无词典环境，法学硕士必须在没有词典的情况下从上下文中识别与阿片类药物相关的帖子；以及一个新兴的俚语设置，其中法学硕士必须使用模拟的新俚语术语来识别与阿片类药物相关的帖子。所有四位法学硕士在所有任务中都表现出了出色的表现。在基于词典的设置的两个子任务中，LLM F1 分数（“fenty”子任务：0.824-0.972；“smack”子任务：0.540-0.862）远远超过了最佳词典策略的分数（分别为 0.126 和 0.009）。在无词典任务中，法学硕士 F1 分数（0.544-0.769）超过了词典分数（0.080-0.540），并且法学硕士表现出一致更高的召回率。在新兴俚语方面，所有法学硕士的准确率（平均：0.784）、F1 分数（平均：0.712）、精确度（平均：0.981）和召回率（平均：0.587）均高于所评估的两个词典。我们的结果表明，法学硕士可用于识别低流行主题的相关内容，包括但不限于阿片类药物参考，增强为下游分析和预测模型提供的数据。

Title: Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

Authors: Hongbin Zhang, Kehai Chen, Xuefen Bai, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.10351
Pdf URL: https://arxiv.org/pdf/2603.10351
Copy Paste: [[2603.10351]] Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck(https://arxiv.org/abs/2603.10351)
Keywords: language model, llm
Abstract: Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias. In this paper, translationese bias is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual predictability. To mitigate this bias, we propose DIBJudge, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incorporate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.
摘要：大语言模型（LLM）已成为多语言评估的标准，但它们表现出严重的系统翻译偏差。在本文中，翻译偏见的特征是法学硕士系统性地偏爱机器翻译的文本而不是人类撰写的参考文献，特别是在资源匮乏的语言中。我们将这种偏差归因于与（i）与英语的潜在流形对齐和（ii）跨语言可预测性的虚假相关性。为了减轻这种偏差，我们提出了 DIBJudge，这是一个强大的微调框架，它通过变分信息压缩学习最低限度的充分、判断关键的表示，同时明确地将虚假因素隔离到专用偏差分支中。此外，我们采用了交叉协方差惩罚，明确抑制稳健表示和偏差表示之间的统计依赖性，从而鼓励有效的解开。对多语言奖励建模基准和专用翻译偏差评估套件的广泛评估表明，所提出的 DIBJudge 始终优于强大的基线，并大大减轻了翻译偏差。

Title: Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking

Authors: Haoxiang Su, Ruiyu Fang, Liting Jiang, Xiaomeng Huang, Shuangyong Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.10367
Pdf URL: https://arxiv.org/pdf/2603.10367
Copy Paste: [[2603.10367]] Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking(https://arxiv.org/abs/2603.10367)
Keywords: prompt
Abstract: The performance of task-oriented dialogue models is strongly tied to how well they track dialogue states, which records and updates user information across multi-turn interactions. However, current multi-domain DST encounters two key challenges: the difficulty of effectively modeling dialogue history and the limited availability of annotated data, both of which hinder model performance. To tackle the aforementioned problems, we develop a dynamic knowledge fusion framework applicable to multi-domain DST. The model operates in two stages: first, an encoder-only network trained with contrastive learning encodes dialogue history and candidate slots, selecting relevant slots based on correlation scores; second, dynamic knowledge fusion leverages the structured information of selected slots as contextual prompts to enhance the accuracy and consistency of dialogue state tracking. This design enables more accurate integration of dialogue context and domain knowledge. Results obtained from multi-domain dialogue benchmarks indicate that our method notably improves both tracking accuracy and generalization, validating its capability in handling complex dialogue scenarios.
摘要：面向任务的对话模型的性能与它们跟踪对话状态的能力密切相关，对话状态在多轮交互中记录和更新用户信息。然而，当前的多域 DST 遇到两个关键挑战：有效建模对话历史的困难和注释数据的可用性有限，这两者都阻碍了模型性能。为了解决上述问题，我们开发了一个适用于多领域DST的动态知识融合框架。该模型分两个阶段运行：首先，通过对比学习训练的仅编码器网络对对话历史和候选槽进行编码，根据相关性分数选择相关槽；其次，动态知识融合利用所选槽位的结构化信息作为上下文提示，以提高对话状态跟踪的准确性和一致性。这种设计可以更准确地整合对话上下文和领域知识。从多域对话基准获得的结果表明，我们的方法显着提高了跟踪精度和泛化能力，验证了其处理复杂对话场景的能力。

Title: Aligning Large Language Models with Searcher Preferences

Authors: Wei Wu, Peilun Zhou, Liyi Chen, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.10473
Pdf URL: https://arxiv.org/pdf/2603.10473
Copy Paste: [[2603.10473]] Aligning Large Language Models with Searcher Preferences(https://arxiv.org/abs/2603.10473)
Keywords: language model, llm
Abstract: The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.
摘要：从以项目为中心的排名到以答案为中心的综合的范式转变正在重新定义搜索引擎的角色。尽管最近的工业进步已将生成技术应用于电子商务中的封闭集项目排名，但大型内容平台上开放式生成搜索的研究和部署仍然有限。这种设置带来了挑战，包括对噪声检索的鲁棒性、不可协商的安全保证以及与不同用户需求的一致性。在这项工作中，我们介绍了 SearchLLM，这是第一个用于开放式生成搜索的大型语言模型 (LLM)。我们设计了一个分层的、多维的奖励系统，将底线约束（包括事实基础、基本答案质量和格式合规性）与促进鲁棒性的行为优化目标分开，以提高噪声检索和与用户需求的一致性。具体来说，我们的奖励模型评估以用户查询、会话历史记录和检索到的证据集为条件的响应，将基于规则的检查与人工校准的 LLM 法官相结合，在这些维度上生成可解释的分数向量。我们引入门控聚合策略来获得通过组相对策略优化 (GRPO) 优化 SearchLLM 的训练奖励。我们在RedNote的AI搜索入口中部署SearchLLM。线下评估和线上A/B测试显示，发电质量和用户参与度得到提高，有效消费率提高1.03%，研究率降低2.81%，同时坚持严格的安全性和可靠性标准。

Title: Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

Authors: Panatchakorn Anantaprayoon, Nataliia Babina, Nima Asgharbeygi, Jad Tarifi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.10476
Pdf URL: https://arxiv.org/pdf/2603.10476
Copy Paste: [[2603.10476]] Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs(https://arxiv.org/abs/2603.10476)
Keywords: language model, llm, prompt, agent
Abstract: The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using GRPO with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the resulting model achieves CA alignment comparable to a single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities. These results suggest that negotiation-driven deliberation training provides a practical path toward LLMs that better support collective decision-making in value-conflict scenarios.
摘要：通过 RLHF 和宪法人工智能等范式，大语言模型 (LLM) 的对齐在单智能体环境中取得了实质性进展，最近的工作探索了 RLAIF 等可扩展的替代方案和不断发展的对齐目标。然而，这些方法在多利益相关者环境中仍然受到限制，因为在这种环境中会出现价值观冲突，并且需要深思熟虑的谈判能力。这项工作提出了一个基于多代理协商的协调框架，使法学硕士与集体代理（CA）保持一致——这是一个现有的协调目标，旨在促进代理的持续扩张——同时提高冲突解决能力。为了实现可扩展的培训，同一法学硕士的两个自我游戏实例被分配了相反的角色，进行结构化的回合制对话，以综合互利的解决方案。我们生成综合道德困境提示和冲突的角色对，并使用 GRPO 和外部 LLM 奖励模型通过 RLAIF 优化策略。虽然奖励是根据分配给最终完成的 CA 分数计算的，但梯度应用于对话标记以直接改善协商交互动态。实验表明，所得模型实现了与单代理基线相当的 CA 对齐，同时显着提高了冲突解决性能，而不会降低通用语言能力。这些结果表明，谈判驱动的审议培训为法学硕士提供了一条实用的道路，可以更好地支持价值冲突情况下的集体决策。

Title: PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Authors: Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10477
Pdf URL: https://arxiv.org/pdf/2603.10477
Copy Paste: [[2603.10477]] PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses(https://arxiv.org/abs/2603.10477)
Keywords: language model, llm, prompt
Abstract: Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.
摘要：提示设计是大型语言模型 (LLM) 的主要控制界面，但标准评估很大程度上降低了回答正确性的性能，模糊了提示成功或失败的原因，并且几乎没有提供可操作的指导。我们提出了 PEEM（提示工程评估指标），这是一个对提示和响应进行联合和可解释评估的统一框架。 PEEM 定义了一个具有 9 个轴的结构化评分标准：3 个提示标准（清晰度/结构、语言质量、公平性）和 6 个响应标准（准确性、连贯性、相关性、客观性、清晰度、简洁性），并使用基于 LLM 的评估器输出 (i) 1-5 Likert 量表的标量分数和 (ii) 基于该评分标准的特定标准自然语言基本原理。在 7 个基准测试和 5 个任务模型中，PEEM 的准确度轴与传统准确度高度一致，同时保留模型排名（合计 Spearman rho 约为 0.97，Pearson r 约为 0.94，p < 0.001）。使用四个模型的多评估者研究显示了一致的相对判断（成对 rho = 0.68-0.85），支持评估者不可知的部署。除了对齐之外，PEEM 还捕获互补的语言故障模式，并在提示扰动下保持信息丰富：提示质量趋势跟踪迭代重写下的下游准确性，语义对抗性操作导致明显的分数下降，而保留意义的释义产生高稳定性（鲁棒性约为 76.7-80.6%）。最后，仅使用 PEEM 分数和基本原理作为反馈，零样本提示重写循环将下游准确性提高了 11.7 点，优于监督和基于 RL 的提示优化基线。总体而言，PEEM 提供了一种可重复的、标准驱动的协议，将提示制定与响应行为联系起来，并实现了 LLM 交互的系统诊断和优化。

Title: Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

Authors: Zhongzhen Huang, Yan Ling, Hong Chen, Ye Feng, Li Wu, Linjie Mu, Shaoting Zhang, Xiaofan Zhang, Kun Qian, Xiaomu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10492
Pdf URL: https://arxiv.org/pdf/2603.10492
Copy Paste: [[2603.10492]] Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent(https://arxiv.org/abs/2603.10492)
Keywords: language model, agent
Abstract: We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE's performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.
摘要：我们推出了 PULSE，这是一种医学推理代理，它将领域调整的大型语言模型与科学文献检索相结合，以支持复杂现实案例中的诊断决策。为了评估其能力，我们策划了 82 个真实的内分泌学病例报告的基准，涵盖广泛的疾病类型和发病水平。在对照实验中，我们将 PULSE 的表现与不同专业水平的医生（从住院医生到高级专家）进行了比较，并研究了人工智能辅助如何影响人类诊断推理。 PULSE 达到了专家竞争的准确性，超越了住院医生和初级专家，同时在 Top@1 和 Top@4 阈值上与高级专家的表现相匹配。与医生的准确性随着疾病的稀有性而下降不同，PULSE 在不同发病率级别上保持了稳定的表现。该智能体还表现出适应性推理，以类似于专家临床医生观察到的较长审议的方式增加案件难度的输出长度。当协作使用时，PULSE 使医生能够纠正最初的错误并扩大诊断假设，但也带来了自动化偏差的风险。该研究探索了串行和并发协作工作流程，表明 PULSE 为常见和罕见的演示提供了强大的支持。这些发现强调了基于语言模型的代理在临床诊断中的前景和局限性，并为评估它们在现实世界决策中的作用提供了一个框架。

Title: VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

Authors: Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons, Murat Kantarcioglu, Bradley A. Malin, Zhijun Yin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.10494
Pdf URL: https://arxiv.org/pdf/2603.10494
Copy Paste: [[2603.10494]] VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization(https://arxiv.org/abs/2603.10494)
Keywords: gpt, llm
Abstract: Brief Hospital Course (BHC) narratives must be clinically useful yet faithful to fragmented EHR evidence. LLM-based clinical summarizers still introduce unsupported statements, and alignment can encourage omissions ("say-less" degeneration). We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO). On MIMIC-III-Ext-VeriFact-BHC (100 ICU patients; patient-level splits), we train a retrieval-augmented verifier to label claim-evidence pairs as Supported, Not Supported, or Not Addressed via a single-token format. The verifier scores sentence-level claims from sampled BHC candidates and aggregates margins into a coverage-aware utility to mine length-controlled, contradiction-anchored preference pairs. On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving validity from 76.7% to 82.5% and maintaining informative length.
摘要：简短的医院课程 (BHC) 叙述必须在临床上有用，但忠实于零散的 EHR 证据。基于法学硕士的临床总结者仍然引入不受支持的陈述，并且对齐可能会鼓励遗漏（“少说”退化）。我们引入了 VERI-DPO，它使用声明验证来挖掘偏好，并通过直接偏好优化 (DPO) 将其提炼到摘要器中。在 MIMIC-III-Ext-VeriFact-BHC（100 名 ICU 患者；患者级别分割）上，我们训练检索增强验证器，通过单令牌格式将声明-证据对标记为“支持”、“不支持”或“未解决”。验证者对抽样的 BHC 候选者的句子级声明进行评分，并将利润汇总到覆盖感知实用程序中，以挖掘长度控制的、矛盾锚定的偏好对。对于保留的患者，验证者挖掘的偏好通过矛盾密度将候选者分开，VERI-DPO 将不支持的索赔率从 10.7% 降低到 1.9%（本地验证者法官），从 11.6% 降低到 6.4%（GPT-4o 法官），同时将有效性从 76.7% 提高到 82.5%，并保持信息长度。

Title: Safe and Scalable Web Agent Learning via Recreated Websites

Authors: Hyungjoo Chae, Jungsoo Park, Alan Ritter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10505
Pdf URL: https://arxiv.org/pdf/2603.10505
Copy Paste: [[2603.10505]] Safe and Scalable Web Agent Learning via Recreated Websites(https://arxiv.org/abs/2603.10505)
Keywords: language model, llm, agent
Abstract: Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments. Code and resources will be released at this https URL upon acceptance.
摘要：训练自主网络代理从根本上受到他们学习环境的限制：现实世界的网站探索不安全，难以重置，并且很少提供可验证的反馈。我们提出 VeriEnv，一个将语言模型视为环境创建者的框架，自动将现实世界的网站克隆到完全可执行、可验证的合成环境中。通过 Python SDK 公开受控的内部访问，VeriEnv 使代理能够自行生成具有确定性、可编程验证的奖励的任务，从而消除对启发式或基于 LLM 的法官的依赖。这种设计将代理学习与不安全的现实世界交互解耦，同时通过环境扩展实现可扩展的自我进化。通过网络代理基准测试的实验，我们表明使用 VeriEnv 训练的代理可以泛化到看不见的网站，通过自我进化训练实现特定于站点的掌握，并从扩展训练环境的数量中受益。代码和资源将在接受后在此 https URL 发布。

Title: AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

Authors: Dimosthenis Athanasiou, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10524
Pdf URL: https://arxiv.org/pdf/2603.10524
Copy Paste: [[2603.10524]] AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations(https://arxiv.org/abs/2603.10524)
Keywords: llm, retrieval-augmented generation
Abstract: We present the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C). Our unified architecture is built on two principles: (i) a query-diversity-over-retriever-diversity strategy, where five complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and fused via variance-aware nested Reciprocal Rank Fusion; and (ii) a multistage generation pipeline that decomposes grounded generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection. Our system ranks 1st in Task A (nDCG@5: 0.5776, +20.5% over the strongest baseline) and 2nd in Task B (HM: 0.7698). Empirical analysis shows that query diversity over a well-aligned retriever outperforms heterogeneous retriever ensembling, and that answerability calibration-rather than retrieval coverage-is the primary bottleneck in end-to-end performance.
摘要：我们提出了 SemEval-2026 任务 8 (MTRAGEval) 的 AILS-NTUA 系统，解决多轮检索增强生成的所有三个子任务：段落检索 (A)、基于参考的响应生成 (B) 和端到端 RAG (C)。我们的统一架构建立在两个原则之上：(i) 查询多样性优于检索器多样性策略，其中五个互补的基于 LLM 的查询重构被发布到单个语料库对齐的稀疏检索器，并通过方差感知嵌套倒数排名融合进行融合； (ii) 多级生成管道，将接地生成分解为证据跨度提取、双候选人起草和校准多法官选择。我们的系统在任务 A 中排名第一（nDCG@5：0.5776，比最强基线+20.5%），在任务 B 中排名第二（HM：0.7698）。实证分析表明，对齐良好的检索器的查询多样性优于异构检索器集成，并且可回答性校准（而不是检索覆盖率）是端到端性能的主要瓶颈。

Title: Automatic End-to-End Data Integration using Large Language Models

Authors: Aaron Steiner, Christian Bizer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10547
Pdf URL: https://arxiv.org/pdf/2603.10547
Copy Paste: [[2603.10547]] Automatic End-to-End Data Integration using Large Language Models(https://arxiv.org/abs/2603.10547)
Keywords: language model, gpt, llm
Abstract: Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of video game, music, and company related data. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines. End-to-end, the human and the LLM pipelines produce integrated datasets of comparable size and density. Having the LLM configure the pipelines costs approximately \$10 per case study, which represents only a small fraction of the cost of having human data engineers perform the same tasks.
摘要：设计数据集成管道通常需要数据工程师大量的手动工作来配置管道组件和标记训练数据。虽然法学硕士在处理集成过程的各个步骤方面表现出了希望，但它们替代端到端数据集成管道中所有人工输入的潜力尚未得到研究。作为探索这一潜力的一步，我们提出了一个自动数据集成管道，它使用 GPT-5.2 生成使管道适应特定用例所需的所有工件。这些工件是模式映射、数据规范化的值映射、实体匹配的训练数据以及在数据融合中选择冲突解决启发式的验证数据。我们通过三个需要集成视频游戏、音乐和公司相关数据的案例研究，将这种基于法学硕士的管道的性能与人工设计的管道的性能进行比较。我们的实验表明，基于 LLM 的流程能够产生与人工设计的流程类似的结果，对于某些任务甚至可以产生更好的结果。人类和法学硕士管道端到端地生成大小和密度相当的集成数据集。让法学硕士配置管道的成本约为每个案例研究 10 美元，这仅占让人类数据工程师执行相同任务的成本的一小部分。

Title: End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Authors: Nhi Dang, Tung Le, Huy Tien Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10570
Pdf URL: https://arxiv.org/pdf/2603.10570
Copy Paste: [[2603.10570]] End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering(https://arxiv.org/abs/2603.10570)
Keywords: language model, llm, chat, retrieval augmented generation
Abstract: Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.
摘要：大型语言模型 (LLM) 与检索增强生成相结合，实现了特定领域聊天机器人的部署，但这些系统仍然容易生成不受支持或不正确的答案。因此，可靠的评估至关重要，但手动审查成本高昂，而且现有框架通常依赖于策划的测试集和静态指标，从而限制了可扩展性。我们提出了一种端到端自动评估器，旨在大幅减少人力。我们的系统直接从底层知识库生成问答对，使用法学硕士根据参考答案判断聊天机器人的响应，并应用基于置信度的过滤来突出不确定的情况。应用于越南新闻数据集时，评估器与人类判断高度一致，同时显着降低了审查开销。该框架是模块化的且与语言无关，使其能够轻松适应不同的领域。这项工作介绍了一种实用的、可扩展的解决方案，用于评估聊天机器人，同时最大限度地减少对手动干预的依赖。

Title: Disentangling Similarity and Relatedness in Topic Models

Authors: Hanlin Xiao, Mauricio A. Álvarez, Rainer Breitling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10619
Pdf URL: https://arxiv.org/pdf/2603.10619
Copy Paste: [[2603.10619]] Disentangling Similarity and Relatedness in Topic Models(https://arxiv.org/abs/2603.10619)
Keywords: language model, llm
Abstract: The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.
摘要：大型语言模型的最新进展推动了将预训练语言模型 (PLM) 嵌入集成到主题模型中的趋势不断增长，从根本上重塑了主题捕获语义结构的方式。潜在狄利克雷分配 (LDA) 等经典模型从单词共现统计数据中得出主题，而 PLM 增强模型将这些统计数据锚定到预先训练的嵌入空间，施加一个先验，也有利于语义相似单词的聚类。这种结构差异可以通过主题相关性和主题词的分类相似性的心理语言学维度来捕获。为了理清主题模型中的这些维度，我们使用基于 LLM 的注释构建了一个大型的单词对综合基准来训练神经评分函数。我们将该评分器应用于多个语料库和主题模型系列的综合评估，揭示不同的模型系列捕获其主题中不同的语义结构。我们进一步证明，相似性和相关性分数可以根据任务要求成功预测下游任务绩效。本文将相似性和相关性作为主题模型评估的基本轴，并提供了一个可靠的管道来表征跨模型系列和语料库的这些特征。

Title: Making Bielik LLM Reason (Better): A Field Report

Authors: Adam Trybus, Bartosz Bartnicki, Remigiusz Kinas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10640
Pdf URL: https://arxiv.org/pdf/2603.10640
Copy Paste: [[2603.10640]] Making Bielik LLM Reason (Better): A Field Report(https://arxiv.org/abs/2603.10640)
Keywords: language model, llm
Abstract: This paper presents a research program dedicated to evaluating and advancing the reasoning capabilities of Bielik, a Polish large language model. The study describes a number of stages of work: initial benchmarking and creation of evaluation methodology, analyzing of comparative results with other LLMs and outlining of future prospects that take into account the limitations of the analyses conducted so far and aims to keep Bielik in the race give the ever-changing -- and competitive -- AI landscape.
摘要：本文提出了一项研究计划，致力于评估和提高波兰大语言模型 Bielik 的推理能力。该研究描述了多个工作阶段：初始基准测试和创建评估方法、分析与其他法学硕士的比较结果以及概述未来前景，考虑到迄今为止进行的分析的局限性，旨在让 Bielik 继续参与不断变化且具有竞争力的人工智能领域的竞争。

Title: Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Tianyu Liu, Baolong Bi, Lingrui Mei, Jiayu Yao, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10705
Pdf URL: https://arxiv.org/pdf/2603.10705
Copy Paste: [[2603.10705]] Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models(https://arxiv.org/abs/2603.10705)
Keywords: language model, prompt
Abstract: Prompt highlighting steers a large language model to prioritize user-specified text spans during generation. A key challenge is extracting steering directions that capture the difference between relevant and irrelevant contexts, rather than shared structural patterns common to both. We propose PRISM-$\Delta$ (Projection-based Relevance-Informed Steering Method), which decomposes the difference between positive and negative cross-covariance matrices to maximize discriminative energy while eliminating shared directions. Each attention head receives a continuous softplus importance weight, letting weak-but-useful heads contribute at reduced strength. The framework extends naturally to Value representations, capturing content-channel signal that Key-only methods leave unused. Across four benchmarks and five models, PRISM-$\Delta$ matches or exceeds the best existing method on 19 of 20 configurations, with relative gains up to +10.6%, while halving the fluency cost of steering. PRISM-$\Delta$ also scales to long-context retrieval, outperforming the best existing method by up to +4.8% relative gain. PRISM-$\Delta$ is compatible with FlashAttention and adds negligible memory overhead.
摘要：提示突出显示会引导大型语言模型在生成过程中优先考虑用户指定的文本范围。一个关键的挑战是提取捕获相关和不相关上下文之间差异的指导方向，而不是共享两者共有的结构模式。我们提出了 PRISM-$\Delta$（基于投影的相关性通知引导方法），该方法分解正负互协方差矩阵之间的差异，以最大化判别能量，同时消除共享方向。每个注意力头都会受到连续的 softplus 重要性权重，让弱但有用的头以降低的强度做出贡献。该框架自然地扩展到值表示，捕获仅键方法未使用的内容通道信号。在四个基准测试和五个模型中，PRISM-$\Delta$ 在 20 个配置中的 19 个配置上匹配或超过了现有最佳方法，相对增益高达 +10.6%，同时将转向流畅度成本减半。 PRISM-$\Delta$ 还可以扩展到长上下文检索，其性能优于现有最佳方法，相对增益高达 +4.8%。 PRISM-$\Delta$ 与 FlashAttention 兼容，并且增加的内存开销可以忽略不计。

Title: HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

Authors: Shuang Zhou, Kai Yu, Song Wang, Wenya Xie, Zaifu Zhan, Meng-Han Tsai, Yuen-Hei Chung, Shutong Hou, Huixue Zhou, Min Zeng, Bhavadharini Ramu, Lin Yee Chen, Feng Xie, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10764
Pdf URL: https://arxiv.org/pdf/2603.10764
Copy Paste: [[2603.10764]] HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology(https://arxiv.org/abs/2603.10764)
Keywords: agent
Abstract: Heart diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and trustworthy differential diagnosis. However, existing artificial intelligence-based diagnostic methods are often limited by insufficient cardiology knowledge, inadequate support for complex reasoning, and poor interpretability. Here we present HeartAgent, a cardiology-specific agent system designed to support a reliable and explainable differential diagnosis. HeartAgent integrates customized tools and curated data resources and orchestrates multiple specialized sub-agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references. Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top-3 diagnostic accuracy, respectively. Additionally, clinicians assisted by HeartAgent demonstrated gains of 26.9% in diagnostic accuracy and 22.7% in explanatory quality compared with unaided experts. These results demonstrate that HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care.
摘要：心脏病仍然是全世界发病和死亡的主要原因，需要准确和值得信赖的鉴别诊断。然而，现有的基于人工智能的诊断方法往往受到心脏病学知识不足、对复杂推理的支持不足以及可解释性差的限制。在这里，我们介绍 HeartAgent，这是一种心脏病学专用代理系统，旨在支持可靠且可解释的鉴别诊断。 HeartAgent 集成了定制工具和精选数据资源，并协调多个专门的子代理来执行复杂的推理，同时生成透明的推理轨迹和可验证的支持参考。通过对 MIMIC 数据集和私人电子健康记录队列进行评估，HeartAgent 在前 3 名诊断准确性方面比现有的比较方法分别提高了 36% 和 20% 以上。此外，与未经专家协助的临床医生相比，HeartAgent 协助的临床医生诊断准确性提高了 26.9%，解释质量提高了 22.7%。这些结果表明 HeartAgent 为心血管护理提供可靠、可解释且临床上可行的决策支持。

Title: mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

Authors: Konstantin Dobler, Simon Lehnerer, Federico Scozzafava, Jonathan Janke, Mohamed Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10767
Pdf URL: https://arxiv.org/pdf/2603.10767
Copy Paste: [[2603.10767]] mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR(https://arxiv.org/abs/2603.10767)
Keywords: language model
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While mul- tilingual training data and benchmarks have been created in the past, they were not created with RLVR and current model capability in mind, and their level of difficulty is often too low to provide appropriate training signals for current models. To address this gap, we provide mAceReason-Math, a dataset of high-quality translations of challenging math problems sourced from a corpus specifically curated for RLVR (AceReason-Math). We further take specific care to clean and improve our translations, resulting in a coverage of 14 languages with more than 10,000 samples per language. We release the dataset to facilitate multilingual RLVR research and benchmarking in the research community.
摘要：具有可验证奖励的强化学习（RLVR）已成功应用于显着提高预训练大型语言模型的能力，特别是在数学和逻辑问题领域。然而，当前的研究和可用的训练数据集仍然以英语为中心。虽然过去已经创建了多语言训练数据和基准，但它们在创建时并没有考虑到 RLVR 和当前模型的能力，而且它们的难度通常太低，无法为当前模型提供适当的训练信号。为了解决这一差距，我们提供了 mAceReason-Math，这是一个具有挑战性的数学问题的高质量翻译数据集，源自专门为 RLVR 策划的语料库 (AceReason-Math)。我们进一步特别注意清理和改进我们的翻译，最终覆盖 14 种语言，每种语言有超过 10,000 个样本。我们发布数据集是为了促进研究社区的多语言 RLVR 研究和基准测试。

Title: Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

Authors: Zhipeng Yang, Shu Yang, Lijie Hu, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10771
Pdf URL: https://arxiv.org/pdf/2603.10771
Copy Paste: [[2603.10771]] Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness(https://arxiv.org/abs/2603.10771)
Keywords: language model, llm
Abstract: Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.
摘要：使用规范标记化训练的大型语言模型 (LLM) 对字符级标记化等非规范输入表现出令人惊讶的鲁棒性，但这种鲁棒性背后的机制仍不清楚。我们通过机械解释来研究这种现象，并确定一个我们称之为单词恢复的核心过程。我们首先引入一种基于解码的方法来检测单词恢复，表明隐藏状态从字符级输入重建规范的单词级标记标识。然后，我们通过从隐藏状态中删除相应的子空间来提供因果证据，这会持续降低下游任务的性能。最后，我们进行了细粒度的注意力分析，并表明属于同一规范标记的字符之间的组内注意力对于单词恢复至关重要：在早期层中掩盖这种注意力会大大降低恢复分数和任务性能。总之，我们的研究结果为标记化的鲁棒性提供了机械解释，并将单词恢复确定为使法学硕士能够处理字符级输入的关键机制。

Title: Large Language Models as Annotators for Machine Translation Quality Estimation

Authors: Sidi Wang, Sophie Arnoult, Amir Kamran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10775
Pdf URL: https://arxiv.org/pdf/2603.10775
Copy Paste: [[2603.10775]] Large Language Models as Annotators for Machine Translation Quality Estimation(https://arxiv.org/abs/2603.10775)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated excellent performance on Machine Translation Quality Estimation (MTQE), yet their high inference costs make them impractical for direct application. In this work, we propose applying LLMs to generate MQM-style annotations for training a COMET model: following Fernandes et al. (2023), we reckon that segment-level annotations provide a strong rationale for LLMs and are key to good segment-level QE. We propose a simplified MQM scheme, mostly restricted to top-level categories, to guide LLM selection. We present a systematic approach for the development of a GPT-4o-based prompt, called PPbMQM (Prompt-Pattern-based-MQM). We show that the resulting annotations correlate well with human annotations and that training COMET on them leads to competitive performance on segment-level QE for Chinese-English and English-German.
摘要：大型语言模型（LLM）在机器翻译质量估计（MTQE）方面表现出了出色的性能，但其高推理成本使其不适合直接应用。在这项工作中，我们建议应用 LLM 生成 MQM 风格的注释来训练 COMET 模型：遵循 Fernandes 等人的观点。（2023），我们认为细分级别注释为法学硕士提供了强有力的理由，并且是良好细分级别量化宽松的关键。我们提出了一个简化的 MQM 方案，主要限于顶级类别，以指导 LLM 选择。我们提出了一种开发基于 GPT-4o 的提示的系统方法，称为 PPbMQM（基于提示模式的 MQM）。我们表明，所得到的注释与人类注释有很好的相关性，并且在它们上训练 COMET 可以在中文-英语和英语-德语的分段级 QE 上带来有竞争力的表现。

Title: Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

Authors: Weihang Huang, Mengna Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.10784
Pdf URL: https://arxiv.org/pdf/2603.10784
Copy Paste: [[2603.10784]] Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study(https://arxiv.org/abs/2603.10784)
Keywords: llm
Abstract: Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols--MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification--as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen's kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.
摘要：隐喻识别是比喻语言处理中的一项基本任务，但大多数计算方法都作为不透明的分类器运行，无法深入了解为什么一个表达被判断为隐喻。这种可解释性的差距对于汉语来说尤其严重，因为汉语丰富的比喻传统、缺乏形态线索和有限的注释资源加剧了这一挑战。我们提出了一个 LLM 辅助管道，它将四种隐喻识别协议（MIP/MIPVU 词法分析、CMDAG 概念映射注释、基于情感的检测和面向明喻的识别）操作为可执行的、人类可审计的规则脚本。每个协议都是确定性步骤的模块化链，与受控的 LLM 调用交织在一起，在每个分类决策的同时产生结构化的基本原理。我们对七个涵盖词元级、句子级和跨度级注释的中文隐喻数据集进行了评估，为中文隐喻识别建立了第一个跨协议比较。协议内评估显示，协议 A (MIP) 在代币级别识别上的 F1 为 0.472，而跨协议分析则显示出显着的差异：协议 A 和 D 之间的成对 Cohen kappa 仅 0.001，而协议 B 和 C 表现出近乎完美的一致性（kappa = 0.986）。可解释性审核显示所有协议均实现 100% 确定性再现性，基本原理正确性从 0.40 到 0.87，可编辑性从 0.80 到 1.00。错误分析将概念域不匹配和套准敏感性识别为主要故障模式。我们的结果表明，协议选择是隐喻识别中最大的变异来源，超过了模型级别的变异，并且规则脚本架构在保持完全透明的同时实现了有竞争力的性能。

Title: PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

Authors: Yuzhi Liang, Shiliang Xiao, Jingsong Wei, Qiliang Lin, Xia Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10842
Pdf URL: https://arxiv.org/pdf/2603.10842
Copy Paste: [[2603.10842]] PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words(https://arxiv.org/abs/2603.10842)
Keywords: language model
Abstract: Existing hard-label text attacks often rely on inefficient "outside-in" strategies that traverse vast search spaces. We propose PivotAttack, a query-efficient "inside-out" framework. It employs a Multi-Armed Bandit algorithm to identify Pivot Sets-combinatorial token groups acting as prediction anchors-and strategically perturbs them to induce label flips. This approach captures inter-word dependencies and minimizes query costs. Extensive experiments across traditional models and Large Language Models demonstrate that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.
摘要：现有的硬标签文本攻击通常依赖于低效的“由外而内”策略，遍历巨大的搜索空间。我们提出了 PivotAttack，一个查询高效的“由内而外”框架。它采用 Multi-Armed Bandit 算法来识别枢轴集（充当预测锚的组合标记组）并策略性地扰乱它们以引发标签翻转。这种方法捕获单词之间的依赖关系并最大限度地减少查询成本。跨传统模型和大型语言模型的大量实验表明，PivotAttack 在攻击成功率和查询效率方面始终优于最先进的基线。

Title: An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Authors: Jennifer D'Souza, Sameer Sadruddin, Maximilian Kähler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen
Subjects: cs.CL, cs.AI, cs.DL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.10876
Pdf URL: https://arxiv.org/pdf/2603.10876
Copy Paste: [[2603.10876]] An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?(https://arxiv.org/abs/2603.10876)
Keywords: agent
Abstract: Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.
摘要：主题索引对于发现至关重要，但难以大规模且跨语言维持。我们发布了一个大型双语（英语/德语）目录记录语料库，带有集成规范文件 (GND) 注释，以及机器可操作的 GND 分类法。该资源支持本体感知的多标签分类、将文本映射到权威术语，以及通过可重复的、基于权威的评估进行代理辅助编目。我们提供三个系统的简要统计概况和定性误差分析。我们邀请社区不仅评估准确性，还评估实用性和透明度，以建立以权威为基础的人工智能副驾驶，以扩大编目员的工作。

Title: From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Authors: Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10877
Pdf URL: https://arxiv.org/pdf/2603.10877
Copy Paste: [[2603.10877]] From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers(https://arxiv.org/abs/2603.10877)
Keywords: language model
Abstract: Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.
摘要：知识蒸馏（KD）方法对于将大型预训练语言模型压缩为较小的模型至关重要，从而确保计算效率而不显着降低性能。传统的 KD 技术假设教师（源）模型和学生（目标）模型之间的模式是同质的。另一方面，现有的多模态知识蒸馏方法需要对教师模型进行特定模态的预训练，这在大多数情况下在计算上是不可行的。在本文中，我们介绍了 ARMADA，这是一种高效的跨模式知识蒸馏框架，旨在将知识从大型视觉语言模型（包括黑盒模型）转移到纯语言模型。与现有的 KD 技术依赖于多模式教师的内部结构或需要计算成本高昂的预训练不同，ARMADA 利用新颖的对齐技术在不改变教师模型的情况下提取知识，从而确保效率和可扩展性。我们在 12 种自然语言理解、8 种复杂的生成推理和 5 种指令调整任务上对 ARMADA 进行了实证验证，证明了 DeBERTa-v2-1.4B、OPT-1.3B、LLaMA-{3B, 7B, 8B} 等大型模型的一致性能改进。 ARMADA 在语言理解任务上实现了高达 3.4% 的提升，在生成推理方面实现了 2.6% 的提升，所有这些都不需要昂贵的多模态预训练或教师模型的微调。我们的研究结果挑战了传统的知识蒸馏范式，证明即使是视觉语言模型，尽管缺乏直接的文本理解，但在适当蒸馏时也可以显着增强语言模型。

Title: LLM2Vec-Gen: Generative Embeddings from Large Language Models

Authors: Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.10913
Pdf URL: https://arxiv.org/pdf/2603.10913
Copy Paste: [[2603.10913]] LLM2Vec-Gen: Generative Embeddings from Large Language Models(https://arxiv.org/abs/2603.10913)
Keywords: language model, llm
Abstract: LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.
摘要：基于 LLM 的文本嵌入器通常对其输入的语义内容进行编码。然而，嵌入任务需要将不同的输入映射到相似的输出。通常，这种输入输出是通过使用对比学习用配对数据训练嵌入模型来解决的。在这项工作中，我们提出了一种新颖的自监督方法 LLM2Vec-Gen，它采用不同的范式：我们不是对输入进行编码，而是学习表示模型的潜在响应。具体来说，我们将可训练的特殊标记添加到 LLM 的词汇表中，将它们附加到输入中，并优化它们以以固定长度的序列表示 LLM 的响应。培训由法学硕士自己完成查询以及提供蒸馏目标的无监督嵌入教师指导。这种表述有助于弥合输入-输出差距，并将法学硕士的能力（例如安全对齐和推理）转移到嵌入任务中。至关重要的是，LLM 主干仍然处于冻结状态，训练只需要未标记的查询。 LLM2Vec-Gen 在大规模文本嵌入基准 (MTEB) 上实现了最先进的自监督性能，比最好的无监督嵌入教师提高了 9.3%。我们还观察到有害内容检索减少了 43.2%，嵌入任务的推理能力提高了 29.3%。最后，学习到的嵌入是可解释的，可以解码为文本以揭示其语义内容。

Title: Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Authors: Mingyang Song, Mao Zheng, Chenning Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11027
Pdf URL: https://arxiv.org/pdf/2603.11027
Copy Paste: [[2603.11027]] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge(https://arxiv.org/abs/2603.11027)
Keywords: llm
Abstract: The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $\rho = 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.
摘要：法学硕士作为法官的范式依赖于一个关键假设，即评估者之间的高度一致性表明评估可靠且客观。我们提出了两个互补的发现来挑战这一假设。 \textbf{首先}，我们证明这种共识常常是虚幻的。我们识别并形式化了\textbf{评估错觉}，这是一种LLM法官产生复杂批评的现象，但将分数锚定在共享的表面启发法而不是实质质量上。通过对 105,600 个评估实例（32 个法学硕士 $\times$ 3 个前沿法官 $\times$ 100 个任务 $\times$ 11 个温度）的大规模研究，我们表明模型级一致性（Spearman $\rho = 0.99$）掩盖了脆弱的样本级一致性（Pearson $\bar{r} = 0.72$；绝对一致性 ICC $= 0.67$），仅共享标题结构即可恢复 62% 的总一致性，而高质量的输出却得到了 \textit{最不} 一致的评估。 \textbf{第二}，我们证明基于领域知识动态生成评估标准可以产生更有意义的评估。我们引入 MERG（元认知增强型标题生成），这是一种知识驱动的标题生成框架，其领域选择效应证实了这一点。在知识将评估者锚定在共享标准上的法典领域（教育+22\%，学术+27\%）中，协议\textit{增加}，而在出现真正的评估多元化的主观领域中，协议\textit{增加}。这些发现表明，评估标准应该通过专家知识动态丰富，而不是依赖通用标准，这对 RLAIF 中的奖励建模具有影响。

Title: Instruction set for the representation of graphs

Authors: Ezequiel Lopez-Rubio, Mario Pascual-Gonzalez
Subjects: cs.CL, cs.AI, cs.DS
Abstract URL: https://arxiv.org/abs/2603.11039
Pdf URL: https://arxiv.org/pdf/2603.11039
Copy Paste: [[2603.11039]] Instruction set for the representation of graphs(https://arxiv.org/abs/2603.11039)
Keywords: language model
Abstract: We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling
摘要：我们提出了 IsalGraph，一种将任何有限、简单图的结构表示为九字符指令字母表上的紧凑字符串的方法。编码由一个小型虚拟机执行，该虚拟机包括稀疏图、图节点引用的循环双链表（CDLL）和两个遍历指针。指令通过 CDLL 移动指针或将节点或边插入到图中。一个关键的设计属性是字母表上的每个字符串都解码为有效的图形，并且无法到达无效状态。贪婪的 \emph{GraphToString} 算法将任何连通图以节点数的时间多项式编码为字符串；穷举回溯变体通过在所有起始节点和所有有效遍历顺序中选择字典顺序最小的最短字符串来生成规范字符串。我们评估了五个真实世界图基准数据集（IAM Letter LOW/MED/HIGH、LINUX 和 AIDS）的表示，并表明 IsalGraph 字符串之间的 Levenshtein 距离与图编辑距离 (GED) 密切相关。这些属性共同使 IsalGraph 字符串成为一种紧凑、同构不变且与语言模型兼容的图结构顺序编码，可直接应用于图相似性搜索、图生成和图条件语言建模