2025-12-22

Title: A Women's Health Benchmark for Large Language Models

Authors: Victoria-Elisabeth Gruber, Razvan Marinescu, Diego Fajardo, Amin H. Nassar, Christopher Arkfeld, Alexandria Ludlow, Shama Patel, Mehrnoosh Samaei, Valerie Klug, Anna Huber, Marcel Gühner, Albert Botta i Orfila, Irene Lagoja, Kimya Tarr, Haleigh Larson, Mary Beth Howard
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.17028
Pdf URL: https://arxiv.org/pdf/2512.17028
Copy Paste: [[2512.17028]] A Women's Health Benchmark for Large Language Models(https://arxiv.org/abs/2512.17028)
Keywords: language model, gpt, llm, chat
Abstract: As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60\% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.
摘要：随着大型语言模型（LLM）成为数百万人健康信息的主要来源，它们在女性健康方面的准确性仍然严重未经检验。我们推出女性健康基准（WHB），这是第一个专门评估女性健康方面法学硕士表现的基准。我们的基准包括 96 个经过严格验证的模型树桩，涵盖五个医学专业（妇产科、急诊医学、初级保健、肿瘤学和神经病学）、三种查询类型（患者查询、临床医生查询和证据/政策查询）和八种错误类型（剂量/用药错误、缺失关键信息、过时的指南/治疗建议、不正确的治疗建议、不正确的事实信息、缺失/不正确的鉴别诊断、错过紧急情况和不适当的建议）。我们评估了 13 个最先进的法学硕士，并揭示了惊人的差距：当前模型显示女性健康基准的失败率约为 60%，不同专业和错误类型的表现差异巨大。值得注意的是，模型普遍与“错过紧急情况”指标作斗争，而 GPT-5 等较新的模型在避免不适当的建议方面显示出显着的改进。我们的研究结果强调，人工智能聊天机器人尚未完全能够提供有关女性健康的可靠建议。

Title: Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

Authors: Khushboo Thaker, Yony Bresler
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2512.17053
Pdf URL: https://arxiv.org/pdf/2512.17053
Copy Paste: [[2512.17053]] Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL(https://arxiv.org/abs/2512.17053)
Keywords: language model, llm, chain-of-thought
Abstract: Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.
摘要：在企业级部署准确的文本到 SQL 系统面临着成本、安全性和性能方面的困难三难。当前的解决方案迫使企业在昂贵的专有大语言模型 (LLM) 和性能低下的小语言模型 (SLM) 之间做出选择。改进 SLM 的努力通常依赖于使用非结构化思想链 (CoT) 轨迹从大型法学硕士中提炼推理，这一过程本质上仍然是模糊的。相反，我们假设正式的、结构化的推理表示提供了更清晰、更可靠的教学信号，因为文本到 SQL 任务需要明确且精确的逻辑步骤。为了评估这一假设，我们提出了 Struct-SQL，这是一种新颖的知识蒸馏 (KD) 框架，可训练 SLM 来模拟强大的大型 LLM。因此，我们采用查询执行计划作为正式的蓝图来导出这种结构化推理。我们的 SLM 采用结构化 CoT 蒸馏，与非结构化 CoT 蒸馏基线相比，绝对提高了 8.1%。详细的错误分析表明，取得这一成果的一个关键因素是句法错误的显着减少。这表明使用结构化逻辑蓝图训练模型进行推理有利于 SLM 中可靠的 SQL 生成。

Title: XLM: A Python package for non-autoregressive language models

Authors: Dhruvesh Patel, Durga Prasad Maram, Sai Sreenivas Chintha, Benjamin Rozonoyer, Andrew McCallum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17065
Pdf URL: https://arxiv.org/pdf/2512.17065
Copy Paste: [[2512.17065]] XLM: A Python package for non-autoregressive language models(https://arxiv.org/abs/2512.17065)
Keywords: language model
Abstract: In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion xlm-models package) that can be used by the research community. The code is available at this https URL.
摘要：近年来，人们对通用语言建模背景下的非自回归文本生成的兴趣重新兴起。与拥有大量标准训练和推理库的完善的自回归语言建模范例不同，非自回归语言建模的实现很大程度上是定制的，因此很难对不同方法进行系统比较。此外，每种非自回归语言模型通常都需要自己的数据整理、丢失和预测逻辑，这使得重用公共组件变得具有挑战性。在这项工作中，我们提出了 XLM python 包，它旨在更快地实现小型非自回归语言模型，第二个目标是提供一套可供研究社区使用的小型预训练模型（通过配套的 xlm-models 包）。该代码可从此 https URL 获取。

Title: Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Authors: Pranav Shetty, Mirazul Haque, Petr Babkin, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.17075
Pdf URL: https://arxiv.org/pdf/2512.17075
Copy Paste: [[2512.17075]] Perturb Your Data: Paraphrase-Guided Training Data Watermarking(https://arxiv.org/abs/2512.17075)
Keywords: language model, llm
Abstract: Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.
摘要：训练数据检测对于执行版权和数据许可至关重要，因为大型语言模型 (LLM) 是在从互联网上抓取的大量文本语料库上进行训练的。我们提出了 SPECTRA，这是一种水印方法，即使训练数据只占训练语料库的 0.001% 以下，也能可靠地检测到。 SPECTRA 的工作原理是使用法学硕士对文本进行释义，并根据单独的评分模型根据每个释义的可能性分配分数。选择释义，使其分数与原始文本的分数紧密匹配，以避免引入任何分布变化。为了测试可疑模型是否已经在带水印的数据上进行了训练，我们将其标记概率与评分模型的标记概率进行比较。我们证明，当检测用于训练的数据与未用于训练的数据时，SPECTRA 实现了超过九个数量级的一致 p 值差距，这大于所有测试的基线。 SPECTRA 为数据所有者提供了可扩展的、发布前部署的水印，即使是大规模的 LLM 培训也能幸存。

Title: When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

Authors: Michael H. Coen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17083
Pdf URL: https://arxiv.org/pdf/2512.17083
Copy Paste: [[2512.17083]] When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation(https://arxiv.org/abs/2512.17083)
Keywords: llm
Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.
摘要：对话主题分段支持摘要、检索、内存管理和对话连续性。尽管已有数十年的工作经验，对话主题分割的评估实践仍然以严格的边界匹配和基于 F1 的指标为主，即使现代基于 LLM 的对话系统越来越依赖分割来管理超出模型固定上下文窗口的对话历史，其中非结构化上下文积累会降低效率和连贯性。本文介绍了对话主题分割的评估目标，将边界密度和片段连贯性以及窗口容忍 F1 (W-F1) 视为主要标准。通过广泛的跨数据集实证评估，我们表明跨对话分割基准报告的性能差异不是由模型质量驱动的，而是由注释粒度不匹配和稀疏边界标签驱动的。这表明许多报告的改进来自评估工件而不是改进的边界检测。我们评估了八个对话数据集的多种结构不同的对话分割策略，涵盖面向任务、开放域、会议式和综合交互。在这些设置中，我们观察到高分段一致性以及相对于稀疏标签的极端过度分段，从而产生误导性的低精确匹配 F1 分数。我们表明，主题分割最好理解为选择适当的粒度，而不是预测单个正确的边界集。我们通过明确地将边界评分与边界选择分开来操作这一观点。

Title: Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups

Authors: Salar Hashemitaheri, Ian Harris
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17092
Pdf URL: https://arxiv.org/pdf/2512.17092
Copy Paste: [[2512.17092]] Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups(https://arxiv.org/abs/2512.17092)
Keywords: gpt, llm, prompt, agent
Abstract: Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.
摘要：在线戒烟支持小组既经济又方便，但他们经常面临用户参与度低和耻辱的挑战。使用自动对话代理可以确保所有用户评论得到及时响应，从而提高参与度。）。我们通过采用两级数据增强策略来解决高质量数据不足的挑战：合成数据增强和真实数据增强。首先，我们对开源法学硕士进行了微调，对现有戒烟支持小组的帖子进行分类，并识别 F1（精确度+召回率）分数较低的意图。然后，为了这些目的，我们使用 GPT 模型进行即时工程生成额外的合成数据，生成的合成帖子中平均有 87% 被人类注释者认为是高质量的。总体而言，综合增强过程导致 43% 的原始帖子被选择进行增强，随后对这些帖子进行了 140% 的综合扩展。此外，我们从相关在线支持上下文中抓取了 10,000 多个真实帖子，其中 73\% 经人工注释者验证为良好质量。每个合成或抓取的帖子都经过了人工审核员的严格验证，以确保质量和相关性。经过验证的新数据与原始支持小组帖子相结合，形成了用于重新训练意图分类器的增强数据集。重新训练模型的性能评估表明 F1 提高了 32%，证实了我们的数据增强方法的有效性。合成和真实的后增强导致了类似的性能改进。这项研究提供了一个可复制的框架，用于在数据稀缺成为关键问题的领域中提高对话代理的性能。

Title: Enhancing Long Document Long Form Summarisation with Self-Planning

Authors: Xiaotang Du, Rohit Saxena, Laura Perez-Beltrachini, Pasquale Minervini, Ivan Titov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17179
Pdf URL: https://arxiv.org/pdf/2512.17179
Copy Paste: [[2512.17179]] Enhancing Long Document Long Form Summarisation with Self-Planning(https://arxiv.org/abs/2512.17179)
Keywords: long context
Abstract: We introduce a novel approach for long context summarisation, highlight-guided generation, that leverages sentence-level information as a content plan to improve the traceability and faithfulness of generated summaries. Our framework applies self-planning methods to identify important content and then generates a summary conditioned on the plan. We explore both an end-to-end and two-stage variants of the approach, finding that the two-stage pipeline performs better on long and information-dense documents. Experiments on long-form summarisation datasets demonstrate that our method consistently improves factual consistency while preserving relevance and overall quality. On GovReport, our best approach has improved ROUGE-L by 4.1 points and achieves about 35% gains in SummaC scores. Qualitative analysis shows that highlight-guided summarisation helps preserve important details, leading to more accurate and insightful summaries across domains.
摘要：我们引入了一种用于长上下文摘要的新颖方法，即突出显示引导生成，该方法利用句子级信息作为内容计划来提高生成摘要的可追溯性和忠实度。我们的框架应用自我规划方法来识别重要内容，然后根据计划生成摘要。我们探索了该方法的端到端和两阶段变体，发现两阶段管道在长且信息密集的文档上表现更好。对长格式摘要数据集的实验表明，我们的方法不断提高事实一致性，同时保持相关性和整体质量。在 GovReport 上，我们的最佳方法将 ROUGE-L 提高了 4.1 个百分点，并且 SummaC 分数提高了约 35%。定性分析表明，突出显示引导的摘要有助于保留重要的细节，从而在跨领域提供更准确、更有洞察力的摘要。

Title: Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

Authors: Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17220
Pdf URL: https://arxiv.org/pdf/2512.17220
Copy Paste: [[2512.17220]] Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding(https://arxiv.org/abs/2512.17220)
Keywords: llm, long context, retrieval augmented generation, retrieval-augmented generation
Abstract: Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.
摘要：人类依靠内容的整体语义表示来理解长而复杂的文本。正如人类心理学中的思维景观感知能力所揭示的那样，这种全局视图有助于组织先验知识、解释新信息并整合分散在文档中的证据。当前的检索增强生成（RAG）系统缺乏此类指导，因此难以处理长上下文任务。在本文中，我们提出了 Mindscape-Aware RAG (MiA-RAG)，这是第一个为基于 LLM 的 RAG 系统配备明确的全局上下文感知的方法。 MiA-RAG 通过分层总结构建思维景观，并根据全局语义表示条件检索和生成。这使得检索器能够形成丰富的查询嵌入，并且生成器能够在连贯的全局上下文中对检索到的证据进行推理。我们通过多种长上下文和双语基准评估 MiA-RAG，以实现基于证据的理解和全球意义构建。它始终超越基线，进一步的分析表明，它将局部细节与连贯的全局表示相结合，从而实现更类似于人类的长上下文检索和推理。

Title: Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition

Authors: Zahra Rahmani (1), Hossein Sameti (1) ((1) Department of Computer Engineering, Sharif University of Technology)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17247
Pdf URL: https://arxiv.org/pdf/2512.17247
Copy Paste: [[2512.17247]] Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition(https://arxiv.org/abs/2512.17247)
Keywords: llm
Abstract: Automatic Speech Recognition (ASR) systems suffer significant performance degradation in noisy environments, a challenge that is especially severe for low-resource languages such as Persian. Even state-of-the-art models such as Whisper struggle to maintain accuracy under varying signal-to-noise ratios (SNRs). This study presents a robust noise-sensitive ASR error correction framework that combines multiple hypotheses and noise-aware modeling. Using noisy Persian speech, we generate 5-best hypotheses from a modified Whisper-large decoder. Error Level Noise (ELN) is introduced as a representation that captures semantic- and token-level disagreement across hypotheses, quantifying the linguistic distortions caused by noise. ELN thus provides a direct measure of noise-induced uncertainty, enabling the LLM to reason about the reliability of each hypothesis during correction. Three models are evaluated: (1) a base LLaMA-2-7B model without fine-tuning, (2) a fine-tuned variant trained on text-only hypotheses, and (3) a noise-conditioned model integrating ELN embeddings at both sentence and word levels. Experimental results demonstrate that the ELN-conditioned model achieves substantial reductions in Word Error Rate (WER). Specifically, on the challenging Mixed Noise test set, the proposed Fine-tuned + ELN (Ours) model reduces the WER from a baseline of 31.10\% (Raw Whisper) to 24.84\%, significantly surpassing the Fine-tuned (No ELN) text-only baseline of 30.79\%, whereas the original LLaMA-2-7B model increased the WER to 64.58\%, demonstrating that it is unable to correct Persian errors on its own. This confirms the effectiveness of combining multiple hypotheses with noise-aware embeddings for robust Persian ASR in noisy real-world scenarios.
摘要：自动语音识别 (ASR) 系统在嘈杂的环境中性能会显着下降，对于波斯语等资源匮乏的语言来说，这一挑战尤其严重。即使像 Whisper 这样最先进的模型也很难在不同的信噪比 (SNR) 下保持准确性。本研究提出了一种鲁棒的噪声敏感 ASR 纠错框架，该框架结合了多种假设和噪声感知建模。使用嘈杂的波斯语语音，我们从修改后的 Whisper-large 解码器中生成 5 个最佳假设。引入错误级噪声 (ELN) 作为一种表示形式，捕获假设之间的语义和标记级分歧，量化由噪声引起的语言失真。因此，ELN 提供了噪声引起的不确定性的直接测量，使法学硕士能够在校正过程中推断每个假设的可靠性。评估了三个模型：(1) 未经微调的基本 LLaMA-2-7B 模型，(2) 在纯文本假设上训练的微调变体，以及 (3) 在句子和单词级别集成 ELN 嵌入的噪声条件模型。实验结果表明，ELN 条件模型显着降低了字错误率 (WER)。具体来说，在具有挑战性的混合噪声测试集上，提出的微调 + ELN（我们的）模型将 WER 从基线 31.10\%（原始 Whisper）降低到 24.84\%，显着超过微调（无 ELN）纯文本基线 30.79\%，而原始的 LLaMA-2-7B 模型将 WER 提高到 64.58\%，这表明它无法自行纠正波斯语错误。这证实了将多个假设与噪声感知嵌入相结合以在嘈杂的现实场景中实现稳健的波斯语 ASR 的有效性。

Title: Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

Authors: Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, Thomas Hanwen Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17260
Pdf URL: https://arxiv.org/pdf/2512.17260
Copy Paste: [[2512.17260]] Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience(https://arxiv.org/abs/2512.17260)
Keywords: language model, llm, agent
Abstract: Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.
摘要：大型语言模型最近在生成严格的数学证明方面取得了重大进展。相比之下，利用法学硕士以形式语言（例如 Lean）进行定理证明仍然具有挑战性且计算成本昂贵，特别是在解决本科及以上水平的问题时。在这项工作中，我们提出了 \textbf{Seed-Prover 1.5}，这是一种通过大规模代理强化学习训练的正式定理证明模型，以及高效的测试时间扩展（TTS）工作流程。通过与Lean等工具的广泛交互，模型在强化学习过程中不断积累经验，大幅提升形式定理证明的能力和效率。此外，利用自然语言证明方面的最新进展，我们的 TTS 工作流程有效地弥合了自然语言和形式语言之间的差距。与最先进的方法相比，Seed-Prover 1.5 以更小的计算预算实现了卓越的性能。它解决了 \textbf{88\% of PutnamBench}（本科水平）、\textbf{80\% of Fate-H}（研究生水平）和 \textbf{33\% of Fate-X}（博士水平）问题。值得注意的是，使用我们的系统，我们在 9 小时内解决了 Putnam 2025 的 \textbf{12 个问题中的 11 个问题}。我们的研究结果表明，在高质量的形式反馈的推动下，从经验中进行扩展学习对于形式数学推理的未来具有巨大的潜力。

Title: AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Authors: Michael J. Ryan, Yanzhe Zhang, Amol Salunkhe, Yi Chu, Di Xu, Diyi Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17267
Pdf URL: https://arxiv.org/pdf/2512.17267
Copy Paste: [[2512.17267]] AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators(https://arxiv.org/abs/2512.17267)
Keywords: llm
Abstract: Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
摘要：评估面向用户的人工智能应用程序仍然是一个核心挑战，特别是在旅行计划、临床记录生成或对话等开放领域。黄金标准是用户反馈（例如，赞成/反对）或行为信号（例如，保留），但这些在原型和研究项目中通常很少，或者太慢而无法用于系统优化。我们提出了 AutoMetrics，一个在低数据约束下综合评估指标的框架。 AutoMetrics 结合了 MetricBank 的检索（我们策划的 48 个指标的集合），以及根据轻量级人工反馈自动生成的 LLM 法官标准。这些指标通过回归组成，以最大限度地提高与人类信号的相关性。 AutoMetrics 将您从昂贵的测量转为可解释的自动指标。在 5 种不同的任务中，AutoMetrics 将 Kendall 与人类评分的相关性比法学硕士法官提高了 33.4%，同时需要的反馈点少于 100 个。我们证明 AutoMetrics 可以用作代理奖励，与可验证奖励具有同等效果。我们发布了完整的 AutoMetrics 工具包和 MetricBank，以加速 LLM 申请的自适应评估。

Title: Subjective Question Generation and Answer Evaluation using NLP

Authors: G. M. Refatul Islam, Safwan Shaheer, Yaseen Nur, Mohammad Rafid Hamid
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17289
Pdf URL: https://arxiv.org/pdf/2512.17289
Copy Paste: [[2512.17289]] Subjective Question Generation and Answer Evaluation using NLP(https://arxiv.org/abs/2512.17289)
Keywords: chat
Abstract: Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student's learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.
摘要：自然语言处理（NLP）是当今最具革命性的技术之一。它使用人工智能来理解人类文本和口语。它用于文本摘要、语法检查、情感分析和高级聊天机器人，并具有更多潜在用例。此外，它还在教育领域留下了印记。关于客观问题的生成已经进行了大量的研究和进展；然而，自动化主观问题生成和答案评估仍在进行中。生成主观问题并评估答案的自动化系统可以帮助教师评估学生的作业，并通过允许学生在阅读文章或书中的章节后自我评估他们的理解来增强学生的学习体验。这项研究旨在改进当前的 NLP 模型或创建一种新颖的模型，用于自动主观问题生成和文本输入答案评估。

Title: Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models

Authors: Haomin Qi, Chengbo Huang, Zihan Dai, Yunkai Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17344
Pdf URL: https://arxiv.org/pdf/2512.17344
Copy Paste: [[2512.17344]] Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models(https://arxiv.org/abs/2512.17344)
Keywords: language model
Abstract: We present a governance-aware hybrid fine-tuning framework for multilingual, low-resource adaptation of large language models. The core algorithm combines gradient-aligned low-rank updates with structured orthogonal transformations through layer-wise mixing and introduces unitary constraints in selected sub-layers to stabilize deep optimization. In tandem with lightweight, label-free data governance steps, including language identification, near-duplicate removal, and quality filtering, the framework targets accuracy, calibration, and cross-language parity under tight compute budgets. Across XNLI and FLORES, the hybrid approach delivers consistent gains over strong PEFT baselines while maintaining directional balance and improving probability calibration, as shown in Tables II and III. It is more resilient to lightweight orthographic variants, as shown in Table IV, and benefits additively from simple governance steps, as shown in Table V. Training footprint measurements indicate modest overhead and a favorable cost-quality frontier, as shown in Table VI and Figure 2. Together, these results show that hybrid and unitary PEFT provide a stable and accessible path to resource-efficient multilingual adaptation when paired with practical data governance.
摘要：我们提出了一种具有治理意识的混合微调框架，用于大型语言模型的多语言、低资源适应。核心算法通过分层混合将梯度对齐的低秩更新与结构化正交变换相结合，并在选定的子层中引入单一约束以稳定深度优化。与轻量级、无标签的数据治理步骤（包括语言识别、近似重复删除和质量过滤）相结合，该框架的目标是在紧张的计算预算下实现准确性、校准和跨语言奇偶校验。在 XNLI 和 FLORES 中，混合方法在强 PEFT 基线上提供了一致的增益，同时保持方向平衡并改进概率校准，如表 II 和表 III 所示。它对轻量级拼写变体更具弹性，如表 IV 所示，并从简单的治理步骤中获益，如表 V 所示。训练足迹测量表明适度的开销和有利的成本质量边界，如表 VI 和图 2 所示。总的来说，这些结果表明，当与实际数据治理相结合时，混合和单一 PEFT 为资源高效型多语言适应提供了一条稳定且可访问的路径。

Title: Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Authors: Zeyuan Allen-Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17351
Pdf URL: https://arxiv.org/pdf/2512.17351
Copy Paste: [[2512.17351]] Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers(https://arxiv.org/abs/2512.17351)
Keywords: language model
Abstract: Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components -- named after the musical term "canon" -- that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN -- validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve -- e.g., through better data curation or RL-based post-training -- unlocking deeper reasoning and hierarchical inference.
摘要：理解语言模型中的架构差异具有挑战性，特别是在学术规模的预训练（例如 1.3B 参数、100B 标记）中，结果通常由噪声和随机性主导。为了克服这个问题，我们引入了受控的合成预训练任务来隔离和评估核心模型功能。在这个框架内，我们发现了 CANON LAYERS：轻量级架构组件——以音乐术语“canon”命名——促进相邻令牌之间的水平信息流。 Canon 层计算附近标记表示的加权和，并无缝集成到 Transformer、线性注意力、状态空间模型或任何序列架构中。我们提出了 12 项关键结果。这包括 Canon 层如何增强推理深度（例如，增加 2 倍）、推理广度、知识操作等。它们提升了 NoPE 等弱架构以匹配 RoPE，以及对 Mamba2/GDN 等竞争对手 SOTA 线性模型的线性关注——通过综合任务和现实世界学术规模的预训练进行了验证。这个综合游乐场提供了一条经济的、有原则的路径来隔离经常在学术层面上被掩盖的核心模型功能。配备了无限的高质量数据，它甚至可以预测随着训练管道的改进，未来架构将如何表现——例如，通过更好的数据管理或基于强化学习的后期训练——解锁更深层的推理和层次推理。

Title: UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

Authors: Jiajun Wu, Jian Yang, Wei Zhang, Lin Jing, Yuqing Ma, Ensheng Shi, Yuchi Ma, Zhoujun Li, Xianglong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17385
Pdf URL: https://arxiv.org/pdf/2512.17385
Copy Paste: [[2512.17385]] UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models(https://arxiv.org/abs/2512.17385)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.
摘要：大型语言模型（LLM）在代码生成任务中表现出了卓越的能力。然而，它们的有效性在很大程度上依赖于具有广泛标记（例如，问答对）或未标记数据集（例如，代码片段）的监督训练，这些数据集通常昂贵且难以大规模获得。为了解决这个限制，本文引入了一种方法 IPC，这是一种无监督框架，利用法学硕士的内部探测来生成代码，无需任何外部语料库，甚至无需标记的代码片段。我们引入问题空间探索、测试理解探索、解决方案空间探索以及知识巩固和强化来探索法学硕士中存在的内部知识和置信模式。此外，IPC 通过自我一致性机制和基于表示的质量估计来识别可靠的候选代码，以训练 UCoder（无监督学习编码器）。我们在多个代码基准测试中验证了所提出的方法，证明与监督方法相比，无监督方法可以实现有竞争力的性能，同时显着减少对标记数据和计算资源的依赖。分析实验表明，内部模型状态包含有关代码质量和正确性的丰富信号，正确利用这些信号可以实现代码生成任务的有效无监督学习，为在资源受限的场景中训练代码 LLM 开辟新的方向。

Title: Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

Authors: Zabir Al Nazi, G M Shahariar, Abrar Hossain, Wei Peng
Subjects: cs.CL, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2512.17394
Pdf URL: https://arxiv.org/pdf/2512.17394
Copy Paste: [[2512.17394]] Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?(https://arxiv.org/abs/2512.17394)
Keywords: language model, agent
Abstract: Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
摘要：心智理论（ToM）——将信念、欲望和情感归因于他人的能力——是人类社会智能的基础，但仍然是人工智能体面临的重大挑战。现有的视觉语言模型 (VLM) 越来越多地应用于基于社会的任务，但它们的跨文化 ToM 推理能力在很大程度上尚未被开发。在这项工作中，我们引入了 CultureToM-VQA，这是一个新的评估基准，包含 5095 个问题，旨在通过视觉问答来探索跨不同文化背景的 ToM 推理。该数据集捕获了基于文化的线索，例如仪式、服装、手势和人际动态，从而能够对超越西方中心基准的 ToM 推理进行系统评估。我们的数据集是通过 VLM 辅助的人机循环管道构建的，人类专家首先在传统、仪式和社交互动中策划文化丰富的图像；然后，VLM 协助生成结构化的以 ToM 为中心的场景描述，这些描述被细化为跨越六个 ToM 任务分类和四个分级复杂性级别的问答对。生成的数据集涵盖了不同的心理理论，例如心理状态归因、错误信念推理、非文字交流、违反社会规范、视角协调和多主体推理。

Title: Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection

Authors: Menna Elgabry, Ali Hamdi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.17630
Pdf URL: https://arxiv.org/pdf/2512.17630
Copy Paste: [[2512.17630]] Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection(https://arxiv.org/abs/2512.17630)
Keywords: language model, llm
Abstract: This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.
摘要：本文受孔多塞陪审团定理 (CJT) 的启发，介绍了一种用于基于文本的情感检测的置信度加权、可信度感知的集成框架。与通常依赖同构架构的传统集成不同，我们的方法结合了架构上多样化的基于小型变压器的大型语言模型（sLLM）——BERT、RoBERTa、DistilBERT、DeBERTa 和 ELECTRA，每个模型都针对情感分类进行了完全微调。为了保持误差多样性，我们在利用每个模型的独特偏差的同时最大限度地减少参数收敛。双加权投票机制集成了全局可信度（验证 F1 分数）和局部置信度（实例级概率），以动态加权模型贡献。 DAIR-AI 数据集上的实验表明，我们的可信度-置信度集成实现了 93.5% 的宏观 F1 分数，超越了最先进的基准，并且显着优于大规模 LLM，包括 Falcon、Mistral、Qwen 和 Phi，即使在特定于任务的低秩适应 (LoRA) 后也是如此。事实证明，我们的小型 LLM 集成总共只有 595M 个参数，比多达 7B 个参数的模型具有更高的参数效率和鲁棒性，这表明精心设计的小型微调模型集成可以在情感检测等专业自然语言处理 (NLP) 任务中胜过更大的 LLM。

Title: Linear Personality Probing and Steering in LLMs: A Big Five Study

Authors: Michel Frising, Daniel Balcells
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17639
Pdf URL: https://arxiv.org/pdf/2512.17639
Copy Paste: [[2512.17639]] Linear Personality Probing and Steering in LLMs: A Big Five Study(https://arxiv.org/abs/2512.17639)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.
摘要：大型语言模型 (LLM) 表现出独特且一致的个性，极大地影响信任和参与度。虽然这意味着人格框架将成为描述和控制法学硕士行为的非常有价值的工具，但目前的方法仍然要么成本高昂（培训后）要么脆弱（即时工程）。通过线性方向进行探测和转向最近已成为一种廉价且高效的替代方案。在本文中，我们研究了与大五人格特质一致的线性方向是否可以用于探测和引导模型行为。使用 Llama 3.3 70B，我们生成了 406 个虚构人物及其大五特征得分的描述。然后，我们用羊驼调查问卷中的这些描述和问题来提示模型，使我们能够以已知的、可量化的方式对随人格特征变化的隐藏激活进行采样。使用线性回归，我们学习激活空间中的一组每层方向，并测试它们探测和引导模型行为的有效性。我们的结果表明，与特质得分一致的线性方向是人格检测的有效探针，而它们的引导能力强烈依赖于上下文，在强制选择任务中产生可靠的效果，但在开放式生成或提示中存在附加上下文时影响有限。

Title: Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering

Authors: Riccardo Di Sipio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17677
Pdf URL: https://arxiv.org/pdf/2512.17677
Copy Paste: [[2512.17677]] Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering(https://arxiv.org/abs/2512.17677)
Keywords: language model
Abstract: We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don't know'' response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.
摘要：我们探索贝叶斯推理作为量化问答神经网络不确定性的一种手段。从 Iris 数据集上的多层感知器开始，我们展示了后验推理如何传达预测的置信度。然后，我们将其扩展到语言模型，首先将贝叶斯推理应用于冻结的头部，最后应用于 LoRA 适应的变压器，并在 CommonsenseQA 基准上进行评估。我们不是以最先进的精度为目标，而是将拉普拉斯近似与最大后验（MAP）估计进行比较，以突出不确定性校准和选择性预测。这使得模型可以在信心较低时放弃。 “我不知道”的回答不仅提高了可解释性，而且还说明了贝叶斯方法如何有助于神经问答系统的更负责任和道德的部署。

Title: When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

Authors: Lydia Nishimwe, Benoît Sagot, Rachel Bawden
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17738
Pdf URL: https://arxiv.org/pdf/2512.17738
Copy Paste: [[2512.17738]] When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content(https://arxiv.org/abs/2512.17738)
Keywords: language model, llm, prompt
Abstract: User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.
摘要：用户生成内容 (UGC) 的特点是频繁使用非标准语言，从拼写错误到俚语、字符重复和表情符号等表达选择。这使得评估 UGC 翻译特别具有挑战性：什么才算是“好”翻译取决于输出所需的标准水平。为了探索这一点，我们检查了四个 UGC 数据集的人工翻译指南，并得出十二种非标准现象和五种翻译操作（标准化、复制、转移、省略、审查）的分类法。我们的分析揭示了 UGC 的处理方式存在显着差异，从而导致参考翻译存在一系列标准。通过对大型语言模型 (LLM) 的案例研究，我们表明翻译分数对 UGC 的明确翻译指令的提示高度敏感，并且当这些提示与数据集的指南一致时，翻译分数会提高。我们认为，当保留 UGC 风格很重要时，公平评估需要模型和指标都了解翻译指南。最后，我们呼吁在数据集创建过程中制定明确的指南，并为 UGC 翻译开发可控的、具有指南意识的评估框架。

Title: AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

Authors: Zhihan Zhou, Daqian Shi, Rui Song, Lida Shi, Xiaolei Diao, Hao Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17756
Pdf URL: https://arxiv.org/pdf/2512.17756
Copy Paste: [[2512.17756]] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora(https://arxiv.org/abs/2512.17756)
Keywords: language model
Abstract: Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.
摘要：理解古代文献对于考古学以及了解中国历史和文明具有重要作用。大型语言模型的快速发展需要能够评估其对古文字理解能力的基准。现有的汉语基准大多针对现代汉语和古汉语流传文献，但未涵盖古汉语出土文献的部分。为了满足这一需求，我们提出了AncientBench，旨在评估对古代文字的理解，特别是在出土文献的场景中。 AncientBench分为四个维度，分别对应古文字理解的四种能力：字形理解、语音理解、意义理解、上下文理解。该基准还包含部首、音部首、同音字、完型填空、翻译等十项任务，提供了全面的评估框架。我们召集了考古研究人员进行实验评估，提出了一个古老的模型作为基线，并对目前表现最好的大型语言模型进行了广泛的实验。实验结果揭示了大型语言模型在古代文本场景中的巨大潜力以及与人类的差距。我们的研究旨在推动考古学和古汉语领域大语言模型的开发和应用。

Title: DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

Authors: Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.17776
Pdf URL: https://arxiv.org/pdf/2512.17776
Copy Paste: [[2512.17776]] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports(https://arxiv.org/abs/2512.17776)
Keywords: language model, llm
Abstract: As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.
摘要：随着大型语言模型（LLM）的进步，深度研究系统可以通过多步骤推理和基于证据的综合生成专家级报告，但评估此类报告仍然具有挑战性。现有的基准往往缺乏专家报告的系统标准，严重依赖法学硕士法官的评估可能无法捕捉到需要专家判断的问题，来源验证通常只涵盖明确引用的陈述的有限子集，而不是报告范围内的事实可靠性。我们推出DEER，一个评估专家级深度研究报告的基准。 DEER 包括跨越 13 个领域的 50 项报告撰写任务和一个以专家为基础的评估分类法（7 个维度，25 个子维度），可操作为 130 个细粒度项目。 DEER 进一步提供针对特定任务的专家指导，帮助 LLM 法官更一致地评估专家级报告质量。作为基于标题的评估的补充，我们提出了一种文档级事实检查架构，可以提取和验证整个报告中的所有主张，包括引用和未引用的主张，并量化外部证据质量。 DEER 与人类专家的判断密切相关，并对系统的优缺点进行可解释的诊断。

Title: ShareChat: A Dataset of Chatbot Conversations in the Wild

Authors: Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2512.17843
Pdf URL: https://arxiv.org/pdf/2512.17843
Copy Paste: [[2512.17843]] ShareChat: A Dataset of Chatbot Conversations in the Wild(https://arxiv.org/abs/2512.17843)
Keywords: language model, gpt, llm, chat
Abstract: While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.
摘要：虽然大型语言模型 (LLM) 已发展成为具有独特界面设计和功能的独特平台，但现有公共数据集将模型视为通用文本生成器，从而剥离了主动塑造用户交互的界面上下文。为了解决这一限制，我们推出了 ShareChat，这是一个大规模的跨平台语料库，包含从 ChatGPT、Claude、Gemini、Perplexity 和 Grok 五个主要平台的公开共享 URL 收集的 142,808 个对话和超过 660,000 个回合。 ShareChat 的独特之处在于，它保留了标准日志中经常丢失的本机平台可供性，包括推理跟踪、源链接和代码工件，同时在 2023 年 4 月至 2025 年 10 月期间跨越 101 种语言。此外，ShareChat 提供了比以前的数据集更长的上下文窗口和更大的交互深度。我们通过三个代表性分析展示了该数据集的多方面实用性：（1）分析对话完整性以衡量用户意图满意度； (2) 评估内容生成中的来源引用行为； (3) 进行时间分析以跟踪不断变化的使用模式。这项工作为社区提供了重要且及时的资源，用于了解真实的用户与 LLM 聊天机器人在野外的交互。