2025-11-10

Title: Evaluating LLMs' Reasoning Over Ordered Procedural Steps

Authors: Adrita Anika, Md Messal Monem Miah
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.04688
Pdf URL: https://arxiv.org/pdf/2511.04688
Copy Paste: [[2511.04688]] Evaluating LLMs' Reasoning Over Ordered Procedural Steps(https://arxiv.org/abs/2511.04688)
Keywords: language model, llm
Abstract: Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall's Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.
摘要：对程序序列进行推理（其中步骤的顺序直接影响结果）是大型语言模型 (LLM) 的一项关键功能。在这项工作中，我们研究了使用经过整理的食物食谱数据集从打乱的程序步骤中重建全局有序序列的任务，在这个领域，正确的排序对于任务的成功至关重要。我们在零样本和少样本设置下评估了多个法学硕士，并提出了一个综合评估框架，该框架适应排名和序列比对的既定指标。其中包括 Kendall 的 Tau、归一化最长公共子序列 (NLCS) 和归一化编辑距离 (NED)，它们捕获排序质量的互补方面。我们的分析表明，模型性能随着序列长度的增加而下降，反映出较长程序的复杂性增加。我们还发现，输入中较大的步位移对应于更严重的洗牌，会导致进一步的退化。这些发现凸显了当前法学硕士在程序推理方面的局限性，特别是在输入更长、更无序的情况下。

Title: Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

Authors: Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, Nitesh V. Chawla
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04689
Pdf URL: https://arxiv.org/pdf/2511.04689
Copy Paste: [[2511.04689]] Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks(https://arxiv.org/abs/2511.04689)
Keywords: language model, llm
Abstract: Large language model evaluation requires thousands of benchmark items, making evaluations expensive and slow. Existing methods compute average accuracy across fixed item sets, treating all items equally despite varying quality and informativeness. We present ATLAS an adaptive testing framework using Item Response Theory (IRT) to estimate model ability through Fisher information-guided item selection. Our analysis of five major benchmarks reveals that 3-6% of items exhibit negative discrimination, indicating annotation errors that corrupt static evaluation. ATLAS achieves 90% item reduction while maintaining measurement precision: on HellaSwag (5,608 items), we match full-benchmark estimates using only 42 items with 0.154 MAE. Our framework maintains item exposure rates below 10% and test overlap at 16-27%, compared to static benchmarks where every model sees all items (100% exposure). Among 4,000+ tested models, IRT ranks differ from accuracy ranks: models with the same accuracy get different IRT scores, and 23-31% of all models shift by more than 10 rank positions. Code and calibrated item banks are available at this https URL.
摘要：大型语言模型评估需要数千个基准项目，使得评估成本昂贵且缓慢。现有方法计算固定项目集的平均准确度，尽管质量和信息量各不相同，但平等地对待所有项目。我们提出了 ATLAS 一个自适应测试框架，使用项目响应理论（IRT）通过 Fisher 信息引导的项目选择来估计模型能力。我们对五个主要基准的分析表明，3-6% 的项目表现出负面歧视，表明注释错误会破坏静态评估。 ATLAS 在保持测量精度的同时实现了 90% 的项目减少：在 HellaSwag（5,608 个项目）上，我们仅使用 42 个项目与 0.154 MAE 的全基准估计相匹配。与每个模型都看到所有项目（100% 暴露）的静态基准相比，我们的框架将项目暴露率保持在 10% 以下，测试重叠率为 16-27%。在 4,000 多个测试模型中，IRT 排名与准确度排名不同：具有相同准确度的模型获得不同的 IRT 分数，所有模型中有 23-31% 的排名位置变化超过 10 个。代码和校准项目库可从此 https URL 获取。

Title: Reasoning Up the Instruction Ladder for Controllable Language Models

Authors: Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04694
Pdf URL: https://arxiv.org/pdf/2511.04694
Copy Paste: [[2511.04694]] Reasoning Up the Instruction Ladder for Controllable Language Models(https://arxiv.org/abs/2511.04694)
Keywords: language model, llm, prompt
Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.
摘要：由于基于大型语言模型 (LLM) 的系统在现实决策中扮演着重要角色，因此它们必须在单个提示上下文中协调来自多个来源（例如模型开发人员、用户和工具）的竞争指令。因此，在 LLM 中强制执行指令层次结构 (IH)，其中较高级别的指令覆盖较低优先级的请求，对于 LLM 的可靠性和可控性至关重要。在这项工作中，我们将指令层次结构解析重新定义为推理任务。具体来说，在生成响应之前，模型必须首先“思考”给定的用户提示和更高优先级（系统）指令之间的关系。为了通过训练实现此功能，我们构建了 VerIH，这是一个具有可验证答案的约束跟踪任务的指令层次结构数据集。该数据集包括一致的和冲突的系统用户指令。我们证明，使用 VerIH 的轻量级强化学习有效地将模型的一般推理能力转移到指令优先级。我们的微调模型在指令遵循和指令层次基准方面实现了一致的改进。这种推理能力还可以推广到训练分布之外的安全关键设置。通过将安全问题视为解决对抗性用户输入和预定义的较高优先级策略之间的冲突，我们训练有素的模型增强了针对越狱和提示注入攻击的鲁棒性。这些结果表明，对指令层次结构的推理为可靠的法学硕士提供了一条实用的途径，其中系统提示的更新会产生模型行为的可控且稳健的变化。

Title: EncouRAGe: Evaluating RAG Local, Fast, and Reliable

Authors: Jan Strich, Adeline Scharfenberg, Chris Biemann, Martin Semmann
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2511.04696
Pdf URL: https://arxiv.org/pdf/2511.04696
Copy Paste: [[2511.04696]] EncouRAGe: Evaluating RAG Local, Fast, and Reliable(https://arxiv.org/abs/2511.04696)
Keywords: language model, llm, retrieval-augmented generation
Abstract: We introduce EncouRAGe, a comprehensive Python framework designed to streamline the development and evaluation of Retrieval-Augmented Generation (RAG) systems using Large Language Models (LLMs) and Embedding Models. EncouRAGe comprises five modular and extensible components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics, facilitating flexible experimentation and extensible development. The framework emphasizes scientific reproducibility, diverse evaluation metrics, and local deployment, enabling researchers to efficiently assess datasets within RAG workflows. This paper presents implementation details and an extensive evaluation across multiple benchmark datasets, including 25k QA pairs and over 51k documents. Our results show that RAG still underperforms compared to the Oracle Context, while Hybrid BM25 consistently achieves the best results across all four datasets. We further examine the effects of reranking, observing only marginal performance improvements accompanied by higher response latency.
摘要：我们推出 EncouRAGe，这是一个综合性 Python 框架，旨在简化使用大型语言模型 (LLM) 和嵌入模型的检索增强生成 (RAG) 系统的开发和评估。 EncouRAGe 包含五个模块化和可扩展的组件：Type Manifest、RAG Factory、Inference、Vector Store 和 Metrics，促进灵活的实验和可扩展的开发。该框架强调科学的可重复性、多样化的评估指标和本地部署，使研究人员能够有效地评估 RAG 工作流程中的数据集。本文介绍了实施细节和跨多个基准数据集的广泛评估，包括 25k QA 对和超过 51k 文档。我们的结果表明，与 Oracle Context 相比，RAG 的表现仍然不佳，而 Hybrid BM25 在所有四个数据集上始终取得最佳结果。我们进一步检查了重新排名的影响，观察到仅边际性能改进伴随着更高的响应延迟。

Title: multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder

Authors: K M Sajjadul Islam, John Fields, Praveen Madiraju
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04698
Pdf URL: https://arxiv.org/pdf/2511.04698
Copy Paste: [[2511.04698]] multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder(https://arxiv.org/abs/2511.04698)
Keywords: language model, prompt
Abstract: The early detection of mental health disorders from social media text is critical for enabling timely support, risk assessment, and referral to appropriate resources. This work introduces multiMentalRoBERTa, a fine-tuned RoBERTa model designed for multiclass classification of common mental health conditions, including stress, anxiety, depression, post-traumatic stress disorder (PTSD), suicidal ideation, and neutral discourse. Drawing on multiple curated datasets, data exploration is conducted to analyze class overlaps, revealing strong correlations between depression and suicidal ideation as well as anxiety and PTSD, while stress emerges as a broad, overlapping category. Comparative experiments with traditional machine learning methods, domain-specific transformers, and prompting-based large language models demonstrate that multiMentalRoBERTa achieves superior performance, with macro F1-scores of 0.839 in the six-class setup and 0.870 in the five-class setup (excluding stress), outperforming both fine-tuned MentalBERT and baseline classifiers. Beyond predictive accuracy, explainability methods, including Layer Integrated Gradients and KeyBERT, are applied to identify lexical cues that drive classification, with a particular focus on distinguishing depression from suicidal ideation. The findings emphasize the effectiveness of fine-tuned transformers for reliable and interpretable detection in sensitive contexts, while also underscoring the importance of fairness, bias mitigation, and human-in-the-loop safety protocols. Overall, multiMentalRoBERTa is presented as a lightweight, robust, and deployable solution for enhancing support in mental health platforms.
摘要：从社交媒体文本中及早发现精神健康障碍对于及时提供支持、风险评估和转介适当的资源至关重要。这项工作介绍了 multiMentalRoBERTa，这是一种经过微调的 RoBERTa 模型，专为常见心理健康状况的多类分类而设计，包括压力、焦虑、抑郁、创伤后应激障碍 (PTSD)、自杀意念和中性话语。利用多个精选数据集，进行数据探索来分析类别重叠，揭示抑郁症和自杀意念以及焦虑和创伤后应激障碍之间的密切相关性，而压力则作为一个广泛的、重叠的类别出现。与传统机器学习方法、特定领域转换器和基于提示的大型语言模型的比较实验表明，multiMentalRoBERTa 实现了卓越的性能，在六类设置中的宏观 F1 分数为 0.839，在五类设置中的宏观 F1 分数为 0.870（不包括压力），优于微调的 MentalBERT 和基线分类器。除了预测准确性之外，可解释性方法（包括 Layer Integrated Gradients 和 KeyBERT）还用于识别驱动分类的词汇线索，特别关注区分抑郁症和自杀意念。研究结果强调了微调变压器在敏感环境中进行可靠且可解释的检测的有效性，同时也强调了公平性、偏见缓解和人机交互安全协议的重要性。总体而言，multiMentalRoBERTa 是一种轻量级、强大且可部署的解决方案，用于增强对心理健康平台的支持。

Title: Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation

Authors: Song Wang, Zihan Chen, Peng Wang, Zhepei Wei, Zhen Tan, Yu Meng, Cong Shen, Jundong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04700
Pdf URL: https://arxiv.org/pdf/2511.04700
Copy Paste: [[2511.04700]] Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation(https://arxiv.org/abs/2511.04700)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources to address their limitations in accessing up-to-date or specialized information. A natural strategy to increase the likelihood of retrieving relevant information is to expand the number of retrieved documents. However, involving more documents could introduce significant noise, as many documents may be irrelevant or misleading, thereby reducing the overall accuracy of the generated responses. To overcome the challenge associated with handling a larger number of documents, we propose WinnowRAG, a novel RAG framework designed to systematically filter out noisy documents while preserving valuable content -- a process we refer to as winnowing. WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters. Each cluster is assigned to an LLM agent for generating a unique answer. In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones. To retain useful documents when discarding agents, we propose two strategic merging techniques to ensure that only relevant knowledge is used for generating the final response. Crucially, WinnowRAG is model-agnostic and does not require any model fine-tuning, making it easily adaptable to various tasks. Extensive experiments on various realistic datasets demonstrate the effectiveness of WinnowRAG over state-of-the-art baselines.
摘要：检索增强生成 (RAG) 通过集成外部知识源来增强大型语言模型 (LLM)，以解决其在访问最新或专业信息方面的局限性。增加检索相关信息的可能性的自然策略是扩大检索文档的数量。然而，涉及更多文档可能会引入显着的噪音，因为许多文档可能不相关或具有误导性，从而降低生成响应的整体准确性。为了克服与处理大量文档相关的挑战，我们提出了 WinnowRAG，这是一种新颖的 RAG 框架，旨在系统地过滤掉嘈杂的文档，同时保留有价值的内容，我们将这一过程称为筛选。 WinnowRAG 分两个阶段运行：在第一阶段，我们执行查询感知聚类以对相似文档进行分组并形成不同的主题聚类。每个集群都分配给一个 LLM 代理来生成唯一的答案。在第二阶段，我们进行筛选，其中批评家法学硕士评估多个代理的输出，并迭代地将有用的文档与嘈杂的文档分开。为了在丢弃代理时保留有用的文档，我们提出了两种战略合并技术，以确保仅使用相关知识来生成最终响应。至关重要的是，WinnowRAG 与模型无关，不需要任何模型微调，使其可以轻松适应各种任务。对各种现实数据集的大量实验证明了 WinnowRAG 相对于最先进基线的有效性。

Title: Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Authors: Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May Liu, Lennart Luettgau, Jabez Magomere, Jonathan Rystrøm, Anna Sotnikova, Yushi Yang, Yilun Zhao, Adel Bibi, Antoine Bosselut, Ronald Clark, Arman Cohan, Jakob Foerster, Yarin Gal, Scott A. Hale, Inioluwa Deborah Raji, Christopher Summerfield, Philip H.S. Torr, Cozmin Ududec, Luc Rocher, Adam Mahdi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04703
Pdf URL: https://arxiv.org/pdf/2511.04703
Copy Paste: [[2511.04703]] Measuring what Matters: Construct Validity in Large Language Model Benchmarks(https://arxiv.org/abs/2511.04703)
Keywords: language model, llm
Abstract: Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as 'safety' and 'robustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.
摘要：评估大型语言模型 (LLM) 对于评估其能力以及在部署之前识别安全性或稳健性问题至关重要。可靠地测量“安全”和“鲁棒性”等抽象和复杂的现象需要强大的构造有效性，即具有代表对现象重要的因素的测量。我们拥有一支由 29 名专家评审员组成的团队，对来自自然语言处理和机器学习领域领先会议的 445 个 LLM 基准进行了系统评审。在审查的文章中，我们发现与测量的现象、任务和评分指标相关的模式，这些模式破坏了所得主张的有效性。为了解决这些缺点，我们为研究人员和从业者制定法学硕士基准提供八项关键建议和详细的可行指导。

Title: POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

Authors: Tingyue Yang, Junchi Yao, Yuhui Guo, Chang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04705
Pdf URL: https://arxiv.org/pdf/2511.04705
Copy Paste: [[2511.04705]] POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios(https://arxiv.org/abs/2511.04705)
Keywords: llm
Abstract: We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.
摘要：我们推出 POLIS-Bench，这是第一个严格、系统的评估套件，专为在政府双语政策场景中运作的法学硕士而设计。与现有基准相比，POLIS-Bench 引入了三项重大进步。 (i) 最新的双语语料库：我们构建了一个广泛的、最新的政策语料库，可显着扩大有效评估样本规模，确保与当前治理实践的相关性。 (ii) 基于场景的任务设计：我们提炼出三个专门的基于场景的任务——条款检索和解释、解决方案生成和合规性判断——以全面探讨模型的理解和应用。（iii）双度量评估框架：我们建立了一种新颖的双度量评估框架，将语义相似性与准确率相结合，以精确测量内容对齐和任务要求遵守情况。在 POLIS-Bench 上对 10 多个最先进的法学硕士进行了大规模评估，揭示了清晰的性能层次结构，其中推理模型保持了卓越的跨任务稳定性和准确性，凸显了合规任务的难度。此外，利用我们的基准，我们成功地微调了轻量级开源模型。由此产生的 POLIS 系列模型以显着降低的成本实现了多个政策子任务的强大专有基线的同等或超越，为强大的现实世界政府部署提供了具有成本效益且合规的路径。

Title: GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

Authors: Hari Mohan Pandey, Anshul Gupta, Subham Sarkar, Minakshi Tomer, Schneider Johannes, Yan Gong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04710
Pdf URL: https://arxiv.org/pdf/2511.04710
Copy Paste: [[2511.04710]] GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models(https://arxiv.org/abs/2511.04710)
Keywords: language model, llm, prompt
Abstract: Text-to-SQL systems enable users to interact with structured databases using natural language, eliminating the need for specialized programming knowledge. In this work, we introduce GEMMA-SQL, a lightweight and efficient text-to-SQL model built upon the open-source Gemma 2B architecture. Unlike many large language models (LLMs), GEMMA-SQL is fine-tuned in a resource-efficient, iterative manner and can be deployed on low-cost hardware. Leveraging the SPIDER benchmark for training and evaluation, GEMMA-SQL combines multiple prompting strategies, including few-shot learning, to enhance SQL query generation accuracy. The instruction-tuned variant, GEMMA-SQL Instruct, achieves 66.8% Test-Suite accuracy and 63.3% Exact Set Match accuracy, outperforming several state-of-the-art baselines such as IRNet, RYANSQL, and CodeXDavinci. The proposed approach demonstrates that effective prompt design and targeted instruction tuning can significantly boost performance while maintaining high scalability and adaptability. These results position GEMMA-SQL as a practical, open-source alternative for robust and accessible text-to-SQL systems.
摘要：文本到 SQL 系统使用户能够使用自然语言与结构化数据库进行交互，从而无需专业的编程知识。在这项工作中，我们介绍了 GEMMA-SQL，这是一种基于开源 Gemma 2B 架构构建的轻量级且高效的文本到 SQL 模型。与许多大型语言模型 (LLM) 不同，GEMMA-SQL 以资源高效、迭代的方式进行微调，并且可以部署在低成本硬件上。 GEMMA-SQL 利用 SPIDER 基准进行训练和评估，结合了多种提示策略（包括少样本学习），以提高 SQL 查询生成的准确性。经过指令调整的变体 GEMMA-SQL Instruct 实现了 66.8% 的测试套件准确度和 63.3% 的精确集匹配准确度，优于多个最先进的基线，例如 IRNet、RYANSQL 和 CodeXDavinci。所提出的方法表明，有效的提示设计和有针对性的指令调整可以显着提高性能，同时保持高可扩展性和适应性。这些结果将 GEMMA-SQL 定位为健壮且可访问的文本到 SQL 系统的实用开源替代方案。

Title: First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

Authors: Dmytro Vitel, Anshuman Chhabra
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.04715
Pdf URL: https://arxiv.org/pdf/2511.04715
Copy Paste: [[2511.04715]] First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation(https://arxiv.org/abs/2511.04715)
Keywords: language model, llm
Abstract: Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.
摘要：确定训练样本如何影响大型语言模型 (LLM) 决策对于有效解释模型决策和审核大规模数据集至关重要。当前的训练样本影响估计方法（也称为影响函数）通过利用模型的一阶和高阶梯度项的信息流来实现这一目标。然而，由于当今由数十亿个参数组成的大型模型，这些影响计算通常仅限于模型层的某些子集以确保计算可行性。 Yeh 等人之前的开创性工作。（2022）在评估哪些层最适合计算语言数据影响力时得出结论，第一（嵌入）层为此目的提供的信息最多，使用基于影响力分数抵消（即抵消效应）的假设。在这项工作中，我们提出了理论和经验证据，证明抵消效应是如何不可靠的，并且中间注意力层是更好的影响估计器。此外，我们还解决了跨层汇总影响力得分的更广泛挑战，并展示了标准平均的替代方案（例如排名和基于投票的方法）如何能够显着提高性能。最后，我们提出了更好的方法来评估法学硕士的影响力评分功效，而无需进行模型再训练，并提出了一种称为噪声检测率（NDR）的新指标，与消除效应相比，它表现出强大的预测能力。通过对不同类型和规模的法学硕士进行广泛的实验，我们具体确定，与该领域的先验知识相比，对于法学硕士影响力估计，第一层（层）不一定比最后一层（层）更好。

Title: Learning to reason about rare diseases through retrieval-augmented agents

Authors: Ha Young Kim, Jun Li, Ana Beatriz Solana, Carolin M. Pirkl, Benedikt Wiestler, Julia A. Schnabel, Cosmin I. Bercea
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04720
Pdf URL: https://arxiv.org/pdf/2511.04720
Copy Paste: [[2511.04720]] Learning to reason about rare diseases through retrieval-augmented agents(https://arxiv.org/abs/2511.04720)
Keywords: language model, agent
Abstract: Rare diseases represent the long tail of medical imaging, where AI models often fail due to the scarcity of representative training data. In clinical workflows, radiologists frequently consult case reports and literature when confronted with unfamiliar findings. Following this line of reasoning, we introduce RADAR, Retrieval Augmented Diagnostic Reasoning Agents, an agentic system for rare disease detection in brain MRI. Our approach uses AI agents with access to external medical knowledge by embedding both case reports and literature using sentence transformers and indexing them with FAISS to enable efficient similarity search. The agent retrieves clinically relevant evidence to guide diagnostic decision making on unseen diseases, without the need of additional training. Designed as a model-agnostic reasoning module, RADAR can be seamlessly integrated with diverse large language models, consistently improving their rare pathology recognition and interpretability. On the NOVA dataset comprising 280 distinct rare diseases, RADAR achieves up to a 10.2% performance gain, with the strongest improvements observed for open source models such as DeepSeek. Beyond accuracy, the retrieved examples provide interpretable, literature grounded explanations, highlighting retrieval-augmented reasoning as a powerful paradigm for low-prevalence conditions in medical imaging.
摘要：罕见疾病代表了医学影像的长尾，其中人工智能模型经常由于缺乏代表性训练数据而失败。在临床工作流程中，放射科医生在遇到不熟悉的发现时经常查阅病例报告和文献。遵循这一推理思路，我们介绍了 RADAR（检索增强诊断推理代理），这是一种用于脑 MRI 中罕见疾病检测的代理系统。我们的方法使用人工智能代理来访问外部医学知识，通过使用句子转换器嵌入病例报告和文献，并使用 FAISS 对其进行索引，以实现高效的相似性搜索。该代理检索临床相关证据以指导对未见疾病的诊断决策，无需额外培训。 RADAR 被设计为与模型无关的推理模块，可以与各种大型语言模型无缝集成，不断提高其罕见病理学识别和可解释性。在包含 280 种不同罕见疾病的 NOVA 数据集上，RADAR 实现了高达 10.2% 的性能提升，其中在 DeepSeek 等开源模型中观察到的改进最为明显。除了准确性之外，检索到的示例还提供了可解释的、基于文献的解释，强调检索增强推理作为医学成像中低患病率条件的强大范例。

Title: Surprisal reveals diversity gaps in image captioning and different scorers change the story

Authors: Nikolai Ilinykh, Simon Dobnik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04754
Pdf URL: https://arxiv.org/pdf/2511.04754
Copy Paste: [[2511.04754]] Surprisal reveals diversity gaps in image captioning and different scorers change the story(https://arxiv.org/abs/2511.04754)
Keywords: language model, llm
Abstract: We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.
摘要：我们用令人惊讶的方差来量化图像字幕中的语言多样性 - 字幕集中标记级负对数概率的分布。在 MSCOCO 测试集上，我们将五种最先进的视觉和语言 LLM（通过贪婪采样和核采样解码）与人类字幕进行了比较。使用经过字幕训练的 n-gram LM 进行测量，人类表现出的方差大约是模型的两倍，但使用通用语言模型重新评分相同的字幕会逆转这种模式。我们的分析引入了基于惊喜的图像字幕多样性度量。我们表明，依靠单个评分者可以完全颠倒结论，因此，稳健的多样性评估必须在多个评分者的情况下报告意外情况。

Title: Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

Authors: Chenxi Liu, Junjie Liang, Yuqi Jia, Bochuan Cao, Yang Bai, Heng Huang, Xun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04800
Pdf URL: https://arxiv.org/pdf/2511.04800
Copy Paste: [[2511.04800]] Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models(https://arxiv.org/abs/2511.04800)
Keywords: language model, llm, prompt
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.
摘要：带可验证奖励的强化学习（RLVR）已成为提高大型语言模型（LLM）推理能力的有效方法。组相对策略优化 (GRPO) 系列在使用 RLVR 培训法学硕士方面表现出了出色的性能。然而，随着模型训练时间的延长和规模的扩大，更多的训练提示会变成残留提示，即那些具有零方差奖励但不提供训练信号的提示。因此，有助于培训的提示较少，从而减少了多样性并降低了有效性。为了充分利用这些残留提示，我们提出了探索策略优化中的残留提示（ERPO）框架，该框架鼓励探索残留提示并重新激活其训练信号。 ERPO 为每个提示维护一个历史记录跟踪器，并自适应地提高先前产生所有正确响应的剩余提示的采样温度。这鼓励模型生成更加多样化的推理轨迹，引入错误的响应来恢复训练信号。 Qwen2.5 系列的实证结果表明，ERPO 在多个数学推理基准测试中始终超越强大的基线。

Title: Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

Authors: Preetum Nakkiran, Arwen Bradley, Adam Goliński, Eugene Ndiaye, Michael Kirchhof, Sinead Williamson
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.04869
Pdf URL: https://arxiv.org/pdf/2511.04869
Copy Paste: [[2511.04869]] Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs(https://arxiv.org/abs/2511.04869)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.
摘要：大型语言模型 (LLM) 通常缺乏对其输出有意义的置信度估计。虽然众所周知，基础法学硕士会表现出下一个令牌校准，但目前尚不清楚他们是否能够评估其响应超出令牌级别的实际含义的信心。我们发现，当使用某种基于采样的语义校准概念时，基础法学硕士的校准效果非常好：它们可以有意义地评估开放域问答任务的置信度，尽管没有经过明确的培训。我们的主要理论贡献建立了一种机制，解释为什么语义校准作为下一个令牌预测的副产品出现，利用校准和局部损失最优性之间的最新联系。该理论依赖于“B 校准”的一般定义，这是通过选择等价类（语义或其他）参数化的校准概念。这种理论机制导致了可测试的预测：当基础法学硕士可以在生成响应之前轻松预测自己在语义答案类别上的分布时，它们将在语义上进行校准。我们陈述了这一预测的三个含义，并通过实验进行了验证：(1) 基础法学硕士在问答任务中进行了语义校准，(2) RL 指令调整系统地打破了这种校准，(3) 思想链推理打破了校准。据我们所知，我们的工作为法学硕士何时以及为何出现语义校准提供了第一个原则性的解释。

Title: Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

Authors: Matthew Bozoukov, Matthew Nguyen, Shubkarman Singh, Bart Bussmann, Patrick Leask
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.04875
Pdf URL: https://arxiv.org/pdf/2511.04875
Copy Paste: [[2511.04875]] Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs(https://arxiv.org/abs/2511.04875)
Keywords: llm
Abstract: Recent studies have revealed that LLMs can exhibit behavioral self-awareness: the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled finetuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune's behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.
摘要：最近的研究表明，法学硕士可以表现出行为自我意识：在没有明确监督的情况下准确描述或预测自己学到的行为的能力。这种功能引起了安全问题，例如，它可能允许模型在评估过程中更好地隐藏其真实能力。我们试图描述这种自我意识出现的最低条件，以及它表现出来的机械过程。通过对具有低阶适配器 (LoRA) 的指令调整 LLM 进行受控微调实验，我们发现：(1) 使用单个 1 阶 LoRA 适配器可以可靠地诱导自我意识；（2）学习到的自我意识行为可以在很大程度上被激活空间中的单个转向向量捕获，恢复几乎所有微调的行为效果；（3）自我意识是非通用的和领域局部的，在不同任务中具有独立的表示。总之，这些发现表明，行为自我意识是作为一种特定领域的线性特征出现的，可以很容易地诱导和调节。

Title: SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Authors: Jaehoon Lee, Sohyun Kim, Wanggeun Park, Geon Lee, Seungkyung Kim, Minyoung Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04910
Pdf URL: https://arxiv.org/pdf/2511.04910
Copy Paste: [[2511.04910]] SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents(https://arxiv.org/abs/2511.04910)
Keywords: gpt
Abstract: Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this critical gap, we introduce SDS KoPub VDR, the first large-scale, publicly available benchmark for retrieving and understanding Korean public documents. The benchmark is built upon a corpus of 361 real-world documents (40,781 pages), including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a challenging and reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent a rigorous human verification and refinement process to ensure factual accuracy and contextual relevance. The queries span six major public domains and are systematically categorized by the reasoning modality required: text-based, visual-based (e.g., chart interpretation), and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks that reflect distinct retrieval paradigms: (1) text-only retrieval, which measures a model's ability to locate relevant document pages based solely on textual signals, and (2) multimodal retrieval, which assesses retrieval performance when visual features (e.g., tables, charts, and layouts) are jointly leveraged alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR not only enables rigorous and fine-grained evaluation across textual and multimodal retrieval tasks but also provides a clear roadmap for advancing multimodal AI in complex, real-world document intelligence.
摘要：现有的视觉文档检索（VDR）基准在很大程度上忽视了非英语语言和官方出版物的结构复杂性。为了解决这一关键差距，我们引入了 SDS KoPub VDR，这是第一个用于检索和理解韩国公共文档的大规模公开基准。该基准测试基于 361 个真实文档（40,781 页）的语料库，其中包括 KOGL 1 类许可证下的 256 个文件和来自官方法律门户的 105 个文件，捕获表格、图表和多列布局等复杂的视觉元素。为了建立一个具有挑战性且可靠的评估集，我们构建了 600 个查询-页面-答案三元组。这些最初是使用多模态模型（例如 GPT-4o）生成的，随后经过严格的人工验证和细化过程，以确保事实准确性和上下文相关性。这些查询跨越六个主要公共领域，并根据所需的推理模式进行系统分类：基于文本、基于视觉（例如图表解释）和跨模式。我们在反映不同检索范式的两个互补任务上评估 SDS KoPub VDR：(1) 纯文本检索，衡量模型仅根据文本信号定位相关文档页面的能力；(2) 多模态检索，评估视觉特征（例如表格、图表和布局）与文本联合使用时的检索性能。这种双任务评估揭示了巨大的性能差距，特别是在需要跨模态推理的多模态场景中，即使对于最先进的模型也是如此。作为基础资源，SDS KoPub VDR 不仅能够对文本和多模态检索任务进行严格、细粒度的评估，而且还为在复杂的现实世界文档智能中推进多模态 AI 提供了清晰的路线图。

Title: BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models

Authors: Chandra Vamsi Krishna Alla, Harish Naidu Gaddam, Manohar Kommi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04919
Pdf URL: https://arxiv.org/pdf/2511.04919
Copy Paste: [[2511.04919]] BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models(https://arxiv.org/abs/2511.04919)
Keywords: language model, llm, long context, retrieval augmented generation
Abstract: Large Language Models (LLMs) face significant computational and memory constraints when processing long contexts, despite growing demand for applications requiring reasoning over extensive documents, multi-session dialogues, and book length texts. While recent advances have extended context windows to 100K-1M tokens, such approaches incur prohibitive costs for resource constrained deployments. We propose BudgetMem, a novel memory augmented architecture that learns what to remember rather than remembering everything. Our system combines selective memory policies with feature based salience scoring (entity density, TF-IDF, discourse markers, position bias) to decide which information merits storage under strict budget constraints. Unlike existing retrieval augmented generation (RAG) systems that store all chunks, BudgetMem employs learned gating mechanisms coupled with BM25 sparse retrieval for efficient information access. Through comprehensive experiments on 700 question answer pairs across short (237 tokens) and long (5K-10K tokens) documents with Llama-3.2-3B-Instruct, we demonstrate that BudgetMem achieves remarkable results on long documents: only 1.0% F1 score degradation while saving 72.4% memory compared to baseline RAG. We validate our approach through budget sensitivity analysis (testing 7 budget ratios), naive baseline comparisons, and document length analysis, showing that BudgetMem's benefits increase with document length. Our work provides a practical pathway for deploying capable long context systems on modest hardware, democratizing access to advanced language understanding capabilities.
摘要：尽管对需要对大量文档、多会话对话和书籍长度文本进行推理的应用程序的需求不断增长，但大型语言模型 (LLM) 在处理长上下文时面临着巨大的计算和内存限制。虽然最近的进展已将上下文窗口扩展到 100K-1M 令牌，但这种方法会导致资源受限的部署成本过高。我们提出了 BudgetMem，这是一种新颖的记忆增强架构，它可以学习要记住什么，而不是记住所有内容。我们的系统将选择性记忆策略与基于特征的显着性评分（实体密度、TF-IDF、话语标记、位置偏差）相结合，以决定哪些信息在严格的预算限制下值得存储。与存储所有块的现有检索增强生成 (RAG) 系统不同，BudgetMem 采用学习门控机制与 BM25 稀疏检索相结合，以实现高效的信息访问。通过使用 Llama-3.2-3B-Instruct 对短文档（237 个令牌）和长文档（5K-10K 令牌）的 700 个问题答案对进行综合实验，我们证明 BudgetMem 在长文档上取得了显着的效果：与基线 RAG 相比，F1 分数仅下降 1.0%，同时节省了 72.4% 的内存。我们通过预算敏感性分析（测试 7 个预算比率）、朴素基线比较和文档长度分析来验证我们的方法，结果表明 BudgetMem 的优势随着文档长度的增加而增加。我们的工作为在适度的硬件上部署强大的长上下文系统提供了一条实用的途径，使高级语言理解能力的访问民主化。

Title: AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Authors: Yu Li, Lehui Li, Qingmin Liao, Fengli Xu, Yong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04921
Pdf URL: https://arxiv.org/pdf/2511.04921
Copy Paste: [[2511.04921]] AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent(https://arxiv.org/abs/2511.04921)
Keywords: language model, llm, agent
Abstract: Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85\% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85\% in Recall@20, +8.30\% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.
摘要：大型语言模型代理在以网络为中心的任务（例如信息检索、复杂推理）方面的能力越来越强。这些新兴功能引起了人们对开发法学硕士代理以促进科学探索的研究兴趣。人工智能研究中的一项关键应用是通过代理数据集和基线检索来自动化实验设计。然而，先前的努力受到数据覆盖范围有限的影响，因为推荐数据集主要从公共门户获取候选者，并省略了已发表论文中实际使用的许多数据集，并且过度依赖内容相似性，使模型偏向于表面相似性并忽视实验适用性。利用基线和数据集引用网络中嵌入的集体感知，我们提出了一个用于基线和数据集推荐的综合框架。首先，我们设计了一个自动数据收集管道，将大约十万篇已接受的论文与他们实际使用的基线和数据集联系起来。其次，我们提出了一种集体感知增强的猎犬。为了表示学术网络中每个数据集或基线的位置，它将自我描述与聚合的引用上下文连接起来。为了实现有效的候选召回，我们在这些表示上微调嵌入模型。最后，我们开发了一个推理增强重排序器，它可以精确交互链来构建显式推理链，并微调大型语言模型以产生可解释的理由和精致的排名。我们整理的数据集涵盖了过去五年顶级 AI 会议使用的 85% 的数据集和基线。在我们的数据集上，所提出的方法优于最强的先前基线，在 Recall@20 中平均增益为 +5.85\%，在 HitRate@5 中平均增益为 +8.30\%。总而言之，我们的结果推进了实验设计的可靠、可解释的自动化。

Title: LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

Authors: Wei Shao, Lingchao Zheng, Pengyu Wang, Peizhen Zheng, Jun Li, Yuwei Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04952
Pdf URL: https://arxiv.org/pdf/2511.04952
Copy Paste: [[2511.04952]] LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model(https://arxiv.org/abs/2511.04952)
Keywords: language model, long context
Abstract: Long context inference scenarios have become increasingly important for large language models, yet they introduce significant computational latency. While prior research has optimized long-sequence inference through operators, model architectures, and system frameworks, tokenization remains an overlooked bottleneck. Existing parallel tokenization methods accelerate processing through text segmentation and multi-process tokenization, but they suffer from inconsistent results due to boundary artifacts that occur after merging. To address this, we propose LoPT, a novel Lossless Parallel Tokenization framework that ensures output identical to standard sequential tokenization. Our approach employs character-position-based matching and dynamic chunk length adjustment to align and merge tokenized segments accurately. Extensive experiments across diverse long-text datasets demonstrate that LoPT achieves significant speedup while guaranteeing lossless tokenization. We also provide theoretical proof of consistency and comprehensive analytical studies to validate the robustness of our method.
摘要：长上下文推理场景对于大型语言模型变得越来越重要，但它们引入了显着的计算延迟。虽然之前的研究已经通过算子、模型架构和系统框架优化了长序列推理，但标记化仍然是一个被忽视的瓶颈。现有的并行标记化方法通过文本分割和多进程标记化来加速处理，但由于合并后出现的边界伪影，它们会出现不一致的结果。为了解决这个问题，我们提出了 LoPT，一种新颖的无损并行标记化框架，可确保输出与标准顺序标记化相同。我们的方法采用基于字符位置的匹配和动态块长度调整来准确对齐和合并标记化片段。跨不同长文本数据集的大量实验表明，LoPT 在保证无损标记化的同时实现了显着的加速。我们还提供一致性的理论证明和全面的分析研究，以验证我们方法的稳健性。

Title: Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Authors: Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.04962
Pdf URL: https://arxiv.org/pdf/2511.04962
Copy Paste: [[2511.04962]] Too Good to be Bad: On the Failure of LLMs to Role-Play Villains(https://arxiv.org/abs/2511.04962)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
摘要：大型语言模型 (LLM) 越来越多地承担创意生成的任务，包括模拟虚构人物。然而，他们描绘非亲社会、敌对角色的能力在很大程度上仍未得到检验。我们假设现代法学硕士的安全调整与真实地扮演道德模棱两可或邪恶角色的任务产生了根本冲突。为了调查这一点，我们引入了道德角色扮演基准，这是一个新的数据集，具有四级道德校准量表和用于严格评估的平衡测试集。我们要求最先进的法学硕士扮演从道德典范到纯粹恶棍的角色扮演角色。我们的大规模评估表明，随着角色道德的下降，角色扮演的保真度会持续、单调地下降。我们发现，模型最难应对与安全原则直接对立的特征，例如“欺骗性”和“操纵性”，通常用肤浅的攻击性代替微妙的恶意。此外，我们证明，一般聊天机器人的熟练程度并不能很好地预测恶棍角色扮演的能力，高度安全的模型表现尤其差。我们的工作为这一关键限制提供了第一个系统证据，强调了模型安全性和创意保真度之间的关键紧张关系。我们的基准和研究结果为开发更细致、上下文感知的对齐方法铺平了道路。

Title: Acquiring Common Chinese Emotional Events Using Large Language Model

Authors: Ya Wang, Guangzheng Zhu, Cungen Cao, Jingjing Li, He Li, Xin Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.04989
Pdf URL: https://arxiv.org/pdf/2511.04989
Copy Paste: [[2511.04989]] Acquiring Common Chinese Emotional Events Using Large Language Model(https://arxiv.org/abs/2511.04989)
Keywords: language model, llm, prompt
Abstract: Knowledge about emotional events is an important kind of knowledge which has been applied to improve the effectiveness of different applications. However, emotional events cannot be easily acquired, especially common or generalized emotional events that are context-independent. The goal of this paper is to obtain common emotional events in Chinese language such as "win a prize" and "be criticized". Our approach begins by collecting a comprehensive list of Chinese emotional event indicators. Then, we generate emotional events by prompting a Chinese large language model (LLM) using these indicators. To ensure the quality of these emotional events, we train a filter to discard invalid generated results. We also classify these emotional events as being positive events and negative events using different techniques. Finally, we harvest a total of 102,218 high-quality common emotional events with sentiment polarity labels, which is the only large-scale commonsense knowledge base of emotional events in Chinese language. Intrinsic evaluation results show that the proposed method in this paper can be effectively used to acquire common Chinese emotional events. An extrinsic use case also demonstrates the strong potential of common emotional events in the field of emotion cause extraction (ECE). Related resources including emotional event indicators and emotional events will be released after the publication of this paper.
摘要：关于情感事件的知识是一种重要的知识，已被用来提高不同应用程序的有效性。然而，情感事件并不容易获取，尤其是与情境无关的常见或普遍的情感事件。本文的目标是获取“获奖”、“被批评”等汉语中常见的情感事件。我们的方法首先收集中国情绪事件指标的综合列表。然后，我们通过使用这些指标提示中文大语言模型（LLM）来生成情感事件。为了确保这些情感事件的质量，我们训练一个过滤器来丢弃无效的生成结果。我们还使用不同的技术将这些情绪事件分为积极事件和消极事件。最终，我们总共收获了102,218个带有情感极性标签的高质量常见情感事件，这是中文唯一的大规模情感事件常识知识库。内在评价结果表明，本文提出的方法可以有效地用于获取中国常见的情感事件。外部用例还证明了常见情感事件在情感原因提取（ECE）领域的强大潜力。情感事件指标、情感事件等相关资源将在本文发表后发布。

Title: Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

Authors: Prasoon Varshney, Makesh Narsimhan Sreedhar, Liwei Jiang, Traian Rebedea, Christopher Parisien
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.05018
Pdf URL: https://arxiv.org/pdf/2511.05018
Copy Paste: [[2511.05018]] Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies(https://arxiv.org/abs/2511.05018)
Keywords: language model, llm
Abstract: Large language models (LLMs) are typically aligned to a universal set of safety and usage principles intended for broad public acceptability. Yet, real-world applications of LLMs often take place within organizational ecosystems shaped by distinctive corporate policies, regulatory requirements, use cases, brand guidelines, and ethical commitments. This reality highlights the need for rigorous and comprehensive evaluation of LLMs with pluralistic alignment goals, an alignment paradigm that emphasizes adaptability to diverse user values and needs. In this work, we present PLURALISTIC BEHAVIOR SUITE (PBSUITE), a dynamic evaluation suite designed to systematically assess LLMs' capacity to adhere to pluralistic alignment specifications in multi-turn, interactive conversations. PBSUITE consists of (1) a diverse dataset of 300 realistic LLM behavioral policies, grounded in 30 industries; and (2) a dynamic evaluation framework for stress-testing model compliance with custom behavioral specifications under adversarial conditions. Using PBSUITE, We find that leading open- and closed-source LLMs maintain robust adherence to behavioral policies in single-turn settings (less than 4% failure rates), but their compliance weakens substantially in multi-turn adversarial interactions (up to 84% failure rates). These findings highlight that existing model alignment and safety moderation methods fall short in coherently enforcing pluralistic behavioral policies in real-world LLM interactions. Our work contributes both the dataset and analytical framework to support future research toward robust and context-aware pluralistic alignment techniques.
摘要：大型语言模型 (LLM) 通常符合一套通用的安全和使用原则，旨在获得广泛的公众认可。然而，法学硕士的实际应用通常发生在由独特的公司政策、监管要求、用例、品牌指南和道德承诺塑造的组织生态系统中。这一现实凸显了对具有多元化调整目标的法学硕士进行严格和全面评估的必要性，这是一种强调适应不同用户价值观和需求的调整范式。在这项工作中，我们提出了多元行为套件（PBSUITE），这是一个动态评估套件，旨在系统地评估法学硕士在多回合交互式对话中遵守多元对齐规范的能力。 PBSUITE 包含 (1) 包含 300 个现实 LLM 行为政策的多样化数据集，基于 30 个行业； (2) 动态评估框架，用于压力测试模型在对抗条件下符合自定义行为规范。使用 PBSUITE，我们发现领先的开源和闭源法学硕士在单轮设置中保持对行为策略的严格遵守（失败率低于 4%），但他们的合规性在多轮对抗性交互中大幅减弱（高达 84% 的失败率）。这些发现强调，现有的模型调整和安全调节方法不足以在现实世界的法学硕士互动中连贯地执行多元行为政策。我们的工作提供了数据集和分析框架，以支持未来对稳健且上下文感知的多元对齐技术的研究。

Title: UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

Authors: Mykyta Syromiatnikov, Victoria Ruvinskaya
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2511.05040
Pdf URL: https://arxiv.org/pdf/2511.05040
Copy Paste: [[2511.05040]] UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian(https://arxiv.org/abs/2511.05040)
Keywords: language model, gpt, llm, prompt
Abstract: Evaluating the real capabilities of large language models in low-resource languages still represents a challenge, as many existing benchmarks focus on widespread tasks translated from English or evaluate only simple language understanding. This paper introduces UA-Code-Bench, a new open-source benchmark established for a thorough evaluation of language models' code generation and competitive programming problem-solving abilities in Ukrainian. The benchmark comprises 500 problems from the Eolymp platform, evenly distributed across five complexity levels from very easy to very hard. A diverse set of 13 leading proprietary and open-source models, generating Python solutions based on a one-shot prompt, was evaluated via the dedicated Eolymp environment against hidden tests, ensuring code correctness. The obtained results reveal that even top-performing models, such as OpenAI o3 and GPT-5, solve only half of the problems, highlighting the challenge of code generation in low-resource natural language. Furthermore, this research presents a comprehensive analysis of performance across various difficulty levels, as well as an assessment of solution uniqueness and computational efficiency, measured by both elapsed time and memory consumption of the generated solutions. In conclusion, this work demonstrates the value of competitive programming benchmarks in evaluating large language models, especially in underrepresented languages. It also paves the way for future research on multilingual code generation and reasoning-enhanced models. The benchmark, data parsing, preparation, code generation, and evaluation scripts are available at this https URL.
摘要：评估低资源语言中的大型语言模型的真实能力仍然是一个挑战，因为许多现有的基准测试集中于从英语翻译的广泛任务或仅评估简单的语言理解。本文介绍了 UA-Code-Bench，这是一个新的开源基准，旨在全面评估乌克兰语语言模型的代码生成和竞争性编程问题解决能力。该基准测试包含来自 Eolymp 平台的 500 个问题，均匀分布在从非常简单到非常困难的五个复杂级别中。一组多样化的 13 个领先的专有和开源模型，根据一次性提示生成 Python 解决方案，并通过专用的 Eolymp 环境针对隐藏测试进行了评估，确保了代码的正确性。获得的结果表明，即使是性能最好的模型，例如 OpenAI o3 和 GPT-5，也只能解决一半的问题，凸显了低资源自然语言代码生成的挑战。此外，这项研究还对不同难度级别的性能进行了全面分析，并对解决方案的独特性和计算效率进行了评估，通过生成的解决方案的运行时间和内存消耗来衡量。总之，这项工作证明了竞争性编程基准在评估大型语言模型方面的价值，特别是在代表性不足的语言中。它还为未来多语言代码生成和推理增强模型的研究铺平了道路。基准测试、数据解析、准备、代码生成和评估脚本可从此 https URL 获取。

Title: Order-Level Attention Similarity Across Language Models: A Latent Commonality

Authors: Jinglin Liang, Jin Zhong, Shuangping Huang, Yunqing Hu, Huiyuan Zhang, Huifang Li, Lixin Fan, Hanlin Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05064
Pdf URL: https://arxiv.org/pdf/2511.05064
Copy Paste: [[2511.05064]] Order-Level Attention Similarity Across Language Models: A Latent Commonality(https://arxiv.org/abs/2511.05064)
Keywords: language model
Abstract: In this paper, we explore an important yet previously neglected question: Do context aggregation patterns across Language Models (LMs) share commonalities? While some works have investigated context aggregation or attention weights in LMs, they typically focus on individual models or attention heads, lacking a systematic analysis across multiple LMs to explore their commonalities. In contrast, we focus on the commonalities among LMs, which can deepen our understanding of LMs and even facilitate cross-model knowledge transfer. In this work, we introduce the Order-Level Attention (OLA) derived from the order-wise decomposition of Attention Rollout and reveal that the OLA at the same order across LMs exhibits significant similarities. Furthermore, we discover an implicit mapping between OLA and syntactic knowledge. Based on these two findings, we propose the Transferable OLA Adapter (TOA), a training-free cross-LM adapter transfer method. Specifically, we treat the OLA as a unified syntactic feature representation and train an adapter that takes OLA as input. Due to the similarities in OLA across LMs, the adapter generalizes to unseen LMs without requiring any parameter updates. Extensive experiments demonstrate that TOA's cross-LM generalization effectively enhances the performance of unseen LMs. Code is available at this https URL.
摘要：在本文中，我们探讨了一个重要但之前被忽视的问题：跨语言模型（LM）的上下文聚合模式是否具有共性？虽然一些工作研究了 LM 中的上下文聚合或注意力权重，但它们通常关注单个模型或注意力头，缺乏跨多个 LM 的系统分析来探索它们的共性。相比之下，我们关注 LM 之间的共性，这可以加深我们对 LM 的理解，甚至促进跨模型知识迁移。在这项工作中，我们引入了从 Attention Rollout 的按顺序分解导出的顺序级注意力（OLA），并揭示了跨 LM 相同顺序的 OLA 表现出显着的相似性。此外，我们发现 OLA 和句法知识之间存在隐式映射。基于这两个发现，我们提出了可传输 OLA 适配器（TOA），这是一种免训练的跨 LM 适配器传输方法。具体来说，我们将 OLA 视为统一的句法特征表示，并训练一个以 OLA 作为输入的适配器。由于 LM 之间的 OLA 具有相似性，因此适配器可以推广到未见过的 LM，而不需要任何参数更新。大量实验表明，TOA 的跨 LM 泛化有效地增强了未见过的 LM 的性能。代码可从此 https URL 获取。

Title: On Text Simplification Metrics and General-Purpose LLMs for Accessible Health Information, and A Potential Architectural Advantage of The Instruction-Tuned LLM class

Authors: P. Bilha Githinji, Aikaterini Meilliou, Peiwu Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05080
Pdf URL: https://arxiv.org/pdf/2511.05080
Copy Paste: [[2511.05080]] On Text Simplification Metrics and General-Purpose LLMs for Accessible Health Information, and A Potential Architectural Advantage of The Instruction-Tuned LLM class(https://arxiv.org/abs/2511.05080)
Keywords: language model, llm
Abstract: The increasing health-seeking behavior and digital consumption of biomedical information by the general public necessitate scalable solutions for automatically adapting complex scientific and technical documents into plain language. Automatic text simplification solutions, including advanced large language models, however, continue to face challenges in reliably arbitrating the tension between optimizing readability performance and ensuring preservation of discourse fidelity. This report empirically assesses the performance of two major classes of general-purpose LLMs, demonstrating their linguistic capabilities and foundational readiness for the task compared to a human benchmark. Using a comparative analysis of the instruction-tuned Mistral 24B and the reasoning-augmented QWen2.5 32B, we identify a potential architectural advantage in the instruction-tuned LLM. Mistral exhibits a tempered lexical simplification strategy that enhances readability across a suite of metrics and the simplification-specific formula SARI (mean 42.46), while preserving human-level discourse with a BERTScore of 0.91. QWen also attains enhanced readability performance, but its operational strategy shows a disconnect in balancing between readability and accuracy, reaching a statistically significantly lower BERTScore of 0.89. Additionally, a comprehensive correlation analysis of 21 metrics spanning readability, discourse fidelity, content safety, and underlying distributional measures for mechanistic insights, confirms strong functional redundancies among five readability indices. This empirical evidence tracks baseline performance of the evolving LLMs for the task of text simplification, identifies the instruction-tuned Mistral 24B for simplification, provides necessary heuristics for metric selection, and points to lexical support as a primary domain-adaptation issue for simplification.
摘要：公众寻求健康的行为和生物医学信息的数字消费不断增加，需要可扩展的解决方案来自动将复杂的科学和技术文档调整为简单的语言。然而，自动文本简化解决方案，包括先进的大语言模型，在可靠地仲裁优化可读性性能和确保保留话语保真度之间的紧张关系方面继续面临挑战。本报告根据经验评估了两类主要通用法学硕士的表现，与人类基准相比，展示了他们的语言能力和对任务的基础准备情况。通过对指令调整的 Mistral 24B 和推理增强的 QWen2.5 32B 进行比较分析，我们确定了指令调整的 LLM 的潜在架构优势。 Mistral 展示了一种缓和的词汇简化策略，该策略增强了一系列指标和特定于简化的公式 SARI（平均值 42.46）的可读性，同时保留了人类水平的话语，BERTScore 为 0.91。 QWen 还获得了增强的可读性性能，但其操作策略显示可读性和准确性之间的平衡脱节，从统计上来看，BERTScore 显着较低，为 0.89。此外，对涵盖可读性、话语保真度、内容安全性和机械洞察力的底层分布测量的 21 个指标进行的全面相关性分析，证实了五个可读性指数之间存在很强的功能冗余。该经验证据跟踪了不断发展的法学硕士在文本简化任务中的基线性能，确定了用于简化的指令调整 Mistral 24B，为度量选择提供了必要的启发，并指出词汇支持作为简化的主要领域适应问题。

Title: Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

Authors: Grigory Kovalev, Mikhail Tikhomirov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.05085
Pdf URL: https://arxiv.org/pdf/2511.05085
Copy Paste: [[2511.05085]] Iterative Layer-wise Distillation for Efficient Compression of Large Language Models(https://arxiv.org/abs/2511.05085)
Keywords: language model, gpt, llm
Abstract: This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The findings suggest that the middle transformer layers contribute less to inference, underscoring the potential of the proposed method for creating efficient models. The results demonstrate the effectiveness of iterative distillation and fine-tuning, making the approach suitable for deployment in resource-limited settings.
摘要：这项工作研究了大型语言模型 (LLM) 的蒸馏方法，目标是开发保持高性能的紧凑模型。回顾了几种现有的方法，并讨论了它们各自的优点和局限性。基于合并层重要性迭代评估的思想，开发了一种基于 ShortGPT 方法的改进方法。在每个步骤中，通过使用一组代表性数据集测量删除各个层时的性能下降来评估重要性。该过程与使用基于 KL 散度和均方误差的联合损失函数的进一步训练相结合。在Qwen2.5-3B模型上的实验表明，层数可以从36层减少到28层（产生24.7亿个参数模型），质量损失仅为9.7%，减少到24层，质量损失为18%。研究结果表明，中间变压器层对推理的贡献较小，强调了所提出的方法创建高效模型的潜力。结果证明了迭代蒸馏和微调的有效性，使得该方法适合在资源有限的环境中部署。

Title: A Toolbox for Improving Evolutionary Prompt Search

Authors: Daniel Grießhaber, Maximilian Kimmich, Johannes Maucher, Ngoc Thang Vu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05120
Pdf URL: https://arxiv.org/pdf/2511.05120
Copy Paste: [[2511.05120]] A Toolbox for Improving Evolutionary Prompt Search(https://arxiv.org/abs/2511.05120)
Keywords: llm, prompt
Abstract: Evolutionary prompt optimization has demonstrated effectiveness in refining prompts for LLMs. However, existing approaches lack robust operators and efficient evaluation mechanisms. In this work, we propose several key improvements to evolutionary prompt optimization that can partially generalize to prompt optimization in general: 1) decomposing evolution into distinct steps to enhance the evolution and its control, 2) introducing an LLM-based judge to verify the evolutions, 3) integrating human feedback to refine the evolutionary operator, and 4) developing more efficient evaluation strategies that maintain performance while reducing computational overhead. Our approach improves both optimization quality and efficiency. We release our code, enabling prompt optimization on new tasks and facilitating further research in this area.
摘要：进化提示优化已证明在完善法学硕士提示方面是有效的。然而，现有方法缺乏稳健的算子和有效的评估机制。在这项工作中，我们提出了对进化提示优化的几个关键改进，这些改进可以部分推广到一般的提示优化：1）将进化分解为不同的步骤以增强进化及其控制，2）引入基于LLM的判断来验证进化，3）整合人类反馈以细化进化算子，以及4）开发更有效的评估策略，在保持性能的同时减少计算开销。我们的方法提高了优化质量和效率。我们发布了代码，能够及时优化新任务并促进该领域的进一步研究。

Title: Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results

Authors: Jan-Thorsten Peter, David Vilar, Tobias Domhan, Dan Malkin, Markus Freitag
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05162
Pdf URL: https://arxiv.org/pdf/2511.05162
Copy Paste: [[2511.05162]] Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results(https://arxiv.org/abs/2511.05162)
Keywords: language model, llm
Abstract: Most current large language models (LLMs) support a wide variety of languages in addition to English, including high-resource languages (e.g. German, Chinese, French), as well as low-resource ones (e.g. Swahili, Telugu). In addition they have also shown impressive capabilities in different domains, like coding, science and math. In this short paper, taking math as an example domain, we study the performance of different LLMs across languages. Experimental results show that there exists a non-negligible and consistent gap in the performance of the models across languages. Interestingly, and somewhat against expectations, the gap exists for both high- and low-resource languages. We hope that these results influence further research into cross-lingual capability generalization for next generation LLMs. If it weren't for the fact that they are false! By analyzing one of the standard multilingual math benchmarks (MGSM), we determine that several translation errors are present in the data. Furthermore, the lack of standardized answer extraction from LLM outputs further influences the final results. We propose a method for automatic quality assurance to address the first issue at scale, and give recommendations to address the second one. Combining these two approaches we show that the aforementioned language gap mostly disappears, leading to completely different conclusions from our research. We additionally release the corrected dataset to the community.
摘要：目前大多数大语言模型 (LLM) 除了英语之外还支持多种语言，包括高资源语言（例如德语、中文、法语）和低资源语言（例如斯瓦希里语、泰卢固语）。此外，他们还在编码、科学和数学等不同领域表现出了令人印象深刻的能力。在这篇简短的论文中，我们以数学为示例领域，研究了不同语言的法学硕士的表现。实验结果表明，跨语言的模型性能存在不可忽略且一致的差距。有趣的是，有些出乎意料的是，高资源语言和低资源语言都存在差距。我们希望这些结果能够影响下一代法学硕士跨语言能力泛化的进一步研究。如果不是因为它们是假的！通过分析标准多语言数学基准 (MGSM) 之一，我们确定数据中存在多个翻译错误。此外，法学硕士输出中缺乏标准化答案提取进一步影响了最终结果。我们提出了一种自动质量保证方法来大规模解决第一个问题，并提出解决第二个问题的建议。结合这两种方法，我们发现上述语言差距大部分消失，从而导致我们的研究得出完全不同的结论。我们还向社区发布了更正后的数据集。

Title: Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

Authors: Cong-Thanh Do, Rama Doddipatla, Kate Knill
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05184
Pdf URL: https://arxiv.org/pdf/2511.05184
Copy Paste: [[2511.05184]] Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models(https://arxiv.org/abs/2511.05184)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.
摘要：思想链（CoT）提示是一种广泛使用的提高大型语言模型（LLM）推理能力的方法。最近，CoT 已在知识蒸馏 (KD) 中得到利用，将推理能力从较大的法学硕士转移到较小的法学硕士。本文研究了 CoT 在使用白盒 KD 将推理能力从较大的 LLM 提炼到较小的 LLM 方面的作用，分析了其在提高各种自然语言推理和理解任务的提炼模型性能方面的有效性。我们使用 Qwen 和 Llama2 系列的 LLM 进行白盒 KD 实验，并使用 CoT-Collection 数据集中的 CoT 数据。然后，根据 BIG-Bench-Hard (BBH) 基准的自然语言推理和理解任务对蒸馏后的模型进行评估，这对小型法学硕士提出了复杂的挑战。实验结果证明了 CoT 在提高白盒 KD 有效性方面的作用，使蒸馏模型能够在 BBH 的自然语言推理和理解任务中获得更好的平均性能。

Title: Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

Authors: Zilong Li, Jie Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05239
Pdf URL: https://arxiv.org/pdf/2511.05239
Copy Paste: [[2511.05239]] Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese(https://arxiv.org/abs/2511.05239)
Keywords: language model, llm
Abstract: Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.
摘要：古代人通过在每个字符周围注释来将文言文翻译成日语。我们将这个过程抽象为序列标记任务，并将其融入现代语言技术中。该标注翻译系统的研究面临着资源匮乏的问题。我们通过引入基于 LLM 的注释管道并从数字化开源翻译数据构建新的数据集来解决这个问题。我们表明，在低资源环境下，引入辅助中文 NLP 任务对序列标记任务的训练有促进作用。我们还评估大型语言模型的性能。他们在直接机器翻译中取得了高分，但在被要求注释字符时却感到困惑。我们的方法可以作为法学硕士的补充。

Title: Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Authors: Teqi Hao, Xioayu Tan, Shaojie Shi, Yinghui Xu, Xihe Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05286
Pdf URL: https://arxiv.org/pdf/2511.05286
Copy Paste: [[2511.05286]] Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models(https://arxiv.org/abs/2511.05286)
Keywords: language model, llm, prompt
Abstract: The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user's preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.
摘要：黑盒大语言模型（LLM）的个性化是一项关键但具有挑战性的任务。现有的方法主要依赖于上下文注入，其中用户历史记录被嵌入到提示中以直接指导生成过程。然而，这种单步范例给模型带来了双重负担：生成准确的内容，同时与用户特定的样式保持一致。这通常会导致损害输出质量并限制精确控制的权衡。为了解决这一基本矛盾，我们提出了反射个性化优化（RPO），这是一种新颖的框架，通过将内容生成与对齐解耦来重新定义个性化范式。 RPO 分为两个不同的阶段：首先，基础模型生成高质量的通用响应；然后，外部反射模块显式重写此输出以符合用户的偏好。该反射模块使用两阶段过程进行训练。最初，对结构化重写轨迹进行监督微调，以建立核心个性化推理策略，模拟从通用响应到用户一致响应的转变。随后，应用强化学习来进一步细化和提高个性化输出的质量。 LaMP 基准的综合实验表明，RPO 通过将内容生成与个性化分离，显着优于最先进的基准。这些发现强调了显式响应塑造相对于隐式上下文注入的优越性。此外，RPO引入了一个高效的、与模型无关的个性化层，可以与任何底层基础模型无缝集成，为以用户为中心的生成场景中新的有效方向铺平了道路。

Title: Listening Between the Lines: Decoding Podcast Narratives with Language Modeling

Authors: Shreya Gupta, Ojasva Saxena, Arghodeep Nandi, Sarah Masud, Kiran Garimella, Tanmoy Chakraborty
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2511.05310
Pdf URL: https://arxiv.org/pdf/2511.05310
Copy Paste: [[2511.05310]] Listening Between the Lines: Decoding Podcast Narratives with Language Modeling(https://arxiv.org/abs/2511.05310)
Keywords: language model
Abstract: Podcasts have become a central arena for shaping public opinion, making them a vital source for understanding contemporary discourse. Their typically unscripted, multi-themed, and conversational style offers a rich but complex form of data. To analyze how podcasts persuade and inform, we must examine their narrative structures -- specifically, the narrative frames they employ. The fluid and conversational nature of podcasts presents a significant challenge for automated analysis. We show that existing large language models, typically trained on more structured text such as news articles, struggle to capture the subtle cues that human listeners rely on to identify narrative frames. As a result, current approaches fall short of accurately analyzing podcast narratives at scale. To solve this, we develop and evaluate a fine-tuned BERT model that explicitly links narrative frames to specific entities mentioned in the conversation, effectively grounding the abstract frame in concrete details. Our approach then uses these granular frame labels and correlates them with high-level topics to reveal broader discourse trends. The primary contributions of this paper are: (i) a novel frame-labeling methodology that more closely aligns with human judgment for messy, conversational data, and (ii) a new analysis that uncovers the systematic relationship between what is being discussed (the topic) and how it is being presented (the frame), offering a more robust framework for studying influence in digital media.
摘要：播客已成为塑造公众舆论的中心舞台，使其成为理解当代话语的重要来源。他们典型的无脚本、多主题和对话风格提供了丰富但复杂的数据形式。为了分析播客如何说服和提供信息，我们必须检查它们的叙事结构——特别是它们采用的叙事框架。播客的流动性和对话性对自动化分析提出了重大挑战。我们表明，现有的大型语言模型通常是在新闻文章等更具结构化的文本上进行训练的，很难捕捉人类听众用来识别叙事框架的微妙线索。因此，当前的方法无法大规模准确地分析播客叙事。为了解决这个问题，我们开发并评估了一个经过微调的 BERT 模型，该模型将叙事框架明确链接到对话中提到的特定实体，从而有效地将抽象框架扎根于具体细节。然后，我们的方法使用这些细粒度的框架标签，并将它们与高级主题相关联，以揭示更广泛的话语趋势。本文的主要贡献是：（i）一种新颖的框架标记方法，更符合人类对混乱的对话数据的判断，以及（ii）一种新的分析，揭示了正在讨论的内容（主题）和呈现方式（框架）之间的系统关系，为研究数字媒体的影响力提供了更强大的框架。

Title: What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

Authors: Klára Bendová, Tomáš Knap, Jan Černý, Vojtěch Pour, Jaromir Savelka, Ivana Kvapilíková, Jakub Drápal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.05320
Pdf URL: https://arxiv.org/pdf/2511.05320
Copy Paste: [[2511.05320]] What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions(https://arxiv.org/abs/2511.05320)
Keywords: language model, llm, prompt
Abstract: Criminal justice administrative data contain only a limited amount of information about the committed offense. However, there is an unused source of extensive information in continental European courts' decisions: descriptions of criminal behaviors in verdicts by which offenders are found guilty. In this paper, we study the feasibility of extracting these descriptions from publicly available court decisions from Slovakia. We use two different approaches for retrieval: regular expressions and large language models (LLMs). Our baseline was a simple method employing regular expressions to identify typical words occurring before and after the description. The advanced regular expression approach further focused on "sparing" and its normalization (insertion of spaces between individual letters), typical for delineating the description. The LLM approach involved prompting the Gemini Flash 2.0 model to extract the descriptions using predefined instructions. Although the baseline identified descriptions in only 40.5% of verdicts, both methods significantly outperformed it, achieving 97% with advanced regular expressions and 98.75% with LLMs, and 99.5% when combined. Evaluation by law students showed that both advanced methods matched human annotations in about 90% of cases, compared to just 34.5% for the baseline. LLMs fully matched human-labeled descriptions in 91.75% of instances, and a combination of advanced regular expressions with LLMs reached 92%.
摘要：刑事司法行政数据仅包含有关所犯罪行的有限信息。然而，欧洲大陆法院的判决中有一个未使用的广泛信息来源：对罪犯被判有罪的判决中犯罪行为的描述。在本文中，我们研究了从斯洛伐克公开的法院判决中提取这些描述的可行性。我们使用两种不同的检索方法：正则表达式和大型语言模型（LLM）。我们的基线是一种简单的方法，使用正则表达式来识别描述前后出现的典型单词。高级正则表达式方法进一步关注“保留”及其规范化（在各个字母之间插入空格），通常用于描述描述。 LLM 方法涉及提示 Gemini Flash 2.0 模型使用预定义指令提取描述。尽管基线只识别了 40.5% 的判决中的描述，但这两种方法都明显优于它，使用高级正则表达式达到 97%，使用法学硕士达到 98.75%，组合起来达到 99.5%。法学院学生的评估表明，这两种先进方法在大约 90% 的情况下与人类注释匹配，而基线的匹配率仅为 34.5%。 LLM 在 91.75% 的实例中完全匹配人工标记的描述，高级正则表达式与 LLM 的组合达到 92%。

Title: Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Authors: Firoj Ahmmed Patwary, Abdullah Al Noman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05324
Pdf URL: https://arxiv.org/pdf/2511.05324
Copy Paste: [[2511.05324]] Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE(https://arxiv.org/abs/2511.05324)
Keywords: language model
Abstract: Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are mostly designed for Latin or multilingual corpora and do not perform well on languages with rich morphology such as Bengali. To address this limitation, we present BengaliBPE, a Byte Pair Encoding (BPE) tokenizer specifically developed for the Bengali script. BengaliBPE applies Unicode normalization, grapheme-level initialization, and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity. We use a large-scale Bengali news classification dataset to compare BengaliBPE with three baselines: Whitespace, SentencePiece BPE, and HuggingFace BPE. The evaluation considers tokenization granularity, encoding speed, and downstream classification accuracy. While all methods perform reasonably well, BengaliBPE provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost. These findings highlight the importance of language-aware tokenization for morphologically rich scripts and establish BengaliBPE as a strong foundation for future Bengali NLP systems, including large-scale pretraining of contextual language models.
摘要：标记化是自然语言处理 (NLP) 流程中重要的第一步，因为它决定模型如何学习和表示语言信息。然而，当前的子词分词器（如 SentencePiece 或 HuggingFace BPE）大多是为拉丁语或多语言语料库设计的，在孟加拉语等形态丰富的语言上表现不佳。为了解决这个限制，我们推出了 BengaliBPE，这是一种专门为孟加拉语脚本开发的字节对编码 (BPE) 分词器。 BengaliBPE 应用 Unicode 规范化、字素级初始化和形态感知合并规则来保持语言一致性并保留子字完整性。我们使用大规模孟加拉语新闻分类数据集将 BengaliBPE 与三个基线进行比较：Whitespace、SentencePiece BPE 和 HuggingFace BPE。评估考虑了标记化粒度、编码速度和下游分类准确性。虽然所有方法都表现得相当好，但 BengaliBPE 提供了最详细的分割和最佳的形态学可解释性，尽管计算成本稍高。这些发现强调了语言感知标记化对于形态丰富的脚本的重要性，并将 BengaliBPE 确立为未来孟加拉语 NLP 系统的坚实基础，包括上下文语言模型的大规模预训练。

Title: Large Language Models for Explainable Threat Intelligence

Authors: Tiago Dinis, Miguel Correia, Roger Tavares
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.05406
Pdf URL: https://arxiv.org/pdf/2511.05406
Copy Paste: [[2511.05406]] Large Language Models for Explainable Threat Intelligence(https://arxiv.org/abs/2511.05406)
Keywords: language model, llm, retrieval-augmented generation
Abstract: As cyber threats continue to grow in complexity, traditional security mechanisms struggle to keep up. Large language models (LLMs) offer significant potential in cybersecurity due to their advanced capabilities in text processing and generation. This paper explores the use of LLMs with retrieval-augmented generation (RAG) to obtain threat intelligence by combining real-time information retrieval with domain-specific data. The proposed system, RAGRecon, uses a LLM with RAG to answer questions about cybersecurity threats. Moreover, it makes this form of Artificial Intelligence (AI) explainable by generating and visually presenting to the user a knowledge graph for every reply. This increases the transparency and interpretability of the reasoning of the model, allowing analysts to better understand the connections made by the system based on the context recovered by the RAG system. We evaluated RAGRecon experimentally with two datasets and seven different LLMs and the responses matched the reference responses more than 91% of the time for the best combinations.
摘要：随着网络威胁的复杂性不断增加，传统的安全机制难以跟上。大型语言模型 (LLM) 因其在文本处理和生成方面的先进功能而在网络安全方面具有巨大的潜力。本文探讨了使用具有检索增强生成（RAG）的法学硕士，通过将实时信息检索与特定领域的数据相结合来获取威胁情报。拟议的系统 RAGRecon 使用 LLM 和 RAG 来回答有关网络安全威胁的问题。此外，它通过为每个回复生成并向用户直观地呈现知识图，使这种形式的人工智能 (AI) 变得可解释。这增加了模型推理的透明度和可解释性，使分析人员能够根据 RAG 系统恢复的上下文更好地理解系统所建立的连接。我们使用两个数据集和七个不同的法学硕士对 RAGRecon 进行了实验评估，对于最佳组合，其响应与参考响应的匹配率超过 91%。

Title: Steering Language Models with Weight Arithmetic

Authors: Constanza Fierro, Fabien Roger
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.05408
Pdf URL: https://arxiv.org/pdf/2511.05408
Copy Paste: [[2511.05408]] Steering Language Models with Weight Arithmetic(https://arxiv.org/abs/2511.05408)
Keywords: language model, llm
Abstract: Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.
摘要：在多样化的训练分布上向大型语言模型 (LLM) 提供高质量的反馈可能会很困难且成本高昂，而仅在狭窄的分布上提供反馈可能会导致意外的泛化。为了更好地利用狭窄的训练数据，我们提出了对比权重引导，这是一种简单的后训练方法，可以使用权重算法编辑模型参数。我们通过从两个小微调中减去权重增量来隔离权重空间中的行为方向（一个微调会引发所需的行为，另一个微调会引发相反的行为），然后添加或删除该方向以修改模型的权重。我们应用这种技术来减轻阿谀奉承和引起错位，并发现权重引导通常比激活引导更泛化，在降低一般能力之前实现更强的分布外行为控制。我们还表明，在特定于任务的微调的背景下，权重引导可以部分减轻不需要的行为漂移：它可以减少微调期间引入的阿谀奉承和拒绝不足，同时保持任务性能增益。最后，我们提供了初步证据，表明可以通过测量微调更新和“邪恶”权重方向之间的相似性来检测紧急错位，这表明可以在训练期间监控权重的演变，并检测在训练或评估期间从未出现的罕见错位行为。