2025-11-12

Title: A Preliminary Study of RAG for Taiwanese Historical Archives

Authors: Claire Lin, Bo-Han Feng, Xuanjun Chen, Te-Lun Yang, Hung-yi Lee, Jyh-Shing Roger Jang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07445
Pdf URL: https://arxiv.org/pdf/2511.07445
Copy Paste: [[2511.07445]] A Preliminary Study of RAG for Taiwanese Historical Archives(https://arxiv.org/abs/2511.07445)
Keywords: hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.
摘要：检索增强生成（RAG）已成为知识密集型任务的一种有前景的方法。然而，很少有研究考察台湾历史档案馆的 RAG。在本文中，我们提出了应用于两个历史繁体中文数据集（热兰堡和台湾省议会公报）及其相应的开放式查询集的 RAG 管道的初步研究。我们系统地研究了查询特征和元数据集成策略对检索质量、答案生成和整个系统性能的影响。结果表明，早期元数据集成提高了检索和答案的准确性，同时也揭示了 RAG 系统持续面临的挑战，包括生成过程中的幻觉以及处理时态或多跳历史查询的困难。

Title: Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

Authors: Fatemeh Shahhosseini, Arash Marioriyad, Ali Momen, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban, Shaghayegh Haghjooy Javanmard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07448
Pdf URL: https://arxiv.org/pdf/2511.07448
Copy Paste: [[2511.07448]] Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey(https://arxiv.org/abs/2511.07448)
Keywords: language model, llm, prompt, agent
Abstract: Scientific idea generation lies at the heart of scientific discovery and has driven human progress-whether by solving unsolved problems or proposing novel hypotheses to explain unknown phenomena. Unlike standard scientific reasoning or general creative generation, idea generation in science is a multi-objective and open-ended task, where the novelty of a contribution is as essential as its empirical soundness. Large language models (LLMs) have recently emerged as promising generators of scientific ideas, capable of producing coherent and factual outputs with surprising intuition and acceptable reasoning, yet their creative capacity remains inconsistent and poorly understood. This survey provides a structured synthesis of methods for LLM-driven scientific ideation, examining how different approaches balance creativity with scientific soundness. We categorize existing methods into five complementary families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. To interpret their contributions, we employ two complementary frameworks: Boden's taxonomy of Combinatorial, Exploratory and Transformational creativity to characterize the level of ideas each family expected to generate, and Rhodes' 4Ps framework-Person, Process, Press, and Product-to locate the aspect or source of creativity that each method emphasizes. By aligning methodological advances with creativity frameworks, this survey clarifies the state of the field and outlines key directions toward reliable, systematic, and transformative applications of LLMs in scientific discovery.
摘要：科学思想的产生是科学发现的核心，无论是通过解决未解决的问题还是提出新的假设来解释未知现象，它都推动了人类的进步。与标准科学推理或一般创造性生成不同，科学中的创意生成是一项多目标和开放式任务，其中贡献的新颖性与其经验的可靠性同样重要。大型语言模型（LLM）最近成为有前途的科学思想生成器，能够以令人惊讶的直觉和可接受的推理产生连贯和事实的输出，但它们的创造力仍然不一致且知之甚少。这项调查提供了法学硕士驱动的科学构思方法的结构化综合，研究了不同方法如何平衡创造力与科学可靠性。我们将现有方法分为五个互补的系列：外部知识增强、基于提示的分布式指导、推理时间缩放、多代理协作和参数级适应。为了解释他们的贡献，我们采用了两个互补的框架：博登的组合、探索和变革创造力分类法来描述每个家庭期望产生的想法的水平，以及罗兹的 4Ps 框架——人、过程、新闻和产品——来定位每种方法强调的创造力的方面或来源。通过将方法论的进步与创造力框架相结合，这项调查阐明了该领域的现状，并概述了法学硕士在科学发现中可靠、系统和变革性应用的关键方向。

Title: GRIP: In-Parameter Graph Reasoning through Fine-Tuning Large Language Models

Authors: Jiarui Feng, Donghong Cai, Yixin Chen, Muhan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07457
Pdf URL: https://arxiv.org/pdf/2511.07457
Copy Paste: [[2511.07457]] GRIP: In-Parameter Graph Reasoning through Fine-Tuning Large Language Models(https://arxiv.org/abs/2511.07457)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in modeling sequential textual data and generalizing across diverse tasks. However, adapting LLMs to effectively handle structural data, such as knowledge graphs or web data, remains a challenging problem. Some approaches adopt complex strategies to convert graphs into text sequences, resulting in significant token overhead and rendering them impractical for large-scale graphs. Others introduce additional modules to encode graphs into fixed-size token representations for LLMs. However, these methods typically require large-scale post-training on graph-text corpus and complex alignment procedures, yet often yield sub-optimal results due to poor modality alignment. Inspired by in-parameter knowledge injection for test-time adaptation of LLMs, we propose GRIP, a novel framework that equips LLMs with the ability to internalize complex relational information from graphs through carefully designed fine-tuning tasks. This knowledge is efficiently stored within lightweight LoRA parameters, enabling the fine-tuned LLM to perform a wide range of graph-related tasks without requiring access to the original graph at inference time. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach.
摘要：大型语言模型 (LLM) 在对顺序文本数据建模和跨不同任务进行泛化方面表现出了卓越的能力。然而，调整法学硕士以有效处理结构数据（例如知识图或网络数据）仍然是一个具有挑战性的问题。一些方法采用复杂的策略将图转换为文本序列，导致大量的令牌开销，并使它们对于大规模图来说不切实际。其他人引入了额外的模块来将图编码为法学硕士的固定大小的令牌表示。然而，这些方法通常需要对图文本语料库进行大规模的后训练和复杂的对齐程序，但由于模态对齐较差，常常会产生次优结果。受 LLM 测试时适应的参数内知识注入的启发，我们提出了 GRIP，这是一种新颖的框架，使 LLM 能够通过精心设计的微调任务从图表中内化复杂的关系信息。这些知识有效地存储在轻量级 LoRA 参数中，使经过微调的 LLM 能够执行各种与图相关的任务，而无需在推理时访问原始图。跨多个基准的广泛实验验证了我们方法的有效性和效率。

Title: REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

Authors: Priyanka Mudgal
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2511.07458
Pdf URL: https://arxiv.org/pdf/2511.07458
Copy Paste: [[2511.07458]] REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment(https://arxiv.org/abs/2511.07458)
Keywords: language model, llm
Abstract: Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.
摘要：由于缺乏高质量的参考摘要以及 ROUGE 和 BLEU 等现有指标的局限性（依赖于表层词汇重叠），评估日志摘要系统具有挑战性。我们引入REFLEX，一种基于大语言模型（LLM）判断的日志摘要的无参考评估指标。 REFLEX 使用 LLM 作为零样本评估器，从相关性、信息性和连贯性等维度评估摘要质量，无需黄金标准参考或人工注释。我们表明，REFLEX 可在多个日志汇总数据集上生成稳定、可解释且细粒度的评估，并且比传统指标更有效地区分模型输出。 REFLEX 提供了一种可扩展的替代方案，用于在参考数据稀缺或不可用的现实环境中评估日志摘要。

Title: It Takes Two: A Dual Stage Approach for Terminology-Aware Translation

Authors: Akshat Singh Jaswal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07461
Pdf URL: https://arxiv.org/pdf/2511.07461
Copy Paste: [[2511.07461]] It Takes Two: A Dual Stage Approach for Terminology-Aware Translation(https://arxiv.org/abs/2511.07461)
Keywords: llm, prompt
Abstract: This paper introduces DuTerm, a novel two-stage architecture for terminology-constrained machine translation. Our system combines a terminology-aware NMT model, adapted via fine-tuning on large-scale synthetic data, with a prompt-based LLM for post-editing. The LLM stage refines NMT output and enforces terminology adherence. We evaluate DuTerm on English-to German, English-to-Spanish, and English-to-Russian with the WMT 2025 Terminology Shared Task corpus. We demonstrate that flexible, context-driven terminology handling by the LLM consistently yields higher quality translations than strict constraint enforcement. Our results highlight a critical trade-off, revealing that an LLM's work best for high-quality translation as context-driven mutators rather than generators.
摘要：本文介绍了 DuTerm，这是一种用于术语受限机器翻译的新型两阶段架构。我们的系统结合了术语感知 NMT 模型（通过对大规模合成数据进行微调进行调整）和基于提示的 LLM 进行后期编辑。 LLM 阶段细化 NMT 输出并强制遵守术语。我们使用 WMT 2025 术语共享任务语料库对英语到德语、英语到西班牙语和英语到俄语的 DuTerm 进行评估。我们证明，与严格的约束执行相比，法学硕士灵活、上下文驱动的术语处理始终能产生更高质量的翻译。我们的结果强调了一个关键的权衡，表明法学硕士最适合作为上下文驱动的变异器而不是生成器来实现高质量翻译。

Title: Motif 2 12.7B technical report

Authors: Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07464
Pdf URL: https://arxiv.org/pdf/2511.07464
Copy Paste: [[2511.07464]] Motif 2 12.7B technical report(https://arxiv.org/abs/2511.07464)
Keywords: language model
Abstract: We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.
摘要：我们推出了 Motif-2-12.7B，这是一种新的开放权重基础模型，通过将架构创新与系统级优化相结合，推动大型语言模型的效率前沿。 Motif-2-12.7B 专为在计算预算有限的情况下实现可扩展的语言理解和强大的指令泛化而设计，它以 Motif-2.6B 为基础，集成了分组差分注意 (GDA)，通过解开信号和噪声控制注意路径来提高表示效率。该模型使用课程驱动的数据调度程序对跨越不同语言、数学、科学和编程领域的 5.5 万亿个令牌进行了预训练，该调度程序逐渐改变数据构成比例。该训练系统利用 MuonClip 优化器以及定制的高性能内核，包括融合 PolyNorm 激活和并行 Muon 算法，在大规模分布式环境中产生显着的吞吐量和内存效率增益。后期训练采用三阶段监督微调流程，连续增强一般指令的依从性、构图理解和语言准确性。 Motif-2-12.7B 在不同的基准测试中展示了具有竞争力的性能，表明深思熟虑的架构扩展和优化的训练设计可以与更大模型的功能相媲美。

Title: Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models

Authors: Xin Liu, Qiyang Song, Qihang Zhou, Haichao Du, Shaowen Xu, Wenbo Jiang, Weijuan Zhang, Xiaoqi Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07498
Pdf URL: https://arxiv.org/pdf/2511.07498
Copy Paste: [[2511.07498]] Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models(https://arxiv.org/abs/2511.07498)
Keywords: language model, llm
Abstract: Large language models (LLMs) increasingly support multilingual understanding and generation. Meanwhile, efforts to interpret their internal mechanisms have emerged, offering insights to enhance multilingual performance. While multi-head self-attention (MHA) has proven critical in many areas, its role in multilingual capabilities remains underexplored. In this work, we study the contribution of MHA in supporting multilingual processing in LLMs. We propose Language Attention Head Importance Scores (LAHIS), an effective and efficient method that identifies attention head importance for multilingual capabilities via a single forward and backward pass through the LLM. Applying LAHIS to Aya-23-8B, Llama-3.2-3B, and Mistral-7B-v0.1, we reveal the existence of both language-specific and language-general heads. Language-specific heads enable cross-lingual attention transfer to guide the model toward target language contexts and mitigate off-target language generation issue, contributing to addressing challenges in multilingual LLMs. We also introduce a lightweight adaptation that learns a soft head mask to modulate attention outputs over language heads, requiring only 20 tunable parameters to improve XQuAD accuracy. Overall, our work enhances both the interpretability and multilingual capabilities of LLMs from the perspective of MHA.
摘要：大型语言模型 (LLM) 越来越支持多语言理解和生成。与此同时，解释其内部机制的努力已经出现，为提高多语言性能提供了见解。虽然多头自注意力（MHA）已被证明在许多领域至关重要，但它在多语言能力中的作用仍未得到充分探索。在这项工作中，我们研究了 MHA 在支持法学硕士多语言处理方面的贡献。我们提出了语言注意力头重要性评分（LAHIS），这是一种有效且高效的方法，可通过法学硕士的单次前向和反向传递来识别多语言能力的注意力头重要性。将 LAHIS 应用于 Aya-23-8B、Llama-3.2-3B 和 Mistral-7B-v0.1，我们揭示了特定语言和通用语言头的存在。特定于语言的头能够实现跨语言注意力转移，以引导模型转向目标语言上下文，并减轻偏离目标的语言生成问题，有助于解决多语言法学硕士的挑战。我们还引入了一种轻量级适配，可以学习软头掩模来调节语言头上的注意力输出，只需要 20 个可调参数即可提高 XQuAD 的准确性。总的来说，我们的工作从 MHA 的角度增强了法学硕士的可解释性和多语言能力。

Title: LLM Optimization Unlocks Real-Time Pairwise Reranking

Authors: Jingyu Wu, Aditya Shrivastava, Jing Zhu, Alfy Samuel, Anoop Kumar, Daben Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07555
Pdf URL: https://arxiv.org/pdf/2511.07555
Copy Paste: [[2511.07555]] LLM Optimization Unlocks Real-Time Pairwise Reranking(https://arxiv.org/abs/2511.07555)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Efficiently reranking documents retrieved from information retrieval (IR) pipelines to enhance overall quality of Retrieval-Augmented Generation (RAG) system remains an important yet challenging problem. Recent studies have highlighted the importance of Large Language Models (LLMs) in reranking tasks. In particular, Pairwise Reranking Prompting (PRP) has emerged as a promising plug-and-play approach due to its usability and effectiveness. However, the inherent complexity of the algorithm, coupled with the high computational demands and latency incurred due to LLMs, raises concerns about its feasibility in real-time applications. To address these challenges, this paper presents a focused study on pairwise reranking, demonstrating that carefully applied optimization methods can significantly mitigate these issues. By implementing these methods, we achieve a remarkable latency reduction of up to 166 times, from 61.36 seconds to 0.37 seconds per query, with an insignificant drop in performance measured by Recall@k. Our study highlights the importance of design choices that were previously overlooked, such as using smaller models, limiting the reranked set, using lower precision, reducing positional bias with one-directional order inference, and restricting output tokens. These optimizations make LLM-based reranking substantially more efficient and feasible for latency-sensitive, real-world deployments.
摘要：有效地对从信息检索（IR）管道检索到的文档进行重新排序以提高检索增强生成（RAG）系统的整体质量仍然是一个重要但具有挑战性的问题。最近的研究强调了大型语言模型 (LLM) 在重新排序任务中的重要性。特别是，成对重新排名提示（PRP）由于其可用性和有效性而成为一种有前途的即插即用方法。然而，该算法固有的复杂性，加上法学硕士带来的高计算需求和延迟，引发了人们对其在实时应用中可行性的担忧。为了应对这些挑战，本文重点研究了成对重新排名，证明仔细应用优化方法可以显着缓解这些问题。通过实施这些方法，我们的延迟显着减少了 166 倍，从每次查询 61.36 秒减少到 0.37 秒，而通过 Recall@k 测量的性能下降幅度并不显着。我们的研究强调了以前被忽视的设计选择的重要性，例如使用较小的模型、限制重新排序的集合、使用较低的精度、通过单向顺序推理减少位置偏差以及限制输出标记。这些优化使得基于 LLM 的重新排名对于延迟敏感的实际部署来说更加高效和可行。

Title: LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives

Authors: Ratna Kandala, Katie Hoemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07641
Pdf URL: https://arxiv.org/pdf/2511.07641
Copy Paste: [[2511.07641]] LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives(https://arxiv.org/abs/2511.07641)
Keywords: language model, llm, chat
Abstract: Understanding emotional nuances in everyday language is crucial for computational linguistics and emotion research. While traditional lexicon-based tools like LIWC and Pattern have served as foundational instruments, Large Language Models (LLMs) promise enhanced context understanding. We evaluated three Dutch-specific LLMs (ChocoLlama-8B-Instruct, Reynaerde-7B-chat, and GEITje-7B-ultra) against LIWC and Pattern for valence prediction in Flemish, a low-resource language variant. Our dataset comprised approximately 25000 spontaneous textual responses from 102 Dutch-speaking participants, each providing narratives about their current experiences with self-assessed valence ratings (-50 to +50). Surprisingly, despite architectural advancements, the Dutch-tuned LLMs underperformed compared to traditional methods, with Pattern showing superior performance. These findings challenge assumptions about LLM superiority in sentiment analysis tasks and highlight the complexity of capturing emotional valence in spontaneous, real-world narratives. Our results underscore the need for developing culturally and linguistically tailored evaluation frameworks for low-resource language variants, while questioning whether current LLM fine-tuning approaches adequately address the nuanced emotional expressions found in everyday language use.
摘要：理解日常语言中情感的细微差别对于计算语言学和情感研究至关重要。虽然 LIWC 和 Pattern 等传统的基于词典的工具已成为基础工具，但大型语言模型 (LLM) 有望增强上下文理解。我们针对 LIWC 和佛兰德语（一种低资源语言变体）的效价预测模式评估了三个荷兰特有的 LLM（ChocoLlama-8B-Instruct、Reynaerde-7B-chat 和 GEITje-7B-ultra）。我们的数据集包含来自 102 名荷兰语参与者的约 25000 个自发文本回复，每个回复都通过自我评估的效价评级（-50 至 +50）提供了有关他们当前经历的叙述。令人惊讶的是，尽管架构取得了进步，但荷兰调整的法学硕士与传统方法相比表现不佳，而模式表现出优越的性能。这些发现挑战了法学硕士在情感分析任务中优越性的假设，并强调了在自发的现实世界叙述中捕捉情感效价的复杂性。我们的结果强调需要为资源匮乏的语言变体开发针对文化和语言量身定制的评估框架，同时质疑当前的法学硕士微调方法是否足以解决日常语言使用中细微的情感表达。

Title: Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

Authors: Sai Shridhar Balamurali, Lu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07659
Pdf URL: https://arxiv.org/pdf/2511.07659
Copy Paste: [[2511.07659]] Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering(https://arxiv.org/abs/2511.07659)
Keywords: language model, gpt, llm
Abstract: Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.
摘要：评估最先进的大型语言模型 (LLM) 的答案具有挑战性：词汇指标会忽略语义的细微差别，而“LLM 作为法官”评分的计算成本很高。我们重新评估了一种轻量级替代方案——通过简单的词汇匹配标志增强的现成自然语言推理 (NLI) 评分，发现这种已有数十年历史的技术与 GPT-4o 在长格式 QA 上的准确率 (89.9%) 相匹配，同时需要的参数少了几个数量级。为了严格测试这些指标的人工一致性，我们引入了 DIVER-QA，这是一个新的 3000 个样本人工注释基准，涵盖五个 QA 数据集和五个候选法学硕士。我们的结果强调，基于 NLI 的廉价评估仍然具有竞争力，并为未来的度量研究提供 DIVER-QA 作为开放资源。

Title: CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences

Authors: Rhitabrat Pokharel, Yufei Tao, Ameeta Agrawal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07691
Pdf URL: https://arxiv.org/pdf/2511.07691
Copy Paste: [[2511.07691]] CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences(https://arxiv.org/abs/2511.07691)
Keywords: language model, llm
Abstract: Preference optimization is a critical post-training technique used to align large language models (LLMs) with human preferences, typically by fine-tuning on ranked response pairs. While methods like Direct Preference Optimization (DPO) have proven effective in English, they often fail to generalize robustly to multilingual settings. We propose a simple yet effective alternative, Confidence-Aware Preference Optimization (CAPO), which replaces DPO's fixed treatment of preference pairs with a dynamic loss scaling mechanism based on a relative reward. By modulating the learning signal according to the confidence in each preference pair, CAPO enhances robustness to noisy or low-margin comparisons, typically encountered in multilingual text. Empirically, CAPO outperforms existing preference optimization baselines by at least 16% in reward accuracy, and improves alignment by widening the gap between preferred and dispreferred responses across languages.
摘要：偏好优化是一种关键的训练后技术，用于使大型语言模型 (LLM) 与人类偏好保持一致，通常通过对排名响应对进行微调来实现。虽然直接偏好优化 (DPO) 等方法已被证明在英语中有效，但它们通常无法稳健地推广到多语言环境。我们提出了一种简单而有效的替代方案，即置信感知偏好优化（CAPO），它用基于相对奖励的动态损失缩放机制取代了 DPO 对偏好对的固定处理。通过根据每个偏好对的置信度调节学习信号，CAPO 增强了对多语言文本中通常遇到的噪声或低利润比较的鲁棒性。根据经验，CAPO 在奖励准确性方面优于现有偏好优化基线至少 16%，并通过扩大跨语言的偏好和不偏好响应之间的差距来提高一致性。

Title: Critical Confabulation: Can LLMs Hallucinate for Social Good?

Authors: Peiqi Sui, Eamon Duede, Hoyt Long, Richard Jean So
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07722
Pdf URL: https://arxiv.org/pdf/2511.07722
Copy Paste: [[2511.07722]] Critical Confabulation: Can LLMs Hallucinate for Social Good?(https://arxiv.org/abs/2511.07722)
Keywords: llm, hallucination, prompt
Abstract: LLMs hallucinate, yet some confabulations can have social affordances if carefully bounded. We propose critical confabulation (inspired by critical fabulation from literary and social theory), the use of LLM hallucinations to "fill-in-the-gap" for omissions in archives due to social and political inequality, and reconstruct divergent yet evidence-bound narratives for history's "hidden figures". We simulate these gaps with an open-ended narrative cloze task: asking LLMs to generate a masked event in a character-centric timeline sourced from a novel corpus of unpublished texts. We evaluate audited (for data contamination), fully-open models (the OLMo-2 family) and unaudited open-weight and proprietary baselines under a range of prompts designed to elicit controlled and useful hallucinations. Our findings validate LLMs' foundational narrative understanding capabilities to perform critical confabulation, and show how controlled and well-specified hallucinations can support LLM applications for knowledge production without collapsing speculation into a lack of historical accuracy and fidelity.
摘要：法学硕士会产生幻觉，但如果仔细限制的话，一些虚构可能会产生社会影响。我们提出批判性虚构（受到文学和社会理论中批判性虚构的启发），利用法学硕士的幻觉来“填补”由于社会和政治不平等而导致档案中的遗漏，并为历史的“隐藏人物”重建不同但有证据支持的叙述。我们通过开放式叙事完形填空任务来模拟这些差距：要求法学硕士在源自未发表文本的小说语料库的以人物为中心的时间轴中生成一个被掩盖的事件。我们在一系列旨在引发受控且有用的幻觉的提示下评估经过审计的（针对数据污染）、完全开放的模型（OLMo-2 系列）以及未经审计的开放权重和专有基线。我们的研究结果验证了法学硕士进行批判性虚构的基本叙事理解能力，并展示了受控和明确的幻觉如何支持法学硕士应用知识生产，而不会使猜测陷入缺乏历史准确性和保真度的情况。

Title: Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production

Authors: Shiva Upadhye, Richard Futrell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07752
Pdf URL: https://arxiv.org/pdf/2511.07752
Copy Paste: [[2511.07752]] Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production(https://arxiv.org/abs/2511.07752)
Keywords: language model
Abstract: Contextual predictability shapes both the form and choice of words in online language production. The effects of the predictability of a word given its previous context are generally well-understood in both production and comprehension, but studies of naturalistic production have also revealed a poorly-understood backward predictability effect of a word given its future context, which may be related to future planning. Here, in two studies of naturalistic speech corpora, we investigate backward predictability effects using improved measures and more powerful language models, introducing a new principled and conceptually motivated information-theoretic predictability measure that integrates predictability from both the future and the past context. Our first study revisits classic predictability effects on word duration. Our second study investigates substitution errors within a generative framework that independently models the effects of lexical, contextual, and communicative factors on word choice, while predicting the actual words that surface as speech errors. We find that our proposed conceptually-motivated alternative to backward predictability yields qualitatively similar effects across both studies. Through a fine-grained analysis of substitution errors, we further show that different kinds of errors are suggestive of how speakers prioritize form, meaning, and context-based information during lexical planning. Together, these findings illuminate the functional roles of past and future context in how speakers encode and choose words, offering a bridge between contextual predictability effects and the mechanisms of sentence planning.
摘要：语境可预测性塑造了在线语言生产中词语的形式和选择。一个单词在其先前上下文中的可预测性的影响在产生和理解中通常都得到了很好的理解，但自然主义产生的研究也揭示了一个单词在其未来上下文中的后向可预测性效果知之甚少，这可能与未来的规划有关。在这里，在自然主义语音语料库的两项研究中，我们使用改进的措施和更强大的语言模型来研究向后可预测性效应，引入一种新的原则性和概念驱动的信息论可预测性措施，该措施整合了未来和过去背景的可预测性。我们的第一项研究重新审视了经典的可预测性对单词持续时间的影响。我们的第二项研究调查了生成框架内的替换错误，该框架独立地模拟了词汇、语境和交际因素对单词选择的影响，同时预测了作为语音错误出现的实际单词。我们发现，我们提出的概念驱动的向后可预测性替代方案在两项研究中产生了类似的效果。通过对替换错误的细粒度分析，我们进一步表明，不同类型的错误暗示着说话者在词汇规划过程中如何优先考虑形式、含义和基于上下文的信息。总之，这些发现阐明了过去和未来语境在说话者如何编码和选择单词方面的功能作用，在语境可预测性效应和句子规划机制之间架起了一座桥梁。

Title: Design, Results and Industry Implications of the World's First Insurance Large Language Model Evaluation Benchmark

Authors: Hua Zhou (Central University of Finance and Economics), Bing Ma (Central University of Finance and Economics), Yufei Zhang (Zetavision AI Lab), Yi Zhao (Zetavision AI Lab)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07794
Pdf URL: https://arxiv.org/pdf/2511.07794
Copy Paste: [[2511.07794]] Design, Results and Industry Implications of the World's First Insurance Large Language Model Evaluation Benchmark(https://arxiv.org/abs/2511.07794)
Keywords: language model, agent
Abstract: This paper comprehensively elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0. Adhering to the principles of "quantitative-oriented, expert-driven, and multi-validation," the benchmark establishes an evaluation framework covering 5 core dimensions, 54 sub-indicators, and 14,430 high-quality questions, encompassing insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, and logical rigor. Based on this benchmark, a comprehensive evaluation was conducted on 11 mainstream large language models. The evaluation results reveal that general-purpose models suffer from common bottlenecks such as weak actuarial capabilities and inadequate compliance adaptation. High-quality domain-specific training demonstrates significant advantages in insurance vertical scenarios but exhibits shortcomings in business adaptation and compliance. The evaluation also accurately identifies the common bottlenecks of current large models in professional scenarios such as insurance actuarial, underwriting and claim settlement reasoning, and compliant marketing copywriting. The establishment of CUFEInse not only fills the gap in professional evaluation benchmarks for the insurance field, providing academia and industry with a professional, systematic, and authoritative evaluation tool, but also its construction concept and methodology offer important references for the evaluation paradigm of large models in vertical fields, serving as an authoritative reference for academic model optimization and industrial model selection. Finally, the paper looks forward to the future iteration direction of the evaluation benchmark and the core development direction of "domain adaptation + reasoning enhancement" for insurance large models.
摘要：本文全面阐述了CUFEInse v1.0的构建方法论、多维度评价体系以及底层设计理念。该基准遵循“定量导向、专家驱动、多重验证”的原则，建立了涵盖5个核心维度、54个子指标、14430个高质量问题的评价框架，涵盖保险理论知识、行业理解、安全合规、智能代理应用、逻辑严密性等。以此基准为基础，对11种主流大语言模型进行了综合评估。评估结果表明，通用模型普遍存在精算能力薄弱、合规适应性不足等瓶颈。高质量的专业培训在保险垂直场景中具有显着优势，但在业务适应性和合规性方面存在不足。评测还精准识别了当前大模型在保险精算、承保理赔推理、合规营销文案等专业场景中的共性瓶颈。 CUFEInse的成立不仅填补了保险领域专业评估基准的空白，为学术界和产业界提供了专业、系统、权威的评估工具，其构建理念和方法论为垂直领域大模型的评估范式提供了重要参考，为学术模型优化和产业模型选择提供了权威参考。最后，论文展望了评估基准未来的迭代方向以及保险大模型“领域自适应+推理增强”的核心发展方向。

Title: From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

Authors: Siyu Xia, Zekun Xu, Jiajun Chai, Wentian Fan, Yan Song, Xiaohan Wang, Guojun Yin, Wei Lin, Haifeng Zhang, Jun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07800
Pdf URL: https://arxiv.org/pdf/2511.07800
Copy Paste: [[2511.07800]] From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory(https://arxiv.org/abs/2511.07800)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving across complex, open-ended environments. A promising approach for improving the reasoning capabilities of LLM agents is to better utilize prior experiences in guiding current decisions. However, LLMs acquire experience either through implicit memory via training, which suffers from catastrophic forgetting and limited interpretability, or explicit memory via prompting, which lacks adaptability. In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework and evaluate how context memory enhances the ability of LLMs to utilize parametric information. The graph abstracts raw agent trajectories into structured decision paths in a state machine and further distills them into high-level, human-interpretable strategic meta-cognition. In order to make memory adaptable, we propose a reinforcement-based weight optimization procedure that estimates the empirical utility of each meta-cognition based on reward feedback from downstream tasks. These optimized strategies are then dynamically integrated into the LLM agent's training loop through meta-cognitive prompting. Empirically, the learnable graph memory delivers robust generalization, improves LLM agents' strategic reasoning performance, and provides consistent benefits during Reinforcement Learning (RL) training.
摘要：基于大型语言模型 (LLM) 的代理在复杂、开放的环境中自主解决任务方面表现出了巨大的潜力。提高 LLM 智能体推理能力的一个有前途的方法是更好地利用先前的经验来指导当前的决策。然而，法学硕士要么通过训练的内隐记忆获得经验，这种记忆会遭受灾难性遗忘和有限的可解释性，要么通过提示获得外显记忆，缺乏适应性。在本文中，我们介绍了一种新颖的以代理为中心的、可训练的、多层图记忆框架，并评估了上下文记忆如何增强法学硕士利用参数信息的能力。该图将原始代理轨迹抽象为状态机中的结构化决策路径，并进一步将其提炼为高级的、人类可解释的策略元认知。为了使记忆具有适应性，我们提出了一种基于强化的权重优化程序，该程序根据下游任务的奖励反馈来估计每个元认知的经验效用。然后，这些优化策略通过元认知提示动态集成到 LLM 代理的训练循环中。根据经验，可学习的图内存可提供强大的泛化能力，提高 LLM 代理的策略推理性能，并在强化学习 (RL) 训练期间提供一致的优势。

Title: AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys

Authors: Chenxi Lin, Weikang Yuan, Zhuoren Jiang, Biao Huang, Ruitao Zhang, Jianan Ge, Yueqian Xu, Jianxing Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07871
Pdf URL: https://arxiv.org/pdf/2511.07871
Copy Paste: [[2511.07871]] AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys(https://arxiv.org/abs/2511.07871)
Keywords: language model, llm
Abstract: Understanding human attitudes, preferences, and behaviors through social surveys is essential for academic research and policymaking. Yet traditional surveys face persistent challenges, including fixed-question formats, high costs, limited adaptability, and difficulties ensuring cross-cultural equivalence. While recent studies explore large language models (LLMs) to simulate survey responses, most are limited to structured questions, overlook the entire survey process, and risks under-representing marginalized groups due to training data biases. We introduce AlignSurvey, the first benchmark that systematically replicates and evaluates the full social survey pipeline using LLMs. It defines four tasks aligned with key survey stages: social role modeling, semi-structured interview modeling, attitude stance modeling and survey response modeling. It also provides task-specific evaluation metrics to assess alignment fidelity, consistency, and fairness at both individual and group levels, with a focus on demographic diversity. To support AlignSurvey, we construct a multi-tiered dataset architecture: (i) the Social Foundation Corpus, a cross-national resource with 44K+ interview dialogues and 400K+ structured survey records; and (ii) a suite of Entire-Pipeline Survey Datasets, including the expert-annotated AlignSurvey-Expert (ASE) and two nationally representative surveys for cross-cultural evaluation. We release the SurveyLM family, obtained through two-stage fine-tuning of open-source LLMs, and offer reference models for evaluating domain-specific alignment. All datasets, models, and tools are available at github and huggingface to support transparent and socially responsible research.
摘要：通过社会调查了解人类的态度、偏好和行为对于学术研究和政策制定至关重要。然而，传统调查面临持续的挑战，包括固定问题格式、成本高、适应性有限以及难以确保跨文化平等。虽然最近的研究探索大型语言模型（LLM）来模拟调查响应，但大多数仅限于结构化问题，忽视了整个调查过程，并且由于训练数据偏差而存在无法充分代表边缘化群体的风险。我们推出 AlignSurvey，这是第一个使用法学硕士系统地复制和评估完整社会调查流程的基准。它定义了与关键调查阶段一致的四个任务：社会角色建模、半结构化访谈建模、态度立场建模和调查响应建模。它还提供特定于任务的评估指标，以评估个人和群体层面的一致性保真度、一致性和公平性，重点关注人口多样性。为了支持 AlignSurvey，我们构建了一个多层数据集架构：(i) Social Foundation Corpus，这是一个跨国资源，包含 44K+ 采访对话和 400K+ 结构化调查记录； (ii) 一套整个管道调查数据集，包括专家注释的 AlignSurvey-Expert (ASE) 和两项具有全国代表性的跨文化评估调查。我们发布了通过开源 LLM 的两阶段微调获得的 SurveyLM 系列，并提供用于评估特定领域对齐的参考模型。所有数据集、模型和工具均可在 github 和 Huggingface 上获取，以支持透明且对社会负责的研究。

Title: Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning

Authors: Songze Li, Zhiqiang Liu, Zhaoyan Gong, Xiaoke Guo, Zhengke Gui, Huajun Chen, Wen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07910
Pdf URL: https://arxiv.org/pdf/2511.07910
Copy Paste: [[2511.07910]] Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning(https://arxiv.org/abs/2511.07910)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text, enabling them to understand the logic in natural language and generate logic-consistent responses. However, the representational differences between unstructured and structured knowledge make LLMs inherently struggle to maintain logic consistency, leading to \textit{Logic Drift} challenges in structured knowledge reasoning tasks such as Knowledge Graph Question Answering (KGQA). Existing methods address this limitation by designing complex workflows embedded in prompts to guide LLM reasoning. Nevertheless, these approaches only provide input-level guidance and fail to fundamentally address the \textit{Logic Drift} in LLM outputs. Additionally, their inflexible reasoning workflows cannot adapt to different tasks and knowledge graphs. To enhance LLMs' logic consistency in structured knowledge reasoning, we specifically target the logits output from the autoregressive generation process. We propose the \textit{Logits-to-Logic} framework, which incorporates logits strengthening and logits filtering as core modules to correct logical defects in LLM outputs. Extensive experiments show that our approach significantly improves LLMs' logic consistency in structured knowledge reasoning and achieves state-of-the-art performance on multiple KGQA benchmarks.
摘要：大型语言模型（LLM）通过对大量非结构化文本进行预训练，在自然语言推理任务中取得优异的性能，使其能够理解自然语言的逻辑并生成逻辑一致的响应。然而，非结构化知识和结构化知识之间的表征差异使得法学硕士本质上难以保持逻辑一致性，从而导致知识图问答（KGQA）等结构化知识推理任务中的逻辑漂移挑战。现有方法通过设计嵌入提示的复杂工作流程来指导 LLM 推理，从而解决了这一限制。然而，这些方法仅提供输入级指导，未能从根本上解决 LLM 输出中的 \textit{逻辑漂移}。此外，它们的推理工作流程不灵活，无法适应不同的任务和知识图。为了增强法学硕士在结构化知识推理中的逻辑一致性，我们专门针对自回归生成过程的 logits 输出。我们提出了 \textit{Logits-to-Logic} 框架，该框架将 logits 强化和 logits 过滤作为核心模块来纠正 LLM 输出中的逻辑缺陷。大量实验表明，我们的方法显着提高了法学硕士在结构化知识推理中的逻辑一致性，并在多个 KGQA 基准上实现了最先进的性能。

Title: Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

Authors: Matthias De Lange, Jens-Joris Decorte, Jeroen Van Hautte
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.07969
Pdf URL: https://arxiv.org/pdf/2511.07969
Copy Paste: [[2511.07969]] Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker(https://arxiv.org/abs/2511.07969)
Keywords: prompt
Abstract: Workforce transformation across diverse industries has driven an increased demand for specialized natural language processing capabilities. Nevertheless, tasks derived from work-related contexts inherently reflect real-world complexities, characterized by long-tailed distributions, extreme multi-label target spaces, and scarce data availability. The rise of generalist embedding models prompts the question of their performance in the work domain, especially as progress in the field has focused mainly on individual tasks. To this end, we introduce WorkBench, the first unified evaluation suite spanning six work-related tasks formulated explicitly as ranking problems, establishing a common ground for multi-task progress. Based on this benchmark, we find significant positive cross-task transfer, and use this insight to compose task-specific bipartite graphs from real-world data, synthetically enriched through grounding. This leads to Unified Work Embeddings (UWE), a task-agnostic bi-encoder that exploits our training-data structure with a many-to-many InfoNCE objective, and leverages token-level embeddings with task-agnostic soft late interaction. UWE demonstrates zero-shot ranking performance on unseen target spaces in the work domain, enables low-latency inference by caching the task target space embeddings, and shows significant gains in macro-averaged MAP and RP@10 over generalist embedding models.
摘要：不同行业的劳动力转型推动了对专业自然语言处理能力的需求增加。然而，源自工作相关环境的任务本质上反映了现实世界的复杂性，其特点是长尾分布、极端的多标签目标空间和稀缺的数据可用性。通用嵌入模型的兴起引发了它们在工作领域中表现的问题，特别是当该领域的进展主要集中在个人任务上时。为此，我们引入了 WorkBench，这是第一个统一的评估套件，涵盖六个与工作相关的任务，明确表示为排名问题，为多任务进展建立了共同基础。基于这个基准，我们发现了显着的正向跨任务转移，并利用这种洞察力从现实世界的数据中组成特定于任务的二分图，并通过基础进行综合丰富。这导致了统一工作嵌入（UWE），一种任务无关的双编码器，它利用我们的训练数据结构和多对多 InfoNCE 目标，并利用令牌级嵌入和任务无关的软后期交互。 UWE 展示了工作领域中看不见的目标空间的零样本排名性能，通过缓存任务目标空间嵌入实现低延迟推理，并在宏观平均 MAP 和 RP@10 方面显示出比通用嵌入模型显着的增益。

Title: NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLMs for NOTAM Interpretation

Authors: Maoqi Liu, Quan Fang, Yuhao Wu, Can Zhao, Yang Yang, Kaiquan Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07982
Pdf URL: https://arxiv.org/pdf/2511.07982
Copy Paste: [[2511.07982]] NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLMs for NOTAM Interpretation(https://arxiv.org/abs/2511.07982)
Keywords: language model, llm
Abstract: Accurate interpretation of Notices to Airmen (NOTAMs) is critical for aviation safety, yet their condensed and cryptic language poses significant challenges to both manual and automated processing. Existing automated systems are typically limited to shallow parsing, failing to extract the actionable intelligence needed for operational decisions. We formalize the complete interpretation task as deep parsing, a dual-reasoning challenge requiring both dynamic knowledge grounding (linking the NOTAM to evolving real-world aeronautical data) and schema-based inference (applying static domain rules to deduce operational status). To tackle this challenge, we propose NOTAM-Evolve, a self-evolving framework that enables a large language model (LLM) to autonomously master complex NOTAM interpretation. Leveraging a knowledge graph-enhanced retrieval module for data grounding, the framework introduces a closed-loop learning process where the LLM progressively improves from its own outputs, minimizing the need for extensive human-annotated reasoning traces. In conjunction with this framework, we introduce a new benchmark dataset of 10,000 expert-annotated NOTAMs. Our experiments demonstrate that NOTAM-Evolve achieves a 30.4% absolute accuracy improvement over the base LLM, establishing a new state of the art on the task of structured NOTAM interpretation.
摘要：准确解释飞行员通知 (NOTAM) 对于航空安全至关重要，但其简洁而神秘的语言给手动和自动处理带来了重大挑战。现有的自动化系统通常仅限于浅层解析，无法提取操作决策所需的可操作情报。我们将完整的解释任务形式化为深度解析，这是一个双重推理挑战，需要动态知识基础（将 NOTAM 与不断发展的现实世界航空数据联系起来）和基于模式的推理（应用静态域规则来推断运行状态）。为了应对这一挑战，我们提出了 NOTAM-Evolve，这是一个自我进化框架，使大型语言模型 (LLM) 能够自主掌握复杂的 NOTAM 解释。该框架利用知识图增强检索模块进行数据基础，引入了闭环学习过程，其中法学硕士从其自身输出中逐步改进，最大限度地减少了对大量人工注释推理痕迹的需求。结合该框架，我们引入了包含 10,000 个专家注释的 NOTAM 的新基准数据集。我们的实验表明，与基础 LLM 相比，NOTAM-Evolve 的绝对精度提高了 30.4%，在结构化 NOTAM 解释任务上建立了新的技术水平。

Title: State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Authors: Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07989
Pdf URL: https://arxiv.org/pdf/2511.07989
Copy Paste: [[2511.07989]] State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?(https://arxiv.org/abs/2511.07989)
Keywords: language model, llm, prompt
Abstract: Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.
摘要：直到最近，经过微调的类似 BERT 的模型在文本分类任务上提供了最先进的性能。随着仅指令调整的解码器模型（通常称为大语言模型（LLM））的兴起，该领域越来越多地转向零样本和少样本提示。然而，法学硕士在文本分类方面的表现，特别是在资源匮乏的语言方面的表现仍有待探索。在本文中，我们评估了当前语言模型在几种南斯拉夫语言的文本分类任务上的性能。我们将公开可用的微调 BERT 模型与精选的开源和闭源法学硕士在三个领域的三个任务上进行比较：议会演讲中的情感分类、新闻文章和议会演讲中的主题分类以及网络文本中的流派识别。我们的结果表明，LLM 表现出强大的零样本性能，通常匹配或超越微调的类 BERT 模型。此外，当在零样本设置中使用时，法学硕士在南斯拉夫语言和英语中的表现相当。然而，我们也指出了法学硕士的主要缺点，包括输出可预测性较差、推理速度明显较慢以及计算成本较高。由于这些限制，微调的类 BERT 模型仍然是大规模自动文本注释的更实用的选择。

Title: Self-Correction Distillation for Structured Data Question Answering

Authors: Yushan Zhu, Wen Zhang, Long Jin, Mengshu Sun, Ling Zhong, Zhiqiang Liu, Juan Li, Lei Liang, Chong Long, Chao Deng, Junlan Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.07998
Pdf URL: https://arxiv.org/pdf/2511.07998
Copy Paste: [[2511.07998]] Self-Correction Distillation for Structured Data Question Answering(https://arxiv.org/abs/2511.07998)
Keywords: language model, gpt, llm, prompt
Abstract: Structured data question answering (QA), including table QA, Knowledge Graph (KG) QA, and temporal KG QA, is a pivotal research area. Advances in large language models (LLMs) have driven significant progress in unified structural QA frameworks like TrustUQA. However, these frameworks face challenges when applied to small-scale LLMs since small-scale LLMs are prone to errors in generating structured queries. To improve the structured data QA ability of small-scale LLMs, we propose a self-correction distillation (SCD) method. In SCD, an error prompt mechanism (EPM) is designed to detect errors and provide customized error messages during inference, and a two-stage distillation strategy is designed to transfer large-scale LLMs' query-generation and error-correction capabilities to small-scale LLM. Experiments across 5 benchmarks with 3 structured data types demonstrate that our SCD achieves the best performance and superior generalization on small-scale LLM (8B) compared to other distillation methods, and closely approaches the performance of GPT4 on some datasets. Furthermore, large-scale LLMs equipped with EPM surpass the state-of-the-art results on most datasets.
摘要：结构化数据问答（QA），包括表 QA、知识图（KG）QA 和时态 KG QA，是一个关键的研究领域。大型语言模型 (LLM) 的进步推动了 TrustUQA 等统一结构 QA 框架的重大进展。然而，这些框架在应用于小型法学硕士时面临挑战，因为小型法学硕士在生成结构化查询时容易出错。为了提高小型法学硕士的结构化数据质量保证能力，我们提出了一种自校正蒸馏（SCD）方法。在SCD中，错误提示机制（EPM）旨在检测错误并在推理过程中提供定制的错误消息，并且设计了两阶段蒸馏策略将大规模LLM的查询生成和纠错能力转移到小型LLM。使用 3 种结构化数据类型进行 5 个基准测试的实验表明，与其他蒸馏方法相比，我们的 SCD 在小规模 LLM (8B) 上实现了最佳性能和卓越的泛化能力，并且在某些数据集上接近 GPT4 的性能。此外，配备 EPM 的大型法学硕士在大多数数据集上都超越了最先进的结果。

Title: HyCoRA: Hyper-Contrastive Role-Adaptive Learning for Role-Playing

Authors: Shihao Yang, Zhicong Lu, Yong Yang, Bo Lv, Yang Shen, Nayu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08017
Pdf URL: https://arxiv.org/pdf/2511.08017
Copy Paste: [[2511.08017]] HyCoRA: Hyper-Contrastive Role-Adaptive Learning for Role-Playing(https://arxiv.org/abs/2511.08017)
Keywords: gpt
Abstract: Multi-character role-playing aims to equip models with the capability to simulate diverse roles. Existing methods either use one shared parameterized module across all roles or assign a separate parameterized module to each role. However, the role-shared module may ignore distinct traits of each role, weakening personality learning, while the role-specific module may overlook shared traits across multiple roles, hindering commonality modeling. In this paper, we propose a novel HyCoRA: Hyper-Contrastive Role-Adaptive learning framework, which efficiently improves multi-character role-playing ability by balancing the learning of distinct and shared traits. Specifically, we propose a Hyper-Half Low-Rank Adaptation structure, where one half is a role-specific module generated by a lightweight hyper-network, and the other half is a trainable role-shared module. The role-specific module is devised to represent distinct persona signatures, while the role-shared module serves to capture common traits. Moreover, to better reflect distinct personalities across different roles, we design a hyper-contrastive learning mechanism to help the hyper-network distinguish their unique characteristics. Extensive experimental results on both English and Chinese available benchmarks demonstrate the superiority of our framework. Further GPT-4 evaluations and visual analyses also verify the capability of HyCoRA to capture role characteristics.
摘要：多角色角色扮演旨在使模型具有模拟不同角色的能力。现有方法要么在所有角色中使用一个共享的参数化模块，要么为每个角色分配一个单独的参数化模块。然而，角色共享模块可能会忽略每个角色的独特特征，从而削弱个性学习，而角色特定模块可能会忽略多个角色之间的共享特征，从而阻碍共性建模。在本文中，我们提出了一种新颖的 HyCoRA：超对比角色自适应学习框架，该框架通过平衡不同特征和共享特征的学习来有效提高多角色角色扮演能力。具体来说，我们提出了一种 Hyper-Half 低秩适应结构，其中一半是由轻量级超网络生成的角色特定模块，另一半是可训练的角色共享模块。角色特定模块旨在代表不同的角色签名，而角色共享模块则用于捕获共同特征。此外，为了更好地反映不同角色的不同个性，我们设计了一种超对比学习机制来帮助超网络区分他们的独特特征。在英语和中文可用基准上的大量实验结果证明了我们框架的优越性。进一步的 GPT-4 评估和视觉分析也验证了 HyCoRA 捕捉角色特征的能力。

Title: Estranged Predictions: Measuring Semantic Category Disruption with Masked Language Modelling

Authors: Yuxuan Liu, Haim Dubossarsky, Ruth Ahnert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08109
Pdf URL: https://arxiv.org/pdf/2511.08109
Copy Paste: [[2511.08109]] Estranged Predictions: Measuring Semantic Category Disruption with Masked Language Modelling(https://arxiv.org/abs/2511.08109)
Keywords: language model
Abstract: This paper examines how science fiction destabilises ontological categories by measuring conceptual permeability across the terms human, animal, and machine using masked language modelling (MLM). Drawing on corpora of science fiction (Gollancz SF Masterworks) and general fiction (NovelTM), we operationalise Darko Suvin's theory of estrangement as computationally measurable deviation in token prediction, using RoBERTa to generate lexical substitutes for masked referents and classifying them via Gemini. We quantify conceptual slippage through three metrics: retention rate, replacement rate, and entropy, mapping the stability or disruption of category boundaries across genres. Our findings reveal that science fiction exhibits heightened conceptual permeability, particularly around machine referents, which show significant cross-category substitution and dispersion. Human terms, by contrast, maintain semantic coherence and often anchor substitutional hierarchies. These patterns suggest a genre-specific restructuring within anthropocentric logics. We argue that estrangement in science fiction operates as a controlled perturbation of semantic norms, detectable through probabilistic modelling, and that MLMs, when used critically, serve as interpretive instruments capable of surfacing genre-conditioned ontological assumptions. This study contributes to the methodological repertoire of computational literary studies and offers new insights into the linguistic infrastructure of science fiction.
摘要：本文通过使用掩码语言模型 (MLM) 测量人类、动物和机器术语的概念渗透性，探讨科幻小说如何破坏本体论类别。借鉴科幻小说 (Gollancz SF Masterworks) 和一般小说 (NovelTM) 的语料库，我们将 Darko Suvin 的疏远理论操作为标记预测中可计算测量的偏差，使用 RoBERTa 生成屏蔽所指对象的词汇替代品，并通过 Gemini 对它们进行分类。我们通过三个指标来量化概念滑移：保留率、替代率和熵，映射跨流派类别边界的稳定性或破坏。我们的研究结果表明，科幻小说表现出较高的概念渗透性，特别是围绕机器所指对象，表现出显着的跨类别替代和分散。相比之下，人类术语保持语义连贯性，并且经常锚定替代层次结构。这些模式暗示了人类中心逻辑中特定流派的重组。我们认为，科幻小说中的疏远是对语义规范的受控扰动，可以通过概率建模来检测，而传销在批判性地使用时，可以作为解释工具，能够呈现以流派为条件的本体论假设。这项研究为计算文学研究的方法论做出了贡献，并为科幻小说的语言基础设施提供了新的见解。

Title: Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

Authors: Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08113
Pdf URL: https://arxiv.org/pdf/2511.08113
Copy Paste: [[2511.08113]] Multimodal LLMs Do Not Compose Skills Optimally Across Modalities(https://arxiv.org/abs/2511.08113)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.
摘要：技能组合是结合以前学到的技能来解决新任务的能力。随着神经网络在预训练过程中获得越来越复杂的技能，目前尚不清楚它们能否成功地组合这些技能。在本文中，我们重点关注多模态大型语言模型（MLLM），并研究它们跨模态组合技能的能力。为此，我们设计了三个评估任务，可以依次解决组成两个模态相关技能的问题，并在两个主要设置下评估几个开放的 MLLM：i）提示模型直接解决任务，ii）使用两步级联推理方法，手动强制给定任务组合两种技能。即使使用这些简单的组合，我们发现所有评估的 MLLM 都表现出显着的跨模态技能组合差距。为了缩小上述差距，我们探索了两种替代方案：i）使用思维链提示来明确指示 MLLM 进行技能构成；ii）使用特定的微调方法来促进技能构成。尽管这些策略提高了模型性能，但它们仍然表现出显着的技能构成差距，这表明需要更多的研究来改善 MLLM 中的跨模式技能构成。

Title: Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition

Authors: Raquel Montero, Natalia Moskvina, Paolo Morosi, Tamara Serrano, Elena Pagliarini, Evelina Leivada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08126
Pdf URL: https://arxiv.org/pdf/2511.08126
Copy Paste: [[2511.08126]] Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition(https://arxiv.org/abs/2511.08126)
Keywords: language model, llm, agent
Abstract: Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This papers looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models' architecture, how they may differ from humans, and whether the results are affected by the type of model and language under investigation. We find that there are clear differences between humans and MLLMs with respect to these features across various tasks that tap into the representation of quantification in vivo vs. in silico. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.
摘要：对于（多模态）大型语言模型（MLLM）来说，量化已被证明是一种特别困难的语言现象。然而，考虑到量化与逻辑、语用和数值领域的接口，性能不佳的确切原因仍不清楚。本文着眼于跨语言共享的人类量化的三个关键特征，这些特征迄今为止在法学硕士文献中尚未探讨：量词的尺度排序、使用范围和原型性，以及人类近似数字系统固有的偏差。目的是确定这些特征如何在模型架构中编码、它们与人类有何不同，以及结果是否受到所研究的模型和语言类型的影响。我们发现，在利用体内定量表示和计算机模拟表示的各种任务中，人类和 MLLM 之间在这些特征方面存在明显差异。因此，这项工作为解决 MLLM 作为语义和语用代理的本质铺平了道路，而跨语言视角可以阐明它们的能力在不同语言中是否强大和稳定。

Title: Sentence-Anchored Gist Compression for Long-Context LLMs

Authors: Dmitrii Tarasov, Elizaveta Goncharova, Kuznetsov Andrey
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08128
Pdf URL: https://arxiv.org/pdf/2511.08128
Copy Paste: [[2511.08128]] Sentence-Anchored Gist Compression for Long-Context LLMs(https://arxiv.org/abs/2511.08128)
Keywords: language model, llm
Abstract: This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned to compress their context by factors of 2x to 8x without significant performance degradation, as evaluated on both short-context and long-context benchmarks. Furthermore, in experiments on a 3-billion-parameter LLaMA model, our method achieves results on par with alternative compression techniques while attaining higher compression ratios.
摘要：这项工作研究了大型语言模型 (LLM) 的上下文压缩，使用学习的压缩标记来减少处理长序列的内存和计算需求。我们证明，根据短上下文和长上下文基准的评估，预训练的 LLM 可以进行微调，将其上下文压缩 2 倍到 8 倍，而不会显着降低性能。此外，在 30 亿参数 LLaMA 模型的实验中，我们的方法取得了与替代压缩技术相当的结果，同时获得了更高的压缩比。

Title: On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Authors: Kushal Tatariya, Wessel Poelman, Miryam de Lhoneux
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08139
Pdf URL: https://arxiv.org/pdf/2511.08139
Copy Paste: [[2511.08139]] On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility(https://arxiv.org/abs/2511.08139)
Keywords: language model
Abstract: Language model architectures are predominantly first created for English and subsequently applied to other languages. It is an open question whether this architectural bias leads to degraded performance for languages that are structurally different from English. We examine one specific architectural choice: positional encodings, through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis posits a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pretrain monolingual model variants with absolute, relative, and no positional encodings for seven typologically diverse languages and evaluate them on four downstream tasks. Contrary to previous findings, we do not observe a clear interaction between position encodings and morphological complexity or word order flexibility, as measured by various proxies. Our results show that the choice of tasks, languages, and metrics are essential for drawing stable conclusions
摘要：语言模型架构主要首先是为英语创建的，然后应用于其他语言。这种架构偏差是否会导致结构上不同于英语的语言的性能下降，这是一个悬而未决的问题。我们通过权衡假设的视角来研究一种特定的架构选择：位置编码：形态复杂性和词序灵活性之间假定的相互作用。该假设提出了两者之间的权衡：形态更复杂的语言可以具有更灵活的词序，反之亦然。位置编码是研究该假设与语言建模相关的影响的直接目标。我们对七种类型不同的语言使用绝对、相对和无位置编码来预训练单语言模型变体，并在四个下游任务中对其进行评估。与之前的发现相反，我们没有观察到位置编码与形态复杂性或词序灵活性之间存在明显的相互作用（通过各种代理测量）。我们的结果表明，任务、语言和指标的选择对于得出稳定的结论至关重要

Title: Relation as a Prior: A Novel Paradigm for LLM-based Document-level Relation Extraction

Authors: Qiankun Pi, Yepeng Sun, Jicang Lu, Qinlong Fan, Ningbo Huang, Shiyu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08143
Pdf URL: https://arxiv.org/pdf/2511.08143
Copy Paste: [[2511.08143]] Relation as a Prior: A Novel Paradigm for LLM-based Document-level Relation Extraction(https://arxiv.org/abs/2511.08143)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated their remarkable capabilities in document understanding. However, recent research reveals that LLMs still exhibit performance gaps in Document-level Relation Extraction (DocRE) as requiring fine-grained comprehension. The commonly adopted "extract entities then predict relations" paradigm in LLM-based methods leads to these gaps due to two main reasons: (1) Numerous unrelated entity pairs introduce noise and interfere with the relation prediction for truly related entity pairs. (2) Although LLMs have identified semantic associations between entities, relation labels beyond the predefined set are still treated as prediction errors. To address these challenges, we propose a novel Relation as a Prior (RelPrior) paradigm for LLM-based DocRE. For challenge (1), RelPrior utilizes binary relation as a prior to extract and determine whether two entities are correlated, thereby filtering out irrelevant entity pairs and reducing prediction noise. For challenge (2), RelPrior utilizes predefined relation as a prior to match entities for triples extraction instead of directly predicting relation. Thus, it avoids misjudgment caused by strict predefined relation labeling. Extensive experiments on two benchmarks demonstrate that RelPrior achieves state-of-the-art performance, surpassing existing LLM-based methods.
摘要：大型语言模型（LLM）已经展示了其在文档理解方面的卓越能力。然而，最近的研究表明，法学硕士在文档级关系提取（DocRE）方面仍然存在性能差距，因为需要细粒度的理解。基于LLM的方法中普遍采用的“提取实体然后预测关系”范式导致这些差距的主要原因有两个：（1）大量不相关的实体对引入噪声并干扰真正相关实体对的关系预测。 (2)虽然法学硕士已经识别了实体之间的语义关联，但超出预定义集合的关系标签仍然被视为预测错误。为了应对这些挑战，我们为基于 LLM 的 DocRE 提出了一种新颖的关系作为先验 (RelPrior) 范例。对于挑战（1），RelPrior利用二元关系作为先验来提取并确定两个实体是否相关，从而过滤掉不相关的实体对并减少预测噪声。对于挑战（2），RelPrior 利用预定义关系作为匹配实体的先验来进行三元组提取，而不是直接预测关系。这样就避免了严格预定义的关系标注造成的误判。对两个基准的大量实验表明，RelPrior 实现了最先进的性能，超越了现有的基于 LLM 的方法。

Title: Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?

Authors: Kunal Kingkar Das, Manoj Balaji Jagadeeshan, Nallani Chakravartula Sahith, Jivnesh Sandhan, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08145
Pdf URL: https://arxiv.org/pdf/2511.08145
Copy Paste: [[2511.08145]] Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?(https://arxiv.org/abs/2511.08145)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly treated as universal, general-purpose solutions across NLP tasks, particularly in English. But does this assumption hold for low-resource, morphologically rich languages such as Sanskrit? We address this question by comparing instruction-tuned and in-context-prompted LLMs with smaller task-specific encoder-decoder models on the Sanskrit poetry-to-prose conversion task. This task is intrinsically challenging: Sanskrit verse exhibits free word order combined with rigid metrical constraints, and its conversion to canonical prose (anvaya) requires multi-step reasoning involving compound segmentation, dependency resolution, and syntactic linearisation. This makes it an ideal testbed to evaluate whether LLMs can surpass specialised models. For LLMs, we apply instruction fine-tuning on general-purpose models and design in-context learning templates grounded in Paninian grammar and classical commentary heuristics. For task-specific modelling, we fully fine-tune a ByT5-Sanskrit Seq2Seq model. Our experiments show that domain-specific fine-tuning of ByT5-Sanskrit significantly outperforms all instruction-driven LLM approaches. Human evaluation strongly corroborates this result, with scores exhibiting high correlation with Kendall's Tau scores. Additionally, our prompting strategies provide an alternative to fine-tuning when domain-specific verse corpora are unavailable, and the task-specific Seq2Seq model demonstrates robust generalisation on out-of-domain evaluations.
摘要：大型语言模型 (LLM) 越来越多地被视为跨 NLP 任务（尤其是英语）的通用、通用解决方案。但这个假设对于资源匮乏、形态丰富的语言（例如梵语）是否成立？我们通过将指令调整和上下文提示的法学硕士与梵文诗歌到散文转换任务中较小的特定任务编码器-解码器模型进行比较来解决这个问题。这项任务本质上是具有挑战性的：梵文诗句展现出自由的词序与严格的格律约束相结合，并且将其转换为规范散文（anvaya）需要涉及复合分段、依赖解析和句法线性化的多步骤推理。这使其成为评估法学硕士是否可以超越专业模型的理想测试平台。对于法学硕士，我们对通用模型进行指令微调，并设计基于潘尼尼亚语法和经典评论启发法的上下文学习模板。对于特定于任务的建模，我们完全微调了 ByT5-Sanskrit Seq2Seq 模型。我们的实验表明，ByT5-Sanskrit 的特定领域微调显着优于所有指令驱动的 LLM 方法。人类评估强烈证实了这一结果，分数与 Kendall 的 Tau 分数表现出高度相关性。此外，当特定领域的诗句语料库不可用时，我们的提示策略提供了微调的替代方案，并且特定于任务的 Seq2Seq 模型展示了对域外评估的稳健泛化。

Title: Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?

Authors: Arzu Burcu Güven, Anna Rogers, Rob van der Goot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08199
Pdf URL: https://arxiv.org/pdf/2511.08199
Copy Paste: [[2511.08199]] Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?(https://arxiv.org/abs/2511.08199)
Keywords: language model
Abstract: We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.
摘要：我们检查 BabyLM 语料库和 CHILDES 中年龄组的句法属性。虽然我们发现 CHILDES 并没有根据年龄表现出强烈的句法差异，但我们表明有关训练数据的句法知识有助于解释模型在语言任务上的表现。对于课程学习，我们探索发展性和几种替代的认知启发课程方法。我们发现一些课程有助于阅读任务，但主要的性能改进来自于使用语法可分类数据的子集，而不是完整的噪声语料库。

Title: Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction

Authors: Shivam Rawat, Lucie Flek, Akbar Karimi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08204
Pdf URL: https://arxiv.org/pdf/2511.08204
Copy Paste: [[2511.08204]] Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction(https://arxiv.org/abs/2511.08204)
Keywords: gpt
Abstract: Scientific literature in astronomy is rapidly expanding, making it increasingly important to automate the extraction of key entities and contextual information from research papers. In this paper, we present an encoder-based system for extracting knowledge from astronomy articles. Our objective is to develop models capable of classifying telescope references, detecting auxiliary semantic attributes, and recognizing instrument mentions from textual content. To this end, we implement a multi-task transformer-based system built upon the SciBERT model and fine-tuned for astronomy corpora classification. To carry out the fine-tuning, we stochastically sample segments from the training data and use majority voting over the test segments at inference time. Our system, despite its simplicity and low-cost implementation, significantly outperforms the open-weight GPT baseline.
摘要：天文学中的科学文献正在迅速扩展，这使得从研究论文中自动提取关键实体和上下文信息变得越来越重要。在本文中，我们提出了一种基于编码器的系统，用于从天文学文章中提取知识。我们的目标是开发能够对望远镜参考进行分类、检测辅助语义属性以及从文本内容中识别仪器提及的模型。为此，我们实现了一个基于 SciBERT 模型的多任务 Transformer 系统，并针对天文学语料库分类进行了微调。为了进行微调，我们从训练数据中随机采样片段，并在推理时对测试片段使用多数投票。尽管我们的系统简单且实施成本低，但其性能明显优于开放权重 GPT 基线。

Title: Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

Authors: Yishan Du, Conrad Borchers, Mutlu Cukurova
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2511.08225
Pdf URL: https://arxiv.org/pdf/2511.08225
Copy Paste: [[2511.08225]] Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback(https://arxiv.org/abs/2511.08225)
Keywords: language model, gpt, llm, prompt
Abstract: As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback. Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions: (i) implicit cues via lexicon-based swaps of gendered terms within essays, and (ii) explicit cues via gendered author background in the prompt. We investigated six representative LLMs (i.e. GPT-5 mini, GPT-4o mini, DeepSeek-R1, DeepSeek-R1-Qwen, Gemini 2.5 Pro, Llama-3-8B). We first quantified the response divergence with cosine and Euclidean distances over sentence embeddings, then assessed significance via permutation tests, and finally, visualised structure using dimensionality reduction. In all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Only the GPT and Llama models showed sensitivity to explicit gender cues. These findings show that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, suggesting persistent gender biases in feedback they provide learners. Qualitative analyses further revealed consistent linguistic differences (e.g., more autonomy-supportive feedback under male cues vs. more controlling feedback under female cues). We discuss implications for fairness auditing of pedagogical GenAI, propose reporting standards for counterfactual evaluation in learning analytics, and outline practical guidance for prompt design and deployment to safeguard equitable feedback.
摘要：随着教师在教育实践中越来越多地转向 GenAI，我们需要强大的方法来对大型语言模型 (LLM) 进行基准测试，以达到教学目的。本文提出了一个基于嵌入的基准测试框架，用于检测法学硕士在形成性反馈背景下的偏见。使用 AES 2.0 语料库中的 600 篇真实的学生论文，我们沿着两个维度构建了受控的反事实：(i) 通过论文中基于词典的性别术语交换来隐式提示，以及 (ii) 通过提示中的性别作者背景来显式提示。我们研究了六种代表性的法学硕士（即 GPT-5 mini、GPT-4o mini、DeepSeek-R1、DeepSeek-R1-Qwen、Gemini 2.5 Pro、Llama-3-8B）。我们首先用余弦和欧几里得距离对句子嵌入的响应散度进行量化，然后通过排列测试评估显着性，最后使用降维来可视化结构。在所有模型中，隐式操作可靠地引起了男性-女性反事实比女性-男性更大的语义转变。只有 GPT 和 Llama 模型对明确的性别线索表现出敏感性。这些发现表明，即使是最先进的法学硕士也对性别替代表现出不对称的语义反应，这表明他们向学习者提供的反馈中持续存在性别偏见。定性分析进一步揭示了一致的语言差异（例如，男性暗示下更多的自主支持反馈与女性暗示下更多的控制性反馈）。我们讨论了教学 GenAI 公平性审核的影响，提出了学习分析中反事实评估的报告标准，并概述了及时设计和部署的实用指南，以保障公平的反馈。

Title: VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Authors: Heyang Liu, Ziyang Cheng, Yuhao Wang, Hongcheng Liu, Yiqi Li, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08230
Pdf URL: https://arxiv.org/pdf/2511.08230
Copy Paste: [[2511.08230]] VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context(https://arxiv.org/abs/2511.08230)
Keywords: language model, llm
Abstract: The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at this https URL.
摘要：多模态大语言模型（LLM）的发展带来了能够进行语音交互的智能方法。作为全球使用最广泛的语言之一，普通话得到大多数模型的支持，以增强其适用性和影响范围。然而，普通话环境下全面的语音到语音（S2S）基准的缺乏阻碍了开发者的系统评估，也阻碍了用户公平的模型比较。在这项工作中，我们提出了 VocalBench-zh，这是一个适合普通话语境的能力级别划分评估套件，由 10 个精心设计的子集和超过 10K 的高质量实例组成，涵盖 12 个面向用户的字符。对14种主流模型的评估实验揭示了当前路线的共同挑战，并强调需要对下一代语音交互系统有新的见解。评估代码和数据集将在此 https URL 中提供。

Title: Prompt Tuning for Natural Language to SQL with Embedding Fine-Tuning and RAG

Authors: Jisoo Jang, Tien-Cuong Bui, Yunjun Choi, Wen-Syan Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.08245
Pdf URL: https://arxiv.org/pdf/2511.08245
Copy Paste: [[2511.08245]] Prompt Tuning for Natural Language to SQL with Embedding Fine-Tuning and RAG(https://arxiv.org/abs/2511.08245)
Keywords: llm, prompt
Abstract: This paper introduces an Error Correction through Prompt Tuning for NL-to-SQL, leveraging the latest advancements in generative pre-training-based LLMs and RAG. Our work addresses the crucial need for efficient and accurate translation of natural language queries into SQL expressions in various settings with the growing use of natural language interfaces. We explore the evolution of NLIDBs from early rule-based systems to advanced neural network-driven approaches. Drawing inspiration from the medical diagnostic process, we propose a novel framework integrating an error correction mechanism that diagnoses error types, identifies their causes, provides fixing instructions, and applies these corrections to SQL queries. This approach is further enriched by embedding fine-tuning and RAG, which harnesses external knowledge bases for improved accuracy and transparency. Through comprehensive experiments, we demonstrate that our framework achieves a significant 12 percent accuracy improvement over existing baselines, highlighting its potential to revolutionize data access and handling in contemporary data-driven environments.
摘要：本文介绍了通过即时调优对 NL 到 SQL 进行纠错，利用基于生成预训练的 LLM 和 RAG 的最新进展。随着自然语言界面的使用不断增加，我们的工作解决了在各种环境下将自然语言查询高效、准确地转换为 SQL 表达式的迫切需求。我们探索 NLIDB 从早期基于规则的系统到先进的神经网络驱动方法的演变。从医疗诊断过程中汲取灵感，我们提出了一种集成纠错机制的新颖框架，该机制可以诊断错误类型，识别其原因，提供修复指令，并将这些更正应用于 SQL 查询。通过嵌入微调和 RAG 进一步丰富了这种方法，RAG 利用外部知识库来提高准确性和透明度。通过全面的实验，我们证明我们的框架比现有基线准确率显着提高了 12%，凸显了其在当代数据驱动环境中彻底改变数据访问和处理的潜力。

Title: ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech

Authors: Marios Koniaris, Argyro Tsipi, Panayiotis Tsanakas
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.08247
Pdf URL: https://arxiv.org/pdf/2511.08247
Copy Paste: [[2511.08247]] ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech(https://arxiv.org/abs/2511.08247)
Keywords: language model, llm
Abstract: Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.
摘要：议会语音生成对标准文本生成任务之外的大型语言模型提出了特定的挑战。与一般文本生成不同，议会演讲不仅需要语言质量，还需要政治真实性和意识形态一致性。当前的语言模型缺乏针对议会环境的专门训练，现有的评估方法侧重于标准 NLP 指标而不是政治真实性。为了解决这个问题，我们推出了 ParliaBench，这是议会演讲生成的基准。我们构建了英国议会演讲的数据集，以实现系统的模型训练。我们引入了一个评估框架，将计算指标与法学硕士法官评估相结合，用于衡量三个维度的生成质量：语言质量、语义连贯性和政治真实性。我们提出了两种新颖的基于嵌入的指标：政治谱系联盟和政党联盟，来量化意识形态定位。我们对五个大型语言模型 (LLM) 进行了微调，生成了 28k 个演讲，并使用我们的框架对其进行了评估，比较了基线模型和微调模型。结果表明，微调对大多数指标产生了统计上显着的改善，并且我们的新颖指标显示出对政治维度的强大区分能力。

Title: Hierarchical structure understanding in complex tables with VLLMs: a benchmark and experiments

Authors: Luca Bindini, Simone Giovannini, Simone Marinai, Valeria Nardoni, Kimiya Noor Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08298
Pdf URL: https://arxiv.org/pdf/2511.08298
Copy Paste: [[2511.08298]] Hierarchical structure understanding in complex tables with VLLMs: a benchmark and experiments(https://arxiv.org/abs/2511.08298)
Keywords: language model, llm, prompt
Abstract: This work investigates the ability of Vision Large Language Models (VLLMs) to understand and interpret the structure of tables in scientific articles. Specifically, we explore whether VLLMs can infer the hierarchical structure of tables without additional processing. As a basis for our experiments we use the PubTables-1M dataset, a large-scale corpus of scientific tables. From this dataset, we extract a subset of tables that we introduce as Complex Hierarchical Tables (CHiTab): a benchmark collection of complex tables containing hierarchical headings. We adopt a series of prompt engineering strategies to probe the models' comprehension capabilities, experimenting with various prompt formats and writing styles. Multiple state-of-the-art open-weights VLLMs are evaluated on the benchmark first using their off-the-shelf versions and then fine-tuning some models on our task. We also measure the performance of humans to solve the task on a small set of tables comparing with performance of the evaluated VLLMs. The experiments support our intuition that generic VLLMs, not explicitly designed for understanding the structure of tables, can perform this task. This study provides insights into the potential and limitations of VLLMs to process complex tables and offers guidance for future work on integrating structured data understanding into general-purpose VLLMs.
摘要：这项工作研究了视觉大型语言模型 (VLLM) 理解和解释科学文章中表格结构的能力。具体来说，我们探索 VLLM 是否可以在无需额外处理的情况下推断出表的层次结构。作为我们实验的基础，我们使用 PubTables-1M 数据集，这是一个大型科学表格语料库。从该数据集中，我们提取了表的子集，将其引入为复杂层次表 (CHiTab)：包含层次标题的复杂表的基准集合。我们采用一系列提示工程策略来探索模型的理解能力，尝试各种提示格式和写作风格。首先使用现成版本对多个最先进的开放权重 VLLM 在基准上进行评估，然后根据我们的任务微调一些模型。我们还测量人类在一小部分表上解决任务的性能，并与评估的 VLLM 的性能进行比较。这些实验支持我们的直觉，即通用 VLLM（未明确设计用于理解表结构）可以执行此任务。这项研究深入了解了 VLLM 处理复杂表的潜力和局限性，并为未来将结构化数据理解集成到通用 VLLM 中的工作提供了指导。

Title: Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates

Authors: Shuaimin Li, Liyang Fan, Yufang Lin, Zeyang Li, Xian Wei, Shiwen Ni, Hamid Alinejad-Rokny, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08317
Pdf URL: https://arxiv.org/pdf/2511.08317
Copy Paste: [[2511.08317]] Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates(https://arxiv.org/abs/2511.08317)
Keywords: language model, llm, hallucination, agent
Abstract: Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer-author interactions. To address these limitations, we propose ReViewGraph (Reviewer-Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates. In our approach, reviewer-author exchanges are simulated through LLM-based multi-agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine-grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, underscoring the value of modeling detailed reviewer-author debate structures.
摘要：现有的论文评审方法往往依赖于肤浅的稿件特征或直接依赖于大型语言模型（LLM），这些模型容易产生幻觉、评分出现偏差且推理能力有限。此外，这些方法通常无法捕捉审稿人与作者互动中固有的复杂的论证推理和谈判动态。为了解决这些限制，我们提出了 ReViewGraph（审稿人-作者辩论图推理器），这是一种新颖的框架，可以对 LLM 模拟的多轮审稿人-作者辩论执行异构图推理。在我们的方法中，审稿人与作者之间的交流是通过基于 LLM 的多智能体协作来模拟的。然后，不同的意见关系（例如，接受、拒绝、澄清和妥协）被显式提取并编码为异构交互图中的类型边。通过应用图神经网络对这些结构化辩论图进行推理，ReViewGraph 捕获了细粒度的争论动态，并实现了更明智的评审决策。对三个数据集的广泛实验表明，ReViewGraph 的平均相对改进为 15.73%，优于强大的基线，强调了对详细审稿人-作者辩论结构进行建模的价值。

Title: Adaptive Multi-Agent Response Refinement in Conversational Systems

Authors: Soyeong Jeong, Aparna Elangovan, Emine Yilmaz, Oleg Rokhlenko
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2511.08319
Pdf URL: https://arxiv.org/pdf/2511.08319
Copy Paste: [[2511.08319]] Adaptive Multi-Agent Response Refinement in Conversational Systems(https://arxiv.org/abs/2511.08319)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real-life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for effective conversations. In this work, we propose refining responses through a multi-agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user's persona, or both.
摘要：大型语言模型 (LLM) 通过生成类似人类的响应，在对话系统中取得了显着的成功。然而，它们可能会达不到要求，尤其是在需要考虑个性化或特定知识时。在现实生活中，依靠用户来检测这些错误并请求新的响应是不切实际的。解决此问题的一种方法是在将响应返回给用户之前对其进行优化。虽然现有的方法侧重于在单个法学硕士内完善回答，但这种方法很难考虑有效对话所需的不同方面。在这项工作中，我们建议通过多代理框架来完善响应，其中每个代理在每个方面都被分配了特定的角色。我们关注对对话质量至关重要的三个关键方面：事实性、个性化和连贯性。每个代理负责审查和完善其中一个方面，然后合并他们的反馈以改进整体响应。为了加强他们之间的协作，我们引入了动态沟通策略。我们的方法不是遵循固定的代理序列，而是根据每个查询的具体要求自适应地选择和协调最相关的代理。我们在具有挑战性的会话数据集上验证了我们的框架，证明我们的框架显着优于相关基线，特别是在涉及知识或用户角色或两者的任务中。

Title: AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

Authors: Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.08325
Pdf URL: https://arxiv.org/pdf/2511.08325
Copy Paste: [[2511.08325]] AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress(https://arxiv.org/abs/2511.08325)
Keywords: language model, llm, prompt, agent
Abstract: Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
摘要：尽管发展迅速，大型语言模型（LLM）在网络购物和浏览器导航等多轮决策任务（即代理任务）中仍然遇到挑战，这些任务需要根据环境反馈做出一系列智能决策。 LLM 代理之前的工作通常依赖于精心设计的即时工程或根据专家轨迹进行微调来提高性能。在这项工作中，我们采取了不同的视角：我们探索构建过程奖励模型（PRM）来评估每个决策并指导代理的决策过程。与 LLM 推理不同，每个步骤都根据正确性进行评分，代理任务中的操作没有明确的正确性。相反，应该根据他们与目标的接近程度和取得的进展来评估他们。基于这一见解，我们提出了一种重新定义的代理任务 PRM，名为 AgentPRM，以捕获顺序决策之间的相互依赖关系及其对最终目标的贡献。这可以实现更好的进度跟踪和探索-利用平衡。为了可扩展地获取用于训练 AgentPRM 的标记数据，我们采用了基于时间差（TD）的估计方法与广义优势估计（GAE）相结合，这证明比先前的方法更具样本效率。跨不同代理任务的大量实验表明，AgentPRM 的计算效率比基准高出 8 倍以上，并且在扩展测试时计算时表现出强大的改进。此外，我们还进行了详细的分析，以展示我们的方法是如何工作的，并提供更多见解，例如，将 AgentPRM 应用于 LLM 代理的强化学习。

Title: DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

Authors: Xinyi Wang, Yiping Song, Zhiliang Tian, Bo Liu, Tingjin Luo, Minlie Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08364
Pdf URL: https://arxiv.org/pdf/2511.08364
Copy Paste: [[2511.08364]] DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering(https://arxiv.org/abs/2511.08364)
Keywords: language model, llm, hallucination
Abstract: In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.
摘要：在多跳问答（MHQA）任务中，思想链（CoT）通过多步推理指导大语言模型（LLM）来提高生成质量，知识图（KG）通过语义匹配减少幻觉。结果奖励模型 (ORM) 在生成最终答案后提供反馈，但无法评估多步骤推理的过程。传统的过程奖励模型 (PRM) 评估推理过程，但需要昂贵的人工注释或部署生成。虽然隐式 PRM 仅使用结果信号进行训练，并通过奖励参数化获得步骤奖励而无需显式注释，但它更适合 MHQA 任务中的多步骤推理。然而，现有的隐式 PRM 仅针对纯文本场景进行了探索。当适应MHQA任务时，它无法处理KG中的图结构约束并捕获CoT和KG路径之间潜在的不一致。为了解决这些限制，我们提出了 DPRM（双重隐式过程奖励模型）。它训练两个隐式 PRM，用于 MHQA 任务中的 CoT 和 KG 推理。这两种 PRM，即 KG-PRM 和 CoT-PRM，都通过奖励参数化从结果信号中获得阶梯级奖励，而无需额外的显式注释。其中，KG-PRM使用偏好对从KG中学习结构约束。 DPRM进一步在CoT和KG推理步骤之间引入一致性约束，使得两个PRM相互验证并协同优化推理路径。我们还提供了过程奖励的推导的理论论证。实验结果表明，我们的方法在多个数据集上优于 13 个基线，比 Hit@1 提高了 16.6%。

Title: PCRLLM: Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints

Authors: Tangrui Li, Pei Wang, Hongzheng Wang Christian Hahm, Matteo Spatola, Justin Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08392
Pdf URL: https://arxiv.org/pdf/2511.08392
Copy Paste: [[2511.08392]] PCRLLM: Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints(https://arxiv.org/abs/2511.08392)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often exhibit limited logical coherence, mapping premises to conclusions without adherence to explicit inference rules. We propose Proof-Carrying Reasoning with LLMs (PCRLLM), a framework that constrains reasoning to single-step inferences while preserving natural language formulations. Each output explicitly specifies premises, rules, and conclusions, thereby enabling verification against a target logic. This mechanism mitigates trustworthiness concerns by supporting chain-level validation even in black-box settings. Moreover, PCRLLM facilitates systematic multi-LLM collaboration, allowing intermediate steps to be compared and integrated under formal rules. Finally, we introduce a benchmark schema for generating large-scale step-level reasoning data, combining natural language expressiveness with formal rigor.
摘要：大型语言模型 (LLM) 通常表现出有限的逻辑一致性，将前提映射到结论，而不遵守明确的推理规则。我们提出了 LLM 的证明推理（PCRLLM），这是一个将推理限制为单步推理，同时保留自然语言表述的框架。每个输出都明确指定前提、规则和结论，从而能够针对目标逻辑进行验证。即使在黑盒设置中，该机制也支持链级验证，从而减轻了可信度问题。此外，PCRLLM 促进系统性的多法学硕士协作，允许在正式规则下比较和集成中间步骤。最后，我们引入了一种用于生成大规模步骤级推理数据的基准模式，将自然语言表达力与形式严谨性相结合。

Title: Interaction Dynamics as a Reward Signal for LLMs

Authors: Sian Gooding, Edward Grefenstette
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2511.08394
Pdf URL: https://arxiv.org/pdf/2511.08394
Copy Paste: [[2511.08394]] Interaction Dynamics as a Reward Signal for LLMs(https://arxiv.org/abs/2511.08394)
Keywords: language model, llm, agent
Abstract: The alignment of Large Language Models (LLMs) for multi-turn conversations typically relies on reward signals derived from the content of the text. This approach, however, overlooks a rich, complementary source of signal: the dynamics of the interaction itself. This paper introduces TRACE (Trajectory-based Reward for Agent Collaboration Estimation), a novel reward signal derived from the geometric properties of a dialogue's embedding trajectory--a concept we term 'conversational geometry'. Our central finding is that a reward model trained only on these structural signals achieves a pairwise accuracy (68.20%) comparable to a powerful LLM baseline that analyzes the full transcript (70.04%). Furthermore, a hybrid model combining interaction dynamics with textual analysis achieves the highest performance (80.17%), demonstrating their complementary nature. This work provides strong evidence that for interactive settings, how an agent communicates is as powerful a predictor of success as what it says, offering a new, privacy-preserving framework that not only aligns agents but also serves as a diagnostic tool for understanding the distinct interaction patterns that drive successful collaboration.
摘要：多轮对话的大型语言模型 (LLM) 的对齐通常依赖于从文本内容派生的奖励信号。然而，这种方法忽略了丰富的、互补的信号源：交互本身的动态。本文介绍了 TRACE（基于轨迹的智能体协作估计奖励），这是一种源自对话嵌入轨迹的几何特性的新颖奖励信号——我们称之为“对话几何”的概念。我们的主要发现是，仅针对这些结构信号进行训练的奖励模型所达到的成对准确度 (68.20%) 与分析完整成绩单的强大 LLM 基线 (70.04%) 相当。此外，将交互动态与文本分析相结合的混合模型实现了最高性能（80.17％），证明了它们的互补性。这项工作提供了强有力的证据，证明对于交互设置，代理的沟通方式正如其所说的那样，是成功的有力预测因素，提供了一个新的隐私保护框架，该框架不仅可以协调代理，还可以作为诊断工具来理解推动成功协作的不同交互模式。

Title: Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?

Authors: Shiyan Zheng, Herun Wan, Minnan Luo, Junhang Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08455
Pdf URL: https://arxiv.org/pdf/2511.08455
Copy Paste: [[2511.08455]] Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?(https://arxiv.org/abs/2511.08455)
Keywords: language model, llm
Abstract: While existing social bot detectors perform well on benchmarks, their robustness across diverse real-world scenarios remains limited due to unclear ground truth and varied misleading cues. In particular, the impact of shortcut learning, where models rely on spurious correlations instead of capturing causal task-relevant features, has received limited attention. To address this gap, we conduct an in-depth study to assess how detectors are influenced by potential shortcuts based on textual features, which are most susceptible to manipulation by social bots. We design a series of shortcut scenarios by constructing spurious associations between user labels and superficial textual cues to evaluate model robustness. Results show that shifts in irrelevant feature distributions significantly degrade social bot detector performance, with an average relative accuracy drop of 32\% in the baseline models. To tackle this challenge, we propose mitigation strategies based on large language models, leveraging counterfactual data augmentation. These methods mitigate the problem from data and model perspectives across three levels, including data distribution at both the individual user text and overall dataset levels, as well as the model's ability to extract causal information. Our strategies achieve an average relative performance improvement of 56\% under shortcut scenarios.
摘要：尽管现有的社交机器人检测器在基准测试中表现良好，但由于不清楚的事实真相和各种误导性线索，它们在不同的现实场景中的鲁棒性仍然有限。特别是，捷径学习的影响受到的关注有限，其中模型依赖于虚假相关性而不是捕获因果任务相关的特征。为了解决这一差距，我们进行了深入研究，以评估检测器如何受到基于文本特征的潜在捷径的影响，这些捷径最容易受到社交机器人的操纵。我们通过在用户标签和表面文本提示之间构建虚假关联来设计一系列快捷场景，以评估模型的稳健性。结果表明，不相关特征分布的变化会显着降低社交机器人检测器的性能，基线模型的平均相对准确度下降 32%。为了应对这一挑战，我们提出了基于大型语言模型的缓解策略，利用反事实数据增强。这些方法从三个级别的数据和模型角度缓解了问题，包括单个用户文本和整体数据集级别的数据分布，以及模型提取因果信息的能力。我们的策略在捷径场景下平均相对性能提高了 56%。

Title: SPEAR-MM: Selective Parameter Evaluation and Restoration via Model Merging for Efficient Financial LLM Adaptation

Authors: Berkcan Kapusuzoglu, Supriyo Chakraborty, Renkun Ni, Stephen Rawls, Sambit Sahu
Subjects: cs.CL, cs.AI, cs.LG, math.SP
Abstract URL: https://arxiv.org/abs/2511.08500
Pdf URL: https://arxiv.org/pdf/2511.08500
Copy Paste: [[2511.08500]] SPEAR-MM: Selective Parameter Evaluation and Restoration via Model Merging for Efficient Financial LLM Adaptation(https://arxiv.org/abs/2511.08500)
Keywords: language model, llm
Abstract: Large language models (LLMs) adapted to financial domains often suffer from catastrophic forgetting of general reasoning capabilities essential for customer interactions and complex financial analysis. We introduce Selective Parameter Evaluation and Restoration via Model Merging (SPEAR-MM), a practical framework that preserves critical capabilities while enabling domain adaptation. Our method approximates layer-wise impact on external benchmarks through post-hoc analysis, then selectively freezes or restores transformer layers via spherical interpolation merging. Applied to LLaMA-3.1-8B for financial tasks, SPEAR-MM achieves 91.2% retention of general capabilities versus 69.7% for standard continual pretraining, while maintaining 94% of domain adaptation gains. The approach provides interpretable trade-off control and reduces computational costs by 90% crucial for resource-constrained financial institutions.
摘要：适应金融领域的大型语言模型（LLM）经常会灾难性地忘记对客户交互和复杂金融分析至关重要的一般推理能力。我们引入了通过模型合并进行选择性参数评估和恢复（SPEAR-MM），这是一个实用的框架，可以保留关键功能，同时实现领域适应。我们的方法通过事后分析来近似对外部基准的逐层影响，然后通过球形插值合并选择性地冻结或恢复变压器层。应用于金融任务的 LLaMA-3.1-8B 中，SPEAR-MM 实现了 91.2% 的一般能力保留率，而标准连续预训练的保留率为 69.7%，同时保持了 94% 的领域适应增益。该方法提供了可解释的权衡控制，并将计算成本降低了 90%，这对于资源有限的金融机构至关重要。

Title: Structured RAG for Answering Aggregative Questions

Authors: Omri Koshorek, Niv Granot, Aviv Alloni, Shahar Admati, Roee Hendel, Ido Weiss, Alan Arazi, Shay-Nitzan Cohen, Yonatan Belinkov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.08505
Pdf URL: https://arxiv.org/pdf/2511.08505
Copy Paste: [[2511.08505]] Structured RAG for Answering Aggregative Questions(https://arxiv.org/abs/2511.08505)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.
摘要：检索增强生成（RAG）已成为回答大型语料库问题的主要方法。然而，当前的数据集和方法高度关注只有一小部分语料库（通常是几个段落）与每个查询相关的情况，并且无法捕获聚合查询的丰富世界。这些需要从大量文档中收集信息并对它们进行推理。为了解决这一差距，我们提出了 S-RAG，这是一种专门为此类查询设计的方法。在摄取时，S-RAG 构建语料库的结构化表示；在推理时，它将自然语言查询转换为针对所述表示的正式查询。为了验证我们的方法并促进该领域的进一步研究，我们引入了两个新的聚合查询数据集：HOTELS 和 WORLD CUP。在新引入的数据集以及公共基准上使用 S-RAG 进行的实验表明，它的性能大大优于常见的 RAG 系统和长上下文法学硕士。

Title: Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

Authors: Neelavro Saha, Rafi Shahriyar, Nafis Ashraf Roudra, Saadman Sakib, Annajiat Alim Rasel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08507
Pdf URL: https://arxiv.org/pdf/2511.08507
Copy Paste: [[2511.08507]] Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research(https://arxiv.org/abs/2511.08507)
Keywords: gpt, prompt, retrieval-augmented generation
Abstract: Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model's gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.
摘要：由于缺乏解决句子级翻译的大规模数据集，孟加拉手语 (BdSL) 翻译是一项低资源 NLP 任务。相应地，该领域的现有研究仅限于单词和字母级别的检测。在这项工作中，我们介绍了 Bangla-SGP，这是一个新颖的并行数据集，由 1,000 个人工注释的句子注释对组成，该数据集通过基于规则的检索增强生成 (RAG) 管道，使用句法和词法规则使用句法和形态规则增强了大约 3,000 个综合生成的对。孟加拉语口语句子的注释序列由单独的注释组成，这些注释是孟加拉语符号支持的单词，并用作连续符号的中间表示。我们的数据集包含 1000 个高质量的孟加拉语句子，由专业签名者手动注释为注释序列。增强过程结合了基于规则的语言策略和即时工程技术，我们通过批判性地分析人类注释的句子注释对并与我们的专业签名者密切合作而采用这些技术。此外，我们微调了几个基于 Transformer 的模型，例如 mBart50、Google mT5、GPT4.1-nano，并使用 BLEU 分数评估它们的句子到注释翻译性能，基于这些评估指标，我们比较了模型在数据集和 RWTH-PHOENIX-2014T 基准中的注释翻译一致性。

Title: AlphaResearch: Accelerating New Algorithm Discovery with Language Models

Authors: Zhaojian Yu, Kaiyue Feng, Yilun Zhao, Shilin He, Xiao-Ping Zhang, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08522
Pdf URL: https://arxiv.org/pdf/2511.08522
Copy Paste: [[2511.08522]] AlphaResearch: Accelerating New Algorithm Discovery with Language Models(https://arxiv.org/abs/2511.08522)
Keywords: language model, llm, agent
Abstract: Large language models have made significant progress in complex but easy-to-verify problems, yet they still struggle with discovering the unknown. In this paper, we present \textbf{AlphaResearch}, an autonomous research agent designed to discover new algorithms on open-ended problems. To synergize the feasibility and innovation of the discovery process, we construct a novel dual research environment by combining the execution-based verify and simulated real-world peer review environment. AlphaResearch discovers new algorithm by iteratively running the following steps: (1) propose new ideas (2) verify the ideas in the dual research environment (3) optimize the research proposals for better performance. To promote a transparent evaluation process, we construct \textbf{AlphaResearchComp}, a new evaluation benchmark that includes an eight open-ended algorithmic problems competition, with each problem carefully curated and verified through executable pipelines, objective metrics, and reproducibility checks. AlphaResearch gets a 2/8 win rate in head-to-head comparison with human researchers, demonstrate the possibility of accelerating algorithm discovery with LLMs. Notably, the algorithm discovered by AlphaResearch on the \emph{``packing circles''} problem achieves the best-of-known performance, surpassing the results of human researchers and strong baselines from recent work (e.g., AlphaEvolve). Additionally, we conduct a comprehensive analysis of the remaining challenges of the 6/8 failure cases, providing valuable insights for future research.
摘要：大型语言模型在复杂但易于验证的问题上取得了重大进展，但它们仍然难以发现未知事物。在本文中，我们提出了 \textbf{AlphaResearch}，这是一个自主研究代理，旨在发现开放式问题的新算法。为了协同发现过程的可行性和创新，我们通过结合基于执行的验证和模拟的真实世界同行评审环境，构建了一个新颖的双重研究环境。 AlphaResearch 通过迭代运行以下步骤来发现新算法：（1）提出新想法（2）在双重研究环境中验证想法（3）优化研究建议以获得更好的性能。为了促进透明的评估过程，我们构建了 \textbf{AlphaResearchComp}，这是一个新的评估基准，其中包括八个开放式算法问题竞赛，每个问题都通过可执行管道、客观指标和可重复性检查精心策划和验证。 AlphaResearch 在与人类研究人员的正面比较中获得了 2/8 的胜率，证明了利用法学硕士加速算法发现的可能性。值得注意的是，AlphaResearch 在 \emph{``Packing Circles''} 问题上发现的算法实现了众所周知的最佳性能，超越了人类研究人员的结果和最近工作（例如 AlphaEvolve）的强大基线。此外，我们对 6/8 失败案例的剩余挑战进行了全面分析，为未来的研究提供了宝贵的见解。

Title: Investigating CoT Monitorability in Large Reasoning Models

Authors: Shu Yang, Junchao Wu, Xilin Gou, Xuansheng Wu, Derek Wong, Ninhao Liu, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08525
Pdf URL: https://arxiv.org/pdf/2511.08525
Copy Paste: [[2511.08525]] Investigating CoT Monitorability in Large Reasoning Models(https://arxiv.org/abs/2511.08525)
Keywords: llm, chain-of-thought
Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models' long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models' misbehavior through their CoT and provide structured judgments along with supporting evidence.
摘要：大型推理模型（LRM）通过在产生最终答案之前进行扩展推理，在复杂任务上表现出了卓越的性能。除了提高能力之外，这些详细的推理痕迹还为人工智能安全创造了新的机会，即 CoT 可监控性：通过决策过程中的思维链 (CoT) 监控潜在的模型不当行为，例如使用捷径或阿谀奉承。然而，当尝试通过 CoT 分析构建更有效的监控器时，出现了两个关键的基本挑战。首先，正如先前关于 CoT 忠实度的研究所指出的那样，模型并不总是在生成的推理中真实地代表其内部决策。其次，监控器本身可能过于敏感或不够敏感，并且可能会被模型漫长而复杂的推理痕迹所欺骗。在本文中，我们首次系统地研究了 CoT 可监控性的挑战和潜力。受到我们之前提到的两个基本挑战的推动，我们围绕两个中心视角构建我们的研究：（i）语言表达：LRM 在多大程度上忠实地表达了 CoT 中指导其决策的真实因素，以及（ii）监控可靠性：基于 CoT 的监控器可以在多大程度上可靠地检测到不当行为？具体来说，我们提供数学、科学和道德领域的言语质量、监控可靠性和法学硕士表现之间的经验证据和相关性分析。然后我们进一步研究旨在提高推理效率或性能的不同 CoT 干预方法将如何影响监控有效性。最后，我们提出了 MoME，这是一种新范式，其中法学硕士通过其 CoT 监控其他模型的不当行为，并提供结构化判断和支持证据。

Title: From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL

Authors: Amirmohammad Omidi Galdiani, Sepehr Rezaei Melal, Mohammad Norasteh, Arash Yousefi Jordehi, Seyed Abolghasem Mirroshandel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08537
Pdf URL: https://arxiv.org/pdf/2511.08537
Copy Paste: [[2511.08537]] From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL(https://arxiv.org/abs/2511.08537)
Keywords: agent
Abstract: This report presents a detailed methodology for constructing a high-quality Semantic Role Labeling (SRL) dataset from the Wall Street Journal (WSJ) portion of the OntoNotes 5.0 corpus and adapting it for Opinion Role Labeling (ORL) tasks. Leveraging the PropBank annotation framework, we implement a reproducible extraction pipeline that aligns predicate-argument structures with surface text, converts syntactic tree pointers to coherent spans, and applies rigorous cleaning to ensure semantic fidelity. The resulting dataset comprises 97,169 predicate-argument instances with clearly defined Agent (ARG0), Predicate (REL), and Patient (ARG1) roles, mapped to ORL's Holder, Expression, and Target schema. We provide a detailed account of our extraction algorithms, discontinuous argument handling, annotation corrections, and statistical analysis of the resulting dataset. This work offers a reusable resource for researchers aiming to leverage SRL for enhancing ORL, especially in low-resource opinion mining scenarios.
摘要：本报告介绍了从 OntoNotes 5.0 语料库的《华尔街日报》(WSJ) 部分构建高质量语义角色标签 (SRL) 数据集并使其适应意见角色标签 (ORL) 任务的详细方法。利用 PropBank 注释框架，我们实现了一个可重复的提取管道，将谓词参数结构与表面文本对齐，将句法树指针转换为连贯的跨度，并应用严格的清理以确保语义保真度。生成的数据集包含 97,169 个谓词参数实例，具有明确定义的代理 (ARG0)、谓词 (REL) 和患者 (ARG1) 角色，映射到 ORL 的持有者、表达式和目标模式。我们详细介绍了我们的提取算法、不连续参数处理、注释更正以及结果数据集的统计分析。这项工作为旨在利用 SRL 增强 ORL 的研究人员提供了可重用的资源，特别是在资源匮乏的意见挖掘场景中。

Title: Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Authors: Davi Bastos Costa, Felippe Alves, Renato Vicente
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.08565
Pdf URL: https://arxiv.org/pdf/2511.08565
Copy Paste: [[2511.08565]] Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models(https://arxiv.org/abs/2511.08565)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across and within personas, respectively. We find that, for moral robustness, model family accounts for most of the variance, while model size shows no systematic effect. The Claude family is, by a significant margin, the most robust, followed by Gemini and GPT-4 models, with other families exhibiting lower robustness. In contrast, moral susceptibility exhibits a mild family effect but a clear within-family size effect, with larger variants being more susceptible. Moreover, robustness and susceptibility are positively correlated, an association that is more pronounced at the family level. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in large language models.
摘要：大型语言模型（LLM）越来越多地在社会环境中运作，激发了对它们如何表达和改变道德判断的分析。在这项工作中，我们调查了法学硕士对人物角色扮演的道德反应，促使法学硕士承担特定的角色。使用道德基础问卷（MFQ），我们引入了一个量化两个属性的基准：道德敏感性和道德鲁棒性，分别根据角色之间和角色内部的 MFQ 分数的变异性进行定义。我们发现，对于道德稳健性，模型家庭解释了大部分方差，而模型大小没有显示系统效应。 Claude 系列是最稳健的，其次是 Gemini 和 GPT-4 模型，其他系列的稳健性较低。相比之下，道德敏感性表现出轻微的家庭效应，但存在明显的家庭内部规模效应，较大的变异更容易受到影响。此外，稳健性和易感性呈正相关，这种关联在家庭层面更为明显。此外，我们还为没有角色扮演的模型和跨模型平均的角色提供了道德基础概况。总之，这些分析提供了关于角色调节如何在大型语言模型中塑造道德行为的系统视图。

Title: Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Authors: Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang
Subjects: cs.CL, cs.AI, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2511.08577
Pdf URL: https://arxiv.org/pdf/2511.08577
Copy Paste: [[2511.08577]] Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models(https://arxiv.org/abs/2511.08577)
Keywords: language model, llm
Abstract: Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at this https URL.
摘要：提高大型语言模型（LLM）的推理能力，特别是在参数约束下，对于实际应用至关重要。之前的工作提出了循环变压器，它为每个令牌分配固定数量的额外迭代，以提高生成质量。在第一次标准前向传递之后，最后一层隐藏状态不是用语言表达，而是作为额外迭代的输入进行反馈，以完善标记预测。然而，我们发现了一个潜在的过度思考现象：在第一次通过后已经正确的简单标记预测有时会在额外的迭代中被修改为错误。为了解决这个问题，我们提出了 Think-at-Hard (TaH)，这是一种动态潜在思维方法，仅在硬标记上进行更深入的迭代。它采用轻量级神经决策器，仅在标准前向传递后可能不正确的标记处触发潜在迭代。在潜在迭代期间，低秩适应 (LoRA) 模块将 LLM 目标从一般的下一个令牌预测转变为集中的硬令牌细化。我们进一步引入了一种双因果注意力机制，将注意力从令牌序列维度扩展到额外的迭代深度维度。这使得交叉迭代信息流成为可能，同时保持完全的顺序并行性。实验表明，TaH 在五个具有挑战性的基准测试中提高了 LLM 推理性能，同时保持相同的参数数量。与对所有输出令牌迭代两次的基线相比，TaH 提供了 8.1-11.3% 的准确度增益，同时使 94% 的令牌免于第二次迭代。与使用相同数据进行微调的强大单次迭代 Qwen3 模型相比，它还提供了 4.0-5.0% 的准确度增益。当允许来自 LoRA 和迭代决策器的附加参数少于 3% 时，增益分别增加到 8.5-12.6% 和 5.3-5.4%。我们的代码可以在这个 https URL 上找到。

Title: Training Language Models to Explain Their Own Computations

Authors: Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.08579
Pdf URL: https://arxiv.org/pdf/2511.08579
Copy Paste: [[2511.08579]] Training Language Models to Explain Their Own Computations(https://arxiv.org/abs/2511.08579)
Keywords: language model
Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models' privileged access to their own internals: using a model to explain its own computations generally works better than using a *different* model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.
摘要：语言模型 (LM) 能否学会如实地描述其内部计算？他们比其他模特更能描述自己吗？我们研究了 LM 对自身内部的特权访问可以在多大程度上被利用来产生解释其行为的新技术。使用现有的可解释性技术作为基本事实的来源，我们对 LM 进行微调，以生成以下内容的自然语言描述：(1) LM 特征编码的信息，(2) LM 内部激活的因果结构，以及 (3) 特定输入标记对 LM 输出的影响。当仅使用数万个示例解释进行训练时，解释模型对新查询表现出非平凡的泛化能力。这种概括似乎部分归因于解释器模型对其自身内部结构的特权访问：使用模型来解释其自身的计算通常比使用*不同的*模型来解释其计算效果更好（即使另一个模型的能力明显更强）。我们的结果表明，语言模型不仅可以学习可靠地解释其内部计算，而且这种解释为现有的可解释性方法提供了可扩展的补充。