2025-11-14

Title: Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Authors: Omnilingual ASR team: Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sadagopan, Safiyyah Saleem, Arina Turkatenko, Albert Ventayol-Boada, Zheng-Xin Yong, Yu-An Chung, Jean Maillard, Rashel Moritz, Alexandre Mourachko, Mary Williamson, Shireen Yates
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09690
Pdf URL: https://arxiv.org/pdf/2511.09690
Copy Paste: [[2511.09690]] Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages(https://arxiv.org/abs/2511.09690)
Keywords: llm
Abstract: Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at this https URL.
摘要：自动语音识别 (ASR) 在高资源语言方面取得了进展，但世界上 7,000 多种语言中的大多数仍不受支持，从而使数千种长尾语言落后。扩大 ASR 覆盖范围成本高昂，而且受到限制语言支持的架构的限制，使得大多数人无法进行扩展，同时在没有社区协作的情况下进行扩展时会遇到道德问题。为了超越这些限制，我们引入了全语言 ASR，这是第一个为可扩展性而设计的大型 ASR 系统。全语言 ASR 使社区能够仅通过少量数据样本就引入未提供服务的语言。它将自监督预训练扩展到 7B 参数，以学习鲁棒的语音表示，并引入了专为零样本泛化而设计的编码器-解码器架构，利用了受法学硕士启发的解码器。这种能力建立在庞大且多样化的训练语料库之上；通过将覆盖范围的广度与语言的多样性相结合，该模型可以学习足够强大的表示来适应未见过的语言。全语言 ASR 将公共资源与通过有偿本地合作伙伴收集的社区来源录音相结合，将覆盖范围扩大到 1,600 多种语言，这是迄今为止此类工作中规模最大的一次，其中包括 ASR 以前从未服务过的 500 多种语言。自动评估显示出相对于现有系统的巨大收益，特别是在资源匮乏的条件下，并且具有很强的泛化性。我们将全语言 ASR 作为一系列模型发布，从低功耗设备的 300M 变体到最高准确度的 7B。我们反思了塑造这一设计的道德考虑，并通过讨论其社会影响得出结论。我们特别强调开源模型和工具如何降低研究人员和社区的障碍，吸引新的参与形式。可以在此 https URL 获取开源工件。

Title: Order Matters: Rethinking Prompt Construction in In-Context Learning

Authors: Warren Li, Yiqian Wang, Zihan Wang, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09700
Pdf URL: https://arxiv.org/pdf/2511.09700
Copy Paste: [[2511.09700]] Order Matters: Rethinking Prompt Construction in In-Context Learning(https://arxiv.org/abs/2511.09700)
Keywords: language model, gpt, prompt
Abstract: In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.
摘要：上下文学习 (ICL) 使大型语言模型能够通过以一系列示例为条件来执行新任务。大多数先前的工作合理且直观地假设选择哪些示例对性能的影响远大于这些示例的排序方式，从而导致对示例选择的关注。我们重新审视这个假设，并对选择和排序的效果进行系统比较。通过使用多个开源模型系列（0.5B 到 27B 参数）和 GPT-5 对分类和生成任务进行受控实验，我们发现不同示例排序导致的性能差异与使用完全不同的示例集带来的性能差异相当。此外，我们还表明，仅使用开发集就可以识别强排序，从而实现接近基于测试标签选择最佳排序的预言机的性能。我们的研究结果强调了提示设计中示例选择和排序的同等且相互交织的重要性，呼吁重新审查 ICL 中的假设。

Title: Contextual morphologically-guided tokenization for Latin encoder models

Authors: Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09709
Pdf URL: https://arxiv.org/pdf/2511.09709
Copy Paste: [[2511.09709]] Contextual morphologically-guided tokenization for Latin encoder models(https://arxiv.org/abs/2511.09709)
Keywords: language model
Abstract: Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models' improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.
摘要：标记化是语言模型预训练的关键组成部分，但标准标记化方法通常优先考虑高压缩和低生育率等信息理论目标，而不是形态对齐等语言目标。事实上，它们已被证明对于形态丰富的语言来说不是最佳的，其中标记化质量直接影响下游性能。在这项工作中，我们研究了拉丁语的形态感知标记化，这是一种形态丰富的语言，就预训练数据而言是中等资源，但就策划的词汇资源而言是高资源——这一区别经常被忽视，但在低资源语言建模的讨论中至关重要。我们发现形态引导的标记化提高了四个下游任务的整体性能。域外文本的性能提升最为明显，凸显了我们模型的泛化能力的提高。我们的研究结果证明了语言资源在改进形态复杂语言的语言建模方面的效用。对于缺乏大规模预训练数据的低资源语言，语言资源的开发和整合可以作为提高 LM 性能的可行替代方案。

Title: How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Authors: Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.09748
Pdf URL: https://arxiv.org/pdf/2511.09748
Copy Paste: [[2511.09748]] How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation(https://arxiv.org/abs/2511.09748)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.
摘要：大型语言模型 (LLM) 擅长评估机器翻译 (MT)，但其规模和成本阻碍了在边缘设备和隐私敏感工作流程中的部署。我们问：在仍然检测到改变意义的翻译错误的同时，你能做到多小？我们专注于英语->德语关键错误检测 (CED)，在 WMT21、WMT22 和 SynCED-EnDe-2025 上对 sub-2B 模型（LFM2-350M、Qwen-3-0.6B/1.7B、Llama-3.2-1B-Instruct、Gemma-3-1B）进行基准测试。我们的框架标准化提示，应用轻量级逻辑偏差校准和多数投票，并报告语义质量（MCC、F1-ERR/F1-NOT）和计算指标（VRAM、延迟、吞吐量）。结果揭示了大约 10 亿个参数的明显最佳点：Gemma-3-1B 提供了最佳的质量效率权衡，经过合并权重微调后，在 SynCED-EnDe-2025 上达到 MCC=0.77，F1-ERR=0.98，同时在 MacBook Pro M4 Pro (24 GB) 上保持 400 毫秒的单样本延迟。在更大的规模上，Qwen-3-1.7B 获得了最高的绝对 MCC（比 Gemma +0.11），但计算成本更高。相比之下，超小型模型 (0.6B) 仍然可以使用几次校准，但检测不到实体和数字错误。总体而言，紧凑的、经过指令调整的法学硕士，加上轻量级校准和小样本监督，可以为机器翻译提供值得信赖的设备上 CED，从而在现实世界的翻译管道中实现私密、低成本的错误筛选。所有数据集、提示和脚本均可在我们的 GitHub 存储库中公开获取。

Title: Predicate-Argument Structure Divergences in Chinese and English Parallel Sentences and their Impact on Language Transfer

Authors: Rocco Tripodi, Xiaoyu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.09796
Pdf URL: https://arxiv.org/pdf/2511.09796
Copy Paste: [[2511.09796]] Predicate-Argument Structure Divergences in Chinese and English Parallel Sentences and their Impact on Language Transfer(https://arxiv.org/abs/2511.09796)
Keywords: language model
Abstract: Cross-lingual Natural Language Processing (NLP) has gained significant traction in recent years, offering practical solutions in low-resource settings by transferring linguistic knowledge from resource-rich to low-resource languages. This field leverages techniques like annotation projection and model transfer for language adaptation, supported by multilingual pre-trained language models. However, linguistic divergences hinder language transfer, especially among typologically distant languages. In this paper, we present an analysis of predicate-argument structures in parallel Chinese and English sentences. We explore the alignment and misalignment of predicate annotations, inspecting similarities and differences and proposing a categorization of structural divergences. The analysis and the categorization are supported by a qualitative and quantitative analysis of the results of an annotation projection experiment, in which, in turn, one of the two languages has been used as source language to project annotations into the corresponding parallel sentences. The results of this analysis show clearly that language transfer is asymmetric. An aspect that requires attention when it comes to selecting the source language in transfer learning applications and that needs to be investigated before any scientific claim about cross-lingual NLP is proposed.
摘要：近年来，跨语言自然语言处理 (NLP) 获得了巨大的关注，通过将语言知识从资源丰富的语言转移到资源匮乏的语言，在资源匮乏的环境中提供实用的解决方案。该领域利用注释投影和模型传输等技术来进行语言适应，并得到多语言预训练语言模型的支持。然而，语言差异阻碍了语言迁移，特别是在类型上相距较远的语言之间。在本文中，我们对汉语和英语平行句子中的谓语-论元结构进行了分析。我们探索谓词注释的对齐和错位，检查相似性和差异，并提出结构分歧的分类。对注释投影实验结果的定性和定量分析支持分析和分类，其中，反过来，两种语言之一被用作源语言，将注释投影到相应的平行句子中。该分析的结果清楚地表明语言迁移是不对称的。在迁移学习应用中选择源语言时需要注意的一个方面，并且在提出任何有关跨语言 NLP 的科学主张之前需要进行调查。

Title: TARG: Training-Free Adaptive Retrieval Gating for Efficient RAG

Authors: Yufeng Wang, Lu wei, Haibin Ling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09803
Pdf URL: https://arxiv.org/pdf/2511.09803
Copy Paste: [[2511.09803]] TARG: Training-Free Adaptive Retrieval Gating for Efficient RAG(https://arxiv.org/abs/2511.09803)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft's prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.
摘要：检索增强生成 (RAG) 提高了事实性，但对每个查询进行检索通常会损害质量，同时会增加令牌和延迟。我们提出了免训练自适应检索门控（TARG），这是一种单次策略，它决定何时仅使用基础模型中的简短、无上下文草稿来进行检索。根据草案的前缀 logits，TARG 计算轻量级不确定性分数：平均标记熵、通过单调链接从 top-1/top-2 logit 差距导出的边缘信号，或少数随机前缀的小 N 方差，并且仅当分数超过阈值时才触发检索。该门与模型无关，仅添加数十到数百个草稿令牌，并且不需要额外的训练或辅助头。在 NQ-Open、TriviaQA 和 PopQA 上，TARG 不断改变准确性-效率边界：与 Always-RAG 相比，TARG 匹配或改进了 EM/F1，同时减少了 70-90% 的检索并减少了端到端延迟，并且在开销方面仍然接近 Never-RAG。一个核心的实证发现是，在现代指令调整的法学硕士下，边际信号是一个强大的默认值（随着主干的锐化，熵会压缩），小 N 方差提供了保守的、预算优先的替代方案。我们提供了对门类型和前缀长度的消融，并使用增量延迟视图来明确预算权衡。

Title: Khmer Spellchecking: A Holistic Approach

Authors: Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09812
Pdf URL: https://arxiv.org/pdf/2511.09812
Copy Paste: [[2511.09812]] Khmer Spellchecking: A Holistic Approach(https://arxiv.org/abs/2511.09812)
Keywords: language model
Abstract: Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the most suitable candidate. Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%, compared to existing solutions. The benchmark datasets for Khmer spellchecking and NER tasks in this study will be made publicly available.
摘要：与英语和其他资源丰富的语言相比，高棉语的拼写检查由于面临多项挑战仍然是一个尚未解决的问题。首先，词典中的单词与分词模型之间存在错位。其次，高棉语单词可以写成不同的形式。第三，高棉语复合词通常结构松散且容易形成，并且这些复合词并不总是在词典中找到。第四，由于缺乏高棉命名实体识别（NER）模型，一些专有名词可能被标记为拼写错误。不幸的是，现有的解决方案并不能充分应对这些挑战。本文提出了一种解决高棉语拼写检查问题的整体方法，通过集成高棉语子词分割、高棉语 NER、高棉语字素到音素 (G2P) 转换和高棉语语言模型来应对这些挑战，识别潜在的校正候选者，并对最合适的候选者进行排名。实验结果表明，与现有解决方案相比，所提出的方法实现了最先进的高棉语拼写检查准确率高达 94.4%。本研究中高棉语拼写检查和 NER 任务的基准数据集将公开。

Title: Answering Students' Questions on Course Forums Using Multiple Chain-of-Thought Reasoning and Finetuning RAG-Enabled LLM

Authors: Neo Wang, Sonit Singh
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.09831
Pdf URL: https://arxiv.org/pdf/2511.09831
Copy Paste: [[2511.09831]] Answering Students' Questions on Course Forums Using Multiple Chain-of-Thought Reasoning and Finetuning RAG-Enabled LLM(https://arxiv.org/abs/2511.09831)
Keywords: language model, llm, hallucination, retrieval augmented generation, chain-of-thought
Abstract: The course forums are increasingly significant and play vital role in facilitating student discussions and answering their questions related to the course. It provides a platform for students to post their questions related to the content and admin issues related to the course. However, there are several challenges due to the increase in the number of students enrolled in the course. The primary challenge is that students' queries cannot be responded immediately and the instructors have to face lots of repetitive questions. To mitigate these issues, we propose a question answering system based on large language model with retrieval augmented generation (RAG) method. This work focuses on designing a question answering system with open source Large Language Model (LLM) and fine-tuning it on the relevant course dataset. To further improve the performance, we use a local knowledge base and applied RAG method to retrieve relevant documents relevant to students' queries, where the local knowledge base contains all the course content. To mitigate the hallucination of LLMs, We also integrate it with multi chain-of-thought reasoning to overcome the challenge of hallucination in LLMs. In this work, we experiment fine-tuned LLM with RAG method on the HotpotQA dataset. The experimental results demonstrate that the fine-tuned LLM with RAG method has a strong performance on question answering task.
摘要：课程论坛的重要性日益凸显，在促进学生讨论和回答与课程相关的问题方面发挥着至关重要的作用。它为学生提供了一个平台，可以发布与课程内容相关的问题和管理问题。然而，由于注册该课程的学生人数增加，也面临着一些挑战。主要的挑战是学生的疑问无法立即得到答复，而教师必须面对大量重复的问题。为了缓解这些问题，我们提出了一种基于大型语言模型和检索增强生成（RAG）方法的问答系统。这项工作的重点是使用开源大型语言模型（LLM）设计问答系统，并在相关课程数据集上对其进行微调。为了进一步提高性能，我们使用本地知识库并应用RAG方法来检索与学生查询相关的相关文档，其中本地知识库包含所有课程内容。为了减轻法学硕士的幻觉，我们还将其与多链思维推理相结合，以克服法学硕士的幻觉挑战。在这项工作中，我们在 HotpotQA 数据集上实验了使用 RAG 方法微调的 LLM。实验结果表明，采用RAG方法微调的LLM在问答任务上具有很强的性能。

Title: TermGPT: Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domain

Authors: Yidan Sun, Mengying Zhu, Feiyue Chen, Yangyang Wu, Xiaolei Dan, Mengyuan Yang, Xiaolin Zheng, Shenglin Ben
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09854
Pdf URL: https://arxiv.org/pdf/2511.09854
Copy Paste: [[2511.09854]] TermGPT: Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domain(https://arxiv.org/abs/2511.09854)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.
摘要：大型语言模型（LLM）在文本生成任务中表现出了令人印象深刻的性能；然而，它们的嵌入空间经常受到各向同性问题的影响，导致对特定领域术语的区分不佳，特别是在法律和金融背景下。术语级表示中的这种弱点可能会严重阻碍下游任务，例如法律判决预测或金融风险分析，其中微妙的语义区别至关重要。为了解决这个问题，我们提出了 TermGPT，一个专为术语适应而设计的多级对比微调框架。我们首先构建一个句子图来捕获语义和结构关系，并根据上下文和拓扑线索生成语义一致但有区别的正负样本。然后，我们在句子和标记级别设计了一种多级对比学习方法，增强全局上下文理解和细粒度术语辨别。为了支持稳健的评估，我们构建了第一个源自官方监管文件的金融术语数据集。实验表明，TermGPT 在金融和法律领域的术语歧视任务中优于现有基线。

Title: In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

Authors: Mingye Zhu, Yi Liu, Zheren Fu, Quan Wang, Yongdong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09865
Pdf URL: https://arxiv.org/pdf/2511.09865
Copy Paste: [[2511.09865]] In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback(https://arxiv.org/abs/2511.09865)
Keywords: language model, llm, chain-of-thought
Abstract: Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.
摘要：训练用于思想链推理的大型语言模型 (LLM) 提出了重大挑战：对单个“黄金”基本原理进行监督微调会损害泛化能力，因为它会惩罚同样有效的替代方案，而具有可验证奖励的强化学习则与信用分配和过高的计算成本作斗争。为了解决这些限制，我们引入了 InTRO（代币内理性优化），这是一个新框架，可以实现代币级别的探索和自我反馈，以实现准确而简洁的推理。 InTRO 不是直接在所有有效推理路径上优化棘手的目标，而是利用校正因子（根据生成策略与其答案条件对应项之间的信息差异估计的标记重要性权重）来进行信息丰富的下一个标记选择。这种方法允许模型执行令牌级别的探索，并在一次前向传递中接收自我生成的反馈，最终鼓励准确和简洁的基本原理。在六个数学推理基准中，InTRO 始终优于其他基准，相对于基本模型，解决方案的准确性提高了 20%。它的思想链也明显更加简洁，减少了冗长。除此之外，InTRO 还可以实现跨域传输，成功适应超出数学领域的域外推理任务，展现出强大的泛化能力。

Title: HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

Authors: Nikunj Gupta, Bill Guo, Rajgopal Kannan, Viktor K. Prasanna
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.09873
Pdf URL: https://arxiv.org/pdf/2511.09873
Copy Paste: [[2511.09873]] HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning(https://arxiv.org/abs/2511.09873)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here this https URL Nikunj-Gupta/hierouter.
摘要：大型语言模型 (LLM) 在许多任务中提供最先进的性能，但会带来较高的计算和内存成本，限制了它们在资源受限或实时设置中的部署。为了解决这个问题，我们提出了 HierRouter，这是一种分层路由方法，可以从专门的轻量级语言模型池中动态地组装推理管道。我们的方法被表述为有限范围马尔可夫决策过程（MDP），训练基于近端策略优化（PPO）的强化学习代理，以迭代地选择在多跳推理的每个阶段调用哪些模型。代理根据不断变化的上下文和累积成本来做出上下文感知的路由决策。对三个开源候选法学硕士跨六个基准（包括 QA、代码生成和数学推理）进行的实验表明，与独立使用单个模型相比，HierRouter 将响应质量提高了 2.4 倍，同时平均只产生最小的额外推理成本。这些结果凸显了分层路由对于经济高效、高性能 LLM 推理的前景。所有代码都可以在这里找到 https URL Nikunj-Gupta/hierouter。

Title: EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Authors: Jialin Wu, Kecen Li, Zhicong Huang, Xinfeng Li, Xiaofeng Wang, Cheng Hong
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2511.09880
Pdf URL: https://arxiv.org/pdf/2511.09880
Copy Paste: [[2511.09880]] EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models(https://arxiv.org/abs/2511.09880)
Keywords: language model, llm, prompt
Abstract: Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.
摘要：许多机器学习模型都是根据大型语言模型 (LLM) 进行微调的，以在代码生成、生物医学分析和数学问题解决等专业领域实现高性能。然而，这种微调过程通常会引入一个严重的漏洞：安全一致性的系统性退化，破坏道德准则并增加有害输出的风险。为了应对这一挑战，我们引入了 EnchTable，这是一种新颖的框架，旨在在下游法学硕士中转移和维护安全一致性，而无需进行大量的再培训。 EnchTable 利用基于神经正切内核 (NTK) 的安全向量蒸馏方法将安全约束与特定于任务的推理解耦，确保跨不同模型架构和大小的兼容性。此外，我们的干扰感知合并技术有效地平衡了安全性和实用性，最大限度地减少了不同任务域的性能影响。我们在三个不同的任务域和三个不同的 LLM 架构上实现了功能齐全的 EnchTable 原型，并通过对 11 个不同数据集进行广泛的实验来评估其性能，评估实用性和模型安全性。我们的评估包括来自不同供应商的LLM，展示了EnchTable的泛化能力。此外，EnchTable 对静态和动态越狱攻击表现出强大的抵抗力，在缓解对抗性提示方面优于供应商发布的安全模型。通过六种参数修改方法和两种推理时间对齐基线的比较分析表明，EnchTable 实现了显着降低的不安全率、更高的实用性得分以及跨不同任务领域的普遍适用性。此外，我们还验证了 EnchTable 可以无缝集成到各种部署管道中，而无需大量开销。

Title: MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection

Authors: Pritish Sahu, Anirudh Som, Dimitra Vergyri, Ajay Divakaran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09918
Pdf URL: https://arxiv.org/pdf/2511.09918
Copy Paste: [[2511.09918]] MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection(https://arxiv.org/abs/2511.09918)
Keywords: agent
Abstract: Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures, posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.
摘要：社会规范是指导人际沟通的隐含的、基于文化的期望。与事实常识不同，规范推理是主观的、依赖于上下文的，并且因文化而异，给计算模型带来了挑战。先前的工作提供了有价值的规范注释，但主要针对孤立的话语或综合对话，限制了它们捕捉现实世界对话的流动性、多轮性的能力。在这项工作中，我们提出了 Norm-RAG，一种检索增强的代理框架，用于多轮对话中细致入微的社会规范推理。 Norm-RAG 对话语级属性进行建模，包括交流意图、说话者角色、人际框架和语言线索，并将它们建立在通过新颖的语义分块方法检索的结构化规范文档中。这使得能够对多语言对话中规范的遵守和违反进行可解释和上下文感知的推理。我们进一步介绍了 MINDS（规范驱动语音的多语言交互），这是一个双语数据集，包含 31 个多轮普通话-英语和西班牙语-英语对话。每个回合都使用多注释者共识来注释规范类别和遵守状态，反映跨文化和现实的规范表达。我们的实验表明，Norm-RAG 改进了规范检测和泛化，展示了文化适应性和社交智能对话系统的性能改进。

Title: Leveraging Large Language Models for Identifying Knowledge Components

Authors: Canwen Wang, Jionghao Lin, Kenneth R. Koedinger
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2511.09935
Pdf URL: https://arxiv.org/pdf/2511.09935
Copy Paste: [[2511.09935]] Leveraging Large Language Models for Identifying Knowledge Components(https://arxiv.org/abs/2511.09935)
Keywords: language model, gpt, llm, prompt
Abstract: Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a "simulated textbook" LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model's performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.
摘要：知识组件（KC）是自适应学习系统的基础，但领域专家的手动识别是一个重大瓶颈。虽然大型语言模型 (LLM) 为自动化此过程提供了一条很有前途的途径，但先前的研究仅限于小型数据集，并且已被证明会产生多余的 KC 标签。本研究首先将“模拟教科书”LLM 提示策略（使用 GPT-4o-mini）扩展到包含 646 个多项选择题的更大数据集，从而解决了这些局限性。我们发现这种最初的自动化方法的表现明显比专家设计的 KC 模型差（RMSE 0.4285 与 0.4206），并且生成了过多的 KC（569 与 101）。为了解决冗余问题，我们提出并评估了一种基于余弦相似度来合并语义相似的 KC 标签的新方法。这种合并策略显着提高了模型的性能；使用余弦相似度阈值 0.8 的模型取得了最佳结果，将 KC 计数减少到 428，并将 RMSE 提高到 0.4259。这表明，虽然规模化的 LLM 生成本身是不够的，但将其与语义合并技术相结合，为自动化和完善 KC 识别提供了一条可行的途径。

Title: REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering

Authors: Yijie Zhu, Haojie Zhou, Wanting Hong, Tailin Liu, Ning Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09966
Pdf URL: https://arxiv.org/pdf/2511.09966
Copy Paste: [[2511.09966]] REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering(https://arxiv.org/abs/2511.09966)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP's performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.
摘要：检索增强生成（RAG）已被广泛用于减轻大型语言模型（LLM）中的幻觉。然而，现有的多跳推理任务方法往往缺乏全局规划，增加了陷入局部推理僵局的风险。对检索内容的充分利用和对潜在线索的忽视无法确保推理结果的准确性。为了克服这些限制，我们提出了递归评估和自适应规划（REAP），其核心思想是通过子任务规划器（SP）和事实提取器（FE）模块显式维护与当前任务相关的结构化子任务和事实。 SP保持全局视角，指导整体推理方向，并根据有限元的结果评估任务状态，实现任务求解轨迹的动态优化。 FE 对检索到的内容进行细粒度分析，以提取可靠的答案和线索。这两个模块逐渐丰富了全球知识的逻辑连贯表示，增强了推理过程的可靠性和可追溯性。此外，我们提出了一种统一的任务范式设计，可以实现有效的多任务微调，显着提高 SP 在复杂、数据稀缺任务上的性能。我们对多个公共多跳数据集进行了广泛的实验，结果表明我们的方法在域内和域外设置中均显着优于现有的 RAG 方法，验证了其在复杂多跳推理任务中的有效性。

Title: NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

Authors: Peter Røysland Aarnes, Vinay Setty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09971
Pdf URL: https://arxiv.org/pdf/2511.09971
Copy Paste: [[2511.09971]] NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction(https://arxiv.org/abs/2511.09971)
Keywords: language model
Abstract: Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.
摘要：大型语言模型在事实检查和问答等知识密集型任务上表现出强大的性能，但它们常常在数字推理方面遇到困难。我们使用受控扰动（包括标签翻转探针）对数字主张和证据对的准确性预测的最先进模型进行了系统评估，以测试稳健性。我们的结果表明，即使是领先的专有系统，在某些扰动下，准确率也会下降高达 62%。没有一个模型能够在所有条件下被证明是稳健的。我们进一步发现，增加上下文长度通常会降低准确性，但是当扩展的上下文通过扰动的演示来丰富时，大多数模型都会大幅恢复。这些发现凸显了数字事实检查的关键局限性，并表明稳健性仍然是当前语言模型的一个公开挑战。

Title: Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG

Authors: Bo Li, Tian Tian, Zhenghua Xu, Hao Cheng, Shikun Zhang, Wei Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09980
Pdf URL: https://arxiv.org/pdf/2511.09980
Copy Paste: [[2511.09980]] Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG(https://arxiv.org/abs/2511.09980)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.
摘要：动态检索增强生成 (RAG) 允许大型语言模型 (LLM) 按需获取外部知识，比静态 RAG 提供更好的适应性。这种情况下的一个主要挑战在于确定检索的最佳时机。现有方法通常会基于低令牌级别置信度触发检索，这可能会导致错误传播后延迟干预。我们引入熵趋势约束（ETC），这是一种免训练方法，通过对令牌级不确定性的动态进行建模来确定最佳检索时机。具体来说，ETC 利用熵序列的一阶和二阶差分来检测新出现的不确定性趋势，从而实现更早、更精确的检索。使用三个 LLM 主干对六个 QA 基准进行的实验表明，ETC 始终优于强大的基线，同时降低了检索频率。 ETC 在特定领域的场景中特别有效，表现出强大的泛化能力。消融研究和定性分析进一步证实，趋势感知的不确定性模型可以产生更有效的检索时机。该方法是即插即用的，与模型无关，并且可以轻松集成到现有的解码管道中。实施代码包含在补充材料中。

Title: Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

Authors: Bo Li, Zhenghua Xu, Rui Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09984
Pdf URL: https://arxiv.org/pdf/2511.09984
Copy Paste: [[2511.09984]] Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation(https://arxiv.org/abs/2511.09984)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought
Abstract: Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language. To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG.
摘要：多语言检索增强生成 (RAG) 使大型语言模型 (LLM) 能够利用检索到的文档作为外部证据，在多语言环境中执行知识密集型任务。然而，当检索到的证据在语言上与用户查询和上下文示例不同时，模型通常会通过以非预期语言生成响应来表现出语言漂移。这种现象在推理密集型解码过程中尤其明显，例如思想链（CoT）生成，其中中间步骤会进一步引入语言不稳定。在本文中，我们系统地研究了跨多个数据集、语言和 LLM 主干的多语言 RAG 中的输出语言漂移。我们的对照实验表明，漂移不是由理解失败造成的，而是由解码器级崩溃造成的，其中主要标记分布和高频英语模式主导了预期的生成语言。我们进一步观察到，英语在跨语言条件下充当语义吸引子，成为最强的干扰源和最常见的后备语言。为了缓解这个问题，我们提出了软约束解码（SCD），这是一种轻量级、免训练的解码策略，通过惩罚非目标语言标记来温和地引导生成目标语言。 SCD 与模型无关，可以应用于任何生成算法，无需修改架构或需要额外的数据。跨三个多语言数据集和多种类型不同的语言的实验表明，SCD 持续改进了语言对齐和任务性能，为多语言 RAG 提供了有效且可推广的解决方案。

Title: PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

Authors: Shivam Sharma (1), Riya Naik (1), Tejas Gawas (1), Heramb Patil (1), Kunal Korgaonkar (1) ((1) CSIS Department, BITS Pilani K K Birla Goa Campus, India)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10002
Pdf URL: https://arxiv.org/pdf/2511.10002
Copy Paste: [[2511.10002]] PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models(https://arxiv.org/abs/2511.10002)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.
摘要：大型语言模型 (LLM) 在理解和生成类人内容方面表现出了卓越的能力。这彻底改变了医疗保健、软件开发和教育等各个领域。在教育领域，法学硕士提供个性化和互动学习体验的潜力，特别是在教学资源有限的地区。然而，将这些模型有效地适应特定课程的内容，例如印度国家教育研究和培训委员会 (NCERT) 的教学大纲，在准确性、一致性和教学相关性方面提出了独特的挑战。在本文中，我们提出了框架“PustakAI”\footnote{Pustak 在许多印度语言中的意思是“书”。}用于设计和评估新颖的问答数据集“NCERT-QA”，该数据集与 6 至 8 年级英语和科学科目的 NCERT 课程一致。我们将策划的 QA 对分类为事实、推理和其他（评估和推理）。我们使用各种提示技术（例如元提示、少样本和 CoT 式提示）来评估数据集，并使用不同的评估指标来了解哪种方法更有效地符合课程的结构和需求。除了数据集的可用性之外，我们还分析了当前开源法学硕士（Gemma3:1b、Llama3.2:3b 和 Nemotron-mini:4b）和高端法学硕士（Llama-4-Scout-17B 和 Deepseek-r1-70B）作为正规教育系统中基于人工智能的学习工具的优势和局限性。

Title: Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Authors: Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10045
Pdf URL: https://arxiv.org/pdf/2511.10045
Copy Paste: [[2511.10045]] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism(https://arxiv.org/abs/2511.10045)
Keywords: language model, llm
Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.
摘要：声音象征是一个语言学概念，指语音形式与其含义之间的非任意关联。我们认为，这可以成为对多模态大语言模型（MLLM）如何解释人类语言中的听觉信息的令人信服的探索。我们研究了 MLLM 在文本（正字法和国际音标）和听觉形式的输入（最多 25 个语义维度（例如，锐利与圆润））中的语音象似性表现，通过测量音素级注意力分数来观察模型的分层信息处理。为此，我们提出了 LEX-ICON，这是一个广泛的模仿词数据集，由来自四种自然语言（英语、法语、日语和韩语）的 8,052 个单词和 2,930 个系统构建的伪单词组成，并使用跨文本和音频模态的语义特征进行注释。我们的主要发现表明（1）MLLM 的语音直觉与跨多个语义维度的现有语言研究相一致，以及（2）语音语义注意力模式突出了模型对标志性音素的关注。这些结果弥合了人工智能和认知语言学领域，首次在 MLLM 的可解释性方面对语音象似性进行了大规模的定量分析。

Title: GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt

Authors: Zhenhe Li, Can Lin, Ling Zheng, Wen-Da Wei, Junli Liang, Qi Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10051
Pdf URL: https://arxiv.org/pdf/2511.10051
Copy Paste: [[2511.10051]] GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt(https://arxiv.org/abs/2511.10051)
Keywords: language model, llm, prompt, agent
Abstract: Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.
摘要：多轮指令跟踪对于构建能够在对话轮次中一致遵守指令的智能对话系统至关重要。然而，现有的增强多轮指令跟踪的方法主要依赖于收集或生成大规模多轮对话数据集来微调大型语言模型（LLM），该模型将每个响应生成视为一个孤立的任务，并且无法明确地将多轮指令跟踪纳入优化目标。因此，经过指令调整的法学硕士常常面临复杂的远程限制。在多轮对话中，跨轮的关系约束可以自然地建模为标记的有向边，使得图结构特别适合建模多轮指令跟踪。尽管有这种潜力，但利用图结构来增强法学硕士的多轮指令跟踪能力仍有待探索。为了弥补这一差距，我们提出了 GraphIF，这是一个即插即用的框架，它将多回合对话建模为有向关系图，并利用图形提示来增强法学硕士的指令跟踪能力。 GraphIF 包含三个关键组件：（1）基于代理的关系提取模块，通过动作触发机制捕获回合间语义关系以构造结构化图； (2)关系图提示生成模块，将结构化图信息转换为自然语言提示； (3) 响应重写模块，使用生成的图形提示细化初始 LLM 输出。对两个长多轮对话数据集的大量实验表明，GraphIF 可以无缝集成到指令调整的 LLM 中，并导致所有四个多轮指令跟踪评估指标的显着改进。

Title: Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Authors: Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10075
Pdf URL: https://arxiv.org/pdf/2511.10075
Copy Paste: [[2511.10075]] Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts(https://arxiv.org/abs/2511.10075)
Keywords: language model, llm
Abstract: With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models' multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.
摘要：随着提交的科学论文数量不断增加，对能够帮助审稿人评估研究主张的系统的需求也不断增加。实验结果是科学工作的核心组成部分，通常以表格或图表等不同格式呈现。了解当前多模态大语言模型（多模态法学硕士）在验证不同证据格式的科学主张方面的稳健性仍然是一个重要且尚未充分探索的挑战。在本文中，我们设计并进行了一系列实验，以评估多模式法学硕士使用表格和图表作为证据验证科学主张的能力。为了实现这一评估，我们通过合并多模式声明验证任务所需的注释和结构来调整两个现有的科学论文数据集。使用这个改编后的数据集，我们评估了 12 个多模式法学硕士，发现当前模型在基于表格的证据方面表现更好，而在基于图表的证据方面表现不佳。我们进一步进行了人类评估，并观察到人类在两种格式中都保持着强劲的表现，这与模型不同。我们的分析还表明，较小的多模式法学硕士（8B 以下）在基于表格和基于图表的任务之间表现出较弱的相关性，表明跨模式泛化有限。这些发现凸显了当前模型多模态推理能力的关键差距。我们建议未来的多模式法学硕士应更加重视提高图表理解，以更好地支持科学主张验证。

Title: On the Military Applications of Large Language Models

Authors: Satu Johansson, Taneli Riihonen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10093
Pdf URL: https://arxiv.org/pdf/2511.10093
Copy Paste: [[2511.10093]] On the Military Applications of Large Language Models(https://arxiv.org/abs/2511.10093)
Keywords: language model, gpt, chat
Abstract: In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.
摘要：本文考虑了自然语言处理和大型语言模型的军事用例或应用及其实现，这些模型随着生成式预训练变压器 (GPT) 的发明以及 OpenAI 为 ChatGPT 等所做的广泛基础模型预训练而声名鹊起。首先，我们询问基于 GPT 的语言模型（即 Microsoft Copilot），使其揭示自己对潜在军事应用的了解，然后批判性地评估这些信息。其次，我们研究如何轻松使用商业云服务（即 Microsoft Azure）来构建此类应用程序，并评估其中哪些是可行的。我们的结论是，语言模型的概括和生成特性直接促进了许多应用程序的发展，而其他功能可能会找到特定的用途。

Title: Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA

Authors: Yiran Zhang, Mingyang Lin, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10182
Pdf URL: https://arxiv.org/pdf/2511.10182
Copy Paste: [[2511.10182]] Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA(https://arxiv.org/abs/2511.10182)
Keywords: language model, llm
Abstract: Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VISTA, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VISTA allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct "what-if" analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model's step-by-step logical path. By providing a unified and interactive framework, VISTA significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.
摘要：最近的研究越来越关注多轮交互中大型语言模型（LLM）的推理能力，因为这些场景更接近地反映现实世界的问题解决。然而，由于复杂的上下文依赖性和缺乏专门的可视化工具，分析这些交互中复杂的推理过程提出了重大挑战，导致研究人员的认知负担很高。为了解决这一差距，我们推出了 VISTA，这是一种基于网络的视觉交互系统，用于多轮推理任务中的文本分析。 VISTA 允许用户可视化上下文对模型决策的影响，并交互式修改对话历史以跨不同模型进行“假设”分析。此外，该平台可以自动解析会话并生成推理依赖树，提供模型逐步逻辑路径的透明视图。通过提供统一的交互式框架，VISTA 显着降低了分析推理链的复杂性，从而有助于更深入地了解当前法学硕士的能力和局限性。该平台是开源的，支持自定义基准测试和本地模型的轻松集成。

Title: Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Authors: Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2511.10192
Pdf URL: https://arxiv.org/pdf/2511.10192
Copy Paste: [[2511.10192]] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL(https://arxiv.org/abs/2511.10192)
Keywords: llm, chain-of-thought
Abstract: The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.
摘要：以数据为中心的范式已成为 AI 中的关键，特别是对于文本到 SQL，其性能受到稀缺、简单和低多样性数据集的限制。为了解决这个问题，我们提出了 Text2SQL-Flow，这是一个 SQL 感知的数据增强框架，可以从最小的种子数据生成大规模、语义有效且结构多样的文本到 SQL 对。它跨六个增强维度进行操作，并集成了具有 SQL 执行验证、自然语言问题生成、思维链推理跟踪和数据分类功能的端到端管道。模块化数据库管理器确保跨数据库兼容性和可扩展性。使用这个框架，我们构建了 SQLFlow，这是一个包含 89,544 个带注释示例的高质量数据集。我们在两种设置中评估 SQLFlow：（1）对于开源 LLM，在相同数据预算下，对 SQLFlow 进行微调可以持续提高跨基准的性能。 (2)对于闭源LLM，我们引入了一种掩码对齐检索方法，该方法将SQLFlow视为检索器的知识库和训练数据。这可以通过对问题和 SQL 查询之间的细粒度对齐进行建模来实现结构感知示例匹配。实验表明我们的检索策略优于现有方法，强调了 SQLFlow 高保真数据和我们的新技术的价值。我们的工作为推进文本到 SQL 系统奠定了可扩展、以数据为中心的基础，并强调了高质量结构化数据在现代人工智能中的关键作用。

Title: EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

Authors: Junquan Huang, Haotian Wu, Yubo Gao, Yibo Yan, Junyan Zhang, Yonghua Hei, Song Dai, Jie Zhang, Puay Siew Tan, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10201
Pdf URL: https://arxiv.org/pdf/2511.10201
Copy Paste: [[2511.10201]] EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models(https://arxiv.org/abs/2511.10201)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.
摘要：具有思想链 (CoT) 提示的大型语言模型 (LLM) 可以实现强大的推理，但通常会产生不必要的长解释，从而增加成本，有时还会降低准确性。分散的评估实践阻碍了以效率为导向的方法的公平比较。我们推出了 EffiReason-Bench，这是一个统一基准，用于对跨三个类别的高效推理方法进行严格的跨范式评估：推理蓝图、动态执行和事后细化。为了实现逐步评估，我们通过强制标准化推理结构、全面的选项分析和人工验证的管道为 CommonsenseQA 和 LogiQA 构建经过验证的 CoT 注释。我们在涵盖数学、常识和逻辑的 4 个数据集上评估了 6 个开源法学硕士 (1B-70B) 的 7 种方法，并提出了 E3-Score，这是一种受经济权衡模型启发的原则性指标，可提供平滑、稳定的评估，而不会出现不连续性或严重依赖启发式方法。实验表明，没有哪一种方法能普遍占主导地位。最佳策略取决于主干规模、任务复杂性和架构。

Title: Persona-Aware Alignment Framework for Personalized Dialogue Generation

Authors: Guanrong Li, Xinyu Liu, Zhen Wu, Xinyu Dai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10215
Pdf URL: https://arxiv.org/pdf/2511.10215
Copy Paste: [[2511.10215]] Persona-Aware Alignment Framework for Personalized Dialogue Generation(https://arxiv.org/abs/2511.10215)
Keywords: language model
Abstract: Personalized dialogue generation aims to leverage persona profiles and dialogue history to generate persona-relevant and consistent responses. Mainstream models typically rely on token-level language model training with persona dialogue data, such as Next Token Prediction, to implicitly achieve personalization, making these methods tend to neglect the given personas and generate generic responses. To address this issue, we propose a novel Persona-Aware Alignment Framework (PAL), which directly treats persona alignment as the training objective of dialogue generation. Specifically, PAL employs a two-stage training method including Persona-aware Learning and Persona Alignment, equipped with an easy-to-use inference strategy Select then Generate, to improve persona sensitivity and generate more persona-relevant responses at the semantics level. Through extensive experiments, we demonstrate that our framework outperforms many state-of-the-art personalized dialogue methods and large language models.
摘要：个性化对话生成旨在利用角色概况和对话历史来生成与角色相关且一致的响应。主流模型通常依赖于角色对话数据的令牌级语言模型训练，例如下一个令牌预测，以隐式实现个性化，使得这些方法往往会忽略给定的角色并生成通用响应。为了解决这个问题，我们提出了一种新颖的角色感知对齐框架（PAL），它直接将角色对齐作为对话生成的训练目标。具体来说，PAL采用了角色感知学习和角色对齐的两阶段训练方法，配备了易于使用的推理策略“选择然后生成”，以提高角色敏感性并在语义层面生成更多与角色相关的响应。通过大量的实验，我们证明我们的框架优于许多最先进的个性化对话方法和大型语言模型。

Title: LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning

Authors: Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Lei Huang, Weitao Ma, Qichen Hong, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10229
Pdf URL: https://arxiv.org/pdf/2511.10229
Copy Paste: [[2511.10229]] LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning(https://arxiv.org/abs/2511.10229)
Keywords: language model, llm
Abstract: Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data. Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data. In this paper, we propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model's representation space. LangGPS first filters training data based on separability scores and then refines the subset using existing selection methods. Extensive experiments across six benchmarks and 22 languages demonstrate that applying LangGPS on top of existing selection methods improves their effectiveness and generalizability in multilingual training, especially for understanding tasks and low-resource languages. Further analysis reveals that highly separable samples facilitate the formation of clearer language boundaries and support faster adaptation, while low-separability samples tend to function as bridges for cross-lingual alignment. Besides, we also find that language separability can serve as an effective signal for multilingual curriculum learning, where interleaving samples with diverse separability levels yields stable and generalizable gains. Together, we hope our work offers a new perspective on data utility in multilingual contexts and support the development of more linguistically informed LLMs.
摘要：联合多语言指令调优是一种广泛采用的方法，用于提高大型语言模型（LLM）的多语言指令跟踪能力和下游性能，但由此产生的多语言能力对训练数据的组成和选择仍然高度敏感。现有的选择方法通常基于文本质量、多样性或任务相关性等特征，通常忽略多语言数据的内在语言结构。在本文中，我们提出了 LangGPS，一种以语言可分离性为指导的轻量级两阶段预选框架，它量化了模型表示空间中不同语言样本的区分程度。 LangGPS 首先根据可分离性分数过滤训练数据，然后使用现有的选择方法细化子集。跨越 6 个基准和 22 种语言的广泛实验表明，在现有选择方法之上应用 LangGPS 可以提高其在多语言训练中的有效性和通用性，特别是对于理解任务和低资源语言。进一步分析表明，高度可分离的样本有利于形成更清晰的语言边界并支持更快的适应，而低可分离的样本往往充当跨语言对齐的桥梁。此外，我们还发现语言可分离性可以作为多语言课程学习的有效信号，其中具有不同可分离性水平的样本的交错会产生稳定且可推广的收益。我们希望我们的工作能够为多语言环境中的数据利用提供新的视角，并支持更多语言知识的法学硕士的发展。

Title: VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

Authors: Yuhao Wang, Ziyang Cheng, Heyang Liu, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Subjects: cs.CL, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2511.10232
Pdf URL: https://arxiv.org/pdf/2511.10232
Copy Paste: [[2511.10232]] VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction(https://arxiv.org/abs/2511.10232)
Keywords: language model
Abstract: Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a comprehensive comparison of single-codebook and multi-codebook strategies, offering valuable insights for developing efficient and high-performance SLMs for real-time interactive applications.
摘要：当前的端到端口语模型（SLM）已经取得了显着的进步，但它们仍然遇到相当大的响应延迟。这种延迟主要是由于语音标记的自回归生成以及对语音合成的复杂流匹配模型的依赖而引起的。为了克服这个问题，我们引入了 VocalNet-M2，这是一种新颖的低延迟 SLM，集成了多码本标记器和多标记预测 (MTP) 策略。我们的模型直接生成多码本语音令牌，从而消除了对引起延迟的流匹配模型的需要。此外，我们的MTP策略提高了发电效率并提高了整体性能。大量实验表明，VocalNet-M2 显着降低了第一个块延迟（从大约 725 毫秒到 350 毫秒），同时保持了主流 SLM 的竞争性能。这项工作还提供了单码本和多码本策略的全面比较，为开发用于实时交互式应用的高效、高性能的 SLM 提供了宝贵的见解。

Title: MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Authors: He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Shaohua Ma, Irwin King
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2511.10262
Pdf URL: https://arxiv.org/pdf/2511.10262
Copy Paste: [[2511.10262]] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models(https://arxiv.org/abs/2511.10262)
Keywords: language model
Abstract: Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.
摘要：全双工语音语言模型 (FD-SLM) 可实现实时、重叠的对话交互，与传统的半双工模型相比，提供更动态的用户体验。然而，现有的基准主要侧重于评估单轮交互和会话特征，忽略了多轮通信的复杂性以及指令遵循和安全性等关键能力。在多轮环境中评估 FD-SLM 提出了重大挑战，包括通信中模糊的转弯边界以及模型推理过程中上下文的不一致。为了解决这些差距，我们引入了 MTR-DuplexBench，这是一种新颖的基准，可将连续的全双工对话分割成离散的回合，从而能够在对话质量、对话动态、指令遵循和安全性方面对 FD-SLM 进行全面、逐回合的评估。实验结果表明，当前的 FD-SLM 面临着在多轮和评估维度上保持一致性能的困难，这凸显了我们提出的基准的必要性和有效性。基准测试和代码将在未来提供。

Title: Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Authors: Changyuan Tian, Zhicong Lu, Shuang Qian, Nayu Liu, Peiguang Li, Li Jin, Leiyi Hu, Zhizhao Zeng, Sirui Wang, Ke Zeng, Zhi Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10303
Pdf URL: https://arxiv.org/pdf/2511.10303
Copy Paste: [[2511.10303]] Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning(https://arxiv.org/abs/2511.10303)
Keywords: language model, llm
Abstract: To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.
摘要：为了改进大型语言模型（LLM）的多步数学推理（MsMR），通过自动批评 MsMR 推理过程中的错误并给出问题解决方案的最终结论来获得语料库的可扩展监督至关重要。大多数现有方法依赖于制作高质量的监督微调演示来增强批评能力，而很少关注深入探究法学硕士批评表现不佳的根本原因。本文对潜在原因——评价偏好不平衡进行正交量化和调查，并进行统计偏好分析。在分析原因的基础上，提出了一种新的困惑感知强化学习算法来纠正评价偏好，提高批评能力。具体来说，为了探究LLM的批评特征，精心构建了一对多问题解决方案（OPS）基准，以量化LLM在评估自己和他人生成的问题解决方案时的行为差异。然后，为了深入研究行为差异，我们进行了面向困惑度的统计偏好分析，发现了一个有趣的现象——“法学硕士倾向于判断困惑度较低的解决方案是正确的”，这被称为\textit{不平衡的评估偏好}。为了纠正这种偏好，我们将困惑度视为组相对策略优化算法中的指挥棒，支持法学硕士探索将较低困惑度判断为错误、将较高困惑度判断为正确的轨迹。我们构建的 OPS 和现有可用的批评家基准的大量实验结果证明了我们方法的有效性。

Title: BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Authors: Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulkarni, Gautam Rajeev, Jay Piplodiya, Arul Menezes, Shaharukh Khan, Souvik Rana, Manya Sah, Chandra Khatri, Shubham Agarwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10338
Pdf URL: https://arxiv.org/pdf/2511.10338
Copy Paste: [[2511.10338]] BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages(https://arxiv.org/abs/2511.10338)
Keywords: language model, llm, prompt
Abstract: In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.
摘要：在大型语言模型 (LLM) 预训练的背景下，合成数据已成为大规模生成高质量预训练数据的替代方案。这在资源匮乏的语言环境中尤其有益，因为最近的法学硕士的好处在不同语言之间分布不均匀。在这项工作中，我们提出了关于印度语言合成多语言预训练数据的生成和评估的系统研究，其中我们构建了一个大规模合成数据集 BhashaKritika，其中包含针对 10 种语言使用 5 种不同技术的 540B 个标记。我们探讨了文档、人物角色和主题中的基础生成的影响。我们分析提示说明和文档基础中的语言选择如何影响数据质量，并将英语内容的翻译与印度语言的本地生成进行比较。为了支持可扩展和语言敏感的评估，我们引入了模块化质量评估管道，该管道集成了脚本和语言检测、元数据一致性检查、n-gram 重复分析以及使用 KenLM 模型的基于困惑的过滤。我们的框架能够跨不同的脚本和语言环境进行强大的质量控制。通过模型运行得到的实证结果揭示了生成策略的关键权衡，并强调了构建有效的多语言语料库的最佳实践。

Title: Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Authors: Andrea Schimmenti, Valentina Pasqual, Fabio Vitali, Marieke van Erp
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10354
Pdf URL: https://arxiv.org/pdf/2511.10354
Copy Paste: [[2511.10354]] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates(https://arxiv.org/abs/2511.10354)
Keywords: language model, gpt, llm
Abstract: Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.
摘要：文化遗产文本包含丰富的知识，由于将非结构化话语转换为结构化知识图（KG）的挑战，很难系统地查询这些知识。本文介绍了 ATR4CH（文化遗产自适应文本到 RDF），这是一种从文化遗产文档中基于大型语言模型的知识提取的系统性五步方法。我们通过真实性评估辩论的案例研究验证了该方法。方法论 - ATR4CH 通过迭代开发将注释模型、本体框架和基于 LLM 的提取结合起来：基础分析、注释模式开发、管道架构、集成细化和综合评估。我们使用有关争议项目（文档、工件...）的维基百科文章来演示该方法，并使用三个 LLM（Claude Sonnet 3.7、Llama 3.3 70B、GPT-4o-mini）实现顺序管道。研究结果 - 该方法成功提取了复杂的文化遗产知识：元数据提取为 0.96-0.99 F1，实体识别为 0.7-0.8 F1，假设提取为 0.65-0.75 F1，证据提取为 0.95-0.97，话语表示为 0.62 G-EVAL。较小的型号具有竞争力，可实现经济高效的部署。原创性——这是第一个协调基于法学硕士的提取与文化遗产本体的系统方法。 ATR4CH 提供了一个可跨 CH 领域和机构资源进行调整的可复制框架。研究限制 - 生成的知识图谱仅限于维基百科文章。虽然结果令人鼓舞，但在后处理过程中需要人工监督。实际影响 - ATR4CH 使文化遗产机构能够系统地将文本知识转换为可查询的知识图谱，支持自动化元数据丰富和知识发现。

Title: TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs

Authors: Shuyi Liu, Yuming Shang, Xi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10375
Pdf URL: https://arxiv.org/pdf/2511.10375
Copy Paste: [[2511.10375]] TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs(https://arxiv.org/abs/2511.10375)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, TruthfulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.
摘要：检索增强生成 (RAG) 已成为一个强大的框架，通过将基于检索的方法与生成模型相集成来增强大型语言模型 (LLM) 的功能。随着外部知识存储库的不断扩展以及模型中的参数化知识变得过时，RAG 系统面临的一个关键挑战是解决检索到的外部信息与法学硕士内部知识之间的冲突，这可能会严重损害生成内容的准确性和可靠性。然而，现有的冲突解决方法通常在令牌或语义层面上运行，通常会导致法学硕士的知识和背景之间的事实差异的支离破碎和部分理解，特别是在知识密集型任务中。为了解决这个限制，我们提出了TruthfulRAG，这是第一个利用知识图（KG）来解决RAG系统中事实级知识冲突的框架。具体来说，TruthfulRAG通过从检索内容中系统地提取三元组来构建知识图谱，利用基于查询的图检索来识别相关知识，并采用基于熵的过滤机制来精确定位冲突元素并减少事实不一致，从而使法学硕士能够生成忠实且准确的响应。大量实验表明，TruthfulRAG 优于现有方法，有效缓解了知识冲突，提高了 RAG 系统的鲁棒性和可信度。

Title: Position: On the Methodological Pitfalls of Evaluating Base LLMs for Reasoning

Authors: Jason Chan, Zhixue Zhao, Robert Gaizauskas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10381
Pdf URL: https://arxiv.org/pdf/2511.10381
Copy Paste: [[2511.10381]] Position: On the Methodological Pitfalls of Evaluating Base LLMs for Reasoning(https://arxiv.org/abs/2511.10381)
Keywords: language model, llm
Abstract: Existing work investigates the reasoning capabilities of large language models (LLMs) to uncover their limitations, human-like biases and underlying processes. Such studies include evaluations of base LLMs (pre-trained on unlabeled corpora only) for this purpose. Our position paper argues that evaluating base LLMs' reasoning capabilities raises inherent methodological concerns that are overlooked in such existing studies. We highlight the fundamental mismatch between base LLMs' pretraining objective and normative qualities, such as correctness, by which reasoning is assessed. In particular, we show how base LLMs generate logically valid or invalid conclusions as coincidental byproducts of conforming to purely linguistic patterns of statistical plausibility. This fundamental mismatch challenges the assumptions that (a) base LLMs' outputs can be assessed as their bona fide attempts at correct answers or conclusions; and (b) conclusions about base LLMs' reasoning can generalize to post-trained LLMs optimized for successful instruction-following. We call for a critical re-examination of existing work that relies implicitly on these assumptions, and for future work to account for these methodological pitfalls.
摘要：现有的工作研究了大型语言模型（LLM）的推理能力，以揭示它们的局限性、类人偏见和潜在过程。此类研究包括为此目的对基础法学硕士（仅在未标记的语料库上进行预训练）进行评估。我们的立场文件认为，评估基础法学硕士的推理能力会引起固有的方法论问题，而这些问题在现有的研究中被忽视了。我们强调了基础法学硕士的预训练目标和规范性质量（例如评估推理的正确性）之间的根本不匹配。特别是，我们展示了基础法学硕士如何生成逻辑上有效或无效的结论，作为符合统计合理性的纯粹语言模式的巧合副产品。这种根本性的不匹配挑战了这样的假设：(a) 基础法学硕士的输出可以被评估为他们对正确答案或结论的真诚尝试； (b) 关于基础法学硕士推理的结论可以推广到针对成功遵循指令而优化的经过训练的法学硕士。我们呼吁对隐含依赖于这些假设的现有工作进行严格的重新审查，并呼吁未来的工作解释这些方法论陷阱。

Title: Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction

Authors: Chunyang Jiang, Paola Merlo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10441
Pdf URL: https://arxiv.org/pdf/2511.10441
Copy Paste: [[2511.10441]] Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction(https://arxiv.org/abs/2511.10441)
Keywords: language model, gpt
Abstract: Large language models achieve strong performance through training on vast datasets. Can analogical paradigm organization enable lightweight models to match this performance with minimal data? We develop a computational approach implementing three cognitive-inspired principles: analogical structure, contrastive learning, and minimal contextual cues. We test this approach with structured completion tasks where models identify correct sentence completions from analogical patterns with contrastive alternatives. Training lightweight models (BERT+CNN, $0.5M$ parameters) on only one hundred structured examples of English causative/inchoative alternations achieves $F1=0.95$, outperforming zero-shot \texttt{GPT-o3} ($F1=0.87$). Ablation studies confirm that analogical organization and contrastive structure improve performance, consistently surpassing randomly shuffled baselines across architectures. Cross-phenomenon validation using unspecified object alternations replicates these efficiency gains, confirming approach robustness. Our results show that analogical paradigm organization enables competitive linguistic rule learning with orders of magnitude less data than conventional approaches require.
摘要：大型语言模型通过对大量数据集进行训练来实现强大的性能。类比范式组织能否使轻量级模型能够以最少的数据匹配这种性能？我们开发了一种计算方法，实施三个认知启发原则：类比结构、对比学习和最小上下文线索。我们通过结构化完成任务来测试这种方法，其中模型从具有对比替代方案的类比模式中识别正确的句子完成。仅在一百个英语致使/起始交替的结构化示例上训练轻量级模型（BERT+CNN，$0.5M$参数）即可实现$F1=0.95$，优于零样本\texttt{GPT-o3}（$F1=0.87$）。消融研究证实，类比组织和对比结构可以提高性能，始终超越架构中随机打乱的基线。使用未指定对象交替的交叉现象验证复制了这些效率增益，确认了方法的稳健性。我们的结果表明，类比范式组织能够实现竞争性语言规则学习，而所需的数据量比传统方法所需的数据少几个数量级。

Title: Reasoning About Intent for Ambiguous Requests

Authors: Irina Saparina, Mirella Lapata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10453
Pdf URL: https://arxiv.org/pdf/2511.10453
Copy Paste: [[2511.10453]] Reasoning About Intent for Ambiguous Requests(https://arxiv.org/abs/2511.10453)
Keywords: language model
Abstract: Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.
摘要：大型语言模型通常通过隐式地承诺一种解释来响应不明确的请求。意图误解可能会让用户感到沮丧并造成安全风险。为了解决这个问题，我们建议在对不明确的请求的单个结构化响应中生成多个解释-答案对。我们的模型通过强化学习和定制奖励函数进行训练，并使用多个有效答案作为监督。会话问答和语义解析的实验表明，我们的方法比基线方法实现了更高的有效答案覆盖率。人类评估证实预测的解释与他们的答案高度一致。我们的方法通过明确的解释提高透明度，仅需要一个生成步骤即可实现效率，并通过其结构化输出格式支持下游应用程序。

Title: Exploring State Tracking Capabilities of Large Language Models

Authors: Kiamehr Rezaee, Jose Camacho-Collados, Mohammad Taher Pilehvar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10457
Pdf URL: https://arxiv.org/pdf/2511.10457
Copy Paste: [[2511.10457]] Exploring State Tracking Capabilities of Large Language Models(https://arxiv.org/abs/2511.10457)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.
摘要：大型语言模型 (LLM) 在解决复杂任务（包括需要一定程度推理的任务）方面表现出了令人印象深刻的能力。在本文中，我们关注状态跟踪，这是一个模型需要跟踪管理多个实体的状态的问题。为了将状态跟踪组件与其他因素分开，我们提出了一个基于三个明确定义的状态跟踪任务的基准，并分析了 LLM 在不同场景中的性能。结果表明，最新一代的 LLM（特别是 GPT-4 和 Llama3）能够跟踪状态，尤其是与 Chain of Thought 等机制集成时。然而，前一代的模型虽然理解任务并能够在初始阶段解决它，但在一定数量的步骤后往往会失败。

Title: LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

Authors: Zihan Gao, Yifei Xu, Jacob Thebault-Spieker
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.10459
Pdf URL: https://arxiv.org/pdf/2511.10459
Copy Paste: [[2511.10459]] LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning(https://arxiv.org/abs/2511.10459)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.
摘要：大语言模型（LLM）已在宏观地理任务上得到广泛评估，例如全球事实回忆、事件总结和区域推理。然而，人们对它们处理超本地知识的能力仍然知之甚少。随着从公民平台到社区新闻等现实世界的应用程序需要人工智能系统能够推理特定社区的动态、文化叙事和地方治理，这种差距变得越来越重要。现有的基准测试无法捕捉这种复杂性，通常依赖于粗粒度数据或孤立的引用。我们推出 LocalBench，这是第一个旨在系统评估美国县级地方知识法学硕士的基准。 LocalBench 以本地性概念框架为基础，包含美国 49 个州 526 个县的 14,782 个经过验证的问答对，整合了人口普查统计数据、当地 subreddit 讨论和区域新闻等不同来源。它跨越了局部性的物理、认知和关系维度。使用 LocalBench，我们在闭卷和网络增强设置下评估了 13 个最先进的法学硕士。我们的研究结果揭示了严重的局限性：即使是表现最好的模型，在叙述式问题上的准确率也只有 56.8%，在数字推理上的准确率低于 15.5%。此外，更大的模型尺寸和网络增强并不能保证更好的性能，例如，搜索将 Gemini 的准确性提高了 +13.6%，但使 GPT 系列性能降低了 -11.4%。这些结果强调了对能够支持公平、地点感知的人工智能系统的语言模型的迫切需求：能够与跨地理和文化背景的当地社区的多样化、细粒度的现实进行互动。

Title: Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks

Authors: Yunzhe Xu, Zhuosheng Zhang, Zhe Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10465
Pdf URL: https://arxiv.org/pdf/2511.10465
Copy Paste: [[2511.10465]] Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks(https://arxiv.org/abs/2511.10465)
Keywords: language model, prompt
Abstract: While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models' capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO's superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: this https URL.
摘要：虽然提示优化已成为增强语言模型性能的关键技术，但现有方法主要侧重于基于启发的策略，寻找最佳提示来激活模型的功能。这些方法在解决知识密集型任务时表现出根本的局限性，因为它们在固定的参数边界内运行，而不是提供专业领域所需的事实知识、术语精度和推理模式。为了解决这些局限性，我们提出了基于知识提供的提示优化（KPPO），这是一个将提示优化重新表述为系统知识集成而不是潜在启发的框架。 KPPO 引入了三项关键创新：1）用于知识差距识别和有针对性修复的知识差距填补机制； 2）同时考虑性能改进和分布稳定性的批量候选评估方法； 3）自适应知识剪枝策略，平衡性能和代币效率，减少高达 29% 的代币使用量。对来自不同领域的 15 个知识密集型基准的广泛评估证明了 KPPO 相对于基于启发式方法的优越性，与最强基线相比平均性能提高了约 6%，同时实现了相当或更低的代币消耗。代码位于：此 https URL。

Title: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

Authors: Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Selina Peng, Beibin Li, Shengjie Bi, Shishir G. Patil, Qi Qi, Shengyu Feng, Julian Katz-Samuels, Richard Yuanzhe Pang, Sujan Gonugondla, Hunter Lang, Yue Yu, Yundi Qian, Maryam Fazel-Zarandi, Licheng Yu, Amine Benhalloum, Hany Awadalla, Manaal Faruqui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10507
Pdf URL: https://arxiv.org/pdf/2511.10507
Copy Paste: [[2511.10507]] Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following(https://arxiv.org/abs/2511.10507)
Keywords: language model, llm, prompt
Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
摘要：大型语言模型 (LLM) 的最新进展在一系列任务上取得了令人印象深刻的性能，但高级指令跟踪 (IF)（尤其是复杂、多轮和系统提示的指令）仍然是一个重大挑战。由于缺乏高质量、人工注释的基准和可靠、可解释的奖励信号，对此类能力的严格评估和有效培训受到阻碍。在这项工作中，我们引入了 AdvancedIF（我们将很快发布此基准），这是一个综合基准，具有 1,600 多个提示和专家策划的评分标准，用于评估法学硕士遵循复杂、多轮和系统级指令的能力。我们进一步提出了RIFL（基于Rubric的指令跟随学习），这是一种新颖的训练后管道，它利用Rubric生成、微调的Rubric验证器和奖励塑造来实现指令跟随的有效强化学习。大量实验表明，RIFL 显着提高了法学硕士的指令跟踪能力，在 AdvancedIF 上实现了 6.7% 的绝对增益，并在公共基准测试中取得了优异的成绩。我们的消融研究证实了 RIFL 中每个成分的有效性。这项工作将评分标准确立为法学硕士培训和评估高级 IF 的强大工具，为更强大、更可靠的人工智能系统铺平了道路。

Title: Say It Differently: Linguistic Styles as Jailbreak Vectors

Authors: Srikant Panda, Avinash Rai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10519
Pdf URL: https://arxiv.org/pdf/2511.10519
Copy Paste: [[2511.10519]] Say It Differently: Linguistic Styles as Jailbreak Vectors(https://arxiv.org/abs/2511.10519)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.
摘要：通常评估大型语言模型 (LLM) 针对释义或语义等效越狱提示的鲁棒性，但很少关注作为攻击面的语言变化。在这项工作中，我们系统地研究了恐惧或好奇等语言风格如何重新构建有害意图并从一致的模型中引发不安全的反应。我们通过使用手工模板和基于 LLM 的重写将提示从 3 个标准数据集转换为 11 种不同的语言风格，同时保留语义意图，构建了风格增强的越狱基准。通过评估 16 个开源和闭源指令调整模型，我们发现风格重构可将越狱成功率提高高达 +57 个百分点。恐惧、好奇和富有同情心等风格是最有效的，情境化重写的效果优于模板变体。为了缓解这个问题，我们引入了一种风格中和预处理步骤，使用辅助法学硕士从用户输入中去除可操纵的风格线索，从而显着降低越狱成功率。我们的研究结果揭示了当前安全管道中忽视的系统性和抗扩展性漏洞。

Title: Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG

Authors: Egor Pakhomov, Erik Nijkamp, Caiming Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10523
Pdf URL: https://arxiv.org/pdf/2511.10523
Copy Paste: [[2511.10523]] Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG(https://arxiv.org/abs/2511.10523)
Keywords: long context, retrieval-augmented generation
Abstract: We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory--where exhaustive search and complete reranking are feasible--deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.
摘要：我们引入了一个用于会话记忆评估的综合基准，其中包含不同类别的 75,336 个问答对，包括用户事实、助理回忆、弃权、偏好、时间变化和隐式连接。虽然现有的基准测试已经推动了该领域的发展，但我们的工作解决了限制当前内存评估框架的统计能力、数据生成一致性和评估灵活性方面的基本挑战。我们研究了会话记忆和检索增强生成（RAG）之间的关系。虽然这些系统共享基本的架构模式——时间推理、隐式提取、知识更新和图形表示——但记忆系统具有独特的特征：它们从零开始，并随着每次对话而逐渐增长。这一特性使得简单的方法成为可能，而这对于传统 RAG 来说是不切实际的。与最近关于长上下文有效性的研究结果一致，我们观察到，即使在最具挑战性的多消息证据案例中，简单的全上下文方法也能达到 70-82% 的准确率，而像 Mem0 这样复杂的基于 RAG 的记忆系统在处理 150 次交互下的对话历史时，只能达到 30-45% 的准确率。我们的分析揭示了实际的转变点：长上下文在前 30 个对话中表现出色，在最多 150 个对话的可管理权衡中仍然可行，并且通常需要混合或 RAG 方法，超过该点，因为成本和延迟变得令人望而却步。这些模式表明，会话记忆的小语料库优势（穷举搜索和完全重新排序是可行的）值得专门研究关注，而不是简单地将通用 RAG 解决方案应用于会话历史。

Title: URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

Authors: Yongxin Shi, Jiapeng Wang, Zeyu Shan, Dezhi Peng, Zening Lin, Lianwen Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10552
Pdf URL: https://arxiv.org/pdf/2511.10552
Copy Paste: [[2511.10552]] URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding(https://arxiv.org/abs/2511.10552)
Keywords: language model, llm
Abstract: Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at this https URL.
摘要：由于两个基本挑战，最近的多模态大语言模型（MLLM）仍然难以理解长文档：来自大量不相关内容的信息干扰，以及基于 Transformer 的架构的二次计算成本。现有的方法主要分为两类：令牌压缩，它牺牲了细粒度的细节；引入外部检索器，这会增加系统复杂性并妨碍端到端优化。为了解决这些问题，我们进行了深入分析，并观察到 MLLM 表现出类似人类从粗到精的推理模式：早期的 Transformer 层广泛参与整个文档，而更深的层则关注相关证据页面。受这一见解的启发，我们认为可以明确利用 MLLM 固有的证据本地化功能在推理过程中执行检索，从而促进高效的长文档理解。为此，我们提出了 URaG，这是一个简单但有效的框架，可以在单个 MLLM 中统一检索和生成。 URaG 引入了一个轻量级的跨模式检索模块，该模块将早期的 Transformer 层转换为高效的证据选择器，识别并保留最相关的页面，同时丢弃不相关的内容。这种设计使更深层次能够将计算资源集中在相关信息上，从而提高准确性和效率。大量实验表明，URaG 实现了最先进的性能，同时将计算开销降低了 44-56%。该代码可从此 https URL 获取。

Title: DESS: DeBERTa Enhanced Syntactic-Semantic Aspect Sentiment Triplet Extraction

Authors: Vishal Thenuwara, Nisansa de Silva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10577
Pdf URL: https://arxiv.org/pdf/2511.10577
Copy Paste: [[2511.10577]] DESS: DeBERTa Enhanced Syntactic-Semantic Aspect Sentiment Triplet Extraction(https://arxiv.org/abs/2511.10577)
Keywords: language model
Abstract: Fine-grained sentiment analysis faces ongoing challenges in Aspect Sentiment Triple Extraction (ASTE), particularly in accurately capturing the relationships between aspects, opinions, and sentiment polarities. While researchers have made progress using BERT and Graph Neural Networks, the full potential of advanced language models in understanding complex language patterns remains unexplored. We introduce DESS, a new approach that builds upon previous work by integrating DeBERTa's enhanced attention mechanism to better understand context and relationships in text. Our framework maintains a dual-channel structure, where DeBERTa works alongside an LSTM channel to process both meaning and grammatical patterns in text. We have carefully refined how these components work together, paying special attention to how different types of language information interact. When we tested DESS on standard datasets, it showed meaningful improvements over current methods, with F1-score increases of 4.85, 8.36, and 2.42 in identifying aspect opinion pairs and determining sentiment accurately. Looking deeper into the results, we found that DeBERTa's sophisticated attention system helps DESS handle complicated sentence structures better, especially when important words are far apart. Our findings suggest that upgrading to more advanced language models when thoughtfully integrated, can lead to real improvements in how well we can analyze sentiments in text. The implementation of our approach is publicly available at: this https URL.
摘要：细粒度情感分析在方面情感三重提取（ASTE）中面临着持续的挑战，特别是在准确捕获方面、观点和情感极性之间的关系方面。尽管研究人员在使用 BERT 和图神经网络方面取得了进展，但高级语言模型在理解复杂语言模式方面的全部潜力仍有待开发。我们引入了 DESS，这是一种基于之前工作的新方法，通过集成 DeBERTa 的增强型注意力机制来更好地理解文本中的上下文和关系。我们的框架保持双通道结构，其中 DeBERTa 与 LSTM 通道一起工作，处理文本中的含义和语法模式。我们仔细改进了这些组件如何协同工作，特别关注不同类型的语言信息如何交互。当我们在标准数据集上测试 DESS 时，它显示出比当前方法有意义的改进，在识别方面观点对和准确确定情绪方面，F1 分数增加了 4.85、8.36 和 2.42。深入研究结果，我们发现 DeBERTa 复杂的注意力系统有助于 DESS 更好地处理复杂的句子结构，尤其是当重要单词相距较远时。我们的研究结果表明，经过深思熟虑的集成后升级到更高级的语言模型可以真正提高我们分析文本情感的能力。我们的方法的实现可在以下网址公开获得：此 https URL。

Title: Evaluating Prompting Strategies with MedGemma for Medical Order Extraction

Authors: Abhinand Balachandran, Bavana Durgapraveen, Gowsikkan Sikkan Sudhagar, Vidhya Varshany J S, Sriram Rajkumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10583
Pdf URL: https://arxiv.org/pdf/2511.10583
Copy Paste: [[2511.10583]] Evaluating Prompting Strategies with MedGemma for Medical Order Extraction(https://arxiv.org/abs/2511.10583)
Keywords: language model, prompt, agent
Abstract: The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to "overthinking" and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.
摘要：从医患对话中准确提取医嘱是减轻临床记录负担和确保患者安全的关键任务。本文详细介绍了我们团队向 MEDIQA-OE-2025 共享任务提交的内容。我们研究了 MedGemma（一种新的特定领域开源语言模型）用于结构化订单提取的性能。我们系统地评估了三种不同的提示范例：简单的一次性方法、注重推理的 ReAct 框架和多步骤代理工作流程。我们的实验表明，虽然 ReAct 和代理流等更复杂的框架很强大，但更简单的一次性提示方法在官方验证集上实现了最高性能。我们认为，在手动注释的转录本上，复杂的推理链可能会导致“过度思考”并引入噪音，从而使直接方法更加稳健和高效。我们的工作为在不同数据条件下选择适当的临床信息提取提示策略提供了宝贵的见解。

Title: Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering

Authors: Bavana Durgapraveen, Sornaraj Sivasankaran, Abhinand Balachandran, Sriram Rajkumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10591
Pdf URL: https://arxiv.org/pdf/2511.10591
Copy Paste: [[2511.10591]] Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering(https://arxiv.org/abs/2511.10591)
Keywords: prompt
Abstract: The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.
摘要：异步远程护理的快速扩张加剧了医疗服务提供者的工作量，从而产生了对人工智能系统的需求，这些系统可以帮助临床医生更有效地管理患者的查询。 MEDIQA-WV 2025 共享任务通过专注于生成与图像配对的伤口护理查询的自由文本响应来解决这一挑战。在这项工作中，我们提出了两种为英语轨道开发的互补方法。第一个利用挖掘的提示策略，其中嵌入训练数据，并检索前 k 个最相似的示例，以在生成过程中作为少数样本演示。第二种方法建立在元数据消融研究的基础上，该研究确定了四个能够持续提高响应质量的元数据属性。我们训练分类器来预测测试用例的这些属性，并将它们合并到生成管道中，根据预测置信度动态调整输出。实验结果表明，挖掘的提示提高了响应相关性，而元数据引导的生成进一步提高了临床精度。这些方法共同凸显了开发人工智能驱动工具的有前景的方向，这些工具可以提供可靠、高效的伤口护理支持。

Title: Know Your Limits: Entropy Estimation Modeling for Compression and Generalization

Authors: Benjamin L. Badger, Matthew Neligeorge
Subjects: cs.CL, cs.AI, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2511.10618
Pdf URL: https://arxiv.org/pdf/2511.10618
Copy Paste: [[2511.10618]] Know Your Limits: Entropy Estimation Modeling for Compression and Generalization(https://arxiv.org/abs/2511.10618)
Keywords: language model
Abstract: Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.
摘要：语言预测受到语言固有的信息熵的限制，因此任何语言模型的准确度都存在限制，同样也存在语言压缩的下限。当今最有效的语言压缩算法是因果（下一个标记预测）大型语言模型，但使用这些模型来形成语言熵的准确估计目前在计算上是不可行的。我们引入了编码器增强的因果解码器模型架构，即使在适度的硬件上进行训练，该架构也表现出卓越的训练效率特征，并且比因果变换器实现更高的压缩。我们演示了如何在每个令牌的基础上获得熵估计，并表明经过训练以接近其训练数据熵的模型的泛化必然超过经过训练以最小化超出该值的损失的模型的泛化。我们凭经验表明，经过训练以接近但不超过估计的每个标记熵的因果模型比不考虑熵的训练模型表现出更好的泛化能力。

Title: SSR: Socratic Self-Refine for Large Language Model Reasoning

Authors: Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.10621
Pdf URL: https://arxiv.org/pdf/2511.10621
Copy Paste: [[2511.10621]] SSR: Socratic Self-Refine for Large Language Model Reasoning(https://arxiv.org/abs/2511.10621)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at this https URL.
摘要：大型语言模型（LLM）已表现出卓越的推理能力，但现有的测试时框架通常依赖于粗略的自我验证和自我纠正，限制了其在复杂任务上的有效性。在本文中，我们提出了苏格拉底式自我完善（SSR），这是一种用于细粒度评估和精确细化法学硕士推理的新颖框架。我们提出的 SSR 将模型响应分解为可验证的（子问题、子答案）对，通过受控的重新求解和自我一致性检查实现步骤级置信度估计。通过查明不可靠的步骤并迭代地改进它们，SSR 产生更准确和可解释的推理链。五个推理基准和三个法学硕士的实证结果表明，SSR 始终优于最先进的迭代自我完善基准。除了性能提升之外，SSR 还提供了一种原则性的黑盒方法来评估和理解法学硕士的内部推理过程。代码可从此 https URL 获取。

Title: Instella: Fully Open Language Models with Stellar Performance

Authors: Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra, Gowtham Ramesh, Sudhanshu Ranjan, Chaitanya Manem, Ximeng Sun, Ze Wang, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.10628
Pdf URL: https://arxiv.org/pdf/2511.10628
Copy Paste: [[2511.10628]] Instella: Fully Open Language Models with Stellar Performance(https://arxiv.org/abs/2511.10628)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
摘要：大型语言模型 (LLM) 在广泛的任务中表现出了卓越的性能，但大多数高性能模型仍然是闭源或部分开放的，限制了透明度和可重复性。在这项工作中，我们介绍了 Instella，这是一个完全开放的 30 亿参数语言模型系列，完全在公开可用的数据和代码库上进行训练。 Instella 由 AMD Instinct MI300X GPU 提供支持，通过大规模预训练、通用指令调整以及符合人类偏好而开发。尽管使用的预训练代币比许多同时代人少得多，但 Instella 在完全开放的模型中取得了最先进的结果，并且与同等规模的领先开放权重模型具有竞争力。我们进一步发布了两个专门的变体：Instella-Long，能够处理高达 128K 令牌的上下文长度，以及 Instella-Math，一种通过对数学任务进行监督微调和强化学习来增强的推理模型。这些贡献共同使 Instella 成为社区透明、高性能和多功能的替代方案，推进开放和可复制语言建模研究的目标。

Title: Black-Box On-Policy Distillation of Large Language Models

Authors: Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.10643
Pdf URL: https://arxiv.org/pdf/2511.10643
Copy Paste: [[2511.10643]] Black-Box On-Policy Distillation of Large Language Models(https://arxiv.org/abs/2511.10643)
Keywords: language model, gpt, llm, chat
Abstract: Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.
摘要：黑盒蒸馏通过单独学习专有教师模型的文本输出来创建学生大语言模型 (LLM)，而无需访问其内部逻辑或参数。在这项工作中，我们引入了生成对抗蒸馏（GAD），它可以实现在线策略和黑盒蒸馏。 GAD 将学生 LLM 构建为生成器，并训练判别器以区分其响应与教师 LLM 的响应，从而创建一个极小极大游戏。判别器充当策略奖励模型，与学生共同进化，提供稳定、自适应的反馈。实验结果表明，GAD 始终超越了常用的序列级知识蒸馏。特别是，经过 GAD 培训的 Qwen2.5-14B-Instruct（学生）在 LMSYS-Chat 自动评估上与其老师 GPT-5-Chat 相当。结果表明 GAD 是黑盒法学硕士蒸馏的一个有前途且有效的范例。

Title: ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Authors: Yesheng Liang, Haisheng Chen, Song Han, Zhijian Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.10645
Pdf URL: https://arxiv.org/pdf/2511.10645
Copy Paste: [[2511.10645]] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference(https://arxiv.org/abs/2511.10645)
Keywords: language model, llm
Abstract: Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.
摘要：仅权重训练后量化 (PTQ) 将大型语言模型 (LLM) 的权重压缩为低精度表示，以减少内存占用并加速推理。然而，权重和激活中异常值的存在通常会导致较大的量化误差和严重的准确性下降，特别是在最近的推理法学硕士中，错误在长思想链中累积。现有的 PTQ 方法要么无法充分抑制异常值，要么在推理过程中引入大量开销。在本文中，我们提出了成对旋转量化（ParoQuant），这是一种仅权重 PTQ 方法，它将硬件高效且可优化的独立吉文斯旋转与通道缩放相结合，以平衡通道间的幅度并缩小每个量化组内的动态范围。我们进一步共同设计推理内核，以充分利用 GPU 并行性，并在运行时保持旋转和缩放的轻量级。在推理任务上，ParoQuant 比 AWQ 的准确度平均提高了 2.4%，而开销不到 10%。这为更高效、更准确地部署推理法学硕士铺平了道路。