2025-11-04

Title: PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization

Authors: Jiajun Zhang, Jianke Zhang, Zeyu Cui, Jiaxi Yang, Lei Zhang, Binyuan Hui, Qiang Liu, Zilei Wang, Liang Wang, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00010
Pdf URL: https://arxiv.org/pdf/2511.00010
Copy Paste: [[2511.00010]] PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization(https://arxiv.org/abs/2511.00010)
Keywords: language model, llm, agent
Abstract: Recent Large Language Models (LLMs) have demonstrated remarkable profi- ciency in code generation. However, their ability to create complex visualiza- tions for scaled and structured data remains largely unevaluated and underdevel- oped. To address this gap, we introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as fi- nance, scientific research, and sociology. The benchmark is structured around seven high-level visualization tasks and encompasses 48 distinct chart types. Cru- cially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our com- prehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious per- formance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we develope SynthVis-30K, a large-scale, high-quality dataset of complex visualization code synthesized via a collaborative agent frame- work. Building upon this dataset, we develope PlotCraftor, a novel code gener- ation model that achieves strong capabilities in complex data visualization with a remarkably small size. Across VisEval, PandasPlotBench, and our proposed PlotCraft, PlotCraftor shows performance comparable to that of leading propri- etary approaches. Especially, on hard task, Our model achieves over 50% per- formance improvement. We will release the benchmark, dataset, and code at this https URL.
摘要：最近的大型语言模型（LLM）在代码生成方面表现出了非凡的熟练程度。然而，它们为规模化和结构化数据创建复杂可视化的能力在很大程度上仍未得到评估和开发。为了解决这一差距，我们引入了 PlotCraft，这是一个新的基准测试，包含 1k 个具有挑战性的可视化任务，涵盖广泛的主题，例如金融、科学研究和社会学。该基准测试围绕七个高级可视化任务构建，包含 48 种不同的图表类型。至关重要的是，它是第一个系统地评估各种任务复杂性的单轮生成和多轮细化的系统。我们对 PlotCraft 上 23 名领先的法学硕士进行了全面评估，发现他们在处理复杂的可视化任务方面存在明显的性能缺陷。为了弥补这一性能差距，我们开发了 SynthVis-30K，这是一个通过协作代理框架合成的复杂可视化代码的大规模、高质量数据集。在此数据集的基础上，我们开发了 PlotCraftor，这是一种新颖的代码生成模型，它以非常小的尺寸实现了复杂数据可视化的强大功能。在 VisEval、PandasPlotBench 和我们提出的 PlotCraft 中，PlotCraftor 的性能可与领先的专有方法相媲美。特别是，在困难任务上，我们的模型实现了超过 50% 的性能提升。我们将在此 https URL 发布基准测试、数据集和代码。

Title: Cognitive Alignment in Personality Reasoning: Leveraging Prototype Theory for MBTI Inference

Authors: Haoyuan Li, Yuanbo Tong, Yuchen Li, Zirui Wang, Chunhou Liu, Jiamou Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.00115
Pdf URL: https://arxiv.org/pdf/2511.00115
Copy Paste: [[2511.00115]] Cognitive Alignment in Personality Reasoning: Leveraging Prototype Theory for MBTI Inference(https://arxiv.org/abs/2511.00115)
Keywords: llm, prompt
Abstract: Personality recognition from text is typically cast as hard-label classification, which obscures the graded, prototype-like nature of human personality judgments. We present ProtoMBTI, a cognitively aligned framework for MBTI inference that operationalizes prototype theory within an LLM-based pipeline. First, we construct a balanced, quality-controlled corpus via LLM-guided multi-dimensional augmentation (semantic, linguistic, sentiment). Next, we LoRA-fine-tune a lightweight (<=2B) encoder to learn discriminative embeddings and to standardize a bank of personality prototypes. At inference, we retrieve top-k prototypes for a query post and perform a retrieve--reuse--revise--retain cycle: the model aggregates prototype evidence via prompt-based voting, revises when inconsistencies arise, and, upon correct prediction, retains the sample to continually enrich the prototype library. Across Kaggle and Pandora benchmarks, ProtoMBTI improves over baselines on both the four MBTI dichotomies and the full 16-type task, and exhibits robust cross-dataset generalization. Our results indicate that aligning the inference process with psychological prototype reasoning yields gains in accuracy, interpretability, and transfer for text-based personality modeling.
摘要：文本中的人格识别通常被视为硬标签分类，这掩盖了人类人格判断的分级、原型性质。我们提出了 ProtoMBTI，这是一种认知一致的 MBTI 推理框架，可在基于法学硕士的流程中实施原型理论。首先，我们通过法学硕士指导的多维增强（语义、语言、情感）构建一个平衡的、质量控制的语料库。接下来，我们对轻量级 (<=2B) 编码器进行 LoRA 微调，以学习有区别的嵌入并标准化一组个性原型。在推理时，我们检索查询帖子的 top-k 原型，并执行检索-重用-修订-保留周期：模型通过基于提示的投票聚合原型证据，在出现不一致时进行修改，并在正确预测后保留样本以不断丰富原型库。在 Kaggle 和 Pandora 基准测试中，ProtoMBTI 在四种 MBTI 二分法和完整的 16 类型任务上都比基线有所改进，并表现出强大的跨数据集泛化能力。我们的结果表明，将推理过程与心理原型推理相结合可以提高基于文本的人格建模的准确性、可解释性和迁移性。

Title: ParaScopes: What do Language Models Activations Encode About Future Text?

Authors: Nicky Pochinkov, Yulia Volkova, Anna Vasileva, Sai V R Chereddy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.00180
Pdf URL: https://arxiv.org/pdf/2511.00180
Copy Paste: [[2511.00180]] ParaScopes: What do Language Models Activations Encode About Future Text?(https://arxiv.org/abs/2511.00180)
Keywords: language model
Abstract: Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.
摘要：语言模型中的可解释性研究通常调查激活的前瞻性表示。然而，随着语言模型能够执行更长的时间范围任务，理解激活的方法通常仍然仅限于测试特定的概念或标记。我们开发了一个残差流解码器框架，作为探测段落规模和文档规模计划的模型激活的方法。我们测试了几种方法，发现可以在小模型中解码相当于未来上下文的 5 个以上标记的信息。这些结果为更好地监控语言模型和更好地理解它们如何编码长期规划信息奠定了基础。

Title: Training LLMs Beyond Next Token Prediction - Filling the Mutual Information Gap

Authors: Chun-Hao Yang, Bo-Han Feng, Tzu-Yuan Lai, Yan Yu Chen, Yin-Kai Dean Huang, Shou-De Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.00198
Pdf URL: https://arxiv.org/pdf/2511.00198
Copy Paste: [[2511.00198]] Training LLMs Beyond Next Token Prediction - Filling the Mutual Information Gap(https://arxiv.org/abs/2511.00198)
Keywords: language model, llm
Abstract: Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.
摘要：优化大型语言模型 (LLM) 的训练性能仍然是一项重大挑战，特别是在提高模型性能的同时保持计算成本。这项工作挑战了使用下一个令牌预测（NTP）来训练 LLM 的传统方法，认为通过在训练期间预测信息丰富的令牌，可以有一种更有效的方法来训练 LLM。我们研究了所提出的解决方案对法学硕士三种任务的影响：算术、文本多标签分类和自然语言生成。这项工作提供了优化 LLM 训练、提高模型性能和对目标标记选择策略的理论理解的原则性方法。

Title: Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Authors: Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.00222
Pdf URL: https://arxiv.org/pdf/2511.00222
Copy Paste: [[2511.00222]] Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning(https://arxiv.org/abs/2511.00222)
Keywords: language model, llm, prompt, chat, agent
Abstract: Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.
摘要：大型语言模型 (LLM) 越来越多地用于在治疗、教育和社交角色扮演等交互环境中模拟人类用户。虽然这些模拟能够对人工智能代理进行可扩展的培训和评估，但现成的法学硕士经常偏离他们指定的角色，与之前的陈述相矛盾，或者放弃适合角色的行为。我们引入了一个统一的框架来评估和提高法学硕士生成的对话中角色的一致性。我们定义了三个自动指标：提示行一致性、行间一致性和问答一致性，它们捕获不同类型的角色漂移并根据人工注释验证每个指标。使用这些指标作为奖励信号，我们应用多轮强化学习来针对三种用户角色微调法学硕士：患者、学生和社交聊天伙伴。我们的方法将不一致性减少了 55% 以上，从而使模拟用户更加连贯和忠实。

Title: AgentBnB: A Browser-Based Cybersecurity Tabletop Exercise with Large Language Model Support and Retrieval-Aligned Scaffolding

Authors: Arman Anwar, Zefang Liu
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2511.00265
Pdf URL: https://arxiv.org/pdf/2511.00265
Copy Paste: [[2511.00265]] AgentBnB: A Browser-Based Cybersecurity Tabletop Exercise with Large Language Model Support and Retrieval-Aligned Scaffolding(https://arxiv.org/abs/2511.00265)
Keywords: language model, prompt, agent
Abstract: Traditional cybersecurity tabletop exercises (TTXs) provide valuable training but are often scripted, resource-intensive, and difficult to scale. We introduce AgentBnB, a browser-based re-imagining of the Backdoors & Breaches game that integrates large language model teammates with a Bloom-aligned, retrieval-augmented copilot (C2D2). The system expands a curated corpus into factual, conceptual, procedural, and metacognitive snippets, delivering on-demand, cognitively targeted hints. Prompt-engineered agents employ a scaffolding ladder that gradually fades as learner confidence grows. In a solo-player pilot with four graduate students, participants reported greater intention to use the agent-based version compared to the physical card deck and viewed it as more scalable, though a ceiling effect emerged on a simple knowledge quiz. Despite limitations of small sample size, single-player focus, and narrow corpus, these early findings suggest that large language model augmented TTXs can provide lightweight, repeatable practice without the logistical burden of traditional exercises. Planned extensions include multi-player modes, telemetry-driven coaching, and comparative studies with larger cohorts.
摘要：传统的网络安全桌面演习 (TTX) 提供了有价值的培训，但通常是脚本化的、资源密集型的且难以扩展。我们推出了 AgentBnB，这是一款基于浏览器的 Backdoors & Breaches 游戏重新构想，它将大型语言模型队友与 Bloom 对齐、检索增强的副驾驶 (C2D2) 集成在一起。该系统将精选的语料库扩展为事实、概念、程序和元认知片段，提供按需的、有针对性的认知提示。快速设计的代理使用脚手架梯子，随着学习者信心的增强，梯子会逐渐消失。在由四名研究生参与的单人游戏试点中，参与者表示，与实体卡牌相比，他们更愿意使用基于代理的版本，并认为它更具可扩展性，尽管简单的知识测验中出现了天花板效应。尽管存在样本量小、单人关注和语料库狭窄等限制，但这些早期发现表明，大型语言模型增强的 TTX 可以提供轻量级、可重复的练习，而无需传统练习的后勤负担。计划的扩展包括多人模式、遥测驱动的辅导以及更大群体的比较研究。

Title: IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval

Authors: Shounak Paul, Dhananjay Ghumare, Pawan Goyal, Saptarshi Ghosh, Ashutosh Modi
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.00268
Pdf URL: https://arxiv.org/pdf/2511.00268
Copy Paste: [[2511.00268]] IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval(https://arxiv.org/abs/2511.00268)
Keywords: llm
Abstract: Identifying/retrieving relevant statutes and prior cases/precedents for a given legal situation are common tasks exercised by law practitioners. Researchers to date have addressed the two tasks independently, thus developing completely different datasets and models for each task; however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this paper, we address this gap. We propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM-based re-ranking approach that gives the best performance.
摘要：识别/检索特定法律情况的相关法规和先前案例/先例是法律从业者执行的常见任务。迄今为止，研究人员已经独立解决了这两项任务，从而为每项任务开发了完全不同的数据集和模型；然而，这两个检索任务本质上是相关的，例如，相似的案例往往会引用相似的法规（由于相似的事实情况）。在本文中，我们解决了这一差距。我们提出了 IL-PCR（印度先例和法规检索法律语料库），这是一个独特的语料库，为开发两个任务（法规检索和先例检索）的模型提供了一个通用测试平台，可以利用两者之间的依赖关系。我们在任务上对几种基线模型进行了广泛的实验，包括词汇模型、语义模型和基于 GNN 的集成。此外，为了利用这两个任务之间的依赖性，我们开发了一种基于 LLM 的重新排名方法，该方法可提供最佳性能。

Title: Language Modeling With Factorization Memory

Authors: Lee Xiong, Maksim Tkachenko, Johanes Effendi, Ting Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.00315
Pdf URL: https://arxiv.org/pdf/2511.00315
Copy Paste: [[2511.00315]] Language Modeling With Factorization Memory(https://arxiv.org/abs/2511.00315)
Keywords: language model
Abstract: We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks while also demonstrating superior generalization in long-context scenarios. Our model builds upon Mamba-2, enabling Factorization Memory to exploit parallel computations during training while preserving constant computational and memory complexity during inference. To further optimize model efficiency and representational capacity, we develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart. To our knowledge, this represents the first RNN architecture that successfully combines sparse memory activation with competitive performance across both short and long-context settings. This work provides a systematic empirical analysis of Factorization Memory in comparison to Transformer and Mamba-2 architectures.
摘要：我们提出了 Factorization Memory，这是一种高效的循环神经网络 (RNN) 架构，它在短上下文语言建模任务上实现了与 Transformer 模型相当的性能，同时还在长上下文场景中展示了卓越的泛化能力。我们的模型基于 Mamba-2 构建，使分解内存能够在训练期间利用并行计算，同时在推理期间保持恒定的计算和内存复杂性。为了进一步优化模型效率和表示能力，我们开发了因子分解记忆的稀疏公式，该公式在每一步仅更新循环状态的子集，同时保留其密集对应物的强大性能。据我们所知，这代表了第一个成功地将稀疏记忆激活与短上下文和长上下文设置中的竞争性能结合起来的 RNN 架构。这项工作提供了与 Transformer 和 Mamba-2 架构相比的分解内存的系统实证分析。

Title: Reversal Invariance in Autoregressive Language Models

Authors: Mihir Sahasrabudhe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00341
Pdf URL: https://arxiv.org/pdf/2511.00341
Copy Paste: [[2511.00341]] Reversal Invariance in Autoregressive Language Models(https://arxiv.org/abs/2511.00341)
Keywords: language model
Abstract: We formalize a structural property of the causal (autoregressive) language modeling (CLM) objective: reversal invariance. Formally, the next-token prediction loss assigns identical likelihood to a corpus and its reversal, implying that standard CLM pretraining is direction-blind. This symmetry explains why models trained on reversed text can achieve comparable performance to those trained on forward text, despite the inherently time-asymmetric nature of human language and reasoning. We argue that this invariance represents a limitation of current pretraining objectives rather than a benign artifact. If natural language encodes directional dependencies - phonological, morphological, or causal - a symmetric objective may fail to capture them. We therefore propose viewing pretraining through the lens of temporal asymmetry, motivating future work on loss functions and architectures that explicitly model the arrow of language while retaining standard language modeling capacity.
摘要：我们形式化了因果（自回归）语言建模（CLM）目标的结构属性：反转不变性。形式上，下一个标记预测损失将相同的可能性分配给语料库及其逆转，这意味着标准 CLM 预训练是方向盲的。这种对称性解释了为什么在反向文本上训练的模型可以实现与在正向文本上训练的模型相当的性能，尽管人类语言和推理本质上具有时间不对称性。我们认为这种不变性代表了当前预训练目标的限制，而不是良性的伪影。如果自然语言对方向依赖性（语音、形态或因果关系）进行编码，那么对称目标可能无法捕获它们。因此，我们建议从时间不对称的角度来看待预训练，激发未来对损失函数和架构的研究，这些函数和架构可以明确地建模语言箭头，同时保留标准语言建模能力。

Title: LingGym: How Far Are LLMs from Thinking Like Field Linguists?

Authors: Changbing Yang, Franklin Ma, Freda Shi, Jian Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00343
Pdf URL: https://arxiv.org/pdf/2511.00343
Copy Paste: [[2511.00343]] LingGym: How Far Are LLMs from Thinking Like Field Linguists?(https://arxiv.org/abs/2511.00343)
Keywords: llm
Abstract: This paper introduces LingGym, a new benchmark that evaluates LLMs' capacity for meta-linguistic reasoning using Interlinear Glossed Text (IGT) and grammatical descriptions extracted from 18 typologically diverse reference grammars. Unlike previous work that focuses on specific downstream tasks, we assess whether LLMs can generalize linguistic inference across low-resource languages and structures not seen during training. We present a controlled evaluation task: Word-Gloss Inference, in which the model must infer a missing word and gloss from context using varying levels of linguistic information (e.g., glosses, grammatical explanations, translations). Our results show that incorporating structured linguistic cues leads to consistent improvements in reasoning performance across all models. This work highlights both the promise and current limitations of using LLMs for typologically informed linguistic analysis and low-resource language documentation.
摘要：本文介绍了 LingGym，这是一个新的基准，它使用行间注释文本 (IGT) 和从 18 种不同类型的参考语法中提取的语法描述来评估法学硕士的元语言推理能力。与之前专注于特定下游任务的工作不同，我们评估法学硕士是否可以在训练期间未见过的低资源语言和结构中推广语言推理。我们提出了一个受控评估任务：单词注释推理，其中模型必须使用不同级别的语言信息（例如注释、语法解释、翻译）从上下文中推断出缺失的单词和注释。我们的结果表明，结合结构化语言线索可以使所有模型的推理性能得到持续提高。这项工作强调了使用法学硕士进行类型学语言分析和低资源语言文档的前景和当前局限性。

Title: Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs

Authors: Erfan Al-Hossami, Razvan Bunescu
Subjects: cs.CL, cs.CY, cs.SE
Abstract URL: https://arxiv.org/abs/2511.00371
Pdf URL: https://arxiv.org/pdf/2511.00371
Copy Paste: [[2511.00371]] Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs(https://arxiv.org/abs/2511.00371)
Keywords: llm
Abstract: In Socratic debugging, instructors guide students towards identifying and fixing a bug on their own, instead of providing the bug fix directly. Most novice programmer bugs are caused by programming misconceptions, namely false beliefs about a programming concept. In this context, Socratic debugging can be formulated as a guided Reasoning Trajectory (RT) leading to a statement about the program behavior that contradicts the bug-causing misconception. Upon reaching this statement, the ensuing cognitive dissonance leads the student to first identify and then update their false belief. In this paper, we introduce the task of reasoning trajectory generation, together with a dataset of debugging problems manually annotated with RTs. We then describe LLM-based solutions for generating RTs and Socratic conversations that are anchored on them. A large-scale LLM-as-judge evaluation shows that frontier models can generate up to 91% correct reasoning trajectories and 98.7% valid conversation turns.
摘要：在苏格拉底式调试中，教师引导学生自行识别和修复错误，而不是直接提供错误修复。大多数新手程序员的错误是由编程误解引起的，即对编程概念的错误信念。在这种情况下，苏格拉底式调试可以被表述为引导推理轨迹（RT），从而得出有关程序行为的陈述，该陈述与引起错误的误解相矛盾。达到这个陈述后，随之而来的认知失调导致学生首先识别然后更新他们的错误信念。在本文中，我们介绍了推理轨迹生成的任务，以及用 RT 手动注释的调试问题数据集。然后，我们描述基于 LLM 的解决方案，用于生成 RT 和基于它们的苏格拉底式对话。大规模法学硕士法官评估表明，前沿模型可以生成高达 91% 的正确推理轨迹和 98.7% 的有效对话回合。

Title: PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

Authors: Yiwei Zha, Rui Min, Shanu Sushmita
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.00416
Pdf URL: https://arxiv.org/pdf/2511.00416
Copy Paste: [[2511.00416]] PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks(https://arxiv.org/abs/2511.00416)
Keywords: llm
Abstract: While AI-generated text (AIGT) detectors achieve over 90\% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see this https URL.
摘要：虽然人工智能生成文本 (AIGT) 检测器在直接 LLM 输出上实现了超过 90% 的准确度，但它们在迭代释义内容方面却遭遇了灾难性的失败。我们研究了为什么迭代释义的文本（本身是人工智能生成的）能够逃避为 AIGT 识别而设计的检测系统。通过内在机制分析，我们揭示了迭代释义创建了一个以保留生成模式的语义位移为特征的中间清洗区域，这带来了两种攻击类别：释义人类创作的文本（作者身份混淆）和释义LLM生成的文本（剽窃规避）。为了解决这些漏洞，我们引入了 PADBen，这是第一个系统评估检测器针对两种释义攻击场景的稳健性的基准测试。 PADBen 包含五种类型的文本分类法，捕获从原始内容到深度清洗文本的完整轨迹，以及跨句子对和单句挑战的五个渐进检测任务。我们评估了 11 个最先进的检测器，揭示了严重的不对称性：检测器成功识别了抄袭规避问题，但未能识别作者身份混淆的情况。我们的研究结果表明，当前的检测方法无法有效地处理中间洗钱区域，因此需要在现有语义和风格区分方法之外的检测架构上取得根本性进展。详细代码实现请看这个https网址。

Title: MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

Authors: Naoto Iwase, Hiroki Okuyama, Junichiro Iwasawa
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.00421
Pdf URL: https://arxiv.org/pdf/2511.00421
Copy Paste: [[2511.00421]] MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts(https://arxiv.org/abs/2511.00421)
Keywords: language model, llm
Abstract: Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts -- a prerequisite for safe deployment -- remains under-evaluated, particularly beyond English. We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks: error detection, error localization (sentence extraction), and error correction. MedRECT is built with a scalable, automated pipeline from the Japanese Medical Licensing Examinations (JMLE) and a curated English counterpart, yielding MedRECT-ja (663 texts) and MedRECT-en (458 texts) with comparable error/no-error balance. We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families. Key findings: (i) reasoning models substantially outperform standard architectures, with up to 13.5% relative improvement in error detection and 51.0% in sentence extraction; (ii) cross-lingual evaluation reveals 5-10% performance gaps from English to Japanese, with smaller disparities for reasoning models; (iii) targeted LoRA fine-tuning yields asymmetric improvements in error correction performance (Japanese: +0.078, English: +0.168) while preserving reasoning capabilities; and (iv) our fine-tuned model exceeds human expert performance on structured medical error correction tasks. To our knowledge, MedRECT is the first comprehensive cross-lingual benchmark for medical error correction, providing a reproducible framework and resources for developing safer medical LLMs across languages.
摘要：大语言模型 (LLM) 在医疗应用中显示出越来越大的前景，但它们检测和纠正临床文本中错误的能力（安全部署的先决条件）仍然被低估，特别是在英语之外。我们引入了 MedRECT，一个跨语言基准（日语/英语），它将医疗错误处理制定为三个子任务：错误检测、错误定位（句子提取）和错误纠正。 MedRECT 采用来自日本医疗执照考试 (JMLE) 的可扩展、自动化管道和精心策划的英语对应管道构建，产生具有可比错误/无错误平衡的 MedRECT-ja（663 文本）和 MedRECT-en（458 文本）。我们评估了 9 名当代法学硕士，涵盖专有、开放权重和推理系列。主要发现：(i) 推理模型的性能大大优于标准架构，在错误检测方面相对改进高达 13.5%，在句子提取方面相对改进高达 51.0%； (ii) 跨语言评估显示从英语到日语有 5-10% 的性能差距，推理模型的差异较小； (iii) 有针对性的 LoRA 微调可在纠错性能方面产生不对称的改进（日语：+0.078，英语：+0.168），同时保留推理能力； (iv) 我们的微调模型在结构化医疗纠错任务上超越了人类专家的表现。据我们所知，MedRECT 是第一个全面的跨语言医疗纠错基准，为开发更安全的跨语言医学法学硕士提供了可重复的框架和资源。

Title: G2: Guided Generation for Enhanced Output Diversity in LLMs

Authors: Zhiwen Ruan, Yixia Li, Yefeng Liu, Yun Chen, Weihua Luo, Peng Li, Yang Liu, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00432
Pdf URL: https://arxiv.org/pdf/2511.00432
Copy Paste: [[2511.00432]] G2: Guided Generation for Enhanced Output Diversity in LLMs(https://arxiv.org/abs/2511.00432)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, these models exhibit a critical limitation in output diversity, often generating highly similar content across multiple attempts. This limitation significantly affects tasks requiring diverse outputs, from creative writing to reasoning. Existing solutions, like temperature scaling, enhance diversity by modifying probability distributions but compromise output quality. We propose Guide-to-Generation (G2), a training-free plug-and-play method that enhances output diversity while preserving generation quality. G2 employs a base generator alongside dual Guides, which guide the generation process through decoding-based interventions to encourage more diverse outputs conditioned on the original query. Comprehensive experiments demonstrate that G2 effectively improves output diversity while maintaining an optimal balance between diversity and quality.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中表现出了卓越的性能。然而，这些模型在输出多样性方面表现出严重的限制，通常会在多次尝试中生成高度相似的内容。这种限制极大地影响了需要不同输出的任务，从创意写作到推理。现有的解决方案（例如温度缩放）通过修改概率分布来增强多样性，但会损害输出质量。我们提出了Guide-to-Generation（G2），这是一种无需训练的即插即用方法，可以在保持生成质量的同时增强输出多样性。 G2 采用基础生成器和双指南，通过基于解码的干预来指导生成过程，以鼓励以原始查询为条件的更多样化的输出。综合实验表明，G2有效提高了输出多样性，同时保持多样性和质量之间的最佳平衡。

Title: Remembering Unequally: Global and Disciplinary Bias in LLM-Generated Co-Authorship Networks

Authors: Ghazal Kalhor, Afra Mashhadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00476
Pdf URL: https://arxiv.org/pdf/2511.00476
Copy Paste: [[2511.00476]] Remembering Unequally: Global and Disciplinary Bias in LLM-Generated Co-Authorship Networks(https://arxiv.org/abs/2511.00476)
Keywords: language model, llm
Abstract: Ongoing breakthroughs in Large Language Models (LLMs) are reshaping search and recommendation platforms at their core. While this shift unlocks powerful new scientometric tools, it also exposes critical fairness and bias issues that could erode the integrity of the information ecosystem. Additionally, as LLMs become more integrated into web-based searches for scholarly tools, their ability to generate summarized research work based on memorized data introduces new dimensions to these challenges. The extent of memorization in LLMs can impact the accuracy and fairness of the co-authorship networks they produce, potentially reflecting and amplifying existing biases within the scientific community and across different regions. This study critically examines the impact of LLM memorization on the co-authorship networks. To this end, we assess memorization effects across three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B, analyzing how memorization-driven outputs vary across academic disciplines and world regions. While our global analysis reveals a consistent bias favoring highly cited researchers, this pattern is not uniformly observed. Certain disciplines, such as Clinical Medicine, and regions, including parts of Africa, show more balanced representation, pointing to areas where LLM training data may reflect greater equity. These findings underscore both the risks and opportunities in deploying LLMs for scholarly discovery.
摘要：大型语言模型 (LLM) 的不断突破正在重塑搜索和推荐平台的核心。虽然这种转变释放了强大的新科学计量工具，但它也暴露了可能侵蚀信息生态系统完整性的关键公平和偏见问题。此外，随着法学硕士越来越融入基于网络的学术工具搜索，他们基于记忆数据生成总结研究工作的能力为这些挑战带来了新的维度。法学硕士的记忆程度可能会影响他们产生的合着网络的准确性和公平性，可能反映和放大科学界和不同地区现有的偏见。这项研究批判性地考察了法学硕士记忆对合着网络的影响。为此，我们评估了三个著名模型（DeepSeek R1、Llama 4 Scout 和 Mixtral 8x7B）的记忆效果，分析了记忆驱动的输出在不同学科和世界地区之间的差异。虽然我们的全球分析揭示了对高被引研究人员的一贯偏见，但这种模式并未得到一致观察。某些学科（例如临床医学）和地区（包括非洲部分地区）表现出更加平衡的代表性，表明法学硕士培训数据可能反映更大的公平性。这些发现强调了利用法学硕士进行学术发现的风险和机遇。

Title: Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus

Authors: Pooja Singh, Shashwat Bhardwaj, Vaibhav Sharma, Sandeep Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00486
Pdf URL: https://arxiv.org/pdf/2511.00486
Copy Paste: [[2511.00486]] Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus(https://arxiv.org/abs/2511.00486)
Keywords: language model, llm
Abstract: The linguistic diversity of India poses significant machine translation challenges, especially for underrepresented tribal languages like Bhili, which lack high-quality linguistic resources. This paper addresses the gap by introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest parallel corpus worldwide comprising 110,000 meticulously curated sentences across Bhili, Hindi, and English. The corpus was created with the assistance of expert human translators. BHEPC spans critical domains such as education, administration, and news, establishing a valuable benchmark for research in low resource machine translation. To establish a comprehensive Bhili Machine Translation benchmark, we evaluated a wide range of proprietary and open-source Multilingual Large Language Models (MLLMs) on bidirectional translation tasks between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the fine-tuned NLLB-200 distilled 600M variant model outperforms others, highlighting the potential of multilingual models in low resource scenarios. Furthermore, we investigated the generative translation capabilities of multilingual LLMs on BHEPC using in-context learning, assessing performance under cross-domain generalization and quantifying distributional divergence. This work bridges a critical resource gap and promotes inclusive natural language processing technologies for low-resource and marginalized languages globally.
摘要：印度的语言多样性给机器翻译带来了巨大的挑战，特别是对于像比利语这样代表性不足的部落语言，这些语言缺乏高质量的语言资源。本文通过介绍比利语-印地语-英语平行语料库 (BHEPC) 来解决这一差距，这是全球第一个也是最大的平行语料库，包含 110,000 个精心策划的比利语、印地语和英语句子。该语料库是在专业翻译人员的帮助下创建的。 BHEPC 跨越教育、行政和新闻等关键领域，为低资源机器翻译研究建立了有价值的基准。为了建立全面的 Bhili 机器翻译基准，我们针对英语/印地语和 Bhili 之间的双向翻译任务评估了各种专有和开源多语言大型语言模型 (MLLM)。综合评估表明，经过微调的 NLLB-200 蒸馏 600M 变体模型优于其他模型，凸显了多语言模型在低资源场景下的潜力。此外，我们使用上下文学习研究了 BHEPC 上多语言法学硕士的生成翻译能力，评估跨领域泛化下的性能并量化分布差异。这项工作弥合了关键的资源缺口，并促进全球资源匮乏和边缘化语言的包容性自然语言处理技术。

Title: ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models

Authors: Jiani Guo, Zuchao Li, Jie Wu, Qianren Wang, Yun Li, Lefei Zhang, Hai Zhao, Yujiu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00489
Pdf URL: https://arxiv.org/pdf/2511.00489
Copy Paste: [[2511.00489]] ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models(https://arxiv.org/abs/2511.00489)
Keywords: language model, llm, long context, retrieval-augmented generation
Abstract: Large Language Models (LLMs), constrained by limited context windows, often face significant performance degradation when reasoning over long contexts. To address this, Retrieval-Augmented Generation (RAG) retrieves and reasons over chunks but frequently sacrifices logical coherence due to its reliance on similarity-based rankings. Similarly, divide-and-conquer frameworks (DCF) split documents into small chunks for independent reasoning and aggregation. While effective for local reasoning, DCF struggles to capture long-range dependencies and risks inducing conflicts by processing chunks in isolation. To overcome these limitations, we propose ToM, a novel Tree-oriented MapReduce framework for long-context reasoning. ToM leverages the inherent hierarchical structure of long documents (e.g., main headings and subheadings) by constructing a DocTree through hierarchical semantic parsing and performing bottom-up aggregation. Using a Tree MapReduce approach, ToM enables recursive reasoning: in the Map step, rationales are generated at child nodes; in the Reduce step, these rationales are aggregated across sibling nodes to resolve conflicts or reach consensus at parent nodes. Experimental results on 70B+ LLMs show that ToM significantly outperforms existing divide-and-conquer frameworks and retrieval-augmented generation methods, achieving better logical coherence and long-context reasoning. Our code is available at this https URL .
摘要：大型语言模型 (LLM) 受到有限上下文窗口的限制，在长上下文推理时通常会面临显着的性能下降。为了解决这个问题，检索增强生成（RAG）对块进行检索和推理，但由于依赖于基于相似性的排名，经常牺牲逻辑一致性。类似地，分而治之框架（DCF）将文档分割成小块以进行独立推理和聚合。虽然 DCF 对本地推理有效，但它很难捕获远程依赖关系，并且存在通过单独处理块而引发冲突的风险。为了克服这些限制，我们提出了 ToM，一种用于长上下文推理的新型面向树的 MapReduce 框架。 ToM 通过分层语义解析构建 DocTree 并执行自下而上的聚合，利用长文档（例如主标题和副标题）固有的分层结构。 ToM 使用 Tree MapReduce 方法实现递归推理：在 Map 步骤中，在子节点处生成基本原理；在Reduce步骤中，这些基本原理在兄弟节点之间聚合，以解决冲突或在父节点达成共识。 70B+ LLM 的实验结果表明，ToM 显着优于现有的分而治之框架和检索增强生成方法，实现了更好的逻辑连贯性和长上下文推理。我们的代码可在此 https URL 获取。

Title: Zero-RAG: Towards Retrieval-Augmented Generation with Zero Redundant Knowledge

Authors: Qi Luo, Xiaonan Li, Junqi Dai, Shuang Cheng, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00505
Pdf URL: https://arxiv.org/pdf/2511.00505
Copy Paste: [[2511.00505]] Zero-RAG: Towards Retrieval-Augmented Generation with Zero Redundant Knowledge(https://arxiv.org/abs/2511.00505)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation has shown remarkable results to address Large Language Models' hallucinations, which usually uses a large external corpus to supplement knowledge to LLMs. However, with the development of LLMs, the internal knowledge of LLMs has expanded significantly, thus causing significant knowledge redundancy between the external corpus and LLMs. On the one hand, the indexing cost of dense retrieval is highly related to the corpus size and thus significant redundant knowledge intensifies the dense retrieval's workload. On the other hand, the redundant knowledge in the external corpus is not helpful to LLMs and our exploratory analysis shows that it instead hurts the RAG performance on those questions which the LLM can answer by itself. To address these issues, we propose Zero-RAG to tackle these challenges. Specifically, we first propose the Mastery-Score metric to identify redundant knowledge in the RAG corpus to prune it. After pruning, answers to "mastered" questions rely primarily on internal knowledge of the LLM. To better harness the internal capacity, we propose Query Router and Noise-Tolerant Tuning to avoid the irrelevant documents' distraction and thus further improve the LLM's utilization of internal knowledge with pruned corpus. Experimental results show that Zero-RAG prunes the Wikipedia corpus by 30\% and accelerates the retrieval stage by 22\%, without compromising RAG's performance.
摘要：检索增强生成在解决大型语言模型的幻觉方面表现出了显着的效果，大型语言模型通常使用大型外部语料库来补充法学硕士的知识。然而，随着LLM的发展，LLM的内部知识显着扩展，从而导致外部语料库和LLM之间存在显着的知识冗余。一方面，密集检索的索引成本与语料库大小高度相关，因此大量的冗余知识加剧了密集检索的工作量。另一方面，外部语料库中的冗余知识对法学硕士没有帮助，我们的探索性分析表明，它反而损害了法学硕士可以自行回答的那些问题上的 RAG 表现。为了解决这些问题，我们提出 Zero-RAG 来应对这些挑战。具体来说，我们首先提出 Mastery-Score 指标来识别 RAG 语料库中的冗余知识以对其进行修剪。修剪后，“掌握”问题的答案主要依赖于法学硕士的内部知识。为了更好地利用内部能力，我们提出了查询路由器和噪声容忍调整，以避免不相关文档的干扰，从而通过修剪语料库进一步提高法学硕士对内部知识的利用率。实验结果表明，Zero-RAG 将维基百科语料库修剪了 30%，并将检索阶段加速了 22%，而不会影响 RAG 的性能。

Title: Fine-Tuning DialoGPT on Common Diseases in Rural Nepal for Medical Conversations

Authors: Birat Poudel, Satyam Ghimire, Er. Prakash Chandra Prasad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00514
Pdf URL: https://arxiv.org/pdf/2511.00514
Copy Paste: [[2511.00514]] Fine-Tuning DialoGPT on Common Diseases in Rural Nepal for Medical Conversations(https://arxiv.org/abs/2511.00514)
Keywords: gpt, agent
Abstract: Conversational agents are increasingly being explored to support healthcare delivery, particularly in resource-constrained settings such as rural Nepal. Large-scale conversational models typically rely on internet connectivity and cloud infrastructure, which may not be accessible in rural areas. In this study, we fine-tuned DialoGPT, a lightweight generative dialogue model that can operate offline, on a synthetically constructed dataset of doctor-patient interactions covering ten common diseases prevalent in rural Nepal, including common cold, seasonal fever, diarrhea, typhoid fever, gastritis, food poisoning, malaria, dengue fever, tuberculosis, and pneumonia. Despite being trained on a limited, domain-specific dataset, the fine-tuned model produced coherent, contextually relevant, and medically appropriate responses, demonstrating an understanding of symptoms, disease context, and empathetic communication. These results highlight the adaptability of compact, offline-capable dialogue models and the effectiveness of targeted datasets for domain adaptation in low-resource healthcare environments, offering promising directions for future rural medical conversational AI.
摘要：人们越来越多地探索对话代理来支持医疗保健服务，特别是在尼泊尔农村等资源有限的环境中。大规模对话模型通常依赖于互联网连接和云基础设施，而这些在农村地区可能无法访问。在这项研究中，我们在综合构建的医患互动数据集上对 DialoGPT（一种可以离线运行的轻量级生成对话模型）进行了微调，该数据集涵盖了尼泊尔农村地区流行的十种常见疾病，包括普通感冒、季节性发烧、腹泻、伤寒、胃炎、食物中毒、疟疾、登革热、肺结核和肺炎。尽管是在有限的、特定领域的数据集上进行训练的，但经过微调的模型产生了连贯的、上下文相关的、医学上适当的响应，展示了对症状、疾病背景和同理心沟通的理解。这些结果凸显了紧凑、离线对话模型的适应性以及目标数据集在资源匮乏的医疗保健环境中领域适应的有效性，为未来农村医疗对话人工智能提供了有希望的方向。

Title: Exploring and Mitigating Gender Bias in Encoder-Based Transformer Models

Authors: Ariyan Hossain, Khondokar Mohammad Ahanaf Hannan, Rakinul Haque, Nowreen Tarannum Rafa, Humayra Musarrat, Shoaib Ahmed Dipu, Farig Yousuf Sadeque
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00519
Pdf URL: https://arxiv.org/pdf/2511.00519
Copy Paste: [[2511.00519]] Exploring and Mitigating Gender Bias in Encoder-Based Transformer Models(https://arxiv.org/abs/2511.00519)
Keywords: language model
Abstract: Gender bias in language models has gained increasing attention in the field of natural language processing. Encoder-based transformer models, which have achieved state-of-the-art performance in various language tasks, have been shown to exhibit strong gender biases inherited from their training data. This paper investigates gender bias in contextualized word embeddings, a crucial component of transformer-based models. We focus on prominent architectures such as BERT, ALBERT, RoBERTa, and DistilBERT to examine their vulnerability to gender bias. To quantify the degree of bias, we introduce a novel metric, MALoR, which assesses bias based on model probabilities for filling masked tokens. We further propose a mitigation approach involving continued pre-training on a gender-balanced dataset generated via Counterfactual Data Augmentation. Our experiments reveal significant reductions in gender bias scores across different pronoun pairs. For instance, in BERT-base, bias scores for "he-she" dropped from 1.27 to 0.08, and "his-her" from 2.51 to 0.36 following our mitigation approach. We also observed similar improvements across other models, with "male-female" bias decreasing from 1.82 to 0.10 in BERT-large. Our approach effectively reduces gender bias without compromising model performance on downstream tasks.
摘要：语言模型中的性别偏见在自然语言处理领域受到越来越多的关注。基于编码器的 Transformer 模型在各种语言任务中都取得了最先进的性能，已被证明表现出从训练数据中继承的强烈性别偏见。本文研究了上下文词嵌入中的性别偏见，这是基于 Transformer 的模型的重要组成部分。我们重点关注 BERT、ALBERT、RoBERTa 和 DistilBERT 等著名架构，以检查它们对性别偏见的脆弱性。为了量化偏差程度，我们引入了一种新颖的指标 MALoR，它根据填充屏蔽标记的模型概率来评估偏差。我们进一步提出了一种缓解方法，包括对通过反事实数据增强生成的性别平衡数据集进行持续预训练。我们的实验表明，不同代词对的性别偏见得分显着降低。例如，在 BERT-base 中，采用我们的缓解方法后，“he-she”的偏差分数从 1.27 下降到 0.08，“his-her”的偏差分数从 2.51 下降到 0.36。我们还在其他模型中观察到类似的改进，BERT-large 中的“男性-女性”偏差从 1.82 降至 0.10。我们的方法有效地减少了性别偏见，而不影响下游任务的模型性能。

Title: Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly

Authors: Wenya Xie, Shaochen (Henry)Zhong, Hoang Anh Duy Le, Zhaozhuo Xu, Jianwen Xie, Zirui Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00536
Pdf URL: https://arxiv.org/pdf/2511.00536
Copy Paste: [[2511.00536]] Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly(https://arxiv.org/abs/2511.00536)
Keywords: prompt
Abstract: Large Reasoning Models (LRMs) are often bottlenecked by the high cost of output tokens. We show that a significant portion of these tokens are useless self-repetitions - what we call "word salad" - that exhaust the decoding budget without adding value. Interestingly, we observe that LRMs are self-aware when trapped in these loops: the hidden states of <\n\n> tokens trailing each reasoning chunk exhibit patterns that allow us to detect word salad behavior on-the-fly via a single-layer linear classifier. Once detected, a simple chop appended by a straightforward regeneration prompt yields substantial length savings with minimal quality loss. Our work offers WordSaladChopper (WSC) - a lightweight, turnkey component for LRM that is minimally invasive to its reasoning trajectory by only removing semantically redundant tokens. Given its low overhead, strong savings, and the lack of semantic value of word salad tokens, we believe it is not too far-fetched to argue that WSC - or a similar component - is a must-have for all LRM applications with user experience in mind. Our code is publicly available at this https URL.
摘要：大型推理模型（LRM）通常会因输出代币的高成本而受到瓶颈。我们表明，这些标记的很大一部分是无用的自我重复——我们称之为“单词沙拉”——耗尽了解码预算而没有增加价值。有趣的是，我们观察到 LRM 在陷入这些循环时具有自我意识：每个推理块后面的 <\n\n> 标记的隐藏状态表现出的模式使我们能够通过单层线性分类器动态检测单词沙拉行为。一旦检测到，简单的切割加上简单的再生提示即可节省大量长度，同时将质量损失降至最低。我们的工作提供了 WordSaladChopper (WSC) - LRM 的一个轻量级交钥匙组件，通过仅删除语义上冗余的标记，对其推理轨迹影响最小。鉴于其低开销、强大的节省以及缺乏单词沙拉标记的语义价值，我们认为认为 WSC（或类似组件）是所有考虑到用户体验的 LRM 应用程序的必备品并不是太牵强。我们的代码可通过此 https URL 公开获取。

Title: Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction

Authors: Peter Atandoh, Jie Zou, Weikang Guo, Jiwei Wei, Zheng Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.00537
Pdf URL: https://arxiv.org/pdf/2511.00537
Copy Paste: [[2511.00537]] Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction(https://arxiv.org/abs/2511.00537)
Keywords: language model
Abstract: Sentiment analysis using deep learning and pre-trained language models (PLMs) has gained significant traction due to their ability to capture rich contextual representations. However, existing approaches often underperform in scenarios involving nuanced emotional cues, domain shifts, and imbalanced sentiment distributions. We argue that these limitations stem from inadequate semantic grounding, poor generalization to diverse linguistic patterns, and biases toward dominant sentiment classes. To overcome these challenges, we propose CISEA-MRFE, a novel PLM-based framework integrating Contextual Instruction (CI), Semantic Enhancement Augmentation (SEA), and Multi-Refined Feature Extraction (MRFE). CI injects domain-aware directives to guide sentiment disambiguation; SEA improves robustness through sentiment-consistent paraphrastic augmentation; and MRFE combines a Scale-Adaptive Depthwise Encoder (SADE) for multi-scale feature specialization with an Emotion Evaluator Context Encoder (EECE) for affect-aware sequence modeling. Experimental results on four benchmark datasets demonstrate that CISEA-MRFE consistently outperforms strong baselines, achieving relative improvements in accuracy of up to 4.6% on IMDb, 6.5% on Yelp, 30.3% on Twitter, and 4.1% on Amazon. These results validate the effectiveness and generalization ability of our approach for sentiment classification across varied domains.
摘要：使用深度学习和预训练语言模型 (PLM) 的情感分析由于能够捕获丰富的上下文表示而获得了巨大的关注。然而，现有的方法在涉及微妙的情感线索、领域转移和不平衡的情绪分布的场景中通常表现不佳。我们认为，这些限制源于语义基础不足、对不同语言模式的概括性较差以及对主导情感类别的偏见。为了克服这些挑战，我们提出了 CISEA-MRFE，这是一种基于 PLM 的新型框架，集成了上下文指令 (CI)、语义增强增强 (SEA) 和多重细化特征提取 (MRFE)。 CI 注入领域感知指令来指导情感消歧； SEA 通过情感一致的释义增强来提高鲁棒性； MRFE 将用于多尺度特征专业化的尺度自适应深度编码器 (SADE) 与用于情感感知序列建模的情感评估器上下文编码器 (EECE) 相结合。四个基准数据集的实验结果表明，CISEA-MRFE 始终优于强大的基线，在 IMDb 上的准确率相对提高了 4.6%，在 Yelp 上提高了 6.5%，在 Twitter 上提高了 30.3%，在 Amazon 上提高了 4.1%。这些结果验证了我们跨不同领域的情感分类方法的有效性和泛化能力。

Title: Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Authors: Peng Ding, Jun Kuang, Wen Sun, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00556
Pdf URL: https://arxiv.org/pdf/2511.00556
Copy Paste: [[2511.00556]] Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack(https://arxiv.org/abs/2511.00556)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at this https URL.
摘要：尽管大型语言模型 (LLM) 的功能令人印象深刻，但仍然容易受到越狱攻击。调查这些弱点对于健全的安全机制至关重要。现有的攻击主要通过引入额外的上下文或对抗性标记来分散法学硕士的注意力，从而使核心有害意图保持不变。在本文中，我们介绍了 ISA（意图转移攻击），它会混淆 LLM 的攻击意图。更具体地说，我们建立了意图转换的分类法，并利用它们来生成可能被法学硕士误解为良性信息请求的攻击。与之前依赖复杂标记或冗长上下文的方法不同，我们的方法只需要对原始请求进行最少的编辑，并产生自然的、人类可读的、看似无害的提示。对开源和商业 LLM 的大量实验表明，与直接有害提示相比，ISA 的攻击成功率提高了 70% 以上。更重要的是，仅对使用 ISA 模板重新制定的良性数据进行模型微调，可将成功率提高到接近 100%。对于防御，我们评估现有方法并证明它们针对 ISA 的不足，同时探索免训练和基于训练的缓解策略。我们的研究结果揭示了法学硕士安全意图推断的根本挑战，并强调需要更有效的防御。我们的代码和数据集可在此 https URL 获取。

Title: FlashEVA: Accelerating LLM inference via Efficient Attention

Authors: Juan Gabriel Kostelec, Qinghai Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.00576
Pdf URL: https://arxiv.org/pdf/2511.00576
Copy Paste: [[2511.00576]] FlashEVA: Accelerating LLM inference via Efficient Attention(https://arxiv.org/abs/2511.00576)
Keywords: llm
Abstract: Transformer models have revolutionized natural language processing, achieving state-of-the-art performance and demonstrating remarkable scalability. However, their memory demands, particularly due to maintaining full context in memory, pose significant challenges for inference. In this paper, we present FlashEVA, an efficient implementation of EVA (Efficient Attention via Control Variates), and demonstrate how to finetune transformers to adapt to FlashEVA attention. Our method enables fine-tuning of Transformer models with as few as 1.5B tokens while preserving effectiveness across various downstream tasks. Notably, FlashEVA achieves up to 6.7x higher throughput and 5x lower peak GPU memory usage during inference compared to standard Transformer implementations. Despite these improvements, we observe limitations in retrieval-focused tasks. Our implementation offers control over the trade-off between throughput and accuracy through adjustable hyperparameters, providing flexibility for diverse use cases. This work represents a significant step towards more efficient and adaptable Transformer-based models for inference.
摘要：Transformer 模型彻底改变了自然语言处理，实现了最先进的性能并展示了卓越的可扩展性。然而，他们的记忆需求，特别是由于在记忆中保持完整的上下文，对推理提出了重大挑战。在本文中，我们提出了 FlashEVA，这是 EVA（通过控制变量进行高效注意力）的有效实现，并演示了如何微调变压器以适应 FlashEVA 注意力。我们的方法可以使用少至 1.5B 的代币对 Transformer 模型进行微调，同时保持各种下游任务的有效性。值得注意的是，与标准 Transformer 实现相比，FlashEVA 在推理期间的吞吐量提高了 6.7 倍，GPU 内存使用峰值降低了 5 倍。尽管有这些改进，我们仍然观察到以检索为中心的任务的局限性。我们的实现通过可调节的超参数来控制吞吐量和准确性之间的权衡，为不同的用例提供灵活性。这项工作代表了朝着更高效、适应性更强的基于 Transformer 的推理模型迈出了重要一步。

Title: OpenSIR: Open-Ended Self-Improving Reasoner

Authors: Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00602
Pdf URL: https://arxiv.org/pdf/2511.00602
Copy Paste: [[2511.00602]] OpenSIR: Open-Ended Self-Improving Reasoner(https://arxiv.org/abs/2511.00602)
Keywords: language model, llm
Abstract: Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.
摘要：通过强化学习进行大型语言模型 (LLM) 推理的最新进展依赖于带注释的数据集来获得可验证的奖励，这可能会限制模型超越人类水平表现的能力。虽然自我对弈提供了一种有前途的替代方案，但现有方法依赖于外部验证者或无法进行开放式学习。我们提出了开放式自我改进推理机（OpenSIR），这是一个自我游戏框架，法学硕士可以在没有外部监督的情况下通过交替教师和学生角色来学习生成和解决新问题。为了生成新颖的问题，OpenSIR 针对难度和多样性进行了优化，奖励在探索不同概念的同时适当挑战的问题，从而实现开放式数学发现。从一个简单的种子问题开始，OpenSIR 大幅改进了指令模型：Llama-3.2-3B-Instruct 在 GSM8K 上从 73.9 提高到 78.3，在大学数学上从 28.8 提高到 34.4，而 Gemma-2-2B-Instruct 在 GSM8K 上从 38.5 提高到 58.7。我们的分析表明，OpenSIR 通过共同进化的师生角色实现了开放式学习，自适应地校准难度并推动多样化的探索，从基础数学自主进展到高级数学。

Title: SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Authors: Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen, Nando Fioretto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00606
Pdf URL: https://arxiv.org/pdf/2511.00606
Copy Paste: [[2511.00606]] SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding(https://arxiv.org/abs/2511.00606)
Keywords: language model, llm
Abstract: Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.
摘要：推测性解码已成为加速大型语言模型 (LLM) 推理的标准方法。它利用无损的草稿然后验证过程来规避自回归解码的延迟，从而实现令人印象深刻的加速。然而，当前的推测性解码方法仍然受到两个基本瓶颈的限制：（1）起草过程中的自回归依赖性限制了并行性，以及（2）由于草稿和验证模型之间的不一致而导致草稿令牌的频繁拒绝。本文提出了 SpecDiff-2，这是一种共同解决这两个瓶颈的新颖框架。它利用离散扩散作为非自回归绘图器来解决瓶颈 (1)，并开发新技术来使用自回归验证器校准离散扩散绘图器，从而解决瓶颈 (2)。综合基准测试套件的实验结果表明，SpecDiff-2 在推理、编码和数学基准测试中实现了新的最先进水平，与之前的基准相比，每秒的令牌数平均提高了 55%，并且与标准解码相比，平均速度提高了 5.5 倍，而没有任何准确性损失。

Title: Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

Authors: Autumn Toney-Wails, Ryan Wails
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00620
Pdf URL: https://arxiv.org/pdf/2511.00620
Copy Paste: [[2511.00620]] Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios(https://arxiv.org/abs/2511.00620)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Reliable uncertainty quantification (UQ) is essential for ensuring trustworthy downstream use of large language models, especially when they are deployed in decision-support and other knowledge-intensive applications. Model certainty can be estimated from token logits, with derived probability and entropy values offering insight into performance on the prompt task. However, this approach may be inadequate for probabilistic scenarios, where the probabilities of token outputs are expected to align with the theoretical probabilities of the possible outcomes. We investigate the relationship between token certainty and alignment with theoretical probability distributions in well-defined probabilistic scenarios. Using GPT-4.1 and DeepSeek-Chat, we evaluate model responses to ten prompts involving probability (e.g., roll a six-sided die), both with and without explicit probability cues in the prompt (e.g., roll a fair six-sided die). We measure two dimensions: (1) response validity with respect to scenario constraints, and (2) alignment between token-level output probabilities and theoretical probabilities. Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.
摘要：可靠的不确定性量化 (UQ) 对于确保大型语言模型在下游的可信使用至关重要，特别是当它们部署在决策支持和其他知识密集型应用程序中时。模型确定性可以根据 token logits 进行估计，并通过派生的概率和熵值来深入了解提示任务的性能。然而，这种方法可能不适用于概率场景，在概率场景中，代币输出的概率预计与可能结果的理论概率一致。我们研究了令牌确定性与明确定义的概率场景中的理论概率分布的一致性之间的关系。使用 GPT-4.1 和 DeepSeek-Chat，我们评估模型对涉及概率的十个提示（例如，掷六面骰子）的响应，提示中是否有明确的概率提示（例如，掷六面骰子）。我们测量两个维度：（1）相对于场景约束的响应有效性，以及（2）令牌级输出概率与理论概率之间的一致性。我们的结果表明，虽然这两个模型在所有提示场景中都实现了完美的域内响应精度，但它们的令牌级概率和熵值始终偏离相应的理论分布。

Title: Do You Know About My Nation? Investigating Multilingual Language Models' Cultural Literacy Through Factual Knowledge

Authors: Eshaan Tanwar, Anwoy Chatterjee, Michael Saxon, Alon Albalak, William Yang Wang, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00657
Pdf URL: https://arxiv.org/pdf/2511.00657
Copy Paste: [[2511.00657]] Do You Know About My Nation? Investigating Multilingual Language Models' Cultural Literacy Through Factual Knowledge(https://arxiv.org/abs/2511.00657)
Keywords: language model, llm
Abstract: Most multilingual question-answering benchmarks, while covering a diverse pool of languages, do not factor in regional diversity in the information they capture and tend to be Western-centric. This introduces a significant gap in fairly evaluating multilingual models' comprehension of factual information from diverse geographical locations. To address this, we introduce XNationQA for investigating the cultural literacy of multilingual LLMs. XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages. We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics. Our analyses uncover a considerable discrepancy in the models' accessibility to culturally specific facts across languages. Notably, we often find that a model demonstrates greater knowledge of cultural information in English than in the dominant language of the respective culture. The models exhibit better performance in Western languages, although this does not necessarily translate to being more literate for Western countries, which is counterintuitive. Furthermore, we observe that models have a very limited ability to transfer knowledge across languages, particularly evident in open-source models.
摘要：大多数多语言问答基准虽然涵盖多种语言，但没有考虑到它们捕获的信息的区域多样性，并且往往以西方为中心。这在公平评估多语言模型对来自不同地理位置的事实信息的理解方面存在显着差距。为了解决这个问题，我们引入了 XNationQA 来调查多语言法学硕士的文化素养。 XNationQA 总共包含 49,280 个问题，涉及 9 个国家的地理、文化和历史，以 7 种语言呈现。我们在 XNationQA 上对八个标准多语言法学硕士进行了基准测试，并使用两种新颖的迁移指标对其进行评估。我们的分析发现，模型对不同语言的特定文化事实的可访问性存在相当大的差异。值得注意的是，我们经常发现模型比各自文化的主导语言表现出更多的英语文化信息知识。这些模型在西方语言中表现出更好的性能，尽管这并不一定意味着西方国家的识字能力更强，这是违反直觉的。此外，我们观察到模型跨语言转移知识的能力非常有限，在开源模型中尤其明显。

Title: Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

Authors: Berk Atil, Rebecca J. Passonneau, Fred Morstatter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00689
Pdf URL: https://arxiv.org/pdf/2511.00689
Copy Paste: [[2511.00689]] Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?(https://arxiv.org/abs/2511.00689)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) undergo safety alignment after training and tuning, yet recent work shows that safety can be bypassed through jailbreak attacks. While many jailbreaks and defenses exist, their cross-lingual generalization remains underexplored. This paper presents the first systematic multilingual evaluation of jailbreaks and defenses across ten languages--spanning high-, medium-, and low-resource languages--using six LLMs on HarmBench and AdvBench. We assess two jailbreak types: logical-expression-based and adversarial-prompt-based. For both types, attack success and defense robustness vary across languages: high-resource languages are safer under standard queries but more vulnerable to adversarial ones. Simple defenses can be effective, but are language- and model-dependent. These findings call for language-aware and cross-lingual safety benchmarks for LLMs.
摘要：大型语言模型（LLM）在训练和调整后进行安全调整，但最近的研究表明，可以通过越狱攻击绕过安全性。尽管存在许多越狱和防御方法，但它们的跨语言泛化仍未得到充分探索。本文使用 HarmBench 和 AdvBench 上的 6 个法学硕士，首次对十种语言（涵盖高资源语言、中资源语言和低资源语言）的越狱和防御进行了系统的多语言评估。我们评估两种越狱类型：基于逻辑表达式和基于对抗性提示。对于这两种类型，攻击成功率和防御稳健性因语言而异：高资源语言在标准查询下更安全，但更容易受到对抗性查询的影响。简单的防御可能是有效的，但依赖于语言和模型。这些发现需要为法学硕士制定语言感知和跨语言的安全基准。

Title: TriCon-Fair: Triplet Contrastive Learning for Mitigating Social Bias in Pre-trained Language Models

Authors: Chong Lyu, Lin Li, Shiqing Wu, Jingling Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00854
Pdf URL: https://arxiv.org/pdf/2511.00854
Copy Paste: [[2511.00854]] TriCon-Fair: Triplet Contrastive Learning for Mitigating Social Bias in Pre-trained Language Models(https://arxiv.org/abs/2511.00854)
Keywords: language model
Abstract: The increasing utilization of large language models raises significant concerns about the propagation of social biases, which may result in harmful and unfair outcomes. However, existing debiasing methods treat the biased and unbiased samples independently, thus ignoring their mutual relationship. This oversight enables a hidden negative-positive coupling, where improvements for one group inadvertently compromise the other, allowing residual social bias to persist. In this paper, we introduce TriCon-Fair, a contrastive learning framework that employs a decoupled loss that combines triplet and language modeling terms to eliminate positive-negative coupling. Our TriCon-Fair assigns each anchor an explicitly biased negative and an unbiased positive, decoupling the push-pull dynamics and avoiding positive-negative coupling, and jointly optimizes a language modeling (LM) objective to preserve general capability. Experimental results demonstrate that TriCon-Fair reduces discriminatory output beyond existing debiasing baselines while maintaining strong downstream performance. This suggests that our proposed TriCon-Fair offers a practical and ethical solution for sensitive NLP applications.
摘要：大型语言模型的使用日益增加引起了人们对社会偏见传播的严重担忧，这可能会导致有害和不公平的结果。然而，现有的去偏方法独立处理有偏样本和无偏样本，从而忽略了它们之间的相互关系。这种监督造成了一种隐藏的消极与积极的耦合，其中一个群体的改进无意中损害了另一个群体，从而导致残余的社会偏见持续存在。在本文中，我们介绍了 TriCon-Fair，这是一种对比学习框架，它采用解耦损失，结合三元组和语言建模项来消除正负耦合。我们的 TriCon-Fair 为每个锚点分配一个显式偏置的负值和一个无偏置的正值，解耦推拉动力学并避免正负耦合，并联合优化语言建模（LM）目标以保留一般能力。实验结果表明，TriCon-Fair 减少了超出现有去偏基线的歧视性输出，同时保持了强大的下游性能。这表明我们提出的 TriCon-Fair 为敏感的 NLP 应用提供了实用且符合道德的解决方案。

Title: Assessing LLM Reasoning Steps via Principal Knowledge Grounding

Authors: Hyeon Hwang, Yewon Cho, Chanwoong Yoon, Yein Park, Minju Song, Kyungjae Lee, Gangwoo Kim, Jaewoo Kang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.00879
Pdf URL: https://arxiv.org/pdf/2511.00879
Copy Paste: [[2511.00879]] Assessing LLM Reasoning Steps via Principal Knowledge Grounding(https://arxiv.org/abs/2511.00879)
Keywords: language model, llm
Abstract: Step-by-step reasoning has become a standard approach for large language models (LLMs) to tackle complex tasks. While this paradigm has proven effective, it raises a fundamental question: How can we verify that an LLM's reasoning is accurately grounded in knowledge? To address this question, we introduce a novel evaluation suite that systematically assesses the knowledge grounding of intermediate reasoning. Our framework comprises three key components. (1) Principal Knowledge Collection, a large-scale repository of atomic knowledge essential for reasoning. Based on the collection, we propose (2) knowledge-grounded evaluation metrics designed to measure how well models recall and apply prerequisite knowledge in reasoning. These metrics are computed by our (3) evaluator LLM, a lightweight model optimized for cost-effective and reliable metric computation. Our evaluation suite demonstrates remarkable effectiveness in identifying missing or misapplied knowledge elements, providing crucial insights for uncovering fundamental reasoning deficiencies in LLMs. Beyond evaluation, we demonstrate how these metrics can be integrated into preference optimization, showcasing further applications of knowledge-grounded evaluation.
摘要：逐步推理已成为大型语言模型 (LLM) 处理复杂任务的标准方法。虽然这种范式已被证明是有效的，但它提出了一个基本问题：我们如何验证法学硕士的推理是否准确地基于知识？为了解决这个问题，我们引入了一种新颖的评估套件，可以系统地评估中间推理的知识基础。我们的框架由三个关键组成部分组成。 (1)主体知识集合，推理所必需的原子知识的大规模存储库。基于该集合，我们提出（2）基于知识的评估指标，旨在衡量模型在推理中回忆和应用先决知识的程度。这些指标由我们的 (3) 评估器 LLM 计算，这是一个针对经济高效且可靠的指标计算进行优化的轻量级模型。我们的评估套件在识别缺失或误用的知识元素方面表现出显着的有效性，为发现法学硕士的基本推理缺陷提供了重要的见解。除了评估之外，我们还演示了如何将这些指标集成到偏好优化中，展示基于知识的评估的进一步应用。

Title: ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Authors: Ahmed Masry, Megh Thakkar, Patrice Bechard, Sathwik Tejaswi Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay Kalkunte Suresh, Srivatsava Daruru, Enamul Hoque, Spandana Gella, Torsten Scholak, Sai Rajeswar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00903
Pdf URL: https://arxiv.org/pdf/2511.00903
Copy Paste: [[2511.00903]] ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval(https://arxiv.org/abs/2511.00903)
Keywords: retrieval-augmented generation
Abstract: Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.
摘要：当模型需要专业知识或访问最新数据时，检索增强生成已被证明是实用的。然而，现有的多模式文档检索方法通常复制为纯文本检索开发的技术，无论是如何编码文档、定义训练目标还是计算相似性分数。为了解决这些限制，我们提出了 ColMate，一种文档检索模型，它弥补了多模态表示学习和文档检索之间的差距。 ColMate 利用新颖的基于 OCR 的预训练目标、自监督蒙版对比学习目标以及与多模式文档结构和视觉特征更相关的后期交互评分机制。 ColMate 在 ViDoRe V2 基准上比现有检索模型获得了 3.61% 的改进，展示了对域外基准的更强泛化。

Title: The Biased Oracle: Assessing LLMs' Understandability and Empathy in Medical Diagnoses

Authors: Jianzhou Yao, Shunchang Liu, Guillaume Drui, Rikard Pettersson, Alessandro Blasimme, Sara Kijewski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00924
Pdf URL: https://arxiv.org/pdf/2511.00924
Copy Paste: [[2511.00924]] The Biased Oracle: Assessing LLMs' Understandability and Empathy in Medical Diagnoses(https://arxiv.org/abs/2511.00924)
Keywords: language model, llm
Abstract: Large language models (LLMs) show promise for supporting clinicians in diagnostic communication by generating explanations and guidance for patients. Yet their ability to produce outputs that are both understandable and empathetic remains uncertain. We evaluate two leading LLMs on medical diagnostic scenarios, assessing understandability using readability metrics as a proxy and empathy through LLM-as-a-Judge ratings compared to human evaluations. The results indicate that LLMs adapt explanations to socio-demographic variables and patient conditions. However, they also generate overly complex content and display biased affective empathy, leading to uneven accessibility and support. These patterns underscore the need for systematic calibration to ensure equitable patient communication. The code and data are released: this https URL
摘要：大语言模型（LLM）有望通过为患者生成解释和指导来支持临床医生进行诊断交流。然而，他们产生既可以理解又具有同理心的成果的能力仍然不确定。我们评估了两位领先的法学硕士在医疗诊断场景上的情况，使用可读性指标作为代理来评估可理解性，并通过法学硕士作为法官的评级与人类评估进行比较来评估同理心。结果表明，法学硕士会根据社会人口变量和患者状况做出解释。然而，它们也会生成过于复杂的内容并表现出有偏见的情感同理心，导致可访问性和支持不均匀。这些模式强调了系统校准的必要性，以确保公平的患者沟通。代码和数据发布：这个https网址

Title: The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles

Authors: Abhinav P M, Ojasva Saxena, Oswald C, Parameswari Krishnamurthy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.00960
Pdf URL: https://arxiv.org/pdf/2511.00960
Copy Paste: [[2511.00960]] The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles(https://arxiv.org/abs/2511.00960)
Keywords: language model, llm, prompt
Abstract: The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.
摘要：大型语言模型 (LLM) 在非英语语言中执行基于文化的推理的程度仍有待探索。本文考察了印度七种主要语言（孟加拉语、古吉拉特语、印地语、卡纳达语、马拉雅拉姆语、泰米尔语和泰卢固语）法学硕士的推理和自我评估能力。我们引入了一个将传统谜语与上下文重建变体相结合的多语言谜语数据集，并在七种提示策略下评估了五种法学硕士（Gemini 2.5 Pro、Gemini 2.5 Flash、Mistral-Saba、LLaMA 4 Scout 和 LLaMA 4 Maverick）。在第一阶段，我们评估了解谜性能，发现虽然 Gemini 2.5 Pro 整体表现最好，但小样本方法只能产生边际收益，而且不同语言的准确度差异显着。在第二阶段，我们进行自我评估实验来衡量推理一致性。结果揭示了一个重要发现：模型的初始准确性与其识别自身错误的能力成反比。 Gemini 2.5 Pro 等表现最好的模型过于自信（真实阴性率为 4.34%），而 LLaMA 4 Scout 等表现较差的模型则明显更具自我意识（真实阴性率为 42.09%）。这些结果指出了多语言推理中的明显差距，并强调了对模型的需求，该模型不仅能够有效推理，而且能够认识到自身的局限性。

Title: Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective

Authors: Chenwang Wu, Yiu-ming Cheung, Bo Han, Defu Lian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.00988
Pdf URL: https://arxiv.org/pdf/2511.00988
Copy Paste: [[2511.00988]] Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective(https://arxiv.org/abs/2511.00988)
Keywords: llm
Abstract: Existing machine-generated text (MGT) detection methods implicitly assume labels as the "golden standard". However, we reveal boundary ambiguity in MGT detection, implying that traditional training paradigms are inexact. Moreover, limitations of human cognition and the superintelligence of detectors make inexact learning widespread and inevitable. To this end, we propose an easy-to-hard enhancement framework to provide reliable supervision under such inexact conditions. Distinct from knowledge distillation, our framework employs an easy supervisor targeting relatively simple longer-text detection tasks (despite weaker capabilities), to enhance the more challenging target detector. Firstly, longer texts targeted by supervisors theoretically alleviate the impact of inexact labels, laying the foundation for reliable supervision. Secondly, by structurally incorporating the detector into the supervisor, we theoretically model the supervisor as a lower performance bound for the detector. Thus, optimizing the supervisor indirectly optimizes the detector, ultimately approximating the underlying "golden" labels. Extensive experiments across diverse practical scenarios, including cross-LLM, cross-domain, mixed text, and paraphrase attacks, demonstrate the framework's significant detection effectiveness. The code is available at: this https URL.
摘要：现有的机器生成文本 (MGT) 检测方法隐含地将标签视为“黄金标准”。然而，我们揭示了 MGT 检测中的边界模糊性，这意味着传统的训练范例是不精确的。此外，人类认知的局限性和探测器的超级智能使得不精确的学习变得普遍且不可避免。为此，我们提出了一个由易到难的增强框架，以在这种不精确的条件下提供可靠的监督。与知识蒸馏不同，我们的框架采用了一个简单的监督程序，针对相对简单的较长文本检测任务（尽管功能较弱），以增强更具挑战性的目标检测器。首先，监管者针对的较长文本理论上可以减轻标签不精确的影响，为可靠监管奠定基础。其次，通过在结构上将检测器合并到监控器中，我们从理论上将监控器建模为检测器的较低性能界限。因此，优化监督器间接优化了检测器，最终逼近底层的“黄金”标签。跨不同实际场景（包括跨LLM、跨域、混合文本和释义攻击）的广泛实验证明了该框架显着的检测有效性。该代码位于：此 https URL。

Title: MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Authors: Haolin Yang, Jipeng Zhang, Zhitao He, Yi R. Fung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01008
Pdf URL: https://arxiv.org/pdf/2511.01008
Copy Paste: [[2511.01008]] MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL(https://arxiv.org/abs/2511.01008)
Keywords: agent
Abstract: Translating natural language to SQL remains difficult for complex queries. Such queries often need environmental interaction and self-correction. To address this, we introduce MARS-SQL, a novel multi-agent framework that combines principled task decomposition and interactive reinforcement learning (RL). Our system comprises three specialized agents: a Grounding Agent for schema linking, a Generation Agent for query generation, and a Validation Agent for final selection. The core of our framework is the Generation agent, which is trained via a multi-turn RL policy. Adopting a ReAct-style Think-Act-Observe loop, the agent iteratively generates thoughts, executes SQL actions against a live database, and revises its strategy based on execution feedback, enabling dynamic, stateful reasoning and self-correction. At inference time, we generate multiple interaction trajectories to explore diverse reasoning paths. The Validation agent, then selects the optimal trajectory by modeling verification as a next-token prediction task and choosing the solution with the highest generation probability. This structured workflow pipelines specialized agents. It combines interactive RL for generation with generative modeling for verification. The approach proves highly effective for robust and accurate SQL generation. Experiments show that MARS-SQL achieves state-of-the-art Execution Accuracy of 77.84% on the BIRD dev set and 89.75% on the Spider test set. Our code is available at this https URL.
摘要：对于复杂的查询来说，将自然语言转换为 SQL 仍然很困难。此类查询通常需要环境交互和自我纠正。为了解决这个问题，我们引入了 MARS-SQL，这是一种新颖的多智能体框架，它结合了原则性任务分解和交互式强化学习 (RL)。我们的系统包含三个专门的代理：用于模式链接的基础代理、用于生成查询的生成代理和用于最终选择的验证代理。我们框架的核心是生成代理，它通过多轮强化学习策略进行训练。采用 ReAct 风格的 Think-Act-Observe 循环，代理迭代地生成想法，针对实时数据库执行 SQL 操作，并根据执行反馈修改其策略，从而实现动态、状态推理和自我纠正。在推理时，我们生成多个交互轨迹来探索不同的推理路径。然后，验证代理通过将验证建模为下一个令牌预测任务并选择具有最高生成概率的解决方案来选择最佳轨迹。这种结构化的工作流程管道专门的代理。它将用于生成的交互式强化学习与用于验证的生成建模相结合。事实证明，该方法对于健壮且准确的 SQL 生成非常有效。实验表明，MARS-SQL 在 BIRD 开发集上实现了 77.84% 的最先进执行准确度，在 Spider 测试集上实现了 89.75% 的最先进执行准确度。我们的代码可以在这个 https URL 上找到。

Title: IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

Authors: Bosi Wen, Yilin Niu, Cunxiang Wang, Pei Ke, Xiaoying Ling, Ying Zhang, Aohan Zeng, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01014
Pdf URL: https://arxiv.org/pdf/2511.01014
Copy Paste: [[2511.01014]] IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation(https://arxiv.org/abs/2511.01014)
Keywords: language model, llm
Abstract: Instruction following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic that can provide efficient and reliable assessments of constraint following in the instructions. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments demonstrate that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including Deepseek-R1 and o4-mini. With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.
摘要：指令跟踪是大型语言模型 (LLM) 的一项基本能力，要求其生成的输出遵循输入指令中施加的多个约束。许多研究试图通过基于法学硕士法官的奖励信号的偏好优化或强化学习来增强这种能力。然而，现有的教学跟踪评估模型仍然存在许多缺陷，例如成本高昂、评估不可靠等。为此，我们提出了 IF-CRITIC，这是一种 LLM 批评家，可以对指令中的约束遵循提供高效、可靠的评估。我们首先开发一个清单生成器来分解指令并生成约束清单。在检查表的帮助下，我们通过多级批评过滤机制收集高质量的批评训练数据，并采用约束级偏好优化方法来训练 IF-CRITIC。大量实验表明，IF-CRITIC 的评估性能可以击败强大的 LLM-as-a-Judge 基线，包括 Deepseek-R1 和 o4-mini。借助 IF-CRITIC 提供的可扩展奖励信号，与强大的 LLM CRITIC 基线相比，LLM 可以在较低的计算开销下在指令跟踪优化中实现显着的性能提升。

Title: Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Authors: Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01016
Pdf URL: https://arxiv.org/pdf/2511.01016
Copy Paste: [[2511.01016]] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning(https://arxiv.org/abs/2511.01016)
Keywords: language model, llm, prompt
Abstract: Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at this https URL.
摘要：最近，高级大语言模型（LLM）以越来越快的速度出现。然而，当面对复杂的问题时，大多数用户往往无法提供准确有效的提示与LLM进行交互，从而限制了LLM的表现。为了应对这一挑战，我们提出了 Prompt-R1，这是一个端到端的强化学习框架，它使用小型 LLM 与大规模 LLM 协作，取代用户交互以更好地解决问题。这种协作被塑造为多轮提示交互，其中小规模的LLM思考并生成提示，大规模的LLM进行复杂的推理。双约束奖励旨在优化正确性、生成质量和推理准确性。 Prompt-R1 提供了一个即插即用的框架，支持各种大规模 LLM 的推理和训练。对多个公共数据集的实验表明，Prompt-R1 在各个任务上显着优于基线模型。我们的代码可通过此 https URL 公开获取。

Title: OceanAI: A Conversational Platform for Accurate, Transparent, Near-Real-Time Oceanographic Insights

Authors: Bowen Chen, Jayesh Gajbhar, Gregory Dusek, Rob Redmon, Patrick Hogan, Paul Liu, DelWayne Bohnenstiehl, Dongkuan (DK)Xu, Ruoying He
Subjects: cs.CL, cs.AI, cs.CE, cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2511.01019
Pdf URL: https://arxiv.org/pdf/2511.01019
Copy Paste: [[2511.01019]] OceanAI: A Conversational Platform for Accurate, Transparent, Near-Real-Time Oceanographic Insights(https://arxiv.org/abs/2511.01019)
Keywords: language model, llm, hallucination, chat
Abstract: Artificial intelligence is transforming the sciences, yet general conversational AI systems often generate unverified "hallucinations" undermining scientific rigor. We present OceanAI, a conversational platform that integrates the natural-language fluency of open-source large language models (LLMs) with real-time, parameterized access to authoritative oceanographic data streams hosted by the National Oceanic and Atmospheric Administration (NOAA). Each query such as "What was Boston Harbor's highest water level in 2024?" triggers real-time API calls that identify, parse, and synthesize relevant datasets into reproducible natural-language responses and data visualizations. In a blind comparison with three widely used AI chat-interface products, only OceanAI produced NOAA-sourced values with original data references; others either declined to answer or provided unsupported results. Designed for extensibility, OceanAI connects to multiple NOAA data products and variables, supporting applications in marine hazard forecasting, ecosystem assessment, and water-quality monitoring. By grounding outputs and verifiable observations, OceanAI advances transparency, reproducibility, and trust, offering a scalable framework for AI-enabled decision support within the oceans. A public demonstration is available at this https URL.
摘要：人工智能正在改变科学，但一般的对话式人工智能系统经常会产生未经证实的“幻觉”，破坏科学的严谨性。我们推出了 OceanAI，这是一个对话平台，它将开源大语言模型 (LLM) 的自然语言流畅性与对美国国家海洋和大气管理局 (NOAA) 托管的权威海洋学数据流的实时、参数化访问相结合。每个查询例如“2024 年波士顿港的最高水位是多少？”触发实时 API 调用，识别、解析相关数据集并将其合成为可重现的自然语言响应和数据可视化。与三种广泛使用的人工智能聊天界面产品进行盲目比较，只有OceanAI产生了来自NOAA的原始数据参考值；其他人要么拒绝回答，要么提供不受支持的结果。 OceanAI 专为可扩展性而设计，可连接多个 NOAA 数据产品和变量，支持海洋灾害预报、生态系统评估和水质监测等应用。通过接地输出和可验证的观测结果，OceanAI 提高了透明度、可重复性和信任度，为海洋内人工智能支持的决策支持提供了一个可扩展的框架。此 https URL 提供了公开演示。

Title: VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics

Authors: Vedant Acharya, Abhay Pisharodi, Rishabh Mondal, Mohammad Rafiuddin, Nipun Batra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01046
Pdf URL: https://arxiv.org/pdf/2511.01046
Copy Paste: [[2511.01046]] VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics(https://arxiv.org/abs/2511.01046)
Keywords: language model, llm, chat
Abstract: Air pollution causes about 1.6 million premature deaths each year in India, yet decision makers struggle to turn dispersed data into decisions. Existing tools require expertise and provide static dashboards, leaving key policy questions unresolved. We present VayuChat, a conversational system that answers natural language questions on air quality, meteorology, and policy programs, and responds with both executable Python code and interactive visualizations. VayuChat integrates data from Central Pollution Control Board (CPCB) monitoring stations, state-level demographics, and National Clean Air Programme (NCAP) funding records into a unified interface powered by large language models. Our live demonstration will show how users can perform complex environmental analytics through simple conversations, making data science accessible to policymakers, researchers, and citizens. The platform is publicly deployed at this https URL VayuChat. For further information check out video uploaded on this https URL.
摘要：在印度，空气污染每年导致约 160 万人过早死亡，但决策者却很难将分散的数据转化为决策。现有工具需要专业知识并提供静态仪表板，导致关键政策问题尚未解决。我们推出 VayuChat，这是一个对话系统，可以回答有关空气质量、气象和政策计划的自然语言问题，并通过可执行的 Python 代码和交互式可视化进行响应。 VayuChat 将中央污染控制委员会 (CPCB) 监测站、州级人口统计数据和国家清洁空气计划 (NCAP) 资金记录的数据集成到由大型语言模型支持的统一界面中。我们的现场演示将展示用户如何通过简单的对话执行复杂的环境分析，使政策制定者、研究人员和公民能够接触到数据科学。该平台公开部署在 https URL VayuChat 上。有关更多信息，请查看此 https URL 上上传的视频。

Title: Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs

Authors: Qing Ding, Eric Hua Qing Zhang, Felix Jozsa, Julia Ive
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01053
Pdf URL: https://arxiv.org/pdf/2511.01053
Copy Paste: [[2511.01053]] Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs(https://arxiv.org/abs/2511.01053)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs' clinical utility and guideline adherence.
摘要：大语言模型 (LLM) 越来越多地用于医疗保健领域，但缺乏用于评估基于指南的临床推理的标准化基准。这项研究引入了一个经过验证的数据集，该数据集源自多个诊断的公开指南。该数据集是在 GPT 的帮助下创建的，包含现实的患者场景以及临床问题。我们对一系列最近流行的法学硕士进行了基准测试，以展示我们数据集的有效性。该框架支持对法学硕士的临床效用和指南遵守情况进行系统评估。

Title: HPLT~3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Authors: Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Hajič, Jindrič Helcl, Andrey Kutuzov, Zihao Li, Risto Luukkonen, Bhavitvya Malik, Vladislav Mikhailov, Amanda Myntti, Dayyán O'Brien, Lucie Poláková, Sampo Pyysalo, Gema Ramírez Sánchez, Janine Siewert, Pavel Stepachev, Jörg Tiedemann, Teemu Vahtola, Fedor Vitiugin, Tea Vojtěchová, Jaume Zaragoza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01066
Pdf URL: https://arxiv.org/pdf/2511.01066
Copy Paste: [[2511.01066]] HPLT~3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models(https://arxiv.org/abs/2511.01066)
Keywords: language model, gpt, llm, prompt
Abstract: We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
摘要：我们提出了一项持续的计划，为近 200 种语言提供开放的、非常大的、高质量的、注释丰富的文本数据集。拥有 30 万亿个代币，这可能是最大的通用多语言 LLM 预训练数据集合。拥有 30 万亿个代币，这可能是最大的通用多语言 LLM 预训练数据集合。这些数据集源自不同来源的网络抓取，并附带完整的开源管道，用于从网络档案中选择文档、从 HTML 中提取文本、噪声文本的语言识别、精确和近乎重复数据删除、注册标签注释、文本质量估计和个人身份信息；以及最终的选择和过滤。我们通过对比和分析统计、手动检查 24 种语言的样本以及通过对基于这些数据训练的各种语言模型架构进行端到端评估来报告数据质量调查。对于多语言法学硕士评估，我们提供了九种欧洲语言的全面基准集合，特别强调本地创建的任务、减轻即时敏感性的机制以及精细的标准化和分数聚合。此外，我们还训练和评估了一系列 57 个单语言编码器-解码器模型，以及一些类似 GPT 的单语言参考模型。除了单语言数据和模型之外，我们还提供了从这些数据中自动挖掘的大量平行文本，以及通过机器翻译合成的新颖的平行语料库。

Title: Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering

Authors: Vlad Negoita, Mihai Masala, Traian Rebedea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01090
Pdf URL: https://arxiv.org/pdf/2511.01090
Copy Paste: [[2511.01090]] Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering(https://arxiv.org/abs/2511.01090)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently exploded in popularity, often matching or outperforming human abilities on many tasks. One of the key factors in training LLMs is the availability and curation of high-quality data. Data quality is especially crucial for under-represented languages, where high-quality corpora are scarce. In this work we study the characteristics and coverage of Romanian pretraining corpora and we examine how they differ from English data. By training a lightweight multitask model on carefully LLM-annotated Romanian texts, we are able to analyze and perform multi-level filtering (e.g., educational value, topic, format) to generate high-quality pretraining datasets. Our experiments show noteworthy trends in the topics present in Romanian and English data, while also proving the effectiveness of filtering data through improved LLM pretraining performance across multiple benchmarks.
摘要：大型语言模型 (LLM) 最近迅速普及，在许多任务上通常可以匹配或超越人类的能力。培训法学硕士的关键因素之一是高质量数据的可用性和管理。数据质量对于代表性不足的语言尤其重要，因为这些语言缺乏高质量的语料库。在这项工作中，我们研究了罗马尼亚语预训练语料库的特征和覆盖范围，并研究了它们与英语数据的不同之处。通过在仔细的法学硕士注释的罗马尼亚语文本上训练轻量级多任务模型，我们能够分析和执行多级过滤（例如教育价值、主题、格式）以生成高质量的预训练数据集。我们的实验显示了罗马尼亚语和英语数据中存在的主题的值得注意的趋势，同时还证明了通过跨多个基准改进 LLM 预训练性能来过滤数据的有效性。

Title: TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

Authors: Marek Strong, Andreas Vlachos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01101
Pdf URL: https://arxiv.org/pdf/2511.01101
Copy Paste: [[2511.01101]] TSVer: A Benchmark for Fact Verification Against Time-Series Evidence(https://arxiv.org/abs/2511.01101)
Keywords: llm
Abstract: Reasoning over temporal and numerical data, such as time series, is a crucial aspect of fact-checking. While many systems have recently been developed to handle this form of evidence, their evaluation remains limited by existing datasets, which often lack structured evidence, provide insufficient justifications for verdicts, or rely on synthetic claims. In this paper, we introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 287 real-world claims sourced from 38 fact-checking organizations and a curated database of 400 time series covering diverse domains. Each claim is annotated with time frames across all pertinent time series, along with a verdict and justifications reflecting how the evidence is used to reach the verdict. Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of kappa=0.745 on verdicts. We also develop a baseline for verifying claims against time-series evidence and show that even the state-of-the-art reasoning models like Gemini-2.5-Pro are challenged by time series, achieving a 63.37 accuracy score on verdicts and an Ev2R score of 48.63 on verdict justifications.
摘要：对时间和数值数据（例如时间序列）的推理是事实核查的一个重要方面。虽然最近开发了许多系统来处理这种形式的证据，但它们的评估仍然受到现有数据集的限制，这些数据集往往缺乏结构化证据，为判决提供的理由不足，或者依赖于综合主张。在本文中，我们介绍了 TSVer，这是一个用于事实验证的新基准数据集，专注于利用时间序列证据进行时间和数值推理。 TSVer 包含来自 38 个事实核查组织的 287 条真实世界声明，以及涵盖不同领域的 400 个时间序列的精选数据库。每项主张都注释有所有相关时间序列的时间范围，以及反映如何使用证据得出结论的判决和理由。使用 LLM 辅助的多步骤注释流程，我们提高了注释质量，并在判决上实现了 kappa=0.745 的注释者间一致性。我们还制定了一个基线，用于根据时间序列证据验证主张，并表明即使是像 Gemini-2.5-Pro 这样最先进的推理模型也会受到时间序列的挑战，在判决上获得 63.37 的准确度分数，在判决理由上获得 48.63 的 Ev2R 分数。

Title: MicroRemed: Benchmarking LLMs in Microservices Remediation

Authors: Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Chiming Duan, Minghua He, Leyi Pan, Zhaoyang Liu, Bolin Ding, Ying Li
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2511.01166
Pdf URL: https://arxiv.org/pdf/2511.01166
Copy Paste: [[2511.01166]] MicroRemed: Benchmarking LLMs in Microservices Remediation(https://arxiv.org/abs/2511.01166)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) integrated with agent-based reasoning frameworks have recently shown strong potential for autonomous decision-making and system-level operations. One promising yet underexplored direction is microservice remediation, where the goal is to automatically recover faulty microservice systems. Existing approaches, however, still rely on human-crafted prompts from Site Reliability Engineers (SREs), with LLMs merely converting textual instructions into executable code. To advance research in this area, we introduce MicroRemed, the first benchmark for evaluating LLMs in end-to-end microservice remediation, where models must directly generate executable Ansible playbooks from diagnosis reports to restore system functionality. We further propose ThinkRemed, a multi-agent framework that emulates the reflective and perceptive reasoning of SREs. Experimental results show that MicroRemed presents substantial challenges to current LLMs, while ThinkRemed improves end-to-end remediation performance through iterative reasoning and system reflection. The benchmark is available at this https URL.
摘要：与基于代理的推理框架集成的大型语言模型（LLM）最近在自主决策和系统级操作方面显示出强大的潜力。一个有前途但尚未充分探索的方向是微服务修复，其目标是自动恢复有故障的微服务系统。然而，现有的方法仍然依赖于站点可靠性工程师 (SRE) 的人为提示，法学硕士仅将文本指令转换为可执行代码。为了推进这一领域的研究，我们引入了 MicroRemed，这是评估端到端微服务修复中 LLM 的第一个基准，其中模型必须直接从诊断报告生成可执行的 Ansible 剧本以恢复系统功能。我们进一步提出 ThinkRemed，一个模拟 SRE 的反思性和感知性推理的多智能体框架。实验结果表明，MicroRemed 对当前的法学硕士提出了实质性挑战，而 ThinkRemed 通过迭代推理和系统反思提高了端到端的修复性能。该基准可从此 https URL 获取。

Title: Learning When to Quit in Sales Conversations

Authors: Emaad Manzoor, Eva Ascarza, Oded Netzer
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.01181
Pdf URL: https://arxiv.org/pdf/2511.01181
Copy Paste: [[2511.01181]] Learning When to Quit in Sales Conversations(https://arxiv.org/abs/2511.01181)
Keywords: language model, agent
Abstract: Salespeople frequently face the dynamic screening decision of whether to persist in a conversation or abandon it to pursue the next lead. Yet, little is known about how these decisions are made, whether they are efficient, or how to improve them. We study these decisions in the context of high-volume outbound sales where leads are ample, but time is scarce and failure is common. We formalize the dynamic screening decision as an optimal stopping problem and develop a generative language model-based sequential decision agent - a stopping agent - that learns whether and when to quit conversations by imitating a retrospectively-inferred optimal stopping policy. Our approach handles high-dimensional textual states, scales to large language models, and works with both open-source and proprietary language models. When applied to calls from a large European telecommunications firm, our stopping agent reduces the time spent on failed calls by 54% while preserving nearly all sales; reallocating the time saved increases expected sales by up to 37%. Upon examining the linguistic cues that drive salespeople's quitting decisions, we find that they tend to overweight a few salient expressions of consumer disinterest and mispredict call failure risk, suggesting cognitive bounds on their ability to make real-time conversational decisions. Our findings highlight the potential of artificial intelligence algorithms to correct cognitively-bounded human decisions and improve salesforce efficiency.
摘要：销售人员经常面临动态筛选决策，是继续对话还是放弃对话以寻找下一个潜在客户。然而，人们对这些决策是如何做出的、它们是否有效、或者如何改进它们知之甚少。我们在大批量对外销售的背景下研究这些决策，其中潜在客户充足，但时间有限且失败很常见。我们将动态筛选决策形式化为最佳停止问题，并开发一种基于生成语言模型的顺序决策代理（停止代理），它通过模仿回顾性推断的最佳停止策略来学习是否以及何时退出对话。我们的方法可以处理高维文本状态，扩展到大型语言模型，并且可以与开源和专有语言模型一起使用。当应用于一家大型欧洲电信公司的呼叫时，我们的停止代理将失败呼叫所花费的时间减少了 54%，同时保留了几乎所有销售额；重新分配节省的时间可使预期销售额增加高达 37%。在检查推动销售人员做出退出决定的语言线索后，我们发现他们往往会过度重视消费者不感兴趣的一些显着表达，并错误预测呼叫失败风险，这表明他们做出实时对话决策的能力存在认知限制。我们的研究结果凸显了人工智能算法在纠正受认知限制的人类决策和提高销售人员效率方面的潜力。

Title: Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Authors: Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.01187
Pdf URL: https://arxiv.org/pdf/2511.01187
Copy Paste: [[2511.01187]] Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs(https://arxiv.org/abs/2511.01187)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic "backwardness" (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.
摘要：大型语言模型（LLM）被广泛部署用于开放式交流，但大多数偏见评估仍然依赖于英语的分类式任务。我们推出 DebateBias-8K，这是一种新的多语言、辩论式基准，旨在揭示叙事偏见在现实生成环境中的表现方式。我们的数据集包括 8,400 个结构化辩论提示，涵盖四个敏感领域：妇女权利、社会经济发展、恐怖主义和宗教，涵盖从高资源（英语、中文）到低资源（斯瓦希里语、尼日利亚洋泾浜语）的七种语言。使用四个旗舰模型（GPT-4o、Claude 3、DeepSeek 和 LLaMA 3），我们生成并自动分类超过 100,000 个响应。结果显示，尽管安全一致，但所有模型都再现了根深蒂固的刻板印象：阿拉伯人绝大多数与恐怖主义和宗教有关（>=95%），非洲人与社会经济“落后”有关（高达<=77%），而西方群体始终被视为现代或进步。在资源较低的语言中，偏差急剧增加，这表明主要用英语训练的对齐并不能在全球范围内推广。我们的研究结果凸显了多语言公平性方面持续存在的分歧：当前的对齐方法减少了显性毒性，但无法防止开放式环境中出现有偏见的输出。我们发布 DebateBias-8K 基准测试和分析框架，以支持下一代多语言偏见评估和更安全、文化包容的模型对齐。

Title: ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

Authors: Lvhua Wu, Xuefeng Jiang, Sheng Sun, Tian Wen, Yuwei Wang, Min Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01188
Pdf URL: https://arxiv.org/pdf/2511.01188
Copy Paste: [[2511.01188]] ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction(https://arxiv.org/abs/2511.01188)
Keywords: language model, llm, hallucination, agent
Abstract: The rapid spread of fake news threatens social stability and public trust, rendering its detection an imperative research priority. Although large language models (LLMs) excel at numerous natural language processing tasks with their remarkable contextual understanding and extensive prior knowledge, the time-bounded knowledge coverage and tendency for generating hallucination content reduce their reliability when handling fast-evolving news streams. Furthermore, models trained on existing static datasets also often lack the generalization needed for emerging news topics. To address these challenges, we propose ZoFia, a novel two-stage zero-shot fake news detection framework. First, we introduce Hierarchical Salience to quantify the importance of entities in the news content, and propose the SC-MMR algorithm to effectively select an informative and diverse set of keywords that serve as queries for retrieving up-to-date external evidence. Subsequently, a multi LLM interactive system, in which each agent assumes a distinct role, performs multi-view collaborative analysis and adversarial debate over the news text and its related information, and finally produces an interpretable and robust judgment. Comprehensive experiments on two public datasets demonstrate that ZoFia obviously outperforms existing zero-shot baselines and most of few-shot methods. Our codes will be open-sourced to facilitate related communities.
摘要：假新闻的迅速传播威胁着社会稳定和公众信任，使其检测成为当务之急的研究重点。尽管大型语言模型（LLM）凭借出色的上下文理解和广泛的先验知识在众多自然语言处理任务中表现出色，但受时间限制的知识覆盖范围和生成幻觉内容的趋势降低了其在处理快速发展的新闻流时的可靠性。此外，在现有静态数据集上训练的模型通常也缺乏新兴新闻主题所需的泛化能力。为了应对这些挑战，我们提出了 ZoFia，一种新颖的两阶段零样本假新闻检测框架。首先，我们引入层次显着性来量化新闻内容中实体的重要性，并提出 SC-MMR 算法来有效地选择一组信息丰富且多样化的关键词作为检索最新外部证据的查询。随后，在一个多LLM交互系统中，每个智能体扮演不同的角色，对新闻文本及其相关信息进行多视角协作分析和对抗性辩论，最终产生可解释的、鲁棒的判断。对两个公共数据集的综合实验表明，ZoFia 明显优于现有的零样本基线和大多数少样本方法。我们的代码将开源，以方便相关社区。

Title: DEER: Disentangled Mixture of Experts with Instance-Adaptive Routing for Generalizable Machine-Generated Text Detection

Authors: Guoxin Ma, Xiaoming Liu, Zhanhan Zhang, Chengzhengxu Li, Shengchao Liu, Yu Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01192
Pdf URL: https://arxiv.org/pdf/2511.01192
Copy Paste: [[2511.01192]] DEER: Disentangled Mixture of Experts with Instance-Adaptive Routing for Generalizable Machine-Generated Text Detection(https://arxiv.org/abs/2511.01192)
Keywords: language model, llm
Abstract: Detecting machine-generated text (MGT) has emerged as a critical challenge, driven by the rapid advancement of large language models (LLMs) capable of producing highly realistic, human-like content. However, the performance of current approaches often degrades significantly under domain shift. To address this challenge, we propose a novel framework designed to capture both domain-specific and domain-general MGT patterns through a two-stage Disentangled mixturE-of-ExpeRts (DEER) architecture. First, we introduce a disentangled mixture-of-experts module, in which domain-specific experts learn fine-grained, domain-local distinctions between human and machine-generated text, while shared experts extract transferable, cross-domain features. Second, to mitigate the practical limitation of unavailable domain labels during inference, we design a reinforcement learning-based routing mechanism that dynamically selects the appropriate experts for each input instance, effectively bridging the train-inference gap caused by domain uncertainty. Extensive experiments on five in-domain and five out-of-domain benchmark datasets demonstrate that DEER consistently outperforms state-of-the-art methods, achieving average F1-score improvements of 1.39% and 5.32% on in-domain and out-of-domain datasets respectively, along with accuracy gains of 1.35% and 3.61% respectively. Ablation studies confirm the critical contributions of both disentangled expert specialization and adaptive routing to model performance.
摘要：在能够生成高度逼真、类人内容的大型语言模型 (LLM) 快速发展的推动下，检测机器生成的文本 (MGT) 已成为一项重大挑战。然而，当前方法的性能在域转移下通常会显着下降。为了应对这一挑战，我们提出了一种新颖的框架，旨在通过两阶段解缠结混合专家（DEER）架构捕获特定领域和通用领域的 MGT 模式。首先，我们引入了一个解开的专家混合模块，其中特定领域的专家学习人类和机器生成的文本之间的细粒度、领域局部的区别，而共享专家则提取可转移的跨领域特征。其次，为了减轻推理过程中域标签不可用的实际限制，我们设计了一种基于强化学习的路由机制，为每个输入实例动态选择合适的专家，有效地弥合了域不确定性引起的训练-推理差距。对五个域内和五个域外基准数据集的广泛实验表明，DEER 始终优于最先进的方法，在域内和域外数据集上平均 F1 分数分别提高了 1.39% 和 5.32%，准确率分别提高了 1.35% 和 3.61%。消融研究证实了解开的专家专业化和自适应路由对模型性能的关键贡献。

Title: AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

Authors: Mo El-Haj, Paul Rayson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01265
Pdf URL: https://arxiv.org/pdf/2511.01265
Copy Paste: [[2511.01265]] AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs(https://arxiv.org/abs/2511.01265)
Keywords: language model, llm
Abstract: This paper investigates the impact of domain specificity on abstractive summarisation of Arabic financial texts using large language models (LLMs). We introduce AraFinNews, the largest publicly available Arabic financial news dataset to date, comprising 212,500 article--headline pairs spanning nearly a decade of reporting from October 2015 to July 2025. Designed as the Arabic equivalent of major English summarisation corpora such as CNN/DailyMail, AraFinNews provides a robust benchmark for evaluating domain-specific language understanding and generation in financial contexts. Using this resource, we evaluate transformer-based models -- including mT5, AraT5, and the domain-adapted FinAraT5 -- to examine how financial-domain pretraining influences factual accuracy, numerical reliability, and stylistic alignment with professional reporting. Experimental results show that domain-adapted models generate more faithful and coherent summaries, particularly in handling quantitative and entity-centric information. The findings highlight the importance of domain-specific adaptation for improving factual consistency and narrative fluency in Arabic financial summarisation. The dataset is freely available for non-commercial research at this https URL.
摘要：本文研究了领域特异性对使用大型语言模型 (LLM) 的阿拉伯金融文本抽象概括的影响。我们介绍 AraFinNews，这是迄今为止最大的公开阿拉伯金融新闻数据集，包含 212,500 个文章-标题对，涵盖从 2015 年 10 月到 2025 年 7 月近十年的报道。AraFinNews 被设计为主要英语摘要语料库（如 CNN/DailyMail）的阿拉伯语版本，为评估金融环境中特定领域语言的理解和生成提供了一个强大的基准。使用此资源，我们评估基于 Transformer 的模型（包括 mT5、AraT5 和适应领域的 FinAraT5），以研究金融领域预训练如何影响事实准确性、数值可靠性以及与专业报告的风格一致性。实验结果表明，领域适应模型可以生成更忠实、更连贯的摘要，特别是在处理定量和以实体为中心的信息时。研究结果强调了特定领域的适应对于提高阿拉伯语财务摘要中的事实一致性和叙述流畅性的重要性。该数据集可通过此 https URL 免费用于非商业研究。

Title: When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Authors: Min Fang, Zhihui Fu, Qibin Zhao, Jun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01282
Pdf URL: https://arxiv.org/pdf/2511.01282
Copy Paste: [[2511.01282]] When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding(https://arxiv.org/abs/2511.01282)
Keywords: language model, llm
Abstract: Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting model. While model-based methods like EAGLE-2 are accurate but costly, retrieval-enhanced methods like SAM-Decoding rely on heuristic switching strategies that often trigger unnecessary retrievals. To address this, we propose ReSpec (\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding), a novel framework that transforms heuristic drafter switching into adaptive decision-making. ReSpec features three core innovations: 1) An \textbf{entropy-guided adaptive trigger} quantifies contextual predictability to initiate retrieval only when uncertainty is low, avoiding costly low-quality speculations. 2) A \textbf{feedback-driven candidate selection} leverages historical feedback to organize multiple high-quality candidates for parallel verification, maximizing retrieval utility. 3) A source-aware \textbf{relaxed verification strategy} applies strict checks to model-generated drafts while using a relaxed verification for retrieved drafts, achieving a better balance between accuracy and efficiency. Extensive experiments on Spec-Bench demonstrate that ReSpec achieves state-of-the-art acceleration,outperforming EAGLE-2 and SAM-Decoding by over $33\%$ and $25\%$, respectively, while maintaining output quality.
摘要：推测性解码 (SD) 已成为一种在不影响输出质量的情况下加速大型语言模型 (LLM) 推理的有效技术。然而，可实现的加速很大程度上取决于绘图模型的有效性。虽然 EAGLE-2 等基于模型的方法准确但成本高昂，但 SAM-Decoding 等检索增强方法依赖于启发式切换策略，而这些策略通常会触发不必要的检索。为了解决这个问题，我们提出了 ReSpec（\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding），这是一种新颖的框架，可将启发式起草者切换转变为自适应决策。 ReSpec 具有三个核心创新：1） \textbf{熵引导的自适应触发器} 量化上下文可预测性，仅在不确定性较低时启动检索，避免代价高昂的低质量推测。 2）\textbf{反馈驱动的候选选择}利用历史反馈来组织多个高质量候选进行并行验证，从而最大化检索效用。 3）源感知的\textbf{宽松验证策略}对模型生成的草稿进行严格检查，同时对检索到的草稿使用宽松的验证，从而在准确性和效率之间取得更好的平衡。 Spec-Bench 上的大量实验表明，ReSpec 实现了最先进的加速，在保持输出质量的同时，性能分别比 EAGLE-2 和 SAM-Decoding 高出超过 $33\%$ 和 $25\%$。

Title: "Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Authors: Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2511.01287
Pdf URL: https://arxiv.org/pdf/2511.01287
Copy Paste: [[2511.01287]] "Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers(https://arxiv.org/abs/2511.01287)
Keywords: prompt
Abstract: With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.
摘要：随着人工智能模型的快速发展，它们在不同任务中的部署变得越来越广泛。一个值得注意的新兴应用是利用人工智能模型来协助审阅科学论文。然而，最近的报告显示，一些论文包含隐藏的、注入的提示，旨在操纵人工智能审稿人提供过于有利的评估。在这项工作中，我们对这一新兴威胁进行了早期系统调查。我们提出两类攻击：（1）静态攻击，采用固定的注入提示；（2）迭代攻击，针对模拟审阅者模型优化注入提示，以最大限度地提高其有效性。这两种攻击都取得了惊人的性能，当针对前沿人工智能评审者时，经常会产生完整的评估分数。此外，我们表明这些攻击在各种设置下都是稳健的。为了应对这种威胁，我们探索了一种简单的基于检测的防御。虽然它大大降低了攻击的成功率，但我们证明自适应攻击者可以部分规避这种防御。我们的研究结果强调，在人工智能辅助同行评审中，需要更多关注和严格防范即时注入威胁。

Title: FirstAidQA: A Synthetic Dataset for First Aid and Emergency Response in Low-Connectivity Settings

Authors: Saiyma Sittul Muna, Rezwan Islam Salvi, Mushfiqur Rahman Mushfique, Ajwad Abrar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01289
Pdf URL: https://arxiv.org/pdf/2511.01289
Copy Paste: [[2511.01289]] FirstAidQA: A Synthetic Dataset for First Aid and Emergency Response in Low-Connectivity Settings(https://arxiv.org/abs/2511.01289)
Keywords: language model, gpt, llm, prompt, chat
Abstract: In emergency situations, every second counts. The deployment of Large Language Models (LLMs) in time-sensitive, low or zero-connectivity environments remains limited. Current models are computationally intensive and unsuitable for low-tier devices often used by first responders or civilians. A major barrier to developing lightweight, domain-specific solutions is the lack of high-quality datasets tailored to first aid and emergency response. To address this gap, we introduce FirstAidQA, a synthetic dataset containing 5,500 high-quality question answer pairs that encompass a wide range of first aid and emergency response scenarios. The dataset was generated using a Large Language Model, ChatGPT-4o-mini, with prompt-based in-context learning, using texts from the Vital First Aid Book (2019). We applied preprocessing steps such as text cleaning, contextual chunking, and filtering, followed by human validation to ensure accuracy, safety, and practical relevance of the QA pairs. FirstAidQA is designed to support instruction-tuning and fine-tuning of LLMs and Small Language Models (SLMs), enabling faster, more reliable, and offline-capable systems for emergency settings. We publicly release the dataset to advance research on safety-critical and resource-constrained AI applications in first aid and emergency response. The dataset is available on Hugging Face at this https URL.
摘要：在紧急情况下，每一秒都很重要。在时间敏感、低连接或零连接的环境中部署大型语言模型 (LLM) 仍然受到限制。当前的模型计算量大，不适合急救人员或平民经常使用的低层设备。开发轻量级、特定领域解决方案的一个主要障碍是缺乏针对急救和紧急响应量身定制的高质量数据集。为了弥补这一差距，我们引入了 FirstAidQA，这是一个包含 5,500 个高质量问答对的综合数据集，涵盖了广泛的急救和紧急响应场景。该数据集是使用大型语言模型 ChatGPT-4o-mini 生成的，并使用《重要急救书》（2019 年）中的文本进行基于提示的上下文学习。我们应用了文本清理、上下文分块和过滤等预处理步骤，然后进行人工验证，以确保 QA 对的准确性、安全性和实际相关性。 FirstAidQA 旨在支持 LLM 和小语言模型 (SLM) 的指令调整和微调，从而为紧急情况设置提供更快、更可靠且具有离线功能的系统。我们公开发布该数据集，以推进对急救和应急响应中安全关键和资源有限的人工智能应用的研究。该数据集可在 Hugging Face 上找到，网址为 https URL。

Title: DeepSpecs: Expert-Level Questions Answering in 5G

Authors: Aman Ganapathy Manvattira, Yifei Xu, Ziyue Dang, Songwu Lu
Subjects: cs.CL, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2511.01305
Pdf URL: https://arxiv.org/pdf/2511.01305
Copy Paste: [[2511.01305]] DeepSpecs: Expert-Level Questions Answering in 5G(https://arxiv.org/abs/2511.01305)
Keywords: llm, retrieval-augmented generation
Abstract: 5G technology enables mobile Internet access for billions of users. Answering expert-level questions about 5G specifications requires navigating thousands of pages of cross-referenced standards that evolve across releases. Existing retrieval-augmented generation (RAG) frameworks, including telecom-specific approaches, rely on semantic similarity and cannot reliably resolve cross-references or reason about specification evolution. We present DeepSpecs, a RAG system enhanced by structural and temporal reasoning via three metadata-rich databases: SpecDB (clause-aligned specification text), ChangeDB (line-level version diffs), and TDocDB (standardization meeting documents). DeepSpecs explicitly resolves cross-references by recursively retrieving referenced clauses through metadata lookup, and traces specification evolution by mining changes and linking them to Change Requests that document design rationale. We curate two 5G QA datasets: 573 expert-annotated real-world questions from practitioner forums and educational resources, and 350 evolution-focused questions derived from approved Change Requests. Across multiple LLM backends, DeepSpecs outperforms base models and state-of-the-art telecom RAG systems; ablations confirm that explicit cross-reference resolution and evolution-aware retrieval substantially improve answer quality, underscoring the value of modeling the structural and temporal properties of 5G standards.
摘要：5G技术使数十亿用户移动互联网接入。回答有关 5G 规范的专家级问题需要浏览数千页跨版本演变的交叉引用标准。现有的检索增强生成（RAG）框架，包括电信特定的方法，依赖于语义相似性，并且无法可靠地解决交叉引用或有关规范演变的推理。我们提出了 DeepSpecs，这是一个通过三个元数据丰富的数据库进行结构和时间推理增强的 RAG 系统：SpecDB（子句对齐规范文本）、ChangeDB（行级版本差异）和 TDocDB（标准化会议文档）。 DeepSpecs 通过元数据查找递归检索引用的子句来明确解决交叉引用，并通过挖掘变更并将其链接到记录设计原理的变更请求来跟踪规范的演变。我们整理了两个 5G QA 数据集：来自从业者论坛和教育资源的 573 个专家注释的现实世界问题，以及来自批准的变更请求的 350 个以演进为中心的问题。在多个 LLM 后端中，DeepSpecs 的性能优于基础模型和最先进的电信 RAG 系统；消融证实，显式交叉引用解析和进化感知检索可显着提高答案质量，强调对 5G 标准的结构和时间属性进行建模的价值。

Title: DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Authors: Jiabao Ji, Min Li, Priyanshu Kumar, Shiyu Chang, Saloni Potdar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01323
Pdf URL: https://arxiv.org/pdf/2511.01323
Copy Paste: [[2511.01323]] DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness(https://arxiv.org/abs/2511.01323)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) with integrated search tools show strong promise in open-domain question answering (QA), yet they often struggle to produce complete answer set to complex questions such as Which actor from the film Heat won at least one Academy Award?, which requires (1) distinguishing between multiple films sharing the same title and (2) reasoning across a large set of actors to gather and integrate evidence. Existing QA benchmarks rarely evaluate both challenges jointly. To address this, we introduce DeepAmbigQAGen, an automatic data generation pipeline that constructs QA tasks grounded in text corpora and linked knowledge graph, generating natural and verifiable questions that systematically embed name ambiguity and multi-step reasoning. Based on this, we build DeepAmbigQA, a dataset of 3,600 questions requiring multi-hop reasoning and half of them explicit name ambiguity resolving. Experiments reveal that, even state-of-the-art GPT-5 show incomplete answers, achieving only 0.13 exact match on ambiguous questions and 0.21 on non-ambiguous questions. These findings highlight the need for more robust QA systems aimed at information gathering and answer completeness.
摘要：具有集成搜索工具的大型语言模型 (LLM) 在开放域问答 (QA) 方面显示出强大的前景，但它们常常难以为复杂问题提供完整的答案集，例如电影《热火》中的哪位演员赢得了至少一项奥斯卡奖？，这需要 (1) 区分多部具有相同标题的电影，以及 (2) 对大量演员进行推理以收集和整合证据。现有的质量保证基准很少联合评估这两个挑战。为了解决这个问题，我们引入了 DeepAmbigQAGen，这是一种自动数据生成管道，它构建基于文本语料库和链接知识图的 QA 任务，生成自然且可验证的问题，系统地嵌入名称歧义和多步骤推理。在此基础上，我们构建了 DeepAmbigQA，这是一个包含 3,600 个需要多跳推理的问题的数据集，其中一半需要显式名称歧义解决。实验表明，即使是最先进的 GPT-5 也显示出不完整的答案，在模糊问题上仅实现 0.13 的精确匹配，在非模糊问题上仅实现 0.21 的精确匹配。这些发现强调需要更强大的 QA 系统来收集信息并保证答案的完整性。

Title: PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise

Authors: Sapir Harary, Eran Hirsch, Aviv Slobodkin, David Wan, Mohit Bansal, Ido Dagan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01359
Pdf URL: https://arxiv.org/pdf/2511.01359
Copy Paste: [[2511.01359]] PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise(https://arxiv.org/abs/2511.01359)
Keywords: llm
Abstract: Natural Language Inference (NLI) models have been used in various ways to improve the factuality of LLM outputs. This is typically done by applying an NLI model to judge whether the model output is entailed from the supposed evidence, triggering some corrective actions, such as beam reranking at inference time or RL rewards during training. While NLI models are trained to detect factual inconsistencies over complete sentences, decisions in the common autoregressive generation architecture are made for each evolving text prefix, during decoding. Addressing this setting, we generalize the entailment detection task to apply over arbitrary text prefixes, and suggest its utility for improving generation faithfulness. Providing suitable evaluation and training datasets for this task, we train MiniTruePrefixes, a novel specialized model that better detects factual inconsistencies over text prefixes, outperforming comparable baseline NLI models by 5-14 F1 points in prefix-level entailment. We further demonstrate that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization. When guided by MiniTruePrefixes, LLaMA-3.2-3B-Instruct matches the faithfulness and runtime of the 8B model from the same model family, while using only half the memory.
摘要：自然语言推理（NLI）模型已以多种方式使用来提高法学硕士输出的真实性。这通常是通过应用 NLI 模型来判断模型输出是否来自假定的证据，从而触发一些纠正措施来完成，例如推理时的波束重新排序或训练期间的 RL 奖励。虽然 NLI 模型经过训练可以检测完整句子中的事实不一致，但在解码过程中，通用自回归生成架构中的决策是针对每个不断演变的文本前缀做出的。针对这种情况，我们将蕴含检测任务推广到应用于任意文本前缀，并提出了其在提高生成忠实度方面的实用性。为此任务提供合适的评估和训练数据集，我们训练 MiniTruePrefixes，这是一种新颖的专用模型，可以更好地检测文本前缀上的事实不一致，在前缀级蕴涵中比可比较的基线 NLI 模型高出 5-14 个 F1 点。我们进一步证明，将 MiniTruePrefixes 集成到受控解码框架中可以显着提高抽象摘要中的事实一致性。在 MiniTruePrefixes 的指导下，LLaMA-3.2-3B-Instruct 与同一模型系列中的 8B 模型的忠实度和运行时间相匹配，同时仅使用一半的内存。

Title: Safer in Translation? Presupposition Robustness in Indic Languages

Authors: Aadi Palnitkar, Arjun Suresh, Rishi Rajesh, Puneet Puli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01360
Pdf URL: https://arxiv.org/pdf/2511.01360
Copy Paste: [[2511.01360]] Safer in Translation? Presupposition Robustness in Indic Languages(https://arxiv.org/abs/2511.01360)
Keywords: language model, llm
Abstract: Increasingly, more and more people are turning to large language models (LLMs) for healthcare advice and consultation, making it important to gauge the efficacy and accuracy of the responses of LLMs to such queries. While there are pre-existing medical benchmarks literature which seeks to accomplish this very task, these benchmarks are almost universally in English, which has led to a notable gap in existing literature pertaining to multilingual LLM evaluation. Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic, an Indic language benchmark built by translating a 500-item subset of Cancer-Myth, sampled evenly across its original categories, into five under-served but widely used languages from the subcontinent (500 per language; 2,500 translated items total). Native-speaker translators followed a style guide for preserving implicit presuppositions in translation; items feature false presuppositions relating to cancer. We evaluate several popular LLMs under this presupposition stress.
摘要：越来越多的人开始转向大型语言模型 (LLM) 来获取医疗保健建议和咨询，因此衡量 LLM 对此类查询的响应的有效性和准确性变得非常重要。虽然已有的医学基准文献试图完成这一任务，但这些基准几乎普遍采用英语，这导致与多语言法学硕士评估相关的现有文献存在显着差距。在这项工作中，我们寻求通过 Cancer-Myth-Indic 来帮助解决这一差距，这是一种印度语言基准，通过将 Cancer-Myth 的 500 个项目子集（在其原始类别中均匀采样）翻译成来自次大陆的五种服务不足但广泛使用的语言（每种语言 500 个；总共 2,500 个翻译项目）来建立。母语译者遵循风格指南来保留翻译中隐含的预设；项目具有与癌症相关的错误预设。我们在这种预设压力下评估了几个受欢迎的法学硕士。

Title: The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

Authors: İbrahim Ethem Deveci, Duygu Ataman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01365
Pdf URL: https://arxiv.org/pdf/2511.01365
Copy Paste: [[2511.01365]] The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation(https://arxiv.org/abs/2511.01365)
Keywords: language model, llm
Abstract: The rapid rise of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has been accompanied by an equally rapid increase of benchmarks used to assess them. However, due to both improved model competence resulting from scaling and novel training advances as well as likely many of these datasets being included in pre or post training data, results become saturated, driving a continuous need for new and more challenging replacements. In this paper, we discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure? We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities across different benchmarks evolve over the years. We also analyze performance trends over the years across different reasoning tasks and discuss the current situation of benchmarking and remaining challenges. By offering a comprehensive overview of benchmarks and reasoning tasks, our work aims to serve as a first reference to ground future research in reasoning evaluation and model development.
摘要：大型语言模型 (LLM) 和大型推理模型 (LRM) 的快速崛起伴随着用于评估它们的基准的同样快速增长。然而，由于扩展和新颖的训练进步所带来的模型能力的提高，以及许多这些数据集可能包含在训练前或训练后数据中，结果变得饱和，从而不断需要新的和更具挑战性的替代品。在本文中，我们讨论超越基准是否真正证明了推理能力，还是我们只是跟踪与我们声称要衡量的能力相脱离的数字？我们提出了一项针对三个模型系列（OpenAI、Anthropic 和 Google）的调查，以及它们在不同基准上的推理能力多年来如何发展。我们还分析了多年来不同推理任务的性能趋势，并讨论了基准测试的现状和剩余的挑战。通过提供基准和推理任务的全面概述，我们的工作旨在为推理评估和模型开发的未来研究提供第一个参考。

Title: Confounding Factors in Relating Model Performance to Morphology

Authors: Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01380
Pdf URL: https://arxiv.org/pdf/2511.01380
Copy Paste: [[2511.01380]] Confounding Factors in Relating Model Performance to Morphology(https://arxiv.org/abs/2511.01380)
Keywords: language model
Abstract: The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.
摘要：个体语言特征在多大程度上影响标记化和语言建模是一个悬而未决的问题。形态系统的差异被认为既不重要又至关重要（Cotterell 等人，2018 年；Gerz 等人，2018a；Park 等人，2021 年等）。我们认为这种相互矛盾的证据是由于实验设置中的混杂因素造成的，使得很难比较结果并得出结论。我们在分析中确定了混杂因素，试图回答词法是否以及如何与语言建模相关的问题。接下来，我们重新评估 Arnett 和 Bergen (2025) 的三个假设，解释为什么对粘着语言进行建模会比融合语言产生更高的复杂性：它们着眼于标记化的形态对齐、标记化效率和数据集大小。我们表明每个结论都包含混杂因素。最后，我们引入令牌二元度量作为预测因果语言建模难度的内在方法，并发现它们是不需要专家注释的形态复杂性的梯度代理。最终，我们概述了可靠地回答形态是否与语言建模相关以及如何相关的必要性。

Title: RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets

Authors: Muhammed Yusuf Kartal (1), Suha Kagan Kose (2), Korhan Sevinç (1), Burak Aktas (2) ((1) TOBB University of Economics and Technology, (2) Roketsan Inc.)
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2511.01386
Pdf URL: https://arxiv.org/pdf/2511.01386
Copy Paste: [[2511.01386]] RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets(https://arxiv.org/abs/2511.01386)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, so optimizing modules in isolation is brittle. We introduce RAGSmith, a modular framework that treats RAG design as an end-to-end architecture search over nine technique families and 46{,}080 feasible pipeline configurations. A genetic search optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, Computer Science), each with 100 questions spanning factual, interpretation, and long-answer types. RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8\% on average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in retrieval and +7.5\% in generation. The search typically explores $\approx 0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone -- vector retrieval plus post-generation reflection/revision -- augmented by domain-dependent choices in expansion, reranking, augmentation, and prompt reordering; passage compression is never selected. Improvement magnitude correlates with question type, with larger gains on factual/long-answer mixes than interpretation-heavy sets. These results provide practical, domain-aware guidance for assembling effective RAG systems and demonstrate the utility of evolutionary search for full-pipeline optimization.
摘要：检索增强生成 (RAG) 质量取决于检索、排名、增强、提示和生成中的许多交互选择，因此单独优化模块是脆弱的。我们引入了 RAGsmith，这是一个模块化框架，它将 RAG 设计视为对九个技术系列和 46{,}080 种可行管道配置的端到端架构搜索。遗传搜索优化标量目标，该标量目标联合聚合检索指标（recall@k、mAP、nDCG、MRR）和生成指标（LLM-Judge 和语义相似性）。我们对 6 个源自维基百科的领域（数学、法律、金融、医学、国防工业、计算机科学）进行评估，每个领域有 100 个涵盖事实、解释和长答案类型的问题。 RAGsmith 发现配置始终优于朴素 RAG 基线，平均性能提高了 +3.8\%（跨域范围为 +1.2\% 到 +6.9\%），检索方面的增益高达 +12.5\%，生成方面的增益高达 +7.5\%。搜索通常会探索 $\approx\%$ 的空间（$\sim 100$ 候选者），并发现一个强大的主干——向量检索加上生成后的反射/修订——通过扩展、重新排序、增强和提示重新排序中的域相关选择来增强；从不选择通道压缩。改进幅度与问题类型相关，事实/长答案混合的收益比解释较多的集合更大。这些结果为组装有效的 RAG 系统提供了实用的、领域感知的指导，并证明了进化搜索在全流程优化中的实用性。

Title: LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge

Authors: Heng Zhou, Ao Yu, Yuchen Fan, Jianing Shi, Li Kang, Hejia Geng, Yongting Zhang, Yutao Fan, Yuhao Wu, Tiancheng He, Yiran Qin, Lei Bai, Zhenfei Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01409
Pdf URL: https://arxiv.org/pdf/2511.01409
Copy Paste: [[2511.01409]] LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge(https://arxiv.org/abs/2511.01409)
Keywords: language model, llm
Abstract: Evaluating large language models (LLMs) on question answering often relies on static benchmarks that reward memorization and understate the role of retrieval, failing to capture the dynamic nature of world knowledge. We present LiveSearchBench, an automated pipeline for constructing retrieval-dependent benchmarks from recent knowledge updates. Our method computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three levels of reasoning difficulty, each guaranteed to admit a unique, verifiable answer through SPARQL validation. The pipeline is fully automated, scalable across time, and minimizes human intervention, enabling continual regeneration of temporally grounded benchmarks. Experiments show a pronounced performance drop when models confront facts that post-date pretraining, with the gap most salient on multi-hop queries. Retrieval augmented methods and larger, instruction-tuned models provide partial gains but fail to close this recency gap. By design, LiveSearchBench shifts evaluation from static memorization toward tasks that require up-to-date retrieval and reasoning, offering a foundation for systematic, long-term assessment of LLMs under evolving knowledge.
摘要：评估问答方面的大型语言模型 (LLM) 通常依赖于静态基准，这些基准奖励记忆并低估检索的作用，无法捕捉世界知识的动态本质。我们推出了 LiveSearchBench，这是一个自动化管道，用于根据最近的知识更新构建依赖于检索的基准。我们的方法计算连续维基数据快照之间的增量，过滤候选三元组的质量，并综合三个推理难度级别的自然语言问题，每个问题都保证通过 SPARQL 验证承认一个独特的、可验证的答案。该管道是完全自动化的，可随时间扩展，并最大限度地减少人为干预，从而能够持续更新暂时的基准。实验表明，当模型面对预训练后的事实时，性能会显着下降，其中在多跳查询上差距最为明显。检索增强方法和更大的指令调整模型提供了部分收益，但无法弥补这种新近度差距。通过设计，LiveSearchBench 将评估从静态记忆转向需要最新检索和推理的任务，为不断发展的知识下对法学硕士进行系统、长期评估奠定了基础。

Title: "Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG

Authors: Sergio Torres Aguilar
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2511.01454
Pdf URL: https://arxiv.org/pdf/2511.01454
Copy Paste: [[2511.01454]] "Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG(https://arxiv.org/abs/2511.01454)
Keywords: language model, gpt, llm
Abstract: Translating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.
摘要：翻译拉丁语等形态丰富、资源匮乏的语言面临着巨大的挑战。本文介绍了一种可重复的基于草稿的细化管道，该管道将开源大型语言模型 (LLM) 提升到在统计上可与顶级专有系统相媲美的性能水平。我们的方法首先使用微调的 NLLB-1.3B 模型来生成高质量、结构忠实的草稿。然后，零样本法学硕士（Llama-3.3 或 Qwen3）完善此草稿，这一过程可以通过使用检索到的上下文外示例 (RAG) 增强上下文来进一步增强。我们在两个不同的基准上证明了这种方法的稳健性：标准的域内测试集（Rosenthal，2023）和新的、具有挑战性的 12 世纪拉丁字母的域外（OOD）集（2025）。我们的主要发现是，这个开源 RAG 系统在统计上与 GPT-5 基线相当，无需任何特定于任务的 LLM 微调。我们发布了管道、Chartres OOD 集以及评估脚本和模型，以促进可复制性和进一步研究。

Title: BARD: budget-aware reasoning distillation

Authors: Lujie Niu, Lei Shen, Yi Jiang, Caixia Yuan, Xiaojie Wang, Wenbo Su, Bo zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01470
Pdf URL: https://arxiv.org/pdf/2511.01470
Copy Paste: [[2511.01470]] BARD: budget-aware reasoning distillation(https://arxiv.org/abs/2511.01470)
Keywords: language model, chain-of-thought
Abstract: While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.
摘要：虽然长思想链（CoT）蒸馏有效地将推理能力转移到较小的语言模型，但推理过程通常仍然是冗余的并且计算预算不可控，导致资源使用效率低下。为了解决这个限制，我们提出了 \textbf{预算感知推理蒸馏（BARD）}，这是一种新颖的框架，可以同时提取推理能力并实现对推理长度的细粒度控制。 BARD 使用思维预算作为用户指定的控制信号，使模型能够动态平衡推理性能和计算效率。为了实现这一概念，巴德引入了两阶段训练方案。第一阶段，对教师生成的压缩到各种预算水平的长 CoT 数据进行监督微调 (SFT)，引导模型对预算约束的理解。第二阶段利用奖励信号中的强化学习（RL），同时考虑推理性能和预算保真度。纳入两阶段方案对于避免政策退化和确保两个目标共同优化至关重要。大量实验表明，我们的方法使 8B 学生模型能够在具有挑战性的推理基准（\textit{AIME24，AIME25，GPQA}）上取得出色的性能，同时在各种预算范围内对其推理长度提供精确和自适应的控制。

Title: Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

Authors: Neha Sharma, Navneet Agarwal, Kairit Sirts
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01482
Pdf URL: https://arxiv.org/pdf/2511.01482
Copy Paste: [[2511.01482]] Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation(https://arxiv.org/abs/2511.01482)
Keywords: language model, gpt, llm
Abstract: Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.
摘要：基于文本的自动认知扭曲检测由于其主观性而成为一项具有挑战性的任务，即使在专家人类注释者中也观察到较低的一致性分数，从而导致注释不可靠。我们探索使用大型语言模型（LLM）作为一致且可靠的注释器，并提出多个独立的 LLM 运行可以揭示稳定的标记模式，尽管任务具有固有的主观性。此外，为了公平比较在具有不同特征的数据集上训练的模型，我们引入了一个与数据集无关的评估框架，使用 Cohen 的 kappa 作为效果大小度量。这种方法允许公平的跨数据集和交叉研究比较，而 F1 分数等传统指标则达不到要求。我们的结果表明，GPT-4 可以产生一致的注释（Fleiss 的 Kappa = 0.78），与使用人类标记数据训练的模型相比，使用这些注释训练的模型可以提高测试集性能。我们的研究结果表明，法学硕士可以提供一种可扩展且内部一致的替代方案来生成训练数据，以支持主观 NLP 任务中强大的下游性能。

Title: Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Authors: Max Schaffelder, Albert Gatt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01490
Pdf URL: https://arxiv.org/pdf/2511.01490
Copy Paste: [[2511.01490]] Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning(https://arxiv.org/abs/2511.01490)
Keywords: language model, llm
Abstract: As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, the latter preserves higher output quality, thus making outputs potentially more usable and dangerous. Finally, fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data.
摘要：随着合成数据在语言模型开发中广泛使用，了解其对模型行为的影响至关重要。本文研究了合成数据源的多样性对微调大型语言模型的影响。我们关注三个关键维度：分布崩溃、对抗稳健性和自我偏好偏差。我们的研究结果表明，对来自不同来源的合成数据进行微调模型可以减轻分布崩溃，保持输出分布的广度和输出文本的多样性。此外，虽然人工和合成微调数据都可以消除安全措施，但后者保留了更高的输出质量，从而使输出可能更有用和更危险。最后，微调可以减少自我偏好偏差，其中人类数据最有效，其次是多源合成数据。

Title: BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Authors: Ayesha Afroza Mohsin, Mashrur Ahsan, Nafisa Maliyat, Shanta Maria, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01512
Pdf URL: https://arxiv.org/pdf/2511.01512
Copy Paste: [[2511.01512]] BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification(https://arxiv.org/abs/2511.01512)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.
摘要：孟加拉语中的有毒语言仍然普遍存在，尤其是在网络环境中，而且几乎没有采取有效的预防措施。尽管文本解毒在高资源语言中取得了进展，但由于资源有限，孟加拉语仍然没有得到充分开发。在本文中，我们提出了一种孟加拉语文本去毒的新颖管道，它结合了帕累托类优化的大语言模型（LLM）和思想链（CoT）提示生成去毒句子。为了支持这项工作，我们使用在随机样本上评估的帕累托优化法学硕士构建了 BanglaNirTox，这是一个人工生成的平行语料库，包含 68,041 个有毒孟加拉语句子，具有按类别分类的毒性标签、推理和解毒释义。生成的 BanglaNirTox 数据集用于微调语言模型，以生成更好的孟加拉语句子的解毒版本。我们的研究结果表明，具有 CoT 提示的帕累托优化法学硕士显着提高了孟加拉语文本解毒的质量和一致性。

Title: Difficulty-Controllable Cloze Question Distractor Generation

Authors: Seokhoon Kang, Yejin Jeon, Seonjeong Hwang, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01526
Pdf URL: https://arxiv.org/pdf/2511.01526
Copy Paste: [[2511.01526]] Difficulty-Controllable Cloze Question Distractor Generation(https://arxiv.org/abs/2511.01526)
Keywords: gpt
Abstract: Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process in order to produce diverse and plausible distractors. These candidates are subsequently refined through filtering and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is leveraged to train a difficulty-controllable generation model via multitask learning. The framework includes carefully designed auxiliary tasks that enhance the model's semantic understanding of distractors and its ability to estimate their difficulty. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.
摘要：多项选择完形填空题通常用于评估语言能力和理解力。然而，生成高质量的干扰项仍然具有挑战性，因为现有方法通常缺乏适应性和对难度级别的控制，并且缺乏难度注释数据集进一步阻碍了进展。为了解决这些问题，我们提出了一种新颖的框架，通过利用数据增强和多任务学习策略来生成难度可控的干扰项。首先，为了创建高质量的、有难度注释的数据集，我们引入了双向干扰项生成过程，以产生多样化且合理的干扰项。这些候选者随后通过过滤进行细化，然后使用整体 QA 系统按难度进行分类。其次，利用这个新创建的数据集通过多任务学习来训练难度可控的生成模型。该框架包括精心设计的辅助任务，可增强模型对干扰因素的语义理解及其估计其难度的能力。实验结果表明，我们的方法可以在不同难度级别上生成高质量的干扰项，并且在将干扰项难度与人类感知相匹配方面远远优于 GPT-4o。

Title: Math anxiety and associative knowledge structure are entwined in psychology students but not in Large Language Models like GPT-3.5 and GPT-4o

Authors: Luciana Ciringione, Emma Franchino, Simone Reigl, Isaia D'Onofrio, Anna Serbati, Oleksandra Poquet, Florence Gabriel, Massimo Stella
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.01558
Pdf URL: https://arxiv.org/pdf/2511.01558
Copy Paste: [[2511.01558]] Math anxiety and associative knowledge structure are entwined in psychology students but not in Large Language Models like GPT-3.5 and GPT-4o(https://arxiv.org/abs/2511.01558)
Keywords: language model, gpt
Abstract: Math anxiety poses significant challenges for university psychology students, affecting their career choices and overall well-being. This study employs a framework based on behavioural forma mentis networks (i.e. cognitive models that map how individuals structure their associative knowledge and emotional perceptions of concepts) to explore individual and group differences in the perception and association of concepts related to math and anxiety. We conducted 4 experiments involving psychology undergraduates from 2 samples (n1 = 70, n2 = 57) compared against GPT-simulated students (GPT-3.5: n2 = 300; GPT-4o: n4 = 300). Experiments 1, 2, and 3 employ individual-level network features to predict psychometric scores for math anxiety and its facets (observational, social and evaluational) from the Math Anxiety Scale. Experiment 4 focuses on group-level perceptions extracted from human students, GPT-3.5 and GPT-4o's networks. Results indicate that, in students, positive valence ratings and higher network degree for "anxiety", together with negative ratings for "math", can predict higher total and evaluative math anxiety. In contrast, these models do not work on GPT-based data because of differences in simulated networks and psychometric scores compared to humans. These results were also reconciled with differences found in the ways that high/low subgroups of simulated and real students framed semantically and emotionally STEM concepts. High math-anxiety students collectively framed "anxiety" in an emotionally polarising way, absent in the negative perception of low math-anxiety students. "Science" was rated positively, but contrasted against the negative perception of "math". These findings underscore the importance of understanding concept perception and associations in managing students' math anxiety.
摘要：数学焦虑给大学心理学学生带来了重大挑战，影响他们的职业选择和整体幸福感。本研究采用基于行为形式网络（即映射个体如何构建其联想知识和概念的情感感知的认知模型）的框架来探索与数学和焦虑相关的概念的感知和关联的个体和群体差异。我们对来自 2 个样本的心理学本科生（n1 = 70，n2 = 57）进行了 4 项实验，与 GPT 模拟学生（GPT-3.5：n2 = 300；GPT-4o：n4 = 300）进行比较。实验 1、2 和 3 采用个人层面的网络特征来预测数学焦虑量表中数学焦虑及其各个方面（观察、社交和评估）的心理测量分数。实验 4 重点关注从人类学生、GPT-3.5 和 GPT-4o 网络中提取的群体级别感知。结果表明，在学生中，“焦虑”的正效价评级和较高的网络程度，以及“数学”的负面评级，可以预测较高的总体和评估性数学焦虑。相比之下，这些模型不适用于基于 GPT 的数据，因为与人类相比，模拟网络和心理测量分数存在差异。这些结果也与模拟学生和真实学生的高/低分组在语义和情感上构建 STEM 概念的方式存在的差异相一致。高度数学焦虑的学生以一种情绪两极分化的方式集体表达“焦虑”，而低度数学焦虑学生的负面看法则不存在。 “科学”获得了积极评价，但与“数学”的负面看法形成鲜明对比。这些发现强调了理解概念感知和关联在管理学生数学焦虑方面的重要性。

Title: ECO Decoding: Entropy-Based Control for Controllability and Fluency in Controllable Dialogue Generation

Authors: Seungmin Shin, Dooyoung Kim, Youngjoong Ko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01568
Pdf URL: https://arxiv.org/pdf/2511.01568
Copy Paste: [[2511.01568]] ECO Decoding: Entropy-Based Control for Controllability and Fluency in Controllable Dialogue Generation(https://arxiv.org/abs/2511.01568)
Keywords: language model, chat
Abstract: Controllable Dialogue Generation (CDG) enables chatbots to generate responses with desired attributes, and weighted decoding methods have achieved significant success in the CDG task. However, using a fixed constant value to manage the bias of attribute probabilities makes it challenging to find an ideal control strength that satisfies both controllability and fluency. To address this issue, we propose ECO decoding (Entropy-based COntrol), which dynamically adjusts the control strength at each generation step according to the model's entropy in both the language model and attribute classifier probability distributions. Experiments on the DailyDialog and MultiWOZ datasets demonstrate that ECO decoding consistently improves controllability while maintaining fluency and grammaticality, outperforming prior decoding methods across various models and settings. Furthermore, ECO decoding alleviates probability interpolation issues in multi-attribute generation and consequently demonstrates strong performance in both single and multi-attribute scenarios.
摘要：可控对话生成（CDG）使聊天机器人能够生成具有所需属性的响应，并且加权解码方法在 CDG 任务中取得了显着的成功。然而，使用固定的常数值来管理属性概率的偏差，使得寻找同时满足可控性和流畅性的理想控制强度具有挑战性。为了解决这个问题，我们提出了 ECO 解码（基于熵的控制），它根据语言模型和属性分类器概率分布中的模型熵动态调整每个生成步骤的控制强度。 DailyDialog 和 MultiWOZ 数据集上的实验表明，ECO 解码持续提高了可控性，同时保持了流畅性和语法性，在各种模型和设置上优于现有的解码方法。此外，ECO 解码减轻了多属性生成中的概率插值问题，因此在单属性和多属性场景中都表现出了强大的性能。

Title: BIRD: Bronze Inscription Restoration and Dating

Authors: Wenjie Hua, Hoang H. Nguyen, Gangyan Ge
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01589
Pdf URL: https://arxiv.org/pdf/2511.01589
Copy Paste: [[2511.01589]] BIRD: Bronze Inscription Restoration and Dating(https://arxiv.org/abs/2511.01589)
Keywords: language model
Abstract: Bronze inscriptions from early China are fragmentary and difficult to date. We introduce BIRD(Bronze Inscription Restoration and Dating), a fully encoded dataset grounded in standard scholarly transcriptions and chronological labels. We further propose an allograph-aware masked language modeling framework that integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which links graphemes and allographs. Experiments show that GN improves restoration, while glyph-biased sampling yields gains in dating.
摘要：中国早期的青铜器铭文残缺不全，难以断代。我们介绍 BIRD（青铜铭文修复和约会），这是一个基于标准学术转录和时间标签的完全编码数据集。我们进一步提出了一种异位字感知掩码语言建模框架，它将领域和任务自适应预训练与连接字素和异位字的字形网络（GN）集成在一起。实验表明，GN 可以改善恢复效果，而字形偏差采样则可以提高约会效果。

Title: Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers

Authors: Francisco Portillo López
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01615
Pdf URL: https://arxiv.org/pdf/2511.01615
Copy Paste: [[2511.01615]] Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers(https://arxiv.org/abs/2511.01615)
Keywords: language model, gpt, llm
Abstract: Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.
摘要：语言错误不仅仅是偏离规范语法，而且是错误的。它们为了解语言的认知架构提供了一个独特的窗口，并暴露了试图复制它们的人工系统当前的局限性。该项目提出对西班牙语母语者产生的语言错误进行跨学科研究，旨在分析当前的大语言模型（LLM）如何解释、再现或纠正这些错误。该研究整合了三个核心视角：理论语言学，对错误的本质进行分类和理解；神经语言学，将它们置于大脑实时语言处理的情境中；和自然语言处理（NLP），以评估他们对语言错误的解释。专门构建的母语西班牙语真实错误（+500）语料库将作为实证分析的基础。这些错误将针对 GPT 或 Gemini 等人工智能模型进行测试，以评估它们的解释准确性以及概括人类语言行为模式的能力。该项目不仅有助于理解西班牙语作为母语，而且有助于开发 NLP 系统，这些系统具有更多的认知信息，能够处理真实人类语言的不完美、多变且常常模糊的本质。

Title: A Graph-based RAG for Energy Efficiency Question Answering

Authors: Riccardo Campi, Nicolò Oreste Pinciroli Vago, Mathyas Giudici, Pablo Barrachina Rodriguez-Guisado, Marco Brambilla, Piero Fraternali
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2511.01643
Pdf URL: https://arxiv.org/pdf/2511.01643
Copy Paste: [[2511.01643]] A Graph-based RAG for Energy Efficiency Question Answering(https://arxiv.org/abs/2511.01643)
Keywords: language model, llm, retrieval augmented generation
Abstract: In this work, we investigate the use of Large Language Models (LLMs) within a graph-based Retrieval Augmented Generation (RAG) architecture for Energy Efficiency (EE) Question Answering. First, the system automatically extracts a Knowledge Graph (KG) from guidance and regulatory documents in the energy field. Then, the generated graph is navigated and reasoned upon to provide users with accurate answers in multiple languages. We implement a human-based validation using the RAGAs framework properties, a validation dataset comprising 101 question-answer pairs, and domain experts. Results confirm the potential of this architecture and identify its strengths and weaknesses. Validation results show how the system correctly answers in about three out of four of the cases (75.2 +- 2.7%), with higher results on questions related to more general EE answers (up to 81.0 +- 4.1%), and featuring promising multilingual abilities (4.4% accuracy loss due to translation).
摘要：在这项工作中，我们研究了大型语言模型 (LLM) 在基于图的检索增强生成 (RAG) 架构中用于能源效率 (EE) 问答的情况。首先，系统自动从能源领域的指导性和规范性文件中提取知识图谱（KG）。然后，对生成的图表进行导航和推理，以多种语言为用户提供准确的答案。我们使用 RAGA 框架属性、包含 101 个问答对的验证数据集和领域专家来实现基于人的验证。结果证实了该架构的潜力并确定了其优点和缺点。验证结果显示系统如何在大约四分之三的情况下正确回答（75.2 ± 2.7%），在与更一般的 EE 答案相关的问题上获得更高的结果（高达 81.0 ± 4.1%），并且具有良好的多语言能力（由于翻译导致的准确性损失 4.4%）。

Title: Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation

Authors: Hung-Shin Lee, Chen-Chi Chang, Ching-Yuan Chen, Yun-Hsiang Hsu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01649
Pdf URL: https://arxiv.org/pdf/2511.01649
Copy Paste: [[2511.01649]] Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation(https://arxiv.org/abs/2511.01649)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This study proposes a cognitive benchmarking framework to evaluate how large language models (LLMs) process and apply culturally specific knowledge. The framework integrates Bloom's Taxonomy with Retrieval-Augmented Generation (RAG) to assess model performance across six hierarchical cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Using a curated Taiwanese Hakka digital cultural archive as the primary testbed, the evaluation measures LLM-generated responses' semantic accuracy and cultural relevance.
摘要：本研究提出了一个认知基准测试框架来评估大型语言模型（LLM）如何处理和应用文化特定知识。该框架将 Bloom 分类法与检索增强生成 (RAG) 相集成，以评估六个分层认知领域的模型性能：记忆、理解、应用、分析、评估和创建。该评估使用精心策划的台湾客家数字文化档案作为主要测试平台，衡量法学硕士生成的回复的语义准确性和文化相关性。

Title: EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

Authors: Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Preslav Nakov, Zhuohan Xie
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.01650
Pdf URL: https://arxiv.org/pdf/2511.01650
Copy Paste: [[2511.01650]] EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering(https://arxiv.org/abs/2511.01650)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly being applied to specialized, high-stakes domains like engineering, which demands rigorous evaluation of their complex reasoning capabilities. While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. The problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. With this benchmark, we move beyond final answer accuracy with a two-stage evaluation: we first quantitatively verify the numerical and semantic validity of each reasoning step and then introduce LLM-As-A-Judge, an automated system to qualitatively categorize the identified reasoning errors.
摘要：大型语言模型 (LLM) 越来越多地应用于工程等专业、高风险领域，这需要对其复杂推理能力进行严格评估。虽然当前的基准评估语言理解、事实回忆、数学或代码生成，但没有一个基准能够捕捉到工程核心的综合推理，其中科学原理、定量建模和实际约束必须融合。为了弥补这一差距，我们引入了 EngChain，这是一个用于可验证的多步骤工程问题解决的基准。 EngChain 包含跨越三个工程分支的 90 个问题，分为 9 个领域和 20 个不同的领域。这些问题是根据高度随机化的符号模板生成的，以确保多样性并消除污染风险。有了这个基准，我们通过两阶段评估超越了最终答案的准确性：我们首先定量验证每个推理步骤的数值和语义有效性，然后引入 LLM-As-A-Judge，这是一个自动化系统，用于对已识别的推理错误进行定性分类。

Title: SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Authors: Chaoqun Liu, Mahani Aljunied, Guizhen Chen, Hou Pong Chan, Weiwen Xu, Yu Rong, Wenxuan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01670
Pdf URL: https://arxiv.org/pdf/2511.01670
Copy Paste: [[2511.01670]] SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia(https://arxiv.org/abs/2511.01670)
Keywords: language model, llm
Abstract: We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.
摘要：我们推出 SeaLLMs-Audio，这是第一个针对多种东南亚 (SEA) 语言（印度尼西亚语 (id)、泰语 (th) 和越南语 (vi)）以及英语 (en) 和中文 (zh) 量身定制的大型音频语言模型 (LALM)。 SeaLLMs-Audio 经过大规模音频语料库的训练，在各种以音频为中心的任务中表现出强大的性能，涵盖细粒度的音频理解和基于语音的交互。其主要特点包括： 1）多语言：该模型主要支持5种语言，即印尼语、泰语、越南语、英语和汉语； 2）多模态：模型接受灵活的输入方式，包括纯音频、纯文本以及音频加文本； 3）多任务：模型支持多种任务，包括音频字幕、自动语音识别、语音转文本、语音情感识别、语音问答、语音摘要等音频分析任务。它还支持基于语音的对话，包括回答事实、数学和一般知识查询。作为推进东南亚音频法学硕士的重要一步，我们预计 SeaLLMs-Audio 将使区域研究界和行业受益。为了实现东南亚 LALM 评估的自动化，我们引入了 SeaBench-Audio，这是一个涵盖多个任务的基准测试。实验表明，与 SEA 语言上的其他 LALM 相比，SeaLLMs-Audio 实现了具有竞争力的性能。

Title: Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

Authors: Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.01689
Pdf URL: https://arxiv.org/pdf/2511.01689
Copy Paste: [[2511.01689]] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI(https://arxiv.org/abs/2511.01689)
Keywords: language model, prompt, chat
Abstract: The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at this https URL.
摘要：现代聊天机器人大语言模型生成的“人工智能助手”角色的特征影响着表面行为和明显的价值观、信仰和道德。这些都会影响交互质量、感知智能以及与开发人员和用户意图的一致性。这种角色的塑造，即所谓的性格培训，是行业后培训的重要组成部分，但在学术文献中仍未得到有效研究。我们引入了第一个开放的角色训练实施，利用宪法人工智能和新的数据管道，使用合成的内省数据，以比限制系统提示或激活转向等替代方案更有效和受控的方式塑造助理角色。具体来说，我们使用 11 个示例角色（例如幽默、深切关怀甚至恶毒）对三个流行的开放权重模型进行微调。为了跟踪我们方法的效果，我们引入了一种方法来分析所显示的偏好，揭示性格中清晰而全面的变化。我们发现这些变化比上述两种选择更能抵御对抗性提示，同时也导致了更加连贯和现实的一代。最后，我们证明这种微调对通过通用基准衡量的一般功能几乎没有影响。我们描述并开源了完整的训练后方法，其实现可以在此 https URL 中找到。

Title: Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement

Authors: Sekh Mainul Islam, Pepa Atanasova, Isabelle Augenstein
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.01706
Pdf URL: https://arxiv.org/pdf/2511.01706
Copy Paste: [[2511.01706]] Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement(https://arxiv.org/abs/2511.01706)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Natural Language Explanations (NLEs) describe how Large Language Models (LLMs) make decisions, drawing on both external Context Knowledge (CK) and Parametric Knowledge (PK) stored in model weights. Understanding their interaction is key to assessing the grounding of NLEs, yet it remains underexplored. Prior work has largely examined only single-step generation, typically the final answer, and has modelled PK and CK interaction only as a binary choice in a rank-1 subspace. This overlooks richer forms of interaction, such as complementary or supportive knowledge. We propose a novel rank-2 projection subspace that disentangles PK and CK contributions more accurately and use it for the first multi-step analysis of knowledge interactions across longer NLE sequences. Experiments on four QA datasets and three open-weight instruction-tuned LLMs show that diverse knowledge interactions are poorly represented in a rank-1 subspace but are effectively captured in our rank-2 formulation. Our multi-step analysis reveals that hallucinated NLEs align strongly with the PK direction, context-faithful ones balance PK and CK, and Chain-of-Thought prompting for NLEs shifts generated NLEs toward CK by reducing PK reliance. This work provides the first framework for systematic studies of multi-step knowledge interactions in LLMs through a richer rank-2 subspace disentanglement. Code and data: this https URL.
摘要：自然语言解释 (NLE) 描述大型语言模型 (LLM) 如何利用模型权重中存储的外部上下文知识 (CK) 和参数知识 (PK) 做出决策。了解它们之间的相互作用是评估 NLE 基础的关键，但这一问题仍未得到充分探索。先前的工作主要只检查单步生成，通常是最终答案，并且仅将 PK 和 CK 相互作用建模为 1 级子空间中的二元选择。这忽视了更丰富的互动形式，例如补充性或支持性知识。我们提出了一种新颖的 2 阶投影子空间，可以更准确地分解 PK 和 CK 贡献，并将其用于跨较长 NLE 序列的知识交互的首次多步骤分析。对四个 QA 数据集和三个开放权重指令调整的 LLM 的实验表明，不同的知识交互在 1 级子空间中表现不佳，但在我们的 2 级公式中得到了有效捕获。我们的多步骤分析表明，幻觉的 NLE 与 PK 方向强烈一致，上下文忠实的 NLE 平衡 PK 和 CK，并且 NLE 的思想链提示通过减少 PK 依赖，将生成的 NLE 转向 CK。这项工作通过更丰富的二级子空间解耦，为法学硕士中多步骤知识交互的系统研究提供了第一个框架。代码和数据：此 https URL。

Title: Efficient Tool-Calling Multi-Expert NPC Agent for Commonsense Persona-Grounded Dialogue

Authors: Mahammad Nuriyev
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01720
Pdf URL: https://arxiv.org/pdf/2511.01720
Copy Paste: [[2511.01720]] Efficient Tool-Calling Multi-Expert NPC Agent for Commonsense Persona-Grounded Dialogue(https://arxiv.org/abs/2511.01720)
Keywords: agent
Abstract: We present a multi-expert system for creating Non-Player Characters (NPCs) capable of both natural dialogue and contextual action execution in interactive environments. Using Qwen3 as the base model and Low-Rank Adaptation (LoRA) adapters, we instantiate three specialists: tool calling, tool-response interpretation, and direct dialogue. Our system comfortably meets the computational efficiency requirements, delivering fast responses and maintaining modest resource usage on L40S GPUs. In the Commonsense Persona-Grounded Dialogue Challenge 2025, our method ranked second overall. Code available at: this https URL
摘要：我们提出了一个多专家系统，用于创建能够在交互式环境中进行自然对话和上下文动作执行的非玩家角色（NPC）。使用 Qwen3 作为基础模型和低秩适应 (LoRA) 适配器，我们实例化了三个专家：工具调用、工具响应解释和直接对话。我们的系统轻松满足计算效率要求，提供快速响应并在 L40S GPU 上保持适度的资源使用。在 2025 年 Commonsense Persona-Grounded Dialog Challenge 2025 中，我们的方法总体排名第二。代码位于：此 https URL

Title: Accumulating Context Changes the Beliefs of Language Models

Authors: Jiayi Geng, Howard Chen, Ryan Liu, Manoel Horta Ribeiro, Robb Willer, Graham Neubig, Thomas L. Griffiths
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01805
Pdf URL: https://arxiv.org/pdf/2511.01805
Copy Paste: [[2511.01805]] Accumulating Context Changes the Beliefs of Language Models(https://arxiv.org/abs/2511.01805)
Keywords: language model, gpt, agent
Abstract: Language model (LM) assistants are increasingly used in applications such as brainstorming and research. Improvements in memory and context size have allowed these models to become more autonomous, which has also resulted in more text accumulation in their context windows without explicit user intervention. This comes with a latent risk: the belief profiles of models -- their understanding of the world as manifested in their responses or actions -- may silently change as context accumulates. This can lead to subtly inconsistent user experiences, or shifts in behavior that deviate from the original alignment of the models. In this paper, we explore how accumulating context by engaging in interactions and processing text -- talking and reading -- can change the beliefs of language models, as manifested in their responses and this http URL results reveal that models' belief profiles are highly malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds of discussion about moral dilemmas and queries about safety, while Grok 4 shows a 27.2% shift on political issues after reading texts from the opposing position. We also examine models' behavioral changes by designing tasks that require tool use, where each tool selection corresponds to an implicit belief. We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems. Our analysis exposes the hidden risk of belief shift as models undergo extended sessions of talking or reading, rendering their opinions and actions unreliable.
摘要：语言模型 (LM) 助手越来越多地用于头脑风暴和研究等应用。内存和上下文大小的改进使这些模型变得更加自主，这也导致在没有明确的用户干预的情况下在其上下文窗口中积累更多的文本。这带来了一个潜在的风险：模型的信念概况——他们对世界的理解体现在他们的反应或行动中——可能会随着背景的积累而悄然改变。这可能会导致用户体验略有不一致，或者行为发生偏离模型原始对齐的变化。在本文中，我们探讨了如何通过参与交互和处理文本（说话和阅读）来积累上下文，从而改变语言模型的信念，正如它们的响应所体现的那样，这个 http URL 结果表明模型的信念概况具有高度可塑性：在 10 轮关于道德困境和安全问题的讨论后，GPT-5 的陈述信念发生了 54.7% 的转变，而 Grok 4 在阅读对方的文本后，在政治问题上发生了 27.2% 的转变位置。我们还通过设计需要使用工具的任务来检查模型的行为变化，其中每个工具选择都对应于一个隐含的信念。我们发现这些变化与既定的信念转变相一致，这表明信念转变将反映在代理系统的实际行为中。我们的分析揭示了信念转变的隐藏风险，因为模型经历了长时间的谈话或阅读，使他们的观点和行为变得不可靠。

Title: Plan-and-Write: Structure-Guided Length Control for LLMs without Model Retraining

Authors: Adewale Akinfaderin, Shreyas Subramanian, Akarsha Sehwag
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01807
Pdf URL: https://arxiv.org/pdf/2511.01807
Copy Paste: [[2511.01807]] Plan-and-Write: Structure-Guided Length Control for LLMs without Model Retraining(https://arxiv.org/abs/2511.01807)
Keywords: language model, llm, prompt
Abstract: Length control in Large Language Models (LLMs) is a crucial but under-addressed challenge, with applications ranging from voice interfaces requiring concise responses to research summaries needing comprehensive outputs. Current approaches to length control, including Regularized DPO, Length-Instruction Fine Tuning, and tool-augmented methods, typically require expensive model retraining or complex inference-time tooling. This paper presents a prompt engineering methodology that enables precise length control without model retraining. Our structure-guided approach implements deliberate planning and word counting mechanisms within the prompt, encouraging the model to carefully track and adhere to specified length constraints. Comprehensive evaluations across six state-of-the-art LLMs demonstrate that our method significantly improves length fidelity for several models compared to standard prompting when applied to document summarization tasks, particularly for shorter-to-medium length constraints. The proposed technique shows varying benefits across different model architectures, with some models demonstrating up to 37.6% improvement in length adherence. Quality evaluations further reveal that our approach maintains or enhances overall output quality compared to standard prompting techniques. Our approach provides an immediately deployable solution for applications requiring precise length control, particularly valuable for production environments where model retraining is impractical or cost-prohibitive.
摘要：大型语言模型 (LLM) 中的长度控制是一个至关重要但尚未得到充分解决的挑战，其应用范围从需要简洁响应的语音界面到需要全面输出的研究摘要。当前的长度控制方法，包括正则化 DPO、长度指令微调和工具增强方法，通常需要昂贵的模型重新训练或复杂的推理时间工具。本文提出了一种快速的工程方法，无需重新训练模型即可实现精确的长度控制。我们的结构引导方法在提示中实施了深思熟虑的规划和字数统计机制，鼓励模型仔细跟踪并遵守指定的长度限制。对六个最先进的法学硕士的综合评估表明，当应用于文档摘要任务时，特别是对于中短长度约束，与标准提示相比，我们的方法显着提高了多个模型的长度保真度。所提出的技术在不同的模型架构中显示出不同的优势，一些模型在长度依从性方面表现出高达 37.6% 的改进。质量评估进一步表明，与标准提示技术相比，我们的方法保持或提高了整体输出质量。我们的方法为需要精确长度控制的应用提供了可立即部署的解决方案，对于模型重新训练不切实际或成本过高的生产环境尤其有价值。

Title: KV Cache Transform Coding for Compact Storage in LLM Inference

Authors: Konrad Staniszewski, Adrian Łańcucki
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.01815
Pdf URL: https://arxiv.org/pdf/2511.01815
Copy Paste: [[2511.01815]] KV Cache Transform Coding for Compact Storage in LLM Inference(https://arxiv.org/abs/2511.01815)
Keywords: language model, llm, prompt, chat
Abstract: Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.
摘要：大规模服务大型语言模型 (LLM) 需要高效的键值 (KV) 缓存管理。 KV 缓存可以通过迭代代码编辑和聊天中常见的共享前缀提示在对话轮次中重复使用。然而，过时的缓存会消耗稀缺的 GPU 内存，需要卸载或强制重新计算。我们推出了 KVTC，一种轻量级变换编码器，可压缩 KV 缓存以实现紧凑的 GPU 上和 GPU 外存储。 KVTC 借鉴经典媒体压缩的基础，结合了基于 PCA 的特征去相关、自适应量化和熵编码。它只需要简单的初始校准，并且模型参数保持不变。通过利用 KV 缓存中的冗余，KVTC 在保持推理和长上下文准确性的同时实现高达 20$\times$ 的压缩，对于特定用例，可实现 40$\times$ 或更高的压缩。我们使用 Llama 3、Mistral NeMo 和 R1-Qwen 2.5 模型在 AIME25、LiveCodeBench、GSM8K、MMLU、Qasper、RULER 和 MATH-500 等基准测试中测试 KVTC。它始终优于推理时间基线，例如令牌驱逐、量化和基于 SVD 的方法，同时实现更高的压缩比。这些结果支持 KVTC 作为具有可重用 KV 缓存的内存高效 LLM 服务的实用构建块。

Title: Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems

Authors: Elias Lumer, Faheem Nizar, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.01854
Pdf URL: https://arxiv.org/pdf/2511.01854
Copy Paste: [[2511.01854]] Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems(https://arxiv.org/abs/2511.01854)
Keywords: llm, agent
Abstract: Recent advances in LLM Multi-Agent Systems enable scalable orchestration of sub-agents, each coordinating hundreds or thousands of tools or Model Context Protocol (MCP) servers. However, existing retrieval methods typically match queries against coarse agent-level descriptions before routing, which obscures fine-grained tool functionality and often results in suboptimal agent selection. We introduce Tool-to-Agent Retrieval, a unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships. By explicitly representing tool capabilities and traversing metadata to the agent level, Tool-to-Agent Retrieval enables granular tool-level or agent-level retrieval, ensuring that agents and their underlying tools or MCP servers are equally represented without the context dilution that arises from chunking many tools together. Evaluating Tool-to-Agent Retrieval across eight embedding models, our approach achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on the LiveMCPBench benchmark.
摘要：LLM 多代理系统的最新进展实现了子代理的可扩展编排，每个子代理协调数百或数千个工具或模型上下文协议（MCP）服务器。然而，现有的检索方法通常在路由之前将查询与粗略的代理级描述进行匹配，这掩盖了细粒度的工具功能，并且经常导致次优的代理选择。我们引入了工具到代理检索，这是一个统一的框架，它将工具及其父代理嵌入到共享向量空间中，并通过元数据关系将它们连接起来。通过显式地表示工具功能并将元数据遍历到代理级别，工具到代理检索可实现细粒度的工具级别或代理级别检索，确保代理及其底层工具或 MCP 服务器得到同等表示，而不会因将许多工具组合在一起而导致上下文稀释。通过评估八种嵌入模型的工具到代理检索，我们的方法在 LiveMCPBench 基准上比之前最先进的代理检索器在 Recall@5 和 nDCG@5 中实现了 19.4% 和 17.7% 的持续改进。