2025-11-05

Title: Multi-Personality Generation of LLMs at Decoding-time

Authors: Rongxin Chen, Yunfan Li, Yige Yuan, Bingbing Xu, Huawei Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01891
Pdf URL: https://arxiv.org/pdf/2511.01891
Copy Paste: [[2511.01891]] Multi-Personality Generation of LLMs at Decoding-time(https://arxiv.org/abs/2511.01891)
Keywords: llm
Abstract: Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a "free lunch" to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at this https URL .
摘要：法学硕士的多重人格生成，能够同时体现多种个性化属性，是一项根本挑战。现有的基于再训练的方法成本高昂且可扩展性差，而解码时方法通常依赖于外部模型或启发式方法，限制了灵活性和鲁棒性。在本文中，我们在解码时间组合范式下提出了一种新颖的多个性生成（MPG）框架。它灵活地控制多重人格，而不依赖于稀缺的多维模型或额外的训练，利用单维模型中的隐式密度比率作为“免费午餐”，将任务重新制定为从聚合这些比率的目标策略中采样。为了有效地实现 MPG，我们设计了基于推测块级的拒绝采样 (SCR)，它生成块中的响应，并通过滑动窗口内的估计阈值并行验证它们。这显着减少了计算开销，同时保持了高质量的生成。 MBTI 人格和角色扮演的实验证明了 MPG 的有效性，显示改善高达 16%-18%。代码和数据可在此 https URL 获取。

Title: Rethinking LLM Human Simulation: When a Graph is What You Need

Authors: Joseph Suh, Suhong Moon, Serina Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02135
Pdf URL: https://arxiv.org/pdf/2511.02135
Copy Paste: [[2511.02135]] Rethinking LLM Human Simulation: When a Graph is What You Need(https://arxiv.org/abs/2511.02135)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used to simulate humans, with applications ranging from survey prediction to decision-making. However, are LLMs strictly necessary, or can smaller, domain-grounded models suffice? We identify a large class of simulation problems in which individuals make choices among discrete options, where a graph neural network (GNN) can match or surpass strong LLM baselines despite being three orders of magnitude smaller. We introduce Graph-basEd Models for human Simulation (GEMS), which casts discrete choice simulation tasks as a link prediction problem on graphs, leveraging relational knowledge while incorporating language representations only when needed. Evaluations across three key settings on three simulation datasets show that GEMS achieves comparable or better accuracy than LLMs, with far greater efficiency, interpretability, and transparency, highlighting the promise of graph-based modeling as a lightweight alternative to LLMs for human simulation. Our code is available at this https URL.
摘要：大型语言模型 (LLM) 越来越多地用于模拟人类，其应用范围从调查预测到决策制定。然而，法学硕士是绝对必要的，还是较小的、基于领域的模型就足够了？我们确定了一类模拟问题，其中个体在离散选项中进行选择，其中图神经网络（GNN）可以匹配或超越强大的 LLM 基线，尽管其尺寸要小三个数量级。我们引入了基于图的人类模拟模型（GEMS），它将离散选择模拟任务转化为图上的链接预测问题，利用关系知识，同时仅在需要时结合语言表示。对三个模拟数据集的三个关键设置的评估表明，GEMS 实现了与 LLM 相当或更好的准确性，并且具有更高的效率、可解释性和透明度，凸显了基于图的建模作为人类模拟 LLM 的轻量级替代方案的前景。我们的代码可以在这个 https URL 上找到。

Title: IG-Pruning: Input-Guided Block Pruning for Large Language Models

Authors: Kangyu Qiao, Shaolei Zhang, Yang Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02213
Pdf URL: https://arxiv.org/pdf/2511.02213
Copy Paste: [[2511.02213]] IG-Pruning: Input-Guided Block Pruning for Large Language Models(https://arxiv.org/abs/2511.02213)
Keywords: language model, llm
Abstract: With the growing computational demands of large language models (LLMs), efficient inference has become increasingly critical for practical deployment. Depth pruning has emerged as a promising approach for reducing the computational costs of large language models by removing transformer layers. However, existing methods typically rely on fixed block masks, which can lead to suboptimal performance across different tasks and inputs. In this paper, we propose IG-Pruning, a novel input-aware block-wise pruning method that dynamically selects layer masks at inference time. Our approach consists of two stages: (1) Discovering diverse mask candidates through semantic clustering and L0 optimization, and (2) Implementing efficient dynamic pruning without the need for extensive training. Experimental results demonstrate that our method consistently outperforms state-of-the-art static depth pruning methods, making it particularly suitable for resource-constrained deployment scenarios.
摘要：随着大型语言模型（LLM）的计算需求不断增长，高效推理对于实际部署变得越来越重要。深度剪枝已成为一种通过删除变换层来降低大型语言模型的计算成本的有前途的方法。然而，现有方法通常依赖于固定的块掩码，这可能导致不同任务和输入的性能不佳。在本文中，我们提出了 IG-Pruning，这是一种新颖的输入感知分块修剪方法，可在推理时动态选择层掩码。我们的方法包括两个阶段：（1）通过语义聚类和 L0 优化发现不同的掩模候选者，（2）无需大量训练即可实现高效的动态修剪。实验结果表明，我们的方法始终优于最先进的静态深度修剪方法，使其特别适合资源受限的部署场景。

Title: Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results

Authors: Jonathan Liu, Haoling Qiu, Jonathan Lasko, Damianos Karakos, Mahsa Yarmohammadi, Mark Dredze
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2511.02246
Pdf URL: https://arxiv.org/pdf/2511.02246
Copy Paste: [[2511.02246]] Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results(https://arxiv.org/abs/2511.02246)
Keywords: llm, hallucination, prompt, chat, agent
Abstract: Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: this https URL.
摘要：最近的研究表明，幻觉、遗漏和偏见在法学硕士的日常使用案例中普遍存在。然而，在医疗环境中使用的聊天机器人必须在涉及非医疗因素的情况下（例如存在人口统计信息时）提供一致的建议。为了了解医疗聊天机器人无法按预期执行的条件，我们开发了一个基础设施，1）自动生成查询来探测法学硕士，2）使用多个法学硕士作为法官设置和提示来评估这些查询的答案。对于 1)，我们的提示创建管道对患者人口统计、病史、疾病和写作风格的空间进行采样，以创建我们随后用于提示法学硕士的现实问题。在2）中，我们的评估流程除了LLM作为法官治疗类别检测器之外，还使用LLM作为法官以及代理工作流程提供幻觉和遗漏检测。作为基线研究，我们对法学硕士之间的协议以及不同法学硕士回答和评估的影响进行了两个案例研究。我们发现 LLM 注释者表现出较低的一致性分数（平均 Cohen's Kappa $\kappa=0.118$），并且只有特定的（回答、评估）LLM 对才能在写作风格、性别和种族之间产生统计上的显着差异。我们建议使用法学硕士评估的研究使用多个法学硕士作为评估者，以避免得出统计上显着但不可概括的结果，特别是在缺乏真实数据的情况下。我们还建议发布法学硕士之间的协议指标以提高透明度。我们的代码和数据集可以在这里找到：https URL。

Title: LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Authors: Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin, Yan Zhang, Xing Sun, Rongrong Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02347
Pdf URL: https://arxiv.org/pdf/2511.02347
Copy Paste: [[2511.02347]] LTD-Bench: Evaluating Large Language Models by Letting Them Draw(https://arxiv.org/abs/2511.02347)
Keywords: language model, llm
Abstract: Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept--a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
摘要：当前大型语言模型（LLM）的评估范式代表了人工智能研究中的一个关键盲点——依赖于不透明的数字指标，这些指标掩盖了空间推理的基本局限性，同时无法提供对模型功能的直观理解。这种缺陷在报告的性能和实际能力之间造成了危险的脱节，特别是对于需要理解物理世界的应用程序。我们推出了 LTD-Bench，这是一个突破性的基准，通过要求模型通过点阵或可执行代码生成绘图，将 LLM 评估从抽象分数转变为直接可观察的视觉输出。这种方法使得空间推理的局限性即使对于非专家来说也立即显而易见，从而弥合了统计性能和直观评估之间的根本差距。 LTD-Bench 在三个逐渐具有挑战性的难度级别上实施了一种全面的方法，其中包含互补的生成任务（测试空间想象力）和识别任务（评估空间感知），系统地评估关键语言空间映射的两个方向。我们对最先进模型的广泛实验暴露了惊人的能力差距：即使是在传统基准上取得令人印象深刻结果的法学硕士，也表明在建立语言和空间概念之间的双向映射方面存在严重缺陷——这是一个根本性的限制，削弱了它们作为真正的世界模型的潜力。此外，LTD-Bench 的视觉输出可实现强大的诊断分析，提供一种研究模型相似性的潜在方法。

Title: Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

Authors: Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park
Subjects: cs.CL, cs.AI, cs.IR, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2511.02358
Pdf URL: https://arxiv.org/pdf/2511.02358
Copy Paste: [[2511.02358]] Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation(https://arxiv.org/abs/2511.02358)
Keywords: language model, llm
Abstract: Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
摘要：查询增强通过向查询附加更多信息来查找相关文档，从而使查询更有意义。当前的研究提出了基于大型语言模型（LLM）的嵌入器，它通过利用 LLM 的生成能力以多任务方式学习嵌入的表示和查询增强的生成。在推理过程中，这些联合训练的嵌入器先进行查询增强，然后进行嵌入，显示出有效的结果。然而，增加每个查询会导致大量的嵌入延迟，并且查询增加可能会损害某些查询的性能。此外，以前的方法尚未在多模式环境中进行过探索。为了解决这些问题，我们提出了 M-Solomon，这是一种通用的多模式嵌入器，可以自适应地确定何时增强查询。我们的方法首先在数据集级别将训练数据集的查询分为两组。一种包括需要增强的查询，另一种包括不需要增强的查询。然后，我们引入了一个综合过程，该过程利用强大的多模态 LLM (MLLM) 为需要的查询生成适当的增强。接下来，我们提出自适应查询增强。通过此步骤，M-Solomon 可以仅在必要时进行查询增强，方法是学习为需要它们的查询生成带有前缀 /augment 的合成增强，并为其他查询生成简单的字符串 /embed。实验结果表明，M-Solomon 不仅大幅超越了未使用增强的基线，而且还优于始终使用增强的基线，提供了更快的嵌入延迟。

Title: LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

Authors: Yudong Li, Zhongliang Yang, Kejiang Chen, Wenxuan Wang, Tianxin Zhang, Sifang Wan, Kecheng Wang, Haitian Li, Xu Wang, Lefan Cheng, Youdan Yang, Baocheng Chen, Ziyu Liu, Yufei Sun, Liyan Wu, Wenya Wen, Xingchi Gu, Peiru Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02366
Pdf URL: https://arxiv.org/pdf/2511.02366
Copy Paste: [[2511.02366]] LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context(https://arxiv.org/abs/2511.02366)
Keywords: llm, agent
Abstract: In this work, we propose LiveSecBench, a dynamic and continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench evaluates models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in the Chinese legal and social frameworks. This benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors, such as the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs, providing a landscape of AI safety in the context of Chinese language. The leaderboard is publicly accessible at this https URL.
摘要：在这项工作中，我们提出了LiveSecBench，一个专门针对中文LLM应用场景的动态且持续更新的安全基准。 LiveSecBench 评估植根于中国法律和社会框架的六个关键维度（合法性、道德、事实性、隐私、对抗鲁棒性和推理安全性）的模型。该基准通过包含新威胁向量的动态更新计划来保持相关性，例如计划在下一次更新中包含文本到图像生成安全和代理安全。目前，LiveSecBench (v251030) 已经评估了 18 个法学硕士，提供了中文背景下的人工智能安全图景。可以通过此 https URL 公开访问排行榜。

Title: AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda

Authors: Mohd Nauman, Sravan Gvm, Vijay Devane, Shyam Pawar, Viraj Thakur, Kundeshwar Pundalik, Piyush Sawarkar, Rohit Saluja, Maunendra Desarkar, Ganesh Ramakrishnan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02374
Pdf URL: https://arxiv.org/pdf/2511.02374
Copy Paste: [[2511.02374]] AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda(https://arxiv.org/abs/2511.02374)
Keywords: language model, llm
Abstract: Current large language models excel at broad, general-purpose tasks, but consistently underperform when exposed to highly specialized domains that require deep cultural, linguistic, and subject-matter expertise. In particular, traditional medical systems such as Ayurveda embody centuries of nuanced textual and clinical knowledge that mainstream LLMs fail to accurately interpret or apply. We introduce AyurParam-2.9B, a domain-specialized, bilingual language model fine-tuned from Param-1-2.9B using an extensive, expertly curated Ayurveda dataset spanning classical texts and clinical guidance. AyurParam's dataset incorporates context-aware, reasoning, and objective-style Q&A in both English and Hindi, with rigorous annotation protocols for factual precision and instructional clarity. Benchmarked on BhashaBench-Ayur, AyurParam not only surpasses all open-source instruction-tuned models in its size class (1.5--3B parameters), but also demonstrates competitive or superior performance compared to much larger models. The results from AyurParam highlight the necessity for authentic domain adaptation and high-quality supervision in delivering reliable, culturally congruent AI for specialized medical knowledge.
摘要：当前的大型语言模型擅长执行广泛的通用任务，但在涉及需要深厚文化、语言和主题专业知识的高度专业化领域时，其表现始终不佳。特别是，阿育吠陀等传统医学体系体现了数百年的细致入微的文本和临床知识，而主流法学硕士无法准确解释或应用这些知识。我们推出 AyurParam-2.9B，这是一种领域专业的双语语言模型，使用涵盖经典文本和临床指导的广泛、专业策划的阿育吠陀数据集，从 Param-1-2.9B 进行了微调。 AyurParam 的数据集融合了英语和印地语的上下文感知、推理和客观式问答，并具有严格的注释协议，以确保事实的准确性和教学的清晰度。以 BhashaBench-Ayur 为基准，AyurParam 不仅超越了其尺寸级别（1.5--3B 参数）中的所有开源指令调整模型，而且与更大的模型相比，还表现出了具有竞争力或优越的性能。 AyurParam 的结果强调了真实领域适应和高质量监督的必要性，以便为专业医学知识提供可靠的、文化上一致的人工智能。

Title: AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Authors: Aashray Reddy, Andrew Zagula, Nicholas Saban
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.02376
Pdf URL: https://arxiv.org/pdf/2511.02376
Copy Paste: [[2511.02376]] AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models(https://arxiv.org/abs/2511.02376)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
摘要：大型语言模型（LLM）仍然容易受到越狱攻击，其中对抗性提示会引发有害输出，但大多数评估都集中在单轮交互上，而现实世界的攻击是通过自适应多轮对话展开的。我们推出了 AutoAdv，这是一种无需训练的自动多轮越狱框架，可在六轮内对 Llama-3.1-8B 实现高达 95% 的攻击成功率，比单轮基线提高 24%。 AutoAdv 独特地结合了三种自适应机制：从成功攻击中学习以增强未来提示的模式管理器、根据故障模式动态调整采样参数的温度管理器以及伪装有害请求然后迭代细化它们的两阶段重写策略。对商业和开源模型（GPT-4o-mini、Qwen3-235B、Mistral-7B）的广泛评估揭示了当前安全机制中持续存在的漏洞，多轮攻击始终优于单轮方法。这些发现表明，针对单轮交互优化的对齐策略无法在扩展对话中保持鲁棒性，这凸显了对多轮感知防御的迫切需要。

Title: Merging Continual Pretraining Models for Domain-Specialized LLMs: A Case Study in Finance

Authors: Kentaro Ueda, François Portet, Hirohiko Suwa, Keiichi Yasumoto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02451
Pdf URL: https://arxiv.org/pdf/2511.02451
Copy Paste: [[2511.02451]] Merging Continual Pretraining Models for Domain-Specialized LLMs: A Case Study in Finance(https://arxiv.org/abs/2511.02451)
Keywords: llm
Abstract: While LLMs excel at general tasks, they struggle in specialized domains like finance, requiring diverse skills in domain knowledge, mathematical reasoning, and multilingual processing. Merging domain-specific Continual Pre-training (CPT) "experts" offers a practical alternative to costly and unstable multi-skill training. However, unlike established Supervised Fine-Tuning (SFT) model-based merging, CPT model merging remains largely unexplored. We address this gap by creating financial LLMs from experts in finance, math, and Japanese. We propose a three-stage evaluation focusing on knowledge recovery, complementarity, and emergence, and assess three merging methods (Task Arithmetic, TIES, and DARE-TIES) on a comprehensive financial benchmark curated from 18 tasks across 8 established datasets. Results show that merging an expert with its base model recovers general knowledge lost during CPT, while merging experts improves performance and can yield emergent cross-domain skills. Among the methods, Task Arithmetic performs strongly but is hyperparameter-sensitive, whereas TIES is more robust. Our findings also suggest that while model similarity correlates with merging success, emergent skills depend on more complex factors. This work presents the first foundational analysis of CPT model merging, establishing a principled framework and providing clear guidance for building multi-skill LLMs from existing assets.
摘要：虽然法学硕士擅长一般任务，但他们在金融等专业领域却举步维艰，需要领域知识、数学推理和多语言处理等多种技能。合并特定领域的持续预训练（CPT）“专家”为昂贵且不稳定的多技能培训提供了一种实用的替代方案。然而，与已建立的基于监督微调 (SFT) 模型的合并不同，CPT 模型合并在很大程度上仍未得到探索。我们通过由金融、数学和日语专家开设金融法学硕士来解决这一差距。我们提出了一个三阶段评估，重点关注知识恢复、互补性和涌现，并根据 8 个已建立数据集的 18 项任务策划的综合财务基准评估三种合并方法（任务算术、TIES 和 DARE-TIES）。结果表明，将专家与其基础模型合并可以恢复 CPT 期间丢失的一般知识，而合并专家可以提高性能并产生新兴的跨领域技能。在这些方法中，Task Arithmetic 表现强劲，但对超参数敏感，而 TIES 则更稳健。我们的研究结果还表明，虽然模型相似性与合并成功相关，但新兴技能取决于更复杂的因素。这项工作首次对 CPT 模型合并进行了基础分析，建立了原则框架，并为利用现有资产构建多技能法学硕士提供了明确的指导。

Title: Prompting for Policy: Forecasting Macroeconomic Scenarios with Synthetic LLM Personas

Authors: Giulia Iadisernia, Carolina Camassa
Subjects: cs.CL, cs.CE, econ.GN
Abstract URL: https://arxiv.org/abs/2511.02458
Pdf URL: https://arxiv.org/pdf/2511.02458
Copy Paste: [[2511.02458]] Prompting for Policy: Forecasting Macroeconomic Scenarios with Synthetic LLM Personas(https://arxiv.org/abs/2511.02458)
Keywords: language model, gpt, llm, prompt
Abstract: We evaluate whether persona-based prompting improves Large Language Model (LLM) performance on macroeconomic forecasting tasks. Using 2,368 economics-related personas from the PersonaHub corpus, we prompt GPT-4o to replicate the ECB Survey of Professional Forecasters across 50 quarterly rounds (2013-2025). We compare the persona-prompted forecasts against the human experts panel, across four target variables (HICP, core HICP, GDP growth, unemployment) and four forecast horizons. We also compare the results against 100 baseline forecasts without persona descriptions to isolate its effect. We report two main findings. Firstly, GPT-4o and human forecasters achieve remarkably similar accuracy levels, with differences that are statistically significant yet practically modest. Our out-of-sample evaluation on 2024-2025 data demonstrates that GPT-4o can maintain competitive forecasting performance on unseen events, though with notable differences compared to the in-sample period. Secondly, our ablation experiment reveals no measurable forecasting advantage from persona descriptions, suggesting these prompt components can be omitted to reduce computational costs without sacrificing accuracy. Our results provide evidence that GPT-4o can achieve competitive forecasting accuracy even on out-of-sample macroeconomic events, if provided with relevant context data, while revealing that diverse prompts produce remarkably homogeneous forecasts compared to human panels.
摘要：我们评估基于角色的提示是否可以提高大型语言模型（LLM）在宏观经济预测任务中的表现。使用 PersonaHub 语料库中的 2,368 个与经济相关的角色，我们促使 GPT-4o 在 50 个季度（2013-2025 年）中复制欧洲央行专业预测员调查。我们将角色提示的预测与人类专家小组的四个目标变量（HICP、核心 HICP、GDP 增长、失业率）和四个预测范围进行比较。我们还将结果与 100 个没有角色描述的基线预测进行比较，以隔离其影响。我们报告了两个主要发现。首先，GPT-4o 和人类预测者达到了非常相似的准确度水平，虽然差异在统计上显着，但实际上却很小。我们对 2024-2025 年数据的样本外评估表明，GPT-4o 可以在未见过的事件上保持有竞争力的预测性能，尽管与样本内期间相比存在显着差异。其次，我们的消融实验表明，角色描述没有可测量的预测优势，这表明可以省略这些提示组件，以在不牺牲准确性的情况下降低计算成本。我们的结果证明，如果提供相关的背景数据，GPT-4o 即使在样本外的宏观经济事件上也可以实现有竞争力的预测准确性，同时揭示与人工小组相比，不同的提示可以产生非常均匀的预测。

Title: Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour

Authors: Max Norris, Kobi Gal, Sahan Bulathwela
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02599
Pdf URL: https://arxiv.org/pdf/2511.02599
Copy Paste: [[2511.02599]] Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour(https://arxiv.org/abs/2511.02599)
Keywords: language model, llm
Abstract: Modelling student knowledge is a key challenge when leveraging AI in education, with major implications for personalised learning. The Knowledge Tracing (KT) task aims to predict how students will respond to educational questions in learning environments, based on their prior interactions. Existing KT models typically use response correctness along with metadata like skill tags and timestamps, often overlooking the question text, which is an important source of pedagogical insight. This omission poses a lost opportunity while limiting predictive performance. We propose Next Token Knowledge Tracing (NTKT), a novel approach that reframes KT as a next-token prediction task using pretrained Large Language Models (LLMs). NTKT represents both student histories and question content as sequences of text, allowing LLMs to learn patterns in both behaviour and language. Our series of experiments significantly improves performance over state-of-the-art neural KT models and generalises much better to cold-start questions and users. These findings highlight the importance of question content in KT and demonstrate the benefits of leveraging pretrained representations of LLMs to model student learning more effectively.
摘要：在教育中利用人工智能时，对学生知识进行建模是一个关键挑战，这对个性化学习具有重大影响。知识追踪（KT）任务旨在根据学生之前的互动来预测学生在学习环境中如何回答教育问题。现有的 KT 模型通常使用响应正确性以及技能标签和时间戳等元数据，通常会忽略问题文本，而问题文本是教学洞察力的重要来源。这种遗漏会导致失去机会，同时限制预测性能。我们提出了下一个令牌知识追踪（NTKT），这是一种使用预训练的大型语言模型（LLM）将 KT 重新构建为下一个令牌预测任务的新颖方法。 NTKT 将学生历史和问题内容表示为文本序列，使法学硕士能够学习行为和语言模式。我们的一系列实验显着提高了最先进的神经 KT 模型的性能，并且更好地推广到冷启动问题和用户。这些发现强调了 KT 中问题内容的重要性，并证明了利用法学硕士的预训练表示来更有效地模拟学生学习的好处。

Title: CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

Authors: Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02603
Pdf URL: https://arxiv.org/pdf/2511.02603
Copy Paste: [[2511.02603]] CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency(https://arxiv.org/abs/2511.02603)
Keywords: language model, llm
Abstract: Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.
摘要：大型语言模型 (LLM) 通常在测试时被多次查询，并通过多数投票汇总预测。虽然有效，但这种自洽策略 (arXiv:2203.11171) 需要固定数量的调用，并且当正确答案很少时可能会失败。我们引入了置信引导提前停止（CGES），这是一种贝叶斯框架，它使用从令牌概率或奖励模型派生的标量置信信号来形成候选答案的后验。一旦候选者的后验质量超过阈值，CGES 就会自适应地停止采样。我们为完美校准的置信度和现实的噪声置信度信号提供理论保证。在五个推理基准中，CGES 将模型调用的平均数量减少了约 69%（例如，从 16.0 减少到 4.9），同时将自洽准确度保持在 0.06 个百分点以内。

Title: The Realignment Problem: When Right becomes Wrong in LLMs

Authors: Aakash Sen Sharma, Debdeep Sanyal, Vivek Srivastava, Shirish Karande, Murari Mandal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02623
Pdf URL: https://arxiv.org/pdf/2511.02623
Copy Paste: [[2511.02623]] The Realignment Problem: When Right becomes Wrong in LLMs(https://arxiv.org/abs/2511.02623)
Keywords: language model, llm
Abstract: The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a programmatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.
摘要：大型语言模型 (LLM) 与人类价值观的一致性是其安全部署的核心，但当前的实践产生了静态、脆弱且维护成本高昂的模型，无法跟上不断发展的规范和政策的步伐。这种不一致，我们称之为“对齐与现实差距”，对可靠的长期使用提出了越来越大的挑战。现有的补救措施是不够的：大规模的重新注释在经济上是令人望而却步的，而标准的遗忘方法充当了钝器，会侵蚀效用而不是实现精确的策略更新。我们引入了 TRACE（通过对齐冲突评估进行分类和重新对齐），这是一个原则性遗忘框架，它将重新对齐视为一个程序化的政策应用问题。 TRACE 以编程方式根据新策略对现有偏好数据进行分类，通过对齐影响分数识别高影响冲突，并应用混合优化，在保护模型性能的同时干净地反转、丢弃或保留偏好。实证结果表明，TRACE 在不同模型系列（Qwen2.5-7B、Gemma-2-9B、Llama-3.1-8B）之间实现了稳健的重新对齐。在复杂政策转变下的综合基准和 PKU-SafeRLHF 数据集上，TRACE 在不降低一般能力的情况下强制执行新原则。我们的工作建立了一个可扩展、动态且具有成本效益的范式，以保持法学硕士的一致性，为可持续和负责任的人工智能部署奠定基础。

Title: Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation

Authors: Renfei Dang, Peng Hu, Changjiang Gao, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02626
Pdf URL: https://arxiv.org/pdf/2511.02626
Copy Paste: [[2511.02626]] Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation(https://arxiv.org/abs/2511.02626)
Keywords: language model, llm, hallucination
Abstract: Previous studies show that introducing new knowledge during large language models (LLMs) fine-tuning can lead to the generation of erroneous output when tested on known information, thereby triggering factual hallucinations. However, existing studies have not deeply investigated the specific manifestations and underlying mechanisms of these hallucinations. Our work addresses this gap by designing a controlled dataset Biography-Reasoning, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that when fine-tuned on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit significantly increased hallucination tendencies. This suggests that the high unfamiliarity of a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations, and these tendencies can even affect other knowledge types in QA tasks. To mitigate such factual hallucinations, we propose KnownPatch, which patches a small number of known knowledge samples in the later stages of training, effectively alleviating new-knowledge-induced hallucinations. Through attention analysis, we find that learning new knowledge reduces the model's attention to key entities in the question, thus causing excessive focus on the surrounding context, which may increase the risk of hallucination. Moreover, the attention pattern can propagate to similar contexts, facilitating the spread of hallucinations to textually similar questions. Our method effectively mitigates the disruption of new knowledge learning to the model's attention on key entities, accompanied by improved performance.
摘要：先前的研究表明，在大语言模型（LLM）微调过程中引入新知识可能会导致在对已知信息进行测试时产生错误的输出，从而引发事实幻觉。然而，现有研究尚未深入探讨这些幻觉的具体表现和潜在机制。我们的工作通过设计受控数据集 Biography-Reasoning 并跨多种知识类型和两种任务类型（包括知识问答（QA）和知识推理任务）进行细粒度分析来解决这一差距。我们发现，当对特定知识类型完全由新知识组成的数据集进行微调时，法学硕士表现出显着增加的幻觉倾向。这表明，对特定知识类型的高度不熟悉，而不是新知识的总体比例，是幻觉的更强驱动因素，并且这些倾向甚至会影响 QA 任务中的其他知识类型。为了减轻这种事实幻觉，我们提出了KnownPatch，它在训练后期对少量已知知识样本进行修补，有效减轻新知识引起的幻觉。通过注意力分析，我们发现学习新知识降低了模型对问题中关键实体的注意力，从而导致过度关注周围的上下文，这可能会增加产生幻觉的风险。此外，注意力模式可以传播到相似的上下文，从而促进对文本相似问题的幻觉传播。我们的方法有效地减轻了新知识学习对模型对关键实体的关注的干扰，同时提高了性能。

Title: Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes

Authors: Mohammadsajad Alipour, Mohammad Mohammadi Amiri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.02681
Pdf URL: https://arxiv.org/pdf/2511.02681
Copy Paste: [[2511.02681]] Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes(https://arxiv.org/abs/2511.02681)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.
摘要：大型语言模型 (LLM) 在各种应用程序中越来越普遍。然而，它们巨大的规模限制了少数资源充足的利益相关者的存储和处理能力。因此，大多数应用程序都依赖于预先训练的法学硕士，并针对特定任务进行了微调。然而，由于这些模型处理的任务范围广泛，即使存储这些模型的微调版本仍然是一个重大挑战。最近的研究表明，微调这些模型主要影响一小部分参数，这突出表明需要更有效地存储微调模型。本文重点关注微调后预训练模型中参数更新的有效存储。为了应对这一挑战，我们利用微调更新都是低秩和稀疏的观察结果，这可用于提高存储效率。然而，仅使用低秩近似或稀疏化可能会丢弃增强模型表达能力的关键奇异组件。我们首先观察到，给定相同的内存预算，具有较大秩的稀疏低秩近似优于具有较小秩的标准低秩近似。在此基础上，我们提出了我们的方法，即最优奇异损伤，该方法通过利用奇异向量的交错重要性来选择性地稀疏低秩近似更新，确保保留最有影响力的分量。我们通过大量的实验证明，与单独采用低秩近似或稀疏化相比，我们提出的方法可以在相同的内存预算内实现显着的存储效率和卓越的准确性。

Title: AI Diffusion in Low Resource Language Countries

Authors: Amit Misra, Syed Waqas Zamir, Wassim Hamidouche, Inbal Becker-Reshef, Juan Lavista Ferres
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.02752
Pdf URL: https://arxiv.org/pdf/2511.02752
Copy Paste: [[2511.02752]] AI Diffusion in Low Resource Language Countries(https://arxiv.org/abs/2511.02752)
Keywords: language model, llm
Abstract: Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.
摘要：人工智能 (AI) 正在以前所未有的速度在全球范围内传播，但采用情况仍然不平衡。众所周知，由于数据稀缺，前沿大型语言模型 (LLM) 在低资源语言上表现不佳。我们假设这种性能缺陷降低了人工智能的效用，从而减缓了低资源语言国家（LRLC）的采用。为了测试这一点，我们使用加权回归模型将语言影响与社会经济和人口统计因素隔离开来，发现 LRLC 的 AI 用户比例相对于其基线大约低 20%。这些结果表明，语言可及性是公平人工智能传播的一个重要的、独立的障碍。

Title: Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

Authors: Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.02755
Pdf URL: https://arxiv.org/pdf/2511.02755
Copy Paste: [[2511.02755]] Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning(https://arxiv.org/abs/2511.02755)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.
摘要：大型语言模型 (LLM) 表现出跨领域的互补优势，并具有不同的推理成本，从而推动了多智能体 LLM 系统的设计，其中专业模型可以高效协作。现有方法主要依赖于去中心化框架，该框架为每个输入调用多个法学硕士，从而导致大量且不受控制的推理成本。在这项工作中，我们引入了一个集中式多LLM框架，其中控制器LLM以经济有效且成本可控的方式有选择地协调专家模型池。我们将这个协调问题表述为具有双重目标的强化学习：最大化任务性能，同时最小化总体推理成本。此外，我们期望多智能体系统在推理过程中能够适应不同预算条件的行为。为此，我们提出了 CoRL，一种强化学习框架，可以在可控的多预算环境中优化性能成本权衡。对四个不同基准的实验表明，CoRL 使单个系统能够在高预算设置下超越最好的专家 LLM，同时在更经济的低预算模式下保持强劲的性能，突出了集中协调对于可扩展且经济高效的多代理 LLM 系统的有效性。

Title: MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Authors: Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, Xianpei Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02805
Pdf URL: https://arxiv.org/pdf/2511.02805
Copy Paste: [[2511.02805]] MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning(https://arxiv.org/abs/2511.02805)
Keywords: llm, agent
Abstract: Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at this https URL
摘要：典型的搜索代理将整个交互历史连接到 LLM 上下文中，保留信息完整性，但会产生长而嘈杂的上下文，从而导致较高的计算和内存成本。相反，仅使用当前回合可以避免这种开销，但会丢弃重要信息。这种权衡限制了搜索代理的可扩展性。为了应对这一挑战，我们提出了 MemSearcher，这是一种代理工作流程，它迭代地维护紧凑的内存并将当前回合与其结合起来。每次，MemSearcher 都会将用户的问题与内存融合，以生成推理轨迹、执行搜索操作并更新内存以仅保留解决任务所必需的信息。这种设计稳定了多轮交互中的上下文长度，在不牺牲准确性的情况下提高了效率。为了优化这个工作流程，我们引入了多上下文 GRPO，这是一种端到端的 RL 框架，可以联合优化 MemSearcher Agent 的推理、搜索策略和内存管理。具体来说，多上下文 GRPO 对不同上下文下的轨迹组进行采样，并在其中的所有对话中传播轨迹级优势。 MemSearcher 在与 Search-R1 相同的数据集上进行训练，在七个公共基准测试中实现了较强基线的显着改进：Qwen2.5-3B-Instruct 的相对平均增益 +11%，Qwen2.5-7B-Instruct 的相对平均增益 +12%。值得注意的是，基于 3B 的 MemSearcher 甚至优于基于 7B 的基线，这表明在信息完整性和效率之间取得平衡可以带来更高的准确性和更低的计算开销。代码和模型将在此 https URL 公开提供

Title: Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

Authors: Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02817
Pdf URL: https://arxiv.org/pdf/2511.02817
Copy Paste: [[2511.02817]] Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities(https://arxiv.org/abs/2511.02817)
Keywords: gpt, long context
Abstract: As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.
摘要：随着模型上下文长度的不断增长，关于模型是否有效使用完整上下文长度的担忧仍然存在。虽然最近发布了一些精心设计的长上下文评估，但这些评估往往依赖于从上下文的一个或多个部分进行检索，这使得几乎所有上下文标记都被视为噪音。这仅代表可以在长上下文中执行的一种类型的任务。我们引入了乌龙，这是长上下文推理任务的基准，需要在原子级别上分析各个文本块，然后聚合这些分析来回答分布问题。乌龙分为两个任务集：乌龙合成，一组自然合成任务，我们可以在其中轻松消除推理问题的组成部分；乌龙真实，一种下游设置，需要对现实世界的对话数据进行推理。乌龙需要模型对大量示例进行推理，在上下文中执行分类和计数，并对时间和用户关系进行推理。即使是前沿模型也在乌龙上挣扎，GPT-5、Claude-Sonnet-4 和 Gemini-2.5-Pro 在 128K 的两个分割上都实现了低于 50% 的准确率。我们发布了 Oolong 的数据和评估工具，以便进一步开发可以推理大量文本的模型。