2025-12-16

Title: Enhancing Urban Visual Place Recognition for Crowdsourced Flood Imagery via LLM-Guided Attention

Authors: Fengyi Xu, Jun Ma, Waishan Qiu, Cui Guo
Subjects: cs.CL, cs.AI, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2512.11811
Pdf URL: https://arxiv.org/pdf/2512.11811
Copy Paste: [[2512.11811]] Enhancing Urban Visual Place Recognition for Crowdsourced Flood Imagery via LLM-Guided Attention(https://arxiv.org/abs/2512.11811)
Keywords: language model, llm
Abstract: Crowdsourced street-view imagery from social media provides valuable real-time visual evidence of urban flooding and other crisis events, yet it often lacks reliable geographic metadata for emergency response. Existing Visual Place Recognition (VPR) models exhibit substantial performance degradation when applied to such imagery due to visual distortions and domain shifts inherent in cross-source scenarios. This paper presents VPR-AttLLM, a model-agnostic framework that integrates the semantic reasoning and geospatial knowledge of Large Language Models (LLMs) into established VPR pipelines through attention-guided descriptor enhancement. By leveraging LLMs to identify location-informative regions within the city context and suppress transient visual noise, VPR-AttLLM improves retrieval performance without requiring model retraining or additional data. Comprehensive evaluations are conducted on extended benchmarks including SF-XL enriched with real social-media flood images, synthetic flooding scenarios over established query sets and Mapillary photos, and a new HK-URBAN dataset capturing morphologically distinct cityscapes. Integrating VPR-AttLLM with three state-of-the-art VPR models-CosPlace, EigenPlaces, and SALAD-consistently improves recall performance, yielding relative gains typically between 1-3% and reaching up to 8% on the most challenging real flood imagery. Beyond measurable gains in retrieval accuracy, this study establishes a generalizable paradigm for LLM-guided multimodal fusion in visual retrieval systems. By embedding principles from urban perception theory into attention mechanisms, VPR-AttLLM bridges human-like spatial reasoning with modern VPR architectures. Its plug-and-play design, strong cross-source robustness, and interpretability highlight its potential for scalable urban monitoring and rapid geo-localization of crowdsourced crisis imagery.
摘要：来自社交媒体的众包街景图像提供了城市洪水和其他危机事件的宝贵实时视觉证据，但它往往缺乏用于紧急响应的可靠地理元数据。由于跨源场景中固有的视觉失真和域转移，现有的视觉位置识别 (VPR) 模型在应用于此类图像时表现出显着的性能下降。本文提出了 VPR-AttLLM，这是一个与模型无关的框架，它通过注意力引导描述符增强将大型语言模型 (LLM) 的语义推理和地理空间知识集成到已建立的 VPR 管道中。通过利用 LLM 识别城市环境中的位置信息区域并抑制瞬态视觉噪声，VPR-AttLLM 提高了检索性能，而无需模型重新训练或额外数据。对扩展基准进行了综合评估，包括富含真实社交媒体洪水图像的 SF-XL、基于已建立查询集和 Mapillary 照片的合成洪水场景，以及捕获形态独特的城市景观的新 HK-URBAN 数据集。将 VPR-AttLLM 与三种最先进的 VPR 模型（CosPlace、EigenPlaces 和 SALAD）集成，可以持续提高召回性能，相对增益通常在 1-3% 之间，在最具挑战性的真实洪水图像上可达 8%。除了检索准确性方面的可测量收益之外，本研究还为视觉检索系统中法学硕士引导的多模态融合建立了一个可推广的范式。通过将城市感知理论的原理嵌入注意机制中，VPR-AttLLM 将类人空间推理与现代 VPR 架构联系起来。其即插即用设计、强大的跨源鲁棒性和可解释性凸显了其可扩展的城市监测和众包危机图像快速地理定位的潜力。

Title: Reinforcement Learning for Latent-Space Thinking in LLMs

Authors: Enes Özeren, Matthias Aßenmacher
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11816
Pdf URL: https://arxiv.org/pdf/2512.11816
Copy Paste: [[2512.11816]] Reinforcement Learning for Latent-Space Thinking in LLMs(https://arxiv.org/abs/2512.11816)
Keywords: llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning typically utilizes the discrete language space for thinking, which is inherently inefficient, as many generated tokens only enforce linguistic rules that are not required for reasoning. To bypass this, latent-space thinking allows models to think using the continuous embedding space. While existing methods for training those models show domain-specific gains, they fail to maintain performance in complex tasks, such as mathematical reasoning. We experimentally demonstrate that the Coconut approach, a form of supervised fine-tuning for latent-space thinking, is highly sensitive to design choices and exhibits several inherent limitations. To address these issues, we investigate reinforcement learning (RL) techniques -- an underexplored direction in latent-space thinking -- including GRPO and design a novel Latent RL method for directly optimizing the latent thinking steps. Our experimental results reveal that these RL-trained models still lag behind traditional language-space CoT models in the mathematical reasoning domain. We make our codebase publicly available.
摘要：思想链 (CoT) 推理通常利用离散语言空间进行思维，这本质上是低效的，因为许多生成的标记仅强制执行推理不需要的语言规则。为了绕过这个问题，潜在空间思维允许模型使用连续嵌入空间进行思考。虽然训练这些模型的现有方法显示了特定领域的收益，但它们无法保持复杂任务（例如数学推理）中的性能。我们通过实验证明，椰子方法是一种潜在空间思维的监督微调形式，对设计选择高度敏感，并表现出一些固有的局限性。为了解决这些问题，我们研究了强化学习 (RL) 技术（潜在空间思维中尚未探索的方向），包括 GRPO，并设计了一种新颖的潜在 RL 方法来直接优化潜在思维步骤。我们的实验结果表明，这些强化学习训练的模型在数学推理领域仍然落后于传统的语言空间 CoT 模型。我们公开我们的代码库。

Title: Direct Confidence Alignment: Aligning Verbalized Confidence with Internal Confidence In Large Language Models

Authors: Glenn Zhang, Treasure Mayowa, Jason Fan, Yicheng Fu, Aaron Sandoval, Sean O'Brien, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11998
Pdf URL: https://arxiv.org/pdf/2512.11998
Copy Paste: [[2512.11998]] Direct Confidence Alignment: Aligning Verbalized Confidence with Internal Confidence In Large Language Models(https://arxiv.org/abs/2512.11998)
Keywords: language model, llm
Abstract: Producing trustworthy and reliable Large Language Models (LLMs) has become increasingly important as their usage becomes more widespread. Calibration seeks to achieve this by improving the alignment between the model's confidence and the actual likelihood of its responses being correct or desirable. However, it has been observed that the internal confidence of a model, derived from token probabilities, is not well aligned with its verbalized confidence, leading to misleading results with different calibration methods. In this paper, we propose Direct Confidence Alignment (DCA), a method using Direct Preference Optimization to align an LLM's verbalized confidence with its internal confidence rather than ground-truth accuracy, enhancing model transparency and reliability by ensuring closer alignment between the two confidence measures. We evaluate DCA across multiple open-weight LLMs on a wide range of datasets. To further assess this alignment, we also introduce three new calibration error-based metrics. Our results show that DCA improves alignment metrics on certain model architectures, reducing inconsistencies in a model's confidence expression. However, we also show that it can be ineffective on others, highlighting the need for more model-aware approaches in the pursuit of more interpretable and trustworthy LLMs.
摘要：随着大型语言模型 (LLM) 的使用变得越来越广泛，生成值得信赖且可靠的大型语言模型 (LLM) 变得越来越重要。校准旨在通过提高模型的置信度与其响应正确或理想的实际可能性之间的一致性来实现这一目标。然而，据观察，从令牌概率得出的模型的内部置信度与其口头置信度并不一致，导致不同的校准方法产生误导性结果。在本文中，我们提出了直接置信度对齐（DCA），这是一种使用直接偏好优化将法学硕士的言语置信度与其内部置信度（而不是真实准确性）对齐的方法，通过确保两个置信度度量之间的更紧密对齐来增强模型的透明度和可靠性。我们在广泛的数据集上评估了多个开放权重法学硕士的 DCA。为了进一步评估这种一致性，我们还引入了三个新的基于校准误差的指标。我们的结果表明，DCA 改进了某些模型架构上的对齐指标，减少了模型置信度表达中的不一致情况。然而，我们也表明它对其他人可能无效，强调在追求更可解释和更值得信赖的法学硕士时需要更多模型感知方法。

Title: Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

Authors: Minghui Liu, Aadi Palnitkar, Tahseen Rabbani, Hyunwoo Jae, Kyle Rui Sang, Dixi Yao, Shayan Shabihi, Fuheng Zhao, Tian Li, Ce Zhang, Furong Huang, Kunpeng Zhang
Subjects: cs.CL, cs.AI, cs.PF
Abstract URL: https://arxiv.org/abs/2512.12008
Pdf URL: https://arxiv.org/pdf/2512.12008
Copy Paste: [[2512.12008]] Hold Onto That Thought: Assessing KV Cache Compression On Reasoning(https://arxiv.org/abs/2512.12008)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.
摘要：大型语言模型 (LLM) 在长上下文任务上表现出了卓越的性能，但经常受到内存限制的瓶颈。也就是说，用于显着加速注意力计算的 KV 缓存随着上下文长度线性增长。引入了一套压缩算法，通过驱逐不重要的令牌来减轻缓存的增长。然而，几种流行的策略都是针对预填充阶段，即处理长提示上下文，并且很少在需要长解码的推理任务上评估它们的性能。特别是，简短但复杂的提示，例如 GSM8K 和 MATH500 等基准测试中的提示，通常受益于多步骤推理和自我反思，从而导致思维序列长达数千个标记。在这项工作中，我们对几种流行的压缩策略在长推理任务上的性能进行了基准测试。对于非推理 Llama-3.1-8B-Instruct，我们确定没有适合所有情况的单一策略，并且性能很大程度上受到数据集类型的影响。然而，我们发现 H2O 和支持解码的 SnapKV 变体是推理模型的主要策略，这表明重型跟踪对于推理轨迹的实用性。我们还发现，低预算的驱逐策略可以产生更长的推理轨迹，揭示了缓存大小和推理成本之间的权衡。

Title: Benchmarking Contextual Understanding for In-Car Conversational Systems

Authors: Philipp Habicht, Lev Sorokin, Abdullah Saydemir, Ken E. Friedl, Andrea Stocco
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12042
Pdf URL: https://arxiv.org/pdf/2512.12042
Copy Paste: [[2512.12042]] Benchmarking Contextual Understanding for In-Car Conversational Systems(https://arxiv.org/abs/2512.12042)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying sizes and providers, including OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study involving restaurant recommendations. The most substantial improvements occur for small non-reasoning models when applying advanced prompting techniques, particularly multi-agent prompting. However, reasoning models consistently outperform non-reasoning models, with the best performance achieved using single-agent prompting with self-consistency. Notably, DeepSeek-R1 reaches an F1-score of 0.99 at a cost of 0.002 USD per request. Overall, the best balance between effectiveness and cost-time efficiency is reached with the non-reasoning model DeepSeek-V3. Our findings show that LLM-based evaluation offers a scalable and accurate alternative to traditional human evaluation for benchmarking contextual understanding in ConvQA systems.
摘要：车载对话式问答 (ConvQA) 系统通过实现无缝语音交互显着增强用户体验。然而，评估其准确性和可靠性仍然是一个挑战。本文探讨了使用大型语言模型 (LLM) 以及先进的提示技术和基于代理的方法来评估 ConvQA 系统响应遵循用户话语的程度。重点在于上下文理解以及考虑用户限制和情境上下文提供准确场地推荐的能力。为了使用 LLM 评估话语-响应一致性，我们综合生成用户话语，并附有正确的和修改后的包含故障的系统响应。我们使用输入输出、思维链、自我一致性提示和多智能体提示技术，以及 13 个不同规模和提供者的推理和非推理法学硕士，包括 OpenAI、DeepSeek、Mistral AI 和 Meta。我们通过涉及餐厅推荐的案例研究来评估我们的方法。当应用高级提示技术（特别是多代理提示）时，小型非推理模型会发生最实质性的改进。然而，推理模型始终优于非推理模型，使用具有自我一致性的单代理提示实现了最佳性能。值得注意的是，DeepSeek-R1 的 F1 分数为 0.99，每个请求的成本为 0.002 美元。总体而言，非推理模型 DeepSeek-V3 达到了有效性和成本时间效率之间的最佳平衡。我们的研究结果表明，基于 LLM 的评估为传统人类评估提供了一种可扩展且准确的替代方案，用于对 ConvQA 系统中的上下文理解进行基准测试。

Title: VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Authors: Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin, Vivek Kulkarni, Srinivas Chappidi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12072
Pdf URL: https://arxiv.org/pdf/2512.12072
Copy Paste: [[2512.12072]] VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs(https://arxiv.org/abs/2512.12072)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3x improvement in diversity.
摘要：大型语言模型 (LLM) 越来越多地用于生成用于下游模型评估和训练的合成数据集。然而，先前的工作已经指出，此类生成的数据缺乏多样性。在本文中，我们提出了 Voyager，一种生成多样化数据集的新颖原理方法。我们的方法是迭代的，并直接优化数学量，该数学量使用行列式点过程的机制来优化数据集的多样性。此外，我们的方法无需培训，适用于闭源模型，并且可扩展。除了为我们的方法的工作提供理论依据外，我们还通过综合实验证明，Voyager 的多样性提高了 1.5-3 倍，显着优于流行的基线方法。

Title: BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Authors: Jiayi Yuan, Cameron Shinn, Kai Xu, Jingze Cui, George Klimiashvili, Guangxuan Xiao, Perkz Zheng, Bo Li, Yuxin Zhou, Zhouhai Ye, Weijie You, Tian Zheng, Dominic Brown, Pengbo Wang, Richard Cai, Julien Demouth, John D. Owens, Xia Hu, Song Han, Timmy Liu, Huizi Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12087
Pdf URL: https://arxiv.org/pdf/2512.12087
Copy Paste: [[2512.12087]] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding(https://arxiv.org/abs/2512.12087)
Keywords: language model, llm
Abstract: The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62x speedup for prefill at 74.7% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.
摘要：大型语言模型（LLM）对长上下文推理能力的需求不断增长，加剧了标准注意力机制固有的计算和内存瓶颈。为了应对这一挑战，我们引入了 BLASST，这是一种嵌入式稀疏注意力方法，可以动态修剪注意力矩阵，无需任何预计算或代理分数。我们的方法使用固定阈值和来自在线 softmax 的现有信息来识别可忽略的注意力分数，跳过 softmax 计算、值块加载和随后的矩阵乘法。这可以无缝地融入现有的 FlashAttention 内核设计，并且延迟开销可以忽略不计。该方法适用于所有注意力变体（MHA、GQA、MQA 和 MLA）的预填充和解码阶段，为加速长上下文推理提供统一的解决方案。我们开发了一种自动校准程序，揭示了最佳阈值和上下文长度之间的简单反比关系，从而实现了跨不同场景的稳健部署。在保持高精度的情况下，我们在现代 GPU 上证明了在 74.7% 稀疏度下预填充速度提高了 1.62 倍，在稀疏度为 73.2% 时解码速度提高了 1.48 倍。此外，我们将稀疏感知训练作为一种自然扩展进行探索，表明模型可以经过训练，使其对稀疏注意力模式具有更强的鲁棒性，从而进一步推动准确性-稀疏性前沿。

Title: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

Authors: Yoav Gelberg, Koshi Eguchi, Takuya Akiba, Edoardo Cetin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12167
Pdf URL: https://arxiv.org/pdf/2512.12167
Copy Paste: [[2512.12167]] Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings(https://arxiv.org/abs/2512.12167)
Keywords: language model, llm
Abstract: So far, expensive finetuning beyond the pretraining sequence length has been a requirement for effectively extending the context of language models (LM). In this work, we break this key bottleneck by Dropping the Positional Embeddings of LMs after training (DroPE). Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings (PEs) serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length, even when using popular PE-scaling methods. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely removed after pretraining, following a short recalibration phase. Empirically, DroPE yields seamless zero-shot context extension without any long-context finetuning, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary positional embedding scaling methods.
摘要：到目前为止，超出预训练序列长度的昂贵微调一直是有效扩展语言模型（LM）上下文的要求。在这项工作中，我们通过在训练后删除 LM 的位置嵌入 (DroPE) 来打破这个关键瓶颈。我们的简单方法是由三个关键的理论和经验观察推动的。首先，位置嵌入（PE）在预训练过程中发挥着至关重要的作用，提供了重要的归纳偏差，可显着促进收敛。其次，即使使用流行的 PE 缩放方法，过度依赖这种显式位置信息也正是阻止测试时间泛化到未见过长度的序列的原因。第三，位置嵌入不是有效语言建模的固有要求，可以在预训练后经过短暂的重新校准阶段安全地删除。根据经验，DroPE 可以产生无缝的零样本上下文扩展，无需任何长上下文微调，可以快速适应预训练的 LM，而不会影响其在原始训练上下文中的能力。我们的研究结果适用于不同的模型和数据集大小，远远优于以前的专用架构和已建立的旋转位置嵌入缩放方法。

Title: Diffusion Language Model Inference with Monte Carlo Tree Search

Authors: Zheng Huang, Kiran Ramnath, Yueyan Chen, Aosong Feng, Sangmin Woo, Balasubramaniam Srinivasan, Zhichao Xu, Kang Zhou, Shuai Wang, Haibo Ding, Lin Lee Cheong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12168
Pdf URL: https://arxiv.org/pdf/2512.12168
Copy Paste: [[2512.12168]] Diffusion Language Model Inference with Monte Carlo Tree Search(https://arxiv.org/abs/2512.12168)
Keywords: language model
Abstract: Diffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, a framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This integration is enabled by restricting the search space to high-confidence actions and prioritizing token choices that improve model confidence over remaining masked positions. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in diffusion language models.
摘要：扩散语言模型 (DLM) 最近已成为自回归生成的引人注目的替代方案，提供并行生成并提高全局一致性。在推理过程中，DLM 通过并行迭代地对屏蔽序列进行去噪来生成文本；然而，确定要取消屏蔽哪些位置以及要提交哪些标记会形成一个巨大的组合搜索问题。现有的推理方法使用启发式方法来近似这种搜索，这通常会产生次优的解码路径；其他方法则依赖额外的训练来指导令牌选择。为了介绍 DLM 推理的原则性搜索机制，我们引入了 MEDAL，这是一个集成了用于扩散语言模型推理的蒙特卡洛树搜索初始化的框架。我们在初始化阶段采用蒙特卡罗树搜索来探索有希望的揭露轨迹，为后续的细化提供了一个稳健的起点。这种集成是通过将搜索空间限制为高置信度操作并优先考虑可提高模型置信度而不是剩余屏蔽位置的标记选择来实现的。在多个基准测试中，MEDAL 比现有推理策略提高了 22.0%，为扩散语言模型中基于搜索的推理建立了新的范式。

Title: Semantic Distance Measurement based on Multi-Kernel Gaussian Processes

Authors: Yinzhu Cheng, Haihua Xie, Yaqing Wang, Miao He, Mingming Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12238
Pdf URL: https://arxiv.org/pdf/2512.12238
Copy Paste: [[2512.12238]] Semantic Distance Measurement based on Multi-Kernel Gaussian Processes(https://arxiv.org/abs/2512.12238)
Keywords: language model
Abstract: Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Matérn and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.
摘要：语义距离测量是计算语言学中的一个基本问题，提供文本片段之间相似性或相关性的定量表征，并支撑文本检索和文本分类等任务。从数学角度来看，语义距离可以被视为在文本空间或从文本导出的表示空间上定义的度量。然而，大多数经典语义距离方法本质上是固定的，使得它们难以适应特定的数据分布和任务要求。本文提出了一种基于多核高斯过程（MK-GP）的语义距离度量。与文本相关的潜在语义函数被建模为高斯过程，其协方差函数由结合 Matérn 和多项式分量的组合核给出。内核参数是在监督下从数据中自动学习的，而不是手工制作的。这种语义距离是在上下文学习（ICL）设置下使用大型语言模型在细粒度情感分类的背景下进行实例化和评估的。实验结果证明了所提出措施的有效性。

Title: Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

Authors: Abhay Srivastava, Sam Jung, Spencer Mateega
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12264
Pdf URL: https://arxiv.org/pdf/2512.12264
Copy Paste: [[2512.12264]] Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics(https://arxiv.org/abs/2512.12264)
Keywords: language model, gpt, llm
Abstract: We introduce MARKET-BENCH, a benchmark that evaluates large language models (LLMs) on introductory quantitative trading tasks by asking them to construct executable backtesters from natural-language strategy descriptions and market assumptions. Each instance specifies one of three canonical strategies -- scheduled trading on Microsoft (NASDAQ: MSFT), pairs trading on Coca-Cola (NASDAQ: KO) and Pepsi (NASDAQ: PEP), or delta hedging on MSFT -- and models must produce code whose P\&L, drawdown, and position paths match a verifiable reference implementation. We assess twelve state-of-the-art models using a multi-round pass@k metric that separates structural reliability (whether the backtest runs) from numerical accuracy (mean absolute error of the backtest metrics). While most models reliably execute the simplest strategy (average pass@3 of 0.80), errors vary by orders of magnitude across models and tasks: Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies, GPT-5.1 Codex-Max achieves perfect pass@1 on the first two strategies and the lowest best-run error on the easiest task, and Qwen3 Max attains perfect pass@3 yet sometimes produces inaccurate P\&L paths. These results show that current LLMs can scaffold basic trading infrastructure but still struggle to reason robustly about prices, inventory, and risk; we release MARKET-BENCH and a public leaderboard at this https URL.
摘要：我们引入了 MARKET-BENCH，这是一个基准，通过要求大型语言模型 (LLM) 根据自然语言策略描述和市场假设构建可执行的回测器，来评估介绍性定量交易任务的大型语言模型 (LLM)。每个实例指定三种规范策略之一——微软（纳斯达克股票代码：MSFT）的预定交易、可口可乐（纳斯达克股票代码：KO）和百事可乐（纳斯达克股票代码：PEP）的配对交易，或 MSFT 的 Delta 对冲——并且模型必须生成其损益、回撤和头寸路径与可验证的参考实现相匹配的代码。我们使用多轮 pass@k 指标评估 12 个最先进的模型，该指标将结构可靠性（回测是否运行）与数值准确性（回测指标的平均绝对误差）分开。虽然大多数模型都能可靠地执行最简单的策略（平均 pass@3 为 0.80），但不同模型和任务的误差会存在数量级的差异：Gemini 3 Pro 和 Claude 4.5 Sonnet 在较简单的策略上将强大的可靠性与低误差相结合，GPT-5.1 Codex-Max 在前两种策略上实现了完美的 pass@1 并在最简单的任务上实现了最低的最佳运行误差，而 Qwen3 Max 则实现了完美的 pass@3 但有时会产生不准确的盈亏路径。这些结果表明，目前的法学硕士可以搭建基本的贸易基础设施，但仍然难以对价格、库存和风险进行强有力的推理；我们在此 https URL 发布 MARKET-BENCH 和公共排行榜。

Title: SCIR: A Self-Correcting Iterative Refinement Framework for Enhanced Information Extraction Based on Schema

Authors: Yushen Fang, Jianjun Li, Mingqian Ding, Chang Liu, Xinchi Zou, Wenqi Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12337
Pdf URL: https://arxiv.org/pdf/2512.12337
Copy Paste: [[2512.12337]] SCIR: A Self-Correcting Iterative Refinement Framework for Enhanced Information Extraction Based on Schema(https://arxiv.org/abs/2512.12337)
Keywords: language model, gpt, llm
Abstract: Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm, the Self-Correcting Iterative Refinement (SCIR) framework, along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4's capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks: named entity recognition, relation extraction, and event extraction, achieving a 5.27 percent average improvement in span-based Micro-F1 while reducing training costs by 87 percent compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms.
摘要：尽管大语言模型 (LLM) 支持的信息提取 (IE) 系统已显示出令人印象深刻的功能，但当前的微调范式面临两个主要限制：训练成本高和难以与 LLM 偏好保持一致。为了解决这些问题，我们提出了一种新颖的通用 IE 范式，即自校正迭代细化 (SCIR) 框架，以及包含超过 100,000 个条目的多任务双语（中英）自校正 (MBSC) 数据集。 SCIR框架通过其双路径自校正模块和反馈驱动的优化实现了与现有LLM和IE系统的即插即用兼容性，从而显着降低了培训成本。同时，MBSC 数据集通过将 GPT-4 的功能间接提炼到 IE 结果检测模型中来解决偏好对齐的挑战。实验结果表明，SCIR 在三个关键任务上优于最先进的 IE 方法：命名实体识别、关系提取和事件提取，在基于跨度的 Micro-F1 中实现了 5.27% 的平均改进，同时与基线方法相比，培训成本降低了 87%。这些进步不仅增强了 IE 系统的灵活性和准确性，而且为轻量级和高效的 IE 范式铺平了道路。

Title: Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors

Authors: Veronica Mangiaterra, Hamad Al-Azary, Chiara Barattieri di San Pietro, Paolo Canal, Valentina Bambini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12444
Pdf URL: https://arxiv.org/pdf/2512.12444
Copy Paste: [[2512.12444]] Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors(https://arxiv.org/abs/2512.12444)
Keywords: language model, gpt, llm
Abstract: As Large Language Models (LLMs) are increasingly being used in scientific research, the issue of their trustworthiness becomes crucial. In psycholinguistics, LLMs have been recently employed in automatically augmenting human-rated datasets, with promising results obtained by generating ratings for single words. Yet, performance for ratings of complex items, i.e., metaphors, is still unexplored. Here, we present the first assessment of the validity and reliability of ratings of metaphors on familiarity, comprehensibility, and imageability, generated by three GPT models for a total of 687 items gathered from the Italian Figurative Archive and three English studies. We performed a thorough validation in terms of both alignment with human data and ability to predict behavioral and electrophysiological responses. We found that machine-generated ratings positively correlated with human-generated ones. Familiarity ratings reached moderate-to-strong correlations for both English and Italian metaphors, although correlations weakened for metaphors with high sensorimotor load. Imageability showed moderate correlations in English and moderate-to-strong in Italian. Comprehensibility for English metaphors exhibited the strongest correlations. Overall, larger models outperformed smaller ones and greater human-model misalignment emerged with familiarity and imageability. Machine-generated ratings significantly predicted response times and the EEG amplitude, with a strength comparable to human ratings. Moreover, GPT ratings obtained across independent sessions were highly stable. We conclude that GPT, especially larger models, can validly and reliably replace - or augment - human subjects in rating metaphor properties. Yet, LLMs align worse with humans when dealing with conventionality and multimodal aspects of metaphorical meaning, calling for careful consideration of the nature of stimuli.
摘要：随着大型语言模型 (LLM) 越来越多地应用于科学研究，其可信度问题变得至关重要。在心理语言学中，法学硕士最近被用于自动增强人类评分的数据集，通过生成单个单词的评分获得了有希望的结果。然而，复杂项目（即隐喻）的评级表现仍有待探索。在这里，我们首次评估了隐喻熟悉度、可理解性和形象性评级的有效性和可靠性，这些评级是由三个 GPT 模型生成的，涉及从意大利比喻档案馆和三项英语研究中收集的总共 687 个项目。我们在与人类数据的一致性以及预测行为和电生理反应的能力方面进行了彻底的验证。我们发现机器生成的评分与人类生成的评分呈正相关。英语和意大利语隐喻的熟悉度评级均达到中等到强的相关性，尽管与高感觉运动负荷的隐喻相关性减弱。英语中的可成像性显示出中等相关性，而意大利语中的可成像性则显示出中等到强的相关性。英语隐喻的可理解性表现出最强的相关性。总体而言，较大的模型优于较小的模型，并且随着熟悉度和可成像性的增加，出现了更大的人体模型错位。机器生成的评分显着预测了反应时间和脑电图振幅，其强度与人类评分相当。此外，在独立会议中获得的 GPT 评级非常稳定。我们得出的结论是，GPT，尤其是较大的模型，可以有效且可靠地取代或增强人类受试者对隐喻属性的评级。然而，法学硕士在处理隐喻意义的常规性和多模态方面时与人类的一致性较差，需要仔细考虑刺激的本质。

Title: Large language models have learned to use language

Authors: Gary Lupyan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12447
Pdf URL: https://arxiv.org/pdf/2512.12447
Copy Paste: [[2512.12447]] Large language models have learned to use language(https://arxiv.org/abs/2512.12447)
Keywords: language model
Abstract: Acknowledging that large language models have learned to use language can open doors to breakthrough language science. Achieving these breakthroughs may require abandoning some long-held ideas about how language knowledge is evaluated and reckoning with the difficult fact that we have entered a post-Turing test era.
摘要：承认大型语言模型已经学会使用语言可以为突破语言科学打开大门。要实现这些突破，可能需要放弃一些长期以来关于如何评估语言知识的想法，并正视我们已经进入后图灵测试时代的艰难事实。

Title: The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting

Authors: James Luther, Donald Brown
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12488
Pdf URL: https://arxiv.org/pdf/2512.12488
Copy Paste: [[2512.12488]] The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting(https://arxiv.org/abs/2512.12488)
Keywords: language model, gpt, llm, prompt
Abstract: Culture is the bedrock of human interaction; it dictates how we perceive and respond to everyday interactions. As the field of human-computer interaction grows via the rise of generative Large Language Models (LLMs), the cultural alignment of these models become an important field of study. This work, using the VSM13 International Survey and Hofstede's cultural dimensions, identifies the cultural alignment of popular LLMs (DeepSeek-V3, V3.1, GPT-5, GPT-4.1, GPT-4, Claude Opus 4, Llama 3.1, and Mistral Large). We then use cultural prompting, or using system prompts to shift the cultural alignment of a model to a desired country, to test the adaptability of these models to other cultures, namely China, France, India, Iran, Japan, and the United States. We find that the majority of the eight LLMs tested favor the United States when the culture is not specified, with varying results when prompted for other cultures. When using cultural prompting, seven of the eight models shifted closer to the expected culture. We find that models had trouble aligning with Japan and China, despite two of the models tested originating with the Chinese company DeepSeek.
摘要：文化是人类互动的基石；它决定了我们如何看待和应对日常互动。随着人机交互领域通过生成式大型语言模型（LLM）的兴起而发展，这些模型的文化一致性成为一个重要的研究领域。这项工作使用 VSM13 国际调查和霍夫斯泰德的文化维度，确定了流行的法学硕士（DeepSeek-V3、V3.1、GPT-5、GPT-4.1、GPT-4、Claude Opus 4、Llama 3.1 和 Mistral Large）的文化一致性。然后，我们使用文化提示，或者使用系统提示将模型的文化对齐转移到所需的国家，以测试这些模型对其他文化（即中国、法国、印度、伊朗、日本和美国）的适应性。我们发现，在未指定文化的情况下，测试的八个法学硕士中的大多数都倾向于美国，而在提示其他文化时，结果会有所不同。当使用文化提示时，八个模型中有七个变得更接近预期文化。我们发现，尽管测试的两个模型来自中国公司 DeepSeek，但模型在与日本和中国保持一致时遇到了困难。

Title: NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

Authors: Agniva Maiti, Manya Pandey, Murari Mandal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12537
Pdf URL: https://arxiv.org/pdf/2512.12537
Copy Paste: [[2512.12537]] NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data(https://arxiv.org/abs/2512.12537)
Keywords: llm
Abstract: The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.
摘要：世界上绝大多数语言，特别是像那伽马语这样的克里奥尔语，在自然语言处理 (NLP) 方面仍然严重缺乏资源，这给它们在数字技术中的表达造成了重大障碍。本文介绍了 NagaNLP，这是一个针对 Nagamese 的综合开源工具包，通过一种依赖于法学硕士驱动但经过人工验证的合成数据生成的新颖方法来引导。我们详细介绍了一个多阶段管道，其中由专家指导的法学硕士（双子座）生成候选语料库，然后由母语人士对其进行细化和注释。这种综合混合方法产生了 10K 对对话数据集和用于基础任务的高质量注释语料库。为了评估我们方法的有效性，我们训练了判别模型和生成模型。我们经过微调的 XLM-RoBERTa-base 模型为 Nagamese 建立了新的基准，在词性标记上实现了 93.81% 的准确度（0.90 F1-Macro），在命名实体识别上实现了 0.75 F1-Macro，大大优于强大的零样本基线。此外，我们对名为 NagaLLaMA 的 Llama-3.2-3B Instruct 模型进行了微调，该模型在对话任务中表现出卓越的性能，实现了 3.85 的困惑度，比其少数样本 (96.76) 提高了一个数量级。我们发布了 NagaNLP 工具包，包括所有数据集、模型和代码，为以前服务不足的语言提供了基础资源，并为减少其他资源匮乏环境中的数据稀缺提供了可重现的框架。

Title: HyperEdit: Unlocking Instruction-based Text Editing in LLMs via Hypernetworks

Authors: Yiming Zeng, Jinghan Cao, Zexin Li, Wanhao Yu, Zhankai Ye, Dawei Xiang, Ting Hua, Xin Liu, Shangqian Gao, Tingting Yu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12544
Pdf URL: https://arxiv.org/pdf/2512.12544
Copy Paste: [[2512.12544]] HyperEdit: Unlocking Instruction-based Text Editing in LLMs via Hypernetworks(https://arxiv.org/abs/2512.12544)
Keywords: language model, llm
Abstract: Instruction-based text editing is increasingly critical for real-world applications such as code editors (e.g., Cursor), but Large Language Models (LLMs) continue to struggle with this task. Unlike free-form generation, editing requires faithfully implementing user instructions while preserving unchanged content, as even minor unintended modifications can break functionality. Existing approaches treat editing as generic text generation, leading to two key failures: they struggle to faithfully align edits with diverse user intents, and they often over-edit unchanged regions. We propose HyperEdit to address both issues. First, we introduce hypernetwork-based dynamic adaptation that generates request-specific parameters, enabling the model to tailor its editing strategy to each instruction. Second, we develop difference-aware regularization that focuses supervision on modified spans, preventing over-editing while ensuring precise, minimal changes. HyperEdit achieves a 9%--30% relative improvement in BLEU on modified regions over state-of-the-art baselines, despite utilizing only 3B parameters.
摘要：基于指令的文本编辑对于代码编辑器（例如光标）等实际应用程序越来越重要，但大型语言模型（LLM）仍然难以完成这项任务。与自由格式生成不同，编辑需要忠实地执行用户指令，同时保留未更改的内容，因为即使是微小的意外修改也可能会破坏功能。现有方法将编辑视为通用文本生成，导致两个关键失败：它们难以忠实地将编辑与不同的用户意图保持一致，并且经常过度编辑未更改的区域。我们建议 HyperEdit 来解决这两个问题。首先，我们引入基于超网络的动态适应，它生成特定于请求的参数，使模型能够根据每条指令定制其编辑策略。其次，我们开发差异感知正则化，将监督重点放在修改的跨度上，防止过度编辑，同时确保精确、最小的更改。尽管仅使用 3B 参数，但与最先进的基线相比，HyperEdit 在修改区域的 BLEU 实现了 9%--30% 的相对改进。

Title: Coupled Variational Reinforcement Learning for Language Model General Reasoning

Authors: Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12576
Pdf URL: https://arxiv.org/pdf/2512.12576
Copy Paste: [[2512.12576]] Coupled Variational Reinforcement Learning for Language Model General Reasoning(https://arxiv.org/abs/2512.12576)
Keywords: language model, llm
Abstract: While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
摘要：虽然强化学习在语言模型推理方面取得了令人瞩目的进展，但它们受到可验证奖励要求的限制。最近的无验证者强化学习方法通过利用 LLM 生成参考答案作为奖励信号的内在概率来解决这一限制。然而，这些方法通常仅根据问题来采样推理痕迹。这种设计将推理跟踪采样与答案信息分离，导致探索效率低下以及跟踪与最终答案之间的不一致。在本文中，我们提出 \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL)，它通过混合采样策略耦合先验和后验分布来连接变分推理和强化学习。通过构建和优化整合这两个分布的复合分布，CoVRL 能够实现高效探索，同时保持强大的思想答案一致性。对数学和一般推理基准的大量实验表明，与基础模型相比，CoVRL 的性能提高了 12.4%，并且比最先进的无验证器 RL 基线额外提高了 2.3%，为增强语言模型的一般推理能力提供了原则性框架。

Title: Human-Inspired Learning for Large Language Models via Obvious Record and Maximum-Entropy Method Discovery

Authors: Hong Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12608
Pdf URL: https://arxiv.org/pdf/2512.12608
Copy Paste: [[2512.12608]] Human-Inspired Learning for Large Language Models via Obvious Record and Maximum-Entropy Method Discovery(https://arxiv.org/abs/2512.12608)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel at extracting common patterns from large-scale corpora, yet they struggle with rare, low-resource, or previously unseen scenarios-such as niche hardware deployment issues or irregular IoT device behaviors-because such cases are sparsely represented in training data. Moreover, LLMs rely primarily on implicit parametric memory, which limits their ability to explicitly acquire, recall, and refine methods, causing them to behave predominantly as intuition-driven predictors rather than deliberate, method-oriented learners. Inspired by how humans learn from rare experiences, this paper proposes a human-inspired learning framework that integrates two complementary mechanisms. The first, Obvious Record, explicitly stores cause--result (or question--solution) relationships as symbolic memory, enabling persistent learning even from single or infrequent encounters. The second, Maximum-Entropy Method Discovery, prioritizes and preserves methods with high semantic dissimilarity, allowing the system to capture diverse and underrepresented strategies that are typically overlooked by next-token prediction. Verification on a benchmark of 60 semantically diverse question--solution pairs demonstrates that the proposed entropy-guided approach achieves stronger coverage of unseen questions and significantly greater internal diversity than a random baseline, confirming its effectiveness in discovering more generalizable and human-inspired methods.
摘要：大型语言模型 (LLM) 擅长从大规模语料库中提取常见模式，但它们在处理稀有、资源匮乏或以前未见过的场景（例如利基硬件部署问题或不规则的物联网设备行为）时遇到了困难，因为此类情况在训练数据中很少得到体现。此外，法学硕士主要依赖于隐性参数记忆，这限制了他们明确获取、回忆和完善方法的能力，导致他们主要表现为直觉驱动的预测者，而不是刻意的、以方法为导向的学习者。受人类如何从稀有经验中学习的启发，本文提出了一种集成了两种互补机制的受人类启发的学习框架。第一个是明显记录，它明确地将原因-结果（或问题-解决方案）关系存储为符号记忆，即使是从单一或不频繁的遭遇中也能持续学习。第二个是最大熵方法发现，优先考虑并保留具有高度语义差异的方法，使系统能够捕获通常被下一个令牌预测所忽视的多样化且代表性不足的策略。对 60 个语义不同的问题-解决方案对的基准的验证表明，所提出的熵引导方法比随机基线实现了对未见过的问题的更强覆盖，并且显着更大的内部多样性，证实了其在发现更通用和受人类启发的方法方面的有效性。

Title: Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives

Authors: Aheli Poddar (1), Saptarshi Sahoo (2), Sujata Ghosh (2) ((1) Institute of Engineering & Management, Kolkata, (2) Indian Statistical Institute, Chennai)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12620
Pdf URL: https://arxiv.org/pdf/2512.12620
Copy Paste: [[2512.12620]] Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives(https://arxiv.org/abs/2512.12620)
Keywords: language model, llm
Abstract: We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.
摘要：我们从逻辑和自然语言的角度研究法学硕士的三段论推理。在此过程中，我们探索法学硕士的基本推理能力以及这项研究的前进方向。为了帮助我们的研究，我们使用 14 个大型语言模型，并在符号推理和自然语言理解方面研究它们的三段论推理能力。尽管这种推理机制并不是法学硕士的统一涌现属性，但某些模型中完美的符号性能让我们怀疑法学硕士是否正在变得越来越正式的推理机制，而不是明确人类推理的细微差别。

Title: LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases

Authors: Yida Cai, Ranjuexiao Hu, Huiyuan Xie, Chenyang Li, Yun Liu, Yuxiao Ye, Zhenghao Liu, Weixing Shen, Zhiyuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12643
Pdf URL: https://arxiv.org/pdf/2512.12643
Copy Paste: [[2512.12643]] LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases(https://arxiv.org/abs/2512.12643)
Keywords: language model, llm
Abstract: Legal relations form a highly consequential analytical framework of civil law system, serving as a crucial foundation for resolving disputes and realizing values of the rule of law in judicial practice. However, legal relations in Chinese civil cases remain underexplored in the field of legal artificial intelligence (legal AI), largely due to the absence of comprehensive schemas. In this work, we firstly introduce a comprehensive schema, which contains a hierarchical taxonomy and definitions of arguments, for AI systems to capture legal relations in Chinese civil cases. Based on this schema, we then formulate legal relation extraction task and present LexRel, an expert-annotated benchmark for legal relation extraction in Chinese civil law. We use LexRel to evaluate state-of-the-art large language models (LLMs) on legal relation extractions, showing that current LLMs exhibit significant limitations in accurately identifying civil legal relations. Furthermore, we demonstrate that incorporating legal relations information leads to consistent performance gains on other downstream legal AI tasks.
摘要：法律关系构成了大陆法系具有重要意义的分析框架，是司法实践中解决纠纷、实现法治价值的重要基础。然而，中国民事案件中的法律关系在法律人工智能（legal AI）领域的探索仍然不足，很大程度上是由于缺乏全面的模式。在这项工作中，我们首先介绍了一个综合模式，其中包含层次分类法和论据定义，供人工智能系统捕获中国民事案件中的法律关系。基于这个模式，我们制定了法律关系抽取任务，并提出了 LexRel，一个专家注释的中国民法法律关系抽取基准。我们使用 LexRel 来评估法律关系提取方面最先进的大型语言模型 (LLM)，表明当前的 LLM 在准确识别民事法律关系方面表现出显着的局限性。此外，我们还证明，整合法律关系信息可以在其他下游法律人工智能任务上带来一致的性能提升。

Title: Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches

Authors: Amirhossein Yousefiramandi, Ciaran Cooney
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12677
Pdf URL: https://arxiv.org/pdf/2512.12677
Copy Paste: [[2512.12677]] Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches(https://arxiv.org/abs/2512.12677)
Keywords: language model, llm, prompt
Abstract: We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM's final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt->response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.
摘要：我们探索有效的策略来微调仅解码器的大型语言模型（LLM），以在资源限制下进行下游文本分类。研究了两种方法：（1）将分类头附加到预先训练的因果 LLM 上并对任务进行微调（使用 LLM 的最终标记嵌入作为序列表示），以及（2）以提示 -> 响应格式对 LLM 进行指令调整以进行分类。为了实现高达 8B 参数的单 GPU 模型微调，我们将 4 位模型量化与低秩适应 (LoRA) 相结合，以实现参数高效的训练。对两个数据集（专有的单标签数据集和公共 WIPO-Alpha 专利数据集（极端多标签分类））的实验表明，基于嵌入的方法在 F1 分数中显着优于指令调整方法，并且在相同任务上与甚至超越微调的特定领域模型（例如 BERT）非常有竞争力。这些结果表明，直接利用因果法学硕士的内部表示以及高效的微调技术，可以在有限的计算资源下产生令人印象深刻的分类性能。我们讨论了每种方法的优点，同时概述了在分类场景中优化 LLM 微调的实用指南和未来方向。

Title: CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning

Authors: Xuanzhang Liu, Jianglun Feng, Zhuoran Zhuang, Junzhe Zhao, Maofei Que, Jieting Li, Dianlei Wang, Hao Tong, Ye Chen, Pan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12716
Pdf URL: https://arxiv.org/pdf/2512.12716
Copy Paste: [[2512.12716]] CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning(https://arxiv.org/abs/2512.12716)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi-step tasks. However, their performance is often crippled by "Context Explosion", where the accumulation of long text outputs overwhelms the model's context window and leads to reasoning failures. To address this, we introduce CoDA, a Context-Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high-level planning from low-level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high-level Planner that decomposes tasks within a concise strategic context, and a low-level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end-to-end using PECO (Planner-Executor Co-Optimization), a reinforcement learning methodology that applies a trajectory-level reward to jointly optimize both roles, fostering seamless collaboration through context-dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state-of-the-art baselines on complex multi-hop question-answering benchmarks, and it exhibits strong robustness in long-context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.
摘要：经过强化学习 (RL) 训练的大型语言模型 (LLM) 代理在解决复杂的多步骤任务方面显示出巨大的前景。然而，它们的性能往往会因“上下文爆炸”而受到削弱，即长文本输出的积累会淹没模型的上下文窗口并导致推理失败。为了解决这个问题，我们引入了 CoDA，一个上下文解耦的分层代理，一个简单但有效的强化学习框架，可以将高层规划与低层执行解耦。它采用一个共享的 LLM 主干，学习如何以两个不同的、上下文隔离的角色进行操作：一个高级规划器，在简洁的战略上下文中分解任务；以及一个低级执行器，在短暂的、隔离的工作空间中处理工具交互。我们使用 PECO（规划者-执行者协同优化）端到端地训练这个统一代理，这是一种强化学习方法，它应用轨迹级奖励来共同优化两个角色，通过上下文相关的策略更新来促进无缝协作。大量实验表明，CoDA 在复杂的多跳问答基准上比最先进的基线实现了显着的性能改进，并且在长上下文场景中表现出强大的鲁棒性，在所有其他基线遭受严重退化的情况下保持稳定的性能，从而进一步验证了我们的分层设计在缓解上下文过载方面的有效性。

Title: NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Authors: Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyao Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang, Ge Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12730
Pdf URL: https://arxiv.org/pdf/2512.12730
Copy Paste: [[2512.12730]] NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents(https://arxiv.org/abs/2512.12730)
Keywords: agent
Abstract: Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
摘要：编码代理的最新进展表明自主软件开发取得了快速进展，但现有基准未能严格评估构建完整软件系统所需的长期能力。大多数先前的评估都集中在本地化代码生成、脚手架完成或短期修复任务上，从而留下了代理是否能够在现实世界存储库构建所需的扩展范围内维持连贯推理、规划和执行的问题。为了解决这一差距，我们提出了 NL2Repo Bench，这是一个明确设计用于评估编码代理的长期存储库生成能力的基准。仅给定一个自然语言需求文档和一个空工作区，代理必须自主设计架构、管理依赖关系、实现多模块逻辑并生成完全可安装的 Python 库。我们对最先进的开源和闭源模型进行的实验表明，长期存储库生成在很大程度上仍未得到解决：即使是最强大的代理，平均测试通过率也低于 40%，并且很少正确完成整个存储库。详细的分析揭示了基本的长期故障模式，包括过早终止、全局一致性丧失、脆弱的跨文件依赖性以及对数百个交互步骤的规划不足。 NL2Repo Bench 建立了一个严格的、可验证的测试平台，用于测量持续的代理能力，并强调长期推理是下一代自主编码代理的中心瓶颈。

Title: Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining

Authors: Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12770
Pdf URL: https://arxiv.org/pdf/2512.12770
Copy Paste: [[2512.12770]] Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining(https://arxiv.org/abs/2512.12770)
Keywords: language model, llm
Abstract: Continued pretraining extends a language model's capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curió 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curió-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens. Despite using only 10% of the data and 20% of the computation, Curió-Edu 7B surpasses the full-corpus model in our evaluations, demonstrating that data selection can be fundamental even when adapting models with limited prior exposure to the target language. The developed models are available at this https URL
摘要：持续的预训练通过进一步将语言模型暴露于额外的数据来扩展语言模型的功能，这些数据通常是针对特定的语言或领域上下文定制的。在使通用模型适应新环境时，该策略已成为完全再训练的有效替代方案。在这项工作中，我们通过 Curió 7B 来研究这一范式，Curió 7B 是一个源自 LLaMA-2 的 70 亿参数模型，并使用来自 ClassiCC-PT 语料库的 1000 亿个葡萄牙语标记进行训练，这是迄今为止超过 30 亿参数规模的最广泛的葡萄牙语特定持续预训练工作。除了规模之外，我们还研究数量是否足够，或者数据质量是否在语言适应中发挥决定性作用。为此，我们引入了 Curió-Edu 7B，这是一种专门针对同一语料库的教育和 STEM 过滤子集进行训练的变体，总共只有 100 亿个令牌。尽管只使用了 10% 的数据和 20% 的计算量，Curió-Edu 7B 在我们的评估中超越了全语料库模型，这表明即使在调整先前接触目标语言的模型时，数据选择也是至关重要的。开发的模型可在此 https URL 获取

Title: Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions

Authors: Pedro Henrique Luz de Araujo, Michael A. Hedderich, Ali Modarressi, Hinrich Schuetze, Benjamin Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12775
Pdf URL: https://arxiv.org/pdf/2512.12775
Copy Paste: [[2512.12775]] Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions(https://arxiv.org/abs/2512.12775)
Keywords: language model, llm
Abstract: Persona-assigned large language models (LLMs) are used in domains such as education, healthcare, and sociodemographic simulation. Yet, they are typically evaluated only in short, single-round settings that do not reflect real-world usage. We introduce an evaluation protocol that combines long persona dialogues (over 100 rounds) and evaluation datasets to create dialogue-conditioned benchmarks that can robustly measure long-context effects. We then investigate the effects of dialogue length on persona fidelity, instruction-following, and safety of seven state-of-the-art open- and closed-weight LLMs. We find that persona fidelity degrades over the course of dialogues, especially in goal-oriented conversations, where models must sustain both persona fidelity and instruction following. We identify a trade-off between fidelity and instruction following, with non-persona baselines initially outperforming persona-assigned models; as dialogues progress and fidelity fades, persona responses become increasingly similar to baseline responses. Our findings highlight the fragility of persona applications in extended interactions and our work provides a protocol to systematically measure such failures.
摘要：角色分配的大语言模型 (LLM) 用于教育、医疗保健和社会人口模拟等领域。然而，它们通常仅在简短的单轮设置中进行评估，而不能反映现实世界的使用情况。我们引入了一种评估协议，该协议结合了长角色对话（超过 100 轮）和评估数据集，以创建对话条件基准，可以稳健地衡量长上下文效果。然后，我们研究了对话长度对七个最先进的开放式和封闭式法学硕士的人物形象忠诚度、指令遵循和安全性的影响。我们发现，角色保真度在对话过程中会下降，特别是在以目标为导向的对话中，模型必须维持角色保真度和指令遵循。我们确定了保真度和遵循指令之间的权衡，非角色基线最初优于角色分配模型；随着对话的进展和保真度的降低，人物角色的反应变得与基线反应越来越相似。我们的研究结果强调了角色应用程序在扩展交互中的脆弱性，我们的工作提供了一个协议来系统地衡量此类失败。

Title: State over Tokens: Characterizing the Role of Reasoning Tokens

Authors: Mosh Levy, Zohar Elyoseph, Shauli Ravfogel, Yoav Goldberg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12777
Pdf URL: https://arxiv.org/pdf/2512.12777
Copy Paste: [[2512.12777]] State over Tokens: Characterizing the Role of Reasoning Tokens(https://arxiv.org/abs/2512.12777)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can generate reasoning tokens before their final answer to boost performance on complex tasks. While these sequences seem like human thought processes, empirical evidence reveals that they are not a faithful explanation of the model's actual reasoning process. To address this gap between appearance and function, we introduce the State over Tokens (SoT) conceptual framework. SoT reframes reasoning tokens not as a linguistic narrative, but as an externalized computational state -- the sole persistent information carrier across the model's stateless generation cycles. This explains how the tokens can drive correct reasoning without being a faithful explanation when read as text and surfaces previously overlooked research questions on these tokens. We argue that to truly understand the process that LLMs do, research must move beyond reading the reasoning tokens as text and focus on decoding them as state.
摘要：大型语言模型 (LLM) 可以在最终答案之前生成推理标记，以提高复杂任务的性能。虽然这些序列看起来像人类的思维过程，但经验证据表明它们并不是对模型实际推理过程的忠实解释。为了解决外观和功能之间的差距，我们引入了 State over Tokens (SoT) 概念框架。 SoT 将推理标记重新构建为一种外化的计算状态，而不是一种语言叙述——跨模型无状态生成周期的唯一持久信息载体。这解释了当以文本形式阅读时，标记如何在不提供忠实解释的情况下驱动正确的推理，并提出了先前忽视的有关这些标记的研究问题。我们认为，要真正理解法学硕士所做的过程，研究必须超越将推理标记作为文本阅读，而专注于将它们解码为状态。

Title: Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA

Authors: Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, Xiaojing Fan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12812
Pdf URL: https://arxiv.org/pdf/2512.12812
Copy Paste: [[2512.12812]] Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA(https://arxiv.org/abs/2512.12812)
Keywords: language model, gpt, llm, prompt
Abstract: Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.
摘要：即时工程已成为影响大语言模型 (LLM) 性能的关键因素，但语言语气和礼貌等语用元素的影响仍未得到充分探索，特别是在不同的模型系列中。在这项工作中，我们提出了一个系统的评估框架来检查交互语气如何影响模型准确性，并将其应用于三个最近发布且广泛使用的 LLM：GPT-4o mini (OpenAI)、Gemini 2.0 Flash (Google DeepMind) 和 Llama 4 Scout (Meta)。使用 MMMLU 基准，我们在跨越 STEM 和人文领域的六个任务中评估非常友好、中性和非常粗鲁提示变体下的模型性能，并通过统计显着性测试分析成对准确性差异。我们的结果表明，音调敏感性既取决于模型又取决于领域。中性或非常友好的提示通常比非常粗鲁的提示产生更高的准确度，但统计上显着的效果只出现在人文学科任务的子集中，其中粗鲁的语气会降低 GPT 和 Llama 的准确度，而 Gemini 仍然相对对语气不敏感。当每个领域内的任务绩效汇总时，语气效应就会减弱，并在很大程度上失去统计意义。与早期的研究相比，这些发现表明数据集规模和覆盖范围对音调效果的检测有重大影响。总体而言，我们的研究表明，虽然交互语气在特定的解释环境中可能很重要，但现代法学硕士对典型混合领域使用中的语气变化具有广泛的鲁棒性，为现实世界部署中的提示设计和模型选择提供了实用指导。

Title: Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects

Authors: Chris Latimer, Nicoló Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava, Xuan Wang, Naren Ramakrishnan
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12818
Pdf URL: https://arxiv.org/pdf/2512.12818
Copy Paste: [[2512.12818]] Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects(https://arxiv.org/abs/2512.12818)
Keywords: gpt, llm, prompt, agent
Abstract: Agent memory has been touted as a dimension of growth for LLM-based applications, enabling agents that can accumulate experience, adapt across sessions, and move beyond single-shot question answering. The current generation of agent memory systems treats memory as an external layer that extracts salient snippets from conversations, stores them in vector or graph-based stores, and retrieves top-k items into the prompt of an otherwise stateless model. While these systems improve personalization and context carry-over, they still blur the line between evidence and inference, struggle to organize information over long horizons, and offer limited support for agents that must explain their reasoning. We present Hindsight, a memory architecture that treats agent memory as a structured, first-class substrate for reasoning by organizing it into four logical networks that distinguish world facts, agent experiences, synthesized entity summaries, and evolving beliefs. This framework supports three core operations -- retain, recall, and reflect -- that govern how information is added, accessed, and updated. Under this abstraction, a temporal, entity aware memory layer incrementally turns conversational streams into a structured, queryable memory bank, while a reflection layer reasons over this bank to produce answers and to update information in a traceable way. On key long-horizon conversational memory benchmarks like LongMemEval and LoCoMo, Hindsight with an open-source 20B model lifts overall accuracy from 39% to 83.6% over a full-context baseline with the same backbone and outperforms full context GPT-4o. Scaling the backbone further pushes Hindsight to 91.4% on LongMemEval and up to 89.61% on LoCoMo (vs. 75.78% for the strongest prior open system), consistently outperforming existing memory architectures on multi-session and open-domain questions.
摘要：代理内存被认为是基于 LLM 的应用程序的增长维度，使代理能够积累经验、跨会话适应并超越单次问答。当前一代代理内存系统将内存视为外部层，从对话中提取显着片段，将它们存储在基于向量或图形的存储中，并将前 k 个项目检索到无状态模型的提示中。虽然这些系统改善了个性化和上下文延续，但它们仍然模糊了证据和推理之间的界限，难以长期组织信息，并且为必须解释其推理的代理提供有限的支持。我们提出了 Hindsight，一种记忆架构，它将代理记忆视为结构化的一流推理基础，将其组织成四个逻辑网络，区分世界事实、代理经验、综合实体摘要和不断发展的信念。该框架支持三个核心操作——保留、调用和反映——控制信息的添加、访问和更新方式。在这种抽象下，时间的、实体感知的内存层逐渐将会话流转换为结构化的、可查询的内存库，而反射层则对该库进行推理，以产生答案并以可追踪的方式更新信息。在 LongMemEval 和 LoCoMo 等关键的长视野对话内存基准测试中，Hindsight 采用开源 20B 模型，将整体准确率从具有相同主干的全上下文基线的 39% 提高到 83.6%，并且优于全上下文 GPT-4o。扩展主干网进一步将 Hindsight 在 LongMemEval 上提升至 91.4%，在 LoCoMo 上提升至 89.61%（之前最强的开放系统为 75.78%），在多会话和开放域问题上始终优于现有内存架构。

Title: What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation

Authors: Dingyi Yang, Qin Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12839
Pdf URL: https://arxiv.org/pdf/2512.12839
Copy Paste: [[2512.12839]] What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation(https://arxiv.org/abs/2512.12839)
Keywords: gpt
Abstract: In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, LongStoryEval, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an evaluation criteria structure and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: aggregation-based, incremental-updated, and summary-based evaluations. Our findings reveal that aggregation- and summary-based evaluations perform better, with the former excelling in detail assessment and the latter offering greater efficiency. Building on these insights, we further propose NovelCritique, an 8B model that leverages the efficient summary-based framework to review and score stories across specified aspects. NovelCritique outperforms commercial models like GPT-4o in aligning with human evaluations. Our datasets and codes are available at this https URL.
摘要：在这项工作中，我们在一个具有挑战性的领域进行系统研究：自动评估书本长度的故事（> 100K 代币）。我们的研究重点关注两个关键问题：（1）了解哪些评价方面对读者最重要；（2）探索评价长篇故事的有效方法。我们引入了第一个大规模基准测试 LongStoryEval，包含 600 本新出版的书籍，平均长度为 121K 个 token（最大 397K）。每本书都包含其平均评分和多位读者评论，以按评价方面组织的评论形式呈现。通过分析用户提到的所有方面，我们提出了一个评估标准结构并进行实验，以确定 8 个顶级标准中最重要的方面。对于评估方法，我们比较了三种类型的有效性：基于聚合的评估、增量更新的评估和基于摘要的评估。我们的研究结果表明，基于聚合和基于摘要的评估表现更好，前者在详细评估方面表现出色，后者提供更高的效率。基于这些见解，我们进一步提出了 NovelCritique，这是一种 8B 模型，它利用基于摘要的高效框架来审查特定方面的故事并对其进行评分。 NovelCritique 在与人类评估的一致性方面优于 GPT-4o 等商业模型。我们的数据集和代码可在此 https URL 获取。

Title: Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM

Authors: Furong Jia, Yuan Pu, Finn Guo, Monica Agrawal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12868
Pdf URL: https://arxiv.org/pdf/2512.12868
Copy Paste: [[2512.12868]] Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM(https://arxiv.org/abs/2512.12868)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.
摘要：大型语言模型 (LLM) 在多项选择临床诊断基准上表现出色，但尚不清楚这种性能在多大程度上反映了潜在的概率推理。我们通过 MedQA 的问题来研究这一点，其任务是选择最可能的诊断。我们引入了基于频率的概率排序器（FBPR），这是一种轻量级方法，可以使用平滑的朴素贝叶斯对来自大型语料库的概念诊断共现统计数据进行选项评分。当共现统计数据来自 OLMo 和 Llama 的预训练语料库时，FBPR 实现了与在同一语料库上预训练的相应 LLM 相当的性能。直接 LLM 推理和 FBPR 在很大程度上正确回答了不同的问题，重叠仅略高于随机机会，表明每种方法的优势互补。这些发现强调了明确概率基线的持续价值：它们提供了有意义的性能参考点和潜在杂交的补充信号。虽然法学硕士的性能似乎是由简单频率聚合以外的机制驱动的，但我们表明，类似于历史基础的低复杂性专家系统的方法仍然占基准性能的很大一部分。

Title: Building from Scratch: A Multi-Agent Framework with Human-in-the-Loop for Multilingual Legal Terminology Mapping

Authors: Lingyi Meng, Maolin Liu, Hao Wang, Yilan Cheng, Qi Yang, Idlkaid Mohanmmed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12950
Pdf URL: https://arxiv.org/pdf/2512.12950
Copy Paste: [[2512.12950]] Building from Scratch: A Multi-Agent Framework with Human-in-the-Loop for Multilingual Legal Terminology Mapping(https://arxiv.org/abs/2512.12950)
Keywords: language model, agent
Abstract: Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human-AI collaborative approach for building a multilingual legal terminology database, based on a multi-agent framework. This approach integrates advanced large language models and legal domain experts throughout the entire process-from raw document preprocessing, article-level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi-agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human-in-the-loop, multi-agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods.
摘要：跨语言准确映射法律术语仍然是一项重大挑战，特别是对于像中文和日语这样的语言对，它们共享大量具有不同含义的同形异义词。这些语言的现有资源和标准化工具是有限的。为了解决这个问题，我们提出了一种基于多代理框架的人机协作方法，用于构建多语言法律术语数据库。这种方法在整个过程中集成了先进的大语言模型和法律领域专家——从原始文档预处理、文章级对齐，到术语提取、映射和质量保证。与单个自动化管道不同，我们的方法更加强调人类专家如何参与这个多代理系统。人类和人工智能代理扮演不同的角色：人工智能代理处理特定的重复性任务，例如 OCR、文本分割、语义对齐和初始术语提取，而人类专家则利用上下文知识和法律判断提供关键的监督、审查和监督输出。我们使用由 35 条主要中文法规及其英文和日文翻译组成的三语平行语料库测试了该框架的有效性。实验结果表明，这种人机交互的多智能体工作流程不仅提高了多语言法律术语映射的精度和一致性，而且与传统的手动方法相比，还提供了更大的可扩展性。

Title: QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Authors: Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12967
Pdf URL: https://arxiv.org/pdf/2512.12967
Copy Paste: [[2512.12967]] QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management(https://arxiv.org/abs/2512.12967)
Keywords: gpt, long context, agent
Abstract: We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
摘要：我们推出了QwenLong-L1.5，该模型通过系统的训练后创新实现了卓越的长上下文推理能力。 QwenLong-L1.5的关键技术突破如下：（1）长上下文数据合成管道：我们开发了一个系统的合成框架，该框架可以生成需要在全球分布的证据上进行多跳接地的具有挑战性的推理任务。通过将文档解构为原子事实及其潜在关系，然后以编程方式编写可验证的推理问题，我们的方法大规模创建高质量的训练数据，远远超出简单的检索任务，以实现真正的远程推理能力。 (2) 用于长上下文训练的稳定强化学习：为了克服长上下文强化学习中的严重不稳定性，我们引入了任务平衡采样和特定于任务的优势估计，以减轻奖励偏差，并提出了自适应熵控制策略优化（AEPO）来动态调节探索-利用权衡。 (3) 超长上下文的内存增强架构：认识到即使扩展的上下文窗口也无法容纳任意长的序列，我们开发了一个具有多阶段融合 RL 训练的内存管理框架，该框架将单遍推理与基于迭代内存的处理无缝集成，用于超过 4M 令牌的任务。基于Qwen3-30B-A3B-Thinking，QwenLong-L1.5在长上下文推理基准上实现了与GPT-5和Gemini-2.5-Pro相当的性能，平均超出其基线9.90分。在超长任务（1M~4M 令牌）上，QwenLong-L1.5 的内存代理框架比代理基线提高了 9.48 点。此外，获得的长上下文推理能力可以转化为科学推理、记忆工具使用和扩展对话等一般领域的表现增强。

Title: Authors Should Annotate

Authors: Marcus Ma, Cole Johnson, Nolan Bridges, Jackson Trager, Georgios Chochlakis, Shrikanth Narayanan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12976
Pdf URL: https://arxiv.org/pdf/2512.12976
Copy Paste: [[2512.12976]] Authors Should Annotate(https://arxiv.org/abs/2512.12976)
Keywords: chat
Abstract: The status quo for labeling text is third-party annotation, but there are many cases where information directly from the document's source would be preferable over a third-person proxy, especially for egocentric features like sentiment and belief. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 10,000 users to deploy an author labeling annotation system for subjective features related to product recommendation. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors' answers in real time. We train and deploy an online-learning model architecture for product recommendation that continuously improves from author labeling and find it achieved a 534% increase in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at this http URL.
摘要：标签文本的现状是第三方注释，但在很多情况下，直接来自文档来源的信息比第三方代理更可取，特别是对于情感和信仰等以自我为中心的特征。我们引入作者标签，这是一种注释技术，文档的作者本身在创建时对数据进行注释。我们与一个拥有超过 10,000 个用户的商业聊天机器人合作，部署一个作者标签注释系统，用于与产品推荐相关的主观特征。该系统识别与任务相关的查询，生成即时标签问题，并实时记录作者的答案。我们训练和部署了一个用于产品推荐的在线学习模型架构，该架构不断改进作者标签，发现与同时运行的行业广告基准相比，它的点击率提高了 534%。然后，我们将作者标签的质量和实用性与三种传统的情感分析注释方法进行比较，发现作者标签质量更高、获取更快、成本更低。这些发现强化了现有的文献，即由作者而不是第三方标记的注释，特别是对于自我中心和主观信念的注释，质量显着更高。为了促进更广泛的科学采用，我们在此 http URL 上为研究界发布了作者标签服务。

Title: An Open and Reproducible Deep Research Agent for Long-Form Question Answering

Authors: Ikuya Yamada, Wataru Ikeda, Ko Yoshida, Mengyu Ye, Hinata Sugimoto, Masatoshi Suzuki, Hisanori Ozaki, Jun Suzuki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13059
Pdf URL: https://arxiv.org/pdf/2512.13059
Copy Paste: [[2512.13059]] An Open and Reproducible Deep Research Agent for Long-Form Question Answering(https://arxiv.org/abs/2512.13059)
Keywords: language model, llm, agent
Abstract: We present an open deep research system for long-form question answering, selected as a winning system in the text-to-text track of the MMU-RAG competition at NeurIPS 2025. The system combines an open-source large language model (LLM) with an open web search API to perform iterative retrieval, reasoning, and synthesis in real-world open-domain settings. To enhance reasoning quality, we apply preference tuning based on LLM-as-a-judge feedback that evaluates multiple aspects, including clarity, insightfulness, and factuality. Our experimental results show that the proposed method consistently improves answer quality across all three aspects. Our source code is publicly available at this https URL.
摘要：我们提出了一个用于长格式问答的开放式深度研究系统，被选为 NeurIPS 2025 的 MMU-RAG 竞赛文本到文本赛道的获胜系统。该系统将开源大语言模型 (LLM) 与开放网络搜索 API 结合起来，在现实世界的开放域设置中执行迭代检索、推理和合成。为了提高推理质量，我们根据法学硕士法官的反馈进行偏好调整，评估多个方面，包括清晰度、洞察力和事实性。我们的实验结果表明，所提出的方法在所有三个方面持续提高了答案质量。我们的源代码可通过此 https URL 公开获取。

Title: LLM Rationalis? Measuring Bargaining Capabilities of AI Negotiators

Authors: Cheril Shah, Akshit Agarwal, Kanak Garg, Mourad Heddaya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13063
Pdf URL: https://arxiv.org/pdf/2512.13063
Copy Paste: [[2512.13063]] LLM Rationalis? Measuring Bargaining Capabilities of AI Negotiators(https://arxiv.org/abs/2512.13063)
Keywords: language model, llm
Abstract: Bilateral negotiation is a complex, context-sensitive task in which human negotiators dynamically adjust anchors, pacing, and flexibility to exploit power asymmetries and informal cues. We introduce a unified mathematical framework for modeling concession dynamics based on a hyperbolic tangent curve, and propose two metrics burstiness tau and the Concession-Rigidity Index (CRI) to quantify the timing and rigidity of offer trajectories. We conduct a large-scale empirical comparison between human negotiators and four state-of-the-art large language models (LLMs) across natural-language and numeric-offers settings, with and without rich market context, as well as six controlled power-asymmetry scenarios. Our results reveal that, unlike humans who smoothly adapt to situations and infer the opponents position and strategies, LLMs systematically anchor at extremes of the possible agreement zone for negotiations and optimize for fixed points irrespective of leverage or context. Qualitative analysis further shows limited strategy diversity and occasional deceptive tactics used by LLMs. Moreover the ability of LLMs to negotiate does not improve with better models. These findings highlight fundamental limitations in current LLM negotiation capabilities and point to the need for models that better internalize opponent reasoning and context-dependent strategy.
摘要：双边谈判是一项复杂的、环境敏感的任务，其中人类谈判者动态调整锚点、节奏和灵活性，以利用权力不对称和非正式线索。我们引入了一个统一的数学框架，用于基于双曲正切曲线对特许权动态进行建模，并提出了两个指标突发性 tau 和特许权刚性指数（CRI）来量化要约轨迹的时间安排和刚性。我们在人类谈判者和四种最先进的大型语言模型（LLM）之间进行了大规模的实证比较，这些模型跨越自然语言和数字报价设置，有或没有丰富的市场背景，以及六种受控的权力不对称场景。我们的结果表明，与能够顺利适应情况并推断对手立场和策略的人类不同，法学硕士系统地锚定在谈判可能达成协议区域的极端，并针对固定点进行优化，无论杠杆或背景如何。定性分析进一步表明，法学硕士的策略多样性有限，并且偶尔会使用欺骗性策略。此外，法学硕士的谈判能力并不会随着更好的模型而提高。这些发现凸显了当前法学硕士谈判能力的根本局限性，并指出需要更好地内化对手推理和上下文相关策略的模型。

Title: Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing

Authors: Zewen Qiang, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13109
Pdf URL: https://arxiv.org/pdf/2512.13109
Copy Paste: [[2512.13109]] Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing(https://arxiv.org/abs/2512.13109)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.
摘要：大型语言模型 (LLM) 在各种自然语言处理 (NLP) 任务上表现出了强大的性能。然而，由于“迷失在中间”现象，他们经常在处理长文本序列时遇到困难。该问题已被证明是由 U 形注意力偏差引起的，即注意力不成比例地集中在文本的开头和结尾，而中间部分的代表性不足。虽然之前的研究将这种偏差归因于位置编码，但我们的研究首先确定了另一个因素：初始显着性。这意味着在每个 token 的注意力计算中，相对于初始 token 具有更高注意力权重的 token 在下一个 token 的预测中往往会受到更多的关注。我们进一步发现，通过缩放初始标记和其他标记之间的注意力权重来利用此属性可以提高模型处理长上下文的能力，在 MDQA 数据集中实现 3.6% 的最大改进。此外，将该方法与现有方法相结合以减少位置编码偏差，进一步提高性能，在 KV 检索任务中实现最大 3.4% 的提升。

Title: Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Authors: Chendong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13194
Pdf URL: https://arxiv.org/pdf/2512.13194
Copy Paste: [[2512.13194]] Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models(https://arxiv.org/abs/2512.13194)
Keywords: language model, llm
Abstract: Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as $1 - \max(P_{\mathrm{target}})$. By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.
摘要：推测解码是一种重要的技术，通过采用快速草稿模型提出候选标记序列和大型目标模型来并行验证它们，从而加速大型语言模型 (LLM) 的自回归推理。然而，其核心组件——拒绝采样机制——依赖于固定的、与上下文无关的随机阈值。这导致在高不确定性生成场景中出现严重的“随机拒绝”问题，其中合理的候选令牌经常由于随机机会而被拒绝，从而损害了推理效率。本文介绍了高效自适应拒绝采样（EARS），这是一种通过结合目标模型自身的预测不确定性（测量为 $1 - \max(P_{\mathrm{target}})$ 来动态调整接受阈值的新方法）。通过引入与这种不确定性成比例的容差项，EARS 在模型不确定时智能地放宽接受标准，有效减少随机拒绝，同时在模型有信心时保持严格的标准。创意写作和开放域 QA 任务的实验表明，EARS 显着提高了推测解码的效率，在 GSM8K 基准上实现了高达 18.12% 的吞吐量提升，而准确率下降了可忽略不计的 0.84%。该方法不需要对模型架构进行修改，并且可以无缝集成到现有的推测解码框架中。

Title: AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

Authors: Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13278
Pdf URL: https://arxiv.org/pdf/2512.13278
Copy Paste: [[2512.13278]] AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning(https://arxiv.org/abs/2512.13278)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
摘要：代理强化学习拥有先进的大型语言模型（LLM），可以通过长链思维轨迹进行推理，同时交错使用外部工具。现有方法假设工具库存固定，限制了 LLM 代理对新的或不断发展的工具集的适应性。我们提出了 AutoTool，一个框架，为 LLM 代理在整个推理轨迹中提供动态工具选择功能。我们首先构建了一个 20 万个数据集，其中包含 1,000 多个工具和 100 多个任务（涵盖数学、科学、代码生成和多模态推理）的明确工具选择原理。在此数据基础上，AutoTool 采用双阶段优化流程：(i) 用于连贯推理的监督和基于 RL 的轨迹稳定，以及 (ii) KL 正则化 Plackett-Luce 排名，以细化一致的多步骤工具选择。在十个不同的基准测试中，我们使用 AutoTool 训练了两个基本模型：Qwen3-8B 和 Qwen2.5-VL-7B。由于参数较少，AutoTool 始终优于先进的 LLM 代理和工具集成方法，在数学和科学推理方面平均提高 6.4%，在基于搜索的 QA 方面平均提高 4.5%，在代码生成方面平均提高 7.7%，在多模式理解方面平均提高 6.9%。此外，AutoTool 通过在推理过程中动态利用不断发展的工具集中未见过的工具，表现出更强的泛化能力。

Title: AIR: Post-training Data Selection for Reasoning via Attention Head Influence

Authors: Jinrui Liu, Jeff Wu, Xuanguang Pan, Gavin Cheung, Shuai Ma, Chongyang Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13279
Pdf URL: https://arxiv.org/pdf/2512.13279
Copy Paste: [[2512.13279]] AIR: Post-training Data Selection for Reasoning via Attention Head Influence(https://arxiv.org/abs/2512.13279)
Keywords: llm
Abstract: LLMs achieve remarkable multi-step reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.
摘要：法学硕士实现了卓越的多步推理能力，但通过训练后蒸馏有效地转移这些技能仍然具有挑战性。现有的数据选择方法，从手动管理到基于长度、熵或总体损失的启发式方法，都无法捕捉单个推理步骤的因果重要性，从而限制了蒸馏效率。为了解决这个问题，我们提出了注意力影响推理（AIR），这是一个有原则的、无监督的、免训练的框架，它利用检索头的机械洞察力来选择高价值的训练后数据。 AIR 首先识别现成模型的推理关键注意力头，然后构建具有禁用头影响的弱化参考模型，最后将由此产生的损失散度量化为注意力影响分数。该分数可以在步骤和样本级别进行细粒度评估，支持步骤级别加权微调和全局样本选择。跨多个推理基准的实验表明，AIR 持续提高推理准确性，超越启发式基线，并有效隔离最关键的步骤和样本。我们的工作为法学硕士的推理蒸馏建立了一种机制驱动、数据高效的方法。

Title: MiniLingua: A Small Open-Source LLM for European Languages

Authors: Anna Aksenova, Boris Zverkov, Nicola Dainese, Alexander Nikitin, Pekka Marttinen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13298
Pdf URL: https://arxiv.org/pdf/2512.13298
Copy Paste: [[2512.13298]] MiniLingua: A Small Open-Source LLM for European Languages(https://arxiv.org/abs/2512.13298)
Keywords: language model, llm
Abstract: Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.
摘要：大型语言模型功能强大，但通常受到高计算成本、隐私问题和以英语为中心的训练的限制。最近的进展表明，具有约 10 亿个参数的小型高效模型可以提供强大的结果并支持设备上的使用。本文介绍了 MiniLingua，这是一种多语言开源法学硕士，拥有 10 亿个参数，针对 13 种欧洲语言从头开始训练，旨在平衡覆盖范围和指令跟踪能力。根据评估结果，MiniLingua 的指令调整版本在摘要、分类以及开卷和闭卷问答方面优于 EuroLLM（一种具有类似训练方法但训练预算更大的模型）。此外，它在开放式生成任务上与更先进的最先进模型仍然具有竞争力。我们发布了用于数据处理和模型训练的模型权重、分词器和源代码。

Title: FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Authors: Joona Kytöniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary, Sampo Pyysalo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13330
Pdf URL: https://arxiv.org/pdf/2512.13330
Copy Paste: [[2512.13330]] FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models(https://arxiv.org/abs/2512.13330)
Keywords: language model, prompt
Abstract: We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at this https URL. Supplementary resources are released in a separate repository at this https URL.
摘要：我们推出了 FIN-bench-v2，这是一个用于评估芬兰语大型语言模型的统一基准套件。 FIN-bench-v2 将广泛使用的基准的芬兰版本与原始 FIN-bench 的更新和扩展版本整合为一个统一格式的集合，涵盖阅读理解、常识推理、情感分析、世界知识和对齐等多项选择和生成任务。所有数据集都转换为 HuggingFace 数据集，其中包括完形填空和多项选择提示公式，每个任务有五个变体，并且我们纳入了机器翻译资源（例如 GoldenSwag 和 XED）的人工注释或审阅。为了选择稳健的任务，我们预训练了一组仅包含 2.15B 参数的解码器模型，并使用它们的学习曲线来计算单调性、信噪比、非随机性能和模型排序一致性，仅保留满足所有标准的任务。我们进一步评估一组更大的指令调整模型，以表征跨任务和提示公式的性能。所有数据集、提示和评估配置均可通过我们的语言模型评估工具分支（此 https URL）公开获得。补充资源在此 https URL 的单独存储库中发布。

Title: Large language models are not about language

Authors: Johan J. Bolhuis, Andrea Moro, Stephen Crain, Sandiway Fong
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.13441
Pdf URL: https://arxiv.org/pdf/2512.13441
Copy Paste: [[2512.13441]] Large language models are not about language(https://arxiv.org/abs/2512.13441)
Keywords: language model
Abstract: Large Language Models are useless for linguistics, as they are probabilistic models that require a vast amount of data to analyse externalized strings of words. In contrast, human language is underpinned by a mind-internal computational system that recursively generates hierarchical thought structures. The language system grows with minimal external input and can readily distinguish between real language and impossible languages.
摘要：大型语言模型对于语言学来说毫无用处，因为它们是概率模型，需要大量数据来分析外化的字符串。相比之下，人类语言以思维内部计算系统为基础，该系统递归地生成分层思维结构。语言系统以最少的外部输入成长，并且可以轻松地区分真实语言和不可能的语言。

Title: Scaling Laws for Code: Every Programming Language Matters

Authors: Jian Yang, Shawn Guo, Lin Jing, Wei Zhang, Aishan Liu, Chuan Hao, Zhoujun Li, Wayne Xin Zhao, Xianglong Liu, Weifeng Lv, Bryan Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13472
Pdf URL: https://arxiv.org/pdf/2512.13472
Copy Paste: [[2512.13472]] Scaling Laws for Code: Every Programming Language Matters(https://arxiv.org/abs/2512.13472)
Keywords: language model, llm
Abstract: Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.
摘要：代码大型语言模型（Code LLM）功能强大，但训练成本高昂，其缩放法则根据模型大小、数据和计算来预测性能。然而，不同的编程语言（PL）在预训练过程中产生不同的影响，显着影响基础模型的性能，导致性能预测不准确。此外，现有的工作侧重于与语言无关的设置，忽略了现代软件开发固有的多语言性质。因此，首先需要研究不同PL的标度律，然后考虑它们之间的相互影响，以得出最终的多语言标度律。在本文中，我们首次系统地探索了多语言代码预训练的缩放法则，在多个 PL、模型大小（0.2B 到 14B 参数）和数据集大小（1T 令牌）上进行了 1000 多个实验（相当于 336,000+ H800 小时）。我们为跨多个 PL 的代码 LLM 建立了全面的扩展法则，表明解释语言（例如 Python）比编译语言（例如 Rust）从增加的模型大小和数据中受益更多。研究表明，多语言预训练可以提供协同效益，特别是在语法相似的 PL 之间。此外，并行配对的预训练策略（将代码片段与其翻译连接起来）显着增强了跨语言能力，具有良好的扩展特性。最后，提出了一种依赖于比例的多语言缩放法则，通过优先考虑高实用性 PL（例如 Python）、平衡高协同对（例如 JavaScript-TypeScript）以及减少对快速饱和语言（Rust）的分配来优化分配训练令牌，与相同计算预算下的均匀分布相比，在所有 PL 上实现优异的平均性能。

Title: Non-Resolution Reasoning: A Framework for Preserving Semantic Ambiguity in Language Models

Authors: Kei Saito
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13478
Pdf URL: https://arxiv.org/pdf/2512.13478
Copy Paste: [[2512.13478]] Non-Resolution Reasoning: A Framework for Preserving Semantic Ambiguity in Language Models(https://arxiv.org/abs/2512.13478)
Keywords: language model
Abstract: Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $\rho$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
摘要：过早的语义崩溃——被迫早期承诺单一含义——仍然是当前语言模型的核心架构限制。 Softmax 驱动的竞争和贪婪解码会导致模型在足够的上下文可用之前丢弃有效的解释，从而导致脆弱的推理和上下文失败。我们引入非解析推理（NRR），这是一种通用计算框架，它在推理过程中保留语义歧义，并仅在明确需要时才执行解析。 NRR 集成了三个组件：(1) 多向量嵌入，为每个令牌维护多种可行的解释；(2) 非折叠注意力，防止跨层赢家通吃的动态；(3) 上下文身份跟踪 (CIT)，它将上下文特定的身份分配给重复出现的实体（例如，区分“心脏病专家史密斯博士”和“研究员史密斯博士”）。这些机制由外部解析运算符 $\rho$ 统一，使语义承诺明确、可控且依赖于任务。与标准架构不同，NRR 将表示与解析分开，允许单个模型在创造性、事实性和模糊性保留推理之间转换，而无需重新训练。综合评估证明了 NRR 保持歧义和跟踪上下文的能力：CIT 增强模型在分布外身份转换任务上实现了 90.9% 的准确率，而 Transformer 基线的准确率仅为 9.1%。 NRR 为过早崩溃提供了一种有原则的替代方案，将模糊性重新定义为一种明确的表征状态，而不是一种故障模式。问题不在于人工智能是否应该解决歧义，而在于何时、如何以及在谁的控制下。

Title: SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping

Authors: Yu-Chen Lu, Sheng-Feng Yu, Hui-Hsien Weng, Pei-Shuo Wang, Yu-Fang Hu, Liang Hung-Chun, Hung-Yueh Chiang, Kai-Chiang Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13494
Pdf URL: https://arxiv.org/pdf/2512.13494
Copy Paste: [[2512.13494]] SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping(https://arxiv.org/abs/2512.13494)
Keywords: language model, llm
Abstract: Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naïve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.
摘要：大型语言模型（LLM）在广泛的任务中取得了卓越的性能。然而，它们巨大的参数大小对在计算和内存资源有限的边缘设备上的部署提出了重大挑战。低秩压缩是解决这个问题的一种有前途的方法，因为它降低了计算和内存成本，使 LLM 更适合资源受限的环境。尽管如此，简单的低秩压缩方法需要显着减少保留的秩，以实现有意义的内存和计算节省。对于低秩模型，需要将秩减少一半以上才能提高效率。然而，这种激进的截断通常会导致性能大幅下降。为了解决这种权衡问题，我们提出了 SkipCat，这是一种新颖的低秩压缩框架，可以在实现相同压缩率的同时使用更高的秩。首先，我们引入一种层内共享低秩投影方法，其中共享相同输入的多个矩阵使用公共投影。这减少了冗余并提高了压缩效率。其次，我们提出了一种块跳跃技术，该技术省略了低秩分解中选定子块的计算和内存传输。这两种技术共同使我们的压缩模型能够在相同的压缩预算下保留更有效的排名。实验结果表明，在没有任何额外微调的情况下，我们的方法在相同压缩率下的零样本任务上比以前的低秩压缩方法提高了 7% 的精度。这些结果凸显了我们的排名最大化压缩策略在严格的资源限制下保持模型性能的有效性。

Title: Memory in the Age of AI Agents

Authors: Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13564
Pdf URL: https://arxiv.org/pdf/2512.13564
Copy Paste: [[2512.13564]] Memory in the Age of AI Agents(https://arxiv.org/abs/2512.13564)
Keywords: llm, retrieval augmented generation, agent
Abstract: Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
摘要：内存已经成为并将继续成为基于模型的代理的核心功能。随着智能体记忆的研究迅速扩展并引起前所未有的关注，该领域也变得越来越分散。属于智能体记忆范畴的现有作品通常在动机、实现和评估协议方面存在很大差异，而松散定义的记忆术语的激增进一步模糊了概念的清晰度。事实证明，长/短期记忆等传统分类法不足以捕捉当代主体记忆系统的多样性。这项工作旨在提供当前代理记忆研究的最新情况。我们首先明确界定代理记忆的范围，并将其与 LLM 记忆、检索增强生成（RAG）和上下文工程等相关概念区分开来。然后，我们通过形式、功能和动力学的统一视角来检查主体记忆。从形式的角度来看，我们确定了代理记忆的三种主要实现方式，即令牌级记忆、参数记忆和潜在记忆。从功能的角度来看，我们提出了一种更细粒度的分类法来区分事实记忆、经验记忆和工作记忆。我们从动力学的角度分析记忆是如何随着时间的推移而形成、进化和检索的。为了支持实际开发，我们编制了内存基准测试和开源框架的全面摘要。除了整合之外，我们还阐述了对新兴研究前沿的前瞻性观点，包括记忆自动化、强化学习集成、多模态记忆、多智能体记忆和可信度问题。我们希望这项调查不仅可以作为现有工作的参考，而且可以作为重新思考记忆作为未来代理智能设计中的一流原语的概念基础。

Title: ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Authors: Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13586
Pdf URL: https://arxiv.org/pdf/2512.13586
Copy Paste: [[2512.13586]] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding(https://arxiv.org/abs/2512.13586)
Keywords: language model
Abstract: Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.
摘要：自回归模型 (ARM) 受到缓慢的顺序推理的阻碍。虽然掩码扩散模型 (MDM) 提供了一种并行替代方案，但它们存在严重缺点：排除键值 (KV) 缓存导致计算开销较高，以及学习难以处理的令牌组合空间上的依赖关系而导致生成不连贯。为了解决这些限制，我们引入了 ReFusion，这是一种新颖的掩码扩散模型，通过将并行解码从令牌级别提升到更高的时隙级别（其中每个时隙都是固定长度的连续子序列）来实现卓越的性能和效率。这是通过迭代“规划和填充”解码过程实现的：基于扩散的规划步骤首先识别一组弱相关的槽，然后自回归填充步骤并行解码这些选定的槽。基于槽的设计同时通过统一的因果框架解锁了完整的KV缓存重用，并将学习复杂度从令牌组合空间降低到可管理的槽级排列空间。对七个不同基准的大量实验表明，ReFusion 不仅以 34% 的性能提升和超过 18$\times$ 的平均加速压倒性地超越了之前的 MDM，而且缩小了与强大 ARM 的性能差距，同时保持了 2.33$\times$ 的平均加速。

Title: Textual Gradients are a Flawed Metaphor for Automatic Prompt Optimization

Authors: Daniel Melcer, Qi Chen, Wen-Hao Chiang, Shweta Garg, Pranav Garg, Christian Bock
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13598
Pdf URL: https://arxiv.org/pdf/2512.13598
Copy Paste: [[2512.13598]] Textual Gradients are a Flawed Metaphor for Automatic Prompt Optimization(https://arxiv.org/abs/2512.13598)
Keywords: language model, prompt
Abstract: A well-engineered prompt can increase the performance of large language models; automatic prompt optimization techniques aim to increase performance without requiring human effort to tune the prompts. One leading class of prompt optimization techniques introduces the analogy of textual gradients. We investigate the behavior of these textual gradient methods through a series of experiments and case studies. While such methods often result in a performance improvement, our experiments suggest that the gradient analogy does not accurately explain their behavior. Our insights may inform the selection of prompt optimization strategies, and development of new approaches.
摘要：精心设计的提示可以提高大型语言模型的性能；自动提示优化技术旨在提高性能，而不需要人工调整提示。一类领先的提示优化技术引入了文本渐变的类比。我们通过一系列实验和案例研究研究这些文本梯度方法的行为。虽然此类方法通常会带来性能改进，但我们的实验表明梯度类比并不能准确解释它们的行为。我们的见解可以为及时优化策略的选择和新方法的开发提供信息。

Title: Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Authors: Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13607
Pdf URL: https://arxiv.org/pdf/2512.13607
Copy Paste: [[2512.13607]] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models(https://arxiv.org/abs/2512.13607)
Keywords: prompt
Abstract: Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
摘要：使用强化学习 (RL) 构建通用推理模型需要大量的跨域异质性，包括推理时间响应长度和验证延迟的巨大变化。这种可变性使强化学习基础设施变得复杂，减慢了训练速度，并使训练课程（例如响应长度扩展）和超参数选择变得具有挑战性。在这项工作中，我们提出级联领域强化学习（Cascade RL）来开发通用推理模型 Nemotron-Cascade，它能够在指导和深度思维模式下运行。与混合来自不同领域的异构提示的传统方法不同，Cascade RL 可以协调顺序的、按领域的 RL，降低工程复杂性并在各种基准测试中提供最先进的性能。值得注意的是，用于对齐的 RLHF 在用作预处理时，可以提高模型的推理能力，远远超出单纯的偏好优化，并且后续的域级 RLVR 阶段很少会降低早期域中获得的基准性能，甚至可能会提高它（参见图 1 中的图示）。我们的 14B 模型经过 RL 处理后，在 LiveCodeBench v5/v6/Pro 上的表现优于其 SFT 老师 DeepSeek-R1-0528，并在 2025 年国际信息学奥林匹克竞赛 (IOI) 中获得银牌。我们透明地分享我们的培训和数据配方。

Title: Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Authors: Zefang Liu, Nam Nguyen, Yinzhu Quan, Austin Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13618
Pdf URL: https://arxiv.org/pdf/2512.13618
Copy Paste: [[2512.13618]] Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models(https://arxiv.org/abs/2512.13618)
Keywords: language model, llm
Abstract: Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
摘要：在使用大型语言模型 (LLM) 建模时间事件序列时，表示连续时间是一个关键且尚未探索的挑战。已经提出了各种策略，例如字节级表示或日历标记。然而，最佳方法仍不清楚，特别是考虑到现实世界事件数据的统计分布多种多样，从平滑的对数正态到离散的尖峰模式。本文提出了事件序列时间标记化的第一个实证研究，比较了不同的编码策略：朴素数字字符串、高精度字节级表示、人类语义日历标记、经典统一分箱和自适应残差标量量化。我们通过在体现这些不同分布的现实数据集上微调法学硕士来评估这些策略。我们的分析表明，没有哪一种策略是放之四海而皆准的。相反，预测性能在很大程度上取决于将分词器与数据的统计属性保持一致，基于日志的策略在偏斜分布上表现出色，而以人为中心的格式对于混合模态证明是稳健的。

Title: Large-Language Memorization During the Classification of United States Supreme Court Cases

Authors: John E. Ortega, Dhruv D. Joshi, Matt P. Borkowski
Subjects: cs.CL, cs.AI, cs.ET, cs.IR
Abstract URL: https://arxiv.org/abs/2512.13654
Pdf URL: https://arxiv.org/pdf/2512.13654
Copy Paste: [[2512.13654]] Large-Language Memorization During the Classification of United States Supreme Court Cases(https://arxiv.org/abs/2512.13654)
Keywords: language model, llm, hallucination, prompt
Abstract: Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
摘要：大语言模型 (LLM) 已被证明可以以多种方式响应问答之外的分类任务。 LLM 的回答有时被称为“幻觉”，因为输出不是预期的。人们正在详细研究法学硕士的记忆策略，目的是了解法学硕士的反应。我们根据美国最高法院 (SCOTUS) 的判决深入研究分类任务。 SCOTUS 语料库是研究 LLM 记忆准确性的理想分类任务，因为由于句子长度过长、法律术语复杂、结构不标准和特定领域词汇，它带来了巨大的挑战。使用最新的 LLM 微调和基于检索的方法（例如参数高效微调、自动建模等）在两个传统的基于类别的 SCOTUS 分类任务上进行实验：一个有 15 个标记主题，另一个有 279 个标记主题。我们表明，带有记忆的基于提示的模型（例如 DeepSeek）在这两项任务上比以前基于 BERT 的模型更加稳健，得分比以前不基于提示的模型高出约 2 分。

Title: Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Authors: Richard J. Young
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2512.13655
Pdf URL: https://arxiv.org/pdf/2512.13655
Copy Paste: [[2512.13655]] Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation(https://arxiv.org/abs/2512.13655)
Keywords: language model, llm
Abstract: Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
摘要：大语言模型中的安全对齐机制可以防止通过习得的拒绝行为来响应有害查询，但这些机制也会阻碍合法的研究应用，包括认知建模、对抗性测试和安全分析。虽然消除技术能够通过定向正交化手术去除拒绝表示，但可用实现的相对有效性仍然未知。本研究评估了 16 个指令调整模型（7B-14B 参数）中的四种消除工具（Heretic、DECCP、ErisForge、FailSpy），报告了所有 16 个模型的工具兼容性以及工具支持规定的子集的定量指标。单通道方法在基准子集上表现出卓越的能力保留（三个模型的平均 GSM8K 变化：ErisForge -0.28 pp；DECCP -0.13 pp），而贝叶斯优化消除产生了可变分布偏移（KL 散度：0.043-1.646），并具有与模型相关的能力影响。这些发现为研究人员提供了基于证据的选择标准，用于跨不同模型架构部署消除工具。主要发现表明，数学推理能力对消除干预表现出最高的敏感性，GSM8K 变化范围从 +1.51 pp 到 -18.81 pp（相对 -26.5%），具体取决于工具选择和模型架构。

Title: Towards Effective Model Editing for LLM Personalization

Authors: Baixiang Huang, Limeng Cui, Jiapeng Liu, Haoran Wang, Jiawei Xu, Zhuiyue Tan, Yutong Chen, Chen Luo, Yi Liu, Kai Shu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13676
Pdf URL: https://arxiv.org/pdf/2512.13676
Copy Paste: [[2512.13676]] Towards Effective Model Editing for LLM Personalization(https://arxiv.org/abs/2512.13676)
Keywords: llm, prompt
Abstract: Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.
摘要：个性化对于法学硕士来说变得不可或缺，以符合个人用户的偏好和需求。然而，当前的方法通常计算成本昂贵、数据密集、容易发生灾难性遗忘，并且在多轮交互或处理隐式查询时容易出现性能下降。为了应对这些挑战，我们将个性化概念化为模型编辑任务，并引入个性化编辑，这是一个应用由集群偏好表示引导的本地化编辑的框架。这种设计可以实现精确的偏好一致更新，同时保留整体模型功能。此外，现有的个性化基准经常依赖于法学硕士之间基于角色的对话，而不是用户与法学硕士的交互，或者主要关注风格模仿，而忽略需要准确回忆用户特定偏好的信息查找任务。我们引入了用户偏好问答（UPQA），这是一个由不同难度的现场用户查询构建的简答问答数据集。与之前的基准测试不同，UPQA 直接评估模型回忆和应用特定用户偏好的能力。在实验设置中，个性化编辑比微调实现了更高的编辑准确性和更高的计算效率，同时在多轮对话和隐式偏好问题设置中优于基于提示的基线。

Title: Beyond surface form: A pipeline for semantic analysis in Alzheimer's Disease detection from spontaneous speech

Authors: Dylan Phelps, Rodrigo Wilkens, Edward Gow-Smith, Lilian Hubner, Bárbara Malcorra, César Rennó-Costa, Marco Idiart, Maria-Cruz Villa-Uriol, Aline Villavicencio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13685
Pdf URL: https://arxiv.org/pdf/2512.13685
Copy Paste: [[2512.13685]] Beyond surface form: A pipeline for semantic analysis in Alzheimer's Disease detection from spontaneous speech(https://arxiv.org/abs/2512.13685)
Keywords: language model
Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.
摘要：阿尔茨海默病 (AD) 是一种进行性神经退行性疾病，会对认知能力产生不利影响。通过分析语言评估任务（例如图片描述）的输出，可以自动识别与语言相关的变化。语言模型有望成为 AD 筛选工具的基础，但其有限的可解释性在区分认知衰退的真正语言标记与表面文本模式方面提出了挑战。为了解决这个问题，我们研究了表面形式变化如何影响分类性能，目的是评估语言模型表示底层语义指标的能力。我们引入了一种新颖的方法，通过改变语法和词汇来转换文本表面形式，同时保留语义内容。这些转换显着修改了结构和词汇内容（如低 BLEU 和 chrF 分数所示），但保留了底层语义（如高语义相似性分数所示），隔离了语义信息的影响，并且发现模型的性能与使用原始文本时类似，仅在宏 F1 中存在很小的偏差。我们还研究了图片描述中的语言是否保留了足够的细节来使用生成模型重建原始图像。我们发现基于图像的转换显着提高了降噪分类精度。我们的方法提供了一种新颖的方法来研究哪些特征影响模型预测，并允许消除可能的虚假相关性。我们发现仅使用语义信息，基于语言模型的分类器仍然可以检测 AD。这项工作表明，可以识别难以检测的语义障碍，解决语言恶化的被忽视的特征，并为早期检测系统开辟新的途径。