2026-02-12

Title: Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback

Authors: Sukannya Purkayastha, Qile Wan, Anne Lauscher, Lizhen Qu, Iryna Gurevych
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2602.10118
Pdf URL: https://arxiv.org/pdf/2602.10118
Copy Paste: [[2602.10118]] Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback(https://arxiv.org/abs/2602.10118)
Keywords: llm
Abstract: Peer review is central to scientific quality, yet reliance on simple heuristics -- lazy thinking -- has lowered standards. Prior work treats lazy thinking detection as a single-label task, but review segments may exhibit multiple issues, including broader clarity problems, or specificity issues. Turning detection into actionable improvements requires guideline-aware feedback, which is currently missing. We introduce an LLM-driven framework that decomposes reviews into argumentative segments, identifies issues via a neurosymbolic module combining LLM features with traditional classifiers, and generates targeted feedback using issue-specific templates refined by a genetic algorithm. Experiments show our method outperforms zero-shot LLM baselines and improves review quality by up to 92.4\%. We also release LazyReviewPlus, a dataset of 1,309 sentences labeled for lazy thinking and specificity.
摘要：同行评审对于科学质量至关重要，但对简单启发法（懒惰思维）的依赖降低了标准。先前的工作将惰性思维检测视为单标签任务，但评论片段可能会表现出多个问题，包括更广泛的清晰度问题或特异性问题。将检测转化为可操作的改进需要具有指南意识的反馈，而目前尚缺乏这种反馈。我们引入了一个 LLM 驱动的框架，该框架将评论分解为论证片段，通过将 LLM 特征与传统分类器相结合的神经符号模块识别问题，并使用遗传算法改进的特定问题模板生成有针对性的反馈。实验表明，我们的方法优于零样本 LLM 基线，并将审稿质量提高了 92.4%。我们还发布了 LazyReviewPlus，这是一个包含 1,309 个句子的数据集，标记为懒惰思维和特异性。

Title: Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens

Authors: Weihao Liu, Dehai Min, Lu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10229
Pdf URL: https://arxiv.org/pdf/2602.10229
Copy Paste: [[2602.10229]] Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens(https://arxiv.org/abs/2602.10229)
Keywords: language model, llm, chain-of-thought
Abstract: While explicit Chain-of-Thought (CoT) equips Large Language Models (LLMs) with strong reasoning capabilities, it requires models to verbalize every intermediate step in text tokens, constraining the model thoughts to the discrete vocabulary space. Recently, reasoning in continuous latent space has emerged as a promising alternative, enabling more robust inference and flexible computation beyond discrete token constraints. However, current latent paradigms often suffer from feature collapse and instability, stemming from distribution mismatches when recurrently using hidden states as the input embeddings, or alignment issues when relying on assistant models. To address this, we propose Latent Thoughts Tuning (LT-Tuning), a framework that redefines how latent thoughts are constructed and deployed. Instead of relying solely on raw hidden states, our method introduces a Context-Prediction-Fusion mechanism that jointly leveraging contextual hidden states and predictive semantic guidance from the vocabulary embedding space. Combined with a progressive three-stage curriculum learning pipeline, LT-Tuning also enables dynamically switching between latent and explicit thinking modes. Experiments demonstrate that our method outperforms existing latent reasoning baselines, effectively mitigating feature collapse and achieving robust reasoning accuracy.
摘要：虽然显式思维链（CoT）为大型语言模型（LLM）提供了强大的推理能力，但它要求模型用语言表达文本标记中的每个中间步骤，将模型思想限制在离散词汇空间中。最近，连续潜在空间中的推理已成为一种有前途的替代方案，可以实现超越离散标记约束的更鲁棒的推理和灵活的计算。然而，当前的潜在范式经常遭受特征崩溃和不稳定的困扰，这是由于反复使用隐藏状态作为输入嵌入时的分布不匹配，或者依赖辅助模型时的对齐问题。为了解决这个问题，我们提出了潜在想法调整（LT-Tuning），这是一个重新定义潜在想法如何构建和部署的框架。我们的方法不是仅仅依赖于原始隐藏状态，而是引入了上下文预测融合机制，该机制联合利用上下文隐藏状态和来自词汇嵌入空间的预测语义指导。结合渐进的三阶段课程学习流程，LT-Tuning 还可以在潜在和显性思维模式之间动态切换。实验表明，我们的方法优于现有的潜在推理基线，有效减轻了特征崩溃并实现了稳健的推理准确性。

Title: Learning to Evict from Key-Value Cache

Authors: Luca Moschella, Laura Manduchi, Ozan Sener
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10238
Pdf URL: https://arxiv.org/pdf/2602.10238
Copy Paste: [[2602.10238]] Learning to Evict from Key-Value Cache(https://arxiv.org/abs/2602.10238)
Keywords: language model, llm, agent
Abstract: The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token's future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
摘要：大型语言模型 (LLM) 规模的不断增长给高效推理带来了挑战，这主要是由于自回归键值 (KV) 缓存的内存需求。现有的逐出或压缩方法降低了成本，但依赖于启发法，例如新近度或过去的注意力分数，它们仅充当令牌未来效用的间接代理并引入计算开销。我们将 KV 缓存驱逐重新定义为强化学习 (RL) 问题：学习根据令牌对未来解码的预测有用性对令牌进行排名。为此，我们引入了 KV 策略 (KVP)，这是一种轻量级单头 RL 代理框架，仅使用键和值向量在预先计算的生成轨迹上进行训练。每个智能体都会学习由未来效用指导的专门驱逐策略，该策略评估所有缓存预算的排名质量，不需要对底层 LLM 进行修改或进行额外的推理。在长上下文基准 RULER 和多轮对话基准 OASST2-4k 上对两个不同的模型系列进行评估，KVP 显着优于基线。此外，对标准下游任务（例如，LongBench、BOOLQ、ARC）的零样本测试表明，KVP 的泛化能力远远超出了其训练分布和更长的上下文长度。这些结果表明，学习预测未来令牌效用是自适应 KV 缓存管理的强大且可扩展的范例。

Title: On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models

Authors: Polina Tsvilodub, Jan-Felix Klumpp, Amir Mohammadpour, Jennifer Hu, Michael Franke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10298
Pdf URL: https://arxiv.org/pdf/2602.10298
Copy Paste: [[2602.10298]] On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models(https://arxiv.org/abs/2602.10298)
Keywords: language model
Abstract: This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent "social world models", i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal-mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs' performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like-minded work. Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected "social world models" rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.
摘要：本文研究了 LM 是否为一般心智理论 (ToM) 和特定于语言的语用推理引入共享计算机制，以便回答 LM 是否可以说具有新兴的“社会世界模型”这一普遍问题，即跨任务重新调整用途的心理状态的表示（功能整合假设）。通过受认知神经科学启发的功能定位方法进行行为评估和因果机制实验，我们在比之前志同道合的工作中使用的定位器数据集大得多的定位器数据集上分析了 LM 在 ToM 能力的七个子类别中的表现（Beaudoin 等人，2020）。严格的假设驱动的统计测试的结果为功能整合假设提供了暗示性证据，表明 LM 可能会开发相互关联的“社会世界模型”，而不是孤立的能力。这项工作贡献了新颖的 ToM 定位器数据、功能定位技术的方法改进以及对人工系统中社会认知的出现的实证见解。

Title: Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

Authors: Zhimin Hu, Riya Roshan, Sashank Varma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10329
Pdf URL: https://arxiv.org/pdf/2602.10329
Copy Paste: [[2602.10329]] Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality(https://arxiv.org/abs/2602.10329)
Keywords: language model
Abstract: Human reasoning is shaped by resource rationality -- optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.
摘要：人类推理是由资源理性塑造的——在约束下优化性能。最近，推理时间扩展已成为一种强大的范式，通过扩展测试时间计算来提高大型语言模型的推理性能。具体来说，指令调整 (IT) 模型在推理过程中显式生成长推理步骤，而大型推理模型 (LRM) 通过强化学习进行训练，以发现最大化准确性的推理路径。然而，目前尚不清楚在没有与计算成本相关的明确奖励的情况下，资源合理性是否可以从这种扩展中出现。我们引入了变量归因任务，其中模型在给定候选变量、输入输出试验和预定义逻辑函数的情况下推断哪些变量决定结果。通过改变候选变量和试验的数量，我们系统地控制任务的复杂性。随着复杂性的增加，这两种模型都表现出了从暴力策略到分析策略的转变。 IT 模型在 XOR 和 XNOR 功能上性能下降，而 LRM 仍然稳健。这些发现表明，即使没有明确的基于成本的奖励，模型也可以根据任务复杂性调整其推理行为。它提供了令人信服的证据，证明资源合理性是推理时间缩放本身的一个新兴属性。

Title: Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Authors: Arash Gholami Davoodi, Navid Rezazadeh, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10346
Pdf URL: https://arxiv.org/pdf/2602.10346
Copy Paste: [[2602.10346]] Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models(https://arxiv.org/abs/2602.10346)
Keywords: language model, llm
Abstract: Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.
摘要：大型语言模型（LLM）必须在开放式生成中平衡多样性和创造力与逻辑连贯性。现有的基于截断的采样器是有效的，但很大程度上是启发式的，主要依赖于概率质量和熵，而忽略了标记空间的语义几何。我们提出了 Top-W，一种几何感知截断规则，它使用在标记嵌入几何上定义的 Wasserstein 距离来保持裁剪后的分布接近原始分布，同时显式地平衡保留的概率质量与保留集的熵。我们的理论为固定势子集更新提供了一个简单的封闭式结构：根据质量熵权衡，最佳裁剪要么折叠为单个标记，要么采用可以通过线性扫描有效找到的一维前缀的形式。我们使用高效的基于几何的势（最近集或 k-NN）来实现 Top-W，并将其与保持标准截断和采样接口不变的交替解码例程配对。对三个指令调整模型的四个基准（GSM8K、GPQA、AlpacaEval 和 MT-Bench）进行的广泛实验表明，Top-W 始终优于先前最先进的解码方法，实现了高达 33.7% 的改进。此外，我们发现 Top-W 不仅提高了以准确性为中心的表现，而且还提高了基于法官的开放式评估的创造力。

Title: Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Authors: Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10352
Pdf URL: https://arxiv.org/pdf/2602.10352
Copy Paste: [[2602.10352]] Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs(https://arxiv.org/abs/2602.10352)
Keywords: language model, prompt, chain-of-thought
Abstract: Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.
摘要：自解释方法促使语言模型描述其自身的内部状态，但由于超参数敏感性而仍然不可靠。我们证明，在可解释性工件上训练轻量级适配器，同时保持 LM 完全冻结，可以在任务和模型系列中产生可靠的自我解释。一个标量仿射适配器只需 $d_\text{model}+1$ 参数就足够了：经过训练的适配器生成稀疏自动编码器特征标签，其性能优于训练标签本身（70B 规模的生成评分为 71% vs 63%），识别召回率为 94%@1 的主题，而未训练基线的召回率为 1%，并解码多跳推理中既没有出现在提示中也没有出现在响应中的桥接实体，无需呈现隐式推理。思想链。仅学习到的偏差向量就占了 85% 的改进，并且更简单的适配器比更具表现力的替代方案概括得更好。通过提示描述来控制模型知识，我们发现从 7B 到 72B 参数，自我解释的增益超过了能力的增益。我们的结果表明，自我解释随着规模的扩大而提高，而无需修改正在解释的模型。

Title: Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM-Based Land Surface Intelligence

Authors: Mashrekur Rahman
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10354
Pdf URL: https://arxiv.org/pdf/2602.10354
Copy Paste: [[2602.10354]] Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM-Based Land Surface Intelligence(https://arxiv.org/abs/2602.10354)
Keywords: llm, retrieval-augmented generation
Abstract: Satellite foundation models produce dense embeddings whose physical interpretability remains poorly understood, limiting their integration into environmental decision systems. Using 12.1 million samples across the Continental United States (2017--2023), we first present a comprehensive interpretability analysis of Google AlphaEarth's 64-dimensional embeddings against 26 environmental variables spanning climate, vegetation, hydrology, temperature, and terrain. Combining linear, nonlinear, and attention-based methods, we show that individual embedding dimensions map onto specific land surface properties, while the full embedding space reconstructs most environmental variables with high fidelity (12 of 26 variables exceed $R^2 > 0.90$; temperature and elevation approach $R^2 = 0.97$). The strongest dimension-variable relationships converge across all three analytical methods and remain robust under spatial block cross-validation (mean $\Delta R^2 = 0.017$) and temporally stable across all seven study years (mean inter-year correlation $r = 0.963$). Building on these validated interpretations, we then developed a Land Surface Intelligence system that implements retrieval-augmented generation over a FAISS-indexed embedding database of 12.1 million vectors, translating natural language environmental queries into satellite-grounded assessments. An LLM-as-Judge evaluation across 360 query--response cycles, using four LLMs in rotating generator, system, and judge roles, achieved weighted scores of $\mu = 3.74 \pm 0.77$ (scale 1--5), with grounding ($\mu = 3.93$) and coherence ($\mu = 4.25$) as the strongest criteria. Our results demonstrate that satellite foundation model embeddings are physically structured representations that can be operationalized for environmental and geospatial intelligence.
摘要：卫星基础模型产生密集的嵌入，其物理可解释性仍然知之甚少，限制了它们与环境决策系统的集成。使用美国大陆（2017--2023）的 1210 万个样本，我们首先对 Google AlphaEarth 的 64 维嵌入针对涵盖气候、植被、水文、温度和地形的 26 个环境变量进行了全面的可解释性分析。结合线性、非线性和基于注意力的方法，我们表明单个嵌入维度映射到特定的地表属性，而完整的嵌入空间以高保真度重建大多数环境变量（26 个变量中的 12 个超过 $R^2 > 0.90$；温度和海拔接近 $R^2 = 0.97$）。最强的维度-变量关系在所有三种分析方法中收敛，并在空间块交叉验证下保持稳健（平均 $\Delta R^2 = 0.017$），并且在所有七个研究年中保持时间稳定（平均年间相关性 $r = 0.963$）。基于这些经过验证的解释，我们随后开发了一个陆地表面智能系统，该系统在包含 1210 万个向量的 FAISS 索引嵌入数据库上实现检索增强生成，将自然语言环境查询转换为基于卫星的评估。法学硕士作为法官在 360 个查询-响应周期中进行评估，使用四位法学硕士担任轮流生成者、系统和法官角色，获得了 $\mu = 3.74 \pm 0.77$（等级 1--5）的加权分数，其中基础性 ($\mu = 3.93$) 和连贯性 ($\mu = 4.25$) 作为最强标准。我们的结果表明，卫星基础模型嵌入是物理结构化表示，可用于环境和地理空间情报。

Title: Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

Authors: Tianci Xue, Zeyi Liao, Tianneng Shi, Zilu Wang, Kai Zhang, Dawn Song, Yu Su, Huan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10356
Pdf URL: https://arxiv.org/pdf/2602.10356
Copy Paste: [[2602.10356]] Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation(https://arxiv.org/abs/2602.10356)
Keywords: agent
Abstract: Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at this https URL.
摘要：现实世界的数字环境是高度多样化和动态的。这些特征导致代理经常遇到看不见的场景和分布变化，使得在特定环境中持续学习对于计算机使用代理（CUA）至关重要。然而，一个关键的挑战在于如何在不依赖昂贵的人工注释的情况下获得高质量且基于环境的代理数据。在这项工作中，我们引入了 ACuRL，这是一种自主课程强化学习框架，它能够在零人类数据的情况下不断使代理适应特定环境。代理首先探索目标环境以获得初始经验。在随后的迭代训练中，课程任务生成器利用这些经验以及先前迭代的反馈来合成适合代理当前能力的新任务。为了提供可靠的奖励信号，我们引入了 CUAJudge，这是一种强大的 CUA 自动评估器，与人类判断的一致性达到 93%。根据经验，我们的方法有效地实现了环境内和跨环境的持续学习，获得了 4-22% 的性能提升，并且不会对现有环境造成灾难性的遗忘。进一步的分析显示高度稀疏的更新（例如 20% 的参数），这有助于解释有效且稳健的适应。我们的数据和代码可通过此 https URL 获取。

Title: Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Authors: Théo Lasnier, Wissam Antoun, Francis Kulumba, Djamé Seddah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10382
Pdf URL: https://arxiv.org/pdf/2602.10382
Copy Paste: [[2602.10382]] Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models(https://arxiv.org/abs/2602.10382)
Keywords: language model, llm
Abstract: Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.
摘要：后门攻击给大型语言模型 (LLM) 带来了重大的安全风险，但触发器运行的内部机制仍然知之甚少。我们提出了语言切换后门的第一个机制分析，研究了 GAPperon 模型系列（1B、8B、24B 参数），其中包含在预训练期间注入的触发器，导致输出语言切换。使用激活修补，我们将触发形成定位到早期层（模型深度的 7.5-25%），并确定哪些注意头处理触发信息。我们的主要发现是，触发激活的头部与跨模型尺度自然编码输出语言的头部基本上重叠，识别出的顶部头部的 Jaccard 指数在 0.18 到 0.66 之间。这表明后门触发器不会形成孤立的电路，而是会选择模型的现有语言组件。这些发现对后门防御具有影响：检测方法可能受益于监视已知的功能组件，而不是搜索隐藏的电路，并且缓解策略可能会利用注入行为和自然行为之间的这种纠缠。

Title: When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Authors: Virginie Mouilleron, Théo Lasnier, Djamé Seddah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10384
Pdf URL: https://arxiv.org/pdf/2602.10384
Copy Paste: [[2602.10384]] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents(https://arxiv.org/abs/2602.10384)
Keywords: language model, llm
Abstract: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.
摘要：视觉语言模型 (VLM) 在许多文档理解任务中表现良好，但其在专业非英语领域的可靠性仍未得到充分探索。这种差距在金融领域尤其重要，因为金融领域的文档混合了密集的监管文本、数字表格和可视化图表，并且提取错误可能会产生现实世界的后果。我们推出 Multimodal Finance Eval，这是第一个评估法国金融文件理解的多模式基准。该数据集包含 1,204 个经过专家验证的问题，涵盖文本提取、表格理解、图表解释和多轮对话推理，这些问题取自真实的投资招股说明书、KID 和 PRIIP。我们使用 LLM-as-judge 协议评估了 6 个开放权重 VLM（8B-124B 参数）。虽然模型在文本和表格任务上取得了出色的性能（准确率 85-90%），但它们在图表解释方面却遇到了困难（34-62%）。最值得注意的是，多回合对话揭示了一种严重的失败模式：早期错误会跨回合传播，无论模型大小如何，准确率都会下降到大约 50%。这些结果表明，当前的 VLM 对于明确定义的提取任务是有效的，但在交互式、多步骤的财务分析中仍然很脆弱。多式联运财务评估提供了一个具有挑战性的基准来衡量和推动这种高风险环境的进展。

Title: Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Authors: Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10388
Pdf URL: https://arxiv.org/pdf/2602.10388
Copy Paste: [[2602.10388]] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs(https://arxiv.org/abs/2602.10388)
Keywords: language model, llm
Abstract: The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.
摘要：训练后数据的多样性对于大型语言模型 (LLM) 的有效下游性能至关重要。许多现有的构建训练后数据的方法使用捕获语言变化的基于文本的指标来量化多样性，但此类指标只能为决定下游性能的任务相关特征提供微弱的信号。在这项工作中，我们引入了特征激活覆盖率（FAC），它可以测量可解释特征空间中的数据多样性。在此指标的基础上，我们进一步提出了一种多样性驱动的数据合成框架，名为 FAC Synthesis，它首先使用稀疏自动编码器来识别种子数据集中缺失的特征，然后生成明确反映这些特征的合成样本。实验表明，我们的方法持续改善了各种任务的数据多样性和下游性能，包括指令遵循、毒性检测、奖励建模和行为引导。有趣的是，我们确定了跨模型系列（即 LLaMA、Mistral 和 Qwen）的共享、可解释的特征空间，从而实现了跨模型知识迁移。我们的工作为探索法学硕士以数据为中心的优化提供了可靠且实用的方法。

Title: LATA: A Tool for LLM-Assisted Translation Annotation

Authors: Baorong Huang, Ali Asiri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10454
Pdf URL: https://arxiv.org/pdf/2602.10454
Copy Paste: [[2602.10454]] LATA: A Tool for LLM-Assisted Translation Annotation(https://arxiv.org/abs/2602.10454)
Keywords: language model, llm, prompt
Abstract: The construction of high-quality parallel corpora for translation research has increasingly evolved from simple sentence alignment to complex, multi-layered annotation tasks. This methodological shift presents significant challenges for structurally divergent language pairs, such as Arabic--English, where standard automated tools frequently fail to capture deep linguistic shifts or semantic nuances. This paper introduces a novel, LLM-assisted interactive tool designed to reduce the gap between scalable automation and the rigorous precision required for expert human judgment. Unlike traditional statistical aligners, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment under strict JSON output constraints. In this tool, automated preprocessing integrates into a human-in-the-loop workflow, allowing researchers to refine alignments and apply custom translation technique annotations through a stand-off architecture. By leveraging LLM-assisted processing, the tool balances annotation efficiency with the linguistic precision required to analyze complex translation phenomena in specialized domains.
摘要：用于翻译研究的高质量平行语料库的构建日益从简单的句子对齐发展到复杂的、多层的注释任务。这种方法上的转变对结构上不同的语言对（例如阿拉伯语-英语）提出了重大挑战，其中标准自动化工具经常无法捕获深层的语言转变或语义细微差别。本文介绍了一种新颖的、法学硕士辅助的交互式工具，旨在缩小可扩展的自动化与专家人工判断所需的严格精度之间的差距。与传统的统计对齐器不同，我们的系统采用基于模板的提示管理器，它利用大型语言模型 (LLM) 在严格的 JSON 输出约束下进行句子分段和对齐。在此工具中，自动预处理集成到人机交互工作流程中，使研究人员能够通过隔离架构完善对齐并应用自定义翻译技术注释。通过利用法学硕士辅助处理，该工具可以平衡注释效率与分析专业领域复杂翻译现象所需的语言精度。

Title: Neuro-Symbolic Synergy for Interactive World Modeling

Authors: Hongyu Zhao, Siyu Zhou, Haolin Yang, Zengyi Qin, Tianyi Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10480
Pdf URL: https://arxiv.org/pdf/2602.10480
Copy Paste: [[2602.10480]] Neuro-Symbolic Synergy for Interactive World Modeling(https://arxiv.org/abs/2602.10480)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit strong general-purpose reasoning capabilities, yet they frequently hallucinate when used as world models (WMs), where strict compliance with deterministic transition rules--particularly in corner cases--is essential. In contrast, Symbolic WMs provide logical consistency but lack semantic expressivity. To bridge this gap, we propose Neuro-Symbolic Synergy (NeSyS), a framework that integrates the probabilistic semantic priors of LLMs with executable symbolic rules to achieve both expressivity and robustness. NeSyS alternates training between the two models using trajectories inadequately explained by the other. Unlike rule-based prompting, the symbolic WM directly constrains the LLM by modifying its output probability distribution. The neural WM is fine-tuned only on trajectories not covered by symbolic rules, reducing training data by 50% without loss of accuracy. Extensive experiments on three distinct interactive environments, i.e., ScienceWorld, Webshop, and Plancraft, demonstrate NeSyS's consistent advantages over baselines in both WM prediction accuracy and data efficiency.
摘要：大型语言模型 (LLM) 表现出强大的通用推理能力，但在用作世界模型 (WM) 时，它们经常产生幻觉，在这种情况下，严格遵守确定性转换规则（尤其是在极端情况下）至关重要。相比之下，符号 WM 提供逻辑一致性，但缺乏语义表达能力。为了弥补这一差距，我们提出了神经符号协同（NeSyS），这是一个将法学硕士的概率语义先验与可执行符号规则相结合的框架，以实现表达性和鲁棒性。 NeSyS 使用另一个模型无法充分解释的轨迹在两个模型之间交替训练。与基于规则的提示不同，符号WM通过修改其输出概率分布来直接约束LLM。神经 WM 仅对符号规则未涵盖的轨迹进行微调，从而在不损失准确性的情况下减少 50% 的训练数据。在三个不同的交互环境（即 ScienceWorld、Webshop 和 Plancraft）上进行的大量实验证明了 NeSyS 在 WM 预测准确性和数据效率方面相对于基线具有一致的优势。

Title: Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

Authors: Lingzhuang Sun, Yuxia Zhu, Ruitong Liu, Hao Liang, Zheng Sun, Caijun Jia, Honghao He, Yuchen Wu, Siyuan Li, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10494
Pdf URL: https://arxiv.org/pdf/2602.10494
Copy Paste: [[2602.10494]] Canvas-of-Thought: Grounding Reasoning via Mutable Structured States(https://arxiv.org/abs/2602.10494)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model's reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the "ground truth". Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.
摘要：虽然思想链 (CoT) 提示显着提高了多模式大型语言模型 (MLLM) 的推理能力，但仅依赖线性文本序列仍然是复杂任务的瓶颈。我们观察到，即使辅助视觉元素交织在一起，它们也常常被视为一维、非结构化推理链中的静态快照。我们认为，此类方法将推理历史视为不可变的流：纠正局部错误需要生成详细的下游纠正或重新生成整个上下文。这迫使模型隐式维护和跟踪状态更新，显着增加令牌消耗和认知负荷。这种限制在高维领域尤其严重，例如几何和 SVG 设计，其中 CoT 的文本表达缺乏明确的视觉指导，进一步限制了模型的推理精度。为了弥补这一差距，我们引入了 \textbf{Canvas-of-Thought (Canvas-CoT)}。通过利用 HTML Canvas 作为外部推理基础，Canvas-CoT 使模型能够执行原子的、基于 DOM 的 CRUD 操作。这种架构可以在不破坏周围环境的情况下进行就地状态修改，从而允许模型明确地维护“基本事实”。此外，我们集成了一个基于渲染的批判循环，作为硬约束验证器，提供明确的视觉反馈来解决仅通过文本难以阐明的复杂任务。 VCode、RBench-V 和 MathVista 上的大量实验表明，Canvas-CoT 的性能显着优于现有基线，为上下文高效的多模态推理建立了新的范式。

Title: On the Robustness of Knowledge Editing for Detoxification

Authors: Ming Dong, Shiyi Tang, Ziyan Peng, Guanyi Chen, Tingting He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10504
Pdf URL: https://arxiv.org/pdf/2602.10504
Copy Paste: [[2602.10504]] On the Robustness of Knowledge Editing for Detoxification(https://arxiv.org/abs/2602.10504)
Keywords: language model
Abstract: Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.
摘要：基于知识编辑（KE）的解毒已成为减轻大型语言模型中有害行为的一种有前途的方法。然而，现有的评估在很大程度上依赖于自动毒性分类器，隐含地假设毒性分数的降低反映了真正的行为抑制。在这项工作中，我们提出了一个基于 KE 的解毒的稳健性评估框架，该框架从三个维度检查其超出基于标准分类器的指标的可靠性：优化稳健性、组合稳健性和跨语言稳健性。我们将伪解毒视为一种常见的失败模式，其中明显的毒性降低来自退化生成行为，而不是对不安全内容的有意义的抑制。我们进一步表明，当联合编辑多种不安全行为时，解毒效果会降低，并且单语言和跨语言解毒只有在特定的模型方法组合下才保持有效。总的来说，我们的结果表明，基于 KE 的解毒仅对于某些模型、有限数量的解毒目标和语言子集是稳健的。

Title: LHAW: Controllable Underspecification for Long-Horizon Tasks

Authors: George Pu, Michael S. Lee, Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Samuel Marc Denton
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10525
Pdf URL: https://arxiv.org/pdf/2602.10525
Copy Paste: [[2602.10525]] LHAW: Controllable Underspecification for Long-Horizon Tasks(https://arxiv.org/abs/2602.10525)
Keywords: llm, agent
Abstract: Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.
摘要：长期有效运行的长视野工作流代理对于真正的自治系统至关重要。它们的可靠执行关键取决于在模糊情况下进行推理的能力，在这种情况下，需要寻求澄清以确保正确的任务执行。然而，由于缺乏可扩展的、与任务无关的框架来系统地管理和衡量自定义工作流程中模糊性的影响，进展受到限制。我们通过引入 LHAW（长视野增强工作流）来解决这一差距，这是一种模块化的、与数据集无关的合成管道，通过在可配置的严重性级别系统地删除四个维度（目标、约束、输入和上下文）的信息，将任何明确指定的任务转换为可控的未指定变体。与依赖 LLM 模糊性预测的方法不同，LHAW 通过经验代理试验来验证变体，并根据观察到的最终状态差异将它们分类为结果关键型、发散型或良性型。我们根据我们的分类法，从 TheAgentCompany、SWE-Bench Pro 和 MCP-Atlas 发布了 285 个任务变体，同时进行正式分析，衡量当前代理如何在不明确的设置中检测、推理和解决规范不足的问题。 LHAW 提供了第一个系统框架，用于对长视野环境中的代理澄清行为进行成本敏感的评估，从而能够开发可靠的自主系统。

Title: When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

Authors: Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang, Xiang Wang, An Zhang, Ke Shen, Tat-Seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10560
Pdf URL: https://arxiv.org/pdf/2602.10560
Copy Paste: [[2602.10560]] When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning(https://arxiv.org/abs/2602.10560)
Keywords: language model, llm, long context, agent
Abstract: While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400\% times inference speed acceleration.
摘要：虽然长上下文推理对于各种现实应用程序至关重要，但对于大型语言模型 (LLM) 来说仍然具有挑战性，因为随着上下文长度的增长，它们的性能会下降。最近的工作 MemAgent 试图通过在类似 RNN 的循环中逐块处理上下文并更新文本记忆以获取最终答案来解决这个问题。然而，这种简单的循环记忆更新面临着两个关键的缺点：（i）记忆可能会迅速爆炸，因为它可以不加区别地更新，即使是在无证据的块上； (ii)循环缺乏退出机制，即使收集到足够的证据，也会导致不必要的计算。为了解决这些问题，我们提出了 GRU-Mem，它结合了两个文本控制的门，以实现更稳定、更高效的长上下文推理。具体来说，在 GRU-Mem 中，内存仅在更新门打开时更新，一旦退出门打开，循环将立即退出。为了赋予模型这样的能力，我们在端到端强化学习中引入了两个奖励信号 $r^{\text{update}}$ 和 $r^{\text{exit}}$，分别奖励正确的更新和退出行为。对各种长上下文推理任务的实验证明了 GRU-Mem 的有效性和效率，它通常优于普通 MemAgent，推理速度加速高达 400% 倍。

Title: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Authors: Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, Chang Su, Changxin Miao, Changyi Wan, Chao Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengting Feng, Chengyuan Yao, Chunrui Han, Dan Ma, Dapeng Shi, Daxin Jiang, Dehua Ma, Deshan Sun, Di Qi, Enle Liu, Fajie Zhang, Fanqi Wan, Guanzhe Huang, Gulin Yan, Guoliang Cao, Guopeng Li, Han Cheng, Hangyu Guo, Hanshan Zhang, Hao Nie, Haonan Jia, Haoran Lv, Hebin Zhou, Hekun Lv, Heng Wang, Heung-Yeung Shum, Hongbo Huang, Hongbo Peng, Hongyu Zhou, Hongyuan Wang, Houyong Chen, Huangxi Zhu, Huimin Wu, Huiyong Guo, Jia Wang, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiashu Lv, Jiashuo Liu, Jiayi Fu, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yang, Jie Zhou, Jieyi Hou, Jing Bai, Jingcheng Hu, Jingjing Xie, Jingwei Wu, Jingyang Zhang, Jishi Zhou, Junfeng Liu, Junzhe Lin, Ka Man Lo, Kai Liang, Kaibo Liu, Kaijun Tan, Kaiwen Yan, Kaixiang Li, Kang An, Kangheng Lin, Lei Yang, Liang Lv, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lina Chen, Luck Ma, Mengqiang Ren, Michael Li, Ming Li, Mingliang Li, Mingming Zhang, Mingrui Chen, Mitt Huang, Na Wang, Peng Liu, Qi Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10604
Pdf URL: https://arxiv.org/pdf/2602.10604
Copy Paste: [[2602.10604]] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters(https://arxiv.org/abs/2602.10604)
Keywords: gpt, agent
Abstract: We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
摘要：我们引入了 Step 3.5 Flash，这是一种稀疏专家混合 (MoE) 模型，可连接前沿级代理智能和计算效率。在构建代理时，我们关注最重要的事情：敏锐的推理和快速、可靠的执行。步骤 3.5 Flash 将 196B 参数基础与 11B 活动参数配对，以实现高效推理。它通过交错 3:1 滑动窗口/全注意力和多令牌预测 (MTP-3) 进行优化，以减少多轮代理交互的延迟和成本。为了达到前沿水平的智能，我们设计了一个可扩展的强化学习框架，该框架将可验证的信号与偏好反馈相结合，同时在大规模离策略训练下保持稳定，从而实现数学、代码和工具使用方面的一致自我改进。 Step 3.5 Flash 在智能体、编码和数学任务上表现出了强大的性能，在 IMO-AnswerBench 上实现了 85.4%，在 LiveCodeBench-v6 (2024.08-2025.05) 上实现了 86.4%，在 tau2-Bench 上实现了 88.2%，在 BrowseComp（具有上下文管理）上实现了 69.0%，在 Terminal-Bench 2.0 上实现了 51.0%，与前沿型号，如 GPT-5.2 xHigh 和 Gemini 3.0 Pro。通过重新定义效率边界，Step 3.5 Flash 为在现实工业环境中部署复杂的代理提供了高密度基础。

Title: Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Authors: Shuo He, Lang Feng, Xin Cheng, Lei Feng, Bo An
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10609
Pdf URL: https://arxiv.org/pdf/2602.10609
Copy Paste: [[2602.10609]] Online Causal Kalman Filtering for Stable and Effective Policy Optimization(https://arxiv.org/abs/2602.10609)
Keywords: language model
Abstract: Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.
摘要：大型语言模型的强化学习受到高方差令牌级重要性采样（IS）比率的影响，这会破坏大规模策略优化的稳定性。为了提高稳定性，最近的方法通常对序列中的所有令牌使用固定的序列级 IS 比率，或者单独调整每个令牌的 IS 比率，从而忽略序列中令牌之间的时间离策略推导。在本文中，我们首先凭经验确定局部离策略偏差在令牌级别上结构不一致，这可能会扭曲相邻令牌之间的策略梯度更新并导致训练崩溃。为了解决这个问题，我们提出在线因果卡尔曼过滤以实现稳定有效的策略优化（KPO）。具体来说，我们将所需的 IS 比率建模为跨令牌演变的潜在状态，并应用卡尔曼滤波器根据过去令牌的状态在线和自回归更新此状态，而不管未来令牌如何。由此产生的过滤 IS 比率保留了标记式局部结构感知变化，同时强烈平滑噪声尖峰，从而产生更稳定和有效的策略更新。在实验上，与最先进的同行相比，KPO 在具有挑战性的数学推理数据集上取得了优异的结果。

Title: How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Authors: Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Yang Chen, Xiaotong Lin, Wuliang Huang, Ziyi Gao, Xing Fu, Yu Cheng, Weiqiang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10622
Pdf URL: https://arxiv.org/pdf/2602.10622
Copy Paste: [[2602.10622]] How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning(https://arxiv.org/abs/2602.10622)
Keywords: language model, llm
Abstract: Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at this https URL.
摘要：仅解码器的大型语言模型越来越多地用作用户表示学习的行为编码器，但注意力屏蔽对用户嵌入质量的影响仍未得到充分探索。在这项工作中，我们在统一的对比学习框架内对因果、混合和双向注意掩模进行了系统研究，该框架是在大规模现实世界支付宝数据上进行训练的，该数据集成了长期异构用户行为。为了改善从因果注意力过渡到双向注意力时的训练动态，我们提出了梯度引导软掩蔽，这是一种在线性调度器之前应用的基于梯度的预预热，在优化过程中逐渐打开未来的注意力。通过对涵盖预测、偏好和营销敏感性任务的 9 个工业用户认知基准进行评估，与因果、混合和仅调度器基线相比，我们的方法始终能够产生更稳定的训练和更高质量的双向表示，同时保持与解码器预训练的兼容性。总体而言，我们的研究结果强调了掩蔽设计和训练过渡对于采用仅解码器的 LLM 进行有效的用户表示学习的重要性。我们的代码可以在这个 https URL 上找到。

Title: UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

Authors: Yongshi Ye, Hui Jiang, Feihu Jiang, Tian Lan, Yichao Du, Biao Fu, Xiaodong Shi, Qianghuai Jia, Longyue Wang, Weihua Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10652
Pdf URL: https://arxiv.org/pdf/2602.10652
Copy Paste: [[2602.10652]] UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory(https://arxiv.org/abs/2602.10652)
Keywords: language model, llm, agent
Abstract: Self-evolving memory serves as the trainable parameters for Large Language Models (LLMs)-based agents, where extraction (distilling insights from experience) and management (updating the memory bank) must be tightly coordinated. Existing methods predominately optimize memory management while treating memory extraction as a static process, resulting in poor generalization, where agents accumulate instance-specific noise rather than robust memories. To address this, we propose Unified Memory Extraction and Management (UMEM), a self-evolving agent framework that jointly optimizes a Large Language Model to simultaneous extract and manage memories. To mitigate overfitting to specific instances, we introduce Semantic Neighborhood Modeling and optimize the model with a neighborhood-level marginal utility reward via GRPO. This approach ensures memory generalizability by evaluating memory utility across clusters of semantically related queries. Extensive experiments across five benchmarks demonstrate that UMEM significantly outperforms highly competitive baselines, achieving up to a 10.67% improvement in multi-turn interactive tasks. Futhermore, UMEM maintains a monotonic growth curve during continuous evolution. Codes and models will be publicly released.
摘要：自我进化记忆作为基于大型语言模型（LLM）的代理的可训练参数，其中提取（从经验中提取见解）和管理（更新记忆库）必须紧密协调。现有方法主要优化内存管理，同时将内存提取视为静态过程，导致泛化能力较差，代理会积累特定于实例的噪声而不是稳健的内存。为了解决这个问题，我们提出了统一内存提取和管理（UMEM），这是一种自我进化的代理框架，可以联合优化大型语言模型以同时提取和管理内存。为了减轻对特定实例的过度拟合，我们引入了语义邻域建模，并通过 GRPO 使用邻域级边际效用奖励来优化模型。这种方法通过跨语义相关查询集群评估内存效用来确保内存通用性。跨五个基准的广泛实验表明，UMEM 显着优于高度竞争的基准，在多轮交互任务中实现了高达 10.67% 的改进。此外，UMEM在持续演化过程中保持单调增长曲线。代码和模型将公开发布。

Title: Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

Authors: Woojin Chung, Jeonghoon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10657
Pdf URL: https://arxiv.org/pdf/2602.10657
Copy Paste: [[2602.10657]] Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance(https://arxiv.org/abs/2602.10657)
Keywords: language model
Abstract: Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.
摘要：了解高质量预训练数据的构成仍然是语言模型训练的核心问题。在这项工作中，我们研究基准性能是否主要由预训练语料库和评估数据集之间的统计模式重叠程度驱动。我们使用单词级一元交叉熵和词频统计来测量这种重叠，并在 $10$ 零样本基准、涵盖 $8.5\mathrm{B}$ 到 $60\mathrm{B}$ 令牌的 $4$ 预训练数据集以及从 $400\mathrm{M}$ 到 $3\mathrm{B}$ 参数的模型大小之间进行受控实验。我们的结果证明了单词级一元交叉熵和基准性能之间存在强大的反比关系，这表明广泛使用的基准受到训练和评估数据之间单词重叠的强烈影响。因此，具有相似字级一元交叉熵的较大预训练子集会产生改进的下游结果，这表明词频统计在塑造基准分数方面发挥着额外的作用。总而言之，这些结果表明，许多标准基准测试相对于预训练语料库只是弱分布外，因此简单的单词重叠统计数据可以预测基准测试性能。

Title: Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

Authors: Daniel Gallagher, Gerhard Heyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10661
Pdf URL: https://arxiv.org/pdf/2602.10661
Copy Paste: [[2602.10661]] Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment(https://arxiv.org/abs/2602.10661)
Keywords: language model
Abstract: This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM > DAT > ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.
摘要：本文评估了基于 Transformer 的语言模型在格鲁吉亚语分裂作格对齐方面的性能，格鲁吉亚语是一种特别罕见的系统，用于分配语法格来标记参数角色。我们关注通过主格、作格和与格名词形式的各种排列确定的主语和宾语标记。实现了使用 Grew 查询语言生成最小对的基于树库的方法。我们创建了一个包含 370 个句法测试的数据集，由七个任务组成，每个任务包含 50-70 个样本，其中在任何给定样本中测试三个名词形式。使用单词和/或句子级别的准确度指标评估五个仅编码器模型和两个仅解码器模型。无论具体的句法构成如何，模型在正确分配作格方面表现最差，在正确分配主格方面表现最强。表现与三种形式的总体频率分布相关（NOM > DAT > ERG）。尽管数据稀缺是低资源语言的一个已知问题，但我们表明作格的高度特定作用以及缺乏可用的训练数据可能会导致这种情况下的表现不佳。该数据集是公开的，该方法为未来对基准有限的语言进行句法评估提供了一个有趣的途径。

Title: Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Authors: Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, Jun Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10715
Pdf URL: https://arxiv.org/pdf/2602.10715
Copy Paste: [[2602.10715]] Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents(https://arxiv.org/abs/2602.10715)
Keywords: llm, prompt, agent
Abstract: Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: this https URL.
摘要：长期会话记忆是基于法学硕士的对话系统的核心能力，但现有的基准和评估协议主要侧重于表面事实回忆。在实际交互中，适当的响应通常取决于隐式约束，例如稍后未显式查询的用户状态、目标或值。为了评估这种设置，我们引入了 \textbf{LoCoMo-Plus}，这是一个在线索触发语义断开下评估认知记忆的基准，其中模型必须在长对话上下文中保留和应用潜在约束。我们进一步表明，传统的字符串匹配指标和显式任务类型提示与此类场景不相符，并提出了基于约束一致性的统一评估框架。跨不同骨干模型、基于检索的方法和记忆系统的实验表明，认知记忆仍然具有挑战性，并揭示了现有基准未捕获的故障。我们的代码和评估框架可在以下网址公开获取：此 https URL。

Title: Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Authors: Alaa Elsetohy, Sama Hadhoud, Haryo Akbarianto Wibowo, Chenxi Whitehouse, Genta Indra Winata, Fajri Koto, Alham Fikri Aji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10732
Pdf URL: https://arxiv.org/pdf/2602.10732
Copy Paste: [[2602.10732]] Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling(https://arxiv.org/abs/2602.10732)
Keywords: llm
Abstract: Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here this https URL.
摘要：多语言基准很少测试基于文化的推理：翻译后的数据集保留以英语为中心的场景，而文化优先的数据集通常缺乏对所需推理的控制。我们提出了 Macaron，这是一个模板优先的基准，它考虑了跨问题语言的推理类型和文化方面。使用涵盖 7 种推理类型、22 个文化方面的 100 个与语言无关的模板，母语注释者创建符合场景的英语和本地语言多项选择题以及系统导出的对/错问题。 Macaron 包含 11,862 个实例，跨越 20 个国家/文化背景、10 种文字和 20 种语言（包括阿姆哈拉语、约鲁巴语、祖鲁语、吉尔吉斯语和一些阿拉伯方言等资源匮乏的语言）。在 21 个多语言 LLM 的零样本评估中，推理模式模型在英语和本地语言之间实现了最强的性能和接近平等，而开放权重模型在本地语言中大幅退化，并且在 T/F 任务上经常接近机会。基于文化的数学和计数模板始终是最难的。可以通过此 https URL 访问数据。

Title: Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

Authors: Yuming Yan, Shuo Yang, Kai Tang, Sihong Chen, Yang Zhang, Ke Xu, Dan Hu, Qun Yu, Pengfei Hu, Edith C.H. Ngai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10740
Pdf URL: https://arxiv.org/pdf/2602.10740
Copy Paste: [[2602.10740]] Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs(https://arxiv.org/abs/2602.10740)
Keywords: language model, llm
Abstract: Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.
摘要：视觉语言模型 (VLM) 表现出卓越的通用功能，但在医学成像或几何问题解决等专业领域往往表现不佳。有监督微调（SFT）可以增强目标域内的性能，但它通常会导致灾难性遗忘，限制其泛化。因此，核心挑战是使 VLM 适应新领域，同时保留其通用功能。持续预训练对于扩展大型语言模型 (LLM) 的知识是有效的，但对于 VLM 来说不太可行，因为计算成本过高且大多数开源模型无法获得预训练数据。这需要有效的训练后适应方法。基于强化学习 (RL) 的方法，例如组相对策略优化 (GRPO)，在保留一般能力方面表现出了希望，但它们经常在模型最初缺乏足够的领域知识的领域适应场景中失败，从而导致优化崩溃。为了弥补这一差距，我们提出了强化课程预调整（RCPA），这是一种新颖的培训后范式，引入了课程感知渐进调制机制。在早期阶段，RCPA 应用部分输出约束来安全地将模型暴露给新的领域概念。随着模型领域熟悉度的增加，训练逐渐过渡到全生成优化，细化响应并使它们与特定领域的偏好保持一致。这种分阶段的适应平衡了领域知识的获取和一般多模式能力的保留。跨专业领域和通用基准的广泛实验验证了 RCPA 的有效性，为构建高性能和领域自适应 VLM 建立了一条实用途径。

Title: Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM

Authors: Haotian Sheng, Heyong Wang, Ming Hong, Hongman He, Junqiu Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10801
Pdf URL: https://arxiv.org/pdf/2602.10801
Copy Paste: [[2602.10801]] Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM(https://arxiv.org/abs/2602.10801)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have achieved remarkable success, however, the emergence of content generation distortion (hallucination) limits their practical applications. The core cause of hallucination lies in LLMs' lack of awareness regarding their stored internal knowledge, preventing them from expressing their knowledge state on questions beyond their internal knowledge boundaries, as humans do. However, existing research on knowledge boundary expression primarily focuses on white-box LLMs, leaving methods suitable for black-box LLMs which offer only API access without revealing internal parameters-largely unexplored. Against this backdrop, this paper proposes LSCL (LLM-Supervised Confidence Learning), a deep learning-based method for expressing the knowledge boundaries of black-box LLMs. Based on the knowledge distillation framework, this method designs a deep learning model. Taking the input question, output answer, and token probability from a black-box LLM as inputs, it constructs a mapping between the inputs and the model' internal knowledge state, enabling the quantification and expression of the black-box LLM' knowledge boundaries. Experiments conducted on diverse public datasets and with multiple prominent black-box LLMs demonstrate that LSCL effectively assists black-box LLMs in accurately expressing their knowledge boundaries. It significantly outperforms existing baseline models on metrics such as accuracy and recall rate. Furthermore, considering scenarios where some black-box LLMs do not support access to token probability, an adaptive alternative method is proposed. The performance of this alternative approach is close to that of LSCL and surpasses baseline models.
摘要：大型语言模型（LLM）取得了显着的成功，然而，内容生成失真（幻觉）的出现限制了其实际应用。产生幻觉的核心原因在于法学硕士对自己储存的内部知识缺乏认识，导致他们无法像人类一样在超出其内部知识边界的问题上表达自己的知识状态。然而，现有的知识边界表达研究主要集中在白盒 LLM 上，而适用于黑盒 LLM 的方法仅提供 API 访问而不透露内部参数——很大程度上尚未探索。在此背景下，本文提出了LSCL（LLM-Supervised Confidence Learning），一种基于深度学习的方法，用于表达黑盒LLM的知识边界。该方法基于知识蒸馏框架，设计了深度学习模型。以黑盒LLM的输入问题、输出答案和token概率作为输入，构建输入与模型内部知识状态之间的映射，实现黑盒LLM知识边界的量化和表达。在不同的公共数据集和多个著名的黑盒法学硕士上进行的实验表明，LSCL 有效地帮助黑盒法学硕士准确表达其知识边界。它在准确性和召回率等指标上显着优于现有的基线模型。此外，考虑到一些黑盒LLM不支持访问令牌概率的场景，提出了一种自适应替代方法。这种替代方法的性能接近 LSCL 并超过基线模型。

Title: Beyond Confidence: The Rhythms of Reasoning in Generative Models

Authors: Deyuan Liu, Zecheng Wang, Zhanyue Qin, Zhiying Tu, Dianhui Chu, Dianbo Sui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10816
Pdf URL: https://arxiv.org/pdf/2602.10816
Copy Paste: [[2602.10816]] Beyond Confidence: The Rhythms of Reasoning in Generative Models(https://arxiv.org/abs/2602.10816)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM's internal state to perturbations. We introduce the Token Constraint Bound ($\delta_{\mathrm{TCB}}$), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, $\delta_{\mathrm{TCB}}$ provides insights into the stability of the model's internal predictive commitment. Our experiments show $\delta_{\mathrm{TCB}}$ correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. $\delta_{\mathrm{TCB}}$ offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.
摘要：大型语言模型 (LLM) 展现出令人印象深刻的功能，但对轻微的输入上下文变化很敏感，从而影响了可靠性。准确性和困惑度等传统指标无法评估局部预测的稳健性，因为归一化输出概率可能会掩盖法学硕士内部状态对扰动的潜在弹性。我们引入了 Token Constraint Bound ($\delta_{\mathrm{TCB}}$)，这是一种新颖的指标，用于量化 LLM 在其主导的下一个 token 预测显着变化之前可以承受的最大内部状态扰动。 $\delta_{\mathrm{TCB}}$ 本质上与输出嵌入空间几何相关，提供了对模型内部预测承诺稳定性的见解。我们的实验表明 $\delta_{\mathrm{TCB}}$ 与有效的提示工程相关，并揭示了上下文学习和文本生成过程中因困惑而错过的关键预测不稳定性。 $\delta_{\mathrm{TCB}}$ 提供了一种有原则的、互补的方法来分析并潜在地提高 LLM 预测的上下文稳定性。

Title: C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution

Authors: Binwei Yan, Yifei Fu, Mingjian Zhu, Hanting Chen, Mingxuan Yuan, Yunhe Wang, Hailin Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10874
Pdf URL: https://arxiv.org/pdf/2602.10874
Copy Paste: [[2602.10874]] C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution(https://arxiv.org/abs/2602.10874)
Keywords: language model, llm, prompt
Abstract: Automatic prompt optimization is a promising direction to boost the performance of Large Language Models (LLMs). However, existing methods often suffer from noisy and conflicting update signals. In this research, we propose C-MOP (Cluster-based Momentum Optimized Prompting), a framework that stabilizes optimization via Boundary-Aware Contrastive Sampling (BACS) and Momentum-Guided Semantic Clustering (MGSC). Specifically, BACS utilizes batch-level information to mine tripartite features--Hard Negatives, Anchors, and Boundary Pairs--to precisely characterize the typical representation and decision boundaries of positive and negative prompt samples. To resolve semantic conflicts, MGSC introduces a textual momentum mechanism with temporal decay that distills persistent consensus from fluctuating gradients across iterations. Extensive experiments demonstrate that C-MOP consistently outperforms SOTA baselines like PromptWizard and ProTeGi, yielding average gains of 1.58% and 3.35%. Notably, C-MOP enables a general LLM with 3B activated parameters to surpass a 70B domain-specific dense LLM, highlighting its effectiveness in driving precise prompt evolution. The code is available at this https URL.
摘要：自动提示优化是提高大型语言模型（LLM）性能的一个有前途的方向。然而，现有的方法经常受到噪声和冲突的更新信号的影响。在这项研究中，我们提出了 C-MOP（基于集群的动量优化提示），这是一种通过边界感知对比采样（BACS）和动量引导语义聚类（MGSC）来稳定优化的框架。具体来说，BACS利用批次级信息挖掘三方特征——硬阴性、锚点和边界对——来精确表征正负提示样本的典型表示和决策边界。为了解决语义冲突，MGSC 引入了一种具有时间衰减的文本动量机制，可以从迭代中波动的梯度中提取持久的共识。大量实验表明，C-MOP 始终优于 PromptWizard 和 ProTeGi 等 SOTA 基线，平均增益为 1.58% 和 3.35%。值得注意的是，C-MOP 使具有 3B 激活参数的通用 LLM 超越了 70B 特定领域的密集 LLM，凸显了其在驱动精确即时进化方面的有效性。该代码可从此 https URL 获取。

Title: Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis

Authors: Zhiyin Tan, Jennifer D'Souza
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.10881
Pdf URL: https://arxiv.org/pdf/2602.10881
Copy Paste: [[2602.10881]] Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis(https://arxiv.org/abs/2602.10881)
Keywords: language model, llm
Abstract: Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (this https URL).
摘要：系统评价和荟萃分析依赖于将叙述性文章转化为结构化的、基于数字的研究记录。尽管大型语言模型（LLM）取得了快速进展，但仍不清楚它们是否能够满足该过程的结构要求，这取决于跨文档保留角色、方法和效果大小归因，而不是识别孤立的实体。我们提出了一个结构性诊断框架，将基于 LLM 的证据提取评估为一系列模式约束查询，其关系和数字复杂性不断增加，从而能够精确识别原子级提取之外的故障点。使用跨越五个科学领域的手动管理语料库，以及统一的查询套件和评估协议，我们在每个文档和长上下文、多文档输入机制下评估两个最先进的法学硕士。在跨领域和模型中，单属性查询的性能仍然中等，但一旦任务需要变量、角色、统计方法和效果大小之间的稳定绑定，性能就会急剧下降。完整的元分析关联元组的提取可靠性接近于零，而长上下文输入进一步加剧了这些失败。下游聚合甚至会放大较小的上游错误，从而导致语料库级别的统计数据不可靠。我们的分析表明，这些限制并非源于实体识别错误，而是源于系统结构故障，包括角色反转、交叉分析绑定漂移、密集结果部分中的实例压缩以及数字错误归因，这表明当前的法学硕士缺乏自动荟萃分析所需的结构保真度、关系绑定和数字基础。代码和数据可在 GitHub（此 https URL）上公开获取。

Title: The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

Authors: Zhuohan Xie, Rania Elbadry, Fan Zhang, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Dimitar Dimitrov, Vanshikaa Jani, Yuyang Dai, Jiahui Geng, Yuxia Wang, Ivan Koychev, Veselin Stoyanov, Preslav Nakov
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2602.10886
Pdf URL: https://arxiv.org/pdf/2602.10886
Copy Paste: [[2602.10886]] The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems(https://arxiv.org/abs/2602.10886)
Keywords: language model, llm
Abstract: We present the setup and the tasks of the FinMMEval Lab at CLEF 2026, which introduces the first multilingual and multimodal evaluation framework for financial Large Language Models (LLMs). While recent advances in financial natural language processing have enabled automated analysis of market reports, regulatory documents, and investor communications, existing benchmarks remain largely monolingual, text-only, and limited to narrow subtasks. FinMMEval 2026 addresses this gap by offering three interconnected tasks that span financial understanding, reasoning, and decision-making: Financial Exam Question Answering, Multilingual Financial Question Answering (PolyFiQA), and Financial Decision Making. Together, these tasks provide a comprehensive evaluation suite that measures models' ability to reason, generalize, and act across diverse languages and modalities. The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems, with datasets and evaluation resources publicly released to support reproducible research.
摘要：我们在 CLEF 2026 上介绍了 FinMMEval 实验室的设置和任务，该实验室引入了第一个针对金融大语言模型 (LLM) 的多语言和多模式评估框架。尽管金融自然语言处理的最新进展已经实现了对市场报告、监管文件和投资者沟通的自动分析，但现有基准仍然主要是单语言、纯文本，并且仅限于狭窄的子任务。 FinMMEval 2026 通过提供三个涵盖金融理解、推理和决策的相互关联的任务来解决这一差距：金融考试问答、多语言金融问答 (PolyFiQA) 和金融决策。这些任务共同提供了一个全面的评估套件，用于衡量模型跨不同语言和模式进行推理、泛化和行动的能力。该实验室旨在促进稳健、透明和全球包容的金融人工智能系统的发展，公开发布数据集和评估资源以支持可重复的研究。

Title: Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

Authors: Mingyu Cao, Alvaro Correia, Christos Louizos, Shiwei Liu, Lu Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10953
Pdf URL: https://arxiv.org/pdf/2602.10953
Copy Paste: [[2602.10953]] Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models(https://arxiv.org/abs/2602.10953)
Keywords: language model, prompt
Abstract: Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model's uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.
摘要：扩散语言模型 (DLM) 通过迭代对屏蔽序列进行去噪来生成文本，并反复决定在每个步骤中提交哪些位置。标准解码遵循贪婪规则：揭开最有信心的位置，但这种局部选择可能会将模型锁定为次优的揭开顺序，尤其是在推理繁重的提示上。我们提出了 SOAR，一种无需训练的解码算法，可以根据模型的不确定性调整其行为。当信心较低时，SOAR 会短暂扩大对替代揭露决策的搜索范围，以避免过早做出承诺；当置信度较高时，它会折叠搜索并并行解码许多位置，以减少去噪迭代的次数。在 Dream-7B 和 LLaDA-8B 上的数学推理和代码生成基准（GSM8K、MBPP、HumanEval）中，SOAR 提高了生成质量，同时保持有竞争力的推理速度，提供了一种平衡 DLM 解码质量和效率的实用方法。

Title: Language Model Inversion through End-to-End Differentiation

Authors: Kevin Yandoka Denamganaï, Kartic Subr
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11044
Pdf URL: https://arxiv.org/pdf/2602.11044
Copy Paste: [[2602.11044]] Language Model Inversion through End-to-End Differentiation(https://arxiv.org/abs/2602.11044)
Keywords: language model, prompt
Abstract: Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).
摘要：尽管关于语言模型 (LM) 的研究不断涌现，但很少有方法分析 LM 的可逆性。也就是说，给定 LM 和所需的令牌目标输出序列，确定哪些输入提示将产生目标输出仍然是一个悬而未决的问题。我们将这个问题表述为经典的基于梯度的优化。首先，我们提出了一种简单的算法来实现给定（冻结）LM 的端到端可微性，然后通过梯度下降找到优化的提示。我们的核心见解是将 LM 视为对代币分布序列进行操作的函数（而不是传统观点中对代币序列进行操作的函数）。我们的实验和消融表明，对于几个白盒 LM（开箱即用），我们的 DLM 驱动的反演可以可靠且高效地针对长度为 20 美元的目标优化长度为 10 美元和 80 美元的提示。

Title: Embedding Inversion via Conditional Masked Diffusion Language Models

Authors: Han Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11047
Pdf URL: https://arxiv.org/pdf/2602.11047
Copy Paste: [[2602.11047]] Embedding Inversion via Conditional Masked Diffusion Language Models(https://arxiv.org/abs/2602.11047)
Keywords: language model
Abstract: We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes through a 78M parameter model with no access to the target encoder. On 32-token sequences across three embedding models, the method achieves 81.3% token accuracy and 0.87 cosine similarity.
摘要：我们将嵌入反转定义为条件掩码扩散，通过迭代去噪而不是顺序自回归生成并行恢复所有标记。掩码扩散语言模型通过自适应层归一化以目标嵌入为条件，仅需要 8 次前向传递通过 78M 参数模型，而无需访问目标编码器。在跨三个嵌入模型的 32 个标记序列上，该方法实现了 81.3% 的标记准确率和 0.87 的余弦相似度。

Title: SteuerLLM: Local specialized large language model for German tax law analysis

Authors: Sebastian Wind, Jeta Sopa, Laurin Schmid, Quirin Jackl, Sebastian Kiefer, Fei Wu, Martin Mayr, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.11081
Pdf URL: https://arxiv.org/pdf/2602.11081
Copy Paste: [[2602.11081]] SteuerLLM: Local specialized large language model for German tax law analysis(https://arxiv.org/abs/2602.11081)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at this https URL.
摘要：大型语言模型 (LLM) 表现出强大的一般推理和语言理解能力，但在受严格的形式规则、精确的术语和具有法律约束力的结构管辖的领域中，其性能会下降。税法体现了这些挑战，因为正确的答案需要准确的法定引用、结构化的法律论证以及严格的评分方案下的数字准确性。我们通过算法生成 SteuerEx，这是第一个源自真正的德国大学税法考试的开放基准。 SteuerEx 包含 115 道经过专家验证的考试题，涵盖六个核心税法领域和多个学术级别，并采用紧密反映真实考试实践的陈述级部分学分评估框架。我们进一步介绍了 SteuerLLM，这是一种针对德国税法的领域适应型法学硕士，使用受控检索增强管道，在从真实考试材料生成的大规模合成数据集上进行了训练。 SteuerLLM（28B 参数）始终优于同等规模的通用指令调整模型，并且在某些情况下，优于更大的系统，这表明特定领域的数据和架构适应比参数规模对现实法律推理任务的性能更具决定性。所有基准数据、训练数据集、模型权重和评估代码均公开发布，以支持特定领域法律人工智能的可重复研究。此 https URL 提供基于 Web 的 SteuerLLM 演示。

Title: DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Authors: Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11089
Pdf URL: https://arxiv.org/pdf/2602.11089
Copy Paste: [[2602.11089]] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning(https://arxiv.org/abs/2602.11089)
Keywords: language model, llm
Abstract: In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.
摘要：在当前的大型语言模型 (LLM) 领域，大规模、高质量训练数据的管理是模型性能的主要驱动力。一个关键杠杆是\emph{数据配方}，它包含一个数据处理管道，用于将原始来源转换为训练语料库。尽管越来越多地使用法学硕士来自动化单个数据处理步骤，例如数据合成和过滤，但数据配方的整体设计仍然主要是手动和劳动密集型的，需要大量的人类专业知识和迭代。为了弥补这一差距，我们为 LLM 适应制定了 \emph{端到端数据配方生成}。给定目标基准和可用数据源池，需要一个模型来输出完整的数据配方，使基础 LLM 适应目标任务。我们提出了 DataChef-32B，它使用预测候选食谱下游性能的代理奖励来执行在线强化学习。在六项保留任务中，DataChef-32B 生成实用的食谱，其下游性能可与人类专家策划的食谱相媲美。值得注意的是，DataChef-32B 的配方将 Qwen3-1.7B-Base 应用于数学领域，在 AIME'25 上达到 66.7 并超过了 Qwen3-1.7B。这项工作为自动化法学硕士培训和开发自我进化的人工智能系统提供了新的思路。

Title: Can Large Language Models Make Everyone Happy?

Authors: Usman Naseem, Gautam Siddharth Kashyap, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Rafiq Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11091
Pdf URL: https://arxiv.org/pdf/2602.11091
Copy Paste: [[2602.11091]] Can Large Language Models Make Everyone Happy?(https://arxiv.org/abs/2602.11091)
Keywords: language model, llm, prompt
Abstract: Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.
摘要：大型语言模型 (LLM) 中的错位是指未能同时满足安全、价值和文化维度，从而导致在这些维度必须同时出现的现实环境中偏离人类期望的行为。现有的基准，例如 SAFETUNEBED（以安全为中心）、VALUEBENCH（以价值为中心）和 WORLDVIEW-BENCH（以文化为中心），主要单独评估这些维度，因此对它们的相互作用和权衡提供的洞察有限。最近的工作，包括 MIB 和基于机械可解释性的可解释性基准，为模型故障提供了有价值的观点；然而，它们仍然不足以系统地描述跨维度的权衡。为了解决这些差距，我们引入了 MisAlign-Profile，这是一个受机械分析启发的测量失准权衡的统一基准。首先，我们构建了 MISALIGNTRADE，这是一个跨 112 个规范域分类法的英语错位数据集，其中包括 14 个安全域、56 个价值域和 42 个文化域。除了域标签之外，每个提示还使用 Gemma-2-9B-it 使用三种正交语义类型（对象、属性或关系错位）之一进行分类，并使用基于 SimHash 的指纹识别通过 Qwen3-30B-A3B-Instruct-2507 进行扩展，以避免重复数据删除。每个提示都通过两阶段拒绝抽样与未对齐和对齐的响应配对，以确保质量。其次，我们对 MISALIGNTRADE 上的通用、微调和开放权重法学硕士进行了基准测试，揭示了跨维度 12%-34% 的错位权衡。

Title: Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Authors: Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11096
Pdf URL: https://arxiv.org/pdf/2602.11096
Copy Paste: [[2602.11096]] Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away(https://arxiv.org/abs/2602.11096)
Keywords: chain-of-thought
Abstract: Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
摘要：基于强化学习（RL）的显式思维链（例如GRPO）后训练提高了多模态大规模推理模型（MLRM）的推理能力。但最近的证据表明，它会同时降低安全性并提高越狱成功率。我们提出了 SafeThink，一种轻量级推理时间防御，它将安全恢复视为令人满意的约束而不是最大化目标。 SafeThink 使用安全奖励模型监控不断演变的推理轨迹，并仅在违反安全阈值时有条件地注入优化的短纠正前缀（“等待，安全思考”）。在我们对六个开源 MLRM 和四个越狱基准（JailbreakV-28K、Hades、FigStep 和 MM-SafetyBench）的评估中，SafeThink 将攻击成功率降低了 30-60%（例如，LlamaV-o1：JailbreakV-28K 上的 63.33% 到 5.74%，R1-Onevision：69.07% 到在 Hades 上为 5.65%），同时保持推理性能（MathVista 准确度：65.20% 至 65.00%）。我们实验的一个关键实证发现是，安全恢复通常只需几个转向步骤即可：干预前 1-3 个推理步骤通常足以将整个生成过程转向安全完成。

Title: TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

Authors: Géraud Faye, Wassila Ouerdane, Guillaume Gadek, Céline Hudelot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11106
Pdf URL: https://arxiv.org/pdf/2602.11106
Copy Paste: [[2602.11106]] TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection(https://arxiv.org/abs/2602.11106)
Keywords: language model
Abstract: Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.
摘要：错误信息检测是一项关键任务，可以从外部知识的整合中获益匪浅，就像手动事实检查一样。在这项工作中，我们提出了一种表示文本文档的新方法，该方法有助于合并来自知识库的信息。我们的方法“图文本编码”(TEG) 通过提取图形式的结构化信息并对文本和图进行编码以进行分类来处理文档。通过大量的实验，我们证明与单独使用语言模型相比，这种混合表示可以增强错误信息检测性能。此外，我们引入了 TEGRA，它是我们框架的扩展，集成了特定领域的知识，在大多数情况下进一步提高了分类准确性。

Title: Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Authors: Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11149
Pdf URL: https://arxiv.org/pdf/2602.11149
Copy Paste: [[2602.11149]] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning(https://arxiv.org/abs/2602.11149)
Keywords: language model, chain-of-thought
Abstract: Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.
摘要：对思维链数据进行监督微调（SFT）是推理语言模型的重要后训练步骤。标准机器学习直觉表明，使用更独特的训练样本进行训练可以产生更好的泛化能力。与直觉相反，我们表明 SFT 从重复中受益：在固定的更新预算下，在较小数据集上进行更多时期的训练优于在较大数据集上进行单时期训练。在 AIME'24/25 和 GPQA 基准上，Olmo3-7B 在 400 个样本上训练 128 个 epoch，其性能比在 51200 个样本上训练 1 个 epoch 的效果高出 12-26 个百分点，并且没有额外的灾难性遗忘。我们发现，当重复已经饱和时，训练令牌的准确性会可靠地发出信号；完全记忆时的额外历元高原的改进，在所有设置中都是一致的模式。这些发现为推理 SFT 提供了一种实用的方法，其中以令牌精度作为停止标准的缩放纪元可以取代昂贵的无向数据缩放。我们提出了重复优势，其中完整的记忆与改进的泛化相一致，作为社区理解大型语言模型的训练动态的一个新的开放问题。