2026-02-02

Title: In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

Authors: Anudeex Shetty, Aditya Joshi, Salil S. Kanhere
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22169
Pdf URL: https://arxiv.org/pdf/2601.22169
Copy Paste: [[2601.22169]] In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement(https://arxiv.org/abs/2601.22169)
Keywords: language model, llm, prompt
Abstract: Humans are susceptible to undesirable behaviours and privacy leaks under the influence of alcohol. This paper investigates drunk language, i.e., text written under the influence of alcohol, as a driver for safety failures in large language models (LLMs). We investigate three mechanisms for inducing drunk language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both benchmarks are in English, as compared to the base LLMs as well as previously reported approaches. Via a robust combination of manual evaluation and LLM-based evaluators and analysis of error categories, our findings highlight a correspondence between human-intoxicated behaviour, and anthropomorphism in LLMs induced with drunk language. The simplicity and efficiency of our drunk language inducement approaches position them as potential counters for LLM safety tuning, highlighting significant risks to LLM safety.
摘要：人类在酒精的影响下很容易出现不良行为和隐私泄露。本文研究了醉酒语言（即在酒精影响下编写的文本）作为大型语言模型（LLM）安全失败的驱动因素。我们研究了法学硕士中诱导醉酒语言的三种机制：基于角色的提示、因果微调和基于强化的后培训。当对 5 个 LLM 进行评估时，我们发现与基础 LLM 以及之前报告的方法相比，JailbreakBench 上的越狱（即使存在防御）和 ConfAIde 上的隐私泄露（这两个基准都是英语）的敏感性更高。通过手动评估和基于法学硕士的评估器的稳健结合以及对错误类别的分析，我们的研究结果强调了人类醉酒行为与醉酒语言引起的法学硕士拟人化之间的对应关系。我们的醉酒语言诱导方法的简单性和效率使它们成为法学硕士安全调整的潜在对策，突显了法学硕士安全的重大风险。

Title: Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning

Authors: Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng, Heng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22297
Pdf URL: https://arxiv.org/pdf/2601.22297
Copy Paste: [[2601.22297]] Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning(https://arxiv.org/abs/2601.22297)
Keywords: language model, llm, prompt, agent
Abstract: The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning (SDRL), a training framework that equips a single LLM with strong standalone problem-solving ability and the capability to learn from diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL improves overall MAD performance while simultaneously strengthening single model reasoning.
摘要：通过可验证奖励的强化学习（RLVR），大语言模型（LLM）的推理能力得到了显着提高。在测试时，通过多智能体辩论（MAD）进行协作推理已成为提高法学硕士绩效的一种有前途的方法。然而，当前的 RLVR 方法通常训练法学硕士独立解决问题，而没有明确让他们做好综合辩论期间出现的不同理由并从中受益的准备。在这项工作中，我们提出了自我辩论强化学习（SDRL），这是一种训练框架，使单个法学硕士具有强大的独立解决问题的能力以及从 MAD 中的不同推理轨迹中学习的能力。在给出提示的情况下，SDRL 首先对多个候选解决方案进行采样，然后构建具有不同推理路径的辩论上下文，并根据该上下文生成第二轮响应。最后，SDRL 联合优化初始响应和辩论条件响应，产生一个既可作为独立求解器又可作为辩论参与者的有效模型。跨多个基本模型和推理基准的实验表明，SDRL 提高了整体 MAD 性能，同时加强了单一模型推理。

Title: MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment

Authors: Yupeng Cao, Chengyang He, Yangyang Yu, Ping Wang, K.P. Subbalakshmi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22361
Pdf URL: https://arxiv.org/pdf/2601.22361
Copy Paste: [[2601.22361]] MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment(https://arxiv.org/abs/2601.22361)
Keywords: language model, gpt, llm, agent
Abstract: Assessing the veracity of online content has become increasingly critical. Large language models (LLMs) have recently enabled substantial progress in automated veracity assessment, including automated fact-checking and claim verification systems. Typical veracity assessment pipelines break down complex claims into sub-claims, retrieve external evidence, and then apply LLM reasoning to assess veracity. However, existing methods often treat evidence retrieval as a static, isolated step and do not effectively manage or reuse retrieved evidence across claims. In this work, we propose MERMAID, a memory-enhanced multi-agent veracity assessment framework that tightly couples the retrieval and reasoning processes. MERMAID integrates agent-driven search, structured knowledge representations, and a persistent memory module within a Reason-Action style iterative process, enabling dynamic evidence acquisition and cross-claim evidence reuse. By retaining retrieved evidence in an evidence memory, the framework reduces redundant searches and improves verification efficiency and consistency. We evaluate MERMAID on three fact-checking benchmarks and two claim-verification datasets using multiple LLMs, including GPT, LLaMA, and Qwen families. Experimental results show that MERMAID achieves state-of-the-art performance while improving the search efficiency, demonstrating the effectiveness of synergizing retrieval, reasoning, and memory for reliable veracity assessment.
摘要：评估在线内容的真实性变得越来越重要。大型语言模型（LLM）最近在自动准确性评估方面取得了重大进展，包括自动事实检查和声明验证系统。典型的准确性评估流程将复杂的声明分解为子声明，检索外部证据，然后应用法学硕士推理来评估准确性。然而，现有方法通常将证据检索视为静态、孤立的步骤，并且不能有效地管理或重复使用跨索赔检索的证据。在这项工作中，我们提出了 MERMAID，一种记忆增强的多智能体准确性评估框架，它将检索和推理过程紧密结合在一起。 MERMAID 将代理驱动的搜索、结构化知识表示和持久记忆模块集成到“原因-行动”式迭代过程中，从而实现动态证据获取和跨声明证据重用。通过将检索到的证据保留在证据存储器中，该框架减少了冗余搜索并提高了验证效率和一致性。我们使用多个 LLM（包括 GPT、LLaMA 和 Qwen 系列）在三个事实检查基准和两个声明验证数据集上评估 MERMAID。实验结果表明，MERMAID 在提高搜索效率的同时实现了最先进的性能，证明了协同检索、推理和记忆以实现可靠的准确性评估的有效性。

Title: Context Structure Reshapes the Representational Geometry of Language Models

Authors: Eghbal A. Hosseini, Yuxuan Li, Yasaman Bahri, Declan Campbell, Andrew Kyle Lampinen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22364
Pdf URL: https://arxiv.org/pdf/2601.22364
Copy Paste: [[2601.22364]] Context Structure Reshapes the Representational Geometry of Language Models(https://arxiv.org/abs/2601.22364)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been shown to organize the representations of input sequences into straighter neural trajectories in their deep layers, which has been hypothesized to facilitate next-token prediction via linear extrapolation. Language models can also adapt to diverse tasks and learn new structure in context, and recent work has shown that this in-context learning (ICL) can be reflected in representational changes. Here we bring these two lines of research together to explore whether representation straightening occurs \emph{within} a context during ICL. We measure representational straightening in Gemma 2 models across a diverse set of in-context tasks, and uncover a dichotomy in how LLMs' representations change in context. In continual prediction settings (e.g., natural language, grid world traversal tasks) we observe that increasing context increases the straightness of neural sequence trajectories, which is correlated with improvement in model prediction. Conversely, in structured prediction settings (e.g., few-shot tasks), straightening is inconsistent -- it is only present in phases of the task with explicit structure (e.g., repeating a template), but vanishes elsewhere. These results suggest that ICL is not a monolithic process. Instead, we propose that LLMs function like a Swiss Army knife: depending on task structure, the LLM dynamically selects between strategies, only some of which yield representational straightening.
摘要：大型语言模型（LLM）已被证明可以将输入序列的表示组织成深层的更直的神经轨迹，假设这可以通过线性外推法促进下一个标记的预测。语言模型还可以适应不同的任务并在上下文中学习新的结构，最近的工作表明这种上下文学习（ICL）可以反映在表征的变化中。在这里，我们将这两条研究方向结合在一起，探讨 ICL 期间表征矫正是否在上下文中发生。我们通过一系列不同的上下文任务来测量 Gemma 2 模型中的表征矫正，并揭示了法学硕士的表征如何在上下文中变化的二分法。在连续预测设置（例如自然语言、网格世界遍历任务）中，我们观察到增加上下文会增加神经序列轨迹的直线度，这与模型预测的改进相关。相反，在结构化预测设置（例如，少样本任务）中，拉直是不一致的——它仅出现在具有显式结构的任务阶段（例如，重复模板），但在其他地方消失。这些结果表明 ICL 并不是一个单一的过程。相反，我们建议法学硕士的功能就像一把瑞士军刀：根据任务结构，法学硕士在策略之间动态选择，只有其中一些策略会产生代表性矫正。

Title: Stability-Aware Prompt Optimization for Clinical Data Abstraction

Authors: Arinbjörn Kolbeinsson, Daniel Timbie, Sajjan Narsinghani, Sanjay Hariharan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22373
Pdf URL: https://arxiv.org/pdf/2601.22373
Copy Paste: [[2601.22373]] Stability-Aware Prompt Optimization for Clinical Data Abstraction(https://arxiv.org/abs/2601.22373)
Keywords: language model, llm, prompt
Abstract: Large language models used for clinical abstraction are sensitive to prompt wording, yet most work treats prompts as fixed and studies uncertainty in isolation. We argue these should be treated jointly. Across two clinical tasks (MedAlign applicability/correctness and MS subtype abstraction) and multiple open and proprietary models, we measure prompt sensitivity via flip rates and relate it to calibration and selective prediction. We find that higher accuracy does not guarantee prompt stability, and that models can appear well-calibrated yet remain fragile to paraphrases. We propose a dual-objective prompt optimization loop that jointly targets accuracy and stability, showing that explicitly including a stability term reduces flip rates across tasks and models, sometimes at modest accuracy cost. Our results suggest prompt sensitivity should be an explicit objective when validating clinical LLM systems.
摘要：用于临床抽象的大型语言模型对提示措辞敏感，但大多数工作将提示视为固定的并孤立地研究不确定性。我们认为应该共同对待这些问题。在两项临床任务（MedAlign 适用性/正确性和 MS 亚型抽象）和多个开放和专有模型中，我们通过翻转率测量即时敏感性，并将其与校准和选择性预测联系起来。我们发现更高的准确度并不能保证即时的稳定性，并且模型可能看起来经过了良好的校准，但对于释义来说仍然很脆弱。我们提出了一个双目标提示优化循环，共同针对准确性和稳定性，表明明确包含稳定性项会降低任务和模型之间的翻转率，有时会以适度的准确性成本为代价。我们的结果表明，在验证临床法学硕士系统时，及时的敏感性应该是一个明确的目标。

Title: SPLA: Block Sparse Plus Linear Attention for Long Context Modeling

Authors: Bailin Wang, Dan Friedman, Tao Lei, Chong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22379
Pdf URL: https://arxiv.org/pdf/2601.22379
Copy Paste: [[2601.22379]] SPLA: Block Sparse Plus Linear Attention for Long Context Modeling(https://arxiv.org/abs/2601.22379)
Keywords: long context
Abstract: Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining "long tail," SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA -- calculating the residual as the difference between global and selected linear attention -- ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.
摘要：分块稀疏注意力为长上下文建模提供了显着的效率提升，但现有方法通常会因完全丢弃未选择的块而遭受低选择保真度和累积上下文损失的困扰。为了解决这些限制，我们引入了稀疏加线性注意（SPLA），该框架利用从二阶泰勒展开导出的选择度量来准确识别相关块以实现精确注意。 SPLA 没有丢弃剩余的“长尾”，而是通过残余线性注意（RLA）模块将未选择的块压缩为紧凑的循环状态。至关重要的是，为了避免 IO 开销，我们为 RLA 推导了一个基于减法的优化公式——将残差计算为全局和选定线性注意力之间的差异——确保在推理过程中永远不会显式访问未选定的块。我们的实验表明，SPLA 缩小了持续预训练中的性能差距，在 RULER 等长上下文基准上超越了密集注意力模型，同时保持了有竞争力的常识和推理能力。

Title: SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization

Authors: Chaoyue He, Xin Zhou, Di Wang, Hong Xu, Wei Liu, Chunyan Miao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22385
Pdf URL: https://arxiv.org/pdf/2601.22385
Copy Paste: [[2601.22385]] SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization(https://arxiv.org/abs/2601.22385)
Keywords: language model, llm
Abstract: Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.
摘要：直接偏好优化 (DPO) 使用单个全局温度 beta 来控制拟合偏好标签和保持接近参考模型之间的权衡，隐式地将所有偏好对视为同等信息。现实世界的偏好语料库是异构的：它们将高信号、客观故障（例如安全性、事实性、指令违规）与低信号或主观区别（例如风格）混合在一起，还包括标签噪声。我们介绍了我们的方法 SP2DPO（语义每对 DPO），这是一种概括，用特定于实例的时间表 beta_i 代替全局温度，该时间表是根据教师语言模型生成的结构化语义间隙注释（类别、幅度、置信度）离线预先确定的。我们在 UltraFeedback 偏好语料库（59,960 对）上实例化此过程，从而能够大规模构建可审核的 beta_i 工件，并产生零训练时间开销：内循环优化器仍然是标准 DPO，每对都有 beta 集。我们的实证研究重点是 AlpacaEval 2.0，报告了原始胜率和长度控制胜率。在四个开放权重、经过指令调整的学生骨干网 (4B-8B) 中，SP2DPO 与经过调整的全局 beta DPO 基线具有竞争力，并提高了四个骨干网中两个的 AlpacaEval 2.0 长度控制获胜率，同时避免了每个模型的 beta 扫描。所有代码、注释和工件都将被发布。

Title: Specialists or Generalists? Multi-Agent and Single-Agent LLMs for Essay Grading

Authors: Jamiu Adekunle Idowu, Ahmed Almasoud
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2601.22386
Pdf URL: https://arxiv.org/pdf/2601.22386
Copy Paste: [[2601.22386]] Specialists or Generalists? Multi-Agent and Single-Agent LLMs for Essay Grading(https://arxiv.org/abs/2601.22386)
Keywords: language model, gpt, llm, agent
Abstract: Automated essay scoring (AES) systems increasingly rely on large language models, yet little is known about how architectural choices shape their performance across different essay quality levels. This paper evaluates single-agent and multi-agent LLM architectures for essay grading using the ASAP 2.0 corpus. Our multi-agent system decomposes grading into three specialist agents (Content, Structure, Language) coordinated by a Chairman Agent that implements rubric-aligned logic including veto rules and score capping. We test both architectures in zero-shot and few-shot conditions using GPT-5.1. Results show that the multi-agent system is significantly better at identifying weak essays while the single-agent system performs better on mid-range essays. Both architectures struggle with high-quality essays. Critically, few-shot calibration emerges as the dominant factor in system performance -- providing just two examples per score level improves QWK by approximately 26% for both architectures. These findings suggest architectural choice should align with specific deployment priorities, with multi-agent AI particularly suited for diagnostic screening of at-risk students, while single-agent models provide a cost-effective solution for general assessment.
摘要：自动论文评分 (AES) 系统越来越依赖于大型语言模型，但对于架构选择如何影响其在不同论文质量水平上的表现却知之甚少。本文使用 ASAP 2.0 语料库评估用于论文评分的单代理和多代理 LLM 架构。我们的多代理系统将评分分解为三个专业代理（内容、结构、语言），由主席代理协调，该主席代理实施符合规则的逻辑，包括否决规则和分数上限。我们使用 GPT-5.1 在零样本和少样本条件下测试这两种架构。结果表明，多智能体系统在识别薄弱论文方面明显更好，而单智能体系统在中等论文方面表现更好。这两种架构都难以实现高质量的论文。至关重要的是，少样本校准成为系统性能的主导因素——每个分数级别仅提供两个示例，这两种架构的 QWK 均提高了约 26%。这些发现表明，架构选择应与特定的部署优先级保持一致，多智能体人工智能特别适合对高危学生进行诊断筛查，而单智能体模型则为一般评估提供了经济高效的解决方案。

Title: Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks

Authors: Candida M. Greco, Lucio La Cava, Andrea Tagarelli
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2601.22396
Pdf URL: https://arxiv.org/pdf/2601.22396
Copy Paste: [[2601.22396]] Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks(https://arxiv.org/abs/2601.22396)
Keywords: language model, llm
Abstract: Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.
摘要：尽管大型语言模型（LLM）在模拟人类行为方面的实用性越来越大，但这些合成人物角色在多大程度上准确反映了不同文化条件下的世界和道德价值体系仍然不确定。本文研究了综合的、基于文化的人物角色与既定框架的一致性，特别是世界价值观调查（WVS）、英格尔哈特-韦尔泽尔文化地图和道德基础理论。我们根据一组可解释的 WVS 派生变量来概念化和生成 LLM 生成的角色，并通过三个互补的镜头检查生成的角色：在 Inglehart-Welzel 地图上定位，揭示其解释，反映跨文化条件的稳定差异；人口层面与世界价值观调查的一致性，其中响应分布广泛跟踪人类群体模式；以及来自道德基础调查问卷的道德概况，我们通过文化到道德的映射进行分析，以描述道德反应在不同文化配置中的差异。我们基于文化的角色生成和分析方法可以评估跨文化结构和道德变异。

Title: Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization

Authors: Kanishk Awadhiya
Subjects: cs.CL, cs.FL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22402
Pdf URL: https://arxiv.org/pdf/2601.22402
Copy Paste: [[2601.22402]] Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization(https://arxiv.org/abs/2601.22402)
Keywords: language model, llm
Abstract: Rotary Positional Embeddings (RoPE) have become the standard for Large Language Models (LLMs) due to their ability to encode relative positions through geometric rotation. However, we identify a significant limitation we term ''Spectral Rigidity'': standard RoPE utilizes a fixed geometric decay ($\theta^{-i}$) optimized for local syntactic coherence, which fails to capture the long-range, periodic structures inherent in recursive logic and algorithmic reasoning. This results in a ''Structure Gap'', where models trained on shallow reasoning chains fail to extrapolate to deeper recursive steps. In this work, we introduce Bifocal Attention, an architectural paradigm that decouples positional encoding into two distinct modalities: Geometric Eyes (Standard RoPE) for precise token-level manipulation, and Spectral Eyes (Learnable Harmonic Operators) for tracking long-range recursive depth. We propose a novel training protocol, Spectral Evolution, which initializes positional frequencies as static geometric parameters but allows them to evolve via gradient descent into a harmonic basis optimized for the specific algorithmic topology of the task.
摘要：旋转位置嵌入 (RoPE) 已成为大型语言模型 (LLM) 的标准，因为它们能够通过几何旋转对相对位置进行编码。然而，我们发现了一个重要的限制，我们称之为“谱刚性”：标准 RoPE 使用针对局部句法连贯性进行优化的固定几何衰减 ($\theta^{-i}$)，它无法捕获递归逻辑和算法推理中固有的长范围周期性结构。这导致了“结构差距”，在浅层推理链上训练的模型无法推断出更深层次的递归步骤。在这项工作中，我们引入了 Bifocal Attention，这是一种将位置编码解耦为两种不同模式的架构范例：用于精确标记级操作的几何眼（标准 RoPE）和用于跟踪远程递归深度的光谱眼（可学习谐波算子）。我们提出了一种新颖的训练协议，即频谱进化，它将位置频率初始化为静态几何参数，但允许它们通过梯度下降进化为针对任务的特定算法拓扑优化的谐波基础。

Title: Word-Centered Semantic Graphs for Interpretable Diachronic Sense Tracking

Authors: Imene Kolli, Kai-Robin Lange, Jonas Rieger, Carsten Jentsch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22410
Pdf URL: https://arxiv.org/pdf/2601.22410
Copy Paste: [[2601.22410]] Word-Centered Semantic Graphs for Interpretable Diachronic Sense Tracking(https://arxiv.org/abs/2601.22410)
Keywords: language model
Abstract: We propose an interpretable, graph-based framework for analyzing semantic shift in diachronic corpora. For each target word and time slice, we induce a word-centered semantic network that integrates distributional similarity from diachronic Skip-gram embeddings with lexical substitutability from time-specific masked language models. We identify sense-related structure by clustering the peripheral graph, align clusters across time via node overlap, and track change through cluster composition and normalized cluster mass. In an application study on a corpus of New York Times Magazine articles (1980 - 2017), we show that graph connectivity reflects polysemy dynamics and that the induced communities capture contrasting trajectories: event-driven sense replacement (trump), semantic stability with cluster over-segmentation effects (god), and gradual association shifts tied to digital communication (post). Overall, word-centered semantic graphs offer a compact and transparent representation for exploring sense evolution without relying on predefined sense inventories.
摘要：我们提出了一种可解释的、基于图形的框架，用于分析历时语料库中的语义转变。对于每个目标单词和时间片，我们引入了一个以单词为中心的语义网络，该网络将历时 Skip-gram 嵌入的分布相似性与特定时间屏蔽语言模型的词汇可替换性相结合。我们通过对外围图进行聚类来识别与感觉相关的结构，通过节点重叠跨时间对齐聚类，并通过聚类组成和归一化聚类质量来跟踪变化。在对纽约时报杂志文章语料库（1980 - 2017）的应用研究中，我们表明图连接性反映了多义动态，并且诱导的社区捕获了对比轨迹：事件驱动的意义替换（特朗普）、具有集群过度分割效应的语义稳定性（上帝）以及与数字通信相关的逐渐关联转变（帖子）。总体而言，以单词为中心的语义图为探索意义演化提供了紧凑而透明的表示，而无需依赖预定义的意义库存。

Title: Large Language Model Agents Are Not Always Faithful Self-Evolvers

Authors: Weixiang Zhao, Yingshuo Wang, Yichen Zhang, Yang Deng, Yanyan Zhao, Wanxiang Che, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22436
Pdf URL: https://arxiv.org/pdf/2601.22436
Copy Paste: [[2601.22436]] Large Language Model Agents Are Not Always Faithful Self-Evolvers(https://arxiv.org/abs/2601.22436)
Keywords: language model, llm, agent
Abstract: Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent's decisions on the experience it is given, in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 10 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.
摘要：自我进化的大语言模型（LLM）智能体通过积累和重用过去的经验来不断改进，但仍不清楚他们是否忠实地依赖这些经验来指导他们的行为。我们首次对自我进化的法学硕士代理人中的经验忠实度进行了系统的调查，即代理人的决策对其所给予的经验的因果依赖性。我们对原始经验和浓缩形式的经验进行受控因果干预，全面评估了 10 个法学硕士骨干和 9 个环境中的四个代表性框架。我们的分析揭示了一个惊人的不对称性：虽然代理人始终依赖原始经验，但他们经常忽视或误解浓缩经验，即使它是唯一提供的经验。这种差距在单代理和多代理配置以及骨干网规模上仍然存在。我们将其根本原因追溯到三个因素：压缩内容的语义限制、抑制经验的内部处理偏差以及预先训练的先验已经足够的任务机制。这些发现挑战了关于自我进化方法的普遍假设，并强调需要更忠实和可靠的方法来体验整合。

Title: Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss

Authors: Galim Turumtaev
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22439
Pdf URL: https://arxiv.org/pdf/2601.22439
Copy Paste: [[2601.22439]] Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss(https://arxiv.org/abs/2601.22439)
Keywords: language model
Abstract: Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.
摘要：由于训练数据的可用性有限，神经语言模型经常与低资源语言作斗争，使得这些语言的标记在训练集中很少见。本文解决了训练过程中的一个具体挑战：稀有令牌受到边缘化的影响尤为严重，这阻碍了它们有效地学习。我们提出了一种阈值技术，可以减少这种边缘化的影响，使稀有代币能够从更有意义的对齐中受益。通过字符级语言模型的实验，我们证明该方法显着提高了低资源语言验证数据的性能。这项工作首次展示了如何应用负采样通过限制过度边缘化的有害影响来提高稀有标记的表示，从而提供了一种增强代表性不足的语言的语言模型性能的新方法。

Title: SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization

Authors: Jinyang Wu, Changpeng Yang, Yuhao Shen, Fangzhi Xu, Bolin Ni, Chonghua Liao, Yuchen Liu, Hongzhen Wang, Shuai Nie, Shuai Zhang, Haoran Luo, Jiaming Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22491
Pdf URL: https://arxiv.org/pdf/2601.22491
Copy Paste: [[2601.22491]] SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization(https://arxiv.org/abs/2601.22491)
Keywords: agent
Abstract: Reinforcement learning with verifiable rewards has emerged as a powerful paradigm for training intelligent agents. However, existing methods typically employ binary rewards that fail to capture quality differences among trajectories achieving identical outcomes, thereby overlooking potential diversity within the solution space. Inspired by the ``sweet spot'' concept in tennis-the racket's core region that produces optimal hitting effects, we introduce \textbf{S}weet \textbf{S}pot \textbf{L}earning (\textbf{SSL}), a novel framework that provides differentiated guidance for agent optimization. SSL follows a simple yet effective principle: progressively amplified, tiered rewards guide policies toward the sweet-spot region of the solution space. This principle naturally adapts across diverse tasks: visual perception tasks leverage distance-tiered modeling to reward proximity, while complex reasoning tasks reward incremental progress toward promising solutions. We theoretically demonstrate that SSL preserves optimal solution ordering and enhances the gradient signal-to-noise ratio, thereby fostering more directed optimization. Extensive experiments across GUI perception, short/long-term planning, and complex reasoning tasks show consistent improvements over strong baselines on 12 benchmarks, achieving up to 2.5X sample efficiency gains and effective cross-task transferability. Our work establishes SSL as a general principle for training capable and robust agents.
摘要：具有可验证奖励的强化学习已成为训练智能代理的强大范例。然而，现有方法通常采用二元奖励，无法捕获实现相同结果的轨迹之间的质量差异，从而忽略了解决方案空间内的潜在多样性。受到网球中“最佳击球点”概念（球拍产生最佳击球效果的核心区域）的启发，我们引入了 \textbf{S}weet \textbf{S}pot \textbf{L}earning (\textbf{SSL})，这是一个为代理优化提供差异化指导的新颖框架。 SSL 遵循一个简单而有效的原则：逐步放大、分层的奖励引导政策走向解决方案空间的最佳区域。这一原则自然适用于不同的任务：视觉感知任务利用距离分层建模来奖励接近性，而复杂的推理任务则奖励朝着有希望的解决方案取得的渐进进展。我们从理论上证明 SSL 保留了最优解排序并增强了梯度信噪比，从而促进了更有针对性的优化。跨 GUI 感知、短期/长期规划和复杂推理任务的广泛实验表明，在 12 个基准测试中，与强大的基线相比，得到了一致的改进，实现了高达 2.5 倍的样本效率增益和有效的跨任务可转移性。我们的工作将 SSL 确立为培训有能力且稳健的代理的一般原则。

Title: Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards

Authors: Yuan-Jay Lü, Chengyu Wang, Lei Shen, Jun Huang, Tong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22511
Pdf URL: https://arxiv.org/pdf/2601.22511
Copy Paste: [[2601.22511]] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards(https://arxiv.org/abs/2601.22511)
Keywords: language model, llm, agent
Abstract: Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.
摘要：小型法学硕士通常很难与大型且昂贵的模型的代理能力相匹配。虽然强化学习可以提供帮助，但进展受到两个结构性瓶颈的限制：现有的开源代理训练数据任务种类较少且易于解决；现实世界的 API 缺乏多样性，并且对于大规模强化学习部署过程来说不稳定。我们通过 SYNTHAGENT 解决这些挑战，这是一个联合综合不同工具使用训练数据并模拟完整环境的框架。具体来说，强大的教师模型创建新颖的任务和工具生态系统，然后将它们重写为故意未指定的指令。这迫使客服人员主动向用户查询缺失的详细信息。在处理综合任务时，基于LLM的用户模拟器提供用户私有信息，而模拟工具系统提供稳定的工具响应。对于奖励，任务级别的规则是根据所需的子目标、用户代理交互和禁止的行为构建的。在数学、搜索和工具使用方面的 14 个具有挑战性的数据集中，在我们的合成数据上训练的模型取得了巨大的进步，小型模型的表现优于较大的基线。

Title: $ρ$-$\texttt{EOS}$: Training-free Bidirectional Variable-Length Control for Masked Diffusion LLMs

Authors: Jingyi Yang, Yuxian Jiang, Jing Shao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22527
Pdf URL: https://arxiv.org/pdf/2601.22527
Copy Paste: [[2601.22527]] $ρ$-$\texttt{EOS}$: Training-free Bidirectional Variable-Length Control for Masked Diffusion LLMs(https://arxiv.org/abs/2601.22527)
Keywords: language model, llm
Abstract: Beyond parallel generation and global context modeling, current masked diffusion large language models (dLLMs) suffer from a fundamental limitation: they require a predefined, fixed generation length, which lacks flexibility and forces an inevitable trade-off between output quality and computational efficiency. To address this, we study the denoising dynamics and find that the implicit density ($\rho$) of end-of-sequence ($\texttt{EOS}$) tokens serves as a reliable signal of generation sufficiency. In particular, the evolving implicit $\texttt{EOS}$ density during denoising reveals whether the current masked space is excessive or insufficient, thereby guiding the adjustment direction for generation length. Building on this insight, we propose $\textbf{$\rho$-$\texttt{EOS}$}$, a training-free, single-stage strategy that enables bidirectional variable-length generation for masked dLLMs. Unlike prior two-stage approaches--which require separate length adjustment and iterative mask insertion phases while supporting only unidirectional expansion--$\textbf{$\rho$-$\texttt{EOS}$}$ achieves bidirectional length adjustment within a unified denoising process by continuously estimating the implicit $\texttt{EOS}$ density: excessively high density triggers $\texttt{MASK}$ token contraction, while insufficient density induces expansion. Extensive experiments on mathematics and code benchmarks demonstrate that $\textbf{$\rho$-$\texttt{EOS}$}$ achieves comparable performance while substantially improving inference efficiency and token utilization.
摘要：除了并行生成和全局上下文建模之外，当前的掩码扩散大语言模型 (dLLM) 还存在一个基本限制：它们需要预定义的固定生成长度，这缺乏灵活性，并迫使输出质量和计算效率之间不可避免地进行权衡。为了解决这个问题，我们研究了去噪动态，发现序列末尾 ($\texttt{EOS}$) 令牌的隐式密度 ($\rho$) 可以作为生成充足性的可靠信号。特别是，去噪过程中不断演变的隐式$\texttt{EOS}$密度揭示了当前掩蔽空间是否过多或不足，从而指导生成长度的调整方向。基于这一见解，我们提出了 $\textbf{$\rho$-$\texttt{EOS}$}$，这是一种免训练的单阶段策略，可以为 masked dLLM 实现双向可变长度生成。与之前的两阶段方法（需要单独的长度调整和迭代掩码插入阶段，同时仅支持单向扩展）不同，$\textbf{$\rho$-$\texttt{EOS}$}$ 通过连续估计隐式 $\texttt{EOS}$ 密度，在统一的去噪过程中实现双向长度调整：密度过高会触发 $\texttt{MASK}$ 令牌收缩，而密度不足则会导致扩展。关于数学和代码基准的大量实验表明，$\textbf{$\rho$-$\texttt{EOS}$}$ 实现了可比的性能，同时大幅提高了推理效率和令牌利用率。

Title: Towards the Holographic Characteristic of LLMs for Efficient Short-text Generation

Authors: Shun Qian, Bingquan Liu, Chengjie Sun, Zhen Xu, Baoxun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22546
Pdf URL: https://arxiv.org/pdf/2601.22546
Copy Paste: [[2601.22546]] Towards the Holographic Characteristic of LLMs for Efficient Short-text Generation(https://arxiv.org/abs/2601.22546)
Keywords: language model, llm, chain-of-thought
Abstract: The recent advancements in Large Language Models (LLMs) have attracted interest in exploring their in-context learning abilities and chain-of-thought capabilities. However, there are few studies investigating the specific traits related to the powerful generation capacity of LLMs. This paper aims to delve into the generation characteristics exhibited by LLMs. Through our investigation, we have discovered that language models tend to capture target-side keywords at the beginning of the generation process. We name this phenomenon the Holographic Characteristic of language models. For the purpose of exploring this characteristic and further improving the inference efficiency of language models, we propose a plugin called HOLO, which leverages the Holographic Characteristic to extract target-side keywords from language models within a limited number of generation steps and complements the sentence with a parallel lexically constrained text generation method. To verify the effectiveness of HOLO, we conduct massive experiments on language models of varying architectures and scales in the short-text generation scenario. The results demonstrate that HOLO achieves comparable performance to the baselines in terms of both automatic and human-like evaluation metrics and highlight the potential of the Holographic Characteristic.
摘要：大型语言模型 (LLM) 的最新进展引起了人们对探索其上下文学习能力和思维链能力的兴趣。然而，很少有研究调查与法学硕士强大的生成能力相关的具体特征。本文旨在深入探讨法学硕士所展现的一代特征。通过我们的调查，我们发现语言模型倾向于在生成过程开始时捕获目标端关键字。我们将这种现象称为语言模型的全息特征。为了探索这一特性并进一步提高语言模型的推理效率，我们提出了一个名为 HOLO 的插件，它利用全息特性在有限的生成步骤内从语言模型中提取目标侧关键词，并通过并行词汇约束文本生成方法来补充句子。为了验证 HOLO 的有效性，我们在短文本生成场景中对不同架构和规模的语言模型进行了大量实验。结果表明，HOLO 在自动和类人评估指标方面均实现了与基线相当的性能，并凸显了全息特征的潜力。

Title: Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

Authors: Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Mackenzie Puig-Hall, Narmeen Oozeer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22548
Pdf URL: https://arxiv.org/pdf/2601.22548
Copy Paste: [[2601.22548]] Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations(https://arxiv.org/abs/2601.22548)
Keywords: language model, llm
Abstract: Recent research has shown that large language models (LLM) favor own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of "easy" versus "hard" evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.
摘要：最近的研究表明，大型语言模型（LLM）在充当法官时偏向于自己的输出，从而破坏了自动化后培训和评估工作流程的完整性。然而，很难区分哪些评价偏差是用自恋来解释的，哪些是用一般实验混杂来解释的，扭曲了自我偏好偏差的测量。我们发现了一个核心方法论混淆，可以将测量误差减少 89.6%。具体来说，当法官回答他们自己错误完成的问题时，LLM评估人员可能会做出自我偏好的裁决；无论他们的反应是否是他们自己的，这都是事实。为了将自我偏好信号与困难问题上的噪声输出解耦，我们引入了评估器质量基线，它将法官错误投票给自己的概率与投票给另一个模型的错误响应的概率进行比较。通过对 37,448 个查询评估这个简单的基线，只有 51% 的初始结果保留了统计显着性。最后，我们转向描述法学硕士法官的“简单”与“困难”评估投票的熵。我们的纠正基线通过消除潜在解决方案中的噪声数据来支持未来对自我偏好的研究。更广泛地说，这项工作有助于越来越多的分类和隔离法官偏见影响的工作。

Title: SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

Authors: Chao Wang, Bei Li, Jiaqi Zhang, Xinyu Liu, Yuchun Fan, Linkun Lyu, Xin Chen, Jingang Wang, Tong Xiao, Peng Pei, Xunliang Cai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22580
Pdf URL: https://arxiv.org/pdf/2601.22580
Copy Paste: [[2601.22580]] SpanNorm: Reconciling Training Stability and Performance in Deep Transformers(https://arxiv.org/abs/2601.22580)
Keywords: language model, llm
Abstract: The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.
摘要：大型语言模型（LLM）的成功取决于深层 Transformer 架构的稳定训练。一个关键的设计选择是归一化层的放置，从而导致一个基本的权衡：“PreNorm”架构以深度模型中潜在的性能下降为代价确保训练稳定性，而“PostNorm”架构提供了强大的性能，但遭受严重的训练不稳定。在这项工作中，我们提出了 SpanNorm，一种旨在通过整合两种范式的优势来解决这一困境的新技术。在结构上，SpanNorm 建立了一个跨越整个变压器块的干净残差连接，以稳定信号传播，同时采用 PostNorm 式计算对聚合输出进行归一化，以增强模型性能。我们提供的理论分析表明，SpanNorm 与有原则的缩放策略相结合，在整个网络中保持有界信号方差，防止困扰 PostNorm 模型的梯度问题，并减轻 PreNorm 的表示崩溃。根据经验，SpanNorm 在密集和专家混合 (MoE) 场景中始终优于标准标准化方案，为更强大、更稳定的 Transformer 架构铺平了道路。

Title: Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

Authors: Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, Daqing He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22588
Pdf URL: https://arxiv.org/pdf/2601.22588
Copy Paste: [[2601.22588]] Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry(https://arxiv.org/abs/2601.22588)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.
摘要：大型语言模型 (LLM) 通过提示被广泛用作无参考评估器，但这种“LLM 作为法官”范式成本高昂、不透明，并且对提示设计敏感。在这项工作中，我们研究较小的模型是否可以通过利用内部表示而不是表面生成来充当有效的评估器。我们发现了一个一致的经验模式：小型 LM 尽管生成能力较弱，但在其隐藏状态中编码了丰富的评估信号。这促使我们提出语义能力不对称假设：评估所需的语义能力明显少于生成，并且可以基于中间表示，这表明评估不一定需要依赖大规模生成模型，而是可以利用较小模型的潜在特征。我们的研究结果推动了从法学硕士作为法官到代表作为法官的范式转变，这是一种无需解码的评估策略，可探索内部模型结构而不是依赖于提示的输出。我们通过 INSPECTOR 实例化了这个范例，这是一个基于探测的框架，可以根据小型模型表示来预测方面级别的评估分数。推理基准（GSM8K、MATH、GPQA）的实验表明，INSPECTOR 的性能大大优于基于提示的小型 LM，并且非常接近完整的 LLM 法官，同时为可扩展的评估提供了更高效、可靠和可解释的替代方案。

Title: Language Model Circuits Are Sparse in the Neuron Basis

Authors: Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22594
Pdf URL: https://arxiv.org/pdf/2601.22594
Copy Paste: [[2601.22594]] Language Model Circuits Are Sparse in the Neuron Basis(https://arxiv.org/abs/2601.22594)
Keywords: language model
Abstract: The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.
摘要：神经网络用于执行计算的高级概念不需要与单个神经元对齐（Smolensky，1986）。因此，语言模型可解释性研究转向了 \textit{稀疏自动编码器} (SAE) 等技术，将神经元基础分解为更可解释的模型计算单元，以完成 \textit{电路跟踪} 等任务。然而，并非所有基于神经元的表示都是无法解释的。我们第一次凭经验证明 \textbf{MLP 神经元的特征基础与 SAE 一样稀疏}。我们利用这一发现开发了一种基于 MLP 神经元的电路追踪端到端管道，该管道使用基于梯度的归因来定位各种任务的因果电路。在标准主谓一致基准（Marks et al., 2025）上，$\approx 10^2$ MLP 神经元的电路足以控制模型行为。在 Lindsey 等人 2025 年的多跳城市 $\to$ state $\to$ Capital 任务中，我们发现了一个电路，其中一小组神经元编码特定的潜在推理步骤（例如〜“将城市映射到其状态”），并且可以被引导来改变模型的输出。因此，这项工作提高了语言模型的自动解释性，而无需额外的培训成本。

Title: Layer-wise Swapping for Generalizable Multilingual Safety

Authors: Hyunseo Shin, Wonseok Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22620
Pdf URL: https://arxiv.org/pdf/2601.22620
Copy Paste: [[2601.22620]] Layer-wise Swapping for Generalizable Multilingual Safety(https://arxiv.org/abs/2601.22620)
Keywords: language model, llm
Abstract: Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.
摘要：尽管大型语言模型 (LLM) 取得了快速进步，但安全风险仍然是低资源语言面临的严峻挑战。现有的安全数据集主要以英语为中心，限制了多语言安全协调的进展。因此，与高资源专家模型相比，在各自的指令数据集上进行微调的低资源专家模型往往表现出更高的不安全率。在这项工作中，我们提出了一种安全意识层交换方法，无需额外培训即可将安全对齐从英语安全专家转移到低资源语言专家。为了进一步增强迁移能力，我们的方法根据模块的专业化程度自适应地选择或混合模块。我们的方法保留了一般语言理解任务的性能，同时增强了目标语言的安全性。实验结果表明，所提出的方法在 MMMLU、BELEBELE 和 MGSM 等通用基准上实现了与语言专家相当的性能，同时在 MultiJail 安全基准上产生了更一致且危害更小的响应。

Title: Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models

Authors: Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Yiqiao Huang, Ivor Tsang, Yang You
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22629
Pdf URL: https://arxiv.org/pdf/2601.22629
Copy Paste: [[2601.22629]] Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models(https://arxiv.org/abs/2601.22629)
Keywords: language model
Abstract: Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation, yet how this structure can be leveraged to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored. In this paper, we show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor: early denoising steps largely determine the global semantic structure, while later steps focus on local lexical refinement. Building on this insight, we propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.
摘要：扩散语言模型（Diffusion-LM）在文本生成中引入了显式的时间维度，但如何利用这种结构来控制生成多样性以探索多个有效的语义或推理路径仍有待探索。在本文中，我们表明扩散LM，就像图像生成中的扩散模型一样，表现出劳动的时间分工：早期的去噪步骤在很大程度上决定了全局语义结构，而后续的步骤则侧重于局部词汇细化。基于这一见解，我们提出了时间退火扰动采样（TAPS），这是一种免训练的推理策略，鼓励在扩散过程的早期进行语义分支，同时逐步减少扰动以保持流畅性和指令依从性。 TAPS 与非自回归和半自回归 Diffusion 主干兼容，这在我们的论文中的 LLaDA 和 TraDo 上得到了演示，并在不影响生成质量的情况下持续提高创意写作和推理基准的输出多样性。

Title: DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning

Authors: Abhishek Tyagi, Yunuo Cen, Shrey Dhorajiya, Bharadwaj Veeravalli, Xuanyao Fong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22632
Pdf URL: https://arxiv.org/pdf/2601.22632
Copy Paste: [[2601.22632]] DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning(https://arxiv.org/abs/2601.22632)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention-Guided Runtime Tracing), a lightweight, training-free method that performs on-the-fly context-based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE-L scores with respect to static-masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain-specific tasks while running at less than 10MBs of memory for LLAMA-3.1-8B(16GBs) with 0.1% FLOPs overhead. The code is available at this https URL.
摘要：大型语言模型 (LLM) 表现出大量的参数冗余，特别是在前馈网络 (FFN) 中。现有的修剪方法有两个主要限制。首先，对特定于数据集的校准的依赖引入了显着的数据依赖性和计算开销。其次，它们主要是静态的，无法解释法学硕士中知识神经元在自回归生成过程中随着上下文的演变而不断变化的子集。为了解决这个问题，我们引入了 DART，即动态注意力引导运行时跟踪（Dynamic Attention-Guided Runtime Tracing），这是一种轻量级、免训练的方法，可以执行基于上下文的即时修剪。 DART 监视注意力分数分布的变化以推断上下文变化，动态更新神经元级掩模以保留显着参数。在 10 个基准测试中，DART 的性能优于之前的动态基线，在 LLAMA-3.1-8B 上以 70% FFN 稀疏度实现了高达 14.5% 的准确度增益。此外，DART 在摘要任务上的静态屏蔽修剪方面实现了高达 3 倍的 ROUGE-L 分数，其性能与原始密集模型相当。我们最终证明，所提出的框架可以有效地适应不同的语义上下文，保留跨一般任务和特定领域任务的模型功能，同时以不到 10MB 的内存运行 LLAMA-3.1-8B（16GB），并具有 0.1% 的 FLOP 开销。该代码可从此 https URL 获取。

Title: NAG: A Unified Native Architecture for Encoder-free Text-Graph Modeling in Language Models

Authors: Haisong Gong, Zhibo Liu, Qiang Liu, Shu Wu, Liang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22657
Pdf URL: https://arxiv.org/pdf/2601.22657
Copy Paste: [[2601.22657]] NAG: A Unified Native Architecture for Encoder-free Text-Graph Modeling in Language Models(https://arxiv.org/abs/2601.22657)
Keywords: language model
Abstract: Prevailing methods for integrating graphs into Language Models (LMs) typically rely on a segregated architecture: external Graph Neural Networks (GNNs) encode structural topology, while LMs process textual semantics. We argue this approach is suboptimal for text-graphs: it creates a conceptually disjointed interaction paradigm. By segregating structural encoding from semantic processing, these systems must perform a complex implicit alignment between abstract graph tokens and concrete textual elements. Challenging the necessity of external encoders, we propose NAG (Native Architecture for Graphs), a unified framework that internalizes graph processing within the LM's native manifold. Instead of bridging disparate embedding spaces, NAG repurposes the self-attention mechanism to enforce topological dependencies and recalibrates positional IDs to ensure structural equivalence. This allows the model to harness its intrinsic linguistic capability to simultaneously comprehend node and edge content alongside structural topology. We introduce two efficient implementations: NAG-Zero for absolute preservation of the base model's linguistic capabilities, and NAG-LoRA for enhanced structural adaptation. Experiments across diverse graph tasks validate that NAG achieves robust graph comprehension without the overhead of external encoders, offering a simpler, more coherent paradigm for text-graph modeling.
摘要：将图集成到语言模型 (LM) 中的流行方法通常依赖于隔离架构：外部图神经网络 (GNN) 编码结构拓扑，而 LM 处理文本语义。我们认为这种方法对于文本图来说不是最佳的：它创建了概念上脱节的交互范式。通过将结构编码与语义处理分离，这些系统必须在抽象图标记和具体文本元素之间执行复杂的隐式对齐。为了挑战外部编码器的必要性，我们提出了 NAG（原生图形架构），这是一个统一的框架，可将图形处理内部化在 LM 的原生流形中。 NAG 没有桥接不同的嵌入空间，而是重新利用自注意力机制来强制拓扑依赖性并重新校准位置 ID 以确保结构等效性。这使得模型能够利用其内在的语言能力来同时理解节点和边缘内容以及结构拓扑。我们引入了两种有效的实现：NAG-Zero 用于绝对保留基础模型的语言能力，NAG-LoRA 用于增强结构适应。跨不同图形任务的实验验证了 NAG 在没有外部编码器开销的情况下实现了强大的图形理解，为文本图形建模提供了更简单、更连贯的范例。

Title: TSLM: Tree-Structured Language Modeling for Divergent Thinking

Authors: Doyoung Kim, Jaehyeok Doo, Minjoon Seo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22688
Pdf URL: https://arxiv.org/pdf/2601.22688
Copy Paste: [[2601.22688]] TSLM: Tree-Structured Language Modeling for Divergent Thinking(https://arxiv.org/abs/2601.22688)
Keywords: language model
Abstract: Language models generate reasoning sequentially, preventing them from decoupling irrelevant exploration paths during search. We introduce Tree-Structured Language Modeling (TSLM), which uses special tokens to encode branching structure, enabling models to generate and selectively expand multiple search paths within a single generation process. By training on complete search trees including both successful and failed attempts, TSLM learns to internalize systematic exploration without redundant recomputation of shared prefixes. TSLM achieves robust performance and superior inference efficiency by avoiding the multiple independent forward passes required by external search methods. These results suggest a new paradigm of inference-time scaling for robust reasoning, demonstrating that supervised learning on complete tree-structured traces provides an efficient alternative for developing systematic exploration capabilities in language models.
摘要：语言模型按顺序生成推理，防止它们在搜索过程中解耦不相关的探索路径。我们引入了树结构语言模型（TSLM），它使用特殊的标记来编码分支结构，使模型能够在单个生成过程中生成并有选择地扩展多个搜索路径。通过对完整搜索树（包括成功和失败的尝试）进行训练，TSLM 学会了内化系统探索，而无需对共享前缀进行冗余重新计算。 TSLM 通过避免外部搜索方法所需的多次独立前向传递来实现稳健的性能和卓越的推理效率。这些结果提出了一种用于鲁棒推理的推理时间缩放的新范式，证明对完整树结构轨迹的监督学习为开发语言模型中的系统探索能力提供了有效的替代方案。

Title: FNF: Functional Network Fingerprint for Large Language Models

Authors: Yiheng Liu, Junhao Ning, Sichen Xia, Haiyang Sun, Yang Yang, Hanyang Chi, Xiaohui Gao, Ning Qiang, Bao Ge, Junwei Han, Xintao Hu
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2601.22692
Pdf URL: https://arxiv.org/pdf/2601.22692
Copy Paste: [[2601.22692]] FNF: Functional Network Fingerprint for Large Language Models(https://arxiv.org/abs/2601.22692)
Keywords: language model, llm
Abstract: The development of large language models (LLMs) is costly and has significant commercial value. Consequently, preventing unauthorized appropriation of open-source LLMs and protecting developers' intellectual property rights have become critical challenges. In this work, we propose the Functional Network Fingerprint (FNF), a training-free, sample-efficient method for detecting whether a suspect LLM is derived from a victim model, based on the consistency between their functional network activity. We demonstrate that models that share a common origin, even with differences in scale or architecture, exhibit highly consistent patterns of neuronal activity within their functional networks across diverse input samples. In contrast, models trained independently on distinct data or with different objectives fail to preserve such activity alignment. Unlike conventional approaches, our method requires only a few samples for verification, preserves model utility, and remains robust to common model modifications (such as fine-tuning, pruning, and parameter permutation), as well as to comparisons across diverse architectures and dimensionalities. FNF thus provides model owners and third parties with a simple, non-invasive, and effective tool for protecting LLM intellectual property. The code is available at this https URL.
摘要：大型语言模型（LLM）的开发成本高昂，并且具有显着的商业价值。因此，防止未经授权盗用开源法学硕士并保护开发者的知识产权已成为关键挑战。在这项工作中，我们提出了功能网络指纹（FNF），这是一种免训练、样本高效的方法，用于根据功能网络活动之间的一致性来检测可疑的 LLM 是否源自受害者模型。我们证明，即使在规模或架构上存在差异，具有共同起源的模型也会在不同输入样本的功能网络中表现出高度一致的神经元活动模式。相比之下，根据不同数据或具有不同目标独立训练的模型无法保持这种活动一致性。与传统方法不同，我们的方法仅需要少量样本进行验证，保留模型实用性，并且对常见模型修改（例如微调、剪枝和参数排列）以及跨不同架构和维度的比较保持鲁棒性。因此，FNF 为模型所有者和第三方提供了一种简单、非侵入性且有效的工具来保护 LLM 知识产权。该代码可从此 https URL 获取。

Title: Models Know Models Best: Evaluation via Model-Preferred Formats

Authors: Joonhak Lee, Sungmok Jung, Jongyeon Park, Jaejin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22699
Pdf URL: https://arxiv.org/pdf/2601.22699
Copy Paste: [[2601.22699]] Models Know Models Best: Evaluation via Model-Preferred Formats(https://arxiv.org/abs/2601.22699)
Keywords: language model, llm
Abstract: Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.
摘要：大型语言模型 (LLM) 在多项选择任务上的性能在基于符号的评估格式和完形填空式评估格式之间存在显着差异。观察到的差异系统地归因于任务特征：自然语言延续受益于似然评分，而显式比较更适合基于符号的选择。这些趋势在各种基于解码器的法学硕士中是一致的，表明模型不可知的影响。为了解决这些不一致问题，引入了动态格式对齐策略，该策略采用基于潜在模型偏好信号训练的轻量级分类器。与人为设计的启发式方法（通常会降低性能）相反，这种方法使用模型生成的信号来确定每个问题实例的最佳格式。所提出的方法在推理和知识基准上实现了零样本精度的实质性和一致的改进，更好地揭示了模型的潜在能力。

Title: MM-THEBench: Do Reasoning MLLMs Think Reasonably?

Authors: Zhidian Huang, Zijun Yao, Ji Qi, Shangqing Tu, Junxian Ma, Jinxin Liu, Weichuan Liu, Xiaoyin Che, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22735
Pdf URL: https://arxiv.org/pdf/2601.22735
Copy Paste: [[2601.22735]] MM-THEBench: Do Reasoning MLLMs Think Reasonably?(https://arxiv.org/abs/2601.22735)
Keywords: language model, llm, hallucination
Abstract: Recent advances in multimodal large language models (MLLMs) mark a shift from non-thinking models to post-trained reasoning models capable of solving complex problems through thinking. However, whether such thinking mitigates hallucinations in multimodal perception and reasoning remains unclear. Self-reflective reasoning enhances robustness but introduces additional hallucinations, and subtle perceptual errors still result in incorrect or coincidentally correct answers. Existing benchmarks primarily focus on models before the emergence of reasoning MLLMs, neglecting the internal thinking process and failing to measure the hallucinations that occur during thinking. To address these challenges, we introduce MM-THEBench, a comprehensive benchmark for assessing hallucinations of intermediate CoTs in reasoning MLLMs. MM-THEBench features a fine-grained taxonomy grounded in cognitive dimensions, diverse data with verified reasoning annotations, and a multi-level automated evaluation framework. Extensive experiments on mainstream reasoning MLLMs reveal insights into how thinking affects hallucination and reasoning capability in various multimodal tasks.
摘要：多模态大语言模型（MLLM）的最新进展标志着从非思维模型到能够通过思维解决复杂问题的后训练推理模型的转变。然而，这种思维是否能减轻多模态感知和推理中的幻觉仍不清楚。自我反思推理增强了稳健性，但引入了额外的幻觉，并且微妙的感知错误仍然会导致错误或巧合的正确答案。现有的基准主要关注推理 MLLM 出现之前的模型，忽略了内部思维过程，未能衡量思维过程中出现的幻觉。为了应对这些挑战，我们引入了 MM-THEBench，这是一个用于评估推理 MLLM 中中间 CoT 幻觉的综合基准。 MM-THEBench 具有基于认知维度的细粒度分类法、经过验证的推理注释的多样化数据以及多级自动化评估框架。对主流推理 MLLM 的大量实验揭示了思维如何影响各种多模态任务中的幻觉和推理能力。

Title: AR-BENCH: Benchmarking Legal Reasoning with Judgment Error Detection, Classification and Correction

Authors: Yifei Li, Richong Zhang, Wanyu Tu, Zhijie Nie, Haokun Luo, Chuantao Yin, Pengchong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22742
Pdf URL: https://arxiv.org/pdf/2601.22742
Copy Paste: [[2601.22742]] AR-BENCH: Benchmarking Legal Reasoning with Judgment Error Detection, Classification and Correction(https://arxiv.org/abs/2601.22742)
Keywords: language model
Abstract: Legal judgments may contain errors due to the complexity of case circumstances and the abstract nature of legal concepts, while existing appellate review mechanisms face efficiency pressures from a surge in case volumes. Although current legal AI research focuses on tasks like judgment prediction and legal document generation, the task of judgment review differs fundamentally in its objectives and paradigm: it centers on detecting, classifying, and correcting errors after a judgment is issued, constituting anomaly detection rather than prediction or generation. To address this research gap, we introduce a novel task APPELLATE REVIEW, aiming to assess models' diagnostic reasoning and reliability in legal practice. We also construct a novel dataset benchmark AR-BENCH, which comprises 8,700 finely annotated decisions and 34,617 supplementary corpora. By evaluating 14 large language models, we reveal critical limitations in existing models' ability to identify legal application errors, providing empirical evidence for future improvements.
摘要：由于案件情况的复杂性和法律概念的抽象性，法律判决可能存在错误，而现有的上诉审查机制面临案件量激增带来的效率压力。虽然当前的法律人工智能研究主要集中在判决预测和法律文书生成等任务，但判决审查的任务在目标和范式上有根本的不同：它的核心是判决发布后的错误检测、分类和纠正，构成异常检测而不是预测或生成。为了弥补这一研究空白，我们引入了一项新任务上诉审查，旨在评估模型在法律实践中的诊断推理和可靠性。我们还构建了一个新颖的数据集基准 AR-BENCH，其中包含 8,700 个精细注释的决策和 34,617 个补充语料库。通过评估 14 个大型语言模型，我们揭示了现有模型识别法律应用错误的能力的关键局限性，为未来的改进提供了经验证据。

Title: RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation

Authors: Jiaxuan Luo, Siqi Ouyang, Lei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22777
Pdf URL: https://arxiv.org/pdf/2601.22777
Copy Paste: [[2601.22777]] RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation(https://arxiv.org/abs/2601.22777)
Keywords: language model, llm
Abstract: Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise terminology hints to the Speech LLM. We further synthesize training data that teaches the Speech LLM to leverage retrieved terms precisely. Experiments on three language directions of the ACL 60/60 dev set show that RASST improves terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points, with ablations confirming the contribution of each component.
摘要：同步语音翻译 (SST) 从部分语音输入增量生成目标文本。最近的语音大语言模型 (Speech LLM) 显着提高了 SST 质量，但它们仍然难以正确翻译罕见的和特定领域的术语。虽然检索增强对于机器翻译中的术语翻译非常有效，但将检索引入 SST 并非易事：它需要在部分、持续到达的输入下进行快速、准确的跨模式（语音到文本）检索，并且模型必须决定在增量生成期间是否以及何时应用检索到的术语。我们提出检索增强同步语音翻译（RASST），它将跨模态检索紧密集成到 SST 管道中。 RASST 训练轻量级语音文本检索器并执行高效的滑动窗口检索，为语音法学硕士提供分块术语提示。我们进一步综合训练数据，教导语音法学硕士精确利用检索到的术语。对 ACL 60/60 开发集的三个语言方向进行的实验表明，RASST 将术语翻译准确性提高了高达 16%，并将整体翻译质量提高了多达 3 个 BLEU 点，并通过消融确认了每个组件的贡献。

Title: Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs

Authors: Corentin Kervadec, Iuliia Lysova, Marco Baroni, Gemma Boleda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22795
Pdf URL: https://arxiv.org/pdf/2601.22795
Copy Paste: [[2601.22795]] Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs(https://arxiv.org/abs/2601.22795)
Keywords: language model, llm
Abstract: Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs. Several studies on LLM efficiency optimization argue that it is possible to prune a significant portion of the parameters, while only marginally impacting performance. This suggests that the computation is not uniformly distributed across the parameters. We introduce here a technique to systematically quantify computation density in LLMs. In particular, we design a density estimator drawing on mechanistic interpretability. We experimentally test our estimator and find that: (1) contrary to what has been often assumed, LLM processing generally involves dense computation; (2) computation density is dynamic, in the sense that models shift between sparse and dense processing regimes depending on the input; (3) per-input density is significantly correlated across LLMs, suggesting that the same inputs trigger either low or high density. Investigating the factors influencing density, we observe that predicting rarer tokens requires higher density, and increasing context length often decreases the density. We believe that our computation density estimator will contribute to a better understanding of the processing at work in LLMs, challenging their symbolic interpretation.
摘要：基于 Transformer 的大型语言模型 (LLM) 由排列在深度和广度计算图中的数十亿个参数组成。关于 LLM 效率优化的几项研究认为，可以修剪大部分参数，而对性能的影响却很小。这表明计算在参数上分布不均匀。我们在这里介绍一种系统量化法学硕士计算密度的技术。特别是，我们利用机械可解释性设计了一个密度估计器。我们通过实验测试了我们的估计器并发现：（1）与通常的假设相反，LLM 处理通常涉及密集计算； (2) 计算密度是动态的，即模型根据输入在稀疏和密集处理机制之间切换； (3) 每个输入的密度在法学硕士之间显着相关，这表明相同的输入会触发低密度或高密度。通过研究影响密度的因素，我们发现预测较稀有的标记需要更高的密度，而增加上下文长度通常会降低密度。我们相信，我们的计算密度估计器将有助于更好地理解法学硕士的处理过程，挑战他们的符号解释。

Title: When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training

Authors: Felicia Körner, Max Müller-Eberstein, Anna Korhonen, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22851
Pdf URL: https://arxiv.org/pdf/2601.22851
Copy Paste: [[2601.22851]] When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training(https://arxiv.org/abs/2601.22851)
Keywords: language model, llm, prompt
Abstract: Training Large Language Models (LLMs) with high multilingual coverage is becoming increasingly important -- especially when monolingual resources are scarce. Recent studies have found that LLMs process multilingual inputs in shared concept spaces, thought to support generalization and cross-lingual transfer. However, these prior studies often do not use causal methods, lack deeper error analysis or focus on the final model only, leaving open how these spaces emerge during training. We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM through the causal interpretability method of activation patching. We isolate cross-lingual concept representations, then inject them into a translation prompt to investigate how consistently translations can be altered, independently of the language. We find that shared concept spaces emerge early} and continue to refine, but that alignment with them is language-dependent}. Furthermore, in contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior -- like selecting senses for polysemous words or translating instead of copying cross-lingual homographs -- rather than improved translation ability. Our findings offer new insight into the training dynamics of cross-lingual alignment and the conditions under which causal interpretability methods offer meaningful insights in multilingual contexts.
摘要：训练具有高多语言覆盖率的大型语言模型 (LLM) 变得越来越重要，尤其是在单语资源稀缺的情况下。最近的研究发现，法学硕士在共享概念空间中处理多语言输入，被认为支持泛化和跨语言迁移。然而，这些先前的研究通常不使用因果方法，缺乏更深入的错误分析或仅关注最终模型，从而使这些空间在训练过程中如何出现尚不清楚。我们通过激活修补的因果解释方法研究了 EuroLLM 预训练期间与语言无关的概念空间的发展。我们隔离跨语言概念表示，然后将它们注入翻译提示中，以研究如何独立于语言而一致地改变翻译。我们发现共享概念空间很早就出现并不断完善，但与它们的一致性取决于语言。此外，与之前的工作相比，我们的细粒度手动分析表明，翻译质量的一些明显提高反映了行为的转变，例如为多义词选择意义或翻译而不是复制跨语言同形异义词，而不是翻译能力的提高。我们的研究结果为跨语言对齐的训练动态以及因果解释方法在多语言环境中提供有意义的见解的条件提供了新的见解。

Title: Leveraging LLMs For Turkish Skill Extraction

Authors: Ezgi Arslan İltüzer, Özgür Anıl Özlü, Vahid Farajijobehdar, Gülşen Eryiğit
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22885
Pdf URL: https://arxiv.org/pdf/2601.22885
Copy Paste: [[2601.22885]] Leveraging LLMs For Turkish Skill Extraction(https://arxiv.org/abs/2601.22885)
Keywords: language model, llm, prompt
Abstract: Skill extraction is a critical component of modern recruitment systems, enabling efficient job matching, personalized recommendations, and labor market analysis. Despite Türkiye's significant role in the global workforce, Turkish, a morphologically complex language, lacks both a skill taxonomy and a dedicated skill extraction dataset, resulting in underexplored research in skill extraction for Turkish. This article seeks the answers to three research questions: 1) How can skill extraction be effectively performed for this language, in light of its low resource nature? 2)~What is the most promising model? 3) What is the impact of different Large Language Models (LLMs) and prompting strategies on skill extraction (i.e., dynamic vs. static few-shot samples, varying context information, and encouraging causal reasoning)? The article introduces the first Turkish skill extraction dataset and performance evaluations of automated skill extraction using LLMs. The manually annotated dataset contains 4,819 labeled skill spans from 327 job postings across different occupation areas. The use of LLM outperforms supervised sequence labeling when used in an end-to-end pipeline, aligning extracted spans with standardized skills in the ESCO taxonomy more effectively. The best-performing configuration, utilizing Claude Sonnet 3.7 with dynamic few-shot prompting for skill identification, embedding-based retrieval, and LLM-based reranking for skill linking, achieves an end-to-end performance of 0.56, positioning Turkish alongside similar studies in other languages, which are few in the literature. Our findings suggest that LLMs can improve skill extraction performance in low-resource settings, and we hope that our work will accelerate similar research on skill extraction for underrepresented languages.
摘要：技能提取是现代招聘系统的重要组成部分，可实现高效的工作匹配、个性化推荐和劳动力市场分析。尽管土耳其语在全球劳动力中发挥着重要作用，但土耳其语作为一种形态复杂的语言，缺乏技能分类法和专门的技能提取数据集，导致土耳其语技能提取研究尚未得到充分探索。本文寻求三个研究问题的答案：1）鉴于这种语言的低资源特性，如何有效地进行技能提取？ 2)~最有前途的模型是什么？ 3）不同的大型语言模型（LLM）和提示策略对技能提取（即动态与静态小样本、不同的上下文信息和鼓励因果推理）有什么影响？本文介绍了第一个土耳其技能提取数据集以及使用法学硕士的自动技能提取的性能评估。手动注释的数据集包含来自不同职业领域 327 个职位发布的 4,819 个标记技能范围。当在端到端管道中使用时，LLM 的使用优于监督序列标记，可以更有效地将提取的跨度与 ESCO 分类法中的标准化技能对齐。性能最佳的配置利用 Claude Sonnet 3.7 和动态小样本提示进行技能识别、基于嵌入的检索和基于 LLM 的技能链接重新排名，实现了 0.56 的端到端性能，将土耳其语与其他语言的类似研究放在一起，这在文献中很少见。我们的研究结果表明，法学硕士可以提高资源匮乏环境中的技能提取性能，我们希望我们的工作将加速针对代表性不足的语言的技能提取的类似研究。

Title: Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial

Authors: Jio Oh, Paul Vicinanza, Thomas Butler, Steven Euijong Whang, Dezhi Hong, Amani Namboori
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22888
Pdf URL: https://arxiv.org/pdf/2601.22888
Copy Paste: [[2601.22888]] Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial(https://arxiv.org/abs/2601.22888)
Keywords: llm
Abstract: More than 80% of the 1.6 billion English speakers do not use Standard American English (SAE) and experience higher failure rates and stereotyped responses when interacting with LLMs as a result. Yet multi-dialectal performance remains underexplored. We introduce $\textbf{MDial}$, the first large-scale framework for generating multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features -- for nine English dialects. Partnering with native linguists, we design an annotated and scalable rule-based LLM transformation to ensure precision. Our approach challenges the assumption that models should mirror users' morphosyntactic features, showing that up to 90% of the grammatical features of a dialect should not be reproduced by models. Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness. Using this pipeline, we construct the dialect-parallel $\textbf{MDialBench}$mark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for Canadian English, and systematically misclassify non-SAE dialects as American or British. As dialect identification underpins natural language understanding, these errors risk cascading failures into downstream tasks.
摘要：在 16 亿讲英语的人中，超过 80% 不使用标准美式英语 (SAE)，因此在与法学硕士互动时会遇到更高的失败率和刻板反应。然而，多方言表演仍未得到充分探索。我们介绍 $\textbf{MDial}$，这是第一个用于生成多方言会话数据的大型框架，涵盖九种英语方言的书面方言的三大支柱——词汇（词汇）、拼写（拼写）和形态句法（语法）特征。我们与母语语言学家合作，设计了一个带注释且可扩展的基于规则的法学硕士转换，以确保精度。我们的方法挑战了模型应反映用户的形态句法特征的假设，表明方言中高达 90% 的语法特征不应由模型再现。独立评估确认了数据质量，在 98% 的方言自然度成对比较中，注释者更喜欢 MDial 输出而不是先前的方法。使用此管道，我们构建了具有 50k 多个对话的方言并行 $\textbf{MDialBench}$mark，从而产生了 97k 多个 QA 对，并评估了 17 个法学硕士的方言识别和响应生成任务。即使前沿模型的准确率也低于 70%，加拿大英语的准确率也达不到 50%，并且系统性地将非 SAE 方言错误分类为美式或英式。由于方言识别是自然语言理解的基础，这些错误可能会导致下游任务出现级联故障。

Title: DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

Authors: Yuxuan Lou, Ziming Wu, Yaochen Wang, Yong Liu, Yingxuan Ren, Fuming Lai, Shaobing Lian, Jie Tang, Yang You
Subjects: cs.CL, cs.AI, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2601.22889
Pdf URL: https://arxiv.org/pdf/2601.22889
Copy Paste: [[2601.22889]] DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion(https://arxiv.org/abs/2601.22889)
Keywords: language model, llm
Abstract: Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.
摘要：当前的语音语言模型直接生成响应，没有明确的推理，导致一旦产生音频就无法纠正的错误。我们引入 \textbf{``沉默的思考，口头的答案''} - 一种范式，其中语音法学硕士在口头响应的同时生成内部文本推理，并通过思维痕迹告知语音质量。为了实现这一点，我们提出了 \method{}，这是第一个支持理解和生成的基于扩散的语音文本语言模型，将离散文本和标记化语音统一在单个屏蔽扩散框架下。与自回归方法不同，\method{} 通过迭代去噪以及特定于模态的掩蔽时间表联合生成推理轨迹和语音标记。我们还构建了 \dataset{}，这是第一个具有配对文本推理轨迹的语音 QA 数据集，包含 26K 样本，总计 319 小时。实验表明，\method{} 实现了最先进的语音到语音 QA 准确性，比最佳基线高出最多 9 个点，同时在生成模型中获得最佳 TTS 质量 (6.2\% WER) 并保留语言理解 (66.2\% MMLU)。消融证实扩散架构和思维痕迹都有助于这些收益。

Title: LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models

Authors: Alhassan Abdelhalim, Janick Edinger, Sören Laue, Michaela Regneri
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22928
Pdf URL: https://arxiv.org/pdf/2601.22928
Copy Paste: [[2601.22928]] LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models(https://arxiv.org/abs/2601.22928)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance. However, despite their ubiquitous use, the exact mechanisms underlying their outstanding performance remain unclear. Different methods for LLM explainability exist, and many are, as a method, not fully understood themselves. We started with the question of how linguistic abstraction emerges in LLMs, aiming to detect it across different LLM modules (attention heads and input embeddings). For this, we used methods well-established in the literature: (1) probing for token-level relational structures, and (2) feature-mapping using embeddings as carriers of human-interpretable properties. Both attempts failed for different methodological reasons: Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens. Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure rather than meaningful semantic knowledge. These failures matter because both techniques are widely treated as evidence for what LLMs supposedly understand, yet our results show such conclusions are unwarranted. These limitations are particularly relevant in pervasive and distributed computing settings where LLMs are deployed as system components and interpretability methods are relied upon for debugging, compression, and explaining models.
摘要：大型语言模型 (LLM) 因其多功能性和强大的性能而在普适计算中变得越来越流行。然而，尽管它们的使用无处不在，但其出色性能背后的确切机制仍不清楚。存在不同的 LLM 可解释性方法，而且许多方法本身并没有被完全理解。我们从法学硕士中语言抽象如何出现的问题开始，旨在跨不同的法学硕士模块（注意力头和输入嵌入）检测它。为此，我们使用了文献中成熟的方法：（1）探测令牌级关系结构，（2）使用嵌入作为人类可解释属性的载体进行特征映射。这两次尝试都因不同的方法论原因而失败：一旦我们测试了后层表示仍然对应于令牌的核心假设，基于注意力的解释就崩溃了。应用于嵌入的属性推断方法也失败了，因为它们的高预测分数是由方法论工件和数据集结构驱动的，而不是有意义的语义知识。这些失败很重要，因为这两种技术都被广泛视为法学硕士应该理解的证据，但我们的结果表明这样的结论是没有根据的。这些限制在普遍的分布式计算环境中尤其重要，其中法学硕士被部署为系统组件，并且依赖可解释性方法来调试、压缩和解释模型。

Title: Benchmarking Machine Translation on Chinese Social Media Texts

Authors: Kaiyan Zhao, Zheyong Xie, Zhongtao Miao, Xinze Lyu, Yao Hu, Shaosheng Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22931
Pdf URL: https://arxiv.org/pdf/2601.22931
Copy Paste: [[2601.22931]] Benchmarking Machine Translation on Chinese Social Media Texts(https://arxiv.org/abs/2601.22931)
Keywords: llm
Abstract: The prevalence of rapidly evolving slang, neologisms, and highly stylized expressions in informal user-generated text, particularly on Chinese social media, poses significant challenges for Machine Translation (MT) benchmarking. Specifically, we identify two primary obstacles: (1) data scarcity, as high-quality parallel data requires bilingual annotators familiar with platform-specific slang, and stylistic cues in both languages; and (2) metric limitations, where traditional evaluators like COMET often fail to capture stylistic fidelity and nonstandard expressions. To bridge these gaps, we introduce CSM-MTBench, a benchmark covering five Chinese-foreign language directions and consisting of two expert-curated subsets: Fun Posts, featuring context-rich, slang- and neologism-heavy content, and Social Snippets, emphasizing concise, emotion- and style- driven expressions. Furthermore, we propose tailored evaluation approaches for each subset: measuring the translation success rate of slang and neologisms in Fun Posts, while assessing tone and style preservation in Social Snippets via a hybrid of embedding-based metrics and LLM-as-a-judge. Experiments on over 20 models reveal substantial variation in how current MT systems handle semantic fidelity and informal, social-media-specific stylistic cues. CSM-MTBench thus serves as a rigorous testbed for advancing MT systems capable of mastering real-world Chinese social media texts.
摘要：在用户生成的非正式文本中，尤其是在中文社交媒体上，快速发展的俚语、新词和高度程式化的表达方式的盛行，给机器翻译 (MT) 基准测试带来了重大挑战。具体来说，我们确定了两个主要障碍：（1）数据稀缺，因为高质量的并行数据需要双语注释者熟悉平台特定的俚语和两种语言的风格提示； (2) 度量限制，像 COMET 这样的传统评估器常常无法捕捉风格保真度和非标准表达。为了弥补这些差距，我们引入了 CSM-MTBench，这是一个涵盖五个中外语言方向的基准，由两个专家策划的子集组成：有趣的帖子，以上下文丰富、俚语和新词较多的内容为特色，以及社交片段，强调简洁、情感和风格驱动的表达。此外，我们为每个子集提出了量身定制的评估方法：测量趣味帖子中俚语和新词的翻译成功率，同时通过基于嵌入的指标和法学硕士法官的混合评估社交片段中的语气和风格保留。对 20 多个模型的实验揭示了当前机器翻译系统在处理语义保真度和非正式的、社交媒体特定的风格线索方面存在巨大差异。因此，CSM-MTBench 可以作为先进的机器翻译系统的严格测试平台，以掌握现实世界的中文社交媒体文本。

Title: Relaxing Positional Alignment in Masked Diffusion Language Models

Authors: Mengyu Ye, Ryosuke Takahashi, Keito Kudo, Jun Suzuki
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22947
Pdf URL: https://arxiv.org/pdf/2601.22947
Copy Paste: [[2601.22947]] Relaxing Positional Alignment in Masked Diffusion Language Models(https://arxiv.org/abs/2601.22947)
Keywords: language model
Abstract: Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches. Although they achieve competitive performance on several tasks, a substantial gap remains in open-ended text generation. We hypothesize that one cause of this gap is that strict positional prediction makes MDLM decoding highly sensitive to token misalignment, and we show through controlled interventions that a one-position shift can severely disrupt semantics. This observation suggests that enforcing strict positional supervision during training is misaligned with the irreversible denoising dynamics of MDLM decoding. Motivated by this mismatch, we adopt an alignment-flexible supervision strategy during fine-tuning. Specifically, we introduce a special token via the connectionist temporal classification objective. We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks. Our method consistently outperforms the original model and improves robustness to positional shifts, indicating that relaxing strict positional supervision is an important factor in improving generation quality in MDLMs.
摘要：掩蔽扩散语言模型（MDLM）已成为主流自回归方法的有前途的替代方案。尽管他们在多项任务上取得了有竞争力的表现，但在开放式文本生成方面仍然存在很大差距。我们假设造成这一差距的原因之一是严格的位置预测使得 MDLM 解码对标记未对齐高度敏感，并且我们通过受控干预表明，一个位置的移位可能会严重破坏语义。这一观察结果表明，在训练期间实施严格的位置监督与 MDLM 解码的不可逆去噪动态不一致。受这种不匹配的激励，我们在微调过程中采用了对齐灵活的监督策略。具体来说，我们通过联结主义时间分类目标引入了一个特殊的标记。我们将此方法应用于广泛使用的 MDLM 模型，并在五个开放式文本生成基准上进行实验。我们的方法始终优于原始模型，并提高了对位置变化的鲁棒性，这表明放松严格的位置监督是提高 MDLM 发电质量的重要因素。

Title: Autonomous Chain-of-Thought Distillation for Graph-Based Fraud Detection

Authors: Yuan Li, Jun Hu, Bryan Hooi, Bingsheng He, Cheng Chen
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2601.22949
Pdf URL: https://arxiv.org/pdf/2601.22949
Copy Paste: [[2601.22949]] Autonomous Chain-of-Thought Distillation for Graph-Based Fraud Detection(https://arxiv.org/abs/2601.22949)
Keywords: llm, prompt, chain-of-thought
Abstract: Graph-based fraud detection on text-attributed graphs (TAGs) requires jointly modeling rich textual semantics and relational dependencies. However, existing LLM-enhanced GNN approaches are constrained by predefined prompting and decoupled training pipelines, limiting reasoning autonomy and weakening semantic-structural alignment. We propose FraudCoT, a unified framework that advances TAG-based fraud detection through autonomous, graph-aware chain-of-thought (CoT) reasoning and scalable LLM-GNN co-training. To address the limitations of predefined prompts, we introduce a fraud-aware selective CoT distillation mechanism that generates diverse reasoning paths and enhances semantic-structural understanding. These distilled CoTs are integrated into node texts, providing GNNs with enriched, multi-hop semantic and structural cues for fraud detection. Furthermore, we develop an efficient asymmetric co-training strategy that enables end-to-end optimization while significantly reducing the computational cost of naive joint training. Extensive experiments on public and industrial benchmarks demonstrate that FraudCoT achieves up to 8.8% AUPRC improvement over state-of-the-art methods and delivers up to 1,066x speedup in training throughput, substantially advancing both detection performance and efficiency.
摘要：对文本属性图（TAG）进行基于图的欺诈检测需要对丰富的文本语义和关系依赖性进行联合建模。然而，现有的 LLM 增强型 GNN 方法受到预定义提示和解耦训练流程的限制，限制了推理自主性并削弱了语义结构对齐。我们提出了 FraudCoT，这是一个统一的框架，通过自主的、图形感知的思想链 (CoT) 推理和可扩展的 LLM-GNN 协同训练来推进基于 TAG 的欺诈检测。为了解决预定义提示的局限性，我们引入了一种欺诈感知选择性 CoT 蒸馏机制，该机制可生成不同的推理路径并增强语义结构理解。这些经过提炼的 CoT 被集成到节点文本中，为 GNN 提供用于欺诈检测的丰富的多跳语义和结构线索。此外，我们开发了一种有效的非对称协同训练策略，可以实现端到端优化，同时显着降低朴素联合训练的计算成本。对公共和工业基准的大量实验表明，FraudCoT 比最先进的方法实现了高达 8.8% 的 AUPRC 改进，并在训练吞吐量方面实现了高达 1,066 倍的加速，从而大幅提高了检测性能和效率。

Title: Residual Context Diffusion Language Models

Authors: Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W. Mahoney, Sewon Min, Mehrdad Farajtabar, Kurt Keutzer, Amir Gholami, Chenfeng Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22954
Pdf URL: https://arxiv.org/pdf/2601.22954
Copy Paste: [[2601.22954]] Residual Context Diffusion Language Models(https://arxiv.org/abs/2601.22954)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~1 billion tokens. RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at equivalent accuracy levels.
摘要：扩散大型语言模型 (dLLM) 已成为纯自回归语言模型的有前途的替代方案，因为它们可以并行解码多个标记。然而，最先进的分块 dLLM 依赖于“重新屏蔽”机制，该机制仅解码最有信心的令牌并丢弃其余令牌，从而有效地浪费了计算。我们证明，从丢弃的令牌中回收计算是有益的，因为这些令牌保留了对后续解码迭代有用的上下文信息。有鉴于此，我们提出了残差上下文扩散（RCD），该模块将这些丢弃的标记表示转换为上下文残差，并将它们注入回用于下一步去噪步骤。 RCD 使用解耦的两级训练管道来绕过与反向传播相关的内存瓶颈。我们在长 CoT 推理 (SDAR) 和短 CoT 指令跟踪 (LLaDA) 模型上验证了我们的方法。我们证明，只需约 10 亿个代币即可将标准 dLLM 有效转换为 RCD 范式。 RCD 在各种基准测试中以最小的额外计算开销持续将前沿 dLLM 的准确度提高了 5-10 个百分点。值得注意的是，在最具挑战性的 AIME 任务中，RCD 几乎使基线精度提高了一倍，并在同等精度水平下实现了多达 4-5 倍的降噪步骤。

Title: A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

Authors: Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, Yang Xu, Haoran Lian, Siqi Zhang, Rui Men, Jianwei Zhang, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22966
Pdf URL: https://arxiv.org/pdf/2601.22966
Copy Paste: [[2601.22966]] A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training(https://arxiv.org/abs/2601.22966)
Keywords: language model
Abstract: We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations (\textit{e.g.}, softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon \textit{outlier-driven rescaling} and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).
摘要：我们研究了大型语言模型中出现的异常值的功能作用，特别是注意力池（一些持续接收大量注意力逻辑的令牌）和残差池（在大多数令牌中具有持续较大激活的一些固定维度）。我们假设这些离群值与相应的归一化（\textit{e.g.}、softmax 注意力和 RMSNorm）相结合，可以有效地重新调整其他非离群值分量。我们将这种现象称为 \textit{离群值驱动的重新缩放}，并在不同的模型架构和训练令牌计数中验证这一假设。这种观点统一了两种汇类型的起源和缓解。我们的主要结论和观察包括：（1）离群值与归一化共同作用：去除归一化可以消除相应的离群值，但会降低训练的稳定性和性能；在保留标准化的同时直接修剪异常值会导致性能下降，这表明异常值驱动的重新缩放有助于训练稳定性。（2）离群值更多地充当重新调整因素而不是贡献者，因为注意力和残余汇的最终贡献明显小于非离群值。 (3) 异常值可以被吸收到可学习参数中，或者通过显式门控重新缩放来减轻，从而提高训练性能（平均增益 2 点）并增强量化鲁棒性（在 W4A4 量化下降低 1.2 点）。

Title: ArabicDialectHub: A Cross-Dialectal Arabic Learning Resource and Platform

Authors: Salem Lahlou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22987
Pdf URL: https://arxiv.org/pdf/2601.22987
Copy Paste: [[2601.22987]] ArabicDialectHub: A Cross-Dialectal Arabic Learning Resource and Platform(https://arxiv.org/abs/2601.22987)
Keywords: llm
Abstract: We present ArabicDialectHub, a cross-dialectal Arabic learning resource comprising 552 phrases across six varieties (Moroccan Darija, Lebanese, Syrian, Emirati, Saudi, and MSA) and an interactive web platform. Phrases were generated using LLMs and validated by five native speakers, stratified by difficulty, and organized thematically. The open-source platform provides translation exploration, adaptive quizzing with algorithmic distractor generation, cloud-synchronized progress tracking, and cultural context. Both the dataset and complete platform source code are released under MIT license. Platform: this https URL.
摘要：我们推出阿拉伯语方言中心 (ArabicDialectHub)，这是一个跨方言阿拉伯语学习资源，包含 6 个语种（摩洛哥达里贾语、黎巴嫩语、叙利亚语、阿联酋语、沙特语和 MSA）的 552 个短语，以及一个交互式网络平台。短语是使用法学硕士生成的，并由五位母语人士验证，按难度分层，并按主题组织。该开源平台提供翻译探索、带有算法干扰项生成的自适应测验、云同步进度跟踪和文化背景。数据集和完整的平台源代码均在 MIT 许可下发布。平台：此 https URL。

Title: Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs

Authors: Afrozah Nadeem, Agrima, Mehwish Nasim, Usman Naseem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.23001
Pdf URL: https://arxiv.org/pdf/2601.23001
Copy Paste: [[2601.23001]] Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs(https://arxiv.org/abs/2601.23001)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) increasingly shape global discourse, making fairness and ideological neutrality essential for responsible AI deployment. Despite growing attention to political bias in LLMs, prior work largely focuses on high-resource, Western languages or narrow multilingual settings, leaving cross-lingual consistency and safe post-hoc mitigation underexplored. To address this gap, we present a large-scale multilingual evaluation of political bias spanning 50 countries and 33 languages. We introduce a complementary post-hoc mitigation framework, Cross-Lingual Alignment Steering (CLAS), designed to augment existing steering methods by aligning ideological representations across languages and dynamically regulating intervention strength. This method aligns latent ideological representations induced by political prompts into a shared ideological subspace, ensuring cross lingual consistency, with the adaptive mechanism prevents over correction and preserves coherence. Experiments demonstrate substantial bias reduction along both economic and social axes with minimal degradation in response quality. The proposed framework establishes a scalable and interpretable paradigm for fairness-aware multilingual LLM governance, balancing ideological neutrality with linguistic and cultural diversity.
摘要：大型语言模型 (LLM) 日益影响全球话语，使公平和意识形态中立对于负责任的人工智能部署至关重要。尽管法学硕士中的政治偏见越来越受到关注，但之前的工作主要集中在高资源、西方语言或狭隘的多语言环境上，从而导致跨语言一致性和安全的事后缓解措施尚未得到充分探索。为了解决这一差距，我们对 50 个国家和 33 种语言的政治偏见进行了大规模的多语言评估。我们引入了一个补充性的事后缓解框架，即跨语言对齐指导（CLAS），旨在通过调整跨语言的意识形态表征和动态调节干预强度来增强现有的指导方法。这种方法将政治提示引起的潜在意识形态表征整合到一个共享的意识形态子空间中，确保跨语言的一致性，并通过自适应机制防止过度纠正并保持连贯性。实验表明，经济和社会轴上的偏差显着减少，而响应质量的下降最小。拟议的框架为具有公平意识的多语言法学硕士治理建立了一个可扩展和可解释的范例，平衡了意识形态中立性与语言和文化多样性。

Title: InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning

Authors: Junyou Su, He Zhu, Xiao Luo, Liyu Zhang, Hong-Yu Zhou, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23006
Pdf URL: https://arxiv.org/pdf/2601.23006
Copy Paste: [[2601.23006]] InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning(https://arxiv.org/abs/2601.23006)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing returns. Existing data selection methods suffer from severe domain specificity: techniques optimized for general instruction-following fail on reasoning tasks, and vice versa. We observe that measuring entropy differences between base models and minimally instruction-tuned calibrated models reveals a pattern -- samples with the lowest differential entropy consistently yield optimal performance across domains, yet this principle manifests domain-adaptively: reasoning tasks favor entropy increase (cognitive expansion), while general tasks favor entropy decrease (cognitive compression). We introduce InstructDiff, a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion through warmup calibration, bi-directional NLL filtering, and entropy-based ranking. Extensive experiments show that InstructDiff achieves 17\% relative improvement over full data training on mathematical reasoning and 52\% for general instruction-following, outperforming prior baselines while using only 10\% of the data.
摘要：有监督微调 (SFT) 是适应大型语言模型的基础，但对完整数据集的训练会产生高昂的成本，并且收益递减。现有的数据选择方法受到严重的领域特异性的影响：针对一般指令跟踪优化的技术在推理任务中失败，反之亦然。我们观察到，测量基本模型和最低限度指令调整的校准模型之间的熵差异揭示了一种模式——具有最低微分熵的样本在跨域中始终产生最佳性能，但这一原理以域自适应的方式表现出来：推理任务有利于熵增加（认知扩展），而一般任务有利于熵减少（认知压缩）。我们引入了 InstructDiff，这是一个统一的框架，通过预热校准、双向 NLL 过滤和基于熵的排序，将微分熵作为域自适应选择标准。大量实验表明，InstructDiff 在数学推理方面比完整数据训练实现了 17% 的相对改进，在一般指令遵循方面实现了 52% 的相对改进，在仅使用 10% 的数据的情况下优于先前的基线。

Title: DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis

Authors: Lung-Hao Lee, Liang-Chih Yu, Natalia Loukashevich, Ilseyar Alimova, Alexander Panchenko, Tzu-Mi Lin, Zhe-Yu Xu, Jian-Yu Zhou, Guangmin Zheng, Jin Wang, Sharanya Awasthi, Jonas Becker, Jan Philip Wahle, Terry Ruas, Shamsuddeen Hassan Muhammad, Saif M. Mohammed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23022
Pdf URL: https://arxiv.org/pdf/2601.23022
Copy Paste: [[2601.23022]] DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2601.23022)
Keywords: language model, prompt
Abstract: Aspect-Based Sentiment Analysis (ABSA) focuses on extracting sentiment at a fine-grained aspect level and has been widely applied across real-world domains. However, existing ABSA research relies on coarse-grained categorical labels (e.g., positive, negative), which limits its ability to capture nuanced affective states. To address this limitation, we adopt a dimensional approach that represents sentiment with continuous valence-arousal (VA) scores, enabling fine-grained analysis at both the aspect and sentiment levels. To this end, we introduce DimABSA, the first multilingual, dimensional ABSA resource annotated with both traditional ABSA elements (aspect terms, aspect categories, and opinion terms) and newly introduced VA scores. This resource contains 76,958 aspect instances across 42,590 sentences, spanning six languages and four domains. We further introduce three subtasks that combine VA scores with different ABSA elements, providing a bridge from traditional ABSA to dimensional ABSA. Given that these subtasks involve both categorical and continuous outputs, we propose a new unified metric, continuous F1 (cF1), which incorporates VA prediction error into standard F1. We provide a comprehensive benchmark using both prompted and fine-tuned large language models across all subtasks. Our results show that DimABSA is a challenging benchmark and provides a foundation for advancing multilingual dimensional ABSA.
摘要：基于方面的情感分析（ABSA）专注于在细粒度的方面级别提取情感，并已广泛应用于现实世界领域。然而，现有的 ABSA 研究依赖于粗粒度的分类标签（例如积极、消极），这限制了其捕捉微妙情感状态的能力。为了解决这个限制，我们采用了一种维度方法，用连续的价唤醒（VA）分数来表示情感，从而能够在方面和情感层面进行细粒度的分析。为此，我们推出了 DimABSA，这是第一个多语言、维度 ABSA 资源，用传统的 ABSA 元素（方面术语、方面类别和观点术语）和新引入的 VA 分数进行注释。该资源包含 42,590 个句子的 76,958 个方面实例，跨越六种语言和四个领域。我们进一步引入了三个子任务，将 VA 分数与不同的 ABSA 元素相结合，提供从传统 ABSA 到维度 ABSA 的桥梁。鉴于这些子任务涉及分类和连续输出，我们提出了一种新的统一度量，连续 F1 (cF1)，它将 VA 预测误差纳入标准 F1。我们在所有子任务中使用提示和微调的大型语言模型提供全面的基准测试。我们的结果表明 DimABSA 是一个具有挑战性的基准，并为推进多语言维度 ABSA 提供了基础。

Title: Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Authors: Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, Jie Zhang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2601.23081
Pdf URL: https://arxiv.org/pdf/2601.23081
Copy Paste: [[2601.23081]] Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures(https://arxiv.org/abs/2601.23081)
Keywords: language model, llm, prompt
Abstract: Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned behavior. Prior explanations mainly attribute this phenomenon to the generalization of erroneous or unsafe content. In this work, we show that this view is incomplete. Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning, while largely preserving general capabilities. This indicates that emergent misalignment arises from stable shifts in model behavior rather than from capability degradation or corrupted knowledge. We further show that such behavioral dispositions can be conditionally activated by both training-time triggers and inference-time persona-aligned prompts, revealing shared structure across emergent misalignment, backdoor activation, and jailbreak susceptibility. Overall, our results identify character formation as a central and underexplored alignment risk, suggesting that robust alignment must address behavioral dispositions rather than isolated errors or prompt-level defenses.
摘要：紧急错位是指一种故障模式，其中对范围狭窄的数据进行大型语言模型 (LLM) 的微调会导致广泛的错位行为。先前的解释主要将这种现象归因于错误或不安全内容的泛化。在这项工作中，我们表明这种观点是不完整的。在多个领域和模型系列中，我们发现对表现出特定字符级倾向的数据进行微调模型会比错误建议微调产生更强、更可转移的错位，同时在很大程度上保留一般功能。这表明出现的失调是由模型行为的稳定变化引起的，而不是由能力退化或知识损坏引起的。我们进一步表明，这种行为倾向可以通过训练时触发器和推理时角色对齐提示有条件地激活，从而揭示了紧急错位、后门激活和越狱敏感性之间的共享结构。总体而言，我们的结果将性格形成视为一个核心且未被充分探索的一致性风险，这表明强大的一致性必须解决行为倾向，而不是孤立的错误或即时级别的防御。

Title: Safer Policy Compliance with Dynamic Epistemic Fallback

Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.23094
Pdf URL: https://arxiv.org/pdf/2601.23094
Copy Paste: [[2601.23094]] Safer Policy Compliance with Dynamic Epistemic Fallback(https://arxiv.org/abs/2601.23094)
Keywords: llm
Abstract: Humans develop a series of cognitive defenses, known as epistemic vigilance, to combat risks of deception and misinformation from everyday interactions. Developing safeguards for LLMs inspired by this mechanism might be particularly helpful for their application in high-stakes tasks such as automating compliance with data privacy laws. In this paper, we introduce Dynamic Epistemic Fallback (DEF), a dynamic safety protocol for improving an LLM's inference-time defenses against deceptive attacks that make use of maliciously perturbed policy texts. Through various levels of one-sentence textual cues, DEF nudges LLMs to flag inconsistencies, refuse compliance, and fallback to their parametric knowledge upon encountering perturbed policy texts. Using globally recognized legal policies such as HIPAA and GDPR, our empirical evaluations report that DEF effectively improves the capability of frontier LLMs to detect and refuse perturbed versions of policies, with DeepSeek-R1 achieving a 100% detection rate in one setting. This work encourages further efforts to develop cognitively inspired defenses to improve LLM robustness against forms of harm and deception that exploit legal artifacts.
摘要：人类发展出一系列认知防御，称为认知警惕，以应对日常互动中的欺骗和错误信息风险。受此机制启发，为法学硕士制定保障措施可能对其在高风险任务中的应用特别有帮助，例如自动遵守数据隐私法。在本文中，我们介绍了动态认知回退（DEF），这是一种动态安全协议，用于改进法学硕士的推理时间防御，以防止利用恶意干扰策略文本的欺骗性攻击。通过各种级别的单句文本提示，DEF 促使法学硕士标记不一致、拒绝遵守，并在遇到扰乱的政策文本时回退到他们的参数知识。使用 HIPAA 和 GDPR 等全球公认的法律政策，我们的实证评估报告显示，DEF 有效提高了前沿法学硕士检测和拒绝受干扰的政策版本的能力，DeepSeek-R1 在一种设置下实现了 100% 的检测率。这项工作鼓励进一步努力开发认知启发的防御措施，以提高法学硕士针对利用法律工件的伤害和欺骗形式的鲁棒性。

Title: Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics

Authors: Yilun Hua, Giuseppe Castellucci, Peter Schulam, Heba Elfardy, Kevin Small
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23129
Pdf URL: https://arxiv.org/pdf/2601.23129
Copy Paste: [[2601.23129]] Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics(https://arxiv.org/abs/2601.23129)
Keywords: llm, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG)'s success depends on the utility the LLM derives from the content used for grounding. Quantifying content utility does not have a definitive specification and existing metrics ignore model-specific capabilities and/or rely on costly annotations. In this paper, we propose Grounding Generation Utility (GroGU), a model-specific and reference-free metric that defines utility as a function of the downstream LLM's generation confidence based on entropy. Despite having no annotation requirements, GroGU is largely faithful in distinguishing ground-truth documents while capturing nuances ignored by LLM-agnostic metrics. We apply GroGU to train a query-rewriter for RAG by identifying high-utility preference data for Direct Preference Optimization. Experiments show improvements by up to 18.2 points in Mean Reciprocal Rank and up to 9.4 points in answer accuracy.
摘要：检索增强生成（RAG）的成功取决于法学硕士从用于基础的内容中获得的效用。量化内容实用程序没有明确的规范，并且现有指标忽略特定于模型的功能和/或依赖于昂贵的注释。在本文中，我们提出了基础生成效用（GroGU），这是一种特定于模型且无参考的度量，将效用定义为下游 LLM 基于熵的生成置信度的函数。尽管没有注释要求，GroGU 在很大程度上忠实于区分真实文档，同时捕获与 LLM 无关的指标所忽略的细微差别。我们应用 GroGU 通过识别直接偏好优化的高实用性偏好数据来训练 RAG 的查询重写器。实验表明，平均倒数排名提高了 18.2 分，答案准确率提高了 9.4 分。

Title: Monotonic Reference-Free Refinement for Autoformalization

Authors: Lan Zhang, Marco Valentino, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23166
Pdf URL: https://arxiv.org/pdf/2601.23166
Copy Paste: [[2601.23166]] Monotonic Reference-Free Refinement for Autoformalization(https://arxiv.org/abs/2601.23166)
Keywords: llm
Abstract: While statement autoformalization has advanced rapidly, full-theorem autoformalization remains largely unexplored. Existing iterative refinement methods in statement autoformalization typicall improve isolated aspects of formalization, such as syntactic correctness, but struggle to jointly optimizing multiple quality dimensions, which is critical for full-theorem autoformalization. We introduce a reference-free iterative monotonic process for full-theorem autoformalization that leverages complementary feedback from theorem provers and LLM-based judges, without access to ground-truth proofs or existing formalizations at inference time. Our approach optimizes a masked composite objective over Formal Validity, Logical Preservation, Mathematical Consistency, and Formal Quality, guided by a responsiveness map that indicates how different LLMs acting as different roles preferentially improve each dimension. We further propose an acceptance policy that guarantees certified monotonic improvement, and provide conditions ensuring convergence and termination. Empirical experiments demonstrate the proposed process enables simultaneous improvement across multiple dimensions, achieving 93.44% formal validity and a 78.22% overall score on miniF2F, and 44.09% formal validity and a 29.79% overall score on ProofNet.
摘要：虽然语句自动形式化发展迅速，但全定理自动形式化在很大程度上仍未得到探索。语句自动形式化中现有的迭代细化方法通常会改进形式化的孤立方面，例如语法正确性，但难以联合优化多个质量维度，这对于全定理自动形式化至关重要。我们引入了一种用于全定理自动形式化的无参考迭代单调过程，该过程利用来自定理证明者和基于法学硕士的法官的补充反馈，而无需在推理时访问地面事实证明或现有的形式化。我们的方法在形式有效性、逻辑保存性、数学一致性和形式质量上优化了掩蔽复合目标，并以响应性图为指导，该图表明不同的法学硕士作为不同的角色如何优先改进每个维度。我们进一步提出了一种保证单调改进的接受策略，并提供确保收敛和终止的条件。实证实验表明，所提出的过程可以在多个维度上同时进行改进，在 miniF2F 上实现 93.44% 的形式有效性和 78.22% 的总分，在 ProofNet 上实现 44.09% 的形式有效性和 29.79% 的总分。

Title: FourierSampler: Unlocking Non-Autoregressive Potential in Diffusion Language Models via Frequency-Guided Generation

Authors: Siyang He, Qiqi Wang, Xiaoran Liu, Hongnan Ma, Yiwei Shi, Yuerong Song, Ying Zhu, Tianyi Liang, Zengfeng Huang, Ziwei He, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23182
Pdf URL: https://arxiv.org/pdf/2601.23182
Copy Paste: [[2601.23182]] FourierSampler: Unlocking Non-Autoregressive Potential in Diffusion Language Models via Frequency-Guided Generation(https://arxiv.org/abs/2601.23182)
Keywords: language model, llm
Abstract: Despite the non-autoregressive potential of diffusion language models (dLLMs), existing decoding strategies demonstrate positional bias, failing to fully unlock the potential of arbitrary generation. In this work, we delve into the inherent spectral characteristics of dLLMs and present the first frequency-domain analysis showing that low-frequency components in hidden states primarily encode global structural information and long-range dependencies, while high-frequency components are responsible for characterizing local details. Based on this observation, we propose FourierSampler, which leverages a frequency-domain sliding window mechanism to dynamically guide the model to achieve a "structure-to-detail" generation. FourierSampler outperforms other inference enhancement strategies on LLADA and SDAR, achieving relative improvements of 20.4% on LLaDA1.5-8B and 16.0% on LLaDA-8B-Instruct. It notably surpasses similarly sized autoregressive models like Llama3.1-8B-Instruct.
摘要：尽管扩散语言模型（dLLM）具有非自回归潜力，但现有的解码策略表现出位置偏差，无法完全释放任意生成的潜力。在这项工作中，我们深入研究了 dLLM 的固有频谱特性，并提出了第一个频域分析，表明隐藏状态中的低频分量主要编码全局结构信息和远程依赖性，而高频分量负责表征局部细节。基于这一观察，我们提出了 FourierSampler，它利用频域滑动窗口机制来动态引导模型实现“结构到细节”的生成。 FourierSampler 在 LLADA 和 SDAR 上的性能优于其他推理增强策略，在 LLaDA1.5-8B 上实现了 20.4% 的相对改进，在 LLaDA-8B-Instruct 上实现了 16.0% 的相对改进。它明显超过了 Llama3.1-8B-Instruct 等类似大小的自回归模型。

Title: JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs

Authors: Casimiro Pio Carrino, Paula Estrella, Rabih Zbib, Carlos Escolano, José A. R. Fonollosa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23183
Pdf URL: https://arxiv.org/pdf/2601.23183
Copy Paste: [[2601.23183]] JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs(https://arxiv.org/abs/2601.23183)
Keywords: llm
Abstract: We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving résumés and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross-document reasoning. We propose a data generation pipeline derived from real-world sources through de-identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost-effective, human-in-the-loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post-editing to ensure an high-quality multi-way parallel benchmark. We provide a baseline evaluations across multiple open-weight LLM families using an LLM-as-judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The benchmark is publicly available at: this https URL
摘要：我们推出了 JobResQA，这是一种多语言问答基准，用于评估法学硕士在涉及简历和职位描述的人力资源特定任务上的机器阅读理解 (MRC) 能力。该数据集包含五种语言（英语、西班牙语、意大利语、德语和中文）的 105 个合成简历-职位描述对中的 581 个 QA 对，问题涵盖从基本事实提取到复杂的跨文档推理的三个复杂级别。我们提出了一个通过去识别和数据合成从现实世界来源派生的数据生成管道，以确保现实性和隐私性，同时受控的人口和专业属性（通过占位符实现）实现系统性偏见和公平性研究。我们还提出了一种基于 TEaR 方法的经济高效的人机交互翻译流程，结合了 MQM 错误注释和选择性译后编辑，以确保高质量的多路并行基准测试。我们使用法学硕士作为法官的方法对多个开放权重法学硕士系列进行基线评估，显示英语和西班牙语的表现较高，但其他语言的表现大幅下降，凸显了人力资源应用程序的多语言 MRC 能力的关键差距。 JobResQA 为推进公平可靠的基于 LLM 的人力资源系统提供了可重复的基准。该基准测试可在以下位置公开获取：此 https URL

Title: ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought

Authors: Fanmeng Wang, Haotian Liu, Guojiang Zhao, Hongteng Xu, Zhifeng Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23184
Pdf URL: https://arxiv.org/pdf/2601.23184
Copy Paste: [[2601.23184]] ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought(https://arxiv.org/abs/2601.23184)
Keywords: language model, llm, chain-of-thought
Abstract: While Chain-of-Thought (CoT) significantly enhances the performance of Large Language Models (LLMs), explicit reasoning chains introduce substantial computational redundancy. Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space, but often suffer from severe performance degradation due to the lack of appropriate compression guidance. In this study, we propose Rendered CoT-Guided variational Latent Reasoning (ReGuLaR), a simple yet novel latent learning paradigm resolving this issue. Fundamentally, we formulate latent reasoning within the Variational Auto-Encoding (VAE) framework, sampling the current latent reasoning state from the posterior distribution conditioned on previous ones. Specifically, when learning this variational latent reasoning model, we render explicit reasoning chains as images, from which we extract dense visual-semantic representations to regularize the posterior distribution, thereby achieving efficient compression with minimal information loss. Extensive experiments demonstrate that ReGuLaR significantly outperforms existing latent reasoning methods across both computational efficiency and reasoning effectiveness, and even surpasses CoT through multi-modal reasoning, providing a new and insightful solution to latent reasoning. Code: this https URL.
摘要：虽然思想链 (CoT) 显着增强了大型语言模型 (LLM) 的性能，但显式推理链引入了大量的计算冗余。最近的潜在推理方法试图通过将推理过程压缩到潜在空间来缓解这一问题，但由于缺乏适当的压缩指导，常常会遭受严重的性能下降。在这项研究中，我们提出了渲染 CoT 引导的变分潜在推理（ReGuLaR），这是一种简单而新颖的潜在学习范式，可以解决这个问题。从根本上说，我们在变分自动编码（VAE）框架内制定潜在推理，从以先前分布为条件的后验分布中采样当前的潜在推理状态。具体来说，在学习这种变分潜在推理模型时，我们将显式推理链渲染为图像，从中提取密集的视觉语义表示来正则化后验分布，从而以最小的信息损失实现有效的压缩。大量实验表明，ReGuLaR 在计算效率和推理有效性方面都显着优于现有的潜在推理方法，甚至通过多模态推理超越 CoT，为潜在推理提供了一种新的、富有洞察力的解决方案。代码：此 https URL。

Title: Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience

Authors: Zhongxiang Sun, Qipeng Wang, Weijie Yu, Jingxuan Yang, Haolang Lu, Jun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23188
Pdf URL: https://arxiv.org/pdf/2601.23188
Copy Paste: [[2601.23188]] Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience(https://arxiv.org/abs/2601.23188)
Keywords: language model, agent
Abstract: Deep search agents powered by large language models have demonstrated strong capabilities in multi-step retrieval, reasoning, and long-horizon task execution. However, their practical failures often stem from the lack of mechanisms to monitor and regulate reasoning and retrieval states as tasks evolve under uncertainty. Insights from cognitive neuroscience suggest that human metacognition is hierarchically organized, integrating fast anomaly detection with selectively triggered, experience-driven reflection. In this work, we propose Deep Search with Meta-Cognitive Monitoring (DS-MCM), a deep search framework augmented with an explicit hierarchical metacognitive monitoring mechanism. DS-MCM integrates a Fast Consistency Monitor, which performs lightweight checks on the alignment between external evidence and internal reasoning confidence, and a Slow Experience-Driven Monitor, which is selectively activated to guide corrective intervention based on experience memory from historical agent trajectories. By embedding monitoring directly into the reasoning-retrieval loop, DS-MCM determines both when intervention is warranted and how corrective actions should be informed by prior experience. Experiments across multiple deep search benchmarks and backbone models demonstrate that DS-MCM consistently improves performance and robustness.
摘要：由大型语言模型支持的深度搜索代理在多步骤检索、推理和长期任务执行方面表现出了强大的能力。然而，它们的实际失败往往源于任务在不确定性下发展时缺乏监控和调节推理和检索状态的机制。认知神经科学的见解表明，人类元认知是分层组织的，将快速异常检测与选择性触发的、经验驱动的反思相结合。在这项工作中，我们提出了带有元认知监控的深度搜索（DS-MCM），这是一种通过显式分层元认知监控机制增强的深度搜索框架。 DS-MCM 集成了快速一致性监视器和慢速经验驱动监视器，前者对外部证据和内部推理置信度之间的一致性进行轻量级检查，后者有选择地激活以根据历史代理轨迹的经验记忆来指导纠正干预。通过将监控直接嵌入到推理检索循环中，DS-MCM 可以确定何时需要进行干预以及如何根据先前的经验采取纠正措施。跨多个深度搜索基准和骨干模型的实验表明 DS-MCM 持续提高性能和稳健性。

Title: Are you going to finish that? A Practical Study of the Tokenization Boundary Problem

Authors: Hao Xu, Alisa Liu, Jonathan Hayase, Yejin Choi, Noah A. Smith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23223
Pdf URL: https://arxiv.org/pdf/2601.23223
Copy Paste: [[2601.23223]] Are you going to finish that? A Practical Study of the Tokenization Boundary Problem(https://arxiv.org/abs/2601.23223)
Keywords: language model, prompt
Abstract: Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token, leading to distorted next-token predictions. Although this issue has been studied using arbitrary character prefixes, its prevalence and severity in realistic prompts respecting word boundaries remains underexplored. In this work, we identify three domains where token and "word" boundaries often do not line up: languages that do not use whitespace, highly compounding languages, and code. In Chinese, for example, up to 25% of word boundaries do not line up with token boundaries, making even natural, word-complete prompts susceptible to this problem. We systematically construct semantically natural prompts ending with a partial tokens; in experiments, we find that they comprise a serious failure mode: frontier LMs consistently place three orders of magnitude less probability on the correct continuation compared to when the prompt is "backed-off" to be token-aligned. This degradation does not diminish with scale and often worsens for larger models. Finally, we evaluate inference-time mitigations to the partial token problem and validate the effectiveness of recent exact solutions. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and provide practical recommentions for model inference providers.
摘要：语言模型 (LM) 是通过标记序列进行训练的，而用户则通过文本与 LM 交互。这种不匹配会导致部分令牌问题，当用户在预期的下一个令牌中间结束提示时就会发生这种问题，从而导致下一个令牌的预测失真。尽管已经使用任意字符前缀研究了这个问题，但其在尊重单词边界的实际提示中的普遍性和严重性仍未得到充分研究。在这项工作中，我们确定了标记和“单词”边界通常不对齐的三个领域：不使用空格的语言、高度复合的语言和代码。例如，在中文中，高达 25% 的单词边界与标记边界不对齐，甚至自然的单词完整提示也容易出现此问题。我们系统地构建以部分标记结尾的语义自然提示；在实验中，我们发现它们构成了一种严重的失败模式：与提示“后退”以进行标记对齐相比，前沿 LM 始终将正确延续的概率降低了三个数量级。这种退化不会随着规模的扩大而减弱，并且对于较大的模型通常会恶化。最后，我们评估部分令牌问题的推理时间缓解措施，并验证最近精确解决方案的有效性。总的来说，我们展示了实际用例中由标记化引起的概率失真的规模和严重性，并为模型推理提供者提供了实用的建议。

Title: Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Authors: Ye Yu, Haibo Jin, Yaoning Yu, Jun Zhuang, Haohan Wang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2601.23255
Pdf URL: https://arxiv.org/pdf/2601.23255
Copy Paste: [[2601.23255]] Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models(https://arxiv.org/abs/2601.23255)
Keywords: language model
Abstract: Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.
摘要：大型音频语言模型越来越多地在原始语音输入上运行，从而实现语音助手、教育和临床分诊等领域的更无缝集成。然而，这种转变引入了一类独特的漏洞，这些漏洞在很大程度上仍未得到表征。我们通过设计文本到音频越狱，将不允许的指令嵌入到叙事风格的音频流中，来研究这种模式转变的安全影响。该攻击利用先进的指令跟踪文本转语音 (TTS) 模型来利用结构和声学特性，从而规避主要针对文本校准的安全机制。当通过合成语音传递时，叙事格式会引发包括 Gemini 2.0 Flash 在内的最先进模型的有限输出，实现 98.26% 的成功率，大大超过纯文本基线。这些结果凸显了对安全框架的需求，该框架能够对语言和副语言表示进行联合推理，特别是在基于语音的界面变得更加普遍的情况下。

Title: PaperBanana: Automating Academic Illustration for AI Scientists

Authors: Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, Jinsung Yoon
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.23265
Pdf URL: https://arxiv.org/pdf/2601.23265
Copy Paste: [[2601.23265]] PaperBanana: Automating Academic Illustration for AI Scientists(https://arxiv.org/abs/2601.23265)
Keywords: language model, agent
Abstract: Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.
摘要：尽管在语言模型的支持下，自主人工智能科学家取得了快速进步，但生成可发表的插图仍然是研究工作流程中的劳动密集型瓶颈。为了减轻这一负担，我们引入了 PaperBanana，这是一个用于自动生成可供出版的学术插图的代理框架。在最先进的 VLM 和图像生成模型的支持下，PaperBanana 协调专门的代理来检索参考、规划内容和风格、渲染图像，并通过自我批评进行迭代完善。为了严格评估我们的框架，我们引入了 PaperBananaBench，其中包含来自 NeurIPS 2025 出版物的 292 个方法图测试用例，涵盖不同的研究领域和插图风格。综合实验表明，PaperBanana 在忠实性、简洁性、可读性和美观性方面始终优于领先的基线。我们进一步表明，我们的方法可以有效地扩展到生成高质量的统计图。总的来说，PaperBanana 为自动生成可供出版的插图铺平了道路。

Title: UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection

Authors: Siran Peng, Weisong Zhao, Tianyu Fu, Chenxu Zhao, Tianshuo Zhang, Haoyuan Zhang, Xiangyu Zhu, Minghui Wu, Zhen Lei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.23273
Pdf URL: https://arxiv.org/pdf/2601.23273
Copy Paste: [[2601.23273]] UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection(https://arxiv.org/abs/2601.23273)
Keywords: language model, llm, prompt, agent
Abstract: Prompt agents have recently emerged as a promising paradigm for automated prompt optimization, framing refinement as a sequential decision-making problem over a structured prompt space. While this formulation enables the use of advanced planning algorithms, these methods typically assume access to supervised reward signals, which are often unavailable in practical scenarios. In this work, we propose UPA, an Unsupervised Prompt Agent that realizes structured search and selection without relying on supervised feedback. Specifically, during search, UPA iteratively constructs an evolving tree structure to navigate the prompt space, guided by fine-grained and order-invariant pairwise comparisons from Large Language Models (LLMs). Crucially, as these local comparisons do not inherently yield a consistent global scale, we decouple systematic prompt exploration from final selection, introducing a two-stage framework grounded in the Bradley-Terry-Luce (BTL) model. This framework first performs path-wise Bayesian aggregation of local comparisons to filter candidates under uncertainty, followed by global tournament-style comparisons to infer latent prompt quality and identify the optimal prompt. Experiments across multiple tasks demonstrate that UPA consistently outperforms existing prompt optimization methods, showing that agent-style optimization remains highly effective even in fully unsupervised settings.
摘要：提示代理最近成为自动提示优化的有前途的范例，将细化框架作为结构化提示空间上的顺序决策问题。虽然这种公式可以使用先进的规划算法，但这些方法通常假设可以访问监督奖励信号，而这在实际场景中通常是不可用的。在这项工作中，我们提出了 UPA，一种无监督提示代理，可以在不依赖监督反馈的情况下实现结构化搜索和选择。具体来说，在搜索过程中，UPA 在大型语言模型 (LLM) 的细粒度和顺序不变的成对比较的指导下，迭代地构建一个不断发展的树结构来导航提示空间。至关重要的是，由于这些局部比较本身并不能产生一致的全球规模，因此我们将系统的即时探索与最终选择脱钩，引入了基于 Bradley-Terry-Luce (BTL) 模型的两阶段框架。该框架首先执行局部比较的路径贝叶斯聚合，以过滤不确定性下的候选者，然后进行全局锦标赛式比较，以推断潜在提示质量并识别最佳提示。跨多个任务的实验表明，UPA 始终优于现有的提示优化方法，这表明即使在完全无人监督的环境中，代理式优化仍然非常有效。