2025-12-17

Title: FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Authors: Jonas Golde, Patrick Haller, Alan Akbik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13884
Pdf URL: https://arxiv.org/pdf/2512.13884
Copy Paste: [[2512.13884]] FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition(https://arxiv.org/abs/2512.13884)
Keywords: language model, llm
Abstract: Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.
摘要：最近的多语言命名实体识别 (NER) 工作表明，大型语言模型 (LLM) 可以提供有效的综合监督，但此类数据集大多作为更广泛实验的副产品出现，而不是作为系统的、可重用的资源。我们推出了 FiNERweb，这是一个数据集创建管道，可将师生范式扩展到 91 种语言和 25 种脚本。我们的方法以 FineWeb-Edu 为基础，训练回归模型来识别 NER 相关的段落，并用多语言 LLM 对其进行注释，从而产生大约 225k 段带有 235k 不同实体标签的段落。我们的实验表明，回归模型实现了超过 84 F1，并且在 FiNERweb 上训练的模型在英语、泰语和斯瓦希里语的零镜头传输设置中获得了可比较或改进的性能，尽管训练数据比强基线少 19 倍。此外，我们使用法学硕士作为评委来评估注释质量，并观察到忠实度（满分 5 分中的 3.99 分）和完整性（满分 5 分中的 4.05 分）均获得高分，这表明注释可靠且信息丰富。此外，我们发布了带有各自目标语言的英语标签和翻译标签集的数据集，因为我们观察到，当使用目标语言标签而不是英语标签进行评估时，当前最先进模型的性能下降了 0.02 到 0.09 F1。我们向研究社区发布 FiNERweb 以及所有附带的工件，以促进更有效的多语言命名实体识别师生培训。

Title: Olmo 3

Authors: Team Olmo: Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alexander Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lester James V. Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13961
Pdf URL: https://arxiv.org/pdf/2512.13961
Copy Paste: [[2512.13961]] Olmo 3(https://arxiv.org/abs/2512.13961)
Keywords: language model, chat
Abstract: We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.
摘要：我们推出 Olmo 3，这是一个 7B 和 32B 参数尺度的最先进、完全开放的语言模型系列。 Olmo 3 模型构建的目标是长上下文推理、函数调用、编码、指令遵循、一般聊天和知识回忆。此版本包括整个模型流程，即模型系列的完整生命周期，包括用于构建模型的每个阶段、检查点、数据点和依赖项。我们的旗舰机型Olmo 3 Think 32B是迄今为止发布的最强全开放思维机型。

Title: Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models

Authors: Zhimin Qiu, Di Wu, Feng Liu, Chenrui Hu, Yuxiao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13980
Pdf URL: https://arxiv.org/pdf/2512.13980
Copy Paste: [[2512.13980]] Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models(https://arxiv.org/abs/2512.13980)
Keywords: language model
Abstract: This paper proposes a structure-aware decoding method based on large language models to address the difficulty of traditional approaches in maintaining both semantic integrity and structural consistency in nested and overlapping entity extraction tasks. The method introduces a candidate span generation mechanism and structured attention modeling to achieve unified modeling of entity boundaries, hierarchical relationships, and cross-dependencies. The model first uses a pretrained language model to obtain context-aware semantic representations, then captures multi-granular entity span features through candidate representation combinations, and introduces hierarchical structural constraints during decoding to ensure consistency between semantics and structure. To enhance stability in complex scenarios, the model jointly optimizes classification loss and structural consistency loss, maintaining high recognition accuracy under multi-entity co-occurrence and long-sentence dependency conditions. Experiments conducted on the ACE 2005 dataset demonstrate significant improvements in Accuracy, Precision, Recall, and F1-Score, particularly in nested and overlapping entity recognition, where the model shows stronger boundary localization and structural modeling capability. This study verifies the effectiveness of structure-aware decoding in complex semantic extraction tasks, provides a new perspective for developing language models with hierarchical understanding, and establishes a methodological foundation for high-precision information extraction.
摘要：本文提出了一种基于大型语言模型的结构感知解码方法，以解决传统方法在嵌套和重叠实体提取任务中保持语义完整性和结构一致性的困难。该方法引入了候选跨度生成机制和结构化注意力建模，以实现实体边界、层次关系和交叉依赖关系的统一建模。该模型首先使用预训练的语言模型获得上下文感知的语义表示，然后通过候选表示组合捕获多粒度实体跨度特征，并在解码过程中引入分层结构约束以确保语义和结构之间的一致性。为了增强复杂场景下的稳定性，模型联合优化分类损失和结构一致性损失，在多实体共现和长句依赖条件下保持较高的识别精度。在 ACE 2005 数据集上进行的实验表明，准确率、精确率、召回率和 F1 分数都有显着提高，特别是在嵌套和重叠实体识别方面，模型显示出更强的边界定位和结构建模能力。该研究验证了结构感知解码在复杂语义提取任务中的有效性，为开发具有层次理解的语言模型提供了新的视角，为高精度信息提取奠定了方法学基础。

Title: What Affects the Effective Depth of Large Language Models?

Authors: Yi Hu, Cai Zhou, Muhan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14064
Pdf URL: https://arxiv.org/pdf/2512.14064
Copy Paste: [[2512.14064]] What Affects the Effective Depth of Large Language Models?(https://arxiv.org/abs/2512.14064)
Keywords: language model, llm
Abstract: The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of "effective depth", arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at this https URL.
摘要：大型语言模型 (LLM) 的扩展强调增加深度，但性能增益会随着层数的增加而减弱。先前的工作引入了“有效深度”的概念，认为更深的模型无法充分利用其层进行有意义的计算。在此基础上，我们系统地研究了有效深度如何随模型规模、训练类型和任务难度而变化。首先，我们分析了 Qwen-2.5 系列（1.5B-32B）的模型行为，发现虽然有效层数随着模型尺寸的增加而增加，但有效深度比保持稳定。此外，基础模型和相应的长 CoT 模型之间的比较显示有效深度没有增加，这表明推理的改进源于更长的上下文而不是更深的每个标记计算。此外，对不同难度任务的评估表明，模型不会动态地使用更多层来解决更困难的问题。我们的结果表明，当前的法学硕士在不同尺度、训练范式和不同难度的任务上未充分利用可用深度，指出了提高法学硕士层利用率、模型剪枝和早期退出的研究机会。我们的代码在此 https URL 发布。

Title: Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Authors: Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14067
Pdf URL: https://arxiv.org/pdf/2512.14067
Copy Paste: [[2512.14067]] Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed(https://arxiv.org/abs/2512.14067)
Keywords: language model
Abstract: Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
摘要：扩散语言模型（dLM）已成为一种有前途的范式，可以实现并行、非自回归生成，但从头开始训练时，其学习效率落后于自回归（AR）语言模型。为此，我们研究了 AR 到 dLM 的转换，将预训练的 AR 模型转换为高效的 dLM，其速度出色，同时保持 AR 模型的任务准确性。我们通过识别现有 AR 到 dLM 方法的注意力模式和目标的局限性，然后提出更有效的 AR 到 dLM 转换的原则和方法来实现这一目标。具体来说，我们首先系统地比较不同的注意力模式，并发现维持预训练的 AR 权重分布对于有效的 AR 到 dLM 转换至关重要。因此，我们引入了一种具有块级注意模式的连续预训练方案，该方案在块之间保持因果关系，同时在每个块内实现双向建模。我们发现，这种方法除了具有启用 KV 缓存的已知优势之外，还可以比完全双向建模更好地保留预训练 AR 模型的权重分布，并实现准确性和效率的双赢。其次，为了缩小掩码令牌分布中的训练与测试差距（均匀与高度从左到右），我们提出了一种位置相关的令牌掩码策略，该策略在训练期间为后面的令牌分配更高的掩码概率，以更好地模仿测试时行为。利用这个框架，我们对 dLM 的注意力模式、训练动态和其他设计选择进行了广泛的研究，为可扩展的 AR 到 dLM 转换提供了可操作的见解。这些研究催生了 Efficient-DLM 系列，其性能优于最先进的 AR 模型和 dLM，例如，与 Dream 7B 和 Qwen3 4B 相比，我们的 Efficient-DLM 8B 的精度分别提高了 5.4%/+2.7%，吞吐量提高了 4.5 倍/2.7 倍。

Title: A Unified Sparse Attention via Multi-Granularity Compression

Authors: Siran Liu, Zane Cao, Yongchao He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14082
Pdf URL: https://arxiv.org/pdf/2512.14082
Copy Paste: [[2512.14082]] A Unified Sparse Attention via Multi-Granularity Compression(https://arxiv.org/abs/2512.14082)
Keywords: language model, llm
Abstract: Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.
摘要：高效的长上下文理解和推理对于多轮对话和程序分析等大型语言模型 (LLM) 应用越来越重要。然而，核心的自注意力机制随序列长度呈二次方扩展，从而产生了基本的计算瓶颈。现有的稀疏注意力方法缓解了这个问题，但面临着权衡：基于训练的方法成本高昂，不能直接用作其他模型的加速插件，而推理时间方法通常会损害效率或跨模态通用性。为了解决这些限制，我们提出了 UniSparse，这是一种引入复合令牌概念的统一机制——聚合多粒度上下文信息的紧凑表示。在此抽象的基础上，UniSparse 通过多粒度压缩和块级选择动态构建稀疏注意力，从而在 GPU 上实现高效且硬件友好的执行。在从综合基准到实际应用的多种模式和任务中，UniSparse 在准确性和效率方面始终超越最先进的稀疏注意力方法（例如 MInference、XAttention、FlexPrefill），实现了 99% 的全注意力准确度，注意力计算速度比 FlashAttention 快 2.61 倍。

Title: CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models

Authors: Yiran Zhang, Jincheng Hu, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14118
Pdf URL: https://arxiv.org/pdf/2512.14118
Copy Paste: [[2512.14118]] CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models(https://arxiv.org/abs/2512.14118)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) excel at single-turn reasoning but often lose accuracy and coherence over extended, multi-turn interactions. Recent evaluations such as TurnBench highlight recurring failure modes-reasoning bias, task drift, hallucination, overconfidence, and memory decay. Current approaches typically append full conversational histories, causing unbounded context growth, higher computational costs, and degraded reasoning efficiency. We introduce CogMem, a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. CogMem incorporates three layers: a Long-Term Memory (LTM) that consolidates cross-session reasoning strategies; a Direct Access (DA) memory that maintains session-level notes and retrieves relevant long-term memories; and a Focus of Attention (FoA) mechanism that dynamically reconstructs concise, task-relevant context at each turn. Experiments on TurnBench show that this layered design mitigates reasoning failures, controls context growth, and improves consistency across extended reasoning chains, moving toward more reliable, human-like reasoning in LLMs.
摘要：大型语言模型 (LLM) 擅长单轮推理，但在扩展的多轮交互中往往会失去准确性和连贯性。最近的评估（例如 TurnBench）强调了反复出现的失败模式——推理偏差、任务漂移、幻觉、过度自信和记忆衰退。当前的方法通常会附加完整的对话历史，导致上下文无限增长、计算成本更高、推理效率降低。我们推出 CogMem，这是一种认知启发、记忆增强的法学硕士架构，可通过结构化、持久性记忆支持持续迭代推理。 CogMem 包含三层：巩固跨会话推理策略的长期记忆 (LTM)；直接访问（DA）存储器，用于维护会话级笔记并检索相关的长期记忆；注意力焦点（FoA）机制，可以在每个回合动态重建简洁的、与任务相关的上下文。 TurnBench 上的实验表明，这种分层设计可以减轻推理失败、控制上下文增长并提高扩展推理链的一致性，从而在法学硕士中实现更可靠、更人性化的推理。

Title: Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents

Authors: Hongqiu Ni, Jiabao Zhang, Guopeng Li, Zilong Wang, Ruiqi Wu, Chi Zhang, Haisheng Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14142
Pdf URL: https://arxiv.org/pdf/2512.14142
Copy Paste: [[2512.14142]] Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents(https://arxiv.org/abs/2512.14142)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5\% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.
摘要：大型语言模型 (LLM) 越来越多地被部署为智能代理。它们的多阶段工作流程在本地计算和对 Web API 等外部网络服务的调用之间交替，导致其执行模式和现有推理系统（如 vLLM）的调度粒度不匹配。现有系统通常专注于每个段的优化，这阻止它们最大限度地减少整个代理工作流程的端到端延迟，即整个请求生命周期的全局作业完成时间（JCT）。为了解决这个限制，我们提出了 Astraea，这是一个服务引擎，旨在将优化从本地分段转移到全局请求生命周期。 Astraea 采用状态感知的分层调度算法，将请求的历史状态与未来预测相集成。它根据 I/O 和计算密集型性质对请求进行动态分类，并使用增强的 HRRN 策略来平衡效率和公平性。 Astraea还实现了一个自适应KV缓存管理器，可以根据系统内存压力智能地处理I/O等待期间的代理状态。大量实验表明，与基线方法相比，Astraea 将平均 JCT 降低了 25.5%。此外，我们的方法在各种模型规模的高负载下表现出强大的鲁棒性和稳定性。

Title: A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs

Authors: K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2512.14179
Pdf URL: https://arxiv.org/pdf/2512.14179
Copy Paste: [[2512.14179]] A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs(https://arxiv.org/abs/2512.14179)
Keywords: gpt, llm, retrieval-augmented generation
Abstract: Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local\_dialect:standard\_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76\% to 55\% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.
摘要：由于数据稀缺和语言变异（孟加拉语中的一个突出问题），从标准语言翻译为其地方方言是一个重大的 NLP 挑战。本文提出并比较了两种用于标准孟加拉语方言翻译的新颖 RAG 管道。第一个是基于转录的管道，使用来自音频转录的大方言句子上下文。第二个是更有效的标准化句子对管道，利用结构化的本地方言：标准孟加拉语句子对。我们使用 BLEU、ChrF、WER 和 BERTScore 评估了跨六种孟加拉语方言和多个法学硕士的两条流程。我们的研究结果表明，句子对管道的性能始终优于基于转录的管道，将吉大港方言的单词错误率 (WER) 从 76\% 降低到 55\%。至关重要的是，这种 RAG 方法使较小的模型（例如 Llama-3.1-8B）能够胜过更大的模型（例如 GPT-OSS-120B），这表明精心设计的检索策略可能比模型大小更重要。这项工作为低资源方言翻译提供了一种有效的、免微调的解决方案，为保护语言多样性提供了实用的蓝图。

Title: Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets

Authors: Estelle Zheng (LORIA, ALE), Nathan Cerisara (LORIA), Sébastien Warichet (ALE), Emmanuel Helbert (ALE), Christophe Cerisara (SYNALP, LORIA)
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14237
Pdf URL: https://arxiv.org/pdf/2512.14237
Copy Paste: [[2512.14237]] Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets(https://arxiv.org/abs/2512.14237)
Keywords: language model, llm, chain-of-thought
Abstract: Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA's compute scaling slope while cutting peak memory by 50\%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA's accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder's architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.
摘要：微调大型语言模型 (LLM) 通常受到商用 GPU 上可用内存的限制。 QLoRA 等参数高效微调 (PEFT) 方法减少了可训练参数的数量，但仍然会因整个模型中的反向传播而导致较高的内存使用量。我们重新审视了 Ladder Side Tuning (LST)，这是一种很少被探索的 PEFT 技术，它添加了轻量级侧网络，并表明它与 QLoRA 的计算缩放斜率相匹配，同时将峰值内存削减了 50%。在涵盖自然语言理解、数学和法学硕士批判任务的不同下游基准测试中，LST 的平均准确度与 QLoRA 具有竞争力，同时内存效率更高。这种效率可以在具有 2k 令牌上下文的单个 12 GB 消费者 GPU 上微调 7B 参数模型，无需 QLoRA 耗尽内存的梯度检查点\textemdash 条件。除了内存效率之外，我们还建立了缩放定律，表明 LST 的缩放比例与 QLoRA 类似。我们通过引入 xLadder 来利用 Ladder 的架构灵活性，xLadder 是一种深度扩展变体，可通过交叉连接增加有效深度，并在固定参数数量下缩短思想链 (CoT)。当内存成为瓶颈时，梯子就很强大； xLadder 在此基础上构建，无需额外的内存开销即可进行更深入的推理。

Title: Two CFG Nahuatl for automatic corpora expansion

Authors: Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Graham Ranger Martha-Lorena Avendaño-Garrido
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14239
Pdf URL: https://arxiv.org/pdf/2512.14239
Copy Paste: [[2512.14239]] Two CFG Nahuatl for automatic corpora expansion(https://arxiv.org/abs/2512.14239)
Keywords: language model, llm
Abstract: The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $\pi$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.
摘要：本文的目的是介绍两种用于 Nawatl 语料库扩展的上下文无关语法 (CFG)。 Nawatl 是 $\pi$ 语言类型的美洲印第安语言（墨西哥的国家语言），即数字资源很少的语言。因此，可用于学习大型语言模型 (LLM) 的语料库几乎不存在，这构成了重大挑战。目标是产生大量语法上有效的人工 Nawatl 句子，从而扩展语料库，以达到学习非上下文嵌入的目的。为了这个目标，我们引入了两个新的 Nawatl CFG 并在生成模式下使用它们。使用这些语法，可以显着扩展 Nawatl 语料库，然后使用它来学习嵌入并评估它们在句子语义相似性任务中的相关性。与仅使用原始语料库而不进行人工扩展所获得的结果相比，结果有所改进，并且还表明经济嵌入通常比某些法学硕士表现更好。

Title: From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition

Authors: Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang, Yifei Wu, Hao Wen, Gang Chen, Fanchao Qi, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14244
Pdf URL: https://arxiv.org/pdf/2512.14244
Copy Paste: [[2512.14244]] From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition(https://arxiv.org/abs/2512.14244)
Keywords: language model, llm, hallucination, agent
Abstract: Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.
摘要：管理广泛的上下文仍然是大型语言模型（LLM）的一个关键瓶颈，特别是在长文档问答和自主代理等应用程序中，其中冗长的输入会产生高昂的计算成本并引入噪声。现有的压缩技术通常会通过离散标记删除来破坏局部一致性，或者依赖于存在位置偏差且与闭源 API 不兼容的隐式潜在编码。为了解决这些限制，我们引入了基于 EDU 的上下文压缩器，这是一种新颖的显式压缩框架，旨在保留全局结构和细粒度细节。我们的方法将上下文压缩重新表述为结构然后选择的过程。首先，我们的 LingoEDU 将线性文本转换为基本话语单元 (EDU) 的结构关系树，该树严格锚定到源索引以消除幻觉。其次，轻量级排名模块选择与查询相关的子树进行线性化。为了严格评估结构理解，我们发布了 StructBench，这是一个包含 248 个不同文档的手动注释数据集。实证结果表明，我们的方法实现了最先进的结构预测精度，并且在降低成本的同时显着优于前沿法学硕士。此外，我们的结构感知压缩大大提高了从长上下文任务到复杂的深度搜索场景的下游任务的性能。

Title: Inflation Attitudes of Large Language Models

Authors: Nikoleta Anesti, Edward Hill, Andreas Joseph
Subjects: cs.CL, econ.EM
Abstract URL: https://arxiv.org/abs/2512.14306
Pdf URL: https://arxiv.org/pdf/2512.14306
Copy Paste: [[2512.14306]] Inflation Attitudes of Large Language Models(https://arxiv.org/abs/2512.14306)
Keywords: language model, gpt, llm, prompt
Abstract: This paper investigates the ability of Large Language Models (LLMs), specifically GPT-3.5-turbo (GPT), to form inflation perceptions and expectations based on macroeconomic price signals. We compare the LLM's output to household survey data and official statistics, mimicking the information set and demographic characteristics of the Bank of England's Inflation Attitudes Survey (IAS). Our quasi-experimental design exploits the timing of GPT's training cut-off in September 2021 which means it has no knowledge of the subsequent UK inflation surge. We find that GPT tracks aggregate survey projections and official statistics at short horizons. At a disaggregated level, GPT replicates key empirical regularities of households' inflation perceptions, particularly for income, housing tenure, and social class. A novel Shapley value decomposition of LLM outputs suited for the synthetic survey setting provides well-defined insights into the drivers of model outputs linked to prompt content. We find that GPT demonstrates a heightened sensitivity to food inflation information similar to that of human respondents. However, we also find that it lacks a consistent model of consumer price inflation. More generally, our approach could be used to evaluate the behaviour of LLMs for use in the social sciences, to compare different models, or to assist in survey design.
摘要：本文研究了大型语言模型 (LLM)，特别是 GPT-3.5-turbo (GPT) 根据宏观经济价格信号形成通胀认知和预期的能力。我们将法学硕士的输出与家庭调查数据和官方统计数据进行比较，模仿英格兰银行通胀态度调查（IAS）的信息集和人口特征。我们的准实验设计利用了 GPT 2021 年 9 月训练截止的时间，这意味着它不知道随后的英国通胀飙升。我们发现 GPT 在短期内跟踪总体调查预测和官方统计数据。在分类层面上，GPT 复制了家庭通胀认知的关键经验规律，特别是在收入、住房保有权和社会阶层方面。适用于综合调查设置的 LLM 输出的新颖 Shapley 值分解提供了对与提示内容相关的模型输出驱动因素的明确见解。我们发现，与人类受访者类似，GPT 对食品通胀信息表现出更高的敏感性。然而，我们也发现它缺乏一致的消费者价格通胀模型。更一般地说，我们的方法可用于评估法学硕士在社会科学中的行为，比较不同的模型，或协助调查设计。

Title: Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models

Authors: Gabriele Prato, Shagun Sodhani, Alessandro Sordoni, Sarath Chandar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14427
Pdf URL: https://arxiv.org/pdf/2512.14427
Copy Paste: [[2512.14427]] Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models(https://arxiv.org/abs/2512.14427)
Keywords: language model, llm
Abstract: The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.
摘要：训练大型语言模型的标准做法是将多个文档打包在一起以优化计算效率。然而，这一过程对模型功能的影响在很大程度上仍未得到探索。为了解决这一差距，我们研究了不同的文档打包策略如何影响法学硕士的潜在多跳推理能力。我们的研究结果表明，与对单个文档进行训练相比，打包可以提高模型性能，但代价是需要更多的计算。为了进一步了解潜在机制，我们进行了消融研究，确定了解释包装优势的关键因素。最终，我们的研究加深了对法学硕士培训动态的理解，并为优化模型开发提供了实用的见解。

Title: SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models

Authors: Shizhuo Mao, Song Chen, Yi Kang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14481
Pdf URL: https://arxiv.org/pdf/2512.14481
Copy Paste: [[2512.14481]] SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models(https://arxiv.org/abs/2512.14481)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.
摘要：大型语言模型 (LLM) 擅长自然语言任务，但由于其规模不断增长超过了 GPU 内存的进步，因此面临部署挑战。模型量化通过降低权重和激活精度来缓解这个问题，但现有的解决方案面临着根本性的权衡：动态量化会带来高计算开销，并对边缘设备带来部署挑战，而静态量化会牺牲准确性。现有的量化感知训练（QAT）方法还受到重量训练成本的影响。我们提出 SASQ：专门为激活量化因子量身定制的轻量级 QAT 框架。 SASQ 仅专门优化量化因子（不改变预先训练的权重），实现高精度静态推理，同时保持部署效率。 SASQ 自适应地截断一些异常值，从而降低量化难度，同时保留激活的分布特征。 SASQ不仅超越了现有的SOTA量化方案，而且还优于相应的FP16模型。在 LLaMA2-7B 上，它的困惑度比 QuaRot 低 5.2%，比 WikiText2 上的 FP16 模型低 4.7%。

Title: C-ing Clearly: Enhanced Binary Code Explanations using C code

Authors: Teodor Poncu, Ioana Pintilie, Marius Dragoi, Dragos Tantaru, Florin Brad
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14500
Pdf URL: https://arxiv.org/pdf/2512.14500
Copy Paste: [[2512.14500]] C-ing Clearly: Enhanced Binary Code Explanations using C code(https://arxiv.org/abs/2512.14500)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM's understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.
摘要：大型语言模型 (LLM) 通常擅长涉及高级编程语言的编码任务，而不是低级编程语言（例如汇编语言）。我们提出了一种名为 C-ing Clearly 的合成数据生成方法，该方法利用相应的 C 代码来增强法学硕士对汇编的理解。通过对我们的方法生成的数据进行微调，我们展示了二进制代码摘要和漏洞检测的改进的 LLM 性能。我们的方法展示了不同法学硕士系列和模型规模的一致收益。

Title: Linguists should learn to love speech-based deep learning models

Authors: Marianne de Heer Kloots, Paul Boersma, Willem Zuidema
Subjects: cs.CL, cs.SD, eess.AS, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.14506
Pdf URL: https://arxiv.org/pdf/2512.14506
Copy Paste: [[2512.14506]] Linguists should learn to love speech-based deep learning models(https://arxiv.org/abs/2512.14506)
Keywords: llm
Abstract: Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article's focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.
摘要：Futrell 和 Mahowald 提出了一个有用的框架，将面向技术的深度学习系统和面向解释的语言理论联系起来。不幸的是，目标文章对基于生成文本的法学硕士的关注从根本上限制了与语言学的富有成效的互动，因为许多关于人类语言的有趣问题超出了书面文本捕获的范围。我们认为基于音频的深度学习模型可以而且应该发挥至关重要的作用。

Title: VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Authors: Ying Nie, Kai Han, Hongguang Li, Hang Zhou, Tianyu Guo, Enhua Wu, Xinghao Chen, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14531
Pdf URL: https://arxiv.org/pdf/2512.14531
Copy Paste: [[2512.14531]] VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse(https://arxiv.org/abs/2512.14531)
Keywords: language model, llm
Abstract: The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering "easy" tokens through the efficient width-wise route and allocating deeper iterative refinement to "hard" tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code will be available at this https URL.
摘要：大型语言模型 (LLM) 的快速扩展取得了卓越的性能，但也导致了过高的内存成本。现有的参数有效方法（例如剪枝和量化）主要压缩预训练模型，而没有增强架构能力，从而达到了基础模型的表征上限。在这项工作中，我们提出了 VersatileFFN，这是一种新颖的前馈网络（FFN），可以在固定参数预算内灵活地重用宽度和深度维度的参数。受认知双过程理论的启发，VersatileFFN 包含两条自适应路径：宽度通用路径，从单个共享 FFN 生成子专家的混合，在不增加参数的情况下模仿稀疏专家路由；以及深度通用路径，递归应用相同的 FFN 来模拟复杂标记的更深层处理。难度感知门控动态平衡两条路径，通过有效的宽度方向引导“简单”令牌，并将更深入的迭代细化分配给“困难”令牌。至关重要的是，这两种途径都重用相同的参数，因此所有额外的容量都来自计算而不是内存。不同基准和模型规模的实验证明了该方法的有效性。该代码将在此 https URL 中提供。

Title: Dual Language Models: Balancing Training Efficiency and Overfitting Resilience

Authors: David Samuel, Lucas Georges Gabriel Charpentier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14549
Pdf URL: https://arxiv.org/pdf/2512.14549
Copy Paste: [[2512.14549]] Dual Language Models: Balancing Training Efficiency and Overfitting Resilience(https://arxiv.org/abs/2512.14549)
Keywords: language model
Abstract: This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.
摘要：本文结合了自回归和掩蔽扩散训练目标，无需任何架构修改，从而产生了优于单目标模型的灵活语言模型。自回归建模一直是一种流行的方法，部分原因是其训练效率；然而，这是以对过度拟合的敏感性为代价的。另一方面，掩蔽扩散模型的训练效率较低，但对过度拟合的适应能力更强。在这项工作中，我们证明了双目标训练可以实现两全其美。为了得出两个目标之间的最佳比率，我们在不同的数据重复级别下训练和评估 50 个语言模型。我们表明，在所有评估的设置下结合这两个目标是最佳的，并且无论是针对自回归还是掩模扩散下游性能，最佳比率都是相似的。

Title: VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

Authors: Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh, Phan Phi Hai, Nguyen Thi Ngoc Anh, Dang Van Tu, Binh Vu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14554
Pdf URL: https://arxiv.org/pdf/2512.14554
Copy Paste: [[2512.14554]] VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models(https://arxiv.org/abs/2512.14554)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.
摘要：大语言模型（LLM）的快速发展为人工智能在法律领域的应用提供了新的可能性。尽管如此，越南立法的复杂性、层级组织和频繁修订对评估这些模型解释和利用法律知识的程度提出了相当大的挑战。为了弥补这一差距，越南法律基准（VLegal-Bench）被引入，这是第一个旨在系统评估法学硕士越南法律任务的综合基准。根据 Bloom 的认知分类法，VLegal-Bench 通过旨在反映实际使用场景的任务涵盖了多个层次的法律理解。该基准包括通过严格的注释管道生成的 10,450 个样本，法律专家使用我们的注释系统对每个实例进行标记和交叉验证，以确保每个样本都以权威的法律文档为基础，并反映现实世界的法律助理工作流程，包括一般法律问题和答案、检索增强生成、多步骤推理以及针对越南法律量身定制的基于场景的问题解决。通过提供标准化、透明和认知知情的评估框架，VLegal-Bench 为评估越南法律背景下的法学硕士表现奠定了坚实的基础，并支持开发更可靠、可解释和符合道德的人工智能辅助法律体系。

Title: Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Authors: Hongli Li, Che Han Chen, Kevin Fan, Chiho Young-Johnson, Soyoung Lim, Yali Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14561
Pdf URL: https://arxiv.org/pdf/2512.14561
Copy Paste: [[2512.14561]] Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis(https://arxiv.org/abs/2512.14561)
Keywords: language model, llm
Abstract: Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman's rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.
摘要：尽管大型语言模型 (LLM) 在自动论文评分 (AES) 中的前景越来越广阔，但与人类评分者相比，关于其可靠性的实证结果仍然好坏参半。根据 PRISMA 2020 指南，我们综合了 2022 年 1 月至 2025 年 8 月期间已发表和未发表的 65 项研究，检验了法学硕士与 AES 人类评分者之间的一致性。各项研究中，报告的法学硕士与人类的一致性一般为中等至良好，一致性指数（例如二次加权 Kappa、皮尔逊相关性和 Spearman's rho）大多在 0.30 至 0.80 之间。不同研究中观察到的一致性水平存在显着差异，反映出研究特定因素的差异以及标准化报告实践的缺乏。讨论了未来研究的意义和方向。

Title: Polypersona: Persona-Grounded LLM for Synthetic Survey Responses

Authors: Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad, Saima Rafi, Rohith Tangudu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14562
Pdf URL: https://arxiv.org/pdf/2512.14562
Copy Paste: [[2512.14562]] Polypersona: Persona-Grounded LLM for Synthetic Survey Responses(https://arxiv.org/abs/2512.14562)
Keywords: language model, llm, chat
Abstract: This paper introduces PolyPersona, a generative framework for synthesizing persona-conditioned survey responses across multiple domains. The framework instruction-tunes compact chat models using parameter-efficient LoRA adapters with 4-bit quantization under a resource-adaptive training setup. A dialogue-based data pipeline explicitly preserves persona cues, ensuring consistent behavioral alignment across generated responses. Using this pipeline, we construct a dataset of 3,568 synthetic survey responses spanning ten domains and 433 distinct personas, enabling controlled instruction tuning and systematic multi-domain evaluation. We evaluate the generated responses using a multi-metric evaluation suite that combines standard text generation metrics, including BLEU, ROUGE, and BERTScore, with survey-specific metrics designed to assess structural coherence, stylistic consistency, and sentiment this http URL results show that compact models such as TinyLlama 1.1B and Phi-2 achieve performance comparable to larger 7B to 8B baselines, with a highest BLEU score of 0.090 and ROUGE-1 of 0.429. These findings demonstrate that persona-conditioned fine-tuning enables small language models to generate reliable and coherent synthetic survey data. The proposed framework provides an efficient and reproducible approach for survey data generation, supporting scalable evaluation while facilitating bias analysis through transparent and open protocols.
摘要：本文介绍了 PolyPersona，这是一个生成框架，用于合成跨多个领域的角色条件调查响应。该框架在资源自适应训练设置下使用参数高效的 LoRA 适配器和 4 位量化指令调整紧凑聊天模型。基于对话的数据管道明确保留角色线索，确保生成的响应之间行为一致。使用此管道，我们构建了一个包含 3,568 个综合调查响应的数据集，涵盖 10 个领域和 433 个不同的角色，从而实现受控的指令调整和系统的多领域评估。我们使用多指标评估套件来评估生成的响应，该套件结合了标准文本生成指标（包括 BLEU、ROUGE 和 BERTScore）以及旨在评估结构连贯性、文体一致性和情绪的调查特定指标。此 http URL 结果表明，TinyLlama 1.1B 和 Phi-2 等紧凑模型的性能可与较大的 7B 至 8B 基线相媲美，最高 BLEU 得分为 0.090，ROUGE-1 为0.429。这些发现表明，角色条件微调使小型语言模型能够生成可靠且连贯的综合调查数据。所提出的框架为调查数据生成提供了一种高效且可重复的方法，支持可扩展的评估，同时通过透明和开放的协议促进偏差分析。

Title: Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer

Authors: Adarsha Shrestha, Basanta Pokharel, Binit Shrestha, Smriti Adhikari, Dinesh Gothe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14585
Pdf URL: https://arxiv.org/pdf/2512.14585
Copy Paste: [[2512.14585]] Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer(https://arxiv.org/abs/2512.14585)
Keywords: language model, gpt, llm
Abstract: Nepali, a low-resource language spoken by over 32 million people, continues to face challenges in natural language processing (NLP) due to its complex grammar, agglutinative morphology, and limited availability of high-quality corpora. Most efforts to date have centered on basic encoder architectures; they remain insufficient for Nepali-specific text generation. This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3, including optimized learning rate schedules, batch scaling, and architectural refinements. A custom 16k Byte-Pair Encoding (BPE) tokenizer was trained exclusively on Nepali text to ensure more consistent segmentation and improved input representation. The model was pretrained on a combined dataset comprising a 10.75GB cleaned NepBERTa corpus and additional web-scraped Nepali news articles. FlashAttention was integrated to reduce memory usage and stabilize training. After two epochs, the model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.
摘要：尼泊尔语是一种有超过 3200 万人使用的低资源语言，由于其复杂的语法、粘着词法和高质量语料库的有限可用性，在自然语言处理 (NLP) 方面继续面临挑战。迄今为止，大多数工作都集中在基本编码器架构上；它们仍然不足以生成尼泊尔语特定的文本。本研究提出了一种基于 GPT-2 的尼泊尔语言模型，该模型使用受 GPT-3 启发的多种训练策略进行训练，包括优化的学习率计划、批量扩展和架构改进。定制的 16k 字节对编码 (BPE) 分词器专门针对尼泊尔语文本进行了训练，以确保更一致的分割和改进的输入表示。该模型在一个组合数据集上进行预训练，该数据集包括 10.75GB 的清理过的 NepBERTa 语料库和其他网络抓取的尼泊尔新闻文章。集成了 FlashAttention 以减少内存使用并稳定训练。经过两个 epoch 后，该模型的训练损失为 3.168177，验证损失为 3.081982，最终困惑度为 21.80，展示了其生成连贯的尼泊尔新闻风格文本的能力。

Title: JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Authors: Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.14620
Pdf URL: https://arxiv.org/pdf/2512.14620
Copy Paste: [[2512.14620]] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction(https://arxiv.org/abs/2512.14620)
Keywords: prompt
Abstract: This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
摘要：本文介绍了基于图像的日本多学科多模态理解基准 JMMMU-Pro 和可扩展构建方法 Vibe Benchmark Construction。继从MMMU到MMMU-Pro的演变之后，JMMMU-Pro通过将问题图像和问题文本合成为单个图像来扩展JMMMU，从而创建需要通过视觉感知来集成视觉文本理解的基准。为了构建 JMMMU-Pro，我们提出了 Vibe Benchmark Construction，这是一种图像生成模型（例如 Nano Banana Pro）生成候选视觉问题的方法，人类验证输出，并在必要时使用调整后的提示重新生成以确保质量。通过利用 Nano Banana Pro 的高度逼真的图像生成功能及其嵌入干净的日语文本的能力，我们以低成本构建了高质量的基准，涵盖了广泛的背景和布局设计。实验结果表明，所有开源 LMM 都与 JMMMU-Pro 存在很大的矛盾，这凸显了 JMMMU-Pro 作为指导开源社区未来工作的重要基准。我们相信，JMMMU-Pro 为评估 LMM 的日本能力提供了更严格的评估工具，并且我们的 Vibe 基准构建也为基于图像的 VQA 基准的未来开发提供了有效的指导。

Title: TiME: Tiny Monolingual Encoders for Efficient NLP Pipelines

Authors: David Schulmeister, Valentin Hartmann, Lars Klein, Robert West
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14645
Pdf URL: https://arxiv.org/pdf/2512.14645
Copy Paste: [[2512.14645]] TiME: Tiny Monolingual Encoders for Efficient NLP Pipelines(https://arxiv.org/abs/2512.14645)
Keywords: language model
Abstract: Today, a lot of research on language models is focused on large, general-purpose models. However, many NLP pipelines only require models with a well-defined, small set of capabilities. While large models are capable of performing the tasks of those smaller models, they are simply not fast enough to process large amounts of data or offer real-time responses. Furthermore, they often use unnecessarily large amounts of energy, leading to sustainability concerns and problems when deploying them on battery-powered devices. In our work, we show how to train small models for such efficiency-critical applications. As opposed to many off-the-shelf NLP pipelines, our models use modern training techniques such as distillation, and offer support for low-resource languages. We call our models TiME (Tiny Monolingual Encoders) and comprehensively evaluate them on a range of common NLP tasks, observing an improved trade-off between benchmark performance on one hand, and throughput, latency and energy consumption on the other. Along the way, we show that distilling monolingual models from multilingual teachers is possible, and likewise distilling models with absolute positional embeddings from teachers with relative positional embeddings.
摘要：如今，许多关于语言模型的研究都集中在大型通用模型上。然而，许多 NLP 管道只需要具有明确定义的小型功能集的模型。虽然大型模型能够执行小型模型的任务，但它们的速度不足以处理大量数据或提供实时响应。此外，它们经常使用不必要的大量能源，导致在电池供电设备上部署它们时出现可持续性问题和问题。在我们的工作中，我们展示了如何为此类效率关键的应用程序训练小型模型。与许多现成的 NLP 管道相反，我们的模型使用蒸馏等现代训练技术，并提供对低资源语言的支持。我们将我们的模型称为 TiME（微型单语言编码器），并在一系列常见的 NLP 任务上对它们进行全面评估，观察到一方面基准性能与另一方面吞吐量、延迟和能耗之间的改进权衡。在此过程中，我们表明从多语言教师中提取单语言模型是可能的，同样从具有相对位置嵌入的教师中提取具有绝对位置嵌入的模型是可能的。

Title: Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Authors: Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14681
Pdf URL: https://arxiv.org/pdf/2512.14681
Copy Paste: [[2512.14681]] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing(https://arxiv.org/abs/2512.14681)
Keywords: language model, llm
Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at this https URL.
摘要：多令牌生成已成为加速基于变压器的大型模型推理的有前途的范例。最近的工作主要探索用于并行解码的扩散大型语言模型（dLLM），以减少推理延迟。为了实现 AR 级别的生成质量，许多技术将 AR 模型适配到 dLLM 中以实现并行解码。然而，由于训练前与训练后的不匹配，与 AR 模型相比，它们的加速有限。具体来说，训练后的屏蔽数据分布与预训练期间看到的真实数据分布存在显着偏差，并且 dLLM 依赖于双向注意力，这与预训练期间学习的因果先验发生冲突，并阻碍了精确 KV 缓存重用的集成。为了解决这个问题，我们引入了 Jacobi Forcing，这是一种渐进式蒸馏范例，其中模型根据自己生成的并行解码轨迹进行训练，将 AR 模型平滑地转换为高效的并行解码器，同时保留其预先训练的因果推理属性。在这种范式（雅可比强迫模型）下训练的模型在编码和数学基准方面实现了 3.8 倍的挂钟加速，同时性能损失最小。基于雅可比强迫模型的轨迹特征，我们引入了具有拒绝回收功能的多块解码，这使得每次迭代的令牌接受计数提高了 4.5 倍，并将挂钟加速提高了近 4.0 倍，有效地用额外的计算换取了更低的推理延迟。我们的代码可以在这个 https URL 上找到。

Title: Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization

Authors: Yen-Ju Lu, Kunxiao Gao, Mingrui Liang, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak, Jesus Villalba
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2512.14687
Pdf URL: https://arxiv.org/pdf/2512.14687
Copy Paste: [[2512.14687]] Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization(https://arxiv.org/abs/2512.14687)
Keywords: language model, llm
Abstract: Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at this https URL. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
摘要：最近的音频语言模型可以跟随长时间的对话。然而，由于缺乏将语音、摘要和副语言线索联系起来的数据，对情感感知或口头对话摘要的研究受到限制。我们推出了 Spoken DialogSum，这是第一个将原始对话音频与事实摘要、情感丰富的摘要以及说话者年龄、性别和情感的话语级标签对齐的语料库。该数据集分两个阶段构建：首先，LLM 使用 Switchboard 式填充程序和反向通道重写 DialogSum 脚本，然后用情感、音调和语速标记每个话语。其次，富有表现力的 TTS 引擎根据标记的脚本合成语音，并与副语言标签对齐。 Spoken DialogSum 包含 13,460 个情绪多样的对话，每个对话都配有事实摘要和以情绪为中心的摘要。该数据集可通过此 https URL 在线获取。基线显示，相对于级联 ASR-LLM 系统，Audio-LLM 将情感总结 ROUGE-L 提高了 28%，证实了端到端语音建模的价值。

Title: MMGR: Multi-Modal Generative Reasoning

Authors: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2512.14691
Pdf URL: https://arxiv.org/pdf/2512.14691
Copy Paste: [[2512.14691]] MMGR: Multi-Modal Generative Reasoning(https://arxiv.org/abs/2512.14691)
Keywords: gpt
Abstract: Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
摘要：视频基础模型生成视觉逼真且时间连贯的内容，但它们作为世界模拟器的可靠性取决于它们是否捕获物理、逻辑和空间约束。 Frechet Video Distance (FVD) 等现有指标强调感知质量，而忽视推理失败，包括违反因果关系、物理原理和全局一致性。我们引入了 MMGR（多模态生成推理评估和基准），这是一个基于五种推理能力的原则性评估框架：物理、逻辑、3D 空间、2D 空间和时间。 MMGR 评估三个领域的生成推理：抽象推理（ARC-AGI、数独）、体现导航（现实世界 3D 导航和定位）和物理常识（体育和组合交互）。 MMGR 应用细粒度的指标，要求视频和图像生成的整体正确性。我们对领先的视频模型（Veo-3、Sora-2、Wan-2.2）和图像模型（Nano-banana、Nano-banana Pro、GPT-4o-image、Qwen-image）进行了基准测试，揭示了跨领域的巨大性能差距。模型在物理常识任务上表现出一定的成功，但在抽象推理方面表现不佳（ARC-AGI 的准确率低于 10%），并且在具体环境中进行长视野空间规划时表现不佳。我们的分析强调了当前模型的主要局限性，包括过度依赖感知数据、全局状态一致性薄弱，以及奖励视觉合理性而非因果正确性的目标。 MMGR 提供了统一的诊断基准和通往推理感知生成世界模型的路径。