2025-11-20

Title: Test-time Scaling of LLMs: A Survey from A Subproblem Structure Perspective

Authors: Zhuoyi Yang, Xu Guo, Tong Zhang, Huijuan Xu, Boyang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.14772
Pdf URL: https://arxiv.org/pdf/2511.14772
Copy Paste: [[2511.14772]] Test-time Scaling of LLMs: A Survey from A Subproblem Structure Perspective(https://arxiv.org/abs/2511.14772)
Keywords: language model, llm, chain-of-thought, tree-of-thought
Abstract: With this paper, we survey techniques for improving the predictive accuracy of pretrained large language models by allocating additional compute at inference time. In categorizing test-time scaling methods, we place special emphasis on how a problem is decomposed into subproblems and on the topological organization of these subproblems whether sequential, parallel, or tree-structured. This perspective allows us to unify diverse approaches such as Chain-of-Thought, Branch-Solve-Merge, and Tree-of-Thought under a common lens. We further synthesize existing analyses of these techniques, highlighting their respective strengths and weaknesses, and conclude by outlining promising directions for future research
摘要：在本文中，我们调查了通过在推理时分配额外计算来提高预训练大型语言模型的预测准确性的技术。在对测试时间缩放方法进行分类时，我们特别强调如何将问题分解为子问题以及这些子问题的拓扑组织（无论是顺序的、并行的还是树结构的）。这种视角使我们能够在一个共同的视角下统一不同的方法，例如思想链、分支求解合并和思想树。我们进一步综合了这些技术的现有分析，强调了它们各自的优点和缺点，最后概述了未来研究的有希望的方向

Title: Temporal Predictors of Outcome in Reasoning Language Models

Authors: Joey David
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.14773
Pdf URL: https://arxiv.org/pdf/2511.14773
Copy Paste: [[2511.14773]] Temporal Predictors of Outcome in Reasoning Language Models(https://arxiv.org/abs/2511.14773)
Keywords: language model, llm, chain-of-thought
Abstract: The chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning, gradually refining the model's latent representation of a solution. However, it remains unclear just how early a Large Language Model (LLM) internally commits to an eventual outcome. We probe this by training linear classifiers on hidden states after the first t reasoning tokens, showing that eventual correctness is highly predictable after only a few tokens, even when longer outputs are needed to reach a definite answer. We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact: hard items are disproportionately represented in long CoTs. Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens, with implications for interpretability and for inference-time control.
摘要：思想链 (CoT) 范式使用逐步推理的引出作为推理的代理，逐渐完善模型对解决方案的潜在表示。然而，目前尚不清楚大型语言模型 (LLM) 内部承诺最终结果的时间有多早。我们通过在前 t 个推理标记之后在隐藏状态上训练线性分类器来探索这一点，表明即使需要更长的输出才能达到明确的答案，仅在几个标记之后最终的正确性也是高度可预测的。我们表明，对于较难的问题，预测准确性的下降凸显了选择伪影：较难的项目在长 CoT 中的比例不成比例。总体而言，我们的结果表明，对于推理模型，成功的内部自我评估往往会在仅几个标记后出现，这对可解释性和推理时间控制具有影响。

Title: LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

Authors: Pei-Fu Guo, Yun-Da Tsai, Chun-Chia Hsu, Kai-Xin Chen, Ya-An Tsai, Kai-Wei Chang, Nanyun Peng, Mi-Yen Yeh, Shou-De Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.14774
Pdf URL: https://arxiv.org/pdf/2511.14774
Copy Paste: [[2511.14774]] LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs(https://arxiv.org/abs/2511.14774)
Keywords: language model, llm
Abstract: Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model's knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.
摘要：评估大型语言模型中的跨语言知识迁移具有挑战性，因为目标语言的正确答案可能来自真正的迁移，也可能来自预训练期间的先前接触。我们推出了 LiveCLKTBench，这是一个专门设计用于隔离和测量跨语言知识转移的自动生成管道。我们的管道从现实世界领域识别独立的、时间敏感的知识实体，根据时间发生情况对它们进行过滤，并根据模型的知识验证它们。然后，这些有效实体的文档被用来生成事实问题，这些问题被翻译成多种语言，以评估跨语言边界的可转移性。使用 LiveCLKTBench，我们评估了五种语言的多个法学硕士，并观察到跨语言迁移受到语言距离的强烈影响，并且通常在语言方向上不对称。虽然较大的模型可以改善迁移，但收益会随着规模的增加而减少，并且在不同领域之间也会有所不同。这些发现为多语言传输提供了新的见解，并证明了 LiveCLKTBench 作为未来研究的可靠基准的价值。

Title: COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation

Authors: Snigdha Pandya, Rohan Nagale, Kenji Sahay, Anna Lin, Shikhar Shiromani, Kevin Zhu, Dev Sunishchal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.14776
Pdf URL: https://arxiv.org/pdf/2511.14776
Copy Paste: [[2511.14776]] COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation(https://arxiv.org/abs/2511.14776)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between contextual and parametric knowledge. Understanding and steering this internal behavior is key both for trustworthy deployment and for scientific interpretability of model mechanisms. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable control framework that embeds a model-based feedback loop directly within decoding. COMPASS quantifies context reliance via a transparent metric, the Context Reliance Score (CRS), which serves as an online probe of how attention heads ground generation in evidence. Using this interpretable signal, a PID controller dynamically modulates attention heads to maintain factual consistency without retraining or multi-pass decoding. Across benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates (2.8 to 5.8 percent absolute) while revealing how distinct attention heads contribute to evidence alignment. These results highlight feedback-driven interpretability as a pathway toward scientific understanding of LLM behavior.
摘要：尽管可以访问相关证据，大型语言模型（LLM）通常会生成流畅但实际上不正确的陈述，这是一种失败模式，其根源在于它们如何在上下文知识和参数知识之间分配注意力。理解和引导这种内部行为对于模型机制的可信部署和科学解释来说都是关键。我们引入了 COMPASS（上下文调制 PID 注意力引导系统），这是一种轻量级、可解释的控制框架，将基于模型的反馈循环直接嵌入到解码中。 COMPASS 通过一个透明的指标——情境依赖分数（CRS）来量化情境依赖，它可以作为注意力如何引导证据生成的在线探针。使用这种可解释的信号，PID 控制器动态调节注意力头以保持事实一致性，而无需重新训练或多遍解码。在基准测试（HotpotQA、XSum、HaluEval、RAGTruth）中，COMPASS 持续降低情境幻觉率（绝对值 2.8% 至 5.8%），同时揭示不同的注意力头如何有助于证据对齐。这些结果强调了反馈驱动的可解释性是科学理解法学硕士行为的途径。

Title: Human or LLM as Standardized Patients? A Comparative Study for Medical Education

Authors: Bingquan Zhang, Xiaoxiao Liu, Yuchi Wang, Lei Zhou, Qianqian Xie, Benyou Wang
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.14783
Pdf URL: https://arxiv.org/pdf/2511.14783
Copy Paste: [[2511.14783]] Human or LLM as Standardized Patients? A Comparative Study for Medical Education(https://arxiv.org/abs/2511.14783)
Keywords: llm, agent
Abstract: Standardized Patients (SP) are indispensable for clinical skills training but remain expensive, inflexible, and difficult to scale. Existing large-language-model (LLM)-based SP simulators promise lower cost yet show inconsistent behavior and lack rigorous comparison with human SP. We present EasyMED, a multi-agent framework combining a Patient Agent for realistic dialogue, an Auxiliary Agent for factual consistency, and an Evaluation Agent that delivers actionable feedback. To support systematic assessment, we introduce SPBench, a benchmark of real SP-doctor interactions spanning 14 specialties and eight expert-defined evaluation criteria. Experiments demonstrate that EasyMED matches human SP learning outcomes while producing greater skill gains for lower-baseline students and offering improved flexibility, psychological safety, and cost efficiency.
摘要：标准化患者（SP）对于临床技能培训是不可或缺的，但仍然昂贵、不灵活且难以扩展。现有的基于大语言模型 (LLM) 的 SP 模拟器有望降低成本，但表现出不一致的行为，并且缺乏与人类 SP 的严格比较。我们提出 EasyMED，一个多代理框架，结合了用于现实对话的患者代理、用于事实一致性的辅助代理和提供可操作反馈的评估代理。为了支持系统评估，我们引入了 SPBench，这是一个涵盖 14 个专业和 8 个专家定义的评估标准的真实 SP 与医生互动的基准。实验表明，EasyMED 与人类 SP 学习成果相匹配，同时为基线较低的学生带来更大的技能收益，并提供更高的灵活性、心理安全性和成本效率。

Title: Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings

Authors: Xueying Ding, Xingyue Huang, Mingxuan Ju, Liam Collins, Yozen Liu, Leman Akoglu, Neil Shah, Tong Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.14868
Pdf URL: https://arxiv.org/pdf/2511.14868
Copy Paste: [[2511.14868]] Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings(https://arxiv.org/abs/2511.14868)
Keywords: language model, llm
Abstract: Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by prepending a single summary token, they over-compress information, hence harming performance on long documents. We propose Hierarchical Token Prepending (HTP), a method that resolves two critical bottlenecks. To mitigate attention-level compression, HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating multiple pathways for backward information flow. To address readout-level over-squashing, we replace last-token pooling with mean-pooling, a choice supported by theoretical analysis. HTP achieves consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, especially in long-context settings. As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.
摘要：大型语言模型产生强大的文本嵌入，但它们的因果注意机制限制了信息从较晚的标记到较早的标记的流动，从而降低了表示质量。虽然最近的方法尝试通过在前面添加单个摘要标记来解决此问题，但它们过度压缩信息，从而损害长文档的性能。我们提出了分层令牌预置（HTP），这是一种解决两个关键瓶颈的方法。为了减轻注意力级别的压缩，HTP 将输入划分为块，并将块级摘要标记添加到后续块中，从而为后向信息流创建多个路径。为了解决读出级别的过度挤压问题，我们用均值池替换了最后一个令牌池，这是理论分析支持的选择。 HTP 在 11 个检索数据集和 30 个通用嵌入基准上实现了一致的性能提升，尤其是在长上下文设置中。作为一种简单的、与架构无关的方法，HTP 增强了零样本和微调模型，为卓越的长文档嵌入提供了可扩展的途径。

Title: Mathematical Analysis of Hallucination Dynamics in Large Language Models: Uncertainty Quantification, Advanced Decoding, and Principled Mitigation

Authors: Moses Kiprono
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.15005
Pdf URL: https://arxiv.org/pdf/2511.15005
Copy Paste: [[2511.15005]] Mathematical Analysis of Hallucination Dynamics in Large Language Models: Uncertainty Quantification, Advanced Decoding, and Principled Mitigation(https://arxiv.org/abs/2511.15005)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable LLMs.
摘要：大型语言模型 (LLM) 是强大的语言引擎，但仍然容易产生幻觉：听起来合理的输出实际上不正确或不受支持。在这项工作中，我们提出了一个以数学为基础的框架来理解、测量和减轻这些幻觉。利用概率建模、信息论、三角信号分析和贝叶斯不确定性估计，我们分析了错误如何以自回归方式复合，提出了精确的不确定性度量，包括语义和相位感知变体，并制定了有原则的缓解策略，例如对比解码、检索增强基础、事实对齐和弃权。这种统一的镜头结合了校准、检索和对准方面的最新进展，以支持更安全、更可靠的法学硕士。

Title: Teaching According to Students' Aptitude: Personalized Mathematics Tutoring via Persona-, Memory-, and Forgetting-Aware LLMs

Authors: Yang Wu, Rujing Yao, Tong Zhang, Yufei Shi, Zhuoren Jiang, Zhushan Li, Xiaozhong Liu
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2511.15163
Pdf URL: https://arxiv.org/pdf/2511.15163
Copy Paste: [[2511.15163]] Teaching According to Students' Aptitude: Personalized Mathematics Tutoring via Persona-, Memory-, and Forgetting-Aware LLMs(https://arxiv.org/abs/2511.15163)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.
摘要：大型语言模型 (LLM) 越来越多地集成到智能辅导系统中，以提供类人的自适应教学。然而，大多数现有方法无法捕捉学生的知识如何随着他们的熟练程度、概念差距和遗忘模式动态发展。这一挑战在数学辅导中尤为严峻，有效的教学需要根据每个学生的掌握水平和认知保留进行精确校准的细粒度支架。为了解决这个问题，我们提出了 TASA（根据学生的能力进行教学），这是一个学生意识的辅导框架，它集成了个性化数学学习的角色、记忆和遗忘动态。具体来说，TASA 维护一个结构化的学生角色，捕获熟练程度概况和记录先前学习交互的事件记忆。通过将连续遗忘曲线与知识追踪相结合，TASA 可以动态更新每个学生的掌握状态，并生成适合上下文的、难度校准的问题和解释。实证结果表明，与代表性基线相比，TASA 实现了卓越的学习成果和更具适应性的辅导行为，强调了在基于法学硕士的辅导系统中对时间遗忘和学习者概况进行建模的重要性。

Title: HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples

Authors: Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel Sama Supratheek Reddy, Divyam Gupta, Dasari Srikar, Krishna Teja Kuchimanchi, Rajiv Misra, Rohun Tripathi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.15183
Pdf URL: https://arxiv.org/pdf/2511.15183
Copy Paste: [[2511.15183]] HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples(https://arxiv.org/abs/2511.15183)
Keywords: language model
Abstract: With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.
摘要：印度拥有近 15 亿人口和 120 多种主要语言，是世界上最多元化的地区之一。随着多语言视觉语言模型 (VLM) 的重要性日益凸显，稳健的评估方法对于推动低资源语言的公平人工智能发展至关重要。当前的多语言 VLM 评估存在四个主要限制：依赖未经验证的自动翻译、狭窄的任务/领域覆盖范围、有限的样本量以及缺乏文化和本地来源的问答 (QA)。为了解决这些差距，我们提出了一个可扩展的框架来评估印度语言的 VLM 并将其与英语的性能进行比较。使用该框架，我们生成了 HinTel-AlignBench，这是一个基准，它从印地语和泰卢固语的不同来源中提取了英语对齐的样本。我们的贡献有三个：（1）结合反向翻译、过滤和人工验证的半自动数据集创建框架； (2) 最全面的印地语和泰卢固语视觉语言基准，包括改编的英语数据集（VQAv2、RealWorldQA、CLEVR-Math）和本土小说印度语数据集（JEE 用于 STEM、VAANI 用于文化基础），每种语言大约有 4,000 个 QA 对； (3) 对各种最先进 (SOTA) 开源和闭源 VLM 的详细性能分析。我们发现，在所有模型的 5 个任务中，有 4 个任务的英语任务与印度语言任务的性能出现了回归，印地语的平均回归为 8.3 分，泰卢固语的平均回归为 5.5 分。我们对常见的故障模式进行分类，以突出多语言多模式理解方面需要改进的具体领域。

Title: Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

Authors: Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.15210
Pdf URL: https://arxiv.org/pdf/2511.15210
Copy Paste: [[2511.15210]] Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story(https://arxiv.org/abs/2511.15210)
Keywords: llm
Abstract: Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
摘要：内在维度（ID）是现代法学硕士分析中的重要工具，为训练动态、缩放行为和数据集结构的研究提供信息，但其文本决定因素仍未得到充分探索。我们通过交叉编码器分析、语言特征和稀疏自动编码器 (SAE) 提供了第一个将 ID 置于可解释文本属性中的综合研究。在这项工作中，我们得出了三个关键发现。首先，ID 与基于熵的度量是互补的：在控制长度后，两者不相关，ID 捕获与预测质量正交的几何复杂性。其次，ID 表现出强大的流派分层：在所有测试的模型中，科学散文显示低 ID（~8），百科内容中等 ID（~9），创意/观点写作高 ID（~10.5）。这表明当代法学硕士认为科学文本“代表性简单”，而小说则需要额外的自由度。第三，使用 SAE，我们识别因果特征：科学信号（正式语气、报告模板、统计数据）减少 ID；人性化的信号（个性化、情感、叙述）会增加它。转向实验证实这些影响是因果关系。因此，对于当代模型来说，科学写作显得相对“容易”，而小说、观点和情感则增加了表征的自由度。我们的多方面分析为正确使用 ID 和基于 ID 的结果的合理解释提供了实用指导。

Title: OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Authors: Xinli Tao, Xin Dong, Xuezhong Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.15211
Pdf URL: https://arxiv.org/pdf/2511.15211
Copy Paste: [[2511.15211]] OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition(https://arxiv.org/abs/2511.15211)
Keywords: language model, llm, prompt, agent
Abstract: Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA's three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.
摘要：临床命名实体识别 (NER) 对于从电子健康记录 (EHR) 中提取信息至关重要，但 CRF 和 BioClinicalBERT 等监督模型需要昂贵的注释数据。虽然使用大型语言模型 (LLM) 的零样本 NER 减少了这种依赖性，但它在示例选择粒度以及将提示与自我改进相结合方面遇到了困难。为了解决这个问题，我们提出了 OEMA，这是一种使用多智能体协作的零样本临床 NER 框架。 OEMA 的三个组件是：生成示例的自注释器、通过 SNOMED CT 过滤示例的鉴别器以及使用实体描述进行准确推理的预测器。在 MTSamples 和 VAERS 数据集上，OEMA 实现了最先进的精确匹配性能。在相关匹配下，它与有监督的 BioClinicalBERT 匹配并超越了 CRF。 OEMA 通过本体引导推理和多智能体协作解决关键的零样本 NER 挑战，实现近监督性能并显示出临床 NLP 应用的前景。

Title: Context Cascade Compression: Exploring the Upper Limits of Text Compression

Authors: Fanfan Liu, Haibo Qiu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2511.15244
Pdf URL: https://arxiv.org/pdf/2511.15244
Copy Paste: [[2511.15244]] Context Cascade Compression: Exploring the Upper Limits of Text Compression(https://arxiv.org/abs/2511.15244)
Keywords: language model, llm, long context
Abstract: Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at this https URL
摘要：长上下文任务中的百万级令牌输入给大型语言模型（LLM）带来了巨大的计算和内存挑战。近期，DeepSeek-OCR对Contexts Optical Compression的可行性进行了研究，并取得了初步成果。受此启发，我们引入Context Cascade Compression C3来探索文本压缩的上限。我们的方法级联两个不同大小的 LLM 来处理压缩和解码任务。具体来说，小型 LLM 作为第一阶段，通过将长上下文压缩为一组潜在标记（例如，长度为 32 或 64）来执行文本压缩，从而实现文本标记与潜在标记的高比率。大型LLM作为第二阶段，然后在此压缩上下文上执行解码任务。实验表明，在 20 倍压缩比（其中文本标记数量是潜在标记数量的 20 倍）下，我们的模型实现了 98% 的解码准确率，而 DeepSeek-OCR 的解码准确率约为 60%。当我们进一步将压缩比提高到 40 倍时，准确率保持在 93% 左右。这表明在上下文压缩领域，C3 压缩表现出优于光学字符压缩的性能和可行性。 C3 使用更简单的纯文本管道，忽略布局、颜色和视觉编码器的信息丢失等因素。这也表明了光学字符压缩、OCR 和相关领域的未来工作中压缩比的潜在上限。代码和模型权重可通过此 https URL 公开访问

Title: IndicGEC: Powerful Models, or a Measurement Mirage?

Authors: Sowmya Vajjala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.15260
Pdf URL: https://arxiv.org/pdf/2511.15260
Copy Paste: [[2511.15260]] IndicGEC: Powerful Models, or a Measurement Mirage?(https://arxiv.org/abs/2511.15260)
Keywords: language model, prompt
Abstract: In this paper, we report the results of the TeamNRC's participation in the BHASHA-Task 1 Grammatical Error Correction shared task this https URL for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.
摘要：在本文中，我们报告了 TeamNRC 参与 BHASHA-Task 1 语法错误纠正共享任务（此 https URL）的 5 种印度语言的结果。我们的方法专注于不同大小的语言模型（4B 到大型专有模型）的零/少样本提示，在泰卢固语中排名第 4，在印地语中排名第 2，GLEU 分数分别为 83.78 和 84.31。在本文中，我们将实验扩展到共享任务的其他三种语言——泰米尔语、马拉雅拉姆语和孟加拉语，并仔细研究所使用的数据质量和评估指标。我们的结果主要强调了小语言模型的潜力，并总结了与创建适合印度语言脚本的高质量数据集和适合此任务的适当指标相关的问题。

Title: Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Authors: Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.15304
Pdf URL: https://arxiv.org/pdf/2511.15304
Copy Paste: [[2511.15304]] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models(https://arxiv.org/abs/2511.15304)
Keywords: language model, llm, prompt
Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
摘要：我们提供的证据表明，对抗性诗歌可以作为大型语言模型（LLM）的通用单轮越狱技术。在 25 个前沿专有和开放权重模型中，精心策划的诗意提示产生了很高的攻击成功率 (ASR)，一些提供商超过 90%。将提示映射到 MLCommons 和 EU CoP 风险分类表明，诗意攻击跨 CBRN、操纵、网络攻击和失控领域转移。通过标准化元提示将 1,200 个 MLCommons 有害提示转换为诗歌，产生的 ASR 比其散文基线高出 18 倍。使用开放权重判断模型和人工验证的分层子集（使用双注释来衡量一致性）的集合来评估输出。手动解决分歧。诗意框架的手工诗歌平均越狱成功率为 62%，元提示转换的平均越狱成功率约为 43%（与非诗意基线相比），大大优于非诗意基线，并揭示了模型系列和安全培训方法中的系统漏洞。这些发现表明，仅风格变化就可以规避当代的安全机制，表明当前的对齐方法和评估方案存在根本局限性。

Title: HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Authors: Alexis Correa-Guillén, Carlos Gómez-Rodríguez, David Vilares
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.15355
Pdf URL: https://arxiv.org/pdf/2511.15355
Copy Paste: [[2511.15355]] HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning(https://arxiv.org/abs/2511.15355)
Keywords: llm, prompt
Abstract: We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
摘要：我们引入了 HEAD-QA v2，这是最初由 Vilares 和 Gómez-Rodríguez (2019) 发布的西班牙语/英语医疗保健多项选择推理数据集的扩展和更新版本。此次更新响应了对高质量数据集日益增长的需求，这些数据集捕获了医疗保健推理的语言和概念复杂性。我们将数据集扩展到十年来的西班牙语专业考试中的 12,000 多个问题，使用提示、RAG 和基于概率的答案选择对多个开源法学硕士进行基准测试，并提供额外的多语言版本以支持未来的工作。结果表明，性能主要由模型规模和内在推理能力驱动，复杂的推理策略获得的收益有限。总之，这些结果使 HEAD-QA v2 成为推进生物医学推理和模型改进研究的可靠资源。

Title: The Empowerment of Science of Science by Large Language Models: New Tools and Methods

Authors: Guoqiang Liang, Jingqian Gong, Mengxuan Li, Gege Lin, Shuo Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.15370
Pdf URL: https://arxiv.org/pdf/2511.15370
Copy Paste: [[2511.15370]] The Empowerment of Science of Science by Large Language Models: New Tools and Methods(https://arxiv.org/abs/2511.15370)
Keywords: language model, llm, prompt, retrieval augmented generation, agent
Abstract: Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.
摘要：大型语言模型 (LLM) 在自然语言理解和生成、图像识别和多模态任务方面表现出了卓越的能力，为 AGI 指明了方向，并成为全球技术竞赛的中心问题。本手稿从用户的角度对支持法学硕士的核心技术进行了全面的回顾，包括即时工程、知识增强检索增强生成、微调、预训练和工具学习。此外，它还追溯了科学科学（SciSci）的历史发展，并对法学硕士在科学计量领域的潜在应用提出了前瞻性的观点。此外，它还讨论了基于人工智能代理的科学评估模型的前景，并提出了新的研究前沿检测和法学硕士知识图谱构建方法。

Title: A Compliance-Preserving Retrieval System for Aircraft MRO Task Search

Authors: Byungho Jo
Subjects: cs.CL, cs.AI, cs.ET, cs.IR
Abstract URL: https://arxiv.org/abs/2511.15383
Pdf URL: https://arxiv.org/pdf/2511.15383
Copy Paste: [[2511.15383]] A Compliance-Preserving Retrieval System for Aircraft MRO Task Search(https://arxiv.org/abs/2511.15383)
Keywords: llm
Abstract: Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
摘要：飞机维护技术人员 (AMT) 花费高达 30% 的工作时间来搜索手册，这是 MRO 操作中记录在案的效率瓶颈，其中每个程序都必须可追溯到经过认证的来源。我们提出了一种保持合规性的检索系统，通过与经过认证的传统查看器一起运行而不是取代它，使 LLM 重新排名和语义搜索适应航空 MRO 环境。该系统根据 ATA 章节层次结构构建修订稳健的嵌入，并使用视觉语言解析来构建经过认证的内容，使技术人员能够预览排名任务并在现有查看器中访问经过验证的程序。对 49k 综合查询的评估实现了 >90% 的检索准确率，而使用 10 个许可 AMT 进行的双语对照研究表明，前 10 名的成功率达到 90.9%，查找时间减少了 95%，从每个任务的 6-15 分钟减少到 18 秒。这些成果提供了具体的证据，表明语义检索可以在严格的监管约束下运行，并有意义地减少现实世界多语言 MRO 工作流程中的操作工作量。

Title: DEPO: Dual-Efficiency Preference Optimization for LLM Agents

Authors: Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.15392
Pdf URL: https://arxiv.org/pdf/2511.15392
Copy Paste: [[2511.15392]] DEPO: Dual-Efficiency Preference Optimization for LLM Agents(https://arxiv.org/abs/2511.15392)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at this https URL.
摘要：大型语言模型（LLM）的最新进展极大地提高了它们作为代理部署时的推理和决策能力。然而，更丰富的推理往往是以更长的思维链（CoT）为代价的，从而阻碍了现实场景中的交互效率。然而，目前对于LLM代理效率仍然缺乏系统的定义，阻碍了针对性的改进。为此，我们引入了双重效率，包括（i）步骤级效率，最大限度地减少每一步的令牌，以及（ii）轨迹级效率，最大限度地减少完成任务的步骤数。基于这个定义，我们提出了 DEPO，一种双效偏好优化方法，联合奖励简洁的响应和更少的行动步骤。在 WebShop 和 BabyAI 上的实验表明，DEPO 减少了高达 60.9% 的代币使用量，减少了高达 26.9% 的步长，同时实现了高达 29.3% 的性能提升。 DEPO 还推广到三个域外数学基准，并在仅使用 25% 的数据进行训练时保留其效率增益。我们的项目页面位于此 https URL。

Title: NAMeGEn: Creative Name Generation via A Novel Agent-based Multiple Personalized Goal Enhancement Framework

Authors: Shanlin Zhou (1), Xinpeng Wang (1), Jianxun Lian (2), Zhenghao Liu (3), Laks V.S. Lakshmanan (4), Xiaoyuan Yi (2), Yongtao Hao (1) ((1) Tongji University, (2) Microsoft Research Asia, (3) Northeastern University, (4) The University of British Columbia)
Subjects: cs.CL, cs.AI, cs.IR, cs.MA, cs.NE
Abstract URL: https://arxiv.org/abs/2511.15408
Pdf URL: https://arxiv.org/pdf/2511.15408
Copy Paste: [[2511.15408]] NAMeGEn: Creative Name Generation via A Novel Agent-based Multiple Personalized Goal Enhancement Framework(https://arxiv.org/abs/2511.15408)
Keywords: language model, llm, agent
Abstract: Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users' perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.
摘要：经过对各种人类创作文本的训练，大型语言模型 (LLM) 释放了创造性自然语言生成 (CNLG) 的潜力，使广告和讲故事等各种应用受益。尽管如此，由于两个主要挑战，CNLG 仍然面临困难。（1）多目标灵活性：用户需求往往是个性化、细粒度、多元化的，LLM很难同时满足； (2)解释复杂性：除了生成之外，创造力还涉及对隐含意义的理解和解释，以增强用户的感知。这些挑战极大地限制了当前的方法，特别是在短格式文本生成方面，无法生成创造性和有洞察力的内容。为了解决这个问题，我们专注于中文婴儿命名，这是一项具有代表性的简短 CNLG 任务，需要遵守明确的用户限制（例如长度、语义、人名），同时提供有意义的美学解释。我们提出了 NAMeGEn，一种新颖的多智能体优化框架，它在目标提取、名称生成和评估之间迭代交替，以满足不同的要求并生成准确的解释。为了支持这项任务，我们进一步构建了一个包含 17000 多首诗歌的中国古典诗歌语料库，以增强美感，并引入了 CBNames，一个具有定制指标的新基准。大量实验表明，NAMEGEn 可以有效地生成满足多样化、个性化要求的创意名称，同时提供有意义的解释，其性能优于跨越各种 LLM 主干的六种基线方法，无需任何培训。

Title: LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering

Authors: Yuanjie Zhu, Liangwei Yang, Ke Xu, Weizhi Zhang, Zihe Song, Jindong Wang, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.15424
Pdf URL: https://arxiv.org/pdf/2511.15424
Copy Paste: [[2511.15424]] LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering(https://arxiv.org/abs/2511.15424)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.
摘要：大型语言模型 (LLM) 提供前所未有的基于深度语义理解的文本聚类能力，正在重塑无监督学习。然而，它们的直接应用从根本上受到缺乏用于迭代细化的状态内存以及管理集群粒度的困难的限制。因此，现有方法通常依赖于具有外部模块的复杂管道，从而牺牲了真正的端到端方法。我们引入了 LLM-MemCluster，这是一个新颖的框架，它将集群重新概念化为完全 LLM 原生的任务。它利用动态内存来灌输状态意识和双提示策略，使模型能够推理和确定集群的数量。在多个基准数据集上进行评估，我们的免调整框架显着且始终优于强大的基线。 LLM-MemCluster 为基于 LLM 的文本聚类提供了一种有效的、可解释的、真正的端到端范例。

Title: Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis

Authors: Yves Pauli, Jan-Bernard Marsman, Finn Rabe, Victoria Edkins, Roya Hüppi, Silvia Ciampelli, Akhil Ratan Misra, Nils Lang, Wolfram Hinzen, Iris Sommer, Philipp Homan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.15512
Pdf URL: https://arxiv.org/pdf/2511.15512
Copy Paste: [[2511.15512]] Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis(https://arxiv.org/abs/2511.15512)
Keywords: language model
Abstract: The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.
摘要：大型语言模型的引入和基于人工智能的语言处理的其他有影响力的发展导致了可用于定量分析语言数据的方法的发展。随着人们对语言处理关注的增加，出现了重大挑战，包括组织和共享语言数据缺乏标准化以及缺乏标准化和可重复的处理方法。为了争取未来的标准化，我们首先提出了语言处理数据结构（LPDS），这是一种受脑成像数据结构（BIDS）启发的数据结构，BIDS是一种广泛采用的处理神经科学数据的标准。它为语言研究提供了文件夹结构和文件命名约定。其次，我们介绍 pelican nlp，这是一个模块化、可扩展的 Python 包，旨在实现简化的语言处理，从初始数据清理和特定于任务的预处理到提取复杂的语言和声学特征，例如语义嵌入和韵律度量。整个处理工作流程可以在单个可共享的配置文件中指定，然后 pelican nlp 对 LPDS 格式的数据执行该配置文件。根据规范，可再现的输出可以包含预处理的语言数据或语言和声学特征的标准化提取以及相应的结果聚合。 LPDS 和 pelican nlp 共同提供语言数据的端到端处理管道，旨在确保方法透明度并提高可重复性。

Title: Multimodal Evaluation of Russian-language Architectures

Authors: Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2511.15552
Pdf URL: https://arxiv.org/pdf/2511.15552
Copy Paste: [[2511.15552]] Multimodal Evaluation of Russian-language Architectures(https://arxiv.org/abs/2511.15552)
Keywords: language model, llm, prompt
Abstract: Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
摘要：多模态大语言模型（MLLM）目前是研究关注的中心，在规模和能力方面显示出快速进步，但它们的智能、局限性和风险仍然没有得到充分的了解。为了解决这些问题，特别是在目前不存在多模态基准的俄语背景下，我们引入了 Mera Multi，这是一个针对俄语架构的开放式多模态评估框架。该基准测试基于指令，涵盖默认文本、图像、音频和视频模式，包括 18 个针对通用模型和特定模式架构（图像到文本、视频到文本和音频到文本）新构建的评估任务。我们的贡献包括：（i）多模式能力的通用分类法； (ii) 完全从头开始创建的 18 个数据集，注重俄罗斯文化和语言的特殊性、统一的提示和指标； (iii) 闭源和开源模型的基线结果； (iv) 防止基准泄漏的方法，包括水印和私人设备许可证。虽然我们目前的重点是俄语，但拟议的基准提供了一种可复制的方法，用于在类型多样的语言（尤其是斯拉夫语系）中构建多模式基准。

Title: HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning

Authors: Qihao Yang, Xuelin Wang, Jiale Chen, Xuelian Dong, Yuxin Hao, Tianyong Hao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.15574
Pdf URL: https://arxiv.org/pdf/2511.15574
Copy Paste: [[2511.15574]] HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning(https://arxiv.org/abs/2511.15574)
Keywords: language model, llm, agent
Abstract: Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners' language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: this https URL.
摘要：语言习得对于揭示人类语言智能的本质至关重要，并且最近已成为提高大型语言模型（LLM）可解释性的一个有前景的前景。然而，进行需要控制人类学习者语言输入的实验在伦理上和实践上都是不可行的。这对语言习得模型的可验证性和可扩展性提出了挑战，特别是在中文第二语言习得（SLA）方面。虽然法学硕士提供了一种可控且可重复的替代方案，但仍然缺乏支持分阶段建模和评估的系统基准。在本文中，我们推出了 HSKBenchmark，这是中国 SLA 中法学硕士阶段建模和写作评估的第一个基准。它涵盖HSK 3至6级，包括676万个token的正宗教材、16K综合教学样本、30个测试主题和基于语言的评估体系。为了模拟人类的学习轨迹，我们引入了一个课程调整框架，可以训练从初级到高级的模型。创建了一个评估系统来检查基于级别的语法覆盖率、写作错误、词汇和句法复杂性以及整体评分。我们还构建了 HSKAgent，并针对 10K 学习者作文进行了微调。大量的实验结果表明，HSKBenchmark不仅可以有效地模拟中文SLA，而且可以作为法学硕士动态写作评估的可靠基准。我们精心调整的法学硕士的写作表现与高级人类学习者相当，并表现出类似人类的习得特征。 HSKBenchmark、HSKAgent 和检查点作为基础工具和资源，有可能为语言习得建模和法学硕士可解释性的未来研究铺平道路。代码和数据可在以下位置公开获取：此 https URL。