2025-12-29

Title: Teaching People LLM's Errors and Getting it Right

Authors: Nathan Stringham, Fateme Hashemi Chaleshtori, Xinyuan Yan, Zhichao Xu, Bei Wang, Ana Marasović
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21422
Pdf URL: https://arxiv.org/pdf/2512.21422
Copy Paste: [[2512.21422]] Teaching People LLM's Errors and Getting it Right(https://arxiv.org/abs/2512.21422)
Keywords: language model, llm, prompt
Abstract: People use large language models (LLMs) when they should not. This is partly because they see LLMs compose poems and answer intricate questions, so they understandably, but incorrectly, assume LLMs won't stumble on basic tasks like simple arithmetic. Prior work has tried to address this by clustering instance embeddings into regions where an LLM is likely to fail and automatically describing patterns in these regions. The found failure patterns are taught to users to mitigate their overreliance. Yet, this approach has not fully succeeded. In this analysis paper, we aim to understand why. We first examine whether the negative result stems from the absence of failure patterns. We group instances in two datasets by their meta-labels and evaluate an LLM's predictions on these groups. We then define criteria to flag groups that are sizable and where the LLM is error-prone, and find meta-label groups that meet these criteria. Their meta-labels are the LLM's failure patterns that could be taught to users, so they do exist. We next test whether prompting and embedding-based approaches can surface these known failures. Without this, users cannot be taught about them to reduce their overreliance. We find mixed results across methods, which could explain the negative result. Finally, we revisit the final metric that measures teaching effectiveness. We propose to assess a user's ability to effectively use the given failure patterns to anticipate when an LLM is error-prone. A user study shows a positive effect from teaching with this metric, unlike the human-AI team accuracy. Our findings show that teaching failure patterns could be a viable approach to mitigating overreliance, but success depends on better automated failure-discovery methods and using metrics like ours.
摘要：人们在不应该使用大型语言模型 (LLM) 的时候使用它们。部分原因是他们看到法学硕士写诗并回答复杂的问题，因此他们认为法学硕士不会在简单算术等基本任务上遇到困难，这是可以理解的，但这是错误的。之前的工作试图通过将实例嵌入聚类到 LLM 可能失败的区域并自动描述这些区域中的模式来解决这个问题。向用户教授所发现的故障模式，以减轻他们的过度依赖。然而，这种方法还没有完全成功。在这篇分析论文中，我们旨在了解原因。我们首先检查负面结果是否源于失败模式的缺失。我们根据元标签对两个数据集中的实例进行分组，并评估法学硕士对这些组的预测。然后，我们定义标准来标记规模较大且 LLM 容易出错的组，并找到满足这些标准的元标签组。他们的元标签是法学硕士的失败模式，可以教给用户，所以它们确实存在。接下来我们测试基于提示和嵌入的方法是否可以解决这些已知的故障。如果没有这一点，用户就无法了解它们以减少他们的过度依赖。我们发现各种方法的结果混合，这可以解释负面结果。最后，我们重新审视衡量教学效果的最终指标。我们建议评估用户有效使用给定失败模式的能力，以预测法学硕士何时容易出错。一项用户研究表明，与人类人工智能团队的准确性不同，使用该指标进行教学会产生积极的效果。我们的研究结果表明，教授失败模式可能是减轻过度依赖的可行方法，但成功取决于更好的自动化故障发现方法和使用像我们这样的指标。

Title: Morality is Contextual: Learning Interpretable Moral Contexts from Human Data with Probabilistic Clustering and Large Language Models

Authors: Geoffroy Morlat, Marceau Nahon, Augustin Chartouny, Raja Chatila, Ismael T. Freire, Mehdi Khamassi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.21439
Pdf URL: https://arxiv.org/pdf/2512.21439
Copy Paste: [[2512.21439]] Morality is Contextual: Learning Interpretable Moral Contexts from Human Data with Probabilistic Clustering and Large Language Models(https://arxiv.org/abs/2512.21439)
Keywords: language model, llm, prompt
Abstract: Moral actions are judged not only by their outcomes but by the context in which they occur. We present COMETH (Contextual Organization of Moral Evaluation from Textual Human inputs), a framework that integrates a probabilistic context learner with LLM-based semantic abstraction and human moral evaluations to model how context shapes the acceptability of ambiguous actions. We curate an empirically grounded dataset of 300 scenarios across six core actions (violating Do not kill, Do not deceive, and Do not break the law) and collect ternary judgments (Blame/Neutral/Support) from N=101 participants. A preprocessing pipeline standardizes actions via an LLM filter and MiniLM embeddings with K-means, producing robust, reproducible core-action clusters. COMETH then learns action-specific moral contexts by clustering scenarios online from human judgment distributions using principled divergence criteria. To generalize and explain predictions, a Generalization module extracts concise, non-evaluative binary contextual features and learns feature weights in a transparent likelihood-based model. Empirically, COMETH roughly doubles alignment with majority human judgments relative to end-to-end LLM prompting (approx. 60% vs. approx. 30% on average), while revealing which contextual features drive its predictions. The contributions are: (i) an empirically grounded moral-context dataset, (ii) a reproducible pipeline combining human judgments with model-based context learning and LLM semantics, and (iii) an interpretable alternative to end-to-end LLMs for context-sensitive moral prediction and explanation.
摘要：道德行为的评判不仅取决于其结果，还取决于其发生的背景。我们提出了 COMETH（来自人类文本输入的道德评估的情境组织），这是一个框架，它将概率情境学习器与基于 LLM 的语义抽象和人类道德评估相结合，以模拟情境如何塑造模糊行为的可接受性。我们策划了一个基于经验的数据集，其中包含 6 个核心行动（违反“不杀人”、“不欺骗”和“不违法”）的 300 个场景，并从 N=101 名参与者中收集三元判断（指责/中立/支持）。预处理管道通过 LLM 过滤器和带有 K 均值的 MiniLM 嵌入来标准化操作，从而生成强大的、可重复的核心操作集群。然后，COMETH 通过使用原则分歧标准从人类判断分布中在线聚类场景来学习特定行为的道德背景。为了概括和解释预测，泛化模块提取简洁的、非评估性的二进制上下文特征，并在透明的基于似然的模型中学习特征权重。根据经验，相对于端到端 LLM 提示，COMETH 与大多数人类判断的一致性大致提高了一倍（约 60% 对平均约 30%），同时揭示了哪些上下文特征驱动其预测。这些贡献是：（i）基于经验的道德背景数据集，（ii）将人类判断与基于模型的背景学习和法学硕士语义相结合的可重复管道，以及（iii）端到端法学硕士的可解释替代方案，用于情境敏感的道德预测和解释。

Title: Oogiri-Master: Benchmarking Humor Understanding via Oogiri

Authors: Soichiro Murakami, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21494
Pdf URL: https://arxiv.org/pdf/2512.21494
Copy Paste: [[2512.21494]] Oogiri-Master: Benchmarking Humor Understanding via Oogiri(https://arxiv.org/abs/2512.21494)
Keywords: language model, llm, prompt
Abstract: Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others' ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.
摘要：幽默是大型语言模型 (LLM) 中类人创造性思维的重要测试平台。我们使用日本创意反应游戏 Oogiri 来研究幽默，在该游戏中，参与者对给定的提示做出机智的反应，并提出以下研究问题：是什么让这些反应对人类来说很有趣？以前的工作只提供了有限的可靠方法来回答这个问题。现有的数据集在每个提示中包含很少的候选响应，在评级期间暴露流行信号，并且缺乏客观和可比较的有趣指标。因此，我们引入了 Oogiri-Master 和 Oogiri-Corpus，它们是旨在对法学硕士幽默理解进行严格评估的基准和数据集。每个提示都与大约 100 个不同的候选答案配对，并且由大约 100 名人类评委独立评分，无需访问其他人的评分，从而减少受欢迎程度偏差并实现强大的聚合。使用 Oogiri-Corpus，我们对与有趣性相关的语言因素（例如文本长度、歧义性和不一致解决方案）进行定量分析，并得出用于预测人类判断的客观指标。随后，我们在 Oogiri-Master 中对一系列法学硕士和人类基线进行了基准测试，证明最先进的模型接近人类表现，并且洞察力增强的提示提高了模型性能。我们的结果为评估和促进法学硕士的幽默理解提供了原则基础。

Title: Beyond Heuristics: A Decision-Theoretic Framework for Agent Memory Management

Authors: Changzhi Sun, Xiangyu Chen, Jixiang Luo, Dell Zhang, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21567
Pdf URL: https://arxiv.org/pdf/2512.21567
Copy Paste: [[2512.21567]] Beyond Heuristics: A Decision-Theoretic Framework for Agent Memory Management(https://arxiv.org/abs/2512.21567)
Keywords: language model, llm, agent
Abstract: External memory is a key component of modern large language model (LLM) systems, enabling long-term interaction and personalization. Despite its importance, memory management is still largely driven by hand-designed heuristics, offering little insight into the long-term and uncertain consequences of memory decisions. In practice, choices about what to read or write shape future retrieval and downstream behavior in ways that are difficult to anticipate. We argue that memory management should be viewed as a sequential decision-making problem under uncertainty, where the utility of memory is delayed and dependent on future interactions. To this end, we propose DAM (Decision-theoretic Agent Memory), a decision-theoretic framework that decomposes memory management into immediate information access and hierarchical storage maintenance. Within this architecture, candidate operations are evaluated via value functions and uncertainty estimators, enabling an aggregate policy to arbitrate decisions based on estimated long-term utility and risk. Our contribution is not a new algorithm, but a principled reframing that clarifies the limitations of heuristic approaches and provides a foundation for future research on uncertainty-aware memory systems.
摘要：外部存储器是现代大语言模型（LLM）系统的关键组成部分，可实现长期交互和个性化。尽管内存管理很重要，但它在很大程度上仍然是由手工设计的启发式驱动的，对内存决策的长期和不确定后果几乎没有提供深入的了解。在实践中，关于读或写内容的选择会以难以预测的方式影响未来的检索和下游行为。我们认为，内存管理应该被视为不确定性下的顺序决策问题，其中内存的效用是延迟的并且依赖于未来的交互。为此，我们提出了DAM（决策理论代理内存），这是一种决策理论框架，它将内存管理分解为即时信息访问和分层存储维护。在此架构中，通过价值函数和不确定性估计器来评估候选操作，从而使聚合策略能够根据估计的长期效用和风险来仲裁决策。我们的贡献不是一种新算法，而是一种原则性的重构，它澄清了启发式方法的局限性，并为不确定性感知记忆系统的未来研究奠定了基础。

Title: A Unified Definition of Hallucination, Or: It's the World Model, Stupid

Authors: Emmy Liu, Varun Gangal, Chelsea Zou, Xiaoqi Huang, Michael Yu, Alex Chang, Zhuofu Tao, Sachin Kumar, Steven Y. Feng
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.21577
Pdf URL: https://arxiv.org/pdf/2512.21577
Copy Paste: [[2512.21577]] A Unified Definition of Hallucination, Or: It's the World Model, Stupid(https://arxiv.org/abs/2512.21577)
Keywords: language model, hallucination
Abstract: Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature. We argue that this unified view is useful because it forces evaluations to make clear their assumed "world" or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.
摘要：尽管自神经语言模型诞生以来，人们多次尝试解决幻觉问题，但即使在当今的前沿大型语言模型中，它仍然是一个问题。为什么会这样呢？我们从历史的角度回顾迄今为止文献中使用的幻觉定义，并将它们折叠成一个单一的幻觉定义，其中不同的先前定义侧重于我们定义的不同方面。从本质上讲，我们认为幻觉只是不准确的（内部）世界建模，其形式是用户可以观察到的（例如，陈述与知识库相矛盾的事实，或生成与已知来源相矛盾的摘要）。通过改变参考世界模型以及知识冲突策略（例如，知识库与上下文），我们得出了文献中现有的不同幻觉定义。我们认为，这种统一的观点是有用的，因为它迫使评估明确他们假设的“世界”或真相来源，澄清什么应该和不应该被称为幻觉（而不是计划或奖励/激励相关的错误），并提供一种通用语言来比较基准和缓解技术。在此定义的基础上，我们概述了一系列基准测试的计划，其中幻觉被定义为与不同环境中合成但完全指定的世界模型不匹配，并概述了这些基准测试如何使用此类设置来压力测试和改进语言模型的世界建模组件。

Title: Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

Authors: Alexander Podolskiy, Semen Molokov, Timofey Gerasin, Maksim Titov, Alexey Rukhovich, Artem Khrapov, Kirill Morozov, Evgeny Tetin, Constantine Korikov, Pavel Efimov, Polina Lazukova, Yuliya Skripkar, Nikita Okhotnikov, Irina Piontkovskaya, Meng Xiaojun, Zou Xueyi, Zhang Zhenhe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21580
Pdf URL: https://arxiv.org/pdf/2512.21580
Copy Paste: [[2512.21580]] Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM(https://arxiv.org/abs/2512.21580)
Keywords: language model, llm
Abstract: We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).
摘要：我们推出了 Gamayun，一个 1.5B 参数的多语言语言模型，完全在 2.5T 代币上从头开始训练。 Gamayun 专为资源有限环境中的效率和部署而设计，通过采用新颖的两阶段预训练策略，解决了小型非英语法学硕士研究的缺乏：平衡多语言训练以实现跨语言对齐，然后进行高质量的英语强化，以跨语言转移性能提升。我们的模型支持 12 种语言，特别关注俄语。尽管训练预算比同类模型小得多，但 Gamayun 在所有考虑的基准上都优于 LLaMA3.2-1B（9T 令牌），并在广泛的英语和多语言任务上超过 Qwen2.5-1.5B（18T 令牌）。它在高级 STEM 之外的大多数任务上匹配或超过 Qwen3（36T 代币），在同等大小（1-2B 参数）的模型中，在俄语中取得了最先进的结果，包括 MERA 基准。

Title: Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations

Authors: Chengxu Yang, Jingling Yuan, Siqi Cai, Jiawei Jiang, Chuang Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21635
Pdf URL: https://arxiv.org/pdf/2512.21635
Copy Paste: [[2512.21635]] Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations(https://arxiv.org/abs/2512.21635)
Keywords: language model, llm, hallucination, prompt
Abstract: Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific this http URL, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.
摘要：大语言模型 (LLM) 中的幻觉通常被视为需要最小化的错误。然而，最近的观点表明，一些幻觉可能编码创造性或认知上有价值的内容，这一维度在当前文献中仍然未被充分量化。现有的幻觉检测方法主要注重事实的一致性，难以处理异构的科学任务并平衡创造力与准确性。为了应对这些挑战，我们提出了 HIC-Bench，这是一种新颖的评估框架，将幻觉分为智能幻觉（IH）和缺陷幻觉（DH），从而能够系统地研究它们在法学硕士创造力中的相互作用。 HIC-Bench具有三个核心特征：（1）结构化IH/DH评估。使用多维度量矩阵，将托兰斯创造性思维测试 (TTCT) 度量（原创性、可行性、价值）与幻觉特定维度（科学合理性、事实偏差）相结合； (2)跨域适用性。跨越十个科学领域，具有开放式创新任务； (3)动态提示优化。利用动态幻觉提示 (DHP) 引导模型获得创造性和可靠的输出。评估过程采用多名法学硕士评委，对分数进行平均以减少偏差，并由人工注释者验证 IH/DH 分类。实验结果揭示了 IH 和 DH 之间的非线性关系，证明创造力和正确性可以共同优化。这些见解将 IH 定位为创造力的催化剂，并揭示了法学硕士幻觉推动科学发展的能力。此 http URL，HIC-Bench 为推进法学硕士幻觉的创造性智力研究提供了一个有价值的平台。

Title: MoRAgent: Parameter Efficient Agent Tuning with Mixture-of-Roles

Authors: Jing Han, Binwei Yan, Tianyu Guo, Zheyuan Bai, Mengyu Zheng, Hanting Chen, Ying Nie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21708
Pdf URL: https://arxiv.org/pdf/2512.21708
Copy Paste: [[2512.21708]] MoRAgent: Parameter Efficient Agent Tuning with Mixture-of-Roles(https://arxiv.org/abs/2512.21708)
Keywords: language model, llm, agent
Abstract: Despite recent advancements of fine-tuning large language models (LLMs) to facilitate agent tasks, parameter-efficient fine-tuning (PEFT) methodologies for agent remain largely unexplored. In this paper, we introduce three key strategies for PEFT in agent tasks: 1) Inspired by the increasingly dominant Reason+Action paradigm, we first decompose the capabilities necessary for the agent tasks into three distinct roles: reasoner, executor, and summarizer. The reasoner is responsible for comprehending the user's query and determining the next role based on the execution trajectory. The executor is tasked with identifying the appropriate functions and parameters to invoke. The summarizer conveys the distilled information from conversations back to the user. 2) We then propose the Mixture-of-Roles (MoR) framework, which comprises three specialized Low-Rank Adaptation (LoRA) groups, each designated to fulfill a distinct role. By focusing on their respective specialized capabilities and engaging in collaborative interactions, these LoRAs collectively accomplish the agent task. 3) To effectively fine-tune the framework, we develop a multi-role data generation pipeline based on publicly available datasets, incorporating role-specific content completion and reliability verification. We conduct extensive experiments and thorough ablation studies on various LLMs and agent benchmarks, demonstrating the effectiveness of the proposed method. This project is publicly available at this https URL.
摘要：尽管最近在微调大型语言模型（LLM）以促进代理任务方面取得了进展，但代理的参数高效微调（PEFT）方法在很大程度上仍未得到探索。在本文中，我们介绍了智能体任务中 PEFT 的三个关键策略：1）受到日益占主导地位的推理+行动范式的启发，我们首先将智能体任务所需的能力分解为三个不同的角色：推理者、执行者和总结者。推理机负责理解用户的查询并根据执行轨迹确定下一个角色。执行器的任务是识别要调用的适当函数和参数。摘要器将从对话中提取的信息传达给用户。 2）然后，我们提出了角色混合（MoR）框架，该框架由三个专门的低阶适应（LoRA）组组成，每个组都指定履行不同的角色。通过专注于各自的专业能力并进行协作交互，这些 LoRA 共同完成代理任务。 3）为了有效地微调框架，我们基于公开可用的数据集开发了多角色数据生成管道，结合了特定于角色的内容完成和可靠性验证。我们对各种法学硕士和代理基准进行了广泛的实验和彻底的消融研究，证明了所提出方法的有效性。该项目可通过此 https URL 公开获取。

Title: Detecting AI-Generated Paraphrases in Bengali: A Comparative Study of Zero-Shot and Fine-Tuned Transformers

Authors: Md. Rakibul Islam, Most. Sharmin Sultana Samu, Md. Zahid Hossain, Farhad Uz Zaman, Md. Kamrozzaman Bhuiyan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21709
Pdf URL: https://arxiv.org/pdf/2512.21709
Copy Paste: [[2512.21709]] Detecting AI-Generated Paraphrases in Bengali: A Comparative Study of Zero-Shot and Fine-Tuned Transformers(https://arxiv.org/abs/2512.21709)
Keywords: language model, llm
Abstract: Large language models (LLMs) can produce text that closely resembles human writing. This capability raises concerns about misuse, including disinformation and content manipulation. Detecting AI-generated text is essential to maintain authenticity and prevent malicious applications. Existing research has addressed detection in multiple languages, but the Bengali language remains largely unexplored. Bengali's rich vocabulary and complex structure make distinguishing human-written and AI-generated text particularly challenging. This study investigates five transformer-based models: XLMRoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base. Zero-shot evaluation shows that all models perform near chance levels (around 50% accuracy) and highlight the need for task-specific fine-tuning. Fine-tuning significantly improves performance, with XLM-RoBERTa, mDeBERTa and MultilingualBERT achieving around 91% on both accuracy and F1-score. IndicBERT demonstrates comparatively weaker performance, indicating limited effectiveness in fine-tuning for this task. This work advances AI-generated text detection in Bengali and establishes a foundation for building robust systems to counter AI-generated content.
摘要：大型语言模型 (LLM) 可以生成与人类书写非常相似的文本。这种功能引起了人们对滥用的担忧，包括虚假信息和内容操纵。检测人工智能生成的文本对于保持真实性和防止恶意应用程序至关重要。现有的研究已经解决了多种语言的检测问题，但孟加拉语在很大程度上仍未得到探索。孟加拉语丰富的词汇和复杂的结构使得区分人类书写的文本和人工智能生成的文本特别具有挑战性。本研究研究了五种基于 Transformer 的模型：XLMRoBERTa-Large、mDeBERTaV3-Base、BanglaBERT-Base、IndicBERT-Base 和 MultilingualBERT-Base。零样本评估表明，所有模型的表现都接近机会水平（大约 50% 的准确度），并强调了针对特定任务进行微调的必要性。微调显着提高了性能，XLM-RoBERTa、mDeBERTa 和 MultilingualBERT 的准确率和 F1 分数均达到 91% 左右。 IndicBERT 表现出相对较弱的性能，表明对该任务的微调效果有限。这项工作推进了孟加拉语中人工智能生成的文本检测，并为构建强大的系统来对抗人工智能生成的内容奠定了基础。

Title: Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought

Authors: Yuyi Zhang, Boyu Tang, Tianjie Ju, Sufeng Duan, Gongshen Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21711
Pdf URL: https://arxiv.org/pdf/2512.21711
Copy Paste: [[2512.21711]] Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought(https://arxiv.org/abs/2512.21711)
Keywords: language model, llm, chain-of-thought
Abstract: Latent tokens are gaining attention for enhancing reasoning in large language models (LLMs), yet their internal mechanisms remain unclear. This paper examines the problem from a reliability perspective, uncovering fundamental weaknesses: latent tokens function as uninterpretable placeholders rather than encoding faithful reasoning. While resistant to perturbation, they promote shortcut usage over genuine reasoning. We focus on Chain-of-Continuous-Thought (COCONUT), which claims better efficiency and stability than explicit Chain-of-Thought (CoT) while maintaining performance. We investigate this through two complementary approaches. First, steering experiments perturb specific token subsets, namely COCONUT and explicit CoT. Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information. Second, shortcut experiments evaluate models under biased and out-of-distribution settings. Results on MMLU and HotpotQA demonstrate that COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning. These findings reposition COCONUT as a pseudo-reasoning mechanism: it generates plausible traces that conceal shortcut dependence rather than faithfully representing reasoning processes.
摘要：潜在标记因增强大型语言模型（LLM）的推理而受到关注，但其内部机制仍不清楚。本文从可靠性的角度研究了这个问题，揭示了根本的弱点：潜在标记充当不可解释的占位符，而不是编码忠实的推理。在抵抗扰动的同时，它们提倡捷径的使用而不是真正的推理。我们专注于连续思维链（COCONUT），它声称比显式思维链（CoT）有更好的效率和稳定性，同时保持性能。我们通过两种互补的方法对此进行研究。首先，引导实验扰乱特定的标记子集，即 COCONUT 和显式 CoT。与 CoT 代币不同，COCONUT 代币对转向的敏感性极低，并且缺乏推理关键信息。其次，快捷实验评估有偏差和分布外设置下的模型。 MMLU 和 HotpotQA 的结果表明，COCONUT 始终利用数据集工件，在没有真正推理的情况下夸大基准性能。这些发现将 COCONUT 重新定位为伪推理机制：它生成隐藏捷径依赖性的合理痕迹，而不是忠实地代表推理过程。

Title: CATCH: A Controllable Theme Detection Framework with Contextualized Clustering and Hierarchical Generation

Authors: Rui Ke, Jiahui Xu, Shenghao Yang, Kuang Wang, Feng Jiang, Haizhou Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21715
Pdf URL: https://arxiv.org/pdf/2512.21715
Copy Paste: [[2512.21715]] CATCH: A Controllable Theme Detection Framework with Contextualized Clustering and Hierarchical Generation(https://arxiv.org/abs/2512.21715)
Keywords: llm
Abstract: Theme detection is a fundamental task in user-centric dialogue systems, aiming to identify the latent topic of each utterance without relying on predefined schemas. Unlike intent induction, which operates within fixed label spaces, theme detection requires cross-dialogue consistency and alignment with personalized user preferences, posing significant challenges. Existing methods often struggle with sparse, short utterances for accurate topic representation and fail to capture user-level thematic preferences across dialogues. To address these challenges, we propose CATCH (Controllable Theme Detection with Contextualized Clustering and Hierarchical Generation), a unified framework that integrates three core components: (1) context-aware topic representation, which enriches utterance-level semantics using surrounding topic segments; (2) preference-guided topic clustering, which jointly models semantic proximity and personalized feedback to align themes across dialogue; and (3) a hierarchical theme generation mechanism designed to suppress noise and produce robust, coherent topic labels. Experiments on a multi-domain customer dialogue benchmark (DSTC-12) demonstrate the effectiveness of CATCH with 8B LLM in both theme clustering and topic generation quality.
摘要：主题检测是以用户为中心的对话系统中的一项基本任务，旨在在不依赖预定义模式的情况下识别每个话语的潜在主题。与在固定标签空间内运行的意图归纳不同，主题检测需要跨对话一致性并与个性化用户偏好保持一致，这构成了重大挑战。现有的方法常常难以准确表达稀疏、简短的话语，并且无法捕获对话中用户级的主题偏好。为了应对这些挑战，我们提出了 CATCH（具有上下文聚类和分层生成的可控主题检测），这是一个集成了三个核心组件的统一框架：（1）上下文感知主题表示，它使用周围的主题片段丰富了话语级语义； (2) 偏好引导的主题聚类，联合建模语义接近度和个性化反馈，以协调对话中的主题； (3) 分层主题生成机制，旨在抑制噪音并生成稳健、连贯的主题标签。多领域客户对话基准 (DSTC-12) 的实验证明了 CATCH 与 8B LLM 在主题聚类和主题生成质量方面的有效性。

Title: Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

Authors: Abdullah Alabdullah, Lifeng Han, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21787
Pdf URL: https://arxiv.org/pdf/2512.21787
Copy Paste: [[2512.21787]] Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation(https://arxiv.org/abs/2512.21787)
Keywords: gpt
Abstract: Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation is a challenging task in Machine Translation (MT) due to significant lexical, syntactic, and semantic divergences between Arabic dialects and MSA. Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment. This paper introduces Ara-HOPE, a human-centric post-editing evaluation framework designed to systematically address these challenges. The framework includes a five-category error taxonomy and a decision-tree annotation protocol. Through comparative evaluation of three MT systems (Arabic-centric Jais, general-purpose GPT-3.5, and baseline NLLB-200), Ara-HOPE effectively highlights systematic performance differences between these systems. The results show that dialect-specific terminology and semantic preservation remain the most persistent challenges in DA-MSA translation. Ara-HOPE establishes a new framework for evaluating Dialectal Arabic MT quality and provides actionable guidance for improving dialect-aware MT systems.
摘要：由于阿拉伯语方言和 MSA 之间存在显着的词汇、句法和语义差异，因此阿拉伯语方言到现代标准阿拉伯语 (DA-MSA) 的翻译是机器翻译 (MT) 中的一项具有挑战性的任务。现有的自动评估指标和通用人类评估框架很难捕获特定于方言的机器翻译错误，从而阻碍了翻译评估的进展。本文介绍了 Ara-HOPE，一个以人为中心的译后编辑评估框架，旨在系统地应对这些挑战。该框架包括五类错误分类法和决策树注释协议。通过对三个机器翻译系统（以阿拉伯语为中心的 Jais、通用 GPT-3.5 和基线 NLLB-200）的比较评估，Ara-HOPE 有效地突出了这些系统之间的系统性能差异。结果表明，方言特定术语和语义保留仍然是 DA-MSA 翻译中最持久的挑战。 Ara-HOPE 建立了一个评估阿拉伯语方言机器翻译质量的新框架，并为改进方言感知机器翻译系统提供了可行的指导。

Title: Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning

Authors: Ting-Hao K.Huang, Ryan A. Rossi, Sungchul Kim, Tong Yu, Ting-Yao E. Hsu, Ho Yin (Sam)Ng, C. Lee Giles
Subjects: cs.CL, cs.AI, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2512.21789
Pdf URL: https://arxiv.org/pdf/2512.21789
Copy Paste: [[2512.21789]] Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning(https://arxiv.org/abs/2512.21789)
Keywords: language model, llm
Abstract: Between 2021 and 2025, the SciCap project grew from a small seed-funded idea at The Pennsylvania State University (Penn State) into one of the central efforts shaping the scientific figure-captioning landscape. Supported by a Penn State seed grant, Adobe, and the Alfred P. Sloan Foundation, what began as our attempt to test whether domain-specific training, which was successful in text models like SciBERT, could also work for figure captions expanded into a multi-institution collaboration. Over these five years, we curated, released, and continually updated a large collection of figure-caption pairs from arXiv papers, conducted extensive automatic and human evaluations on both generated and author-written captions, navigated the rapid rise of large language models (LLMs), launched annual challenges, and built interactive systems that help scientists write better captions. In this piece, we look back at the first five years of SciCap and summarize the key technical and methodological lessons we learned. We then outline five major unsolved challenges and propose directions for the next phase of research in scientific figure captioning.
摘要：2021 年至 2025 年间，SciCap 项目从宾夕法尼亚州立大学（宾夕法尼亚州立大学）的一个小型种子资助想法发展成为塑造科学图形字幕景观的核心努力之一。在宾夕法尼亚州立大学种子基金、Adobe 和 Alfred P. Sloan 基金会的支持下，我们开始尝试测试在 SciBERT 等文本模型中取得成功的特定领域训练是否也适用于图形标题，并扩展到多机构合作。在这五年里，我们策划、发布并不断更新了来自 arXiv 论文的大量图形标题对，对生成的和作者编写的标题进行了广泛的自动和人工评估，引导了大型语言模型 (LLM) 的快速崛起，发起了年度挑战，并构建了帮助科学家编写更好标题的交互式系统。在这篇文章中，我们回顾了 SciCap 的前五年，总结了我们学到的关键技术和方法经验教训。然后，我们概述了五个尚未解决的主要挑战，并为科学图形字幕的下一阶段研究提出了方向。

Title: On The Conceptualization and Societal Impact of Cross-Cultural Bias

Authors: Vitthal Bhandari
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.21809
Pdf URL: https://arxiv.org/pdf/2512.21809
Copy Paste: [[2512.21809]] On The Conceptualization and Societal Impact of Cross-Cultural Bias(https://arxiv.org/abs/2512.21809)
Keywords: language model, llm
Abstract: Research has shown that while large language models (LLMs) can generate their responses based on cultural context, they are not perfect and tend to generalize across cultures. However, when evaluating the cultural bias of a language technology on any dataset, researchers may choose not to engage with stakeholders actually using that technology in real life, which evades the very fundamental problem they set out to address. Inspired by the work done by arXiv:2005.14050v2, I set out to analyse recent literature about identifying and evaluating cultural bias in Natural Language Processing (NLP). I picked out 20 papers published in 2025 about cultural bias and came up with a set of observations to allow NLP researchers in the future to conceptualize bias concretely and evaluate its harms effectively. My aim is to advocate for a robust assessment of the societal impact of language technologies exhibiting cross-cultural bias.
摘要：研究表明，虽然大型语言模型 (LLM) 可以根据文化背景生成响应，但它们并不完美，并且往往会跨文化推广。然而，在评估语言技术对任何数据集的文化偏见时，研究人员可能会选择不与现实生活中实际使用该技术的利益相关者接触，这回避了他们想要解决的根本问题。受到 arXiv:2005.14050v2 所做工作的启发，我开始分析有关识别和评估自然语言处理 (NLP) 中文化偏见的最新文献。我挑选了2025年发表的关于文化偏见的20篇论文，并提出了一组观察结果，以便将来的NLP研究人员能够具体地概念化偏见并有效评估其危害。我的目标是倡导对表现出跨文化偏见的语言技术的社会影响进行强有力的评估。

Title: Method Decoration (DeMe): A Framework for LLM-Driven Adaptive Method Generation in Dynamic IoT Environments

Authors: Hong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21817
Pdf URL: https://arxiv.org/pdf/2512.21817
Copy Paste: [[2512.21817]] Method Decoration (DeMe): A Framework for LLM-Driven Adaptive Method Generation in Dynamic IoT Environments(https://arxiv.org/abs/2512.21817)
Keywords: language model, llm, agent
Abstract: Intelligent IoT systems increasingly rely on large language models (LLMs) to generate task-execution methods for dynamic environments. However, existing approaches lack the ability to systematically produce new methods when facing previously unseen situations, and they often depend on fixed, device-specific logic that cannot adapt to changing environmental this http URL this paper, we propose Method Decoration (DeMe), a general framework that modifies the method-generation path of an LLM using explicit decorations derived from hidden goals, accumulated learned methods, and environmental feedback. Unlike traditional rule augmentation, decorations in DeMe are not hardcoded; instead, they are extracted from universal behavioral principles, experience, and observed environmental differences. DeMe enables the agent to reshuffle the structure of its method path-through pre-decoration, post-decoration, intermediate-step modification, and step insertion-thereby producing context-aware, safety-aligned, and environment-adaptive methods. Experimental results show that method decoration allows IoT devices to derive ore appropriate methods when confronting unknown or faulty operating conditions.
摘要：智能物联网系统越来越依赖大型语言模型（LLM）来生成动态环境的任务执行方法。然而，现有的方法缺乏在面对以前未见过的情况时系统地生成新方法的能力，并且它们通常依赖于固定的、特定于设备的逻辑，无法适应不断变化的环境。在本文中，我们提出了方法装饰（DeMe），这是一个通用框架，它使用源自隐藏目标、积累的学习方法和环境反馈的显式装饰来修改法学硕士的方法生成路径。与传统的规则增强不同，DeMe 中的装饰不是硬编码的；相反，它们是从普遍的行为原则、经验和观察到的环境差异中提取的。 DeMe 使智能体能够通过预装饰、后装饰、中间步骤修改和步骤插入来重新调整其方法路径的结构，从而产生上下文感知、安全对齐和环境自适应的方法。实验结果表明，方法装饰允许物联网设备在面对未知或错误的操作条件时导出适当的方法。

Title: Knowledge Reasoning of Large Language Models Integrating Graph-Structured Information for Pest and Disease Control in Tobacco

Authors: Siyu Li, Chenwei Song, Wan Zhou, Xinyi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21837
Pdf URL: https://arxiv.org/pdf/2512.21837
Copy Paste: [[2512.21837]] Knowledge Reasoning of Large Language Models Integrating Graph-Structured Information for Pest and Disease Control in Tobacco(https://arxiv.org/abs/2512.21837)
Keywords: language model, llm, chat
Abstract: This paper proposes a large language model (LLM) approach that integrates graph-structured information for knowledge reasoning in tobacco pest and disease control. Built upon the GraphRAG framework, the proposed method enhances knowledge retrieval and reasoning by explicitly incorporating structured information from a domain-specific knowledge graph. Specifically, LLMs are first leveraged to assist in the construction of a tobacco pest and disease knowledge graph, which organizes key entities such as diseases, symptoms, control methods, and their relationships. Based on this graph, relevant knowledge is retrieved and integrated into the reasoning process to support accurate answer generation. The Transformer architecture is adopted as the core inference model, while a graph neural network (GNN) is employed to learn expressive node representations that capture both local and global relational information within the knowledge graph. A ChatGLM-based model serves as the backbone LLM and is fine-tuned using LoRA to achieve parameter-efficient adaptation. Extensive experimental results demonstrate that the proposed approach consistently outperforms baseline methods across multiple evaluation metrics, significantly improving both the accuracy and depth of reasoning, particularly in complex multi-hop and comparative reasoning scenarios.
摘要：本文提出了一种大语言模型（LLM）方法，该方法集成了图结构信息，用于烟草病虫害控制中的知识推理。所提出的方法建立在 GraphRAG 框架之上，通过显式合并来自特定领域知识图谱的结构化信息来增强知识检索和推理。具体来说，法学硕士首先用于协助构建烟草病虫害知识图谱，该知识图谱组织了疾病、症状、控制方法及其关系等关键实体。基于该图，检索相关知识并将其集成到推理过程中以支持准确的答案生成。采用 Transformer 架构作为核心推理模型，同时采用图神经网络（GNN）来学习表达节点表示，以捕获知识图谱中的局部和全局关系信息。基于 ChatGLM 的模型作为 LLM 的骨干，并使用 LoRA 进行微调，以实现参数高效的适应。大量的实验结果表明，所提出的方法在多个评估指标上始终优于基线方法，显着提高了推理的准确性和深度，特别是在复杂的多跳和比较推理场景中。

Title: AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts

Authors: Baorong Huang, Ali Asiri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21842
Pdf URL: https://arxiv.org/pdf/2512.21842
Copy Paste: [[2512.21842]] AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts(https://arxiv.org/abs/2512.21842)
Keywords: llm
Abstract: High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising complex legal and literary texts. Our evaluation demonstrates that "Easy" datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our "Hard" subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated superior robustness, achieving an overall F1-score of 85.5%, a 9% improvement over previous methods. Our datasets and codes are open-sourced at this https URL.
摘要：高质量的平行语料库对于机器翻译（MT）研究和翻译教学至关重要。然而，阿拉伯语-英语资源仍然稀缺，现有数据集主要由简单的一对一映射组成。在本文中，我们提出了 AlignAR，一种生成句子对齐方法，以及一个包含复杂法律和文学文本的新阿拉伯语-英语数据集。我们的评估表明“简单”数据集缺乏充分评估对齐方法的辨别力。通过减少“硬”子集中的一对一映射，我们暴露了传统对齐方法的局限性。相比之下，基于 LLM 的方法表现出卓越的稳健性，总体 F1 分数达到 85.5%，比以前的方法提高了 9%。我们的数据集和代码在此 https URL 上开源。

Title: HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

Authors: Jiaxin Liu, Peiyi Tu, Wenyu Chen, Yihong Zhuang, Xinxia Ling, Anji Zhou, Chenxi Wang, Zhuo Han, Zhengkai Yang, Junbo Zhao, Zenan Huang, Yuanyuan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21849
Pdf URL: https://arxiv.org/pdf/2512.21849
Copy Paste: [[2512.21849]] HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs(https://arxiv.org/abs/2512.21849)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have achieved remarkable success in cognitive and reasoning benchmarks, they exhibit a persistent deficit in anthropomorphic intelligence-the capacity to navigate complex social, emotional, and ethical nuances. This gap is particularly acute in the Chinese linguistic and cultural context, where a lack of specialized evaluation frameworks and high-quality socio-emotional data impedes progress. To address these limitations, we present HeartBench, a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese LLMs. Grounded in authentic psychological counseling scenarios and developed in collaboration with clinical experts, the benchmark is structured around a theory-driven taxonomy comprising five primary dimensions and 15 secondary capabilities. We implement a case-specific, rubric-based methodology that translates abstract human-like traits into granular, measurable criteria through a ``reasoning-before-scoring'' evaluation protocol. Our assessment of 13 state-of-the-art LLMs indicates a substantial performance ceiling: even leading models achieve only 60% of the expert-defined ideal score. Furthermore, analysis using a difficulty-stratified ``Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs. HeartBench establishes a standardized metric for anthropomorphic AI evaluation and provides a methodological blueprint for constructing high-quality, human-aligned training data.
摘要：虽然大型语言模型 (LLM) 在认知和推理基准方面取得了显着的成功，但它们在拟人智力（驾驭复杂的社会、情感和道德细微差别的能力）方面表现出持续的缺陷。这种差距在中国语言和文化背景下尤为严重，缺乏专门的评估框架和高质量的社会情感数据阻碍了进步。为了解决这些局限性，我们提出了 HeartBench，这是一个旨在评估中国法学硕士的综合情感、文化和道德维度的框架。该基准以真实的心理咨询场景为基础，并与临床专家合作开发，围绕理论驱动的分类法构建，包括 5 个主要维度和 15 个次要能力。我们实施一种针对特定案例、基于评分标准的方法，通过“评分前推理”评估协议，将抽象的类人特征转化为细粒度的、可测量的标准。我们对 13 个最先进的法学硕士的评估显示了巨大的性能上限：即使是领先的模型也只能达到专家定义的理想分数的 60%。此外，使用难度分层“硬集”的分析揭示了在涉及微妙情感潜台词和复杂道德权衡的场景中，性能显着下降。 HeartBench 为拟人化人工智能评估建立了标准化指标，并为构建高质量、人性化的训练数据提供了方法蓝图。

Title: TimeBill: Time-Budgeted Inference for Large Language Models

Authors: Qi Fan, An Zou, Yehan Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21859
Pdf URL: https://arxiv.org/pdf/2512.21859
Copy Paste: [[2512.21859]] TimeBill: Time-Budgeted Inference for Large Language Models(https://arxiv.org/abs/2512.21859)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.
摘要：大型语言模型 (LLM) 越来越多地部署在时间关键型系统中，例如机器人、自动驾驶、嵌入式智能和工业自动化，其中在给定的时间预算内生成准确的响应对于决策、控制或安全关键型任务至关重要。然而，LLM 的自回归生成过程使得建模和估计端到端执行时间变得具有挑战性。此外，现有的基于固定键值（KV）缓存驱逐比率的高效推理方法难以适应具有不同时间预算的不同任务，其中不正确的驱逐比率可能导致推理不完整或响应性能下降。在本文中，我们提出了 TimeBill，一种用于法学硕士的新颖的时间预算推理框架，可以平衡推理效率和响应性能。更具体地说，我们提出了一个细粒度的响应长度预测器（RLP）和一个执行时间估计器（ETE）来准确预测LLM的端到端执行时间。在此之后，我们开发了一种时间预算的高效推理方法，该方法根据执行时间预测和给定的时间预算自适应地调整 KV 缓存驱逐率。最后，通过大量的实验，我们展示了TimeBill在提高任务完成率和在各种超限策略下保持响应性能方面的优势。

Title: Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?

Authors: Naen Xu, Jinghuai Zhang, Changjiang Li, Hengyu An, Chunyi Zhou, Jun Wang, Boyu Xu, Yuyuan Li, Tianyu Du, Shouling Ji
Subjects: cs.CL, cs.AI, cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2512.21871
Pdf URL: https://arxiv.org/pdf/2512.21871
Copy Paste: [[2512.21871]] Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?(https://arxiv.org/abs/2512.21871)
Keywords: language model
Abstract: Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content -- such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.
摘要：大型视觉语言模型（LVLM）在多模态推理任务中取得了显着的进步。然而，它们的广泛使用引起了人们对潜在版权侵权的严重担忧。当在上下文中遇到受版权保护的内容（即用户输入、检索到的文档）时，LVLM 能否准确识别并遵守版权法规？不遵守版权法规可能会导致严重的法律和道德后果，特别是当 LVLM 基于受版权保护的材料（例如检索的图书专家、新闻报道）生成响应时。在本文中，我们对各种 LVLM 进行了全面评估，研究了它们如何处理受版权保护的内容——例如书籍摘录、新闻文章、音乐歌词和代码文档，当它们作为视觉输入呈现时。为了系统地衡量版权合规性，我们引入了一个包含 50,000 个多模式查询内容对的大型基准数据集，旨在评估 LVLM 处理可能导致版权侵权的查询的效率。鉴于现实世界中受版权保护的内容可能包含也可能不包含版权声明，因此数据集包括两种不同场景中的查询内容对：有版权声明和没有版权声明。对于前者，我们广泛讨论了四种类型的版权声明，以适应不同的情况。我们的评估表明，即使是最先进的闭源 LVLM 在识别和尊重受版权保护的内容方面也存在重大缺陷，即使在提供版权声明时也是如此。为了解决这个限制，我们引入了一种新颖的工具增强版权合规防御框架，可以降低所有场景下的侵权风险。我们的研究结果强调了开发具有版权意识的 LVLM 以确保负责任且合法地使用受版权保护的内容的重要性。

Title: CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Authors: Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21877
Pdf URL: https://arxiv.org/pdf/2512.21877
Copy Paste: [[2512.21877]] CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics(https://arxiv.org/abs/2512.21877)
Keywords: language model, gpt, llm, prompt
Abstract: Cricket is the second most popular sport globally, commanding a massive following of over 2.5 billion fans globally. Enthusiasts and analysts frequently seek advanced statistical insights, such as long-term historical performance trends or complex player comparisons, that are often unavailable through standard web searches. While Large Language Models (LLMs) have advanced significantly in Text-to-SQL tasks, their capability to handle the domain-specific nuances, complex schema variations, and multilingual requirements inherent to sports analytics remains under-explored. To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data. To curate a "Gold Standard" dataset, we collaborate with domain experts in cricket and SQL to manually author complex queries, ensuring logical correctness. Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages. We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol. Our results reveal that high performance on general benchmarks does not guarantee success in specialized domains. While the open-weights reasoning model DeepSeek R1 achieves state-of-the-art performance (50.6%), surpassing proprietary giants like Claude 3.7 Sonnet (47.7%) and GPT-4o (33.7%), it still exhibits a significant accuracy drop when moving from general benchmarks (BIRD) to CricBench. Furthermore, we observe that code-mixed Hindi queries frequently yield parity or higher accuracy compared to English, challenging the assumption that English is the optimal prompt language for specialized SQL tasks.
摘要：板球是全球第二受欢迎的运动，在全球拥有超过 25 亿粉丝。爱好者和分析师经常寻求先进的统计见解，例如长期历史表现趋势或复杂的玩家比较，而这些通常无法通过标准网络搜索获得。虽然大型语言模型 (LLM) 在文本到 SQL 任务方面取得了显着进步，但它们处理特定领域的细微差别、复杂的模式变化以及体育分析固有的多语言要求的能力仍有待探索。为了调查这种潜在的能力差距，我们推出了 CricBench，这是一个综合基准套件，用于根据专门的板球数据评估法学硕士。为了策划“黄金标准”数据集，我们与板球和 SQL 领域专家合作，手动编写复杂的查询，确保逻辑正确性。认识到语言多样性，我们构建了英语和印地语基准，建立了一个开放的框架，可进一步扩展到其他区域语言。我们使用严格的评估协议评估了六种最先进的模型，包括 GPT-4o、Claude 3.7 Sonnet 和开源模型。我们的结果表明，一般基准测试的高性能并不能保证在专业领域取得成功。虽然开放权重推理模型 DeepSeek R1 实现了最先进的性能 (50.6%)，超越了 Claude 3.7 Sonnet (47.7%) 和 GPT-4o (33.7%) 等专有巨头，但从通用基准 (BIRD) 迁移到 CricBench 时，它仍然表现出显着的准确性下降。此外，我们观察到，与英语相比，代码混合的印地语查询经常产生奇偶性或更高的准确性，这挑战了英语是专门 SQL 任务的最佳提示语言的假设。

Title: Explainable Statute Prediction via Attention-based Model and LLM Prompting

Authors: Sachin Pawar, Girish Keshav Palshikar, Anindita Sinha Banerjee, Nitin Ramrakhiyani, Basit Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21902
Pdf URL: https://arxiv.org/pdf/2512.21902
Copy Paste: [[2512.21902]] Explainable Statute Prediction via Attention-based Model and LLM Prompting(https://arxiv.org/abs/2512.21902)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: In this paper, we explore the problem of automatic statute prediction where for a given case description, a subset of relevant statutes are to be predicted. Here, the term "statute" refers to a section, a sub-section, or an article of any specific Act. Addressing this problem would be useful in several applications such as AI-assistant for lawyers and legal question answering system. For better user acceptance of such Legal AI systems, we believe the predictions should also be accompanied by human understandable explanations. We propose two techniques for addressing this problem of statute prediction with explanations -- (i) AoS (Attention-over-Sentences) which uses attention over sentences in a case description to predict statutes relevant for it and (ii) LLMPrompt which prompts an LLM to predict as well as explain relevance of a certain statute. AoS uses smaller language models, specifically sentence transformers and is trained in a supervised manner whereas LLMPrompt uses larger language models in a zero-shot manner and explores both standard as well as Chain-of-Thought (CoT) prompting techniques. Both these models produce explanations for their predictions in human understandable forms. We compare statute prediction performance of both the proposed techniques with each other as well as with a set of competent baselines, across two popular datasets. Also, we evaluate the quality of the generated explanations through an automated counter-factual manner as well as through human evaluation.
摘要：在本文中，我们探讨了自动法规预测的问题，其中对于给定的案例描述，将预测相关法规的子集。这里，“法规”一词是指任何具体法案的章节、分节或条款。解决这个问题将在律师人工智能助手和法律问答系统等多种应用中发挥作用。为了让用户更好地接受此类法律人工智能系统，我们认为预测还应该附有人类可以理解的解释。我们提出了两种通过解释来解决法规预测问题的技术——（i）AoS（句子注意力），它使用案例描述中句子的注意力来预测与之相关的法规；（ii）LLMPrompt，它提示法学硕士预测并解释特定法规的相关性。 AoS 使用较小的语言模型，特别是句子转换器，并以监督方式进行训练，而 LLMPrompt 以零样本方式使用较大的语言模型，并探索标准和思想链 (CoT) 提示技术。这两个模型都以人类可以理解的形式对其预测进行解释。我们在两个流行的数据集中比较了两种所提出的技术之间以及与一组有效基线的法规预测性能。此外，我们通过自动反事实方式以及人工评估来评估生成的解释的质量。

Title: Accelerate Speculative Decoding with Sparse Computation in Verification

Authors: Jikai Wang, Jianchao Tan, Yuxuan Hu, Jiayu Qin, Yerui Sun, Yuchen Xie, Xunliang Cai, Juntao Li, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21911
Pdf URL: https://arxiv.org/pdf/2512.21911
Copy Paste: [[2512.21911]] Accelerate Speculative Decoding with Sparse Computation in Verification(https://arxiv.org/abs/2512.21911)
Keywords: language model, llm
Abstract: Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel. However, the verification stage often becomes the dominant computational bottleneck, especially for long-context inputs and mixture-of-experts (MoE) models. Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding to remove substantial computational redundancy in LLMs. This work systematically adopts different sparse methods on the verification stage of the speculative decoding and identifies structured redundancy across multiple dimensions. Based on these observations, we propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost. The framework further incorporates an inter-draft token and inter-layer retrieval reuse strategy to further reduce redundant computation without introducing additional training. Extensive experiments across summarization, question answering, and mathematical reasoning datasets demonstrate that the proposed methods achieve favorable efficiency-accuracy trade-offs, while maintaining stable acceptance length.
摘要：推测性解码通过并行验证多个草稿标记来加速自回归语言模型推理。然而，验证阶段通常成为主要的计算瓶颈，特别是对于长上下文输入和专家混合（MoE）模型。现有的稀疏化方法主要是为标准的逐个令牌自回归解码而设计的，以消除 LLM 中大量的计算冗余。这项工作在推测解码的验证阶段系统地采用了不同的稀疏方法，并识别跨多个维度的结构化冗余。基于这些观察，我们提出了一个稀疏验证框架，在验证阶段联合稀疏注意力、FFN 和 MoE 组件，以减少主要的计算成本。该框架进一步结合了草稿间令牌和层间检索重用策略，以进一步减少冗余计算，而无需引入额外的训练。总结、问答和数学推理数据集的广泛实验表明，所提出的方法实现了有利的效率与准确性权衡，同时保持稳定的接受长度。

Title: SWE-RM: Execution-free Feedback For Software Engineering Agents

Authors: KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, Junxian He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21919
Pdf URL: https://arxiv.org/pdf/2512.21919
Copy Paste: [[2512.21919]] SWE-RM: Execution-free Feedback For Software Engineering Agents(https://arxiv.org/abs/2512.21919)
Keywords: agent
Abstract: Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.
摘要：基于执行的反馈（例如单元测试）通过测试时间缩放（TTS）和强化学习（RL）广泛用于编码代理的开发。这种范式需要可扩展且可靠的单元测试用例集合来提供准确的反馈，而得到的反馈通常是稀疏的，并且无法有效地区分既成功又不成功的轨迹。相比之下，来自奖励模型的无执行反馈可以提供更细粒度的信号，而不依赖于单元测试用例。尽管有这种潜力，但对于现实软件工程（SWE）代理的无执行反馈仍然没有得到充分探索。然而，为了开发在 TTS 和 RL 中有效的通用奖励模型，我们观察到两个具有几乎相同 TTS 性能的验证者在 RL 中却可能产生截然不同的结果。直观上，TTS 主要反映了模型选择最佳轨迹的能力，但这种能力并不一定能推广到 RL。为了解决这个限制，我们确定了对于 RL 训练至关重要的两个额外方面：分类准确性和校准。然后，我们进行全面的对照实验，以研究如何训练在这些指标上表现良好的稳健奖励模型。我们特别分析了训练数据规模、策略混合和数据源构成等各种因素的影响。在这些研究的指导下，我们引入了 SWE-RM，这是一种准确且稳健的奖励模型，采用专家混合架构，总参数为 30B，并且在推理过程中激活了 3B。 SWE-RM 显着提高了 SWE 代理的 TTS 和 RL 性能。例如，它在使用 TTS 的 SWE-Bench Verified 上将 Qwen3-Coder-Flash 的准确率从 51.6% 提高到 62.0%，将 Qwen3-Coder-Max 从 67.0% 提高到 74.6%，在开源模型中实现了新的最先进性能。

Title: Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

Authors: Sachin Pawar, Manoj Apte, Kshitij Jadhav, Girish Keshav Palshikar, Nitin Ramrakhiyani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.21933
Pdf URL: https://arxiv.org/pdf/2512.21933
Copy Paste: [[2512.21933]] Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs(https://arxiv.org/abs/2512.21933)
Keywords: language model, llm
Abstract: Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral's tokenizer splits "martial" into "mart" and "ial"). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.
摘要：标记化是训练任何大型语言模型 (LLM) 的第一步，其中根据模型的固定词汇表将文本拆分为一系列标记。 LLM 中的这种标记化与 NLP 中的传统标记化不同，后者将文本分割成一系列“自然”单词。在 LLM 中，由于 LLM 的词汇量有限，自然词也可能被分解为多个标记（例如，Mistral 的标记器将“martial”拆分为“mart”和“ial”）。在本文中，我们假设这种对自然词的破坏会对法学硕士在各种 NLP 任务上的表现产生负面影响。为了量化这种影响，我们提出了一组惩罚函数，用于计算特定 LLM 给定文本的标记化惩罚，表明标记化有多“糟糕”。我们针对一组不同的 LLM 的多个 NLP 任务建立了我们的假设的统计显着性。

Title: Context as a Tool: Context Management for Long-Horizon SWE-Agents

Authors: Shukai Liu, Jian Yang, Bo Jiang, Yizhi Li, Jinyang Guo, Xianglong Liu, Bryan Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22087
Pdf URL: https://arxiv.org/pdf/2512.22087
Copy Paste: [[2512.22087]] Context as a Tool: Context Management for Long-Horizon SWE-Agents(https://arxiv.org/abs/2512.22087)
Keywords: language model, agent
Abstract: Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose CAT, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. CAT formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CAT-GENERATOR, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.
摘要：基于大型语言模型的代理最近在现实世界的软件工程（SWE）任务中显示出强大的潜力，这些任务需要与存储库规模的代码库进行长期交互。然而，大多数现有代理依赖于仅附加上下文维护或被动触发的压缩启发式，这通常会导致长时间运行交互中的上下文爆炸、语义漂移和推理退化。我们提出了 CAT，一种新的上下文管理范例，它将上下文维护提升为集成到代理决策过程中的可调用工具。 CAT 形式化了一个结构化的上下文工作空间，由稳定的任务语义、压缩的长期记忆和高保真短期交互组成，并使代理能够在适当的里程碑主动将历史轨迹压缩为可操作的摘要。为了支持 SWE 代理的上下文管理，我们提出了一个轨迹级监督框架 CAT-GENERATOR，它基于离线数据构建管道，将上下文管理操作注入完整的交互轨迹中。使用这个框架，我们训练了一个上下文感知模型，SWE-Compressor。 SWE-Bench-Verified 上的实验表明，SWE-Compressor 达到了 57.6% 的解决率，并且显着优于基于 ReAct 的代理和静态压缩基线，同时在有界上下文预算下保持稳定和可扩展的长视野推理。

Title: Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

Authors: Duygu Altinok
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22100
Pdf URL: https://arxiv.org/pdf/2512.22100
Copy Paste: [[2512.22100]] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis(https://arxiv.org/abs/2512.22100)
Keywords: language model, llm
Abstract: Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.
摘要：评估各种模型架构（例如 Transformer、大型语言模型 (LLM) 和其他 NLP 系统）的性能需要综合基准来衡量多个维度的性能。其中，自然语言理解（NLU）的评估尤为关键，因为它是评估模型能力的基本标准。因此，有必要建立能够从不同角度对 NLU 能力进行全面评估和分析的基准。虽然 GLUE 基准为评估英语 NLU 制定了标准，但也为其他语言开发了类似的基准，例如中文的 CLUE、法语的 FLUE 和日语的 JGLUE。然而，土耳其语目前不存在可比较的基准。为了弥补这一差距，我们引入了 TrGLUE，这是一个涵盖土耳其语各种 NLU 任务的综合基准。此外，我们还推出了 SentiTurca，一种专门用于情绪分析的基准。为了支持研究人员，我们还为基于 Transformer 的模型提供微调和评估代码，以促进这些基准的有效使用。 TrGLUE 包含土耳其本土语料库，旨在反映 GLUE 式评估的领域和任务公式，并通过半自动化管道获得标签，该管道结合了基于 LLM 的强大注释、跨模型一致性检查和后续人工验证。这种设计优先考虑语言自然性，最大限度地减少直接翻译伪影，并产生可扩展、可重复的工作流程。通过 TrGLUE，我们的目标是为土耳其 NLU 建立强大的评估框架，为研究人员提供宝贵的资源，并提供生成高质量半自动化数据集的见解。