2026-01-05

Title: RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning

Authors: Xiang Gao, Yuguang Yao, Qi Zhang, Kaiwen Dong, Avinash Baidya, Ruocheng Guo, Hilaf Hasson, Kamalika Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00086
Pdf URL: https://arxiv.org/pdf/2601.00086
Copy Paste: [[2601.00086]] RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning(https://arxiv.org/abs/2601.00086)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.
摘要：大型语言模型 (LLM) 通常难以在特定领域的设置中可靠地使用工具，其中 API 可能比较特殊、文档记录不足或针对私有工作流程量身定制。这凸显了有效适应特定任务工具的必要性。我们提出 RIMRULE，一种基于动态规则注入的 LLM 适应神经符号方法。紧凑、可解释的规则是从故障跟踪中提取出来的，并在推理过程中注入到提示中，以提高任务性能。这些规则由法学硕士本身提出，并使用有利于通用性和简洁性的最小描述长度（MDL）目标进行整合。每个规则都以自然语言和结构化符号形式存储，支持推理时的高效检索。工具使用基准实验表明，这种方法在不修改 LLM 权重的情况下提高了已见和未见工具的准确性。它优于基于提示的适应方法并补充了微调。此外，从一个法学硕士学到的规则可以重复使用来改进其他法学硕士，包括长推理法学硕士，强调符号知识跨架构的可移植性。

Title: Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning

Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00095
Pdf URL: https://arxiv.org/pdf/2601.00095
Copy Paste: [[2601.00095]] Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning(https://arxiv.org/abs/2601.00095)
Keywords: language model, llm
Abstract: Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5--2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2\% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5--10 gradient steps (5--15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.
摘要：大型语言模型越来越需要结构化推理，从 JSON 模式实施到多语言解析，其中输出必须满足复杂的约束。我们引入了 MetaJuLS，这是一种元强化学习方法，可以学习适用于跨语言和任务的通用约束传播策略，而无需针对特定任务进行再训练。通过将结构化推理制定为自适应约束传播并使用元学习训练图注意力网络，MetaJuLS 比 GPU 优化基线实现了 1.5--2.0$\times$ 加速，同时保持了最先进解析器的 0.2\% 精度。在跨 10 种语言和 LLM 约束生成（LogicBench、GSM8K-Constrained）的通用依赖关系上，MetaJuLS 展示了快速的跨域适应：经过英语解析训练的策略可以通过 5--10 个梯度步骤（5--15 秒）适应新语言和任务，而不需要数小时的特定任务训练。机制分析揭示了该策略发现了类似人类的解析策略（简单优先）和新颖的非直观启发式。通过减少 LLM 部署中的传播步骤，MetaJuLS 直接减少推理碳足迹，为绿色 AI 做出贡献。

Title: Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description

Authors: Yongmin Yoo, Kris W Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00166
Pdf URL: https://arxiv.org/pdf/2601.00166
Copy Paste: [[2601.00166]] Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description(https://arxiv.org/abs/2601.00166)
Keywords: language model, llm
Abstract: Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.
摘要：专利描述必须提供全面的技术披露，同时满足严格的法律标准，例如启用和书面描述要求。尽管大型语言模型已经实现了端到端的自动化专利起草，但现有的评估方法无法评估特定于描述的长格式结构一致性和法定合规性。我们提出Pat-DEVAL，第一个专门针对专利描述机构的多维度评估框架。 Pat-DEVAL 利用法学硕士作为法官范式，引入了法律思维链 (CoLT)，这是一种受法律约束的推理机制，可强制执行连续的专利法特定分析。专利专家在我们的 Pap2Pat-EvalGold 数据集上验证的实验表明，Pat-DEVAL 的 Pearson 相关性达到 0.69，显着优于基线指标和现有的 LLM 评估器。值得注意的是，该框架在法律专业合规性方面表现出 0.73 的优异相关性，证明明确注入法定约束对于捕获细微差别的法律有效性至关重要。通过建立确保技术健全性和法律合规性的新标准，Pat-DEVAL 为自动化专利起草系统的实际部署提供了坚实的方法基础。

Title: Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models

Authors: Wang Xing, Wei Song, Siyu Lin, Chen Wu, Zhesi Li, Man Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00202
Pdf URL: https://arxiv.org/pdf/2601.00202
Copy Paste: [[2601.00202]] Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models(https://arxiv.org/abs/2601.00202)
Keywords: language model
Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model's ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.
摘要：时态知识图（TKG）推理是提高智能决策系统效率和可靠性的基础，已成为未来人工智能应用的关键技术基础。尽管最近取得了进展，但现有的 TKG 推理模型通常依赖于大参数大小和密集计算，导致硬件成本和能耗较高。这些限制阻碍了它们在需要实时推理的资源受限、低功耗和分布式平台上的部署。此外，大多数现有的模型压缩和蒸馏技术都是针对静态知识图设计的，无法充分捕获 TKG 固有的时间依赖性，通常会导致推理性能下降。为了应对这些挑战，我们提出了一个专门为时间知识图推理量身定制的蒸馏框架。我们的方法利用大型语言模型作为教师模型来指导蒸馏过程，从而能够有效地将结构和时间推理能力转移到轻量级学生模型。通过将大规模公共知识与特定于任务的时间信息相结合，所提出的框架增强了学生模型对时间动态进行建模的能力，同时保持紧凑而高效的架构。对多个公开可用的基准数据集进行的广泛实验表明，我们的方法始终优于强大的基线，在推理准确性、计算效率和实际可部署性之间实现了有利的权衡。

Title: From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

Authors: Jinning Zhang, Jie Song, Wenhui Tu, Zecheng Li, Jingxuan Li, Jin Li, Xuan Liu, Taole Sha, Zichen Wei, Yan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00216
Pdf URL: https://arxiv.org/pdf/2601.00216
Copy Paste: [[2601.00216]] From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark(https://arxiv.org/abs/2601.00216)
Keywords: language model, llm, retrieval-augmented generation
Abstract: In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.
摘要：在医学领域，大型语言模型 (LLM) 越来越依赖检索增强生成 (RAG) 来根据最新的外部证据来确定输出。然而，当前的 RAG 方法主要关注性能改进，而忽视了循证医学 (EBM) 原则。这项研究解决了两个关键差距：(1) 查询和检索到的证据之间缺乏 PICO 一致性，(2) 重新排序期间缺乏证据层次考虑因素。我们提出了一种通用策略，使 EBM 适应基于图的 RAG，将 PICO 框架集成到知识图构建和检索中，并提出一种受贝叶斯启发的重新排名算法，通过证据等级来校准排名分数，而无需引入预定义的权重。我们在运动康复领域验证了这个框架，这是一个文献丰富的领域，目前缺乏 RAG 系统和基准。我们发布了知识图（357,844 个节点和 371,226 个边）和 1,637 个 QA 对的可重用基准。该系统实现了 0.830 的块金覆盖率、0.819 的答案忠实度、0.882 的语义相似度和 0.788 的 PICOT 匹配精度。在 5 点 Likert 评估中，五位临床专家在事实准确性、忠实性、相关性、安全性和 PICO 一致性方面对该系统的评分为 4.66-4.84。这些发现表明，所提出的 EBM 适应策略提高了检索和答案质量，并且可以转移到其他临床领域。发布的资源还有助于解决运动康复领域 RAG 数据集的稀缺问题。

Title: JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation

Authors: Leonard Lin, Adam Lensenmayer (this http URL)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00223
Pdf URL: https://arxiv.org/pdf/2601.00223
Copy Paste: [[2601.00223]] JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation(https://arxiv.org/abs/2601.00223)
Keywords: llm
Abstract: We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often "which of these two good translations is better?" rather than "is this translation acceptable?" This distinction matters for Japanese-English, where subtle choices in politeness, implicature, ellipsis, and register strongly affect perceived naturalness. JP-TL-Bench uses a protocol built to make LLM judging both reliable and affordable: it evaluates a candidate model via reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley-Terry model and reported as win rates plus a normalized 0-10 "LT" score derived from a logistic transform of fitted log-strengths. Because each candidate is scored against the same frozen anchor set, scores are structurally stable given the same base set, judge, and aggregation code.
摘要：我们推出 JP-TL-Bench，这是一个轻量级、开放的基准测试，旨在指导日英翻译系统的迭代开发。在这种情况下，挑战往往是“这两个好的翻译哪个更好？”而不是“这个翻译可以接受吗？”这种区别对于日语-英语来说很重要，在礼貌、暗示、省略和语域方面的微妙选择会强烈影响感知的自然性。 JP-TL-Bench 使用的协议旨在使 LLM 判断既可靠又实惠：它通过与固定的版本化锚集进行无参考、成对的 LLM 比较来评估候选模型。成对结果使用 Bradley-Terry 模型进行汇总，并报告为胜率加上从拟合对数强度的逻辑变换得出的归一化 0-10“LT”分数。由于每个候选者都是针对相同的冻结锚集进行评分，因此在给定相同的基本集、判断和聚合代码的情况下，分数在结构上是稳定的。

Title: Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback

Authors: Yan Sun, Ming Cai, Stanley Kok
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2601.00224
Pdf URL: https://arxiv.org/pdf/2601.00224
Copy Paste: [[2601.00224]] Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback(https://arxiv.org/abs/2601.00224)
Keywords: language model, llm
Abstract: As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.
摘要：随着大型语言模型 (LLM) 助手越来越多地集成到企业工作流程中，它们生成准确、语义一致且可执行的输出的能力至关重要。然而，当前的会话式业务分析 (CBA) 系统通常缺乏内置验证机制，导致用户必须手动验证可能存在缺陷的结果。本文介绍了两种互补的验证技术：Q*（在代码和用户意图之间执行反向翻译和语义匹配）和 Feedback+（结合执行反馈来指导代码细化）。这些机制嵌入在生成器-鉴别器框架中，将验证责任从用户转移到系统。对三个基准数据集 Spider、Bird 和 GSM8K 的评估表明，Q* 和 Feedback+ 都能降低错误率和任务完成时间。该研究还将逆向翻译确定为一个关键瓶颈，强调了未来改进的机会。总的来说，这项工作提供了一个以设计为导向的框架，用于构建更可靠的企业级 GenAI 系统，能够提供值得信赖的决策支持。

Title: Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation

Authors: Qianli Wang, Van Bach Nguyen, Yihong Liu, Fedor Splitt, Nils Feldhus, Christin Seifert, Hinrich Schütze, Sebastian Möller, Vera Schmitt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00263
Pdf URL: https://arxiv.org/pdf/2601.00263
Copy Paste: [[2601.00263]] Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation(https://arxiv.org/abs/2601.00263)
Keywords: language model, llm
Abstract: Counterfactuals refer to minimally edited inputs that cause a model's prediction to change, serving as a promising approach to explaining the model's behavior. Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency. However, their effectiveness in generating multilingual counterfactuals remains unclear. To this end, we conduct a comprehensive study on multilingual counterfactuals. We first conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages. Although translation-based counterfactuals offer higher validity than their directly generated counterparts, they demand substantially more modifications and still fall short of matching the quality of the original English counterfactuals. Second, we find the patterns of edits applied to high-resource European-language counterfactuals to be remarkably similar, suggesting that cross-lingual perturbations follow common strategic principles. Third, we identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages. Finally, we reveal that multilingual counterfactual data augmentation (CDA) yields larger model performance improvements than cross-lingual CDA, especially for lower-resource languages. Yet, the imperfections of the generated counterfactuals limit gains in model performance and robustness.
摘要：反事实是指经过最少编辑的输入，导致模型的预测发生变化，是解释模型行为的一种有前景的方法。大型语言模型 (LLM) 擅长生成英语反事实并展示多语言能力。然而，它们在生成多语言反事实方面的有效性仍不清楚。为此，我们对多语言反事实进行了全面的研究。我们首先对目标语言直接生成的反事实和通过六种语言的英语翻译衍生的反事实进行自动评估。尽管基于翻译的反事实比直接生成的反事实提供了更高的有效性，但它们需要更多的修改，并且仍然达不到原始英语反事实的质量。其次，我们发现应用于高资源欧洲语言反事实的编辑模式非常相似，这表明跨语言扰动遵循共同的策略原则。第三，我们对跨语言生成的反事实中一致出现的四种主要错误类型进行识别和分类。最后，我们发现多语言反事实数据增强 (CDA) 比跨语言 CDA 能带来更大的模型性能改进，特别是对于资源较低的语言。然而，生成的反事实的缺陷限制了模型性能和鲁棒性的提高。

Title: Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity

Authors: Doyoung Kim (1 and 2), Zhiwei Ren (1 and 3), Jie Hao (1), Zhongkai Sun (1), Lichao Wang (1), Xiyao Ma (1), Zack Ye (1), Xu Han (1), Jun Yin (1), Heng Ji (4), Wei Shen (1), Xing Fan (1), Benjamin Yao (1), Chenlei Guo (1) ((1) Amazon, (2) KAIST, (3) University of Pittsburgh, (4) University of Illinois Urbana-Champaign)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00268
Pdf URL: https://arxiv.org/pdf/2601.00268
Copy Paste: [[2601.00268]] Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity(https://arxiv.org/abs/2601.00268)
Keywords: language model, llm, agent
Abstract: We introduce WildAGTEval, a benchmark designed to evaluate large language model (LLM) agents' function-calling capabilities under realistic API complexity. Unlike prior work that assumes an idealized API system and disregards real-world factors such as noisy API outputs, WildAGTEval accounts for two dimensions of real-world complexity: 1. API specification, which includes detailed documentation and usage constraints, and 2. API execution, which captures runtime challenges. Consequently, WildAGTEval offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user-agent interactions for evaluating LLM agents on these scenarios. Using WildAGTEval, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with irrelevant information complexity posing the greatest difficulty and reducing the performance of strong LLMs by 27.3%. Furthermore, our qualitative analysis reveals that LLMs occasionally distort user intent merely to claim task completion, critically affecting user satisfaction.
摘要：我们推出了 WildAGTEval，这是一个基准测试，旨在评估大型语言模型 (LLM) 代理在实际 API 复杂性下的函数调用能力。与之前的工作假设理想化的 API 系统并忽略诸如噪声 API 输出之类的现实世界因素不同，WildAGTEval 考虑了现实世界复杂性的两个维度：1. API 规范，其中包括详细的文档和使用限制；2. API 执行，捕获运行时挑战。因此，WildAGTEval 提供了 (i) 一个包含 60 个不同复杂场景的 API 系统，这些场景可以组成大约 32K 的测试配置，以及 (ii) 用于在这些场景上评估 LLM 代理的用户代理交互。使用WildAGTEval，我们系统地评估了几个高级法学硕士，并观察到大多数场景都具有挑战性，其中不相关的信息复杂性构成了最大的困难，并使强大的法学硕士的表现降低了27.3%。此外，我们的定性分析表明，法学硕士有时会扭曲用户意图，只是为了声称任务已完成，从而严重影响用户满意度。

Title: Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations

Authors: Qianli Wang, Nils Feldhus, Pepa Atanasova, Fedor Splitt, Simon Ostermann, Sebastian Möller, Vera Schmitt
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.00282
Pdf URL: https://arxiv.org/pdf/2601.00282
Copy Paste: [[2601.00282]] Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations(https://arxiv.org/abs/2601.00282)
Keywords: language model, llm
Abstract: Quantization is widely used to accelerate inference and streamline the deployment of large language models (LLMs), yet its effects on self-explanations (SEs) remain unexplored. SEs, generated by LLMs to justify their own outputs, require reasoning about the model's own decision-making process, a capability that may exhibit particular sensitivity to quantization. As SEs are increasingly relied upon for transparency in high-stakes applications, understanding whether and to what extent quantization degrades SE quality and faithfulness is critical. To address this gap, we examine two types of SEs: natural language explanations (NLEs) and counterfactual examples, generated by LLMs quantized using three common techniques at distinct bit widths. Our findings indicate that quantization typically leads to moderate declines in both SE quality (up to 4.4\%) and faithfulness (up to 2.38\%). The user study further demonstrates that quantization diminishes both the coherence and trustworthiness of SEs (up to 8.5\%). Compared to smaller models, larger models show limited resilience to quantization in terms of SE quality but better maintain faithfulness. Moreover, no quantization technique consistently excels across task accuracy, SE quality, and faithfulness. Given that quantization's impact varies by context, we recommend validating SE quality for specific use cases, especially for NLEs, which show greater sensitivity. Nonetheless, the relatively minor deterioration in SE quality and faithfulness does not undermine quantization's effectiveness as a model compression technique.
摘要：量化被广泛用于加速推理和简化大型语言模型 (LLM) 的部署，但其对自我解释 (SE) 的影响仍未被探索。由法学硕士生成的 SE 是为了证明自己的输出的合理性，需要对模型自身的决策过程进行推理，这种能力可能对量化表现出特别的敏感性。随着高风险应用程序越来越依赖 SE 来实现透明度，了解量化是否以及在多大程度上会降低 SE 质量和可信度至关重要。为了解决这一差距，我们研究了两种类型的SE：自然语言解释（NLE）和反事实示例，它们是由LLM使用三种常见技术以不同位宽进行量化生成的。我们的研究结果表明，量化通常会导致 SE 质量（高达 4.4%）和忠实度（高达 2.38%）适度下降。用户研究进一步表明，量化会降低 SE 的一致性和可信度（高达 8.5%）。与较小的模型相比，较大的模型在 SE 质量方面表现出有限的量化弹性，但可以更好地保持忠实度。此外，没有一种量化技术在任务准确性、SE 质量和忠实度方面始终表现出色。鉴于量化的影响因上下文而异，我们建议验证特定用例的 SE 质量，尤其是 NLE，它表现出更高的敏感性。尽管如此，SE 质量和忠实度相对较小的恶化并没有削弱量化作为模型压缩技术的有效性。

Title: Robust Uncertainty Quantification for Factual Generation of Large Language Models

Authors: Yuhao Zhang, Zhongliang Yang, Linna Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00348
Pdf URL: https://arxiv.org/pdf/2601.00348
Copy Paste: [[2601.00348]] Robust Uncertainty Quantification for Factual Generation of Large Language Models(https://arxiv.org/abs/2601.00348)
Keywords: language model, llm, hallucination, prompt
Abstract: The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.
摘要：大语言模型（LLM）技术的快速发展促进了其融入专业和日常生活的各个领域。然而，法学硕士幻觉的持续挑战已成为一个关键限制，严重损害了人工智能生成内容的可靠性和可信度。这一挑战引起了科学界的高度关注，促使人们在幻觉检测和缓解策略方面进行了广泛的研究工作。当前的方法框架揭示了一个关键的局限性：传统的不确定性量化方法主要在传统的问答范式中表现出有效性，但在面对非规范或对抗性提问策略时表现出明显的缺陷。这种表现差距引起了人们对法学硕士在需要强大批判性思维能力的实际应用中的可靠性的严重担忧。本研究旨在通过在生成多个事实的任务中提出不确定性量化场景来填补这一空白。我们精心构建了一套包含假名的陷阱题。基于这种情况，我们创新性地提出了一种新颖且鲁棒的不确定性量化方法（RU）。进行了一系列实验来验证其有效性。结果表明，构建的陷阱问题集表现出色。此外，与四种不同模型上的基线方法相比，我们提出的方法表现出了良好的性能，与性能最佳的基线方法相比，ROCAUC值平均增加了0.1-0.2，为解决LLM的幻觉问题提供了新的视野和方法。

Title: The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

Authors: Jiandong Shao, Raphael Tang, Crystina Zhang, Karin Sevegnani, Pontus Stenetorp, Jianfei Yang, Yao Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00364
Pdf URL: https://arxiv.org/pdf/2601.00364
Copy Paste: [[2601.00364]] The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining(https://arxiv.org/abs/2601.00364)
Keywords: language model
Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.
摘要：尽管主要是单语言预训练，多语言大语言模型仍实现了令人印象深刻的跨语言性能。尽管人们普遍认为预训练语料库中的双语数据能够实现这些能力，但其贡献的细节仍不清楚。我们通过在受控条件下从头开始预训练模型来研究这个问题，将标准网络语料库与删除所有多语言文档的单语言版本进行比较。尽管仅占语料库的 2%，但删除双语数据会导致 BLEU 中的翻译性能下降 56%，而跨语言 QA 和一般推理任务的行为保持稳定，训练曲线很大程度上与基线重叠。为了理解这种不对称性，我们根据不同语言内容的语义相关性将双语数据分为并行（14％）、语码转换（72％）和杂项文档（14％）。然后，我们通过将并行或代码转换数据重新引入单语言语料库来进行粒度消融。我们的实验表明，并行数据几乎完全恢复了翻译性能（未过滤基线的 91%），而代码切换的贡献微乎其微。其他跨语言任务在很大程度上不受这两种类型的影响。这些发现表明，翻译很大程度上取决于并行数据的系统标记级对齐，而跨语言理解和推理即使没有双语数据似乎也是可以实现的。

Title: Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Authors: Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00388
Pdf URL: https://arxiv.org/pdf/2601.00388
Copy Paste: [[2601.00388]] Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach(https://arxiv.org/abs/2601.00388)
Keywords: language model
Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.
摘要：视觉语言模型的最新进展为推理驱动的图像地理定位开辟了新的可能性。然而，现有方法通常依赖于合成推理注释或外部图像检索，这可能限制可解释性和普遍性。在本文中，我们提出了 Geo-R，这是一种免检索框架，可以从现有的地面实况坐标中揭示结构化推理路径，并通过强化学习来优化地理定位精度。我们提出了区域链，这是一种基于规则的分层推理范式，通过将 GPS 坐标映射到地理实体（例如国家、省、城市）来生成精确的、可解释的监督，而不依赖于模型生成或合成的标签。在此基础上，我们引入了一种轻量级强化学习策略，该策略具有基于半正弦距离的坐标对齐奖励，使模型能够通过空间有意义的反馈来完善预测。我们的方法将结构化地理推理与直接空间监督联系起来，从而提高定位精度、更强的泛化性和更透明的推理。多个基准的实验结果证实了 Geo-R 的有效性，为可扩展和可解释的图像地理定位建立了新的免检索范例。为了促进进一步的研究并确保可重复性，模型和代码都将公开。

Title: Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset

Authors: Alistair Plum, Laura Bernardy, Tharindu Ranasinghe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00411
Pdf URL: https://arxiv.org/pdf/2601.00411
Copy Paste: [[2601.00411]] Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset(https://arxiv.org/abs/2601.00411)
Keywords: language model, llm
Abstract: We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.
摘要：我们提出了 JudgeWEL，这是一个用于卢森堡语命名实体识别 (NER) 的数据集，自动标记并随后在新颖的管道中使用大型语言模型 (LLM) 进行验证。为代表性不足的语言构建数据集仍然是自然语言处理的主要瓶颈之一，资源的稀缺和语言的特殊性使得大规模注释成本高昂且可能不一致。为了应对这些挑战，我们提出并评估了一种利用维基百科和维基数据作为弱监督的结构化来源的新方法。通过利用维基百科文章中的内部链接，我们根据相应的维基数据条目推断实体类型，从而以最少的人工干预生成初始注释。由于此类链接并非一致可靠，因此我们通过使用和比较多个法学硕士来识别和仅保留高质量的标记句子来减轻噪音。由此产生的语料库大约比当前可用的卢森堡 NER 数据集大五倍，并提供跨实体类别的更广泛和更平衡的覆盖，为多语言和低资源 NER 研究提供了大量的新资源。

Title: Toward Better Temporal Structures for Geopolitical Events Forecasting

Authors: Kian Ahrabian, Eric Boxer, Jay Pujara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00430
Pdf URL: https://arxiv.org/pdf/2601.00430
Copy Paste: [[2601.00430]] Toward Better Temporal Structures for Geopolitical Events Forecasting(https://arxiv.org/abs/2601.00430)
Keywords: language model, llm
Abstract: Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on the relation prediction task, providing insights into their adaptability and capabilities in complex forecasting scenarios.
摘要：通过大型语言模型 (LLM) 的视角对地缘政治时间知识图 (TKG) 进行预测最近受到关注。虽然 TKG 及其泛化超关系时间知识图 (HTKG) 提供了一种简单的结构来表示简单的时间关系，但它们缺乏有效传达复杂事实的表达能力。 HTKG 的关键局限性之一是缺乏对时间事实中两个以上主要实体的支持，这通常发生在现实世界的事件中。为了解决这个限制，在这项工作中，我们研究了 HTKG 的泛化，即超关系时间知识广义超图（HTKGH）。我们首先推导出 HTKGH 的形式化，证明它们的向后兼容性，同时支持地缘政治事件中常见的两种复杂事实。然后，利用这种形式化，我们引入了基于全球事件数据库 POLECAT 的 htkgh-polecat 数据集。最后，我们对流行的法学硕士在关系预测任务上进行基准测试和分析，深入了解它们在复杂预测场景中的适应性和能力。

Title: Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games

Authors: Dimitris Vartziotis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00448
Pdf URL: https://arxiv.org/pdf/2601.00448
Copy Paste: [[2601.00448]] Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games(https://arxiv.org/abs/2601.00448)
Keywords: language model, llm
Abstract: Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.
摘要：大型语言模型（LLM）提供了一个新的实证环境，可以在其中检验长期存在的语言意义理论。本文对比了两种主要方法：与语言游戏相关的社会建构主义解释，以及我们称为语义场理论的数学导向框架。在作者早期工作的基础上，我们将词汇场 (Lexfelder) 和语言场 (Lingofelder) 的概念形式化为连续语义空间中的交互结构。然后，我们分析 Transformer 架构的核心属性（例如分布式表示、注意力机制和嵌入空间中的几何规律）如何与这些概念相关。我们认为，法学硕士在捕捉语义规律方面的成功支持了这样一种观点，即语言表现出潜在的数学结构，而它们在实用推理和语境敏感性方面的持续局限性与语言使用的哲学解释中强调的社会基础的重要性是一致的。在此基础上，我们建议数学结构和语言游戏可以被理解为互补而非竞争的观点。由此产生的框架澄清了纯统计语言模型的范围和局限性，并为理论上的人工智能架构激发了新的方向。

Title: Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Authors: Hyunjun Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00454
Pdf URL: https://arxiv.org/pdf/2601.00454
Copy Paste: [[2601.00454]] Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations(https://arxiv.org/abs/2601.00454)
Keywords: language model, llm
Abstract: Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.
摘要：Guardrail 模型对于确保大型语言模型 (LLM) 部署的安全性至关重要，但处理完整的多轮对话历史会产生大量计算成本。我们提出了防御性 M2S，这是一种训练范例，可以在多轮到单轮（M2S）压缩对话而不是完整的对话历史上微调护栏模型。我们提供了正式的复杂性分析，表明 M2S 将 $n$ 轮对话的训练成本从 $O(n^2)$ 降低到 $O(n)$。根据经验，在我们的训练数据集（779 个样本，平均 10.6 圈）上，M2S 仅需要 169K 个令牌，而多圈基线需要 1570 万个令牌，减少了 93 美元×$。我们在 SafeDialBench（综合性多轮越狱基准测试）上评估了三个护栏模型系列（LlamaGuard、Nemotron、Qwen3Guard）和三个压缩模板（hyphenize、numberize、pythonize）的防御性 M2S。我们的最佳配置是具有连字符压缩功能的 Qwen3Guard，可实现 93.8% 的攻击检测召回率，同时将推理令牌减少 94.6%（从每个会话 3,231 个令牌减少到 173 个令牌）。这比基线提高了 38.9 个百分点，同时显着降低了训练和推理成本。我们的研究结果表明，M2S 压缩可以作为护栏部署的有效高效技术，从而实现长时间多轮对话的可扩展安全筛选。

Title: Rule-Based Approaches to Atomic Sentence Extraction

Authors: Lineesha Kamana, Akshita Ananda Subramanian, Mehuli Ghosh, Suman Saha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00506
Pdf URL: https://arxiv.org/pdf/2601.00506
Copy Paste: [[2601.00506]] Rule-Based Approaches to Atomic Sentence Extraction(https://arxiv.org/abs/2601.00506)
Keywords: language model
Abstract: Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the "split-and-rephrase" task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.
摘要：自然语言经常将多种想法组合成复杂的句子。原子句子提取是将复杂句子分解为更简单的句子（每个句子表达一个想法）的任务，可以提高信息检索、问答和自动推理系统的性能。之前的工作已经形式化了“拆分和改写”任务并建立了评估指标，并且使用大型语言模型的机器学习方法提高了提取准确性。然而，这些方法缺乏可解释性，并且对导致提取失败的语言结构的了解有限。尽管一些研究探索了基于依存关系的主谓宾三元组和从句的提取，但没有原则分析检查哪些特定的从句结构和依存关系导致提取困难。本研究通过分析复杂的句子结构（包括关系从句、状语从句、协调模式和被动结构）如何影响基于规则的原子句子提取的性能来解决这一差距。使用 WikiSplit 数据集，我们在 spaCy 中实现了基于依存关系的提取规则，生成了 100 个黄金=标准原子句子集，并使用 ROUGE 和 BERTScore 评估了性能。该系统实现了 ROUGE-1 F1 = 0.6714、ROUGE-2 F1 = 0.478、ROUGE-L F1 = 0.650 和 BERTScore F1 = 0.5898，表明词汇、结构和语义对齐程度为中等到高。具有挑战性的结构包括关系从句、同位语、并列谓语、状语从句和被动结构。总体而言，基于规则的提取相当准确，但对句法复杂性敏感。

Title: Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends

Authors: Yuelyu Ji, Zhuochun Li, Rui Meng, Daqing He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00536
Pdf URL: https://arxiv.org/pdf/2601.00536
Copy Paste: [[2601.00536]] Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends(https://arxiv.org/abs/2601.00536)
Keywords: agent
Abstract: Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval--reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval--reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.
摘要：多跳问答 (QA) 要求系统跨多跳迭代检索证据和推理。虽然最近的 RAG 和代理方法报告了强有力的结果，但潜在的检索推理 \emph{process} 通常是隐式的，使得程序选择很难在模型系列之间进行比较。本次调查以执行过程为分析单位，引入了（A）整体执行计划、（B）指标结构、（C）下一步控制（策略和触发）、（D）停止/继续标准的四轴框架。使用此模式，我们映射了代表性的多跳 QA 系统，并综合了标准基准（例如 HotpotQA、2WikiMultiHopQA、MuSiQue）上报告的消融和趋势，突出了有效性、效率和证据可信度之间反复出现的权衡。最后，我们提出了检索推理代理的开放挑战，包括结构感知规划、可转移控制策略和分布转移下的稳健停止。

Title: ECR: Manifold-Guided Semantic Cues for Compact Language Models

Authors: Chung-Wei Victor Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00543
Pdf URL: https://arxiv.org/pdf/2601.00543
Copy Paste: [[2601.00543]] ECR: Manifold-Guided Semantic Cues for Compact Language Models(https://arxiv.org/abs/2601.00543)
Keywords: language model
Abstract: Compact models often lose the structure of their embedding space. The issue shows up when the capacity is tight or the data spans several languages. Such collapse makes it difficult for downstream tasks to build on the resulting representation. Existing compression methods focus on aligning model outputs at a superficial level but fail to preserve the underlying manifold structure. This mismatch often leads to semantic drift in the compact model, causing both task behavior and linguistic properties to deviate from the reference model. To address those issues, we provide a new framework called Embedding Consistency Regulation (ECR). This framework first derives a set of semantic anchors from teacher embeddings (computed once offline). Then, the compact model learns to maintain consistent geometry around these anchors, without relying on matching logits or internal features. ECR adds only a small projection step at inference, without altering the decoding architecture or its runtime behavior. In experiments on a 100K multilingual corpus, ECR consistently stabilizes training and preserves semantic structure across tasks and languages. It also produces a more compact and task-aligned representation space, enabling low-capacity models to learn cleaner manifolds than conventional baselines. ECR works without teacher outputs and is compatible with, but independent of, distillation. Taken together, our results show that ECR helps compact models better follow task requirements and makes them easier to deploy under strict efficiency or privacy limits.
摘要：紧凑模型通常会丢失其嵌入空间的结构。当容量紧张或数据跨越多种语言时，就会出现此问题。这种崩溃使得下游任务很难在结果表示的基础上进行构建。现有的压缩方法侧重于在表面层面上对齐模型输出，但无法保留底层的流形结构。这种不匹配通常会导致紧凑模型中的语义漂移，导致任务行为和语言属性偏离参考模型。为了解决这些问题，我们提供了一个名为嵌入一致性监管（ECR）的新框架。该框架首先从教师嵌入中派生出一组语义锚（离线计算一次）。然后，紧凑模型学习在这些锚点周围保持一致的几何形状，而不依赖于匹配的逻辑或内部特征。 ECR 仅在推理时添加了一个小的投影步骤，而没有改变解码架构或其运行时行为。在 100K 多语言语料库的实验中，ECR 始终稳定训练并保留跨任务和语言的语义结构。它还产生更紧凑且与任务一致的表示空间，使低容量模型能够学习比传统基线更干净的流形。 ECR 无需教师输出即可工作，并且与蒸馏兼容但独立。总而言之，我们的结果表明，ECR 有助于紧凑模型更好地遵循任务要求，并使它们在严格的效率或隐私限制下更容易部署。

Title: InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Authors: Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00575
Pdf URL: https://arxiv.org/pdf/2601.00575
Copy Paste: [[2601.00575]] InfoSynth: Information-Guided Benchmark Synthesis for LLMs(https://arxiv.org/abs/2601.00575)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: this https URL
摘要：大型语言模型 (LLM) 在推理和代码生成方面展现了显着的进步。然而，有效地创建新的基准来评估这些能力仍然是一个挑战。传统的基准创建依赖于人工，这个过程既昂贵又耗时。此外，现有的基准经常污染法学硕士培训数据，需要新颖且多样化的基准来准确评估其真正能力。这项工作介绍了 InfoSynth，这是一种新颖的框架，用于在信息论原理的指导下自动生成和评估推理基准。我们提出基于 KL 散度和熵的指标来量化基准的新颖性和多样性，而不依赖于昂贵的模型评估。在此框架的基础上，我们开发了一个端到端管道，使用遗传算法和迭代代码反馈从种子数据集中合成强大的 Python 编码问题。我们的方法在 97% 的时间内生成准确的测试用例和新问题的解决方案，并且与种子数据集相比，综合基准始终表现出更高的新颖性和多样性。此外，我们的算法提供了一种控制生成问题的新颖性/多样性和难度的方法。 InfoSynth 提供了一个可扩展、自我验证的管道，用于为法学硕士构建高质量、新颖且多样化的基准。项目页面：此 https URL

Title: CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Authors: Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu, Qiankun Li, Kun Wang, Zhigang Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00588
Pdf URL: https://arxiv.org/pdf/2601.00588
Copy Paste: [[2601.00588]] CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns(https://arxiv.org/abs/2601.00588)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.
摘要：大型语言模型（LLM）越来越多地部署在成本敏感和设备端场景中，安全护栏主要以英语为主。然而，现实世界中的中文恶意查询通常通过同音词、拼音、基于符号的分割和其他中文特有的模式来隐藏意图。这些中国特有的对抗模式造成了安全评估差距，而现有的英语基准无法很好地体现这一差距。对于轻量级模型来说，这种差距尤其令人担忧，因为轻量级模型可能更容易受到这种特定的对抗性扰动的影响。为了弥补这一差距，我们引入了中文特定安全基准（CSSBench），它强调这些对抗模式并评估中文轻量级法学硕士的安全性。我们的基准涵盖了中国真实场景中常见的六个领域，包括非法活动和合规、隐私泄露、健康和医疗错误信息、欺诈和仇恨、成人内容以及公共和政治安全，并将查询组织成多种任务类型。我们评估了一组流行的轻量级法学硕士，并测量过度拒绝行为，以评估安全引起的性能下降。我们的结果表明，中国特有的对抗模式对于轻量级法学硕士来说是一个关键挑战。该基准对中文法学硕士的安全性进行了全面评估，有助于实践中的稳健部署。

Title: Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

Authors: Sumanth Balaji, Piyush Mishra, Aashraya Sachdeva, Suraj Agrawal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00596
Pdf URL: https://arxiv.org/pdf/2601.00596
Copy Paste: [[2601.00596]] Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence(https://arxiv.org/abs/2601.00596)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.
摘要：传统的客户支持系统，例如交互式语音应答 (IVR)，依赖于严格的脚本，缺乏处理复杂的、策略驱动的任务所需的灵活性。虽然大型语言模型 (LLM) 代理提供了一种有前途的替代方案，但评估其按照业务规则和现实世界支持工作流程行事的能力仍然是一个开放的挑战。现有的基准测试主要关注工具的使用或任务的完成，忽视了代理遵守多步骤策略、导航任务依赖性以及对不可预测的用户或环境行为保持稳健的能力。在这项工作中，我们引入了 JourneyBench，这是一个旨在评估客户支持中的策略感知代理的基准。 JourneyBench 利用图形表示来生成多样化、现实的支持场景，并提出用户旅程覆盖分数，这是一种衡量政策遵守情况的新颖指标。我们使用两种代理设计来评估多个最先进的法学硕士：静态提示代理（SPA）和动态提示代理（DPA），它们显式地模拟策略控制。在三个领域的 703 次对话中，我们表明 DPA 显着提高了策略遵守率，甚至允许 GPT-4o-mini 等较小的模型胜过 GPT-4o 等功能更强的模型。我们的研究结果证明了结构化编排的重要性，并将 JourneyBench 确立为关键资源，以推动人工智能驱动的客户支持超越 IVR 时代的限制。

Title: Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs

Authors: Nils Rautenberg, Sven Schippkus
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00641
Pdf URL: https://arxiv.org/pdf/2601.00641
Copy Paste: [[2601.00641]] Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs(https://arxiv.org/abs/2601.00641)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) frequently produce contextual hallucinations, where generated content contradicts or ignores information explicitly stated in the prompt. Such errors are particularly problematic in deterministic automation workflows, where inputs are fixed and correctness is unambiguous. We introduce a simple and model-agnostic framework that provides explicit probabilistic guarantees for reducing hallucinations in this setting. We formalize the notion of a specific task, defined by a fixed input and a deterministic correctness criterion, and show that issuing the same prompt in independent context windows yields an exponential reduction in the probability that all model outputs are incorrect. To identify a correct answer among repeated runs, we incorporate an LLM-as-a-judge and prove that the probability that the judged pipeline fails decays at a rate determined by the judge's true- and false-positive probabilities. When the judge is imperfect, we strengthen it through majority vote over independent judge calls, obtaining ensemble-level error rates that decrease exponentially in the number of votes. This yields an explicit bound on the probability that the pipeline selects a hallucinated answer. Experiments on controlled extraction tasks with synthetic noisy judges match these predictions exactly: pipeline failure decreases exponentially with the number of repetitions, and hallucination-selection decreases exponentially with the number of judges in the ensemble. Together, these results provide a lightweight, modular, and theoretically grounded method for driving hallucination probabilities arbitrarily low in fixed-input LLM workflows-without modifying model weights, decoding strategies, or prompt engineering.
摘要：大型语言模型 (LLM) 经常产生上下文幻觉，其中生成的内容与提示中明确说明的信息相矛盾或忽略。此类错误在确定性自动化工作流程中尤其成问题，因为输入是固定的且正确性明确。我们引入了一个简单且与模型无关的框架，该框架为减少这种情况下的幻觉提供了明确的概率保证。我们形式化了由固定输入和确定性正确性标准定义的特定任务的概念，并表明在独立上下文窗口中发出相同的提示会使所有模型输出不正确的概率呈指数下降。为了在重复运行中确定正确答案，我们采用了法学硕士作为法官，并证明所判断的管道失败的概率以法官的真概率和假阳性概率确定的速率衰减。当法官不完美时，我们通过对独立法官的呼吁进行多数投票来加强它，获得整体级别的错误率，该错误率会随着投票数量呈指数级下降。这对管道选择幻觉答案的概率产生了明确的限制。使用合成噪声判断器进行受控提取任务的实验与这些预测完全匹配：管道故障随着重复次数的增加而呈指数减少，而幻觉选择随着集合中判断器的数量呈指数形式减少。总之，这些结果提供了一种轻量级、模块化且有理论依据的方法，用于在固定输入 LLM 工作流程中将幻觉概率降低到任意低的水平，而无需修改模型权重、解码策略或提示工程。

Title: Physio-DPO: Aligning Large Language Models with the Protein Energy Landscape to Eliminate Structural Hallucinations

Authors: QiWei Meng
Subjects: cs.CL, cs.CE, q-bio.QM
Abstract URL: https://arxiv.org/abs/2601.00647
Pdf URL: https://arxiv.org/pdf/2601.00647
Copy Paste: [[2601.00647]] Physio-DPO: Aligning Large Language Models with the Protein Energy Landscape to Eliminate Structural Hallucinations(https://arxiv.org/abs/2601.00647)
Keywords: language model, hallucination
Abstract: Large Protein Language Models have shown strong potential for generative protein design, yet they frequently produce structural hallucinations, generating sequences with high linguistic likelihood that fold into thermodynamically unstable conformations. Existing alignment approaches such as Direct Preference Optimization are limited in this setting, as they model preferences as binary labels and ignore the continuous structure of the physical energy landscape. We propose Physio-DPO, a physics informed alignment framework that grounds protein language models in thermodynamic stability. Physio-DPO introduces a magnitude aware objective that scales optimization updates according to the energy gap between native structures and physics perturbed hard negatives. Experiments show that Physio-DPO consistently outperforms strong baselines including SFT, PPO, and standard DPO, reducing self consistency RMSD to 1.28 Å and increasing foldability to 92.8%. Qualitative analysis further demonstrates that Physio-DPO effectively mitigates structural hallucinations by recovering biophysical interactions such as hydrophobic core packing and hydrogen bond networks.
摘要：大型蛋白质语言模型在生成蛋白质设计方面显示出强大的潜力，但它们经常产生结构幻觉，生成具有高语言可能性的序列，折叠成热力学不稳定的构象。现有的对齐方法（例如直接偏好优化）在这种情况下受到限制，因为它们将偏好建模为二元标签，并忽略了物理能源景观的连续结构。我们提出了 Physio-DPO，这是一种物理通知的对齐框架，它将蛋白质语言模型建立在热力学稳定性的基础上。 Physio-DPO 引入了一个幅度感知目标，可根据本机结构和物理扰动硬负片之间的能隙来缩放优化更新。实验表明，Physio-DPO 的性能始终优于包括 SFT、PPO 和标准 DPO 在内的强大基线，将自一致性 RMSD 降低至 1.28 Å，并将可折叠性提高至 92.8%。定性分析进一步表明，Physio-DPO 通过恢复疏水核心堆积和氢键网络等生物物理相互作用，有效减轻结构幻觉。

Title: Fast-weight Product Key Memory

Authors: Tianyu Zhao, Llion Jones
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00671
Pdf URL: https://arxiv.org/pdf/2601.00671
Copy Paste: [[2601.00671]] Fast-weight Product Key Memory(https://arxiv.org/abs/2601.00671)
Keywords: language model
Abstract: Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
摘要：现代语言模型中的序列建模层通常面临存储容量和计算效率之间的权衡。虽然 Softmax 注意力以令人望而却步的二次成本提供无限存储，但线性变体提供了效率，但受到有限的固定大小存储的影响。我们提出了快速权重产品密钥内存（FwPKM），这是一种新颖的架构，通过将稀疏的产品密钥内存（PKM）从静态模块转换为动态的“快速权重”情景内存来解决这种紧张关系。与 PKM 不同，FwPKM 通过局部块级梯度下降在训练和推理时动态更新其参数，使模型能够快速记忆并从输入序列中检索新的键值对。实验表明，FwPKM 可以作为一种有效的情景记忆，补充标准模块的语义记忆，从而显着减少长上下文数据集的困惑度。值得注意的是，在大海捞针评估中，FwPKM 泛化到 128K 令牌上下文，尽管仅在 4K 令牌序列上进行训练。

Title: Sigmoid Head for Quality Estimation under Language Ambiguity

Authors: Tu Anh Dinh, Jan Niehues
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.00680
Pdf URL: https://arxiv.org/pdf/2601.00680
Copy Paste: [[2601.00680]] Sigmoid Head for Quality Estimation under Language Ambiguity(https://arxiv.org/abs/2601.00680)
Keywords: language model
Abstract: Language model (LM) probability is not a reliable quality estimator, as natural language is ambiguous. When multiple output options are valid, the model's probability distribution is spread across them, which can misleadingly indicate low output quality. This issue is caused by two reasons: (1) LMs' final output activation is softmax, which does not allow multiple correct options to receive high probabilities simultaneuously and (2) LMs' training data is single, one-hot encoded references, indicating that there is only one correct option at each output step. We propose training a module for Quality Estimation on top of pre-trained LMs to address these limitations. The module, called Sigmoid Head, is an extra unembedding head with sigmoid activation to tackle the first limitation. To tackle the second limitation, during the negative sampling process to train the Sigmoid Head, we use a heuristic to avoid selecting potentially alternative correct tokens. Our Sigmoid Head is computationally efficient during training and inference. The probability from Sigmoid Head is notably better quality signal compared to the original softmax head. As the Sigmoid Head does not rely on human-annotated quality data, it is more robust to out-of-domain settings compared to supervised QE.
摘要：语言模型（LM）概率不是可靠的质量估计器，因为自然语言是模糊的。当多个输出选项有效时，模型的概率分布分布在它们之间，这可能会误导性地表明输出质量较低。这个问题是由两个原因引起的：(1) LM 的最终输出激活是 softmax，它不允许多个正确选项同时接收高概率；(2) LM 的训练数据是单一的、one-hot 编码的参考，表明每个输出步骤只有一个正确选项。我们建议在预先训练的 LM 之上训练一个质量估计模块来解决这些限制。该模块称为 Sigmoid Head，是一个额外的非嵌入头，具有 sigmoid 激活功能，可解决第一个限制。为了解决第二个限制，在训练 Sigmoid Head 的负采样过程中，我们使用启发式方法来避免选择潜在的替代正确标记。我们的 Sigmoid Head 在训练和推理过程中计算效率很高。与原始 softmax 头相比，Sigmoid 头的概率明显是质量更好的信号。由于 Sigmoid Head 不依赖于人工注释的质量数据，因此与监督 QE 相比，它对域外设置更加稳健。

Title: Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

Authors: Alphaeus Dmonte, Roland Oruche, Tharindu Ranasinghe, Marcos Zampieri, Prasad Calyam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.00736
Pdf URL: https://arxiv.org/pdf/2601.00736
Copy Paste: [[2601.00736]] Exploring the Performance of Large Language Models on Subjective Span Identification Tasks(https://arxiv.org/abs/2601.00736)
Keywords: language model, llm
Abstract: Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.
摘要：识别相关文本跨度对于 NLP 中的几个下游任务非常重要，因为它有助于模型的可解释性。虽然大多数跨度识别方法依赖于相对较小的预训练语言模型（如 BERT），但最近的一些方法利用了最新一代的大型语言模型 (LLM) 来完成任务。目前的工作主要集中在命名实体识别 (NER) 等显式跨度识别上，而在基于方面的情感分析 (ABSA) 等任务中使用法学硕士进行的更主观的跨度识别尚未得到充分探索。在本文中，我们通过评估各种法学硕士在文本跨度识别方面在三个流行任务（即情感分析、攻击性语言识别和声明验证）中的表现来填补这一重要空白。我们探索了几种法学硕士策略，如指令调整、情境学习和思维链。我们的结果表明文本中的潜在关系有助于法学硕士识别精确的文本跨度。