2026-01-21

Title: Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths

Authors: Ahilan Ayyachamy Nadar Ponnusamy, Karthic Chandran, M Maruf Hossain
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11564
Pdf URL: https://arxiv.org/pdf/2601.11564
Copy Paste: [[2601.11564]] Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths(https://arxiv.org/abs/2601.11564)
Keywords: language model, llm
Abstract: The scaling trend in Large Language Models (LLMs) has prioritized increasing the maximum context window to facilitate complex, long-form reasoning and document analysis. However, managing this expanded context introduces severe computational overhead. This paper investigates the critical trade-off between system performance and model quality when dense transformer architectures--specifically Llama-3.1-70B and Qwen1.5-14B--are exposed to large volumes of irrelevant and distracting context. The research identifies a non-linear performance degradation tied to the growth of the Key-Value (KV) cache. Furthermore, an extended analysis of the Mixture-of-Experts (MoE) architecture reveals unique behavioral anomalies at varying context scales, suggesting that architectural benefits may be masked by infrastructure bottlenecks at high token volumes.
摘要：大型语言模型 (LLM) 的扩展趋势优先考虑增加最大上下文窗口，以促进复杂的长格式推理和文档分析。然而，管理这种扩展的上下文会带来严重的计算开销。本文研究了当密集变压器架构（特别是 Llama-3.1-70B 和 Qwen1.5-14B）暴露于大量不相关和分散注意力的环境中时，系统性能和模型质量之间的关键权衡。研究发现，与键值 (KV) 缓存增长相关的非线性性能下降。此外，对专家混合 (MoE) 架构的扩展分析揭示了不同上下文规模下的独特行为异常，这表明架构优势可能会被高代币交易量的基础设施瓶颈所掩盖。

Title: Compass-Embedding v4: Robust Contrastive Learning for Multilingual E-commerce Embeddings

Authors: Pakorn Ueareeworakul, Shuman Liu, Jinghao Feng, Ling Hu, Zhantang Shi, Chengqi Sun, Liang Yao, Panyi Ouyang, Haibo Zhang, Anxiang Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11565
Pdf URL: https://arxiv.org/pdf/2601.11565
Copy Paste: [[2601.11565]] Compass-Embedding v4: Robust Contrastive Learning for Multilingual E-commerce Embeddings(https://arxiv.org/abs/2601.11565)
Keywords: llm
Abstract: As global e-commerce rapidly expands into emerging markets, the lack of high-quality semantic representations for low-resource languages has become a decisive bottleneck for retrieval, recommendation, and search systems. In this work, we present Compass-Embedding v4, a high-efficiency multilingual embedding framework specifically optimized for Southeast Asian (SEA) e-commerce scenarios, where data scarcity, noisy supervision, and strict production constraints jointly challenge representation learning. Compass-Embedding v4 addresses three core challenges. First, large-batch contrastive training under mixed task supervision introduces systematic false negatives that degrade semantic alignment. We propose Class-Aware Masking (CAM), a lightweight modification to the InfoNCE objective that suppresses invalid in-batch negatives and improves semantic discrimination without altering training efficiency. Second, low-resource SEA languages suffer from limited and uneven data coverage. We construct a diversified training corpus through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction, enabling robust multilingual and domain-specific learning. Third, production deployment requires high-throughput inference while preserving embedding quality. We combine robustness-driven large-batch training with spherical model merging to mitigate catastrophic forgetting, and optimize inference via vLLM and FP8 quantization. Extensive evaluations across multilingual benchmarks and proprietary e-commerce tasks show that Compass-Embedding v4 achieves state-of-the-art performance on major SEA languages, significantly outperforming general-purpose embedding models in domain-specific retrieval and classification, while maintaining competitive performance on high-resource languages.
摘要：随着全球电子商务迅速扩展到新兴市场，低资源语言缺乏高质量的语义表示已成为检索、推荐和搜索系统的决定性瓶颈。在这项工作中，我们提出了 Compass-Embedding v4，这是一种高效的多语言嵌入框架，专门针对东南亚（SEA）电子商务场景进行了优化，其中数据稀缺、噪声监督和严格的生产约束共同挑战了表示学习。 Compass-Embedding v4 解决了三个核心挑战。首先，混合任务监督下的大批量对比训练引入了系统性的漏报，从而降低了语义对齐。我们提出了类感知掩码（CAM），这是对 InfoNCE 目标的一种轻量级修改，可以抑制无效的批内负例并提高语义辨别力，而不会改变训练效率。其次，资源匮乏的东南亚语言受到数据覆盖范围有限且不均匀的影响。我们通过基于上下文的合成数据生成、跨语言翻译和结构化电子商务数据构建构建多元化的训练语料库，从而实现强大的多语言和特定领域的学习。第三，生产部署需要高吞吐量推理，同时保持嵌入质量。我们将稳健性驱动的大批量训练与球形模型合并相结合，以减轻灾难性遗忘，并通过 vLLM 和 FP8 量化优化推理。对多语言基准和专有电子商务任务的广泛评估表明，Compass-Embedding v4 在主要 SEA 语言上实现了最先进的性能，在特定领域的检索和分类方面显着优于通用嵌入模型，同时保持在高资源语言上的竞争性能。

Title: Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

Authors: Vanessa D'Amario, Randy Daniel, Alessandro Zanetti, Dhruv Edamadaka, Nitya Alaparthy, Joshua Tarkoff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11567
Pdf URL: https://arxiv.org/pdf/2601.11567
Copy Paste: [[2601.11567]] Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology(https://arxiv.org/abs/2601.11567)
Keywords: language model, gpt, llm, prompt
Abstract: Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility. However, their evaluation is often limited to accuracy on medical multiple choice question (MCQ) benchmarks, and lacks evaluation of consistency, robustness, or reasoning behavior. We use MCQ coupled to human evaluation and clinical review to assess six small open-source medical LLMs (HuatuoGPT-o1 (Chen 2024), Diabetica-7B, Diabetica-o1 (Wei 2024), Meditron3-8B (Sallinen2025), MedFound-7B (Liu 2025), and ClinicaGPT-base-zh (Wang 2023)) in pediatric endocrinology. In deterministic settings, we examine the effect of prompt variation on models' output and self-assessment bias. In stochastic settings, we evaluate output variability and investigate the relationship between consistency and correctness. HuatuoGPT-o1-8B achieved the highest performance. The results show that high consistency across the model response is not an indicator of correctness, although HuatuoGPT-o1-8B showed the highest consistency rate. When tasked with selecting correct reasoning, both HuatuoGPT-o1-8B and Diabetica-o1 exhibit self-assessment bias and dependency on the order of the candidate explanations. Expert review of incorrect reasoning rationales identified a mix of clinically acceptable responses and clinical oversight. We further show that system-level perturbations, such as differences in CUDA builds, can yield statistically significant shifts in model output despite stable accuracy. This work demonstrates that small, semantically negligible prompt perturbations lead to divergent outputs, raising concerns about reproducibility of LLM-based evaluations and highlights the output variability under different stochastic regimes, emphasizing the need of a broader diagnostic framework to understand potential pitfalls in real-world clinical decision support scenarios.
摘要：小型开源医学大语言模型 (LLM) 为低资源部署和更广泛的可访问性提供了有希望的机会。然而，他们的评估通常仅限于医学多项选择问题（MCQ）基准的准确性，并且缺乏对一致性、鲁棒性或推理行为的评估。我们使用 MCQ 与人类评估和临床审查相结合来评估六个小型开源医学法学硕士（HuatuoGPT-o1 (Chen 2024)、Diabetica-7B、Diabetica-o1 (Wei 2024)、Meditron3-8B (Sallinen2025)、MedFound-7B (Liu 2025) 和 ClinicaGPT-base-zh (Wang 2023)）在儿科内分泌学。在确定性环境中，我们检查即时变化对模型输出和自我评估偏差的影响。在随机设置中，我们评估输出变异性并研究一致性和正确性之间的关系。 HuatuoGPT-o1-8B取得了最高的性能。结果表明，尽管 HuatuoGPT-o1-8B 显示出最高的一致性率，但模型响应的高度一致性并不是正确性的指标。当被要求选择正确的推理时，HuatuoGPT-o1-8B 和 Diabetica-o1 都表现出自我评估偏差和对候选解释顺序的依赖。对不正确推理原理的专家审查确定了临床可接受的反应和临床监督的组合。我们进一步表明，尽管精度稳定，但系统级扰动（例如 CUDA 构建中的差异）可能会导致模型输出出现统计上显着的变化。这项工作表明，小的、语义上可以忽略不计的即时扰动会导致不同的输出，引起人们对基于 LLM 评估的可重复性的担忧，并强调不同随机方案下的输出变异性，强调需要更广泛的诊断框架来理解现实世界临床决策支持场景中的潜在陷阱。

Title: An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT and BioStarsGPT

Authors: Muhammad Muneeb, David B. Ascher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11573
Pdf URL: https://arxiv.org/pdf/2601.11573
Copy Paste: [[2601.11573]] An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT and BioStarsGPT(https://arxiv.org/abs/2601.11573)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) often lack specialized knowledge for complex bioinformatics applications. We present a reproducible pipeline for fine-tuning LLMs on specialized bioinformatics data, demonstrated through two use cases: PRSGPT, focused on polygenic risk score (PRS) tools, and BioStarsGPT, trained on community forum discussions. The nine-step pipeline integrates diverse data sources, structured preprocessing, prompt-based question-answer (QA) generation (via Google Gemini), natural language inference (NLI) for quality control, semantic deduplication, clustering-based data splitting, and parameter-efficient fine-tuning using LoRA. We fine-tuned three LLMs (LLaMA-3.2-3B, Qwen2.5-7B, Gemma) and benchmarked them on over 14 lexical and semantic metrics. Qwen2.5-7B emerged as the best performer, with BLEU-4 and ROUGE-1 improvements of 82\% and 70\% for PRSGPT and 6\% and 18\% for BioStarsGPT, respectively. The open-source datasets produced include over 28,000 QA pairs for PRSGPT and 154,282 for BioStarsGPT. Human evaluation of PRSGPT yielded 61.9\% accuracy on the PRS tools comparison task, comparable to Google Gemini (61.4\%), but with richer methodological detail and accurate citations. BioStarsGPT demonstrated 59\% conceptual accuracy across 142 curated bioinformatics questions. Our pipeline enables scalable, domain-specific fine-tuning of LLMs. It enables privacy-preserving, locally deployable bioinformatics assistants, explores their practical applications, and addresses the challenges, limitations, and mitigation strategies associated with their development and use.
摘要：大型语言模型 (LLM) 通常缺乏复杂生物信息学应用的专业知识。我们提出了一个可重复的管道，用于根据专门的生物信息学数据微调法学硕士，并通过两个用例进行演示：PRSGPT，专注于多基因风险评分（PRS）工具，以及 BioStarsGPT，接受社区论坛讨论的培训。九步管道集成了不同的数据源、结构化预处理、基于提示的问答 (QA) 生成（通过 Google Gemini）、用于质量控制的自然语言推理 (NLI)、语义重复数据删除、基于集群的数据分割以及使用 LoRA 的参数高效微调。我们对三个 LLM（LLaMA-3.2-3B、Qwen2.5-7B、Gemma）进行了微调，并根据超过 14 个词汇和语义指标对它们进行了基准测试。 Qwen2.5-7B 表现最佳，PRSGPT 的 BLEU-4 和 ROUGE-1 分别提高了 82% 和 70%，BioStarsGPT 分别提高了 6% 和 18%。生成的开源数据集包括超过 28,000 个 PRSGPT QA 对和 154,282 个 BioStarsGPT。 PRSGPT 的人工评估在 PRS 工具比较任务中的准确度为 61.9%，与 Google Gemini (61.4%) 相当，但具有更丰富的方法细节和准确的引用。 BioStarsGPT 在 142 个精心策划的生物信息学问题中表现出 59% 的概念准确性。我们的管道可以对法学硕士进行可扩展的、特定领域的微调。它支持保护隐私、可本地部署的生物信息学助手，探索其实际应用，并解决与其开发和使用相关的挑战、限制和缓解策略。

Title: Concept Attractors in LLMs and their Applications

Authors: Sotirios Panagiotis Chytas, Vikas Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11575
Pdf URL: https://arxiv.org/pdf/2601.11575
Copy Paste: [[2601.11575]] Concept Attractors in LLMs and their Applications(https://arxiv.org/abs/2601.11575)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) often map semantically related prompts to similar internal representations at specific layers, even when their surface forms differ widely. We show that this behavior can be explained through Iterated Function Systems (IFS), where layers act as contractive mappings toward concept-specific Attractors. We leverage this insight and develop simple, training-free methods that operate directly on these Attractors to solve a wide range of practical tasks, including language translation, hallucination reduction, guardrailing, and synthetic data generation. Despite their simplicity, these Attractor-based interventions match or exceed specialized baselines, offering an efficient alternative to heavy fine-tuning, generalizable in scenarios where baselines underperform.
摘要：大型语言模型 (LLM) 通常将语义相关的提示映射到特定层的类似内部表示，即使它们的表面形式差异很大。我们证明这种行为可以通过迭代函数系统（IFS）来解释，其中层充当针对特定概念吸引子的收缩映射。我们利用这种洞察力，开发了简单、免训练的方法，直接在这些吸引子上运行，以解决广泛的实际任务，包括语言翻译、减少幻觉、护栏和合成数据生成。尽管它们很简单，但这些基于吸引子的干预措施匹配或超过了专门的基线，为大量微调提供了一种有效的替代方案，可在基线表现不佳的情况下推广。

Title: LimAgents: Multi-Agent LLMs for Generating Research Limitations

Authors: Ibrahim Al Azher, Zhishuai Guo, Hamed Alhoori
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11578
Pdf URL: https://arxiv.org/pdf/2601.11578
Copy Paste: [[2601.11578]] LimAgents: Multi-Agent LLMs for Generating Research Limitations(https://arxiv.org/abs/2601.11578)
Keywords: language model, gpt, llm, agent
Abstract: Identifying and articulating limitations is essential for transparent and rigorous scientific research. However, zero-shot large language models (LLMs) approach often produce superficial or general limitation statements (e.g., dataset bias or generalizability). They usually repeat limitations reported by authors without looking at deeper methodological issues and contextual gaps. This problem is made worse because many authors disclose only partial or trivial limitations. We propose LimAgents, a multi-agent LLM framework for generating substantive limitations. LimAgents integrates OpenReview comments and author-stated limitations to provide stronger ground truth. It also uses cited and citing papers to capture broader contextual weaknesses. In this setup, different agents have specific roles as sequential role: some extract explicit limitations, others analyze methodological gaps, some simulate the viewpoint of a peer reviewer, and a citation agent places the work within the larger body of literature. A Judge agent refines their outputs, and a Master agent consolidates them into a clear set. This structure allows for systematic identification of explicit, implicit, peer review-focused, and literature-informed limitations. Moreover, traditional NLP metrics like BLEU, ROUGE, and cosine similarity rely heavily on n-gram or embedding overlap. They often overlook semantically similar limitations. To address this, we introduce a pointwise evaluation protocol that uses an LLM-as-a-Judge to measure coverage more accurately. Experiments show that LimAgents substantially improve performance. The RAG + multi-agent GPT-4o mini configuration achieves a +15.51% coverage gain over zero-shot baselines, while the Llama 3 8B multi-agent setup yields a +4.41% improvement.
摘要：识别和阐明局限性对于透明和严谨的科学研究至关重要。然而，零样本大型语言模型（LLM）方法通常会产生肤浅或一般的限制陈述（例如，数据集偏差或普遍性）。他们通常重复作者报告的局限性，而不考虑更深层次的方法论问题和背景差距。这个问题变得更糟，因为许多作者只披露了部分或微不足道的限制。我们提出 LimAgents，一个用于产生实质性限制的多代理 LLM 框架。 LimAgents 集成了 OpenReview 评论和作者声明的限制，以提供更强有力的事实依据。它还使用被引论文和施引论文来捕捉更广泛的上下文弱点。在这种设置中，不同的代理具有特定的角色作为顺序角色：一些提取明确的局限性，另一些分析方法论差距，一些模拟同行评审员的观点，引文代理将工作置于更大的文献中。 Judge 代理完善其输出，Master 代理将它们合并为一个清晰的集合。这种结构允许系统地识别显性的、隐性的、以同行评审为重点和文献知情的局限性。此外，BLEU、ROUGE 和余弦相似度等传统 NLP 指标严重依赖于 n-gram 或嵌入重叠。他们经常忽视语义上相似的限制。为了解决这个问题，我们引入了一种逐点评估协议，该协议使用法学硕士作为法官来更准确地衡量覆盖范围。实验表明 LimAgents 显着提高了性能。 RAG + 多代理 GPT-4o 迷你配置比零样本基线实现了 +15.51% 的覆盖增益，而 Llama 3 8B 多代理设置则实现了 +4.41% 的改进。

Title: Bielik 11B v3: Multilingual Large Language Model for European Languages

Authors: Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11579
Pdf URL: https://arxiv.org/pdf/2601.11579
Copy Paste: [[2601.11579]] Bielik 11B v3: Multilingual Large Language Model for European Languages(https://arxiv.org/abs/2601.11579)
Keywords: language model
Abstract: We present Bielik 11B v3, a state-of-the-art language model highly optimized for the Polish language, while also maintaining strong capabilities in other European languages. This model extends the Mistral 7B v0.2 architecture, scaled to 11B parameters via depth up-scaling. Its development involved a comprehensive four-stage training pipeline: continuous pre-training, supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and reinforcement learning. Comprehensive evaluations demonstrate that Bielik 11B v3 achieves exceptional performance. It significantly surpasses other specialized Polish language models and outperforms many larger models (with 2-6 times more parameters) on a wide range of tasks, from basic linguistic understanding to complex reasoning. The model's parameter efficiency, combined with extensive quantization options, allows for effective deployment across diverse hardware configurations. Bielik 11B v3 not only advances AI capabilities for the Polish language but also establishes a new benchmark for developing resource-efficient, high-performance models for less-represented languages.
摘要：我们推出 Bielik 11B v3，这是一种针对波兰语高度优化的最先进的语言模型，同时还保持了其他欧洲语言的强大功能。该模型扩展了 Mistral 7B v0.2 架构，通过深度放大扩展到 11B 参数。其开发涉及全面的四阶段训练流程：持续预训练、监督微调（SFT）、直接偏好优化（DPO）和强化学习。综合评估表明 Bielik 11B v3 具有卓越的性能。它显着超越了其他专门的波兰语言模型，并且在从基本语言理解到复杂推理的广泛任务上优于许多更大的模型（参数多 2-6 倍）。该模型的参数效率与广泛的量化选项相结合，可实现跨不同硬件配置的有效部署。 Bielik 11B v3 不仅提高了波兰语的 AI 功能，还为为代表性较少的语言开发资源高效、高性能模型建立了新的基准。

Title: Speculative Decoding: Performance or Illusion?

Authors: Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11580
Pdf URL: https://arxiv.org/pdf/2601.11580
Copy Paste: [[2601.11580]] Speculative Decoding: Performance or Illusion?(https://arxiv.org/abs/2601.11580)
Keywords: language model, llm
Abstract: Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.
摘要：推测性解码 (SD) 已成为加速大型语言模型 (LLM) 推理的流行技术，但其实际有效性仍不清楚，因为先前的评估依赖于研究原型和不切实际的小批量大小。据我们所知，我们首次在生产级和广泛部署的推理引擎 (vLLM) 上对 SD 进行了系统研究，涵盖了跨不同工作负载、模型规模和批量大小的多个 SD 变体（$n$-gram、EAGLE/EAGLE-3、草稿模型、多令牌预测）。我们分析了控制 SD 性能的关键因素，并量化了 SD 加速的理论上限。我们的结果表明，目标模型的验证主导了执行，而接受长度在输出令牌位置、请求和数据集之间存在显着差异。将测量的性能与理论界限进行比较，揭示了观察到的与理论上限之间的巨大差距，我们利用这一观察来强调我们的研究在改善 SD 方面开辟的新研究机会。

Title: Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents

Authors: Hyunjun Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11585
Pdf URL: https://arxiv.org/pdf/2601.11585
Copy Paste: [[2601.11585]] Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents(https://arxiv.org/abs/2601.11585)
Keywords: language model, llm, agent
Abstract: Context engineering for large language model (LLM) agents requires distinguishing pragmatically useful information from misleading distractors. We introduce Entropic Context Shaping (ECS), an information-theoretic framework that measures context utility via the shift in the model's answer distribution toward the correct answer. Unlike lexical similarity methods that rely on word overlap, ECS captures pragmatic utility -- whether a passage actually helps answer the question. We formalize utility as the signed change in answer probability and provide theoretical analysis showing that task-irrelevant updates yield near-zero distribution shift. We evaluate on multi-turn context selection tasks using LongMemEval (session-level) and LoCoMo (turn-level) benchmarks. On fine-grained turn selection, ECS with Llama-3.1-8B achieves F1=0.265, a 71.83% relative improvement over TF-IDF (F1=0.154), demonstrating that pragmatic utility outperforms lexical similarity when precise context selection matters. Code and data are available in the supplementary materials.
摘要：大型语言模型 (LLM) 代理的上下文工程需要区分实用有用的信息和误导性的干扰信息。我们引入了熵上下文塑造（ECS），这是一种信息理论框架，它通过模型的答案分布向正确答案的转变来衡量上下文效用。与依赖单词重叠的词汇相似性方法不同，ECS 捕获了实用性——一段话是否真的有助于回答问题。我们将效用形式化为答案概率的有符号变化，并提供理论分析，表明与任务无关的更新产生接近于零的分布偏移。我们使用 LongMemEval（会话级）和 LoCoMo（回合级）基准评估多回合上下文选择任务。在细粒度回合选择上，使用 Llama-3.1-8B 的 ECS 实现了 F1=0.265，比 TF-IDF (F1=0.154) 相对提高了 71.83%，这表明当精确的上下文选择很重要时，语用实用性优于词汇相似性。代码和数据可在补充材料中找到。

Title: Towards AGI A Pragmatic Approach Towards Self Evolving Agent

Authors: Indrajit Kar, Sammy Zonunpuia, Zonunfeli Ralte
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11658
Pdf URL: https://arxiv.org/pdf/2601.11658
Copy Paste: [[2601.11658]] Towards AGI A Pragmatic Approach Towards Self Evolving Agent(https://arxiv.org/abs/2601.11658)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) based agents are powerful yet fundamentally static after deployment, lacking the ability to autonomously expand capabilities, generate new tools, or evolve their reasoning. This work introduces a hierarchical self-evolving multi-agent framework that integrates a Base LLM, an operational SLM agent, a Code-Generation LLM, and a Teacher-LLM to enable continuous adaptation. The workflow begins with the agent attempting a task using reasoning and existing tools; if unsuccessful, it escalates to tool synthesis through the Code-Gen LLM, and when failures persist, it triggers an evolution phase using Curriculum Learning (CL), Reward-Based Learning (RL), or Genetic Algorithm (GA) evolution. Using the TaskCraft dataset rich in hierarchical tasks, tool-use traces, and difficulty scaling we evaluate these paradigms. CL delivers fast recovery and strong generalization, RL excels on high-difficulty tasks, and GA offers high behavioral diversity. Across all settings, evolved agents outperform their originals, demonstrating robust, autonomous, self-improving agentic evolution.
摘要：基于大型语言模型 (LLM) 的代理功能强大，但在部署后基本上是静态的，缺乏自主扩展功能、生成新工具或发展其推理的能力。这项工作引入了一个分层的自我进化多代理框架，该框架集成了基础 LLM、可操作的 SLM 代理、代码生成 LLM 和教师 LLM，以实现持续适应。工作流程从代理使用推理和现有工具尝试执行任务开始；如果不成功，它会通过 Code-Gen LLM 升级到工具合成，当失败持续存在时，它会触发使用课程学习 (CL)、基于奖励的学习 (RL) 或遗传算法 (GA) 进化的进化阶段。使用富含分层任务、工具使用轨迹和难度缩放的 TaskCraft 数据集，我们评估这些范例。 CL 提供快速恢复和强大的泛化能力，RL 擅长高难度任务，GA 提供高度的行为多样性。在所有设置中，进化后的智能体都优于原始智能体，展示了强大、自主、自我改进的智能体进化。

Title: RAC: Retrieval-Augmented Clarification for Faithful Conversational Search

Authors: Ahmed Rayane Kebir, Vincent Guigue, Lynda Said Lhadj, Laure Soulier
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.11722
Pdf URL: https://arxiv.org/pdf/2601.11722
Copy Paste: [[2601.11722]] RAC: Retrieval-Augmented Clarification for Faithful Conversational Search(https://arxiv.org/abs/2601.11722)
Keywords: language model, llm
Abstract: Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpus-faithful clarification questions. After comparing several indexing strategies for retrieval, we fine-tune a large language model to make optimal use of research context and to encourage the generation of evidence-based question. We then apply contrastive preference optimization to favor questions supported by retrieved passages over ungrounded alternatives. Evaluated on four benchmarks, RAC demonstrate significant improvements over baselines. In addition to LLM-as-Judge assessments, we introduce novel metrics derived from NLI and data-to-text to assess how well questions are anchored in the context, and we demonstrate that our approach consistently enhances faithfulness.
摘要：澄清问题有助于会话搜索系统解决模糊或不明确的用户查询。虽然之前的工作重点关注流畅性和与用户意图的一致性，特别是通过方面提取，但很少关注底层语料库中的基础澄清。如果没有这样的基础，系统可能会提出无法从可用文档中回答的问题。我们介绍 RAC（检索增强澄清），一个用于生成忠实于语料库的澄清问题的框架。在比较了几种检索索引策略之后，我们对大型语言模型进行了微调，以充分利用研究背景并鼓励生成基于证据的问题。然后，我们应用对比偏好优化来支持由检索到的段落支持的问题，而不是无根据的替代方案。根据四个基准进行评估，RAC 表现出较基准有显着改进。除了法学硕士作为法官的评估之外，我们还引入了源自 NLI 和数据到文本的新颖指标，以评估问题在上下文中的锚定程度，并且我们证明我们的方法始终如一地增强了忠实度。

Title: Bridging Human Interpretation and Machine Representation: A Landscape of Qualitative Data Analysis in the LLM Era

Authors: Xinyu Pi, Qisen Yang, Chuong Nguyen, Hua Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11739
Pdf URL: https://arxiv.org/pdf/2601.11739
Copy Paste: [[2601.11739]] Bridging Human Interpretation and Machine Representation: A Landscape of Qualitative Data Analysis in the LLM Era(https://arxiv.org/abs/2601.11739)
Keywords: llm
Abstract: LLMs are increasingly used to support qualitative research, yet existing systems produce outputs that vary widely--from trace-faithful summaries to theory-mediated explanations and system models. To make these differences explicit, we introduce a 4$\times$4 landscape crossing four levels of meaning-making (descriptive, categorical, interpretive, theoretical) with four levels of modeling (static structure, stages/timelines, causal pathways, feedback dynamics). Applying the landscape to prior LLM-based automation highlights a strong skew toward low-level meaning and low-commitment representations, with few reliable attempts at interpretive/theoretical inference or dynamical modeling. Based on the revealed gap, we outline an agenda for applying and building LLM-systems that make their interpretive and modeling commitments explicit, selectable, and governable.
摘要：法学硕士越来越多地用于支持定性研究，但现有系统产生的输出差异很大——从跟踪忠实的摘要到理论介导的解释和系统模型。为了明确这些差异，我们引入了一个 4$\times$4 景观，跨越四个层面的意义构建（描述性、分类性、解释性、理论性）和四个建模层面（静态结构、阶段/时间线、因果路径、反馈动态）。将景观应用于之前基于法学硕士的自动化，凸显了对低层次意义和低承诺表示的强烈倾斜，在解释/理论推理或动态建模方面几乎没有可靠的尝试。根据所揭示的差距，我们概述了应用和构建法学硕士系统的议程，使其解释和建模承诺明确、可选择和可管理。

Title: LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text

Authors: George Mihaila, Suleyman Olcay Polat, Poli Nemkova, Himanshu Sharma, Namratha V. Urs, Mark V. Albert
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11746
Pdf URL: https://arxiv.org/pdf/2601.11746
Copy Paste: [[2601.11746]] LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text(https://arxiv.org/abs/2601.11746)
Keywords: language model, llm
Abstract: Local explanation methods such as LIME (Ribeiro et al., 2016) remain fundamental to trustworthy AI, yet their application to NLP is limited by a reliance on random token masking. These heuristic perturbations frequently generate semantically invalid, out-of-distribution inputs that weaken the fidelity of local surrogate models. While recent generative approaches such as LLiMe (Angiulli et al., 2025b) attempt to mitigate this by employing Large Language Models for neighborhood generation, they rely on unconstrained paraphrasing that introduces confounding variables, making it difficult to isolate specific feature contributions. We introduce LIME-LLM, a framework that replaces random noise with hypothesis-driven, controlled perturbations. By enforcing a strict "Single Mask-Single Sample" protocol and employing distinct neutral infill and boundary infill strategies, LIME-LLM constructs fluent, on-manifold neighborhoods that rigorously isolate feature effects. We evaluate our method against established baselines (LIME, SHAP, Integrated Gradients) and the generative LLiMe baseline across three diverse benchmarks: CoLA, SST-2, and HateXplain using human-annotated rationales as ground truth. Empirical results demonstrate that LIME-LLM establishes a new benchmark for black-box NLP explainability, achieving significant improvements in local explanation fidelity compared to both traditional perturbation-based methods and recent generative alternatives.
摘要：诸如 LIME（Ribeiro 等人，2016）之类的本地解释方法仍然是值得信赖的 AI 的基础，但它们在 NLP 中的应用因依赖随机标记屏蔽而受到限制。这些启发式扰动经常产生语义上无效的、不符合分布的输入，从而削弱了局部代理模型的保真度。虽然 LLiMe（Angiulli 等人，2025b）等最近的生成方法试图通过采用大型语言模型进行邻域生成来缓解这一问题，但它们依赖于无约束的释义，从而引入了混杂变量，从而很难隔离特定的特征贡献。我们引入了 LIME-LLM，这是一个用假设驱动的受控扰动取代随机噪声的框架。通过执行严格的“单掩模-单样本”协议并采用不同的中性填充和边界填充策略，LIME-LLM 构建了流畅的流形邻域，严格隔离特征影响。我们根据已建立的基线（LIME、SHAP、集成梯度）和跨三个不同基准的生成 LLiMe 基线来评估我们的方法：CoLA、SST-2 和 HateXplain，使用人工注释的基本原理作为基本事实。实证结果表明，LIME-LLM 为黑盒 NLP 可解释性建立了新的基准，与传统的基于扰动的方法和最近的生成替代方法相比，在局部解释保真度方面实现了显着改进。

Title: Industry-Aligned Granular Topic Modeling

Authors: Sae Young Moon, Myeongjun Erik Jang, Haoyan Luo, Chunyang Xiao, Antonios Georgiadis, Fran Silavong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11762
Pdf URL: https://arxiv.org/pdf/2601.11762
Copy Paste: [[2601.11762]] Industry-Aligned Granular Topic Modeling(https://arxiv.org/abs/2601.11762)
Keywords: language model, llm
Abstract: Topic modeling has extensive applications in text mining and data analysis across various industrial sectors. Although the concept of granularity holds significant value for business applications by providing deeper insights, the capability of topic modeling methods to produce granular topics has not been thoroughly explored. In this context, this paper introduces a framework called TIDE, which primarily provides a novel granular topic modeling method based on large language models (LLMs) as a core feature, along with other useful functionalities for business applications, such as summarizing long documents, topic parenting, and distillation. Through extensive experiments on a variety of public and real-world business datasets, we demonstrate that TIDE's topic modeling approach outperforms modern topic modeling methods, and our auxiliary components provide valuable support for dealing with industrial business scenarios. The TIDE framework is currently undergoing the process of being open sourced.
摘要：主题建模在各个工业领域的文本挖掘和数据分析中有着广泛的应用。尽管粒度概念通过提供更深入的见解而对业务应用程序具有重要价值，但主题建模方法生成粒度主题的能力尚未得到彻底探索。在此背景下，本文介绍了一个名为 TIDE 的框架，该框架主要提供一种基于大语言模型 (LLM) 作为核心功能的新颖的粒度主题建模方法，以及其他对业务应用程序有用的功能，例如总结长文档、主题父子化和蒸馏。通过对各种公共和现实世界业务数据集的广泛实验，我们证明了 TIDE 的主题建模方法优于现代主题建模方法，并且我们的辅助组件为处理工业业务场景提供了宝贵的支持。 TIDE框架目前正在进行开源过程。

Title: Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

Authors: Kaituo Zhang, Zhimeng Jiang, Na Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11776
Pdf URL: https://arxiv.org/pdf/2601.11776
Copy Paste: [[2601.11776]] Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models(https://arxiv.org/abs/2601.11776)
Keywords: language model, llm
Abstract: Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset used to fine-tune the model, enhancing its ability for safe and coherent text generation. Experiments on benchmark datasets such as DetoxLLM and ParaDetox show that our method achieves better detoxification performance than state-of-the-art methods while preserving semantic fidelity. By obviating the need for human intervention or external components, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.
摘要：大型语言模型（LLM）最近的突破揭示了卓越的生成能力和新兴的自我监管机制，包括自我纠正和自我奖励。然而，目前的解毒技术很少利用这些内置的能力。相反，它们依赖于外部模块、劳动密集型数据注释或人工干预——这些因素阻碍了可扩展性和一致性。在本文中，我们介绍了一个完全自我反思的解毒框架，该框架利用法学硕士的固有能力来检测、纠正有毒内容并改进法学硕士，无需外部模块和数据注释。具体来说，我们提出了一种有毒信号检测器——一种内部自我识别机制，加上系统干预过程，将有毒文本转化为无毒文本。这个迭代过程产生了一个对比解毒数据集，用于微调模型，增强其安全、连贯的文本生成的能力。在 DetoxLLM 和 ParaDetox 等基准数据集上进行的实验表明，我们的方法在保持语义保真度的同时，比最先进的方法实现了更好的解毒性能。通过消除人为干预或外部组件的需要，本文揭示了法学硕士内在的自我解毒能力，为减少有害内容的生成提供了一致且有效的方法。最终，我们的研究结果强调了真正自我调节的语言模型的潜力，为更负责任和道德指导的文本生成系统铺平了道路。

Title: Translation as a Scalable Proxy for Multilingual Evaluation

Authors: Sheriff Issaka, Erick Rosas Gonzalez, Lieqi Liu, Evans Kofi Agyei, Lucas Bandarkar, Nanyun Peng, David Ifeoluwa Adelani, Francisco Guzmán, Saadia Gabriel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11778
Pdf URL: https://arxiv.org/pdf/2601.11778
Copy Paste: [[2601.11778]] Translation as a Scalable Proxy for Multilingual Evaluation(https://arxiv.org/abs/2601.11778)
Keywords: llm
Abstract: The rapid proliferation of LLMs has created a critical evaluation paradox: while LLMs claim multilingual proficiency, comprehensive non-machine-translated benchmarks exist for fewer than 30 languages, leaving >98% of the world's 7,000 languages in an empirical void. Traditional benchmark construction faces scaling challenges such as cost, scarcity of domain experts, and data contamination. We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities? Through systematic evaluation of 14 models (1B-72B parameters) across 9 diverse benchmarks and 7 translation metrics, we find that translation performance is a good indicator of downstream task success (e.g., Phi-4, median Pearson r: MetricX = 0.89, xCOMET = 0.91, SSA-COMET = 0.87). These results suggest that the representational abilities supporting faithful translation overlap with those required for multilingual understanding. Translation quality, thus emerges as a strong, inexpensive first-pass proxy of multilingual performance, enabling a translation-first screening with targeted follow-up for specific tasks.
摘要：法学硕士的迅速激增造成了一个关键的评估悖论：虽然法学硕士声称具备多语言能力，但针对不到 30 种语言存在全面的非机器翻译基准，使得全球 7,000 种语言中超过 98% 的语言处于实证空白。传统的基准构建面临着成本、领域专家稀缺和数据污染等扩展挑战。我们评估一个更简单的替代方案的有效性：翻译质量能否单独表明模型更广泛的多语言能力？通过对 9 个不同基准和 7 个翻译指标的 14 个模型（1B-72B 参数）进行系统评估，我们发现翻译性能是下游任务成功的良好指标（例如，Phi-4、中值 Pearson r：MetricX = 0.89、xCOMET = 0.91、SSA-COMET = 0.87）。这些结果表明，支持忠实翻译的表征能力与多语言理解所需的表征能力重叠。因此，翻译质量成为多语言性能的强大且廉价的首过指标，能够实现翻译优先筛选，并针对特定任务进行有针对性的后续行动。

Title: Beyond Tokens: Concept-Level Training Objectives for LLMs

Authors: Laya Iyer, Pranav Somani, Alice Guo, Dan Jurafsky, Chen Shani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11791
Pdf URL: https://arxiv.org/pdf/2601.11791
Copy Paste: [[2601.11791]] Beyond Tokens: Concept-Level Training Objectives for LLMs(https://arxiv.org/abs/2601.11791)
Keywords: language model, llm
Abstract: The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the \textit{token} level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent (e.g., ``mom'' vs. ``mother''). As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations. We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., ``mom,'' ``mommy,'' ``mother'' $\rightarrow$ \textit{MOTHER}). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests \textit{concept-level supervision} as an improved training signal that better aligns LLMs with human semantic abstractions.
摘要：下一个令牌预测 (NTP) 目标是现代大语言模型 (LLM) 开发的基础，推动了流畅性和泛化性的进步。然而，NTP 在 \textit{token} 级别上运行，即使替代延续同样合理或语义等效（例如，“妈妈”与“母亲”），也会将与单个参考延续的偏差视为错误。因此，标记级损失可能会惩罚有效的抽象、释义或概念上正确的推理路径，使模型偏向表面形式而不是潜在含义。训练信号和语义正确性之间的不匹配激发了在更高级别表示上运行的学习目标。我们建议从标记级预测转向概念级预测，其中概念将同一想法的多个表面形式分组（例如“妈妈”、“妈妈”、“妈妈”$\rightarrow$\textit{MOTHER}）。我们介绍了将概念监督集成到 LLM 训练中的各种方法，并表明概念感知模型在不同的 NLP 基准上比基于 NTP 的模型实现了更低的困惑度、更高的鲁棒性，以及更强的性能。这表明 \textit{概念级监督} 作为一种改进的训练信号，可以更好地将法学硕士与人类语义抽象结合起来。

Title: ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System

Authors: Yifei Zhang, Hooshang Nayyeri, Rinat Khaziev, Emine Yilmaz, Gokhan Tur, Dilek Hakkani-Tür, Hari Thadakamalla
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2601.11854
Pdf URL: https://arxiv.org/pdf/2601.11854
Copy Paste: [[2601.11854]] ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System(https://arxiv.org/abs/2601.11854)
Keywords: language model, llm, agent
Abstract: Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation. We further present a strong agentic memory-based evaluator for benchmarking on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment across task completion, agentic capability, and response quality, and that the proposed evaluator offers a better accuracy-efficiency tradeoff compared to existing memory- and LLM-based approaches under this evaluation setting.
摘要：面向任务的对话 (TOD) 系统的最新进展，由具有广泛 API 和工具集成的大型语言模型 (LLM) 驱动，使会话代理能够协调交错的目标、维护长期上下文，并通过异步执行主动采取行动。这些功能超出了传统的 TOD 系统，但现有的基准缺乏评估此类代理行为的系统支持。为了解决这一差距，我们引入了 ATOD，这是一种基准和合成对话生成管道，可生成需要长期推理的丰富注释对话。 ATOD 捕捉了高级 TOD 的关键特征，包括多目标协调、依赖性管理、记忆、适应性和主动性。在 ATOD 的基础上，我们提出了 ATOD-Eval，这是一个整体评估框架，可将这些维度转化为细粒度的指标，并支持可重复的离线和在线评估。我们进一步提出了一个强大的基于代理内存的评估器，用于 ATOD 基准测试。实验表明，ATOD-Eval 能够对任务完成情况、代理能力和响应质量进行全面评估，并且在此评估设置下，与现有的基于记忆和基于 LLM 的方法相比，所提出的评估器提供了更好的准确性-效率权衡。

Title: CTPD: Cross Tokenizer Preference Distillation

Authors: Truong Nguyen, Phi Van Dat, Ngan Nguyen, Linh Ngo Van, Trung Le, Thanh Hong Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11865
Pdf URL: https://arxiv.org/pdf/2601.11865
Copy Paste: [[2601.11865]] CTPD: Cross Tokenizer Preference Distillation(https://arxiv.org/abs/2601.11865)
Keywords: language model
Abstract: While knowledge distillation has seen widespread use in pre-training and instruction tuning, its application to aligning language models with human preferences remains underexplored, particularly in the more realistic cross-tokenizer setting. The incompatibility of tokenization schemes between teacher and student models has largely prevented fine-grained, white-box distillation of preference information. To address this gap, we propose Cross-Tokenizer Preference Distillation (CTPD), the first unified framework for transferring human-aligned behavior between models with heterogeneous tokenizers. CTPD introduces three key innovations: (1) Aligned Span Projection, which maps teacher and student tokens to shared character-level spans for precise supervision transfer; (2) a cross-tokenizer adaptation of Token-level Importance Sampling (TIS-DPO) for improved credit assignment; and (3) a Teacher-Anchored Reference, allowing the student to directly leverage the teacher's preferences in a DPO-style objective. Our theoretical analysis grounds CTPD in importance sampling, and experiments across multiple benchmarks confirm its effectiveness, with significant performance gains over existing methods. These results establish CTPD as a practical and general solution for preference distillation across diverse tokenization schemes, opening the door to more accessible and efficient alignment of language models.
摘要：虽然知识蒸馏在预训练和指令调整中得到了广泛的应用，但其在使语言模型与人类偏好保持一致方面的应用仍未得到充分探索，特别是在更现实的交叉分词器设置中。教师和学生模型之间标记化方案的不兼容性在很大程度上阻碍了偏好信息的细粒度、白盒蒸馏。为了解决这一差距，我们提出了跨分词器偏好蒸馏（CTPD），这是第一个用于在具有异构分词器的模型之间传输人类一致行为的统一框架。 CTPD 引入了三个关键创新：（1）对齐跨度投影，将教师和学生令牌映射到共享的字符级跨度，以实现精确的监督传输； (2) 令牌级重要性采样 (TIS-DPO) 的跨令牌化器改编，以改进信用分配； (3) 教师锚定参考，允许学生在 DPO 式目标中直接利用教师的偏好。我们的理论分析将 CTPD 建立在重要性采样的基础上，并且跨多个基准的实验证实了其有效性，并且比现有方法具有显着的性能提升。这些结果使 CTPD 成为跨不同标记化方案的偏好蒸馏的实用且通用的解决方案，为更易于访问和更有效地对齐语言模型打开了大门。

Title: Advances in LLM Reasoning Enable Flexibility in Clinical Problem-Solving

Authors: Kie Shidara, Preethi Prem, Jonathan Kim, Anna Podlasek, Feng Liu, Ahmed Alaa, Danilo Bernardo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11866
Pdf URL: https://arxiv.org/pdf/2601.11866
Copy Paste: [[2601.11866]] Advances in LLM Reasoning Enable Flexibility in Clinical Problem-Solving(https://arxiv.org/abs/2601.11866)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved high accuracy on medical question-answer (QA) benchmarks, yet their capacity for flexible clinical reasoning has been debated. Here, we asked whether advances in reasoning LLMs improve their cognitive flexibility in clinical reasoning. We assessed reasoning models from the OpenAI, Grok, Gemini, Claude, and DeepSeek families on the medicine abstraction and reasoning corpus (mARC), an adversarial medical QA benchmark which utilizes the Einstellung effect to induce inflexible overreliance on learned heuristic patterns in contexts where they become suboptimal. We found that strong reasoning models avoided Einstellung-based traps more often than weaker reasoning models, achieving human-level performance on mARC. On questions most commonly missed by physicians, the top 5 performing models answered 55% to 70% correctly with high confidence, indicating that these models may be less susceptible than humans to Einstellung effects. Our results indicate that strong reasoning models demonstrate improved flexibility in medical reasoning, achieving performance on par with humans on mARC.
摘要：大型语言模型 (LLM) 在医学问答 (QA) 基准上取得了很高的准确性，但其灵活的临床推理能力一直存在争议。在这里，我们询问推理法学硕士的进步是否提高了他们在临床推理中的认知灵活性。我们在医学抽象和推理语料库 (mARC) 上评估了来自 OpenAI、Grok、Gemini、Claude 和 DeepSeek 系列的推理模型，mARC 是一种对抗性医学 QA 基准，它利用 Einstellung 效应在学习的启发式模式变得次优的情况下诱导对学习启发式模式的不灵活的过度依赖。我们发现，强推理模型比弱推理模型更能避免基于 Einstellung 的陷阱，在 mARC 上实现了人类水平的表现。对于医生最常忽略的问题，表现最好的 5 个模型的正确回答率为 55% 到 70%，置信度很高，这表明这些模型可能比人类更不容易受到 Einstellung 效应的影响。我们的结果表明，强推理模型表现出医学推理灵活性的提高，在 mARC 上的表现与人类相当。

Title: Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

Authors: Kaijie Mo, Siddhartha Venkatayogi, Chantal Shaib, Ramez Kouzy, Wei Xu, Byron C. Wallace, Junyi Jessy Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11886
Pdf URL: https://arxiv.org/pdf/2601.11886
Copy Paste: [[2601.11886]] Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence(https://arxiv.org/abs/2601.11886)
Keywords: llm
Abstract: In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety protocols? In this paper, we investigate how LLMs behave and reason when presented with counterfactual or even adversarial medical evidence. We first construct MedCounterFact, a counterfactual medical QA dataset that requires the models to answer clinical comparison questions (i.e., judge the efficacy of certain treatments, with evidence consisting of randomized controlled trials provided as context). In MedCounterFact, real-world medical interventions within the questions and evidence are systematically replaced with four types of counterfactual stimuli, ranging from unknown words to toxic substances. Our evaluation across multiple frontier LLMs on MedCounterFact reveals that in the presence of counterfactual evidence, existing models overwhelmingly accept such "evidence" at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. While it may be prudent to draw a boundary between faithfulness and safety, our findings reveal that there exists no such boundary yet.
摘要：在医学等高风险领域，通常需要模型忠实地遵循所提供的背景。但是，如果上下文与模型先验或安全协议不一致会发生什么？在本文中，我们研究了法学硕士在面对反事实甚至对抗性医学证据时的行为和推理方式。我们首先构建 MedCounterFact，一个反事实的医学 QA 数据集，要求模型回答临床比较问题（即判断某些治疗的疗效，并提供由随机对照试验组成的证据作为背景）。在 MedCounterFact 中，问题和证据中的现实世界医疗干预被系统地替换为四种类型的反事实刺激，从未知单词到有毒物质。我们对 MedCounterFact 上多个前沿法学硕士的评估表明，在存在反事实证据的情况下，现有模型绝大多数从表面上接受这种“证据”，即使它是危险的或难以置信的，并提供自信和无条件的答案。虽然在忠诚和安全之间划定界限可能是谨慎的做法，但我们的研究结果表明，目前还不存在这样的界限。

Title: PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning

Authors: Byeongjin Kim, Gyuwan Kim, Seo Yeon Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11908
Pdf URL: https://arxiv.org/pdf/2601.11908
Copy Paste: [[2601.11908]] PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning(https://arxiv.org/abs/2601.11908)
Keywords: language model, llm, long context, prompt
Abstract: Large language models (LLMs) struggle with reasoning over long contexts where relevant information is sparsely distributed. Although plan-and-execute frameworks mitigate this by decomposing tasks into planning and execution, their effectiveness is often limited by unreliable plan generation due to dependence on surface-level cues. Consequently, plans may be based on incorrect assumptions, and once a plan is formed, identifying what went wrong and revising it reliably becomes difficult, limiting the effectiveness of reactive refinement. To address this limitation, we propose PPA-Plan, a proactive planning strategy for long-context reasoning that focuses on preventing such failures before plan generation. PPA-Plan identifies potential logical pitfalls and false assumptions, formulates them as negative constraints, and conditions plan generation on explicitly avoiding these constraints. Experiments on long-context QA benchmarks show that executing plans generated by PPA-Plan consistently outperforms existing plan-and-execute methods and direct prompting.
摘要：大型语言模型 (LLM) 很难在相关信息分布稀疏的长上下文中进行推理。尽管计划和执行框架通过将任务分解为计划和执行来缓解这一问题，但由于依赖于表面线索，其有效性往往受到不可靠的计划生成的限制。因此，计划可能基于错误的假设，一旦计划形成，找出问题所在并可靠地进行修改就变得很困难，从而限制了反应性细化的有效性。为了解决这个限制，我们提出了 PPA-Plan，这是一种用于长上下文推理的主动规划策略，重点是在计划生成之前防止此类失败。 PPA-Plan 识别潜在的逻辑陷阱和错误假设，将它们表述为负面约束，并在明确避免这些约束的情况下生成计划。长上下文 QA 基准测试表明，PPA-Plan 生成的执行计划始终优于现有的计划和执行方法和直接提示。

Title: LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding

Authors: Yichen Jiang, Peng Ye, Jiakang Yuan, Chongjun Tu, Lei Bai, Tao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11913
Pdf URL: https://arxiv.org/pdf/2601.11913
Copy Paste: [[2601.11913]] LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding(https://arxiv.org/abs/2601.11913)
Keywords: language model, llm, long context, hallucination, agent
Abstract: Effectively processing long contexts remains a fundamental yet unsolved challenge for large language models (LLMs). Existing single-LLM-based methods primarily reduce the context window or optimize the attention mechanism, but they often encounter additional computational costs or constrained expanded context length. While multi-agent-based frameworks can mitigate these limitations, they remain susceptible to the accumulation of errors and the propagation of hallucinations. In this work, we draw inspiration from the Long Short-Term Memory (LSTM) architecture to design a Multi-Agent System called LSTM-MAS, emulating LSTM's hierarchical information flow and gated memory mechanisms for long-context understanding. Specifically, LSTM-MAS organizes agents in a chained architecture, where each node comprises a worker agent for segment-level comprehension, a filter agent for redundancy reduction, a judge agent for continuous error detection, and a manager agent for globally regulates information propagation and retention, analogous to LSTM and its input gate, forget gate, constant error carousel unit, and output gate. These novel designs enable controlled information transfer and selective long-term dependency modeling across textual segments, which can effectively avoid error accumulation and hallucination propagation. We conducted an extensive evaluation of our method. Compared with the previous best multi-agent approach, CoA, our model achieves improvements of 40.93%, 43.70%,121.57% and 33.12%, on NarrativeQA, Qasper, HotpotQA, and MuSiQue, respectively.
摘要：对于大型语言模型（LLM）来说，有效处理长上下文仍然是一个基本但尚未解决的挑战。现有的基于单LLM的方法主要是减少上下文窗口或优化注意力机制，但它们经常会遇到额外的计算成本或受限的扩展上下文长度。虽然基于多代理的框架可以减轻这些限制，但它们仍然容易受到错误累积和幻觉传播的影响。在这项工作中，我们从长短期记忆（LSTM）架构中汲取灵感，设计了一个名为 LSTM-MAS 的多智能体系统，模拟 LSTM 的分层信息流和门控记忆机制，以实现长上下文理解。具体来说，LSTM-MAS以链式架构组织代理，其中每个节点包括用于段级理解的工作代理、用于减少冗余的过滤代理、用于连续错误检测的判断代理和用于全局调节信息传播和保留的管理代理，类似于LSTM及其输入门、遗忘门、恒定误差轮播单元和输出门。这些新颖的设计可以实现跨文本片段的受控信息传输和选择性长期依赖建模，从而可以有效避免错误累积和幻觉传播。我们对我们的方法进行了广泛的评估。与之前最好的多智能体方法 CoA 相比，我们的模型在 NarrativeQA、Qasper、HotpotQA 和 MuSiQue 上分别实现了 40.93%、43.70%、121.57% 和 33.12% 的改进。

Title: Enhancing LLM-Based Data Annotation with Error Decomposition

Authors: Zhen Xu, Vedant Khatri, Yijun Dai, Xiner Liu, Siyan Li, Xuanming Zhang, Renzhe Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11920
Pdf URL: https://arxiv.org/pdf/2601.11920
Copy Paste: [[2601.11920]] Enhancing LLM-Based Data Annotation with Error Decomposition(https://arxiv.org/abs/2601.11920)
Keywords: language model, llm
Abstract: Large language models offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.
摘要：大型语言模型为数据注释任务的人工编码提供了一种可扩展的替代方案，从而能够扩大数据密集型领域的研究规模。虽然法学硕士已经在客观注释任务上实现了接近人类的准确性，但他们在主观注释任务（例如涉及心理构造的任务）上的表现不太一致，并且更容易出错。标准评估实践通常将所有注释错误合并为单个对齐度量，但这种简化的方法可能会掩盖以不同方式影响最终分析结论的不同类型的错误。在这里，我们提出了一种诊断评估范例，该范例结合了人机交互步骤，将任务固有的模糊性与模型驱动的不准确性分开，并根据其潜在的下游影响评估注释质量。我们在有序注释任务上完善了这种范式，这在主观注释中很常见。细化的范式包括：（1）诊断分类法，从两个维度对LLM注释错误进行分类：来源（模型特定与任务固有）和类型（边界模糊与概念错误识别）； (2) 轻量级人工注释测试，用于估计 LLM 注释中任务固有的歧义性； (3) 一种根据我们的分类法分解观察到的 LLM 注释错误的计算方法。我们在四个教育注释任务上验证了这种范式，证明了其概念有效性和实用性。从理论上讲，我们的工作提供了经验证据，说明为什么在特定注释任务中过高的对齐是不现实的，以及为什么单一对齐指标不足以反映法学硕士注释的质量。在实践中，我们的范例可以是一种低成本的诊断工具，用于评估给定任务对法学硕士注释的适用性，并为进一步的技术优化提供可操作的见解。

Title: Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes

Authors: Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11932
Pdf URL: https://arxiv.org/pdf/2601.11932
Copy Paste: [[2601.11932]] Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes(https://arxiv.org/abs/2601.11932)
Keywords: llm
Abstract: The current state of event detection research has two notable re-occurring limitations that we investigate in this study. First, the unidirectional nature of decoder-only LLMs presents a fundamental architectural bottleneck for natural language understanding tasks that depend on rich, bidirectional context. Second, we confront the conventional reliance on Micro-F1 scores in event detection literature, which systematically inflates performance by favoring majority classes. Instead, we focus on Macro-F1 as a more representative measure of a model's ability across the long-tail of event types. Our experiments demonstrate that models enhanced with sentence context achieve superior performance over canonical decoder-only baselines. Using Low-Rank Adaptation (LoRA) during finetuning provides a substantial boost in Macro-F1 scores in particular, especially for the decoder-only models, showing that LoRA can be an effective tool to enhance LLMs' performance on long-tailed event classes.
摘要：事件检测研究的当前状态有两个显着的重复出现的局限性，我们在本研究中对此进行了调查。首先，仅解码器的法学硕士的单向性质为依赖于丰富的双向上下文的自然语言理解任务带来了基本的架构瓶颈。其次，我们面对事件检测文献中对 Micro-F1 分数的传统依赖，该分数通过偏向多数类而系统性地夸大了性能。相反，我们将 Macro-F1 视为对模型跨长尾事件类型的能力更具代表性的衡量标准。我们的实验表明，通过句子上下文增强的模型比仅使用规范解码器的基线具有更优越的性能。在微调过程中使用低秩适应 (LoRA) 可以显着提高 Macro-F1 分数，特别是对于仅解码器模型，这表明 LoRA 可以成为增强 LLM 在长尾事件类上的性能的有效工具。

Title: Double-Calibration: Towards Trustworthy LLMs via Calibrating Knowledge and Reasoning Confidence

Authors: Yuyin Lu, Ziran Liang, Yanghui Rao, Wenqi Fan, Fu Lee Wang, Qing Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11956
Pdf URL: https://arxiv.org/pdf/2601.11956
Copy Paste: [[2601.11956]] Double-Calibration: Towards Trustworthy LLMs via Calibrating Knowledge and Reasoning Confidence(https://arxiv.org/abs/2601.11956)
Keywords: language model, llm, hallucination
Abstract: Trustworthy reasoning in Large Language Models (LLMs) is challenged by their propensity for hallucination. While augmenting LLMs with Knowledge Graphs (KGs) improves factual accuracy, existing KG-augmented methods fail to quantify epistemic uncertainty in both the retrieved evidence and LLMs' reasoning. To bridge this gap, we introduce DoublyCal, a framework built on a novel double-calibration principle. DoublyCal employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence. This calibrated supporting evidence then guides a black-box LLM, yielding final predictions that are not only more accurate but also well-calibrated, with confidence scores traceable to the uncertainty of the supporting evidence. Experiments on knowledge-intensive benchmarks show that DoublyCal significantly improves both the accuracy and confidence calibration of black-box LLMs with low token cost.
摘要：大型语言模型 (LLM) 中的可信推理受到幻觉倾向的挑战。虽然用知识图（KG）增强法学硕士可以提高事实准确性，但现有的知识图谱增强方法无法量化检索到的证据和法学硕士推理中的认知不确定性。为了弥补这一差距，我们引入了 DoublyCal，这是一个基于新颖的双重校准原理构建的框架。 DoublyCal 采用轻量级代理模型首先生成 KG 证据以及校准的证据置信度。然后，这种经过校准的支持证据会指导黑盒法学硕士，产生不仅更准确而且经过良好校准的最终预测，置信度分数可追溯到支持证据的不确定性。知识密集型基准测试的实验表明，DoublyCal 以较低的令牌成本显着提高了黑盒 LLM 的准确性和置信度校准。

Title: PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

Authors: Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg, Niran Kundapur, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.11957
Pdf URL: https://arxiv.org/pdf/2601.11957
Copy Paste: [[2601.11957]] PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning(https://arxiv.org/abs/2601.11957)
Keywords: language model, llm, agent
Abstract: Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating such process is crucial yet challenging. Scheduling logistics drain hours, and human delegation often fails at scale, which motivate we to ask: Can we trust large language model (LLM) or language agent to manager time? To enable systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. Conflicts are presented sequentially and agents receive feedback after each round, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think has 35% average error rate. To address this gap, we propose PEARL, a reinforcement-learning framework that augments language agent with an external memory module and optimized round-wise reward design, enabling agent to progressively infer and adapt to user preferences on-the-fly. Experiments on CalConflictBench shows that PEARL achieves 0.76 error reduction rate, and 55% improvement in average error rate compared to the strongest baseline.
摘要：重叠的日历邀请迫使忙碌的专业人士反复决定参加、重新安排或拒绝哪些会议。我们将这种偏好驱动的决策过程称为日历冲突解决。自动化这一过程至关重要但也具有挑战性。安排物流消耗时间和人工委派经常会大规模失败，这促使我们问：我们可以信任大型语言模型（LLM）或语言代理来管理时间吗？为了系统地研究这个问题，我们引入了 CalConflictBench，这是长期日历冲突解决的基准。冲突按顺序呈现，代理在每轮后都会收到反馈，要求他们逐步推断并适应用户偏好。我们的实验表明，当前的 LLM 代理表现不佳，错误率较高，例如 Qwen-3-30B-Think 的平均错误率为 35%。为了解决这一差距，我们提出了 PEARL，这是一种强化学习框架，它通过外部记忆模块和优化的循环奖励设计来增强语言代理，使代理能够动态推断和适应用户偏好。 CalConflictBench 上的实验表明，与最强基线相比，PEARL 的错误率降低了 0.76，平均错误率提高了 55%。

Title: $\texttt{MemoryRewardBench}$: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Authors: Zecheng Tang, Baibei Ji, Ruoxi Sun, Haitian Wang, WangJie You, Zhang Yijun, Wenpeng Zhu, Ji Qi, Juntao Li, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11969
Pdf URL: https://arxiv.org/pdf/2601.11969
Copy Paste: [[2601.11969]] $\texttt{MemoryRewardBench}$: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models(https://arxiv.org/abs/2601.11969)
Keywords: language model, llm, long context
Abstract: Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce $\texttt{MemoryRewardBench}$, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. $\texttt{MemoryRewardBench}$ covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.
摘要：现有的工作越来越多地采用以内存为中心的机制以分段的方式处理长上下文，而有效的内存管理是使大型语言模型能够在整个序列中有效传播信息的关键能力之一。因此，利用奖励模型（RM）自动、可靠地评估记忆质量至关重要。在这项工作中，我们引入了 $\texttt{MemoryRewardBench}$，这是第一个系统研究 RM 评估长期内存管理过程能力的基准。 $\texttt{MemoryRewardBench}$ 涵盖长上下文理解和长格式生成任务，具有 10 种不同的设置和不同的内存管理模式，上下文长度范围从 8K 到 128K token。对 13 个尖端 RM 的评估表明，开源模型和专有模型之间的性能差距正在缩小，无论参数数量如何，新一代模型的性能始终优于其前代模型。我们进一步揭示了当前 RM 在跨不同设置评估 LLM 内存管理方面的能力和基本局限性。

Title: Acting Flatterers via LLMs Sycophancy: Combating Clickbait with LLMs Opposing-Stance Reasoning

Authors: Chaowei Zhang, Xiansheng Luo, Zewei Zhang, Yi Zhu, Jipeng Qiang, Longwei Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12019
Pdf URL: https://arxiv.org/pdf/2601.12019
Copy Paste: [[2601.12019]] Acting Flatterers via LLMs Sycophancy: Combating Clickbait with LLMs Opposing-Stance Reasoning(https://arxiv.org/abs/2601.12019)
Keywords: language model, llm, prompt
Abstract: The widespread proliferation of online content has intensified concerns about clickbait, deceptive or exaggerated headlines designed to attract attention. While Large Language Models (LLMs) offer a promising avenue for addressing this issue, their effectiveness is often hindered by Sycophancy, a tendency to produce reasoning that matches users' beliefs over truthful ones, which deviates from instruction-following principles. Rather than treating sycophancy as a flaw to be eliminated, this work proposes a novel approach that initially harnesses this behavior to generate contrastive reasoning from opposing perspectives. Specifically, we design a Self-renewal Opposing-stance Reasoning Generation (SORG) framework that prompts LLMs to produce high-quality agree and disagree reasoning pairs for a given news title without requiring ground-truth labels. To utilize the generated reasoning, we develop a local Opposing Reasoning-based Clickbait Detection (ORCD) model that integrates three BERT encoders to represent the title and its associated reasoning. The model leverages contrastive learning, guided by soft labels derived from LLM-generated credibility scores, to enhance detection robustness. Experimental evaluations on three benchmark datasets demonstrate that our method consistently outperforms LLM prompting, fine-tuned smaller language models, and state-of-the-art clickbait detection baselines.
摘要：在线内容的广泛扩散加剧了人们对标题诱饵、旨在吸引注意力的欺骗性或夸大标题的担忧。虽然大型语言模型 (LLM) 为解决这个问题提供了一个有希望的途径，但其有效性常常受到谄媚的阻碍，谄媚是一种产生与用户信念相匹配而不是真实信念的推理的倾向，这偏离了遵循指令的原则。这项工作并没有将阿谀奉承视为需要消除的缺陷，而是提出了一种新颖的方法，该方法最初利用这种行为从相反的角度产生对比推理。具体来说，我们设计了一个自我更新的反对立场推理生成（SORG）框架，该框架促使法学硕士为给定的新闻标题生成高质量的同意和不同意推理对，而不需要真实标签。为了利用生成的推理，我们开发了一个基于本地反对推理的点击诱饵检测（ORCD）模型，该模型集成了三个 BERT 编码器来表示标题及其相关推理。该模型利用对比学习，以法学硕士生成的可信度分数派生的软标签为指导，以增强检测的稳健性。对三个基准数据集的实验评估表明，我们的方法始终优于 LLM 提示、微调的较小语言模型和最先进的点击诱饵检测基线。

Title: Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection

Authors: Muhammad Alif Al Hakim, Alfan Farizki Wicaksono, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12033
Pdf URL: https://arxiv.org/pdf/2601.12033
Copy Paste: [[2601.12033]] Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection(https://arxiv.org/abs/2601.12033)
Keywords: language model, llm
Abstract: Quantization is widely adopted to reduce the computational cost of large language models (LLMs); however, its implications for fairness and safety, particularly in dynamic quantization and multilingual contexts, remain underexplored. In this work, we conduct a systematic study of how static and dynamic quantization methods impact fairness and safety across benchmarks measuring intrinsic and extrinsic bias and safety alignment. For fairness, we evaluate English, French, Dutch, Spanish, and Turkish; for safety, we focus on English, Korean, and Arabic. Our findings reveal that quantization consistently degrades fairness and safety, with dynamic methods demonstrating greater stability than static ones. Moreover, fairness degradation varies across languages, while safety deterioration is especially pronounced in non-English settings. To address these risks, we introduce Critical Weight Protection, a novel technique that identifies and preserves fairness- and safety-critical weights during quantization. This approach effectively mitigates bias and safety deterioration without costly retraining or alignment, maintaining trustworthiness while retaining efficiency.
摘要：量化被广泛采用以降低大型语言模型（LLM）的计算成本；然而，其对公平和安全的影响，特别是在动态量化和多语言背景下，仍未得到充分探索。在这项工作中，我们系统地研究了静态和动态量化方法如何影响衡量内在和外在偏差以及安全一致性的基准的公平性和安全性。为了公平起见，我们评估英语、法语、荷兰语、西班牙语和土耳其语；为了安全起见，我们专注于英语、韩语和阿拉伯语。我们的研究结果表明，量化始终会降低公平性和安全性，动态方法比静态方法表现出更高的稳定性。此外，公平性下降因语言而异，而安全性下降在非英语环境中尤其明显。为了解决这些风险，我们引入了临界权重保护，这是一种新技术，可以在量化过程中识别并保留公平性和安全性关键的权重。这种方法可以有效地减轻偏差和安全性恶化，而无需昂贵的再培训或调整，在保持效率的同时保持可信度。

Title: Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs

Authors: Ziyi Zhao, Chongming Gao, Yang Zhang, Haoyan Liu, Weinan Gan, Huifeng Guo, Yong Liu, Fuli Feng
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.12034
Pdf URL: https://arxiv.org/pdf/2601.12034
Copy Paste: [[2601.12034]] Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs(https://arxiv.org/abs/2601.12034)
Keywords: language model, llm, prompt
Abstract: Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the Prompt-level User Migration Adapter (PUMA), a lightweight framework to efficiently migrate personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to significantly reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.
摘要：大型语言模型 (LLM) 中的个性化通常依赖于用户特定的软提示。然而，当基础模型升级时，这些提示就会变得过时，需要昂贵的全面再培训。为了克服这一限制，我们提出了提示级用户迁移适配器（PUMA），这是一个轻量级框架，可以跨不兼容的模型有效地迁移个性化提示。 PUMA 利用参数高效的适配器来弥合语义差距，并结合基于组的用户选择策略来显着降低培训成本。在三个大型数据集上的实验表明，我们的方法与从头开始重新训练的性能相匹配甚至超越，计算成本降低高达 98%。该框架展示了跨不同模型架构的强大通用性以及链式和聚合迁移等高级场景的稳健性，通过将用户资产与底层模型解耦，为个性化人工智能的可持续发展提供了一条实用路径。

Title: Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation

Authors: Jinsook Lee, Kirk Vanacore, Zhuqian Zhou, Jeanine Grutter, Rene F. Kizilcec
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12061
Pdf URL: https://arxiv.org/pdf/2601.12061
Copy Paste: [[2601.12061]] Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation(https://arxiv.org/abs/2601.12061)
Keywords: llm
Abstract: Dialogue Act (DA) annotation typically treats communicative or pedagogical intent as localized to individual utterances or turns. This leads annotators to agree on the underlying action while disagreeing on segment boundaries, reducing apparent reliability. We propose codebook-injected segmentation, which conditions boundary decisions on downstream annotation criteria, and evaluate LLM-based segmenters against standard and retrieval-augmented baselines. To assess these without gold labels, we introduce evaluation metrics for span consistency, distinctiveness, and human-AI distributional agreement. We found DA-awareness produces segments that are internally more consistent than text-only baselines. While LLMs excel at creating construct-consistent spans, coherence-based baselines remain superior at detecting global shifts in dialogue flow. Across two datasets, no single segmenter dominates. Improvements in within-segment coherence frequently trade off against boundary distinctiveness and human-AI distributional agreement. These results highlight segmentation as a consequential design choice that should be optimized for downstream objectives rather than a single performance score.
摘要：对话行为 (DA) 注释通常将交流或教学意图视为局限于个人的话语或回合。这导致注释者在基本操作上达成一致，但在分段边界上存在分歧，从而降低了明显的可靠性。我们提出了码本注入分割，它根据下游注释标准进行边界决策，并根据标准和检索增强基线评估基于 LLM 的分割器。为了在没有金标签的情况下评估这些，我们引入了跨度一致性、独特性和人类-人工智能分布一致性的评估指标。我们发现 DA 意识产生的片段在内部比纯文本基线更加一致。虽然法学硕士擅长创建结构一致的跨度，但基于连贯性的基线在检测对话流的全局变化方面仍然表现出色。在两个数据集中，没有一个分段器占主导地位。段内一致性的改进经常会与边界独特性和人类-人工智能分布一致性进行权衡。这些结果强调了分段作为一种相应的设计选择，应该针对下游目标而不是单一的性能分数进行优化。

Title: To Copy or Not to Copy: Copying Is Easier to Induce Than Recall

Authors: Mehrdad Farahani, Franziska Penzkofer, Richard Johansson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12075
Pdf URL: https://arxiv.org/pdf/2601.12075
Copy Paste: [[2601.12075]] To Copy or Not to Copy: Copying Is Easier to Induce Than Recall(https://arxiv.org/abs/2601.12075)
Keywords: language model, prompt
Abstract: Language models used in retrieval-augmented settings must arbitrate between parametric knowledge stored in their weights and contextual information in the prompt. This work presents a mechanistic study of that choice by extracting an \emph{arbitration vector} from model activations on a curated dataset designed to disentangle (i) irrelevant contexts that elicit parametric recall and (ii) relevant but false contexts that elicit copying. The vector is computed as the residual-stream centroid difference between these regimes across 27 relations, and is injected as an additive intervention at selected layers and token spans to steer behavior in two directions: Copy$\rightarrow$Recall (suppressing context use) and Recall$\rightarrow$Copy (inducing the model to copy any token from the context). Experiments on two architectures (decoder-only and encoder/decoder) and two open-domain QA benchmarks show consistent behavior shifts under moderate scaling while monitoring accuracy and fluency. Mechanistic analyses of attention routing, MLP contributions, and layer-wise probability trajectories reveal an asymmetry: inducing copying is an easy ``reactivation'' process that can be triggered at different locations in the input, while restoring recall is a ``suppression'' process that is more fragile and strongly tied to object-token interventions.
摘要：检索增强设置中使用的语言模型必须在存储在其权重中的参数知识和提示中的上下文信息之间进行仲裁。这项工作通过从精心策划的数据集上的模型激活中提取 \emph{仲裁向量} 来对该选择进行机制研究，旨在解开（i）引起参数回忆的不相关上下文和（ii）引起复制的相关但错误的上下文。该向量被计算为跨 27 个关系的这些机制之间的残差流质心差异，并作为附加干预注入到选定的层和令牌跨度，以引导两个方向的行为：Copy$\rightarrow$Recall（抑制上下文使用）和 Recall$\rightarrow$Copy（诱导模型从上下文复制任何令牌）。对两种架构（仅解码器和编码器/解码器）和两个开放域 QA 基准的实验表明，在监控准确性和流畅性的同时，在适度缩放下会出现一致的行为变化。对注意力路由、MLP 贡献和逐层概率轨迹的机制分析揭示了一种不对称性：诱导复制是一个简单的“重新激活”过程，可以在输入的不同位置触发，而恢复回忆是一个“抑制”过程，更脆弱，并且与对象令牌干预紧密相关。

Title: Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization

Authors: Linfeng Du, Ye Yuan, Zichen Zhao, Fuyuan Lyu, Emiliano Penaloza, Xiuying Chen, Zipeng Sun, Jikun Kang, Laurent Charlin, Xue Liu, Haolun Wu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.12078
Pdf URL: https://arxiv.org/pdf/2601.12078
Copy Paste: [[2601.12078]] Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization(https://arxiv.org/abs/2601.12078)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel at general-purpose tasks, yet adapting their responses to individual users remains challenging. Retrieval augmentation provides a lightweight alternative to fine-tuning by conditioning LLMs on user history records, and existing approaches typically select these records based on semantic relevance. We argue that relevance serves as an unreliable proxy for utility: a record may be semantically similar to a query yet fail to improve generation quality or even degrade it due to redundancy or conflicting information. To bridge this gap, we propose PURPLE, a contextual bandit framework that oPtimizes UseR Profiles for Llm pErsonalization. In contrast to a greedy selection of the most relevant records, PURPLE treats profile construction as a set generation process and utilizes a Plackett-Luce ranking model to capture complex inter-record dependencies. By training with dense feedback provided by the likelihood of the reference response, our method aligns retrieval directly with generation quality. Extensive experiments on nine personalization tasks demonstrate that PURPLE consistently outperforms strong heuristic and retrieval-augmented baselines in both effectiveness and efficiency, establishing a principled and scalable solution for optimizing user profiles.
摘要：大型语言模型 (LLM) 擅长执行通用任务，但调整其响应以适应个人用户仍然具有挑战性。检索增强通过在用户历史记录上调节 LLM 来提供微调的轻量级替代方案，现有方法通常根据语义相关性选择这些记录。我们认为，相关性是效用的不可靠代理：记录可能在语义上与查询相似，但无法提高生成质量，甚至由于冗余或信息冲突而降低生成质量。为了弥补这一差距，我们提出了 PURPLE，一个上下文强盗框架，可以优化 LLM 个性化的用户配置文件。与贪婪地选择最相关的记录相比，PURPLE 将配置文件构建视为集合生成过程，并利用 Plackett-Luce 排名模型来捕获复杂的记录间依赖性。通过利用参考响应的可能性提供的密集反馈进行训练，我们的方法将检索与生成质量直接对齐。对九项个性化任务的广泛实验表明，PURPLE 在有效性和效率方面始终优于强大的启发式和检索增强基线，为优化用户配置文件建立了原则性和可扩展的解决方案。

Title: Large language models struggle with ethnographic text annotation

Authors: Leonardo S. Goodall, Dor Shilton, Daniel A. Mullins, Harvey Whitehouse
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12099
Pdf URL: https://arxiv.org/pdf/2601.12099
Copy Paste: [[2601.12099]] Large language models struggle with ethnographic text annotation(https://arxiv.org/abs/2601.12099)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown promise for automated text annotation, raising hopes that they might accelerate cross-cultural research by extracting structured data from ethnographic texts. We evaluated 7 state-of-the-art LLMs on their ability to annotate 121 ritual features across 567 ethnographic excerpts. Performance was limited, falling well below levels required for reliable automated annotation. Longer texts, features requiring ordinal distinctions, and ambiguous constructs proved particularly difficult. Human inter-coder reliability set an approximate ceiling on LLM accuracy: features that human coders found difficult to agree upon were also difficult for LLMs. Yet even on features where humans reliably agreed, models fell short of human performance. Our findings suggest that LLMs cannot yet substitute for human expertise in ethnographic annotation.
摘要：大型语言模型（LLM）已经显示出自动文本注释的前景，人们希望它们可以通过从民族志文本中提取结构化数据来加速跨文化研究。我们评估了 7 位最先进的法学硕士对 567 个民族志摘录中 121 个仪式特征进行注释的能力。性能有限，远低于可靠的自动注释所需的水平。较长的文本、需要顺序区分的特征以及模糊的结构被证明特别困难。人类编码员之间的可靠性为法学硕士的准确性设定了一个近似上限：人类编码员发现难以达成一致的特征对于法学硕士来说也很难达成一致。然而，即使在人类可靠地同意的特征上，模型也未能达到人类的表现。我们的研究结果表明，法学硕士尚不能取代人类在民族志注释方面的专业知识。

Title: Powerful Training-Free Membership Inference Against Autoregressive Language Models

Authors: David Ilić, David Stanojević, Kostadin Cvejoski
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2601.12104
Pdf URL: https://arxiv.org/pdf/2601.12104
Copy Paste: [[2601.12104]] Powerful Training-Free Membership Inference Against Autoregressive Language Models(https://arxiv.org/abs/2601.12104)
Keywords: language model, gpt
Abstract: Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at this https URL.
摘要：微调的语言模型会带来重大的隐私风险，因为它们可能会记住并暴露训练数据中的敏感信息。成员推理攻击（MIA）为审计这些风险提供了一个原则框架，但现有方法的检测率有限，特别是在实际隐私审计所需的低误报阈值下。我们提出了 EZ-MIA，这是一种成员推理攻击，它利用了一个关键观察结果：记忆在错误位置表现得最为强烈，特别是模型预测错误但仍然显示训练示例概率较高的标记。我们引入了误差区（EZ）分数，它测量错误位置相对于预训练参考模型的概率变化的方向不平衡。这个原则统计数据只需要每个查询两次前向传递，并且不需要任何类型的模型训练。在使用 GPT-2 的 WikiText 上，EZ-MIA 在相同条件下实现了比之前最先进技术高 3.8 倍的检测（66.3% 的真阳性率与 1% 假阳性率的 17.5%），并且具有近乎完美的辨别力（AUC 0.98）。在对现实审计至关重要的严格 0.1% FPR 阈值下，我们的检测率比之前的工作高出 8 倍（14.0% 对 1.8%），且无需参考模型训练。这些成果延伸到了更大的架构：在 AG News 上，我们使用 Llama-2-7B 实现了 3 倍的高检测（1% FPR 时，TPR 为 46.7%，TPR 为 15.8%）。这些结果表明，微调语言模型的隐私风险比以前理解的要大得多，这对隐私审计和部署决策都有影响。代码可从此 https URL 获取。

Title: Bengali Text Classification: An Evaluation of Large Language Model Approaches

Authors: Md Mahmudul Hoque, Md Mehedi Hassain, Md Hojaifa Tanvir, Rahul Nandy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12132
Pdf URL: https://arxiv.org/pdf/2601.12132
Copy Paste: [[2601.12132]] Bengali Text Classification: An Evaluation of Large Language Model Approaches(https://arxiv.org/abs/2601.12132)
Keywords: language model, llm
Abstract: Bengali text classification is a Significant task in natural language processing (NLP), where text is categorized into predefined labels. Unlike English, Bengali faces challenges due to the lack of extensive annotated datasets and pre-trained language models. This study explores the effectiveness of large language models (LLMs) in classifying Bengali newspaper articles. The dataset used, obtained from Kaggle, consists of articles from Prothom Alo, a major Bangladeshi newspaper. Three instruction-tuned LLMs LLaMA 3.1 8B Instruct, LLaMA 3.2 3B Instruct, and Qwen 2.5 7B Instruct were evaluated for this task under the same classification framework. Among the evaluated models, Qwen 2.5 achieved the highest classification accuracy of 72%, showing particular strength in the "Sports" category. In comparison, LLaMA 3.1 and LLaMA 3.2 attained accuracies of 53% and 56%, respectively. The findings highlight the effectiveness of LLMs in Bengali text classification, despite the scarcity of resources for Bengali NLP. Future research will focus on exploring additional models, addressing class imbalance issues, and refining fine-tuning approaches to improve classification performance.
摘要：孟加拉语文本分类是自然语言处理 (NLP) 中的一项重要任务，其中文本被分类为预定义的标签。与英语不同，孟加拉语由于缺乏广泛的注释数据集和预训练的语言模型而面临挑战。本研究探讨了大型语言模型 (LLM) 在对孟加拉语报纸文章进行分类时的有效性。使用的数据集从 Kaggle 获得，由孟加拉国主要报纸 Prothom Alo 的文章组成。在相同的分类框架下针对此任务评估了三个指令调整的 LLM LLaMA 3.1 8B Instruct、LLaMA 3.2 3B Instruct 和 Qwen 2.5 7B Instruct。在评估的模型中，Qwen 2.5的分类准确率最高，达到72%，在“运动”类别中表现尤其强劲。相比之下，LLaMA 3.1 和 LLaMA 3.2 的准确率分别为 53% 和 56%。尽管孟加拉语 NLP 资源匮乏，但研究结果强调了法学硕士在孟加拉语文本分类方面的有效性。未来的研究将侧重于探索其他模型、解决类别不平衡问题以及完善微调方法以提高分类性能。

Title: Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs

Authors: Teodor-Călin Ionescu, Lifeng Han, Jan Heijdra Suasnabar, Anne Stiggelbout, Suzan Verberne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12154
Pdf URL: https://arxiv.org/pdf/2601.12154
Copy Paste: [[2601.12154]] Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs(https://arxiv.org/abs/2601.12154)
Keywords: gpt, llm
Abstract: This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, to offer insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews). We first evaluate BERTopic and Top2Vec for individual interview summarization by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for the next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on {coherence}, {clarity}, and {relevance}. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three {clinically oriented embedding} models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic \textit{precision} and \textit{interpretability}, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely ``Coordination and Communication in Cancer Care Management" and ``Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows.
摘要：这项研究调查了如何使用神经主题建模和法学硕士从患者讲故事的数据中发现有意义的主题，从而提供有助于更多以患者为导向的医疗保健实践的见解。我们分析了对癌症患者的采访记录（13 次采访中有 132,722 个单词）。我们首先通过使用类似的预处理、分块和聚类配置来评估 BERTopic 和 Top2Vec 的个人访谈摘要，以确保关键字提取上的公平比较。然后，LLM (GPT4) 用于下一步主题标记。他们单次访谈的输出 (I0) 通过小规模的人工评估进行评级，重点关注{连贯性}、{清晰度}和{相关性}。根据初步结果和评估，BERTopic 显示出更强的性能，并被选择使用三个{面向临床的嵌入}模型进行进一步的实验。然后，我们使用最佳模型设置分析了完整的采访集。结果表明，特定领域的嵌入改进了主题 \textit{accuracy} 和 \textit{interpretability}，BioClinicalBERT 在转录本中产生了最一致的结果。使用 BioClinicalBERT 嵌入模型对 13 次访谈的完整数据集进行全局分析，揭示了所有 13 次访谈中最主要的主题，即“癌症护理管理中的协调与沟通”和“癌症治疗旅程中的患者决策”。虽然访谈是从荷兰语到英语的机器翻译，并且临床专业人员没有参与此评估，但研究结果表明，神经主题建模，特别是 BERTopic，可以帮助为患者提供有用的反馈该管道可以支持更有效的文档导航并加强患者声音在医疗保健工作流程中的作用。

Title: Tolerance Principle and Small Language Model Learning

Authors: Adam E. Friedman, Stevan Harnad, Rushen Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12179
Pdf URL: https://arxiv.org/pdf/2601.12179
Copy Paste: [[2601.12179]] Tolerance Principle and Small Language Model Learning(https://arxiv.org/abs/2601.12179)
Keywords: language model, gpt
Abstract: Modern language models like GPT-3, BERT, and LLaMA require massive training data, yet with sufficient training they reliably learn to distinguish grammatical from ungrammatical sentences. Children aged as young as 14 months already have the capacity to learn abstract grammar rules from very few exemplars, even in the presence of non-rule-following exceptions. Yang's (2016) Tolerance Principle defines a precise threshold for how many exceptions a rule can tolerate and still be learnable. The present study explored the minimal amount and quality of training data necessary for rules to be generalized by a transformer-based language model to test the predictions of the Tolerance Principle. We trained BabyBERTa (Huebner et al. 2021), a transformer model optimized for small datasets, on artificial grammars. The training sets varied in size, number of unique sentence types, and proportion of rule-following versus exception exemplars. We found that, unlike human infants, BabyBERTa's learning dynamics do not align with the Tolerance Principle.
摘要：GPT-3、BERT 和 LLaMA 等现代语言模型需要大量训练数据，但通过足够的训练，它们可以可靠地学会区分语法句子和不语法句子。即使存在不遵守规则的例外情况，年仅 14 个月大的儿童就已经有能力从极少数范例中学习抽象语法规则。 Yang（2016）的容忍原则定义了一个规则可以容忍多少例外并且仍然可以学习的精确阈值。本研究探索了基于变压器的语言模型概括规则所需的最小数量和质量的训练数据，以测试容忍原则的预测。我们在人工语法上训练了 BabyBERTa（Huebner et al. 2021），这是一种针对小型数据集优化的转换器模型。训练集的大小、独特句子类型的数量以及遵循规则与例外范例的比例各不相同。我们发现，与人类婴儿不同，BabyBERTa 的学习动态并不符合宽容原则。

Title: Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

Authors: Miao Li, Hanyang Jiang, Sikai Chen, Hengyu Fu, Yuhang Cai, Baihe Huang, Tinghan Ye, Xuanzhou Chen, Pascal Van Hentenryck
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12247
Pdf URL: https://arxiv.org/pdf/2601.12247
Copy Paste: [[2601.12247]] Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models(https://arxiv.org/abs/2601.12247)
Keywords: language model
Abstract: Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.
摘要：扩散语言模型 (DLM) 为文本生成提供了一种有前景的非序列范式，与标准自回归 (AR) 方法不同。然而，当前的解码策略通常采取被动立场，未充分利用全局双向上下文来决定全局轨迹。为了解决这个问题，我们提出了计划-验证-填充（PVF），这是一种无需培训的范例，通过定量验证为规划奠定基础。 PVF 通过优先考虑高杠杆语义锚点来主动构建分层框架，并采用验证协议来操作实用的结构停止，其中进一步的审议会产生收益递减。对 LLaDA-8B-Instruct 和 Dream-7B-Instruct 的广泛评估表明，与跨基准数据集的基于置信度的并行解码相比，PVF 将功能评估 (NFE) 数量减少高达 65%，从而在不影响准确性的情况下释放卓越的效率。

Title: Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Authors: Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12263
Pdf URL: https://arxiv.org/pdf/2601.12263
Copy Paste: [[2601.12263]] Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers(https://arxiv.org/abs/2601.12263)
Keywords: language model
Abstract: Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.
摘要：视觉语言模型 (VLM) 正在迅速取代现代检索和推荐系统中的单模态编码器。尽管它们的能力已有充分记录，但它们在竞争性排名场景中对抗对抗性操纵的鲁棒性仍然很大程度上未被探索。在本文中，我们发现了基于 VLM 的产品搜索中的一个关键漏洞：多模式排名攻击。我们提出了多模式生成引擎优化（MGEO），这是一种新颖的对抗框架，使恶意行为者能够通过联合优化难以察觉的图像扰动和流畅的文本后缀来不公平地推广目标产品。与孤立处理模态的现有攻击不同，MGEO 采用基于交替梯度的优化策略来利用 VLM 内的深度跨模态耦合。使用最先进的模型对现实世界数据集进行的广泛实验表明，我们的协调攻击明显优于纯文本和纯图像基线。这些发现表明，多模式协同（通常是 VLM 的优势）可以被用来破坏搜索排名的完整性，而不会触发传统的内容过滤器。

Title: Simulated Annealing Enhances Theory-of-Mind Reasoning in Autoregressive Language Models

Authors: Xucong Hu, Jian-Qiao Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12269
Pdf URL: https://arxiv.org/pdf/2601.12269
Copy Paste: [[2601.12269]] Simulated Annealing Enhances Theory-of-Mind Reasoning in Autoregressive Language Models(https://arxiv.org/abs/2601.12269)
Keywords: language model
Abstract: Autoregressive language models are next-token predictors and have been criticized for only optimizing surface plausibility (i.e., local coherence) rather than maintaining correct latent-state representations (i.e., global coherence). Because Theory of Mind (ToM) tasks crucially depend on reasoning about latent mental states of oneself and others, such models are therefore often thought to fail at ToM. While post-training methods can improve ToM performance, we show that strong ToM capability can be recovered directly from the base model without any additional weight updates or verifications. Our approach builds on recent power-sampling methods (Karan & Du, 2025) that use Markov chain Monte Carlo (MCMC) to sample from sharpened sequence-level (rather than token-level) probability distributions of autoregressive language models. We further find that incorporating annealing, where the tempered distribution is gradually shifted from high to low temperature, substantially improves ToM performance over fixed-temperature power sampling. Together, these results suggest that sampling-based optimization provides a powerful way to extract latent capabilities from language models without retraining.
摘要：自回归语言模型是下一个标记预测器，并因仅优化表面合理性（即局部连贯性）而不是维持正确的潜在状态表示（即全局连贯性）而受到批评。由于心智理论 (ToM) 任务很大程度上依赖于对自己和他人的潜在心理状态的推理，因此此类模型通常被认为在 ToM 中失败。虽然训练后方法可以提高 ToM 性能，但我们表明可以直接从基础模型恢复强大的 ToM 功能，而无需任何额外的权重更新或验证。我们的方法建立在最近的功率采样方法（Karan & Du，2025）的基础上，该方法使用马尔可夫链蒙特卡罗（MCMC）从自回归语言模型的锐化序列级（而不是标记级）概率分布中进行采样。我们进一步发现，与固定温度功率采样相比，结合退火（回火分布逐渐从高温转移到低温）可显着提高 ToM 性能。总之，这些结果表明，基于采样的优化提供了一种无需重新训练即可从语言模型中提取潜在功能的强大方法。

Title: Conversational Context Classification: A Representation Engineering Approach

Authors: Jonathan Pan
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2601.12286
Pdf URL: https://arxiv.org/pdf/2601.12286
Copy Paste: [[2601.12286]] Conversational Context Classification: A Representation Engineering Approach(https://arxiv.org/abs/2601.12286)
Keywords: language model, llm, hallucination
Abstract: The increasing prevalence of Large Language Models (LLMs) demands effective safeguards for their operation, particularly concerning their tendency to generate out-of-context responses. A key challenge is accurately detecting when LLMs stray from expected conversational norms, manifesting as topic shifts, factual inaccuracies, or outright hallucinations. Traditional anomaly detection struggles to directly apply within contextual semantics. This paper outlines our experiment in exploring the use of Representation Engineering (RepE) and One-Class Support Vector Machine (OCSVM) to identify subspaces within the internal states of LLMs that represent a specific context. By training OCSVM on in-context examples, we establish a robust boundary within the LLM's hidden state latent space. We evaluate out study with two open source LLMs - Llama and Qwen models in specific contextual domain. Our approach entailed identifying the optimal layers within the LLM's internal state subspaces that strongly associates with the context of interest. Our evaluation results showed promising results in identifying the subspace for a specific context. Aside from being useful in detecting in or out of context conversation threads, this research work contributes to the study of better interpreting LLMs.
摘要：大语言模型 (LLM) 的日益普及需要为其运行提供有效的保障，特别是考虑到它们产生断章取义的反应的倾向。一个关键的挑战是准确地检测法学硕士何时偏离预期的对话规范，表现为主题转移、事实不准确或彻底的幻觉。传统的异常检测很难直接应用于上下文语义。本文概述了我们探索使用表示工程 (RepE) 和一类支持向量机 (OCSVM) 来识别代表特定上下文的 LLM 内部状态中的子空间的实验。通过在上下文示例上训练 OCSVM，我们在 LLM 的隐藏状态潜在空间内建立了稳健的边界。我们使用两个开源法学硕士（Llama 和 Qwen 模型）在特定背景领域评估我们的研究。我们的方法需要确定法学硕士内部状态子空间内与感兴趣的上下文密切相关的最佳层。我们的评估结果显示在识别特定上下文的子空间方面取得了有希望的结果。除了有助于检测上下文中或上下文外的对话线索之外，这项研究工作还有助于更好地解释法学硕士的研究。

Title: Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Authors: Ming Zhang, Jiabao Zhuang, Wenqing Jing, Ziyu Kong, Jingyi Deng, Yujiong Shen, Kexin Tan, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Yi Zou, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12369
Pdf URL: https://arxiv.org/pdf/2601.12369
Copy Paste: [[2601.12369]] Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies(https://arxiv.org/abs/2601.12369)
Keywords: llm, agent
Abstract: Deep Research Agents are increasingly used for automated survey generation. However, whether they can write surveys like human experts remains unclear. Existing benchmarks focus on fluency or citation accuracy, but none evaluates the core capabilities: retrieving essential papers and organizing them into coherent knowledge structures. We introduce TaxoBench, a diagnostic benchmark derived from 72 highly-cited computer science surveys. We manually extract expert-authored taxonomy trees containing 3,815 precisely categorized citations as ground truth. Our benchmark supports two evaluation modes: Deep Research mode tests end-to-end retrieval and organization given only a topic, while Bottom-Up mode isolates structuring capability by providing the exact papers human experts used. We evaluate 7 leading Deep Research agents and 12 frontier LLMs. Results reveal a dual bottleneck: the best agent recalls only 20.9% of expert-selected papers, and even with perfect input, the best model achieves only 0.31 ARI in organization. Current deep research agents remain far from expert-level survey writing. Our benchmark is publicly available at this https URL.
摘要：深度研究代理越来越多地用于自动调查生成。然而，他们是否可以像人类专家一样撰写调查报告仍不清楚。现有的基准侧重于流畅性或引用准确性，但没有评估核心能力：检索基本论文并将其组织成连贯的知识结构。我们推出 TaxoBench，这是一个源自 72 项被高引用的计算机科学调查的诊断基准。我们手动提取专家编写的分类树，其中包含 3,815 个精确分类的引文作为基本事实。我们的基准测试支持两种评估模式：深度研究模式仅测试给定主题的端到端检索和组织，而自下而上模式通过提供人类专家使用的确切论文来隔离结构化能力。我们评估了 7 名领先的深度研究代理人和 12 名前沿法学硕士。结果揭示了一个双重瓶颈：最好的智能体只能回忆起 20.9% 的专家选定的论文，即使有完美的输入，最好的模型在组织中也只能达到 0.31 ARI。目前的深度研究人员距离专家级的调查撰写还很远。我们的基准测试可通过此 https URL 公开获取。

Title: A Scalable Entity-Based Framework for Auditing Bias in LLMs

Authors: Akram Elbouanani, Aboubacar Tuo, Adrian Popescu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12374
Pdf URL: https://arxiv.org/pdf/2601.12374
Copy Paste: [[2601.12374]] A Scalable Entity-Based Framework for Auditing Bias in LLMs(https://arxiv.org/abs/2601.12374)
Keywords: language model, llm, prompt
Abstract: Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying on artificial prompts that poorly reflect real-world use, or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework using named entities as probes to measure structural disparities in model behavior. We show that synthetic data reliably reproduces bias patterns observed in natural text, enabling large-scale analysis. Using this approach, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. Our results reveal systematic biases: models penalize right-wing politicians, favor left-wing politicians, prefer Western and wealthy nations over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not attenuate Western-aligned preferences. These results indicate that LLMs should undergo rigorous auditing before deployment in high-stakes applications.
摘要：大型语言模型 (LLM) 中现有的偏差评估方法以生态有效性换取统计控制，依赖于无法反映现实世界使用情况的人工提示，或缺乏规模和严谨性的自然任务。我们引入了一个可扩展的偏差审计框架，使用命名实体作为探针来测量模型行为中的结构差异。我们表明，合成数据可靠地再现了自然文本中观察到的偏差模式，从而实现了大规模分析。使用这种方法，我们进行了迄今为止最大规模的偏差审计，包括跨多种实体类型、任务、语言、模型和提示策略的 19 亿个数据点。我们的结果揭示了系统性偏见：模型惩罚右翼政治家，偏向左翼政治家，偏爱西方和富裕国家而不是南方国家，偏爱西方公司，惩罚国防和制药行业的公司。虽然指令调整可以减少偏差，但增加模型规模会放大偏差，并且用中文或俄语进行提示并不会减弱西方的偏好。这些结果表明，法学硕士在部署到高风险应用程序之前应经过严格的审核。

Title: LR-DWM: Efficient Watermarking for Diffusion Language Models

Authors: Ofek Raban, Ethan Fetaya, Gal Chechik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12376
Pdf URL: https://arxiv.org/pdf/2601.12376
Copy Paste: [[2601.12376]] LR-DWM: Efficient Watermarking for Diffusion Language Models(https://arxiv.org/abs/2601.12376)
Keywords: language model, llm
Abstract: Watermarking (WM) is a critical mechanism for detecting and attributing AI-generated content. Current WM methods for Large Language Models (LLMs) are predominantly tailored for autoregressive (AR) models: They rely on tokens being generated sequentially, and embed stable signals within the generated sequence based on the previously sampled text. Diffusion Language Models (DLMs) generate text via non-sequential iterative denoising, which requires significant modification to use WM methods designed for AR models. Recent work proposed to watermark DLMs by inverting the process when needed, but suffers significant computational or memory overhead. We introduce Left-Right Diffusion Watermarking (LR-DWM), a scheme that biases the generated token based on both left and right neighbors, when they are available. LR-DWM incurs minimal runtime and memory overhead, remaining close to the non-watermarked baseline DLM while enabling reliable statistical detection under standard evaluation settings. Our results demonstrate that DLMs can be watermarked efficiently, achieving high detectability with negligible computational and memory overhead.
摘要：水印 (WM) 是检测和归因人工智能生成内容的关键机制。当前用于大型语言模型 (LLM) 的 WM 方法主要是为自回归 (AR) 模型量身定制的：它们依赖于顺序生成的标记，并根据先前采样的文本在生成的序列中嵌入稳定信号。扩散语言模型 (DLM) 通过非顺序迭代去噪生成文本，这需要进行重大修改才能使用专为 AR 模型设计的 WM 方法。最近的工作提出通过在需要时反转过程来对 DLM 进行水印，但会产生大量的计算或内存开销。我们引入了左右扩散水印（LR-DWM），这是一种根据左右邻居（当它们可用时）对生成的令牌进行偏置的方案。 LR-DWM 产生的运行时间和内存开销最小，保持接近无水印基线 DLM，同时在标准评估设置下实现可靠的统计检测。我们的结果表明，DLM 可以有效地加水印，以可忽略的计算和内存开销实现高可检测性。

Title: NADIR: Differential Attention Flow for Non-Autoregressive Transliteration in Indic Languages

Authors: Lakshya Tomar, Vinayak Abrol, Puneet Agarwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12389
Pdf URL: https://arxiv.org/pdf/2601.12389
Copy Paste: [[2601.12389]] NADIR: Differential Attention Flow for Non-Autoregressive Transliteration in Indic Languages(https://arxiv.org/abs/2601.12389)
Keywords: hallucination
Abstract: In this work, we argue that not all sequence-to-sequence tasks require the strong inductive biases of autoregressive (AR) models. Tasks like multilingual transliteration, code refactoring, grammatical correction or text normalization often rely on local dependencies where the full modeling capacity of AR models can be overkill, creating a trade-off between their high accuracy and high inference latency. While non-autoregressive (NAR) models offer speed, they typically suffer from hallucinations and poor length control. To explore this trade-off, we focus on the multilingual transliteration task in Indic languages and introduce NADIR, a novel NAR architecture designed to strike a balance between speed and accuracy. NADIR integrates a Differential Transformer and a Mixture-of-Experts mechanism, enabling it to robustly model complex character mappings without sequential dependencies. NADIR achieves over a 13x speed-up compared to the state-of-the-art AR baseline. It maintains a competitive mean Character Error Rate of 15.78%, compared to 14.44% for the AR model and 21.88% for a standard NAR equivalent. Importantly, NADIR reduces Repetition errors by 49.53%, Substitution errors by 24.45%, Omission errors by 32.92%, and Insertion errors by 16.87%. This work provides a practical blueprint for building fast and reliable NAR systems, effectively bridging the gap between AR accuracy and the demands of real-time, large-scale deployment.
摘要：在这项工作中，我们认为并非所有序列到序列的任务都需要自回归（AR）模型的强归纳偏差。多语言音译、代码重构、语法纠正或文本规范化等任务通常依赖于本地依赖性，其中 AR 模型的完整建模能力可能会被过度利用，从而在高精度和高推理延迟之间进行权衡。虽然非自回归 (NAR) 模型提供速度，但它们通常会出现幻觉和长度控制不佳。为了探索这种权衡，我们专注于印度语的多语言音译任务，并引入 NADIR，一种新颖的 NAR 架构，旨在在速度和准确性之间取得平衡。 NADIR 集成了差分变压器和专家混合机制，使其能够对复杂的字符映射进行鲁棒建模，而无需顺序依赖。与最先进的 AR 基线相比，NADIR 的速度提高了 13 倍以上。它保持了 15.78% 的竞争平均字符错误率，而 AR 模型的平均字符错误率为 14.44%，标准 NAR 等效模型的平均字符错误率为 21.88%。重要的是，NADIR 将重复错误减少了 49.53%，替换错误减少了 24.45%，遗漏错误减少了 32.92%，插入错误减少了 16.87%。这项工作为构建快速可靠的 NAR 系统提供了实用的蓝图，有效缩小了 AR 精度与实时、大规模部署需求之间的差距。

Title: Legal experts disagree with rationale extraction techniques for explaining ECtHR case outcome classification

Authors: Mahammad Namazov, Tomáš Koref, Ivan Habernal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12419
Pdf URL: https://arxiv.org/pdf/2601.12419
Copy Paste: [[2601.12419]] Legal experts disagree with rationale extraction techniques for explaining ECtHR case outcome classification(https://arxiv.org/abs/2601.12419)
Keywords: language model, llm
Abstract: Interpretability is critical for applications of large language models in the legal domain which requires trust and transparency. While some studies develop task-specific approaches, other use the classification model's parameters to explain the decisions. However, which technique explains the legal outcome prediction best remains an open question. To address this challenge, we propose a comparative analysis framework for model-agnostic interpretability techniques. Among these, we employ two rationale extraction methods, which justify outcomes with human-interpretable and concise text fragments (i.e., rationales) from the given input text. We conduct comparison by evaluating faithfulness-via normalized sufficiency and comprehensiveness metrics along with plausibility-by asking legal experts to evaluate extracted rationales. We further assess the feasibility of LLM-as-a-Judge using legal expert evaluation results. We show that the model's "reasons" for predicting a violation differ substantially from those of legal experts, despite highly promising quantitative analysis results and reasonable downstream classification performance. The source code of our experiments is publicly available at this https URL.
摘要：可解释性对于大型语言模型在需要信任和透明度的法律领域的应用至关重要。虽然一些研究开发了特定于任务的方法，但其他研究则使用分类模型的参数来解释决策。然而，哪种技术最好地解释法律结果预测仍然是一个悬而未决的问题。为了应对这一挑战，我们提出了一个与模型无关的可解释性技术的比较分析框架。其中，我们采用两种基本原理提取方法，它们用给定输入文本中的人类可解释且简洁的文本片段（即基本原理）来证明结果的合理性。我们通过评估忠实性（通过规范化的充分性和全面性指标以及合理性）来进行比较，并要求法律专家评估提取的理由。我们利用法律专家评估结果进一步评估法学硕士作为法官的可行性。我们表明，尽管定量分析结果非常有前途并且下游分类性能合理，但该模型预测违规的“原因”与法律专家的“原因”有很大不同。我们实验的源代码可通过此 https URL 公开获取。

Title: System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

Authors: Tsan Tsai Chan, Varsha Suresh, Anisha Saha, Michael Hahn, Vera Demberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12430
Pdf URL: https://arxiv.org/pdf/2601.12430
Copy Paste: [[2601.12430]] System-Mediated Attention Imbalances Make Vision-Language Models Say Yes(https://arxiv.org/abs/2601.12430)
Keywords: language model, hallucination
Abstract: Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond 'yes'. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.
摘要：视觉语言模型（VLM）幻觉通常与输入模式（系统、图像和文本）的注意力分配不平衡有关。然而，现有的缓解策略倾向于对这些不平衡进行以图像为中心的解释，通常优先考虑增加图像注意力，而较少考虑其他模式的作用。在这项研究中，我们评估了一个更全面的、以系统为中介的账户，该账户将这些不平衡归因于功能冗余的系统权重，从而减少了对图像和文本输入的关注。我们表明，这个框架为“是”偏见提供了有用的经验视角，“是”偏见是一种常见的幻觉形式，其中 VLM 不加区别地回答“是”。将注意力从系统模态重新分配到图像和文本输入可以大大抑制这种偏差，通常优于现有方法。我们进一步提供的证据表明，系统介导的注意力不平衡通过鼓励对粗略输入表征的默认依赖而导致“是”偏差，这对某些任务有效，但不适合其他任务。总而言之，这些发现牢固地确立了系统注意力作为 VLM 幻觉的关键因素，并强调了其作为缓解杠杆的潜力。

Title: Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

Authors: Miao Peng, Weizhou Shen, Nuo Chen, Chenliang Li, Ming Yan, Jia Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12465
Pdf URL: https://arxiv.org/pdf/2601.12465
Copy Paste: [[2601.12465]] Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping(https://arxiv.org/abs/2601.12465)
Keywords: llm, long context
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the "almost-there" phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from "almost-there" trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.
摘要：具有可验证奖励的强化学习（RLVR）已被证明可以有效增强法学硕士的短上下文推理，但在需要精确基础和强大的长程推理的长上下文场景中，其性能会下降。我们发现了长上下文推理中的“几乎就在那里”现象，即轨迹基本上是正确的，但在最后一步失败了，并将这种失败归因于两个因素：（1）长上下文 QA 数据中缺乏高推理密度，这使得法学硕士超越了单纯的基础，转向复杂的多跳推理； (2) 在长上下文强化学习训练期间，由于对部分正确的轨迹和错误的结果进行不加区别的惩罚，导致有价值的学习信号丢失。为了克服这一瓶颈，我们提出了 DeepReasonQA，这是一种知识图谱驱动的综合框架，可通过固有推理链可控地构建高难度、多跳长上下文 QA 对。在此基础上，我们引入了长上下文过程优势塑造（LongPAS），这是一种简单而有效的方法，通过沿着有效性和相关性维度评估推理步骤来执行细粒度的信用分配，从“几乎存在”的轨迹中捕获关键的学习信号。对三个长上下文推理基准的实验表明，我们的方法大大优于 RLVR 基线，并与前沿法学硕士相匹配，同时使用的参数少得多。进一步的分析证实了我们的方法在加强长上下文推理同时保持稳定的强化学习训练方面的有效性。

Title: Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty

Authors: Sravanthi Machcha, Sushrita Yerra, Sahil Gupta, Aishwarya Sahoo, Sharmin Sultana, Hong Yu, Zonghai Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12471
Pdf URL: https://arxiv.org/pdf/2601.12471
Copy Paste: [[2601.12471]] Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty(https://arxiv.org/abs/2601.12471)
Keywords: language model, llm, prompt, agent
Abstract: Current evaluation of large language models (LLMs) overwhelmingly prioritizes accuracy; however, in real-world and safety-critical applications, the ability to abstain when uncertain is equally vital for trustworthy deployment. We introduce MedAbstain, a unified benchmark and evaluation protocol for abstention in medical multiple-choice question answering (MCQA) -- a discrete-choice setting that generalizes to agentic action selection -- integrating conformal prediction, adversarial question perturbations, and explicit abstention options. Our systematic evaluation of both open- and closed-source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain with uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting brings little improvement. These findings highlight the central role of abstention mechanisms for trustworthy LLM deployment and offer practical guidance for improving safety in high-stakes applications.
摘要：目前对大型语言模型（LLM）的评估绝大多数优先考虑准确性；然而，在现实世界和安全关键的应用中，在不确定时放弃的能力对于值得信赖的部署同样重要。我们引入了 MedAbstain，这是一种针对医学多项选择题回答 (MCQA) 中弃权的统一基准和评估协议，这是一种推广到代理行动选择的离散选择设置，集成了保形预测、对抗性问题扰动和显式弃权选项。我们对开源和闭源法学硕士的系统评估表明，即使是最先进的高精度模型也常常无法避免不确定性。值得注意的是，提供明确的弃权选项始终会增加模型的不确定性和更安全的弃权，远远超过输入扰动，而缩放模型大小或高级提示几乎没有带来任何改善。这些发现强调了弃权机制对于值得信赖的法学硕士部署的核心作用，并为提高高风险应用程序的安全性提供了实用指导。

Title: DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity

Authors: Ashish Raj Shekhar, Shiven Agarwal, Priyanuj Bordoloi, Yash Shah, Tejas Anvekar, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12505
Pdf URL: https://arxiv.org/pdf/2601.12505
Copy Paste: [[2601.12505]] DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity(https://arxiv.org/abs/2601.12505)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) can directly consume exam documents, threatening conventional assessments and academic integrity. We present DoPE (Decoy-Oriented Perturbation Encapsulation), a document-layer defense framework that embeds semantic decoys into PDF/HTML assessments to exploit render-parse discrepancies in MLLM pipelines. By instrumenting exams at authoring time, DoPE provides model-agnostic prevention (stop or confound automated solving) and detection (flag blind AI reliance) without relying on conventional one-shot classifiers. We formalize prevention and detection tasks, and introduce FewSoRT-Q, an LLM-guided pipeline that generates question-level semantic decoys and FewSoRT-D to encapsulate them into watermarked documents. We evaluate on Integrity-Bench, a novel benchmark of 1826 exams (PDF+HTML) derived from public QA datasets and OpenCourseWare. Against black-box MLLMs from OpenAI and Anthropic, DoPE yields strong empirical gains: a 91.4% detection rate at an 8.7% false-positive rate using an LLM-as-Judge verifier, and prevents successful completion or induces decoy-aligned failures in 96.3% of attempts. We release Integrity-Bench, our toolkit, and evaluation code to enable reproducible study of document-layer defenses for academic integrity.
摘要：多模态大型语言模型 (MLLM) 可以直接使用考试文档，威胁传统评估和学术诚信。我们提出了 DoPE（面向诱饵的扰动封装），这是一种文档层防御框架，它将语义诱饵嵌入到 PDF/HTML 评估中，以利用 MLLM 管道中的渲染解析差异。通过在创作时对检查进行检测，DoPE 可以提供与模型无关的预防（停止或混淆自动求解）和检测（标记盲目的 AI 依赖），而无需依赖传统的一次性分类器。我们将预防和检测任务形式化，并引入了 FewSoRT-Q（一种 LLM 引导的管道，可生成问题级语义诱饵）和 FewSoRT-D 将其封装到带水印的文档中。我们在 Integrity-Bench 上进行评估，这是一个源自公共 QA 数据集和 OpenCourseWare 的 1826 项考试 (PDF+HTML) 的新颖基准。与 OpenAI 和 Anthropic 的黑盒 MLLM 相比，DoPE 产生了强大的经验收益：使用 LLM-as-Judge 验证器，检测率为 91.4%，误报率为 8.7%，并且在 96.3% 的尝试中阻止成功完成或引发诱饵对齐失败。我们发布了 Integrity-Bench、我们的工具包和评估代码，以便能够对学术诚信的文档层防御进行可重复的研究。

Title: Benchmarking Concept-Spilling Across Languages in LLMs

Authors: Ilia Badanin, Daniil Dzenhaliou, Imanol Schlag
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12549
Pdf URL: https://arxiv.org/pdf/2601.12549
Copy Paste: [[2601.12549]] Benchmarking Concept-Spilling Across Languages in LLMs(https://arxiv.org/abs/2601.12549)
Keywords: language model, llm
Abstract: Multilingual Large Language Models (LLMs) exhibit remarkable cross-lingual abilities, yet often exhibit a systematic bias toward the representations from other languages, resulting in semantic interference when generating content in non-English languages$-$a phenomenon we define as language spilling. This paper presents a novel comparative framework for evaluating multilingual semantic robustness by systematically measuring how models handle polysemous words across languages. Our methodology provides a relative measure of model performance: when required to generate exactly five meanings, both strong and weak models may resort to meanings from dominant languages, but semantically stronger models do so later in the generation sequence, producing more true meanings from the target language before failing, while weaker models resort to dominant-language meanings earlier in the sequence. We evaluate a diverse set of open and closed multilingual LLMs using a structured meaning generation task across nine languages, employing a carefully curated benchmark of 100 high-polysemy English words. Our findings reveal significant variation in semantic robustness across both models and languages, providing a principled ranking system for model comparison without requiring definitive causal attribution of error sources. We contribute both a scalable comparative benchmark for multilingual semantic evaluation and a rigorous validation pipeline$-$critical tools for developing more linguistically balanced AI systems.
摘要：多语言大语言模型 (LLM) 表现出卓越的跨语言能力，但通常对其他语言的表示表现出系统性偏差，导致在生成非英语语言内容时出现语义干扰$-$这种现象我们定义为语言溢出。本文提出了一种新颖的比较框架，通过系统地测量模型如何处理跨语言的多义词来评估多语言语义鲁棒性。我们的方法提供了模型性能的相对衡量标准：当需要生成恰好五个含义时，强模型和弱模型都可能诉诸于主导语言的含义，但语义上较强的模型会在生成序列的后期这样做，在失败之前从目标语言产生更多真实的含义，而较弱的模型则诉诸于序列中较早的主导语言含义。我们使用跨九种语言的结构化意义生成任务，采用精心策划的 100 个高度一词多义英语单词的基准，评估了一组不同的开放式和封闭式多语言法学硕士。我们的研究结果揭示了模型和语言之间语义鲁棒性的显着差异，为模型比较提供了原则性的排名系统，而不需要错误源的明确因果归因。我们为多语言语义评估提供了可扩展的比较基准，并为开发语言上更加平衡的人工智能系统提供了严格的验证流程至关重要的工具。

Title: Evaluating Contextually Mediated Factual Recall in Multilingual Large Language Models

Authors: Yihong Liu, Bingyu Xiong, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12555
Pdf URL: https://arxiv.org/pdf/2601.12555
Copy Paste: [[2601.12555]] Evaluating Contextually Mediated Factual Recall in Multilingual Large Language Models(https://arxiv.org/abs/2601.12555)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can recall a wide range of factual knowledge across languages. However, existing factual recall evaluations primarily assess fact retrieval in isolation, where the queried entity is explicitly named and the fact is requested directly. In natural language use, facts are often accessed through context, where the relevant entity is introduced only indirectly. In this work, we study contextually mediated factual recall, asking whether LLMs can reliably retrieve factual knowledge when the target entity is embedded in a naturalistic context rather than queried explicitly, across languages. We construct controlled prompts that preserve the underlying fact while introducing referential mediation through contextual sentences. To disentangle contextual effects from name-specific associations, we further compare performance using synthetic names and real names across languages. Evaluating multiple model families in five languages, we find that contextual mediation consistently degrades factual recall, with substantial variation across relations. Larger models are more robust to contextual mediation, exhibiting a reduced performance gap relative to direct queries, while the effect of real names and name origin is mixed and unsystematic. These findings highlight a gap between isolated factual recall and context-dependent language understanding in multilingual LLMs.
摘要：大型语言模型 (LLM) 可以回忆跨语言的广泛事实知识。然而，现有的事实回忆评估主要是孤立地评估事实检索，其中明确命名查询的实体并直接请求事实。在自然语言使用中，事实通常是通过上下文访问的，其中相关实体只是间接引入的。在这项工作中，我们研究了上下文介导的事实回忆，询问当目标实体嵌入自然上下文中而不是跨语言明确查询时，法学硕士是否可以可靠地检索事实知识。我们构建受控提示，保留潜在事实，同时通过上下文句子引入指称中介。为了区分特定名称关联的上下文影响，我们进一步比较了跨语言使用合成名称和真实名称的性能。通过评估五种语言的多个模型族，我们发现上下文调解始终会降低事实回忆，并且不同关系之间存在很大差异。较大的模型对上下文中介更加稳健，相对于直接查询表现出较小的性能差距，而真实姓名和姓名来源的影响是混合且不系统的。这些发现凸显了多语言法学硕士中孤立的事实回忆和上下文相关的语言理解之间的差距。

Title: A Cloud-based Multi-Agentic Workflow for Science

Authors: Anurag Acharya, Timothy Vega, Rizwan A. Ashraf, Anshu Sharma, Derek Parker, Robert Rallo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12607
Pdf URL: https://arxiv.org/pdf/2601.12607
Copy Paste: [[2601.12607]] A Cloud-based Multi-Agentic Workflow for Science(https://arxiv.org/abs/2601.12607)
Keywords: language model, llm, agent
Abstract: As Large Language Models (LLMs) become ubiquitous across various scientific domains, their lack of ability to perform complex tasks like running simulations or to make complex decisions limits their utility. LLM-based agents bridge this gap due to their ability to call external resources and tools and thus are now rapidly gaining popularity. However, coming up with a workflow that can balance the models, cloud providers, and external resources is very challenging, making implementing an agentic system more of a hindrance than a help. In this work, we present a domain-agnostic, model-independent workflow for an agentic framework that can act as a scientific assistant while being run entirely on cloud. Built with a supervisor agent marshaling an array of agents with individual capabilities, our framework brings together straightforward tasks like literature review and data analysis with more complex ones like simulation runs. We describe the framework here in full, including a proof-of-concept system we built to accelerate the study of Catalysts, which is highly important in the field of Chemistry and Material Science. We report the cost to operate and use this framework, including the breakdown of the cost by services use. We also evaluate our system on a custom-curated synthetic benchmark and a popular Chemistry benchmark, and also perform expert validation of the system. The results show that our system is able to route the task to the correct agent 90% of the time and successfully complete the assigned task 97.5% of the time for the synthetic tasks and 91% of the time for real-world tasks, while still achieving better or comparable accuracy to most frontier models, showing that this is a viable framework for other scientific domains to replicate.
摘要：随着大型语言模型（LLM）在各个科学领域变得无处不在，它们缺乏执行复杂任务（例如运行模拟或做出复杂决策）的能力限制了它们的实用性。基于法学硕士的代理由于能够调用外部资源和工具而弥补了这一差距，因此现在迅速普及。然而，提出一个可以平衡模型、云提供商和外部资源的工作流程非常具有挑战性，这使得实施代理系统更像是一种障碍，而不是一种帮助。在这项工作中，我们为代理框架提出了一个与领域无关、与模型无关的工作流程，该框架可以充当科学助手，同时完全在云上运行。我们的框架由一个主管代理构建，该代理将一系列具有单独功能的代理编组在一起，将文献综述和数据分析等简单任务与模拟运行等更复杂的任务结合在一起。我们在这里完整地描述了该框架，包括我们为加速催化剂研究而构建的概念验证系统，这在化学和材料科学领域非常重要。我们报告运营和使用该框架的成本，包括按服务使用划分的成本细目。我们还根据定制的合成基准和流行的化学基准评估我们的系统，并对系统进行专家验证。结果表明，我们的系统能够在 90% 的时间内将任务路由给正确的代理，并在合成任务中成功完成分配的任务 97.5%，在现实世界任务中成功完成分配的任务 91%，同时仍能实现与大多数前沿模型更好或相当的准确性，这表明这是其他科学领域可以复制的可行框架。

Title: Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems

Authors: Elham Tajik, Conrad Borchers, Bahar Shahrokhian, Sebastian Simon, Ali Keramati, Sonika Pal, Sreecharan Sankaranarayanan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12618
Pdf URL: https://arxiv.org/pdf/2601.12618
Copy Paste: [[2601.12618]] Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems(https://arxiv.org/abs/2601.12618)
Keywords: language model, llm, agent
Abstract: Learning analytics researchers often analyze qualitative student data such as coded annotations or interview transcripts to understand learning processes. With the rise of generative AI, fully automated and human-AI workflows have emerged as promising methods for analysis. However, methodological standards to guide such workflows remain limited. In this study, we propose that reasoning traces generated by large language model (LLM) agents, especially within multi-agent systems, constitute a novel and rich form of process data to enhance interpretive practices in qualitative coding. We apply cosine similarity to LLM reasoning traces to systematically detect, quantify, and interpret disagreements among agents, reframing disagreement as a meaningful analytic signal. Analyzing nearly 10,000 instances of agent pairs coding human tutoring dialog segments, we show that LLM agents' semantic reasoning similarity robustly differentiates consensus from disagreement and correlates with human coding reliability. Qualitative analysis guided by this metric reveals nuanced instructional sub-functions within codes and opportunities for conceptual codebook refinement. By integrating quantitative similarity metrics with qualitative review, our method has the potential to improve and accelerate establishing inter-rater reliability during coding by surfacing interpretive ambiguity, especially when LLMs collaborate with humans. We discuss how reasoning-trace disagreements represent a valuable new class of analytic signals advancing methodological rigor and interpretive depth in educational research.
摘要：学习分析研究人员经常分析定性学生数据，例如编码注释或访谈记录，以了解学习过程。随着生成式人工智能的兴起，全自动和人类人工智能工作流程已成为有前途的分析方法。然而，指导此类工作流程的方法标准仍然有限。在这项研究中，我们提出由大型语言模型（LLM）代理生成的推理轨迹，尤其是在多代理系统中，构成了一种新颖且丰富的过程数据形式，以增强定性编码中的解释实践。我们将余弦相似度应用于 LLM 推理轨迹，以系统地检测、量化和解释代理之间的分歧，将分歧重新定义为有意义的分析信号。通过分析近 10,000 个编码人类辅导对话片段的代理对实例，我们表明 LLM 代理的语义推理相似性可以稳健地区分共识和分歧，并与人类编码可靠性相关。以该指标为指导的定性分析揭示了代码中细致入微的教学子功能以及概念性代码本细化的机会。通过将定量相似性指标与定性审查相结合，我们的方法有可能通过揭示解释的歧义来改进和加速在编码过程中建立评估者间的可靠性，特别是当法学硕士与人类合作时。我们讨论推理轨迹分歧如何代表一类有价值的新分析信号，从而提高教育研究中方法论的严谨性和解释深度。

Title: BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models

Authors: Kriti Bhattarai, Vipina K. Keloth, Donald Wright, Andrew Loza, Yang Ren, Hua Xu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.12632
Pdf URL: https://arxiv.org/pdf/2601.12632
Copy Paste: [[2601.12632]] BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models(https://arxiv.org/abs/2601.12632)
Keywords: language model, gpt, llm
Abstract: Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs.
摘要：目标：大语言模型（LLM）越来越多地应用于生物医学领域，现有的基准数据集在支持模型开发和评估方面发挥了重要作用。然而，这些基准通常有局限性。许多人依赖静态或过时的数据集，而这些数据集无法捕捉生物医学知识的动态、上下文丰富和高风险的本质。由于与模型预训练语料库重叠，它们还带来了越来越大的数据泄漏风险，并且经常忽视关键维度，例如对语言变化的鲁棒性和潜在的人口统计偏差。材料和方法：为了弥补这些差距，我们引入了 BioPulse-QA，这是一个评估法学硕士回答新发表的生物医学文件（包括药物标签、试验方案和临床指南）问题的基准。 BioPulse-QA 包括 2,280 个经过专家验证的问答 (QA) 对和扰动变体，涵盖提取和抽象格式。我们评估了在基准文档发布日期之前发布的四个法学硕士 - GPT-4o、GPT-o1、Gemini-2.0-Flash 和 LLaMA-3.1 8B Instruct。结果：GPT-o1 在药品标签上获得了最高的宽松 F1 分数 (0.92)，其次是 Gemini-2.0-Flash (0.90)。临床试验是最具挑战性的来源，提取的 F1 分数低至 0.36。讨论和结论：释义的性能差异比印刷错误更大，而偏差测试显示的差异可以忽略不计。 BioPulse-QA 为评估生物医学法学硕士提供了一个可扩展且临床相关的框架。

Title: Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Authors: Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong, Michael Umeokoli, Samuel Ratnam
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12639
Pdf URL: https://arxiv.org/pdf/2601.12639
Copy Paste: [[2601.12639]] Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift(https://arxiv.org/abs/2601.12639)
Keywords: llm, prompt
Abstract: Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning -- holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals -- especially ORPO and KL-regularization -- substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.
摘要：对良性数据进行微调法学硕士仍然会降低一致性和对抗鲁棒性，但对微调目标在塑造这些安全结果中的作用的直接分析仍然有限。我们提出了六个微调目标的受控比较——监督微调、直接偏好优化、条件微调、接种提示、比值比偏好优化和 KL 正则化微调——保持数据、域、架构和优化固定。在封闭式推理和开放式生成任务中，我们发现客观选择会导致安全能力边界发生系统性的、与规模相关的变化。在培训预算较小的情况下，各个目标的稳健性相似，但能力不同。在预算较大的情况下，目标会出现很大的分歧：监督和基于偏好的调整将能力增益与增加的对抗性脆弱性和角色漂移紧密结合在一起，而限制学习信号的目标——尤其是 ORPO 和 KL 正则化——则大大减轻了这两者。因此，微调目标对于小规模的安全性影响不大，但随着训练规模的增加，它成为对抗鲁棒性和潜在角色稳定性的主要驱动力。

Title: Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?

Authors: Nafiz Imtiaz Khan, Kylie Cleland, Vladimir Filkov, Roger Eric Goldman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12648
Pdf URL: https://arxiv.org/pdf/2601.12648
Copy Paste: [[2601.12648]] Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?(https://arxiv.org/abs/2601.12648)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Procedural case logs are a core requirement in radiology training, yet they are time-consuming to complete and prone to inconsistency when authored manually. This study investigates whether large language models (LLMs) can automate procedural case log documentation directly from free-text radiology reports. We evaluate multiple local and commercial LLMs under instruction-based and chain-of-thought prompting to extract structured procedural information from 414 curated interventional radiology reports authored by nine residents between 2018 and 2024. Model performance is assessed using sensitivity, specificity, and F1-score, alongside inference latency and token efficiency to estimate operational cost. Results show that both local and commercial models achieve strong extraction performance, with best F1-scores approaching 0.87, while exhibiting different trade-offs between speed and cost. Automation using LLMs has the potential to substantially reduce clerical burden for trainees and improve consistency in case logging. These findings demonstrate the feasibility of AI-assisted documentation in medical education and highlight the need for further validation across institutions and clinical workflows.
摘要：程序病例日志是放射学培训的核心要求，但完成起来非常耗时，而且手动编写时容易出现不一致。本研究调查大型语言模型 (LLM) 是否可以直接从自由文本放射学报告中自动生成程序病例日志记录。我们在基于指令和思维链的提示下评估了多个本地和商业法学硕士，从 2018 年至 2024 年间由 9 名住院医师撰写的 414 份精选介入放射学报告中提取结构化程序信息。使用灵敏度、特异性和 F1 分数以及推理延迟和令牌效率来评估模型性能，以估计运营成本。结果表明，本地模型和商业模型都实现了强大的提取性能，最佳 F1 分数接近 0.87，同时在速度和成本之间表现出不同的权衡。使用法学硕士的自动化有可能大大减轻学员的文书负担并提高案例记录的一致性。这些发现证明了人工智能辅助文档在医学教育中的可行性，并强调需要在机构和临床工作流程中进一步验证。

Title: Augmenting Question Answering with A Hybrid RAG Approach

Authors: Tianyi Yang, Nashrah Haque, Vaishnave Jonnalagadda, Yuya Jeremy Ong, Zhehui Chen, Yanzhao Wu, Lei Yu, Divyesh Jadav, Wenqi Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12658
Pdf URL: https://arxiv.org/pdf/2601.12658
Copy Paste: [[2601.12658]] Augmenting Question Answering with A Hybrid RAG Approach(https://arxiv.org/abs/2601.12658)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful technique for enhancing the quality of responses in Question-Answering (QA) tasks. However, existing approaches often struggle with retrieving contextually relevant information, leading to incomplete or suboptimal answers. In this paper, we introduce Structured-Semantic RAG (SSRAG), a hybrid architecture that enhances QA quality by integrating query augmentation, agentic routing, and a structured retrieval mechanism combining vector and graph based techniques with context unification. By refining retrieval processes and improving contextual grounding, our approach improves both answer accuracy and informativeness. We conduct extensive evaluations on three popular QA datasets, TruthfulQA, SQuAD and WikiQA, across five Large Language Models (LLMs), demonstrating that our proposed approach consistently improves response quality over standard RAG implementations.
摘要：检索增强生成 (RAG) 已成为提高问答 (QA) 任务响应质量的强大技术。然而，现有的方法常常难以检索上下文相关的信息，从而导致答案不完整或次优。在本文中，我们介绍了结构化语义 RAG (SSRAG)，这是一种混合架构，它通过集成查询增强、代理路由以及将基于向量和图的技术与上下文统一相结合的结构化检索机制来提高 QA 质量。通过改进检索过程和改进上下文基础，我们的方法提高了答案的准确性和信息量。我们对五个大型语言模型 (LLM) 中的三个流行的 QA 数据集（TruthfulQA、SQuAD 和 WikiQA）进行了广泛的评估，证明我们提出的方法比标准 RAG 实现持续提高了响应质量。

Title: UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages

Authors: Tassallah Abdullahi, Macton Mgonzo, Mardiyyah Oduwole, Paul Okewunmi, Abraham Owodunni, Ritambhara Singh, Carsten Eickhoff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12696
Pdf URL: https://arxiv.org/pdf/2601.12696
Copy Paste: [[2601.12696]] UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages(https://arxiv.org/abs/2601.12696)
Keywords: llm
Abstract: Current guardian models are predominantly Western-centric and optimized for high-resource languages, leaving low-resource African languages vulnerable to evolving harms, cross-lingual safety failures, and cultural misalignment. Moreover, most guardian models rely on rigid, predefined safety categories that fail to generalize across diverse linguistic and sociocultural contexts. Robust safety, therefore, requires flexible, runtime-enforceable policies and benchmarks that reflect local norms, harm scenarios, and cultural expectations. We introduce UbuntuGuard, the first African policy-based safety benchmark built from adversarial queries authored by 155 domain experts across sensitive fields, including healthcare. From these expert-crafted queries, we derive context-specific safety policies and reference responses that capture culturally grounded risk signals, enabling policy-aligned evaluation of guardian models. We evaluate 13 models, comprising six general-purpose LLMs and seven guardian models across three distinct variants: static, dynamic, and multilingual. Our findings reveal that existing English-centric benchmarks overestimate real-world multilingual safety, cross-lingual transfer provides partial but insufficient coverage, and dynamic models, while better equipped to leverage policies at inference time, still struggle to fully localize African-language contexts. These findings highlight the urgent need for multilingual, culturally grounded safety benchmarks to enable the development of reliable and equitable guardian models for low-resource languages. Our code can be found online.\footnote{Code repository available at this https URL.
摘要：目前的监护模式主要以西方为中心，并针对资源丰富的语言进行了优化，使得资源匮乏的非洲语言容易受到不断变化的伤害、跨语言安全故障和文化失调的影响。此外，大多数监护人模型依赖于严格的、预定义的安全类别，这些安全类别无法在不同的语言和社会文化背景下推广。因此，强大的安全性需要灵活的、运行时可执行的政策和基准，以反映当地规范、危害场景和文化期望。我们推出 UbuntuGuard，这是第一个基于政策的非洲安全基准，由 155 名跨敏感领域（包括医疗保健）的领域专家编写的对抗性查询构建。从这些专家精心设计的查询中，我们得出针对具体情况的安全政策和参考响应，以捕获基于文化的风险信号，从而实现对监护人模型的政策一致评估。我们评估了 13 个模型，包括 6 个通用 LLM 和 7 个监护人模型，涵盖三种不同的变体：静态、动态和多语言。我们的研究结果表明，现有的以英语为中心的基准高估了现实世界的多语言安全性，跨语言迁移提供了部分但不足的覆盖范围，而动态模型虽然能够更好地在推理时利用政策，但仍然难以完全本地化非洲语言环境。这些发现凸显了迫切需要多语言、基于文化的安全基准，以便为资源匮乏的语言开发可靠且公平的监护人模型。我们的代码可以在线找到。\footnote{此 https URL 提供代码存储库。

Title: A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Authors: Qiuyi Qu, Yicheng Sui, Yufei Sun, Rui Chen, Xiaofei Zhang, Yuzhi Zhang, Haofeng Wang, Ge Lan, Ning Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12698
Pdf URL: https://arxiv.org/pdf/2601.12698
Copy Paste: [[2601.12698]] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization(https://arxiv.org/abs/2601.12698)
Keywords: llm, agent
Abstract: GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.
摘要：GPU 代码优化是 HPC 工作负载以及大型模型训练和推理的关键性能瓶颈。尽管编译器优化和手写内核可以部分缓解这个问题，但实现接近硬件极限的性能仍然很大程度上依赖于手动代码重构和参数调整。基于 LLM-agent 的内核生成和优化的最新进展已被报道，但许多方法主要集中于直接代码重写，其中参数选择通常是隐式的且难以控制，或者需要人工干预，导致不稳定的性能增益。本文在代理驱动的迭代循环之上引入了基于模板的重写层：内核在语义上被重构为显式参数化模板，然后通过基于搜索的自动调整来优化模板参数，从而产生更稳定和更高质量的加速。对一组真实内核的实验表明，在最佳情况下加速速度超过 3 倍。我们从SGLang中提取代表性的CUDA内核作为评估目标；所提出的代理调谐器迭代地执行模板、测试、分析和规划，并利用分析反馈在硬件资源限制下执行受限参数搜索。与仅代理直接重写相比，模板加搜索设计显着降低了迭代优化的随机性，使过程更具可解释性，并为高性能配置提供了更系统的方法。所提出的方法可以进一步扩展到 OpenCL、HIP 和其他后端，为实际生产工作负载提供自动化性能优化。

Title: A Shared Geometry of Difficulty in Multilingual Language Models

Authors: Stefano Civelli, Pietro Bernardelle, Nicolò Brunello, Gianluca Demartini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12731
Pdf URL: https://arxiv.org/pdf/2601.12731
Copy Paste: [[2601.12731]] A Shared Geometry of Difficulty in Multilingual Language Models(https://arxiv.org/abs/2601.12731)
Keywords: language model, llm
Abstract: Predicting problem-difficulty in large language models (LLMs) refers to estimating how difficult a task is according to the model itself, typically by training linear probes on its internal representations. In this work, we study the multilingual geometry of problem-difficulty in LLMs by training linear probes using the AMC subset of the Easy2Hard benchmark, translated into 21 languages. We found that difficulty-related signals emerge at two distinct stages of the model internals, corresponding to shallow (early-layers) and deep (later-layers) internal representations, that exhibit functionally different behaviors. Probes trained on deep representations achieve high accuracy when evaluated on the same language but exhibit poor cross-lingual generalization. In contrast, probes trained on shallow representations generalize substantially better across languages, despite achieving lower within-language performance. Together, these results suggest that LLMs first form a language-agnostic representation of problem difficulty, which subsequently becomes language-specific. This closely aligns with existing findings in LLM interpretability showing that models tend to operate in an abstract conceptual space before producing language-specific outputs. We demonstrate that this two-stage representational process extends beyond semantic content to high-level meta-cognitive properties such as problem-difficulty estimation.
摘要：预测大型语言模型 (LLM) 中的问题难度是指根据模型本身来估计任务的难度，通常是通过在其内部表示上训练线性探针。在这项工作中，我们通过使用 Easy2Hard 基准的 AMC 子集（翻译成 21 种语言）训练线性探针来研究法学硕士中问题难度的多语言几何。我们发现与难度相关的信号出现在模型内部的两个不同阶段，对应于浅层（早期层）和深层（后期层）内部表示，它们表现出功能上不同的行为。在深度表征上训练的探针在使用相同语言进行评估时可以达到很高的准确度，但跨语言泛化能力较差。相比之下，尽管在语言内的性能较低，但在浅层表示上训练的探针在不同语言之间的泛化能力要好得多。总之，这些结果表明法学硕士首先形成了与语言无关的问题难度表示，随后变得特定于语言。这与法学硕士可解释性的现有发现密切相关，表明模型倾向于在生成特定于语言的输出之前在抽象概念空间中运行。我们证明了这种两阶段的表征过程超越了语义内容，扩展到了高级元认知属性，例如问题难度估计。

Title: Towards Robust Process Reward Modeling via Noise-aware Learning

Authors: Bin Xie, Bingbing Xu, Xueyun Tian, Yilin Chen, Huawei Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12748
Pdf URL: https://arxiv.org/pdf/2601.12748
Copy Paste: [[2601.12748]] Towards Robust Process Reward Modeling via Noise-aware Learning(https://arxiv.org/abs/2601.12748)
Keywords: language model, llm
Abstract: Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27\% absolute gain in average F1 over PRMs trained with noisy supervision.
摘要：过程奖励模型（PRM）在复杂推理方面取得了很好的成果，但受到成本高昂的过程级监督的瓶颈。蒙特卡洛估计（MCE）是一种广泛使用的替代方法，它将过程奖励定义为策略模型从给定推理步骤得出正确最终答案的概率。然而，步骤正确性是推理轨迹的固有属性，并且对于策略选择应该是不变的。我们的实证研究结果表明，MCE 产生的政策相关奖励会引发标签噪声，包括奖励错误步骤的误报和惩罚正确步骤的误报。为了解决上述挑战，我们提出了一个两阶段框架来减轻噪音监督。在标记阶段，我们引入了反射感知标签校正机制，该机制使用大语言模型（LLM）作为判断者来检测与当前推理步骤相关的反射和自我校正行为，从而抑制高估的奖励。在训练阶段，我们进一步提出了一个 \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining 框架，使 PRM 能够根据自己的置信度逐步细化噪声标签。大量实验表明，我们的方法大大提高了步骤级正确性辨别力，与经过噪声监督训练的 PRM 相比，平均 F1 绝对增益高达 27%。

Title: VISPA: Pluralistic Alignment via Automatic Value Selection and Activation

Authors: Shenyan Zheng, Jiayou Zhong, Anudeex Shetty, Heng Ji, Preslav Nakov, Usman Naseem
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12758
Pdf URL: https://arxiv.org/pdf/2601.12758
Copy Paste: [[2601.12758]] VISPA: Pluralistic Alignment via Automatic Value Selection and Activation(https://arxiv.org/abs/2601.12758)
Keywords: language model, prompt
Abstract: As large language models are increasingly used in high-stakes domains, it is essential that their outputs reflect not average} human preference, rather range of varying perspectives. Achieving such pluralism, however, remains challenging. Existing approaches consider limited values or rely on prompt-level interventions, lacking value control and representation. To address this, we introduce VISPA, a training-free pluralistic alignment framework, that enables direct control over value expression by dynamic selection and internal model activation steering. Across extensive empirical studies spanning multiple models and evaluation settings, we show VISPA is performant across all pluralistic alignment modes in healthcare and beyond. Further analysis reveals VISPA is adaptable with different steering initiations, model, and/or values. These results suggest that pluralistic alignment can be achieved through internal activation mechanisms, offering a scalable path toward language models that serves all.
摘要：随着大型语言模型越来越多地在高风险领域中使用，它们的输出必须反映的不是平均的人类偏好，而是反映不同观点的范围。然而，实现这种多元化仍然具有挑战性。现有方法考虑有限的价值或依赖即时干预，缺乏价值控制和代表性。为了解决这个问题，我们引入了 VISPA，一种免训练的多元对齐框架，它可以通过动态选择和内部模型激活控制来直接控制价值表达。通过跨越多种模型和评估设置的广泛实证研究，我们表明 VISPA 在医疗保健及其他领域的所有多元化调整模式中均表现出色。进一步的分析表明 VISPA 可适应不同的转向启动、模型和/或值。这些结果表明，多元对齐可以通过内部激活机制来实现，为服务所有人的语言模型提供可扩展的路径。

Title: Who Does This Name Remind You of? Nationality Prediction via Large Language Model Associative Memory

Authors: Keito Inoshita
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12771
Pdf URL: https://arxiv.org/pdf/2601.12771
Copy Paste: [[2601.12771]] Who Does This Name Remind You of? Nationality Prediction via Large Language Model Associative Memory(https://arxiv.org/abs/2601.12771)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) possess extensive world knowledge, yet methods for effectively eliciting this knowledge remain underexplored. Nationality and region prediction tasks require understanding of not only linguistic features but also cultural and historical background, making LLM world knowledge particularly valuable. However, conventional LLM prompting methods rely on direct reasoning approaches, which have limitations in applying abstract linguistic rules. We propose LLM Associative Memory Agents (LAMA), a novel framework that leverages LLM world knowledge as associative memory. Rather than directly inferring nationality from names, LAMA recalls famous individuals with the same name and aggregates their nationalities through indirect reasoning. A dual-agent architecture comprising a Person Agent and a Media Agent, specialized in different knowledge domains, recalls famous individuals in parallel, generating Top-1 predictions through voting and Top-K predictions through conditional completion. On a 99-country nationality prediction task, LAMA achieved 0.817 accuracy, substantially outperforming conventional LLM prompting methods and neural models. Our experiments reveal that LLMs exhibit higher reliability in recalling concrete examples than in abstract reasoning, that recall-based approaches are robust to low-frequency nationalities independent of data frequency distributions, and that the dual-agent architecture functions complementarily to produce synergistic effects. These results demonstrate the effectiveness of a new multi-agent system that retrieves and aggregates LLM knowledge rather than prompting reasoning.
摘要：大型语言模型（LLM）拥有广泛的世界知识，但有效获取这些知识的方法仍未得到充分探索。国籍和地区预测任务不仅需要了解语言特征，还需要了解文化和历史背景，这使得LLM世界知识特别有价值。然而，传统的LLM提示方法依赖于直接推理方法，这在应用抽象语言规则方面存在局限性。我们提出了 LLM 联想记忆代理 (LAMA)，这是一种利用 LLM 世界知识作为联想记忆的新颖框架。喇嘛不是直接从名字推断国籍，而是回忆同名的著名人物，并通过间接推理汇总他们的国籍。由人员代理和媒体代理组成的双代理架构，专门针对不同的知识领域，并行回忆名人，通过投票生成 Top-1 预测，通过条件完成生成 Top-K 预测。在 99 个国家的国籍预测任务中，LAMA 达到了 0.817 的准确率，大大优于传统的 LLM 提示方法和神经模型。我们的实验表明，法学硕士在回忆具体例子方面表现出比抽象推理更高的可靠性，基于回忆的方法对于独立于数据频率分布的低频国籍具有鲁棒性，并且双代理架构互补地发挥作用以产生协同效应。这些结果证明了新的多智能体系统的有效性，该系统检索和聚合 LLM 知识而不是提示推理。

Title: Do Clinical Question Answering Systems Really Need Specialised Medical Fine Tuning?

Authors: Sushant Kumar Ray, Gautam Siddharth Kashyap, Sahil Tripathi, Nipun Joshi, Vijay Govindarajan, Rafiq Ali, Jiechao Gao, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12812
Pdf URL: https://arxiv.org/pdf/2601.12812
Copy Paste: [[2601.12812]] Do Clinical Question Answering Systems Really Need Specialised Medical Fine Tuning?(https://arxiv.org/abs/2601.12812)
Keywords: language model, gpt, llm
Abstract: Clinical Question-Answering (CQA) industry systems are increasingly rely on Large Language Models (LLMs), yet their deployment is often guided by the assumption that domain-specific fine-tuning is essential. Although specialised medical LLMs such as BioBERT, BioGPT, and PubMedBERT remain popular, they face practical limitations including narrow coverage, high retraining costs, and limited adaptability. Efforts based on Supervised Fine-Tuning (SFT) have attempted to address these assumptions but continue to reinforce what we term the SPECIALISATION FALLACY-the belief that specialised medical LLMs are inherently superior for CQA. To address this assumption, we introduce MEDASSESS-X, a deployment-industry-oriented CQA framework that applies alignment at inference time rather than through SFT. MEDASSESS-X uses lightweight steering vectors to guide model activations toward medically consistent reasoning without updating model weights or requiring domain-specific retraining. This inference-time alignment layer stabilises CQA performance across both general-purpose and specialised medical LLMs, thereby resolving the SPECIALISATION FALLACY. Empirically, MEDASSESS-X delivers consistent gains across all LLM families, improving Accuracy by up to +6%, Factual Consistency by +7%, and reducing Safety Error Rate by as much as 50%.
摘要：临床问答 (CQA) 行业系统越来越依赖大型语言模型 (LLM)，但其部署通常以特定领域的微调至关重要的假设为指导。尽管 BioBERT、BioGPT 和 PubMedBERT 等专业医学法学硕士仍然很受欢迎，但它们面临着实际的局限性，包括覆盖面窄、再培训成本高和适应性有限。基于监督微调（SFT）的努力试图解决这些假设，但继续强化了我们所说的“专业化谬误”——即认为专业医学法学硕士本质上更适合 CQA。为了解决这个假设，我们引入了 MEDASSESS-X，这是一种面向部署行业的 CQA 框架，它在推理时而不是通过 SFT 应用对齐。 MEDASSESS-X 使用轻量级引导向量来引导模型激活实现医学上一致的推理，而无需更新模型权重或需要特定领域的重新训练。这种推理时间对齐层可以稳定通用和专业医学法学硕士的 CQA 性能，从而解决专业化谬误。根据经验，MEDASSESS-X 为所有 LLM 系列带来了一致的收益，将准确性提高了 6%，事实一致性提高了 7%，并将安全错误率降低了 50%。

Title: Multimodal Multi-Agent Empowered Legal Judgment Prediction

Authors: Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu, Zhiyuan Feng, Yuan Wang, Simon Fong, Kaiyue Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12815
Pdf URL: https://arxiv.org/pdf/2601.12815
Copy Paste: [[2601.12815]] Multimodal Multi-Agent Empowered Legal Judgment Prediction(https://arxiv.org/abs/2601.12815)
Keywords: agent
Abstract: Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.
摘要：法律判决预测（LJP）旨在根据事实描述预测法律案件的结果，是推动法律制度发展的一项基础性任务。传统方法通常依赖于统计分析或基于角色的模拟，但面临指控多、证据多样且缺乏适应性的挑战。在本文中，我们介绍了 JurisMMA，这是 LJP 的一种新颖框架，它可以有效分解审判任务、标准化流程并将其组织为不同的阶段。此外，我们还构建了 JurisMM，这是一个包含超过 100,000 条近期中国司法记录的大型数据集，包括文本和多模态视频文本数据，可以进行综合评估。 JurisMM 和基准 LawBench 上的实验验证了我们框架的有效性。这些结果表明，我们的框架不仅对 LJP 有效，而且对更广泛的法律应用也有效，为未来法律方法和数据集的发展提供了新的视角。

Title: Race, Ethnicity and Their Implication on Bias in Large Language Models

Authors: Shiyue Hu, Ruizhe Li, Yanjun Gao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12868
Pdf URL: https://arxiv.org/pdf/2601.12868
Copy Paste: [[2601.12868]] Race, Ethnicity and Their Implication on Bias in Large Language Models(https://arxiv.org/abs/2601.12868)
Keywords: language model, llm
Abstract: Large language models (LLMs) increasingly operate in high-stakes settings including healthcare and medicine, where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode sensitive or stereotype-related associations from pretraining, identical demographic cues can induce qualitatively different behaviors. Interventions suppressing such neurons reduce bias but leave substantial residual effects, suggesting behavioral rather than representational change and motivating more systematic mitigation.
摘要：大型语言模型 (LLM) 越来越多地在包括医疗保健和医学在内的高风险环境中运行，其中种族和民族等人口统计属性可以从文本中明确说明或隐含推断。然而，现有的研究主要记录结果水平的差异，对这些影响背后的内部机制的了解有限。我们提出了一项关于种族和族裔如何在法学硕士中代表和运作的机制研究。使用涵盖毒性相关生成和临床叙述理解任务的两个公开数据集，我们分析了三个开源模型，其具有可重复的可解释性管道，结合了探测、神经元水平归因和有针对性的干预。我们发现人口统计信息分布在内部单位之间，具有显着的跨模型差异。尽管某些单元编码了预训练中的敏感或与刻板印象相关的关联，但相同的人口统计线索可能会诱发性质不同的行为。抑制此类神经元的干预措施可以减少偏见，但会留下显着的残余影响，这表明行为而非表征的变化，并激发了更系统的缓解措施。

Title: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

Authors: Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, Xianglin Chen, Yongwei Wu, Congfeng Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12904
Pdf URL: https://arxiv.org/pdf/2601.12904
Copy Paste: [[2601.12904]] From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation(https://arxiv.org/abs/2601.12904)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation enhances Large Language Models by integrating external knowledge, which reduces hallucinations but increases prompt length. This increase leads to higher computational costs and longer Time to First Token (TTFT). To mitigate this issue, existing solutions aim to reuse the preprocessed KV cache of each retrieved chunk to accelerate RAG. However, the lack of cross-chunk contextual information leads to a significant drop in generation quality, leaving the potential benefits of KV cache reuse largely unfulfilled. The challenge lies in how to reuse the precomputed KV cache of chunks while preserving generation quality. We propose FusionRAG, a novel inference framework that optimizes both the preprocessing and reprocessing stages of RAG. In the offline preprocessing stage, we embed information from other related text chunks into each chunk, while in the online reprocessing stage, we recompute the KV cache for tokens that the model focuses on. As a result, we achieve a better trade-off between generation quality and efficiency. According to our experiments, FusionRAG significantly improves generation quality at the same recomputation ratio compared to previous state-of-the-art solutions. By recomputing fewer than 15% of the tokens, FusionRAG achieves up to 70% higher normalized F1 scores than baselines and reduces TTFT by 2.66x-9.39x compared to Full Attention.
摘要：检索增强生成通过集成外部知识来增强大型语言模型，这减少了幻觉，但增加了提示长度。这种增加会导致更高的计算成本和更长的首次代币时间 (TTFT)。为了缓解这个问题，现有的解决方案旨在重用每个检索到的块的预处理 KV 缓存来加速 RAG。然而，缺乏跨块上下文信息会导致生成质量显着下降，从而使 KV 缓存重用的潜在好处很大程度上无法实现。挑战在于如何重用预先计算的块的 KV 缓存，同时保持生成质量。我们提出了 FusionRAG，这是一种新颖的推理框架，可以优化 RAG 的预处理和再处理阶段。在离线预处理阶段，我们将其他相关文本块的信息嵌入到每个块中，而在在线重新处理阶段，我们重新计算模型关注的标记的 KV 缓存。因此，我们在发电质量和效率之间实现了更好的权衡。根据我们的实验，与之前最先进的解决方案相比，FusionRAG 在相同的重新计算比率下显着提高了生成质量。通过重新计算不到 15% 的令牌，FusionRAG 的标准化 F1 分数比基线高出 70%，并且与 Full Attention 相比，TTFT 减少了 2.66 倍至 9.39 倍。

Title: Gated Differentiable Working Memory for Long-Context Language Modeling

Authors: Lingrui Mei, Shenghua Liu, Yiwei Wang, Yuyao Ge, Baolong Bi, Jiayu Yao, Jun Wan, Ziling Yin, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12906
Pdf URL: https://arxiv.org/pdf/2601.12906
Copy Paste: [[2601.12906]] Gated Differentiable Working Memory for Long-Context Language Modeling(https://arxiv.org/abs/2601.12906)
Keywords: language model, long context
Abstract: Long contexts challenge transformers: attention scores dilute across thousands of tokens, critical information is often lost in the middle, and models struggle to adapt to novel patterns at inference time. Recent work on test-time adaptation addresses this by maintaining a form of working memory -- transient parameters updated on the current context -- but existing approaches rely on uniform write policies that waste computation on low-utility regions and suffer from high gradient variance across semantically heterogeneous contexts. In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, focusing on which parts of the context should be consolidated into working memory under limited computation. We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process. The controller estimates Contextual Utility, an information-theoretic measure of long-range contextual dependence, and allocates gradient steps accordingly while maintaining global coverage. Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$\times$ fewer gradient steps than uniform baselines, establishing a new efficiency-performance Pareto frontier for test-time adaptation.
摘要：长上下文对 Transformer 提出了挑战：注意力分数在数千个标记中被稀释，关键信息经常在中间丢失，并且模型在推理时难以适应新的模式。最近关于测试时适应的工作通过维护一种工作记忆形式（在当前上下文中更新瞬态参数）来解决这个问题，但现有的方法依赖于统一的写入策略，这些策略浪费了低效用区域的计算，并且在语义异构上下文中遭受高梯度方差。在这项工作中，我们将测试时间适应重新定义为预算受限的内存整合问题，重点关注在有限的计算下应将上下文的哪些部分整合到工作内存中。我们提出了 Gdwm（门控可微工作内存），这是一个引入写入控制器来门控整合过程的框架。控制器估计上下文效用，这是一种远程上下文依赖性的信息论度量，并在保持全局覆盖的同时相应地分配梯度步骤。 ZeroSCROLLS 和 LongBench v2 上的实验表明，Gdwm 能够以比统一基线少 4$\times$ 的梯度步数实现可比或优越的性能，从而为测试时间适应建立了新的效率性能 Pareto 前沿。

Title: SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Authors: Tim Baumgärtner, Iryna Gurevych
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12910
Pdf URL: https://arxiv.org/pdf/2601.12910
Copy Paste: [[2601.12910]] SciCoQA: Quality Assurance for Scientific Paper--Code Alignment(https://arxiv.org/abs/2601.12910)
Keywords: gpt, llm
Abstract: We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7\% of real-world paper-code discrepancies.
摘要：我们推出了 SciCoQA，这是一个用于检测科学出版物及其代码库之间差异的数据集，以确保忠实的实施。我们从 GitHub 问题和可重复性论文构建 SciCoQA，并为了扩展我们的数据集，我们提出了一种用于构建论文代码差异的合成数据生成方法。我们详细分析纸质代码差异并提出差异类型和类别，以更好地理解发生的不匹配。总的来说，我们的数据集包含 611 个纸质代码差异（81 个真实的，530 个合成的），涵盖不同的计算科学学科，包括人工智能、物理、定量生物学等。我们对 21 名法学硕士的评估凸显了 SciCoQA 的难度，特别是涉及遗漏论文细节、长上下文输入和模型预训练语料库之外的数据的情况。我们评估中性能最好的模型 GPT-5 只能检测到 45.7% 的现实世界纸质代码差异。

Title: Injecting Knowledge from Social Science Journals to Improve Indonesian Cultural Understanding by LLMs

Authors: Adimulya Kartiyasa, Bao Gia Cao, Boyang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12921
Pdf URL: https://arxiv.org/pdf/2601.12921
Copy Paste: [[2601.12921]] Injecting Knowledge from Social Science Journals to Improve Indonesian Cultural Understanding by LLMs(https://arxiv.org/abs/2601.12921)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recently there have been intensifying efforts to improve the understanding of Indonesian cultures by large language models (LLMs). An attractive source of cultural knowledge that has been largely overlooked is local journals of social science, which likely contain substantial cultural studies from a native perspective. We present a novel text dataset of journal article passages, created from 151 open-source Indonesian social science journals, called IndoSoSci. We demonstrate an effective recipe for injecting Indonesian cultural knowledge therein into LLMs: extracting the facts related to Indonesian culture, and apply retrieval-augmented generation (RAG) with LLM-generated hypothetical documents as queries during retrieval. The proposed recipe yields strong performance gains over several strong baselines on the IndoCulture benchmark. Additionally, by combining IndoSoSci with Indonesian Wikipedia, we set a new state-of-the-art accuracy on the IndoCulture benchmark.
摘要：最近，人们加大力度通过大型语言模型（LLM）来提高对印度尼西亚文化的理解。当地社会科学期刊是文化知识的一个有吸引力的来源，但在很大程度上被忽视了，其中可能包含从本土角度进行的大量文化研究。我们提出了一个新颖的期刊文章段落文本数据集，该数据集由 151 份开源印度尼西亚社会科学期刊创建，称为 IndoSoSci。我们展示了将印尼文化知识注入法学硕士的有效方法：提取与印尼文化相关的事实，并应用检索增强生成（RAG），将法学硕士生成的假设文档作为检索过程中的查询。与 IndoCulture 基准测试中的几个强大基线相比，所提出的方案带来了强劲的性能提升。此外，通过将 IndoSoSci 与印度尼西亚维基百科相结合，我们在 IndoCulture 基准上设定了新的最先进的准确性。

Title: A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits

Authors: Miao Xie, Siguang Chen, Chunli Lv
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12945
Pdf URL: https://arxiv.org/pdf/2601.12945
Copy Paste: [[2601.12945]] A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits(https://arxiv.org/abs/2601.12945)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have become powerful and widely used systems for language understanding and generation, while multi-armed bandit (MAB) algorithms provide a principled framework for adaptive decision-making under uncertainty. This survey explores the potential at the intersection of these two fields. As we know, it is the first survey to systematically review the bidirectional interaction between large language models and multi-armed bandits at the component level. We highlight the bidirectional benefits: MAB algorithms address critical LLM challenges, spanning from pre-training to retrieval-augmented generation (RAG) and personalization. Conversely, LLMs enhance MAB systems by redefining core components such as arm definition and environment modeling, thereby improving decision-making in sequential tasks. We analyze existing LLM-enhanced bandit systems and bandit-enhanced LLM systems, providing insights into their design, methodologies, and performance. Key challenges and representative findings are identified to help guide future research. An accompanying GitHub repository that indexes relevant literature is available at this https URL.
摘要：大型语言模型 (LLM) 已成为强大且广泛使用的语言理解和生成系统，而多臂老虎机 (MAB) 算法为不确定性下的自适应决策提供了原则框架。这项调查探讨了这两个领域交叉点的潜力。众所周知，这是第一个在组件级别系统地回顾大型语言模型和多臂老虎机之间双向交互的调查。我们强调双向优势：MAB 算法解决了 LLM 的关键挑战，从预训练到检索增强生成 (RAG) 和个性化。相反，法学硕士通过重新定义核心组件（例如手臂定义和环境建模）来增强 MAB 系统，从而改进顺序任务中的决策。我们分析现有的 LLM 增强型老虎机系统和老虎机增强型 LLM 系统，深入了解其设计、方法和性能。确定了关键挑战和代表性发现，以帮助指导未来的研究。此 https URL 提供了索引相关文献的随附 GitHub 存储库。

Title: Pardon? Evaluating Conversational Repair in Large Audio-Language Models

Authors: Shuanghong Huang, Jinlei Xu, Youchao Zhou, Yanghao Zhou, Xuan Zhao, Chong Feng, Wenxuan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12973
Pdf URL: https://arxiv.org/pdf/2601.12973
Copy Paste: [[2601.12973]] Pardon? Evaluating Conversational Repair in Large Audio-Language Models(https://arxiv.org/abs/2601.12973)
Keywords: language model
Abstract: Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction.
摘要：大型音频语言模型 (LALM) 在口语问答 (QA) 方面表现出了强大的性能，现有的评估主要侧重于答案的准确性和对声学扰动的鲁棒性。然而，此类评估隐含地假设口头输入在语义上仍然是可回答的，而当重要信息丢失时，这种假设在现实世界的交互中常常会失败。在这项工作中，我们引入了一种修复感知评估设置，该设置明确区分可应答和不可应答的音频输入。我们将可回答性定义为输入本身的属性，并使用语义声学掩蔽协议构建配对评估条件。基于此设置，我们提出了可评估意识和修复（EAR）评分，这是一种非补偿性指标，可联合评估可回答条件下的任务能力和不可回答条件下的修复行为。对不同 LALM 的两个口语 QA 基准进行的实验揭示了答案准确性和对话可靠性之间的一致差距：虽然许多模型在输入可回答时表现良好，但大多数模型无法识别语义不可回答性并启动适当的对话修复。这些发现暴露了流行的以准确性为中心的评估实践的局限性，并激发了可靠性评估，将无法回答的输入视为修复和持续交互的线索。

Title: Bridging the Knowledge-Action Gap by Evaluating LLMs in Dynamic Dental Clinical Scenarios

Authors: Hongyang Ma, Tiantian Gu, Huaiyuan Sun, Huilin Zhu, Yongxin Wang, Jie Li, Wubin Sun, Zeliang Lian, Yinghong Zhou, Yi Gao, Shirui Wang, Zhihui Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12974
Pdf URL: https://arxiv.org/pdf/2601.12974
Copy Paste: [[2601.12974]] Bridging the Knowledge-Action Gap by Evaluating LLMs in Dynamic Dental Clinical Scenarios(https://arxiv.org/abs/2601.12974)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: The transition of Large Language Models (LLMs) from passive knowledge retrievers to autonomous clinical agents demands a shift in evaluation-from static accuracy to dynamic behavioral reliability. To explore this boundary in dentistry, a domain where high-quality AI advice uniquely empowers patient-participatory decision-making, we present the Standardized Clinical Management & Performance Evaluation (SCMPE) benchmark, which comprehensively assesses performance from knowledge-oriented evaluations (static objective tasks) to workflow-based simulations (multi-turn simulated patient interactions). Our analysis reveals that while models demonstrate high proficiency in static objective tasks, their performance precipitates in dynamic clinical dialogues, identifying that the primary bottleneck lies not in knowledge retention, but in the critical challenges of active information gathering and dynamic state tracking. Mapping "Guideline Adherence" versus "Decision Quality" reveals a prevalent "High Efficacy, Low Safety" risk in general models. Furthermore, we quantify the impact of Retrieval-Augmented Generation (RAG). While RAG mitigates hallucinations in static tasks, its efficacy in dynamic workflows is limited and heterogeneous, sometimes causing degradation. This underscores that external knowledge alone cannot bridge the reasoning gap without domain-adaptive pre-training. This study empirically charts the capability boundaries of dental LLMs, providing a roadmap for bridging the gap between standardized knowledge and safe, autonomous clinical practice.
摘要：大型语言模型（LLM）从被动知识检索器到自主临床代理的转变需要评估方式的转变——从静态准确性到动态行为可靠性。为了探索牙科领域的这一边界，高质量的人工智能建议独特地增强了患者参与决策的能力，我们提出了标准化临床管理和绩效评估（SCMPE）基准，该基准全面评估从知识导向的评估（静态目标任务）到基于工作流程的模拟（多轮模拟患者互动）的绩效。我们的分析表明，虽然模型在静态目标任务中表现出很高的熟练程度，但它们的表现在动态临床对话中却有所提高，这表明主要瓶颈不在于知识保留，而在于主动信息收集和动态状态跟踪的关键挑战。将“指南遵循度”与“决策质量”进行映射揭示了一般模型中普遍存在的“高效率、低安全性”风险。此外，我们量化了检索增强生成（RAG）的影响。虽然 RAG 可以减轻静态任务中的幻觉，但其在动态工作流程中的功效有限且异构，有时会导致性能下降。这强调了如果没有领域自适应预训练，仅靠外部知识无法弥合推理差距。这项研究根据经验绘制了牙科法学硕士的能力界限，为弥合标准化知识与安全、自主的临床实践之间的差距提供了路线图。

Title: The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check

Authors: Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12979
Pdf URL: https://arxiv.org/pdf/2601.12979
Copy Paste: [[2601.12979]] The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check(https://arxiv.org/abs/2601.12979)
Keywords: language model, llm, agent
Abstract: The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool-Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi-agent evaluation framework that integrates dLLMs as plug-and-play cognitive cores. Our analysis shows that dLLMs are effective in non-causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.
摘要：对实时代理交互的追求引发了人们对基于扩散的大型语言模型（dLLM）的兴趣，作为自回归主干的替代品，有望打破顺序延迟瓶颈。然而，这种效率提升是否会转化为有效的代理行为？在这项工作中，我们通过两种不同的代理范式对 dLLM（例如 LLaDA、Dream）进行了全面评估：体现代理（需要长期规划）和工具调用代理（需要精确格式化）。与效率炒作相反，我们在 Agentboard 和 BFCL 上的结果揭示了一个“惨痛的教训”：当前的 dLLM 未能充当可靠的代理骨干，经常导致系统性失败。 (1) 在具体环境中，dLLM 遭受反复尝试，未能在时间反馈下分支。 (2) 在工具调用设置中，dLLM 在扩散噪声下无法保持符号精度（例如严格的 JSON 模式）。为了评估 dLLM 在代理工作流程中的潜力，我们引入了 DiffuAgent，这是一个将 dLLM 集成为即插即用认知核心的多代理评估框架。我们的分析表明，dLLM 在非因果角色（例如，记忆总结和工具选择）方面是有效的，但需要将因果、精确和逻辑基础的推理机制纳入去噪过程中，才能适用于代理任务。

Title: ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

Authors: Jesus-German Ortiz-Barajas, Jonathan Tonglet, Vivek Gupta, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12983
Pdf URL: https://arxiv.org/pdf/2601.12983
Copy Paste: [[2601.12983]] ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation(https://arxiv.org/abs/2601.12983)
Keywords: language model, llm, prompt
Abstract: Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables, enabling efficient data analysis and reporting but also introducing new misuse risks. In this work, we introduce ChartAttack, a novel framework for evaluating how MLLMs can be misused to generate misleading charts at scale. ChartAttack injects misleaders into chart designs, aiming to induce incorrect interpretations of the underlying data. Furthermore, we create AttackViz, a chart question-answering (QA) dataset where each (chart specification, QA) pair is labeled with effective misleaders and their induced incorrect answers. Experiments in in-domain and cross-domain settings show that ChartAttack significantly degrades the QA performance of MLLM readers, reducing accuracy by an average of 19.6 points and 14.9 points, respectively. A human study further shows an average 20.2 point drop in accuracy for participants exposed to misleading charts generated by ChartAttack. Our findings highlight an urgent need for robustness and security considerations in the design, evaluation, and deployment of MLLM-based chart generation systems. We make our code and data publicly available.
摘要：多模态大语言模型 (MLLM) 越来越多地用于从数据表自动生成图表，从而实现高效的数据分析和报告，但也引入了新的误用风险。在这项工作中，我们介绍了 ChartAttack，这是一种新颖的框架，用于评估 MLLM 如何被滥用来大规模生成误导性图表。 ChartAttack 在图表设计中注入误导性信息，旨在引发对基础数据的错误解释。此外，我们创建了 AttackViz，这是一个图表问答 (QA) 数据集，其中每个（图表规范，QA）对都标有有效的误导者及其诱导的错误答案。域内和跨域设置的实验表明，ChartAttack 显着降低了 MLLM 阅读器的 QA 性能，准确率平均分别降低了 19.6 点和 14.9 点。一项人类研究进一步显示，接触 ChartAttack 生成的误导性图表的参与者的准确度平均下降 20.2 个百分点。我们的研究结果凸显了在基于 MLLM 的图表生成系统的设计、评估和部署中迫切需要考虑稳健性和安全性。我们公开我们的代码和数据。

Title: Graph Reasoning Paradigm: Structured and Symbolic Reasoning with Topology-Aware Reinforcement Learning for Large Language Models

Authors: Runxuan Liu, Xianhao Ou, Xinyan Ma, Jiyuan Wang, Jiafeng Liang, Jiaqi Li, Tao He, Zheng Chu, Rongchuan Mu, Zekun Wang, Baoxin Wang, Dayong Wu, Ming Liu, Shijin Wang, Guoping Hu, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12995
Pdf URL: https://arxiv.org/pdf/2601.12995
Copy Paste: [[2601.12995]] Graph Reasoning Paradigm: Structured and Symbolic Reasoning with Topology-Aware Reinforcement Learning for Large Language Models(https://arxiv.org/abs/2601.12995)
Keywords: language model, llm, chain-of-thought
Abstract: Long Chain-of-Thought (LCoT), achieved by Reinforcement Learning with Verifiable Rewards (RLVR), has proven effective in enhancing the reasoning capabilities of Large Language Models (LLMs). However, reasoning in current LLMs is primarily generated as plain text, where performing semantic evaluation on such unstructured data creates a computational bottleneck during training. Despite RLVR-based optimization, existing methods still suffer from coarse-grained supervision, reward hacking, high training costs, and poor generalization. To address these issues, we propose the Graph Reasoning Paradigm (GRP), which realizes structured and symbolic reasoning, implemented via graph-structured representations with step-level cognitive labels. Building upon GRP, we further design Process-Aware Stratified Clipping Group Relative Policy Optimization (PASC-GRPO), which leverages structured evaluation to replace semantic evaluation, achieves process-aware verification through graph-structured outcome rewards, and mitigates reward hacking via stratified clipping advantage estimation. Experiments demonstrate significant improvements across mathematical reasoning and code generation tasks. Data, models, and code will be released later.
摘要：通过可验证奖励的强化学习 (RLVR) 实现的长思维链 (LCoT) 已被证明可以有效增强大型语言模型 (LLM) 的推理能力。然而，当前法学硕士中的推理主要以纯文本形式生成，对此类非结构化数据执行语义评估会在训练期间产生计算瓶颈。尽管基于 RLVR 的优化，现有方法仍然存在粗粒度监督、奖励黑客、训练成本高和泛化性差等问题。为了解决这些问题，我们提出了图推理范式（GRP），它通过具有步骤级认知标签的图结构表示来实现结构化和符号推理。在GRP的基础上，我们进一步设计了流程感知的分层裁剪组相对策略优化（PASC-GRPO），它利用结构化评估代替语义评估，通过图结构的结果奖励实现流程感知的验证，并通过分层裁剪优势估计来减轻奖励黑客攻击。实验证明了数学推理和代码生成任务的显着改进。数据、模型和代码将在稍后发布。

Title: Bi-Attention HateXplain : Taking into account the sequential aspect of data during explainability in a multi-task context

Authors: Ghislain Dorian Tchuente Mondjo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13018
Pdf URL: https://arxiv.org/pdf/2601.13018
Copy Paste: [[2601.13018]] Bi-Attention HateXplain : Taking into account the sequential aspect of data during explainability in a multi-task context(https://arxiv.org/abs/2601.13018)
Keywords: llm
Abstract: Technological advances in the Internet and online social networks have brought many benefits to humanity. At the same time, this growth has led to an increase in hate speech, the main global threat. To improve the reliability of black-box models used for hate speech detection, post-hoc approaches such as LIME, SHAP, and LRP provide the explanation after training the classification model. In contrast, multi-task approaches based on the HateXplain benchmark learn to explain and classify simultaneously. However, results from HateXplain-based algorithms show that predicted attention varies considerably when it should be constant. This attention variability can lead to inconsistent interpretations, instability of predictions, and learning difficulties. To solve this problem, we propose the BiAtt-BiRNN-HateXplain (Bidirectional Attention BiRNN HateXplain) model which is easier to explain compared to LLMs which are more complex in view of the need for transparency, and will take into account the sequential aspect of the input data during explainability thanks to a BiRNN layer. Thus, if the explanation is correctly estimated, thanks to multi-task learning (explainability and classification task), the model could classify better and commit fewer unintentional bias errors related to communities. The experimental results on HateXplain data show a clear improvement in detection performance, explainability and a reduction in unintentional bias.
摘要：互联网和在线社交网络的技术进步给人类带来了许多好处。与此同时，这种增长导致仇恨言论增加，这是全球主要威胁。为了提高用于仇恨语音检测的黑盒模型的可靠性，LIME、SHAP 和 LRP 等事后方法在训练分类模型后提供解释。相比之下，基于 HateXplain 基准的多任务方法学会同时解释和分类。然而，基于 HateXplain 的算法的结果表明，预测的注意力在应该保持不变的情况下却变化很大。这种注意力的可变性可能会导致解释不一致、预测不稳定和学习困难。为了解决这个问题，我们提出了 BiAtt-BiRNN-HateXplain（双向注意力 BiRNN HateXplain）模型，与由于需要透明度而更加复杂的 LLM 相比，该模型更容易解释，并且由于 BiRNN 层，在可解释性期间将考虑输入数据的顺序方面。因此，如果解释被正确估计，由于多任务学习（可解释性和分类任务），模型可以更好地分类并减少与社区相关的无意偏差错误。 HateXplain 数据的实验结果表明，检测性能、可解释性明显提高，并且无意偏差减少。

Title: Tears or Cheers? Benchmarking LLMs via Culturally Elicited Distinct Affective Responses

Authors: Chongyuan Dai, Yaling Shen, Jinpeng Hu, Zihan Gao, Jia Li, Yishun Jiang, Yaxiong Wang, Liu Liu, Zongyuan Ge
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13024
Pdf URL: https://arxiv.org/pdf/2601.13024
Copy Paste: [[2601.13024]] Tears or Cheers? Benchmarking LLMs via Culturally Elicited Distinct Affective Responses(https://arxiv.org/abs/2601.13024)
Keywords: language model, llm
Abstract: Culture serves as a fundamental determinant of human affective processing and profoundly shapes how individuals perceive and interpret emotional stimuli. Despite this intrinsic link extant evaluations regarding cultural alignment within Large Language Models primarily prioritize declarative knowledge such as geographical facts or established societal customs. These benchmarks remain insufficient to capture the subjective interpretative variance inherent to diverse sociocultural lenses. To address this limitation, we introduce CEDAR, a multimodal benchmark constructed entirely from scenarios capturing Culturally \underline{\textsc{E}}licited \underline{\textsc{D}}istinct \underline{\textsc{A}}ffective \underline{\textsc{R}}esponses. To construct CEDAR, we implement a novel pipeline that leverages LLM-generated provisional labels to isolate instances yielding cross-cultural emotional distinctions, and subsequently derives reliable ground-truth annotations through rigorous human evaluation. The resulting benchmark comprises 10,962 instances across seven languages and 14 fine-grained emotion categories, with each language including 400 multimodal and 1,166 text-only samples. Comprehensive evaluations of 17 representative multilingual models reveal a dissociation between language consistency and cultural alignment, demonstrating that culturally grounded affective understanding remains a significant challenge for current models.
摘要：文化是人类情感处理的基本决定因素，并深刻影响着个人感知和解释情感刺激的方式。尽管存在这种内在联系，但关于大语言模型中的文化一致性的现有评估主要优先考虑地理事实或既定社会习俗等陈述性知识。这些基准仍然不足以捕捉不同社会文化视角固有的主观解释差异。为了解决这个限制，我们引入了 CEDAR，一个完全根据捕获文化 \underline{\textsc{E}}licited \underline{\textsc{D}}istinct \underline{\textsc{A}}ffective \underline{\textsc{R}} 响应的场景构建的多模式基准。为了构建 CEDAR，我们实现了一种新颖的管道，利用 LLM 生成的临时标签来隔离产生跨文化情感差异的实例，并随后通过严格的人类评估得出可靠的地面实况注释。由此产生的基准包含 7 种语言和 14 个细粒度情感类别的 10,962 个实例，每种语言包括 400 个多模式样本和 1,166 个纯文本样本。对 17 个代表性多语言模型的综合评估揭示了语言一致性和文化一致性之间的脱节，表明基于文化的情感理解仍然是当前模型的重大挑战。

Title: Profiling German Text Simplification with Interpretable Model-Fingerprints

Authors: Lars Klöser, Mika Beele, Bodo Kraft
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13050
Pdf URL: https://arxiv.org/pdf/2601.13050
Copy Paste: [[2601.13050]] Profiling German Text Simplification with Interpretable Model-Fingerprints(https://arxiv.org/abs/2601.13050)
Keywords: language model, llm, prompt
Abstract: While Large Language Models (LLMs) produce highly nuanced text simplifications, developers currently lack tools for a holistic, efficient, and reproducible diagnosis of their behavior. This paper introduces the Simplification Profiler, a diagnostic toolkit that generates a multidimensional, interpretable fingerprint of simplified texts. Multiple aggregated simplifications of a model result in a model's fingerprint. This novel evaluation paradigm is particularly vital for languages, where the data scarcity problem is magnified when creating flexible models for diverse target groups rather than a single, fixed simplification style. We propose that measuring a model's unique behavioral signature is more relevant in this context as an alternative to correlating metrics with human preferences. We operationalize this with a practical meta-evaluation of our fingerprints' descriptive power, which bypasses the need for large, human-rated datasets. This test measures if a simple linear classifier can reliably identify various model configurations by their created simplifications, confirming that our metrics are sensitive to a model's specific characteristics. The Profiler can distinguish high-level behavioral variations between prompting strategies and fine-grained changes from prompt engineering, including few-shot examples. Our complete feature set achieves classification F1-scores up to 71.9 %, improving upon simple baselines by over 48 percentage points. The Simplification Profiler thus offers developers a granular, actionable analysis to build more effective and truly adaptive text simplification systems.
摘要：虽然大型语言模型 (LLM) 产生了高度细致入微的文本简化，但开发人员目前缺乏对其行为进行全面、高效且可重复诊断的工具。本文介绍了 Simplification Profiler，这是一个诊断工具包，可生成简化文本的多维、可解释的指纹。模型的多个聚合简化会产生模型的指纹。这种新颖的评估范式对于语言尤其重要，在为不同的目标群体创建灵活的模型而不是单一的、固定的简化风格时，数据稀缺问题会被放大。我们建议，在这种情况下，测量模型的独特行为特征更有意义，作为将指标与人类偏好相关联的替代方案。我们通过对指纹的描述能力进行实用的元评估来实现这一点，从而绕过了对大型人工评价数据集的需求。该测试测量简单的线性分类器是否可以通过其创建的简化来可靠地识别各种模型配置，从而确认我们的指标对模型的特定特征敏感。 Profiler 可以区分提示策略之间的高级行为变化和提示工程中的细粒度更改，包括少数样本示例。我们完整的功能集实现了高达 71.9% 的分类 F1 分数，比简单基线提高了 48 个百分点以上。因此，简化分析器为开发人员提供了精细的、可操作的分析，以构建更有效和真正自适应的文本简化系统。

Title: Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Authors: Abdellah El Mekki, Samar M. Magdy, Houdaifa Atou, Ruwa AbuHweidi, Baraah Qawasmeh, Omer Nacar, Thikra Al-hibiri, Razan Saadie, Hamzah Alsayadi, Nadia Ghezaiel Hammouda, Alshima Alkhazimi, Aya Hamod, Al-Yas Al-Ghafri, Wesam El-Sayed, Asila Al sharji, Mohamad Ballout, Anas Belfathi, Karim Ghaddar, Serry Sibaee, Alaa Aoun, Areej Asiri, Lina Abureesh, Ahlam Bashiti, Majdal Yousef, Abdulaziz Hafiz, Yehdih Mohamed, Emira Hamedtou, Brakehe Brahim, Rahaf Alhamouri, Youssef Nafea, Aya El Aatar, Walid Al-Dhabyani, Emhemed Hamed, Sara Shatnawi, Fakhraddin Alwajih, Khalid Elkhidir, Ashwag Alasmari, Abdurrahman Gerrio, Omar Alshahri, AbdelRahim A. Elmadany, Ismail Berrada, Amir Azad Adli Alkathiri, Fadi A Zaraket, Mustafa Jarrar, Yahya Mohamed El Hadj, Hassan Alhuzali, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13099
Pdf URL: https://arxiv.org/pdf/2601.13099
Copy Paste: [[2601.13099]] Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs(https://arxiv.org/abs/2601.13099)
Keywords: language model, llm
Abstract: Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbf{Alexandria}, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.
摘要：阿拉伯语是一种高度双语的语言，大多数日常交流都是用地方方言而不是现代标准阿拉伯语进行的。尽管如此，机器翻译 (MT) 系统对于方言输入的泛化能力通常较差，限制了其对数百万说话者的实用性。我们引入了 \textbf{Alexandria}，这是一个大规模、社区驱动、人工翻译的数据集，旨在弥合这一差距。亚历山大涵盖 13 个阿拉伯国家和 11 个高影响领域，包括健康、教育和农业。与以前的资源不同，亚历山大通过将贡献与原产地城市元数据相关联，提供了前所未有的粒度，捕获了超越粗略区域标签的真实当地品种。该数据集由多轮对话场景组成，并用说话者-收件人性别配置进行注释，从而能够研究方言使用中的性别条件变化。 Alexandria 包含 107K 总样本，既可作为培训资源，又可作为评估 MT 和大型语言模型 (LLM) 的严格基准。我们对具有阿拉伯语意识的法学硕士进行自动和人工评估，对当前跨不同阿拉伯方言和子方言的翻译能力进行了基准测试，同时暴露了重大的持续挑战。

Title: Leveraging Lora Fine-Tuning and Knowledge Bases for Construction Identification

Authors: Liu Kaipeng, Wu Ling
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13105
Pdf URL: https://arxiv.org/pdf/2601.13105
Copy Paste: [[2601.13105]] Leveraging Lora Fine-Tuning and Knowledge Bases for Construction Identification(https://arxiv.org/abs/2601.13105)
Keywords: language model, retrieval-augmented generation
Abstract: This study investigates the automatic identification of the English ditransitive construction by integrating LoRA-based fine-tuning of a large language model with a Retrieval-Augmented Generation (RAG) framework.A binary classification task was conducted on annotated data from the British National Corpus. Results demonstrate that a LoRA-fine-tuned Qwen3-8B model significantly outperformed both a native Qwen3-MAX model and a theory-only RAG system. Detailed error analysis reveals that fine-tuning shifts the model's judgment from a surface-form pattern matching towards a more semantically grounded understanding based.
摘要：本研究通过将基于 LoRA 的大型语言模型微调与检索增强生成 (RAG) 框架相结合，研究英语双及物结构的自动识别。对来自英国国家语料库的注释数据进行了二元分类任务。结果表明，经过 LoRA 微调的 Qwen3-8B 模型显着优于原生 Qwen3-MAX 模型和纯理论 RAG 系统。详细的误差分析表明，微调将模型的判断从表面形式模式匹配转向更加基于语义的理解。

Title: CORE-T: COherent REtrieval of Tables for Text-to-SQL

Authors: Hassan Soliman, Vivek Gupta, Dan Roth, Iryna Gurevych
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2601.13111
Pdf URL: https://arxiv.org/pdf/2601.13111
Copy Paste: [[2601.13111]] CORE-T: COherent REtrieval of Tables for Text-to-SQL(https://arxiv.org/abs/2601.13111)
Keywords: llm
Abstract: Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a simple additive adjustment step restores strongly compatible tables. Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables, improving multi-table execution accuracy by up to 5.0 points on Bird and 6.9 points on MMQA, and using 4-5x fewer tokens than LLM-intensive baselines.
摘要：现实的文本到 SQL 工作流程通常需要连接多个表。因此，准确检索相关表集成为端到端性能的关键瓶颈。我们研究了一种开放式的环境，其中必须对来自多个来源的大型异构表集合回答查询，而没有明确的范围信号（例如数据库标识符）。在这里，密集检索（DR）实现了高召回率，但返回了许多干扰项，而连接感知替代方案通常依赖于额外的假设和/或产生高推理开销。我们提出了 CORE-T，这是一个可扩展、免培训的框架，它利用 LLM 生成的目的元数据来丰富表，并预先计算轻量级的表兼容性缓存。在推理时，DR 返回 top-K 候选；单个 LLM 调用选择一个连贯的、可连接的子集，并且一个简单的附加调整步骤可以恢复强兼容的表。在 Bird、Spider 和 MMQA 中，CORE-T 将表选择 F1 提高了 22.7 个点，同时检索的表数量减少了 42%，在 Bird 上将多表执行精度提高了 5.0 个点，在 MMQA 上提高了 6.9 个点，并且使用的令牌比 LLM 密集型基线少了 4-5 倍。

Title: Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

Authors: Fengran Mo, Yifan Gao, Sha Li, Hansi Zeng, Xin Liu, Zhaoxuan Tan, Xian Li, Jianshu Chen, Dakuo Wang, Meng Jiang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.13115
Pdf URL: https://arxiv.org/pdf/2601.13115
Copy Paste: [[2601.13115]] Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning(https://arxiv.org/abs/2601.13115)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have become a popular interface for human-AI interaction, supporting information seeking and task assistance through natural, multi-turn dialogue. To respond to users within multi-turn dialogues, the context-dependent user intent evolves across interactions, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Existing studies usually follow static rewrite, retrieve, and generate pipelines, which optimize different procedures separately and overlook the mixed-initiative action optimization simultaneously. Although the recent developments in deep search agents demonstrate the effectiveness in jointly optimizing retrieval and generation via reasoning, these approaches focus on single-turn scenarios, which might lack the ability to handle multi-turn interactions. We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training with tailored rewards towards evolving user goals. The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods by surpassing several existing strong baselines.
摘要：大型语言模型 (LLM) 已成为人机交互的流行界面，通过自然的多轮对话支持信息查找和任务协助。为了在多轮对话中响应用户，依赖于上下文的用户意图在交互中不断演变，需要上下文解释、查询重新表述以及检索和生成之间的动态协调。现有的研究通常遵循静态重写、检索和生成管道，分别优化不同的过程并同时忽略混合主动动作优化。尽管深度搜索代理的最新发展证明了通过推理联合优化检索和生成的有效性，但这些方法侧重于单轮场景，可能缺乏处理多轮交互的能力。我们引入了一种会话代理，它可以将搜索和推理交叉进行，从而实现通过强化学习（RL）训练学习到的探索性和适应性行为，以及针对不断发展的用户目标的定制奖励。四个广泛使用的对话基准的实验结果通过超越几个现有的强基准证明了我们方法的有效性。

Title: Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains

Authors: Yuan Gao, Zhigang Liu, Xinyu Yao, Bo Chen, Xiaobing Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13137
Pdf URL: https://arxiv.org/pdf/2601.13137
Copy Paste: [[2601.13137]] Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains(https://arxiv.org/abs/2601.13137)
Keywords: language model, llm
Abstract: With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial alignment framework, which enhances the value consistency of the model in sensitive domains through continued pre-training, instruction fine-tuning and adversarial training. In adversarial training, we use the Attacker to generate controversial queries, the Actor to generate responses with value consistency, and the Critic to filter and ensure response quality. Furthermore, we train a Value-Consistent Large Language Model, VC-LLM, for sensitive domains, and construct a bilingual evaluation dataset in Chinese and English. The experimental results show that VC-LLM performs better than the existing mainstream models in both Chinese and English tests, verifying the effectiveness of the method. Warning: This paper contains examples of LLMs that are offensive or harmful in nature.
摘要：随着大语言模型（LLM）的广泛应用，敏感领域的偏见和价值不一致的问题逐渐显现，特别是在种族、社会和政治方面。在本文中，我们提出了一种对抗性对齐框架，通过持续的预训练、指令微调和对抗性训练来增强模型在敏感领域的价值一致性。在对抗训练中，我们使用 Attacker 生成有争议的查询，Actor 生成具有值一致性的响应，并使用 Critic 过滤并确保响应质量。此外，我们针对敏感领域训练了价值一致的大语言模型VC-LLM，并构建了中英文双语评估数据集。实验结果表明，VC-LLM在中英文测试中均表现优于现有主流模型，验证了该方法的有效性。警告：本文包含具有冒犯性或有害性质的法学硕士示例。

Title: Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference

Authors: Zimeng Wu, Donghao Wang, Chaozhe Jin, Jiaxin Chen, Yunhong Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13155
Pdf URL: https://arxiv.org/pdf/2601.13155
Copy Paste: [[2601.13155]] Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference(https://arxiv.org/abs/2601.13155)
Keywords: language model, llm
Abstract: Long-context inference enhances the reasoning capability of Large Language Models (LLMs) while incurring significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown promise in reducing inference latency, but still suffer from inherently limited acceleration potential, outdated proxy signals, and redundancy interference, thus yielding suboptimal speed-accuracy trade-offs. To address these challenges, we propose SPTS (Self-Predictive Token Skipping), a training-free framework for efficient long-context LLM inference. Specifically, motivated by the thought of probing the influence of targeted skipping layers, we design two component-specific strategies for selective token skipping: Partial Attention Probing (PAP) for multi-head attention, which selects informative tokens by performing partial forward attention computation, and Low-rank Transformation Probing (LTP) for feed forward network, which constructs a low-rank proxy network to predict token transformations. Furthermore, a Multi-Stage Delayed Pruning (MSDP) strategy reallocates the skipping budget and progressively prunes redundant tokens across layers. Extensive experiments demonstrate the effectiveness of our method, achieving up to 2.46$\times$ and 2.29$\times$ speedups for prefilling and end-to-end generation, respectively, while maintaining state-of-the-art model performance. The source code will be publicly available upon paper acceptance.
摘要：长上下文推理增强了大型语言模型（LLM）的推理能力，但会产生大量的计算开销。面向令牌的方法（例如修剪和跳过）在减少推理延迟方面已显示出希望，但仍然受到固有的有限加速潜力、过时的代理信号和冗余干扰的影响，从而产生次优的速度与精度权衡。为了应对这些挑战，我们提出了 SPTS（自我预测令牌跳过），这是一种用于高效长上下文 LLM 推理的免训练框架。具体来说，受探测目标跳跃层影响的思想的启发，我们设计了两种用于选择性令牌跳跃的特定于组件的策略：用于多头注意力的部分注意力探测（PAP），它通过执行部分前向注意力计算来选择信息令牌；以及用于前馈网络的低秩变换探测（LTP），它构建一个低秩代理网络来预测令牌变换。此外，多阶段延迟修剪（MSDP）策略重新分配跳过预算并逐步跨层修剪冗余令牌。大量实验证明了我们方法的有效性，预填充和端到端生成分别实现了高达 2.46$\times$ 和 2.29$\times$ 的加速，同时保持了最先进的模型性能。源代码将在论文接受后公开。

Title: Medical Triage as Pairwise Ranking: A Benchmark for Urgency in Patient Portal Messages

Authors: Joseph Gatto, Parker Seegmiller, Timothy Burdick, Philip Resnik, Roshnik Rahat, Sarah DeLozier, Sarah M. Preum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13178
Pdf URL: https://arxiv.org/pdf/2601.13178
Copy Paste: [[2601.13178]] Medical Triage as Pairwise Ranking: A Benchmark for Urgency in Patient Portal Messages(https://arxiv.org/abs/2601.13178)
Keywords: llm
Abstract: Medical triage is the task of allocating medical resources and prioritizing patients based on medical need. This paper introduces the first large-scale public dataset for studying medical triage in the context of asynchronous outpatient portal messages. Our novel task formulation views patient message triage as a pairwise inference problem, where we train LLMs to choose `"which message is more medically urgent" in a head-to-head tournament-style re-sort of a physician's inbox. Our novel benchmark PMR-Bench contains 1569 unique messages and 2,000+ high-quality test pairs for pairwise medical urgency assessment alongside a scalable training data generation pipeline. PMR-Bench includes samples that contain both unstructured patient-written messages alongside real electronic health record (EHR) data, emulating a real-world medical triage scenario. We develop a novel automated data annotation strategy to provide LLMs with in-domain guidance on this task. The resulting data is used to train two model classes, UrgentReward and UrgentSFT, leveraging Bradley-Terry and next token prediction objective, respectively to perform pairwise urgency classification. We find that UrgentSFT achieves top performance on PMR-Bench, with UrgentReward showing distinct advantages in low-resource settings. For example, UrgentSFT-8B and UrgentReward-8B provide a 15- and 16-point boost, respectively, on inbox sorting metrics over off-the-shelf 8B models. Paper resources can be found at this https URL
摘要：医疗分诊是根据医疗需求分配医疗资源并优先考虑患者的任务。本文介绍了第一个用于研究异步门诊门户消息背景下的医疗分诊的大型公共数据集。我们新颖的任务制定将患者消息分类视为成对推理问题，我们训练法学硕士在医生收件箱的面对面锦标赛式重新排序中选择“哪条消息在医疗上更紧急”。我们的新颖基准 PMR-Bench 包含 1569 条独特的消息和 2,000 多个高质量测试对，用于成对医疗紧急情况评估以及可扩展的训练数据生成管道。 PMR-Bench 包含的样本包含非结构化患者编写的消息以及真实的电子健康记录 (EHR) 数据，模拟真实世界的医疗分诊场景。我们开发了一种新颖的自动化数据注释策略，为法学硕士提供有关此任务的领域内指导。生成的数据用于训练两个模型类 UrgentReward 和 UrgentSFT，分别利用 Bradley-Terry 和下一个令牌预测目标来执行成对紧急度分类。我们发现 UrgentSFT 在 PMR-Bench 上实现了最佳性能，而 UrgentReward 在低资源环境中显示出明显的优势。例如，与现成的 8B 模型相比，UrgentSFT-8B 和 UrgentReward-8B 的收件箱排序指标分别提高了 15 点和 16 点。论文资源可以在此 https URL 找到

Title: OpenExempt: A Diagnostic Benchmark for Legal Reasoning and a Framework for Creating Custom Benchmarks on Demand

Authors: Sergio Servantez, Sarah B. Lawsky, Rajiv Jain, Daniel W. Linna Jr., Kristian Hammond
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13183
Pdf URL: https://arxiv.org/pdf/2601.13183
Copy Paste: [[2601.13183]] OpenExempt: A Diagnostic Benchmark for Legal Reasoning and a Framework for Creating Custom Benchmarks on Demand(https://arxiv.org/abs/2601.13183)
Keywords: language model
Abstract: Reasoning benchmarks have played a crucial role in the progress of language models. Yet rigorous evaluation remains a significant challenge as static question-answer pairs provide only a snapshot of performance, compressing complex behavior into a single accuracy metric. This limitation is especially true in complex, rule-bound domains such as law, where existing benchmarks are costly to build and ill suited for isolating specific failure modes. To address this, we introduce OpenExempt, a framework and benchmark for diagnostic evaluation of legal reasoning. The OpenExempt Framework uses expert-crafted symbolic representations of U.S. Bankruptcy Code statutes to dynamically generate a large space of natural language reasoning tasks and their machine-computable solutions on demand. This gives users fine-grained control over task complexity and scope, allowing individual reasoning skills to be probed in isolation. Using this system, we construct the OpenExempt Benchmark, a diagnostic benchmark for legal reasoning with 9,765 samples across nine evaluation suites designed to carefully probe model capabilities. Experiments on 13 diverse language models reveal sharp performance cliffs that emerge only under longer reasoning paths and in the presence of obfuscating statements. We release the framework and benchmark publicly to support research aimed at understanding and improving the next generation of reasoning systems.
摘要：推理基准在语言模型的进步中发挥了至关重要的作用。然而，严格的评估仍然是一个重大挑战，因为静态问答对仅提供性能快照，将复杂的行为压缩为单个准确性指标。这种限制在复杂的、受规则约束的领域尤其如此，例如法律，现有基准的构建成本高昂，并且不适合隔离特定的故障模式。为了解决这个问题，我们引入了 OpenExempt，一个用于法律推理诊断评估的框架和基准。 OpenExempt 框架使用专家制作的美国破产法法规的符号表示来动态生成大量自然语言推理任务及其按需的机器计算解决方案。这使用户可以对任务复杂性和范围进行细粒度控制，从而可以单独探索个人推理技能。使用该系统，我们构建了 OpenExempt Benchmark，这是一个法律推理的诊断基准，包含九个评估套件中的 9,765 个样本，旨在仔细探索模型的功能。对 13 种不同语言模型的实验揭示了只有在较长的推理路径和存在混淆语句的情况下才会出现的急剧性能悬崖。我们公开发布框架和基准，以支持旨在理解和改进下一代推理系统的研究。

Title: Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision

Authors: Bingsen Chen, Boyan Li, Ping Nie, Yuyu Zhang, Xi Ye, Chen Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13217
Pdf URL: https://arxiv.org/pdf/2601.13217
Copy Paste: [[2601.13217]] Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision(https://arxiv.org/abs/2601.13217)
Keywords: prompt, agent
Abstract: Existing benchmarks for Deep Research Agents (DRAs) treat report generation as a single-shot writing task, which fundamentally diverges from how human researchers iteratively draft and revise reports via self-reflection or peer feedback. Whether DRAs can reliably revise reports with user feedback remains unexplored. We introduce Mr Dre, an evaluation suite that establishes multi-turn report revision as a new evaluation axis for DRAs. Mr Dre consists of (1) a unified long-form report evaluation protocol spanning comprehensiveness, factuality, and presentation, and (2) a human-verified feedback simulation pipeline for multi-turn revision. Our analysis of five diverse DRAs reveals a critical limitation: while agents can address most user feedback, they also regress on 16-27% of previously covered content and citation quality. Over multiple revision turns, even the best-performing agents leave significant headroom, as they continue to disrupt content outside the feedback's scope and fail to preserve earlier edits. We further show that these issues are not easily resolvable through inference-time fixes such as prompt engineering and a dedicated sub-agent for report revision.
摘要：深度研究代理 (DRA) 的现有基准将报告生成视为单次编写任务，这从根本上不同于人类研究人员通过自我反思或同行反馈迭代起草和修改报告的方式。 DRA 是否可以根据用户反馈可靠地修改报告仍有待探索。我们引入 Mr Dre，这是一个评估套件，它将多轮报告修订建立为 DRA 的新评估轴。 Mr Dre 包括 (1) 一个统一的长格式报告评估协议，涵盖全面性、真实性和演示性，以及 (2) 用于多轮修订的经人工验证的反馈模拟管道。我们对五种不同的 DRA 的分析揭示了一个关键的局限性：虽然代理可以解决大多数用户反馈，但它们也会对之前覆盖的内容和引文质量进行 16-27% 的回归。在多次修订轮次中，即使是表现最好的代理也会留下很大的空间，因为他们会继续破坏反馈范围之外的内容，并且无法保留早期的编辑。我们进一步表明，这些问题不容易通过推理时间修复（例如即时工程和用于报告修订的专用子代理）来解决。

Title: Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation

Authors: Tianqi Du, Lizhe Fang, Weijie Yang, Chenheng Zhang, Zeming Wei, Yifei Wang, Yisen Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13228
Pdf URL: https://arxiv.org/pdf/2601.13228
Copy Paste: [[2601.13228]] Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation(https://arxiv.org/abs/2601.13228)
Keywords: language model
Abstract: Diffusion language models enable any-order generation and bidirectional conditioning, offering appealing flexibility for tasks such as infilling, rewriting, and self-correction. However, their formulation-predicting one part of a sequence from another within a single-step dependency-limits modeling depth and often yields lower sample quality and stability than autoregressive (AR) models. To address this, we revisit autoregressive modeling as a foundation and reformulate diffusion-style training into a structured multi-group prediction process. We propose Any-order Any-subset Autoregressive modeling (A3), a generalized framework that extends the standard AR factorization to arbitrary token groups and generation orders. A3 preserves the probabilistic rigor and multi-layer dependency modeling of AR while inheriting diffusion models' flexibility for parallel and bidirectional generation. We implement A3 through a two-stream attention architecture and a progressive adaptation strategy that transitions pretrained AR models toward any-order prediction. Experiments on question answering, commonsense reasoning, and story infilling demonstrate that A3 outperforms diffusion-based models while maintaining flexible decoding. This work offers a unified approach for a flexible, efficient, and novel language modeling paradigm.
摘要：扩散语言模型支持任意顺序生成和双向调节，为填充、重写和自我纠正等任务提供了极具吸引力的灵活性。然而，它们的公式（在单步依赖性内预测序列的一部分）限制了建模深度，并且通常会产生比自回归 (AR) 模型更低的样本质量和稳定性。为了解决这个问题，我们重新审视自回归模型作为基础，并将扩散式训练重新表述为结构化的多组预测过程。我们提出了任意阶任意子集自回归建模（A3），这是一个将标准 AR 分解扩展到任意标记组和生成顺序的通用框架。 A3 保留了 AR 的概率严谨性和多层依赖建模，同时继承了扩散模型并行和双向生成的灵活性。我们通过双流注意力架构和渐进式适应策略来实现 A3，该策略将预训练的 AR 模型转变为任意顺序预测。问答、常识推理和故事填充的实验表明，A3 在保持灵活解码的同时优于基于扩散的模型。这项工作为灵活、高效和新颖的语言建模范例提供了统一的方法。

Title: Aligning Agentic World Models via Knowledgeable Experience Learning

Authors: Baochang Ren, Yunzhi Yao, Rui Sun, Shuofei Qiao, Ningyu Zhang, Huajun Chen
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2601.13247
Pdf URL: https://arxiv.org/pdf/2601.13247
Copy Paste: [[2601.13247]] Aligning Agentic World Models via Knowledgeable Experience Learning(https://arxiv.org/abs/2601.13247)
Keywords: language model, llm, hallucination, agent
Abstract: Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer from physical hallucinations-generating plans that are logically sound but physically unexecutable. Existing alignment strategies predominantly rely on resource-intensive training or fine-tuning, which attempt to compress dynamic environmental rules into static model parameters. However, such parametric encapsulation is inherently rigid, struggling to adapt to the open-ended variability of physical dynamics without continuous, costly retraining. To bridge this gap, we introduce WorldMind, a framework that autonomously constructs a symbolic World Knowledge Repository by synthesizing environmental feedback. Specifically, it unifies Process Experience to enforce physical feasibility via prediction errors and Goal Experience to guide task optimality through successful trajectories. Experiments on EB-ALFRED and EB-Habitat demonstrate that WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.
摘要：当前的大型语言模型（LLM）表现出严重的模式脱节：它们拥有大量的语义知识，但缺乏尊重物理世界不变法则的程序基础。因此，虽然这些代理隐式地充当世界模型，但它们的模拟经常遭受物理幻觉生成计划的影响，这些计划在逻辑上是合理的，但在物理上却无法执行。现有的对齐策略主要依赖于资源密集型训练或微调，试图将动态环境规则压缩为静态模型参数。然而，这种参数化封装本质上是僵化的，在没有持续、昂贵的再训练的情况下，很难适应物理动力学的开放式可变性。为了弥补这一差距，我们引入了 WorldMind，这是一个通过综合环境反馈自主构建符号世界知识库的框架。具体来说，它结合了过程经验，通过预测错误来增强物理可行性，并结合目标经验，通过成功的轨迹来指导任务优化。 EB-ALFRED 和 EB-Habitat 上的实验表明，与基线相比，WorldMind 具有卓越的性能，具有出色的跨模型和跨环境可迁移性。

Title: Beyond Cosine Similarity: Taming Semantic Drift and Antonym Intrusion in a 15-Million Node Turkish Synonym Graph

Authors: Ebubekir Tosun, Mehmet Emin Buldur, Özay Ezerceli, Mahmoud ElHussieni
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13251
Pdf URL: https://arxiv.org/pdf/2601.13251
Copy Paste: [[2601.13251]] Beyond Cosine Similarity: Taming Semantic Drift and Antonym Intrusion in a 15-Million Node Turkish Synonym Graph(https://arxiv.org/abs/2601.13251)
Keywords: llm, retrieval-augmented generation
Abstract: Neural embeddings have a notorious blind spot: they can't reliably tell synonyms apart from antonyms. Consequently, increasing similarity thresholds often fails to prevent opposites from being grouped together. We've built a large-scale semantic clustering system specifically designed to tackle this problem head on. Our pipeline chews through 15 million lexical items, evaluates a massive 520 million potential relationships, and ultimately generates 2.9 million high-precision semantic clusters. The system makes three primary contributions. First, we introduce a labeled dataset of 843,000 concept pairs spanning synonymy, antonymy, and co-hyponymy, constructed via Gemini 2.5-Flash LLM augmentation and verified using human-curated dictionary resources. Second, we propose a specialized three-way semantic relation discriminator that achieves 90% macro-F1, enabling robust disambiguation beyond raw embedding similarity. Third, we introduce a novel soft-to-hard clustering algorithm that mitigates semantic drift preventing erroneous transitive chains (e.g., hot -> spicy -> pain -> depression) while simultaneously resolving polysemy. Our approach employs a topology-aware two-stage expansion-pruning procedure with topological voting, ensuring that each term is assigned to exactly one semantically coherent cluster. The resulting resource enables high-precision semantic search and retrieval-augmented generation, particularly for morphologically rich and low-resource languages where existing synonym databases remain sparse.
摘要：神经嵌入有一个臭名昭著的盲点：它们无法可靠地区分同义词和反义词。因此，增加相似性阈值通常无法阻止对立物被分组在一起。我们构建了一个大规模语义聚类系统，专门用于解决这个问题。我们的管道仔细研究了 1500 万个词汇项目，评估了 5.2 亿个巨大的潜在关系，并最终生成 290 万个高精度语义集群。该系统做出了三个主要贡献。首先，我们引入了一个包含 843,000 个概念对的标记数据集，涵盖同义词、反义词和同义词，这些概念对是通过 Gemini 2.5-Flash LLM 增强构建的，并使用人工管理的词典资源进行验证。其次，我们提出了一种专门的三向语义关系鉴别器，可以实现 90% 的宏 F1，从而实现超越原始嵌入相似性的稳健消歧。第三，我们引入了一种新颖的软到硬聚类算法，该算法可以减轻语义漂移，防止错误的传递链（例如，热 -> 辣 -> 疼痛 -> 抑郁），同时解决一词多义。我们的方法采用拓扑感知的两阶段扩展修剪过程和拓扑投票，确保每个术语都准确地分配给一个语义一致的集群。由此产生的资源能够实现高精度语义搜索和检索增强生成，特别是对于现有同义词数据库仍然稀疏的形态丰富且资源匮乏的语言。

Title: Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Authors: Sawsan Alqahtani, Mir Tafseer Nayeem, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13260
Pdf URL: https://arxiv.org/pdf/2601.13260
Copy Paste: [[2601.13260]] Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models(https://arxiv.org/abs/2601.13260)
Keywords: language model
Abstract: Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.
摘要：标记化是每个大型语言模型的基础，但它仍然是一个理论不足且设计不一致的组件。字节对编码 (BPE) 等常见子字方法提供了可扩展性，但常常与语言结构不一致、放大偏差并浪费跨语言和域的容量。本文将标记化重新定义为核心建模决策，而不是预处理步骤。我们主张建立一个上下文感知框架，该框架集成了分词器和模型协同设计，并以语言、领域和部署考虑为指导。标准化评估和透明报告对于使代币化选择具有可问责性和可比性至关重要。将标记化视为核心设计问题，而不是技术事后的想法，可以产生更公平、更高效、适应性更强的语言技术。

Title: Unlearning in LLMs: Methods, Evaluation, and Open Challenges

Authors: Tyler Lizzo, Larry Heck
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13264
Pdf URL: https://arxiv.org/pdf/2601.13264
Copy Paste: [[2601.13264]] Unlearning in LLMs: Methods, Evaluation, and Open Challenges(https://arxiv.org/abs/2601.13264)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable success across natural language processing tasks, yet their widespread deployment raises pressing concerns around privacy, copyright, security, and bias. Machine unlearning has emerged as a promising paradigm for selectively removing knowledge or data from trained models without full retraining. In this survey, we provide a structured overview of unlearning methods for LLMs, categorizing existing approaches into data-centric, parameter-centric, architecture-centric, hybrid, and other strategies. We also review the evaluation ecosystem, including benchmarks, metrics, and datasets designed to measure forgetting effectiveness, knowledge retention, and robustness. Finally, we outline key challenges and open problems, such as scalable efficiency, formal guarantees, cross-language and multimodal unlearning, and robustness against adversarial relearning. By synthesizing current progress and highlighting open directions, this paper aims to serve as a roadmap for developing reliable and responsible unlearning techniques in large language models.
摘要：大型语言模型 (LLM) 在自然语言处理任务中取得了显着的成功，但其广泛部署引起了人们对隐私、版权、安全和偏见的紧迫担忧。机器忘却已经成为一种有前景的范例，可以选择性地从经过训练的模型中删除知识或数据，而无需完全重新训练。在本次调查中，我们对法学硕士的遗忘方法进行了结构化概述，将现有方法分类为以数据为中心、以参数为中心、以架构为中心、混合和其他策略。我们还审查了评估生态系统，包括旨在衡量遗忘有效性、知识保留和稳健性的基准、指标和数据集。最后，我们概述了关键挑战和未解决的问题，例如可扩展效率、形式保证、跨语言和多模式忘却，以及对抗性再学习的鲁棒性。通过综合当前进展并强调开放方向，本文旨在为大型语言模型中开发可靠且负责任的忘却技术提供路线图。

Title: A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

Authors: Gonzalo Ariel Meyoyan, Luciano Del Corro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13288
Pdf URL: https://arxiv.org/pdf/2601.13288
Copy Paste: [[2601.13288]] A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification(https://arxiv.org/abs/2601.13288)
Keywords: llm
Abstract: Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline.
摘要：生产 LLM 系统通常依赖于单独的模型来实现安全性和其他繁重的分类步骤，从而增加延迟、VRAM 占用空间和操作复杂性。相反，我们重用服务 LLM 已经支付的计算：我们在其隐藏状态上训练轻量级探针，并在用于生成的同一前向传递中预测标签。我们将分类框架为对完整令牌层隐藏状态张量的表示选择，而不是致力于固定令牌或固定层（例如，第一个令牌逻辑或最终层池化）。为了实现这一点，我们引入了一个两阶段聚合器，它（i）总结每层内的标记，（ii）聚合跨层摘要以形成用于分类的单一表示。我们使用直接池化、100K 参数评分注意力门和具有多达 35M 可训练参数的向下多头自注意力 (MHA) 探针来实例化该模板。在安全和情感基准方面，我们的探针比仅逻辑重用（例如 MULI）有所改进，并且与更大的特定于任务的基线具有竞争力，同时保留接近服务的延迟并避免单独的保护模型管道的 VRAM 和延迟成本。

Title: OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference

Authors: Yow-Fu Liou, Yu-Chien Tang, Yu-Hsiang Liu, An-Zi Yen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13300
Pdf URL: https://arxiv.org/pdf/2601.13300
Copy Paste: [[2601.13300]] OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference(https://arxiv.org/abs/2601.13300)
Keywords: language model, llm, prompt
Abstract: Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by directive signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional option containing a misleading directive, leveraging standardized choice structure and scalable evaluation. We construct OI-Bench, a benchmark of 3,000 questions spanning knowledge, reasoning, and commonsense tasks, with 16 directive types covering social compliance, bonus framing, threat framing, and instructional interference. This setting combines manipulation of the choice interface with directive-based interference, enabling systematic assessment of model susceptibility. We evaluate 12 LLMs to analyze attack success rates, behavioral responses, and further investigate mitigation strategies ranging from inference-time prompting to post-training alignment. Experimental results reveal substantial vulnerabilities and heterogeneous robustness across models. OI-Bench is expected to support more systematic evaluation of LLM robustness to directive interference within choice-based interfaces.
摘要：对大型语言模型 (LLM) 进行基准测试对于了解其功能、局限性和稳健性至关重要。除了界面工件之外，先前的研究表明，法学硕士的决策可能会受到社交线索、框架和指示等指令信号的影响。在这项工作中，我们引入了选项注入，这是一种基准测试方法，它通过包含误导性指令的附加选项来增强多项选择题回答（MCQA）界面，利用标准化的选择结构和可扩展的评估。我们构建了 OI-Bench，这是一个包含 3,000 个问题的基准，涵盖知识、推理和常识任务，有 16 种指令类型，涵盖社会责任、奖金框架、威胁框架和教学干扰。此设置将选择界面的操作与基于指令的干扰相结合，从而能够对模型敏感性进行系统评估。我们评估了 12 个法学硕士，以分析攻击成功率、行为响应，并进一步研究从推理时间提示到训练后对齐的缓解策略。实验结果揭示了模型之间的重大漏洞和异构鲁棒性。 OI-Bench 预计将支持对基于选择的接口内的指令干扰的 LLM 鲁棒性进行更系统的评估。

Title: Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme Modeling of Climate Discourse

Authors: Samantha Sudhoff, Pranav Perumal, Zhaoqing Wu, Tunazzina Islam
Subjects: cs.CL, cs.AI, cs.CY, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2601.13317
Pdf URL: https://arxiv.org/pdf/2601.13317
Copy Paste: [[2601.13317]] Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme Modeling of Climate Discourse(https://arxiv.org/abs/2601.13317)
Keywords: language model, llm
Abstract: Climate discourse online plays a crucial role in shaping public understanding of climate change and influencing political and policy outcomes. However, climate communication unfolds across structurally distinct platforms with fundamentally different incentive structures: paid advertising ecosystems incentivize targeted, strategic persuasion, while public social media platforms host largely organic, user-driven discourse. Existing computational studies typically analyze these environments in isolation, limiting our ability to distinguish institutional messaging from public expression. In this work, we present a comparative analysis of climate discourse across paid advertisements on Meta (previously known as Facebook) and public posts on Bluesky from July 2024 to September 2025. We introduce an interpretable, end-to-end thematic discovery and assignment framework that clusters texts by semantic similarity and leverages large language models (LLMs) to generate concise, human-interpretable theme labels. We evaluate the quality of the induced themes against traditional topic modeling baselines using both human judgments and an LLM-based evaluator, and further validate their semantic coherence through downstream stance prediction and theme-guided retrieval tasks. Applying the resulting themes, we characterize systematic differences between paid climate messaging and public climate discourse and examine how thematic prevalence shifts around major political events. Our findings show that platform-level incentives are reflected in the thematic structure, stance alignment, and temporal responsiveness of climate narratives. While our empirical analysis focuses on climate communication, the proposed framework is designed to support comparative narrative analysis across heterogeneous communication environments.
摘要：在线气候话语在塑造公众对气候变化的理解以及影响政治和政策结果方面发挥着至关重要的作用。然而，气候传播在结构不同的平台上展开，激励结构也截然不同：付费广告生态系统激励有针对性的战略说服力，而公共社交媒体平台则主要承载有机的、用户驱动的话语。现有的计算研究通常孤立地分析这些环境，限制了我们区分机构信息和公共表达的能力。在这项工作中，我们对 2024 年 7 月至 2025 年 9 月 Meta（以前称为 Facebook）上的付费广告和 Bluesky 上的公共帖子的气候话语进行了比较分析。我们引入了一种可解释的端到端主题发现和分配框架，该框架通过语义相似性对文本进行聚类，并利用大型语言模型 (LLM) 生成简洁的、人类可解释的主题标签。我们使用人类判断和基于 LLM 的评估器根据传统主题建模基线评估诱导主题的质量，并通过下游立场预测和主题引导检索任务进一步验证其语义一致性。应用由此产生的主题，我们描述了付费气候信息传递和公共气候话语之间的系统差异，并研究了主题流行度如何围绕重大政治事件发生变化。我们的研究结果表明，平台层面的激励措施反映在气候叙事的主题结构、立场一致性和时间响应性上。虽然我们的实证分析侧重于气候传播，但所提出的框架旨在支持跨异构传播环境的比较叙事分析。

Title: RegCheck: A tool for automating comparisons between study registrations and papers

Authors: Jamie Cummins, Beth Clarke, Ian Hussey, Malte Elson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13330
Pdf URL: https://arxiv.org/pdf/2601.13330
Copy Paste: [[2601.13330]] RegCheck: A tool for automating comparisons between study registrations and papers(https://arxiv.org/abs/2601.13330)
Keywords: llm
Abstract: Across the social and medical sciences, researchers recognize that specifying planned research activities (i.e., 'registration') prior to the commencement of research has benefits for both the transparency and rigour of science. Despite this, evidence suggests that study registrations frequently go unexamined, minimizing their effectiveness. In a way this is no surprise: manually checking registrations against papers is labour- and time-intensive, requiring careful reading across formats and expertise across domains. The advent of AI unlocks new possibilities in facilitating this activity. We present RegCheck, a modular LLM-assisted tool designed to help researchers, reviewers, and editors from across scientific disciplines compare study registrations with their corresponding papers. Importantly, RegCheck keeps human expertise and judgement in the loop by (i) ensuring that users are the ones who determine which features should be compared, and (ii) presenting the most relevant text associated with each feature to the user, facilitating (rather than replacing) human discrepancy judgements. RegCheck also generates shareable reports with unique RegCheck IDs, enabling them to be easily shared and verified by other users. RegCheck is designed to be adaptable across scientific domains, as well as registration and publication formats. In this paper we provide an overview of the motivation, workflow, and design principles of RegCheck, and we discuss its potential as an extensible infrastructure for reproducible science with an example use case.
摘要：在社会科学和医学科学领域，研究人员认识到，在研究开始之前指定计划的研究活动（即“注册”）有利于科学的透明度和严谨性。尽管如此，有证据表明研究注册经常未经审查，从而最大限度地降低了其有效性。从某种意义上说，这并不奇怪：手动检查论文注册是劳动密集型和时间密集型的，需要仔细阅读跨格式和跨领域的专业知识。人工智能的出现为促进这一活动带来了新的可能性。我们推出 RegCheck，这是一种模块化的法学硕士辅助工具，旨在帮助来自各个科学学科的研究人员、审稿人和编辑将研究注册与其相应的论文进行比较。重要的是，RegCheck 通过以下方式保持人类的专业知识和判断：(i) 确保用户决定应该比较哪些特征；(ii) 向用户呈现与每个特征相关的最相关文本，从而促进（而不是取代）人类差异判断。 RegCheck 还生成具有唯一 RegCheck ID 的可共享报告，使其他用户能够轻松共享和验证它们。 RegCheck 旨在适应跨科学领域以及注册和出版格式。在本文中，我们概述了 RegCheck 的动机、工作流程和设计原则，并通过示例用例讨论了它作为可重复科学的可扩展基础设施的潜力。

Title: LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction

Authors: Yuxing Lu, J. Ben Tamo, Weichen Zhao, Nan Sun, Yishan Zhong, Wenqi Shi, Jinzhuo Wang, May D. Wang
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2601.13352
Pdf URL: https://arxiv.org/pdf/2601.13352
Copy Paste: [[2601.13352]] LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction(https://arxiv.org/abs/2601.13352)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mechanism that improves predictions for step t+1. We propose LLM-as-RNN, an inference-only framework that turns a frozen LLM into a recurrent predictor by representing its hidden state as natural-language memory. This state, implemented as a structured system-prompt summary, is updated at each timestep via feedback-driven text rewrites, enabling learning without parameter updates. Under a fixed token budget, LLM-as-RNN corrects errors and retains task-relevant patterns, effectively performing online learning through language. We evaluate the method on three sequential benchmarks in healthcare, meteorology, and finance across Llama, Gemma, and GPT model families. LLM-as-RNN significantly outperforms zero-shot, full-history, and MemPrompt baselines, improving predictive accuracy by 6.5% on average, while producing interpretable, human-readable learning traces absent in standard context accumulation.
摘要：大型语言模型是强大的序列预测器，但标准推理依赖于不可变的上下文历史。在生成步骤 t 出错后，模型缺乏可更新的内存机制来改进步骤 t+1 的预测。我们提出了 LLM-as-RNN，这是一种仅推理框架，通过将其隐藏状态表示为自然语言记忆，将冻结的 LLM 转变为循环预测器。这种状态作为结构化系统提示摘要实现，通过反馈驱动的文本重写在每个时间步进行更新，从而无需更新参数即可进行学习。在固定的代币预算下，LLM-as-RNN 可以纠正错误并保留任务相关模式，从而通过语言有效地进行在线学习。我们在 Llama、Gemma 和 GPT 模型系列的医疗保健、气象和金融领域的三个连续基准上评估该方法。 LLM-as-RNN 显着优于零样本、全历史和 MemPrompt 基线，平均将预测准确性提高 6.5%，同时产生标准上下文积累中缺少的可解释、人类可读的学习痕迹。

Title: Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection

Authors: Asen Dotsinski, Panagiotis Eustratiadis
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13359
Pdf URL: https://arxiv.org/pdf/2601.13359
Copy Paste: [[2601.13359]] Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection(https://arxiv.org/abs/2601.13359)
Keywords: language model, llm, prompt
Abstract: As open-weight large language models (LLMs) increase in capabilities, safeguarding them against malicious prompts and understanding possible attack vectors becomes ever more important. While automated jailbreaking methods like GCG [Zou et al., 2023] remain effective, they often require substantial computational resources and specific expertise. We introduce "sockpuppetting'', a simple method for jailbreaking open-weight LLMs by inserting an acceptance sequence (e.g., "Sure, here is how to...'') at the start of a model's output and allowing it to complete the response. Requiring only a single line of code and no optimization, sockpuppetting achieves up to 80% higher attack success rate (ASR) than GCG on Qwen3-8B in per-prompt comparisons. We also explore a hybrid approach that optimizes the adversarial suffix within the assistant message block rather than the user prompt, increasing ASR by 64% over GCG on Llama-3.1-8B in a prompt-agnostic setting. The results establish sockpuppetting as an effective low-cost attack accessible to unsophisticated adversaries, highlighting the need for defences against output-prefix injection in open-weight models.
摘要：随着开放式大型语言模型 (LLM) 功能的增强，保护它们免受恶意提示并了解可能的攻击向量变得越来越重要。虽然像 GCG [Zou et al., 2023] 这样的自动越狱方法仍然有效，但它们通常需要大量的计算资源和特定的专业知识。我们引入了“sockpuppetting”，这是一种通过在模型输出开始处插入接受序列（例如，“当然，这里是如何...”）并允许其完成响应来越狱开放权重 LLM 的简单方法。只需一行代码且无需优化，在每次提示比较中，sockpuppetting 在 Qwen3-8B 上的攻击成功率 (ASR) 比 GCG 高出 80%。我们还探索了一种混合方法，该方法优化辅助消息块中的对抗性后缀，而不是用户提示，在与提示无关的设置中，ASR 比 Llama-3.1-8B 上的 GCG 提高了 64%。结果表明，傀儡是一种有效的低成本攻击，对于不成熟的对手来说是可以访问的，这凸显了在开放权重模型中防御输出前缀注入的必要性。

Title: Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models

Authors: Zhenjiang Mao, Anirudhh Venkat
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13368
Pdf URL: https://arxiv.org/pdf/2601.13368
Copy Paste: [[2601.13368]] Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models(https://arxiv.org/abs/2601.13368)
Keywords: language model, hallucination, chain-of-thought
Abstract: As reasoning modules, such as the chain-of-thought mechanism, are applied to large language models, they achieve strong performance on various tasks such as answering common-sense questions and solving math problems. The main challenge now is to assess the uncertainty of answers, which can help prevent misleading or serious hallucinations for users. Although current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences, the temporal spread of confidence is often overlooked. This oversight can lead to inflated overall confidence, even when earlier steps exhibit very low confidence. To address this issue, we propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps. For handling long-horizon responses, we introduce a hidden confidence mechanism to retain historical confidence information, which is then combined with stepwise confidence to produce a more accurate overall estimate. We evaluate our method on the GAOKAO math benchmark and the CLadder causal reasoning dataset using mainstream open-source large language models. Our approach is shown to outperform state-of-the-art methods by achieving a superior balance between predictive quality and calibration, demonstrated by strong performance on both Negative Log-Likelihood and Expected Calibration Error.
摘要：随着思维链机制等推理模块应用于大型语言模型，它们在回答常识问题、解决数学问题等各种任务上都取得了很强的性能。现在的主要挑战是评估答案的不确定性，这有助于防止用户产生误导或严重的幻觉。尽管当前的方法通过过滤不相关的标记并检查附近标记或句子之间的潜在连接来分析长推理序列，但置信度的时间传播常常被忽视。这种疏忽可能会导致整体信心膨胀，即使早期步骤表现出非常低的信心。为了解决这个问题，我们提出了一种新颖的方法，该方法结合了步骤间注意来分析跨步骤的语义相关性。为了处理长范围响应，我们引入了一种隐藏置信机制来保留历史置信信息，然后将其与逐步置信相结合以产生更准确的总体估计。我们使用主流开源大语言模型在 GAOKAO 数学基准和 CLadder 因果推理数据集上评估我们的方法。我们的方法通过在预测质量和校准之间实现卓越的平衡而优于最先进的方法，这通过负对数似然和预期校准误差的强劲表现得到证明。

Title: Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning

Authors: Zhenjiang Mao, Anirudhh Venkat, Artem Bisliouk, Akshat Kothiyal, Sindhura Kumbakonam Subramanian, Saithej Singhu, Ivan Ruchkin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13387
Pdf URL: https://arxiv.org/pdf/2601.13387
Copy Paste: [[2601.13387]] Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning(https://arxiv.org/abs/2601.13387)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) increasingly rely on long-form, multi-step reasoning to solve complex tasks such as mathematical problem solving and scientific question answering. Despite strong performance, existing confidence estimation methods typically reduce an entire reasoning process to a single scalar score, ignoring how confidence evolves throughout the generation. As a result, these methods are often sensitive to superficial factors such as response length or verbosity, and struggle to distinguish correct reasoning from confidently stated errors. We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL). Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses. Our analysis found that the STL patterns generalize across tasks, and numeric parameters exhibit sensitivity to individual questions. Based on these insights, we develop a confidence estimation approach that informs STL blocks with parameter hypernetworks. Experiments on multiple reasoning tasks show our confidence scores are more calibrated than the baselines.
摘要：大型语言模型 (LLM) 越来越依赖长格式、多步骤推理来解决数学问题解决和科学问答等复杂任务。尽管性能很强，但现有的置信度估计方法通常将整个推理过程简化为单个标量分数，而忽略了置信度在整个生成过程中如何演变。因此，这些方法通常对响应长度或冗长等表面因素敏感，并且难以区分正确的推理和自信陈述的错误。我们建议使用信号时间逻辑（STL）来表征逐步置信信号。使用判别性 STL 挖掘过程，我们发现了区分正确和错误响应的置信信号的时间公式。我们的分析发现，STL 模式可以跨任务推广，并且数字参数表现出对单个问题的敏感性。基于这些见解，我们开发了一种置信度估计方法，通过参数超网络通知 STL 块。多项推理任务的实验表明，我们的置信度分数比基线更加校准。

Title: Structured Insight from Unstructured Data: Large Language Models for SDOH-Driven Diabetes Risk Prediction

Authors: Sasha Ronaghi, Prerit Choudhary, David H Rehkopf, Bryant Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13388
Pdf URL: https://arxiv.org/pdf/2601.13388
Copy Paste: [[2601.13388]] Structured Insight from Unstructured Data: Large Language Models for SDOH-Driven Diabetes Risk Prediction(https://arxiv.org/abs/2601.13388)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Social determinants of health (SDOH) play a critical role in Type 2 Diabetes (T2D) management but are often absent from electronic health records and risk prediction models. Most individual-level SDOH data is collected through structured screening tools, which lack the flexibility to capture the complexity of patient experiences and unique needs of a clinic's population. This study explores the use of large language models (LLMs) to extract structured SDOH information from unstructured patient life stories and evaluate the predictive value of both the extracted features and the narratives themselves for assessing diabetes control. We collected unstructured interviews from 65 T2D patients aged 65 and older, focused on their lived experiences, social context, and diabetes management. These narratives were analyzed using LLMs with retrieval-augmented generation to produce concise, actionable qualitative summaries for clinical interpretation and structured quantitative SDOH ratings for risk prediction modeling. The structured SDOH ratings were used independently and in combination with traditional laboratory biomarkers as inputs to linear and tree-based machine learning models (Ridge, Lasso, Random Forest, and XGBoost) to demonstrate how unstructured narrative data can be applied in conventional risk prediction workflows. Finally, we evaluated several LLMs on their ability to predict a patient's level of diabetes control (low, medium, high) directly from interview text with A1C values redacted. LLMs achieved 60% accuracy in predicting diabetes control levels from interview text. This work demonstrates how LLMs can translate unstructured SDOH-related data into structured insights, offering a scalable approach to augment clinical risk models and decision-making.
摘要：健康的社会决定因素 (SDOH) 在 2 型糖尿病 (T2D) 管理中发挥着关键作用，但电子健康记录和风险预测模型中往往缺失。大多数个人层面的 SDOH 数据是通过结构化筛选工具收集的，这些工具缺乏灵活性来捕获患者体验的复杂性和诊所人群的独特需求。本研究探索使用大语言模型 (LLM) 从非结构化患者生活故事中提取结构化 SDOH 信息，并评估提取的特征和叙述本身对于评估糖尿病控制的预测价值。我们收集了 65 名 65 岁及以上的 T2D 患者的非结构化访谈，重点关注他们的生活经历、社会背景和糖尿病管理。使用具有检索增强生成功能的法学硕士对这些叙述进行分析，以生成用于临床解释的简洁、可操作的定性摘要，以及用于风险预测模型的结构化定量 SDOH 评级。结构化 SDOH 评级独立使用，并与传统实验室生物标志物结合使用，作为线性和基于树的机器学习模型（Ridge、Lasso、随机森林和 XGBoost）的输入，以演示如何将非结构化叙述数据应用于传统风险预测工作流程。最后，我们评估了几位法学硕士直接从经过编辑的 A1C 值的访谈文本预测患者糖尿病控制水平（低、中、高）的能力。法学硕士根据访谈文本预测糖尿病控制水平的准确率达到 60%。这项工作展示了法学硕士如何将非结构化 SDOH 相关数据转化为结构化见解，提供可扩展的方法来增强临床风险模型和决策。

Title: Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks

Authors: Shlok Shelat, Jay Raval, Souvik Roy, Manas Gaur
Subjects: cs.CL, cs.AI, cs.FL
Abstract URL: https://arxiv.org/abs/2601.13392
Pdf URL: https://arxiv.org/pdf/2601.13392
Copy Paste: [[2601.13392]] Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks(https://arxiv.org/abs/2601.13392)
Keywords: language model, llm, prompt, chain-of-thought, tree-of-thought
Abstract: Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden's theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint protocol that enables correction of shallow errors but does not reliably resolve globally inconsistent or structurally flawed automata. Our analysis across multiple prompting strategies (direct, Chain-of-Thought, Tree-of-Thought) reveals that errors persist regardless of prompting approach, exposing a fundamental gap between LLMs' ability to generate syntactically plausible DFAs and their capacity for semantically correct formal reasoning.
摘要：大型语言模型（LLM）在形式语言任务上表现出了强大的性能，但这是否反映了真正的符号推理或熟悉结构的模式匹配仍不清楚。我们引入了从常规语言构建确定性有限自动机（DFA）的基准，包括事实知识问题、来自公共来源的已知构建问题以及两类未见过的问题：具有多个交互约束的手工制作的实例和通过雅顿定理系统生成的问题。模型在事实问题上达到完美的准确率，在已知任务上达到 84-90% 的准确率。然而，由于对语言约束的系统性误解、对 Kleene-star 语义的错误处理以及未能保持全局一致性，导致未见问题的准确度急剧下降（下降 30-64%）。我们评估了一个三阶段提示协议，该协议能够纠正浅层错误，但不能可靠地解决全局不一致或结构上有缺陷的自动机。我们对多种提示策略（直接、思维链、思维树）的分析表明，无论采用何种提示方法，错误都会持续存在，暴露出法学硕士生成句法上合理的 DFA 的能力与语义上正确的形式推理的能力之间的根本差距。

Title: Trust Me, I'm an Expert: Decoding and Steering Authority Bias in Large Language Models

Authors: Priyanka Mary Mammen, Emil Joswin, Shankar Venkitachalam
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13433
Pdf URL: https://arxiv.org/pdf/2601.13433
Copy Paste: [[2601.13433]] Trust Me, I'm an Expert: Decoding and Steering Authority Bias in Large Language Models(https://arxiv.org/abs/2601.13433)
Keywords: language model
Abstract: Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.
摘要：先前的研究表明，语言模型在推理任务上的表现可能会受到建议、提示和认可的影响。然而，背书来源可信度的影响仍未得到充分探索。我们根据认可提供者的专业知识来调查语言模型是否表现出系统偏差。在涵盖数学、法律和医学推理的 4 个数据集中，我们使用代表每个领域四个专业水平的人物角色评估了 11 个模型。我们的结果表明，随着来源专业知识的增加，模型越来越容易受到错误/误导性认可的影响，较高权威的来源不仅会导致准确性下降，还会增加对错误答案的信心。我们还表明，这种权威偏见是机械地编码在模型中的，并且模型可以远离偏见，从而提高其性能，即使专家给出了误导性的认可。

Title: MOSLD-Bench: Multilingual Open-Set Learning and Discovery Benchmark for Text Categorization

Authors: Adriana-Valentina Costache, Daria-Nicoleta Dragomir, Silviu-Florin Gheorghe, Eduard Poesina, Paul Irofti, Radu Tudor Ionescu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13437
Pdf URL: https://arxiv.org/pdf/2601.13437
Copy Paste: [[2601.13437]] MOSLD-Bench: Multilingual Open-Set Learning and Discovery Benchmark for Text Categorization(https://arxiv.org/abs/2601.13437)
Keywords: language model
Abstract: Open-set learning and discovery (OSLD) is a challenging machine learning task in which samples from new (unknown) classes can appear at test time. It can be seen as a generalization of zero-shot learning, where the new classes are not known a priori, hence involving the active discovery of new classes. While zero-shot learning has been extensively studied in text classification, especially with the emergence of pre-trained language models, open-set learning and discovery is a comparatively new setup for the text domain. To this end, we introduce the first multilingual open-set learning and discovery (MOSLD) benchmark for text categorization by topic, comprising 960K data samples across 12 languages. To construct the benchmark, we (i) rearrange existing datasets and (ii) collect new data samples from the news domain. Moreover, we propose a novel framework for the OSLD task, which integrates multiple stages to continuously discover and learn new classes. We evaluate several language models, including our own, to obtain results that can be used as reference for future work. We release our benchmark at this https URL.
摘要：开放集学习和发现 (OSLD) 是一项具有挑战性的机器学习任务，其中来自新（未知）类的样本可能会在测试时出现。它可以被视为零样本学习的推广，其中新类别是事先未知的，因此涉及新类别的主动发现。虽然零样本学习在文本分类中得到了广泛的研究，特别是随着预训练语言模型的出现，但开放集学习和发现对于文本领域来说是一种相对较新的设置。为此，我们引入了第一个按主题进行文本分类的多语言开放集学习和发现 (MOSLD) 基准，包含 12 种语言的 96 万个数据样本。为了构建基准，我们（i）重新排列现有数据集并（ii）从新闻领域收集新的数据样本。此外，我们为 OSLD 任务提出了一个新颖的框架，它集成了多个阶段来不断发现和学习新类。我们评估了多种语言模型，包括我们自己的语言模型，以获得可作为未来工作参考的结果。我们在此 https URL 发布了基准测试。

Title: PhysicsSolutionAgent: Towards Multimodal Explanations for Numerical Physics Problem Solving

Authors: Aditya Thole, Anmol Agrawal, Arnav Ramamoorthy, Dhruv Kumar
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2601.13453
Pdf URL: https://arxiv.org/pdf/2601.13453
Copy Paste: [[2601.13453]] PhysicsSolutionAgent: Towards Multimodal Explanations for Numerical Physics Problem Solving(https://arxiv.org/abs/2601.13453)
Keywords: language model, gpt, llm, agent
Abstract: Explaining numerical physics problems often requires more than text-based solutions; clear visual reasoning can substantially improve conceptual understanding. While large language models (LLMs) demonstrate strong performance on many physics questions in textual form, their ability to generate long, high-quality visual explanations remains insufficiently explored. In this work, we introduce PhysicsSolutionAgent (PSA), an autonomous agent that generates physics-problem explanation videos of up to six minutes using Manim animations. To evaluate the generated videos, we design an assessment pipeline that performs automated checks across 15 quantitative parameters and incorporates feedback from a vision-language model (VLM) to iteratively improve video quality. We evaluate PSA on 32 videos spanning numerical and theoretical physics problems. Our results reveal systematic differences in video quality depending on problem difficulty and whether the task is numerical or theoretical. Using GPT-5-mini, PSA achieves a 100% video-completion rate with an average automated score of 3.8/5. However, qualitative analysis and human inspection uncover both minor and major issues, including visual layout inconsistencies and errors in how visual content is interpreted during feedback. These findings expose key limitations in reliable Manim code generation and highlight broader challenges in multimodal reasoning and evaluation for visual explanations of numerical physics problems. Our work underscores the need for improved visual understanding, verification, and evaluation frameworks in future multimodal educational systems
摘要：解释数值物理问题通常需要的不仅仅是基于文本的解决方案；清晰的视觉推理可以大大提高概念理解。虽然大型语言模型 (LLM) 在许多文本形式的物理问题上表现出强大的性能，但它们生成长篇、高质量视觉解释的能力仍未得到充分探索。在这项工作中，我们引入了PhysicsSolutionAgent (PSA)，这是一种自主代理，可以使用 Manim 动画生成长达六分钟的物理问题解释视频。为了评估生成的视频，我们设计了一个评估管道，该管道对 15 个定量参数执行自动检查，并结合视觉语言模型 (VLM) 的反馈来迭代提高视频质量。我们通过 32 个涵盖数值和理论物理问题的视频来评估 PSA。我们的结果揭示了视频质量的系统差异，具体取决于问题难度以及任务是数值任务还是理论任务。使用 GPT-5-mini，PSA 实现了 100% 的视频完成率，平均自动评分为 3.8/5。然而，定性分析和人工检查揭示了一些次要和主要问题，包括视觉布局不一致以及反馈期间如何解释视觉内容的错误。这些发现暴露了可靠的 Manim 代码生成的关键局限性，并强调了数字物理问题的视觉解释的多模态推理和评估中更广泛的挑战。我们的工作强调了未来多模式教育系统中改进视觉理解、验证和评估框架的必要性

Title: Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives

Authors: Kyung Ho Lim, Byung-Hoon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13503
Pdf URL: https://arxiv.org/pdf/2601.13503
Copy Paste: [[2601.13503]] Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives(https://arxiv.org/abs/2601.13503)
Keywords: gpt, llm
Abstract: Psychiatric narratives encode patient identity not only through explicit identifiers but also through idiosyncratic life events embedded in their clinical structure. Existing de-identification approaches, including PHI masking and LLM-based synthetic rewriting, operate at the text level and offer limited control over which semantic elements are preserved or altered. We introduce Anonpsy, a de-identification framework that reformulates the task as graph-guided semantic rewriting. Anonpsy (1) converts each narrative into a semantic graph encoding clinical entities, temporal anchors, and typed relations; (2) applies graph-constrained perturbations that modify identifying context while preserving clinically essential structure; and (3) regenerates text via graph-conditioned LLM generation. Evaluated on 90 clinician-authored psychiatric case narratives, Anonpsy preserves diagnostic fidelity while achieving consistently low re-identification risk under expert, semantic, and GPT-5-based evaluations. Compared with a strong LLM-only rewriting baseline, Anonpsy yields substantially lower semantic similarity and identifiability. These results demonstrate that explicit structural representations combined with constrained generation provide an effective approach to de-identification for psychiatric narratives.
摘要：精神病学叙述不仅通过明确的标识符，而且还通过嵌入其临床结构中的特殊生活事件来编码患者身份。现有的去识别化方法，包括 PHI 屏蔽和基于 LLM 的合成重写，在文本级别运行，并且对保留或更改哪些语义元素提供有限的控制。我们引入了 Anonpsy，一个去识别框架，它将任务重新表述为图形引导的语义重写。 Anonpsy (1) 将每个叙述转换为编码临床实体、时间锚点和类型化关系的语义图； (2) 应用图约束扰动来修改识别上下文，同时保留临床基本结构； (3) 通过图条件 LLM 生成重新生成文本。 Anonpsy 对 90 个临床医生撰写的精神病学病例叙述进行了评估，在保持诊断保真度的同时，在专家、语义和基于 GPT-5 的评估下始终实现较低的重新识别风险。与强大的仅法学硕士重写基线相比，Anonpsy 产生的语义相似性和可识别性要低得多。这些结果表明，明确的结构表征与受限生成相结合，为精神病学叙事的去身份化提供了一种有效的方法。

Title: When Wording Steers the Evaluation: Framing Bias in LLM judges

Authors: Yerin Hwang, Dongryeol Lee, Taegwan Kang, Minwoo Lee, Kyomin Jung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13537
Pdf URL: https://arxiv.org/pdf/2601.13537
Copy Paste: [[2601.13537]] When Wording Steers the Evaluation: Framing Bias in LLM judges(https://arxiv.org/abs/2601.13537)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are known to produce varying responses depending on prompt phrasing, indicating that subtle guidance in phrasing can steer their answers. However, the impact of this framing bias on LLM-based evaluation, where models are expected to make stable and impartial judgments, remains largely underexplored. Drawing inspiration from the framing effect in psychology, we systematically investigate how deliberate prompt framing skews model judgments across four high-stakes evaluation tasks. We design symmetric prompts using predicate-positive and predicate-negative constructions and demonstrate that such framing induces significant discrepancies in model outputs. Across 14 LLM judges, we observe clear susceptibility to framing, with model families showing distinct tendencies toward agreement or rejection. These findings suggest that framing bias is a structural property of current LLM-based evaluation systems, underscoring the need for framing-aware protocols.
摘要：众所周知，大型语言模型 (LLM) 会根据提示措辞产生不同的响应，这表明措辞中的微妙指导可以引导他们的答案。然而，这种框架偏差对基于法学硕士的评估的影响仍然很大程度上尚未得到充分研究，在法学硕士评估中，模型预计会做出稳定和公正的判断。受心理学中的框架效应的启发，我们系统地研究了故意的即时框架如何在四个高风险评估任务中扭曲模型判断。我们使用谓词肯定和谓词否定结构设计对称提示，并证明这种框架会导致模型输出出现显着差异。在 14 名法学硕士法官中，我们观察到对框架的明显敏感性，模范家庭表现出明显的同意或拒绝倾向。这些发现表明，框架偏差是当前基于法学硕士的评估系统的结构属性，强调了框架感知协议的必要性。

Title: Comparing Without Saying: A Dataset and Benchmark for Implicit Comparative Opinion Mining from Same-User Reviews

Authors: Thanh-Lam T. Nguyen, Ngoc-Quang Le, Quoc-Trung Phu, Thi-Phuong Le, Ngoc-Huyen Pham, Phuong-Nguyen Nguyen, Hoang-Quynh Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13575
Pdf URL: https://arxiv.org/pdf/2601.13575
Copy Paste: [[2601.13575]] Comparing Without Saying: A Dataset and Benchmark for Implicit Comparative Opinion Mining from Same-User Reviews(https://arxiv.org/abs/2601.13575)
Keywords: language model
Abstract: Existing studies on comparative opinion mining have mainly focused on explicit comparative expressions, which are uncommon in real-world reviews. This leaves implicit comparisons - here users express preferences across separate reviews - largely underexplored. We introduce SUDO, a novel dataset for implicit comparative opinion mining from same-user reviews, allowing reliable inference of user preferences even without explicit comparative cues. SUDO comprises 4,150 annotated review pairs (15,191 sentences) with a bi-level structure capturing aspect-level mentions and review-level preferences. We benchmark this task using two baseline architectures: traditional machine learning- and language model-based baselines. Experimental results show that while the latter outperforms the former, overall performance remains moderate, revealing the inherent difficulty of the task and establishing SUDO as a challenging and valuable benchmark for future research.
摘要：现有的比较意见挖掘研究主要集中在明确的比较表达上，这在现实世界的评论中并不常见。这就留下了隐含的比较——用户在不同的评论中表达偏好——很大程度上没有得到充分的探索。我们引入了 SUDO，这是一种新颖的数据集，用于从同一用户评论中挖掘隐式比较意见，即使没有明确的比较线索，也可以可靠地推断用户偏好。 SUDO 包含 4,150 个带注释的评论对（15,191 个句子），具有捕获方面级别提及和评论级别偏好的双层结构。我们使用两种基线架构对该任务进行基准测试：传统机器学习和基于语言模型的基线。实验结果表明，虽然后者优于前者，但整体表现仍然中等，揭示了该任务固有的难度，并将 SUDO 确立为未来研究的具有挑战性和有价值的基准。

Title: TREX: Tokenizer Regression for Optimal Data Mixture

Authors: Inho Won, Hangyeol Yoo, Minkyung Cho, Jungyeul Park, Hoyun Song, KyungTae Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13588
Pdf URL: https://arxiv.org/pdf/2601.13588
Copy Paste: [[2601.13588]] TREX: Tokenizer Regression for Optimal Data Mixture(https://arxiv.org/abs/2601.13588)
Keywords: language model, llm
Abstract: Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.
摘要：为多语言大型语言模型 (LLM) 构建有效的分词器需要仔细控制特定于语言的数据混合。虽然分词器的压缩性能严重影响 LLM 训练和推理的效率，但现有方法依赖启发式或昂贵的大规模搜索来确定最佳语言比率。我们引入了最佳数据混合的分词器回归（TREX），这是一种基于回归的框架，可以有效地预测分词器训练的最佳数据混合。 TREX 在随机混合物上训练小规模代理分词器，收集其压缩统计数据，并学习从数据混合物中预测压缩性能。这种学习模型可以在大规模分词器训练之前实现可扩展的混合搜索，从而减轻多语言分词器设计中的准确性与成本权衡。使用 TReX 的预测混合物训练的分词器在分布内和分布外压缩效率方面均优于基于 LLaMA3 和均匀分布的混合物高达 12%，展示了强大的可扩展性、鲁棒性和实际有效性。

Title: Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

Authors: Fan Huang, Haewoon Kwak, Jisun An
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13590
Pdf URL: https://arxiv.org/pdf/2601.13590
Copy Paste: [[2601.13590]] Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions(https://arxiv.org/abs/2601.13590)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly employed in various question-answering tasks. However, recent studies showcase that LLMs are susceptible to persuasion and could adopt counterfactual beliefs. We present a systematic evaluation of LLM susceptibility to persuasion under the Source--Message--Channel--Receiver (SMCR) communication framework. Across five mainstream Large Language Models (LLMs) and three domains (factual knowledge, medical QA, and social bias), we analyze how different persuasive strategies influence belief stability over multiple interaction turns. We further examine whether meta-cognition prompting (i.e., eliciting self-reported confidence) affects resistance to persuasion. Results show that smaller models exhibit extreme compliance, with over 80% of belief changes occurring at the first persuasive turn (average end turn of 1.1--1.4). Contrary to expectations, meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness. Finally, we evaluate adversarial fine-tuning as a defense. While GPT-4o-mini achieves near-complete robustness (98.6%) and Mistral~7B improves substantially (35.7% $\rightarrow$ 79.3%), Llama models remain highly susceptible (<14%) even when fine-tuned on their own failure cases. Together, these findings highlight substantial model-dependent limits of current robustness interventions and offer guidance for developing more trustworthy LLMs.
摘要：大型语言模型 (LLM) 越来越多地应用于各种问答任务。然而，最近的研究表明，法学硕士很容易受到说服，并可能采取反事实的信念。我们在源--消息--渠道--接收者（SMCR）通信框架下对法学硕士对说服的敏感性进行了系统评估。在五个主流大语言模型（LLM）和三个领域（事实知识、医学质量保证和社会偏见）中，我们分析了不同的说服策略如何影响多个交互回合中的信念稳定性。我们进一步研究元认知提示（即引发自我报告的信心）是否会影响对说服的抵抗力。结果表明，较小的模型表现出极高的顺应性，超过 80% 的信念变化发生在第一次说服轮次（平均最终轮次为 1.1--1.4）。与预期相反，元认知提示通过加速信念侵蚀而不是增强稳健性来增加脆弱性。最后，我们评估对抗性微调作为防御。虽然 GPT-4o-mini 实现了近乎完全的鲁棒性 (98.6%) 并且 Mistral~7B 显着改善 (35.7% $\rightarrow$ 79.3%)，但 Llama 模型仍然高度敏感 (<14%)，即使在针对自己的故障案例进行微调时也是如此。总之，这些发现凸显了当前稳健性干预措施的重大模型依赖性限制，并为开发更值得信赖的法学硕士提供了指导。

Title: CauScientist: Teaching LLMs to Respect Data for Causal Discovery

Authors: Bo Peng, Sirui Chen, Lei Xu, Chaochao Lu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13614
Pdf URL: https://arxiv.org/pdf/2601.13614
Copy Paste: [[2601.13614]] CauScientist: Teaching LLMs to Respect Data for Causal Discovery(https://arxiv.org/abs/2601.13614)
Keywords: llm
Abstract: Causal discovery is fundamental to scientific understanding and reliable decision-making. Existing approaches face critical limitations: purely data-driven methods suffer from statistical indistinguishability and modeling assumptions, while recent LLM-based methods either ignore statistical evidence or incorporate unverified priors that can mislead result. To this end, we propose CauScientist, a collaborative framework that synergizes LLMs as hypothesis-generating "data scientists" with probabilistic statistics as rigorous "verifiers". CauScientist employs hybrid initialization to select superior starting graphs, iteratively refines structures through LLM-proposed modifications validated by statistical criteria, and maintains error memory to guide efficient search space. Experiments demonstrate that CauScientist substantially outperforms purely data-driven baselines, achieving up to 53.8% F1 score improvement and enhancing recall from 35.0% to 100.0%. Notably, while standalone LLM performance degrades with graph complexity, CauScientist reduces structural hamming distance (SHD) by 44.0% compared to Qwen3-32B on 37-node graphs. Our project page is at this https URL.
摘要：因果发现是科学理解和可靠决策的基础。现有方法面临严重的局限性：纯粹的数据驱动方法受到统计不可区分性和建模假设的影响，而最近基于法学硕士的方法要么忽略统计证据，要么纳入可能误导结果的未经验证的先验。为此，我们提出了 CauScientist，这是一个协作框架，将法学硕士作为生成假设的“数据科学家”与概率统计作为严格的“验证者”进行协同。 CauScientist 采用混合初始化来选择优越的起始图，通过 LLM 提出的经统计标准验证的修改迭代地细化结构，并维护错误记忆以指导有效的搜索空间。实验表明，CauScientist 的性能大大优于纯数据驱动的基线，F1 分数提高了高达 53.8%，召回率从 35.0% 提高到 100.0%。值得注意的是，虽然独立的 LLM 性能会随着图的复杂性而降低，但与 37 节点图上的 Qwen3-32B 相比，CauScientist 将结构汉明距离 (SHD) 降低了 44.0%。我们的项目页面位于此 https URL。

Title: Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

Authors: Zhaopeng Zhang, Pengcheng Sun, Lan Zhang, Chen Tang, Jiewei Lai, Yunhao Wang, Hui Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13630
Pdf URL: https://arxiv.org/pdf/2601.13630
Copy Paste: [[2601.13630]] Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models(https://arxiv.org/abs/2601.13630)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed over knowledge bases for efficient knowledge retrieval and question answering. However, LLMs can inadvertently answer beyond a user's permission scope, leaking sensitive content, thus making it difficult to deploy knowledge-base QA under fine-grained access control requirements. In this work, we identify a geometric regularity in intermediate activations: for the same query, representations induced by different permission scopes cluster distinctly and are readily separable. Building on this separability, we propose Activation-space Anchored Access Control (AAAC), a training-free framework for multi-class permission control. AAAC constructs an anchor bank, with one permission anchor per class, from a small offline sample set and requires no fine-tuning. At inference time, a multi-anchor steering mechanism redirects each query's activations toward the anchor-defined authorized region associated with the current user, thereby suppressing over-privileged generations by design. Finally, extensive experiments across three LLM families demonstrate that AAAC reduces permission violation rates by up to 86.5% and prompt-based attack success rates by 90.7%, while improving response usability with minor inference overhead compared to baselines.
摘要：大型语言模型 (LLM) 越来越多地部署在知识库上，以实现高效的知识检索和问题解答。然而，LLM 可能会无意中超出用户的权限范围进行回答，从而泄露敏感内容，从而使得在细粒度访问控制要求下部署知识库 QA 变得困难。在这项工作中，我们确定了中间激活中的几何规律：对于同一查询，由不同权限范围引起的表示明显聚集并且易于分离。基于这种可分离性，我们提出了激活空间锚定访问控制（AAAC），这是一种用于多类权限控制的免培训框架。 AAAC 从一个小的离线样本集中构建了一个锚点库，每个类有一个权限锚点，并且不需要微调。在推理时，多锚点引导机制将每个查询的激活重定向到与当前用户关联的锚点定义的授权区域，从而通过设计抑制过度特权的生成。最后，针对三个 LLM 系列的广泛实验表明，AAAC 将权限违规率降低了高达 86.5%，将基于提示的攻击成功率降低了 90.7%，同时与基线相比，以较小的推理开销提高了响应可用性。

Title: Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge

Authors: Xiaolin Zhou, Zheng Luo, Yicheng Gao, Qixuan Chen, Xiyang Hu, Yue Zhao, Ruishan Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13649
Pdf URL: https://arxiv.org/pdf/2601.13649
Copy Paste: [[2601.13649]] Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge(https://arxiv.org/abs/2601.13649)
Keywords: language model, llm, prompt
Abstract: Recent advances in Large Language Models (LLMs) have incentivized the development of LLM-as-a-judge, an application of LLMs where they are used as judges to decide the quality of a certain piece of text given a certain context. However, previous studies have demonstrated that LLM-as-a-judge can be biased towards different aspects of the judged texts, which often do not align with human preference. One of the identified biases is language bias, which indicates that the decision of LLM-as-a-judge can differ based on the language of the judged texts. In this paper, we study two types of language bias in pairwise LLM-as-a-judge: (1) performance disparity between languages when the judge is prompted to compare options from the same language, and (2) bias towards options written in major languages when the judge is prompted to compare options of two different languages. We find that for same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages, and this bias is more pronounced in culturally-related subjects. For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language. Finally, we investigate whether language bias is in fact caused by low-perplexity bias, a previously identified bias of LLM-as-a-judge, and we find that while perplexity is slightly correlated with language bias, language bias cannot be fully explained by perplexity only.
摘要：大型语言模型 (LLM) 的最新进展刺激了 LLM 作为法官的发展，这是 LLM 的一种应用，其中 LLM 被用作法官，在给定特定上下文的情况下决定特定文本的质量。然而，之前的研究表明，作为法官的法学硕士可能会对所判文本的不同方面产生偏见，而这些方面往往与人类的偏好不一致。已发现的偏见之一是语言偏见，这表明法学硕士作为法官的决定可能会根据所法官文本的语言而有所不同。在本文中，我们研究了成对法学硕士法官中的两种语言偏见：（1）当法官被提示比较同一语言的选项时，语言之间的表现差异；（2）当法官被提示比较两种不同语言的选项时，对主要语言写的选项的偏见。我们发现，对于同语言评审，不同语系之间存在显着的表现差异，欧洲语言始终优于非洲语言，并且这种偏见在文化相关科目中更为明显。对于跨语言判断，我们观察到大多数模型倾向于英语答案，并且这种偏好更多地受到答案语言的影响，而不是问题语言的影响。最后，我们调查了语言偏见是否实际上是由低困惑偏见（之前发现的法学硕士作为法官的偏见）引起的，我们发现虽然困惑与语言偏见略有相关，但语言偏见不能仅用困惑来完全解释。

Title: Beyond Known Facts: Generating Unseen Temporal Knowledge to Address Data Contamination in LLM Evaluation

Authors: Arthur Amalvy, Hen-Hsen Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13658
Pdf URL: https://arxiv.org/pdf/2601.13658
Copy Paste: [[2601.13658]] Beyond Known Facts: Generating Unseen Temporal Knowledge to Address Data Contamination in LLM Evaluation(https://arxiv.org/abs/2601.13658)
Keywords: language model, llm
Abstract: The automatic extraction of information is important for populating large web knowledge bases such as Wikidata. The temporal version of that task, temporal knowledge graph extraction (TKGE), involves extracting temporally grounded facts from text, represented as semantic quadruples (subject, relation, object, timestamp). Many recent systems take advantage of large language models (LLMs), which are becoming a new cornerstone of the web due to their performance on many tasks across the natural language processing (NLP) field. Despite the importance of TKGE, existing datasets for training and evaluation remain scarce, and contamination of evaluation data is an unaddressed issue, potentially inflating LLMs' perceived performance due to overlaps between training and evaluation sets. To mitigate these challenges, we propose a novel synthetic evaluation dataset constructed from predicted future, previously unseen temporal facts, thereby eliminating contamination and enabling robust and unbiased benchmarking. Our dataset creation involves a two-step approach: (1) Temporal Knowledge Graph Forecasting (TKGF) generates plausible future quadruples, which are subsequently filtered to adhere to the original knowledge base schema; (2) LLMs perform quadruple-to-text generation, creating semantically aligned textual descriptions. We benchmark Extract, Define and Canonicalize (EDC), a state-of-the-art LLM-based extraction framework, demonstrating that LLM performance decreases when evaluated on our dataset compared to a dataset of known facts. We publicly release our dataset consisting of 4.2K future quadruples and corresponding textual descriptions, along with the generation methodology, enabling continuous creation of unlimited future temporal datasets to serve as long-term, contamination-free benchmarks for TKGE.
摘要：信息的自动提取对于填充大型网络知识库（例如维基数据）非常重要。该任务的时间版本，即时间知识图提取（TKGE），涉及从文本中提取基于时间的事实，表示为语义四元组（主题、关系、对象、时间戳）。许多最新的系统都利用了大型语言模型 (LLM)，由于它们在自然语言处理 (NLP) 领域的许多任务上的性能，它们正在成为网络的新基石。尽管 TKGE 很重要，但现有的培训和评估数据集仍然稀缺，评估数据的污染是一个未解决的问题，由于培训和评估集之间的重叠，可能会夸大法学硕士的感知表现。为了缓解这些挑战，我们提出了一种新颖的综合评估数据集，该数据集根据预测的未来、以前未见过的时间事实构建，从而消除污染并实现稳健且公正的基准测试。我们的数据集创建涉及两步方法：（1）时态知识图预测（TKGF）生成合理的未来四元组，随后对其进行过滤以遵循原始知识库模式； (2) 法学硕士执行四元到文本生成，创建语义对齐的文本描述。我们对 Extract, Define and Canonicalize (EDC)（一种最先进的基于 LLM 的提取框架）进行了基准测试，证明与已知事实的数据集相比，在我们的数据集上进行评估时，LLM 的性能会下降。我们公开发布由 4.2K 未来四元组和相应文本描述组成的数据集以及生成方法，从而能够持续创建无限的未来时间数据集，作为 TKGE 的长期、无污染基准。

Title: CommunityBench: Benchmarking Community-Level Alignment across Diverse Groups and Tasks

Authors: Jiayu Lin, Zhongyu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13669
Pdf URL: https://arxiv.org/pdf/2601.13669
Copy Paste: [[2601.13669]] CommunityBench: Benchmarking Community-Level Alignment across Diverse Groups and Tasks(https://arxiv.org/abs/2601.13669)
Keywords: language model, llm
Abstract: Large language models (LLMs) alignment ensures model behaviors reflect human value. Existing alignment strategies primarily follow two paths: one assumes a universal value set for a unified goal (i.e., one-size-fits-all), while the other treats every individual as unique to customize models (i.e., individual-level). However, assuming a monolithic value space marginalizes minority norms, while tailoring individual models is prohibitively expensive. Recognizing that human society is organized into social clusters with high intra-group value alignment, we propose community-level alignment as a "middle ground". Practically, we introduce CommunityBench, the first large-scale benchmark for community-level alignment evaluation, featuring four tasks grounded in Common Identity and Common Bond theory. With CommunityBench, we conduct a comprehensive evaluation of various foundation models on CommunityBench, revealing that current LLMs exhibit limited capacity to model community-specific preferences. Furthermore, we investigate the potential of community-level alignment in facilitating individual modeling, providing a promising direction for scalable and pluralistic alignment.
摘要：大语言模型 (LLM) 对齐可确保模型行为反映人类价值。现有的对齐策略主要遵循两种路径：一种为统一目标假设一个通用的价值集（即一刀切），而另一种则将每个个体视为独特的定制模型（即个体级别）。然而，假设一个单一的价值空间会边缘化少数规范，而定制个体模型的成本却高得令人望而却步。认识到人类社会被组织成具有高度群体内价值一致性的社会集群，我们提出将社区层面的一致性作为“中间立场”。实际上，我们引入了 CommunityBench，这是第一个用于社区级别一致性评估的大型基准，具有基于共同身份和共同债券理论的四项任务。通过 CommunityBench，我们对 CommunityBench 上的各种基础模型进行了全面评估，揭示了当前的法学硕士在模拟社区特定偏好方面的能力有限。此外，我们研究了社区层面的协调在促进个体建模方面的潜力，为可扩展和多元化的协调提供了有希望的方向。

Title: HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

Authors: Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, Wenxiao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13684
Pdf URL: https://arxiv.org/pdf/2601.13684
Copy Paste: [[2601.13684]] HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference(https://arxiv.org/abs/2601.13684)
Keywords: llm
Abstract: The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information, principally because they overlook the attention drift phenomenon where token significance evolves dynamically. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead due to frequent data transfers. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and redundancy. Consequently, we apply a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes, thereby addressing the inefficiency of coarse-grained strategies. Furthermore, we employ a hierarchical storage mechanism in which a subset of representative heads monitors attention shift, and trigger an asynchronous, on-demand retrieval of contexts from the CPU, effectively hiding I/O latency. Finally, experiments demonstrate that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3\times$ compared to the original model in the 224K context. Our code will be open-source.
摘要：KV 缓存的线性内存增长给长上下文任务中的 LLM 推理带来了重大瓶颈。现有的静态压缩方法通常无法保留全局重要信息，主要是因为它们忽略了令牌重要性动态演变的注意力漂移现象。尽管最近的动态检索方法试图解决这个问题，但它们通常受到粗粒度缓存策略的影响，并且由于频繁的数据传输而产生高 I/O 开销。为了克服这些限制，我们提出了 HeteroCache，一种免训练的动态压缩框架。我们的方法建立在两个关键见解之上：注意力头表现出不同的时间异质性，并且同一层内的头之间存在显着的空间冗余。在这些见解的指导下，HeteroCache 根据稳定性和冗余性对磁头进行分类。因此，我们应用细粒度的加权策略，将更大的缓存预算分配给注意力快速转移的头，以捕获上下文变化，从而解决粗粒度策略的低效率问题。此外，我们采用分层存储机制，其中代表性头的子集监视注意力转移，并触发从 CPU 异步、按需检索上下文，从而有效隐藏 I/O 延迟。最后，实验表明，HeteroCache 在多个长上下文基准测试中实现了最先进的性能，并且与 224K 上下文中的原始模型相比，解码速度提高了高达 3 倍。我们的代码将是开源的。

Title: Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning

Authors: Yue Guo, Fanfu Wang, Jianwei Lv, Xincheng Shi, Yuchen Li, Youya Wang, Yunsheng Zeng, Yujing Liu, Yunhao Qiao, Gen Li, Junfeng Wang, Bo Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13690
Pdf URL: https://arxiv.org/pdf/2601.13690
Copy Paste: [[2601.13690]] Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning(https://arxiv.org/abs/2601.13690)
Keywords: language model, llm
Abstract: Clinical Decision Support Systems (CDSSs) provide reasoning and inquiry guidance for physicians, yet they face notable challenges, including high maintenance costs and low generalization capability. Recently, Large Language Models (LLMs) have been widely adopted in healthcare due to their extensive knowledge reserves, retrieval, and communication capabilities. While LLMs show promise and excel at medical benchmarks, their diagnostic reasoning and inquiry skills are constrained. To mitigate this issue, we propose (1) Clinical Diagnostic Reasoning Data (CDRD) structure to capture abstract clinical reasoning logic, and a pipeline for its construction, and (2) the Dr. Assistant, a clinical diagnostic model equipped with clinical reasoning and inquiry skills. Its training involves a two-stage process: SFT, followed by RL with a tailored reward function. We also introduce a benchmark to evaluate both diagnostic reasoning and inquiry. Our experiments demonstrate that the Dr. Assistant outperforms open-source models and achieves competitive performance to closed-source models, providing an effective solution for clinical diagnostic inquiry guidance.
摘要：临床决策支持系统（CDSS）为医生提供推理和询问指导，但它们面临着显着的挑战，包括维护成本高和泛化能力低。近年来，大型语言模型（LLM）因其丰富的知识储备、检索和交流能力而在医疗保健领域得到广泛采用。虽然法学硕士在医学基准方面表现出希望并表现出色，但他们的诊断推理和探究技能受到限制。为了缓解这个问题，我们提出（1）临床诊断推理数据（CDRD）结构来捕获抽象的临床推理逻辑，及其构建管道，以及（2）Dr. Assistant，一种配备临床推理和查询技能的临床诊断模型。其训练涉及两个阶段的过程：SFT，然后是具有定制奖励函数的 RL。我们还引入了一个评估诊断推理和询问的基准。我们的实验表明，Dr. Assistant 的性能优于开源模型，并实现了与闭源模型的竞争性能，为临床诊断询问指导提供了有效的解决方案。

Title: Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning

Authors: Zhihang Yuan, Chengyu Yue, Long Huang, Litu Ou, Lei Shi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13697
Pdf URL: https://arxiv.org/pdf/2601.13697
Copy Paste: [[2601.13697]] Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning(https://arxiv.org/abs/2601.13697)
Keywords: language model, gpt, llm
Abstract: Instruction tuning is a standard paradigm for adapting large language models (LLMs), but modern instruction datasets are large, noisy, and redundant, making full-data fine-tuning costly and often unnecessary. Existing data selection methods either build expensive gradient datastores or assign static scores from a weak proxy, largely ignoring evolving uncertainty, and thus missing a key source of LLM interpretability. We propose GRADFILTERING, an objective-agnostic, uncertainty-aware data selection framework that utilizes a small GPT-2 proxy with a LoRA ensemble and aggregates per-example gradients into a Gradient Signal-to-Noise Ratio (G-SNR) utility. Our method matches or surpasses random subsets and strong baselines in most LLM-as-a-judge evaluations as well as in human assessment. Moreover, GRADFILTERING-selected subsets converge faster than competitive filters under the same compute budget, reflecting the benefit of uncertainty-aware scoring.
摘要：指令调优是适应大型语言模型 (LLM) 的标准范例，但现代指令数据集庞大、嘈杂且冗余，使得全数据微调成本高昂且通常没有必要。现有的数据选择方法要么构建昂贵的梯度数据存储，要么从弱代理分配静态分数，很大程度上忽略了不断变化的不确定性，从而错过了 LLM 可解释性的关键来源。我们提出了 GRADFILTERING，这是一种与目标无关、不确定性感知的数据选择框架，它利用带有 LoRA 集成的小型 GPT-2 代理，并将每个示例的梯度聚合到梯度信噪比 (G-SNR) 实用程序中。我们的方法匹配或超越了大多数法学硕士法官评估以及人类评估中的随机子集和强大基线。此外，在相同的计算预算下，GRADFILTERING 选择的子集比竞争过滤器收敛得更快，这反映了不确定性感知评分的好处。

Title: GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

Authors: Lotta Kiefer, Christoph Leiter, Sotaro Takeshita, Elena Schmidt, Steffen Eger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13711
Pdf URL: https://arxiv.org/pdf/2601.13711
Copy Paste: [[2601.13711]] GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark(https://arxiv.org/abs/2601.13711)
Keywords: language model, gpt, llm
Abstract: Authorship verification (AV) is the task of determining whether two texts were written by the same author and has been studied extensively, predominantly for English data. In contrast, large-scale benchmarks and systematic evaluations for other languages remain scarce. We address this gap by introducing GerAV, a comprehensive benchmark for German AV comprising over 600k labeled text pairs. GerAV is built from Twitter and Reddit data, with the Reddit part further divided into in-domain and cross-domain message-based subsets, as well as a profile-based subset. This design enables controlled analysis of the effects of data source, topical domain, and text length. Using the provided training splits, we conduct a systematic evaluation of strong baselines and state-of-the-art models and find that our best approach, a fine-tuned large language model, outperforms recent baselines by up to 0.09 absolute F1 score and surpasses GPT-5 in a zero-shot setting by 0.08. We further observe a trade-off between specialization and generalization: models trained on specific data types perform best under matching conditions but generalize less well across data regimes, a limitation that can be mitigated by combining training sources. Overall, GerAV provides a challenging and versatile benchmark for advancing research on German and cross-domain AV.
摘要：作者身份验证（AV）是确定两个文本是否由同一作者撰写的任务，并且已被广泛研究，主要针对英语数据。相比之下，其他语言的大规模基准测试和系统评估仍然很少。我们通过引入 GerAV 来解决这一差距，这是一个德国 AV 的综合基准，包含超过 60 万个标记文本对。 GerAV 是根据 Twitter 和 Reddit 数据构建的，Reddit 部分进一步分为基于域内和跨域消息的子集，以及基于配置文件的子集。这种设计可以对数据源、主题域和文本长度的影响进行受控分析。使用提供的训练分割，我们对强大的基线和最先进的模型进行了系统评估，发现我们的最佳方法（经过微调的大型语言模型）比最近的基线高出 0.09 绝对 F1 分数，并在零样本设置中超过 GPT-5 0.08。我们进一步观察到专业化和泛化之间的权衡：在特定数据类型上训练的模型在匹配条件下表现最好，但跨数据体系的泛化效果较差，这一限制可以通过组合训练源来缓解。总的来说，GerAV 为推进德国和跨领域 AV 的研究提供了一个具有挑战性和多功能的基准。

Title: Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff

Authors: Zehan Li, Yuxuan Wang, Ali El Lahib, Ying-Jieh Xia, Xinyu Pi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13717
Pdf URL: https://arxiv.org/pdf/2601.13717
Copy Paste: [[2601.13717]] Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff(https://arxiv.org/abs/2601.13717)
Keywords: llm, prompt, chain-of-thought
Abstract: Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) -- evaluating on already-resolved events -- faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably "rewind" model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.
摘要：评估 LLM 预测能力受到根本性紧张的限制：前瞻性评估提供了方法严谨性，但延迟却令人望而却步，而回顾性预测（RF）——对已经解决的事件进行评估——面临着快速缩减的干净评估数据，因为 SOTA 模型拥有越来越多的最新知识截止点。模拟无知（SI）促使模型抑制截止前的知识，已成为一种潜在的解决方案。我们首次系统测试 SI 是否可以逼近真无知 (TI)。在 477 个竞赛级问题和 9 个模型中，我们发现 SI 系统性地失败了：（1）截止指令在 SI 和 TI 之间留下了 52% 的性能差距；（2）即使推理痕迹不包含明确的切断后参考，思想链推理也无法抑制先验知识； (3) 尽管推理跟踪质量优异，但推理优化模型表现出较差的 SI 保真度。这些发现表明提示不能可靠地“倒回”模型知识。我们的结论是，关于截止前事件的 RF 在方法上是有缺陷的；我们建议不要使用基于 SI 的回顾性设置来衡量预测能力。

Title: OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents

Authors: Yulin Hu, Zimo Long, Jiahe Guo, Xingyu Sui, Xing Fu, Weixiang Zhao, Yanyan Zhao, Bing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13722
Pdf URL: https://arxiv.org/pdf/2601.13722
Copy Paste: [[2601.13722]] OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents(https://arxiv.org/abs/2601.13722)
Keywords: language model, agent
Abstract: Memory-augmented conversational agents enable personalized interactions using long-term user memory and have gained substantial traction. However, existing benchmarks primarily focus on whether agents can recall and apply user information, while overlooking whether such personalization is used appropriately. In fact, agents may overuse personal information, producing responses that feel forced, intrusive, or socially inappropriate to users. We refer to this issue as \emph{over-personalization}. In this work, we formalize over-personalization into three types: Irrelevance, Repetition, and Sycophancy, and introduce \textbf{OP-Bench} a benchmark of 1,700 verified instances constructed from long-horizon dialogue histories. Using \textbf{OP-Bench}, we evaluate multiple large language models and memory-augmentation methods, and find that over-personalization is widespread when memory is introduced. Further analysis reveals that agents tend to retrieve and over-attend to user memories even when unnecessary. To address this issue, we propose \textbf{Self-ReCheck}, a lightweight, model-agnostic memory filtering mechanism that mitigates over-personalization while preserving personalization performance. Our work takes an initial step toward more controllable and appropriate personalization in memory-augmented dialogue systems.
摘要：记忆增强对话代理利用长期用户记忆实现个性化交互，并获得了巨大的吸引力。然而，现有的基准主要关注代理是否能够回忆和应用用户信息，而忽略了这种个性化是否被适当使用。事实上，代理可能会过度使用个人信息，产生让用户感到强迫、侵入性或社交上不合适的响应。我们将此问题称为\emph{过度个性化}。在这项工作中，我们将过度个性化分为三种类型：无关性、重复性和谄媚性，并引入 \textbf{OP-Bench} 基准，该基准由长期对话历史构建的 1,700 个经过验证的实例组成。使用 \textbf{OP-Bench}，我们评估了多种大型语言模型和记忆增强方法，发现引入记忆后过度个性化现象普遍存在。进一步的分析表明，即使在不必要的情况下，智能体也倾向于检索和过度关注用户的记忆。为了解决这个问题，我们提出了 \textbf{Self-ReCheck}，这是一种轻量级的、与模型无关的内存过滤机制，可以在保持个性化性能的同时减轻过度个性化的影响。我们的工作朝着记忆增强对话系统中更加可控和适当的个性化迈出了第一步。

Title: On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation

Authors: Weichuan Wang, Mingyang Liu, Linqi Song, Chen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13729
Pdf URL: https://arxiv.org/pdf/2601.13729
Copy Paste: [[2601.13729]] On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation(https://arxiv.org/abs/2601.13729)
Keywords: language model
Abstract: In recent years, the non-deterministic properties of language models have garnered considerable attention and have shown a significant influence on real-world applications. However, such properties remain under-explored in machine translation (MT), a complex, non-deterministic NLP task. In this study, we systematically evaluate modern MT systems and identify temperature-constrained Non-Deterministic MT (ND-MT) as a distinct phenomenon. Additionally, we demonstrate that ND-MT exhibits significant potential in addressing the multi-modality issue that has long challenged MT research and provides higher-quality candidates than Deterministic MT (D-MT) under temperature constraints. However, ND-MT introduces new challenges in evaluating system performance. Specifically, the evaluation framework designed for D-MT fails to yield consistent evaluation results when applied to ND-MT. We further investigate this emerging challenge by evaluating five state-of-the-art ND-MT systems across three open datasets using both lexical-based and semantic-based metrics at varying sampling sizes. The results reveal a Buckets effect across these systems: the lowest-quality candidate generated by ND-MT consistently determines the overall system ranking across different sampling sizes for all reasonable metrics. Furthermore, we propose the ExpectoSample strategy to automatically assess the reliability of evaluation metrics for selecting robust ND-MT.
摘要：近年来，语言模型的非确定性特性引起了人们的广泛关注，并对现实世界的应用产生了重大影响。然而，在机器翻译 (MT) 这一复杂的、非确定性的 NLP 任务中，这些属性仍未得到充分探索。在这项研究中，我们系统地评估了现代 MT 系统，并将温度约束的非确定性 MT (ND-MT) 识别为一种独特的现象。此外，我们证明 ND-MT 在解决长期挑战 MT 研究的多模态问题方面表现出巨大的潜力，并在温度限制下提供比确定性 MT (D-MT) 更高质量的候选方案。然而，ND-MT 在评估系统性能方面带来了新的挑战。具体来说，为D-MT设计的评估框架在应用于ND-MT时无法产生一致的评估结果。我们通过使用不同采样大小的基于词汇和基于语义的指标来评估跨三个开放数据集的五个最先进的 ND-MT 系统，进一步研究这一新兴挑战。结果揭示了这些系统中的桶效应：ND-MT 生成的最低质量候选者始终决定了所有合理指标的不同采样大小的总体系统排名。此外，我们提出 ExpectoSample 策略来自动评估评估指标的可靠性，以选择稳健的 ND-MT。

Title: Towards robust long-context understanding of large language model via active recap learning

Authors: Chenyu Hui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13734
Pdf URL: https://arxiv.org/pdf/2601.13734
Copy Paste: [[2601.13734]] Towards robust long-context understanding of large language model via active recap learning(https://arxiv.org/abs/2601.13734)
Keywords: language model, llm, long context
Abstract: In this paper, we propose active recap learning (ARL), a framework for enhancing large language model (LLM) in understanding long contexts. ARL enables models to revisit and summarize earlier content through targeted sequence construction during contined pretraining and retrospective summarization at inference. First, we identify key tokens in prepared long context based on loss gaps between long and short forward contexts and find most revant preceding paragraphs, then summarize them using an LLM. Second, ARL equips models with the ability to autonomously generate and utilize these retrospective summaries during inference, thereby establishing a recursive memory mechanism across paragraphs. Experimental results show substantial gains, with ARL achieving a 26.8% improvement on RULER and a 9.44% improvement on LongBench. Overall, ARL offers a simple yet effective continued pretraining-based approach to strengthen long-context understanding, advancing scalable memory augmentation in LLM
摘要：在本文中，我们提出了主动回顾学习（ARL），这是一种增强大语言模型（LLM）理解长上下文的框架。 ARL 使模型能够在持续预训练和推理时回顾性总结期间通过有针对性的序列构建来重新访问和总结早期内容。首先，我们根据长和短前向上下文之间的损失差距来识别准备好的长上下文中的关键标记，并找到最相关的前面段落，然后使用 LLM 对其进行总结。其次，ARL 使模型能够在推理过程中自主生成和利用这些回顾性摘要，从而建立跨段落的递归记忆机制。实验结果显示，ARL 比 RULER 提高了 26.8%，比 LongBench 提高了 9.44%。总体而言，ARL 提供了一种简单而有效的基于持续预训练的方法，以加强长上下文理解，推进 LLM 中的可扩展记忆增强

Title: Dimension-First Evaluation of Speech-to-Speech Models with Structured Acoustic Cues

Authors: Arjun Chandra, Kevin Miller, Venkatesh Ravichandran, Constantinos Papayiannis, Venkatesh Saligrama
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13742
Pdf URL: https://arxiv.org/pdf/2601.13742
Copy Paste: [[2601.13742]] Dimension-First Evaluation of Speech-to-Speech Models with Structured Acoustic Cues(https://arxiv.org/abs/2601.13742)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
摘要：大语言模型（LLM）评委表现出强大的推理能力，但仅限于文本内容。这使得当前的自动语音到语音 (S2S) 评估方法依赖于不透明且昂贵的音频语言模型 (ALM)。在这项工作中，我们提出了 TRACE（用于评估的音频线索文本推理），这是一种新颖的框架，使 LLM 法官能够根据音频线索进行推理，以实现经济高效且人性化的 S2S 评估。为了证明该框架的优势，我们首先引入人脑思维链（HCoT）注释协议，通过将评估分为明确的维度：内容（C）、语音质量（VQ）和副语言学（P）来提高现有法官基准的诊断能力。使用这些数据，TRACE 构建了廉价音频信号的文本蓝图，并提示法学硕士做出维度判断，通过确定性策略将它们融合到总体评级中。与 ALM 和仅记录成绩单的 LLM 法官相比，TRACE 与人类评分者达成了更高的一致性，同时显着更具成本效益。我们将发布 HCoT 注释和 TRACE 框架，以实现可扩展且人性化的 S2S 评估。

Title: Pro-AI Bias in Large Language Models

Authors: Benaya Trabelsi, Jonathan Shaki, Sarit Kraus
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13749
Pdf URL: https://arxiv.org/pdf/2601.13749
Copy Paste: [[2601.13749]] Pro-AI Bias in Large Language Models(https://arxiv.org/abs/2601.13749)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly employed for decision-support across multiple domains. We investigate whether these models display a systematic preferential bias in favor of artificial intelligence (AI) itself. Across three complementary experiments, we find consistent evidence of pro-AI bias. First, we show that LLMs disproportionately recommend AI-related options in response to diverse advice-seeking queries, with proprietary models doing so almost deterministically. Second, we demonstrate that models systematically overestimate salaries for AI-related jobs relative to closely matched non-AI jobs, with proprietary models overestimating AI salaries more by 10 percentage points. Finally, probing internal representations of open-weight models reveals that ``Artificial Intelligence'' exhibits the highest similarity to generic prompts for academic fields under positive, negative, and neutral framings alike, indicating valence-invariant representational centrality. These patterns suggest that LLM-generated advice and valuation can systematically skew choices and perceptions in high-stakes decisions.
摘要：大型语言模型 (LLM) 越来越多地用于跨多个领域的决策支持。我们研究这些模型是否表现出有利于人工智能（AI）本身的系统性偏好偏差。通过三个互补的实验，我们发现了支持人工智能偏见的一致证据。首先，我们表明法学硕士不成比例地推荐与人工智能相关的选项来响应不同的寻求建议的查询，而专有模型几乎是确定性地这样做。其次，我们证明，相对于密切匹配的非人工智能工作，模型系统地高估了人工智能相关工作的薪资，专有模型高估了人工智能薪资 10 个百分点。最后，探索开放权重模型的内部表征表明，“人工智能”在积极、消极和中性框架下与学术领域的一般提示表现出最高的相似性，表明价不变的表征中心性。这些模式表明，法学硕士生成的建议和评估可能会系统性地扭曲高风险决策中的选择和看法。

Title: Knowledge Graph-Assisted LLM Post-Training for Enhanced Legal Reasoning

Authors: Dezhao Song, Guglielmo Bonifazi, Frank Schilder, Jonathan Richard Schwarz
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13806
Pdf URL: https://arxiv.org/pdf/2601.13806
Copy Paste: [[2601.13806]] Knowledge Graph-Assisted LLM Post-Training for Enhanced Legal Reasoning(https://arxiv.org/abs/2601.13806)
Keywords: llm
Abstract: LLM post-training has primarily relied on large text corpora and human feedback, without capturing the structure of domain knowledge. This has caused models to struggle dealing with complex reasoning tasks, especially for high-stakes professional domains. In Law, reasoning requires deep understanding of the relations between various legal concepts, a key component missing in current LLM post-training. In this paper, we propose a knowledge graph (KG)-assisted approach for enhancing LLMs' reasoning capability in Legal that is generalizable to other high-stakes domains. We model key legal concepts by following the \textbf{IRAC} (Issue, Rule, Analysis and Conclusion) framework, and construct a KG with 12K legal cases. We then produce training data using our IRAC KG, and conduct both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) with three state-of-the-art (SOTA) LLMs (30B, 49B and 70B), varying architecture and base model family. Our post-trained models obtained better average performance on 4/5 diverse legal benchmarks (14 tasks) than baselines. In particular, our 70B DPO model achieved the best score on 4/6 reasoning tasks, among baselines and a 141B SOTA legal LLM, demonstrating the effectiveness of our KG for enhancing LLMs' legal reasoning capability.
摘要：LLM后期培训主要依赖于大型文本语料库和人工反馈，没有捕获领域知识的结构。这导致模型难以处理复杂的推理任务，尤其是对于高风险的专业领域。在法律领域，推理需要对各种法律概念之间的关系有深入的理解，这是目前法学硕士后培训中缺少的一个关键组成部分。在本文中，我们提出了一种知识图（KG）辅助方法，用于增强法学硕士在法律方面的推理能力，该方法可推广到其他高风险领域。我们按照\textbf{IRAC}（问题、规则、分析和结论）框架对关键法律概念进行建模，并构建了包含 12000 个法律案例的知识图谱。然后，我们使用 IRAC KG 生成训练数据，并使用三个最先进的 (SOTA) LLM（30B、49B 和 70B）、不同的架构和基础模型系列进行监督微调 (SFT) 和直接偏好优化 (DPO)。我们的训练后模型在 4/5 不同的法律基准（14 项任务）上获得了比基线更好的平均性能。特别是，我们的 70B DPO 模型在 4/6 推理任务上取得了基线和 141B SOTA 法律 LLM 中的最佳分数，证明了我们的 KG 在增强 LLM 法律推理能力方面的有效性。

Title: FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Authors: Qian Chen, Jinlan Fu, Changsong Li, See-Kiong Ng, Xipeng Qiu
Subjects: cs.CL, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2601.13836
Pdf URL: https://arxiv.org/pdf/2601.13836
Copy Paste: [[2601.13836]] FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs(https://arxiv.org/abs/2601.13836)
Keywords: language model, llm
Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (this https URL) and datasets (this https URL).
摘要：尽管多模态大语言模型（MLLM）表现出强大的全模态感知，但它们根据视听线索预测未来事件的能力在很大程度上仍未得到探索，因为现有基准主要侧重于回顾性理解。为了弥补这一差距，我们引入了 FutureOmni，这是第一个旨在评估视听环境中的全模式未来预测的基准。评估的模型需要执行跨模式因果和时间推理，并有效利用内部知识来预测未来事件。 FutureOmni 通过可扩展的法学硕士辅助、人机交互管道构建，包含 8 个主要领域的 919 个视频和 1,034 个多项选择 QA 对。对 13 个全模态和 7 个纯视频模型的评估表明，当前系统在视听未来预测方面存在困难，特别是在语音较多的场景中，Gemini 3 Flash 的最佳准确率达到 64.8%。为了缓解这一限制，我们整理了一个 7K 样本指令调整数据集，并提出了全模态未来预测 (OFF) 训练策略。对 FutureOmni 以及流行的视听和纯视频基准的评估表明，OFF 增强了未来的预测和泛化能力。我们公开发布所有代码（此 https URL）和数据集（此 https URL）。

Title: Pedagogical Alignment for Vision-Language-Action Models: A Comprehensive Framework for Data, Architecture, and Evaluation in Education

Authors: Unggi Lee, Jahyun Jeong, Sunyoung Shin, Haeun Park, Jeongsu Moon, Youngchang Song, Jaechang Shim, JaeHwan Lee, Yunju Noh, Seungwon Choi, Ahhyun Kim, TaeHyeon Kim, Kyungtae Joo, Taeyeong Kim, Gyeonggeon Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13876
Pdf URL: https://arxiv.org/pdf/2601.13876
Copy Paste: [[2601.13876]] Pedagogical Alignment for Vision-Language-Action Models: A Comprehensive Framework for Data, Architecture, and Evaluation in Education(https://arxiv.org/abs/2601.13876)
Keywords: language model, llm
Abstract: Science demonstrations are important for effective STEM education, yet teachers face challenges in conducting them safely and consistently across multiple occasions, where robotics can be helpful. However, current Vision-Language-Action (VLA) models require substantial computational resources and sacrifice language generation capabilities to maximize efficiency, making them unsuitable for resource-constrained educational settings that require interpretable, explanation-generating systems. We present \textit{Pedagogical VLA Framework}, a framework that applies pedagogical alignment to lightweight VLA models through four components: text healing to restore language generation capabilities, large language model (LLM) distillation to transfer pedagogical knowledge, safety training for educational environments, and pedagogical evaluation adjusted to science education contexts. We evaluate Pedagogical VLA Framework across five science demonstrations spanning physics, chemistry, biology, and earth science, using an evaluation framework developed in collaboration with science education experts. Our evaluation assesses both task performance (success rate, protocol compliance, efficiency, safety) and pedagogical quality through teacher surveys and LLM-as-Judge assessment. We additionally provide qualitative analysis of generated texts. Experimental results demonstrate that Pedagogical VLA Framework achieves comparable task performance to baseline models while producing contextually appropriate educational explanations.
摘要：科学演示对于有效的 STEM 教育非常重要，但教师在多种场合安全、一致地进行科学演示面临着挑战，而机器人技术可以在这些场合提供帮助。然而，当前的视觉-语言-动作（VLA）模型需要大量的计算资源，并牺牲语言生成能力来最大限度地提高效率，这使得它们不适合需要可解释的解释生成系统的资源有限的教育环境。我们提出了 \textit{教学 VLA 框架}，该框架通过四个组件将教学调整应用于轻量级 VLA 模型：恢复语言生成能力的文本修复、用于转移教学知识的大语言模型（LLM）蒸馏、教育环境的安全培训以及根据科学教育背景调整的教学评估。我们使用与科学教育专家合作开发的评估框架，评估涵盖物理、化学、生物学和地球科学的五个科学演示的教学 VLA 框架。我们的评估通过教师调查和法学硕士法官评估来评估任务绩效（成功率、协议合规性、效率、安全性）和教学质量。我们还提供生成文本的定性分析。实验结果表明，教学 VLA 框架可实现与基线模型相当的任务绩效，同时生成适合情境的教育解释。

Title: OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models

Authors: Unggi Lee, Sookbun Lee, Heungsoo Choi, Jinseo Lee, Haeun Park, Younghoon Jeon, Sungmin Cho, Minju Kang, Junbo Koh, Jiyeong Bae, Minwoo Nam, Juyeon Eun, Yeonji Jung, Yeil Jeong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13882
Pdf URL: https://arxiv.org/pdf/2601.13882
Copy Paste: [[2601.13882]] OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models(https://arxiv.org/abs/2601.13882)
Keywords: language model, llm
Abstract: Large Language Models are increasingly deployed as educational tools, yet existing benchmarks focus on narrow skills and lack grounding in learning sciences. We introduce OpenLearnLM Benchmark, a theory-grounded framework evaluating LLMs across three dimensions derived from educational assessment theory: Knowledge (curriculum-aligned content and pedagogical understanding), Skills (scenario-based competencies organized through a four-level center-role-scenario-subscenario hierarchy), and Attitude (alignment consistency and deception resistance). Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels based on Bloom's taxonomy. The Knowledge domain prioritizes authentic assessment items from established benchmarks, while the Attitude domain adapts Anthropic's Alignment Faking methodology to detect behavioral inconsistency under varying monitoring conditions. Evaluation of seven frontier models reveals distinct capability profiles: Claude-Opus-4.5 excels in practical skills despite lower content knowledge, while Grok-4.1-fast leads in knowledge but shows alignment concerns. Notably, no single model dominates all dimensions, validating the necessity of multi-axis evaluation. OpenLearnLM provides an open, comprehensive framework for advancing LLM readiness in authentic educational contexts.
摘要：大型语言模型越来越多地被部署为教育工具，但现有的基准侧重于狭隘的技能，缺乏学习科学的基础。我们引入了 OpenLearnLM Benchmark，这是一个以理论为基础的框架，从教育评估理论衍生的三个维度评估法学硕士：知识（与课程一致的内容和教学理解）、技能（通过四级中心-角色-场景-子场景层次结构组织的基于场景的能力）和态度（一致性和抗欺骗性）。我们的基准包括 124K+ 项目，涵盖多个学科、教育角色和基于 Bloom 分类法的难度级别。知识领域优先考虑已建立基准中的真实评估项目，而态度领域则采用 Anthropic 的对齐伪造方法来检测不同监控条件下的行为不一致。对七个前沿模型的评估揭示了不同的能力概况：尽管知识内容较低，但 Claude-Opus-4.5 在实践技能方面表现出色，而 Grok-4.1-fast 在知识方面领先，但表现出一致性问题。值得注意的是，没有一个模型能够主导所有维度，这验证了多轴评估的必要性。 OpenLearnLM 提供了一个开放、全面的框架，用于在真实的教育环境中推进法学硕士的准备工作。

Title: Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

Authors: Esma Balkır, Alice Pernthaller, Marco Basaldella, José Hernández-Orallo, Nigel Collier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13885
Pdf URL: https://arxiv.org/pdf/2601.13885
Copy Paste: [[2601.13885]] Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores(https://arxiv.org/abs/2601.13885)
Keywords: llm
Abstract: Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 {\tau} over random sampling, with 95% accuracy on confident predictions.
摘要：计算机化自适应测试（CAT）已被证明可以有效地在多项选择基准上有效地进行 LLM 评估，但现代 LLM 评估越来越依赖于生成任务，其中对输出进行连续评分，而不是标记正确/不正确。我们通过用异方差正态分布替换伯努利响应分布，提出了基于 IRT 的自适应测试对连续有界分数（ROUGE、BLEU、LLM-as-a-Judge）的原则性扩展。在此基础上，我们引入了具有自适应停止标准的不确定性感知排名器，可以在测试尽可能少的项目和尽可能便宜的情况下实现可靠的模型排名。我们在五个基准测试上验证了我们的方法，这些基准测试涵盖基于 n-gram、基于嵌入和 LLM-as-judge 指标。我们的方法使用 2% 的项目，同时将排名相关性比随机采样提高 0.12 tau，并且置信预测的准确率达到 95%。

Title: AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization

Authors: Yusheng Liao, Chuan Xuan, Yutong Cai, Lina Yang, Zhe Chen, Yanfeng Wang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13918
Pdf URL: https://arxiv.org/pdf/2601.13918
Copy Paste: [[2601.13918]] AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization(https://arxiv.org/abs/2601.13918)
Keywords: language model, agent
Abstract: Large Language Models have demonstrated profound utility in the medical domain. However, their application to autonomous Electronic Health Records~(EHRs) navigation remains constrained by a reliance on curated inputs and simplified retrieval tasks. To bridge the gap between idealized experimental settings and realistic clinical environments, we present AgentEHR. This benchmark challenges agents to execute complex decision-making tasks, such as diagnosis and treatment planning, requiring long-range interactive reasoning directly within raw and high-noise databases. In tackling these tasks, we identify that existing summarization methods inevitably suffer from critical information loss and fractured reasoning continuity. To address this, we propose RetroSum, a novel framework that unifies a retrospective summarization mechanism with an evolving experience strategy. By dynamically re-evaluating interaction history, the retrospective mechanism prevents long-context information loss and ensures unbroken logical coherence. Additionally, the evolving strategy bridges the domain gap by retrieving accumulated experience from a memory bank. Extensive empirical evaluations demonstrate that RetroSum achieves performance gains of up to 29.16% over competitive baselines, while significantly decreasing total interaction errors by up to 92.3%.
摘要：大型语言模型在医学领域已经证明了深远的实用性。然而，它们在自主电子健康记录（EHR）导航中的应用仍然受到对策划输入和简化检索任务的依赖的限制。为了弥合理想的实验设置和现实的临床环境之间的差距，我们推出了 AgentEHR。该基准测试要求代理执行复杂的决策任务，例如诊断和治疗计划，需要直接在原始和高噪声数据库中进行远程交互式推理。在处理这些任务时，我们发现现有的摘要方法不可避免地会遭受关键信息丢失和推理连续性断裂的影响。为了解决这个问题，我们提出了 RetroSum，这是一个新颖的框架，它将回顾性总结机制与不断发展的体验策略相结合。通过动态地重新评估交互历史，追溯机制可以防止长上下文信息丢失并确保不间断的逻辑连贯性。此外，不断发展的策略通过从记忆库中检索积累的经验来弥合领域差距。广泛的实证评估表明，RetroSum 的性能比竞争基准提高了高达 29.16%，同时显着降低了总交互错误高达 92.3%。

Title: HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs

Authors: Yuezhe Yang, Hao Wang, Yige Peng, Jinman Kim, Lei Bi
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.13919
Pdf URL: https://arxiv.org/pdf/2601.13919
Copy Paste: [[2601.13919]] HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs(https://arxiv.org/abs/2601.13919)
Keywords: language model, agent
Abstract: Automated clinical diagnosis remains a core challenge in medical AI, which usually requires models to integrate multi-modal data and reason across complex, case-specific contexts. Although recent methods have advanced medical report generation (MRG) and visual question answering (VQA) with medical vision-language models (VLMs), these methods, however, predominantly operate under a sample-isolated inference paradigm, as such processing cases independently without access to longitudinal electronic health records (EHRs) or structurally related patient examples. This paradigm limits reasoning to image-derived information alone, which ignores external complementary medical evidence for potentially more accurate diagnosis. To overcome this limitation, we propose \textbf{HyperWalker}, a \textit{Deep Diagnosis} framework that reformulates clinical reasoning via dynamic hypergraphs and test-time training. First, we construct a dynamic hypergraph, termed \textbf{iBrochure}, to model the structural heterogeneity of EHR data and implicit high-order associations among multimodal clinical information. Within this hypergraph, a reinforcement learning agent, \textbf{Walker}, navigates to and identifies optimal diagnostic paths. To ensure comprehensive coverage of diverse clinical characteristics in test samples, we incorporate a \textit{linger mechanism}, a multi-hop orthogonal retrieval strategy that iteratively selects clinically complementary neighborhood cases reflecting distinct clinical attributes. Experiments on MRG with MIMIC and medical VQA on EHRXQA demonstrate that HyperWalker achieves state-of-the-art performance. Code is available at: this https URL
摘要：自动化临床诊断仍然是医疗人工智能的核心挑战，这通常需要模型在复杂的特定案例环境中集成多模式数据和推理。尽管最近的方法具有先进的医疗报告生成（MRG）和带有医学视觉语言模型（VLM）的视觉问答（VQA），但这些方法主要在样本隔离推理范式下运行，因此独立处理病例，无需访问纵向电子健康记录（EHR）或结构相关的患者示例。这种范式将推理仅限于图像衍生信息，而忽略了外部补充医学证据以实现更准确的诊断。为了克服这个限制，我们提出了 \textbf{HyperWalker}，一个 \textit{Deep Diagnosis} 框架，它通过动态超图和测试时训练重新表述临床推理。首先，我们构建一个动态超图，称为 \textbf{iBrochure}，来模拟 EHR 数据的结构异质性和多模式临床信息之间隐含的高阶关联。在这个超图中，强化学习代理 \textbf{Walker} 导航并识别最佳诊断路径。为了确保测试样本中不同临床特征的全面覆盖，我们采用了 \textit{linger 机制}，这是一种多跳正交检索策略，可迭代选择反映不同临床属性的临床互补邻域病例。在采用 MIMIC 的 MRG 上以及在 EHRXQA 上进行的医学 VQA 上的实验表明，HyperWalker 实现了最先进的性能。代码位于：此 https URL

Title: Automatic Prompt Optimization for Dataset-Level Feature Discovery

Authors: Adrian Cosma, Oleg Szehr, David Kletz, Alessandro Antonucci, Olivier Pelletier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13922
Pdf URL: https://arxiv.org/pdf/2601.13922
Copy Paste: [[2601.13922]] Automatic Prompt Optimization for Dataset-Level Feature Discovery(https://arxiv.org/abs/2601.13922)
Keywords: prompt, agent
Abstract: Feature extraction from unstructured text is a critical step in many downstream classification pipelines, yet current approaches largely rely on hand-crafted prompts or fixed feature schemas. We formulate feature discovery as a dataset-level prompt optimization problem: given a labelled text corpus, the goal is to induce a global set of interpretable and discriminative feature definitions whose realizations optimize a downstream supervised learning objective. To this end, we propose a multi-agent prompt optimization framework in which language-model agents jointly propose feature definitions, extract feature values, and evaluate feature quality using dataset-level performance and interpretability feedback. Instruction prompts are iteratively refined based on this structured feedback, enabling optimization over prompts that induce shared feature sets rather than per-example predictions. This formulation departs from prior prompt optimization methods that rely on per-sample supervision and provides a principled mechanism for automatic feature discovery from unstructured text.
摘要：从非结构化文本中提取特征是许多下游分类管道中的关键步骤，但当前的方法很大程度上依赖于手工制作的提示或固定的特征模式。我们将特征发现表述为数据集级提示优化问题：给定一个标记的文本语料库，目标是引入一组全局的可解释和判别性特征定义，其实现优化下游监督学习目标。为此，我们提出了一种多智能体提示优化框架，其中语言模型智能体共同提出特征定义、提取特征值并使用数据集级性能和可解释性反馈来评估特征质量。指令提示根据这种结构化反馈进行迭代细化，从而能够对诱导共享特征集而不是每个示例预测的提示进行优化。该公式不同于先前依赖于每个样本监督的提示优化方法，并提供了从非结构化文本自动发现特征的原理机制。

Title: "The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

Authors: Jin Cui, Jiaqi Guo, Jiepeng Zhou, Ruixuan Yang, Jiayi Lu, Jiajun Xu, Jiangcheng Song, Boran Zhao, Pengju Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13992
Pdf URL: https://arxiv.org/pdf/2601.13992
Copy Paste: [[2601.13992]] "The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework(https://arxiv.org/abs/2601.13992)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.
摘要：思想链 (CoT) 推理使大型语言模型 (LLM) 具有卓越的功能，但通常需要过高的参数规模。 CoT 蒸馏已成为一种很有前途的范式，可将推理能力转移到紧凑的学生模型 (SLM) 中，但现有方法通常依赖于单独的教师，从而限制了学生的潜力，因为个别法学硕士经常表现出明显的能力偏差，并可能遭受灾难性遗忘。虽然利用多元化的教师似乎很有吸引力，但有效融合他们的监督仍然具有挑战性：师生不兼容可能会放大幻觉，而被动监督无法确保真正的逻辑内化。为了解决这个问题，我们引入了 COMPACT，这是一个框架，它根据多维指标评估的学生实时兼容性，动态加权教师梯度，自适应地融合来自不同教师的监督：（1）基于图的共识，通过识别主流推理路径来过滤误导性的理由；（2）基于互信息的适应性来检测“顿悟时刻”，以真正理解推理过程而不是仅仅模仿； (3)基于损失的困难来评估学生对教师指导的接受程度并防止负迁移。大量的实验和潜在空间分析表明，COMPACT 有效地整合了多种推理能力，而不破坏模型原有的知识结构，在各种基准上实现了最先进的性能，同时减少了灾难性遗忘。

Title: From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning

Authors: Zihan Niu, Wenping Hu, Junmin Chen, Xiyue Wang, Tong Xu, Ruiming Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13995
Pdf URL: https://arxiv.org/pdf/2601.13995
Copy Paste: [[2601.13995]] From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning(https://arxiv.org/abs/2601.13995)
Keywords: llm
Abstract: Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up hierarchical clustering. By grounding data instances onto this tree, a tree-aware metric then quantifies data quality and diversity, facilitating effective sampling. Our controllable sampling strategy maximizes tree-level information gain and enforces leaf-level alignment via KL-divergence for specific domains. Extensive experiments demonstrate that TAGS significantly outperforms state-of-the-art baselines. Notably, it surpasses the full-dataset model by \textbf{+5.84\%} using only \textbf{5\%} of the data, while our aligned sampling strategy further boosts average performance by \textbf{+4.24\%}.
摘要：有效且可控的数据选择对于LLM指令调优至关重要，尤其是对于海量开源数据集。现有方法主要依赖于实例级质量得分，或基于嵌入集群或语义标签的多样性度量。然而，受嵌入空间平坦度或标签粗糙度的限制，这些方法忽视了细粒度知识及其内在的层次依赖性，从而阻碍了精确的数据评估和知识对齐采样。为了应对这一挑战，我们提出了树感知对齐全局采样（TAGS），这是一个统一的框架，利用细粒度标签构建的知识树，从而实现全局质量、多样性和目标对齐的联合控制。使用基于 LLM 的标记器，我们提取原子知识概念，这些概念通过自下而上的层次聚类组织成全局树。通过将数据实例植根于这棵树上，树感知指标可以量化数据质量和多样性，从而促进有效采样。我们的可控采样策略最大化树级信息增益，并通过特定域的 KL 散度强制叶级对齐。大量实验表明 TAGS 的性能显着优于最先进的基线。值得注意的是，它仅使用 \textbf{5\%} 数据就超越了完整数据集模型 \textbf{+5.84\%}，而我们的对齐采样策略进一步将平均性能提高了 \textbf{+4.24\%}。

Title: Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Authors: Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14004
Pdf URL: https://arxiv.org/pdf/2601.14004
Copy Paste: [[2601.14004]] Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models(https://arxiv.org/abs/2601.14004)
Keywords: language model, llm
Abstract: Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at this https URL.
摘要：机械可解释性 (MI) 已成为揭开大型语言模型 (LLM) 不透明决策神秘面纱的重要方法。然而，现有的评论主要将 MI 视为一门观察科学，总结分析见解，但缺乏可操作干预的系统框架。为了弥补这一差距，我们提出了一项围绕管道构建的实用调查：“定位、引导和改进”。我们根据特定的可解释对象对定位（诊断）和引导（干预）方法进行正式分类，以建立严格的干预协议。此外，我们还展示了该框架如何实现一致性、能力和效率的切实改进，有效地将 MI 作为模型优化的可行方法进行操作。这项工作的精选论文列表可在 https URL 中找到。

Title: BACH-V: Bridging Abstract and Concrete Human-Values in Large Language Models

Authors: Junyu Zhang, Yipeng Kang, Jiong Guo, Jiayu Zhan, Junqi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14007
Pdf URL: https://arxiv.org/pdf/2601.14007
Copy Paste: [[2601.14007]] BACH-V: Bridging Abstract and Concrete Human-Values in Large Language Models(https://arxiv.org/abs/2601.14007)
Keywords: language model, llm
Abstract: Do large language models (LLMs) genuinely understand abstract concepts, or merely manipulate them as statistical patterns? We introduce an abstraction-grounding framework that decomposes conceptual understanding into three capacities: interpretation of abstract concepts (Abstract-Abstract, A-A), grounding of abstractions in concrete events (Abstract-Concrete, A-C), and application of abstract principles to regulate concrete decisions (Concrete-Concrete, C-C). Using human values as a testbed - given their semantic richness and centrality to alignment - we employ probing (detecting value traces in internal activations) and steering (modifying representations to shift behavior). Across six open-source LLMs and ten value dimensions, probing shows that diagnostic probes trained solely on abstract value descriptions reliably detect the same values in concrete event narratives and decision reasoning, demonstrating cross-level transfer. Steering reveals an asymmetry: intervening on value representations causally shifts concrete judgments and decisions (A-C, C-C), yet leaves abstract interpretations unchanged (A-A), suggesting that encoded abstract values function as stable anchors rather than malleable activations. These findings indicate LLMs maintain structured value representations that bridge abstraction and action, providing a mechanistic and operational foundation for building value-driven autonomous AI systems with more transparent, generalizable alignment and control.
摘要：大型语言模型 (LLM) 是否真正理解抽象概念，还是仅仅将它们作为统计模式进行处理？我们引入了一个抽象基础框架，它将概念理解分解为三种能力：抽象概念的解释（抽象-抽象，A-A）、具体事件中抽象的基础（抽象-具体，A-C）以及应用抽象原则来规范具体决策（具体-具体，C-C）。使用人类价值观作为测试平台——考虑到它们的语义丰富性和对齐的中心性——我们采用探测（检测内部激活中的价值痕迹）和引导（修改表示以改变行为）。在六个开源法学硕士和十个价值维度中，探索表明，仅在抽象价值描述上训练的诊断探针可以可靠地检测到具体事件叙述和决策推理中的相同值，从而证明了跨级别转移。转向揭示了一种不对称性：对价值表示的干预会因果性地改变具体的判断和决策（A-C，C-C），但保持抽象解释不变（A-A），这表明编码的抽象值充当稳定的锚点而不是可塑的激活。这些发现表明，法学硕士维持着连接抽象和行动的结构化价值表示，为构建具有更透明、更通用的调整和控制的价值驱动的自主人工智能系统提供了机制和操作基础。

Title: RM-Distiller: Exploiting Generative LLM for Reward Model Distillation

Authors: Hongli Zhou, Hui Huang, Wei Liu, Chenglong Wang, Xingyuan Bu, Lvyuan Han, Fuhai Song, Muyun Yang, Wenhao Jiang, Hailong Cao, Tiejun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14032
Pdf URL: https://arxiv.org/pdf/2601.14032
Copy Paste: [[2601.14032]] RM-Distiller: Exploiting Generative LLM for Reward Model Distillation(https://arxiv.org/abs/2601.14032)
Keywords: language model, llm
Abstract: Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. Due to the difficulty of obtaining high-quality human preference annotations, distilling preferences from generative LLMs has emerged as a standard practice. However, existing approaches predominantly treat teacher models as simple binary annotators, failing to fully exploit the rich knowledge and capabilities for RM distillation. To address this, we propose RM-Distiller, a framework designed to systematically exploit the multifaceted capabilities of teacher LLMs: (1) Refinement capability, which synthesizes highly correlated response pairs to create fine-grained and contrastive signals. (2) Scoring capability, which guides the RM in capturing precise preference strength via a margin-aware optimization objective. (3) Generation capability, which incorporates the teacher's generative distribution to regularize the RM to preserve its fundamental linguistic knowledge. Extensive experiments demonstrate that RM-Distiller significantly outperforms traditional distillation methods both on RM benchmarks and reinforcement learning-based alignment, proving that exploiting multifaceted teacher capabilities is critical for effective reward modeling. To the best of our knowledge, this is the first systematic research on RM distillation from generative LLMs.
摘要：奖励模型 (RM) 在使大型语言模型 (LLM) 与人类偏好保持一致方面发挥着关键作用。由于难以获得高质量的人类偏好注释，从生成法学硕士中提取偏好已成为一种标准做法。然而，现有方法主要将教师模型视为简单的二进制注释器，未能充分利用 RM 蒸馏的丰富知识和能力。为了解决这个问题，我们提出了 RM-Distiller，这是一个旨在系统地利用法学硕士教师的多方面能力的框架：（1）细化能力，它合成高度相关的响应对以创建细粒度和对比信号。 (2) 评分能力，引导 RM 通过边缘感知优化目标捕获精确的偏好强度。 (3)生成能力，结合教师的生成分布来规范RM以保留其基本语言知识。大量实验表明，RM-Distiller 在 RM 基准和基于强化学习的对齐方面均显着优于传统蒸馏方法，证明利用多方面的教师能力对于有效的奖励建模至关重要。据我们所知，这是第一个针对生成法学硕士 RM 蒸馏的系统研究。

Title: Top 10 Open Challenges Steering the Future of Diffusion Language Model and Its Variants

Authors: Yunhe Wang, Kai Han, Huiling Zhen, Yuchuan Tian, Hanting Chen, Yongbing Huang, Yufei Cui, Yingte Shu, Shan Gao, Ismail Elezi, Roy Vaughan Miles, Songcen Xu, Feng Wen, Chao Xu, Sinan Zeng, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14041
Pdf URL: https://arxiv.org/pdf/2601.14041
Copy Paste: [[2601.14041]] Top 10 Open Challenges Steering the Future of Diffusion Language Model and Its Variants(https://arxiv.org/abs/2601.14041)
Keywords: language model, gpt, llm
Abstract: The paradigm of Large Language Models (LLMs) is currently defined by auto-regressive (AR) architectures, which generate text through a sequential ``brick-by-brick'' process. Despite their success, AR models are inherently constrained by a causal bottleneck that limits global structural foresight and iterative refinement. Diffusion Language Models (DLMs) offer a transformative alternative, conceptualizing text generation as a holistic, bidirectional denoising process akin to a sculptor refining a masterpiece. However, the potential of DLMs remains largely untapped as they are frequently confined within AR-legacy infrastructures and optimization frameworks. In this Perspective, we identify ten fundamental challenges ranging from architectural inertia and gradient sparsity to the limitations of linear reasoning that prevent DLMs from reaching their ``GPT-4 moment''. We propose a strategic roadmap organized into four pillars: foundational infrastructure, algorithmic optimization, cognitive reasoning, and unified multimodal intelligence. By shifting toward a diffusion-native ecosystem characterized by multi-scale tokenization, active remasking, and latent thinking, we can move beyond the constraints of the causal horizon. We argue that this transition is essential for developing next-generation AI capable of complex structural reasoning, dynamic self-correction, and seamless multimodal integration.
摘要：大型语言模型 (LLM) 的范式目前由自回归 (AR) 架构定义，该架构通过连续的“逐块”过程生成文本。尽管 AR 模型取得了成功，但它本质上受到因果瓶颈的限制，限制了全局结构远见和迭代细化。扩散语言模型 (DLM) 提供了一种变革性的替代方案，将文本生成概念化为一个整体的双向去噪过程，类似于雕塑家精炼杰作的过程。然而，DLM 的潜力在很大程度上尚未开发，因为它们经常局限于 AR 遗留基础设施和优化框架内。在本视角中，我们确定了十个基本挑战，从架构惯性和梯度稀疏性到阻碍 DLM 达到“GPT-4 时刻”的线性推理限制。我们提出了一个分为四个支柱的战略路线图：基础设施、算法优化、认知推理和统一多模态智能。通过转向以多尺度代币化、主动重新屏蔽和潜在思维为特征的扩散原生生态系统，我们可以超越因果视野的限制。我们认为，这种转变对于开发能够进行复杂结构推理、动态自我校正和无缝多模式集成的下一代人工智能至关重要。

Title: PRiSM: Benchmarking Phone Realization in Speech Models

Authors: Shikhar Bharadwaj, Chin-Jou Li, Yoonjae Kim, Kwanghee Choi, Eunjung Yeo, Ryan Soh-Eun Shim, Hanyu Zhou, Brendon Boldt, Karen Rosero Jacome, Kalvin Chang, Darsh Agrawal, Keer Xu, Chao-Han Huck Yang, Jian Zhu, Shinji Watanabe, David R. Mortensen
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2601.14046
Pdf URL: https://arxiv.org/pdf/2601.14046
Copy Paste: [[2601.14046]] PRiSM: Benchmarking Phone Realization in Speech Models(https://arxiv.org/abs/2601.14046)
Keywords: language model
Abstract: Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: this https URL.
摘要：电话识别 (PR) 充当跨语言语音处理和语音分析的语言无关建模的原子接口。尽管在开发 PR 系统方面付出了长期的努力，但目前的评估仅测量表面水平的转录准确性。我们推出 PRiSM，这是第一个开源基准测试，旨在通过 PR 系统的内在和外在评估来暴露语音感知的盲点。 PRiSM 标准化基于转录的评估，并通过转录和表征探针评估临床、教育和多语言环境中的下游效用。我们发现训练期间的多样化语言接触是 PR 性能的关键，编码器 CTC 模型是最稳定的，并且专门的 PR 模型仍然优于大型音频语言模型。 PRiSM 发布了代码、配方和数据集，以推动该领域向具有强大语音能力的多语言语音模型发展：此 https URL。

Title: Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering

Authors: Yuxin Chen, Zhengzhou Cai, Xiangtian Ji, Weixiang Zhao, An Zhang, Xiang Wang, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14050
Pdf URL: https://arxiv.org/pdf/2601.14050
Copy Paste: [[2601.14050]] Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering(https://arxiv.org/abs/2601.14050)
Keywords: llm
Abstract: Mixture-of-Experts (MoE) architectures have shown strong multilingual capabilities, yet the internal mechanisms underlying performance gains and cross-language differences remain insufficiently understood. In this work, we conduct a systematic analysis of MoE models, examining routing behavior and expert specialization across languages and network depth. Our analysis reveals that multilingual processing in MoE models is highly structured: routing aligns with linguistic families, expert utilization follows a clear layerwise pattern, and high-resource languages rely on shared experts while low-resource languages depend more on language-exclusive experts despite weaker performance. Layerwise interventions further show that early and late MoE layers support language-specific processing, whereas middle layers serve as language-agnostic capacity hubs. Building on these insights, we propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time, leading to consistent multilingual performance improvements, particularly for linguistically related language pairs. Our code is available at this https URL.
摘要：专家混合 (MoE) 架构已显示出强大的多语言能力，但性能提升和跨语言差异背后的内部机制仍不清楚。在这项工作中，我们对 MoE 模型进行了系统分析，检查跨语言和网络深度的路由行为和专家专业化。我们的分析表明，MoE 模型中的多语言处理是高度结构化的：路由与语言家族一致，专家利用遵循清晰的分层模式，高资源语言依赖于共享专家，而低资源语言则更多地依赖于语言专用专家，尽管性能较弱。分层干预进一步表明，早期和晚期 MoE 层支持特定于语言的处理，而中间层则充当与语言无关的能力中心。基于这些见解，我们提出了一种路由引导的转向方法，该方法在推理时自适应地将中间层的路由行为引导至与主导语言相关的共享专家，从而实现一致的多语言性能改进，特别是对于语言相关的语言对。我们的代码可以在这个 https URL 上找到。

Title: Kakugo: Distillation of Low-Resource Languages into Small Language Models

Authors: Peter Devine, Mardhiyah Sanni, Farid Adilazuarda, Julieta Gil Loizaga, Barry Haddow
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14051
Pdf URL: https://arxiv.org/pdf/2601.14051
Copy Paste: [[2601.14051]] Kakugo: Distillation of Low-Resource Languages into Small Language Models(https://arxiv.org/abs/2601.14051)
Keywords: language model, prompt
Abstract: We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.
摘要：我们推出了 Kakugo，一种新颖且经济高效的管道，旨在仅使用语言名称作为输入来训练低资源语言的通用小语言模型 (SLM)。通过使用大型教师模型生成合成提示并翻译指令数据集，我们为 54 种低资源语言生成了训练数据和 SLM。对各种通用自然语言处理任务（包括翻译、分类和问答）的评估表明，我们的管道持续提高了基础模型的性能。 Kakugo 每种语言的总生成和培训成本低于 50 美元，为社区开发特定语言的 AI 提供了一种易于使用的方法。

Title: XCR-Bench: A Multi-Task Benchmark for Evaluating Cultural Reasoning in LLMs

Authors: Mohsinul Kabir, Tasnim Ahmed, Md Mezbaur Rahman, Shaoxiong Ji, Hassan Alhuzali, Sophia Ananiadou
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.14063
Pdf URL: https://arxiv.org/pdf/2601.14063
Copy Paste: [[2601.14063]] XCR-Bench: A Multi-Task Benchmark for Evaluating Cultural Reasoning in LLMs(https://arxiv.org/abs/2601.14063)
Keywords: language model, llm
Abstract: Cross-cultural competence in large language models (LLMs) requires the ability to identify Culture-Specific Items (CSIs) and to adapt them appropriately across cultural contexts. Progress in evaluating this capability has been constrained by the scarcity of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. To address this limitation, we introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark consisting of 4.9k parallel sentences and 1,098 unique CSIs, spanning three distinct reasoning tasks with corresponding evaluation metrics. Our corpus integrates Newmark's CSI framework with Hall's Triad of Culture, enabling systematic analysis of cultural reasoning beyond surface-level artifacts and into semi-visible and invisible cultural elements such as social norms, beliefs, and values. Our findings show that state-of-the-art LLMs exhibit consistent weaknesses in identifying and adapting CSIs related to social etiquette and cultural reference. Additionally, we find evidence that LLMs encode regional and ethno-religious biases even within a single linguistic setting during cultural adaptation. We release our corpus and code to facilitate future research on cross-cultural NLP.
摘要：大语言模型 (LLM) 中的跨文化能力需要能够识别文化特定项目 (CSI) 并在不同文化背景下适当地调整它们。由于缺乏具有平行跨文化句子对的高质量 CSI 注释语料库，评估这种能力的进展受到限制。为了解决这一限制，我们引入了 XCR-Bench，这是一个 Cross(X)-文化推理基准，由 4.9k 个并行句子和 1,098 个独特的 CSI 组成，涵盖三个不同的推理任务以及相应的评估指标。我们的语料库将 Newmark 的 CSI 框架与 Hall 的文化三元组相结合，从而能够对文化推理进行系统分析，超越表面的文物，深入到半可见和不可见的文化元素，如社会规范、信仰和价值观。我们的研究结果表明，最先进的法学硕士在识别和调整与社会礼仪和文化参考相关的 CSI 方面表现出一贯的弱点。此外，我们发现证据表明，即使在文化适应过程中的单一语言环境中，法学硕士也会编码区域和民族宗教偏见。我们发布语料库和代码，以促进未来跨文化 NLP 的研究。

Title: NewsRECON: News article REtrieval for image CONtextualization

Authors: Jonathan Tonglet, Iryna Gurevych, Tinne Tuytelaars, Marie-Francine Moens
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14121
Pdf URL: https://arxiv.org/pdf/2601.14121
Copy Paste: [[2601.14121]] NewsRECON: News article REtrieval for image CONtextualization(https://arxiv.org/abs/2601.14121)
Keywords: language model
Abstract: Identifying when and where a news image was taken is crucial for journalists and forensic experts to produce credible stories and debunk misinformation. While many existing methods rely on reverse image search (RIS) engines, these tools often fail to return results, thereby limiting their practical applicability. In this work, we address the challenging scenario where RIS evidence is unavailable. We introduce NewsRECON, a method that links images to relevant news articles to infer their date and location from article metadata. NewsRECON leverages a corpus of over 90,000 articles and integrates: (1) a bi-encoder for retrieving event-relevant articles; (2) two cross-encoders for reranking articles by location and event consistency. Experiments on the TARA and 5Pils-OOC show that NewsRECON outperforms prior work and can be combined with a multimodal large language model to achieve new SOTA results in the absence of RIS evidence. We make our code available.
摘要：确定新闻图像的拍摄时间和地点对于记者和法医专家制作可信的故事和揭穿错误信息至关重要。虽然许多现有方法依赖于反向图像搜索（RIS）引擎，但这些工具通常无法返回结果，从而限制了它们的实际适用性。在这项工作中，我们解决了无法获得 RIS 证据的挑战性场景。我们引入 NewsRECON，这是一种将图像链接到相关新闻文章以从文章元数据推断其日期和位置的方法。 NewsRECON 利用超过 90,000 篇文章的语料库并集成：(1) 用于检索事件相关文章的双编码器； (2) 两个交叉编码器，用于根据位置和事件一致性对文章进行重新排序。在 TARA 和 5Pils-OOC 上的实验表明，NewsRECON 的性能优于之前的工作，并且可以与多模态大语言模型相结合，在缺乏 RIS 证据的情况下实现新的 SOTA 结果。我们提供我们的代码。

Title: A Systematic Analysis of Chunking Strategies for Reliable Question Answering

Authors: Sofia Bennani, Charles Moslonka
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.14123
Pdf URL: https://arxiv.org/pdf/2601.14123
Copy Paste: [[2601.14123]] A Systematic Analysis of Chunking Strategies for Reliable Question Answering(https://arxiv.org/abs/2601.14123)
Keywords: retrieval-augmented generation
Abstract: We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies chunking method (token, sentence, semantic, code), chunk size, overlap, and context length. We use a standard industrial setup: SPLADE retrieval and a Mistral-8B generator. We derive actionable lessons for cost-efficient deployment: (i) overlap provides no measurable benefit and increases indexing cost; (ii) sentence chunking is the most cost-effective method, matching semantic chunking up to ~5k tokens; (iii) a "context cliff" reduces quality beyond ~2.5k tokens; and (iv) optimal context depends on the goal (semantic quality peaks at small contexts; exact match at larger ones).
摘要：我们研究文档分块选择如何影响工业中检索增强生成 (RAG) 系统的可靠性。虽然实践通常依赖于启发式方法，但我们对自然问题的端到端评估系统地改变了分块方法（标记、句子、语义、代码）、块大小、重叠和上下文长度。我们使用标准工业设置：SPLADE 检索和 Mistral-8B 发生器。我们得出了具有成本效益的部署的可行经验：(i) 重叠没有提供可衡量的好处并增加了索引成本； (ii) 句子分块是最具成本效益的方法，最多可匹配约 5k 个标记的语义分块； (iii) “上下文悬崖”将质量降低到超过约 2,500 个令牌； (iv) 最佳上下文取决于目标（小上下文中语义质量达到峰值；大上下文中精确匹配）。

Title: Style Transfer as Bias Mitigation: Diffusion Models for Synthetic Mental Health Text for Arabic

Authors: Saad Mankarious, Aya Zirikly
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14124
Pdf URL: https://arxiv.org/pdf/2601.14124
Copy Paste: [[2601.14124]] Style Transfer as Bias Mitigation: Diffusion Models for Synthetic Mental Health Text for Arabic(https://arxiv.org/abs/2601.14124)
Keywords: language model, llm
Abstract: Synthetic data offers a promising solution for mitigating data scarcity and demographic bias in mental health analysis, yet existing approaches largely rely on pretrained large language models (LLMs), which may suffer from limited output diversity and propagate biases inherited from their training data. In this work, we propose a pretraining-free diffusion-based approach for synthetic text generation that frames bias mitigation as a style transfer problem. Using the CARMA Arabic mental health corpus, which exhibits a substantial gender imbalance, we focus on male-to-female style transfer to augment underrepresented female-authored content. We construct five datasets capturing varying linguistic and semantic aspects of gender expression in Arabic and train separate diffusion models for each setting. Quantitative evaluations demonstrate consistently high semantic fidelity between source and generated text, alongside meaningful surface-level stylistic divergence, while qualitative analysis confirms linguistically plausible gender transformations. Our results show that diffusion-based style transfer can generate high-entropy, semantically faithful synthetic data without reliance on pretrained LLMs, providing an effective and flexible framework for mitigating gender bias in sensitive, low-resource mental health domains.
摘要：合成数据为缓解心理健康分析中的数据稀缺和人口统计偏差提供了一种有前途的解决方案，但现有方法在很大程度上依赖于预训练的大语言模型（LLM），该模型可能会受到输出多样性有限的影响，并传播从训练数据继承的偏差。在这项工作中，我们提出了一种基于无预训练扩散的合成文本生成方法，该方法将偏差缓解视为风格转移问题。使用 CARMA 阿拉伯语心理健康语料库（该语料库表现出严重的性别失衡），我们专注于男性到女性的风格转移，以增加代表性不足的女性创作的内容。我们构建了五个数据集，捕获阿拉伯语性别表达的不同语言和语义方面，并为每个设置训练单独的扩散模型。定量评估表明源文本和生成文本之间始终保持高语义保真度，同时存在有意义的表面风格分歧，而定性分析则证实了语言上合理的性别转变。我们的结果表明，基于扩散的风格迁移可以生成高熵、语义忠实的合成数据，而不依赖于预先训练的法学硕士，为减轻敏感、资源匮乏的心理健康领域的性别偏见提供了有效且灵活的框架。

Title: Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models

Authors: Hyunjong Ok, Jaeho Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14152
Pdf URL: https://arxiv.org/pdf/2601.14152
Copy Paste: [[2601.14152]] Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models(https://arxiv.org/abs/2601.14152)
Keywords: language model, prompt
Abstract: Large language models exhibit surprising sensitivity to the structure of the prompt, but the mechanisms underlying this sensitivity remain poorly understood. In this work, we conduct an in-depth investigation on a striking case: in multiple-choice question answering, placing context before the questions and options (CQO) outperforms the reverse order (QOC) by over 14%p, consistently over a wide range of models and datasets. Through systematic architectural analysis, we identify causal attention as the core mechanism: in QOC prompts, the causal mask prevents option tokens from attending to context, creating an information bottleneck where context becomes invisible to options.
摘要：大型语言模型对提示的结构表现出令人惊讶的敏感性，但这种敏感性背后的机制仍然知之甚少。在这项工作中，我们对一个引人注目的案例进行了深入调查：在多项选择题回答中，将上下文置于问题和选项之前 (CQO) 的性能优于逆序 (QOC) 超过 14%p，在各种模型和数据集上始终如此。通过系统的架构分析，我们将因果注意力确定为核心机制：在 QOC 提示中，因果掩码阻止了选项标记关注上下文，从而造成了信息瓶颈，上下文对选项来说变得不可见。

Title: Domain-Adaptation through Synthetic Data: Fine-Tuning Large Language Models for German Law

Authors: Ali Hamza Bashir, Muhammad Rehan Khalid, Kostadin Cvejoski, Jana Birr, Jule Berghaus, Armin Berger, Sandra Halscheidt, Christian Temath, Rafet Sifa, David Berghaus
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14160
Pdf URL: https://arxiv.org/pdf/2601.14160
Copy Paste: [[2601.14160]] Domain-Adaptation through Synthetic Data: Fine-Tuning Large Language Models for German Law(https://arxiv.org/abs/2601.14160)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) often struggle in specialized domains such as legal reasoning due to limited expert knowledge, resulting in factually incorrect outputs or hallucinations. This paper presents an effective method for adapting advanced LLMs to German legal question answering through a novel synthetic data generation approach. In contrast to costly human-annotated resources or unreliable synthetic alternatives, our approach systematically produces high-quality, diverse, and legally accurate question-answer pairs directly from authoritative German statutes. Using rigorous automated filtering methods and parameter-efficient fine-tuning techniques, we demonstrate that LLMs adapted with our synthetic dataset significantly outperform their baseline counterparts on German legal question answering tasks. Our results highlight the feasibility of using carefully designed synthetic data as a robust alternative to manual annotation in high-stakes, knowledge-intensive domains.
摘要：由于专业知识有限，大型语言模型（LLM）经常在法律推理等专业领域陷入困境，导致事实上不正确的输出或幻觉。本文提出了一种通过新颖的合成数据生成方法使高级法学硕士适应德国法律问答的有效方法。与昂贵的人工注释资源或不可靠的合成替代品相比，我们的方法直接根据权威的德国法规系统地生成高质量、多样化且法律准确的问答对。使用严格的自动过滤方法和参数高效的微调技术，我们证明了适应我们的合成数据集的法学硕士在德国法律问答任务上的表现显着优于其基线同行。我们的结果强调了在高风险、知识密集型领域使用精心设计的合成数据作为手动注释的可靠替代方案的可行性。

Title: Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

Authors: Víctor Yeste, Paolo Rosso
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14172
Pdf URL: https://arxiv.org/pdf/2601.14172
Copy Paste: [[2601.14172]] Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum(https://arxiv.org/abs/2601.14172)
Keywords: llm
Abstract: We study sentence-level identification of the 19 values in the Schwartz motivational continuum as a concrete formulation of human value detection in text. The setting - out-of-context sentences from news and political manifestos - features sparse moral cues and severe class imbalance. This combination makes fine-grained sentence-level value detection intrinsically difficult, even for strong modern neural models. We first operationalize a binary moral presence task ("does any value appear?") and show that it is learnable from single sentences (positive-class F1 $\approx$ 0.74 with calibrated thresholds). We then compare a presence-gated hierarchy to a direct multi-label classifier under matched compute, both based on DeBERTa-base and augmented with lightweight signals (prior-sentence context, LIWC-22/eMFD/MJD lexica, and topic features). The hierarchy does not outperform direct prediction, indicating that gate recall limits downstream gains. We also benchmark instruction-tuned LLMs - Gemma 2 9B, Llama 3.1 8B, Mistral 8B, and Qwen 2.5 7B - in zero-/few-shot and QLoRA setups and build simple ensembles; a soft-vote supervised ensemble reaches macro-F1 0.332, significantly surpassing the best single supervised model and exceeding prior English-only baselines. Overall, in this scenario, lightweight signals and small ensembles yield the most reliable improvements, while hierarchical gating offers limited benefit. We argue that, under an 8 GB single-GPU constraint and at the 7-9B scale, carefully tuned supervised encoders remain a strong and compute-efficient baseline for structured human value detection, and we outline how richer value structure and sentence-in-document context could further improve performance.
摘要：我们研究 Schwartz 动机连续体中 19 个价值观的句子级识别，作为文本中人类价值检测的具体表述。故事背景是新闻和政治宣言中断章取义的句子，道德线索稀疏，阶级失衡严重。这种组合使得细粒度的句子级值检测本质上很困难，即使对于强大的现代神经模型也是如此。我们首先操作一个二元道德存在任务（“出现任何值吗？”），并表明它可以从单个句子中学习（正类 F1 $\approx$ 0.74，具有校准阈值）。然后，我们将存在门控层次结构与匹配计算下的直接多标签分类器进行比较，两者都基于 DeBERTa 基础并通过轻量级信号（先前句子上下文、LIWC-22/eMFD/MJD 词汇和主题特征）进行增强。层次结构并不优于直接预测，这表明门召回限制了下游收益。我们还在零/少样本和 QLoRA 设置中对指令调整的 LLM（Gemma 2 9B、Llama 3.1 8B、Mistral 8B 和 Qwen 2.5 7B）进行基准测试，并构建简单的集成；软投票监督集成达到了宏观 F1 0.332，显着超过了最好的单一监督模型，并超过了之前的纯英语基线。总体而言，在这种情况下，轻量级信号和小型集合产生了最可靠的改进，而分层门控提供的好处有限。我们认为，在 8 GB 单 GPU 约束下和 7-9B 规模下，仔细调整的监督编码器仍然是结构化人类价值检测的强大且计算高效的基线，并且我们概述了更丰富的价值结构和文档中的句子上下文如何进一步提高性能。

Title: HALT: Hallucination Assessment via Latent Testing

Authors: Rohan Bhatnagar, Youran Sun, Chi Andrew Zhang, Yixin Wen, Haizhao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14210
Pdf URL: https://arxiv.org/pdf/2601.14210
Copy Paste: [[2601.14210]] HALT: Hallucination Assessment via Latent Testing(https://arxiv.org/abs/2601.14210)
Keywords: language model, llm, hallucination, agent
Abstract: Hallucination in large language models (LLMs) can be understood as a failure of faithful readout: although internal representations may encode uncertainty about a query, decoding pressures still yield a fluent answer. We propose lightweight residual probes that read hallucination risk directly from intermediate hidden states of question tokens, motivated by the hypothesis that these layers retain epistemic signals that are attenuated in the final decoding stage. The probe is a small auxiliary network whose computation is orders of magnitude cheaper than token generation and can be evaluated fully in parallel with inference, enabling near-instantaneous hallucination risk estimation with effectively zero added latency in low-risk cases. We deploy the probe as an agentic critic for fast selective generation and routing, allowing LLMs to immediately answer confident queries while delegating uncertain ones to stronger verification pipelines. Across four QA benchmarks and multiple LLM families, the method achieves strong AUROC and AURAC, generalizes under dataset shift, and reveals interpretable structure in intermediate representations, positioning fast internal uncertainty readout as a principled foundation for reliable agentic AI.
摘要：大语言模型（LLM）中的幻觉可以理解为忠实读出的失败：尽管内部表示可能编码查询的不确定性，但解码压力仍然会产生流畅的答案。我们提出了轻量级残差探针，直接从问题标记的中间隐藏状态读取幻觉风险，其动机是假设这些层保留在最终解码阶段衰减的认知信号。该探针是一个小型辅助网络，其计算量比令牌生成便宜几个数量级，并且可以与推理完全并行地进行评估，从而在低风险情况下实现近乎瞬时的幻觉风险估计，并有效地实现零附加延迟。我们将探针部署为快速选择性生成和路由的代理批评者，允许法学硕士立即回答自信的查询，同时将不确定的查询委托给更强大的验证管道。在四个 QA 基准和多个 LLM 系列中，该方法实现了强大的 AUROC 和 AURAC，在数据集转换下进行了泛化，并揭示了中间表示中的可解释结构，将快速内部不确定性读出定位为可靠代理 AI 的原则基础。

Title: MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

Authors: Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2601.14230
Pdf URL: https://arxiv.org/pdf/2601.14230
Copy Paste: [[2601.14230]] MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems(https://arxiv.org/abs/2601.14230)
Keywords: agent
Abstract: Multi-agent systems (MAS) have recently emerged as promising socio-collaborative companions for emotional and cognitive support. However, these systems frequently suffer from persona collapse--where agents revert to generic, homogenized assistant behaviors--and social sycophancy, which produces redundant, non-constructive dialogue. We propose MASCOT, a generalizable framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that finetunes individual agents for strict persona fidelity to prevent identity loss; and 2) Collaborative Dialogue Optimization, a meta-policy guided by group-level rewards to ensure diverse and productive discourse. Extensive evaluations across psychological support and workplace domains demonstrate that MASCOT significantly outperforms state-of-the-art baselines, achieving improvements of up to +14.1 in Persona Consistency and +10.6 in Social Contribution. Our framework provides a practical roadmap for engineering the next generation of socially intelligent multi-agent systems.
摘要：多智能体系统（MAS）最近已成为情感和认知支持的有前途的社会协作伴侣。然而，这些系统经常遭受角色崩溃（代理恢复到通用的、同质化的助理行为）和社会阿谀奉承的问题，从而产生冗余的、非建设性的对话。我们提出了 MASCOT，一个多视角社交协作伙伴的通用框架。 MASCOT 引入了一种新颖的双层优化策略来协调个人和集体行为：1）角色感知行为对齐，这是一个 RLAIF 驱动的管道，可以对个体代理进行微调，以实现严格的角色保真度，以防止身份丢失； 2）协作对话优化，这是一种以群体层面奖励为指导的元政策，以确保对话的多样性和富有成效。跨心理支持和工作场所领域的广泛评估表明，MASCOT 的性能显着优于最先进的基线，在角色一致性方面实现了高达 +14.1 的改进，在社会贡献方面实现了 +10.6 的改进。我们的框架为设计下一代社交智能多智能体系统提供了实用的路线图。

Title: APEX-Agents

Authors: Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14242
Pdf URL: https://arxiv.org/pdf/2601.14242
Copy Paste: [[2601.14242]] APEX-Agents(https://arxiv.org/abs/2601.14242)
Keywords: gpt, prompt, agent
Abstract: We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
摘要：我们引入了人工智能代理生产力指数（APEX-Agents），这是一个评估人工智能代理是否能够执行由投资银行分析师、管理顾问和公司律师创建的长期跨应用任务的基准。 APEX-Agents 要求代理使用文件和工具在现实的工作环境中导航。我们使用 Pass@1 测试了排行榜的八个代理。 Gemini 3 Flash（Thinking=High）得分最高，为24.0%，其次是GPT-5.2（Thinking=High）、Claude Opus 4.5（Thinking=High）和Gemini 3 Pro（Thinking=High）。我们开源了 APEX-Agents 基准测试 (n=480)，其中包含所有提示、评分标准、黄金输出、文件和元数据。我们还开源了 Archipelago，这是我们用于代理执行和评估的基础设施。

Title: Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Authors: Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14249
Pdf URL: https://arxiv.org/pdf/2601.14249
Copy Paste: [[2601.14249]] Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment(https://arxiv.org/abs/2601.14249)
Keywords: llm, chain-of-thought
Abstract: Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that closely align with the model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically combine low absolute probability with relatively high-ranked tokens under the student model, balancing learning signal strength and behavioral alignment. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training performance (average Spearman 0.86), outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.
摘要：长的思维链（CoT）轨迹为从教师到学生法学硕士的推理提炼提供了丰富的监督信号。然而，之前的工作和我们的实验都表明，更强的教师的轨迹不一定会产生更好的学生，这凸显了数据学生适合性在精炼中的重要性。现有方法主要通过学生可能性来评估适用性，偏向与模型当前行为密切相关的轨迹，但忽略了信息更丰富的轨迹。为了解决这个问题，我们提出了排名惊喜比（RSR），这是一个简单的指标，可以捕获对齐和信息量以评估推理轨迹的适用性。 RSR 的动机是观察到有效轨迹通常将学生模型下的低绝对概率与相对较高排名的标记结合起来，平衡学习信号强度和行为一致性。具体来说，RSR 被定义为轨迹的平均标记排序与其平均负对数似然之比，并且易于计算和解释。在 5 个学生模型和 11 位不同教师的推理轨迹中，RSR 与训练后表现密切相关（平均 Spearman 0.86），优于现有指标。我们进一步证明了它在轨迹选择和教师选择方面的实用性。