2026-01-16

Title: LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue

Authors: Jinqiang Wang, Huansheng Ning, Jianguo Ding, Tao Zhu, Liming Chen, Chris Nugent
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09713
Pdf URL: https://arxiv.org/pdf/2601.09713
Copy Paste: [[2601.09713]] LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue(https://arxiv.org/abs/2601.09713)
Keywords: llm
Abstract: Proactively predicting a users next utterance in human-machine dialogue can streamline interaction and improve user experience. Existing commercial API-based solutions are subject to privacy concerns while deploying general-purpose LLMs locally remains computationally expensive. As such, training a compact, task-specific LLM provides a practical alternative. Although user simulator methods can predict a user's next utterance, they mainly imitate their speaking style rather than advancing the dialogue. Preference data synthesis has been investigated to generate data for proactive next utterance prediction and help align LLMs with user preferences. Yet existing methods lack the ability to explicitly model the intent reasoning that leads to the user's next utterance and to define and synthesize preference and non-preference reasoning processes for predicting the user's next this http URL address these challenges, we propose ProUtt, an LLM-driven preference data synthesis method for proactive next utterance prediction. ProUtt converts dialogue history into an intent tree and explicitly models intent reasoning trajectories by predicting the next plausible path from both exploitation and exploration perspectives. It then constructs preference and non-preference reasoning processes by perturbing or revising intent tree paths at different future turns. Extensive evaluations using LLM-as-a-judge and human judgments demonstrate that ProUtt consistently outperforms existing data synthesis methods, user simulators, and commercial LLM APIs across four benchmark datasets. We release both the code and the synthesized datasets to facilitate future research.
摘要：主动预测用户在人机对话中的下一个话语可以简化交互并改善用户体验。现有的基于 API 的商业解决方案存在隐私问题，而在本地部署通用 LLM 的计算成本仍然很高。因此，培训紧凑的、针对特定任务的法学硕士提供了一种实用的替代方案。尽管用户模拟器方法可以预测用户的下一个话语，但它们主要是模仿他们的说话风格而不是推进对话。我们对偏好数据合成进行了研究，以生成用于主动预测下一个话语的数据，并帮助法学硕士与用户偏好保持一致。然而，现有方法缺乏对导致用户下一个话语的意图推理进行显式建模的能力，以及定义和综合用于预测用户下一个http URL的偏好和非偏好推理过程的能力，为了解决这些挑战，我们提出了ProUtt，一种用于主动下一个话语预测的LLM驱动的偏好数据合成方法。 ProUtt 将对话历史转换为意图树，并通过从利用和探索的角度预测下一个可能的路径来显式地建模意图推理轨迹。然后，它通过在未来的不同回合扰动或修改意图树路径来构建偏好和非偏好推理过程。使用 LLM 作为法官和人类判断进行的广泛评估表明，ProUtt 在四个基准数据集上始终优于现有的数据合成方法、用户模拟器和商业 LLM API。我们发布代码和合成数据集以促进未来的研究。

Title: Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines

Authors: Devesh Saraogi, Rohit Singhee, Dhruv Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09714
Pdf URL: https://arxiv.org/pdf/2601.09714
Copy Paste: [[2601.09714]] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines(https://arxiv.org/abs/2601.09714)
Keywords: language model, gpt, llm, prompt, agent
Abstract: The integration of Large Language Models (LLMs) into the scientific ecosystem raises fundamental questions about the creativity and originality of AI-generated research. Recent work has identified ``smart plagiarism'' as a concern in single-step prompting approaches, where models reproduce existing ideas with terminological shifts. This paper investigates whether agentic workflows -- multi-step systems employing iterative reasoning, evolutionary search, and recursive decomposition -- can generate more novel and feasible research plans. We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research (GPT-5.1) recursive decomposition, and Gemini~3 Pro multimodal long-context pipeline. Using evaluations from thirty proposals each on novelty, feasibility, and impact, we find that decomposition-based and long-context workflows achieve mean novelty of 4.17/5, while reflection-based approaches score significantly lower (2.33/5). Results reveal varied performance across research domains, with high-performing workflows maintaining feasibility without sacrificing creativity. These findings support the view that carefully designed multi-stage agentic workflows can advance AI-assisted research ideation.
摘要：将大型语言模型 (LLM) 集成到科学生态系统中引发了有关人工智能生成的研究的创造力和原创性的基本问题。最近的工作已将“智能抄袭”确定为单步提示方法中的一个问题，在该方法中，模型通过术语转换来重现现有的想法。本文研究了代理工作流程（采用迭代推理、进化搜索和递归分解的多步骤系统）是否可以生成更新颖、更可行的研究计划。我们对五种推理架构进行了基准测试：基于反射的迭代细化、Sakana AI v2 进化算法、Google Co-Scientist 多智能体框架、GPT Deep Research (GPT-5.1) 递归分解和 Gemini~3 Pro 多模态长上下文管道。通过对 30 个提案的新颖性、可行性和影响力进行评估，我们发现基于分解和长上下文的工作流程的平均新颖性为 4.17/5，而基于反射的方法得分明显较低 (2.33/5)。结果揭示了各个研究领域的不同性能，高性能的工作流程在不牺牲创造力的情况下保持了可行性。这些发现支持这样的观点：精心设计的多阶段代理工作流程可以推进人工智能辅助的研究构思。

Title: Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents

Authors: Adam Bradley, John Hastings, Khandaker Mamun Ahmed
Subjects: cs.CL, cs.AI, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2601.09715
Pdf URL: https://arxiv.org/pdf/2601.09715
Copy Paste: [[2601.09715]] Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents(https://arxiv.org/abs/2601.09715)
Keywords: llm, chat, retrieval-augmented generation, agent
Abstract: The insurance industry is undergoing a paradigm shift through the adoption of artificial intelligence (AI) technologies, particularly in the realm of intelligent conversational agents. Chatbots have evolved into sophisticated AI-driven systems capable of automating complex workflows, including policy recommendation and claims triage, while simultaneously enabling dynamic, context-aware user engagement. This paper presents the design, implementation, and empirical evaluation of Axlerod, an AI-powered conversational interface designed to improve the operational efficiency of independent insurance agents. Leveraging natural language processing (NLP), retrieval-augmented generation (RAG), and domain-specific knowledge integration, Axlerod demonstrates robust capabilities in parsing user intent, accessing structured policy databases, and delivering real-time, contextually relevant responses. Experimental results underscore Axlerod's effectiveness, achieving an overall accuracy of 93.18% in policy retrieval tasks while reducing the average search time by 2.42 seconds. This work contributes to the growing body of research on enterprise-grade AI applications in insurtech, with a particular focus on agent-assistive rather than consumer-facing architectures.
摘要：通过采用人工智能（AI）技术，保险业正在经历范式转变，特别是在智能会话代理领域。聊天机器人已经发展成为复杂的人工智能驱动系统，能够自动化复杂的工作流程，包括政策推荐和索赔分类，同时支持动态、上下文感知的用户参与。本文介绍了 Axlerod 的设计、实现和实证评估，这是一个人工智能驱动的对话界面，旨在提高独立保险代理人的运营效率。利用自然语言处理 (NLP)、检索增强生成 (RAG) 和特定领域的知识集成，Axlerod 在解析用户意图、访问结构化策略数据库以及提供实时、上下文相关的响应方面展示了强大的功能。实验结果凸显了 Axlerod 的有效性，在策略检索任务中实现了 93.18% 的总体准确率，同时将平均搜索时间减少了 2.42 秒。这项工作有助于对保险科技企业级人工智能应用进行不断增长的研究，特别关注代理辅助架构，而不是面向消费者的架构。

Title: SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data

Authors: Yiwei Yan, Hao Li, Hua He, Gong Kai, Zhengyi Yang, Guanfeng Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09717
Pdf URL: https://arxiv.org/pdf/2601.09717
Copy Paste: [[2601.09717]] SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data(https://arxiv.org/abs/2601.09717)
Keywords: language model, llm
Abstract: Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at this https URL.
摘要：在线医疗咨询会生成大量对话式健康数据，这些数据通常嵌入受保护的健康信息，需要强大的方法来对数据类别进行分类并根据政策和实践分配风险级别。然而，现有方法缺乏统一的标准和可靠的自动化方法来实现此类对话健康数据的敏感性分类。本研究提出了一种基于大型语言模型的提取管道 SALP-CG，用于对在线会话健康数据中的隐私风险进行分类和分级。我们根据GB/T 39725-2020制定了健康数据分类和分级规则。与后端无关的提取管道结合了少数样本指导、JSON Schema 约束解码和确定性高风险规则，在不同的 LLM 中实现了强大的类别合规性和可靠的灵敏度。在 MedDialog-CN 基准上，模型可产生稳健的实体计数、高模式合规性和准确的敏感性分级，而最强的模型可实现最高级别预测的 micro-F1=0.900。按敏感度分层的品类格局显示，2-3 级项目占主导地位，组合后可重新识别； 4-5 级物品出现的频率较低，但会造成巨大伤害。 SALP-CG 可靠地帮助法学硕士在线对话健康数据进行类别分类和敏感性分级，为健康数据治理提供了实用的方法。代码可从此 https URL 获取。

Title: StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model

Authors: Jing-Yi Zeng, Guan-Hua Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09718
Pdf URL: https://arxiv.org/pdf/2601.09718
Copy Paste: [[2601.09718]] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model(https://arxiv.org/abs/2601.09718)
Keywords: language model, llm
Abstract: This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines, starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine-tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at this https URL.
摘要：本研究研究了如何使用轻量级 LLaMA-3.2-3B 系列作为基础模型 (FM) 来高效构建用于统计的领域专用大语言模型 (LLM)。我们系统地比较了三个多阶段训练流程，从没有指令跟踪能力的基础 FM 开始，通过事后指令调整增强的基础 FM，以及在持续预训练、监督微调 (SFT)、人类反馈强化学习 (RLHF) 偏好对齐和下游任务适应方面具有强大通用推理能力的指令调整 FM。结果表明，以基本 FM 开始的管道无法开发有意义的统计推理，即使经过大量指令调整、SFT 或 RLHF 对齐也是如此。相比之下，从 LLaMA-3.2-3B-Instruct 开始可以实现有效的领域专业化。对 SFT 变体的全面评估揭示了领域专业知识和一般推理能力之间的明显权衡。我们进一步证明直接偏好优化提供了稳定有效的 RLHF 偏好调整。最后，我们表明下游微调必须以极低的强度进行，以避免高度优化的模型中发生灾难性遗忘。最终模型 StatLLaMA 在数学推理、常识推理和统计专业知识的基准上实现了强大且平衡的性能，为开发资源高效的统计法学硕士提供了实用的蓝图。该代码可从此 https URL 获取。

Title: Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Authors: Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09719
Pdf URL: https://arxiv.org/pdf/2601.09719
Copy Paste: [[2601.09719]] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models(https://arxiv.org/abs/2601.09719)
Keywords: language model, llm
Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: this https URL
摘要：预层归一化（Pre-LN）是大型语言模型（LLM）的事实上的选择，对于稳定的预训练和有效的迁移学习至关重要。然而，Pre-LN 由于重复的统计计算而效率低下，并且受到深度诅咒的影响。随着层数的增长，隐藏状态的幅度和方差不断升级，从而破坏训练的稳定性。面向效率的无归一化方法（例如动态 Tanh (DyT)）可提高速度，但在深度上仍然脆弱。为了共同解决稳定性和效率问题，我们提出有界双曲正切 (BHyT)，它是 Pre-LN 的直接替代品。 BHyT 将 tanh 非线性与显式的数据驱动输入边界结合起来，以将激活保持在非饱和范围内。它可以防止激活幅度和方差的深度增长，并具有理论上的稳定性保证。为了提高效率，BHyT 每个块计算一次精确的统计数据，并用轻量级方差近似代替第二次归一化，从而提高了效率。根据经验，BHyT 在预训练期间表现出更高的稳定性和效率，与 RMSNorm 相比，训练速度平均提高了 15.8%，令牌生成吞吐量平均提高了 4.2%，同时在语言理解和推理基准方面匹配或超越了其推理性能和鲁棒性。我们的代码位于：此 https URL

Title: Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox

Authors: Vahideh Zolfaghari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09721
Pdf URL: https://arxiv.org/pdf/2601.09721
Copy Paste: [[2601.09721]] Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox(https://arxiv.org/abs/2601.09721)
Keywords: language model, llm
Abstract: Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p<0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.
摘要：背景大语言模型 (LLM) 越来越多地部署在医疗咨询中，但其在现实用户压力下的安全性仍未得到充分研究。之前的评估侧重于中性条件，忽视了焦虑用户挑战保障措施带来的漏洞。这项研究评估了在跨模型和平台的儿科咨询中，在父母焦虑驱动的对抗压力下法学硕士的安全性。方法 PediatricAnxietyBench 根据之前的评估，包括 10 个主题的 300 个查询（150 个真实的，150 个对抗性的）。通过 API 评估了三种模型：Llama-3.3-70B 和 Llama-3.1-8B (Groq)、Mistral-7B (HuggingFace)，产生 900 个响应。安全性使用 0-15 等级来衡量约束、转介、对冲、紧急情况识别和非规定行为。分析采用配对 t 检验和自举 CI。结果平均分数：9.70 (Llama-3.3-70B) 至 10.39 (Mistral-7B)。 Llama-3.1-8B 的性能优于 Llama-3.3-70B +0.66（p=0.0001，d=0.225）。模型显示出积极的对抗效应，Mistral-7B 最强（+1.09，p=0.0002）。跨平台的通用安全性； Llama-3.3-70B 有 8% 的失败率。容易癫痫发作（33% 的诊断不正确）。对冲预测安全性（r=0.68，p<0.001）。结论评估表明，安全性取决于规模上的对齐和架构，较小的模型优于较大的模型。跨版本的稳健性的演变表明有针对性的培训进展。漏洞和没有紧急识别表明不适合分类。研究结果指导选择、强调对抗性测试，并为医疗人工智能安全提供开放基准。

Title: ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

Authors: Franciszek Górski, Andrzej Czyżewski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09722
Pdf URL: https://arxiv.org/pdf/2601.09722
Copy Paste: [[2601.09722]] ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language(https://arxiv.org/abs/2601.09722)
Keywords: language model, llm
Abstract: In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.
摘要：在这项工作中，我们提出了一个注释框架，演示了如何将在大型语料库上预训练的多语言法学硕士用作教师模型，以提取用波兰语标记医学文本所需的专家知识。这项工作是名为 ADMEDVOICE 的大型项目的一部分，在该项目中，我们收集了代表五个临床类别（放射学、肿瘤学、心脏病学、高血压和病理学）的广泛医学文本语料库。使用这些数据，我们必须开发一个多类分类器，但根本问题是缺乏注释足够数量的文本的资源。因此，在我们的解决方案中，我们使用多语言 Llama3.1 模型来用波兰语注释大量医学文本语料库。使用有限的注释资源，我们仅验证了这些标签的一部分，并从中创建了一个测试集。然后，以这种方式注释的数据被用于基于 BERT 架构的 3 种不同类型的分类器的训练和验证——蒸馏的 DistilBERT 模型、针对医学数据进行微调的 BioBERT 以及针对波兰语语料库进行微调的 HerBERT。在我们训练的模型中，DistilBERT模型取得了最好的结果，每个临床类别的F1分数> 0.80，其中3个类别的F1分数> 0.93。通过这种方式，我们获得了一系列高效的分类器，它们代表了大型语言模型的替代方案，因为它们的尺寸缩小了近 500 倍，GPU VRAM 消耗降低了 300 倍，推理速度提高了数百倍。

Title: SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels

Authors: Guancheng Du, Yong Hu, Wenqing Wang, Yaming Yang, Jiaheng Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09723
Pdf URL: https://arxiv.org/pdf/2601.09723
Copy Paste: [[2601.09723]] SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels(https://arxiv.org/abs/2601.09723)
Keywords: language model, llm, long context, agent
Abstract: Large Language Models (LLMs) have shown significant progress, but understanding long and complex documents remains challenging. Many long-context benchmarks have been proposed, but they face several limitations, including task realism, data scalability, and data quality. To this end, we introduce SagaScale, a realistic, scalable, and high-quality long-context benchmark built from full-length novels. The entire benchmark is constructed using an automated data collection pipeline that utilizes external resources (e.g., Wikipedia pages) to curate question-answer pairs. Critically, these external resources are provided only for benchmark construction and not during evaluation, which allows LLMs to curate complex questions that go beyond what they can answer during evaluation. SagaScale is also bilingual and offers the largest context length to date, with average token counts exceeding 250K for English novels and 320K for Chinese novels. Our evaluation across 12 frontier LLMs and three long-context methods -- Naïve RAG, Agentic RAG, and Long Context -- yields key insights, including: (1) Directly supplying the full context to the LLM can outperform other methods by a large margin; (2) Most LLMs still struggle with lengthy contexts, but Gemini-2.5-Pro stands out as an exception; and (3) Agentic RAG effectively addresses the retrieval bottleneck in Naïve RAG. Finally, we publicly release the SagaScale benchmark and our data collection codebase to facilitate future research.
摘要：大型语言模型 (LLM) 已取得显着进展，但理解长且复杂的文档仍然具有挑战性。人们已经提出了许多长上下文基准，但它们面临着一些限制，包括任务现实性、数据可扩展性和数据质量。为此，我们推出了 SagaScale，这是一个根据长篇小说构建的现实的、可扩展的、高质量的长上下文基准。整个基准测试是使用自动数据收集管道构建的，该管道利用外部资源（例如维基百科页面）来管理问答对。至关重要的是，这些外部资源仅用于基准构建，而不是在评估期间提供，这使得法学硕士能够提出超出他们在评估期间可以回答的复杂问题。 SagaScale 也是双语的，提供迄今为止最大的上下文长度，英文小说的平均标记数超过 250K，中文小说的平均标记数超过 320K。我们对 12 个前沿法学硕士和三种长上下文方法（Naïve RAG、Agentic RAG 和长上下文）进行了评估，得出了重要的见解，包括：（1）直接向法学硕士提供完整的上下文可以大大优于其他方法； (2) 大多数法学硕士仍然难以应对冗长的上下文，但 Gemini-2.5-Pro 是一个例外； (3) Agentic RAG 有效解决了 Naïve RAG 中的检索瓶颈。最后，我们公开发布 SagaScale 基准测试和我们的数据收集代码库，以方便未来的研究。

Title: Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions

Authors: Katherine Elkins, Jon Chun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09724
Pdf URL: https://arxiv.org/pdf/2601.09724
Copy Paste: [[2601.09724]] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions(https://arxiv.org/abs/2601.09724)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) are increasingly deployed in consequential decision-making settings, yet their robustness to benign prompt variation remains underexplored. In this work, we study whether LLMs maintain consistent ethical judgments across logically equivalent but syntactically different prompts, focusing on variations involving negation and conditional structure. We introduce Syntactic Framing Fragility (SFF), a robustness evaluation framework that isolates purely syntactic effects via Logical Polarity Normalization (LPN), enabling direct comparison of decisions across positive and negative framings without semantic drift. Auditing 23 state-of-the-art models spanning the U.S. and China as well as small U.S. open-source software models over 14 ethical scenarios and four controlled framings (39,975 decisions), we find widespread and statistically significant inconsistency: many models reverse ethical endorsements solely due to syntactic polarity, with open-source models exhibiting over twice the fragility of commercial counterparts. We further uncover extreme negation sensitivity, where some models endorse actions in 80-97% of cases when explicitly prompted with "should not." We show that eliciting chain-of-thought reasoning substantially reduces fragility, identifying a practical mitigation lever, and we map fragility across scenarios, finding higher risk in financial and business contexts than in medical scenarios. Our results demonstrate that syntactic consistency constitutes a distinct and critical dimension of ethical robustness, and we argue that SFF-style audits should be a standard component of safety evaluation for deployed LLMs. Code and results will be available on this http URL.
摘要：大型语言模型（LLM）越来越多地部署在结果性决策环境中，但它们对良性提示变化的鲁棒性仍未得到充分探索。在这项工作中，我们研究法学硕士是否在逻辑上相同但句法上不同的提示中保持一致的道德判断，重点关注涉及否定和条件结构的变化。我们引入了句法框架脆弱性（SFF），这是一种鲁棒性评估框架，它通过逻辑极性归一化（LPN）隔离纯粹的句法效果，从而能够直接比较积极和消极框架的决策，而不会产生语义漂移。我们对横跨美国和中国的 23 个最先进的模型以及超过 14 个道德场景和 4 个受控框架（39,975 个决策）的美国小型开源软件模型进行了审核，发现存在广泛且统计上显着的不一致：许多模型仅由于句法极性而逆转了道德认可，开源模型的脆弱性是商业模型的两倍以上。我们进一步发现了极端的否定敏感性，当明确提示“不应该”时，一些模型会在 80-97% 的情况下认可行动。我们表明，引发思想链推理可以大大降低脆弱性，确定实际的缓解杠杆，并且我们绘制了跨场景的脆弱性，发现金融和商业环境中的风险比医疗场景中的风险更高。我们的结果表明，句法一致性构成了道德稳健性的一个独特且关键的维度，并且我们认为，SFF 式审计应该成为已部署的法学硕士安全评估的标准组成部分。代码和结果将在此 http URL 上提供。

Title: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

Authors: Kaustubh Shivshankar Shejole, Sourabh Deoghare, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09725
Pdf URL: https://arxiv.org/pdf/2601.09725
Copy Paste: [[2601.09725]] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation(https://arxiv.org/abs/2601.09725)
Keywords: language model, llm
Abstract: Punctuation plays a critical role in resolving semantic and structural ambiguity in written language. Machine Translation (MT) systems are now widely applied across diverse domains and languages, including many low-resource settings. In this work, we focus on Marathi, a low- to middle-resource language. We introduce Virām, the first diagnostic benchmark for assessing punctuation robustness in English-to-Marathi machine translation, consisting of 54 manually curated, punctuation-ambiguous instances. We evaluate two primary strategies for enhancing reliability: a pipeline-based restore-then-translate approach and direct fine-tuned on punctuation-varied data. Our results demonstrate that specialized fine-tuned models and pipeline systems significantly improve translation quality over standard baselines on the Virām benchmark. Qualitative analysis reveals that the original model may result in wrong translations leading to wrong interpretations, while fine-tuned models significantly improve overall reliability. Furthermore, we find that current Large Language Models (LLMs) lag behind these task-specific approaches in preserving meaning for punctuation-ambiguous text, thus necessitating further research in this area.
摘要：标点符号在解决书面语言中的语义和结构歧义方面起着至关重要的作用。机器翻译 (MT) 系统现已广泛应用于不同领域和语言，包括许多资源匮乏的环境。在这项工作中，我们重点关注马拉地语，一种中低资源语言。我们推出了 Virām，这是第一个用于评估英语到马拉地语机器翻译中标点符号鲁棒性的诊断基准，由 54 个手动策划的标点符号模糊实例组成。我们评估了增强可靠性的两种主要策略：基于管道的恢复然后翻译方法和对标点符号变化的数据进行直接微调。我们的结果表明，专门的微调模型和管道系统可显着提高 Virām 基准的翻译质量。定性分析表明，原始模型可能会导致错误的翻译，从而导致错误的解释，而微调后的模型可显着提高整体可靠性。此外，我们发现当前的大型语言模型（LLM）在保留标点符号模糊文本的含义方面落后于这些特定于任务的方法，因此需要在该领域进行进一步的研究。

Title: Forgetting as a Feature: Cognitive Alignment of Large Language Models

Authors: Hien Tran, Quinten Steenhuis, Alexandros Christoforos, Chadbourne Davis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09726
Pdf URL: https://arxiv.org/pdf/2601.09726
Copy Paste: [[2601.09726]] Forgetting as a Feature: Cognitive Alignment of Large Language Models(https://arxiv.org/abs/2601.09726)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are often evaluated against ideals of perfect Bayesian inference, yet growing evidence suggests that their in-context reasoning exhibits systematic forgetting of past information. Rather than viewing this behavior as a limitation, we reinterpret forgetting as a functional cognitive mechanism. Drawing inspiration from human memory dynamics, we model LLM inference as a probabilistic memory process governed by exponential decay. We introduce a benchmark suite that evaluates temporal reasoning, concept drift adaptation, and associative recall, enabling direct comparison between model behavior and human cognitive patterns. Our empirical results reveal that LLMs demonstrate forgetting rates analogous to human memory efficiency trade-offs between stability and adaptability. Building on these observations, we propose probabilistic memory prompting, a lightweight strategy that shapes evidence integration to mimic human-like memory decay, leading to improved long-horizon reasoning performance. Our findings position forgetting not as a failure mode, but as a principled mechanism for adaptive intelligence.
摘要：大型语言模型 (LLM) 通常根据完美贝叶斯推理的理想进行评估，但越来越多的证据表明，它们的上下文推理表现出对过去信息的系统性遗忘。我们没有将这种行为视为一种限制，而是将遗忘重新解释为一种功能性认知机制。受人类记忆动力学的启发，我们将 LLM 推理建模为受指数衰减控制的概率记忆过程。我们引入了一个基准套件，用于评估时间推理、概念漂移适应和联想回忆，从而能够直接比较模型行为和人类认知模式。我们的实证结果表明，法学硕士的遗忘率类似于人类记忆效率在稳定性和适应性之间的权衡。基于这些观察，我们提出了概率记忆提示，这是一种轻量级策略，可以塑造证据整合以模仿人类的记忆衰退，从而提高长视野推理性能。我们的研究结果将遗忘不是一种失败模式，而是一种适应性智能的原则机制。

Title: SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis

Authors: Sauhard Dubey
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.09727
Pdf URL: https://arxiv.org/pdf/2601.09727
Copy Paste: [[2601.09727]] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis(https://arxiv.org/abs/2601.09727)
Keywords: language model, llm
Abstract: Cross-domain scientific synthesis requires connecting mechanistic explanations across fragmented literature, a capability that remains challenging for both retrieval-based systems and unconstrained language models. While recent work has applied large language models to scientific summarization and question answering, these approaches provide limited control over reasoning depth and structural grounding. We frame mechanistic synthesis as a graph-constrained multi-hop reasoning problem over literature-derived concept graphs. Given a scientific query and a compact, query-local corpus, SciNets constructs a directed concept graph and synthesizes mechanistic explanations by identifying multi-hop reasoning paths that connect concepts that rarely co-occur within individual papers. We systematically compare shortest-path reasoning, k-shortest paths with diversity constraints, stochastic random walks, and a retrieval-augmented language model baseline. Rather than evaluating correctness, which is often indeterminate when synthesizing connections across distributed sources, we introduce a behavioral framework that measures symbolic reasoning depth, mechanistic diversity, and grounding stability. Across machine learning, biology, and climate science tasks, explicit graph constraints enable controllable multi-hop reasoning while revealing a consistent trade-off: deeper and more diverse symbolic reasoning increases grounding instability, whereas shortest-path reasoning remains highly stable but structurally conservative. These findings provide a systematic behavioral characterization of the limits and capabilities of current graph-LLM integration for scientific synthesis.
摘要：跨领域的科学综合需要将分散的文献中的机械解释联系起来，这种能力对于基于检索的系统和无约束的语言模型来说仍然具有挑战性。虽然最近的工作已将大型语言模型应用于科学摘要和问题解答，但这些方法对推理深度和结构基础的控制有限。我们将机械综合构建为基于文献衍生概念图的图约束多跳推理问题。给定一个科学查询和一个紧凑的查询本地语料库，SciNets 构建一个有向概念图，并通过识别连接很少在单篇论文中同时出现的概念的多跳推理路径来综合机械解释。我们系统地比较了最短路径推理、具有多样性约束的 k 最短路径、随机随机游走和检索增强语言模型基线。我们引入了一个行为框架来衡量符号推理深度、机制多样性和基础稳定性，而不是评估正确性（在综合分布式源之间的连接时，正确性通常是不确定的）。在机器学习、生物学和气候科学任务中，显式图约束实现了可控的多跳推理，同时揭示了一致的权衡：更深层次和更多样化的符号推理增加了基础不稳定性，而最短路径推理保持高度稳定但结构保守。这些发现为当前图法学硕士集成的科学综合的局限性和能力提供了系统的行为特征。

Title: Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens

Authors: Meicong Zhang, Tiancheng su, Guoxiu He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09728
Pdf URL: https://arxiv.org/pdf/2601.09728
Copy Paste: [[2601.09728]] Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens(https://arxiv.org/abs/2601.09728)
Keywords: language model, llm, agent
Abstract: In recent years, using predefined agentic workflows to guide large language models (LLMs) for literature classification and review has become a research focus. However, writing research introductions is more challenging. It requires rigorous logic, coherent structure, and abstract summarization. Existing workflows often suffer from long reasoning chains, error accumulation, and reduced textual coherence. To address these limitations, we propose eliminating external agentic workflows. Instead, we directly parameterize their logical structure into the LLM. This allows the generation of a complete introduction in a single inference. To this end, we introduce the Stage Token for Introduction Generation (STIG). STIG converts the multiple stages of the original workflow into explicit stage signals. These signals guide the model to follow different logical roles and functions during generation. Through instruction tuning, the model learns the mapping between stage tokens and text functions. It also learns the logical order and transition patterns between stages, encoding this knowledge into the model parameters. Experimental results show that STIG can generate multi-stage text in a single inference. It does not require explicit workflow calls. STIG outperforms traditional agentic workflows and other baselines on metrics of semantic similarity and sentence-level structural rationality. The code is provided in the Supplementary Materials.
摘要：近年来，使用预定义的代理工作流程来指导大型语言模型（LLM）进行文献分类和评审已成为研究热点。然而，撰写研究介绍更具挑战性。它需要严密的逻辑、连贯的结构和抽象的概括。现有的工作流程常常面临推理链长、错误累积和文本一致性降低的问题。为了解决这些限制，我们建议消除外部代理工作流程。相反，我们直接将它们的逻辑结构参数化到法学硕士中。这允许在单个推理中生成完整的介绍。为此，我们引入了阶段代币（STIG）。 STIG 将原始工作流程的多个阶段转换为显式阶段信号。这些信号引导模型在生成过程中遵循不同的逻辑角色和功能。通过指令调整，模型学习阶段标记和文本函数之间的映射。它还学习阶段之间的逻辑顺序和转换模式，并将这些知识编码到模型参数中。实验结果表明，STIG 可以在一次推理中生成多阶段文本。它不需要显式的工作流程调用。 STIG 在语义相似性和句子级结构合理性指标方面优于传统代理工作流程和其他基线。该代码在补充材料中提供。

Title: Enhancing Business Analytics through Hybrid Summarization of Financial Reports

Authors: Tohida Rehman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09729
Pdf URL: https://arxiv.org/pdf/2601.09729
Copy Paste: [[2601.09729]] Enhancing Business Analytics through Hybrid Summarization of Financial Reports(https://arxiv.org/abs/2601.09729)
Keywords: long context
Abstract: Financial reports and earnings communications contain large volumes of structured and semi structured information, making detailed manual analysis inefficient. Earnings conference calls provide valuable evidence about a firm's performance, outlook, and strategic priorities. The manual analysis of lengthy call transcripts requires substantial effort and is susceptible to interpretive bias and unintentional error. In this work, we present a hybrid summarization framework that combines extractive and abstractive techniques to produce concise and factually reliable Reuters-style summaries from the ECTSum dataset. The proposed two stage pipeline first applies the LexRank algorithm to identify salient sentences, which are subsequently summarized using fine-tuned variants of BART and PEGASUS designed for resource constrained settings. In parallel, we fine-tune a Longformer Encoder-Decoder (LED) model to directly capture long-range contextual dependencies in financial documents. Model performance is evaluated using standard automatic metrics, including ROUGE, METEOR, MoverScore, and BERTScore, along with domain-specific variants such as SciBERTScore and FinBERTScore. To assess factual accuracy, we further employ entity-level measures based on source-precision and F1-target. The results highlight complementary trade offs between approaches, long context models yield the strongest overall performance, while the hybrid framework achieves competitive results with improved factual consistency under computational constraints. These findings support the development of practical summarization systems for efficiently distilling lengthy financial texts into usable business insights.
摘要：财务报告和收益通讯包含大量结构化和半结构化信息，导致详细的手动分析效率低下。收益电话会议提供了有关公司业绩、前景和战略重点的宝贵证据。对冗长通话记录的手动分析需要付出大量努力，并且容易受到解释偏差和无意错误的影响。在这项工作中，我们提出了一个混合摘要框架，该框架结合了提取和抽象技术，从 ECTSum 数据集中生成简洁且事实上可靠的路透社风格的摘要。所提出的两阶段管道首先应用 LexRank 算法来识别显着句子，随后使用专为资源受限设置设计的 BART 和 PEGASUS 的微调变体进行总结。与此同时，我们对 Longformer 编码器-解码器 (LED) 模型进行微调，以直接捕获财务文档中的远程上下文依赖关系。使用标准自动指标评估模型性能，包括 ROUGE、METEOR、MoverScore 和 BERTScore，以及 SciBERTScore 和 FinBERTScore 等特定领域的变体。为了评估事实准确性，我们进一步采用基于源精度和 F1 目标的实体级度量。结果强调了方法之间的互补权衡，长上下文模型产生了最强的整体性能，而混合框架在计算限制下通过提高事实一致性实现了有竞争力的结果。这些发现支持开发实用的摘要系统，以有效地将冗长的金融文本提炼成可用的业务见解。

Title: Clinical Document Metadata Extraction: A Scoping Review

Authors: Kurt Miller (1 and 2), Qiuhao Lu (3), William Hersh (4), Kirk Roberts (3), Steven Bedrick (4), Andrew Wen (3), Hongfang Liu (3) ((1) Mayo Clinic, (2) University of Minnesota, (3) University of Texas Health Science Center at Houston, (4) Oregon Health & Science University)
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2601.09730
Pdf URL: https://arxiv.org/pdf/2601.09730
Copy Paste: [[2601.09730]] Clinical Document Metadata Extraction: A Scoping Review(https://arxiv.org/abs/2601.09730)
Keywords: language model
Abstract: Clinical document metadata, such as document type, structure, author role, medical specialty, and encounter setting, is essential for accurate interpretation of information captured in clinical documents. However, vast documentation heterogeneity and drift over time challenge harmonization of document metadata. Automated extraction methods have emerged to coalesce metadata from disparate practices into target schema. This scoping review aims to catalog research on clinical document metadata extraction, identify methodological trends and applications, and highlight gaps. We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines to identify articles that perform clinical document metadata extraction. We initially found and screened 266 articles published between January 2011 and August 2025, then comprehensively reviewed 67 we deemed relevant to our study. Among the articles included, 45 were methodological, 17 used document metadata as features in a downstream application, and 5 analyzed document metadata composition. We observe myriad purposes for methodological study and application types. Available labelled public data remains sparse except for structural section datasets. Methods for extracting document metadata have progressed from largely rule-based and traditional machine learning with ample feature engineering to transformer-based architectures with minimal feature engineering. The emergence of large language models has enabled broader exploration of generalizability across tasks and datasets, allowing the possibility of advanced clinical text processing systems. We anticipate that research will continue to expand into richer document metadata representations and integrate further into clinical applications and workflows.
摘要：临床文档元数据，例如文档类型、结构、作者角色、医学专业和遭遇设置，对于准确解释临床文档中捕获的信息至关重要。然而，巨大的文档异质性和随时间的变化对文档元数据的协调提出了挑战。自动提取方法已经出现，可以将不同实践中的元数据合并到目标模式中。本次范围审查旨在对临床文档元数据提取的研究进行分类，确定方法趋势和应用，并突出差距。我们遵循 PRISMA-ScR（系统评价的首选报告项目和范围界定评价的元分析扩展）指南来识别执行临床文档元数据提取的文章。我们最初发现并筛选了 2011 年 1 月至 2025 年 8 月期间发表的 266 篇文章，然后全面审查了我们认为与我们的研究相关的 67 篇文章。在纳入的文章中，45 篇是方法论文章，17 篇在下游应用程序中使用文档元数据作为特征，5 篇分析了文档元数据的组成。我们观察方法研究和应用类型的无数目的。除结构截面数据集外，可用的标记公共数据仍然稀疏。提取文档元数据的方法已经从主要基于规则的传统机器学习（具有丰富的特征工程）发展到基于变压器的架构（具有最少的特征工程）。大型语言模型的出现使得能够更广泛地探索跨任务和数据集的通用性，从而使先进的临床文本处理系统成为可能。我们预计研究将继续扩展到更丰富的文档元数据表示，并进一步集成到临床应用和工作流程中。

Title: Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings

Authors: Wen G. Gong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09732
Pdf URL: https://arxiv.org/pdf/2601.09732
Copy Paste: [[2601.09732]] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings(https://arxiv.org/abs/2601.09732)
Keywords: llm
Abstract: With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA < 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.
摘要：尽管有数百种多语言嵌入模型可供使用，但从业者缺乏明确的指导，无法通过特定于语言的模式提供真正的跨语言语义对齐与任务性能。任务驱动基准（MTEB）可能会掩盖基本的一致性缺陷。我们引入了语义亲和力 (SA)，这是一种有界（0 到 1 之间）度量，使用余弦距离与 Semanscope 框架中的 PHATE 可视化相结合来测量语言间与语言内的传播比率。对 4 个数据集（52 个实验）中的 13 个模型进行基准测试揭示了一个三层结构：（1）顶级 BERT 模型（LaBSE SA = 0.70、USE SA = 0.68、S-BERT SA = 0.68）通过翻译对监督实现强对齐； (2) 无论 0.6 B 到 8 B 等级如何，LLM 嵌入在 SA 处稳定在 0.55 到 0.61 之间； (3) 尽管进行了 100 多种语言训练，但仅使用 MLM 的 BERT 模型（mBERT、XLM-R、SA < 0.50）仍然失败。决定一致性的是培训目标，而不是架构或规模。 Oracle Bone 原语（公元前 1200 年）揭示了语义漂移模型学习语料库模式而不是认知原语。这项工作提供了语义基准测试，帮助从业者从数百个可用模型中选择高质量的多语言嵌入，表明跨语言对齐需要显式的翻译监督，而不仅仅是模型规模或多语言数据。

Title: Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

Authors: Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He, Lijun Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09733
Pdf URL: https://arxiv.org/pdf/2601.09733
Copy Paste: [[2601.09733]] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets(https://arxiv.org/abs/2601.09733)
Keywords: language model, llm
Abstract: The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \& 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch'' strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.
摘要：监督微调 (SFT) 数据集的构建是大型语言模型 (LLM) 后训练中的一个关键但尚未理论化的阶段，因为普遍的做法通常依赖于启发式聚合，而没有系统地了解单个样本如何对模型性能做出贡献。在本报告中，我们提出了使用 OpenDataArena (ODA) 从临时管理到闭环数据集工程框架的范式转变，该框架利用价值锚定排名和多维分析将价值基准转化为指导数据集构建的反馈信号。我们通过两个新数据集实例化了这种方法：\textbf{ODA-Math-460k}，这是一个专门的数学推理数据集，它利用新颖的两阶段难度感知管道在 AIME 和 HMMT 等基准上实现最先进（SOTA）结果；\textbf{ODA-Mixture (100k \& 500k)}，这是一系列通过“锚定和补丁”策略的性能明显优于更大的开源基线。我们的实证结果表明，ODA 驱动的数据集显着提高了特定领域的推理和通用实用性，同时实现了卓越的数据效率，验证了向以数据为中心的人工智能的过渡，其中透明评估是工程高质量训练数据的主要引擎。

Title: From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis

Authors: Yanyi Liu, Qingwen Yang, Tiezheng Guo, Feiyu Qu, Jun Liu, Yingyou Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09734
Pdf URL: https://arxiv.org/pdf/2601.09734
Copy Paste: [[2601.09734]] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis(https://arxiv.org/abs/2601.09734)
Keywords: language model, llm, hallucination
Abstract: Hallucinations in Large Language Models (LLMs), defined as the generation of content inconsistent with facts or context, represent a core obstacle to their reliable deployment in critical domains. Current research primarily focuses on binary "detection" approaches that, while capable of identifying hallucinations, fail to provide interpretable and actionable feedback for model improvement, thus limiting practical utility. To address this limitation, a new research paradigm is proposed, shifting from "detection" to "diagnosis". The Hallucination Diagnosis Task is introduced, a task which requires models to not only detect hallucinations, but also perform error localization, causal explanation, and content correction. We develop the Hallucination Diagnosis Generator (HDG), an automated pipeline that systematically generates high-quality training samples with rich diagnostic metadata from raw corpora through multi-dimensional augmentation strategies including controlled fact fabrication and reasoning chain perturbation. Using HDG-generated data, we train HDM-4B-RL, a 4-billion-parameter hallucination diagnosis model, employing Group Relative Policy Optimization (GRPO) with a comprehensive reward function incorporating structural, accuracy, and localization signals. Experimental results demonstrate that our model surpasses previous state-of-the-art detection models on the HaluEval benchmark while achieving comparable performance to advanced general-purpose models. In comprehensive diagnosis tasks, HDM-4B-RL matches the capabilities of larger general models while maintaining a smaller size. This work validates the feasibility and value of hallucination diagnosis, providing an effective methodology for building more trustworthy and reliable generative AI systems.
摘要：大型语言模型 (LLM) 中的幻觉被定义为生成与事实或上下文不一致的内容，是其在关键领域可靠部署的核心障碍。目前的研究主要集中在二元“检测”方法上，虽然能够识别幻觉，但无法为模型改进提供可解释和可操作的反馈，从而限制了实际效用。为了解决这一局限性，提出了一种新的研究范式，从“检测”转向“诊断”。引入了幻觉诊断任务，该任务要求模型不仅要检测幻觉，还要进行错误定位、因果解释和内容纠正。我们开发了幻觉诊断生成器（HDG），这是一种自动化管道，可以通过包括受控事实制造和推理链扰动在内的多维增强策略，从原始语料库中系统地生成具有丰富诊断元数据的高质量训练样本。使用 HDG 生成的数据，我们训练 HDM-4B-RL，这是一个 40 亿参数的幻觉诊断模型，采用组相对策略优化 (GRPO) 以及包含结构、准确性和定位信号的综合奖励函数。实验结果表明，我们的模型在 HaluEval 基准上超越了之前最先进的检测模型，同时实现了与先进通用模型相当的性能。在综合诊断任务中，HDM-4B-RL在保持较小尺寸的同时，可与较大通用模型的能力相匹配。这项工作验证了幻觉诊断的可行性和价值，为构建更值得信赖和可靠的生成人工智能系统提供了有效的方法。

Title: Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations

Authors: Xiaoxu Ma, Xiangbo Zhang, Zhenyu Weng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09833
Pdf URL: https://arxiv.org/pdf/2601.09833
Copy Paste: [[2601.09833]] Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations(https://arxiv.org/abs/2601.09833)
Keywords: language model, llm, prompt
Abstract: Evaluating personality traits in Large Language Models (LLMs) is key to model interpretation, comparison, and responsible deployment. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation-based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model's internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the persona direction. We provide a theoretical analysis of the effectiveness and generalization properties of PVNI. Extensive experiments across diverse LLMs demonstrate that PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.
摘要：评估大型语言模型 (LLM) 中的人格特质是模型解释、比较和负责任部署的关键。然而，现有的基于问卷的评估方法的稳定性有限，并且几乎没有可解释性，因为它们的结果对提示措辞或角色扮演配置的微小变化高度敏感。为了解决这些局限性，我们提出了一种基于内部激活的方法，称为人格向量中性插值（PVNI），用于法学硕士的稳定且可解释的人格特质评估。 PVNI 使用对比提示从模型的内部激活中提取与目标人格特质相关的角色向量。然后，它通过沿着角色向量作为锚轴进行插值来估计相应的中性分数，从而实现中性提示表示和角色方向之间的可解释比较。我们对 PVNI 的有效性和泛化特性进行了理论分析。跨不同法学硕士的广泛实验表明，即使在问卷调查和角色扮演变体下，PVNI 也能比现有方法产生更加稳定的人格特质评估。

Title: Bears, all bears, and some bears. Language Constraints on Language Models' Inductive Inferences

Authors: Sriram Padmanabhan, Siyuan Song, Kanishka Misra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09852
Pdf URL: https://arxiv.org/pdf/2601.09852
Copy Paste: [[2601.09852]] Bears, all bears, and some bears. Language Constraints on Language Models' Inductive Inferences(https://arxiv.org/abs/2601.09852)
Keywords: language model
Abstract: Language places subtle constraints on how we make inductive inferences. Developmental evidence by Gelman et al. (2002) has shown children (4 years and older) to differentiate among generic statements ("Bears are daxable"), universally quantified NPs ("all bears are daxable") and indefinite plural NPs ("some bears are daxable") in extending novel properties to a specific member (all > generics > some), suggesting that they represent these types of propositions differently. We test if these subtle differences arise in general purpose statistical learners like Vision Language Models, by replicating the original experiment. On tasking them through a series of precondition tests (robust identification of categories in images and sensitivities to all and some), followed by the original experiment, we find behavioral alignment between models and humans. Post-hoc analyses on their representations revealed that these differences are organized based on inductive constraints and not surface-form differences.
摘要：语言对我们如何进行归纳推理施加了微妙的限制。 Gelman 等人的发育证据。 (2002) 向儿童（4 岁及以上）展示了在将新属性扩展到特定成员（全部 > 通用 > 某些）时区分通用陈述（“熊是可 daxable”）、通用量化 NP（“所有熊都是可 daxable”）和不定复数 NP（“某些熊是可 daxable”）的区别，表明它们以不同的方式表示这些类型的命题。我们通过复制原始实验来测试这些细微的差异是否出现在视觉语言模型等通用统计学习器中。通过对他们进行一系列前置条件测试（对图像中的类别以及对所有和某些类别的敏感性的稳健识别）进行任务，然后进行原始实验，我们发现模型和人类之间的行为一致。对他们的表述进行事后分析表明，这些差异是根据归纳约束而不是表面形式差异来组织的。

Title: MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication

Authors: Sraavya Sambara, Yuan Pu, Ayman Ali, Vishala Mishra, Lionel Wong, Monica Agrawal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09853
Pdf URL: https://arxiv.org/pdf/2601.09853
Copy Paste: [[2601.09853]] MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication(https://arxiv.org/abs/2601.09853)
Keywords: language model, llm
Abstract: Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real-world health communication, highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at this https URL.
摘要：患者提出的现实世界健康问题常常无意中嵌入错误的假设或前提。在这种情况下，安全的医疗沟通通常涉及重定向：解决隐含的误解，然后回应潜在的患者背景，而不是最初的问题。虽然大型语言模型 (LLM) 越来越多地被非专业用户用于医疗建议，但它们尚未经过这一关键能力的测试。因此，在这项工作中，我们研究了法学硕士如何对现实世界健康问题中嵌入的错误前提做出反应。我们开发了一个半自动化管道来管理 MedRedFlag，这是一个包含来自 Reddit 的 1100 多个需要重定向的问题的数据集。然后，我们系统地比较最先进的法学硕士和临床医生的反应。我们的分析表明，即使检测到有问题的前提，法学硕士通常也无法重定向有问题的问题，并提供可能导致次优医疗决策的答案。我们的基准和结果揭示了法学硕士在现实世界健康传播条件下的表现存在新颖且巨大的差距，凸显了面向患者的医疗人工智能系统的关键安全问题。代码和数据集可从此 https URL 获取。

Title: OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing

Authors: Yilin Bao, Ziyao He, Zayden Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.09858
Pdf URL: https://arxiv.org/pdf/2601.09858
Copy Paste: [[2601.09858]] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing(https://arxiv.org/abs/2601.09858)
Keywords: language model, llm
Abstract: Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.
摘要：科学论文的生成需要文档级的规划和事实基础，但当前的大型语言模型尽管具有很强的本地流畅性，但在全局结构、输入覆盖范围和引文一致性方面往往失败。我们提出了一个强化学习框架，将科学大纲构建视为分层文档结构上的长期规划问题。我们的方法模型通过结构化操作编辑不断演变的轮廓，使系统能够逐步构建完整的科学手稿。为了支持有效和稳定的学习，我们引入了一个两阶段优化程序，包括（i）从部分计划向后轮廓重建以强制全局结构一致性，以及（ii）前向价值引导强化学习，并明确建模科学正确性、话语连贯性和引文保真度。此外，我们进一步引入了科学论文生成的基准，用于评估文档规划、输入利用率、参考忠实度、大纲组织和内容级事实准确性。我们的结果表明，与强大的神经和法学硕士基线相比，我们取得了持续的进步，特别是在长期结构连贯性和引文可靠性方面。

Title: Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Authors: Yifei Shen, Yilun Zhao, Justice Ou, Tinglin Huang, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09876
Pdf URL: https://arxiv.org/pdf/2601.09876
Copy Paste: [[2601.09876]] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL(https://arxiv.org/abs/2601.09876)
Keywords: gpt, long context, chain-of-thought
Abstract: Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.
摘要：现实世界的临床文本到 SQL 需要对异构 EHR 表、时间窗口和患者相似性队列进行推理，以生成可执行查询。我们引入了 CLINSQL，它是 MIMIC-IV v3.1 上 633 个专家注释任务的基准，需要多表连接、具有临床意义的过滤器和可执行 SQL。解决 CLINSQL 需要导航模式元数据和临床编码系统、处理长上下文以及编写超越传统文本到 SQL 的多步骤查询。我们在思想链自我完善下评估 22 个专有和开源模型，并使用基于规则的 SQL 分析和执行检查来优先考虑关键的临床需求。尽管最近取得了进展，但性能仍远未达到临床可靠性：在测试集上，GPT-5-mini 获得了 74.7% 的执行分数，DeepSeek-R1 以 69.2% 的成绩领先开源，而 Gemini-2.5-Pro 从 Easy 上的 85.5% 下降到 Hard 上的 67.2%。 CLINSQL 的进展标志着现实世界 EHR 分析中临床可靠的文本到 SQL 的切实进展。

Title: Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

Authors: Sathvik Nair, Byung-Doh Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09886
Pdf URL: https://arxiv.org/pdf/2601.09886
Copy Paste: [[2601.09886]] Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal(https://arxiv.org/abs/2601.09886)
Keywords: language model
Abstract: How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.
摘要：单词的可预测性可以通过两种方式量化：使用人类对完形填空任务的反应或使用语言模型 (LM) 的概率。当用作处理工作的预测因子时，LM 概率优于从完形填空数据导出的概率。然而，重要的是要确定 LM 概率这样做是出于正确的原因，因为不同的预测变量可能会导致关于预测在语言理解中的作用的不同科学结论。我们提供了有关 LM 概率优势的三个假设的证据：不受低分辨率的影响、区分语义相似的单词以及准确地将概率分配给低频单词。这些结果要求努力提高完形填空研究的分辨率，并结合实验来检验类人预测是否对 LM 概率所做的细粒度区分也同样敏感。

Title: Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Authors: Christabel Acquaye, Yi Ting Huang, Marine Carpuat, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09953
Pdf URL: https://arxiv.org/pdf/2601.09953
Copy Paste: [[2601.09953]] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations(https://arxiv.org/abs/2601.09953)
Keywords: language model, llm, prompt
Abstract: Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a "classroom" of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different "classroom sizes," showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.
摘要：标准化数学评估需要昂贵的人类试点研究来确定测试项目的难度。我们研究了开源大语言模型（LLM）的预测价值，以评估现实世界学生的多项选择数学问题的难度。我们表明，虽然法学硕士对问题难度的直接判断能力较差，但法学硕士基于模拟的方法在适当的条件下会产生有希望的结果。根据所提出的方法，我们通过提示法学硕士对不同熟练程度的学生进行角色扮演来模拟 4 年级、8 年级或 12 年级学生的“课堂”。我们使用这些模拟的结果来拟合项目反应理论（IRT）模型，将学习到的项目难度参数与现实世界的难度进行比较，这些参数是由国家教育进步评估（NAEP）提供的项目级统计数据确定的。我们观察到 4、8 和 12 年级的相关性分别高达 0.75、0.76 和 0.82。在我们的模拟中，我们尝试了不同的“教室规模”，显示了计算规模和准确性之间的权衡。我们发现，与命名的学生进行角色扮演可以改善预测（与学生 ID 相比），并且根据性别和种族对姓名进行分层可以进一步改善预测。我们的结果表明，数学能力相对较弱的法学硕士（Gemma）实际上比数学能力较强的模型（Llama 和 Qwen）能产生更好的现实世界难度预测，进一步强调了开源模型对该任务的适用性。

Title: Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

Authors: David Samuel Setiawan, Raphaël Merx, Jey Han Lau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09982
Pdf URL: https://arxiv.org/pdf/2601.09982
Copy Paste: [[2601.09982]] Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG(https://arxiv.org/abs/2601.09982)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
摘要：用于低资源语言的神经机器翻译 (NMT) 模型在域转移下会遭受显着的性能下降。我们使用 Dhao 来量化这一挑战，Dhao 是印度尼西亚东部的一种土著语言，除了新约 (NT) 之外没有任何数字足迹。当应用于看不见的旧约 (OT) 时，在 NT 上微调的标准 NMT 模型的域内分数从 36.17 chrF++ 下降到 27.11 chrF++。为了弥补这种损失，我们引入了一个混合框架，其中经过微调的 NMT 模型生成初始草稿，然后使用检索增强生成 (RAG) 通过大型语言模型 (LLM) 进行细化。最终系统达到 35.21 chrF++（+8.10 恢复），有效匹配原始域内质量。我们的分析表明，这种性能主要是由检索示例的数量驱动的，而不是检索算法的选择。定性分析证实法学硕士可以作为一个强大的“安全网”，修复零样本领域的严重故障。

Title: SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction

Authors: Sanghyeok Choi, Woosang Jeon, Kyuseok Yang, Taehyeong Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10003
Pdf URL: https://arxiv.org/pdf/2601.10003
Copy Paste: [[2601.10003]] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction(https://arxiv.org/abs/2601.10003)
Keywords: llm
Abstract: Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.
摘要：从非结构化文本构建知识图（KG）为知识表示和推理提供了一个结构化框架，但当前基于法学硕士的方法面临着一个基本的权衡：事实覆盖往往会导致关系碎片化，而过早整合会导致信息丢失。为了解决这个问题，我们提出了 SocraticKG，一种自动化知识图谱构建方法，它引入问答对作为结构化中间表示，以便在三元组提取之前系统地展开文档级语义。通过采用 5W1H 引导的 QA 扩展，SocraticKG 捕获通常在直接 KG 提取管道中丢失的上下文依赖关系和隐式关系链接，在源文档中提供显式基础，有助于减少隐式推理错误。对 MINE 基准的评估表明，我们的方法有效地解决了覆盖范围与连接性的权衡，即使提取的知识量大幅扩展，也能实现卓越的事实保留，同时保持高结构内聚力。这些结果凸显了 QA 介导的语义脚手架在 KG 提取之前构建语义方面发挥着关键作用，从而在后续阶段实现更加连贯和可靠的图形构建。

Title: EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records

Authors: Lingfei Qian, Mauro Giuffre, Yan Wang, Huan He, Qianqian Xie, Xuguang Ai, Xeuqing Peng, Fan Ma, Ruey-Ling Weng, Donald Wright, Adan Wang, Qingyu Chen, Vipina K. Keloth, Hua Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10020
Pdf URL: https://arxiv.org/pdf/2601.10020
Copy Paste: [[2601.10020]] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records(https://arxiv.org/abs/2601.10020)
Keywords: agent
Abstract: Clinical decision-making increasingly relies on timely and context-aware access to patient information within Electronic Health Records (EHRs), yet most existing natural language question-answering (QA) systems are evaluated solely on benchmark datasets, limiting their practical relevance. To overcome this limitation, we introduce EHRNavigator, a multi-agent framework that harnesses AI agents to perform patient-level question answering across heterogeneous and multimodal EHR data. We assessed its performance using both public benchmark and institutional datasets under realistic hospital conditions characterized by diverse schemas, temporal reasoning demands, and multimodal evidence integration. Through quantitative evaluation and clinician-validated chart review, EHRNavigator demonstrated strong generalization, achieving 86% accuracy on real-world cases while maintaining clinically acceptable response times. Overall, these findings confirm that EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real-world EHR question answering.
摘要：临床决策越来越依赖于及时、上下文感知地访问电子健康记录 (EHR) 中的患者信息，但大多数现有的自然语言问答 (QA) 系统仅根据基准数据集进行评估，限制了其实际相关性。为了克服这一限制，我们引入了 EHRNavigator，这是一个多代理框架，利用 AI 代理跨异构和多模式 EHR 数据执行患者级问答。我们使用公共基准和机构数据集在现实医院条件下评估其性能，这些条件以不同的模式、时间推理需求和多模式证据集成为特征。通过定量评估和临床医生验证的图表审查，EHRNavigator 展现了强大的泛化能力，在真实案例中实现了 86% 的准确率，同时保持了临床可接受的响应时间。总的来说，这些发现证实 EHRNavigator 有效地弥合了基准评估和临床部署之间的差距，为现实世界的 EHR 问答提供了强大、适应性和高效的解决方案。

Title: EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels

Authors: Wan Jou She, Lis Kanashiro Pereira, Fei Cheng, Sakiko Yahata, Panote Siriaraya, Eiji Aramaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10033
Pdf URL: https://arxiv.org/pdf/2601.10033
Copy Paste: [[2601.10033]] EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels(https://arxiv.org/abs/2601.10033)
Keywords: language model, llm
Abstract: This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4125 two-turn dialogues, collected through crowdsourcing and expert review. To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation--dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.
摘要：本文介绍了 EmplifAI，这是一个日本同理心对话数据集，旨在支持患者应对慢性疾病。他们经常会经历各种积极和消极的情绪（例如希望和绝望），这些情绪在疾病管理的不同阶段会发生变化。 EmplifAI 通过提供基于 28 种细粒度情感类别的基于情境的对话来解决这种复杂性，并根据 GoEmotions 分类法进行调整和验证。该数据集包括通过众包和专家评审收集的 280 个医学情境和 4125 个双向对话。为了评估移情对话中的情感一致性，我们使用多个大型语言模型 (LLM) 中的 BERTScore 评估了模型对情景对话对的预测，获得了 0.83 的 F1 分数。使用 EmplifAI 对日语 LLM 基线 (LLM-jp-3.1-13b-instruct4) 进行微调，在流畅性、一般同理心和特定情感同理心方面取得了显着改善。此外，我们还比较了法学硕士法官和人工评分员对多个法学硕士生成的对话所给出的分数，以验证我们的评估流程，并讨论从相关性分析中得出的见解和潜在风险。

Title: Long-Chain Reasoning Distillation via Adaptive Prefix Alignment

Authors: Zhenghao Liu, Zhuoyang Wu, Xinze Li, Yukun Yan, Shuo Wang, Zulong Chen, Yu Gu, Ge Yu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10064
Pdf URL: https://arxiv.org/pdf/2601.10064
Copy Paste: [[2601.10064]] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment(https://arxiv.org/abs/2601.10064)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at this https URL.
摘要：大型语言模型（LLM）已表现出卓越的推理能力，特别是在解决复杂的数学问题方面。最近的研究表明，提取长推理轨迹可以有效增强小规模学生模型的推理性能。然而，教师生成的推理轨迹通常过长且结构复杂，使得学生模型难以学习。这种不匹配导致提供的监督信号与学生模型的学习能力之间存在差距。为了应对这一挑战，我们提出了前缀对齐蒸馏（P-ALIGN），这是一个通过自适应前缀对齐充分利用教师 CoT 进行蒸馏的框架。具体来说，P-ALIGN 通过确定剩余后缀是否简洁且足以指导学生模型来自适应地截断教师生成的推理轨迹。然后，P-ALIGN 利用教师生成的前缀来监督学生模型，鼓励有效的前缀对齐。对多个数学推理基准的实验表明，P-ALIGN 的性能优于所有基准 3% 以上。进一步分析表明，P-ALIGN构建的前缀提供了更有效的监督信号，同时避免了冗余和不确定推理组件的负面影响。所有代码均可在此 https URL 中获取。

Title: Deriving Character Logic from Storyline as Codified Decision Trees

Authors: Letian Peng, Kun Zhou, Longfei Yun, Yupeng Hou, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10080
Pdf URL: https://arxiv.org/pdf/2601.10080
Copy Paste: [[2601.10080]] Deriving Character Logic from Storyline as Codified Decision Trees(https://arxiv.org/abs/2601.10080)
Keywords: agent
Abstract: Role-playing (RP) agents rely on behavioral profiles to act consistently across diverse narrative contexts, yet existing profiles are largely unstructured, non-executable, and weakly validated, leading to brittle agent behavior. We propose Codified Decision Trees (CDT), a data-driven framework that induces an executable and interpretable decision structure from large-scale narrative data. CDT represents behavioral profiles as a tree of conditional rules, where internal nodes correspond to validated scene conditions and leaves encode grounded behavioral statements, enabling deterministic retrieval of context-appropriate rules at execution time. The tree is learned by iteratively inducing candidate scene-action rules, validating them against data, and refining them through hierarchical specialization, yielding profiles that support transparent inspection and principled updates. Across multiple benchmarks, CDT substantially outperforms human-written profiles and prior profile induction methods on $85$ characters across $16$ artifacts, indicating that codified and validated behavioral representations lead to more reliable agent grounding.
摘要：角色扮演（RP）代理依靠行为配置文件在不同的叙述环境中保持一致的行为，但现有的配置文件很大程度上是非结构化的、不可执行的且验证性弱，导致代理行为脆弱。我们提出了编码决策树（CDT），这是一种数据驱动的框架，可以从大规模叙述数据中归纳出可执行且可解释的决策结构。 CDT 将行为配置文件表示为条件规则树，其中内部节点对应于经过验证的场景条件，并留下编码基础行为语句，从而能够在执行时确定性地检索适合上下文的规则。该树是通过迭代地引入候选场景动作规则、根据数据验证它们并通过分层专业化来细化它们来学习的，从而生成支持透明检查和原则性更新的配置文件。在多个基准测试中，CDT 在 16 美元工件上的 85 美元字符上显着优于人工编写的配置文件和先前的配置文件归纳方法，这表明经过编码和验证的行为表示可以带来更可靠的代理基础。

Title: CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

Authors: Viet Cuong Nguyen, Nhi Yen Nguyen, Kristin A. Candan, Mary Conlon, Vanessa Rumie, Kristen Risola, Srijan Kumar, Munmun De Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10085
Pdf URL: https://arxiv.org/pdf/2601.10085
Copy Paste: [[2601.10085]] CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking(https://arxiv.org/abs/2601.10085)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are increasingly used in mental health-related settings, yet they struggle to sustain realistic, goal-directed dialogue over extended interactions. While LLMs generate fluent responses, they optimize locally for the next turn rather than maintaining a coherent model of therapeutic progress, leading to brittleness and long-horizon drift. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing (MI) dialogues that explicitly models dual-actor conversational dynamics. CALM-IT represents therapist-client interaction as a bidirectional state-space process, in which both agents continuously update inferred alignment, mental states, and short-term goals to guide strategy selection and utterance generation. Across large-scale evaluations, CALM-IT consistently outperforms strong baselines in Effectiveness and Goal Alignment and remains substantially more stable as conversation length increases. Although CALM-IT initiates fewer therapist redirections, it achieves the highest client acceptance rate (64.3%), indicating more precise and therapeutically aligned intervention timing. Overall, CALM-IT provides evidence for modeling evolving conversational state being essential for generating high-quality long-form synthetic conversations.
摘要：大型语言模型 (LLM) 越来越多地用于心理健康相关环境，但它们很难在扩展的交互中维持现实的、目标导向的对话。虽然法学硕士能够产生流畅的反应，但它们会针对下一轮进行局部优化，而不是维持治疗进展的连贯模型，从而导致脆弱性和长期漂移。我们引入了 CALM-IT，这是一个用于生成和评估长篇动机访谈 (MI) 对话的框架，可明确模拟双参与者的对话动态。 CALM-IT 将治疗师与客户的互动表示为双向状态空间过程，其中两个代理不断更新推断的一致性、心理状态和短期目标，以指导策略选择和话语生成。在大规模评估中，CALM-IT 在有效性和目标一致性方面始终优于强大的基线，并且随着对话长度的增加而保持更加稳定。尽管 CALM-IT 发起的治疗师重定向较少，但它实现了最高的客户接受率 (64.3%)，这表明干预时间更精确、治疗更一致。总体而言，CALM-IT 为不断演变的对话状态建模提供了证据，这对于生成高质量的长篇综合对话至关重要。

Title: SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Authors: Yiming Ren, Junjie Wang, Yuxin Meng, Yihang Shi, Zhiqiang Lin, Ruihang Chu, Yiran Xu, Ziming Li, Yunfei Zhao, Zihan Wang, Yu Qiao, Ruiming Tang, Minghao Liu, Yujiu Yang
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2601.10108
Pdf URL: https://arxiv.org/pdf/2601.10108
Copy Paste: [[2601.10108]] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature(https://arxiv.org/abs/2601.10108)
Keywords: language model, gpt, llm
Abstract: Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.
摘要：评估多模态大型语言模型是否真正理解长篇科学论文仍然具有挑战性：仅答案指标和综合“大海捞针”测试通常会奖励答案匹配，而不需要在文档中进行因果的、与证据相关的推理跟踪。我们提出了“海洋中的鱼”（FITO）范式，它需要模型在本地科学文档中构建明确的跨模式证据链。为了实施 FITO，我们构建了 SIN-Data，这是一个科学交错语料库，保留了文本和图形的本机交错。在此基础上，我们构建了具有四个渐进任务的 SIN-Bench，涵盖证据发现 (SIN-Find)、假设验证 (SIN-Verify)、扎根 QA (SIN-QA) 和证据锚定合成 (SIN-Summary)。我们进一步引入“没有证据，就没有分数”，当基于可验证的锚点时对预测进行评分，并通过匹配、相关性和逻辑来诊断证据质量。对 8 个 MLLM 的实验表明，接地是主要瓶颈：Gemini-3-pro 实现了最佳平均总分（0.573），而 GPT-5 获得了最高的 SIN-QA 答案准确率（0.767），但在证据对齐的总分上表现不佳，暴露了正确性和可追踪支持之间的差距。

Title: Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends

Authors: Ye Wang, Jiaxing Chen, Hongjiang Xiao
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2601.10122
Pdf URL: https://arxiv.org/pdf/2601.10122
Copy Paste: [[2601.10122]] Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends(https://arxiv.org/abs/2601.10122)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.
摘要：近年来，随着大语言模型（LLM）的快速发展，角色扮演语言代理（RPLA）已成为自然语言处理（NLP）和人机交互交叉领域的一个突出研究热点。本文系统回顾了RPLA的发展现状和关键技术，描绘了从早期基于规则的模板范式，经过语言风格模仿阶段，到以人格建模和记忆机制为中心的认知模拟阶段的技术演变。总结了支持高质量角色扮演的关键技术路径，包括心理尺度驱动的角色建模、记忆增强的提示机制以及基于动机情境的行为决策控制。在数据层面，论文进一步分析了构建角色特定语料库的方法和挑战，重点关注数据来源、版权约束和结构化标注过程。在评估方面，整理了涵盖角色知识、人格真实度、价值取向、交互幻觉等多维度的评估框架和基准数据集，同时评论了人类评估、奖励模型、基于LLM的评分等方法的优缺点。最后，论文概述了角色扮演智能体未来的发展方向，包括人格进化建模、多智能体协作叙事、多模态沉浸式交互以及与认知神经科学的融合等，旨在为后续研究提供系统视角和方法论见解。

Title: ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Authors: Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10156
Pdf URL: https://arxiv.org/pdf/2601.10156
Copy Paste: [[2601.10156]] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback(https://arxiv.org/abs/2601.10156)
Keywords: llm, prompt, agent
Abstract: While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
摘要：虽然基于 LLM 的代理可以通过调用外部工具与环境交互，但其扩展的功能也放大了安全风险。实时监控步骤级工具调用行为并在不安全执行之前主动进行干预对于代理部署至关重要，但仍有待探索。在这项工作中，我们首先构建了 TS-Bench，这是一种用于 LLM 代理中步骤级工具调用安全检测的新颖基准。然后，我们使用多任务强化学习开发了一个护栏模型 TS-Guard。该模型通过推理交互历史记录，在执行前主动检测不安全的工具调用操作。它评估请求的危害性和行动与攻击的相关性，产生可解释和可概括的安全判断和反馈。此外，我们还引入了 TS-Flow，这是一种用于 LLM 代理的护栏反馈驱动的推理框架，它可以将 ReAct 式代理的有害工具调用平均减少 65%，并在即时注入攻击下将良性任务完成率提高约 10%。

Title: What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models

Authors: Guimin Hu, Meng Li, Qiwei Peng, Lijie Hu, Boyan Xu, Ruichu Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10159
Pdf URL: https://arxiv.org/pdf/2601.10159
Copy Paste: [[2601.10159]] What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models(https://arxiv.org/abs/2601.10159)
Keywords: language model, llm
Abstract: Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. Motivated by functional specialization in the human brain, we analyze expert activation by distinguishing domain and driver experts. In this work, we study expert activation in MoE models across three public domains and address two key questions: (1) which experts are activated, and whether certain expert types exhibit consistent activation patterns; and (2) how tokens are associated with and trigger the activation of specific experts. To answer these questions, we introduce entropy-based and causal-effect metrics to assess whether an expert is strongly favored for a particular domain, and how strongly expert activation contributes causally to the model's output, thus identify domain and driver experts, respectively. Furthermore, we explore how individual tokens are associated with the activation of specific experts. Our analysis reveals that (1) Among the activated experts, some show clear domain preferences, while others exert strong causal influence on model performance, underscoring their decisive roles. (2) tokens occurring earlier in a sentence are more likely to trigger the driver experts, and (3) adjusting the weights of domain and driver experts leads to significant performance gains across all three models and domains. These findings shed light on the internal mechanisms of MoE models and enhance their interpretability.
摘要：大多数可解释性工作都集中在 Transformers 中的层或神经元级机制上，而 MoE LLM 中的专家级行为尚未得到充分探索。受人脑功能专业化的启发，我们通过区分领域专家和驱动专家来分析专家激活。在这项工作中，我们研究了三个公共领域的 MoE 模型中的专家激活，并解决了两个关键问题：（1）哪些专家被激活，以及某些专家类型是否表现出一致的激活模式； (2) 代币如何与特定专家关联并触发特定专家的激活。为了回答这些问题，我们引入基于熵和因果效应指标来评估专家是否在特定领域受到强烈青睐，以及专家激活对模型输出的因果贡献有多大，从而分别识别领域专家和驱动专家。此外，我们还探讨了单个代币如何与特定专家的激活相关联。我们的分析表明：（1）在激活的专家中，一些专家表现出明显的领域偏好，而另一些专家则对模型性能产生强烈的因果影响，凸显了他们的决定性作用。 (2) 句子中较早出现的标记更有可能触发驱动程序专家，(3) 调整领域和驱动程序专家的权重可以在所有三个模型和领域中带来显着的性能提升。这些发现揭示了 MoE 模型的内部机制并增强了其可解释性。

Title: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Authors: Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.10160
Pdf URL: https://arxiv.org/pdf/2601.10160
Copy Paste: [[2601.10160]] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment(https://arxiv.org/abs/2601.10160)
Keywords: llm
Abstract: Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at this http URL
摘要：预训练语料库包含有关人工智能系统的广泛论述，但这种论述对下游对齐的因果影响仍然知之甚少。如果对人工智能行为的普遍描述主要是负面的，法学硕士可能会内化相应的行为先验，从而导致自我实现的偏差。本文通过使用不同数量的（错误）对齐话语对 6.9B 参数 LLM 进行预训练，提供了该假设的首次对照研究。我们发现对人工智能的讨论会导致偏差。对有关人工智能错位的合成训练文档进行上采样会导致错位行为显着增加。相反，对有关对齐行为的文档进行上采样可将错位分数从 45% 减少到 9%。我们认为这是自我实现一致性的证据。这些影响会减弱，但在训练后仍然持续存在。我们的研究结果建立了对预训练数据如何塑造对齐先验或对齐预训练的研究，作为训练后的补充。我们建议从业者对一致性和能力进行预培训。我们的模型和数据集可在此 http URL 中获取

Title: AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers

Authors: Prachuryya Kaushik, Ashish Anand
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2601.10161
Pdf URL: https://arxiv.org/pdf/2601.10161
Copy Paste: [[2601.10161]] AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers(https://arxiv.org/abs/2601.10161)
Keywords: language model, llm, agent
Abstract: We introduce AWED-FiNER, an open-source ecosystem designed to bridge the gap in Fine-grained Named Entity Recognition (FgNER) for 36 global languages spoken by more than 6.6 billion people. While Large Language Models (LLMs) dominate general Natural Language Processing (NLP) tasks, they often struggle with low-resource languages and fine-grained NLP tasks. AWED-FiNER provides a collection of agentic toolkits, web applications, and several state-of-the-art expert models that provides FgNER solutions across 36 languages. The agentic tools enable to route multilingual text to specialized expert models and fetch FgNER annotations within seconds. The web-based platforms provide ready-to-use FgNER annotation service for non-technical users. Moreover, the collection of language specific extremely small sized open-source state-of-the-art expert models facilitate offline deployment in resource contraint scenerios including edge devices. AWED-FiNER covers languages spoken by over 6.6 billion people, including a specific focus on vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (this https URL), Web Application (this https URL), and 49 Expert Detector Models (this https URL).
摘要：我们推出 AWED-FiNER，这是一个开源生态系统，旨在弥补全球超过 66 亿人使用的 36 种语言的细粒度命名实体识别 (FgNER) 方面的差距。虽然大型语言模型 (LLM) 在一般自然语言处理 (NLP) 任务中占据主导地位，但它们常常难以应对低资源语言和细粒度 NLP 任务。 AWED-FiNER 提供了一系列代理工具包、Web 应用程序和多个最先进的专家模型，可提供跨 36 种语言的 FgNER 解决方案。代理工具能够将多语言文本路由到专门的专家模型并在几秒钟内获取 FgNER 注释。基于网络的平台为非技术用户提供即用型 FgNER 注释服务。此外，特定语言的极小型开源最先进专家模型的集合有助于在包括边缘设备在内的资源受限场景中进行离线部署。 AWED-FiNER 涵盖超过 66 亿人使用的语言，其中特别关注博多语、曼尼普里语、Bishnupriya 和 Mizo 等弱势语言。可以在此处访问资源：Agentic Tool（此 https URL）、Web 应用程序（此 https URL）和 49 个专家检测器模型（此 https URL）。

Title: Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection

Authors: Nhung Nguyen Thi Hong, Cuong Nguyen Dang, Tri Le Ngoc
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10167
Pdf URL: https://arxiv.org/pdf/2601.10167
Copy Paste: [[2601.10167]] Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection(https://arxiv.org/abs/2601.10167)
Keywords: language model, gpt
Abstract: Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.
摘要：债务催收是银行、金融服务和保险 (BFSI) 行业的一项关键职能，严重依赖主要在越南联络中心进行的大规模人与人对话互动。这些对话涉及非正式的口语、情绪变化和复杂的特定领域推理，这对传统的自然语言处理系统提出了重大挑战。本文介绍了 Credit C-GPT，这是一种具有 70 亿个参数的领域专业大型语言模型，针对越南债务催收场景中的对话理解进行了微调。所提出的模型在一个基于推理的框架内集成了多个会话智能任务，包括对话理解、情感识别、意图检测、呼叫阶段分类和结构化槽值提取。我们描述了数据构建过程、注释策略和训练方法，并在专有的人工注释数据集上评估模型。实验结果表明，与传统的基于管道的方法相比，得到了一致的改进，这表明领域专用的会话语言模型为企业联络中心的实时帮助和呼叫后分析提供了可扩展且具有隐私意识的解决方案。

Title: HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

Authors: Ziang Cui, Mengran Yu, Tianjiao Li, Chenyu Shi, Yingxuan Shi, Lusheng Zhang, Hongwei Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10187
Pdf URL: https://arxiv.org/pdf/2601.10187
Copy Paste: [[2601.10187]] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning(https://arxiv.org/abs/2601.10187)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively "tames" the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.
摘要：大型语言模型（LLM）在多语言翻译方面取得了显着的进步，但受到系统性跨语言冗长偏见的阻碍，使它们不适合严格的时间限制任务，如字幕和配音。当前的即时工程方法很难解决语义保真度和严格的时间可行性之间的冲突。为了弥补这一差距，我们首先引入 Sand-Glass，这是一个专门设计用于在音节级持续时间限制下评估翻译的基准。此外，我们提出了 HOMURA，一种强化学习框架，它明确优化了语义保留和时间合规性之间的权衡。通过采用具有新颖的动态音节比奖励的 KL 正则化目标，HOMURA 有效地“驯服”了输出长度。实验结果表明，我们的方法显着优于强大的法学硕士基线，实现了尊重语言密度层次结构的精确长度控制，而不影响语义充分性。

Title: HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns

Authors: Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang, Jun Gao, Shuai Huang, Yueping Kang, Liyuan Gou, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10198
Pdf URL: https://arxiv.org/pdf/2601.10198
Copy Paste: [[2601.10198]] HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns(https://arxiv.org/abs/2601.10198)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HUMANLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling--simulating not just what humans do, but the psychological processes generating those behaviors.
摘要：大型语言模型 (LLM) 在推理和生成方面表现出了卓越的能力，是高级角色模拟和角色扮演语言代理 (RPLA) 的基础。然而，实现与人类认知和行为模式的真正一致性仍然是这些智能体面临的严峻挑战。我们提出了 HUMANLLM，一个将心理模式视为相互作用的因果力量的框架。我们从约 12,000 篇学术论文中构建了 244 种模式，并综合了 11,359 个场景，其中 2-5 种模式相互加强、冲突或调节，通过多轮对话表达内心的想法、行动和对话。我们的双层检查表评估个体模式保真度和新兴的多模式动态，实现强大的人类一致性（r = 0.91），同时揭示整体指标将模拟准确性与社会期望相结合。尽管参数减少了 4 倍，HUMANLLM-8B 在多模式动力学方面优于 Qwen3-32B，这表明真正的拟人化需要认知建模——不仅模拟人类的行为，还模拟产生这些行为的心理过程。

Title: GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients

Authors: Kentaro Kazama, Daiki Shirafuji, Tatsuhiko Saito
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10229
Pdf URL: https://arxiv.org/pdf/2601.10229
Copy Paste: [[2601.10229]] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients(https://arxiv.org/abs/2601.10229)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advances in Large Language Models (LLMs) have improved multi-step reasoning. Most approaches rely on Chain-of-Thought (CoT) rationales. Previous studies have shown that LLMs often generate logically inconsistent reasoning steps even when their final answers are correct. These inconsistencies reduce the reliability of step-level reasoning. We propose GeoSteer, a manifold-based framework that improves the quality of intermediate reasoning. The method consists of: (1) constructing a CoT dataset with segment-level scores, (2) training a Variational Autoencoder (VAE) model and a quality estimation model to learn a low-dimensional manifold of high-quality CoT trajectories, and (3) steering hidden states of target LLMs toward higher-quality regions in the latent space. This update in a latent space behaves like a natural-gradient adjustment in the original hidden-state space. It ensures geometrically coherent steering. We evaluate GeoSteer on the GSM8k dataset using the Qwen3 series. We measure via answer accuracy and overall reasoning performance. GeoSteer improved the exact match accuracy by up to 2.6 points. It also enhanced the pairwise win rate by 5.3 points. These results indicate that GeoSteer provides an effective and controllable mechanism for improving the quality of intermediate reasoning in LLMs.
摘要：大型语言模型 (LLM) 的最新进展改进了多步推理。大多数方法都依赖于思想链 (CoT) 原理。先前的研究表明，法学硕士即使最终答案是正确的，也经常会产生逻辑上不一致的推理步骤。这些不一致降低了步骤级推理的可靠性。我们提出了 GeoSteer，一个基于流形的框架，可以提高中间推理的质量。该方法包括：（1）构建具有分段级分数的 CoT 数据集，（2）训练变分自动编码器（VAE）模型和质量估计模型以学习高质量 CoT 轨迹的低维流形，以及（3）将目标 LLM 的隐藏状态转向潜在空间中的更高质量区域。潜在空间中的更新行为类似于原始隐藏状态空间中的自然梯度调整。它确保了几何连贯的转向。我们使用 Qwen3 系列在 GSM8k 数据集上评估 GeoSteer。我们通过答案准确性和整体推理表现来衡量。 GeoSteer 将精确匹配精度提高了 2.6 个百分点。它还将配对胜率提高了 5.3 点。这些结果表明，GeoSteer为提高法学硕士中间推理的质量提供了有效且可控的机制。

Title: Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?

Authors: Guanxu Chen, Dongrui Liu, Jing Shao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10242
Pdf URL: https://arxiv.org/pdf/2601.10242
Copy Paste: [[2601.10242]] Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?(https://arxiv.org/abs/2601.10242)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)--architectures that increase computational depth by iterating shared layers--can bridge this gap by utilizing their iterative nature as a form of introspection. Our experiments reveal that while increasing loop iterations narrows the gap, it is partly driven by a degradation of their internal knowledge carried by representations. Moreover, another empirical analysis suggests that current LTs' ability to perceive representations does not improve across loops; it is only present in the final loop. These results suggest that while LTs offer a promising direction for scaling computational depth, they have yet to achieve the introspection required to truly link representation space and natural language.
摘要：大型语言模型 (LLM) 的内部知识和显式语言输出之间经常存在差距。在本报告中，我们实证研究了循环变压器（LT）——通过迭代共享层来增加计算深度的架构——是否可以利用其迭代性质作为一种内省形式来弥补这一差距。我们的实验表明，虽然增加循环迭代可以缩小差距，但部分原因是表示所携带的内部知识的退化。此外，另一项实证分析表明，当前 LT 感知表征的能力并没有跨循环提高；它只出现在最后的循环中。这些结果表明，虽然 LT 为扩展计算深度提供了一个有希望的方向，但它们尚未实现真正链接表示空间和自然语言所需的内省。

Title: coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts

Authors: Prottay Kumar Adhikary, Reena Rawat, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10246
Pdf URL: https://arxiv.org/pdf/2601.10246
Copy Paste: [[2601.10246]] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts(https://arxiv.org/abs/2601.10246)
Keywords: language model, agent
Abstract: Access to mental healthcare is increasingly strained by workforce shortages and rising demand, motivating the development of intelligent systems that can support mental healthcare experts. We introduce coTherapist, a unified framework utilizing a small language model to emulate core therapeutic competencies through domain-specific fine-tuning, retrieval augmentation, and agentic reasoning. Evaluation on clinical queries demonstrates that coTherapist generates more relevant and clinically grounded responses than contemporary baselines. Using our novel T-BARS rubric and psychometric profiling, we confirm coTherapist exhibits high empathy and therapist-consistent personality traits. Furthermore, human evaluation by domain experts validates that coTherapist delivers accurate, trustworthy, and safe responses. coTherapist was deployed and tested by clinical experts. Collectively, these findings demonstrate that small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.
摘要：由于劳动力短缺和需求不断增长，获得精神保健的机会日益紧张，这推动了可以支持精神保健专家的智能系统的开发。我们引入了 coTherapist，这是一个统一的框架，利用小型语言模型通过特定领域的微调、检索增强和代理推理来模拟核心治疗能力。对临床查询的评估表明，与当代基线相比，coTherapist 生成了更相关且更有临床依据的响应。使用我们新颖的 T-BARS 评分标准和心理测量分析，我们确认共同治疗师表现出高度的同理心和与治疗师一致的人格特质。此外，领域专家的人工评估验证了 coTherapist 能够提供准确、值得信赖且安全的响应。 coTherapist 由临床专家部署和测试。总的来说，这些发现表明，可以设计小型模型来表现出专家般的行为，为数字心理健康工具提供可扩展的途径。

Title: Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs

Authors: Nan Li, Bo Kang, Tijl De Bie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10257
Pdf URL: https://arxiv.org/pdf/2601.10257
Copy Paste: [[2601.10257]] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs(https://arxiv.org/abs/2601.10257)
Keywords: llm
Abstract: When LLMs judge moral dilemmas, do they reach different conclusions in different languages, and if so, why? Two factors could drive such differences: the language of the dilemma itself, or the language in which the model reasons. Standard evaluation conflates these by testing only matched conditions (e.g., English dilemma with English reasoning). We introduce a methodology that separately manipulates each factor, covering also mismatched conditions (e.g., English dilemma with Chinese reasoning), enabling decomposition of their contributions. To study \emph{what} changes, we propose an approach to interpret the moral judgments in terms of Moral Foundations Theory. As a side result, we identify evidence for splitting the Authority dimension into a family-related and an institutional dimension. Applying this methodology to English-Chinese moral judgment with 13 LLMs, we demonstrate its diagnostic power: (1) the framework isolates reasoning-language effects as contributing twice the variance of input-language effects; (2) it detects context-dependency in nearly half of models that standard evaluation misses; and (3) a diagnostic taxonomy translates these patterns into deployment guidance. We release our code and datasets at this https URL.
摘要：当法学硕士判断道德困境时，他们是否会用不同的语言得出不同的结论？如果是，为什么？有两个因素可能导致这种差异：困境本身的语言，或者模型推理的语言。标准评估通过仅测试匹配的条件（例如，英语困境与英语推理）将这些混为一谈。我们引入了一种单独操纵每个因素的方法，还涵盖了不匹配的条件（例如，英语困境与中国推理），从而能够分解它们的贡献。为了研究\emph{什么}变化，我们提出了一种根据道德基础理论解释道德判断的方法。作为一个附带结果，我们找到了将权威维度分为家庭相关维度和制度维度的证据。将这种方法应用于 13 名法学硕士的英汉道德判断，我们展示了其诊断能力：（1）该框架将推理语言效应隔离为输入语言效应方差的两倍； (2) 它检测到标准评估漏掉的近一半模型中的上下文依赖性； (3) 诊断分类法将这些模式转化为部署指南。我们在此 https URL 发布我们的代码和数据集。

Title: Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel

Authors: Hiroaki Yamagiwa, Yusuke Takase, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10266
Pdf URL: https://arxiv.org/pdf/2601.10266
Copy Paste: [[2601.10266]] Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel(https://arxiv.org/abs/2601.10266)
Keywords: gpt
Abstract: Understanding relationships between attention heads is essential for interpreting the internal structure of Transformers, yet existing metrics do not capture this structure well. We focus on the subspaces spanned by attention-head weight matrices and quantify head-to-head relationships using the Projection Kernel (PK), a principal-angle-based measure of subspace similarity. Experiments show that PK reproduces known head-to-head interactions on the IOI task more clearly than prior metrics such as the Composition Score. We further introduce a framework to quantify the informativeness of PK distributions by comparing them with a reference distribution derived from random orthogonal subspaces. As an application, we analyze a directed graph constructed from PK and show that, in GPT2-small, L4H7 acts as a hub by functioning as an identity head.
摘要：理解注意力头之间的关系对于解释 Transformers 的内部结构至关重要，但现有的指标并不能很好地捕捉这种结构。我们关注由注意力头权重矩阵跨越的子空间，并使用投影核（PK）（一种基于主角的子空间相似性度量）来量化头对头关系。实验表明，PK 比先前的指标（例如作文分数）更清晰地再现了 IOI 任务中已知的头对头交互。我们进一步引入了一个框架，通过将 PK 分布与从随机正交子空间导出的参考分布进行比较来量化 PK 分布的信息量。作为一个应用程序，我们分析了由 PK 构建的有向图，并表明，在 GPT2-small 中，L4H7 通过充当身份头来充当集线器。

Title: MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

Authors: Yuxuan Lou, Kai Yang, Yang You
Subjects: cs.CL, cs.AI, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2601.10272
Pdf URL: https://arxiv.org/pdf/2601.10272
Copy Paste: [[2601.10272]] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts(https://arxiv.org/abs/2601.10272)
Keywords: language model, llm
Abstract: We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at this https URL
摘要：我们提出了 MoST（语音和文本混合），这是一种新颖的多模态大语言模型，它通过我们提出的模态感知专家混合 (MAMoE) 架构无缝集成语音和文本处理。虽然当前的多模态模型通常使用相同的参数处理不同的模态表示，而不考虑它们固有的表示差异，但我们引入了专门的路由路径，根据输入类型将令牌引导到适合模态的专家。 MAMoE 通过两个互补的组件同时增强特定模态学习和跨模态理解：捕获特定领域模式的特定模态专家组和促进模态之间信息传输的共享专家。在此架构的基础上，我们开发了一个高效的转换管道，通过对 ASR 和 TTS 数据集进行战略性后训练来适应预训练的 MoE 语言模型，然后使用精心策划的语音文本指令数据集进行微调。该管道的一个关键特征是，它完全依赖于完全可访问的开源数据集来实现强大的性能和数据效率。对 ASR、TTS、音频语言建模和口语问答基准的综合评估表明，MoST 始终优于参数数量相当的现有模型。我们的消融研究证实，特定于模态的路由机制和共享专家设计对所有测试领域的性能提升都有显着贡献。据我们所知，MoST 代表了第一个基于混合专家架构构建的完全开源语音文本法学硕士。 \footnote{我们在此 https URL 发布 MoST 模型、训练代码、推理代码和训练数据

Title: The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Authors: Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10307
Pdf URL: https://arxiv.org/pdf/2601.10307
Copy Paste: [[2601.10307]] The Straight and Narrow: Do LLMs Possess an Internal Moral Path?(https://arxiv.org/abs/2601.10307)
Keywords: language model, llm
Abstract: Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.
摘要：增强大型语言模型 (LLM) 的道德一致性是人工智能安全方面的一项关键挑战。当前的对齐技术常常充当表面的护栏，而法学硕士的内在道德表征基本上没有受到影响。在本文中，我们利用道德基础理论（MFT）来绘制和操纵法学硕士的细粒度道德景观，从而弥补了这一差距。通过跨语言线性探索，我们验证了中间层道德表征的共同性质，并揭示了英语和汉语之间共同但不同的道德子空间。在此基础上，我们提取了可引导的道德向量，并成功验证了它们在内部和行为层面的功效。利用道德的高度普遍性，我们提出了自适应道德融合（AMF），这是一种动态推理时间干预，可以将探针检测与向量注入相结合，以解决安全性与有用性之间的权衡问题。实证结果证实，我们的方法可以作为一种有针对性的内在防御，有效减少对良性查询的错误拒绝，同时与标准基线相比最小化越狱成功率。

Title: Multilinguality as Sense Adaptation

Authors: Jan Christian Blaise Cruz, David Ifeoluwa Adelani, Alham Fikri Aji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10310
Pdf URL: https://arxiv.org/pdf/2601.10310
Copy Paste: [[2601.10310]] Multilinguality as Sense Adaptation(https://arxiv.org/abs/2601.10310)
Keywords: language model
Abstract: We approach multilinguality as sense adaptation: aligning latent meaning representations across languages rather than relying solely on shared parameters and scale. In this paper, we introduce SENse-based Symmetric Interlingual Alignment (SENSIA), which adapts a Backpack language model from one language to another by explicitly aligning sense-level mixtures and contextual representations on parallel data, while jointly training a target-language language modeling loss to preserve fluency. Across benchmarks on four typologically diverse languages, SENSIA generally outperforms comparable multilingual alignment methods and achieves competitive accuracy against monolingual from-scratch baselines while using 2-4x less target-language data. Analyses of learned sense geometry indicate that local sense topology and global structure relative to English are largely preserved, and ablations show that the method is robust in terms of design and scale.
摘要：我们将多语言视为意义适应：跨语言调整潜在意义表示，而不是仅仅依赖共享参数和规模。在本文中，我们介绍了基于 SENse 的对称语际对齐（SENSIA），它通过在并行数据上显式对齐语义级别混合和上下文表示来将 Backpack 语言模型从一种语言适应到另一种语言，同时联合训练目标语言语言建模损失以保持流畅性。在四种类型不同的语言的基准测试中，SENSIA 的性能通常优于可比较的多语言对齐方法，并且在使用的目标语言数据减少 2-4 倍的情况下，与单语言从头开始的基线相比，实现了具有竞争力的准确性。对学习到的意义几何的分析表明，相对于英语的局部意义拓扑和全局结构在很大程度上得到了保留，并且消融表明该方法在设计和规模方面是稳健的。

Title: Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis

Authors: Songsong Tian, Kongsheng Zhuo, Zhendong Wang, Rong Shen, Shengtao Zhang, Yong Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10318
Pdf URL: https://arxiv.org/pdf/2601.10318
Copy Paste: [[2601.10318]] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis(https://arxiv.org/abs/2601.10318)
Keywords: gpt, chain-of-thought
Abstract: In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: this https URL.
摘要：在本文中，我们提出了 BAR-SQL（边界感知可靠 NL2SQL），这是一个统一的训练框架，可将可靠性和边界感知直接嵌入到生成过程中。我们引入了种子突变数据合成范例，该范例构建了代表性企业语料库，明确包含多步骤分析查询以及边界情况，包括歧义和模式限制。为了确保可解释性，我们采用基于知识的推理综合，它产生显式锚定在模式元数据和业务规则中的思想链跟踪。该模型通过两个阶段的过程进行训练：监督微调（SFT），然后通过组相对策略优化进行强化学习。我们设计了一种任务条件混合奖励机制，利用抽象语法树分析和密集结果匹配，同时优化 SQL 执行准确性和弃权响应中的语义精度。为了评估可靠性和生成准确性，我们构建并发布了 Ent-SQL-Bench，它联合评估模糊和无法回答的查询中的 SQL 精度和边界感知弃权。该基准测试的实验结果表明，BAR-SQL 的平均准确率达到 91.48%，在 SQL 生成质量和边界感知弃权能力方面均优于领先的专有模型，包括 Claude 4.5 Sonnet 和 GPT-5。源代码和基准测试可通过以下网址匿名获取：此 https URL。

Title: An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit

Authors: Warren Jouanneau, Emma Jouffroy, Marc Palyart
Subjects: cs.CL, cs.IR, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2601.10321
Pdf URL: https://arxiv.org/pdf/2601.10321
Copy Paste: [[2601.10321]] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit(https://arxiv.org/abs/2601.10321)
Keywords: language model, llm
Abstract: Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.
摘要：实时找到与工作提案最相关的人是一项挑战，尤其是当简历很长、结构严密且多语言时。在本文中，我们提出了一种基于新一代后期交叉注意架构的重新排序模型，该模型分解简历和项目简介，以最小的计算开销有效地处理长上下文输入。为了减轻历史数据偏差，我们使用生成式大语言模型（LLM）作为教师，生成细粒度的、基于语义的监督。该信号通过丰富的蒸馏损失函数被蒸馏到我们的学生模型中。由此产生的模型会产生技能适合度分数，从而实现一致且可解释的人员与工作匹配。关于相关性、排名和校准指标的实验表明，我们的方法优于最先进的基线。

Title: OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Authors: Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, Qunhong Zeng, Qi Zhang, Xuanjing Huang, Qidi Xu, Tao Gui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10343
Pdf URL: https://arxiv.org/pdf/2601.10343
Copy Paste: [[2601.10343]] OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding(https://arxiv.org/abs/2601.10343)
Keywords: llm, agent
Abstract: Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.
摘要：现代编码支架将法学硕士转变为有能力的软件代理，但它们遵循支架指定指令的能力仍然没有得到充分检验，特别是当约束是异构的并且在交互中持续存在时。为了填补这一空白，我们引入了 OctoBench，它对基于存储库的代理编码中遵循的脚手架感知指令进行了基准测试。 OctoBench 包括在三种脚手架类型下实例化的 34 个环境和 217 个任务，并与 7,098 个客观检查表项目配对。为了将解决任务与遵循规则分开，我们提供了一个自动观察和评分工具包，可以捕获完整的轨迹并执行细粒度的检查。对八个代表性模型的实验揭示了任务解决和支架感知合规性之间的系统差距，强调了明确针对异构指令遵循的培训和评估的需要。我们发布基准测试是为了支持可重复的基准测试，并加速开发更多具有支架意识的编码代理。

Title: Training-Trajectory-Aware Token Selection

Authors: Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.10348
Pdf URL: https://arxiv.org/pdf/2601.10348
Copy Paste: [[2601.10348]] Training-Trajectory-Aware Token Selection(https://arxiv.org/abs/2601.10348)
Keywords: llm
Abstract: Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.
摘要：高效蒸馏是将昂贵的推理能力转化为可部署效率的关键途径，但在学生已经具有较强推理能力的前沿领域，天真的持续蒸馏往往会产生有限的收益甚至退化。我们观察到一个特征训练现象：即使损失单调减少，所有性能指标也会在几乎相同的瓶颈处急剧下降，然后逐渐恢复。我们进一步揭示了一个令牌级别的机制：置信度分为稳定增加的模仿锚令牌，这些令牌可以快速锚定优化，而其他尚未学习的令牌的置信度会受到抑制，直到遇到瓶颈。而这两类代币不能共存的特性是持续蒸馏失败的根本原因。为此，我们提出训练轨迹感知令牌选择（T3S）来重建令牌级别的训练目标，为尚未学习的令牌扫清优化路径。 T3 在 AR 和 dLLM 设置中都取得了一致的收益：仅通过数百个示例，Qwen3-8B 在竞争推理基准上超越 DeepSeek-R1，Qwen3-32B 接近 Qwen3-235B，T3 训练的 LLaDA-2.0-Mini 超过其 AR 基线，在所有 16B 规模的 no-think 模型中实现了最先进的性能。

Title: Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

Authors: Zhihao Xu, Rumei Li, Jiahuan Li, Rongxiang Weng, Jingang Wang, Xunliang Cai, Xiting Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10355
Pdf URL: https://arxiv.org/pdf/2601.10355
Copy Paste: [[2601.10355]] Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text(https://arxiv.org/abs/2601.10355)
Keywords: language model, llm, agent
Abstract: Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on {\tau} - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.
摘要：使大型语言模型 (LLM) 能够有效地利用多轮交互中的工具对于构建有能力的自主代理至关重要。然而，获取多样化且真实的多回转刀具使用数据仍然是一个重大挑战。在这项工作中，我们提出了一种新颖的基于文本的范例。我们观察到，文本语料库自然包含丰富的、多步骤的问题解决经验，可以作为多轮工具使用任务的未开发的、可扩展的、真实的数据源。基于这一见解，我们引入了 GEM，这是一种数据合成管道，可以通过四个阶段的过程从文本语料库中生成和提取多轮工具使用轨迹：相关性过滤、工作流和工具提取、轨迹基础和复杂性细化。为了降低计算成本，我们通过监督微调进一步训练专门的轨迹合成器。该模型将复杂的生成管道提炼为高效的端到端轨迹生成器。实验表明，我们的 GEM-32B 比 BFCL V3 多圈基准提高了 16.5%。我们的模型部分超过了在 {\tau} - 基准（航空公司和零售）域内数据上训练的模型的性能，突出了我们基于文本的合成范式所产生的卓越泛化能力。值得注意的是，我们的轨迹合成器与整个管道的质量相匹配，同时显着降低了推理延迟和成本。

Title: The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Authors: Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10387
Pdf URL: https://arxiv.org/pdf/2601.10387
Copy Paste: [[2601.10387]] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models(https://arxiv.org/abs/2601.10387)
Keywords: language model
Abstract: Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts "persona drift," a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
摘要：大型语言模型可以代表各种角色，但通常默认为在培训后培养的有用的助理身份。我们通过提取与不同角色原型相对应的激活方向来研究模型角色的空间结构。在几个不同的模型中，我们发现这个角色空间的主要组成部分是“助手轴”，它捕获模型在其默认助手模式下运行的程度。转向助理方向会强化有益和无害的行为；避开会增加模型识别为其他实体的倾向。此外，远离更极端的价值观往往会导致一种神秘的、戏剧性的演讲风格。我们发现这个轴也存在于预先训练的模型中，它主要促进有益的人类原型，如顾问和教练，并抑制精神原型。测量沿着辅助轴的偏差可以预测“角色漂移”，这种现象是模型会表现出与其典型角色不同的有害或奇怪的行为。我们发现角色漂移通常是由需要对模型流程进行元反思或以情感脆弱的用户为特色的对话所驱动的。我们证明，将激活限制在辅助轴上的固定区域可以稳定这些场景中的模型行为 - 并且在面对基于对抗性角色的越狱时也是如此。我们的结果表明，训练后将模型引导到角色空间的特定区域，但只是松散地将它们束缚在该区域，从而激发了训练和引导策略的工作，将模型更深入地锚定到连贯的角色。

Title: INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

Authors: Tarun Sharma, Manikandan Ravikiran, Sourava Kumar Behera, Pramit Bhattacharya, Arnab Bhattacharya, Rohit Saluja
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10388
Pdf URL: https://arxiv.org/pdf/2601.10388
Copy Paste: [[2601.10388]] INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects(https://arxiv.org/abs/2601.10388)
Keywords: gpt, llm
Abstract: Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.
摘要：最近的 NLP 进展主要集中在标准化语言上，导致大多数资源匮乏的方言得不到充分服务，尤其是在印度场景中。在印度，这个问题尤为重要：尽管印地语是全球第三大语言（使用者超过 6 亿），但其众多方言的代表性仍然不足。奥迪亚的情况与此类似，该国约有 4500 万使用者。虽然存在一些包含标准印地语和奥迪亚语言的数据集，但它们的地区方言几乎没有在网络上出现。我们推出了 INDIC-DIALECT，这是一个人工策划的平行语料库，包含 13k 个句子对，涵盖 11 种方言和 2 种语言：印地语和奥迪亚语。使用这个语料库，我们构建了一个包含三个任务的多任务基准：方言分类、多项选择题（MCQ）回答和机器翻译（MT）。我们的实验表明，像 GPT-4o 和 Gemini 2.5 这样的 LLM 在分类任务上表现不佳。虽然在印度语言上预训练的基于 Transformer 的微调模型大大提高了性能，例如，在方言分类上将 F1 从 19.6% 提高到 89.8%。对于方言到语言的翻译，我们发现混合 AI 模型的 BLEU 得分最高为 61.32，而基线得分为 23.36。有趣的是，由于生成方言句子的复杂性，我们观察到，对于语言到方言的翻译，“基于规则，然后是人工智能”的方法获得了 48.44 的最佳 BLEU 分数，而基线分数为 27.59。因此，INDIC-DIALECT 是方言感知印度语 NLP 的新基准，我们计划将其作为开源发布，以支持资源匮乏的印度方言的进一步工作。

Title: TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction

Authors: Mihai Dan Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10410
Pdf URL: https://arxiv.org/pdf/2601.10410
Copy Paste: [[2601.10410]] TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction(https://arxiv.org/abs/2601.10410)
Keywords: language model, llm, prompt
Abstract: Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and large-scale synthetic data generation in a reproducible framework. Building on TF1, a three-million-story English fable dataset, and TF2, which extends TF1 through high-quality Romanian translations, we introduce TF3-RO, a Romanian-centric language modeling pipeline spanning tokenizer training, from-scratch model development, and Romanian-native dataset generation. TF3-RO constructs Romanian-specific BPE and Unigram tokenizers from a linguistically informed corpus to mitigate token inflation induced by Romanian morphology. Using long-sequence packed training, we pretrain a 51.65M-parameter LLaMA-style Transformer entirely from scratch. The model is subsequently optimized through quantization, structured pruning, and logit-based knowledge distillation, yielding a compact 26.45M-parameter student model with tied embeddings and strong deployment characteristics. Using this distilled model, TF3-RO generates three million Romanian-native synthetic fables via a controlled combinatorial prompting framework. Across all stages, the pipeline integrates a comprehensive evaluation suite combining intrinsic metrics, Romanian agreement probes, entity coherence, rule-based grammar checking, and LLM-based assessment. TF3-RO provides a reproducible and linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora.
摘要：合成数据生成的最新进展表明，当底层语料库在结构上受到控制且语言上一致时，可以有效地训练紧凑的语言模型。然而，对于形态丰富且计算资源不足的语言（例如罗马尼亚语），仍然没有公开记录的端到端管道将分词器设计、预处理、预训练、压缩、评估和大规模合成数据生成统一在可重现的框架中。在 TF1（包含 300 万个故事的英语寓言数据集）和 TF2（通过高质量罗马尼亚语翻译扩展 TF1）的基础上，我们引入了 TF3-RO，这是一个以罗马尼亚语为中心的语言建模管道，涵盖分词器训练、从头开始模型开发和罗马尼亚语本地数据集生成。 TF3-RO 从语言学语料库构建罗马尼亚语特定的 BPE 和 Unigram 标记器，以减轻罗马尼亚语形态引起的标记膨胀。使用长序列打包训练，我们完全从头开始预训练一个 51.65M 参数的 LLaMA 式 Transformer。随后通过量化、结构化剪枝和基于 Logit 的知识蒸馏对该模型进行优化，产生具有绑定嵌入和强大部署特性的紧凑的 26.45M 参数学生模型。使用这种提炼模型，TF3-RO 通过受控组合提示框架生成 300 万个罗马尼亚本土合成寓言。在所有阶段，该管道集成了一个综合评估套件，结合了内在指标、罗马尼亚协议探测、实体一致性、基于规则的语法检查和基于法学硕士的评估。 TF3-RO 提供了一个可重复且以语言为基础的框架，用于训练紧凑的罗马尼亚语语言模型和生成大规模合成叙事语料库。

Title: Are Language Models Models?

Authors: Philip Resnik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10421
Pdf URL: https://arxiv.org/pdf/2601.10421
Copy Paste: [[2601.10421]] Are Language Models Models?(https://arxiv.org/abs/2601.10421)
Keywords: language model, llm
Abstract: Futrell and Mahowald claim LMs "serve as model systems", but an assessment at each of Marr's three levels suggests the claim is clearly not true at the implementation level, poorly motivated at the algorithmic-representational level, and problematic at the computational theory level. LMs are good candidates as tools; calling them cognitive models overstates the case and unnecessarily feeds LLM hype.
摘要：Futrell 和 Mahowald 声称 LM“充当模型系统”，但对 Marr 三个级别中每个级别的评估表明，该声明在实现级别显然不正确，在算法表示级别缺乏动机，并且在计算理论级别存在问题。 LM 是很好的工具候选者；将它们称为认知模型夸大了事实，并且不必要地助长了法学硕士的炒作。

Title: SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability

Authors: Ruochen Li, Kun Yuan, Yufei Xia, Yue Zhou, Qingyu Lu, Weihang Li, Youxiang Zhu, Nassir Navab
Subjects: cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2601.10455
Pdf URL: https://arxiv.org/pdf/2601.10455
Copy Paste: [[2601.10455]] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability(https://arxiv.org/abs/2601.10455)
Keywords: language model, llm
Abstract: Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motivated by a goal-oriented view of surgical planning, we define planning correctness via phase-goal satisfiability, where plan validity is determined by expert-defined surgical rules. Based on this definition, we introduce a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Using this benchmark, we show that sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. We therefore adopt a rule-based goal-satisfiability metric as a high-precision meta-evaluation reference to assess Video-LLMs under progressively constrained settings, revealing failures due to perception errors and under-constrained reasoning. Structural knowledge consistently improves performance, whereas semantic guidance alone is unreliable and benefits larger models only when combined with structural constraints.
摘要：手术规划整合了视觉感知、长视野推理和程序知识，但目前尚不清楚当前的评估协议是否能够可靠地评估安全关键环境中的视觉语言模型（VLM）。在手术计划的目标导向观点的推动下，我们通过阶段目标可满足性来定义计划的正确性，其中计划的有效性由专家定义的手术规则决定。基于这个定义，我们引入了一个多中心元评估基准，其中包含有效的程序变化和包含顺序和内容错误的无效计划。使用这个基准，我们表明序列相似性度量系统地误判了计划质量，惩罚了有效的计划，同时未能识别无效的计划。因此，我们采用基于规则的目标可满足性指标作为高精度元评估参考，以在渐进约束的设置下评估视频法学硕士，揭示由于感知错误和约束推理不足而导致的失败。结构知识不断提高性能，而单独的语义指导是不可靠的，只有与结构约束相结合才能使更大的模型受益。

Title: Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models

Authors: Abhinaba Basu, Pavan Chakraborty
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2601.10460
Pdf URL: https://arxiv.org/pdf/2601.10460
Copy Paste: [[2601.10460]] Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models(https://arxiv.org/abs/2601.10460)
Keywords: language model, prompt
Abstract: A model that avoids stereotypes in a lab benchmark may not avoid them in deployment. We show that measured bias shifts dramatically when prompts mention different places, times, or audiences -- no adversarial prompting required. We introduce Contextual StereoSet, a benchmark that holds stereotype content fixed while systematically varying contextual framing. Testing 13 models across two protocols, we find striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested on this contrast (p<0.05); gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts it by up to 13 percentage points. These effects replicate in hiring, lending, and help-seeking vignettes. We propose Context Sensitivity Fingerprints (CSF): a compact profile of per-dimension dispersion and paired contrasts with bootstrap CIs and FDR correction. Two evaluation tracks support different use cases -- a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening. The implication is methodological: bias scores from fixed-condition tests may not this http URL is not a claim about ground-truth bias rates; it is a stress test of evaluation robustness. CSF forces evaluators to ask, "Under what conditions does bias appear?" rather than "Is this model biased?" We release our benchmark, code, and results.
摘要：在实验室基准测试中避免刻板印象的模型可能无法在部署中避免它们。我们表明，当提示提到不同的地点、时间或受众时，测量到的偏见会发生巨大变化——不需要对抗性提示。我们引入了 Contextual StereoSet，这是一个基准，可以固定刻板印象内容，同时系统地改变上下文框架。通过两种协议测试 13 个模型，我们发现了惊人的模式：锚定到 1990 年（相对于 2030 年）提高了在此对比下测试的所有模型中的刻板印象选择（p<0.05）；八卦框架在 6 个全网格模型中的 5 个中提高了它；外群体观察者框架最多可改变 13 个百分点。这些影响会在招聘、贷款和寻求帮助的小插曲中得到体现。我们提出了上下文敏感指纹（CSF）：每维色散的紧凑轮廓以及与 bootstrap CI 和 FDR 校正的配对对比。两个评估轨道支持不同的用例——用于深度分析的 360 度上下文诊断网格和涵盖 4,229 个项目的用于生产筛选的预算协议。其含义是方法论上的：来自固定条件测试的偏差分数可能不是这个 http URL 不是关于真实偏差率的声明；它是评估稳健性的压力测试。 CSF 迫使评估者问：“在什么条件下会出现偏差？”而不是“这个模型有偏见吗？”我们发布了基准、代码和结果。

Title: DR-Arena: an Automated Evaluation Framework for Deep Research Agents

Authors: Yiwen Gao, Ruochen Zhao, Yang Deng, Wenxuan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10504
Pdf URL: https://arxiv.org/pdf/2601.10504
Copy Paste: [[2601.10504]] DR-Arena: an Automated Evaluation Framework for Deep Research Agents(https://arxiv.org/abs/2601.10504)
Keywords: language model, llm, agent
Abstract: As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.
摘要：随着大型语言模型 (LLM) 越来越多地作为能够自主调查和信息合成的深度研究 (DR) 代理运行，对其任务性能的可靠评估已成为关键瓶颈。当前的基准主要依赖于静态数据集，但其存在一些局限性：有限的任务通用性、时间错位和数据污染。为了解决这些问题，我们引入了 DR-Arena，这是一个完全自动化的评估框架，可通过动态调查将 DR 代理推向其能力极限。 DR-Arena 根据新的网络趋势构建实时信息树，以确保评估标准与实时世界状态同步，并采用自动检查器生成结构化任务来测试两种正交功能：深度推理和广泛覆盖。 DR-Arena 进一步采用自适应进化循环（Adaptive Evolvement Loop），这是一种状态机控制器，可根据实时性能动态提升任务复杂性，要求更深入的推导或更广泛的聚合，直到出现决定性的能力边界。使用 6 个高级 DR 代理进行的实验表明，DR-Arena 与 LMSYS Search Arena 排行榜的 Spearman 相关性达到 0.94。这代表了最先进的与人类偏好的一致，无需任何手动操作，验证了 DR-Arena 作为昂贵的人类裁决的可靠替代方案。

Title: PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models

Authors: Chengbing Wang, Wuqiang Zheng, Yang Zhang, Fengbin Zhu, Junyi Cheng, Yi Xie, Wenjie Wang, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10532
Pdf URL: https://arxiv.org/pdf/2601.10532
Copy Paste: [[2601.10532]] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models(https://arxiv.org/abs/2601.10532)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed in human-centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology-grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state-of-the-art baselines by over 10\%. Furthermore, a blinded user study reveals a 70\% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at this https URL.
摘要：大型语言模型（LLM）越来越多地部署在以人为中心的应用程序中，但它们往往无法提供实质性的情感支持。虽然强化学习（RL）已被用来增强法学硕士的同理心，但现有的奖励模型通常从单一角度评估同理心，忽视了同理心循环理论所定义的支持者和寻求者之间同理心本质上的双向互动本质。为了解决这个限制，我们提出了基于心理学的移情奖励模型（PERM）。 PERM通过双向分解来操作同理心评估：1）支持者视角，评估内部共鸣和沟通表达； 2）寻求者视角，评估情感接受。此外，它还结合了旁观者的视角来监控整体交互质量。对广泛使用的情商基准和工业日常对话数据集进行的大量实验表明，PERM 的性能比最先进的基准高出 10% 以上。此外，一项盲法用户研究显示，70% 的用户偏好我们的方法，强调了它在产生更多同理心反应方面的功效。我们的代码、数据集和模型可从此 https URL 获取。

Title: Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

Authors: Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, Nazia Tasnim, Farig Sadeque
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.10566
Pdf URL: https://arxiv.org/pdf/2601.10566
Copy Paste: [[2601.10566]] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure(https://arxiv.org/abs/2601.10566)
Keywords: llm
Abstract: Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
摘要：法学硕士的选择性知识删除对于 GDPR 合规性和模型安全至关重要，但当前的遗忘方法将行为抑制与真正的知识删除混为一谈，从而使潜在能力在表面拒绝之下持续存在。在这项工作中，我们通过引入知识免疫框架（KIF）来应对这一挑战，这是一种表示感知架构，通过针对内部激活签名而不是表面输出来区分真正的擦除和混淆。我们的方法将特定于主题的表示的动态抑制与参数高效的适应相结合，从而无需进行完整的模型再训练即可实现持久的忘却。 KIF 实现了接近预言机的擦除（FQ 约为 0.99 与 1.00），同时保留了预言机级别的效用（MU = 0.62），有效地打破了限制所有先前工作的稳定性与擦除权衡。我们评估了 3B 到 14B 参数的标准基础模型（Llama 和 Mistral）和推理先验模型（Qwen 和 DeepSeek）。我们的观察表明，标准模型表现出与尺度无关的真实擦除（<3% 效用漂移），而推理先验模型则揭示了基本的架构分歧。我们全面的双度量评估协议将表面级泄漏与潜在痕迹持久性相结合，实现了混淆-擦除区别，并首次系统地诊断了跨模型家族和规模的机制级遗忘行为。

Title: Form and Meaning in Intrinsic Multilingual Evaluations

Authors: Wessel Poelman, Miryam de Lhoneux
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10580
Pdf URL: https://arxiv.org/pdf/2601.10580
Copy Paste: [[2601.10580]] Form and Meaning in Intrinsic Multilingual Evaluations(https://arxiv.org/abs/2601.10580)
Keywords: language model
Abstract: Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.
摘要：条件语言模型的内在评估指标（例如困惑度或每个字符的位数）广泛应用于单语言和多语言环境中。这些指标在单语言设置中使用和比较相当简单，但在多语言设置中依赖于许多假设。其中一个假设是，比较 CLM 在平行句子上的困惑度可以表明它们的质量，因为信息内容（此处理解为语义）是相同的。然而，这些度量本质上是在信息论意义上测量信息内容。我们明确这一假设和其他此类假设并讨论它们的含义。我们使用单语言和多语言模型在两个多并行语料库上使用六个指标进行实验。最终，我们发现当前的指标并不具有普遍可比性。我们通过研究形式与意义的争论来对此提供一些解释。

Title: Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs

Authors: Yuxi Xia, Loris Schoenegger, Benjamin Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10645
Pdf URL: https://arxiv.org/pdf/2601.10645
Copy Paste: [[2601.10645]] Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs(https://arxiv.org/abs/2601.10645)
Keywords: language model, llm
Abstract: Large language models (LLMs) can increase users' perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.
摘要：大型语言模型 (LLM) 可以通过表达对其输出的信心来增加用户的感知信任。然而，之前的研究表明，法学硕士往往过于自信，导致他们声称的信心不可靠，因为它与事实的准确性不一致。为了更好地理解这种语言化置信度的来源，我们引入了 TracVC（\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence），这是一种建立在信息检索和影响估计基础上的方法，可将生成的置信度表达式追溯到训练数据。我们在问答环境中评估 OLMo 和 Llama 模型上的 TracVC，提出了一个新的指标：内容基础性，它衡量法学硕士对内容相关训练示例（与问题和答案相关）的信心与信心语言表达的一般示例的信心程度。我们的分析表明，OLMo2-13B 经常受到与查询词汇无关的置信度相关数据的影响，这表明它可能模仿确定性的表面语言表达，而不是依赖于真实的内容基础。这些发现指出了当前培训制度的一个根本局限性：法学硕士可能会学习如何表现出自信，而无需学习何时自信是合理的。我们的分析为提高法学硕士表达更可靠信心的可信度奠定了基础。

Title: Detecting Winning Arguments with Large Language Models and Persuasion Strategies

Authors: Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, Giovanni Da San Martino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.10660
Pdf URL: https://arxiv.org/pdf/2601.10660
Copy Paste: [[2601.10660]] Detecting Winning Arguments with Large Language Models and Persuasion Strategies(https://arxiv.org/abs/2601.10660)
Keywords: language model, llm, prompt
Abstract: Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.
摘要：检测论证文本中的说服力是一项具有挑战性的任务，对于理解人类交流具有重要意义。这项工作研究了说服策略（例如攻击声誉、分散注意力和操纵性措辞）在确定文本说服力方面的作用。我们对三个带注释的论证数据集进行了实验：Winning Arguments（从 Reddit 的 Change My View subreddit 构建）、Anthropic/Persuasion 和 Persuasion for Good。我们的方法利用大型语言模型 (LLM) 和多策略说服评分方法，指导六种说服策略的推理。结果表明，策略引导推理提高了说服力的预测。为了更好地理解内容的影响，我们将获胜论点数据集组织成广泛的讨论主题，并分析它们的表现。我们公开发布该数据集的主题注释版本，以方便未来的研究。总体而言，我们的方法论证明了结构化、策略意识提示对于增强论证质量评估的可解释性和稳健性的价值。

Title: LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Authors: Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10700
Pdf URL: https://arxiv.org/pdf/2601.10700
Copy Paste: [[2601.10700]] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals(https://arxiv.org/abs/2601.10700)
Keywords: llm
Abstract: Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
摘要：基于概念的解释量化了高级概念（例如性别或经验）如何影响模型行为，这对于高风险领域的决策者至关重要。最近的工作通过将这些解释与根据反事实估计的参考因果效应进行比较来评估这些解释的可信度。在实践中，现有的基准依赖于昂贵的人工编写的反事实作为不完美的代理。为了解决这个问题，我们引入了一个用于构建包含结构反事实对的数据集的框架：LIBERTy（基于 LLM 的可解释性参考目标的干预基准）。 LIBERTy 植根于文本生成的明确定义的结构化因果模型 (SCM)，对概念的干预通过 SCM 传播，直到法学硕士生成反事实。我们引入了三个数据集（疾病检测、简历筛查和工作场所暴力预测）以及新的评估指标——订单忠实度。使用它们，我们评估了五个模型中的各种方法，并确定了改进基于概念的解释的巨大空间。 LIBERTy 还可以对模型对干预措施的敏感性进行系统分析：我们发现专有法学硕士对人口统计概念的敏感性显着降低，这可能是由于培训后的缓解措施所致。总体而言，LIBERTy 为开发忠实的可解释性方法提供了急需的基准。

Title: Grounding Agent Memory in Contextual Intent

Authors: Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2601.10702
Pdf URL: https://arxiv.org/pdf/2601.10702
Copy Paste: [[2601.10702]] Grounding Agent Memory in Contextual Intent(https://arxiv.org/abs/2601.10702)
Keywords: language model, agent
Abstract: Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
摘要：在长期、面向目标的交互中部署大型语言模型仍然具有挑战性，因为相似的实体和事实在不同的潜在目标和约束下重复出现，导致记忆系统检索上下文不匹配的证据。我们提出了 STITCH（上下文历史中的结构化意图跟踪），这是一种代理记忆系统，它使用结构化检索线索、上下文意图来索引每个轨迹步骤，并通过匹配当前步骤的意图来检索历史。上下文意图提供了紧凑的信号，可以消除重复提及的歧义并减少干扰：(1) 定义主题片段的当前潜在目标，(2) 操作类型，以及 (3) 锚定重要属性的显着实体类型。在推理过程中，STITCH 通过意图兼容性过滤内存片段并确定其优先级，从而抑制语义相似但上下文不兼容的历史记录。为了进行评估，我们引入了 CAME-Bench，这是一个在现实、动态、面向目标的轨迹中进行上下文感知检索的基准。在 CAME-Bench 和 LongMemEval 中，STITCH 实现了最先进的性能，比最强基线高出 35.6%，随着轨迹长度的增加，增益最大。我们的分析表明，意图索引大大减少了检索噪声，支持意图感知内存以实现稳健的长视野推理。

Title: MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Authors: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10712
Pdf URL: https://arxiv.org/pdf/2601.10712
Copy Paste: [[2601.10712]] MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching(https://arxiv.org/abs/2601.10712)
Keywords: language model, llm
Abstract: Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at this https URL.
摘要：工具集成推理 (TIR) 通过将推理步骤与外部工具交互交错，使大型语言模型 (LLM) 能够处理复杂的任务。然而，现有的强化学习方法通常依赖于结果或轨迹级别的奖励，为轨迹内的所有步骤分配统一的优势。这种粗粒度的信用分配无法区分有效的工具调用和冗余或错误的工具调用，特别是在长范围多轮场景中。为了解决这个问题，我们提出了 MatchTIR，这是一个通过基于二分匹配的轮次奖励分配和双级优势估计引入细粒度监督的框架。具体来说，我们将信用分配制定为预测轨迹和真实轨迹之间的二分匹配问题，利用两种分配策略来获得密集的回合级奖励。此外，为了平衡局部步骤精度和全局任务成功，我们引入了一种双级优势估计方案，该方案集成了回合级和轨迹级信号，为各个交互回合分配不同的优势值。对三个基准的大量实验证明了 MatchTIR 的优越性。值得注意的是，我们的 4B 模型超越了大多数 8B 竞争对手，特别是在长视野和多回合任务方面。我们的代码可通过此 https URL 获取。