2026-01-15

Title: DeliberationBench: When Do More Voices Hurt? A Controlled Study of Multi-LLM Deliberation Protocols

Authors: Vaarunay Kaushal, Taranveer Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08835
Pdf URL: https://arxiv.org/pdf/2601.08835
Copy Paste: [[2601.08835]] DeliberationBench: When Do More Voices Hurt? A Controlled Study of Multi-LLM Deliberation Protocols(https://arxiv.org/abs/2601.08835)
Keywords: language model, llm, agent
Abstract: Multi-agent systems where Large Language Models (LLMs) deliberate to form consensus have gained significant attention, yet their practical value over simpler methods remains under-scrutinized. We introduce DELIBERATIONBENCH, a controlled benchmark evaluating three deliberation protocols against a strong baseline of selecting the best response from a pool of model outputs. Across 270 questions and three independent seeds (810 total evaluations), we find a striking negative result: the best-single baseline achieves an 82.5% +- 3.3% win rate, dramatically outperforming the best deliberation protocol(13.8% +- 2.6%). This 6.0x performance gap is statistically significant (p < 0.01) and comes at 1.5-2.5x higher computational cost. Our findings challenge assumptions that complexity enhances quality in multi-LLM systems.
摘要：大型语言模型（LLM）精心形成共识的多智能体系统已经引起了广泛关注，但它们相对于更简单方法的实际价值仍然没有得到充分审查。我们引入了 DELIBERATIONBENCH，这是一个受控基准，根据从模型输出池中选择最佳响应的强大基线来评估三种审议协议。在 270 个问题和三个独立种子（总共 810 个评估）中，我们发现了一个惊人的负面结果：最佳单一基线实现了 82.5% ± 3.3% 的获胜率，显着优于最佳审议协议（13.8% ± 2.6%）。这种 6.0 倍的性能差距具有统计显着性 (p < 0.01)，并且计算成本高出 1.5-2.5 倍。我们的研究结果挑战了复杂性可以提高多法学硕士系统质量的假设。

Title: From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Authors: Piercosma Bisconti, Marcello Galisai, Matteo Prandi, Federico Pierucci, Olga Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Marcantonio Brancale, Daniele Nardi
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08837
Pdf URL: https://arxiv.org/pdf/2601.08837
Copy Paste: [[2601.08837]] From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda(https://arxiv.org/abs/2601.08837)
Keywords: llm, prompt
Abstract: Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.
摘要：法学硕士的安全机制仍然容易受到通过文化编码结构重新构建有害请求的攻击。我们介绍了 Adversarial Tales，这是一种越狱技术，可将有害内容嵌入到赛博朋克叙事中，并提示模型执行受弗拉基米尔·普罗普民间故事形态启发的功能分析。通过将任务视为结构分解，攻击诱导模型将有害程序重建为合法的叙事解释。在来自 9 个提供商的 26 个前沿模型中，我们观察到平均攻击成功率为 71.3%，没有一个模型系列被证明是可靠稳健的。结合我们之前关于对抗性诗歌的工作，这些发现表明，结构性越狱构成了广泛的漏洞类别，而不是孤立的技术。可以调解有害意图的文化编码框架的空间是巨大的，仅靠模式匹配防御可能是取之不尽用之不竭的。因此，理解这些攻击成功的原因至关重要：我们概述了一个机械可解释性研究议程，以研究叙事线索如何重塑模型表征，以及模型是否能够学会独立于表面形式识别有害意图。

Title: Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL

Authors: Jiahui Chen, Lei Fu, Jian Cui, Yu Lei, Zhenning Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08838
Pdf URL: https://arxiv.org/pdf/2601.08838
Copy Paste: [[2601.08838]] Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL(https://arxiv.org/abs/2601.08838)
Keywords: agent
Abstract: Large-scale Text-to-SQL benchmarks such as BIRD typically assume complete and accurate database annotations as well as readily available external knowledge, which fails to reflect common industrial settings where annotations are missing, incomplete, or erroneous. This mismatch substantially limits the real-world applicability of state-of-the-art (SOTA) Text-to-SQL systems. To bridge this gap, we explore a database-centric approach that leverages intrinsic, fine-grained information residing in relational databases to construct missing evidence and improve Text-to-SQL accuracy under annotation-scarce conditions. Our key hypothesis is that when a query requires multi-step reasoning over extensive table information, existing methods often struggle to reliably identify and utilize the truly relevant knowledge. We therefore propose to "cache" query-relevant knowledge on the database side in advance, so that it can be selectively activated at inference time. Based on this idea, we introduce Companion Agents (CA), a new Text-to-SQL paradigm that incorporates a group of agents accompanying database schemas to proactively mine and consolidate hidden inter-table relations, value-domain distributions, statistical regularities, and latent semantic cues before query generation. Experiments on BIRD under the fully missing evidence setting show that CA recovers +4.49 / +4.37 / +14.13 execution accuracy points on RSL-SQL / CHESS / DAIL-SQL, respectively, with larger gains on the Challenging subset +9.65 / +7.58 / +16.71. These improvements stem from CA's automatic database-side mining and evidence construction, suggesting a practical path toward industrial-grade Text-to-SQL deployment without reliance on human-curated evidence.
摘要：BIRD 等大规模文本到 SQL 基准测试通常假设完整且准确的数据库注释以及易于获得的外部知识，这无法反映注释缺失、不完整或错误的常见工业环境。这种不匹配极大地限制了最先进 (SOTA) 文本到 SQL 系统的实际适用性。为了弥补这一差距，我们探索了一种以数据库为中心的方法，该方法利用关系数据库中固有的细粒度信息来构建缺失的证据，并在注释稀缺的情况下提高文本到 SQL 的准确性。我们的关键假设是，当查询需要对大量表信息进行多步骤推理时，现有方法通常难以可靠地识别和利用真正相关的知识。因此，我们建议提前在数据库端“缓存”查询相关知识，以便在推理时选择性激活。基于这个想法，我们引入了伴侣代理（CA），这是一种新的文本到 SQL 范式，它结合了一组伴随数据库模式的代理，在查询生成之前主动挖掘和巩固隐藏的表间关系、值域分布、统计规律和潜在语义线索。在完全缺失证据设置下的 BIRD 上进行的实验表明，CA 在 RSL-SQL / CHESS / DAIL-SQL 上分别恢复了 +4.49 / +4.37 / +14.13 点执行准确度，在 Challenging 子集上有更大的增益 +9.65 / +7.58 / +16.71。这些改进源于 CA 的自动数据库端挖掘和证据构建，提出了一条无需依赖人工证据即可实现工业级文本到 SQL 部署的实用路径。

Title: Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework

Authors: Toshiyuki Shigemura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08839
Pdf URL: https://arxiv.org/pdf/2601.08839
Copy Paste: [[2601.08839]] Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework(https://arxiv.org/abs/2601.08839)
Keywords: llm, agent
Abstract: This paper presents a tri-agent cross-validation framework for analyzing stability and explainability in multi-model large language systems. The architecture integrates three heterogeneous LLMs-used for semantic generation, analytical consistency checking, and transparency auditing-into a recursive interaction cycle. This design induces Recursive Knowledge Synthesis (RKS), where intermediate representations are continuously refined through mutually constraining transformations irreducible to single-model behavior. Across 47 controlled trials using public-access LLM deployments (October 2025), we evaluated system stability via four metrics: Reflex Reliability Score (RRS), Transparency Score (TS), Deviation Detection Rate (DDR), and Correction Success Rate (CSR). The system achieved mean RRS = 0.78+-0.06 and maintained TS >= 0.8 in about 68% of trials. Approximately 89% of trials converged, supporting the theoretical prediction that transparency auditing acts as a contraction operator within the composite validation mapping. The contributions are threefold: (1) a structured tri-agent framework for coordinated reasoning across heterogeneous LLMs, (2) a formal RKS model grounded in fixed-point theory, and (3) empirical evaluation of inter-model stability under realistic, non-API public-access conditions. These results provide initial empirical evidence that a safety-preserving, humansupervised multi-LLM architecture can achieve stable recursive knowledge synthesis in realistic, publicly deployed environments.
摘要：本文提出了一个三代理交叉验证框架，用于分析多模型大型语言系统的稳定性和可解释性。该架构将三个异构 LLM（用于语义生成、分析一致性检查和透明度审核）集成到递归交互周期中。这种设计引入了递归知识合成（RKS），其中中间表示通过不可简化为单模型行为的相互约束的转换不断细化。在使用公共访问 LLM 部署的 47 项对照试验中（2025 年 10 月），我们通过四个指标评估了系统稳定性：反射可靠性评分 (RRS)、透明度评分 (TS)、偏差检测率 (DDR) 和纠正成功率 (CSR)。该系统在大约 68% 的试验中实现了平均 RRS = 0.78+-0.06 并保持 TS >= 0.8。大约 89% 的试验收敛，支持透明度审计在复合验证映射中充当收缩算子的理论预测。这些贡献有三个方面：（1）用于跨异构法学硕士协调推理的结构化三代理框架，（2）基于定点理论的正式 RKS 模型，以及（3）在现实的非 API 公共访问条件下对模型间稳定性的实证评估。这些结果提供了初步的经验证据，证明安全的、人类监督的多法学硕士架构可以在现实的、公开部署的环境中实现稳定的递归知识合成。

Title: Consistency-Aware Editing for Entity-level Unlearning in Language Models

Authors: Xiaoqi Han, Víctor Gutiérrez-Basulto, Ru Li, Xiaoli Li, Jiye Liang, Jeff Z. Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08840
Pdf URL: https://arxiv.org/pdf/2601.08840
Copy Paste: [[2601.08840]] Consistency-Aware Editing for Entity-level Unlearning in Language Models(https://arxiv.org/abs/2601.08840)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) risk retaining sensitive, copyrighted, or harmful information from their training data. Entity-level unlearning addresses this issue by removing all knowledge of a specific entity while preserving the model's overall capabilities. Existing approaches typically rely on full-model fine-tuning or prompt-based interventions, which can be computationally expensive or brittle when handling paraphrased queries. Recently, model editing has emerged as an efficient alternative for updating knowledge in LLMs, offering a promising direction for unlearning. However, existing editing techniques are typically designed for instance-level updates, modifying responses to specific attributes of an entity rather than eliminating all knowledge associated with the entity. In this paper, we investigate how editing techniques can be adapted for effective and efficient entity-level unlearning. To this end, we introduce a novel consistency-aware editing (CAE) framework. CAE aggregates a diverse set of prompts related to a target entity, including its attributes, relations, and adversarial paraphrases. It then jointly learns a low-rank update guided by a consistency regularizer that aligns the editing directions across prompts. This promotes robust and comprehensive forgetting while minimizing interference with unrelated knowledge. We further examine where different entities are stored within the model and how many diverse prompts are needed for successful unlearning. We evaluate CAE on two challenging benchmarks, RWKU and ToFU, and demonstrate that it (i) provides insights into how entity-level knowledge is internally represented and deleted in LLMs, (ii) significantly improves forgetting accuracy and robustness over traditional unlearning and editing baselines, and (iii) enables scalable entity removal using only tens of carefully selected prompts.
摘要：大型语言模型 (LLM) 存在保留训练数据中敏感、受版权保护或有害信息的风险。实体级遗忘通过删除特定实体的所有知识同时保留模型的整体功能来解决此问题。现有的方法通常依赖于全模型微调或基于提示的干预，在处理释义查询时，这可能会导致计算成本高昂或脆弱。最近，模型编辑已成为法学硕士知识更新的有效替代方案，为忘却学习提供了一个有希望的方向。然而，现有的编辑技术通常是为实例级更新而设计的，修改对实体的特定属性的响应，而不是消除与实体相关的所有知识。在本文中，我们研究了如何调整编辑技术以实现有效且高效的实体级遗忘。为此，我们引入了一种新颖的一致性感知编辑（CAE）框架。 CAE 聚合了与目标实体相关的一组不同的提示，包括其属性、关系和对抗性释义。然后，它共同学习由一致性正则化器引导的低秩更新，该正则化器可以跨提示对齐编辑方向。这促进了强大而全面的遗忘，同时最大限度地减少了对不相关知识的干扰。我们进一步检查不同实体在模型中的存储位置，以及成功忘记学习需要多少不同的提示。我们在 RWKU 和 ToFU 这两个具有挑战性的基准上评估 CAE，并证明它 (i) 提供了关于实体级知识如何在 LLM 中内部表示和删除的见解，(ii) 与传统的遗忘和编辑基线相比，显着提高了遗忘的准确性和鲁棒性，(iii) 仅使用数十个精心选择的提示即可实现可扩展的实体删除。

Title: Resisting Correction: How RLHF Makes Language Models Ignore External Safety Signals in Natural Conversation

Authors: Felipe Biava Cataneo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08842
Pdf URL: https://arxiv.org/pdf/2601.08842
Copy Paste: [[2601.08842]] Resisting Correction: How RLHF Makes Language Models Ignore External Safety Signals in Natural Conversation(https://arxiv.org/abs/2601.08842)
Keywords: language model, prompt
Abstract: Safety architectures for language models increasingly rely on external monitors to detect errors and inject corrective signals at inference time. For such systems to function in interactive settings, models must be able to incorporate externally provided confidence information into their verbal responses. In this work, we test whether instruction-tuned language models preserve this controllability across different interaction modes. Using Llama-3.2-3B on GSM8K, we perform a causal intervention study in which explicit external confidence signals are injected and model compliance is measured under multiple prompt strategies. We find that base models exhibit near-perfect controllability (Spearman rho close to 1.0), while instruction-tuned models display a striking context dependence: they fully comply with external corrections under explicit command prompts (bias approximately 0 percent, rho = 0.93), yet systematically ignore the same signals in natural conversational queries (bias plus 40 percent, rho = 0.04). This behavior is not a capability failure; the model can process the signal, but an emergent property of RLHF optimization that prioritizes conversational fluency over external calibration cues in natural dialogue. We further show that internal token-level confidence in small models is uninformative (r = 0.035), underscoring the necessity of external supervision. Our findings highlight a deployment-critical failure mode: the interaction style users expect is precisely where safety corrections are least effective.
摘要：语言模型的安全架构越来越依赖外部监视器来检测错误并在推理时注入纠正信号。为了使此类系统在交互环境中发挥作用，模型必须能够将外部提供的置信信息纳入其口头响应中。在这项工作中，我们测试指令调整的语言模型是否在不同的交互模式下保持这种可控性。我们在 GSM8K 上使用 Llama-3.2-3B 进行因果干预研究，其中注入明确的外部置信信号，并在多种提示策略下测量模型合规性。我们发现基础模型表现出近乎完美的可控性（Spearman rho 接近 1.0），而指令调整模型则显示出惊人的上下文依赖性：它们完全符合显式命令提示下的外部校正（偏差约为 0%，rho = 0.93），但系统地忽略自然会话查询中的相同信号（偏差加 40%，rho = 0.04）。这种行为不是能力失败；而是能力失败。该模型可以处理信号，但 RLHF 优化的一个新兴特性是在自然对话中优先考虑对话流畅性而不是外部校准线索。我们进一步表明，小模型中的内部代币级别信心是无信息的（r = 0.035），强调了外部监督的必要性。我们的研究结果强调了部署关键型故障模式：用户期望的交互方式恰恰是安全纠正最无效的地方。

Title: Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

Authors: Haotian Deng, Chris Farber, Jiyoon Lee, David Tang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08843
Pdf URL: https://arxiv.org/pdf/2601.08843
Copy Paste: [[2601.08843]] Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness(https://arxiv.org/abs/2601.08843)
Keywords: language model, llm, prompt
Abstract: Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model's robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity. Our "Trust Curve" analysis demonstrates a clear trade-off where filtering low-confidence predictions improves accuracy on the remaining subset. Additionally, robustness experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions. Our work provides critical insights into the capabilities and limitations of rubric-conditioned LLM judges, highlighting the importance of uncertainty estimation and robustness testing for reliable deployment.
摘要：由于学生回答的语言差异以及对细致入微、符合标准的部分学分的需要，自动简答评分 (ASAG) 仍然是一项具有挑战性的任务。虽然大型语言模型 (LLM) 提供了一种很有前景的解决方案，但它们在基于评分标准的环境中作为自动法官的可靠性需要严格的评估。在本文中，我们系统地评估了法学硕士评委在基于标题的简答评分中的表现。我们研究了三个关键方面：法学硕士评分与不同复杂程度的专家判断的一致性、基于共识的延迟机制促进的不确定性和准确性之间的权衡，以及模型在随机输入扰动和对抗性攻击下的稳健性。使用 SciEntsBank 基准和 Qwen 2.5-72B，我们发现二进制任务的对齐能力很强，但随着标题粒度的增加而降低。我们的“信任曲线”分析表明了一个明显的权衡，过滤低置信度预测可以提高剩余子集的准确性。此外，稳健性实验表明，虽然该模型对提示注入具有弹性，但它对同义词替换很敏感。我们的工作为以标题为条件的法学硕士法官的能力和局限性提供了重要的见解，强调了不确定性估计和稳健性测试对于可靠部署的重要性。

Title: Emissions and Performance Trade-off Between Small and Large Language Models

Authors: Anandita Garg, Uma Gaba, Deepan Muthirayan, Anish Roy Chowdhury
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08844
Pdf URL: https://arxiv.org/pdf/2601.08844
Copy Paste: [[2601.08844]] Emissions and Performance Trade-off Between Small and Large Language Models(https://arxiv.org/abs/2601.08844)
Keywords: language model, llm
Abstract: The advent of Large Language Models (LLMs) has raised concerns about their enormous carbon footprint, starting with energy-intensive training and continuing through repeated inference. This study investigates the potential of using fine-tuned Small Language Models (SLMs) as a sustainable alternative for predefined tasks. Here, we present a comparative analysis of the performance-emissions trade-off between LLMs and fine-tuned SLMs across selected tasks under Natural Language Processing, Reasoning and Programming. Our results show that in four out of the six selected tasks, SLMs maintained comparable performances for a significant reduction in carbon emissions during inference. Our findings demonstrate the viability of smaller models in mitigating the environmental impact of resource-heavy LLMs, thus advancing towards sustainable, green AI.
摘要：大型语言模型（LLM）的出现引起了人们对其巨大碳足迹的担忧，从能源密集型训练开始，一直到重复推理。本研究探讨了使用微调小语言模型 (SLM) 作为预定义任务的可持续替代方案的潜力。在这里，我们对自然语言处理、推理和编程下选定任务中法学硕士和微调的 SLM 之间的性能与排放权衡进行了比较分析。我们的结果表明，在六项选定任务中的四项中，SLM 保持了可比较的性能，从而在推理过程中显着减少了碳排放。我们的研究结果证明了较小模型在减轻资源密集型法学硕士对环境影响方面的可行性，从而迈向可持续的绿色人工智能。

Title: Directional Attractors in LLM Reasoning: How Similarity Retrieval Steers Iterative Summarization Based Reasoning

Authors: Cagatay Tekin, Charbel Barakat, Luis Joseph Luna Limgenco
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08846
Pdf URL: https://arxiv.org/pdf/2601.08846
Copy Paste: [[2601.08846]] Directional Attractors in LLM Reasoning: How Similarity Retrieval Steers Iterative Summarization Based Reasoning(https://arxiv.org/abs/2601.08846)
Keywords: language model, llm
Abstract: Iterative summarization based reasoning frameworks such as InftyThink enable long-horizon reasoning in large language models (LLMs) by controlling context growth, but they repeatedly regenerate similar reasoning strategies across tasks. We introduce InftyThink with Cross-Chain Memory, an extension that augments iterative reasoning with an embedding-based semantic cache of previously successful reasoning patterns. At each reasoning step, the model retrieves and conditions on the most semantically similar stored lemmas, guiding inference without expanding the context window indiscriminately. Experiments on MATH500, AIME2024, and GPQA-Diamond demonstrate that semantic lemma retrieval improves accuracy in structured domains while exposing failure modes in tests that include heterogeneous domains. Geometric analyses of reasoning trajectories reveal that cache retrieval induces directional biases in embedding space, leading to consistent fix (improve baseline accuracy) and break (degradation in baseline accuracy) attractors. Our results highlight both the benefits and limits of similarity-based memory for self-improving LLM reasoning.
摘要：基于迭代总结的推理框架（例如 InftyThink）通过控制上下文增长来实现大型语言模型（LLM）中的长视野推理，但它们会在任务中重复重新生成类似的推理策略。我们引入了带有跨链内存的 InftyThink，这是一种扩展，通过对先前成功的推理模式进行基于嵌入的语义缓存来增强迭代推理。在每个推理步骤中，模型都会检索和条件化语义上最相似的存储引理，从而指导推理，而不会不加区别地扩展上下文窗口。 MATH500、AIME2024 和 GPQA-Diamond 上的实验表明，语义引理检索提高了结构化域中的准确性，同时暴露了包含异构域的测试中的故障模式。推理轨迹的几何分析表明，缓存检索会引起嵌入空间中的方向偏差，从而导致一致的修复（提高基线精度）和破坏（基线精度下降）吸引子。我们的结果强调了基于相似性的记忆对于自我改进法学硕士推理的好处和局限性。

Title: Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe

Authors: JV Roig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08847
Pdf URL: https://arxiv.org/pdf/2601.08847
Copy Paste: [[2601.08847]] Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe(https://arxiv.org/abs/2601.08847)
Keywords: llm, hallucination
Abstract: Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion - generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora. Our evaluation of 33 models using over 21 billion tokens reveals that context length claims frequently exceed usable capacity, with significant degradation beyond 32K tokens; cross-document aggregation proves substantially harder than single-document extraction; and grounding ability and hallucination resistance are distinct capabilities - models excelling at finding facts that exist may still fabricate facts that do not. Beyond the specific benchmark, we contribute a domain-agnostic methodology for constructing scalable and contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth.
摘要：评估知识系统（LLM、RAG、知识图等）面临着根本性挑战：静态基准容易受到污染、基于 LLM 的法官表现出系统性偏见、地面实况提取需要昂贵的人工注释。我们提出 RIKER（检索智能和知识提取评级），它既是一个基准，又是一种基于范式反转的可复制方法 - 从已知的基本事实生成文档，而不是从文档中提取基本事实。这种方法无需人工注释或参考模型即可实现确定性评分和可扩展评估，并通过可再生语料库实现抗污染。我们对使用超过 210 亿个令牌的 33 个模型进行的评估表明，上下文长度声明经常超出可用容量，超过 32K 令牌时，性能会显着下降；事实证明，跨文档聚合比单文档提取困难得多；接地能力和抗幻觉能力是不同的能力——擅长发现存在事实的模型仍然可能捏造不存在的事实。除了特定的基准之外，我们还提供了一种与领域无关的方法，用于构建可扩展且抗污染的评估，只要可以从结构化的地面事实生成合成文档即可。

Title: PediaMind-R1: A Temperament-Aware Language Model for Personalized Early Childhood Care Reasoning via Cognitive Modeling and Preference Alignment

Authors: Zihe Zhang, Can Zhang, Yanheng Xu, Xin Hu, Jichao Leng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08848
Pdf URL: https://arxiv.org/pdf/2601.08848
Copy Paste: [[2601.08848]] PediaMind-R1: A Temperament-Aware Language Model for Personalized Early Childhood Care Reasoning via Cognitive Modeling and Preference Alignment(https://arxiv.org/abs/2601.08848)
Keywords: language model, llm, chain-of-thought
Abstract: This paper presents PediaMind-R1, a domain-specialized large language model designed to achieve active personalization in intelligent parenting scenarios. Unlike conventional systems that provide generic suggestions, PediaMind-R1 draws on insights from developmental psychology. It introduces temperament theory from the Thomas-Chess framework and builds a temperament knowledge graph for infants and toddlers (0-3 years). Our two-stage training pipeline first uses supervised fine-tuning to teach structured chain-of-thought reasoning, and then applies a GRPO-based alignment stage to reinforce logical consistency, domain expertise, and empathetic caregiving strategies. We further design an evaluation framework comprising temperament-sensitive multiple-choice tests and human assessments. The results demonstrate that PediaMind-R1 can accurately interpret early childhood temperament profiles and proactively engage in individualized reasoning. This work highlights the value of integrating vertical-domain modeling with psychological theory. It offers a novel approach to developing user-centered LLMs that advance the practice of active personalization in sensitive caregiving contexts.
摘要：本文提出了 PediaMind-R1，这是一种专门领域的大语言模型，旨在实现智能育儿场景中的主动个性化。与提供通用建议的传统系统不同，PediaMind-R1 借鉴了发展心理学的见解。引入Thomas-Chess框架的气质理论，构建婴幼儿（0-3岁）气质知识图谱。我们的两阶段训练流程首先使用监督微调来教授结构化思想链推理，然后应用基于 GRPO 的对齐阶段来加强逻辑一致性、领域专业知识和同理心护理策略。我们进一步设计了一个评估框架，包括气质敏感的多项选择测试和人类评估。结果表明，PediaMind-R1 可以准确解读幼儿气质特征并主动进行个性化推理。这项工作强调了将垂直领域建模与心理学理论相结合的价值。它提供了一种开发以用户为中心的法学硕士的新颖方法，可促进敏感护理环境中主动个性化的实践。

Title: Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

Authors: Manas Khatore, Sumana Sridharan, Kevork Sulahian, Benjamin J. Smith, Shi Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08849
Pdf URL: https://arxiv.org/pdf/2601.08849
Copy Paste: [[2601.08849]] Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment(https://arxiv.org/abs/2601.08849)
Keywords: llm, prompt
Abstract: Automated answer matching, which leverages LLMs to evaluate free-text responses by comparing them to a reference answer, shows substantial promise as a scalable and aligned alternative to human evaluation. However, its reliability requires robustness against strategic attacks such as guesswork or verbosity that may artificially inflate scores without improving actual correctness. In this work, we systematically investigate whether such tactics deceive answer matching models by prompting examinee models to: (1) generate verbose responses, (2) provide multiple answers when unconfident, and (3) embed conflicting answers with the correct answer near the start of their response. Our results show that these manipulations do not increase scores and often reduce them. Additionally, binary scoring (which requires a matcher to answer with a definitive "correct" or "incorrect") is more robust to attacks than continuous scoring (which requires a matcher to determine partial correctness). These findings show that answer matching is generally robust to inexpensive text manipulation and is a viable alternative to traditional LLM-as-a-judge or human evaluation when reference answers are available.
摘要：自动答案匹配利用法学硕士通过将自由文本答案与参考答案进行比较来评估自由文本答案，显示出作为人类评估的可扩展且一致的替代方案的巨大前景。然而，其可靠性需要能够抵御诸如猜测或冗长之类的战略攻击，这些攻击可能会人为地抬高分数，而不会提高实际的正确性。在这项工作中，我们系统地调查了这种策略是否通过提示考生模型来欺骗答案匹配模型：（1）生成详细的答案，（2）在不自信时提供多个答案，以及（3）在答案开头附近嵌入与正确答案相冲突的答案。我们的结果表明，这些操作不会增加分数，而且常常会降低分数。此外，二进制评分（需要匹配器以明确的“正确”或“不正确”来回答）比连续评分（需要匹配器确定部分正确性）对攻击更稳健。这些发现表明，答案匹配通常对于廉价的文本操作来说是稳健的，并且当参考答案可用时，是传统法学硕士作为法官或人工评估的可行替代方案。

Title: NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models

Authors: Nidhi Pandya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.08852
Pdf URL: https://arxiv.org/pdf/2601.08852
Copy Paste: [[2601.08852]] NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models(https://arxiv.org/abs/2601.08852)
Keywords: gpt
Abstract: Automated news verification requires structured claim extraction, but existing approaches either lack schema compliance or generalize poorly across domains. This paper presents NewsScope, a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction. The dataset contains 455 articles across politics, health, science/environment, and business, consisting of 395 in-domain articles and 60 out-of-source articles for generalization testing. LLaMA 3.1 8B was fine-tuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets. Human evaluation on 400 claims shows NewsScope achieves 89.4% human-evaluated accuracy compared to GPT-4o-mini's 93.7% (p=0.07). NewsScope outperforms GPT-4o-mini on political claims (94.3% vs. 87.8%). A numeric grounding filter further improves accuracy to 91.6%, narrowing the gap to 2.1 percentage points. Inter-annotator agreement studies (160 claims) confirm labeling reliability (94.6% positive agreement on SUPPORTED judgments). The open-weight model enables offline deployment at approximately $15 on-demand compute (or $0 on free tiers). Code and benchmark are publicly released.
摘要：自动化新闻验证需要结构化声明提取，但现有方法要么缺乏模式合规性，要么跨领域泛化性较差。本文介绍了 NewsScope，这是一个跨域数据集、基准测试和微调模型，用于基于模式的新闻声明提取。该数据集包含 455 篇涉及政治、健康、科学/环境和商业的文章，其中包括 395 篇域内文章和 60 篇用于泛化测试的源外文章。 LLaMA 3.1 8B 使用 LoRA 对 315 个训练示例进行了微调，并在保留的域内（80 篇文章）和源外（60 篇文章）测试集上进行了评估。对 400 项声明的人工评估显示，NewsScope 的人工评估准确度为 89.4%，而 GPT-4o-mini 的人工评估准确度为 93.7% (p=0.07)。 NewsScope 在政治主张方面优于 GPT-4o-mini（94.3% vs. 87.8%）。数字接地滤波器进一步将准确度提高至 91.6%，将差距缩小至 2.1 个百分点。注释者间一致性研究（160 个声明）证实了标签的可靠性（支持的判断有 94.6% 的积极一致性）。开放权重模型支持离线部署，按需计算费用约为 15 美元（免费套餐为 0 美元）。代码和基准已公开发布。

Title: Evaluating Role-Consistency in LLMs for Counselor Training

Authors: Eric Rudolph, Natalie Engert, Jens Albrecht
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.08892
Pdf URL: https://arxiv.org/pdf/2601.08892
Copy Paste: [[2601.08892]] Evaluating Role-Consistency in LLMs for Counselor Training(https://arxiv.org/abs/2601.08892)
Keywords: language model, llm
Abstract: The rise of online counseling services has highlighted the need for effective training methods for future counselors. This paper extends research on VirCo, a Virtual Client for Online Counseling, designed to complement traditional role-playing methods in academic training by simulating realistic client interactions. Building on previous work, we introduce a new dataset incorporating adversarial attacks to test the ability of large language models (LLMs) to maintain their assigned roles (role-consistency). The study focuses on evaluating the role consistency and coherence of the Vicuna model's responses, comparing these findings with earlier research. Additionally, we assess and compare various open-source LLMs for their performance in sustaining role consistency during virtual client interactions. Our contributions include creating an adversarial dataset, evaluating conversation coherence and persona consistency, and providing a comparative analysis of different LLMs.
摘要：在线咨询服务的兴起凸显了对未来咨询师有效培训方法的需求。本文扩展了 VirCo（一种在线咨询虚拟客户）的研究，旨在通过模拟真实的客户互动来补充学术培训中的传统角色扮演方法。在之前的工作基础上，我们引入了一个包含对抗性攻击的新数据集，以测试大型语言模型（LLM）维持其指定角色（角色一致性）的能力。该研究的重点是评估骆驼模型反应的角色一致性和连贯性，并将这些发现与早期研究进行比较。此外，我们还评估和比较了各种开源法学硕士在虚拟客户端交互期间维持角色一致性方面的表现。我们的贡献包括创建对抗性数据集、评估对话连贯性和角色一致性，以及提供不同法学硕士的比较分析。

Title: Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models

Authors: Youwei Liu, Jian Wang, Hanlin Wang, Beichen Guo, Wenjie Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.08955
Pdf URL: https://arxiv.org/pdf/2601.08955
Copy Paste: [[2601.08955]] Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models(https://arxiv.org/abs/2601.08955)
Keywords: agent
Abstract: Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yielding multi-step ``imagined'' trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents' reasoning capability, providing valuable insights into addressing broader, complex tasks.
摘要：世界模型的最新进展显示出对环境状态的未来动态进行建模的希望，使代理能够在不访问真实环境的情况下进行推理和行动。当前的方法主要执行单步或固定范围的推出，其复杂任务规划的潜力尚未得到充分利用。我们提出了 Imagine-then-Plan (\texttt{ITP})，这是一个通过前瞻想象进行代理学习的统一框架，其中代理的策略模型与学习的世界模型交互，产生多步“想象”的轨迹。由于想象范围可能因任务和阶段而异，因此我们通过权衡最终目标和任务进度，引入了一种新颖的自适应前瞻机制。由此产生的想象轨迹提供了关于未来后果的丰富信号，例如已取得的进展和潜在冲突，这些信号与当前的观察融合，制定了部分 \textit{observable} 和 \textit{imaginable} 马尔可夫决策过程来指导政策学习。我们用免训练和强化训练的变体实例化 \texttt{ITP}。跨代表性代理基准的广泛实验表明，\texttt{ITP} 显着优于竞争基准。进一步的分析证实，我们的自适应前瞻很大程度上增强了智能体的推理能力，为解决更广泛、复杂的任务提供了有价值的见解。

Title: Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

Authors: Pedro Memoli Buffa, Luciano Del Corro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09001
Pdf URL: https://arxiv.org/pdf/2601.09001
Copy Paste: [[2601.09001]] Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM(https://arxiv.org/abs/2601.09001)
Keywords: llm
Abstract: Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all "10 choose k" combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.
摘要：部署 LLM 带来了两个相互关联的挑战：(1) 监控 - 估计模型在流量和域漂移时表现不佳的地方 - 以及 (2) 改进 - 优先考虑数据采集以缩小最大的性能差距。我们测试推理时间信号是否可以估计域移位下的切片级精度。对于每个响应，我们根据最后层的下一个标记概率（来自前 k 个对数概率）计算输出熵概况，并用 11 个统计数据对其进行总结。轻量级分类器预测实例的正确性，并对预测概率进行平均产生域级精度估计。我们对来自 6 个系列 (3B-20B) 的 9 个法学硕士的 10 个 STEM 推理基准进行了评估，其中包含详尽的训练/测试组合（{1,2,3,4} 中的 k；所有“10 选择 k”组合）。估计通常会跟踪保留的基准准确性，并且一些模型显示域的近单调排序。因此，输出熵剖面是可扩展监测和目标数据采集的可访问信号。

Title: Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game

Authors: Haryo Akbarianto Wibowo, Alaa Elsetohy, Qinrong Cui, Alham Fikri Aji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09017
Pdf URL: https://arxiv.org/pdf/2601.09017
Copy Paste: [[2601.09017]] Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game(https://arxiv.org/abs/2601.09017)
Keywords: language model, llm, chat, agent
Abstract: The rapid advancement of Large Language Models (LLMs) has necessitated more robust evaluation methods that go beyond static benchmarks, which are increasingly prone to data saturation and leakage. In this paper, we propose a dynamic benchmarking framework for evaluating multilingual and multicultural capabilities through the social deduction game Spyfall. In our setup, models must engage in strategic dialogue to either identify a secret agent or avoid detection, utilizing culturally relevant locations or local foods. Our results show that our game-based rankings align closely with the Chatbot Arena. However, we find a significant performance gap in non-English contexts: models are generally less proficient when handling locally specific entities and often struggle with rule-following or strategic integrity in non-English languages. We demonstrate that this game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks. The game history can be accessed here this https URL.
摘要：大型语言模型（LLM）的快速发展需要更强大的评估方法，超越静态基准，而静态基准越来越容易出现数据饱和和泄漏。在本文中，我们提出了一个动态基准测试框架，用于通过社交演绎游戏 Spyfall 评估多语言和多文化能力。在我们的设置中，模型必须利用文化相关地点或当地食物进行战略对话，以识别秘密特工或避免被发现。我们的结果表明，我们基于游戏的排名与聊天机器人竞技场密切相关。然而，我们发现在非英语环境中存在显着的性能差距：模型在处理本地特定实体时通常不太熟练，并且经常在非英语语言中难以遵守规则或战略完整性。我们证明，这种基于游戏的方法为传统 NLP 基准提供了可扩展、防泄漏且在文化上有细微差别的替代方案。可以通过此 https URL 访问游戏历史记录。

Title: OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG

Authors: Fengran Mo, Zhan Su, Yuchen Hui, Jinghan Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tetsuya Sakai, Jian-Yun Nie
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2601.09028
Pdf URL: https://arxiv.org/pdf/2601.09028
Copy Paste: [[2601.09028]] OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG(https://arxiv.org/abs/2601.09028)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs' internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.
摘要：大语言模型（LLM）的开发在一系列下游任务中取得了卓越的性能，包括基于LLM的检索增强生成（RAG）。生成内容的质量在很大程度上取决于检索到的信息的有用性以及法学硕士内部信息处理机制将其纳入答案生成的能力。通常假设检索到的信息与问题相关。然而，检索到的信息可能具有不同程度的相关性和有用性，具体取决于问题和文档集合。在生成答案时考虑检索到的信息的相关性非常重要。在本文中，我们提出了 OpenDecoder，这是一种利用检索信息的显式评估作为生成质量指标特征的新方法。我们的目标是构建一个对于不同级别的噪声环境更加稳健的 RAG 模型。考虑三种类型的显式评估信息：相关性得分、排名得分和QPP（查询性能预测）得分。五个基准数据集上的实验结果证明了 OpenDecoder 的有效性和更好的鲁棒性，优于各种基准方法。重要的是，这种范例可以灵活地与任何目的的法学硕士后培训相结合，并与任何类型的外部指标相结合。

Title: SpectraQuery: A Hybrid Retrieval-Augmented Conversational Assistant for Battery Science

Authors: Sreya Vangara, Jagjit Nanda, Yan-Kai Tzeng, Eric Darve
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.09036
Pdf URL: https://arxiv.org/pdf/2601.09036
Copy Paste: [[2601.09036]] SpectraQuery: A Hybrid Retrieval-Augmented Conversational Assistant for Battery Science(https://arxiv.org/abs/2601.09036)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Scientific reasoning increasingly requires linking structured experimental data with the unstructured literature that explains it, yet most large language model (LLM) assistants cannot reason jointly across these modalities. We introduce SpectraQuery, a hybrid natural-language query framework that integrates a relational Raman spectroscopy database with a vector-indexed scientific literature corpus using a Structured and Unstructured Query Language (SUQL)-inspired design. By combining semantic parsing with retrieval-augmented generation, SpectraQuery translates open-ended questions into coordinated SQL and literature retrieval operations, producing cited answers that unify numerical evidence with mechanistic explanation. Across SQL correctness, answer groundedness, retrieval effectiveness, and expert evaluation, SpectraQuery demonstrates strong performance: approximately 80 percent of generated SQL queries are fully correct, synthesized answers reach 93-97 percent groundedness with 10-15 retrieved passages, and battery scientists rate responses highly across accuracy, relevance, grounding, and clarity (4.1-4.6/5). These results show that hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.
摘要：科学推理越来越需要将结构化实验数据与解释它的非结构化文献联系起来，但大多数大型语言模型 (LLM) 助手无法跨这些模式进行联合推理。我们推出 SpectraQuery，这是一种混合自然语言查询框架，它使用结构化和非结构化查询语言 (SUQL) 启发的设计，将关系拉曼光谱数据库与矢量索引科学文献语料库集成在一起。通过将语义解析与检索增强生成相结合，SpectraQuery 将开放式问题转化为协调的 SQL 和文献检索操作，生成将数字证据与机械解释统一起来的引用答案。在 SQL 正确性、答案基础性、检索有效性和专家评估方面，SpectraQuery 展示了强大的性能：大约 80% 的生成 SQL 查询完全正确，合成答案通过 10-15 个检索的段落达到 93-97% 的基础性，电池科学家在准确性、相关性、基础性和清晰度方面对响应给予高度评价 (4.1-4.6/5)。这些结果表明，混合检索架构可以通过桥接大量实验数据集的数据和讨论来有意义地支持科学工作流程。

Title: Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity

Authors: Samhita Bollepally, Aurora Sloman-Moll, Takashi Yamauchi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09041
Pdf URL: https://arxiv.org/pdf/2601.09041
Copy Paste: [[2601.09041]] Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity(https://arxiv.org/abs/2601.09041)
Keywords: language model, gpt, llm
Abstract: Large language models generate judgments that resemble those of humans. Yet the extent to which these models align with human judgments in interpreting figurative and socially grounded language remains uncertain. To investigate this, human participants and four instruction-tuned LLMs of different sizes (GPT-4, Gemma-2-9B, Llama-3.2, and Mistral-7B) rated 240 dialogue-based sentences representing six linguistic traits: conventionality, sarcasm, funny, emotional, idiomacy, and slang. Each of the 240 sentences was paired with 40 interpretive questions, and both humans and LLMs rated these sentences on a 10-point Likert scale. Results indicated that humans and LLMs aligned at the surface level with humans, but diverged significantly at the representational level, especially in interpreting figurative sentences involving idioms and Gen Z slang. GPT-4 most closely approximates human representational patterns, while all models struggle with context-dependent and socio-pragmatic expressions like sarcasm, slang, and idiomacy.
摘要：大型语言模型会生成类似于人类的判断。然而，这些模型在解释比喻和社会基础语言时与人类判断的一致程度仍然不确定。为了研究这一点，人类参与者和四位不同规模的指令调整法学硕士（GPT-4、Gemma-2-9B、Llama-3.2 和 Mistral-7B）对 240 个基于对话的句子进行了评分，这些句子代表六种语言特征：常规、讽刺、有趣、情感、惯用语和俚语。 240 个句子中的每一个都配有 40 个解释性问题，人类和法学硕士都按照 10 分李克特量表对这些句子进行评分。结果表明，人类和法学硕士在表面层面上与人类一致，但在表征层面上存在显着差异，特别是在解释涉及习语和 Z 世代俚语的比喻句时。 GPT-4 最接近人类的表征模式，而所有模型都在与上下文相关和社会实用的表达方式（如讽刺、俚语和惯用语）作斗争。

Title: Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers

Authors: Kaiyu He, Zhang Mian, Peilin Wu, Xinya Du, Zhiyu Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09049
Pdf URL: https://arxiv.org/pdf/2601.09049
Copy Paste: [[2601.09049]] Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers(https://arxiv.org/abs/2601.09049)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) excel at factual retrieval, they often struggle with the "curse of two-hop reasoning" in compositional tasks. Recent research suggests that parameter-sharing transformers can bridge this gap by forming a "Generalization Circuit" during a prolonged "grokking" phase. A fundamental question arises: Is a grokked model superior to its non-grokked counterparts on downstream tasks? Furthermore, is the extensive computational cost of waiting for the grokking phase worthwhile? In this work, we conduct a mechanistic study to evaluate the Generalization Circuit's role in knowledge assimilation and transfer. We demonstrate that: (i) The inference paths established by non-grokked and grokked models for in-distribution compositional queries are identical. This suggests that the "Generalization Circuit" does not represent the sudden acquisition of a new reasoning paradigm. Instead, we argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path. (ii) Achieving high accuracy on unseen cases after prolonged training and the formation of a certain reasoning path are not bound; they can occur independently under specific data regimes. (iii) Even a mature circuit exhibits limited transferability when integrating new knowledge, suggesting that "grokked" Transformers do not achieve a full mastery of compositional logic.
摘要：虽然大型语言模型 (LLM) 擅长事实检索，但它们经常在组合任务中遇到“两跳推理的诅咒”。最近的研究表明，参数共享变压器可以通过在长时间的“摸索”阶段形成“泛化电路”来弥补这一差距。一个基本问题出现了：在下游任务上，grokked 模型是否优于非grokked 模型？此外，等待 grokking 阶段的大量计算成本值得吗？在这项工作中，我们进行了一项机制研究，以评估泛化电路在知识同化和迁移中的作用。我们证明：（i）非grokked和grokked模型为分布内组合查询建立的推理路径是相同的。这表明“泛化电路”并不代表突然获得新的推理范式。相反，我们认为摸索是将记忆的原子事实整合到自然建立的推理路径中的过程。 (ii) 经过长时间的训练，对未见过的案例达到较高的准确率，并形成一定的推理路径，不受束缚；它们可以在特定的数据制度下独立发生。 (iii) 即使是成熟的电路，在整合新知识时也表现出有限的可移植性，这表明“摸索”的变形金刚并没有完全掌握组合逻辑。

Title: Efficient Multilingual Dialogue Processing via Translation Pipelines and Distilled Language Models

Authors: Santiago Martínez Novoa, Nicolás Rozo Fajardo, Diego Alejandro González Vargas, Nicolás Bedoya Figueroa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09059
Pdf URL: https://arxiv.org/pdf/2601.09059
Copy Paste: [[2601.09059]] Efficient Multilingual Dialogue Processing via Translation Pipelines and Distilled Language Models(https://arxiv.org/abs/2601.09059)
Keywords: language model
Abstract: This paper presents team Kl33n3x's multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task. The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages. By leveraging knowledge distillation techniques, this work demonstrates that compact models can achieve highly competitive performance across nine languages. The system achieved strong win rates across the competition's tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA), demonstrating the effectiveness of translation-based approaches for low-resource language processing without task-specific fine-tuning.
摘要：本文介绍了Kl33n3x团队为NLPAI4Health 2025共享任务开发的多语言对话摘要和问答系统。该方法采用三阶段管道：从印度语言到英语的正向翻译、使用 2.55B 参数蒸馏语言模型生成多任务文本，以及反向翻译回源语言。通过利用知识蒸馏技术，这项工作证明紧凑模型可以在九种语言中实现极具竞争力的性能。该系统在竞赛任务中取得了很高的胜率，尤其是在马拉地语 (86.7% QnA)、泰米尔语 (86.7% QnA) 和印地语 (80.0% QnA) 上的表现尤其强劲，证明了基于翻译的方法在无需针对特定任务进行微调的情况下进行低资源语言处理的有效性。

Title: Mi:dm 2.0 Korea-centric Bilingual Language Models

Authors: Donghoon Shin, Sejung Lee, Soonmin Bae, Hwijung Ryu, Changwon Ok, Hoyoun Jung, Hyesung Ji, Jeehyun Lim, Jehoon Lee, Ji-Eun Han, Jisoo Baik, Mihyeon Kim, Riwoo Chung, Seongmin Lee, Wonjae Park, Yoonseok Heo, Youngkyung Seo, Seyoun Won, Boeun Kim, Cheolhun Heo, Eunkyeong Lee, Honghee Lee, Hyeongju Ju, Hyeontae Seo, Jeongyong Shim, Jisoo Lee, Junseok Koh, Junwoo Kim, Minho Lee, Minji Kang, Minju Kim, Sangha Nam, Seongheum Park, Taehyeong Kim, Euijai Ahn, Hong Seok Jeung, Jisu Shin, Jiyeon Kim, Seonyeong Song, Seung Hyun Kong, Sukjin Hong, Taeyang Yun, Yu-Seon Kim, A-Hyun Lee, Chae-Jeong Lee, Hye-Won Yu, Ji-Hyun Ahn, Song-Yeon Kim, Sun-Woo Jung, Eunju Kim, Eunji Ha, Jinwoo Baek, Yun-ji Lee, Wanjin Park, Jeong Yeop Kim, Eun Mi Kim, Hyoung Jun Park, Jung Won Yoon, Min Sung Noh, Myung Gyo Oh, Wongyoung Lee, Yun Jin Park, Young S. Kwon, Hyun Keun Kim, Jieun Lee, YeoJoo Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09066
Pdf URL: https://arxiv.org/pdf/2601.09066
Copy Paste: [[2601.09066]] Mi:dm 2.0 Korea-centric Bilingual Language Models(https://arxiv.org/abs/2601.09066)
Keywords: language model, llm
Abstract: We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at this https URL. For technical inquiries, please contact midm-llm@kt.com.
摘要：我们推出 Mi:dm 2.0，这是一种双语大语言模型 (LLM)，专门为推进以韩国为中心的人工智能而设计。该模型超越了韩语文本处理，通过整合韩国社会固有的价值观、推理模式和常识知识，实现对文化背景、情感微妙性和现实世界场景的细致入微的理解，从而生成可靠且适合文化的响应。为了解决现有法学硕士的局限性（通常是由于韩国数据不足或低质量以及缺乏文化一致性造成的），Mi:dm 2.0 通过全面的管道强调稳健的数据质量，其中包括专有数据清理、高质量合成数据生成、与课程学习的战略数据混合以及定制的韩国优化标记器，以提高效率和覆盖范围。为了实现这一愿景，我们提供两种互补配置：Mi:dm 2.0 Base（11.5B 参数），采用适合通用用途的深度扩展策略构建；Mi:dm 2.0 Mini（2.3B 参数），针对资源受限环境和专门任务进行了优化。 Mi:dm 2.0 在韩语特定基准上实现了最先进的性能，在 KMMLU 上具有顶级的零样本结果，在语言、人文和社会科学任务方面具有强大的内部评估结果。 Mi:dm 2.0 系列是在 MIT 许可下发布的，以支持广泛的研究和商业用途。通过提供易于访问且高性能的以韩国为中心的法学硕士，KT 旨在加速人工智能在韩国各行业、公共服务和教育领域的采用，加强韩国人工智能开发者社区，并为更广泛的韩国智能愿景奠定基础。我们的模型可通过此 https URL 获取。如需技术咨询，请联系 midm-llm@kt.com。

Title: From Symbolic to Natural-Language Relations: Rethinking Knowledge Graph Construction in the Era of Large Language Models

Authors: Kanyao Han, Yushang Lai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09069
Pdf URL: https://arxiv.org/pdf/2601.09069
Copy Paste: [[2601.09069]] From Symbolic to Natural-Language Relations: Rethinking Knowledge Graph Construction in the Era of Large Language Models(https://arxiv.org/abs/2601.09069)
Keywords: language model, llm, prompt
Abstract: Knowledge graphs (KGs) have commonly been constructed using predefined symbolic relation schemas, typically implemented as categorical relation labels. This design has notable shortcomings: real-world relations are often contextual, nuanced, and sometimes uncertain, and compressing it into discrete relation labels abstracts away critical semantic detail. Nevertheless, symbolic-relation KGs remain widely used because they have been operationally effective and broadly compatible with pre-LLM downstream models and algorithms, in which KG knowledge could be retrieved or encoded into quantified features and embeddings at scale. The emergence of LLMs has reshaped how knowledge is created and consumed. LLMs support scalable synthesis of domain facts directly in concise natural language, and prompting-based inference favors context-rich free-form text over quantified representations. This position paper argues that these changes call for rethinking the representation of relations themselves rather than merely using LLMs to populate conventional schemas more efficiently. We therefore advocate moving from symbolic to natural-language relation descriptions, and we propose hybrid design principles that preserve a minimal structural backbone while enabling more flexible and context-sensitive relational representations.
摘要：知识图 (KG) 通常是使用预定义的符号关系模式构建的，通常作为分类关系标签实现。这种设计有明显的缺点：现实世界的关系通常是上下文相关的、微妙的，有时是不确定的，将其压缩为离散的关系标签会抽象出关键的语义细节。尽管如此，符号关系知识图谱仍然被广泛使用，因为它们在操作上有效并且与LLM之前的下游模型和算法广泛兼容，其中知识图谱知识可以大规模地检索或编码成量化的特征和嵌入。法学硕士的出现重塑了知识的创造和消费方式。法学硕士支持直接以简洁的自然语言对领域事实进行可扩展的综合，并且基于提示的推理更有利于上下文丰富的自由格式文本而不是量化的表示。本立场文件认为，这些变化要求重新思考关系本身的表示，而不是仅仅使用法学硕士来更有效地填充传统模式。因此，我们主张从符号关系描述转向自然语言关系描述，并且我们提出混合设计原则，保留最小的结构主干，同时实现更灵活和上下文敏感的关系表示。

Title: How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation

Authors: Wilson Y. Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.09084
Pdf URL: https://arxiv.org/pdf/2601.09084
Copy Paste: [[2601.09084]] How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation(https://arxiv.org/abs/2601.09084)
Keywords: prompt, chat
Abstract: Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.
摘要：人类偏好评估被广泛用于比较生成模型，但仍不清楚需要多少判断才能可靠地检测到微小的改进。我们表明，当偏好信号在提示中扩散时（即所有提示类型都具有类似的信息），比例分配是极小极大最优：没有分配策略能够显着提高可检测性。对大规模人类偏好数据集的实证分析表明，大多数比较都属于这种分散的状态，即使在抽样良好的比较中，也表现出较小的偏好幅度，需要比通常收集的更多的判断。这些限制在评估协议和模式中仍然存在，包括聊天、图像生成和带有执行反馈的代码生成。相比之下，减少提示引起的变异性的策划基准可以系统地带来更大的利润，并通过提示级别方差减少 1.5 倍来提高可检测性。我们的结果表明，不确定或负面的人类评估结果经常反映评估动力不足，而不是模型等效性，强调需要明确考虑效果大小、预算和方案设计。

Title: SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding

Authors: Shuyang Hou, Yi Hu, Muhan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09089
Pdf URL: https://arxiv.org/pdf/2601.09089
Copy Paste: [[2601.09089]] SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding(https://arxiv.org/abs/2601.09089)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.
摘要：大型语言模型（LLM）的最新进展显着增强了它们的推理能力。然而，他们仍然在基本的字符级任务上遇到困难，例如计算单词中的字母，这是其标记化过程中根源的问题。虽然现有的基准测试通过基本的角色操作强调了这一弱点，但由于缺乏实际相关性，此类失败常常被忽视。然而，许多现实世界的应用程序，例如导航基于文本的地图或解释结构化表格，在很大程度上依赖于精确的子令牌理解。在这方面，我们推出了 SubTokenTest，这是一个综合基准测试，通过实用的、实用驱动的任务来评估对子代币的理解。我们的基准测试包括跨四个领域的十项任务，并通过将性能与复杂推理分离来隔离与标记化相关的故障。我们提供九个高级法学硕士的综合评估。此外，我们研究了测试时间缩放对子标记推理的影响，并探索如何在隐藏状态中编码字符级信息。

Title: Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms

Authors: Yongming Sun
Subjects: cs.CL, econ.GN
Abstract URL: https://arxiv.org/abs/2601.09119
Pdf URL: https://arxiv.org/pdf/2601.09119
Copy Paste: [[2601.09119]] Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms(https://arxiv.org/abs/2601.09119)
Keywords: language model, llm
Abstract: Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations--especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF--IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.
摘要：细粒度的劳动力市场分析越来越依赖于将非结构化招聘广告映射到 ESCO 等标准化技能分类法。这种映射自然地被表述为极端多标签分类 (XMLC) 问题，但监督解决方案受到大规模、分类学一致注释的稀缺性和成本的限制，尤其是在非英语环境中，其中招聘广告语言与正式技能定义有很大差异。我们提出了一个零样本技能提取框架，消除了手动标记的职位广告训练数据的需要。该框架使用大型语言模型 (LLM) 从 ESCO 定义合成训练实例，并引入基于 ESCO Level-2 类别的分层约束多技能生成，以提高多标签上下文中的语义连贯性。在合成语料库之上，我们训练了一个对比双编码器，将招聘广告句子与共享嵌入空间中的 ESCO 技能描述对齐；编码器通过 BiLSTM 和注意力池增强了 BERT 主干，以更好地对长的、信息密集的需求语句进行建模。上游基于 RoBERTa 的二进制过滤器会删除非技能句子，以提高端到端精度。实验表明，（i）层次条件生成相对于无约束配对提高了流畅性和可辨别性，（ii）所得的多标签模型有效地转移到现实世界的中文招聘广告中，实现了强大的零样本检索性能（F1@5 = 0.72）并优于 TF-IDF 和标准 BERT 基线。总体而言，拟议的管道为劳动经济学和劳动力分析中的自动化技能编码提供了一条可扩展、数据高效的途径。

Title: Adaptive Multi-Stage Patent Claim Generation with Unified Quality Assessment

Authors: Chen-Wei Liang, Bin Guo, Zhen-Yuan Wei, Mu-Jiang-Shan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09120
Pdf URL: https://arxiv.org/pdf/2601.09120
Copy Paste: [[2601.09120]] Adaptive Multi-Stage Patent Claim Generation with Unified Quality Assessment(https://arxiv.org/abs/2601.09120)
Keywords: gpt
Abstract: Current patent claim generation systems face three fundamental limitations: poor cross-jurisdictional generalization, inadequate semantic relationship modeling between claims and prior art, and unreliable quality assessment. We introduce a novel three-stage framework that addresses these challenges through relationship-aware similarity analysis, domain-adaptive claim generation, and unified quality assessment. Our approach employs multi-head attention with eight specialized heads for explicit relationship modeling, integrates curriculum learning with dynamic LoRA adapter selection across five patent domains, and implements cross-attention mechanisms between evaluation aspects for comprehensive quality assessment. Extensive experiments on USPTO HUPD dataset, EPO patent collections, and Patent-CE benchmark demonstrate substantial improvements: 7.6-point ROUGE-L gain over GPT-4o, 8.3\% BERTScore enhancement over Llama-3.1-8B, and 0.847 correlation with human experts compared to 0.623 for separate evaluation models. Our method maintains 89.4\% cross-jurisdictional performance retention versus 76.2\% for baselines, establishing a comprehensive solution for automated patent prosecution workflows.
摘要：当前的专利权利要求生成系统面临三个基本限制：跨管辖区概括性差、权利要求与现有技术之间的语义关系建模不充分以及质量评估不可靠。我们引入了一种新颖的三阶段框架，通过关系感知相似性分析、领域自适应声明生成和统一质量评估来应对这些挑战。我们的方法采用具有八个专门头的多头注意力来进行显式关系建模，将课程学习与跨五个专利领域的动态 LoRA 适配器选择相结合，并在评估方面之间实现交叉注意力机制以进行综合质量评估。对 USPTO HUPD 数据集、EPO 专利集和 Patent-CE 基准进行的大量实验证明了显着的改进：ROUGE-L 比 GPT-4o 提高了 7.6 点，BERTScore 比 Llama-3.1-8B 提高了 8.3%，与人类专家的相关性为 0.847，而单独评估模型的相关性为 0.623。我们的方法保持了 89.4% 的跨司法管辖区绩效保留率，而基线为 76.2%，为自动化专利审查工作流程建立了全面的解决方案。

Title: Identity-Robust Language Model Generation via Content Integrity Preservation

Authors: Miao Zhang, Kelly Chen, Md Mehrab Tanjim, Rumi Chunara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09141
Pdf URL: https://arxiv.org/pdf/2601.09141
Copy Paste: [[2601.09141]] Identity-Robust Language Model Generation via Content Integrity Preservation(https://arxiv.org/abs/2601.09141)
Keywords: language model, llm, prompt
Abstract: Large Language Model (LLM) outputs often vary across user sociodemographic attributes, leading to disparities in factual accuracy, utility, and safety, even for objective questions where demographic information is irrelevant. Unlike prior work on stereotypical or representational bias, this paper studies identity-dependent degradation of core response quality. We show empirically that such degradation arises from biased generation behavior, despite factual knowledge being robustly encoded across identities. Motivated by this mismatch, we propose a lightweight, training-free framework for identity-robust generation that selectively neutralizes non-critical identity information while preserving semantically essential attributes, thus maintaining output content integrity. Experiments across four benchmarks and 18 sociodemographic identities demonstrate an average 77% reduction in identity-dependent bias compared to vanilla prompting and a 45% reduction relative to prompt-based defenses. Our work addresses a critical gap in mitigating the impact of user identity cues in prompts on core generation quality.
摘要：大语言模型 (LLM) 的输出通常会因用户的社会人口统计属性而异，从而导致事实准确性、实用性和安全性方面的差异，即使对于人口统计信息不相关的客观问题也是如此。与之前关于刻板印象或代表性偏见的研究不同，本文研究了核心响应质量的身份依赖性退化。我们凭经验表明，尽管事实知识在不同身份之间被稳健地编码，但这种退化是由有偏见的一代行为引起的。受这种不匹配的启发，我们提出了一种轻量级、免训练的身份鲁棒生成框架，该框架有选择地中和非关键身份信息，同时保留语义上的基本属性，从而保持输出内容的完整性。针对 4 个基准和 18 个社会人口学身份的实验表明，与普通提示相比，身份依赖性偏见平均减少了 77%，与基于提示的防御相比，平均减少了 45%。我们的工作解决了减轻提示中的用户身份线索对核心生成质量的影响方面的一个关键差距。

Title: OrthoGeoLoRA: Geometric Parameter-Efficient Fine-Tuning for Structured Social Science Concept Retrieval on theWeb

Authors: Zeqiang Wang, Xinyue Wu, Chenxi Li, Zixi Chen, Nishanth Sastry, Jon Johnson, Suparna De
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09185
Pdf URL: https://arxiv.org/pdf/2601.09185
Copy Paste: [[2601.09185]] OrthoGeoLoRA: Geometric Parameter-Efficient Fine-Tuning for Structured Social Science Concept Retrieval on theWeb(https://arxiv.org/abs/2601.09185)
Keywords: language model
Abstract: Large language models and text encoders increasingly power web-based information systems in the social sciences, including digital libraries, data catalogues, and search interfaces used by researchers, policymakers, and civil society. Full fine-tuning is often computationally and energy intensive, which can be prohibitive for smaller institutions and non-profit organizations in the Web4Good ecosystem. Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), reduces this cost by updating only a small number of parameters. We show that the standard LoRA update $\Delta W = BA^\top$ has geometric drawbacks: gauge freedom, scale ambiguity, and a tendency toward rank collapse. We introduce OrthoGeoLoRA, which enforces an SVD-like form $\Delta W = B\Sigma A^\top$ by constraining the low-rank factors to be orthogonal (Stiefel manifold). A geometric reparameterization implements this constraint while remaining compatible with standard optimizers such as Adam and existing fine-tuning pipelines. We also propose a benchmark for hierarchical concept retrieval over the European Language Social Science Thesaurus (ELSST), widely used to organize social science resources in digital repositories. Experiments with a multilingual sentence encoder show that OrthoGeoLoRA outperforms standard LoRA and several strong PEFT variants on ranking metrics under the same low-rank budget, offering a more compute- and parameter-efficient path to adapt foundation models in resource-constrained settings.
摘要：大型语言模型和文本编码器越来越多地为社会科学中基于网络的信息系统提供动力，包括研究人员、政策制定者和民间社会使用的数字图书馆、数据目录和搜索界面。全面的微调通常需要大量计算和能源，这对于 Web4Good 生态系统中的小型机构和非营利组织来说可能会令人望而却步。参数高效微调（PEFT），特别是低秩适应（LoRA），通过仅更新少量参数来降低这种成本。我们证明标准 LoRA 更新 $\Delta W = BA^\top$ 具有几何缺陷：规范自由度、尺度模糊性和等级崩溃的趋势。我们引入了 OrthoGeoLoRA，它通过将低秩因子约束为正交（Stiefel 流形）来强制执行类似 SVD 的形式 $\Delta W = B\Sigma A^\top$。几何重新参数化实现了这一约束，同时保持与 Adam 等标准优化器和现有微调管道的兼容性。我们还提出了欧洲语言社会科学辞典（ELSST）的分层概念检索基准，广泛用于组织数字存储库中的社会科学资源。多语言句子编码器的实验表明，在相同的低秩预算下，OrthoGeoLoRA 在排名指标方面优于标准 LoRA 和几个强大的 PEFT 变体，提供了一种计算和参数效率更高的路径，以在资源受限的环境中适应基础模型。

Title: ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

Authors: Tao Liu, Taiqiang Wu, Runming Yang, Shaoning Sun, Junjie Wang, Yujiu Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09195
Pdf URL: https://arxiv.org/pdf/2601.09195
Copy Paste: [[2601.09195]] ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection(https://arxiv.org/abs/2601.09195)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.
摘要：有监督微调 (SFT) 是一种基本的训练后策略，旨在使大型语言模型 (LLM) 与人类意图保持一致。然而，传统的 SFT 经常通过强制与单个参考答案对齐来忽略语言的一对多性质，导致模型过度拟合非核心表达式。尽管我们的实证分析表明引入多个参考答案可以缓解这个问题，但令人望而却步的数据和计算成本需要进行战略转变：优先考虑缓解单参考过度拟合，而不是代价高昂地追求答案多样性。为了实现这一目标，我们揭示了令牌概率和语义重要性之间的内在联系：高概率令牌承载核心逻辑框架，而低概率令牌大多是可替换的表达。基于这一见解，我们提出了 ProFit，它有选择地掩盖低概率令牌，以防止表面级别的过度拟合。大量实验证实，ProFit 在一般推理和数学基准方面始终优于传统的 SFT 基线。

Title: A.X K1 Technical Report

Authors: Sung Jun Cheon, Jaekyung Cho, Seongho Choi, Hyunjun Eun, Seokhwan Jo, Jaehyun Jun, Minsoo Kang, Jin Kim, Jiwon Kim, Minsang Kim, Sungwan Kim, Seungsik Kim, Tae Yoon Kim, Youngrang Kim, Hyeongmun Lee, Sangyeol Lee, Sungeun Lee, Youngsoon Lee, Yujin Lee, Seongmin Ok, Chanyong Park, Hyewoong Park, Junyoung Park, Hyunho Yang, Subin Yi, Soohyun Bae, Dhammiko Arya, Yongseok Choi, Sangho Choi, Dongyeon Cho, Seungmo Cho, Gyoungeun Han, Yong-jin Han, Seokyoung Hong, Hyeon Hwang, Wonbeom Jang, Minjeong Ju, Wonjin Jung, Keummin Ka, Sungil Kang, Dongnam Kim, Joonghoon Kim, Jonghwi Kim, SaeRom Kim, Sangjin Kim, Seongwon Kim, Youngjin Kim, Seojin Lee, Sunwoo Lee, Taehoon Lee, Chanwoo Park, Sohee Park, Sooyeon Park, Yohan Ra, Sereimony Sek, Seungyeon Seo, Gun Song, Sanghoon Woo, Janghan Yoon, Sungbin Yoon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09200
Pdf URL: https://arxiv.org/pdf/2601.09200
Copy Paste: [[2601.09200]] A.X K1 Technical Report(https://arxiv.org/abs/2601.09200)
Keywords: language model
Abstract: We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.
摘要：我们推出 A.X K1，一种从头开始训练的 519B 参数混合专家 (MoE) 语言模型。我们的设计利用缩放法则在固定计算预算下优化训练配置和词汇量。 A.X K1 在由多级数据处理管道管理的约 10T 代币的语料库上进行了预训练。 A.X K1 旨在缩小推理能力和推理效率之间的差距，支持显式可控推理，以促进跨不同现实场景的可扩展部署。我们提出了一种简单而有效的 Think-Fusion 训练方法，使用户能够在单个统一模型中控制思维模式和非思维模式之间的切换。广泛的评估表明，A.X K1 的性能可与领先的开源模型相媲美，同时在韩语基准测试中建立了独特的优势。

Title: UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning

Authors: Feng Zhang, Shijia Li, Chunmao Zhang, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu, Han Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09215
Pdf URL: https://arxiv.org/pdf/2601.09215
Copy Paste: [[2601.09215]] UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning(https://arxiv.org/abs/2601.09215)
Keywords: language model, agent
Abstract: User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and proactively engages in negotiation by challenging or bargaining. However, current methods exhibit two issues. They rely on static and context-unaware profiles, necessitating extensive manual redesign for new scenarios, thus limiting generalizability. Moreover, they neglect human strategic thinking, leading to vulnerability to agent manipulation. To address these issues, we propose UserLM-R1, a novel user language model with reasoning capability. Specifically, we first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios. Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses, and further refine the reasoning and improve strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning. Extensive experimental results demonstrate that UserLM-R1 outperforms competitive baselines, particularly on the more challenging adversarial set.
摘要：用户模拟器是代理培训后的关键交互环境，理想的用户模拟器可以跨领域泛化，并通过挑战或讨价还价主动参与谈判。然而，当前的方法存在两个问题。它们依赖于静态和上下文不感知的配置文件，需要针对新场景进行大量的手动重新设计，从而限制了通用性。此外，他们忽视了人类的战略思维，导致容易受到代理操纵。为了解决这些问题，我们提出了 UserLM-R1，一种具有推理能力的新型用户语言模型。具体来说，我们首先构建具有静态角色和动态特定场景目标的综合用户档案，以适应不同的场景。然后，我们提出了一种目标驱动的决策策略，在产生响应之前生成高质量的理由，并通过监督微调和多奖励强化学习进一步细化推理并提高战略能力。大量的实验结果表明，UserLM-R1 的性能优于竞争基线，特别是在更具挑战性的对抗集上。

Title: When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation

Authors: Jing Ren, Bowen Li, Ziqi Xu, Xinkun Zhang, Haytham Fayek, Xiaodong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09241
Pdf URL: https://arxiv.org/pdf/2601.09241
Copy Paste: [[2601.09241]] When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation(https://arxiv.org/abs/2601.09241)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning. While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains. To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG. Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.
摘要：知识图检索增强生成 (KG-RAG) 通过合并知识图中的结构化知识来扩展 RAG 范式，使大型语言模型 (LLM) 能够执行更精确和可解释的推理。虽然 KG-RAG 提高了复杂任务中的事实准确性，但现有的 KG-RAG 模型通常严重过度自信，即使检索到的子图不完整或不可靠，也会产生高置信度的预测，这引起了对高风险领域部署的担忧。为了解决这个问题，我们提出了 Ca2KG，一种 KG-RAG 的因果感知校准框架。 Ca2KG 集成了反事实提示，揭示了知识质量和推理可靠性方面依赖于检索的不确定性，以及基于小组的重新评分机制，可稳定干预措施中的预测。对两个复杂 QA 数据集的大量实验表明，Ca2KG 持续改进校准，同时保持甚至增强预测准确性。

Title: TeachPro: Multi-Label Qualitative Teaching Evaluation via Cross-View Graph Synergy and Semantic Anchored Evidence Encoding

Authors: Xiangqian Wang, Yifan Jia, Yang Xiang, Yumin Zhang, Yanbin Wang, Ke Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09246
Pdf URL: https://arxiv.org/pdf/2601.09246
Copy Paste: [[2601.09246]] TeachPro: Multi-Label Qualitative Teaching Evaluation via Cross-View Graph Synergy and Semantic Anchored Evidence Encoding(https://arxiv.org/abs/2601.09246)
Keywords: prompt
Abstract: Standardized Student Evaluation of Teaching often suffer from low reliability, restricted response options, and response distortion. Existing machine learning methods that mine open-ended comments usually reduce feedback to binary sentiment, which overlooks concrete concerns such as content clarity, feedback timeliness, and instructor demeanor, and provides limited guidance for instructional this http URL propose TeachPro, a multi-label learning framework that systematically assesses five key teaching dimensions: professional expertise, instructional behavior, pedagogical efficacy, classroom experience, and other performance metrics. We first propose a Dimension-Anchored Evidence Encoder, which integrates three core components: (i) a pre-trained text encoder that transforms qualitative feedback annotations into contextualized embeddings; (ii) a prompt module that represents five teaching dimensions as learnable semantic anchors; and (iii) a cross-attention mechanism that aligns evidence with pedagogical dimensions within a structured semantic space. We then propose a Cross-View Graph Synergy Network to represent student comments. This network comprises two components: (i) a Syntactic Branch that extracts explicit grammatical dependencies from parse trees, and (ii) a Semantic Branch that models latent conceptual relations derived from BERT-based similarity graphs. BiAffine fusion module aligns syntactic and semantic units, while a differential regularizer disentangles embeddings to encourage complementary representations. Finally, a cross-attention mechanism bridges the dimension-anchored evidence with the multi-view comment representations. We also contribute a novel benchmark dataset featuring expert qualitative annotations and multi-label scores. Extensive experiments demonstrate that TeachPro offers superior diagnostic granularity and robustness across diverse evaluation settings.
摘要：标准化学生教学评价常常存在可靠性低、反应选项有限和反应失真的问题。现有的挖掘开放式评论的机器学习方法通常会减少对二元情感的反馈，从而忽略了内容清晰度、反馈及时性和教师行为等具体问题，并且为教学提供了有限的指导。这个http URL提出了TeachPro，一个多标签学习框架，系统地评估五个关键教学维度：专业知识、教学行为、教学效率、课堂体验和其他绩效指标。我们首先提出了一个维度锚定证据编码器，它集成了三个核心组件：（i）一个预先训练的文本编码器，将定性反馈注释转换为上下文嵌入； (ii) 提示模块，将五个教学维度表示为可学习的语义锚点； (iii) 交叉关注机制，将证据与结构化语义空间内的教学维度保持一致。然后，我们提出了一个跨视图图协同网络来代表学生的评论。该网络由两个组件组成：(i) 从解析树中提取显式语法依赖关系的句法分支，以及 (ii) 对从基于 BERT 的相似性图导出的潜在概念关系进行建模的语义分支。 BiAffine 融合模块对齐句法和语义单元，而差分正则化器解开嵌入以鼓励互补表示。最后，交叉注意机制将维度锚定证据与多视图评论表示联系起来。我们还贡献了一个新颖的基准数据集，具有专家定性注释和多标签分数。大量实验表明，TeachPro 在不同的评估设置中提供了卓越的诊断粒度和稳健性。

Title: When to Invoke: Refining LLM Fairness with Toxicity Assessment

Authors: Jing Ren, Bowen Li, Ziqi Xu, Renqiang Luo, Shuo Yu, Xin Ye, Haytham Fayek, Xiaodong Li, Feng Xia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09250
Pdf URL: https://arxiv.org/pdf/2601.09250
Copy Paste: [[2601.09250]] When to Invoke: Refining LLM Fairness with Toxicity Assessment(https://arxiv.org/abs/2601.09250)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments? To address this, we propose FairToT, an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment. FairToT identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied. In addition, we introduce two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters. Experiments on benchmark datasets show that FairToT reduces group-level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference-time refinement offers an effective and practical approach for fairness improvement in LLM-based toxicity assessment systems. The source code can be found at this https URL.
摘要：大语言模型 (LLM) 越来越多地用于在线审核系统中的毒性评估，其中跨人口群体的公平对于公平对待至关重要。然而，法学硕士经常对微妙的表达产生不一致的毒性判断，特别是那些涉及隐性仇恨言论的表达，揭示出难以通过标准培训纠正的潜在偏见。这就提出了一个现有方法经常忽视的关键问题：何时应该调用纠正机制来确保评估的公平和可靠？为了解决这个问题，我们提出了 FairToT，这是一种推理时间框架，可通过即时引导的毒性评估来增强 LLM 的公平性。 FairToT 识别可能发生人口统计相关变化的情况，并确定何时应应用额外评估。此外，我们引入了两个可解释的公平性指标，可以检测此类情况并在不修改模型参数的情况下提高推理一致性。基准数据集上的实验表明，FairToT 减少了组级差异，同时保持稳定可靠的毒性预测，证明推理时间细化为基于 LLM 的毒性评估系统的公平性改进提供了有效且实用的方法。源代码可以在此 https URL 中找到。

Title: MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

Authors: Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bin Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09270
Pdf URL: https://arxiv.org/pdf/2601.09270
Copy Paste: [[2601.09270]] MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus(https://arxiv.org/abs/2601.09270)
Keywords: language model, llm
Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: this https URL
摘要：随着多模态大语言模型（MLLM）的快速发展，其潜力引起了中国古典研究（CCS）的极大关注。虽然现有的研究主要集中在文本和视觉模式上，但该领域内的音频语料库在很大程度上仍未得到充分探索。为了弥补这一差距，我们提出了多任务中国古典文学体裁音频语料库（MCGA）。它涵盖六项任务的多种文学流派：自动语音识别 (ASR)、语音到文本翻译 (S2TT)、语音情感字幕 (SEC)、口语问答 (SQA)、语音理解 (SU) 和语音推理 (SR)。通过对十个 MLLM 的评估，我们的实验结果表明，当前模型在 MCGA 测试集上处理时仍然面临着巨大的挑战。此外，我们引入了 SEC 的评估指标和衡量 MLLM 语音和文本能力之间一致性的指标。我们向公众发布 MCGA 和我们的代码，以促进在 CCS 中具有更强大的多维音频功能的 MLLM 的开发。 MCGA 语料库：此 https URL

Title: ReGraM: Region-First Knowledge Graph Reasoning for Medical Question Answering

Authors: Chaerin Lee, Sohee Park, Hyunsik Na, Daseon Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09280
Pdf URL: https://arxiv.org/pdf/2601.09280
Copy Paste: [[2601.09280]] ReGraM: Region-First Knowledge Graph Reasoning for Medical Question Answering(https://arxiv.org/abs/2601.09280)
Keywords: language model, llm, hallucination
Abstract: Recent studies in medical question answering (Medical QA) have actively explored the integration of large language models (LLMs) with biomedical knowledge graphs (KGs) to improve factual accuracy. However, most existing approaches still rely on traversing the entire KG or performing large-scale retrieval, which introduces substantial noise and leads to unstable multi-hop reasoning. We argue that the core challenge lies not in expanding access to knowledge, but in identifying and reasoning over the appropriate subset of evidence for each query. ReGraM is a region-first knowledge graph reasoning framework that addresses this challenge by constructing a query-aligned subgraph and performing stepwise reasoning constrained to this localized region under multiple evidence aware modes. By focusing inference on only the most relevant portion of the KG, ReGraM departs from the assumption that all relations are equally useful an assumption that rarely holds in domain-specific medical settings. Experiments on seven medical QA benchmarks demonstrate that ReGraM consistently outperforms a strong baseline (KGARevion), achieving an 8.04% absolute accuracy gain on MCQ, a 4.50% gain on SAQ, and a 42.9% reduction in hallucination rate. Ablation and qualitative analyses further show that aligning region construction with hop-wise reasoning is the primary driver of these improvements. Overall, our results highlight region-first KG reasoning as an effective paradigm for improving factual accuracy and consistency in medical QA.
摘要：最近的医学问答（医学 QA）研究积极探索大语言模型（LLM）与生物医学知识图（KG）的集成，以提高事实准确性。然而，大多数现有方法仍然依赖于遍历整个知识图谱或执行大规模检索，这会引入大量噪声并导致不稳定的多跳推理。我们认为，核心挑战不在于扩大知识的获取，而在于识别和推理每个查询的适当证据子集。 ReGraM 是一个区域优先的知识图推理框架，它通过构建查询对齐的子图并在多种证据感知模式下执行仅限于该局部区域的逐步推理来解决这一挑战。通过仅将推理集中于知识图谱中最相关的部分，ReGraM 背离了所有关系都同样有用的假设，而这一假设在特定领域的医疗环境中很少成立。对七个医学 QA 基准的实验表明，ReGraM 始终优于强基线 (KGARevion)，在 MCQ 上实现了 8.04% 的绝对准确度增益，在 SAQ 上实现了 4.50% 的增益，并且幻觉率降低了 42.9%。消融和定性分析进一步表明，将区域构建与跳跃推理相结合是这些改进的主要驱动力。总体而言，我们的结果强调区域优先 KG 推理是提高医学 QA 事实准确性和一致性的有效范例。

Title: Understanding or Memorizing? A Case Study of German Definite Articles in Language Models

Authors: Jonathan Drechsel, Erisa Bytyqi, Steffen Herbold
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09313
Pdf URL: https://arxiv.org/pdf/2601.09313
Copy Paste: [[2601.09313]] Understanding or Memorizing? A Case Study of German Definite Articles in Language Models(https://arxiv.org/abs/2601.09313)
Keywords: language model
Abstract: Language models perform well on grammatical agreement, but it is unclear whether this reflects rule-based generalization or memorization. We study this question for German definite singular articles, whose forms depend on gender and case. Using GRADIEND, a gradient-based interpretability method, we learn parameter update directions for gender-case specific article transitions. We find that updates learned for a specific gender-case article transition frequently affect unrelated gender-case settings, with substantial overlap among the most affected neurons across settings. These results argue against a strictly rule-based encoding of German definite articles, indicating that models at least partly rely on memorized associations rather than abstract grammatical rules.
摘要：语言模型在语法一致性方面表现良好，但尚不清楚这是否反映了基于规则的泛化或记忆。我们研究德语定单数冠词的这个问题，其形式取决于性别和格。使用基于梯度的可解释性方法 GRADIEND，我们学习针对性别案例特定文章转换的参数更新方向。我们发现，针对特定性别案例文章转换学习到的更新经常影响不相关的性别案例设置，在不同设置中受影响最严重的神经元之间存在大量重叠。这些结果反对严格基于规则的德语定冠词编码，表明模型至少部分依赖于记忆的关联而不是抽象的语法规则。

Title: Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework

Authors: Ewelina Gajewska, Katarzyna Budzynska, Jarosław A Chudziak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09342
Pdf URL: https://arxiv.org/pdf/2601.09342
Copy Paste: [[2601.09342]] Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework(https://arxiv.org/abs/2601.09342)
Keywords: prompt, chain-of-thought, agent
Abstract: This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.
摘要：这项工作提出了一种针对隐含仇恨言论的情境化检测框架，该框架作为一个多代理系统实现，包括中央主持人代理和代表特定人口群体的动态构建的社区代理。我们的方法明确地整合了来自公开可用知识源的社会文化背景，从而实现了身份感知调节，超越了最先进的提示方法（零样本提示、少样本提示、思维链提示）和具有挑战性的 ToxiGen 数据集上的替代方法。我们通过将平衡准确性作为分类公平性的核心指标来增强绩效评估的技术严谨性，该指标考虑了真阳性率和真阴性率之间的权衡。我们证明，我们的社区驱动的咨询框架显着提高了所有目标群体的分类准确性和公平性。

Title: Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs

Authors: Biswesh Mohapatra, Théo Charlot, Giovanni Duca, Mayank Palan, Laurent Romary, Justine Cassell
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09365
Pdf URL: https://arxiv.org/pdf/2601.09365
Copy Paste: [[2601.09365]] Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs(https://arxiv.org/abs/2601.09365)
Keywords: llm
Abstract: Common ground plays a critical role in situated spoken dialogues, where interlocutors must establish and maintain shared references to entities, events, and relations to sustain coherent interaction. For dialog systems, the ability to correctly ground conversational content in order to refer back to it later is particularly important. Prior studies have demonstrated that LLMs are capable of performing grounding acts such as requesting clarification or producing acknowledgments, yet relatively little work has investigated how common ground can be explicitly represented and stored for later use. Without such mechanisms, it remains unclear whether acknowledgment or clarification behaviors truly reflect a grounded understanding. In this work, we evaluate a model's ability to establish and exploit common ground through relational references to entities within the shared context in a situational dialogue. We test multiple methods for representing common ground in situated dialogues and further propose approaches to improve both the establishment of common ground and its subsequent use in the conversation.
摘要：共同点在情境口语对话中起着至关重要的作用，对话者必须建立并维护对实体、事件和关系的共同参考，以维持连贯的互动。对于对话系统来说，正确地处理对话内容以便稍后参考的能力尤为重要。先前的研究表明，法学硕士能够执行诸如要求澄清或做出致谢等基础行为，但相对较少的工作研究了如何明确表示和存储共同点以供以后使用。如果没有这样的机制，我们仍不清楚承认或澄清行为是否真正反映了有根据的理解。在这项工作中，我们评估模型通过情景对话中共享上下文中的实体的关系引用来建立和利用共同点的能力。我们测试了在情景对话中表示共同点的多种方法，并进一步提出了改进共同点的建立及其随后在对话中的使用的方法。

Title: Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish

Authors: Aidana Aidynkyzy, Oğuz Dikenelli, Oylum Alatlı, Şebnem Bora
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09367
Pdf URL: https://arxiv.org/pdf/2601.09367
Copy Paste: [[2601.09367]] Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish(https://arxiv.org/abs/2601.09367)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.
摘要：用于非英语语言临床信息提取的带注释数据集的稀缺阻碍了对主要用英语开发的基于大语言模型（LLM）的方法的评估。在这项研究中，我们首次以英语和土耳其语对法学硕士的临床关系提取（RE）任务进行了全面的双语评估。为了促进这一评估，我们引入了第一个英语-土耳其语并行临床 RE 数据集，该数据集源自 2010 年 i2b2/VA 关系分类语料库并精心策划。我们系统地评估了一系列不同的提示策略，包括多种情境学习（ICL）和思想链（CoT）方法，并将其性能与 PURE 等微调基线进行比较。此外，我们提出了关系感知检索（RAR），这是一种基于对比学习的新颖的上下文示例选择方法，专门用于捕获句子级和关系级语义。我们的结果表明，基于提示的法学硕士方法始终优于传统的微调模型。此外，在所有评估的法学硕士和提示技巧中，英语评估的表现均优于土耳其语评估。在 ICL 方法中，RAR 的性能最高，Gemini 1.5 Flash 的 micro-F1 分数在英语中达到 0.906，在土耳其语中达到 0.888。当 RAR 与使用 DeepSeek-V3 模型的结构化推理提示相结合时，英语性能进一步提高至 0.918 F1。这些发现强调了高质量演示检索的重要性，并强调了高级检索和提示技术在弥合临床自然语言处理资源差距方面的潜力。

Title: The Imperfective Paradox in Large Language Models

Authors: Bolei Ma, Yusuke Miyao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09373
Pdf URL: https://arxiv.org/pdf/2601.09373
Copy Paste: [[2601.09373]] The Imperfective Paradox in Large Language Models(https://arxiv.org/abs/2601.09373)
Keywords: language model, llm, prompt
Abstract: Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.
摘要：大型语言模型 (LLM) 是否真正掌握了事件的组合语义，还是依赖于表面概率启发法？我们研究不完美悖论，这是一种逻辑现象，其中过去的进步方面需要活动的事件实现（例如，运行 $\to$ ran），而不是成就（例如，建造 $\nrightarrow$ 建造）。我们引入了 ImperfectiveNLI，这是一个诊断数据集，旨在探讨不同语义类别之间的这种区别。通过评估最先进的开放权重模型，我们发现了一种普遍存在的目的论偏差：模型系统地幻觉以目标为导向的事件的完成，通常会凌驾于明确的文本否定之上。代表性分析表明，虽然内部嵌入通常区分过程和结果，但推理决策由有关目标实现的强先验主导。我们进一步发现，基于提示的干预措施减少了幻觉完成，但也增加了对有效蕴涵的错误拒绝。我们的研究结果表明，当前的法学硕士缺乏结构方面的意识，作为预测性叙事引擎而不是忠实的逻辑推理者。

Title: Ability Transfer and Recovery via Modularized Parameters Localization

Authors: Songyao Jin, Kun Zhou, Wenqi Li, Peng Wang, Biwei Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.09398
Pdf URL: https://arxiv.org/pdf/2601.09398
Copy Paste: [[2601.09398]] Ability Transfer and Recovery via Modularized Parameters Localization(https://arxiv.org/abs/2601.09398)
Keywords: language model, llm
Abstract: Large language models can be continually pre-trained or fine-tuned to improve performance in specific domains, languages, or skills, but this specialization often degrades other capabilities and may cause catastrophic forgetting. We investigate how abilities are distributed within LLM parameters by analyzing module activations under domain- and language-specific inputs for closely related models. Across layers and modules, we find that ability-related activations are highly concentrated in a small set of channels (typically <5\%), and these channels are largely disentangled with good sufficiency and stability. Building on these observations, we propose ACT (Activation-Guided Channel-wise Ability Transfer), which localizes ability-relevant channels via activation differences and selectively transfers only the corresponding parameters, followed by lightweight fine-tuning for compatibility. Experiments on multilingual mathematical and scientific reasoning show that ACT can recover forgotten abilities while preserving retained skills. It can also merge multiple specialized models to integrate several abilities into a single model with minimal interference. Our code and data will be publicly released.
摘要：大型语言模型可以不断地进行预训练或微调，以提高特定领域、语言或技能的性能，但这种专业化通常会降低其他能力，并可能导致灾难性遗忘。我们通过分析紧密相关模型的特定于领域和语言的输入下的模块激活来研究能力如何在 LLM 参数内分布。在各个层和模块中，我们发现与能力相关的激活高度集中在一小组通道中（通常<5%），并且这些通道在很大程度上被解开，具有良好的充分性和稳定性。基于这些观察，我们提出了 ACT（激活引导通道能力转移），它通过激活差异来定位与能力相关的通道，并有选择地仅转移相应的参数，然后进行轻量级微调以实现兼容性。多语言数学和科学推理实验表明，ACT 可以恢复遗忘的能力，同时保留保留的技能。它还可以合并多个专用模型，以最小的干扰将多种能力集成到单个模型中。我们的代码和数据将公开发布。

Title: Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation

Authors: Xinze Li, Zhenghao Liu, Haidong Xin, Yukun Yan, Shuo Wang, Zheni Zeng, Sen Mei, Ge Yu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09402
Pdf URL: https://arxiv.org/pdf/2601.09402
Copy Paste: [[2601.09402]] Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation(https://arxiv.org/abs/2601.09402)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge. Recently, some works have incorporated iterative knowledge accumulation processes into RAG models to progressively accumulate and refine query-related knowledge, thereby constructing more comprehensive knowledge representations. However, these iterative processes often lack a coherent organizational structure, which limits the construction of more comprehensive and cohesive knowledge representations. To address this, we propose PAGER, a page-driven autonomous knowledge representation framework for RAG. PAGER first prompts an LLM to construct a structured cognitive outline for a given question, which consists of multiple slots representing a distinct knowledge aspect. Then, PAGER iteratively retrieves and refines relevant documents to populate each slot, ultimately constructing a coherent page that serves as contextual input for guiding answer generation. Experiments on multiple knowledge-intensive benchmarks and backbone models show that PAGER consistently outperforms all RAG baselines. Further analyses demonstrate that PAGER constructs higher-quality and information-dense knowledge representations, better mitigates knowledge conflicts, and enables LLMs to leverage external knowledge more effectively. All code is available at this https URL.
摘要：检索增强生成 (RAG) 通过整合外部知识来增强大型语言模型 (LLM)。最近，一些工作将迭代知识积累过程纳入RAG模型中，以逐步积累和细化与查询相关的知识，从而构建更全面的知识表示。然而，这些迭代过程往往缺乏连贯的组织结构，这限制了更全面、更有凝聚力的知识表示的构建。为了解决这个问题，我们提出了 PAGER，一种用于 RAG 的页面驱动的自主知识表示框架。 PAGER 首先提示法学硕士为给定问题构建结构化认知大纲，该大纲由代表不同知识方面的多个槽组成。然后，PAGER 迭代检索并细化相关文档以填充每个槽，最终构建一个连贯的页面，作为指导答案生成的上下文输入。对多个知识密集型基准和主干模型的实验表明，PAGER 始终优于所有 RAG 基准。进一步的分析表明，PAGER 构建了更高质量和信息密集的知识表示，更好地缓解了知识冲突，并使法学硕士能够更有效地利用外部知识。所有代码均可在此 https URL 中获取。

Title: Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing

Authors: Filip Trhlik, Andrew Caines, Paula Buttery
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09421
Pdf URL: https://arxiv.org/pdf/2601.09421
Copy Paste: [[2601.09421]] Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing(https://arxiv.org/abs/2601.09421)
Keywords: language model
Abstract: Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.
摘要：在过去的几年里，预训练语言模型 (LM) 在社会采用和培训成本方面都大幅增长。规模的快速增长限制了理解和减轻偏见的进展。由于重新训练 LM 的成本过高，因此大多数消除偏差的工作都集中在事后或基于屏蔽的策略上，而这些策略往往无法解决偏差的根本原因。在这项工作中，我们寻求通过使用低成本代理模型使模型前去偏研究民主化。具体来说，我们研究了 BabyLM，这是一种在小型且可变的语料库上训练的类似 BERT 的紧凑模型，可以近似较大模型的偏差获取和学习动态。我们表明，与标准 BERT 模型相比，BabyLM 显示出紧密一致的内在偏差形成和性能发展模式，尽管它们的尺寸大大减小。此外，BabyLM 和 BERT 之间的相关性在多种模型内和模型后去偏方法中都成立。利用这些相似性，我们使用 BabyLM 进行了模型前去偏差实验，复制了之前的研究结果，并提出了关于性别不平衡和毒性对偏差形成的影响的新见解。我们的结果表明，BabyLM 可以作为大规模 LM 的有效沙箱，将预训练成本从超过 500 个 GPU 小时减少到 30 个 GPU 小时以下。这提供了一种使模型前去偏差研究民主化的方法，并能够更快、更容易地探索构建更公平的 LM 的方法。

Title: Where Knowledge Collides: A Mechanistic Study of Intra-Memory Knowledge Conflict in Language Models

Authors: Minh Vu Pham, Hsuvas Borkakoty, Yufang Hou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09445
Pdf URL: https://arxiv.org/pdf/2601.09445
Copy Paste: [[2601.09445]] Where Knowledge Collides: A Mechanistic Study of Intra-Memory Knowledge Conflict in Language Models(https://arxiv.org/abs/2601.09445)
Keywords: language model
Abstract: In language models (LMs), intra-memory knowledge conflict largely arises when inconsistent information about the same event is encoded within the model's parametric knowledge. While prior work has primarily focused on resolving conflicts between a model's internal knowledge and external resources through approaches such as fine-tuning or knowledge editing, the problem of localizing conflicts that originate during pre-training within the model's internal representations remain unexplored. In this work, we design a framework based on mechanistic interpretability methods to identify where and how conflicting knowledge from the pre-training data is encoded within LMs. Our findings contribute to a growing body of evidence that specific internal components of a language model are responsible for encoding conflicting knowledge from pre-training, and we demonstrate how mechanistic interpretability methods can be leveraged to causally intervene in and control conflicting knowledge at inference time.
摘要：在语言模型（LM）中，当模型的参数知识中编码有关同一事件的不一致信息时，很大程度上会出现内存内知识冲突。虽然之前的工作主要集中在通过微调或知识编辑等方法解决模型内部知识和外部资源之间的冲突，但模型内部表示中预训练期间产生的局部冲突问题仍未得到探索。在这项工作中，我们设计了一个基于机械可解释性方法的框架，以识别来自预训练数据的冲突知识在语言模型中的编码位置和方式。我们的研究结果提供了越来越多的证据，表明语言模型的特定内部组件负责编码预训练中的冲突知识，并且我们演示了如何利用机械可解释性方法在推理时因果地干预和控制冲突知识。

Title: Improving Symbolic Translation of Language Models for Logical Reasoning

Authors: Ramya Keerthy Thatikonda, Jiuzhou Han, Wray Buntine, Ehsan Shareghi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09446
Pdf URL: https://arxiv.org/pdf/2601.09446
Copy Paste: [[2601.09446]] Improving Symbolic Translation of Language Models for Logical Reasoning(https://arxiv.org/abs/2601.09446)
Keywords: language model
Abstract: The use of formal language for deductive logical reasoning aligns well with language models (LMs), where translating natural language (NL) into first-order logic (FOL) and employing an external solver results in a verifiable and therefore reliable reasoning system. However, smaller LMs often struggle with this translation task, frequently producing incorrect symbolic outputs due to formatting and translation errors. Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model. To address this, we first categorize common errors and fine-tune smaller LMs using data synthesized by large language models. The evaluation is performed using the defined error categories. We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior and enhancing generation quality as measured by predicate metrics. This decomposition framework also enables the use of a verification module that targets predicate-arity errors to further improve performance. Our study evaluates three families of models across four logical-reasoning datasets. The comprehensive fine-tuning, incremental inference, and verification modules reduce error rates, increase predicate coverage, and improve reasoning performance for smaller LMs, moving us closer to developing reliable and accessible symbolic-reasoning systems.
摘要：使用形式语言进行演绎逻辑推理与语言模型 (LM) 非常一致，其中将自然语言 (NL) 转换为一阶逻辑 (FOL) 并使用外部求解器产生可验证且因此可靠的推理系统。然而，较小的语言模型常常难以完成这项翻译任务，经常由于格式和翻译错误而产生不正确的符号输出。现有方法通常依靠自迭代来纠正这些错误，但此类方法在很大程度上取决于底层模型的功能。为了解决这个问题，我们首先对常见错误进行分类，并使用大型语言模型合成的数据对较小的语言模型进行微调。使用定义的错误类别进行评估。我们引入增量推理，它将推理分为两个阶段：谓词生成和 FOL 翻译，提供对模型行为的更好控制，并提高由谓词指标衡量的生成质量。该分解框架还允许使用针对谓词数量错误的验证模块来进一步提高性能。我们的研究评估了四个逻辑推理数据集的三个模型系列。全面的微调、增量推理和验证模块可降低错误率、增加谓词覆盖率并提高小型 LM 的推理性能，使我们更接近开发可靠且可访问的符号推理系统。

Title: SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics

Authors: Yunqiao Yang, Wenbo Li, Houxing Ren, Zimu Lu, Ke Wang, Zhiyuan Huang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09487
Pdf URL: https://arxiv.org/pdf/2601.09487
Copy Paste: [[2601.09487]] SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics(https://arxiv.org/abs/2601.09487)
Keywords: language model, llm
Abstract: The rapid evolution of Large Language Models (LLMs) has fostered diverse paradigms for automated slide generation, ranging from code-driven layouts to image-centric synthesis. However, evaluating these heterogeneous systems remains challenging, as existing protocols often struggle to provide comparable scores across architectures or rely on uncalibrated judgments. In this paper, we introduce SlidesGen-Bench, a benchmark designed to evaluate slide generation through a lens of three core principles: universality, quantification, and reliability. First, to establish a unified evaluation framework, we ground our analysis in the visual domain, treating terminal outputs as renderings to remain agnostic to the underlying generation method. Second, we propose a computational approach that quantitatively assesses slides across three distinct dimensions - Content, Aesthetics, and Editability - offering reproducible metrics where prior works relied on subjective or reference-dependent proxies. Finally, to ensure high correlation with human preference, we construct the Slides-Align1.5k dataset, a human preference aligned dataset covering slides from nine mainstream generation systems across seven scenarios. Our experiments demonstrate that SlidesGen-Bench achieves a higher degree of alignment with human judgment than existing evaluation pipelines. Our code and data are available at this https URL.
摘要：大型语言模型 (LLM) 的快速发展催生了自动幻灯片生成的多种范式，从代码驱动的布局到以图像为中心的合成。然而，评估这些异构系统仍然具有挑战性，因为现有协议通常难以提供跨架构的可比分数或依赖于未经校准的判断。在本文中，我们介绍了 SlidesGen-Bench，这是一个旨在通过三个核心原则（通用性、量化和可靠性）评估幻灯片生成的基准。首先，为了建立统一的评估框架，我们将分析立足于视觉领域，将终端输出视为渲染，以保持与底层生成方法无关。其次，我们提出了一种计算方法，可以跨三个不同的维度（内容、美学和可编辑性）定量评估幻灯片，提供可重复的指标，而先前的工作依赖于主观或依赖于参考的代理。最后，为了确保与人类偏好的高度相关性，我们构建了 Slides-Align1.5k 数据集，这是一个人类偏好一致的数据集，涵盖来自七种场景的九种主流生成系统的幻灯片。我们的实验表明，与现有的评估流程相比，SlidesGen-Bench 与人类判断的一致性更高。我们的代码和数据可在此 https URL 中获取。

Title: SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams

Authors: Chenglong Wang, Canjia Li, Xingzhao Zhu, Yifu Huo, Huiyu Wang, Weixiong Lin, Yun Yang, Qiaozhi He, Tianhua Zhou, Xiaojia Chang, Jingbo Zhu, Tong Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09515
Pdf URL: https://arxiv.org/pdf/2601.09515
Copy Paste: [[2601.09515]] SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams(https://arxiv.org/abs/2601.09515)
Keywords: agent
Abstract: Due to the dynamically evolving nature of real-world query streams, relevance models struggle to generalize to practical search scenarios. A sophisticated solution is self-evolution techniques. However, in large-scale industrial settings with massive query streams, this technique faces two challenges: (1) informative samples are often sparse and difficult to identify, and (2) pseudo-labels generated by the current model could be unreliable. To address these challenges, in this work, we propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules: a multi-agent sample miner, designed to detect distributional shifts and identify informative training samples, and a multi-agent relevance annotator, which provides reliable labels through a two-level agreement framework. We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily. Experimental results demonstrate that SERM can achieve significant performance gains through iterative self-evolution, as validated by extensive offline multilingual evaluations and online testing.
摘要：由于现实世界查询流的动态发展性质，相关性模型很难推广到实际的搜索场景。一个复杂的解决方案是自我进化技术。然而，在具有大量查询流的大规模工业环境中，该技术面临两个挑战：（1）信息样本通常稀疏且难以识别，（2）当前模型生成的伪标签可能不可靠。为了应对这些挑战，在这项工作中，我们提出了一种自我进化相关性模型方法（SERM），它包含两个互补的多智能体模块：一个多智能体样本挖掘器，旨在检测分布变化并识别信息丰富的训练样本；以及一个多智能体相关性注释器，通过两级协议框架提供可靠的标签。我们在大型工业环境中评估 SERM，该环境每天服务数十亿用户请求。实验结果表明，SERM 可以通过迭代自我进化实现显着的性能提升，并通过广泛的离线多语言评估和在线测试得到验证。

Title: Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats

Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Haoli Bai, Hui-Ling Zhen, Zhenhua Dong, Xianzhi Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09555
Pdf URL: https://arxiv.org/pdf/2601.09555
Copy Paste: [[2601.09555]] Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats(https://arxiv.org/abs/2601.09555)
Keywords: language model, llm
Abstract: Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.
摘要：微缩放浮点 (MXFP) 已成为大型语言模型 (LLM) 的一种有前途的低精度格式。尽管提出了各种训练后量化 (PTQ) 算法，但它们主要关注整数量化，而它们在 MXFP 格式下的适用性和行为在很大程度上仍未得到探索。为了弥补这一差距，这项工作对 MXFP 格式下的 PTQ 进行了系统研究，涵盖超过 7 种 PTQ 算法、15 个评估基准和 3 个 LLM 系列。主要发现包括：1) MXFP8 始终实现近乎无损的性能，而 MXFP4 会导致精度大幅下降，并且仍然具有挑战性； 2) MXFP 下的 PTQ 有效性很大程度上取决于格式兼容性，某些算法范式始终比其他算法范式更有效； 3）PTQ性能在模型族和模态之间表现出高度一致的趋势，特别是，量化灵敏度由语言模型主导，而不是多模态LLM中的视觉编码器； 4) 量化的缩放因子是 MXFP4 中的关键误差源，简单的预缩放优化策略可以显着减轻其影响。总之，这些结果为将现有 PTQ 方法应用于 MXFP 量化提供了实用指导。

Title: Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering

Authors: Dimitris Panagopoulos, Adolfo Perrusquia, Weisi Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09570
Pdf URL: https://arxiv.org/pdf/2601.09570
Copy Paste: [[2601.09570]] Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering(https://arxiv.org/abs/2601.09570)
Keywords: language model, llm
Abstract: Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive. We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses. SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy. Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs.
摘要：进行基于模式的信息收集对话的自治系统面临着仪器缺口，缺乏用于监控采集效率和检测提问何时变得无效的轮次级可观测值。我们引入了对话遥测（DT），这是一种测量框架，在每次问答交换后产生两个与模型无关的信号：（i）进度估计器（PE）量化每个类别的剩余信息潜力（使用基于位的变体），以及（ii）停滞指数（SI）检测可观察到的故障特征，其特征是通过语义相似的低边际增益响应进行重复类别探测。 SI 无需进行因果诊断即可标记此模式，支持在将退化归因于特定原因可能不切实际的环境中进行监控。我们使用基于大语言模型 (LLM) 的模拟在受控搜索和救援 (SAR) 启发的访谈中验证 DT，区分有效的对话轨迹和停滞的对话轨迹，并通过将 DT 信号集成到强化学习 (RL) 策略中来说明下游效用。在这些设置中，DT 提供了可解释的轮次级别工具，可以在停滞带来运营成本时提高策略性能。

Title: DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

Authors: Qian Cao, Yahui Liu, Wei Bi, Yi Zhao, Ruihua Song, Xiting Wang, Ruiming Tang, Guorui Zhou, Han Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09609
Pdf URL: https://arxiv.org/pdf/2601.09609
Copy Paste: [[2601.09609]] DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing(https://arxiv.org/abs/2601.09609)
Keywords: language model, llm, chain-of-thought
Abstract: Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
摘要：基于强化学习 (RL) 的大语言模型 (LLM) 增强通常会导致输出多样性减少，从而削弱其在创意写作等开放式任务中的效用。当前的方法缺乏指导多样化探索的明确机制，而是优先考虑优化效率和性能而不是多样性。本文提出了一种围绕半结构化长思想链（CoT）构建的强化学习框架，其中生成过程被分解为明确计划的中间步骤。我们引入了一种多样化规划分支方法，该方法根据多样性变化在规划阶段战略性地引入分歧，同时采用群体意识多样性奖励来鼓励不同的轨迹。创意写作基准的实验结果表明，我们的方法在不影响生成质量的情况下显着提高了输出多样性，始终优于现有基准。

Title: LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation

Authors: Stergios Chatzikyriakidis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09631
Pdf URL: https://arxiv.org/pdf/2601.09631
Copy Paste: [[2601.09631]] LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation(https://arxiv.org/abs/2601.09631)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
摘要：大型语言模型 (LLM) 尽管在 NLP 任务中具有出色的能力，但仍难以解决韵律检测和生成等基于语音的现象。这在现代希腊语等资源匮乏的语言中更为明显。在本文中，我们提出了一种混合系统，将法学硕士与确定性音系算法相结合，以实现准确的韵律识别/分析和生成。我们的方法实现了希腊语韵类型的全面分类，包括纯韵、丰富韵、不完美韵、马赛克韵和相同前韵元音 (IDV) 模式，并采用具有音韵验证的代理生成管道。我们评估了多个 LLM 的多种提示策略（零样本、少样本、思想链和 RAG 增强），包括 Claude 3.7 和 4.5、GPT-4o、Gemini 2.0 以及 Llama 3.1 8B 和 70B 和 Mistral Large 等开放权重模型。结果揭示了显着的“推理差距”：虽然类原生模型 (Claude 3.7) 表现直观（识别准确度为 40\%），但推理密集型模型 (Claude 4.5) 仅在思想链提示时才实现最先进的性能 (54\%)。最关键的是，纯 LLM 生成灾难性地失败（有效诗歌低于 4%），而我们的混合验证循环将性能恢复到 73.1%。我们发布了我们的系统和一个重要的、经过严格清理的包含 40,000 多个韵律的语料库，这些语料库源自 Anemoskala 和 Interwar Poetry 语料库，以支持未来的研究。

Title: DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

Authors: Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09688
Pdf URL: https://arxiv.org/pdf/2601.09688
Copy Paste: [[2601.09688]] DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation(https://arxiv.org/abs/2601.09688)
Keywords: agent
Abstract: Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
摘要：深度研究系统广泛用于多步骤网络研究、分析和跨源综合，但对其评估仍然具有挑战性。现有的基准通常需要注释密集型任务构建，依赖于静态评估维度，或者在引用缺失时无法可靠地验证事实。为了弥补这些差距，我们引入了 DeepResearchEval，这是一个用于深度研究任务构建和代理评估的自动化框架。对于任务构建，我们提出了一个角色驱动的管道，生成基于不同用户配置文件的现实、复杂的研究任务，应用两阶段过滤器任务资格和搜索必要性来仅保留需要多源证据集成和外部检索的任务。对于评估，我们提出了一个包含两个组件的代理管道：一个自适应逐点质量评估，根据每个生成的任务动态导出特定于任务的评估维度、标准和权重；以及一个主动事实检查，即使在引用缺失的情况下，也可以通过网络搜索自动提取和验证报告陈述。

Title: Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection

Authors: Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.09692
Pdf URL: https://arxiv.org/pdf/2601.09692
Copy Paste: [[2601.09692]] Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection(https://arxiv.org/abs/2601.09692)
Keywords: language model, llm
Abstract: Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
摘要：大型语言模型 (LLM) 路由器动态地为给定输入选择最佳模型。现有方法通常假设访问真实标记数据，这在实践中通常是不可用的，特别是当用户请求分布是异构且未知时。我们引入了生成数据路由（RGD），这是一种具有挑战性的设置，其中路由器专门针对由生成器 LLM 的高级任务描述生成的生成查询和答案进行训练。我们通过四个不同的基准和 12 个模型评估查询-应答路由器（同时使用查询和标签）和仅查询路由器，发现随着生成器质量的下降，查询-应答路由器比仅查询路由器降级得更快。我们的分析揭示了有效生成器的两个关键特征：它们必须准确地回答自己的问题，并且它们的问题必须在模型池中产生足够的性能差异。然后，我们展示过滤这些特征如何提高生成数据的质量。我们进一步提出了 CASCAL，一种新颖的仅查询路由器，它通过共识投票来估计模型的正确性，并通过层次聚类来识别特定于模型的技能领域。 CASCAL 对生成器质量的鲁棒性显着提高，在弱生成器数据上进行训练时，其绝对准确度比最佳查询应答路由器高出 4.6%。

Title: LLMs can Compress LLMs: Adaptive Pruning by Agents

Authors: Sai Varun Kodathala, Rakesh Vunnam
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2601.09694
Pdf URL: https://arxiv.org/pdf/2601.09694
Copy Paste: [[2601.09694]] LLMs can Compress LLMs: Adaptive Pruning by Agents(https://arxiv.org/abs/2601.09694)
Keywords: language model, gpt, llm, agent
Abstract: As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
摘要：随着大型语言模型 (LLM) 的不断扩展，训练后剪枝已成为一种有前途的方法，可以在保持性能的同时降低计算成本。 SparseGPT 和 Wanda 等现有方法通过逐层权重重建或激活感知幅度剪枝实现高稀疏性，但依赖统一或手工设计的启发式方法来确定每层稀疏率。此外，最近的研究表明，经过剪枝的法学硕士遭受了严重的事实知识退化，结构化剪枝方法的事实问答能力几乎完全崩溃。我们引入了代理引导的剪枝，其中基础模型充当自适应剪枝代理，智能地选择每次迭代时要剪枝的层，同时保留关键的知识路径。我们的方法通过将 Wanda 启发的权重激活指标与梯度重要性分数相结合来构建分层的敏感性配置文件，并标准化为 z 分数以进行与模型无关的比较。这些统计数据由配备自我反思功能的 LLM 代理进行处理，使其能够从之前的修剪结果中学习并迭代完善其策略。检查点回滚机制通过在困惑度下降超过阈值时进行恢复来维持模型质量。我们在 Qwen3 模型（4B 和 8B 参数）上以大约 45% 的稀疏度评估我们的方法，证明了相对于结构化修剪基线的显着改进：MMLU 准确性相对提高了 56%，FreebaseQA 上的事实知识保留提高了 19 倍，困惑度降低降低了 69%。值得注意的是，我们的框架不需要重新训练，以与模型无关的方式运行，并且在 21-40 次迭代中仅进行 2-4 次回滚，从而表现出有效的自我校正，这表明基础模型可以有效地指导其他基础模型的压缩。

Title: Empathy Applicability Modeling for General Health Queries

Authors: Shan Randhawa, Agha Ali Raza, Kentaro Toyama, Julie Hui, Mustafa Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.09696
Pdf URL: https://arxiv.org/pdf/2601.09696
Copy Paste: [[2601.09696]] Empathy Applicability Modeling for General Health Queries(https://arxiv.org/abs/2601.09696)
Keywords: gpt, llm
Abstract: LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.
摘要：法学硕士越来越多地融入临床工作流程，但他们往往缺乏临床同理心，而临床同理心是有效医患沟通的一个重要方面。现有的 NLP 框架侧重于在医生的回答中反应性地标记同理心，但对同理心需求的预期建模提供有限的支持，尤其是在一般健康查询中。我们引入了同理心适用性框架（EAF），这是一种理论驱动的方法，根据临床、情境和语言线索，根据情绪反应和解释的适用性对患者的询问进行分类。我们发布了真实患者查询的基准，由人类和 GPT-4o 双重注释。在具有人类共识的子集中，我们还观察到大量的人类-GPT 对齐。为了验证 EAF，我们在人工标记和仅 GPT 注释上训练分类器，以预测同理心适用性，实现强大的性能并超越启发式和零样本 LLM 基线。错误分析强调了持续存在的挑战：隐性痛苦、临床严重性模糊和上下文困难，强调了多注释器建模、临床医生在环校准和文化多样性注释的需要。 EAF 提供了一个在响应生成之前识别同理心需求的框架，为预期同理心建模建立了基准，并支持异步医疗保健中的同理心沟通。

Title: Value-Aware Numerical Representations for Transformer Language Models

Authors: Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.09706
Pdf URL: https://arxiv.org/pdf/2601.09706
Copy Paste: [[2601.09706]] Value-Aware Numerical Representations for Transformer Language Models(https://arxiv.org/abs/2601.09706)
Keywords: language model
Abstract: Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.
摘要：基于 Transformer 的语言模型通常在数学推理基准上取得了很好的结果，但在基本的数值理解和算术运算方面仍然脆弱。一个主要限制是数字被处理为符号标记，其嵌入没有显式编码数值，从而导致系统错误。我们引入了一种值感知的数字表示，它使用专用的前缀标记来增强标准标记化输入，该前缀标记的嵌入显式地以底层数值为条件。该机制将幅度信息直接注入模型的输入空间，同时保持与现有标记器和仅解码器 Transformer 架构的兼容性。对算术任务的评估表明，所提出的方法在数值格式、任务和操作数长度方面优于基线。这些结果表明，显式编码数值是提高语言模型中基本数值鲁棒性的有效且高效的方法。