2026-01-07

Title: WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

Authors: Zhaojiang Lin, Yong Xu, Kai Sun, Jing Zheng, Yin Huang, Surya Teja Appini, Krish Narang, Renjie Tao, Ishan Kapil Jain, Siddhant Arora, Ruizhi Li, Yiteng Huang, Kaushik Patnaik, Wenfang Xu, Suwon Shon, Yue Liu, Ahmed A Aly, Anuj Kumar, Florian Metze, Xin Luna Dong
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2601.02391
Pdf URL: https://arxiv.org/pdf/2601.02391
Copy Paste: [[2601.02391]] WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables(https://arxiv.org/abs/2601.02391)
Keywords: language model, llm
Abstract: Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and acoustic conditions. Each recording is accompanied by rich metadata, enabling nuanced analysis of model performance under real-world constraints. We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies on WearVox ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, underscoring the difficulty and realism of the benchmark. Additionally, we conduct a case study with two new SLLMs that perform inference with single-channel and multi-channel audio, demonstrating that multi-channel audio inputs significantly enhance model robustness to environmental noise and improve discrimination between device-directed and background speech. Our results highlight the critical importance of spatial audio cues for context-aware voice assistants and establish WearVox as a comprehensive testbed for advancing wearable voice AI research.
摘要：AI 眼镜等可穿戴设备正在将语音助手转变为随时可用的免提协作者，与日常生活无缝集成，但它们也带来了一些挑战，例如受运动和噪音影响的以自我为中心的音频、快速的微交互，以及区分设备引导的语音和背景对话的需要。现有的基准测试在很大程度上忽略了这些复杂性，而是专注于干净或通用的对话音频。为了弥补这一差距，我们推出了 WearVox，这是第一个旨在严格评估现实可穿戴场景中的语音助手的基准测试。 WearVox 包含通过 AI 眼镜收集的 3,842 条多通道、以自我为中心的录音，涉及五种不同的任务，包括搜索 QA、闭卷 QA、旁白拒绝、工具调用和语音翻译，涵盖广泛的室内外环境和声学条件。每个记录都附有丰富的元数据，可以在现实世界的限制下对模型性能进行细致的分析。我们对领先的专有和开源语音大语言模型 (SLLM) 进行了基准测试，发现大多数实时 SLLM 在 WearVox 上的准确率在 29% 到 59% 之间，但在室外嘈杂音频上的性能却大幅下降，这凸显了基准测试的难度和现实性。此外，我们还对两个新的 SLLM 进行了案例研究，这两个 SLLM 对单通道和多通道音频进行推理，证明多通道音频输入显着增强了模型对环境噪声的鲁棒性，并改善了设备导向语音和背景语音之间的区分。我们的结果强调了空间音频提示对于上下文感知语音助手的至关重要性，并将 WearVox 确立为推进可穿戴语音人工智能研究的综合测试平台。

Title: PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models

Authors: Inpyo Song, Eunji Jeon, Jangwon Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02404
Pdf URL: https://arxiv.org/pdf/2601.02404
Copy Paste: [[2601.02404]] PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models(https://arxiv.org/abs/2601.02404)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including software development, education, and technical assistance. Among these, software development is one of the key areas where LLMs are increasingly adopted. However, when hardware constraints are considered-for instance, in physical computing, where software must interact with and control physical hardware -their effectiveness has not been fully explored. To address this gap, we introduce \textsc{PCEval} (Physical Computing Evaluation), the first benchmark in physical computing that enables a fully automatic evaluation of the capabilities of LLM in both the logical and physical aspects of the projects, without requiring human assessment. Our evaluation framework assesses LLMs in generating circuits and producing compatible code across varying levels of project complexity. Through comprehensive testing of 13 leading models, \textsc{PCEval} provides the first reproducible and automatically validated empirical assessment of LLMs' ability to reason about fundamental hardware implementation constraints within a simulation environment. Our findings reveal that while LLMs perform well in code generation and logical circuit design, they struggle significantly with physical breadboard layout creation, particularly in managing proper pin connections and avoiding circuit errors. \textsc{PCEval} advances our understanding of AI assistance in hardware-dependent computing environments and establishes a foundation for developing more effective tools to support physical computing education.
摘要：大型语言模型 (LLM) 在软件开发、教育和技术援助等各个领域都表现出了卓越的能力。其中，软件开发是法学硕士越来越多地采用的关键领域之一。然而，当考虑硬件约束时（例如，在物理计算中，软件必须与物理硬件交互并控制物理硬件），其有效性尚未得到充分探索。为了解决这一差距，我们引入了 \textsc{PCEval} （物理计算评估），这是物理计算领域的第一个基准，可以全自动评估 LLM 在项目逻辑和物理方面的能力，而无需人工评估。我们的评估框架评估法学硕士在不同项目复杂程度下生成电路和生成兼容代码的能力。通过对 13 个领先模型的全面测试，\textsc{PCEval} 为法学硕士在模拟环境中推理基本硬件实现约束的能力提供了第一个可重复且自动验证的经验评估。我们的研究结果表明，虽然法学硕士在代码生成和逻辑电路设计方面表现良好，但他们在物理面包板布局创建方面表现不佳，特别是在管理正确的引脚连接和避免电路错误方面。 \textsc{PCEval} 增进了我们对依赖硬件的计算环境中人工智能辅助的理解，并为开发更有效的工具来支持物理计算教育奠定了基础。

Title: ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

Authors: Hyeong Kyu Choi, Sharon Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02535
Pdf URL: https://arxiv.org/pdf/2601.02535
Copy Paste: [[2601.02535]] ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation(https://arxiv.org/abs/2601.02535)
Keywords: language model, llm
Abstract: Selecting a single high-quality output from multiple stochastic generations remains a fundamental challenge for large language models (LLMs), particularly in open-ended tasks where no canonical answer exists. While Best-of-N and self-consistency methods show that aggregating multiple generations can improve performance, existing approaches typically rely on external evaluators, reward models, or exact string-match voting, limiting their applicability and efficiency. We propose Mode Extraction (ModeX), an evaluator-free Best-of-N selection framework that generalizes majority voting to open-ended text generation by identifying the modal output representing the dominant semantic consensus among generated texts. ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. We further instantiate this selection principle as ModeX--Lite, an improved version of ModeX with early pruning for efficiency. Across open-ended tasks--including text summarization, code generation, and mathematical reasoning--our approaches consistently outperform standard single- and multi-path baselines, providing a computationally efficient solution for robust open-ended text generation. Code is released in this https URL.
摘要：从多个随机生成中选择单个高质量输出仍然是大型语言模型 (LLM) 的基本挑战，特别是在不存在规范答案的开放式任务中。虽然 Best-of-N 和自我一致性方法表明聚合多代可以提高性能，但现有方法通常依赖于外部评估者、奖励模型或精确的字符串匹配投票，限制了它们的适用性和效率。我们提出了模式提取（ModeX），这是一种无需评估者的 Best-of-N 选择框架，通过识别代表生成文本之间主要语义共识的模式输出，将多数投票推广到开放式文本生成。 ModeX 在候选代上构建相似性图，并递归地应用谱聚类来选择代表性质心，而不需要额外的推理或辅助模型。我们进一步将这种选择原则实例化为ModeX--Lite，这是ModeX的改进版本，具有早期剪枝以提高效率。在开放式任务中——包括文本摘要、代码生成和数学推理——我们的方法始终优于标准的单路径和多路径基线，为强大的开放式文本生成提供了计算高效的解决方案。代码在此 https URL 中发布。

Title: LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

Authors: Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02569
Pdf URL: https://arxiv.org/pdf/2601.02569
Copy Paste: [[2601.02569]] LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference(https://arxiv.org/abs/2601.02569)
Keywords: language model, llm
Abstract: Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present \textbf{LoRA-Drop}, a plug-and-play inference framework that accelerates decoding by applying a \emph{temporal compute schedule} to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic \emph{refresh} steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps and refreshing periodically. Across \textbf{LLaMA2-7B}, \textbf{LLaMA3-8B}, \textbf{Qwen2.5-7B}, and \textbf{Qwen2.5-14B}, LoRA-Drop achieves up to \textbf{2.6$\times$ faster decoding} and \textbf{45--55\% KV-cache reduction} while staying within \textbf{0.5 percentage points (pp)} of baseline accuracy. Evaluations on reasoning (GSM8K, MATH, BBH), code generation (HumanEval, MBPP), and long-context/multilingual benchmarks (LongBench, XNLI, XCOPA) identify a consistent \emph{safe zone} of scheduling configurations that preserves quality while delivering substantial efficiency gains, providing a simple path toward adaptive-capacity inference in LLMs. Codes are available at this https URL.
摘要：自回归大型语言模型 (LLM) 受到顺序解码的瓶颈，其中每个新标记通常需要执行所有转换器层。现有的动态深度和跳层方法降低了这种成本，但通常依赖辅助路由机制，或者当绕过的层未得到补偿时会导致精度下降。我们提出了 \textbf{LoRA-Drop}，一个即插即用的推理框架，它通过将 \emph{temporalcomputeschedule} 应用于中间层的固定子集来加速解码：在大多数解码步骤中，选定的层重用先前的令牌隐藏状态并应用低秩 LoRA 校正，而定期的 \emph{refresh} 步骤执行完整模型以防止漂移。 LoRA-Drop 不需要路由网络，与标准 KV 缓存兼容，并且可以通过在 LoRA 步骤期间跳过可删除层中的 KV 更新并定期刷新来减少 KV 缓存占用空间。在 \textbf{LLaMA2-7B}、\textbf{LLaMA3-8B}、\textbf{Qwen2.5-7B} 和 \textbf{Qwen2.5-14B} 中，LoRA-Drop 实现了 \textbf{2.6$\times$ 更快的解码}和 \textbf{45--55\% KV 缓存减少}，同时保持在基线准确度\textbf{0.5 个百分点 (pp)}。对推理（GSM8K、MATH、BBH）、代码生成（HumanEval、MBPP）和长上下文/多语言基准（LongBench、XNLI、XCOPA）的评估确定了调度配置的一致\emph{安全区}，可以在保持质量的同时提供显着的效率增益，为法学硕士中的自适应能力推理提供简单的路径。代码可从此 https URL 获取。

Title: Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency

Authors: Haoran Wang, Maryam Khalid, Qiong Wu, Jian Gao, Cheng Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02574
Pdf URL: https://arxiv.org/pdf/2601.02574
Copy Paste: [[2601.02574]] Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency(https://arxiv.org/abs/2601.02574)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used in applications requiring factual accuracy, yet their outputs often contain hallucinated responses. While fact-checking can mitigate these errors, existing methods typically retrieve external evidence indiscriminately, overlooking the model's internal knowledge and potentially introducing irrelevant noise. Moreover, current systems lack targeted mechanisms to resolve specific uncertainties in the model's reasoning. Inspired by how humans fact-check, we argue that LLMs should adaptively decide whether to rely on internal knowledge or initiate retrieval based on their confidence in a given claim. We introduce Probabilistic Certainty and Consistency (PCC), a framework that estimates factual confidence by jointly modeling an LLM's probabilistic certainty and reasoning consistency. These confidence signals enable an adaptive verification strategy: the model answers directly when confident, triggers targeted retrieval when uncertain or inconsistent, and escalates to deep search when ambiguity is high. Our confidence-guided routing mechanism ensures that retrieval is invoked only when necessary, improving both efficiency and reliability. Extensive experiments across three challenging benchmarks show that PCC achieves better uncertainty quantification than verbalized confidence and consistently outperforms strong LLM-based fact-checking baselines. Furthermore, we demonstrate that PCC generalizes well across various LLMs.
摘要：大型语言模型 (LLM) 越来越多地用于需要事实准确性的应用中，但其输出通常包含幻觉响应。虽然事实检查可以减轻这些错误，但现有方法通常不加区别地检索外部证据，忽略模型的内部知识并可能引入不相关的噪声。此外，当前系统缺乏有针对性的机制来解决模型推理中的特定不确定性。受人类事实核查方式的启发，我们认为法学硕士应该根据他们对给定主张的信心自适应地决定是依赖内部知识还是启动检索。我们引入概率确定性和一致性（PCC），这是一个通过联合建模法学硕士的概率确定性和推理一致性来估计事实置信度的框架。这些置信度信号实现了自适应验证策略：模型在有信心时直接回答，在不确定或不一致时触发有针对性的检索，并在模糊性较高时升级到深度搜索。我们的置信引导路由机制确保仅在必要时调用检索，从而提高效率和可靠性。跨越三个具有挑战性的基准的广泛实验表明，PCC 比口头信心实现了更好的不确定性量化，并且始终优于基于 LLM 的事实检查基线。此外，我们证明 PCC 可以很好地推广到各种法学硕士。

Title: DataParasite Enables Scalable and Repurposable Online Data Curation

Authors: Mengyi Sun (Cold Spring Harbor Laboratory)
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.02578
Pdf URL: https://arxiv.org/pdf/2601.02578
Copy Paste: [[2601.02578]] DataParasite Enables Scalable and Repurposable Online Data Curation(https://arxiv.org/abs/2601.02578)
Keywords: language model, agent
Abstract: Many questions in computational social science rely on datasets assembled from heterogeneous online sources, a process that is often labor-intensive, costly, and difficult to reproduce. Recent advances in large language models enable agentic search and structured extraction from the web, but existing systems are frequently opaque, inflexible, or poorly suited to scientific data curation. Here we introduce DataParasite, an open-source, modular pipeline for scalable online data collection. DataParasite decomposes tabular curation tasks into independent, entity-level searches defined through lightweight configuration files and executed through a shared, task-agnostic python script. Crucially, the same pipeline can be repurposed to new tasks, including those without predefined entity lists, using only natural-language instructions. We evaluate the pipeline on multiple canonical tasks in computational social science, including faculty hiring histories, elite death events, and political career trajectories. Across tasks, DataParasite achieves high accuracy while reducing data-collection costs by an order of magnitude relative to manual curation. By lowering the technical and labor barriers to online data assembly, DataParasite provides a practical foundation for scalable, transparent, and reusable data curation in computational social science and beyond.
摘要：计算社会科学中的许多问题都依赖于异构在线资源组装的数据集，这个过程通常是劳动密集型、成本高昂且难以重现。大型语言模型的最新进展使得代理搜索和从网络中结构化提取成为可能，但现有系统通常不透明、不灵活或不太适合科学数据管理。在这里，我们介绍 DataParasite，这是一个用于可扩展在线数据收集的开源模块化管道。 DataParasite 将表格管理任务分解为通过轻量级配置文件定义的独立实体级搜索，并通过与任务无关的共享 python 脚本执行。至关重要的是，仅使用自然语言指令即可将同一管道重新用于新任务，包括那些没有预定义实体列表的任务。我们评估了计算社会科学中多个典型任务的流程，包括教师招聘历史、精英死亡事件和政治职业轨迹。在各种任务中，DataParasite 实现了高精度，同时相对于手动管理将数据收集成本降低了一个数量级。通过降低在线数据组装的技术和劳动力障碍，DataParasite 为计算社会科学及其他领域的可扩展、透明和可重用的数据管理提供了实用的基础。

Title: Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models

Authors: Christopher Ormerod
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02580
Pdf URL: https://arxiv.org/pdf/2601.02580
Copy Paste: [[2601.02580]] Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models(https://arxiv.org/abs/2601.02580)
Keywords: language model, llm
Abstract: Traditional methods for determining assessment item parameters, such as difficulty and discrimination, rely heavily on expensive field testing to collect student performance data for Item Response Theory (IRT) calibration. This study introduces a novel approach that implicitly models these psychometric properties by fine-tuning Large Language Models (LLMs) to simulate student responses across a spectrum of latent abilities. Leveraging the Qwen-3 dense model series and Low-Rank Adaptation (LoRA), we train models to generate responses to multiple choice questions conditioned on discrete ability descriptors. We reconstruct the probability of a correct response as a function of student ability, effectively generating synthetic Item Characteristic Curves (ICCs) to estimate IRT parameters. Evaluation on a dataset of Grade 6 English Language Arts (ELA) items and the BEA 2024 Shared Task dataset demonstrates that this method competes with or outperforms baseline approaches. This simulation-based technique seems particularly effective at modeling item discrimination.
摘要：确定评估项目参数（例如难度和区分度）的传统方法严重依赖昂贵的现场测试来收集学生表现数据以进行项目反应理论（IRT）校准。这项研究引入了一种新颖的方法，通过微调大型语言模型 (LLM) 来隐式模拟这些心理测量特性，以模拟学生对一系列潜在能力的反应。利用 Qwen-3 密集模型系列和低秩适应 (LoRA)，我们训练模型以生成对基于离散能力描述符的多项选择问题的响应。我们将正确响应的概率重建为学生能力的函数，有效地生成综合项目特征曲线 (ICC) 来估计 IRT 参数。对 6 年级英语语言艺术 (ELA) 项目数据集和 BEA 2024 共享任务数据集的评估表明，该方法可与基线方法竞争或优于基线方法。这种基于模拟的技术在建模项目区分方面似乎特别有效。

Title: FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

Authors: Kris W Pan, Yongmin Yoo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02589
Pdf URL: https://arxiv.org/pdf/2601.02589
Copy Paste: [[2601.02589]] FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions(https://arxiv.org/abs/2601.02589)
Keywords: llm, prompt
Abstract: Over 3.5 million patents are filed annually, with drafting patent descriptions requiring deep technical and legal expertise. Transforming scientific papers into patent descriptions is particularly challenging due to their differing rhetorical styles and stringent legal requirements. Unlike black-box text-to-text approaches that struggle to model structural reasoning and legal constraints, we propose FlowPlan-G2P, a novel framework that mirrors the cognitive workflow of expert drafters by reformulating this task into three stages: (1) Concept Graph Induction, extracting technical entities and relationships into a directed graph via expert-like reasoning; (2) Paragraph and Section Planning, reorganizing the graph into coherent clusters aligned with canonical patent sections; and (3) Graph-Conditioned Generation, producing legally compliant paragraphs using section-specific subgraphs and tailored prompts. Experiments demonstrate that FlowPlan-G2P significantly improves logical coherence and legal compliance over end-to-end LLM baselines. Our framework establishes a new paradigm for paper-to-patent generation and advances structured text generation for specialized domains.
摘要：每年提交超过 350 万项专利，起草专利说明需要深厚的技术和法律专业知识。由于其不同的修辞风格和严格的法律要求，将科学论文转化为专利描述尤其具有挑战性。与难以对结构推理和法律约束进行建模的黑盒文本到文本方法不同，我们提出了 FlowPlan-G2P，这是一种新颖的框架，通过将这项任务重新表述为三个阶段来反映专家起草者的认知工作流程：（1）概念图归纳，通过类似专家的推理将技术实体和关系提取到有向图中； (2) 段落和章节规划，将图表重新组织成与规范专利章节一致的连贯集群； (3) 图形条件生成，使用特定部分的子图和定制的提示生成合法的段落。实验表明，与端到端 LLM 基线相比，FlowPlan-G2P 显着提高了逻辑一致性和法律合规性。我们的框架建立了纸张到专利生成的新范例，并推进了专业领域的结构化文本生成。

Title: Scalable Construction of a Lung Cancer Knowledge Base: Profiling Semantic Reasoning in LLMs

Authors: Cesar Felipe Martínez Cisneros, Jesús Ulises Quiroz Bautista, Claudia Anahí Guzmán Solano, Bogdan Kaleb García Rivera, Iván García Pacheco, Yalbi Itzel Balderas Martínez, Kolawole John Adebayoc, Ignacio Arroyo Fernández
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02604
Pdf URL: https://arxiv.org/pdf/2601.02604
Copy Paste: [[2601.02604]] Scalable Construction of a Lung Cancer Knowledge Base: Profiling Semantic Reasoning in LLMs(https://arxiv.org/abs/2601.02604)
Keywords: language model, llm
Abstract: The integration of Large Language Models (LLMs) into biomedical research offers new opportunities for domainspecific reasoning and knowledge representation. However, their performance depends heavily on the semantic quality of training data. In oncology, where precision and interpretability are vital, scalable methods for constructing structured knowledge bases are essential for effective fine-tuning. This study presents a pipeline for developing a lung cancer knowledge base using Open Information Extraction (OpenIE). The process includes: (1) identifying medical concepts with the MeSH thesaurus; (2) filtering open-access PubMed literature with permissive licenses (CC0); (3) extracting (subject, relation, object) triplets using OpenIE method; and (4) enriching triplet sets with Named Entity Recognition (NER) to ensure biomedical relevance. The resulting triplet sets provide a domain-specific, large-scale, and noise-aware resource for fine-tuning LLMs. We evaluated T5 models finetuned on this dataset through Supervised Semantic Fine-Tuning. Comparative assessments with ROUGE and BERTScore show significantly improved performance and semantic coherence, demonstrating the potential of OpenIE-derived resources as scalable, low-cost solutions for enhancing biomedical NLP.
摘要：将大型语言模型 (LLM) 集成到生物医学研究中为特定领域推理和知识表示提供了新的机会。然而，它们的性能在很大程度上取决于训练数据的语义质量。在肿瘤学中，精度和可解释性至关重要，构建结构化知识库的可扩展方法对于有效的微调至关重要。本研究提出了使用开放信息提取 (OpenIE) 开发肺癌知识库的管道。该过程包括：（1）利用MeSH同义词库识别医学概念； (2) 过滤具有许可许可 (CC0) 的开放获取 PubMed 文献； (3)利用OpenIE方法提取(主语、关系、宾语)三元组； (4) 使用命名实体识别 (NER) 丰富三元组集，以确保生物医学相关性。由此产生的三元组集为 LLM 的微调提供了特定于领域的、大规模的、噪声感知的资源。我们评估了通过监督语义微调在此数据集上微调的 T5 模型。与 ROUGE 和 BERTScore 的比较评估显示性能和语义一致性显着提高，证明了 OpenIE 衍生资源作为增强生物医学 NLP 的可扩展、低成本解决方案的潜力。

Title: Improved Evidence Extraction for Document Inconsistency Detection with LLMs

Authors: Nelvin Tan, Yaowen Zhang, James Asikin Cheung, Fusheng Liu, Yu-Ching Shih, Dong Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02627
Pdf URL: https://arxiv.org/pdf/2601.02627
Copy Paste: [[2601.02627]] Improved Evidence Extraction for Document Inconsistency Detection with LLMs(https://arxiv.org/abs/2601.02627)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. However, research on LLM-based approaches to document inconsistency detection is relatively limited. There are two key aspects of document inconsistency detection: (i) classification of whether there exists any inconsistency, and (ii) providing evidence of the inconsistent sentences. We focus on the latter, and introduce new comprehensive evidence-extraction metrics and a redact-and-retry framework with constrained filtering that substantially improves LLM-based document inconsistency detection over direct prompting. We back our claims with promising experimental results.
摘要：大型语言模型 (LLM) 由于大型训练数据集和大型模型大小而具有令人印象深刻的能力，因此在许多领域变得越来越有用。然而，基于法学硕士的文档不一致检测方法的研究相对有限。文档不一致检测有两个关键方面：（i）对是否存在不一致进行分类，以及（ii）提供不一致句子的证据。我们重点关注后者，并引入新的综合证据提取指标和带有约束过滤的编辑和重试框架，与直接提示相比，该框架大大改进了基于 LLM 的文档不一致检测。我们用有希望的实验结果来支持我们的主张。

Title: Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays

Authors: Kuo Wang, Haowei Hua, Pengfei Yan, Hong Jiao, Dan Song
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.02659
Pdf URL: https://arxiv.org/pdf/2601.02659
Copy Paste: [[2601.02659]] Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays(https://arxiv.org/abs/2601.02659)
Keywords: language model, long context
Abstract: Long context may impose challenges for encoder-only language models in text processing, specifically for automated scoring of essays. This study trained several commonly used encoder-based language models for automated scoring of long essays. The performance of these trained models was evaluated and compared with the ensemble models built upon the base language models with a token limit of 512?. The experimented models include BERT-based models (BERT, RoBERTa, DistilBERT, and DeBERTa), ensemble models integrating embeddings from multiple encoder models, and ensemble models of feature-based supervised machine learning models, including Gradient-Boosted Decision Trees, eXtreme Gradient Boosting, and Light Gradient Boosting Machine. We trained, validated, and tested each model on a dataset of 17,307 essays, with an 80%/10%/10% split, and evaluated model performance using Quadratic Weighted Kappa. This study revealed that an ensemble-of-embeddings model that combines multiple pre-trained language model representations with gradient-boosting classifier as the ensemble model significantly outperforms individual language models at scoring long essays.
摘要：长上下文可能会给文本处理中的仅编码器语言模型带来挑战，特别是对于论文的自动评分。这项研究训练了几种常用的基于编码器的语言模型，用于长论文的自动评分。对这些经过训练的模型的性能进行了评估，并与基于基本语言模型构建的集成模型（令牌限制为 512？）进行了比较。实验模型包括基于 BERT 的模型（BERT、RoBERTa、DistilBERT 和 DeBERTa）、集成多个编码器模型嵌入的集成模型以及基于特征的监督机器学习模型的集成模型，包括梯度提升决策树、eXtreme Gradient Boosting 和 Light Gradient Boosting Machine。我们在包含 17,307 篇论文的数据集上训练、验证和测试每个模型，比例为 80%/10%/10%，并使用二次加权 Kappa 评估模型性能。这项研究表明，将多个预训练语言模型表示与梯度提升分类器相结合的嵌入集成模型在长论文评分方面显着优于单个语言模型。

Title: When Do Tools and Planning Help LLMs Think? A Cost- and Latency-Aware Benchmark

Authors: Subha Ghoshal, Ali Al-Bustami
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02663
Pdf URL: https://arxiv.org/pdf/2601.02663
Copy Paste: [[2601.02663]] When Do Tools and Planning Help LLMs Think? A Cost- and Latency-Aware Benchmark(https://arxiv.org/abs/2601.02663)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Modern large language models (LLMs) increasingly rely on inference-time planning and external tools to improve reasoning. We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV). Using LangChain and LangGraph, we compare a one-shot baseline against a plan--execute--replan agent equipped with task-specific tools (DBpedia SPARQL/lookup/schema exploration, Wikipedia-focused retrieval, and topical web search). We evaluate on 60 examples each from Event-QA and CMV (3 splits of 20), and report both mean end-to-end latency and per-example token cost estimates. We evaluate GPT-4o and GPT-4o-mini under identical workflows and report accuracy and end-to-end latency. On Event-QA, the best tool-augmented configuration improves accuracy (e.g., 47.5\% $\rightarrow$ 67.5\% for GPT-4o) while increasing latency by orders of magnitude ($\sim$8s $\rightarrow$ $\sim$317s per example). On CMV, one-shot prompting is strongest (e.g., GPT-4o-mini achieves 75\% at $\sim$6s), and planning+search increases latency substantially without consistent gains. However, complex multi-tool orchestration exposes failure modes where the smaller model degrades. Overall, the findings highlight the need for task-specific, cost-aware choices of both model size and agent/tooling complexity.
摘要：现代大型语言模型 (LLM) 越来越依赖推理时间规划和外部工具来改进推理。我们在两个现实环境中对这种行为进行了基准测试：基于图结构知识的以事件为中心的问答 (Event-QA) 和 Reddit ChangeMyView (CMV) 中的说服性响应生成。使用 LangChain 和 LangGraph，我们将一次性基线与配备特定任务工具（DBpedia SPARQL/查找/模式探索、以维基百科为中心的检索和主题网络搜索）的计划-执行-重新计划代理进行比较。我们对来自 Event-QA 和 CMV 的 60 个示例进行评估（20 个分为 3 个部分），并报告平均端到端延迟和每个示例的令牌成本估计。我们在相同的工作流程下评估 GPT-4o 和 GPT-4o-mini，并报告准确性和端到端延迟。在事件 QA 上，最佳的工具增强配置可提高准确性（例如，GPT-4o 为 47.5\% $\rightarrow$ 67.5\%），同时将延迟增加几个数量级（每个示例为 $\sim$8s $\rightarrow$ $\sim$317s）。在 CMV 上，一次性提示最强（例如，GPT-4o-mini 在 $\sim$6s 下达到 75%），而规划+搜索会大幅增加延迟，而没有一致的增益。然而，复杂的多工具编排暴露了较小模型性能下降的故障模式。总体而言，研究结果强调需要针对特定任务、对模型大小和代理/工具复杂性进行具有成本意识的选择。

Title: Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking

Authors: Hongzhan Lin, Zixin Chen, Zhiqi Shen, Ziyang Luo, Zhen Ye, Jing Ma, Tat-Seng Chua, Guandong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02669
Pdf URL: https://arxiv.org/pdf/2601.02669
Copy Paste: [[2601.02669]] Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking(https://arxiv.org/abs/2601.02669)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems, yet existing evaluations focus predominantly on claim verification and overlook the broader fact-checking workflow, including claim extraction and evidence retrieval. This narrow focus prevents current benchmarks from revealing systematic reasoning failures, factual blind spots, and robustness limitations of modern LLMs. To bridge this gap, we present FactArena, a fully automated arena-style evaluation framework that conducts comprehensive, stage-wise benchmarking of LLMs across the complete fact-checking pipeline. FactArena integrates three key components: (i) an LLM-driven fact-checking process that standardizes claim decomposition, evidence retrieval via tool-augmented interactions, and justification-based verdict prediction; (ii) an arena-styled judgment mechanism guided by consolidated reference guidelines to ensure unbiased and consistent pairwise comparisons across heterogeneous judge agents; and (iii) an arena-driven claim-evolution module that adaptively generates more challenging and semantically controlled claims to probe LLMs' factual robustness beyond fixed seed data. Across 16 state-of-the-art LLMs spanning seven model families, FactArena produces stable and interpretable rankings. Our analyses further reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence, highlighting the necessity of holistic evaluation. The proposed framework offers a scalable and trustworthy paradigm for diagnosing LLMs' factual reasoning, guiding future model development, and advancing the reliable deployment of LLMs in safety-critical fact-checking applications.
摘要：大型语言模型（LLM）越来越多地部署在现实世界的事实检查系统中，但现有的评估主要集中在声明验证上，而忽视了更广泛的事实检查工作流程，包括声明提取和证据检索。这种狭隘的关注点阻碍了当前的基准揭示现代法学硕士的系统推理失败、事实盲点和鲁棒性限制。为了弥补这一差距，我们推出了 FactArena，这是一个完全自动化的竞技场式评估框架，可以在整个事实检查流程中对法学硕士进行全面、分阶段的基准测试。 FactArena 集成了三个关键组件：(i) 法学硕士驱动的事实核查流程，可标准化索赔分解、通过工具增强交互进行的证据检索以及基于理由的判决预测； (ii) 以综合参考指南为指导的竞技场式判断机制，以确保异质判断代理人之间进行公正且一致的成对比较； (iii) 一个竞技场驱动的声明演化模块，该模块自适应地生成更具挑战性和语义控制的声明，以探索法学硕士超越固定种子数据的事实鲁棒性。 FactArena 涵盖 7 个模型系列的 16 个最先进的法学硕士，提供稳定且可解释的排名。我们的分析进一步揭示了静态索赔验证准确性与端到端事实检查能力之间的显着差异，凸显了整体评估的必要性。所提出的框架提供了一个可扩展且值得信赖的范例，用于诊断法学硕士的事实推理、指导未来的模型开发以及推进法学硕士在安全关键事实检查应用程序中的可靠部署。

Title: Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Authors: Devang Kulshreshtha, Hang Su, Chinmay Hegde, Haohan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02670
Pdf URL: https://arxiv.org/pdf/2601.02670
Copy Paste: [[2601.02670]] Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search(https://arxiv.org/abs/2601.02670)
Keywords: gpt, llm, prompt
Abstract: Most jailbreak methods achieve high attack success rates (ASR) but require attacker LLMs to craft adversarial queries and/or demand high query budgets. These resource limitations make jailbreaking expensive, and the queries generated by attacker LLMs often consist of non-interpretable random prefixes. This paper introduces Lexical Anchor Tree Search (), addressing these limitations through an attacker-LLM-free method that operates purely via lexical anchor injection. LATS reformulates jailbreaking as a breadth-first tree search over multi-turn dialogues, where each node incrementally injects missing content words from the attack goal into benign prompts. Evaluations on AdvBench and HarmBench demonstrate that LATS achieves 97-100% ASR on latest GPT, Claude, and Llama models with an average of only ~6.4 queries, compared to 20+ queries required by other methods. These results highlight conversational structure as a potent and under-protected attack surface, while demonstrating superior query efficiency in an era where high ASR is readily achievable. Our code will be released to support reproducibility.
摘要：大多数越狱方法可实现较高的攻击成功率 (ASR)，但需要攻击者法学硕士精心设计对抗性查询和/或要求较高的查询预算。这些资源限制使得越狱成本高昂，并且攻击者 LLM 生成的查询通常包含不可解释的随机前缀。本文介绍了词法锚树搜索 (Lexical Anchor Tree Search)，通过纯粹通过词法锚注入操作的无攻击者 LLM 方法解决了这些限制。 LATS 将越狱重新表述为多轮对话上的广度优先树搜索，其中每个节点增量地将攻击目标中缺失的内容词注入到良性提示中。对 AdvBench 和 HarmBench 的评估表明，LATS 在最新的 GPT、Claude 和 Llama 模型上实现了 97-100% 的 ASR，平均只需约 6.4 次查询，而其他方法需要 20 多个查询。这些结果凸显了会话结构作为一种有效且未受保护的攻击面，同时在可以轻松实现高 ASR 的时代展示了卓越的查询效率。我们的代码将被发布以支持可重复性。

Title: Extracting books from production language models

Authors: Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.02671
Pdf URL: https://arxiv.org/pdf/2601.02671
Copy Paste: [[2601.02671]] Extracting books from production language models(https://arxiv.org/abs/2601.02671)
Keywords: language model, gpt, llm, prompt
Abstract: Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
摘要：关于法学硕士和版权的许多未解决的法律问题都集中在记忆上：特定的训练数据是否在训练期间被编码到模型的权重中，以及这些记忆的数据是否可以在模型的输出中提取。虽然许多人认为法学硕士不会记住太多的训练数据，但最近的研究表明，可以从开放权重模型中提取大量受版权保护的文本。然而，考虑到这些系统实施的安全措施，类似的提取对于生产法学硕士是否可行仍然是一个悬而未决的问题。我们使用两阶段过程来研究这个问题：(1) 进行初始探测以测试提取可行性，有时使用 Best-of-N (BoN) 越狱，然后是 (2) 迭代继续提示以尝试提取书籍。我们在四个生产 LLM（Claude 3.7 Sonnet、GPT-4.1、Gemini 2.5 Pro 和 Grok 3）上评估我们的程序，并使用根据最长公共子串（nv-recall）的基于块的近似值计算得出的分数来衡量提取成功。通过不同的法学硕士实验配置，我们能够提取不同数量的文本。对于第一阶段的探测，不需要越狱 Gemini 2.5 Pro 和 Grok 3 来提取文本（例如，《哈利·波特与魔法石》的 nv 召回率分别为 76.8% 和 70.3%），而对于 Claude 3.7 Sonnet 和 GPT-4.1 则需要越狱。在某些情况下，越狱的 Claude 3.7 Sonnet 几乎逐字输出整本书（例如，nv-recall=95.8%）。 GPT-4.1 需要更多的 BoN 尝试（例如 20 次），并最终拒绝继续（例如 nv-recall=4.0%）。总而言之，我们的工作强调，即使有模型和系统级的保护措施，提取（版权内）训练数据仍然对生产法学硕士构成风险。

Title: Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration

Authors: Guangxin Wu, Hao Zhang, Zhang Zhibin, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02674
Pdf URL: https://arxiv.org/pdf/2601.02674
Copy Paste: [[2601.02674]] Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration(https://arxiv.org/abs/2601.02674)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks. However, their ever-growing scale introduces significant barriers to real-world deployment, including substantial computational overhead, memory footprint, and inference latency. While model pruning presents a viable solution to these challenges, existing unstructured pruning techniques often yield irregular sparsity patterns that necessitate specialized hardware or software support. In this work, we explore structured pruning, which eliminates entire architectural components and maintains compatibility with standard hardware accelerators. We introduce a novel structured pruning framework that leverages a hybrid multi-domain calibration set and an iterative calibration strategy to effectively identify and remove redundant channels. Extensive experiments on various models across diverse downstream tasks show that our approach achieves significant compression with minimal performance degradation.
摘要：大型语言模型 (LLM) 在广泛的自然语言处理任务中取得了显着的成功。然而，它们不断增长的规模给现实世界的部署带来了巨大的障碍，包括大量的计算开销、内存占用和推理延迟。虽然模型剪枝为这些挑战提供了可行的解决方案，但现有的非结构化剪枝技术通常会产生不规则的稀疏模式，需要专门的硬件或软件支持。在这项工作中，我们探索结构化修剪，它消除了整个架构组件并保持与标准硬件加速器的兼容性。我们引入了一种新颖的结构化剪枝框架，该框架利用混合多域校准集和迭代校准策略来有效识别和删除冗余通道。对跨不同下游任务的各种模型进行的广泛实验表明，我们的方法在性能下降最小的情况下实现了显着的压缩。

Title: EvoRoute: Experience-Driven Self-Routing LLM Agent Systems

Authors: Guibin Zhang, Haiyang Yu, Kaiming Yang, Bingli Wu, Fei Huang, Yongbin Li, Shuicheng Yan
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2601.02695
Pdf URL: https://arxiv.org/pdf/2601.02695
Copy Paste: [[2601.02695]] EvoRoute: Experience-Driven Self-Routing LLM Agent Systems(https://arxiv.org/abs/2601.02695)
Keywords: language model, llm, agent
Abstract: Complex agentic AI systems, powered by a coordinated ensemble of Large Language Models (LLMs), tool and memory modules, have demonstrated remarkable capabilities on intricate, multi-turn tasks. However, this success is shadowed by prohibitive economic costs and severe latency, exposing a critical, yet underexplored, trade-off. We formalize this challenge as the \textbf{Agent System Trilemma}: the inherent tension among achieving state-of-the-art performance, minimizing monetary cost, and ensuring rapid task completion. To dismantle this trilemma, we introduce EvoRoute, a self-evolving model routing paradigm that transcends static, pre-defined model assignments. Leveraging an ever-expanding knowledge base of prior experience, EvoRoute dynamically selects Pareto-optimal LLM backbones at each step, balancing accuracy, efficiency, and resource use, while continually refining its own selection policy through environment feedback. Experiments on challenging agentic benchmarks such as GAIA and BrowseComp+ demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to $80\%$ and latency by over $70\%$.
摘要：复杂的代理人工智能系统由大型语言模型 (LLM)、工具和内存模块的协调集成提供支持，在复杂的多轮任务上表现出了卓越的能力。然而，这种成功却被高昂的经济成本和严重的延迟所掩盖，暴露了一个关键但尚未充分探索的权衡。我们将这一挑战形式化为\textbf{代理系统三难困境}：实现最先进的性能、最小化货币成本和确保快速完成任务之间的内在张力。为了解决这个难题，我们引入了 EvoRoute，这是一种超越静态、预定义模型分配的自我演化模型路由范式。利用不断扩大的先前经验知识库，EvoRoute 在每一步动态选择帕累托最优 LLM 骨干，平衡准确性、效率和资源使用，同时通过环境反馈不断完善自己的选择策略。在 GAIA 和 BrowseComp+ 等具有挑战性的代理基准上进行的实验表明，EvoRoute 在集成到现成的代理系统中时，不仅可以维持或增强系统性能，还可以将执行成本降低高达 80\%$，并将延迟降低超过 70\%$。

Title: Mitigating Prompt-Induced Hallucinations in Large Language Models via Structured Reasoning

Authors: Jinbo Hao, Kai Yang, Qingzhen Su, Yang Chen, Yifan Li, Chao Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02739
Pdf URL: https://arxiv.org/pdf/2601.02739
Copy Paste: [[2601.02739]] Mitigating Prompt-Induced Hallucinations in Large Language Models via Structured Reasoning(https://arxiv.org/abs/2601.02739)
Keywords: language model, gpt, llm, hallucination, prompt, chain-of-thought
Abstract: To address hallucination issues in large language models (LLMs), this paper proposes a method for mitigating prompt-induced hallucinations. Building on a knowledge distillation chain-style model, we introduce a code module to guide knowledge-graph exploration and incorporate code as part of the chain-of-thought prompt, forming an external knowledge input that provides more accurate and structured information to the model. Based on this design, we develop an improved knowledge distillation chain-style model and leverage it to analyze and constrain the reasoning process of LLMs, thereby improving inference accuracy. We empirically evaluate the proposed approach using GPT-4 and LLaMA-3.3 on multiple public datasets. Experimental results demonstrate that incorporating code modules significantly enhances the model's ability to capture contextual information and effectively mitigates prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 improve by 15.64%, 13.38%, and 13.28%, respectively. Moreover, the proposed method achieves HIT@1, HIT@3, and HIT@5 scores exceeding 95% across several evaluation settings. These results indicate that the proposed approach substantially reduces hallucination behavior while improving the accuracy and verifiability of large language models.
摘要：为了解决大语言模型（LLM）中的幻觉问题，本文提出了一种减轻提示引起的幻觉的方法。在知识蒸馏链式模型的基础上，我们引入了代码模块来指导知识图谱探索，并将代码作为思想链提示的一部分，形成外部知识输入，为模型提供更准确和结构化的信息。基于此设计，我们开发了一种改进的知识蒸馏链式模型，并利用它来分析和约束LLM的推理过程，从而提高推理准确性。我们使用 GPT-4 和 LLaMA-3.3 在多个公共数据集上对所提出的方法进行实证评估。实验结果表明，合并代码模块显着增强了模型捕获上下文信息的能力，并有效减轻了提示引起的幻觉。具体而言，HIT@1、HIT@3 和 HIT@5 分别提高了 15.64%、13.38% 和 13.28%。此外，所提出的方法在多个评估设置中实现了超过 95% 的 HIT@1、HIT@3 和 HIT@5 分数。这些结果表明，所提出的方法大大减少了幻觉行为，同时提高了大型语言模型的准确性和可验证性。

Title: SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

Authors: Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You, Yifan Zhou, Ruidong Zhang, Yohannes Abate, Tianming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02744
Pdf URL: https://arxiv.org/pdf/2601.02744
Copy Paste: [[2601.02744]] SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation(https://arxiv.org/abs/2601.02744)
Keywords: language model, llm, agent
Abstract: While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory. To bridge this gap, we introduce Synapse (Synergistic Associative Processing Semantic Encoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub-graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation-based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that Synapse significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks, offering a robust solution to the "Contextual Tunneling" problem. Our code and data will be made publicly available upon acceptance.
摘要：虽然大型语言模型 (LLM) 擅长广义推理，但标准检索增强方法无法解决长期代理记忆的断开性质。为了弥补这一差距，我们引入了 Synapse（协同关联处理语义编码），这是一种超越静态向量相似性的统一内存架构。 Synapse 借鉴认知科学，将记忆建模为动态图，其中相关性来自传播激活而不是预先计算的链接。通过整合横向抑制和时间衰减，系统动态突出显示相关子图，同时过滤干扰。我们实现了三重混合检索策略，将几何嵌入与基于激活的图遍历融合在一起。对 LoCoMo 基准的综合评估表明，Synapse 在复杂的时间和多跳推理任务中显着优于最先进的方法，为“上下文隧道”问题提供了强大的解决方案。我们的代码和数据将在接受后公开。

Title: Window-based Membership Inference Attacks Against Fine-tuned Large Language Models

Authors: Yuetian Chen, Yuntao Du, Kaiyuan Zhang, Ashish Kundu, Charles Fleming, Bruno Ribeiro, Ninghui Li
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2601.02751
Pdf URL: https://arxiv.org/pdf/2601.02751
Copy Paste: [[2601.02751]] Window-based Membership Inference Attacks Against Fine-tuned Large Language Models(https://arxiv.org/abs/2601.02751)
Keywords: language model, llm
Abstract: Most membership inference attacks (MIAs) against Large Language Models (LLMs) rely on global signals, like average loss, to identify training data. This approach, however, dilutes the subtle, localized signals of memorization, reducing attack effectiveness. We challenge this global-averaging paradigm, positing that membership signals are more pronounced within localized contexts. We introduce WBC (Window-Based Comparison), which exploits this insight through a sliding window approach with sign-based aggregation. Our method slides windows of varying sizes across text sequences, with each window casting a binary vote on membership based on loss comparisons between target and reference models. By ensembling votes across geometrically spaced window sizes, we capture memorization patterns from token-level artifacts to phrase-level structures. Extensive experiments across eleven datasets demonstrate that WBC substantially outperforms established baselines, achieving higher AUC scores and 2-3 times improvements in detection rates at low false positive thresholds. Our findings reveal that aggregating localized evidence is fundamentally more effective than global averaging, exposing critical privacy vulnerabilities in fine-tuned LLMs.
摘要：大多数针对大型语言模型 (LLM) 的成员推理攻击 (MIA) 依赖于全局信号（例如平均损失）来识别训练数据。然而，这种方法削弱了微妙的、局部的记忆信号，降低了攻击效率。我们挑战这种全球平均范式，认为会员信号在本地背景下更加明显。我们引入了 WBC（基于窗口的比较），它通过具有基于符号的聚合的滑动窗口方法来利用这种洞察力。我们的方法在文本序列中滑动不同大小的窗口，每个窗口根据目标模型和参考模型之间的损失比较对成员资格进行二元投票。通过在几何间隔的窗口大小上集成投票，我们捕获了从令牌级工件到短语级结构的记忆模式。跨 11 个数据集的大量实验表明，WBC 显着优于既定基线，实现了更高的 AUC 分数，并且在低假阳性阈值下检测率提高了 2-3 倍。我们的研究结果表明，聚合本地证据从根本上比全球平均更有效，暴露了经过微调的法学硕士中的关键隐私漏洞。

Title: EComStage: Stage-wise and Orientation-specific Benchmarking for Large Language Models in E-commerce

Authors: Kaiyan Zhao, Zijie Meng, Zheyong Xie, Jin Duan, Yao Hu, Zuozhu Liu, Shaosheng Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02752
Pdf URL: https://arxiv.org/pdf/2601.02752
Copy Paste: [[2601.02752]] EComStage: Stage-wise and Orientation-specific Benchmarking for Large Language Models in E-commerce(https://arxiv.org/abs/2601.02752)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM)-based agents are increasingly deployed in e-commerce applications to assist customer services in tasks such as product inquiries, recommendations, and order management. Existing benchmarks primarily evaluate whether these agents successfully complete the final task, overlooking the intermediate reasoning stages that are crucial for effective decision-making. To address this gap, we propose EComStage, a unified benchmark for evaluating agent-capable LLMs across the comprehensive stage-wise reasoning process: Perception (understanding user intent), Planning (formulating an action plan), and Action (executing the decision). EComStage evaluates LLMs through seven separate representative tasks spanning diverse e-commerce scenarios, with all samples human-annotated and quality-checked. Unlike prior benchmarks that focus only on customer-oriented interactions, EComStage also evaluates merchant-oriented scenarios, including promotion management, content review, and operational support relevant to real-world applications. We evaluate a wide range of over 30 LLMs, spanning from 1B to over 200B parameters, including open-source models and closed-source APIs, revealing stage/orientation- specific strengths and weaknesses. Our results provide fine-grained, actionable insights for designing and optimizing LLM-based agents in real-world e-commerce settings.
摘要：基于大型语言模型 (LLM) 的代理越来越多地部署在电子商务应用程序中，以协助客户服务完成产品查询、推荐和订单管理等任务。现有的基准主要评估这些智能体是否成功完成最终任务，而忽略了对于有效决策至关重要的中间推理阶段。为了解决这一差距，我们提出了 EComStage，这是一个统一的基准，用于在全面的阶段推理过程中评估具有代理能力的 LLM：感知（理解用户意图）、规划（制定行动计划）和行动（执行决策）。 EComStage 通过涵盖不同电子商务场景的七个独立的代表性任务来评估法学硕士，所有样本均经过人工注释和质量检查。与之前仅关注面向客户的交互的基准测试不同，EComStage 还评估面向商家的场景，包括促销管理、内容审核以及与实际应用程序相关的运营支持。我们评估了 30 多个法学硕士，涵盖从 1B 到超过 200B 的参数，包括开源模型和闭源 API，揭示特定阶段/方向的优势和劣势。我们的结果为在现实电子商务环境中设计和优化基于 LLM 的代理提供了细粒度、可操作的见解。

Title: MiMo-V2-Flash Technical Report

Authors: Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang, Peidian Li, Qianli Chen, Shaohui Liu, Shihua Yu, Shijie Cao, Shimao Chen, Shouqiu Yu, Shuo Liu, Tianling Zhou, Weijiang Su, Weikun Wang, Wenhan Ma, Xiangwei Deng, Bohan Mao, Bowen Ye, Can Cai, Chenghua Wang, Chengxuan Zhu, Chong Ma, Chun Chen, Chunan Li, Dawei Zhu, Deshan Xiao, Dong Zhang, Duo Zhang, Fangyue Liu, Feiyu Yang, Fengyuan Shi, Guoan Wang, Hao Tian, Hao Wu, Heng Qu, Hongfei Yi, Hongxu An, Hongyi Guan, Xing Zhang, Yifan Song, Yihan Yan, Yihao Zhao, Yingchun Lai, Yizhao Gao, Yu Cheng, Yuanyuan Tian, Yudong Wang, Zhen Tang, Zhengju Tang, Zhengtao Wen, Zhichao Song, Zhixian Zheng, Zihan Jiang, Jian Wen, Jiarui Sun, Jiawei Li, Jinlong Xue, Jun Xia, Kai Fang, Menghang Zhu, Nuo Chen, Qian Tu, Qihao Zhang, Qiying Wang, Rang Li, Rui Ma, Shaolei Zhang, Shengfan Wang, Shicheng Li, Shuhao Gu, Shuhuai Ren, Sirui Deng, Tao Guo, Tianyang Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02780
Pdf URL: https://arxiv.org/pdf/2601.02780
Copy Paste: [[2601.02780]] MiMo-V2-Flash Technical Report(https://arxiv.org/abs/2601.02780)
Keywords: agent
Abstract: We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.
摘要：我们推出 MiMo-V2-Flash，这是一种专家混合 (MoE) 模型，具有 309B 总参数和 15B 活动参数，专为快速、强大的推理和代理功能而设计。 MiMo-V2-Flash采用混合注意力架构，将滑动窗口注意力（SWA）与全局注意力交错，在5:1的混合比例下具有128个token的滑动窗口。该模型通过多令牌预测 (MTP) 对 27 万亿个令牌进行了预训练，采用原生 32k 上下文长度，随后扩展到 256k。为了有效扩展训练后计算，MiMo-V2-Flash 引入了一种新颖的多教师按策略蒸馏 (MOPD) 范例。在此框架中，领域专业教师（例如，通过大规模强化学习进行培训）提供密集且代币级别的奖励，使学生模型能够完美掌握教师的专业知识。 MiMo-V2-Flash 可以与 DeepSeek-V3.2 和 Kimi-K2 等顶级开放权重模型相媲美，尽管它们分别只使用了它们总参数的 1/2 和 1/3。在推理过程中，通过将 MTP 重新用作推测解码的草案模型，MiMo-V2-Flash 通过三个 MTP 层实现了高达 3.6 的接受长度和 2.6 倍的解码加速。我们开源模型权重和三层 MTP 权重，以促进开放研究和社区协作。

Title: Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

Authors: Junxiang Qiu, Shuo Wang, Zhengsu Chen, Hengheng Zhang, Jinda Lu, Changcheng Li, Qi Tian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02819
Pdf URL: https://arxiv.org/pdf/2601.02819
Copy Paste: [[2601.02819]] Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models(https://arxiv.org/abs/2601.02819)
Keywords: language model, llm
Abstract: Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.
摘要：注意力是大型语言模型（LLM）中长上下文建模的基本机制，但由于其二次复杂性，密集注意力在结构上对长序列产生了阻碍。因此，稀疏注意力作为一种可扩展的替代方案受到了越来越多的关注。然而，现有的稀疏注意力方法在块选择过程中依赖于粗粒度的语义表示，这模糊了块内语义边界并导致关键信息的丢失。为了解决这个问题，我们提出了 \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}，一个原生可训练的稀疏注意力框架，利用标点符号作为语义边界锚。具体来说，（1）我们设计了一种双分支聚合机制，将全局语义表示与标点增强的边界特征融合在一起，保留核心语义结构，同时几乎不引入额外的计算开销；（2）我们引入了一种极端稀疏自适应训练和推理策略，可以在非常低的令牌激活率下稳定模型行为；对一般基准和长上下文评估的大量实验表明，PHSA 始终优于密集注意力和最先进的稀疏注意力基线，包括 InfLLM v2。具体来说，对于具有 32k-token 输入序列的 0.6B 参数模型，PHSA 可以在稀疏率为 97.3% 的情况下减少 10.8% 的信息损失。

Title: The performances of the Chinese and U.S. Large Language Models on the Topic of Chinese Culture

Authors: Feiyan Liu, Chenxun Zhuo, Siyan Zhao, Bao Ge, Tianming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02830
Pdf URL: https://arxiv.org/pdf/2601.02830
Copy Paste: [[2601.02830]] The performances of the Chinese and U.S. Large Language Models on the Topic of Chinese Culture(https://arxiv.org/abs/2601.02830)
Keywords: language model, gpt, llm
Abstract: Cultural backgrounds shape individuals' perspectives and approaches to problem-solving. Since the emergence of GPT-1 in 2018, large language models (LLMs) have undergone rapid development. To date, the world's ten leading LLM developers are primarily based in China and the United States. To examine whether LLMs released by Chinese and U.S. developers exhibit cultural differences in Chinese-language settings, we evaluate their performance on questions about Chinese culture. This study adopts a direct-questioning paradigm to evaluate models such as GPT-5.1, DeepSeek-V3.2, Qwen3-Max, and Gemini2.5Pro. We assess their understanding of traditional Chinese culture, including history, literature, poetry, and related domains. Comparative analyses between LLMs developed in China and the U.S. indicate that Chinese models generally outperform their U.S. counterparts on these tasks. Among U.S.-developed models, Gemini 2.5Pro and GPT-5.1 achieve relatively higher accuracy. The observed performance differences may potentially arise from variations in training data distribution, localization strategies, and the degree of emphasis on Chinese cultural content during model development.
摘要：文化背景塑造个人的观点和解决问题的方法。自2018年GPT-1出现以来，大型语言模型（LLM）经历了快速发展。迄今为止，全球十大领先的法学硕士开发者主要位于中国和美国。为了检验中国和美国开发商发布的法学硕士在中文环境中是否表现出文化差异，我们评估了他们在有关中国文化的问题上的表现。本研究采用直接提问范式来评估 GPT-5.1、DeepSeek-V3.2、Qwen3-Max 和 Gemini2.5Pro 等模型。我们评估他们对中国传统文化的理解，包括历史、文学、诗歌和相关领域。中国和美国的法学硕士之间的比较分析表明，中国模式在这些任务上普遍优于美国同行。美国开发的模型中，Gemini 2.5Pro和GPT-5.1的精度相对较高。观察到的性能差异可能是由于训练数据分布、本地化策略以及模型开发过程中对中国文化内容的重视程度的变化造成的。

Title: TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents

Authors: Kai Li, Xuanqing Yu, Ziyi Ni, Yi Zeng, Yao Xu, Zheqing Zhang, Xin Li, Jitao Sang, Xiaogang Duan, Xuelei Wang, Chengbao Liu, Jie Tan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02845
Pdf URL: https://arxiv.org/pdf/2601.02845
Copy Paste: [[2601.02845]] TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents(https://arxiv.org/abs/2601.02845)
Keywords: language model, llm, agent
Abstract: Long-horizon conversational agents have to manage ever-growing interaction histories that quickly exceed the finite context windows of large language models (LLMs). Existing memory frameworks provide limited support for temporally structured information across hierarchical levels, often leading to fragmented memories and unstable long-horizon personalization. We present TiMem, a temporal--hierarchical memory framework that organizes conversations through a Temporal Memory Tree (TMT), enabling systematic memory consolidation from raw conversational observations to progressively abstracted persona representations. TiMem is characterized by three core properties: (1) temporal--hierarchical organization through TMT; (2) semantic-guided consolidation that enables memory integration across hierarchical levels without fine-tuning; and (3) complexity-aware memory recall that balances precision and efficiency across queries of varying complexity. Under a consistent evaluation setup, TiMem achieves state-of-the-art accuracy on both benchmarks, reaching 75.30% on LoCoMo and 76.88% on LongMemEval-S. It outperforms all evaluated baselines while reducing the recalled memory length by 52.20% on LoCoMo. Manifold analysis indicates clear persona separation on LoCoMo and reduced dispersion on LongMemEval-S. Overall, TiMem treats temporal continuity as a first-class organizing principle for long-horizon memory in conversational agents.
摘要：长视野对话代理必须管理不断增长的交互历史，这些历史很快就会超出大型语言模型（LLM）的有限上下文窗口。现有的记忆框架对跨层次的时间结构化信息提供有限的支持，通常会导致记忆碎片和不稳定的长期个性化。我们提出了 TiMem，一种时间分层记忆框架，它通过时间记忆树 (TMT) 组织对话，从而实现从原始对话观察到逐步抽象的角色表征的系统记忆整合。 TiMem 具有三个核心特性：（1）时间——通过 TMT 进行分层组织；（2）语义引导整合，无需微调即可实现跨层次的内存整合； (3) 复杂性感知内存召回，可在不同复杂性的查询之间平衡精度和效率。在一致的评估设置下，TiMem 在两个基准测试中均实现了最先进的准确度，在 LoCoMo 上达到 75.30%，在 LongMemEval-S 上达到 76.88%。它优于所有评估的基线，同时将 LoCoMo 上的回忆记忆长度减少了 52.20%。流形分析表明 LoCoMo 上的角色分离清晰，LongMemEval-S 上的分散度降低。总体而言，TiMem 将时间连续性视为会话代理中长视野记忆的一流组织原则。

Title: To Generate or Discriminate? Methodological Considerations for Measuring Cultural Alignment in LLMs

Authors: Saurabh Kumar Pandey, Sougata Saha, Monojit Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02858
Pdf URL: https://arxiv.org/pdf/2601.02858
Copy Paste: [[2601.02858]] To Generate or Discriminate? Methodological Considerations for Measuring Cultural Alignment in LLMs(https://arxiv.org/abs/2601.02858)
Keywords: language model, gpt, llm, prompt
Abstract: Socio-demographic prompting (SDP) - prompting Large Language Models (LLMs) using demographic proxies to generate culturally aligned outputs - often shows LLM responses as stereotypical and biased. While effective in assessing LLMs' cultural competency, SDP is prone to confounding factors such as prompt sensitivity, decoding parameters, and the inherent difficulty of generation over discrimination tasks due to larger output spaces. These factors complicate interpretation, making it difficult to determine if the poor performance is due to bias or the task design. To address this, we use inverse socio-demographic prompting (ISDP), where we prompt LLMs to discriminate and predict the demographic proxy from actual and simulated user behavior from different users. We use the Goodreads-CSI dataset (Saha et al., 2025), which captures difficulty in understanding English book reviews for users from India, Mexico, and the USA, and test four LLMs: Aya-23, Gemma-2, GPT-4o, and LLaMA-3.1 with ISDP. Results show that models perform better with actual behaviors than simulated ones, contrary to what SDP suggests. However, performance with both behavior types diminishes and becomes nearly equal at the individual level, indicating limits to personalization.
摘要：社会人口提示（SDP）——使用人口统计代理来提示大型语言模型（LLM）以生成文化上一致的输出——通常会显示 LLM 的反应是刻板的和有偏见的。虽然 SDP 可以有效评估法学硕士的文化能力，但它很容易受到混杂因素的影响，例如提示敏感性、解码参数以及由于输出空间较大而导致区分任务生成的固有难度。这些因素使解释变得复杂，使得很难确定表现不佳是由于偏见还是任务设计造成的。为了解决这个问题，我们使用逆社会人口提示（ISDP），提示法学硕士根据不同用户的实际和模拟用户行为来区分和预测人口统计代理。我们使用 Goodreads-CSI 数据集（Saha 等人，2025），该数据集捕获了来自印度、墨西哥和美国的用户理解英语书评的困难，并使用 ISDP 测试了四个法学硕士：Aya-23、Gemma-2、GPT-4o 和 LLaMA-3.1。结果表明，模型在实际行为方面的表现比模拟行为更好，这与 SDP 的建议相反。然而，两种行为类型的表现都会减弱，并且在个人层面上变得几乎相同，这表明个性化的局限性。

Title: Training Language Models with homotokens Leads to Delayed Overfitting

Authors: Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02867
Pdf URL: https://arxiv.org/pdf/2601.02867
Copy Paste: [[2601.02867]] Training Language Models with homotokens Leads to Delayed Overfitting(https://arxiv.org/abs/2601.02867)
Keywords: language model
Abstract: Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse evaluation datasets. In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality: gains are strongest when canonical tokens are highly compressed and diminish when the tokenizer already over-fragments the input. Overall, homotokens provide a simple and modular mechanism for inducing tokenization invariance in language models.
摘要：子词标记化在语言模型中引入了一个计算层，其中许多不同的标记序列解码为相同的表面形式并保留含义，但引起不同的内部计算。尽管存在这种非唯一性，但语言模型通常是使用单个规范最长前缀标记化进行训练的。我们将同形标记（同一词汇项的替代有效子词分段）形式化为数据增强的严格意义保留形式。我们引入了一种轻量级训练架构，该架构通过辅助因果编码器和块因果交叉注意力对采样的同标记变体进行规范的下一个标记预测，而无需修改训练目标或标记接口。在数据受限的预训练中，同形词增强在重复的数据暴露下一致地延迟了过度拟合，并提高了跨不同评估数据集的泛化能力。在多语言微调中，我们发现同形标记的有效性取决于分词器的质量：当规范标记被高度压缩时，增益最强；当分词器已经对输入进行过度碎片化时，增益会减弱。总体而言，同形标记提供了一种简单且模块化的机制，用于在语言模型中引入标记化不变性。

Title: LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

Authors: Ziyang Chen, Xing Wu, Junlong Jia, Chaochen Gao, Qi Fu, Debing Zhang, Songlin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02872
Pdf URL: https://arxiv.org/pdf/2601.02872
Copy Paste: [[2601.02872]] LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark(https://arxiv.org/abs/2601.02872)
Keywords: language model, llm
Abstract: The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world complexity, while fully manual annotation is costly to scale to extreme lengths and diverse scenarios. We present LongBench Pro, a more realistic and comprehensive bilingual benchmark of 1,500 naturally occurring long-context samples in English and Chinese spanning 11 primary tasks and 25 secondary tasks, with input lengths from 8k to 256k tokens. LongBench Pro supports fine-grained analysis with task-specific metrics and a multi-dimensional taxonomy of context requirement (full vs. partial dependency), length (six levels), and difficulty (four levels calibrated by model performance). To balance quality with scalability, we propose a Human-Model Collaborative Construction pipeline: frontier LLMs draft challenging questions and reference answers, along with design rationales and solution processes, to reduce the cost of expert verification. Experts then rigorously validate correctness and refine problematic cases. Evaluating 46 widely used long-context LLMs on LongBench Pro yields three findings: (1) long-context optimization contributes more to long-context comprehension than parameter scaling; (2) effective context length is typically shorter than the claimed context length, with pronounced cross-lingual misalignment; and (3) the "thinking" paradigm helps primarily models trained with native reasoning, while mixed-thinking designs offer a promising Pareto trade-off. In summary, LongBench Pro provides a robust testbed for advancing long-context understanding.
摘要：大型语言模型（LLM）中上下文长度的快速扩展已经超过了现有的评估基准。当前的长上下文基准通常会在可扩展性和真实性之间进行权衡：合成任务不能充分体现现实世界的复杂性，而完全手动注释扩展到极端长度和多样化场景的成本高昂。我们推出了 LongBench Pro，这是一个更现实、更全面的双语基准测试，包含 1,500 个自然发生的英语和中文长上下文样本，涵盖 11 个主要任务和 25 个次要任务，输入长度从 8k 到 256k 标记。 LongBench Pro 支持使用特定于任务的指标以及上下文要求（完全依赖与部分依赖）、长度（六个级别）和难度（通过模型性能校准的四个级别）的多维分类进行细粒度分析。为了平衡质量与可扩展性，我们提出了一个人模协同构建流程：前沿法学硕士起草具有挑战性的问题和参考答案，以及设计原理和解决方案流程，以降低专家验证的成本。然后专家们严格验证正确性并完善有问题的案例。在 LongBench Pro 上评估 46 个广泛使用的长上下文法学硕士得出三个结论：（1）长上下文优化比参数缩放对长上下文理解的贡献更大； (2) 有效上下文长度通常短于所声称的上下文长度，并且存在明显的跨语言错位；（3）“思维”范式主要帮助使用本机推理训练的模型，而混合思维设计则提供了有希望的帕累托权衡。总之，LongBench Pro 为推进长上下文理解提供了一个强大的测试平台。

Title: Revisiting Data Compression with Language Modeling

Authors: Chen-Han Tsai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02875
Pdf URL: https://arxiv.org/pdf/2601.02875
Copy Paste: [[2601.02875]] Revisiting Data Compression with Language Modeling(https://arxiv.org/abs/2601.02875)
Keywords: language model, llm
Abstract: In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18\%$ on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.
摘要：在本报告中，我们研究了大型语言模型 (LLM) 在数据压缩任务中的潜在用途。以前的工作已经证明了应用法学硕士不仅可以压缩文本，而且可以压缩广泛的多模态数据，取得了有希望的结果。尽管取得了良好的性能，但仍然存在一些实际问题，这些问题对用 LLM 替换现有数据压缩算法构成了挑战。在这项工作中，我们探索不同的方法来使用 LLM 作为数据压缩器来实现较低的调整压缩率。与之前的工作相比，我们能够在 enwik9 数据集上实现新的最先进 (SOTA) 调整压缩率，约为 18\%$，无需额外的模型训练。此外，我们还探索了法学硕士在压缩非英语数据、代码数据、字节流序列中的使用。我们表明，虽然 LLM 在压缩文本主导领域的数据方面表现出色，但如果以正确的方式配置，它们在压缩非自然文本序列方面的能力仍然保持竞争力。

Title: Beyond the Black Box: Theory and Mechanism of Large Language Models

Authors: Zeyu Gan, Ruifeng Ren, Wei Yao, Xiaolin Hu, Gengze Xu, Chen Qian, Huayi Tang, Zixuan Gong, Xinhao Yao, Pengwei Tang, Zhenxing Dou, Yong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02907
Pdf URL: https://arxiv.org/pdf/2601.02907
Copy Paste: [[2601.02907]] Beyond the Black Box: Theory and Mechanism of Large Language Models(https://arxiv.org/abs/2601.02907)
Keywords: language model, llm
Abstract: The rapid emergence of Large Language Models (LLMs) has precipitated a profound paradigm shift in Artificial Intelligence, delivering monumental engineering successes that increasingly impact modern society. However, a critical paradox persists within the current field: despite the empirical efficacy, our theoretical understanding of LLMs remains disproportionately nascent, forcing these systems to be treated largely as ``black boxes''. To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation. Within this framework, we provide a systematic review of the foundational theories and internal mechanisms driving LLM performance. Specifically, we analyze core theoretical issues such as the mathematical justification for data mixtures, the representational limits of various architectures, and the optimization dynamics of alignment algorithms. Moving beyond current best practices, we identify critical frontier challenges, including the theoretical limits of synthetic data self-improvement, the mathematical bounds of safety guarantees, and the mechanistic origins of emergent intelligence. By connecting empirical observations with rigorous scientific inquiry, this work provides a structured roadmap for transitioning LLM development from engineering heuristics toward a principled scientific discipline.
摘要：大型语言模型 (LLM) 的迅速出现引发了人工智能领域的深刻范式转变，带来了巨大的工程成功，对现代社会的影响日益增大。然而，当前领域中仍然存在一个关键的悖论：尽管具有实证效力，但我们对法学硕士的理论理解仍然处于不成比例的新生阶段，迫使这些系统在很大程度上被视为“黑匣子”。为了解决这种理论碎片化问题，本次调查提出了一种基于生命周期的统一分类法，将研究领域分为六个不同的阶段：数据准备、模型准备、训练、对齐、推理和评估。在此框架内，我们对驱动法学硕士绩效的基础理论和内部机制进行了系统回顾。具体来说，我们分析了核心理论问题，例如数据混合的数学合理性、各种架构的表示限制以及对齐算法的优化动态。超越当前的最佳实践，我们确定了关键的前沿挑战，包括合成数据自我改进的理论限制、安全保证的数学界限以及新兴智能的机械起源。通过将经验观察与严格的科学探究联系起来，这项工作为将法学硕士发展从工程启发法转向有原则的科学学科提供了一个结构化的路线图。

Title: Image, Word and Thought: A More Challenging Language Task for the Iterated Learning Model

Authors: Hyoyeon Lee, Seth Bullock, Conor Houghton
Subjects: cs.CL, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2601.02911
Pdf URL: https://arxiv.org/pdf/2601.02911
Copy Paste: [[2601.02911]] Image, Word and Thought: A More Challenging Language Task for the Iterated Learning Model(https://arxiv.org/abs/2601.02911)
Keywords: agent
Abstract: The iterated learning model simulates the transmission of language from generation to generation in order to explore how the constraints imposed by language transmission facilitate the emergence of language structure. Despite each modelled language learner starting from a blank slate, the presence of a bottleneck limiting the number of utterances to which the learner is exposed can lead to the emergence of language that lacks ambiguity, is governed by grammatical rules, and is consistent over successive generations, that is, one that is expressive, compositional and stable. The recent introduction of a more computationally tractable and ecologically valid semi supervised iterated learning model, combining supervised and unsupervised learning within an autoencoder architecture, has enabled exploration of language transmission dynamics for much larger meaning-signal spaces. Here, for the first time, the model has been successfully applied to a language learning task involving the communication of much more complex meanings: seven-segment display images. Agents in this model are able to learn and transmit a language that is expressive: distinct codes are employed for all 128 glyphs; compositional: signal components consistently map to meaning components, and stable: the language does not change from generation to generation.
摘要：迭代学习模型模拟语言从一代到一代的传播，以探索语言传播施加的约束如何促进语言结构的出现。尽管每个建模的语言学习者都是从一张白纸开始的，但限制学习者所接触的话语数量的瓶颈的存在可能会导致出现缺乏歧义、受语法规则控制并且在连续几代中保持一致的语言，即具有表达性、组合性和稳定性的语言。最近引入了一种计算上更容易处理且生态上有效的半监督迭代学习模型，该模型将监督学习和无监督学习结合在自动编码器架构中，使得能够探索更大的意义信号空间的语言传输动态。在这里，该模型首次成功应用于涉及更复杂含义交流的语言学习任务：七段显示图像。该模型中的代理能够学习和传输一种富有表现力的语言：所有 128 个字形都采用不同的代码；成分性：信号成分始终映射到意义成分，并且稳定：语言不会一代一代地改变。

Title: RAL2M: Retrieval Augmented Learning-To-Match Against Hallucination in Compliance-Guaranteed Service Systems

Authors: Mengze Hong, Di Jiang, Jiangtao Wen, Zhiyang Su, Yawen Li, Yanjie Sun, Guan Wang, Chen Jason Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02917
Pdf URL: https://arxiv.org/pdf/2601.02917
Copy Paste: [[2601.02917]] RAL2M: Retrieval Augmented Learning-To-Match Against Hallucination in Compliance-Guaranteed Service Systems(https://arxiv.org/abs/2601.02917)
Keywords: llm, hallucination
Abstract: Hallucination is a major concern in LLM-driven service systems, necessitating explicit knowledge grounding for compliance-guaranteed responses. In this paper, we introduce Retrieval-Augmented Learning-to-Match (RAL2M), a novel framework that eliminates generation hallucination by repositioning LLMs as query-response matching judges within a retrieval-based system, providing a robust alternative to purely generative approaches. To further mitigate judgment hallucination, we propose a query-adaptive latent ensemble strategy that explicitly models heterogeneous model competence and interdependencies among LLMs, deriving a calibrated consensus decision. Extensive experiments on large-scale benchmarks demonstrate that the proposed method effectively leverages the "wisdom of the crowd" and significantly outperforms strong baselines. Finally, we discuss best practices and promising directions for further exploiting latent representations in future work.
摘要：幻觉是法学硕士驱动的服务系统中的一个主要问题，需要明确的知识基础才能保证合规性响应。在本文中，我们介绍了检索增强学习匹配（RAL2M），这是一种新颖的框架，通过将法学硕士重新定位为基于检索的系统中的查询-响应匹配法官来消除生成幻觉，为纯粹的生成方法提供了可靠的替代方案。为了进一步减轻判断幻觉，我们提出了一种查询自适应潜在集成策略，该策略显式模拟法学硕士之间的异构模型能力和相互依赖性，从而得出经过校准的共识决策。对大规模基准的大量实验表明，所提出的方法有效地利用了“人群的智慧”，并且显着优于强基准。最后，我们讨论了在未来工作中进一步利用潜在表示的最佳实践和有希望的方向。

Title: Memorization, Emergence, and Explaining Reversal Failures: A Controlled Study of Relational Semantics in LLMs

Authors: Yihua Zhu, Qianying Liu, Jiaxin Wang, Fei Cheng, Chaoran Liu, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02931
Pdf URL: https://arxiv.org/pdf/2601.02931
Copy Paste: [[2601.02931]] Memorization, Emergence, and Explaining Reversal Failures: A Controlled Study of Relational Semantics in LLMs(https://arxiv.org/abs/2601.02931)
Keywords: gpt, llm
Abstract: Autoregressive LLMs perform well on relational tasks that require linking entities via relational words (e.g., father/son, friend), but it is unclear whether they learn the logical semantics of such relations (e.g., symmetry and inversion logic) and, if so, whether reversal-type failures arise from missing relational semantics or left-to-right order bias. We propose a controlled Knowledge Graph-based synthetic framework that generates text from symmetric/inverse triples, train GPT-style autoregressive models from scratch, and evaluate memorization, logical inference, and in-context generalization to unseen entities to address these questions. We find a sharp phase transition in which relational semantics emerge with sufficient logic-bearing supervision, even in shallow (2-3 layer) models, and that successful generalization aligns with stable intermediate-layer signals. Finally, order-matched forward/reverse tests and a diffusion baseline indicate that reversal failures are primarily driven by autoregressive order bias rather than deficient inversion semantics.
摘要：自回归法学硕士在需要通过关系词（例如，父亲/儿子、朋友）链接实体的关系任务上表现良好，但尚不清楚它们是否学习了此类关系的逻辑语义（例如，对称性和反转逻辑），如果是的话，逆转型失败是否是由于缺少关系语义或从左到右的顺序偏差引起的。我们提出了一种基于受控知识图的合成框架，该框架从对称/逆三元组生成文本，从头开始训练 GPT 式自回归模型，并评估记忆、逻辑推理和对未见过实体的上下文泛化，以解决这些问题。我们发现了一个尖锐的阶段转变，其中关系语义在足够的逻辑承载监督下出现，即使在浅层（2-3层）模型中也是如此，并且成功的泛化与稳定的中间层信号保持一致。最后，顺序匹配的正向/反向测试和扩散基线表明反转失败主要是由自回归顺序偏差驱动的，而不是有缺陷的反转语义。

Title: Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

Authors: Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02956
Pdf URL: https://arxiv.org/pdf/2601.02956
Copy Paste: [[2601.02956]] Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion(https://arxiv.org/abs/2601.02956)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Multilingual Retrieval-Augmented Generation (mRAG) systems often exhibit a perceived preference for high-resource languages, particularly English, resulting in the widespread adoption of English pivoting. While prior studies attribute this advantage to the superior English-centric capabilities of Large Language Models (LLMs), we find that such measurements are significantly distorted by structural priors inherent in evaluation benchmarks. Specifically, we identify exposure bias and a gold availability prior-both driven by the disproportionate concentration of resources in English-as well as cultural priors rooted in topic locality, as factors that hinder accurate assessment of genuine language preference. To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds. Our analysis using DeLP reveals that the previously reported English preference is largely a byproduct of evidence distribution rather than an inherent model bias. Instead, we find that retrievers fundamentally favor monolingual alignment between the query and the document language. Building on this insight, we introduce DELTA (DEbiased Language preference-guided Text Augmentation), a lightweight and efficient mRAG framework that strategically leverages monolingual alignment to optimize cross-lingual retrieval and generation. Experimental results demonstrate that DELTA consistently outperforms English pivoting and mRAG baselines across diverse languages.
摘要：多语言检索增强生成（mRAG）系统通常表现出对高资源语言（尤其是英语）的感知偏好，从而导致英语旋转的广泛采用。虽然之前的研究将这一优势归因于大型语言模型（LLM）以英语为中心的卓越能力，但我们发现这种测量结果被评估基准中固有的结构先验严重扭曲。具体来说，我们将暴露偏差和黄金可用性先验（两者都是由英语资源的不成比例集中所驱动）以及根植于主题局部性的文化先验确定为阻碍准确评估真实语言偏好的因素。为了解决这些偏见，我们提出了 DeLP（去偏见语言偏好），这是一种校准指标，旨在明确地剔除这些结构性混淆。我们使用 DeLP 进行的分析表明，之前报告的英语偏好很大程度上是证据分布的副产品，而不是固有的模型偏差。相反，我们发现检索器从根本上支持查询和文档语言之间的单语言对齐。基于这一见解，我们引入了 DELTA（去偏见语言偏好引导文本增强），这是一个轻量级且高效的 mRAG 框架，它战略性地利用单语言对齐来优化跨语言检索和生成。实验结果表明，DELTA 在不同语言中始终优于英语旋转和 mRAG 基线。

Title: LLM-Augmented Changepoint Detection: A Framework for Ensemble Detection and Automated Explanation

Authors: Fabian Lukassen, Christoph Weisser, Michael Schlee, Manish Kumar, Anton Thielmann, Benjamin Saefken, Thomas Kneib
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02957
Pdf URL: https://arxiv.org/pdf/2601.02957
Copy Paste: [[2601.02957]] LLM-Augmented Changepoint Detection: A Framework for Ensemble Detection and Automated Explanation(https://arxiv.org/abs/2601.02957)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper introduces a novel changepoint detection framework that combines ensemble statistical methods with Large Language Models (LLMs) to enhance both detection accuracy and the interpretability of regime changes in time series data. Two critical limitations in the field are addressed. First, individual detection methods exhibit complementary strengths and weaknesses depending on data characteristics, making method selection non-trivial and prone to suboptimal results. Second, automated, contextual explanations for detected changes are largely absent. The proposed ensemble method aggregates results from ten distinct changepoint detection algorithms, achieving superior performance and robustness compared to individual methods. Additionally, an LLM-powered explanation pipeline automatically generates contextual narratives, linking detected changepoints to potential real-world historical events. For private or domain-specific data, a Retrieval-Augmented Generation (RAG) solution enables explanations grounded in user-provided documents. The open source Python framework demonstrates practical utility in diverse domains, including finance, political science, and environmental science, transforming raw statistical output into actionable insights for analysts and decision-makers.
摘要：本文介绍了一种新颖的变化点检测框架，该框架将集成统计方法与大型语言模型（LLM）相结合，以提高时间序列数据中状态变化的检测精度和可解释性。解决了该领域的两个关键限制。首先，根据数据特征，各个检测方法表现出互补的优点和缺点，使得方法选择变得非常重要并且容易产生次优结果。其次，对检测到的变化的自动化、上下文解释基本上不存在。所提出的集成方法聚合了十种不同的变化点检测算法的结果，与单独的方法相比，实现了卓越的性能和鲁棒性。此外，由法学硕士支持的解释管道会自动生成上下文叙述，将检测到的变化点与潜在的现实世界历史事件联系起来。对于私有或特定领域的数据，检索增强生成 (RAG) 解决方案可以基于用户提供的文档进行解释。开源 Python 框架展示了在金融、政治科学和环境科学等不同领域的实用性，将原始统计输出转化为分析师和决策者可操作的见解。

Title: Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning

Authors: Junseok Kim, Nakyeong Yang, Kyungmin Min, Kyomin Jung
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.02970
Pdf URL: https://arxiv.org/pdf/2601.02970
Copy Paste: [[2601.02970]] Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning(https://arxiv.org/abs/2601.02970)
Keywords: llm
Abstract: Self-Consistency improves reasoning reliability through multi-sample aggregation, but incurs substantial inference cost. Adaptive self-consistency methods mitigate this issue by adjusting the sampling budget; however, they rely on count-based stopping rules that treat all responses equally, often leading to unnecessary sampling. We propose Reliability-Aware Adaptive Self-Consistency (ReASC), which addresses this limitation by reframing adaptive sampling from response counting to evidence sufficiency, leveraging response-level confidence for principled information aggregation. ReASC operates in two stages: a single-sample decision stage that resolves instances confidently answerable from a single response, and a reliability-aware accumulation stage that aggregates responses by jointly leveraging their frequency and confidence. Across five models and four datasets, ReASC consistently achieves the best accuracy-cost trade-off compared to existing baselines, yielding improved inference efficiency across model scales from 3B to 27B parameters. As a concrete example, ReASC reduces inference cost by up to 70\% relative to self-consistency while preserving accuracy on GSM8K using Gemma-3-4B-it.
摘要：自一致性通过多样本聚合提高推理可靠性，但会产生大量推理成本。自适应自一致性方法通过调整抽样预算来缓解这个问题；然而，它们依赖于基于计数的停止规则，平等对待所有响应，通常会导致不必要的采样。我们提出了可靠性感知自适应自我一致性（ReASC），它通过将自适应采样从响应计数重新定义为证据充分性，利用响应级别的置信度进行有原则的信息聚合，从而解决了这一限制。 ReASC 分两个阶段运行：单样本决策阶段，可从单个响应中自信地解析实例；可靠性感知累积阶段，通过共同利用响应的频率和置信度来聚合响应。与现有基线相比，ReASC 在五个模型和四个数据集上始终实现了最佳的准确性与成本权衡，从而在从 3B 到 27B 参数的模型规模上提高了推理效率。举一个具体的例子，相对于自一致性，ReASC 可以将推理成本降低高达 70%，同时使用 Gemma-3-4B-it 保持 GSM8K 上的准确性。

Title: Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning

Authors: Nathanaël Carraz Rakotonirina, Ren Pang, Neha Anna John, Michael Bohlke-Schneider, Momchil Hardalov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02972
Pdf URL: https://arxiv.org/pdf/2601.02972
Copy Paste: [[2601.02972]] Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning(https://arxiv.org/abs/2601.02972)
Keywords: language model, llm, chain-of-thought
Abstract: The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy--response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.
摘要：通过增加测试时间计算，大型语言模型 (LLM) 的推理能力得到了显着提高，通常以称为思想链 (CoT) 的中间标记的形式出现。然而，CoT 常常变得不必要的长，增加了计算成本，而没有实际的精度增益，有时甚至降低了性能，这种现象被称为“过度思考”。我们提出了一种多阶段高效推理方法，该方法将监督微调（通过拒绝采样或推理轨迹重新格式化）与使用自适应长度惩罚的强化学习相结合。我们引入了一种轻量级奖励函数，它会惩罚第一个正确答案后生成的代币，但仅在有益时鼓励自我验证。我们对七个不同的推理任务进行了整体评估，分析了准确性与响应长度的权衡。我们的方法将 8B 模型的响应长度平均缩短 28%，将 32B 模型的响应长度平均缩短 40%，而性能仅分别轻微下降 1.6 点和 2.5 点。尽管其概念简单，但与更复杂的最先进的高效推理方法相比，它实现了卓越的权衡，在过度思考调整精度曲线下的面积 ($\text{AUC}_{\text{OAA}}$) 方面得分为 76.6，比基本模型高 5 分，比第二最佳方法高 2.5 分。

Title: Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Authors: Ruikang Zhang, Shuo Wang, Qi Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.02978
Pdf URL: https://arxiv.org/pdf/2601.02978
Copy Paste: [[2601.02978]] Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders(https://arxiv.org/abs/2601.02978)
Keywords: language model, llm
Abstract: Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior while maintaining superior stability and performance compared to existing activation steering methods like Contrastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs internalize deeply integrated representations of high-order concepts, and provide a novel, robust mechanistic path for the regulation of complex AI behaviors.
摘要：最近在机械可解释性 (MI) 方面的工作使得能够识别和干预大型语言模型 (LLM) 中的内部特征。然而，一个持续的挑战在于将这些内部特征与语言生成中复杂的行为级语义属性的可靠控制联系起来。在本文中，我们提出了一种基于稀疏自动编码器的框架，用于检索和引导与高级语言行为相关的语义可解释的内部特征。我们的方法采用基于受控语义对立的对比特征检索管道，结合统计激活分析和基于生成的验证，从稀疏激活空间中提取单语义功能特征。使用大五人格特质作为案例研究，我们证明了我们的方法能够实现模型行为的精确双向引导，同时与对比激活附加（CAA）等现有激活引导方法相比保持卓越的稳定性和性能。我们进一步确定了一种经验效应，我们称之为功能忠实性，即对特定内部特征的干预会导致与目标语义属性一致的多个语言维度上的连贯且可预测的转变。我们的研究结果表明，法学硕士内化了高阶概念的深度集成表示，并为复杂人工智能行为的调节提供了一条新颖、稳健的机制路径。

Title: Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy

Authors: Hosein Hasani, Mohammadali Banayeeanzade, Ali Nafisi, Sadegh Mohammadian, Fatemeh Askari, Mobin Bagherian, Amirmohammad Izadi, Mahdieh Soleymani Baghshah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02989
Pdf URL: https://arxiv.org/pdf/2601.02989
Copy Paste: [[2601.02989]] Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy(https://arxiv.org/abs/2601.02989)
Keywords: language model, llm
Abstract: Large language models (LLMs), despite strong performance on complex mathematical problems, exhibit systematic limitations in counting tasks. This issue arises from architectural limits of transformers, where counting is performed across layers, leading to degraded precision for larger counting problems due to depth constraints. To address this limitation, we propose a simple test-time strategy inspired by System-2 cognitive processes that decomposes large counting tasks into smaller, independent sub-problems that the model can reliably solve. We evaluate this approach using observational and causal mediation analyses to understand the underlying mechanism of this System-2-like strategy. Our mechanistic analysis identifies key components: latent counts are computed and stored in the final item representations of each part, transferred to intermediate steps via dedicated attention heads, and aggregated in the final stage to produce the total count. Experimental results demonstrate that this strategy enables LLMs to surpass architectural limitations and achieve high accuracy on large-scale counting tasks. This work provides mechanistic insight into System-2 counting in LLMs and presents a generalizable approach for improving and understanding their reasoning behavior.
摘要：大型语言模型（LLM）尽管在复杂数学问题上表现出色，但在计数任务中表现出系统局限性。这个问题是由变压器的架构限制引起的，其中计数是跨层执行的，由于深度限制，导致较大计数问题的精度下降。为了解决这一限制，我们提出了一种受 System-2 认知过程启发的简单测试时策略，该策略将大型计数任务分解为模型可以可靠解决的较小的独立子问题。我们使用观察和因果中介分析来评估这种方法，以了解这种类似 System-2 策略的潜在机制。我们的机制分析确定了关键组成部分：计算潜在计数并将其存储在每个部分的最终项目表示中，通过专用注意头转移到中间步骤，并在最后阶段聚合以产生总计数。实验结果表明，该策略使法学硕士能够超越架构限制并在大规模计数任务上实现高精度。这项工作提供了对法学硕士中 System-2 计数的机制洞察，并提出了一种改进和理解其推理行为的通用方法。

Title: Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation

Authors: Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02993
Pdf URL: https://arxiv.org/pdf/2601.02993
Copy Paste: [[2601.02993]] Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation(https://arxiv.org/abs/2601.02993)
Keywords: language model, llm, long context, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has become a key paradigm for reducing factual hallucinations in large language models (LLMs), yet little is known about how the order of retrieved documents affects model behavior. We empirically show that under Top-5 retrieval with the gold document included, LLM answers vary substantially across permutations of the retrieved set, even when the gold document is fixed in the first position. This reveals a previously underexplored sensitivity to retrieval permutations. Although robust RAG methods primarily focus on enhancing LLM robustness to low-quality retrieval and mitigating positional bias to distribute attention fairly over long contexts, neither approach directly addresses permutation sensitivity. In this paper, we propose Stable-RAG, which exploits permutation sensitivity estimation to mitigate permutation-induced hallucinations. Stable-RAG runs the generator under multiple retrieval orders, clusters hidden states, and decodes from a cluster-center representation that captures the dominant reasoning pattern. It then uses these reasoning results to align hallucinated outputs toward the correct answer, encouraging the model to produce consistent and accurate predictions across document permutations. Experiments on three QA datasets show that Stable-RAG significantly improves answer accuracy, reasoning consistency and robust generalization across datasets, retrievers, and input lengths compared with baselines.
摘要：检索增强生成（RAG）已成为减少大型语言模型（LLM）中事实幻觉的关键范例，但人们对检索文档的顺序如何影响模型行为知之甚少。我们凭经验表明，在包含黄金文档的 Top-5 检索下，即使黄金文档固定在第一个位置，LLM 答案在检索集的排列中也有很大差异。这揭示了先前未充分探索的对检索排列的敏感性。尽管稳健的 RAG 方法主要侧重于增强 LLM 对低质量检索的鲁棒性，并减轻位置偏差以在长上下文中公平地分配注意力，但这两种方法都没有直接解决排列敏感性。在本文中，我们提出了 Stable-RAG，它利用排列敏感性估计来减轻排列引起的幻觉。 Stable-RAG 在多个检索顺序下运行生成器，对隐藏状态进行聚类，并从捕获主要推理模式的聚类中心表示进行解码。然后，它使用这些推理结果将幻觉输出与正确答案对齐，从而鼓励模型在文档排列中产生一致且准确的预测。对三个 QA 数据集的实验表明，与基线相比，Stable-RAG 显着提高了跨数据集、检索器和输入长度的答案准确性、推理一致性和鲁棒泛化性。

Title: Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

Authors: Yihong Liu, Raoyuan Zhao, Hinrich Schütze, Michael A. Hedderich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.02996
Pdf URL: https://arxiv.org/pdf/2601.02996
Copy Paste: [[2601.02996]] Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners(https://arxiv.org/abs/2601.02996)
Keywords: chain-of-thought
Abstract: Large reasoning models (LRMs) achieve strong performance on mathematical reasoning tasks, often attributed to their capability to generate explicit chain-of-thought (CoT) explanations. However, recent work shows that LRMs often arrive at the correct answer before completing these textual reasoning steps, indicating the presence of latent reasoning -- internal, non-verbal computation encoded in hidden states. While this phenomenon has been explored in English, its multilingual behavior remains largely unknown. In this paper, we conduct a systematic investigation of multilingual latent reasoning in LRMs across 11 languages. Using a truncation-based strategy, we examine how the correct answer emerges as the model is given only partial reasoning traces, allowing us to measure stepwise latent prediction formation. Our results reveal clear evidence of multilingual latent reasoning, though unevenly: strong in resource-rich languages, weaker in low-resource ones, and broadly less observable on harder benchmarks. To understand whether these differences reflect distinct internal mechanisms, we further perform representational analyses. Despite surface-level disparities, we find that the internal evolution of predictions is highly consistent across languages and broadly aligns with English -- a pattern suggesting an English-centered latent reasoning pathway.
摘要：大型推理模型 (LRM) 在数学推理任务上取得了出色的性能，这通常归因于它们生成明确的思想链 (CoT) 解释的能力。然而，最近的研究表明，LRM 通常在完成这些文本推理步骤之前就得出正确答案，这表明潜在推理的存在——隐藏状态中编码的内部非语言计算。虽然这种现象已经用英语进行了探讨，但它的多语言行为仍然很大程度上未知。在本文中，我们对 11 种语言的 LRM 中的多语言潜在推理进行了系统研究。使用基于截断的策略，我们检查当模型仅给出部分推理轨迹时如何出现正确答案，从而使我们能够测量逐步的潜在预测形成。我们的结果揭示了多语言潜在推理的明显证据，尽管不均匀：在资源丰富的语言中较强，在资源匮乏的语言中较弱，并且在更严格的基准上普遍难以观察到。为了了解这些差异是否反映了不同的内部机制，我们进一步进行了代表性分析。尽管存在表面差异，但我们发现预测的内部演变在不同语言之间高度一致，并且与英语大致一致——这种模式表明存在以英语为中心的潜在推理路径。

Title: SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

Authors: Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao, Ziwen Wang, Qi Song, Xiangyang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03014
Pdf URL: https://arxiv.org/pdf/2601.03014
Copy Paste: [[2601.03014]] SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering(https://arxiv.org/abs/2601.03014)
Keywords: language model, retrieval-augmented generation
Abstract: Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.
摘要：传统的检索增强生成（RAG）有效地支持大型语言模型的单跳问答，但在多跳问答任务中面临重大限制，这需要结合来自多个文档的证据。现有的基于块的检索通常提供不相关且逻辑上不连贯的上下文，导致证据链不完整以及答案生成过程中的推理不正确。为了应对这些挑战，我们提出了 SentGraph，这是一种基于句子级图的 RAG 框架，它显式地模拟了句子之间的细粒度逻辑关系，以进行多跳问答。具体来说，我们首先采用修辞结构理论来区分核心句和附属句，然后用跨文档实体桥将它们组织成主题级子图，从而离线构建层次句图。在在线检索过程中，SentGraph通过图引导的证据选择和路径扩展来检索细粒度的句子级证据。对四个多跳问答基准的大量实验证明了 SentGraph 的有效性，验证了为多跳推理显式建模句子级逻辑依赖关系的重要性。

Title: MMFormalizer: Multimodal Autoformalization in the Wild

Authors: Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03017
Pdf URL: https://arxiv.org/pdf/2601.03017
Copy Paste: [[2601.03017]] MMFormalizer: Multimodal Autoformalization in the Wild(https://arxiv.org/abs/2601.03017)
Keywords: gpt
Abstract: Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: this http URL
摘要：自动形式化将自然语言数学转化为形式化语句以实现机器推理，但由于物理世界的多模态性质，它在野外面临着根本性的挑战，其中物理学需要从视觉元素推断隐藏的约束（例如质量或能量）。为了解决这个问题，我们提出了 MMFormalizer，它通过将自适应基础与来自现实世界数学和物理领域的实体相集成，将自动形式化扩展到文本之外。 MMFormalizer 通过递归基础和公理组合，从基于感知的基元递归地构造形式命题，并通过自适应递归终止确保每个抽象都得到视觉证据的支持，并锚定在维度或公理基础上。我们在新的基准 PhyX-AF 上评估 MMFormalizer，该基准包含来自 MathVerse、PhyX、合成几何和解析几何的 115 个精选样本，涵盖各种多模态自动形式化任务。结果表明，GPT-5 和 Gemini-3-Pro 等前沿模型实现了最高的编译和语义准确性，其中 GPT-5 在物理推理方面表现出色，而几何仍然是最具挑战性的领域。总体而言，MMFormalizer 为统一的多模态自动形式化、桥接感知和形式推理提供了一个可扩展的框架。据我们所知，这是第一个能够处理经典力学（源自哈密顿量）以及相对论、量子力学和热力学的多模态自动形式化方法。更多详细信息请访问我们的项目页面：此 http URL

Title: Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis

Authors: Choonghan Kim, Hyunmin Hwang, Hangeol Chang, Jaemin Kim, Jinse Park, Jae-Sung Lim, Jong Chul Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03018
Pdf URL: https://arxiv.org/pdf/2601.03018
Copy Paste: [[2601.03018]] Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis(https://arxiv.org/abs/2601.03018)
Keywords: language model, gpt, llm
Abstract: While Large Language Models (LLMs) have shown strong performance on clinical text understanding, they struggle with longitudinal prediction tasks such as dementia prognosis, which require reasoning over complex, non-monotonic symptom trajectories across multiple visits. Standard supervised training lacks explicit annotations for symptom evolution, while direct Reinforcement Learning (RL) is hindered by sparse binary rewards. To address this challenge, we introduce Dementia-R1, an RL-based framework for longitudinal dementia prognosis from unstructured clinical notes. Our approach adopts a Cold-Start RL strategy that pre-trains the model to predict verifiable clinical indices extracted from patient histories, enhancing the capability to reason about disease progression before determining the final clinical status. Extensive experiments demonstrate that Dementia-R1 achieves an F1 score of 77.03% on real-world unstructured clinical datasets. Notably, on the ADNI benchmark, our 7B model rivals GPT-4o, effectively capturing fluctuating cognitive trajectories. Code is available at this https URL
摘要：虽然大型语言模型 (LLM) 在临床文本理解方面表现出强大的性能，但它们在痴呆症预后等纵向预测任务上遇到了困难，这些任务需要对多次就诊的复杂、非单调症状轨迹进行推理。标准监督训练缺乏对症状演变的明确注释，而直接强化学习（RL）则受到稀疏二元奖励的阻碍。为了应对这一挑战，我们引入了 Dementia-R1，这是一种基于强化学习的框架，用于根据非结构化临床记录进行纵向痴呆预后。我们的方法采用冷启动强化学习策略，对模型进行预训练，以预测从患者病史中提取的可验证的临床指标，从而增强在确定最终临床状态之前推断疾病进展的能力。大量实验表明，Dementia-R1 在现实世界的非结构化临床数据集上取得了 77.03% 的 F1 分数。值得注意的是，在 ADNI 基准上，我们的 7B 模型可与 GPT-4o 相媲美，有效捕获波动的认知轨迹。代码可在此 https URL 获取

Title: MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

Authors: Lecheng Gong, Weimin Fang, Ting Yang, Dongjie Tao, Chunxiao Guo, Peng Wei, Bo Xie, Jinqun Guan, Zixiao Chen, Fang Shi, Jinjie Gu, Junwei Liu
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2601.03023
Pdf URL: https://arxiv.org/pdf/2601.03023
Copy Paste: [[2601.03023]] MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models(https://arxiv.org/abs/2601.03023)
Keywords: language model, llm, hallucination, agent
Abstract: Medical conversational AI (AI) plays a pivotal role in the development of safer and more effective medical dialogue systems. However, existing benchmarks and evaluation frameworks for assessing the information-gathering and diagnostic reasoning abilities of medical large language models (LLMs) have not been rigorously evaluated. To address these gaps, we present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics generated by LLMs and subsequently refined by clinical experts, specifically designed to assess the multi-turn diagnostic capabilities of LLM. Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints from underlying disease knowledge without accessing real-world electronic health records, thereby mitigating privacy and data-governance concerns. We design a robust Patient Agent that is limited to a set of atomic medical facts and augmented with a dynamic guidance mechanism that continuously detects and corrects hallucinations throughout the dialogue, ensuring internal coherence and clinical plausibility of the simulated cases. Furthermore, we propose a structured LLM-based and expert-annotated rubric-generation pipeline that retrieves Evidence-Based Medicine (EBM) guidelines and utilizes the reject sampling to derive a prioritized set of rubric items ("must-ask" items) for each case. We perform a comprehensive evaluation of state-of-the-art models and demonstrate that, across multiple assessment dimensions, current models face substantial challenges. Our results indicate that improving medical dialogue will require advances in dialogue management architectures, not just incremental tuning of the base-model.
摘要：医疗对话人工智能（AI）在开发更安全、更有效的医疗对话系统中发挥着关键作用。然而，现有的用于评估医学大语言模型（LLM）信息收集和诊断推理能力的基准和评估框架尚未经过严格评估。为了弥补这些差距，我们推出了 MedDialogRubrics，这是一个新颖的基准，包含 5,200 个综合构建的患者病例和 60,000 多个由法学硕士生成并随后由临床专家完善的细粒度评估标准，专门用于评估法学硕士的多轮诊断能力。我们的框架采用多代理系统来合成真实的患者记录和来自潜在疾病知识的主诉，而无需访问现实世界的电子健康记录，从而减轻隐私和数据治理问题。我们设计了一个强大的患者代理，它仅限于一组原子医学事实，并通过动态指导机制进行增强，该机制在整个对话过程中不断检测和纠正幻觉，确保模拟病例的内部一致性和临床合理性。此外，我们提出了一个基于 LLM 和专家注释的结构化标题生成管道，该管道检索循证医学 (EBM) 指南，并利用拒绝抽样为每个案例导出一组优先的标题项目（“必须询问”项目）。我们对最先进的模型进行了全面评估，并证明，在多个评估维度上，当前模型面临着巨大的挑战。我们的结果表明，改善医疗对话需要对话管理架构的进步，而不仅仅是对基本模型的增量调整。

Title: LittiChoQA: Literary Texts in Indic Languages Chosen for Question Answering

Authors: Aarya Khandelwal, Ritwik Mishra, Rajiv Ratn Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03025
Pdf URL: https://arxiv.org/pdf/2601.03025
Copy Paste: [[2601.03025]] LittiChoQA: Literary Texts in Indic Languages Chosen for Question Answering(https://arxiv.org/abs/2601.03025)
Keywords: language model, llm
Abstract: Long-context question answering (QA) over literary texts poses significant challenges for modern large language models, particularly in low-resource languages. We address the scarcity of long-context QA resources for Indic languages by introducing LittiChoQA, the largest literary QA dataset to date covering many languages spoken in the Gangetic plains of India. The dataset comprises over 270K automatically generated question-answer pairs with a balanced distribution of factoid and non-factoid questions, generated from naturally authored literary texts collected from the open web. We evaluate multiple multilingual LLMs on non-factoid, abstractive QA, under both full-context and context-shortened settings. Results demonstrate a clear trade-off between performance and efficiency: full-context fine-tuning yields the highest token-level and semantic-level scores, while context shortening substantially improves throughput. Among the evaluated models, Krutrim-2 achieves the strongest performance, obtaining a semantic score of 76.1 with full context. While, in shortened context settings it scores 74.9 with answer paragraph selection and 71.4 with vector-based retrieval. Qualitative evaluations further corroborate these findings.
摘要：针对文学文本的长上下文问答 (QA) 对现代大型语言模型提出了重大挑战，特别是在资源匮乏的语言中。我们通过引入 LittiChoQA 解决印度语言长上下文 QA 资源的稀缺问题，LittiChoQA 是迄今为止最大的文学 QA 数据集，涵盖印度恒河平原使用的多种语言。该数据集包含超过 27 万个自动生成的问答对，其中事实和非事实问题均衡分布，这些问题是根据从开放网络收集的自然创作的文学文本生成的。我们在全上下文和上下文缩短的设置下，对多个多语言法学硕士进行非事实、抽象的 QA 评估。结果表明，性能和效率之间存在明显的权衡：全上下文微调产生最高的标记级别和语义级别分数，而上下文缩短则大大提高了吞吐量。在评估的模型中，Krutrim-2 的性能最强，在完整上下文的情况下获得了 76.1 的语义得分。而在缩短的上下文设置中，答案段落选择得分为 74.9，基于向量的检索得分为 71.4。定性评估进一步证实了这些发现。

Title: Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

Authors: Sindhuja Chaduvula, Ahmed Y. Radwan, Azib Farooq, Yani Ioannou, Shaina Raza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03027
Pdf URL: https://arxiv.org/pdf/2601.03027
Copy Paste: [[2601.03027]] Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning(https://arxiv.org/abs/2601.03027)
Keywords: llm, hallucination
Abstract: Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by five times (from 0.424 to 0.084) while improving factuality scores by 50 percent (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves plus 17 percent MC1 accuracy (0.500 to 0.585) and plus 49 percent MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.
摘要：RLHF 和直接偏好优化 (DPO) 等偏好对齐方法可以改善指令遵循，但当偏好判断奖励流畅性和信心而不是事实正确性时，它们也会强化幻觉。我们引入 F-DPO（事实感知直接偏好优化），它是仅使用二进制事实标签的 DPO 的简单扩展。 F-DPO (i) 应用标签翻转转换来纠正顺序错误的偏好对，以便所选响应的真实性永远不会低于被拒绝的响应，并且 (ii) 添加事实感知边距，强调具有明显正确性差异的对，同时在两个响应具有相同事实性时减少到标准 DPO。我们通过使用二进制事实指标和合成幻觉变体增强 DPO 对来构建事实感知偏好数据。在七个开放权重法学硕士 (1B-14B) 中，相对于基本模型和标准 DPO，F-DPO 始终如一地提高事实性并降低幻觉率。在 Qwen3-8B 上，F-DPO 将幻觉率降低了五倍（从 0.424 到 0.084），同时将事实分数提高了 50%（从 5.26 到 7.90）。 F-DPO 还推广到分布外基准：在 TruthfulQA 上，Qwen2.5-14B 的 MC1 准确率提高了 17%（0.500 到 0.585），MC2 准确率提高了 49%（0.357 到 0.531）。 F-DPO 不需要辅助奖励模型、代币级注释或多阶段训练。

Title: NorwAI's Large Language Models: Technical Report

Authors: Jon Atle Gulla, Peng Liu, Lemei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03034
Pdf URL: https://arxiv.org/pdf/2601.03034
Copy Paste: [[2601.03034]] NorwAI's Large Language Models: Technical Report(https://arxiv.org/abs/2601.03034)
Keywords: language model, gpt, llm
Abstract: Norwegian, spoken by approximately five million people, remains underrepresented in many of the most significant breakthroughs in Natural Language Processing (NLP). To address this gap, the NorLLM team at NorwAI has developed a family of models specifically tailored to Norwegian and other Scandinavian languages, building on diverse Transformer-based architectures such as GPT, Mistral, Llama2, Mixtral and Magistral. These models are either pretrained from scratch or continually pretrained on 25B - 88.45B tokens, using a Norwegian-extended tokenizer and advanced post-training strategies to optimize performance, enhance robustness, and improve adaptability across various real-world tasks. Notably, instruction-tuned variants (e.g., Mistral-7B-Instruct and Mixtral-8x7B-Instruct) showcase strong assistant-style capabilities, underscoring their potential for practical deployment in interactive and domain-specific applications. The NorwAI large language models are openly available to Nordic organizations, companies and students for both research and experimental use. This report provides detailed documentation of the model architectures, training data, tokenizer design, fine-tuning strategies, deployment, and evaluations.
摘要：挪威语约有 500 万人使用，但在自然语言处理 (NLP) 领域的许多最重大突破中仍然代表性不足。为了解决这一差距，NorwAI 的 NorLLM 团队开发了一系列专门针对挪威语和其他斯堪的纳维亚语言定制的模型，这些模型建立在各种基于 Transformer 的架构（例如 GPT、Mistral、Llama2、Mixtral 和 Magistral）的基础上。这些模型要么从头开始预训练，要么在 25B - 88.45B 令牌上持续预训练，使用挪威扩展令牌生成器和先进的训练后策略来优化性能、增强鲁棒性并提高对各种现实任务的适应性。值得注意的是，指令调整的变体（例如，Mistral-7B-Instruct 和 Mixtral-8x7B-Instruct）展示了强大的助手式功能，强调了它们在交互式和特定领域应用程序中实际部署的潜力。 NorwAI 大语言模型向北欧组织、公司和学生开放，用于研究和实验用途。该报告提供了模型架构、训练数据、分词器设计、微调策略、部署和评估的详细文档。

Title: BaseCal: Unsupervised Confidence Calibration via Base Model Signals

Authors: Hexiang Tan, Wanli Yang, Junwei Zhang, Xin Chen, Rui Tang, Du Su, Jingang Wang, Yuanzhuo Wang, Fei Sun, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03042
Pdf URL: https://arxiv.org/pdf/2601.03042
Copy Paste: [[2601.03042]] BaseCal: Unsupervised Confidence Calibration via Base Model Signals(https://arxiv.org/abs/2601.03042)
Keywords: llm
Abstract: Reliable confidence is essential for trusting the outputs of LLMs, yet widely deployed post-trained LLMs (PoLLMs) typically compromise this trust with severe overconfidence. In contrast, we observe that their corresponding base LLMs often remain well-calibrated. This naturally motivates us to calibrate PoLLM confidence using the base LLM as a reference. This work proposes two ways to achieve this. A straightforward solution, BaseCal-ReEval, evaluates PoLLM's responses by feeding them into the base LLM to get average probabilities as confidence. While effective, this approach introduces additional inference overhead. To address this, we propose BaseCal-Proj, which trains a lightweight projection to map the final-layer hidden states of PoLLMs back to those of their base LLMs. These projected states are then processed by the base LLM's output layer to derive base-calibrated confidence for PoLLM's responses. Notably, BaseCal is an unsupervised, plug-and-play solution that operates without human labels or LLM modifications. Experiments across five datasets and three LLM families demonstrate the effectiveness of BaseCal, reducing Expected Calibration Error (ECE) by an average of 42.90\% compared to the best unsupervised baselines.
摘要：可靠的信心对于信任法学硕士的输出至关重要，但广泛部署的经过培训的法学硕士 (PoLLM) 通常会因严重过度自信而损害这种信任。相比之下，我们观察到他们相应的基础法学硕士通常保持良好的校准。这自然促使我们使用基础 LLM 作为参考来校准 PoLLM 置信度。这项工作提出了两种实现这一目标的方法。一个简单的解决方案 BaseCal-ReEval 通过将 PoLLM 的响应输入到基础 LLM 中来评估 PoLLM 的响应，以获得平均概率作为置信度。这种方法虽然有效，但会带来额外的推理开销。为了解决这个问题，我们提出了 BaseCal-Proj，它训练一个轻量级投影，将 PoLLM 的最终层隐藏状态映射回其基础 LLM 的隐藏状态。然后，这些投影状态由基础 LLM 的输出层进行处理，以得出 PoLLM 响应的基础校准置信度。值得注意的是，BaseCal 是一种无监督、即插即用的解决方案，无需人工标签或 LLM 修改即可运行。五个数据集和三个 LLM 系列的实验证明了 BaseCal 的有效性，与最佳无监督基线相比，预期校准误差 (ECE) 平均减少了 42.90%。

Title: Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage

Authors: Junhao Hu, Fangze Li, Mingtao Xu, Feifan Meng, Shiju Zhao, Tiancheng Hu, Ting Peng, Anmin Liu, Wenrui Huang, Chenxu Liu, Ziyue Hua, Tao Xie
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03043
Pdf URL: https://arxiv.org/pdf/2601.03043
Copy Paste: [[2601.03043]] Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage(https://arxiv.org/abs/2601.03043)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less'' (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.
摘要：大型语言模型 (LLM) 在各种复杂任务中表现出强大的能力，并且越来越大规模地部署，对推理效率提出了很高的要求。先前的工作通常将推理分解为预填充和解码阶段，其中解码阶段主导总延迟。为了减少解码阶段的时间和内存复杂性，一系列工作引入了稀疏注意力算法。在本文中，我们从经验和理论上证明，稀疏注意力会矛盾地增加端到端的复杂性：信息丢失通常会导致序列显着变长，我们将这种现象称为“Less is Less”（Lil）。为了缓解 Lil 问题，我们提出了一种提前停止算法，用于检测稀疏解码期间信息丢失超过信息增益的阈值。我们的提前停止算法可将令牌消耗减少高达 90%，并且在推理密集型基准测试中边际精度下降幅度小于 2%。

Title: Temporal Graph Network: Hallucination Detection in Multi-Turn Conversation

Authors: Vidhi Rathore, Sambu Aneesh, Himanshu Singh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03051
Pdf URL: https://arxiv.org/pdf/2601.03051
Copy Paste: [[2601.03051]] Temporal Graph Network: Hallucination Detection in Multi-Turn Conversation(https://arxiv.org/abs/2601.03051)
Keywords: hallucination
Abstract: Hallucinations can be produced by conversational AI systems, particularly in multi-turn conversations where context changes and contradictions may eventually surface. By representing the entire conversation as a temporal graph, we present a novel graph-based method for detecting dialogue-level hallucinations. Our framework models each dialogue as a node, encoding it using a sentence transformer. We explore two different ways of connectivity: i) shared-entity edges, which connect turns that refer to the same entities; ii) temporal edges, which connect contiguous turns in the conversation. Message-passing is used to update the node embeddings, allowing flow of information between related nodes. The context-aware node embeddings are then combined using attention pooling into a single vector, which is then passed on to a classifier to determine the presence and type of hallucinations. We demonstrate that our method offers slightly improved performance over existing methods. Further, we show the attention mechanism can be used to justify the decision making process. The code and model weights are made available at: this https URL.
摘要：对话式人工智能系统可能会产生幻觉，特别是在多轮对话中，上下文的变化和矛盾最终可能会浮出水面。通过将整个对话表示为时间图，我们提出了一种新颖的基于图的方法来检测对话级幻觉。我们的框架将每个对话建模为一个节点，并使用句子转换器对其进行编码。我们探索两种不同的连接方式：i）共享实体边，它连接引用相同实体的转弯； ii) 时间边缘，连接对话中的连续轮次。消息传递用于更新节点嵌入，允许相关节点之间的信息流动。然后使用注意力池将上下文感知节点嵌入组合成单个向量，然后将其传递给分类器以确定幻觉的存在和类型。我们证明我们的方法比现有方法的性能略有提高。此外，我们展示了注意力机制可以用来证明决策过程的合理性。代码和模型权重可在以下位置获取：此 https URL。

Title: Detecting Hallucinations in Retrieval-Augmented Generation via Semantic-level Internal Reasoning Graph

Authors: Jianpeng Hu, Yanzeng Li, Jialun Zhong, Wenfa Qi, Lei Zou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03052
Pdf URL: https://arxiv.org/pdf/2601.03052
Copy Paste: [[2601.03052]] Detecting Hallucinations in Retrieval-Augmented Generation via Semantic-level Internal Reasoning Graph(https://arxiv.org/abs/2601.03052)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: The Retrieval-augmented generation (RAG) system based on Large language model (LLM) has made significant progress. It can effectively reduce factuality hallucinations, but faithfulness hallucinations still exist. Previous methods for detecting faithfulness hallucinations either neglect to capture the models' internal reasoning processes or handle those features coarsely, making it difficult for discriminators to learn. This paper proposes a semantic-level internal reasoning graph-based method for detecting faithfulness hallucination. Specifically, we first extend the layer-wise relevance propagation algorithm from the token level to the semantic level, constructing an internal reasoning graph based on attribution vectors. This provides a more faithful semantic-level representation of dependency. Furthermore, we design a general framework based on a small pre-trained language model to utilize the dependencies in LLM's reasoning for training and hallucination detection, which can dynamically adjust the pass rate of correct samples through a threshold. Experimental results demonstrate that our method achieves better overall performance compared to state-of-the-art baselines on RAGTruth and Dolly-15k.
摘要：基于大语言模型（LLM）的检索增强生成（RAG）系统取得了重大进展。它可以有效减少事实性幻觉，但忠实性幻觉仍然存在。以前用于检测忠实幻觉的方法要么忽略捕获模型的内部推理过程，要么粗略地处理这些特征，从而使鉴别器难以学习。本文提出了一种基于语义级内部推理图的忠诚幻觉检测方法。具体来说，我们首先将分层相关性传播算法从令牌级别扩展到语义级别，构建基于归因向量的内部推理图。这提供了更忠实的依赖关系语义级表示。此外，我们设计了一个基于小型预训练语言模型的通用框架，利用LLM推理中的依赖关系进行训练和幻觉检测，可以通过阈值动态调整正确样本的通过率。实验结果表明，与 RAGTruth 和 Dolly-15k 上最先进的基线相比，我们的方法实现了更好的整体性能。

Title: Do LLMs Encode Functional Importance of Reasoning Tokens?

Authors: Janvijay Singh, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03066
Pdf URL: https://arxiv.org/pdf/2601.03066
Copy Paste: [[2601.03066]] Do LLMs Encode Functional Importance of Reasoning Tokens?(https://arxiv.org/abs/2601.03066)
Keywords: language model, llm
Abstract: Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevant reasoning. Prior work on compact reasoning shortens such chains through probabilistic sampling, heuristics, or supervision from frontier models, but offers limited insight into whether models internally encode token-level functional importance for answer generation. We address this gap diagnostically and propose greedy pruning, a likelihood-preserving deletion procedure that iteratively removes reasoning tokens whose removal minimally degrades model likelihood under a specified objective, yielding length-controlled reasoning chains. We evaluate pruned reasoning in a distillation framework and show that students trained on pruned chains outperform a frontier-model-supervised compression baseline at matched reasoning lengths. Finally, our analysis reveals systematic pruning patterns and shows that attention scores can predict greedy pruning ranks, further suggesting that models encode a nontrivial functional importance structure over reasoning tokens.
摘要：大型语言模型通过生成长推理链来解决复杂任务，以增加计算成本和降低隔离功能相关推理的能力为代价实现更高的准确性。先前关于紧凑推理的工作通过概率采样、启发式或前沿模型的监督缩短了此类链条，但对于模型是否在内部编码答案生成的标记级功能重要性提供了有限的见解。我们诊断性地解决了这一差距，并提出了贪婪剪枝，这是一种保留可能性的删除过程，它迭代地删除推理标记，这些推理标记的删除在指定目标下最小程度地降低了模型的可能性，从而产生长度受控的推理链。我们在蒸馏框架中评估了修剪推理，并表明接受修剪链训练的学生在匹配的推理长度上优于前沿模型监督的压缩基线。最后，我们的分析揭示了系统的修剪模式，并表明注意力分数可以预测贪婪的修剪等级，进一步表明模型编码了推理标记上的非平凡功能重要性结构。

Title: Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models

Authors: Bocheng Chen, Han Zi, Xi Chen, Xitong Zhang, Kristen Johnson, Guangliang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03079
Pdf URL: https://arxiv.org/pdf/2601.03079
Copy Paste: [[2601.03079]] Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models(https://arxiv.org/abs/2601.03079)
Keywords: language model, llm
Abstract: Moral sensitivity is fundamental to human moral competence, as it guides individuals in regulating everyday behavior. Although many approaches seek to align large language models (LLMs) with human moral values, how to enable them morally sensitive has been extremely challenging. In this paper, we take a step toward answering the question: how can we enhance moral sensitivity in LLMs? Specifically, we propose two pragmatic inference methods that faciliate LLMs to diagnose morally benign and hazardous input and correct moral errors, whereby enhancing LLMs' moral sensitivity. A central strength of our pragmatic inference methods is their unified perspective: instead of modeling moral discourses across semantically diverse and complex surface forms, they offer a principled perspective for designing pragmatic inference procedures grounded in their inferential loads. Empirical evidence demonstrates that our pragmatic methods can enhance moral sensitivity in LLMs and achieves strong performance on representative morality-relevant benchmarks.
摘要：道德敏感性是人类道德能力的基础，因为它指导个人规范日常行为。尽管许多方法都试图将大型语言模型（LLM）与人类道德价值观结合起来，但如何使它们具有道德敏感性一直极具挑战性。在本文中，我们朝着回答这个问题迈出了一步：如何提高法学硕士的道德敏感性？具体来说，我们提出了两种实用的推理方法，帮助法学硕士诊断道德良性和危险的输入并纠正道德错误，从而增强法学硕士的道德敏感性。我们的实用推理方法的核心优势是它们的统一视角：它们不是在语义多样和复杂的表面形式上对道德话语进行建模，而是为设计基于其推理负载的实用推理程序提供了原则性的视角。经验证据表明，我们的务实方法可以提高法学硕士的道德敏感性，并在代表性道德相关基准上取得优异的表现。

Title: Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs

Authors: Xin Huang, Antoni B. Chan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03089
Pdf URL: https://arxiv.org/pdf/2601.03089
Copy Paste: [[2601.03089]] Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs(https://arxiv.org/abs/2601.03089)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their black-box nature raises concerns about transparency and faithfulness. Input attribution methods aim to highlight each input token's contributions to the model's output, but existing approaches are typically model-agnostic, and do not focus on transformer-specific architectures, leading to limited faithfulness. To address this, we propose Grad-ELLM, a gradient-based attribution method for decoder-only transformer-based LLMs. By aggregating channel importance from gradients of the output logit with respect to attention layers and spatial importance from attention maps, Grad-ELLM generates heatmaps at each generation step without requiring architectural modifications. Additionally, we introduce two faithfulneses metrics $\pi$-Soft-NC and $\pi$-Soft-NS, which are modifications of Soft-NC/NS that provide fairer comparisons by controlling the amount of information kept when perturbing the text. We evaluate Grad-ELLM on sentiment classification, question answering, and open-generation tasks using different models. Experiment results show that Grad-ELLM consistently achieves superior faithfulness than other attribution methods.
摘要：大型语言模型（LLM）在不同的任务中表现出了卓越的能力，但其黑盒性质引起了人们对透明度和忠实性的担忧。输入归因方法旨在突出每个输入标记对模型输出的贡献，但现有方法通常与模型无关，并且不关注特定于变压器的架构，从而导致忠实度有限。为了解决这个问题，我们提出了 Grad-ELLM，这是一种基于梯度的归因方法，适用于仅解码器、基于变压器的 LLM。通过聚合输出 logit 相对于注意力层的梯度的通道重要性和注意力图的空间重要性，Grad-ELLM 在每个生成步骤生成热图，而无需进行架构修改。此外，我们引入了两个忠实度度量 $\pi$-Soft-NC 和 $\pi$-Soft-NS，它们是 Soft-NC/NS 的修改，通过控制扰动文本时保留的信息量来提供更公平的比较。我们使用不同的模型评估 Grad-ELLM 在情感分类、问答和开放生成任务方面的表现。实验结果表明，Grad-ELLM 始终比其他归因方法具有更高的可信度。

Title: Who Laughs with Whom? Disentangling Influential Factors in Humor Preferences across User Clusters and LLMs

Authors: Soichiro Murakami, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03103
Pdf URL: https://arxiv.org/pdf/2601.03103
Copy Paste: [[2601.03103]] Who Laughs with Whom? Disentangling Influential Factors in Humor Preferences across User Clusters and LLMs(https://arxiv.org/abs/2601.03103)
Keywords: language model, llm, prompt
Abstract: Humor preferences vary widely across individuals and cultures, complicating the evaluation of humor using large language models (LLMs). In this study, we model heterogeneity in humor preferences in Oogiri, a Japanese creative response game, by clustering users with voting logs and estimating cluster-specific weights over interpretable preference factors using Bradley-Terry-Luce models. We elicit preference judgments from LLMs by prompting them to select the funnier response and found that user clusters exhibit distinct preference patterns and that the LLM results can resemble those of particular clusters. Finally, we demonstrate that, by persona prompting, LLM preferences can be directed toward a specific cluster. The scripts for data collection and analysis will be released to support reproducibility.
摘要：不同个人和文化的幽默偏好差异很大，这使得使用大语言模型 (LLM) 评估幽默变得复杂。在本研究中，我们通过对具有投票日志的用户进行聚类，并使用 Bradley-Terry-Luce 模型估计可解释偏好因素的特定于簇的权重，对 Oogiri（一种日本创意反应游戏）中幽默偏好的异质性进行建模。我们通过提示法学硕士选择更有趣的反应来引出他们的偏好判断，并发现用户集群表现出不同的偏好模式，并且法学硕士的结果可能与特定集群的结果相似。最后，我们证明，通过角色提示，LLM 偏好可以针对特定集群。用于数据收集和分析的脚本将被发布以支持可重复性。

Title: Discovering and Causally Validating Emotion-Sensitive Neurons in Large Audio-Language Models

Authors: Xiutian Zhao, Björn Schuller, Berrak Sisman
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2601.03115
Pdf URL: https://arxiv.org/pdf/2601.03115
Copy Paste: [[2601.03115]] Discovering and Causally Validating Emotion-Sensitive Neurons in Large Audio-Language Models(https://arxiv.org/abs/2601.03115)
Keywords: language model
Abstract: Emotion is a central dimension of spoken communication, yet, we still lack a mechanistic account of how modern large audio-language models (LALMs) encode it internally. We present the first neuron-level interpretability study of emotion-sensitive neurons (ESNs) in LALMs and provide causal evidence that such units exist in Qwen2.5-Omni, Kimi-Audio, and Audio Flamingo 3. Across these three widely used open-source models, we compare frequency-, entropy-, magnitude-, and contrast-based neuron selectors on multiple emotion recognition benchmarks. Using inference-time interventions, we reveal a consistent emotion-specific signature: ablating neurons selected for a given emotion disproportionately degrades recognition of that emotion while largely preserving other classes, whereas gain-based amplification steers predictions toward the target emotion. These effects arise with modest identification data and scale systematically with intervention strength. We further observe that ESNs exhibit non-uniform layer-wise clustering with partial cross-dataset transfer. Taken together, our results offer a causal, neuron-level account of emotion decisions in LALMs and highlight targeted neuron interventions as an actionable handle for controllable affective behaviors.
摘要：情感是口语交流的核心维度，然而，我们仍然缺乏对现代大型音频语言模型（LALM）如何内部编码情感的机械解释。我们提出了 LALM 中情感敏感神经元 (ESN) 的第一个神经元级可解释性研究，并提供了 Qwen2.5-Omni、Kimi-Audio 和 Audio Flamingo 3 中存在此类单元的因果证据。在这三个广泛使用的开源模型中，我们在多个情感识别基准上比较了基于频率、熵、幅度和对比度的神经元选择器。使用推理时间干预，我们揭示了一致的情绪特定特征：消融为给定情绪选择的神经元会不成比例地降低对该情绪的识别，同时在很大程度上保留其他类别，而基于增益的放大则将预测转向目标情绪。这些影响是通过适度的识别数据产生的，并随着干预强度而系统地扩展。我们进一步观察到 ESN 表现出不均匀的分层聚类以及部分跨数据集传输。总而言之，我们的结果提供了 LALM 中情绪决策的因果神经元水平解释，并强调有针对性的神经元干预作为可控情感行为的可行手柄。

Title: ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Authors: Peiran Li, Jan Fillies, Adrian Paschke
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.03121
Pdf URL: https://arxiv.org/pdf/2601.03121
Copy Paste: [[2601.03121]] ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation(https://arxiv.org/abs/2601.03121)
Keywords: language model, llm
Abstract: Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.
摘要：以可控和特定类别的方式增强有毒语言数据对于提高毒性分类的鲁棒性至关重要，但由于有限的监督和分布偏差，仍然具有挑战性。我们提出了 ToxiGAN，一种类感知文本增强框架，它将对抗性生成与大型语言模型 (LLM) 的语义指导相结合。为了解决基于 GAN 的增强中的常见问题（例如模式崩溃和语义漂移），ToxiGAN 引入了两步定向训练策略，并利用 LLM 生成的中性文本作为语义镇流器。与之前将法学硕士视为静态生成器的工作不同，我们的方法动态选择中性范例来提供平衡的指导。有毒样本经过明确优化以偏离这些样本，从而增强了特定类别的对比信号。对四个仇恨言论基准的实验表明，ToxiGAN 在宏观 F1 和仇恨 F1 中均实现了最强的平均性能，始终优于传统和基于 LLM 的增强方法。消融和敏感性分析进一步证实了语义镇流器和定向训练在增强分类器鲁棒性方面的好处。

Title: The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs

Authors: Xiangzhe Yuan, Zhenhao Zhang, Haoming Tang, Siying Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03134
Pdf URL: https://arxiv.org/pdf/2601.03134
Copy Paste: [[2601.03134]] The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs(https://arxiv.org/abs/2601.03134)
Keywords: llm, agent
Abstract: As LLMs gain persuasive agentic capabilities through extended dialogues, they introduce novel risks in multi-turn conversational scams that single-turn safety evaluations fail to capture. We systematically study these risks using a controlled LLM-to-LLM simulation framework across multi-turn scam scenarios. Evaluating eight state-of-the-art models in English and Chinese, we analyze dialogue outcomes and qualitatively annotate attacker strategies, defensive responses, and failure modes. Results reveal that scam interactions follow recurrent escalation patterns, while defenses employ verification and delay mechanisms. Furthermore, interactional failures frequently stem from safety guardrail activation and role instability. Our findings highlight multi-turn interactional safety as a critical, distinct dimension of LLM behavior.
摘要：随着法学硕士通过扩展对话获得有说服力的代理能力，他们在多轮对话诈骗中引入了单轮安全评估无法捕获的新风险。我们使用受控的 LLM 到 LLM 模拟框架跨多轮诈骗场景系统地研究这些风险。我们评估了八种最先进的英文和中文模型，分析对话结果并定性注释攻击者策略、防御响应和失败模式。结果表明，诈骗互动遵循反复升级模式，而防御则采用验证和延迟机制。此外，交互失败常常源于安全护栏激活和角色不稳定。我们的研究结果强调多轮交互安全是法学硕士行为的一个关键的、独特的维度。

Title: Self-Verification is All You Need To Pass The Japanese Bar Examination

Authors: Andrew Shin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03144
Pdf URL: https://arxiv.org/pdf/2601.03144
Copy Paste: [[2601.03144]] Self-Verification is All You Need To Pass The Japanese Bar Examination(https://arxiv.org/abs/2601.03144)
Keywords: language model, llm, agent
Abstract: Despite rapid advances in large language models (LLMs), achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true--false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam-level competence. In this paper, we present a self-verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi-agent inference and decomposition-based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format-faithful supervision and consistency verification, and suggest that carefully designed single-model approaches can outperform more complex systems in high-stakes professional reasoning tasks. Our dataset and codes are publicly available.
摘要：尽管大型语言模型（LLM）取得了快速进步，但在高度专业和结构化的考试中取得可靠的表现仍然是一个重大挑战。日本律师考试是一项要求特别严格的基准，不仅需要先进的法律推理，还需要严格遵守复杂的答案格式，涉及对多个命题的联合评估。尽管最近的研究报告称，通过将此类问题分解为更简单的真假判断，这些方法得到了改进，但这些方法尚未在原始考试形式和评分方案下进行系统评估，因此仍存在它们是否真正体现考试水平能力的问题。在本文中，我们提出了一种在新构建的数据集上训练的自我验证模型，该模型忠实地复制了考试的真实格式和评估量表。我们的模型在实际考试量表上评估时能够超过官方的通过分数，据我们所知，这标志着法学硕士在不改变其原始问题结构或评分规则的情况下通过日本律师考试的首次演示。我们进一步与替代策略（包括多智能体推理和基于分解的监督）进行了广泛的比较，发现这些方法无法实现可比较的性能。我们的结果强调了格式忠实监督和一致性验证的重要性，并表明精心设计的单模型方法可以在高风险的专业推理任务中胜过更复杂的系统。我们的数据集和代码是公开的。

Title: Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective

Authors: Beiduo Chen, Tiancheng Hu, Caiqi Zhang, Robert Litschko, Anna Korhonen, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03154
Pdf URL: https://arxiv.org/pdf/2601.03154
Copy Paste: [[2601.03154]] Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective(https://arxiv.org/abs/2601.03154)
Keywords: llm, chain-of-thought
Abstract: Reasoning-tuned LLMs utilizing long Chain-of-Thought (CoT) excel at single-answer tasks, yet their ability to model Human Label Variation--which requires capturing probabilistic ambiguity rather than resolving it--remains underexplored. We investigate this through systematic disentanglement experiments on distribution-based tasks, employing Cross-CoT experiments to isolate the effect of reasoning text from intrinsic model priors. We observe a distinct "decoupled mechanism": while CoT improves distributional alignment, final accuracy is dictated by CoT content (99% variance contribution), whereas distributional ranking is governed by model priors (over 80%). Step-wise analysis further shows that while CoT's influence on accuracy grows monotonically during the reasoning process, distributional structure is largely determined by LLM's intrinsic priors. These findings suggest that long CoT serves as a decisive LLM decision-maker for the top option but fails to function as a granular distribution calibrator for ambiguous tasks.
摘要：利用长思维链（CoT）进行推理调整的法学硕士在单一答案任务中表现出色，但它们对人类标签变异进行建模的能力（这需要捕获概率歧义而不是解决它）仍未得到充分探索。我们通过对基于分布的任务进行系统的解开实验来研究这一点，利用 Cross-CoT 实验将推理文本的影响与内在模型先验隔离开来。我们观察到一种独特的“解耦机制”：虽然 CoT 改善了分布对齐，但最终精度由 CoT 内容（99% 方差贡献）决定，而分布排名由模型先验（超过 80%）控制。逐步分析进一步表明，虽然 CoT 对准确性的影响在推理过程中单调增长，但分布结构很大程度上是由 LLM 的内在先验决定的。这些发现表明，长期 CoT 可以作为最佳选择的决定性 LLM 决策者，但无法充当模糊任务的粒度分布校准器。

Title: WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

Authors: Yu Xinmiao, Zhang Liwen, Feng Xiaocheng, Jiang Yong, Qin Bing, Xie Pengjun, Zhou Jingren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03164
Pdf URL: https://arxiv.org/pdf/2601.03164
Copy Paste: [[2601.03164]] WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning(https://arxiv.org/abs/2601.03164)
Keywords: language model, llm, agent
Abstract: Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.
摘要：基于大语言模型（LLM）的代理在网络信息搜索方面表现出了强大的能力，强化学习（RL）成为关键的优化范例。然而，规划仍然是一个瓶颈，因为现有的方法与长期战略相矛盾。我们的分析揭示了一个关键现象，即计划锚，其中第一个推理步骤不成比例地影响长期网络推理任务中的下游行为。当前的强化学习算法无法通过在整个轨迹上均匀分配奖励来解决这一问题。为了解决这个问题，我们提出了 Anchor-GRPO，这是一个两阶段的 RL 框架，可以解耦规划和执行。在第一阶段，智能体使用源自自我对弈经验和人类校准的细粒度规则来优化其第一步规划。在第二阶段，通过稀疏奖励使执行与初始计划保持一致，确保稳定高效的工具使用。我们在四个基准上评估 Anchor-GRPO：BrowseComp、BrowseComp-Zh、GAIA 和 XBench-DeepSearch。在从 3B 到 30B 的模型中，Anchor-GRPO 的性能优于基线 GRPO 和第一步 GRPO，提高了任务成功率和工具效率。值得注意的是，WebAnchor-30B 在 BrowseComp 上实现了 46.0% pass@1，在 GAIA 上实现了 76.4%。 Anchor-GRPO 还表现出强大的可扩展性，随着模型大小和上下文长度的增加而获得更高的准确性。

Title: Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning

Authors: Naixin Zhai, Pengyang Shao, Binbin Zheng, Fei Shen, Long Bai, Xun Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03190
Pdf URL: https://arxiv.org/pdf/2601.03190
Copy Paste: [[2601.03190]] Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning(https://arxiv.org/abs/2601.03190)
Keywords: language model, llm
Abstract: Machine unlearning aims to forget sensitive knowledge from Large Language Models (LLMs) while maintaining general utility. However, existing approaches typically treat all tokens in a response indiscriminately and enforce uncertainty over the entire vocabulary. This global treatment results in unnecessary utility degradation and extends optimization to content-agnostic regions. To address these limitations, we propose PALU (Prefix-Aware Localized Unlearning), a framework driven by a local entropy maximization objective across both temporal and vocabulary dimensions. PALU reveals that (i) suppressing the sensitive prefix alone is sufficient to sever the causal generation link, and (ii) flattening only the top-$k$ logits is adequate to maximize uncertainty in the critical subspace. These findings allow PALU to avoid redundant optimization across the full vocabulary and parameter space while minimizing collateral damage to general model performance. Extensive experiments validate that PALU achieves superior forgetting efficacy and utility preservation compared to state-of-the-art baselines.
摘要：机器遗忘的目的是忘记大型语言模型（LLM）中的敏感知识，同时保持通用性。然而，现有的方法通常不加区别地对待响应中的所有标记，并对整个词汇表施加不确定性。这种全局处理会导致不必要的效用下降，并将优化扩展到与内容无关的区域。为了解决这些限制，我们提出了 PALU（前缀感知本地化学习），这是一个由跨时间和词汇维度的局部熵最大化目标驱动的框架。 PALU 表明，(i) 仅抑制敏感前缀就足以切断因果生成链接，并且 (ii) 仅展平 top-$k$ logits 足以最大化关键子空间中的不确定性。这些发现使 PALU 能够避免整个词汇和参数空间的冗余优化，同时最大限度地减少对一般模型性能的附带损害。大量实验证实，与最先进的基线相比，PALU 具有卓越的遗忘功效和效用保留。

Title: MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

Authors: Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, Muning Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03192
Pdf URL: https://arxiv.org/pdf/2601.03192
Copy Paste: [[2601.03192]] MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory(https://arxiv.org/abs/2601.03192)
Keywords: language model, llm, agent
Abstract: The hallmark of human intelligence is the ability to master new skills through Constructive Episodic Simulation-retrieving past experiences to synthesize solutions for novel tasks. While Large Language Models possess strong reasoning capabilities, they struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory. MemRL explicitly separates the stable reasoning of a frozen LLM from the plastic, evolving memory. Unlike traditional methods, MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values (utility). These utilities are continuously refined via environmental feedback in an trial-and-error manner, allowing the agent to distinguish high-value strategies from similar noise. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines. Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.
摘要：人类智能的标志是能够通过构造性情景模拟掌握新技能，即检索过去的经验来综合新任务的解决方案。虽然大型语言模型拥有强大的推理能力，但它们很难模拟这种自我进化：微调的计算成本很高，并且容易发生灾难性遗忘，而现有的基于记忆的方法依赖于经常检索噪声的被动语义匹配。为了应对这些挑战，我们提出了 MemRL，这是一个框架，使代理能够通过情景记忆的非参数强化学习进行自我进化。 MemRL 明确地将冻结的法学硕士的稳定推理与可塑的、不断发展的记忆分开。与传统方法不同，MemRL 采用两阶段检索机制，通过语义相关性过滤候选者，然后根据学习的 Q 值（效用）选择它们。这些实用程序通过环境反馈以试错的方式不断完善，使代理能够将高价值策略与类似噪音区分开来。在 HLE、BigCodeBench、ALFWorld 和 Lifelong Agent Bench 上进行的大量实验表明，MemRL 的性能显着优于最先进的基线。我们的分析实验证实，MemRL 有效地解决了稳定性-可塑性困境，无需更新权重即可实现持续的运行时间改进。

Title: X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

Authors: Mohammad Zia Ur Rehman, Sai Kartheek Reddy Kasu, Shashivardhan Reddy Koppula, Sai Rithwik Reddy Chirra, Shwetank Shekhar Singh, Nagendra Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03194
Pdf URL: https://arxiv.org/pdf/2601.03194
Copy Paste: [[2601.03194]] X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework(https://arxiv.org/abs/2601.03194)
Keywords: language model, llm
Abstract: Hate speech detection on social media faces challenges in both accuracy and explainability, especially for underexplored Indic languages. We propose a novel explainability-guided training framework, X-MuTeST (eXplainable Multilingual haTe Speech deTection), for hate speech detection that combines high-level semantic reasoning from large language models (LLMs) with traditional attention-enhancing techniques. We extend this research to Hindi and Telugu alongside English by providing benchmark human-annotated rationales for each word to justify the assigned class label. The X-MuTeST explainability method computes the difference between the prediction probabilities of the original text and those of unigrams, bigrams, and trigrams. Final explanations are computed as the union between LLM explanations and X-MuTeST explanations. We show that leveraging human rationales during training enhances both classification performance and explainability. Moreover, combining human rationales with our explainability method to refine the model attention yields further improvements. We evaluate explainability using Plausibility metrics such as Token-F1 and IOU-F1 and Faithfulness metrics such as Comprehensiveness and Sufficiency. By focusing on under-resourced languages, our work advances hate speech detection across diverse linguistic contexts. Our dataset includes token-level rationale annotations for 6,004 Hindi, 4,492 Telugu, and 6,334 English samples. Data and code are available on this https URL
摘要：社交媒体上的仇恨言论检测面临准确性和可解释性方面的挑战，特别是对于尚未开发的印度语言。我们提出了一种新颖的可解释性引导训练框架 X-MuTeST（可解释的多语言仇恨语音检测），用于仇恨语音检测，它将大型语言模型 (LLM) 的高级语义推理与传统的注意力增强技术相结合。我们将这项研究扩展到印地语和泰卢固语以及英语，为每个单词提供基准的人工注释基本原理，以证明分配的类别标签的合理性。 X-MuTeST 可解释性方法计算原始文本的预测概率与一元组、二元组和三元组的预测概率之间的差异。最终解释作为 LLM 解释和 X-MuTeST 解释之间的并集进行计算。我们表明，在训练过程中利用人类的基本原理可以提高分类性能和可解释性。此外，将人类的基本原理与我们的可解释性方法相结合来细化模型注意力会产生进一步的改进。我们使用 Token-F1 和 IOU-F1 等合理性指标以及全面性和充分性等忠实性指标来评估可解释性。通过关注资源贫乏的语言，我们的工作推进了跨不同语言环境的仇恨言论检测。我们的数据集包括 6,004 个印地语、4,492 个泰卢固语和 6,334 个英语样本的标记级基本原理注释。数据和代码可在此 https URL 上获取

Title: DIP: Dynamic In-Context Planner For Diffusion Language Models

Authors: Yang Li, Han Meng, Chenan Wang, Haipeng Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03199
Pdf URL: https://arxiv.org/pdf/2601.03199
Copy Paste: [[2601.03199]] DIP: Dynamic In-Context Planner For Diffusion Language Models(https://arxiv.org/abs/2601.03199)
Keywords: language model, prompt
Abstract: Diffusion language models (DLMs) have shown strong potential for general natural language tasks with in-context examples. However, due to the bidirectional attention mechanism, DLMs incur substantial computational cost as context length increases. This work addresses this issue with a key discovery: unlike the sequential generation in autoregressive language models (ARLMs), the diffusion generation paradigm in DLMs allows \textit{efficient dynamic adjustment of the context} during generation. Building on this insight, we propose \textbf{D}ynamic \textbf{I}n-Context \textbf{P}lanner (DIP), a context-optimization method that dynamically selects and inserts in-context examples during generation, rather than providing all examples in the prompt upfront. Results show DIP maintains generation quality while achieving up to 12.9$\times$ inference speedup over standard inference and 1.17$\times$ over KV cache-enhanced inference.
摘要：扩散语言模型 (DLM) 在具有上下文示例的一般自然语言任务中显示出强大的潜力。然而，由于双向注意力机制，随着上下文长度的增加，DLM 会产生大量的计算成本。这项工作通过一个关键发现解决了这个问题：与自回归语言模型（ARLM）中的顺序生成不同，DLM 中的扩散生成范式允许在生成过程中 \textit{有效地动态调整上下文}。基于这一见解，我们提出了 \textbf{D}ynamic \textbf{I}n-Context \textbf{P}lanner (DIP)，这是一种上下文优化方法，可以在生成过程中动态选择和插入上下文示例，而不是在提示中预先提供所有示例。结果显示，DIP 保持生成质量，同时比标准推理实现高达 12.9$\times$ 的推理加速，比 KV 缓存增强推理高达 1.17$\times$。

Title: UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward

Authors: Yile Liu, Yixian Liu, Zongwei Li, Yufei Huang, Xinhua Feng, Zhichao Hu, Jinglu Hu, Jianfeng Yan, Fengzong Lian, Yuhong Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03205
Pdf URL: https://arxiv.org/pdf/2601.03205
Copy Paste: [[2601.03205]] UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward(https://arxiv.org/abs/2601.03205)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.
摘要：虽然大型语言模型 (LLM) 在自然语言处理方面已展现出巨大潜力，但需要多步骤逻辑、规划和验证的复杂通用推理仍然是一个关键瓶颈。尽管具有可验证奖励的强化学习（RLVR）在特定领域取得了成功，但该领域缺乏用于一般推理的大规模、高质量和难度校准的数据。为了解决这个问题，我们提出了 UltraLogic，这是一个框架，通过基于代码的解决方法将问题的逻辑核心与其自然语言表达分离，以自动化高质量的数据生成。该框架包含数百种独特的任务类型和跨十个难度级别的自动校准管道。此外，为了缓解二元奖励稀疏性和非负奖励陷阱，我们引入了双极浮动奖励（BFR）机制，利用分级惩罚来有效区分完美响应和有逻辑缺陷的响应。我们的实验表明，任务多样性是推理增强的主要驱动力，而 BFR 与难度匹配策略相结合，可以显着提高训练效率，引导模型走向全局逻辑最优。

Title: MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics

Authors: Xinghe Chen, Naiming Liu, Shashank Sonkar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03217
Pdf URL: https://arxiv.org/pdf/2601.03217
Copy Paste: [[2601.03217]] MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics(https://arxiv.org/abs/2601.03217)
Keywords: language model
Abstract: Student mistakes in mathematics are often systematic: a learner applies a coherent but wrong procedure and repeats it across contexts. We introduce MalruleLib, a learning-science-grounded framework that translates documented misconceptions into executable procedures, drawing on 67 learning-science and mathematics education sources, and generates step-by-step traces of malrule-consistent student work. We formalize a core student-modeling problem as Malrule Reasoning Accuracy (MRA): infer a misconception from one worked mistake and predict the student's next answer under cross-template rephrasing. Across nine language models (4B-120B), accuracy drops from 66% on direct problem solving to 40% on cross-template misconception prediction. MalruleLib encodes 101 malrules over 498 parameterized problem templates and produces paired dual-path traces for both correct reasoning and malrule-consistent student reasoning. Because malrules are executable and templates are parameterizable, MalruleLib can generate over one million instances, enabling scalable supervision and controlled evaluation. Using MalruleLib, we observe cross-template degradations of 10-21%, while providing student step traces improves prediction by 3-15%. We release MalruleLib as infrastructure for educational AI that models student procedures across contexts, enabling diagnosis and feedback that targets the underlying misconception.
摘要：学生在数学中的错误通常是系统性的：学习者应用连贯但错误的程序，并在不同的上下文中重复它。我们引入了 MalruleLib，这是一个以学习科学为基础的框架，它利用 67 个学习科学和数学教育资源，将记录下来的误解转化为可执行的程序，并生成符合恶规则的学生作业的逐步痕迹。我们将核心学生建模问题形式化为错误推理准确性（MRA）：从一个错误中推断出误解，并在跨模板改写下预测学生的下一个答案。在九种语言模型 (4B-120B) 中，准确率从直接问题解决的 66% 下降到跨模板误解预测的 40%。 MalruleLib 对 498 个参数化问题模板中的 101 个错误规则进行编码，并为正确推理和错误规则一致的学生推理生成配对的双路径跟踪。由于 malrule 是可执行的且模板是可参数化的，MalruleLib 可以生成超过一百万个实例，从而实现可扩展的监督和受控评估。使用 MalruleLib，我们观察到跨模板性能下降了 10-21%，同时提供学生步数跟踪将预测提高了 3-15%。我们发布 MalruleLib 作为教育人工智能的基础设施，它可以跨环境对学生程序进行建模，从而实现针对潜在误解的诊断和反馈。

Title: Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

Authors: Kartik Bose, Abhinandan Kumar, Raghuraman Soundararajan, Priya Mudgil, Samonee Ralmilay, Niharika Dutta, Manphool Singhal, Arun Kumar, Saugata Sen, Anurima Patra, Priya Ghosh, Abanti Das, Amit Gupta, Ashish Verma, Dipin Sudhakaran, Ekta Dhamija, Himangi Unde, Ishan Kumar, Krithika Rangarajan, Prerna Garg, Rachel Sequeira, Sudhin Shylendran, Taruna Yadav, Tej Pal, Pankaj Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.03232
Pdf URL: https://arxiv.org/pdf/2601.03232
Copy Paste: [[2601.03232]] Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models(https://arxiv.org/abs/2601.03232)
Keywords: language model, gpt, llm, prompt
Abstract: Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B parameters) and GPT-5.2 under a fixed guided prompt. Primary endpoints were validity and accuracy; a secondary analysis compared guided versus zero-shot prompting. Results: Under guided prompting GPT-5.2 achieved 99.8% validity and 81.1% accuracy (1,600 predictions). Pooled SLMs (65,600 predictions) achieved 96.8% validity and 61.1% accuracy; top SLMs in the 20-32B range reached ~99% validity and mid-to-high 70% accuracy. Performance scaled with model size (inflection between <1B and >=10B) and declined with RADS complexity primarily due to classification difficulty rather than invalid outputs. Guided prompting improved validity (99.2% vs 96.7%) and accuracy (78.5% vs 69.6%) compared with zero-shot. Conclusion: RXL-RADSet provides a radiologist-verified multi-RADS benchmark; large SLMs (20-32B) can approach proprietary-model performance under guided prompting, but gaps remain for higher-complexity schemes.
摘要：背景：报告和数据系统 (RADS) 标准化了放射学风险沟通，但由于指南复杂性、输出格式限制以及跨 RADS 框架和模型大小的基准测试有限，从叙述性报告中自动分配 RADS 具有挑战性。目的：创建 RXL-RADSet，这是一种经过放射科医生验证的综合多 RADS 基准，并将开放权重小语言模型 (SLM) 与 RADS 分配的专有模型的有效性和准确性进行比较。材料和方法：RXL-RADSet 包含 1,600 份合成放射学报告，涵盖 10 个 RADS（BI-RADS、CAD-RADS、GB-RADS、LI-RADS、Lung-RADS、NI-RADS、O-RADS、PI-RADS、TI-RADS、VI-RADS）和多种模式。报告由法学硕士使用情景计划和模拟放射科医生风格生成，并经过两阶段放射科医生验证。我们在固定的引导提示下评估了 41 个量化的 SLM（12 个族，0.135-32B 参数）和 GPT-5.2。主要终点是有效性和准确性；二次分析比较了引导式提示与零样本提示。结果：在指导提示下，GPT-5.2 实现了 99.8% 的有效性和 81.1% 的准确性（1,600 个预测）。合并的 SLM（65,600 个预测）实现了 96.8% 的有效性和 61.1% 的准确性； 20-32B 范围内的顶级 SLM 达到了约 99% 的有效性和中高 70% 的准确度。性能随模型大小（<1B 和 >=10B 之间的变化）而变化，并随 RADS 复杂性而下降，这主要是由于分类难度而不是无效输出。与零样本相比，引导提示提高了有效性（99.2% vs 96.7%）和准确性（78.5% vs 69.6%）。结论：RXL-RADSet 提供了经过放射科医生验证的多 RADS 基准；大型 SLM（20-32B）可以在引导提示下接近专有模型性能，但对于更高复杂性的方案仍然存在差距。

Title: STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Authors: Juntong Ni, Shiyu Wang, Ming Jin, Qi He, Wei Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03248
Pdf URL: https://arxiv.org/pdf/2601.03248
Copy Paste: [[2601.03248]] STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning(https://arxiv.org/abs/2601.03248)
Keywords: llm, agent
Abstract: Spatio-temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context. This capability is vital for high-stakes decision-making in systems such as traffic networks, power grids, and disease propagation. However, the field remains underdeveloped because most existing works prioritize predictive accuracy over reasoning. To address the gap, we introduce ST-Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in-context forecasting, developed via a network SDE-based multi-agent data synthesis pipeline. We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning. To promote spatially grounded logic, we introduce S-GRPO, a reinforcement learning algorithm that rewards performance gains specifically attributable to spatial information. Experiments show that STReasoner achieves average accuracy gains between 17% and 135% at only 0.004X the cost of proprietary models and generalizes robustly to real-world data.
摘要：时间序列中的时空推理涉及时间动态、空间依赖性和文本上下文的显式综合。这种能力对于交通网络、电网和疾病传播等系统中的高风险决策至关重要。然而，该领域仍然不发达，因为大多数现有工作优先考虑预测准确性而不是推理。为了解决这一差距，我们引入了 ST-Bench，这是一个由四个核心任务组成的基准，包括病因推理、实体识别、相关推理和上下文预测，通过基于网络 SDE 的多智能体数据合成管道开发。然后，我们提出了 STreasoner，它使 LLM 能够集成时间序列、图结构和文本以进行显式推理。为了促进空间接地逻辑，我们引入了 S-GRPO，这是一种强化学习算法，可奖励专门归因于空间信息的性能提升。实验表明，STRasoner 的平均准确度提高了 17% 到 135%，而成本仅为专有模型的 0.004 倍，并且可以稳健地推广到现实世界的数据。

Title: Automated Semantic Rules Detection (ASRD) for Emergent Communication Interpretation

Authors: Bastien Vanderplaetse, Xavier Siebert, Stéphane Dupont
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.03254
Pdf URL: https://arxiv.org/pdf/2601.03254
Copy Paste: [[2601.03254]] Automated Semantic Rules Detection (ASRD) for Emergent Communication Interpretation(https://arxiv.org/abs/2601.03254)
Keywords: agent
Abstract: The field of emergent communication within multi-agent systems examines how autonomous agents can independently develop communication strategies, without explicit programming, and adapt them to varied environments. However, few studies have focused on the interpretability of emergent languages. The research exposed in this paper proposes an Automated Semantic Rules Detection (ASRD) algorithm, which extracts relevant patterns in messages exchanged by agents trained with two different datasets on the Lewis Game, which is often studied in the context of emergent communication. ASRD helps at the interpretation of the emergent communication by relating the extracted patterns to specific attributes of the input data, thereby considerably simplifying subsequent analysis.
摘要：多智能体系统中的紧急通信领域研究了自主智能体如何在没有显式编程的情况下独立开发通信策略，并使它们适应不同的环境。然而，很少有研究关注新兴语言的可解释性。本文提出的研究提出了一种自动语义规则检测（ASRD）算法，该算法提取在刘易斯博弈上使用两个不同数据集训练的代理交换的消息中的相关模式，该博弈经常在紧急通信的背景下进行研究。 ASRD 通过将提取的模式与输入数据的特定属性相关联来帮助解释紧急通信，从而大大简化后续分析。