2025-12-30

Title: Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

Authors: Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Arash Akbari, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Weiyan Shi, Xingchen Xu, Yu Huang, Wei Jiang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22208
Pdf URL: https://arxiv.org/pdf/2512.22208
Copy Paste: [[2512.22208]] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA(https://arxiv.org/abs/2512.22208)
Keywords: language model, gpt, llm
Abstract: Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.
摘要：最近，大型语言模型（LLM）经历了重大转变，其受欢迎程度和能力都迅速上升。引领这一发展的是 GPT-4 和 GPT-o1 等专有法学硕士，它们因其卓越的性能和多功能性而引起了人工智能社区的广泛关注。同时，开源法学硕士（例如 LLaMA 和 Mistral）由于可以轻松地在不同的应用程序中定制和部署模型，因此为法学硕士的日益普及做出了巨大贡献。 Moxin 7B是根据模型开放框架开发的完全开源的法学硕士，它超越了简单的模型权重共享，在培训、数据集和实施细节方面实现了完全透明，从而培育了一个更具包容性和协作性的研究环境，可以维持健康的开源生态系统。为了进一步为Moxin配备不同任务的各种能力，我们在Moxin的基础上开发了三个变体，包括Moxin-VLM、Moxin-VLA和Moxin-Chinese，分别针对视觉语言、视觉语言动作和中文能力。实验表明，我们的模型在各种评估中都取得了优异的性能。我们采用开源框架和开放数据进行培训。我们发布我们的模型，以及派生这些模型的可用数据和代码。

Title: Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces

Authors: Sophie Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22227
Pdf URL: https://arxiv.org/pdf/2512.22227
Copy Paste: [[2512.22227]] Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces(https://arxiv.org/abs/2512.22227)
Keywords: language model
Abstract: Recent work has shown that transformer-based language models learn rich geometric structure in their embedding spaces, yet the presence of higher-level cognitive organization within these representations remains underexplored. In this work, we investigate whether sentence embeddings encode a graded, hierarchical structure aligned with human-interpretable cognitive or psychological attributes. We construct a dataset of 480 natural-language sentences annotated with continuous ordinal energy scores and discrete tier labels spanning seven ordered cognitive categories. Using fixed sentence embeddings from multiple transformer models, we evaluate the recoverability of these annotations via linear and shallow nonlinear probes. Across models, both continuous scores and tier labels are reliably decodable, with shallow nonlinear probes providing consistent performance gains over linear probes. Lexical TF-IDF baselines perform substantially worse, indicating that the observed structure is not attributable to surface word statistics alone. Nonparametric permutation tests further confirm that probe performance exceeds chance under label-randomization nulls. Qualitative analyses using UMAP visualizations and confusion matrices reveal smooth low-to-high gradients and predominantly adjacent-tier confusions in embedding space. Taken together, these results provide evidence that transformer embedding spaces exhibit a hierarchical geometric organization aligned with human-defined cognitive attributes, while remaining agnostic to claims of internal awareness or phenomenology.
摘要：最近的工作表明，基于 Transformer 的语言模型在其嵌入空间中学习丰富的几何结构，但这些表示中更高层次的认知组织的存在仍未得到充分探索。在这项工作中，我们研究了句子嵌入是否编码了与人类可解释的认知或心理属性相一致的分级分层结构。我们构建了一个由 480 个自然语言句子组成的数据集，这些句子用连续序数能量分数和跨越七个有序认知类别的离散层标签进行注释。使用来自多个变压器模型的固定句子嵌入，我们通过线性和浅非线性探针评估这些注释的可恢复性。在模型中，连续分数和层标签都可以可靠地解码，浅层非线性探针比线性探针提供一致的性能增益。词汇 TF-IDF 基线的表现要差得多，这表明观察到的结构不能仅归因于表面词统计。非参数排列测试进一步证实探针性能超过了标签随机化零值下的机会。使用 UMAP 可视化和混淆矩阵的定性分析揭示了嵌入空间中从低到高的平滑梯度和主要是相邻层的混淆。总而言之，这些结果提供了证据，证明变压器嵌入空间表现出与人类定义的认知属性一致的分层几何组织，同时对内部意识或现象学的主张保持不可知。

Title: SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

Authors: Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2512.22322
Pdf URL: https://arxiv.org/pdf/2512.22322
Copy Paste: [[2512.22322]] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents(https://arxiv.org/abs/2512.22322)
Keywords: llm, agent
Abstract: Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.
摘要：代理强化学习（RL）对于复杂 GUI 任务下自主代理的开发有着巨大的前景，但其可扩展性仍然受到任务完成验证的严重阻碍。现有的任务验证被视为被动的事后过程：验证者（即基于规则的评分脚本、奖励或批评模型以及 LLM-as-a-Judge）分析智能体的整个交互轨迹，以确定智能体是否成功。对包含不相关的、嘈杂的历史记录的详细上下文的这种处理对验证协议提出了挑战，因此导致成本过高和可靠性低。为了克服这一瓶颈，我们提出了 SmartSnap，这是一种从被动、事后验证到代理本身主动、现场自我验证的范式转变。我们引入了自我验证代理，这是一种具有双重任务的新型代理：不仅要完成任务，还要通过精心设计的快照证据来证明其成就。在我们提出的 3C 原则（完整性、简洁性和创造性）的指导下，代理利用其对在线环境的可访问性对最少的、决定性的快照集执行自我验证。此类证据作为一般法官法学硕士验证者确定其有效性和相关性的唯一材料。跨模型系列和规模的移动任务实验表明，我们的 SmartSnap 范式允许以可扩展的方式训练 LLM 驱动的代理，为 8B 和 30B 模型分别带来高达 26.08% 和 16.66% 的性能提升。解决方案寻找和证据寻求之间的协同作用有助于培养高效、自我验证的代理，其性能可与 DeepSeek V3.1 和 Qwen3-235B-A22B 竞争。

Title: Towards Efficient Post-Training via Fourier-Driven Adapter Architectures

Authors: Donggyun Bae, Jongil Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22378
Pdf URL: https://arxiv.org/pdf/2512.22378
Copy Paste: [[2512.22378]] Towards Efficient Post-Training via Fourier-Driven Adapter Architectures(https://arxiv.org/abs/2512.22378)
Keywords: language model
Abstract: We propose a novel framework, termed Fourier-Activated Adapter (FAA), for parameter-efficient fine-tuning of large pre-trained language models. By incorporating random Fourier features into lightweight adapter modules, FAA decomposes intermediate representations into complementary low- and high-frequency components, enabling frequency-aware modulation of semantic information. This design allows the model to selectively emphasize informative frequency bands during adaptation while preserving the representational capacity of the frozen backbone. Extensive experiments on GLUE, E2E NLG, and instruction-tuning benchmarks demonstrate that FAA consistently achieves competitive or superior performance compared to existing parameter-efficient fine-tuning methods, while maintaining low computational and memory overhead. Ablation studies further verify the effectiveness of frequency-aware activation and adaptive weighting mechanisms, highlighting FAA as a robust and efficient approach for post-training large language models.
摘要：我们提出了一种称为傅立叶激活适配器（FAA）的新颖框架，用于大型预训练语言模型的参数高效微调。通过将随机傅立叶特征合并到轻量级适配器模块中，FAA 将中间表示分解为互补的低频和高频分量，从而实现语义信息的频率感知调制。这种设计允许模型在适应过程中选择性地强调信息频段，同时保留冻结主干的表示能力。对 GLUE、E2E NLG 和指令调优基准的大量实验表明，与现有的参数高效微调方法相比，FAA 始终能够实现具有竞争力或优越的性能，同时保持较低的计算和内存开销。消融研究进一步验证了频率感知激活和自适应加权机制的有效性，强调 FAA 作为训练后大型语言模型的稳健且高效的方法。

Title: LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

Authors: Elsen Ronando, Sozo Inoue
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.22385
Pdf URL: https://arxiv.org/pdf/2512.22385
Copy Paste: [[2512.22385]] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition(https://arxiv.org/abs/2512.22385)
Keywords: llm
Abstract: In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar weara-ble sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and $k$-center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.
摘要：在本文中，我们提出了一个法学硕士指导的样本选择框架，以解决最先进的人类活动识别（HAR）方法的一个关键限制：它们依赖于大型标记数据集和纯粹的几何样本选择，这通常无法区分类似的可穿戴传感器活动，例如步行、上楼和下楼。我们的方法通过法学硕士生成的知识先验结合语义推理，捕获特征重要性、类间混淆性和样本预算乘数，并用它来指导样本评分和选择。这些先验与基于边际的验证线索、PageRank 中心性、中心性惩罚和设施位置优化相结合，以获得一组紧凑且信息丰富的示例。在严格的少样本条件下对 UCI-HAR 数据集进行评估，该框架实现了 88.78% 的宏观 F1 分数，优于随机采样、羊群和 $k$-center 等经典方法。结果表明，LLM 导出的语义先验与结构和几何线索相结合，为在少样本可穿戴传感器 HAR 中选择代表性传感器样本提供了更坚实的基础。

Title: Hallucination Detection and Evaluation of Large Language Model

Authors: Chenggong Zhang, Haopeng Wang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2512.22416
Pdf URL: https://arxiv.org/pdf/2512.22416
Copy Paste: [[2512.22416]] Hallucination Detection and Evaluation of Large Language Model(https://arxiv.org/abs/2512.22416)
Keywords: language model, llm, hallucination
Abstract: Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy $82.2\%$ and TPR $78.9\%$. However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.
摘要：大型语言模型 (LLM) 中的幻觉构成了重大挑战，会产生误导性或无法验证的内容，从而破坏信任和可靠性。现有的评估方法，例如 KnowHalu，采用多阶段验证，但计算成本较高。为了解决这个问题，我们集成了休斯幻觉评估模型（HHEM），这是一个轻量级的基于分类的框架，独立于基于法学硕士的判断而运行，显着提高了效率，同时保持了高检测精度。我们对不同法学硕士的幻觉检测方法进行了比较分析，评估了真阳性率（TPR）、真阴性率（TNR）以及问答（QA）和总结任务的准确性。我们的结果表明，HHEM 将评估时间从 8 小时减少到 10 分钟，而采用非制造检查的 HHEM 实现了最高准确度 $82.2\%$ 和 TPR $78.9\%$。然而，HHEM 在总结任务中与局部幻觉作斗争。为了解决这个问题，我们引入了基于分段的检索，通过验证较小的文本组件来改进检测。此外，我们的累积分布函数 (CDF) 分析表明，较大的模型（7B-9B 参数）通常表现出较少的幻觉，而中等大小的模型则表现出较高的不稳定性。这些发现凸显了对结构化评估框架的需求，该框架可以平衡计算效率与强大的事实验证，从而提高法学硕士生成内容的可靠性。

Title: HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG

Authors: Cattalyya Nuengsigkapian
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22442
Pdf URL: https://arxiv.org/pdf/2512.22442
Copy Paste: [[2512.22442]] HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG(https://arxiv.org/abs/2512.22442)
Keywords: retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) in open-domain settings faces significant challenges regarding irrelevant information in retrieved documents and the alignment of generated answers with user intent. We present HiFi-RAG (Hierarchical Filtering RAG), the winning closed-source system in the Text-to-Text static evaluation of the MMU-RAGent NeurIPS 2025 Competition. Our approach moves beyond standard embedding-based retrieval via a multi-stage pipeline. We leverage the speed and cost-efficiency of Gemini 2.5 Flash (4-6x cheaper than Pro) for query formulation, hierarchical content filtering, and citation attribution, while reserving the reasoning capabilities of Gemini 2.5 Pro for final answer generation. On the MMU-RAGent validation set, our system outperformed the baseline, improving ROUGE-L to 0.274 (+19.6%) and DeBERTaScore to 0.677 (+6.2%). On Test2025, our custom dataset evaluating questions that require post-cutoff knowledge (post January 2025), HiFi-RAG outperforms the parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.
摘要：开放域环境中的检索增强生成 (RAG) 面临着重大挑战，涉及检索到的文档中的不相关信息以及生成的答案与用户意图的一致性。我们展示了 HiFi-RAG（分层过滤 RAG），这是在 MMU-RAGent NeurIPS 2025 竞赛的文本到文本静态评估中获胜的闭源系统。我们的方法通过多级管道超越了标准的基于嵌入的检索。我们利用 Gemini 2.5 Flash（比 Pro 便宜 4-6 倍）的速度和成本效率来进行查询制定、分层内容过滤和引文归因，同时保留 Gemini 2.5 Pro 的推理能力来生成最终答案。在 MMU-RAGent 验证集上，我们的系统优于基线，将 ROUGE-L 提高到 0.274 (+19.6%)，将 DeBERTaScore 提高到 0.677 (+6.2%)。在 Test2025 上，我们的自定义数据集评估需要截止后知识的问题（2025 年 1 月后），HiFi-RAG 在 ROUGE-L 中比参数基线高出 57.4%，在 DeBERTaScore 中比参数基线高出 14.9%。

Title: Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models

Authors: Jie Zhou, Xin Chen, Jie Zhang, Zhe Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22443
Pdf URL: https://arxiv.org/pdf/2512.22443
Copy Paste: [[2512.22443]] Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models(https://arxiv.org/abs/2512.22443)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) are reshaping learning paradigms, cognitive processes, and research methodologies across a wide range of domains. Integrating LLMs with professional fields and redefining the relationship between LLMs and domain-specific applications has become a critical challenge for promoting enterprise digital transformation and broader social development. To effectively integrate LLMs into the accounting domain, it is essential to understand their domain-specific reasoning capabilities. This study introduces the concept of vertical-domain accounting reasoning and establishes evaluation criteria by analyzing the training data characteristics of representative GLM-series models. These criteria provide a foundation for subsequent research on reasoning paradigms and offer benchmarks for improving accounting reasoning performance. Based on this framework, we evaluate several representative models, including GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4, on a set of accounting reasoning tasks. Experimental results show that different prompt engineering strategies lead to varying degrees of performance improvement across models, with GPT-4 achieving the strongest accounting reasoning capability. However, current LLMs still fall short of real-world application requirements. In particular, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.
摘要：大型语言模型 (LLM) 正在重塑各个领域的学习范式、认知过程和研究方法。将法学硕士与专业领域相结合，重新定义法学硕士与特定领域应用之间的关系，已成为推动企业数字化转型和更广泛的社会发展的关键挑战。为了有效地将法学硕士融入会计领域，了解其特定领域的推理能力至关重要。本研究引入了垂直领域会计推理的概念，并通过分析代表性GLM系列模型的训练数据特征建立了评估标准。这些标准为后续推理范式的研究奠定了基础，为提高会计推理性能提供了基准。基于该框架，我们在一组会计推理任务上评估了几种代表性模型，包括 GLM-6B、GLM-130B、GLM-4 和 OpenAI GPT-4。实验结果表明，不同的即时工程策略会导致模型不同程度的性能提升，其中GPT-4实现了最强的会计推理能力。然而，目前的法学硕士仍然达不到现实世界的应用要求。特别是在企业级会计场景的部署还需要进一步优化，以充分发挥LLM在该领域的潜在价值。

Title: Learning When Not to Attend Globally

Authors: Xuan Luo, Kailai Zhang, Xifeng Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22562
Pdf URL: https://arxiv.org/pdf/2512.22562
Copy Paste: [[2512.22562]] Learning When Not to Attend Globally(https://arxiv.org/abs/2512.22562)
Keywords: language model, llm
Abstract: When reading books, humans focus primarily on the current page, flipping back to recap prior context only when necessary. Similarly, we demonstrate that Large Language Models (LLMs) can learn to dynamically determine when to attend to global context. We propose All-or-Here Attention (AHA), which utilizes a binary router per attention head to dynamically toggle between full attention and local sliding window attention for each token. Our results indicate that with a window size of 256 tokens, up to 93\% of the original full attention operations can be replaced by sliding window attention without performance loss. Furthermore, by evaluating AHA across various window sizes, we identify a long-tail distribution in context dependency, where the necessity for full attention decays rapidly as the local window expands. By decoupling local processing from global access, AHA reveals that full attention is largely redundant, and that efficient inference requires only on-demand access to the global context.
摘要：当读书时，人们主要关注当前页面，仅在必要时才翻回以回顾之前的上下文。同样，我们证明大型语言模型（LLM）可以学习动态确定何时关注全局上下文。我们提出了 All-or-Here Attention (AHA)，它利用每个注意力头的二进制路由器来动态地在每个 token 的完全注意力和局部滑动窗口注意力之间切换。我们的结果表明，在窗口大小为 256 个 token 的情况下，高达 93% 的原始全注意力操作可以被滑动窗口注意力替代，而不会造成性能损失。此外，通过评估不同窗口大小的 AHA，我们确定了上下文依赖的长尾分布，其中随着局部窗口的扩展，充分注意的必要性迅速衰减。通过将本地处理与全局访问分离，AHA 揭示了充分的注意力在很大程度上是多余的，并且有效的推理只需要按需访问全局上下文。

Title: Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis

Authors: Zhiqiang Gao, Shihao Gao, Zixing Zhang, Yihao Guo, Hongyu Chen, Jing Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22603
Pdf URL: https://arxiv.org/pdf/2512.22603
Copy Paste: [[2512.22603]] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis(https://arxiv.org/abs/2512.22603)
Keywords: language model, llm, prompt
Abstract: Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple, including holder, target, aspect, opinion, sentiment, and rationale from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.
摘要：理解多模式对话中的情绪是构建情感智能人工智能系统的一个复杂但至关重要的挑战。基于多模态会话方面的情感分析（MCABSA）挑战赛邀请参与者解决两个艰巨的子任务：（1）从多说话者对话中提取全面的情感六元组，包括持有者、目标、方面、意见、情感和基本原理；（2）检测情感翻转，检测动态情感变化及其潜在触发因素。对于子任务 I，在本文中，我们设计了一个结构化提示管道，引导大型语言模型 (LLM) 通过精细的上下文理解顺序提取情感成分。对于子任务 II，我们通过集成进一步利用三个法学硕士的互补优势，以稳健地识别情绪转变及其触发因素。我们的系统在子任务 I 上获得了 47.38% 的平均得分，在子任务 II 上获得了 74.12% 的精确匹配 F1，显示了逐步细化和集成策略在丰富的多模态情感分析任务中的有效性。

Title: Chain-of-thought Reviewing and Correction for Time Series Question Answering

Authors: Chen Su, Yuanhe Tian, Yan Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22627
Pdf URL: https://arxiv.org/pdf/2512.22627
Copy Paste: [[2512.22627]] Chain-of-thought Reviewing and Correction for Time Series Question Answering(https://arxiv.org/abs/2512.22627)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: With the advancement of large language models (LLMs), diverse time series analysis tasks are reformulated as time series question answering (TSQA) through a unified natural language interface. However, existing LLM-based approaches largely adopt general natural language processing techniques and are prone to reasoning errors when handling complex numerical sequences. Different from purely textual tasks, time series data are inherently verifiable, enabling consistency checking between reasoning steps and the original input. Motivated by this property, we propose T3LLM, which performs multi-step reasoning with an explicit correction mechanism for time series question answering. The T3LLM framework consists of three LLMs, namely, a worker, a reviewer, and a student, that are responsible for generation, review, and reasoning learning, respectively. Within this framework, the worker generates step-wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments. The collaboratively generated corrected CoT are used to fine-tune the student model, internalizing multi-step reasoning and self-correction into its parameters. Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.
摘要：随着大语言模型（LLM）的进步，各种时间序列分析任务通过统一的自然语言接口被重新表述为时间序列问答（TSQA）。然而，现有的基于LLM的方法大多采用通用的自然语言处理技术，在处理复杂的数字序列时容易出现推理错误。与纯粹的文本任务不同，时间序列数据本质上是可验证的，可以在推理步骤和原始输入之间进行一致性检查。受此属性的启发，我们提出了 T3LLM，它通过时间序列问答的显式校正机制执行多步推理。 T3LLM框架由三个LLM组成，即worker、reviewer和student，分别负责生成、复习和推理学习。在此框架内，工作人员在结构化提示下生成逐步的思想链（CoT），而审阅者检查推理，识别错误步骤并提供纠正意见。协作生成的校正 CoT 用于微调学生模型，将多步推理和自我校正内化到其参数中。对多个真实世界 TSQA 基准的实验表明，T3LLM 在基于 LLM 的强大基准上实现了最先进的性能。

Title: M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

Authors: Fanglin Xu, Wei Zhang, Jian Yang, Guo Chen, Aishan Liu, Zhoujun Li, Xianglong Liu, Bryan Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22628
Pdf URL: https://arxiv.org/pdf/2512.22628
Copy Paste: [[2512.22628]] M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation(https://arxiv.org/abs/2512.22628)
Keywords: language model, llm
Abstract: The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.
摘要：代码大语言模型（LLM）的快速发展引发了人们对系统评估其代码生成能力的重大研究兴趣，但现有的基准测试主要以单一结构粒度评估模型，并关注有限的编程语言，模糊了不同代码范围和多语言场景之间的细粒度能力差异。我们介绍 M2G-Eval，这是一个多粒度、多语言框架，用于跨四个级别评估大型语言模型 (LLM) 中的代码生成：类、函数、块和行。 M2G-Eval 涵盖 18 种编程语言，包括 17K+ 训练任务和 1,286 个人工注释、污染控制的测试实例。我们通过监督微调和组相关策略优化训练 Qwen3-8B 来开发 M2G-Eval-Coder 模型。评估 30 个模型（28 个最先进的 LLM 加上我们的两个 M2G-Eval-Coder 变体）揭示了三个主要发现：（1）明显的难度层次结构，线级任务最简单，班级级任务最具挑战性； (2) 随着任务复杂性的增加，全粒度语言和部分粒度语言之间的性能差距不断扩大； (3) 强大的跨语言相关性，表明模型可以学习可迁移的编程概念。 M2G-Eval 可对代码生成功能进行细粒度诊断，并强调合成复杂的长格式代码中持续存在的挑战。

Title: On the Role of Discreteness in Diffusion LLMs

Authors: Ziqi Jin, Bin Wang, Xiang Lin, Lidong Bing, Aixin Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22630
Pdf URL: https://arxiv.org/pdf/2512.22630
Copy Paste: [[2512.22630]] On the Role of Discreteness in Diffusion LLMs(https://arxiv.org/abs/2512.22630)
Keywords: language model, llm
Abstract: Diffusion models offer appealing properties for language generation, such as parallel decoding and iterative refinement, but the discrete and highly structured nature of text challenges the direct application of diffusion principles. In this paper, we revisit diffusion language modeling from the view of diffusion process and language modeling, and outline five properties that separate diffusion mechanics from language-specific requirements. We first categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens. We then show that each satisfies only part of the five essential properties and therefore reflects a structural trade-off. Through analyses of recent large diffusion language models, we identify two central issues: (i) uniform corruption does not respect how information is distributed across positions, and (ii) token-wise marginal training cannot capture multi-token dependencies during parallel decoding. These observations motivate diffusion processes that align more closely with the structure of text, and encourage future work toward more coherent diffusion language models.
摘要：扩散模型为语言生成提供了有吸引力的特性，例如并行解码和迭代细化，但文本的离散和高度结构化的性质对扩散原理的直接应用提出了挑战。在本文中，我们从扩散过程和语言建模的角度重新审视扩散语言建模，并概述了将扩散机制与语言特定要求分开的五个属性。我们首先将现有方法分类为嵌入空间中的连续扩散和令牌上的离散扩散。然后我们表明，每个属性仅满足五个基本属性的一部分，因此反映了结构性权衡。通过对最近的大型扩散语言模型的分析，我们发现了两个核心问题：（i）统一腐败不尊重信息在位置之间的分布方式，以及（ii）令牌明智的边缘训练无法在并行解码期间捕获多令牌依赖性。这些观察结果激发了与文本结构更紧密结合的扩散过程，并鼓励未来研究更加连贯的扩散语言模型。

Title: Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Authors: Hadi Mohammadi, Tamas Kozak, Anastasia Giachanou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22631
Pdf URL: https://arxiv.org/pdf/2512.22631
Copy Paste: [[2512.22631]] Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs(https://arxiv.org/abs/2512.22631)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.
摘要：思想链 (CoT) 推理已成为一种强大的技术，可提高大型语言模型 (LLM) 解决问题的能力，特别是对于需要多步骤推理的任务。然而，最近的研究表明，CoT 解释通常无法反映模型的实际推理过程，因为模型可能会产生连贯但具有误导性的理由，或者在不承认外部线索的情况下修改答案。这种差异破坏了基于 CoT 的安全监督和对准监控方法的可靠性，因为模型可能会为错误答案生成看似合理但具有欺骗性的理由。为了更好地理解这一限制，我们评估了两种优化方法：组相对策略优化 (GRPO) 和直接偏好优化 (DPO) 提高 CoT 忠实度的能力。我们的实验表明，GRPO 在较大模型中比 DPO 实现了更高的性能，其中 Qwen2.5-14B-Instruct 模型在所有评估指标上都获得了最佳结果。两种方法都表现出模型大小和性能之间的正相关性，但 GRPO 在改善忠实度指标方面表现出更大的潜力，尽管在较小规模下的行为不太稳定。这些结果表明，GRPO 为法学硕士开发更透明、更值得信赖的推理提供了一个有前途的方向。

Title: Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency

Authors: Yoshith Roy Kotla, Varshith Roy Kotla
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22682
Pdf URL: https://arxiv.org/pdf/2512.22682
Copy Paste: [[2512.22682]] Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency(https://arxiv.org/abs/2512.22682)
Keywords: language model, llm
Abstract: Deploying large language models (LLMs) in high-stakes domains requires rigorous uncertainty quantification, yet standard softmax probabilities are often poorly calibrated. We present a systematic study of Adaptive Prediction Sets (APS) applied to next-token prediction in transformer-based models with large vocabularies (greater than 250,000 tokens). Our central contribution is the identification of a coverage-efficiency tradeoff: while naive conformal prediction achieves valid coverage, it produces prediction sets of hundreds of tokens, rendering them uninformative. We propose Vocabulary-Aware Conformal Prediction (VACP), a framework that leverages semantic masking and temperature-adjusted scoring to reduce the effective prediction space while provably maintaining marginal coverage. Experiments on Gemma-2B using SQUAD and WikiText benchmarks demonstrate that VACP achieves 89.7 percent empirical coverage (90 percent target) while reducing the mean prediction set size from 847 tokens to 4.3 tokens -- a 197x improvement in efficiency. We provide a theoretical analysis of vocabulary reduction and release our implementation for reproducibility.
摘要：在高风险领域部署大型语言模型 (LLM) 需要严格的不确定性量化，但标准 softmax 概率通常校准不佳。我们提出了一项自适应预测集 (APS) 的系统研究，该研究应用于具有大词汇量（超过 250,000 个标记）的基于 Transformer 的模型中的下一个标记预测。我们的核心贡献是确定覆盖率与效率的权衡：虽然朴素的保形预测实现了有效的覆盖率，但它产生了数百个标记的预测集，使它们缺乏信息。我们提出了词汇感知保形预测（VACP），这是一个利用语义屏蔽和温度调整评分来减少有效预测空间，同时可证明保持边缘覆盖的框架。使用 SQUAD 和 WikiText 基准测试在 Gemma-2B 上进行的实验表明，VACP 实现了 89.7% 的经验覆盖率（90% 的目标），同时将平均预测集大小从 847 个标记减少到 4.3 个标记，效率提高了 197 倍。我们提供了词汇减少的理论分析，并发布了我们的可重复性实现。

Title: Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Authors: Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich, Mark Ibrahim, Levent Sagun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22712
Pdf URL: https://arxiv.org/pdf/2512.22712
Copy Paste: [[2512.22712]] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages(https://arxiv.org/abs/2512.22712)
Keywords: language model, prompt, chain-of-thought
Abstract: Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.
摘要：大型语言模型通过思维链提示展现出强大的推理能力，但这种推理质量是否可以跨语言转移仍有待探索。我们引入了一个经过人工验证的框架来评估模型生成的推理轨迹是否在逻辑上支持他们跨语言的结论。通过分析 6 种语言和 6 种前沿模型的 GlobalMMLU 问题的 65k 条推理轨迹，我们发现了一个关键盲点：虽然模型实现了很高的任务准确性，但它们的推理可能无法支持其结论。非拉丁文字中的推理痕迹显示其推理和结论之间的不一致至少是拉丁文字中的两倍。我们通过人工注释开发了一种错误分类法来描述这些失败，发现它们主要源于证据错误（不受支持的主张、模糊的事实），然后是不合逻辑的推理步骤。我们的研究结果表明，当前的多语言评估实践对模型推理能力的描述并不完整，并强调了对推理感知评估框架的需求。

Title: Mitigating Social Desirability Bias in Random Silicon Sampling

Authors: Sashank Chapala, Maksym Mironov, Songgaojun Deng
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.22725
Pdf URL: https://arxiv.org/pdf/2512.22725
Copy Paste: [[2512.22725]] Mitigating Social Desirability Bias in Random Silicon Sampling(https://arxiv.org/abs/2512.22725)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly used to simulate population responses, a method known as ``Silicon Sampling''. However, responses to socially sensitive questions frequently exhibit Social Desirability Bias (SDB), diverging from real human data toward socially acceptable answers. Existing studies on social desirability bias in LLM-based sampling remain limited. In this work, we investigate whether minimal, psychologically grounded prompt wording can mitigate this bias and improve alignment between silicon and human samples. We conducted a study using data from the American National Election Study (ANES) on three LLMs from two model families: the open-source Llama-3.1 series and GPT-4.1-mini. We first replicate a baseline silicon sampling study, confirming the persistent Social Desirability Bias. We then test four prompt-based mitigation methods: \emph{reformulated} (neutral, third-person phrasing), \emph{reverse-coded} (semantic inversion), and two meta-instructions, \emph{priming} and \emph{preamble}, respectively encouraging analytics and sincerity. Alignment with ANES is evaluated using Jensen-Shannon Divergence with bootstrap confidence intervals. Our results demonstrate that reformulated prompts most effectively improve alignment by reducing distribution concentration on socially acceptable answers and achieving distributions closer to ANES. Reverse-coding produced mixed results across eligible items, while the Priming and Preamble encouraged response uniformity and showed no systematic benefit for bias mitigation. Our findings validate the efficacy of prompt-based framing controls in mitigating inherent Social Desirability Bias in LLMs, providing a practical path toward more representative silicon samples.
摘要：大型语言模型 (LLM) 越来越多地用于模拟群体反应，这种方法称为“硅采样”。然而，对社会敏感问题的回答经常表现出社会期望偏差（SDB），从真实的人类数据转向社会可接受的答案。现有关于法学硕士抽样中社会期望偏差的研究仍然有限。在这项工作中，我们研究了最小的、基于心理的提示措辞是否可以减轻这种偏见并改善硅样本和人体样本之间的一致性。我们使用美国国家选举研究 (ANES) 的数据对来自两个模型系列的三名法学硕士进行了一项研究：开源 Llama-3.1 系列和 GPT-4.1-mini。我们首先复制基线硅抽样研究，确认持续存在的社会期望偏差。然后，我们测试了四种基于提示的缓解方法：\emph{reformulated}（中性，第三人称措辞），\emph{reverse-coded}（语义倒置），以及两个元指令，\emph{priming}和\emph{preamble}，分别鼓励分析和诚意。使用 Jensen-Shannon Divergence 和 bootstrap 置信区间来评估与 ANES 的对齐情况。我们的结果表明，重新制定的提示可以通过减少对社会可接受答案的分布集中度并实现更接近 ANES 的分布来最有效地改善一致性。反向编码在符合条件的项目中产生了混合结果，而启动和序言鼓励响应一致性，并且没有显示出对缓解偏差的系统性好处。我们的研究结果验证了基于提示的框架控制在减轻法学硕士固有的社会期望偏差方面的有效性，为获得更具代表性的芯片样本提供了一条实用途径。

Title: WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Authors: Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22737
Pdf URL: https://arxiv.org/pdf/2512.22737
Copy Paste: [[2512.22737]] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference(https://arxiv.org/abs/2512.22737)
Keywords: language model, llm
Abstract: Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.
摘要：自回归 (AR) 生成是大型语言模型 (LLM) 的标准解码范例，但其逐个令牌的性质限制了推理时的并行性。扩散语言模型 (DLLM) 通过每步恢复多个屏蔽标记来提供并行解码；然而，在实践中，他们常常无法将这种并行性转化为优化 AR 引擎（例如 vLLM）的部署速度增益。一个关键原因是许多 DLLM 依赖于双向注意力，这打破了标准前缀 KV 缓存并强制重复上下文化，从而降低了效率。我们提出了 WeDLM，一种完全基于标准因果注意力构建的扩散解码框架，以使并行生成前缀缓存友好。核心思想是让每个屏蔽位置条件在所有当前观察到的令牌上，同时保持严格的因果屏蔽，这是通过拓扑重新排序实现的，将观察到的令牌移动到物理前缀，同时保留其逻辑位置。在此属性的基础上，我们引入了一种流式解码过程，该过程不断地将置信令牌提交到不断增长的从左到右的前缀中，并保持固定的并行工作负载，从而避免了块扩散方法中常见的停止和等待行为。实验表明，WeDLM 保留了强大的 AR 主干的质量，同时提供了显着的加速，在具有挑战性的推理基准上接近 3 倍，在低熵生成机制中高达 10 倍；至关重要的是，我们的比较是在匹配的部署设置下与 vLLM 提供的 AR 基线进行比较，这表明扩散式解码在实践中可以胜过优化的 AR 引擎。

Title: Harnessing Large Language Models for Biomedical Named Entity Recognition

Authors: Jian Chen, Leilei Su, Cong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22738
Pdf URL: https://arxiv.org/pdf/2512.22738
Copy Paste: [[2512.22738]] Harnessing Large Language Models for Biomedical Named Entity Recognition(https://arxiv.org/abs/2512.22738)
Keywords: language model, llm
Abstract: Background and Objective: Biomedical Named Entity Recognition (BioNER) is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching. However, adapting general-domain Large Language Models (LLMs) to this task is often hampered by their lack of domain-specific knowledge and the performance degradation caused by low-quality training data. To address these challenges, we introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning LLMs that prioritizes data quality over quantity. Methods and Results: BioSelectTune reformulates BioNER as a structured JSON generation task and leverages our novel Hybrid Superfiltering strategy, a weak-to-strong data curation method that uses a homologous weak model to distill a compact, high-impact training dataset. Conclusions: Through extensive experiments, we demonstrate that BioSelectTune achieves state-of-the-art (SOTA) performance across multiple BioNER benchmarks. Notably, our model, trained on only 50% of the curated positive data, not only surpasses the fully-trained baseline but also outperforms powerful domain-specialized models like BioMedBERT.
摘要：背景和目的：生物医学命名实体识别（BioNER）是医学信息学的一项基础任务，对于药物发现和临床试验匹配等下游应用至关重要。然而，将通用领域大型语言模型 (LLM) 应用于此任务通常会受到缺乏特定领域知识以及低质量训练数据导致的性能下降的阻碍。为了应对这些挑战，我们引入了 BioSelectTune，这是一个高效、以数据为中心的框架，用于微调法学硕士，优先考虑数据质量而不是数量。方法和结果：BioSelectTune 将 BioNER 重新表述为结构化 JSON 生成任务，并利用我们新颖的混合超级过滤策略，这是一种从弱到强的数据管理方法，使用同源弱模型来提取紧凑、高影响力的训练数据集。结论：通过大量实验，我们证明 BioSelectTune 在多个 BioNER 基准测试中实现了最先进的 (SOTA) 性能。值得注意的是，我们的模型仅使用 50% 的精选正数据进行训练，不仅超越了完全训练的基线，而且还优于 BioMedBERT 等强大的领域专业模型。

Title: Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis

Authors: Dongning Rao, Yunbiao Zeng, Zhihua Jiang, Jujian Lv
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22741
Pdf URL: https://arxiv.org/pdf/2512.22741
Copy Paste: [[2512.22741]] Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis(https://arxiv.org/abs/2512.22741)
Keywords: language model, llm
Abstract: Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.
摘要：涉及人机交互的应用程序强调了多模式情感分析 (MSA) 的需求。尽管已经提出了许多方法来解决不同模式中的微妙情绪，但解释和时间对齐的力量仍未得到充分探索。因此，本文提出了具有 eXplanation 和 Temporal Alignment for MSA (TEXT) 的文本路由稀疏专家混合模型。 TEXT 首先通过多模态大语言模型 (MLLM) 增强对 MSA 的解释，然后通过面向时间的神经网络块新颖地对齐音频和视频的表示。 TEXT 将不同的模式与解释结合起来，并通过门融合促进新的文本路由稀疏专家混合。我们的时间对齐块融合了 Mamba 和时间交叉注意力的优点。因此，TEXT 在所有测试模型中的四个数据集上实现了最佳性能，包括最近提出的三种方法和三个 MLLM。 TEXT 在所有六个指标中的至少四个指标上获胜。例如，TEXT 在 CH-SIMS 数据集上将平均绝对误差降低至 0.353，这意味着与最近提出的方法相比减少了 13.5%。

Title: Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language

Authors: Muhammad Zain Ali, Bernhard Pfahringer, Tony Smith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22778
Pdf URL: https://arxiv.org/pdf/2512.22778
Copy Paste: [[2512.22778]] Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language(https://arxiv.org/abs/2512.22778)
Keywords: language model
Abstract: Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R consistently outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results.
摘要：社交媒体上的错误信息是一个广泛公认的问题，世界各地的研究人员都在积极致力于发现这一问题。然而，乌尔都语等资源匮乏的语言在该领域受到的关注有限。一种明显的方法是利用多语言预训练语言模型，并针对下游分类任务（例如错误信息检测）对其进行微调。然而，这些模型难以处理特定领域的术语，导致性能不佳。为了解决这个问题，我们在对乌尔都语假新闻分类进行微调之前研究了领域适应的有效性，采用分阶段训练方法来优化模型泛化。我们评估了两种广泛使用的多语言模型 XLM-RoBERTa 和 mBERT，并使用公开的乌尔都语新闻语料库应用领域自适应预训练。对四个公开可用的乌尔都语假新闻数据集的实验表明，适应领域的 XLM-R 始终优于其普通对应物，而适应领域的 mBERT 表现出好坏参半的结果。

Title: CNSight: Evaluation of Clinical Note Segmentation Tools

Authors: Risha Surana, Adrian Law, Sunwoo Kim, Rishab Sridhar, Angxiao Han, Peiyu Hong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22795
Pdf URL: https://arxiv.org/pdf/2512.22795
Copy Paste: [[2512.22795]] CNSight: Evaluation of Clinical Note Segmentation Tools(https://arxiv.org/abs/2512.22795)
Keywords: language model, gpt
Abstract: Clinical notes are often stored in unstructured or semi-structured formats after extraction from electronic medical record (EMR) systems, which complicates their use for secondary analysis and downstream clinical applications. Reliable identification of section boundaries is a key step toward structuring these notes, as sections such as history of present illness, medications, and discharge instructions each provide distinct clinical contexts. In this work, we evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV. Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation. Lightweight baselines remain competitive on structured sentence-level tasks but falter on unstructured freetext. Our results provide guidance for method selection and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization.
摘要：从电子病历 (EMR) 系统中提取后，临床记录通常以非结构化或半结构化格式存储，这使得其在二次分析和下游临床应用中的使用变得复杂。可靠地识别章节边界是构建这些笔记的关键一步，因为诸如现病史、药物和出院说明等章节都提供了不同的临床背景。在这项工作中，我们使用来自 MIMIC-IV 的 1,000 个笔记的精选数据集来评估基于规则的基线、特定领域的转换器模型和用于临床笔记分割的大型语言模型。我们的实验表明，基于 API 的大型模型实现了最佳的整体性能，其中 GPT-5-mini 在句子级和自由文本分割方面达到了 72.4 的最佳平均 F1。轻量级基线在结构化句子级任务上仍然具有竞争力，但在非结构化自由文本上却表现不佳。我们的结果为方法选择提供了指导，并为信息提取、队列识别和自动总结等下游任务奠定了基础。

Title: AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning

Authors: Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang, Yong Jiang, Liangcai Su, Liwen Zhang, Wenbiao Yin, Zhen Zhang, Fuli Feng, Pengjun Xie, Xiaobin Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22857
Pdf URL: https://arxiv.org/pdf/2512.22857
Copy Paste: [[2512.22857]] AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning(https://arxiv.org/abs/2512.22857)
Keywords: agent
Abstract: Conducting reinforcement learning (RL) in simulated environments offers a cost-effective and highly scalable way to enhance language-based agents. However, previous work has been limited to semi-automated environment synthesis or tasks lacking sufficient difficulty, offering little breadth or depth. In addition, the instability of simulated users integrated into these environments, along with the heterogeneity across simulated environments, poses further challenges for agentic RL. In this work, we propose: (1) a unified pipeline for automated and scalable synthesis of simulated environments associated with high-difficulty but easily verifiable tasks; and (2) an environment level RL algorithm that not only effectively mitigates user instability but also performs advantage estimation at the environment level, thereby improving training efficiency and stability. Comprehensive evaluations on agentic benchmarks, including tau-bench, tau2-Bench, and VitaBench, validate the effectiveness of our proposed method. Further in-depth analyses underscore its out-of-domain generalization.
摘要：在模拟环境中进行强化学习 (RL) 提供了一种经济有效且高度可扩展的方法来增强基于语言的代理。然而，之前的工作仅限于半自动化的环境合成或缺乏足够难度的任务，提供的广度和深度都很小。此外，集成到这些环境中的模拟用户的不稳定性，以及模拟环境之间的异质性，给代理强化学习带来了进一步的挑战。在这项工作中，我们提出：（1）一个统一的管道，用于自动和可扩展地合成与高难度但易于验证的任务相关的模拟环境； (2)环境级强化学习算法，不仅能有效缓解用户的不稳定性，还能在环境级进行优势估计，从而提高训练效率和稳定性。对代理基准（包括 tau-bench、tau2-Bench 和 VitaBench）的综合评估验证了我们提出的方法的有效性。进一步深入的分析强调了其域外泛化。

Title: Diversity or Precision? A Deep Dive into Next Token Prediction

Authors: Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang, Weile Chen, Zihao Zheng, Bei Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22955
Pdf URL: https://arxiv.org/pdf/2512.22955
Copy Paste: [[2512.22955]] Diversity or Precision? A Deep Dive into Next Token Prediction(https://arxiv.org/abs/2512.22955)
Keywords: language model, llm
Abstract: Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.
摘要：最近的进展表明，强化学习（RL）可以显着提高大型语言模型（LLM）的推理能力。然而，这种强化学习训练的有效性主要取决于预训练模型的令牌输出分布定义的探索空间。在本文中，我们重新审视标准交叉熵损失，将其解释为单步事件中应用的策略梯度优化的特定实例。为了系统地研究预训练的分布如何塑造后续强化学习的探索潜力，我们提出了一个通用的预训练目标，该目标将策略强化学习原则应用于监督学习。通过将下一个令牌预测构建为随机决策过程，我们引入了一种明确平衡多样性和精度的奖励塑造策略。我们的方法采用正奖励比例因子来控制真实令牌的概率集中度，并采用排名感知机制来不对称地处理高排名和低排名负令牌。这使我们能够重塑预训练的 token 输出分布，并研究如何为 RL 提供更有利的探索空间，最终增强端到端推理性能。与较高的分布熵有助于有效探索的直觉相反，我们发现施加面向精度的先验可以为强化学习带来优越的探索空间。

Title: Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

Authors: Mengdi Chai, Ali R. Zomorrodi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22966
Pdf URL: https://arxiv.org/pdf/2512.22966
Copy Paste: [[2512.22966]] Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks(https://arxiv.org/abs/2512.22966)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM's out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining tasks. Furthermore, ChatGPT performed better under the zero temperature, whereas LIama showed stronger performance under the default temperature. Next, we assessed whether prompt engineering could enhance LLM performance by applying variations of the MedPrompt framework, incorporating targeted and random dynamic few-shot learning. The results demonstrate that prompt engineering is not a one-size-fit-all solution. While it significantly improved the performance on the task with lowest baseline accuracy (relevant diagnostic testing), it was counterproductive for others. Another key finding was that the targeted dynamic few-shot prompting did not consistently outperform random selection, indicating that the presumed benefits of closely matched examples may be counterbalanced by loss of broader contextual diversity. These findings suggest that the impact of prompt engineering is highly model and task-dependent, highlighting the need for tailored, context-aware strategies for integrating LLMs into healthcare.
摘要：大型语言模型 (LLM) 在医学知识评估方面已展现出前景，但其在现实世界临床决策中的实际效用仍未得到充分探索。在这项研究中，我们评估了三种最先进的 LLM（ChatGPT-4o、Gemini 1.5 Pro 和 LIama 3.3 70B）在典型患者遭遇的整个临床推理工作流程中的临床决策支持的性能。通过 36 个案例研究，我们首先评估了法学硕士在两种温度设置（默认与零）下五个关键顺序临床决策任务中的开箱即用性能：鉴别诊断、必要的立即步骤、相关诊断测试、最终诊断和治疗建议。所有模型都显示出任务的高度可变性，在最终诊断中实现了近乎完美的准确性，在相关诊断测试中表现不佳，在剩余任务中表现中等。此外，ChatGPT 在零温度下表现更好，而 LIama 在默认温度下表现更强。接下来，我们评估了即时工程是否可以通过应用 MedPrompt 框架的变体，结合有针对性的和随机的动态小样本学习来提高 LLM 的表现。结果表明，即时工程并不是一种一刀切的解决方案。虽然它以最低的基线精度（相关诊断测试）显着提高了任务的性能，但对其其他任务却适得其反。另一个重要发现是，有针对性的动态几次提示并不总是优于随机选择，这表明紧密匹配的示例的假定好处可能会因更广泛的上下文多样性的丧失而被抵消。这些发现表明，即时工程的影响高度依赖于模型和任务，这凸显了将法学硕士整合到医疗保健中需要定制的、上下文感知的策略。

Title: Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

Authors: Tao Yu, Yongqi An, Kuan Zhu, Guibo Zhu, Ming Tang, Jinqiao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23014
Pdf URL: https://arxiv.org/pdf/2512.23014
Copy Paste: [[2512.23014]] Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping(https://arxiv.org/abs/2512.23014)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%--8.5% in average accuracy under 30% and 40% sparsity.
摘要：大型语言模型 (LLM) 在自然语言任务中表现出令人印象深刻的性能，但由于其规模而产生大量的计算和存储成本。训练后结构化剪枝提供了一种有效的解决方案。然而，当少样本校准集无法充分反映预训练数据分布时，现有方法对下游任务的泛化能力有限。为了解决这个问题，我们提出了功能感知神经元分组（FANG），这是一种训练后修剪框架，通过识别和保留对特定功能至关重要的神经元来减轻校准偏差。 FANG 根据功能相似的神经元处理的语义上下文类型对它们进行分组，并独立地修剪每个组。在每个组内的重要性估计期间，与神经元组的功能作用密切相关的标记被赋予更高的权重。此外，FANG 还保留了跨多种上下文类型做出贡献的神经元。为了在稀疏性和性能之间实现更好的权衡，它根据功能复杂性自适应地将稀疏性分配给每个块。实验表明，FANG 提高了下游准确性，同时保留了语言建模性能。当与两种代表性剪枝方法 FLAP 和 OBC 结合时，它实现了最先进（SOTA）的结果。具体来说，在 30% 和 40% 稀疏度下，FANG 的平均准确率比 FLAP 和 OBC 高出 1.5%--8.5%。

Title: LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Authors: Wenxuan Xu, Arvind Pillai, Subigya Nepal, Amanda C Collins, Daniel M Mackin, Michael V Heinz, Tess Z Griffin, Nicholas C Jacobson, Andrew Campbell
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23025
Pdf URL: https://arxiv.org/pdf/2512.23025
Copy Paste: [[2512.23025]] LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models(https://arxiv.org/abs/2512.23025)
Keywords: language model, llm
Abstract: Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.
摘要：多模态健康传感为评估心理健康提供了丰富的行为信号，但将这些数字时间序列测量结果转化为自然语言仍然具有挑战性。目前的法学硕士无法本地摄取长时间的传感器流，并且配对的传感器文本数据集很稀缺。为了应对这些挑战，我们引入了 LENS，这是一个将多模态传感数据与语言模型结合起来的框架，以生成基于临床的心理健康叙述。 LENS 首先通过将与抑郁和焦虑症状相关的生态瞬时评估 (EMA) 响应转换为自然语言描述来构建大规模数据集，从 258 名参与者中生成超过 100,000 个传感器文本 QA 对。为了实现本机时间序列集成，我们训练了一个补丁级编码器，将原始传感器信号直接投影到 LLM 的表示空间中。我们的结果表明，LENS 在标准 NLP 指标和症状严重程度准确性的特定任务测量方面优于强大的基线。一项针对 13 名心理健康专业人士的用户研究进一步表明，LENS 制作的叙述是全面且具有临床意义的。最终，我们的方法将法学硕士作为健康感知的接口，提供了一条可扩展的模型路径，可以推理原始行为信号并支持下游临床决策。

Title: Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Authors: Kerem Zaman, Shashank Srivastava
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23032
Pdf URL: https://arxiv.org/pdf/2512.23032
Copy Paste: [[2512.23032]] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization(https://arxiv.org/abs/2512.23032)
Keywords: prompt, chain-of-thought
Abstract: Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.
摘要：最近的工作使用偏差特征指标，如果 CoT 忽略了影响预测的提示注入提示，则将 CoT 标记为不忠实。我们认为这个指标混淆了不忠实性和不完整性，将分布式变压器计算转变为线性自然语言叙述所需的有损压缩。在使用 Llama-3 和 Gemma-3 的多跳推理任务中，许多被偏差特征标记为不忠实的 CoT 被其他指标判断为忠实，在某些模型中超过 50%。通过新的忠实@k指标，我们表明较大的推理时间令牌预算大大增加了提示语言（在某些设置中高达90％），这表明明显的不忠实是由于严格的令牌限制造成的。使用因果中介分析，我们进一步表明，即使是非语言提示也可以通过 CoT 因果性地中介预测变化。因此，我们警告不要仅仅依赖基于提示的评估，并提倡更广泛的可解释性工具包，包括因果调解和基于腐败的指标。

Title: Accelerating Language Model Workflows with Prompt Choreography

Authors: TJ Bai, Jason Eisner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23049
Pdf URL: https://arxiv.org/pdf/2512.23049
Copy Paste: [[2512.23049]] Accelerating Language Model Workflows with Prompt Choreography(https://arxiv.org/abs/2512.23049)
Keywords: language model, llm, prompt, agent
Abstract: Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($>$2.2$\times$) in some workflows dominated by redundant computation.
摘要：大型语言模型越来越多地部署在多代理工作流程中。我们引入了 Prompt Choreography，这是一个通过维护动态的全局 KV 缓存来高效执行 LLM 工作流程的框架。每个 LLM 调用都可以处理先前编码消息的任意、重新排序的子集。支持并行调用。尽管缓存消息的编码有时会因在新上下文中重新编码而产生不同的结果，但我们在不同的设置中表明，微调 LLM 以与缓存配合使用可以帮助它模仿原始结果。提示编排显着减少了每条消息的延迟（第一个令牌的时间加快了 2.0--6.2$\times$），并在一些以冗余计算为主的工作流程中实现了显着的端到端加速（$>$2.2$\times$）。

Title: Reservoir Computing inspired Matrix Multiplication-free Language Model

Authors: Takumi Shiratsuchi, Yuichiro Tanaka, Hakaru Tamukoh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23145
Pdf URL: https://arxiv.org/pdf/2512.23145
Copy Paste: [[2512.23145]] Reservoir Computing inspired Matrix Multiplication-free Language Model(https://arxiv.org/abs/2512.23145)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved state-of-the-art performance in natural language processing; however, their high computational cost remains a major bottleneck. In this study, we target computational efficiency by focusing on a matrix multiplication free language model (MatMul-free LM) and further reducing the training cost through an architecture inspired by reservoir computing. Specifically, we partially fix and share the weights of selected layers in the MatMul-free LM and insert reservoir layers to obtain rich dynamic representations without additional training overhead. Additionally, several operations are combined to reduce memory accesses. Experimental results show that the proposed architecture reduces the number of parameters by up to 19%, training time by 9.9%, and inference time by 8.0%, while maintaining comparable performance to the baseline model.
摘要：大型语言模型（LLM）在自然语言处理方面取得了最先进的性能；然而，其高计算成本仍然是一个主要瓶颈。在本研究中，我们通过关注矩阵乘法自由语言模型（MatMul-free LM）来提高计算效率，并通过受储层计算启发的架构进一步降低训练成本。具体来说，我们部分固定和共享 MatMul-free LM 中所选层的权重，并插入存储层以获得丰富的动态表示，而无需额外的训练开销。此外，组合了多个操作以减少内存访问。实验结果表明，所提出的架构将参数数量减少了 19%，训练时间减少了 9.9%，推理时间减少了 8.0%，同时保持了与基线模型相当的性能。

Title: Not too long do read: Evaluating LLM-generated extreme scientific summaries

Authors: Zhuoqi Lyu, Qing Ke
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23206
Pdf URL: https://arxiv.org/pdf/2512.23206
Copy Paste: [[2512.23206]] Not too long do read: Evaluating LLM-generated extreme scientific summaries(https://arxiv.org/abs/2512.23206)
Keywords: language model, llm
Abstract: High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs' summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors' comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at this https URL (Lyu and Ke, 2025).
摘要：高质量的科学极限摘要（TLDR）有助于有效的科学传播。大型语言模型 (LLM) 在生成它们时如何表现？ LLM 生成的摘要与人类专家撰写的摘要有何不同？然而，缺乏全面、高质量的科学TLDR数据集阻碍了法学硕士总结能力的发展和评估。为了解决这些问题，我们提出了一个新颖的数据集 BiomedTLDR，其中包含大量研究人员撰写的科学论文摘要样本，它利用了将作者评论与参考书目项目一起包含在内的常见做法。然后，我们测试流行的开放权重法学硕士，以根据摘要生成 TLDR。我们的分析表明，尽管其中一些人成功地生成了人形摘要，但法学硕士通常对原文的词汇选择和修辞结构表现出更大的亲和力，因此与人类相比，他们通常更倾向于提取而不是抽象。我们的代码和数据集可在此 https URL 上获取（Lyu 和 Ke，2025）。

Title: Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Authors: Zhijun Chen, Zeyu Ji, Qianren Mao, Junhang Cheng, Bangjie Qin, Hao Wu, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23213
Pdf URL: https://arxiv.org/pdf/2512.23213
Copy Paste: [[2512.23213]] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process(https://arxiv.org/abs/2512.23213)
Keywords: language model, llm
Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.
摘要：我们提出了 LLM-PeerReview，这是一种无监督的 LLM Ensemble 方法，它利用具有不同优势的多个模型的集体智慧，为每个查询从多个 LLM 生成的候选者中选择最理想的响应。 LLM-PeerReview 建立在一个新颖的、受同行评审启发的框架之上，该框架提供了清晰且可解释的机制，同时保持完全不受监督的灵活适应性和泛化性。具体来说，它分三个阶段进行：对于评分，我们使用新兴的法学硕士作为法官技术，通过重复使用手头的多个法学硕士来评估每个回答；为了推理，我们可以应用基于原理图形模型的真值推理算法或直接的平均策略来聚合多个分数，以产生每个响应的最终分数；最后，选择得分最高的响应作为最佳集成输出。 LLM-PeerReview 概念简单，经验丰富。所提出方法的两种变体在四个数据集上获得了强劲的结果，包括分别比最近的先进模型 Smoothie-Global 好 6.9% 和 7.3%。

Title: Anka: A Domain-Specific Language for Reliable LLM Code Generation

Authors: Saif Khalfan Saif Al Mazrouei
Subjects: cs.CL, cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2512.23214
Pdf URL: https://arxiv.org/pdf/2512.23214
Copy Paste: [[2512.23214]] Anka: A Domain-Specific Language for Reliable LLM Code Generation(https://arxiv.org/abs/2512.23214)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python's flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.
摘要：大型语言模型 (LLM) 在代码生成方面表现出了卓越的能力，但它们在复杂的多步骤编程任务上却表现出了系统错误。我们假设这些错误源于通用语言的灵活性，它允许多种有效的方法并需要隐式状态管理。为了测试这个假设，我们引入了 Anka，这是一种用于数据转换管道的领域特定语言 (DSL)，采用显式的、受约束的语法设计，可减少代码生成中的歧义。尽管之前对 Anka 的训练为零，但 Claude 3.5 Haiku 在 100 个基准问题上实现了 99.9% 的解析成功率和 95.8% 的总体任务准确性。重要的是，Anka 在多步管道任务上比 Python 表现出 40 个百分点的准确度优势（100% vs. 60%），而 Python 灵活的语法导致操作排序和变量管理中频繁出错。使用 GPT-4o-mini 的跨模型验证证实了这一优势（在多步骤任务上增加了 26.7 个百分点）。我们的结果表明：（1）法学硕士可以完全从上下文提示中学习新颖的 DSL，达到接近原生的准确性； (2) 约束语法显着减少了复杂任务上的错误； (3) 专门为法学硕士生成而设计的特定领域语言的性能优于法学硕士接受过广泛培训的通用语言。我们发布了完整的语言实现、基准套件和评估框架，以促进进一步的研究。

Title: Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

Authors: Dianyun Wang, Qingsen Ma, Yuhu Shang, Zhifeng Lu, Lechen Ning, Zhenbo Xu, Huijia Wu, Zhaofeng He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23260
Pdf URL: https://arxiv.org/pdf/2512.23260
Copy Paste: [[2512.23260]] Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation(https://arxiv.org/abs/2512.23260)
Keywords: language model
Abstract: Parameter-efficient fine-tuning has become the dominant paradigm for adapting large language models to downstream tasks. Low-rank adaptation methods such as LoRA operate under the assumption that task-relevant weight updates reside in a low-rank subspace, yet this subspace is learned implicitly from data in a black-box manner, offering no interpretability or direct control. We hypothesize that this difficulty stems from polysemanticity--individual dimensions encoding multiple entangled concepts. To address this, we leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct an explicit, interpretable low-rank subspace to guide adapter initialization. We provide theoretical analysis proving that under monosemanticity assumptions, SAE-based subspace identification achieves arbitrarily small recovery error, while direct identification in polysemantic space suffers an irreducible error floor. On safety alignment, our method achieves up to 99.6% safety rate--exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods--while updating only 0.19-0.24% of parameters. Crucially, our method provides interpretable insights into the learned alignment subspace through the semantic grounding of SAE features. Our work demonstrates that incorporating mechanistic interpretability into the fine-tuning process can simultaneously improve both performance and transparency.
摘要：参数高效的微调已成为使大型语言模型适应下游任务的主导范例。 LoRA 等低秩自适应方法的运行假设是任务相关权重更新驻留在低秩子空间中，但该子空间是以黑盒方式从数据中隐式学习的，不提供可解释性或直接控制。我们假设这种困难源于多语义性——编码多个纠缠概念的个体维度。为了解决这个问题，我们利用预先训练的稀疏自动编码器（SAE）来识别解开的特征空间中与任务相关的特征，然后构建一个显式的、可解释的低秩子空间来指导适配器初始化。我们提供的理论分析证明，在单语义假设下，基于 SAE 的子空间识别实现了任意小的恢复误差，而多语义空间中的直接识别则遭受了不可约的错误底限。在安全对齐方面，我们的方法实现了高达 99.6% 的安全率，比完全微调高出 7.4 个百分点，接近基于 RLHF 的方法，同时仅更新 0.19-0.24% 的参数。至关重要的是，我们的方法通过 SAE 特征的语义基础提供了对学习的对齐子空间的可解释的见解。我们的工作表明，将机械可解释性纳入微调过程可以同时提高性能和透明度。

Title: Chinese Morph Resolution in E-commerce Live Streaming Scenarios

Authors: Jiahao Zhu, Jipeng Qiang, Ran Bai, Chenyu Liu, Xiaoye Ouyang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23280
Pdf URL: https://arxiv.org/pdf/2512.23280
Copy Paste: [[2512.23280]] Chinese Morph Resolution in E-commerce Live Streaming Scenarios(https://arxiv.org/abs/2512.23280)
Keywords: language model, llm
Abstract: E-commerce live streaming in China, particularly on platforms like Douyin, has become a major sales channel, but hosts often use morphs to evade scrutiny and engage in false advertising. This study introduces the Live Auditory Morph Resolution (LiveAMR) task to detect such violations. Unlike previous morph research focused on text-based evasion in social media and underground industries, LiveAMR targets pronunciation-based evasion in health and medical live streams. We constructed the first LiveAMR dataset with 86,790 samples and developed a method to transform the task into a text-to-text generation problem. By leveraging large language models (LLMs) to generate additional training data, we improved performance and demonstrated that morph resolution significantly enhances live streaming regulation.
摘要：中国的电子商务直播，尤其是抖音等平台上的直播，已成为主要的销售渠道，但主播经常使用变形来逃避审查并从事虚假广告。本研究引入了实时听觉变形解析（LiveAMR）任务来检测此类违规行为。与之前专注于社交媒体和地下产业中基于文本的规避的变形研究不同，LiveAMR 的目标是健康和医疗直播中基于发音的规避。我们构建了第一个包含 86,790 个样本的 LiveAMR 数据集，并开发了一种将任务转换为文本到文本生成问题的方法。通过利用大型语言模型 (LLM) 生成额外的训练数据，我们提高了性能，并证明变形分辨率可显着增强实时流媒体监管。

Title: AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration

Authors: Minjiang Huang, Jipeng Qiang, Yi Zhu, Chaowei Zhang, Xiangyu Zhao, Kui Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23300
Pdf URL: https://arxiv.org/pdf/2512.23300
Copy Paste: [[2512.23300]] AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration(https://arxiv.org/abs/2512.23300)
Keywords: language model, llm, agent
Abstract: Audiobook interpretations are attracting increasing attention, as they provide accessible and in-depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time-consuming and resource-intensive. To address this challenge, we propose AI4Reading, a multi-agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system's output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.
摘要：有声读物解读越来越受到关注，因为它们提供了对书籍的易懂且深入的分析，为读者提供了实用的见解和智力灵感。然而，他们的手动创建过程仍然耗时且占用资源。为了应对这一挑战，我们提出了 AI4Reading，这是一个多智能体协作系统，利用大语言模型 (LLM) 和语音合成技术来生成播客，例如有声读物解释。该系统旨在满足三个关键目标：准确的内容保存、增强的可理解性和逻辑叙述结构。为了实现这些目标，我们开发了一个由 11 名专业代理组成的框架，包括主题分析师、案例分析师、编辑、叙述者和校对员，他们协同工作来探索主题、提取现实世界案例、完善内容组织和合成自然口语。通过将专家的解释与我们系统的输出进行比较，结果表明，尽管AI4Reading在语音生成质量上仍存在差距，但生成的解释脚本更简单、更准确。

Title: AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

Authors: Jiafeng Liang, Hao Li, Chang Li, Jiaqi Zhou, Shixin Jiang, Zekun Wang, Changkai Ji, Zhihao Zhu, Runxuan Liu, Tao Ren, Jinlan Fu, See-Kiong Ng, Xia Liang, Ming Liu, Bing Qin
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.23343
Pdf URL: https://arxiv.org/pdf/2512.23343
Copy Paste: [[2512.23343]] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents(https://arxiv.org/abs/2512.23343)
Keywords: llm, agent
Abstract: Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.
摘要：记忆是连接过去和未来的关键纽带，为人类和人工智能系统提供了宝贵的概念和经验来应对复杂的任务。最近对自主代理的研究越来越集中于利用认知神经科学来设计高效的记忆工作流程。然而，由于跨学科障碍的限制，现有的作品很难吸收人类记忆机制的本质。为了弥补这一差距，我们系统地综合了跨学科的记忆知识，将认知神经科学的见解与法学硕士驱动的代理联系起来。具体来说，我们首先沿着从认知神经科学到法学硕士再到智能体的渐进轨迹阐明记忆的定义和功能。然后，我们从生物学和人工的角度对内存分类、存储机制和完整的管理生命周期进行比较分析。随后，我们回顾了评估代理内存的主流基准。此外，我们从攻击和防御的双重角度探讨内存安全。最后，我们展望了未来的研究方向，重点是多模式记忆系统和技能获取。

Title: A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation

Authors: Xin Zhang, Yang Cao, Baoxing Wu, Xinyi Chen, Kai Song, Siying Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23356
Pdf URL: https://arxiv.org/pdf/2512.23356
Copy Paste: [[2512.23356]] A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation(https://arxiv.org/abs/2512.23356)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved strong performance across a wide range of natural language processing tasks in recent years, including machine translation, text generation, and question answering. As their applications extend to increasingly complex scenarios, however, LLMs continue to face challenges in tasks that require deep reasoning and logical inference. In particular, models trained on large scale textual corpora may incorporate noisy or irrelevant information during generation, which can lead to incorrect predictions or outputs that are inconsistent with factual knowledge. To address this limitation, we propose a stepwise reasoning enhancement framework for LLMs based on external subgraph generation, termed SGR. The proposed framework dynamically constructs query relevant subgraphs from external knowledge bases and leverages their semantic structure to guide the reasoning process. By performing reasoning in a step by step manner over structured subgraphs, SGR reduces the influence of noisy information and improves reasoning accuracy. Specifically, the framework first generates an external subgraph tailored to the input query, then guides the model to conduct multi step reasoning grounded in the subgraph, and finally integrates multiple reasoning paths to produce the final answer. Experimental results on multiple benchmark datasets demonstrate that SGR consistently outperforms strong baselines, indicating its effectiveness in enhancing the reasoning capabilities of LLMs.
摘要：近年来，大型语言模型 (LLM) 在各种自然语言处理任务中取得了出色的性能，包括机器翻译、文本生成和问答。然而，随着法学硕士的应用扩展到日益复杂的场景，法学硕士在需要深度推理和逻辑推理的任务中继续面临挑战。特别是，在大规模文本语料库上训练的模型可能在生成过程中包含噪声或不相关的信息，这可能导致错误的预测或与事实知识不一致的输出。为了解决这个限制，我们提出了一种基于外部子图生成的 LLM 逐步推理增强框架，称为 SGR。所提出的框架从外部知识库动态构建查询相关子图，并利用其语义结构来指导推理过程。 SGR通过对结构化子图逐步进行推理，减少噪声信息的影响，提高推理精度。具体来说，该框架首先生成针对输入查询定制的外部子图，然后引导模型基于子图进行多步推理，最后整合多个推理路径以产生最终答案。多个基准数据集上的实验结果表明，SGR 始终优于强基线，表明其在增强法学硕士推理能力方面的有效性。

Title: Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data

Authors: Jiapeng Wang, Yiwen Hu, Yanzipeng Gao, Haoyu Wang, Shuo Wang, Hongyu Lu, Jiaxin Mao, Wayne Xin Zhao, Junyi Li, Xiao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23422
Pdf URL: https://arxiv.org/pdf/2512.23422
Copy Paste: [[2512.23422]] Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data(https://arxiv.org/abs/2512.23422)
Keywords: language model, llm
Abstract: As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model's ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.
摘要：随着高质量、特定领域数据的获取变得越来越稀缺，多时期训练已成为适应大型语言模型 (LLM) 的实用策略。然而，自回归模型在重复的数据暴露下常常会出现性能下降，过度拟合会导致模型能力显着下降。通过实证分析，我们将这种退化追溯到学习动态的不平衡：可预测的低熵标记被快速学习并主导优化，而模型泛化高熵标记的能力随着持续训练而恶化。为了解决这个问题，我们引入了 EntroDrop，一种熵引导的 token dropout 方法，其功能相当于结构化数据正则化。 EntroDrop 在训练期间选择性地屏蔽低熵标记，并采用课程表来根据训练进度调整正则化强度。从 0.6B 到 8B 参数的模型规模实验表明，EntroDrop 始终优于标准正则化基线，并在扩展的多周期训练中保持稳健的性能。这些发现强调了在有限数据上进行训练时将正则化与令牌级学习动态相结合的重要性。我们的方法为在数据受限领域更有效地适应法学硕士提供了一条有希望的途径。

Title: C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Authors: Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, Shuai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23430
Pdf URL: https://arxiv.org/pdf/2512.23430
Copy Paste: [[2512.23430]] C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs(https://arxiv.org/abs/2512.23430)
Keywords: language model, llm, chat
Abstract: Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.
摘要：大型语言模型 (LLM) 中的偏见对可信度构成重大风险，主要表现为刻板印象偏见（例如性别或种族刻板印象）和结构偏见（例如词汇重叠或位置偏好）。然而，先前的范式通常孤立地解决这些问题，常常以加剧另一个的代价来缓解其中一个。为了解决这个问题，我们对这些推理失败进行了系统的探索，并确定了一个主要诱因：输入中驱动这些错误推理捷径的潜在虚假特征相关性。在这些发现的推动下，我们引入了因果对比偏好优化（C2PO），这是一个统一的对齐框架，旨在通过在优化过程中直接同时发现和抑制这些相关性来解决这些特定的故障。具体来说，C2PO 利用因果反事实信号将导致偏差的特征与有效推理路径隔离，并采用公平敏感的偏好更新机制来动态评估 logit 级别的贡献并抑制快捷特征。涵盖刻板偏见（BBQ、Unqover）、结构偏见（MNLI、HANS、Chatbot、MT-Bench）、域外公平性（StereoSet、WinoBias）和通用效用（MMLU、GSM8K）的多个基准的广泛实验表明，C2PO 有效地减轻了刻板偏见和结构偏见，同时保留了强大的通用推理能力。

Title: ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

Authors: Yuqi Tang, Jing Yu, Zichang Su, Kehua Feng, Zhihui Zhu, Libin Wang, Lei Liang, Qiang Zhang, Keyan Ding, Huajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23440
Pdf URL: https://arxiv.org/pdf/2512.23440
Copy Paste: [[2512.23440]] ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning(https://arxiv.org/abs/2512.23440)
Keywords: language model, llm, agent
Abstract: Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients' response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.
摘要：临床诊断始于医患互动，在此过程中医生反复收集信息、确定检查并根据患者的反应完善鉴别诊断。现有的专注于静态问答的法学硕士基准很难体现这种动态的临床推理过程。为了缩小这些差距，最近的方法探索了涉及交互式临床对话的动态医学框架。尽管有效，但它们通常依赖于有限的、容易受到污染的数据集，并且缺乏细粒度的、多层次的评估。在这项工作中，我们提出了 ClinDEF，这是一个通过模拟诊断对话评估法学硕士临床推理的动态框架。我们的方法以疾病知识图为基础，动态生成患者病例，并促进法学硕士医生和自动化患者代理之间的多轮交互。我们的评估方案通过结合细粒度效率分析和基于标准的诊断质量评估，超越了诊断准确性。实验表明，ClinDEF 有效地揭示了最先进的法学硕士中关键的临床推理差距，提供了更细致且具有临床意义的评估范例。

Title: Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Authors: Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23447
Pdf URL: https://arxiv.org/pdf/2512.23447
Copy Paste: [[2512.23447]] Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss(https://arxiv.org/abs/2512.23447)
Keywords: llm
Abstract: Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
摘要：专家混合 (MoE) 模型缺乏明确的约束来确保路由器的决策与专家的能力保持一致，这最终限制了模型的性能。为了解决这个问题，我们提出了专家路由器耦合（ERC）损失，这是一种轻量级的辅助损失，它将路由器的决策与专家的能力紧密耦合起来。我们的方法将每个专家的路由器嵌入视为分配给该专家的令牌的代理令牌，并通过专家提供扰动的路由器嵌入以获得内部激活。 ERC 损失对这些激活施加了两个约束：(1) 每个专家必须对自己的代理代币表现出比任何其他专家的代理代币更高的激活。 (2) 每个代理令牌必须从其相应的专家那里引发比任何其他专家更强的激活。这些约束共同确保每个路由器嵌入忠实地代表其相应专家的能力，而每个专家专门处理实际路由到它的令牌。 ERC 损失计算效率很高，仅在 n^2 次激活上运行，其中 n 是专家的数量。这代表了与批次大小无关的固定成本，与之前随代币数量（通常每批次数百万）扩展的耦合方法不同。通过预训练从 3B 到 15B 参数的 MoE-LLM 以及对数万亿代币的广泛分析，我们证明了 ERC 损失的有效性。此外，ERC 损失在培训期间提供了对专家专业水平的灵活控制和定量跟踪，为 MoE 提供了宝贵的见解。

Title: Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Authors: Thomas Haschka, Joseph Bakarji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23471
Pdf URL: https://arxiv.org/pdf/2512.23471
Copy Paste: [[2512.23471]] Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings(https://arxiv.org/abs/2512.23471)
Keywords: language model, llm
Abstract: Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster - the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 News- groups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.
摘要：近年来，由于大型语言模型（LLM）及其高维嵌入的兴起，语义文本分类取得了重大进展。虽然 LLM 嵌入经常用于通过向量数据库中的语义相似性来存储和检索文本，但文本语料库中的全局结构语义关系通常仍然不透明。在这里，我们提出了一种嵌套密度聚类方法，来推断语义相关文本的层次树。该方法首先在 LLM 嵌入空间中搜索密集簇时识别具有强语义相似性的文本。随着密度标准逐渐放宽，这些密集的簇合并成更分散的簇，直到整个数据集由单个簇（树的根）表示。通过将密集的簇嵌入到日益分散的簇中，我们构建了一个树结构来捕获文本之间的层次语义关系。作为案例研究，我们概述了如何使用这种方法对科学摘要摘要的文本数据进行分类。这使得数据驱动的发现研究领域及其子领域无需预定义类别。为了评估该方法的普遍适用性，我们进一步将其应用于已建立的基准数据集，例如 20 个新闻组和 IMDB 50k 电影评论，证明了其跨领域的稳健性。最后，我们讨论了科学计量学、主题演化的可能应用，强调嵌套密度树如何揭示文本数据集中的语义结构和演化。

Title: Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias

Authors: Hazel Kim, Philip Torr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23518
Pdf URL: https://arxiv.org/pdf/2512.23518
Copy Paste: [[2512.23518]] Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias(https://arxiv.org/abs/2512.23518)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) are highly vulnerable to input confirmation bias. When a prompt implies a preferred answer, models often reinforce that bias rather than explore alternatives. This phenomenon remains underexplored, yet it is already harmful in base models and poses an even greater risk in multi-agent debate, where echo chambers reinforce bias instead of correction. We introduce Mixture of Latent Concept Experts (MoLaCE), a lightweight inference-time framework that addresses confirmation bias by mixing experts instantiated as different activation strengths over latent concepts that shape model responses. Our key insight is that, due to the compositional nature of language, differently phrased prompts reweight latent concepts in prompt-specific ways that affect factual correctness, so no single fixed intervention can be applied universally across inputs. This design enables a single LLM to emulate the benefits of debate internally while remaining computationally efficient and scalable. It can also be integrated into multi-agent debate frameworks to diversify perspectives and reduce correlated errors. We empirically show that it consistently reduces confirmation bias, improves robustness, and matches or surpasses multi-agent debate while requiring only a fraction of the computation.
摘要：大型语言模型 (LLM) 极易受到输入确认偏差的影响。当提示暗示首选答案时，模型通常会强化这种偏见，而不是探索替代方案。这种现象仍未得到充分研究，但它在基础模型中已经是有害的，并且在多智能体辩论中造成更大的风险，在多智能体辩论中，回声室强化了偏差而不是纠正。我们引入了潜在概念专家混合 (MoLaCE)，这是一种轻量级推理时间框架，通过混合实例化为塑造模型响应的潜在概念的不同激活强度的专家来解决确认偏差。我们的主要见解是，由于语言的组成性质，不同措辞的提示会以特定于提示的方式重新衡量潜在概念，从而影响事实的正确性，因此没有任何单一的固定干预措施可以在输入中普遍应用。这种设计使单个法学硕士能够在内部模拟辩论的好处，同时保持计算效率和可扩展性。它还可以集成到多主体辩论框架中，以实现观点多样化并减少相关错误。我们凭经验表明，它始终如一地减少了确认偏差，提高了鲁棒性，并匹配或超越了多智能体辩论，同时只需要一小部分计算。

Title: Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs

Authors: Sahil Kale, Antonio Luca Alfeo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23547
Pdf URL: https://arxiv.org/pdf/2512.23547
Copy Paste: [[2512.23547]] Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs(https://arxiv.org/abs/2512.23547)
Keywords: language model, gpt, llm, hallucination
Abstract: Hallucinations, the generation of apparently convincing yet false statements, remain a major barrier to the safe deployment of LLMs. Building on the strong performance of self-detection methods, we examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection. Specifically, we propose a simple yet powerful approach that enriches hallucination self-detection by (i) converting LLM responses into knowledge graphs of entities and relations, and (ii) using these graphs to estimate the likelihood that a response contains hallucinations. We evaluate the proposed approach using two widely used LLMs, GPT-4o and Gemini-2.5-Flash, across two hallucination detection datasets. To support more reliable future benchmarking, one of these datasets has been manually curated and enhanced and is released as a secondary outcome of this work. Compared to standard self-detection methods and SelfCheckGPT, a state-of-the-art approach, our method achieves up to 16% relative improvement in accuracy and 20% in F1-score. Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs, even when initial outputs contain inaccuracies. This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.
摘要：幻觉，即产生明显令人信服但虚假的陈述，仍然是法学硕士安全部署的主要障碍。基于自我检测方法的强大性能，我们研究了结构化知识表示（即知识图）的使用，以改进幻觉自我检测。具体来说，我们提出了一种简单而强大的方法，通过（i）将LLM响应转换为实体和关系的知识图，以及（ii）使用这些图来估计响应包含幻觉的可能性来丰富幻觉自我检测。我们使用两种广泛使用的 LLM（GPT-4o 和 Gemini-2.5-Flash）在两个幻觉检测数据集上评估所提出的方法。为了支持更可靠的未来基准测试，其中一个数据集已被手动整理和增强，并作为这项工作的次要成果发布。与标准自我检测方法和最先进的方法 SelfCheckGPT 相比，我们的方法在准确度方面实现了高达 16% 的相对改进，在 F1 分数方面实现了 20% 的相对改进。我们的结果表明，当法学硕士被构造为知识图时，即使初始输出包含不准确之处，法学硕士也可以更好地分析原子事实。这种低成本、与模型无关的方法为更安全、更值得信赖的语言模型铺平了道路。

Title: Instruction-Following Evaluation of Large Vision-Language Models

Authors: Daiki Shiono, Shumpei Miyawaki, Ryota Tanaka, Jun Suzuki
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2512.23572
Pdf URL: https://arxiv.org/pdf/2512.23572
Copy Paste: [[2512.23572]] Instruction-Following Evaluation of Large Vision-Language Models(https://arxiv.org/abs/2512.23572)
Keywords: language model, llm
Abstract: Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation confirmed that LVLMs' instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.
摘要：随着大型语言模型 (LLM) 的最初蓬勃发展，将 LLM 与视觉功能相结合的大型视觉语言模型 (LVLM) 的提议数量激增。然而，据观察，LVLM 在使用常用训练数据集调整到视觉指令后，通常无法表现出集成前 LLM 中存在的指令遵循能力，从而导致它们无法按预期遵循任务指令。本研究定量地证明了微调后LVLM的指令跟踪能力下降，并分析了其根本原因。特别是，我们构建了新的训练数据集，突出显示是否指定了输出格式。然后，我们研究了微调期间明确指示输出格式如何影响 LVLM 的指令跟踪能力。我们的定量评估证实，在使用常用数据集进行微调后，LVLM 的指令跟踪能力会下降。此外，我们发现使用数据集（包括输出格式指令）训练的 LVLM 往往比不遵循指令的模型更准确地遵循指令。这些发现表明，在（视觉）指令调整期间包含带有输出格式说明的样本可能有助于减轻指令跟踪能力的下降。

Title: Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Authors: Yu-Xiang Lin, Cheng-Han Chiang, Hung-yi Lee
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2512.23578
Pdf URL: https://arxiv.org/pdf/2512.23578
Copy Paste: [[2512.23578]] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models(https://arxiv.org/abs/2512.23578)
Keywords: language model, prompt
Abstract: In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.
摘要：在本文中，我们表明，当口语模型（SLM）在多轮对话开始时被指示以特定的说话风格说话时，它们在几轮交互后无法保持所需的说话风格；我们将此称为 SLM 风格失忆症。我们关注副语言的说话风格，包括情感、口音、音量和说话速度。我们评估了三个专有的 SLM 和两个开源的 SLM，结果表明这些模型在接到指示时都无法保持一致的说话风格。我们进一步表明，当 SLM 被要求在后面的回合中回忆风格指令时，他们可以回忆起风格指令，但他们无法在整个对话中表达它。我们还表明，明确要求模型回忆风格指令可以部分缓解风格遗忘症。此外，我们检查了各种提示策略，发现当指令放置在系统消息而不是用户消息中时，SLM 很难遵循所需的风格，这与系统提示的预期功能相矛盾。

Title: Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Authors: Yuwen Li, Wei Zhang, Zelong Huang, Mason Yang, Jiajun Wu, Shawn Guo, Huahao Hu, Lingyi Sun, Jian Yang, Mingjie Tang, Byran Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23611
Pdf URL: https://arxiv.org/pdf/2512.23611
Copy Paste: [[2512.23611]] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing(https://arxiv.org/abs/2512.23611)
Keywords: language model, llm, agent
Abstract: Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high-quality trajectories, poor generalization to unseen tools, and quality ceilings inherent in single-model synthesis that perpetuate biases and coverage gaps. We introduce InfTool, a fully autonomous framework that breaks these barriers through self-evolving multi-agent synthesis. Given only raw API specifications, InfTool orchestrates three collaborative agents (User Simulator, Tool-Calling Assistant, and MCP Server) to generate diverse, verified trajectories spanning single-turn calls to complex multi-step workflows. The framework establishes a closed loop: synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, the improved model generates higher-quality data targeting capability gaps, and this cycle iterates without human intervention. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) demonstrate that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus, and entirely from synthetic data without human annotation.
摘要：启用大型语言模型 (LLM) 来可靠地调用外部工具仍然是自主代理的关键瓶颈。现有方法面临三个基本挑战：高质量轨迹的昂贵人工注释、对看不见的工具的泛化能力差以及单一模型合成固有的质量上限，导致偏差和覆盖范围差距永久化。我们引入了 InfTool，这是一个完全自主的框架，它通过自我进化的多智能体合成来打破这些障碍。仅在给定原始 API 规范的情况下，InfTool 即可协调三个协作代理（用户模拟器、工具调用助手和 MCP 服务器），以生成多样化的、经过验证的轨迹，涵盖单轮调用到复杂的多步骤工作流程。该框架建立了一个闭环：合成数据通过带有门控奖励的组相对策略优化（GRPO）来训练模型，改进后的模型针对能力差距生成更高质量的数据，并且这个循环无需人工干预即可迭代。 Berkeley 函数调用排行榜 (BFCL) 上的实验表明，InfTool 将基础 32B 模型的准确率从 19.8% 提高到 70.9% (+258%)，超过了 10 倍大的模型，可与 Claude-Opus 相媲美，而且完全来自合成数据，无需人工注释。

Title: Nested Browser-Use Learning for Agentic Information Seeking

Authors: Baixuan Li, Jialong Wu, Wenbiao Yin, Kuan Li, Zhongwang Zhang, Huifeng Yin, Zhengwei Tao, Liwen Zhang, Pengjun Xie, Jingren Zhou, Yong Jiang
Subjects: cs.CL, cs.AI, cs.IR, cs.MA
Abstract URL: https://arxiv.org/abs/2512.23647
Pdf URL: https://arxiv.org/pdf/2512.23647
Copy Paste: [[2512.23647]] Nested Browser-Use Learning for Agentic Information Seeking(https://arxiv.org/abs/2512.23647)
Keywords: agent
Abstract: Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.
摘要：信息寻求（IS）代理在一系列广泛和深入的搜索任务中取得了强大的性能，但它们的工具使用仍然很大程度上局限于 API 级片段检索和基于 URL 的页面获取，从而限制了通过实际浏览获取更丰富的信息。虽然完整的浏览器交互可以解锁更深层次的功能，但其细粒度的控制和详细的页面内容返回给 ReAct 风格的函数调用代理带来了相当大的复杂性。为了弥补这一差距，我们提出了嵌套浏览器使用学习（NestBrowse），它引入了一个最小且完整的浏览器操作框架，该框架通过嵌套结构将交互控制与页面探索分离。这种设计简化了代理推理，同时实现了有效的深层网络信息获取。挑战深度 IS 基准的实证结果表明，NestBrowse 在实践中提供了明显的优势。进一步深入的分析强调了其效率和灵活性。

Title: Less is more: Probabilistic reduction is best explained by small-scale predictability measures

Authors: Cassandra L. Jacobs, Andrés Buxó-Lugo, Anna K. Taylor, Marie Leopold-Hooke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23659
Pdf URL: https://arxiv.org/pdf/2512.23659
Copy Paste: [[2512.23659]] Less is more: Probabilistic reduction is best explained by small-scale predictability measures(https://arxiv.org/abs/2512.23659)
Keywords: language model
Abstract: The primary research questions of this paper center on defining the amount of context that is necessary and/or appropriate when investigating the relationship between language model probabilities and cognitive phenomena. We investigate whether whole utterances are necessary to observe probabilistic reduction and demonstrate that n-gram representations suffice as cognitive units of planning.
摘要：本文的主要研究问题集中在定义研究语言模型概率和认知现象之间的关系时必要和/或适当的上下文数量。我们研究了整个话语对于观察概率缩减是否是必要的，并证明 n 元语法表示足以作为规划的认知单元。

Title: Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

Authors: Panagiotis Theocharopoulos, Ajinkya Kulkarni, Mathew Magimai.-Doss
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23684
Pdf URL: https://arxiv.org/pdf/2512.23684
Copy Paste: [[2512.23684]] Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing(https://arxiv.org/abs/2512.23684)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.
摘要：大型语言模型 (LLM) 越来越多地被考虑用于高影响力的工作流程，包括学术同行评审。然而，LLM 很容易受到文档级隐藏提示注入攻击。在这项工作中，我们构建了一个由 ICML 接受的大约 500 篇真实学术论文组成的数据集，并评估了在这些文档中嵌入隐藏的对抗性提示的效果。每篇论文都注入了四种不同语言的语义等效指令，并使用法学硕士进行审查。我们发现，即时注入会导致英语、日语和中文注入的评论分数和接受/拒绝决策发生重大变化，而阿拉伯语注入几乎没有影响。这些结果凸显了基于 LLM 的审阅系统对文档级提示注入的敏感性，并揭示了不同语言之间漏洞的显着差异。

Title: PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

Authors: Deepak Babu Piskala
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2512.23686
Pdf URL: https://arxiv.org/pdf/2512.23686
Copy Paste: [[2512.23686]] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech(https://arxiv.org/abs/2512.23686)
Keywords: language model, prompt
Abstract: Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: this https URL Code: this https URL
摘要：专业环境中的自动语音识别 (ASR) 面临着现有基准所忽视的挑战：密集的领域术语、正式的语域变化以及对关键实体错误的近乎零容忍度。我们推出 ProfASR-Bench，这是一个专业演讲评估套件，适用于金融、医学、法律和技术领域的高风险应用。每个示例都将自然语言提示（域提示和/或说话者配置文件）与富含实体的目标话语配对，从而实现上下文条件识别的受控测量。该语料库支持传统的 ASR 指标以及实体感知分数以及按口音和性别进行的切片报告。在匹配的无上下文、配置文件、域+配置文件、预言机和对抗条件下使用代表性系列 Whisper（编码器-解码器 ASR）和 Qwen-Omni（音频语言模型），我们发现了一致的模式：即使有预言机提示，轻量级文本上下文对平均单词错误率（WER）也几乎没有变化，并且对抗性提示不会可靠地降低性能。我们将其称为上下文利用差距（CUG）：当前的系统名义上是可提示的，但未充分利用现成的辅助信息。 ProfASR-Bench 提供了标准化的上下文阶梯、具有置信区间的实体和切片感知报告，以及用于比较跨模型系列的融合策略的可重复测试台。数据集：此 https URL 代码：此 https URL

Title: Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans

Authors: Sky CH-Wang, Justin Svegliato, Helen Appel, Jason Eisner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23693
Pdf URL: https://arxiv.org/pdf/2512.23693
Copy Paste: [[2512.23693]] Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans(https://arxiv.org/abs/2512.23693)
Keywords: language model, llm
Abstract: We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.
摘要：我们提出了一种使用反馈驱动的改进链通过偏好监督来微调语言模型的方法和数据集。给定模型响应，注释者通过标记“喜欢”和“不喜欢”范围并指定他们喜欢或不喜欢的内容来提供细粒度的反馈。然后，基础模型相应地重写不喜欢的跨度，从左到右进行，形成一系列增量改进。我们构建偏好对，以便从链中的每个相邻步骤直接对齐，使模型能够从本地化、有针对性的编辑中学习。我们发现我们的方法优于基于标准 A/B 偏好排名或完全对比重写的直接对齐方法，这表明结构化的、基于修订的监督可以带来更高效和有效的偏好调整。

Title: Eliciting Behaviors in Multi-Turn Conversations

Authors: Jing Huang, Shujian Zhang, Lun Wang, Andrew Hard, Rajiv Mathews, John Lambert
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23701
Pdf URL: https://arxiv.org/pdf/2512.23701
Copy Paste: [[2512.23701]] Eliciting Behaviors in Multi-Turn Conversations(https://arxiv.org/abs/2512.23701)
Keywords: language model, llm, prompt
Abstract: Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.
摘要：从对话环境中的大型语言模型 (LLM) 中识别特定且通常复杂的行为对于其评估至关重要。最近的工作提出了寻找自然语言提示的新技术，这些提示可以诱导目标模型的特定行为，但它们主要是在单轮设置中进行研究。在这项工作中，我们研究多轮对话背景下的行为诱导。我们首先提供一个分析框架，根据现有方法与目标模型的交互将现有方法分为三类：仅使用先验知识的方法、使用离线交互的方法以及从在线交互中学习的方法。然后，我们引入在线方法的广义多轮公式，统一单轮和多轮启发。我们评估自动生成多轮测试用例的所有三种方法。我们通过分析查询预算（即与目标模型的交互次数）和成功率（即行为引发输入的发现率）之间的权衡来研究这些方法的效率。我们发现，在线方法只需对三个任务进行几千次查询即可实现 45/19/77% 的平均成功率，而现有多轮对话基准的静态方法几乎没有发现失败案例。我们的工作强调了行为诱发方法在多轮对话评估中的新颖应用以及社区走向动态基准的需要。