2026-01-13

Title: TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

Authors: Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, Xuelong Li
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2601.06037
Pdf URL: https://arxiv.org/pdf/2601.06037
Copy Paste: [[2601.06037]] TeleMem: Building Long-Term and Multimodal Memory for Agentic AI(https://arxiv.org/abs/2601.06037)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal this http URL address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.
摘要：大型语言模型 (LLM) 擅长许多 NLP 任务，但由于对扩展对话历史的关注有限，难以维持长期交互。检索增强生成（RAG）缓解了这个问题，但缺乏更新或精炼存储记忆的可靠机制，导致模式驱动的幻觉、低效的写入操作以及对多模式的最低限度的支持。这个http URL解决了这些挑战，我们提出TeleMem，一个统一的长期和多模式记忆系统，通过叙事动态提取来维护连贯的用户配置文件，确保只保留基于对话的信息。 TeleMem 进一步引入了结构化写入管道，可批量、检索、集群和合并内存条目，从而大幅提高存储效率、减少令牌使用并加速内存操作。此外，多模态记忆模块与 ReAct 式推理相结合，为系统配备了闭环观察、思考和行动过程，能够在长期上下文中准确理解复杂的视频内容。实验结果表明，TeleMem 超越了最先进的 Mem0 基准，准确度提高了 19%，令牌减少了 43%，并且在 ZH-4O 长期角色扮演游戏基准上加速了 2.1 倍。

Title: Operation Veja: Fixing Fundamental Concepts Missing from Modern Roleplaying Training Paradigms

Authors: Yueze Liu, Ajay Nagi Reddy Kumdam, Ronit Kanjilal, Hao Yang, Yichi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06039
Pdf URL: https://arxiv.org/pdf/2601.06039
Copy Paste: [[2601.06039]] Operation Veja: Fixing Fundamental Concepts Missing from Modern Roleplaying Training Paradigms(https://arxiv.org/abs/2601.06039)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Modern roleplaying models are increasingly sophisticated, yet they consistently struggle to capture the essence of believable, engaging characters. We argue this failure stems from training paradigms that overlook the dynamic interplay of a character's internal world. Current approaches, including Retrieval-Augmented Generation (RAG), fact-based priming, literature-based learning, and synthetic data generation, exhibit recurring limitations in modeling the deliberative, value-conflicted reasoning that defines human interaction. In this paper, we identify four core concepts essential for character authenticity: Values, Experiences, Judgments, and Abilities (VEJA). We propose the VEJA framework as a new paradigm for data curation that addresses these systemic limitations. To illustrate the qualitative ceiling enabled by our framework, we present a pilot study comparing a manually curated, VEJA-grounded dataset against a state-of-the-art synthetic baseline. Using an LLM-as-judge evaluation, our findings demonstrate a significant quality gap, suggesting that a shift toward conceptually grounded data curation, as embodied by VEJA, is necessary for creating roleplaying agents with genuine depth and narrative continuity. The full dataset is available at this https URL
摘要：现代角色扮演模型越来越复杂，但它们始终难以捕捉可信、引人入胜的角色的本质。我们认为这种失败源于训练范式忽视了角色内部世界的动态相互作用。当前的方法，包括检索增强生成（RAG）、基于事实的启动、基于文献的学习和合成数据生成，在对定义人类交互的深思熟虑的、价值冲突的推理进行建模时表现出反复出现的局限性。在本文中，我们确定了性格真实性所必需的四个核心概念：价值观、经验、判断和能力（VEJA）。我们提出 VEJA 框架作为数据管理的新范式，以解决这些系统性限制。为了说明我们的框架所实现的定性上限，我们提出了一项试点研究，将手动策划的、基于 VEJA 的数据集与最先进的合成基线进行比较。使用法学硕士作为法官的评估，我们的研究结果表明存在显着的质量差距，这表明向以概念为基础的数据管理的转变（如 VEJA 所体现的）对于创建具有真正深度和叙事连续性的角色扮演代理是必要的。完整数据集可在此 https URL 获取

Title: Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

Authors: Hanyu Li, Jiangshan Duo, Bofei Gao, Hailin Zhang, Sujian Li, Xiaotie Deng, Liang Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06052
Pdf URL: https://arxiv.org/pdf/2601.06052
Copy Paste: [[2601.06052]] Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization(https://arxiv.org/abs/2601.06052)
Keywords: language model, chain-of-thought
Abstract: Chain-of-thought reasoning in large language models often creates an "overthinking trap," leading to excessive computational cost and latency for unreliable accuracy gains. Prior work has typically relied on global, static controls that risk penalizing necessary reasoning. We introduce a sample-level, soft reinforcement learning compression method that penalizes inefficiently long rollouts, but only on problems where the model has already mastered and already produced a more concise rollout. Our experiments show that this method reduces average response length by 20-40% with comparable or higher accuracy. Crucially, the compression exhibits strong cross-domain generalization; a model trained on math spontaneously shortens responses on unseen tasks like code, instruction following, and general knowledge QA, with stable or improved accuracy. We demonstrate a stable post-training curriculum (accuracy-compression-accuracy) that can ultimately produce models that are more accurate and reason more concisely, arguing that such compression method should be a standard phase in developing efficient reasoning models.
摘要：大型语言模型中的思想链推理通常会造成“过度思考陷阱”，导致计算成本过高和延迟，从而获得不可靠的准确率。之前的工作通常依赖于全局静态控制，这可能会损害必要的推理。我们引入了一种样本级的软强化学习压缩方法，该方法会惩罚低效的长推出，但仅限于模型已经掌握并已经产生更简洁的推出的问题。我们的实验表明，该方法将平均响应长度减少了 20-40%，且准确度相当或更高。至关重要的是，压缩表现出强大的跨域泛化能力；经过数学训练的模型会自发地缩短对代码、指令遵循和一般知识 QA 等看不见的任务的响应，并保持稳定或提高的准确性。我们展示了一个稳定的训练后课程（准确性-压缩-准确性），最终可以产生更准确、推理更简洁的模型，认为这种压缩方法应该成为开发高效推理模型的标准阶段。

Title: A Multi-Stage Workflow for the Review of Marketing Content with Reasoning Large Language Models

Authors: Alberto Purpura, Emily Chen, Swapnil Shinde
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06054
Pdf URL: https://arxiv.org/pdf/2601.06054
Copy Paste: [[2601.06054]] A Multi-Stage Workflow for the Review of Marketing Content with Reasoning Large Language Models(https://arxiv.org/abs/2601.06054)
Keywords: language model, llm
Abstract: Reasoning Large Language Models (LLMs) have shown promising results when tasked with solving complex problems. In this paper, we propose and evaluate a multi-stage workflow that leverages the capabilities of fine-tuned reasoning LLMs to assist in the review process of marketing content, making sure they comply with a given list of requirements. The contributions of this paper are the following: (i) we present a novel approach -- that does not rely on any external knowledge representation -- for the automatic identification of compliance issues in textual content; (ii) compare the effectiveness of different fine-tuning strategies like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) in training models to solve this problem; (iii) we evaluate the effectiveness of training small LLMs to generate reasoning tokens before providing their final response; (iv) we evaluate how the choice and combinations of different reward functions affects the performance of a model trained with GRPO.
摘要：推理大型语言模型 (LLM) 在解决复杂问题时已显示出有希望的结果。在本文中，我们提出并评估了一个多阶段工作流程，该工作流程利用微调推理法学硕士的能力来协助营销内容的审核过程，确保它们符合给定的要求列表。本文的贡献如下：（i）我们提出了一种不依赖于任何外部知识表示的新颖方法，用于自动识别文本内容中的合规性问题； (ii) 比较训练模型中不同微调策略（如监督微调（SFT）和组相对策略优化（GRPO））的有效性，以解决该问题； (iii) 在提供最终答复之前，我们评估培训小型法学硕士生成推理令牌的有效性； (iv) 我们评估不同奖励函数的选择和组合如何影响使用 GRPO 训练的模型的性能。

Title: AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning

Authors: Yiwen Shao, Wei Liu, Jiahong Li, Tianzi Wang, Kun Wei, Meng Yu, Dong Yu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2601.06086
Pdf URL: https://arxiv.org/pdf/2601.06086
Copy Paste: [[2601.06086]] AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning(https://arxiv.org/abs/2601.06086)
Keywords: language model, llm, chat
Abstract: Extending large language models (LLMs) to the speech domain has recently gained significant attention. A typical approach connects a pretrained LLM with an audio encoder through a projection module and trains the resulting model on large-scale, task-specific instruction-tuning datasets. However, curating such instruction-tuning data for specific requirements is time-consuming, and models trained in this manner often generalize poorly to unseen tasks. In this work, we first formulate that the strongest generalization of a speech-LLM is achieved when it is trained with Self-Generated Instruction-Free Tuning (SIFT), in which supervision signals are generated by a frozen LLM using textual representations of speech as input. Our proposed SIFT paradigm eliminates the need for collecting task-specific question-answer pairs and yields the theoretically best generalization to unseen tasks. Building upon this paradigm, we introduce AZeroS (Auden Zero-instruction-tuned Speech-LLM), which is trained on speech-text pairs derived from publicly available corpora, including approximately 25,000 hours of speech with ASR transcripts and 3,000 hours of speech with paralinguistic labels. Built upon Qwen2.5-7B-Instruct, the model updates only two lightweight projection modules (23.8 million parameters each), while keeping both the LLM and audio encoders frozen. Despite the minimal training cost and modest data scale, AZeroS achieves state-of-the-art performance on both semantic and paralinguistic benchmarks, including VoiceBench, AIR-Bench Foundation (Speech), and AIR-Bench Chat (Speech).
摘要：将大型语言模型 (LLM) 扩展到语音领域最近引起了广泛关注。一种典型的方法是通过投影模块将预训练的 LLM 与音频编码器连接起来，并在大规模、特定于任务的指令调整数据集上训练生成的模型。然而，针对特定要求整理此类指令调整数据非常耗时，并且以这种方式训练的模型通常很难泛化到看不见的任务。在这项工作中，我们首先提出，当使用自生成无指令调优（SIFT）进行训练时，可以实现语音 LLM 的最强泛化，其中监督信号是由冻结的 LLM 使用语音的文本表示作为输入生成的。我们提出的 SIFT 范式消除了收集特定于任务的问题答案对的需要，并产生了理论上对未见过的任务的最佳泛化。在此范式的基础上，我们引入了 AZeroS（Auden 零指令调整语音-LLM），它是在来自公开语料库的语音文本对上进行训练的，包括大约 25,000 小时的带有 ASR 转录的语音和 3,000 小时带有副语言标签的语音。该模型基于 Qwen2.5-7B-Instruct 构建，仅更新两个轻量级投影模块（每个模块 2380 万个参数），同时保持 LLM 和音频编码器冻结。尽管训练成本最低且数据规模不大，但 AZeroS 在语义和副语言基准测试上实现了最先进的性能，包括 VoiceBench、AIR-Bench Foundation（语音）和 AIR-Bench Chat（语音）。

Title: Is Sanskrit the most token-efficient language? A quantitative study using GPT, Gemini, and SentencePiece

Authors: Anshul Kumar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06142
Pdf URL: https://arxiv.org/pdf/2601.06142
Copy Paste: [[2601.06142]] Is Sanskrit the most token-efficient language? A quantitative study using GPT, Gemini, and SentencePiece(https://arxiv.org/abs/2601.06142)
Keywords: language model, gpt, llm
Abstract: Tokens are the basic units of Large Language Models (LLMs). LLMs rely on tokenizers to segment text into these tokens, and tokenization is the primary determinant of computational and inference cost. Sanskrit, one of the oldest languages, is hypothesized to express more meaning per token due to its morphology and grammar rules; however, no prior work has quantified this. We use a dataset of 701 parallel verses of the Bhagavad Gita, which comprises three languages-Sanskrit, English, and Hindi along with transliteration of Sanskrit into English. We test tokenizers including SentencePiece (SPM), older GPT models, and the latest generation tokenizers from Gemini and GPT. We use metrics of token count, characters per token (token efficiency), and tokens per character (token cost). Results show a ~2x difference in token counts between Sanskrit and English/Hindi under the unbiased SPM baseline. English/Hindi translations of Sanskrit commentary resulted in an approximately 20x increase in token count. GPT o200k base (latest, used by GPT-4o) and Gemini (latest) reduce bias by a significant degree compared to GPT cl100k base (used until GPT-4), but still fail to fully capture Sanskrit's compactness. This matters because there might be a penalty bias for non-English users, which inflates the token count. This research provides a foundation for improving future tokenizer design and shows the potential of Sanskrit for highly compact encoding, saving on cost while speeding up training and inference. The code and dataset are available at this https URL
摘要：令牌是大型语言模型（LLM）的基本单位。 LLM 依靠标记器将文本分割成这些标记，而标记化是计算和推理成本的主要决定因素。梵语是最古老的语言之一，由于其形态和语法规则，假设每个标记可以表达更多含义；然而，之前的工作还没有对此进行量化。我们使用《薄伽梵歌》的 701 节平行诗句的数据集，其中包含三种语言：梵语、英语和印地语，以及梵语到英语的音译。我们测试的分词器包括 SentencePiece (SPM)、较旧的 GPT 模型以及 Gemini 和 GPT 的最新一代分词器。我们使用令牌计数、每个令牌的字符数（令牌效率）和每个字符的令牌数（令牌成本）等指标。结果显示，在无偏差 SPM 基线下，梵语和英语/印地语之间的标记计数存在约 2 倍的差异。梵文评论的英语/印地语翻译导致标记数量增加了大约 20 倍。与 GPT cl100k 基础（直到 GPT-4 使用）相比，GPT o200k 基础（最新，被 GPT-4o 使用）和 Gemini（最新）显着减少了偏差，但仍然无法完全捕捉梵文的紧凑性。这很重要，因为对于非英语用户可能会存在惩罚偏差，从而增加令牌数量。这项研究为改进未来分词器设计奠定了基础，并展示了梵文在高度紧凑编码、节省成本的同时加快训练和推理的潜力。代码和数据集可在此 https URL 获取

Title: Amory: Building Coherent Narrative-Driven Agent Memory through Agentic Reasoning

Authors: Yue Zhou, Xiaobo Guo, Belhassen Bayar, Srinivasan H. Sengamedu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06282
Pdf URL: https://arxiv.org/pdf/2601.06282
Copy Paste: [[2601.06282]] Amory: Building Coherent Narrative-Driven Agent Memory through Agentic Reasoning(https://arxiv.org/abs/2601.06282)
Keywords: agent
Abstract: Long-term conversational agents face a fundamental scalability challenge as interactions extend over time: repeatedly processing entire conversation histories becomes computationally prohibitive. Current approaches attempt to solve this through memory frameworks that predominantly fragment conversations into isolated embeddings or graph representations and retrieve relevant ones in a RAG style. While computationally efficient, these methods often treat memory formation minimally and fail to capture the subtlety and coherence of human memory. We introduce Amory, a working memory framework that actively constructs structured memory representations through enhancing agentic reasoning during offline time. Amory organizes conversational fragments into episodic narratives, consolidates memories with momentum, and semanticizes peripheral facts into semantic memory. At retrieval time, the system employs coherence-driven reasoning over narrative structures. Evaluated on the LOCOMO benchmark for long-term reasoning, Amory achieves considerable improvements over previous state-of-the-art, with performance comparable to full context reasoning while reducing response time by 50%. Analysis shows that momentum-aware consolidation significantly enhances response quality, while coherence-driven retrieval provides superior memory coverage compared to embedding-based approaches.
摘要：随着交互随着时间的推移而扩展，长期会话代理面临着基本的可扩展性挑战：重复处理整个会话历史在计算上变得令人望而却步。当前的方法试图通过内存框架来解决这个问题，该框架主要将对话分割成独立的嵌入或图形表示，并以 RAG 风格检索相关的对话。虽然计算效率高，但这些方法通常很少处理记忆形成，并且无法捕捉人类记忆的微妙性和连贯性。我们介绍了 Amory，一种工作记忆框架，它通过增强离线时间的代理推理来主动构建结构化记忆表示。艾默里将对话片段组织成情景叙述，用动力巩固记忆，并将外围事实语义化为语义记忆。在检索时，系统对叙述结构采用连贯性驱动的推理。在长期推理的 LOCOMO 基准上进行评估，Amory 比之前最先进的技术取得了相当大的进步，其性能可与完整上下文推理相媲美，同时将响应时间缩短了 50%。分析表明，动量感知整合显着提高了响应质量，而与基于嵌入的方法相比，连贯性驱动的检索提供了卓越的内存覆盖率。

Title: How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning?

Authors: Yufeng Wang, Lu Wei, Lin Liu, Hao Xu, Haibin Ling
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06289
Pdf URL: https://arxiv.org/pdf/2601.06289
Copy Paste: [[2601.06289]] How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning?(https://arxiv.org/abs/2601.06289)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Mass spectrometry (MS) is a powerful analytical technique for identifying small molecules, yet determining complete molecular structures directly from tandem mass spectra (MS/MS) remains a long-standing challenge due to complex fragmentation patterns and the vast diversity of chemical space. Recent progress in large language models (LLMs) has shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear. In this work, we introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures. We formalize expert chemists' reasoning steps-such as double bond equivalent (DBE) analysis, neutral loss identification, and fragment assembly-into structured prompts and assess multiple state-of-the-art LLMs (Claude-3.5-Sonnet, GPT-4o-mini, and Llama-3 series) in a zero-shot setting using the MassSpecGym dataset. Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions. These findings highlight both the interpretive potential and the current limitations of LLM-based reasoning for molecular elucidation, providing a foundation for future work that combines domain knowledge and reinforcement learning to achieve chemically grounded AI reasoning.
摘要：质谱 (MS) 是一种用于识别小分子的强大分析技术，但由于复杂的碎片模式和化学空间的巨大多样性，直接从串联质谱 (MS/MS) 确定完整的分子结构仍然是一个长期存在的挑战。大语言模型 (LLM) 的最新进展已显示出推理密集型科学任务的前景，但其化学解释的能力仍不清楚。在这项工作中，我们引入了一个思想链（CoT）提示框架和基准，用于评估法学硕士如何推理质谱数据以预测分子结构。我们将专家化学家的推理步骤（例如双键当量 (DBE) 分析、中性损失识别和片段组装）形式化为结构化提示，并使用 MassSpecGym 数据集在零样本设置中评估多个最先进的 LLM（Claude-3.5-Sonnet、GPT-4o-mini 和 Llama-3 系列）。我们对 SMILES 有效性、公式一致性和结构相似性指标的评估表明，虽然法学硕士可以产生语法上有效且部分合理的结构，但它们无法实现化学准确性或将推理与正确的分子预测联系起来。这些发现凸显了基于法学硕士的分子阐明推理的解释潜力和当前局限性，为未来结合领域知识和强化学习以实现基于化学的人工智能推理的工作奠定了基础。

Title: $\texttt{AMEND++}$: Benchmarking Eligibility Criteria Amendments in Clinical Trials

Authors: Trisha Das, Mandis Beigi, Jacob Aptekar, Jimeng Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06300
Pdf URL: https://arxiv.org/pdf/2601.06300
Copy Paste: [[2601.06300]] $\texttt{AMEND++}$: Benchmarking Eligibility Criteria Amendments in Clinical Trials(https://arxiv.org/abs/2601.06300)
Keywords: language model, llm
Abstract: Clinical trial amendments frequently introduce delays, increased costs, and administrative burden, with eligibility criteria being the most commonly amended component. We introduce \textit{eligibility criteria amendment prediction}, a novel NLP task that aims to forecast whether the eligibility criteria of an initial trial protocol will undergo future amendments. To support this task, we release $\texttt{AMEND++}$, a benchmark suite comprising two datasets: $\texttt{AMEND}$, which captures eligibility-criteria version histories and amendment labels from public clinical trials, and $\verb|AMEND_LLM|$, a refined subset curated using an LLM-based denoising pipeline to isolate substantive changes. We further propose $\textit{Change-Aware Masked Language Modeling}$ (CAMLM), a revision-aware pretraining strategy that leverages historical edits to learn amendment-sensitive representations. Experiments across diverse baselines show that CAMLM consistently improves amendment prediction, enabling more robust and cost-effective clinical trial design.
摘要：临床试验修订经常会造成延误、增加成本和管理负担，其中资格标准是最常修订的部分。我们引入了 \textit{资格标准修订预测}，这是一项新颖的 NLP 任务，旨在预测初始试验方案的资格标准是否会在未来进行修订。为了支持这项任务，我们发布了 $\texttt{AMEND++}$，这是一个包含两个数据集的基准套件：$\texttt{AMEND}$，它捕获来自公共临床试验的资格标准版本历史和修订标签；$\verb|AMEND_LLM|$，这是一个使用基于 LLM 的去噪管道精心策划的子集，以隔离实质性变化。我们进一步提出 $\textit{Change-Aware Masked Language Modeling}$ (CAMLM)，这是一种修订感知预训练策略，利用历史编辑来学习修订敏感表示。跨不同基线的实验表明，CAMLM 持续改进修正预测，从而实现更稳健且更具成本效益的临床试验设计。

Title: Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

Authors: Hoang-Chau Luong, Lingwei Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06305
Pdf URL: https://arxiv.org/pdf/2601.06305
Copy Paste: [[2601.06305]] Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models(https://arxiv.org/abs/2601.06305)
Keywords: language model
Abstract: Low-Rank Adaptation (LoRA) is widely used for parameter-efficient fine-tuning of large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness is caused primarily by low rank, we show that LoRA's vulnerability is fundamentally spectral. Our analysis identifies two key factors: LoRA updates (i) possess insufficient spectral strength, with singular values far below those of pretrained weights, and (ii) exhibit unfavorable spectral alignment, weakly matching clean-task directions while retaining overlap with trigger-sensitive subspaces. We further establish a critical scaling threshold beyond which LoRA can theoretically suppress trigger-induced activations, and we show empirically that standard LoRA rarely reaches this regime. We introduce Regularized Low-Rank Adaptation (RoRA), which improves forgetting by increasing spectral strength and correcting alignment through clean-strengthened regularization, trigger-insensitive constraints, and post-training spectral rescaling. Experiments across multiple NLP benchmarks and attack settings show that RoRA substantially reduces attack success rates while maintaining clean accuracy.
摘要：低秩适应（LoRA）广泛用于大型语言模型的参数高效微调，但在干净数据集上进行微调时，它在从有毒的预训练模型中消除后门行为方面尤其无效。人们普遍认为这种弱点主要是由低等级造成的，与此相反，我们表明 LoRA 的弱点从根本上来说是光谱性的。我们的分析确定了两个关键因素：LoRA 更新 (i) 频谱强度不足，奇异值远低于预训练权重的奇异值；(ii) 表现出不利的频谱对齐，与干净任务方向的匹配较弱，同时保留与触发敏感子空间的重叠。我们进一步建立了一个关键的缩放阈值，超过该阈值 LoRA 理论上可以抑制触发诱导的激活，并且我们根据经验表明标准 LoRA 很少达到这个状态。我们引入了正则化低秩适应（RoRA），它通过增强谱强度和通过干净强化的正则化、触发不敏感约束和训练后谱重新缩放来校正对齐来改善遗忘。跨多个 NLP 基准和攻击设置的实验表明，RoRA 大大降低了攻击成功率，同时保持了干净的准确性。

Title: A Rising Tide Lifts All Boats: MTQE Rewards for Idioms Improve General Translation Quality

Authors: Ishika Agarwal, Zhenlin He, Dhruva Patil, Dilek Hakkani-Tür
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06307
Pdf URL: https://arxiv.org/pdf/2601.06307
Copy Paste: [[2601.06307]] A Rising Tide Lifts All Boats: MTQE Rewards for Idioms Improve General Translation Quality(https://arxiv.org/abs/2601.06307)
Keywords: llm
Abstract: Non-compositional expressions (e.g., idioms, proverbs, and metaphors) pose significant challenges for neural machine translation systems because their meanings cannot be derived from individual words alone. These expressions encode rich, cultural meaning, and have both figurative and literal meanings, making accurate translation difficult. Because models are fairly good at translating compositional text, we investigate GRPO-style fine-tuning using Machine Translation Quality Estimation (MTQE) models as reward functions to train models to better translate idioms. Using Chinese and Hindi idiom datasets, we find that idiom translation abilities improve by ~14 points, general, non-idiomatic translation implicitly improves by ~8 points, and cross-lingual translation abilities (trained on one language, evaluated on another) improves by ~6 points. Overall, our work quantifies the non-compositional translation gap and offers insights for developing LLMs with stronger cross-cultural and figurative language understanding.
摘要：非组合表达（例如习语、谚语和隐喻）给神经机器翻译系统带来了重大挑战，因为它们的含义不能单独从单个单词中得出。这些表达方式蕴藏着丰富的文化意义，既有比喻意义又有字面意义，给准确翻译带来了困难。由于模型相当擅长翻译组合文本，因此我们使用机器翻译质量估计 (MTQE) 模型作为奖励函数来研究 GRPO 式微调，以训练模型更好地翻译习语。使用中文和印地语习语数据集，我们发现习语翻译能力提高了约 14 分，一般非惯用翻译隐式提高了约 8 分，跨语言翻译能力（用一种语言训练，用另一种语言评估）提高了约 6 分。总的来说，我们的工作量化了非组合翻译差距，并为培养具有更强的跨文化和比喻语言理解的法学硕士提供了见解。

Title: Annotating Dimensions of Social Perception in Text: The First Sentence-Level Dataset of Warmth and Competence

Authors: Mutaz Ayesh, Saif M. Mohammad, Nedjma Ousidhoum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06316
Pdf URL: https://arxiv.org/pdf/2601.06316
Copy Paste: [[2601.06316]] Annotating Dimensions of Social Perception in Text: The First Sentence-Level Dataset of Warmth and Competence(https://arxiv.org/abs/2601.06316)
Keywords: language model, llm
Abstract: Warmth (W) (often further broken down into Trust (T) and Sociability (S)) and Competence (C) are central dimensions along which people evaluate individuals and social groups (Fiske, 2018). While these constructs are well established in social psychology, they are only starting to get attention in NLP research through word-level lexicons, which do not completely capture their contextual expression in larger text units and discourse. In this work, we introduce Warmth and Competence Sentences (W&C-Sent), the first sentence-level dataset annotated for warmth and competence. The dataset includes over 1,600 English sentence--target pairs annotated along three dimensions: trust and sociability (components of warmth), and competence. The sentences in W&C-Sent are from social media and often express attitudes and opinions about specific individuals or social groups (the targets of our annotations). We describe the data collection, annotation, and quality-control procedures in detail, and evaluate a range of large language models (LLMs) on their ability to identify trust, sociability, and competence in text. W&C-Sent provides a new resource for analyzing warmth and competence in language and supports future research at the intersection of NLP and computational social science.
摘要：温暖 (W)（通常进一步细分为信任 (T) 和社交性 (S)）和能力 (C) 是人们评估个人和社会群体的核心维度（Fiske，2018）。虽然这些结构在社会心理学中已经很成熟，但它们只是开始通过单词级词典在 NLP 研究中受到关注，这些词典并不能完全捕捉它们在更大的文本单元和话语中的语境表达。在这项工作中，我们介绍了温暖和能力句子（W&C-Sent），这是第一个针对温暖和能力进行注释的句子级数据集。该数据集包括 1,600 多个英语句子-目标对，按三个维度进行注释：信任和社交性（温暖的组成部分）以及能力。 W&C-Sent中的句子来自社交媒体，经常表达对特定个人或社会群体（我们注释的目标）的态度和意见。我们详细描述了数据收集、注释和质量控制程序，并评估了一系列大型语言模型 (LLM) 识别文本中的信任、社交性和能力的能力。 W&C-Sent 提供了一种用于分析语言热情和能力的新资源，并支持 NLP 和计算社会科学交叉领域的未来研究。

Title: On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Authors: Jeff Chan-Jan Sju, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06329
Pdf URL: https://arxiv.org/pdf/2601.06329
Copy Paste: [[2601.06329]] On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation(https://arxiv.org/abs/2601.06329)
Keywords: language model, prompt
Abstract: Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.
摘要：在大规模原始音频上预训练的生成口语模型可以继续提供适当内容的语音提示，同时保留说话者和情感等属性，作为口语对话的基础模型。在之前的文献中，这些模型通常使用“全局令牌困惑度”进行评估，它将文本困惑度公式直接应用于语音令牌。然而，这种做法忽略了语音和文本模式之间的根本差异，可能导致低估语音特征。在这项工作中，我们提出了各种基于可能性和生成的评估方法，以代替朴素的全局令牌困惑。我们证明，所提出的评估更忠实地反映了感知的生成质量，与人类评分平均意见得分（MOS）更强的相关性证明了这一点。当根据新的指标进行评估时，口语模型的相对性能格局被重塑，显示出最佳性能模型与人类顶线之间的差距显着缩小。总之，这些结果表明，适当的评估对于准确评估口语建模的进展至关重要。

Title: AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Authors: Hao Yu, Tianyi Xu, Michael A. Hedderich, Wassim Hamidouche, Syed Waqas Zamir, David Ifeoluwa Adelani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06395
Pdf URL: https://arxiv.org/pdf/2601.06395
Copy Paste: [[2601.06395]] AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages(https://arxiv.org/abs/2601.06395)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on [Huggingface](this https URL).
摘要：大型语言模型（LLM）越来越多语言化，但开放模型相对于专有系统仍然表现不佳，其中非洲语言的差距最为明显。持续预训练（CPT）为语言适应提供了一条实用途径，但对数学推理等要求较高的能力的改进通常仍然有限。这种限制部分是由于领域覆盖不均匀和任务相关知识缺失造成的，而这些知识是许多低资源语言语料库的特征。我们推出 \texttt{AfriqueLLM}，这是一套开放式 LLM，通过 26B 代币上的 CPT 适应 20 种非洲语言。我们对涵盖大小和架构的五个基本模型（包括 Llama 3.1、Gemma 3 和 Qwen 3）进行了全面的实证研究，并系统地分析了 CPT 数据组成如何影响下游性能。特别是，我们改变了包括数学、代码和合成翻译数据的混合，并在一系列多语言基准上评估生成的模型。我们的结果表明数据构成是 CPT 收益的主要驱动力。添加数学、代码和合成翻译数据可以带来持续的改进，包括面向推理的评估。在固定架构中，较大的模型通常会提高性能，但在跨模型系列进行比较时，架构选择决定了规模。此外，基础模型中强大的多语言性能并不能可靠地预测 CPT 后的结果；稳健的架构与任务相关的数据相结合，提供了更可靠的方案。最后，我们最好的模型提高了长上下文性能，包括文档级翻译。模型已在 [Huggingface]（此 https URL）上发布。

Title: MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan

Authors: Sebastian Nehrdich, Kurt Keutzer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06400
Pdf URL: https://arxiv.org/pdf/2601.06400
Copy Paste: [[2601.06400]] MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan(https://arxiv.org/abs/2601.06400)
Keywords: language model
Abstract: Ancient Buddhist literature features frequent, yet often unannotated, textual parallels spread across diverse languages: Sanskrit, Pāli, Buddhist Chinese, Tibetan, and more. The scale of this material makes manual examination prohibitive. We present the MITRA framework, which consists of a novel pipeline for multilingual parallel passage mining, MITRA-parallel, a large-scale corpus of 1.74 million parallel sentence pairs between Sanskrit, Chinese, and Tibetan, and the development of the domain-specific pretrained language model Gemma 2 MITRA. We present Gemma 2 MITRA-MT, a version of this base model fine-tuned on machine translation tasks, reaching state-of-the-art performance for machine translation of these languages into English and outperforming even much larger open-source models. We also present Gemma 2 MITRA-E, a semantic embedding model that shows state-of-the-art performance on a novel, detailed semantic embedding benchmark. We make the parallel dataset, model weights, and semantic similarity benchmark openly available to aid both NLP research and philological studies in Buddhist and classical Asian literature.
摘要：古代佛教文献的特点是频繁但往往未经注释的文本相似之处分布在多种语言中：梵语、巴利语、佛教汉语、藏语等。这种材料的规模使得手动检查变得令人望而却步。我们提出了 MITRA 框架，其中包括用于多语言并行通道挖掘的新颖管道 MITRA-parallel、梵文、中文和藏文之间 174 万个并行句子对的大型语料库，以及特定领域预训练语言模型 Gemma 2 MITRA 的开发。我们推出了 Gemma 2 MITRA-MT，这是该基本模型的一个版本，针对机器翻译任务进行了微调，在将这些语言机器翻译成英语方面达到了最先进的性能，甚至超越了更大的开源模型。我们还推出了 Gemma 2 MITRA-E，这是一种语义嵌入模型，它在新颖、详细的语义嵌入基准上展示了最先进的性能。我们公开提供并行数据集、模型权重和语义相似性基准，以帮助佛教和古典亚洲文学中的 NLP 研究和语言学研究。

Title: Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding

Authors: Yijiang River Dong, Tiancheng Hu, Zheng Hui, Nigel Collier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06403
Pdf URL: https://arxiv.org/pdf/2601.06403
Copy Paste: [[2601.06403]] Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding(https://arxiv.org/abs/2601.06403)
Keywords: language model, prompt
Abstract: Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.
摘要：大型语言模型擅长复杂的指令，但很难偏离其乐于助人的助理角色，因为训练后会灌输强大的先验知识，以抵制相互冲突的指令。我们引入了系统提示强度，这是一种无需训练的方法，将提示坚持视为一种持续控制。通过对比目标和默认系统提示的逻辑，我们通过标量因子 alpha 隔离并放大目标角色特有的行为信号。在跨越约束满足、行为控制、多元对齐、能力调节和风格控制的五个不同基准中，我们的方法产生了实质性的改进：IFEval 的严格准确度高达 +8.5，OffTopicEval 的拒绝率高达 +45pp，Prompt-Steering 的可操纵性高达 +13%。我们的方法使从业者能够调节系统提示强度，提供对模型行为的动态控制，而无需重新训练。

Title: Value of Information: A Framework for Human-Agent Communication

Authors: Yijiang River Dong, Tiancheng Hu, Zheng Hui, Caiqi Zhang, Ivan Vulić, Andreea Bobu, Nigel Collier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06407
Pdf URL: https://arxiv.org/pdf/2601.06407
Copy Paste: [[2601.06407]] Value of Information: A Framework for Human-Agent Communication(https://arxiv.org/abs/2601.06407)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) agents deployed for real-world tasks face a fundamental dilemma: user requests are underspecified, yet agents must decide whether to act on incomplete information or interrupt users for clarification. Existing approaches either rely on brittle confidence thresholds that require task-specific tuning, or fail to account for the varying stakes of different decisions. We introduce a decision-theoretic framework that resolves this trade-off through the Value of Information (VoI), enabling agents to dynamically weigh the expected utility gain from asking questions against the cognitive cost imposed on users. Our inference-time method requires no hyperparameter tuning and adapts seamlessly across contexts-from casual games to medical diagnosis. Experiments across four diverse domains (20 Questions, medical diagnosis, flight booking, and e-commerce) show that VoI consistently matches or exceeds the best manually-tuned baselines, achieving up to 1.36 utility points higher in high-cost settings. This work provides a parameter-free framework for adaptive agent communication that explicitly balances task risk, query ambiguity, and user effort.
摘要：为现实世界任务部署的大型语言模型（LLM）代理面临着一个基本的困境：用户请求未指定，但代理必须决定是对不完整的信息采取行动还是打断用户以进行澄清。现有方法要么依赖于需要针对特定任务进行调整的脆弱置信阈值，要么无法考虑不同决策的不同风险。我们引入了一个决策理论框架，通过信息价值（VoI）解决这种权衡，使代理能够动态权衡提出问题的预期效用增益与强加给用户的认知成本。我们的推理时间方法不需要超参数调整，并且可以无缝适应从休闲游戏到医疗诊断的各种环境。跨越四个不同领域（20 个问题、医疗诊断、航班预订和电子商务）的实验表明，VoI 始终匹配或超过最佳手动调整基线，在高成本环境中实现高达 1.36 个效用点。这项工作为自适应代理通信提供了一个无参数框架，可以显式地平衡任务风险、查询模糊性和用户工作量。

Title: Structured Episodic Event Memory

Authors: Zhengxuan Lu, Dongfang Li, Yukun Shi, Beilun Wang, Longyue Wang, Baotian Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06411
Pdf URL: https://arxiv.org/pdf/2601.06411
Copy Paste: [[2601.06411]] Structured Episodic Event Memory(https://arxiv.org/abs/2601.06411)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Current approaches to memory in Large Language Models (LLMs) predominantly rely on static Retrieval-Augmented Generation (RAG), which often results in scattered retrieval and fails to capture the structural dependencies required for complex reasoning. For autonomous agents, these passive and flat architectures lack the cognitive organization necessary to model the dynamic and associative nature of long-term interaction. To address this, we propose Structured Episodic Event Memory (SEEM), a hierarchical framework that synergizes a graph memory layer for relational facts with a dynamic episodic memory layer for narrative progression. Grounded in cognitive frame theory, SEEM transforms interaction streams into structured Episodic Event Frames (EEFs) anchored by precise provenance pointers. Furthermore, we introduce an agentic associative fusion and Reverse Provenance Expansion (RPE) mechanism to reconstruct coherent narrative contexts from fragmented evidence. Experimental results on the LoCoMo and LongMemEval benchmarks demonstrate that SEEM significantly outperforms baselines, enabling agents to maintain superior narrative coherence and logical consistency.
摘要：当前大型语言模型（LLM）中的记忆方法主要依赖于静态检索增强生成（RAG），这通常会导致分散的检索，并且无法捕获复杂推理所需的结构依赖性。对于自主代理来说，这些被动且扁平的架构缺乏对长期交互的动态和关联性质进行建模所需的认知组织。为了解决这个问题，我们提出了结构化情景事件记忆（SEEM），这是一种分层框架，它将用于关系事实的图形记忆层与用于叙事进展的动态情景记忆层相结合。 SEEM 以认知框架理论为基础，将交互流转换为由精确的来源指针锚定的结构化情景事件框架 (EEF)。此外，我们引入了代理关联融合和反向来源扩展（RPE）机制，以从碎片证据中重建连贯的叙事上下文。 LoCoMo 和 LongMemEval 基准的实验结果表明，SEEM 显着优于基线，使代理能够保持卓越的叙述连贯性和逻辑一致性。

Title: Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

Authors: Sazia Tabasum Mim, Jack Morris, Manish Dhakal, Yanming Xiu, Maria Gorlatova, Yi Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06424
Pdf URL: https://arxiv.org/pdf/2601.06424
Copy Paste: [[2601.06424]] Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?(https://arxiv.org/abs/2601.06424)
Keywords: language model, llm, agent
Abstract: To explore a more scalable path for adding multimodal capabilities to existing LLMs, this paper addresses a fundamental question: Can a unimodal LLM, relying solely on text, reason about its own informational needs and provide effective feedback to optimize a multimodal model? To answer this, we propose a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent's preferences. Our results from different experiments affirm this hypothesis, showing that LLM preference feedback significantly enhances VLM descriptions. Using our proposed method, we find that the VLM can generate multimodal scene descriptions to help the LLM better understand multimodal context, leading to improvements of maximum 13% in absolute accuracy compared to the baseline multimodal approach. Furthermore, a human study validated our AI-driven feedback, showing a 64.6% preference alignment rate between the LLM's choices and human judgments. Extensive experiments provide insights on how and why the method works and its limitations.
摘要：为了探索一条更具可扩展性的途径，为现有法学硕士添加多模式功能，本文解决了一个基本问题：单模式法学硕士能否仅依靠文本来推理其自身的信息需求并提供有效的反馈来优化多模式模型？为了回答这个问题，我们提出了一种方法，使语言代理能够向视觉语言模型（VLM）提供反馈，以使文本生成适应代理的偏好。我们不同实验的结果证实了这一假设，表明 LLM 偏好反馈显着增强了 VLM 描述。使用我们提出的方法，我们发现 VLM 可以生成多模态场景描述，以帮助 LLM 更好地理解多模态上下文，与基线多模态方法相比，绝对精度最多提高 13%。此外，一项人类研究验证了我们人工智能驱动的反馈，显示法学硕士的选择与人类判断之间的偏好一致率为 64.6%。大量的实验提供了关于该方法如何、为何起作用及其局限性的见解。

Title: NC-Bench: An LLM Benchmark for Evaluating Conversational Competence

Authors: Robert J. Moore, Sungeun An, Farhan Ahmed, Jay Pankaj Gala
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06426
Pdf URL: https://arxiv.org/pdf/2601.06426
Copy Paste: [[2601.06426]] NC-Bench: An LLM Benchmark for Evaluating Conversational Competence(https://arxiv.org/abs/2601.06426)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The Natural Conversation Benchmark (NC-Bench) introduce a new approach to evaluating the general conversational competence of large language models (LLMs). Unlike prior benchmarks that focus on the content of model behavior, NC-Bench focuses on the form and structure of natural conversation. Grounded in the IBM Natural Conversation Framework (NCF), NC-Bench comprises three distinct sets. The Basic Conversation Competence set evaluates fundamental sequence management practices, such as answering inquiries, repairing responses, and closing conversational pairs. The RAG set applies the same sequence management patterns as the first set but incorporates retrieval-augmented generation (RAG). The Complex Request set extends the evaluation to complex requests involving more intricate sequence management patterns. Each benchmark tests a model's ability to produce contextually appropriate conversational actions in response to characteristic interaction patterns. Initial evaluations across 6 open-source models and 14 interaction patterns show that models perform well on basic answering tasks, struggle more with repair tasks (especially repeat), have mixed performance on closing sequences, and find complex multi-turn requests most challenging, with Qwen models excelling on the Basic set and Granite models on the RAG set and the Complex Request set. By operationalizing fundamental principles of human conversation, NC-Bench provides a lightweight, extensible, and theory-grounded framework for assessing and improving the conversational abilities of LLMs beyond topical or task-specific benchmarks.
摘要：自然对话基准 (NC-Bench) 引入了一种新方法来评估大型语言模型 (LLM) 的一般对话能力。与之前关注模型行为内容的基准测试不同，NC-Bench 关注自然对话的形式和结构。 NC-Bench 以 IBM 自然对话框架 (NCF) 为基础，由三个不同的集合组成。基本对话能力集评估基本的序列管理实践，例如回答查询、修复响应和结束对话对。 RAG 集应用与第一集相同的序列管理模式，但合并了检索增强生成 (RAG)。复杂请求集将评估扩展到涉及更复杂的序列管理模式的复杂请求。每个基准测试都会测试模型根据特定的交互模式生成适合上下文的对话操作的能力。对 6 个开源模型和 14 种交互模式的初步评估表明，模型在基本回答任务上表现良好，在修复任务（尤其是重复）上表现不佳，在结束序列上表现参差不齐，并发现复杂的多轮请求最具挑战性，其中 Qwen 模型在基本集上表现出色，花岗岩模型在 RAG 集和复杂请求集上表现出色。通过实施人类对话的基本原则，NC-Bench 提供了一个轻量级、可扩展且以理论为基础的框架，用于评估和提高法学硕士的对话能力，超越主题或特定任务的基准。

Title: Time Travel Engine: A Shared Latent Chronological Manifold Enables Historical Navigation in Large Language Models

Authors: Jingmin An, Wei Liu, Qian Wang, Fang Fang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06437
Pdf URL: https://arxiv.org/pdf/2601.06437
Copy Paste: [[2601.06437]] Time Travel Engine: A Shared Latent Chronological Manifold Enables Historical Navigation in Large Language Models(https://arxiv.org/abs/2601.06437)
Keywords: language model, llm, prompt
Abstract: Time functions as a fundamental dimension of human cognition, yet the mechanisms by which Large Language Models (LLMs) encode chronological progression remain opaque. We demonstrate that temporal information in their latent space is organized not as discrete clusters but as a continuous, traversable geometry. We introduce the Time Travel Engine (TTE), an interpretability-driven framework that projects diachronic linguistic patterns onto a shared chronological manifold. Unlike surface-level prompting, TTE directly modulates latent representations to induce coherent stylistic, lexical, and conceptual shifts aligned with target eras. By parameterizing diachronic evolution as a continuous manifold within the residual stream, TTE enables fluid navigation through period-specific "zeitgeists" while restricting access to future knowledge. Furthermore, experiments across diverse architectures reveal topological isomorphism between the temporal subspaces of Chinese and English-indicating that distinct languages share a universal geometric logic of historical evolution. These findings bridge historical linguistics with mechanistic interpretability, offering a novel paradigm for controlling temporal reasoning in neural networks.
摘要：时间是人类认知的基本维度，但大型语言模型 (LLM) 编码时间顺序的机制仍然不透明。我们证明，潜在空间中的时间信息不是被组织为离散的簇，而是被组织为连续的、可遍历的几何结构。我们引入了时间旅行引擎（TTE），这是一个可解释性驱动的框架，它将历时语言模式投射到共享的时间流形上。与表面提示不同，TTE 直接调节潜在表征，以诱导与目标时代一致的连贯文体、词汇和概念转变。通过将历时演化参数化为残余流中的连续流形，TTE 能够在特定时期的“时代精神”中进行流体导航，同时限制对未来知识的访问。此外，跨不同架构的实验揭示了中文和英语的时间子空间之间的拓扑同构，这表明不同的语言共享历史演化的普遍几何逻辑。这些发现将历史语言学与机械可解释性联系起来，为控制神经网络中的时间推理提供了一种新颖的范式。

Title: LitVISTA: A Benchmark for Narrative Orchestration in Literary Text

Authors: Mingzhe Lu, Yiwen Wang, Yanbing Liu, Qi You, Chong Liu, Ruize Qin, Haoyu Dong, Wenyu Zhang, Jiarui Zhang, Yue Hu, Yunpeng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06445
Pdf URL: https://arxiv.org/pdf/2601.06445
Copy Paste: [[2601.06445]] LitVISTA: A Benchmark for Narrative Orchestration in Literary Text(https://arxiv.org/abs/2601.06445)
Keywords: language model, gpt, llm
Abstract: Computational narrative analysis aims to capture rhythm, tension, and emotional dynamics in literary texts. Existing large language models can generate long stories but overly focus on causal coherence, neglecting the complex story arcs and orchestration inherent in human narratives. This creates a structural misalignment between model- and human-generated narratives. We propose VISTA Space, a high-dimensional representational framework for narrative orchestration that unifies human and model narrative perspectives. We further introduce LitVISTA, a structurally annotated benchmark grounded in literary texts, enabling systematic evaluation of models' narrative orchestration capabilities. We conduct oracle evaluations on a diverse selection of frontier LLMs, including GPT, Claude, Grok, and Gemini. Results reveal systematic deficiencies: existing models fail to construct a unified global narrative view, struggling to jointly capture narrative function and structure. Furthermore, even advanced thinking modes yield only limited gains for such literary narrative understanding.
摘要：计算叙事分析旨在捕捉文学文本中的节奏、张力和情感动态。现有的大型语言模型可以生成长故事，但过度关注因果连贯性，忽略了人类叙事固有的复杂故事情节和编排。这造成了模型生成的叙述和人类生成的叙述之间的结构性错位。我们提出了 VISTA Space，一种用于叙事编排的高维表征框架，统一了人类和模型的叙事视角。我们进一步引入了 LitVISTA，这是一个基于文学文本的结构注释基准，可以系统地评估模型的叙事编排能力。我们对多种前沿法学硕士进行了预言机评估，包括 GPT、Claude、Grok 和 Gemini。结果揭示了系统性缺陷：现有模型未能构建统一的全球叙事观，难以共同捕捉叙事功能和结构。此外，即使是先进的思维模式对于这种文学叙事理解也只能带来有限的收获。

Title: PRISP: Privacy-Safe Few-Shot Personalization via Lightweight Adaptation

Authors: Junho Park, Dohoon Kim, Taesup Moon
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06471
Pdf URL: https://arxiv.org/pdf/2601.06471
Copy Paste: [[2601.06471]] PRISP: Privacy-Safe Few-Shot Personalization via Lightweight Adaptation(https://arxiv.org/abs/2601.06471)
Keywords: language model, llm
Abstract: Large language model (LLM) personalization aims to adapt general-purpose models to individual users. Most existing methods, however, are developed under data-rich and resource-abundant settings, often incurring privacy risks. In contrast, realistic personalization typically occurs after deployment under (i) extremely limited user data, (ii) constrained computational resources, and (iii) strict privacy requirements. We propose PRISP, a lightweight and privacy-safe personalization framework tailored to these constraints. PRISP leverages a Text-to-LoRA hypernetwork to generate task-aware LoRA parameters from task descriptions, and enables efficient user personalization by optimizing a small subset of task-aware LoRA parameters together with minimal additional modules using few-shot user data. Experiments on a few-shot variant of the LaMP benchmark demonstrate that PRISP achieves strong overall performance compared to prior approaches, while reducing computational overhead and eliminating privacy risks.
摘要：大语言模型（LLM）个性化旨在使通用模型适应个人用户。然而，大多数现有方法是在数据丰富和资源丰富的环境下开发的，通常会带来隐私风险。相比之下，现实的个性化通常在以下情况下发生：(i) 极其有限的用户数据、(ii) 有限的计算资源和 (iii) 严格的隐私要求。我们提出 PRISP，这是一个针对这些限制量身定制的轻量级且隐私安全的个性化框架。 PRISP 利用文本到 LoRA 超网络从任务描述生成任务感知 LoRA 参数，并通过使用少量用户数据优化一小部分任务感知 LoRA 参数以及最少的附加模块，实现高效的用户个性化。对 LaMP 基准测试的几个样本变体进行的实验表明，与之前的方法相比，PRISP 实现了强大的整体性能，同时减少了计算开销并消除了隐私风险。

Title: IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments

Authors: Debasmita Panda, Akash Anil, Neelesh Kumar Shukla
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2601.06477
Pdf URL: https://arxiv.org/pdf/2601.06477
Copy Paste: [[2601.06477]] IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments(https://arxiv.org/abs/2601.06477)
Keywords: language model, llm
Abstract: Warning: This paper consists of examples representing regional biases in Indian regions that might be offensive towards a particular region. While social biases corresponding to gender, race, socio-economic conditions, etc., have been extensively studied in the major applications of Natural Language Processing (NLP), biases corresponding to regions have garnered less attention. This is mainly because of (i) difficulty in the extraction of regional bias datasets, (ii) disagreements in annotation due to inherent human biases, and (iii) regional biases being studied in combination with other types of social biases and often being under-represented. This paper focuses on creating a dataset IndRegBias, consisting of regional biases in an Indian context reflected in users' comments on popular social media platforms, namely Reddit and YouTube. We carefully selected 25,000 comments appearing on various threads in Reddit and videos on YouTube discussing trending topics on regional issues in India. Furthermore, we propose a multilevel annotation strategy to annotate the comments describing the severity of regional biased statements. To detect the presence of regional bias and its severity in IndRegBias, we evaluate open-source Large Language Models (LLMs) and Indic Language Models (ILMs) using zero-shot, few-shot, and fine-tuning strategies. We observe that zero-shot and few-shot approaches show lower accuracy in detecting regional biases and severity in the majority of the LLMs and ILMs. However, the fine-tuning approach significantly enhances the performance of the LLM in detecting Indian regional bias along with its severity.
摘要：警告：本文包含的示例代表了印度地区的区域偏见，这些偏见可能会冒犯特定地区。虽然与性别、种族、社会经济条件等相对应的社会偏见在自然语言处理（NLP）的主要应用中得到了广泛的研究，但与地区相对应的偏见却很少受到关注。这主要是因为（i）难以提取区域偏见数据集，（ii）由于人类固有偏见而在注释方面存在分歧，以及（iii）区域偏见与其他类型的社会偏见结合起来进行研究，并且往往代表性不足。本文重点创建一个数据集 IndRegBias，该数据集包含印度背景下的区域偏见，反映在流行社交媒体平台（即 Reddit 和 YouTube）上的用户评论中。我们精心挑选了出现在 Reddit 各种帖子和 YouTube 视频上的 25,000 条评论，讨论印度地区问题的热门话题。此外，我们提出了一种多级注释策略来注释描述区域偏见陈述的严重性的评论。为了检测 IndRegBias 中区域偏差的存在及其严重性，我们使用零样本、少样本和微调策略评估开源大型语言模型 (LLM) 和印度语言模型 (ILM)。我们观察到，零样本和少样本方法在检测大多数 LLM 和 ILM 中的区域偏差和严重性方面显示出较低的准确性。然而，微调方法显着提高了法学硕士在检测印度地区偏见及其严重程度方面的表现。

Title: Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

Authors: Minghui Jia, Qichao Zhang, Ali Luo, Linjing Li, Shuo Ye, Hailing Lu, Wen Hou, Dongbin Zhao
Subjects: cs.CL, astro-ph.IM
Abstract URL: https://arxiv.org/abs/2601.06498
Pdf URL: https://arxiv.org/pdf/2601.06498
Copy Paste: [[2601.06498]] Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection(https://arxiv.org/abs/2601.06498)
Keywords: chain-of-thought, agent
Abstract: Due to the limited generalization and interpretability of deep learning classifiers, The final vetting of rare celestial object candidates still relies on expert visual inspection--a manually intensive process. In this process, astronomers leverage specialized tools to analyze spectra and construct reliable catalogs. However, this practice has become the primary bottleneck, as it is fundamentally incapable of scaling with the data deluge from modern spectroscopic surveys. To bridge this gap, we propose Spec-o3, a tool-augmented vision-language agent that performs astronomer-aligned spectral inspection via interleaved multimodal chain-of-thought reasoning. Spec-o3 is trained with a two-stage post-training recipe: cold-start supervised fine-tuning on expert inspection trajectories followed by outcome-based reinforcement learning on rare-type verification tasks. Evaluated on five rare-object identification tasks from LAMOST, Spec-o3 establishes a new State-of-the-Art, boosting the macro-F1 score from 28.3 to 76.5 with a 7B parameter base model and outperforming both proprietary VLMs and specialized deep models. Crucially, the agent demonstrates strong generalization to unseen inspection tasks across survey shifts (from LAMOST to SDSS/DESI). Expert evaluations confirm that its reasoning traces are coherent and physically consistent, supporting transparent and trustworthy decision-making. Code, data, and models are available at \href{this https URL}{Project HomePage}.
摘要：由于深度学习分类器的泛化性和可解释性有限，稀有天体候选者的最终审查仍然依赖于专家的目视检查，这是一个人工密集的过程。在此过程中，天文学家利用专门的工具来分析光谱并构建可靠的目录。然而，这种做法已成为主要瓶颈，因为它从根本上无法适应现代光谱调查的海量数据。为了弥补这一差距，我们提出了 Spec-o3，这是一种工具增强的视觉语言代理，它通过交错的多模态思维链推理来执行天文学家对齐的光谱检查。 Spec-o3 采用两阶段后训练方法进行训练：对专家检查轨迹进行冷启动监督微调，然后对罕见类型验证任务进行基于结果的强化学习。通过对 LAMOST 的五项稀有物体识别任务进行评估，Spec-o3 建立了新的最先进技术，通过 7B 参数基础模型将宏观 F1 分数从 28.3 提高到 76.5，并且性能优于专有的 VLM 和专用深度模型。至关重要的是，该代理对跨调查班次（从 LAMOST 到 SDSS/DESI）中看不见的检查任务表现出很强的泛化能力。专家评估证实其推理痕迹连贯且物理一致，支持透明且值得信赖的决策。代码、数据和模型可在 \href{此 https URL}{项目主页} 获取。

Title: MedRAGChecker: Claim-Level Verification for Biomedical Retrieval-Augmented Generation

Authors: Yuelyu Ji, Min Gu Kwak, Hang Zhang, Xizhi Wu, Chenyu Li, Yanshan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06519
Pdf URL: https://arxiv.org/pdf/2601.06519
Copy Paste: [[2601.06519]] MedRAGChecker: Claim-Level Verification for Biomedical Retrieval-Augmented Generation(https://arxiv.org/abs/2601.06519)
Keywords: llm, retrieval-augmented generation
Abstract: Biomedical retrieval-augmented generation (RAG) can ground LLM answers in medical literature, yet long-form outputs often contain isolated unsupported or contradictory claims with safety implications. We introduce MedRAGChecker, a claim-level verification and diagnostic framework for biomedical RAG. Given a question, retrieved evidence, and a generated answer, MedRAGChecker decomposes the answer into atomic claims and estimates claim support by combining evidence-grounded natural language inference (NLI) with biomedical knowledge-graph (KG) consistency signals. Aggregating claim decisions yields answer-level diagnostics that help disentangle retrieval and generation failures, including faithfulness, under-evidence, contradiction, and safety-critical error rates. To enable scalable evaluation, we distill the pipeline into compact biomedical models and use an ensemble verifier with class-specific reliability weighting. Experiments on four biomedical QA benchmarks show that MedRAGChecker reliably flags unsupported and contradicted claims and reveals distinct risk profiles across generators, particularly on safety-critical biomedical relations.
摘要：生物医学检索增强生成（RAG）可以在医学文献中为法学硕士的答案提供基础，但长篇输出通常包含孤立的不受支持或相互矛盾的主张，具有安全隐患。我们推出 MedRAGChecker，这是一种针对生物医学 RAG 的声明级验证和诊断框架。给定问题、检索到的证据和生成的答案，MedRAGChecker 将答案分解为原子声明，并通过将基于证据的自然语言推理 (NLI) 与生物医学知识图 (KG) 一致性信号相结合来估计声明支持。汇总索赔决策可产生答案级诊断，有助于解决检索和生成失败的问题，包括忠实性、证据不足、矛盾和安全关键错误率。为了实现可扩展的评估，我们将管道提炼成紧凑的生物医学模型，并使用具有特定类别可靠性加权的集成验证器。对四个生物医学 QA 基准的实验表明，MedRAGChecker 可靠地标记了不受支持和矛盾的主张，并揭示了生成器之间不同的风险状况，特别是在安全关键的生物医学关系方面。

Title: Exposía: Academic Writing Assessment of Exposés and Peer Feedback

Authors: Dennis Zyska, Alla Rozovskaya, Ilia Kuznetsov, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06536
Pdf URL: https://arxiv.org/pdf/2601.06536
Copy Paste: [[2601.06536]] Exposía: Academic Writing Assessment of Exposés and Peer Feedback(https://arxiv.org/abs/2601.06536)
Keywords: language model, llm, prompt
Abstract: We present Exposía, the first public dataset that connects writing and feedback assessment in higher education, enabling research on educationally grounded approaches to academic writing evaluation. Exposía includes student research project proposals and peer and instructor feedback consisting of comments and free-text reviews. The dataset was collected in the "Introduction to Scientific Work" course of the Computer Science undergraduate program that focuses on teaching academic writing skills and providing peer feedback on academic writing. Exposía reflects the multi-stage nature of the academic writing process that includes drafting, providing and receiving feedback, and revising the writing based on the feedback received. Both the project proposals and peer feedback are accompanied by human assessment scores based on a fine-grained, pedagogically-grounded schema for writing and feedback assessment that we develop. We use Exposía to benchmark state-of-the-art open-source large language models (LLMs) for two tasks: automated scoring of (1) the proposals and (2) the student reviews. The strongest LLMs attain high agreement on scoring aspects that require little domain knowledge but degrade on dimensions evaluating content, in line with human agreement values. We find that LLMs align better with the human instructors giving high scores. Finally, we establish that a prompting strategy that scores multiple aspects of the writing together is the most effective, an important finding for classroom deployment.
摘要：我们推出了 Exposía，这是第一个将高等教育中的写作和反馈评估联系起来的公共数据集，从而能够研究以教育为基础的学术写作评估方法。 Exposía 包括学生研究项目提案以及由评论和自由文本评论组成的同行和教师反馈。该数据集是在计算机科学本科课程的“科学工作概论”课程中收集的，该课程侧重于教授学术写作技巧并提供学术写作的同行反馈。 Exposía 反映了学术写作过程的多阶段性质，包括起草、提供和接收反馈以及根据收到的反馈修改写作。项目提案和同行反馈都附有人工评估分数，该分数基于我们开发的细粒度、以教学为基础的写作和反馈评估模式。我们使用 Exposía 对最先进的开源大语言模型 (LLM) 进行基准测试，以完成两项任务：(1) 提案和 (2) 学生评论的自动评分。最强的法学硕士在评分方面获得了高度一致，这些方面需要很少的领域知识，但在评估内容的维度上却下降了，这与人类的一致值一致。我们发现法学硕士与给出高分的人类讲师更加一致。最后，我们确定，对写作的多个方面进行评分的提示策略是最有效的，也是课堂部署的重要发现。

Title: SimLLM: Fine-Tuning Code LLMs for SimPy-Based Queueing System Simulation

Authors: Jun-Qi Chen, Kun Zhang, Rui Zheng, Ying Zhong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06543
Pdf URL: https://arxiv.org/pdf/2601.06543
Copy Paste: [[2601.06543]] SimLLM: Fine-Tuning Code LLMs for SimPy-Based Queueing System Simulation(https://arxiv.org/abs/2601.06543)
Keywords: language model, gpt, llm
Abstract: The Python package SimPy is widely used for modeling queueing systems due to its flexibility, simplicity, and smooth integration with modern data analysis and optimization frameworks. Recent advances in large language models (LLMs) have shown strong ability in generating clear and executable code, making them powerful and suitable tools for writing SimPy queueing simulation code. However, directly employing closed-source models like GPT-4o to generate such code may lead to high computational costs and raise data privacy concerns. To address this, we fine-tune two open-source LLMs, Qwen-Coder-7B and DeepSeek-Coder-6.7B, on curated SimPy queueing data, which enhances their code-generating performance in executability, output-format compliance, and instruction-code consistency. Particularly, we proposed a multi-stage fine-tuning framework comprising two stages of supervised fine-tuning (SFT) and one stage of direct preference optimization (DPO), progressively enhancing the model's ability in SimPy-based queueing simulation code generation. Extensive evaluations demonstrate that both fine-tuned models achieve substantial improvements in executability, output-format compliance, and instruct consistency. These results confirm that domain-specific fine-tuning can effectively transform compact open-source code models into reliable SimPy simulation generators which provide a practical alternative to closed-source LLMs for education, research, and operational decision support.
摘要：Python 包 SimPy 因其灵活性、简单性以及与现代数据分析和优化框架的平滑集成而被广泛用于排队系统建模。大型语言模型 (LLM) 的最新进展显示出生成清晰可执行代码的强大能力，使其成为编写 SimPy 排队模拟代码的强大且合适的工具。然而，直接采用 GPT-4o 等闭源模型来生成此类代码可能会导致高昂的计算成本并引发数据隐私问题。为了解决这个问题，我们在精心策划的 SimPy 排队数据上对两个开源 LLM Qwen-Coder-7B 和 DeepSeek-Coder-6.7B 进行了微调，从而增强了它们在可执行性、输出格式合规性和指令代码一致性方面的代码生成性能。特别是，我们提出了一种多阶段微调框架，包括两阶段监督微调（SFT）和一阶段直接偏好优化（DPO），逐步增强模型基于 SimPy 的排队模拟代码生成能力。广泛的评估表明，这两种微调模型在可执行性、输出格式合规性和指令一致性方面都取得了实质性改进。这些结果证实，特定领域的微调可以有效地将紧凑的开源代码模型转换为可靠的 SimPy 仿真生成器，为教育、研究和运营决策支持提供闭源 LLM 的实用替代方案。

Title: CSR-RAG: An Efficient Retrieval System for Text-to-SQL on the Enterprise Scale

Authors: Rajpreet Singh, Novak Boškov, Lawrence Drabeck, Aditya Gudal, Manzoor A. Khan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06564
Pdf URL: https://arxiv.org/pdf/2601.06564
Copy Paste: [[2601.06564]] CSR-RAG: An Efficient Retrieval System for Text-to-SQL on the Enterprise Scale(https://arxiv.org/abs/2601.06564)
Keywords: language model, llm, retrieval augmented generation
Abstract: Natural language to SQL translation (Text-to-SQL) is one of the long-standing problems that has recently benefited from advances in Large Language Models (LLMs). While most academic Text-to-SQL benchmarks request schema description as a part of natural language input, enterprise-scale applications often require table retrieval before SQL query generation. To address this need, we propose a novel hybrid Retrieval Augmented Generation (RAG) system consisting of contextual, structural, and relational retrieval (CSR-RAG) to achieve computationally efficient yet sufficiently accurate retrieval for enterprise-scale databases. Through extensive enterprise benchmarks, we demonstrate that CSR-RAG achieves up to 40% precision and over 80% recall while incurring a negligible average query generation latency of only 30ms on commodity data center hardware, which makes it appropriate for modern LLM-based enterprise-scale systems.
摘要：自然语言到 SQL 的翻译（文本到 SQL）是长期存在的问题之一，最近受益于大型语言模型 (LLM) 的进步。虽然大多数学术文本到 SQL 基准测试要求架构描述作为自然语言输入的一部分，但企业级应用程序通常需要在 SQL 查询生成之前进行表检索。为了满足这一需求，我们提出了一种新颖的混合检索增强生成（RAG）系统，由上下文检索、结构检索和关系检索（CSR-RAG）组成，以实现企业级数据库的计算高效且足够准确的检索。通过广泛的企业基准测试，我们证明 CSR-RAG 可以实现高达 40% 的精度和超过 80% 的召回率，同时在商用数据中心硬件上产生的平均查询生成延迟仅为 30 毫秒，可以忽略不计，这使其适合基于 LLM 的现代企业规模系统。

Title: EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Authors: Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang, Tianyu Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06565
Pdf URL: https://arxiv.org/pdf/2601.06565
Copy Paste: [[2601.06565]] EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation(https://arxiv.org/abs/2601.06565)
Keywords: language model
Abstract: Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: this https URL.
摘要：大型语言模型越来越多地应用于各种开发场景。然而，在链上交易场景中，即使是一个微小的错误也可能给用户带来不可挽回的损失。现有的评估常常忽视执行的准确性和安全性。我们推出了 EVM-QuestBench，这是一个基于执行的基准，用于在 EVM 兼容链上生成自然语言交易脚本。该基准采用动态评估：从模板池中采样指令，从预定义的间隔中提取数字参数，验证器根据这些实例化值验证结果。 EVM-QuestBench 包含 107 个任务（62 个原子任务，45 个复合任务）。其模块化架构可实现快速任务开发。运行器在具有快照隔离的分叉 EVM 链上执行脚本；复合任务应用阶跃效率衰减。我们评估了 20 个模型并发现了巨大的性能差距，分割分数揭示了单操作精度和多步骤工作流程完成之间持续的不对称性。代码：此 https URL。

Title: Are Emotions Arranged in a Circle? Geometric Analysis of Emotion Representations via Hyperspherical Contrastive Learning

Authors: Yusuke Yamauchi, Akiko Aizawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06575
Pdf URL: https://arxiv.org/pdf/2601.06575
Copy Paste: [[2601.06575]] Are Emotions Arranged in a Circle? Geometric Analysis of Emotion Representations via Hyperspherical Contrastive Learning(https://arxiv.org/abs/2601.06575)
Keywords: language model
Abstract: Psychological research has long utilized circumplex models to structure emotions, placing similar emotions adjacently and opposing ones diagonally. Although frequently used to interpret deep learning representations, these models are rarely directly incorporated into the representation learning of language models, leaving their geometric validity unexplored. This paper proposes a method to induce circular emotion representations within language model embeddings via contrastive learning on a hypersphere. We show that while this circular alignment offers superior interpretability and robustness against dimensionality reduction, it underperforms compared to conventional designs in high-dimensional settings and fine-grained classification. Our findings elucidate the trade-offs involved in applying psychological circumplex models to deep learning architectures.
摘要：心理学研究长期以来一直利用环状模型来构建情绪，将相似的情绪相邻放置，将相反的情绪对角放置。尽管这些模型经常用于解释深度学习表示，但很少直接纳入语言模型的表示学习中，从而导致其几何有效性未被探索。本文提出了一种通过超球面上的对比学习在语言模型嵌入中引入循环情感表示的方法。我们表明，虽然这种圆形对齐提供了卓越的可解释性和针对降维的鲁棒性，但与高维设置和细粒度分类中的传统设计相比，它的表现较差。我们的研究结果阐明了将心理循环模型应用于深度学习架构所涉及的权衡。

Title: Stylistic Evolution and LLM Neutrality in Singlish Language

Authors: Linus Tze En Foo, Weihan Angela Ng, Wenkai Li, Lynnette Hui Xian Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06580
Pdf URL: https://arxiv.org/pdf/2601.06580
Copy Paste: [[2601.06580]] Stylistic Evolution and LLM Neutrality in Singlish Language(https://arxiv.org/abs/2601.06580)
Keywords: llm, prompt
Abstract: Singlish is a creole rooted in Singapore's multilingual environment and continues to evolve alongside social and technological change. This study investigates the evolution of Singlish over a decade of informal digital text messages. We propose a stylistic similarity framework that compares lexico-structural, pragmatic, psycholinguistic, and encoder-derived features across years to quantify temporal variation. Our analysis reveals notable diachronic changes in tone, expressivity and sentence construction over the years. Conversely, while some LLMs were able to generate superficially realistic Singlish messages, they do not produce temporally neutral outputs, and residual temporal signals remain detectable despite prompting and fine-tuning. Our findings highlight the dynamic evolution of Singlish, as well as the capabilities and limitations of current LLMs in modeling sociolectal and temporal variations in the colloquial language.
摘要：新加坡式英语是一种植根于新加坡多语言环境的克里奥尔语，并随着社会和技术变革而不断发展。这项研究调查了新加坡式英语十多年来非正式数字短信的演变。我们提出了一种风格相似性框架，该框架可以比较多年来的词汇结构、语用、心理语言学和编码器衍生的特征，以量化时间变化。我们的分析揭示了多年来语气、表达力和句子结构的显着历时变化。相反，虽然一些法学硕士能够生成表面上真实的新加坡式英语消息，但它们不会产生时间中性的输出，并且尽管进行了提示和微调，残留的时间信号仍然是可检测到的。我们的研究结果强调了新加坡式英语的动态演变，以及当前法学硕士在模拟口语语言的社会语言和时间变化方面的能力和局限性。

Title: Detecting LLM-Generated Text with Performance Guarantees

Authors: Hongyi Zhou, Jin Zhu, Ying Yang, Chengchun Shi
Subjects: cs.CL, cs.LG, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2601.06586
Pdf URL: https://arxiv.org/pdf/2601.06586
Copy Paste: [[2601.06586]] Detecting LLM-Generated Text with Performance Guarantees(https://arxiv.org/abs/2601.06586)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) such as GPT, Claude, Gemini, and Grok have been deeply integrated into our daily life. They now support a wide range of tasks -- from dialogue and email drafting to assisting with teaching and coding, serving as search engines, and much more. However, their ability to produce highly human-like text raises serious concerns, including the spread of fake news, the generation of misleading governmental reports, and academic misconduct. To address this practical problem, we train a classifier to determine whether a piece of text is authored by an LLM or a human. Our detector is deployed on an online CPU-based platform this https URL, and contains three novelties over existing detectors: (i) it does not rely on auxiliary information, such as watermarks or knowledge of the specific LLM used to generate the text; (ii) it more effectively distinguishes between human- and LLM-authored text; and (iii) it enables statistical inference, which is largely absent in the current literature. Empirically, our classifier achieves higher classification accuracy compared to existing detectors, while maintaining type-I error control, high statistical power, and computational efficiency.
摘要：GPT、Claude、Gemini、Grok 等大型语言模型（LLM）已经深入融入我们的日常生活。他们现在支持广泛的任务——从对话和电子邮件起草到协助教学和编码、充当搜索引擎等等。然而，它们生成高度人性化文本的能力引起了严重关注，包括假新闻的传播、误导性政府报告的生成以及学术不端行为。为了解决这个实际问题，我们训练一个分类器来确定一段文本是由法学硕士还是人类撰写的。我们的检测器部署在基于 CPU 的在线平台上（此 https URL），并且与现有检测器相比包含三个新颖之处：（i）它不依赖于辅助信息，例如水印或用于生成文本的特定 LLM 的知识； (ii) 它更有效地区分人类和法学硕士撰写的文本； (iii)它能够进行统计推断，而这在当前的文献中基本上是不存在的。根据经验，与现有检测器相比，我们的分类器实现了更高的分类精度，同时保持 I 类错误控制、高统计能力和计算效率。

Title: How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

Authors: Shivam Adarsh, Maria Maistro, Christina Lioma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06599
Pdf URL: https://arxiv.org/pdf/2601.06599
Copy Paste: [[2601.06599]] How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs(https://arxiv.org/abs/2601.06599)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also known as truth vectors, have been studied in prior work, however how they change when context is introduced remains unexplored. We study this question by measuring (1) the directional change ($\theta$) between the truth vectors with and without context and (2) the relative magnitude of the truth vectors upon adding context. Across four LLMs and four datasets, we find that (1) truth vectors are roughly orthogonal in early layers, converge in middle layers, and may stabilize or continue increasing in later layers; (2) adding context generally increases the truth vector magnitude, i.e., the separation between true and false representations in the activation space is amplified; (3) larger models distinguish relevant from irrelevant context mainly through directional change ($\theta$), while smaller models show this distinction through magnitude differences. We also find that context conflicting with parametric knowledge produces larger geometric changes than parametrically aligned context. To the best of our knowledge, this is the first work that provides a geometric characterization of how context transforms the truth vector in the activation space of LLMs.
摘要：大型语言模型 (LLM) 通常将语句是否正确编码为残差流激活中的向量。这些向量，也称为真值向量，已经在之前的工作中进行了研究，但是当引入上下文时它们如何变化仍有待探索。我们通过测量（1）有上下文和无上下文的真值向量之间的方向变化（$\theta$）和（2）添加上下文后真值向量的相对大小来研究这个问题。在四个 LLM 和四个数据集中，我们发现（1）真值向量在早期层大致正交，在中间层收敛，并且在后面层可能稳定或继续增加；（2）添加上下文通常会增加真值向量的大小，即激活空间中真假表示之间的分离被放大；（3）较大的模型主要通过方向变化（$\theta$）来区分相关和不相关的上下文，而较小的模型通过幅度差异来显示这种区别。我们还发现，与参数化知识相冲突的上下文比参数化对齐的上下文产生更大的几何变化。据我们所知，这是第一个提供上下文如何在 LLM 激活空间中变换真值向量的几何特征的工作。

Title: Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Authors: Jen-tse Huang, Chang Chen, Shiyang Lai, Wenxuan Wang, Michelle R. Kaufman, Mark Dredze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06600
Pdf URL: https://arxiv.org/pdf/2601.06600
Copy Paste: [[2601.06600]] Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation(https://arxiv.org/abs/2601.06600)
Keywords: language model, llm
Abstract: Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns, experimental errors, logical fallacies, and fabricated claims, each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.
摘要：短视频平台已成为错误信息的主要渠道，欺骗性言论经常利用视觉实验和社交线索。虽然多模态大语言模型（MLLM）已经表现出令人印象深刻的推理能力，但它们针对与认知偏差纠缠在一起的错误信息的鲁棒性仍有待探索。在本文中，我们介绍了一个综合评估框架，该框架使用高质量、手动注释的数据集，其中包含跨越四个健康领域的 200 个短视频。该数据集为三种欺骗性模式、实验错误、逻辑谬误和捏造的主张提供了细粒度的注释，每种模式都经过国家标准和学术文献等证据的验证。我们评估了五种模式设置中的八个前沿 MLLM。实验结果表明，Gemini-2.5-Pro 在多模态设置中实现了最高性能，置信度得分为 71.5/100，而 o3 的表现最差，为 35.2。此外，我们调查了导致视频中错误信念的社交线索，发现模型很容易受到权威频道 ID 等偏见的影响。

Title: N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs

Authors: Mohamed Sharafath, Aravindh Annamalai, Ganesh Murugan, Aravindakumar Venugopalan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06603
Pdf URL: https://arxiv.org/pdf/2601.06603
Copy Paste: [[2601.06603]] N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs(https://arxiv.org/abs/2601.06603)
Keywords: llm, retrieval-augmented generation
Abstract: Multi-hop question answering over hybrid table-text data requires retrieving and reasoning across multiple evidence pieces from large corpora, but standard Retrieval-Augmented Generation (RAG) pipelines process documents as flat ranked lists, causing retrieval noise to obscure reasoning chains. We introduce N2N-GQA. To our knowledge, it is the first zeroshot framework for open-domain hybrid table-text QA that constructs dynamic evidence graphs from noisy retrieval outputs. Our key insight is that multi-hop reasoning requires understanding relationships between evidence pieces: by modeling documents as graph nodes with semantic relationships as edges, we identify bridge documents connecting reasoning steps, a capability absent in list-based retrieval. On OTT-QA, graph-based evidence curation provides a 19.9-point EM improvement over strong baselines, demonstrating that organizing retrieval results as structured graphs is critical for multihop reasoning. N2N-GQA achieves 48.80 EM, matching finetuned retrieval models (CORE: 49.0 EM) and approaching heavily optimized systems (COS: 56.9 EM) without any task specific training. This establishes graph-structured evidence organization as essential for scalable, zero-shot multi-hop QA systems and demonstrates that simple, interpretable graph construction can rival sophisticated fine-tuned approaches.
摘要：针对混合表文本数据的多跳问答需要对大型语料库中的多个证据片段进行检索和推理，但标准检索增强生成 (RAG) 管道将文档作为平面排名列表进行处理，从而导致检索噪声模糊推理链。我们介绍 N2N-GQA。据我们所知，它是第一个用于开放域混合表文本 QA 的零样本框架，可以从嘈杂的检索输出中构建动态证据图。我们的主要见解是，多跳推理需要理解证据之间的关系：通过将文档建模为图形节点，将语义关系作为边缘，我们识别出连接推理步骤的桥梁文档，这是基于列表的检索中所缺乏的功能。在 OTT-QA 上，基于图的证据管理比强基线提供了 19.9 点的 EM 改进，这表明将检索结果组织为结构化图对于多跳推理至关重要。 N2N-GQA 达到 48.80 EM，匹配微调检索模型（CORE：49.0 EM）并接近高度优化的系统（COS：56.9 EM），而无需任何特定任务的训练。这使得图结构证据组织对于可扩展、零样本多跳 QA 系统至关重要，并证明简单、可解释的图结构可以与复杂的微调方法相媲美。

Title: Pragya: An AI-Based Semantic Recommendation System for Sanskrit Subhasitas

Authors: Tanisha Raorane, Prasenjit Kole
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06607
Pdf URL: https://arxiv.org/pdf/2601.06607
Copy Paste: [[2601.06607]] Pragya: An AI-Based Semantic Recommendation System for Sanskrit Subhasitas(https://arxiv.org/abs/2601.06607)
Keywords: llm, retrieval-augmented generation
Abstract: Sanskrit Subhasitas encapsulate centuries of cultural and philosophical wisdom, yet remain underutilized in the digital age due to linguistic and contextual barriers. In this work, we present Pragya, a retrieval-augmented generation (RAG) framework for semantic recommendation of Subhasitas. We curate a dataset of 200 verses annotated with thematic tags such as motivation, friendship, and compassion. Using sentence embeddings (IndicBERT), the system retrieves top-k verses relevant to user queries. The retrieved results are then passed to a generative model (Mistral LLM) to produce transliterations, translations, and contextual explanations. Experimental evaluation demonstrates that semantic retrieval significantly outperforms keyword matching in precision and relevance, while user studies highlight improved accessibility through generated summaries. To our knowledge, this is the first attempt at integrating retrieval and generation for Sanskrit Subhasitas, bridging cultural heritage with modern applied AI.
摘要：梵文 Subhasitas 浓缩了几个世纪的文化和哲学智慧，但由于语言和语境障碍，在数字时代仍未得到充分利用。在这项工作中，我们提出了 Pragya，一种用于 Subhasitas 语义推荐的检索增强生成（RAG）框架。我们整理了一个包含 200 节经文的数据集，并附有动机、友谊和同情心等主题标签。使用句子嵌入 (IndicBERT)，系统检索与用户查询相关的前 k 节经文。然后，检索到的结果将传递到生成模型 (Mistral LLM) 以生成音译、翻译和上下文解释。实验评估表明，语义检索在精度和相关性方面显着优于关键字匹配，而用户研究则强调通过生成的摘要提高了可访问性。据我们所知，这是首次尝试将梵文 Subhasitas 的检索和生成结合起来，将文化遗产与现代应用人工智能联系起来。

Title: Labels have Human Values: Value Calibration of Subjective Tasks

Authors: Mohammed Fayiz Parappan, Ricardo Henao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06631
Pdf URL: https://arxiv.org/pdf/2601.06631
Copy Paste: [[2601.06631]] Labels have Human Values: Value Calibration of Subjective Tasks(https://arxiv.org/abs/2601.06631)
Keywords: chat
Abstract: Building NLP systems for subjective tasks requires one to ensure their alignment to contrasting human values. We propose the MultiCalibrated Subjective Task Learner framework (MC-STL), which clusters annotations into identifiable human value clusters by three approaches (similarity of annotator rationales, expert-value taxonomies or rater's sociocultural descriptors) and calibrates predictions for each value cluster by learning cluster-specific embeddings. We demonstrate MC-STL on several subjective learning settings, including ordinal, binary, and preference learning predictions, and evaluate it on multiple datasets covering toxic chatbot conversations, offensive social media posts, and human preference alignment. The results show that MC-STL consistently outperforms the baselines that ignore the latent value structure of the annotations, delivering gains in discrimination, value-specific calibration, and disagreement-aware metrics.
摘要：为主观任务构建 NLP 系统需要确保其与对比鲜明的人类价值观保持一致。我们提出了多重校准主观任务学习器框架（MC-STL），该框架通过三种方法（注释者基本原理的相似性、专家价值分类法或评估者的社会文化描述符）将注释聚类为可识别的人类价值簇，并通过学习特定于簇的嵌入来校准每个值簇的预测。我们在几种主观学习设置（包括序数、二元和偏好学习预测）上演示了 MC-STL，并在涵盖有毒聊天机器人对话、攻击性社交媒体帖子和人类偏好对齐的多个数据集上对其进行评估。结果表明，MC-STL 始终优于忽略注释的潜在价值结构的基线，在区分度、特定价值校准和分歧感知指标方面取得了进展。

Title: MedEinst: Benchmarking the Einstellung Effect in Medical LLMs through Counterfactual Differential Diagnosis

Authors: Wenting Chen, Zhongrui Zhu, Guolin Huang, Wenxuan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06636
Pdf URL: https://arxiv.org/pdf/2601.06636
Copy Paste: [[2601.06636]] MedEinst: Benchmarking the Einstellung Effect in Medical LLMs through Counterfactual Differential Diagnosis(https://arxiv.org/abs/2601.06636)
Keywords: llm, agent
Abstract: Despite achieving high accuracy on medical benchmarks, LLMs exhibit the Einstellung Effect in clinical diagnosis--relying on statistical shortcuts rather than patient-specific evidence, causing misdiagnosis in atypical cases. Existing benchmarks fail to detect this critical failure mode. We introduce MedEinst, a counterfactual benchmark with 5,383 paired clinical cases across 49 diseases. Each pair contains a control case and a "trap" case with altered discriminative evidence that flips the diagnosis. We measure susceptibility via Bias Trap Rate--probability of misdiagnosing traps despite correctly diagnosing controls. Extensive Evaluation of 17 LLMs shows frontier models achieve high baseline accuracy but severe bias trap rates. Thus, we propose ECR-Agent, aligning LLM reasoning with Evidence-Based Medicine standard via two components: (1) Dynamic Causal Inference (DCI) performs structured reasoning through dual-pathway perception, dynamic causal graph reasoning across three levels (association, intervention, counterfactual), and evidence audit for final diagnosis; (2) Critic-Driven Graph and Memory Evolution (CGME) iteratively refines the system by storing validated reasoning paths in an exemplar base and consolidating disease-specific knowledge into evolving illness graphs. Source code is to be released.
摘要：尽管法学硕士在医学基准上取得了很高的准确性，但法学硕士在临床诊断中表现出“安装效应”——依赖统计捷径而不是患者特定的证据，导致非典型病例的误诊。现有的基准测试无法检测到这种关键故障模式。我们引入了 MedEinst，这是一个反事实基准，包含 49 种疾病的 5,383 个配对临床病例。每对都包含一个对照案例和一个“陷阱”案例，其中具有改变的判别性证据，从而推翻了诊断。我们通过偏差陷阱率来测量敏感性——尽管正确诊断了控制，但错误诊断陷阱的概率。对 17 个法学硕士的广泛评估表明，前沿模型实现了较高的基线精度，但存在严重的偏差陷阱率。因此，我们提出ECR-Agent，通过两个组成部分将LLM推理与循证医学标准结合起来：（1）动态因果推理（DCI）通过双路径感知、跨三个层面（关联、干预、反事实）的动态因果图推理以及最终诊断的证据审核来执行结构化推理； (2) 批评驱动图和记忆进化（CGME）通过在示例库中存储经过验证的推理路径并将特定疾病的知识整合到不断发展的疾病图中，迭代地完善系统。源代码即将发布。

Title: Do Language Models Reason Across Languages?

Authors: Yan Meng, Wafaa Mohammed, Christof Monz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06644
Pdf URL: https://arxiv.org/pdf/2601.06644
Copy Paste: [[2601.06644]] Do Language Models Reason Across Languages?(https://arxiv.org/abs/2601.06644)
Keywords: language model, prompt
Abstract: The real-world information sources are inherently multilingual, which naturally raises a question about whether language models can synthesize information across languages. In this paper, we introduce a simple two-hop question answering setting, where answering a question requires making inferences over two multilingual documents. We find that language models are more sensitive to language variation in answer-span documents than in those providing bridging information, despite the equal importance of both documents for answering a question. Under a step-by-step sub-question evaluation, we further show that in up to 33% of multilingual cases, models fail to infer the bridging information in the first step yet still answer the overall question correctly. This indicates that reasoning in language models, especially in multilingual settings, does not follow a faithful step-by-step decomposition. Subsequently, we show that the absence of reasoning decomposition leads to around 18% composition failure, where both sub-questions are answered correctly but fail for the final two-hop questions. To mitigate this, we propose a simple three-stage SUBQ prompting method to guide the multi-step reasoning with sub-questions, which boosts accuracy from 10.1% to 66.5%.
摘要：现实世界的信息源本质上是多语言的，这自然引发了一个问题：语言模型是否可以跨语言合成信息。在本文中，我们介绍了一种简单的两跳问答设置，其中回答问题需要对两个多语言文档进行推断。我们发现，语言模型对答案跨度文档中的语言变化比对提供桥接信息的文档中的语言变化更敏感，尽管这两个文档对于回答问题同样重要。在逐步的子问题评估下，我们进一步表明，在高达 33% 的多语言案例中，模型未能在第一步中推断出桥接信息，但仍能正确回答整个问题。这表明语言模型中的推理，特别是在多语言环境中，并不遵循忠实的逐步分解。随后，我们表明，缺乏推理分解会导致大约 18% 的作文失败，其中两个子问题都被正确回答，但最终的两跳问题都失败了。为了缓解这个问题，我们提出了一种简单的三阶段 SUBQ 提示方法来指导子问题的多步推理，将准确率从 10.1% 提高到 66.5%。

Title: What makes for an enjoyable protagonist? An analysis of character warmth and competence

Authors: Hannes Rosenbusch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06658
Pdf URL: https://arxiv.org/pdf/2601.06658
Copy Paste: [[2601.06658]] What makes for an enjoyable protagonist? An analysis of character warmth and competence(https://arxiv.org/abs/2601.06658)
Keywords: gpt, llm
Abstract: Drawing on psychological and literary theory, we investigated whether the warmth and competence of movie protagonists predict IMDb ratings, and whether these effects vary across genres. Using 2,858 films and series from the Movie Scripts Corpus, we identified protagonists via AI-assisted annotation and quantified their warmth and competence with the LLM_annotate package ([1]; human-LLM agreement: r = .83). Preregistered Bayesian regression analyses revealed theory-consistent but small associations between both warmth and competence and audience ratings, while genre-specific interactions did not meaningfully improve predictions. Male protagonists were slightly less warm than female protagonists, and movies with male leads received higher ratings on average (an association that was multiple times stronger than the relationships between movie ratings and warmth/competence). These findings suggest that, although audiences tend to favor warm, competent characters, the effects on movie evaluations are modest, indicating that character personality is only one of many factors shaping movie ratings. AI-assisted annotation with LLM_annotate and gpt-4.1-mini proved effective for large-scale analyses but occasionally fell short of manually generated annotations.
摘要：借鉴心理学和文学理论，我们研究了电影主角的热情和能力是否可以预测 IMDb 的收视率，以及这些影响是否因类型而异。我们使用电影剧本语料库中的 2,858 部电影和连续剧，通过 AI 辅助注释识别主角，并使用 LLM_annotate 包量化他们的热情和能力（[1]；人类与 LLM 协议：r = .83）。预先注册的贝叶斯回归分析揭示了热情和能力与收视率之间理论一致但较小的关联，而特定类型的互动并没有有意义地改善预测。男性主角的热情程度略低于女性主角，而男性主演的电影平均收视率更高（这种关联比电影收视率与温暖/能力之间的关系强数倍）。这些发现表明，尽管观众倾向于青睐温暖、有能力的角色，但对电影评价的影响不大，表明角色个性只是影响电影收视率的众多因素之一。事实证明，使用 LLM_annotate 和 gpt-4.1-mini 的人工智能辅助注释对于大规模分析是有效的，但有时达不到手动生成的注释。

Title: InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLMs

Authors: Yuzhuo Bai, Shuzheng Si, Kangyang Luo, Qingyi Wang, Wenhao Li, Gang Chen, Fanchao Qi, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06666
Pdf URL: https://arxiv.org/pdf/2601.06666
Copy Paste: [[2601.06666]] InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLMs(https://arxiv.org/abs/2601.06666)
Keywords: language model, llm
Abstract: Large language models (LLMs) often hallucinate, yet most existing fact-checking methods treat factuality evaluation as a binary classification problem, offering limited interpretability and failing to capture fine-grained error types. In this paper, we introduce InFi-Check, a framework for interpretable and fine-grained fact-checking of LLM outputs. Specifically, we first propose a controlled data synthesis pipeline that generates high-quality data featuring explicit evidence, fine-grained error type labels, justifications, and corrections. Based on this, we further construct large-scale training data and a manually verified benchmark InFi-Check-FG for fine-grained fact-checking of LLM outputs. Building on these high-quality training data, we further propose InFi-Checker, which can jointly provide supporting evidence, classify fine-grained error types, and produce justifications along with corrections. Experiments show that InFi-Checker achieves state-of-the-art performance on InFi-Check-FG and strong generalization across various downstream tasks, significantly improving the utility and trustworthiness of factuality evaluation.
摘要：大型语言模型（LLM）经常产生幻觉，但大多数现有的事实检查方法将事实性评估视为二元分类问题，提供有限的可解释性，并且无法捕获细粒度的错误类型。在本文中，我们介绍了 InFi-Check，这是一个用于对 LLM 输出进行可解释和细粒度事实检查的框架。具体来说，我们首先提出一个受控数据合成管道，该管道生成具有明确证据、细粒度错误类型标签、理由和更正的高质量数据。在此基础上，我们进一步构建大规模训练数据和手动验证的基准InFi-Check-FG，用于LLM输出的细粒度事实检查。基于这些高质量的训练数据，我们进一步提出了 InFi-Checker，它可以共同提供支持证据，对细粒度错误类型进行分类，并提供理由和纠正。实验表明，InFi-Checker 在 InFi-Check-FG 上实现了最先进的性能，并在各种下游任务中实现了强大的泛化能力，显着提高了事实性评估的实用性和可信度。

Title: Evaluating Cross-Lingual Unlearning in Multilingual Language Models

Authors: Tyler Lizzo, Larry Heck
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06675
Pdf URL: https://arxiv.org/pdf/2601.06675
Copy Paste: [[2601.06675]] Evaluating Cross-Lingual Unlearning in Multilingual Language Models(https://arxiv.org/abs/2601.06675)
Keywords: language model, llm
Abstract: We present the first comprehensive evaluation of cross-lingual unlearning in multilingual LLMs. Using translated TOFU benchmarks in seven language/script variants, we test major unlearning algorithms and show that most fail to remove facts outside the training language, even when utility remains high. However, subspace-projection consistently outperforms the other methods, achieving strong cross-lingual forgetting with minimal degradation. Analysis of learned task subspaces reveals a shared interlingua structure: removing this shared subspace harms all languages, while removing language-specific components selectively affects one. These results demonstrate that multilingual forgetting depends on geometry in weight space, motivating subspace-based approaches for future unlearning systems.
摘要：我们首次对多语言法学硕士中的跨语言遗忘进行全面评估。使用七种语言/脚本变体的翻译 TOFU 基准测试，我们测试了主要的遗忘算法，并表明大多数算法无法删除训练语言之外的事实，即使实用性仍然很高。然而，子空间投影始终优于其他方法，以最小的退化实现了强大的跨语言遗忘。对学习任务子空间的分析揭示了一种共享的国际语结构：删除这个共享子空间会损害所有语言，而删除特定于语言的组件会选择性地影响一种语言。这些结果表明，多语言遗忘取决于权重空间中的几何形状，激发了未来遗忘系统基于子空间的方法。

Title: IDRBench: Interactive Deep Research Benchmark

Authors: Yingchaojie Feng, Qiang Huang, Xiaoya Xie, Zhaorui Yang, Jun Yu, Wei Chen, Anthony K. H. Tung
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2601.06676
Pdf URL: https://arxiv.org/pdf/2601.06676
Copy Paste: [[2601.06676]] IDRBench: Interactive Deep Research Benchmark(https://arxiv.org/abs/2601.06676)
Keywords: language model, llm, agent
Abstract: Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.
摘要：由大型语言模型 (LLM) 提供支持的深度研究代理可以执行多步骤推理、网络探索和长格式报告生成。然而，大多数现有系统以自主方式运行，假设完全指定的用户意图并仅评估最终输出。在实践中，研究目标往往不明确，并且在探索过程中不断变化，因此持续的互动对于稳健的协调至关重要。尽管交互很重要，但现有的深度研究基准在很大程度上仍然不可见，这些基准既不能对动态用户反馈进行建模，也不能量化其成本。我们推出 IDRBench，这是第一个系统评估交互式深度研究的基准。 IDRBench 将模块化多智能体研究框架与按需交互、可扩展的基于参考的用户模拟器以及交互感知评估套件相结合，共同测量交互收益（质量和对齐）和成本（回合和令牌）。七个最先进的法学硕士的实验表明，交互持续提高研究质量和稳健性，通常超过模型能力的差异，同时揭示交互效率的巨大权衡。

Title: Characterising Toxicity in Generative Large Language Models

Authors: Zhiyao Zhang, Yazan Mash'Al, Yuhan Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06700
Pdf URL: https://arxiv.org/pdf/2601.06700
Copy Paste: [[2601.06700]] Characterising Toxicity in Generative Large Language Models(https://arxiv.org/abs/2601.06700)
Keywords: language model, llm, prompt
Abstract: In recent years, the advent of the attention mechanism has significantly advanced the field of natural language processing (NLP), revolutionizing text processing and text generation. This has come about through transformer-based decoder-only architectures, which have become ubiquitous in NLP due to their impressive text processing and generation capabilities. Despite these breakthroughs, language models (LMs) remain susceptible to generating undesired outputs: inappropriate, offensive, or otherwise harmful responses. We will collectively refer to these as ``toxic'' outputs. Although methods like reinforcement learning from human feedback (RLHF) have been developed to align model outputs with human values, these safeguards can often be circumvented through carefully crafted prompts. Therefore, this paper examines the extent to which LLMs generate toxic content when prompted, as well as the linguistic factors -- both lexical and syntactic -- that influence the production of such outputs in generative models.
摘要：近年来，注意力机制的出现极大地推进了自然语言处理（NLP）领域的发展，彻底改变了文本处理和文本生成。这是通过基于 Transformer 的纯解码器架构实现的，该架构由于其令人印象深刻的文本处理和生成能力而在 NLP 中变得无处不在。尽管取得了这些突破，语言模型 (LM) 仍然容易产生不需要的输出：不恰当的、冒犯性的或其他有害的响应。我们将这些统称为“有毒”输出。尽管已经开发了诸如基于人类反馈的强化学习 (RLHF) 等方法来使模型输出与人类价值观保持一致，但这些保护措施通常可以通过精心设计的提示来规避。因此，本文研究了法学硕士在提示时产生有毒内容的程度，以及影响生成模型中此类输出产生的语言因素（词汇和句法）。

Title: GRASP LoRA: GRPO Guided Adapter Sparsity Policy for Cross Lingual Transfer

Authors: Besher Hassan, Xiuying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06702
Pdf URL: https://arxiv.org/pdf/2601.06702
Copy Paste: [[2601.06702]] GRASP LoRA: GRPO Guided Adapter Sparsity Policy for Cross Lingual Transfer(https://arxiv.org/abs/2601.06702)
Keywords: llm
Abstract: Parameter efficient fine tuning is a way to adapt LLMs to new languages when compute or data are limited, yet adapter pipelines usually choose a global prune ratio by grid search. This practice is computationally expensive and development set intensive, since it repeats training, freezes sparsity, and misses fractional optima. We introduce GRASP LoRA (GRPO Guided Adapter Sparsity Policy), which treats global sparsity as a learnable control variable. A GRPO controller interleaves with training, periodically probing candidate prune ratios on a small micro development set and updating a single global prune ratio online from its reward signal. It operates on merged source and target LoRA adapters on a frozen backbone and replaces grid search with one controller run that learns a prune ratio, followed by a single final merge and prune fine tuning run with pruning fixed to that ratio. On cross lingual transfer from English into Arabic and Chinese, including XL-Sum summarization and MLQA extractive question answering with Llama 3 8B, GRASP LoRA improves semantic faithfulness, content coverage, and answer quality over strong target only and merge and prune baselines. It reduces end to end runtime by multiple times relative to grid search, lowers reliance on large development sets, and makes adapter reuse practical for low resource deployment.
摘要：当计算或数据有限时，参数高效微调是使 LLM 适应新语言的一种方法，但适配器管道通常通过网格搜索选择全局剪枝比。这种做法的计算成本很高且开发集密集，因为它会重复训练、冻结稀疏性并错过分数最优值。我们引入了GRASP LoRA（GRPO引导适配器稀疏性策略），它将全局稀疏性视为可学习的控制变量。 GRPO 控制器与训练交叉，定期探测小型微开发集上的候选剪枝率，并根据其奖励信号在线更新单个全局剪枝率。它在冻结主干上的合并源和目标 LoRA 适配器上运行，并用一次学习修剪比率的控制器运行取代网格搜索，然后进行一次最终合并和修剪微调运行，并将修剪固定为该比率。在从英语到阿拉伯语和中文的跨语言转换中，包括 XL-Sum 总结和使用 Llama 3 8B 的 MLQA 提取问答，GRASP LoRA 比仅强目标以及合并和修剪基线提高了语义忠实度、内容覆盖率和答案质量。相对于网格搜索，它减少了端到端运行时间数倍，降低了对大型开发集的依赖，并使适配器重用适用于低资源部署。

Title: Evaluating Accounting Reasoning Capabilities of Large Language Models

Authors: Jie Zhou, Xin Chen, Jie Zhang, Hai Li, Jie Wang, Zhe Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06707
Pdf URL: https://arxiv.org/pdf/2601.06707
Copy Paste: [[2601.06707]] Evaluating Accounting Reasoning Capabilities of Large Language Models(https://arxiv.org/abs/2601.06707)
Keywords: language model, gpt, prompt
Abstract: Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.
摘要：大型语言模型正在改变许多领域的学习、认知和研究。将它们有效地融入会计等专业领域是企业数字化转型的关键挑战。为了解决这个问题，我们定义了垂直领域会计推理，并提出了从代表性 GLM 模型的训练数据特征分析中得出的评估标准。这些标准支持会计推理的系统研究，并为绩效改进提供基准。使用该框架，我们在会计推理任务上评估 GLM-6B、GLM-130B、GLM-4 和 OpenAI GPT-4。结果表明，即时设计显着影响性能，其中 GPT-4 表现出最强的能力。尽管取得了这些成果，但当前模型仍然不足以满足现实世界的企业会计要求，这表明需要进一步优化以释放其全部实用价值。

Title: MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Authors: Zheyuan Liu, Dongwhi Kim, Yixin Wan, Xiangchi Yuan, Zhaoxuan Tan, Fengran Mo, Meng Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06757
Pdf URL: https://arxiv.org/pdf/2601.06757
Copy Paste: [[2601.06757]] MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues(https://arxiv.org/abs/2601.06757)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) are increasingly deployed as assistants that interact through text and images, making it crucial to evaluate contextual safety when risk depends on both the visual scene and the evolving dialogue. Existing contextual safety benchmarks are mostly single-turn and often miss how malicious intent can emerge gradually or how the same scene can support both benign and exploitative goals. We introduce the Multi-Turn Multimodal Contextual Safety Benchmark (MTMCS-Bench), a benchmark of realistic images and multi-turn conversations that evaluates contextual safety in MLLMs under two complementary settings, escalation-based risk and context-switch risk. MTMCS-Bench offers paired safe and unsafe dialogues with structured evaluation. It contains over 30 thousand multimodal (image+text) and unimodal (text-only) samples, with metrics that separately measure contextual intent recognition, safety-awareness on unsafe cases, and helpfulness on benign ones. Across eight open-source and seven proprietary MLLMs, we observe persistent trade-offs between contextual safety and utility, with models tending to either miss gradual risks or over-refuse benign dialogues. Finally, we evaluate five current guardrails and find that they mitigate some failures but do not fully resolve multi-turn contextual risks.
摘要：多模态大语言模型 (MLLM) 越来越多地被部署为通过文本和图像进行交互的助手，因此当风险取决于视觉场景和不断发展的对话时，评估上下文安全性至关重要。现有的上下文安全基准大多是单轮的，并且经常忽略恶意意图如何逐渐出现，或者同一场景如何支持良性和剥削性目标。我们引入了多回合多模态上下文安全基准 (MTMCS-Bench)，这是一个真实图像和多回合对话的基准，用于评估 MLLM 在两种互补设置（基于升级的风险和上下文切换风险）下的上下文安全性。 MTMCS-Bench 提供配对的安全和不安全对话以及结构化评估。它包含超过 30,000 个多模态（图像+文本）和单模态（纯文本）样本，其指标分别衡量上下文意图识别、对不安全情况的安全意识以及对良性情况的帮助。在八个开源和七个专有的 MLLM 中，我们观察到上下文安全性和实用性之间持续存在的权衡，模型往往要么错过渐进的风险，要么过度拒绝良性对话。最后，我们评估了当前的五种防护措施，发现它们可以缓解一些故障，但不能完全解决多回合上下文风险。

Title: GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

Authors: Shubhashis Roy Dipta, Khairul Mahbub, Nadia Najjar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06767
Pdf URL: https://arxiv.org/pdf/2601.06767
Copy Paste: [[2601.06767]] GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO(https://arxiv.org/abs/2601.06767)
Keywords: llm
Abstract: We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, "Ganit"), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.
摘要：我们提出了一个名为 GanitLLM 的孟加拉数学推理模型（以孟加拉语中的数学词“Ganit”命名），以及一个新的难度感知孟加拉数学语料库和一个基于课程的 GRPO 管道。孟加拉语是世界上使用最广泛的语言之一，但现有的法学硕士要么用英语推理然后翻译，要么只是在多步骤孟加拉语数学上失败，部分原因是强化学习食谱是针对高资源语言进行调整的，并在资源匮乏的环境中因奖励稀疏而崩溃。为了解决这个问题，我们构建了 Ganit，一个经过严格过滤和净化的孟加拉数学数据集，具有从强大评估器模型的 pass@k 派生的自动难度标签。在此数据集的基础上，我们提出了 Curriculum-GRPO，它将多阶段训练 (SFT + GRPO) 与难度感知采样以及格式、数值正确性和孟加拉语推理的可验证奖励相结合。在 Bn-MGSM 和 Bn-MSVAMP 上，GanitLLM-4B 比其 Qwen3-4B 基础分别提高了 +8 和 +7 个精度点，同时将孟加拉语推理标记的百分比从 14% 增加到 88% 以上，并将平均解决方案长度从 943 个单词减少到 193 个单词。

Title: Multi-Stage Evolutionary Model Merging with Meta Data Driven Curriculum Learning for Sentiment-Specialized Large Language Modeling

Authors: Keito Inoshita, Xiaokang Zhou, Akira Kawai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06780
Pdf URL: https://arxiv.org/pdf/2601.06780
Copy Paste: [[2601.06780]] Multi-Stage Evolutionary Model Merging with Meta Data Driven Curriculum Learning for Sentiment-Specialized Large Language Modeling(https://arxiv.org/abs/2601.06780)
Keywords: language model, llm
Abstract: The emergence of large language models (LLMs) has significantly transformed natural language processing (NLP), enabling more generalized models to perform various tasks with minimal training. However, traditional sentiment analysis methods, which focus on individual tasks such as sentiment classification or aspect-based analysis, are not practical for real-world applications that usually require handling multiple tasks. While offering flexibility, LLMs in sentiment-specific tasks often fall short of the required accuracy. Techniques like fine-tuning and evolutionary model merging help integrate models into a unified framework, which can improve the learning performance while reducing computational costs. The use of task meta-data and curriculum learning to optimize learning processes remains underexplored, while sentiment analysis is a critical task in NLP that requires high accuracy and scalability across multiple subtasks. In this study, we propose a hybrid learning model called Multi-stage Evolutionary Model Merging with Meta data driven Curriculum Learning (MEM-MCL), to enhance the sentiment analysis in large language modeling. In particular, expert models are created through instruction tuning for specific sentiment tasks and then merged using evolutionary algorithms to form a unified model. The merging process is optimized with weak data to enhance performance across tasks. The curriculum learning is incorporated to provide a learning sequence based on task difficulty, improving knowledge extraction from LLMs. Experiment results demonstrate that the proposed MEM-MCL model outperforms conventional LLMs in a majority of sentiment analysis tasks, achieving superior results across various subtasks.
摘要：大型语言模型 (LLM) 的出现极大地改变了自然语言处理 (NLP)，使更通用的模型能够以最少的训练执行各种任务。然而，传统的情感分析方法专注于单个任务，例如情感分类或基于方面的分析，对于通常需要处理多个任务的现实应用程序来说并不实用。在提供灵活性的同时，法学硕士在特定情感任务中往往达不到所需的准确性。微调和进化模型合并等技术有助于将模型集成到统一的框架中，从而提高学习性能，同时降低计算成本。使用任务元数据和课程学习来优化学习过程的探索尚未充分，而情感分析是 NLP 中的一项关键任务，需要跨多个子任务的高精度和可扩展性。在本研究中，我们提出了一种称为多阶段进化模型与元数据驱动课程学习合并（MEM-MCL）的混合学习模型，以增强大语言建模中的情感分析。特别是，专家模型是通过针对特定情感任务的指令调整来创建的，然后使用进化算法进行合并以形成统一的模型。使用弱数据优化合并过程，以提高跨任务的性能。课程学习被纳入，以提供基于任务难度的学习顺序，改善法学硕士的知识提取。实验结果表明，所提出的 MEM-MCL 模型在大多数情感分析任务中都优于传统的 LLM，在各个子任务中取得了优异的结果。

Title: EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs

Authors: Jewon Yeom, Jaewon Sok, Seonghyeon Park, Jeongjae Park, Taesup Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06786
Pdf URL: https://arxiv.org/pdf/2601.06786
Copy Paste: [[2601.06786]] EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs(https://arxiv.org/abs/2601.06786)
Keywords: language model, llm
Abstract: Improving the reasoning abilities of large language models (LLMs) has largely relied on iterative self-training with model-generated data. While effective at boosting accuracy, existing approaches primarily reinforce successful reasoning paths, incurring a substantial calibration cost: models become overconfident and lose the ability to represent uncertainty. This failure has been characterized as a form of model collapse in alignment, where predictive distributions degenerate toward low-variance point estimates. We address this issue by reframing reasoning training as an epistemic learning problem, in which models must learn not only how to reason, but also when their reasoning should be trusted. We propose epistemically-calibrated reasoning (EpiCaR) as a training objective that jointly optimizes reasoning performance and calibration, and instantiate it within an iterative supervised fine-tuning framework using explicit self-evaluation signals. Experiments on Llama-3 and Qwen-3 families demonstrate that our approach achieves Pareto-superiority over standard baselines in both accuracy and calibration, particularly in models with sufficient reasoning capacity (e.g., 3B+). This framework generalizes effectively to OOD mathematical reasoning (GSM8K) and code generation (MBPP). Ultimately, our approach enables a 3X reduction in inference compute, matching the K=30 performance of STaR with only K=10 samples in capable models.
摘要：提高大型语言模型（LLM）的推理能力在很大程度上依赖于使用模型生成的数据进行迭代自我训练。虽然可以有效提高准确性，但现有方法主要是强化成功的推理路径，从而产生大量的校准成本：模型变得过度自信并失去表示不确定性的能力。这种失败被描述为模型对齐崩溃的一种形式，其中预测分布向低方差点估计退化。我们通过将推理训练重新定义为认知学习问题来解决这个问题，其中模型不仅必须学习如何推理，而且还必须学习何时应该信任它们的推理。我们提出认知校准推理（EpiCaR）作为联合优化推理性能和校准的训练目标，并使用显式自我评估信号在迭代监督微调框架内将其实例化。 Llama-3 和 Qwen-3 系列的实验表明，我们的方法在准确性和校准方面均优于标准基线，特别是在具有足够推理能力的模型（例如 3B+）中。该框架有效地推广到 OOD 数学推理 (GSM8K) 和代码生成 (MBPP)。最终，我们的方法使推理计算量减少了 3 倍，将 STaR 的 K=30 性能与功能模型中仅 K=10 个样本相匹配。

Title: Garbage Attention in Large Language Models: BOS Sink Heads and Sink-aware Pruning

Authors: Jaewon Sok, Jewon Yeom, Seonghyeon Park, Jeongjae Park, Taesup Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06787
Pdf URL: https://arxiv.org/pdf/2601.06787
Copy Paste: [[2601.06787]] Garbage Attention in Large Language Models: BOS Sink Heads and Sink-aware Pruning(https://arxiv.org/abs/2601.06787)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are known to contain significant redundancy, yet a systematic explanation for why certain components, particularly in higher layers, are more redundant has remained elusive. In this work, we identify the BOS sink phenomenon as a key mechanism driving this layer-wise sensitivity. We show that attention heads with high BOS sink scores are strongly associated with functional redundancy: such heads, especially in deeper layers, contribute little to predictive performance and effectively serve as \emph{dumping grounds} for superfluous attention weights. This provides a concrete functional explanation for the structural redundancy reported in prior studies. Leveraging this insight, we introduce a simple pruning strategy that removes high-BOS sink heads. Experiments on Gemma-3, Llama-3.1, and Qwen3 demonstrate that this approach identifies redundant transformer components more reliably than weight- or activation-based criteria, while preserving performance close to dense baselines even under aggressive pruning. Moreover, we find that the behavior of sink heads remains stable across different sequence lengths. Overall, our results suggest that structural properties of attention offer a more intuitive and robust basis for model compression than magnitude-based methods.
摘要：众所周知，大型语言模型 (LLM) 包含大量冗余，但对于为什么某些组件（尤其是高层组件）冗余程度更高的系统解释仍然难以捉摸。在这项工作中，我们将 BOS 下沉现象确定为驱动这种分层敏感性的关键机制。我们表明，具有高 BOS 汇分的注意力头与功能冗余密切相关：这样的头，尤其是在更深的层中，对预测性能贡献不大，并且有效地充当了多余注意力权重的“倾销场”。这为先前研究中报告的结构冗余提供了具体的功能解释。利用这一见解，我们引入了一种简单的修剪策略，可以去除高 BOS 水槽头。 Gemma-3、Llama-3.1 和 Qwen3 上的实验表明，这种方法比基于权重或基于激活的标准更可靠地识别冗余变压器组件，同时即使在积极修剪的情况下也能保持接近密集基线的性能。此外，我们发现汇头的行为在不同的序列长度上保持稳定。总的来说，我们的结果表明，注意力的结构特性为模型压缩提供了比基于幅度的方法更直观、更稳健的基础。

Title: CIRAG: Construction-Integration Retrieval and Adaptive Generation for Multi-hop Question Answering

Authors: Zili Wei, Xiaocui Yang, Yilin Wang, Zihan Wang, Weidong Bao, Shi Feng, Daling Wang, Yifei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06799
Pdf URL: https://arxiv.org/pdf/2601.06799
Copy Paste: [[2601.06799]] CIRAG: Construction-Integration Retrieval and Adaptive Generation for Multi-hop Question Answering(https://arxiv.org/abs/2601.06799)
Keywords: retrieval-augmented generation
Abstract: Triple-based Iterative Retrieval-Augmented Generation (iRAG) mitigates document-level noise for multi-hop question answering. However, existing methods still face limitations: (i) greedy single-path expansion, which propagates early errors and fails to capture parallel evidence from different reasoning branches, and (ii) granularity-demand mismatch, where a single evidence representation struggles to balance noise control with contextual sufficiency. In this paper, we propose the Construction-Integration Retrieval and Adaptive Generation model, CIRAG. It introduces an Iterative Construction-Integration module that constructs candidate triples and history-conditionally integrates them to distill core triples and generate the next-hop query. This module mitigates the greedy trap by preserving multiple plausible evidence chains. Besides, we propose an Adaptive Cascaded Multi-Granularity Generation module that progressively expands contextual evidence based on the problem requirements, from triples to supporting sentences and full passages. Moreover, we introduce Trajectory Distillation, which distills the teacher model's integration policy into a lightweight student, enabling efficient and reliable long-horizon reasoning. Extensive experiments demonstrate that CIRAG achieves superior performance compared to existing iRAG methods.
摘要：基于三重的迭代检索增强生成 (iRAG) 可减轻多跳问答的文档级噪声。然而，现有方法仍然面临局限性：（i）贪婪的单路径扩展，它传播早期错误并且无法从不同的推理分支捕获并行证据，以及（ii）粒度需求不匹配，其中单个证据表示难以平衡噪声控制与上下文充分性。在本文中，我们提出了构建-集成检索和自适应生成模型，CIRAG。它引入了迭代构建集成模块，该模块构建候选三元组并有条件地历史集成它们以提取核心三元组并生成下一跳查询。该模块通过保留多个看似合理的证据链来减轻贪婪陷阱。此外，我们提出了一种自适应级联多粒度生成模块，可根据问题要求逐步扩展上下文证据，从三元组到支持句子和完整段落。此外，我们引入了轨迹蒸馏，它将教师模型的集成策略蒸馏为轻量级学生模型，从而实现高效可靠的长视野推理。大量实验表明，与现有 iRAG 方法相比，CIRAG 具有卓越的性能。

Title: Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Authors: Yubo Wang, Juntian Zhang, Yichen Wu, Yankai Lin, Nils Lukas, Yuhan Liu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.06803
Pdf URL: https://arxiv.org/pdf/2601.06803
Copy Paste: [[2601.06803]] Forest Before Trees: Latent Superposition for Efficient Visual Reasoning(https://arxiv.org/abs/2601.06803)
Keywords: language model, chain-of-thought
Abstract: While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a "Forest-before-Trees" cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.
摘要：虽然思想链为大型视觉语言模型提供了多步推理，但明确的文本基本原理却受到信息带宽瓶颈的影响，在离散标记化过程中，连续的视觉细节被丢弃。最近的潜在推理方法试图解决这一挑战，但由于严格的自回归目标，常常会陷入过早的语义崩溃。在本文中，我们提出了 Laser，这是一种通过动态窗口对齐学习（DWAL）重新制定视觉演绎的新颖范式。 Laser 不是强制进行逐点预测，而是将潜在状态与未来语义的动态有效性窗口对齐。这种机制强制执行“先森林后树”的认知层次结构，使模型能够在缩小到局部细节之前保持全局特征的概率叠加。至关重要的是，激光通过可解码的轨迹保持可解释性，同时通过自我完善的叠加稳定无约束的学习。对 6 个基准测试的大量实验表明，Laser 在潜在推理方法中实现了最先进的性能，平均超过强基准 Monet 5.03%。值得注意的是，它以极高的效率实现了这些收益，将推理标记减少了 97% 以上，同时展示了对分布外域的强大泛化能力。

Title: AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents

Authors: Xuannan Liu, Xiao Yang, Zekun Li, Peipei Li, Ran He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06818
Pdf URL: https://arxiv.org/pdf/2601.06818
Copy Paste: [[2601.06818]] AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents(https://arxiv.org/abs/2601.06818)
Keywords: gpt, llm, hallucination, agent
Abstract: As LLM-based agents operate over sequential multi-step reasoning, hallucinations arising at intermediate steps risk propagating along the trajectory, thus degrading overall reliability. Unlike hallucination detection in single-turn responses, diagnosing hallucinations in multi-step workflows requires identifying which step causes the initial divergence. To fill this gap, we propose a new research task, automated hallucination attribution of LLM-based agents, aiming to identify the step responsible for the hallucination and explain why. To support this task, we introduce AgentHallu, a comprehensive benchmark with: (1) 693 high-quality trajectories spanning 7 agent frameworks and 5 domains, (2) a hallucination taxonomy organized into 5 categories (Planning, Retrieval, Reasoning, Human-Interaction, and Tool-Use) and 14 sub-categories, and (3) multi-level annotations curated by humans, covering binary labels, hallucination-responsible steps, and causal explanations. We evaluate 13 leading models, and results show the task is challenging even for top-tier models (like GPT-5, Gemini-2.5-Pro). The best-performing model achieves only 41.1\% step localization accuracy, where tool-use hallucinations are the most challenging at just 11.6\%. We believe AgentHallu will catalyze future research into developing robust, transparent, and reliable agentic systems.
摘要：由于基于 LLM 的智能体通过顺序多步骤推理进行操作，因此中间步骤产生的幻觉有沿着轨迹传播的风险，从而降低了整体可靠性。与单轮响应中的幻觉检测不同，在多步骤工作流程中诊断幻觉需要确定哪个步骤导致初始分歧。为了填补这一空白，我们提出了一项新的研究任务，即基于 LLM 的代理的自动幻觉归因，旨在确定导致幻觉的步骤并解释原因。为了支持这项任务，我们引入了 AgentHallu，这是一个综合基准测试，其中包含：（1）跨越 7 个智能体框架和 5 个领域的 693 个高质量轨迹，（2）分为 5 个类别（规划、检索、推理、人类交互和工具使用）和 14 个子类别的幻觉分类法，以及（3）由人类策划的多级注释，涵盖二进制标签、幻觉负责的标签步骤和因果解释。我们评估了 13 个领先模型，结果表明即使对于顶级模型（如 GPT-5、Gemini-2.5-Pro），这项任务也具有挑战性。性能最佳的模型仅实现 41.1% 的步进定位精度，其中工具使用幻觉最具挑战性，仅为 11.6%。我们相信 AgentHallu 将促进未来研究开发强大、透明和可靠的代理系统。

Title: PDR: A Plug-and-Play Positional Decay Framework for LLM Pre-training Data Detection

Authors: Jinhan Liu, Yibo Yang, Ruiying Lu, Piotr Piekos, Yimeng Chen, Peng Wang, Dandan Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06827
Pdf URL: https://arxiv.org/pdf/2601.06827
Copy Paste: [[2601.06827]] PDR: A Plug-and-Play Positional Decay Framework for LLM Pre-training Data Detection(https://arxiv.org/abs/2601.06827)
Keywords: language model, llm
Abstract: Detecting pre-training data in Large Language Models (LLMs) is crucial for auditing data privacy and copyright compliance, yet it remains challenging in black-box, zero-shot settings where computational resources and training data are scarce. While existing likelihood-based methods have shown promise, they typically aggregate token-level scores using uniform weights, thereby neglecting the inherent information-theoretic dynamics of autoregressive generation. In this paper, we hypothesize and empirically validate that memorization signals are heavily skewed towards the high-entropy initial tokens, where model uncertainty is highest, and decay as context accumulates. To leverage this linguistic property, we introduce Positional Decay Reweighting (PDR), a training-free and plug-and-play framework. PDR explicitly reweights token-level scores to amplify distinct signals from early positions while suppressing noise from later ones. Extensive experiments show that PDR acts as a robust prior and can usually enhance a wide range of advanced methods across multiple benchmarks.
摘要：检测大型语言模型 (LLM) 中的预训练数据对于审核数据隐私和版权合规性至关重要，但在计算资源和训练数据稀缺的黑盒、零样本环境中仍然具有挑战性。虽然现有的基于可能性的方法已经显示出希望，但它们通常使用统一的权重来聚合令牌级别的分数，从而忽略了自回归生成的固有信息论动态。在本文中，我们假设并凭经验验证，记忆信号严重偏向高熵初始标记，其中模型不确定性最高，并且随着上下文的积累而衰减。为了利用这种语言特性，我们引入了位置衰减重新加权（PDR），这是一种免训练且即插即用的框架。 PDR 明确地重新加权令牌级别分数，以放大来自早期位置的不同信号，同时抑制来自后面位置的噪音。大量实验表明，PDR 可以作为强大的先验，通常可以在多个基准测试中增强各种先进方法。

Title: Explainable Multimodal Aspect-Based Sentiment Analysis with Dependency-guided Large Language Model

Authors: Zhongzheng Wang, Yuanhe Tian, Hongzhi Wang, Yan Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06848
Pdf URL: https://arxiv.org/pdf/2601.06848
Copy Paste: [[2601.06848]] Explainable Multimodal Aspect-Based Sentiment Analysis with Dependency-guided Large Language Model(https://arxiv.org/abs/2601.06848)
Keywords: language model, llm, prompt
Abstract: Multimodal aspect-based sentiment analysis (MABSA) aims to identify aspect-level sentiments by jointly modeling textual and visual information, which is essential for fine-grained opinion understanding in social media. Existing approaches mainly rely on discriminative classification with complex multimodal fusion, yet lacking explicit sentiment explainability. In this paper, we reformulate MABSA as a generative and explainable task, proposing a unified framework that simultaneously predicts aspect-level sentiment and generates natural language explanations. Based on multimodal large language models (MLLMs), our approach employs a prompt-based generative paradigm, jointly producing sentiment and explanation. To further enhance aspect-oriented reasoning capabilities, we propose a dependency-syntax-guided sentiment cue strategy. This strategy prunes and textualizes the aspect-centered dependency syntax tree, guiding the model to distinguish different sentiment aspects and enhancing its explainability. To enable explainability, we use MLLMs to construct new datasets with sentiment explanations to fine-tune. Experiments show that our approach not only achieves consistent gains in sentiment classification accuracy, but also produces faithful, aspect-grounded explanations.
摘要：基于多模态方面的情感分析（MABSA）旨在通过联合建模文本和视觉信息来识别方面级别的情感，这对于社交媒体中细粒度的意见理解至关重要。现有方法主要依赖于具有复杂多模态融合的判别分类，但缺乏明确的情感可解释性。在本文中，我们将 MABSA 重新表述为一项生成且可解释的任务，提出了一个统一的框架，可以同时预测方面级别的情感并生成自然语言解释。基于多模态大语言模型（MLLM），我们的方法采用基于提示的生成范式，共同产生情感和解释。为了进一步增强面向方面的推理能力，我们提出了一种依存语法引导的情感线索策略。该策略对以方面为中心的依存语法树进行修剪和文本化，指导模型区分不同的情感方面并增强其可解释性。为了实现可解释性，我们使用 MLLM 构建新的数据集，并通过情感解释进行微调。实验表明，我们的方法不仅在情感分类准确性方面取得了一致的进步，而且还产生了忠实的、基于方面的解释。

Title: †DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems

Authors: Zabir Al Nazi, Shubhashis Roy Dipta, Sudipta Kar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06853
Pdf URL: https://arxiv.org/pdf/2601.06853
Copy Paste: [[2601.06853]] †DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems(https://arxiv.org/abs/2601.06853)
Keywords: prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, including in low-resource languages, yet its behavior under irrelevant context remains underexplored. To systematically study this challenge, we introduce DISTRACTMATH-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information. Evaluating seven models ranging from 3B to 12B parameters, we observe substantial performance degradation under distractors: standard models drop by up to 41 points, while reasoning-specialized models decline by 14 to 20 points despite consuming five times more tokens. We propose †DAGGER, which reformulates mathematical problem solving as executable computational graph generation with explicit modeling of distractor nodes. Fine-tuning Gemma-3 models using supervised fine-tuning followed by Group Relative Policy Optimization achieves comparable weighted accuracy on augmented benchmarks while using 89 percent fewer tokens than reasoning models. Importantly, this robustness emerges without explicit training on distractor-augmented examples. Our results suggest that enforcing structured intermediate representations improves robustness and inference efficiency in mathematical reasoning compared to free-form approaches, particularly in noisy, low-resource settings.
摘要：思想链 (CoT) 提示被广泛用于解决数学问题，包括在低资源语言中，但其在不相关上下文下的行为仍未得到充分探索。为了系统地研究这一挑战，我们引入了 DISTRACTMATH-BN，这是一个孟加拉基准，它通过语义一致但计算上不相关的信息增强了 MGSM 和 MSVAMP。通过评估从 3B 到 12B 参数的 7 个模型，我们观察到干扰因素下性能大幅下降：标准模型下降了 41 点，而推理专用模型尽管消耗了 5 倍的代币，但下降了 14 到 20 点。我们提出 †DAGGER，它将数学问题解决重新表述为具有干扰节点显式建模的可执行计算图生成。使用监督微调和组相对策略优化对 Gemma-3 模型进行微调，可在增强基准上实现可比的加权精度，同时使用比推理模型少 89% 的标记。重要的是，这种鲁棒性是在没有对干扰增强示例进行明确训练的情况下出现的。我们的结果表明，与自由形式方法相比，实施结构化中间表示可以提高数学推理的鲁棒性和推理效率，特别是在嘈杂、资源匮乏的环境中。

Title: BiasLab: A Multilingual, Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models

Authors: William Guey, Wei Zhang, Pei-Luen Patrick Rau, Pierrick Bougault, Vitor D. de Moura, Bertan Ucar, Jose O. Gomes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06861
Pdf URL: https://arxiv.org/pdf/2601.06861
Copy Paste: [[2601.06861]] BiasLab: A Multilingual, Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models(https://arxiv.org/abs/2601.06861)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes contexts where their outputs influence real-world decisions. However, evaluating bias in LLM outputs remains methodologically challenging due to sensitivity to prompt wording, limited multilingual coverage, and the lack of standardized metrics that enable reliable comparison across models. This paper introduces BiasLab, an open-source, model-agnostic evaluation framework for quantifying output-level (extrinsic) bias through a multilingual, robustness-oriented experimental design. BiasLab constructs mirrored probe pairs under a strict dual-framing scheme: an affirmative assertion favoring Target A and a reverse assertion obtained by deterministic target substitution favoring Target B, while preserving identical linguistic structure. To reduce dependence on prompt templates, BiasLab performs repeated evaluation under randomized instructional wrappers and enforces a fixed-choice Likert response format to maximize comparability across models and languages. Responses are normalized into agreement labels using an LLM-based judge, aligned for polarity consistency across framings, and aggregated into quantitative bias indicators with descriptive statistics including effect sizes and neutrality rates. The framework supports evaluation across diverse bias axes, including demographic, cultural, political, and geopolitical topics, and produces reproducible artifacts such as structured reports and comparative visualizations. BiasLab contributes a standardized methodology for cross-lingual and framing-sensitive bias measurement that complements intrinsic and dataset-based audits, enabling researchers and institutions to benchmark robustness and make better-informed deployment decisions.
摘要：大型语言模型 (LLM) 越来越多地部署在高风险环境中，其输出影响现实世界的决策。然而，由于对提示措辞的敏感性、有限的多语言覆盖范围以及缺乏能够跨模型进行可靠比较的标准化指标，评估法学硕士输出的偏差在方法上仍然具有挑战性。本文介绍了 BiasLab，这是一个与模型无关的开源评估框架，用于通过多语言、稳健性导向的实验设计来量化输出水平（外在）偏差。 BiasLab 在严格的双框架方案下构建镜像探针对：有利于目标 A 的肯定断言和有利于目标 B 的确定性目标替换获得的反向断言，同时保留相同的语言结构。为了减少对提示模板的依赖，BiasLab 在随机教学包装下执行重复评估，并强制执行固定选择 Likert 响应格式，以最大限度地提高模型和语言之间的可比性。使用基于法学硕士的法官将回答标准化为一致标签，跨框架对齐极性一致性，并汇总为具有描述性统计数据（包括效应大小和中性率）的定量偏差指标。该框架支持跨不同偏见轴的评估，包括人口、文化、政治和地缘政治主题，并生成可重复的工件，例如结构化报告和比较可视化。 BiasLab 为跨语言和框架敏感的偏差测量提供了一种标准化方法，该方法补充了内部审计和基于数据集的审计，使研究人员和机构能够对稳健性进行基准测试并做出更明智的部署决策。

Title: Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Authors: Masahiro Kaneko
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06884
Pdf URL: https://arxiv.org/pdf/2601.06884
Copy Paste: [[2601.06884]] Paraphrasing Adversarial Attack on LLM-as-a-Reviewer(https://arxiv.org/abs/2601.06884)
Keywords: language model, llm, prompt
Abstract: The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper's claims. Human evaluation confirms that generated paraphrases maintain meaning and naturalness. We also find that attacked papers exhibit increased perplexity in reviews, offering a potential detection signal, and that paraphrasing submissions can partially mitigate attacks.
摘要：在同行评审系统中使用大型语言模型（LLM）引起了越来越多的关注，因此检查其潜在漏洞至关重要。先前的攻击依赖于即时注入，这会改变手稿内容并将注入敏感性与评估稳健性混为一谈。我们提出了释义对抗攻击（PAA），这是一种黑盒优化方法，可以搜索释义序列，产生更高的评论分数，同时保持语义等价性和语言自然性。 PAA 利用上下文学习，使用之前的释义及其分数来指导候选生成。在五场 ML 和 NLP 会议上，三名 LLM 审稿人和五种攻击模型进行的实验表明，PAA 在不改变论文主张的情况下持续提高了审稿分数。人类评估证实生成的释义保持了意义和自然性。我们还发现，受到攻击的论文在评论中表现出更多的复杂性，提供了潜在的检测信号，并且释义提交的内容可以部分减轻攻击。

Title: Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

Authors: Shaoning Sun, Mingzhu Cai, Huang He, Bingjin Chen, Siqi Bao, Yujiu Yang, Hua Wu, Haifeng Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06911
Pdf URL: https://arxiv.org/pdf/2601.06911
Copy Paste: [[2601.06911]] Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models(https://arxiv.org/abs/2601.06911)
Keywords: language model
Abstract: Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.
摘要：语言模型家族在从强化学习中受益的能力方面表现出惊人的差异：在相同的训练下，像 Qwen 这样的模型取得了巨大的进步，而像 Llama 这样的其他模型则取得了有限的进步。作为对以数据为中心的方法的补充，我们揭示了这种差异反映了一个隐藏的结构属性：概率空间中的 \textbf{分布清晰度}。通过从现象到机制到解释的三阶段分析，我们发现强化学习友好模型在正确与错误响应的概率分配中表现出类内紧凑性和类间分离。我们使用 \textbf{轮廓系数} ($S$) 量化这种清晰度，并证明 (1) 高 $S$ 与 RL 性能密切相关； (2) 低$S$与严重的逻辑错误和推理不稳定有关。为了确认这一属性，我们引入了一种轮廓感知重加权策略，该策略在训练期间优先考虑低 $S$ 样本。六个数学基准的实验表明，所有模型系列都有一致的改进，在 AIME24 上的提升高达 5.9 分。我们的工作将分布清晰度确立为强化学习友好性的基本可训练属性。

Title: TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG

Authors: Tianhua Zhang, Kun Li, Junan Li, Yunxiang Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06922
Pdf URL: https://arxiv.org/pdf/2601.06922
Copy Paste: [[2601.06922]] TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG(https://arxiv.org/abs/2601.06922)
Keywords: retrieval-augmented generation, agent
Abstract: Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.
摘要：代理检索增强生成（RAG）将问答制定为推理和信息检索之间的多步骤交互，并且最近通过基于结果的监督的强化学习（RL）得到了推进。虽然有效，但仅仅依靠稀疏的最终奖励限制了逐步的信用分配，并且为中间推理和行动提供了薄弱的指导。最近的努力探索了过程级监督，但通常依赖于离线构建的训练数据，这存在分布变化的风险，或者需要昂贵的中间注释。我们提出了 TreePS-RAG，这是一种用于代理 RAG 的基于树的在线 RL 框架，它可以实现逐步的信用分配，同时保留标准的仅结果奖励。我们的主要见解是将代理 RAG 推理建模为 rollout 树，其中每个推理步骤自然映射到一个节点。这种树结构允许通过蒙特卡洛估计来估计其后代结果的步骤效用，从而产生细粒度的过程优势，而不需要中间标签。为了使这个范例变得实用，我们引入了一种有效的在线树构建策略，该策略可以在有限的计算预算下保留探索多样性。其推出成本与 Search-R1 等强大基线相当，跨多个模型规模的七个多跳和通用 QA 基准的实验表明，TreePS-RAG 始终显着优于结果监督和领先的过程监督 RL 方法。

Title: X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

Authors: Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06953
Pdf URL: https://arxiv.org/pdf/2601.06953
Copy Paste: [[2601.06953]] X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests(https://arxiv.org/abs/2601.06953)
Keywords: llm
Abstract: Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real-world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters. In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data.
摘要：竞争性编程由于其密集的推理要求和较高的逻辑复杂性，给代码法硕士带来了巨大的挑战。然而，当前的代码法学硕士仍然严重依赖现实世界的数据，这限制了它们的可扩展性。在本文中，我们探索了一种完全综合的方法：使用完全生成的任务、解决方案和测试用例来训练代码法学硕士，以在不依赖真实世界数据的情况下增强代码推理模型的能力。为了支持这一点，我们利用基于特征的合成提出了一种名为 SynthSmith 的新型数据合成管道。 SynthSmith 在生成多样化且具有挑战性的任务以及经过验证的解决方案和测试方面显示出强大的潜力，支持监督微调和强化学习。基于所提出的合成 SFT 和 RL 数据集，我们引入了 X-Coder 模型系列，该模型在 LiveCodeBench v5 上实现了 62.9 avg@8 的通过率，在 v6 上实现了 55.8 的通过率，尽管只有 7B 个参数，但其性能优于 DeepCoder-14B-Preview 和 AReal-boba2-14B。深入分析表明，缩放定律适用于我们的合成数据集，并且我们探索哪些维度更有效缩放。我们进一步提供了对以代码为中心的强化学习的见解，并通过详细的消融和分析突出了影响性能的关键因素。我们的研究结果表明，扩展高质量的合成数据和采用分阶段训练可以极大地推进代码推理，同时减轻对现实世界编码数据的依赖。

Title: RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Authors: Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06966
Pdf URL: https://arxiv.org/pdf/2601.06966
Copy Paste: [[2601.06966]] RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction(https://arxiv.org/abs/2601.06966)
Keywords: language model, llm, agent
Abstract: As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals. To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at [this https URL](this https URL).
摘要：随着大型语言模型（LLM）从静态对话接口发展到自治通用代理，有效的记忆对于确保长期一致性至关重要。然而，现有的基准主要侧重于随意对话或面向任务的对话，未能捕获**“面向长期项目”**交互，其中代理必须跟踪不断变化的目标。为了弥补这一差距，我们引入了 **RealMem**，这是第一个基于实际项目场景的基准测试。 RealMem 包含 11 个场景的 2,000 多个跨会话对话，利用自然用户查询进行评估。我们提出了一个综合管道，集成了项目基础构建、多智能体对话生成以及内存和进度管理来模拟内存的动态演化。实验表明，当前的记忆系统在管理长期项目状态和现实项目中固有的动态上下文依赖性方面面临着重大挑战。我们的代码和数据集可在 [此 https URL]（此 https URL）处获取。

Title: Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition

Authors: Nathan Roll, Pranav Bhalerao, Martijn Bartelds, Arjun Pawar, Yuka Tatsumi, Tolulope Ogunremi, Chen Shani, Calbert Graham, Meghan Sumner, Dan Jurafsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06972
Pdf URL: https://arxiv.org/pdf/2601.06972
Copy Paste: [[2601.06972]] Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition(https://arxiv.org/abs/2601.06972)
Keywords: language model
Abstract: In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a "Categorize Early" strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers "Integrate Late," deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers' front-loaded categorization may benefit low-latency streaming, while Transformers' deep integration may favor tasks requiring rich context and cross-utterance normalization.
摘要：在语音语言建模中，两种架构主导着前沿：Transformer 和 Conformer。然而，目前尚不清楚它们的可比性能是否源于收敛处理策略或独特的架构归纳偏差。我们引入了架构指纹识别（Architectural Fingerprinting），这是一种探测框架，可以隔离架构对表示的影响，并将其应用于由 24 个预训练编码器（39M-3.3B 参数）组成的受控套件。我们的分析揭示了不同的层次结构：遵从者实施“早期分类”策略，提前 29% 深度解析音素类别，提前 16% 深度解析说话者性别。相比之下，变形金刚“后期集成”，将音素、口音和持续时间编码推迟到深层（49-57%）。这些指纹表明了设计启发：Conformers 的前置分类可能有利于低延迟流，而 Transformers 的深度集成可能有利于需要丰富上下文和跨话语规范化的任务。

Title: LLMs Can't Play Hangman: On the Necessity of a Private Working Memory for Language Agents

Authors: Davide Baldelli, Ali Parviz, Amal Zouaq, Sarath Chandar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06973
Pdf URL: https://arxiv.org/pdf/2601.06973
Copy Paste: [[2601.06973]] LLMs Can't Play Hangman: On the Necessity of a Private Working Memory for Language Agents(https://arxiv.org/abs/2601.06973)
Keywords: llm, chat, agent
Abstract: As LLMs move from text completion toward autonomous agents, they remain constrained by the standard chat interface, which lacks private working memory. This raises a fundamental question: can agents reliably perform interactive tasks that depend on hidden state? We define Private State Interactive Tasks (PSITs), which require agents to generate and maintain hidden information while producing consistent public responses. We show theoretically that any agent restricted to the public conversation history cannot simultaneously preserve secrecy and consistency in PSITs, yielding an impossibility theorem. To empirically validate this limitation, we introduce a self-consistency testing protocol that evaluates whether agents can maintain a hidden secret across forked dialogue branches. Standard chat-based LLMs and retrieval-based memory baselines fail this test regardless of scale, demonstrating that semantic retrieval does not enable true state maintenance. To address this, we propose a novel architecture incorporating an explicit private working memory; we demonstrate that this mechanism restores consistency, establishing private state as a necessary component for interactive language agents.
摘要：随着法学硕士从文本完成转向自主代理，他们仍然受到标准聊天界面的限制，而该界面缺乏私人工作记忆。这就提出了一个基本问题：代理能否可靠地执行依赖于隐藏状态的交互任务？我们定义了私有状态交互任务（PSIT），它要求代理生成和维护隐藏信息，同时产生一致的公共响应。我们从理论上证明，任何仅限于公共对话历史的智能体都无法同时保持 PSIT 的保密性和一致性，从而产生了不可能定理。为了凭经验验证这一限制，我们引入了一种自我一致性测试协议，该协议评估代理是否可以跨分叉对话分支维护隐藏的秘密。无论规模如何，标准的基于聊天的 LLM 和基于检索的内存基线都无法通过此测试，这表明语义检索无法实现真实状态维护。为了解决这个问题，我们提出了一种包含显式私有工作内存的新颖架构；我们证明这种机制恢复了一致性，将私有状态建立为交互式语言代理的必要组成部分。

Title: MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education

Authors: Dongsuk Jang, Ziyao Shangguan, Kyle Tegtmeyer, Anurag Gupta, Jan Czerminski, Sophie Chheang, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.06979
Pdf URL: https://arxiv.org/pdf/2601.06979
Copy Paste: [[2601.06979]] MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education(https://arxiv.org/abs/2601.06979)
Keywords: llm, retrieval-augmented generation
Abstract: The learning process for medical residents presents significant challenges, demanding both the ability to interpret complex case reports and the rapid acquisition of accurate medical knowledge from reliable sources. Residents typically study case reports and engage in discussions with peers and mentors, but finding relevant educational materials and evidence to support their learning from these cases is often time-consuming and challenging. To address this, we introduce MedTutor, a novel system designed to augment resident training by automatically generating evidence-based educational content and multiple-choice questions from clinical case reports. MedTutor leverages a Retrieval-Augmented Generation (RAG) pipeline that takes clinical case reports as input and produces targeted educational materials. The system's architecture features a hybrid retrieval mechanism that synergistically queries a local knowledge base of medical textbooks and academic literature (using PubMed, Semantic Scholar APIs) for the latest related research, ensuring the generated content is both foundationally sound and current. The retrieved evidence is filtered and ordered using a state-of-the-art reranking model and then an LLM generates the final long-form output describing the main educational content regarding the case-report. We conduct a rigorous evaluation of the system. First, three radiologists assessed the quality of outputs, finding them to be of high clinical and educational value. Second, we perform a large scale evaluation using an LLM-as-a Judge to understand if LLMs can be used to evaluate the output of the system. Our analysis using correlation between LLMs outputs and human expert judgments reveals a moderate alignment and highlights the continued necessity of expert oversight.
摘要：住院医师的学习过程面临着巨大的挑战，既需要解释复杂病例报告的能力，又需要从可靠来源快速获取准确的医学知识。住院医生通常会研究案例报告并与同伴和导师进行讨论，但寻找相关的教育材料和证据来支持他们从这些案例中学习通常既耗时又具有挑战性。为了解决这个问题，我们推出了 MedTutor，这是一种新颖的系统，旨在通过自动生成基于证据的教育内容和临床病例报告中的多项选择题来增强住院医师培训。 MedTutor 利用检索增强生成 (RAG) 管道，将临床病例报告作为输入并生成有针对性的教育材料。该系统的架构采用混合检索机制，可协同查询医学教科书和学术文献的本地知识库（使用 PubMed、Semantic Scholar API）以获取最新的相关研究，确保生成的内容既基本合理又最新。使用最先进的重新排序模型对检索到的证据进行过滤和排序，然后法学硕士生成最终的长格式输出，描述有关病例报告的主要教育内容。我们对系统进行严格的评估。首先，三名放射科医生评估了输出的质量，发现它们具有很高的临床和教育价值。其次，我们使用法学硕士作为法官进行大规模评估，以了解法学硕士是否可用于评估系统的输出。我们使用法学硕士输出与人类专家判断之间的相关性进行的分析揭示了适度的一致性，并强调了专家监督的持续必要性。

Title: TurkBench: A Benchmark for Evaluating Turkish Large Language Models

Authors: Çağrı Toraman, Ahmet Kaan Sever, Ayse Aysu Cengiz, Elif Ecem Arslan, Görkem Sevinç, Mete Mert Birdal, Yusuf Faruk Güldemir, Ali Buğra Kanburoğlu, Sezen Felekoğlu, Osman Gürlek, Sarp Kantar, Birsen Şahin Kütük, Büşra Tufan, Elif Genç, Serkan Coşkun, Gupse Ekin Demir, Muhammed Emin Arayıcı, Olgun Dursun, Onur Gungor, Susan Üsküdarlı, Abdullah Topraksoy, Esra Darıcı
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07020
Pdf URL: https://arxiv.org/pdf/2601.07020
Copy Paste: [[2601.07020]] TurkBench: A Benchmark for Evaluating Turkish Large Language Models(https://arxiv.org/abs/2601.07020)
Keywords: language model
Abstract: With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at this https URL
摘要：随着最近大型语言模型发展的激增，对全面且针对特定语言的评估基准的需求变得至关重要。尽管在评估英语语言模型方面取得了重大进展，但其他语言，特别是土耳其语等具有独特语言特征的语言的基准仍然不够发达。我们的研究引入了 TurkBench，这是一个综合基准测试，旨在评估土耳其语生成大型语言模型的能力。 TurkBench 涉及 21 个不同子任务的 8,151 个数据样本。这些评估分为六个主要类别：知识、语言理解、推理、内容审核、土耳其语语法和词汇以及指令遵循。多样化的任务和文化相关的数据将为研究人员和开发人员提供有价值的工具来评估他们的模型并确定需要改进的领域。我们在此 https URL 进一步发布在线提交的基准

Title: Solar Open Technical Report

Authors: Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song, Seonghoon Yang, Seungyoun Yi, Sanghoon Yoon, Jeonghyun Ko, Seyoung Song, Keunwoo Choi, Hwalsuk Lee, Sunghun Kim, Du-Seong Chang, Kyunghyun Cho, Junsuk Choe, Hwaran Lee, Jae-Gil Lee, KyungTae Lim, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07022
Pdf URL: https://arxiv.org/pdf/2601.07022
Copy Paste: [[2601.07022]] Solar Open Technical Report(https://arxiv.org/abs/2601.07022)
Keywords: language model, llm
Abstract: We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.
摘要：我们推出 Solar Open，这是一种针对服务不足的语言的 102B 参数双语专家混合语言模型。 Solar Open 展示了通过解决三个相互关联的挑战来构建有竞争力的法学硕士的系统方法。首先，为了在服务不足的语言缺乏数据的情况下进行有效训练，我们合成了 4.5T 高质量、特定领域和面向 RL 的数据令牌。其次，我们通过渐进式课程来协调这些数据，共同优化 20 万亿代币的组成、质量阈值和领域覆盖范围。第三，为了通过可扩展的强化学习实现推理能力，我们应用我们提出的框架 SnapPO 进行高效优化。在英语和韩语基准测试中，Solar Open 取得了有竞争力的表现，证明了这种方法对于服务不足的语言人工智能开发的有效性。

Title: Codified Foreshadowing-Payoff Text Generation

Authors: Longfei Yun, Kun Zhou, Yupeng Hou, Letian Peng, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07033
Pdf URL: https://arxiv.org/pdf/2601.07033
Copy Paste: [[2601.07033]] Codified Foreshadowing-Payoff Text Generation(https://arxiv.org/abs/2601.07033)
Keywords: language model, llm, prompt
Abstract: Foreshadowing and payoff are ubiquitous narrative devices through which authors introduce commitments early in a story and resolve them through concrete, observable outcomes. However, despite advances in story generation, large language models (LLMs) frequently fail to bridge these long-range narrative dependencies, often leaving "Chekhov's guns" unfired even when the necessary context is present. Existing evaluations largely overlook this structural failure, focusing on surface-level coherence rather than the logical fulfillment of narrative setups. In this paper, we introduce Codified Foreshadowing-Payoff Generation (CFPG), a novel framework that reframes narrative quality through the lens of payoff realization. Recognizing that LLMs struggle to intuitively grasp the "triggering mechanism" of a foreshadowed event, CFPG transforms narrative continuity into a set of executable causal predicates. By mining and encoding Foreshadow-Trigger-Payoff triples from the BookSum corpus, we provide structured supervision that ensures foreshadowed commitments are not only mentioned but also temporally and logically fulfilled. Experiments demonstrate that CFPG significantly outperforms standard prompting baselines in payoff accuracy and narrative alignment. Our findings suggest that explicitly codifying narrative mechanics is essential for moving LLMs from surface-level fluency to genuine narrative competence.
摘要：伏笔和回报是无处不在的叙事手段，作者通过它们在故事的早期引入承诺，并通过具体的、可观察的结果来解决它们。然而，尽管故事生成方面取得了进步，但大型语言模型 (LLM) 经常无法弥合这些长期的叙事依赖性，即使存在必要的上下文，也常常导致“契诃夫之枪”无法开火。现有的评估在很大程度上忽视了这种结构性失败，关注的是表面层面的连贯性，而不是叙事设置的逻辑实现。在本文中，我们介绍了编码化预示-回报生成（CFPG），这是一种通过回报实现的视角重新构建叙事质量的新颖框架。认识到法学硕士很难直观地掌握预示事件的“触发机制”，CFPG 将叙事连续性转变为一组可执行的因果谓词。通过从 BookSum 语料库中挖掘和编码 Foreshadow-Trigger-Payoff 三元组，我们提供结构化监督，确保预示的承诺不仅被提及，而且在时间和逻辑上得到履行。实验表明，CFPG 在支付准确性和叙述一致性方面显着优于标准提示基线。我们的研究结果表明，明确地编纂叙事机制对于将法学硕士从表面流畅性转变为真正的叙事能力至关重要。

Title: Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

Authors: Wang Yang, Debargha Ganguly, Xinpeng Li, Chaoda Song, Shouren Wang, Vikash Singh, Vipin Chaudhary, Xiaotian Han
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.07036
Pdf URL: https://arxiv.org/pdf/2601.07036
Copy Paste: [[2601.07036]] Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers(https://arxiv.org/abs/2601.07036)
Keywords: language model, prompt
Abstract: Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger tokens rather than the instructions themselves. Through attention analysis and controlled prompting experiments, we show that a leading ``Okay'' token induces reasoning behavior, while the newline pattern following ``'' suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy-length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.
摘要：混合推理语言模型通常通过高级思考/不思考指令来控制以调节推理行为，但我们发现这种模式切换很大程度上是由一小组触发令牌而不是指令本身驱动的。通过注意力分析和受控提示实验，我们表明领先的“Okay”标记会诱发推理行为，而“”后面的换行模式会抑制推理行为。基于这一观察，我们提出了 Mid-Think，这是一种简单的免训练提示格式，它结合了这些触发器来实现中间预算推理，在准确性与长度权衡方面始终优于固定令牌和基于提示的基线。此外，在 SFT 之后将 Mid-Think 应用于 RL 训练可减少约 15% 的训练时间，同时将 Qwen3-8B 在 AIME 上的最终性能从 69.8% 提高到 72.4%，在 GPQA 上从 58.5% 提高到 61.1%，证明了其对于推理时间控制和基于 RL 的推理训练的有效性。

Title: When Abundance Conceals Weakness: Knowledge Conflict in Multilingual Models

Authors: Jiaqi Zhao, Qiang Huang, Haodong Chen, Xiaoxing You, Jun Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07041
Pdf URL: https://arxiv.org/pdf/2601.07041
Copy Paste: [[2601.07041]] When Abundance Conceals Weakness: Knowledge Conflict in Multilingual Models(https://arxiv.org/abs/2601.07041)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) encode vast world knowledge across multiple languages, yet their internal beliefs are often unevenly distributed across linguistic spaces. When external evidence contradicts these language-dependent memories, models encounter \emph{cross-lingual knowledge conflict}, a phenomenon largely unexplored beyond English-centric settings. We introduce \textbf{CLEAR}, a \textbf{C}ross-\textbf{L}ingual knowl\textbf{E}dge conflict ev\textbf{A}luation f\textbf{R}amework that systematically examines how multilingual LLMs reconcile conflicting internal beliefs and multilingual external evidence. CLEAR decomposes conflict resolution into four progressive scenarios, from multilingual parametric elicitation to competitive multi-source cross-lingual induction, and systematically evaluates model behavior across two complementary QA benchmarks with distinct task characteristics. We construct multilingual versions of ConflictQA and ConflictingQA covering 10 typologically diverse languages and evaluate six representative LLMs. Our experiments reveal a task-dependent decision dichotomy. In reasoning-intensive tasks, conflict resolution is dominated by language resource abundance, with high-resource languages exerting stronger persuasive power. In contrast, for entity-centric factual conflicts, linguistic affinity, not resource scale, becomes decisive, allowing low-resource but linguistically aligned languages to outperform distant high-resource ones.
摘要：大型语言模型（LLM）跨多种语言编码大量的世界知识，但它们的内部信念通常在语言空间中分布不均匀。当外部证据与这些依赖于语言的记忆相矛盾时，模型就会遇到\emph{跨语言知识冲突}，这种现象在以英语为中心的环境之外基本上未被探索过。我们引入\textbf{CLEAR}，一个\textbf{C}ross-\textbf{L}语言知识\textbf{E}dge冲突ev\textbf{A}luation f\textbf{R}工程，系统地研究多语言法学硕士如何调和相互冲突的内部信念和多语言外部证据。 CLEAR 将冲突解决分解为四个渐进场景，从多语言参数启发到竞争性多源跨语言归纳，并系统地评估两个具有不同任务特征的互补 QA 基准的模型行为。我们构建了 ConflictQA 和 ConflictingQA 的多语言版本，涵盖 10 种不同类型的语言，并评估了六种代表性的法学硕士。我们的实验揭示了任务相关的决策二分法。在推理密集型任务中，冲突解决以语言资源丰富程度为主，资源丰富的语言具有更强的说服力。相比之下，对于以实体为中心的事实冲突，起决定性作用的是语言亲和力，而不是资源规模，这使得资源匮乏但语言一致的语言能够胜过相距遥远的高资源语言。

Title: Engineering of Hallucination in Generative AI: It's not a Bug, it's a Feature

Authors: Tim Fingscheidt, Patrick Blumenberg, Björn Möller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07046
Pdf URL: https://arxiv.org/pdf/2601.07046
Copy Paste: [[2601.07046]] Engineering of Hallucination in Generative AI: It's not a Bug, it's a Feature(https://arxiv.org/abs/2601.07046)
Keywords: language model, gpt, hallucination, prompt, chat
Abstract: Generative artificial intelligence (AI) is conquering our lives at lightning speed. Large language models such as ChatGPT answer our questions or write texts for us, large computer vision models such as GAIA-1 generate videos on the basis of text descriptions or continue prompted videos. These neural network models are trained using large amounts of text or video data, strictly according to the real data employed in training. However, there is a surprising observation: When we use these models, they only function satisfactorily when they are allowed a certain degree of fantasy (hallucination). While hallucination usually has a negative connotation in generative AI - after all, ChatGPT is expected to give a fact-based answer! - this article recapitulates some simple means of probability engineering that can be used to encourage generative AI to hallucinate to a limited extent and thus lead to the desired results. We have to ask ourselves: Is hallucination in gen-erative AI probably not a bug, but rather a feature?
摘要：生成式人工智能 (AI) 正在以闪电般的速度征服我们的生活。 ChatGPT 等大型语言模型回答我们的问题或为我们编写文本，GAIA-1 等大型计算机视觉模型根据文本描述生成视频或继续提示视频。这些神经网络模型是使用大量文本或视频数据进行训练的，严格按照训练中使用的真实数据进行训练。然而，有一个令人惊讶的观察结果：当我们使用这些模型时，只有当它们被允许一定程度的幻想（幻觉）时，它们才能令人满意地发挥作用。虽然幻觉在生成人工智能中通常具有负面含义 - 毕竟，ChatGPT 有望给出基于事实的答案！ - 本文概括了概率工程的一些简单方法，可用于鼓励生成式人工智能在有限程度上产生幻觉，从而产生预期的结果。我们必须问自己：生成人工智能中的幻觉可能不是一个错误，而是一个特性吗？

Title: Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge

Authors: Zhuoyi Yang, Yurun Song, Iftekhar Ahmed, Ian Harris
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.07054
Pdf URL: https://arxiv.org/pdf/2601.07054
Copy Paste: [[2601.07054]] Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge(https://arxiv.org/abs/2601.07054)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models' pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.
摘要：多跳问答广泛用于评估大型语言模型（LLM）的推理能力，因为它需要整合多个支持知识才能得出正确答案。虽然之前的工作已经探索了为法学硕士提供知识的不同机制，例如微调和检索增强生成（RAG），但它们对于多跳问答的相对有效性仍然没有得到充分理解，特别是当所需知识暂时新颖时。在本文中，我们系统地比较了开放域多跳问答的参数化和非参数化知识注入方法。我们评估了三个 7B 参数开源 LLM 的无监督微调（持续预训练）、有监督微调和检索增强生成。实验在两个基准上进行：QASC，一个标准的多跳科学问答数据集，以及一个新构建的数据集，包含源自 2024 年维基百科事件的 10,000 多个多跳问题，旨在测试超出模型预训练截止值的知识。我们的结果表明，无监督微调仅比基本模型提供有限的增益，这表明仅持续预训练不足以提高多跳推理准确性。相比之下，检索增强生成产生了实质性且一致的改进，特别是在回答依赖于暂时新颖信息的问题时。有监督的微调可实现模型和数据集的最高整体准确性。这些发现强调了知识注入机制如何支持多跳问答的根本差异，并强调了在需要外部或组合知识时基于检索的方法的重要性。

Title: The Need for a Socially-Grounded Persona Framework for User Simulation

Authors: Pranav Narayanan Venkit, Yu Li, Yada Pruksachatkun, Chien-Sheng Wu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.07110
Pdf URL: https://arxiv.org/pdf/2601.07110
Copy Paste: [[2601.07110]] The Need for a Socially-Grounded Persona Framework for User Simulation(https://arxiv.org/abs/2601.07110)
Keywords: language model, llm, prompt
Abstract: Synthetic personas are widely used to condition large language models (LLMs) for social simulation, yet most personas are still constructed from coarse sociodemographic attributes or summaries. We revisit persona creation by introducing SCOPE, a socially grounded framework for persona construction and evaluation, built from a 141-item, two-hour sociopsychological protocol collected from 124 U.S.-based participants. Across seven models, we find that demographic-only personas are a structural bottleneck: demographics explain only ~1.5% of variance in human response similarity. Adding sociopsychological facets improves behavioral prediction and reduces over-accentuation, and non-demographic personas based on values and identity achieve strong alignment with substantially lower bias. These trends generalize to SimBench (441 aligned questions), where SCOPE personas outperform default prompting and NVIDIA Nemotron personas, and SCOPE augmentation improves Nemotron-based personas. Our results indicate that persona quality depends on sociopsychological structure rather than demographic templates or summaries.
摘要：合成角色被广泛用于调节大型语言模型（LLM）以进行社会模拟，但大多数角色仍然是根据粗略的社会人口统计属性或摘要构建的。我们通过引入 SCOPE 来重新审视角色创建，SCOPE 是一个基于社会的角色构建和评估框架，根据从 124 名美国参与者收集的 141 项、时长两小时的社会心理学协议构建。在七个模型中，我们发现仅人口统计的角色是一个结构性瓶颈：人口统计只能解释人类反应相似性中约 1.5% 的方差。添加社会心理学方面可以改善行为预测并减少过度强调，并且基于价值观和身份的非人口角色可以实现强烈的一致性，并大大降低偏见。这些趋势推广到 SimBench（441 个一致的问题），其中 SCOPE 角色优于默认提示和 NVIDIA Nemotron 角色，并且 SCOPE 增强改进了基于 Nemotron 的角色。我们的结果表明，角色质量取决于社会心理结构，而不是人口统计模板或摘要。

Title: ReMIND: Orchestrating Modular Large Language Models for Controllable Serendipity A REM-Inspired System Design for Emergent Creative Ideation

Authors: Makoto Sato
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07121
Pdf URL: https://arxiv.org/pdf/2601.07121
Copy Paste: [[2601.07121]] ReMIND: Orchestrating Modular Large Language Models for Controllable Serendipity A REM-Inspired System Design for Emergent Creative Ideation(https://arxiv.org/abs/2601.07121)
Keywords: language model, llm
Abstract: Large language models (LLMs) are used not only for problem solving but also for creative ideation; however, eliciting serendipitous insights that are both novel and internally coherent remains difficult. While stochastic sampling promotes novelty, it often degrades consistency. Here, we propose ReMIND, a REM-inspired modular framework for ideation. ReMIND consists of four stages: wake, which generates a stable low-temperature semantic baseline; dream, which performs high-temperature exploratory generation; judge, which applies coarse evaluation to filter incoherent outputs and extract candidate ideas; and re-wake, which re-articulates selected ideas into coherent final outputs. By instantiating each stage as an independent LLM, ReMIND enables functional separation between exploration and consolidation. Parameter sweeps show that ReMIND reliably induces semantic exploration while preserving downstream stability. Embedding-based analyses confirm substantial semantic displacement during the dream phase, whereas external evaluations reveal that high-quality ideas emerge sporadically rather than as extrema along any single metric. These results suggest that serendipitous ideation in LLMs is a rare-event process best approached through system level design that shapes the conditions under which valuable ideas can emerge and be stabilized. ReMIND provides a general framework for studying the computational basis of serendipity and illustrates how modular LLM orchestration can bridge exploration and stabilization.
摘要：大型语言模型（LLM）不仅用于解决问题，还用于创意构思；然而，获得既新颖又内在一致的偶然见解仍然很困难。虽然随机抽样促进了新颖性，但它通常会降低一致性。在这里，我们提出 ReMIND，一个受 REM 启发的模块化构思框架。 ReMIND由四个阶段组成：唤醒，生成稳定的低温语义基线；梦想，进行高温探索性的生成；判断，它应用粗略评估来过滤不连贯的输出并提取候选想法；重新唤醒，将选定的想法重新阐明为连贯的最终输出。通过将每个阶段实例化为独立的法学硕士，ReMIND 实现了探索和巩固之间的功能分离。参数扫描表明 ReMIND 可靠地诱导语义探索，同时保持下游稳定性。基于嵌入的分析证实了梦阶段存在大量的语义位移，而外部评估表明高质量的想法是零星出现的，而不是作为任何单一指标的极值。这些结果表明，法学硕士中的偶然构思是一个罕见事件过程，最好通过系统级设计来实现，系统级设计塑造了有价值的想法出现和稳定的条件。 ReMIND 提供了一个用于研究偶然性的计算基础的通用框架，并说明了模块化 LLM 编排如何桥接探索和稳定。

Title: Measuring Iterative Temporal Reasoning with TimePuzzles

Authors: Zhengxiang Wang, Zeyu Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07148
Pdf URL: https://arxiv.org/pdf/2601.07148
Copy Paste: [[2601.07148]] Measuring Iterative Temporal Reasoning with TimePuzzles(https://arxiv.org/abs/2601.07148)
Keywords: gpt, llm
Abstract: We introduce TimePuzzles, a constraint-based date inference task for evaluating iterative temporal reasoning. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations, admits one or multiple valid solution dates, and is algorithmically generated for controlled, dynamic, and continual evaluation. Across 13 diverse LLMs, TimePuzzles well distinguishes their iterative temporal reasoning capabilities and remains challenging without tools: GPT-5 reaches only 49.3% accuracy and all other models stay below 31%, despite the dataset's simplicity. Web search consistently yields substantial gains and using code interpreter shows mixed effects, but all models perform much better when constraints are rewritten with explicit dates, revealing a gap in reliable tool use. Overall, TimePuzzles presents a simple, cost-effective diagnostic for tool-augmented iterative temporal reasoning.
摘要：我们引入了 TimePuzzles，这是一种基于约束的日期推理任务，用于评估迭代时间推理。每个谜题都将事实时间锚点与（跨文化）日历关系相结合，承认一个或多个有效的解决日期，并通过算法生成以进行受控、动态和持续评估。在 13 个不同的法学硕士中，TimePuzzles 很好地区分了它们的迭代时间推理能力，并且在没有工具的情况下仍然具有挑战性：GPT-5 仅达到 49.3% 的准确率，而所有其他模型都保持在 31% 以下，尽管数据集很简单。网络搜索始终能带来可观的收益，并且使用代码解释器显示出混合效果，但当用明确的日期重写约束时，所有模型都表现得更好，揭示了可靠工具使用方面的差距。总体而言，TimePuzzles 为工具增强迭代时间推理提供了一种简单、经济高效的诊断方法。

Title: Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?

Authors: Genta Indra Winata, David Anugraha, Patrick Amadeus Irawan, Anirban Das, Haneul Yoo, Paresh Dashore, Shreyas Kulkarni, Ruochen Zhang, Haruki Sakajo, Frederikus Hudi, Anaelia Ovalle, Syrielle Montariol, Felix Gaschi, Michael Anugraha, Rutuj Ravindra Puranik, Zawad Hayat Ahmed, Adril Putra Merin, Emmanuele Chersoni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07153
Pdf URL: https://arxiv.org/pdf/2601.07153
Copy Paste: [[2601.07153]] Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?(https://arxiv.org/abs/2601.07153)
Keywords: language model, llm
Abstract: Code-switching is a pervasive phenomenon in multilingual communication, yet the robustness of large language models (LLMs) in mixed-language settings remains insufficiently understood. In this work, we present a comprehensive evaluation of LLM capabilities in understanding, reasoning over, and generating code-switched text. We introduce CodeMixQA a novel benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms. Using this benchmark, we analyze the reasoning behavior of LLMs on code-switched question-answering tasks, shedding light on how models process and reason over mixed-language inputs. We further conduct a systematic evaluation of LLM-generated synthetic code-switched text, focusing on both naturalness and semantic fidelity, and uncover key limitations in current generation capabilities. Our findings reveal persistent challenges in both reasoning and generation under code-switching conditions and provide actionable insights for building more robust multilingual LLMs. We release the dataset and code as open source.
摘要：语码转换是多语言交流中普遍存在的现象，但大语言模型（LLM）在混合语言环境中的稳健性仍然没有得到充分的了解。在这项工作中，我们对法学硕士在理解、推理和生成代码转换文本方面的能力进行了全面评估。我们推出了 CodeMixQA ，这是一种具有高质量人工注释的新颖基准，其中包含 16 种不同的并行代码转换语言对变体，这些变体跨越多个地理区域和代码转换模式，并且包括原始脚本及其音译形式。使用这个基准，我们分析了法学硕士在代码转换问答任务上的推理行为，揭示了模型如何处理和推理混合语言输入。我们进一步对法学硕士生成的合成代码转换文本进行系统评估，重点关注自然性和语义保真度，并揭示当前生成能力的关键局限性。我们的研究结果揭示了语码转换条件下推理和生成方面持续存在的挑战，并为构建更强大的多语言法学硕士提供了可行的见解。我们将数据集和代码作为开源发布。

Title: Structured Reasoning for Large Language Models

Authors: Jinyi Han, Zixiang Di, Zishang Jiang, Ying Liao, Jiaqing Liang, Yongqi Wang, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07180
Pdf URL: https://arxiv.org/pdf/2601.07180
Copy Paste: [[2601.07180]] Structured Reasoning for Large Language Models(https://arxiv.org/abs/2601.07180)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.
摘要：大型语言模型（LLM）通过生成长思维链来实现强大的性能，但较长的轨迹总是会引入冗余或无效的推理步骤。一种典型的行为是，即使得出了正确的答案，他们也经常进行不必要的验证和修改。这种限制源于推理轨迹的非结构化性质以及缺乏对批判推理能力的有针对性的监督。为了解决这个问题，我们提出了结构化推理（SCR），这是一个将推理轨迹解耦为显式、可评估和可训练组件的框架。我们主要使用生成-验证-修改范例来实现 SCR。具体来说，我们构建结构化训练数据并应用动态终止监督来指导模型决定何时终止推理。为了避免不同推理能力的学习信号之间的干扰，我们采用渐进的两阶段强化学习策略：第一阶段针对初始生成和自我验证，第二阶段侧重于修正。对三个骨干模型的大量实验表明，SCR 极大地提高了推理效率和自我验证能力。此外，与现有的推理范式相比，它减少了高达 50% 的输出 token 长度。

Title: Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG

Authors: Manzong Huang, Chenyang Bu, Yi He, Xingrui Zhuo, Xindong Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07192
Pdf URL: https://arxiv.org/pdf/2601.07192
Copy Paste: [[2601.07192]] Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG(https://arxiv.org/abs/2601.07192)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) mitigates hallucinations in Large Language Models (LLMs) by grounding them in structured knowledge. However, current GraphRAG methods are constrained by a prevailing \textit{build-then-reason} paradigm, which relies on a static, pre-constructed Knowledge Graph (KG). This paradigm faces two critical challenges. First, the KG's inherent incompleteness often breaks reasoning paths. Second, the graph's low signal-to-noise ratio introduces distractor facts, presenting query-relevant but misleading knowledge that disrupts the reasoning process. To address these challenges, we argue for a \textit{reason-and-construct} paradigm and propose Relink, a framework that dynamically builds a query-specific evidence graph. To tackle incompleteness, \textbf{Relink} instantiates required facts from a latent relation pool derived from the original text corpus, repairing broken paths on the fly. To handle misleading or distractor facts, Relink employs a unified, query-aware evaluation strategy that jointly considers candidates from both the KG and latent relations, selecting those most useful for answering the query rather than relying on their pre-existence. This empowers Relink to actively discard distractor facts and construct the most faithful and precise evidence path for each query. Extensive experiments on five Open-Domain Question Answering benchmarks show that Relink achieves significant average improvements of 5.4\% in EM and 5.2\% in F1 over leading GraphRAG baselines, demonstrating the superiority of our proposed framework.
摘要：基于图的检索增强生成（GraphRAG）通过将大型语言模型（LLM）建立在结构化知识的基础上，减轻了它们的幻觉。然而，当前的 GraphRAG 方法受到流行的 \textit{build-then-reason} 范式的限制，该范式依赖于静态的、预先构建的知识图（KG）。这种范式面临两个关键挑战。首先，知识图谱固有的不完整性常常会破坏推理路径。其次，图表的低信噪比引入了干扰性事实，呈现了与查询相关但具有误导性的知识，从而扰乱了推理过程。为了应对这些挑战，我们提出了 \textit{reason-and-construct} 范式，并提出了 Relink，一个动态构建特定于查询的证据图的框架。为了解决不完整性问题，\textbf{Relink} 从原始文本语料库派生的潜在关系池中实例化所需的事实，即时修复损坏的路径。为了处理误导性或干扰性的事实，Relink 采用了统一的、查询感知的评估策略，该策略共同考虑来自 KG 和潜在关系的候选者，选择那些对回答查询最有用的选项，而不是依赖于它们的预先存在。这使得 Relink 能够主动丢弃干扰事实，并为每个查询构建最忠实、最精确的证据路径。对五个开放域问答基准的广泛实验表明，与领先的 GraphRAG 基准相比，Relink 在 EM 中平均提高了 5.4%，在 F1 中平均提高了 5.2%，这证明了我们提出的框架的优越性。

Title: MI-PRUN: Optimize Large Language Model Pruning via Mutual Information

Authors: Hao Zhang, Zhibin Zhang, Guangxin Wu, He Chen, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07212
Pdf URL: https://arxiv.org/pdf/2601.07212
Copy Paste: [[2601.07212]] MI-PRUN: Optimize Large Language Model Pruning via Mutual Information(https://arxiv.org/abs/2601.07212)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have become indispensable across various domains, but this comes at the cost of substantial computational and memory resources. Model pruning addresses this by removing redundant components from models. In particular, block pruning can achieve significant compression and inference acceleration. However, existing block pruning methods are often unstable and struggle to attain globally optimal solutions. In this paper, we propose a mutual information based pruning method MI-PRUN for LLMs. Specifically, we leverages mutual information to identify redundant blocks by evaluating transitions in hidden states. Additionally, we incorporate the Data Processing Inequality (DPI) to reveal the relationship between the importance of entire contiguous blocks and that of individual blocks. Moreover, we develop the Fast-Block-Select algorithm, which iteratively updates block combinations to achieve a globally optimal solution while significantly improving the efficiency. Extensive experiments across various models and datasets demonstrate the stability and effectiveness of our method.
摘要：大型语言模型 (LLM) 已成为各个领域不可或缺的一部分，但这是以大量计算和内存资源为代价的。模型剪枝通过从模型中删除冗余组件来解决这个问题。特别是，块修剪可以实现显着的压缩和推理加速。然而，现有的块剪枝方法通常不稳定并且难以获得全局最优解。在本文中，我们提出了一种针对 LLM 的基于互信息的剪枝方法 MI-PRUN。具体来说，我们通过评估隐藏状态的转换来利用互信息来识别冗余块。此外，我们还结合了数据处理不等式（DPI）来揭示整个连续块的重要性与单个块的重要性之间的关系。此外，我们开发了快速块选择算法，该算法迭代更新块组合以实现全局最优解，同时显着提高效率。各种模型和数据集的广泛实验证明了我们方法的稳定性和有效性。

Title: The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Authors: Chen Shani, Yuval Reif, Nathan Roll, Dan Jurafsky, Ekaterina Shutova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07220
Pdf URL: https://arxiv.org/pdf/2601.07220
Copy Paste: [[2601.07220]] The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?(https://arxiv.org/abs/2601.07220)
Keywords: language model
Abstract: Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
摘要：多语言语言模型 (LM) 有望提供更广泛的 NLP 访问，但当前系统在全球各种语言中的性能参差不齐。这项调查研究了为什么这些差距持续存在，以及它们是否反映了内在的语言困难或建模工件。我们围绕两个问题组织文献：语言差异是否源于表示和分配选择（例如标记化、编码、数据公开、参数共享）而不是固有的复杂性？以及哪些设计选择可以减轻不同类型语言之间的不平等。我们回顾了语言特征，例如正字法、词法、词汇多样性、句法、信息密度和类型距离，将每个特征与具体的建模机制联系起来。当分段、编码和数据公开标准化时，差距通常会缩小，这表明当前的建模选择存在明显的困难。我们将这些见解综合成标记化、采样、架构和评估的设计建议，以支持更平衡的多语言语言模型。

Title: ActiShade: Activating Overshadowed Knowledge to Guide Multi-Hop Reasoning in Large Language Models

Authors: Huipeng Ma, Luan Zhang, Dandan Song, Linmei Hu, Yuhang Tian, Jun Yang, Changzhi Zhou, Chenhao Li, Yizhou Jin, Xudong Li, Meng Lin, Mingxing Zhang, Shuhao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07260
Pdf URL: https://arxiv.org/pdf/2601.07260
Copy Paste: [[2601.07260]] ActiShade: Activating Overshadowed Knowledge to Guide Multi-Hop Reasoning in Large Language Models(https://arxiv.org/abs/2601.07260)
Keywords: language model, llm, retrieval-augmented generation
Abstract: In multi-hop reasoning, multi-round retrieval-augmented generation (RAG) methods typically rely on LLM-generated content as the retrieval query. However, these approaches are inherently vulnerable to knowledge overshadowing - a phenomenon where critical information is overshadowed during generation. As a result, the LLM-generated content may be incomplete or inaccurate, leading to irrelevant retrieval and causing error accumulation during the iteration process. To address this challenge, we propose ActiShade, which detects and activates overshadowed knowledge to guide large language models (LLMs) in multi-hop reasoning. Specifically, ActiShade iteratively detects the overshadowed keyphrase in the given query, retrieves documents relevant to both the query and the overshadowed keyphrase, and generates a new query based on the retrieved documents to guide the next-round iteration. By supplementing the overshadowed knowledge during the formulation of next-round queries while minimizing the introduction of irrelevant noise, ActiShade reduces the error accumulation caused by knowledge overshadowing. Extensive experiments show that ActiShade outperforms existing methods across multiple datasets and LLMs.
摘要：在多跳推理中，多轮检索增强生成（RAG）方法通常依赖于LLM生成的内容作为检索查询。然而，这些方法本质上很容易受到知识掩盖的影响，即关键信息在生成过程中被掩盖的现象。因此，LLM生成的内容可能不完整或不准确，导致不相关的检索并在迭代过程中造成错误累积。为了应对这一挑战，我们提出了 ActiShade，它可以检测并激活被掩盖的知识，以指导多跳推理中的大型语言模型 (LLM)。具体来说，ActiShade 迭代地检测给定查询中被掩盖的关键短语，检索与查询和被掩盖的关键短语相关的文档，并基于检索到的文档生成新查询以指导下一轮迭代。通过在制定下一轮查询时补充被遮盖的知识，同时最大限度地减少不相关噪声的引入，ActiShade 减少了因知识遮盖而导致的错误累积。大量实验表明，ActiShade 在多个数据集和法学硕士方面优于现有方法。

Title: The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

Authors: Weihao Xuan, Qingcheng Zeng, Heli Qi, Yunze Xiao, Junjue Wang, Naoto Yokoya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07264
Pdf URL: https://arxiv.org/pdf/2601.07264
Copy Paste: [[2601.07264]] The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents(https://arxiv.org/abs/2601.07264)
Keywords: language model, llm, agent
Abstract: Autonomous agents based on large language models (LLMs) are rapidly evolving to handle multi-turn tasks, but ensuring their trustworthiness remains a critical challenge. A fundamental pillar of this trustworthiness is calibration, which refers to an agent's ability to express confidence that reliably reflects its actual performance. While calibration is well-established for static models, its dynamics in tool-integrated agentic workflows remain underexplored. In this work, we systematically investigate verbalized calibration in tool-use agents, revealing a fundamental confidence dichotomy driven by tool type. Specifically, our pilot study identifies that evidence tools (e.g., web search) systematically induce severe overconfidence due to inherent noise in retrieved information, while verification tools (e.g., code interpreters) can ground reasoning through deterministic feedback and mitigate miscalibration. To robustly improve calibration across tool types, we propose a reinforcement learning (RL) fine-tuning framework that jointly optimizes task accuracy and calibration, supported by a holistic benchmark of reward designs. We demonstrate that our trained agents not only achieve superior calibration but also exhibit robust generalization from local training environments to noisy web settings and to distinct domains such as mathematical reasoning. Our results highlight the necessity of domain-specific calibration strategies for tool-use agents. More broadly, this work establishes a foundation for building self-aware agents that can reliably communicate uncertainty in high-stakes, real-world deployments.
摘要：基于大型语言模型 (LLM) 的自主代理正在迅速发展以处理多轮任务，但确保其可信度仍然是一个关键挑战。这种可信度的一个基本支柱是校准，它是指代理表达可靠地反映其实际表现的信心的能力。虽然静态模型的校准已经很成熟，但其在工具集成代理工作流程中的动态仍未得到充分探索。在这项工作中，我们系统地研究了工具使用代理的言语校准，揭示了由工具类型驱动的基本信心二分法。具体来说，我们的试点研究发现，证据工具（例如网络搜索）由于检索到的信息中固有的噪声而系统性地引发严重的过度自信，而验证工具（例如代码解释器）可以通过确定性反馈来进行推理并减轻校准错误。为了有力地改进跨工具类型的校准，我们提出了一种强化学习（RL）微调框架，在奖励设计的整体基准的支持下，联合优化任务准确性和校准。我们证明，我们训练有素的智能体不仅实现了卓越的校准，而且还表现出从本地训练环境到嘈杂的网络设置以及数学推理等不同领域的强大泛化能力。我们的结果强调了针对工具使用代理的特定领域校准策略的必要性。更广泛地说，这项工作为构建具有自我意识的代理奠定了基础，这些代理可以可靠地传达高风险、现实世界部署中的不确定性。

Title: Document-Level Zero-Shot Relation Extraction with Entity Side Information

Authors: Mohan Raj Chanthran, Soon Lay Ki, Ong Huey Fang, Bhawani Selvaretnam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07271
Pdf URL: https://arxiv.org/pdf/2601.07271
Copy Paste: [[2601.07271]] Document-Level Zero-Shot Relation Extraction with Entity Side Information(https://arxiv.org/abs/2601.07271)
Keywords: language model, llm
Abstract: Document-Level Zero-Shot Relation Extraction (DocZSRE) aims to predict unseen relation labels in text documents without prior training on specific relations. Existing approaches rely on Large Language Models (LLMs) to generate synthetic data for unseen labels, which poses challenges for low-resource languages like Malaysian English. These challenges include the incorporation of local linguistic nuances and the risk of factual inaccuracies in LLM-generated data. This paper introduces Document-Level Zero-Shot Relation Extraction with Entity Side Information (DocZSRE-SI) to address limitations in the existing DocZSRE approach. The DocZSRE-SI framework leverages Entity Side Information, such as Entity Mention Descriptions and Entity Mention Hypernyms, to perform ZSRE without depending on LLM-generated synthetic data. The proposed low-complexity model achieves an average improvement of 11.6% in the macro F1-Score compared to baseline models and existing benchmarks. By utilizing Entity Side Information, DocZSRE-SI offers a robust and efficient alternative to error-prone, LLM-based methods, demonstrating significant advancements in handling low-resource languages and linguistic diversity in relation extraction tasks. This research provides a scalable and reliable solution for ZSRE, particularly in contexts like Malaysian English news articles, where traditional LLM-based approaches fall short.
摘要：文档级零样本关系提取（DocZSRE）旨在预测文本文档中看不见的关系标签，而无需事先对特定关系进行训练。现有方法依赖大型语言模型（LLM）为看不见的标签生成合成数据，这对马来西亚英语等资源匮乏的语言构成了挑战。这些挑战包括纳入当地语言的细微差别以及法学硕士生成的数据中事实不准确的风险。本文介绍了带有实体侧信息的文档级零样本关系提取（DocZSRE-SI），以解决现有 DocZSRE 方法的局限性。 DocZSRE-SI 框架利用实体侧信息（例如实体提及描述和实体提及上位词）来执行 ZSRE，而不依赖于 LLM 生成的合成数据。与基线模型和现有基准相比，所提出的低复杂度模型在宏观 F1-Score 中平均提高了 11.6%。通过利用实体侧信息，DocZSRE-SI 为容易出错的基于 LLM 的方法提供了一种强大而高效的替代方案，展示了在处理关系提取任务中的低资源语言和语言多样性方面的显着进步。这项研究为 ZSRE 提供了一个可扩展且可靠的解决方案，特别是在马来西亚英语新闻文章等背景下，传统的基于法学硕士的方法无法满足这些要求。

Title: Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

Authors: Kalvin Chang, Yiwen Shao, Jiahong Li, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07274
Pdf URL: https://arxiv.org/pdf/2601.07274
Copy Paste: [[2601.07274]] Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects(https://arxiv.org/abs/2601.07274)
Keywords: language model, llm
Abstract: Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at this https URL.
摘要：尽管拥有数亿人口，但汉语方言在语音和语言技术方面仍落后于普通话。大多数变体主要是口语，这使得方言到普通话语音法学硕士（大语言模型）比方言法学硕士更实用。构建方言到普通话的语音法学硕士需要在汉语方言和普通话之间进行跨方言语义对齐的语音表示。在本文中，我们通过仅使用 ASR（自动语音识别）数据训练语音编码器来实现这种跨方言语义对齐，正如我们贡献的汉语口语变体新基准上的语音到语音检索所证明的那样。我们的语音编码器进一步展示了针对中文方言的最先进的 ASR 性能。我们的中文方言基准、语义对齐的语音表示和语音到语音检索评估为未来的中文方言语音法学硕士奠定了基础。我们在此 https URL 发布基准测试。

Title: ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios

Authors: Changzai Pan, Jie Zhang, Kaiwen Wei, Chenshuo Pan, Yu Zhao, Jingwang Huang, Jian Yang, Zhenhe Wu, Haoyang Zeng, Xiaoyan Gu, Weichao Sun, Yanbo Zhai, Yujie Mao, Zhuoru Jiang, Jiang Zhong, Shuangyong Song, Yongxiang Li, Zhongjiang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07280
Pdf URL: https://arxiv.org/pdf/2601.07280
Copy Paste: [[2601.07280]] ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios(https://arxiv.org/abs/2601.07280)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have significantly catalyzed table-based question answering (TableQA). However, existing TableQA benchmarks often overlook the intricacies of industrial scenarios, which are characterized by multi-table structures, nested headers, and massive scales. These environments demand robust table reasoning through deep structured inference, presenting a significant challenge that remains inadequately addressed by current methodologies. To bridge this gap, we present ReasonTabQA, a large-scale bilingual benchmark encompassing 1,932 tables across 30 industry domains such as energy and automotive. ReasonTabQA provides high-quality annotations for both final answers and explicit reasoning chains, supporting both thinking and no-thinking paradigms. Furthermore, we introduce TabCodeRL, a reinforcement learning method that leverages table-aware verifiable rewards to guide the generation of logical reasoning paths. Extensive experiments on ReasonTabQA and 4 TableQA datasets demonstrate that while TabCodeRL yields substantial performance gains on open-source LLMs, the persistent performance gap on ReasonTabQA underscores the inherent complexity of real-world industrial TableQA.
摘要：大型语言模型 (LLM) 的最新进展极大地促进了基于表格的问答 (TableQA)。然而，现有的TableQA基准往往忽视了工业场景的复杂性，其特点是多表结构、嵌套头和大规模。这些环境需要通过深层结构化推理进行稳健的表格推理，这提出了当前方法尚未充分解决的重大挑战。为了弥补这一差距，我们推出了 ReasonTabQA，这是一个大型双语基准测试，涵盖能源和汽车等 30 个行业领域的 1,932 个表格。 ReasonTabQA 为最终答案和显式推理链提供高质量注释，支持思考和无思考范式。此外，我们还引入了 TabCodeRL，这是一种强化学习方法，利用表感知的可验证奖励来指导逻辑推理路径的生成。在 ReasonTabQA 和 4 个 TableQA 数据集上进行的大量实验表明，虽然 TabCodeRL 在开源 LLM 上带来了显着的性能提升，但 ReasonTabQA 上持续存在的性能差距凸显了现实世界工业 TableQA 固有的复杂性。

Title: PsyCLIENT: Client Simulation via Conversational Trajectory Modeling for Trainee Practice and Model Evaluation in Mental Health Counseling

Authors: Huachuan Qiu, Zhaoming Chen, Yuqian Chen, Yuan Xie, Yu Lu, Zhenzhong Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07312
Pdf URL: https://arxiv.org/pdf/2601.07312
Copy Paste: [[2601.07312]] PsyCLIENT: Client Simulation via Conversational Trajectory Modeling for Trainee Practice and Model Evaluation in Mental Health Counseling(https://arxiv.org/abs/2601.07312)
Keywords: llm
Abstract: LLM-based client simulation has emerged as a promising tool for training novice counselors and evaluating automated counseling systems. However, existing client simulation approaches face three key challenges: (1) limited diversity and realism in client profiles, (2) the lack of a principled framework for modeling realistic client behaviors, and (3) a scarcity in Chinese-language settings. To address these limitations, we propose PsyCLIENT, a novel simulation framework grounded in conversational trajectory modeling. By conditioning LLM generation on predefined real-world trajectories that incorporate explicit behavior labels and content constraints, our approach ensures diverse and realistic interactions. We further introduce PsyCLIENT-CP, the first open-source Chinese client profile dataset, covering 60 distinct counseling topics. Comprehensive evaluations involving licensed professional counselors demonstrate that PsyCLIENT significantly outperforms baselines in terms of authenticity and training effectiveness. Notably, the simulated clients are nearly indistinguishable from human clients, achieving an about 95\% expert confusion rate in discrimination tasks. These findings indicate that conversational trajectory modeling effectively bridges the gap between theoretical client profiles and dynamic, realistic simulations, offering a robust solution for mental health education and research. Code and data will be released to facilitate future research in mental health counseling.
摘要：基于法学硕士的客户模拟已成为培训新手咨询师和评估自动咨询系统的有前景的工具。然而，现有的客户模拟方法面临三个关键挑战：（1）客户档案的多样性和真实性有限，（2）缺乏对现实客户行为进行建模的原则框架，以及（3）中文环境的稀缺性。为了解决这些限制，我们提出了 PsyCLIENT，这是一种基于会话轨迹建模的新型模拟框架。通过将法学硕士的生成限制在预定义的现实世界轨迹上，其中包含明确的行为标签和内容约束，我们的方法确保了多样化和现实的交互。我们进一步介绍了 PsyCLIENT-CP，这是第一个开源的中国客户档案数据集，涵盖 60 个不同的咨询主题。涉及持证专业顾问的综合评估表明，PsyCLIENT 在真实性和培训有效性方面显着优于基线。值得注意的是，模拟客户与人类客户几乎没有区别，在判别任务中达到了约 95% 的专家混淆率。这些发现表明，对话轨迹建模有效地弥合了理论客户概况与动态、现实模拟之间的差距，为心理健康教育和研究提供了强大的解决方案。将发布代码和数据以促进未来心理健康咨询的研究。

Title: BayesRAG: Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation

Authors: Xuan Li, Yining Wang, Haocai Luo, Shengping Liu, Jerry Liang, Ying Fu, Weihuang, Jun Yu, Junnan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07329
Pdf URL: https://arxiv.org/pdf/2601.07329
Copy Paste: [[2601.07329]] BayesRAG: Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation(https://arxiv.org/abs/2601.07329)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has become a pivotal paradigm for Large Language Models (LLMs), yet current approaches struggle with visually rich documents by treating text and images as isolated retrieval targets. Existing methods relying solely on cosine similarity often fail to capture the semantic reinforcement provided by cross-modal alignment and layout-induced coherence. To address these limitations, we propose BayesRAG, a novel multimodal retrieval framework grounded in Bayesian inference and Dempster-Shafer evidence theory. Unlike traditional approaches that rank candidates strictly by similarity, BayesRAG models the intrinsic consistency of retrieved candidates across modalities as probabilistic evidence to refine retrieval confidence. Specifically, our method computes the posterior association probability for combinations of multimodal retrieval results, prioritizing text-image pairs that mutually corroborate each other in terms of both semantics and layout. Extensive experiments demonstrate that BayesRAG significantly outperforms state-of-the-art (SOTA) methods on challenging multimodal benchmarks. This study establishes a new paradigm for multimodal retrieval fusion that effectively resolves the isolation of heterogeneous modalities through an evidence fusion mechanism and enhances the robustness of retrieval outcomes. Our code is available at this https URL.
摘要：检索增强生成（RAG）已成为大型语言模型（LLM）的关键范例，但当前的方法通过将文本和图像视为孤立的检索目标来处理视觉丰富的文档。仅仅依赖余弦相似性的现有方法通常无法捕获跨模式对齐和布局引起的连贯性提供的语义强化。为了解决这些局限性，我们提出了 BayesRAG，一种基于贝叶斯推理和 Dempster-Shafer 证据理论的新型多模态检索框架。与严格按照相似性对候选者进行排名的传统方法不同，BayesRAG 将跨模式检索到的候选者的内在一致性建模为概率证据，以提高检索置信度。具体来说，我们的方法计算多模态检索结果组合的后验关联概率，优先考虑在语义和布局方面相互证实的文本图像对。大量实验表明，BayesRAG 在具有挑战性的多模态基准测试中显着优于最先进的 (SOTA) 方法。本研究建立了多模态检索融合的新范式，通过证据融合机制有效解决异构模态的孤立性，增强检索结果的稳健性。我们的代码可以在这个 https URL 上找到。

Title: Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

Authors: Yanzhi Tian, Cunxiang Wang, Zeming Liu, Heyan Huang, Wenbo Yu, Dawei Song, Jie Tang, Yuhang Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07338
Pdf URL: https://arxiv.org/pdf/2601.07338
Copy Paste: [[2601.07338]] Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation(https://arxiv.org/abs/2601.07338)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: this https URL.
摘要：大型语言模型 (LLM) 显着提升了机器翻译 (MT)，将其应用于语言复杂的领域，例如社交网络服务、文学等。在这些场景中，翻译通常需要处理非字面表达，导致 MT 指标不准确。为了系统地研究 MT 指标的可靠性，我们首先策划一个专注于非字面翻译的元评估数据集，即 MENT。 MENT 涵盖四个非字面翻译领域，并具有与来自不同 MT 系统的翻译配对的源句子，以及 7,530 个人工注释的翻译质量分数。实验结果揭示了传统机器翻译指标的不准确性和法学硕士作为法官的局限性，特别是知识截止和分数不一致问题。为了减轻这些限制，我们提出了 RATE，一种新颖的代理翻译评估框架，以动态调用专门子代理的反射核心代理为中心。实验结果表明了 RATE 的有效性，与当前指标相比，元得分至少提高了 3.2。进一步的实验证明了 RATE 对通用领域 MT 评估的鲁棒性。代码和数据集可在以下位置获得：此 https URL。

Title: DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models

Authors: Shaokai He, Kaiwen Wei, Xinyi Zeng, Xiang Chen, Xue Yang, Zhenyang Li, Jiang Zhong, Yu Tian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07347
Pdf URL: https://arxiv.org/pdf/2601.07347
Copy Paste: [[2601.07347]] DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models(https://arxiv.org/abs/2601.07347)
Keywords: language model, llm
Abstract: The "reversal curse" refers to the phenomenon where large language models (LLMs) exhibit predominantly unidirectional behavior when processing logically bidirectional relationships. Prior work attributed this to autoregressive training -- predicting the next token inherently favors left-to-right information flow over genuine bidirectional knowledge associations. However, we observe that Diffusion LLMs (DLLMs), despite being trained bidirectionally, also suffer from the reversal curse. To investigate the root causes, we conduct systematic experiments on DLLMs and identify three key reasons: 1) entity fragmentation during training, 2) data asymmetry, and 3) missing entity relations. Motivated by the analysis of these reasons, we propose Diffusion Entity-Relation Modeling (DiffER), which addresses the reversal curse through entity-aware training and balanced data construction. Specifically, DiffER introduces whole-entity masking, which mitigates entity fragmentation by predicting complete entities in a single step. DiffER further employs distribution-symmetric and relation-enhanced data construction strategies to alleviate data asymmetry and missing relations. Extensive experiments demonstrate that DiffER effectively alleviates the reversal curse in Diffusion LLMs, offering new perspectives for future research.
摘要：“逆转诅咒”是指大型语言模型（LLM）在处理逻辑双向关系时主要表现出单向行为的现象。之前的工作将此归因于自回归训练——预测下一个标记本质上有利于从左到右的信息流，而不是真正的双向知识关联。然而，我们观察到扩散法学硕士（DLLM）尽管是双向训练的，但也遭受了逆转诅咒。为了调查根本原因，我们对 DLLM 进行了系统实验，并确定了三个关键原因：1）训练过程中的实体碎片，2）数据不对称，3）实体关系缺失。在分析这些原因的动机下，我们提出了扩散实体关系模型（DiffER），它通过实体感知训练和平衡数据构建来解决逆转诅咒。具体来说，DiffER 引入了整体实体屏蔽，它通过一步预测完整实体来减轻实体碎片。 DiffER进一步采用分布对称和关系增强的数据构建策略来减轻数据不对称和缺失关系。大量实验表明，DiffER 有效缓解了 Diffusion LLM 中的逆转诅咒，为未来的研究提供了新的视角。

Title: Controlled Self-Evolution for Algorithmic Code Optimization

Authors: Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Yi Xu, Huacan Wang
Subjects: cs.CL, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2601.07348
Pdf URL: https://arxiv.org/pdf/2601.07348
Copy Paste: [[2601.07348]] Controlled Self-Evolution for Algorithmic Code Optimization(https://arxiv.org/abs/2601.07348)
Keywords: llm
Abstract: Self-evolution methods enhance code generation through iterative "generate-verify-refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across this http URL address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task this http URL on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at this https URL.
摘要：自进化方法通过迭代的“生成-验证-细化”循环来增强代码生成，但现有方法的探索效率较低，无法在有限的预算内发现具有较高复杂性的解决方案。这种低效率源于初始化偏差在较差的解决方案区域中捕获进化、缺乏反馈指导的不受控制的随机操作以及跨此http URL的经验利用率不足，为了解决这些瓶颈，我们提出了受控自我进化（CSE），它由三个关键组件组成。多样化规划初始化生成结构不同的算法策略，以实现广泛的解决方案空间覆盖。遗传进化用反馈引导机制取代了随机操作，从而实现了有针对性的突变和组合交叉。分层进化内存捕获任务间和任务内的成功和失败经验，EffiBench-X 上的此 http URL 表明 CSE 始终优于各种 LLM 主干的所有基线。此外，CSE 从早期几代开始就实现了更高的效率，并在整个进化过程中保持持续改进。我们的代码可通过此 https URL 公开获取。

Title: Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models

Authors: Linhao Zhong, Linyu Wu, Bozhen Fang, Tianjian Feng, Chenchen Jing, Wen Wang, Jiaheng Zhang, Hao Chen, Chunhua Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07351
Pdf URL: https://arxiv.org/pdf/2601.07351
Copy Paste: [[2601.07351]] Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models(https://arxiv.org/abs/2601.07351)
Keywords: language model
Abstract: Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken-DLM, a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken-DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines. Project webpage: this https URL.
摘要：扩散语言模型 (DLM) 通过迭代细化实现并行解码，为语言建模提供了一种有前途的替代方案。然而，大多数 DLM 依赖于硬二进制屏蔽和离散令牌分配，这阻碍了早期决策的修改，并且没有充分利用中间概率表示。在本文中，我们提出了 EvoToken-DLM，这是一种新颖的基于扩散的语言建模方法，用不断发展的软令牌分布取代硬二进制掩码。 EvoToken-DLM 可实现从屏蔽状态到离散输出的渐进过渡，支持可修改的解码。为了有效支持这种演变，我们引入了连续轨迹监督，它将训练目标与迭代概率更新保持一致。跨多个基准的大量实验表明，EvoToken-DLM 始终实现卓越的性能，优于强大的基于扩散和屏蔽的 DLM 基线。项目网页：此 https URL。

Title: TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

Authors: Tianyu Liu, Qitan Lv, Yuhao Shen, Xiao Sun, Xiaoyan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07353
Pdf URL: https://arxiv.org/pdf/2601.07353
Copy Paste: [[2601.07353]] TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees(https://arxiv.org/abs/2601.07353)
Keywords: llm
Abstract: Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a "deep-and-narrow" form for deterministic contexts and a "shallow-and-wide" form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.
摘要：推测性解码 (SD) 已成为在不牺牲输出质量的情况下加速 LLM 推理的标准技术。推测性解码的最新进展已经从基于顺序链的起草转向树结构生成，其中草稿模型构建候选标记树以并行探索多个可能的草稿。然而，现有的基于树的SD方法通常构建固定宽度、固定深度的草案树，它无法适应令牌和上下文的不同难度。因此，草稿模型无法动态调整树结构以提前停止困难令牌并延长简单令牌的生成。为了应对这些挑战，我们引入了 TALON，这是一种免训练、预算驱动的自适应树扩展框架，可以插入现有的基于树的方法中。与静态方法不同，TALON 使用混合扩展策略迭代构建草图树，直到满足固定的令牌预算，该策略自适应地将节点预算分配到草图树的每一层。该框架自然地将草图树塑造为确定性上下文的“深而窄”的形式和不确定性分支的“浅而宽”的形式，有效地优化了给定预算下探索宽度和生成深度之间的权衡。跨 5 个模型和 6 个数据集的大量实验表明，TALON 的性能始终优于最先进的 EAGLE-3，与自回归解码相比，端到端加速高达 5.16 倍。

Title: Semantic Compression of LLM Instructions via Symbolic Metalanguages

Authors: Ernst van Gassen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07354
Pdf URL: https://arxiv.org/pdf/2601.07354
Copy Paste: [[2601.07354]] Semantic Compression of LLM Instructions via Symbolic Metalanguages(https://arxiv.org/abs/2601.07354)
Keywords: gpt, llm, prompt, chat
Abstract: We introduce MetaGlyph, a symbolic language for compressing prompts by encoding instructions as mathematical symbols rather than prose. Unlike systems requiring explicit decoding rules, MetaGlyph uses symbols like $\in$ (membership) and $\Rightarrow$ (implication) that models already understand from their training data. We test whether these symbols work as ''instruction shortcuts'' that models can interpret without additional teaching. We evaluate eight models across two dimensions relevant to practitioners: scale (3B-1T parameters) and accessibility (open-source for local deployment vs. proprietary APIs). MetaGlyph achieves 62-81% token reduction across all task types. For API-based deployments, this translates directly to cost savings; for local deployments, it reduces latency and memory pressure. Results vary by model. Gemini 2.5 Flash achieves 75% semantic equivalence between symbolic and prose instructions on selection tasks, with 49.9% membership operator fidelity. Kimi K2 reaches 98.1% fidelity for implication ($\Rightarrow$) and achieves perfect (100%) accuracy on selection tasks with symbolic prompts. GPT-5.2 Chat shows the highest membership fidelity observed (91.3%), though with variable parse success across task types. Claude Haiku 4.5 achieves 100% parse success with 26% membership fidelity. Among mid-sized models, Qwen 2.5 7B shows 62% equivalence on extraction tasks. Mid-sized open-source models (7B-12B) show near-zero operator fidelity, suggesting a U-shaped relationship where sufficient scale overcomes instruction-tuning biases.
摘要：我们引入了 MetaGlyph，一种符号语言，用于通过将指令编码为数学符号而不是散文来压缩提示。与需要显式解码规则的系统不同，MetaGlyph 使用 $\in$ （成员资格）和 $\Rightarrow$ （暗示）等符号，模型已经从训练数据中理解了这些符号。我们测试这些符号是否可以作为模型无需额外教学即可解释的“指令快捷方式”。我们在与从业者相关的两个维度上评估了八个模型：规模（3B-1T 参数）和可访问性（本地部署的开源与专有 API）。 MetaGlyph 在所有任务类型中实现了 62-81% 的标记减少。对于基于 API 的部署，这可以直接转化为成本节省；对于本地部署，它可以减少延迟和内存压力。结果因型号而异。 Gemini 2.5 Flash 在选择任务中的符号指令和散文指令之间实现了 75% 的语义等效性，以及 49.9% 的成员运算符保真度。 Kimi K2 的暗示保真度 ($\Rightarrow$) 达到 98.1%，并在带有符号提示的选择任务中达到完美 (100%) 的准确性。 GPT-5.2 Chat 显示了观察到的最高成员保真度 (91.3%)，尽管跨任务类型的变量解析成功。 Claude Haiku 4.5 实现了 100% 的解析成功率和 26% 的成员保真度。在中型模型中，Qwen 2.5 7B 在提取任务上显示出 62% 的等效性。中型开源模型 (7B-12B) 显示出接近于零的算子保真度，表明存在 U 形关系，其中足够的规模克服了指令调整偏差。

Title: Interpretable Text Classification Applied to the Detection of LLM-generated Creative Writing

Authors: Minerva Suvanto, Andrea McGlinchey, Mattias Wahde, Peter J Barclay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07368
Pdf URL: https://arxiv.org/pdf/2601.07368
Copy Paste: [[2601.07368]] Interpretable Text Classification Applied to the Detection of LLM-generated Creative Writing(https://arxiv.org/abs/2601.07368)
Keywords: llm
Abstract: We consider the problem of distinguishing human-written creative fiction (excerpts from novels) from similar text generated by an LLM. Our results show that, while human observers perform poorly (near chance levels) on this binary classification task, a variety of machine-learning models achieve accuracy in the range 0.93 - 0.98 over a previously unseen test set, even using only short samples and single-token (unigram) features. We therefore employ an inherently interpretable (linear) classifier (with a test accuracy of 0.98), in order to elucidate the underlying reasons for this high accuracy. In our analysis, we identify specific unigram features indicative of LLM-generated text, one of the most important being that the LLM tends to use a larger variety of synonyms, thereby skewing the probability distributions in a manner that is easy to detect for a machine learning classifier, yet very difficult for a human observer. Four additional explanation categories were also identified, namely, temporal drift, Americanisms, foreign language usage, and colloquialisms. As identification of the AI-generated text depends on a constellation of such features, the classification appears robust, and therefore not easy to circumvent by malicious actors intent on misrepresenting AI-generated text as human work.
摘要：我们考虑将人类撰写的创意小说（小说摘录）与法学硕士生成的类似文本区分开来的问题。我们的结果表明，虽然人类观察者在此二元分类任务上表现不佳（接近机会水平），但各种机器学习模型在以前未见过的测试集上实现了 0.93 - 0.98 范围内的准确度，即使仅使用短样本和单标记（一元）特征。因此，我们采用本质上可解释的（线性）分类器（测试精度为 0.98），以阐明这种高精度的根本原因。在我们的分析中，我们识别了表示 LLM 生成文本的特定一元特征，其中最重要的一个是 LLM 倾向于使用更多种类的同义词，从而以机器学习分类器易于检测但对于人类观察者来说非常困难的方式扭曲概率分布。还确定了四个额外的解释类别，即时间漂移、美国语、外语用法和口语。由于人工智能生成的文本的识别取决于一系列此类特征，因此分类看起来很稳健，因此不容易被意图将人工智能生成的文本歪曲为人类作品的恶意行为者规避。

Title: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Authors: Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, Wenfeng Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07372
Pdf URL: https://arxiv.org/pdf/2601.07372
Copy Paste: [[2601.07372]] Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models(https://arxiv.org/abs/2601.07372)
Keywords: language model
Abstract: While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.
摘要：虽然专家混合 (MoE) 通过条件计算扩展容量，但 Transformer 缺乏用于知识查找的本机原语，迫使它们通过计算低效地模拟检索。为了解决这个问题，我们引入条件记忆作为补充稀疏轴，通过 Engram 实例化，Engram 是一个现代化用于 O(1) 查找的经典 $N$-gram 嵌入的模块。通过制定稀疏分配问题，我们发现了一个 U 形缩放定律，该定律优化了神经计算 (MoE) 和静态内存 (Engram) 之间的权衡。在这一定律的指导下，我们将 Engram 扩展到 27B 参数，在严格的 iso 参数和 iso-FLOPs MoE 基线上实现了卓越的性能。最值得注意的是，虽然内存模块预计有助于知识检索（例如，MMLU +3.4；CMMLU +4.0），但我们观察到在一般推理（例如，BBH +5.0；ARC-Challenge +3.7）和代码/数学领域〜（HumanEval +3.0；MATH +2.4）方面取得了更大的进步。机制分析表明，Engram 使骨干网的早期层免于静态重建，有效加深了网络的复杂推理能力。此外，通过将本地依赖项委托给查找，它可以释放全局上下文的注意力能力，从而大大提高长上下文检索（例如，多查询 NIAH：84.2 到 97.0）。最后，Engram 建立了基础设施感知效率：其确定性寻址支持运行时从主机内存预取，产生的开销可以忽略不计。我们将条件记忆视为下一代稀疏模型不可或缺的建模原语。

Title: GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap

Authors: Farzad Shami, Subhrasankha Dey, Nico Van de Weghe, Henrikki Tenkanen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07375
Pdf URL: https://arxiv.org/pdf/2601.07375
Copy Paste: [[2601.07375]] GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap(https://arxiv.org/abs/2601.07375)
Keywords: llm, agent
Abstract: The evaluation of navigation instructions remains a persistent challenge in Vision-and-Language Navigation (VLN) research. Traditional reference-based metrics such as BLEU and ROUGE fail to capture the functional utility of spatial directives, specifically whether an instruction successfully guides a navigator to the intended destination. Although existing VLN agents could serve as evaluators, their reliance on high-fidelity visual simulators introduces licensing constraints and computational costs, and perception errors further confound linguistic quality assessment. This paper introduces GROKE(Graph-based Reasoning over OSM Knowledge for instruction Evaluation), a vision-free training-free hierarchical LLM-based framework for evaluating navigation instructions using OpenStreetMap data. Through systematic ablation studies, we demonstrate that structured JSON and textual formats for spatial information substantially outperform grid-based and visual graph representations. Our hierarchical architecture combines sub-instruction planning with topological graph navigation, reducing navigation error by 68.5% compared to heuristic and sampling baselines on the Map2Seq dataset. The agent's execution success, trajectory fidelity, and decision patterns serve as proxy metrics for functional navigability given OSM-visible landmarks and topology, establishing a scalable and interpretable evaluation paradigm without visual dependencies. Code and data are available at this https URL.
摘要：导航指令的评估仍然是视觉和语言导航（VLN）研究中持续存在的挑战。 BLEU 和 ROUGE 等传统的基于参考的指标无法捕捉空间指令的功能效用，特别是指令是否成功引导导航器到达预期目的地。尽管现有的 VLN 代理可以充当评估器，但它们对高保真视觉模拟器的依赖引入了许可限制和计算成本，并且感知错误进一步混淆了语言质量评估。本文介绍了 GROKE（基于 OSM 知识的图形推理指令评估），这是一种无需视觉、无需训练、基于 LLM 的分层框架，用于使用 OpenStreetMap 数据评估导航指令。通过系统的消融研究，我们证明空间信息的结构化 JSON 和文本格式大大优于基于网格和可视化图形表示。我们的分层架构将子指令规划与拓扑图导航相结合，与 Map2Seq 数据集上的启发式和采样基线相比，将导航错误减少了 68.5%。代理的执行成功、轨迹保真度和决策模式作为给定 OSM 可见地标和拓扑的功能导航性的代理指标，建立了一个可扩展且可解释的评估范例，无需视觉依赖性。代码和数据可从此 https URL 获取。

Title: Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Authors: Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, Hongcheng Guo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.07408
Pdf URL: https://arxiv.org/pdf/2601.07408
Copy Paste: [[2601.07408]] Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning(https://arxiv.org/abs/2601.07408)
Keywords: llm
Abstract: Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.
摘要：组相对策略优化（GRPO）已成为一种有前景的无批评强化学习范式，用于推理任务。然而，标准 GRPO 采用粗粒度的信用分配机制，将组级奖励均匀地传播到序列中的每个令牌，忽略各个推理步骤的不同贡献。我们通过引入基于结果的优势重塑（OAR）来解决这一限制，这是一种细粒度的信用分配机制，可以根据每个令牌对模型最终答案的影响程度来重新分配优势。我们通过两种互补策略实例化 OAR：（1）OAR-P，它通过反事实令牌扰动估计结果敏感性，充当高保真归因信号； (2) OAR-G，它使用输入梯度灵敏度代理通过单次向后传递来近似影响信号。这些重要性信号与保守的双层优势重塑方案相结合，该方案抑制低影响力的代币并增强关键的代币，同时保留整体优势质量。广泛数学推理基准的实证结果表明，虽然 OAR-P 设置了性能上限，但 OAR-G 以可忽略的计算开销实现了相当的增益，两者都显着优于强大的 GRPO 基线，突破了无批评的 LLM 推理的界限。

Title: Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

Authors: Wen Luo, Guangyue Peng, Wei Li, Shaohang Wei, Feifan Song, Liang Wang, Nan Yang, Xingxing Zhang, Jing Jin, Furu Wei, Houfeng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07422
Pdf URL: https://arxiv.org/pdf/2601.07422
Copy Paste: [[2601.07422]] Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations(https://arxiv.org/abs/2601.07422)
Keywords: language model, llm, hallucination
Abstract: Despite their impressive capabilities, large language models (LLMs) frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on question-answer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions. Finally, building on these insightful findings, two applications are proposed to enhance hallucination detection performance. Overall, our work provides new insight into how LLMs internally encode truthfulness, offering directions for more reliable and self-aware generative systems.
摘要：尽管大型语言模型 (LLM) 的功能令人印象深刻，但它们经常会产生幻觉。先前的研究表明，它们的内部状态编码了丰富的真实信号，但这些信号的起源和机制仍不清楚。在本文中，我们证明真实性线索来自两种不同的信息路径：（1）依赖于问答信息流的问题锚定路径，以及（2）从生成的答案本身得出独立证据的答案锚定路径。首先，我们通过注意力剔除和令牌修补来验证和理清这些路径。之后，我们发现了这两种机制的显着且有趣的特性。进一步的实验表明：（1）这两种机制与LLM知识边界密切相关； (2) 内部代表意识到自己的区别。最后，基于这些富有洞察力的发现，提出了两种应用来增强幻觉检测性能。总的来说，我们的工作为法学硕士如何内部编码真实性提供了新的见解，为更可靠和自我意识的生成系统提供了方向。

Title: KALE: Enhancing Knowledge Manipulation in Large Language Models via Knowledge-aware Learning

Authors: Qitan Lv, Tianyu Liu, Qiaosheng Zhang, Xingcheng Xu, Chaochao Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07430
Pdf URL: https://arxiv.org/pdf/2601.07430
Copy Paste: [[2601.07430]] KALE: Enhancing Knowledge Manipulation in Large Language Models via Knowledge-aware Learning(https://arxiv.org/abs/2601.07430)
Keywords: language model, llm
Abstract: Despite the impressive performance of large language models (LLMs) pretrained on vast knowledge corpora, advancing their knowledge manipulation-the ability to effectively recall, reason, and transfer relevant knowledge-remains challenging. Existing methods mainly leverage Supervised Fine-Tuning (SFT) on labeled datasets to enhance LLMs' knowledge manipulation ability. However, we observe that SFT models still exhibit the known&incorrect phenomenon, where they explicitly possess relevant knowledge for a given question but fail to leverage it for correct answers. To address this challenge, we propose KALE (Knowledge-Aware LEarning)-a post-training framework that leverages knowledge graphs (KGs) to generate high-quality rationales and enhance LLMs' knowledge manipulation ability. Specifically, KALE first introduces a Knowledge-Induced (KI) data synthesis method that efficiently extracts multi-hop reasoning paths from KGs to generate high-quality rationales for question-answer pairs. Then, KALE employs a Knowledge-Aware (KA) fine-tuning paradigm that enhances knowledge manipulation by internalizing rationale-guided reasoning through minimizing the KL divergence between predictions with and without rationales. Extensive experiments on eight popular benchmarks across six different LLMs demonstrate the effectiveness of KALE, achieving accuracy improvements of up to 11.72% and an average of 4.18%.
摘要：尽管在大量知识语料库上预训练的大型语言模型（LLM）表现出色，但提高其知识操作（有效回忆、推理和转移相关知识的能力）仍然具有挑战性。现有方法主要利用标记数据集上的监督微调（SFT）来增强法学硕士的知识操纵能力。然而，我们观察到 SFT 模型仍然表现出已知且不正确的现象，即它们明确拥有给定问题的相关知识，但无法利用它来获得正确答案。为了应对这一挑战，我们提出了KALE（知识感知学习）——一种利用知识图（KG）生成高质量原理并增强法学硕士知识操纵能力的培训后框架。具体来说，KALE 首先引入了知识诱导（KI）数据合成方法，该方法可以有效地从知识图谱中提取多跳推理路径，从而为问答对生成高质量的基本原理。然后，KALE 采用知识感知 (KA) 微调范式，通过最小化有理由和无理由的预测之间的 KL 差异，内化理由引导推理，从而增强知识操作。对六个不同法学硕士的八个流行基准进行的广泛实验证明了 KALE 的有效性，准确率提高了高达 11.72%，平均提高了 4.18%。

Title: Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

Authors: Dongryeol Lee, Yerin Hwang, Taegwan Kang, Minwoo Lee, Younhyung Chae, Kyomin Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07506
Pdf URL: https://arxiv.org/pdf/2601.07506
Copy Paste: [[2601.07506]] Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation(https://arxiv.org/abs/2601.07506)
Keywords: language model, llm, prompt
Abstract: While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.
摘要：虽然大型语言模型 (LLM) 越来越多地用作问答 (QA) 和其他参考条件评估任务的自动判断器，但人们对其遵守所提供参考的能力知之甚少。我们确定了这种基于参考的 LLM QA 评估的一个关键失败模式：当提供的参考与判断模型的参数知识相冲突时，所得分数变得不可靠，大大降低了评估保真度。为了系统地研究这种现象，我们引入了一种受控交换参考 QA 框架，该框架会引发参考信念冲突。具体来说，我们用不正确的实体替换参考答案，并用相应对齐的候选答案构建原始和交换参考的不同配对。令人惊讶的是，在广泛的判断模型中交换参考的情况下，评分可靠性急剧下降。我们的经验表明，这种脆弱性是由法官过度依赖参数知识造成的，导致法官在冲突下忽视给定的参考。最后，我们发现这种失败在常见的基于提示的缓解策略下仍然存在，凸显了法学硕士作为法官评估的根本局限性，并激发了基于参考的协议，强制更严格地遵守所提供的参考。

Title: High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning

Authors: Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07507
Pdf URL: https://arxiv.org/pdf/2601.07507
Copy Paste: [[2601.07507]] High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2601.07507)
Keywords: language model
Abstract: As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity when compared to full parameter fine-tuning. We present \textbf{SMoA}, a high-rank \textbf{S}tructured \textbf{MO}dulation \textbf{A}dapter that uses fewer trainable parameters while maintaining a higher rank, thereby improving the model's representational capacity and offering improved performance potential. The core idea is to freeze the original pretrained weights and selectively amplify or suppress important features of the original weights across multiple subspaces. The subspace mechanism provides an efficient way to increase the capacity and complexity of a model. We conduct both theoretical analyses and empirical studies on various tasks. Experiment results show that SMoA outperforms LoRA and its variants on 10 tasks, with extensive ablation studies validating its effectiveness.
摘要：随着模型参数数量的增加，参数高效微调（PEFT）已成为定制预训练大型语言模型的首选。低秩适应（LoRA）使用低秩更新方法来模拟全参数微调，广泛用于减少资源需求。然而，与全参数微调相比，降低排名会遇到表示能力有限的挑战。我们提出了 \textbf{SMoA}，一种高秩 \textbf{S} 结构 \textbf{MO}dulation \textbf{A} 适配器，它使用更少的可训练参数，同时保持较高的秩，从而提高模型的表示能力并提供改进的性能潜力。其核心思想是冻结原始的预训练权重，并在多个子空间上选择性地放大或抑制原始权重的重要特征。子空间机制提供了一种增加模型容量和复杂性的有效方法。我们对各种任务进行理论分析和实证研究。实验结果表明，SMoA 在 10 项任务上优于 LoRA 及其变体，广泛的消融研究验证了其有效性。

Title: Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions

Authors: Yongqi Li, Hao Lang, Tieyun Qian, Yongbin Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.07516
Pdf URL: https://arxiv.org/pdf/2601.07516
Copy Paste: [[2601.07516]] Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions(https://arxiv.org/abs/2601.07516)
Keywords: language model, agent
Abstract: Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. Recently, reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios. Despite showing great enhancement in generalization performance, fine-tuning MCAs via RL still faces challenges in handling the extremely large text token space. To address this, we learn a compact latent action space for RL fine-tuning instead. Specifically, we adopt the learning from observation mechanism to construct the codebook for the latent action space, where future observations are leveraged to estimate current latent actions that could further be used to reconstruct future observations. However, the scarcity of paired image-text data hinders learning a codebook with sufficient coverage. Thus, we leverage both paired image-text data and text-only data to construct the latent action space, using a cross-modal projector for transforming text embeddings into image-text embeddings. We initialize the cross-modal projector on paired image-text data, and further train it on massive text-only data with a novel cycle consistency loss to enhance its robustness. We show that our latent action based method outperforms competitive baselines on two conversation tasks across various RL algorithms.
摘要：视觉语言模型越来越多地用作多模式会话代理（MCA）来执行不同的会话任务。最近，强化学习 (RL) 已被广泛探索，用于使 MCA 适应各种人机交互场景。尽管在泛化性能方面显示出巨大的增强，但通过 RL 微调 MCA 在处理极大的文本标记空间方面仍然面临挑战。为了解决这个问题，我们学习了一个紧凑的潜在动作空间来进行强化学习微调。具体来说，我们采用从观察中学习的机制来构建潜在动作空间的密码本，其中利用未来的观察来估计当前的潜在动作，这些动作可以进一步用于重建未来的观察。然而，图像-文本配对数据的稀缺阻碍了学习具有足够覆盖范围的码本。因此，我们利用成对的图像文本数据和纯文本数据来构建潜在动作空间，使用跨模式投影仪将文本嵌入转换为图像文本嵌入。我们在成对的图像文本数据上初始化跨模态投影仪，并在大量纯文本数据上进一步训练它，并使用新颖的循环一致性损失来增强其鲁棒性。我们证明，我们基于潜在动作的方法在各种 RL 算法的两个对话任务上优于竞争基线。

Title: Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

Authors: Ngoc Trinh Hung Nguyen, Alonso Silva, Laith Zumot, Liubov Tupikina, Armen Aghasaryan, Mehwish Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07525
Pdf URL: https://arxiv.org/pdf/2601.07525
Copy Paste: [[2601.07525]] Thinking Before Constraining: A Unified Decoding Framework for Large Language Models(https://arxiv.org/abs/2601.07525)
Keywords: language model, llm
Abstract: Natural generation allows Language Models (LMs) to produce free-form responses with rich reasoning, but the lack of guaranteed structure makes outputs difficult to parse or verify. Structured generation, or constrained decoding, addresses this drawback by producing content in standardized formats such as JSON, ensuring consistency and guaranteed-parsable outputs, but it can inadvertently restrict the model's reasoning capabilities. In this work, we propose a simple approach that combines the advantages of both natural and structured generation. By allowing LLMs to reason freely until specific trigger tokens are generated, and then switching to structured generation, our method preserves the expressive power of natural language reasoning while ensuring the reliability of structured outputs. We further evaluate our approach on several datasets, covering both classification and reasoning tasks, to demonstrate its effectiveness, achieving a substantial gain of up to 27% in accuracy compared to natural generation, while requiring only a small overhead of 10-20 extra tokens.
摘要：自然生成允许语言模型（LM）产生具有丰富推理的自由形式响应，但缺乏有保证的结构使得输出难以解析或验证。结构化生成或约束解码通过以 JSON 等标准化格式生成内容来解决此缺点，确保一致性和有保证的可解析输出，但它可能会无意中限制模型的推理能力。在这项工作中，我们提出了一种简单的方法，结合了自然生成和结构化生成的优点。通过允许LLM自由推理直到生成特定的触发标记，然后切换到结构化生成，我们的方法保留了自然语言推理的表达能力，同时确保了结构化输出的可靠性。我们进一步在多个数据集上评估我们的方法，涵盖分类和推理任务，以证明其有效性，与自然生成相比，准确度大幅提高高达 27%，同时只需要 10-20 个额外标记的少量开销。

Title: From RAG to Agentic RAG for Faithful Islamic Question Answering

Authors: Gagan Bhatia, Hamdy Mubarak, Mustafa Jarrar, George Mikros, Fadi Zaraket, Mahmoud Alhirthani, Mutaz Al-Khatib, Logan Cochrane, Kareem Darwish, Rashid Yahiaoui, Firoj Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07528
Pdf URL: https://arxiv.org/pdf/2601.07528
Copy Paste: [[2601.07528]] From RAG to Agentic RAG for Faithful Islamic Question Answering(https://arxiv.org/abs/2601.07528)
Keywords: llm, hallucination, agent
Abstract: LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and whether models appropriately abstain when evidence is lacking. To shed a light on this aspect we introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modelling suite consisting of (i) 25K Arabic text-grounded SFT reasoning pairs, (ii) 5K bilingual preference samples for reward-guided alignment, and (iii) a verse-level Qur'an retrieval corpus of $\sim$6k atomic verses (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic-English robustness even with a small model (i.e., Qwen3 4B). We will make the experimental resources and datasets publicly available for the community.
摘要：法学硕士越来越多地用于伊斯兰问答，其中不接地气的回答可能会带来严重的宗教后果。然而，标准的 MCQ/MRC 式评估并没有捕捉到关键的现实世界失败模式，特别是自由形式的幻觉以及模型在缺乏证据时是否适当放弃。为了阐明这一点，我们引入了 ISLAMICFAITHQA，这是一个包含 3,810 项双语（阿拉伯语/英语）生成基准，具有原子单金答案，可以直接测量幻觉和戒心。我们还开发了一个端到端的伊斯兰建模套件，其中包括 (i) 25K 个基于阿拉伯语文本的 SFT 推理对，(ii) 5K 个用于奖励引导对齐的双语偏好样本，以及 (iii) 一个包含 $\sim$6k 原子经文 (ayat) 的经文级别古兰经检索语料库。在这些资源的基础上，我们开发了一个代理古兰经基础框架（代理 RAG），该框架使用结构化工具调用进行迭代证据寻求和答案修订。以阿拉伯语为中心的多语言法学硕士的实验表明，检索提高了正确性，并且代理 RAG 比标准 RAG 产生了最大的收益，即使使用小型模型（即 Qwen3 4B）也能实现最先进的性能和更强的阿拉伯语-英语鲁棒性。我们将向社区公开提供实验资源和数据集。

Title: A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models

Authors: Jiaqi Qiao, Xiujuan Xu, Xinran Li, Yu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07565
Pdf URL: https://arxiv.org/pdf/2601.07565
Copy Paste: [[2601.07565]] A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models(https://arxiv.org/abs/2601.07565)
Keywords: language model, llm, prompt
Abstract: Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.
摘要：多模态情感理解需要有效集成文本、音频和视觉模态，以实现离散情感识别和连续情感分析。我们提出了 EGMF，这是一个将专家指导的多模态融合与大型语言模型相结合的统一框架。我们的方法具有三个专门的专家网络 - 一个负责微妙情感细微差别的细粒度本地专家，一个负责跨模式关系的语义相关专家，以及一个负责远程依赖关系的全局上下文专家 - 通过分层动态门控进行自适应集成，以进行上下文感知特征选择。增强的多模态表示通过伪令牌注入和基于提示的调节与 LLM 集成，使单个生成框架能够通过自然语言生成处理分类和回归。我们采用 LoRA 微调来提高计算效率。双语基准（MELD、CHERMA、MOSEI、SIMS-V2）的实验表明，与最先进的方法相比，取得了一致的改进，卓越的跨语言鲁棒性揭示了英语和汉语多模态情感表达的普遍模式。我们将公开发布源代码。

Title: ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents

Authors: Huhai Zou, Tianhao Sun, Chuanjiang He, Yu Tian, Zhenyang Li, Li Jin, Nayu Liu, Jiang Zhong, Kaiwen Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07582
Pdf URL: https://arxiv.org/pdf/2601.07582
Copy Paste: [[2601.07582]] ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents(https://arxiv.org/abs/2601.07582)
Keywords: agent
Abstract: Memory is critical for dialogue agents to maintain coherence and enable continuous adaptation in long-term interactions. While existing memory mechanisms offer basic storage and retrieval capabilities, they are hindered by two primary limitations: (1) rigid memory granularity often disrupts semantic integrity, resulting in fragmented and incoherent memory units; (2) prevalent flat retrieval paradigms rely solely on surface-level semantic similarity, neglecting the structural cues of discourse required to navigate and locate specific episodic contexts. To mitigate these limitations, drawing inspiration from Event Segmentation Theory, we propose ES-Mem, a framework incorporating two core components: (1) a dynamic event segmentation module that partitions long-term interactions into semantically coherent events with distinct boundaries; (2) a hierarchical memory architecture that constructs multi-layered memories and leverages boundary semantics to anchor specific episodic memory for precise context localization. Evaluations on two memory benchmarks demonstrate that ES-Mem yields consistent performance gains over baseline methods. Furthermore, the proposed event segmentation module exhibits robust applicability on dialogue segmentation datasets.
摘要：记忆对于对话代理保持连贯性并在长期交互中实现持续适应至关重要。虽然现有的记忆机制提供了基本的存储和检索能力，但它们受到两个主要限制的阻碍：（1）严格的记忆粒度经常破坏语义完整性，导致记忆单元碎片化和不连贯； (2)流行的平面检索范式仅依赖于表面语义相似性，忽略了导航和定位特定情景上下文所需的话语结构线索。为了减轻这些限制，从事件分割理论中汲取灵感，我们提出了 ES-Mem，一个包含两个核心组件的框架：（1）动态事件分割模块，将长期交互划分为具有不同边界的语义连贯事件；（2）分层记忆架构，构建多层记忆并利用边界语义锚定特定情景记忆以实现精确的上下文定位。对两个内存基准测试的评估表明，ES-Mem 比基准方法具有一致的性能增益。此外，所提出的事件分割模块在对话分割数据集上表现出强大的适用性。

Title: Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Authors: Bingyang Ye, Shan Chen, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07606
Pdf URL: https://arxiv.org/pdf/2601.07606
Copy Paste: [[2601.07606]] Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments(https://arxiv.org/abs/2601.07606)
Keywords: language model, prompt, agent
Abstract: Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models' judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers' agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.
摘要：大型语言模型越来越多地用于评估和预测研究想法，但我们缺乏可扩展的方法来评估模型对这些科学想法的判断质量。为了实现这一目标，我们引入了 PoT，这是一种半可验证的基准测试框架，它将科学思想判断与随后可观察到的下游信号（例如，研究人员议程的引用和变化）联系起来。 PoT 将截断前的证据快照冻结在离线沙箱中，并要求模型预测截断后的结果，从而在地面真相到达时进行可验证的评估，无需详尽的专家注释即可进行可扩展的基准测试，并根据同行评审奖项等信号分析人类模型的偏差。此外，PoT 为基于主体的研究判断提供了一个受控测试平台，用于评估科学思想，在迅速消融和预算调整的情况下将使用工具的主体与非主体基线进行比较。在跨越四个基准域的 30,000 多个实例中，我们发现，与非代理基准相比，较高的交互预算通常会提高代理性能，而工具使用的好处强烈依赖于任务。通过将时间分区、未来可验证的目标与用于工具使用的离线沙箱相结合，PoT 支持对面向未来的科学思想判断任务的代理进行可扩展的评估。

Title: PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

Authors: Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07645
Pdf URL: https://arxiv.org/pdf/2601.07645
Copy Paste: [[2601.07645]] PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs(https://arxiv.org/abs/2601.07645)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on this https URL.
摘要：多模态大语言模型 (MLLM) 依赖于从其基础语言模型继承的强大语言推理。然而，多模态指令微调反而降低了本文的推理能力，从而损害了多模态性能。为了解决这个问题，我们提出了一个免培训框架来缓解这种退化。通过分层视觉标记掩码，我们揭示了多模态大语言模型中常见的三阶段模式：早期模态分离、中期模态对齐和后期模态退化。通过分析 MLLM 在不同阶段的行为，我们提出了一种平台引导模型合并方法，选择性地将基础语言模型参数注入 MLLM。基于五个 MLLM 在九个基准上的实验结果证明了我们方法的有效性。基于注意力的分析进一步揭示，合并将注意力从分散的、分散的模式转移到与任务相关的视觉区域的集中定位。我们的存储库位于此 https URL 上。

Title: Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends

Authors: Jing Yang, Nils Feldhus, Salar Mohtaj, Leonhard Hennig, Qianli Wang, Eleni Metheniti, Sherzod Hakimov, Charlott Jakob, Veronika Solopova, Konrad Rieck, David Schlangen, Sebastian Möller, Vera Schmitt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07648
Pdf URL: https://arxiv.org/pdf/2601.07648
Copy Paste: [[2601.07648]] Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends(https://arxiv.org/abs/2601.07648)
Keywords: llm
Abstract: Despite advances in Natural Language Generation (NLG), evaluation remains challenging. Although various new metrics and LLM-as-a-judge (LaaJ) methods are proposed, human judgment persists as the gold standard. To systematically review how NLG evaluation has evolved, we employ an automatic information extraction scheme to gather key information from NLG papers, focusing on different evaluation methods (metrics, LaaJ and human evaluation). With extracted metadata from 14,171 papers across four major conferences (ACL, EMNLP, NAACL, and INLG) over the past six years, we reveal several critical findings: (1) Task Divergence: While Dialogue Generation demonstrates a rapid shift toward LaaJ (>40% in 2025), Machine Translation remains locked into n-gram metrics, and Question Answering exhibits a substantial decline in the proportion of studies conducting human evaluation. (2) Metric Inertia: Despite the development of semantic metrics, general-purpose metrics (e.g., BLEU, ROUGE) continue to be widely used across tasks without empirical justification, often lacking the discriminative power to distinguish between specific quality criteria. (3) Human-LaaJ Divergence: Our association analysis challenges the assumption that LLMs act as mere proxies for humans; LaaJ and human evaluations prioritize very different signals, and explicit validation is scarce (<8% of papers comparing the two), with only moderate to low correlation. Based on these observations, we derive practical recommendations to improve the rigor of future NLG evaluation.
摘要：尽管自然语言生成（NLG）取得了进步，但评估仍然具有挑战性。尽管提出了各种新的衡量标准和法学硕士作为法官（LaaJ）方法，但人类判断仍然是黄金标准。为了系统地回顾 NLG 评估的演变过程，我们采用自动信息提取方案从 NLG 论文中收集关键信息，重点关注不同的评估方法（指标、LaaJ 和人工评估）。通过过去六年从四个主要会议（ACL、EMNLP、NAACL 和 INLG）的 14,171 篇论文中提取的元数据，我们揭示了几个关键发现：(1) 任务分歧：虽然对话生成显示出向 LaaJ 的快速转变（2025 年超过 40%），但机器翻译仍然锁定在 n-gram 指标中，问答显示进行人工评估的研究比例大幅下降。 (2) 度量惯性：尽管语义度量得到了发展，但通用度量（例如 BLEU、ROUGE）仍然在没有经验依据的情况下在任务中广泛使用，通常缺乏区分特定质量标准的区分能力。 (3) Human-LaaJ Divergence：我们的关联分析挑战了 LLM 仅充当人类代理的假设； LaaJ 和人类评估优先考虑非常不同的信号，并且明确的验证很少（比较两者的论文的<8%），只有中度到低度的相关性。基于这些观察，我们提出了切实可行的建议，以提高未来 NLG 评估的严谨性。

Title: Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Authors: Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.07667
Pdf URL: https://arxiv.org/pdf/2601.07667
Copy Paste: [[2601.07667]] Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference(https://arxiv.org/abs/2601.07667)
Keywords: language model, llm
Abstract: Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one-shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state-of-the-art layer-wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.
摘要：由于大型语言模型 (LLM) 的盛行，LLM 推理的键值 (KV) 缓存减少受到了广泛关注。在近年来提出的众多工作中，分层令牌修剪方法是最流行的方案之一，该方法选择特定层的令牌子集保留在 KV 缓存中并修剪其他令牌。他们主要采用一组预定义的层，在这些层上选择令牌。这种设计是不灵活的，因为不同任务的准确率差异很大，并且在 KV 检索等更困难的任务中会恶化。在本文中，我们提出了 ASL，一种免训练的方法，它利用按注意力分数排序的 token 排名的方差，自适应地选择选择层来减少 KV 缓存。所提出的方法平衡了不同任务的性能，同时满足用户指定的 KV 预算要求。 ASL 在预填充阶段运行，可以与 SnapKV 等现有的 KV 缓存缩减方法联合使用，以优化解码阶段。通过对 InfiniteBench、RULER 和 NIAH 基准的评估，我们表明，配备一次性令牌选择（即在一层选择令牌并将其传播到更深的层），ASL 在准确性方面优于最先进的逐层令牌选择方法，同时保持解码速度和 KV 缓存减少。

Title: Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task

Authors: Nick Ferguson, Alan Bundy, Kwabena Nuamah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07696
Pdf URL: https://arxiv.org/pdf/2601.07696
Copy Paste: [[2601.07696]] Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task(https://arxiv.org/abs/2601.07696)
Keywords: language model, llm, prompt
Abstract: Recent advancements in Large Language Models (LLMs) are increasingly focused on "reasoning" ability, a concept with many overlapping definitions in the LLM discourse. We take a more structured approach, distinguishing meta-level reasoning (denoting the process of reasoning about intermediate steps required to solve a task) from object-level reasoning (which concerns the low-level execution of the aforementioned steps.) We design a novel question answering task, which is based around the values of geopolitical indicators for various countries over various years. Questions require breaking down into intermediate steps, retrieval of data, and mathematical operations over that data. The meta-level reasoning ability of LLMs is analysed by examining the selection of appropriate tools for answering questions. To bring greater depth to the analysis of LLMs beyond final answer accuracy, our task contains 'essential actions' against which we can compare the tool call output of LLMs to infer the strength of reasoning ability. We find that LLMs demonstrate good meta-level reasoning on our task, yet are flawed in some aspects of task understanding. We find that n-shot prompting has little effect on accuracy; error messages encountered do not often deteriorate performance; and provide additional evidence for the poor numeracy of LLMs. Finally, we discuss the generalisation and limitation of our findings to other task domains.
摘要：大型语言模型 (LLM) 的最新进展越来越关注“推理”能力，这是一个在 LLM 论述中具有许多重叠定义的概念。我们采用更加结构化的方法，将元级推理（表示解决任务所需的中间步骤的推理过程）与对象级推理（涉及上述步骤的低级执行）区分开来。我们设计了一种新颖的问答任务，该任务基于各国多年来的地缘政治指标值。问题需要分解为中间步骤、数据检索以及对该数据的数学运算。通过检查回答问题的适当工具的选择来分析法学硕士的元水平推理能力。为了更深入地分析法学硕士，超越最终答案的准确性，我们的任务包含“基本操作”，我们可以比较法学硕士的工具调用输出，以推断推理能力的强度。我们发现法学硕士在我们的任务中展示了良好的元级推理，但在任务理解的某些方面存在缺陷。我们发现n次提示对准确率影响很小；遇到的错误消息通常不会降低性能；并为法学硕士的计算能力较差提供额外的证据。最后，我们讨论了我们的研究结果对其他任务领域的概括和局限性。

Title: Emotional Support Evaluation Framework via Controllable and Diverse Seeker Simulator

Authors: Chaewon Heo, Cheyon Jin, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07698
Pdf URL: https://arxiv.org/pdf/2601.07698
Copy Paste: [[2601.07698]] Emotional Support Evaluation Framework via Controllable and Diverse Seeker Simulator(https://arxiv.org/abs/2601.07698)
Keywords: chat
Abstract: As emotional support chatbots have recently gained significant traction across both research and industry, a common evaluation strategy has emerged: use help-seeker simulators to interact with supporter chatbots. However, current simulators suffer from two critical limitations: (1) they fail to capture the behavioral diversity of real-world seekers, often portraying them as overly cooperative, and (2) they lack the controllability required to simulate specific seeker profiles. To address these challenges, we present a controllable seeker simulator driven by nine psychological and linguistic features that underpin seeker behavior. Using authentic Reddit conversations, we train our model via a Mixture-of-Experts (MoE) architecture, which effectively differentiates diverse seeker behaviors into specialized parameter subspaces, thereby enhancing fine-grained controllability. Our simulator achieves superior profile adherence and behavioral diversity compared to existing approaches. Furthermore, evaluating 7 prominent supporter models with our system uncovers previously obscured performance degradations. These findings underscore the utility of our framework in providing a more faithful and stress-tested evaluation for emotional support chatbots.
摘要：随着情感支持聊天机器人最近在研究和行业中获得了巨大的关注，出现了一种常见的评估策略：使用寻求帮助者模拟器与支持者聊天机器人进行交互。然而，当前的模拟器存在两个关键局限性：（1）它们无法捕捉现实世界搜寻者的行为多样性，通常将他们描绘成过度合作；（2）它们缺乏模拟特定搜寻者概况所需的可控性。为了应对这些挑战，我们提出了一个可控的探索者模拟器，该模拟器由支撑探索者行为的九种心理和语言特征驱动。使用真实的 Reddit 对话，我们通过专家混合 (MoE) 架构训练我们的模型，该架构有效地将不同的搜索者行为区分为专门的参数子空间，从而增强细粒度的可控性。与现有方法相比，我们的模拟器实现了卓越的个人资料依从性和行为多样性。此外，使用我们的系统评估 7 个著名的支持者模型，发现了之前被掩盖的性能下降问题。这些发现强调了我们的框架在为情感支持聊天机器人提供更忠实和经过压力测试的评估方面的实用性。

Title: Is Agentic RAG worth it? An experimental comparison of RAG approaches

Authors: Pietro Ferrazzi, Milica Cvjeticanin, Alessio Piraccini, Davide Giannuzzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07711
Pdf URL: https://arxiv.org/pdf/2601.07711
Copy Paste: [[2601.07711]] Is Agentic RAG worth it? An experimental comparison of RAG approaches(https://arxiv.org/abs/2601.07711)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) systems are usually defined by the combination of a generator and a retrieval component that extracts textual context from a knowledge base to answer user queries. However, such basic implementations exhibit several limitations, including noisy or suboptimal retrieval, misuse of retrieval for out-of-scope queries, weak query-document matching, and variability or cost associated with the generator. These shortcomings have motivated the development of "Enhanced" RAG, where dedicated modules are introduced to address specific weaknesses in the workflow. More recently, the growing self-reflective capabilities of Large Language Models (LLMs) have enabled a new paradigm, which we refer to as "Agentic" RAG. In this approach, the LLM orchestrates the entire process-deciding which actions to perform, when to perform them, and whether to iterate-thereby reducing reliance on fixed, manually engineered modules. Despite the rapid adoption of both paradigms, it remains unclear which approach is preferable under which conditions. In this work, we conduct an extensive, empirically driven evaluation of Enhanced and Agentic RAG across multiple scenarios and dimensions. Our results provide practical insights into the trade-offs between the two paradigms, offering guidance on selecting the most effective RAG design for real-world applications, considering both costs and performance.
摘要：检索增强生成（RAG）系统通常由生成器和检索组件的组合来定义，该组件从知识库中提取文本上下文以回答用户查询。然而，这种基本实现表现出一些局限性，包括嘈杂或次优检索、误用范围外查询的检索、弱查询文档匹配以及与生成器相关的可变性或成本。这些缺点促使“增强型”RAG 的开发，其中引入了专用模块来解决工作流程中的特定弱点。最近，大型语言模型 (LLM) 不断增强的自我反思能力催生了一种新的范式，我们将其称为“代理”RAG。在这种方法中，法学硕士协调整个流程——决定执行哪些操作、何时执行以及是否迭代——从而减少对固定的、手动设计的模块的依赖。尽管这两种范式都被迅速采用，但仍不清楚在何种条件下哪种方法更可取。在这项工作中，我们跨多个场景和维度对增强型和代理 RAG 进行了广泛的、实证驱动的评估。我们的结果为两种范式之间的权衡提供了实用的见解，为在考虑成本和性能的情况下为实际应用选择最有效的 RAG 设计提供了指导。

Title: Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents

Authors: Aryan Mishra, Akash Anil
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07754
Pdf URL: https://arxiv.org/pdf/2601.07754
Copy Paste: [[2601.07754]] Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents(https://arxiv.org/abs/2601.07754)
Keywords: language model, llm
Abstract: Numerical reasoning is an important task in the analysis of financial documents. It helps in understanding and performing numerical predictions with logical conclusions for the given query seeking answers from financial texts. Recently, Large Language Models (LLMs) have shown promising results in multiple Question-Answering (Q-A) systems with the capability of logical reasoning. As documents related to finance often consist of long and complex financial contexts, LLMs appear well-suited for building high-quality automated financial question-answering systems. However, LLMs often face challenges in accurately processing the various numbers within financial reports. Extracting numerical data from unstructured text and semi-structured tables, and reliably performing accurate calculations, remains a significant bottleneck for numerical reasoning in most state-of-the-art LLMs. Recent studies have shown that structured data augmentations, such as Knowledge Graphs (KGs), have notably improved the predictions of LLMs along with logical explanations. Thus, it is an important requirement to consider inherent structured information in financial reports while using LLMs for various financial analytics. This paper proposes a framework to incorporate structured information using KGs along with LLM predictions for numerical reasoning tasks. The KGs are extracted using a proposed schema inherently from the document under processing. We evaluated our proposed framework over the benchmark data FinQA, using an open-source LLM, namely Llama 3.1 8B Instruct. We observed that the proposed framework improved execution accuracy by approximately 12% relative to the vanilla LLM.
摘要：数值推理是财务文件分析中的一项重要任务。它有助于理解和执行数值预测，并为给定的查询从金融文本中寻求答案提供逻辑结论。最近，大型语言模型（LLM）在具有逻辑推理能力的多个问答（Q-A）系统中显示出了可喜的结果。由于与金融相关的文档通常包含冗长而复杂的金融背景，因此法学硕士似乎非常适合构建高质量的自动化金融问答系统。然而，法学硕士在准确处理财务报告中的各种数字方面经常面临挑战。从非结构化文本和半结构化表格中提取数值数据，并可靠地执行准确计算，仍然是大多数最先进的法学硕士中数值推理的重大瓶颈。最近的研究表明，知识图谱（KG）等结构化数据增强显着改善了法学硕士的预测以及逻辑解释。因此，在使用法学硕士进行各种财务分析时，考虑财务报告中固有的结构化信息是一个重要的要求。本文提出了一个框架，使用知识图谱将结构化信息与 LLM 预测结合起来，用于数值推理任务。知识图谱是使用从正在处理的文档中固有的提议模式提取的。我们使用开源 LLM（即 Llama 3.1 8B Instruct）根据基准数据 FinQA 评估了我们提出的框架。我们观察到，相对于普通 LLM，所提出的框架将执行准确性提高了约 12%。

Title: Contrastive Learning with Narrative Twins for Modeling Story Salience

Authors: Igor Sterner, Alex Lascarides, Frank Keller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07765
Pdf URL: https://arxiv.org/pdf/2601.07765
Copy Paste: [[2601.07765]] Contrastive Learning with Narrative Twins for Modeling Story Salience(https://arxiv.org/abs/2601.07765)
Keywords: llm, prompt
Abstract: Understanding narratives requires identifying which events are most salient for a story's progression. We present a contrastive learning framework for modeling narrative salience that learns story embeddings from narrative twins: stories that share the same plot but differ in surface form. Our model is trained to distinguish a story from both its narrative twin and a distractor with similar surface features but different plot. Using the resulting embeddings, we evaluate four narratologically motivated operations for inferring salience (deletion, shifting, disruption, and summarization). Experiments on short narratives from the ROCStories corpus and longer Wikipedia plot summaries show that contrastively learned story embeddings outperform a masked-language-model baseline, and that summarization is the most reliable operation for identifying salient sentences. If narrative twins are not available, random dropout can be used to generate the twins from a single story. Effective distractors can be obtained either by prompting LLMs or, in long-form narratives, by using different parts of the same story.
摘要：理解叙事需要确定哪些事件对故事的进展最重要。我们提出了一个用于建模叙事显着性的对比学习框架，该框架从叙事双胞胎中学习故事嵌入：共享相同情节但表面形式不同的故事。我们的模型经过训练，可以将一个故事与它的叙事双胞胎和表面特征相似但情节不同的干扰因素区分开来。使用由此产生的嵌入，我们评估了四种出于叙事动机的操作来推断显着性（删除、移动、破坏和摘要）。对 ROCStories 语料库的简短叙述和较长的维基百科情节摘要进行的实验表明，对比学习的故事嵌入优于掩码语言模型基线，并且摘要是识别显着句子的最可靠操作。如果叙事双胞胎不可用，则可以使用随机退出来从单个故事生成双胞胎。有效的干扰因素可以通过提示法学硕士或在长篇叙述中使用同一故事的不同部分来获得。

Title: Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection

Authors: Mariana Costa, Alberlucia Rafael Soarez, Daniel Kim, Camila Ferreira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07780
Pdf URL: https://arxiv.org/pdf/2601.07780
Copy Paste: [[2601.07780]] Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection(https://arxiv.org/abs/2601.07780)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: While Chain-of-Thought (CoT) prompting advances LLM reasoning, challenges persist in consistency, accuracy, and self-correction, especially for complex or ethically sensitive tasks. Existing single-dimensional reflection methods offer insufficient improvements. We propose MyGO Poly-Reflective Chain-of-Thought (PR-CoT), a novel methodology employing structured multi-perspective reflection. After initial CoT, PR-CoT guides the LLM to self-assess its reasoning across multiple predefined angles: logical consistency, information completeness, biases/ethics, and alternative solutions. Implemented purely via prompt engineering, this process refines the initial CoT into a more robust and accurate final answer without model retraining. Experiments across arithmetic, commonsense, ethical decision-making, and logical puzzles, using GPT-three point five and GPT-four models, demonstrate PR-CoT's superior performance. It significantly outperforms traditional CoT and existing reflection methods in logical consistency and error correction, with notable gains in nuanced domains like ethical decision-making. Ablation studies, human evaluations, and qualitative analyses further validate the contribution of each reflection perspective and the overall efficacy of our poly-reflective paradigm in fostering more reliable LLM reasoning.
摘要：虽然思想链 (CoT) 促进了 LLM 推理的发展，但在一致性、准确性和自我纠正方面仍然存在挑战，特别是对于复杂或道德敏感的任务。现有的一维反射方法没有提供足够的改进。我们提出了 MyGO Poly-Reflective Chain-of-Thought (PR-CoT)，这是一种采用结构化多视角反射的新颖方法。在初始 CoT 之后，PR-CoT 指导法学硕士从多个预定义角度自我评估其推理：逻辑一致性、信息完整性、偏见/道德和替代解决方案。该过程纯粹通过即时工程实施，将初始 CoT 细化为更稳健、更准确的最终答案，无需模型重新训练。使用 GPT-三点五和 GPT-四模型进行的算术、常识、道德决策和逻辑难题的实验证明了 PR-CoT 的卓越性能。它在逻辑一致性和纠错方面显着优于传统的 CoT 和现有的反射方法，并且在道德决策等微妙领域取得了显着的进步。消融研究、人类评估和定性分析进一步验证了每个反思视角的贡献以及我们的多反思范式在促进更可靠的法学硕士推理方面的整体功效。

Title: Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning

Authors: Wei Fang, James Glass
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2601.07782
Pdf URL: https://arxiv.org/pdf/2601.07782
Copy Paste: [[2601.07782]] Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning(https://arxiv.org/abs/2601.07782)
Keywords: llm, agent
Abstract: LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
摘要：在庞大的动态工具库上运行的 LLM 代理依赖于有效的检索，但标准的单次密集检索器却难以应对复杂的请求。这些失败主要源于抽象用户目标和技术文档之间的脱节，以及固定大小嵌入对组合工具组合进行建模的能力有限。为了应对这些挑战，我们提出了 TOOLQP，这是一个轻量级框架，将检索建模为迭代查询规划。 TOOLQP 不是单次匹配，而是将指令分解为子任务并动态生成查询以与检索器交互，通过针对组合所需的特定子任务来有效地弥合语义差距。我们使用合成查询轨迹来训练 TOOLQP，然后通过具有可验证奖励的强化学习 (RLVR) 进行优化。实验表明，TOOLQP 实现了最先进的性能，表现出卓越的零样本泛化能力、跨不同检索器的鲁棒性以及下游代理执行的显着改进。

Title: Kinship Data Benchmark for Multi-hop Reasoning

Authors: Tianda Sun, Dimitar Kazakov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07794
Pdf URL: https://arxiv.org/pdf/2601.07794
Copy Paste: [[2601.07794]] Kinship Data Benchmark for Multi-hop Reasoning(https://arxiv.org/abs/2601.07794)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly evaluated on their ability to perform multi-hop reasoning, i.e., to combine multiple pieces of information into a coherent inference. We introduce KinshipQA, a benchmark designed to probe this capability through reasoning over kinship relations. The central contribution of our work is a generative pipeline that produces, on demand, large-scale, realistic, and culture-specific genealogical data: collections of interconnected family trees that satisfy explicit marriage constraints associated with different kinship systems. This allows task difficulty, cultural assumptions, and relational depth to be systematically controlled and varied. From these genealogies, we derive textual inference tasks that require reasoning over implicit relational chains. We evaluate the resulting benchmark using six state-of-the-art LLMs, spanning both open-source and closed-source models, under a uniform zero-shot protocol with deterministic decoding. Performance is measured using exact-match and set-based metrics. Our results demonstrate that KinshipQA yields a wide spread of outcomes and exposes systematic differences in multi-hop reasoning across models and cultural settings.
摘要：大型语言模型 (LLM) 越来越多地根据其执行多跳推理的能力进行评估，即将多条信息组合成连贯的推理。我们引入了 KinshipQA，这是一个旨在通过亲属关系推理来探索这种能力的基准。我们工作的核心贡献是一个生成管道，可以按需生成大规模、现实和特定文化的家谱数据：相互关联的家谱集合，满足与不同亲属制度相关的明确婚姻约束。这使得任务难度、文化假设和关系深度能够得到系统地控制和改变。从这些谱系中，我们推导出需要对隐式关系链进行推理的文本推理任务。我们使用六个最先进的法学硕士（涵盖开源和闭源模型）在具有确定性解码的统一零样本协议下评估最终的基准。使用精确匹配和基于集合的指标来衡量性能。我们的结果表明，KinshipQA 产生了广泛的结果，并揭示了跨模型和文化环境的多跳推理的系统差异。

Title: Learning Through Dialogue: Unpacking the Dynamics of Human-LLM Conversations on Political Issues

Authors: Shaz Furniturewala, Gerard Christopher Yeo, Kokil Jaidka
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2601.07796
Pdf URL: https://arxiv.org/pdf/2601.07796
Copy Paste: [[2601.07796]] Learning Through Dialogue: Unpacking the Dynamics of Human-LLM Conversations on Political Issues(https://arxiv.org/abs/2601.07796)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) are increasingly used as conversational partners for learning, yet the interactional dynamics supporting users' learning and engagement are understudied. We analyze the linguistic and interactional features from both LLM and participant chats across 397 human-LLM conversations about socio-political issues to identify the mechanisms and conditions under which LLM explanations shape changes in political knowledge and confidence. Mediation analyses reveal that LLM explanatory richness partially supports confidence by fostering users' reflective insight, whereas its effect on knowledge gain operates entirely through users' cognitive engagement. Moderation analyses show that these effects are highly conditional and vary by political efficacy. Confidence gains depend on how high-efficacy users experience and resolve uncertainty. Knowledge gains depend on high-efficacy users' ability to leverage extended interaction, with longer conversations benefiting primarily reflective users. In summary, we find that learning from LLMs is an interactional achievement, not a uniform outcome of better explanations. The findings underscore the importance of aligning LLM explanatory behavior with users' engagement states to support effective learning in designing Human-AI interactive systems.
摘要：大型语言模型 (LLM) 越来越多地用作学习的对话伙伴，但支持用户学习和参与的交互动态尚未得到充分研究。我们分析了法学硕士和参与者在 397 场有关社会政治问题的法学硕士对话中的语言和互动特征，以确定法学硕士解释塑造政治知识和信心变化的机制和条件。中介分析表明，LLM 的解释丰富性通过培养用户的反思性洞察力来部分支持信心，而其对知识获取的影响完全通过用户的认知参与来发挥作用。适度分析表明，这些影响是高度有条件的，并且因政治效能而异。信心的增益取决于高效用户如何体验和解决不确定性。知识获取取决于高效率用户利用扩展交互的能力，较长的对话主要使反思型用户受益。总之，我们发现向法学硕士学习是一种互动的成就，而不是更好解释的统一结果。研究结果强调了将法学硕士解释行为与用户的参与状态保持一致的重要性，以支持设计人机交互系统的有效学习。

Title: The Confidence Trap: Gender Bias and Predictive Certainty in LLMs

Authors: Ahmed Sabir, Markus Kängsepp, Rajesh Sharma
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.07806
Pdf URL: https://arxiv.org/pdf/2601.07806
Copy Paste: [[2601.07806]] The Confidence Trap: Gender Bias and Predictive Certainty in LLMs(https://arxiv.org/abs/2601.07806)
Keywords: language model, llm
Abstract: The increased use of Large Language Models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs' confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.
摘要：在敏感领域中越来越多地使用大型语言模型 (LLM)，导致人们越来越关注其置信度分数如何与公平性和偏见相对应。本研究检验了法学硕士预测的置信度与人工注释的偏差判断之间的一致性。该研究以性别偏见为重点，调查了涉及性别代词解析的背景下的概率置信度校准。目标是评估基于预测置信度得分的校准指标是否有效捕获法学硕士中与公平性相关的差异。结果表明，在六种最先进的模型中，根据性别偏见基准，Gemma-2 的校准效果最差。这项工作的主要贡献是对法学硕士的信心校准进行公平意识评估，为道德部署提供指导。此外，我们还引入了一种新的校准指标 Gender-ECE，旨在衡量解决任务中的性别差异。

Title: Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

Authors: Manar Ali, Judith Sieker, Sina Zarrieß, Hendrik Buschmeier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.07820
Pdf URL: https://arxiv.org/pdf/2601.07820
Copy Paste: [[2601.07820]] Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests(https://arxiv.org/abs/2601.07820)
Keywords: language model
Abstract: In human conversation, both interlocutors play an active role in maintaining mutual understanding. When addressees are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar addressee role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a good testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.
摘要：在人类对话中，对话双方在维持相互理解方面发挥着积极作用。例如，当收件人不确定说话者的意思时，他们可以要求澄清。对于语言模型来说，它们是否可以承担类似的收件人角色，通过澄清来识别和表达自己的不确定性，这是一个悬而未决的问题。我们认为参考游戏是解决这个问题的一个很好的测试平台，因为它们是受控的、独立的，并且使澄清需求明确且可衡量。为了测试这一点，我们评估了三个视觉语言模型，将基线参考解析任务与实验进行比较，在实验中模型被指示在不确定时请求澄清。结果表明，即使在如此简单的任务中，模型也常常难以识别内部不确定性并将其转化为充分的澄清行为。这证明了参考游戏作为（视觉和）语言模型交互质量测试平台的价值。